From patchwork Fri Jun 23 21:28:10 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 21248 Received: (qmail 125296 invoked by alias); 23 Jun 2017 21:28:16 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 125284 invoked by uid 89); 23 Jun 2017 21:28:15 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: EUR01-DB5-obe.outbound.protection.outlook.com From: Wilco Dijkstra To: Sebastian Pop , "libc-alpha@sourceware.org" CC: Marcus Shawcroft , "maxim.kuvyrkov@linaro.org" , Ramana Radhakrishnan , "ryan.arnold@linaro.org" , "adhemerval.zanella@linaro.org" , "sebpop@gmail.com" , nd Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp Date: Fri, 23 Jun 2017 21:28:10 +0000 Message-ID: References: <1498174226-16525-1-git-send-email-s.pop@samsung.com>, <637cf51c-160d-172f-6520-bba51058f85e@samsung.com> In-Reply-To: <637cf51c-160d-172f-6520-bba51058f85e@samsung.com> authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM5PR0802MB2609; 7:N2LvCMdVzAKSDcZ5p2KSmZ2qVui34poYQUMUJUXYeiKQOgQhT4W/8uu14PdzqbnRKu8nahwH6n+46Xvpxmalj638/7pldasF5mq/YycLiM+aW4RtbNBu3OlmOY1fc4KvVsR2jrayQGKVRcZndtejAq3xUL3e0cPWNtd/ZDdhSZVDTAug/5gtypMGmFEqBsDD5T7PiE6s2ITsCR48o05lxfoS7UpmcAhz2WFNkSepYq78S8ka7e+1lTb6hCtQxiD/CGCAgeaF+KV27sxKBtFscYlvo0cHVr6JUPUI/lUMbjHx/MZVXlvnYSn6Bg4rACau77Z0PAzzPMaKYr5h0MrvpwKxaTuu3KpaQx5JcTT4YwzL1ihb7ZsgljU6HdBK2F3ouEWpiaSy/MZPp3ckV7GIWnhVeiulnt24mrvV6nNFp/qZbAG49/nd86IvdXc3Y0eIXyiejqmMRsaEX9FErVVhn589JxT0IerUvwXpTpQKUhvBS7Cm4b4lBMpdYyBylPTgBW3Gj2akMOKaXvbLUFAzg/Cu4eliUR0FEVXtGQovkECKWMpNSWkG0K7g2hMfPfXUiDnp1KQlPzqx/TvPBJFZ+x3UD0jVtXtfR9HQj1dwF0f68TMmG9faihiAjwEhP3UrwT9aLprOCKJRUQRkXuLvp/FY5YjdDHxEzmlbwvjyAeoykrBOkp3gZPo8PwUIOCJN3BC4HkoOD71y5wagmocAHJzaFLEvc7T9jMjq0AKEtFYx/9le2fIUhcQ00OLMTgLTUNLiuuHlZjtAjZehjfIBq6rdiyaMRinGigUEiwdqIig= x-ms-office365-filtering-correlation-id: 6d39ffec-2725-4536-19ae-08d4ba7ebf8c x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(22001)(2017030254075)(48565401081)(201703131423075)(201703031133081)(201702281549075); SRVR:AM5PR0802MB2609; x-ms-traffictypediagnostic: AM5PR0802MB2609: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(236129657087228)(247924648384137); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(100000703101)(100105400095)(3002001)(10201501046)(93006095)(93001095)(6055026)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123555025)(20161123560025)(20161123558100)(20161123562025)(20161123564025)(6072148)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:AM5PR0802MB2609; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:AM5PR0802MB2609; x-forefront-prvs: 0347410860 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(39450400003)(39840400002)(39400400002)(39410400002)(24454002)(189998001)(54906002)(7696004)(9686003)(6246003)(33656002)(76176999)(53936002)(38730400002)(2900100001)(81166006)(7736002)(229853002)(2950100002)(54356999)(5660300001)(8676002)(102836003)(6116002)(50986999)(6506006)(55016002)(25786009)(6436002)(99286003)(39060400002)(478600001)(3660700001)(3280700002)(2906002)(305945005)(74316002)(86362001)(72206003)(14454004)(2501003)(5250100002)(4326008)(8936002)(357404004); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0802MB2609; H:AM5PR0802MB2610.eurprd08.prod.outlook.com; FPR:; SPF:None; MLV:ovrnspm; PTR:InfoNoRecords; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jun 2017 21:28:10.7576 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0802MB2609 Sebastian Pop wrote: > If I remove all the alignment code, I get less performance on the hikey > A53 board. > With this patch: @@ -142,9 +143,23 @@ ENTRY(memcmp)          .p2align 6   .Lmisaligned8: + +       cmp     limit, #8 +       b.lo    .LmisalignedLt8 + +       .p2align 5 +.Lloop_part_aligned: +       ldr     data1, [src1], #8 +       ldr     data2, [src2], #8 +       subs    limit_wd, limit_wd, #1 +.Lstart_part_realigned: +       eor     diff, data1, data2      /* Non-zero if differences found. */ +       cbnz    diff, .Lnot_limit +       b.ne    .Lloop_part_aligned + +.LmisalignedLt8:          sub     limit, limit, #1   1: -       /* Perhaps we can do better than this.  */          ldrb    data1w, [src1], #1          ldrb    data2w, [src2], #1          subs    limit, limit, #1 Where is the setup of limit_wd and limit??? I would expect the small cases to be faster since you avoid around 10 cycles of mostly ALU ops that make very little progress. So it should take several iterations with an extra unaligned access to before you're worse off. In memcpy (which is similar with 2 streams) I align after 96 bytes. > With the extra patch: Note it's more readable to write mov tmp3, 8. However it's even better to use a writeback of 8 in the unaligned loads, and then subtract tmp1 from src1 and src2 - this saves 2 instructions. Wilco --- a/libc/arch-arm64/generic/bionic/memcmp.S +++ b/libc/arch-arm64/generic/bionic/memcmp.S @@ -159,7 +159,7 @@ ENTRY(memcmp)          /* Sources are not aligned align one of the sources find max offset             from aligned boundary. */ -       and     tmp1, src1, #0x7 +       and     tmp1, src2, #0x7          orr     tmp3, xzr, #0x8          sub     pos, tmp3, tmp1