Message ID | PAWPR08MB8982F6211E8FDA162ECB72D883FD9@PAWPR08MB8982.eurprd08.prod.outlook.com |
---|---|
State | Committed |
Commit | 03c8ce5000198947a4dd7b2c14e5131738fda62b |
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 21FA6385483F for <patchwork@sourceware.org>; Thu, 12 Jan 2023 15:54:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 21FA6385483F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1673538844; bh=KcTDjvSbrraIIHpdRcjTPiFnQu8HSw8SuZ9TT1R7ajo=; h=To:CC:Subject:Date:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=b+BRChLrUUIlT0a6aqN0mwfNhVALxwgIDtlvocR6xz1ozo2zNVaD1KMZ35tMCIM8C /sxdzcL3+nC3Y5LQmgzZT6JsNc/hCSE/1GSTj/Y4ayzBhDU/sZzVJpAunZ/yC8l25G gL3qcrBgOKy2khaf7oPn2O6DbTwsYcUdr12ltwtI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR03-VI1-obe.outbound.protection.outlook.com (mail-vi1eur03on2056.outbound.protection.outlook.com [40.107.103.56]) by sourceware.org (Postfix) with ESMTPS id F40043858D35 for <libc-alpha@sourceware.org>; Thu, 12 Jan 2023 15:53:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F40043858D35 Received: from FR3P281CA0005.DEUP281.PROD.OUTLOOK.COM (2603:10a6:d10:1d::19) by AS2PR08MB9415.eurprd08.prod.outlook.com (2603:10a6:20b:595::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6002.12; Thu, 12 Jan 2023 15:53:24 +0000 Received: from VI1EUR03FT032.eop-EUR03.prod.protection.outlook.com (2603:10a6:d10:1d:cafe::5c) by FR3P281CA0005.outlook.office365.com (2603:10a6:d10:1d::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6023.6 via Frontend Transport; Thu, 12 Jan 2023 15:53:23 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; pr=C Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VI1EUR03FT032.mail.protection.outlook.com (100.127.145.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6002.13 via Frontend Transport; Thu, 12 Jan 2023 15:53:23 +0000 Received: ("Tessian outbound b1d3ffe56e73:v132"); Thu, 12 Jan 2023 15:53:23 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: c340fa278aecfb0d X-CR-MTA-TID: 64aa7808 Received: from 9c56c2b8b9b9.2 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 6581C858-0973-4977-B80F-1E51F6C75ACF.1; Thu, 12 Jan 2023 15:53:12 +0000 Received: from EUR04-VI1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 9c56c2b8b9b9.2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Thu, 12 Jan 2023 15:53:12 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ExXznH4aZ/mX16q3F2jtHD+MoA3tDzrnVn4/Yhz8F+pSajtQpiKbRif2hPlbFaiH34AW39a6da0HBSLC4lS+FUffuwRxNWE9xEucxfYSMSXc1UuiDVoRH6BQvYg1vD8+fDpBo9wUj1uIe+Yi6ERkTL+VV4oGJsVeZNcHcs4MFYoxC0t5ej63H8kLQv/jopwxYiVP3X/jJrJBSX28eHVRYejO0SMPuJ/jI4O6eWU+8YUAXY5z9NZCyhzeiRKi+Hk/fMoy0x8VU5a0Pwvi5Hd/G5J9e0QUplfWCgmTlEEPWseXsQk+Tg4DD1VL4o/BTZeSshKFQwzxkkWhKZAe7YyUKA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=KcTDjvSbrraIIHpdRcjTPiFnQu8HSw8SuZ9TT1R7ajo=; b=j1JYSs0Xpa+gYQ0qrWPxHn7pSYdzZULhD8fi0ksOituoyXL06+Vm8QN8gN0JyvRIRF3pfsGNwznhQYmaFvRwgrLCo9AJYPl45rMQe9Vj9KAGPDk6iaulR8VieBr+WWZR/yEiTig0Mvc2HY2ElfRm82JcrjDXwFEOR5ydiG/Hcz817kVyAl+rVISMbPw6Kdoj8f342kdmzZfIzJr6F1z4DF59KREhKY5ETZ023WfWE1OJY0xsXSWyaLPwRidKlWGDh49L0OPRz4RoAqxjS2IUvGf28mlJvyBwCxC1sWZ6gEY1ZRuMw+4W92g7NhefvYUnhf3T0DzqxIaswibDDclB1Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from PAWPR08MB8982.eurprd08.prod.outlook.com (2603:10a6:102:33f::20) by AS2PR08MB10111.eurprd08.prod.outlook.com (2603:10a6:20b:62d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5986.18; Thu, 12 Jan 2023 15:53:09 +0000 Received: from PAWPR08MB8982.eurprd08.prod.outlook.com ([fe80::66e4:4940:d096:4f7]) by PAWPR08MB8982.eurprd08.prod.outlook.com ([fe80::66e4:4940:d096:4f7%9]) with mapi id 15.20.5986.018; Thu, 12 Jan 2023 15:53:09 +0000 To: 'GNU C Library' <libc-alpha@sourceware.org> CC: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Subject: [PATCH] AArch64: Optimize strlen Thread-Topic: [PATCH] AArch64: Optimize strlen Thread-Index: AQHZJp3Wf4+/HJBvoEyQJPlZh5uVSQ== Date: Thu, 12 Jan 2023 15:53:09 +0000 Message-ID: <PAWPR08MB8982F6211E8FDA162ECB72D883FD9@PAWPR08MB8982.eurprd08.prod.outlook.com> Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; x-ms-traffictypediagnostic: PAWPR08MB8982:EE_|AS2PR08MB10111:EE_|VI1EUR03FT032:EE_|AS2PR08MB9415:EE_ X-MS-Office365-Filtering-Correlation-Id: fb1ed7c6-77c8-4e95-d39e-08daf4b522e1 x-checkrecipientrouted: true nodisclaimer: true X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: 4hf8hIaFxFm3N6LcvfL1SQhCT7ixkFNq4JGAuVwUBMprSaRB/AcEXgGBrCU1h8ShGVjuRQq8/RHl1f26JySetP+NDomlsEnZaW/NQX2MupELFlKhvmS97kfa05cAjN/GnLQnDdGobxzhMO6OvvV+MRxnEwPJDdZFJKuQOFumGnF3eo3dRcyCl8LRARN5U+DX1ytsG16thJMrGasZKvpxF8uEjwrpoG77qm97c/r03RTIFmMjYaBPinlDjVIOGm+epzdxOVOcvK/ERM0KLxzL5VvCrBQPOOkWm03woD6kB8gRWrgWXtaANzgbKIXAdmYViqR6udyUd5rFp7wSB4Goxomagjx5GrItK673kafw64/eJ6g3O6NHObsrN44y34I08en6UZPysp41/bSK35SDfbo1r1ub6zdfi3G76k+N2OG4x7DqtxsxenPrphKx8Qo5T/cVrbNYZIvMLeD8LIcBRO8g4tOuPe55/7MDV88FHK4Z6N+rSe/YM0F4KYhThhge0PUtlkEu//9TD4DkvDgSaBevjc9cGJ+8mFZ+2FKOD00bnqUVaAXPAb+OWgN+cdjSNLB1ILgpR6FEOcMN0X6FdosKMY3OhNUvrrHCZdidtvfV7H5TkVo4G0m893bVBAGU3g+HKg3KxlcgrzoMWvrJq7S5+5OBAQahanP3N4WLUFU0AZmwp7bVNd7QyJWqkZPXUzXn2AVvsyiV0Rq7Q5ZVKg== X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:PAWPR08MB8982.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230022)(4636009)(376002)(396003)(39860400002)(366004)(136003)(346002)(451199015)(86362001)(2906002)(5660300002)(8936002)(66446008)(41300700001)(66556008)(91956017)(33656002)(66476007)(52536014)(8676002)(66946007)(64756008)(76116006)(38100700002)(83380400001)(4326008)(316002)(55016003)(6916009)(6506007)(26005)(9686003)(71200400001)(186003)(7696005)(478600001)(122000001)(38070700005); DIR:OUT; SFP:1101; Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS2PR08MB10111 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VI1EUR03FT032.eop-EUR03.prod.protection.outlook.com X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id-Prvs: 39547d8b-c212-477b-8849-08daf4b51a38 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: t99kTbp68BEpppJCymKOD2VL+cDjagnyrgKvDXhzP0yrxmF4eVgACCWBtvvRd1AeC0atUESJn31tMhg/nJ+dsQaSHEH9F5EOFy8hj2OED2ssGwC4UbdQvuhgshcscHQjsUaUZlNccLsc8CoPRhHCXcGXIsvCdEhqSDSmqFVbhaIOKziDv/TU6LvPj/P+6wlRo8qWoEtyfgdomYuZH0L6QER8haNu3V5S8Lrh3QFUdZuVPiUeEuu9aiE5y63ndzw3kQVFFv9Q4yZBTDTpCVPMsgDiM11wimq7gqIQnI2LW48fZv55t0n3a7avVFSInD2yCKZ2SpgKdvjsIAW+FBEwC1br+jqNo5n0U2+t7RrPAubjdrR9XxRSqnvngvPQHbpYKcormVfXUorkSpi85NPx9SZOf10WYWCGRJU3enScaDVwzdqgo95NSdpz/ycvLowJDnzz+iQnFTJcAlUU9HlrBTIMrPGq96TQj9+YZUwrp43TiQEN4l084J5MPpXTwICEIzcJsod/gK5aSjZiazbxXjLIYs+KwohcgIb14xdAGasZhm+NbVcIV5qNXAoHNvAXpvMldMg8Ss9F25aVX/O0gQj0O5pCqPBTRfSllwpZpeDFdFwdAu5ML5L3WSpQmqXdl55IDEUVxLVrihJeaYRba3SDJiq1X35CvpPMvlW3K3DDT6Wir9Mv3IbuRt4iWBqFahSFsw9fhaGq7in2HBmcbQ== X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(13230022)(4636009)(346002)(376002)(136003)(39860400002)(396003)(451199015)(36840700001)(40470700004)(46966006)(8936002)(70586007)(70206006)(52536014)(6916009)(4326008)(26005)(86362001)(41300700001)(8676002)(5660300002)(186003)(478600001)(356005)(2906002)(40460700003)(7696005)(33656002)(316002)(9686003)(55016003)(40480700001)(336012)(47076005)(6506007)(82310400005)(83380400001)(81166007)(82740400003)(36860700001); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Jan 2023 15:53:23.6735 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: fb1ed7c6-77c8-4e95-d39e-08daf4b522e1 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VI1EUR03FT032.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS2PR08MB9415 X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, FORGED_SPF_HELO, GIT_PATCH_0, KAM_DMARC_NONE, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_NONE, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Series |
AArch64: Optimize strlen
|
|
Checks
Context | Check | Description |
---|---|---|
dj/TryBot-apply_patch | success | Patch applied to master at the time it was sent |
dj/TryBot-32bit | success | Build for i686 |
Commit Message
Wilco Dijkstra
Jan. 12, 2023, 3:53 p.m. UTC
Optimize strlen by unrolling the main loop. Large strings are 64% faster on modern CPUs. Passes regress. ---
Comments
The 01/12/2023 15:53, Wilco Dijkstra wrote: > Optimize strlen by unrolling the main loop. Large strings are 64% faster on > modern CPUs. Passes regress. please commit it, thanks. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > --- > > diff --git a/sysdeps/aarch64/strlen.S b/sysdeps/aarch64/strlen.S > index b3c92d9dc9b3c52e29e05ebbb89b929f177dc2cf..133ef933425fa260e61642a7840d73391168507d 100644 > --- a/sysdeps/aarch64/strlen.S > +++ b/sysdeps/aarch64/strlen.S > @@ -43,12 +43,9 @@ > #define dend d2 > > /* Core algorithm: > - > - For each 16-byte chunk we calculate a 64-bit nibble mask value with four bits > - per byte. We take 4 bits of every comparison byte with shift right and narrow > - by 4 instruction. Since the bits in the nibble mask reflect the order in > - which things occur in the original string, counting trailing zeros identifies > - exactly which byte matched. */ > + Process the string in 16-byte aligned chunks. Compute a 64-bit mask with > + four bits per byte using the shrn instruction. A count trailing zeros then > + identifies the first zero byte. */ > > ENTRY (STRLEN) > PTR_ARG (0) > @@ -68,18 +65,25 @@ ENTRY (STRLEN) > > .p2align 5 > L(loop): > - ldr data, [src, 16]! > + ldr data, [src, 16] > + cmeq vhas_nul.16b, vdata.16b, 0 > + umaxp vend.16b, vhas_nul.16b, vhas_nul.16b > + fmov synd, dend > + cbnz synd, L(loop_end) > + ldr data, [src, 32]! > cmeq vhas_nul.16b, vdata.16b, 0 > umaxp vend.16b, vhas_nul.16b, vhas_nul.16b > fmov synd, dend > cbz synd, L(loop) > - > + sub src, src, 16 > +L(loop_end): > shrn vend.8b, vhas_nul.8h, 4 /* 128->64 */ > sub result, src, srcin > fmov synd, dend > #ifndef __AARCH64EB__ > rbit synd, synd > #endif > + add result, result, 16 > clz tmp, synd > add result, result, tmp, lsr 2 > ret >
Hi Szabolcs, I am attempting to reproduce the presented performance improvements on a Ampere Altra processor. Can you please detail some more what was your setup and how did you measured it. BTW, I am not looking to discredit the work, but rather to be able to replicate the results on our end to evaluate backporting your patches. Best regards, Cupertino Szabolcs Nagy via Libc-alpha writes: > The 01/12/2023 15:53, Wilco Dijkstra wrote: >> Optimize strlen by unrolling the main loop. Large strings are 64% faster on >> modern CPUs. Passes regress. > > please commit it, thanks. > > Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > >> >> --- >> >> diff --git a/sysdeps/aarch64/strlen.S b/sysdeps/aarch64/strlen.S >> index b3c92d9dc9b3c52e29e05ebbb89b929f177dc2cf..133ef933425fa260e61642a7840d73391168507d 100644 >> --- a/sysdeps/aarch64/strlen.S >> +++ b/sysdeps/aarch64/strlen.S >> @@ -43,12 +43,9 @@ >> #define dend d2 >> >> /* Core algorithm: >> - >> - For each 16-byte chunk we calculate a 64-bit nibble mask value with four bits >> - per byte. We take 4 bits of every comparison byte with shift right and narrow >> - by 4 instruction. Since the bits in the nibble mask reflect the order in >> - which things occur in the original string, counting trailing zeros identifies >> - exactly which byte matched. */ >> + Process the string in 16-byte aligned chunks. Compute a 64-bit mask with >> + four bits per byte using the shrn instruction. A count trailing zeros then >> + identifies the first zero byte. */ >> >> ENTRY (STRLEN) >> PTR_ARG (0) >> @@ -68,18 +65,25 @@ ENTRY (STRLEN) >> >> .p2align 5 >> L(loop): >> - ldr data, [src, 16]! >> + ldr data, [src, 16] >> + cmeq vhas_nul.16b, vdata.16b, 0 >> + umaxp vend.16b, vhas_nul.16b, vhas_nul.16b >> + fmov synd, dend >> + cbnz synd, L(loop_end) >> + ldr data, [src, 32]! >> cmeq vhas_nul.16b, vdata.16b, 0 >> umaxp vend.16b, vhas_nul.16b, vhas_nul.16b >> fmov synd, dend >> cbz synd, L(loop) >> - >> + sub src, src, 16 >> +L(loop_end): >> shrn vend.8b, vhas_nul.8h, 4 /* 128->64 */ >> sub result, src, srcin >> fmov synd, dend >> #ifndef __AARCH64EB__ >> rbit synd, synd >> #endif >> + add result, result, 16 >> clz tmp, synd >> add result, result, tmp, lsr 2 >> ret >>
Hi Cupertino, > I am attempting to reproduce the presented performance improvements on a > Ampere Altra processor. > Can you please detail some more what was your setup and how did you > measured it. I measured it on large strings (eg. 1024 bytes) on Neoverse V1. You can run benchtests/bench-strlen.c and look at results for larger strings. However Altra is half as wide as V1, so may not see much (or any) benefit. Also note by default it uses the multiarch/strlen_asimd.S version, not strlen.S. > BTW, I am not looking to discredit the work, but rather to be able > to replicate the results on our end to evaluate backporting your patches. Yes it has been a while since we backported string improvements, so it's a good idea. Are you talking about the official GLIBC release branches or a private branch? Cheers, Wilco
Wilco Dijkstra writes: > Hi Cupertino, > >> I am attempting to reproduce the presented performance improvements on a >> Ampere Altra processor. >> Can you please detail some more what was your setup and how did you >> measured it. > > I measured it on large strings (eg. 1024 bytes) on Neoverse V1. You can run > benchtests/bench-strlen.c and look at results for larger strings. However Altra > is half as wide as V1, so may not see much (or any) benefit. Ok, will check. > Also note by default > it uses the multiarch/strlen_asimd.S version, not strlen.S. Oh, did not know that. Thanks ! > >> BTW, I am not looking to discredit the work, but rather to be able >> to replicate the results on our end to evaluate backporting your patches. > > Yes it has been a while since we backported string improvements, so it's a good > idea. Are you talking about the official GLIBC release branches or a private branch? I am refering to a private branch. > > Cheers, > Wilco Thanks, Cupertino
diff --git a/sysdeps/aarch64/strlen.S b/sysdeps/aarch64/strlen.S index b3c92d9dc9b3c52e29e05ebbb89b929f177dc2cf..133ef933425fa260e61642a7840d73391168507d 100644 --- a/sysdeps/aarch64/strlen.S +++ b/sysdeps/aarch64/strlen.S @@ -43,12 +43,9 @@ #define dend d2 /* Core algorithm: - - For each 16-byte chunk we calculate a 64-bit nibble mask value with four bits - per byte. We take 4 bits of every comparison byte with shift right and narrow - by 4 instruction. Since the bits in the nibble mask reflect the order in - which things occur in the original string, counting trailing zeros identifies - exactly which byte matched. */ + Process the string in 16-byte aligned chunks. Compute a 64-bit mask with + four bits per byte using the shrn instruction. A count trailing zeros then + identifies the first zero byte. */ ENTRY (STRLEN) PTR_ARG (0) @@ -68,18 +65,25 @@ ENTRY (STRLEN) .p2align 5 L(loop): - ldr data, [src, 16]! + ldr data, [src, 16] + cmeq vhas_nul.16b, vdata.16b, 0 + umaxp vend.16b, vhas_nul.16b, vhas_nul.16b + fmov synd, dend + cbnz synd, L(loop_end) + ldr data, [src, 32]! cmeq vhas_nul.16b, vdata.16b, 0 umaxp vend.16b, vhas_nul.16b, vhas_nul.16b fmov synd, dend cbz synd, L(loop) - + sub src, src, 16 +L(loop_end): shrn vend.8b, vhas_nul.8h, 4 /* 128->64 */ sub result, src, srcin fmov synd, dend #ifndef __AARCH64EB__ rbit synd, synd #endif + add result, result, 16 clz tmp, synd add result, result, tmp, lsr 2 ret