From patchwork Thu Jul 16 13:00:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 40117 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 91D17388CC26; Thu, 16 Jul 2020 13:00:50 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR04-HE1-obe.outbound.protection.outlook.com (mail-eopbgr70040.outbound.protection.outlook.com [40.107.7.40]) by sourceware.org (Postfix) with ESMTPS id A8985388A826 for ; Thu, 16 Jul 2020 13:00:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A8985388A826 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Wilco.Dijkstra@arm.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=AC58oNZkxn3EY8/Z1IEzsUbkGF7IyinQSAYr1ijD2D4=; b=+MWZtwLgp9WuqjCXbUwfKM6XVEYP33vH2XV82tS/D8boK88Ue4UhV9fzQVYRcjMboz5B6XvNJhGiz5ExrR2MUfzvPmns4x/Ze/Y7rBFfNueQRbLcUo53iEa58LvPVTQ/Rbx/XOK5KGwAej4uomYG5kp4PhcI4KtNwCj8GVNp4OE= Received: from AM6PR10CA0003.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:209:89::16) by DB7PR08MB3434.eurprd08.prod.outlook.com (2603:10a6:10:42::15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3195.18; Thu, 16 Jul 2020 13:00:43 +0000 Received: from AM5EUR03FT020.eop-EUR03.prod.protection.outlook.com (2603:10a6:209:89:cafe::ea) by AM6PR10CA0003.outlook.office365.com (2603:10a6:209:89::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3195.17 via Frontend Transport; Thu, 16 Jul 2020 13:00:43 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; sourceware.org; dmarc=bestguesspass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT020.mail.protection.outlook.com (10.152.16.116) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3195.18 via Frontend Transport; Thu, 16 Jul 2020 13:00:43 +0000 Received: ("Tessian outbound c83312565ef4:v62"); Thu, 16 Jul 2020 13:00:43 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 47e2459d41d51014 X-CR-MTA-TID: 64aa7808 Received: from df27c53f97b9.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id A9036354-7A67-4CFC-AE08-C70DE5AC54E1.1; Thu, 16 Jul 2020 13:00:37 +0000 Received: from EUR05-VI1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id df27c53f97b9.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Thu, 16 Jul 2020 13:00:37 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oIWvKQB2B9sArfPViSv9sm0hZp/D7hyiDIcSliuNk4fcImBPpG2Mv8ds34zylck3kiT96RzNq4VUU5s+u80OfR0oSeOWxDWDAU7NkWG7xspgSzQs7RIqib2WzGuzDS+Ia4yo+DHzV9ys8itmZPNMtDpiwSJFjiur1zQl27JCkNYkFPpVvQClyIKuUOrUSH2Or1B53rgLI6Q7OKUqF3Sp+B5lYBYrQpb+/vcTr9PycvLo2nKnfpODlNLpZROZaP3yEm4QE6xO0LEgzws6LXC1ZQrZyCCuyEtS2ab6RWIIEJvPUgaBsQJC/q01lXrzFV0vCLYYI4oRsapshlMNzZ9CDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=AC58oNZkxn3EY8/Z1IEzsUbkGF7IyinQSAYr1ijD2D4=; b=UIHwK2SYk3PBKfzGoDF67o2F1WQDohCi6Cub0sCKdrSCWSlqXI4BiLDRb/Q/0YGDOGBE2EBHRZwyr8nOovMaA94N40VVXe/cnYIFep58viMhMvdh9QqAS0sHIYAIjHgSq6hbpo7PfTjAnTb6Epf1zwS6gCQ3hGwcmzdc+KM144H3J83C/Uw+IKBrDbhkPLGWq1tG+eBesvMYoFHS3NkNyaAS09UdmuQzRRhWiVG1nn+5r9c6yMR/bAfnmtwFD+kNoVd0qspIDRN+8DGCd7WGZnw2gMV17DkSLlh0Xvzvqtgcuus/q6nff7MG3PhY55a1NoHx8F+nuUxzB/wM3GW4lA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=AC58oNZkxn3EY8/Z1IEzsUbkGF7IyinQSAYr1ijD2D4=; b=+MWZtwLgp9WuqjCXbUwfKM6XVEYP33vH2XV82tS/D8boK88Ue4UhV9fzQVYRcjMboz5B6XvNJhGiz5ExrR2MUfzvPmns4x/Ze/Y7rBFfNueQRbLcUo53iEa58LvPVTQ/Rbx/XOK5KGwAej4uomYG5kp4PhcI4KtNwCj8GVNp4OE= Received: from DB8PR08MB5036.eurprd08.prod.outlook.com (2603:10a6:10:ed::20) by DB6PR0802MB2325.eurprd08.prod.outlook.com (2603:10a6:4:85::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3195.18; Thu, 16 Jul 2020 13:00:33 +0000 Received: from DB8PR08MB5036.eurprd08.prod.outlook.com ([fe80::40e3:3b43:9af2:d4ff]) by DB8PR08MB5036.eurprd08.prod.outlook.com ([fe80::40e3:3b43:9af2:d4ff%3]) with mapi id 15.20.3174.027; Thu, 16 Jul 2020 13:00:33 +0000 From: Wilco Dijkstra To: 'GNU C Library' Subject: [PATCH] AArch64: Improve strlen_asimd performance (bug 25824) Thread-Topic: [PATCH] AArch64: Improve strlen_asimd performance (bug 25824) Thread-Index: AQHWW24rbX5769AfV0CxJK/p5JMyqw== Date: Thu, 16 Jul 2020 13:00:33 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.199.97] x-ms-publictraffictype: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: 211b48ad-648a-43ad-0ef1-08d829883f8d x-ms-traffictypediagnostic: DB6PR0802MB2325:|DB7PR08MB3434: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:6430;OLM:6430; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: cgYJz2uxior6VlGB4dI6P0QBniQxFwuccyRIKgaFulM0b51nqefdManSKaINSKcL3V5S0jlBDHZETU+zA+0CWQ30+R56rRLiPEsJcbL7zOH1iGu9TQlojHaTBbMxVti85jrSOtOOKqeCHV4yRULwK0i7ntyjC5H6MUXYRi4KFut7FbxD85uPzBmW+7rCWR51G8VCptwhfAHJ5D0xtpnXbTWU+CfeSlS1ImuPxZRtDLBDnZo2K8bqXnv17R3kYsnVRx53q2ZSA0MFyZpJXHTn43SSfpWp7urB2S3QO4dYw4PHAeZjMfibw7X3alfS482IaRLCDBpro9jIbzdqZEH9/g/YWywe/2nRXS26zreZ2JCrphtWkBTTLUlCBqjZXATtE5OhfU7AKbxvPmWQUSokmsNc2FoZH5GIW/dEzI9vzJb1k/2Nv66Q/TjBKpp9kxOwx53ohDEE6LiZ6SXQTRY+ig== X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB8PR08MB5036.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:(4636009)(346002)(376002)(39860400002)(366004)(136003)(396003)(33656002)(316002)(66476007)(66556008)(8936002)(8676002)(64756008)(66446008)(66946007)(186003)(6506007)(76116006)(26005)(2906002)(966005)(7696005)(478600001)(6916009)(30864003)(86362001)(9686003)(5660300002)(71200400001)(52536014)(55016002)(83380400001)(2004002)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: cJGj9tIdTtOs2Tly3rIki7CcTIha61edupbKWEuJdjvF8DvnTXB8ehsmb017uuFgleY9G+TOessGkyv80Z33AU+jbQfar/lszKW3ALLSftwZXjGXbpFVpCQOxG0F80QbN6ixortMEzmHYcXTdCuCfr03jVQXojYVTjDJfkgdeXHpZhlp+WZaPeBJit8g8DZ7ZNgCQYdNTxzITzgvI0TijG+YXmTCXZnTVYHT2glJJhdpWJVGflqIjDL57gPAEwYI2BW7IuOFYup1M2OnL9w4z3zBir3m9g4K21Nd6V4jWGR4lm92VBOblEEqHSfr+tsiRoWX6PYLqGSYh6EN3Pf25eTZPc7VMq0JfmJ4R6gn3366t6ZkJAUxFlh4Tyc8IymXt2L6Ikyym67FY9K3pV+Tsl28B5MNGVTjuRepbV3yHLlDsQMxAbD0Vc0/uArgwPvkIFK3lbY/T61qdZiPzOrCyo36KKRkxA+O7m5Kx8d3L6U= x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0802MB2325 Original-Authentication-Results: sourceware.org; dkim=none (message not signed) header.d=none; sourceware.org; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT020.eop-EUR03.prod.protection.outlook.com X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFTY:; SFS:(4636009)(346002)(136003)(376002)(39860400002)(396003)(46966005)(356005)(82740400003)(8676002)(8936002)(30864003)(81166007)(47076004)(83380400001)(5660300002)(86362001)(52536014)(82310400002)(70586007)(70206006)(2906002)(36906005)(6916009)(966005)(478600001)(6506007)(55016002)(33656002)(26005)(316002)(7696005)(336012)(9686003)(186003)(2004002); DIR:OUT; SFP:1101; X-MS-Office365-Filtering-Correlation-Id-Prvs: b4c8c8f8-6cdb-461b-593f-08d829883a0f X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: L4hKmkWAnD/6vLjWbQXCGUBe8TDTpiZQTUUuCZmXGzL4RfUxoVl344dqVXw8bqQrDevKytEfkbDO6JKp+Kdgy8Nrmzme4WE7kqWESmDuZNQmu5m9d75t0BGiTE9/VrzYgcednsmdDAApizMatPm61R9Z9KDhXHrOtjavyT3sqGcN6nCQxzUrLp/4UUEceCvG6lfJG2eaEcgS0wBIZHADhgUKD3gP6ShkRQupLjjrimHbVYA8XjSiLYJW1q1cyk8ePbc1kpGP1SYmniw4K+HSLmGFmaX8dXKmcIKCR3BH+sQ2yx4PibfnK0lGjlNqQ7+6xQVyxoW2XbDRelFtmlS37hUNZB8qruCq+tjGz/W0cRp8Gx0tVrGZTfj74mArGCtGIaWdjdLAxBHnhqme1ciMV7o7FiPeUqQSF03EnyHxroF5KH3wG0IoxyoP+HO8HR4nF5QCnQoNLi+IwYurPkJ6BblF/v5vc3wYW8fL7RiKNME= X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 16 Jul 2020 13:00:43.0656 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 211b48ad-648a-43ad-0ef1-08d829883f8d X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT020.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR08MB3434 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Optimize strlen using a mix of scalar and SIMD code. On modern micro architectures large strings are 2.6 times faster than existing strlen_asimd and 35% faster than the new MTE version of strlen. On a random strlen benchmark using small sizes the speedup is 7% vs strlen_asimd and 40% vs the MTE strlen. This fixes the main strlen regressions on Cortex-A53 and other cores with a simple Neon unit (see https://sourceware.org/pipermail/libc-alpha/2020-June/114641.html ) Rename __strlen_generic to __strlen_mte, and select strlen_asimd when MTE is not enabled (this is waiting on support for a HWCAP_MTE bit which can hopefully be added soon). This fixes big-endian bug 25824. Passes GLIBC regression tests. I'd like this for 2.32 to fix the bug and avoid any regressions. Reviewed-by: Szabolcs Nagy diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile index a65c554bf3a60ccbed6b519bbbc46aabdf5b6025..4377df0735287c210efd661188f9e6e3923c8003 100644 --- a/sysdeps/aarch64/multiarch/Makefile +++ b/sysdeps/aarch64/multiarch/Makefile @@ -4,5 +4,5 @@ sysdep_routines += memcpy_generic memcpy_thunderx memcpy_thunderx2 \ memcpy_new \ memset_generic memset_falkor memset_emag memset_kunpeng \ memchr_generic memchr_nosimd \ - strlen_generic strlen_asimd + strlen_mte strlen_asimd endif diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c index c1df2a012ae17f45f26bd45fc98fe45b2f4d9eb1..1e22fdf8726bf4cd92aed09401b2772f514bf3dc 100644 --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c @@ -62,7 +62,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL (i, name, strlen, IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_asimd) - IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic)) + IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_mte)) return i; } diff --git a/sysdeps/aarch64/multiarch/strlen.c b/sysdeps/aarch64/multiarch/strlen.c index 99f2cf2cde54fd1158383d097ba51edc1377f55b..7c0352dd878086708ac785807bc4d210b85e528f 100644 --- a/sysdeps/aarch64/multiarch/strlen.c +++ b/sysdeps/aarch64/multiarch/strlen.c @@ -26,17 +26,15 @@ # include # include -#define USE_ASIMD_STRLEN() IS_FALKOR (midr) +/* This should check HWCAP_MTE when it is available. */ +#define MTE_ENABLED() (false) extern __typeof (__redirect_strlen) __strlen; -extern __typeof (__redirect_strlen) __strlen_generic attribute_hidden; +extern __typeof (__redirect_strlen) __strlen_mte attribute_hidden; extern __typeof (__redirect_strlen) __strlen_asimd attribute_hidden; -libc_ifunc (__strlen, - (USE_ASIMD_STRLEN () || IS_KUNPENG920 (midr) - ? __strlen_asimd - :__strlen_generic)); +libc_ifunc (__strlen, (MTE_ENABLED () ? __strlen_mte : __strlen_asimd)); # undef strlen strong_alias (__strlen, strlen); diff --git a/sysdeps/aarch64/multiarch/strlen_asimd.S b/sysdeps/aarch64/multiarch/strlen_asimd.S index 236a2c96a6eb5f02b0f0847d230857f0aee87fbe..076a905dceae501d85c1ab59a2250d8305c718f2 100644 --- a/sysdeps/aarch64/multiarch/strlen_asimd.S +++ b/sysdeps/aarch64/multiarch/strlen_asimd.S @@ -1,5 +1,4 @@ -/* Strlen implementation that uses ASIMD instructions for load and NULL checks. - Copyright (C) 2018-2020 Free Software Foundation, Inc. +/* Copyright (C) 2020 Free Software Foundation, Inc. This file is part of the GNU C Library. @@ -20,80 +19,90 @@ #include /* Assumptions: + * + * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses. + * Not MTE compatible. + */ + +#define srcin x0 +#define len x0 + +#define src x1 +#define data1 x2 +#define data2 x3 +#define has_nul1 x4 +#define has_nul2 x5 +#define tmp1 x4 +#define tmp2 x5 +#define tmp3 x6 +#define tmp4 x7 +#define zeroones x8 + +#define maskv v0 +#define maskd d0 +#define dataq1 q1 +#define dataq2 q2 +#define datav1 v1 +#define datav2 v2 +#define tmp x2 +#define tmpw w2 +#define synd x3 +#define shift x4 + +/* For the first 32 bytes, NUL detection works on the principle that + (X - 1) & (~X) & 0x80 (=> (X - 1) & ~(X | 0x7f)) is non-zero if a + byte is zero, and can be done in parallel across the entire word. */ - ARMv8-a, AArch64, ASIMD, unaligned accesses, min page size 4k. */ +#define REP8_01 0x0101010101010101 +#define REP8_7f 0x7f7f7f7f7f7f7f7f /* To test the page crossing code path more thoroughly, compile with -DTEST_PAGE_CROSS - this will force all calls through the slower entry path. This option is not intended for production use. */ -/* Arguments and results. */ -#define srcin x0 -#define len x0 - -/* Locals and temporaries. */ -#define src x1 -#define data1 x2 -#define data2 x3 -#define has_nul1 x4 -#define has_nul2 x5 -#define tmp1 x4 -#define tmp2 x5 -#define tmp3 x6 -#define tmp4 x7 -#define zeroones x8 -#define dataq q2 -#define datav v2 -#define datab2 b3 -#define dataq2 q3 -#define datav2 v3 - -#define REP8_01 0x0101010101010101 -#define REP8_7f 0x7f7f7f7f7f7f7f7f - #ifdef TEST_PAGE_CROSS -# define MIN_PAGE_SIZE 16 +# define MIN_PAGE_SIZE 32 #else # define MIN_PAGE_SIZE 4096 #endif - /* Since strings are short on average, we check the first 16 bytes - of the string for a NUL character. In order to do an unaligned load - safely we have to do a page cross check first. If there is a NUL - byte we calculate the length from the 2 8-byte words using - conditional select to reduce branch mispredictions (it is unlikely - strlen_asimd will be repeatedly called on strings with the same - length). - - If the string is longer than 16 bytes, we align src so don't need - further page cross checks, and process 16 bytes per iteration. - - If the page cross check fails, we read 16 bytes from an aligned - address, remove any characters before the string, and continue - in the main loop using aligned loads. Since strings crossing a - page in the first 16 bytes are rare (probability of - 16/MIN_PAGE_SIZE ~= 0.4%), this case does not need to be optimized. - - AArch64 systems have a minimum page size of 4k. We don't bother - checking for larger page sizes - the cost of setting up the correct - page size is just not worth the extra gain from a small reduction in - the cases taking the slow path. Note that we only care about - whether the first fetch, which may be misaligned, crosses a page - boundary. */ - -ENTRY_ALIGN (__strlen_asimd, 6) - DELOUSE (0) - DELOUSE (1) +/* Core algorithm: + + Since strings are short on average, we check the first 32 bytes of the + string for a NUL character without aligning the string. In order to use + unaligned loads safely we must do a page cross check first. + + If there is a NUL byte we calculate the length from the 2 8-byte words + using conditional select to reduce branch mispredictions (it is unlikely + strlen will be repeatedly called on strings with the same length). + + If the string is longer than 32 bytes, align src so we don't need further + page cross checks, and process 32 bytes per iteration using a fast SIMD + loop. + + If the page cross check fails, we read 32 bytes from an aligned address, + and ignore any characters before the string. If it contains a NUL + character, return the length, if not, continue in the main loop. */ + +ENTRY (__strlen_asimd) + DELOUSE (0) + and tmp1, srcin, MIN_PAGE_SIZE - 1 - mov zeroones, REP8_01 - cmp tmp1, MIN_PAGE_SIZE - 16 - b.gt L(page_cross) + cmp tmp1, MIN_PAGE_SIZE - 32 + b.hi L(page_cross) + + /* Look for a NUL byte in the first 16 bytes. */ ldp data1, data2, [srcin] + mov zeroones, REP8_01 + #ifdef __AARCH64EB__ + /* For big-endian, carry propagation (if the final byte in the + string is 0x01) means we cannot use has_nul1/2 directly. + Since we expect strings to be small and early-exit, + byte-swap the data now so has_null1/2 will be correct. */ rev data1, data1 rev data2, data2 #endif - sub tmp1, data1, zeroones orr tmp2, data1, REP8_7f sub tmp3, data2, zeroones @@ -101,78 +110,105 @@ ENTRY_ALIGN (__strlen_asimd, 6) bics has_nul1, tmp1, tmp2 bic has_nul2, tmp3, tmp4 ccmp has_nul2, 0, 0, eq - beq L(main_loop_entry) + b.eq L(bytes16_31) + + /* Find the exact offset of the first NUL byte in the first 16 bytes + from the string start. Enter with C = has_nul1 == 0. */ csel has_nul1, has_nul1, has_nul2, cc mov len, 8 rev has_nul1, has_nul1 - clz tmp1, has_nul1 csel len, xzr, len, cc + clz tmp1, has_nul1 add len, len, tmp1, lsr 3 ret -L(main_loop_entry): - bic src, srcin, 15 - sub src, src, 16 - -L(main_loop): - ldr dataq, [src, 32]! -L(page_cross_entry): - /* Get the minimum value and keep going if it is not zero. */ - uminv datab2, datav.16b - mov tmp1, datav2.d[0] - cbz tmp1, L(tail) - ldr dataq, [src, 16] - uminv datab2, datav.16b - mov tmp1, datav2.d[0] - cbnz tmp1, L(main_loop) - add src, src, 16 - -L(tail): + .p2align 3 + /* Look for a NUL byte at offset 16..31 in the string. */ +L(bytes16_31): + ldp data1, data2, [srcin, 16] #ifdef __AARCH64EB__ - rev64 datav.16b, datav.16b -#endif - /* Set te NULL byte as 0xff and the rest as 0x00, move the data into a - pair of scalars and then compute the length from the earliest NULL - byte. */ - cmeq datav.16b, datav.16b, #0 - mov data1, datav.d[0] - mov data2, datav.d[1] - cmp data1, 0 - csel data1, data1, data2, ne - sub len, src, srcin rev data1, data1 - add tmp2, len, 8 - clz tmp1, data1 - csel len, len, tmp2, ne + rev data2, data2 +#endif + sub tmp1, data1, zeroones + orr tmp2, data1, REP8_7f + sub tmp3, data2, zeroones + orr tmp4, data2, REP8_7f + bics has_nul1, tmp1, tmp2 + bic has_nul2, tmp3, tmp4 + ccmp has_nul2, 0, 0, eq + b.eq L(loop_entry) + + /* Find the exact offset of the first NUL byte at offset 16..31 from + the string start. Enter with C = has_nul1 == 0. */ + csel has_nul1, has_nul1, has_nul2, cc + mov len, 24 + rev has_nul1, has_nul1 + mov tmp3, 16 + clz tmp1, has_nul1 + csel len, tmp3, len, cc add len, len, tmp1, lsr 3 ret - /* Load 16 bytes from [srcin & ~15] and force the bytes that precede - srcin to 0xff, so we ignore any NUL bytes before the string. - Then continue in the aligned loop. */ -L(page_cross): - mov tmp3, 63 - bic src, srcin, 15 - and tmp1, srcin, 7 - ands tmp2, srcin, 8 - ldr dataq, [src] - lsl tmp1, tmp1, 3 - csel tmp2, tmp2, tmp1, eq - csel tmp1, tmp1, tmp3, eq - mov tmp4, -1 +L(loop_entry): + bic src, srcin, 31 + + .p2align 5 +L(loop): + ldp dataq1, dataq2, [src, 32]! + uminp maskv.16b, datav1.16b, datav2.16b + uminp maskv.16b, maskv.16b, maskv.16b + cmeq maskv.8b, maskv.8b, 0 + fmov synd, maskd + cbz synd, L(loop) + + /* Low 32 bits of synd are non-zero if a NUL was found in datav1. */ + cmeq maskv.16b, datav1.16b, 0 + sub len, src, srcin + tst synd, 0xffffffff + b.ne 1f + cmeq maskv.16b, datav2.16b, 0 + add len, len, 16 +1: + /* Generate a bitmask and compute correct byte offset. */ #ifdef __AARCH64EB__ - /* Big-endian. Early bytes are at MSB. */ - lsr tmp1, tmp4, tmp1 - lsr tmp2, tmp4, tmp2 + bic maskv.8h, 0xf0 #else - /* Little-endian. Early bytes are at LSB. */ - lsl tmp1, tmp4, tmp1 - lsl tmp2, tmp4, tmp2 + bic maskv.8h, 0x0f, lsl 8 +#endif + umaxp maskv.16b, maskv.16b, maskv.16b + fmov synd, maskd +#ifndef __AARCH64EB__ + rbit synd, synd #endif - mov datav2.d[0], tmp1 - mov datav2.d[1], tmp2 - orn datav.16b, datav.16b, datav2.16b - b L(page_cross_entry) + clz tmp, synd + add len, len, tmp, lsr 2 + ret + + .p2align 4 + +L(page_cross): + bic src, srcin, 31 + mov tmpw, 0x0c03 + movk tmpw, 0xc030, lsl 16 + ld1 {datav1.16b, datav2.16b}, [src] + dup maskv.4s, tmpw + cmeq datav1.16b, datav1.16b, 0 + cmeq datav2.16b, datav2.16b, 0 + and datav1.16b, datav1.16b, maskv.16b + and datav2.16b, datav2.16b, maskv.16b + addp maskv.16b, datav1.16b, datav2.16b + addp maskv.16b, maskv.16b, maskv.16b + fmov synd, maskd + lsl shift, srcin, 1 + lsr synd, synd, shift + cbz synd, L(loop) + + rbit synd, synd + clz len, synd + lsr len, len, 1 + ret + END (__strlen_asimd) weak_alias (__strlen_asimd, strlen_asimd) libc_hidden_builtin_def (strlen_asimd) diff --git a/sysdeps/aarch64/multiarch/strlen_generic.S b/sysdeps/aarch64/multiarch/strlen_mte.S similarity index 88% rename from sysdeps/aarch64/multiarch/strlen_generic.S rename to sysdeps/aarch64/multiarch/strlen_mte.S index 61d3f72c9985bdd103d5e4c68337fed4a55511be..b8daa54dd89afbd99a6338cef45f49a25defaa26 100644 --- a/sysdeps/aarch64/multiarch/strlen_generic.S +++ b/sysdeps/aarch64/multiarch/strlen_mte.S @@ -17,14 +17,14 @@ . */ /* The actual strlen code is in ../strlen.S. If we are building libc this file - defines __strlen_generic. Otherwise the include of ../strlen.S will define + defines __strlen_mte. Otherwise the include of ../strlen.S will define the normal __strlen entry points. */ #include #if IS_IN (libc) -# define STRLEN __strlen_generic +# define STRLEN __strlen_mte /* Do not hide the generic version of strlen, we use it internally. */ # undef libc_hidden_builtin_def @@ -32,7 +32,7 @@ # ifdef SHARED /* It doesn't make sense to send libc-internal strlen calls through a PLT. */ - .globl __GI_strlen; __GI_strlen = __strlen_generic + .globl __GI_strlen; __GI_strlen = __strlen_mte # endif #endif