From patchwork Tue Jul 14 16:33:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 40096 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 05684386486B; Tue, 14 Jul 2020 16:33:51 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR02-VE1-obe.outbound.protection.outlook.com (mail-eopbgr20066.outbound.protection.outlook.com [40.107.2.66]) by sourceware.org (Postfix) with ESMTPS id 13423385BF81 for ; Tue, 14 Jul 2020 16:33:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 13423385BF81 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=Wilco.Dijkstra@arm.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ESARIUZcFXdsAIPww65hsQqo9wXHaz/xDKMdNzKJ/nQ=; b=DkxBbzG0E8vjCqEzZ/eUhV1QjPIZTht2qN5lE0+DAYifP4L0LrubMuaMGHZInyieyKElrF+PY8lUFxI2C4A4EDCHx/4at/RSVvIcKKiMkqbpjM5zDyUeqEDp6hM613+JvZ6RgWfO8tzg873qBzBJ19wDdX/JJywkuIVH70s6hqo= Received: from AM6PR10CA0009.EURPRD10.PROD.OUTLOOK.COM (2603:10a6:209:89::22) by VI1PR08MB4351.eurprd08.prod.outlook.com (2603:10a6:803:fd::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3174.21; Tue, 14 Jul 2020 16:33:45 +0000 Received: from VE1EUR03FT011.eop-EUR03.prod.protection.outlook.com (2603:10a6:209:89:cafe::f8) by AM6PR10CA0009.outlook.office365.com (2603:10a6:209:89::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3195.17 via Frontend Transport; Tue, 14 Jul 2020 16:33:44 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; sourceware.org; dmarc=bestguesspass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VE1EUR03FT011.mail.protection.outlook.com (10.152.18.134) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3174.21 via Frontend Transport; Tue, 14 Jul 2020 16:33:44 +0000 Received: ("Tessian outbound c4059ed8d7bf:v62"); Tue, 14 Jul 2020 16:33:44 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: ca90d71cbeda18d0 X-CR-MTA-TID: 64aa7808 Received: from a0ab76d7ee77.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id A9D8DF7D-E0AA-4728-9498-F4FE80CE65BD.1; Tue, 14 Jul 2020 16:33:39 +0000 Received: from EUR02-AM5-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id a0ab76d7ee77.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Tue, 14 Jul 2020 16:33:39 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=YsXsGpPvnjJaogCVhlP0ogWRa49tH/2dwSenGQf65QCS2HUpFza87W1VFzKtyTZW15eElDiW/5frTrsOWYWU7ktxtnn5I8s9m93donIMW0ECr1qzgKkh7ndpe7XpKCy3cKNezVLI1XXekdfzDB8MmQH8MxTUn2b4kD+PWmCw6mSYjm+7QmWyKdvVy+rkVym/qlrzq3FG5UmMU7H4bEHZTnhhr0t+OIgd4oSfcBOxE72kKlsZt0hoOtk+CbBC99Gf6Aq1mt/PeYNgXndEGufspCC4kSOGGETGWxFH4zoe6Ck7o+OHF/jiIxM95tf3/sd8Nymc79emD4YkhBPL76eSVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ESARIUZcFXdsAIPww65hsQqo9wXHaz/xDKMdNzKJ/nQ=; b=YsBTZ9h2jUPdQhROt8ffydgQJKUlRnaSoQjXzg+3M0ZbdtLou3VXbWg0jNcIRxS6yWXj3HKItGlH38rksKVa8vzhwLVIuxq6WfE6Isl+iPdpqMOwr3xnk3HykMlNCbQpO/jtsrcwfMl/MDMgT51NBi1c7lQ3OYWJqUjwQgNjhJ9apNYy8PUAJmCJqSogrLm6UF9Xh3ITYJtsvaOCvSAd8h1HNGTtCqzMrfEjB89W/m5jc61k/S18re7yV+KVWYUK3GsBuEN06Yj2H8FZN7RUGBfmVMJ6xf1nHlDOce5QD3GIZyOoIyiuwY+/9NjZ90gHHWBfFV8pp2Ry5FM8PvjRCw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ESARIUZcFXdsAIPww65hsQqo9wXHaz/xDKMdNzKJ/nQ=; b=DkxBbzG0E8vjCqEzZ/eUhV1QjPIZTht2qN5lE0+DAYifP4L0LrubMuaMGHZInyieyKElrF+PY8lUFxI2C4A4EDCHx/4at/RSVvIcKKiMkqbpjM5zDyUeqEDp6hM613+JvZ6RgWfO8tzg873qBzBJ19wDdX/JJywkuIVH70s6hqo= Received: from DB8PR08MB5036.eurprd08.prod.outlook.com (2603:10a6:10:ed::20) by DB6PR0802MB2470.eurprd08.prod.outlook.com (2603:10a6:4:a1::13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3174.21; Tue, 14 Jul 2020 16:33:37 +0000 Received: from DB8PR08MB5036.eurprd08.prod.outlook.com ([fe80::40e3:3b43:9af2:d4ff]) by DB8PR08MB5036.eurprd08.prod.outlook.com ([fe80::40e3:3b43:9af2:d4ff%3]) with mapi id 15.20.3174.026; Tue, 14 Jul 2020 16:33:37 +0000 From: Wilco Dijkstra To: 'GNU C Library' Subject: [PATCH] AArch64: Add optimized Q-register memcpy Thread-Topic: [PATCH] AArch64: Add optimized Q-register memcpy Thread-Index: AQHWWfpVxS3frH5+/kSn1YhsRLcz6w== Date: Tue, 14 Jul 2020 16:33:37 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: sourceware.org; dkim=none (message not signed) header.d=none;sourceware.org; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.199.97] x-ms-publictraffictype: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: 4c0aede0-911a-4f3b-0a06-08d82813ad16 x-ms-traffictypediagnostic: DB6PR0802MB2470:|VI1PR08MB4351: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:2733;OLM:2733; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: YQpPxKXc/Lg82617fatc645pGgi8HtbkCJs6e/9FSq9QUi8CJRpg7EnagY/3dfn1wihR5T81VstlyYlyf1RxkFvmE9mnEZMhUd5HXDefyS5FWV63rLSFFjx9hAVcMNzzboayQCIVY1K0XdgdmIGHZWLeB0cd4ehvT4rV4Y7+Y0vGcRqNqBMReRmXVfkpsnM8jKa7QUUl5ZlH0lkjaYlOtvi/PzX8tm4fxjMRgsEhh11HC+YMTbIRXJFMTv48WjUN5Dcpgfk4VbMAIlCXb6rtPpuTTEvr0jcakquqKJ2jZ/2dIqFtPAssI4YWBI0mOw5WXN/+OXh9JlA1Ph/Iylt9VARMdZfVvnOWRnEPMAmgKegb7wrUarKt3BgLFb5lM1rzktMUsV9ELM9w8fz8C5tyQqKFdeOYz+I5kFaoZYRBMm480LSWlwpyoSyxqp66x5cD X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DB8PR08MB5036.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFTY:; SFS:(4636009)(376002)(366004)(136003)(396003)(39860400002)(346002)(76116006)(5660300002)(6916009)(30864003)(8936002)(316002)(7696005)(52536014)(186003)(71200400001)(9686003)(478600001)(66476007)(64756008)(33656002)(55016002)(66446008)(66946007)(66556008)(86362001)(6506007)(2906002)(8676002)(26005)(83380400001)(2004002); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: i7okHPNSgSg4327e78zNyyeQG3We2+EOuhYXwnoJIHRqpckAWzybmhAS8J2SOqvm5U4h0tZnao/KjhSjA/7yMkCj4XT71BahmGnzaCc6AYz9A8mdFCJBMYt8w8CK9Y96Drg/uQwKi9zVIIybOg3NhkAZrpKDesuts6AxYKs0t5LDm+difEnuJ+l6u8ko/ZfrLhZ88U+mRqjvBJnvYkkNTnafORhaZvniUYoLq3FjJRKVMMvGdXXM+Vx6mx7URC3sif/YvGfqGmhcGmsFZnsSivXOGAb7VMxvmbLyWSn1c95QeZTI2lI74DL7CB+ARO/WZZQnbTPwi/4BVG9TpdOXvD2WnBUR3Udp0kX4rfxOYWstZxbJO0BwG5VltMficN3EvdLmLYcM1uBYdzFE1wkx/rcdLeC6wWeMGK7ONnYnlgy8GWqpyQpojqrU3yx8mivZtpVYG9xVeYtGWr5H2Ltp039xR6MIIc4c6wlqo+VVkg0= x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0802MB2470 Original-Authentication-Results: sourceware.org; dkim=none (message not signed) header.d=none; sourceware.org; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT011.eop-EUR03.prod.protection.outlook.com X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFTY:; SFS:(4636009)(136003)(396003)(39860400002)(346002)(376002)(46966005)(7696005)(82740400003)(6916009)(52536014)(47076004)(5660300002)(336012)(316002)(83380400001)(55016002)(82310400002)(81166007)(8676002)(8936002)(356005)(33656002)(2906002)(478600001)(86362001)(70586007)(9686003)(70206006)(26005)(186003)(30864003)(36906005)(6506007)(2004002); DIR:OUT; SFP:1101; X-MS-Office365-Filtering-Correlation-Id-Prvs: a983be08-d5dc-4217-7496-08d82813a90a X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 4o2kqbOeZVdGVql3jg6V6B6eRrEv5DxOZptWm4dWAUEU9xx7UnlJN2rbdPQe5xtc0OdJafAv6i1JQPdHjOnMY5g2qvReHwj2SjL7AcmFhKMQgFrHk2HARs65olOG5NqRADlQrxcS4LpjMJPVCKmRKxxlJxukvJknFHWNooEncm/5gCDC/X9Xa4ImfmX7Dn4fWYwDl+9dXCFNKN6TC2KVxZREy8KX3SXfol3949b0UyZ5XtAPJZo04kQe+kXjBKj+FRnskRMXUsxgM1iA+QcvFUg9XXmrZ1gZ3yO/q1oNmLg3vIVwKQKLpxsaalTb2IdOVSYegHDINT9Aja+y/btNvoFfsiKfOMKPeY5M0tpbO2xbkbx1atAXuRfz0RXCI46J971/hwD+iIlEwHpb1p8jLZ4o1aqNZdS/Fbky5Qf0EarPda4UPCkNsQ8nC2af+yI1Dk361gCCIqwuBh/COqTDWgo8LjLbaXmThsGdE6B8AxcvjpOTGicJiPe4eOe5wyFy X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Jul 2020 16:33:44.4678 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 4c0aede0-911a-4f3b-0a06-08d82813ad16 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VE1EUR03FT011.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB4351 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Add a new memcpy using 128-bit Q registers - this is faster on modern cores and reduces codesize. Similar to the generic memcpy, small cases include copies up to 32 bytes. 64-128 byte copies are split into two cases to improve performance of 64-96 byte copies. Large copies align the source rather than the destination. bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1, so make this memcpy the default on N1 (on Centriq it is 15% faster than memcpy_falkor). Passes GLIBC regression tests. OK for commit? diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile index 4377df0735287c210efd661188f9e6e3923c8003..e93c21e764a8d02b9f07f5030c31836a3f03f3e1 100644 --- a/sysdeps/aarch64/multiarch/Makefile +++ b/sysdeps/aarch64/multiarch/Makefile @@ -1,5 +1,5 @@ ifeq ($(subdir),string) -sysdep_routines += memcpy_generic memcpy_thunderx memcpy_thunderx2 \ +sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \ memcpy_falkor \ memcpy_new \ memset_generic memset_falkor memset_emag memset_kunpeng \ diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c index 0ccaf53e555e410569eb2be76ec7d5b4d7bc64a5..09feea97ea37ab923cf4a8557197d46adcd49204 100644 --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c @@ -42,11 +42,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx) IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx2) IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor) + IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd) IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic)) IFUNC_IMPL (i, name, memmove, IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx) IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx2) IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor) + IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd) IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic)) IFUNC_IMPL (i, name, memset, /* Enable this on non-falkor processors too so that other cores diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c index 2fafefd5d23fc1528031b5fe52098218ed603b89..e6f3ae116701097d71a02e2a1f6bfdadc1eec34a 100644 --- a/sysdeps/aarch64/multiarch/memcpy.c +++ b/sysdeps/aarch64/multiarch/memcpy.c @@ -29,6 +29,7 @@ extern __typeof (__redirect_memcpy) __libc_memcpy; extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden; +extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden; extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden; extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden; extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden; @@ -36,11 +37,11 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden; libc_ifunc (__libc_memcpy, (IS_THUNDERX (midr) ? __memcpy_thunderx - : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_ARES (midr) || IS_KUNPENG920 (midr) + : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr) ? __memcpy_falkor : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr) ? __memcpy_thunderx2 - : __memcpy_generic)))); + : (IS_ARES (midr) ? __memcpy_simd : __memcpy_generic))))); # undef memcpy strong_alias (__libc_memcpy, memcpy); diff --git a/sysdeps/aarch64/multiarch/memcpy_advsimd.S b/sysdeps/aarch64/multiarch/memcpy_advsimd.S new file mode 100644 index 0000000000000000000000000000000000000000..d4ba74777744c8bb5a83e43ab2d63ad8dab35203 --- /dev/null +++ b/sysdeps/aarch64/multiarch/memcpy_advsimd.S @@ -0,0 +1,247 @@ +/* Generic optimized memcpy using SIMD. + Copyright (C) 2020 Free Software Foundation, Inc. + + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library. If not, see + . */ + +#include + +/* Assumptions: + * + * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses. + * + */ + +#define dstin x0 +#define src x1 +#define count x2 +#define dst x3 +#define srcend x4 +#define dstend x5 +#define A_l x6 +#define A_lw w6 +#define A_h x7 +#define B_l x8 +#define B_lw w8 +#define B_h x9 +#define C_lw w10 +#define tmp1 x14 + +#define A_q q0 +#define B_q q1 +#define C_q q2 +#define D_q q3 +#define E_q q4 +#define F_q q5 +#define G_q q6 +#define H_q q7 + + +/* This implementation supports both memcpy and memmove and shares most code. + It uses unaligned accesses and branchless sequences to keep the code small, + simple and improve performance. + + Copies are split into 3 main cases: small copies of up to 32 bytes, medium + copies of up to 128 bytes, and large copies. The overhead of the overlap + check in memmove is negligible since it is only required for large copies. + + Large copies use a software pipelined loop processing 64 bytes per + iteration. The destination pointer is 16-byte aligned to minimize + unaligned accesses. The loop tail is handled by always copying 64 bytes + from the end. */ + +ENTRY (__memcpy_simd) + DELOUSE (0) + DELOUSE (1) + DELOUSE (2) + + add srcend, src, count + add dstend, dstin, count + cmp count, 128 + b.hi L(copy_long) + cmp count, 32 + b.hi L(copy32_128) + + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldr A_q, [src] + ldr B_q, [srcend, -16] + str A_q, [dstin] + str B_q, [dstend, -16] + ret + + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) + ldr A_l, [src] + ldr A_h, [srcend, -8] + str A_l, [dstin] + str A_h, [dstend, -8] + ret + + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) + ldr A_lw, [src] + ldr B_lw, [srcend, -4] + str A_lw, [dstin] + str B_lw, [dstend, -4] + ret + + /* Copy 0..3 bytes using a branchless sequence. */ +L(copy4): + cbz count, L(copy0) + lsr tmp1, count, 1 + ldrb A_lw, [src] + ldrb C_lw, [srcend, -1] + ldrb B_lw, [src, tmp1] + strb A_lw, [dstin] + strb B_lw, [dstin, tmp1] + strb C_lw, [dstend, -1] +L(copy0): + ret + + .p2align 4 + /* Medium copies: 33..128 bytes. */ +L(copy32_128): + ldp A_q, B_q, [src] + ldp C_q, D_q, [srcend, -32] + cmp count, 64 + b.hi L(copy128) + stp A_q, B_q, [dstin] + stp C_q, D_q, [dstend, -32] + ret + + .p2align 4 + /* Copy 65..128 bytes. */ +L(copy128): + ldp E_q, F_q, [src, 32] + cmp count, 96 + b.ls L(copy96) + ldp G_q, H_q, [srcend, -64] + stp G_q, H_q, [dstend, -64] +L(copy96): + stp A_q, B_q, [dstin] + stp E_q, F_q, [dstin, 32] + stp C_q, D_q, [dstend, -32] + ret + + /* Align loop64 below to 16 bytes. */ + nop + + /* Copy more than 128 bytes. */ +L(copy_long): + /* Copy 16 bytes and then align src to 16-byte alignment. */ + ldr D_q, [src] + and tmp1, src, 15 + bic src, src, 15 + sub dst, dstin, tmp1 + add count, count, tmp1 /* Count is now 16 too large. */ + ldp A_q, B_q, [src, 16] + str D_q, [dstin] + ldp C_q, D_q, [src, 48] + subs count, count, 128 + 16 /* Test and readjust count. */ + b.ls L(copy64_from_end) +L(loop64): + stp A_q, B_q, [dst, 16] + ldp A_q, B_q, [src, 80] + stp C_q, D_q, [dst, 48] + ldp C_q, D_q, [src, 112] + add src, src, 64 + add dst, dst, 64 + subs count, count, 64 + b.hi L(loop64) + + /* Write the last iteration and copy 64 bytes from the end. */ +L(copy64_from_end): + ldp E_q, F_q, [srcend, -64] + stp A_q, B_q, [dst, 16] + ldp A_q, B_q, [srcend, -32] + stp C_q, D_q, [dst, 48] + stp E_q, F_q, [dstend, -64] + stp A_q, B_q, [dstend, -32] + ret + +END (__memcpy_simd) +libc_hidden_builtin_def (__memcpy_simd) + + +ENTRY (__memmove_simd) + DELOUSE (0) + DELOUSE (1) + DELOUSE (2) + + add srcend, src, count + add dstend, dstin, count + cmp count, 128 + b.hi L(move_long) + cmp count, 32 + b.hi L(copy32_128) + + /* Small moves: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldr A_q, [src] + ldr B_q, [srcend, -16] + str A_q, [dstin] + str B_q, [dstend, -16] + ret + +L(move_long): + /* Only use backward copy if there is an overlap. */ + sub tmp1, dstin, src + cbz tmp1, L(move0) + cmp tmp1, count + b.hs L(copy_long) + + /* Large backwards copy for overlapping copies. + Copy 16 bytes and then align srcend to 16-byte alignment. */ +L(copy_long_backwards): + ldr D_q, [srcend, -16] + and tmp1, srcend, 15 + bic srcend, srcend, 15 + sub count, count, tmp1 + ldp A_q, B_q, [srcend, -32] + str D_q, [dstend, -16] + ldp C_q, D_q, [srcend, -64] + sub dstend, dstend, tmp1 + subs count, count, 128 + b.ls L(copy64_from_start) + +L(loop64_backwards): + stp A_q, B_q, [dstend, -32] + ldp A_q, B_q, [srcend, -96] + stp C_q, D_q, [dstend, -64] + ldp C_q, D_q, [srcend, -128] + sub srcend, srcend, 64 + sub dstend, dstend, 64 + subs count, count, 64 + b.hi L(loop64_backwards) + + /* Write the last iteration and copy 64 bytes from the start. */ +L(copy64_from_start): + ldp E_q, F_q, [src, 32] + stp A_q, B_q, [dstend, -32] + ldp A_q, B_q, [src] + stp C_q, D_q, [dstend, -64] + stp E_q, F_q, [dstin, 32] + stp A_q, B_q, [dstin] +L(move0): + ret + +END (__memmove_simd) +libc_hidden_builtin_def (__memmove_simd) diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c index ed5a47f6f83e7b0afcec60cb9fa0f09999eaacae..1229f8b89296eddd2e711490bb7fc0b35726b6f5 100644 --- a/sysdeps/aarch64/multiarch/memmove.c +++ b/sysdeps/aarch64/multiarch/memmove.c @@ -29,6 +29,7 @@ extern __typeof (__redirect_memmove) __libc_memmove; extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden; +extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden; extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden; extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden; extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden; @@ -40,7 +41,7 @@ libc_ifunc (__libc_memmove, ? __memmove_falkor : (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr) ? __memmove_thunderx2 - : __memmove_generic)))); + : (IS_ARES (midr) ? __memmove_simd : __memmove_generic))))); # undef memmove strong_alias (__libc_memmove, memmove);