From patchwork Thu Oct 14 15:53:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 46221 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id CB1E13857C52 for ; Thu, 14 Oct 2021 15:55:09 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CB1E13857C52 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1634226909; bh=8MBewz3/Q/Q7J09ZN5L+bzdY+6BxoWZJk8K5lk462pk=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=peUkfOHY78AA2egxJn4aeOIxvYLXyk4MR+4rImZPdO+2Tk4SuIRocSPmIdQ4i8dZa yGPKcnALiwvFLwuspnit62JKQ5ZW4VEeNp4KHA6BLlXb3+rkZFvqEEQ4eBENGEcXE/ 7gom8rcqLX2axH4omENHv5SEunFvEqaWjClwNQkE= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR04-DB3-obe.outbound.protection.outlook.com (mail-eopbgr60071.outbound.protection.outlook.com [40.107.6.71]) by sourceware.org (Postfix) with ESMTPS id BB11E3857C6C for ; Thu, 14 Oct 2021 15:54:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BB11E3857C6C Received: from AM6PR01CA0053.eurprd01.prod.exchangelabs.com (2603:10a6:20b:e0::30) by VI1PR08MB3262.eurprd08.prod.outlook.com (2603:10a6:803:47::28) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4587.24; Thu, 14 Oct 2021 15:54:06 +0000 Received: from AM5EUR03FT052.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:e0:cafe::3c) by AM6PR01CA0053.outlook.office365.com (2603:10a6:20b:e0::30) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4608.14 via Frontend Transport; Thu, 14 Oct 2021 15:54:06 +0000 X-MS-Exchange-Authentication-Results: spf=temperror (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=temperror action=none header.from=arm.com; Received-SPF: TempError (protection.outlook.com: error in processing during lookup of arm.com: DNS Timeout) Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT052.mail.protection.outlook.com (10.152.17.161) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4608.15 via Frontend Transport; Thu, 14 Oct 2021 15:54:04 +0000 Received: ("Tessian outbound 8e26f7114b75:v103"); Thu, 14 Oct 2021 15:54:04 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 5c779279494f60f3 X-CR-MTA-TID: 64aa7808 Received: from 9789d6b4964f.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 320B7FDB-AA95-4231-BFBD-BCAC1F7CF82A.1; Thu, 14 Oct 2021 15:53:57 +0000 Received: from EUR02-AM5-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 9789d6b4964f.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Thu, 14 Oct 2021 15:53:57 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=WKZ7RVBh2PjpJ4A2Fk2NtmkzgdXwE1XBOaHEEovOEf0IUWyuMRUtP3QxFkxJ4lrXQVzVdHT92fX5LodC+vXvj+5czu2AI/GwMXjUpjorc0y2mhScx+M2YoOD4hroMfydRXJQYDU57krAxa2/PJ3q8zT+YRZYgNHBYqnHIxPv5/MGF92MpymRuIW2n6AQRq43Ow1THyHAgA8a036eBdzKz1RgTsdM7JlsekZx7XeLsahJTc37IxA9/S4AqVAOsLPlfcK+UPGGEfDeoiytvn1ZQ5fXW40l3zSrSJFKbr7nK4JaS/2juWj9qiVmhu6O5lfEFAxCEZ08Oqwd0p8XpY6urw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=8MBewz3/Q/Q7J09ZN5L+bzdY+6BxoWZJk8K5lk462pk=; b=RFeCir70FzfRVXzY4esfmvU25DbV+Lc6+asv1RM+VXpBFek+CYqjh6bSi5NC2LIm9kp2hLea4Hx3I3Yr4RSER4BD+ruiOHHsnT8Edx/g+7Nx6Rev9JGAR1R64qSCQLRbeIugrQoSZ03myAjL0tiH0/MFyLd/IDsMObeNw/pC2/EaX1fFWvDB8aQxA4EX7TIr3RJvJ5UwcCWNTrVQ5pJvqH+lDO3dJq3DJ9OXWZLCEcKesBbySgrsxUtNKFU76HyRiZt2vurcDgGBSRyx6Qa0frlcr/5J7VZ5pPo443mlttJMKx0WiHv76lvArxIt2htru55nAiFyiCFYRogLHzKc2A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR0802MB2157.eurprd08.prod.outlook.com (2603:10a6:800:9c::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4608.16; Thu, 14 Oct 2021 15:53:55 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::281b:cded:83ff:1856]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::281b:cded:83ff:1856%3]) with mapi id 15.20.4587.026; Thu, 14 Oct 2021 15:53:54 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH v2] AArch64: Improve A64FX memcpy Thread-Topic: [PATCH v2] AArch64: Improve A64FX memcpy Thread-Index: AQHXwRCub9fHjQ7H0kWTdLL2Tyf4Yw== Date: Thu, 14 Oct 2021 15:53:54 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: suggested_attachment_session_id: ab209b57-8ff6-7817-b6be-01d5579bacfc Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: 7f1b6fc8-4df4-4807-52bb-08d98f2ad961 x-ms-traffictypediagnostic: VI1PR0802MB2157:|VI1PR08MB3262: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:2803;OLM:2803; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: 9ZvXWAujLJEKGBCwD9tcO3Yy9Dto8rU1e+Ko77CAubGyholcRRhkVM9In2waa1eTrSKO8dLVMdt2DoYJXJOxOvI+lMenxSOsOXbdcVgBPy3wNc5RMMksw3VRbOlCa+RYXx1U8UBYNkyAMpNYGoxrotiGI/6jb5GY7j+jVJtCOafUB5+9C/KIF/cAh0kE8M7zFek52SHDcTXmB5RpMgLSRz1OIwnGVfUXaFdFYTXnGIf5CHwMC25tFBKJNj4fxYUopywuC3sy0Q0GHRe5+oTTxUXsh1pNadoraU2B1RK1zAiOh2g8FnzJVs1nYG8kQ6CG7Q6KFGNf2yOEUphnDrItH6FAnF9G3rQIYHaHNvhgqzFU/kBGHZ6ozuhmpuyD45Nz1ccAFdI7Ew880III3E3j1Z/zKWijAZX0FMtoq6iYYGij2x8BudeAiSTln7km8Qu/8Sn9k+1zxAMbXyz62bQ5iy4bzJoeLHc70+Hyra1Pj6atXrkPeTag4WM92CPBcmnkN48piHUWX75Ve5RvBejvewhQmh9q5SH1fLblFoR4TPEMdkJ0Ky2f+i+LbhAQb+CiGXw36CfG6SuHp0Y9OMhcziiyYqrLmF1cX3iU5NraOySkpi5vH/4z6bWgxp4urMpV+o6+WD0+rXuGNhk/sBQpTJRpa+60919jv8lkpdOTddPZSw2qex4DRcN+WDHTvXRst4fMVW0AbD9v2AFMxTF7U4bxYOdL2c3p23waA32mQkI= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(366004)(5660300002)(8676002)(30864003)(6916009)(6506007)(26005)(4326008)(7696005)(2906002)(316002)(508600001)(86362001)(8936002)(71200400001)(33656002)(52536014)(66446008)(66556008)(91956017)(186003)(122000001)(38100700002)(76116006)(66476007)(38070700005)(64756008)(55016002)(9686003)(66946007)(579004)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0802MB2157 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT052.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: c0158800-e15f-4f16-5c71-08d98f2ad371 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: A24P9lGhk7lXMi8yayvWUSZFj2Ip1ric4sO1gGYlvSSdtb7YEL2xiqhC7qRnS5M8qpfFMgSnHEmEYM957hklL/0vsUofmPOjM8UJq+SeQ/nS1KBjp5gBrqHE5n0TjLeNG7NVeFJm8YVhpaT8DlrE8Udp+WQaZ/fOlFUzrNiTbqaULrpSaTOTSM4azr03Pb+/Jxerenk5MuWuG7SFsZnx6A6JoZD+BmZHUjY9Ml0mVpzI9vqYc0q3TjBmByWxfHWSO6ZnaQ2D7ZnjXNQxPWFy9wqt4OEuEdOYfEfkGjiqzBW9BXUYDcvtzqLmBzraF0Tv8oGyC+yc7IzIvbNWXMzSB0tSbsFcSJckXl9hY1nFLrhZ97yksUgm+xoGOBFpqhkyR4jjXUlQ/rpu0bF6ffyZ+QzouA2WX1rMqGfozk5rcKM/Ttl93spKYWAT9qiECCSSzGqfJAYO3dDkItgQACygDxUxS5OYtFIUs+geAl4CwybVqiig+z5Gbsdwy4cAzSziBPSktPoIIFFyOP+IHVxIlArMwjNLqqbw2Y77L+iqa0QkAZXXtKU8aW1IrjyEfghvAQg6nIVVcHgH+i4tuusIvkAtDvbzw0u9VVhr5u7pB+n/6FoHrDp7F2W8lCQCLIqV7dE8R0rNRcqAO/qjk3BlfVNJ2+UKFb/std6plbWhIMtCfyiHZJ+HM3hWwGIw2BQ1/8DiMYm/5236s8JVP/Wsl/F/elVIUQB/o5rJUQUPZs0= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(46966006)(36840700001)(55016002)(6506007)(186003)(8936002)(70206006)(9686003)(336012)(82310400003)(4326008)(36860700001)(81166007)(2906002)(356005)(26005)(70586007)(6862004)(316002)(47076005)(33656002)(8676002)(86362001)(63350400001)(5660300002)(7696005)(63370400001)(52536014)(508600001)(30864003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Oct 2021 15:54:04.7068 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 7f1b6fc8-4df4-4807-52bb-08d98f2ad961 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT052.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB3262 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Hi Naohiro, This is v2 of the A64FX memcpy - in the end I decided on a complete rewrite. Performance is improved by streamlining the code, aligning to vector size in large copies and using a single unrolled loop for all sizes. The codesize for memcpy and memmove goes down from 1796 bytes to 868 bytes (only 70% larger than memcpy_advsimd.S). Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is 33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% better, and total time is reduced by 4%. Passes GLIBC regress, OK for commit? Cheers, Wilco diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S index 65528405bb12373731e895c7030ccef23b88c17f..99abca0a891aa3e79494c4f4a268992472b91c75 100644 --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S @@ -25,20 +25,15 @@ * */ -#define L2_SIZE (8*1024*1024)/2 // L2 8MB/2 -#define CACHE_LINE_SIZE 256 -#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance -#define dest x0 -#define src x1 -#define n x2 // size -#define tmp1 x3 -#define tmp2 x4 -#define tmp3 x5 -#define rest x6 -#define dest_ptr x7 -#define src_ptr x8 -#define vector_length x9 -#define cl_remainder x10 // CACHE_LINE_SIZE remainder +#define dstin x0 +#define src x1 +#define n x2 +#define dst x3 +#define dstend x4 +#define srcend x5 +#define tmp x6 +#define vlen x7 +#define vlen8 x8 #if HAVE_AARCH64_SVE_ASM # if IS_IN (libc) @@ -47,45 +42,37 @@ .arch armv8.2-a+sve - .macro dc_zva times - dc zva, tmp1 - add tmp1, tmp1, CACHE_LINE_SIZE - .if \times-1 - dc_zva "(\times-1)" - .endif - .endm - .macro ld1b_unroll8 - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p0/z, [src_ptr, #1, mul vl] - ld1b z2.b, p0/z, [src_ptr, #2, mul vl] - ld1b z3.b, p0/z, [src_ptr, #3, mul vl] - ld1b z4.b, p0/z, [src_ptr, #4, mul vl] - ld1b z5.b, p0/z, [src_ptr, #5, mul vl] - ld1b z6.b, p0/z, [src_ptr, #6, mul vl] - ld1b z7.b, p0/z, [src_ptr, #7, mul vl] + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p0/z, [src, 1, mul vl] + ld1b z2.b, p0/z, [src, 2, mul vl] + ld1b z3.b, p0/z, [src, 3, mul vl] + ld1b z4.b, p0/z, [src, 4, mul vl] + ld1b z5.b, p0/z, [src, 5, mul vl] + ld1b z6.b, p0/z, [src, 6, mul vl] + ld1b z7.b, p0/z, [src, 7, mul vl] .endm .macro stld1b_unroll4a - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p0, [dest_ptr, #1, mul vl] - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p0/z, [src_ptr, #1, mul vl] - st1b z2.b, p0, [dest_ptr, #2, mul vl] - st1b z3.b, p0, [dest_ptr, #3, mul vl] - ld1b z2.b, p0/z, [src_ptr, #2, mul vl] - ld1b z3.b, p0/z, [src_ptr, #3, mul vl] + st1b z0.b, p0, [dst, 0, mul vl] + st1b z1.b, p0, [dst, 1, mul vl] + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p0/z, [src, 1, mul vl] + st1b z2.b, p0, [dst, 2, mul vl] + st1b z3.b, p0, [dst, 3, mul vl] + ld1b z2.b, p0/z, [src, 2, mul vl] + ld1b z3.b, p0/z, [src, 3, mul vl] .endm .macro stld1b_unroll4b - st1b z4.b, p0, [dest_ptr, #4, mul vl] - st1b z5.b, p0, [dest_ptr, #5, mul vl] - ld1b z4.b, p0/z, [src_ptr, #4, mul vl] - ld1b z5.b, p0/z, [src_ptr, #5, mul vl] - st1b z6.b, p0, [dest_ptr, #6, mul vl] - st1b z7.b, p0, [dest_ptr, #7, mul vl] - ld1b z6.b, p0/z, [src_ptr, #6, mul vl] - ld1b z7.b, p0/z, [src_ptr, #7, mul vl] + st1b z4.b, p0, [dst, 4, mul vl] + st1b z5.b, p0, [dst, 5, mul vl] + ld1b z4.b, p0/z, [src, 4, mul vl] + ld1b z5.b, p0/z, [src, 5, mul vl] + st1b z6.b, p0, [dst, 6, mul vl] + st1b z7.b, p0, [dst, 7, mul vl] + ld1b z6.b, p0/z, [src, 6, mul vl] + ld1b z7.b, p0/z, [src, 7, mul vl] .endm .macro stld1b_unroll8 @@ -94,87 +81,18 @@ .endm .macro st1b_unroll8 - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p0, [dest_ptr, #1, mul vl] - st1b z2.b, p0, [dest_ptr, #2, mul vl] - st1b z3.b, p0, [dest_ptr, #3, mul vl] - st1b z4.b, p0, [dest_ptr, #4, mul vl] - st1b z5.b, p0, [dest_ptr, #5, mul vl] - st1b z6.b, p0, [dest_ptr, #6, mul vl] - st1b z7.b, p0, [dest_ptr, #7, mul vl] + st1b z0.b, p0, [dst, 0, mul vl] + st1b z1.b, p0, [dst, 1, mul vl] + st1b z2.b, p0, [dst, 2, mul vl] + st1b z3.b, p0, [dst, 3, mul vl] + st1b z4.b, p0, [dst, 4, mul vl] + st1b z5.b, p0, [dst, 5, mul vl] + st1b z6.b, p0, [dst, 6, mul vl] + st1b z7.b, p0, [dst, 7, mul vl] .endm - .macro shortcut_for_small_size exit - // if rest <= vector_length * 2 - whilelo p0.b, xzr, n - whilelo p1.b, vector_length, n - b.last 1f - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - ret -1: // if rest > vector_length * 8 - cmp n, vector_length, lsl 3 // vector_length * 8 - b.hi \exit - // if rest <= vector_length * 4 - lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, n - incb tmp1 - whilelo p3.b, tmp1, n - b.last 1f - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] - st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] - ret -1: // if rest <= vector_length * 8 - lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, n - incb tmp1 - whilelo p5.b, tmp1, n - b.last 1f - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] - ld1b z4.b, p4/z, [src, #4, mul vl] - ld1b z5.b, p5/z, [src, #5, mul vl] - st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] - st1b z4.b, p4, [dest, #4, mul vl] - st1b z5.b, p5, [dest, #5, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - incb tmp1 // vector_length * 5 - incb tmp1 // vector_length * 6 - whilelo p6.b, tmp1, n - incb tmp1 - whilelo p7.b, tmp1, n - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] - ld1b z4.b, p4/z, [src, #4, mul vl] - ld1b z5.b, p5/z, [src, #5, mul vl] - ld1b z6.b, p6/z, [src, #6, mul vl] - ld1b z7.b, p7/z, [src, #7, mul vl] - st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] - st1b z4.b, p4, [dest, #4, mul vl] - st1b z5.b, p5, [dest, #5, mul vl] - st1b z6.b, p6, [dest, #6, mul vl] - st1b z7.b, p7, [dest, #7, mul vl] - ret - .endm +#undef BTI_C +#define BTI_C ENTRY (MEMCPY) @@ -182,223 +100,209 @@ ENTRY (MEMCPY) PTR_ARG (1) SIZE_ARG (2) -L(memcpy): - cntb vector_length - // shortcut for less than vector_length * 8 - // gives a free ptrue to p0.b for n >= vector_length - shortcut_for_small_size L(vl_agnostic) - // end of shortcut - -L(vl_agnostic): // VL Agnostic - mov rest, n - mov dest_ptr, dest - mov src_ptr, src - // if rest >= L2_SIZE && vector_length == 64 then L(L2) - mov tmp1, 64 - cmp rest, L2_SIZE - ccmp vector_length, tmp1, 0, cs - b.eq L(L2) - -L(unroll8): // unrolling and software pipeline - lsl tmp1, vector_length, 3 // vector_length * 8 - .p2align 3 - cmp rest, tmp1 - b.cc L(last) + cntb vlen + cmp n, vlen, lsl 1 + b.hi L(copy_small) + whilelo p1.b, vlen, n + whilelo p0.b, xzr, n + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p1/z, [src, 1, mul vl] + st1b z0.b, p0, [dstin, 0, mul vl] + st1b z1.b, p1, [dstin, 1, mul vl] + ret + + .p2align 4 + +L(copy_small): + cmp n, vlen, lsl 3 + b.hi L(copy_large) + add dstend, dstin, n + add srcend, src, n + cmp n, vlen, lsl 2 + b.hi 1f + + /* Copy 2-4 vectors. */ + ptrue p0.b + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p0/z, [src, 1, mul vl] + ld1b z2.b, p0/z, [srcend, -2, mul vl] + ld1b z3.b, p0/z, [srcend, -1, mul vl] + st1b z0.b, p0, [dstin, 0, mul vl] + st1b z1.b, p0, [dstin, 1, mul vl] + st1b z2.b, p0, [dstend, -2, mul vl] + st1b z3.b, p0, [dstend, -1, mul vl] + ret + + .p2align 4 + /* Copy 4-8 vectors. */ +1: ptrue p0.b + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p0/z, [src, 1, mul vl] + ld1b z2.b, p0/z, [src, 2, mul vl] + ld1b z3.b, p0/z, [src, 3, mul vl] + ld1b z4.b, p0/z, [srcend, -4, mul vl] + ld1b z5.b, p0/z, [srcend, -3, mul vl] + ld1b z6.b, p0/z, [srcend, -2, mul vl] + ld1b z7.b, p0/z, [srcend, -1, mul vl] + st1b z0.b, p0, [dstin, 0, mul vl] + st1b z1.b, p0, [dstin, 1, mul vl] + st1b z2.b, p0, [dstin, 2, mul vl] + st1b z3.b, p0, [dstin, 3, mul vl] + st1b z4.b, p0, [dstend, -4, mul vl] + st1b z5.b, p0, [dstend, -3, mul vl] + st1b z6.b, p0, [dstend, -2, mul vl] + st1b z7.b, p0, [dstend, -1, mul vl] + ret + + .p2align 4 + /* At least 8 vectors - always align to vector length for + higher and consistent write performance. */ +L(copy_large): + sub tmp, vlen, 1 + and tmp, dstin, tmp + sub tmp, vlen, tmp + whilelo p1.b, xzr, tmp + ld1b z1.b, p1/z, [src] + st1b z1.b, p1, [dstin] + add dst, dstin, tmp + add src, src, tmp + sub n, n, tmp + ptrue p0.b + + lsl vlen8, vlen, 3 + subs n, n, vlen8 + b.ls 3f ld1b_unroll8 - add src_ptr, src_ptr, tmp1 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.cc 2f - .p2align 3 + add src, src, vlen8 + subs n, n, vlen8 + b.ls 2f + + .p2align 4 + /* 8x unrolled and software pipelined loop. */ 1: stld1b_unroll8 - add dest_ptr, dest_ptr, tmp1 - add src_ptr, src_ptr, tmp1 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.ge 1b + add dst, dst, vlen8 + add src, src, vlen8 + subs n, n, vlen8 + b.hi 1b 2: st1b_unroll8 - add dest_ptr, dest_ptr, tmp1 - - .p2align 3 -L(last): - whilelo p0.b, xzr, rest - whilelo p1.b, vector_length, rest - b.last 1f - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p1/z, [src_ptr, #1, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - ret -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, rest - incb tmp1 - whilelo p3.b, tmp1, rest - b.last 1f - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p1/z, [src_ptr, #1, mul vl] - ld1b z2.b, p2/z, [src_ptr, #2, mul vl] - ld1b z3.b, p3/z, [src_ptr, #3, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - st1b z2.b, p2, [dest_ptr, #2, mul vl] - st1b z3.b, p3, [dest_ptr, #3, mul vl] + add dst, dst, vlen8 +3: add n, n, vlen8 + + /* Move last 0-8 vectors. */ +L(last_bytes): + cmp n, vlen, lsl 1 + b.hi 1f + whilelo p0.b, xzr, n + whilelo p1.b, vlen, n + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p1/z, [src, 1, mul vl] + st1b z0.b, p0, [dst, 0, mul vl] + st1b z1.b, p1, [dst, 1, mul vl] ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, rest - incb tmp1 - whilelo p5.b, tmp1, rest - incb tmp1 - whilelo p6.b, tmp1, rest - incb tmp1 - whilelo p7.b, tmp1, rest - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p1/z, [src_ptr, #1, mul vl] - ld1b z2.b, p2/z, [src_ptr, #2, mul vl] - ld1b z3.b, p3/z, [src_ptr, #3, mul vl] - ld1b z4.b, p4/z, [src_ptr, #4, mul vl] - ld1b z5.b, p5/z, [src_ptr, #5, mul vl] - ld1b z6.b, p6/z, [src_ptr, #6, mul vl] - ld1b z7.b, p7/z, [src_ptr, #7, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - st1b z2.b, p2, [dest_ptr, #2, mul vl] - st1b z3.b, p3, [dest_ptr, #3, mul vl] - st1b z4.b, p4, [dest_ptr, #4, mul vl] - st1b z5.b, p5, [dest_ptr, #5, mul vl] - st1b z6.b, p6, [dest_ptr, #6, mul vl] - st1b z7.b, p7, [dest_ptr, #7, mul vl] + + .p2align 4 + +1: add srcend, src, n + add dstend, dst, n + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p0/z, [src, 1, mul vl] + ld1b z2.b, p0/z, [srcend, -2, mul vl] + ld1b z3.b, p0/z, [srcend, -1, mul vl] + cmp n, vlen, lsl 2 + b.hi 1f + + st1b z0.b, p0, [dst, 0, mul vl] + st1b z1.b, p0, [dst, 1, mul vl] + st1b z2.b, p0, [dstend, -2, mul vl] + st1b z3.b, p0, [dstend, -1, mul vl] ret -L(L2): - // align dest address at CACHE_LINE_SIZE byte boundary - mov tmp1, CACHE_LINE_SIZE - ands tmp2, dest_ptr, CACHE_LINE_SIZE - 1 - // if cl_remainder == 0 - b.eq L(L2_dc_zva) - sub cl_remainder, tmp1, tmp2 - // process remainder until the first CACHE_LINE_SIZE boundary - whilelo p1.b, xzr, cl_remainder // keep p0.b all true - whilelo p2.b, vector_length, cl_remainder - b.last 1f - ld1b z1.b, p1/z, [src_ptr, #0, mul vl] - ld1b z2.b, p2/z, [src_ptr, #1, mul vl] - st1b z1.b, p1, [dest_ptr, #0, mul vl] - st1b z2.b, p2, [dest_ptr, #1, mul vl] - b 2f -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p3.b, tmp1, cl_remainder - incb tmp1 - whilelo p4.b, tmp1, cl_remainder - ld1b z1.b, p1/z, [src_ptr, #0, mul vl] - ld1b z2.b, p2/z, [src_ptr, #1, mul vl] - ld1b z3.b, p3/z, [src_ptr, #2, mul vl] - ld1b z4.b, p4/z, [src_ptr, #3, mul vl] - st1b z1.b, p1, [dest_ptr, #0, mul vl] - st1b z2.b, p2, [dest_ptr, #1, mul vl] - st1b z3.b, p3, [dest_ptr, #2, mul vl] - st1b z4.b, p4, [dest_ptr, #3, mul vl] -2: add dest_ptr, dest_ptr, cl_remainder - add src_ptr, src_ptr, cl_remainder - sub rest, rest, cl_remainder - -L(L2_dc_zva): - // zero fill - and tmp1, dest, 0xffffffffffffff - and tmp2, src, 0xffffffffffffff - subs tmp1, tmp1, tmp2 // diff - b.ge 1f - neg tmp1, tmp1 -1: mov tmp3, ZF_DIST + CACHE_LINE_SIZE * 2 - cmp tmp1, tmp3 - b.lo L(unroll8) - mov tmp1, dest_ptr - dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1 - // unroll - ld1b_unroll8 // this line has to be after "b.lo L(unroll8)" - add src_ptr, src_ptr, CACHE_LINE_SIZE * 2 - sub rest, rest, CACHE_LINE_SIZE * 2 - mov tmp1, ZF_DIST - .p2align 3 -1: stld1b_unroll4a - add tmp2, dest_ptr, tmp1 // dest_ptr + ZF_DIST - dc zva, tmp2 - stld1b_unroll4b - add tmp2, tmp2, CACHE_LINE_SIZE - dc zva, tmp2 - add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2 - add src_ptr, src_ptr, CACHE_LINE_SIZE * 2 - sub rest, rest, CACHE_LINE_SIZE * 2 - cmp rest, tmp3 // ZF_DIST + CACHE_LINE_SIZE * 2 - b.ge 1b - st1b_unroll8 - add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2 - b L(unroll8) +1: ld1b z4.b, p0/z, [src, 2, mul vl] + ld1b z5.b, p0/z, [src, 3, mul vl] + ld1b z6.b, p0/z, [srcend, -4, mul vl] + ld1b z7.b, p0/z, [srcend, -3, mul vl] + st1b z0.b, p0, [dst, 0, mul vl] + st1b z1.b, p0, [dst, 1, mul vl] + st1b z4.b, p0, [dst, 2, mul vl] + st1b z5.b, p0, [dst, 3, mul vl] + st1b z6.b, p0, [dstend, -4, mul vl] + st1b z7.b, p0, [dstend, -3, mul vl] + st1b z2.b, p0, [dstend, -2, mul vl] + st1b z3.b, p0, [dstend, -1, mul vl] + ret END (MEMCPY) libc_hidden_builtin_def (MEMCPY) -ENTRY (MEMMOVE) +ENTRY_ALIGN (MEMMOVE, 4) PTR_ARG (0) PTR_ARG (1) SIZE_ARG (2) - // remove tag address - // dest has to be immutable because it is the return value - // src has to be immutable because it is used in L(bwd_last) - and tmp2, dest, 0xffffffffffffff // save dest_notag into tmp2 - and tmp3, src, 0xffffffffffffff // save src_notag intp tmp3 - cmp n, 0 - ccmp tmp2, tmp3, 4, ne - b.ne 1f + /* Fast case for up to 2 vectors. */ + cntb vlen + cmp n, vlen, lsl 1 + b.hi 1f + whilelo p0.b, xzr, n + whilelo p1.b, vlen, n + ld1b z0.b, p0/z, [src, 0, mul vl] + ld1b z1.b, p1/z, [src, 1, mul vl] + st1b z0.b, p0, [dstin, 0, mul vl] + st1b z1.b, p1, [dstin, 1, mul vl] +L(full_overlap): ret -1: cntb vector_length - // shortcut for less than vector_length * 8 - // gives a free ptrue to p0.b for n >= vector_length - // tmp2 and tmp3 should not be used in this macro to keep - // notag addresses - shortcut_for_small_size L(dispatch) - // end of shortcut - -L(dispatch): - // tmp2 = dest_notag, tmp3 = src_notag - // diff = dest_notag - src_notag - sub tmp1, tmp2, tmp3 - // if diff <= 0 || diff >= n then memcpy - cmp tmp1, 0 - ccmp tmp1, n, 2, gt - b.cs L(vl_agnostic) - -L(bwd_start): - mov rest, n - add dest_ptr, dest, n // dest_end - add src_ptr, src, n // src_end - -L(bwd_unroll8): // unrolling and software pipeline - lsl tmp1, vector_length, 3 // vector_length * 8 - .p2align 3 - cmp rest, tmp1 - b.cc L(bwd_last) - sub src_ptr, src_ptr, tmp1 + + .p2align 4 + /* Check for overlapping moves. Return if there is a full overlap. + Small moves up to 8 vectors use the overlap-safe copy_small code. + Non-overlapping or overlapping moves with dst < src use memcpy. + Overlapping moves with dst > src use a backward copy loop. */ +1: sub tmp, dstin, src + ands tmp, tmp, 0xffffffffffffff /* Clear special tag bits. */ + b.eq L(full_overlap) + cmp n, vlen, lsl 3 + b.ls L(copy_small) + cmp tmp, n + b.hs L(copy_large) + + /* Align to vector length. */ + add dst, dstin, n + sub tmp, vlen, 1 + ands tmp, dst, tmp + csel tmp, tmp, vlen, ne + whilelo p1.b, xzr, tmp + sub n, n, tmp + ld1b z1.b, p1/z, [src, n] + st1b z1.b, p1, [dstin, n] + add src, src, n + add dst, dstin, n + + ptrue p0.b + lsl vlen8, vlen, 3 + subs n, n, vlen8 + b.ls 3f + sub src, src, vlen8 ld1b_unroll8 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.cc 2f - .p2align 3 -1: sub src_ptr, src_ptr, tmp1 - sub dest_ptr, dest_ptr, tmp1 + subs n, n, vlen8 + b.ls 2f + + .p2align 4 + /* 8x unrolled and software pipelined backward copy loop. */ +1: sub src, src, vlen8 + sub dst, dst, vlen8 stld1b_unroll8 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.ge 1b -2: sub dest_ptr, dest_ptr, tmp1 + subs n, n, vlen8 + b.hi 1b +2: sub dst, dst, vlen8 st1b_unroll8 +3: add n, n, vlen8 -L(bwd_last): - mov dest_ptr, dest - mov src_ptr, src - b L(last) + /* Adjust src/dst for last 0-8 vectors. */ + sub src, src, n + mov dst, dstin + b L(last_bytes) END (MEMMOVE) libc_hidden_builtin_def (MEMMOVE)