From patchwork Wed Jun 30 15:38:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 44065 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C372D395ACE6 for ; Wed, 30 Jun 2021 15:40:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C372D395ACE6 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1625067610; bh=uGB4nmt4xeVHgyViD4djR6Xlu8mrzKjbWdkVxm2QT8g=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=b5Oc2nr+sazb79h0FTZmf+2BwvD7qSC3jEo84YhxQn4J43NU5bxzJQZwuN4qfbRq9 Tn/63B9rrSzl7Aof1Jt5TauNm3TG6suyS+vqW4Gx07osef7RKPc9eFmWZ+q94tEbk5 hnPvqqFWqBEsStLHg5UW4CEEFiUKyukZZydtPgbs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR05-VI1-obe.outbound.protection.outlook.com (mail-vi1eur05on20607.outbound.protection.outlook.com [IPv6:2a01:111:f400:7d00::607]) by sourceware.org (Postfix) with ESMTPS id D4B4E395A477 for ; Wed, 30 Jun 2021 15:39:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D4B4E395A477 Received: from AM5PR0201CA0011.eurprd02.prod.outlook.com (2603:10a6:203:3d::21) by DB7PR08MB3259.eurprd08.prod.outlook.com (2603:10a6:5:1f::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4264.23; Wed, 30 Jun 2021 15:39:08 +0000 Received: from AM5EUR03FT064.eop-EUR03.prod.protection.outlook.com (2603:10a6:203:3d:cafe::2) by AM5PR0201CA0011.outlook.office365.com (2603:10a6:203:3d::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22 via Frontend Transport; Wed, 30 Jun 2021 15:39:08 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT064.mail.protection.outlook.com (10.152.17.53) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22 via Frontend Transport; Wed, 30 Jun 2021 15:39:08 +0000 Received: ("Tessian outbound 80741586f868:v97"); Wed, 30 Jun 2021 15:39:08 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: e7cbe011958d279c X-CR-MTA-TID: 64aa7808 Received: from e3cc8ab42cf0.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 4058485A-07E8-4AE6-A38E-3994A11BB154.1; Wed, 30 Jun 2021 15:38:29 +0000 Received: from EUR01-VE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id e3cc8ab42cf0.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 30 Jun 2021 15:38:29 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d5aOVBaXfJcmOdX7sjgsIQV2f4yszg3ub0TrOpNCzFGkTYC8CtZUTLSqE0fBQLB8K17Y6JzwP6mXdRWohZwLdOXZ7PYudr6WK1g2mKIYvBkahkJOje8FFAja4v3AFlOO78/L4NasSUavnIuDY0JmfbI8MHqyFILJIuuRstBYTbHaqq1DLtp0E7bfYIUMvdaIg6fOlqKmYLhPdXcMqbb3xHRjjICv58jtT1s1dp6bRYOWSBVZvY69eiVFLOqM1JvFNbPafhiJznaqDWsPXTdBSFFGoLthaX4LKWyq+6yweZfexZ2vmzETdk75U2XDoNXNeaq50wZZo22eeMyAcyPuXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=uGB4nmt4xeVHgyViD4djR6Xlu8mrzKjbWdkVxm2QT8g=; b=OYUJWtke/OVzIRSQW3fdEz2BZgBohXB/EN8DWLuOs7Ure7OEgYPQwRI25YD27pMW8wG3BHo4+0aFpksi14sFWAYqVBQ1/CJDGWzx6N4KxjurB/i28OYRnGbuPqUBlO175fwx1PR5vwlH0JoTAh4u9OspBMF889nlS02kbbgjNzoX9B7bAG6Zb5fE++XdYZRwdRXrpbQWyB+pZVeaBJYL0Dw74QU6nxr3IVDg1cTnyLgbS7mcxWFnm3In0wKX5St0v7pMlKANB7SOJE7txJ0DLM7xUCGhUluOjZIXkM/rAXKDQqZZNxCSkbskuqCagrF1qk+w0rJu4L3sEm6kbRRe0g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR08MB3392.eurprd08.prod.outlook.com (2603:10a6:803:7b::30) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4242.23; Wed, 30 Jun 2021 15:38:26 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac%5]) with mapi id 15.20.4242.023; Wed, 30 Jun 2021 15:38:26 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH] AArch64: Improve A64FX memcpy Thread-Topic: [PATCH] AArch64: Improve A64FX memcpy Thread-Index: AQHXbcPErFaFtf+6sUq7ZvspQBdTNg== Date: Wed, 30 Jun 2021 15:38:26 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: d32b6389-6b0b-4e3d-e3b5-08d93bdd3380 x-ms-traffictypediagnostic: VI1PR08MB3392:|DB7PR08MB3259: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:10000;OLM:10000; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: x2Gj+PbBqSfJl77tMMFDsf3fxpuYZO7wRbLZLEViizCGHZzDUjImdyOVg1/0vtEG2V8UtLJeAZxyu21GTubP4rFsuh70aG+YxOIrrZ7nk2FktsPDCKIkc/jPXPKneodm2k9vIFaZ/DWnvBVJ1LMRluzU1oYhkl4RGayHHfbmrZ3JAjnWGVucOJUuTC5YdC3PqnbyOQhjY8hnVE6wkWy8Hey56rHHtmrVcCUYUFWxPhsT2BQE7QSfhg64MFohv0FHsPPcr/kuBUVnOIaCMQzXCi/KtvPAEk2kdP9RkeTfkHIWWvy2p6PcDBNbbcAHZu73NdjSZU5++DMC3zvHX2PoMV4ZFEQ2Ql/tZtfItB7uF6e6i7qo4XWkEBlwfVEORNPKl+hOSeoJnAKuYBocA4sI8tZX1TWI/hhQAihE2hl0edoNFXyq04iq1BzgdBUXEfdwqQy1c2nqE5dFC0ZNIiSP4/1iwbKiLDxs0PSIkDZLuNwIiK9Fx2zGkZJon8bg4kw/hHkODw6vESFCKoanYZj3dqvUtvQxscNZ1gfLsOgCpNCZSYwA4zBqvaXID2Gu8/Y2of4eNFzPBaAlXrOyNx4VDzFVLO5hWVQ5Tcy1IgPAwooEEonaAyDCy4ILIBsRj7invvIBxgfaNFWPeD9PkbxLxgW25C9QIgaxFZuK9YblmAeHuQH/mx1iZ2G39aZLWwNE0ZtgEQ0/NwZ3eQlE71Ug5A== X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(376002)(136003)(39860400002)(366004)(396003)(7696005)(9686003)(71200400001)(6916009)(8676002)(55016002)(8936002)(478600001)(86362001)(33656002)(6506007)(52536014)(66476007)(64756008)(66556008)(5660300002)(186003)(30864003)(26005)(66946007)(66446008)(38100700002)(2906002)(4326008)(316002)(122000001)(76116006)(91956017)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?+BeViJBEGNNqtvKQzUKPjTu?= =?iso-8859-1?q?PQRsOFIcok85o/UuS2PizPDPZBxUmNOtDDbUzX9dpNeMQn/Veuqr+AZL+Y40?= =?iso-8859-1?q?GHTP0G+65s9ace71OcZz8nUASHoQ5zdisd8mtuYZ6FLssc6WiZv3nf9TkOG0?= =?iso-8859-1?q?4C4OG+DqUKemDJxHlvI4B2XU3DcOGTC00qA628mcSJzZkJFdhdeTa73E4cgx?= =?iso-8859-1?q?IH69DUdH6vS8FoL4w9f15tkYL/QWHDPTmjZ5wd6fXmx2terC+V6eLCjEiipA?= =?iso-8859-1?q?bluaD2Wnpv5PRpGhs9RDs2iCllrkxX4lh+F806mzzUwd7H8t2GXVmYPf4dA0?= =?iso-8859-1?q?qTX5TiSzAUSy0Fa+CemA+4AKk/1xut+ckEwmAWE0F7TZOMMaJmIvAU0Iejvr?= =?iso-8859-1?q?uOI8ouv9cMnslZfnyEqVuIE8z6odUE1afwZnzPlMylP3Z+C+wrq1Qohn/6U+?= =?iso-8859-1?q?SuHxWeOmlJFASS0DMvDdrT3+oAhBSgXTOUT/um0riHQ962DfyDRivcWMmQsY?= =?iso-8859-1?q?IvYgT9r6mFuH9KCIwLyMJnjC5nPT50E7hOaAVxIq0EQovwxKJzOzsqpnMBGE?= =?iso-8859-1?q?JMZ0Vf0oFM6NvvvLU23tJYD0fR6TY3+j+2N25XBUQQufJFqshhWa31u/OQZ6?= =?iso-8859-1?q?MxEcJiwRsyNG4lPjr67AkMieGsP4pBfBPvk/drdo0vnJCRUU/WigKsuhfZgq?= =?iso-8859-1?q?3Ej1f1pdq2aCQF/jrdn9kOQdPgH0Veee6Ok59xMEpXJfd4MPpWXlmpwjmZSx?= =?iso-8859-1?q?n09nWjN2cmK6z1EhKDoShLo0OEo1DLZPpLceznxbHNfsFI3F057sFaWs5psr?= =?iso-8859-1?q?vOcp5xyIb1UpVKRksP596DAp5bHHRJqlZeoBSseS3LwDIxZVCv/H/feGSi0d?= =?iso-8859-1?q?G13/0A6S4nBQI8XyclwSFZswkj9iVHJrTOfE3hG1phLE5sHbq0B3nHUPdQUw?= =?iso-8859-1?q?XbjlVH7x31wRD20t8iqzqlRsj3MI2XXLdkqQBrtqGgcfbaxUA/zOxwueZA1S?= =?iso-8859-1?q?75JzmvBSXQ4b8LI1thWnbmo2mxZZ3b8qb4EyEz+kOo9t367LQnvRzbPZxmtA?= =?iso-8859-1?q?f2Qt3EYxuxBQxexf2icwJkfYoRWH4UmvJOZ0J4+2Z42QavcApifrcqKAwmab?= =?iso-8859-1?q?RDNVS4gXMUloWRterU4hsLyTcTpE0fm1CWnfO10H2dCR0SZO+k5Hxy7ySaGQ?= =?iso-8859-1?q?cliI5b4ZYOD3iKUGtLn43tCVW+DG25H6amxxYOUf/y0jDQL0AGRFDmee7Ef6?= =?iso-8859-1?q?Cwv4+4HUJzprhVnyFWkzqid4LdJ6VrhEDJyoj0S0Nk7zvS9LzX8fldCJFx9C?= =?iso-8859-1?q?4896TaoB79JDOWt9YNgtbZgxbKm5J+MmL4uTkJbA=3D?= x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB3392 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT064.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 22d6744b-0fd3-403d-4742-08d93bdd1a4f X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: SDqg7D9aBEb3FUSIpjbBr/ytkbyTCi6EUoAP5o6M+ciQ/ktkNO5s3Rq7xEc+nVWGokTqWfI6ddakzswbWz2AFR2eMSSVVtNWFm2Xxy2ZtYsTkUXpmy7a+QSOaWOe92xVPV1VgvOCawIe5Xijl+KF+8o+mJhKUfOQUARA3jQJFBkGmV+hhvVTTkN8Mz6J6FIv/BFQUafcxy4pl99tdjl390HTKVEc/E3O7L8Tfmn616OIHGuM02EXnjD+aSUemXJ83a2+QVpCpDSqlGD+s/ZKwtq1f4dAYxokD56nb2Sc2IjF0mXvqZZgYKDgvzBXkZ9NVBaZzvqrPqFJ6doaflCFexRJJ7rkZMknno3qce6JlCwbplX8ccSKHdRvclRsDUkn7ca9sDHRV4MGZhi2wHT1jq78sghKaWNvF/YHHrl6RdCZ4/FNvW6nkqnJQeZfVnIXe3/IcPCTnWakRBezsBDSJDmaj35zhOzVG66GFcoohXvMjJk2H1nYp6fuVuJLtlOrbdJf48ftdM5WeMJXwF3uzLdUOJjVrSK55WROaNl3i+Q9huuZ8qngpIG06in5wGKSB9GggK3Nk3dQFc7NLrl3Se4ljPtuyJ0oty7oe5VZd4PqW3zShS7IW85NDMpj4T72UQiXgAlkl2qdbinWAvUbth6+fi+rRI6we5F+rWO0Icdq5FqcddX/D+Z7pGhfI19drF0LBA6CPqaN/PVC8IKy0F59dh1Th/u1jvzrc0lwIrs= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(39860400002)(136003)(396003)(376002)(346002)(46966006)(36840700001)(70586007)(6862004)(70206006)(5660300002)(9686003)(336012)(30864003)(7696005)(86362001)(186003)(36860700001)(52536014)(2906002)(55016002)(47076005)(478600001)(82740400003)(82310400003)(8936002)(33656002)(26005)(8676002)(356005)(81166007)(6506007)(316002)(4326008)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jun 2021 15:39:08.6682 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: d32b6389-6b0b-4e3d-e3b5-08d93bdd3380 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT064.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB7PR08MB3259 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Hi Naohiro, Since the A64FX memcpy is still quite large, I decided to do a quick pass to cleanup to simplify the code. I believe the code is better overall and a bit faster, but please let me know what you think. I've left the structure as it was, but there are likely more tweaks possible. Here it is: Reduce the codesize of the A64FX memcpy by avoiding duplication of code, and removing redundant instructions. The size for memcpy and memmove goes down from 1796 bytes to 1080 bytes. Performance is mostly unchanged or slightly better as the critical loops are identical but fewer instructions are executed before entering the loop. Passes GLIBC regress, OK for commit? diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S index 65528405bb12373731e895c7030ccef23b88c17f..425148300913aadd8b144d17e7ee2b496f65008e 100644 --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S @@ -38,7 +38,6 @@ #define dest_ptr x7 #define src_ptr x8 #define vector_length x9 -#define cl_remainder x10 // CACHE_LINE_SIZE remainder #if HAVE_AARCH64_SVE_ASM # if IS_IN (libc) @@ -47,14 +46,6 @@ .arch armv8.2-a+sve - .macro dc_zva times - dc zva, tmp1 - add tmp1, tmp1, CACHE_LINE_SIZE - .if \times-1 - dc_zva "(\times-1)" - .endif - .endm - .macro ld1b_unroll8 ld1b z0.b, p0/z, [src_ptr, #0, mul vl] ld1b z1.b, p0/z, [src_ptr, #1, mul vl] @@ -106,69 +97,49 @@ .macro shortcut_for_small_size exit // if rest <= vector_length * 2 - whilelo p0.b, xzr, n + whilelo p0.b, xzr, n whilelo p1.b, vector_length, n - b.last 1f ld1b z0.b, p0/z, [src, #0, mul vl] ld1b z1.b, p1/z, [src, #1, mul vl] + b.last 1f st1b z0.b, p0, [dest, #0, mul vl] st1b z1.b, p1, [dest, #1, mul vl] ret + 1: // if rest > vector_length * 8 cmp n, vector_length, lsl 3 // vector_length * 8 b.hi \exit + // if rest <= vector_length * 4 lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, n - incb tmp1 - whilelo p3.b, tmp1, n - b.last 1f - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] + sub n, n, tmp1 + whilelo p2.b, xzr, n + whilelo p3.b, vector_length, n ld1b z2.b, p2/z, [src, #2, mul vl] ld1b z3.b, p3/z, [src, #3, mul vl] - st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] - ret -1: // if rest <= vector_length * 8 - lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, n - incb tmp1 - whilelo p5.b, tmp1, n b.last 1f - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] - ld1b z4.b, p4/z, [src, #4, mul vl] - ld1b z5.b, p5/z, [src, #5, mul vl] st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] + st1b z1.b, p0, [dest, #1, mul vl] st1b z2.b, p2, [dest, #2, mul vl] st1b z3.b, p3, [dest, #3, mul vl] - st1b z4.b, p4, [dest, #4, mul vl] - st1b z5.b, p5, [dest, #5, mul vl] ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - incb tmp1 // vector_length * 5 - incb tmp1 // vector_length * 6 - whilelo p6.b, tmp1, n - incb tmp1 - whilelo p7.b, tmp1, n - ld1b z0.b, p0/z, [src, #0, mul vl] - ld1b z1.b, p1/z, [src, #1, mul vl] - ld1b z2.b, p2/z, [src, #2, mul vl] - ld1b z3.b, p3/z, [src, #3, mul vl] + +1: // if rest <= vector_length * 8 + sub n, n, tmp1 + add tmp2, tmp1, vector_length + whilelo p4.b, xzr, n + whilelo p5.b, vector_length, n + whilelo p6.b, tmp1, n + whilelo p7.b, tmp2, n + ld1b z4.b, p4/z, [src, #4, mul vl] ld1b z5.b, p5/z, [src, #5, mul vl] ld1b z6.b, p6/z, [src, #6, mul vl] ld1b z7.b, p7/z, [src, #7, mul vl] st1b z0.b, p0, [dest, #0, mul vl] - st1b z1.b, p1, [dest, #1, mul vl] - st1b z2.b, p2, [dest, #2, mul vl] - st1b z3.b, p3, [dest, #3, mul vl] + st1b z1.b, p0, [dest, #1, mul vl] + st1b z2.b, p0, [dest, #2, mul vl] + st1b z3.b, p0, [dest, #3, mul vl] st1b z4.b, p4, [dest, #4, mul vl] st1b z5.b, p5, [dest, #5, mul vl] st1b z6.b, p6, [dest, #6, mul vl] @@ -182,8 +153,8 @@ ENTRY (MEMCPY) PTR_ARG (1) SIZE_ARG (2) -L(memcpy): cntb vector_length +L(memmove_small): // shortcut for less than vector_length * 8 // gives a free ptrue to p0.b for n >= vector_length shortcut_for_small_size L(vl_agnostic) @@ -201,135 +172,107 @@ L(vl_agnostic): // VL Agnostic L(unroll8): // unrolling and software pipeline lsl tmp1, vector_length, 3 // vector_length * 8 - .p2align 3 - cmp rest, tmp1 - b.cc L(last) + sub rest, rest, tmp1 ld1b_unroll8 add src_ptr, src_ptr, tmp1 - sub rest, rest, tmp1 - cmp rest, tmp1 + subs rest, rest, tmp1 b.cc 2f - .p2align 3 + .p2align 4 1: stld1b_unroll8 add dest_ptr, dest_ptr, tmp1 add src_ptr, src_ptr, tmp1 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.ge 1b + subs rest, rest, tmp1 + b.hs 1b 2: st1b_unroll8 add dest_ptr, dest_ptr, tmp1 + add rest, rest, tmp1 .p2align 3 L(last): - whilelo p0.b, xzr, rest + whilelo p0.b, xzr, rest whilelo p1.b, vector_length, rest - b.last 1f - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p1/z, [src_ptr, #1, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - ret -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, rest - incb tmp1 - whilelo p3.b, tmp1, rest - b.last 1f - ld1b z0.b, p0/z, [src_ptr, #0, mul vl] - ld1b z1.b, p1/z, [src_ptr, #1, mul vl] - ld1b z2.b, p2/z, [src_ptr, #2, mul vl] - ld1b z3.b, p3/z, [src_ptr, #3, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - st1b z2.b, p2, [dest_ptr, #2, mul vl] - st1b z3.b, p3, [dest_ptr, #3, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, rest - incb tmp1 - whilelo p5.b, tmp1, rest - incb tmp1 - whilelo p6.b, tmp1, rest - incb tmp1 - whilelo p7.b, tmp1, rest ld1b z0.b, p0/z, [src_ptr, #0, mul vl] ld1b z1.b, p1/z, [src_ptr, #1, mul vl] + b.nlast 1f + + lsl tmp1, vector_length, 1 // vector_length * 2 + sub rest, rest, tmp1 + whilelo p2.b, xzr, rest + whilelo p3.b, vector_length, rest ld1b z2.b, p2/z, [src_ptr, #2, mul vl] ld1b z3.b, p3/z, [src_ptr, #3, mul vl] + b.nlast 2f + + sub rest, rest, tmp1 + add tmp2, tmp1, vector_length // vector_length * 3 + whilelo p4.b, xzr, rest + whilelo p5.b, vector_length, rest + whilelo p6.b, tmp1, rest + whilelo p7.b, tmp2, rest + ld1b z4.b, p4/z, [src_ptr, #4, mul vl] ld1b z5.b, p5/z, [src_ptr, #5, mul vl] ld1b z6.b, p6/z, [src_ptr, #6, mul vl] ld1b z7.b, p7/z, [src_ptr, #7, mul vl] - st1b z0.b, p0, [dest_ptr, #0, mul vl] - st1b z1.b, p1, [dest_ptr, #1, mul vl] - st1b z2.b, p2, [dest_ptr, #2, mul vl] - st1b z3.b, p3, [dest_ptr, #3, mul vl] st1b z4.b, p4, [dest_ptr, #4, mul vl] st1b z5.b, p5, [dest_ptr, #5, mul vl] st1b z6.b, p6, [dest_ptr, #6, mul vl] st1b z7.b, p7, [dest_ptr, #7, mul vl] +2: st1b z2.b, p2, [dest_ptr, #2, mul vl] + st1b z3.b, p3, [dest_ptr, #3, mul vl] +1: st1b z0.b, p0, [dest_ptr, #0, mul vl] + st1b z1.b, p1, [dest_ptr, #1, mul vl] ret L(L2): // align dest address at CACHE_LINE_SIZE byte boundary - mov tmp1, CACHE_LINE_SIZE - ands tmp2, dest_ptr, CACHE_LINE_SIZE - 1 - // if cl_remainder == 0 - b.eq L(L2_dc_zva) - sub cl_remainder, tmp1, tmp2 - // process remainder until the first CACHE_LINE_SIZE boundary - whilelo p1.b, xzr, cl_remainder // keep p0.b all true - whilelo p2.b, vector_length, cl_remainder - b.last 1f - ld1b z1.b, p1/z, [src_ptr, #0, mul vl] - ld1b z2.b, p2/z, [src_ptr, #1, mul vl] - st1b z1.b, p1, [dest_ptr, #0, mul vl] - st1b z2.b, p2, [dest_ptr, #1, mul vl] - b 2f -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p3.b, tmp1, cl_remainder - incb tmp1 - whilelo p4.b, tmp1, cl_remainder - ld1b z1.b, p1/z, [src_ptr, #0, mul vl] - ld1b z2.b, p2/z, [src_ptr, #1, mul vl] - ld1b z3.b, p3/z, [src_ptr, #2, mul vl] - ld1b z4.b, p4/z, [src_ptr, #3, mul vl] - st1b z1.b, p1, [dest_ptr, #0, mul vl] - st1b z2.b, p2, [dest_ptr, #1, mul vl] - st1b z3.b, p3, [dest_ptr, #2, mul vl] - st1b z4.b, p4, [dest_ptr, #3, mul vl] -2: add dest_ptr, dest_ptr, cl_remainder - add src_ptr, src_ptr, cl_remainder - sub rest, rest, cl_remainder + and tmp1, dest_ptr, CACHE_LINE_SIZE - 1 + sub tmp1, tmp1, CACHE_LINE_SIZE + ld1b z1.b, p0/z, [src_ptr, #0, mul vl] + ld1b z2.b, p0/z, [src_ptr, #1, mul vl] + ld1b z3.b, p0/z, [src_ptr, #2, mul vl] + ld1b z4.b, p0/z, [src_ptr, #3, mul vl] + st1b z1.b, p0, [dest_ptr, #0, mul vl] + st1b z2.b, p0, [dest_ptr, #1, mul vl] + st1b z3.b, p0, [dest_ptr, #2, mul vl] + st1b z4.b, p0, [dest_ptr, #3, mul vl] + sub dest_ptr, dest_ptr, tmp1 + sub src_ptr, src_ptr, tmp1 + add rest, rest, tmp1 L(L2_dc_zva): - // zero fill - and tmp1, dest, 0xffffffffffffff - and tmp2, src, 0xffffffffffffff - subs tmp1, tmp1, tmp2 // diff - b.ge 1f - neg tmp1, tmp1 -1: mov tmp3, ZF_DIST + CACHE_LINE_SIZE * 2 - cmp tmp1, tmp3 + // check for overlap + sub tmp1, src_ptr, dest_ptr + and tmp1, tmp1, 0xffffffffffffff // clear tag bits + mov tmp2, ZF_DIST + cmp tmp1, tmp2 b.lo L(unroll8) + + // zero fill loop mov tmp1, dest_ptr - dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1 + mov tmp3, ZF_DIST / CACHE_LINE_SIZE +1: dc zva, tmp1 + add tmp1, tmp1, CACHE_LINE_SIZE + subs tmp3, tmp3, 1 + b.ne 1b + + mov tmp3, ZF_DIST + CACHE_LINE_SIZE * 2 // unroll - ld1b_unroll8 // this line has to be after "b.lo L(unroll8)" - add src_ptr, src_ptr, CACHE_LINE_SIZE * 2 - sub rest, rest, CACHE_LINE_SIZE * 2 - mov tmp1, ZF_DIST - .p2align 3 -1: stld1b_unroll4a - add tmp2, dest_ptr, tmp1 // dest_ptr + ZF_DIST - dc zva, tmp2 + ld1b_unroll8 + add src_ptr, src_ptr, CACHE_LINE_SIZE * 2 + sub rest, rest, CACHE_LINE_SIZE * 2 + .p2align 4 +2: stld1b_unroll4a + add tmp1, dest_ptr, tmp2 // dest_ptr + ZF_DIST + dc zva, tmp1 stld1b_unroll4b - add tmp2, tmp2, CACHE_LINE_SIZE - dc zva, tmp2 + add tmp1, tmp1, CACHE_LINE_SIZE + dc zva, tmp1 add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2 add src_ptr, src_ptr, CACHE_LINE_SIZE * 2 sub rest, rest, CACHE_LINE_SIZE * 2 - cmp rest, tmp3 // ZF_DIST + CACHE_LINE_SIZE * 2 - b.ge 1b + cmp rest, tmp3 + b.hs 2b st1b_unroll8 add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2 b L(unroll8) @@ -338,68 +281,50 @@ END (MEMCPY) libc_hidden_builtin_def (MEMCPY) -ENTRY (MEMMOVE) +ENTRY_ALIGN (MEMMOVE, 4) PTR_ARG (0) PTR_ARG (1) SIZE_ARG (2) - // remove tag address - // dest has to be immutable because it is the return value - // src has to be immutable because it is used in L(bwd_last) - and tmp2, dest, 0xffffffffffffff // save dest_notag into tmp2 - and tmp3, src, 0xffffffffffffff // save src_notag intp tmp3 - cmp n, 0 - ccmp tmp2, tmp3, 4, ne - b.ne 1f - ret -1: cntb vector_length - // shortcut for less than vector_length * 8 - // gives a free ptrue to p0.b for n >= vector_length - // tmp2 and tmp3 should not be used in this macro to keep - // notag addresses - shortcut_for_small_size L(dispatch) - // end of shortcut - -L(dispatch): - // tmp2 = dest_notag, tmp3 = src_notag - // diff = dest_notag - src_notag - sub tmp1, tmp2, tmp3 - // if diff <= 0 || diff >= n then memcpy - cmp tmp1, 0 - ccmp tmp1, n, 2, gt - b.cs L(vl_agnostic) - -L(bwd_start): - mov rest, n - add dest_ptr, dest, n // dest_end - add src_ptr, src, n // src_end + cntb vector_length + // diff = dest - src + sub tmp1, dest, src + ands tmp1, tmp1, 0xffffffffffffff // clear tag bits + b.eq L(full_overlap) -L(bwd_unroll8): // unrolling and software pipeline - lsl tmp1, vector_length, 3 // vector_length * 8 - .p2align 3 - cmp rest, tmp1 - b.cc L(bwd_last) - sub src_ptr, src_ptr, tmp1 + cmp n, vector_length, lsl 3 // vector_length * 8 + b.ls L(memmove_small) + + ptrue p0.b + // if diff < 0 || diff >= n then memcpy + cmp tmp1, n + b.hs L(vl_agnostic) + + // unrolling and software pipeline + lsl tmp1, vector_length, 3 // vector_length * 8 + add dest_ptr, dest, n // dest_end + sub rest, n, tmp1 + add src_ptr, src, rest // src_end ld1b_unroll8 - sub rest, rest, tmp1 - cmp rest, tmp1 + subs rest, rest, tmp1 b.cc 2f - .p2align 3 + .p2align 4 1: sub src_ptr, src_ptr, tmp1 sub dest_ptr, dest_ptr, tmp1 stld1b_unroll8 - sub rest, rest, tmp1 - cmp rest, tmp1 - b.ge 1b + subs rest, rest, tmp1 + b.hs 1b 2: sub dest_ptr, dest_ptr, tmp1 st1b_unroll8 - -L(bwd_last): + add rest, rest, tmp1 mov dest_ptr, dest mov src_ptr, src b L(last) +L(full_overlap): + ret + END (MEMMOVE) libc_hidden_builtin_def (MEMMOVE) # endif /* IS_IN (libc) */