From patchwork Fri Jul 9 12:23:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 44294 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7BF36398B0C7 for ; Fri, 9 Jul 2021 12:24:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7BF36398B0C7 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1625833451; bh=utUhrp9uwe1/bpKUmE2dJrE1wZHTjGKXXJR4yiSPRg8=; h=To:Subject:Date:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=KCizQYjnxyf60GWBweGUNUVLvAHZtgQo7Dq3hdifgH/j89jXNYRzE9dS16z8DtLJy qmmUdS2YjheR0RBprR5w9vHGLq7R9cSc31bnHY24iRyinVS4XMhhUt5cacqxOn0kYh ekIZ3yBWxo1mUWfVX5UkTFiUtNy48RpEsmXOL/y0= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0619.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe1e::619]) by sourceware.org (Postfix) with ESMTPS id D3A9F385E448 for ; Fri, 9 Jul 2021 12:23:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D3A9F385E448 Received: from AM5PR0701CA0049.eurprd07.prod.outlook.com (2603:10a6:203:2::11) by VE1PR08MB4815.eurprd08.prod.outlook.com (2603:10a6:802:a3::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4308.19; Fri, 9 Jul 2021 12:23:43 +0000 Received: from AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com (2603:10a6:203:2:cafe::cc) by AM5PR0701CA0049.outlook.office365.com (2603:10a6:203:2::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4331.11 via Frontend Transport; Fri, 9 Jul 2021 12:23:42 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT022.mail.protection.outlook.com (10.152.16.79) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4308.20 via Frontend Transport; Fri, 9 Jul 2021 12:23:42 +0000 Received: ("Tessian outbound 17c2a40a31ce:v98"); Fri, 09 Jul 2021 12:23:41 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 5e513636b4de4874 X-CR-MTA-TID: 64aa7808 Received: from cac7a5327d92.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 61FD19AC-3243-4279-9D9E-7D60A77286A4.1; Fri, 09 Jul 2021 12:23:35 +0000 Received: from EUR05-AM6-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id cac7a5327d92.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Fri, 09 Jul 2021 12:23:35 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KafFScRjcQnhFmy9rfs+OXuqdLwH8wKrNu9M+A4feeJu7sh8mTmwMr4J38f23mtZ8oEBTiKBoWKRkVcgtdiPst1YkGkTxgf/TV89Adx6nIJsqQDRQw8neBEbX2mWSPzgzdlpHyabrPlUkHX/LYiLaCGb50KDVtJmc+PhQhp+5i6n13aIKqUWuWmr0Z1p5VSrzpQgyIdSzycGE61sMahVe0l5eewt5ap2ThKddp0nWzb7xSlfJ0l/FEqE9g3usYm0qiO2eo/Ze5eXkwVyke+sOT+DmyPOXuTbDv2GX2HrbbCwCmz3ab9xj24GRcvn4Bdh6NzGIyyO3TIw4PefgSPG/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=utUhrp9uwe1/bpKUmE2dJrE1wZHTjGKXXJR4yiSPRg8=; b=fMJsvSyM3oEhKyEI7v+e5eVfp9M6edoXNG1XpYSa5L4mMi3zRMPRzr9VeQpInCJauYEfskBBiuu1pZqeNc85EvA7K5whzH/9xoszAw/FTlyN4YhzMZmZ1olppU6uTwmRCWFzY9Yg2XKesUtXjEbdwyTF3bxeL67hDHZLMzi7WXWiR83zvCZdOlaP4yslQMFPVzQ7romRdkDXQAU1ou6vXkdRmrza48wX5+cAz5zWcgo2A6/N1fFykQo213ztyLmLseHZ1VUKxU9EbTyMus8AxOCPFrnx4TxHfAfbi2IFjbyPnwE9s95EEigETnfNnZ4jr7xVzx581bxiArwtnVeJeg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR0801MB1759.eurprd08.prod.outlook.com (2603:10a6:800:5b::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22; Fri, 9 Jul 2021 12:23:34 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac%5]) with mapi id 15.20.4242.023; Fri, 9 Jul 2021 12:23:34 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH v2] AArch64: Improve A64FX memset Thread-Topic: [PATCH v2] AArch64: Improve A64FX memset Thread-Index: AQHXdL08ZMJY05KHOEymQRhbc5sG9Q== Date: Fri, 9 Jul 2021 12:23:34 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: b0a7dc6d-48e9-4f6b-897e-08d942d463b9 x-ms-traffictypediagnostic: VI1PR0801MB1759:|VE1PR08MB4815: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:1013;OLM:1013; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: p/R/GZPpp0e+4cyy67oMevtL316a03DSNuNJZvenRXEdmX0QtEEk5qdkBSohGBBZCsiJmA/vAqt2iOksNh1ARIRCL42pjsJks6aDyf8aEkx/fAG2QGMcietMQqNpOs94CfnU2t5LulZz6PYegmgUvYVyQffMuI4fWrKJREaJ9xjlPWhw54oc4EMzU7fHHCaG5iL2ED5ArOUrIVMrBvjnGUS1KhR8jg6VJBjZOpcb7B0WTKupnsddP1EO/HFAGEVKw+4mFW98gB8uLVpS+WlWtZGEbAS3kl3Et/NARCPhKcZYD29vRk4SKe0r6ldKVGHErh/DBoV2dAy5OIegLRzZSnzAUSUcIIm1tDvNeLvakIBk+887DMW7yl6Hq95k+6fMlEIPNuCXL5aTzkwDAdc46HYPoLGYHPDqRh6SMceVu5R7VgpAz0QrsZwO08qFXSAM8nIDCuQaNqth7Ac0nR1X5AXoR1bR963WrjoOhJ/S0XFQ3T/aDownuKwbCrFl7XiwYwneuzgdmj3MM03yhXw6lGUTgx7SpWCs7qhn6671gZ5XWqnZVaRD4eArDw0GJ4J3coot99V1qkBh26aea2CpD32q0gFQn3eFHr227BM1toVN+EPJgTa5b9YZkKqbHVvEsDu8eXuf4yWMAcjzudrovXyCPu+y1LMwvMSdbwylcTHCRx9Hhw7YklM0bX1guBkJ6YRjPfu+qx/ziKDJc9YUI58gwtMFKQnMkX/fsdhdBWQ= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(136003)(396003)(39840400004)(366004)(376002)(346002)(186003)(26005)(6506007)(66556008)(55016002)(316002)(8936002)(66476007)(8676002)(52536014)(4326008)(66946007)(5660300002)(6916009)(7696005)(33656002)(64756008)(9686003)(122000001)(86362001)(478600001)(38100700002)(66446008)(76116006)(2906002)(71200400001)(473944003)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?u+ovj3od460DIhIMhWUgTbP?= =?iso-8859-1?q?hYSiUzDuBr0cCNZcFW5jvsOftnD3DILwB4pzHTgIHkOvCHgTMmmxaXNYcLo4?= =?iso-8859-1?q?xFZGmBJDEILLbK9L86Me0T7HPd7rUcPf4a/gL5HAHiywH1PdVNtVTYs3yi+g?= =?iso-8859-1?q?tMpuckZvha89cQ7ZYPz/hPWd8SQgLRT3D75+RQRXH2MgXZdFn+ZZCTy4aZQw?= =?iso-8859-1?q?fCpkZq/Af/75k6uAfYFsETLb/AFPUgpRBfyz2AAFU1dZdM8o9PUXD4qhVlpc?= =?iso-8859-1?q?Taxj5EomRuyQKFw+j7KrO/UqaZkmNEiN8Rc+AOsgg9v2hEN8jLC52IPx4ooj?= =?iso-8859-1?q?RoHnydV/qpYO1LQyUHT9fPiVy4JlauGjph2AcX6/NId+93Spq7fI576Py8Pf?= =?iso-8859-1?q?mhepHLbPy/Wn+rzvqg9moIwVyArhAoZDRv7wo2wdD6y2V+NwKx7uAWe+4fYQ?= =?iso-8859-1?q?NQTNOqDeVS8nOFx/0hL3AsjtzdPMHzBsKb2Of0exbqizhnU+6dnimXSQYXWt?= =?iso-8859-1?q?hXlzZpj34b+kiXFZ0J2w4+Wc+p8+DHy8s9BjxPAh1X2QxQZhkxpmsSfmXdeM?= =?iso-8859-1?q?846SAJsE2UCucPz7v+XYJH9wf1ZRzbk2B8cTyb7bhFAEfabXeJfyKvG82S4W?= =?iso-8859-1?q?85Iwhj1HN7CoObpkbvVX0qNNSivj/6/aRDKsIQsSe7KD0PSVW7DQg+nJ0PK4?= =?iso-8859-1?q?hPrTzaiLlQePhS2ZE4emPj+hmT/MwsYkj5DwYvN1kofneKrCLttiUf1kRR6A?= =?iso-8859-1?q?7X/3gLftq7QCjbR8yA7vCKy2bKYXp4FtZn7zeDPywTLShiQuDrAGTetEF7T/?= =?iso-8859-1?q?5zdE7kbewV2PebeeQlC1Bact+3RSyaI6wbewresF7plluwsxwI6AdE4FkV+2?= =?iso-8859-1?q?rCPdPM1uGYe0sumQe+qyjbwPqMcLmhSIQaaimXXYK6BxIzND+TTGB/CpYXjp?= =?iso-8859-1?q?74/PRZxoklaQ4sE7Ww3V3bHGyuXG29kuaFctVzvQ7nNXH1TbVkPHYnxb+ZfO?= =?iso-8859-1?q?f8nI/FvESHRKD5yi8WWW5LmzaH1qfwDgdHpiip4yjL884ox2sQch77cgCBKW?= =?iso-8859-1?q?jn6VN+abFaETYy80c7Xn4dlmUR6awz2GAE/BUZ4cj7NmryEevqDQXECtR91s?= =?iso-8859-1?q?zXbJzmaj0Epcax65ZJwYL/H9rNRG4+/7co3w/4HXz0ezIBfw4uiB/UVKJQTg?= =?iso-8859-1?q?CiQk2rn3q+jcrr0He8vrnY46HzIMYyJWHyZ8tdrPGt8T6IPz3XXKxnPRorIC?= =?iso-8859-1?q?5OoHikNS3Y++HZL9tWTaV9cAsK2vf7Ik2/AN50f1TQJst4OBzXYib2Y03TDl?= =?iso-8859-1?q?2GQnUGJz7aO7j116twylAJGqKFUn1suk7NXFo2kA=3D?= x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0801MB1759 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: dbdf2faf-8e32-47c4-42a3-08d942d45eef X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: SPJHAwemE6iD5c1dlSk2fix87FKwL82vaPg04W2voniPUWHCDKKo/EaMXu6IDB1QTwwevCzCWGZcbV0xqDfIzpR+9plBny0KGLx3NvGgbY6CE8F42aN6l2BbiiIiQ0X1okDAefOjY08JwVWljWNx6J2dK99CFuVUXuPBEMg892h75x2rC/Z7xGZQZBmYXa4bG/irVeYF2/AFMoxc2H1/pMPL6sKsszCKuqZC0fLXnoe5xX8HUuD1zYZDaGGaFFCx3pMChKSN2Bcl/LXwB07Yq2ra/MX4b/uHIS/IhsE2BnNGW+FKp1G2jO6MxL0k/tlIvkCbQC4pXZroZidT3k9qzyiV6rGyBbt4dLFPsWaw3mMJJMAsa8ipHC2P9jf9Ox9X1erDjzIOS1C11HATXrtP6o6A90ivV9KON2b5kiqTJ148I7QLeY6AHcuwOh7izqMerEvcQ8ZYFaNcSY4q+lJiwnT1VrpqEpBkggoRhhrYIaoYl8Arz9b4Te1Mthc+gSuLldaEXury+DhOSl5GI0QooxDmIDDEXdtXxj9zPQFv8v4ppUViTu0JIKEp8Azf80dDcriPZnEPNQoJcSVG8BtPcT4DA1DrEX+5Ij0mq9AtWX+fs7Qyt0RmgBQblQzd6eAHd5BUg8JJDlv6qJlvKU2mwkgSgyFAi+0oN8JzycPHb8DOe0qP8uiHrdJBlHmUh3yXmuy6eWL3YxfDd96VUwBZknaFL0MELlr4SwKxW1WU9d82AMCsVcsQsdp2D6M9wGV8 X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(346002)(376002)(39850400004)(396003)(136003)(46966006)(36840700001)(86362001)(82740400003)(47076005)(8676002)(316002)(7696005)(36860700001)(336012)(81166007)(356005)(5660300002)(2906002)(55016002)(6506007)(9686003)(52536014)(8936002)(82310400003)(6862004)(70206006)(478600001)(4326008)(186003)(70586007)(33656002)(26005)(473944003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jul 2021 12:23:42.2429 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: b0a7dc6d-48e9-4f6b-897e-08d942d463b9 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT022.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB4815 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Hi Naohiro, Here is version 2 which should improve things a lot: v2: Improve handling of last 512 bytes which improves medium sized memsets. Further reduce codesize by removing unnecessary unrolling of dc zva. Speed up huge memsets of zero and non-zero. Reduce the codesize of the A64FX memset by simplifying the small memset code, better handling of alignment and last 8 vectors as well as removing redundant instructions and branches. The size for memset goes down from 1032 to 376 bytes. For large zeroing memsets use DC ZVA, which almost doubles performance. Large non-zero memsets use the unroll8 loop which is about 10% faster. Passes GLIBC regress, OK for commit? diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S index ce54e5418b08c8bc0ecc7affff68a59272ba6397..2737f0cba3e1a9ac887cd8072f6122f4852a9f94 100644 --- a/sysdeps/aarch64/multiarch/memset_a64fx.S +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S @@ -30,11 +30,7 @@ #define L2_SIZE (8*1024*1024) // L2 8MB - 1MB #define CACHE_LINE_SIZE 256 #define PF_DIST_L1 (CACHE_LINE_SIZE * 16) // Prefetch distance L1 -#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance -#define rest x8 #define vector_length x9 -#define vl_remainder x10 // vector_length remainder -#define cl_remainder x11 // CACHE_LINE_SIZE remainder #if HAVE_AARCH64_SVE_ASM # if IS_IN (libc) @@ -42,224 +38,126 @@ .arch armv8.2-a+sve - .macro dc_zva times - dc zva, tmp1 - add tmp1, tmp1, CACHE_LINE_SIZE - .if \times-1 - dc_zva "(\times-1)" - .endif - .endm - .macro st1b_unroll first=0, last=7 - st1b z0.b, p0, [dst, #\first, mul vl] + st1b z0.b, p0, [dst, \first, mul vl] .if \last-\first st1b_unroll "(\first+1)", \last .endif .endm - .macro shortcut_for_small_size exit - // if rest <= vector_length * 2 + +#undef BTI_C +#define BTI_C + +ENTRY (MEMSET) + PTR_ARG (0) + SIZE_ARG (2) + + dup z0.b, valw whilelo p0.b, xzr, count + cntb vector_length whilelo p1.b, vector_length, count + st1b z0.b, p0, [dstin, 0, mul vl] + st1b z0.b, p1, [dstin, 1, mul vl] b.last 1f - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] ret -1: // if rest > vector_length * 8 - cmp count, vector_length, lsl 3 // vector_length * 8 - b.hi \exit - // if rest <= vector_length * 4 - lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, count - incb tmp1 - whilelo p3.b, tmp1, count - b.last 1f - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] - ret -1: // if rest <= vector_length * 8 - lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, count - incb tmp1 - whilelo p5.b, tmp1, count - b.last 1f - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] - st1b z0.b, p4, [dstin, #4, mul vl] - st1b z0.b, p5, [dstin, #5, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - incb tmp1 // vector_length * 5 - incb tmp1 // vector_length * 6 - whilelo p6.b, tmp1, count - incb tmp1 - whilelo p7.b, tmp1, count - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] - st1b z0.b, p4, [dstin, #4, mul vl] - st1b z0.b, p5, [dstin, #5, mul vl] - st1b z0.b, p6, [dstin, #6, mul vl] - st1b z0.b, p7, [dstin, #7, mul vl] - ret - .endm -ENTRY (MEMSET) - - PTR_ARG (0) - SIZE_ARG (2) + // count >= vector_length * 2 + .p2align 4 +1: add dst, dstin, count + cmp count, vector_length, lsl 2 + b.hi 1f + st1b z0.b, p0, [dst, -2, mul vl] + st1b z0.b, p0, [dst, -1, mul vl] + ret - cbnz count, 1f + // count > vector_length * 4 +1: cmp count, vector_length, lsl 3 + b.hi L(vl_agnostic) + st1b z0.b, p0, [dstin, 2, mul vl] + st1b z0.b, p0, [dstin, 3, mul vl] + st1b z0.b, p0, [dst, -4, mul vl] + st1b z0.b, p0, [dst, -3, mul vl] + st1b z0.b, p0, [dst, -2, mul vl] + st1b z0.b, p0, [dst, -1, mul vl] ret -1: dup z0.b, valw - cntb vector_length - // shortcut for less than vector_length * 8 - // gives a free ptrue to p0.b for n >= vector_length - shortcut_for_small_size L(vl_agnostic) - // end of shortcut -L(vl_agnostic): // VL Agnostic - mov rest, count + // count >= vector_length * 8 + .p2align 4 +L(vl_agnostic): mov dst, dstin - add dstend, dstin, count - // if rest >= L2_SIZE && vector_length == 64 then L(L2) mov tmp1, 64 - cmp rest, L2_SIZE - ccmp vector_length, tmp1, 0, cs - b.eq L(L2) - // if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch) - cmp rest, L1_SIZE + // if count >= L1_SIZE && vector_length == 64 then L(L1_prefetch) + cmp count, L1_SIZE ccmp vector_length, tmp1, 0, cs b.eq L(L1_prefetch) -L(unroll32): - lsl tmp1, vector_length, 3 // vector_length * 8 - lsl tmp2, vector_length, 5 // vector_length * 32 - .p2align 3 -1: cmp rest, tmp2 - b.cc L(unroll8) - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - sub rest, rest, tmp2 - b 1b - + // count >= 8 * vector_length L(unroll8): lsl tmp1, vector_length, 3 - .p2align 3 -1: cmp rest, tmp1 - b.cc L(last) - st1b_unroll + sub count, count, tmp1 + lsl tmp2, vector_length, 1 + .p2align 4 +1: subs count, count, tmp1 + st1b_unroll 0, 7 add dst, dst, tmp1 - sub rest, rest, tmp1 - b 1b - -L(last): - whilelo p0.b, xzr, rest - whilelo p1.b, vector_length, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - ret -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, rest - incb tmp1 - whilelo p3.b, tmp1, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, rest - incb tmp1 - whilelo p5.b, tmp1, rest - incb tmp1 - whilelo p6.b, tmp1, rest - incb tmp1 - whilelo p7.b, tmp1, rest - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - st1b z0.b, p4, [dst, #4, mul vl] - st1b z0.b, p5, [dst, #5, mul vl] - st1b z0.b, p6, [dst, #6, mul vl] - st1b z0.b, p7, [dst, #7, mul vl] + b.hi 1b + + add dst, dst, count + add count, count, tmp1 + cmp count, tmp2 + b.ls 2f + add tmp2, vector_length, vector_length, lsl 2 + cmp count, tmp2 + b.ls 5f + st1b z0.b, p0, [dst, 0, mul vl] + st1b z0.b, p0, [dst, 1, mul vl] + st1b z0.b, p0, [dst, 2, mul vl] +5: st1b z0.b, p0, [dst, 3, mul vl] + st1b z0.b, p0, [dst, 4, mul vl] + st1b z0.b, p0, [dst, 5, mul vl] +2: st1b z0.b, p0, [dst, 6, mul vl] + st1b z0.b, p0, [dst, 7, mul vl] ret -L(L1_prefetch): // if rest >= L1_SIZE + // count >= L1_SIZE .p2align 3 +L(L1_prefetch): + cmp count, L2_SIZE + b.hs L(L2) 1: st1b_unroll 0, 3 prfm pstl1keep, [dst, PF_DIST_L1] st1b_unroll 4, 7 prfm pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE] add dst, dst, CACHE_LINE_SIZE * 2 - sub rest, rest, CACHE_LINE_SIZE * 2 - cmp rest, L1_SIZE - b.ge 1b - cbnz rest, L(unroll32) - ret + sub count, count, CACHE_LINE_SIZE * 2 + cmp count, PF_DIST_L1 + b.hs 1b + b L(unroll8) + // count >= L2_SIZE L(L2): - // align dst address at vector_length byte boundary - sub tmp1, vector_length, 1 - ands tmp2, dst, tmp1 - // if vl_remainder == 0 - b.eq 1f - sub vl_remainder, vector_length, tmp2 - // process remainder until the first vector_length boundary - whilelt p2.b, xzr, vl_remainder - st1b z0.b, p2, [dst] - add dst, dst, vl_remainder - sub rest, rest, vl_remainder - // align dstin address at CACHE_LINE_SIZE byte boundary -1: mov tmp1, CACHE_LINE_SIZE - ands tmp2, dst, CACHE_LINE_SIZE - 1 - // if cl_remainder == 0 - b.eq L(L2_dc_zva) - sub cl_remainder, tmp1, tmp2 - // process remainder until the first CACHE_LINE_SIZE boundary - mov tmp1, xzr // index -2: whilelt p2.b, tmp1, cl_remainder - st1b z0.b, p2, [dst, tmp1] - incb tmp1 - cmp tmp1, cl_remainder - b.lo 2b - add dst, dst, cl_remainder - sub rest, rest, cl_remainder - -L(L2_dc_zva): - // zero fill - mov tmp1, dst - dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1 - mov zva_len, ZF_DIST - add tmp1, zva_len, CACHE_LINE_SIZE * 2 - // unroll - .p2align 3 -1: st1b_unroll 0, 3 - add tmp2, dst, zva_len - dc zva, tmp2 - st1b_unroll 4, 7 - add tmp2, tmp2, CACHE_LINE_SIZE - dc zva, tmp2 - add dst, dst, CACHE_LINE_SIZE * 2 - sub rest, rest, CACHE_LINE_SIZE * 2 - cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2 - b.ge 1b - cbnz rest, L(unroll8) - ret + tst valw, 255 + b.ne L(unroll8) + // align dst to CACHE_LINE_SIZE byte boundary + and tmp1, dst, CACHE_LINE_SIZE - 1 + sub tmp1, tmp1, CACHE_LINE_SIZE + st1b z0.b, p0, [dst, 0, mul vl] + st1b z0.b, p0, [dst, 1, mul vl] + st1b z0.b, p0, [dst, 2, mul vl] + st1b z0.b, p0, [dst, 3, mul vl] + sub dst, dst, tmp1 + add count, count, tmp1 + + // clear cachelines using DC ZVA + sub count, count, CACHE_LINE_SIZE * 4 + .p2align 4 +1: dc zva, dst + add dst, dst, CACHE_LINE_SIZE + subs count, count, CACHE_LINE_SIZE + b.hs 1b + add count, count, CACHE_LINE_SIZE * 4 + b L(unroll8) END (MEMSET) libc_hidden_builtin_def (MEMSET)