From patchwork Wed Jun 30 15:49:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 44066 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 604EA396980A for ; Wed, 30 Jun 2021 15:50:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 604EA396980A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1625068237; bh=QzJ0IwtT0x8IbH1JBr6UQvD4ON5CCQDewkIaK8SLmuU=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=XTSAqLxUDERK34mm9GYONJEFV9Pm+08PQMzbmaZoZKaU87e3kkOweAO3Pzg3qFme6 TUFSiWH4TpCR5t10jnCyvOXS/EAzocseESQnQ2K5kJRHNW4kpsvSgmeYhdfrnoSk10 UX2j9Ri4IzUV31NPtOk0jadOpo0iAs+I4EMlyPtU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05on2053.outbound.protection.outlook.com [40.107.20.53]) by sourceware.org (Postfix) with ESMTPS id BAE3C385E82B for ; Wed, 30 Jun 2021 15:50:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BAE3C385E82B Received: from DB6PR07CA0049.eurprd07.prod.outlook.com (2603:10a6:6:2a::11) by VI1PR08MB2701.eurprd08.prod.outlook.com (2603:10a6:802:1a::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4264.24; Wed, 30 Jun 2021 15:50:09 +0000 Received: from DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com (2603:10a6:6:2a:cafe::aa) by DB6PR07CA0049.outlook.office365.com (2603:10a6:6:2a::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.8 via Frontend Transport; Wed, 30 Jun 2021 15:50:09 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DB5EUR03FT025.mail.protection.outlook.com (10.152.20.104) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22 via Frontend Transport; Wed, 30 Jun 2021 15:50:09 +0000 Received: ("Tessian outbound 80741586f868:v97"); Wed, 30 Jun 2021 15:50:09 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 5ebbda56d354ab44 X-CR-MTA-TID: 64aa7808 Received: from c3fcf234aebb.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 276B2383-339E-4E3E-964A-C483B62053C4.1; Wed, 30 Jun 2021 15:49:57 +0000 Received: from EUR02-HE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id c3fcf234aebb.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 30 Jun 2021 15:49:57 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=WqbaXB1ZPwvlUO7zdL3lo3m9z7IP0itQqnwWsovZ3zH6d2F29Wm/F724pY3s0UGuK2zkxbNujZeViPqhJfSzi8hEpb/R3mAMBirWSn5lLTCASFsuPeAp6U3Hkqptsso5OFPuKqohqEuCHDBSV07KTqkZJ7P/ZSQYGlbhikPkYHes9MfqCZSulnNDesRITS5bUvDDSr4iT1AgIUZH+CO/PMdEdDMeXef+l30R8KlZmuxXFcCmkqLZg8hWMgdyZrLAxH9J9kwx052IO+LXEB266sNc6MAdax5/0Zcj76kPoZqWjRsV7qCC0KWUfUNNHF1pm2qFiICUQzVxNG1uUbMz9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QzJ0IwtT0x8IbH1JBr6UQvD4ON5CCQDewkIaK8SLmuU=; b=FnZlvLyRRoG3XxQvFkM/wxny0QAhhrqLW4VTrA5piwCpxn1qS3Mq0yFd6WfmCIPxZx6jJRpDQAGBPWaKoVW/D1rfzyJE0+X13HiSfH/DWqhvfPBJ97p8atiwKxsrfyquoTHziUaYzIQCDDPpAZi0Q3JYPqo6+O7pXS2+5i4OlfeY1kxjr74B00wrL22I7rznxbPx8FxGYWELsJHTSnsjR4hdUhgl30oeDm3rcRjgrJMAQsBc+oTLjXZIK/LKksFzdZhKPaP6Ql8FER8i3TnB/Kw2hYxqMQTGI3p4WxxM0Hs+QLuPrxjcG+k0rw41LlDyTDvXsb4wFHJcv4Y9ZxCYDw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VE1PR08MB5773.eurprd08.prod.outlook.com (2603:10a6:800:1a9::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.23; Wed, 30 Jun 2021 15:49:52 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac%5]) with mapi id 15.20.4242.023; Wed, 30 Jun 2021 15:49:52 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH] AArch64: Improve A64FX memset Thread-Topic: [PATCH] AArch64: Improve A64FX memset Thread-Index: AQHXbcYzOoyDgpB+D0uU1TFRbN82oQ== Date: Wed, 30 Jun 2021 15:49:52 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: 8de240f9-b9ff-4ba2-7ada-08d93bdebd1e x-ms-traffictypediagnostic: VE1PR08MB5773:|VI1PR08MB2701: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:2512;OLM:2512; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: RLVXPAtXPtDYIK9oKE7E3o0ZJLweiz8TUa0h5AhxAT/e27APsn/YUREM3iasgC/xwSHNVfL1APu8X9Tqq4bPdnMwkRCNHRoohIY58FvjTkudfV8aph+AMa/6q/BXEq5QfbnSOU/iPyYj57hpG7nuzAVZD4F6a59OQV5h+3ZiXMTKrE6iTTCQGpwwkMuQ87/ngV6xrSDaPk+KaRcRwv+KtLJ1i+wKauR4S9cvYi+gALKV0iO5AFc0f0ElmPHEe+zbAymlhd1wuSssNte2jnbI7Ape3nUk/bMPt/sP1GQnSLKzImBnO0U+OKSRxHIkbxY1d+BuEWQWPKBxR2cDitFP2HXrTnwNdBYhE9NNRd2kWhm0r3FxAx+8aH+uvlq0ym9ADs18tlZ8NnwoEY5SITL5+/uEHRSkJHZ85wfMjZ+yueckw/UWeZC0NeLpz728tDXxTyVo1AP18YvvapEPyrOyzkCeUDS1R8ao5/sMXDsSHBIZ8RjeF1PBcQIKkcX3IeF+KWQLMllJPX4MvrCUvq0FBD8e8jUwx0Tl8Qi7+WyA7s/Y5rXUbTTNhtSib9hKa33EwPNAO1k6TSiXFilNonNRm8YO2bGJ7lc/49WNMDfYqFGL2RN2PsRf4yWN2CT25ZkdsdhG/RRgoeiGWZt3FxnAogj7bPws6haA2m4xuVnrMKhe6TjGrkM3fOqTMpSmTcxRlqa87qdlGZH6Q44L6TCzuBFe4/pMHClUYMvOiFW/BQc= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(366004)(136003)(39850400004)(396003)(376002)(52536014)(9686003)(5660300002)(55016002)(4326008)(316002)(7696005)(76116006)(91956017)(66946007)(6506007)(64756008)(66556008)(478600001)(66476007)(71200400001)(66446008)(2906002)(122000001)(38100700002)(8936002)(8676002)(6916009)(86362001)(33656002)(26005)(186003)(83380400001)(473944003)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?q?yoMip9sYCN344sqgfvT7k4E?= =?iso-8859-1?q?0qxHDUvVWcxgnpK9CvYI4KrlZw6ptxl63rggXHLeduaK6gRTZuwZwsaByRgC?= =?iso-8859-1?q?HXDJoEhWz8E0MeXrbm7p2r1f+7eKQfwCI7+JmdNno8QSMvWf3LHQowqY0i4m?= =?iso-8859-1?q?CbYeSXCXWbmFIG87FTXspScyk5BJZbAwi3553srIBlOuRRz7wQcjZWGDnR6s?= =?iso-8859-1?q?Ywmsgv9w80JsvdgtWQtOQ67WALF1qQqR7I67EeiimigIW1B7CEa/qeRvJfxi?= =?iso-8859-1?q?1Xd5UaY1LrsCxOuievpxYVakkS7XO66iBb1sCGx0y+XKd8mmxZTIxziU/CSd?= =?iso-8859-1?q?lX0f1HEWjDQ70iVHgP8P5ERcQtsPIwpAizXpevQ5QMDUc6aypun6cpjO+sZR?= =?iso-8859-1?q?uzc+wXK2LXcmdXNstdm0iYQOkTBMOwgIWRR/4In2ojg8NSAzg3dOMZopLefM?= =?iso-8859-1?q?sqDp0DY7SLbTc0Gp+zTLhI1uqhDoRA1JLNgX/pRBuEGkcPg9JSSure0GFIrb?= =?iso-8859-1?q?9ZsTC6KtjPMfJqCtTmIJvbeWcVfUkKBqC3aP5wppNEgogOqn9kEwB0/AM3mQ?= =?iso-8859-1?q?PZ4KJEJCq6+ZX8efO0jhfWT+ibNL5JVdMg4GRGhayLhJfsbruB9hMcIMkGZ0?= =?iso-8859-1?q?iA05dRhGPvi62sh1l0ukK6wcwKkZNwhew8RLUjRqo1bFgqxKElB7UsOPuPvr?= =?iso-8859-1?q?6Cyq0bgbCO8neYUSlfFA1v1sGF4wQVI+4mlJIbPnVUhEsB+hZuww+IaEqt9N?= =?iso-8859-1?q?sd3VcRLChvuVoDMyhtsFPdVl6lVOuY1VcJfm8em0hfACfsfxsxvIUzxfRVJ0?= =?iso-8859-1?q?TAJYxA0Q1Y8ieJyUqO1N6IvDtTFaHPhFBZVVXoZgs4Rm7bAb40/+bkteC8zE?= =?iso-8859-1?q?kuBdXaNwcfgHXcSucRcVcfafFu+0kdCFd1IwDn9EBOsqEHexmYG1cLbwm+Is?= =?iso-8859-1?q?IzUtQSZrbgz6YdyWyI7pejbZ4EnwFAsg+PTdQmNAQau1bhZaNmrmr/7B1Qo5?= =?iso-8859-1?q?C+7GvIvFH9YAIJiCb5/5n9tBImUzZE/ZTL6X5A5rdIsuRN4U+kGjLWH/a99x?= =?iso-8859-1?q?HF1oNE5Vo/csp+XD0dlhmi6hfy5iDt/9ueEFZszc3878teK0IZC1wZQo1+kw?= =?iso-8859-1?q?UXniku8JohZsqQXMsc4gpvhw+E2hzC3ZrbqqILFy7LADy2VRyOuM7QdGo54T?= =?iso-8859-1?q?DObTd58/385JcaD72/+j37+KzB1KCwVzqzMSrGWv2oe/bmmB/uU/6Qm94O8m?= =?iso-8859-1?q?JIzTRTgskqKpnalN6s+HYO6bPEQo+f0euyoPD/kAlpFNv3qQg+VaZF86+oU3?= =?iso-8859-1?q?pvJDfybo85PPsxFzpgrM0ydt5rRmwhJctcvEve9w=3D?= x-ms-exchange-transport-forked: True MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB5773 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: f78b03d3-7dea-43c7-376d-08d93bdeb35e X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: zDVA/Dc4jm84K2UdufXL2+iv3+VdRBwFt/qtDox8wmUj3p2IA8nhmQyrAsThJ9kGmcc3ZUInsFG+pXlSsGvPrB1zDFfAih31RFLtq92Aokxs0zR6KyEXlRhHTn59KjkBhwMEXShcPRdN2RM2KHHfd1M/9qvisP2o/Si+iefww0QKUNn+XG5tzkJ7WxyR3Zk8kF8J7m49SWy3mMkTSqQX8cEBl3pWFW1N3WjZgPGDpcMOx0/7liPWHvyqweJyj73NSfRWsSUyw7CeQ8d3mttgIyHPzzOnpeuHO6FMDAdEF185xr4N+C95ekxPZkWm9ChQrD8hF/Pao0+6Bv/BZalIKNz7vpURE74zBzb0zjbTDUUZqPUNUbf3wrn9FkO+ehnkPoL7ODD0hgKDmiO3A1aw1KVX04QZdSnR2St818vTO3H8C+CnC6K9DNEzLo5KJuaoMfDBhOHQnUx2+COKICIWEGlgC93C/xppKXDnXEn0CxJt9pLqrxK/ZS4gWJpwXEytefHMKmdX/p31rZuMhLEq8/1piYyMzr/BMZA9x3viyZCC5uF04m04EbFoJ3wE4ywfhvKWhLLA+PVGNR8oEuGxMO5H9msWauhxvepZBWNQ90iB7OGrL0Ynon9Svpif5w9jPFySkJ9CO76jgQanSWVrLdZXSbGWMUGNRVoDuv8AkrgzwPyuMtPyrXNev59pSlRf+W2/ZZqZbC2d49hjfsT03krpwafiR+ZupZZ3fJEfbtd+qWvpIRu/iwzQ3o+wMDsb X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(346002)(39850400004)(376002)(136003)(396003)(36840700001)(46966006)(316002)(47076005)(86362001)(4326008)(336012)(33656002)(6506007)(5660300002)(55016002)(26005)(36860700001)(9686003)(8936002)(81166007)(70206006)(356005)(2906002)(52536014)(70586007)(7696005)(478600001)(82740400003)(186003)(8676002)(83380400001)(6862004)(82310400003)(473944003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jun 2021 15:50:09.1235 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 8de240f9-b9ff-4ba2-7ada-08d93bdebd1e X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB2701 X-Spam-Status: No, score=-12.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Hi Naohiro, And here is the memset version. The code is smaller and easier to follow plus a bit faster. One thing I noticed is that it does not optimize for the common case of memset of zero (the generic memset is significantly faster for large sizes). It is possible to just use DC ZVA for zeroing memsets and not do any vector stores. Reduce the codesize of the A64FX memset by simplifying the small memset code, better handling of alignment and last 8 vectors as well as removing redundant instructions and branches. The size for memset goes down from 1032 to 604 bytes. Performance is noticeably better for small memsets. Passes GLIBC regress, OK for commit? diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S index ce54e5418b08c8bc0ecc7affff68a59272ba6397..da8930c2b0e5ab552943331e9a1aa355e917e775 100644 --- a/sysdeps/aarch64/multiarch/memset_a64fx.S +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S @@ -57,149 +57,78 @@ .endif .endm - .macro shortcut_for_small_size exit - // if rest <= vector_length * 2 + +#undef BTI_C +#define BTI_C + +ENTRY (MEMSET) + + PTR_ARG (0) + SIZE_ARG (2) + + dup z0.b, valw whilelo p0.b, xzr, count + cntb vector_length whilelo p1.b, vector_length, count - b.last 1f st1b z0.b, p0, [dstin, #0, mul vl] st1b z0.b, p1, [dstin, #1, mul vl] - ret -1: // if rest > vector_length * 8 - cmp count, vector_length, lsl 3 // vector_length * 8 - b.hi \exit - // if rest <= vector_length * 4 - lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, count - incb tmp1 - whilelo p3.b, tmp1, count b.last 1f - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] ret -1: // if rest <= vector_length * 8 - lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, count - incb tmp1 - whilelo p5.b, tmp1, count - b.last 1f - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] - st1b z0.b, p4, [dstin, #4, mul vl] - st1b z0.b, p5, [dstin, #5, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - incb tmp1 // vector_length * 5 - incb tmp1 // vector_length * 6 - whilelo p6.b, tmp1, count - incb tmp1 - whilelo p7.b, tmp1, count - st1b z0.b, p0, [dstin, #0, mul vl] - st1b z0.b, p1, [dstin, #1, mul vl] - st1b z0.b, p2, [dstin, #2, mul vl] - st1b z0.b, p3, [dstin, #3, mul vl] - st1b z0.b, p4, [dstin, #4, mul vl] - st1b z0.b, p5, [dstin, #5, mul vl] - st1b z0.b, p6, [dstin, #6, mul vl] - st1b z0.b, p7, [dstin, #7, mul vl] - ret - .endm -ENTRY (MEMSET) - - PTR_ARG (0) - SIZE_ARG (2) - - cbnz count, 1f + .p2align 4 +1: + add dst, dstin, count + cmp count, vector_length, lsl 2 + b.hi 1f + st1b z0.b, p0, [dst, #-2, mul vl] + st1b z0.b, p0, [dst, #-1, mul vl] + ret +1: + cmp count, vector_length, lsl 3 // vector_length * 8 + b.hi L(vl_agnostic) + + st1b z0.b, p0, [dstin, #2, mul vl] + st1b z0.b, p0, [dstin, #3, mul vl] + st1b z0.b, p0, [dst, #-4, mul vl] + st1b z0.b, p0, [dst, #-3, mul vl] + st1b z0.b, p0, [dst, #-2, mul vl] + st1b z0.b, p0, [dst, #-1, mul vl] ret -1: dup z0.b, valw - cntb vector_length - // shortcut for less than vector_length * 8 - // gives a free ptrue to p0.b for n >= vector_length - shortcut_for_small_size L(vl_agnostic) - // end of shortcut L(vl_agnostic): // VL Agnostic mov rest, count mov dst, dstin - add dstend, dstin, count - // if rest >= L2_SIZE && vector_length == 64 then L(L2) mov tmp1, 64 - cmp rest, L2_SIZE - ccmp vector_length, tmp1, 0, cs - b.eq L(L2) // if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch) cmp rest, L1_SIZE ccmp vector_length, tmp1, 0, cs b.eq L(L1_prefetch) -L(unroll32): - lsl tmp1, vector_length, 3 // vector_length * 8 - lsl tmp2, vector_length, 5 // vector_length * 32 - .p2align 3 -1: cmp rest, tmp2 - b.cc L(unroll8) - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - st1b_unroll - add dst, dst, tmp1 - sub rest, rest, tmp2 - b 1b - L(unroll8): lsl tmp1, vector_length, 3 - .p2align 3 + .p2align 4 1: cmp rest, tmp1 - b.cc L(last) + b.ls L(last) st1b_unroll add dst, dst, tmp1 sub rest, rest, tmp1 b 1b -L(last): - whilelo p0.b, xzr, rest - whilelo p1.b, vector_length, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - ret -1: lsl tmp1, vector_length, 1 // vector_length * 2 - whilelo p2.b, tmp1, rest - incb tmp1 - whilelo p3.b, tmp1, rest - b.last 1f - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - ret -1: lsl tmp1, vector_length, 2 // vector_length * 4 - whilelo p4.b, tmp1, rest - incb tmp1 - whilelo p5.b, tmp1, rest - incb tmp1 - whilelo p6.b, tmp1, rest - incb tmp1 - whilelo p7.b, tmp1, rest - st1b z0.b, p0, [dst, #0, mul vl] - st1b z0.b, p1, [dst, #1, mul vl] - st1b z0.b, p2, [dst, #2, mul vl] - st1b z0.b, p3, [dst, #3, mul vl] - st1b z0.b, p4, [dst, #4, mul vl] - st1b z0.b, p5, [dst, #5, mul vl] - st1b z0.b, p6, [dst, #6, mul vl] - st1b z0.b, p7, [dst, #7, mul vl] +L(last): // store 8 vectors from the end + add dst, dst, rest + st1b z0.b, p0, [dst, #-8, mul vl] + st1b z0.b, p0, [dst, #-7, mul vl] + st1b z0.b, p0, [dst, #-6, mul vl] + st1b z0.b, p0, [dst, #-5, mul vl] + st1b z0.b, p0, [dst, #-4, mul vl] + st1b z0.b, p0, [dst, #-3, mul vl] + st1b z0.b, p0, [dst, #-2, mul vl] + st1b z0.b, p0, [dst, #-1, mul vl] ret L(L1_prefetch): // if rest >= L1_SIZE + cmp rest, L2_SIZE + b.hs L(L2) .p2align 3 1: st1b_unroll 0, 3 prfm pstl1keep, [dst, PF_DIST_L1] @@ -208,37 +137,19 @@ L(L1_prefetch): // if rest >= L1_SIZE add dst, dst, CACHE_LINE_SIZE * 2 sub rest, rest, CACHE_LINE_SIZE * 2 cmp rest, L1_SIZE - b.ge 1b - cbnz rest, L(unroll32) - ret + b.hs 1b + b L(unroll8) L(L2): - // align dst address at vector_length byte boundary - sub tmp1, vector_length, 1 - ands tmp2, dst, tmp1 - // if vl_remainder == 0 - b.eq 1f - sub vl_remainder, vector_length, tmp2 - // process remainder until the first vector_length boundary - whilelt p2.b, xzr, vl_remainder - st1b z0.b, p2, [dst] - add dst, dst, vl_remainder - sub rest, rest, vl_remainder // align dstin address at CACHE_LINE_SIZE byte boundary -1: mov tmp1, CACHE_LINE_SIZE - ands tmp2, dst, CACHE_LINE_SIZE - 1 - // if cl_remainder == 0 - b.eq L(L2_dc_zva) - sub cl_remainder, tmp1, tmp2 - // process remainder until the first CACHE_LINE_SIZE boundary - mov tmp1, xzr // index -2: whilelt p2.b, tmp1, cl_remainder - st1b z0.b, p2, [dst, tmp1] - incb tmp1 - cmp tmp1, cl_remainder - b.lo 2b - add dst, dst, cl_remainder - sub rest, rest, cl_remainder + and tmp1, dst, CACHE_LINE_SIZE - 1 + sub tmp1, tmp1, CACHE_LINE_SIZE + st1b z0.b, p0, [dst, #0, mul vl] + st1b z0.b, p0, [dst, #1, mul vl] + st1b z0.b, p0, [dst, #2, mul vl] + st1b z0.b, p0, [dst, #3, mul vl] + sub dst, dst, tmp1 + add rest, rest, tmp1 L(L2_dc_zva): // zero fill @@ -250,16 +161,15 @@ L(L2_dc_zva): .p2align 3 1: st1b_unroll 0, 3 add tmp2, dst, zva_len - dc zva, tmp2 + dc zva, tmp2 st1b_unroll 4, 7 add tmp2, tmp2, CACHE_LINE_SIZE dc zva, tmp2 add dst, dst, CACHE_LINE_SIZE * 2 sub rest, rest, CACHE_LINE_SIZE * 2 cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2 - b.ge 1b - cbnz rest, L(unroll8) - ret + b.hs 1b + b L(unroll8) END (MEMSET) libc_hidden_builtin_def (MEMSET)