From patchwork Wed Feb 26 16:11:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 38325 Received: (qmail 6793 invoked by alias); 26 Feb 2020 16:12:11 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 6785 invoked by uid 89); 26 Feb 2020 16:12:11 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-19.0 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LOTSOFHASH, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy=H*RU:200, HX-Spam-Relays-External:200, bhs X-HELO: EUR01-HE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s/CanZC4GC2Xdd8ccW3MvTv/BJYvgBB5MOZOGykbodU=; b=EvkmOIoezKx+x47zTX8aeO5eoVsJNetQQgZotUY64cc6KcDsq5mTFOafaHidgQPqqjJu4SldmwnOKcEfPdEODM7WAZ3zXnnAsSxGCERdQBT7Hiza8eluf4I4nY0nHdaVbl8VDIl9J7x6I9X0lM9mfvuZH4Fx4icDSAb1Wi2VTXM= Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; sourceware.org; dmarc=bestguesspass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; X-CheckRecipientChecked: true X-CR-MTA-CID: 431f9f337e96ee2c X-CR-MTA-TID: 64aa7808 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OtkZfK6RXNCQ6brc3jMhItGYZKcGWBKKn977mBdGzRRkYqsC2dvGRvUe6Mqyd7RsECNy3KN/l5+3CSXlAIPIRhkjZHbF+eolTWgKOeGd/ZXEV7gyPpsFOXS3FKmh8CrLMM4v6sAE1w+/U9x87gx/Q/F13+arj1Tz2IaGyTsINfJTqWTwrObJHGFJAyC0OjvI1siDULLHN7sBoGOhrKEZITfQO0pcR7Kmy5okC9k7HXH1X6X20S2nHWgQAkcb6H+Phd4t1aMzS3NLTbIWiQnPmYUaSHA+iTJ1A0aiYdv1/m8cVxHbMmmUzYt96UMbpmc68R1xH1lqbP2d6Xf6BbOIDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s/CanZC4GC2Xdd8ccW3MvTv/BJYvgBB5MOZOGykbodU=; b=DH2hDtEU/DNMVYjq2vHxZoS5T5yjJ0GzDcIwYL2Zw790gqVrGimGgEbROltQL0+y4BF8dh6BlQSiRzV96onTZnKp0YvRq+xYsCwnEr2/bCdsl+orwP6xVDtK25o80gRVjQO9B1subo4Mxqh3o9n292x9fpGZgEyGFwymT6BiHW2+u0HrBVsj5Z2h10MYr23mo8xW6BggcnjYuTOVAhaFN5I20572uVCrRQrLdcsjcRrLk/qRXUi3DEeIxdAu34TPLKzff1wrYsK4azsBJOHezRY/83nEMw0N6lTX6mtbXxdnb0qmaMKToxVVu2XNxhgC443A45dZofhDRDOfVevcvA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=s/CanZC4GC2Xdd8ccW3MvTv/BJYvgBB5MOZOGykbodU=; b=EvkmOIoezKx+x47zTX8aeO5eoVsJNetQQgZotUY64cc6KcDsq5mTFOafaHidgQPqqjJu4SldmwnOKcEfPdEODM7WAZ3zXnnAsSxGCERdQBT7Hiza8eluf4I4nY0nHdaVbl8VDIl9J7x6I9X0lM9mfvuZH4Fx4icDSAb1Wi2VTXM= From: Wilco Dijkstra To: 'GNU C Library' Subject: [PATCH][AArch64] Cleanup memset Date: Wed, 26 Feb 2020 16:11:53 +0000 Message-ID: Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-checkrecipientrouted: true x-ms-oob-tlc-oobclassifiers: OLM:4714;OLM:4714; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM; SFS:(10009020)(4636009)(39860400002)(396003)(366004)(376002)(346002)(136003)(199004)(189003)(26005)(6916009)(55016002)(6506007)(66446008)(64756008)(86362001)(76116006)(186003)(66556008)(9686003)(52536014)(66476007)(7696005)(66946007)(8936002)(8676002)(81156014)(33656002)(316002)(2906002)(478600001)(5660300002)(81166006)(71200400001)(357404004); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB2100; H:AM5PR0801MB2035.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: FSxMtt0ZcLc1LO8pEYka6pVj6VMOO3pn88VX4ymE9tRcU/MNNd8/GisaA9PWAVCfzRikzPnqBUsha3kIKAktWwdMd1yAeKOyNyougRm7aWw9CHXptf7jV7ZAaQcHLrc0aWAg8xmwaNiA/D6mJkmvug4ly7VNQ10y/PVCmHPLQ4+j2v5Nbm8Y18YI2LXmz8drWPW20VgOV8sAO7rVfJ4WAVZ4YsbzwDkZ9lDluJOQhOlAZCuJo1md3ApcmiNdZFgHJGcXvkjUolS0KDS1s07xdxD+qBE+ZCMPaek4zXHrJv1Ta/gV2DqCRUQDqlF2TR8jKhrySgGHRqRiU1k4BYy36oOskcaluvHht/RNBCzmeThXKGjZJ93ItuhQ9SNOxUhG0W3neJln0plfjIXFyAcAuiT93T8tfy3AdVtaJrYC2JdzAXMAiBz3E9Enqa3tD5FglYOvQBitiwLKGcQnDoiDbIWfmJ0WB5cKC9NPaD46hiiBpAsaMJ+8bFVMXNv43c4X x-ms-exchange-antispam-messagedata: uA/PSy8NeNDWP22aDJNniv2RWslsOuAcwHhavExjlXQXqZAvC3eWCV34qaby/g0rBBzy4ZyT2d6DQiBCWDVYYlkw/oJeU4kwu4eMJ3iSFks9UYqcMxlGme5jdK02TMHEFb4SYkFVfvyTcAVxUoR7zQ== x-ms-exchange-transport-forked: True MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; Return-Path: Wilco.Dijkstra@arm.com X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT031.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: db25ebbe-6064-4a5d-11f0-08d7bad69857 Cleanup memset. Remove unnecessary code for unused ZVA sizes. For zero memsets it's faster use DC ZVA for >= 160 bytes rather than 256. Add a define which allows skipping the ZVA size test when the ZVA size is known to be 64 - reading dczid_el0 may be expensive. This simplifies the Falkor memset implementation. diff --git a/sysdeps/aarch64/memset-reg.h b/sysdeps/aarch64/memset-reg.h index 872a6b511de2bb97939e644c78676a6212c65d6d..22da7381e68bdfaa6086e39cc980c77150242d21 100644 --- a/sysdeps/aarch64/memset-reg.h +++ b/sysdeps/aarch64/memset-reg.h @@ -22,9 +22,4 @@ #define count x2 #define dst x3 #define dstend x4 -#define tmp1 x5 -#define tmp1w w5 -#define tmp2 x6 -#define tmp2w w6 -#define zva_len x7 -#define zva_lenw w7 +#define zva_val x5 diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S index ac577f1660e2706e2024e42d0efc09120c041af2..4654732e1bdd4a7e8f673b79ed66be408bed726b 100644 --- a/sysdeps/aarch64/memset.S +++ b/sysdeps/aarch64/memset.S @@ -78,114 +78,49 @@ L(set96): stp q0, q0, [dstend, -32] ret - .p2align 3 - nop + .p2align 4 L(set_long): and valw, valw, 255 bic dst, dstin, 15 str q0, [dstin] - cmp count, 256 - ccmp valw, 0, 0, cs - b.eq L(try_zva) -L(no_zva): - sub count, dstend, dst /* Count is 16 too large. */ - sub dst, dst, 16 /* Dst is biased by -32. */ - sub count, count, 64 + 16 /* Adjust count and bias for loop. */ -1: stp q0, q0, [dst, 32] - stp q0, q0, [dst, 64]! -L(tail64): - subs count, count, 64 - b.hi 1b -2: stp q0, q0, [dstend, -64] - stp q0, q0, [dstend, -32] - ret - -L(try_zva): -#ifdef ZVA_MACRO - zva_macro -#else - .p2align 3 - mrs tmp1, dczid_el0 - tbnz tmp1w, 4, L(no_zva) - and tmp1w, tmp1w, 15 - cmp tmp1w, 4 /* ZVA size is 64 bytes. */ - b.ne L(zva_128) - - /* Write the first and last 64 byte aligned block using stp rather - than using DC ZVA. This is faster on some cores. - */ -L(zva_64): + cmp count, 160 + ccmp valw, 0, 0, hs + b.ne L(no_zva) + +#ifndef SKIP_ZVA_CHECK + mrs zva_val, dczid_el0 + and zva_val, zva_val, 31 + cmp zva_val, 4 /* ZVA size is 64 bytes. */ + b.ne L(no_zva) +#endif str q0, [dst, 16] stp q0, q0, [dst, 32] bic dst, dst, 63 - stp q0, q0, [dst, 64] - stp q0, q0, [dst, 96] - sub count, dstend, dst /* Count is now 128 too large. */ - sub count, count, 128+64+64 /* Adjust count and bias for loop. */ - add dst, dst, 128 - nop -1: dc zva, dst + sub count, dstend, dst /* Count is now 64 too large. */ + sub count, count, 128 /* Adjust count and bias for loop. */ + + .p2align 4 +L(zva_loop): add dst, dst, 64 + dc zva, dst subs count, count, 64 - b.hi 1b - stp q0, q0, [dst, 0] - stp q0, q0, [dst, 32] + b.hi L(zva_loop) stp q0, q0, [dstend, -64] stp q0, q0, [dstend, -32] ret - .p2align 3 -L(zva_128): - cmp tmp1w, 5 /* ZVA size is 128 bytes. */ - b.ne L(zva_other) - - str q0, [dst, 16] +L(no_zva): + sub count, dstend, dst /* Count is 16 too large. */ + sub dst, dst, 16 /* Dst is biased by -32. */ + sub count, count, 64 + 16 /* Adjust count and bias for loop. */ +L(no_zva_loop): stp q0, q0, [dst, 32] - stp q0, q0, [dst, 64] - stp q0, q0, [dst, 96] - bic dst, dst, 127 - sub count, dstend, dst /* Count is now 128 too large. */ - sub count, count, 128+128 /* Adjust count and bias for loop. */ - add dst, dst, 128 -1: dc zva, dst - add dst, dst, 128 - subs count, count, 128 - b.hi 1b - stp q0, q0, [dstend, -128] - stp q0, q0, [dstend, -96] + stp q0, q0, [dst, 64]! + subs count, count, 64 + b.hi L(no_zva_loop) stp q0, q0, [dstend, -64] stp q0, q0, [dstend, -32] ret -L(zva_other): - mov tmp2w, 4 - lsl zva_lenw, tmp2w, tmp1w - add tmp1, zva_len, 64 /* Max alignment bytes written. */ - cmp count, tmp1 - blo L(no_zva) - - sub tmp2, zva_len, 1 - add tmp1, dst, zva_len - add dst, dst, 16 - subs count, tmp1, dst /* Actual alignment bytes to write. */ - bic tmp1, tmp1, tmp2 /* Aligned dc zva start address. */ - beq 2f -1: stp q0, q0, [dst], 64 - stp q0, q0, [dst, -32] - subs count, count, 64 - b.hi 1b -2: mov dst, tmp1 - sub count, dstend, tmp1 /* Remaining bytes to write. */ - subs count, count, zva_len - b.lo 4f -3: dc zva, dst - add dst, dst, zva_len - subs count, count, zva_len - b.hs 3b -4: add count, count, zva_len - sub dst, dst, 32 /* Bias dst for tail loop. */ - b L(tail64) -#endif - END (MEMSET) libc_hidden_builtin_def (MEMSET) diff --git a/sysdeps/aarch64/multiarch/memset_falkor.S b/sysdeps/aarch64/multiarch/memset_falkor.S index 54fd5abffb1b6638ef8a5fc29e58b2f67765b28a..bee4aed52aab69f5aaded367d40e8b64e406c545 100644 --- a/sysdeps/aarch64/multiarch/memset_falkor.S +++ b/sysdeps/aarch64/multiarch/memset_falkor.S @@ -24,30 +24,8 @@ use this function only when ZVA is enabled. */ #if IS_IN (libc) -.macro zva_macro - .p2align 4 - /* Write the first and last 64 byte aligned block using stp rather - than using DC ZVA. This is faster on some cores. */ - str q0, [dst, 16] - stp q0, q0, [dst, 32] - bic dst, dst, 63 - stp q0, q0, [dst, 64] - stp q0, q0, [dst, 96] - sub count, dstend, dst /* Count is now 128 too large. */ - sub count, count, 128+64+64 /* Adjust count and bias for loop. */ - add dst, dst, 128 -1: dc zva, dst - add dst, dst, 64 - subs count, count, 64 - b.hi 1b - stp q0, q0, [dst, 0] - stp q0, q0, [dst, 32] - stp q0, q0, [dstend, -64] - stp q0, q0, [dstend, -32] - ret -.endm - -# define ZVA_MACRO zva_macro + +# define SKIP_ZVA_CHECK # define MEMSET __memset_falkor # include #endif