From patchwork Mon Apr 12 12:52:05 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 42947 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7D9963850404; Mon, 12 Apr 2021 12:52:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7D9963850404 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1618231941; bh=ZVkEsZPrStW6JgEEUHvrZSQIQ/O6fv4Ebbp4TZ86cZI=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=oyBkonI/JLenC8wwS4Yv0pl6SgkCv9UaTM+ZglSDPVRVlYsH+NK/ZJjpqAEWI2k3r FdlvwgUPmhFJP957UD/AETjds1+VJY/oe30gCEbF7F2C7H5E3KXZD6RB0cJWnXvqll Sb3uPJMhdAigzSsngMKZx4YWx1YA0ZnDGTwJEI0w= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05on2065.outbound.protection.outlook.com [40.107.20.65]) by sourceware.org (Postfix) with ESMTPS id A07D43857C7F for ; Mon, 12 Apr 2021 12:52:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A07D43857C7F Received: from AM6PR0202CA0043.eurprd02.prod.outlook.com (2603:10a6:20b:3a::20) by PAXPR08MB6448.eurprd08.prod.outlook.com (2603:10a6:102:152::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.21; Mon, 12 Apr 2021 12:52:16 +0000 Received: from AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:3a:cafe::55) by AM6PR0202CA0043.outlook.office365.com (2603:10a6:20b:3a::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.17 via Frontend Transport; Mon, 12 Apr 2021 12:52:16 +0000 X-MS-Exchange-Authentication-Results: spf=temperror (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=temperror action=none header.from=arm.com; Received-SPF: TempError (protection.outlook.com: error in processing during lookup of arm.com: DNS Timeout) Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by AM5EUR03FT007.mail.protection.outlook.com (10.152.16.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.17 via Frontend Transport; Mon, 12 Apr 2021 12:52:15 +0000 Received: ("Tessian outbound b610e7b4d771:v90"); Mon, 12 Apr 2021 12:52:14 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: bd882f00c48293fb X-CR-MTA-TID: 64aa7808 Received: from cae8493d4718.2 by 64aa7808-outbound-1.mta.getcheckrecipient.com id C06C0390-1F77-434B-9784-B2A2F413CE47.1; Mon, 12 Apr 2021 12:52:07 +0000 Received: from EUR03-AM5-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id cae8493d4718.2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Mon, 12 Apr 2021 12:52:07 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=EEybFhVULorwfdL+z5FmksV9vJ88nEMv4WX1cYq8V1cmElsG3uMZNdEovuW4ID8ZdZCCrSRbnGKGD1wwEXcozMNPaTI2H76Nsz+lxYnwDJlUZGXXfVi7kEF775PFcutZPhIg7W36sUlnu1WFXYLrqckVdaxD9vA0cVxsPZBNUK65DYvpdmUiV6/RYVQzUHjRGpb0b2YqYwICAY7nVa30Ck7aP9PxN/BwtA6F+oAyNZjAZCA1tQzw4+f9Lz6sb1oVMX5k0faNFhbeTXxgyKJ/qVONrwsr++Apd9WrdhMGOMjvXI7BEj5cje7fqJEBVYdnaK00goT8H9pogw0QH3iQJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ZVkEsZPrStW6JgEEUHvrZSQIQ/O6fv4Ebbp4TZ86cZI=; b=T4ZWv6w2j20O4J3HGVu7xPg3rQiZzFgN/GMMy+8VGSIkadlmjw0KrVuacdsKm8wUTCeBN6Mt+M77O7y+D1RrWDzYF9TFFknI8ETflkdpMSsjIjt3r186S1N+QG/46g5zKHN64NVvPw3416sIuPWlBmDiW9PJqW2pA+8+oHQ62G+cJO9ElEvntUI6g8J1ggY+UHS+qoGRs6PenWgPHtPAJlRcF2wnAVM8SkAhSacntgWp/Y2UY0AsDZEoiTgwq74VBqkUXnWqahzXUwsQ55Z1O+qgZylJ1YNfze/ePusOgqlzaOgTk5dCn9Wt5dZzPUuHsuImGyY9RhtZxY97XKZ3Xw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VE1PR08MB5854.eurprd08.prod.outlook.com (2603:10a6:800:1b0::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4020.21; Mon, 12 Apr 2021 12:52:05 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::385c:f8ff:ee16:3a4d%6]) with mapi id 15.20.4020.022; Mon, 12 Apr 2021 12:52:05 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Topic: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Thread-Index: AQHXL5Jyw0P1gKwhEk6/DkVDv1IPJQ== Date: Mon, 12 Apr 2021 12:52:05 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: df6c18c3-57dd-4f53-138b-08d8fdb1cc3f x-ms-traffictypediagnostic: VE1PR08MB5854:|PAXPR08MB6448: x-ms-exchange-transport-forked: True X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:6790;OLM:6790; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: g7YbcxTqna6EfFgRzRBjHuWMSFDcIXgSWhOdWs/mzW5uoWbBS5kfcYx5GPKhDy8EjSfbh8B4A8M1Mhg/2ujUOuXiYjT7Wy+ASnWgc1J9GsM5cdsdkpcoiRg0YhqWn43SkYOPumuOYiZ1ePxRKHToZLw6zaQgjH7orVhhOpamMxdNkRm+jnFo/bC7OGiklGjNQYhLsCTb0DlM0NJ1w/plxwyRwSUmih9r8suLu+D5fYI/o3fLoOaJbAIh07PiXpqY/UZJyhhON/rtuV/kGR89Di6QMHRKmkPvBGZNNPpdu6S+2MWSopVmKc3+NWRtZsrsrgK+G8/n94S9DhlX6597zwJenifzkrOek+QQg8xMYk0zdBx1lpQ99nedJYNp0IKVTcgcOHsPmd0txJ7KE4o/0ck12eUd4/qxH5ULtpzwoYf6nLMaZbuujYkm2ZBKovxCGM7cBKSAxooOzrpoS664xlw1M/iixam/lfb+vboG3AI32lpR7fd7LBhQ1FS3slC+t7pX3zTaxTGXxFXjqbTbRsbu6CBI9c4lx3/514+GrjsgatyFizkaBwUOQPe5gRWKI88Lj5njq8zUwr3kv3bm+PvtV6LihRkbLfGDH+unrLs= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(136003)(376002)(366004)(396003)(39860400002)(316002)(55016002)(2906002)(186003)(6916009)(91956017)(76116006)(86362001)(26005)(54906003)(4326008)(66476007)(8676002)(478600001)(38100700002)(8936002)(64756008)(5660300002)(66946007)(7696005)(6506007)(33656002)(71200400001)(66446008)(9686003)(66556008)(52536014); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: =?iso-8859-1?q?DS4aINihd5PmnHnXSyZ1WafZp?= =?iso-8859-1?q?tUU13Q5QTM0dlQlY7jCNxe2Y43c6Oo/RIda6E7vX4Qvsn5AERrvKA4ekr8hQ?= =?iso-8859-1?q?rGqmQiaxdj1/9ltfJh9FPOiSlYkNHIuFR60ThccbTEAWpYnC1YPfrlytkdhB?= =?iso-8859-1?q?47rEyXkO73IbAiR5nTjnMp13y209S+W5T+0o/AhM0n/HMNg13vdrKkK5uI/3?= =?iso-8859-1?q?GgAZVX5OM9zlJqYKVoNqbEOx0Wy5IfhSzL+PMai1Wc/SadAVZJcrsvGL3Oif?= =?iso-8859-1?q?jgpcQl3/5My/8ugKgtZMe0/Wsb4bJBbQn0/6kEa10v3IgMPeYtKU0zw7LxZe?= =?iso-8859-1?q?f9+W8gjWDoisaCGeVVwtPAoZozhELAfmgkrMLGHUraO+K+w2fMrECU459I3P?= =?iso-8859-1?q?1j7DhLJBURr/YrPn5x5lNPieH3+QN8AcJojJ1dglZKJHc/C4ywpvBpKWfXZj?= =?iso-8859-1?q?JqU2K1/13UIOX3qxm14A8T6LxXlS4p8MqEF76qZX2ZXuYiWfXw344ApVo15I?= =?iso-8859-1?q?jlxOrtPKmR3QeDyJ8zrS1ROVBxGDPptRthoaR/laPxf2N+BEst5tCO3FZea1?= =?iso-8859-1?q?5INz5X01v69t/tbEUWUrdERz/cXN7EjXPfSBYPnWRY/syg2T+S2cE4UaKPCM?= =?iso-8859-1?q?RCedkV7z/BGHXYse6kmv5FD7NAlauF1iINS5Rzi/TMKGijEN0sE0leQ23RbG?= =?iso-8859-1?q?c/5KTpc0QbIdbVZDZYPso5r9hQ0yuhNh1t899KD7PvqQRoYGz2Ekg9SfO0Ex?= =?iso-8859-1?q?KsLG77lUofnnIpC+ubF8E89t0IwiNgT1TZNUOe3QO1TCpa7GhIr95mADodea?= =?iso-8859-1?q?CtwCT8LAzg7uZQ5vRMcuJoj7GLi/yFhTLRSBYZKnLMDjvF1G20sf1K3GH8Ag?= =?iso-8859-1?q?4B6KNZAGiSdnngPJJocljtzVP2ryEj5+scRo6TgCU2UeYBx/bZOJGmiM79NC?= =?iso-8859-1?q?0wqJcNxIwDQ2uUWjsC+F3BocvNrViBs/lM1ikuTMgVOkTEJHI1pMdiyTy6cu?= =?iso-8859-1?q?HXevnG1qJfWQ49Kc8zmqTNpojwY+AbHMC+4q9/fHBvjjqwkT9YaSDXhK18hB?= =?iso-8859-1?q?oXmylqv6QncVwvMv2r9s7AhJavnpmfxq46wr1skGrc9veLtAHjyLvjbnPe4V?= =?iso-8859-1?q?uMip7NKIJN0y1WZV3HWB3yJAfr4z3wmWv1gTCz9Xmi6rXTlHBwdK3WFbuhyY?= =?iso-8859-1?q?90JgyRsKhybOwcdIbV5+Bwxgev7EyULUvWj7r6cqm28egZN2THNxtCCwTiA4?= =?iso-8859-1?q?dalctagExR7vRUQSi2CyZE0D5qPL/KZOa+1OVe0zjflfpkFPFhNcxxeq8qSB?= =?iso-8859-1?q?UONnhcDfc2129Uww1yThXzH0eJ4cw6zkjHFCCo=3D?= MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB5854 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: f678c652-93ed-40f0-c759-08d8fdb1c669 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 2XxyjBTpNY2b9GPl5igurEbkqJVNh+SHQxEhl1u2qb9ejjPsaEcPai2nW7wkU/DFVuipuf6wYOCoj5H/GHsO6EU+7F/cRBWrao650xqwiOvA1Yw4ql1F5bVT25O3dls6mNU+fQchk6DTPmoTXSGlYWEkOUkPl4HVH5LWVdisSJxR7KRouw5VNZywIe9V3BYOePBOHM1+fA6mNeWwJfS5GedIVvDVZxHzDlNBdHwOPBa9LsKf+K8hYtEj7whVs9fr/t0SYGfxYpOnCjhpPmrnz/WPSRpoZgyK1E392g3VCyvHr/1kdCHdrMnzE2kcVWOOzMuA9BJIj7EcwfML2x5oGUF1rGdvIfZWJiADKzxoBcgoEpATgrfHzpYJjl5sfj5R8yG+WhHhIh+bV0ZeVaogdRVJPzlWAqKc/ukIxDxzmwZe/5D4FOIU80z7ccBdrwkYO+ygzZSyrN8baF/nqY7LzmGPsealUwhVCGRTxzHCzHOrkCv4h41iaIcry2Bl0JqiCqwq3U0VWd3KT5O3j9NmFLGJHyq4sg+dcFtZy1ChZyjQP6ZN3VTVdBvlI9X9HPMVfanDHULGjJuONDcJ16jCazN1QxzQs9VRwZtb77/fiUAUx9OO3UBKKLYpUCmaMuLY5HbmeSoMKhkApdKCrYUPXgeoU+sPAYY69vTdjo/AcK8= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(396003)(136003)(346002)(376002)(39860400002)(36840700001)(46966006)(63350400001)(6862004)(36860700001)(63370400001)(9686003)(26005)(82740400003)(54906003)(2906002)(70586007)(47076005)(6506007)(86362001)(336012)(5660300002)(7696005)(4326008)(186003)(8936002)(316002)(81166007)(8676002)(356005)(52536014)(55016002)(478600001)(70206006)(82310400003)(33656002); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Apr 2021 12:52:15.0080 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: df6c18c3-57dd-4f53-138b-08d8fdb1cc3f X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: AM5EUR03FT007.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: PAXPR08MB6448 X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Cc: Szabolcs Nagy , 'GNU C Library' Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Hi, I have a few comments about memcpy design (the principles apply equally to memset): 1. Overall the code is too large due to enormous unroll factors Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger! This hurts performance due to the code not fitting in the I-cache for common copies. On a modern OoO core you need very little unrolling since ALU operations and branches become essentially free while the CPU executes loads and stores. So rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the taken branch latency. 2. I don't see any special handling for small copies Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common, so you always want to handle those as quickly (and with as little code) as possible. Special casing small copies does not slow down the huge copies - the reverse is more likely since you no longer need to handle small cases. 3. Check whether using SVE helps small/medium copies Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD copy on your uarch. 4. Avoid making the code too general or too specialistic I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths, and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize the code for that and check the size in the ifunc selector (as various memsets do already). If you want to handle multiple vector sizes, just use a register for the increment rather than repeating the same code several times for each vector length. 5. Odd prefetches I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???), then prefetching the same data a 2nd time, before finally write the loaded data is helping performance... Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch. So what is the performance gain of each prefetch/clear step? What is the difference between memcpy and memmove performance (given memmove doesn't do any of this)? Cheers, Wilco