From patchwork Wed Feb 26 16:18:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 38326 Received: (qmail 93783 invoked by alias); 26 Feb 2020 16:18:54 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 93737 invoked by uid 89); 26 Feb 2020 16:18:53 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-20.2 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy=H*RU:206, HX-Spam-Relays-External:206 X-HELO: EUR01-DB5-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=; b=uR/zvhrY3W+MZLVJvOTs89j9+pXGeznB3St54oc/YexLXnxDni+3/7SV5QOnZK4HW/jc6NoecyZ0fvRmSv1u3lNuQm9PomKVdr1uymQCbBPqJVCw7mo+NAu3fE/uC4xUn+kxYFuPwPKRr9CAEueGcDlca2vSPE4fcsmzIizg5uQ= Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; sourceware.org; dmarc=bestguesspass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; X-CheckRecipientChecked: true X-CR-MTA-CID: 181a65a57333f949 X-CR-MTA-TID: 64aa7808 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oSwVpfgXzF5d+7Ie7qMP+kooYl58WqZCFHIUozKsy6A0CbX4MCX0IlszR5J7dhG5cNEfukJTUp7EyUWv9Q4SAonFG6huS4rIxvf2i7I7DrBDNQFCERiXXRy0TeJDbcN8aU1QYL79+xqHB7DIIkAIHbkA8HwZxDAHbH50nEJLg1A994O5foEU0dDU6u+Uk48N/alAuDDFvDZVdPXeFHlMwZXld+VeHMgoj2As6Z1nzzMBKyDcvz7rzvWdmQ/CYDc4M8a5WG978LQfsTA+ZvRZtJTaOBUjzfDIYsZicIpON+pzYNmwwqM/nC+agmUpaQ4eGZZn9w2Kdt2GDE15uohlTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=; b=efZcFx4SdJ2RUr2A0f05JsdtACWFJ9pJPG2RfHToa+/hOKYozfcJ5pBCW/1kWN6Qg1tBIs8iMtB36k2zjpIH2WPkZ+uTARyBGw1+qbxazYYagk/uG3k23/3yJcnS6bG7B/So/CRGvg0u8zXv3xG6y2h4+Xc4ivzqZcjJxmdvMo+mR9jdp4eO7hHDSOe2a3iKP7stZ+mP7HgHGugGUV6+JPWlrst8aAHlZn3+NPaCfF79PAmb71IYlnnnmy124fwzXXjSIZw+7t4UAxHtyXtWayVX/VlJR+CzxDzV165Z3x6bvyd783SZss2x4RTpGOHCPNBoVMvIzCZU5rFYdQJxdg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=; b=uR/zvhrY3W+MZLVJvOTs89j9+pXGeznB3St54oc/YexLXnxDni+3/7SV5QOnZK4HW/jc6NoecyZ0fvRmSv1u3lNuQm9PomKVdr1uymQCbBPqJVCw7mo+NAu3fE/uC4xUn+kxYFuPwPKRr9CAEueGcDlca2vSPE4fcsmzIizg5uQ= From: Wilco Dijkstra To: 'GNU C Library' Subject: Re: [PATCH v2][AArch64] Improve integer memcpy Date: Wed, 26 Feb 2020 16:18:32 +0000 Message-ID: References: In-Reply-To: Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-checkrecipientrouted: true x-ms-oob-tlc-oobclassifiers: OLM:2958;OLM:2958; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM; SFS:(10009020)(4636009)(396003)(346002)(136003)(366004)(376002)(39860400002)(189003)(199004)(186003)(316002)(26005)(52536014)(8676002)(478600001)(7696005)(86362001)(5660300002)(66946007)(76116006)(66556008)(66476007)(66446008)(33656002)(55016002)(64756008)(2940100002)(6506007)(71200400001)(9686003)(2906002)(81156014)(81166006)(6916009)(8936002); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB1922; H:AM5PR0801MB2035.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: LAqdtyhsTJEWHyf29u+uMHmmQQneEcpO1HgGJsvIngLbwUFtXwRU3TFFoJLxvijqI4TmYzco7hz/4snadXwL0tFHy60XQ4ozrzUQTtCkf7OBimP/+zcsos/YxG9jfAW8oE2CYAsYuTHLksr0512kVUvEcZkmuJrGTqW6uviwyfq2b1cEzzXwq3A1QOP9uN+KXHNYV/NjOQ85f93xq8BWCwJZY1Y2l//rfXwEl91NBbyaCXejdeXbptPyZcqGlFe2TTgeh9UQpt04JywxXlTqMLiDN8g0o1toDMKSH+7GZDU58vH4nQX28UcTBQjy3Vr8uNDM8cZmyqnfvaamvgDZlxGkDrwzavWYboId8i8y2repbXMqfdJ6s3jYEUDzbXRex53irgPX4Ym9w6BmGiuBQJ5O6GAZwGOY1GFdLBP7frkywQlLkNRhU6m34MUtsORL x-ms-exchange-antispam-messagedata: gU9TxV5wCguuqTJkYiNd5euGprFH3uqXg/TK1J8nwImAT/rt6YcFWKkotSSN1RTqatckgmp992Fj1o/ALEss7OEmsXYeT34cr7VLCEFH2E76xdDqnFi4fDIxnTdSn6lJPgz/O8f+Bz31IE58U/Abdw== x-ms-exchange-transport-forked: True MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; Return-Path: Wilco.Dijkstra@arm.com X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT052.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 32f389ce-aaad-4988-6e9e-08d7bad7864b Version 2 fixes white space and uses ENTRY_ALIGN rather than ENTRY: Further optimize integer memcpy. Small cases now include copies up to 32 bytes. 64-128 byte copies are split into two cases to improve performance of 64-96 byte copies. Comments have been rewritten. The attached graph shows how the new memcpy (memcpy_new) performs against the current generic memcpy and the previous version (memcpy.S before commit b9f145df85). Passes GLIBC tests. Reviewed-by: Adhemerval Zanella diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S index ff720c800ed0ca3afac03d19ba02f67817b3422e..d31f7bb38eaf91692fb90f1c313b5e276fdf975b 100644 --- a/sysdeps/aarch64/memcpy.S +++ b/sysdeps/aarch64/memcpy.S @@ -33,11 +33,11 @@ #define A_l x6 #define A_lw w6 #define A_h x7 -#define A_hw w7 #define B_l x8 #define B_lw w8 #define B_h x9 #define C_l x10 +#define C_lw w10 #define C_h x11 #define D_l x12 #define D_h x13 @@ -51,16 +51,6 @@ #define H_h srcend #define tmp1 x14 -/* Copies are split into 3 main cases: small copies of up to 32 bytes, - medium copies of 33..128 bytes which are fully unrolled. Large copies - of more than 128 bytes align the destination and use an unrolled loop - processing 64 bytes per iteration. - In order to share code with memmove, small and medium copies read all - data before writing, allowing any kind of overlap. So small, medium - and large backwards memmoves are handled by falling through into memcpy. - Overlapping large forward memmoves use a loop that copies backwards. -*/ - #ifndef MEMMOVE # define MEMMOVE memmove #endif @@ -68,128 +58,124 @@ # define MEMCPY memcpy #endif -ENTRY_ALIGN (MEMMOVE, 6) +/* This implementation supports both memcpy and memmove and shares most code. + It uses unaligned accesses and branchless sequences to keep the code small, + simple and improve performance. - DELOUSE (0) - DELOUSE (1) - DELOUSE (2) + Copies are split into 3 main cases: small copies of up to 32 bytes, medium + copies of up to 128 bytes, and large copies. The overhead of the overlap + check in memmove is negligible since it is only required for large copies. - sub tmp1, dstin, src - cmp count, 128 - ccmp tmp1, count, 2, hi - b.lo L(move_long) - - /* Common case falls through into memcpy. */ -END (MEMMOVE) -libc_hidden_builtin_def (MEMMOVE) -ENTRY (MEMCPY) + Large copies use a software pipelined loop processing 64 bytes per iteration. + The destination pointer is 16-byte aligned to minimize unaligned accesses. + The loop tail is handled by always copying 64 bytes from the end. +*/ +ENTRY_ALIGN (MEMCPY, 6) DELOUSE (0) DELOUSE (1) DELOUSE (2) - prfm PLDL1KEEP, [src] add srcend, src, count add dstend, dstin, count - cmp count, 32 - b.ls L(copy32) cmp count, 128 b.hi L(copy_long) + cmp count, 32 + b.hi L(copy32_128) - /* Medium copies: 33..128 bytes. */ + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) ldp A_l, A_h, [src] - ldp B_l, B_h, [src, 16] - ldp C_l, C_h, [srcend, -32] ldp D_l, D_h, [srcend, -16] - cmp count, 64 - b.hi L(copy128) stp A_l, A_h, [dstin] - stp B_l, B_h, [dstin, 16] - stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret - .p2align 4 - /* Small copies: 0..32 bytes. */ -L(copy32): - /* 16-32 bytes. */ - cmp count, 16 - b.lo 1f - ldp A_l, A_h, [src] - ldp B_l, B_h, [srcend, -16] - stp A_l, A_h, [dstin] - stp B_l, B_h, [dstend, -16] - ret - .p2align 4 -1: - /* 8-15 bytes. */ - tbz count, 3, 1f + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) ldr A_l, [src] ldr A_h, [srcend, -8] str A_l, [dstin] str A_h, [dstend, -8] ret - .p2align 4 -1: - /* 4-7 bytes. */ - tbz count, 2, 1f + + .p2align 3 + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) ldr A_lw, [src] - ldr A_hw, [srcend, -4] + ldr B_lw, [srcend, -4] str A_lw, [dstin] - str A_hw, [dstend, -4] + str B_lw, [dstend, -4] ret - /* Copy 0..3 bytes. Use a branchless sequence that copies the same - byte 3 times if count==1, or the 2nd byte twice if count==2. */ -1: - cbz count, 2f + /* Copy 0..3 bytes using a branchless sequence. */ +L(copy4): + cbz count, L(copy0) lsr tmp1, count, 1 ldrb A_lw, [src] - ldrb A_hw, [srcend, -1] + ldrb C_lw, [srcend, -1] ldrb B_lw, [src, tmp1] strb A_lw, [dstin] strb B_lw, [dstin, tmp1] - strb A_hw, [dstend, -1] -2: ret + strb C_lw, [dstend, -1] +L(copy0): + ret .p2align 4 - /* Copy 65..128 bytes. Copy 64 bytes from the start and - 64 bytes from the end. */ + /* Medium copies: 33..128 bytes. */ +L(copy32_128): + ldp A_l, A_h, [src] + ldp B_l, B_h, [src, 16] + ldp C_l, C_h, [srcend, -32] + ldp D_l, D_h, [srcend, -16] + cmp count, 64 + b.hi L(copy128) + stp A_l, A_h, [dstin] + stp B_l, B_h, [dstin, 16] + stp C_l, C_h, [dstend, -32] + stp D_l, D_h, [dstend, -16] + ret + + .p2align 4 + /* Copy 65..128 bytes. */ L(copy128): ldp E_l, E_h, [src, 32] ldp F_l, F_h, [src, 48] + cmp count, 96 + b.ls L(copy96) ldp G_l, G_h, [srcend, -64] ldp H_l, H_h, [srcend, -48] + stp G_l, G_h, [dstend, -64] + stp H_l, H_h, [dstend, -48] +L(copy96): stp A_l, A_h, [dstin] stp B_l, B_h, [dstin, 16] stp E_l, E_h, [dstin, 32] stp F_l, F_h, [dstin, 48] - stp G_l, G_h, [dstend, -64] - stp H_l, H_h, [dstend, -48] stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret - /* Align DST to 16 byte alignment so that we don't cross cache line - boundaries on both loads and stores. There are at least 128 bytes - to copy, so copy 16 bytes unaligned and then align. The loop - copies 64 bytes per iteration and prefetches one iteration ahead. */ - .p2align 4 + /* Copy more than 128 bytes. */ L(copy_long): + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + ldp D_l, D_h, [src] and tmp1, dstin, 15 bic dst, dstin, 15 - ldp D_l, D_h, [src] sub src, src, tmp1 - add count, count, tmp1 /* Count is now 16 too large. */ + add count, count, tmp1 /* Count is now 16 too large. */ ldp A_l, A_h, [src, 16] stp D_l, D_h, [dstin] ldp B_l, B_h, [src, 32] ldp C_l, C_h, [src, 48] ldp D_l, D_h, [src, 64]! - subs count, count, 128 + 16 /* Test and readjust count. */ - b.ls L(last64) + b.ls L(copy64_from_end) + L(loop64): stp A_l, A_h, [dst, 16] ldp A_l, A_h, [src, 16] @@ -202,10 +188,8 @@ L(loop64): subs count, count, 64 b.hi L(loop64) - /* Write the last full set of 64 bytes. The remainder is at most 64 - bytes, so it is safe to always copy 64 bytes from the end even if - there is just 1 byte left. */ -L(last64): + /* Write the last iteration and copy 64 bytes from the end. */ +L(copy64_from_end): ldp E_l, E_h, [srcend, -64] stp A_l, A_h, [dst, 16] ldp A_l, A_h, [srcend, -48] @@ -220,20 +204,42 @@ L(last64): stp C_l, C_h, [dstend, -16] ret - .p2align 4 -L(move_long): - cbz tmp1, 3f +END (MEMCPY) +libc_hidden_builtin_def (MEMCPY) + +ENTRY_ALIGN (MEMMOVE, 4) + DELOUSE (0) + DELOUSE (1) + DELOUSE (2) add srcend, src, count add dstend, dstin, count + cmp count, 128 + b.hi L(move_long) + cmp count, 32 + b.hi L(copy32_128) - /* Align dstend to 16 byte alignment so that we don't cross cache line - boundaries on both loads and stores. There are at least 128 bytes - to copy, so copy 16 bytes unaligned and then align. The loop - copies 64 bytes per iteration and prefetches one iteration ahead. */ + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldp A_l, A_h, [src] + ldp D_l, D_h, [srcend, -16] + stp A_l, A_h, [dstin] + stp D_l, D_h, [dstend, -16] + ret - and tmp1, dstend, 15 + .p2align 4 +L(move_long): + /* Only use backward copy if there is an overlap. */ + sub tmp1, dstin, src + cbz tmp1, L(copy0) + cmp tmp1, count + b.hs L(copy_long) + + /* Large backwards copy for overlapping copies. + Copy 16 bytes and then align dst to 16-byte alignment. */ ldp D_l, D_h, [srcend, -16] + and tmp1, dstend, 15 sub srcend, srcend, tmp1 sub count, count, tmp1 ldp A_l, A_h, [srcend, -16] @@ -243,10 +249,9 @@ L(move_long): ldp D_l, D_h, [srcend, -64]! sub dstend, dstend, tmp1 subs count, count, 128 - b.ls 2f + b.ls L(copy64_from_start) - nop -1: +L(loop64_backwards): stp A_l, A_h, [dstend, -16] ldp A_l, A_h, [srcend, -16] stp B_l, B_h, [dstend, -32] @@ -256,12 +261,10 @@ L(move_long): stp D_l, D_h, [dstend, -64]! ldp D_l, D_h, [srcend, -64]! subs count, count, 64 - b.hi 1b + b.hi L(loop64_backwards) - /* Write the last full set of 64 bytes. The remainder is at most 64 - bytes, so it is safe to always copy 64 bytes from the start even if - there is just 1 byte left. */ -2: + /* Write the last iteration and copy 64 bytes from the start. */ +L(copy64_from_start): ldp G_l, G_h, [src, 48] stp A_l, A_h, [dstend, -16] ldp A_l, A_h, [src, 32] @@ -274,7 +277,7 @@ L(move_long): stp A_l, A_h, [dstin, 32] stp B_l, B_h, [dstin, 16] stp C_l, C_h, [dstin] -3: ret + ret -END (MEMCPY) -libc_hidden_builtin_def (MEMCPY) +END (MEMMOVE) +libc_hidden_builtin_def (MEMMOVE)