From patchwork Wed Feb 12 15:56:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 38005 Received: (qmail 19332 invoked by alias); 12 Feb 2020 15:56:26 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 19317 invoked by uid 89); 12 Feb 2020 15:56:25 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-14.7 required=5.0 tests=AWL, BAYES_00, FORGED_SPF_HELO, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, IMAGE_ATTACHED, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy= X-HELO: EUR02-HE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=; b=SYEJI8+89w4x08NGiS2tQ8w22oinhQtat5smKMm7HqnqERJ5JTpG8cZfzuNPEwJYCejQfDYNGlBhLu1gWexYpsWQMqyRmgY7KYk474tMNlhbtznJ7LFT13awBzika+Yy/s/RFUOb8HDb/VgehEClqpLsVaPhcobVejsvbHBi1y4= Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; sourceware.org; dmarc=bestguesspass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; X-CheckRecipientChecked: true X-CR-MTA-CID: cf26dad920c31349 X-CR-MTA-TID: 64aa7808 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=l0kXweAoT92H1GgfNUvfo1bBvqXbGLyRyEecOBpQ0kCREIbLtT2Hz/kBb5XyAYvF6BkY6q7sd5h/ZrQq6vAPX4QkeuS+tabuzGOoXQ7j67WZTXoqtEpXarsshK4fMh1j6nD0QoMe5WSHjBejU1VlimC6CP5AmTOYuT29qvYKrsu7JnQpx9cJLqFJNXBAIW3hyrSSthjbkQSoDekkC8W/BhuRhdMID1wj2mXGd84hezFe4BX/iVHX2f0OYCX/1wQ7JyM/B//Mlao4rkKR/meeZgZoRwtkoPksESLJnEoB5pyWFioyL9OTav5NR8IDVCgSQ2VgQTjfvE93l7OOP/z3WA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=; b=XjnhzqEaE2XOpgAZQQeFYHAuEj7Xa1DCC2IhAoou/Zzzes/pVJxO9nMYhG2kwluoVuuAHdzB8AnQYIEFt3JD/mCSzM50FWPedaN0FPc8XcPYPxw8iY/E6B90fKEg4WCxYZeykhZsGUZJIujLslGq+eGZ2zApDblXPOUINlY3lvRrUHaCnpxkUzhDgbBEX38H7Sj+IVae5PtbVIp1rmWzBPxE9E/AqY8Y0z8bGyQ+Ge0BcTfdjnaKe3/sHzsZFZVp3IV049m8XBEQkW+QftmzI842TnO5oj+Hzk+N6yAAmh1XQUZMojfmgXKawtHpaRaJsYi0Ue8R2WkIg2Xg9VJNtQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=; b=SYEJI8+89w4x08NGiS2tQ8w22oinhQtat5smKMm7HqnqERJ5JTpG8cZfzuNPEwJYCejQfDYNGlBhLu1gWexYpsWQMqyRmgY7KYk474tMNlhbtznJ7LFT13awBzika+Yy/s/RFUOb8HDb/VgehEClqpLsVaPhcobVejsvbHBi1y4= From: Wilco Dijkstra To: 'GNU C Library' Subject: [PATCH][AArch64] Improve integer memcpy Date: Wed, 12 Feb 2020 15:56:11 +0000 Message-ID: References: , In-Reply-To: Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-checkrecipientrouted: true x-ms-oob-tlc-oobclassifiers: OLM:2958;OLM:2958; X-Forefront-Antispam-Report-Untrusted: SFV:NSPM; SFS:(10009020)(4636009)(346002)(136003)(396003)(366004)(376002)(39860400002)(199004)(189003)(55016002)(2906002)(66556008)(66446008)(66476007)(64756008)(71200400001)(66616009)(8936002)(76116006)(66946007)(8676002)(9686003)(478600001)(316002)(52536014)(5660300002)(81156014)(81166006)(6916009)(86362001)(186003)(2940100002)(6506007)(7696005)(33656002)(26005); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB2034; H:AM5PR0801MB2035.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: OuNIltHplKsOl2GYGi+olDa7d9Jf/K1hNvWz9fChcStVTTCGsQYdxpvaFZ0Y8YBO+U/9ylHJ3xkehm4ROebUTH0mv4jFRPZtSz3neovEzTdhQzkBui0qZDIlvTYWsi8iuM2vQ29pKQYOSQhe8IlEOZoc03nrj4paSfcXX9cjOa5+sfR3n+lBGrMUIR5BfpfGMOa1NNNn3+12zEyWbMj3wiVaAApEtonsyMcR+84HPo0XWIJwrCDZUj59yV0h5sSHbNCxQ3FhfOVaanjylZf1SaMzK1v7iq+Jb3wXz3+klGsqW/9VWTHZND3Y7O7iJuOvUCMkf3A5UNTQuAGil2CkzKdRI1QFZHKxFaFrq+nUIJYC6LxSmo5JoxVXBPxw7QOxvh0ZlITSUAlqv2FOtHB+ljoAqDcM5/Ik/5sUWoduCj1htIgncVvGzMX7Ok33nRqG x-ms-exchange-antispam-messagedata: 3WfHCv0O2jvbY7BTlnc3Qe5a8ij0/6ic20m5cqwlJk8+g3lXL58h7Al12Ng+xTxc0EVYwb1T9qUBmeUYRxBdDtrQKgtbytAqUKbn02kfdRaDoIZXuAeU0dQYzKaCnXVQED2vBw9Q6uuj7Sgkhcw66A== x-ms-exchange-transport-forked: True MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; Return-Path: Wilco.Dijkstra@arm.com X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT036.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 0c839ad3-15a4-4159-6b1b-08d7afd41540 Hi, Further optimize integer memcpy. Small cases now include copies up to 32 bytes. 64-128 byte copies are split into two cases to improve performance of 64-96 byte copies. Comments have been rewritten. The attached graph shows how the new memcpy (memcpy_new) performs against the current generic memcpy and the previous version (memcpy.S before commit b9f145df85). Passes GLIBC tests. diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S index ff720c800ed0ca3afac03d19ba02f67817b3422e..e0547259a8618292fe798e70fe5b44409acecc51 100644 --- a/sysdeps/aarch64/memcpy.S +++ b/sysdeps/aarch64/memcpy.S @@ -33,11 +33,11 @@ #define A_l x6 #define A_lw w6 #define A_h x7 -#define A_hw w7 #define B_l x8 #define B_lw w8 #define B_h x9 #define C_l x10 +#define C_lw w10 #define C_h x11 #define D_l x12 #define D_h x13 @@ -51,16 +51,6 @@ #define H_h srcend #define tmp1 x14 -/* Copies are split into 3 main cases: small copies of up to 32 bytes, - medium copies of 33..128 bytes which are fully unrolled. Large copies - of more than 128 bytes align the destination and use an unrolled loop - processing 64 bytes per iteration. - In order to share code with memmove, small and medium copies read all - data before writing, allowing any kind of overlap. So small, medium - and large backwards memmoves are handled by falling through into memcpy. - Overlapping large forward memmoves use a loop that copies backwards. -*/ - #ifndef MEMMOVE # define MEMMOVE memmove #endif @@ -68,128 +58,124 @@ # define MEMCPY memcpy #endif -ENTRY_ALIGN (MEMMOVE, 6) +/* This implementation supports both memcpy and memmove and shares most code. + It uses unaligned accesses and branchless sequences to keep the code small, + simple and improve performance. - DELOUSE (0) - DELOUSE (1) - DELOUSE (2) + Copies are split into 3 main cases: small copies of up to 32 bytes, medium + copies of up to 128 bytes, and large copies. The overhead of the overlap + check in memmove is negligible since it is only required for large copies. - sub tmp1, dstin, src - cmp count, 128 - ccmp tmp1, count, 2, hi - b.lo L(move_long) + Large copies use a software pipelined loop processing 64 bytes per iteration. + The destination pointer is 16-byte aligned to minimize unaligned accesses. + The loop tail is handled by always copying 64 bytes from the end. +*/ - /* Common case falls through into memcpy. */ -END (MEMMOVE) -libc_hidden_builtin_def (MEMMOVE) ENTRY (MEMCPY) - DELOUSE (0) DELOUSE (1) DELOUSE (2) - prfm PLDL1KEEP, [src] add srcend, src, count add dstend, dstin, count + cmp count, 128 + b.hi L(copy_long) cmp count, 32 - b.ls L(copy32) - cmp count, 128 - b.hi L(copy_long) + b.hi L(copy32_128) - /* Medium copies: 33..128 bytes. */ + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) ldp A_l, A_h, [src] - ldp B_l, B_h, [src, 16] - ldp C_l, C_h, [srcend, -32] ldp D_l, D_h, [srcend, -16] - cmp count, 64 - b.hi L(copy128) stp A_l, A_h, [dstin] - stp B_l, B_h, [dstin, 16] - stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret - .p2align 4 - /* Small copies: 0..32 bytes. */ -L(copy32): - /* 16-32 bytes. */ - cmp count, 16 - b.lo 1f - ldp A_l, A_h, [src] - ldp B_l, B_h, [srcend, -16] - stp A_l, A_h, [dstin] - stp B_l, B_h, [dstend, -16] - ret - .p2align 4 -1: - /* 8-15 bytes. */ - tbz count, 3, 1f + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) ldr A_l, [src] ldr A_h, [srcend, -8] str A_l, [dstin] str A_h, [dstend, -8] ret - .p2align 4 -1: - /* 4-7 bytes. */ - tbz count, 2, 1f + + .p2align 3 + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) ldr A_lw, [src] - ldr A_hw, [srcend, -4] + ldr B_lw, [srcend, -4] str A_lw, [dstin] - str A_hw, [dstend, -4] + str B_lw, [dstend, -4] ret - /* Copy 0..3 bytes. Use a branchless sequence that copies the same - byte 3 times if count==1, or the 2nd byte twice if count==2. */ -1: - cbz count, 2f + /* Copy 0..3 bytes using a branchless sequence. */ +L(copy4): + cbz count, L(copy0) lsr tmp1, count, 1 ldrb A_lw, [src] - ldrb A_hw, [srcend, -1] + ldrb C_lw, [srcend, -1] ldrb B_lw, [src, tmp1] strb A_lw, [dstin] strb B_lw, [dstin, tmp1] - strb A_hw, [dstend, -1] -2: ret + strb C_lw, [dstend, -1] +L(copy0): + ret + + .p2align 4 + /* Medium copies: 33..128 bytes. */ +L(copy32_128): + ldp A_l, A_h, [src] + ldp B_l, B_h, [src, 16] + ldp C_l, C_h, [srcend, -32] + ldp D_l, D_h, [srcend, -16] + cmp count, 64 + b.hi L(copy128) + stp A_l, A_h, [dstin] + stp B_l, B_h, [dstin, 16] + stp C_l, C_h, [dstend, -32] + stp D_l, D_h, [dstend, -16] + ret .p2align 4 - /* Copy 65..128 bytes. Copy 64 bytes from the start and - 64 bytes from the end. */ + /* Copy 65..128 bytes. */ L(copy128): ldp E_l, E_h, [src, 32] ldp F_l, F_h, [src, 48] + cmp count, 96 + b.ls L(copy96) ldp G_l, G_h, [srcend, -64] ldp H_l, H_h, [srcend, -48] + stp G_l, G_h, [dstend, -64] + stp H_l, H_h, [dstend, -48] +L(copy96): stp A_l, A_h, [dstin] - stp B_l, B_h, [dstin, 16] - stp E_l, E_h, [dstin, 32] - stp F_l, F_h, [dstin, 48] - stp G_l, G_h, [dstend, -64] - stp H_l, H_h, [dstend, -48] - stp C_l, C_h, [dstend, -32] + stp B_l, B_h, [dstin, 16] + stp E_l, E_h, [dstin, 32] + stp F_l, F_h, [dstin, 48] + stp C_l, C_h, [dstend, -32] stp D_l, D_h, [dstend, -16] ret - /* Align DST to 16 byte alignment so that we don't cross cache line - boundaries on both loads and stores. There are at least 128 bytes - to copy, so copy 16 bytes unaligned and then align. The loop - copies 64 bytes per iteration and prefetches one iteration ahead. */ - .p2align 4 + /* Copy more than 128 bytes. */ L(copy_long): + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + ldp D_l, D_h, [src] and tmp1, dstin, 15 bic dst, dstin, 15 - ldp D_l, D_h, [src] sub src, src, tmp1 - add count, count, tmp1 /* Count is now 16 too large. */ + add count, count, tmp1 /* Count is now 16 too large. */ ldp A_l, A_h, [src, 16] stp D_l, D_h, [dstin] ldp B_l, B_h, [src, 32] ldp C_l, C_h, [src, 48] ldp D_l, D_h, [src, 64]! - subs count, count, 128 + 16 /* Test and readjust count. */ - b.ls L(last64) + b.ls L(copy64_from_end) + L(loop64): stp A_l, A_h, [dst, 16] ldp A_l, A_h, [src, 16] @@ -202,10 +188,8 @@ L(loop64): subs count, count, 64 b.hi L(loop64) - /* Write the last full set of 64 bytes. The remainder is at most 64 - bytes, so it is safe to always copy 64 bytes from the end even if - there is just 1 byte left. */ -L(last64): + /* Write the last iteration and copy 64 bytes from the end. */ +L(copy64_from_end): ldp E_l, E_h, [srcend, -64] stp A_l, A_h, [dst, 16] ldp A_l, A_h, [srcend, -48] @@ -220,20 +204,42 @@ L(last64): stp C_l, C_h, [dstend, -16] ret - .p2align 4 -L(move_long): - cbz tmp1, 3f +END (MEMCPY) +libc_hidden_builtin_def (MEMCPY) + +ENTRY_ALIGN (MEMMOVE, 4) + DELOUSE (0) + DELOUSE (1) + DELOUSE (2) add srcend, src, count add dstend, dstin, count + cmp count, 128 + b.hi L(move_long) + cmp count, 32 + b.hi L(copy32_128) + + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldp A_l, A_h, [src] + ldp D_l, D_h, [srcend, -16] + stp A_l, A_h, [dstin] + stp D_l, D_h, [dstend, -16] + ret - /* Align dstend to 16 byte alignment so that we don't cross cache line - boundaries on both loads and stores. There are at least 128 bytes - to copy, so copy 16 bytes unaligned and then align. The loop - copies 64 bytes per iteration and prefetches one iteration ahead. */ + .p2align 4 +L(move_long): + /* Only use backward copy if there is an overlap. */ + sub tmp1, dstin, src + cbz tmp1, L(copy0) + cmp tmp1, count + b.hs L(copy_long) - and tmp1, dstend, 15 + /* Large backwards copy for overlapping copies. + Copy 16 bytes and then align dst to 16-byte alignment. */ ldp D_l, D_h, [srcend, -16] + and tmp1, dstend, 15 sub srcend, srcend, tmp1 sub count, count, tmp1 ldp A_l, A_h, [srcend, -16] @@ -243,10 +249,9 @@ L(move_long): ldp D_l, D_h, [srcend, -64]! sub dstend, dstend, tmp1 subs count, count, 128 - b.ls 2f + b.ls L(copy64_from_start) - nop -1: +L(loop64_backwards): stp A_l, A_h, [dstend, -16] ldp A_l, A_h, [srcend, -16] stp B_l, B_h, [dstend, -32] @@ -256,12 +261,10 @@ L(move_long): stp D_l, D_h, [dstend, -64]! ldp D_l, D_h, [srcend, -64]! subs count, count, 64 - b.hi 1b + b.hi L(loop64_backwards) - /* Write the last full set of 64 bytes. The remainder is at most 64 - bytes, so it is safe to always copy 64 bytes from the start even if - there is just 1 byte left. */ -2: + /* Write the last iteration and copy 64 bytes from the start. */ +L(copy64_from_start): ldp G_l, G_h, [src, 48] stp A_l, A_h, [dstend, -16] ldp A_l, A_h, [src, 32] @@ -274,7 +277,7 @@ L(move_long): stp A_l, A_h, [dstin, 32] stp B_l, B_h, [dstin, 16] stp C_l, C_h, [dstin] -3: ret + ret -END (MEMCPY) -libc_hidden_builtin_def (MEMCPY) +END (MEMMOVE) +libc_hidden_builtin_def (MEMMOVE)