From patchwork Wed Feb 12 15:56:11 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
X-Patchwork-Id: 38005
Received: (qmail 19332 invoked by alias); 12 Feb 2020 15:56:26 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 19317 invoked by uid 89); 12 Feb 2020 15:56:25 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-14.7 required=5.0 tests=AWL, BAYES_00,
	FORGED_SPF_HELO, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2,
	GIT_PATCH_3, IMAGE_ATTACHED, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,
	UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy=
X-HELO: EUR02-HE1-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector2-armh-onmicrosoft-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=;
	b=SYEJI8+89w4x08NGiS2tQ8w22oinhQtat5smKMm7HqnqERJ5JTpG8cZfzuNPEwJYCejQfDYNGlBhLu1gWexYpsWQMqyRmgY7KYk474tMNlhbtznJ7LFT13awBzika+Yy/s/RFUOb8HDb/VgehEClqpLsVaPhcobVejsvbHBi1y4=
Authentication-Results: spf=pass (sender IP is 63.35.35.123)
	smtp.mailfrom=arm.com; sourceware.org;
	dkim=pass (signature was verified)
	header.d=armh.onmicrosoft.com; sourceware.org;
	dmarc=bestguesspass action=none header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
	63.35.35.123 as permitted sender) receiver=protection.outlook.com;
	client-ip=63.35.35.123;
	helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
X-CheckRecipientChecked: true
X-CR-MTA-CID: cf26dad920c31349
X-CR-MTA-TID: 64aa7808
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
	b=l0kXweAoT92H1GgfNUvfo1bBvqXbGLyRyEecOBpQ0kCREIbLtT2Hz/kBb5XyAYvF6BkY6q7sd5h/ZrQq6vAPX4QkeuS+tabuzGOoXQ7j67WZTXoqtEpXarsshK4fMh1j6nD0QoMe5WSHjBejU1VlimC6CP5AmTOYuT29qvYKrsu7JnQpx9cJLqFJNXBAIW3hyrSSthjbkQSoDekkC8W/BhuRhdMID1wj2mXGd84hezFe4BX/iVHX2f0OYCX/1wQ7JyM/B//Mlao4rkKR/meeZgZoRwtkoPksESLJnEoB5pyWFioyL9OTav5NR8IDVCgSQ2VgQTjfvE93l7OOP/z3WA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
	s=arcselector9901;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=;
	b=XjnhzqEaE2XOpgAZQQeFYHAuEj7Xa1DCC2IhAoou/Zzzes/pVJxO9nMYhG2kwluoVuuAHdzB8AnQYIEFt3JD/mCSzM50FWPedaN0FPc8XcPYPxw8iY/E6B90fKEg4WCxYZeykhZsGUZJIujLslGq+eGZ2zApDblXPOUINlY3lvRrUHaCnpxkUzhDgbBEX38H7Sj+IVae5PtbVIp1rmWzBPxE9E/AqY8Y0z8bGyQ+Ge0BcTfdjnaKe3/sHzsZFZVp3IV049m8XBEQkW+QftmzI842TnO5oj+Hzk+N6yAAmh1XQUZMojfmgXKawtHpaRaJsYi0Ue8R2WkIg2Xg9VJNtQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
	smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com;
	dkim=pass header.d=arm.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector2-armh-onmicrosoft-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=0WWvpquqgW2Cky4TPG1sKi6tRP3PPJkzwl0KfVTZ4KI=;
	b=SYEJI8+89w4x08NGiS2tQ8w22oinhQtat5smKMm7HqnqERJ5JTpG8cZfzuNPEwJYCejQfDYNGlBhLu1gWexYpsWQMqyRmgY7KYk474tMNlhbtznJ7LFT13awBzika+Yy/s/RFUOb8HDb/VgehEClqpLsVaPhcobVejsvbHBi1y4=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: [PATCH][AArch64] Improve integer memcpy
Date: Wed, 12 Feb 2020 15:56:11 +0000
Message-ID: 
 <AM5PR0801MB20350169A5AED775B26BB3E5831B0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
References: 
 <AM5PR0801MB2035F6EE30C8585586CA2118831B0@AM5PR0801MB2035.eurprd08.prod.outlook.com>,
	<AM5PR0801MB2035922D561E4FD05F157967831B0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
In-Reply-To: 
 <AM5PR0801MB2035922D561E4FD05F157967831B0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
Authentication-Results-Original: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
x-checkrecipientrouted: true
x-ms-oob-tlc-oobclassifiers: OLM:2958;OLM:2958;
X-Forefront-Antispam-Report-Untrusted: SFV:NSPM;
	SFS:(10009020)(4636009)(346002)(136003)(396003)(366004)(376002)(39860400002)(199004)(189003)(55016002)(2906002)(66556008)(66446008)(66476007)(64756008)(71200400001)(66616009)(8936002)(76116006)(66946007)(8676002)(9686003)(478600001)(316002)(52536014)(5660300002)(81156014)(81166006)(6916009)(86362001)(186003)(2940100002)(6506007)(7696005)(33656002)(26005);
	DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB2034;
	H:AM5PR0801MB2035.eurprd08.prod.outlook.com; FPR:; SPF:None;
	LANG:en; PTR:InfoNoRecords; A:1; MX:1;
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original: 
 OuNIltHplKsOl2GYGi+olDa7d9Jf/K1hNvWz9fChcStVTTCGsQYdxpvaFZ0Y8YBO+U/9ylHJ3xkehm4ROebUTH0mv4jFRPZtSz3neovEzTdhQzkBui0qZDIlvTYWsi8iuM2vQ29pKQYOSQhe8IlEOZoc03nrj4paSfcXX9cjOa5+sfR3n+lBGrMUIR5BfpfGMOa1NNNn3+12zEyWbMj3wiVaAApEtonsyMcR+84HPo0XWIJwrCDZUj59yV0h5sSHbNCxQ3FhfOVaanjylZf1SaMzK1v7iq+Jb3wXz3+klGsqW/9VWTHZND3Y7O7iJuOvUCMkf3A5UNTQuAGil2CkzKdRI1QFZHKxFaFrq+nUIJYC6LxSmo5JoxVXBPxw7QOxvh0ZlITSUAlqv2FOtHB+ljoAqDcM5/Ik/5sUWoduCj1htIgncVvGzMX7Ok33nRqG
x-ms-exchange-antispam-messagedata: 
 3WfHCv0O2jvbY7BTlnc3Qe5a8ij0/6ic20m5cqwlJk8+g3lXL58h7Al12Ng+xTxc0EVYwb1T9qUBmeUYRxBdDtrQKgtbytAqUKbn02kfdRaDoIZXuAeU0dQYzKaCnXVQED2vBw9Q6uuj7Sgkhcw66A==
x-ms-exchange-transport-forked: True
MIME-Version: 1.0
Original-Authentication-Results: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
Return-Path: Wilco.Dijkstra@arm.com
X-MS-Exchange-Transport-CrossTenantHeadersStripped: 
 DB5EUR03FT036.eop-EUR03.prod.protection.outlook.com
X-MS-Office365-Filtering-Correlation-Id-Prvs: 
 0c839ad3-15a4-4159-6b1b-08d7afd41540

Hi,

Further optimize integer memcpy.  Small cases now include copies up
to 32 bytes.  64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies.  Comments have been rewritten.
The attached graph shows how the new memcpy (memcpy_new) performs
against the current generic memcpy and the previous version (memcpy.S
before commit b9f145df85).

Passes GLIBC tests.

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index ff720c800ed0ca3afac03d19ba02f67817b3422e..e0547259a8618292fe798e70fe5b44409acecc51 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -33,11 +33,11 @@
 #define A_l     x6
 #define A_lw    w6
 #define A_h     x7
-#define A_hw   w7
 #define B_l     x8
 #define B_lw    w8
 #define B_h     x9
 #define C_l     x10
+#define C_lw   w10
 #define C_h     x11
 #define D_l     x12
 #define D_h     x13
@@ -51,16 +51,6 @@
 #define H_h     srcend
 #define tmp1    x14
 
-/* Copies are split into 3 main cases: small copies of up to 32 bytes,
-   medium copies of 33..128 bytes which are fully unrolled. Large copies
-   of more than 128 bytes align the destination and use an unrolled loop
-   processing 64 bytes per iteration.
-   In order to share code with memmove, small and medium copies read all
-   data before writing, allowing any kind of overlap. So small, medium
-   and large backwards memmoves are handled by falling through into memcpy.
-   Overlapping large forward memmoves use a loop that copies backwards.
-*/
-
 #ifndef MEMMOVE
 # define MEMMOVE memmove
 #endif
@@ -68,128 +58,124 @@
 # define MEMCPY memcpy
 #endif
 
-ENTRY_ALIGN (MEMMOVE, 6)
+/* This implementation supports both memcpy and memmove and shares most code.
+   It uses unaligned accesses and branchless sequences to keep the code small,
+   simple and improve performance.
 
-       DELOUSE (0)
-       DELOUSE (1)
-       DELOUSE (2)
+   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+   copies of up to 128 bytes, and large copies.  The overhead of the overlap
+   check in memmove is negligible since it is only required for large copies.
 
-       sub     tmp1, dstin, src
-       cmp     count, 128
-       ccmp    tmp1, count, 2, hi
-       b.lo    L(move_long)
+   Large copies use a software pipelined loop processing 64 bytes per iteration.
+   The destination pointer is 16-byte aligned to minimize unaligned accesses.
+   The loop tail is handled by always copying 64 bytes from the end.
+*/
 
-       /* Common case falls through into memcpy.  */
-END (MEMMOVE)
-libc_hidden_builtin_def (MEMMOVE)
 ENTRY (MEMCPY)
-
         DELOUSE (0)
         DELOUSE (1)
         DELOUSE (2)
 
-       prfm    PLDL1KEEP, [src]
         add     srcend, src, count
         add     dstend, dstin, count
+       cmp     count, 128
+       b.hi    L(copy_long)
         cmp     count, 32
-       b.ls    L(copy32)
-       cmp     count, 128
-       b.hi    L(copy_long)
+       b.hi    L(copy32_128)
 
-       /* Medium copies: 33..128 bytes.  */
+       /* Small copies: 0..32 bytes.  */
+       cmp     count, 16
+       b.lo    L(copy16)
         ldp     A_l, A_h, [src]
-       ldp     B_l, B_h, [src, 16]
-       ldp     C_l, C_h, [srcend, -32]
         ldp     D_l, D_h, [srcend, -16]
-       cmp     count, 64
-       b.hi    L(copy128)
         stp     A_l, A_h, [dstin]
-       stp     B_l, B_h, [dstin, 16]
-       stp     C_l, C_h, [dstend, -32]
         stp     D_l, D_h, [dstend, -16]
         ret
 
-       .p2align 4
-       /* Small copies: 0..32 bytes.  */
-L(copy32):
-       /* 16-32 bytes.  */
-       cmp     count, 16
-       b.lo    1f
-       ldp     A_l, A_h, [src]
-       ldp     B_l, B_h, [srcend, -16]
-       stp     A_l, A_h, [dstin]
-       stp     B_l, B_h, [dstend, -16]
-       ret
-       .p2align 4
-1:
-       /* 8-15 bytes.  */
-       tbz     count, 3, 1f
+       /* Copy 8-15 bytes.  */
+L(copy16):
+       tbz     count, 3, L(copy8)
         ldr     A_l, [src]
         ldr     A_h, [srcend, -8]
         str     A_l, [dstin]
         str     A_h, [dstend, -8]
         ret
-       .p2align 4
-1:
-       /* 4-7 bytes.  */
-       tbz     count, 2, 1f
+
+       .p2align 3
+       /* Copy 4-7 bytes.  */
+L(copy8):
+       tbz     count, 2, L(copy4)
         ldr     A_lw, [src]
-       ldr     A_hw, [srcend, -4]
+       ldr     B_lw, [srcend, -4]
         str     A_lw, [dstin]
-       str     A_hw, [dstend, -4]
+       str     B_lw, [dstend, -4]
         ret
 
-       /* Copy 0..3 bytes.  Use a branchless sequence that copies the same
-          byte 3 times if count==1, or the 2nd byte twice if count==2.  */
-1:
-       cbz     count, 2f
+       /* Copy 0..3 bytes using a branchless sequence.  */
+L(copy4):
+       cbz     count, L(copy0)
         lsr     tmp1, count, 1
         ldrb    A_lw, [src]
-       ldrb    A_hw, [srcend, -1]
+       ldrb    C_lw, [srcend, -1]
         ldrb    B_lw, [src, tmp1]
         strb    A_lw, [dstin]
         strb    B_lw, [dstin, tmp1]
-       strb    A_hw, [dstend, -1]
-2:     ret
+       strb    C_lw, [dstend, -1]
+L(copy0):
+       ret
+
+       .p2align 4
+       /* Medium copies: 33..128 bytes.  */
+L(copy32_128):
+       ldp     A_l, A_h, [src]
+       ldp     B_l, B_h, [src, 16]
+       ldp     C_l, C_h, [srcend, -32]
+       ldp     D_l, D_h, [srcend, -16]
+       cmp     count, 64
+       b.hi    L(copy128)
+       stp     A_l, A_h, [dstin]
+       stp     B_l, B_h, [dstin, 16]
+       stp     C_l, C_h, [dstend, -32]
+       stp     D_l, D_h, [dstend, -16]
+       ret
 
         .p2align 4
-       /* Copy 65..128 bytes.  Copy 64 bytes from the start and
-          64 bytes from the end.  */
+       /* Copy 65..128 bytes.  */
 L(copy128):
         ldp     E_l, E_h, [src, 32]
         ldp     F_l, F_h, [src, 48]
+       cmp     count, 96
+       b.ls    L(copy96)
         ldp     G_l, G_h, [srcend, -64]
         ldp     H_l, H_h, [srcend, -48]
+       stp     G_l, G_h, [dstend, -64]
+       stp     H_l, H_h, [dstend, -48]
+L(copy96):
         stp     A_l, A_h, [dstin]
-       stp     B_l, B_h, [dstin, 16]
-       stp     E_l, E_h, [dstin, 32]
-       stp     F_l, F_h, [dstin, 48]
-       stp     G_l, G_h, [dstend, -64]
-       stp     H_l, H_h, [dstend, -48]
-       stp     C_l, C_h, [dstend, -32]
+       stp     B_l, B_h, [dstin, 16]
+       stp     E_l, E_h, [dstin, 32]
+       stp     F_l, F_h, [dstin, 48]
+       stp     C_l, C_h, [dstend, -32]
         stp     D_l, D_h, [dstend, -16]
         ret
 
-       /* Align DST to 16 byte alignment so that we don't cross cache line
-          boundaries on both loads and stores.  There are at least 128 bytes
-          to copy, so copy 16 bytes unaligned and then align.  The loop
-          copies 64 bytes per iteration and prefetches one iteration ahead.  */
-
         .p2align 4
+       /* Copy more than 128 bytes.  */
 L(copy_long):
+       /* Copy 16 bytes and then align dst to 16-byte alignment.  */
+       ldp     D_l, D_h, [src]
         and     tmp1, dstin, 15
         bic     dst, dstin, 15
-       ldp     D_l, D_h, [src]
         sub     src, src, tmp1
-       add     count, count, tmp1      /* Count is now 16 too large.  */
+       add     count, count, tmp1      /* Count is now 16 too large.  */
         ldp     A_l, A_h, [src, 16]
         stp     D_l, D_h, [dstin]
         ldp     B_l, B_h, [src, 32]
         ldp     C_l, C_h, [src, 48]
         ldp     D_l, D_h, [src, 64]!
-
         subs    count, count, 128 + 16  /* Test and readjust count.  */
-       b.ls    L(last64)
+       b.ls    L(copy64_from_end)
+
 L(loop64):
         stp     A_l, A_h, [dst, 16]
         ldp     A_l, A_h, [src, 16]
@@ -202,10 +188,8 @@ L(loop64):
         subs    count, count, 64
         b.hi    L(loop64)
 
-       /* Write the last full set of 64 bytes.  The remainder is at most 64
-          bytes, so it is safe to always copy 64 bytes from the end even if
-          there is just 1 byte left.  */
-L(last64):
+       /* Write the last iteration and copy 64 bytes from the end.  */
+L(copy64_from_end):
         ldp     E_l, E_h, [srcend, -64]
         stp     A_l, A_h, [dst, 16]
         ldp     A_l, A_h, [srcend, -48]
@@ -220,20 +204,42 @@ L(last64):
         stp     C_l, C_h, [dstend, -16]
         ret
 
-       .p2align 4
-L(move_long):
-       cbz     tmp1, 3f
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+ENTRY_ALIGN (MEMMOVE, 4)
+       DELOUSE (0)
+       DELOUSE (1)
+       DELOUSE (2)
 
         add     srcend, src, count
         add     dstend, dstin, count
+       cmp     count, 128
+       b.hi    L(move_long)
+       cmp     count, 32
+       b.hi    L(copy32_128)
+
+       /* Small copies: 0..32 bytes.  */
+       cmp     count, 16
+       b.lo    L(copy16)
+       ldp     A_l, A_h, [src]
+       ldp     D_l, D_h, [srcend, -16]
+       stp     A_l, A_h, [dstin]
+       stp     D_l, D_h, [dstend, -16]
+       ret
 
-       /* Align dstend to 16 byte alignment so that we don't cross cache line
-          boundaries on both loads and stores.  There are at least 128 bytes
-          to copy, so copy 16 bytes unaligned and then align.  The loop
-          copies 64 bytes per iteration and prefetches one iteration ahead.  */
+       .p2align 4
+L(move_long):
+       /* Only use backward copy if there is an overlap.  */
+       sub     tmp1, dstin, src
+       cbz     tmp1, L(copy0)
+       cmp     tmp1, count
+       b.hs    L(copy_long)
 
-       and     tmp1, dstend, 15
+       /* Large backwards copy for overlapping copies.
+          Copy 16 bytes and then align dst to 16-byte alignment.  */
         ldp     D_l, D_h, [srcend, -16]
+       and     tmp1, dstend, 15
         sub     srcend, srcend, tmp1
         sub     count, count, tmp1
         ldp     A_l, A_h, [srcend, -16]
@@ -243,10 +249,9 @@ L(move_long):
         ldp     D_l, D_h, [srcend, -64]!
         sub     dstend, dstend, tmp1
         subs    count, count, 128
-       b.ls    2f
+       b.ls    L(copy64_from_start)
 
-       nop
-1:
+L(loop64_backwards):
         stp     A_l, A_h, [dstend, -16]
         ldp     A_l, A_h, [srcend, -16]
         stp     B_l, B_h, [dstend, -32]
@@ -256,12 +261,10 @@ L(move_long):
         stp     D_l, D_h, [dstend, -64]!
         ldp     D_l, D_h, [srcend, -64]!
         subs    count, count, 64
-       b.hi    1b
+       b.hi    L(loop64_backwards)
 
-       /* Write the last full set of 64 bytes.  The remainder is at most 64
-          bytes, so it is safe to always copy 64 bytes from the start even if
-          there is just 1 byte left.  */
-2:
+       /* Write the last iteration and copy 64 bytes from the start.  */
+L(copy64_from_start):
         ldp     G_l, G_h, [src, 48]
         stp     A_l, A_h, [dstend, -16]
         ldp     A_l, A_h, [src, 32]
@@ -274,7 +277,7 @@ L(move_long):
         stp     A_l, A_h, [dstin, 32]
         stp     B_l, B_h, [dstin, 16]
         stp     C_l, C_h, [dstin]
-3:     ret
+       ret
 
-END (MEMCPY)
-libc_hidden_builtin_def (MEMCPY)
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)