From patchwork Wed Feb 26 16:18:32 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
X-Patchwork-Id: 38326
Received: (qmail 93783 invoked by alias); 26 Feb 2020 16:18:54 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 93737 invoked by uid 89); 26 Feb 2020 16:18:53 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-20.2 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS,
	UNPARSEABLE_RELAY autolearn=ham version=3.3.1 spammy=H*RU:206,
	HX-Spam-Relays-External:206
X-HELO: EUR01-DB5-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector2-armh-onmicrosoft-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=;
	b=uR/zvhrY3W+MZLVJvOTs89j9+pXGeznB3St54oc/YexLXnxDni+3/7SV5QOnZK4HW/jc6NoecyZ0fvRmSv1u3lNuQm9PomKVdr1uymQCbBPqJVCw7mo+NAu3fE/uC4xUn+kxYFuPwPKRr9CAEueGcDlca2vSPE4fcsmzIizg5uQ=
Authentication-Results: spf=pass (sender IP is 63.35.35.123)
	smtp.mailfrom=arm.com; sourceware.org;
	dkim=pass (signature was verified)
	header.d=armh.onmicrosoft.com; sourceware.org;
	dmarc=bestguesspass action=none header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
	63.35.35.123 as permitted sender) receiver=protection.outlook.com;
	client-ip=63.35.35.123;
	helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
X-CheckRecipientChecked: true
X-CR-MTA-CID: 181a65a57333f949
X-CR-MTA-TID: 64aa7808
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
	b=oSwVpfgXzF5d+7Ie7qMP+kooYl58WqZCFHIUozKsy6A0CbX4MCX0IlszR5J7dhG5cNEfukJTUp7EyUWv9Q4SAonFG6huS4rIxvf2i7I7DrBDNQFCERiXXRy0TeJDbcN8aU1QYL79+xqHB7DIIkAIHbkA8HwZxDAHbH50nEJLg1A994O5foEU0dDU6u+Uk48N/alAuDDFvDZVdPXeFHlMwZXld+VeHMgoj2As6Z1nzzMBKyDcvz7rzvWdmQ/CYDc4M8a5WG978LQfsTA+ZvRZtJTaOBUjzfDIYsZicIpON+pzYNmwwqM/nC+agmUpaQ4eGZZn9w2Kdt2GDE15uohlTA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
	s=arcselector9901;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=;
	b=efZcFx4SdJ2RUr2A0f05JsdtACWFJ9pJPG2RfHToa+/hOKYozfcJ5pBCW/1kWN6Qg1tBIs8iMtB36k2zjpIH2WPkZ+uTARyBGw1+qbxazYYagk/uG3k23/3yJcnS6bG7B/So/CRGvg0u8zXv3xG6y2h4+Xc4ivzqZcjJxmdvMo+mR9jdp4eO7hHDSOe2a3iKP7stZ+mP7HgHGugGUV6+JPWlrst8aAHlZn3+NPaCfF79PAmb71IYlnnnmy124fwzXXjSIZw+7t4UAxHtyXtWayVX/VlJR+CzxDzV165Z3x6bvyd783SZss2x4RTpGOHCPNBoVMvIzCZU5rFYdQJxdg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
	smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com;
	dkim=pass header.d=arm.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
	s=selector2-armh-onmicrosoft-com;
	h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
	bh=okB6yT5eqYdsO9GLZUz2GzFIplVGoO4tuUMZyztOTAk=;
	b=uR/zvhrY3W+MZLVJvOTs89j9+pXGeznB3St54oc/YexLXnxDni+3/7SV5QOnZK4HW/jc6NoecyZ0fvRmSv1u3lNuQm9PomKVdr1uymQCbBPqJVCw7mo+NAu3fE/uC4xUn+kxYFuPwPKRr9CAEueGcDlca2vSPE4fcsmzIizg5uQ=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: Re: [PATCH v2][AArch64] Improve integer memcpy
Date: Wed, 26 Feb 2020 16:18:32 +0000
Message-ID: 
 <AM5PR0801MB20353EB926BF1E3BBFD3471C83EA0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
References: 
 <AM5PR0801MB2035AA956FB1D577A54A72EB83EA0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
In-Reply-To: 
 <AM5PR0801MB2035AA956FB1D577A54A72EB83EA0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
Authentication-Results-Original: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
x-checkrecipientrouted: true
x-ms-oob-tlc-oobclassifiers: OLM:2958;OLM:2958;
X-Forefront-Antispam-Report-Untrusted: SFV:NSPM;
	SFS:(10009020)(4636009)(396003)(346002)(136003)(366004)(376002)(39860400002)(189003)(199004)(186003)(316002)(26005)(52536014)(8676002)(478600001)(7696005)(86362001)(5660300002)(66946007)(76116006)(66556008)(66476007)(66446008)(33656002)(55016002)(64756008)(2940100002)(6506007)(71200400001)(9686003)(2906002)(81156014)(81166006)(6916009)(8936002);
	DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB1922;
	H:AM5PR0801MB2035.eurprd08.prod.outlook.com; FPR:; SPF:None;
	LANG:en; PTR:InfoNoRecords; MX:1; A:1;
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
X-MS-Exchange-SenderADCheck: 1
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original: 
 LAqdtyhsTJEWHyf29u+uMHmmQQneEcpO1HgGJsvIngLbwUFtXwRU3TFFoJLxvijqI4TmYzco7hz/4snadXwL0tFHy60XQ4ozrzUQTtCkf7OBimP/+zcsos/YxG9jfAW8oE2CYAsYuTHLksr0512kVUvEcZkmuJrGTqW6uviwyfq2b1cEzzXwq3A1QOP9uN+KXHNYV/NjOQ85f93xq8BWCwJZY1Y2l//rfXwEl91NBbyaCXejdeXbptPyZcqGlFe2TTgeh9UQpt04JywxXlTqMLiDN8g0o1toDMKSH+7GZDU58vH4nQX28UcTBQjy3Vr8uNDM8cZmyqnfvaamvgDZlxGkDrwzavWYboId8i8y2repbXMqfdJ6s3jYEUDzbXRex53irgPX4Ym9w6BmGiuBQJ5O6GAZwGOY1GFdLBP7frkywQlLkNRhU6m34MUtsORL
x-ms-exchange-antispam-messagedata: 
 gU9TxV5wCguuqTJkYiNd5euGprFH3uqXg/TK1J8nwImAT/rt6YcFWKkotSSN1RTqatckgmp992Fj1o/ALEss7OEmsXYeT34cr7VLCEFH2E76xdDqnFi4fDIxnTdSn6lJPgz/O8f+Bz31IE58U/Abdw==
x-ms-exchange-transport-forked: True
MIME-Version: 1.0
Original-Authentication-Results: spf=none (sender IP is )
	smtp.mailfrom=Wilco.Dijkstra@arm.com;
Return-Path: Wilco.Dijkstra@arm.com
X-MS-Exchange-Transport-CrossTenantHeadersStripped: 
 AM5EUR03FT052.eop-EUR03.prod.protection.outlook.com
X-MS-Office365-Filtering-Correlation-Id-Prvs: 
 32f389ce-aaad-4988-6e9e-08d7bad7864b

Version 2 fixes white space and uses ENTRY_ALIGN rather than ENTRY:

Further optimize integer memcpy.  Small cases now include copies up
to 32 bytes.  64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies.  Comments have been rewritten.
The attached graph shows how the new memcpy (memcpy_new) performs
against the current generic memcpy and the previous version (memcpy.S
before commit b9f145df85).

Passes GLIBC tests.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index ff720c800ed0ca3afac03d19ba02f67817b3422e..d31f7bb38eaf91692fb90f1c313b5e276fdf975b 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -33,11 +33,11 @@
 #define A_l x6
 #define A_lw w6
 #define A_h x7
-#define A_hw w7
 #define B_l x8
 #define B_lw w8
 #define B_h x9
 #define C_l x10
+#define C_lw w10
 #define C_h x11
 #define D_l x12
 #define D_h x13
@@ -51,16 +51,6 @@
 #define H_h srcend
 #define tmp1 x14
 
-/* Copies are split into 3 main cases: small copies of up to 32 bytes,
-   medium copies of 33..128 bytes which are fully unrolled. Large copies
-   of more than 128 bytes align the destination and use an unrolled loop
-   processing 64 bytes per iteration.
-   In order to share code with memmove, small and medium copies read all
-   data before writing, allowing any kind of overlap. So small, medium
-   and large backwards memmoves are handled by falling through into memcpy.
-   Overlapping large forward memmoves use a loop that copies backwards.
-*/
-
 #ifndef MEMMOVE
 # define MEMMOVE memmove
 #endif
@@ -68,128 +58,124 @@
 # define MEMCPY memcpy
 #endif
 
-ENTRY_ALIGN (MEMMOVE, 6)
+/* This implementation supports both memcpy and memmove and shares most code.
+   It uses unaligned accesses and branchless sequences to keep the code small,
+   simple and improve performance.
 
- DELOUSE (0)
- DELOUSE (1)
- DELOUSE (2)
+   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
+   copies of up to 128 bytes, and large copies.  The overhead of the overlap
+   check in memmove is negligible since it is only required for large copies.
 
- sub tmp1, dstin, src
- cmp count, 128
- ccmp tmp1, count, 2, hi
- b.lo L(move_long)
-
- /* Common case falls through into memcpy.  */
-END (MEMMOVE)
-libc_hidden_builtin_def (MEMMOVE)
-ENTRY (MEMCPY)
+   Large copies use a software pipelined loop processing 64 bytes per iteration.
+   The destination pointer is 16-byte aligned to minimize unaligned accesses.
+   The loop tail is handled by always copying 64 bytes from the end.
+*/
 
+ENTRY_ALIGN (MEMCPY, 6)
  DELOUSE (0)
  DELOUSE (1)
  DELOUSE (2)
 
- prfm PLDL1KEEP, [src]
  add srcend, src, count
  add dstend, dstin, count
- cmp count, 32
- b.ls L(copy32)
  cmp count, 128
  b.hi L(copy_long)
+ cmp count, 32
+ b.hi L(copy32_128)
 
- /* Medium copies: 33..128 bytes.  */
+ /* Small copies: 0..32 bytes.  */
+ cmp count, 16
+ b.lo L(copy16)
  ldp A_l, A_h, [src]
- ldp B_l, B_h, [src, 16]
- ldp C_l, C_h, [srcend, -32]
  ldp D_l, D_h, [srcend, -16]
- cmp count, 64
- b.hi L(copy128)
  stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstin, 16]
- stp C_l, C_h, [dstend, -32]
  stp D_l, D_h, [dstend, -16]
  ret
 
- .p2align 4
- /* Small copies: 0..32 bytes.  */
-L(copy32):
- /* 16-32 bytes.  */
- cmp count, 16
- b.lo 1f
- ldp A_l, A_h, [src]
- ldp B_l, B_h, [srcend, -16]
- stp A_l, A_h, [dstin]
- stp B_l, B_h, [dstend, -16]
- ret
- .p2align 4
-1:
- /* 8-15 bytes.  */
- tbz count, 3, 1f
+ /* Copy 8-15 bytes.  */
+L(copy16):
+ tbz count, 3, L(copy8)
  ldr A_l, [src]
  ldr A_h, [srcend, -8]
  str A_l, [dstin]
  str A_h, [dstend, -8]
  ret
- .p2align 4
-1:
- /* 4-7 bytes.  */
- tbz count, 2, 1f
+
+ .p2align 3
+ /* Copy 4-7 bytes.  */
+L(copy8):
+ tbz count, 2, L(copy4)
  ldr A_lw, [src]
- ldr A_hw, [srcend, -4]
+ ldr B_lw, [srcend, -4]
  str A_lw, [dstin]
- str A_hw, [dstend, -4]
+ str B_lw, [dstend, -4]
  ret
 
- /* Copy 0..3 bytes.  Use a branchless sequence that copies the same
-   byte 3 times if count==1, or the 2nd byte twice if count==2.  */
-1:
- cbz count, 2f
+ /* Copy 0..3 bytes using a branchless sequence.  */
+L(copy4):
+ cbz count, L(copy0)
  lsr tmp1, count, 1
  ldrb A_lw, [src]
- ldrb A_hw, [srcend, -1]
+ ldrb C_lw, [srcend, -1]
  ldrb B_lw, [src, tmp1]
  strb A_lw, [dstin]
  strb B_lw, [dstin, tmp1]
- strb A_hw, [dstend, -1]
-2: ret
+ strb C_lw, [dstend, -1]
+L(copy0):
+ ret
 
  .p2align 4
- /* Copy 65..128 bytes.  Copy 64 bytes from the start and
-   64 bytes from the end.  */
+ /* Medium copies: 33..128 bytes.  */
+L(copy32_128):
+ ldp A_l, A_h, [src]
+ ldp B_l, B_h, [src, 16]
+ ldp C_l, C_h, [srcend, -32]
+ ldp D_l, D_h, [srcend, -16]
+ cmp count, 64
+ b.hi L(copy128)
+ stp A_l, A_h, [dstin]
+ stp B_l, B_h, [dstin, 16]
+ stp C_l, C_h, [dstend, -32]
+ stp D_l, D_h, [dstend, -16]
+ ret
+
+ .p2align 4
+ /* Copy 65..128 bytes.  */
 L(copy128):
  ldp E_l, E_h, [src, 32]
  ldp F_l, F_h, [src, 48]
+ cmp count, 96
+ b.ls L(copy96)
  ldp G_l, G_h, [srcend, -64]
  ldp H_l, H_h, [srcend, -48]
+ stp G_l, G_h, [dstend, -64]
+ stp H_l, H_h, [dstend, -48]
+L(copy96):
  stp A_l, A_h, [dstin]
  stp B_l, B_h, [dstin, 16]
  stp E_l, E_h, [dstin, 32]
  stp F_l, F_h, [dstin, 48]
- stp G_l, G_h, [dstend, -64]
- stp H_l, H_h, [dstend, -48]
  stp C_l, C_h, [dstend, -32]
  stp D_l, D_h, [dstend, -16]
  ret
 
- /* Align DST to 16 byte alignment so that we don't cross cache line
-   boundaries on both loads and stores.  There are at least 128 bytes
-   to copy, so copy 16 bytes unaligned and then align.  The loop
-   copies 64 bytes per iteration and prefetches one iteration ahead.  */
-
  .p2align 4
+ /* Copy more than 128 bytes.  */
 L(copy_long):
+ /* Copy 16 bytes and then align dst to 16-byte alignment.  */
+ ldp D_l, D_h, [src]
  and tmp1, dstin, 15
  bic dst, dstin, 15
- ldp D_l, D_h, [src]
  sub src, src, tmp1
- add count, count, tmp1      /* Count is now 16 too large.  */
+ add count, count, tmp1 /* Count is now 16 too large.  */
  ldp A_l, A_h, [src, 16]
  stp D_l, D_h, [dstin]
  ldp B_l, B_h, [src, 32]
  ldp C_l, C_h, [src, 48]
  ldp D_l, D_h, [src, 64]!
-
  subs count, count, 128 + 16 /* Test and readjust count.  */
- b.ls L(last64)
+ b.ls L(copy64_from_end)
+
 L(loop64):
  stp A_l, A_h, [dst, 16]
  ldp A_l, A_h, [src, 16]
@@ -202,10 +188,8 @@ L(loop64):
  subs count, count, 64
  b.hi L(loop64)
 
- /* Write the last full set of 64 bytes.  The remainder is at most 64
-   bytes, so it is safe to always copy 64 bytes from the end even if
-   there is just 1 byte left.  */
-L(last64):
+ /* Write the last iteration and copy 64 bytes from the end.  */
+L(copy64_from_end):
  ldp E_l, E_h, [srcend, -64]
  stp A_l, A_h, [dst, 16]
  ldp A_l, A_h, [srcend, -48]
@@ -220,20 +204,42 @@ L(last64):
  stp C_l, C_h, [dstend, -16]
  ret
 
- .p2align 4
-L(move_long):
- cbz tmp1, 3f
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+ENTRY_ALIGN (MEMMOVE, 4)
+ DELOUSE (0)
+ DELOUSE (1)
+ DELOUSE (2)
 
  add srcend, src, count
  add dstend, dstin, count
+ cmp count, 128
+ b.hi L(move_long)
+ cmp count, 32
+ b.hi L(copy32_128)
 
- /* Align dstend to 16 byte alignment so that we don't cross cache line
-   boundaries on both loads and stores.  There are at least 128 bytes
-   to copy, so copy 16 bytes unaligned and then align.  The loop
-   copies 64 bytes per iteration and prefetches one iteration ahead.  */
+ /* Small copies: 0..32 bytes.  */
+ cmp count, 16
+ b.lo L(copy16)
+ ldp A_l, A_h, [src]
+ ldp D_l, D_h, [srcend, -16]
+ stp A_l, A_h, [dstin]
+ stp D_l, D_h, [dstend, -16]
+ ret
 
- and tmp1, dstend, 15
+ .p2align 4
+L(move_long):
+ /* Only use backward copy if there is an overlap.  */
+ sub tmp1, dstin, src
+ cbz tmp1, L(copy0)
+ cmp tmp1, count
+ b.hs L(copy_long)
+
+ /* Large backwards copy for overlapping copies.
+   Copy 16 bytes and then align dst to 16-byte alignment.  */
  ldp D_l, D_h, [srcend, -16]
+ and tmp1, dstend, 15
  sub srcend, srcend, tmp1
  sub count, count, tmp1
  ldp A_l, A_h, [srcend, -16]
@@ -243,10 +249,9 @@ L(move_long):
  ldp D_l, D_h, [srcend, -64]!
  sub dstend, dstend, tmp1
  subs count, count, 128
- b.ls 2f
+ b.ls L(copy64_from_start)
 
- nop
-1:
+L(loop64_backwards):
  stp A_l, A_h, [dstend, -16]
  ldp A_l, A_h, [srcend, -16]
  stp B_l, B_h, [dstend, -32]
@@ -256,12 +261,10 @@ L(move_long):
  stp D_l, D_h, [dstend, -64]!
  ldp D_l, D_h, [srcend, -64]!
  subs count, count, 64
- b.hi 1b
+ b.hi L(loop64_backwards)
 
- /* Write the last full set of 64 bytes.  The remainder is at most 64
-   bytes, so it is safe to always copy 64 bytes from the start even if
-   there is just 1 byte left.  */
-2:
+ /* Write the last iteration and copy 64 bytes from the start.  */
+L(copy64_from_start):
  ldp G_l, G_h, [src, 48]
  stp A_l, A_h, [dstend, -16]
  ldp A_l, A_h, [src, 32]
@@ -274,7 +277,7 @@ L(move_long):
  stp A_l, A_h, [dstin, 32]
  stp B_l, B_h, [dstin, 16]
  stp C_l, C_h, [dstin]
-3: ret
+ ret
 
-END (MEMCPY)
-libc_hidden_builtin_def (MEMCPY)
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)