Patchwork [AArch64] Adjust writeback in non-zero memset

login
register
mail settings
Submitter Wilco Dijkstra
Date Nov. 7, 2018, 6:09 p.m.
Message ID <DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com>
Download mbox | patch
Permalink /patch/30063/
State New
Headers show

Comments

Wilco Dijkstra - Nov. 7, 2018, 6:09 p.m.
This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite, OK for commit?

ChangeLog:
2018-11-07  Wilco Dijkstra  <wdijkstr@arm.com>

	* sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.

---

Patch

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 4a454593618f78e22c55520d56737fab5d8f63a4..2eefc62fc1eeccf736f627a7adfe5485aff9bca9 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -89,10 +89,10 @@  L(set_long):
 	b.eq	L(try_zva)
 L(no_zva):
 	sub	count, dstend, dst	/* Count is 16 too large.  */
-	add	dst, dst, 16
+	sub	dst, dst, 16		/* Dst is biased by -32.  */
 	sub	count, count, 64 + 16	/* Adjust count and bias for loop.  */
-1:	stp	q0, q0, [dst], 64
-	stp	q0, q0, [dst, -32]
+1:	stp	q0, q0, [dst, 32]
+	stp	q0, q0, [dst, 64]!
 L(tail64):
 	subs	count, count, 64
 	b.hi	1b