Message ID | DB5PR08MB103082464EB664B31AFC5C5783C30@DB5PR08MB1030.eurprd08.prod.outlook.com |
---|---|
State | New, archived |
Headers | show |
On 14/11/18 8:15 PM, Wilco Dijkstra wrote: > v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256. > > This fixes an ineffiency in the non-zero memset. Delaying the writeback > until the end of the loop is slightly faster on some cores - this shows > ~5% performance gain on Cortex-A53 when doing large non-zero memsets. > > Tested against the GLIBC testsuite, OK for commit? Can you please also summarize the performance results for other processors? Thanks, Siddhesh
Hi Siddhesh, >On 14/11/18 8:15 PM, Wilco Dijkstra wrote: >> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256. >> >> This fixes an ineffiency in the non-zero memset. Delaying the writeback >> until the end of the loop is slightly faster on some cores - this shows >> ~5% performance gain on Cortex-A53 when doing large non-zero memsets. >> >> Tested against the GLIBC testsuite, OK for commit? > > Can you please also summarize the performance results for other processors? I ran it on Cortex-A72, and there isn't a performance difference on the memset bench beyond measurement errors. On any out-of-order core address increments are executed in parallel with loads/stores, ie. they have no measurable latency and so their exact placement is irrelevant. Wilco
On 16/11/18 10:34 PM, Wilco Dijkstra wrote: > I ran it on Cortex-A72, and there isn't a performance difference on the memset bench > beyond measurement errors. On any out-of-order core address increments are > executed in parallel with loads/stores, ie. they have no measurable latency and so > their exact placement is irrelevant. Thanks for the confirmation. Looks OK to me. Siddhesh
diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S index 4a454593618f78e22c55520d56737fab5d8f63a4..9738cf5fd55a1d937fb3392cec46f37b4d5fb51d 100644 --- a/sysdeps/aarch64/memset.S +++ b/sysdeps/aarch64/memset.S @@ -89,10 +89,10 @@ L(set_long):     b.eq   L(try_zva)  L(no_zva):     sub   count, dstend, dst   /* Count is 16 too large. */ -   add   dst, dst, 16 +   sub   dst, dst, 16      /* Dst is biased by -32. */     sub   count, count, 64 + 16   /* Adjust count and bias for loop. */ -1:   stp   q0, q0, [dst], 64 -   stp   q0, q0, [dst, -32] +1:   stp   q0, q0, [dst, 32] +   stp   q0, q0, [dst, 64]!  L(tail64):     subs   count, count, 64     b.hi   1b @@ -183,6 +183,7 @@ L(zva_other):     subs   count, count, zva_len     b.hs   3b  4:   add   count, count, zva_len +   sub   dst, dst, 32      /* Bias dst for tail loop. */     b   L(tail64)  #endif