[AArch64] Adjust writeback in non-zero memset
Commit Message
v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
This fixes an ineffiency in the non-zero memset. Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.
Tested against the GLIBC testsuite, OK for commit?
ChangeLog:
2018-11-14 Wilco Dijkstra <wdijkstr@arm.com>
* sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.
---
Comments
On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
>
> This fixes an ineffiency in the non-zero memset. Delaying the writeback
> until the end of the loop is slightly faster on some cores - this shows
> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
>
> Tested against the GLIBC testsuite, OK for commit?
Can you please also summarize the performance results for other processors?
Thanks,
Siddhesh
Hi Siddhesh,
>On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
>> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
>>
>> This fixes an ineffiency in the non-zero memset. Delaying the writeback
>> until the end of the loop is slightly faster on some cores - this shows
>> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
>>
>> Tested against the GLIBC testsuite, OK for commit?
>
> Can you please also summarize the performance results for other processors?
I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
beyond measurement errors. On any out-of-order core address increments are
executed in parallel with loads/stores, ie. they have no measurable latency and so
their exact placement is irrelevant.
Wilco
On 16/11/18 10:34 PM, Wilco Dijkstra wrote:
> I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
> beyond measurement errors. On any out-of-order core address increments are
> executed in parallel with loads/stores, ie. they have no measurable latency and so
> their exact placement is irrelevant.
Thanks for the confirmation. Looks OK to me.
Siddhesh
@@ -89,10 +89,10 @@ L(set_long):
    b.eq   L(try_zva)
 L(no_zva):
    sub   count, dstend, dst   /* Count is 16 too large. */
-   add   dst, dst, 16
+   sub   dst, dst, 16      /* Dst is biased by -32. */
    sub   count, count, 64 + 16   /* Adjust count and bias for loop. */
-1:   stp   q0, q0, [dst], 64
-   stp   q0, q0, [dst, -32]
+1:   stp   q0, q0, [dst, 32]
+   stp   q0, q0, [dst, 64]!
 L(tail64):
    subs   count, count, 64
    b.hi   1b
@@ -183,6 +183,7 @@ L(zva_other):
    subs   count, count, zva_len
    b.hs   3b
 4:   add   count, count, zva_len
+   sub   dst, dst, 32      /* Bias dst for tail loop. */
    b   L(tail64)
 #endif