[2/2] aarch64: Add back non-temporal load/stores from oryon-1's memset

Message ID 20250301190504.1581631-2-quic_apinski@quicinc.com (mailing list archive)
State Accepted
Delegated to: Adhemerval Zanella Netto
Headers
Series [1/2] aarch64: Add back non-temporal load/stores from oryon-1's memcpy |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Andrew Pinski March 1, 2025, 7:05 p.m. UTC
  I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memset for large sizes. It
was not recommended for their use with device memory and memset is already
not valid to be used with device memory.

This reverts commit e6590f0c86632c36c9a784cf96075f4be2e920d2.
---
 sysdeps/aarch64/multiarch/memset_oryon1.S | 26 +++++++++++++++++++++++
 1 file changed, 26 insertions(+)
  

Comments

Adhemerval Zanella March 5, 2025, 2:02 p.m. UTC | #1
On 01/03/25 16:05, Andrew Pinski wrote:
> I misunderstood the recommendation from the hardware team about non-temporal
> load/stores. It is still recommended to use them in memset for large sizes. It
> was not recommended for their use with device memory and memset is already
> not valid to be used with device memory.
> 
> This reverts commit e6590f0c86632c36c9a784cf96075f4be2e920d2.

LGTM.  Besides the public documentation point from memcpy patch, I recall
that IBM had a related issued where unaligned VSX load/store triggered
traps on non-cached memory (for DMA-like accelerators), and while libc 
memcpy was not supported in such scenarios it was used anyway by some 
userland programs to operate device drivers mapped memory (like mesa 
drivers). 

Would be this a problem, specially now that this chip is getting more 
support for desktop environments?

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  sysdeps/aarch64/multiarch/memset_oryon1.S | 26 +++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/sysdeps/aarch64/multiarch/memset_oryon1.S b/sysdeps/aarch64/multiarch/memset_oryon1.S
> index 0f9b718f25..88f4ef4b33 100644
> --- a/sysdeps/aarch64/multiarch/memset_oryon1.S
> +++ b/sysdeps/aarch64/multiarch/memset_oryon1.S
> @@ -90,6 +90,8 @@ L(set_long):
>  	cmp	count, 256
>  	ccmp	valw, 0, 0, cs
>  	b.eq	L(try_zva)
> +	cmp	count, #32768
> +	b.hi	L(set_long_with_nontemp)
>  	/* Small-size or non-zero memset does not use DC ZVA. */
>  	sub	count, dstend, dst
>  
> @@ -112,6 +114,30 @@ L(set_long):
>  	stp	val, val, [dstend, -16]
>  	ret
>  
> +L(set_long_with_nontemp):
> +	/* Small-size or non-zero memset does not use DC ZVA. */
> +	sub	count, dstend, dst
> +
> +	/* Adjust count and bias for loop. By subtracting extra 1 from count,
> +	   it is easy to use tbz instruction to check whether loop tailing
> +	   count is less than 33 bytes, so as to bypass 2 unnecessary stps. */
> +	sub	count, count, 64+16+1
> +
> +1:	stnp	val, val, [dst, 16]
> +	stnp	val, val, [dst, 32]
> +	stnp	val, val, [dst, 48]
> +	stnp	val, val, [dst, 64]
> +	add	dst, dst, #64
> +	subs	count, count, 64
> +	b.hs	1b
> +
> +	tbz	count, 5, 1f	/* Remaining count is less than 33 bytes? */
> +	stnp	val, val, [dst, 16]
> +	stnp	val, val, [dst, 32]
> +1:	stnp	val, val, [dstend, -32]
> +	stnp	val, val, [dstend, -16]
> +	ret
> +
>  L(try_zva):
>  	/* Write the first and last 64 byte aligned block using stp rather
>  	   than using DC ZVA as it is faster. */
  

Patch

diff --git a/sysdeps/aarch64/multiarch/memset_oryon1.S b/sysdeps/aarch64/multiarch/memset_oryon1.S
index 0f9b718f25..88f4ef4b33 100644
--- a/sysdeps/aarch64/multiarch/memset_oryon1.S
+++ b/sysdeps/aarch64/multiarch/memset_oryon1.S
@@ -90,6 +90,8 @@  L(set_long):
 	cmp	count, 256
 	ccmp	valw, 0, 0, cs
 	b.eq	L(try_zva)
+	cmp	count, #32768
+	b.hi	L(set_long_with_nontemp)
 	/* Small-size or non-zero memset does not use DC ZVA. */
 	sub	count, dstend, dst
 
@@ -112,6 +114,30 @@  L(set_long):
 	stp	val, val, [dstend, -16]
 	ret
 
+L(set_long_with_nontemp):
+	/* Small-size or non-zero memset does not use DC ZVA. */
+	sub	count, dstend, dst
+
+	/* Adjust count and bias for loop. By subtracting extra 1 from count,
+	   it is easy to use tbz instruction to check whether loop tailing
+	   count is less than 33 bytes, so as to bypass 2 unnecessary stps. */
+	sub	count, count, 64+16+1
+
+1:	stnp	val, val, [dst, 16]
+	stnp	val, val, [dst, 32]
+	stnp	val, val, [dst, 48]
+	stnp	val, val, [dst, 64]
+	add	dst, dst, #64
+	subs	count, count, 64
+	b.hs	1b
+
+	tbz	count, 5, 1f	/* Remaining count is less than 33 bytes? */
+	stnp	val, val, [dst, 16]
+	stnp	val, val, [dst, 32]
+1:	stnp	val, val, [dstend, -32]
+	stnp	val, val, [dstend, -16]
+	ret
+
 L(try_zva):
 	/* Write the first and last 64 byte aligned block using stp rather
 	   than using DC ZVA as it is faster. */