[1/2] aarch64: Add back non-temporal load/stores from oryon-1's memcpy

Message ID 20250301190504.1581631-1-quic_apinski@quicinc.com (mailing list archive)
State Accepted
Delegated to: Adhemerval Zanella Netto
Headers
Series [1/2] aarch64: Add back non-temporal load/stores from oryon-1's memcpy |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Andrew Pinski March 1, 2025, 7:05 p.m. UTC
  I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memcpy for large sizes. It
was not recommended for their use with device memory and memcpy is already
not valid to be use with device memory.

This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9.
---
 sysdeps/aarch64/multiarch/memcpy_oryon1.S | 40 +++++++++++++++++++++++
 1 file changed, 40 insertions(+)
  

Comments

Adhemerval Zanella March 5, 2025, 1:56 p.m. UTC | #1
On 01/03/25 16:05, Andrew Pinski wrote:
> I misunderstood the recommendation from the hardware team about non-temporal
> load/stores. It is still recommended to use them in memcpy for large sizes. It
> was not recommended for their use with device memory and memcpy is already
> not valid to be use with device memory.
> 
> This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9.

LGTM.  Do we have public information on how to properly tune for these
chips, like a reference manual? It would be good to have some reference
in the future.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  sysdeps/aarch64/multiarch/memcpy_oryon1.S | 40 +++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> index e86d8b04f5..cc267db53c 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> @@ -152,6 +152,46 @@ L(copy96):
>  	.p2align 6
>  L(copy_long):
>  
> +	/* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
> +	   This loop is identical to the one below it but using ldnp/stnp
> +	   instructions.  For loops that are less than 32768 bytes,
> +	   the ldnp/stnp instructions will not help and will cause a slow
> +	   down so only use the ldnp/stnp loop for the largest sizes.  */
> +
> +	cmp	count, #32768
> +	b.lo	L(copy_long_without_nontemp)
> +	and	tmp1, dstin, 15
> +	bic	dst, dstin, 15
> +	ldnp	D_l, D_h, [src]
> +	sub	src, src, tmp1
> +	add	count, count, tmp1	/* Count is now 16 too large.  */
> +	ldnp	A_l, A_h, [src, 16]
> +	stnp	D_l, D_h, [dstin]
> +	ldnp	B_l, B_h, [src, 32]
> +	ldnp	C_l, C_h, [src, 48]
> +	ldnp	D_l, D_h, [src, 64]
> +	add	src, src, #64
> +	subs	count, count, 128 + 16	/* Test and readjust count.  */
> +
> +L(nontemp_loop64):
> +	tbz	src, #6, 1f
> +1:
> +	stnp	A_l, A_h, [dst, 16]
> +	ldnp	A_l, A_h, [src, 16]
> +	stnp	B_l, B_h, [dst, 32]
> +	ldnp	B_l, B_h, [src, 32]
> +	stnp	C_l, C_h, [dst, 48]
> +	ldnp	C_l, C_h, [src, 48]
> +	stnp	D_l, D_h, [dst, 64]
> +	ldnp	D_l, D_h, [src, 64]
> +	add	src, src, #64
> +	add	dst, dst, #64
> +	subs	count, count, 64
> +	b.hi	L(nontemp_loop64)
> +	b	L(last64)
> +
> +L(copy_long_without_nontemp):
> +
>  	and	tmp1, dstin, 15
>  	bic	dst, dstin, 15
>  	ldp	D_l, D_h, [src]
  

Patch

diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
index e86d8b04f5..cc267db53c 100644
--- a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
+++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
@@ -152,6 +152,46 @@  L(copy96):
 	.p2align 6
 L(copy_long):
 
+	/* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
+	   This loop is identical to the one below it but using ldnp/stnp
+	   instructions.  For loops that are less than 32768 bytes,
+	   the ldnp/stnp instructions will not help and will cause a slow
+	   down so only use the ldnp/stnp loop for the largest sizes.  */
+
+	cmp	count, #32768
+	b.lo	L(copy_long_without_nontemp)
+	and	tmp1, dstin, 15
+	bic	dst, dstin, 15
+	ldnp	D_l, D_h, [src]
+	sub	src, src, tmp1
+	add	count, count, tmp1	/* Count is now 16 too large.  */
+	ldnp	A_l, A_h, [src, 16]
+	stnp	D_l, D_h, [dstin]
+	ldnp	B_l, B_h, [src, 32]
+	ldnp	C_l, C_h, [src, 48]
+	ldnp	D_l, D_h, [src, 64]
+	add	src, src, #64
+	subs	count, count, 128 + 16	/* Test and readjust count.  */
+
+L(nontemp_loop64):
+	tbz	src, #6, 1f
+1:
+	stnp	A_l, A_h, [dst, 16]
+	ldnp	A_l, A_h, [src, 16]
+	stnp	B_l, B_h, [dst, 32]
+	ldnp	B_l, B_h, [src, 32]
+	stnp	C_l, C_h, [dst, 48]
+	ldnp	C_l, C_h, [src, 48]
+	stnp	D_l, D_h, [dst, 64]
+	ldnp	D_l, D_h, [src, 64]
+	add	src, src, #64
+	add	dst, dst, #64
+	subs	count, count, 64
+	b.hi	L(nontemp_loop64)
+	b	L(last64)
+
+L(copy_long_without_nontemp):
+
 	and	tmp1, dstin, 15
 	bic	dst, dstin, 15
 	ldp	D_l, D_h, [src]