[1/2] aarch64: Add back non-temporal load/stores from oryon-1's memcpy

Message ID 20250301190504.1581631-1-quic_apinski@quicinc.com (mailing list archive)
State Committed
Commit 0e1aa5db738ac7c73599a3e7f1a0b70b99f99e0a
Delegated to: Adhemerval Zanella Netto
Headers
Series [1/2] aarch64: Add back non-temporal load/stores from oryon-1's memcpy |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Andrew Pinski March 1, 2025, 7:05 p.m. UTC
  I misunderstood the recommendation from the hardware team about non-temporal
load/stores. It is still recommended to use them in memcpy for large sizes. It
was not recommended for their use with device memory and memcpy is already
not valid to be use with device memory.

This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9.
---
 sysdeps/aarch64/multiarch/memcpy_oryon1.S | 40 +++++++++++++++++++++++
 1 file changed, 40 insertions(+)
  

Comments

Adhemerval Zanella Netto March 5, 2025, 1:56 p.m. UTC | #1
On 01/03/25 16:05, Andrew Pinski wrote:
> I misunderstood the recommendation from the hardware team about non-temporal
> load/stores. It is still recommended to use them in memcpy for large sizes. It
> was not recommended for their use with device memory and memcpy is already
> not valid to be use with device memory.
> 
> This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9.

LGTM.  Do we have public information on how to properly tune for these
chips, like a reference manual? It would be good to have some reference
in the future.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>

> ---
>  sysdeps/aarch64/multiarch/memcpy_oryon1.S | 40 +++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> index e86d8b04f5..cc267db53c 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> @@ -152,6 +152,46 @@ L(copy96):
>  	.p2align 6
>  L(copy_long):
>  
> +	/* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
> +	   This loop is identical to the one below it but using ldnp/stnp
> +	   instructions.  For loops that are less than 32768 bytes,
> +	   the ldnp/stnp instructions will not help and will cause a slow
> +	   down so only use the ldnp/stnp loop for the largest sizes.  */
> +
> +	cmp	count, #32768
> +	b.lo	L(copy_long_without_nontemp)
> +	and	tmp1, dstin, 15
> +	bic	dst, dstin, 15
> +	ldnp	D_l, D_h, [src]
> +	sub	src, src, tmp1
> +	add	count, count, tmp1	/* Count is now 16 too large.  */
> +	ldnp	A_l, A_h, [src, 16]
> +	stnp	D_l, D_h, [dstin]
> +	ldnp	B_l, B_h, [src, 32]
> +	ldnp	C_l, C_h, [src, 48]
> +	ldnp	D_l, D_h, [src, 64]
> +	add	src, src, #64
> +	subs	count, count, 128 + 16	/* Test and readjust count.  */
> +
> +L(nontemp_loop64):
> +	tbz	src, #6, 1f
> +1:
> +	stnp	A_l, A_h, [dst, 16]
> +	ldnp	A_l, A_h, [src, 16]
> +	stnp	B_l, B_h, [dst, 32]
> +	ldnp	B_l, B_h, [src, 32]
> +	stnp	C_l, C_h, [dst, 48]
> +	ldnp	C_l, C_h, [src, 48]
> +	stnp	D_l, D_h, [dst, 64]
> +	ldnp	D_l, D_h, [src, 64]
> +	add	src, src, #64
> +	add	dst, dst, #64
> +	subs	count, count, 64
> +	b.hi	L(nontemp_loop64)
> +	b	L(last64)
> +
> +L(copy_long_without_nontemp):
> +
>  	and	tmp1, dstin, 15
>  	bic	dst, dstin, 15
>  	ldp	D_l, D_h, [src]
  
Andrew Pinski April 15, 2025, 7:05 p.m. UTC | #2
On Wed, Mar 5, 2025 at 5:58 AM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 01/03/25 16:05, Andrew Pinski wrote:
> > I misunderstood the recommendation from the hardware team about non-temporal
> > load/stores. It is still recommended to use them in memcpy for large sizes. It
> > was not recommended for their use with device memory and memcpy is already
> > not valid to be use with device memory.
> >
> > This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9.
>
> LGTM.  Do we have public information on how to properly tune for these
> chips, like a reference manual? It would be good to have some reference
> in the future.

Let me get back to you on this.

Thanks,
Andrew Pinski


>
> Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
>
> > ---
> >  sysdeps/aarch64/multiarch/memcpy_oryon1.S | 40 +++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > index e86d8b04f5..cc267db53c 100644
> > --- a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > +++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
> > @@ -152,6 +152,46 @@ L(copy96):
> >       .p2align 6
> >  L(copy_long):
> >
> > +     /* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
> > +        This loop is identical to the one below it but using ldnp/stnp
> > +        instructions.  For loops that are less than 32768 bytes,
> > +        the ldnp/stnp instructions will not help and will cause a slow
> > +        down so only use the ldnp/stnp loop for the largest sizes.  */
> > +
> > +     cmp     count, #32768
> > +     b.lo    L(copy_long_without_nontemp)
> > +     and     tmp1, dstin, 15
> > +     bic     dst, dstin, 15
> > +     ldnp    D_l, D_h, [src]
> > +     sub     src, src, tmp1
> > +     add     count, count, tmp1      /* Count is now 16 too large.  */
> > +     ldnp    A_l, A_h, [src, 16]
> > +     stnp    D_l, D_h, [dstin]
> > +     ldnp    B_l, B_h, [src, 32]
> > +     ldnp    C_l, C_h, [src, 48]
> > +     ldnp    D_l, D_h, [src, 64]
> > +     add     src, src, #64
> > +     subs    count, count, 128 + 16  /* Test and readjust count.  */
> > +
> > +L(nontemp_loop64):
> > +     tbz     src, #6, 1f
> > +1:
> > +     stnp    A_l, A_h, [dst, 16]
> > +     ldnp    A_l, A_h, [src, 16]
> > +     stnp    B_l, B_h, [dst, 32]
> > +     ldnp    B_l, B_h, [src, 32]
> > +     stnp    C_l, C_h, [dst, 48]
> > +     ldnp    C_l, C_h, [src, 48]
> > +     stnp    D_l, D_h, [dst, 64]
> > +     ldnp    D_l, D_h, [src, 64]
> > +     add     src, src, #64
> > +     add     dst, dst, #64
> > +     subs    count, count, 64
> > +     b.hi    L(nontemp_loop64)
> > +     b       L(last64)
> > +
> > +L(copy_long_without_nontemp):
> > +
> >       and     tmp1, dstin, 15
> >       bic     dst, dstin, 15
> >       ldp     D_l, D_h, [src]
>
  

Patch

diff --git a/sysdeps/aarch64/multiarch/memcpy_oryon1.S b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
index e86d8b04f5..cc267db53c 100644
--- a/sysdeps/aarch64/multiarch/memcpy_oryon1.S
+++ b/sysdeps/aarch64/multiarch/memcpy_oryon1.S
@@ -152,6 +152,46 @@  L(copy96):
 	.p2align 6
 L(copy_long):
 
+	/* On oryon1 cores, large memcpy's are helped by using ldnp/stnp.
+	   This loop is identical to the one below it but using ldnp/stnp
+	   instructions.  For loops that are less than 32768 bytes,
+	   the ldnp/stnp instructions will not help and will cause a slow
+	   down so only use the ldnp/stnp loop for the largest sizes.  */
+
+	cmp	count, #32768
+	b.lo	L(copy_long_without_nontemp)
+	and	tmp1, dstin, 15
+	bic	dst, dstin, 15
+	ldnp	D_l, D_h, [src]
+	sub	src, src, tmp1
+	add	count, count, tmp1	/* Count is now 16 too large.  */
+	ldnp	A_l, A_h, [src, 16]
+	stnp	D_l, D_h, [dstin]
+	ldnp	B_l, B_h, [src, 32]
+	ldnp	C_l, C_h, [src, 48]
+	ldnp	D_l, D_h, [src, 64]
+	add	src, src, #64
+	subs	count, count, 128 + 16	/* Test and readjust count.  */
+
+L(nontemp_loop64):
+	tbz	src, #6, 1f
+1:
+	stnp	A_l, A_h, [dst, 16]
+	ldnp	A_l, A_h, [src, 16]
+	stnp	B_l, B_h, [dst, 32]
+	ldnp	B_l, B_h, [src, 32]
+	stnp	C_l, C_h, [dst, 48]
+	ldnp	C_l, C_h, [src, 48]
+	stnp	D_l, D_h, [dst, 64]
+	ldnp	D_l, D_h, [src, 64]
+	add	src, src, #64
+	add	dst, dst, #64
+	subs	count, count, 64
+	b.hi	L(nontemp_loop64)
+	b	L(last64)
+
+L(copy_long_without_nontemp):
+
 	and	tmp1, dstin, 15
 	bic	dst, dstin, 15
 	ldp	D_l, D_h, [src]