[2/3] x86_64: Optimize large size copy in memmove-ssse3

Message ID 20240626024649.3689-2-MayShao-oc@zhaoxin.com (mailing list archive)
State Superseded
Headers
Series [1/3] x86:Set preferred CPU features on the KH-40000 and KX-7000 Zhaoxin processors |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

Mayshao-oc June 26, 2024, 2:46 a.m. UTC
  From: MayShao <mayshao-oc@zhaoxin.com>

This patch optimizes large size copy using normal store when src > dst
and overlap.  Make it the same as the logic in memmove-vec-unaligned-erms.S.

Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
temporal threshold, this patch updates that value to
'__x86_shared_non_temporal_threshold'.  Currently, the
' __x86_shared_non_temporal_threshold' is cpu-specific, and different CPUs
will have different values based on the related nt-benchmark results.
However, in memmove-ssse3, the nontemporal threshold uses
'__x86_shared_cache_size_half', which sounds unreasonable.

The performance is not changed drastically although shows overall
improvements without any major regressions or gains.

Results on Zhaoxin KX-7000:
bench-memcpy geometric_mean(N=20) New / Original: 1.000

bench-memcpy-random geometric_mean(N=20) New / Original: 0.998

bench-memcpy-large geometric_mean(N=20) New / Original: 0.975

bench-memmove geometric_mean(N=20) New / Original: 1.001

bench-memmmove-large geometric_mean(N=20) New / Original: 0.964

Results on Intel Core i5-6600K:
bench-memcpy geometric_mean(N=20) New / Original: 1.007

bench-memcpy-random geometric_mean(N=20) New / Original: 1.000

bench-memcpy-large geometric_mean(N=20) New / Original: 0.998

bench-memmove geometric_mean(N=20) New / Original: 0.996

bench-memmmove-large geometric_mean(N=20) New / Original: 0.941
---
 sysdeps/x86_64/multiarch/memmove-ssse3.S | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)
  

Comments

Noah Goldstein June 26, 2024, 3:50 a.m. UTC | #1
On Wed, Jun 26, 2024 at 10:47 AM MayShao <MayShao-oc@zhaoxin.com> wrote:
>
> From: MayShao <mayshao-oc@zhaoxin.com>
>
> This patch optimizes large size copy using normal store when src > dst
> and overlap.  Make it the same as the logic in memmove-vec-unaligned-erms.S.
>
> Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
> temporal threshold, this patch updates that value to
> '__x86_shared_non_temporal_threshold'.  Currently, the
> ' __x86_shared_non_temporal_threshold' is cpu-specific, and different CPUs
> will have different values based on the related nt-benchmark results.
> However, in memmove-ssse3, the nontemporal threshold uses
> '__x86_shared_cache_size_half', which sounds unreasonable.
>
> The performance is not changed drastically although shows overall
> improvements without any major regressions or gains.
>
> Results on Zhaoxin KX-7000:
> bench-memcpy geometric_mean(N=20) New / Original: 1.000
>
> bench-memcpy-random geometric_mean(N=20) New / Original: 0.998
>
> bench-memcpy-large geometric_mean(N=20) New / Original: 0.975
>
> bench-memmove geometric_mean(N=20) New / Original: 1.001
>
> bench-memmmove-large geometric_mean(N=20) New / Original: 0.964
>
> Results on Intel Core i5-6600K:
> bench-memcpy geometric_mean(N=20) New / Original: 1.007
>
> bench-memcpy-random geometric_mean(N=20) New / Original: 1.000
>
> bench-memcpy-large geometric_mean(N=20) New / Original: 0.998
>
> bench-memmove geometric_mean(N=20) New / Original: 0.996
>
> bench-memmmove-large geometric_mean(N=20) New / Original: 0.941
> ---
>  sysdeps/x86_64/multiarch/memmove-ssse3.S | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memmove-ssse3.S b/sysdeps/x86_64/multiarch/memmove-ssse3.S
> index 048d015712..40bf90b2b7 100644
> --- a/sysdeps/x86_64/multiarch/memmove-ssse3.S
> +++ b/sysdeps/x86_64/multiarch/memmove-ssse3.S
> @@ -151,13 +151,11 @@ L(more_2x_vec):
>            loop.  */
>         movups  %xmm0, (%rdi)
>
> -# ifdef SHARED_CACHE_SIZE_HALF
> -       cmp     $SHARED_CACHE_SIZE_HALF, %RDX_LP
> -# else
> -       cmp     __x86_shared_cache_size_half(%rip), %rdx
> -# endif
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %rdx
>         ja      L(large_memcpy)
>
> +    .p2align 4,, 8
Drop this. The padding here ends up messing up alignment of
further down targets. The overlap + super large copy case
isn't so common.
> +L(loop_fwd):
>         leaq    -64(%rdi, %rdx), %r8
>         andq    $-16, %rdi
>         movl    $48, %edx
> @@ -199,6 +197,10 @@ L(large_memcpy):
>         movups  -64(%r9, %rdx), %xmm10
>         movups  -80(%r9, %rdx), %xmm11
>
> +       subq    %rdi, %r9
> +       cmpq    %r9, %rdx
> +       ja      L(loop_fwd)
nit: can you swap this to `cmp %rdx, %r9; jb L(loop_fwd)` so it's consistent w/
the backward overlap check. Also can you comment what this check is doing
and the rationale.
> +
>         sall    $5, %ecx
>         leal    (%rcx, %rcx, 2), %r8d
>         leaq    -96(%rdi, %rdx), %rcx
> --
> 2.34.1
>
  
Mayshao-oc June 27, 2024, 9:16 a.m. UTC | #2
On Wed, Jun 26, 2024 at 11:50 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Jun 26, 2024 at 10:47 AM MayShao <MayShao-oc@zhaoxin.com> wrote:
> >
> > From: MayShao <mayshao-oc@zhaoxin.com>
> >
> > This patch optimizes large size copy using normal store when src > dst
> > and overlap.  Make it the same as the logic in memmove-vec-unaligned-erms.S.
> >
> > Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
> > temporal threshold, this patch updates that value to
> > '__x86_shared_non_temporal_threshold'.  Currently, the
> > ' __x86_shared_non_temporal_threshold' is cpu-specific, and different CPUs
> > will have different values based on the related nt-benchmark results.
> > However, in memmove-ssse3, the nontemporal threshold uses
> > '__x86_shared_cache_size_half', which sounds unreasonable.
> >
> > The performance is not changed drastically although shows overall
> > improvements without any major regressions or gains.
> >
> > Results on Zhaoxin KX-7000:
> > bench-memcpy geometric_mean(N=20) New / Original: 1.000
> >
> > bench-memcpy-random geometric_mean(N=20) New / Original: 0.998
> >
> > bench-memcpy-large geometric_mean(N=20) New / Original: 0.975
> >
> > bench-memmove geometric_mean(N=20) New / Original: 1.001
> >
> > bench-memmmove-large geometric_mean(N=20) New / Original: 0.964
> >
> > Results on Intel Core i5-6600K:
> > bench-memcpy geometric_mean(N=20) New / Original: 1.007
> >
> > bench-memcpy-random geometric_mean(N=20) New / Original: 1.000
> >
> > bench-memcpy-large geometric_mean(N=20) New / Original: 0.998
> >
> > bench-memmove geometric_mean(N=20) New / Original: 0.996
> >
> > bench-memmmove-large geometric_mean(N=20) New / Original: 0.941
> > ---
> >  sysdeps/x86_64/multiarch/memmove-ssse3.S | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memmove-ssse3.S b/sysdeps/x86_64/multiarch/memmove-ssse3.S
> > index 048d015712..40bf90b2b7 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-ssse3.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-ssse3.S
> > @@ -151,13 +151,11 @@ L(more_2x_vec):
> >            loop.  */
> >         movups  %xmm0, (%rdi)
> >
> > -# ifdef SHARED_CACHE_SIZE_HALF
> > -       cmp     $SHARED_CACHE_SIZE_HALF, %RDX_LP
> > -# else
> > -       cmp     __x86_shared_cache_size_half(%rip), %rdx
> > -# endif
> > +       cmp     __x86_shared_non_temporal_threshold(%rip), %rdx
> >         ja      L(large_memcpy)
> >
> > +    .p2align 4,, 8
> Drop this. The padding here ends up messing up alignment of
> further down targets. The overlap + super large copy case
> isn't so common.

Agree. I will fix it.

> > +L(loop_fwd):
> >         leaq    -64(%rdi, %rdx), %r8
> >         andq    $-16, %rdi
> >         movl    $48, %edx
> > @@ -199,6 +197,10 @@ L(large_memcpy):
> >         movups  -64(%r9, %rdx), %xmm10
> >         movups  -80(%r9, %rdx), %xmm11
> >
> > +       subq    %rdi, %r9
> > +       cmpq    %r9, %rdx
> > +       ja      L(loop_fwd)
> nit: can you swap this to `cmp %rdx, %r9; jb L(loop_fwd)` so it's consistent w/
> the backward overlap check. Also can you comment what this check is doing
> and the rationale.

Will fix this.

> > +
> >         sall    $5, %ecx
> >         leal    (%rcx, %rcx, 2), %r8d
> >         leaq    -96(%rdi, %rdx), %rcx
> > --
> > 2.34.1
> >
  

Patch

diff --git a/sysdeps/x86_64/multiarch/memmove-ssse3.S b/sysdeps/x86_64/multiarch/memmove-ssse3.S
index 048d015712..40bf90b2b7 100644
--- a/sysdeps/x86_64/multiarch/memmove-ssse3.S
+++ b/sysdeps/x86_64/multiarch/memmove-ssse3.S
@@ -151,13 +151,11 @@  L(more_2x_vec):
 	   loop.  */
 	movups	%xmm0, (%rdi)
 
-# ifdef SHARED_CACHE_SIZE_HALF
-	cmp	$SHARED_CACHE_SIZE_HALF, %RDX_LP
-# else
-	cmp	__x86_shared_cache_size_half(%rip), %rdx
-# endif
+	cmp	__x86_shared_non_temporal_threshold(%rip), %rdx
 	ja	L(large_memcpy)
 
+    .p2align 4,, 8
+L(loop_fwd):
 	leaq	-64(%rdi, %rdx), %r8
 	andq	$-16, %rdi
 	movl	$48, %edx
@@ -199,6 +197,10 @@  L(large_memcpy):
 	movups	-64(%r9, %rdx), %xmm10
 	movups	-80(%r9, %rdx), %xmm11
 
+	subq	%rdi, %r9
+	cmpq	%r9, %rdx
+	ja	L(loop_fwd)
+
 	sall	$5, %ecx
 	leal	(%rcx, %rcx, 2), %r8d
 	leaq	-96(%rdi, %rdx), %rcx