[v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Message ID 20240519004347.2759850-1-goldstein.w.n@gmail.com
State Committed
Commit 5bf0ab80573d66e4ae5d94b094659094336da90f
Headers
Series [v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed

Commit Message

Noah Goldstein May 19, 2024, 12:43 a.m. UTC
  Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
---
 .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
 1 file changed, 91 insertions(+), 58 deletions(-)
  

Comments

H.J. Lu May 20, 2024, 5:14 p.m. UTC | #1
On Sat, May 18, 2024 at 5:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +         non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>         VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
>         VMOVU   %VMM(0), (%rdi)
>         VZEROUPPER_RETURN
> +
> +       /* If have AVX512 mask instructions put L(less_vec) close to
> +          entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +       .p2align 6,, 47
> +       .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +       /* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +       /* Clear high bits from edi. Only keeping bits relevant to page
> +          cross check. Note that we are using rax which is set in
> +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +       andl    $(PAGE_SIZE - 1), %edi
> +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> +          serious performance degradation when it has to fault suppress.  */
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> +       /* This is generally considered a cold target.  */
> +       ja      L(cross_page)
> +# if VEC_SIZE > 32
> +       movq    $-1, %rcx
> +       bzhiq   %rdx, %rcx, %rcx
> +       kmovq   %rcx, %k1
> +# else
> +       movl    $-1, %ecx
> +       bzhil   %edx, %ecx, %ecx
> +       kmovd   %ecx, %k1
> +# endif
> +       vmovdqu8 %VMM(0), (%rax){%k1}
> +       VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>         VZEROUPPER_RETURN
>
> -       /* If have AVX512 mask instructions put L(less_vec) close to
> -          entry as it doesn't take much space and is likely a hot target.
> -        */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -       .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -       /* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -       /* Clear high bits from edi. Only keeping bits relevant to page
> -          cross check. Note that we are using rax which is set in
> -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -       andl    $(PAGE_SIZE - 1), %edi
> -       /* Check if VEC_SIZE store cross page. Mask stores suffer
> -          serious performance degradation when it has to fault suppress.
> -        */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> -       /* This is generally considered a cold target.  */
> -       ja      L(cross_page)
> -# if VEC_SIZE > 32
> -       movq    $-1, %rcx
> -       bzhiq   %rdx, %rcx, %rcx
> -       kmovq   %rcx, %k1
> -# else
> -       movl    $-1, %ecx
> -       bzhil   %edx, %ecx, %ecx
> -       kmovd   %ecx, %k1
> -# endif
> -       vmovdqu8 %VMM(0), (%rax){%k1}
> -       VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -       /* Include L(stosb_local) here if including L(less_vec) between
> -          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -          L(stosb_more_2x_vec) target.  */
> -       .p2align 4,, 10
> -L(stosb_local):
> -       movzbl  %sil, %eax
> -       mov     %RDX_LP, %RCX_LP
> -       mov     %RDI_LP, %RDX_LP
> -       rep     stosb
> -       mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>         .p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>         ret
>  #endif
>
> -       .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +       .p2align 4
> +#else
> +       .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
>         mov     %RDI_LP, %RDX_LP
>         rep     stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +       /* Use xchg to save 1-byte (this helps align targets below).  */
> +       xchg    %RDX_LP, %RAX_LP
> +# else
>         mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
>  # endif
> +       VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>         /* Define L(less_vec) only if not otherwise defined.  */
> -       .p2align 4
> +       .p2align 4,, 12
>  L(less_vec):
>         /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>            xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>         movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>         ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +       /* Force align so the loop doesn't cross a cache-line.  */
> +       .p2align 4
> +# endif
> +       .p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +       orq     $(VEC_SIZE * 1 - 1), %rdi
> +       incq    %rdi
> +       .p2align 4,, 7
> +L(nt_loop):
> +       VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> +       subq    $(VEC_SIZE * -4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(nt_loop)
> +       sfence
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> +       VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.
  
Adhemerval Zanella Netto May 20, 2024, 7:16 p.m. UTC | #2
On 18/05/24 21:43, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.

Btw this datasheet is not accessible without extra steps.

> 
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.

Any chance on how this play in newer AMD chips? I am trying to avoid
another regression like BZ#30994 and BZ#30995 (this one I would like
to fix, however I don't have access to a Zen4 machine to check for
results).

> 
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
> 
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
> 
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
> 
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
> 
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
> 
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
> 
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
> 
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +	  then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +	  non-temporal stores.  */
>  
>  #include <sysdep.h>
>  
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>  	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
>  	VMOVU	%VMM(0), (%rdi)
>  	VZEROUPPER_RETURN
> +
> +	/* If have AVX512 mask instructions put L(less_vec) close to
> +	   entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +	.p2align 6,, 47
> +	.p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +	/* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +	/* Clear high bits from edi. Only keeping bits relevant to page
> +	   cross check. Note that we are using rax which is set in
> +	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +	andl	$(PAGE_SIZE - 1), %edi
> +	/* Check if VEC_SIZE store cross page. Mask stores suffer
> +	   serious performance degradation when it has to fault suppress.  */
> +	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
> +	/* This is generally considered a cold target.  */
> +	ja	L(cross_page)
> +# if VEC_SIZE > 32
> +	movq	$-1, %rcx
> +	bzhiq	%rdx, %rcx, %rcx
> +	kmovq	%rcx, %k1
> +# else
> +	movl	$-1, %ecx
> +	bzhil	%edx, %ecx, %ecx
> +	kmovd	%ecx, %k1
> +# endif
> +	vmovdqu8 %VMM(0), (%rax){%k1}
> +	VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>  
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>  	VZEROUPPER_RETURN
>  
> -	/* If have AVX512 mask instructions put L(less_vec) close to
> -	   entry as it doesn't take much space and is likely a hot target.
> -	 */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -	.p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -	/* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -	/* Clear high bits from edi. Only keeping bits relevant to page
> -	   cross check. Note that we are using rax which is set in
> -	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -	andl	$(PAGE_SIZE - 1), %edi
> -	/* Check if VEC_SIZE store cross page. Mask stores suffer
> -	   serious performance degradation when it has to fault suppress.
> -	 */
> -	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
> -	/* This is generally considered a cold target.  */
> -	ja	L(cross_page)
> -# if VEC_SIZE > 32
> -	movq	$-1, %rcx
> -	bzhiq	%rdx, %rcx, %rcx
> -	kmovq	%rcx, %k1
> -# else
> -	movl	$-1, %ecx
> -	bzhil	%edx, %ecx, %ecx
> -	kmovd	%ecx, %k1
> -# endif
> -	vmovdqu8 %VMM(0), (%rax){%k1}
> -	VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -	/* Include L(stosb_local) here if including L(less_vec) between
> -	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -	   L(stosb_more_2x_vec) target.  */
> -	.p2align 4,, 10
> -L(stosb_local):
> -	movzbl	%sil, %eax
> -	mov	%RDX_LP, %RCX_LP
> -	mov	%RDI_LP, %RDX_LP
> -	rep	stosb
> -	mov	%RDX_LP, %RAX_LP
> -	VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  	.p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>  	ret
>  #endif
>  
> -	.p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +	.p2align 4
> +#else
> +	.p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>  	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>  	   range for 2-byte jump encoding.  */
>  L(stosb_local):
> +	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +	jae	L(nt_memset)
>  	movzbl	%sil, %eax
>  	mov	%RDX_LP, %RCX_LP
>  	mov	%RDI_LP, %RDX_LP
>  	rep	stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +	/* Use xchg to save 1-byte (this helps align targets below).  */
> +	xchg	%RDX_LP, %RAX_LP
> +# else
>  	mov	%RDX_LP, %RAX_LP
> -	VZEROUPPER_RETURN
>  # endif
> +	VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>  	/* Define L(less_vec) only if not otherwise defined.  */
> -	.p2align 4
> +	.p2align 4,, 12
>  L(less_vec):
>  	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>  	   xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>  	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>  	ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +	/* Force align so the loop doesn't cross a cache-line.  */
> +	.p2align 4
> +# endif
> +	.p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
> +	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +	orq	$(VEC_SIZE * 1 - 1), %rdi
> +	incq	%rdi
> +	.p2align 4,, 7
> +L(nt_loop):
> +	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
> +	subq	$(VEC_SIZE * -4), %rdi
> +	cmpq	%rdx, %rdi
> +	jb	L(nt_loop)
> +	sfence
> +	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
> +	VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
  
Zack Weinberg May 20, 2024, 7:37 p.m. UTC | #3
On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a few MBs)
> memsets. See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.

Did your tests read the region that was memset afterward, or did they
just do the memset?  It *probably* doesn't matter at "above a few MBs"
but I'm worried that this new implementation might look good on
microbenchmarks but make real programs *slower* because it is no longer
priming the cache for the memset region.

zw
  
Noah Goldstein May 20, 2024, 10:47 p.m. UTC | #4
On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 18/05/24 21:43, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a
> > few MBs) memsets.
> > See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Btw this datasheet is not accessible without extra steps.
>

Bh sorry, put the benchmark data in a repo. See data here:
https://github.com/goldsteinn/memset-benchmarks/tree/master/results

The ICX.html and SKX.html have the data from the above link.
> >
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> Any chance on how this play in newer AMD chips? I am trying to avoid
> another regression like BZ#30994 and BZ#30995 (this one I would like
> to fix, however I don't have access to a Zen4 machine to check for
> results).

I didn't and don't have access to any of the newer AMD chips.

The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
has a README w/ steps if anyone wants to test it.
>
> >
> > All of the other changes to the file are to re-organize the
> > code-blocks to maintain "good" alignment given the new code added in
> > the `L(stosb_local)` case.
> >
> > The results from running the GLIBC memset benchmarks on TGL-client for
> > N=20 runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.979
> > Geometric Mean across the suite New / Old EXEX512: 0.979
> > Geometric Mean across the suite New / Old AVX2   : 0.986
> > Geometric Mean across the suite New / Old SSE2   : 0.979
> >
> > Most of the cases are essentially unchanged, this is mostly to show
> > that adding the non-temporal case didn't add any regressions to the
> > other cases.
> >
> > The results on the memset-large benchmark suite on TGL-client for N=20
> > runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.926
> > Geometric Mean across the suite New / Old EXEX512: 0.925
> > Geometric Mean across the suite New / Old AVX2   : 0.928
> > Geometric Mean across the suite New / Old SSE2   : 0.924
> >
> > So roughly a 7.5% speedup. This is lower than what we see on servers
> > (likely because clients typically have faster single-core bandwidth so
> > saving bandwidth on RFOs is less impactful), but still advantageous.
> >
> > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > ---
> >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> >  1 file changed, 91 insertions(+), 58 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 97839a2248..637caadb40 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -21,10 +21,13 @@
> >     2. If size is less than VEC, use integer register stores.
> >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > -   5. On machines ERMS feature, if size is greater or equal than
> > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > +      4 VEC stores and store 4 * VEC at a time until done.
> > +   6. On machines ERMS feature, if size is range
> > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > +       then REP STOSB will be used.
> > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > +       non-temporal stores.  */
> >
> >  #include <sysdep.h>
> >
> > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> >       VMOVU   %VMM(0), (%rdi)
> >       VZEROUPPER_RETURN
> > +
> > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > +        entry as it doesn't take much space and is likely a hot target.  */
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > +     .p2align 6,, 47
> > +     .p2align 4
> > +L(less_vec):
> > +L(less_vec_from_wmemset):
> > +     /* Less than 1 VEC.  */
> > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > +#  error Unsupported VEC_SIZE!
> > +# endif
> > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > +        cross check. Note that we are using rax which is set in
> > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > +     andl    $(PAGE_SIZE - 1), %edi
> > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > +        serious performance degradation when it has to fault suppress.  */
> > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > +     /* This is generally considered a cold target.  */
> > +     ja      L(cross_page)
> > +# if VEC_SIZE > 32
> > +     movq    $-1, %rcx
> > +     bzhiq   %rdx, %rcx, %rcx
> > +     kmovq   %rcx, %k1
> > +# else
> > +     movl    $-1, %ecx
> > +     bzhil   %edx, %ecx, %ecx
> > +     kmovd   %ecx, %k1
> > +# endif
> > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > +     VZEROUPPER_RETURN
> > +#endif
> > +
> >  #if defined USE_MULTIARCH && IS_IN (libc)
> >  END (MEMSET_SYMBOL (__memset, unaligned))
> >
> > @@ -185,54 +223,6 @@ L(last_2x_vec):
> >  #endif
> >       VZEROUPPER_RETURN
> >
> > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > -        entry as it doesn't take much space and is likely a hot target.
> > -      */
> > -#ifdef USE_LESS_VEC_MASK_STORE
> > -     .p2align 4,, 10
> > -L(less_vec):
> > -L(less_vec_from_wmemset):
> > -     /* Less than 1 VEC.  */
> > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > -#  error Unsupported VEC_SIZE!
> > -# endif
> > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > -        cross check. Note that we are using rax which is set in
> > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > -     andl    $(PAGE_SIZE - 1), %edi
> > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > -        serious performance degradation when it has to fault suppress.
> > -      */
> > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > -     /* This is generally considered a cold target.  */
> > -     ja      L(cross_page)
> > -# if VEC_SIZE > 32
> > -     movq    $-1, %rcx
> > -     bzhiq   %rdx, %rcx, %rcx
> > -     kmovq   %rcx, %k1
> > -# else
> > -     movl    $-1, %ecx
> > -     bzhil   %edx, %ecx, %ecx
> > -     kmovd   %ecx, %k1
> > -# endif
> > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > -     VZEROUPPER_RETURN
> > -
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > -     /* Include L(stosb_local) here if including L(less_vec) between
> > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > -        L(stosb_more_2x_vec) target.  */
> > -     .p2align 4,, 10
> > -L(stosb_local):
> > -     movzbl  %sil, %eax
> > -     mov     %RDX_LP, %RCX_LP
> > -     mov     %RDI_LP, %RDX_LP
> > -     rep     stosb
> > -     mov     %RDX_LP, %RAX_LP
> > -     VZEROUPPER_RETURN
> > -# endif
> > -#endif
> > -
> >  #if defined USE_MULTIARCH && IS_IN (libc)
> >       .p2align 4
> >  L(stosb_more_2x_vec):
> > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> >       ret
> >  #endif
> >
> > -     .p2align 4,, 10
> > -#ifndef USE_LESS_VEC_MASK_STORE
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > +#ifdef USE_WITH_AVX2
> > +     .p2align 4
> > +#else
> > +     .p2align 4,, 4
> > +#endif
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> >          range for 2-byte jump encoding.  */
> >  L(stosb_local):
> > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > +     jae     L(nt_memset)
> >       movzbl  %sil, %eax
> >       mov     %RDX_LP, %RCX_LP
> >       mov     %RDI_LP, %RDX_LP
> >       rep     stosb
> > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > +     xchg    %RDX_LP, %RAX_LP
> > +# else
> >       mov     %RDX_LP, %RAX_LP
> > -     VZEROUPPER_RETURN
> >  # endif
> > +     VZEROUPPER_RETURN
> > +#endif
> > +#ifndef USE_LESS_VEC_MASK_STORE
> >       /* Define L(less_vec) only if not otherwise defined.  */
> > -     .p2align 4
> > +     .p2align 4,, 12
> >  L(less_vec):
> >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> >          xmm). This is only does anything for AVX2.  */
> > @@ -423,4 +425,35 @@ L(between_2_3):
> >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> >  #endif
> >       ret
> > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > +# ifdef USE_WITH_AVX512
> > +     /* Force align so the loop doesn't cross a cache-line.  */
> > +     .p2align 4
> > +# endif
> > +     .p2align 4,, 7
> > +    /* Memset using non-temporal stores.  */
> > +L(nt_memset):
> > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > +    /* Align DST.  */
> > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > +     incq    %rdi
> > +     .p2align 4,, 7
> > +L(nt_loop):
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > +     subq    $(VEC_SIZE * -4), %rdi
> > +     cmpq    %rdx, %rdi
> > +     jb      L(nt_loop)
> > +     sfence
> > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > +     VZEROUPPER_RETURN
> > +#endif
> > +
> > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
  
Noah Goldstein May 20, 2024, 10:48 p.m. UTC | #5
On Mon, May 20, 2024 at 2:37 PM Zack Weinberg <zack@owlfolio.org> wrote:
>
> On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a few MBs)
> > memsets. See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Did your tests read the region that was memset afterward, or did they
> just do the memset?  It *probably* doesn't matter at "above a few MBs"
> but I'm worried that this new implementation might look good on
> microbenchmarks but make real programs *slower* because it is no longer
> priming the cache for the memset region.

Yes, See all results tagged with `reused=1`. Results are still good.
Data available at:
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/SKX.html
and/or
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/ICX.html
>
> zw
  
Noah Goldstein May 20, 2024, 10:50 p.m. UTC | #6
On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 18/05/24 21:43, Noah Goldstein wrote:
> > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > notably worse than non-temporal stores for large (above a
> > > few MBs) memsets.
> > > See:
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > For data using different stategies for large memset on ICX and SKX.
> >
> > Btw this datasheet is not accessible without extra steps.
> >
>
> Bh sorry, put the benchmark data in a repo. See data here:
> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>
> The ICX.html and SKX.html have the data from the above link.
> > >
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > Any chance on how this play in newer AMD chips? I am trying to avoid
> > another regression like BZ#30994 and BZ#30995 (this one I would like
> > to fix, however I don't have access to a Zen4 machine to check for
> > results).
>
> I didn't and don't have access to any of the newer AMD chips.
>
> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> has a README w/ steps if anyone wants to test it.

What we could do is use a seperate tunable for memset NT threshold and just
make it SIZE_MAX for AMD.
> >
> > >
> > > All of the other changes to the file are to re-organize the
> > > code-blocks to maintain "good" alignment given the new code added in
> > > the `L(stosb_local)` case.
> > >
> > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > N=20 runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > >
> > > Most of the cases are essentially unchanged, this is mostly to show
> > > that adding the non-temporal case didn't add any regressions to the
> > > other cases.
> > >
> > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > >
> > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > (likely because clients typically have faster single-core bandwidth so
> > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > >
> > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > ---
> > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 97839a2248..637caadb40 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -21,10 +21,13 @@
> > >     2. If size is less than VEC, use integer register stores.
> > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > -   5. On machines ERMS feature, if size is greater or equal than
> > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > +   6. On machines ERMS feature, if size is range
> > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > +       then REP STOSB will be used.
> > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > +       non-temporal stores.  */
> > >
> > >  #include <sysdep.h>
> > >
> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > >       VMOVU   %VMM(0), (%rdi)
> > >       VZEROUPPER_RETURN
> > > +
> > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > +     .p2align 6,, 47
> > > +     .p2align 4
> > > +L(less_vec):
> > > +L(less_vec_from_wmemset):
> > > +     /* Less than 1 VEC.  */
> > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > +#  error Unsupported VEC_SIZE!
> > > +# endif
> > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > +        cross check. Note that we are using rax which is set in
> > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > +     andl    $(PAGE_SIZE - 1), %edi
> > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > +        serious performance degradation when it has to fault suppress.  */
> > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > +     /* This is generally considered a cold target.  */
> > > +     ja      L(cross_page)
> > > +# if VEC_SIZE > 32
> > > +     movq    $-1, %rcx
> > > +     bzhiq   %rdx, %rcx, %rcx
> > > +     kmovq   %rcx, %k1
> > > +# else
> > > +     movl    $-1, %ecx
> > > +     bzhil   %edx, %ecx, %ecx
> > > +     kmovd   %ecx, %k1
> > > +# endif
> > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > >
> > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > >  #endif
> > >       VZEROUPPER_RETURN
> > >
> > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > -        entry as it doesn't take much space and is likely a hot target.
> > > -      */
> > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > -     .p2align 4,, 10
> > > -L(less_vec):
> > > -L(less_vec_from_wmemset):
> > > -     /* Less than 1 VEC.  */
> > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > -#  error Unsupported VEC_SIZE!
> > > -# endif
> > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > -        cross check. Note that we are using rax which is set in
> > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > -     andl    $(PAGE_SIZE - 1), %edi
> > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > -        serious performance degradation when it has to fault suppress.
> > > -      */
> > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > -     /* This is generally considered a cold target.  */
> > > -     ja      L(cross_page)
> > > -# if VEC_SIZE > 32
> > > -     movq    $-1, %rcx
> > > -     bzhiq   %rdx, %rcx, %rcx
> > > -     kmovq   %rcx, %k1
> > > -# else
> > > -     movl    $-1, %ecx
> > > -     bzhil   %edx, %ecx, %ecx
> > > -     kmovd   %ecx, %k1
> > > -# endif
> > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > -     VZEROUPPER_RETURN
> > > -
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > -        L(stosb_more_2x_vec) target.  */
> > > -     .p2align 4,, 10
> > > -L(stosb_local):
> > > -     movzbl  %sil, %eax
> > > -     mov     %RDX_LP, %RCX_LP
> > > -     mov     %RDI_LP, %RDX_LP
> > > -     rep     stosb
> > > -     mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > > -# endif
> > > -#endif
> > > -
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >       .p2align 4
> > >  L(stosb_more_2x_vec):
> > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > >       ret
> > >  #endif
> > >
> > > -     .p2align 4,, 10
> > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > +#ifdef USE_WITH_AVX2
> > > +     .p2align 4
> > > +#else
> > > +     .p2align 4,, 4
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > >          range for 2-byte jump encoding.  */
> > >  L(stosb_local):
> > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > +     jae     L(nt_memset)
> > >       movzbl  %sil, %eax
> > >       mov     %RDX_LP, %RCX_LP
> > >       mov     %RDI_LP, %RDX_LP
> > >       rep     stosb
> > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > +     xchg    %RDX_LP, %RAX_LP
> > > +# else
> > >       mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > >  # endif
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > -     .p2align 4
> > > +     .p2align 4,, 12
> > >  L(less_vec):
> > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > >          xmm). This is only does anything for AVX2.  */
> > > @@ -423,4 +425,35 @@ L(between_2_3):
> > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > >  #endif
> > >       ret
> > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +# ifdef USE_WITH_AVX512
> > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > +     .p2align 4
> > > +# endif
> > > +     .p2align 4,, 7
> > > +    /* Memset using non-temporal stores.  */
> > > +L(nt_memset):
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > +    /* Align DST.  */
> > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > +     incq    %rdi
> > > +     .p2align 4,, 7
> > > +L(nt_loop):
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > +     subq    $(VEC_SIZE * -4), %rdi
> > > +     cmpq    %rdx, %rdi
> > > +     jb      L(nt_loop)
> > > +     sfence
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
  
Noah Goldstein May 23, 2024, 4:46 p.m. UTC | #7
On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > <adhemerval.zanella@linaro.org> wrote:
> > >
> > >
> > >
> > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > notably worse than non-temporal stores for large (above a
> > > > few MBs) memsets.
> > > > See:
> > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > For data using different stategies for large memset on ICX and SKX.
> > >
> > > Btw this datasheet is not accessible without extra steps.
> > >
> >
> > Bh sorry, put the benchmark data in a repo. See data here:
> > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >
> > The ICX.html and SKX.html have the data from the above link.
> > > >
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > to fix, however I don't have access to a Zen4 machine to check for
> > > results).
> >
> > I didn't and don't have access to any of the newer AMD chips.
> >
> > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > has a README w/ steps if anyone wants to test it.
>
> What we could do is use a seperate tunable for memset NT threshold and just
> make it SIZE_MAX for AMD.

Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
specific?
> > >
> > > >
> > > > All of the other changes to the file are to re-organize the
> > > > code-blocks to maintain "good" alignment given the new code added in
> > > > the `L(stosb_local)` case.
> > > >
> > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > N=20 runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > > >
> > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > that adding the non-temporal case didn't add any regressions to the
> > > > other cases.
> > > >
> > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > > >
> > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > (likely because clients typically have faster single-core bandwidth so
> > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > >
> > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > ---
> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 97839a2248..637caadb40 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -21,10 +21,13 @@
> > > >     2. If size is less than VEC, use integer register stores.
> > > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > -   5. On machines ERMS feature, if size is greater or equal than
> > > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > > +   6. On machines ERMS feature, if size is range
> > > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > +       then REP STOSB will be used.
> > > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > +       non-temporal stores.  */
> > > >
> > > >  #include <sysdep.h>
> > > >
> > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > >       VMOVU   %VMM(0), (%rdi)
> > > >       VZEROUPPER_RETURN
> > > > +
> > > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > > +     .p2align 6,, 47
> > > > +     .p2align 4
> > > > +L(less_vec):
> > > > +L(less_vec_from_wmemset):
> > > > +     /* Less than 1 VEC.  */
> > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > +#  error Unsupported VEC_SIZE!
> > > > +# endif
> > > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > +        cross check. Note that we are using rax which is set in
> > > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > +     andl    $(PAGE_SIZE - 1), %edi
> > > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > +        serious performance degradation when it has to fault suppress.  */
> > > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > +     /* This is generally considered a cold target.  */
> > > > +     ja      L(cross_page)
> > > > +# if VEC_SIZE > 32
> > > > +     movq    $-1, %rcx
> > > > +     bzhiq   %rdx, %rcx, %rcx
> > > > +     kmovq   %rcx, %k1
> > > > +# else
> > > > +     movl    $-1, %ecx
> > > > +     bzhil   %edx, %ecx, %ecx
> > > > +     kmovd   %ecx, %k1
> > > > +# endif
> > > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > > >
> > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > >  #endif
> > > >       VZEROUPPER_RETURN
> > > >
> > > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > -        entry as it doesn't take much space and is likely a hot target.
> > > > -      */
> > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > -     .p2align 4,, 10
> > > > -L(less_vec):
> > > > -L(less_vec_from_wmemset):
> > > > -     /* Less than 1 VEC.  */
> > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > -#  error Unsupported VEC_SIZE!
> > > > -# endif
> > > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > -        cross check. Note that we are using rax which is set in
> > > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > -     andl    $(PAGE_SIZE - 1), %edi
> > > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > -        serious performance degradation when it has to fault suppress.
> > > > -      */
> > > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > -     /* This is generally considered a cold target.  */
> > > > -     ja      L(cross_page)
> > > > -# if VEC_SIZE > 32
> > > > -     movq    $-1, %rcx
> > > > -     bzhiq   %rdx, %rcx, %rcx
> > > > -     kmovq   %rcx, %k1
> > > > -# else
> > > > -     movl    $-1, %ecx
> > > > -     bzhil   %edx, %ecx, %ecx
> > > > -     kmovd   %ecx, %k1
> > > > -# endif
> > > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > -     VZEROUPPER_RETURN
> > > > -
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > -        L(stosb_more_2x_vec) target.  */
> > > > -     .p2align 4,, 10
> > > > -L(stosb_local):
> > > > -     movzbl  %sil, %eax
> > > > -     mov     %RDX_LP, %RCX_LP
> > > > -     mov     %RDI_LP, %RDX_LP
> > > > -     rep     stosb
> > > > -     mov     %RDX_LP, %RAX_LP
> > > > -     VZEROUPPER_RETURN
> > > > -# endif
> > > > -#endif
> > > > -
> > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > >       .p2align 4
> > > >  L(stosb_more_2x_vec):
> > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > >       ret
> > > >  #endif
> > > >
> > > > -     .p2align 4,, 10
> > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > +#ifdef USE_WITH_AVX2
> > > > +     .p2align 4
> > > > +#else
> > > > +     .p2align 4,, 4
> > > > +#endif
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > >          range for 2-byte jump encoding.  */
> > > >  L(stosb_local):
> > > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > +     jae     L(nt_memset)
> > > >       movzbl  %sil, %eax
> > > >       mov     %RDX_LP, %RCX_LP
> > > >       mov     %RDI_LP, %RDX_LP
> > > >       rep     stosb
> > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > > +     xchg    %RDX_LP, %RAX_LP
> > > > +# else
> > > >       mov     %RDX_LP, %RAX_LP
> > > > -     VZEROUPPER_RETURN
> > > >  # endif
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > > -     .p2align 4
> > > > +     .p2align 4,, 12
> > > >  L(less_vec):
> > > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > >          xmm). This is only does anything for AVX2.  */
> > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > >  #endif
> > > >       ret
> > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > +# ifdef USE_WITH_AVX512
> > > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > > +     .p2align 4
> > > > +# endif
> > > > +     .p2align 4,, 7
> > > > +    /* Memset using non-temporal stores.  */
> > > > +L(nt_memset):
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > +    /* Align DST.  */
> > > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > > +     incq    %rdi
> > > > +     .p2align 4,, 7
> > > > +L(nt_loop):
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > +     subq    $(VEC_SIZE * -4), %rdi
> > > > +     cmpq    %rdx, %rdi
> > > > +     jb      L(nt_loop)
> > > > +     sfence
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
  
H.J. Lu May 23, 2024, 4:49 p.m. UTC | #8
On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > > <adhemerval.zanella@linaro.org> wrote:
> > > >
> > > >
> > > >
> > > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > > notably worse than non-temporal stores for large (above a
> > > > > few MBs) memsets.
> > > > > See:
> > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > > For data using different stategies for large memset on ICX and SKX.
> > > >
> > > > Btw this datasheet is not accessible without extra steps.
> > > >
> > >
> > > Bh sorry, put the benchmark data in a repo. See data here:
> > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> > >
> > > The ICX.html and SKX.html have the data from the above link.
> > > > >
> > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > > on SKX. Historically, these numbers would not have been so good
> > > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > > removed as a potential side-channel attack, so there is no longer any
> > > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > > memory bandwidth.
> > > >
> > > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > > to fix, however I don't have access to a Zen4 machine to check for
> > > > results).
> > >
> > > I didn't and don't have access to any of the newer AMD chips.
> > >
> > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > > has a README w/ steps if anyone wants to test it.
> >
> > What we could do is use a seperate tunable for memset NT threshold and just
> > make it SIZE_MAX for AMD.
>
> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> specific?

Please make it Intel specific.

Thanks.

> > > >
> > > > >
> > > > > All of the other changes to the file are to re-organize the
> > > > > code-blocks to maintain "good" alignment given the new code added in
> > > > > the `L(stosb_local)` case.
> > > > >
> > > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > > N=20 runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > > > >
> > > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > > that adding the non-temporal case didn't add any regressions to the
> > > > > other cases.
> > > > >
> > > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > > runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > > > >
> > > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > > (likely because clients typically have faster single-core bandwidth so
> > > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > > >
> > > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > > ---
> > > > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > > > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 97839a2248..637caadb40 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -21,10 +21,13 @@
> > > > >     2. If size is less than VEC, use integer register stores.
> > > > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > > -   5. On machines ERMS feature, if size is greater or equal than
> > > > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > > > +   6. On machines ERMS feature, if size is range
> > > > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > > +       then REP STOSB will be used.
> > > > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > > +       non-temporal stores.  */
> > > > >
> > > > >  #include <sysdep.h>
> > > > >
> > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > >       VMOVU   %VMM(0), (%rdi)
> > > > >       VZEROUPPER_RETURN
> > > > > +
> > > > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > > > +     .p2align 6,, 47
> > > > > +     .p2align 4
> > > > > +L(less_vec):
> > > > > +L(less_vec_from_wmemset):
> > > > > +     /* Less than 1 VEC.  */
> > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > +#  error Unsupported VEC_SIZE!
> > > > > +# endif
> > > > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > +        cross check. Note that we are using rax which is set in
> > > > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > > +     andl    $(PAGE_SIZE - 1), %edi
> > > > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > +        serious performance degradation when it has to fault suppress.  */
> > > > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > +     /* This is generally considered a cold target.  */
> > > > > +     ja      L(cross_page)
> > > > > +# if VEC_SIZE > 32
> > > > > +     movq    $-1, %rcx
> > > > > +     bzhiq   %rdx, %rcx, %rcx
> > > > > +     kmovq   %rcx, %k1
> > > > > +# else
> > > > > +     movl    $-1, %ecx
> > > > > +     bzhil   %edx, %ecx, %ecx
> > > > > +     kmovd   %ecx, %k1
> > > > > +# endif
> > > > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > > > >
> > > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > >  #endif
> > > > >       VZEROUPPER_RETURN
> > > > >
> > > > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > -        entry as it doesn't take much space and is likely a hot target.
> > > > > -      */
> > > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > > -     .p2align 4,, 10
> > > > > -L(less_vec):
> > > > > -L(less_vec_from_wmemset):
> > > > > -     /* Less than 1 VEC.  */
> > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > -#  error Unsupported VEC_SIZE!
> > > > > -# endif
> > > > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > -        cross check. Note that we are using rax which is set in
> > > > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > > -     andl    $(PAGE_SIZE - 1), %edi
> > > > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > -        serious performance degradation when it has to fault suppress.
> > > > > -      */
> > > > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > -     /* This is generally considered a cold target.  */
> > > > > -     ja      L(cross_page)
> > > > > -# if VEC_SIZE > 32
> > > > > -     movq    $-1, %rcx
> > > > > -     bzhiq   %rdx, %rcx, %rcx
> > > > > -     kmovq   %rcx, %k1
> > > > > -# else
> > > > > -     movl    $-1, %ecx
> > > > > -     bzhil   %edx, %ecx, %ecx
> > > > > -     kmovd   %ecx, %k1
> > > > > -# endif
> > > > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > -     VZEROUPPER_RETURN
> > > > > -
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > > -        L(stosb_more_2x_vec) target.  */
> > > > > -     .p2align 4,, 10
> > > > > -L(stosb_local):
> > > > > -     movzbl  %sil, %eax
> > > > > -     mov     %RDX_LP, %RCX_LP
> > > > > -     mov     %RDI_LP, %RDX_LP
> > > > > -     rep     stosb
> > > > > -     mov     %RDX_LP, %RAX_LP
> > > > > -     VZEROUPPER_RETURN
> > > > > -# endif
> > > > > -#endif
> > > > > -
> > > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > > >       .p2align 4
> > > > >  L(stosb_more_2x_vec):
> > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > >       ret
> > > > >  #endif
> > > > >
> > > > > -     .p2align 4,, 10
> > > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +#ifdef USE_WITH_AVX2
> > > > > +     .p2align 4
> > > > > +#else
> > > > > +     .p2align 4,, 4
> > > > > +#endif
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > >          range for 2-byte jump encoding.  */
> > > > >  L(stosb_local):
> > > > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > +     jae     L(nt_memset)
> > > > >       movzbl  %sil, %eax
> > > > >       mov     %RDX_LP, %RCX_LP
> > > > >       mov     %RDI_LP, %RDX_LP
> > > > >       rep     stosb
> > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > > > +     xchg    %RDX_LP, %RAX_LP
> > > > > +# else
> > > > >       mov     %RDX_LP, %RAX_LP
> > > > > -     VZEROUPPER_RETURN
> > > > >  # endif
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > > > -     .p2align 4
> > > > > +     .p2align 4,, 12
> > > > >  L(less_vec):
> > > > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > >          xmm). This is only does anything for AVX2.  */
> > > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > >  #endif
> > > > >       ret
> > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +# ifdef USE_WITH_AVX512
> > > > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > > > +     .p2align 4
> > > > > +# endif
> > > > > +     .p2align 4,, 7
> > > > > +    /* Memset using non-temporal stores.  */
> > > > > +L(nt_memset):
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > > +    /* Align DST.  */
> > > > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > > > +     incq    %rdi
> > > > > +     .p2align 4,, 7
> > > > > +L(nt_loop):
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > > +     subq    $(VEC_SIZE * -4), %rdi
> > > > > +     cmpq    %rdx, %rdi
> > > > > +     jb      L(nt_loop)
> > > > > +     sfence
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
  
Adhemerval Zanella Netto May 23, 2024, 5:18 p.m. UTC | #9
On 23/05/24 13:49, H.J. Lu wrote:
> On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>>
>>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 18/05/24 21:43, Noah Goldstein wrote:
>>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
>>>>>> notably worse than non-temporal stores for large (above a
>>>>>> few MBs) memsets.
>>>>>> See:
>>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>>>>>> For data using different stategies for large memset on ICX and SKX.
>>>>>
>>>>> Btw this datasheet is not accessible without extra steps.
>>>>>
>>>>
>>>> Bh sorry, put the benchmark data in a repo. See data here:
>>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>>>>
>>>> The ICX.html and SKX.html have the data from the above link.
>>>>>>
>>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>>>>> on SKX. Historically, these numbers would not have been so good
>>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>>>>> is able to do. But, the zero-over-zero writeback optimization has been
>>>>>> removed as a potential side-channel attack, so there is no longer any
>>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>>>>> size, non-temporal writes can avoid data in their RFO requests saving
>>>>>> memory bandwidth.
>>>>>
>>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
>>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
>>>>> to fix, however I don't have access to a Zen4 machine to check for
>>>>> results).
>>>>
>>>> I didn't and don't have access to any of the newer AMD chips.
>>>>
>>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
>>>> has a README w/ steps if anyone wants to test it.
>>>
>>> What we could do is use a seperate tunable for memset NT threshold and just
>>> make it SIZE_MAX for AMD.
>>
>> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
>> specific?
> 
> Please make it Intel specific.
> 
> Thanks.

Yes, please make it Intel specific for now. I will try to take a look at least
with a Zen3 core that I have access
  
Noah Goldstein May 24, 2024, 5:39 p.m. UTC | #10
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access

Done, split into two patches, the tunable one is the second in the series.
  
Noah Goldstein May 30, 2024, 8:02 p.m. UTC | #11
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access

Adding Joe into the thread. He has access to some AMD machines
and will run the benchmarks for us.
  

Patch

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@ 
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. On machines ERMS feature, if size is greater or equal than
-      __x86_rep_stosb_threshold then REP STOSB will be used.
-   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
-      4 VEC stores and store 4 * VEC at a time until done.  */
+   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+      4 VEC stores and store 4 * VEC at a time until done.
+   6. On machines ERMS feature, if size is range
+	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  then REP STOSB will be used.
+   7. If size >= __x86_shared_non_temporal_threshold, use a
+	  non-temporal stores.  */
 
 #include <sysdep.h>
 
@@ -147,6 +150,41 @@  L(entry_from_wmemset):
 	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
 	VMOVU	%VMM(0), (%rdi)
 	VZEROUPPER_RETURN
+
+	/* If have AVX512 mask instructions put L(less_vec) close to
+	   entry as it doesn't take much space and is likely a hot target.  */
+#ifdef USE_LESS_VEC_MASK_STORE
+    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
+	.p2align 6,, 47
+	.p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+	/* Less than 1 VEC.  */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+#  error Unsupported VEC_SIZE!
+# endif
+	/* Clear high bits from edi. Only keeping bits relevant to page
+	   cross check. Note that we are using rax which is set in
+	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
+	andl	$(PAGE_SIZE - 1), %edi
+	/* Check if VEC_SIZE store cross page. Mask stores suffer
+	   serious performance degradation when it has to fault suppress.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
+	/* This is generally considered a cold target.  */
+	ja	L(cross_page)
+# if VEC_SIZE > 32
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+	kmovq	%rcx, %k1
+# else
+	movl	$-1, %ecx
+	bzhil	%edx, %ecx, %ecx
+	kmovd	%ecx, %k1
+# endif
+	vmovdqu8 %VMM(0), (%rax){%k1}
+	VZEROUPPER_RETURN
+#endif
+
 #if defined USE_MULTIARCH && IS_IN (libc)
 END (MEMSET_SYMBOL (__memset, unaligned))
 
@@ -185,54 +223,6 @@  L(last_2x_vec):
 #endif
 	VZEROUPPER_RETURN
 
-	/* If have AVX512 mask instructions put L(less_vec) close to
-	   entry as it doesn't take much space and is likely a hot target.
-	 */
-#ifdef USE_LESS_VEC_MASK_STORE
-	.p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
-	/* Less than 1 VEC.  */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-#  error Unsupported VEC_SIZE!
-# endif
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check. Note that we are using rax which is set in
-	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
-	andl	$(PAGE_SIZE - 1), %edi
-	/* Check if VEC_SIZE store cross page. Mask stores suffer
-	   serious performance degradation when it has to fault suppress.
-	 */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
-	/* This is generally considered a cold target.  */
-	ja	L(cross_page)
-# if VEC_SIZE > 32
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-	kmovq	%rcx, %k1
-# else
-	movl	$-1, %ecx
-	bzhil	%edx, %ecx, %ecx
-	kmovd	%ecx, %k1
-# endif
-	vmovdqu8 %VMM(0), (%rax){%k1}
-	VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
-	/* Include L(stosb_local) here if including L(less_vec) between
-	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
-	   L(stosb_more_2x_vec) target.  */
-	.p2align 4,, 10
-L(stosb_local):
-	movzbl	%sil, %eax
-	mov	%RDX_LP, %RCX_LP
-	mov	%RDI_LP, %RDX_LP
-	rep	stosb
-	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
-# endif
-#endif
-
 #if defined USE_MULTIARCH && IS_IN (libc)
 	.p2align 4
 L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@  L(return_vzeroupper):
 	ret
 #endif
 
-	.p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+	.p2align 4
+#else
+	.p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
 	mov	%RDI_LP, %RDX_LP
 	rep	stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+	/* Use xchg to save 1-byte (this helps align targets below).  */
+	xchg	%RDX_LP, %RAX_LP
+# else
 	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
 # endif
+	VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
 	/* Define L(less_vec) only if not otherwise defined.  */
-	.p2align 4
+	.p2align 4,, 12
 L(less_vec):
 	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
 	   xmm). This is only does anything for AVX2.  */
@@ -423,4 +425,35 @@  L(between_2_3):
 	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
 #endif
 	ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+	/* Force align so the loop doesn't cross a cache-line.  */
+	.p2align 4
+# endif
+	.p2align 4,, 7
+    /* Memset using non-temporal stores.  */
+L(nt_memset):
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+    /* Align DST.  */
+	orq	$(VEC_SIZE * 1 - 1), %rdi
+	incq	%rdi
+	.p2align 4,, 7
+L(nt_loop):
+	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(nt_loop)
+	sfence
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))