[v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Testing passed
|
Commit Message
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch.
---
.../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
1 file changed, 91 insertions(+), 58 deletions(-)
Comments
On Sat, May 18, 2024 at 5:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>
LGTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Thanks.
On 18/05/24 21:43, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
Btw this datasheet is not accessible without extra steps.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
Any chance on how this play in newer AMD chips? I am trying to avoid
another regression like BZ#30994 and BZ#30995 (this one I would like
to fix, however I don't have access to a Zen4 machine to check for
results).
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a few MBs)
> memsets. See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
Did your tests read the region that was memset afterward, or did they
just do the memset? It *probably* doesn't matter at "above a few MBs"
but I'm worried that this new implementation might look good on
microbenchmarks but make real programs *slower* because it is no longer
priming the cache for the memset region.
zw
On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 18/05/24 21:43, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a
> > few MBs) memsets.
> > See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Btw this datasheet is not accessible without extra steps.
>
Bh sorry, put the benchmark data in a repo. See data here:
https://github.com/goldsteinn/memset-benchmarks/tree/master/results
The ICX.html and SKX.html have the data from the above link.
> >
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> Any chance on how this play in newer AMD chips? I am trying to avoid
> another regression like BZ#30994 and BZ#30995 (this one I would like
> to fix, however I don't have access to a Zen4 machine to check for
> results).
I didn't and don't have access to any of the newer AMD chips.
The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
has a README w/ steps if anyone wants to test it.
>
> >
> > All of the other changes to the file are to re-organize the
> > code-blocks to maintain "good" alignment given the new code added in
> > the `L(stosb_local)` case.
> >
> > The results from running the GLIBC memset benchmarks on TGL-client for
> > N=20 runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.979
> > Geometric Mean across the suite New / Old EXEX512: 0.979
> > Geometric Mean across the suite New / Old AVX2 : 0.986
> > Geometric Mean across the suite New / Old SSE2 : 0.979
> >
> > Most of the cases are essentially unchanged, this is mostly to show
> > that adding the non-temporal case didn't add any regressions to the
> > other cases.
> >
> > The results on the memset-large benchmark suite on TGL-client for N=20
> > runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.926
> > Geometric Mean across the suite New / Old EXEX512: 0.925
> > Geometric Mean across the suite New / Old AVX2 : 0.928
> > Geometric Mean across the suite New / Old SSE2 : 0.924
> >
> > So roughly a 7.5% speedup. This is lower than what we see on servers
> > (likely because clients typically have faster single-core bandwidth so
> > saving bandwidth on RFOs is less impactful), but still advantageous.
> >
> > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > ---
> > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > 1 file changed, 91 insertions(+), 58 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 97839a2248..637caadb40 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -21,10 +21,13 @@
> > 2. If size is less than VEC, use integer register stores.
> > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > - 5. On machines ERMS feature, if size is greater or equal than
> > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > - 4 VEC stores and store 4 * VEC at a time until done. */
> > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > + 4 VEC stores and store 4 * VEC at a time until done.
> > + 6. On machines ERMS feature, if size is range
> > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > + then REP STOSB will be used.
> > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > + non-temporal stores. */
> >
> > #include <sysdep.h>
> >
> > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > VMOVU %VMM(0), (%rdi)
> > VZEROUPPER_RETURN
> > +
> > + /* If have AVX512 mask instructions put L(less_vec) close to
> > + entry as it doesn't take much space and is likely a hot target. */
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > + .p2align 6,, 47
> > + .p2align 4
> > +L(less_vec):
> > +L(less_vec_from_wmemset):
> > + /* Less than 1 VEC. */
> > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > +# error Unsupported VEC_SIZE!
> > +# endif
> > + /* Clear high bits from edi. Only keeping bits relevant to page
> > + cross check. Note that we are using rax which is set in
> > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > + andl $(PAGE_SIZE - 1), %edi
> > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > + serious performance degradation when it has to fault suppress. */
> > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > + /* This is generally considered a cold target. */
> > + ja L(cross_page)
> > +# if VEC_SIZE > 32
> > + movq $-1, %rcx
> > + bzhiq %rdx, %rcx, %rcx
> > + kmovq %rcx, %k1
> > +# else
> > + movl $-1, %ecx
> > + bzhil %edx, %ecx, %ecx
> > + kmovd %ecx, %k1
> > +# endif
> > + vmovdqu8 %VMM(0), (%rax){%k1}
> > + VZEROUPPER_RETURN
> > +#endif
> > +
> > #if defined USE_MULTIARCH && IS_IN (libc)
> > END (MEMSET_SYMBOL (__memset, unaligned))
> >
> > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > #endif
> > VZEROUPPER_RETURN
> >
> > - /* If have AVX512 mask instructions put L(less_vec) close to
> > - entry as it doesn't take much space and is likely a hot target.
> > - */
> > -#ifdef USE_LESS_VEC_MASK_STORE
> > - .p2align 4,, 10
> > -L(less_vec):
> > -L(less_vec_from_wmemset):
> > - /* Less than 1 VEC. */
> > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > -# error Unsupported VEC_SIZE!
> > -# endif
> > - /* Clear high bits from edi. Only keeping bits relevant to page
> > - cross check. Note that we are using rax which is set in
> > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > - andl $(PAGE_SIZE - 1), %edi
> > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > - serious performance degradation when it has to fault suppress.
> > - */
> > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > - /* This is generally considered a cold target. */
> > - ja L(cross_page)
> > -# if VEC_SIZE > 32
> > - movq $-1, %rcx
> > - bzhiq %rdx, %rcx, %rcx
> > - kmovq %rcx, %k1
> > -# else
> > - movl $-1, %ecx
> > - bzhil %edx, %ecx, %ecx
> > - kmovd %ecx, %k1
> > -# endif
> > - vmovdqu8 %VMM(0), (%rax){%k1}
> > - VZEROUPPER_RETURN
> > -
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > - /* Include L(stosb_local) here if including L(less_vec) between
> > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > - L(stosb_more_2x_vec) target. */
> > - .p2align 4,, 10
> > -L(stosb_local):
> > - movzbl %sil, %eax
> > - mov %RDX_LP, %RCX_LP
> > - mov %RDI_LP, %RDX_LP
> > - rep stosb
> > - mov %RDX_LP, %RAX_LP
> > - VZEROUPPER_RETURN
> > -# endif
> > -#endif
> > -
> > #if defined USE_MULTIARCH && IS_IN (libc)
> > .p2align 4
> > L(stosb_more_2x_vec):
> > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > ret
> > #endif
> >
> > - .p2align 4,, 10
> > -#ifndef USE_LESS_VEC_MASK_STORE
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > +#ifdef USE_WITH_AVX2
> > + .p2align 4
> > +#else
> > + .p2align 4,, 4
> > +#endif
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > range for 2-byte jump encoding. */
> > L(stosb_local):
> > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > + jae L(nt_memset)
> > movzbl %sil, %eax
> > mov %RDX_LP, %RCX_LP
> > mov %RDI_LP, %RDX_LP
> > rep stosb
> > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > + /* Use xchg to save 1-byte (this helps align targets below). */
> > + xchg %RDX_LP, %RAX_LP
> > +# else
> > mov %RDX_LP, %RAX_LP
> > - VZEROUPPER_RETURN
> > # endif
> > + VZEROUPPER_RETURN
> > +#endif
> > +#ifndef USE_LESS_VEC_MASK_STORE
> > /* Define L(less_vec) only if not otherwise defined. */
> > - .p2align 4
> > + .p2align 4,, 12
> > L(less_vec):
> > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > xmm). This is only does anything for AVX2. */
> > @@ -423,4 +425,35 @@ L(between_2_3):
> > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > #endif
> > ret
> > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > +# ifdef USE_WITH_AVX512
> > + /* Force align so the loop doesn't cross a cache-line. */
> > + .p2align 4
> > +# endif
> > + .p2align 4,, 7
> > + /* Memset using non-temporal stores. */
> > +L(nt_memset):
> > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > + /* Align DST. */
> > + orq $(VEC_SIZE * 1 - 1), %rdi
> > + incq %rdi
> > + .p2align 4,, 7
> > +L(nt_loop):
> > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > + subq $(VEC_SIZE * -4), %rdi
> > + cmpq %rdx, %rdi
> > + jb L(nt_loop)
> > + sfence
> > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > + VZEROUPPER_RETURN
> > +#endif
> > +
> > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Mon, May 20, 2024 at 2:37 PM Zack Weinberg <zack@owlfolio.org> wrote:
>
> On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a few MBs)
> > memsets. See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Did your tests read the region that was memset afterward, or did they
> just do the memset? It *probably* doesn't matter at "above a few MBs"
> but I'm worried that this new implementation might look good on
> microbenchmarks but make real programs *slower* because it is no longer
> priming the cache for the memset region.
Yes, See all results tagged with `reused=1`. Results are still good.
Data available at:
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/SKX.html
and/or
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/ICX.html
>
> zw
On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 18/05/24 21:43, Noah Goldstein wrote:
> > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > notably worse than non-temporal stores for large (above a
> > > few MBs) memsets.
> > > See:
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > For data using different stategies for large memset on ICX and SKX.
> >
> > Btw this datasheet is not accessible without extra steps.
> >
>
> Bh sorry, put the benchmark data in a repo. See data here:
> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>
> The ICX.html and SKX.html have the data from the above link.
> > >
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > Any chance on how this play in newer AMD chips? I am trying to avoid
> > another regression like BZ#30994 and BZ#30995 (this one I would like
> > to fix, however I don't have access to a Zen4 machine to check for
> > results).
>
> I didn't and don't have access to any of the newer AMD chips.
>
> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> has a README w/ steps if anyone wants to test it.
What we could do is use a seperate tunable for memset NT threshold and just
make it SIZE_MAX for AMD.
> >
> > >
> > > All of the other changes to the file are to re-organize the
> > > code-blocks to maintain "good" alignment given the new code added in
> > > the `L(stosb_local)` case.
> > >
> > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > N=20 runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > >
> > > Most of the cases are essentially unchanged, this is mostly to show
> > > that adding the non-temporal case didn't add any regressions to the
> > > other cases.
> > >
> > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > >
> > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > (likely because clients typically have faster single-core bandwidth so
> > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > >
> > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > ---
> > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 97839a2248..637caadb40 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -21,10 +21,13 @@
> > > 2. If size is less than VEC, use integer register stores.
> > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > - 5. On machines ERMS feature, if size is greater or equal than
> > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > + 6. On machines ERMS feature, if size is range
> > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > + then REP STOSB will be used.
> > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > + non-temporal stores. */
> > >
> > > #include <sysdep.h>
> > >
> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > VMOVU %VMM(0), (%rdi)
> > > VZEROUPPER_RETURN
> > > +
> > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > + entry as it doesn't take much space and is likely a hot target. */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > + .p2align 6,, 47
> > > + .p2align 4
> > > +L(less_vec):
> > > +L(less_vec_from_wmemset):
> > > + /* Less than 1 VEC. */
> > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > +# error Unsupported VEC_SIZE!
> > > +# endif
> > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > + cross check. Note that we are using rax which is set in
> > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > + andl $(PAGE_SIZE - 1), %edi
> > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > + serious performance degradation when it has to fault suppress. */
> > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > + /* This is generally considered a cold target. */
> > > + ja L(cross_page)
> > > +# if VEC_SIZE > 32
> > > + movq $-1, %rcx
> > > + bzhiq %rdx, %rcx, %rcx
> > > + kmovq %rcx, %k1
> > > +# else
> > > + movl $-1, %ecx
> > > + bzhil %edx, %ecx, %ecx
> > > + kmovd %ecx, %k1
> > > +# endif
> > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > END (MEMSET_SYMBOL (__memset, unaligned))
> > >
> > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > #endif
> > > VZEROUPPER_RETURN
> > >
> > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > - entry as it doesn't take much space and is likely a hot target.
> > > - */
> > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > - .p2align 4,, 10
> > > -L(less_vec):
> > > -L(less_vec_from_wmemset):
> > > - /* Less than 1 VEC. */
> > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > -# error Unsupported VEC_SIZE!
> > > -# endif
> > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > - cross check. Note that we are using rax which is set in
> > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > - andl $(PAGE_SIZE - 1), %edi
> > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > - serious performance degradation when it has to fault suppress.
> > > - */
> > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > - /* This is generally considered a cold target. */
> > > - ja L(cross_page)
> > > -# if VEC_SIZE > 32
> > > - movq $-1, %rcx
> > > - bzhiq %rdx, %rcx, %rcx
> > > - kmovq %rcx, %k1
> > > -# else
> > > - movl $-1, %ecx
> > > - bzhil %edx, %ecx, %ecx
> > > - kmovd %ecx, %k1
> > > -# endif
> > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > - VZEROUPPER_RETURN
> > > -
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > - L(stosb_more_2x_vec) target. */
> > > - .p2align 4,, 10
> > > -L(stosb_local):
> > > - movzbl %sil, %eax
> > > - mov %RDX_LP, %RCX_LP
> > > - mov %RDI_LP, %RDX_LP
> > > - rep stosb
> > > - mov %RDX_LP, %RAX_LP
> > > - VZEROUPPER_RETURN
> > > -# endif
> > > -#endif
> > > -
> > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > .p2align 4
> > > L(stosb_more_2x_vec):
> > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > ret
> > > #endif
> > >
> > > - .p2align 4,, 10
> > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > +#ifdef USE_WITH_AVX2
> > > + .p2align 4
> > > +#else
> > > + .p2align 4,, 4
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > range for 2-byte jump encoding. */
> > > L(stosb_local):
> > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > + jae L(nt_memset)
> > > movzbl %sil, %eax
> > > mov %RDX_LP, %RCX_LP
> > > mov %RDI_LP, %RDX_LP
> > > rep stosb
> > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > + xchg %RDX_LP, %RAX_LP
> > > +# else
> > > mov %RDX_LP, %RAX_LP
> > > - VZEROUPPER_RETURN
> > > # endif
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > /* Define L(less_vec) only if not otherwise defined. */
> > > - .p2align 4
> > > + .p2align 4,, 12
> > > L(less_vec):
> > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > xmm). This is only does anything for AVX2. */
> > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > #endif
> > > ret
> > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +# ifdef USE_WITH_AVX512
> > > + /* Force align so the loop doesn't cross a cache-line. */
> > > + .p2align 4
> > > +# endif
> > > + .p2align 4,, 7
> > > + /* Memset using non-temporal stores. */
> > > +L(nt_memset):
> > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > + /* Align DST. */
> > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > + incq %rdi
> > > + .p2align 4,, 7
> > > +L(nt_loop):
> > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > + subq $(VEC_SIZE * -4), %rdi
> > > + cmpq %rdx, %rdi
> > > + jb L(nt_loop)
> > > + sfence
> > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > <adhemerval.zanella@linaro.org> wrote:
> > >
> > >
> > >
> > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > notably worse than non-temporal stores for large (above a
> > > > few MBs) memsets.
> > > > See:
> > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > For data using different stategies for large memset on ICX and SKX.
> > >
> > > Btw this datasheet is not accessible without extra steps.
> > >
> >
> > Bh sorry, put the benchmark data in a repo. See data here:
> > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >
> > The ICX.html and SKX.html have the data from the above link.
> > > >
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > to fix, however I don't have access to a Zen4 machine to check for
> > > results).
> >
> > I didn't and don't have access to any of the newer AMD chips.
> >
> > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > has a README w/ steps if anyone wants to test it.
>
> What we could do is use a seperate tunable for memset NT threshold and just
> make it SIZE_MAX for AMD.
Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
specific?
> > >
> > > >
> > > > All of the other changes to the file are to re-organize the
> > > > code-blocks to maintain "good" alignment given the new code added in
> > > > the `L(stosb_local)` case.
> > > >
> > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > N=20 runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > > >
> > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > that adding the non-temporal case didn't add any regressions to the
> > > > other cases.
> > > >
> > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > > >
> > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > (likely because clients typically have faster single-core bandwidth so
> > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > >
> > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > ---
> > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 97839a2248..637caadb40 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -21,10 +21,13 @@
> > > > 2. If size is less than VEC, use integer register stores.
> > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > - 5. On machines ERMS feature, if size is greater or equal than
> > > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > > + 6. On machines ERMS feature, if size is range
> > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > + then REP STOSB will be used.
> > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > + non-temporal stores. */
> > > >
> > > > #include <sysdep.h>
> > > >
> > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > VMOVU %VMM(0), (%rdi)
> > > > VZEROUPPER_RETURN
> > > > +
> > > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > > + entry as it doesn't take much space and is likely a hot target. */
> > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > > + .p2align 6,, 47
> > > > + .p2align 4
> > > > +L(less_vec):
> > > > +L(less_vec_from_wmemset):
> > > > + /* Less than 1 VEC. */
> > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > +# error Unsupported VEC_SIZE!
> > > > +# endif
> > > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > > + cross check. Note that we are using rax which is set in
> > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > + andl $(PAGE_SIZE - 1), %edi
> > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > + serious performance degradation when it has to fault suppress. */
> > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > + /* This is generally considered a cold target. */
> > > > + ja L(cross_page)
> > > > +# if VEC_SIZE > 32
> > > > + movq $-1, %rcx
> > > > + bzhiq %rdx, %rcx, %rcx
> > > > + kmovq %rcx, %k1
> > > > +# else
> > > > + movl $-1, %ecx
> > > > + bzhil %edx, %ecx, %ecx
> > > > + kmovd %ecx, %k1
> > > > +# endif
> > > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > END (MEMSET_SYMBOL (__memset, unaligned))
> > > >
> > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > #endif
> > > > VZEROUPPER_RETURN
> > > >
> > > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > > - entry as it doesn't take much space and is likely a hot target.
> > > > - */
> > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > - .p2align 4,, 10
> > > > -L(less_vec):
> > > > -L(less_vec_from_wmemset):
> > > > - /* Less than 1 VEC. */
> > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > -# error Unsupported VEC_SIZE!
> > > > -# endif
> > > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > > - cross check. Note that we are using rax which is set in
> > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > - andl $(PAGE_SIZE - 1), %edi
> > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > - serious performance degradation when it has to fault suppress.
> > > > - */
> > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > - /* This is generally considered a cold target. */
> > > > - ja L(cross_page)
> > > > -# if VEC_SIZE > 32
> > > > - movq $-1, %rcx
> > > > - bzhiq %rdx, %rcx, %rcx
> > > > - kmovq %rcx, %k1
> > > > -# else
> > > > - movl $-1, %ecx
> > > > - bzhil %edx, %ecx, %ecx
> > > > - kmovd %ecx, %k1
> > > > -# endif
> > > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > > - VZEROUPPER_RETURN
> > > > -
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > - L(stosb_more_2x_vec) target. */
> > > > - .p2align 4,, 10
> > > > -L(stosb_local):
> > > > - movzbl %sil, %eax
> > > > - mov %RDX_LP, %RCX_LP
> > > > - mov %RDI_LP, %RDX_LP
> > > > - rep stosb
> > > > - mov %RDX_LP, %RAX_LP
> > > > - VZEROUPPER_RETURN
> > > > -# endif
> > > > -#endif
> > > > -
> > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > .p2align 4
> > > > L(stosb_more_2x_vec):
> > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > ret
> > > > #endif
> > > >
> > > > - .p2align 4,, 10
> > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > +#ifdef USE_WITH_AVX2
> > > > + .p2align 4
> > > > +#else
> > > > + .p2align 4,, 4
> > > > +#endif
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > range for 2-byte jump encoding. */
> > > > L(stosb_local):
> > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > + jae L(nt_memset)
> > > > movzbl %sil, %eax
> > > > mov %RDX_LP, %RCX_LP
> > > > mov %RDI_LP, %RDX_LP
> > > > rep stosb
> > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > > + xchg %RDX_LP, %RAX_LP
> > > > +# else
> > > > mov %RDX_LP, %RAX_LP
> > > > - VZEROUPPER_RETURN
> > > > # endif
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > /* Define L(less_vec) only if not otherwise defined. */
> > > > - .p2align 4
> > > > + .p2align 4,, 12
> > > > L(less_vec):
> > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > xmm). This is only does anything for AVX2. */
> > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > #endif
> > > > ret
> > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > +# ifdef USE_WITH_AVX512
> > > > + /* Force align so the loop doesn't cross a cache-line. */
> > > > + .p2align 4
> > > > +# endif
> > > > + .p2align 4,, 7
> > > > + /* Memset using non-temporal stores. */
> > > > +L(nt_memset):
> > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > + /* Align DST. */
> > > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > > + incq %rdi
> > > > + .p2align 4,, 7
> > > > +L(nt_loop):
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > + subq $(VEC_SIZE * -4), %rdi
> > > > + cmpq %rdx, %rdi
> > > > + jb L(nt_loop)
> > > > + sfence
> > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > > <adhemerval.zanella@linaro.org> wrote:
> > > >
> > > >
> > > >
> > > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > > notably worse than non-temporal stores for large (above a
> > > > > few MBs) memsets.
> > > > > See:
> > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > > For data using different stategies for large memset on ICX and SKX.
> > > >
> > > > Btw this datasheet is not accessible without extra steps.
> > > >
> > >
> > > Bh sorry, put the benchmark data in a repo. See data here:
> > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> > >
> > > The ICX.html and SKX.html have the data from the above link.
> > > > >
> > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > > on SKX. Historically, these numbers would not have been so good
> > > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > > removed as a potential side-channel attack, so there is no longer any
> > > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > > memory bandwidth.
> > > >
> > > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > > to fix, however I don't have access to a Zen4 machine to check for
> > > > results).
> > >
> > > I didn't and don't have access to any of the newer AMD chips.
> > >
> > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > > has a README w/ steps if anyone wants to test it.
> >
> > What we could do is use a seperate tunable for memset NT threshold and just
> > make it SIZE_MAX for AMD.
>
> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> specific?
Please make it Intel specific.
Thanks.
> > > >
> > > > >
> > > > > All of the other changes to the file are to re-organize the
> > > > > code-blocks to maintain "good" alignment given the new code added in
> > > > > the `L(stosb_local)` case.
> > > > >
> > > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > > N=20 runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > > > >
> > > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > > that adding the non-temporal case didn't add any regressions to the
> > > > > other cases.
> > > > >
> > > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > > runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > > > >
> > > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > > (likely because clients typically have faster single-core bandwidth so
> > > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > > >
> > > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > > ---
> > > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 97839a2248..637caadb40 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -21,10 +21,13 @@
> > > > > 2. If size is less than VEC, use integer register stores.
> > > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > > - 5. On machines ERMS feature, if size is greater or equal than
> > > > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > > > + 6. On machines ERMS feature, if size is range
> > > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > > + then REP STOSB will be used.
> > > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > > + non-temporal stores. */
> > > > >
> > > > > #include <sysdep.h>
> > > > >
> > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > > VMOVU %VMM(0), (%rdi)
> > > > > VZEROUPPER_RETURN
> > > > > +
> > > > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > + entry as it doesn't take much space and is likely a hot target. */
> > > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > > > + .p2align 6,, 47
> > > > > + .p2align 4
> > > > > +L(less_vec):
> > > > > +L(less_vec_from_wmemset):
> > > > > + /* Less than 1 VEC. */
> > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > +# error Unsupported VEC_SIZE!
> > > > > +# endif
> > > > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > + cross check. Note that we are using rax which is set in
> > > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > > + andl $(PAGE_SIZE - 1), %edi
> > > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > + serious performance degradation when it has to fault suppress. */
> > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > + /* This is generally considered a cold target. */
> > > > > + ja L(cross_page)
> > > > > +# if VEC_SIZE > 32
> > > > > + movq $-1, %rcx
> > > > > + bzhiq %rdx, %rcx, %rcx
> > > > > + kmovq %rcx, %k1
> > > > > +# else
> > > > > + movl $-1, %ecx
> > > > > + bzhil %edx, %ecx, %ecx
> > > > > + kmovd %ecx, %k1
> > > > > +# endif
> > > > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > > END (MEMSET_SYMBOL (__memset, unaligned))
> > > > >
> > > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > > #endif
> > > > > VZEROUPPER_RETURN
> > > > >
> > > > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > - entry as it doesn't take much space and is likely a hot target.
> > > > > - */
> > > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > > - .p2align 4,, 10
> > > > > -L(less_vec):
> > > > > -L(less_vec_from_wmemset):
> > > > > - /* Less than 1 VEC. */
> > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > -# error Unsupported VEC_SIZE!
> > > > > -# endif
> > > > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > - cross check. Note that we are using rax which is set in
> > > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > > - andl $(PAGE_SIZE - 1), %edi
> > > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > - serious performance degradation when it has to fault suppress.
> > > > > - */
> > > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > - /* This is generally considered a cold target. */
> > > > > - ja L(cross_page)
> > > > > -# if VEC_SIZE > 32
> > > > > - movq $-1, %rcx
> > > > > - bzhiq %rdx, %rcx, %rcx
> > > > > - kmovq %rcx, %k1
> > > > > -# else
> > > > > - movl $-1, %ecx
> > > > > - bzhil %edx, %ecx, %ecx
> > > > > - kmovd %ecx, %k1
> > > > > -# endif
> > > > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > - VZEROUPPER_RETURN
> > > > > -
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > > - L(stosb_more_2x_vec) target. */
> > > > > - .p2align 4,, 10
> > > > > -L(stosb_local):
> > > > > - movzbl %sil, %eax
> > > > > - mov %RDX_LP, %RCX_LP
> > > > > - mov %RDI_LP, %RDX_LP
> > > > > - rep stosb
> > > > > - mov %RDX_LP, %RAX_LP
> > > > > - VZEROUPPER_RETURN
> > > > > -# endif
> > > > > -#endif
> > > > > -
> > > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > > .p2align 4
> > > > > L(stosb_more_2x_vec):
> > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > > ret
> > > > > #endif
> > > > >
> > > > > - .p2align 4,, 10
> > > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +#ifdef USE_WITH_AVX2
> > > > > + .p2align 4
> > > > > +#else
> > > > > + .p2align 4,, 4
> > > > > +#endif
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > > range for 2-byte jump encoding. */
> > > > > L(stosb_local):
> > > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > + jae L(nt_memset)
> > > > > movzbl %sil, %eax
> > > > > mov %RDX_LP, %RCX_LP
> > > > > mov %RDI_LP, %RDX_LP
> > > > > rep stosb
> > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > > > + xchg %RDX_LP, %RAX_LP
> > > > > +# else
> > > > > mov %RDX_LP, %RAX_LP
> > > > > - VZEROUPPER_RETURN
> > > > > # endif
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > > /* Define L(less_vec) only if not otherwise defined. */
> > > > > - .p2align 4
> > > > > + .p2align 4,, 12
> > > > > L(less_vec):
> > > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > > xmm). This is only does anything for AVX2. */
> > > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > > #endif
> > > > > ret
> > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +# ifdef USE_WITH_AVX512
> > > > > + /* Force align so the loop doesn't cross a cache-line. */
> > > > > + .p2align 4
> > > > > +# endif
> > > > > + .p2align 4,, 7
> > > > > + /* Memset using non-temporal stores. */
> > > > > +L(nt_memset):
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > > + /* Align DST. */
> > > > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > > > + incq %rdi
> > > > > + .p2align 4,, 7
> > > > > +L(nt_loop):
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > > + subq $(VEC_SIZE * -4), %rdi
> > > > > + cmpq %rdx, %rdi
> > > > > + jb L(nt_loop)
> > > > > + sfence
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On 23/05/24 13:49, H.J. Lu wrote:
> On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>>
>>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 18/05/24 21:43, Noah Goldstein wrote:
>>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
>>>>>> notably worse than non-temporal stores for large (above a
>>>>>> few MBs) memsets.
>>>>>> See:
>>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>>>>>> For data using different stategies for large memset on ICX and SKX.
>>>>>
>>>>> Btw this datasheet is not accessible without extra steps.
>>>>>
>>>>
>>>> Bh sorry, put the benchmark data in a repo. See data here:
>>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>>>>
>>>> The ICX.html and SKX.html have the data from the above link.
>>>>>>
>>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>>>>> on SKX. Historically, these numbers would not have been so good
>>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>>>>> is able to do. But, the zero-over-zero writeback optimization has been
>>>>>> removed as a potential side-channel attack, so there is no longer any
>>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>>>>> size, non-temporal writes can avoid data in their RFO requests saving
>>>>>> memory bandwidth.
>>>>>
>>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
>>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
>>>>> to fix, however I don't have access to a Zen4 machine to check for
>>>>> results).
>>>>
>>>> I didn't and don't have access to any of the newer AMD chips.
>>>>
>>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
>>>> has a README w/ steps if anyone wants to test it.
>>>
>>> What we could do is use a seperate tunable for memset NT threshold and just
>>> make it SIZE_MAX for AMD.
>>
>> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
>> specific?
>
> Please make it Intel specific.
>
> Thanks.
Yes, please make it Intel specific for now. I will try to take a look at least
with a Zen3 core that I have access
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access
Done, split into two patches, the tunable one is the second in the series.
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access
Adding Joe into the thread. He has access to some AMD machines
and will run the benchmarks for us.
@@ -21,10 +21,13 @@
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
- 5. On machines ERMS feature, if size is greater or equal than
- __x86_rep_stosb_threshold then REP STOSB will be used.
- 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
- 4 VEC stores and store 4 * VEC at a time until done. */
+ 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+ 4 VEC stores and store 4 * VEC at a time until done.
+ 6. On machines ERMS feature, if size is range
+ [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+ then REP STOSB will be used.
+ 7. If size >= __x86_shared_non_temporal_threshold, use a
+ non-temporal stores. */
#include <sysdep.h>
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
VMOVU %VMM(0), (%rdi)
VZEROUPPER_RETURN
+
+ /* If have AVX512 mask instructions put L(less_vec) close to
+ entry as it doesn't take much space and is likely a hot target. */
+#ifdef USE_LESS_VEC_MASK_STORE
+ /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
+ .p2align 6,, 47
+ .p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+ /* Less than 1 VEC. */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+# error Unsupported VEC_SIZE!
+# endif
+ /* Clear high bits from edi. Only keeping bits relevant to page
+ cross check. Note that we are using rax which is set in
+ MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
+ andl $(PAGE_SIZE - 1), %edi
+ /* Check if VEC_SIZE store cross page. Mask stores suffer
+ serious performance degradation when it has to fault suppress. */
+ cmpl $(PAGE_SIZE - VEC_SIZE), %edi
+ /* This is generally considered a cold target. */
+ ja L(cross_page)
+# if VEC_SIZE > 32
+ movq $-1, %rcx
+ bzhiq %rdx, %rcx, %rcx
+ kmovq %rcx, %k1
+# else
+ movl $-1, %ecx
+ bzhil %edx, %ecx, %ecx
+ kmovd %ecx, %k1
+# endif
+ vmovdqu8 %VMM(0), (%rax){%k1}
+ VZEROUPPER_RETURN
+#endif
+
#if defined USE_MULTIARCH && IS_IN (libc)
END (MEMSET_SYMBOL (__memset, unaligned))
@@ -185,54 +223,6 @@ L(last_2x_vec):
#endif
VZEROUPPER_RETURN
- /* If have AVX512 mask instructions put L(less_vec) close to
- entry as it doesn't take much space and is likely a hot target.
- */
-#ifdef USE_LESS_VEC_MASK_STORE
- .p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
- /* Less than 1 VEC. */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-# error Unsupported VEC_SIZE!
-# endif
- /* Clear high bits from edi. Only keeping bits relevant to page
- cross check. Note that we are using rax which is set in
- MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
- andl $(PAGE_SIZE - 1), %edi
- /* Check if VEC_SIZE store cross page. Mask stores suffer
- serious performance degradation when it has to fault suppress.
- */
- cmpl $(PAGE_SIZE - VEC_SIZE), %edi
- /* This is generally considered a cold target. */
- ja L(cross_page)
-# if VEC_SIZE > 32
- movq $-1, %rcx
- bzhiq %rdx, %rcx, %rcx
- kmovq %rcx, %k1
-# else
- movl $-1, %ecx
- bzhil %edx, %ecx, %ecx
- kmovd %ecx, %k1
-# endif
- vmovdqu8 %VMM(0), (%rax){%k1}
- VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
- /* Include L(stosb_local) here if including L(less_vec) between
- L(stosb_more_2x_vec) and ENTRY. This is to cache align the
- L(stosb_more_2x_vec) target. */
- .p2align 4,, 10
-L(stosb_local):
- movzbl %sil, %eax
- mov %RDX_LP, %RCX_LP
- mov %RDI_LP, %RDX_LP
- rep stosb
- mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
-# endif
-#endif
-
#if defined USE_MULTIARCH && IS_IN (libc)
.p2align 4
L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
ret
#endif
- .p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+ .p2align 4
+#else
+ .p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
range for 2-byte jump encoding. */
L(stosb_local):
+ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ jae L(nt_memset)
movzbl %sil, %eax
mov %RDX_LP, %RCX_LP
mov %RDI_LP, %RDX_LP
rep stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+ /* Use xchg to save 1-byte (this helps align targets below). */
+ xchg %RDX_LP, %RAX_LP
+# else
mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
# endif
+ VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
/* Define L(less_vec) only if not otherwise defined. */
- .p2align 4
+ .p2align 4,, 12
L(less_vec):
/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
xmm). This is only does anything for AVX2. */
@@ -423,4 +425,35 @@ L(between_2_3):
movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
#endif
ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+ /* Force align so the loop doesn't cross a cache-line. */
+ .p2align 4
+# endif
+ .p2align 4,, 7
+ /* Memset using non-temporal stores. */
+L(nt_memset):
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
+ leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
+ /* Align DST. */
+ orq $(VEC_SIZE * 1 - 1), %rdi
+ incq %rdi
+ .p2align 4,, 7
+L(nt_loop):
+ VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
+ subq $(VEC_SIZE * -4), %rdi
+ cmpq %rdx, %rdi
+ jb L(nt_loop)
+ sfence
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
+ VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))