[v2,1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Message ID 20240524173851.2483952-1-goldstein.w.n@gmail.com
State Committed
Commit 5bf0ab80573d66e4ae5d94b094659094336da90f
Delegated to: Arjun Shankar
Headers
Series [v2,1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed

Commit Message

Noah Goldstein May 24, 2024, 5:38 p.m. UTC
  Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
---
 .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
 1 file changed, 91 insertions(+), 58 deletions(-)
  

Comments

H.J. Lu May 29, 2024, 10:52 p.m. UTC | #1
On Fri, May 24, 2024 at 10:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +         non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>         VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
>         VMOVU   %VMM(0), (%rdi)
>         VZEROUPPER_RETURN
> +
> +       /* If have AVX512 mask instructions put L(less_vec) close to
> +          entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +       .p2align 6,, 47
> +       .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +       /* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +       /* Clear high bits from edi. Only keeping bits relevant to page
> +          cross check. Note that we are using rax which is set in
> +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +       andl    $(PAGE_SIZE - 1), %edi
> +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> +          serious performance degradation when it has to fault suppress.  */
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> +       /* This is generally considered a cold target.  */
> +       ja      L(cross_page)
> +# if VEC_SIZE > 32
> +       movq    $-1, %rcx
> +       bzhiq   %rdx, %rcx, %rcx
> +       kmovq   %rcx, %k1
> +# else
> +       movl    $-1, %ecx
> +       bzhil   %edx, %ecx, %ecx
> +       kmovd   %ecx, %k1
> +# endif
> +       vmovdqu8 %VMM(0), (%rax){%k1}
> +       VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>         VZEROUPPER_RETURN
>
> -       /* If have AVX512 mask instructions put L(less_vec) close to
> -          entry as it doesn't take much space and is likely a hot target.
> -        */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -       .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -       /* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -       /* Clear high bits from edi. Only keeping bits relevant to page
> -          cross check. Note that we are using rax which is set in
> -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -       andl    $(PAGE_SIZE - 1), %edi
> -       /* Check if VEC_SIZE store cross page. Mask stores suffer
> -          serious performance degradation when it has to fault suppress.
> -        */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> -       /* This is generally considered a cold target.  */
> -       ja      L(cross_page)
> -# if VEC_SIZE > 32
> -       movq    $-1, %rcx
> -       bzhiq   %rdx, %rcx, %rcx
> -       kmovq   %rcx, %k1
> -# else
> -       movl    $-1, %ecx
> -       bzhil   %edx, %ecx, %ecx
> -       kmovd   %ecx, %k1
> -# endif
> -       vmovdqu8 %VMM(0), (%rax){%k1}
> -       VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -       /* Include L(stosb_local) here if including L(less_vec) between
> -          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -          L(stosb_more_2x_vec) target.  */
> -       .p2align 4,, 10
> -L(stosb_local):
> -       movzbl  %sil, %eax
> -       mov     %RDX_LP, %RCX_LP
> -       mov     %RDI_LP, %RDX_LP
> -       rep     stosb
> -       mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>         .p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>         ret
>  #endif
>
> -       .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +       .p2align 4
> +#else
> +       .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
>         mov     %RDI_LP, %RDX_LP
>         rep     stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +       /* Use xchg to save 1-byte (this helps align targets below).  */
> +       xchg    %RDX_LP, %RAX_LP
> +# else
>         mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
>  # endif
> +       VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>         /* Define L(less_vec) only if not otherwise defined.  */
> -       .p2align 4
> +       .p2align 4,, 12
>  L(less_vec):
>         /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>            xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>         movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>         ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +       /* Force align so the loop doesn't cross a cache-line.  */
> +       .p2align 4
> +# endif
> +       .p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +       orq     $(VEC_SIZE * 1 - 1), %rdi
> +       incq    %rdi
> +       .p2align 4,, 7
> +L(nt_loop):
> +       VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> +       subq    $(VEC_SIZE * -4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(nt_loop)
> +       sfence
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> +       VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.
  
DJ Delorie July 11, 2024, 8:56 p.m. UTC | #2
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.

I'm actually working on RHEL-29312 that you reference in the subject
(thanks for the reference, but please don't reference downstream tickets
in upstream patches).  In that ticket was shown a 50% slowdown in memset
for the Xeon 4215.  I've reproduced that slowdown locally, but this
patch doesn't seem to have any affect on it (current git glibc still has
the slow code).  I can get the faster results with tunables[1].  Will
there be further tuning of the default thresholds?  Or do we expect
users to use tunables to get around these regressions, for the specific
chips that are not optimized?

[1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
of the others have any affect.
  
Noah Goldstein July 12, 2024, 7:04 a.m. UTC | #3
On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
>
>
> Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> I'm actually working on RHEL-29312 that you reference in the subject
> (thanks for the reference, but please don't reference downstream tickets
> in upstream patches).  In that ticket was shown a 50% slowdown in memset
> for the Xeon 4215.  I've reproduced that slowdown locally, but this
> patch doesn't seem to have any affect on it (current git glibc still has
> the slow code).  I can get the faster results with tunables[1].  Will
> there be further tuning of the default thresholds?  Or do we expect
> users to use tunables to get around these regressions, for the specific
> chips that are not optimized?
>
> [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> of the others have any affect.

Okay, I messed up. The SKX machine I used for the SKX benchmarks was
mislabeled, and was in fact an ICX machine.

I have some older SKX/BWD data:
https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
and agree we need to increase the stosb threshold on SKX. I didn't see
the huge (100MB threshold) you saw there, but am re-running benchmarks.

An unfortunate note is BWD seems to prefer ERMS in the same range SKX
doesn't, so we can't just make the decision based on the feature.
On the bright side, it does look like the non-temporal decision (this patch)
still makes sense on SKX.

I am re-running benchmarks now and will see what we can do about raising
the stosb threshold on SKX.

I am deeply sorry for the confusion I have created here.
>
  
Sam James July 12, 2024, 7:12 a.m. UTC | #4
DJ Delorie <dj@redhat.com> writes:

> Noah Goldstein <goldstein.w.n@gmail.com> writes:
>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>> on SKX. Historically, these numbers would not have been so good
>> because of the zero-over-zero writeback optimization that `rep stosb`
>> is able to do. But, the zero-over-zero writeback optimization has been
>> removed as a potential side-channel attack, so there is no longer any
>> good reason to only rely on `rep stosb` for large memsets. On the flip
>> size, non-temporal writes can avoid data in their RFO requests saving
>> memory bandwidth.
>
> I'm actually working on RHEL-29312 that you reference in the subject
> (thanks for the reference, but please don't reference downstream tickets
> in upstream patches).

Can I be annoying and dissent here? I don't like the idea of throwing
away provenance. If you're working on it because of something you can
reference, you can and you should.

What you should also do is file a glibc bug with the relevant
information extracted if someone is likely to want to read up on it, so
they don't have to make an account elsehwere or something.

But I don't think chucking away information when we know it is good. It
might well be important when bisecting or if a revert is needed way down
the line.

> [...]

thanks,
sam
  
Andreas K. Huettel July 12, 2024, 11:24 a.m. UTC | #5
> > I'm actually working on RHEL-29312 that you reference in the subject
> > (thanks for the reference, but please don't reference downstream tickets
> > in upstream patches).
> 
> Can I be annoying and dissent here? I don't like the idea of throwing
> away provenance. If you're working on it because of something you can
> reference, you can and you should.
> 

(It's even useful if the downstream bug tracker is non-public (the horror),
since then you at least know there's someone you might be able to ask 
about it. :o) )

> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.
> 
> But I don't think chucking away information when we know it is good. It
> might well be important when bisecting or if a revert is needed way down
> the line.
> 
> > [...]
> 
> thanks,
> sam
>
  
Carlos O'Donell July 12, 2024, 11:34 a.m. UTC | #6
On 7/12/24 3:12 AM, Sam James wrote:
> DJ Delorie <dj@redhat.com> writes:
> 
>> Noah Goldstein <goldstein.w.n@gmail.com> writes:
>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>> on SKX. Historically, these numbers would not have been so good
>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>> is able to do. But, the zero-over-zero writeback optimization has been
>>> removed as a potential side-channel attack, so there is no longer any
>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>> size, non-temporal writes can avoid data in their RFO requests saving
>>> memory bandwidth.
>>
>> I'm actually working on RHEL-29312 that you reference in the subject
>> (thanks for the reference, but please don't reference downstream tickets
>> in upstream patches).
> 
> Can I be annoying and dissent here? I don't like the idea of throwing
> away provenance. If you're working on it because of something you can
> reference, you can and you should.

Always :-)

I agree we should not loose the downstream ticket reference.
 
> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.

Exactly.
 
> But I don't think chucking away information when we know it is good. It
> might well be important when bisecting or if a revert is needed way down
> the line.

Yes, we should in our own tracker provide links to all the downstream tickets.

I would love this in general because then you can search which tickets are impacting
more of the distributions.
  
DJ Delorie July 12, 2024, 4:54 p.m. UTC | #7
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> I didn't see the huge (100MB threshold) you saw there, but am

That wasn't a calculated threshold, that was a "way more than the
benchmarks were using" threshold.  I.e. enough to just enable/disable
the various code paths and show a difference.

I didn't feel like calculating SIZE_MAX/16 ;-)
  
DJ Delorie July 12, 2024, 4:57 p.m. UTC | #8
Sam James <sam@gentoo.org> writes:
> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.

Sorry, that was my intent.  I didn't mean to throw the information away,
I meant to store it in our own bug tracking system and have the commit
reference that.  Cross-referencing upstream and downstream bugs is OK
too.  I worry that a downstream tracker might make the bug inaccessible
to the public, and we'd lose the information anyway.
  
Noah Goldstein July 15, 2024, 7:36 a.m. UTC | #9
On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> >
> >
> > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > I'm actually working on RHEL-29312 that you reference in the subject
> > (thanks for the reference, but please don't reference downstream tickets
> > in upstream patches).  In that ticket was shown a 50% slowdown in memset
> > for the Xeon 4215.  I've reproduced that slowdown locally, but this
> > patch doesn't seem to have any affect on it (current git glibc still has
> > the slow code).  I can get the faster results with tunables[1].  Will
> > there be further tuning of the default thresholds?  Or do we expect
> > users to use tunables to get around these regressions, for the specific
> > chips that are not optimized?
> >
> > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> > of the others have any affect.
>
> Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> mislabeled, and was in fact an ICX machine.
>
> I have some older SKX/BWD data:
> https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> and agree we need to increase the stosb threshold on SKX. I didn't see
> the huge (100MB threshold) you saw there, but am re-running benchmarks.
>
> An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> doesn't, so we can't just make the decision based on the feature.
> On the bright side, it does look like the non-temporal decision (this patch)
> still makes sense on SKX.
>
> I am re-running benchmarks now and will see what we can do about raising
> the stosb threshold on SKX.
>
> I am deeply sorry for the confusion I have created here.

I've re-ran some benchmarks and updated the google sheet:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
See SKX/SKX-2.

Regarding ERMS:
What I am measuring so far indicates we want temporal writes
when the majority of the set fits in L1. Past that ERMS becomes better,
especially as thread count increases.

Regarding NT Stores:
non-temporal doesn't look good in these benchmarks (particularly SKX-2).
I am running some additional tests to see exactly what is going on.
I am going to post a patch to disable non-temporal stores on Skylake,
if that is considered too risky given the 2.40 release I think we should revert
this patch (and the follow up AMD ones) to avoid a regression on SKX
and try to get them into 2.41 instead with proper tuning.


> >
  
H.J. Lu July 15, 2024, 7:44 a.m. UTC | #10
On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:

> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >
> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> > >
> > >
> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has
> been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the
> flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > I'm actually working on RHEL-29312 that you reference in the subject
> > > (thanks for the reference, but please don't reference downstream
> tickets
> > > in upstream patches).  In that ticket was shown a 50% slowdown in
> memset
> > > for the Xeon 4215.  I've reproduced that slowdown locally, but this
> > > patch doesn't seem to have any affect on it (current git glibc still
> has
> > > the slow code).  I can get the faster results with tunables[1].  Will
> > > there be further tuning of the default thresholds?  Or do we expect
> > > users to use tunables to get around these regressions, for the specific
> > > chips that are not optimized?
> > >
> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> > > of the others have any affect.
> >
> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> > mislabeled, and was in fact an ICX machine.
> >
> > I have some older SKX/BWD data:
> >
> https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> > and agree we need to increase the stosb threshold on SKX. I didn't see
> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
> >
> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> > doesn't, so we can't just make the decision based on the feature.
> > On the bright side, it does look like the non-temporal decision (this
> patch)
> > still makes sense on SKX.
> >
> > I am re-running benchmarks now and will see what we can do about raising
> > the stosb threshold on SKX.
> >
> > I am deeply sorry for the confusion I have created here.
>
> I've re-ran some benchmarks and updated the google sheet:
>
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> See SKX/SKX-2.
>
> Regarding ERMS:
> What I am measuring so far indicates we want temporal writes
> when the majority of the set fits in L1. Past that ERMS becomes better,
> especially as thread count increases.
>
> Regarding NT Stores:
> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
> I am running some additional tests to see exactly what is going on.
> I am going to post a patch to disable non-temporal stores on Skylake,
>

Skylake clients have very different NT behaviors.
Please make sure that NT isn't disabled
on clients.


> if that is considered too risky given the 2.40 release I think we should
> revert
> this patch (and the follow up AMD ones) to avoid a regression on SKX
> and try to get them into 2.41 instead with proper tuning.
>
>
> > >
>
> H.J.
  
Noah Goldstein July 15, 2024, 7:48 a.m. UTC | #11
On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
>
> On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >
>> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
>> > >
>> > >
>> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
>> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>> > > > on SKX. Historically, these numbers would not have been so good
>> > > > because of the zero-over-zero writeback optimization that `rep stosb`
>> > > > is able to do. But, the zero-over-zero writeback optimization has been
>> > > > removed as a potential side-channel attack, so there is no longer any
>> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
>> > > > size, non-temporal writes can avoid data in their RFO requests saving
>> > > > memory bandwidth.
>> > >
>> > > I'm actually working on RHEL-29312 that you reference in the subject
>> > > (thanks for the reference, but please don't reference downstream tickets
>> > > in upstream patches).  In that ticket was shown a 50% slowdown in memset
>> > > for the Xeon 4215.  I've reproduced that slowdown locally, but this
>> > > patch doesn't seem to have any affect on it (current git glibc still has
>> > > the slow code).  I can get the faster results with tunables[1].  Will
>> > > there be further tuning of the default thresholds?  Or do we expect
>> > > users to use tunables to get around these regressions, for the specific
>> > > chips that are not optimized?
>> > >
>> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
>> > > of the others have any affect.
>> >
>> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
>> > mislabeled, and was in fact an ICX machine.
>> >
>> > I have some older SKX/BWD data:
>> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
>> > and agree we need to increase the stosb threshold on SKX. I didn't see
>> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
>> >
>> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
>> > doesn't, so we can't just make the decision based on the feature.
>> > On the bright side, it does look like the non-temporal decision (this patch)
>> > still makes sense on SKX.
>> >
>> > I am re-running benchmarks now and will see what we can do about raising
>> > the stosb threshold on SKX.
>> >
>> > I am deeply sorry for the confusion I have created here.
>>
>> I've re-ran some benchmarks and updated the google sheet:
>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>> See SKX/SKX-2.
>>
>> Regarding ERMS:
>> What I am measuring so far indicates we want temporal writes
>> when the majority of the set fits in L1. Past that ERMS becomes better,
>> especially as thread count increases.
>>
>> Regarding NT Stores:
>> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
>> I am running some additional tests to see exactly what is going on.
>> I am going to post a patch to disable non-temporal stores on Skylake,
>
>
> Skylake clients have very different NT behaviors.
> Please make sure that NT isn't disabled
> on clients.

I guess my thinking right now is it's better to ensure we don't introduce a
new regression. The NT store code is new to 2.40, so disabling it won't
cause any regressions.
>
>>
>> if that is considered too risky given the 2.40 release I think we should revert
>> this patch (and the follow up AMD ones) to avoid a regression on SKX
>> and try to get them into 2.41 instead with proper tuning.
>>
>>
>> > >
>>
> H.J.
  
Noah Goldstein July 15, 2024, 8:32 a.m. UTC | #12
On Mon, Jul 15, 2024 at 3:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> >
> > On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >> >
> >> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> >> > >
> >> > >
> >> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> >> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >> > > > on SKX. Historically, these numbers would not have been so good
> >> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> >> > > > is able to do. But, the zero-over-zero writeback optimization has been
> >> > > > removed as a potential side-channel attack, so there is no longer any
> >> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> >> > > > size, non-temporal writes can avoid data in their RFO requests saving
> >> > > > memory bandwidth.
> >> > >
> >> > > I'm actually working on RHEL-29312 that you reference in the subject
> >> > > (thanks for the reference, but please don't reference downstream tickets
> >> > > in upstream patches).  In that ticket was shown a 50% slowdown in memset
> >> > > for the Xeon 4215.  I've reproduced that slowdown locally, but this
> >> > > patch doesn't seem to have any affect on it (current git glibc still has
> >> > > the slow code).  I can get the faster results with tunables[1].  Will
> >> > > there be further tuning of the default thresholds?  Or do we expect
> >> > > users to use tunables to get around these regressions, for the specific
> >> > > chips that are not optimized?
> >> > >
> >> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> >> > > of the others have any affect.
> >> >
> >> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> >> > mislabeled, and was in fact an ICX machine.
> >> >
> >> > I have some older SKX/BWD data:
> >> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> >> > and agree we need to increase the stosb threshold on SKX. I didn't see
> >> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
> >> >
> >> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> >> > doesn't, so we can't just make the decision based on the feature.
> >> > On the bright side, it does look like the non-temporal decision (this patch)
> >> > still makes sense on SKX.
> >> >
> >> > I am re-running benchmarks now and will see what we can do about raising
> >> > the stosb threshold on SKX.
> >> >
> >> > I am deeply sorry for the confusion I have created here.
> >>
> >> I've re-ran some benchmarks and updated the google sheet:
> >> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >> See SKX/SKX-2.
> >>
> >> Regarding ERMS:
> >> What I am measuring so far indicates we want temporal writes
> >> when the majority of the set fits in L1. Past that ERMS becomes better,
> >> especially as thread count increases.
> >>
> >> Regarding NT Stores:
> >> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
> >> I am running some additional tests to see exactly what is going on.
> >> I am going to post a patch to disable non-temporal stores on Skylake,
> >
> >
> > Skylake clients have very different NT behaviors.
> > Please make sure that NT isn't disabled
> > on clients.
>
> I guess my thinking right now is it's better to ensure we don't introduce a
> new regression. The NT store code is new to 2.40, so disabling it won't
> cause any regressions.
> >
> >>
> >> if that is considered too risky given the 2.40 release I think we should revert
> >> this patch (and the follow up AMD ones) to avoid a regression on SKX
> >> and try to get them into 2.41 instead with proper tuning.
Patch posted.
I disabled for SKX only (im convinced enough its a function of SKX's unique
memory system).
> >>
> >>
> >> > >
> >>
> > H.J.
  

Patch

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@ 
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. On machines ERMS feature, if size is greater or equal than
-      __x86_rep_stosb_threshold then REP STOSB will be used.
-   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
-      4 VEC stores and store 4 * VEC at a time until done.  */
+   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+      4 VEC stores and store 4 * VEC at a time until done.
+   6. On machines ERMS feature, if size is range
+	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  then REP STOSB will be used.
+   7. If size >= __x86_shared_non_temporal_threshold, use a
+	  non-temporal stores.  */
 
 #include <sysdep.h>
 
@@ -147,6 +150,41 @@  L(entry_from_wmemset):
 	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
 	VMOVU	%VMM(0), (%rdi)
 	VZEROUPPER_RETURN
+
+	/* If have AVX512 mask instructions put L(less_vec) close to
+	   entry as it doesn't take much space and is likely a hot target.  */
+#ifdef USE_LESS_VEC_MASK_STORE
+    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
+	.p2align 6,, 47
+	.p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+	/* Less than 1 VEC.  */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+#  error Unsupported VEC_SIZE!
+# endif
+	/* Clear high bits from edi. Only keeping bits relevant to page
+	   cross check. Note that we are using rax which is set in
+	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
+	andl	$(PAGE_SIZE - 1), %edi
+	/* Check if VEC_SIZE store cross page. Mask stores suffer
+	   serious performance degradation when it has to fault suppress.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
+	/* This is generally considered a cold target.  */
+	ja	L(cross_page)
+# if VEC_SIZE > 32
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+	kmovq	%rcx, %k1
+# else
+	movl	$-1, %ecx
+	bzhil	%edx, %ecx, %ecx
+	kmovd	%ecx, %k1
+# endif
+	vmovdqu8 %VMM(0), (%rax){%k1}
+	VZEROUPPER_RETURN
+#endif
+
 #if defined USE_MULTIARCH && IS_IN (libc)
 END (MEMSET_SYMBOL (__memset, unaligned))
 
@@ -185,54 +223,6 @@  L(last_2x_vec):
 #endif
 	VZEROUPPER_RETURN
 
-	/* If have AVX512 mask instructions put L(less_vec) close to
-	   entry as it doesn't take much space and is likely a hot target.
-	 */
-#ifdef USE_LESS_VEC_MASK_STORE
-	.p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
-	/* Less than 1 VEC.  */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-#  error Unsupported VEC_SIZE!
-# endif
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check. Note that we are using rax which is set in
-	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
-	andl	$(PAGE_SIZE - 1), %edi
-	/* Check if VEC_SIZE store cross page. Mask stores suffer
-	   serious performance degradation when it has to fault suppress.
-	 */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
-	/* This is generally considered a cold target.  */
-	ja	L(cross_page)
-# if VEC_SIZE > 32
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-	kmovq	%rcx, %k1
-# else
-	movl	$-1, %ecx
-	bzhil	%edx, %ecx, %ecx
-	kmovd	%ecx, %k1
-# endif
-	vmovdqu8 %VMM(0), (%rax){%k1}
-	VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
-	/* Include L(stosb_local) here if including L(less_vec) between
-	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
-	   L(stosb_more_2x_vec) target.  */
-	.p2align 4,, 10
-L(stosb_local):
-	movzbl	%sil, %eax
-	mov	%RDX_LP, %RCX_LP
-	mov	%RDI_LP, %RDX_LP
-	rep	stosb
-	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
-# endif
-#endif
-
 #if defined USE_MULTIARCH && IS_IN (libc)
 	.p2align 4
 L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@  L(return_vzeroupper):
 	ret
 #endif
 
-	.p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+	.p2align 4
+#else
+	.p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
 	mov	%RDI_LP, %RDX_LP
 	rep	stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+	/* Use xchg to save 1-byte (this helps align targets below).  */
+	xchg	%RDX_LP, %RAX_LP
+# else
 	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
 # endif
+	VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
 	/* Define L(less_vec) only if not otherwise defined.  */
-	.p2align 4
+	.p2align 4,, 12
 L(less_vec):
 	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
 	   xmm). This is only does anything for AVX2.  */
@@ -423,4 +425,35 @@  L(between_2_3):
 	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
 #endif
 	ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+	/* Force align so the loop doesn't cross a cache-line.  */
+	.p2align 4
+# endif
+	.p2align 4,, 7
+    /* Memset using non-temporal stores.  */
+L(nt_memset):
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+    /* Align DST.  */
+	orq	$(VEC_SIZE * 1 - 1), %rdi
+	incq	%rdi
+	.p2align 4,, 7
+L(nt_loop):
+	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(nt_loop)
+	sfence
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))