[v2,1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Testing passed
|
Commit Message
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch.
---
.../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
1 file changed, 91 insertions(+), 58 deletions(-)
Comments
On Fri, May 24, 2024 at 10:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>
LGTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Thanks.
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
I'm actually working on RHEL-29312 that you reference in the subject
(thanks for the reference, but please don't reference downstream tickets
in upstream patches). In that ticket was shown a 50% slowdown in memset
for the Xeon 4215. I've reproduced that slowdown locally, but this
patch doesn't seem to have any affect on it (current git glibc still has
the slow code). I can get the faster results with tunables[1]. Will
there be further tuning of the default thresholds? Or do we expect
users to use tunables to get around these regressions, for the specific
chips that are not optimized?
[1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
of the others have any affect.
On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
>
>
> Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> I'm actually working on RHEL-29312 that you reference in the subject
> (thanks for the reference, but please don't reference downstream tickets
> in upstream patches). In that ticket was shown a 50% slowdown in memset
> for the Xeon 4215. I've reproduced that slowdown locally, but this
> patch doesn't seem to have any affect on it (current git glibc still has
> the slow code). I can get the faster results with tunables[1]. Will
> there be further tuning of the default thresholds? Or do we expect
> users to use tunables to get around these regressions, for the specific
> chips that are not optimized?
>
> [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> of the others have any affect.
Okay, I messed up. The SKX machine I used for the SKX benchmarks was
mislabeled, and was in fact an ICX machine.
I have some older SKX/BWD data:
https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
and agree we need to increase the stosb threshold on SKX. I didn't see
the huge (100MB threshold) you saw there, but am re-running benchmarks.
An unfortunate note is BWD seems to prefer ERMS in the same range SKX
doesn't, so we can't just make the decision based on the feature.
On the bright side, it does look like the non-temporal decision (this patch)
still makes sense on SKX.
I am re-running benchmarks now and will see what we can do about raising
the stosb threshold on SKX.
I am deeply sorry for the confusion I have created here.
>
DJ Delorie <dj@redhat.com> writes:
> Noah Goldstein <goldstein.w.n@gmail.com> writes:
>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>> on SKX. Historically, these numbers would not have been so good
>> because of the zero-over-zero writeback optimization that `rep stosb`
>> is able to do. But, the zero-over-zero writeback optimization has been
>> removed as a potential side-channel attack, so there is no longer any
>> good reason to only rely on `rep stosb` for large memsets. On the flip
>> size, non-temporal writes can avoid data in their RFO requests saving
>> memory bandwidth.
>
> I'm actually working on RHEL-29312 that you reference in the subject
> (thanks for the reference, but please don't reference downstream tickets
> in upstream patches).
Can I be annoying and dissent here? I don't like the idea of throwing
away provenance. If you're working on it because of something you can
reference, you can and you should.
What you should also do is file a glibc bug with the relevant
information extracted if someone is likely to want to read up on it, so
they don't have to make an account elsehwere or something.
But I don't think chucking away information when we know it is good. It
might well be important when bisecting or if a revert is needed way down
the line.
> [...]
thanks,
sam
> > I'm actually working on RHEL-29312 that you reference in the subject
> > (thanks for the reference, but please don't reference downstream tickets
> > in upstream patches).
>
> Can I be annoying and dissent here? I don't like the idea of throwing
> away provenance. If you're working on it because of something you can
> reference, you can and you should.
>
(It's even useful if the downstream bug tracker is non-public (the horror),
since then you at least know there's someone you might be able to ask
about it. :o) )
> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.
>
> But I don't think chucking away information when we know it is good. It
> might well be important when bisecting or if a revert is needed way down
> the line.
>
> > [...]
>
> thanks,
> sam
>
On 7/12/24 3:12 AM, Sam James wrote:
> DJ Delorie <dj@redhat.com> writes:
>
>> Noah Goldstein <goldstein.w.n@gmail.com> writes:
>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>> on SKX. Historically, these numbers would not have been so good
>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>> is able to do. But, the zero-over-zero writeback optimization has been
>>> removed as a potential side-channel attack, so there is no longer any
>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>> size, non-temporal writes can avoid data in their RFO requests saving
>>> memory bandwidth.
>>
>> I'm actually working on RHEL-29312 that you reference in the subject
>> (thanks for the reference, but please don't reference downstream tickets
>> in upstream patches).
>
> Can I be annoying and dissent here? I don't like the idea of throwing
> away provenance. If you're working on it because of something you can
> reference, you can and you should.
Always :-)
I agree we should not loose the downstream ticket reference.
> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.
Exactly.
> But I don't think chucking away information when we know it is good. It
> might well be important when bisecting or if a revert is needed way down
> the line.
Yes, we should in our own tracker provide links to all the downstream tickets.
I would love this in general because then you can search which tickets are impacting
more of the distributions.
Noah Goldstein <goldstein.w.n@gmail.com> writes:
> I didn't see the huge (100MB threshold) you saw there, but am
That wasn't a calculated threshold, that was a "way more than the
benchmarks were using" threshold. I.e. enough to just enable/disable
the various code paths and show a difference.
I didn't feel like calculating SIZE_MAX/16 ;-)
Sam James <sam@gentoo.org> writes:
> What you should also do is file a glibc bug with the relevant
> information extracted if someone is likely to want to read up on it, so
> they don't have to make an account elsehwere or something.
Sorry, that was my intent. I didn't mean to throw the information away,
I meant to store it in our own bug tracking system and have the commit
reference that. Cross-referencing upstream and downstream bugs is OK
too. I worry that a downstream tracker might make the bug inaccessible
to the public, and we'd lose the information anyway.
On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> >
> >
> > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > I'm actually working on RHEL-29312 that you reference in the subject
> > (thanks for the reference, but please don't reference downstream tickets
> > in upstream patches). In that ticket was shown a 50% slowdown in memset
> > for the Xeon 4215. I've reproduced that slowdown locally, but this
> > patch doesn't seem to have any affect on it (current git glibc still has
> > the slow code). I can get the faster results with tunables[1]. Will
> > there be further tuning of the default thresholds? Or do we expect
> > users to use tunables to get around these regressions, for the specific
> > chips that are not optimized?
> >
> > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> > of the others have any affect.
>
> Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> mislabeled, and was in fact an ICX machine.
>
> I have some older SKX/BWD data:
> https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> and agree we need to increase the stosb threshold on SKX. I didn't see
> the huge (100MB threshold) you saw there, but am re-running benchmarks.
>
> An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> doesn't, so we can't just make the decision based on the feature.
> On the bright side, it does look like the non-temporal decision (this patch)
> still makes sense on SKX.
>
> I am re-running benchmarks now and will see what we can do about raising
> the stosb threshold on SKX.
>
> I am deeply sorry for the confusion I have created here.
I've re-ran some benchmarks and updated the google sheet:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
See SKX/SKX-2.
Regarding ERMS:
What I am measuring so far indicates we want temporal writes
when the majority of the set fits in L1. Past that ERMS becomes better,
especially as thread count increases.
Regarding NT Stores:
non-temporal doesn't look good in these benchmarks (particularly SKX-2).
I am running some additional tests to see exactly what is going on.
I am going to post a patch to disable non-temporal stores on Skylake,
if that is considered too risky given the 2.40 release I think we should revert
this patch (and the follow up AMD ones) to avoid a regression on SKX
and try to get them into 2.41 instead with proper tuning.
> >
On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com>
wrote:
> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >
> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> > >
> > >
> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has
> been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the
> flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > I'm actually working on RHEL-29312 that you reference in the subject
> > > (thanks for the reference, but please don't reference downstream
> tickets
> > > in upstream patches). In that ticket was shown a 50% slowdown in
> memset
> > > for the Xeon 4215. I've reproduced that slowdown locally, but this
> > > patch doesn't seem to have any affect on it (current git glibc still
> has
> > > the slow code). I can get the faster results with tunables[1]. Will
> > > there be further tuning of the default thresholds? Or do we expect
> > > users to use tunables to get around these regressions, for the specific
> > > chips that are not optimized?
> > >
> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> > > of the others have any affect.
> >
> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> > mislabeled, and was in fact an ICX machine.
> >
> > I have some older SKX/BWD data:
> >
> https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> > and agree we need to increase the stosb threshold on SKX. I didn't see
> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
> >
> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> > doesn't, so we can't just make the decision based on the feature.
> > On the bright side, it does look like the non-temporal decision (this
> patch)
> > still makes sense on SKX.
> >
> > I am re-running benchmarks now and will see what we can do about raising
> > the stosb threshold on SKX.
> >
> > I am deeply sorry for the confusion I have created here.
>
> I've re-ran some benchmarks and updated the google sheet:
>
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> See SKX/SKX-2.
>
> Regarding ERMS:
> What I am measuring so far indicates we want temporal writes
> when the majority of the set fits in L1. Past that ERMS becomes better,
> especially as thread count increases.
>
> Regarding NT Stores:
> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
> I am running some additional tests to see exactly what is going on.
> I am going to post a patch to disable non-temporal stores on Skylake,
>
Skylake clients have very different NT behaviors.
Please make sure that NT isn't disabled
on clients.
> if that is considered too risky given the 2.40 release I think we should
> revert
> this patch (and the follow up AMD ones) to avoid a regression on SKX
> and try to get them into 2.41 instead with proper tuning.
>
>
> > >
>
> H.J.
On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
>
> On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>> >
>> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
>> > >
>> > >
>> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
>> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>> > > > on SKX. Historically, these numbers would not have been so good
>> > > > because of the zero-over-zero writeback optimization that `rep stosb`
>> > > > is able to do. But, the zero-over-zero writeback optimization has been
>> > > > removed as a potential side-channel attack, so there is no longer any
>> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
>> > > > size, non-temporal writes can avoid data in their RFO requests saving
>> > > > memory bandwidth.
>> > >
>> > > I'm actually working on RHEL-29312 that you reference in the subject
>> > > (thanks for the reference, but please don't reference downstream tickets
>> > > in upstream patches). In that ticket was shown a 50% slowdown in memset
>> > > for the Xeon 4215. I've reproduced that slowdown locally, but this
>> > > patch doesn't seem to have any affect on it (current git glibc still has
>> > > the slow code). I can get the faster results with tunables[1]. Will
>> > > there be further tuning of the default thresholds? Or do we expect
>> > > users to use tunables to get around these regressions, for the specific
>> > > chips that are not optimized?
>> > >
>> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
>> > > of the others have any affect.
>> >
>> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
>> > mislabeled, and was in fact an ICX machine.
>> >
>> > I have some older SKX/BWD data:
>> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
>> > and agree we need to increase the stosb threshold on SKX. I didn't see
>> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
>> >
>> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
>> > doesn't, so we can't just make the decision based on the feature.
>> > On the bright side, it does look like the non-temporal decision (this patch)
>> > still makes sense on SKX.
>> >
>> > I am re-running benchmarks now and will see what we can do about raising
>> > the stosb threshold on SKX.
>> >
>> > I am deeply sorry for the confusion I have created here.
>>
>> I've re-ran some benchmarks and updated the google sheet:
>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>> See SKX/SKX-2.
>>
>> Regarding ERMS:
>> What I am measuring so far indicates we want temporal writes
>> when the majority of the set fits in L1. Past that ERMS becomes better,
>> especially as thread count increases.
>>
>> Regarding NT Stores:
>> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
>> I am running some additional tests to see exactly what is going on.
>> I am going to post a patch to disable non-temporal stores on Skylake,
>
>
> Skylake clients have very different NT behaviors.
> Please make sure that NT isn't disabled
> on clients.
I guess my thinking right now is it's better to ensure we don't introduce a
new regression. The NT store code is new to 2.40, so disabling it won't
cause any regressions.
>
>>
>> if that is considered too risky given the 2.40 release I think we should revert
>> this patch (and the follow up AMD ones) to avoid a regression on SKX
>> and try to get them into 2.41 instead with proper tuning.
>>
>>
>> > >
>>
> H.J.
On Mon, Jul 15, 2024 at 3:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> >
> > On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >> >
> >> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote:
> >> > >
> >> > >
> >> > > Noah Goldstein <goldstein.w.n@gmail.com> writes:
> >> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >> > > > on SKX. Historically, these numbers would not have been so good
> >> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> >> > > > is able to do. But, the zero-over-zero writeback optimization has been
> >> > > > removed as a potential side-channel attack, so there is no longer any
> >> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> >> > > > size, non-temporal writes can avoid data in their RFO requests saving
> >> > > > memory bandwidth.
> >> > >
> >> > > I'm actually working on RHEL-29312 that you reference in the subject
> >> > > (thanks for the reference, but please don't reference downstream tickets
> >> > > in upstream patches). In that ticket was shown a 50% slowdown in memset
> >> > > for the Xeon 4215. I've reproduced that slowdown locally, but this
> >> > > patch doesn't seem to have any affect on it (current git glibc still has
> >> > > the slow code). I can get the faster results with tunables[1]. Will
> >> > > there be further tuning of the default thresholds? Or do we expect
> >> > > users to use tunables to get around these regressions, for the specific
> >> > > chips that are not optimized?
> >> > >
> >> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none
> >> > > of the others have any affect.
> >> >
> >> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was
> >> > mislabeled, and was in fact an ICX machine.
> >> >
> >> > I have some older SKX/BWD data:
> >> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing
> >> > and agree we need to increase the stosb threshold on SKX. I didn't see
> >> > the huge (100MB threshold) you saw there, but am re-running benchmarks.
> >> >
> >> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX
> >> > doesn't, so we can't just make the decision based on the feature.
> >> > On the bright side, it does look like the non-temporal decision (this patch)
> >> > still makes sense on SKX.
> >> >
> >> > I am re-running benchmarks now and will see what we can do about raising
> >> > the stosb threshold on SKX.
> >> >
> >> > I am deeply sorry for the confusion I have created here.
> >>
> >> I've re-ran some benchmarks and updated the google sheet:
> >> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >> See SKX/SKX-2.
> >>
> >> Regarding ERMS:
> >> What I am measuring so far indicates we want temporal writes
> >> when the majority of the set fits in L1. Past that ERMS becomes better,
> >> especially as thread count increases.
> >>
> >> Regarding NT Stores:
> >> non-temporal doesn't look good in these benchmarks (particularly SKX-2).
> >> I am running some additional tests to see exactly what is going on.
> >> I am going to post a patch to disable non-temporal stores on Skylake,
> >
> >
> > Skylake clients have very different NT behaviors.
> > Please make sure that NT isn't disabled
> > on clients.
>
> I guess my thinking right now is it's better to ensure we don't introduce a
> new regression. The NT store code is new to 2.40, so disabling it won't
> cause any regressions.
> >
> >>
> >> if that is considered too risky given the 2.40 release I think we should revert
> >> this patch (and the follow up AMD ones) to avoid a regression on SKX
> >> and try to get them into 2.41 instead with proper tuning.
Patch posted.
I disabled for SKX only (im convinced enough its a function of SKX's unique
memory system).
> >>
> >>
> >> > >
> >>
> > H.J.
@@ -21,10 +21,13 @@
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
- 5. On machines ERMS feature, if size is greater or equal than
- __x86_rep_stosb_threshold then REP STOSB will be used.
- 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
- 4 VEC stores and store 4 * VEC at a time until done. */
+ 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+ 4 VEC stores and store 4 * VEC at a time until done.
+ 6. On machines ERMS feature, if size is range
+ [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+ then REP STOSB will be used.
+ 7. If size >= __x86_shared_non_temporal_threshold, use a
+ non-temporal stores. */
#include <sysdep.h>
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
VMOVU %VMM(0), (%rdi)
VZEROUPPER_RETURN
+
+ /* If have AVX512 mask instructions put L(less_vec) close to
+ entry as it doesn't take much space and is likely a hot target. */
+#ifdef USE_LESS_VEC_MASK_STORE
+ /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
+ .p2align 6,, 47
+ .p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+ /* Less than 1 VEC. */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+# error Unsupported VEC_SIZE!
+# endif
+ /* Clear high bits from edi. Only keeping bits relevant to page
+ cross check. Note that we are using rax which is set in
+ MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
+ andl $(PAGE_SIZE - 1), %edi
+ /* Check if VEC_SIZE store cross page. Mask stores suffer
+ serious performance degradation when it has to fault suppress. */
+ cmpl $(PAGE_SIZE - VEC_SIZE), %edi
+ /* This is generally considered a cold target. */
+ ja L(cross_page)
+# if VEC_SIZE > 32
+ movq $-1, %rcx
+ bzhiq %rdx, %rcx, %rcx
+ kmovq %rcx, %k1
+# else
+ movl $-1, %ecx
+ bzhil %edx, %ecx, %ecx
+ kmovd %ecx, %k1
+# endif
+ vmovdqu8 %VMM(0), (%rax){%k1}
+ VZEROUPPER_RETURN
+#endif
+
#if defined USE_MULTIARCH && IS_IN (libc)
END (MEMSET_SYMBOL (__memset, unaligned))
@@ -185,54 +223,6 @@ L(last_2x_vec):
#endif
VZEROUPPER_RETURN
- /* If have AVX512 mask instructions put L(less_vec) close to
- entry as it doesn't take much space and is likely a hot target.
- */
-#ifdef USE_LESS_VEC_MASK_STORE
- .p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
- /* Less than 1 VEC. */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-# error Unsupported VEC_SIZE!
-# endif
- /* Clear high bits from edi. Only keeping bits relevant to page
- cross check. Note that we are using rax which is set in
- MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
- andl $(PAGE_SIZE - 1), %edi
- /* Check if VEC_SIZE store cross page. Mask stores suffer
- serious performance degradation when it has to fault suppress.
- */
- cmpl $(PAGE_SIZE - VEC_SIZE), %edi
- /* This is generally considered a cold target. */
- ja L(cross_page)
-# if VEC_SIZE > 32
- movq $-1, %rcx
- bzhiq %rdx, %rcx, %rcx
- kmovq %rcx, %k1
-# else
- movl $-1, %ecx
- bzhil %edx, %ecx, %ecx
- kmovd %ecx, %k1
-# endif
- vmovdqu8 %VMM(0), (%rax){%k1}
- VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
- /* Include L(stosb_local) here if including L(less_vec) between
- L(stosb_more_2x_vec) and ENTRY. This is to cache align the
- L(stosb_more_2x_vec) target. */
- .p2align 4,, 10
-L(stosb_local):
- movzbl %sil, %eax
- mov %RDX_LP, %RCX_LP
- mov %RDI_LP, %RDX_LP
- rep stosb
- mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
-# endif
-#endif
-
#if defined USE_MULTIARCH && IS_IN (libc)
.p2align 4
L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
ret
#endif
- .p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+ .p2align 4
+#else
+ .p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
range for 2-byte jump encoding. */
L(stosb_local):
+ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ jae L(nt_memset)
movzbl %sil, %eax
mov %RDX_LP, %RCX_LP
mov %RDI_LP, %RDX_LP
rep stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+ /* Use xchg to save 1-byte (this helps align targets below). */
+ xchg %RDX_LP, %RAX_LP
+# else
mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
# endif
+ VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
/* Define L(less_vec) only if not otherwise defined. */
- .p2align 4
+ .p2align 4,, 12
L(less_vec):
/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
xmm). This is only does anything for AVX2. */
@@ -423,4 +425,35 @@ L(between_2_3):
movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
#endif
ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+ /* Force align so the loop doesn't cross a cache-line. */
+ .p2align 4
+# endif
+ .p2align 4,, 7
+ /* Memset using non-temporal stores. */
+L(nt_memset):
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
+ leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
+ /* Align DST. */
+ orq $(VEC_SIZE * 1 - 1), %rdi
+ incq %rdi
+ .p2align 4,, 7
+L(nt_loop):
+ VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
+ subq $(VEC_SIZE * -4), %rdi
+ cmpq %rdx, %rdi
+ jb L(nt_loop)
+ sfence
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
+ VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))