[v1] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
warning
|
Patch failed to apply
|
Commit Message
On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
---
sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
Comments
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On some machines we end up with incomplete cache information. This can
> make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> lower than intended (and lower than the prior value). So reintroduce
> the old bound as a lower bound to avoid potentially regressing code
> where we don't have complete information to make the decision.
> ---
> sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index c98fa57a7b..0436ffb349 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> the maximum thrashing capped at 1/associativity. */
> unsigned long int non_temporal_threshold
> = shared / cachesize_non_temporal_divisor;
> +
> + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> + likely have incorrect/incomplete cache info in which case, default to
> + 3/4 * per-thread L3 to avoid regressions. */
> + unsigned long int non_temporal_threshold_lowbound
> + = shared_per_thread * 3 / 4;
> + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> + non_temporal_threshold = non_temporal_threshold_lowbound;
> +
> /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> a higher risk of actually thrashing the cache as they don't have a HW LRU
> hint. As well, their performance in highly parallel situations is
> noticeably worse. */
> if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> - non_temporal_threshold = shared_per_thread * 3 / 4;
> + non_temporal_threshold = non_temporal_threshold_lowbound;
> /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> --
> 2.34.1
>
Sajan, can you test this out?
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On some machines we end up with incomplete cache information. This can
> > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > lower than intended (and lower than the prior value). So reintroduce
> > the old bound as a lower bound to avoid potentially regressing code
> > where we don't have complete information to make the decision.
> > ---
> > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index c98fa57a7b..0436ffb349 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > the maximum thrashing capped at 1/associativity. */
> > unsigned long int non_temporal_threshold
> > = shared / cachesize_non_temporal_divisor;
> > +
> > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > + likely have incorrect/incomplete cache info in which case, default to
> > + 3/4 * per-thread L3 to avoid regressions. */
> > + unsigned long int non_temporal_threshold_lowbound
> > + = shared_per_thread * 3 / 4;
> > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > +
> > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > hint. As well, their performance in highly parallel situations is
> > noticeably worse. */
> > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > --
> > 2.34.1
> >
>
> Sajan, can you test this out?
ping @Sajan Karumanchi
release is quickly approaching and I assume you want this in.
*Noah,
On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On some machines we end up with incomplete cache information. This can
> > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > lower than intended (and lower than the prior value). So reintroduce
> > > the old bound as a lower bound to avoid potentially regressing code
> > > where we don't have complete information to make the decision.
> > > ---
> > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > index c98fa57a7b..0436ffb349 100644
> > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > the maximum thrashing capped at 1/associativity. */
> > > unsigned long int non_temporal_threshold
> > > = shared / cachesize_non_temporal_divisor;
> > > +
> > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > + likely have incorrect/incomplete cache info in which case, default to
> > > + 3/4 * per-thread L3 to avoid regressions. */
> > > + unsigned long int non_temporal_threshold_lowbound
> > > + = shared_per_thread * 3 / 4;
> > > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > +
> > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > hint. As well, their performance in highly parallel situations is
> > > noticeably worse. */
> > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > --
> > > 2.34.1
> > >
> >
> > Sajan, can you test this out?
> ping @Sajan Karumanchi
> release is quickly approaching and I assume you want this in.
I tested the above patch, which tries to revert the 'nt_threshold'
value for AMD CPUs to the actual value '3/4 of the shared cache'.
However, your comments on threads' typical share of cache size and the
non_temproral default ranges do not apply to AMD CPUs.
The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
for AMD Zen CPUs.
Maybe the simplest way to work for AMD CPUs is to configure the
default non-temporal threshold value to 3/4th of the shared cache.
I think the below patch would simplify the threshold computation for
both architectures.
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
- shared_per_thread = shared;
level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
@@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
cpu_features->level3_cache_linesize = level3_cache_linesize;
cpu_features->level4_cache_size = level4_cache_size;
+
+ unsigned long int non_temporal_threshold = shared * 3 / 4;
+
+ if (cpu_features->basic.kind != arch_kind_amd)
+ {
unsigned long int cachesize_non_temporal_divisor
= cpu_features->cachesize_non_temporal_divisor;
if (cachesize_non_temporal_divisor <= 0)
@@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
/* The default setting for the non_temporal threshold is [1/8, 1/2] of size
of the chip's cache (depending on `cachesize_non_temporal_divisor` which
- is microarch specific. The default is 1/4). For most Intel and AMD
+ is microarch specific. The default is 1/4). For most Intel
processors with an initial release date between 2017 and 2023, a thread's
typical share of the cache is from 18-64MB. Using a reasonable size
fraction of L3 is meant to estimate the point where non-temporal stores
@@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
L3 cache being evicted by the copy are mostly alleviated by the fact that
modern HW detects streaming patterns and provides proper LRU hints so that
the maximum thrashing capped at 1/associativity. */
- unsigned long int non_temporal_threshold
+ non_temporal_threshold
= shared / cachesize_non_temporal_divisor;
/* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
a higher risk of actually thrashing the cache as they don't have a HW LRU
@@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
noticeably worse. */
if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
non_temporal_threshold = shared_per_thread * 3 / 4;
+ }
+
Thanks,
Sajan K.
On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> *Noah,
> On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On some machines we end up with incomplete cache information. This can
> > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > lower than intended (and lower than the prior value). So reintroduce
> > > > the old bound as a lower bound to avoid potentially regressing code
> > > > where we don't have complete information to make the decision.
> > > > ---
> > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > index c98fa57a7b..0436ffb349 100644
> > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > the maximum thrashing capped at 1/associativity. */
> > > > unsigned long int non_temporal_threshold
> > > > = shared / cachesize_non_temporal_divisor;
> > > > +
> > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > + likely have incorrect/incomplete cache info in which case, default to
> > > > + 3/4 * per-thread L3 to avoid regressions. */
> > > > + unsigned long int non_temporal_threshold_lowbound
> > > > + = shared_per_thread * 3 / 4;
> > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > +
> > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > hint. As well, their performance in highly parallel situations is
> > > > noticeably worse. */
> > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > --
> > > > 2.34.1
> > > >
> > >
> > > Sajan, can you test this out?
> > ping @Sajan Karumanchi
> > release is quickly approaching and I assume you want this in.
>
> I tested the above patch, which tries to revert the 'nt_threshold'
> value for AMD CPUs to the actual value '3/4 of the shared cache'.
How about just updating the comment?
> However, your comments on threads' typical share of cache size and the
> non_temproral default ranges do not apply to AMD CPUs.
> The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> for AMD Zen CPUs.
`shared_per_thread` is equal to what `shared` was prior to the patch.
> Maybe the simplest way to work for AMD CPUs is to configure the
> default non-temporal threshold value to 3/4th of the shared cache.
> I think the below patch would simplify the threshold computation for
> both architectures.
I somewhat prefer using 3/4*shared_per_thread as a lowbound as
a catchall for misconfigurations.
>
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> - shared_per_thread = shared;
>
> level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> cpu_features->level3_cache_linesize = level3_cache_linesize;
> cpu_features->level4_cache_size = level4_cache_size;
>
> +
> + unsigned long int non_temporal_threshold = shared * 3 / 4;
> +
> + if (cpu_features->basic.kind != arch_kind_amd)
> + {
> unsigned long int cachesize_non_temporal_divisor
> = cpu_features->cachesize_non_temporal_divisor;
> if (cachesize_non_temporal_divisor <= 0)
> @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> - is microarch specific. The default is 1/4). For most Intel and AMD
> + is microarch specific. The default is 1/4). For most Intel
> processors with an initial release date between 2017 and 2023, a thread's
> typical share of the cache is from 18-64MB. Using a reasonable size
> fraction of L3 is meant to estimate the point where non-temporal stores
> @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> L3 cache being evicted by the copy are mostly alleviated by the fact that
> modern HW detects streaming patterns and provides proper LRU hints so that
> the maximum thrashing capped at 1/associativity. */
> - unsigned long int non_temporal_threshold
> + non_temporal_threshold
> = shared / cachesize_non_temporal_divisor;
> /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> a higher risk of actually thrashing the cache as they don't have a HW LRU
> @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> noticeably worse. */
> if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> non_temporal_threshold = shared_per_thread * 3 / 4;
> + }
> +
>
> Thanks,
> Sajan K.
sajan karumanchi writes:
> I tested the above patch, which tries to revert the 'nt_threshold'
> value for AMD CPUs to the actual value '3/4 of the shared cache'.
> However, your comments on threads' typical share of cache size and the
> non_temproral default ranges do not apply to AMD CPUs.
> The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> for AMD Zen CPUs.
> Maybe the simplest way to work for AMD CPUs is to configure the
> default non-temporal threshold value to 3/4th of the shared cache.
> I think the below patch would simplify the threshold computation for
> both architectures.
Rather than put AMD-specific code in dl-cacheinfo.h, why not add another
field to cpu_features so that AMD can per-family-tune their cache
preferences in cpu-features.c, and set defaults there?
It seems to me that the threshold has three parts: a numerator (to avoid
floating point math) and denominator, and whether the memory depends on
how many cores there are, which can be applied to the other two.
A field for cachesize_non_temporal_numerator allows for non-floating
point math, and cpu-features.c could adjust either the numerator or
denominator according to whether the cache is per-core or total. It
also means that all tuning happens in one place, instead of spreading it
around.
On Mon, Jul 17, 2023 at 4:25 PM DJ Delorie <dj@redhat.com> wrote:
>
>
> sajan karumanchi writes:
> > I tested the above patch, which tries to revert the 'nt_threshold'
> > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > However, your comments on threads' typical share of cache size and the
> > non_temproral default ranges do not apply to AMD CPUs.
> > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > for AMD Zen CPUs.
> > Maybe the simplest way to work for AMD CPUs is to configure the
> > default non-temporal threshold value to 3/4th of the shared cache.
> > I think the below patch would simplify the threshold computation for
> > both architectures.
>
> Rather than put AMD-specific code in dl-cacheinfo.h, why not add another
> field to cpu_features so that AMD can per-family-tune their cache
> preferences in cpu-features.c, and set defaults there?
>
> It seems to me that the threshold has three parts: a numerator (to avoid
> floating point math) and denominator, and whether the memory depends on
> how many cores there are, which can be applied to the other two.
>
> A field for cachesize_non_temporal_numerator allows for non-floating
> point math, and cpu-features.c could adjust either the numerator or
> denominator according to whether the cache is per-core or total. It
> also means that all tuning happens in one place, instead of spreading it
> around.
>
I think the issue is we don't have the same pipeline for computing
all the info for each family.
So for AMD we need denominator to essentially be 4*threads-per-socket.
But AFAIK theres no where earlier to set that.
Might be something we can revisit / cleanup in the future, but think this
is important for 2.38 so would like to get a baseline fix in.
* Noah
On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > *Noah,
> > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On some machines we end up with incomplete cache information. This can
> > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > where we don't have complete information to make the decision.
> > > > > ---
> > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > index c98fa57a7b..0436ffb349 100644
> > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > > the maximum thrashing capped at 1/associativity. */
> > > > > unsigned long int non_temporal_threshold
> > > > > = shared / cachesize_non_temporal_divisor;
> > > > > +
> > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > + likely have incorrect/incomplete cache info in which case, default to
> > > > > + 3/4 * per-thread L3 to avoid regressions. */
> > > > > + unsigned long int non_temporal_threshold_lowbound
> > > > > + = shared_per_thread * 3 / 4;
> > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > +
> > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > > hint. As well, their performance in highly parallel situations is
> > > > > noticeably worse. */
> > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > --
> > > > > 2.34.1
> > > > >
> > > >
> > > > Sajan, can you test this out?
> > > ping @Sajan Karumanchi
> > > release is quickly approaching and I assume you want this in.
> >
> > I tested the above patch, which tries to revert the 'nt_threshold'
> > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> How about just updating the comment?
>
> > However, your comments on threads' typical share of cache size and the
> > non_temproral default ranges do not apply to AMD CPUs.
> > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > for AMD Zen CPUs.
> `shared_per_thread` is equal to what `shared` was prior to the patch.
>
This 'shared_pre_thread' was introduced under this series of patches.
I suggest removing 'shared_pre_thread' to avoid the misinterpretation
of the shared cache per CCX with a thread.
> > Maybe the simplest way to work for AMD CPUs is to configure the
> > default non-temporal threshold value to 3/4th of the shared cache.
> > I think the below patch would simplify the threshold computation for
> > both architectures.
>
> I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> a catchall for misconfiguration.
I think you can use the existing 'minimum_non_temporal_threshold' to
handle the misconfigurations.
> >
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > - shared_per_thread = shared;
> >
> > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > cpu_features->level3_cache_linesize = level3_cache_linesize;
> > cpu_features->level4_cache_size = level4_cache_size;
> >
> > +
> > + unsigned long int non_temporal_threshold = shared * 3 / 4;
> > +
> > + if (cpu_features->basic.kind != arch_kind_amd)
> > + {
> > unsigned long int cachesize_non_temporal_divisor
> > = cpu_features->cachesize_non_temporal_divisor;
> > if (cachesize_non_temporal_divisor <= 0)
> > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> > of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > - is microarch specific. The default is 1/4). For most Intel and AMD
> > + is microarch specific. The default is 1/4). For most Intel
> > processors with an initial release date between 2017 and 2023, a thread's
> > typical share of the cache is from 18-64MB. Using a reasonable size
> > fraction of L3 is meant to estimate the point where non-temporal stores
> > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > L3 cache being evicted by the copy are mostly alleviated by the fact that
> > modern HW detects streaming patterns and provides proper LRU hints so that
> > the maximum thrashing capped at 1/associativity. */
> > - unsigned long int non_temporal_threshold
> > + non_temporal_threshold
> > = shared / cachesize_non_temporal_divisor;
> > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > noticeably worse. */
> > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > non_temporal_threshold = shared_per_thread * 3 / 4;
> > + }
> > +
> >
> > Thanks,
> > Sajan K.
On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> * Noah
> On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> > <sajan.karumanchi@gmail.com> wrote:
> > >
> > > *Noah,
> > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > On some machines we end up with incomplete cache information. This can
> > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > > where we don't have complete information to make the decision.
> > > > > > ---
> > > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > > index c98fa57a7b..0436ffb349 100644
> > > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > > > the maximum thrashing capped at 1/associativity. */
> > > > > > unsigned long int non_temporal_threshold
> > > > > > = shared / cachesize_non_temporal_divisor;
> > > > > > +
> > > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > > + likely have incorrect/incomplete cache info in which case, default to
> > > > > > + 3/4 * per-thread L3 to avoid regressions. */
> > > > > > + unsigned long int non_temporal_threshold_lowbound
> > > > > > + = shared_per_thread * 3 / 4;
> > > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > +
> > > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > > > hint. As well, their performance in highly parallel situations is
> > > > > > noticeably worse. */
> > > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > >
> > > > > Sajan, can you test this out?
> > > > ping @Sajan Karumanchi
> > > > release is quickly approaching and I assume you want this in.
> > >
> > > I tested the above patch, which tries to revert the 'nt_threshold'
> > > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > How about just updating the comment?
> >
> > > However, your comments on threads' typical share of cache size and the
> > > non_temproral default ranges do not apply to AMD CPUs.
> > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > > for AMD Zen CPUs.
> > `shared_per_thread` is equal to what `shared` was prior to the patch.
> >
> This 'shared_pre_thread' was introduced under this series of patches.
> I suggest removing 'shared_pre_thread' to avoid the misinterpretation
> of the shared cache per CCX with a thread.
Agreed, already split it just hadnt updated this.
But fixed.
>
> > > Maybe the simplest way to work for AMD CPUs is to configure the
> > > default non-temporal threshold value to 3/4th of the shared cache.
> > > I think the below patch would simplify the threshold computation for
> > > both architectures.
> >
> > I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> > a catchall for misconfiguration.
> I think you can use the existing 'minimum_non_temporal_threshold' to
> handle the misconfigurations.
>
Thats a correctness bound.
This is more a sanity check for "we are making the decision with
correct configuration knowledge" catchall.
> > >
> > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > > core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > > - shared_per_thread = shared;
> > >
> > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > cpu_features->level3_cache_linesize = level3_cache_linesize;
> > > cpu_features->level4_cache_size = level4_cache_size;
> > >
> > > +
> > > + unsigned long int non_temporal_threshold = shared * 3 / 4;
> > > +
> > > + if (cpu_features->basic.kind != arch_kind_amd)
> > > + {
> > > unsigned long int cachesize_non_temporal_divisor
> > > = cpu_features->cachesize_non_temporal_divisor;
> > > if (cachesize_non_temporal_divisor <= 0)
> > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >
> > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> > > of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > > - is microarch specific. The default is 1/4). For most Intel and AMD
> > > + is microarch specific. The default is 1/4). For most Intel
> > > processors with an initial release date between 2017 and 2023, a thread's
> > > typical share of the cache is from 18-64MB. Using a reasonable size
> > > fraction of L3 is meant to estimate the point where non-temporal stores
> > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > L3 cache being evicted by the copy are mostly alleviated by the fact that
> > > modern HW detects streaming patterns and provides proper LRU hints so that
> > > the maximum thrashing capped at 1/associativity. */
> > > - unsigned long int non_temporal_threshold
> > > + non_temporal_threshold
> > > = shared / cachesize_non_temporal_divisor;
> > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > noticeably worse. */
> > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > non_temporal_threshold = shared_per_thread * 3 / 4;
> > > + }
> > > +
> > >
> > > Thanks,
> > > Sajan K.
On Tue, Jul 18, 2023 at 10:27 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > * Noah
> > On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> > > <sajan.karumanchi@gmail.com> wrote:
> > > >
> > > > *Noah,
> > > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > > >
> > > > > > > On some machines we end up with incomplete cache information. This can
> > > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > > > where we don't have complete information to make the decision.
> > > > > > > ---
> > > > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > > > > 1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > > > index c98fa57a7b..0436ffb349 100644
> > > > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > > > > the maximum thrashing capped at 1/associativity. */
> > > > > > > unsigned long int non_temporal_threshold
> > > > > > > = shared / cachesize_non_temporal_divisor;
> > > > > > > +
> > > > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > > > + likely have incorrect/incomplete cache info in which case, default to
> > > > > > > + 3/4 * per-thread L3 to avoid regressions. */
> > > > > > > + unsigned long int non_temporal_threshold_lowbound
> > > > > > > + = shared_per_thread * 3 / 4;
> > > > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > > +
> > > > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > > > > hint. As well, their performance in highly parallel situations is
> > > > > > > noticeably worse. */
> > > > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > > > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > > > --
> > > > > > > 2.34.1
> > > > > > >
> > > > > >
> > > > > > Sajan, can you test this out?
> > > > > ping @Sajan Karumanchi
> > > > > release is quickly approaching and I assume you want this in.
> > > >
> > > > I tested the above patch, which tries to revert the 'nt_threshold'
> > > > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > > How about just updating the comment?
> > >
> > > > However, your comments on threads' typical share of cache size and the
> > > > non_temproral default ranges do not apply to AMD CPUs.
> > > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > > > for AMD Zen CPUs.
> > > `shared_per_thread` is equal to what `shared` was prior to the patch.
> > >
> > This 'shared_pre_thread' was introduced under this series of patches.
> > I suggest removing 'shared_pre_thread' to avoid the misinterpretation
> > of the shared cache per CCX with a thread.
>
> Agreed, already split it just hadnt updated this.
> But fixed.
> >
> > > > Maybe the simplest way to work for AMD CPUs is to configure the
> > > > default non-temporal threshold value to 3/4th of the shared cache.
> > > > I think the below patch would simplify the threshold computation for
> > > > both architectures.
> > >
> > > I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> > > a catchall for misconfiguration.
> > I think you can use the existing 'minimum_non_temporal_threshold' to
> > handle the misconfigurations.
> >
> Thats a correctness bound.
> This is more a sanity check for "we are making the decision with
> correct configuration knowledge" catchall.
>
The main point being: Other x86 systems might be affected, not just
AMD.
> > > >
> > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > > > core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > > > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > > > - shared_per_thread = shared;
> > > >
> > > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > > > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > cpu_features->level3_cache_linesize = level3_cache_linesize;
> > > > cpu_features->level4_cache_size = level4_cache_size;
> > > >
> > > > +
> > > > + unsigned long int non_temporal_threshold = shared * 3 / 4;
> > > > +
> > > > + if (cpu_features->basic.kind != arch_kind_amd)
> > > > + {
> > > > unsigned long int cachesize_non_temporal_divisor
> > > > = cpu_features->cachesize_non_temporal_divisor;
> > > > if (cachesize_non_temporal_divisor <= 0)
> > > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >
> > > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> > > > of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > > > - is microarch specific. The default is 1/4). For most Intel and AMD
> > > > + is microarch specific. The default is 1/4). For most Intel
> > > > processors with an initial release date between 2017 and 2023, a thread's
> > > > typical share of the cache is from 18-64MB. Using a reasonable size
> > > > fraction of L3 is meant to estimate the point where non-temporal stores
> > > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > L3 cache being evicted by the copy are mostly alleviated by the fact that
> > > > modern HW detects streaming patterns and provides proper LRU hints so that
> > > > the maximum thrashing capped at 1/associativity. */
> > > > - unsigned long int non_temporal_threshold
> > > > + non_temporal_threshold
> > > > = shared / cachesize_non_temporal_divisor;
> > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > noticeably worse. */
> > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > + }
> > > > +
> > > >
> > > > Thanks,
> > > > Sajan K.
@@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
the maximum thrashing capped at 1/associativity. */
unsigned long int non_temporal_threshold
= shared / cachesize_non_temporal_divisor;
+
+ /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
+ likely have incorrect/incomplete cache info in which case, default to
+ 3/4 * per-thread L3 to avoid regressions. */
+ unsigned long int non_temporal_threshold_lowbound
+ = shared_per_thread * 3 / 4;
+ if (non_temporal_threshold < non_temporal_threshold_lowbound)
+ non_temporal_threshold = non_temporal_threshold_lowbound;
+
/* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
a higher risk of actually thrashing the cache as they don't have a HW LRU
hint. As well, their performance in highly parallel situations is
noticeably worse. */
if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
- non_temporal_threshold = shared_per_thread * 3 / 4;
+ non_temporal_threshold = non_temporal_threshold_lowbound;
/* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
if that operation cannot overflow. Minimum of 0x4040 (16448) because the