[v1] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.

Message ID 20230714151459.3357038-1-goldstein.w.n@gmail.com
State Superseded
Headers
Series [v1] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold. |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 warning Patch failed to apply

Commit Message

Noah Goldstein July 14, 2023, 3:14 p.m. UTC
  On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
---
 sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
  

Comments

Noah Goldstein July 14, 2023, 3:15 p.m. UTC | #1
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On some machines we end up with incomplete cache information. This can
> make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> lower than intended (and lower than the prior value). So reintroduce
> the old bound as a lower bound to avoid potentially regressing code
> where we don't have complete information to make the decision.
> ---
>  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index c98fa57a7b..0436ffb349 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>       the maximum thrashing capped at 1/associativity. */
>    unsigned long int non_temporal_threshold
>        = shared / cachesize_non_temporal_divisor;
> +
> +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> +     likely have incorrect/incomplete cache info in which case, default to
> +     3/4 * per-thread L3 to avoid regressions.  */
> +  unsigned long int non_temporal_threshold_lowbound
> +      = shared_per_thread * 3 / 4;
> +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> +    non_temporal_threshold = non_temporal_threshold_lowbound;
> +
>    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
>       a higher risk of actually thrashing the cache as they don't have a HW LRU
>       hint. As well, their performance in highly parallel situations is
>       noticeably worse.  */
>    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> -    non_temporal_threshold = shared_per_thread * 3 / 4;
> +    non_temporal_threshold = non_temporal_threshold_lowbound;
>    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
>       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
>       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> --
> 2.34.1
>

Sajan, can you test this out?
  
Noah Goldstein July 17, 2023, 4:39 p.m. UTC | #2
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On some machines we end up with incomplete cache information. This can
> > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > lower than intended (and lower than the prior value). So reintroduce
> > the old bound as a lower bound to avoid potentially regressing code
> > where we don't have complete information to make the decision.
> > ---
> >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index c98fa57a7b..0436ffb349 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >       the maximum thrashing capped at 1/associativity. */
> >    unsigned long int non_temporal_threshold
> >        = shared / cachesize_non_temporal_divisor;
> > +
> > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > +     likely have incorrect/incomplete cache info in which case, default to
> > +     3/4 * per-thread L3 to avoid regressions.  */
> > +  unsigned long int non_temporal_threshold_lowbound
> > +      = shared_per_thread * 3 / 4;
> > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > +
> >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> >       hint. As well, their performance in highly parallel situations is
> >       noticeably worse.  */
> >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > --
> > 2.34.1
> >
>
> Sajan, can you test this out?
ping @Sajan Karumanchi
release is quickly approaching and I assume you want this in.
  
Sajan Karumanchi July 17, 2023, 6:42 p.m. UTC | #3
*Noah,
On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On some machines we end up with incomplete cache information. This can
> > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > lower than intended (and lower than the prior value). So reintroduce
> > > the old bound as a lower bound to avoid potentially regressing code
> > > where we don't have complete information to make the decision.
> > > ---
> > >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > index c98fa57a7b..0436ffb349 100644
> > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >       the maximum thrashing capped at 1/associativity. */
> > >    unsigned long int non_temporal_threshold
> > >        = shared / cachesize_non_temporal_divisor;
> > > +
> > > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > +     likely have incorrect/incomplete cache info in which case, default to
> > > +     3/4 * per-thread L3 to avoid regressions.  */
> > > +  unsigned long int non_temporal_threshold_lowbound
> > > +      = shared_per_thread * 3 / 4;
> > > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > +
> > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > >       hint. As well, their performance in highly parallel situations is
> > >       noticeably worse.  */
> > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > --
> > > 2.34.1
> > >
> >
> > Sajan, can you test this out?
> ping @Sajan Karumanchi
> release is quickly approaching and I assume you want this in.

I tested the above patch, which tries to revert the 'nt_threshold'
value for AMD CPUs to the actual value '3/4 of the shared cache'.
However, your comments on threads' typical share of cache size and the
non_temproral default ranges do not apply to AMD CPUs.
The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
for AMD Zen CPUs.
Maybe the simplest way to work for AMD CPUs is to configure the
default non-temporal threshold value to 3/4th of the shared cache.
I think the below patch would simplify the threshold computation for
both architectures.

+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
       core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
-      shared_per_thread = shared;

       level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
       level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
@@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   cpu_features->level3_cache_linesize = level3_cache_linesize;
   cpu_features->level4_cache_size = level4_cache_size;

+
+  unsigned long int non_temporal_threshold = shared * 3 / 4;
+
+  if (cpu_features->basic.kind != arch_kind_amd)
+  {
   unsigned long int cachesize_non_temporal_divisor
       = cpu_features->cachesize_non_temporal_divisor;
   if (cachesize_non_temporal_divisor <= 0)
@@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)

   /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
      of the chip's cache (depending on `cachesize_non_temporal_divisor` which
-     is microarch specific. The default is 1/4). For most Intel and AMD
+     is microarch specific. The default is 1/4). For most Intel
      processors with an initial release date between 2017 and 2023, a thread's
      typical share of the cache is from 18-64MB. Using a reasonable size
      fraction of L3 is meant to estimate the point where non-temporal stores
@@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
      L3 cache being evicted by the copy are mostly alleviated by the fact that
      modern HW detects streaming patterns and provides proper LRU hints so that
      the maximum thrashing capped at 1/associativity. */
-  unsigned long int non_temporal_threshold
+  non_temporal_threshold
       = shared / cachesize_non_temporal_divisor;
   /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
      a higher risk of actually thrashing the cache as they don't have a HW LRU
@@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
      noticeably worse.  */
   if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
     non_temporal_threshold = shared_per_thread * 3 / 4;
+  }
+

Thanks,
Sajan K.
  
Noah Goldstein July 17, 2023, 8:07 p.m. UTC | #4
On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> *Noah,
> On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On some machines we end up with incomplete cache information. This can
> > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > lower than intended (and lower than the prior value). So reintroduce
> > > > the old bound as a lower bound to avoid potentially regressing code
> > > > where we don't have complete information to make the decision.
> > > > ---
> > > >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > index c98fa57a7b..0436ffb349 100644
> > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >       the maximum thrashing capped at 1/associativity. */
> > > >    unsigned long int non_temporal_threshold
> > > >        = shared / cachesize_non_temporal_divisor;
> > > > +
> > > > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > +     likely have incorrect/incomplete cache info in which case, default to
> > > > +     3/4 * per-thread L3 to avoid regressions.  */
> > > > +  unsigned long int non_temporal_threshold_lowbound
> > > > +      = shared_per_thread * 3 / 4;
> > > > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > +
> > > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > >       hint. As well, their performance in highly parallel situations is
> > > >       noticeably worse.  */
> > > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > --
> > > > 2.34.1
> > > >
> > >
> > > Sajan, can you test this out?
> > ping @Sajan Karumanchi
> > release is quickly approaching and I assume you want this in.
>
> I tested the above patch, which tries to revert the 'nt_threshold'
> value for AMD CPUs to the actual value '3/4 of the shared cache'.
How about just updating the comment?

> However, your comments on threads' typical share of cache size and the
> non_temproral default ranges do not apply to AMD CPUs.
> The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> for AMD Zen CPUs.
`shared_per_thread` is equal to what `shared` was prior to the patch.

> Maybe the simplest way to work for AMD CPUs is to configure the
> default non-temporal threshold value to 3/4th of the shared cache.
> I think the below patch would simplify the threshold computation for
> both architectures.

I somewhat prefer using 3/4*shared_per_thread as a lowbound as
a catchall for misconfigurations.
>
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
>        core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
>        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> -      shared_per_thread = shared;
>
>        level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
>        level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    cpu_features->level3_cache_linesize = level3_cache_linesize;
>    cpu_features->level4_cache_size = level4_cache_size;
>
> +
> +  unsigned long int non_temporal_threshold = shared * 3 / 4;
> +
> +  if (cpu_features->basic.kind != arch_kind_amd)
> +  {
>    unsigned long int cachesize_non_temporal_divisor
>        = cpu_features->cachesize_non_temporal_divisor;
>    if (cachesize_non_temporal_divisor <= 0)
> @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
>       of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> -     is microarch specific. The default is 1/4). For most Intel and AMD
> +     is microarch specific. The default is 1/4). For most Intel
>       processors with an initial release date between 2017 and 2023, a thread's
>       typical share of the cache is from 18-64MB. Using a reasonable size
>       fraction of L3 is meant to estimate the point where non-temporal stores
> @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>       L3 cache being evicted by the copy are mostly alleviated by the fact that
>       modern HW detects streaming patterns and provides proper LRU hints so that
>       the maximum thrashing capped at 1/associativity. */
> -  unsigned long int non_temporal_threshold
> +  non_temporal_threshold
>        = shared / cachesize_non_temporal_divisor;
>    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
>       a higher risk of actually thrashing the cache as they don't have a HW LRU
> @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>       noticeably worse.  */
>    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
>      non_temporal_threshold = shared_per_thread * 3 / 4;
> +  }
> +
>
> Thanks,
> Sajan K.
  
DJ Delorie July 17, 2023, 9:25 p.m. UTC | #5
sajan karumanchi writes:
> I tested the above patch, which tries to revert the 'nt_threshold'
> value for AMD CPUs to the actual value '3/4 of the shared cache'.
> However, your comments on threads' typical share of cache size and the
> non_temproral default ranges do not apply to AMD CPUs.
> The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> for AMD Zen CPUs.
> Maybe the simplest way to work for AMD CPUs is to configure the
> default non-temporal threshold value to 3/4th of the shared cache.
> I think the below patch would simplify the threshold computation for
> both architectures.

Rather than put AMD-specific code in dl-cacheinfo.h, why not add another
field to cpu_features so that AMD can per-family-tune their cache
preferences in cpu-features.c, and set defaults there?

It seems to me that the threshold has three parts: a numerator (to avoid
floating point math) and denominator, and whether the memory depends on
how many cores there are, which can be applied to the other two.

A field for cachesize_non_temporal_numerator allows for non-floating
point math, and cpu-features.c could adjust either the numerator or
denominator according to whether the cache is per-core or total.  It
also means that all tuning happens in one place, instead of spreading it
around.
  
Noah Goldstein July 17, 2023, 9:40 p.m. UTC | #6
On Mon, Jul 17, 2023 at 4:25 PM DJ Delorie <dj@redhat.com> wrote:
>
>
> sajan karumanchi writes:
> > I tested the above patch, which tries to revert the 'nt_threshold'
> > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > However, your comments on threads' typical share of cache size and the
> > non_temproral default ranges do not apply to AMD CPUs.
> > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > for AMD Zen CPUs.
> > Maybe the simplest way to work for AMD CPUs is to configure the
> > default non-temporal threshold value to 3/4th of the shared cache.
> > I think the below patch would simplify the threshold computation for
> > both architectures.
>
> Rather than put AMD-specific code in dl-cacheinfo.h, why not add another
> field to cpu_features so that AMD can per-family-tune their cache
> preferences in cpu-features.c, and set defaults there?
>
> It seems to me that the threshold has three parts: a numerator (to avoid
> floating point math) and denominator, and whether the memory depends on
> how many cores there are, which can be applied to the other two.
>
> A field for cachesize_non_temporal_numerator allows for non-floating
> point math, and cpu-features.c could adjust either the numerator or
> denominator according to whether the cache is per-core or total.  It
> also means that all tuning happens in one place, instead of spreading it
> around.
>
I think the issue is we don't have the same pipeline for computing
all the info for each family.

So for AMD we need denominator to essentially be 4*threads-per-socket.
But AFAIK theres no where earlier to set that.

Might be something we can revisit / cleanup in the future, but think this
is important for 2.38 so would like to get a baseline fix in.
  
Sajan Karumanchi July 18, 2023, 1:34 p.m. UTC | #7
* Noah
On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > *Noah,
> > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On some machines we end up with incomplete cache information. This can
> > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > where we don't have complete information to make the decision.
> > > > > ---
> > > > >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > index c98fa57a7b..0436ffb349 100644
> > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > >       the maximum thrashing capped at 1/associativity. */
> > > > >    unsigned long int non_temporal_threshold
> > > > >        = shared / cachesize_non_temporal_divisor;
> > > > > +
> > > > > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > +     likely have incorrect/incomplete cache info in which case, default to
> > > > > +     3/4 * per-thread L3 to avoid regressions.  */
> > > > > +  unsigned long int non_temporal_threshold_lowbound
> > > > > +      = shared_per_thread * 3 / 4;
> > > > > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > +
> > > > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > >       hint. As well, their performance in highly parallel situations is
> > > > >       noticeably worse.  */
> > > > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > --
> > > > > 2.34.1
> > > > >
> > > >
> > > > Sajan, can you test this out?
> > > ping @Sajan Karumanchi
> > > release is quickly approaching and I assume you want this in.
> >
> > I tested the above patch, which tries to revert the 'nt_threshold'
> > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> How about just updating the comment?
>
> > However, your comments on threads' typical share of cache size and the
> > non_temproral default ranges do not apply to AMD CPUs.
> > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > for AMD Zen CPUs.
> `shared_per_thread` is equal to what `shared` was prior to the patch.
>
This 'shared_pre_thread' was introduced under this series of patches.
I suggest removing 'shared_pre_thread' to avoid the misinterpretation
of the shared cache per CCX with a thread.

> > Maybe the simplest way to work for AMD CPUs is to configure the
> > default non-temporal threshold value to 3/4th of the shared cache.
> > I think the below patch would simplify the threshold computation for
> > both architectures.
>
> I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> a catchall for misconfiguration.
I think you can use the existing 'minimum_non_temporal_threshold' to
handle the misconfigurations.

> >
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >        data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> >        core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> >        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > -      shared_per_thread = shared;
> >
> >        level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> >        level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >    cpu_features->level3_cache_linesize = level3_cache_linesize;
> >    cpu_features->level4_cache_size = level4_cache_size;
> >
> > +
> > +  unsigned long int non_temporal_threshold = shared * 3 / 4;
> > +
> > +  if (cpu_features->basic.kind != arch_kind_amd)
> > +  {
> >    unsigned long int cachesize_non_temporal_divisor
> >        = cpu_features->cachesize_non_temporal_divisor;
> >    if (cachesize_non_temporal_divisor <= 0)
> > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> >    /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> >       of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > -     is microarch specific. The default is 1/4). For most Intel and AMD
> > +     is microarch specific. The default is 1/4). For most Intel
> >       processors with an initial release date between 2017 and 2023, a thread's
> >       typical share of the cache is from 18-64MB. Using a reasonable size
> >       fraction of L3 is meant to estimate the point where non-temporal stores
> > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >       L3 cache being evicted by the copy are mostly alleviated by the fact that
> >       modern HW detects streaming patterns and provides proper LRU hints so that
> >       the maximum thrashing capped at 1/associativity. */
> > -  unsigned long int non_temporal_threshold
> > +  non_temporal_threshold
> >        = shared / cachesize_non_temporal_divisor;
> >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >       noticeably worse.  */
> >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> >      non_temporal_threshold = shared_per_thread * 3 / 4;
> > +  }
> > +
> >
> > Thanks,
> > Sajan K.
  
Noah Goldstein July 18, 2023, 3:27 p.m. UTC | #8
On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> * Noah
> On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> > <sajan.karumanchi@gmail.com> wrote:
> > >
> > > *Noah,
> > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > On some machines we end up with incomplete cache information. This can
> > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > > where we don't have complete information to make the decision.
> > > > > > ---
> > > > > >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > > index c98fa57a7b..0436ffb349 100644
> > > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > > >       the maximum thrashing capped at 1/associativity. */
> > > > > >    unsigned long int non_temporal_threshold
> > > > > >        = shared / cachesize_non_temporal_divisor;
> > > > > > +
> > > > > > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > > +     likely have incorrect/incomplete cache info in which case, default to
> > > > > > +     3/4 * per-thread L3 to avoid regressions.  */
> > > > > > +  unsigned long int non_temporal_threshold_lowbound
> > > > > > +      = shared_per_thread * 3 / 4;
> > > > > > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > +
> > > > > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > > >       hint. As well, their performance in highly parallel situations is
> > > > > >       noticeably worse.  */
> > > > > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > > >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > > >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > >
> > > > > Sajan, can you test this out?
> > > > ping @Sajan Karumanchi
> > > > release is quickly approaching and I assume you want this in.
> > >
> > > I tested the above patch, which tries to revert the 'nt_threshold'
> > > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > How about just updating the comment?
> >
> > > However, your comments on threads' typical share of cache size and the
> > > non_temproral default ranges do not apply to AMD CPUs.
> > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > > for AMD Zen CPUs.
> > `shared_per_thread` is equal to what `shared` was prior to the patch.
> >
> This 'shared_pre_thread' was introduced under this series of patches.
> I suggest removing 'shared_pre_thread' to avoid the misinterpretation
> of the shared cache per CCX with a thread.

Agreed, already split it just hadnt updated this.
But fixed.
>
> > > Maybe the simplest way to work for AMD CPUs is to configure the
> > > default non-temporal threshold value to 3/4th of the shared cache.
> > > I think the below patch would simplify the threshold computation for
> > > both architectures.
> >
> > I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> > a catchall for misconfiguration.
> I think you can use the existing 'minimum_non_temporal_threshold' to
> handle the misconfigurations.
>
Thats a correctness bound.
This is more a sanity check for "we are making the decision with
correct configuration knowledge" catchall.

> > >
> > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >        data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > >        core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > >        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > > -      shared_per_thread = shared;
> > >
> > >        level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > >        level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >    cpu_features->level3_cache_linesize = level3_cache_linesize;
> > >    cpu_features->level4_cache_size = level4_cache_size;
> > >
> > > +
> > > +  unsigned long int non_temporal_threshold = shared * 3 / 4;
> > > +
> > > +  if (cpu_features->basic.kind != arch_kind_amd)
> > > +  {
> > >    unsigned long int cachesize_non_temporal_divisor
> > >        = cpu_features->cachesize_non_temporal_divisor;
> > >    if (cachesize_non_temporal_divisor <= 0)
> > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >
> > >    /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> > >       of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > > -     is microarch specific. The default is 1/4). For most Intel and AMD
> > > +     is microarch specific. The default is 1/4). For most Intel
> > >       processors with an initial release date between 2017 and 2023, a thread's
> > >       typical share of the cache is from 18-64MB. Using a reasonable size
> > >       fraction of L3 is meant to estimate the point where non-temporal stores
> > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >       L3 cache being evicted by the copy are mostly alleviated by the fact that
> > >       modern HW detects streaming patterns and provides proper LRU hints so that
> > >       the maximum thrashing capped at 1/associativity. */
> > > -  unsigned long int non_temporal_threshold
> > > +  non_temporal_threshold
> > >        = shared / cachesize_non_temporal_divisor;
> > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > >       noticeably worse.  */
> > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > >      non_temporal_threshold = shared_per_thread * 3 / 4;
> > > +  }
> > > +
> > >
> > > Thanks,
> > > Sajan K.
  
Noah Goldstein July 18, 2023, 3:28 p.m. UTC | #9
On Tue, Jul 18, 2023 at 10:27 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > * Noah
> > On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi
> > > <sajan.karumanchi@gmail.com> wrote:
> > > >
> > > > *Noah,
> > > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > > > >
> > > > > > > On some machines we end up with incomplete cache information. This can
> > > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up
> > > > > > > lower than intended (and lower than the prior value). So reintroduce
> > > > > > > the old bound as a lower bound to avoid potentially regressing code
> > > > > > > where we don't have complete information to make the decision.
> > > > > > > ---
> > > > > > >  sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++-
> > > > > > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > > > > > > index c98fa57a7b..0436ffb349 100644
> > > > > > > --- a/sysdeps/x86/dl-cacheinfo.h
> > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > > > > >       the maximum thrashing capped at 1/associativity. */
> > > > > > >    unsigned long int non_temporal_threshold
> > > > > > >        = shared / cachesize_non_temporal_divisor;
> > > > > > > +
> > > > > > > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > > > > > > +     likely have incorrect/incomplete cache info in which case, default to
> > > > > > > +     3/4 * per-thread L3 to avoid regressions.  */
> > > > > > > +  unsigned long int non_temporal_threshold_lowbound
> > > > > > > +      = shared_per_thread * 3 / 4;
> > > > > > > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > > > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > > +
> > > > > > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > > > > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > > > >       hint. As well, their performance in highly parallel situations is
> > > > > > >       noticeably worse.  */
> > > > > > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > > > > > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > > > > +    non_temporal_threshold = non_temporal_threshold_lowbound;
> > > > > > >    /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > > > > > >       'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > > > > > >       if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > > > > > > --
> > > > > > > 2.34.1
> > > > > > >
> > > > > >
> > > > > > Sajan, can you test this out?
> > > > > ping @Sajan Karumanchi
> > > > > release is quickly approaching and I assume you want this in.
> > > >
> > > > I tested the above patch, which tries to revert the 'nt_threshold'
> > > > value for AMD CPUs to the actual value '3/4 of the shared cache'.
> > > How about just updating the comment?
> > >
> > > > However, your comments on threads' typical share of cache size and the
> > > > non_temproral default ranges do not apply to AMD CPUs.
> > > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense
> > > > for AMD Zen CPUs.
> > > `shared_per_thread` is equal to what `shared` was prior to the patch.
> > >
> > This 'shared_pre_thread' was introduced under this series of patches.
> > I suggest removing 'shared_pre_thread' to avoid the misinterpretation
> > of the shared cache per CCX with a thread.
>
> Agreed, already split it just hadnt updated this.
> But fixed.
> >
> > > > Maybe the simplest way to work for AMD CPUs is to configure the
> > > > default non-temporal threshold value to 3/4th of the shared cache.
> > > > I think the below patch would simplify the threshold computation for
> > > > both architectures.
> > >
> > > I somewhat prefer using 3/4*shared_per_thread as a lowbound as
> > > a catchall for misconfiguration.
> > I think you can use the existing 'minimum_non_temporal_threshold' to
> > handle the misconfigurations.
> >
> Thats a correctness bound.
> This is more a sanity check for "we are making the decision with
> correct configuration knowledge" catchall.
>
The main point being: Other x86 systems might be affected, not just
AMD.
> > > >
> > > > +++ b/sysdeps/x86/dl-cacheinfo.h
> > > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >        data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
> > > >        core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
> > > >        shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
> > > > -      shared_per_thread = shared;
> > > >
> > > >        level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
> > > >        level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE);
> > > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >    cpu_features->level3_cache_linesize = level3_cache_linesize;
> > > >    cpu_features->level4_cache_size = level4_cache_size;
> > > >
> > > > +
> > > > +  unsigned long int non_temporal_threshold = shared * 3 / 4;
> > > > +
> > > > +  if (cpu_features->basic.kind != arch_kind_amd)
> > > > +  {
> > > >    unsigned long int cachesize_non_temporal_divisor
> > > >        = cpu_features->cachesize_non_temporal_divisor;
> > > >    if (cachesize_non_temporal_divisor <= 0)
> > > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >
> > > >    /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
> > > >       of the chip's cache (depending on `cachesize_non_temporal_divisor` which
> > > > -     is microarch specific. The default is 1/4). For most Intel and AMD
> > > > +     is microarch specific. The default is 1/4). For most Intel
> > > >       processors with an initial release date between 2017 and 2023, a thread's
> > > >       typical share of the cache is from 18-64MB. Using a reasonable size
> > > >       fraction of L3 is meant to estimate the point where non-temporal stores
> > > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >       L3 cache being evicted by the copy are mostly alleviated by the fact that
> > > >       modern HW detects streaming patterns and provides proper LRU hints so that
> > > >       the maximum thrashing capped at 1/associativity. */
> > > > -  unsigned long int non_temporal_threshold
> > > > +  non_temporal_threshold
> > > >        = shared / cachesize_non_temporal_divisor;
> > > >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > > >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> > > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > > >       noticeably worse.  */
> > > >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > > >      non_temporal_threshold = shared_per_thread * 3 / 4;
> > > > +  }
> > > > +
> > > >
> > > > Thanks,
> > > > Sajan K.
  

Patch

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index c98fa57a7b..0436ffb349 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -757,12 +757,21 @@  dl_init_cacheinfo (struct cpu_features *cpu_features)
      the maximum thrashing capped at 1/associativity. */
   unsigned long int non_temporal_threshold
       = shared / cachesize_non_temporal_divisor;
+
+  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
+     likely have incorrect/incomplete cache info in which case, default to
+     3/4 * per-thread L3 to avoid regressions.  */
+  unsigned long int non_temporal_threshold_lowbound
+      = shared_per_thread * 3 / 4;
+  if (non_temporal_threshold < non_temporal_threshold_lowbound)
+    non_temporal_threshold = non_temporal_threshold_lowbound;
+
   /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
      a higher risk of actually thrashing the cache as they don't have a HW LRU
      hint. As well, their performance in highly parallel situations is
      noticeably worse.  */
   if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
-    non_temporal_threshold = shared_per_thread * 3 / 4;
+    non_temporal_threshold = non_temporal_threshold_lowbound;
   /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
      'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
      if that operation cannot overflow. Minimum of 0x4040 (16448) because the