[v3] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.

Message ID 20230718152759.219727-1-goldstein.w.n@gmail.com
State Committed
Commit 8b9a0af8ca012217bf90d1dc0694f85b49ae09da
Headers
Series [v3] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold. |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm warning Patch failed to apply
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 warning Patch failed to apply
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 warning Patch failed to apply
linaro-tcwg-bot/tcwg_glibc_check--master-arm warning Patch failed to apply

Commit Message

Noah Goldstein July 18, 2023, 3:27 p.m. UTC
  On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
---
 sysdeps/x86/dl-cacheinfo.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)
  

Comments

DJ Delorie July 19, 2023, 1:51 a.m. UTC | #1
Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes:

Way-back-when we stored the cache size in "shared" ("core" or
"core/threads_l2") the math was "shared * 3/4".  Now the "core or
core/l2" value is in "shared_per_thread"

>   unsigned long int non_temporal_threshold
>       = shared / cachesize_non_temporal_divisor;

This is the "new" Intel ERMS value.

> +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> +     likely have incorrect/incomplete cache info in which case, default to
> +     3/4 * per-thread L3 to avoid regressions.  */
> +  unsigned long int non_temporal_threshold_lowbound
> +      = shared_per_thread * 3 / 4;
> +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> +    non_temporal_threshold = non_temporal_threshold_lowbound;

So non_temporal_threshold might be increased to the 3/4 mark *even for
ERMS* when we know better.

>    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
>       a higher risk of actually thrashing the cache as they don't have a HW LRU
>       hint. As well, their performance in highly parallel situations is
>       noticeably worse.  */
>    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> -    non_temporal_threshold = shared_per_thread * 3 / 4;
> +    non_temporal_threshold = non_temporal_threshold_lowbound;

This only does something when the ERMS value was larger than the lower
bound.

We could end up with an ERMS-enabled core but with a default threshold,
but I think the only time that ever happens is with only one thread per
core.  I suspect the logic could be better, but at the moment, it could
have been worse too ;-)

I note Sajan's comment "I think you can use the existing
'minimum_non_temporal_threshold' to handle the misconfigurations." but
the source disagrees:

  /* If `non_temporal_threshold` less than `minimum_non_temporal_threshold`
     it most likely means we failed to detect the cache info. We don't want
     to default to `minimum_non_temporal_threshold` as such a small value,
     while correct, has bad performance. We default to 64MB as reasonable
     default bound. 64MB is likely conservative in that most/all systems would
     choose a lower value so it should never forcing non-temporal stores when
     they otherwise wouldn't be used.  */
  if (non_temporal_threshold < minimum_non_temporal_threshold)
    non_temporal_threshold = 64 * 1024 * 1024;
  else if (non_temporal_threshold > maximum_non_temporal_threshold)
    non_temporal_threshold = maximum_non_temporal_threshold;

The minimum can't be reset as it's used later for tunables.

LGTM
Reviewed-by: DJ Delorie <dj@redhat.com>
  
Noah Goldstein July 19, 2023, 2:03 a.m. UTC | #2
On Tue, Jul 18, 2023 at 8:51 PM DJ Delorie <dj@redhat.com> wrote:
>
> Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes:
>
> Way-back-when we stored the cache size in "shared" ("core" or
> "core/threads_l2") the math was "shared * 3/4".  Now the "core or
> core/l2" value is in "shared_per_thread"
>
> >   unsigned long int non_temporal_threshold
> >       = shared / cachesize_non_temporal_divisor;
>
> This is the "new" Intel ERMS value.
>
> > +  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > +     likely have incorrect/incomplete cache info in which case, default to
> > +     3/4 * per-thread L3 to avoid regressions.  */
> > +  unsigned long int non_temporal_threshold_lowbound
> > +      = shared_per_thread * 3 / 4;
> > +  if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > +    non_temporal_threshold = non_temporal_threshold_lowbound;
>
> So non_temporal_threshold might be increased to the 3/4 mark *even for
> ERMS* when we know better.
>
> >    /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> >       a higher risk of actually thrashing the cache as they don't have a HW LRU
> >       hint. As well, their performance in highly parallel situations is
> >       noticeably worse.  */
> >    if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > -    non_temporal_threshold = shared_per_thread * 3 / 4;
> > +    non_temporal_threshold = non_temporal_threshold_lowbound;
>
> This only does something when the ERMS value was larger than the lower
> bound.
>
> We could end up with an ERMS-enabled core but with a default threshold,
> but I think the only time that ever happens is with only one thread per
> core.  I suspect the logic could be better, but at the moment, it could
> have been worse too ;-)
>
> I note Sajan's comment "I think you can use the existing
> 'minimum_non_temporal_threshold' to handle the misconfigurations." but
> the source disagrees:
>
>   /* If `non_temporal_threshold` less than `minimum_non_temporal_threshold`
>      it most likely means we failed to detect the cache info. We don't want
>      to default to `minimum_non_temporal_threshold` as such a small value,
>      while correct, has bad performance. We default to 64MB as reasonable
>      default bound. 64MB is likely conservative in that most/all systems would
>      choose a lower value so it should never forcing non-temporal stores when
>      they otherwise wouldn't be used.  */
>   if (non_temporal_threshold < minimum_non_temporal_threshold)
>     non_temporal_threshold = 64 * 1024 * 1024;
>   else if (non_temporal_threshold > maximum_non_temporal_threshold)
>     non_temporal_threshold = maximum_non_temporal_threshold;
>
> The minimum can't be reset as it's used later for tunables.
>
Ah good point.

> LGTM
> Reviewed-by: DJ Delorie <dj@redhat.com>
>

Thank you for the review DJ.
I'm going to push this shortly.

After that I think all related changed for the memcpy fixes will
be done for 2.38.
  

Patch

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index c98fa57a7b..2586ff0e31 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -745,8 +745,8 @@  dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
      of the chip's cache (depending on `cachesize_non_temporal_divisor` which
-     is microarch specific. The default is 1/4). For most Intel and AMD
-     processors with an initial release date between 2017 and 2023, a thread's
+     is microarch specific. The default is 1/4). For most Intel processors
+     with an initial release date between 2017 and 2023, a thread's
      typical share of the cache is from 18-64MB. Using a reasonable size
      fraction of L3 is meant to estimate the point where non-temporal stores
      begin out-competing REP MOVSB. As well the point where the fact that
@@ -757,12 +757,21 @@  dl_init_cacheinfo (struct cpu_features *cpu_features)
      the maximum thrashing capped at 1/associativity. */
   unsigned long int non_temporal_threshold
       = shared / cachesize_non_temporal_divisor;
+
+  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
+     likely have incorrect/incomplete cache info in which case, default to
+     3/4 * per-thread L3 to avoid regressions.  */
+  unsigned long int non_temporal_threshold_lowbound
+      = shared_per_thread * 3 / 4;
+  if (non_temporal_threshold < non_temporal_threshold_lowbound)
+    non_temporal_threshold = non_temporal_threshold_lowbound;
+
   /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
      a higher risk of actually thrashing the cache as they don't have a HW LRU
      hint. As well, their performance in highly parallel situations is
      noticeably worse.  */
   if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
-    non_temporal_threshold = shared_per_thread * 3 / 4;
+    non_temporal_threshold = non_temporal_threshold_lowbound;
   /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
      'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
      if that operation cannot overflow. Minimum of 0x4040 (16448) because the