[v3] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
warning
|
Patch failed to apply
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
warning
|
Patch failed to apply
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
warning
|
Patch failed to apply
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
warning
|
Patch failed to apply
|
Commit Message
On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
---
sysdeps/x86/dl-cacheinfo.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
Comments
Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes:
Way-back-when we stored the cache size in "shared" ("core" or
"core/threads_l2") the math was "shared * 3/4". Now the "core or
core/l2" value is in "shared_per_thread"
> unsigned long int non_temporal_threshold
> = shared / cachesize_non_temporal_divisor;
This is the "new" Intel ERMS value.
> + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> + likely have incorrect/incomplete cache info in which case, default to
> + 3/4 * per-thread L3 to avoid regressions. */
> + unsigned long int non_temporal_threshold_lowbound
> + = shared_per_thread * 3 / 4;
> + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> + non_temporal_threshold = non_temporal_threshold_lowbound;
So non_temporal_threshold might be increased to the 3/4 mark *even for
ERMS* when we know better.
> /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> a higher risk of actually thrashing the cache as they don't have a HW LRU
> hint. As well, their performance in highly parallel situations is
> noticeably worse. */
> if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> - non_temporal_threshold = shared_per_thread * 3 / 4;
> + non_temporal_threshold = non_temporal_threshold_lowbound;
This only does something when the ERMS value was larger than the lower
bound.
We could end up with an ERMS-enabled core but with a default threshold,
but I think the only time that ever happens is with only one thread per
core. I suspect the logic could be better, but at the moment, it could
have been worse too ;-)
I note Sajan's comment "I think you can use the existing
'minimum_non_temporal_threshold' to handle the misconfigurations." but
the source disagrees:
/* If `non_temporal_threshold` less than `minimum_non_temporal_threshold`
it most likely means we failed to detect the cache info. We don't want
to default to `minimum_non_temporal_threshold` as such a small value,
while correct, has bad performance. We default to 64MB as reasonable
default bound. 64MB is likely conservative in that most/all systems would
choose a lower value so it should never forcing non-temporal stores when
they otherwise wouldn't be used. */
if (non_temporal_threshold < minimum_non_temporal_threshold)
non_temporal_threshold = 64 * 1024 * 1024;
else if (non_temporal_threshold > maximum_non_temporal_threshold)
non_temporal_threshold = maximum_non_temporal_threshold;
The minimum can't be reset as it's used later for tunables.
LGTM
Reviewed-by: DJ Delorie <dj@redhat.com>
On Tue, Jul 18, 2023 at 8:51 PM DJ Delorie <dj@redhat.com> wrote:
>
> Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes:
>
> Way-back-when we stored the cache size in "shared" ("core" or
> "core/threads_l2") the math was "shared * 3/4". Now the "core or
> core/l2" value is in "shared_per_thread"
>
> > unsigned long int non_temporal_threshold
> > = shared / cachesize_non_temporal_divisor;
>
> This is the "new" Intel ERMS value.
>
> > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
> > + likely have incorrect/incomplete cache info in which case, default to
> > + 3/4 * per-thread L3 to avoid regressions. */
> > + unsigned long int non_temporal_threshold_lowbound
> > + = shared_per_thread * 3 / 4;
> > + if (non_temporal_threshold < non_temporal_threshold_lowbound)
> > + non_temporal_threshold = non_temporal_threshold_lowbound;
>
> So non_temporal_threshold might be increased to the 3/4 mark *even for
> ERMS* when we know better.
>
> > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
> > a higher risk of actually thrashing the cache as they don't have a HW LRU
> > hint. As well, their performance in highly parallel situations is
> > noticeably worse. */
> > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
> > - non_temporal_threshold = shared_per_thread * 3 / 4;
> > + non_temporal_threshold = non_temporal_threshold_lowbound;
>
> This only does something when the ERMS value was larger than the lower
> bound.
>
> We could end up with an ERMS-enabled core but with a default threshold,
> but I think the only time that ever happens is with only one thread per
> core. I suspect the logic could be better, but at the moment, it could
> have been worse too ;-)
>
> I note Sajan's comment "I think you can use the existing
> 'minimum_non_temporal_threshold' to handle the misconfigurations." but
> the source disagrees:
>
> /* If `non_temporal_threshold` less than `minimum_non_temporal_threshold`
> it most likely means we failed to detect the cache info. We don't want
> to default to `minimum_non_temporal_threshold` as such a small value,
> while correct, has bad performance. We default to 64MB as reasonable
> default bound. 64MB is likely conservative in that most/all systems would
> choose a lower value so it should never forcing non-temporal stores when
> they otherwise wouldn't be used. */
> if (non_temporal_threshold < minimum_non_temporal_threshold)
> non_temporal_threshold = 64 * 1024 * 1024;
> else if (non_temporal_threshold > maximum_non_temporal_threshold)
> non_temporal_threshold = maximum_non_temporal_threshold;
>
> The minimum can't be reset as it's used later for tunables.
>
Ah good point.
> LGTM
> Reviewed-by: DJ Delorie <dj@redhat.com>
>
Thank you for the review DJ.
I'm going to push this shortly.
After that I think all related changed for the memcpy fixes will
be done for 2.38.
@@ -745,8 +745,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
/* The default setting for the non_temporal threshold is [1/8, 1/2] of size
of the chip's cache (depending on `cachesize_non_temporal_divisor` which
- is microarch specific. The default is 1/4). For most Intel and AMD
- processors with an initial release date between 2017 and 2023, a thread's
+ is microarch specific. The default is 1/4). For most Intel processors
+ with an initial release date between 2017 and 2023, a thread's
typical share of the cache is from 18-64MB. Using a reasonable size
fraction of L3 is meant to estimate the point where non-temporal stores
begin out-competing REP MOVSB. As well the point where the fact that
@@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
the maximum thrashing capped at 1/associativity. */
unsigned long int non_temporal_threshold
= shared / cachesize_non_temporal_divisor;
+
+ /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
+ likely have incorrect/incomplete cache info in which case, default to
+ 3/4 * per-thread L3 to avoid regressions. */
+ unsigned long int non_temporal_threshold_lowbound
+ = shared_per_thread * 3 / 4;
+ if (non_temporal_threshold < non_temporal_threshold_lowbound)
+ non_temporal_threshold = non_temporal_threshold_lowbound;
+
/* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
a higher risk of actually thrashing the cache as they don't have a HW LRU
hint. As well, their performance in highly parallel situations is
noticeably worse. */
if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS))
- non_temporal_threshold = shared_per_thread * 3 / 4;
+ non_temporal_threshold = non_temporal_threshold_lowbound;
/* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
if that operation cannot overflow. Minimum of 0x4040 (16448) because the