x86: Fix non-temporal memset unreachable on AMD Zen 3/4/5

Message ID 20260506053801.3433002-1-zombie12139@gmail.com (mailing list archive)
State New
Headers
Series x86: Fix non-temporal memset unreachable on AMD Zen 3/4/5 |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686

Commit Message

zombie12138 May 6, 2026, 5:38 a.m. UTC
  On AMD Zen 3/4/5 with ERMS, the non-temporal memset path is unreachable
because rep_stosb_threshold is set to SIZE_MAX (vectorized loop is faster
than ERMS on these CPUs), but the non-temporal code path is nested inside
the rep_stosb branch.

The existing rescue logic at the Avoid_STOSB check only covers the case
where the CPU lacks ERMS hardware support.  It does not cover AMD Zen 3+
where ERMS is supported but deliberately unused for performance reasons.

Extend the condition to also lower rep_stosb_threshold when:
- The user has not explicitly set x86_rep_stosb_threshold (respect tunables)
- rep_stosb_threshold is higher than memset_non_temporal_threshold (NT gated)

This makes the non-temporal path reachable for large memset operations,
providing ~2x speedup on pre-faulted buffers larger than L3 cache.

Tested on AMD Ryzen 7 8745HS (Zen 4):
- Pre-faulted 64MB memset: 2.02 ms -> 0.94 ms (2.15x faster)
- First-touch 64MB memset: 19.3 ms -> 21.3 ms (11% regression, expected:
  kernel clear_page cache warming bypassed by NT stores)

	* sysdeps/x86/dl-cacheinfo.h (dl_init_cacheinfo): Extend
	rep_stosb_threshold lowering condition to cover AMD Zen 3/4/5
	where ERMS is supported but stosb is disabled via threshold.

Signed-off-by: zombie12138 <zombie12139@gmail.com>
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=34129
---
 sysdeps/x86/dl-cacheinfo.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
  

Patch

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index b6e17b0e32..78929d1f2d 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -1293,7 +1293,9 @@  dl_init_cacheinfo (struct cpu_features *cpu_features)
   /* Do `rep_stosb_thresh = non_temporal_thresh` after setting/getting the
      final value of `x86_memset_non_temporal_threshold`. In some cases this can
      be a matter of correctness.  */
-  if (CPU_FEATURES_ARCH_P (cpu_features, Avoid_STOSB))
+  if (CPU_FEATURES_ARCH_P (cpu_features, Avoid_STOSB)
+      || (!TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold)
+	  && rep_stosb_threshold > memset_non_temporal_threshold))
     rep_stosb_threshold
 	= TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,