[0/1] Optimizing memcpy for AMD Zen architecture.

Message ID 20201022045005.17371-1-sajan.karumanchi@amd.com


Karumanchi, Sajan Oct. 22, 2020, 4:50 a.m. UTC
  From: Sajan Karumanchi <sajan.karumanchi@amd.com>

Modifying the shareable cache '__x86_shared_cache_size', which is a
factor in computing the non-temporal threshold parameter
'__x86_shared_non_temporal_threshold' to optimize memcpy for AMD Zen
In the existing implementation, the shareable cache is computed as 'L3
per thread, L2 per core'.
Recomputing this shareable cache as 'L3 per CCX'(Core-Complex) has
brought performance gains of ~44% for memory sizes greater than 16MB.

The patch I posted earlier: 'Tuning NT Threshold parameter for AMD
and the recent patch committed by Patrick McGehearty: 'Reversing
calculation of __x86_shared_non_temporal_threshold', both have
regression problems on AMD Zen machines for memory ranges of 1MB to 8MB
as per the large bench variant results.
This patch addresses the regression problem on AMD Zen machines.
The below link will show the performance results chart comparison of
'Master' branch and 'AMD' patch against the 2.32 stable release.
Summary: On master branch we see a regression for memoery sizes below
8MB with performance drop of upto 99%, whereas AMD patch has performance
gains for 16MB and above with no regressions.

Note: The benchmarking is done by isolating all the cpu cores in a CCX,
configuring them to fixed frequency mode and routing the IRQs to other
cpu cores.
Then the large bench tests were run by pinning to one of the isolated
cores for 1000 iterations and the performance computation is done by
taking average of these iterations.

Sajan Karumanchi (1):
  x86: Optimizing memcpy for AMD Zen architecture.

 sysdeps/x86/cacheinfo.h | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)