mbox series

[0/1] x86: Tuning NT Threshold parameter for AMD machines

Message ID 20200819104539.9854-1-sajan.karumanchi@amd.com
Headers show
Series x86: Tuning NT Threshold parameter for AMD machines | expand

Message

Sajan Karumanchi Aug. 19, 2020, 10:45 a.m. UTC
Tuning NT threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of
shared cache size on AMD Zen[1|2] machines brings in performance gains
for memcpy/memmove as per the Large and Walk Bench variant reuslts.

As there are run to run variations in bench results, I took average of 100 runs
for both vanilla and patched glibc.

AMD ZEN[1/2] architectures doesn't have ERMS cpu feature.
So, on ZEN architecutre memcpy takes 'memcpy_avx_unaligned' entry point.

Below is the large bench test results comparision for entry points:
avx_unaligned and avx_unaligned_erms.
-------------------------------------------------------------------------
size     load_align store_align avx_unaligned(%) avx_unaligned_erms(%)
-------------------------------------------------------------------------
1048583         0       0       1.89                    68.28
1048591         0       3       1.19                    94.56
1048607         3       0       -0.25                   68.25
1048639         3       5       -90.7                   89.69
2097159         0       0       -75.11                  43.18
2097167         0       3       -74.08                  90.16
2097183         3       0       -78.12                  43.81
2097215         3       5       -73.75                  90.58
4194311         0       0       -88.5                   39.26
4194319         0       3       -72.13                  90.21
4194335         3       0       -78.31                  43.97
4194367         3       5       -72                     90.64
8388615         0       0       -12.22                  43.24
8388623         0       3       -15.76                  90.3
8388639         3       0       -22.31                  39.92
8388671         3       5       -15.34                  90.74
16777223        0       0       49.8                    46.89
16777231        0       3       52.5                    90.14
16777247        3       0       51.82                   46.68
16777279        3       5       52.35                   90.55
33554439        0       0       41.76                   52.72
33554447        0       3       44.17                   88.29
33554463        3       0       43.74                   53.62
33554495        3       5       44.09                   88.78
-------------------------------------------------------------------------

Below is the Walk bench test results comparision for entry points.
avx_unaligned and avx_unaligned_erms.
---------------------------------------------------
size            avx_unaligned(%) avx_unaligned_erms(%)
---------------------------------------------------
1048576                 -0.2            15.03
1048577                 0.92            15.52
2097152                 40.52           50.92
2097153                 40.76           50.84
4194304                 40.6            51.22
4194305                 40.57           51.25
8388608                 40.61           51.23
8388609                 40.82           51.32
16777216                40.56           51.11
16777217                40.35           51.29
33554432                40.15           37.41
33554433                20.75           41.22
---------------------------------------------------
Question:
Why do we see discrepancies in the results of Large bench, though code path
taken for NT Stores in memcpy is same for both entry points
"memcpy_avx_unaligned" and "memcpy_avx_unaligned_erms"?


Sajan Karumanchi (1):
  x86: Tuning NT Threshold parameter for AMD machines.

 sysdeps/x86/cacheinfo.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)