[0/1] x86: Tuning NT Threshold parameter for AMD machines

Message ID	20200819104539.9854-1-sajan.karumanchi@amd.com
Headers	DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D38963844020 To: libc-alpha@sourceware.org, carlos@redhat.com Subject: [PATCH 0/1] x86: Tuning NT Threshold parameter for AMD machines Date: Wed, 19 Aug 2020 16:15:38 +0530 Message-Id: <20200819104539.9854-1-sajan.karumanchi@amd.com> Precedence: list From: Sajan Karumanchi via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Sajan Karumanchi <sajan.karumanchi@gmail.com> Cc: Sajan Karumanchi <sajan.karumanchi@amd.com>, premachandra.mallappa@amd.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	x86: Tuning NT Threshold parameter for AMD machines \| [0/1] x86: Tuning NT Threshold parameter for AMD machines [1/1] x86: Tuning NT Threshold parameter for AMD machines.

Message ID

20200819104539.9854-1-sajan.karumanchi@amd.com

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org D38963844020
To: libc-alpha@sourceware.org,
	carlos@redhat.com
Subject: [PATCH 0/1] x86: Tuning NT Threshold parameter for AMD machines
Date: Wed, 19 Aug 2020 16:15:38 +0530
Message-Id: <20200819104539.9854-1-sajan.karumanchi@amd.com>
Precedence: list
From: Sajan Karumanchi via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Sajan Karumanchi <sajan.karumanchi@gmail.com>
Cc: Sajan Karumanchi <sajan.karumanchi@amd.com>, premachandra.mallappa@amd.com
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

Series

x86: Tuning NT Threshold parameter for AMD machines |

Message

Sajan Karumanchi Aug. 19, 2020, 10:45 a.m. UTC

  Tuning NT threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of
shared cache size on AMD Zen[1|2] machines brings in performance gains
for memcpy/memmove as per the Large and Walk Bench variant reuslts.

As there are run to run variations in bench results, I took average of 100 runs
for both vanilla and patched glibc.

AMD ZEN[1/2] architectures doesn't have ERMS cpu feature.
So, on ZEN architecutre memcpy takes 'memcpy_avx_unaligned' entry point.

Below is the large bench test results comparision for entry points:
avx_unaligned and avx_unaligned_erms.
-------------------------------------------------------------------------
size     load_align store_align avx_unaligned(%) avx_unaligned_erms(%)
-------------------------------------------------------------------------
1048583         0       0       1.89                    68.28
1048591         0       3       1.19                    94.56
1048607         3       0       -0.25                   68.25
1048639         3       5       -90.7                   89.69
2097159         0       0       -75.11                  43.18
2097167         0       3       -74.08                  90.16
2097183         3       0       -78.12                  43.81
2097215         3       5       -73.75                  90.58
4194311         0       0       -88.5                   39.26
4194319         0       3       -72.13                  90.21
4194335         3       0       -78.31                  43.97
4194367         3       5       -72                     90.64
8388615         0       0       -12.22                  43.24
8388623         0       3       -15.76                  90.3
8388639         3       0       -22.31                  39.92
8388671         3       5       -15.34                  90.74
16777223        0       0       49.8                    46.89
16777231        0       3       52.5                    90.14
16777247        3       0       51.82                   46.68
16777279        3       5       52.35                   90.55
33554439        0       0       41.76                   52.72
33554447        0       3       44.17                   88.29
33554463        3       0       43.74                   53.62
33554495        3       5       44.09                   88.78
-------------------------------------------------------------------------

Below is the Walk bench test results comparision for entry points.
avx_unaligned and avx_unaligned_erms.
---------------------------------------------------
size            avx_unaligned(%) avx_unaligned_erms(%)
---------------------------------------------------
1048576                 -0.2            15.03
1048577                 0.92            15.52
2097152                 40.52           50.92
2097153                 40.76           50.84
4194304                 40.6            51.22
4194305                 40.57           51.25
8388608                 40.61           51.23
8388609                 40.82           51.32
16777216                40.56           51.11
16777217                40.35           51.29
33554432                40.15           37.41
33554433                20.75           41.22
---------------------------------------------------
Question:
Why do we see discrepancies in the results of Large bench, though code path
taken for NT Stores in memcpy is same for both entry points
"memcpy_avx_unaligned" and "memcpy_avx_unaligned_erms"?


Sajan Karumanchi (1):
  x86: Tuning NT Threshold parameter for AMD machines.

 sysdeps/x86/cacheinfo.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)