diff mbox series

[1/1] x86: Tuning NT Threshold parameter for AMD machines.

Message ID	20200819104539.9854-2-sajan.karumanchi@amd.com
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1377C386F41C To: libc-alpha@sourceware.org, carlos@redhat.com Subject: [PATCH 1/1] x86: Tuning NT Threshold parameter for AMD machines. Date: Wed, 19 Aug 2020 16:15:39 +0530 Message-Id: <20200819104539.9854-2-sajan.karumanchi@amd.com> In-Reply-To: <20200819104539.9854-1-sajan.karumanchi@amd.com> References: <20200819104539.9854-1-sajan.karumanchi@amd.com> Precedence: list From: Sajan Karumanchi via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Sajan Karumanchi <sajan.karumanchi@gmail.com> Cc: Sajan Karumanchi <sajan.karumanchi@amd.com>, premachandra.mallappa@amd.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	x86: Tuning NT Threshold parameter for AMD machines \| [0/1] x86: Tuning NT Threshold parameter for AMD machines [1/1] x86: Tuning NT Threshold parameter for AMD machines.

Commit Message

Sajan Karumanchi Aug. 19, 2020, 10:45 a.m. UTC

  Tuning NT threshold parameter to bring in performance gains of
memcpy/memove on AMD cpu's.

Based on Large and Walk bench variant results,
setting __x86_shared_non_temporal_threshold to 2/3 of shared cache size
brings in performance gains for memcpy/memmove on AMD machines.

Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Signed-off-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
Signed-off-by: Sajan Karumanchi <sajan.karumanchi@amd.com>
---
 sysdeps/x86/cacheinfo.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Comments

H.J. Lu Sept. 1, 2020, 7:23 p.m. UTC | #1

On Wed, Aug 19, 2020 at 3:58 AM Sajan Karumanchi via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Tuning NT threshold parameter to bring in performance gains of
> memcpy/memove on AMD cpu's.
>
> Based on Large and Walk bench variant results,
> setting __x86_shared_non_temporal_threshold to 2/3 of shared cache size
> brings in performance gains for memcpy/memmove on AMD machines.
>

The patch looks mostly OK.  But I have quite a few x86 patches queued
which touch the same codes.  Please take a look at

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/tunable/master

and put your patch on top of mine.

Sajan Karumanchi Sept. 8, 2020, 11:36 a.m. UTC | #2

Thanks H.J.Lu for reviewing the patch. 
Before pushing a rebased patch, I am looking for answers regarding 
the performance drop observed only in large bench variant results for 
size ranges of 1MB to 8MB.
For more details, please refer to the cover letter
https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html

H.J. Lu Dec. 7, 2020, 2:23 p.m. UTC | #3

On Tue, Sep 8, 2020 at 4:39 AM Sajan Karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> Thanks H.J.Lu for reviewing the patch.
> Before pushing a rebased patch, I am looking for answers regarding
> the performance drop observed only in large bench variant results for
> size ranges of 1MB to 8MB.
> For more details, please refer to the cover letter
> https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html
>

Please update your patch since the code has been changed by

commit d3c57027470b78dba79c6d931e4e409b1fecfc80
Author: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Date:   Mon Sep 28 20:11:28 2020 +0000

    Reversing calculation of __x86_shared_non_temporal_threshold

diff mbox series

Patch

diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 217c21c34f..5487f382a8 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -829,7 +829,8 @@  init_cacheinfo (void)
     }
 
   if (cpu_features->data_cache_size != 0)
-    data = cpu_features->data_cache_size;
+    if (data == 0 || cpu_features->basic.kind != arch_kind_amd)
+      data = cpu_features->data_cache_size;
 
   if (data > 0)
     {
@@ -842,7 +843,8 @@  init_cacheinfo (void)
     }
 
   if (cpu_features->shared_cache_size != 0)
-    shared = cpu_features->shared_cache_size;
+    if (shared == 0 || cpu_features->basic.kind != arch_kind_amd)
+      shared = cpu_features->shared_cache_size;
 
   if (shared > 0)
     {
@@ -854,6 +856,17 @@  init_cacheinfo (void)
       __x86_shared_cache_size = shared;
     }
 
+  if (cpu_features->basic.kind == arch_kind_amd)
+  {
+  /* Large and Walk benchmarks in glibc shows 2/3 shared cache size is
+     the threshold value above which non-temporal store is performing better */
+  __x86_shared_non_temporal_threshold
+    = (cpu_features->non_temporal_threshold != 0
+       ? cpu_features->non_temporal_threshold
+       : __x86_shared_cache_size * 2 / 3);
+  }
+  else
+  {
   /* The large memcpy micro benchmark in glibc shows that 6 times of
      shared cache size is the approximate value above which non-temporal
      store becomes faster on a 8-core processor.  This is the 3/4 of the
@@ -862,6 +875,7 @@  init_cacheinfo (void)
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
        : __x86_shared_cache_size * threads * 3 / 4);
+  }
 
   /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
   unsigned int minimum_rep_movsb_threshold;