From patchwork Tue Feb  6 17:43:20 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Adhemerval Zanella Netto
 <adhemerval.zanella@linaro.org>
X-Patchwork-Id: 85374
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 6E189385829A
	for <patchwork@sourceware.org>; Tue,  6 Feb 2024 17:44:07 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com
 [IPv6:2607:f8b0:4864:20::52b])
 by sourceware.org (Postfix) with ESMTPS id 7E6133858D38
 for <libc-alpha@sourceware.org>; Tue,  6 Feb 2024 17:43:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7E6133858D38
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 7E6133858D38
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::52b
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707241415; cv=none;
 b=SAxESse2A7cKQHeG9h8KKPNvoVsVsKBpW0fVONyhDD7f8sB1j+5h1u8I6KHoCBSnJywWUR+1pg3OIRWPb43WpEKZ/U8kmKtPo6dY0d+Gid7JoEWg/zXq3iKSRkyZcqu4C+ssJS+ar+QaCQUG770Yq4AASoYctx/dxVx+5tRXWHA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1707241415; c=relaxed/simple;
 bh=mJcAlGvyQw45xAANe9KdykZOEILSKS1fcpJ7Ib7Q3Mc=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=NNEf9yDHzL5A0btgHLyLPsLH6GgQw2oZ9Oj8BhDwZ3FlNZsu3jdq6HfXp7iG3/DLFXoC/qhCrcelRRp19Sh/pRKcEw4SxgrPV4oV1pbPZ/lFh2+eyHgNqZmqkDgsQzbRA0qxlSGYlvZpKL2EinkwEyjUjdzUjjbc69+MbOhko4I=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-pg1-x52b.google.com with SMTP id
 41be03b00d2f7-5d4d15ec7c5so5354344a12.1
 for <libc-alpha@sourceware.org>; Tue, 06 Feb 2024 09:43:32 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=linaro.org; s=google; t=1707241411; x=1707846211; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=CL8RBIO7XYmCI6VfVvkOf2fLr3iWhx6O1jhNkLl4vCc=;
 b=mZIM7hUKVXEBrgMbmqIKloDXqjC7zG/S6a0Xy4FHKNxnk6YMXpLvXO02ZhgwJbMKT+
 nu+uoXqU3214+LKUxRtYnW8K45HTEOt0ImWu6HF2mU313FMw1v/P+OcnQaonUSaJOW8T
 37flXTK+C6gYCdzPG9837T2AQ6LigebkACY9HJDTssWH8hZwOLzWVg2hXhGY7M24t8vN
 10ITgU1oSK6KMHIwnuPtp/j/ALIz5Fa4b50BFIl1VAq5YxejZFEtY9RMK8Fws8WsfZSa
 EQxUkOkEAe6GTjshY1VP0WY0stte+nnFgy/bnUkUeGW/EsAXuj6VBoNK6LAt8xEYtldf
 NSFQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1707241411; x=1707846211;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=CL8RBIO7XYmCI6VfVvkOf2fLr3iWhx6O1jhNkLl4vCc=;
 b=ZvyfdXFhwu+eVXZey4gnToxmTl8BqcAg/eLfxuj9pjmyTGEXJwTY9XmWFhjaxvk4wY
 T2mQQgH4oS7pqPzzNr7qSPzBApINeTOWHrmmXMWQlsjRTVSUjvkLDuYhtVB9LW0FVv1p
 Jzq8ssQ2RyeOLJIP6rPAehgCacml+zza/qVuejXM3Q1ECQBazuDCY6jcnYXSOe7mukdQ
 Y2lkKZI/1bMW1KALTupK/0R7w4sYo26U7kz7qo4YyZlUq6SYlWdDgIMY8+j7YHYH2Dex
 1k1nO9jjG3Vtz9qVnCpgILq9b1CVQxUzLKC+BOVRU+zRWDg3BpfLJVm2E+QRWsEpRFEx
 M8OA==
X-Gm-Message-State: AOJu0YyX51FY67nS5S/AAtdBxpOH3ZXyG9F2ZkoEHIrc/gGxIi5zwy/w
 XRWASzc4lAlFXZPIhZrxx3jJ0MSquiCkMI/K3oIW60sBZzZqdZI9taTczE52JvotymaSvKtT0vM
 S
X-Google-Smtp-Source: 
 AGHT+IFrGL4v3C13uFzmdHO6y4jF1ri/Ief4yi+6Wc1EZuN2Hf+ktgbi5SakK1xWnjRBJWI/EMN01Q==
X-Received: by 2002:a05:6a21:150d:b0:19e:9a59:20df with SMTP id
 nq13-20020a056a21150d00b0019e9a5920dfmr2309100pzb.9.1707241411011;
 Tue, 06 Feb 2024 09:43:31 -0800 (PST)
X-Forwarded-Encrypted: i=0;
 AJvYcCWCmL/cznSrjf3zN+EpC1r4JFDxr2V+btk6sM4i3l8KNHNEKf8pIzEihavCQHY+0KvMNMRdXGdVkdjIPSnC13gWCJxYsK3YemHz5M2YRqHPY+dV4dJLK7cUylfQAb3sUPyppZLXfPHpBHUdqivVEkCzk8m8YUcgzxW3827dXWHMuO67+S4Mw+l7kw==
Received: from mandiga.. ([2804:1b3:a7c0:378:b5ab:9c4b:bdc3:2870])
 by smtp.gmail.com with ESMTPSA id
 d22-20020aa78696000000b006e04a659ed6sm2248598pfo.67.2024.02.06.09.43.28
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 06 Feb 2024 09:43:30 -0800 (PST)
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
To: libc-alpha@sourceware.org
Cc: "H . J . Lu" <hjl.tools@gmail.com>,
 Noah Goldstein <goldstein.w.n@gmail.com>,
 Sajan Karumanchi <sajan.karumanchi@gmail.com>, bmerry@sarao.ac.za,
 pmallapp@amd.com
Subject: [PATCH v2 1/3] x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
Date: Tue,  6 Feb 2024 14:43:20 -0300
Message-Id: <20240206174322.2317679-2-adhemerval.zanella@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240206174322.2317679-1-adhemerval.zanella@linaro.org>
References: <20240206174322.2317679-1-adhemerval.zanella@linaro.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org

The REP MOVSB usage on memcpy/memmove does not show much performance
improvement on Zen3/Zen4 cores compared to the vectorized loops.  Also,
as from BZ 30994, if the source is aligned and the destination is not
the performance can be 20x slower.

The performance difference is noticeable with small buffer sizes, closer
to the lower bounds limits when memcpy/memmove starts to use ERMS.  The
performance of REP MOVSB is similar to vectorized instruction on the
size limit (the L2 cache).  Also, there is no drawback to multiple cores
sharing the cache.

A new tunable, glibc.cpu.x86_rep_movsb_stop_threshold, allows to set up
the higher bound size to use 'rep movsb'.

Checked on x86_64-linux-gnu on Zen3.
---
 manual/tunables.texi         |  9 +++++++
 sysdeps/x86/dl-cacheinfo.h   | 50 +++++++++++++++++++++---------------
 sysdeps/x86/dl-tunables.list | 10 ++++++++
 3 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index be97190d67..ee5d90b91b 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -569,6 +569,15 @@ greater than zero, and currently defaults to 2048 bytes.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_stop_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
+set the threshold in bytes to stop using "rep movsb".  The value must be
+greater than zero, and currently, the default depends on the CPU and the
+cache size.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_rep_stosb_threshold
 The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user to
 set threshold in bytes to start using "rep stosb".  The value must be
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index d5101615e3..74b804c5e6 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -791,7 +791,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   long int data = -1;
   long int shared = -1;
   long int shared_per_thread = -1;
-  long int core = -1;
   unsigned int threads = 0;
   unsigned long int level1_icache_size = -1;
   unsigned long int level1_icache_linesize = -1;
@@ -809,7 +808,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (cpu_features->basic.kind == arch_kind_intel)
     {
       data = handle_intel (_SC_LEVEL1_DCACHE_SIZE, cpu_features);
-      core = handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       shared = handle_intel (_SC_LEVEL3_CACHE_SIZE, cpu_features);
       shared_per_thread = shared;
 
@@ -822,7 +820,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 	= handle_intel (_SC_LEVEL1_DCACHE_ASSOC, cpu_features);
       level1_dcache_linesize
 	= handle_intel (_SC_LEVEL1_DCACHE_LINESIZE, cpu_features);
-      level2_cache_size = core;
+      level2_cache_size
+	= handle_intel (_SC_LEVEL2_CACHE_SIZE, cpu_features);
       level2_cache_assoc
 	= handle_intel (_SC_LEVEL2_CACHE_ASSOC, cpu_features);
       level2_cache_linesize
@@ -835,12 +834,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level4_cache_size
 	= handle_intel (_SC_LEVEL4_CACHE_SIZE, cpu_features);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_zhaoxin)
     {
       data = handle_zhaoxin (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_zhaoxin (_SC_LEVEL3_CACHE_SIZE);
       shared_per_thread = shared;
 
@@ -849,19 +848,19 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_zhaoxin (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_zhaoxin (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_zhaoxin (_SC_LEVEL2_CACHE_SIZE);
       level2_cache_assoc = handle_zhaoxin (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_zhaoxin (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
       level3_cache_assoc = handle_zhaoxin (_SC_LEVEL3_CACHE_ASSOC);
       level3_cache_linesize = handle_zhaoxin (_SC_LEVEL3_CACHE_LINESIZE);
 
-      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
+      get_common_cache_info (&shared, &shared_per_thread, &threads,
+			     level2_cache_size);
     }
   else if (cpu_features->basic.kind == arch_kind_amd)
     {
       data = handle_amd (_SC_LEVEL1_DCACHE_SIZE);
-      core = handle_amd (_SC_LEVEL2_CACHE_SIZE);
       shared = handle_amd (_SC_LEVEL3_CACHE_SIZE);
 
       level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE);
@@ -869,7 +868,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level1_dcache_size = data;
       level1_dcache_assoc = handle_amd (_SC_LEVEL1_DCACHE_ASSOC);
       level1_dcache_linesize = handle_amd (_SC_LEVEL1_DCACHE_LINESIZE);
-      level2_cache_size = core;
+      level2_cache_size = handle_amd (_SC_LEVEL2_CACHE_SIZE);;
       level2_cache_assoc = handle_amd (_SC_LEVEL2_CACHE_ASSOC);
       level2_cache_linesize = handle_amd (_SC_LEVEL2_CACHE_LINESIZE);
       level3_cache_size = shared;
@@ -880,12 +879,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       if (shared <= 0)
         {
            /* No shared L3 cache.  All we have is the L2 cache.  */
-           shared = core;
+           shared = level2_cache_size;
         }
       else if (cpu_features->basic.family < 0x17)
         {
            /* Account for exclusive L2 and L3 caches.  */
-           shared += core;
+           shared += level2_cache_size;
         }
 
       shared_per_thread = shared;
@@ -1028,16 +1027,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 			   SIZE_MAX);
 
   unsigned long int rep_movsb_stop_threshold;
-  /* ERMS feature is implemented from AMD Zen3 architecture and it is
-     performing poorly for data above L2 cache size. Henceforth, adding
-     an upper bound threshold parameter to limit the usage of Enhanced
-     REP MOVSB operations and setting its value to L2 cache size.  */
-  if (cpu_features->basic.kind == arch_kind_amd)
-    rep_movsb_stop_threshold = core;
-  /* Setting the upper bound of ERMS to the computed value of
-     non-temporal threshold for architectures other than AMD.  */
-  else
-    rep_movsb_stop_threshold = non_temporal_threshold;
+  /* If the tunable is set and with a valid value (larger than the minimal
+     threshold to use ERMS) use it instead of default values.  */
+  rep_movsb_stop_threshold = TUNABLE_GET (x86_rep_movsb_stop_threshold,
+					  long int, NULL);
+  if (!TUNABLE_IS_INITIALIZED (x86_rep_movsb_stop_threshold)
+      || rep_movsb_stop_threshold <= rep_movsb_threshold)
+    {
+      /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
+	 cases slower than the vectorized path (and for some alignments,
+	 it is really slow, check BZ #30994).  */
+      if (cpu_features->basic.kind == arch_kind_amd)
+	rep_movsb_stop_threshold = 0;
+      else
+      /* Setting the upper bound of ERMS to the computed value of
+	 non-temporal threshold for architectures other than AMD.  */
+	rep_movsb_stop_threshold = non_temporal_threshold;
+    }
+  TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
+			   SIZE_MAX);
 
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 7d82da0dec..80cf5563ab 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -49,6 +49,16 @@ glibc {
       # if the tunable value is set by user or not [BZ #27069].
       minval: 1
     }
+    x86_rep_movsb_stop_threshold {
+      # For AMD CPUs that support ERMS (Zen3+), REP MOVSB is not faster
+      # than the vectorized path (and for some destination alignment it
+      # is really slow, check BZ #30994).  On Intel CPUs, the size limit
+      # to use ERMS is [1/8, 1/2] of the size of the chip's cache, check
+      # the dl-cacheinfo.h).
+      # This tunable allows the caller to set the limit where to use REP
+      # MOVB on memcpy/memmove.
+      type: SIZE_T
+    }
     x86_rep_stosb_threshold {
       type: SIZE_T
       # Since there is overhead to set up REP STOSB operation, REP STOSB