From patchwork Mon Jul 18 10:38:11 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 56115
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8C0E8385625A
	for <patchwork@sourceware.org>; Mon, 18 Jul 2022 10:38:40 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8C0E8385625A
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1658140720;
	bh=TXEGxnewJCQy+SCIgsiRbIGpgGJhu7Y60NkNiAL3pi8=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=ROMceYcOHN2YjIV4OPSkiz6tOGGiBeTiW8YGvw78QDKohbYlq0sMRqIkzt7RW8zry
	 xr1kH3vLeImL4JKKtRgoAogAFFgTJPft2T19fBUIfI4MiGyeR0PuWNkFOmEt6Uz8sZ
	 70M1/R+HpMNa/40YQSK6DsXqwskcSOejjLk95tL4=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com
 [IPv6:2607:f8b0:4864:20::435])
 by sourceware.org (Postfix) with ESMTPS id 82AAE3857B88
 for <libc-alpha@sourceware.org>; Mon, 18 Jul 2022 10:38:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 82AAE3857B88
Received: by mail-pf1-x435.google.com with SMTP id y9so10205528pff.12
 for <libc-alpha@sourceware.org>; Mon, 18 Jul 2022 03:38:19 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=TXEGxnewJCQy+SCIgsiRbIGpgGJhu7Y60NkNiAL3pi8=;
 b=XP7k0pKFlGXum9+XQNPbzr32VWFnqFftDbnHAgKJ/q+WdUhuQ3Suv8sw9NnYGxps7A
 9YgtSK4XgWXTSEZF/ZVi9xF00V5LEZ86yjjEwReQ/sW3/Lw7dxPHkR0c0E8WNyD3njKX
 +fpvYGkrTKXwT4S8FYzwM0XaD2QKgllj6nkIQ5pMDaR83NjHYov37Kn194o41mh6S6Pz
 QF7EVYkemP7WdkWgjq8TKKmb+h9eQJgXmsdHHfXUO7ojCnETCRH/F/Wm3WdM8iyHcXnL
 ZoxM07Nb2fE3EgyuRSfCy7Lv0d65nZpUa/IS0aFEeYuHrD89D1/bi3FSrICQUlkTxUZF
 4KBQ==
X-Gm-Message-State: AJIora9b/QgipJNKKrKayIjH2jKk+A2NzHtquO9k6FTNjnYetMYgRNpj
 EZoAlPLvJk1RTKgvpKri7IjJ4y8Nkd19bQ==
X-Google-Smtp-Source: 
 AGRyM1uN6hULJR+8t7op/DhtAWexKnUvVwc/TceeBcGlOhT9Rn0Qjjfr7+nEr2PPT6c3ivTacS32pQ==
X-Received: by 2002:a63:1917:0:b0:419:b8e8:233 with SMTP id
 z23-20020a631917000000b00419b8e80233mr18651329pgl.271.1658140698298;
 Mon, 18 Jul 2022 03:38:18 -0700 (PDT)
Received: from noah-tgl.. ([210.177.125.238]) by smtp.gmail.com with ESMTPSA
 id
 n10-20020a170902e54a00b00161e50e2245sm9151442plf.178.2022.07.18.03.38.16
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 18 Jul 2022 03:38:18 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v1] x86: Continue building memmove-ssse3.S as ISA level V3
Date: Mon, 18 Jul 2022 03:38:11 -0700
Message-Id: <20220718103811.1842054-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20220718103811.1842054-1-goldstein.w.n@gmail.com>
References: <20220718103811.1842054-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_NUMSUBJECT, RCVD_IN_ABUSEAT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

Some V3 processors still strongly prefer memmove-ssse3.S because it is
heavily optimized to avoid unaligned memory accesses.

Tested builds for x86-64 v1, v2, v3, and v4 with and without
multiarch.
---
 sysdeps/x86/isa-level.h                    | 15 +++++++++++
 sysdeps/x86_64/multiarch/ifunc-impl-list.c | 30 +++++++++++++---------
 sysdeps/x86_64/multiarch/ifunc-memmove.h   | 14 +++++-----
 sysdeps/x86_64/multiarch/memmove-ssse3.S   |  4 ++-
 4 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/sysdeps/x86/isa-level.h b/sysdeps/x86/isa-level.h
index fe56af7e2b..f49336acf3 100644
--- a/sysdeps/x86/isa-level.h
+++ b/sysdeps/x86/isa-level.h
@@ -90,6 +90,14 @@
 
 /* For X86_ISA_CPU_FEATURES_ARCH_P.  */
 
+
+/* NB: This is just an alias to `AVX_Fast_Unaligned_Load` that will
+   continue doing runtime check up to ISA level >= 4.  This is for
+   some Zhaoxin CPUs which build at ISA level V3 but still have a
+   strong preference for avoiding unaligned `ymm` loads.  */
+#define V4_AVX_Fast_Unaligned_Load_X86_ISA_LEVEL 4
+#define V4_AVX_Fast_Unaligned_Load AVX_Fast_Unaligned_Load
+
 /* NB: This feature is enabled when ISA level >= 3, which was disabled
    for the following CPUs:
         - AMD Excavator
@@ -106,6 +114,13 @@
    this feature don't run on glibc built with ISA level >= 3.  */
 #define Slow_SSE4_2_X86_ISA_LEVEL 3
 
+/* NB: This is just an alias to `Fast_Unaligned_Copy` that will
+   continue doing runtime check up to ISA level >= 3.  This is for
+   some Zhaoxin CPUs which build at ISA level V3 but still have a
+   strong preference for avoiding unaligned `ymm` loads.  */
+#define V3_Fast_Unaligned_Copy_X86_ISA_LEVEL 3
+#define V3_Fast_Unaligned_Copy Fast_Unaligned_Copy
+
 /* Feature(s) enabled when ISA level >= 2.  */
 #define Fast_Unaligned_Load_X86_ISA_LEVEL 2
 
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index a71444eccb..427f127427 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -143,8 +143,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_chk_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, __memmove_chk,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __memmove_chk_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
@@ -190,8 +191,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memmove_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, memmove,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memmove,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __memmove_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
@@ -1004,8 +1006,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_chk_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, __memcpy_chk,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __memcpy_chk_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
@@ -1051,8 +1054,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __memcpy_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, memcpy,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __memcpy_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
@@ -1098,8 +1102,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_chk_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, __mempcpy_chk,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __mempcpy_chk_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
@@ -1145,8 +1150,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 				     (CPU_FEATURE_USABLE (AVX)
 				      && CPU_FEATURE_USABLE (RTM)),
 				     __mempcpy_avx_unaligned_erms_rtm)
-	      /* By V3 we assume fast aligned copy.  */
-	      X86_IFUNC_IMPL_ADD_V2 (array, i, mempcpy,
+	      /* Some V3 implementations still heavily prefer aligned
+	         loads so keep SSSE3 implementation around.  */
+	      X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy,
 				     CPU_FEATURE_USABLE (SSSE3),
 				     __mempcpy_ssse3)
 	      /* ISA V2 wrapper for SSE2 implementation because the SSE2
diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h
index 1643d32887..be0c758783 100644
--- a/sysdeps/x86_64/multiarch/ifunc-memmove.h
+++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h
@@ -72,7 +72,7 @@ IFUNC_SELECTOR (void)
     }
 
   if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
-				   AVX_Fast_Unaligned_Load, ))
+				   V4_AVX_Fast_Unaligned_Load, ))
     {
       if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL))
 	{
@@ -101,11 +101,13 @@ IFUNC_SELECTOR (void)
     }
 
   if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3)
-      /* Leave this as runtime check.  The SSSE3 is optimized almost
-         exclusively for avoiding unaligned memory access during the
-         copy and by and large is not better than the sse2
-         implementation as a general purpose memmove.  */
-      && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy))
+      /* Leave this as runtime check for V2.  By V3 assume it must be
+	     set.  The SSSE3 is optimized almost exclusively for avoiding
+	     unaligned memory access during the copy and by and large is
+	     not better than the sse2 implementation as a general purpose
+	     memmove. */
+      && X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
+				      V3_Fast_Unaligned_Copy, !))
     {
       return OPTIMIZE (ssse3);
     }
diff --git a/sysdeps/x86_64/multiarch/memmove-ssse3.S b/sysdeps/x86_64/multiarch/memmove-ssse3.S
index 57599752c7..15cafee766 100644
--- a/sysdeps/x86_64/multiarch/memmove-ssse3.S
+++ b/sysdeps/x86_64/multiarch/memmove-ssse3.S
@@ -20,7 +20,9 @@
 
 #include <isa-level.h>
 
-#if ISA_SHOULD_BUILD (2)
+/* Continue building up to ISA level V3 as some V3 CPUs strongly
+   prefer this implementation.  */
+#if ISA_SHOULD_BUILD (3)
 
 # include <sysdep.h>
 # ifndef MEMMOVE