From patchwork Fri Apr 15 05:51:31 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 52972
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id EAE29385B211
	for <patchwork@sourceware.org>; Fri, 15 Apr 2022 05:52:02 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EAE29385B211
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1650001923;
	bh=/ztufWnAZesVFUO9VHhcz3tCRRpCSGYwzGfn74y1SCE=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=K/tvJtxzgAp4PullQUhHy1ZHZT1OTYPvc2NZLtHxB3TDQ5iGbprjMYv6inDOBh0oV
	 xzpis3QRjgMwAwwz53MDaJk5U1OfUvVOW4VbjOvIWf7XSK5OcRnsnnsdk/4nIAOcjC
	 wyQxW39ZbN+sxiehZaJIL1Nyc978VfXi9yRhxlo4=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com
 [IPv6:2607:f8b0:4864:20::d2d])
 by sourceware.org (Postfix) with ESMTPS id 687673857376
 for <libc-alpha@sourceware.org>; Fri, 15 Apr 2022 05:51:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 687673857376
Received: by mail-io1-xd2d.google.com with SMTP id q22so1100682iod.2
 for <libc-alpha@sourceware.org>; Thu, 14 Apr 2022 22:51:41 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=/ztufWnAZesVFUO9VHhcz3tCRRpCSGYwzGfn74y1SCE=;
 b=FCs/rsPoW0veTbau4PqXEfVK8Wa8aljb5P4FJSGEfh9PrxkWS5JqNuSI1xhjihyzTp
 6slb4wgKcRiR3UgMX3zP+j7zb6+TepD2pYen53iiTCnVd6YXm9dDPv7+11P/2V56PO7t
 izpBeloCJpyUeuIAaXAviLI1Bw7zglfTgRgCgcu9oQ6MZOvVg/JNV66lMbH4aZixadha
 yeie6nw8mpXVvemZxZOxwCchwUR/IkCwbLviBWmvMtEgw3ye53wmgdLh65e3Q9lE04aF
 151uqDCdR5JUSxikHkM1SYBe6hv7/YGQ97sFXosEookI3aUnLeogLwYgx4wP5pAXSH8b
 9ioA==
X-Gm-Message-State: AOAM5320mh56aQHg9fZXmW0rO8gD/0kcjA10zJGyWuys/6Vc2IfaTUYm
 HVxq+51EZ9UYOJeESV6dg5EZ7kLeUv4=
X-Google-Smtp-Source: 
 ABdhPJx6bFPxExzlOsBKuPAmWTPG9VwEgRi78lXQi4vW+Yc9OaEbDSbt42eCzOvHCA6dCNpFRHj1gA==
X-Received: by 2002:a05:6638:164b:b0:323:ac42:8d4b with SMTP id
 a11-20020a056638164b00b00323ac428d4bmr2961060jat.75.1650001900622;
 Thu, 14 Apr 2022 22:51:40 -0700 (PDT)
Received: from localhost.localdomain (173-245-203-171.ipvanish.com.
 [173.245.203.171]) by smtp.googlemail.com with ESMTPSA id
 p10-20020a056e02104a00b002cbd0996ce8sm2218682ilj.16.2022.04.14.22.51.39
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 14 Apr 2022 22:51:40 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v1 2/3] x86: Remove memcmp-sse4.S
Date: Fri, 15 Apr 2022 00:51:31 -0500
Message-Id: <20220415055132.1257272-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220415055132.1257272-1-goldstein.w.n@gmail.com>
References: <20220415055132.1257272-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

Code didn't actually use any sse4 instructions. The new memcmp-sse2
implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions prefering SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
---
 sysdeps/x86_64/multiarch/Makefile          | 2 --
 sysdeps/x86_64/multiarch/ifunc-impl-list.c | 4 ----
 sysdeps/x86_64/multiarch/ifunc-memcmp.h    | 4 ----
 3 files changed, 10 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index b573966966..0400ea332b 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -11,7 +11,6 @@ sysdep_routines += \
   memcmp-avx2-movbe-rtm \
   memcmp-evex-movbe \
   memcmp-sse2 \
-  memcmp-sse4 \
   memcmpeq-avx2 \
   memcmpeq-avx2-rtm \
   memcmpeq-evex \
@@ -164,7 +163,6 @@ sysdep_routines += \
   wmemcmp-avx2-movbe-rtm \
   wmemcmp-evex-movbe \
   wmemcmp-sse2 \
-  wmemcmp-sse4 \
 # sysdep_routines
 endif
 
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index c6008a73ed..a8afcf81bb 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -96,8 +96,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			       && CPU_FEATURE_USABLE (BMI2)
 			       && CPU_FEATURE_USABLE (MOVBE)),
 			      __memcmp_evex_movbe)
-	      IFUNC_IMPL_ADD (array, i, memcmp, CPU_FEATURE_USABLE (SSE4_1),
-			      __memcmp_sse4_1)
 	      IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_sse2))
 
 #ifdef SHARED
@@ -809,8 +807,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			       && CPU_FEATURE_USABLE (BMI2)
 			       && CPU_FEATURE_USABLE (MOVBE)),
 			      __wmemcmp_evex_movbe)
-	      IFUNC_IMPL_ADD (array, i, wmemcmp, CPU_FEATURE_USABLE (SSE4_1),
-			      __wmemcmp_sse4_1)
 	      IFUNC_IMPL_ADD (array, i, wmemcmp, 1, __wmemcmp_sse2))
 
   /* Support sysdeps/x86_64/multiarch/wmemset.c.  */
diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmp.h b/sysdeps/x86_64/multiarch/ifunc-memcmp.h
index 44759a3ad5..c743970fe3 100644
--- a/sysdeps/x86_64/multiarch/ifunc-memcmp.h
+++ b/sysdeps/x86_64/multiarch/ifunc-memcmp.h
@@ -20,7 +20,6 @@
 # include <init-arch.h>
 
 extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
-extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe_rtm) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_movbe) attribute_hidden;
@@ -46,8 +45,5 @@ IFUNC_SELECTOR (void)
 	return OPTIMIZE (avx2_movbe);
     }
 
-  if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_1))
-    return OPTIMIZE (sse4_1);
-
   return OPTIMIZE (sse2);
 }