From patchwork Fri Apr 15 05:51:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 52972 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id EAE29385B211 for ; Fri, 15 Apr 2022 05:52:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EAE29385B211 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1650001923; bh=/ztufWnAZesVFUO9VHhcz3tCRRpCSGYwzGfn74y1SCE=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=K/tvJtxzgAp4PullQUhHy1ZHZT1OTYPvc2NZLtHxB3TDQ5iGbprjMYv6inDOBh0oV xzpis3QRjgMwAwwz53MDaJk5U1OfUvVOW4VbjOvIWf7XSK5OcRnsnnsdk/4nIAOcjC wyQxW39ZbN+sxiehZaJIL1Nyc978VfXi9yRhxlo4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com [IPv6:2607:f8b0:4864:20::d2d]) by sourceware.org (Postfix) with ESMTPS id 687673857376 for ; Fri, 15 Apr 2022 05:51:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 687673857376 Received: by mail-io1-xd2d.google.com with SMTP id q22so1100682iod.2 for ; Thu, 14 Apr 2022 22:51:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/ztufWnAZesVFUO9VHhcz3tCRRpCSGYwzGfn74y1SCE=; b=FCs/rsPoW0veTbau4PqXEfVK8Wa8aljb5P4FJSGEfh9PrxkWS5JqNuSI1xhjihyzTp 6slb4wgKcRiR3UgMX3zP+j7zb6+TepD2pYen53iiTCnVd6YXm9dDPv7+11P/2V56PO7t izpBeloCJpyUeuIAaXAviLI1Bw7zglfTgRgCgcu9oQ6MZOvVg/JNV66lMbH4aZixadha yeie6nw8mpXVvemZxZOxwCchwUR/IkCwbLviBWmvMtEgw3ye53wmgdLh65e3Q9lE04aF 151uqDCdR5JUSxikHkM1SYBe6hv7/YGQ97sFXosEookI3aUnLeogLwYgx4wP5pAXSH8b 9ioA== X-Gm-Message-State: AOAM5320mh56aQHg9fZXmW0rO8gD/0kcjA10zJGyWuys/6Vc2IfaTUYm HVxq+51EZ9UYOJeESV6dg5EZ7kLeUv4= X-Google-Smtp-Source: ABdhPJx6bFPxExzlOsBKuPAmWTPG9VwEgRi78lXQi4vW+Yc9OaEbDSbt42eCzOvHCA6dCNpFRHj1gA== X-Received: by 2002:a05:6638:164b:b0:323:ac42:8d4b with SMTP id a11-20020a056638164b00b00323ac428d4bmr2961060jat.75.1650001900622; Thu, 14 Apr 2022 22:51:40 -0700 (PDT) Received: from localhost.localdomain (173-245-203-171.ipvanish.com. [173.245.203.171]) by smtp.googlemail.com with ESMTPSA id p10-20020a056e02104a00b002cbd0996ce8sm2218682ilj.16.2022.04.14.22.51.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Apr 2022 22:51:40 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 2/3] x86: Remove memcmp-sse4.S Date: Fri, 15 Apr 2022 00:51:31 -0500 Message-Id: <20220415055132.1257272-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220415055132.1257272-1-goldstein.w.n@gmail.com> References: <20220415055132.1257272-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Code didn't actually use any sse4 instructions. The new memcmp-sse2 implementation is also faster. geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905 Note there are two regressions prefering SSE2 for Size = 1 and Size = 65. Size = 1: size, align0, align1, ret, New Time/Old Time 1, 1, 1, 0, 1.2 1, 1, 1, 1, 1.197 1, 1, 1, -1, 1.2 This is intentional. Size == 1 is significantly less hot based on profiles of GCC11 and Python3 than sizes [4, 8] (which is made hotter). Python3 Size = 1 -> 13.64% Python3 Size = [4, 8] -> 60.92% GCC11 Size = 1 -> 1.29% GCC11 Size = [4, 8] -> 33.86% size, align0, align1, ret, New Time/Old Time 4, 4, 4, 0, 0.622 4, 4, 4, 1, 0.797 4, 4, 4, -1, 0.805 5, 5, 5, 0, 0.623 5, 5, 5, 1, 0.777 5, 5, 5, -1, 0.802 6, 6, 6, 0, 0.625 6, 6, 6, 1, 0.813 6, 6, 6, -1, 0.788 7, 7, 7, 0, 0.625 7, 7, 7, 1, 0.799 7, 7, 7, -1, 0.795 8, 8, 8, 0, 0.625 8, 8, 8, 1, 0.848 8, 8, 8, -1, 0.914 9, 9, 9, 0, 0.625 Size = 65: size, align0, align1, ret, New Time/Old Time 65, 0, 0, 0, 1.103 65, 0, 0, 1, 1.216 65, 0, 0, -1, 1.227 65, 65, 0, 0, 1.091 65, 0, 65, 1, 1.19 65, 65, 65, -1, 1.215 This is because A) the checks in range [65, 96] are now unrolled 2x and B) because smaller values <= 16 are now given a hotter path. By contrast the SSE4 version has a branch for Size = 80. The unrolled version has get better performance for returns which need both comparisons. size, align0, align1, ret, New Time/Old Time 128, 4, 8, 0, 0.858 128, 4, 8, 1, 0.879 128, 4, 8, -1, 0.888 As well, out of microbenchmark environments that are not full predictable the branch will have a real-cost. --- sysdeps/x86_64/multiarch/Makefile | 2 -- sysdeps/x86_64/multiarch/ifunc-impl-list.c | 4 ---- sysdeps/x86_64/multiarch/ifunc-memcmp.h | 4 ---- 3 files changed, 10 deletions(-) diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index b573966966..0400ea332b 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -11,7 +11,6 @@ sysdep_routines += \ memcmp-avx2-movbe-rtm \ memcmp-evex-movbe \ memcmp-sse2 \ - memcmp-sse4 \ memcmpeq-avx2 \ memcmpeq-avx2-rtm \ memcmpeq-evex \ @@ -164,7 +163,6 @@ sysdep_routines += \ wmemcmp-avx2-movbe-rtm \ wmemcmp-evex-movbe \ wmemcmp-sse2 \ - wmemcmp-sse4 \ # sysdep_routines endif diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index c6008a73ed..a8afcf81bb 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -96,8 +96,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, && CPU_FEATURE_USABLE (BMI2) && CPU_FEATURE_USABLE (MOVBE)), __memcmp_evex_movbe) - IFUNC_IMPL_ADD (array, i, memcmp, CPU_FEATURE_USABLE (SSE4_1), - __memcmp_sse4_1) IFUNC_IMPL_ADD (array, i, memcmp, 1, __memcmp_sse2)) #ifdef SHARED @@ -809,8 +807,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, && CPU_FEATURE_USABLE (BMI2) && CPU_FEATURE_USABLE (MOVBE)), __wmemcmp_evex_movbe) - IFUNC_IMPL_ADD (array, i, wmemcmp, CPU_FEATURE_USABLE (SSE4_1), - __wmemcmp_sse4_1) IFUNC_IMPL_ADD (array, i, wmemcmp, 1, __wmemcmp_sse2)) /* Support sysdeps/x86_64/multiarch/wmemset.c. */ diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmp.h b/sysdeps/x86_64/multiarch/ifunc-memcmp.h index 44759a3ad5..c743970fe3 100644 --- a/sysdeps/x86_64/multiarch/ifunc-memcmp.h +++ b/sysdeps/x86_64/multiarch/ifunc-memcmp.h @@ -20,7 +20,6 @@ # include extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe_rtm) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_movbe) attribute_hidden; @@ -46,8 +45,5 @@ IFUNC_SELECTOR (void) return OPTIMIZE (avx2_movbe); } - if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_1)) - return OPTIMIZE (sse4_1); - return OPTIMIZE (sse2); }