From patchwork Tue Sep 14 06:30:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 44971 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 577A63858424 for ; Tue, 14 Sep 2021 06:31:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 577A63858424 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1631601073; bh=bkDM7qFUelENBIW3R5nxDqyLyXcdROuK1gfgXvZ5PF8=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=K4Z4P+xmnyE9ykYPD/ZOL8BXNtA9nUcq65Lr1XlwQjTPDcYEzdzspi9DsKB1OnJtK +p3xg+hexh8Q+e928QMsCm5q5AExWW3+kMWy/Pd3o397oP5vz/xhEycJU71cGrX9KF dZplKEpLSJJK8yV5vmAtomPlUiwxY1mahk7Z7cJs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-il1-x132.google.com (mail-il1-x132.google.com [IPv6:2607:f8b0:4864:20::132]) by sourceware.org (Postfix) with ESMTPS id 74C453858402 for ; Tue, 14 Sep 2021 06:30:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 74C453858402 Received: by mail-il1-x132.google.com with SMTP id w1so12872171ilv.1 for ; Mon, 13 Sep 2021 23:30:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=bkDM7qFUelENBIW3R5nxDqyLyXcdROuK1gfgXvZ5PF8=; b=k2ekrirTvSnvTe51dzYPKxumilN3N6Ef5aCR2h6fWcXQcTxFy1z9wH2ouq9H4S9ZTj Pgm1EthJ7P9AJDHokqWUDxfUalbHCnKmeo+GxepDfTn3BC7sg18fyKd/T8UMqiV+UKtM gyGTkmjKmej2IqVvBPrrd9bxpskdMeZJaCrACZWbqjowKJeACNnVJO2fHEldc1cF0RGG yGv1CRQtsdgM0uVBkK4jqp5fpARiYSxTH4/nPHnZr2vL+gg1Jd1XMcujCoDM1Aaq6hdR TJPOrLQ/2GcLbbdigN8qS8Xl8L/XVFHHaYTfo3Y+KPVjYX855z6mVZ4lRePxqq8G9pd+ O9yg== X-Gm-Message-State: AOAM5316x+fAQJTzZr2cMzrCbaCV6RKJ/Cixkg4aLhpNVtd9Ke3+BX4V oDBqHfOCralp11k2wmL729TPbXk9VdjXXw== X-Google-Smtp-Source: ABdhPJwkqbS1BMfE5Z4G7JXvyFFFmuEMed9gyJOUkfFJG2taFZQY0AzqQo26EylUgX3wLZPnsD0s/g== X-Received: by 2002:a92:dc02:: with SMTP id t2mr10804747iln.126.1631601050389; Mon, 13 Sep 2021 23:30:50 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id b10sm6101328ils.13.2021.09.13.23.30.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Sep 2021 23:30:50 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 1/5] x86_64: Add support for bcmp using sse2, sse4_1, avx2, and evex Date: Tue, 14 Sep 2021 01:30:35 -0500 Message-Id: <20210914063039.1126196-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210913230506.546749-1-goldstein.w.n@gmail.com> References: <20210913230506.546749-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This commit adds support for an optimized bcmp implementation. Support is for sse2, sse4_1, avx2, and evex. All string tests passing and build succeeding. --- benchtests/Makefile | 2 +- benchtests/bench-bcmp.c | 20 ++++++++ benchtests/bench-memcmp.c | 4 +- string/Makefile | 4 +- string/bcmp.c | 25 ++++++++++ string/test-bcmp.c | 21 +++++++++ string/test-memcmp.c | 27 +++++++---- sysdeps/x86_64/memcmp.S | 2 - sysdeps/x86_64/multiarch/Makefile | 3 ++ sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S | 12 +++++ sysdeps/x86_64/multiarch/bcmp-avx2.S | 23 ++++++++++ sysdeps/x86_64/multiarch/bcmp-evex.S | 23 ++++++++++ sysdeps/x86_64/multiarch/bcmp-sse2.S | 23 ++++++++++ sysdeps/x86_64/multiarch/bcmp-sse4.S | 23 ++++++++++ sysdeps/x86_64/multiarch/bcmp.c | 35 ++++++++++++++ sysdeps/x86_64/multiarch/ifunc-bcmp.h | 53 ++++++++++++++++++++++ sysdeps/x86_64/multiarch/ifunc-impl-list.c | 23 ++++++++++ sysdeps/x86_64/multiarch/memcmp-sse2.S | 4 +- sysdeps/x86_64/multiarch/memcmp.c | 2 - 19 files changed, 311 insertions(+), 18 deletions(-) create mode 100644 benchtests/bench-bcmp.c create mode 100644 string/bcmp.c create mode 100644 string/test-bcmp.c create mode 100644 sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S create mode 100644 sysdeps/x86_64/multiarch/bcmp-avx2.S create mode 100644 sysdeps/x86_64/multiarch/bcmp-evex.S create mode 100644 sysdeps/x86_64/multiarch/bcmp-sse2.S create mode 100644 sysdeps/x86_64/multiarch/bcmp-sse4.S create mode 100644 sysdeps/x86_64/multiarch/bcmp.c create mode 100644 sysdeps/x86_64/multiarch/ifunc-bcmp.h diff --git a/benchtests/Makefile b/benchtests/Makefile index 1530939a8c..5fc495eb57 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -47,7 +47,7 @@ bench := $(foreach B,$(filter bench-%,${BENCHSET}), ${${B}}) endif # String function benchmarks. -string-benchset := memccpy memchr memcmp memcpy memmem memmove \ +string-benchset := bcmp memccpy memchr memcmp memcpy memmem memmove \ mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \ strcat strchr strchrnul strcmp strcpy strcspn strlen \ strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \ diff --git a/benchtests/bench-bcmp.c b/benchtests/bench-bcmp.c new file mode 100644 index 0000000000..1023639787 --- /dev/null +++ b/benchtests/bench-bcmp.c @@ -0,0 +1,20 @@ +/* Measure bcmp functions. + Copyright (C) 2015-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define TEST_BCMP 1 +#include "bench-memcmp.c" diff --git a/benchtests/bench-memcmp.c b/benchtests/bench-memcmp.c index 744c7ec5ba..4d5f8fb766 100644 --- a/benchtests/bench-memcmp.c +++ b/benchtests/bench-memcmp.c @@ -17,7 +17,9 @@ . */ #define TEST_MAIN -#ifdef WIDE +#ifdef TEST_BCMP +# define TEST_NAME "bcmp" +#elif defined WIDE # define TEST_NAME "wmemcmp" #else # define TEST_NAME "memcmp" diff --git a/string/Makefile b/string/Makefile index f0fce2a0b8..f1f67ee157 100644 --- a/string/Makefile +++ b/string/Makefile @@ -35,7 +35,7 @@ routines := strcat strchr strcmp strcoll strcpy strcspn \ strncat strncmp strncpy \ strrchr strpbrk strsignal strspn strstr strtok \ strtok_r strxfrm memchr memcmp memmove memset \ - mempcpy bcopy bzero ffs ffsll stpcpy stpncpy \ + mempcpy bcmp bcopy bzero ffs ffsll stpcpy stpncpy \ strcasecmp strncase strcasecmp_l strncase_l \ memccpy memcpy wordcopy strsep strcasestr \ swab strfry memfrob memmem rawmemchr strchrnul \ @@ -52,7 +52,7 @@ strop-tests := memchr memcmp memcpy memmove mempcpy memset memccpy \ stpcpy stpncpy strcat strchr strcmp strcpy strcspn \ strlen strncmp strncpy strpbrk strrchr strspn memmem \ strstr strcasestr strnlen strcasecmp strncasecmp \ - strncat rawmemchr strchrnul bcopy bzero memrchr \ + strncat rawmemchr strchrnul bcmp bcopy bzero memrchr \ explicit_bzero tests := tester inl-tester noinl-tester testcopy test-ffs \ tst-strlen stratcliff tst-svc tst-inlcall \ diff --git a/string/bcmp.c b/string/bcmp.c new file mode 100644 index 0000000000..2f5c446124 --- /dev/null +++ b/string/bcmp.c @@ -0,0 +1,25 @@ +/* Copyright (C) 1991-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + + +/* This file is intentionally left empty. It exists so that both + architectures which implement bcmp seperately from memcmp and + architectures which implement bcmp by having it alias memcmp will + build. + + The alias for bcmp to memcmp for the C implementation is in + memcmp.c. */ diff --git a/string/test-bcmp.c b/string/test-bcmp.c new file mode 100644 index 0000000000..6d19a4a87c --- /dev/null +++ b/string/test-bcmp.c @@ -0,0 +1,21 @@ +/* Test and measure bcmp functions. + Copyright (C) 2012-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define BAD_RESULT(result, expec) ((!(result)) != (!(expec))) +#define TEST_BCMP 1 +#include "test-memcmp.c" diff --git a/string/test-memcmp.c b/string/test-memcmp.c index 6ddbc05d2f..c630e6799d 100644 --- a/string/test-memcmp.c +++ b/string/test-memcmp.c @@ -17,11 +17,14 @@ . */ #define TEST_MAIN -#ifdef WIDE +#ifdef TEST_BCMP +# define TEST_NAME "bcmp" +#elif defined WIDE # define TEST_NAME "wmemcmp" #else # define TEST_NAME "memcmp" #endif + #include "test-string.h" #ifdef WIDE # include @@ -35,6 +38,7 @@ # define CHARBYTES 4 # define CHAR__MIN WCHAR_MIN # define CHAR__MAX WCHAR_MAX + int simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n) { @@ -48,8 +52,11 @@ simple_wmemcmp (const wchar_t *s1, const wchar_t *s2, size_t n) } #else # include - -# define MEMCMP memcmp +# ifdef TEST_BCMP +# define MEMCMP bcmp +# else +# define MEMCMP memcmp +# endif # define MEMCPY memcpy # define SIMPLE_MEMCMP simple_memcmp # define CHAR char @@ -69,6 +76,12 @@ simple_memcmp (const char *s1, const char *s2, size_t n) } #endif +# ifndef BAD_RESULT +# define BAD_RESULT(result, expec) \ + (((result) == 0 && (expec)) || ((result) < 0 && (expec) >= 0) || \ + ((result) > 0 && (expec) <= 0)) +# endif + typedef int (*proto_t) (const CHAR *, const CHAR *, size_t); IMPL (SIMPLE_MEMCMP, 0) @@ -79,9 +92,7 @@ check_result (impl_t *impl, const CHAR *s1, const CHAR *s2, size_t len, int exp_result) { int result = CALL (impl, s1, s2, len); - if ((exp_result == 0 && result != 0) - || (exp_result < 0 && result >= 0) - || (exp_result > 0 && result <= 0)) + if (BAD_RESULT(result, exp_result)) { error (0, 0, "Wrong result in function %s %d %d", impl->name, result, exp_result); @@ -186,9 +197,7 @@ do_random_tests (void) { r = CALL (impl, (CHAR *) p1 + align1, (const CHAR *) p2 + align2, len); - if ((r == 0 && result) - || (r < 0 && result >= 0) - || (r > 0 && result <= 0)) + if (BAD_RESULT(r, result)) { error (0, 0, "Iteration %zd - wrong result in function %s (%zd, %zd, %zd, %zd) %ld != %d, p1 %p p2 %p", n, impl->name, align1 * CHARBYTES & 63, align2 * CHARBYTES & 63, len, pos, r, result, p1, p2); diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S index 870e15c5a0..dfd0269db2 100644 --- a/sysdeps/x86_64/memcmp.S +++ b/sysdeps/x86_64/memcmp.S @@ -356,6 +356,4 @@ L(ATR32res): .p2align 4,, 4 END(memcmp) -#undef bcmp -weak_alias (memcmp, bcmp) libc_hidden_builtin_def (memcmp) diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index 26be40959c..9dd0d8c3ff 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -1,6 +1,7 @@ ifeq ($(subdir),string) sysdep_routines += strncat-c stpncpy-c strncpy-c \ + bcmp-sse2 bcmp-sse4 bcmp-avx2 \ strcmp-sse2 strcmp-sse2-unaligned strcmp-ssse3 \ strcmp-sse4_2 strcmp-avx2 \ strncmp-sse2 strncmp-ssse3 strncmp-sse4_2 strncmp-avx2 \ @@ -40,6 +41,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \ memset-sse2-unaligned-erms \ memset-avx2-unaligned-erms \ memset-avx512-unaligned-erms \ + bcmp-avx2-rtm \ memchr-avx2-rtm \ memcmp-avx2-movbe-rtm \ memmove-avx-unaligned-erms-rtm \ @@ -59,6 +61,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \ strncpy-avx2-rtm \ strnlen-avx2-rtm \ strrchr-avx2-rtm \ + bcmp-evex \ memchr-evex \ memcmp-evex-movbe \ memmove-evex-unaligned-erms \ diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S new file mode 100644 index 0000000000..d742257e4e --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S @@ -0,0 +1,12 @@ +#ifndef MEMCMP +# define MEMCMP __bcmp_avx2_rtm +#endif + +#define ZERO_UPPER_VEC_REGISTERS_RETURN \ + ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST + +#define VZEROUPPER_RETURN jmp L(return_vzeroupper) + +#define SECTION(p) p##.avx.rtm + +#include "bcmp-avx2.S" diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2.S b/sysdeps/x86_64/multiarch/bcmp-avx2.S new file mode 100644 index 0000000000..93a9a20b17 --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp-avx2.S @@ -0,0 +1,23 @@ +/* bcmp optimized with AVX2. + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef MEMCMP +# define MEMCMP __bcmp_avx2 +#endif + +#include "bcmp-avx2.S" diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S new file mode 100644 index 0000000000..ade52e8c68 --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp-evex.S @@ -0,0 +1,23 @@ +/* bcmp optimized with EVEX. + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#ifndef MEMCMP +# define MEMCMP __bcmp_evex +#endif + +#include "memcmp-evex-movbe.S" diff --git a/sysdeps/x86_64/multiarch/bcmp-sse2.S b/sysdeps/x86_64/multiarch/bcmp-sse2.S new file mode 100644 index 0000000000..b18d570386 --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp-sse2.S @@ -0,0 +1,23 @@ +/* bcmp optimized with SSE2 + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +# ifndef memcmp +# define memcmp __bcmp_sse2 +# endif +# define USE_AS_BCMP 1 +#include "memcmp-sse2.S" diff --git a/sysdeps/x86_64/multiarch/bcmp-sse4.S b/sysdeps/x86_64/multiarch/bcmp-sse4.S new file mode 100644 index 0000000000..ed9804053f --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp-sse4.S @@ -0,0 +1,23 @@ +/* bcmp optimized with SSE4.1 + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +# ifndef MEMCMP +# define MEMCMP __bcmp_sse4_1 +# endif +# define USE_AS_BCMP 1 +#include "memcmp-sse4.S" diff --git a/sysdeps/x86_64/multiarch/bcmp.c b/sysdeps/x86_64/multiarch/bcmp.c new file mode 100644 index 0000000000..6e26b73ecc --- /dev/null +++ b/sysdeps/x86_64/multiarch/bcmp.c @@ -0,0 +1,35 @@ +/* Multiple versions of bcmp. + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +/* Define multiple versions only for the definition in libc. */ +#if IS_IN (libc) +# define bcmp __redirect_bcmp +# include +# undef bcmp + +# define SYMBOL_NAME bcmp +# include "ifunc-bcmp.h" + +libc_ifunc_redirected (__redirect_bcmp, bcmp, IFUNC_SELECTOR ()); + +# ifdef SHARED +__hidden_ver1 (bcmp, __GI_bcmp, __redirect_bcmp) + __attribute__ ((visibility ("hidden"))) __attribute_copy__ (bcmp); +# endif +#endif diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h new file mode 100644 index 0000000000..b0dacd8526 --- /dev/null +++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h @@ -0,0 +1,53 @@ +/* Common definition for bcmp ifunc selections. + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2017-2021 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +# include + +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden; + +static inline void * +IFUNC_SELECTOR (void) +{ + const struct cpu_features* cpu_features = __get_cpu_features (); + + if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) + && CPU_FEATURE_USABLE_P (cpu_features, MOVBE) + && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) + { + if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) + return OPTIMIZE (evex); + + if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) + return OPTIMIZE (avx2_rtm); + + if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)) + return OPTIMIZE (avx2); + } + + if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_1)) + return OPTIMIZE (sse4_1); + + return OPTIMIZE (sse2); +} diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index 39ab10613b..dd0c393c7d 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -38,6 +38,29 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, size_t i = 0; + /* Support sysdeps/x86_64/multiarch/bcmp.c. */ + IFUNC_IMPL (i, name, bcmp, + IFUNC_IMPL_ADD (array, i, bcmp, + (CPU_FEATURE_USABLE (AVX2) + && CPU_FEATURE_USABLE (MOVBE) + && CPU_FEATURE_USABLE (BMI2)), + __bcmp_avx2) + IFUNC_IMPL_ADD (array, i, bcmp, + (CPU_FEATURE_USABLE (AVX2) + && CPU_FEATURE_USABLE (BMI2) + && CPU_FEATURE_USABLE (MOVBE) + && CPU_FEATURE_USABLE (RTM)), + __bcmp_avx2_rtm) + IFUNC_IMPL_ADD (array, i, bcmp, + (CPU_FEATURE_USABLE (AVX512VL) + && CPU_FEATURE_USABLE (AVX512BW) + && CPU_FEATURE_USABLE (MOVBE) + && CPU_FEATURE_USABLE (BMI2)), + __bcmp_evex) + IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1), + __bcmp_sse4_1) + IFUNC_IMPL_ADD (array, i, bcmp, 1, __bcmp_sse2)) + /* Support sysdeps/x86_64/multiarch/memchr.c. */ IFUNC_IMPL (i, name, memchr, IFUNC_IMPL_ADD (array, i, memchr, diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S index b135fa2d40..2a4867ad18 100644 --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S @@ -17,7 +17,9 @@ . */ #if IS_IN (libc) -# define memcmp __memcmp_sse2 +# ifndef memcmp +# define memcmp __memcmp_sse2 +# endif # ifdef SHARED # undef libc_hidden_builtin_def diff --git a/sysdeps/x86_64/multiarch/memcmp.c b/sysdeps/x86_64/multiarch/memcmp.c index fe725f3563..1760e045df 100644 --- a/sysdeps/x86_64/multiarch/memcmp.c +++ b/sysdeps/x86_64/multiarch/memcmp.c @@ -27,8 +27,6 @@ # include "ifunc-memcmp.h" libc_ifunc_redirected (__redirect_memcmp, memcmp, IFUNC_SELECTOR ()); -# undef bcmp -weak_alias (memcmp, bcmp) # ifdef SHARED __hidden_ver1 (memcmp, __GI_memcmp, __redirect_memcmp) From patchwork Tue Sep 14 06:30:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 44972 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0876F385801D for ; Tue, 14 Sep 2021 06:31:56 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0876F385801D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1631601116; bh=OFtg0gPtEGHJuCJcm7UANpGqrn4yBP/l84mFCGodt3k=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=C51UVv1l6wOkIjQ/2/DJLkfMnLNkQ5tS1TMAyri3NhD52wCHRwSDRE13Ty7zxrNFM ivd7b33IUbiNq2q7RtTUMv4X+20ScG+A59sU2pGaNMv6aqkyNhEaM+FMpRUICB1aQG NIQsJsGS8djvdEG8vRM+VDkiPEKxomBMC3nY4iQo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd2f.google.com (mail-io1-xd2f.google.com [IPv6:2607:f8b0:4864:20::d2f]) by sourceware.org (Postfix) with ESMTPS id DEA993858408 for ; Tue, 14 Sep 2021 06:30:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DEA993858408 Received: by mail-io1-xd2f.google.com with SMTP id q3so15501103iot.3 for ; Mon, 13 Sep 2021 23:30:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=OFtg0gPtEGHJuCJcm7UANpGqrn4yBP/l84mFCGodt3k=; b=S+AOkE8CdTGkWC+MKtE41+7x2nW9Q+amOts29Z3vhZXsgeD2IDfm8sFxKlmDDppLPF 9GJf1ekzer2Zg06dgBpFAq4c6B4mTcenaqOvyk/iXt6azY8+Y3osy95qa9b660GpmObN Wvl2gQLgH4jbvlohQ9BGgDuWWBicUf1vOphBGNRKLWZrCkpBp+0Gz1AZFOATOV0CdpXV AiaKPYE+NH6TRmEW+ZtNHpXm4/fw4Q4N1cr8UAUhcbMqxu5R/QT4mn3U+k8erG2K7Pyz KDBnajCYyMB79j5mImeAJrOSXqYLfOfc/nJzAMNfgrOa70VB/djoKj19Ft7fyxbiNLo+ XsRg== X-Gm-Message-State: AOAM530yuePSmsx6XtkKFJnLh8CquNR6Lfqzfxtva2f2auSUCwFAVURh WI3tz3oDh+UGaUJtEihNXcfUohtLsVzwPA== X-Google-Smtp-Source: ABdhPJxnoc9NhL13iNxCdG9P+YFMoB/dyvsKNPJveTStCenrzjaFo2QY4fsAw41pJBQ0DrPRo7xBXQ== X-Received: by 2002:a05:6638:2728:: with SMTP id m40mr13349124jav.141.1631601052184; Mon, 13 Sep 2021 23:30:52 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id b10sm6101328ils.13.2021.09.13.23.30.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Sep 2021 23:30:51 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 2/5] x86_64: Add sse2 optimized bcmp implementation in memcmp.S Date: Tue, 14 Sep 2021 01:30:36 -0500 Message-Id: <20210914063039.1126196-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210914063039.1126196-1-goldstein.w.n@gmail.com> References: <20210913230506.546749-1-goldstein.w.n@gmail.com> <20210914063039.1126196-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This commit does not modify any of the memcmp implementation. It just adds bcmp ifdefs to skip obvious cases where computing the proper 1/-1 required by memcmp is not needed. test-memcmp, test-bcmp, and test-wmemcmp are all passing. --- sysdeps/x86_64/memcmp.S | 55 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 51 insertions(+), 4 deletions(-) diff --git a/sysdeps/x86_64/memcmp.S b/sysdeps/x86_64/memcmp.S index dfd0269db2..21607e7c91 100644 --- a/sysdeps/x86_64/memcmp.S +++ b/sysdeps/x86_64/memcmp.S @@ -49,34 +49,63 @@ L(s2b): movzwl (%rdi), %eax movzwl (%rdi, %rsi), %edx subq $2, %r10 +#ifdef USE_AS_BCMP + je L(finz1) +#else je L(fin2_7) +#endif addq $2, %rdi cmpl %edx, %eax +#ifdef USE_AS_BCMP + jnz L(neq_early) +#else jnz L(fin2_7) +#endif L(s4b): testq $4, %r10 jz L(s8b) movl (%rdi), %eax movl (%rdi, %rsi), %edx subq $4, %r10 +#ifdef USE_AS_BCMP + je L(finz1) +#else je L(fin2_7) +#endif addq $4, %rdi cmpl %edx, %eax +#ifdef USE_AS_BCMP + jnz L(neq_early) +#else jnz L(fin2_7) +#endif L(s8b): testq $8, %r10 jz L(s16b) movq (%rdi), %rax movq (%rdi, %rsi), %rdx subq $8, %r10 +#ifdef USE_AS_BCMP + je L(sub_return8) +#else je L(fin2_7) +#endif addq $8, %rdi cmpq %rdx, %rax +#ifdef USE_AS_BCMP + jnz L(neq_early) +#else jnz L(fin2_7) +#endif L(s16b): movdqu (%rdi), %xmm1 movdqu (%rdi, %rsi), %xmm0 pcmpeqb %xmm0, %xmm1 +#ifdef USE_AS_BCMP + pmovmskb %xmm1, %eax + subl $0xffff, %eax + ret +#else pmovmskb %xmm1, %edx xorl %eax, %eax subl $0xffff, %edx @@ -86,7 +115,7 @@ L(s16b): movzbl (%rcx), %eax movzbl (%rsi, %rcx), %edx jmp L(finz1) - +#endif .p2align 4,, 4 L(finr1b): movzbl (%rdi), %eax @@ -95,7 +124,15 @@ L(finz1): subl %edx, %eax L(exit): ret - +#ifdef USE_AS_BCMP + .p2align 4,, 4 +L(sub_return8): + subq %rdx, %rax + movl %eax, %edx + shrq $32, %rax + orl %edx, %eax + ret +#else .p2align 4,, 4 L(fin2_7): cmpq %rdx, %rax @@ -111,12 +148,17 @@ L(fin2_7): movzbl %dl, %edx subl %edx, %eax ret - +#endif .p2align 4,, 4 L(finz): xorl %eax, %eax ret - +#ifdef USE_AS_BCMP + .p2align 4,, 4 +L(neq_early): + movl $1, %eax + ret +#endif /* For blocks bigger than 32 bytes 1. Advance one of the addr pointer to be 16B aligned. 2. Treat the case of both addr pointers aligned to 16B @@ -246,11 +288,16 @@ L(mt16): .p2align 4,, 4 L(neq): +#ifdef USE_AS_BCMP + movl $1, %eax + ret +#else bsfl %edx, %ecx movzbl (%rdi, %rcx), %eax addq %rdi, %rsi movzbl (%rsi,%rcx), %edx jmp L(finz1) +#endif .p2align 4,, 4 L(ATR): From patchwork Tue Sep 14 06:30:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 44974 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4664A3858436 for ; Tue, 14 Sep 2021 06:33:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4664A3858436 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1631601206; bh=FFM0d0/seXcNtzd5Y/BrYUfl1OzBKHIyixwHMMVQKJM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=y4u7Qg8ont4T91fD3pJlxp54qvHqQzbS6agOsl8kvpeOrEDBCmHw0khGxirx4Lz8E YTaCfAyCRAoiCRFTgbYZTzL42Vk4ms5JpyjvpXtLZ5VLgwuQjubmCY+GhVIu5sU6LT aSUfdmyssFgQTg9oQuMn1pKq2iCh/C1dLQtUXJj4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-il1-x131.google.com (mail-il1-x131.google.com [IPv6:2607:f8b0:4864:20::131]) by sourceware.org (Postfix) with ESMTPS id 91B363858402 for ; Tue, 14 Sep 2021 06:30:54 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 91B363858402 Received: by mail-il1-x131.google.com with SMTP id h20so11936314ilj.13 for ; Mon, 13 Sep 2021 23:30:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=FFM0d0/seXcNtzd5Y/BrYUfl1OzBKHIyixwHMMVQKJM=; b=hFSD1+H+FOua5Kse/EYU8N9BtPGilwBHhsJmkPgQWaA7zNwMBx1sVr3kSW2GA2BAVN i4EbBwr0VYyS4D9337QLzf0s020+x6PQhFuqsetCwHsWGeagWJaYuh8+sxKFrYbjO2no PkZSRD4Fk3TjUZmxFvGrU7a9wzWaGsyMUMMjj/kWXJh+koa49wZNLzjRnRd5tgF2Hh9d 87vqfzti6kNTamC6ANRG5RSmMiTBs+VQnA/T3Fls2O6YFytvptnRI2fOoEpZAXToL2y5 yswaX7oQTEAdYhMqR3KMdeHOj+t7G5nx8P6OYz7L12l6GQnSg2AQvIP+4Vk/OYHEC4db 3loQ== X-Gm-Message-State: AOAM530pALlP83rhmOlPCrBKxIS8SczOhB1+6/DF2wcgJ6rlEV+oSbW6 4r9OSeX/4GGJlg2gZgkMiJfB5GEoFoGc2Q== X-Google-Smtp-Source: ABdhPJz/PDh5OIGXWqV5srkbdOgr1n77nNM0f/bh+yWmJuJP1NRvNKa668LahzsSZCS9De2Vyz1lrQ== X-Received: by 2002:a05:6e02:198d:: with SMTP id g13mr11283247ilf.319.1631601053356; Mon, 13 Sep 2021 23:30:53 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id b10sm6101328ils.13.2021.09.13.23.30.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Sep 2021 23:30:53 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 3/5] x86_64: Add sse4_1 optimized bcmp implementation in memcmp-sse4.S Date: Tue, 14 Sep 2021 01:30:37 -0500 Message-Id: <20210914063039.1126196-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210914063039.1126196-1-goldstein.w.n@gmail.com> References: <20210913230506.546749-1-goldstein.w.n@gmail.com> <20210914063039.1126196-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_20_SHORT_WORD_LINES, SCC_35_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This commit does not modify any of the memcmp implementation. It just adds bcmp ifdefs to skip obvious cases where computing the proper 1/-1 required by memcmp is not needed. test-memcmp, test-bcmp, and test-wmemcmp are all passing. --- sysdeps/x86_64/multiarch/memcmp-sse4.S | 761 ++++++++++++++++++++++++- 1 file changed, 746 insertions(+), 15 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memcmp-sse4.S b/sysdeps/x86_64/multiarch/memcmp-sse4.S index b82adcd5fa..b9528ed58e 100644 --- a/sysdeps/x86_64/multiarch/memcmp-sse4.S +++ b/sysdeps/x86_64/multiarch/memcmp-sse4.S @@ -72,7 +72,11 @@ L(79bytesormore): movdqu (%rdi), %xmm2 pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif mov %rsi, %rcx and $-16, %rsi add $16, %rsi @@ -91,34 +95,58 @@ L(less128bytes): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqu 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqu 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif cmp $32, %rdx jb L(less32bytesin64) movdqu 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqu 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -140,42 +168,74 @@ L(less256bytes): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqu 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqu 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif movdqu 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqu 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif movdqu 96(%rdi), %xmm2 pxor 96(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(112bytesin256) +# endif movdqu 112(%rdi), %xmm2 pxor 112(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(128bytesin256) +# endif add $128, %rsi add $128, %rdi @@ -189,12 +249,20 @@ L(less256bytes): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -208,82 +276,146 @@ L(less512bytes): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqu 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqu 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif movdqu 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqu 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif movdqu 96(%rdi), %xmm2 pxor 96(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(112bytesin256) +# endif movdqu 112(%rdi), %xmm2 pxor 112(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(128bytesin256) +# endif movdqu 128(%rdi), %xmm2 pxor 128(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(144bytesin256) +# endif movdqu 144(%rdi), %xmm2 pxor 144(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(160bytesin256) +# endif movdqu 160(%rdi), %xmm2 pxor 160(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(176bytesin256) +# endif movdqu 176(%rdi), %xmm2 pxor 176(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(192bytesin256) +# endif movdqu 192(%rdi), %xmm2 pxor 192(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(208bytesin256) +# endif movdqu 208(%rdi), %xmm2 pxor 208(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(224bytesin256) +# endif movdqu 224(%rdi), %xmm2 pxor 224(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(240bytesin256) +# endif movdqu 240(%rdi), %xmm2 pxor 240(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(256bytesin256) +# endif add $256, %rsi add $256, %rdi @@ -300,12 +432,20 @@ L(less512bytes): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -346,7 +486,11 @@ L(64bytesormore_loop): por %xmm5, %xmm1 ptest %xmm1, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesormore_loop_end) +# endif add $64, %rsi add $64, %rdi sub $64, %rdx @@ -380,7 +524,11 @@ L(L2_L3_unaligned_128bytes_loop): por %xmm5, %xmm1 ptest %xmm1, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesormore_loop_end) +# endif add $64, %rsi add $64, %rdi sub $64, %rdx @@ -404,34 +552,58 @@ L(less128bytesin2aligned): movdqa (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqa 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqa 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqa 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif cmp $32, %rdx jb L(less32bytesin64in2alinged) movdqa 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqa 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -454,42 +626,74 @@ L(less256bytesin2alinged): movdqa (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqa 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqa 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqa 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif movdqa 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqa 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif movdqa 96(%rdi), %xmm2 pxor 96(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(112bytesin256) +# endif movdqa 112(%rdi), %xmm2 pxor 112(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(128bytesin256) +# endif add $128, %rsi add $128, %rdi @@ -503,12 +707,20 @@ L(less256bytesin2alinged): movdqu (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqu 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -524,82 +736,146 @@ L(256bytesormorein2aligned): movdqa (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqa 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif movdqa 32(%rdi), %xmm2 pxor 32(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(48bytesin256) +# endif movdqa 48(%rdi), %xmm2 pxor 48(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesin256) +# endif movdqa 64(%rdi), %xmm2 pxor 64(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(80bytesin256) +# endif movdqa 80(%rdi), %xmm2 pxor 80(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(96bytesin256) +# endif movdqa 96(%rdi), %xmm2 pxor 96(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(112bytesin256) +# endif movdqa 112(%rdi), %xmm2 pxor 112(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(128bytesin256) +# endif movdqa 128(%rdi), %xmm2 pxor 128(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(144bytesin256) +# endif movdqa 144(%rdi), %xmm2 pxor 144(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(160bytesin256) +# endif movdqa 160(%rdi), %xmm2 pxor 160(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(176bytesin256) +# endif movdqa 176(%rdi), %xmm2 pxor 176(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(192bytesin256) +# endif movdqa 192(%rdi), %xmm2 pxor 192(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(208bytesin256) +# endif movdqa 208(%rdi), %xmm2 pxor 208(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(224bytesin256) +# endif movdqa 224(%rdi), %xmm2 pxor 224(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(240bytesin256) +# endif movdqa 240(%rdi), %xmm2 pxor 240(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(256bytesin256) +# endif add $256, %rsi add $256, %rdi @@ -616,12 +892,20 @@ L(256bytesormorein2aligned): movdqa (%rdi), %xmm2 pxor (%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(16bytesin256) +# endif movdqa 16(%rdi), %xmm2 pxor 16(%rsi), %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(32bytesin256) +# endif sub $32, %rdx add $32, %rdi add $32, %rsi @@ -663,7 +947,11 @@ L(64bytesormore_loopin2aligned): por %xmm5, %xmm1 ptest %xmm1, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesormore_loop_end) +# endif add $64, %rsi add $64, %rdi sub $64, %rdx @@ -697,7 +985,11 @@ L(L2_L3_aligned_128bytes_loop): por %xmm5, %xmm1 ptest %xmm1, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(64bytesormore_loop_end) +# endif add $64, %rsi add $64, %rdi sub $64, %rdx @@ -708,7 +1000,7 @@ L(L2_L3_aligned_128bytes_loop): add %rdx, %rdi BRANCH_TO_JMPTBL_ENTRY(L(table_64bytes), %rdx, 4) - +# ifndef USE_AS_BCMP .p2align 4 L(64bytesormore_loop_end): add $16, %rdi @@ -791,17 +1083,29 @@ L(32bytesin256): L(16bytesin256): add $16, %rdi add $16, %rsi +# endif L(16bytes): mov -16(%rdi), %rax mov -16(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif L(8bytes): mov -8(%rdi), %rax mov -8(%rsi), %rcx +# ifdef USE_AS_BCMP + sub %rcx, %rax + mov %rax, %rcx + shr $32, %rcx + or %ecx, %eax +# else cmp %rax, %rcx jne L(diffin8bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -809,16 +1113,26 @@ L(12bytes): mov -12(%rdi), %rax mov -12(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif L(4bytes): mov -4(%rsi), %ecx -# ifndef USE_AS_WMEMCMP +# ifdef USE_AS_BCMP mov -4(%rdi), %eax - cmp %eax, %ecx + sub %ecx, %eax + ret # else +# ifndef USE_AS_WMEMCMP + mov -4(%rdi), %eax + cmp %eax, %ecx +# else cmp -4(%rdi), %ecx -# endif +# endif jne L(diffin4bytes) +# endif L(0bytes): xor %eax, %eax ret @@ -832,31 +1146,51 @@ L(65bytes): mov $-65, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(49bytes): movdqu -49(%rdi), %xmm1 movdqu -49(%rsi), %xmm2 mov $-49, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(33bytes): movdqu -33(%rdi), %xmm1 movdqu -33(%rsi), %xmm2 mov $-33, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(17bytes): mov -17(%rdi), %rax mov -17(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif L(9bytes): mov -9(%rdi), %rax mov -9(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif movzbl -1(%rdi), %eax movzbl -1(%rsi), %edx sub %edx, %eax @@ -867,12 +1201,23 @@ L(13bytes): mov -13(%rdi), %rax mov -13(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx +# ifdef USE_AS_BCMP + sub %rcx, %rax + mov %rax, %rcx + shr $32, %rcx + or %ecx, %eax +# else cmp %rax, %rcx jne L(diffin8bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -880,7 +1225,11 @@ L(5bytes): mov -5(%rdi), %eax mov -5(%rsi), %ecx cmp %eax, %ecx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin4bytes) +# endif movzbl -1(%rdi), %eax movzbl -1(%rsi), %edx sub %edx, %eax @@ -893,37 +1242,59 @@ L(66bytes): mov $-66, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(50bytes): movdqu -50(%rdi), %xmm1 movdqu -50(%rsi), %xmm2 mov $-50, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(34bytes): movdqu -34(%rdi), %xmm1 movdqu -34(%rsi), %xmm2 mov $-34, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(18bytes): mov -18(%rdi), %rax mov -18(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif L(10bytes): mov -10(%rdi), %rax mov -10(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif movzwl -2(%rdi), %eax movzwl -2(%rsi), %ecx +# ifndef USE_AS_BCMP cmp %cl, %al jne L(end) and $0xffff, %eax and $0xffff, %ecx +# endif sub %ecx, %eax ret @@ -932,12 +1303,23 @@ L(14bytes): mov -14(%rdi), %rax mov -14(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx +# ifdef USE_AS_BCMP + sub %rcx, %rax + mov %rax, %rcx + shr $32, %rcx + or %ecx, %eax +# else cmp %rax, %rcx jne L(diffin8bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -945,14 +1327,20 @@ L(6bytes): mov -6(%rdi), %eax mov -6(%rsi), %ecx cmp %eax, %ecx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin4bytes) +# endif L(2bytes): movzwl -2(%rsi), %ecx movzwl -2(%rdi), %eax +# ifndef USE_AS_BCMP cmp %cl, %al jne L(end) and $0xffff, %eax and $0xffff, %ecx +# endif sub %ecx, %eax ret @@ -963,36 +1351,60 @@ L(67bytes): mov $-67, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(51bytes): movdqu -51(%rdi), %xmm2 movdqu -51(%rsi), %xmm1 mov $-51, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(35bytes): movdqu -35(%rsi), %xmm1 movdqu -35(%rdi), %xmm2 mov $-35, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(19bytes): mov -19(%rdi), %rax mov -19(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif L(11bytes): mov -11(%rdi), %rax mov -11(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -4(%rdi), %eax mov -4(%rsi), %ecx +# ifdef USE_AS_BCMP + sub %ecx, %eax +# else cmp %eax, %ecx jne L(diffin4bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -1000,12 +1412,23 @@ L(15bytes): mov -15(%rdi), %rax mov -15(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx +# ifdef USE_AS_BCMP + sub %rcx, %rax + mov %rax, %rcx + shr $32, %rcx + or %ecx, %eax +# else cmp %rax, %rcx jne L(diffin8bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -1013,12 +1436,20 @@ L(7bytes): mov -7(%rdi), %eax mov -7(%rsi), %ecx cmp %eax, %ecx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin4bytes) +# endif mov -4(%rdi), %eax mov -4(%rsi), %ecx +# ifdef USE_AS_BCMP + sub %ecx, %eax +# else cmp %eax, %ecx jne L(diffin4bytes) xor %eax, %eax +# endif ret .p2align 4 @@ -1026,7 +1457,11 @@ L(3bytes): movzwl -3(%rdi), %eax movzwl -3(%rsi), %ecx cmp %eax, %ecx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin2bytes) +# endif L(1bytes): movzbl -1(%rdi), %eax movzbl -1(%rsi), %ecx @@ -1041,38 +1476,58 @@ L(68bytes): mov $-68, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(52bytes): movdqu -52(%rdi), %xmm2 movdqu -52(%rsi), %xmm1 mov $-52, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(36bytes): movdqu -36(%rdi), %xmm2 movdqu -36(%rsi), %xmm1 mov $-36, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(20bytes): movdqu -20(%rdi), %xmm2 movdqu -20(%rsi), %xmm1 mov $-20, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -4(%rsi), %ecx - -# ifndef USE_AS_WMEMCMP +# ifdef USE_AS_BCMP mov -4(%rdi), %eax - cmp %eax, %ecx + sub %ecx, %eax # else +# ifndef USE_AS_WMEMCMP + mov -4(%rdi), %eax + cmp %eax, %ecx +# else cmp -4(%rdi), %ecx -# endif +# endif jne L(diffin4bytes) xor %eax, %eax +# endif ret # ifndef USE_AS_WMEMCMP @@ -1084,32 +1539,52 @@ L(69bytes): mov $-69, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(53bytes): movdqu -53(%rsi), %xmm1 movdqu -53(%rdi), %xmm2 mov $-53, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(37bytes): movdqu -37(%rsi), %xmm1 movdqu -37(%rdi), %xmm2 mov $-37, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(21bytes): movdqu -21(%rsi), %xmm1 movdqu -21(%rdi), %xmm2 mov $-21, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret @@ -1120,32 +1595,52 @@ L(70bytes): mov $-70, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(54bytes): movdqu -54(%rsi), %xmm1 movdqu -54(%rdi), %xmm2 mov $-54, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(38bytes): movdqu -38(%rsi), %xmm1 movdqu -38(%rdi), %xmm2 mov $-38, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(22bytes): movdqu -22(%rsi), %xmm1 movdqu -22(%rdi), %xmm2 mov $-22, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret @@ -1156,32 +1651,52 @@ L(71bytes): mov $-71, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(55bytes): movdqu -55(%rdi), %xmm2 movdqu -55(%rsi), %xmm1 mov $-55, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(39bytes): movdqu -39(%rdi), %xmm2 movdqu -39(%rsi), %xmm1 mov $-39, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(23bytes): movdqu -23(%rdi), %xmm2 movdqu -23(%rsi), %xmm1 mov $-23, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret # endif @@ -1193,33 +1708,53 @@ L(72bytes): mov $-72, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(56bytes): movdqu -56(%rdi), %xmm2 movdqu -56(%rsi), %xmm1 mov $-56, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(40bytes): movdqu -40(%rdi), %xmm2 movdqu -40(%rsi), %xmm1 mov $-40, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(24bytes): movdqu -24(%rdi), %xmm2 movdqu -24(%rsi), %xmm1 mov $-24, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -8(%rsi), %rcx mov -8(%rdi), %rax cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret @@ -1232,32 +1767,52 @@ L(73bytes): mov $-73, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(57bytes): movdqu -57(%rdi), %xmm2 movdqu -57(%rsi), %xmm1 mov $-57, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(41bytes): movdqu -41(%rdi), %xmm2 movdqu -41(%rsi), %xmm1 mov $-41, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(25bytes): movdqu -25(%rdi), %xmm2 movdqu -25(%rsi), %xmm1 mov $-25, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -9(%rdi), %rax mov -9(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif movzbl -1(%rdi), %eax movzbl -1(%rsi), %ecx sub %ecx, %eax @@ -1270,35 +1825,60 @@ L(74bytes): mov $-74, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(58bytes): movdqu -58(%rdi), %xmm2 movdqu -58(%rsi), %xmm1 mov $-58, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(42bytes): movdqu -42(%rdi), %xmm2 movdqu -42(%rsi), %xmm1 mov $-42, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(26bytes): movdqu -26(%rdi), %xmm2 movdqu -26(%rsi), %xmm1 mov $-26, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -10(%rdi), %rax mov -10(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif movzwl -2(%rdi), %eax movzwl -2(%rsi), %ecx +# ifdef USE_AS_BCMP + sub %ecx, %eax + ret +# else jmp L(diffin2bytes) +# endif .p2align 4 L(75bytes): @@ -1307,37 +1887,61 @@ L(75bytes): mov $-75, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(59bytes): movdqu -59(%rdi), %xmm2 movdqu -59(%rsi), %xmm1 mov $-59, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(43bytes): movdqu -43(%rdi), %xmm2 movdqu -43(%rsi), %xmm1 mov $-43, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(27bytes): movdqu -27(%rdi), %xmm2 movdqu -27(%rsi), %xmm1 mov $-27, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -11(%rdi), %rax mov -11(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -4(%rdi), %eax mov -4(%rsi), %ecx +# ifdef USE_AS_BCMP + sub %ecx, %eax +# else cmp %eax, %ecx jne L(diffin4bytes) xor %eax, %eax +# endif ret # endif .p2align 4 @@ -1347,41 +1951,66 @@ L(76bytes): mov $-76, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(60bytes): movdqu -60(%rdi), %xmm2 movdqu -60(%rsi), %xmm1 mov $-60, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(44bytes): movdqu -44(%rdi), %xmm2 movdqu -44(%rsi), %xmm1 mov $-44, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(28bytes): movdqu -28(%rdi), %xmm2 movdqu -28(%rsi), %xmm1 mov $-28, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -12(%rdi), %rax mov -12(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -4(%rsi), %ecx -# ifndef USE_AS_WMEMCMP +# ifdef USE_AS_BCMP mov -4(%rdi), %eax - cmp %eax, %ecx + sub %ecx, %eax # else +# ifndef USE_AS_WMEMCMP + mov -4(%rdi), %eax + cmp %eax, %ecx +# else cmp -4(%rdi), %ecx -# endif +# endif jne L(diffin4bytes) xor %eax, %eax +# endif ret # ifndef USE_AS_WMEMCMP @@ -1393,38 +2022,62 @@ L(77bytes): mov $-77, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(61bytes): movdqu -61(%rdi), %xmm2 movdqu -61(%rsi), %xmm1 mov $-61, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(45bytes): movdqu -45(%rdi), %xmm2 movdqu -45(%rsi), %xmm1 mov $-45, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(29bytes): movdqu -29(%rdi), %xmm2 movdqu -29(%rsi), %xmm1 mov $-29, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -13(%rdi), %rax mov -13(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret @@ -1435,36 +2088,60 @@ L(78bytes): mov $-78, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(62bytes): movdqu -62(%rdi), %xmm2 movdqu -62(%rsi), %xmm1 mov $-62, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(46bytes): movdqu -46(%rdi), %xmm2 movdqu -46(%rsi), %xmm1 mov $-46, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(30bytes): movdqu -30(%rdi), %xmm2 movdqu -30(%rsi), %xmm1 mov $-30, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -14(%rdi), %rax mov -14(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret @@ -1475,36 +2152,60 @@ L(79bytes): mov $-79, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(63bytes): movdqu -63(%rdi), %xmm2 movdqu -63(%rsi), %xmm1 mov $-63, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(47bytes): movdqu -47(%rdi), %xmm2 movdqu -47(%rsi), %xmm1 mov $-47, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(31bytes): movdqu -31(%rdi), %xmm2 movdqu -31(%rsi), %xmm1 mov $-31, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -15(%rdi), %rax mov -15(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret # endif @@ -1515,37 +2216,58 @@ L(64bytes): mov $-64, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(48bytes): movdqu -48(%rdi), %xmm2 movdqu -48(%rsi), %xmm1 mov $-48, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif L(32bytes): movdqu -32(%rdi), %xmm2 movdqu -32(%rsi), %xmm1 mov $-32, %dl pxor %xmm1, %xmm2 ptest %xmm2, %xmm0 +# ifdef USE_AS_BCMP + jnc L(return_not_equals) +# else jnc L(less16bytes) +# endif mov -16(%rdi), %rax mov -16(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif mov -8(%rdi), %rax mov -8(%rsi), %rcx cmp %rax, %rcx +# ifdef USE_AS_BCMP + jne L(return_not_equals) +# else jne L(diffin8bytes) +# endif xor %eax, %eax ret /* * Aligned 8 bytes to avoid 2 branch "taken" in one 16 alinged code block. */ +# ifndef USE_AS_BCMP .p2align 3 L(less16bytes): movsbq %dl, %rdx @@ -1561,16 +2283,16 @@ L(diffin8bytes): shr $32, %rcx shr $32, %rax -# ifdef USE_AS_WMEMCMP +# ifdef USE_AS_WMEMCMP /* for wmemcmp */ cmp %eax, %ecx jne L(diffin4bytes) xor %eax, %eax ret -# endif +# endif L(diffin4bytes): -# ifndef USE_AS_WMEMCMP +# ifndef USE_AS_WMEMCMP cmp %cx, %ax jne L(diffin2bytes) shr $16, %ecx @@ -1589,7 +2311,7 @@ L(end): and $0xff, %ecx sub %ecx, %eax ret -# else +# else /* for wmemcmp */ mov $1, %eax @@ -1601,6 +2323,15 @@ L(end): L(nequal_bigger): ret +L(unreal_case): + xor %eax, %eax + ret +# endif +# else + .p2align 4 +L(return_not_equals): + mov $1, %eax + ret L(unreal_case): xor %eax, %eax ret From patchwork Tue Sep 14 06:30:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 44973 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8F64C3858039 for ; Tue, 14 Sep 2021 06:32:43 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8F64C3858039 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1631601163; bh=kdvQOjsgXLyf0hgfOQw6tpKx/23I+8dDjwajFRnbWe4=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=vQOfGdV6GF20wvKkRLy1R4bUICtIJo86bhR8I5qlDv6IXEWrLdLqr7/isMv9uH0LW Dq+ljjxhKqU62gKoT2Ct0B0EahOXgLth6aGyIJHHN6LPwglWKpvpBVRNsAnEQP2qO0 t0twUD+nrccKGVr+HqHPY1eh6UPpnFl2gDVVw8Xk= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd2f.google.com (mail-io1-xd2f.google.com [IPv6:2607:f8b0:4864:20::d2f]) by sourceware.org (Postfix) with ESMTPS id A634F385801D for ; Tue, 14 Sep 2021 06:30:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A634F385801D Received: by mail-io1-xd2f.google.com with SMTP id j18so15514303ioj.8 for ; Mon, 13 Sep 2021 23:30:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=kdvQOjsgXLyf0hgfOQw6tpKx/23I+8dDjwajFRnbWe4=; b=A7rZWDmDlwqKciEZ1+nep7n36bdx/dBOKrqhMThUerURYF9iniQFFb5yfPQRjjmLiV goadYk/cReEi7E1fq1JMOLiwvfasWZO0WGRuEdN4LWRyrGlXZ3LqXGri8qJgZTtL06St w2gIESQylL4SRDOJnDBtA0RfH1Pk/ZcCpPDOWGLJa3sBqxt3NizFTPjclW2g/2ls/xBO xGzI0GrBT6Jfu1Yrb7URfWD599LXGuCNpeKtqi4hNYIbWIKspD3acn7xSPZDuCtHPBI5 HkzpOAWP0tqxf+ifhWOWF5oIM/Yt/8FEHbPLDlSMJQKT8dzGiLzlKdfNhPGnEVHTWLnU IUtA== X-Gm-Message-State: AOAM530T49T/IatnZZmJjkCogpIuZiW6hDb35/0oL5BaHy2S6Nll7Fn3 B0IqOlNh9B/drq8WJAh/SG1xfI4Ktu0ssA== X-Google-Smtp-Source: ABdhPJwqh4MGM+Y2/3NoR+k0dpeDtF9D9M/BbELcGN7ox0gpAxVX2JQmZ/o+04M5HiRUbYJAJWpMDw== X-Received: by 2002:a05:6602:2001:: with SMTP id y1mr12407620iod.97.1631601054742; Mon, 13 Sep 2021 23:30:54 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id b10sm6101328ils.13.2021.09.13.23.30.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Sep 2021 23:30:54 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 4/5] x86_64: Add avx2 optimized bcmp implementation in bcmp-avx2.S Date: Tue, 14 Sep 2021 01:30:38 -0500 Message-Id: <20210914063039.1126196-4-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210914063039.1126196-1-goldstein.w.n@gmail.com> References: <20210913230506.546749-1-goldstein.w.n@gmail.com> <20210914063039.1126196-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This commit adds new optimized bcmp implementation for avx2. The primary optimizations are 1) skipping the logic to find the difference of the first mismatched byte and 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas. The entry alignment has been fixed at 64. In throughput sensitive functions which bcmp can potentially be frontend loop performance is important to opimized for. This is impossible/difficult to do/maintain with only 16 byte fixed alignment. test-memcmp, test-bcmp, and test-wmemcmp are all passing. --- sysdeps/x86/sysdep.h | 6 +- sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S | 4 +- sysdeps/x86_64/multiarch/bcmp-avx2.S | 304 ++++++++++++++++++++- sysdeps/x86_64/multiarch/ifunc-bcmp.h | 4 +- sysdeps/x86_64/multiarch/ifunc-impl-list.c | 2 - 5 files changed, 308 insertions(+), 12 deletions(-) diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h index cac1d762fb..4895179c10 100644 --- a/sysdeps/x86/sysdep.h +++ b/sysdeps/x86/sysdep.h @@ -78,15 +78,17 @@ enum cf_protection_level #define ASM_SIZE_DIRECTIVE(name) .size name,.-name; /* Define an entry point visible from C. */ -#define ENTRY(name) \ +#define ENTRY_P2ALIGN(name, alignment) \ .globl C_SYMBOL_NAME(name); \ .type C_SYMBOL_NAME(name),@function; \ - .align ALIGNARG(4); \ + .align ALIGNARG(alignment); \ C_LABEL(name) \ cfi_startproc; \ _CET_ENDBR; \ CALL_MCOUNT +#define ENTRY(name) ENTRY_P2ALIGN (name, 4) + #undef END #define END(name) \ cfi_endproc; \ diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S index d742257e4e..28976daff0 100644 --- a/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S +++ b/sysdeps/x86_64/multiarch/bcmp-avx2-rtm.S @@ -1,5 +1,5 @@ -#ifndef MEMCMP -# define MEMCMP __bcmp_avx2_rtm +#ifndef BCMP +# define BCMP __bcmp_avx2_rtm #endif #define ZERO_UPPER_VEC_REGISTERS_RETURN \ diff --git a/sysdeps/x86_64/multiarch/bcmp-avx2.S b/sysdeps/x86_64/multiarch/bcmp-avx2.S index 93a9a20b17..eb77ae5c4a 100644 --- a/sysdeps/x86_64/multiarch/bcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/bcmp-avx2.S @@ -16,8 +16,304 @@ License along with the GNU C Library; if not, see . */ -#ifndef MEMCMP -# define MEMCMP __bcmp_avx2 -#endif +#if IS_IN (libc) + +/* bcmp is implemented as: + 1. Use ymm vector compares when possible. The only case where + vector compares is not possible for when size < VEC_SIZE + and loading from either s1 or s2 would cause a page cross. + 2. Use xmm vector compare when size >= 8 bytes. + 3. Optimistically compare up to first 4 * VEC_SIZE one at a + to check for early mismatches. Only do this if its guranteed the + work is not wasted. + 4. If size is 8 * VEC_SIZE or less, unroll the loop. + 5. Compare 4 * VEC_SIZE at a time with the aligned first memory + area. + 6. Use 2 vector compares when size is 2 * VEC_SIZE or less. + 7. Use 4 vector compares when size is 4 * VEC_SIZE or less. + 8. Use 8 vector compares when size is 8 * VEC_SIZE or less. */ + +# include + +# ifndef BCMP +# define BCMP __bcmp_avx2 +# endif + +# define VPCMPEQ vpcmpeqb + +# ifndef VZEROUPPER +# define VZEROUPPER vzeroupper +# endif + +# ifndef SECTION +# define SECTION(p) p##.avx +# endif + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (BCMP, 6) +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %edx, %edx +# endif + cmp $VEC_SIZE, %RDX_LP + jb L(less_vec) + + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ + vmovdqu (%rsi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + vpmovmskb %ymm1, %eax + incl %eax + jnz L(return_neq0) + cmpq $(VEC_SIZE * 2), %rdx + jbe L(last_1x_vec) + + /* Check second VEC no matter what. */ + vmovdqu VEC_SIZE(%rsi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + vpmovmskb %ymm2, %eax + /* If all 4 VEC where equal eax will be all 1s so incl will overflow + and set zero flag. */ + incl %eax + jnz L(return_neq0) + + /* Less than 4 * VEC. */ + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_2x_vec) + + /* Check third and fourth VEC no matter what. */ + vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + vpmovmskb %ymm3, %eax + incl %eax + jnz L(return_neq0) + + vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + vpmovmskb %ymm4, %eax + incl %eax + jnz L(return_neq0) + + /* Go to 4x VEC loop. */ + cmpq $(VEC_SIZE * 8), %rdx + ja L(more_8x_vec) + + /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any + branches. */ + + /* Adjust rsi and rdi to avoid indexed address mode. This end up + saving a 16 bytes of code, prevents unlamination, and bottlenecks in + the AGU. */ + addq %rdx, %rsi + vmovdqu -(VEC_SIZE * 4)(%rsi), %ymm1 + vmovdqu -(VEC_SIZE * 3)(%rsi), %ymm2 + addq %rdx, %rdi + + VPCMPEQ -(VEC_SIZE * 4)(%rdi), %ymm1, %ymm1 + VPCMPEQ -(VEC_SIZE * 3)(%rdi), %ymm2, %ymm2 + + vmovdqu -(VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ -(VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + vmovdqu -VEC_SIZE(%rsi), %ymm4 + VPCMPEQ -VEC_SIZE(%rdi), %ymm4, %ymm4 -#include "bcmp-avx2.S" + /* Reduce VEC0 - VEC4. */ + vpand %ymm1, %ymm2, %ymm2 + vpand %ymm3, %ymm4, %ymm4 + vpand %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %eax + incl %eax +L(return_neq0): +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + + /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte + aligned. */ + .p2align 5 +L(less_vec): + /* Check if one or less char. This is necessary for size = 0 but is + also faster for size = 1. */ + cmpl $1, %edx + jbe L(one_or_less) + + /* Check if loading one VEC from either s1 or s2 could cause a page + cross. This can have false positives but is by far the fastest + method. */ + movl %edi, %eax + orl %esi, %eax + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + jg L(page_cross_less_vec) + + /* No page cross possible. */ + vmovdqu (%rsi), %ymm2 + VPCMPEQ (%rdi), %ymm2, %ymm2 + vpmovmskb %ymm2, %eax + incl %eax + /* Result will be zero if s1 and s2 match. Otherwise first set bit + will be first mismatch. */ + bzhil %edx, %eax, %eax + VZEROUPPER_RETURN + + /* Relatively cold but placing close to L(less_vec) for 2 byte jump + encoding. */ + .p2align 4 +L(one_or_less): + jb L(zero) + movzbl (%rsi), %ecx + movzbl (%rdi), %eax + subl %ecx, %eax + /* No ymm register was touched. */ + ret + /* Within the same 16 byte block is L(one_or_less). */ +L(zero): + xorl %eax, %eax + ret + + .p2align 4 +L(last_1x_vec): + vmovdqu -(VEC_SIZE * 1)(%rsi, %rdx), %ymm1 + VPCMPEQ -(VEC_SIZE * 1)(%rdi, %rdx), %ymm1, %ymm1 + vpmovmskb %ymm1, %eax + incl %eax + VZEROUPPER_RETURN + + .p2align 4 +L(last_2x_vec): + vmovdqu -(VEC_SIZE * 2)(%rsi, %rdx), %ymm1 + VPCMPEQ -(VEC_SIZE * 2)(%rdi, %rdx), %ymm1, %ymm1 + vmovdqu -(VEC_SIZE * 1)(%rsi, %rdx), %ymm2 + VPCMPEQ -(VEC_SIZE * 1)(%rdi, %rdx), %ymm2, %ymm2 + vpand %ymm1, %ymm2, %ymm2 + vpmovmskb %ymm2, %eax + incl %eax + VZEROUPPER_RETURN + + .p2align 4 +L(more_8x_vec): + /* Set end of s1 in rdx. */ + leaq -(VEC_SIZE * 4)(%rdi, %rdx), %rdx + /* rsi stores s2 - s1. This allows loop to only update one pointer. + */ + subq %rdi, %rsi + /* Align s1 pointer. */ + andq $-VEC_SIZE, %rdi + /* Adjust because first 4x vec where check already. */ + subq $-(VEC_SIZE * 4), %rdi + .p2align 4 +L(loop_4x_vec): + /* rsi has s2 - s1 so get correct address by adding s1 (in rdi). */ + vmovdqu (%rsi, %rdi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + + vmovdqu VEC_SIZE(%rsi, %rdi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + + vmovdqu (VEC_SIZE * 2)(%rsi, %rdi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + + vmovdqu (VEC_SIZE * 3)(%rsi, %rdi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + + vpand %ymm1, %ymm2, %ymm2 + vpand %ymm3, %ymm4, %ymm4 + vpand %ymm2, %ymm4, %ymm4 + vpmovmskb %ymm4, %eax + incl %eax + jnz L(return_neq1) + subq $-(VEC_SIZE * 4), %rdi + /* Check if s1 pointer at end. */ + cmpq %rdx, %rdi + jb L(loop_4x_vec) + + vmovdqu (VEC_SIZE * 3)(%rsi, %rdx), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm4, %ymm4 + subq %rdx, %rdi + /* rdi has 4 * VEC_SIZE - remaining length. */ + cmpl $(VEC_SIZE * 3), %edi + jae L(8x_last_1x_vec) + /* Load regardless of branch. */ + vmovdqu (VEC_SIZE * 2)(%rsi, %rdx), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm3, %ymm3 + cmpl $(VEC_SIZE * 2), %edi + jae L(8x_last_2x_vec) + /* Check last 4 VEC. */ + vmovdqu VEC_SIZE(%rsi, %rdx), %ymm1 + VPCMPEQ VEC_SIZE(%rdx), %ymm1, %ymm1 + + vmovdqu (%rsi, %rdx), %ymm2 + VPCMPEQ (%rdx), %ymm2, %ymm2 + + vpand %ymm3, %ymm4, %ymm4 + vpand %ymm1, %ymm2, %ymm3 +L(8x_last_2x_vec): + vpand %ymm3, %ymm4, %ymm4 +L(8x_last_1x_vec): + vpmovmskb %ymm4, %eax + /* Restore s1 pointer to rdi. */ + incl %eax +L(return_neq1): + VZEROUPPER_RETURN + + /* Relatively cold case as page cross are unexpected. */ + .p2align 4 +L(page_cross_less_vec): + cmpl $16, %edx + jae L(between_16_31) + cmpl $8, %edx + ja L(between_9_15) + cmpl $4, %edx + jb L(between_2_3) + /* From 4 to 8 bytes. No branch when size == 4. */ + movl (%rdi), %eax + movl (%rsi), %ecx + subl %ecx, %eax + movl -4(%rdi, %rdx), %ecx + movl -4(%rsi, %rdx), %esi + subl %esi, %ecx + orl %ecx, %eax + ret + + .p2align 4,, 8 +L(between_9_15): + vmovq (%rdi), %xmm1 + vmovq (%rsi), %xmm2 + VPCMPEQ %xmm1, %xmm2, %xmm3 + vmovq -8(%rdi, %rdx), %xmm1 + vmovq -8(%rsi, %rdx), %xmm2 + VPCMPEQ %xmm1, %xmm2, %xmm2 + vpand %xmm2, %xmm3, %xmm3 + vpmovmskb %xmm3, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_16_31): + /* From 16 to 31 bytes. No branch when size == 16. */ + vmovdqu (%rsi), %xmm1 + VPCMPEQ (%rdi), %xmm1, %xmm1 + vmovdqu -16(%rsi, %rdx), %xmm2 + VPCMPEQ -16(%rdi, %rdx), %xmm2, %xmm2 + vpand %xmm1, %xmm2, %xmm2 + vpmovmskb %xmm2, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_2_3): + /* From 2 to 3 bytes. No branch when size == 2. */ + movzwl (%rdi), %eax + movzwl (%rsi), %ecx + subl %ecx, %eax + movzbl -1(%rdi, %rdx), %edi + movzbl -1(%rsi, %rdx), %esi + subl %edi, %esi + orl %esi, %eax + /* No ymm register was touched. */ + ret +END (BCMP) +#endif diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h index b0dacd8526..f94516e5ee 100644 --- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h +++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h @@ -32,11 +32,11 @@ IFUNC_SELECTOR (void) if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && CPU_FEATURE_USABLE_P (cpu_features, BMI2) - && CPU_FEATURE_USABLE_P (cpu_features, MOVBE) && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) { if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) + && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) return OPTIMIZE (evex); if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index dd0c393c7d..cda0316928 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -42,13 +42,11 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL (i, name, bcmp, IFUNC_IMPL_ADD (array, i, bcmp, (CPU_FEATURE_USABLE (AVX2) - && CPU_FEATURE_USABLE (MOVBE) && CPU_FEATURE_USABLE (BMI2)), __bcmp_avx2) IFUNC_IMPL_ADD (array, i, bcmp, (CPU_FEATURE_USABLE (AVX2) && CPU_FEATURE_USABLE (BMI2) - && CPU_FEATURE_USABLE (MOVBE) && CPU_FEATURE_USABLE (RTM)), __bcmp_avx2_rtm) IFUNC_IMPL_ADD (array, i, bcmp, From patchwork Tue Sep 14 06:30:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 44975 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C02433858039 for ; Tue, 14 Sep 2021 06:34:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C02433858039 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1631601253; bh=oC2IBbwzXpPhRKim4R2u3FAiw0euTHifnt6rWHEqIGY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=j0eVQhzZYYGIMztpar0f4wa945tRPthYcQN3GgATw9KJNa9MU1sadfjC6V+kOA1WR CTJD2kpTKNfnNkjBn4gZGxMNNEGqR9kwCYgirr7ijQABMdZ3xzqwFfmHtjv5zidfvo C0BoHCO3GdkWVLJjaHO3AMaJarSQUKJ+G4qpW7VU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd36.google.com (mail-io1-xd36.google.com [IPv6:2607:f8b0:4864:20::d36]) by sourceware.org (Postfix) with ESMTPS id 3FE733857C4F for ; Tue, 14 Sep 2021 06:30:57 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3FE733857C4F Received: by mail-io1-xd36.google.com with SMTP id b10so15538510ioq.9 for ; Mon, 13 Sep 2021 23:30:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oC2IBbwzXpPhRKim4R2u3FAiw0euTHifnt6rWHEqIGY=; b=z7I3OzT5/Us8D4XifQ+JQMkkvmpYZaKrlE1MFpAOb25tX8XLozHqi8ByQovcb23HOF tISHoVo4GjcXRMoJJY5BsppVTlZ66oBFXK7JHNMxwR5RMKvadLFzS+PB99zbg2aHjZRP mZZUIhYX6ZZDIr/Zkh1AWZsTy+9YZP/LSgyYzzcwm37q2fjOaJ92B1mjzVwLy5ookK43 iKsAc4ZEFvUMWA9h+L+MEbHqz+w46RQIK4mVWyax8mx8llHoBsMSf9qeLI7iU/Ao6fFY aYZ4/7HoECg0J7/REa6gR3PS68wQiYu29wjcF7kP3Yr2mBt3vhpZMYky58A0gxVvToCK AAJw== X-Gm-Message-State: AOAM530qoytWFH/jfbMGeNn77aqc2z4ZlCixp/qH1m5z70/+N9h42Dxs xUR2k5tW1CNQsPnZjiKpQGHlN2dEimUdtA== X-Google-Smtp-Source: ABdhPJwTAvUzhxBhQC3zRB56l9HLs+9kpT0bPM+ovhSEQIFsrMUauYso1DxAU2Oc3R2XJZWY9/XcQw== X-Received: by 2002:a02:b0d1:: with SMTP id w17mr13248052jah.46.1631601056227; Mon, 13 Sep 2021 23:30:56 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id b10sm6101328ils.13.2021.09.13.23.30.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Sep 2021 23:30:55 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 5/5] x86_64: Add evex optimized bcmp implementation in bcmp-evex.S Date: Tue, 14 Sep 2021 01:30:39 -0500 Message-Id: <20210914063039.1126196-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210914063039.1126196-1-goldstein.w.n@gmail.com> References: <20210913230506.546749-1-goldstein.w.n@gmail.com> <20210914063039.1126196-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This commit adds new optimized bcmp implementation for evex. The primary optimizations are 1) skipping the logic to find the difference of the first mismatched byte and 2) not updating src/dst addresses as the non-equals logic does not need to be reused by different areas. The entry alignment has been fixed at 64. In throughput sensitive functions which bcmp can potentially be frontend loop performance is important to opimized for. This is impossible/difficult to do/maintain with only 16 byte fixed alignment. test-memcmp, test-bcmp, and test-wmemcmp are all passing. --- sysdeps/x86_64/multiarch/bcmp-evex.S | 305 ++++++++++++++++++++- sysdeps/x86_64/multiarch/ifunc-bcmp.h | 3 +- sysdeps/x86_64/multiarch/ifunc-impl-list.c | 1 - 3 files changed, 302 insertions(+), 7 deletions(-) diff --git a/sysdeps/x86_64/multiarch/bcmp-evex.S b/sysdeps/x86_64/multiarch/bcmp-evex.S index ade52e8c68..1bfe824eb4 100644 --- a/sysdeps/x86_64/multiarch/bcmp-evex.S +++ b/sysdeps/x86_64/multiarch/bcmp-evex.S @@ -16,8 +16,305 @@ License along with the GNU C Library; if not, see . */ -#ifndef MEMCMP -# define MEMCMP __bcmp_evex -#endif +#if IS_IN (libc) + +/* bcmp is implemented as: + 1. Use ymm vector compares when possible. The only case where + vector compares is not possible for when size < VEC_SIZE + and loading from either s1 or s2 would cause a page cross. + 2. Use xmm vector compare when size >= 8 bytes. + 3. Optimistically compare up to first 4 * VEC_SIZE one at a + to check for early mismatches. Only do this if its guranteed the + work is not wasted. + 4. If size is 8 * VEC_SIZE or less, unroll the loop. + 5. Compare 4 * VEC_SIZE at a time with the aligned first memory + area. + 6. Use 2 vector compares when size is 2 * VEC_SIZE or less. + 7. Use 4 vector compares when size is 4 * VEC_SIZE or less. + 8. Use 8 vector compares when size is 8 * VEC_SIZE or less. */ + +# include + +# ifndef BCMP +# define BCMP __bcmp_evex +# endif + +# define VMOVU vmovdqu64 +# define VPCMP vpcmpub +# define VPTEST vptestmb + +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + +# define YMM0 ymm16 +# define YMM1 ymm17 +# define YMM2 ymm18 +# define YMM3 ymm19 +# define YMM4 ymm20 +# define YMM5 ymm21 +# define YMM6 ymm22 + + + .section .text.evex, "ax", @progbits +ENTRY_P2ALIGN (BCMP, 6) +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %edx, %edx +# endif + cmp $VEC_SIZE, %RDX_LP + jb L(less_vec) + + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ + VMOVU (%rsi), %YMM1 + /* Use compare not equals to directly check for mismatch. */ + VPCMP $4, (%rdi), %YMM1, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + cmpq $(VEC_SIZE * 2), %rdx + jbe L(last_1x_vec) + + /* Check second VEC no matter what. */ + VMOVU VEC_SIZE(%rsi), %YMM2 + VPCMP $4, VEC_SIZE(%rdi), %YMM2, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + /* Less than 4 * VEC. */ + cmpq $(VEC_SIZE * 4), %rdx + jbe L(last_2x_vec) + + /* Check third and fourth VEC no matter what. */ + VMOVU (VEC_SIZE * 2)(%rsi), %YMM3 + VPCMP $4, (VEC_SIZE * 2)(%rdi), %YMM3, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + VMOVU (VEC_SIZE * 3)(%rsi), %YMM4 + VPCMP $4, (VEC_SIZE * 3)(%rdi), %YMM4, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq0) + + /* Go to 4x VEC loop. */ + cmpq $(VEC_SIZE * 8), %rdx + ja L(more_8x_vec) + + /* Handle remainder of size = 4 * VEC + 1 to 8 * VEC without any + branches. */ + + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %YMM1 + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %YMM2 + addq %rdx, %rdi + + /* Wait to load from s1 until addressed adjust due to unlamination. + */ + + /* vpxor will be all 0s if s1 and s2 are equal. Otherwise it will + have some 1s. */ + vpxorq -(VEC_SIZE * 4)(%rdi), %YMM1, %YMM1 + vpxorq -(VEC_SIZE * 3)(%rdi), %YMM2, %YMM2 + + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %YMM3 + vpxorq -(VEC_SIZE * 2)(%rdi), %YMM3, %YMM3 + /* Or together YMM1, YMM2, and YMM3 into YMM3. */ + vpternlogd $0xfe, %YMM1, %YMM2, %YMM3 -#include "memcmp-evex-movbe.S" + VMOVU -(VEC_SIZE)(%rsi, %rdx), %YMM4 + /* Ternary logic to xor (VEC_SIZE * 3)(%rdi) with YMM4 while oring + with YMM3. Result is stored in YMM4. */ + vpternlogd $0xde, -(VEC_SIZE)(%rdi), %YMM3, %YMM4 + /* Compare YMM4 with 0. If any 1s s1 and s2 don't match. */ + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax +L(return_neq0): + ret + + /* Fits in padding needed to .p2align 5 L(less_vec). */ +L(last_1x_vec): + VMOVU -(VEC_SIZE * 1)(%rsi, %rdx), %YMM1 + VPCMP $4, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %k1 + kmovd %k1, %eax + ret + + /* NB: p2align 5 here will ensure the L(loop_4x_vec) is also 32 byte + aligned. */ + .p2align 5 +L(less_vec): + /* Check if one or less char. This is necessary for size = 0 but is + also faster for size = 1. */ + cmpl $1, %edx + jbe L(one_or_less) + + /* Check if loading one VEC from either s1 or s2 could cause a page + cross. This can have false positives but is by far the fastest + method. */ + movl %edi, %eax + orl %esi, %eax + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + jg L(page_cross_less_vec) + + /* No page cross possible. */ + VMOVU (%rsi), %YMM2 + VPCMP $4, (%rdi), %YMM2, %k1 + kmovd %k1, %eax + /* Result will be zero if s1 and s2 match. Otherwise first set bit + will be first mismatch. */ + bzhil %edx, %eax, %eax + ret + + /* Relatively cold but placing close to L(less_vec) for 2 byte jump + encoding. */ + .p2align 4 +L(one_or_less): + jb L(zero) + movzbl (%rsi), %ecx + movzbl (%rdi), %eax + subl %ecx, %eax + /* No ymm register was touched. */ + ret + /* Within the same 16 byte block is L(one_or_less). */ +L(zero): + xorl %eax, %eax + ret + + .p2align 4 +L(last_2x_vec): + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %YMM1 + vpxorq -(VEC_SIZE * 2)(%rdi, %rdx), %YMM1, %YMM1 + VMOVU -(VEC_SIZE * 1)(%rsi, %rdx), %YMM2 + vpternlogd $0xde, -(VEC_SIZE * 1)(%rdi, %rdx), %YMM1, %YMM2 + VPTEST %YMM2, %YMM2, %k1 + kmovd %k1, %eax + ret + + .p2align 4 +L(more_8x_vec): + /* Set end of s1 in rdx. */ + leaq -(VEC_SIZE * 4)(%rdi, %rdx), %rdx + /* rsi stores s2 - s1. This allows loop to only update one pointer. + */ + subq %rdi, %rsi + /* Align s1 pointer. */ + andq $-VEC_SIZE, %rdi + /* Adjust because first 4x vec where check already. */ + subq $-(VEC_SIZE * 4), %rdi + .p2align 4 +L(loop_4x_vec): + VMOVU (%rsi, %rdi), %YMM1 + vpxorq (%rdi), %YMM1, %YMM1 + + VMOVU VEC_SIZE(%rsi, %rdi), %YMM2 + vpxorq VEC_SIZE(%rdi), %YMM2, %YMM2 + + VMOVU (VEC_SIZE * 2)(%rsi, %rdi), %YMM3 + vpxorq (VEC_SIZE * 2)(%rdi), %YMM3, %YMM3 + vpternlogd $0xfe, %YMM1, %YMM2, %YMM3 + + VMOVU (VEC_SIZE * 3)(%rsi, %rdi), %YMM4 + vpternlogd $0xde, (VEC_SIZE * 3)(%rdi), %YMM3, %YMM4 + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax + testl %eax, %eax + jnz L(return_neq2) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rdx, %rdi + jb L(loop_4x_vec) + + subq %rdx, %rdi + VMOVU (VEC_SIZE * 3)(%rsi, %rdx), %YMM4 + vpxorq (VEC_SIZE * 3)(%rdx), %YMM4, %YMM4 + /* rdi has 4 * VEC_SIZE - remaining length. */ + cmpl $(VEC_SIZE * 3), %edi + jae L(8x_last_1x_vec) + /* Load regardless of branch. */ + VMOVU (VEC_SIZE * 2)(%rsi, %rdx), %YMM3 + /* Ternary logic to xor (VEC_SIZE * 2)(%rdx) with YMM3 while oring + with YMM4. Result is stored in YMM4. */ + vpternlogd $0xf6, (VEC_SIZE * 2)(%rdx), %YMM3, %YMM4 + cmpl $(VEC_SIZE * 2), %edi + jae L(8x_last_2x_vec) + + VMOVU VEC_SIZE(%rsi, %rdx), %YMM2 + vpxorq VEC_SIZE(%rdx), %YMM2, %YMM2 + + VMOVU (%rsi, %rdx), %YMM1 + vpxorq (%rdx), %YMM1, %YMM1 + + vpternlogd $0xfe, %YMM1, %YMM2, %YMM4 +L(8x_last_1x_vec): +L(8x_last_2x_vec): + VPTEST %YMM4, %YMM4, %k1 + kmovd %k1, %eax +L(return_neq2): + ret + + /* Relatively cold case as page cross are unexpected. */ + .p2align 4 +L(page_cross_less_vec): + cmpl $16, %edx + jae L(between_16_31) + cmpl $8, %edx + ja L(between_9_15) + cmpl $4, %edx + jb L(between_2_3) + /* From 4 to 8 bytes. No branch when size == 4. */ + movl (%rdi), %eax + movl (%rsi), %ecx + subl %ecx, %eax + movl -4(%rdi, %rdx), %ecx + movl -4(%rsi, %rdx), %esi + subl %esi, %ecx + orl %ecx, %eax + ret + + .p2align 4,, 8 +L(between_9_15): + /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe. + */ + vmovq (%rdi), %xmm1 + vmovq (%rsi), %xmm2 + vpcmpeqb %xmm1, %xmm2, %xmm3 + vmovq -8(%rdi, %rdx), %xmm1 + vmovq -8(%rsi, %rdx), %xmm2 + vpcmpeqb %xmm1, %xmm2, %xmm2 + vpand %xmm2, %xmm3, %xmm3 + vpmovmskb %xmm3, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_16_31): + /* From 16 to 31 bytes. No branch when size == 16. */ + + /* Safe to use xmm[0, 15] as no vzeroupper is needed so RTM safe. + */ + vmovdqu (%rsi), %xmm1 + vpcmpeqb (%rdi), %xmm1, %xmm1 + vmovdqu -16(%rsi, %rdx), %xmm2 + vpcmpeqb -16(%rdi, %rdx), %xmm2, %xmm2 + vpand %xmm1, %xmm2, %xmm2 + vpmovmskb %xmm2, %eax + subl $0xffff, %eax + /* No ymm register was touched. */ + ret + + .p2align 4,, 8 +L(between_2_3): + /* From 2 to 3 bytes. No branch when size == 2. */ + movzwl (%rdi), %eax + movzwl (%rsi), %ecx + subl %ecx, %eax + movzbl -1(%rdi, %rdx), %edi + movzbl -1(%rsi, %rdx), %esi + subl %edi, %esi + orl %esi, %eax + /* No ymm register was touched. */ + ret +END (BCMP) +#endif diff --git a/sysdeps/x86_64/multiarch/ifunc-bcmp.h b/sysdeps/x86_64/multiarch/ifunc-bcmp.h index f94516e5ee..51f251d0c9 100644 --- a/sysdeps/x86_64/multiarch/ifunc-bcmp.h +++ b/sysdeps/x86_64/multiarch/ifunc-bcmp.h @@ -35,8 +35,7 @@ IFUNC_SELECTOR (void) && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) { if (CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) - && CPU_FEATURE_USABLE_P (cpu_features, MOVBE)) + && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW)) return OPTIMIZE (evex); if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index cda0316928..abbb4e407f 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -52,7 +52,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, bcmp, (CPU_FEATURE_USABLE (AVX512VL) && CPU_FEATURE_USABLE (AVX512BW) - && CPU_FEATURE_USABLE (MOVBE) && CPU_FEATURE_USABLE (BMI2)), __bcmp_evex) IFUNC_IMPL_ADD (array, i, bcmp, CPU_FEATURE_USABLE (SSE4_1),