Message ID | 20210203053900.4125403-1-goldstein.w.n@gmail.com |
---|---|
State | Committed |
Headers |
Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 817A4398C806; Wed, 3 Feb 2021 05:39:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 817A4398C806 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1612330755; bh=vgHzeZifLTLI8PaH6NdCi8fDFONWFdXXaFqFvtahDiY=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=pwLIzd2M0OdcDlPw/QQWbugNMfiWqMmDtEaL4qZsaIqYvSsGzsy0SPHvVbekTZ1CM a+jErrumCWWyguSSw0NjSB0GUou+08yIaLkSI4xy9yKe8KPCDNuYjh4WOJzi1WstwL s//2jfx+6BBSX0kX3w+ENHWIvFGP+PzrdcyBvce4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1035.google.com (mail-pj1-x1035.google.com [IPv6:2607:f8b0:4864:20::1035]) by sourceware.org (Postfix) with ESMTPS id 5545A386F83F for <libc-alpha@sourceware.org>; Wed, 3 Feb 2021 05:39:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 5545A386F83F Received: by mail-pj1-x1035.google.com with SMTP id g15so3891617pjd.2 for <libc-alpha@sourceware.org>; Tue, 02 Feb 2021 21:39:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=vgHzeZifLTLI8PaH6NdCi8fDFONWFdXXaFqFvtahDiY=; b=kcGs0GPy2hu3Vbu+i+4Vzm8IZbMTVo3U3fwvEYeci83TWWUh7P28hri8cl0f/zJw7c gjerdKB5BuI+KSzD5h1xrimbBj9KpV7m0cP0bxqBAE6glP/KHfl15+k5We9EkRRtlsco lBaxgM8qUAvh29bfoH8/hlhlyRPV+mPcC57xTiBXg5lPuIqfJKPFpsYe3HECohV2SzIT uCqkid0Lz5iHYe6y6Q7DGnJEK0HwJrrEo9/7DGCLmCmJAte9/kF5y/Pj84yjduIwaZ6w w4aGGzGeMv44rSXUG5wnyDe9oL+Lp5EjUEv3pumrpCnvHb1LiuV2PEJZVIFRTTi7d+zV tC5Q== X-Gm-Message-State: AOAM533medgjDYNnBQlvgdq2PVTuOCU7AO7pE14jP1JPKxMgAR3MP064 M3otSyw5kb5Os8+d8Y87fNiQL/Juu5evzQ== X-Google-Smtp-Source: ABdhPJyr336SnAH6gPyBKKzhyyJczzk2s7VP4+os0ySgLmAEUOL1SCOxYLVdZs+WmtMv+1M1s2ZXqg== X-Received: by 2002:a17:90a:15cf:: with SMTP id w15mr1498585pjd.171.1612330751102; Tue, 02 Feb 2021 21:39:11 -0800 (PST) Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net. [73.241.149.213]) by smtp.googlemail.com with ESMTPSA id s9sm713378pfd.38.2021.02.02.21.39.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Feb 2021 21:39:10 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v4 1/2] x86: Refactor and improve performance of strchr-avx2.S Date: Wed, 3 Feb 2021 00:38:59 -0500 Message-Id: <20210203053900.4125403-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: "goldstein.w.n--- via Libc-alpha" <libc-alpha@sourceware.org> Reply-To: goldstein.w.n@gmail.com Cc: goldstein.w.n@gmail.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org> |
Series |
[v4,1/2] x86: Refactor and improve performance of strchr-avx2.S
|
|
Commit Message
develop--- via Libc-alpha
Feb. 3, 2021, 5:38 a.m. UTC
From: noah <goldstein.w.n@gmail.com> No bug. Just seemed the performance could be improved a bit. Observed and expected behavior are unchanged. Optimized body of main loop. Updated page cross logic and optimized accordingly. Made a few minor instruction selection modifications. No regressions in test suite. Both test-strchrnul and test-strchr passed. Signed-off-by: noah <goldstein.w.n@gmail.com> --- sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- sysdeps/x86_64/multiarch/strchr.c | 1 + 2 files changed, 118 insertions(+), 118 deletions(-)
Comments
On Tue, Feb 2, 2021 at 9:39 PM <goldstein.w.n@gmail.com> wrote: > > From: noah <goldstein.w.n@gmail.com> > > No bug. Just seemed the performance could be improved a bit. Observed > and expected behavior are unchanged. Optimized body of main > loop. Updated page cross logic and optimized accordingly. Made a few > minor instruction selection modifications. No regressions in test > suite. Both test-strchrnul and test-strchr passed. > > Signed-off-by: noah <goldstein.w.n@gmail.com> > --- > sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- > sysdeps/x86_64/multiarch/strchr.c | 1 + > 2 files changed, 118 insertions(+), 118 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > index d416558d04..8b9d78b55a 100644 > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > @@ -27,10 +27,12 @@ > # ifdef USE_AS_WCSCHR > # define VPBROADCAST vpbroadcastd > # define VPCMPEQ vpcmpeqd > +# define VPMINU vpminud > # define CHAR_REG esi > # else > # define VPBROADCAST vpbroadcastb > # define VPCMPEQ vpcmpeqb > +# define VPMINU vpminub > # define CHAR_REG sil > # endif > > @@ -39,20 +41,26 @@ > # endif > > # define VEC_SIZE 32 > +# define PAGE_SIZE 4096 > > .section .text.avx,"ax",@progbits > ENTRY (STRCHR) > movl %edi, %ecx > - /* Broadcast CHAR to YMM0. */ > +# ifndef USE_AS_STRCHRNUL > + xorl %edx, %edx > +# endif > + > + /* Broadcast CHAR to YMM0. */ > vmovd %esi, %xmm0 > vpxor %xmm9, %xmm9, %xmm9 > VPBROADCAST %xmm0, %ymm0 > - /* Check if we may cross page boundary with one vector load. */ > - andl $(2 * VEC_SIZE - 1), %ecx > - cmpl $VEC_SIZE, %ecx > - ja L(cros_page_boundary) > - > - /* Check the first VEC_SIZE bytes. Search for both CHAR and the > + > + /* Check if we cross page boundary with one vector load. */ > + andl $(PAGE_SIZE - 1), %ecx > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > + ja L(cross_page_boundary) > + > + /* Check the first VEC_SIZE bytes. Search for both CHAR and the > null byte. */ > vmovdqu (%rdi), %ymm8 > VPCMPEQ %ymm8, %ymm0, %ymm1 > @@ -60,50 +68,27 @@ ENTRY (STRCHR) > vpor %ymm1, %ymm2, %ymm1 > vpmovmskb %ymm1, %eax > testl %eax, %eax > - jnz L(first_vec_x0) > - > - /* Align data for aligned loads in the loop. */ > - addq $VEC_SIZE, %rdi > - andl $(VEC_SIZE - 1), %ecx > - andq $-VEC_SIZE, %rdi > - > - jmp L(more_4x_vec) > - > - .p2align 4 > -L(cros_page_boundary): > - andl $(VEC_SIZE - 1), %ecx > - andq $-VEC_SIZE, %rdi > - vmovdqu (%rdi), %ymm8 > - VPCMPEQ %ymm8, %ymm0, %ymm1 > - VPCMPEQ %ymm8, %ymm9, %ymm2 > - vpor %ymm1, %ymm2, %ymm1 > - vpmovmskb %ymm1, %eax > - /* Remove the leading bytes. */ > - sarl %cl, %eax > - testl %eax, %eax > - jz L(aligned_more) > - /* Found CHAR or the null byte. */ > + jz L(more_vecs) > tzcntl %eax, %eax > - addq %rcx, %rax > -# ifdef USE_AS_STRCHRNUL > + /* Found CHAR or the null byte. */ > addq %rdi, %rax > -# else > - xorl %edx, %edx > - leaq (%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > ret > > .p2align 4 > +L(more_vecs): > + /* Align data for aligned loads in the loop. */ > + andq $-VEC_SIZE, %rdi > L(aligned_more): > - addq $VEC_SIZE, %rdi > > -L(more_4x_vec): > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > - since data is only aligned to VEC_SIZE. */ > - vmovdqa (%rdi), %ymm8 > + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time > + since data is only aligned to VEC_SIZE. */ > + vmovdqa VEC_SIZE(%rdi), %ymm8 > + addq $VEC_SIZE, %rdi > VPCMPEQ %ymm8, %ymm0, %ymm1 > VPCMPEQ %ymm8, %ymm9, %ymm2 > vpor %ymm1, %ymm2, %ymm1 > @@ -125,7 +110,7 @@ L(more_4x_vec): > vpor %ymm1, %ymm2, %ymm1 > vpmovmskb %ymm1, %eax > testl %eax, %eax > - jnz L(first_vec_x2) > + jnz L(first_vec_x2) > > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > VPCMPEQ %ymm8, %ymm0, %ymm1 > @@ -133,122 +118,136 @@ L(more_4x_vec): > vpor %ymm1, %ymm2, %ymm1 > vpmovmskb %ymm1, %eax > testl %eax, %eax > - jnz L(first_vec_x3) > - > - addq $(VEC_SIZE * 4), %rdi > - > - /* Align data to 4 * VEC_SIZE. */ > - movq %rdi, %rcx > - andl $(4 * VEC_SIZE - 1), %ecx > - andq $-(4 * VEC_SIZE), %rdi > - > - .p2align 4 > -L(loop_4x_vec): > - /* Compare 4 * VEC at a time forward. */ > - vmovdqa (%rdi), %ymm5 > - vmovdqa VEC_SIZE(%rdi), %ymm6 > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > - > - VPCMPEQ %ymm5, %ymm0, %ymm1 > - VPCMPEQ %ymm6, %ymm0, %ymm2 > - VPCMPEQ %ymm7, %ymm0, %ymm3 > - VPCMPEQ %ymm8, %ymm0, %ymm4 > - > - VPCMPEQ %ymm5, %ymm9, %ymm5 > - VPCMPEQ %ymm6, %ymm9, %ymm6 > - VPCMPEQ %ymm7, %ymm9, %ymm7 > - VPCMPEQ %ymm8, %ymm9, %ymm8 > - > - vpor %ymm1, %ymm5, %ymm1 > - vpor %ymm2, %ymm6, %ymm2 > - vpor %ymm3, %ymm7, %ymm3 > - vpor %ymm4, %ymm8, %ymm4 > - > - vpor %ymm1, %ymm2, %ymm5 > - vpor %ymm3, %ymm4, %ymm6 > - > - vpor %ymm5, %ymm6, %ymm5 > - > - vpmovmskb %ymm5, %eax > - testl %eax, %eax > - jnz L(4x_vec_end) > - > - addq $(VEC_SIZE * 4), %rdi > + jz L(prep_loop_4x) > > - jmp L(loop_4x_vec) > + tzcntl %eax, %eax > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > + cmovne %rdx, %rax > +# endif > + VZEROUPPER > + ret > > .p2align 4 > L(first_vec_x0): > - /* Found CHAR or the null byte. */ > tzcntl %eax, %eax > -# ifdef USE_AS_STRCHRNUL > + /* Found CHAR or the null byte. */ > addq %rdi, %rax > -# else > - xorl %edx, %edx > - leaq (%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > ret > - > + > .p2align 4 > L(first_vec_x1): > tzcntl %eax, %eax > -# ifdef USE_AS_STRCHRNUL > - addq $VEC_SIZE, %rax > - addq %rdi, %rax > -# else > - xorl %edx, %edx > leaq VEC_SIZE(%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > - ret > - > + ret > + > .p2align 4 > L(first_vec_x2): > tzcntl %eax, %eax > -# ifdef USE_AS_STRCHRNUL > - addq $(VEC_SIZE * 2), %rax > - addq %rdi, %rax > -# else > - xorl %edx, %edx > + /* Found CHAR or the null byte. */ > leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > ret > + > +L(prep_loop_4x): > + /* Align data to 4 * VEC_SIZE. */ > + andq $-(VEC_SIZE * 4), %rdi > > .p2align 4 > -L(4x_vec_end): > +L(loop_4x_vec): > + /* Compare 4 * VEC at a time forward. */ > + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 > + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 > + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 > + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 > + > + /* Leaves only CHARS matching esi as 0. */ > + vpxor %ymm5, %ymm0, %ymm1 > + vpxor %ymm6, %ymm0, %ymm2 > + vpxor %ymm7, %ymm0, %ymm3 > + vpxor %ymm8, %ymm0, %ymm4 > + > + VPMINU %ymm1, %ymm5, %ymm1 > + VPMINU %ymm2, %ymm6, %ymm2 > + VPMINU %ymm3, %ymm7, %ymm3 > + VPMINU %ymm4, %ymm8, %ymm4 > + > + VPMINU %ymm1, %ymm2, %ymm5 > + VPMINU %ymm3, %ymm4, %ymm6 > + > + VPMINU %ymm5, %ymm6, %ymm5 > + > + VPCMPEQ %ymm5, %ymm9, %ymm5 > + vpmovmskb %ymm5, %eax > + > + addq $(VEC_SIZE * 4), %rdi > + testl %eax, %eax > + jz L(loop_4x_vec) > + > + VPCMPEQ %ymm1, %ymm9, %ymm1 > vpmovmskb %ymm1, %eax > testl %eax, %eax > jnz L(first_vec_x0) > + > + VPCMPEQ %ymm2, %ymm9, %ymm2 > vpmovmskb %ymm2, %eax > testl %eax, %eax > jnz L(first_vec_x1) > - vpmovmskb %ymm3, %eax > - testl %eax, %eax > - jnz L(first_vec_x2) > + > + VPCMPEQ %ymm3, %ymm9, %ymm3 > + VPCMPEQ %ymm4, %ymm9, %ymm4 > + vpmovmskb %ymm3, %ecx > vpmovmskb %ymm4, %eax > + salq $32, %rax > + orq %rcx, %rax > + tzcntq %rax, %rax > + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > + cmovne %rdx, %rax > +# endif > + VZEROUPPER > + ret > + > + /* Cold case for crossing page with first load. */ > + .p2align 4 > +L(cross_page_boundary): > + andq $-VEC_SIZE, %rdi > + andl $(VEC_SIZE - 1), %ecx > + > + vmovdqa (%rdi), %ymm8 > + VPCMPEQ %ymm8, %ymm0, %ymm1 > + VPCMPEQ %ymm8, %ymm9, %ymm2 > + vpor %ymm1, %ymm2, %ymm1 > + vpmovmskb %ymm1, %eax > + /* Remove the leading bits. */ > + sarxl %ecx, %eax, %eax > testl %eax, %eax > -L(first_vec_x3): > + jz L(aligned_more) > tzcntl %eax, %eax > -# ifdef USE_AS_STRCHRNUL > - addq $(VEC_SIZE * 3), %rax > + addq %rcx, %rdi > addq %rdi, %rax > -# else > - xorl %edx, %edx > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > - cmp (%rax), %CHAR_REG > +# ifndef USE_AS_STRCHRNUL > + cmp (%rax), %CHAR_REG > cmovne %rdx, %rax > # endif > VZEROUPPER > ret > > END (STRCHR) > -#endif > +# endif > diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c > index 583a152794..4dfbe3b58b 100644 > --- a/sysdeps/x86_64/multiarch/strchr.c > +++ b/sysdeps/x86_64/multiarch/strchr.c > @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) > > if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) > && CPU_FEATURE_USABLE_P (cpu_features, AVX2) > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) > return OPTIMIZE (avx2); > > -- > 2.29.2 > LGTM. Thanks.
On Mon, Feb 8, 2021 at 6:08 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Tue, Feb 2, 2021 at 9:39 PM <goldstein.w.n@gmail.com> wrote: > > > > From: noah <goldstein.w.n@gmail.com> > > > > No bug. Just seemed the performance could be improved a bit. Observed > > and expected behavior are unchanged. Optimized body of main > > loop. Updated page cross logic and optimized accordingly. Made a few > > minor instruction selection modifications. No regressions in test > > suite. Both test-strchrnul and test-strchr passed. > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > --- > > sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- > > sysdeps/x86_64/multiarch/strchr.c | 1 + > > 2 files changed, 118 insertions(+), 118 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > > index d416558d04..8b9d78b55a 100644 > > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > > @@ -27,10 +27,12 @@ > > # ifdef USE_AS_WCSCHR > > # define VPBROADCAST vpbroadcastd > > # define VPCMPEQ vpcmpeqd > > +# define VPMINU vpminud > > # define CHAR_REG esi > > # else > > # define VPBROADCAST vpbroadcastb > > # define VPCMPEQ vpcmpeqb > > +# define VPMINU vpminub > > # define CHAR_REG sil > > # endif > > > > @@ -39,20 +41,26 @@ > > # endif > > > > # define VEC_SIZE 32 > > +# define PAGE_SIZE 4096 > > > > .section .text.avx,"ax",@progbits > > ENTRY (STRCHR) > > movl %edi, %ecx > > - /* Broadcast CHAR to YMM0. */ > > +# ifndef USE_AS_STRCHRNUL > > + xorl %edx, %edx > > +# endif > > + > > + /* Broadcast CHAR to YMM0. */ > > vmovd %esi, %xmm0 > > vpxor %xmm9, %xmm9, %xmm9 > > VPBROADCAST %xmm0, %ymm0 > > - /* Check if we may cross page boundary with one vector load. */ > > - andl $(2 * VEC_SIZE - 1), %ecx > > - cmpl $VEC_SIZE, %ecx > > - ja L(cros_page_boundary) > > - > > - /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > + > > + /* Check if we cross page boundary with one vector load. */ > > + andl $(PAGE_SIZE - 1), %ecx > > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > > + ja L(cross_page_boundary) > > + > > + /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > null byte. */ > > vmovdqu (%rdi), %ymm8 > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > @@ -60,50 +68,27 @@ ENTRY (STRCHR) > > vpor %ymm1, %ymm2, %ymm1 > > vpmovmskb %ymm1, %eax > > testl %eax, %eax > > - jnz L(first_vec_x0) > > - > > - /* Align data for aligned loads in the loop. */ > > - addq $VEC_SIZE, %rdi > > - andl $(VEC_SIZE - 1), %ecx > > - andq $-VEC_SIZE, %rdi > > - > > - jmp L(more_4x_vec) > > - > > - .p2align 4 > > -L(cros_page_boundary): > > - andl $(VEC_SIZE - 1), %ecx > > - andq $-VEC_SIZE, %rdi > > - vmovdqu (%rdi), %ymm8 > > - VPCMPEQ %ymm8, %ymm0, %ymm1 > > - VPCMPEQ %ymm8, %ymm9, %ymm2 > > - vpor %ymm1, %ymm2, %ymm1 > > - vpmovmskb %ymm1, %eax > > - /* Remove the leading bytes. */ > > - sarl %cl, %eax > > - testl %eax, %eax > > - jz L(aligned_more) > > - /* Found CHAR or the null byte. */ > > + jz L(more_vecs) > > tzcntl %eax, %eax > > - addq %rcx, %rax > > -# ifdef USE_AS_STRCHRNUL > > + /* Found CHAR or the null byte. */ > > addq %rdi, %rax > > -# else > > - xorl %edx, %edx > > - leaq (%rdi, %rax), %rax > > - cmp (%rax), %CHAR_REG > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > cmovne %rdx, %rax > > # endif > > VZEROUPPER > > ret > > > > .p2align 4 > > +L(more_vecs): > > + /* Align data for aligned loads in the loop. */ > > + andq $-VEC_SIZE, %rdi > > L(aligned_more): > > - addq $VEC_SIZE, %rdi > > > > -L(more_4x_vec): > > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > - since data is only aligned to VEC_SIZE. */ > > - vmovdqa (%rdi), %ymm8 > > + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time > > + since data is only aligned to VEC_SIZE. */ > > + vmovdqa VEC_SIZE(%rdi), %ymm8 > > + addq $VEC_SIZE, %rdi > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > VPCMPEQ %ymm8, %ymm9, %ymm2 > > vpor %ymm1, %ymm2, %ymm1 > > @@ -125,7 +110,7 @@ L(more_4x_vec): > > vpor %ymm1, %ymm2, %ymm1 > > vpmovmskb %ymm1, %eax > > testl %eax, %eax > > - jnz L(first_vec_x2) > > + jnz L(first_vec_x2) > > > > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > @@ -133,122 +118,136 @@ L(more_4x_vec): > > vpor %ymm1, %ymm2, %ymm1 > > vpmovmskb %ymm1, %eax > > testl %eax, %eax > > - jnz L(first_vec_x3) > > - > > - addq $(VEC_SIZE * 4), %rdi > > - > > - /* Align data to 4 * VEC_SIZE. */ > > - movq %rdi, %rcx > > - andl $(4 * VEC_SIZE - 1), %ecx > > - andq $-(4 * VEC_SIZE), %rdi > > - > > - .p2align 4 > > -L(loop_4x_vec): > > - /* Compare 4 * VEC at a time forward. */ > > - vmovdqa (%rdi), %ymm5 > > - vmovdqa VEC_SIZE(%rdi), %ymm6 > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > - > > - VPCMPEQ %ymm5, %ymm0, %ymm1 > > - VPCMPEQ %ymm6, %ymm0, %ymm2 > > - VPCMPEQ %ymm7, %ymm0, %ymm3 > > - VPCMPEQ %ymm8, %ymm0, %ymm4 > > - > > - VPCMPEQ %ymm5, %ymm9, %ymm5 > > - VPCMPEQ %ymm6, %ymm9, %ymm6 > > - VPCMPEQ %ymm7, %ymm9, %ymm7 > > - VPCMPEQ %ymm8, %ymm9, %ymm8 > > - > > - vpor %ymm1, %ymm5, %ymm1 > > - vpor %ymm2, %ymm6, %ymm2 > > - vpor %ymm3, %ymm7, %ymm3 > > - vpor %ymm4, %ymm8, %ymm4 > > - > > - vpor %ymm1, %ymm2, %ymm5 > > - vpor %ymm3, %ymm4, %ymm6 > > - > > - vpor %ymm5, %ymm6, %ymm5 > > - > > - vpmovmskb %ymm5, %eax > > - testl %eax, %eax > > - jnz L(4x_vec_end) > > - > > - addq $(VEC_SIZE * 4), %rdi > > + jz L(prep_loop_4x) > > > > - jmp L(loop_4x_vec) > > + tzcntl %eax, %eax > > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > + cmovne %rdx, %rax > > +# endif > > + VZEROUPPER > > + ret > > > > .p2align 4 > > L(first_vec_x0): > > - /* Found CHAR or the null byte. */ > > tzcntl %eax, %eax > > -# ifdef USE_AS_STRCHRNUL > > + /* Found CHAR or the null byte. */ > > addq %rdi, %rax > > -# else > > - xorl %edx, %edx > > - leaq (%rdi, %rax), %rax > > - cmp (%rax), %CHAR_REG > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > cmovne %rdx, %rax > > # endif > > VZEROUPPER > > ret > > - > > + > > .p2align 4 > > L(first_vec_x1): > > tzcntl %eax, %eax > > -# ifdef USE_AS_STRCHRNUL > > - addq $VEC_SIZE, %rax > > - addq %rdi, %rax > > -# else > > - xorl %edx, %edx > > leaq VEC_SIZE(%rdi, %rax), %rax > > - cmp (%rax), %CHAR_REG > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > cmovne %rdx, %rax > > # endif > > VZEROUPPER > > - ret > > - > > + ret > > + > > .p2align 4 > > L(first_vec_x2): > > tzcntl %eax, %eax > > -# ifdef USE_AS_STRCHRNUL > > - addq $(VEC_SIZE * 2), %rax > > - addq %rdi, %rax > > -# else > > - xorl %edx, %edx > > + /* Found CHAR or the null byte. */ > > leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > - cmp (%rax), %CHAR_REG > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > cmovne %rdx, %rax > > # endif > > VZEROUPPER > > ret > > + > > +L(prep_loop_4x): > > + /* Align data to 4 * VEC_SIZE. */ > > + andq $-(VEC_SIZE * 4), %rdi > > > > .p2align 4 > > -L(4x_vec_end): > > +L(loop_4x_vec): > > + /* Compare 4 * VEC at a time forward. */ > > + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 > > + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 > > + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 > > + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 > > + > > + /* Leaves only CHARS matching esi as 0. */ > > + vpxor %ymm5, %ymm0, %ymm1 > > + vpxor %ymm6, %ymm0, %ymm2 > > + vpxor %ymm7, %ymm0, %ymm3 > > + vpxor %ymm8, %ymm0, %ymm4 > > + > > + VPMINU %ymm1, %ymm5, %ymm1 > > + VPMINU %ymm2, %ymm6, %ymm2 > > + VPMINU %ymm3, %ymm7, %ymm3 > > + VPMINU %ymm4, %ymm8, %ymm4 > > + > > + VPMINU %ymm1, %ymm2, %ymm5 > > + VPMINU %ymm3, %ymm4, %ymm6 > > + > > + VPMINU %ymm5, %ymm6, %ymm5 > > + > > + VPCMPEQ %ymm5, %ymm9, %ymm5 > > + vpmovmskb %ymm5, %eax > > + > > + addq $(VEC_SIZE * 4), %rdi > > + testl %eax, %eax > > + jz L(loop_4x_vec) > > + > > + VPCMPEQ %ymm1, %ymm9, %ymm1 > > vpmovmskb %ymm1, %eax > > testl %eax, %eax > > jnz L(first_vec_x0) > > + > > + VPCMPEQ %ymm2, %ymm9, %ymm2 > > vpmovmskb %ymm2, %eax > > testl %eax, %eax > > jnz L(first_vec_x1) > > - vpmovmskb %ymm3, %eax > > - testl %eax, %eax > > - jnz L(first_vec_x2) > > + > > + VPCMPEQ %ymm3, %ymm9, %ymm3 > > + VPCMPEQ %ymm4, %ymm9, %ymm4 > > + vpmovmskb %ymm3, %ecx > > vpmovmskb %ymm4, %eax > > + salq $32, %rax > > + orq %rcx, %rax > > + tzcntq %rax, %rax > > + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > + cmovne %rdx, %rax > > +# endif > > + VZEROUPPER > > + ret > > + > > + /* Cold case for crossing page with first load. */ > > + .p2align 4 > > +L(cross_page_boundary): > > + andq $-VEC_SIZE, %rdi > > + andl $(VEC_SIZE - 1), %ecx > > + > > + vmovdqa (%rdi), %ymm8 > > + VPCMPEQ %ymm8, %ymm0, %ymm1 > > + VPCMPEQ %ymm8, %ymm9, %ymm2 > > + vpor %ymm1, %ymm2, %ymm1 > > + vpmovmskb %ymm1, %eax > > + /* Remove the leading bits. */ > > + sarxl %ecx, %eax, %eax > > testl %eax, %eax > > -L(first_vec_x3): > > + jz L(aligned_more) > > tzcntl %eax, %eax > > -# ifdef USE_AS_STRCHRNUL > > - addq $(VEC_SIZE * 3), %rax > > + addq %rcx, %rdi > > addq %rdi, %rax > > -# else > > - xorl %edx, %edx > > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > - cmp (%rax), %CHAR_REG > > +# ifndef USE_AS_STRCHRNUL > > + cmp (%rax), %CHAR_REG > > cmovne %rdx, %rax > > # endif > > VZEROUPPER > > ret > > > > END (STRCHR) > > -#endif > > +# endif > > diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c > > index 583a152794..4dfbe3b58b 100644 > > --- a/sysdeps/x86_64/multiarch/strchr.c > > +++ b/sysdeps/x86_64/multiarch/strchr.c > > @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) > > > > if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) > > && CPU_FEATURE_USABLE_P (cpu_features, AVX2) > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > > && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) > > return OPTIMIZE (avx2); > > > > -- > > 2.29.2 > > > > LGTM. > > Thanks. > This is the updated patch with extra white spaces fixed I am checking in.
On Mon, Feb 8, 2021 at 2:33 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Mon, Feb 8, 2021 at 6:08 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Tue, Feb 2, 2021 at 9:39 PM <goldstein.w.n@gmail.com> wrote: > > > > > > From: noah <goldstein.w.n@gmail.com> > > > > > > No bug. Just seemed the performance could be improved a bit. Observed > > > and expected behavior are unchanged. Optimized body of main > > > loop. Updated page cross logic and optimized accordingly. Made a few > > > minor instruction selection modifications. No regressions in test > > > suite. Both test-strchrnul and test-strchr passed. > > > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > > --- > > > sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- > > > sysdeps/x86_64/multiarch/strchr.c | 1 + > > > 2 files changed, 118 insertions(+), 118 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > index d416558d04..8b9d78b55a 100644 > > > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > > > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > @@ -27,10 +27,12 @@ > > > # ifdef USE_AS_WCSCHR > > > # define VPBROADCAST vpbroadcastd > > > # define VPCMPEQ vpcmpeqd > > > +# define VPMINU vpminud > > > # define CHAR_REG esi > > > # else > > > # define VPBROADCAST vpbroadcastb > > > # define VPCMPEQ vpcmpeqb > > > +# define VPMINU vpminub > > > # define CHAR_REG sil > > > # endif > > > > > > @@ -39,20 +41,26 @@ > > > # endif > > > > > > # define VEC_SIZE 32 > > > +# define PAGE_SIZE 4096 > > > > > > .section .text.avx,"ax",@progbits > > > ENTRY (STRCHR) > > > movl %edi, %ecx > > > - /* Broadcast CHAR to YMM0. */ > > > +# ifndef USE_AS_STRCHRNUL > > > + xorl %edx, %edx > > > +# endif > > > + > > > + /* Broadcast CHAR to YMM0. */ > > > vmovd %esi, %xmm0 > > > vpxor %xmm9, %xmm9, %xmm9 > > > VPBROADCAST %xmm0, %ymm0 > > > - /* Check if we may cross page boundary with one vector load. */ > > > - andl $(2 * VEC_SIZE - 1), %ecx > > > - cmpl $VEC_SIZE, %ecx > > > - ja L(cros_page_boundary) > > > - > > > - /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > + > > > + /* Check if we cross page boundary with one vector load. */ > > > + andl $(PAGE_SIZE - 1), %ecx > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > > > + ja L(cross_page_boundary) > > > + > > > + /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > null byte. */ > > > vmovdqu (%rdi), %ymm8 > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > @@ -60,50 +68,27 @@ ENTRY (STRCHR) > > > vpor %ymm1, %ymm2, %ymm1 > > > vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > - jnz L(first_vec_x0) > > > - > > > - /* Align data for aligned loads in the loop. */ > > > - addq $VEC_SIZE, %rdi > > > - andl $(VEC_SIZE - 1), %ecx > > > - andq $-VEC_SIZE, %rdi > > > - > > > - jmp L(more_4x_vec) > > > - > > > - .p2align 4 > > > -L(cros_page_boundary): > > > - andl $(VEC_SIZE - 1), %ecx > > > - andq $-VEC_SIZE, %rdi > > > - vmovdqu (%rdi), %ymm8 > > > - VPCMPEQ %ymm8, %ymm0, %ymm1 > > > - VPCMPEQ %ymm8, %ymm9, %ymm2 > > > - vpor %ymm1, %ymm2, %ymm1 > > > - vpmovmskb %ymm1, %eax > > > - /* Remove the leading bytes. */ > > > - sarl %cl, %eax > > > - testl %eax, %eax > > > - jz L(aligned_more) > > > - /* Found CHAR or the null byte. */ > > > + jz L(more_vecs) > > > tzcntl %eax, %eax > > > - addq %rcx, %rax > > > -# ifdef USE_AS_STRCHRNUL > > > + /* Found CHAR or the null byte. */ > > > addq %rdi, %rax > > > -# else > > > - xorl %edx, %edx > > > - leaq (%rdi, %rax), %rax > > > - cmp (%rax), %CHAR_REG > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > cmovne %rdx, %rax > > > # endif > > > VZEROUPPER > > > ret > > > > > > .p2align 4 > > > +L(more_vecs): > > > + /* Align data for aligned loads in the loop. */ > > > + andq $-VEC_SIZE, %rdi > > > L(aligned_more): > > > - addq $VEC_SIZE, %rdi > > > > > > -L(more_4x_vec): > > > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > - since data is only aligned to VEC_SIZE. */ > > > - vmovdqa (%rdi), %ymm8 > > > + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > + since data is only aligned to VEC_SIZE. */ > > > + vmovdqa VEC_SIZE(%rdi), %ymm8 > > > + addq $VEC_SIZE, %rdi > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > VPCMPEQ %ymm8, %ymm9, %ymm2 > > > vpor %ymm1, %ymm2, %ymm1 > > > @@ -125,7 +110,7 @@ L(more_4x_vec): > > > vpor %ymm1, %ymm2, %ymm1 > > > vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > - jnz L(first_vec_x2) > > > + jnz L(first_vec_x2) > > > > > > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > @@ -133,122 +118,136 @@ L(more_4x_vec): > > > vpor %ymm1, %ymm2, %ymm1 > > > vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > - jnz L(first_vec_x3) > > > - > > > - addq $(VEC_SIZE * 4), %rdi > > > - > > > - /* Align data to 4 * VEC_SIZE. */ > > > - movq %rdi, %rcx > > > - andl $(4 * VEC_SIZE - 1), %ecx > > > - andq $-(4 * VEC_SIZE), %rdi > > > - > > > - .p2align 4 > > > -L(loop_4x_vec): > > > - /* Compare 4 * VEC at a time forward. */ > > > - vmovdqa (%rdi), %ymm5 > > > - vmovdqa VEC_SIZE(%rdi), %ymm6 > > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > - > > > - VPCMPEQ %ymm5, %ymm0, %ymm1 > > > - VPCMPEQ %ymm6, %ymm0, %ymm2 > > > - VPCMPEQ %ymm7, %ymm0, %ymm3 > > > - VPCMPEQ %ymm8, %ymm0, %ymm4 > > > - > > > - VPCMPEQ %ymm5, %ymm9, %ymm5 > > > - VPCMPEQ %ymm6, %ymm9, %ymm6 > > > - VPCMPEQ %ymm7, %ymm9, %ymm7 > > > - VPCMPEQ %ymm8, %ymm9, %ymm8 > > > - > > > - vpor %ymm1, %ymm5, %ymm1 > > > - vpor %ymm2, %ymm6, %ymm2 > > > - vpor %ymm3, %ymm7, %ymm3 > > > - vpor %ymm4, %ymm8, %ymm4 > > > - > > > - vpor %ymm1, %ymm2, %ymm5 > > > - vpor %ymm3, %ymm4, %ymm6 > > > - > > > - vpor %ymm5, %ymm6, %ymm5 > > > - > > > - vpmovmskb %ymm5, %eax > > > - testl %eax, %eax > > > - jnz L(4x_vec_end) > > > - > > > - addq $(VEC_SIZE * 4), %rdi > > > + jz L(prep_loop_4x) > > > > > > - jmp L(loop_4x_vec) > > > + tzcntl %eax, %eax > > > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > + cmovne %rdx, %rax > > > +# endif > > > + VZEROUPPER > > > + ret > > > > > > .p2align 4 > > > L(first_vec_x0): > > > - /* Found CHAR or the null byte. */ > > > tzcntl %eax, %eax > > > -# ifdef USE_AS_STRCHRNUL > > > + /* Found CHAR or the null byte. */ > > > addq %rdi, %rax > > > -# else > > > - xorl %edx, %edx > > > - leaq (%rdi, %rax), %rax > > > - cmp (%rax), %CHAR_REG > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > cmovne %rdx, %rax > > > # endif > > > VZEROUPPER > > > ret > > > - > > > + > > > .p2align 4 > > > L(first_vec_x1): > > > tzcntl %eax, %eax > > > -# ifdef USE_AS_STRCHRNUL > > > - addq $VEC_SIZE, %rax > > > - addq %rdi, %rax > > > -# else > > > - xorl %edx, %edx > > > leaq VEC_SIZE(%rdi, %rax), %rax > > > - cmp (%rax), %CHAR_REG > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > cmovne %rdx, %rax > > > # endif > > > VZEROUPPER > > > - ret > > > - > > > + ret > > > + > > > .p2align 4 > > > L(first_vec_x2): > > > tzcntl %eax, %eax > > > -# ifdef USE_AS_STRCHRNUL > > > - addq $(VEC_SIZE * 2), %rax > > > - addq %rdi, %rax > > > -# else > > > - xorl %edx, %edx > > > + /* Found CHAR or the null byte. */ > > > leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > - cmp (%rax), %CHAR_REG > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > cmovne %rdx, %rax > > > # endif > > > VZEROUPPER > > > ret > > > + > > > +L(prep_loop_4x): > > > + /* Align data to 4 * VEC_SIZE. */ > > > + andq $-(VEC_SIZE * 4), %rdi > > > > > > .p2align 4 > > > -L(4x_vec_end): > > > +L(loop_4x_vec): > > > + /* Compare 4 * VEC at a time forward. */ > > > + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 > > > + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 > > > + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 > > > + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 > > > + > > > + /* Leaves only CHARS matching esi as 0. */ > > > + vpxor %ymm5, %ymm0, %ymm1 > > > + vpxor %ymm6, %ymm0, %ymm2 > > > + vpxor %ymm7, %ymm0, %ymm3 > > > + vpxor %ymm8, %ymm0, %ymm4 > > > + > > > + VPMINU %ymm1, %ymm5, %ymm1 > > > + VPMINU %ymm2, %ymm6, %ymm2 > > > + VPMINU %ymm3, %ymm7, %ymm3 > > > + VPMINU %ymm4, %ymm8, %ymm4 > > > + > > > + VPMINU %ymm1, %ymm2, %ymm5 > > > + VPMINU %ymm3, %ymm4, %ymm6 > > > + > > > + VPMINU %ymm5, %ymm6, %ymm5 > > > + > > > + VPCMPEQ %ymm5, %ymm9, %ymm5 > > > + vpmovmskb %ymm5, %eax > > > + > > > + addq $(VEC_SIZE * 4), %rdi > > > + testl %eax, %eax > > > + jz L(loop_4x_vec) > > > + > > > + VPCMPEQ %ymm1, %ymm9, %ymm1 > > > vpmovmskb %ymm1, %eax > > > testl %eax, %eax > > > jnz L(first_vec_x0) > > > + > > > + VPCMPEQ %ymm2, %ymm9, %ymm2 > > > vpmovmskb %ymm2, %eax > > > testl %eax, %eax > > > jnz L(first_vec_x1) > > > - vpmovmskb %ymm3, %eax > > > - testl %eax, %eax > > > - jnz L(first_vec_x2) > > > + > > > + VPCMPEQ %ymm3, %ymm9, %ymm3 > > > + VPCMPEQ %ymm4, %ymm9, %ymm4 > > > + vpmovmskb %ymm3, %ecx > > > vpmovmskb %ymm4, %eax > > > + salq $32, %rax > > > + orq %rcx, %rax > > > + tzcntq %rax, %rax > > > + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > + cmovne %rdx, %rax > > > +# endif > > > + VZEROUPPER > > > + ret > > > + > > > + /* Cold case for crossing page with first load. */ > > > + .p2align 4 > > > +L(cross_page_boundary): > > > + andq $-VEC_SIZE, %rdi > > > + andl $(VEC_SIZE - 1), %ecx > > > + > > > + vmovdqa (%rdi), %ymm8 > > > + VPCMPEQ %ymm8, %ymm0, %ymm1 > > > + VPCMPEQ %ymm8, %ymm9, %ymm2 > > > + vpor %ymm1, %ymm2, %ymm1 > > > + vpmovmskb %ymm1, %eax > > > + /* Remove the leading bits. */ > > > + sarxl %ecx, %eax, %eax > > > testl %eax, %eax > > > -L(first_vec_x3): > > > + jz L(aligned_more) > > > tzcntl %eax, %eax > > > -# ifdef USE_AS_STRCHRNUL > > > - addq $(VEC_SIZE * 3), %rax > > > + addq %rcx, %rdi > > > addq %rdi, %rax > > > -# else > > > - xorl %edx, %edx > > > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > - cmp (%rax), %CHAR_REG > > > +# ifndef USE_AS_STRCHRNUL > > > + cmp (%rax), %CHAR_REG > > > cmovne %rdx, %rax > > > # endif > > > VZEROUPPER > > > ret > > > > > > END (STRCHR) > > > -#endif > > > +# endif > > > diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c > > > index 583a152794..4dfbe3b58b 100644 > > > --- a/sysdeps/x86_64/multiarch/strchr.c > > > +++ b/sysdeps/x86_64/multiarch/strchr.c > > > @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) > > > > > > if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) > > > && CPU_FEATURE_USABLE_P (cpu_features, AVX2) > > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > > > && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) > > > return OPTIMIZE (avx2); > > > > > > -- > > > 2.29.2 > > > > > > > LGTM. > > > > Thanks. > > > > This is the updated patch with extra white spaces fixed I am checking in. > > -- > H.J. Awesome! Thanks! N.G.
On Mon, Feb 8, 2021 at 2:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Feb 8, 2021 at 2:33 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Mon, Feb 8, 2021 at 6:08 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Tue, Feb 2, 2021 at 9:39 PM <goldstein.w.n@gmail.com> wrote: > > > > > > > > From: noah <goldstein.w.n@gmail.com> > > > > > > > > No bug. Just seemed the performance could be improved a bit. Observed > > > > and expected behavior are unchanged. Optimized body of main > > > > loop. Updated page cross logic and optimized accordingly. Made a few > > > > minor instruction selection modifications. No regressions in test > > > > suite. Both test-strchrnul and test-strchr passed. > > > > > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > > > --- > > > > sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- > > > > sysdeps/x86_64/multiarch/strchr.c | 1 + > > > > 2 files changed, 118 insertions(+), 118 deletions(-) > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > index d416558d04..8b9d78b55a 100644 > > > > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > @@ -27,10 +27,12 @@ > > > > # ifdef USE_AS_WCSCHR > > > > # define VPBROADCAST vpbroadcastd > > > > # define VPCMPEQ vpcmpeqd > > > > +# define VPMINU vpminud > > > > # define CHAR_REG esi > > > > # else > > > > # define VPBROADCAST vpbroadcastb > > > > # define VPCMPEQ vpcmpeqb > > > > +# define VPMINU vpminub > > > > # define CHAR_REG sil > > > > # endif > > > > > > > > @@ -39,20 +41,26 @@ > > > > # endif > > > > > > > > # define VEC_SIZE 32 > > > > +# define PAGE_SIZE 4096 > > > > > > > > .section .text.avx,"ax",@progbits > > > > ENTRY (STRCHR) > > > > movl %edi, %ecx > > > > - /* Broadcast CHAR to YMM0. */ > > > > +# ifndef USE_AS_STRCHRNUL > > > > + xorl %edx, %edx > > > > +# endif > > > > + > > > > + /* Broadcast CHAR to YMM0. */ > > > > vmovd %esi, %xmm0 > > > > vpxor %xmm9, %xmm9, %xmm9 > > > > VPBROADCAST %xmm0, %ymm0 > > > > - /* Check if we may cross page boundary with one vector load. */ > > > > - andl $(2 * VEC_SIZE - 1), %ecx > > > > - cmpl $VEC_SIZE, %ecx > > > > - ja L(cros_page_boundary) > > > > - > > > > - /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > > + > > > > + /* Check if we cross page boundary with one vector load. */ > > > > + andl $(PAGE_SIZE - 1), %ecx > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > > > > + ja L(cross_page_boundary) > > > > + > > > > + /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > > null byte. */ > > > > vmovdqu (%rdi), %ymm8 > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > @@ -60,50 +68,27 @@ ENTRY (STRCHR) > > > > vpor %ymm1, %ymm2, %ymm1 > > > > vpmovmskb %ymm1, %eax > > > > testl %eax, %eax > > > > - jnz L(first_vec_x0) > > > > - > > > > - /* Align data for aligned loads in the loop. */ > > > > - addq $VEC_SIZE, %rdi > > > > - andl $(VEC_SIZE - 1), %ecx > > > > - andq $-VEC_SIZE, %rdi > > > > - > > > > - jmp L(more_4x_vec) > > > > - > > > > - .p2align 4 > > > > -L(cros_page_boundary): > > > > - andl $(VEC_SIZE - 1), %ecx > > > > - andq $-VEC_SIZE, %rdi > > > > - vmovdqu (%rdi), %ymm8 > > > > - VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > - VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > - vpor %ymm1, %ymm2, %ymm1 > > > > - vpmovmskb %ymm1, %eax > > > > - /* Remove the leading bytes. */ > > > > - sarl %cl, %eax > > > > - testl %eax, %eax > > > > - jz L(aligned_more) > > > > - /* Found CHAR or the null byte. */ > > > > + jz L(more_vecs) > > > > tzcntl %eax, %eax > > > > - addq %rcx, %rax > > > > -# ifdef USE_AS_STRCHRNUL > > > > + /* Found CHAR or the null byte. */ > > > > addq %rdi, %rax > > > > -# else > > > > - xorl %edx, %edx > > > > - leaq (%rdi, %rax), %rax > > > > - cmp (%rax), %CHAR_REG > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > cmovne %rdx, %rax > > > > # endif > > > > VZEROUPPER > > > > ret > > > > > > > > .p2align 4 > > > > +L(more_vecs): > > > > + /* Align data for aligned loads in the loop. */ > > > > + andq $-VEC_SIZE, %rdi > > > > L(aligned_more): > > > > - addq $VEC_SIZE, %rdi > > > > > > > > -L(more_4x_vec): > > > > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > > - since data is only aligned to VEC_SIZE. */ > > > > - vmovdqa (%rdi), %ymm8 > > > > + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > > + since data is only aligned to VEC_SIZE. */ > > > > + vmovdqa VEC_SIZE(%rdi), %ymm8 > > > > + addq $VEC_SIZE, %rdi > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > vpor %ymm1, %ymm2, %ymm1 > > > > @@ -125,7 +110,7 @@ L(more_4x_vec): > > > > vpor %ymm1, %ymm2, %ymm1 > > > > vpmovmskb %ymm1, %eax > > > > testl %eax, %eax > > > > - jnz L(first_vec_x2) > > > > + jnz L(first_vec_x2) > > > > > > > > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > @@ -133,122 +118,136 @@ L(more_4x_vec): > > > > vpor %ymm1, %ymm2, %ymm1 > > > > vpmovmskb %ymm1, %eax > > > > testl %eax, %eax > > > > - jnz L(first_vec_x3) > > > > - > > > > - addq $(VEC_SIZE * 4), %rdi > > > > - > > > > - /* Align data to 4 * VEC_SIZE. */ > > > > - movq %rdi, %rcx > > > > - andl $(4 * VEC_SIZE - 1), %ecx > > > > - andq $-(4 * VEC_SIZE), %rdi > > > > - > > > > - .p2align 4 > > > > -L(loop_4x_vec): > > > > - /* Compare 4 * VEC at a time forward. */ > > > > - vmovdqa (%rdi), %ymm5 > > > > - vmovdqa VEC_SIZE(%rdi), %ymm6 > > > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > > > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > > - > > > > - VPCMPEQ %ymm5, %ymm0, %ymm1 > > > > - VPCMPEQ %ymm6, %ymm0, %ymm2 > > > > - VPCMPEQ %ymm7, %ymm0, %ymm3 > > > > - VPCMPEQ %ymm8, %ymm0, %ymm4 > > > > - > > > > - VPCMPEQ %ymm5, %ymm9, %ymm5 > > > > - VPCMPEQ %ymm6, %ymm9, %ymm6 > > > > - VPCMPEQ %ymm7, %ymm9, %ymm7 > > > > - VPCMPEQ %ymm8, %ymm9, %ymm8 > > > > - > > > > - vpor %ymm1, %ymm5, %ymm1 > > > > - vpor %ymm2, %ymm6, %ymm2 > > > > - vpor %ymm3, %ymm7, %ymm3 > > > > - vpor %ymm4, %ymm8, %ymm4 > > > > - > > > > - vpor %ymm1, %ymm2, %ymm5 > > > > - vpor %ymm3, %ymm4, %ymm6 > > > > - > > > > - vpor %ymm5, %ymm6, %ymm5 > > > > - > > > > - vpmovmskb %ymm5, %eax > > > > - testl %eax, %eax > > > > - jnz L(4x_vec_end) > > > > - > > > > - addq $(VEC_SIZE * 4), %rdi > > > > + jz L(prep_loop_4x) > > > > > > > > - jmp L(loop_4x_vec) > > > > + tzcntl %eax, %eax > > > > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > + cmovne %rdx, %rax > > > > +# endif > > > > + VZEROUPPER > > > > + ret > > > > > > > > .p2align 4 > > > > L(first_vec_x0): > > > > - /* Found CHAR or the null byte. */ > > > > tzcntl %eax, %eax > > > > -# ifdef USE_AS_STRCHRNUL > > > > + /* Found CHAR or the null byte. */ > > > > addq %rdi, %rax > > > > -# else > > > > - xorl %edx, %edx > > > > - leaq (%rdi, %rax), %rax > > > > - cmp (%rax), %CHAR_REG > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > cmovne %rdx, %rax > > > > # endif > > > > VZEROUPPER > > > > ret > > > > - > > > > + > > > > .p2align 4 > > > > L(first_vec_x1): > > > > tzcntl %eax, %eax > > > > -# ifdef USE_AS_STRCHRNUL > > > > - addq $VEC_SIZE, %rax > > > > - addq %rdi, %rax > > > > -# else > > > > - xorl %edx, %edx > > > > leaq VEC_SIZE(%rdi, %rax), %rax > > > > - cmp (%rax), %CHAR_REG > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > cmovne %rdx, %rax > > > > # endif > > > > VZEROUPPER > > > > - ret > > > > - > > > > + ret > > > > + > > > > .p2align 4 > > > > L(first_vec_x2): > > > > tzcntl %eax, %eax > > > > -# ifdef USE_AS_STRCHRNUL > > > > - addq $(VEC_SIZE * 2), %rax > > > > - addq %rdi, %rax > > > > -# else > > > > - xorl %edx, %edx > > > > + /* Found CHAR or the null byte. */ > > > > leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > > - cmp (%rax), %CHAR_REG > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > cmovne %rdx, %rax > > > > # endif > > > > VZEROUPPER > > > > ret > > > > + > > > > +L(prep_loop_4x): > > > > + /* Align data to 4 * VEC_SIZE. */ > > > > + andq $-(VEC_SIZE * 4), %rdi > > > > > > > > .p2align 4 > > > > -L(4x_vec_end): > > > > +L(loop_4x_vec): > > > > + /* Compare 4 * VEC at a time forward. */ > > > > + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 > > > > + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 > > > > + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 > > > > + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 > > > > + > > > > + /* Leaves only CHARS matching esi as 0. */ > > > > + vpxor %ymm5, %ymm0, %ymm1 > > > > + vpxor %ymm6, %ymm0, %ymm2 > > > > + vpxor %ymm7, %ymm0, %ymm3 > > > > + vpxor %ymm8, %ymm0, %ymm4 > > > > + > > > > + VPMINU %ymm1, %ymm5, %ymm1 > > > > + VPMINU %ymm2, %ymm6, %ymm2 > > > > + VPMINU %ymm3, %ymm7, %ymm3 > > > > + VPMINU %ymm4, %ymm8, %ymm4 > > > > + > > > > + VPMINU %ymm1, %ymm2, %ymm5 > > > > + VPMINU %ymm3, %ymm4, %ymm6 > > > > + > > > > + VPMINU %ymm5, %ymm6, %ymm5 > > > > + > > > > + VPCMPEQ %ymm5, %ymm9, %ymm5 > > > > + vpmovmskb %ymm5, %eax > > > > + > > > > + addq $(VEC_SIZE * 4), %rdi > > > > + testl %eax, %eax > > > > + jz L(loop_4x_vec) > > > > + > > > > + VPCMPEQ %ymm1, %ymm9, %ymm1 > > > > vpmovmskb %ymm1, %eax > > > > testl %eax, %eax > > > > jnz L(first_vec_x0) > > > > + > > > > + VPCMPEQ %ymm2, %ymm9, %ymm2 > > > > vpmovmskb %ymm2, %eax > > > > testl %eax, %eax > > > > jnz L(first_vec_x1) > > > > - vpmovmskb %ymm3, %eax > > > > - testl %eax, %eax > > > > - jnz L(first_vec_x2) > > > > + > > > > + VPCMPEQ %ymm3, %ymm9, %ymm3 > > > > + VPCMPEQ %ymm4, %ymm9, %ymm4 > > > > + vpmovmskb %ymm3, %ecx > > > > vpmovmskb %ymm4, %eax > > > > + salq $32, %rax > > > > + orq %rcx, %rax > > > > + tzcntq %rax, %rax > > > > + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > + cmovne %rdx, %rax > > > > +# endif > > > > + VZEROUPPER > > > > + ret > > > > + > > > > + /* Cold case for crossing page with first load. */ > > > > + .p2align 4 > > > > +L(cross_page_boundary): > > > > + andq $-VEC_SIZE, %rdi > > > > + andl $(VEC_SIZE - 1), %ecx > > > > + > > > > + vmovdqa (%rdi), %ymm8 > > > > + VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > + VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > + vpor %ymm1, %ymm2, %ymm1 > > > > + vpmovmskb %ymm1, %eax > > > > + /* Remove the leading bits. */ > > > > + sarxl %ecx, %eax, %eax > > > > testl %eax, %eax > > > > -L(first_vec_x3): > > > > + jz L(aligned_more) > > > > tzcntl %eax, %eax > > > > -# ifdef USE_AS_STRCHRNUL > > > > - addq $(VEC_SIZE * 3), %rax > > > > + addq %rcx, %rdi > > > > addq %rdi, %rax > > > > -# else > > > > - xorl %edx, %edx > > > > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > > - cmp (%rax), %CHAR_REG > > > > +# ifndef USE_AS_STRCHRNUL > > > > + cmp (%rax), %CHAR_REG > > > > cmovne %rdx, %rax > > > > # endif > > > > VZEROUPPER > > > > ret > > > > > > > > END (STRCHR) > > > > -#endif > > > > +# endif > > > > diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c > > > > index 583a152794..4dfbe3b58b 100644 > > > > --- a/sysdeps/x86_64/multiarch/strchr.c > > > > +++ b/sysdeps/x86_64/multiarch/strchr.c > > > > @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) > > > > > > > > if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) > > > > && CPU_FEATURE_USABLE_P (cpu_features, AVX2) > > > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > > > > && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) > > > > return OPTIMIZE (avx2); > > > > > > > > -- > > > > 2.29.2 > > > > > > > > > > LGTM. > > > > > > Thanks. > > > > > > > This is the updated patch with extra white spaces fixed I am checking in. > > > > -- > > H.J. > > Awesome! Thanks! > > N.G. Shoot, just realized this one has the old commit message that only references test-strchr and test-strchrnul as passing (missing reference to test-wcschr and test-wcschrnul). Do you want me to send another patch with proper commit message or can you fix it on your end or do is not really matter? N.G.
On Mon, Feb 8, 2021 at 1:46 PM Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Feb 8, 2021 at 2:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Mon, Feb 8, 2021 at 2:33 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Mon, Feb 8, 2021 at 6:08 AM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > > > On Tue, Feb 2, 2021 at 9:39 PM <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > From: noah <goldstein.w.n@gmail.com> > > > > > > > > > > No bug. Just seemed the performance could be improved a bit. Observed > > > > > and expected behavior are unchanged. Optimized body of main > > > > > loop. Updated page cross logic and optimized accordingly. Made a few > > > > > minor instruction selection modifications. No regressions in test > > > > > suite. Both test-strchrnul and test-strchr passed. > > > > > > > > > > Signed-off-by: noah <goldstein.w.n@gmail.com> > > > > > --- > > > > > sysdeps/x86_64/multiarch/strchr-avx2.S | 235 ++++++++++++------------- > > > > > sysdeps/x86_64/multiarch/strchr.c | 1 + > > > > > 2 files changed, 118 insertions(+), 118 deletions(-) > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > > index d416558d04..8b9d78b55a 100644 > > > > > --- a/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > > +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S > > > > > @@ -27,10 +27,12 @@ > > > > > # ifdef USE_AS_WCSCHR > > > > > # define VPBROADCAST vpbroadcastd > > > > > # define VPCMPEQ vpcmpeqd > > > > > +# define VPMINU vpminud > > > > > # define CHAR_REG esi > > > > > # else > > > > > # define VPBROADCAST vpbroadcastb > > > > > # define VPCMPEQ vpcmpeqb > > > > > +# define VPMINU vpminub > > > > > # define CHAR_REG sil > > > > > # endif > > > > > > > > > > @@ -39,20 +41,26 @@ > > > > > # endif > > > > > > > > > > # define VEC_SIZE 32 > > > > > +# define PAGE_SIZE 4096 > > > > > > > > > > .section .text.avx,"ax",@progbits > > > > > ENTRY (STRCHR) > > > > > movl %edi, %ecx > > > > > - /* Broadcast CHAR to YMM0. */ > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + xorl %edx, %edx > > > > > +# endif > > > > > + > > > > > + /* Broadcast CHAR to YMM0. */ > > > > > vmovd %esi, %xmm0 > > > > > vpxor %xmm9, %xmm9, %xmm9 > > > > > VPBROADCAST %xmm0, %ymm0 > > > > > - /* Check if we may cross page boundary with one vector load. */ > > > > > - andl $(2 * VEC_SIZE - 1), %ecx > > > > > - cmpl $VEC_SIZE, %ecx > > > > > - ja L(cros_page_boundary) > > > > > - > > > > > - /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > > > + > > > > > + /* Check if we cross page boundary with one vector load. */ > > > > > + andl $(PAGE_SIZE - 1), %ecx > > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > > > > > + ja L(cross_page_boundary) > > > > > + > > > > > + /* Check the first VEC_SIZE bytes. Search for both CHAR and the > > > > > null byte. */ > > > > > vmovdqu (%rdi), %ymm8 > > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > > @@ -60,50 +68,27 @@ ENTRY (STRCHR) > > > > > vpor %ymm1, %ymm2, %ymm1 > > > > > vpmovmskb %ymm1, %eax > > > > > testl %eax, %eax > > > > > - jnz L(first_vec_x0) > > > > > - > > > > > - /* Align data for aligned loads in the loop. */ > > > > > - addq $VEC_SIZE, %rdi > > > > > - andl $(VEC_SIZE - 1), %ecx > > > > > - andq $-VEC_SIZE, %rdi > > > > > - > > > > > - jmp L(more_4x_vec) > > > > > - > > > > > - .p2align 4 > > > > > -L(cros_page_boundary): > > > > > - andl $(VEC_SIZE - 1), %ecx > > > > > - andq $-VEC_SIZE, %rdi > > > > > - vmovdqu (%rdi), %ymm8 > > > > > - VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > > - VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > > - vpor %ymm1, %ymm2, %ymm1 > > > > > - vpmovmskb %ymm1, %eax > > > > > - /* Remove the leading bytes. */ > > > > > - sarl %cl, %eax > > > > > - testl %eax, %eax > > > > > - jz L(aligned_more) > > > > > - /* Found CHAR or the null byte. */ > > > > > + jz L(more_vecs) > > > > > tzcntl %eax, %eax > > > > > - addq %rcx, %rax > > > > > -# ifdef USE_AS_STRCHRNUL > > > > > + /* Found CHAR or the null byte. */ > > > > > addq %rdi, %rax > > > > > -# else > > > > > - xorl %edx, %edx > > > > > - leaq (%rdi, %rax), %rax > > > > > - cmp (%rax), %CHAR_REG > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > cmovne %rdx, %rax > > > > > # endif > > > > > VZEROUPPER > > > > > ret > > > > > > > > > > .p2align 4 > > > > > +L(more_vecs): > > > > > + /* Align data for aligned loads in the loop. */ > > > > > + andq $-VEC_SIZE, %rdi > > > > > L(aligned_more): > > > > > - addq $VEC_SIZE, %rdi > > > > > > > > > > -L(more_4x_vec): > > > > > - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > > > - since data is only aligned to VEC_SIZE. */ > > > > > - vmovdqa (%rdi), %ymm8 > > > > > + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time > > > > > + since data is only aligned to VEC_SIZE. */ > > > > > + vmovdqa VEC_SIZE(%rdi), %ymm8 > > > > > + addq $VEC_SIZE, %rdi > > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > > VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > > vpor %ymm1, %ymm2, %ymm1 > > > > > @@ -125,7 +110,7 @@ L(more_4x_vec): > > > > > vpor %ymm1, %ymm2, %ymm1 > > > > > vpmovmskb %ymm1, %eax > > > > > testl %eax, %eax > > > > > - jnz L(first_vec_x2) > > > > > + jnz L(first_vec_x2) > > > > > > > > > > vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > > > VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > > @@ -133,122 +118,136 @@ L(more_4x_vec): > > > > > vpor %ymm1, %ymm2, %ymm1 > > > > > vpmovmskb %ymm1, %eax > > > > > testl %eax, %eax > > > > > - jnz L(first_vec_x3) > > > > > - > > > > > - addq $(VEC_SIZE * 4), %rdi > > > > > - > > > > > - /* Align data to 4 * VEC_SIZE. */ > > > > > - movq %rdi, %rcx > > > > > - andl $(4 * VEC_SIZE - 1), %ecx > > > > > - andq $-(4 * VEC_SIZE), %rdi > > > > > - > > > > > - .p2align 4 > > > > > -L(loop_4x_vec): > > > > > - /* Compare 4 * VEC at a time forward. */ > > > > > - vmovdqa (%rdi), %ymm5 > > > > > - vmovdqa VEC_SIZE(%rdi), %ymm6 > > > > > - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 > > > > > - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 > > > > > - > > > > > - VPCMPEQ %ymm5, %ymm0, %ymm1 > > > > > - VPCMPEQ %ymm6, %ymm0, %ymm2 > > > > > - VPCMPEQ %ymm7, %ymm0, %ymm3 > > > > > - VPCMPEQ %ymm8, %ymm0, %ymm4 > > > > > - > > > > > - VPCMPEQ %ymm5, %ymm9, %ymm5 > > > > > - VPCMPEQ %ymm6, %ymm9, %ymm6 > > > > > - VPCMPEQ %ymm7, %ymm9, %ymm7 > > > > > - VPCMPEQ %ymm8, %ymm9, %ymm8 > > > > > - > > > > > - vpor %ymm1, %ymm5, %ymm1 > > > > > - vpor %ymm2, %ymm6, %ymm2 > > > > > - vpor %ymm3, %ymm7, %ymm3 > > > > > - vpor %ymm4, %ymm8, %ymm4 > > > > > - > > > > > - vpor %ymm1, %ymm2, %ymm5 > > > > > - vpor %ymm3, %ymm4, %ymm6 > > > > > - > > > > > - vpor %ymm5, %ymm6, %ymm5 > > > > > - > > > > > - vpmovmskb %ymm5, %eax > > > > > - testl %eax, %eax > > > > > - jnz L(4x_vec_end) > > > > > - > > > > > - addq $(VEC_SIZE * 4), %rdi > > > > > + jz L(prep_loop_4x) > > > > > > > > > > - jmp L(loop_4x_vec) > > > > > + tzcntl %eax, %eax > > > > > + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > + cmovne %rdx, %rax > > > > > +# endif > > > > > + VZEROUPPER > > > > > + ret > > > > > > > > > > .p2align 4 > > > > > L(first_vec_x0): > > > > > - /* Found CHAR or the null byte. */ > > > > > tzcntl %eax, %eax > > > > > -# ifdef USE_AS_STRCHRNUL > > > > > + /* Found CHAR or the null byte. */ > > > > > addq %rdi, %rax > > > > > -# else > > > > > - xorl %edx, %edx > > > > > - leaq (%rdi, %rax), %rax > > > > > - cmp (%rax), %CHAR_REG > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > cmovne %rdx, %rax > > > > > # endif > > > > > VZEROUPPER > > > > > ret > > > > > - > > > > > + > > > > > .p2align 4 > > > > > L(first_vec_x1): > > > > > tzcntl %eax, %eax > > > > > -# ifdef USE_AS_STRCHRNUL > > > > > - addq $VEC_SIZE, %rax > > > > > - addq %rdi, %rax > > > > > -# else > > > > > - xorl %edx, %edx > > > > > leaq VEC_SIZE(%rdi, %rax), %rax > > > > > - cmp (%rax), %CHAR_REG > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > cmovne %rdx, %rax > > > > > # endif > > > > > VZEROUPPER > > > > > - ret > > > > > - > > > > > + ret > > > > > + > > > > > .p2align 4 > > > > > L(first_vec_x2): > > > > > tzcntl %eax, %eax > > > > > -# ifdef USE_AS_STRCHRNUL > > > > > - addq $(VEC_SIZE * 2), %rax > > > > > - addq %rdi, %rax > > > > > -# else > > > > > - xorl %edx, %edx > > > > > + /* Found CHAR or the null byte. */ > > > > > leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > > > - cmp (%rax), %CHAR_REG > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > cmovne %rdx, %rax > > > > > # endif > > > > > VZEROUPPER > > > > > ret > > > > > + > > > > > +L(prep_loop_4x): > > > > > + /* Align data to 4 * VEC_SIZE. */ > > > > > + andq $-(VEC_SIZE * 4), %rdi > > > > > > > > > > .p2align 4 > > > > > -L(4x_vec_end): > > > > > +L(loop_4x_vec): > > > > > + /* Compare 4 * VEC at a time forward. */ > > > > > + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 > > > > > + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 > > > > > + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 > > > > > + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 > > > > > + > > > > > + /* Leaves only CHARS matching esi as 0. */ > > > > > + vpxor %ymm5, %ymm0, %ymm1 > > > > > + vpxor %ymm6, %ymm0, %ymm2 > > > > > + vpxor %ymm7, %ymm0, %ymm3 > > > > > + vpxor %ymm8, %ymm0, %ymm4 > > > > > + > > > > > + VPMINU %ymm1, %ymm5, %ymm1 > > > > > + VPMINU %ymm2, %ymm6, %ymm2 > > > > > + VPMINU %ymm3, %ymm7, %ymm3 > > > > > + VPMINU %ymm4, %ymm8, %ymm4 > > > > > + > > > > > + VPMINU %ymm1, %ymm2, %ymm5 > > > > > + VPMINU %ymm3, %ymm4, %ymm6 > > > > > + > > > > > + VPMINU %ymm5, %ymm6, %ymm5 > > > > > + > > > > > + VPCMPEQ %ymm5, %ymm9, %ymm5 > > > > > + vpmovmskb %ymm5, %eax > > > > > + > > > > > + addq $(VEC_SIZE * 4), %rdi > > > > > + testl %eax, %eax > > > > > + jz L(loop_4x_vec) > > > > > + > > > > > + VPCMPEQ %ymm1, %ymm9, %ymm1 > > > > > vpmovmskb %ymm1, %eax > > > > > testl %eax, %eax > > > > > jnz L(first_vec_x0) > > > > > + > > > > > + VPCMPEQ %ymm2, %ymm9, %ymm2 > > > > > vpmovmskb %ymm2, %eax > > > > > testl %eax, %eax > > > > > jnz L(first_vec_x1) > > > > > - vpmovmskb %ymm3, %eax > > > > > - testl %eax, %eax > > > > > - jnz L(first_vec_x2) > > > > > + > > > > > + VPCMPEQ %ymm3, %ymm9, %ymm3 > > > > > + VPCMPEQ %ymm4, %ymm9, %ymm4 > > > > > + vpmovmskb %ymm3, %ecx > > > > > vpmovmskb %ymm4, %eax > > > > > + salq $32, %rax > > > > > + orq %rcx, %rax > > > > > + tzcntq %rax, %rax > > > > > + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > + cmovne %rdx, %rax > > > > > +# endif > > > > > + VZEROUPPER > > > > > + ret > > > > > + > > > > > + /* Cold case for crossing page with first load. */ > > > > > + .p2align 4 > > > > > +L(cross_page_boundary): > > > > > + andq $-VEC_SIZE, %rdi > > > > > + andl $(VEC_SIZE - 1), %ecx > > > > > + > > > > > + vmovdqa (%rdi), %ymm8 > > > > > + VPCMPEQ %ymm8, %ymm0, %ymm1 > > > > > + VPCMPEQ %ymm8, %ymm9, %ymm2 > > > > > + vpor %ymm1, %ymm2, %ymm1 > > > > > + vpmovmskb %ymm1, %eax > > > > > + /* Remove the leading bits. */ > > > > > + sarxl %ecx, %eax, %eax > > > > > testl %eax, %eax > > > > > -L(first_vec_x3): > > > > > + jz L(aligned_more) > > > > > tzcntl %eax, %eax > > > > > -# ifdef USE_AS_STRCHRNUL > > > > > - addq $(VEC_SIZE * 3), %rax > > > > > + addq %rcx, %rdi > > > > > addq %rdi, %rax > > > > > -# else > > > > > - xorl %edx, %edx > > > > > - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax > > > > > - cmp (%rax), %CHAR_REG > > > > > +# ifndef USE_AS_STRCHRNUL > > > > > + cmp (%rax), %CHAR_REG > > > > > cmovne %rdx, %rax > > > > > # endif > > > > > VZEROUPPER > > > > > ret > > > > > > > > > > END (STRCHR) > > > > > -#endif > > > > > +# endif > > > > > diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c > > > > > index 583a152794..4dfbe3b58b 100644 > > > > > --- a/sysdeps/x86_64/multiarch/strchr.c > > > > > +++ b/sysdeps/x86_64/multiarch/strchr.c > > > > > @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) > > > > > > > > > > if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) > > > > > && CPU_FEATURE_USABLE_P (cpu_features, AVX2) > > > > > + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) > > > > > && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) > > > > > return OPTIMIZE (avx2); > > > > > > > > > > -- > > > > > 2.29.2 > > > > > > > > > > > > > LGTM. > > > > > > > > Thanks. > > > > > > > > > > This is the updated patch with extra white spaces fixed I am checking in. > > > > > > -- > > > H.J. > > > > Awesome! Thanks! > > > > N.G. > > Shoot, just realized this one has the old commit message that only > references test-strchr and test-strchrnul as passing (missing > reference to test-wcschr and test-wcschrnul). > > Do you want me to send another patch with proper commit message or can > you fix it on your end or do is not really matter? > > N.G. I would like to backport this patch to release branches. Any comments or objections? --Sunil
diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S index d416558d04..8b9d78b55a 100644 --- a/sysdeps/x86_64/multiarch/strchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S @@ -27,10 +27,12 @@ # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd # define VPCMPEQ vpcmpeqd +# define VPMINU vpminud # define CHAR_REG esi # else # define VPBROADCAST vpbroadcastb # define VPCMPEQ vpcmpeqb +# define VPMINU vpminub # define CHAR_REG sil # endif @@ -39,20 +41,26 @@ # endif # define VEC_SIZE 32 +# define PAGE_SIZE 4096 .section .text.avx,"ax",@progbits ENTRY (STRCHR) movl %edi, %ecx - /* Broadcast CHAR to YMM0. */ +# ifndef USE_AS_STRCHRNUL + xorl %edx, %edx +# endif + + /* Broadcast CHAR to YMM0. */ vmovd %esi, %xmm0 vpxor %xmm9, %xmm9, %xmm9 VPBROADCAST %xmm0, %ymm0 - /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx - ja L(cros_page_boundary) - - /* Check the first VEC_SIZE bytes. Search for both CHAR and the + + /* Check if we cross page boundary with one vector load. */ + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx + ja L(cross_page_boundary) + + /* Check the first VEC_SIZE bytes. Search for both CHAR and the null byte. */ vmovdqu (%rdi), %ymm8 VPCMPEQ %ymm8, %ymm0, %ymm1 @@ -60,50 +68,27 @@ ENTRY (STRCHR) vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax testl %eax, %eax - jnz L(first_vec_x0) - - /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - - jmp L(more_4x_vec) - - .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - vmovdqu (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - /* Remove the leading bytes. */ - sarl %cl, %eax - testl %eax, %eax - jz L(aligned_more) - /* Found CHAR or the null byte. */ + jz L(more_vecs) tzcntl %eax, %eax - addq %rcx, %rax -# ifdef USE_AS_STRCHRNUL + /* Found CHAR or the null byte. */ addq %rdi, %rax -# else - xorl %edx, %edx - leaq (%rdi, %rax), %rax - cmp (%rax), %CHAR_REG +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret .p2align 4 +L(more_vecs): + /* Align data for aligned loads in the loop. */ + andq $-VEC_SIZE, %rdi L(aligned_more): - addq $VEC_SIZE, %rdi -L(more_4x_vec): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ - vmovdqa (%rdi), %ymm8 + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time + since data is only aligned to VEC_SIZE. */ + vmovdqa VEC_SIZE(%rdi), %ymm8 + addq $VEC_SIZE, %rdi VPCMPEQ %ymm8, %ymm0, %ymm1 VPCMPEQ %ymm8, %ymm9, %ymm2 vpor %ymm1, %ymm2, %ymm1 @@ -125,7 +110,7 @@ L(more_4x_vec): vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax testl %eax, %eax - jnz L(first_vec_x2) + jnz L(first_vec_x2) vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 VPCMPEQ %ymm8, %ymm0, %ymm1 @@ -133,122 +118,136 @@ L(more_4x_vec): vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax testl %eax, %eax - jnz L(first_vec_x3) - - addq $(VEC_SIZE * 4), %rdi - - /* Align data to 4 * VEC_SIZE. */ - movq %rdi, %rcx - andl $(4 * VEC_SIZE - 1), %ecx - andq $-(4 * VEC_SIZE), %rdi - - .p2align 4 -L(loop_4x_vec): - /* Compare 4 * VEC at a time forward. */ - vmovdqa (%rdi), %ymm5 - vmovdqa VEC_SIZE(%rdi), %ymm6 - vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - - VPCMPEQ %ymm5, %ymm0, %ymm1 - VPCMPEQ %ymm6, %ymm0, %ymm2 - VPCMPEQ %ymm7, %ymm0, %ymm3 - VPCMPEQ %ymm8, %ymm0, %ymm4 - - VPCMPEQ %ymm5, %ymm9, %ymm5 - VPCMPEQ %ymm6, %ymm9, %ymm6 - VPCMPEQ %ymm7, %ymm9, %ymm7 - VPCMPEQ %ymm8, %ymm9, %ymm8 - - vpor %ymm1, %ymm5, %ymm1 - vpor %ymm2, %ymm6, %ymm2 - vpor %ymm3, %ymm7, %ymm3 - vpor %ymm4, %ymm8, %ymm4 - - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 - - vpor %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - testl %eax, %eax - jnz L(4x_vec_end) - - addq $(VEC_SIZE * 4), %rdi + jz L(prep_loop_4x) - jmp L(loop_4x_vec) + tzcntl %eax, %eax + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG + cmovne %rdx, %rax +# endif + VZEROUPPER + ret .p2align 4 L(first_vec_x0): - /* Found CHAR or the null byte. */ tzcntl %eax, %eax -# ifdef USE_AS_STRCHRNUL + /* Found CHAR or the null byte. */ addq %rdi, %rax -# else - xorl %edx, %edx - leaq (%rdi, %rax), %rax - cmp (%rax), %CHAR_REG +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret - + .p2align 4 L(first_vec_x1): tzcntl %eax, %eax -# ifdef USE_AS_STRCHRNUL - addq $VEC_SIZE, %rax - addq %rdi, %rax -# else - xorl %edx, %edx leaq VEC_SIZE(%rdi, %rax), %rax - cmp (%rax), %CHAR_REG +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER - ret - + ret + .p2align 4 L(first_vec_x2): tzcntl %eax, %eax -# ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 2), %rax - addq %rdi, %rax -# else - xorl %edx, %edx + /* Found CHAR or the null byte. */ leaq (VEC_SIZE * 2)(%rdi, %rax), %rax - cmp (%rax), %CHAR_REG +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret + +L(prep_loop_4x): + /* Align data to 4 * VEC_SIZE. */ + andq $-(VEC_SIZE * 4), %rdi .p2align 4 -L(4x_vec_end): +L(loop_4x_vec): + /* Compare 4 * VEC at a time forward. */ + vmovdqa (VEC_SIZE * 4)(%rdi), %ymm5 + vmovdqa (VEC_SIZE * 5)(%rdi), %ymm6 + vmovdqa (VEC_SIZE * 6)(%rdi), %ymm7 + vmovdqa (VEC_SIZE * 7)(%rdi), %ymm8 + + /* Leaves only CHARS matching esi as 0. */ + vpxor %ymm5, %ymm0, %ymm1 + vpxor %ymm6, %ymm0, %ymm2 + vpxor %ymm7, %ymm0, %ymm3 + vpxor %ymm8, %ymm0, %ymm4 + + VPMINU %ymm1, %ymm5, %ymm1 + VPMINU %ymm2, %ymm6, %ymm2 + VPMINU %ymm3, %ymm7, %ymm3 + VPMINU %ymm4, %ymm8, %ymm4 + + VPMINU %ymm1, %ymm2, %ymm5 + VPMINU %ymm3, %ymm4, %ymm6 + + VPMINU %ymm5, %ymm6, %ymm5 + + VPCMPEQ %ymm5, %ymm9, %ymm5 + vpmovmskb %ymm5, %eax + + addq $(VEC_SIZE * 4), %rdi + testl %eax, %eax + jz L(loop_4x_vec) + + VPCMPEQ %ymm1, %ymm9, %ymm1 vpmovmskb %ymm1, %eax testl %eax, %eax jnz L(first_vec_x0) + + VPCMPEQ %ymm2, %ymm9, %ymm2 vpmovmskb %ymm2, %eax testl %eax, %eax jnz L(first_vec_x1) - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(first_vec_x2) + + VPCMPEQ %ymm3, %ymm9, %ymm3 + VPCMPEQ %ymm4, %ymm9, %ymm4 + vpmovmskb %ymm3, %ecx vpmovmskb %ymm4, %eax + salq $32, %rax + orq %rcx, %rax + tzcntq %rax, %rax + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG + cmovne %rdx, %rax +# endif + VZEROUPPER + ret + + /* Cold case for crossing page with first load. */ + .p2align 4 +L(cross_page_boundary): + andq $-VEC_SIZE, %rdi + andl $(VEC_SIZE - 1), %ecx + + vmovdqa (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax + /* Remove the leading bits. */ + sarxl %ecx, %eax, %eax testl %eax, %eax -L(first_vec_x3): + jz L(aligned_more) tzcntl %eax, %eax -# ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 3), %rax + addq %rcx, %rdi addq %rdi, %rax -# else - xorl %edx, %edx - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax - cmp (%rax), %CHAR_REG +# ifndef USE_AS_STRCHRNUL + cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret END (STRCHR) -#endif +# endif diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c index 583a152794..4dfbe3b58b 100644 --- a/sysdeps/x86_64/multiarch/strchr.c +++ b/sysdeps/x86_64/multiarch/strchr.c @@ -37,6 +37,7 @@ IFUNC_SELECTOR (void) if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) && CPU_FEATURE_USABLE_P (cpu_features, AVX2) + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) return OPTIMIZE (avx2);