From patchwork Thu Jan 21 00:12:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 41781 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2780A3971839; Thu, 21 Jan 2021 00:28:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2780A3971839 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1611188935; bh=dzUW1YNZ3LToESMZi3aPKFXcawkWBUMR/EsEY91UrGw=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=VOGspp2O7ra9/t5j2l+deJhOIY3j2Irtb1CzIDIqzLVxh6ck3pL+LKMORWXlp73UG /fgu3bakto2zQErs5xT4yLnmK6z08eVAwyrd5eL972cNjMHISS0omH4jTIEabdtRX2 J6u9kFKQNoFsNnkZOhVDGma4XIrGnimg6uzCbsK4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by sourceware.org (Postfix) with ESMTPS id 421FF3857839 for ; Thu, 21 Jan 2021 00:28:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 421FF3857839 Received: by mail-pj1-x1036.google.com with SMTP id y12so300138pji.1 for ; Wed, 20 Jan 2021 16:28:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dzUW1YNZ3LToESMZi3aPKFXcawkWBUMR/EsEY91UrGw=; b=IxsR6kYjPf8TGXdS6qh6pz39Vv6U3g02wVcdIbF9jqoJJUXSKf1/8AbG7c054JmApE BHphBLX6Vjc7LVhYdfeNW5zlLVSAGzarDw1lEcEa8JxagtR5qq9FjlBdFPVVTHCSBd8D 9Ykah5JTibEO68/YBraOSFPvRHloxoWFnzVrWprFYd4tljVSSuklSoBEby5eMPCf5/i/ olgXqu9CAnkg6JEZWMBnkvZQHXxAUrt0e18wDPfmeok+mDa/upHN4soAGiNU898iV0yS njqLAmQ14tCoGkesz+UbtqJgcRqzru/bv7ZSIxrTouPGptGZDmhRxwS3FnyQ/BvHwEnl mCyA== X-Gm-Message-State: AOAM531yhnYgvdulPJY8v6YORNIQSOV2F0lhpUBkzFYw9S1sjJYuCoIC KSuXmdDLH5AF7IC2dccurZ0Wc5U4PaSChA== X-Google-Smtp-Source: ABdhPJyOt7hq8MtkqwYELYrUrF6VtZsGOejqvwDXRN5WilKFRp+ZofsMq/O6XMvWvdGt8bccxNbyIg== X-Received: by 2002:a17:902:a984:b029:df:c04a:21a6 with SMTP id bh4-20020a170902a984b02900dfc04a21a6mr4648192plb.33.1611188929908; Wed, 20 Jan 2021 16:28:49 -0800 (PST) Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net. [73.241.149.213]) by smtp.googlemail.com with ESMTPSA id bt8sm8406255pjb.0.2021.01.20.16.28.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jan 2021 16:28:49 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v2] x86: Refactor and improve performance of strchr-avx2.S Date: Wed, 20 Jan 2021 19:12:07 -0500 Message-Id: <20210121001206.471283-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: References: MIME-Version: 1.0 X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: noah via Libc-alpha From: Noah Goldstein Reply-To: noah Cc: goldstein.w.n@gmail.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. Just seemed the performance could be improved a bit. Observed and expected behavior are unchanged. Optimized body of main loop. Updated page cross logic and optimized accordingly. Made a few minor instruction selection modifications. No regressions in test suite. test-strchrnul, test-strchr, test-wcschrnul, and test-wcschr all passed. Author: noah --- >> >> No bug. Just seemed the performance could be improved a bit. Observed >> and expected behavior are unchanged. Optimized body of main >> loop. Updated page cross logic and optimized accordingly. Made a few >> minor instruction selection modifications. No regressions in test >> suite. Both test-strchrnul and test-strchr passed. > >Thank you very much. > :) this is my first patch (hopefully). Its exciting! >> Author: noah >> --- >> Possible improvements fall into 5 categories roughly ordered by >> importance: >> >> 1 - The main loop that handles 4 vectors as a time has 4 uops removed >> from it. This is at the cost of additional port pressure on ports >> 0/1 as vpminub and vpminud can only run on those ports whereas >> vpor can run on ports 0/1/5. But the 4 saved uops should more than >> make up for that. Analysis either by latency, throughput, or >> benchmarks shows its a performance improvement. >> >> 2 - As far as I can tell the cros_page_boundary logic was broken (or >> the jump label was especially confusing). The origional code >> tested for this by checking if the load would split cache lines >> not pages. I don't think there are any x86_64 architectures that > >The original code has > > /* Check if we may cross page boundary with one vector load. */ > andl $(2 * VEC_SIZE - 1), %ecx > cmpl $VEC_SIZE, %ecx > ja L(cros_page_boundary) > >It is just very conservative. It never fails to cross page boundary. > I see. Got it. >> support both AVX2 (Haskwell and newer) and don't have a minimum >> page size of 4kb. Given that the cost of splitting a cacheline >> appears to be low on CPUs that support AVX2 >> [https://www.agner.org/optimize/blog/read.php?i=415#423] and this >> is one-off I don't see it as being critical to avoid. If it >> critical to avoid a cacheline split load I think a branchless >> approach might be better. Thoughts? Note: Given that the check was >> changed to only be for page crosses, I think it is significantly >> less likely than before so I moved the branch target from such >> prime real estate. I am also unsure if there is a PAGE_SIZE define >> in glibc somewhere I can use instead of defining it here. > >Define PAGE_SIZE to 4096 is fine. > Got it. >> 3 - What was origionally the more_4x_vec label was removed and the >> code only does 3 vector blocks now. The reasoning is as follows; >> there are two entry points to that code section, from a page cross >> or fall through (no page cross). The fall through case is more >> important to optimize for assuming the point above. In this case >> the incoming alignments (i.e alignment of ptr in rdi) mod 128 can >> be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector >> blocks optimizes for the [96 - 127] so that when the main 4x loop >> is hit, a new 128 byte aligned segment can be started. Doing 3 >> vector blocks optimizes for the [0 - 32] case. I generally think >> the string is more likely to for aligned to cache line size (or L2 >> prefetcher cache line pairs) than at [96 - 127] bytes. An >> alternative would be to make that code do 5x vector blocks. This >> would mean that at most 2 vector blocks where wasted when >> realigning to 128 bytes (as opposed to 3x or 4x which can allow >> for 3 vector blocks to be wasted). Thoughts? > >I picked 4x because it was how I unrolled in other string functions. >3x is fine if it faster than 4x. > I see. What are your thoughts on doing 5x? With 5x the worst case is 2 vector blocks get wasted on reallignment though it will add overhead for A) getting the 4x loop and B) add overhead for [0 - 31], [32 - 63], and [96 - 127] initial alignment cases. >> 4 - Replace salq using the %cl partial register to sarx. This assumes >> BMI2 which is Haskwell and newer but AVX2 implies Hashwell and >> newer so I think it is safe. > >Need to add a BMI2 check in strchr.c: > >&& CPU_FEATURE_USABLE_P (cpu_features, BMI2) > Done. >> >> 5 - in the first_vec_xN return blocks change the addq from using rax >> as a destination to rdi as a destination. This just allows for 1 >> uop shorter latency. > >Sounds reasonable. > I forgot to add in the origional email. I also removed a few unnecissary instructions on ecx (it was being realigned with rdi but never used) >> Benchamrks: >> I can post my benchmarking code in the email thread if that is >> appreciated. I benchmarked a variety of cases with different >> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just >> give a simple number of x percentage improvement. They where also run >> on my personal computer (icelake-client). Of my 2732 test cases 1985 >> saw an improvement with these changes, 747 performed better with the >> origional code. I should note, however, that my test cases had a >> disproportionate number of cases with 4kb page crosses, which as >> discussed I moved to a colder path. > >Please submit a separate patch to add your workload to >benchtests/bench-strstr.c. > Got it. A few questions. Should the benchmark have a single return value to indicate pass/failure (for a performance regression) or is just outputing times the idea so it can be used as a tool for future patches? Second is the coding standard for benchmarks the same as production code or are rules more relaxed? Last should there be a seperate benchmark for strchr, strchrnul, wcschr, and wcschrnul or is having them all in one with #defines the best approach? >> In general the affects of this change are: >> >> large/medium sized strings (from any part of memory really) 10-30% >> performance boost. >> Small strings that are not page crosses by 10-20% performance boost. >> Small strings are cache line splits by 20-30% performance boost. >> Small strings that cross a page boundary (4kb page that is) see a 10% >> performance regression. >> >> No regressions in test suite. Both test-strchrnul and test-strchr >> passed. > >It is also used for wcschr and wcschrnul. Do they work correctly with >your change? > Yes. Updated the commit message to indicate those tests pass aswell. Updated my benchmark as well to test those for performance. >> I would love to here you feedback and all of these points (or any that >> I missed). >> >> FSF Documentation has been signed and returned (via pub key and email >> respectively) sysdeps/x86_64/multiarch/strchr-avx2.S | 177 +++++++++++++------------ sysdeps/x86_64/multiarch/strchr.c | 2 + 2 files changed, 91 insertions(+), 88 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S index d416558d04..628242c483 100644 --- a/sysdeps/x86_64/multiarch/strchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S @@ -27,10 +27,12 @@ # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd # define VPCMPEQ vpcmpeqd +# define VPMINU vpminud # define CHAR_REG esi # else # define VPBROADCAST vpbroadcastb # define VPCMPEQ vpcmpeqb +# define VPMINU vpminub # define CHAR_REG sil # endif @@ -39,7 +41,8 @@ # endif # define VEC_SIZE 32 - +# define PAGE_SIZE 4096 + .section .text.avx,"ax",@progbits ENTRY (STRCHR) movl %edi, %ecx @@ -47,10 +50,10 @@ ENTRY (STRCHR) vmovd %esi, %xmm0 vpxor %xmm9, %xmm9, %xmm9 VPBROADCAST %xmm0, %ymm0 - /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx - ja L(cros_page_boundary) + /* Check if we cross page boundary with one vector load. */ + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx + ja L(cross_page_boundary) /* Check the first VEC_SIZE bytes. Search for both CHAR and the null byte. */ @@ -63,45 +66,11 @@ ENTRY (STRCHR) jnz L(first_vec_x0) /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - - jmp L(more_4x_vec) - - .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - vmovdqu (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - /* Remove the leading bytes. */ - sarl %cl, %eax - testl %eax, %eax - jz L(aligned_more) - /* Found CHAR or the null byte. */ - tzcntl %eax, %eax - addq %rcx, %rax -# ifdef USE_AS_STRCHRNUL - addq %rdi, %rax -# else - xorl %edx, %edx - leaq (%rdi, %rax), %rax - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax -# endif - VZEROUPPER - ret - - .p2align 4 + andq $-VEC_SIZE, %rdi L(aligned_more): addq $VEC_SIZE, %rdi -L(more_4x_vec): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time + /* Check the next 3 * VEC_SIZE. Only one VEC_SIZE at a time since data is only aligned to VEC_SIZE. */ vmovdqa (%rdi), %ymm8 VPCMPEQ %ymm8, %ymm0, %ymm1 @@ -127,19 +96,9 @@ L(more_4x_vec): testl %eax, %eax jnz L(first_vec_x2) - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x3) - - addq $(VEC_SIZE * 4), %rdi - + /* Align data to 4 * VEC_SIZE. */ - movq %rdi, %rcx - andl $(4 * VEC_SIZE - 1), %ecx + addq $(VEC_SIZE * 3), %rdi andq $-(4 * VEC_SIZE), %rdi .p2align 4 @@ -150,34 +109,61 @@ L(loop_4x_vec): vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm5, %ymm0, %ymm1 - VPCMPEQ %ymm6, %ymm0, %ymm2 - VPCMPEQ %ymm7, %ymm0, %ymm3 - VPCMPEQ %ymm8, %ymm0, %ymm4 - - VPCMPEQ %ymm5, %ymm9, %ymm5 - VPCMPEQ %ymm6, %ymm9, %ymm6 - VPCMPEQ %ymm7, %ymm9, %ymm7 - VPCMPEQ %ymm8, %ymm9, %ymm8 + /* Leaves only CHARS matching esi as 0. */ + vpxor %ymm5, %ymm0, %ymm1 + vpxor %ymm6, %ymm0, %ymm2 + vpxor %ymm7, %ymm0, %ymm3 + vpxor %ymm8, %ymm0, %ymm4 - vpor %ymm1, %ymm5, %ymm1 - vpor %ymm2, %ymm6, %ymm2 - vpor %ymm3, %ymm7, %ymm3 - vpor %ymm4, %ymm8, %ymm4 + VPMINU %ymm1, %ymm5, %ymm1 + VPMINU %ymm2, %ymm6, %ymm2 + VPMINU %ymm3, %ymm7, %ymm3 + VPMINU %ymm4, %ymm8, %ymm4 - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 + VPMINU %ymm1, %ymm2, %ymm5 + VPMINU %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 + VPMINU %ymm5, %ymm6, %ymm5 + VPCMPEQ %ymm5, %ymm9, %ymm5 vpmovmskb %ymm5, %eax + addq $(VEC_SIZE * 4), %rdi + testl %eax, %eax - jnz L(4x_vec_end) + jz L(loop_4x_vec) - addq $(VEC_SIZE * 4), %rdi + subq $(VEC_SIZE * 4), %rdi + +L(4x_vec_end): + VPCMPEQ %ymm1, %ymm9, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x0) + VPCMPEQ %ymm2, %ymm9, %ymm2 + vpmovmskb %ymm2, %eax + testl %eax, %eax + jnz L(first_vec_x1) + VPCMPEQ %ymm3, %ymm9, %ymm3 + vpmovmskb %ymm3, %eax + testl %eax, %eax + jnz L(first_vec_x2) + VPCMPEQ %ymm4, %ymm9, %ymm4 + vpmovmskb %ymm4, %eax - jmp L(loop_4x_vec) + tzcntl %eax, %eax +# ifdef USE_AS_STRCHRNUL + addq $(VEC_SIZE * 3), %rdi + addq %rdi, %rax +# else + xorl %edx, %edx + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + cmp (%rax), %CHAR_REG + cmovne %rdx, %rax +# endif + VZEROUPPER + ret + .p2align 4 L(first_vec_x0): /* Found CHAR or the null byte. */ @@ -197,7 +183,7 @@ L(first_vec_x0): L(first_vec_x1): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $VEC_SIZE, %rax + addq $VEC_SIZE, %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -212,7 +198,7 @@ L(first_vec_x1): L(first_vec_x2): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 2), %rax + addq $(VEC_SIZE * 2), %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -223,32 +209,47 @@ L(first_vec_x2): VZEROUPPER ret + /* Cold case for crossing page with first load. */ .p2align 4 -L(4x_vec_end): +L(cross_page_boundary): + andl $(VEC_SIZE - 1), %ecx + andq $-VEC_SIZE, %rdi + vmovdqu (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax + /* Remove the leading bits. */ + sarxl %ecx, %eax, %eax testl %eax, %eax - jnz L(first_vec_x0) - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(first_vec_x1) - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(first_vec_x2) - vpmovmskb %ymm4, %eax + jnz L(cross_page_return) + + /* Second block so that the 3 other blocks from L(aligned_more) + will get to next 4 * VEC_SIZE alignment. */ + andq $-VEC_SIZE, %rdi + addq $VEC_SIZE, %rdi + xorl %ecx, %ecx + vmovdqa (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax testl %eax, %eax -L(first_vec_x3): + jz L(aligned_more) + +L(cross_page_return): tzcntl %eax, %eax + addq %rcx, %rax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 3), %rax addq %rdi, %rax # else xorl %edx, %edx - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + leaq (%rdi, %rax), %rax cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret - + END (STRCHR) -#endif +# endif diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c index 583a152794..77ff44b8f3 100644 --- a/sysdeps/x86_64/multiarch/strchr.c +++ b/sysdeps/x86_64/multiarch/strchr.c @@ -37,6 +37,8 @@ IFUNC_SELECTOR (void) if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER) && CPU_FEATURE_USABLE_P (cpu_features, AVX2) + && CPU_FEATURE_USABLE_P (cpu_features, BMI1) + && CPU_FEATURE_USABLE_P (cpu_features, BMI2) && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load)) return OPTIMIZE (avx2);