From patchwork Wed Jan 20 09:29:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 41764 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 080B33840C37; Wed, 20 Jan 2021 09:33:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 080B33840C37 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1611135213; bh=36L2SZ4dpL7Ui8+/nU7prvxRH8bkedi6yau0+VamckY=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=uM8z6hO78WFXTMr6bEmM337zG5fsvR9VfQaFW1RMQ4JU5+RygPbbbvhzuVflCbg4P FCE02VhHPaUXiKFLeIshFULsuJviGHkWR/JQCHOpGbIG9gmbHZM6O0np2odp2LH6C6 UaxHCvBqQ4gyIkRdlaFlgixrSZA/+uiaWaeJf5Io= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by sourceware.org (Postfix) with ESMTPS id 9CAE73858018 for ; Wed, 20 Jan 2021 09:33:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 9CAE73858018 Received: by mail-pf1-x42c.google.com with SMTP id 11so14132938pfu.4 for ; Wed, 20 Jan 2021 01:33:29 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=36L2SZ4dpL7Ui8+/nU7prvxRH8bkedi6yau0+VamckY=; b=N17uWd1VnRw8Oe0KWwYqBP1ENBdbqyWTgdL9D+OjeoPsXvzZK/UUMAjb14aRcU3to1 Nzpxkljd41EoHrHuk7KmePy5clQygFl96ieF8DJCrBTxp0jaaLJPWfs6vx+UU8b50L8X 16b/OQAztOZCmK9P51v8tl84OE2/+YQO12/7pJ/9q2l5ZTQWuyCZxPc+jyBB7IQ4iLZE t4hhYoD+2lsnUY5HVrcysOYIsA05MFDoZeGDdyyX8gWjkqJNPK+5JOJY4oXQYcz9+mJb Leh0hutarzgDsNYQd+YFa5VYacMmW7FzRetlvwTeS60Y6C6ikeUJBAqgRXH9aq8lntM2 yMBw== X-Gm-Message-State: AOAM532PjF3GcQsntaFXEs/b7zjPaOnvgFw4JfNAlNXiAA7SWr8GGYYS 3Fw97+gRTZUzwpIE/LBrNYFV3b5EbBU= X-Google-Smtp-Source: ABdhPJzJOeAjFPLQeEnM6UrQAu5uocjzzDKBGB9Uzw+KOqGIXrJWtVonPou+2UoA/Xhs74yLYFyn3g== X-Received: by 2002:aa7:8a99:0:b029:1a6:c8b8:1414 with SMTP id a25-20020aa78a990000b02901a6c8b81414mr8260133pfc.66.1611135208243; Wed, 20 Jan 2021 01:33:28 -0800 (PST) Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net. [73.241.149.213]) by smtp.googlemail.com with ESMTPSA id w192sm1675545pff.181.2021.01.20.01.33.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jan 2021 01:33:27 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH] x86: Refactor and improve performance of strchr-avx2.S Date: Wed, 20 Jan 2021 04:29:15 -0500 Message-Id: <20210120092914.256388-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 MIME-Version: 1.0 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: noah via Libc-alpha From: Noah Goldstein Reply-To: noah Cc: goldstein.w.n@gmail.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" No bug. Just seemed the performance could be improved a bit. Observed and expected behavior are unchanged. Optimized body of main loop. Updated page cross logic and optimized accordingly. Made a few minor instruction selection modifications. No regressions in test suite. Both test-strchrnul and test-strchr passed. Author: noah --- Possible improvements fall into 5 categories roughly ordered by importance: 1 - The main loop that handles 4 vectors as a time has 4 uops removed from it. This is at the cost of additional port pressure on ports 0/1 as vpminub and vpminud can only run on those ports whereas vpor can run on ports 0/1/5. But the 4 saved uops should more than make up for that. Analysis either by latency, throughput, or benchmarks shows its a performance improvement. 2 - As far as I can tell the cros_page_boundary logic was broken (or the jump label was especially confusing). The origional code tested for this by checking if the load would split cache lines not pages. I don't think there are any x86_64 architectures that support both AVX2 (Haskwell and newer) and don't have a minimum page size of 4kb. Given that the cost of splitting a cacheline appears to be low on CPUs that support AVX2 [https://www.agner.org/optimize/blog/read.php?i=415#423] and this is one-off I don't see it as being critical to avoid. If it critical to avoid a cacheline split load I think a branchless approach might be better. Thoughts? Note: Given that the check was changed to only be for page crosses, I think it is significantly less likely than before so I moved the branch target from such prime real estate. I am also unsure if there is a PAGE_SIZE define in glibc somewhere I can use instead of defining it here. 3 - What was origionally the more_4x_vec label was removed and the code only does 3 vector blocks now. The reasoning is as follows; there are two entry points to that code section, from a page cross or fall through (no page cross). The fall through case is more important to optimize for assuming the point above. In this case the incoming alignments (i.e alignment of ptr in rdi) mod 128 can be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector blocks optimizes for the [96 - 127] so that when the main 4x loop is hit, a new 128 byte aligned segment can be started. Doing 3 vector blocks optimizes for the [0 - 32] case. I generally think the string is more likely to for aligned to cache line size (or L2 prefetcher cache line pairs) than at [96 - 127] bytes. An alternative would be to make that code do 5x vector blocks. This would mean that at most 2 vector blocks where wasted when realigning to 128 bytes (as opposed to 3x or 4x which can allow for 3 vector blocks to be wasted). Thoughts? 4 - Replace salq using the %cl partial register to sarx. This assumes BMI2 which is Haskwell and newer but AVX2 implies Hashwell and newer so I think it is safe. 5 - in the first_vec_xN return blocks change the addq from using rax as a destination to rdi as a destination. This just allows for 1 uop shorter latency. Benchamrks: I can post my benchmarking code in the email thread if that is appreciated. I benchmarked a variety of cases with different alignments, sizes, and data hotness (in L1, L2, etc...) so I can just give a simple number of x percentage improvement. They where also run on my personal computer (icelake-client). Of my 2732 test cases 1985 saw an improvement with these changes, 747 performed better with the origional code. I should note, however, that my test cases had a disproportionate number of cases with 4kb page crosses, which as discussed I moved to a colder path. In general the affects of this change are: large/medium sized strings (from any part of memory really) 10-30% performance boost. Small strings that are not page crosses by 10-20% performance boost. Small strings are cache line splits by 20-30% performance boost. Small strings that cross a page boundary (4kb page that is) see a 10% performance regression. No regressions in test suite. Both test-strchrnul and test-strchr passed. I would love to here you feedback and all of these points (or any that I missed). FSF Documentation has been signed and returned (via pub key and email respectively) sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------ 1 file changed, 87 insertions(+), 86 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S index d416558d04..09c2df86d1 100644 --- a/sysdeps/x86_64/multiarch/strchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strchr-avx2.S @@ -27,10 +27,12 @@ # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd # define VPCMPEQ vpcmpeqd +# define VPMINU vpminud # define CHAR_REG esi # else # define VPBROADCAST vpbroadcastb # define VPCMPEQ vpcmpeqb +# define VPMINU vpminub # define CHAR_REG sil # endif @@ -39,7 +41,8 @@ # endif # define VEC_SIZE 32 - +# define PAGE_SIZE 4096 + .section .text.avx,"ax",@progbits ENTRY (STRCHR) movl %edi, %ecx @@ -48,8 +51,8 @@ ENTRY (STRCHR) vpxor %xmm9, %xmm9, %xmm9 VPBROADCAST %xmm0, %ymm0 /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx ja L(cros_page_boundary) /* Check the first VEC_SIZE bytes. Search for both CHAR and the @@ -63,45 +66,11 @@ ENTRY (STRCHR) jnz L(first_vec_x0) /* Align data for aligned loads in the loop. */ - addq $VEC_SIZE, %rdi - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - - jmp L(more_4x_vec) - - .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - vmovdqu (%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - /* Remove the leading bytes. */ - sarl %cl, %eax - testl %eax, %eax - jz L(aligned_more) - /* Found CHAR or the null byte. */ - tzcntl %eax, %eax - addq %rcx, %rax -# ifdef USE_AS_STRCHRNUL - addq %rdi, %rax -# else - xorl %edx, %edx - leaq (%rdi, %rax), %rax - cmp (%rax), %CHAR_REG - cmovne %rdx, %rax -# endif - VZEROUPPER - ret - - .p2align 4 + andq $-VEC_SIZE, %rdi L(aligned_more): addq $VEC_SIZE, %rdi -L(more_4x_vec): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time + /* Check the next 3 * VEC_SIZE. Only one VEC_SIZE at a time since data is only aligned to VEC_SIZE. */ vmovdqa (%rdi), %ymm8 VPCMPEQ %ymm8, %ymm0, %ymm1 @@ -127,19 +96,9 @@ L(more_4x_vec): testl %eax, %eax jnz L(first_vec_x2) - vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm8, %ymm0, %ymm1 - VPCMPEQ %ymm8, %ymm9, %ymm2 - vpor %ymm1, %ymm2, %ymm1 - vpmovmskb %ymm1, %eax - testl %eax, %eax - jnz L(first_vec_x3) - - addq $(VEC_SIZE * 4), %rdi - + /* Align data to 4 * VEC_SIZE. */ - movq %rdi, %rcx - andl $(4 * VEC_SIZE - 1), %ecx + addq $(VEC_SIZE * 3), %rdi andq $-(4 * VEC_SIZE), %rdi .p2align 4 @@ -150,34 +109,61 @@ L(loop_4x_vec): vmovdqa (VEC_SIZE * 2)(%rdi), %ymm7 vmovdqa (VEC_SIZE * 3)(%rdi), %ymm8 - VPCMPEQ %ymm5, %ymm0, %ymm1 - VPCMPEQ %ymm6, %ymm0, %ymm2 - VPCMPEQ %ymm7, %ymm0, %ymm3 - VPCMPEQ %ymm8, %ymm0, %ymm4 - - VPCMPEQ %ymm5, %ymm9, %ymm5 - VPCMPEQ %ymm6, %ymm9, %ymm6 - VPCMPEQ %ymm7, %ymm9, %ymm7 - VPCMPEQ %ymm8, %ymm9, %ymm8 + /* Leaves only CHARS matching esi as 0. */ + vpxor %ymm5, %ymm0, %ymm1 + vpxor %ymm6, %ymm0, %ymm2 + vpxor %ymm7, %ymm0, %ymm3 + vpxor %ymm8, %ymm0, %ymm4 - vpor %ymm1, %ymm5, %ymm1 - vpor %ymm2, %ymm6, %ymm2 - vpor %ymm3, %ymm7, %ymm3 - vpor %ymm4, %ymm8, %ymm4 + VPMINU %ymm1, %ymm5, %ymm1 + VPMINU %ymm2, %ymm6, %ymm2 + VPMINU %ymm3, %ymm7, %ymm3 + VPMINU %ymm4, %ymm8, %ymm4 - vpor %ymm1, %ymm2, %ymm5 - vpor %ymm3, %ymm4, %ymm6 + VPMINU %ymm1, %ymm2, %ymm5 + VPMINU %ymm3, %ymm4, %ymm6 - vpor %ymm5, %ymm6, %ymm5 + VPMINU %ymm5, %ymm6, %ymm5 + VPCMPEQ %ymm5, %ymm9, %ymm5 vpmovmskb %ymm5, %eax + addq $(VEC_SIZE * 4), %rdi + testl %eax, %eax - jnz L(4x_vec_end) + jz L(loop_4x_vec) - addq $(VEC_SIZE * 4), %rdi + subq $(VEC_SIZE * 4), %rdi + +L(4x_vec_end): + VPCMPEQ %ymm1, %ymm9, %ymm1 + vpmovmskb %ymm1, %eax + testl %eax, %eax + jnz L(first_vec_x0) + VPCMPEQ %ymm2, %ymm9, %ymm2 + vpmovmskb %ymm2, %eax + testl %eax, %eax + jnz L(first_vec_x1) + VPCMPEQ %ymm3, %ymm9, %ymm3 + vpmovmskb %ymm3, %eax + testl %eax, %eax + jnz L(first_vec_x2) + VPCMPEQ %ymm4, %ymm9, %ymm4 + vpmovmskb %ymm4, %eax - jmp L(loop_4x_vec) + tzcntl %eax, %eax +# ifdef USE_AS_STRCHRNUL + addq $(VEC_SIZE * 3), %rdi + addq %rdi, %rax +# else + xorl %edx, %edx + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + cmp (%rax), %CHAR_REG + cmovne %rdx, %rax +# endif + VZEROUPPER + ret + .p2align 4 L(first_vec_x0): /* Found CHAR or the null byte. */ @@ -197,7 +183,7 @@ L(first_vec_x0): L(first_vec_x1): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $VEC_SIZE, %rax + addq $VEC_SIZE, %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -212,7 +198,7 @@ L(first_vec_x1): L(first_vec_x2): tzcntl %eax, %eax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 2), %rax + addq $(VEC_SIZE * 2), %rdi addq %rdi, %rax # else xorl %edx, %edx @@ -223,32 +209,47 @@ L(first_vec_x2): VZEROUPPER ret + /* Cold case for crossing page with first load. */ .p2align 4 -L(4x_vec_end): +L(cros_page_boundary): + andl $(VEC_SIZE - 1), %ecx + andq $-VEC_SIZE, %rdi + vmovdqu (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %eax + /* Remove the leading bits. */ + sarxl %ecx, %eax, %eax testl %eax, %eax - jnz L(first_vec_x0) - vpmovmskb %ymm2, %eax - testl %eax, %eax - jnz L(first_vec_x1) - vpmovmskb %ymm3, %eax - testl %eax, %eax - jnz L(first_vec_x2) - vpmovmskb %ymm4, %eax + jnz L(cros_page_return) + + /* Second block so that the 3 other blocks from L(aligned_more) + will get to next 4 * VEC_SIZE alignment. */ + andq $-VEC_SIZE, %rdi + addq $VEC_SIZE, %rdi + xorl %ecx, %ecx + vmovdqa (%rdi), %ymm8 + VPCMPEQ %ymm8, %ymm0, %ymm1 + VPCMPEQ %ymm8, %ymm9, %ymm2 + vpor %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %eax testl %eax, %eax -L(first_vec_x3): + jz L(aligned_more) + +L(cros_page_return): tzcntl %eax, %eax + addq %rcx, %rax # ifdef USE_AS_STRCHRNUL - addq $(VEC_SIZE * 3), %rax addq %rdi, %rax # else xorl %edx, %edx - leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + leaq (%rdi, %rax), %rax cmp (%rax), %CHAR_REG cmovne %rdx, %rax # endif VZEROUPPER ret - + END (STRCHR) -#endif +# endif