[v2] x86: Refactor and improve performance of strchr-avx2.S

  No bug. Just seemed the performance could be improved a bit.  Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. test-strchrnul, test-strchr, test-wcschrnul, and test-wcschr
all passed.

Author: noah <goldstein.w.n@gmail.com>
---
>>
>> No bug. Just seemed the performance could be improved a bit. Observed
>> and expected behavior are unchanged. Optimized body of main
>> loop. Updated page cross logic and optimized accordingly. Made a few
>> minor instruction selection modifications. No regressions in test
>> suite. Both test-strchrnul and test-strchr passed.
>
>Thank you very much.
>

:) this is my first patch (hopefully). Its exciting!

>> Author: noah <goldstein.w.n@gmail.com>
>> ---
>> Possible improvements fall into 5 categories roughly ordered by
>> importance:
>>
>> 1 - The main loop that handles 4 vectors as a time has 4 uops removed
>>     from it. This is at the cost of additional port pressure on ports
>>     0/1 as vpminub and vpminud can only run on those ports whereas
>>     vpor can run on ports 0/1/5. But the 4 saved uops should more than
>>     make up for that. Analysis either by latency, throughput, or
>>     benchmarks shows its a performance improvement.
>>
>> 2 - As far as I can tell the cros_page_boundary logic was broken (or
>>     the jump label was especially confusing). The origional code
>>     tested for this by checking if the load would split cache lines
>>     not pages. I don't think there are any x86_64 architectures that
>
>The original code has
>
>        /* Check if we may cross page boundary with one vector load.  */
>        andl    $(2 * VEC_SIZE - 1), %ecx
>        cmpl    $VEC_SIZE, %ecx
>        ja      L(cros_page_boundary)
>
>It is just very conservative.   It never fails to cross page boundary.
>

I see. Got it.

>>     support both AVX2 (Haskwell and newer) and don't have a minimum
>>     page size of 4kb. Given that the cost of splitting a cacheline
>>     appears to be low on CPUs that support AVX2
>>     [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
>>     is one-off I don't see it as being critical to avoid. If it
>>     critical to avoid a cacheline split load I think a branchless
>>     approach might be better. Thoughts? Note: Given that the check was
>>     changed to only be for page crosses, I think it is significantly
>>     less likely than before so I moved the branch target from such
>>     prime real estate. I am also unsure if there is a PAGE_SIZE define
>>     in glibc somewhere I can use instead of defining it here.
>
>Define PAGE_SIZE to 4096 is fine.
>

Got it.

>> 3 - What was origionally the more_4x_vec label was removed and the
>>     code only does 3 vector blocks now. The reasoning is as follows;
>>     there are two entry points to that code section, from a page cross
>>     or fall through (no page cross). The fall through case is more
>>     important to optimize for assuming the point above. In this case
>>     the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
>>     be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
>>     blocks optimizes for the [96 - 127] so that when the main 4x loop
>>     is hit, a new 128 byte aligned segment can be started. Doing 3
>>     vector blocks optimizes for the [0 - 32] case. I generally think
>>     the string is more likely to for aligned to cache line size (or L2
>>     prefetcher cache line pairs) than at [96 - 127] bytes. An
>>     alternative would be to make that code do 5x vector blocks. This
>>     would mean that at most 2 vector blocks where wasted when
>>     realigning to 128 bytes (as opposed to 3x or 4x which can allow
>>     for 3 vector blocks to be wasted). Thoughts?
>
>I picked 4x because it was how I unrolled in other string functions.
>3x is fine if it faster than 4x.
>

I see. What are your thoughts on doing 5x? With 5x the worst case is 2
vector blocks get wasted on reallignment though it will add overhead
for A) getting the 4x loop and B) add overhead for [0 - 31], [32 -
63], and [96 - 127] initial alignment cases.

>> 4 - Replace salq using the %cl partial register to sarx. This assumes
>>     BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
>>     newer so I think it is safe.
>
>Need to add a BMI2 check in strchr.c:
>
>&& CPU_FEATURE_USABLE_P (cpu_features, BMI2)
>

Done.

>>
>> 5 - in the first_vec_xN return blocks change the addq from using rax
>>     as a destination to rdi as a destination. This just allows for 1
>>     uop shorter latency.
>
>Sounds reasonable.
>

I forgot to add in the origional email. I also removed a few
unnecissary instructions on ecx (it was being realigned with rdi but
never used)

>> Benchamrks:
>> I can post my benchmarking code in the email thread if that is
>> appreciated. I benchmarked a variety of cases with different
>> alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
>> give a simple number of x percentage improvement. They where also run
>> on my personal computer (icelake-client). Of my 2732 test cases 1985
>> saw an improvement with these changes, 747 performed better with the
>> origional code. I should note, however, that my test cases had a
>> disproportionate number of cases with 4kb page crosses, which as
>> discussed I moved to a colder path.
>
>Please submit a separate patch to add your workload to
>benchtests/bench-strstr.c.
>

Got it. A few questions. Should the benchmark have a single return
value to indicate pass/failure (for a performance regression) or is
just outputing times the idea so it can be used as a tool for future
patches? Second is the coding standard for benchmarks the same as
production code or are rules more relaxed? Last should there be a
seperate benchmark for strchr, strchrnul, wcschr, and wcschrnul or is
having them all in one with #defines the best approach?

>> In general the affects of this change are:
>>
>> large/medium sized strings (from any part of memory really) 10-30%
>> performance boost.
>> Small strings that are not page crosses by 10-20% performance boost.
>> Small strings are cache line splits by 20-30%  performance boost.
>> Small strings that cross a page boundary (4kb page that is) see a 10%
>> performance regression.
>>
>> No regressions in test suite. Both test-strchrnul and test-strchr
>> passed.
>
>It is also used for wcschr and wcschrnul.  Do they work correctly with
>your change?
>

Yes. Updated the commit message to indicate those tests pass
aswell. Updated my benchmark as well to test those for performance.

>> I would love to here you feedback and all of these points (or any that
>> I missed).
>>
>> FSF Documentation has been signed and returned (via pub key and email
>> respectively)
 sysdeps/x86_64/multiarch/strchr-avx2.S | 177 +++++++++++++------------
 sysdeps/x86_64/multiarch/strchr.c      |   2 +
 2 files changed, 91 insertions(+), 88 deletions(-)

Message ID	20210121001206.471283-1-goldstein.w.n@gmail.com
State	Superseded
Headers	Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2780A3971839; Thu, 21 Jan 2021 00:28:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2780A3971839 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1611188935; bh=dzUW1YNZ3LToESMZi3aPKFXcawkWBUMR/EsEY91UrGw=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=VOGspp2O7ra9/t5j2l+deJhOIY3j2Irtb1CzIDIqzLVxh6ck3pL+LKMORWXlp73UG /fgu3bakto2zQErs5xT4yLnmK6z08eVAwyrd5eL972cNjMHISS0omH4jTIEabdtRX2 J6u9kFKQNoFsNnkZOhVDGma4XIrGnimg6uzCbsK4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1036.google.com (mail-pj1-x1036.google.com [IPv6:2607:f8b0:4864:20::1036]) by sourceware.org (Postfix) with ESMTPS id 421FF3857839 for <libc-alpha@sourceware.org>; Thu, 21 Jan 2021 00:28:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 421FF3857839 Received: by mail-pj1-x1036.google.com with SMTP id y12so300138pji.1 for <libc-alpha@sourceware.org>; Wed, 20 Jan 2021 16:28:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dzUW1YNZ3LToESMZi3aPKFXcawkWBUMR/EsEY91UrGw=; b=IxsR6kYjPf8TGXdS6qh6pz39Vv6U3g02wVcdIbF9jqoJJUXSKf1/8AbG7c054JmApE BHphBLX6Vjc7LVhYdfeNW5zlLVSAGzarDw1lEcEa8JxagtR5qq9FjlBdFPVVTHCSBd8D 9Ykah5JTibEO68/YBraOSFPvRHloxoWFnzVrWprFYd4tljVSSuklSoBEby5eMPCf5/i/ olgXqu9CAnkg6JEZWMBnkvZQHXxAUrt0e18wDPfmeok+mDa/upHN4soAGiNU898iV0yS njqLAmQ14tCoGkesz+UbtqJgcRqzru/bv7ZSIxrTouPGptGZDmhRxwS3FnyQ/BvHwEnl mCyA== X-Gm-Message-State: AOAM531yhnYgvdulPJY8v6YORNIQSOV2F0lhpUBkzFYw9S1sjJYuCoIC KSuXmdDLH5AF7IC2dccurZ0Wc5U4PaSChA== X-Google-Smtp-Source: ABdhPJyOt7hq8MtkqwYELYrUrF6VtZsGOejqvwDXRN5WilKFRp+ZofsMq/O6XMvWvdGt8bccxNbyIg== X-Received: by 2002:a17:902:a984:b029:df:c04a:21a6 with SMTP id bh4-20020a170902a984b02900dfc04a21a6mr4648192plb.33.1611188929908; Wed, 20 Jan 2021 16:28:49 -0800 (PST) Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net. [73.241.149.213]) by smtp.googlemail.com with ESMTPSA id bt8sm8406255pjb.0.2021.01.20.16.28.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jan 2021 16:28:49 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v2] x86: Refactor and improve performance of strchr-avx2.S Date: Wed, 20 Jan 2021 19:12:07 -0500 Message-Id: <20210121001206.471283-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <CAMe9rOoBYBxqDj_6-kPAaWo9a6wY3Eykdw9LtEiNNobqbxtdUw@mail.gmail.com> References: <CAMe9rOoBYBxqDj_6-kPAaWo9a6wY3Eykdw9LtEiNNobqbxtdUw@mail.gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: noah via Libc-alpha <libc-alpha@sourceware.org> Reply-To: noah <goldstein.w.n@gmail.com> Cc: goldstein.w.n@gmail.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	[v2] x86: Refactor and improve performance of strchr-avx2.S \| [v2] x86: Refactor and improve performance of strchr-avx2.S

[v2] x86: Refactor and improve performance of strchr-avx2.S

Commit Message

Patch