x86: Refactor and improve performance of strchr-avx2.S

  No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.

Author: noah <goldstein.w.n@gmail.com>
---
Possible improvements fall into 5 categories roughly ordered by
importance:

1 - The main loop that handles 4 vectors as a time has 4 uops removed
    from it. This is at the cost of additional port pressure on ports
    0/1 as vpminub and vpminud can only run on those ports whereas
    vpor can run on ports 0/1/5. But the 4 saved uops should more than
    make up for that. Analysis either by latency, throughput, or
    benchmarks shows its a performance improvement.

2 - As far as I can tell the cros_page_boundary logic was broken (or
    the jump label was especially confusing). The origional code
    tested for this by checking if the load would split cache lines
    not pages. I don't think there are any x86_64 architectures that
    support both AVX2 (Haskwell and newer) and don't have a minimum
    page size of 4kb. Given that the cost of splitting a cacheline
    appears to be low on CPUs that support AVX2
    [https://www.agner.org/optimize/blog/read.php?i=415#423] and this
    is one-off I don't see it as being critical to avoid. If it
    critical to avoid a cacheline split load I think a branchless
    approach might be better. Thoughts? Note: Given that the check was
    changed to only be for page crosses, I think it is significantly
    less likely than before so I moved the branch target from such
    prime real estate. I am also unsure if there is a PAGE_SIZE define
    in glibc somewhere I can use instead of defining it here.

3 - What was origionally the more_4x_vec label was removed and the
    code only does 3 vector blocks now. The reasoning is as follows;
    there are two entry points to that code section, from a page cross
    or fall through (no page cross). The fall through case is more
    important to optimize for assuming the point above. In this case
    the incoming alignments (i.e alignment of ptr in rdi) mod 128 can
    be [0 - 31], [32 - 63], [64 - 95], [96 - 127]. Doing 4 vector
    blocks optimizes for the [96 - 127] so that when the main 4x loop
    is hit, a new 128 byte aligned segment can be started. Doing 3
    vector blocks optimizes for the [0 - 32] case. I generally think
    the string is more likely to for aligned to cache line size (or L2
    prefetcher cache line pairs) than at [96 - 127] bytes. An
    alternative would be to make that code do 5x vector blocks. This
    would mean that at most 2 vector blocks where wasted when
    realigning to 128 bytes (as opposed to 3x or 4x which can allow
    for 3 vector blocks to be wasted). Thoughts?

4 - Replace salq using the %cl partial register to sarx. This assumes
    BMI2 which is Haskwell and newer but AVX2 implies Hashwell and
    newer so I think it is safe.

5 - in the first_vec_xN return blocks change the addq from using rax
    as a destination to rdi as a destination. This just allows for 1
    uop shorter latency.

Benchamrks:
I can post my benchmarking code in the email thread if that is
appreciated. I benchmarked a variety of cases with different
alignments, sizes, and data hotness (in L1, L2, etc...) so I can just
give a simple number of x percentage improvement. They where also run
on my personal computer (icelake-client). Of my 2732 test cases 1985
saw an improvement with these changes, 747 performed better with the
origional code. I should note, however, that my test cases had a
disproportionate number of cases with 4kb page crosses, which as
discussed I moved to a colder path.

In general the affects of this change are:

large/medium sized strings (from any part of memory really) 10-30%
performance boost.
Small strings that are not page crosses by 10-20% performance boost.
Small strings are cache line splits by 20-30%  performance boost.
Small strings that cross a page boundary (4kb page that is) see a 10%
performance regression.

No regressions in test suite. Both test-strchrnul and test-strchr
passed.

I would love to here you feedback and all of these points (or any that
I missed).

FSF Documentation has been signed and returned (via pub key and email
respectively)

 sysdeps/x86_64/multiarch/strchr-avx2.S | 173 +++++++++++++------------
 1 file changed, 87 insertions(+), 86 deletions(-)

Message ID	20210120092914.256388-1-goldstein.w.n@gmail.com
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 9CAE73858018 To: libc-alpha@sourceware.org Subject: [PATCH] x86: Refactor and improve performance of strchr-avx2.S Date: Wed, 20 Jan 2021 04:29:15 -0500 Message-Id: <20210120092914.256388-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list From: noah via Libc-alpha <libc-alpha@sourceware.org> Reply-To: noah <goldstein.w.n@gmail.com> Cc: goldstein.w.n@gmail.com Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	x86: Refactor and improve performance of strchr-avx2.S \| x86: Refactor and improve performance of strchr-avx2.S

x86: Refactor and improve performance of strchr-avx2.S

Commit Message

Comments

Patch