[v10,00/24] Improve generic string routines

Message ID	20230120211622.3445279-1-adhemerval.zanella@linaro.org
Headers	DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DA4C438543B5 To: libc-alpha@sourceware.org, Richard Henderson <richard.henderson@linaro.org>, Noah Goldstein <goldstein.w.n@gmail.com> Subject: [PATCH v10 00/24] Improve generic string routines Date: Fri, 20 Jan 2023 18:15:58 -0300 Message-Id: <20230120211622.3445279-1-adhemerval.zanella@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	Improve generic string routines \| [v10,00/24] Improve generic string routines [v10,01/24] Parameterize op_t from memcopy.h [v10,02/24] Parameterize OP_T_THRES from memcopy.h [v10,03/24] Add string vectorized find and detection functions [v10,04/24] string: Improve generic strlen [v10,05/24] string: Improve generic strnlen [v10,06/24] string: Improve generic strchr [v10,07/24] string: Improve generic strchrnul [v10,08/24] string: Improve generic strcmp [v10,09/24] string: Improve generic strncmp [v10,10/24] string: Improve generic memchr [v10,11/24] string: Improve generic memrchr [v10,12/24] hppa: Add memcopy.h [v10,13/24] hppa: Add string-fzb.h and string-fzi.h [v10,14/24] alpha: Add string-fzb.h and string-fzi.h [v10,15/24] arm: Add string-fza.h [v10,16/24] powerpc: Add string-fza.h [v10,17/24] sh: Add string-fzb.h [v10,18/24] string: Hook up the default implementation on test-strlen [v10,19/24] string: Hook up the default implementation on test-strnlen [v10,20/24] string: Hook up the default implementation on test-strchr [v10,21/24] string: Hook up the default implementation on test-strcmp [v10,22/24] string: Hook up the default implementation on test-strncmp [v10,23/24] string: Hook up the default implementation on test-memchr [v10,24/24] string: Hook up the default implementation on test-memrchr

Message ID

20230120211622.3445279-1-adhemerval.zanella@linaro.org

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DA4C438543B5
To: libc-alpha@sourceware.org,
 Richard Henderson <richard.henderson@linaro.org>,
 Noah Goldstein <goldstein.w.n@gmail.com>
Subject: [PATCH v10 00/24] Improve generic string routines
Date: Fri, 20 Jan 2023 18:15:58 -0300
Message-Id: <20230120211622.3445279-1-adhemerval.zanella@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

Series

Improve generic string routines |

Message

Adhemerval Zanella Netto Jan. 20, 2023, 9:15 p.m. UTC

  It is done by:

  1. Parametrizing the internal routines (for instance the find zero
     in a word) so each architecture can reimplement without the need
     to reimplement the whole routine.

  2. Vectorizing more string implementations (for instance strcpy
     and strcmp).

  3. Change some implementations to use already possible optimized
     ones (strnlen and strchr).  It makes new ports to focus on
     only provide optimized implementation of a hardful symbols
     (for instance memchr) and make its improvement to be used in
     a larger set of routines.

I checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
and powerpc64-linux-gnu by removing the arch-specific assembly
implementation and disabling multiarch (it covers both LE and BE
for 64 and 32 bits). I also checked the string routines on alpha, hppa,
and sh.

Changes since v9:
  * Added strncmp optimization.
  * Fixed wcsmbs regressions.

Changes since v8:
  * Change memrchr to use vectorized load on final string, instead of
    byte per byte reads.
  * Remove string-maskoff.h header.
  * Add string-repeat_bytes.h and string-shift.h.
  * Hook up the generic implementation on string tests.

Changes since v7:
  * Split string-fzc.h out of string-fzi.h, with all of the
    routines that are combinations of fza and fzi routines.
  * Fix missing find_t and shift_find() from alpha, arm, powerpc.
  * Use compiler builtins for arm and powerpc.
  * Define sh4 has_zero() via has_eq(), rather than reverse.

Changes since v6:
  * Add find_t to handle alpha way of comapring bytes (which returns
    a bit-mask instead of byte-mask).
  * Fixed alpha string-fzi.h and added string-fza.h.
  * Renamed check_mask to shift_find.

Changes since v5:
  * Replace 'inline' with '__always_inline' macros.
  * Replace strchr implementation with a simpler one that call
    strchrnul.
  * Add strchrnul suggested changes.
  * Add memchr suggested changes.
  * Added check_mask on string-maskoff.h.
  * Rebase and update Copyright years.

Changes since v4:
  * Removed __clz and __ctz in favor of count_leading_zero and
    count_trailing_zeros from longlong.h.
  * Use repeat_bytes more often.
  * Added a comment on strcmp final_cmp on why index_first_zero_ne can
    not be used.

Changes since v3:
  * Rebased against master.
  * Dropped strcpy optimization.
  * Refactor strcmp implementation.
  * Some minor changes in comments.

Changes since v2:
  * Move string-fz{a,b,i} to its own patch.
  * Add a inline implementation for __builtin_c{l,t}z to avoid using
    compiler provided symbols.
  * Add a new header, string-maskoff.h, to handle unaligned accesses
    on some implementation.
  * Fixed strcmp on LE machines.
  * Added a unaligned strcpy variant for architecture that define
    _STRING_ARCH_unaligned.
  * Add SH string-fzb.h (which uses cmp/str instruction to find
    a zero in word).

Changes since v1:
  * Marked ChangeLog entries with [BZ #5806], as appropriate.
  * Reorganized the headers, so that armv6t2 and power6 need override
    as little as possible to use their (integer) zero detection insns.
  * Hopefully fixed all of the coding style issues.
  * Adjusted the memrchr algorithm as discussed.
  * Replaced the #ifdef STRRCHR etc that are used by the multiarch
  * files.
  * Tested on i386, i686, x86_64 (verified this is unused), ppc64,
    ppc64le --with-cpu=power8 (to use power6 in multiarch), armv7,
    aarch64, alpha (qemu) and hppa (qemu).

Adhemerval Zanella (18):
  Parameterize op_t from memcopy.h
  Add string vectorized find and detection functions
  string: Improve generic strlen
  string: Improve generic strnlen
  string: Improve generic strchr
  string: Improve generic strchrnul
  string: Improve generic strcmp
  string: Improve generic strncmp
  string: Improve generic memchr
  string: Improve generic memrchr
  sh: Add string-fzb.h
  string: Hook up the default implementation on test-strlen
  string: Hook up the default implementation on test-strnlen
  string: Hook up the default implementation on test-strchr
  string: Hook up the default implementation on test-strcmp
  string: Hook up the default implementation on test-strncmp
  string: Hook up the default implementation on test-memchr
  string: Hook up the default implementation on test-memrchr

Richard Henderson (6):
  Parameterize OP_T_THRES from memcopy.h
  hppa: Add memcopy.h
  hppa: Add string-fzb.h and string-fzi.h
  alpha: Add string-fzb.h and string-fzi.h
  arm: Add string-fza.h
  powerpc: Add string-fza.h

 string/memchr.c                               | 178 +++++-----------
 string/memcmp.c                               |   4 -
 string/memrchr.c                              | 196 ++++--------------
 string/strchr.c                               | 159 +-------------
 string/strchrnul.c                            | 155 ++------------
 string/strcmp-impl.h                          |  41 ++++
 string/strcmp.c                               | 103 +++++++--
 string/strlen.c                               |  92 ++------
 string/strncmp.c                              | 130 ++++++++----
 string/strnlen.c                              | 137 +-----------
 string/test-memchr.c                          |  31 ++-
 string/test-memrchr.c                         |   7 +
 string/test-strchr.c                          |  53 +++--
 string/test-strcmp.c                          |  22 ++
 string/test-strlen.c                          |  31 ++-
 string/test-strncmp.c                         |  16 ++
 string/test-strnlen.c                         |  35 +++-
 sysdeps/alpha/string-fza.h                    |  61 ++++++
 sysdeps/alpha/string-fzb.h                    |  52 +++++
 sysdeps/alpha/string-fzi.h                    |  62 ++++++
 sysdeps/alpha/string-shift.h                  |  44 ++++
 sysdeps/arm/armv6t2/string-fza.h              |  68 ++++++
 sysdeps/generic/memcopy.h                     |  10 +-
 sysdeps/generic/string-fza.h                  | 104 ++++++++++
 sysdeps/generic/string-fzb.h                  |  49 +++++
 sysdeps/generic/string-fzc.h                  |  91 ++++++++
 sysdeps/generic/string-fzi.h                  |  71 +++++++
 sysdeps/generic/string-misc.h                 |  45 ++++
 sysdeps/generic/string-opthr.h                |  25 +++
 sysdeps/generic/string-optype.h               |  24 +++
 sysdeps/generic/string-shift.h                |  52 +++++
 sysdeps/hppa/memcopy.h                        |  42 ++++
 sysdeps/hppa/string-fzb.h                     |  70 +++++++
 sysdeps/hppa/string-fzc.h                     | 124 +++++++++++
 sysdeps/hppa/string-fzi.h                     |  63 ++++++
 sysdeps/i386/i686/multiarch/strnlen-c.c       |  14 +-
 sysdeps/i386/memcopy.h                        |   3 -
 sysdeps/i386/string-opthr.h                   |  25 +++
 sysdeps/m68k/memcopy.h                        |   3 -
 sysdeps/powerpc/powerpc32/power4/memcopy.h    |   5 -
 .../powerpc32/power4/multiarch/memchr-ppc32.c |  14 +-
 .../power4/multiarch/strchrnul-ppc32.c        |   4 -
 .../power4/multiarch/strnlen-ppc32.c          |  14 +-
 .../powerpc64/multiarch/memchr-ppc64.c        |   9 +-
 sysdeps/powerpc/string-fza.h                  |  72 +++++++
 sysdeps/s390/strchr-c.c                       |  11 +-
 sysdeps/s390/strchrnul-c.c                    |   2 -
 sysdeps/s390/strlen-c.c                       |  10 +-
 sysdeps/s390/strnlen-c.c                      |  14 +-
 sysdeps/sh/string-fzb.h                       |  55 +++++
 sysdeps/x86_64/x32/string-optype.h            |  24 +++
 51 files changed, 1776 insertions(+), 950 deletions(-)
 create mode 100644 string/strcmp-impl.h
 create mode 100644 sysdeps/alpha/string-fza.h
 create mode 100644 sysdeps/alpha/string-fzb.h
 create mode 100644 sysdeps/alpha/string-fzi.h
 create mode 100644 sysdeps/alpha/string-shift.h
 create mode 100644 sysdeps/arm/armv6t2/string-fza.h
 create mode 100644 sysdeps/generic/string-fza.h
 create mode 100644 sysdeps/generic/string-fzb.h
 create mode 100644 sysdeps/generic/string-fzc.h
 create mode 100644 sysdeps/generic/string-fzi.h
 create mode 100644 sysdeps/generic/string-misc.h
 create mode 100644 sysdeps/generic/string-opthr.h
 create mode 100644 sysdeps/generic/string-optype.h
 create mode 100644 sysdeps/generic/string-shift.h
 create mode 100644 sysdeps/hppa/memcopy.h
 create mode 100644 sysdeps/hppa/string-fzb.h
 create mode 100644 sysdeps/hppa/string-fzc.h
 create mode 100644 sysdeps/hppa/string-fzi.h
 create mode 100644 sysdeps/i386/string-opthr.h
 create mode 100644 sysdeps/powerpc/string-fza.h
 create mode 100644 sysdeps/sh/string-fzb.h
 create mode 100644 sysdeps/x86_64/x32/string-optype.h

Comments

Jeff Law Jan. 20, 2023, 11 p.m. UTC | #1

On 1/20/23 14:15, Adhemerval Zanella via Libc-alpha wrote:
> It is done by:
> 
>    1. Parametrizing the internal routines (for instance the find zero
>       in a word) so each architecture can reimplement without the need
>       to reimplement the whole routine.
> 
>    2. Vectorizing more string implementations (for instance strcpy
>       and strcmp).
> 
>    3. Change some implementations to use already possible optimized
>       ones (strnlen and strchr).  It makes new ports to focus on
>       only provide optimized implementation of a hardful symbols
>       (for instance memchr) and make its improvement to be used in
>       a larger set of routines.
> 
> I checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multiarch (it covers both LE and BE
> for 64 and 32 bits). I also checked the string routines on alpha, hppa,
> and sh.
[ ... ]
You guys are making work for me! :)

We've got hand-written RISC-V implementations of various str* and mem* 
routines that I nearly asked to get merged for glibc-2.37.  I held off 
thinking that there wasn't much benefit to including it in 2.37 vs 2.38. 
  Bad call on my part!

jeff

Adhemerval Zanella Netto Jan. 27, 2023, 4:52 p.m. UTC | #2

On 20/01/23 20:00, Jeff Law via Libc-alpha wrote:
> 
> 
> On 1/20/23 14:15, Adhemerval Zanella via Libc-alpha wrote:
>> It is done by:
>>
>>    1. Parametrizing the internal routines (for instance the find zero
>>       in a word) so each architecture can reimplement without the need
>>       to reimplement the whole routine.
>>
>>    2. Vectorizing more string implementations (for instance strcpy
>>       and strcmp).
>>
>>    3. Change some implementations to use already possible optimized
>>       ones (strnlen and strchr).  It makes new ports to focus on
>>       only provide optimized implementation of a hardful symbols
>>       (for instance memchr) and make its improvement to be used in
>>       a larger set of routines.
>>
>> I checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>> and powerpc64-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multiarch (it covers both LE and BE
>> for 64 and 32 bits). I also checked the string routines on alpha, hppa,
>> and sh.
> [ ... ]
> You guys are making work for me! :)
> 
> We've got hand-written RISC-V implementations of various str* and mem* routines that I nearly asked to get merged for glibc-2.37.  I held off thinking that there wasn't much benefit to including it in 2.37 vs 2.38.  Bad call on my part!
>

Does this patchset work for the RISCV routine are aiming to optimize?
Do you have any strategy you think it would be profitable to add in
the generic framework?

Hand-optimize routines might squeeze some more cycles on some scenarios,
but at least the default framework should give a better performance.

I see that most of the arch-specific optimization adds alignment
consideration (specially if architecture provides fast unaligned access)
and loop unrolling for main loop.  For former it would require a 
different implementation, but for latter I think we can add per-arch
compiler flags to force unrolling (as we do for powerpc memmove/worcopy).

Richard Henderson Jan. 27, 2023, 7:36 p.m. UTC | #3

On 1/20/23 13:00, Jeff Law via Libc-alpha wrote:
> You guys are making work for me! :)
> 
> We've got hand-written RISC-V implementations of various str* and mem* routines that I 
> nearly asked to get merged for glibc-2.37.  I held off thinking that there wasn't much 
> benefit to including it in 2.37 vs 2.38.  Bad call on my part!

In that you'll want to benchmark your hand-written code vs the new generic code with a 
string-fza.h file for riscv-zbb (orc.b) or xtheadbb (th.tstnbz)?

After the benchmarking, I'm hoping you'll find that some of the hand-written code is no 
longer needed, and therefore won't accrue a maintenance burden going forward.

r~

Xi Ruoyao Jan. 28, 2023, 2:30 p.m. UTC | #4

On Fri, 2023-01-20 at 18:15 -0300, Adhemerval Zanella via Libc-alpha wrote:
> It is done by:
> 
>   1. Parametrizing the internal routines (for instance the find zero
>      in a word) so each architecture can reimplement without the need
>      to reimplement the whole routine.
> 
>   2. Vectorizing more string implementations (for instance strcpy
>      and strcmp).

Hmm, I think strcpy is not touched by any of the patches (if I'm not
blind :).

>   3. Change some implementations to use already possible optimized
>      ones (strnlen and strchr).  It makes new ports to focus on
>      only provide optimized implementation of a hardful symbols
>      (for instance memchr) and make its improvement to be used in
>      a larger set of routines.
> 
> I checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
> and powerpc64-linux-gnu by removing the arch-specific assembly
> implementation and disabling multiarch (it covers both LE and BE
> for 64 and 32 bits). I also checked the string routines on alpha, hppa,
> and sh.

Tested on loongarch64-linux-gnu where is no assembly implementation for
string functions.

I hoped this could be merged for 2.37 but it's not possible now.

Adhemerval Zanella Netto Jan. 30, 2023, 1:10 p.m. UTC | #5

On 28/01/23 11:30, Xi Ruoyao wrote:
> On Fri, 2023-01-20 at 18:15 -0300, Adhemerval Zanella via Libc-alpha wrote:
>> It is done by:
>>
>>   1. Parametrizing the internal routines (for instance the find zero
>>      in a word) so each architecture can reimplement without the need
>>      to reimplement the whole routine.
>>
>>   2. Vectorizing more string implementations (for instance strcpy
>>      and strcmp).
> 
> Hmm, I think strcpy is not touched by any of the patches (if I'm not
> blind :).

Indeed, I have not added a strcpy one because current generic implementation
issues strlen+memcpy. But I think for generic implementation it also make
sense, since it will essentially read source/destiny twice (and issue two
function call).

I will work on a strcpy implementation.

> 
>>   3. Change some implementations to use already possible optimized
>>      ones (strnlen and strchr).  It makes new ports to focus on
>>      only provide optimized implementation of a hardful symbols
>>      (for instance memchr) and make its improvement to be used in
>>      a larger set of routines.
>>
>> I checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
>> and powerpc64-linux-gnu by removing the arch-specific assembly
>> implementation and disabling multiarch (it covers both LE and BE
>> for 64 and 32 bits). I also checked the string routines on alpha, hppa,
>> and sh.
> 
> Tested on loongarch64-linux-gnu where is no assembly implementation for
> string functions.
> 
> I hoped this could be merged for 2.37 but it's not possible now.

Unfortunately I don't we can make to 2.37, although since it does not
change ABI backport are possible.

I think most of the implementation should be ok for 2.38 once it opens
for development. Noah/Richard already gave me ok for most of them (I
am still missing some Reviewed-by ;) btw).

Adhemerval Zanella Netto Jan. 30, 2023, 1:26 p.m. UTC | #6

On 27/01/23 16:36, Richard Henderson via Libc-alpha wrote:
> On 1/20/23 13:00, Jeff Law via Libc-alpha wrote:
>> You guys are making work for me! :)
>>
>> We've got hand-written RISC-V implementations of various str* and mem* routines that I nearly asked to get merged for glibc-2.37.  I held off thinking that there wasn't much benefit to including it in 2.37 vs 2.38.  Bad call on my part!
> 
> In that you'll want to benchmark your hand-written code vs the new generic code with a string-fza.h file for riscv-zbb (orc.b) or xtheadbb (th.tstnbz)?
> 
> After the benchmarking, I'm hoping you'll find that some of the hand-written code is no longer needed, and therefore won't accrue a maintenance burden going forward.
> 
> 
> r~

I think for riscv it might be advantageous to reimplement string-fzi.h
as well to check bits directly instead of using __builtin_c{t,l}z 
(similar to hppa).

Jeff Law Jan. 30, 2023, 6:37 p.m. UTC | #7

On 1/30/23 06:26, Adhemerval Zanella Netto wrote:
> 
> 
> On 27/01/23 16:36, Richard Henderson via Libc-alpha wrote:
>> On 1/20/23 13:00, Jeff Law via Libc-alpha wrote:
>>> You guys are making work for me! :)
>>>
>>> We've got hand-written RISC-V implementations of various str* and mem* routines that I nearly asked to get merged for glibc-2.37.  I held off thinking that there wasn't much benefit to including it in 2.37 vs 2.38.  Bad call on my part!
>>
>> In that you'll want to benchmark your hand-written code vs the new generic code with a string-fza.h file for riscv-zbb (orc.b) or xtheadbb (th.tstnbz)?
>>
>> After the benchmarking, I'm hoping you'll find that some of the hand-written code is no longer needed, and therefore won't accrue a maintenance burden going forward.
>>
>>
>> r~
> 
> I think for riscv it might be advantageous to reimplement string-fzi.h
> as well to check bits directly instead of using __builtin_c{t,l}z
> (similar to hppa).
Almost certainly.  I'll probably look to see if Raphael can cover this 
in the relatively near future.  We'll obviously compare the result 
against those hand-written implementations from VRULL.  Using orc.c is 
probably the one thing that may be nontrivial, but we'll cross that 
bridge when we get to it.

What we have here is set up for multiarch, so we can have distinct 
variants for basic, zbb or thead.

jeff