Optimized strlen() for x64 (SSE2 based strlen)

Message ID CAMe9rOrgm8uTY2OBqmk8crOvSd6kZ9QPxE_-HyMPK8HnL6oQpQ@mail.gmail.com
State New, archived
Headers

Commit Message

H.J. Lu May 12, 2018, 12:54 p.m. UTC
  On Sat, May 12, 2018 at 4:40 AM, Trevor Herselman <therselman@gmail.com> wrote:
> #include <emmintrin.h>
>
> size_t strlen_sse2( const char *str )
> {
>     const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf );
>     const __m128i zero = _mm_setzero_si128();
>
>     int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
>     mask >>= ( (uintptr_t) str & 0xf );
>     if ( mask ) return __builtin_ctz( (unsigned int) mask );
>     ptr++;
>
>     for ( ; ; )
>     {
>         mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
>         if ( mask ) return (char *)ptr - str + __builtin_ctz(
> (unsigned int) mask );
>         ptr++;
>     }
>     __builtin_unreachable();
> }
>
>
> Written by Trevor Herselman
>
>
> Features:
>
>
> *) NO penalty for misaligned addresses (the `mask >>= unaligned bit
> shift` takes care of it
> *) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes
> *) The cost of the misalignment code is amortized inside the first 16
> byte read for all aligned/unaligned reads
> *) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a
> single unaligned read, not even for 1 byte.
> *) Plain C, the compiler is free to make optimizations around the
> function when inlined.
> *) Small tight code size, less than 1 CPU cache line (about 40 bytes)
> more machine code than the default/bit-twiddling implementation, for a
> significant speed boost!
> *) Compatible with ALL x64 CPU's, because they are ALL required to
> support at least SSE2
> *) NO misaligned read penalties for older hardware
> *) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for
> smaller strings
> *) The size/cost/performance benefit starts at a string length of 2
> bytes; meaning, if the average string length is 2 bytes or more, then
> this implementation is faster than any other method!
> *) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() &
> all other methods I could find.
> *) Doesn't need to do any (4K) page boundary checks and will never
> read beyond a page boundary, as long as the NULL terminator is found.
>
> The main benefit is that there are NO alignment checks! Any unaligned
> bits are bit-shifted off the first mask! So there is just a `shr`
> instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already
> aligned the bit-shift does nothing, because `shr 0` does nothing. But
> at least we've eliminated the alignment check!
>
> Note: I wasn't able to benchmark it against the current GLIBC AVX2 or
> SSE2 implementations, because I can't compile or test them. Besides, I
> looked at the code, I believe they will be slower in general, unless
> you are doing very large strings, because they read 4x XMM registers
> at a time, this is a huge performance cost if you don't need the extra
> reads/data! I believe shorter strings are more common! Also, these
> implementations code bloat are significantly more!

It is trivial to add a new implementation to glibc framework.  This patch
adds __strlen_sse2_2 to glibc:

   /* Support sysdeps/x86_64/multiarch/strnlen.c.  */

Here  __strlen_sse2_2 is your implementation.  Glibc microbenchmark
shows:

                     simple_STRLEN builtin_strlen __strlen_avx2 __strlen
_sse2_2 __strlen_sse2
Length    1, alignment  1: 32.875 48.1562 42.8438 40.5 39.9062
Length    1, alignment  0: 30.7812 47.5938 31.75 35.0625 34.9375
Length    2, alignment  2: 40.4062 44.0938 29.9688 29.9062 28.9375
Length    2, alignment  0: 36.6875 42.2812 29.25 32.5938 29.75
Length    3, alignment  3: 50.25 40.375 29.0625 29.0625 29.5312
Length    3, alignment  0: 46.8438 40.4688 29.1875 28.375 31.5625
Length    4, alignment  4: 56.2812 40.3438 28.0625 27.9375 31.1875
Length    4, alignment  0: 48.6875 40.3438 27.9062 27.75 31.625
Length    5, alignment  5: 69.1562 40.2812 33 31.2812 31.875
Length    5, alignment  0: 64.4688 40.1562 31.625 31.0312 28.0312
Length    6, alignment  6: 64.9062 40.6562 27.9062 27.875 31.3125
Length    6, alignment  0: 66.7188 40.4375 27.6875 28.0625 31.5312
Length    7, alignment  7: 66.8438 40.3438 28.0625 27.9688 28.1875
Length    7, alignment  0: 71.5312 40.8438 28.0625 28.25 30.5312
Length    4, alignment  0: 49 40.8438 28.0625 28 27.8125
Length    4, alignment  7: 49.625 40.5938 27.75 27.9062 27.9375
Length    4, alignment  2: 47.375 41 27.8438 30.9688 27.8438
Length    2, alignment  2: 38.3438 40.25 27.9062 27.6875 28.0625
Length    8, alignment  0: 77.2812 40.9688 27.8438 30.25 31.0938
Length    8, alignment  7: 76.1875 40.9375 28.0312 31.5625 29.25
Length    8, alignment  3: 63.2812 40.5 28.0625 28.3125 27.9062
Length    5, alignment  3: 68.7188 40.3438 27.9375 27.9688 30.3125
Length   16, alignment  0: 106.906 40.8125 27.75 51.6875 53.5312
Length   16, alignment  7: 98.3125 43.1562 28.0625 42.8438 53.5
Length   16, alignment  4: 98.3438 40.375 27.9062 44.625 50.9062
Length   10, alignment  4: 76.375 40.3125 28.0938 35 34.25
Length   32, alignment  0: 179.719 51.7812 43.5625 52.875 60.8125
Length   32, alignment  7: 171.094 48.375 40.1562 45.0625 47.0312
Length   32, alignment  5: 167.062 46.5938 38.5625 45.2812 48.0938
Length   21, alignment  5: 123.344 44.7812 29.5625 47.0938 47.5938
Length   64, alignment  0: 346.125 64.75 46.3125 79.5 79.625
Length   64, alignment  7: 338.562 51.2812 43.8438 60.2812 75.1562
Length   64, alignment  6: 338.969 52.5 42.5312 63.0625 74.4062
Length   42, alignment  6: 431 49.8438 36.6875 56.7188 53.5625
Length  128, alignment  0: 845.344 69.75 50.875 79.375 93.4688
Length  128, alignment  7: 853.75 58.5 49.25 75.7812 84.6562
Length  128, alignment  7: 851.344 56.3438 47.1875 75.8125 83.2812
Length   85, alignment  7: 633.25 61.1562 42.2812 67.0625 76.75
Length  256, alignment  0: 1563.38 102.844 91.7188 149.031 106.75
Length  256, alignment  7: 1578.09 88.75 76.625 124.625 101.656
Length  256, alignment  8: 1556.91 88.125 76.9688 133.125 101.531
Length  170, alignment  8: 1075.59 87.6875 71.5312 99.3125 90.5625
Length  512, alignment  0: 2971.5 127.875 120.031 233.094 145
Length  512, alignment  7: 2969.06 112.469 102.062 210.812 137.062
Length  512, alignment  9: 2964.56 109.562 103.594 207.062 132.875
Length  341, alignment  9: 2017 106.125 98.3125 158.969 119.938
Length 1024, alignment  0: 5782.5 179.844 170.375 383.219 237.562
Length 1024, alignment  7: 5785.72 163.406 152.594 382.219 220.688
Length 1024, alignment 10: 5772.62 162.875 152.062 376.812 218.969
Length  682, alignment 10: 3898.88 145.625 125.938 275.5 160.812
Length 2048, alignment  0: 11879.2 247.469 243.094 769.469 374.844
Length 2048, alignment  7: 10140.6 237.531 229.812 768.188 350.094
Length 2048, alignment 11: 10526.2 235.781 227.125 775.5 351.844
Length 1365, alignment 11: 6820.31 190.219 183.156 448.406 266.625
Length 4096, alignment  0: 20157.3 433.469 431.031 1399.66 679.875
Length 4096, alignment  7: 20502.1 428.844 420.438 1379.12 661.688
Length 4096, alignment 12: 12538.4 123.25 120.5 398.719 192.531
Length 2730, alignment 12: 3913.09 90.8125 87.1562 283.5 132.344

Your implementation is good, especially at shorter strings.   For strings
longer than 128 bytes, yours is slower.

>
> My name is Trevor Herselman, and I give permission to the developers
> of GCC, GLIBC and anyone else permission to use this algorithm, MIT
> license. Please credit me where possible!
>
>
>
> Benchmarks:  10,000,000,000 strings, warm cache, Core i5 3rd gen,
> lower is better
>
> The `GCC/GLIBC bit-twiddling` technique listed below is the current
> GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit
> reads (8-byte long long).
>
> Aligned memory
>
> 0 byte length (empty string, only NULL terminator)
>
> Traditional (naive) 1-byte test: 5.35s
> GCC/GLIBC bit-twiddling: 14s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 1 byte length
>
> Traditional (naive) 1-byte test: 9s
> GCC/GLIBC bit-twiddling: 17s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 2 byte length
>
> Traditional (naive) 1-byte test: 15s
> bit-twiddling: 20s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 3 byte length
>
> Traditional (naive) 1-byte test: 19s
> GCC bit-twiddle: 23s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 7 byte length (+1 byte NULL = 8 bytes total)
>
> Traditional (naive) 1-byte test: 44s
> GCC bit-twiddle: 33.4s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 8 byte length (+1 byte NULL = 9 bytes total)
>
> Traditional (naive) 1-byte test: 46s
> GCC bit-twiddle: 19s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 15 byte length (+1 byte NULL = 16 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 34s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 16 byte length (+1 byte NULL = 17 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 27s
> A_strlen: 24s
> My SSE2 strlen: 13.3
>
>
> 128 byte length (+1 byte NULL = 129 bytes total)
>
> Traditional (naive) 1-byte test: 718.1s
> GCC bit-twiddle: 171.3s
> A_strlen: 63s
> My SSE2 strlen: 45.5s
>
>
>
> unaligned (1 byte offset)
>
> 0 bytes:
>
> Traditional (naive) 1-byte test: 8s
> GCC bit-twiddle: 9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> NOTE: The GCC/GLIBC implementation is faster here, because of the
> unaligned test, it reads the first bit and ends immediately
>
>
> 1 byte:
>
> Traditional (naive) 1-byte test: 13.2s
> GCC bit-twiddle: 15.9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 2 bytes:
>
> Traditional (naive) 1-byte test: 17.4
> GCC bit-twiddle: 19.3s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 3 bytes:
>
> Traditional (naive) 1-byte test: 25.6s
> GCC bit-twiddle: 24.7s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 127 bytes:
>
> Traditional (naive) 1-byte test: 715s
> GCC bit-twiddle: 180s
> A_strlen: 65s
> My SSE2 strlen: 47s
  

Comments

Ondrej Bilka May 12, 2018, 9:41 p.m. UTC | #1
On Sat, May 12, 2018 at 05:54:47AM -0700, H.J. Lu wrote:
 
> Your implementation is good, especially at shorter strings.   For strings
> longer than 128 bytes, yours is slower.
> 
No, that implementation isn't good. This is pretty much same as origninal implementation
from ages ago which we several times improved. So I will briefly recapitulate state of
affairs.

These benchmarks are worthless beause they ignore branch misprediction
which what matters most. 

With my profiler new implementation is 10% slower than current one for
gcc. This is done by interposing strlen calls of gcc compilation and
measuring variant running times. Results when one randomizes
size/alignment are similar, results and profiler are here.

https://iuuk.mff.cuni.cz/~ondra/strlen_profile/results_gcc/result.html
https://iuuk.mff.cuni.cz/~ondra/strlen_profile/results_rand/result.html
https://iuuk.mff.cuni.cz/~ondra/strlen_profile180512.tar.bz2

Checks with aligned loads like here are terrible idea because they
inherently introduce missprediction mainly in case of loading only few bytes.
Unalinged loads in current implementation are there for reason, it gives
biggest perforance improvement that one always checks 16 bytes(well
unless of rare page boundary case) which is lot more predictable check. 

Second improvement is that we unroll loop 4 times to gain better
performance on larger inputs, but it isn't that important as most inputs
are short in practice.
  

Patch

diff --git a/sysdeps/x86_64/multiarch/Makefile
b/sysdeps/x86_64/multiarch/Makefile
index 68257c4017..3bb4fd5ec9 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -24,6 +24,7 @@  sysdep_routines += strncat-c stpncpy-c strncpy-c \
     strchr-sse2 strchrnul-sse2 strchr-avx2 strchrnul-avx2 \
     strrchr-sse2 strrchr-avx2 \
     strlen-sse2 strnlen-sse2 strlen-avx2 strnlen-avx2 \
+    strlen-sse2-2 \
     strcat-ssse3 strncat-ssse3\
     strcpy-sse2 stpcpy-sse2 \
     strcpy-ssse3 strncpy-ssse3 stpcpy-ssse3 stpncpy-ssse3 \
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 7afd674b81..e5b0dfd510 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -186,6 +186,7 @@  __libc_ifunc_impl_list (const char *name, struct
libc_ifunc_impl *array,
        IFUNC_IMPL_ADD (array, i, strlen,
        HAS_ARCH_FEATURE (AVX2_Usable),
        __strlen_avx2)
+       IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2_2)
        IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2))