Optimized strlen() for x64 (SSE2 based strlen)

  On Sat, May 12, 2018 at 4:40 AM, Trevor Herselman <therselman@gmail.com> wrote:
> #include <emmintrin.h>
>
> size_t strlen_sse2( const char *str )
> {
>     const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf );
>     const __m128i zero = _mm_setzero_si128();
>
>     int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
>     mask >>= ( (uintptr_t) str & 0xf );
>     if ( mask ) return __builtin_ctz( (unsigned int) mask );
>     ptr++;
>
>     for ( ; ; )
>     {
>         mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr
> ), zero ) );
>         if ( mask ) return (char *)ptr - str + __builtin_ctz(
> (unsigned int) mask );
>         ptr++;
>     }
>     __builtin_unreachable();
> }
>
>
> Written by Trevor Herselman
>
>
> Features:
>
>
> *) NO penalty for misaligned addresses (the `mask >>= unaligned bit
> shift` takes care of it
> *) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes
> *) The cost of the misalignment code is amortized inside the first 16
> byte read for all aligned/unaligned reads
> *) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a
> single unaligned read, not even for 1 byte.
> *) Plain C, the compiler is free to make optimizations around the
> function when inlined.
> *) Small tight code size, less than 1 CPU cache line (about 40 bytes)
> more machine code than the default/bit-twiddling implementation, for a
> significant speed boost!
> *) Compatible with ALL x64 CPU's, because they are ALL required to
> support at least SSE2
> *) NO misaligned read penalties for older hardware
> *) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for
> smaller strings
> *) The size/cost/performance benefit starts at a string length of 2
> bytes; meaning, if the average string length is 2 bytes or more, then
> this implementation is faster than any other method!
> *) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() &
> all other methods I could find.
> *) Doesn't need to do any (4K) page boundary checks and will never
> read beyond a page boundary, as long as the NULL terminator is found.
>
> The main benefit is that there are NO alignment checks! Any unaligned
> bits are bit-shifted off the first mask! So there is just a `shr`
> instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already
> aligned the bit-shift does nothing, because `shr 0` does nothing. But
> at least we've eliminated the alignment check!
>
> Note: I wasn't able to benchmark it against the current GLIBC AVX2 or
> SSE2 implementations, because I can't compile or test them. Besides, I
> looked at the code, I believe they will be slower in general, unless
> you are doing very large strings, because they read 4x XMM registers
> at a time, this is a huge performance cost if you don't need the extra
> reads/data! I believe shorter strings are more common! Also, these
> implementations code bloat are significantly more!

It is trivial to add a new implementation to glibc framework.  This patch
adds __strlen_sse2_2 to glibc:

   /* Support sysdeps/x86_64/multiarch/strnlen.c.  */

Here  __strlen_sse2_2 is your implementation.  Glibc microbenchmark
shows:

                     simple_STRLEN builtin_strlen __strlen_avx2 __strlen
_sse2_2 __strlen_sse2
Length    1, alignment  1: 32.875 48.1562 42.8438 40.5 39.9062
Length    1, alignment  0: 30.7812 47.5938 31.75 35.0625 34.9375
Length    2, alignment  2: 40.4062 44.0938 29.9688 29.9062 28.9375
Length    2, alignment  0: 36.6875 42.2812 29.25 32.5938 29.75
Length    3, alignment  3: 50.25 40.375 29.0625 29.0625 29.5312
Length    3, alignment  0: 46.8438 40.4688 29.1875 28.375 31.5625
Length    4, alignment  4: 56.2812 40.3438 28.0625 27.9375 31.1875
Length    4, alignment  0: 48.6875 40.3438 27.9062 27.75 31.625
Length    5, alignment  5: 69.1562 40.2812 33 31.2812 31.875
Length    5, alignment  0: 64.4688 40.1562 31.625 31.0312 28.0312
Length    6, alignment  6: 64.9062 40.6562 27.9062 27.875 31.3125
Length    6, alignment  0: 66.7188 40.4375 27.6875 28.0625 31.5312
Length    7, alignment  7: 66.8438 40.3438 28.0625 27.9688 28.1875
Length    7, alignment  0: 71.5312 40.8438 28.0625 28.25 30.5312
Length    4, alignment  0: 49 40.8438 28.0625 28 27.8125
Length    4, alignment  7: 49.625 40.5938 27.75 27.9062 27.9375
Length    4, alignment  2: 47.375 41 27.8438 30.9688 27.8438
Length    2, alignment  2: 38.3438 40.25 27.9062 27.6875 28.0625
Length    8, alignment  0: 77.2812 40.9688 27.8438 30.25 31.0938
Length    8, alignment  7: 76.1875 40.9375 28.0312 31.5625 29.25
Length    8, alignment  3: 63.2812 40.5 28.0625 28.3125 27.9062
Length    5, alignment  3: 68.7188 40.3438 27.9375 27.9688 30.3125
Length   16, alignment  0: 106.906 40.8125 27.75 51.6875 53.5312
Length   16, alignment  7: 98.3125 43.1562 28.0625 42.8438 53.5
Length   16, alignment  4: 98.3438 40.375 27.9062 44.625 50.9062
Length   10, alignment  4: 76.375 40.3125 28.0938 35 34.25
Length   32, alignment  0: 179.719 51.7812 43.5625 52.875 60.8125
Length   32, alignment  7: 171.094 48.375 40.1562 45.0625 47.0312
Length   32, alignment  5: 167.062 46.5938 38.5625 45.2812 48.0938
Length   21, alignment  5: 123.344 44.7812 29.5625 47.0938 47.5938
Length   64, alignment  0: 346.125 64.75 46.3125 79.5 79.625
Length   64, alignment  7: 338.562 51.2812 43.8438 60.2812 75.1562
Length   64, alignment  6: 338.969 52.5 42.5312 63.0625 74.4062
Length   42, alignment  6: 431 49.8438 36.6875 56.7188 53.5625
Length  128, alignment  0: 845.344 69.75 50.875 79.375 93.4688
Length  128, alignment  7: 853.75 58.5 49.25 75.7812 84.6562
Length  128, alignment  7: 851.344 56.3438 47.1875 75.8125 83.2812
Length   85, alignment  7: 633.25 61.1562 42.2812 67.0625 76.75
Length  256, alignment  0: 1563.38 102.844 91.7188 149.031 106.75
Length  256, alignment  7: 1578.09 88.75 76.625 124.625 101.656
Length  256, alignment  8: 1556.91 88.125 76.9688 133.125 101.531
Length  170, alignment  8: 1075.59 87.6875 71.5312 99.3125 90.5625
Length  512, alignment  0: 2971.5 127.875 120.031 233.094 145
Length  512, alignment  7: 2969.06 112.469 102.062 210.812 137.062
Length  512, alignment  9: 2964.56 109.562 103.594 207.062 132.875
Length  341, alignment  9: 2017 106.125 98.3125 158.969 119.938
Length 1024, alignment  0: 5782.5 179.844 170.375 383.219 237.562
Length 1024, alignment  7: 5785.72 163.406 152.594 382.219 220.688
Length 1024, alignment 10: 5772.62 162.875 152.062 376.812 218.969
Length  682, alignment 10: 3898.88 145.625 125.938 275.5 160.812
Length 2048, alignment  0: 11879.2 247.469 243.094 769.469 374.844
Length 2048, alignment  7: 10140.6 237.531 229.812 768.188 350.094
Length 2048, alignment 11: 10526.2 235.781 227.125 775.5 351.844
Length 1365, alignment 11: 6820.31 190.219 183.156 448.406 266.625
Length 4096, alignment  0: 20157.3 433.469 431.031 1399.66 679.875
Length 4096, alignment  7: 20502.1 428.844 420.438 1379.12 661.688
Length 4096, alignment 12: 12538.4 123.25 120.5 398.719 192.531
Length 2730, alignment 12: 3913.09 90.8125 87.1562 283.5 132.344

Your implementation is good, especially at shorter strings.   For strings
longer than 128 bytes, yours is slower.

>
> My name is Trevor Herselman, and I give permission to the developers
> of GCC, GLIBC and anyone else permission to use this algorithm, MIT
> license. Please credit me where possible!
>
>
>
> Benchmarks:  10,000,000,000 strings, warm cache, Core i5 3rd gen,
> lower is better
>
> The `GCC/GLIBC bit-twiddling` technique listed below is the current
> GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit
> reads (8-byte long long).
>
> Aligned memory
>
> 0 byte length (empty string, only NULL terminator)
>
> Traditional (naive) 1-byte test: 5.35s
> GCC/GLIBC bit-twiddling: 14s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 1 byte length
>
> Traditional (naive) 1-byte test: 9s
> GCC/GLIBC bit-twiddling: 17s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 2 byte length
>
> Traditional (naive) 1-byte test: 15s
> bit-twiddling: 20s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 3 byte length
>
> Traditional (naive) 1-byte test: 19s
> GCC bit-twiddle: 23s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 7 byte length (+1 byte NULL = 8 bytes total)
>
> Traditional (naive) 1-byte test: 44s
> GCC bit-twiddle: 33.4s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 8 byte length (+1 byte NULL = 9 bytes total)
>
> Traditional (naive) 1-byte test: 46s
> GCC bit-twiddle: 19s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 15 byte length (+1 byte NULL = 16 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 34s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> 16 byte length (+1 byte NULL = 17 bytes total)
>
> Traditional (naive) 1-byte test: 85s
> GCC bit-twiddle: 27s
> A_strlen: 24s
> My SSE2 strlen: 13.3
>
>
> 128 byte length (+1 byte NULL = 129 bytes total)
>
> Traditional (naive) 1-byte test: 718.1s
> GCC bit-twiddle: 171.3s
> A_strlen: 63s
> My SSE2 strlen: 45.5s
>
>
>
> unaligned (1 byte offset)
>
> 0 bytes:
>
> Traditional (naive) 1-byte test: 8s
> GCC bit-twiddle: 9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
> NOTE: The GCC/GLIBC implementation is faster here, because of the
> unaligned test, it reads the first bit and ends immediately
>
>
> 1 byte:
>
> Traditional (naive) 1-byte test: 13.2s
> GCC bit-twiddle: 15.9s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 2 bytes:
>
> Traditional (naive) 1-byte test: 17.4
> GCC bit-twiddle: 19.3s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 3 bytes:
>
> Traditional (naive) 1-byte test: 25.6s
> GCC bit-twiddle: 24.7s
> A_strlen: 27.5s
> My SSE2 strlen: 10.5s
>
>
> 127 bytes:
>
> Traditional (naive) 1-byte test: 715s
> GCC bit-twiddle: 180s
> A_strlen: 65s
> My SSE2 strlen: 47s

Optimized strlen() for x64 (SSE2 based strlen)

Commit Message

Comments

Patch