From patchwork Sat May 12 12:54:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 27253 Received: (qmail 86216 invoked by alias); 12 May 2018 12:54:53 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 86204 invoked by uid 89); 12 May 2018 12:54:52 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.3 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=*name X-HELO: mail-oi0-f54.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=LT/9kbpy1Aj+159/8oME+Fa37m1FCDdfVZ39S3Yr4/M=; b=IVuNDPYDSM/AOlBHisM0cjYP3jiw159ss9ewmx4bVr/XrW2u94jT6Lz5iiz1JEvzCV WjQJdvC1OfZEYGKNPb0AroG0JLxDvC6nyr7pvghxljWZ7Jd8Rjs8b0SZingUNFDIWvw8 /EmjwQFWgC9Jc19RUXV6BmVefjIAA74AuXcMvUabpKgLd0TdBP0BzES6T0s/zZ63F/5F Y8gcnyZC/enH4epaYjx59ZGm0NUXFVPgJim24gbMcFuTlZ+atVDjVjJo+LvMUbqTpy3C N9ZxX5H6Tbu3y4Wfkn6vu653p6WxkguZ+llkOu1DVSgj1VU77VMCUMmdWCwldY40PbUo KmSQ== X-Gm-Message-State: ALKqPwcgSNmm0JYh6cuRVfzLXQyKbY3pGxajhIm/ZRD3ADlOEU+1b/FX EBAjD+8EQ7GjEEDUAcrdZX+8D2+8aLYYnBc34S0= X-Google-Smtp-Source: AB8JxZqhNInOFKwxyqqwdUXr1VBd8hZzrkj2xDIbt6Bg3cpjclThlUoORLP9V8ke/ZVVFkTkBNj1OpvjOaW7AXLQ8T8= X-Received: by 2002:aca:2e09:: with SMTP id u9-v6mr1982373oiu.228.1526129687905; Sat, 12 May 2018 05:54:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: From: "H.J. Lu" Date: Sat, 12 May 2018 05:54:47 -0700 Message-ID: Subject: Re: Optimized strlen() for x64 (SSE2 based strlen) To: Trevor Herselman Cc: GNU C Library On Sat, May 12, 2018 at 4:40 AM, Trevor Herselman wrote: > #include > > size_t strlen_sse2( const char *str ) > { > const __m128i *ptr = (__m128i*)( (uintptr_t) str & ~0xf ); > const __m128i zero = _mm_setzero_si128(); > > int mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr > ), zero ) ); > mask >>= ( (uintptr_t) str & 0xf ); > if ( mask ) return __builtin_ctz( (unsigned int) mask ); > ptr++; > > for ( ; ; ) > { > mask = _mm_movemask_epi8( _mm_cmpeq_epi8( _mm_load_si128( ptr > ), zero ) ); > if ( mask ) return (char *)ptr - str + __builtin_ctz( > (unsigned int) mask ); > ptr++; > } > __builtin_unreachable(); > } > > > Written by Trevor Herselman > > > Features: > > > *) NO penalty for misaligned addresses (the `mask >>= unaligned bit > shift` takes care of it > *) NO alignment checks / jumps - ALL reads are already pre-aligned to 16 bytes > *) The cost of the misalignment code is amortized inside the first 16 > byte read for all aligned/unaligned reads > *) The function uses ONLY aligned reads (_mm_load_si128), NEVER uses a > single unaligned read, not even for 1 byte. > *) Plain C, the compiler is free to make optimizations around the > function when inlined. > *) Small tight code size, less than 1 CPU cache line (about 40 bytes) > more machine code than the default/bit-twiddling implementation, for a > significant speed boost! > *) Compatible with ALL x64 CPU's, because they are ALL required to > support at least SSE2 > *) NO misaligned read penalties for older hardware > *) Faster than ALL other SSE2 ~ SSE4.2 implementations, especially for > smaller strings > *) The size/cost/performance benefit starts at a string length of 2 > bytes; meaning, if the average string length is 2 bytes or more, then > this implementation is faster than any other method! > *) Benchmarked against GLIBC (bit-twiddle), Agner Fog's A_strlen() & > all other methods I could find. > *) Doesn't need to do any (4K) page boundary checks and will never > read beyond a page boundary, as long as the NULL terminator is found. > > The main benefit is that there are NO alignment checks! Any unaligned > bits are bit-shifted off the first mask! So there is just a `shr` > instead of `if (str & 0xF), `cmp`, `jmp`. If the address is already > aligned the bit-shift does nothing, because `shr 0` does nothing. But > at least we've eliminated the alignment check! > > Note: I wasn't able to benchmark it against the current GLIBC AVX2 or > SSE2 implementations, because I can't compile or test them. Besides, I > looked at the code, I believe they will be slower in general, unless > you are doing very large strings, because they read 4x XMM registers > at a time, this is a huge performance cost if you don't need the extra > reads/data! I believe shorter strings are more common! Also, these > implementations code bloat are significantly more! It is trivial to add a new implementation to glibc framework. This patch adds __strlen_sse2_2 to glibc: /* Support sysdeps/x86_64/multiarch/strnlen.c. */ Here __strlen_sse2_2 is your implementation. Glibc microbenchmark shows: simple_STRLEN builtin_strlen __strlen_avx2 __strlen _sse2_2 __strlen_sse2 Length 1, alignment 1: 32.875 48.1562 42.8438 40.5 39.9062 Length 1, alignment 0: 30.7812 47.5938 31.75 35.0625 34.9375 Length 2, alignment 2: 40.4062 44.0938 29.9688 29.9062 28.9375 Length 2, alignment 0: 36.6875 42.2812 29.25 32.5938 29.75 Length 3, alignment 3: 50.25 40.375 29.0625 29.0625 29.5312 Length 3, alignment 0: 46.8438 40.4688 29.1875 28.375 31.5625 Length 4, alignment 4: 56.2812 40.3438 28.0625 27.9375 31.1875 Length 4, alignment 0: 48.6875 40.3438 27.9062 27.75 31.625 Length 5, alignment 5: 69.1562 40.2812 33 31.2812 31.875 Length 5, alignment 0: 64.4688 40.1562 31.625 31.0312 28.0312 Length 6, alignment 6: 64.9062 40.6562 27.9062 27.875 31.3125 Length 6, alignment 0: 66.7188 40.4375 27.6875 28.0625 31.5312 Length 7, alignment 7: 66.8438 40.3438 28.0625 27.9688 28.1875 Length 7, alignment 0: 71.5312 40.8438 28.0625 28.25 30.5312 Length 4, alignment 0: 49 40.8438 28.0625 28 27.8125 Length 4, alignment 7: 49.625 40.5938 27.75 27.9062 27.9375 Length 4, alignment 2: 47.375 41 27.8438 30.9688 27.8438 Length 2, alignment 2: 38.3438 40.25 27.9062 27.6875 28.0625 Length 8, alignment 0: 77.2812 40.9688 27.8438 30.25 31.0938 Length 8, alignment 7: 76.1875 40.9375 28.0312 31.5625 29.25 Length 8, alignment 3: 63.2812 40.5 28.0625 28.3125 27.9062 Length 5, alignment 3: 68.7188 40.3438 27.9375 27.9688 30.3125 Length 16, alignment 0: 106.906 40.8125 27.75 51.6875 53.5312 Length 16, alignment 7: 98.3125 43.1562 28.0625 42.8438 53.5 Length 16, alignment 4: 98.3438 40.375 27.9062 44.625 50.9062 Length 10, alignment 4: 76.375 40.3125 28.0938 35 34.25 Length 32, alignment 0: 179.719 51.7812 43.5625 52.875 60.8125 Length 32, alignment 7: 171.094 48.375 40.1562 45.0625 47.0312 Length 32, alignment 5: 167.062 46.5938 38.5625 45.2812 48.0938 Length 21, alignment 5: 123.344 44.7812 29.5625 47.0938 47.5938 Length 64, alignment 0: 346.125 64.75 46.3125 79.5 79.625 Length 64, alignment 7: 338.562 51.2812 43.8438 60.2812 75.1562 Length 64, alignment 6: 338.969 52.5 42.5312 63.0625 74.4062 Length 42, alignment 6: 431 49.8438 36.6875 56.7188 53.5625 Length 128, alignment 0: 845.344 69.75 50.875 79.375 93.4688 Length 128, alignment 7: 853.75 58.5 49.25 75.7812 84.6562 Length 128, alignment 7: 851.344 56.3438 47.1875 75.8125 83.2812 Length 85, alignment 7: 633.25 61.1562 42.2812 67.0625 76.75 Length 256, alignment 0: 1563.38 102.844 91.7188 149.031 106.75 Length 256, alignment 7: 1578.09 88.75 76.625 124.625 101.656 Length 256, alignment 8: 1556.91 88.125 76.9688 133.125 101.531 Length 170, alignment 8: 1075.59 87.6875 71.5312 99.3125 90.5625 Length 512, alignment 0: 2971.5 127.875 120.031 233.094 145 Length 512, alignment 7: 2969.06 112.469 102.062 210.812 137.062 Length 512, alignment 9: 2964.56 109.562 103.594 207.062 132.875 Length 341, alignment 9: 2017 106.125 98.3125 158.969 119.938 Length 1024, alignment 0: 5782.5 179.844 170.375 383.219 237.562 Length 1024, alignment 7: 5785.72 163.406 152.594 382.219 220.688 Length 1024, alignment 10: 5772.62 162.875 152.062 376.812 218.969 Length 682, alignment 10: 3898.88 145.625 125.938 275.5 160.812 Length 2048, alignment 0: 11879.2 247.469 243.094 769.469 374.844 Length 2048, alignment 7: 10140.6 237.531 229.812 768.188 350.094 Length 2048, alignment 11: 10526.2 235.781 227.125 775.5 351.844 Length 1365, alignment 11: 6820.31 190.219 183.156 448.406 266.625 Length 4096, alignment 0: 20157.3 433.469 431.031 1399.66 679.875 Length 4096, alignment 7: 20502.1 428.844 420.438 1379.12 661.688 Length 4096, alignment 12: 12538.4 123.25 120.5 398.719 192.531 Length 2730, alignment 12: 3913.09 90.8125 87.1562 283.5 132.344 Your implementation is good, especially at shorter strings. For strings longer than 128 bytes, yours is slower. > > My name is Trevor Herselman, and I give permission to the developers > of GCC, GLIBC and anyone else permission to use this algorithm, MIT > license. Please credit me where possible! > > > > Benchmarks: 10,000,000,000 strings, warm cache, Core i5 3rd gen, > lower is better > > The `GCC/GLIBC bit-twiddling` technique listed below is the current > GCC/GLIBC implementation with a 1-byte pre-alignment test, with 64-bit > reads (8-byte long long). > > Aligned memory > > 0 byte length (empty string, only NULL terminator) > > Traditional (naive) 1-byte test: 5.35s > GCC/GLIBC bit-twiddling: 14s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 1 byte length > > Traditional (naive) 1-byte test: 9s > GCC/GLIBC bit-twiddling: 17s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 2 byte length > > Traditional (naive) 1-byte test: 15s > bit-twiddling: 20s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 3 byte length > > Traditional (naive) 1-byte test: 19s > GCC bit-twiddle: 23s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 7 byte length (+1 byte NULL = 8 bytes total) > > Traditional (naive) 1-byte test: 44s > GCC bit-twiddle: 33.4s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 8 byte length (+1 byte NULL = 9 bytes total) > > Traditional (naive) 1-byte test: 46s > GCC bit-twiddle: 19s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 15 byte length (+1 byte NULL = 16 bytes total) > > Traditional (naive) 1-byte test: 85s > GCC bit-twiddle: 34s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > 16 byte length (+1 byte NULL = 17 bytes total) > > Traditional (naive) 1-byte test: 85s > GCC bit-twiddle: 27s > A_strlen: 24s > My SSE2 strlen: 13.3 > > > 128 byte length (+1 byte NULL = 129 bytes total) > > Traditional (naive) 1-byte test: 718.1s > GCC bit-twiddle: 171.3s > A_strlen: 63s > My SSE2 strlen: 45.5s > > > > unaligned (1 byte offset) > > 0 bytes: > > Traditional (naive) 1-byte test: 8s > GCC bit-twiddle: 9s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > NOTE: The GCC/GLIBC implementation is faster here, because of the > unaligned test, it reads the first bit and ends immediately > > > 1 byte: > > Traditional (naive) 1-byte test: 13.2s > GCC bit-twiddle: 15.9s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > > 2 bytes: > > Traditional (naive) 1-byte test: 17.4 > GCC bit-twiddle: 19.3s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > > 3 bytes: > > Traditional (naive) 1-byte test: 25.6s > GCC bit-twiddle: 24.7s > A_strlen: 27.5s > My SSE2 strlen: 10.5s > > > 127 bytes: > > Traditional (naive) 1-byte test: 715s > GCC bit-twiddle: 180s > A_strlen: 65s > My SSE2 strlen: 47s diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index 68257c4017..3bb4fd5ec9 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -24,6 +24,7 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \ strchr-sse2 strchrnul-sse2 strchr-avx2 strchrnul-avx2 \ strrchr-sse2 strrchr-avx2 \ strlen-sse2 strnlen-sse2 strlen-avx2 strnlen-avx2 \ + strlen-sse2-2 \ strcat-ssse3 strncat-ssse3\ strcpy-sse2 stpcpy-sse2 \ strcpy-ssse3 strncpy-ssse3 stpcpy-ssse3 stpncpy-ssse3 \ diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index 7afd674b81..e5b0dfd510 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -186,6 +186,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, IFUNC_IMPL_ADD (array, i, strlen, HAS_ARCH_FEATURE (AVX2_Usable), __strlen_avx2) + IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2_2) IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2))