From patchwork Mon May 25 01:58:26 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ondrej Bilka X-Patchwork-Id: 6911 Received: (qmail 124393 invoked by alias); 25 May 2015 01:58:48 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 123987 invoked by uid 89); 25 May 2015 01:58:47 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Mon, 25 May 2015 03:58:26 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: libc-alpha@sourceware.org Subject: [RFC PATCH] Add strcmp, strncmp, memcmp inline implementation. Message-ID: <20150525015826.GA29445@domone> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Hi, I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I mean its around three times slower than libcall by using rep cmpsb which even intel manual says that shouldn't be used. So I decided to write a strcmp inline that works. As reencoding it as gcc pass would be extra effort without any benefit I skipped it. This adds x64 specific implementation for constant arguments less than 16 bytes. These are quite common as programmer often does checks like if (strcmp (x, "foo")) Extending this for 32 bytes and more would be straightforward but wouldn't help much as there aren't lot of that large words. As same trick could be used for strncmp and memcmp with size <= 16 with few extra checks as we exploited alignment of string literals. It could be optimized more with cooperation from gcc. A page-cross check could be omitted in most cases using dataflow that gcc already does in fortify_source. A CROSS_PAGE macro could first check for __builtin_valid_range_p(x, x+16) which evaluates to true if gcc can prove that x is more than 16 bytes large. A possible issue would be introducing sse with string.h. How detect gcc -no-sse flag? * sysdeps/x86_64/bits/string.h: New file. diff --git a/sysdeps/x86_64/bits/string.h b/sysdeps/x86_64/bits/string.h index 5893676..c8c5d2d 100644 --- a/sysdeps/x86_64/bits/string.h +++ b/sysdeps/x86_64/bits/string.h @@ -14,6 +14,9 @@ #ifdef _USE_GNU # if __GNUC_PREREQ (3, 2) # define _HAVE_STRING_ARCH_strcmp +# define _HAVE_STRING_ARCH_strncmp +# define _HAVE_STRING_ARCH_memcmp + # include # include # define __LOAD(x) _mm_load_si128 ((__tp_vector *) (x)) @@ -23,17 +26,30 @@ typedef __m128i __tp_vector; typedef uint64_t __tp_mask; -static inline __attribute__ ((always_inline)) int -__strcmp_c (char *s, char *c, int n) +#define CROSS_PAGE(p) __builtin_expect (((uintptr_t) s) % 4096 \ + > 4096 - sizeof (__tp_vector) , 0) + +static inline __attribute__ ((always_inline)) +int +__memcmp_small_a (char *s, char *c, int n) { - if (__builtin_expect (((uintptr_t) s) % 4096 > 4096 - sizeof (__tp_vector) , 0)) - return strcmp (s, c); - __tp_mask m = get_mask (__EQ (__LOADU (s), __LOAD(c))) | 1UL << n; + if (CROSS_PAGE (s)) + return memcmp (s, c, n); + __tp_mask m = get_mask (__EQ (__LOADU (s), __LOAD (c))) | 1UL << n; int found = __builtin_ctzl (m); return s[found] - c[found]; } - -#define __strcmp_cs(s1, s2) -strcmp_c (s2, s1) +static inline __attribute__ ((always_inline)) +int +__memcmp_small (char *s, char *c, int n) +{ + if (CROSS_PAGE (s) || CROSS_PAGE (c)) + return memcmp (s, c, n); + __tp_mask m = get_mask (__EQ (__LOADU (s), __LOADU (c))) | 1UL << n; + int found = __builtin_ctzl (m); + return s[found] - c[found]; +} +#define __min(x,y) (x < y ? x : y) /* Dereferencing a pointer arg to run sizeof on it fails for the void pointer case, so we use this instead. @@ -43,12 +59,27 @@ __strcmp_c (char *s, char *c, int n) # define strcmp(s1, s2) \ - (__extension__ \ - (__builtin_constant_p (s1) && sizeof (s1) <= 16 \ - ? __strcmp_c (s1, s2, sizeof (s1)) \ - : (__builtin_constant_p (s2) && sizeof (s2) <= 16 \ - ? __strcmp_cs (s1, s2, sizeof (s2)) \ + (__extension__ \ + (__builtin_constant_p (s1) && sizeof (s1) <= 16 \ + ? __memcmp_small_a (s1, s2, sizeof (s1)) \ + : (__builtin_constant_p (s2) && sizeof (s2) <= 16 \ + ? - __memcmp_small_a (s2, s1, sizeof (s2)) \ : strcmp (s1, s2)))) + +# define strncmp(s1, s2, n) \ + (__extension__ \ + (__builtin_constant_p (s1) && sizeof (s1) <= 16 \ + ? __memcmp_small_a (s1, s2, min (n, sizeof (s1))) \ + : (__builtin_constant_p (s2) && sizeof (s2) <= 16 \ + ? - __memcmp_small_a (s2, s1, min (n, sizeof (s2))) \ + : strncmp (s1, s2, n)))) + +# define memcmp(s1, s2, n) \ + (__extension__ \ + (__builtin_constant_p (n <= 16) && n <= 16 \ + ? n == 0 ? 0 : __memcmp_small (s1, s2, n - 1)) \ + : memcmp (s1, s2, n)) + # undef __string2_1bptr_p # endif