[RFC] Add strcmp, strncmp, memcmp inline implementation.

Message ID 20150525015826.GA29445@domone
State New, archived
Headers

Commit Message

Ondrej Bilka May 25, 2015, 1:58 a.m. UTC
  Hi,

I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I
mean its around three times slower than libcall by using rep cmpsb which
even intel manual says that shouldn't be used.

So I decided to write a strcmp inline that works. As reencoding it as
gcc pass would be extra effort without any benefit I skipped it.

This adds x64 specific implementation for constant arguments less than
16 bytes. These are quite common as programmer often does checks like 

if (strcmp (x, "foo"))

Extending this for 32 bytes and more would be straightforward but
wouldn't help much as there aren't lot of that large words.

As same trick could be used for strncmp and memcmp with size <= 16 
with few extra checks as we exploited alignment of string literals.

It could be optimized more with cooperation from gcc. A page-cross check
could be omitted in most cases using dataflow that gcc already does in
fortify_source. A CROSS_PAGE macro could first check for
__builtin_valid_range_p(x, x+16) which evaluates to true if gcc can
prove that x is more than 16 bytes large.


A possible issue would be introducing sse with string.h. How detect gcc
-no-sse flag?

	* sysdeps/x86_64/bits/string.h: New file.
  

Comments

Joseph Myers May 28, 2015, 4:23 p.m. UTC | #1
On Mon, 25 May 2015, Ondřej Bílka wrote:

> Hi,
> 
> I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I
> mean its around three times slower than libcall by using rep cmpsb which
> even intel manual says that shouldn't be used.

GCC bug report number?  It's not helpful to say "suck a lot" without 
reporting the issues to the other project (identifying the compiler 
options etc. in use).  We need to cooperate appropriately with other free 
software projects in developing glibc.

> So I decided to write a strcmp inline that works. As reencoding it as
> gcc pass would be extra effort without any benefit I skipped it.

There is obvious benefit to compiler implementations e.g. for calls to 
strcmp in kernel space.

> 	* sysdeps/x86_64/bits/string.h: New file.

No installed headers should ever be in x86_64 or i386 sysdeps directories.  
The installed headers should always come from sysdeps/x86 directories and 
be usable for both 32-bit and 64-bit compilations by containing 
appropriate conditionals on e.g. __x86_64__.  Don't regress on HJ's 
implementation of this for 2.16.
  
Alexander Monakov May 28, 2015, 4:41 p.m. UTC | #2
On Thu, 28 May 2015, Joseph Myers wrote:
> On Mon, 25 May 2015, Ondřej Bílka wrote:
> 
> > Hi,
> > 
> > I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I
> > mean its around three times slower than libcall by using rep cmpsb which
> > even intel manual says that shouldn't be used.
> 
> GCC bug report number?  It's not helpful to say "suck a lot" without 
> reporting the issues to the other project (identifying the compiler 
> options etc. in use).  We need to cooperate appropriately with other free 
> software projects in developing glibc.

I think this report should serve well:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Alexander
  
Joseph Myers May 28, 2015, 9:06 p.m. UTC | #3
On Thu, 28 May 2015, Alexander Monakov wrote:

> On Thu, 28 May 2015, Joseph Myers wrote:
> > On Mon, 25 May 2015, Ondřej Bílka wrote:
> > 
> > > Hi,
> > > 
> > > I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I
> > > mean its around three times slower than libcall by using rep cmpsb which
> > > even intel manual says that shouldn't be used.
> > 
> > GCC bug report number?  It's not helpful to say "suck a lot" without 
> > reporting the issues to the other project (identifying the compiler 
> > options etc. in use).  We need to cooperate appropriately with other free 
> > software projects in developing glibc.
> 
> I think this report should serve well:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

Thanks, that's a report for one specific issue with memcmp.  Each other 
issue with each function should also have bugs filed (with a meta-bug 
depending on them all).
  
Ondrej Bilka May 28, 2015, 10:19 p.m. UTC | #4
On Thu, May 28, 2015 at 09:06:19PM +0000, Joseph Myers wrote:
> On Thu, 28 May 2015, Alexander Monakov wrote:
> 
> > On Thu, 28 May 2015, Joseph Myers wrote:
> > > On Mon, 25 May 2015, Ondřej Bílka wrote:
> > > 
> > > > Hi,
> > > > 
> > > > I just found that on x64 gcc __builtin_strcmp suck a lot. And by lot I
> > > > mean its around three times slower than libcall by using rep cmpsb which
> > > > even intel manual says that shouldn't be used.
> > > 
> > > GCC bug report number?  It's not helpful to say "suck a lot" without 
> > > reporting the issues to the other project (identifying the compiler 
> > > options etc. in use).  We need to cooperate appropriately with other free 
> > > software projects in developing glibc.
> > 
> > I think this report should serve well:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> 
> Thanks, that's a report for one specific issue with memcmp.  Each other 
> issue with each function should also have bugs filed (with a meta-bug 
> depending on them all).
> 
strcmp/strncmp would be duplicates of that bug. Its result of correct
transformantion strcmp(s, c) -> memcmp (s, c, strlen(c) + 1) when c is
string literal.
  

Patch

diff --git a/sysdeps/x86_64/bits/string.h b/sysdeps/x86_64/bits/string.h
index 5893676..c8c5d2d 100644
--- a/sysdeps/x86_64/bits/string.h
+++ b/sysdeps/x86_64/bits/string.h
@@ -14,6 +14,9 @@ 
 #ifdef _USE_GNU
 # if __GNUC_PREREQ (3, 2)
 #  define _HAVE_STRING_ARCH_strcmp
+#  define _HAVE_STRING_ARCH_strncmp
+#  define _HAVE_STRING_ARCH_memcmp
+
 #  include <stdint.h>
 #  include <emmintrin.h>
 #  define __LOAD(x) _mm_load_si128 ((__tp_vector *) (x))
@@ -23,17 +26,30 @@ 
 typedef __m128i __tp_vector;
 typedef uint64_t __tp_mask;
 
-static inline __attribute__ ((always_inline)) int 
-__strcmp_c (char *s, char *c, int n)
+#define CROSS_PAGE(p) __builtin_expect (((uintptr_t) s) % 4096		      \
+					> 4096 - sizeof (__tp_vector) , 0)
+
+static inline __attribute__ ((always_inline)) 
+int
+__memcmp_small_a (char *s, char *c, int n)
 {
-  if (__builtin_expect (((uintptr_t) s) % 4096 > 4096 - sizeof (__tp_vector) , 0))
-    return strcmp (s, c);
-  __tp_mask m = get_mask (__EQ (__LOADU (s), __LOAD(c))) | 1UL << n;
+  if (CROSS_PAGE (s))
+    return memcmp (s, c, n);
+  __tp_mask m = get_mask (__EQ (__LOADU (s), __LOAD (c))) | 1UL << n;
   int found = __builtin_ctzl (m);
   return s[found] - c[found];
 }
-
-#define __strcmp_cs(s1, s2) -strcmp_c (s2, s1)
+static inline __attribute__ ((always_inline)) 
+int
+__memcmp_small (char *s, char *c, int n)
+{
+  if (CROSS_PAGE (s) || CROSS_PAGE (c))
+    return memcmp (s, c, n);
+  __tp_mask m = get_mask (__EQ (__LOADU (s), __LOADU (c))) | 1UL << n;
+  int found = __builtin_ctzl (m);
+  return s[found] - c[found];
+}
+#define __min(x,y) (x < y ? x : y)
 
 /* Dereferencing a pointer arg to run sizeof on it fails for the void
    pointer case, so we use this instead.
@@ -43,12 +59,27 @@  __strcmp_c (char *s, char *c, int n)
 
 
 #  define strcmp(s1, s2) \
-   (__extension__					\
-   (__builtin_constant_p (s1) && sizeof (s1) <= 16	\
-    ?  __strcmp_c (s1, s2, sizeof (s1))			\
-    : (__builtin_constant_p (s2) && sizeof (s2) <= 16   \
-       ? __strcmp_cs (s1, s2, sizeof (s2))		\
+   (__extension__						\
+   (__builtin_constant_p (s1) && sizeof (s1) <= 16		\
+    ?  __memcmp_small_a (s1, s2, sizeof (s1))			\
+    : (__builtin_constant_p (s2) && sizeof (s2) <= 16   	\
+       ? - __memcmp_small_a (s2, s1, sizeof (s2))		\
        : strcmp (s1, s2))))
+
+#  define strncmp(s1, s2, n) \
+   (__extension__						\
+   (__builtin_constant_p (s1) && sizeof (s1) <= 16		\
+    ?  __memcmp_small_a (s1, s2, min (n, sizeof (s1)))		\
+    : (__builtin_constant_p (s2) && sizeof (s2) <= 16   	\
+       ? - __memcmp_small_a (s2, s1, min (n, sizeof (s2)))	\
+       : strncmp (s1, s2, n))))
+
+#  define memcmp(s1, s2, n) \
+   (__extension__						\
+   (__builtin_constant_p (n <= 16) && n <= 16			\
+    ?  n == 0 ? 0 : __memcmp_small (s1, s2, n - 1))		\
+       : memcmp (s1, s2, n))
+
   
 # undef __string2_1bptr_p
 # endif