Improve generic strstr performance.

Message ID 20150512205311.GA24057@domone
State New, archived
Headers

Commit Message

Ondrej Bilka May 12, 2015, 8:53 p.m. UTC
  Hi,

As I mentioned before that generic strstr is not well optimized
this improves hot path. This exploits fact that naive algorithm with
vectorized strchr is hard to beat.

Main weakness of current algorithm is that it spends lot of time in
critical factorization which is useless as haystack is too short.

We look for leading triple which rarely produces mismatch 
mismatch switch to slow two-way algorithm. 

Performance improvement is function of entropy of string and how good is
underlying strchr. On x64 this is around five times faster when each
character has probability 1/128.

A simple benchmark that I used is send separately.

OK to commit?

	* string/strstr.c (STRSTR): Optimize hot path.
  

Comments

Ondrej Bilka May 12, 2015, 9:30 p.m. UTC | #1
On Tue, May 12, 2015 at 02:07:10PM -0700, Roland McGrath wrote:
> > +  if (!needle[1])
> 
> No implicit Boolean coercion:
> 	if (needle[1] == '\0')
> 
> > +          if (needle[2] == '\000')
> 
> Just write '\0' when you are talking about the terminator.  Use '\000' only
> when it's in a context where literal octal values make real sense.
> 
> (I'm not reviewing the substance of your change.)

Ok, will change that before I would commit it.
  

Patch

diff --git a/string/strstr.c b/string/strstr.c
index 045e878..45fb423 100644
--- a/string/strstr.c
+++ b/string/strstr.c
@@ -57,23 +57,62 @@  STRSTR (const char *haystack_start, const char *needle_start)
   size_t haystack_len; /* Known minimum length of HAYSTACK.  */
   bool ok = true; /* True if NEEDLE is prefix of HAYSTACK.  */
 
-  /* Determine length of NEEDLE, and in the process, make sure
-     HAYSTACK is at least as long (no point processing all of a long
-     NEEDLE if HAYSTACK is too short).  */
-  while (*haystack && *needle)
-    ok &= *haystack++ == *needle++;
-  if (*needle)
-    return NULL;
-  if (ok)
-    return (char *) haystack_start;
-
-  /* Reduce the size of haystack using strchr, since it has a smaller
-     linear coefficient than the Two-Way algorithm.  */
-  needle_len = needle - needle_start;
-  haystack = strchr (haystack_start + 1, *needle_start);
-  if (!haystack || __builtin_expect (needle_len == 1, 0))
+  if (!needle[0])
     return (char *) haystack;
-  needle -= needle_len;
+
+  if (!needle[1])
+    return strchr (haystack, needle[0]);
+
+  /*
+     This optimizes strstr hot path.
+
+     Main performance problem is that haystacks and needle are short.
+     You spend 99% of time on sub64byte haystacks.
+     That makes preprocessing very expensive. 
+     Without this around 90% of time would be spend at calculating critical
+     factorization that would never be used.
+
+     This loop should be fast due to optimized strchr implementation. 
+     It would likely traverse haystack as character triplet occurences are
+     rare for practical strings.
+   */
+
+  while ((haystack = strchr (haystack, needle[0])))
+    {
+      if (haystack[1] == needle[1] && (haystack[2] == needle[2] || needle[2] == '\000'))
+        {
+
+          if (needle[2] == '\000')
+            return (char *) haystack;
+
+	  /* Determine length of NEEDLE, and in the process, make sure
+	     HAYSTACK is at least as long (no point processing all of a long
+	     NEEDLE if HAYSTACK is too short).  */
+
+          needle += 2;
+          haystack += 2;
+
+	  while (*haystack && *needle)
+	    ok &= *haystack++ == *needle++;
+	  if (*needle)
+	    return NULL;
+
+	  needle_len = needle - needle_start;
+
+	  if (ok)
+	    return (char *) haystack - needle_len;
+
+          goto cold_path;
+        }
+      else
+        haystack++;
+    }
+
+  return NULL;
+
+  cold_path:;
+
+  needle = needle_start;
   haystack_len = (haystack > haystack_start + needle_len ? 1
 		  : needle_len + haystack_start - haystack);