From patchwork Tue May 12 20:53:11 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ondrej Bilka <neleai@seznam.cz>
X-Patchwork-Id: 6680
Received: (qmail 127048 invoked by alias); 12 May 2015 20:53:28 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 127039 invoked by uid 89); 12 May 2015 20:53:27 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Tue, 12 May 2015 22:53:11 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: libc-alpha@sourceware.org
Subject: [PATCH] Improve generic strstr performance.
Message-ID: <20150512205311.GA24057@domone>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-06-14)

Hi,

As I mentioned before that generic strstr is not well optimized
this improves hot path. This exploits fact that naive algorithm with
vectorized strchr is hard to beat.

Main weakness of current algorithm is that it spends lot of time in
critical factorization which is useless as haystack is too short.

We look for leading triple which rarely produces mismatch 
mismatch switch to slow two-way algorithm. 

Performance improvement is function of entropy of string and how good is
underlying strchr. On x64 this is around five times faster when each
character has probability 1/128.

A simple benchmark that I used is send separately.

OK to commit?

	* string/strstr.c (STRSTR): Optimize hot path.

diff --git a/string/strstr.c b/string/strstr.c
index 045e878..45fb423 100644
--- a/string/strstr.c
+++ b/string/strstr.c
@@ -57,23 +57,62 @@ STRSTR (const char *haystack_start, const char *needle_start)
   size_t haystack_len; /* Known minimum length of HAYSTACK.  */
   bool ok = true; /* True if NEEDLE is prefix of HAYSTACK.  */
 
-  /* Determine length of NEEDLE, and in the process, make sure
-     HAYSTACK is at least as long (no point processing all of a long
-     NEEDLE if HAYSTACK is too short).  */
-  while (*haystack && *needle)
-    ok &= *haystack++ == *needle++;
-  if (*needle)
-    return NULL;
-  if (ok)
-    return (char *) haystack_start;
-
-  /* Reduce the size of haystack using strchr, since it has a smaller
-     linear coefficient than the Two-Way algorithm.  */
-  needle_len = needle - needle_start;
-  haystack = strchr (haystack_start + 1, *needle_start);
-  if (!haystack || __builtin_expect (needle_len == 1, 0))
+  if (!needle[0])
     return (char *) haystack;
-  needle -= needle_len;
+
+  if (!needle[1])
+    return strchr (haystack, needle[0]);
+
+  /*
+     This optimizes strstr hot path.
+
+     Main performance problem is that haystacks and needle are short.
+     You spend 99% of time on sub64byte haystacks.
+     That makes preprocessing very expensive. 
+     Without this around 90% of time would be spend at calculating critical
+     factorization that would never be used.
+
+     This loop should be fast due to optimized strchr implementation. 
+     It would likely traverse haystack as character triplet occurences are
+     rare for practical strings.
+   */
+
+  while ((haystack = strchr (haystack, needle[0])))
+    {
+      if (haystack[1] == needle[1] && (haystack[2] == needle[2] || needle[2] == '\000'))
+        {
+
+          if (needle[2] == '\000')
+            return (char *) haystack;
+
+	  /* Determine length of NEEDLE, and in the process, make sure
+	     HAYSTACK is at least as long (no point processing all of a long
+	     NEEDLE if HAYSTACK is too short).  */
+
+          needle += 2;
+          haystack += 2;
+
+	  while (*haystack && *needle)
+	    ok &= *haystack++ == *needle++;
+	  if (*needle)
+	    return NULL;
+
+	  needle_len = needle - needle_start;
+
+	  if (ok)
+	    return (char *) haystack - needle_len;
+
+          goto cold_path;
+        }
+      else
+        haystack++;
+    }
+
+  return NULL;
+
+  cold_path:;
+
+  needle = needle_start;
   haystack_len = (haystack > haystack_start + needle_len ? 1
 		  : needle_len + haystack_start - haystack);