Improve strstr performance of short needles

Message ID DB5PR08MB1030D79438AF0E17282F3E1783020@DB5PR08MB1030.eurprd08.prod.outlook.com
State New, archived
Headers

Commit Message

Wilco Dijkstra Sept. 5, 2018, 1:38 p.m. UTC
  Improve strstr performance for the common case of short needles.  For 2-4 
characters a small loop is typically fastest.  On large strings the speedup
with a needle size of 4 is about 65%.

Passes GLIBC regression tests. OK for commit?

ChangeLog:
2018-09-05  Wilco Dijkstra  <wdijkstr@arm.com>

	* string/strstr.c (strstr2): Add new function.
	(strstr3): Likewise.
	(strstr3): Likewise.
	(STRSTR): Add special cases for short needles.

--
  

Comments

Gabriel F. T. Gomes Sept. 28, 2018, 9:30 p.m. UTC | #1
On Wed, 05 Sep 2018, Wilco Dijkstra wrote:

>Improve strstr performance for the common case of short needles.  For 2-4 
>characters a small loop is typically fastest.  On large strings the speedup
>with a needle size of 4 is about 65%.

I ran some tests on a powerpc64le machine and I noticed that, as the
haystack becomes larger, the search for a needle that is *not* in the
haystack, became slower with this change.

For instance, when the haystack is 32 chars long, searching for a 4 chars
long needle that is not in the haystack [1] took ~11% to ~22% longer to
finish, when compared to the pristine code.  When the needle *is* in the
haystack, that did not happen and the performance was usually 30% better.

In my tests, this is also true for needles 2 and 3 chars long.


Have you observed similar results in your tests?


[1] Tests with the summary similar to this:
      Length   32/4, alignment 0/ 0, fail:
  
Wilco Dijkstra Sept. 29, 2018, 12:21 a.m. UTC | #2
Gabriel F. T. Gomes wrote:
> On Wed, 05 Sep 2018, Wilco Dijkstra wrote:
>
>>Improve strstr performance for the common case of short needles.  For 2-4 
>>characters a small loop is typically fastest.  On large strings the speedup
>>with a needle size of 4 is about 65%.
>
> I ran some tests on a powerpc64le machine and I noticed that, as the
> haystack becomes larger, the search for a needle that is *not* in the
> haystack, became slower with this change.

That seems odd, particularly since the Two-Way algorithm is pretty slow
for small needles (large preprocessing overhead and small skips).

> For instance, when the haystack is 32 chars long, searching for a 4 chars
> long needle that is not in the haystack [1] took ~11% to ~22% longer to
> finish, when compared to the pristine code.  When the needle *is* in the
> haystack, that did not happen and the performance was usually 30% better.
>
> Have you observed similar results in your tests?

I haven't looked in detail at the GLIBC strstr benchmark results. I'm running
tests with natural language texts from 64 bytes to 256KB. Generally speaking
the short needle Two-Way algorithm is terribly slow on both short and long
haystacks. Short needles make it worse since it can't skip much, so there isn't
any point in keeping it (I posted a much faster replacement in newlib).

Wilco
  
Siddhesh Poyarekar Sept. 29, 2018, 1 a.m. UTC | #3
On 29/09/18 5:51 AM, Wilco Dijkstra wrote:
> I haven't looked in detail at the GLIBC strstr benchmark results. I'm running
> tests with natural language texts from 64 bytes to 256KB. Generally speaking

Would you be able to add a test that emulates that workload to glibc 
benchtests?

Thanks,
Siddhesh
  
Gabriel F. T. Gomes Oct. 1, 2018, 8:46 p.m. UTC | #4
On Mon, 01 Oct 2018, Wilco Dijkstra wrote:

>>> I ran some tests on a powerpc64le machine and I noticed that, as the
>>> haystack becomes larger, the search for a needle that is *not* in the
>>> haystack, became slower with this change.  
>>
>> That seems odd, particularly since the Two-Way algorithm is pretty slow
>> for small needles (large preprocessing overhead and small skips).  
>
>I looked at bench-strstr again. There are cases where Two-Way wins by
>10-15% when the string is found but for short strings the linear loop wins.
>On average I can't see any performance difference.

I ran bench tests with your patch on an x86 machine, and I noticed the
same behaviour that I described previously, however on fewer occasions
(never when needle is 2-chars long, and less frequently for 3- and 4-chars
long needles).

Both on powerpc64le and x86, the overall performance gain for bench tests
looks positive, so I don't actually see a compelling reason to not commit
this patch.  The intention of my question was to know if you have noticed
something similar in your tests, and I'm glad to hear that you have used a
different benchmark.

>However when iterating through 16K needles randomly chosen from natural
>text I get the following speedups for different haystack and needle sizes:
>
>    H64  H256 H4096
>N2: 64%  73%  46%
>N3: 74%  66%  37%
>N4: 94%  92%  66%

Could you add this information (data and explanation) to the commit
message?  It would be good to be able to read it some years in the future
and know that a different benchmark (not bench tests) was used to produce
these numbers. 

>As you can see the speedup is larger for small needles since the preprocessing
>overhead of the Two-Way algorithm is significant. This is completely hidden by
>the GLIBC test due to using just 1 needle.

Agreed.

Reviewed-by: Gabriel F. T. Gomes <gabriel@inconstante.eti.br>
  
Siddhesh Poyarekar Oct. 2, 2018, 3:34 a.m. UTC | #5
On 01/10/18 10:35 PM, Wilco Dijkstra wrote:
> However when iterating through 16K needles randomly chosen from natural
> text I get the following speedups for different haystack and needle sizes:
> 
>      H64  H256 H4096
> N2: 64%  73%  46%
> N3: 74%  66%  37%
> N4: 94%  92%  66%
> 
> As you can see the speedup is larger for small needles since the preprocessing
> overhead of the Two-Way algorithm is significant. This is completely hidden by
> the GLIBC test due to using just 1 needle.

Can you please add this test to benchtests?

Thanks,
Siddhesh
  

Patch

diff --git a/string/strstr.c b/string/strstr.c
index 33acdc5442d3e9031298a56e4f52a19e2ffa7d84..dd8663494ef391f34f7ed0a961d6c0e399bfdda9 100644
--- a/string/strstr.c
+++ b/string/strstr.c
@@ -46,6 +46,50 @@ 
 #define STRSTR strstr
 #endif
 
+
+static inline char *
+strstr2 (const char *hs, const char *ne)
+{
+  uint32_t h1 = (ne[0] << 16) | ne[1];
+  uint32_t h2 = 0;
+  int c = hs[0];
+  while (h1 != h2 && c != 0)
+    {
+      h2 = (h2 << 16) | c;
+      c = *++hs;
+    }
+  return h1 == h2 ? (char *)hs - 2 : NULL;
+}
+
+static inline char *
+strstr3 (const char *hs, const char *ne)
+{
+  uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8);
+  uint32_t h2 = 0;
+  int c = hs[0];
+  while (h1 != h2 && c != 0)
+    {
+      h2 = (h2 | c) << 8;
+      c = *++hs;
+    }
+  return h1 == h2 ? (char *)hs - 3 : NULL;
+}
+
+static inline char *
+strstr4 (const char *hs, const char *ne)
+{
+  uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8) | ne[3];
+  uint32_t h2 = 0;
+  int c = hs[0];
+  while (h1 != h2 && c != 0)
+    {
+      h2 = (h2 << 8) | c;
+      c = *++hs;
+    }
+  return h1 == h2 ? (char *)hs - 4 : NULL;
+}
+
+
 /* Return the first occurrence of NEEDLE in HAYSTACK.  Return HAYSTACK
    if NEEDLE is empty, otherwise NULL if NEEDLE is not found in
    HAYSTACK.  */
@@ -64,6 +108,13 @@  STRSTR (const char *haystack, const char *needle)
   if (haystack == NULL || needle[1] == '\0')
     return (char *) haystack;
 
+  if (needle[2] == '\0')
+    return strstr2 (haystack, needle);
+  if (needle[3] == '\0')
+    return strstr3 (haystack, needle);
+  if (needle[4] == '\0')
+    return strstr4 (haystack, needle);
+
   /* Ensure HAYSTACK length is at least as long as NEEDLE length.
      Since a match may occur early on in a huge HAYSTACK, use strnlen
      and read ahead a few cachelines for improved performance.  */