Improve strstr performance of short needles
Commit Message
Improve strstr performance for the common case of short needles. For 2-4
characters a small loop is typically fastest. On large strings the speedup
with a needle size of 4 is about 65%.
Passes GLIBC regression tests. OK for commit?
ChangeLog:
2018-09-05 Wilco Dijkstra <wdijkstr@arm.com>
* string/strstr.c (strstr2): Add new function.
(strstr3): Likewise.
(strstr3): Likewise.
(STRSTR): Add special cases for short needles.
--
Comments
On Wed, 05 Sep 2018, Wilco Dijkstra wrote:
>Improve strstr performance for the common case of short needles. For 2-4
>characters a small loop is typically fastest. On large strings the speedup
>with a needle size of 4 is about 65%.
I ran some tests on a powerpc64le machine and I noticed that, as the
haystack becomes larger, the search for a needle that is *not* in the
haystack, became slower with this change.
For instance, when the haystack is 32 chars long, searching for a 4 chars
long needle that is not in the haystack [1] took ~11% to ~22% longer to
finish, when compared to the pristine code. When the needle *is* in the
haystack, that did not happen and the performance was usually 30% better.
In my tests, this is also true for needles 2 and 3 chars long.
Have you observed similar results in your tests?
[1] Tests with the summary similar to this:
Length 32/4, alignment 0/ 0, fail:
Gabriel F. T. Gomes wrote:
> On Wed, 05 Sep 2018, Wilco Dijkstra wrote:
>
>>Improve strstr performance for the common case of short needles. For 2-4
>>characters a small loop is typically fastest. On large strings the speedup
>>with a needle size of 4 is about 65%.
>
> I ran some tests on a powerpc64le machine and I noticed that, as the
> haystack becomes larger, the search for a needle that is *not* in the
> haystack, became slower with this change.
That seems odd, particularly since the Two-Way algorithm is pretty slow
for small needles (large preprocessing overhead and small skips).
> For instance, when the haystack is 32 chars long, searching for a 4 chars
> long needle that is not in the haystack [1] took ~11% to ~22% longer to
> finish, when compared to the pristine code. When the needle *is* in the
> haystack, that did not happen and the performance was usually 30% better.
>
> Have you observed similar results in your tests?
I haven't looked in detail at the GLIBC strstr benchmark results. I'm running
tests with natural language texts from 64 bytes to 256KB. Generally speaking
the short needle Two-Way algorithm is terribly slow on both short and long
haystacks. Short needles make it worse since it can't skip much, so there isn't
any point in keeping it (I posted a much faster replacement in newlib).
Wilco
On 29/09/18 5:51 AM, Wilco Dijkstra wrote:
> I haven't looked in detail at the GLIBC strstr benchmark results. I'm running
> tests with natural language texts from 64 bytes to 256KB. Generally speaking
Would you be able to add a test that emulates that workload to glibc
benchtests?
Thanks,
Siddhesh
On Mon, 01 Oct 2018, Wilco Dijkstra wrote:
>>> I ran some tests on a powerpc64le machine and I noticed that, as the
>>> haystack becomes larger, the search for a needle that is *not* in the
>>> haystack, became slower with this change.
>>
>> That seems odd, particularly since the Two-Way algorithm is pretty slow
>> for small needles (large preprocessing overhead and small skips).
>
>I looked at bench-strstr again. There are cases where Two-Way wins by
>10-15% when the string is found but for short strings the linear loop wins.
>On average I can't see any performance difference.
I ran bench tests with your patch on an x86 machine, and I noticed the
same behaviour that I described previously, however on fewer occasions
(never when needle is 2-chars long, and less frequently for 3- and 4-chars
long needles).
Both on powerpc64le and x86, the overall performance gain for bench tests
looks positive, so I don't actually see a compelling reason to not commit
this patch. The intention of my question was to know if you have noticed
something similar in your tests, and I'm glad to hear that you have used a
different benchmark.
>However when iterating through 16K needles randomly chosen from natural
>text I get the following speedups for different haystack and needle sizes:
>
> H64 H256 H4096
>N2: 64% 73% 46%
>N3: 74% 66% 37%
>N4: 94% 92% 66%
Could you add this information (data and explanation) to the commit
message? It would be good to be able to read it some years in the future
and know that a different benchmark (not bench tests) was used to produce
these numbers.
>As you can see the speedup is larger for small needles since the preprocessing
>overhead of the Two-Way algorithm is significant. This is completely hidden by
>the GLIBC test due to using just 1 needle.
Agreed.
Reviewed-by: Gabriel F. T. Gomes <gabriel@inconstante.eti.br>
On 01/10/18 10:35 PM, Wilco Dijkstra wrote:
> However when iterating through 16K needles randomly chosen from natural
> text I get the following speedups for different haystack and needle sizes:
>
> H64 H256 H4096
> N2: 64% 73% 46%
> N3: 74% 66% 37%
> N4: 94% 92% 66%
>
> As you can see the speedup is larger for small needles since the preprocessing
> overhead of the Two-Way algorithm is significant. This is completely hidden by
> the GLIBC test due to using just 1 needle.
Can you please add this test to benchtests?
Thanks,
Siddhesh
@@ -46,6 +46,50 @@
#define STRSTR strstr
#endif
+
+static inline char *
+strstr2 (const char *hs, const char *ne)
+{
+ uint32_t h1 = (ne[0] << 16) | ne[1];
+ uint32_t h2 = 0;
+ int c = hs[0];
+ while (h1 != h2 && c != 0)
+ {
+ h2 = (h2 << 16) | c;
+ c = *++hs;
+ }
+ return h1 == h2 ? (char *)hs - 2 : NULL;
+}
+
+static inline char *
+strstr3 (const char *hs, const char *ne)
+{
+ uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8);
+ uint32_t h2 = 0;
+ int c = hs[0];
+ while (h1 != h2 && c != 0)
+ {
+ h2 = (h2 | c) << 8;
+ c = *++hs;
+ }
+ return h1 == h2 ? (char *)hs - 3 : NULL;
+}
+
+static inline char *
+strstr4 (const char *hs, const char *ne)
+{
+ uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8) | ne[3];
+ uint32_t h2 = 0;
+ int c = hs[0];
+ while (h1 != h2 && c != 0)
+ {
+ h2 = (h2 << 8) | c;
+ c = *++hs;
+ }
+ return h1 == h2 ? (char *)hs - 4 : NULL;
+}
+
+
/* Return the first occurrence of NEEDLE in HAYSTACK. Return HAYSTACK
if NEEDLE is empty, otherwise NULL if NEEDLE is not found in
HAYSTACK. */
@@ -64,6 +108,13 @@ STRSTR (const char *haystack, const char *needle)
if (haystack == NULL || needle[1] == '\0')
return (char *) haystack;
+ if (needle[2] == '\0')
+ return strstr2 (haystack, needle);
+ if (needle[3] == '\0')
+ return strstr3 (haystack, needle);
+ if (needle[4] == '\0')
+ return strstr4 (haystack, needle);
+
/* Ensure HAYSTACK length is at least as long as NEEDLE length.
Since a match may occur early on in a huge HAYSTACK, use strnlen
and read ahead a few cachelines for improved performance. */