Use Quicksearch in strstr

Message ID DB5PR08MB1030E559B3818F52CA48A63583F30@DB5PR08MB1030.eurprd08.prod.outlook.com
State New, archived
Headers

Commit Message

Wilco Dijkstra Oct. 29, 2018, 3:55 p.m. UTC
  This patch significantly improves performance of strstr by using Sunday's
Quick-Search algorithm.  Due to its simplicity it has the best average
performance of string matching algorithms on almost all inputs.  It uses a
bad-character shift table to skip past mismatches.

The needle length is limited to 254 - combined with a hashed shift table
this reduces the shift table memory 16 to 32 times, lowering preprocessing
overhead and minimizing cache effects.  The needle limit also implies
worst-case performance remains linear.

Small 1-4 byte needles use special case code which is typically faster.
Very long needles continue to use the linear-time Two-Way algorithm.

The performance gain using the improved bench-strstr on Cortex-A72 is 2.1x.
Due to the low overhead the gain is larger on small haystacks: haystacks of 64
chars are ~3 times faster for needles of 5-16 chars.

Tested against GLIBC testsuite, randomized tests and the GNULIB strstr test
(https://git.savannah.gnu.org/cgit/gnulib.git/tree/tests/test-strstr.c).

ChangeLog:
2018-10-29  Wilco Dijkstra  <wdijkstr@arm.com>

	* string/str-two-way.h (two_way_short_needle): Add inline to avoid warning.
	* string/strstr.c (strstr2): Add new function.
	(strstr3): Likewise.
	(strstr3): Likewise.
	(STRSTR): Completely rewrite strstr to improve performance

---
  

Comments

Alexander Monakov Oct. 29, 2018, 5:04 p.m. UTC | #1
On Mon, 29 Oct 2018, Wilco Dijkstra wrote:

> The needle length is limited to 254 - combined with a hashed shift table
> this reduces the shift table memory 16 to 32 times, lowering preprocessing
> overhead and minimizing cache effects.

This is a bit unclear: given needle length limit, unhashed table would require
256 bytes, hashing as implemented in the patch reduces it to 64 bytes. I think
unhashed variant should be even faster, is saving 192 bytes on stack worth it?

> The performance gain using the improved bench-strstr on Cortex-A72 is 2.1x.
> Due to the low overhead the gain is larger on small haystacks: haystacks of 64
> chars are ~3 times faster for needles of 5-16 chars.

How does the worst case change look like (I think it's searching for "a{251}ba"
in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
needle length cutoff further below 254?

> 	* string/str-two-way.h (two_way_short_needle): Add inline to avoid warning.
> 	* string/strstr.c (strstr2): Add new function.
> 	(strstr3): Likewise.
> 	(strstr3): Likewise.
> 	(STRSTR): Completely rewrite strstr to improve performance

Generating the patch with 'git diff --patience' may produce a more readable diff.

> +  /* Use the Quick-Search algorithm for needle lengths less than 255.  */
> +  if (__glibc_likely (ne_len < 255))
> +    {
> +      uint8_t shift[1 << SHIFT_TABLE_BITS];
> +      const unsigned char *end = hs + hs_len - ne_len;
> +
> +      /* Initialize bad character shift hash table.  */
> +      memset (shift, ne_len + 1, sizeof (shift));
> +      for (int i = 0; i < ne_len; i++)
> +	shift[ne[i] % sizeof (shift)] = ne_len - i;
> +
> +      do
> +	{
> +	  hs--;

Doesn't this move hs out of bounds and thus invoke undefined behavior?

Thanks.
Alexander
  
Ondrej Bilka Oct. 29, 2018, 10:54 p.m. UTC | #2
On Mon, Oct 29, 2018 at 03:55:36PM +0000, Wilco Dijkstra wrote:
> This patch significantly improves performance of strstr by using Sunday's
> Quick-Search algorithm.  Due to its simplicity it has the best average
> performance of string matching algorithms on almost all inputs.  It uses a
> bad-character shift table to skip past mismatches.
> 
> The needle length is limited to 254 - combined with a hashed shift table
> this reduces the shift table memory 16 to 32 times, lowering preprocessing
> overhead and minimizing cache effects.  The needle limit also implies
> worst-case performance remains linear.
> 
> Small 1-4 byte needles use special case code which is typically faster.
> Very long needles continue to use the linear-time Two-Way algorithm.
> 
> The performance gain using the improved bench-strstr on Cortex-A72 is 2.1x.
> Due to the low overhead the gain is larger on small haystacks: haystacks of 64
> chars are ~3 times faster for needles of 5-16 chars.
>
Real benefit is fixing idiocy of calculating needle size before calling
strchr, code after that tries to make as this make cold path faster.

Better would be to call strchr again and if it still doesn't match for third time then
code after that shouldn't matter as long as its linear.
If you capture inputs of actual applications you will find what are bottlenecks.

That "benchmark" is complete garbage. In practice running time is
dominated by branch misprediction etc. But "benchmark" calls function
many times on same input leading to no branch misprediction. 

You couldn't do much with needle as inputs are really small,
precomputation is just too slow. Even strlen of needle would be
bottleneck.
Also strcmp when you take branch misprediction into account is too slow,
I tried lot of things for that comparison and nothing could beat doing
it byte-by-byte unrolled 4 times or so.

As for real inputs record someing with this
http://kam.mff.cuni.cz/~ondra/dryrun.tar.bz2
make
LD_PRELOAD=./dryrun_strstr.so bash
do something, exit bash
./summary_strstr

which prints statistic like this to see real data.

average size 21.28 needle size  6.67 comparisons  23.00 digraphs  0.22 trigraphs  0.00
calls     1907 succeed  61.4% latencies   0.0   0.0
s1    aligned to 4 bytes  81.9% aligned to 8 bytes  81.6% aligned to 16 bytes  31.7%
s2    aligned to 4 bytes  67.0% aligned to 8 bytes   8.0% aligned to 16 bytes   7.9%
needle found in n bytes: n <= 0:   7.9% n <= 1:   7.9% n <= 2:  11.9% n <= 3:  19.7%  n <= 4:  19.7% n <= 8:  19.7% n <= 16:  65.3% n <= 32: 81.1% n <= 64:  96.8%
needle size: n <= 0:   7.9% n <= 1:   7.9% n <= 2:   7.9% n <= 3:   7.9% n <= 4:  37.8% n <= 8:  96.9% n <= 16:  96.9% n <= 32: 100.0% n <= 64: 100.0%

For arm and anything with assembly strchr you would gain more by similar trick as I did for x64, 
for hot path its easy modification of strchr where you instead of one
character check first two characters where convient(function leaky_digraph), sometimes you check
only first one rather than trying to include character with unsuitable
alignment. Note that unaligned loads by one character more could be
faster than trying to shift vectors.

logic is following:
strstr(s,n) {
if (!n[0]) return s;
if (!n[1]) return strchr (s, n); // corner case, one-char needles are rare
   s = leaky_digraph(s, n[0], n[1]);
   if (!s)
     return NULL;
   for (i = 1; n[i] && s[i] == n[i]; i++);
   if (!n[i]) return s;
   return strstr_two_way(s,n);
}

Improvements after that are mostly QoI.
For bigger haystacks trick is to do code above in loop and find constants a,b,c,d s.t.
time(two_way) > a*(haystack-haystack_start)+b and c * characters_compared + d * leaky_digraph_calls > time(leaky_digraph_loop)
and add to loop chec to switch to two_way when
a*(haystack-haystack_start)+b < c * characters_compared + d * leaky_digraph_calls

One could optimize code more by iterating mask instead of calling
leaky_digraph again but thats to decrease constants in case where digraphs are quite
frequent.
  
Zack Weinberg Oct. 29, 2018, 11:24 p.m. UTC | #3
On Mon, Oct 29, 2018 at 6:54 PM Ondřej Bílka <neleai@seznam.cz> wrote:
> Real benefit is fixing idiocy
...
> That "benchmark" is complete garbage
...

When you dismiss other people's testing this aggressively you make
them not want to cooperate with you.  If I had written this patch, my
reaction to this feedback would be: Fine, fix it yourself if you think
my approach is idiotic.

It is perfectly possible to express concerns in a less hostile manner.
For instance, you could have said "I think this benchmark doesn't
account for the cost of branch misprediction, because it calls strstr
many times with the same inputs.  Can you please test it on a wider
variety of inputs in a randomized order?"

zw
  
Wilco Dijkstra Oct. 30, 2018, 5:32 p.m. UTC | #4
Hi,

Alexander Monakov wrote:
> On Mon, 29 Oct 2018, Wilco Dijkstra wrote:
>
>> The needle length is limited to 254 - combined with a hashed shift table
>> this reduces the shift table memory 16 to 32 times, lowering preprocessing
>> overhead and minimizing cache effects.
>
> This is a bit unclear: given needle length limit, unhashed table would require
> 256 bytes, hashing as implemented in the patch reduces it to 64 bytes. I think
> unhashed variant should be even faster, is saving 192 bytes on stack worth it?

The startup overhead is significant for small haystacks, so reducing that is
actually faster. The hashing uses a simple AND instruction which is hidden
across the memcmp call, so doesn't cost much on a modern core. 

> How does the worst case change look like (I think it's searching for "a{251}ba"
> in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> needle length cutoff further below 254?

To be fair you have to compare the worst case of one algorithm vs the worst
case of another since they are different... Two-way is slow on searching for
ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
faster on this case despite having a theoretical factor of 256/2 advantage.

I don't think there is any gain slowing down common cases in order to
reduce the worst-case difference (assuming that is feasibe).

> +      do
> +     {
> +       hs--;

> Doesn't this move hs out of bounds and thus invoke undefined behavior?

That's a theoretical issue given *(a+(i-1)), *((a-1)+i) and *((a+i)-1) behave
identically.

Wilco
  
Wilco Dijkstra Oct. 31, 2018, 1:56 p.m. UTC | #5
Hi,

Ondřej Bílka wrote:
> Real benefit is fixing idiocy of calculating needle size before calling
> strchr, code after that tries to make as this make cold path faster.

Doing a single strchr early helps indeed if the first char of needle occurs
infrequently or it finds the match. But I don't see it improving performance much.

> Better would be to call strchr again and if it still doesn't match for third time then
> code after that shouldn't matter as long as its linear.

Repeated strchr/memchr seems to be very bad for performance on characters 
that occur frequently.

> If you capture inputs of actual applications you will find what are bottlenecks.

Yes but we'd need applications which use strstr a lot.

> As for real inputs record someing with this
> http://kam.mff.cuni.cz/~ondra/dryrun.tar.bz2

Thanks, I'll have a look.

> For arm and anything with assembly strchr you would gain more by similar trick as I did for x64,
> for hot path its easy modification of strchr where you instead of one
> character check first two characters where convient(function leaky_digraph), sometimes you check
> only first one rather than trying to include character with unsuitable
> alignment. Note that unaligned loads by one character more could be
> faster than trying to shift vectors.

Sure an assembly implementation could easily search for 2 characters at a time.
But right now I'm improving the generic C version first. It's not even using unaligned
accesses yet - and those can give significant gains both in brute force search
and smarter algorithms like Quicksearch.

logic is following:
strstr(s,n) {
if (!n[0]) return s;
if (!n[1]) return strchr (s, n); // corner case, one-char needles are rare
   s = leaky_digraph(s, n[0], n[1]);
   if (!s)
     return NULL;
   for (i = 1; n[i] && s[i] == n[i]; i++);
   if (!n[i]) return s;
   return strstr_two_way(s,n);
}

Yes it would be feasible to add a digraph test like that to replace the initial strchr check.
Do you have an idea what percentage of calls it filters compared to plain strchr?

Wilco
  
Rich Felker Nov. 1, 2018, 4:42 p.m. UTC | #6
On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> Alexander Monakov wrote:
> > How does the worst case change look like (I think it's searching for "a{251}ba"
> > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > needle length cutoff further below 254?

This "quicksearch" is fundamentally a bad algorithm. If it looks fast
in some cases, it's just a matter of getting lucky, and most likely
due to bad tuning or missed optimizations in the implementation of the
algorithm you're comparing it against. It's fundamentally a naive
O(nm) search that will be this bad in all cases except where the
bad-character table happens to let you advance a lot.

> To be fair you have to compare the worst case of one algorithm vs the worst
> case of another since they are different... Two-way is slow on searching for
> ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> faster on this case despite having a theoretical factor of 256/2 advantage.

Do you have a basis for calling this "slow"? It should be roughly one
comparison per byte. "Only 3x faster" doesn't sound slow to me.

> I don't think there is any gain slowing down common cases in order to
> reduce the worst-case difference (assuming that is feasibe).

This argument gets made about strstr every couple years, and every
time it's wrong. It's not that "common cases" (as a general class) are
faster with any naive algorithm. Only very specific cases will ever be
faster. Most common cases are roughly the same speed as with a good
algorithm, and some common cases, plus a lot more uncommon ones, are
catastrophically slower with a bad algorithm.

To my knowledge there is no good test for a case that won't become
catastrophically slower with a bad algorithm. If there is, I would be
very surprised if the same test doesn't naturally translate into an
optimized setup for two-way.

> > +      do
> > +     {
> > +       hs--;
> 
> > Doesn't this move hs out of bounds and thus invoke undefined behavior?
> 
> That's a theoretical issue given *(a+(i-1)), *((a-1)+i) and *((a+i)-1) behave
> identically.

Calling blatant undefined behavior "a theoretical issue" is not
something we should be doing in 2018.

Rich
  
Rich Felker Nov. 1, 2018, 7:27 p.m. UTC | #7
On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote:
> On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> > Alexander Monakov wrote:
> > > How does the worst case change look like (I think it's searching for "a{251}ba"
> > > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > > needle length cutoff further below 254?
> 
> This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> in some cases, it's just a matter of getting lucky, and most likely
> due to bad tuning or missed optimizations in the implementation of the
> algorithm you're comparing it against. It's fundamentally a naive
> O(nm) search that will be this bad in all cases except where the
> bad-character table happens to let you advance a lot.
> 
> > To be fair you have to compare the worst case of one algorithm vs the worst
> > case of another since they are different... Two-way is slow on searching for
> > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> > faster on this case despite having a theoretical factor of 256/2 advantage.
> 
> Do you have a basis for calling this "slow"? It should be roughly one
> comparison per byte. "Only 3x faster" doesn't sound slow to me.

I think there may actually be a bug (just suboptimality) in the
two-way implementation related to this case. It's a periodic needle
that uses 'memory', and glibc has (abbreviated):

	if (memory && shift < period)
		shift = needle_len - period;

From what I recall (having implemented the same), this is to ensure
that it doesn't advance too little, requiring re-comparing a segment
already compared as the right-hand factor. However, it's possible that
shift is already larger than needle_len - period, and in that case I
see no reason to use the memory rather than advancing by the longer
amount.

It's certainly not true in general that the shift of a given byte is
bounded by needle_len - period. For example, zabc...xyz factors as
[z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is
1. Of course in this case, after the truncated shift by 1, the memory
will be gone, and the next shift won't be truncated, so it might not
matter in the total run time.

Anyone else interested in looking at this to confirm if my findings
make sense?

Rich
  
Rich Felker Nov. 1, 2018, 9:04 p.m. UTC | #8
On Thu, Nov 01, 2018 at 03:27:38PM -0400, Rich Felker wrote:
> On Thu, Nov 01, 2018 at 12:42:35PM -0400, Rich Felker wrote:
> > On Tue, Oct 30, 2018 at 05:32:12PM +0000, Wilco Dijkstra wrote:
> > > Alexander Monakov wrote:
> > > > How does the worst case change look like (I think it's searching for "a{251}ba"
> > > > in "aaa...aaa" if memcmp works left-to-right)?  Does it make sense to move
> > > > needle length cutoff further below 254?
> > 
> > This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> > in some cases, it's just a matter of getting lucky, and most likely
> > due to bad tuning or missed optimizations in the implementation of the
> > algorithm you're comparing it against. It's fundamentally a naive
> > O(nm) search that will be this bad in all cases except where the
> > bad-character table happens to let you advance a lot.
> > 
> > > To be fair you have to compare the worst case of one algorithm vs the worst
> > > case of another since they are different... Two-way is slow on searching for
> > > ba{254}b in a long sequence of ba{253}ba{253}. Two-way is only about 3 times
> > > faster on this case despite having a theoretical factor of 256/2 advantage.
> > 
> > Do you have a basis for calling this "slow"? It should be roughly one
> > comparison per byte. "Only 3x faster" doesn't sound slow to me.
> 
> I think there may actually be a bug (just suboptimality) in the
> two-way implementation related to this case. It's a periodic needle
> that uses 'memory', and glibc has (abbreviated):
> 
> 	if (memory && shift < period)
> 		shift = needle_len - period;
> 
> From what I recall (having implemented the same), this is to ensure
> that it doesn't advance too little, requiring re-comparing a segment
> already compared as the right-hand factor. However, it's possible that
> shift is already larger than needle_len - period, and in that case I
> see no reason to use the memory rather than advancing by the longer
> amount.
> 
> It's certainly not true in general that the shift of a given byte is
> bounded by needle_len - period. For example, zabc...xyz factors as
> [z][abc...xyz] where 'a' has a shift of 25, and needle_len - period is
> 1. Of course in this case, after the truncated shift by 1, the memory
> will be gone, and the next shift won't be truncated, so it might not
> matter in the total run time.
> 
> Anyone else interested in looking at this to confirm if my findings
> make sense?

I'm pretty sure this is a logic bug. Except for bytes that don't
appear in the needle (for which shift==needle_len), shift < period is
automatically true. The needle being periodic means that each byte
which appears in it appears within the last 'period' positions of it.

I suspect what was actually intended was:

	if (memory && shift < needle_len - period)
		shift = needle_len - period;

which can be simplified to:

	if (shift < memory)
		shift = memory;

Is this okay?

Rich
  
Rich Felker Nov. 6, 2018, 3:34 p.m. UTC | #9
On Tue, Nov 06, 2018 at 01:02:27PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> 
> > This "quicksearch" is fundamentally a bad algorithm. If it looks fast
> > in some cases, it's just a matter of getting lucky, and most likely
> > due to bad tuning or missed optimizations in the implementation of the
> > algorithm you're comparing it against. It's fundamentally a naive
> > O(nm) search that will be this bad in all cases except where the
> > bad-character table happens to let you advance a lot.
> 
> I disagree. What's important is real world performance. Practically all

Real world performance includes performance under the control of a
hostile input source, under which naive algorithms will perform
hundreds of times slower. This "quicksearch" is incredibly slow on
a{253}ba, and it's as slow as the most naive possible strstr whenever
the last character is abundant in the haystack.

> good search algorithms are O(nm) and they beat any O(n) algorithm by
> huge factors on actual inputs. That should teach you that a concept like
> the O notation does not guarantee good performance.

By a small factor at best, not a huge factor.

> > Do you have a basis for calling this "slow"? It should be roughly one
> > comparison per byte. "Only 3x faster" doesn't sound slow to me.
> 
> One comparison per byte is slow indeed. Modern search algorithms
> are sublinear and achieve O(n/m) comparisons. You can see this when
> benchmarking - performance actually increases with longer needles.

Two-way of course achieves this sublinear performance for common
cases, but it's only possible for a very limited set, not in general,
and does not scale. In particular, once the length of the needle
exceeds the size of the alphabet, it can't continue to scale this way.

> >> I don't think there is any gain slowing down common cases in order to
> >> reduce the worst-case difference (assuming that is feasibe).
> >
> > This argument gets made about strstr every couple years, and every
> > time it's wrong. It's not that "common cases" (as a general class) are
> > faster with any naive algorithm. Only very specific cases will ever be
> > faster. Most common cases are roughly the same speed as with a good
> > algorithm, and some common cases, plus a lot more uncommon ones, are
> > catastrophically slower with a bad algorithm.
> 
> Actually it's the correct argument. Common cases are indeed faster with
> modern algorithms - slow algorithms like Two-way can't speed up those

Two-way is essentially the most modern algorithm there is. Have you
read any of the literature on this topic?

> common cases. There is only very specific small set of inputs which can
> trigger worst-case behaviour, and those are never typical inputs. And it's
> easy to fall back to a different algorithm in such a case.

Citation needed. If it's easy to detect them, prove it. You'll find
it's not. You can of course do an introsearch that counts the number
of times it backtracks and switches to two-way, but by that time you
will have spent many times the setup cost of two-way and it will be a
lot slower.

> > To my knowledge there is no good test for a case that won't become
> > catastrophically slower with a bad algorithm. If there is, I would be
> > very surprised if the same test doesn't naturally translate into an
> > optimized setup for two-way.
> 
> Actually it's trivial to find the worst case for each algorithm, you just ensure
> you hit the slowest path all the time.

This is not a question of finding the worst case. It's a question of
efficiently evaluating if a given case is one of the bad ones.

> > Calling blatant undefined behavior "a theoretical issue" is not
> > something we should be doing in 2018.
> 
> It's 100% theoretical - or you can show an example that could fail?

With UBSan and LTO it should trap and abort.

> It's crazy to even bring this stuff up in the 21th century. Software that does
> not rely on undefined behaviour one way or another simply does not exist.
> So the question becomes, is that a fault with all our software or are the language
> standards simply decades behind common practice?

This kind of claim is not making your case.

Rich
  
Joseph Myers Nov. 6, 2018, 5:50 p.m. UTC | #10
On Tue, 6 Nov 2018, Rich Felker wrote:

> Real world performance includes performance under the control of a
> hostile input source, under which naive algorithms will perform
> hundreds of times slower. This "quicksearch" is incredibly slow on

Indeed.  I think strstr failing to be O(m+n) worst case is clearly a 
security bug; when you have an upper bound on m for an O(mn) algorithm to 
avoid a quadratic worst case, the only question is at what point that 
bound gets too high so the worst case performance is excessively bad.

I've just filed bug 23865 for wcsstr using a naive quadratic-time 
implementation.
  
Rich Felker Nov. 6, 2018, 7:02 p.m. UTC | #11
On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> 
> > Real world performance includes performance under the control of a
> > hostile input source, under which naive algorithms will perform
> > hundreds of times slower. This "quicksearch" is incredibly slow on
> > a{253}ba, and it's as slow as the most naive possible strstr whenever
> > the last character is abundant in the haystack.
> 
> That's not true at all - in reality there is hardly any difference in worst case 
> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on 
> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way
> is only about 16% faster. If the worst-case performance of Two-way was
> deemed acceptable, then the worst-case of Quicksearch must be too.

I don't know what figures you're talking about, but the worst-case for
"quicksearch" with a bound of 256 bytes is easily over 100x slower
than the worst case of two-way.

> >> good search algorithms are O(nm) and they beat any O(n) algorithm by
> >> huge factors on actual inputs. That should teach you that a concept like
> >> the O notation does not guarantee good performance.
> >
> > By a small factor at best, not a huge factor.
> 
> Getting another 2-3x speedup after doubling performance is a huge factor.
> The previous implementation was doing effectively nothing useful 75-85%
> of the time...

Then fix implementation bugs. Don't replace good algorithms with awful
ones. For any given needle, it's definitely possible to make it such
that,...

> >> One comparison per byte is slow indeed. Modern search algorithms
> >> are sublinear and achieve O(n/m) comparisons. You can see this when
> >> benchmarking - performance actually increases with longer needles.
> >
> > Two-way of course achieves this sublinear performance for common
> > cases, but it's only possible for a very limited set, not in general,
> > and does not scale. In particular, once the length of the needle
> > exceeds the size of the alphabet, it can't continue to scale this way.
> 
> Two-way is not sublinear as often as Quicksearch indeed, and that's an
> inherent drawback of the algorithm.

It's sublinear in a superset of the cases where "quicksearch" is
sublinear: whever you get to use the bad character table (which works
the same for both algorithms), *plus* when a mismatch in the left
factor lets you advance by a whole period. This latter class of cases
mostly helps when the period is large and the right factor is short,
but that's an important class of real-world cases (most needles are
non-periodic, i.e. period = strlen(needle), and the right factor is
very short). In some cases, the maximal-suffix algorithm selects a
gratuitously long right factor; it may be possible to optimize this
selection further.

> The alphabet size is not relevant, it's

I wasn't precise in stating what I meant here but it's not important.

> about the occurence frequencies of characters in the input - a bad character
> table happily jumps ahead by a million if a character does not appear in a
> huge needle. It's just unlikely to happen frequently on typical inputs. So for
> large inputs you don't see performance increasing further but level off.

Worst-case is never about what's likely to happen in the haystack but
what can happen.

> > Two-way is essentially the most modern algorithm there is. Have you
> > read any of the literature on this topic?
> 
> Two-way is almost 3 decades old now. It's not modern in any sense of the
> word. I've looked at lots of papers of modern search algorithms, practically
> all are about real-world performance (on English, DNA and gene sequences)
> and are O(nm) - few papers consider real-time or constant-space searching.

For DNA, quadratic-time is even worse.

> >> common cases. There is only very specific small set of inputs which can
> >> trigger worst-case behaviour, and those are never typical inputs. And it's
> >> easy to fall back to a different algorithm in such a case.
> >
> > Citation needed. If it's easy to detect them, prove it. You'll find
> > it's not. You can of course do an introsearch that counts the number
> > of times it backtracks and switches to two-way, but by that time you
> > will have spent many times the setup cost of two-way and it will be a
> > lot slower.
> 
> It's trivial by keeping track of how many comparisons you do vs how much
> progress you make, eg. comparisons_done > N * (cur_pos - start_pos) for
> a fixed N. But you can make it even simpler and do what I did - needles
> larger than 256 are rare and so their performance is less important.

Quicksort performs *utterly abysmally* on plenty of needles shorter
than 256 or even shorter than 50. You cannot use it based on the
length of the needle except possibly for very short needles (maybe
<16), and even then you will do worse on a lot of real-world cases.

> >> Actually it's trivial to find the worst case for each algorithm, you just ensure
> >> you hit the slowest path all the time.
> >
> > This is not a question of finding the worst case. It's a question of
> > efficiently evaluating if a given case is one of the bad ones.
> 
> Why on earth would you want to do that dynamically? We can compare worst
> cases offline and choose what algorithm to run for particular needle/haystack
> sizes. And that is sufficient to ensure linear time in the worst case.

You can do this if you're implementing a custom search for your
application. You cannot do this for the libc strstr because it does
not know what the application is. It needs to be able to perform
non-pathologically for any inputs.

> >> > Calling blatant undefined behavior "a theoretical issue" is not
> >> > something we should be doing in 2018.
> >> 
> >> It's 100% theoretical - or you can show an example that could fail?
> >
> > With UBSan and LTO it should trap and abort.
> 
> I mean a real-world failure on an actual supported target, not something
> that's not actually possible.

This is a real-world near-future target, and introducing gratuitous
breakage for it now is utterly backwards, especially when it's so easy
to avoid (skipping the -- ahead of time and just adding a -1 to each
[]).

Rich
  
Alexander Monakov Nov. 6, 2018, 7:42 p.m. UTC | #12
On Tue, 6 Nov 2018, Rich Felker wrote:
> 
> > >> > Calling blatant undefined behavior "a theoretical issue" is not
> > >> > something we should be doing in 2018.
> > >> 
> > >> It's 100% theoretical - or you can show an example that could fail?
> > >
> > > With UBSan and LTO it should trap and abort.
> > 
> > I mean a real-world failure on an actual supported target, not something
> > that's not actually possible.
> 
> This is a real-world near-future target, and introducing gratuitous
> breakage for it now is utterly backwards, especially when it's so easy
> to avoid (skipping the -- ahead of time and just adding a -1 to each
> []).

A simpler fix is available: 'hs++, hs_len--' after the initial optimistic
memcmp has failed.

Alexander
  
Rich Felker Nov. 7, 2018, 7:28 p.m. UTC | #13
On Wed, Nov 07, 2018 at 04:01:51PM +0000, Wilco Dijkstra wrote:
> Rich Felker wrote:
> > On Tue, Nov 06, 2018 at 06:26:57PM +0000, Wilco Dijkstra wrote:
> 
> >> That's not true at all - in reality there is hardly any difference in worst case 
> >> performance. Comparing a{253}ba on Quicksearch vs the ba{252}b on 
> >> Two-way using GLIBC 2.27 (original version of Two-way) shows Two-way
> >> is only about 16% faster. If the worst-case performance of Two-way was
> >> deemed acceptable, then the worst-case of Quicksearch must be too.
> >
> > I don't know what figures you're talking about, but the worst-case for
> > "quicksearch" with a bound of 256 bytes is easily over 100x slower
> > than the worst case of two-way.
> 
> Not a chance. You do realize I am measuring this stuff for real rather than
> just making up random numbers? Worst-case performance of both algorithms
> is very similar for needles up to 256. It's not possible to get 10x difference, let
> alone 100x.

I'm not either, but we are measuring different implementations,
particularly of the underlying memcmp. But a better memcmp is not
going to make it 100x faster. Maybe 10x at best, leaving a 10x
difference still. And not all archs have a ridiculously fast memcmp.

Indeed, just re-testing on a glibc box, "quicksearch" takes 33 whole
microseconds to search (and not find) a{90}ba in (ba{89}){9}. Two-way
can do the same in 3 microseconds on this box, although glibc's (2.24
from Debian) included two-way on this box is much slower and takes 12.
So that's a 10x difference at 92 bytes (limitation of the measurement
code from a third party I was using). At 256 bytes I'd expect more
like a 25x difference.

> >> Getting another 2-3x speedup after doubling performance is a huge factor..
> >> The previous implementation was doing effectively nothing useful 75-85%
> >> of the time...
> >
> > Then fix implementation bugs. Don't replace good algorithms with awful
> > ones. For any given needle, it's definitely possible to make it such
> > that,...
> 
> I already did that, remember? If you believe you can do better then post further
> performance improvements.

Nobody has replied to my branch of this thread about limiting shift in
the case of memory. I'm pretty confident it's right now so maybe I
should just submit a patch. But it's frustrating that there is more
interest in discussing bad algorithms than in improving the
implementation of a good one.

> > Quicksort performs *utterly abysmally* on plenty of needles shorter
> > than 256 or even shorter than 50. You cannot use it based on the
> > length of the needle except possibly for very short needles (maybe
> > <16), and even then you will do worse on a lot of real-world cases.
> 
> Citation needed. You clearly haven't ever tried Quicksearch. It beats Two-way 
> in every single case on real inputs. The length of the needle matters only for the
> worst case, and by limiting the needle size, you avoid quadratic behaviour.

See above. I have tried it and it performs abysmally.

> > You can do this if you're implementing a custom search for your
> > application. You cannot do this for the libc strstr because it does
> > not know what the application is. It needs to be able to perform
> > non-pathologically for any inputs.
> 
> Limiting the needle size to avoid quadratic behaviour is perfectly fine for GLIBC
> since the goal is to avoid denial of service via specially crafted inputs.
> 
> It's impossible for strstr to always have identical performance on all possible
> inputs - practically every algorithm has fast and slow cases. Rabin-Karp is
> perhaps an exception given a good hash function. It's also 10x faster than
> Two-way slow cases, so it's a candidate to replace Two-way.

It's also quadratic, so it's not a candidate, unless you cap the
length for which it's used very low.

> >> >> > Calling blatant undefined behavior "a theoretical issue" is not
> >> >> > something we should be doing in 2018.
> >> >> 
> >> >> It's 100% theoretical - or you can show an example that could fail?
> >> >
> >> > With UBSan and LTO it should trap and abort.
> >> 
> >> I mean a real-world failure on an actual supported target, not something
> >> that's not actually possible.
> >
> > This is a real-world near-future target, and introducing gratuitous
> > breakage for it now is utterly backwards, especially when it's so easy
> > to avoid (skipping the -- ahead of time and just adding a -1 to each
> > []).
> 
> It seems unlikely it would ever work across dynamic library boundaries, but 

Static linking is valid and supported usage.

> even if you replace the haystack input in strstr with a hardcoded array, UBSan
> does not report a problem. The only way I could make it report anything at all
> was to use hs = &s[-1].

If it fails to detect it, that's a bug in UBSan that will hopefully be
fixed at some point. I've actually had serious memory corruption bugs
that would have been caught by correctly trapping invalid pointer
arithmetic outside the bounds of the object, but weren't, so I'm aware
that it's not perfect and have a significant interest in its being
improved in the future.

Rich
  
Rich Felker Nov. 8, 2018, 7:07 p.m. UTC | #14
On Wed, Nov 07, 2018 at 02:28:45PM -0500, Rich Felker wrote:
> > >> Getting another 2-3x speedup after doubling performance is a huge factor..
> > >> The previous implementation was doing effectively nothing useful 75-85%
> > >> of the time...
> > >
> > > Then fix implementation bugs. Don't replace good algorithms with awful
> > > ones. For any given needle, it's definitely possible to make it such
> > > that,...
> > 
> > I already did that, remember? If you believe you can do better then post further
> > performance improvements.
> 
> Nobody has replied to my branch of this thread about limiting shift in
> the case of memory. I'm pretty confident it's right now so maybe I
> should just submit a patch. But it's frustrating that there is more
> interest in discussing bad algorithms than in improving the
> implementation of a good one.

On further reading, it looks like the whole "performance problem" is
probably lack of a machine-optimized comparison loop for the two
(forward and backwards) comparisons in two-way. If we had "rep cmp"
functions for each direction (optimized to use specialized insns or
vector ops, possibly misaligned if allowed by the ISA, taking into
account that we know the needle length and a lower bound on the
haystack length so that there's no overread issue) returning the first
mismatching offset, I see no reason why there should be any cases that
perform worse than "quicksearch", and pretty much all cases should
speed up a lot vs the status quo.

Ideally the "rep cmp" functions should be inline containing inline asm
fragments so they can be inlined right into the loop without any call
overhead, but this probably only matters for short needles.

Rich
  
Rich Felker Nov. 9, 2018, 5:59 p.m. UTC | #15
On Fri, Nov 09, 2018 at 12:46:54PM +0000, Wilco Dijkstra wrote:
> Hi,
> 
> Rich Felker wrote:
> > On further reading, it looks like the whole "performance problem" is
> > probably lack of a machine-optimized comparison loop for the two
> > (forward and backwards) comparisons in two-way. If we had "rep cmp"
> > functions for each direction (optimized to use specialized insns or
> > vector ops, possibly misaligned if allowed by the ISA, taking into
> > account that we know the needle length and a lower bound on the
> > haystack length so that there's no overread issue) returning the first
> > mismatching offset, I see no reason why there should be any cases that
> > perform worse than "quicksearch", and pretty much all cases should
> > speed up a lot vs the status quo.
> >
> > Ideally the "rep cmp" functions should be inline containing inline asm
> > fragments so they can be inlined right into the loop without any call
> > overhead, but this probably only matters for short needles.
> 
> This won't help at all. The character matching loops don't show up in profiles.
> Almost all time is spent in finding the first character, ie. strchr (short needles)
> and the Horsepool loop (long needles). Using Quicksearch to find a prefix of
> the suffix string should be faster.

Which profiles? The common cases (where significant-length prefix
matches are usually rare) or the worst cases? For needles that factor
poorly (long right factor), the latter breaks down to essentially
repeated disjoint memcmp's of the right factor against the whole
haystack, in successive order, starting over at the beginning of the
needle's right factor whenever there's a mismatch. That sounds highly
"memcmp-bound" to me. If worst cases are something you care about
improving vs the status quo, this is worth pursuing, but if not, okay.

I agree that for common cases there are potentially much better
improvements to be made elsewhere. strchr is of course a big help when
the character to be searched is rare but useless (and pessimizes from
call/setup overhead unless you can inline it) if it's frequent. It's
probably better to search for the first up-to-N characters of the
right factor, where N is a word/vector size, rotating bytes of the
haystack in/out of a running word/vector for comparison (musl's strstr
does this for needle sizes up to 4 already, and some data has shown
it's usually better than two-way a bit further on 64-bit systems).

Performing the critical factorization via the maximal suffix process
described in the two-way paper is also suboptimal, and dependent on
the actual order of the alphabet. For example hello factors as
[hell][o] but hellk factors as [he][llk] despite them being
structurally isomorphic under a renaming of the alphabet. In some
cases, optimal factorizations can be determined by short-circuiting
past the MS process. For example if the loop setting up the bad char
table determines that the final character of the needle is the only
occurrance of that character, [l-1][1] is a critical factorization and
p=l. There are larger classes of this type of shortcut but I'm not
sure what's efficient/reasonable-tradeoff to do. Searching for a short
right factor is appealing though because it means you get to skip by
the whole period as soon as there's any mismatch on the left factor.
It might be possible to choose (at least heuristically) some
permutation of the alphabet (probably anything more than a rotation of
it is not practical, though) that optimizes/improves the MS
factorization results though. For example, in the "hellk" example,
rotating the alphabet so that 'k' is the first or last character would
make it factor as [hell][k] rather than [he][llk].

Rich
  

Patch

diff --git a/string/str-two-way.h b/string/str-two-way.h
index 523d946c59412e1f1f65b8ec3778553b83191952..31c3f18fb057cdd999c3ac9e9d894a8b62a98a70 100644
--- a/string/str-two-way.h
+++ b/string/str-two-way.h
@@ -221,7 +221,7 @@  critical_factorization (const unsigned char *needle, size_t needle_len,
    most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching.
    If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 *
    HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching.  */
-static RETURN_TYPE
+static inline RETURN_TYPE
 two_way_short_needle (const unsigned char *haystack, size_t haystack_len,
 		      const unsigned char *needle, size_t needle_len)
 {
diff --git a/string/strstr.c b/string/strstr.c
index f74d7189ed1319f6225525cc2d32380745de1523..206f20ddfcf10f61ba68725b516ef00356e9028c 100644
--- a/string/strstr.c
+++ b/string/strstr.c
@@ -22,11 +22,8 @@ 
 # include <config.h>
 #endif
 
-/* Specification of strstr.  */
 #include <string.h>
 
-#include <stdbool.h>
-
 #ifndef _LIBC
 # define __builtin_expect(expr, val)   (expr)
 #endif
@@ -47,46 +44,115 @@ 
 #define STRSTR strstr
 #endif
 
-/* Return the first occurrence of NEEDLE in HAYSTACK.  Return HAYSTACK
-   if NEEDLE is empty, otherwise NULL if NEEDLE is not found in
-   HAYSTACK.  */
-char *
-STRSTR (const char *haystack, const char *needle)
+static inline char *
+strstr2 (const unsigned char *hs, const unsigned char *ne)
 {
-  size_t needle_len; /* Length of NEEDLE.  */
-  size_t haystack_len; /* Known minimum length of HAYSTACK.  */
+  uint32_t h1 = (ne[0] << 16) | ne[1];
+  uint32_t h2 = 0;
+  for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs)
+      h2 = (h2 << 16) | c;
+  return h1 == h2 ? (char *)hs - 2 : NULL;
+}
 
-  /* Handle empty NEEDLE special case.  */
-  if (needle[0] == '\0')
-    return (char *) haystack;
+static inline char *
+strstr3 (const unsigned char *hs, const unsigned char *ne)
+{
+  uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8);
+  uint32_t h2 = 0;
+  for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs)
+      h2 = (h2 | c) << 8;
+  return h1 == h2 ? (char *)hs - 3 : NULL;
+}
 
-  /* Skip until we find the first matching char from NEEDLE.  */
-  haystack = strchr (haystack, needle[0]);
-  if (haystack == NULL || needle[1] == '\0')
-    return (char *) haystack;
+static inline char *
+strstr4 (const unsigned char *hs, const unsigned char *ne)
+{
+  uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8) | ne[3];
+  uint32_t h2 = 0;
+  for (int c = hs[0]; c != 0 && h1 != h2; c = *++hs)
+    h2 = (h2 << 8) | c;
+  return h1 == h2 ? (char *)hs - 4 : NULL;
+}
 
-  /* Ensure HAYSTACK length is at least as long as NEEDLE length.
-     Since a match may occur early on in a huge HAYSTACK, use strnlen
+/* Number of bits used to index shift table.  */
+#define SHIFT_TABLE_BITS 6
+
+/* Extremely fast strstr algorithm with guaranteed linear-time performance.
+   Small needles up to size 4 use a dedicated linear search.  Longer needles
+   up to size 254 use Sunday's Quick-Search algorithm.  Due to its simplicity
+   it has the best average performance of string matching algorithms on almost
+   all inputs.  It uses a bad-character shift table to skip past mismatches.
+   By limiting the needle length to 254, the shift table can be reduced to 8
+   bits per entry, lowering preprocessing overhead and minimizing cache effects.
+   The limit also implies the worst-case performance is linear.
+   Even larger needles are processed by the linear-time Two-Way algorithm.
+*/
+char *
+STRSTR (const char *haystack, const char *needle)
+{
+  const unsigned char *hs = (const unsigned char *) haystack;
+  const unsigned char *ne = (const unsigned char *) needle;
+
+  /* Handle short needle special cases first.  */
+  if (ne[0] == '\0')
+    return (char *)hs;
+  hs = (const unsigned char *)strchr ((const char*)hs, ne[0]);
+  if (hs == NULL || ne[1] == '\0')
+    return (char*)hs;
+  if (ne[2] == '\0')
+    return strstr2 (hs, ne);
+  if (ne[3] == '\0')
+    return strstr3 (hs, ne);
+  if (ne[4] == '\0')
+    return strstr4 (hs, ne);
+
+  /* Ensure haystack length is at least as long as needle length.
+     Since a match may occur early on in a huge haystack, use strnlen
      and read ahead a few cachelines for improved performance.  */
-  needle_len = strlen (needle);
-  haystack_len = __strnlen (haystack, needle_len + 256);
-  if (haystack_len < needle_len)
+  size_t ne_len = strlen ((const char*)ne);
+  size_t hs_len = strnlen ((const char*)hs, ne_len | 512);
+  if (hs_len < ne_len)
     return NULL;
 
-  /* Check whether we have a match.  This improves performance since we avoid
-     the initialization overhead of the two-way algorithm.  */
-  if (memcmp (haystack, needle, needle_len) == 0)
-    return (char *) haystack;
-
-  /* Perform the search.  Abstract memory is considered to be an array
-     of 'unsigned char' values, not an array of 'char' values.  See
-     ISO C 99 section 6.2.6.1.  */
-  if (needle_len < LONG_NEEDLE_THRESHOLD)
-    return two_way_short_needle ((const unsigned char *) haystack,
-				 haystack_len,
-				 (const unsigned char *) needle, needle_len);
-  return two_way_long_needle ((const unsigned char *) haystack, haystack_len,
-			      (const unsigned char *) needle, needle_len);
+  /* Check whether we have a match.  This improves performance since we
+     avoid initialization overheads.  */
+  if (memcmp (hs, ne, ne_len) == 0)
+    return (char *) hs;
+
+  /* Use the Quick-Search algorithm for needle lengths less than 255.  */
+  if (__glibc_likely (ne_len < 255))
+    {
+      uint8_t shift[1 << SHIFT_TABLE_BITS];
+      const unsigned char *end = hs + hs_len - ne_len;
+
+      /* Initialize bad character shift hash table.  */
+      memset (shift, ne_len + 1, sizeof (shift));
+      for (int i = 0; i < ne_len; i++)
+	shift[ne[i] % sizeof (shift)] = ne_len - i;
+
+      do
+	{
+	  hs--;
+
+	  /* Search by skipping past bad characters.  */
+	  size_t tmp = shift[hs[ne_len] % sizeof (shift)];
+	  for (hs += tmp; hs <= end; hs += tmp)
+	    {
+	      tmp = shift[hs[ne_len] % sizeof (shift)];
+	      if (memcmp (hs, ne, ne_len) == 0)
+		return (char*)hs;
+	    }
+	  if (end[ne_len] == '\0')
+	    return NULL;
+	  end += strnlen ((const char*)end + ne_len, 2048);
+	}
+      while (hs <= end);
+
+      return NULL;
+    }
+
+  /* Use Two-Way algorithm for very long needles.  */
+  return two_way_long_needle (hs, hs_len, ne, ne_len);
 }
 libc_hidden_builtin_def (strstr)