[v3] powerpc: strstr optimization

Message ID 20150716195538.GA5140@domone
State Not applicable
Headers

Commit Message

Ondrej Bilka July 16, 2015, 7:55 p.m. UTC
  On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> >> May I proceed with this commit?
> > 
> > Yes, please commit this for 2.22.
> 
> For the record I trust IBM to make sure these patches make incremental
> improvements in performance even if they are not the best possible
> performance as pointed out by Ondrej Bilka.
> 
Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
all. I did that as test how much we could test IBM to verify patches. 

I pointed out that it could have possibly quadratic behaviour which
still does. So please don't accept unreviewed patches next time.

As review I asked multiple times what is performance for
strstr("aaa...aaa","aaa...aaab") but I never got answer. 

And answer is that there is indeed performance regression that causes
strstr to be 100 times slower. So if this was applied it should be
reverted.

I get following

                        simple_strstr   stupid_strstr   __strstr_power7
__strstr_ppc
strstr("aaa...aaa","aaa...aaab")
        2.8543e+06      9.24296e+08     3.47783e+08     3.11846e+06

With following benchmark:
  

Comments

Carlos O'Donell July 16, 2015, 8:16 p.m. UTC | #1
On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
>>>> May I proceed with this commit?
>>>
>>> Yes, please commit this for 2.22.
>>
>> For the record I trust IBM to make sure these patches make incremental
>> improvements in performance even if they are not the best possible
>> performance as pointed out by Ondrej Bilka.
>>
> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> all. I did that as test how much we could test IBM to verify patches. 
> 
> I pointed out that it could have possibly quadratic behaviour which
> still does. So please don't accept unreviewed patches next time.

They showed cases for which the code does go faster and objectively
so using the microbenchmark, and that's a win for now. Please continue 
to work with IBM to remove the quadratic worst case.

Tulio, You will need to work out why you have quadratic worst case.
It's certainly something we try to avoid. Did you make a particular
decision not to avoid it?

Cheers,
Carlos.
  
Tulio Magno Quites Machado Filho July 16, 2015, 9:05 p.m. UTC | #2
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
>> all. I did that as test how much we could test IBM to verify patches. 

No.  That isn't true.
I reviewed this patch myself.  All the 4 versions.
So did Steven.

>> I pointed out that it could have possibly quadratic behaviour which
>> still does. So please don't accept unreviewed patches next time.

Yes, this patch does have a quadratic worst case as does quicksort.
But for both algorithms, they're clearly an improvement over what was
available before them.
Is this algorithm perfect?  No.  I don't believe such a thing exist.  I
believe in iterative improvements.

> They showed cases for which the code does go faster and objectively
> so using the microbenchmark, and that's a win for now. Please continue 
> to work with IBM to remove the quadratic worst case.

That's our intention.
An improvement to this implementation is already in our backlog (i.e. we
haven't started yet) and I'd be glad if someone is interested to help us with
this.

> Tulio, You will need to work out why you have quadratic worst case.
> It's certainly something we try to avoid. Did you make a particular
> decision not to avoid it?

IMHO, the overall improvement brought by this patch is more important than
the worst case scenario Ondřej is showing.
Which doesn't mean we're neglecting the quadratic worst case.
  
Joseph Myers July 22, 2015, 4:12 p.m. UTC | #3
On Thu, 16 Jul 2015, Carlos O'Donell wrote:

> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> >>>> May I proceed with this commit?
> >>>
> >>> Yes, please commit this for 2.22.
> >>
> >> For the record I trust IBM to make sure these patches make incremental
> >> improvements in performance even if they are not the best possible
> >> performance as pointed out by Ondrej Bilka.
> >>
> > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > all. I did that as test how much we could test IBM to verify patches. 
> > 
> > I pointed out that it could have possibly quadratic behaviour which
> > still does. So please don't accept unreviewed patches next time.
> 
> They showed cases for which the code does go faster and objectively
> so using the microbenchmark, and that's a win for now. Please continue 
> to work with IBM to remove the quadratic worst case.
> 
> Tulio, You will need to work out why you have quadratic worst case.
> It's certainly something we try to avoid. Did you make a particular
> decision not to avoid it?

If there's a quadratic worst case newly introduced for 2.22, I'd consider 
that a security hole (denial of service) that needs to block the release 
of 2.22 until it's fixed (possibly by removing the implementation in 
question).
  
Carlos O'Donell July 22, 2015, 5:55 p.m. UTC | #4
On 07/22/2015 12:12 PM, Joseph Myers wrote:
> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> 
>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
>>>>>> May I proceed with this commit?
>>>>>
>>>>> Yes, please commit this for 2.22.
>>>>
>>>> For the record I trust IBM to make sure these patches make incremental
>>>> improvements in performance even if they are not the best possible
>>>> performance as pointed out by Ondrej Bilka.
>>>>
>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
>>> all. I did that as test how much we could test IBM to verify patches. 
>>>
>>> I pointed out that it could have possibly quadratic behaviour which
>>> still does. So please don't accept unreviewed patches next time.
>>
>> They showed cases for which the code does go faster and objectively
>> so using the microbenchmark, and that's a win for now. Please continue 
>> to work with IBM to remove the quadratic worst case.
>>
>> Tulio, You will need to work out why you have quadratic worst case.
>> It's certainly something we try to avoid. Did you make a particular
>> decision not to avoid it?
> 
> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> that a security hole (denial of service) that needs to block the release 
> of 2.22 until it's fixed (possibly by removing the implementation in 
> question).

Joseph,

We have had quadratic worse case in our string routines for years without
it blocking a release. I agree that it is not the best case for a release
to have such behaviour, but should it block this release?

Florian,

I would like to hear your opinion on whether you might consider this a
security issue, or simply a QoI issue.

Tulio,

Could you please review the possibility of quadratic behaviour and respond
prompty? I don't want this to hold up the release.

Cheers,
Carlos.
  
Steven Munroe July 22, 2015, 5:59 p.m. UTC | #5
On Wed, 2015-07-22 at 16:12 +0000, Joseph Myers wrote:
> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> 
> > On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> > >>>> May I proceed with this commit?
> > >>>
> > >>> Yes, please commit this for 2.22.
> > >>
> > >> For the record I trust IBM to make sure these patches make incremental
> > >> improvements in performance even if they are not the best possible
> > >> performance as pointed out by Ondrej Bilka.
> > >>
> > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > > all. I did that as test how much we could test IBM to verify patches. 
> > > 
> > > I pointed out that it could have possibly quadratic behaviour which
> > > still does. So please don't accept unreviewed patches next time.
> > 
> > They showed cases for which the code does go faster and objectively
> > so using the microbenchmark, and that's a win for now. Please continue 
> > to work with IBM to remove the quadratic worst case.
> > 
> > Tulio, You will need to work out why you have quadratic worst case.
> > It's certainly something we try to avoid. Did you make a particular
> > decision not to avoid it?
> 
> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> that a security hole (denial of service) that needs to block the release 
> of 2.22 until it's fixed (possibly by removing the implementation in 
> question).
> 

There is a denial of service attach here, but not what you would think.

Effectively this (noise and FUD) is blocking my team from submitting
patches and preventing my team for moving on to additional optimization.

The core problem is that there is no object definition of what a
quadratic behavior might be, nor has a certain troll that constantly
shouts (quadratic! quadratic! quadratic!) about this, provided objective
proof, in the form of benchmark.

So if you really believe that this is problem, you should insist on a
objective benchmark and apply that objective standard to every platform.

This would allow my team work on the problem and improve our code even
more.

Otherwise you are just wasting every ones time.
  
Florian Weimer July 22, 2015, 6:10 p.m. UTC | #6
On 07/22/2015 07:55 PM, Carlos O'Donell wrote:

> I would like to hear your opinion on whether you might consider this a
> security issue, or simply a QoI issue.

Not a security issue as far as I can see.
  
Tulio Magno Quites Machado Filho July 22, 2015, 6:29 p.m. UTC | #7
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 07/22/2015 12:12 PM, Joseph Myers wrote:
>> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
>> 
>>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
>>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
>>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
>>>>>>> May I proceed with this commit?
>>>>>>
>>>>>> Yes, please commit this for 2.22.
>>>>>
>>>>> For the record I trust IBM to make sure these patches make incremental
>>>>> improvements in performance even if they are not the best possible
>>>>> performance as pointed out by Ondrej Bilka.
>>>>>
>>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
>>>> all. I did that as test how much we could test IBM to verify patches. 
>>>>
>>>> I pointed out that it could have possibly quadratic behaviour which
>>>> still does. So please don't accept unreviewed patches next time.
>>>
>>> They showed cases for which the code does go faster and objectively
>>> so using the microbenchmark, and that's a win for now. Please continue 
>>> to work with IBM to remove the quadratic worst case.
>>>
>>> Tulio, You will need to work out why you have quadratic worst case.
>>> It's certainly something we try to avoid. Did you make a particular
>>> decision not to avoid it?
>> 
>> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
>> that a security hole (denial of service) that needs to block the release 
>> of 2.22 until it's fixed (possibly by removing the implementation in 
>> question).

Could you elaborate why a quadratic worst case in a string function can be
considered a denial of service, please?

> Tulio,
>
> Could you please review the possibility of quadratic behaviour and respond
> prompty? I don't want this to hold up the release.

I confirm this algorithm does have a quadratic worst case, which appears if
needle <= 2048.
If the needle > 2048, it falls back to the previous implementation.
  
Carlos O'Donell July 22, 2015, 6:57 p.m. UTC | #8
On 07/22/2015 02:29 PM, Tulio Magno Quites Machado Filho wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> On 07/22/2015 12:12 PM, Joseph Myers wrote:
>>> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
>>>
>>>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
>>>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
>>>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
>>>>>>>> May I proceed with this commit?
>>>>>>>
>>>>>>> Yes, please commit this for 2.22.
>>>>>>
>>>>>> For the record I trust IBM to make sure these patches make incremental
>>>>>> improvements in performance even if they are not the best possible
>>>>>> performance as pointed out by Ondrej Bilka.
>>>>>>
>>>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
>>>>> all. I did that as test how much we could test IBM to verify patches. 
>>>>>
>>>>> I pointed out that it could have possibly quadratic behaviour which
>>>>> still does. So please don't accept unreviewed patches next time.
>>>>
>>>> They showed cases for which the code does go faster and objectively
>>>> so using the microbenchmark, and that's a win for now. Please continue 
>>>> to work with IBM to remove the quadratic worst case.
>>>>
>>>> Tulio, You will need to work out why you have quadratic worst case.
>>>> It's certainly something we try to avoid. Did you make a particular
>>>> decision not to avoid it?
>>>
>>> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
>>> that a security hole (denial of service) that needs to block the release 
>>> of 2.22 until it's fixed (possibly by removing the implementation in 
>>> question).
> 
> Could you elaborate why a quadratic worst case in a string function can be
> considered a denial of service, please?
> 
>> Tulio,
>>
>> Could you please review the possibility of quadratic behaviour and respond
>> prompty? I don't want this to hold up the release.
> 
> I confirm this algorithm does have a quadratic worst case, which appears if
> needle <= 2048.
> If the needle > 2048, it falls back to the previous implementation.
 
While we have striven to remove quadratic worst case from the string ops, there
has been policy against it if the machine maintainer, knowing their machine
best thinks a particular algorithm with such behaviour works best.

It seems like we need some consensus on the requirements of string operations,
but such consensus will take sufficient time that 2.22 will be released by
then.

I am OK to leave the patches in place and to resolve the discussion of
quadratic behaviour and algorithm acceptability for 2.23.

Cheers,
Carlos.
  
Steven Munroe July 22, 2015, 6:58 p.m. UTC | #9
On Wed, 2015-07-22 at 15:29 -0300, Tulio Magno Quites Machado Filho
wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
> > On 07/22/2015 12:12 PM, Joseph Myers wrote:
> >> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> >> 
> >>> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> >>>> On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> >>>>> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> >>>>>>> May I proceed with this commit?
> >>>>>>
> >>>>>> Yes, please commit this for 2.22.
> >>>>>
> >>>>> For the record I trust IBM to make sure these patches make incremental
> >>>>> improvements in performance even if they are not the best possible
> >>>>> performance as pointed out by Ondrej Bilka.
> >>>>>
> >>>> Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> >>>> all. I did that as test how much we could test IBM to verify patches. 
> >>>>
> >>>> I pointed out that it could have possibly quadratic behaviour which
> >>>> still does. So please don't accept unreviewed patches next time.
> >>>
> >>> They showed cases for which the code does go faster and objectively
> >>> so using the microbenchmark, and that's a win for now. Please continue 
> >>> to work with IBM to remove the quadratic worst case.
> >>>
> >>> Tulio, You will need to work out why you have quadratic worst case.
> >>> It's certainly something we try to avoid. Did you make a particular
> >>> decision not to avoid it?
> >> 
> >> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> >> that a security hole (denial of service) that needs to block the release 
> >> of 2.22 until it's fixed (possibly by removing the implementation in 
> >> question).
> 
> Could you elaborate why a quadratic worst case in a string function can be
> considered a denial of service, please?
> 
> > Tulio,
> >
> > Could you please review the possibility of quadratic behaviour and respond
> > prompty? I don't want this to hold up the release.
> 
> I confirm this algorithm does have a quadratic worst case, which appears if
> needle <= 2048.
> If the needle > 2048, it falls back to the previous implementation.
> 
Well it is somewhat worse, but does this meet the definition of Denial
of service, is the debate.

Also we can reduce this by changing the cross over point for needle
size, but at some loss of overall performance for legitimate cases.

This gets back to my request for an objective standard. No one knows.
  
Joseph Myers July 22, 2015, 7:33 p.m. UTC | #10
On Wed, 22 Jul 2015, Carlos O'Donell wrote:

> > If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> > that a security hole (denial of service) that needs to block the release 
> > of 2.22 until it's fixed (possibly by removing the implementation in 
> > question).
> 
> Joseph,
> 
> We have had quadratic worse case in our string routines for years without
> it blocking a release. I agree that it is not the best case for a release

And I believe we established a consensus, when removing the SSE4 
implementation (bug 12100), that such implementations are not suitable for 
inclusion in glibc.

> to have such behaviour, but should it block this release?

As a denial-of-service regression (for any code that may use strstr on 
untrusted inputs), yes.
  
Joseph Myers July 22, 2015, 7:41 p.m. UTC | #11
On Wed, 22 Jul 2015, Tulio Magno Quites Machado Filho wrote:

> >> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> >> that a security hole (denial of service) that needs to block the release 
> >> of 2.22 until it's fixed (possibly by removing the implementation in 
> >> question).
> 
> Could you elaborate why a quadratic worst case in a string function can be
> considered a denial of service, please?

Because code may use strstr on untrusted inputs, meaning small inputs 
(e.g. MB in size) cause large (effectively infinite) CPU usage - any case 
where an attacker can cause your resource usage to grow more than linearly 
with their usage is a denial-of-service problem.  Thus gnulib considers 
such strstr implementations buggy and disables them (so preventing any of 
your optimizations from being effective for any program using gnulib), but 
not everything uses gnulib.

(There are cases, e.g. untrusted regular expressions, where glibc can't 
reasonably avoid such resource usage, and so anyone using untrusted 
regular expressions must apply resource limits externally.  But strstr 
isn't such a case.)

> > Tulio,
> >
> > Could you please review the possibility of quadratic behaviour and respond
> > prompty? I don't want this to hold up the release.
> 
> I confirm this algorithm does have a quadratic worst case, which appears if
> needle <= 2048.
> If the needle > 2048, it falls back to the previous implementation.

strstr should be O(m+n) (for needle length m, haystack length n); a naive 
implementation could be O(mn).  Are you saying that, for constant m <= 
2048, it's O(n^2)?  Because O(mn) for bounded m is fine (the threshold 
2048 would preferably have been determined by some sort of benchmark, but 
the precise choice of threshold isn't something to block a release).
  
Steven Munroe July 22, 2015, 8:10 p.m. UTC | #12
On Wed, 2015-07-22 at 19:41 +0000, Joseph Myers wrote:
> On Wed, 22 Jul 2015, Tulio Magno Quites Machado Filho wrote:
> 
> > >> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> > >> that a security hole (denial of service) that needs to block the release 
> > >> of 2.22 until it's fixed (possibly by removing the implementation in 
> > >> question).
> > 
> > Could you elaborate why a quadratic worst case in a string function can be
> > considered a denial of service, please?
> 
> Because code may use strstr on untrusted inputs, meaning small inputs 
> (e.g. MB in size) cause large (effectively infinite) CPU usage - any case 
> where an attacker can cause your resource usage to grow more than linearly 
> with their usage is a denial-of-service problem.  Thus gnulib considers 
> such strstr implementations buggy and disables them (so preventing any of 
> your optimizations from being effective for any program using gnulib), but 
> not everything uses gnulib.
> 
> (There are cases, e.g. untrusted regular expressions, where glibc can't 
> reasonably avoid such resource usage, and so anyone using untrusted 
> regular expressions must apply resource limits externally.  But strstr 
> isn't such a case.)
> 
> > > Tulio,
> > >
> > > Could you please review the possibility of quadratic behaviour and respond
> > > prompty? I don't want this to hold up the release.
> > 
> > I confirm this algorithm does have a quadratic worst case, which appears if
> > needle <= 2048.
> > If the needle > 2048, it falls back to the previous implementation.
> 
> strstr should be O(m+n) (for needle length m, haystack length n); a naive 
> implementation could be O(mn).  Are you saying that, for constant m <= 
> 2048, it's O(n^2)?  Because O(mn) for bounded m is fine (the threshold 
> 2048 would preferably have been determined by some sort of benchmark, but 
> the precise choice of threshold isn't something to block a release).
> 

This is not an objective standard because real code in the real world
does not behave in a mathematically precise way.

What is good enough and what not good enough has to be measured and
measurable.

So if the big nettle is 4X slower then the small nettle, it what a
denial of service? I don't think so!

What about 8x, or 10x? may be? but I still don't think so because
example given are extremely unlikely in the real world, and if a
deliberately bad case you may just see a slightly higher peak load.

On a server with 100-1000s processors who would notice?

100x well that might be a problem.

So what is you number? Because quadratic and O(this) and O(that) are
meaningless in this context.
  
Ondrej Bilka July 25, 2015, 5:28 a.m. UTC | #13
On Thu, Jul 16, 2015 at 04:16:12PM -0400, Carlos O'Donell wrote:
> On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> >>>> May I proceed with this commit?
> >>>
> >>> Yes, please commit this for 2.22.
> >>
> >> For the record I trust IBM to make sure these patches make incremental
> >> improvements in performance even if they are not the best possible
> >> performance as pointed out by Ondrej Bilka.
> >>
> > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > all. I did that as test how much we could test IBM to verify patches. 
> > 
> > I pointed out that it could have possibly quadratic behaviour which
> > still does. So please don't accept unreviewed patches next time.
> 
> They showed cases for which the code does go faster and objectively
> so using the microbenchmark, and that's a win for now. Please continue 
> to work with IBM to remove the quadratic worst case.
>
Carlos, that benchmark lack consensus. When I wrote it at thread it was
explicitly mentioned as counterexample why it shouldn't be used. Florian
raised objection that strstr isn't used on random inputs.

Also I don't think that its 'objective' benchmark at all, for example it
never measures case when we find string could happen relatively often
when user knows that string is there.
  
Ondrej Bilka July 25, 2015, 8:20 a.m. UTC | #14
On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote:
> On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> 
> > On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> > >>>> May I proceed with this commit?
> > >>>
> > >>> Yes, please commit this for 2.22.
> > >>
> > >> For the record I trust IBM to make sure these patches make incremental
> > >> improvements in performance even if they are not the best possible
> > >> performance as pointed out by Ondrej Bilka.
> > >>
> > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > > all. I did that as test how much we could test IBM to verify patches. 
> > > 
> > > I pointed out that it could have possibly quadratic behaviour which
> > > still does. So please don't accept unreviewed patches next time.
> > 
> > They showed cases for which the code does go faster and objectively
> > so using the microbenchmark, and that's a win for now. Please continue 
> > to work with IBM to remove the quadratic worst case.
> > 
> > Tulio, You will need to work out why you have quadratic worst case.
> > It's certainly something we try to avoid. Did you make a particular
> > decision not to avoid it?
> 
> If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> that a security hole (denial of service) that needs to block the release 
> of 2.22 until it's fixed (possibly by removing the implementation in 
> question).
> 
No my objection was that this should be reverted as patch that wasn't
reviewed. My main reason to believe are numerous issues that every
competent reviewer should notice. First red flag was Tulio telling its
ok for me, then telling that he received message there is bug in strstr.

That doesn't inspire much confidence.

First as quadratic behaviour. I knew that its bounded by 2048 but as
Tulio told that there is no work done I was silent. As he told that
there is no work on that I realized that he either didn't read patch or
forgot that he reviewed part which calls strlen, strnlen and jumps to
generic strstr.

I and you read code and find 2048 constant weird. As reviewer I asked to 
clarify it but author just send unrelated benchmarks, never one that
would convince us. It is as on benchmark that I previously send on thread 
I show that its 100 times slower for that needle as I claimed. 
If Tulio was familiar with code he would suggest to decrease that constant to 32.

As quadratic its mostly shortcut as I don't want to repeat it. Its
beyond my understanding when one believes that in practice O(mn) algorithm 
with m=2048 could beat O(n) one. I raised that in all versions of
strstr/strcasestr.

Thats just simple issue, I raised that months ago so they had enough
time.

Then as matter of performance in general case I found several obvious
problems so why Tulio didn't? First is that strlen/strnlen is called
unnecessarily when first strchr call returns null. Thats hot path.
Second is that crosspage check is unnecessary as you could search for
ending character of needle.

These are arguments why I don't think it was reviewed.

Joseph, what is policy about assembly implementations when there is
better generic implementation? This assembly is slower than when do
simple optimization of C strstr, with patch. I repeatedly asked to
check it but never got reply. 

[PATCH] Improve generic strstr performance.

I submitted C implementation that beats assembly strstr previously in
thread.

To answer objections that benchmarks are inaccurate as strings in
practice are not random strings I used dryrun as I said earlier. I
recorded strstr calls mainly in gdb program.

I saved profile trace here for replicability.
http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2

With assembly I get following:

[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26567826
 took 26699020
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26608298
 took 26698638
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 26752268
 took 26876982


replaying bash took 25841770

But when I replace that with generic improvement it improves perfromance
in practice.

[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25096738
 took 25187742
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25182840
 took 25255354
[neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u


replaying gdb took 25060592
 took 25165898




Main reason that I explained before is that it goes in considerable
lengths to optimizes cold path while not optimizing hot one. 
In random benchmark probability that you match k characters from needle 
decreases exponentially (Similar behaviour is observed in practice.) 

This patch for needles less than 8 bytes does rougthly following, larger
needles have more complicated code to check 8 bytes of needle at time.

nsize = strlen(needle)
if (strnlen(haystack,nsize) < nsize)
  return NULL;

if (nsize <= 9)
  {
    needle_start = *((uint64_t*)needle+1);
    mask = (~0) >> (9 - nsize);
    while (1)
      {
        haystack = strchr (haystack, needle[0]);
        if (*((uint64_t*)haystack) & mask == needle_start) 
          return haystack;
        haystack++;
      }

Problem is that optimizations like 

   if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
          return haystack;

Do more harm than good. Using simple byte check is faster in practice as
probably mismatch occurs in second character and overhead of testing
more characters is bigger than branch misprediction penalty. When I ran
profile trace then average number of digraphs and trigraphs is
following:

digraph count:  0.0126 trigraph count:  0.0024

On other hand it does nothing with hot strchr calls. If performance is
concern these should be at least inlined, or iterate over first
character mask.

I am only telling from my experience what optimizations will help.
Optimizations in this patch will not.
  
Steven Munroe July 27, 2015, 1:26 a.m. UTC | #15
On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote:
> On Wed, Jul 22, 2015 at 04:12:47PM +0000, Joseph Myers wrote:
> > On Thu, 16 Jul 2015, Carlos O'Donell wrote:
> > 
> > > On 07/16/2015 03:55 PM, Ondřej Bílka wrote:
> > > > On Wed, Jul 15, 2015 at 09:40:00PM -0400, Carlos O'Donell wrote:
> > > >> On 07/15/2015 08:43 PM, Carlos O'Donell wrote:
> > > >>>> May I proceed with this commit?
> > > >>>
> > > >>> Yes, please commit this for 2.22.
> > > >>
> > > >> For the record I trust IBM to make sure these patches make incremental
> > > >> improvements in performance even if they are not the best possible
> > > >> performance as pointed out by Ondrej Bilka.
> > > >>
> > > > Sorry Carlos, your trust is misplaced. This patch wasn't reviewed at
> > > > all. I did that as test how much we could test IBM to verify patches. 
> > > > 
> > > > I pointed out that it could have possibly quadratic behaviour which
> > > > still does. So please don't accept unreviewed patches next time.
> > > 
> > > They showed cases for which the code does go faster and objectively
> > > so using the microbenchmark, and that's a win for now. Please continue 
> > > to work with IBM to remove the quadratic worst case.
> > > 
> > > Tulio, You will need to work out why you have quadratic worst case.
> > > It's certainly something we try to avoid. Did you make a particular
> > > decision not to avoid it?
> > 
> > If there's a quadratic worst case newly introduced for 2.22, I'd consider 
> > that a security hole (denial of service) that needs to block the release 
> > of 2.22 until it's fixed (possibly by removing the implementation in 
> > question).
> > 
> No my objection was that this should be reverted as patch that wasn't
> reviewed. My main reason to believe are numerous issues that every
> competent reviewer should notice. First red flag was Tulio telling its
> ok for me, then telling that he received message there is bug in strstr.
> 

This is ridiculous. most of this personal and unprofessional attack
below is based on incorrect assumptions (of things you do not actually
know) or taken out of context and twisted to fit your narrative.

You really need to take a deep breath and think before you say anything
else.


> That doesn't inspire much confidence.
> 
> First as quadratic behaviour. I knew that its bounded by 2048 but as
> Tulio told that there is no work done I was silent. As he told that
> there is no work on that I realized that he either didn't read patch or
> forgot that he reviewed part which calls strlen, strnlen and jumps to
> generic strstr.
> 
> I and you read code and find 2048 constant weird. As reviewer I asked to 
> clarify it but author just send unrelated benchmarks, never one that
> would convince us. It is as on benchmark that I previously send on thread 
> I show that its 100 times slower for that needle as I claimed. 
> If Tulio was familiar with code he would suggest to decrease that constant to 32.
> 
> As quadratic its mostly shortcut as I don't want to repeat it. Its
> beyond my understanding when one believes that in practice O(mn) algorithm 
> with m=2048 could beat O(n) one. I raised that in all versions of
> strstr/strcasestr.
> 
> Thats just simple issue, I raised that months ago so they had enough
> time.
> 
> Then as matter of performance in general case I found several obvious
> problems so why Tulio didn't? First is that strlen/strnlen is called
> unnecessarily when first strchr call returns null. Thats hot path.
> Second is that crosspage check is unnecessary as you could search for
> ending character of needle.
> 
> These are arguments why I don't think it was reviewed.
> 
> Joseph, what is policy about assembly implementations when there is
> better generic implementation? This assembly is slower than when do
> simple optimization of C strstr, with patch. I repeatedly asked to
> check it but never got reply. 
> 
> [PATCH] Improve generic strstr performance.
> 
> I submitted C implementation that beats assembly strstr previously in
> thread.
> 
> To answer objections that benchmarks are inaccurate as strings in
> practice are not random strings I used dryrun as I said earlier. I
> recorded strstr calls mainly in gdb program.
> 
> I saved profile trace here for replicability.
> http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2
> 
> With assembly I get following:
> 
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26567826
>  took 26699020
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26608298
>  took 26698638
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 26752268
>  took 26876982
> 
> 
> replaying bash took 25841770
> 
> But when I replace that with generic improvement it improves perfromance
> in practice.
> 
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25096738
>  took 25187742
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25182840
>  took 25255354
> [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> 
> 
> replaying gdb took 25060592
>  took 25165898
> 
> 
> 
> 
> Main reason that I explained before is that it goes in considerable
> lengths to optimizes cold path while not optimizing hot one. 
> In random benchmark probability that you match k characters from needle 
> decreases exponentially (Similar behaviour is observed in practice.) 
> 
> This patch for needles less than 8 bytes does rougthly following, larger
> needles have more complicated code to check 8 bytes of needle at time.
> 
> nsize = strlen(needle)
> if (strnlen(haystack,nsize) < nsize)
>   return NULL;
> 
> if (nsize <= 9)
>   {
>     needle_start = *((uint64_t*)needle+1);
>     mask = (~0) >> (9 - nsize);
>     while (1)
>       {
>         haystack = strchr (haystack, needle[0]);
>         if (*((uint64_t*)haystack) & mask == needle_start) 
>           return haystack;
>         haystack++;
>       }
> 
> Problem is that optimizations like 
> 
>    if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
>           return haystack;
> 
> Do more harm than good. Using simple byte check is faster in practice as
> probably mismatch occurs in second character and overhead of testing
> more characters is bigger than branch misprediction penalty. When I ran
> profile trace then average number of digraphs and trigraphs is
> following:
> 
> digraph count:  0.0126 trigraph count:  0.0024
> 
> On other hand it does nothing with hot strchr calls. If performance is
> concern these should be at least inlined, or iterate over first
> character mask.
> 
> I am only telling from my experience what optimizations will help.
> Optimizations in this patch will not.
>
  
Ondrej Bilka July 27, 2015, 9:19 a.m. UTC | #16
On Sun, Jul 26, 2015 at 08:26:06PM -0500, Steven Munroe wrote:
> On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote:
> > No my objection was that this should be reverted as patch that wasn't
> > reviewed. My main reason to believe are numerous issues that every
> > competent reviewer should notice. First red flag was Tulio telling its
> > ok for me, then telling that he received message there is bug in strstr.
> > 
> 
> This is ridiculous. most of this personal and unprofessional attack
> below is based on incorrect assumptions (of things you do not actually
> know) or taken out of context and twisted to fit your narrative.
> 
> You really need to take a deep breath and think before you say anything
> else.
>

Could you stop doing ad hominem attacks please? I noticed that you
resort to these when I show that facts speak otherwise.

As this being out of context I expected that you will of course deny
everything and say something like that. So could you answer following
three questions?

1. Reviews, it would be better to always write reviews publictly. But as
you still have them could you send reviews here to clarify.

2. Quadratic behaviour. It should be clear for any expert that with 2047
byte needle you need check all these bytes for each byte in haystack. If
you optimisticaly check 8 bytes per cycle then it would take 256 cycles
times haystack size. In practice its slower as its equivalent to strcmp
which on power7 has following timing:
Length 1024, alignment 14/ 7:   593.5   651.016 318.844 447.203

So why you didn't insist to fix that by changing constant to something
reasonable?

3. Why despite your frequent reviews a implementation is in average case
still slower than simple bugfix of generic strstr.

As for that I do incorrect assumptions and take things out of context
its quite easy to make vague accussations like this. So be more
specific, what assumptions and where is your evidence that assumption 
is incorrect?

 
> 
> > That doesn't inspire much confidence.
> > 
> > First as quadratic behaviour. I knew that its bounded by 2048 but as
> > Tulio told that there is no work done I was silent. As he told that
> > there is no work on that I realized that he either didn't read patch or
> > forgot that he reviewed part which calls strlen, strnlen and jumps to
> > generic strstr.
> > 
> > I and you read code and find 2048 constant weird. As reviewer I asked to 
> > clarify it but author just send unrelated benchmarks, never one that
> > would convince us. It is as on benchmark that I previously send on thread 
> > I show that its 100 times slower for that needle as I claimed. 
> > If Tulio was familiar with code he would suggest to decrease that constant to 32.
> > 
> > As quadratic its mostly shortcut as I don't want to repeat it. Its
> > beyond my understanding when one believes that in practice O(mn) algorithm 
> > with m=2048 could beat O(n) one. I raised that in all versions of
> > strstr/strcasestr.
> > 
> > Thats just simple issue, I raised that months ago so they had enough
> > time.
> > 
> > Then as matter of performance in general case I found several obvious
> > problems so why Tulio didn't? First is that strlen/strnlen is called
> > unnecessarily when first strchr call returns null. Thats hot path.
> > Second is that crosspage check is unnecessary as you could search for
> > ending character of needle.
> > 
> > These are arguments why I don't think it was reviewed.
> > 
> > Joseph, what is policy about assembly implementations when there is
> > better generic implementation? This assembly is slower than when do
> > simple optimization of C strstr, with patch. I repeatedly asked to
> > check it but never got reply. 
> > 
> > [PATCH] Improve generic strstr performance.
> > 
> > I submitted C implementation that beats assembly strstr previously in
> > thread.
> > 
> > To answer objections that benchmarks are inaccurate as strings in
> > practice are not random strings I used dryrun as I said earlier. I
> > recorded strstr calls mainly in gdb program.
> > 
> > I saved profile trace here for replicability.
> > http://kam.mff.cuni.cz/~ondra/dryrunstrstrpower.tar.bz2
> > 
> > With assembly I get following:
> > 
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26567826
> >  took 26699020
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26608298
> >  took 26698638
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 26752268
> >  took 26876982
> > 
> > 
> > replaying bash took 25841770
> > 
> > But when I replace that with generic improvement it improves perfromance
> > in practice.
> > 
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25096738
> >  took 25187742
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25182840
> >  took 25255354
> > [neleai@gcc1-power7 glibctest]$ ./testrun.sh ../dryrunstrstrpower/bin/bench_strstr -u
> > 
> > 
> > replaying gdb took 25060592
> >  took 25165898
> > 
> > 
> > 
> > 
> > Main reason that I explained before is that it goes in considerable
> > lengths to optimizes cold path while not optimizing hot one. 
> > In random benchmark probability that you match k characters from needle 
> > decreases exponentially (Similar behaviour is observed in practice.) 
> > 
> > This patch for needles less than 8 bytes does rougthly following, larger
> > needles have more complicated code to check 8 bytes of needle at time.
> > 
> > nsize = strlen(needle)
> > if (strnlen(haystack,nsize) < nsize)
> >   return NULL;
> > 
> > if (nsize <= 9)
> >   {
> >     needle_start = *((uint64_t*)needle+1);
> >     mask = (~0) >> (9 - nsize);
> >     while (1)
> >       {
> >         haystack = strchr (haystack, needle[0]);
> >         if (*((uint64_t*)haystack) & mask == needle_start) 
> >           return haystack;
> >         haystack++;
> >       }
> > 
> > Problem is that optimizations like 
> > 
> >    if (*((uint64_t*)(haystack+1)) & mask == needle_start) 
> >           return haystack;
> > 
> > Do more harm than good. Using simple byte check is faster in practice as
> > probably mismatch occurs in second character and overhead of testing
> > more characters is bigger than branch misprediction penalty. When I ran
> > profile trace then average number of digraphs and trigraphs is
> > following:
> > 
> > digraph count:  0.0126 trigraph count:  0.0024
> > 
> > On other hand it does nothing with hot strchr calls. If performance is
> > concern these should be at least inlined, or iterate over first
> > character mask.
> > 
> > I am only telling from my experience what optimizations will help.
> > Optimizations in this patch will not.
> >
  
Steven Munroe July 27, 2015, 12:48 p.m. UTC | #17
On Mon, 2015-07-27 at 11:19 +0200, Ondřej Bílka wrote:
> On Sun, Jul 26, 2015 at 08:26:06PM -0500, Steven Munroe wrote:
> > On Sat, 2015-07-25 at 10:20 +0200, Ondřej Bílka wrote:
> > > No my objection was that this should be reverted as patch that wasn't
> > > reviewed. My main reason to believe are numerous issues that every
> > > competent reviewer should notice. First red flag was Tulio telling its
> > > ok for me, then telling that he received message there is bug in strstr.
> > > 
> > 
> > This is ridiculous. most of this personal and unprofessional attack
> > below is based on incorrect assumptions (of things you do not actually
> > know) or taken out of context and twisted to fit your narrative.
> > 
> > You really need to take a deep breath and think before you say anything
> > else.
> >
> 
> Could you stop doing ad hominem attacks please? I noticed that you
> resort to these when I show that facts speak otherwise.
> 
> As this being out of context I expected that you will of course deny
> everything and say something like that. So could you answer following
> three questions?
> 
> 1. Reviews, it would be better to always write reviews publictly. But as
> you still have them could you send reviews here to clarify.
> 
> 2. Quadratic behaviour. It should be clear for any expert that with 2047
> byte needle you need check all these bytes for each byte in haystack. If
> you optimisticaly check 8 bytes per cycle then it would take 256 cycles
> times haystack size. In practice its slower as its equivalent to strcmp
> which on power7 has following timing:
> Length 1024, alignment 14/ 7:   593.5   651.016 318.844 447.203
> 
> So why you didn't insist to fix that by changing constant to something
> reasonable?
> 
> 3. Why despite your frequent reviews a implementation is in average case
> still slower than simple bugfix of generic strstr.
> 
> As for that I do incorrect assumptions and take things out of context
> its quite easy to make vague accussations like this. So be more
> specific, what assumptions and where is your evidence that assumption 
> is incorrect?
> 

What we have here is complete failure to communicate.

On June 9th: I suggested that if you felt so strongly about this for
reasons that I did not understand, then you should provide an objective
benchmark that provided you point.

https://sourceware.org/ml/libc-alpha/2015-06/msg00365.html

On June 10th Roland agreed that that was a reasonable way to proceed.

https://sourceware.org/ml/libc-alpha/2015-06/msg00367.html

You rejected this. So Raji continued to use the one strstr benchmark we
have. We tried to compromise by adjusting the crossover point between
the new and old algorithm and avoid the specific case you gave.

And you rejected that.

On July 16th Raji submitted the V3 of the patch

https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html

On June 16th you basically told Carlos that my team was incompetent and
could not be trusted, hhhm does this qualify as an ad hominem attach?

https://sourceware.org/ml/libc-apha/2015-07/msg00501.html

At which point I stop listening. There does not seem to be anything that
I can say or do to satisfy you. So I stopped trying.

On July 23rd Carlos asked for a second opinion on if what we had was in
fact quadratics.

https://sourceware.org/ml/libc-alpha/2015-07/msg00799.html

The consensus was no, and was at worse O(nm)

This has wasted huge amounts of time. Time we could have used to
improved the code for the next release. 

I will make you a promise. If you will provide the objective benchmark
that illustrates this issue and get the community to accept it as an
objective measure of a realistic scenario. I will ask the team to work
on appropriate improvements.
  
Joseph Myers July 27, 2015, 2:51 p.m. UTC | #18
On Sat, 25 Jul 2015, Ondřej Bílka wrote:

> Joseph, what is policy about assembly implementations when there is
> better generic implementation? This assembly is slower than when do

Removal of such an assembly implementation is a simple matter of getting 
consensus, typically based on benchmark results from the checked-in 
benchtests (so if necessary, you may need to get consensus on an addition 
to such benchtests first).  The most likely form of such consensus is 
through agreement from the architecture maintainer, but I don't expect 
such changes, properly supported by benchmark results from the checked-in 
benchtests, to be controversial.

Of course the comparison is with the *checked-in* generic implementation, 
not something posted but not checked in.

> simple optimization of C strstr, with patch. I repeatedly asked to
> check it but never got reply. 
> 
> [PATCH] Improve generic strstr performance.

Well, if a patch takes a while to get reviewed, keep pinging, while making 
sure that the patch submission includes all necessary information, that 
the required benchmarks have been reviewed and checked in first, etc.

You have about 80 unreviewed patches showing in the patchwork list.  If 
any of those are superseded by later versions of the same patch, I 
strongly advise marking the earlier versions as superseded in patchwork so 
that anyone using patchwork to find patches to review finds those patches 
of yours that actually need review and not a load of noise from older 
superseded patch versions.  (The same applies to anyone submitting patches 
- keep the list of your patches pending review clean - but your list is 
particularly long.)
  
Carlos O'Donell July 27, 2015, 6:25 p.m. UTC | #19
On 07/27/2015 05:19 AM, Ondřej Bílka wrote:
> 1. Reviews, it would be better to always write reviews publictly. But as
> you still have them could you send reviews here to clarify.

As machine maintainers they can commit their code as-if they had consensus.

Your after-the-fact reviews are good, and should be considered follow-on
work. You should work closely with IBM in a professional and technical
manner, present your own patches to the existing IBM code and discus
the testing and changes that you made. All of this *on top* of whatever
is in the existing master branch. You must present this work in a clear
and concise manner, providing IBM the tools with which to evaluate and test
your implementation.

The IBM maintainers do not need to convince you. They have consensus as
maintainers. You need to convince *them* that your solution is better rather
than attempting to block their patches, which is not your responsibility.

Lastly, beware that your single dissenting opinion may not constitute
an important part of the concerned interests of the glibc community[1].
Therefore, even if you are correct, the community may tell you that your
comments will not be considered until you find a way to work with the
machine maintainer.

I for one would like to see you working *with* IBM instead of what appears
to be an antagonistic realtionship surrounding these performance-related
changes. I think both sides should look to the positive aspects of having
more people reviewing existing implementations for performance benefits.

Cheers,
Carlos.

[1] https://sourceware.org/glibc/wiki/Consensus
  

Patch

diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c
index 74f3ee8..b758969 100644
--- a/benchtests/bench-strstr.c
+++ b/benchtests/bench-strstr.c
@@ -128,6 +128,23 @@  test_main (void)
     printf ("\t%s", impl->name);
   putchar ('\n');
 
+
+  char s1[1000000],s2[2000];
+
+  memset (s1, 'a', 1000000);
+  memset (s2, 'a', 2000);
+  s1[999999] = '\0';
+  s2[1998] = 'b';
+  s2[1999] = '\0';
+
+  {
+    printf ("strstr(\"aaa...aaa\",\"aaa...aaab\"\n");
+    FOR_EACH_IMPL (impl, 0)
+      do_one_test (impl, s1, s2, NULL);
+    putchar('\n');
+  }
+
+
   for (size_t klen = 2; klen < 32; ++klen)
     for (size_t hlen = 2 * klen; hlen < 16 * klen; hlen += klen)
       {