diff mbox

powerpc: strcasestr optimization

Message ID	556C36D8.2070208@linux.vnet.ibm.com
State	Superseded
Delegated to:	Tulio Magno Quites Machado Filho
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Message-ID: <556C36D8.2070208@linux.vnet.ibm.com> Date: Mon, 01 Jun 2015 16:11:28 +0530 From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: GNU C Library <libc-alpha@sourceware.org>, Steve Munroe <sjmunroe@us.ibm.com> Subject: [PATCH] powerpc: strcasestr optimization References: <55687597.1060101@linux.vnet.ibm.com> In-Reply-To: <55687597.1060101@linux.vnet.ibm.com> Content-Type: multipart/mixed; boundary="------------040902070907040603020707"

Commit Message

Rajalakshmi S June 1, 2015, 10:41 a.m. UTC

  This patch optimizes strcasestr function for power >= 7 systems.
This patch uses optimized strlen and strnlen for calculating
string length and the average improvement of this optimization is ~40%.
This patch is tested on powerpc64 and powerpc64le.
Attached the benchresults with this new patch.

2015-05-29  Rajalakshmi Srinivasaraghavan  <raji@linux.vnet.ibm.com>

	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Add strcasestr().
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/powerpc/powerpc64/power7/strcasestr.S: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c: New File.
	* sysdeps/powerpc/powerpc64/multiarch/strcasestr.c: New File.
---
   sysdeps/powerpc/powerpc64/multiarch/Makefile       |   2 +-
   .../powerpc/powerpc64/multiarch/ifunc-impl-list.c  |   7 +
   .../powerpc64/multiarch/strcasestr-power7.S        |  43 +++++
   .../powerpc/powerpc64/multiarch/strcasestr-ppc64.c |  34 ++++
   sysdeps/powerpc/powerpc64/multiarch/strcasestr.c   |  37 +++++
   sysdeps/powerpc/powerpc64/power7/strcasestr.S      | 177
+++++++++++++++++++++
   6 files changed, 299 insertions(+), 1 deletion(-)
   create mode 100644 
sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
   create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
   create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
   create mode 100644 sysdeps/powerpc/powerpc64/power7/strcasestr.S

Comments

Ondrej Bilka June 1, 2015, 12:28 p.m. UTC | #1

On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> This patch optimizes strcasestr function for power >= 7 systems.
> This patch uses optimized strlen and strnlen for calculating
> string length and the average improvement of this optimization is ~40%.
> This patch is tested on powerpc64 and powerpc64le.
> Attached the benchresults with this new patch.
> 
Thats not enough. As strcasestr that I submited is around three times
slower your implementation would likely be regression over generic one.

A problem here is that you use moronic algorithm. Fix algorithm first
before trying to optimize it.

Steven Munroe June 1, 2015, 2:36 p.m. UTC | #2

On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > 
> > This patch optimizes strcasestr function for power >= 7 systems.
> > This patch uses optimized strlen and strnlen for calculating
> > string length and the average improvement of this optimization is ~40%.
> > This patch is tested on powerpc64 and powerpc64le.
> > Attached the benchresults with this new patch.
> > 
> Thats not enough. As strcasestr that I submited is around three times
> slower your implementation would likely be regression over generic one.
> 
> A problem here is that you use moronic algorithm. Fix algorithm first
> before trying to optimize it.
> 

This is not very helpful. You are demanding changes without clear
explanation and justification.

What is wrong with Raja's algorithm? What is insufficient in the
benchmark data she has provided? And why do you think your specific
design applies to PowerISA and POWER7/POWER8 micro-architecture.

What data do you have that justified this objection?

Ondrej Bilka June 1, 2015, 4:22 p.m. UTC | #3

On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
> On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > 
> > > This patch optimizes strcasestr function for power >= 7 systems.
> > > This patch uses optimized strlen and strnlen for calculating
> > > string length and the average improvement of this optimization is ~40%.
> > > This patch is tested on powerpc64 and powerpc64le.
> > > Attached the benchresults with this new patch.
> > > 
> > Thats not enough. As strcasestr that I submited is around three times
> > slower your implementation would likely be regression over generic one.
> > 
> > A problem here is that you use moronic algorithm. Fix algorithm first
> > before trying to optimize it.
> > 
> 
> This is not very helpful. You are demanding changes without clear
> explanation and justification.
> 
> What is wrong with Raja's algorithm? What is insufficient in the
> benchmark data she has provided? And why do you think your specific
> design applies to PowerISA and POWER7/POWER8 micro-architecture.
> 
> What data do you have that justified this objection?

I replied on strstr patch thread on why what she submitted is
performance regression. So I will repeat arguments from other thread
which still apply.

First was problem with quadratic behaviour. She tried to fix it but it
isn't a fix at all. Just benchmark 

strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")

That call would take around hundred times than before which is
unacceptable.

If we ignore that red flag second problem was that benchmark she used is
bogus. It test with periodic haystacks, needle is copy of first bytes of
haystack with last byte set to something else.

1234567890abcxyz1234567890abcxyz1234567890abcxyz...

These are far from real inputs. Periodicity could introduce artefacts in
algorithms (like that its worst case for my digraph search which relies
on fact that in most strings characters are relatively independent.)
My worst fear that some implementation would be ten times faster than in
practice. There could be branch that is normally unpredictable. But as
input is regular it could be alternatively taken and not taken which cpu
recognizes and will predict it perfectly.

This isn't main problem which is that result of benchmark depends on
parameter and each submitter could make his benchmark best by selecting
appropriate value of that parameter. This could apply by selecting
period to string n or if we fix previous paragraph by using random
strings by selecting probability of character.

Performance would differ by order of magnitude depending on
choice of that n. A simple comparison for strstr would be compare
two-way algorithm with naive algorithm that looks like

while (s = strchr(s, n[0]))
  check_rest

If you set n=2 or random strings with only a or b then strchr would
cause big overhead per call and it could be five times slower than
two-way algorithm whose performance is mostly constant.

On other hand with n=255 or using all values a strchr would quickly skip
256 bytes on average and be say 5 times faster than two-way algorithm.

If you draw performance characteristic they meet at same point where
they run at same speed. If we assume that random strings model real ones
well how can you be sure that you selected n constiderably less than
encountered in real programs and it shows wrong implementation as best?

Just use same patch like I send with ((unsigned) rand())%16 + 1 and you
will see completely different numbers in benchmark.

I didn't touched that benchmarks couldn't produce a correct result due
additional factors, like that in practice most inputs are small so a
branch prediction in implementation header matters a lot. But this
benchmark doesn't measure it as it calls functions with same arguments
over and over again which makes these branches.

That briefly covers what was wrong with benchmark.

As microarchitecture one doesn't need to know details to see that
something is fundamentally wrong. A algorithm here is essentially
byte-by-byte check which essentially does following

uint64_t haystack_window, needle_window;
while (!end)
  {
    haystack_window = haystack_window << 8 | tolower (haystack[i])
    if (haystack_window == needle_window)
      match_rest
    i++;
  }

A problem of that is that operations there are useless most of time. For
original benchmark 15 times out of 16 you don't have to use shifts and
other instructions to compare needle with haystack. At that position
they differ on first byte. So her algorithm is worse than naive doing
following with same optimizations as in previous case.

while (!end)              
  {                       
    if (tolower(haystack[i]) ==  tolower(needle[0]))
      match_rest
    i++;
  }

As that was covered my final comment was that it isn't substantial
speedup. My implemantation is generic and gives large boost 
so why should be powerpc different.

As I don't have powerpc access now apply my patches

[PATCH v5] Generic string skeleton
[PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem)

and run (preferably fixed) benchmark with these. As gains that I see on
x64 are bigger than ones gained by this assembly you will likely see
that generic implementation is indeed better and it would be pointless
to try review that only to remove it shortly after adding to improve
performance.

Steven Munroe June 2, 2015, 6:47 p.m. UTC | #4

On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote:
> On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
> > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > 
> > > > This patch optimizes strcasestr function for power >= 7 systems.
> > > > This patch uses optimized strlen and strnlen for calculating
> > > > string length and the average improvement of this optimization is ~40%.
> > > > This patch is tested on powerpc64 and powerpc64le.
> > > > Attached the benchresults with this new patch.
> > > > 
> > > Thats not enough. As strcasestr that I submited is around three times
> > > slower your implementation would likely be regression over generic one.
> > > 
> > > A problem here is that you use moronic algorithm. Fix algorithm first
> > > before trying to optimize it.
> > > 
> > 
> > This is not very helpful. You are demanding changes without clear
> > explanation and justification.
> > 
> > What is wrong with Raja's algorithm? What is insufficient in the
> > benchmark data she has provided? And why do you think your specific
> > design applies to PowerISA and POWER7/POWER8 micro-architecture.
> > 
> > What data do you have that justified this objection?
> 
> I replied on strstr patch thread on why what she submitted is
> performance regression. So I will repeat arguments from other thread
> which still apply.
> 
> First was problem with quadratic behaviour. She tried to fix it but it
> isn't a fix at all. Just benchmark 

Which is fixed in the latest patch
> 
> strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")
> 
> That call would take around hundred times than before which is
> unacceptable.
> 
> If we ignore that red flag second problem was that benchmark she used is
> bogus. It test with periodic haystacks, needle is copy of first bytes of
> haystack with last byte set to something else.
> 
Have you looked at he latest patch? If the benchmark provided does not
cover your assertion, then it is your responsibility to provide and
upstream the appropriate benchmark. And convince the community that
benchmark covers a valid issue.

> snip
> 
> As microarchitecture one doesn't need to know details to see that
> something is fundamentally wrong. A algorithm here is essentially
> byte-by-byte check which essentially does following
> 

That assertion is at odds with my experience. Every day I see code that
somebody thinks is optimized but in reality is not. Current processors
are vastly more complicated then most realize and RISC processors have
stronger interactions between compiler and micro-architecture.

So it is on you have to prove that your generic optimization actually
works for more then one platform.

> snip

> 
> As that was covered my final comment was that it isn't substantial
> speedup. My implemantation is generic and gives large boost 
> so why should be powerpc different.
> 
> As I don't have powerpc access now apply my patches
> 
> [PATCH v5] Generic string skeleton
> [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem)
> 
Every one has access to the the GCC Compile farm 

https://gcc.gnu.org/wiki/CompileFarm

So there is no excuse to not have data to back you assertion.

Ondrej Bilka June 2, 2015, 9 p.m. UTC | #5

On Tue, Jun 02, 2015 at 01:47:03PM -0500, Steven Munroe wrote:
> On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote:
> > On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
> > > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> > > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > > 
> > > > > This patch optimizes strcasestr function for power >= 7 systems.
> > > > > This patch uses optimized strlen and strnlen for calculating
> > > > > string length and the average improvement of this optimization is ~40%.
> > > > > This patch is tested on powerpc64 and powerpc64le.
> > > > > Attached the benchresults with this new patch.
> > > > > 
> > > > Thats not enough. As strcasestr that I submited is around three times
> > > > slower your implementation would likely be regression over generic one.
> > > > 
> > > > A problem here is that you use moronic algorithm. Fix algorithm first
> > > > before trying to optimize it.
> > > > 
> > > 
> > > This is not very helpful. You are demanding changes without clear
> > > explanation and justification.
> > > 
> > > What is wrong with Raja's algorithm? What is insufficient in the
> > > benchmark data she has provided? And why do you think your specific
> > > design applies to PowerISA and POWER7/POWER8 micro-architecture.
> > > 
> > > What data do you have that justified this objection?
> > 
> > I replied on strstr patch thread on why what she submitted is
> > performance regression. So I will repeat arguments from other thread
> > which still apply.
> > 
> > First was problem with quadratic behaviour. She tried to fix it but it
> > isn't a fix at all. Just benchmark 
> 
> Which is fixed in the latest patch
> > 
> > strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")
> > 
Did you read also following line? It isn't solved at all.




> > That call would take around hundred times than before which is
> > unacceptable.
> > 
> > If we ignore that red flag second problem was that benchmark she used is
> > bogus. It test with periodic haystacks, needle is copy of first bytes of
> > haystack with last byte set to something else.
> > 
> Have you looked at he latest patch? If the benchmark provided does not
> cover your assertion, then it is your responsibility to provide and
> upstream the appropriate benchmark. And convince the community that
> benchmark covers a valid issue.
> 
That was my comment why benchmark isn't feasible to prove performance as
I could for almost any patch write benchmark that shows its regression
by selecting suitable data. Question is what happens in reality and for
that you shouldn't use artifical benchmarks.
> > snip
> > 
> > As microarchitecture one doesn't need to know details to see that
> > something is fundamentally wrong. A algorithm here is essentially
> > byte-by-byte check which essentially does following
> > 
> 
> That assertion is at odds with my experience. Every day I see code that
> somebody thinks is optimized but in reality is not. Current processors
> are vastly more complicated then most realize and RISC processors have
> stronger interactions between compiler and micro-architecture.
> 
> So it is on you have to prove that your generic optimization actually
> works for more then one platform.
> 
I have same especially when author submits an assembly improvement where
algorithm is completely wrong and doesn't cover known performance
issues.

> > snip
> 
> > 
> > As that was covered my final comment was that it isn't substantial
> > speedup. My implemantation is generic and gives large boost 
> > so why should be powerpc different.
> > 
> > As I don't have powerpc access now apply my patches
> > 
> > [PATCH v5] Generic string skeleton
> > [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem)
> > 
> Every one has access to the the GCC Compile farm 
> 
> https://gcc.gnu.org/wiki/CompileFarm
> 
> So there is no excuse to not have data to back you assertion.

Except that all power machines except osuosl are offline. There may be
osuosl ones but gcc11[012].osuosl.org which don't resolve

Steven Munroe June 9, 2015, 8:40 p.m. UTC | #6

On Tue, 2015-06-02 at 23:00 +0200, Ondřej Bílka wrote:
> On Tue, Jun 02, 2015 at 01:47:03PM -0500, Steven Munroe wrote:
> > On Mon, 2015-06-01 at 18:22 +0200, Ondřej Bílka wrote:
> > > On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
> > > > On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
> > > > > On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> > > > > > 
> > > > > > This patch optimizes strcasestr function for power >= 7 systems.
> > > > > > This patch uses optimized strlen and strnlen for calculating
> > > > > > string length and the average improvement of this optimization is ~40%.
> > > > > > This patch is tested on powerpc64 and powerpc64le.
> > > > > > Attached the benchresults with this new patch.
> > > > > > 
> > > > > Thats not enough. As strcasestr that I submited is around three times
> > > > > slower your implementation would likely be regression over generic one.
> > > > > 
> > > > > A problem here is that you use moronic algorithm. Fix algorithm first
> > > > > before trying to optimize it.
> > > > > 
> > > > 
> > > > This is not very helpful. You are demanding changes without clear
> > > > explanation and justification.
> > > > 
> > > > What is wrong with Raja's algorithm? What is insufficient in the
> > > > benchmark data she has provided? And why do you think your specific
> > > > design applies to PowerISA and POWER7/POWER8 micro-architecture.
> > > > 
> > > > What data do you have that justified this objection?
> > > 
> > > I replied on strstr patch thread on why what she submitted is
> > > performance regression. So I will repeat arguments from other thread
> > > which still apply.
> > > 
> > > First was problem with quadratic behaviour. She tried to fix it but it
> > > isn't a fix at all. Just benchmark 
> > 
> > Which is fixed in the latest patch
> > > 
> > > strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")
> > > 
> Did you read also following line? It isn't solved at all.
> > > That call would take around hundred times than before which is
> > > unacceptable.
> > > 

Ok we need to make progress, either you need to provide data from a
POWER machine that proves your point. Or you need to proved a benchmark
that can be include in GLIBC and we can run, that proves your point.

So far this only your opinion from a abstract perspective vs my opinion
based on my experience as a platform maintainer. It is past time to
break this logjam!

Anyone disagree with this approach?

> > So there is no excuse to not have data to back you assertion.
> 
> Except that all power machines except osuosl are offline. There may be
> osuosl ones but gcc11[012].osuosl.org which don't resolve
> 

My team assures me that both gcc110 gcc112 are up and available.

-bash-4.3$ uptime
01:30PM   up 127 days,   3:38,  0 users,  load average: 2.36, 2.11, 1.94

you might try via the fsffrance.org domain.

Roland McGrath June 9, 2015, 9:05 p.m. UTC | #7

Our policy is that all such questions should be settled objectively in a
reproducible manner.  That means a new benchmark should be added that can
demonstrate the effect of the library change in question, and consensus
reached on the utility of that particular benchmark (which is always a
somewhat subjective issue).  Once the benchmark is in, then objective
reports of benchmark results before and after a change on various hardware
must be shown to justify the library change.  The onus for making all this
happen is on the person who wants to change the code from its status quo.

Steven Munroe June 9, 2015, 9:26 p.m. UTC | #8

On Tue, 2015-06-09 at 14:05 -0700, Roland McGrath wrote:
> Our policy is that all such questions should be settled objectively in a
> reproducible manner.  That means a new benchmark should be added that can
> demonstrate the effect of the library change in question, and consensus
> reached on the utility of that particular benchmark (which is always a
> somewhat subjective issue).  Once the benchmark is in, then objective
> reports of benchmark results before and after a change on various hardware
> must be shown to justify the library change.  The onus for making all this
> happen is on the person who wants to change the code from its status quo.
> 

In https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html Raja
provided the results for the existing benchmark. 

So we have meet your and the community requirement.

Ondřej Bílka is asserting that not good enough based based on his
opinion that the algorithm is flawed. But he has not provide a benchmark
that we can use to verify this.

Since we do not understand what the problem is, we can not provide a new
benchmark.

If Ondřej Bílka believes there is a problem that only he understands
then the onus is on him to provide the appropriate benchmark.

Rajalakshmi S June 10, 2015, 6:34 p.m. UTC | #9

On 06/10/2015 04:18 PM, Ondřej Bílka wrote:
>
>   The onus for making all this happen is on the person who wants to change the code from its status quo.
>
> That would mean that its Rajalakshmi and responsibility to
> write benchmark that shows its performance improvement. I stated my
> objections explicitly described which inputs to use.
>
> Problem is that Steven failed to do his duty as a powerpc maintainer
> which is to read and review change, check benchmark inputs and that
> they are adequate. I could supply really bad implementation of strstr
> which first checks special case that needle and haystack are periodic
> abcdef sequences and returns result expected from benchmark.
>
> I did just simple request that every freshmen should know and Steven
> should as powerpc maintainer check how its handled. Its that naive
> algorithm and proposed algorithm are slow on inputs of form
>
> strstr("aaaaa....aaaa","aaaa...aaab");
>
> As from your previous mail its their responsibility to add that case for
> benchmark. A 'solution' on v2 would be to switch into two-way algorithm
> when size is bigger than 2048. That isn't solution as you still have
> problem with size 2047. I seen somewhere comment that its security
> problem when you have website with search function on user comments and
> implementation internally uses strstr then attacker could just post
> comment with aaaaa...a and search few times for aa...ab.
>
2048 number is chosen by checking various length inputs and comparing 
the performance. So for 2047 case, there wont be worst performance.

> Roland as its really basic issue I think that its proposer
> responsibility to check. Also for strstr thread Carlos said about strstr
> quadratic case:
>
>> We should add gnulib's test to the microbenchmarks.
>
>
> Then as I commented that algorithm is flawed its from my experience,
> Raji's description was:
>
> For case 1, I always compare 8 characters at a time using cmpb instruction.
> instruction.
> -> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
> strchr result.
> -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
> needle and compare them.
> If they are same proceed to next 8 bytes of haystack and needle.
> Else load next doubleword from haystack.Form a doubleword with last
> seven characters from first load and first character from
> second load and compare again with needle. Proceed in a similar way.
> This gives good improvement.
>
Lets not mix the review comments of strstr and strcasestr.(as the 
algorithm is different)
> My comment was that it wastes time most of time as in common case these
> already differ in first byte so all these shifts were completely
> unnecessary, a simple optimization of naive algorithm using
>
> while(s=strchr(s,n[0]))
If I use while(s=strchr(s,n[0]) as mentioned in
https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
the quadratic testcase gives worst peformance. Attached the testcase.
>
> would be faster. Steven instead of doing his duty as powerpc maintainer
> and checking result of benchtest that I send instead tried to bullshit
> himself by telling that powerpc architecture is too special. Then he
> asked me do do benchmark instead as stalling tactics.
>
> If he did his duty of reading benchmark results there is already
> evidence. First to notice is there are regressions in original
> benchmark, like following:
>
> Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
> Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688
>
I agree there are few regression cases(79 out of 13442 cases) in 
'proposed_benchresults.txt' attached in
https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
This is minimal when compared to the overall performance gain.

> That he should comment, as strstr on average end on 10.6 byte after
> haystack start its quite relevant.
>
> But if he read proposed benchmark where needle and haystack are randomly
> generated from list of 128 characters then he would easily see it.
I have used your proposed benchtest, and it always gives improvement for 
strcasestr.Please check the file 'newresults' in 
https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html
>
> Here is trimmed version to see pattern. Note that first there are lot of
> times where performance is less than 13. Thats case where strchr didn't
> find first character and we just return(generic case now does
> unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)
>
> Its natural to ask what would happen if you called strchr again. In 120
> bytes it would probably didn't found second character and we would
> finish in 26 cycles at most not 111.5
>
> I expect that average loop performance would be 13.4375/135 = 0.1 cycles
> per byte not cycle per byte that this algorithm exhibits after he loses
> performance gains from first correct strchr call.
>
>                          simple_strstr   stupid_strstr   __strstr_power7
> __strstr_ppc
>
> Length    4/2, alignment  0/ 3, found:  9.35938 8.09375 7.90625 8.5
> Length   12/2, alignment  3/ 9, fail:   9.23438 16.5625 7.90625 9.20312
> Length   24/2, alignment  3/ 9, fail:   9.54688 26.2812 8.32812 10.1875
> Length   44/4, alignment  9/ 3, found:  13.4688 40.9219 9.59375 13.9688
> Length   44/4, alignment  9/ 3, fail:   13.4688 40.9531 50.6875 13.9844
> Length   52/4, alignment  3/ 9, found:  13.5312 47.2812 9.5625  14.7344
> Length   52/4, alignment  3/ 9, fail:   45.9688 53.625  35.9688 47.5938
> Length   75/5, alignment 15/ 9, fail:   73.75   67.9375 64.2188 75.9062
> Length   75/5, alignment 15/15, found:  15.2188 65.9062 10.7344 17.8438
> Length   75/5, alignment 15/15, fail:   61.5469 68.6562 50.375  63.0312
> Length  120/8, alignment 15/ 3, fail:   113.281 111.5   103.5   114.406
> Length  120/8, alignment 15/ 9, found:  21.25   99.1406 12.8438 28.1406
> Length  135/9, alignment 15/15, found:  23.125  110.734 13.4375 30.7969
> Length  135/9, alignment 15/15, fail:   147.484 113.781 110.516 165.984
> Length  168/12, alignment  3/15, found: 31.2344 134.172 13.6719 39.9375
> Length  168/12, alignment  3/15, fail:  170.766 137.609 131.859 179.859
> Length  168/12, alignment  9/ 0, found: 100.562 137.922 76.6562 112.578
> Length  168/12, alignment  9/ 0, fail:  168.703 144.078 144.828 181.203
> Length  168/12, alignment  9/ 3, found: 129.938 137.703 109.828 136.266
> Length  168/12, alignment  9/ 3, fail:  178.734 137.875 165.469 183.609
> Length  465/31, alignment 15/ 0, fail:  73.0469 363.375 32.2969 87.0625
> Length  465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
> Length  465/31, alignment 15/ 3, fail:  512.922 373.188 437.047 556.844
> Length  465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
> Length  465/31, alignment 15/ 9, fail:  484.312 376.406 423.703 513.672
> Length  465/31, alignment 15/15, found: 326.859 376     256.625 350.422
> Length  465/31, alignment 15/15, fail:  512.641 376.203 439.234 545.906
> Length 131071/16, alignment  0/ 0, found:       116583  108378  122125
> 127809
> Length 131071/16, alignment  0/ 0, fail:        118298  108639  122208
> 128700
>
>

Steven Munroe June 10, 2015, 7:06 p.m. UTC | #10

On Thu, 2015-06-11 at 00:04 +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> On 06/10/2015 04:18 PM, Ondřej Bílka wrote:
> >
> >   The onus for making all this happen is on the person who wants to change the code from its status quo.
> >
> > That would mean that its Rajalakshmi and responsibility to
> > write benchmark that shows its performance improvement. I stated my
> > objections explicitly described which inputs to use.
> >
> > Problem is that Steven failed to do his duty as a powerpc maintainer
> > which is to read and review change, check benchmark inputs and that
> > they are adequate. I could supply really bad implementation of strstr
> > which first checks special case that needle and haystack are periodic
> > abcdef sequences and returns result expected from benchmark.
> >
> > I did just simple request that every freshmen should know and Steven
> > should as powerpc maintainer check how its handled. Its that naive
> > algorithm and proposed algorithm are slow on inputs of form
> >
> > strstr("aaaaa....aaaa","aaaa...aaab");
> >
> > As from your previous mail its their responsibility to add that case for
> > benchmark. A 'solution' on v2 would be to switch into two-way algorithm
> > when size is bigger than 2048. That isn't solution as you still have
> > problem with size 2047. I seen somewhere comment that its security
> > problem when you have website with search function on user comments and
> > implementation internally uses strstr then attacker could just post
> > comment with aaaaa...a and search few times for aa...ab.
> >

Thanks Raja for correcting my misconception. So Raja did responded (in
https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html) with a new
benchmark that we believe should address Ondřej Bílka concerns.

I apologize if this error on my part cause any undo increases in blood
pressure or heart burn.

> 2048 number is chosen by checking various length inputs and comparing 
> the performance. So for 2047 case, there wont be worst performance.
> 
> > Roland as its really basic issue I think that its proposer
> > responsibility to check. Also for strstr thread Carlos said about strstr
> > quadratic case:
> >
> >> We should add gnulib's test to the microbenchmarks.
> >
> >
> > Then as I commented that algorithm is flawed its from my experience,
> > Raji's description was:
> >
> > For case 1, I always compare 8 characters at a time using cmpb instruction.
> > instruction.
> > -> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
> > strchr result.
> > -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
> > needle and compare them.
> > If they are same proceed to next 8 bytes of haystack and needle.
> > Else load next doubleword from haystack.Form a doubleword with last
> > seven characters from first load and first character from
> > second load and compare again with needle. Proceed in a similar way.
> > This gives good improvement.
> >
> Lets not mix the review comments of strstr and strcasestr.(as the 
> algorithm is different)
> > My comment was that it wastes time most of time as in common case these
> > already differ in first byte so all these shifts were completely
> > unnecessary, a simple optimization of naive algorithm using
> >
> > while(s=strchr(s,n[0]))
> If I use while(s=strchr(s,n[0]) as mentioned in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
> the quadratic testcase gives worst peformance. Attached the testcase.
> >
> > would be faster. Steven instead of doing his duty as powerpc maintainer
> > and checking result of benchtest that I send instead tried to bullshit
> > himself by telling that powerpc architecture is too special. Then he
> > asked me do do benchmark instead as stalling tactics.
> >
> > If he did his duty of reading benchmark results there is already
> > evidence. First to notice is there are regressions in original
> > benchmark, like following:
> >
> > Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
> > Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688
> >
> I agree there are few regression cases(79 out of 13442 cases) in 
> 'proposed_benchresults.txt' attached in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
> This is minimal when compared to the overall performance gain.
> 
> > That he should comment, as strstr on average end on 10.6 byte after
> > haystack start its quite relevant.
> >
> > But if he read proposed benchmark where needle and haystack are randomly
> > generated from list of 128 characters then he would easily see it.
> I have used your proposed benchtest, and it always gives improvement for 
> strcasestr.Please check the file 'newresults' in 
> https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html
> >
> > Here is trimmed version to see pattern. Note that first there are lot of
> > times where performance is less than 13. Thats case where strchr didn't
> > find first character and we just return(generic case now does
> > unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)
> >
> > Its natural to ask what would happen if you called strchr again. In 120
> > bytes it would probably didn't found second character and we would
> > finish in 26 cycles at most not 111.5
> >
> > I expect that average loop performance would be 13.4375/135 = 0.1 cycles
> > per byte not cycle per byte that this algorithm exhibits after he loses
> > performance gains from first correct strchr call.
> >
> >                          simple_strstr   stupid_strstr   __strstr_power7
> > __strstr_ppc
> >
> > Length    4/2, alignment  0/ 3, found:  9.35938 8.09375 7.90625 8.5
> > Length   12/2, alignment  3/ 9, fail:   9.23438 16.5625 7.90625 9.20312
> > Length   24/2, alignment  3/ 9, fail:   9.54688 26.2812 8.32812 10.1875
> > Length   44/4, alignment  9/ 3, found:  13.4688 40.9219 9.59375 13.9688
> > Length   44/4, alignment  9/ 3, fail:   13.4688 40.9531 50.6875 13.9844
> > Length   52/4, alignment  3/ 9, found:  13.5312 47.2812 9.5625  14.7344
> > Length   52/4, alignment  3/ 9, fail:   45.9688 53.625  35.9688 47.5938
> > Length   75/5, alignment 15/ 9, fail:   73.75   67.9375 64.2188 75.9062
> > Length   75/5, alignment 15/15, found:  15.2188 65.9062 10.7344 17.8438
> > Length   75/5, alignment 15/15, fail:   61.5469 68.6562 50.375  63.0312
> > Length  120/8, alignment 15/ 3, fail:   113.281 111.5   103.5   114.406
> > Length  120/8, alignment 15/ 9, found:  21.25   99.1406 12.8438 28.1406
> > Length  135/9, alignment 15/15, found:  23.125  110.734 13.4375 30.7969
> > Length  135/9, alignment 15/15, fail:   147.484 113.781 110.516 165.984
> > Length  168/12, alignment  3/15, found: 31.2344 134.172 13.6719 39.9375
> > Length  168/12, alignment  3/15, fail:  170.766 137.609 131.859 179.859
> > Length  168/12, alignment  9/ 0, found: 100.562 137.922 76.6562 112.578
> > Length  168/12, alignment  9/ 0, fail:  168.703 144.078 144.828 181.203
> > Length  168/12, alignment  9/ 3, found: 129.938 137.703 109.828 136.266
> > Length  168/12, alignment  9/ 3, fail:  178.734 137.875 165.469 183.609
> > Length  465/31, alignment 15/ 0, fail:  73.0469 363.375 32.2969 87.0625
> > Length  465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
> > Length  465/31, alignment 15/ 3, fail:  512.922 373.188 437.047 556.844
> > Length  465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
> > Length  465/31, alignment 15/ 9, fail:  484.312 376.406 423.703 513.672
> > Length  465/31, alignment 15/15, found: 326.859 376     256.625 350.422
> > Length  465/31, alignment 15/15, fail:  512.641 376.203 439.234 545.906
> > Length 131071/16, alignment  0/ 0, found:       116583  108378  122125
> > 127809
> > Length 131071/16, alignment  0/ 0, fail:        118298  108639  122208
> > 128700
> >
> >
>

Ondrej Bilka June 16, 2015, 7:21 a.m. UTC | #11

On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> 
> On 06/10/2015 04:18 PM, Ondřej Bílka wrote:
> >
> >  The onus for making all this happen is on the person who wants to change the code from its status quo.
> >
> >That would mean that its Rajalakshmi and responsibility to
> >write benchmark that shows its performance improvement. I stated my
> >objections explicitly described which inputs to use.
> >
> >Problem is that Steven failed to do his duty as a powerpc maintainer
> >which is to read and review change, check benchmark inputs and that
> >they are adequate. I could supply really bad implementation of strstr
> >which first checks special case that needle and haystack are periodic
> >abcdef sequences and returns result expected from benchmark.
> >
> >I did just simple request that every freshmen should know and Steven
> >should as powerpc maintainer check how its handled. Its that naive
> >algorithm and proposed algorithm are slow on inputs of form
> >
> >strstr("aaaaa....aaaa","aaaa...aaab");
> >
> >As from your previous mail its their responsibility to add that case for
> >benchmark. A 'solution' on v2 would be to switch into two-way algorithm
> >when size is bigger than 2048. That isn't solution as you still have
> >problem with size 2047. I seen somewhere comment that its security
> >problem when you have website with search function on user comments and
> >implementation internally uses strstr then attacker could just post
> >comment with aaaaa...a and search few times for aa...ab.
> >
> 2048 number is chosen by checking various length inputs and
> comparing the performance. So for 2047 case, there wont be worst
> performance.
>
What are you talking about? I am saying that aaaa haystack with aaa...ab
 needle of size 2047 is worst case for your algorithm performance. 

 
> >Roland as its really basic issue I think that its proposer
> >responsibility to check. Also for strstr thread Carlos said about strstr
> >quadratic case:
> >
> >>We should add gnulib's test to the microbenchmarks.
> >
> >
> >Then as I commented that algorithm is flawed its from my experience,
> >Raji's description was:
> >
> >For case 1, I always compare 8 characters at a time using cmpb instruction.
> >instruction.
> >-> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
> >strchr result.
> >-> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
> >needle and compare them.
> >If they are same proceed to next 8 bytes of haystack and needle.
> >Else load next doubleword from haystack.Form a doubleword with last
> >seven characters from first load and first character from
> >second load and compare again with needle. Proceed in a similar way.
> >This gives good improvement.
> >
> Lets not mix the review comments of strstr and strcasestr.(as the
> algorithm is different)

From webster dictionary

"Full Definition of ALGORITHM :  
a procedure for solving a mathematical problem (as of finding the
greatest common divisor) in a finite number of steps that frequently
involves repetition of an operation; broadly :  a step-by-step procedure
for solving a problem or accomplishing some end especially by a computer"

Changing techical few details like applying parallel tolower does not
make new algorithm.

Your strstr and strcasestr are really same algorithm. 

And my point still applies, as you could change generic strcasestr from

while(s=strchr(s,n[0])) 

to

if (bijective_locale)
  {
    char fn[3];
    fn[0]=tolower(n[0]);
    fn[1]=toupper(n[0]);
    fn[2]=0;
    while(s=strpbrk(s,fn)) 

and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most
twice slower than strchr and match is twice likely would be faster than
what you gain and easier to implement.

> >My comment was that it wastes time most of time as in common case these
> >already differ in first byte so all these shifts were completely
> >unnecessary, a simple optimization of naive algorithm using
> >
> >while(s=strchr(s,n[0]))
> If I use while(s=strchr(s,n[0]) as mentioned in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
> the quadratic testcase gives worst peformance. Attached the testcase.

Thats irrelevant testcase. I asked two things. First is show how big
regression you get on pathological inputs. So submit data when you use
your testcase with m=2047 and strstr executed 10000 times in loop with
current algorithm and with your proposed one. That is what I ask.

Second is about average case perforance. There worst case is irrelevant
as you use buy-or-rent techniques to handle it. I sumbitted following
patch with easy way to handle that.

[PATCH] Improve generic strstr performance.

its currently superseeded by generic vector bigraph search but it should
beat your strstr.

> >
> >would be faster. Steven instead of doing his duty as powerpc maintainer
> >and checking result of benchtest that I send instead tried to bullshit
> >himself by telling that powerpc architecture is too special. Then he
> >asked me do do benchmark instead as stalling tactics.
> >
> >If he did his duty of reading benchmark results there is already
> >evidence. First to notice is there are regressions in original
> >benchmark, like following:
> >
> >Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
> >Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688
> >
> I agree there are few regression cases(79 out of 13442 cases) in
> 'proposed_benchresults.txt' attached in
> https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
> This is minimal when compared to the overall performance gain.
> 
What overall performance gain? Did you try to find application and
measure performance difference before and after? Big sizes are not very
relevant in practice. You look behind 32'th byte of haystack only in
1.5% of calls so 10% regression at these sizes is much worse that 30%
gain on 256 byte range. In my profiler a simple collection shows
following


calls 13561
success probability   3.0%
average needle size:    8.9    ns <= 0:   0.4% ns <= 4:   0.8% ns <= 8: 5.9% ns <= 16:  99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0%
ns <= 64: 100.0% 
average n:   10.6    n <= 0:   0.4% n <= 4:   1.3% n <= 8:   1.3% n <= 16:  97.0% n <= 24:  97.8% n <= 32:  98.5% n <= 48: 99.9% n <= 64:  99.9% 
s aligned to 4 bytes:  98.6%  8 bytes:  98.6% 16 bytes:  96.0% 


> >That he should comment, as strstr on average end on 10.6 byte after
> >haystack start its quite relevant.
> >
> >But if he read proposed benchmark where needle and haystack are randomly
> >generated from list of 128 characters then he would easily see it.
> I have used your proposed benchtest, and it always gives improvement
> for strcasestr.Please check the file 'newresults' in
> https://sourceware.org/ml/libc-alpha/2015-06/msg00032.html
> >
> >Here is trimmed version to see pattern. Note that first there are lot of
> >times where performance is less than 13. Thats case where strchr didn't
> >find first character and we just return(generic case now does
> >unnecessary work leading to worse numbers but thats addressed on patch that I send with strchr loop.)
> >
> >Its natural to ask what would happen if you called strchr again. In 120
> >bytes it would probably didn't found second character and we would
> >finish in 26 cycles at most not 111.5
> >
> >I expect that average loop performance would be 13.4375/135 = 0.1 cycles
> >per byte not cycle per byte that this algorithm exhibits after he loses
> >performance gains from first correct strchr call.
> >
> >                         simple_strstr   stupid_strstr   __strstr_power7
> >__strstr_ppc
> >
> >Length    4/2, alignment  0/ 3, found:  9.35938 8.09375 7.90625 8.5
> >Length   12/2, alignment  3/ 9, fail:   9.23438 16.5625 7.90625 9.20312
> >Length   24/2, alignment  3/ 9, fail:   9.54688 26.2812 8.32812 10.1875
> >Length   44/4, alignment  9/ 3, found:  13.4688 40.9219 9.59375 13.9688
> >Length   44/4, alignment  9/ 3, fail:   13.4688 40.9531 50.6875 13.9844
> >Length   52/4, alignment  3/ 9, found:  13.5312 47.2812 9.5625  14.7344
> >Length   52/4, alignment  3/ 9, fail:   45.9688 53.625  35.9688 47.5938
> >Length   75/5, alignment 15/ 9, fail:   73.75   67.9375 64.2188 75.9062
> >Length   75/5, alignment 15/15, found:  15.2188 65.9062 10.7344 17.8438
> >Length   75/5, alignment 15/15, fail:   61.5469 68.6562 50.375  63.0312
> >Length  120/8, alignment 15/ 3, fail:   113.281 111.5   103.5   114.406
> >Length  120/8, alignment 15/ 9, found:  21.25   99.1406 12.8438 28.1406
> >Length  135/9, alignment 15/15, found:  23.125  110.734 13.4375 30.7969
> >Length  135/9, alignment 15/15, fail:   147.484 113.781 110.516 165.984
> >Length  168/12, alignment  3/15, found: 31.2344 134.172 13.6719 39.9375
> >Length  168/12, alignment  3/15, fail:  170.766 137.609 131.859 179.859
> >Length  168/12, alignment  9/ 0, found: 100.562 137.922 76.6562 112.578
> >Length  168/12, alignment  9/ 0, fail:  168.703 144.078 144.828 181.203
> >Length  168/12, alignment  9/ 3, found: 129.938 137.703 109.828 136.266
> >Length  168/12, alignment  9/ 3, fail:  178.734 137.875 165.469 183.609
> >Length  465/31, alignment 15/ 0, fail:  73.0469 363.375 32.2969 87.0625
> >Length  465/31, alignment 15/ 3, found: 507.656 373.703 429.797 543.625
> >Length  465/31, alignment 15/ 3, fail:  512.922 373.188 437.047 556.844
> >Length  465/31, alignment 15/ 9, found: 443.969 379.359 392.828 464.75
> >Length  465/31, alignment 15/ 9, fail:  484.312 376.406 423.703 513.672
> >Length  465/31, alignment 15/15, found: 326.859 376     256.625 350.422
> >Length  465/31, alignment 15/15, fail:  512.641 376.203 439.234 545.906
> >Length 131071/16, alignment  0/ 0, found:       116583  108378  122125
> >127809
> >Length 131071/16, alignment  0/ 0, fail:        118298  108639  122208
> >128700
> >
> >

Rajalakshmi S June 18, 2015, 10:29 a.m. UTC | #12

On 06/16/2015 12:51 PM, Ondřej Bílka wrote:
> On Thu, Jun 11, 2015 at 12:04:08AM +0530, Rajalakshmi Srinivasaraghavan wrote:
>>
>>
>> On 06/10/2015 04:18 PM, Ondřej Bílka wrote:
>>>
>>>   The onus for making all this happen is on the person who wants to change the code from its status quo.
>>>
>>> That would mean that its Rajalakshmi and responsibility to
>>> write benchmark that shows its performance improvement. I stated my
>>> objections explicitly described which inputs to use.
>>>
>>> Problem is that Steven failed to do his duty as a powerpc maintainer
>>> which is to read and review change, check benchmark inputs and that
>>> they are adequate. I could supply really bad implementation of strstr
>>> which first checks special case that needle and haystack are periodic
>>> abcdef sequences and returns result expected from benchmark.
>>>
>>> I did just simple request that every freshmen should know and Steven
>>> should as powerpc maintainer check how its handled. Its that naive
>>> algorithm and proposed algorithm are slow on inputs of form
>>>
>>> strstr("aaaaa....aaaa","aaaa...aaab");
>>>
>>> As from your previous mail its their responsibility to add that case for
>>> benchmark. A 'solution' on v2 would be to switch into two-way algorithm
>>> when size is bigger than 2048. That isn't solution as you still have
>>> problem with size 2047. I seen somewhere comment that its security
>>> problem when you have website with search function on user comments and
>>> implementation internally uses strstr then attacker could just post
>>> comment with aaaaa...a and search few times for aa...ab.
>>>
>> 2048 number is chosen by checking various length inputs and
>> comparing the performance. So for 2047 case, there wont be worst
>> performance.
>>
> What are you talking about? I am saying that aaaa haystack with aaa...ab
>   needle of size 2047 is worst case for your algorithm performance.
>
I ran the attached test.c in a loop 10000 times.
proposed patch took 19s and existing code took 20s on power machine.

>
>>> Roland as its really basic issue I think that its proposer
>>> responsibility to check. Also for strstr thread Carlos said about strstr
>>> quadratic case:
>>>
>>>> We should add gnulib's test to the microbenchmarks.
>>>
>>>
>>> Then as I commented that algorithm is flawed its from my experience,
>>> Raji's description was:
>>>
>>> For case 1, I always compare 8 characters at a time using cmpb instruction.
>>> instruction.
>>> -> perform basic strlen, strnlen, checks. Move haystack based on strchr result.
>>> strchr result.
>>> -> Load first 8 bytes(taking care of aligned load)from haystack and needle and compare them.
>>> needle and compare them.
>>> If they are same proceed to next 8 bytes of haystack and needle.
>>> Else load next doubleword from haystack.Form a doubleword with last
>>> seven characters from first load and first character from
>>> second load and compare again with needle. Proceed in a similar way.
>>> This gives good improvement.
>>>
>> Lets not mix the review comments of strstr and strcasestr.(as the
>> algorithm is different)
>
>>From webster dictionary
>
> "Full Definition of ALGORITHM :
> a procedure for solving a mathematical problem (as of finding the
> greatest common divisor) in a finite number of steps that frequently
> involves repetition of an operation; broadly :  a step-by-step procedure
> for solving a problem or accomplishing some end especially by a computer"
>
> Changing techical few details like applying parallel tolower does not
> make new algorithm.
>
> Your strstr and strcasestr are really same algorithm.
>
The reason I mentioned them as different is, in strstr() patch the
input is loaded as double word ie. 8 bytes at a time by byte and in 
strcasestr its byte by byte load.The input is handled in different way.

> And my point still applies, as you could change generic strcasestr from
>
> while(s=strchr(s,n[0]))
>
> to
>
> if (bijective_locale)
>    {
>      char fn[3];
>      fn[0]=tolower(n[0]);
>      fn[1]=toupper(n[0]);
>      fn[2]=0;
>      while(s=strpbrk(s,fn))
>
> and add optimized strpbrk(x,"xy") path. That as strpbrk would be at most
> twice slower than strchr and match is twice likely would be faster than
> what you gain and easier to implement.
>
>>> My comment was that it wastes time most of time as in common case these
>>> already differ in first byte so all these shifts were completely
>>> unnecessary, a simple optimization of naive algorithm using
>>>
>>> while(s=strchr(s,n[0]))
>> If I use while(s=strchr(s,n[0]) as mentioned in
>> https://sourceware.org/ml/libc-alpha/2015-05/msg00060.html
>> the quadratic testcase gives worst peformance. Attached the testcase.
>
> Thats irrelevant testcase. I asked two things. First is show how big
> regression you get on pathological inputs. So submit data when you use
> your testcase with m=2047 and strstr executed 10000 times in loop with
> current algorithm and with your proposed one. That is what I ask.
>
> Second is about average case perforance. There worst case is irrelevant
> as you use buy-or-rent techniques to handle it. I sumbitted following
> patch with easy way to handle that.
>
> [PATCH] Improve generic strstr performance.
>
I applied your proposed patches and the benchtests and I could see 
proposed patch(__strstr_power7) is better than existing code (__strstr_ppc).
13363 out of 13442 combinations shows improvement.
Attached benchtest results.

git status shows:

On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   benchtests/bench-strcasestr.c
	modified:   benchtests/bench-strstr.c
	modified:   locale/C-ctype.c
	modified:   locale/categories.def
	modified:   locale/langinfo.h
	modified:   locale/programs/ld-ctype.c
	modified:   string/memmem.c
	modified:   string/strcasestr.c
	modified:   string/strstr.c
	modified:   string/test-strcasestr.c
	modified:   sysdeps/powerpc/powerpc64/multiarch/Makefile
	modified:   sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c

Untracked files:
   (use "git add <file>..." to include in what will be committed)

	sysdeps/generic/string_vector.h
	sysdeps/generic/string_vector_search.h
	sysdeps/generic/string_vector_skeleton.h
	sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
	sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
	sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
	sysdeps/powerpc/powerpc64/multiarch/strstr-power7.S
	sysdeps/powerpc/powerpc64/multiarch/strstr-ppc64.c
	sysdeps/powerpc/powerpc64/multiarch/strstr.c
	sysdeps/powerpc/powerpc64/power7/strcasestr.S
	sysdeps/powerpc/powerpc64/power7/strstr.S

> its currently superseeded by generic vector bigraph search but it should
> beat your strstr.
>
>>>
>>> would be faster. Steven instead of doing his duty as powerpc maintainer
>>> and checking result of benchtest that I send instead tried to bullshit
>>> himself by telling that powerpc architecture is too special. Then he
>>> asked me do do benchmark instead as stalling tactics.
>>>
>>> If he did his duty of reading benchmark results there is already
>>> evidence. First to notice is there are regressions in original
>>> benchmark, like following:
>>>
>>> Length   30/2, alignment  0/ 9, found:  20.7812 32.2656 13.125  19.7188
>>> Length   30/2, alignment  0/ 9, fail:   32.4688 33.9219 36.75   31.4688
>>>
>> I agree there are few regression cases(79 out of 13442 cases) in
>> 'proposed_benchresults.txt' attached in
>> https://sourceware.org/ml/libc-alpha/2015-05/msg00276.html.
>> This is minimal when compared to the overall performance gain.
>>
> What overall performance gain? Did you try to find application and
> measure performance difference before and after? Big sizes are not very
> relevant in practice. You look behind 32'th byte of haystack only in
> 1.5% of calls so 10% regression at these sizes is much worse that 30%
> gain on 256 byte range. In my profiler a simple collection shows
> following
No I don't check on any specific application and I use the glibc 
benchtests for performance measurement which is considered to be the 
standard way of performance measurement before proposing a patch.
>
>
> calls 13561
> success probability   3.0%
> average needle size:    8.9    ns <= 0:   0.4% ns <= 4:   0.8% ns <= 8: 5.9% ns <= 16:  99.9% ns <= 24: 100.0% ns <= 32: 100.0% ns <= 48: 100.0%
> ns <= 64: 100.0%
> average n:   10.6    n <= 0:   0.4% n <= 4:   1.3% n <= 8:   1.3% n <= 16:  97.0% n <= 24:  97.8% n <= 32:  98.5% n <= 48: 99.9% n <= 64:  99.9%
> s aligned to 4 bytes:  98.6%  8 bytes:  98.6% 16 bytes:  96.0%
>
>

Ondrej Bilka June 18, 2015, 12:58 p.m. UTC | #13

On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> >>2048 number is chosen by checking various length inputs and
> >>comparing the performance. So for 2047 case, there wont be worst
> >>performance.
> >>
> >What are you talking about? I am saying that aaaa haystack with aaa...ab
> >  needle of size 2047 is worst case for your algorithm performance.
> >
> I ran the attached test.c in a loop 10000 times.
> proposed patch took 19s and existing code took 20s on power machine.
>
How could you ran it? You have two problems with it
1. It doesn't compile, its missing final } to end main
2. It doesn't measure strstr at all. Its pure function so gcc optimized
that away.

As when I fixed these two issues just 100 iterations take 1.002s with my
x64 vector implementation and 44.538s on generic implemenation your test
couldn't take just 20s for existing code.


> The reason I mentioned them as different is, in strstr() patch the
> input is loaded as double word ie. 8 bytes at a time by byte and in
> strcasestr its byte by byte load.The input is handled in different
> way.
>

I see now that you wrote basically

long i=0;
n0 = tolower(n[0])
while (s[i])
  {
    int j=1;
    if (tolower(s[i])==n0)
      while (n[j] && tolower(n[j]) == tolower(s[i+j]))
        j++;
    if (n[j]=='\0')
      return s+i;
    i++;
  }
return NULL;
 
For some reason I skimmed that patch and though that you did obvious
optimization once you have vector strstr. You use vectorized tolower to
convert 8 bytes at once. Example of conversion is in
 sysdeps/x86_64/multiarch/strcmp-sse42.S

> >
> >Second is about average case perforance. There worst case is irrelevant
> >as you use buy-or-rent techniques to handle it. I sumbitted following
> >patch with easy way to handle that.
> >
> >[PATCH] Improve generic strstr performance.
> >
> I applied your proposed patches and the benchtests and I could see
> proposed patch(__strstr_power7) is better than existing code
> (__strstr_ppc).
> 13363 out of 13442 combinations shows improvement.
> Attached benchtest results.
>
I said use that patch. Next line of my previous mail explains that.

> >its currently superseeded by generic vector bigraph search but it should
> >beat your strstr.

So you should do also comparison with that.


Your benchmark is invalid for simple reason, I guard vector patch with:

if (__BYTE_ORDER == __LITTLE_ENDIAN)
  {
 
So as powerpc is big endian my patch does nothing at all and you are
just measuring generic strstr again. If you want test bigraph search to
that on power64le.

> >What overall performance gain? Did you try to find application and
> >measure performance difference before and after? Big sizes are not very
> >relevant in practice. You look behind 32'th byte of haystack only in
> >1.5% of calls so 10% regression at these sizes is much worse that 30%
> >gain on 256 byte range. In my profiler a simple collection shows
> >following
> No I don't check on any specific application and I use the glibc
> benchtests for performance measurement which is considered to be the
> standard way of performance measurement before proposing a patch.

Then you have problem that you don't know which data are important and
which are not. 

Proof by numbers is quite suspect. Input ranges are completely
arbitrary. So I could decide that I want to check for sizes 1..64 all
possible 64byte needle and haystack alignments. That generates 262144
test cases which should be enough.

Rajalakshmi S June 24, 2015, 7:17 a.m. UTC | #14

On 06/18/2015 06:28 PM, Ondřej Bílka wrote:
> On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote:
>>>> 2048 number is chosen by checking various length inputs and
>>>> comparing the performance. So for 2047 case, there wont be worst
>>>> performance.
>>>>
>>> What are you talking about? I am saying that aaaa haystack with aaa...ab
>>>   needle of size 2047 is worst case for your algorithm performance.
>>>
>> I ran the attached test.c in a loop 10000 times.
>> proposed patch took 19s and existing code took 20s on power machine.
>>
> How could you ran it? You have two problems with it
> 1. It doesn't compile, its missing final } to end main
> 2. It doesn't measure strstr at all. Its pure function so gcc optimized
> that away.
>
> As when I fixed these two issues just 100 iterations take 1.002s with my
> x64 vector implementation and 44.538s on generic implemenation your test
> couldn't take just 20s for existing code.
>
>
>> The reason I mentioned them as different is, in strstr() patch the
>> input is loaded as double word ie. 8 bytes at a time by byte and in
>> strcasestr its byte by byte load.The input is handled in different
>> way.
>>
>
> I see now that you wrote basically
>
> long i=0;
> n0 = tolower(n[0])
> while (s[i])
>    {
>      int j=1;
>      if (tolower(s[i])==n0)
>        while (n[j] && tolower(n[j]) == tolower(s[i+j]))
>          j++;
>      if (n[j]=='\0')
>        return s+i;
>      i++;
>    }
> return NULL;
>
> For some reason I skimmed that patch and though that you did obvious
> optimization once you have vector strstr. You use vectorized tolower to
> convert 8 bytes at once. Example of conversion is in
>   sysdeps/x86_64/multiarch/strcmp-sse42.S
>
For strcasestr() optimization I am trying to use tolower to convert 8 
bytes and will send the modified patch.
For strstr() I have sent the new version
https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html
>>>
>>> Second is about average case perforance. There worst case is irrelevant
>>> as you use buy-or-rent techniques to handle it. I sumbitted following
>>> patch with easy way to handle that.
>>>
>>> [PATCH] Improve generic strstr performance.
>>>
>> I applied your proposed patches and the benchtests and I could see
>> proposed patch(__strstr_power7) is better than existing code
>> (__strstr_ppc).
>> 13363 out of 13442 combinations shows improvement.
>> Attached benchtest results.
>>
> I said use that patch. Next line of my previous mail explains that.
>
>>> its currently superseeded by generic vector bigraph search but it should
>>> beat your strstr.
>
> So you should do also comparison with that.
>
>
> Your benchmark is invalid for simple reason, I guard vector patch with:
>
> if (__BYTE_ORDER == __LITTLE_ENDIAN)
>    {
>
Modified new version works better for both LE and BE. Results attached 
in https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html

> So as powerpc is big endian my patch does nothing at all and you are
> just measuring generic strstr again. If you want test bigraph search to
> that on power64le.
>
Used your proposed code on powerpc64le and observed that quadratic worst 
case still takes more time with your proposed code.
It took more than 2 minutes when the needle size is 100000 and haystack 
is of size 1000000.

>>> What overall performance gain? Did you try to find application and
>>> measure performance difference before and after? Big sizes are not very
>>> relevant in practice. You look behind 32'th byte of haystack only in
>>> 1.5% of calls so 10% regression at these sizes is much worse that 30%
>>> gain on 256 byte range. In my profiler a simple collection shows
>>> following
>> No I don't check on any specific application and I use the glibc
>> benchtests for performance measurement which is considered to be the
>> standard way of performance measurement before proposing a patch.
>
> Then you have problem that you don't know which data are important and
> which are not.
>
> Proof by numbers is quite suspect. Input ranges are completely
> arbitrary. So I could decide that I want to check for sizes 1..64 all
> possible 64byte needle and haystack alignments. That generates 262144
> test cases which should be enough.
>
>

Ondrej Bilka June 24, 2015, 9:48 a.m. UTC | #15

On Wed, Jun 24, 2015 at 12:47:23PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> 
> 
> On 06/18/2015 06:28 PM, Ondřej Bílka wrote:
> >On Thu, Jun 18, 2015 at 03:59:55PM +0530, Rajalakshmi Srinivasaraghavan wrote:
> >>>>2048 number is chosen by checking various length inputs and
> >>>>comparing the performance. So for 2047 case, there wont be worst
> >>>>performance.
> >>>>
> >>>What are you talking about? I am saying that aaaa haystack with aaa...ab
> >>>  needle of size 2047 is worst case for your algorithm performance.
> >>>
> >>I ran the attached test.c in a loop 10000 times.
> >>proposed patch took 19s and existing code took 20s on power machine.
> >>
> >How could you ran it? You have two problems with it
> >1. It doesn't compile, its missing final } to end main
> >2. It doesn't measure strstr at all. Its pure function so gcc optimized
> >that away.
> >
> >As when I fixed these two issues just 100 iterations take 1.002s with my
> >x64 vector implementation and 44.538s on generic implemenation your test
> >couldn't take just 20s for existing code.
> >
> >
> >>The reason I mentioned them as different is, in strstr() patch the
> >>input is loaded as double word ie. 8 bytes at a time by byte and in
> >>strcasestr its byte by byte load.The input is handled in different
> >>way.
> >>
> >
> >I see now that you wrote basically
> >
> >long i=0;
> >n0 = tolower(n[0])
> >while (s[i])
> >   {
> >     int j=1;
> >     if (tolower(s[i])==n0)
> >       while (n[j] && tolower(n[j]) == tolower(s[i+j]))
> >         j++;
> >     if (n[j]=='\0')
> >       return s+i;
> >     i++;
> >   }
> >return NULL;
> >
> >For some reason I skimmed that patch and though that you did obvious
> >optimization once you have vector strstr. You use vectorized tolower to
> >convert 8 bytes at once. Example of conversion is in
> >  sysdeps/x86_64/multiarch/strcmp-sse42.S
> >
> For strcasestr() optimization I am trying to use tolower to convert
> 8 bytes and will send the modified patch.
> For strstr() I have sent the new version
> https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html

Doing tolower is still bit slow. You should improve strspn. Then you only need to convert first byte
of needle and call strspn.

> >>>
> >>>Second is about average case perforance. There worst case is irrelevant
> >>>as you use buy-or-rent techniques to handle it. I sumbitted following
> >>>patch with easy way to handle that.
> >>>
> >>>[PATCH] Improve generic strstr performance.
> >>>
> >>I applied your proposed patches and the benchtests and I could see
> >>proposed patch(__strstr_power7) is better than existing code
> >>(__strstr_ppc).
> >>13363 out of 13442 combinations shows improvement.
> >>Attached benchtest results.
> >>
> >I said use that patch. Next line of my previous mail explains that.
> >
> >>>its currently superseeded by generic vector bigraph search but it should
> >>>beat your strstr.
> >
> >So you should do also comparison with that.
> >
> >
> >Your benchmark is invalid for simple reason, I guard vector patch with:
> >
> >if (__BYTE_ORDER == __LITTLE_ENDIAN)
> >   {
> >
> Modified new version works better for both LE and BE. Results
> attached in
> https://sourceware.org/ml/libc-alpha/2015-06/msg00788.html
> 
Yes I was saying all time that you should use that.

> >So as powerpc is big endian my patch does nothing at all and you are
> >just measuring generic strstr again. If you want test bigraph search to
> >that on power64le.
> >
> Used your proposed code on powerpc64le and observed that quadratic
> worst case still takes more time with your proposed code.
> It took more than 2 minutes when the needle size is 100000 and
> haystack is of size 1000000.
>
You still didn't answer how big degradation has your implementation. For
mine its a typo disabled quadratic detection at all. All that needs to
change would be
rent += i + 10;
into
*rent += i + 10;
 
> >>>What overall performance gain? Did you try to find application and
> >>>measure performance difference before and after? Big sizes are not very
> >>>relevant in practice. You look behind 32'th byte of haystack only in
> >>>1.5% of calls so 10% regression at these sizes is much worse that 30%
> >>>gain on 256 byte range. In my profiler a simple collection shows
> >>>following
> >>No I don't check on any specific application and I use the glibc
> >>benchtests for performance measurement which is considered to be the
> >>standard way of performance measurement before proposing a patch.
> >
> >Then you have problem that you don't know which data are important and
> >which are not.
> >
> >Proof by numbers is quite suspect. Input ranges are completely
> >arbitrary. So I could decide that I want to check for sizes 1..64 all
> >possible 64byte needle and haystack alignments. That generates 262144
> >test cases which should be enough.
> >
> >
> 
> -- 
> Thanks
> Rajalakshmi S

diff mbox

Patch

diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile
b/sysdeps/powerpc/powerpc64/multiarch/Makefile
index 17265bd..06b2c67 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
+++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
@@ -19,7 +19,7 @@  sysdep_routines += memcpy-power7 memcpy-a2
memcpy-power6 memcpy-cell \
   		   strcmp-power8 strcmp-power7 strcmp-ppc64 \
   		   strcat-power8 strcat-power7 strcat-ppc64 \
   		   memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
-		   strncpy-power8
+		   strncpy-power8 strcasestr-power7 strcasestr-ppc64 \

   CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
   CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
index f5fdea5..0fd2bd2 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
@@ -322,5 +322,12 @@  __libc_ifunc_impl_list (const char *name, struct
libc_ifunc_impl *array,
   	      IFUNC_IMPL_ADD (array, i, strcat, 1,
   			     __strcat_ppc))

+  /* Support sysdeps/powerpc/powerpc64/multiarch/strcasestr.c.  */
+  IFUNC_IMPL (i, name, strcasestr,
+	      IFUNC_IMPL_ADD (array, i, strcasestr,
+			      hwcap & PPC_FEATURE_HAS_VSX,
+			      __strcasestr_power7)
+	      IFUNC_IMPL_ADD (array, i, strcasestr, 1,
+			      __strcasestr_ppc))
     return i;
   }
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
new file mode 100644
index 0000000..e13f575
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-power7.S
@@ -0,0 +1,43 @@ 
+/* Optimized strcasestr implementation for POWER7.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#undef EALIGN
+#define EALIGN(name, alignt, words)				\
+  .section ".text";						\
+  ENTRY_2(__strcasestr_power7)					\
+  .align ALIGNARG(alignt);					\
+  EALIGN_W_##words;						\
+  BODY_LABEL(__strcasestr_power7):				\
+  cfi_startproc;						\
+  LOCALENTRY(__strcasestr_power7)
+
+#undef END
+#define END(name)						\
+  cfi_endproc;							\
+  TRACEBACK(__strcasestr_power7)				\
+  END_2(__strcasestr_power7)
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#define STRLEN __strlen_power7
+#define STRNLEN __strnlen_power7
+
+#include <sysdeps/powerpc/powerpc64/power7/strcasestr.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
new file mode 100644
index 0000000..614c7bf
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr-ppc64.c
@@ -0,0 +1,34 @@ 
+/* PowerPC64 default implementation of strcasestr.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define STRCASESTR  __strcasestr_ppc
+#if IS_IN (libc) && defined(SHARED)
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__strstr_ppc, __GI_strstr, __strstr_ppc);
+#endif
+
+
+#undef weak_alias
+#define weak_alias(a,b )
+
+extern __typeof (strcasestr) __strcasestr_ppc attribute_hidden;
+
+#include <string/strcasestr.c>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
b/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
new file mode 100644
index 0000000..6564314
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strcasestr.c
@@ -0,0 +1,37 @@ 
+/* Multiple versions of strcasestr.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if IS_IN (libc)
+# include <string.h>
+# include <shlib-compat.h>
+# include "init-arch.h"
+
+extern __typeof (__strcasestr) __strcasestr_ppc attribute_hidden;
+extern __typeof (__strcasestr) __strcasestr_power7 attribute_hidden;
+
+/* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle
+   ifunc symbol properly.  */
+libc_ifunc (__strcasestr,
+	    (hwcap & PPC_FEATURE_HAS_VSX)
+            ? __strcasestr_power7
+            : __strcasestr_ppc);
+
+weak_alias (__strcasestr, strcasestr)
+#else
+#include <string/strcasestr.c>
+#endif
diff --git a/sysdeps/powerpc/powerpc64/power7/strcasestr.S
b/sysdeps/powerpc/powerpc64/power7/strcasestr.S
new file mode 100644
index 0000000..521eadb
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power7/strcasestr.S
@@ -0,0 +1,177 @@ 
+/* Optimized strcasestr implementation for PowerPC64.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <locale-defines.h>
+
+#ifndef STRLEN
+/* For builds with no IFUNC support, local calls should be made to internal
+   GLIBC symbol (created by libc_hidden_builtin_def).  */
+# ifdef SHARED
+#  define STRLEN   __GI_strlen
+# else
+#  define STRLEN   strlen
+# endif
+#endif
+
+#ifndef STRNLEN
+/* For builds with no IFUNC support, local calls should be made to internal
+   GLIBC symbol (created by libc_hidden_builtin_def).  */
+# ifdef SHARED
+#  define STRNLEN   __GI_strnlen
+# else
+#  define STRNLEN   strnlen
+# endif
+#endif
+
+#undef strcasestr
+#undef __strcasestr
+
+#ifndef STRCASESTR
+#define STRCASESTR __strcasestr
+#endif
+
+/* char * [r3] strcasestr (char *s [r3], char * pat[r4])  */
+
+/*
+* Load byte from input string and search substring and convert
+* each character to lower case character and then compare both.
+* If they are same, load byte from both r3 and r4 and proceed,
+* Else, load next byte from r3 and compare with current r4.
+*/
+
+#define	FRAMESIZE	(FRAME_MIN_SIZE+32)
+	.machine	power7
+EALIGN (STRCASESTR, 4, 0)
+	CALL_MCOUNT 2
+	mflr	r0			/* Load link register LR to r0.  */
+	std	r31, -8(r1)		/* Save callers register r31.  */
+	cfi_offset(r31, -8)
+	std	r30, -16(r1)		/* Save callers register r30.  */
+	cfi_offset(r30, -16)
+	std	r29, -24(r1)		/* Save callers register r29.  */
+	cfi_offset(r29, -24)
+	std	r0, 16(r1)		/* Store the link register.  */
+	cfi_offset(lr, 16)
+	stdu	r1, -FRAMESIZE(r1)	/* Create the stack frame.  */
+	cfi_adjust_cfa_offset(FRAMESIZE)
+
+	dcbt	0, r3
+	dcbt	0, r4
+
+	and	r0, r3, r4
+	cmpdi	cr7, r0, 0
+	beq	cr7, L(retnull)
+
+	mr	r29, r3
+	mr	r30, r4
+	mr	r3, r4
+	bl	STRLEN
+	nop
+
+	/* Call __strcasestr_ppc if needle len > 2048.  */
+	cmpdi	cr7, r3, 2048
+	bgt	cr7, L(default)
+
+	cmpdi	cr7, r3, 0	/* If search str is null.  */
+	beq	cr7, L(ret_r3)
+	mr	r31, r3
+	mr	r4, r3
+	mr	r3, r29
+	bl	STRNLEN
+	nop
+
+	cmpd	cr7, r3, r31 	/* If len(r3) < len(r4).  */
+	blt	cr7, L(retnull)
+
+	mr	r3, r29
+	mr	r8, r3			/* Save  r3. */
+	addi	r8, r8, -1
+	ld	r10, __libc_tsd_LOCALE@got@tprel(r2)
+	add	r11, r10, __libc_tsd_LOCALE@tls
+	ld	r11, 0(r11)
+	ld	r11, LOCALE_CTYPE_TOLOWER(r11)
+
+	mr	r4, r30
+	lbz	r6, 0(r4)		/* Load next byte from r4.  */
+	cmpdi	cr7, r6, 0		/* Is it null?  */
+	beq	cr7, L(updater3)
+	sldi	r7, r6, 2		/* Convert to lower case.  */
+	lwzx	r7, r11, r7
+	mr	r12, r7			/* Save it for next loop.  */
+L(loop1):
+	addi	r8, r8, 1
+	mr	r3, r8			/* Restore r3.  */
+	mr	r4, r30			/* Restore r4.  */
+	mr	r7, r12
+L(loop):
+	lbz	r5, 0(r3)		/* Load byte from r3.  */
+	cmpdi	cr7, r5, 0		/* Is it null?  */
+	beq	cr7, L(retnull)		/* If yes, return.  */
+	sldi	r10, r5, 2		/* Convert to lower case.  */
+	lwzx	r10, r11, r10
+	cmpw	cr7, r7, r10		/* Compare with byte from r4.  */
+	bne	cr7, L(loop1)
+	addi	r3, r3, 1		/* Increment r3.  */
+	addi	r4, r4, 1		/* Increment r4.  */
+	lbz	r6, 0(r4)		/* Load next byte from r4.  */
+	cmpdi	cr7, r6, 0		/* Is it null?  */
+	beq	cr7, L(updater3)
+	sldi	r7, r6, 2		/* Convert to lower case.  */
+	lwzx	r7, r11, r7
+	b	L(loop)
+
+	/* Handling return values.  */
+	.align	4
+L(updater3):
+	subf	r3, r31, r3	/* Reduce len of r4 from r3.  */
+	b	L(end)
+
+	.align	4
+L(ret_r3):
+	mr	r3, r29		/* Return r3.  */
+	b	L(end)
+
+	.align	4
+L(retnull):
+	li	r3, 0		/* Return NULL.  */
+	b	L(end)
+
+	.align	4
+L(default):
+	mr	r3, r29
+	mr	r4, r30
+	bl	__strcasestr_ppc
+	nop
+
+	.align	4
+L(end):
+	addi	r1, r1, FRAMESIZE	/* Restore stack pointer.  */
+	cfi_adjust_cfa_offset(-FRAMESIZE)
+	ld	r0, 16(r1)	/* Restore the saved link register.  */
+	ld	r29, -24(r1)	/* Restore callers save register r29.  */
+	ld	r30, -16(r1)	/* Restore callers save register r30.  */
+	ld	r31, -8(r1)	/* Restore callers save register r31.  */
+	mtlr	r0		/* Branch to link register.  */
+	blr
+END (STRCASESTR)
+#ifndef NO_ALIAS
+weak_alias (__strcasestr, strcasestr)
+#endif
+
+libc_hidden_builtin_def (strcasestr)