powerpc: strcasestr optimization

Message ID 556D705D.20201@linux.vnet.ibm.com
State Dropped
Delegated to: Tulio Magno Quites Machado Filho
Headers

Commit Message

Rajalakshmi S June 2, 2015, 8:59 a.m. UTC
  On 06/01/2015 09:52 PM, Ondřej Bílka wrote:
> On Mon, Jun 01, 2015 at 09:36:47AM -0500, Steven Munroe wrote:
>> On Mon, 2015-06-01 at 14:28 +0200, Ondřej Bílka wrote:
>>> On Mon, Jun 01, 2015 at 04:11:28PM +0530, Rajalakshmi Srinivasaraghavan wrote:
>>>>
>>>> This patch optimizes strcasestr function for power >= 7 systems.
>>>> This patch uses optimized strlen and strnlen for calculating
>>>> string length and the average improvement of this optimization is ~40%.
>>>> This patch is tested on powerpc64 and powerpc64le.
>>>> Attached the benchresults with this new patch.
>>>>
>>> Thats not enough. As strcasestr that I submited is around three times
>>> slower your implementation would likely be regression over generic one.
>>>
>>> A problem here is that you use moronic algorithm. Fix algorithm first
>>> before trying to optimize it.
>>>
>>
>> This is not very helpful. You are demanding changes without clear
>> explanation and justification.
>>
>> What is wrong with Raja's algorithm? What is insufficient in the
>> benchmark data she has provided? And why do you think your specific
>> design applies to PowerISA and POWER7/POWER8 micro-architecture.
>>
>> What data do you have that justified this objection?
>
> I replied on strstr patch thread on why what she submitted is
> performance regression. So I will repeat arguments from other thread
> which still apply.
>
> First was problem with quadratic behaviour. She tried to fix it but it
> isn't a fix at all. Just benchmark
>
> strcasestr ("aaa...(4000 times)...aaa", "aaa...(2000 times)...aab")
>
> That call would take around hundred times than before which is
> unacceptable.

This is already handled in the patch.If the needle len is more than
2048, it calls default string/strcasestr.c
>
> If we ignore that red flag second problem was that benchmark she used is
> bogus. It test with periodic haystacks, needle is copy of first bytes of
> haystack with last byte set to something else.
Which benchmark are you referring as bogus? The benchtest result 
attached in the previous thread was created using 
benchtests/bench-strcasestr.c . Since your proposed benchtest changes 
were not yet committed, I have used default ones.
.
.
.

>
> Just use same patch like I send with ((unsigned) rand())%16 + 1 and you
> will see completely different numbers in benchmark.
>
Benchtest results attached with these changes.
.
.

>
> As I don't have powerpc access now apply my patches
>
> [PATCH v5] Generic string skeleton
> [PATCH v5 4*] Generic string search functions (strstr, strcasestr, memmem)
>
I have attached the benchtest result with the above patches applied
along with benchtests/bench-strcasestr.c changes.(similiar changed as 
proposed by you for benchtests/bench-strstr.c).
The result attached clearly shows improvement.
> and run (preferably fixed) benchmark with these. As gains that I see on
> x64 are bigger than ones gained by this assembly you will likely see
> that generic implementation is indeed better and it would be pointless
> to try review that only to remove it shortly after adding to improve
> performance.
>
>
  

Patch

diff --git a/benchtests/bench-strcasestr.c b/benchtests/bench-strcasestr.c
index 33531a4..f579491 100644
--- a/benchtests/bench-strcasestr.c
+++ b/benchtests/bench-strcasestr.c
@@ -82,33 +82,14 @@  do_test (size_t align1, size_t align2, size_t len1, size_t len2,
 {
   char *s1 = (char *) (buf1 + align1);
   char *s2 = (char *) (buf2 + align2);
-
-  static const char d[] = "1234567890abcxyz";
-#define dl (sizeof (d) - 1)
-  char *ss2 = s2;
-  for (size_t l = len2; l > 0; l = l > dl ? l - dl : 0)
+  for (size_t l = 0; l < len2; l++)
     {
-      size_t t = l > dl ? dl : l;
-      ss2 = mempcpy (ss2, d, t);
+      s2[l] = (rand () % 128) + 1;
     }
   s2[len2] = '\0';
-
-  if (fail)
-    {
-      char *ss1 = s1;
-      for (size_t l = len1; l > 0; l = l > dl ? l - dl : 0)
-       {
-         size_t t = l > dl ? dl : l;
-         memcpy (ss1, d, t);
-         ++ss1[len2 > 7 ? 7 : len2 - 1];
-         ss1 += t;
-       }
-    }
-  else
+  for (size_t l = 0; l < len1; l++)
     {
-      memset (s1, '0', len1);
-      for (size_t i = 0; i < len2; ++i)
-       s1[len1 - len2 + i] = toupper (s2[i]);
+      s1[l] = (rand () % 128) + 1;
     }
   s1[len1] = '\0';