[v2] Improve memmem.

Message ID 20150513185107.GA4100@domone
State New, archived
Headers

Commit Message

Ondrej Bilka May 13, 2015, 6:51 p.m. UTC
  On Wed, May 13, 2015 at 09:31:06AM -0700, Paul Eggert wrote:
> Ondřej Bílka wrote:
> >+  if (needle_len == 1)
> >+    return memchr (haystack, needle[0], haystack_end - haystack);
> >+
> >+  while ((haystack = memchr (haystack, needle[0], haystack_end - haystack)))
> >+    {
> >+      if (haystack[1] == needle[1]
> >+          && (needle_len == 2 || haystack[2] == needle[2]))
> >+        {
> >+          if (needle_len == 2)
> >+            return haystack;
> >+
> >+          if (!memcmp (haystack + 2, needle + 2, needle_len - 2))
> >+            return haystack;
> >+          else
> >+            goto slow_path;
> >+        }
> >+      haystack++;
> >+    }
> >+
> >+  return NULL;
> >+
> >+  slow_path:;
> 
> First, that "haystack[1] == needle[1]" could access past the end of
> the haystack.  Second, can't this be rewritten to avoid that ugly
> goto?  Something like the attached, perhaps.

You wont get illegal acces. Unless I messed up haystack_end calculation.

Here is equivalent without goto. It uses implementation behaviour that
memcmp(x,y,0) doesn't touch anything.

Is this better?

	* string/memmem.c (__memmem): Optimize hot path.
  

Comments

Paul Eggert May 14, 2015, 1:06 a.m. UTC | #1
Ondřej Bílka wrote:

> You wont get illegal acces.

I don't see why not.  If memchr returns haystack_end - 1, haystack[1] will be 
equivalent to *haystack_end, i.e., one past the end of the haystack.

> Here is equivalent without goto. It uses implementation behaviour that
> memcmp(x,y,0) doesn't touch anything.

Better, but I think it still has the problem.  Also, come to think of it, it's 
not clear that we should bother with the 'needle_len == 2' test (are needles of 
length 2 really that common?), and things are simpler and smaller without it, so 
how about the attached instead?
  
Ondrej Bilka May 14, 2015, 9:29 a.m. UTC | #2
On Wed, May 13, 2015 at 06:06:58PM -0700, Paul Eggert wrote:
> Ondřej Bílka wrote:
> 
> >You wont get illegal acces.
> 
> I don't see why not.  If memchr returns haystack_end - 1,
> haystack[1] will be equivalent to *haystack_end, i.e., one past the
> end of the haystack.
>
I am using different end here

+  const unsigned char *haystack_end = (const unsigned char *)
+                                      haystack_start + haystack_len
+                                      - needle_len + 1;

 
> >Here is equivalent without goto. It uses implementation behaviour that
> >memcmp(x,y,0) doesn't touch anything.
> 
> Better, but I think it still has the problem.  Also, come to think
> of it, it's not clear that we should bother with the 'needle_len ==
> 2' test (are needles of length 2 really that common?), and things
> are simpler and smaller without it, so how about the attached
> instead?
>
Reason is performance. Idea is to use this algorithm and switch to slow
two-way algorithm only for pathological inputs.

This uses simple heuristic to look only for first k characters and once
it matches character beyond k'th switch to two-way algorithm.

With single memchr as now its heuristic with k=1. What are you
suggesting is k=2. My algorithm uses k=3.

Main motivations is that pairs are still too common and in longer switch
you will quickly switch to slow two-way algorithm.

I could use more sophisticated accounting of runtime cost but it would
likely cost more than benefit as most inputs are quite small.

Finally there is buy-or-rent problem. Computing critical factorization
is very costy, often several times more expensive than using naive
algorithm. A solution to limit loss is buy-or-rent approach. You run a
naive algorithm even if its slower until you reach time diffierence
equal of doing factorization, then you switch to two-way algorithm and
you are at most twice slower than if you guessed correctly what to use.
  
Paul Eggert May 15, 2015, 6:08 a.m. UTC | #3
Ondřej Bílka wrote:
> I am using different end here
>
> +  const unsigned char *haystack_end = (const unsigned char *)
> +                                      haystack_start + haystack_len
> +                                      - needle_len + 1;

Ah, sorry, didn't see that.  But in that case the name 'haystack_end' is 
misleading -- that's not the haystack's end, but is something else.  So a 
renaming would appear to be in order.

> Main motivations is that pairs are still too common

Too common where?  Do we have traces of actual programs?
  
Ondrej Bilka May 24, 2015, 2:03 p.m. UTC | #4
On Thu, May 14, 2015 at 11:08:42PM -0700, Paul Eggert wrote:
> Ondřej Bílka wrote:
> >I am using different end here
> >
> >+  const unsigned char *haystack_end = (const unsigned char *)
> >+                                      haystack_start + haystack_len
> >+                                      - needle_len + 1;
> 
> Ah, sorry, didn't see that.  But in that case the name
> 'haystack_end' is misleading -- that's not the haystack's end, but
> is something else.  So a renaming would appear to be in order.
> 
Do you have better suggestion?

> >Main motivations is that pairs are still too common
> 
> Too common where?  Do we have traces of actual programs?

I actually have applications that I use have most haystacks less than 64
bytes so it doesn't make difference.

However its better to be prepared in case programmer uses kb length
haystacks where it would happen. An english digraph th frequency is
around 1% so you will likely switch in first 1/10 of input. For triplets
there could be same problem but I decided to keep it simple,
alternatively could add quadruple check I am open what to use.
  

Patch

diff --git a/string/memmem.c b/string/memmem.c
index 8a81f65..088a2f1 100644
--- a/string/memmem.c
+++ b/string/memmem.c
@@ -46,6 +46,10 @@  __memmem (const void *haystack_start, size_t haystack_len,
      not an array of 'char' values.  See ISO C 99 section 6.2.6.1.  */
   const unsigned char *haystack = (const unsigned char *) haystack_start;
   const unsigned char *needle = (const unsigned char *) needle_start;
+  const unsigned char *haystack_end = (const unsigned char *)
+                                      haystack_start + haystack_len
+                                      - needle_len + 1;
+
 
   if (needle_len == 0)
     /* The first occurrence of the empty string is deemed to occur at
@@ -57,6 +61,25 @@  __memmem (const void *haystack_start, size_t haystack_len,
   if (__glibc_unlikely (haystack_len < needle_len))
     return NULL;
 
+   if (needle_len == 1)
+     return memchr (haystack, needle[0], haystack_end - haystack);
+
+   for (; ;haystack++) 
+     {
+       haystack = memchr (haystack, needle[0], haystack_end - haystack);
+
+       if (!haystack)
+         return NULL;
+
+       if (haystack[1] == needle[1]
+           && (needle_len == 2 || haystack[2] == needle[2]))
+         {
+           if (!memcmp (haystack + 2, needle + 2, needle_len - 2))
+             return (void *) haystack;
+           break;
+         }
+     }
+ 
   /* Use optimizations in memchr when possible, to reduce the search
      size of haystack using a linear algorithm with a smaller
      coefficient.  However, avoid memchr for long needles, since we