Simplify and speedup strstr/strcasestr first match
Commit Message
Adhemerval Zanella wrote:
> +
> + /* Determine length of NEEDLE and handle empty NEEDLE special case. */
> + if (needle[0] == '\0')
> + return (char *) haystack;
> Maybe worth a __glibc_unlikely here?
I don't see any differences with __glibc_unlikely. GCC already does the right thing,
so I'd rather not clutter the code without a benefit.
> + needle_len = strlen (needle);
> +
> + /* Ensure HAYSTACK length is at least as long as NEEDLE length. */
> + haystack_len = __strnlen (haystack, needle_len + 256);
> + if (haystack_len < needle_len)
> I would add a comment why 256 was picked, maybe adding to either avoid
> excessive strnlen calls or add extra memory access to check end of
> string on AVAILABLE.
Fair point - I improved the comment and the description. The chosen value gives a
good speedup on bench-strstr, so the readahead is useful but still small enough it
won't cause performance degradation in the case where there is an early match.
Here is the improved version:
Looking at the benchtests, both strstr and strcasestr spend a lot of time
in a slow initialization loop handling one character per iteration.
This can be simplified and use the much faster strlen/strnlen/strchr/memcmp.
Read ahead a few cachelines to reduce the number of strnlen calls, which
improves performance by ~3-4%. This patch improves the time taken for the
full strstr benchtest by >40%.
Passes GLIBC regression tests.
ChangeLog:
2018-08-01 Wilco Dijkstra <wdijkstr@arm.com>
* string/strcasestr.c (STRCASESTR): Simplify and speedup first match.
* string/strstr.c (AVAILABLE): Likewise.
--
Comments
On 01/08/2018 12:58, Wilco Dijkstra wrote:
> Adhemerval Zanella wrote:
>
>> +
>> + /* Determine length of NEEDLE and handle empty NEEDLE special case. */
>> + if (needle[0] == '\0')
>> + return (char *) haystack;
>
>> Maybe worth a __glibc_unlikely here?
>
> I don't see any differences with __glibc_unlikely. GCC already does the right thing,
> so I'd rather not clutter the code without a benefit.
>
>> + needle_len = strlen (needle);
>> +
>> + /* Ensure HAYSTACK length is at least as long as NEEDLE length. */
>> + haystack_len = __strnlen (haystack, needle_len + 256);
>> + if (haystack_len < needle_len)
>
>> I would add a comment why 256 was picked, maybe adding to either avoid
>> excessive strnlen calls or add extra memory access to check end of
>> string on AVAILABLE.
>
> Fair point - I improved the comment and the description. The chosen value gives a
> good speedup on bench-strstr, so the readahead is useful but still small enough it
> won't cause performance degradation in the case where there is an early match.
> Here is the improved version:
>
>
>
> Looking at the benchtests, both strstr and strcasestr spend a lot of time
> in a slow initialization loop handling one character per iteration.
> This can be simplified and use the much faster strlen/strnlen/strchr/memcmp.
> Read ahead a few cachelines to reduce the number of strnlen calls, which
> improves performance by ~3-4%. This patch improves the time taken for the
> full strstr benchtest by >40%.
>
> Passes GLIBC regression tests.
>
> ChangeLog:
> 2018-08-01 Wilco Dijkstra <wdijkstr@arm.com>
>
> * string/strcasestr.c (STRCASESTR): Simplify and speedup first match.
> * string/strstr.c (AVAILABLE): Likewise.
LGTM, thanks.
> --
>
> diff --git a/string/strcasestr.c b/string/strcasestr.c
> index 5909fe3cdba88e476c7a989a020f3611bbfeb1de..1f6b7b846f81d0d8fe77b390062d2d86be11b056 100644
> --- a/string/strcasestr.c
> +++ b/string/strcasestr.c
> @@ -58,31 +58,22 @@
> case-insensitive comparison. This function gives unspecified
> results in multibyte locales. */
> char *
> -STRCASESTR (const char *haystack_start, const char *needle_start)
> +STRCASESTR (const char *haystack, const char *needle)
> {
> - const char *haystack = haystack_start;
> - const char *needle = needle_start;
> size_t needle_len; /* Length of NEEDLE. */
> size_t haystack_len; /* Known minimum length of HAYSTACK. */
> - bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */
> -
> - /* Determine length of NEEDLE, and in the process, make sure
> - HAYSTACK is at least as long (no point processing all of a long
> - NEEDLE if HAYSTACK is too short). */
> - while (*haystack && *needle)
> - {
> - ok &= (TOLOWER ((unsigned char) *haystack)
> - == TOLOWER ((unsigned char) *needle));
> - haystack++;
> - needle++;
> - }
> - if (*needle)
> +
> + /* Handle empty NEEDLE special case. */
> + if (needle[0] == '\0')
> + return (char *) haystack;
> +
> + /* Ensure HAYSTACK length is at least as long as NEEDLE length.
> + Since a match may occur early on in a huge HAYSTACK, use strnlen
> + and read ahead a few cachelines for improved performance. */
> + needle_len = strlen (needle);
> + haystack_len = __strnlen (haystack, needle_len + 256);
> + if (haystack_len < needle_len)
> return NULL;
> - if (ok)
> - return (char *) haystack_start;
> - needle_len = needle - needle_start;
> - haystack = haystack_start + 1;
> - haystack_len = needle_len - 1;
>
> /* Perform the search. Abstract memory is considered to be an array
> of 'unsigned char' values, not an array of 'char' values. See
> @@ -90,10 +81,10 @@ STRCASESTR (const char *haystack_start, const char *needle_start)
> if (needle_len < LONG_NEEDLE_THRESHOLD)
> return two_way_short_needle ((const unsigned char *) haystack,
> haystack_len,
> - (const unsigned char *) needle_start,
> + (const unsigned char *) needle,
> needle_len);
> return two_way_long_needle ((const unsigned char *) haystack, haystack_len,
> - (const unsigned char *) needle_start,
> + (const unsigned char *) needle,
> needle_len);
> }
>
> diff --git a/string/strstr.c b/string/strstr.c
> index 265e9f310ce507ce63740cc42d8ceea1d28dab01..33acdc5442d3e9031298a56e4f52a19e2ffa7d84 100644
> --- a/string/strstr.c
> +++ b/string/strstr.c
> @@ -50,33 +50,32 @@
> if NEEDLE is empty, otherwise NULL if NEEDLE is not found in
> HAYSTACK. */
> char *
> -STRSTR (const char *haystack_start, const char *needle_start)
> +STRSTR (const char *haystack, const char *needle)
> {
> - const char *haystack = haystack_start;
> - const char *needle = needle_start;
> size_t needle_len; /* Length of NEEDLE. */
> size_t haystack_len; /* Known minimum length of HAYSTACK. */
> - bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */
> -
> - /* Determine length of NEEDLE, and in the process, make sure
> - HAYSTACK is at least as long (no point processing all of a long
> - NEEDLE if HAYSTACK is too short). */
> - while (*haystack && *needle)
> - ok &= *haystack++ == *needle++;
> - if (*needle)
> +
> + /* Handle empty NEEDLE special case. */
> + if (needle[0] == '\0')
> + return (char *) haystack;
> +
> + /* Skip until we find the first matching char from NEEDLE. */
> + haystack = strchr (haystack, needle[0]);
> + if (haystack == NULL || needle[1] == '\0')
> + return (char *) haystack;
> +
> + /* Ensure HAYSTACK length is at least as long as NEEDLE length.
> + Since a match may occur early on in a huge HAYSTACK, use strnlen
> + and read ahead a few cachelines for improved performance. */
> + needle_len = strlen (needle);
> + haystack_len = __strnlen (haystack, needle_len + 256);
> + if (haystack_len < needle_len)
> return NULL;
> - if (ok)
> - return (char *) haystack_start;
> -
> - /* Reduce the size of haystack using strchr, since it has a smaller
> - linear coefficient than the Two-Way algorithm. */
> - needle_len = needle - needle_start;
> - haystack = strchr (haystack_start + 1, *needle_start);
> - if (!haystack || __builtin_expect (needle_len == 1, 0))
> +
> + /* Check whether we have a match. This improves performance since we avoid
> + the initialization overhead of the two-way algorithm. */
> + if (memcmp (haystack, needle, needle_len) == 0)
> return (char *) haystack;
> - needle -= needle_len;
> - haystack_len = (haystack > haystack_start + needle_len ? 1
> - : needle_len + haystack_start - haystack);
>
> /* Perform the search. Abstract memory is considered to be an array
> of 'unsigned char' values, not an array of 'char' values. See
>
>
>
>
@@ -58,31 +58,22 @@
   case-insensitive comparison. This function gives unspecified
   results in multibyte locales. */
 char *
-STRCASESTR (const char *haystack_start, const char *needle_start)
+STRCASESTR (const char *haystack, const char *needle)
 {
-Â const char *haystack = haystack_start;
-Â const char *needle = needle_start;
  size_t needle_len; /* Length of NEEDLE. */
  size_t haystack_len; /* Known minimum length of HAYSTACK. */
- bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */
-
-Â /* Determine length of NEEDLE, and in the process, make sure
-Â Â Â Â HAYSTACK is at least as long (no point processing all of a long
-    NEEDLE if HAYSTACK is too short). */
-Â while (*haystack && *needle)
-Â Â Â {
-Â Â Â Â Â ok &= (TOLOWER ((unsigned char) *haystack)
-Â Â Â Â Â Â Â Â Â Â Â == TOLOWER ((unsigned char) *needle));
-Â Â Â Â Â haystack++;
-Â Â Â Â Â needle++;
-Â Â Â }
-Â if (*needle)
+
+ /* Handle empty NEEDLE special case. */
+Â if (needle[0] == '\0')
+Â Â Â return (char *) haystack;
+
+Â /* Ensure HAYSTACK length is at least as long as NEEDLE length.
+Â Â Â Â Since a match may occur early on in a huge HAYSTACK, use strnlen
+    and read ahead a few cachelines for improved performance. */
+Â needle_len = strlen (needle);
+Â haystack_len = __strnlen (haystack, needle_len + 256);
+Â if (haystack_len < needle_len)
    return NULL;
-Â if (ok)
-Â Â Â return (char *) haystack_start;
-Â needle_len = needle - needle_start;
-Â haystack = haystack_start + 1;
-Â haystack_len = needle_len - 1;
Â
  /* Perform the search. Abstract memory is considered to be an array
     of 'unsigned char' values, not an array of 'char' values. See
@@ -90,10 +81,10 @@ STRCASESTR (const char *haystack_start, const char *needle_start)
  if (needle_len < LONG_NEEDLE_THRESHOLD)
    return two_way_short_needle ((const unsigned char *) haystack,
                                 haystack_len,
-Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (const unsigned char *) needle_start,
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (const unsigned char *) needle,
                                 needle_len);
  return two_way_long_needle ((const unsigned char *) haystack, haystack_len,
-Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (const unsigned char *) needle_start,
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (const unsigned char *) needle,
                              needle_len);
 }
Â
@@ -50,33 +50,32 @@
   if NEEDLE is empty, otherwise NULL if NEEDLE is not found in
   HAYSTACK. */
 char *
-STRSTR (const char *haystack_start, const char *needle_start)
+STRSTR (const char *haystack, const char *needle)
 {
-Â const char *haystack = haystack_start;
-Â const char *needle = needle_start;
  size_t needle_len; /* Length of NEEDLE. */
  size_t haystack_len; /* Known minimum length of HAYSTACK. */
- bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */
-
-Â /* Determine length of NEEDLE, and in the process, make sure
-Â Â Â Â HAYSTACK is at least as long (no point processing all of a long
-    NEEDLE if HAYSTACK is too short). */
-Â while (*haystack && *needle)
-Â Â Â ok &= *haystack++ == *needle++;
-Â if (*needle)
+
+ /* Handle empty NEEDLE special case. */
+Â if (needle[0] == '\0')
+Â Â Â return (char *) haystack;
+
+ /* Skip until we find the first matching char from NEEDLE. */
+Â haystack = strchr (haystack, needle[0]);
+Â if (haystack == NULL || needle[1] == '\0')
+Â Â Â return (char *) haystack;
+
+Â /* Ensure HAYSTACK length is at least as long as NEEDLE length.
+Â Â Â Â Since a match may occur early on in a huge HAYSTACK, use strnlen
+    and read ahead a few cachelines for improved performance. */
+Â needle_len = strlen (needle);
+Â haystack_len = __strnlen (haystack, needle_len + 256);
+Â if (haystack_len < needle_len)
    return NULL;
-Â if (ok)
-Â Â Â return (char *) haystack_start;
-
-Â /* Reduce the size of haystack using strchr, since it has a smaller
-    linear coefficient than the Two-Way algorithm. */
-Â needle_len = needle - needle_start;
-Â haystack = strchr (haystack_start + 1, *needle_start);
-Â if (!haystack || __builtin_expect (needle_len == 1, 0))
+
+ /* Check whether we have a match. This improves performance since we avoid
+    the initialization overhead of the two-way algorithm. */
+Â if (memcmp (haystack, needle, needle_len) == 0)
    return (char *) haystack;
-Â needle -= needle_len;
-Â haystack_len = (haystack > haystack_start + needle_len ? 1
-Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â : needle_len + haystack_start - haystack);
Â
  /* Perform the search. Abstract memory is considered to be an array
     of 'unsigned char' values, not an array of 'char' values. See