From patchwork Fri May 24 13:53:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 32848 Received: (qmail 8728 invoked by alias); 24 May 2019 13:54:00 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 8230 invoked by uid 89); 24 May 2019 13:53:59 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-17.7 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS autolearn=ham version=3.3.1 spammy=improves, repeatedly, gain, needle X-HELO: EUR01-HE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector2-armh-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HN73whM55Bqk68uSxb51SSWdliJfhKTkkc7H6SV5Udw=; b=uMAY3udEFgA69abHPdNr25v6mLycPbyaGn7qOyuDh5Ne4sNWQ0CdQWKeAy0D/fYLum/UrulKrJjKE0K74fnbTDpuysB54iN8r0D7Oa5w1AWzrhM6r/kkTxMNkg/wMgDuuT1eQpRmKVACHBanGnME52gYvb4FMagBOVy6npn7o68= From: Wilco Dijkstra To: "libc-alpha@sourceware.org" , Zack Weinberg , Adhemerval Zanella CC: nd Subject: Re: [PATCH v3] Improve performance of strstr Date: Fri, 24 May 2019 13:53:50 +0000 Message-ID: References: , In-Reply-To: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-exchange-purlcount: 2 x-ms-oob-tlc-oobclassifiers: OLM:1850; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 MIME-Version: 1.0 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: Wilco.Dijkstra@arm.com ping Since we've discussed worst-case performance in great detail without any issues found, I'd like to close this out. Are there any other review comments? Ideally I'd like to backport this too so distros see the performance gains sooner. v3: Remove some unused defines. Increase filter size to 8 bytes. This patch significantly improves performance of strstr using a novel modified Horspool algorithm.  Needles up to size 256 use a bad-character table indexed by hashed pairs of characters to quickly skip past mismatches. Long needles use a self-adapting filtering step to avoid comparing the whole needle repeatedly. By limiting the needle length to 256, the shift table only requires 8 bits per entry, lowering preprocessing overhead and minimizing cache effects. This limit also implies worst-case performance is linear. Small needles up to size 3 use a dedicated linear search.  Very long needles use the Two-Way algorithm. The performance gain using the improved bench-strstr on Cortex-A72 is 5.8 times basic_strstr and 3.7 times twoway_strstr. Tested against GLIBC testsuite, randomized tests and the GNULIB strstr test (https://git.savannah.gnu.org/cgit/gnulib.git/tree/tests/test-strstr.c). OK for commit? ChangeLog: 2019-04-26  Wilco Dijkstra          * string/str-two-way.h (two_way_short_needle): Add inline to avoid         warning.         (two_way_long_needle): Block inlining.         * string/strstr.c (strstr2): Add new function.         (strstr3): Likewise.         (STRSTR): Completely rewrite strstr to improve performance. diff --git a/string/str-two-way.h b/string/str-two-way.h index b5011baafa77a2d211598be246657b9a33fd8a2e..50fe6b46848bb1e758a667957ab06cfa207bcacb 100644 --- a/string/str-two-way.h +++ b/string/str-two-way.h @@ -221,7 +221,7 @@ critical_factorization (const unsigned char *needle, size_t needle_len,     most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching.     If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 *     HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching.  */ -static RETURN_TYPE +static inline RETURN_TYPE  two_way_short_needle (const unsigned char *haystack, size_t haystack_len,                        const unsigned char *needle, size_t needle_len)  { @@ -383,7 +383,7 @@ two_way_short_needle (const unsigned char *haystack, size_t haystack_len,     If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 *     HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and     sublinear performance is not possible.  */ -static RETURN_TYPE +__attribute__((noinline)) static RETURN_TYPE  two_way_long_needle (const unsigned char *haystack, size_t haystack_len,                       const unsigned char *needle, size_t needle_len)  { diff --git a/string/strstr.c b/string/strstr.c index 64e478b9e765a903228dca0bd1504096b374f12c..e2333025fea01d4dea7b586ab3f2dba02c8a224e 100644 --- a/string/strstr.c +++ b/string/strstr.c @@ -16,29 +16,17 @@     License along with the GNU C Library; if not, see     .  */   -/* This particular implementation was written by Eric Blake, 2008.  */ -  #ifndef _LIBC  # include  #endif   -/* Specification of strstr.  */  #include   -#include - -#ifndef _LIBC -# define __builtin_expect(expr, val)   (expr) -#endif -  #define RETURN_TYPE char *  #define AVAILABLE(h, h_l, j, n_l)                       \    (((j) + (n_l) <= (h_l)) \     || ((h_l) += __strnlen ((void*)((h) + (h_l)), (n_l) + 512), \         (j) + (n_l) <= (h_l))) -#define CHECK_EOL (1) -#define RET0_IF_0(a) if (!a) goto ret0 -#define FASTSEARCH(S,C,N) (void*) strchr ((void*)(S), (C))  #include "str-two-way.h"    #undef strstr @@ -47,47 +35,122 @@  #define STRSTR strstr  #endif   -/* Return the first occurrence of NEEDLE in HAYSTACK.  Return HAYSTACK -   if NEEDLE is empty, otherwise NULL if NEEDLE is not found in -   HAYSTACK.  */ -char * -STRSTR (const char *haystack, const char *needle) +static inline char * +strstr2 (const unsigned char *hs, const unsigned char *ne)  { -  size_t needle_len; /* Length of NEEDLE.  */ -  size_t haystack_len; /* Known minimum length of HAYSTACK.  */ - -  /* Handle empty NEEDLE special case.  */ -  if (needle[0] == '\0') -    return (char *) haystack; +  uint32_t h1 = (ne[0] << 16) | ne[1]; +  uint32_t h2 = 0; +  for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs) +      h2 = (h2 << 16) | c; +  return h1 == h2 ? (char *)hs - 2 : NULL; +}   -  /* Skip until we find the first matching char from NEEDLE.  */ -  haystack = strchr (haystack, needle[0]); -  if (haystack == NULL || needle[1] == '\0') -    return (char *) haystack; +static inline char * +strstr3 (const unsigned char *hs, const unsigned char *ne) +{ +  uint32_t h1 = (ne[0] << 24) | (ne[1] << 16) | (ne[2] << 8); +  uint32_t h2 = 0; +  for (int c = hs[0]; h1 != h2 && c != 0; c = *++hs) +      h2 = (h2 | c) << 8; +  return h1 == h2 ? (char *)hs - 3 : NULL; +}   -  /* Ensure HAYSTACK length is at least as long as NEEDLE length. -     Since a match may occur early on in a huge HAYSTACK, use strnlen +#define hash2(p) (((size_t)(p)[0] - ((size_t)(p)[-1] << 3)) % sizeof (shift)) + +/* Fast strstr algorithm with guaranteed linear-time performance. +   Small needles up to size 2 use a dedicated linear search.  Longer needles +   up to size 256 use a novel modified Horspool algorithm.  It hashes pairs +   of characters to quickly skip past mismatches.  The main search loop only +   exits if the last 2 characters match, avoiding unnecessary calls to memcmp +   and allowing for a larger skip if there is no match.  A self-adapting +   filtering check is used to quickly detect mismatches in long needles. +   By limiting the needle length to 256, the shift table can be reduced to 8 +   bits per entry, lowering preprocessing overhead and minimizing cache effects. +   The limit also implies worst-case performance is linear. +   Needles larger than 256 characters use the linear-time Two-Way algorithm.  */ +char * +STRSTR (const char *haystack, const char *needle) +{ +  const unsigned char *hs = (const unsigned char *) haystack; +  const unsigned char *ne = (const unsigned char *) needle; + +  /* Handle short needle special cases first.  */ +  if (ne[0] == '\0') +    return (char *)hs; +  hs = (const unsigned char *)strchr ((const char*)hs, ne[0]); +  if (hs == NULL || ne[1] == '\0') +    return (char*)hs; +  if (ne[2] == '\0') +    return strstr2 (hs, ne); +  if (ne[3] == '\0') +    return strstr3 (hs, ne); + +  /* Ensure haystack length is at least as long as needle length. +     Since a match may occur early on in a huge haystack, use strnlen       and read ahead a few cachelines for improved performance.  */ -  needle_len = strlen (needle); -  haystack_len = __strnlen (haystack, needle_len + 256); -  if (haystack_len < needle_len) +  size_t ne_len = strlen ((const char*)ne); +  size_t hs_len = __strnlen ((const char*)hs, ne_len | 512); +  if (hs_len < ne_len)      return NULL;   -  /* Check whether we have a match.  This improves performance since we avoid -     the initialization overhead of the two-way algorithm.  */ -  if (memcmp (haystack, needle, needle_len) == 0) -    return (char *) haystack; - -  /* Perform the search.  Abstract memory is considered to be an array -     of 'unsigned char' values, not an array of 'char' values.  See -     ISO C 99 section 6.2.6.1.  */ -  if (needle_len < LONG_NEEDLE_THRESHOLD) -    return two_way_short_needle ((const unsigned char *) haystack, -                                haystack_len, -                                (const unsigned char *) needle, needle_len); -  return two_way_long_needle ((const unsigned char *) haystack, haystack_len, -                             (const unsigned char *) needle, needle_len); +  /* Check whether we have a match.  This improves performance since we +     avoid initialization overheads.  */ +  if (memcmp (hs, ne, ne_len) == 0) +    return (char *) hs; + +  /* Use Two-Way algorithm for very long needles.  */ +  if (__glibc_unlikely (ne_len > 256)) +    return two_way_long_needle (hs, hs_len, ne, ne_len); + +  const unsigned char *end = hs + hs_len - ne_len; +  uint8_t shift[256]; +  size_t tmp, shift1; +  size_t m1 = ne_len - 1; +  size_t offset = 0; + +  /* Initialize bad character shift hash table.  */ +  memset (shift, 0, sizeof (shift)); +  for (int i = 1; i < m1; i++) +    shift[hash2 (ne + i)] = i; +  shift1 = m1 - shift[hash2 (ne + m1)]; +  shift[hash2 (ne + m1)] = m1; + +  while (1) +    { +      if (__glibc_unlikely (hs > end)) +       { +         end += __strnlen ((const char*)end + m1 + 1, 2048); +         if (hs > end) +           return NULL; +       } + +      /* Skip past character pairs not in the needle.  */ +      do +       { +         hs += m1; +         tmp = shift[hash2 (hs)]; +       } +      while (hs <= end && tmp == 0); + +      /* If the match is not at the end of the needle, shift to the end +        and continue until we match the last 2 characters.  */ +      hs -= tmp; +      if (tmp < m1) +       continue; + +      /* The last 2 characters match.  If the needle is long, check a +        fixed number of characters first to quickly filter out mismatches.  */ +      if (m1 < 15 || memcmp (hs + offset, ne + offset, 8) == 0) +       { +         if (memcmp (hs, ne, m1) == 0) +           return (void *) hs; + +         /* Adjust filter offset when it doesn't find the mismatch.  */ +         offset = (offset >= 8 ? offset : m1) - 8; +       } + +      /* Skip based on matching the last 2 characters.  */ +      hs += shift1; +    }  }  libc_hidden_builtin_def (strstr) - -#undef LONG_NEEDLE_THRESHOLD