From patchwork Tue May 12 20:53:11 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ondrej Bilka X-Patchwork-Id: 6680 Received: (qmail 127048 invoked by alias); 12 May 2015 20:53:28 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 127039 invoked by uid 89); 12 May 2015 20:53:27 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Tue, 12 May 2015 22:53:11 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: libc-alpha@sourceware.org Subject: [PATCH] Improve generic strstr performance. Message-ID: <20150512205311.GA24057@domone> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) Hi, As I mentioned before that generic strstr is not well optimized this improves hot path. This exploits fact that naive algorithm with vectorized strchr is hard to beat. Main weakness of current algorithm is that it spends lot of time in critical factorization which is useless as haystack is too short. We look for leading triple which rarely produces mismatch mismatch switch to slow two-way algorithm. Performance improvement is function of entropy of string and how good is underlying strchr. On x64 this is around five times faster when each character has probability 1/128. A simple benchmark that I used is send separately. OK to commit? * string/strstr.c (STRSTR): Optimize hot path. diff --git a/string/strstr.c b/string/strstr.c index 045e878..45fb423 100644 --- a/string/strstr.c +++ b/string/strstr.c @@ -57,23 +57,62 @@ STRSTR (const char *haystack_start, const char *needle_start) size_t haystack_len; /* Known minimum length of HAYSTACK. */ bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */ - /* Determine length of NEEDLE, and in the process, make sure - HAYSTACK is at least as long (no point processing all of a long - NEEDLE if HAYSTACK is too short). */ - while (*haystack && *needle) - ok &= *haystack++ == *needle++; - if (*needle) - return NULL; - if (ok) - return (char *) haystack_start; - - /* Reduce the size of haystack using strchr, since it has a smaller - linear coefficient than the Two-Way algorithm. */ - needle_len = needle - needle_start; - haystack = strchr (haystack_start + 1, *needle_start); - if (!haystack || __builtin_expect (needle_len == 1, 0)) + if (!needle[0]) return (char *) haystack; - needle -= needle_len; + + if (!needle[1]) + return strchr (haystack, needle[0]); + + /* + This optimizes strstr hot path. + + Main performance problem is that haystacks and needle are short. + You spend 99% of time on sub64byte haystacks. + That makes preprocessing very expensive. + Without this around 90% of time would be spend at calculating critical + factorization that would never be used. + + This loop should be fast due to optimized strchr implementation. + It would likely traverse haystack as character triplet occurences are + rare for practical strings. + */ + + while ((haystack = strchr (haystack, needle[0]))) + { + if (haystack[1] == needle[1] && (haystack[2] == needle[2] || needle[2] == '\000')) + { + + if (needle[2] == '\000') + return (char *) haystack; + + /* Determine length of NEEDLE, and in the process, make sure + HAYSTACK is at least as long (no point processing all of a long + NEEDLE if HAYSTACK is too short). */ + + needle += 2; + haystack += 2; + + while (*haystack && *needle) + ok &= *haystack++ == *needle++; + if (*needle) + return NULL; + + needle_len = needle - needle_start; + + if (ok) + return (char *) haystack - needle_len; + + goto cold_path; + } + else + haystack++; + } + + return NULL; + + cold_path:; + + needle = needle_start; haystack_len = (haystack > haystack_start + needle_len ? 1 : needle_len + haystack_start - haystack);