From patchwork Tue Jun  9 14:27:48 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Ondrej Bilka <neleai@seznam.cz>
X-Patchwork-Id: 7086
Received: (qmail 85016 invoked by alias); 9 Jun 2015 14:28:04 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 85007 invoked by uid 89); 9 Jun 2015 14:28:03 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2
X-HELO: popelka.ms.mff.cuni.cz
Date: Tue, 9 Jun 2015 16:27:48 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: libc-alpha@sourceware.org
Subject: [PATCH v3] Improve fnmatch performance.
Message-ID: <20150609142748.GA3982@domone>
References: <20150512235339.GA27716@domone>
 <5553805B.3070304@cs.ucla.edu>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <5553805B.3070304@cs.ucla.edu>
User-Agent: Mutt/1.5.20 (2009-06-14)

On Wed, May 13, 2015 at 09:48:27AM -0700, Paul Eggert wrote:
> Ondřej Bílka wrote:
> >How to synchronize this with gnulib? Only implementation specific detail
> >is utf8 detection.
> 
> It could be something like this:
> 
>  #if _LIBC
>   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
>   uint_fast32_t encoding =
>     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
>   bool is_utf8 = encoding == !__cet_other;
>  #else
>   bool is_utf8 = STRCASEEQ (locale_charset (),
>                             "UTF-8", 'U','T','F','-','8',0,0,0,0)
>  #endif
> 
> We should package this sort of thing up and make it easier to use,
> but that could be another day.

Yes I will use that pattern, it needs to change details like that it
also works for single-byte encodings.

I also removed expect on MB_CUR_MAX as unicode is widespread.

Also I now return directly match when entire pattern is normal and
FNM_PERIOD or FNM_FILE_NAME wasn't set which could also help performance
a bit.

Then I could allow nonascii characters to start pattern unless its utf8
and you have FNM_CASEFOLD, would it be better to add two tables or check
for testing these?

	* posix/fnmatch.c (fnmatch): Improve performance.

diff --git a/posix/fnmatch.c b/posix/fnmatch.c
index fd85efa..4c32992 100644
--- a/posix/fnmatch.c
+++ b/posix/fnmatch.c
@@ -131,6 +131,13 @@ extern int fnmatch (const char *pattern, const char *string, int flags);
 #   define ISWCTYPE(WC, WT)	iswctype (WC, WT)
 #  endif
 
+#  ifdef _LIBC
+#   define STRCASESTR		__strcasestr
+#  else
+#   define STRCASESTR		strcasestr
+#  endif
+
+
 #  if (HAVE_MBSTATE_T && HAVE_MBSRTOWCS) || _LIBC
 /* In this case we are implementing the multibyte character handling.  */
 #   define HANDLE_MULTIBYTE	1
@@ -332,8 +339,62 @@ fnmatch (pattern, string, flags)
      const char *string;
      int flags;
 {
+
+  /* ASCII with \+/.*?[{(@! excluded.  */
+  static unsigned char normal[256] = {
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+ 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
+ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+ };
+#  if _LIBC
+   struct __locale_data *current = _NL_CURRENT_LOCALE->__locales[LC_COLLATE];
+   uint_fast32_t encoding =
+     current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+   bool fast_encoding = (encoding != __cet_other);
+#  else
+#   if HANDLE_MULTIBYTE
+   bool is_utf8 = STRCASEEQ (locale_charset (),
+                             "UTF-8", 'U','T','F','-','8',0,0,0,0);
+   bool fast_encoding = (MB_CUR_MAX == 1) || is_utf;
+#   else
+   bool fast_encoding = true;
+#   endif
+#  endif
+
+  if (fast_encoding)
+    {
+      char start[8];
+      char *string2;
+      size_t i;
+      for (i = 0; i < 7 && normal[(unsigned char) pattern[i]]; i++)
+        start[i] = pattern[i];
+      start[i] = 0;
+      if (flags & FNM_CASEFOLD)
+        string2 = STRCASESTR (string, start);
+      else  
+        string2 = strstr (string, start);
+      if (!string2)
+        return FNM_NOMATCH;
+ 
+      if (pattern[i] == '\0' && (flags & (FNM_FILE_NAME | FNM_PERIOD)) == 0)
+        return 0; 
+    }
+
 # if HANDLE_MULTIBYTE
-  if (__builtin_expect (MB_CUR_MAX, 1) != 1)
+  if (MB_CUR_MAX != 1)
     {
       mbstate_t ps;
       size_t n;