[1/2] posix: User scratch_buffer on fnmatch

Message ID 20210104202528.1228255-1-adhemerval.zanella@linaro.org
State Superseded
Headers
Series [1/2] posix: User scratch_buffer on fnmatch |

Commit Message

Adhemerval Zanella Jan. 4, 2021, 8:25 p.m. UTC
  It removes the alloca usage on the string convertion to wide characters
before calling the internal function.

Checked on x86_64-linux-gnu.
---
 posix/fnmatch.c | 152 +++++++++++++++++-------------------------------
 1 file changed, 53 insertions(+), 99 deletions(-)
  

Comments

Florian Weimer Jan. 4, 2021, 8:35 p.m. UTC | #1
* Adhemerval Zanella via Libc-alpha:

> It removes the alloca usage on the string convertion to wide characters
> before calling the internal function.

We have a downstream-only patch to fall back the single byte handling in
case of multibyte decoding failure.  Basically it's a quick hack to fix
this bug:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=14185>

Is this something we should upstream?  Or rework fnmatch so that * is
matched properly against arbitrary bytes?

Thanks,
Florian
  
Adhemerval Zanella Jan. 5, 2021, 1:07 p.m. UTC | #2
On 04/01/2021 17:35, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> It removes the alloca usage on the string convertion to wide characters
>> before calling the internal function.
> 
> We have a downstream-only patch to fall back the single byte handling in
> case of multibyte decoding failure.  Basically it's a quick hack to fix
> this bug:
> 
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=14185>
> 
> Is this something we should upstream?  Or rework fnmatch so that * is
> matched properly against arbitrary bytes?

It seems that gnulib has added the proposed fix with 
aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
2020 to sync back glibc changes) it has fallback to old semantic to return
-1 on in case of failure.  

I am not sure if gnulib was intentional or an overlook. But I am slight
worried about the issues raised by Rich in comment #4, where fnmatch would
match wrong patterns that happen to have invalid multibyte sequence.  Maybe
gnulib guys can gives us some insight here about the realword and if the
67306f600fe6a3 was intentional or not.

I have started to check the feasibility or making the Rich suggestions at
comment #7, to make fnmatch not go in the temporary buffer and call
mbrtowc.  It should be feasible, however it would require more extensive
change in the algorithm and some work to optimize for MB_CUR_MAX == 1
(where mbrtowc should not be necessary).
  
Paul Eggert Jan. 13, 2021, 7:25 p.m. UTC | #3
On 1/5/21 5:07 AM, Adhemerval Zanella wrote:
> It seems that gnulib has added the proposed fix with
> aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
> with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
> 2020 to sync back glibc changes) it has fallback to old semantic to return
> -1 on in case of failure.
> 
> I am not sure if gnulib was intentional or an overlook.

It was an oversight. I simply missed the issue when I did the merge.

> I have started to check the feasibility or making the Rich suggestions at
> comment #7,

That's the right way to go. I would not spend much time trying to fix 
the bugs in the existing code. We should rip out all the wide-char 
stuff, treat encoding errors as if they were just another "character" 
(that's what Emacs does), and stay in the multibyte world. We could 
steal some ideas from Emacs here.

By the way, how important is it to support awful encodings like 
shift-JIS that contain bytes that look like '\'? If we don't have to 
support these encodings any more, things get a bit easier. (Asking for a 
friend. :-)
  
Florian Weimer Jan. 13, 2021, 7:39 p.m. UTC | #4
* Paul Eggert:

> By the way, how important is it to support awful encodings like
> shift-JIS that contain bytes that look like '\'? If we don't have to 
> support these encodings any more, things get a bit easier. (Asking for
> a friend. :-)

There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
it's also specified by WhatWG/HTML5), so from a glibc point of view, it
would be just an ordinary charset like any other.

But feedback we have received is that the users who want Shift-JIS
really want the original thing.

We do not presently support either variant downstream, but one potential
way forward would be to turn Windows-31J into a fully supported glibc
charset with a corresponding ja_JP locale (which would imply downstream
support as well), and just hope that it displaces the original Shift-JIS
in the future.

Thanks,
Florian
  
Bruno Haible Jan. 13, 2021, 11:36 p.m. UTC | #5
Paul Eggert asked:
> > By the way, how important is it to support awful encodings like
> > shift-JIS that contain bytes that look like '\'? If we don't have to 
> > support these encodings any more, things get a bit easier.

Here we are talking about locale encodings, and Shift_JIS (as well as
SHIFT_JISX0213) are not usable as a locale encoding in glibc. See e.g.
[1], [2].

That's the reason why no Shift_JIS locale is listed in
glibc/localedata/SUPPORTED. [3]

Florian Weimer wrote:
> There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
> it's also specified by WhatWG/HTML5), so from a glibc point of view, it
> would be just an ordinary charset like any other.
> 
> But feedback we have received is that the users who want Shift-JIS
> really want the original thing.
> 
> We do not presently support either variant downstream, but one potential
> way forward would be to turn Windows-31J into a fully supported glibc
> charset with a corresponding ja_JP locale (which would imply downstream
> support as well), and just hope that it displaces the original Shift-JIS
> in the future.

I don't think there's a real need for that. In the years 1995 ... 2005 there
was a lot of resistence against Unicode in Japan, because Unicode maps
several slightly differently looking glyph images to the same glyph/character
(even for Western encodings, for example the Polish accents look a bit different
than the French ones), and - at the time - Unicode did not have means to
disambiguate these, thus people complained about "characters are rendered
incorrectly if you use Unicode". This has been resolved for more than 10 years
already.

Bruno

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=3140
[2] https://sourceware.org/legacy-ml/libc-alpha/2000-10/msg00311.html
[3] https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED
  
Florian Weimer Jan. 14, 2021, 10 a.m. UTC | #6
* Bruno Haible:

> Paul Eggert asked:
>> > By the way, how important is it to support awful encodings like
>> > shift-JIS that contain bytes that look like '\'? If we don't have to 
>> > support these encodings any more, things get a bit easier.
>
> Here we are talking about locale encodings, and Shift_JIS (as well as
> SHIFT_JISX0213) are not usable as a locale encoding in glibc. See e.g.
> [1], [2].
>
> That's the reason why no Shift_JIS locale is listed in
> glibc/localedata/SUPPORTED. [3]

> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=3140
> [2] https://sourceware.org/legacy-ml/libc-alpha/2000-10/msg00311.html
> [3] https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED

We used to have a fully supported product based on the original
Shift-JIS.  It did not require glibc changes (we package both localedef
and the locale sources, so it's easy to build custom locales), but other
GNU components had to be patched.

> Florian Weimer wrote:
>> There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
>> it's also specified by WhatWG/HTML5), so from a glibc point of view, it
>> would be just an ordinary charset like any other.
>> 
>> But feedback we have received is that the users who want Shift-JIS
>> really want the original thing.
>> 
>> We do not presently support either variant downstream, but one potential
>> way forward would be to turn Windows-31J into a fully supported glibc
>> charset with a corresponding ja_JP locale (which would imply downstream
>> support as well), and just hope that it displaces the original Shift-JIS
>> in the future.
>
> I don't think there's a real need for that. In the years 1995 ... 2005
> there was a lot of resistence against Unicode in Japan, because
> Unicode maps several slightly differently looking glyph images to the
> same glyph/character (even for Western encodings, for example the
> Polish accents look a bit different than the French ones), and - at
> the time - Unicode did not have means to disambiguate these, thus
> people complained about "characters are rendered incorrectly if you
> use Unicode". This has been resolved for more than 10 years already.

We saw commercial demand for Shift-JIS much later than that.  I think an
official Windows-31J-based ja_JP would still be welcomed at this point.

A Windows-31J locale could be added to localedata/SUPPORTED.  We have
not done that yet because someone wanted to look into alignment between
Windows, HTML/WhatWG and what we currently have in the source tree, but
that hasn't happened yet, unfortunately.

Thanks,
Florian
  
Adhemerval Zanella Jan. 14, 2021, 11:44 a.m. UTC | #7
On 13/01/2021 16:25, Paul Eggert wrote:
> On 1/5/21 5:07 AM, Adhemerval Zanella wrote:
>> It seems that gnulib has added the proposed fix with
>> aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
>> with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
>> 2020 to sync back glibc changes) it has fallback to old semantic to return
>> -1 on in case of failure.
>>
>> I am not sure if gnulib was intentional or an overlook.
> 
> It was an oversight. I simply missed the issue when I did the merge.
> 
>> I have started to check the feasibility or making the Rich suggestions at
>> comment #7,
> 
> That's the right way to go. I would not spend much time trying to fix the bugs in the existing code. We should rip out all the wide-char stuff, treat encoding errors as if they were just another "character" (that's what Emacs does), and stay in the multibyte world. We could steal some ideas from Emacs here.

I think we still should fix BZ#14185 with the suggested with the strategy of
falling back to non wide mode in case of encoding error.  

The full fix will take some time, since it is pretty much a rewrite of the
fnmatch_loop.c.

> 
> By the way, how important is it to support awful encodings like shift-JIS that contain bytes that look like '\'? If we don't have to support these encodings any more, things get a bit easier. (Asking for a friend. :-)
  
Paul Eggert Jan. 15, 2021, 6:56 a.m. UTC | #8
On 1/14/21 3:44 AM, Adhemerval Zanella wrote:
> I think we still should fix BZ#14185 with the suggested with the strategy of
> falling back to non wide mode in case of encoding error.

For glibc 2.33 that's likely the best we can do.

> The full fix will take some time, since it is pretty much a rewrite of the
> fnmatch_loop.c.

Yes. Should be doable for 2.34 though, I'd hope.
  
Paul Eggert March 6, 2021, 5:18 p.m. UTC | #9
On 1/14/21 2:00 AM, Florian Weimer wrote:
> We saw commercial demand for Shift-JIS much later than that.

Yes, I see Red Hat has a product enhancement for that, for coreutils, 
dated 2019.[1]

Shift-JIS is still very much alive and kicking in Japan in its own 
circles. Even DBCS EBCDIC is still used among old mainframe 
applications.[2] I very much doubt whether we need to worry about DBCS 
EBCDIC in GNU software, though.

Shift-JIS is a closer call as it's used a lot more than DBCS EBCDIC. 
However, my worry is that good support for non-ASCII-safe encodings like 
Shift-JIS is hard to do, and that any such support we'd add to 
Gnulib/coreutils/etc. would not only increase maintenance costs and 
reduce runtime performance, but also would be buggy in many oddball 
corner cases (and this would have security implications).

[1] https://access.redhat.com/errata/RHEA-2019:0008

[2] 
https://www.ibm.com/support/knowledgecenter/en/SS4PJT_6.0.0/cd_zos/cdzos_admin/ZOS_Overview_of_DBCS.html
  
Bruno Haible March 6, 2021, 8:17 p.m. UTC | #10
Paul Eggert wrote:
> However, my worry is that good support for non-ASCII-safe encodings like 
> Shift-JIS is hard to do, and that any such support we'd add to 
> Gnulib/coreutils/etc. would not only increase maintenance costs and 
> reduce runtime performance

Shift_JIS is not the only non-ASCII-safe encoding; GB18030, BIG5, BIG5-HKSCS,
and GBK are as well, and among these GB18030 is used as locale encoding
in China. Therefore it is important for programs to support these locale
encodings.

Gnulib has the support for it:

  - It has replacement functions that operate correctly with these locale
    encodings:
      strstr, c_strstr -> mbsstr
      strchr -> mbschr
      strrchr -> mbsrchr
      strspn -> mbsspn
      strcspn -> mbscspn
      strpbrk -> mbspbrk
      strsep -> mbssep
      strtok_r -> mbstok_r

  - It has warnings (through _GL_WARN_ON_USE) for uses of the functions
    that are not OK for non-ASCII-safe encodings.

  - It has modules mbchar, mbiter, mbfile for iterating through the
    multibyte characters of a string or file, that work for all locale
    encodings.

Yes, it does reduce the performance to use these safer functions.
I have shown in the past, through coreutils patches, how to accommodate
both a "fast path" and a "safe path" in the same binary.

Bruno
  

Patch

diff --git a/posix/fnmatch.c b/posix/fnmatch.c
index 5896812c96..ac254fc9ac 100644
--- a/posix/fnmatch.c
+++ b/posix/fnmatch.c
@@ -75,6 +75,7 @@  extern int fnmatch (const char *pattern, const char *string, int flags);
 
 #include <intprops.h>
 #include <flexmember.h>
+#include <scratch_buffer.h>
 
 #ifdef _LIBC
 typedef ptrdiff_t idx_t;
@@ -231,119 +232,72 @@  is_char_class (const wchar_t *wcs)
 
 #include "fnmatch_loop.c"
 
+static int
+fnmatch_convert_to_wide (const char *str, struct scratch_buffer *buf,
+                         size_t *n)
+{
+  mbstate_t ps;
+  memset (&ps, '\0', sizeof (ps));
+
+  size_t nw = buf->length / sizeof (wchar_t);
+  *n = strnlen (str, nw - 1);
+  if (__glibc_likely (*n < nw))
+    {
+      const char *p = str;
+      *n = mbsrtowcs (buf->data, &p, *n + 1, &ps);
+      if (__glibc_unlikely (*n == (size_t) -1))
+        /* Something wrong.
+           XXX Do we have to set 'errno' to something which mbsrtows hasn't
+           already done?  */
+        return -1;
+      if (p == NULL)
+        return 0;
+      memset (&ps, '\0', sizeof (ps));
+    }
+
+  *n = mbsrtowcs (NULL, &str, 0, &ps);
+  if (__glibc_unlikely (*n == (size_t) -1))
+    return -1;
+  if (!scratch_buffer_set_array_size (buf, *n + 1, sizeof (wchar_t)))
+    {
+      __set_errno (ENOMEM);
+      return -2;
+    }
+  assert (mbsinit (&ps));
+  mbsrtowcs (buf->data, &str, *n + 1, &ps);
+  return 0;
+}
 
 int
 fnmatch (const char *pattern, const char *string, int flags)
 {
   if (__glibc_unlikely (MB_CUR_MAX != 1))
     {
-      mbstate_t ps;
       size_t n;
-      const char *p;
-      wchar_t *wpattern_malloc = NULL;
-      wchar_t *wpattern;
-      wchar_t *wstring_malloc = NULL;
-      wchar_t *wstring;
-      size_t alloca_used = 0;
+      struct scratch_buffer wpattern;
+      scratch_buffer_init (&wpattern);
+      struct scratch_buffer wstring;
+      scratch_buffer_init (&wstring);
+      int r;
 
       /* Convert the strings into wide characters.  */
-      memset (&ps, '\0', sizeof (ps));
-      p = pattern;
-      n = strnlen (pattern, 1024);
-      if (__glibc_likely (n < 1024))
-        {
-          wpattern = (wchar_t *) alloca_account ((n + 1) * sizeof (wchar_t),
-                                                 alloca_used);
-          n = mbsrtowcs (wpattern, &p, n + 1, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            return -1;
-          if (p)
-            {
-              memset (&ps, '\0', sizeof (ps));
-              goto prepare_wpattern;
-            }
-        }
-      else
-        {
-        prepare_wpattern:
-          n = mbsrtowcs (NULL, &pattern, 0, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            return -1;
-          if (__glibc_unlikely (n >= (size_t) -1 / sizeof (wchar_t)))
-            {
-              __set_errno (ENOMEM);
-              return -2;
-            }
-          wpattern_malloc = wpattern
-            = (wchar_t *) malloc ((n + 1) * sizeof (wchar_t));
-          assert (mbsinit (&ps));
-          if (wpattern == NULL)
-            return -2;
-          (void) mbsrtowcs (wpattern, &pattern, n + 1, &ps);
-        }
-
-      assert (mbsinit (&ps));
-      n = strnlen (string, 1024);
-      p = string;
-      if (__glibc_likely (n < 1024))
-        {
-          wstring = (wchar_t *) alloca_account ((n + 1) * sizeof (wchar_t),
-                                                alloca_used);
-          n = mbsrtowcs (wstring, &p, n + 1, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            {
-              /* Something wrong.
-                 XXX Do we have to set 'errno' to something which
-                 mbsrtows hasn't already done?  */
-            free_return:
-              free (wpattern_malloc);
-              return -1;
-            }
-          if (p)
-            {
-              memset (&ps, '\0', sizeof (ps));
-              goto prepare_wstring;
-            }
-        }
-      else
+      r = fnmatch_convert_to_wide (pattern, &wpattern, &n);
+      if (r != 0)
+        return r;
+      r = fnmatch_convert_to_wide (string, &wstring, &n);
+      if (r != 0)
         {
-        prepare_wstring:
-          n = mbsrtowcs (NULL, &string, 0, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            goto free_return;
-          if (__glibc_unlikely (n >= (size_t) -1 / sizeof (wchar_t)))
-            {
-              free (wpattern_malloc);
-              __set_errno (ENOMEM);
-              return -2;
-            }
-
-          wstring_malloc = wstring
-            = (wchar_t *) malloc ((n + 1) * sizeof (wchar_t));
-          if (wstring == NULL)
-            {
-              free (wpattern_malloc);
-              return -2;
-            }
-          assert (mbsinit (&ps));
-          (void) mbsrtowcs (wstring, &string, n + 1, &ps);
+          scratch_buffer_free (&wpattern);
+          return n;
         }
 
-      int res = internal_fnwmatch (wpattern, wstring, wstring + n,
+      int res = internal_fnwmatch (wpattern.data, wstring.data,
+                                   (wchar_t *) wstring.data + n,
                                    flags & FNM_PERIOD, flags, NULL,
-                                   alloca_used);
+                                   false);
 
-      free (wstring_malloc);
-      free (wpattern_malloc);
+      scratch_buffer_free (&wstring);
+      scratch_buffer_free (&wpattern);
 
       return res;
     }