diff mbox series

[1/2] posix: User scratch_buffer on fnmatch

Message ID 20210104202528.1228255-1-adhemerval.zanella@linaro.org
State New
Headers show
Series [1/2] posix: User scratch_buffer on fnmatch | expand

Commit Message

Adhemerval Zanella Jan. 4, 2021, 8:25 p.m. UTC
It removes the alloca usage on the string convertion to wide characters
before calling the internal function.

Checked on x86_64-linux-gnu.
---
 posix/fnmatch.c | 152 +++++++++++++++++-------------------------------
 1 file changed, 53 insertions(+), 99 deletions(-)

Comments

Florian Weimer Jan. 4, 2021, 8:35 p.m. UTC | #1
* Adhemerval Zanella via Libc-alpha:

> It removes the alloca usage on the string convertion to wide characters
> before calling the internal function.

We have a downstream-only patch to fall back the single byte handling in
case of multibyte decoding failure.  Basically it's a quick hack to fix
this bug:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=14185>

Is this something we should upstream?  Or rework fnmatch so that * is
matched properly against arbitrary bytes?

Thanks,
Florian
Adhemerval Zanella Jan. 5, 2021, 1:07 p.m. UTC | #2
On 04/01/2021 17:35, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> It removes the alloca usage on the string convertion to wide characters
>> before calling the internal function.
> 
> We have a downstream-only patch to fall back the single byte handling in
> case of multibyte decoding failure.  Basically it's a quick hack to fix
> this bug:
> 
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=14185>
> 
> Is this something we should upstream?  Or rework fnmatch so that * is
> matched properly against arbitrary bytes?

It seems that gnulib has added the proposed fix with 
aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
2020 to sync back glibc changes) it has fallback to old semantic to return
-1 on in case of failure.  

I am not sure if gnulib was intentional or an overlook. But I am slight
worried about the issues raised by Rich in comment #4, where fnmatch would
match wrong patterns that happen to have invalid multibyte sequence.  Maybe
gnulib guys can gives us some insight here about the realword and if the
67306f600fe6a3 was intentional or not.

I have started to check the feasibility or making the Rich suggestions at
comment #7, to make fnmatch not go in the temporary buffer and call
mbrtowc.  It should be feasible, however it would require more extensive
change in the algorithm and some work to optimize for MB_CUR_MAX == 1
(where mbrtowc should not be necessary).
Paul Eggert Jan. 13, 2021, 7:25 p.m. UTC | #3
On 1/5/21 5:07 AM, Adhemerval Zanella wrote:
> It seems that gnulib has added the proposed fix with
> aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
> with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
> 2020 to sync back glibc changes) it has fallback to old semantic to return
> -1 on in case of failure.
> 
> I am not sure if gnulib was intentional or an overlook.

It was an oversight. I simply missed the issue when I did the merge.

> I have started to check the feasibility or making the Rich suggestions at
> comment #7,

That's the right way to go. I would not spend much time trying to fix 
the bugs in the existing code. We should rip out all the wide-char 
stuff, treat encoding errors as if they were just another "character" 
(that's what Emacs does), and stay in the multibyte world. We could 
steal some ideas from Emacs here.

By the way, how important is it to support awful encodings like 
shift-JIS that contain bytes that look like '\'? If we don't have to 
support these encodings any more, things get a bit easier. (Asking for a 
friend. :-)
Florian Weimer Jan. 13, 2021, 7:39 p.m. UTC | #4
* Paul Eggert:

> By the way, how important is it to support awful encodings like
> shift-JIS that contain bytes that look like '\'? If we don't have to 
> support these encodings any more, things get a bit easier. (Asking for
> a friend. :-)

There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
it's also specified by WhatWG/HTML5), so from a glibc point of view, it
would be just an ordinary charset like any other.

But feedback we have received is that the users who want Shift-JIS
really want the original thing.

We do not presently support either variant downstream, but one potential
way forward would be to turn Windows-31J into a fully supported glibc
charset with a corresponding ja_JP locale (which would imply downstream
support as well), and just hope that it displaces the original Shift-JIS
in the future.

Thanks,
Florian
Bruno Haible Jan. 13, 2021, 11:36 p.m. UTC | #5
Paul Eggert asked:
> > By the way, how important is it to support awful encodings like
> > shift-JIS that contain bytes that look like '\'? If we don't have to 
> > support these encodings any more, things get a bit easier.

Here we are talking about locale encodings, and Shift_JIS (as well as
SHIFT_JISX0213) are not usable as a locale encoding in glibc. See e.g.
[1], [2].

That's the reason why no Shift_JIS locale is listed in
glibc/localedata/SUPPORTED. [3]

Florian Weimer wrote:
> There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
> it's also specified by WhatWG/HTML5), so from a glibc point of view, it
> would be just an ordinary charset like any other.
> 
> But feedback we have received is that the users who want Shift-JIS
> really want the original thing.
> 
> We do not presently support either variant downstream, but one potential
> way forward would be to turn Windows-31J into a fully supported glibc
> charset with a corresponding ja_JP locale (which would imply downstream
> support as well), and just hope that it displaces the original Shift-JIS
> in the future.

I don't think there's a real need for that. In the years 1995 ... 2005 there
was a lot of resistence against Unicode in Japan, because Unicode maps
several slightly differently looking glyph images to the same glyph/character
(even for Western encodings, for example the Polish accents look a bit different
than the French ones), and - at the time - Unicode did not have means to
disambiguate these, thus people complained about "characters are rendered
incorrectly if you use Unicode". This has been resolved for more than 10 years
already.

Bruno

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=3140
[2] https://sourceware.org/legacy-ml/libc-alpha/2000-10/msg00311.html
[3] https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED
Florian Weimer Jan. 14, 2021, 10 a.m. UTC | #6
* Bruno Haible:

> Paul Eggert asked:
>> > By the way, how important is it to support awful encodings like
>> > shift-JIS that contain bytes that look like '\'? If we don't have to 
>> > support these encodings any more, things get a bit easier.
>
> Here we are talking about locale encodings, and Shift_JIS (as well as
> SHIFT_JISX0213) are not usable as a locale encoding in glibc. See e.g.
> [1], [2].
>
> That's the reason why no Shift_JIS locale is listed in
> glibc/localedata/SUPPORTED. [3]

> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=3140
> [2] https://sourceware.org/legacy-ml/libc-alpha/2000-10/msg00311.html
> [3] https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED

We used to have a fully supported product based on the original
Shift-JIS.  It did not require glibc changes (we package both localedef
and the locale sources, so it's easy to build custom locales), but other
GNU components had to be patched.

> Florian Weimer wrote:
>> There is a Shift-JIS variant which is ASCII-transparent (Windows-31J,
>> it's also specified by WhatWG/HTML5), so from a glibc point of view, it
>> would be just an ordinary charset like any other.
>> 
>> But feedback we have received is that the users who want Shift-JIS
>> really want the original thing.
>> 
>> We do not presently support either variant downstream, but one potential
>> way forward would be to turn Windows-31J into a fully supported glibc
>> charset with a corresponding ja_JP locale (which would imply downstream
>> support as well), and just hope that it displaces the original Shift-JIS
>> in the future.
>
> I don't think there's a real need for that. In the years 1995 ... 2005
> there was a lot of resistence against Unicode in Japan, because
> Unicode maps several slightly differently looking glyph images to the
> same glyph/character (even for Western encodings, for example the
> Polish accents look a bit different than the French ones), and - at
> the time - Unicode did not have means to disambiguate these, thus
> people complained about "characters are rendered incorrectly if you
> use Unicode". This has been resolved for more than 10 years already.

We saw commercial demand for Shift-JIS much later than that.  I think an
official Windows-31J-based ja_JP would still be welcomed at this point.

A Windows-31J locale could be added to localedata/SUPPORTED.  We have
not done that yet because someone wanted to look into alignment between
Windows, HTML/WhatWG and what we currently have in the source tree, but
that hasn't happened yet, unfortunately.

Thanks,
Florian
Adhemerval Zanella Jan. 14, 2021, 11:44 a.m. UTC | #7
On 13/01/2021 16:25, Paul Eggert wrote:
> On 1/5/21 5:07 AM, Adhemerval Zanella wrote:
>> It seems that gnulib has added the proposed fix with
>> aed23714d60d91b2abc74be33635c32ddc5132b5 (done in 2005) and just recently
>> with a glibc merge with 67306f600fe6a3bcf3fbb6d8bf4b8953b74a8fb7 (done in
>> 2020 to sync back glibc changes) it has fallback to old semantic to return
>> -1 on in case of failure.
>>
>> I am not sure if gnulib was intentional or an overlook.
> 
> It was an oversight. I simply missed the issue when I did the merge.
> 
>> I have started to check the feasibility or making the Rich suggestions at
>> comment #7,
> 
> That's the right way to go. I would not spend much time trying to fix the bugs in the existing code. We should rip out all the wide-char stuff, treat encoding errors as if they were just another "character" (that's what Emacs does), and stay in the multibyte world. We could steal some ideas from Emacs here.

I think we still should fix BZ#14185 with the suggested with the strategy of
falling back to non wide mode in case of encoding error.  

The full fix will take some time, since it is pretty much a rewrite of the
fnmatch_loop.c.

> 
> By the way, how important is it to support awful encodings like shift-JIS that contain bytes that look like '\'? If we don't have to support these encodings any more, things get a bit easier. (Asking for a friend. :-)
Paul Eggert Jan. 15, 2021, 6:56 a.m. UTC | #8
On 1/14/21 3:44 AM, Adhemerval Zanella wrote:
> I think we still should fix BZ#14185 with the suggested with the strategy of
> falling back to non wide mode in case of encoding error.

For glibc 2.33 that's likely the best we can do.

> The full fix will take some time, since it is pretty much a rewrite of the
> fnmatch_loop.c.

Yes. Should be doable for 2.34 though, I'd hope.
diff mbox series

Patch

diff --git a/posix/fnmatch.c b/posix/fnmatch.c
index 5896812c96..ac254fc9ac 100644
--- a/posix/fnmatch.c
+++ b/posix/fnmatch.c
@@ -75,6 +75,7 @@  extern int fnmatch (const char *pattern, const char *string, int flags);
 
 #include <intprops.h>
 #include <flexmember.h>
+#include <scratch_buffer.h>
 
 #ifdef _LIBC
 typedef ptrdiff_t idx_t;
@@ -231,119 +232,72 @@  is_char_class (const wchar_t *wcs)
 
 #include "fnmatch_loop.c"
 
+static int
+fnmatch_convert_to_wide (const char *str, struct scratch_buffer *buf,
+                         size_t *n)
+{
+  mbstate_t ps;
+  memset (&ps, '\0', sizeof (ps));
+
+  size_t nw = buf->length / sizeof (wchar_t);
+  *n = strnlen (str, nw - 1);
+  if (__glibc_likely (*n < nw))
+    {
+      const char *p = str;
+      *n = mbsrtowcs (buf->data, &p, *n + 1, &ps);
+      if (__glibc_unlikely (*n == (size_t) -1))
+        /* Something wrong.
+           XXX Do we have to set 'errno' to something which mbsrtows hasn't
+           already done?  */
+        return -1;
+      if (p == NULL)
+        return 0;
+      memset (&ps, '\0', sizeof (ps));
+    }
+
+  *n = mbsrtowcs (NULL, &str, 0, &ps);
+  if (__glibc_unlikely (*n == (size_t) -1))
+    return -1;
+  if (!scratch_buffer_set_array_size (buf, *n + 1, sizeof (wchar_t)))
+    {
+      __set_errno (ENOMEM);
+      return -2;
+    }
+  assert (mbsinit (&ps));
+  mbsrtowcs (buf->data, &str, *n + 1, &ps);
+  return 0;
+}
 
 int
 fnmatch (const char *pattern, const char *string, int flags)
 {
   if (__glibc_unlikely (MB_CUR_MAX != 1))
     {
-      mbstate_t ps;
       size_t n;
-      const char *p;
-      wchar_t *wpattern_malloc = NULL;
-      wchar_t *wpattern;
-      wchar_t *wstring_malloc = NULL;
-      wchar_t *wstring;
-      size_t alloca_used = 0;
+      struct scratch_buffer wpattern;
+      scratch_buffer_init (&wpattern);
+      struct scratch_buffer wstring;
+      scratch_buffer_init (&wstring);
+      int r;
 
       /* Convert the strings into wide characters.  */
-      memset (&ps, '\0', sizeof (ps));
-      p = pattern;
-      n = strnlen (pattern, 1024);
-      if (__glibc_likely (n < 1024))
-        {
-          wpattern = (wchar_t *) alloca_account ((n + 1) * sizeof (wchar_t),
-                                                 alloca_used);
-          n = mbsrtowcs (wpattern, &p, n + 1, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            return -1;
-          if (p)
-            {
-              memset (&ps, '\0', sizeof (ps));
-              goto prepare_wpattern;
-            }
-        }
-      else
-        {
-        prepare_wpattern:
-          n = mbsrtowcs (NULL, &pattern, 0, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            return -1;
-          if (__glibc_unlikely (n >= (size_t) -1 / sizeof (wchar_t)))
-            {
-              __set_errno (ENOMEM);
-              return -2;
-            }
-          wpattern_malloc = wpattern
-            = (wchar_t *) malloc ((n + 1) * sizeof (wchar_t));
-          assert (mbsinit (&ps));
-          if (wpattern == NULL)
-            return -2;
-          (void) mbsrtowcs (wpattern, &pattern, n + 1, &ps);
-        }
-
-      assert (mbsinit (&ps));
-      n = strnlen (string, 1024);
-      p = string;
-      if (__glibc_likely (n < 1024))
-        {
-          wstring = (wchar_t *) alloca_account ((n + 1) * sizeof (wchar_t),
-                                                alloca_used);
-          n = mbsrtowcs (wstring, &p, n + 1, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            {
-              /* Something wrong.
-                 XXX Do we have to set 'errno' to something which
-                 mbsrtows hasn't already done?  */
-            free_return:
-              free (wpattern_malloc);
-              return -1;
-            }
-          if (p)
-            {
-              memset (&ps, '\0', sizeof (ps));
-              goto prepare_wstring;
-            }
-        }
-      else
+      r = fnmatch_convert_to_wide (pattern, &wpattern, &n);
+      if (r != 0)
+        return r;
+      r = fnmatch_convert_to_wide (string, &wstring, &n);
+      if (r != 0)
         {
-        prepare_wstring:
-          n = mbsrtowcs (NULL, &string, 0, &ps);
-          if (__glibc_unlikely (n == (size_t) -1))
-            /* Something wrong.
-               XXX Do we have to set 'errno' to something which mbsrtows hasn't
-               already done?  */
-            goto free_return;
-          if (__glibc_unlikely (n >= (size_t) -1 / sizeof (wchar_t)))
-            {
-              free (wpattern_malloc);
-              __set_errno (ENOMEM);
-              return -2;
-            }
-
-          wstring_malloc = wstring
-            = (wchar_t *) malloc ((n + 1) * sizeof (wchar_t));
-          if (wstring == NULL)
-            {
-              free (wpattern_malloc);
-              return -2;
-            }
-          assert (mbsinit (&ps));
-          (void) mbsrtowcs (wstring, &string, n + 1, &ps);
+          scratch_buffer_free (&wpattern);
+          return n;
         }
 
-      int res = internal_fnwmatch (wpattern, wstring, wstring + n,
+      int res = internal_fnwmatch (wpattern.data, wstring.data,
+                                   (wchar_t *) wstring.data + n,
                                    flags & FNM_PERIOD, flags, NULL,
-                                   alloca_used);
+                                   false);
 
-      free (wstring_malloc);
-      free (wpattern_malloc);
+      scratch_buffer_free (&wstring);
+      scratch_buffer_free (&wpattern);
 
       return res;
     }