[v3] wcrtomb: Make behavior POSIX compliant

Message ID 20220512131503.764504-1-siddhesh@sourceware.org
State Superseded
Headers
Series [v3] wcrtomb: Make behavior POSIX compliant |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Siddhesh Poyarekar May 12, 2022, 1:15 p.m. UTC
  The GNU implementation of wcrtomb assumes that there are at least
MB_CUR_MAX bytes available in the destination buffer passed to wcrtomb
as the first argument.  This is not compatible with the POSIX
definition, which only requires enough space for the input wide
character.

This does not break much in practice because when users supply buffers
smaller than MB_CUR_MAX (e.g. in ncurses), they compute and dynamically
allocate the buffer, which results in enough spare space (thanks to
usable_size in malloc and padding in alloca) that no actual buffer
overflow occurs.  However when the code is built with _FORTIFY_SOURCE,
it runs into the hard check against MB_CUR_MAX in __wcrtomb_chk and
hence fails.  It wasn't evident until now since dynamic allocations
would result in wcrtomb not being fortified but since _FORTIFY_SOURCE=3,
that limitation is gone, resulting in such code failing.

To fix this problem, introduce an internal buffer that is MB_LEN_MAX
long and use that to perform the conversion and then copy the resultant
bytes into the destination buffer.  Also move the fortification check
into the main implementation, which checks the result after conversion
and aborts if the resultant byte count is greater than the destination
buffer size.

One complication is that applications that assume the MB_CUR_MAX
limitation to be gone may not be able to run safely on older glibcs if
they use static destination buffers smaller than MB_CUR_MAX; dynamic
allocations will always have enough spare space that no actual overruns
will occur.  One alternative to fixing this is to bump symbol version to
prevent them from running on older glibcs but that seems too strict a
constraint.  Instead, since these users will only have made this
decision on reading the manual, I have put a note in the manual warning
them about the pitfalls of having static buffers smaller than
MB_CUR_MAX and running them on older glibc.

Benchmarking:

The wcrtomb microbenchmark shows significant increases in maximum
execution time for all locales, ranging from 10x for ar_SA.UTF-8 to
1.5x-2x for nearly everything else.  The mean execution time however saw
practically no impact, with some results even being quicker, indicating
that cache locality has a much bigger role in the overhead.

Given that the additional copy uses a temporary buffer inside wcrtomb,
it's likely that a hot path will end up putting that buffer (which is
responsible for the additional overhead) in a similar place on stack,
giving the necessary cache locality to negate the overhead.  However in
situations where wcrtomb ends up getting called at wildly different
spots on the call stack (or is on different call stacks, e.g. with
threads or different execution contexts) and is still a hotspot, the
performance lag will be visible.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
---
Changes from v2:
- Updated manual blurb to refer to "this version" so that it is
  backport-clean.

- Changed copy logic to inline copies of up to 4 bytes.  It is 1-7%
  faster that the previous inlining of only 1 and 2 bytes and a good
  10%+ faster than using plain memcpy.  4 bytes is the maximum length
  for byte encodings in charmaps shipped in glibc.

Changes from v1:
- Fixed nits suggested by Paul Eggert.

 debug/tst-fortify.c |  7 ++++++-
 debug/wcrtomb_chk.c |  8 ++-----
 include/wchar.h     |  4 ++++
 manual/charset.texi | 11 +++++-----
 wcsmbs/wcrtomb.c    | 51 ++++++++++++++++++++++++++++++++++++++-------
 5 files changed, 61 insertions(+), 20 deletions(-)
  

Comments

Paul Eggert May 13, 2022, 4:56 a.m. UTC | #1
On 5/12/22 06:15, Siddhesh Poyarekar wrote:
> +	  switch (result)
> +	    {
> +	    case 4:
> +	      s[3] = buf[3];
> +	      /* Fall through.  */
> +	    case 3:
> +	      s[2] = buf[2];
> +	      /* Fall through.  */
> +	    case 2:
> +	      s[1] = buf[1];
> +	      /* Fall through.  */
> +	    case 1:
> +	      s[0] = buf[0];
> +	      break;
> +	    default:
> +	      memcpy (s, buf, result);
> +	    }

For me with GCC 12.1 -O2 on x86-64, the above code generates 2 
comparisons and 3 conditional branches in the common case where RESULT 
is 1. Plus, GCC generates a jmp from the end of case 3 to the start of 
case 2 (which precedes case 3 in the machine code), which is a bit odd.

How about something like the following instead? This generates machine 
code with only one conditional branch - the one that decides whether to 
call memcpy - and this branch is never taken with glibc-supplied 
charmaps. (I'm assuming RESULT is at least 1.)

   s[0] = buf[0];
   if (2 <= result && result <= 4)
     {
       s[1] = buf[1];
       memcpy (&s[result != 2], &buf[result != 2], 2);
       s[result - 1] = buf[result - 1];
     }
   else
     memcpy (s, buf, result);

Hope you don't mind my bikeshedding. (PATCH v3 looks correct as-is, for 
what it's worth.)
  
Paul Eggert May 13, 2022, 5:28 a.m. UTC | #2
On 5/12/22 21:56, Paul Eggert wrote:
> Hope you don't mind my bikeshedding.

Better yet, this:

   s[0] = buf[0];
   if (2 <= result && result <= 4)
     {
       s[1] = buf[1];
       memcpy (&s[result - 2], &buf[result - 2], 2);
     }
   else
     memcpy (s, buf, result);

On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is only 9 
straight-line instructions, counting the compare insn and the 
conditional-branch insn that is never taken.
  
Siddhesh Poyarekar May 13, 2022, 8:18 a.m. UTC | #3
On 13/05/2022 10:26, Paul Eggert wrote:
> On 5/12/22 06:15, Siddhesh Poyarekar wrote:
>> +      switch (result)
>> +        {
>> +        case 4:
>> +          s[3] = buf[3];
>> +          /* Fall through.  */
>> +        case 3:
>> +          s[2] = buf[2];
>> +          /* Fall through.  */
>> +        case 2:
>> +          s[1] = buf[1];
>> +          /* Fall through.  */
>> +        case 1:
>> +          s[0] = buf[0];
>> +          break;
>> +        default:
>> +          memcpy (s, buf, result);
>> +        }
> 
> For me with GCC 12.1 -O2 on x86-64, the above code generates 2 
> comparisons and 3 conditional branches in the common case where RESULT 
> is 1. Plus, GCC generates a jmp from the end of case 3 to the start of 
> case 2 (which precedes case 3 in the machine code), which is a bit odd.

It seems to hit actual performance less, presumably because those 
branches get predicted correctly most times.  However we could simply 
elide the check for the first byte like you've done.

> How about something like the following instead? This generates machine 
> code with only one conditional branch - the one that decides whether to 
> call memcpy - and this branch is never taken with glibc-supplied 
> charmaps. (I'm assuming RESULT is at least 1.)
> 
>    s[0] = buf[0];
>    if (2 <= result && result <= 4)
>      {
>        s[1] = buf[1];
>        memcpy (&s[result != 2], &buf[result != 2], 2);
>        s[result - 1] = buf[result - 1];
>      }
>    else
>      memcpy (s, buf, result);

This produces redundant loads and stores, which hits 2, 3, 4 byte copies 
much worse; I reckon the condition in the subscript ends up preventing 
the compiler from eliminating them.

If result == 1 is a safe assumption (as it should be I think) then we 
could hoist it out of the switch like so:

           s[0] = buf[0];
           switch (result)
             {
             case 4:
               s[3] = buf[3];
               /* Fall through.  */
             case 3:
               s[2] = buf[2];
               /* Fall through.  */
             case 2:
               s[1] = buf[1];
               /* Fall through.  */
             case 1:
               break;
             default:
               memcpy (s, buf, result);
             }

This improves things ever so slightly for the 1 and 2 byte locale tests 
in the microbenchmark, while keeping things the same for the rest. 
There's still a redundant byte copy in the > 4 bytes case, but that 
should be nothing compared to the PLT indirection.

What do you think?

Thanks,
Siddhesh
  
Siddhesh Poyarekar May 13, 2022, 11:31 a.m. UTC | #4
On 13/05/2022 10:58, Paul Eggert wrote:
> On 5/12/22 21:56, Paul Eggert wrote:
>> Hope you don't mind my bikeshedding.
> 
> Better yet, this:
> 
>    s[0] = buf[0];
>    if (2 <= result && result <= 4)
>      {
>        s[1] = buf[1];
>        memcpy (&s[result - 2], &buf[result - 2], 2);
>      }
>    else
>      memcpy (s, buf, result);
> 
> On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is only 9 
> straight-line instructions, counting the compare insn and the 
> conditional-branch insn that is never taken.
> 

Sorry I missed this one.  I tried it and with gcc 11 it seems to produce 
worse code, merging the two memcpys instead of inlining the first one.

Siddhesh
  
Florian Weimer May 13, 2022, 11:38 a.m. UTC | #5
* Siddhesh Poyarekar:

> On 13/05/2022 10:58, Paul Eggert wrote:
>> On 5/12/22 21:56, Paul Eggert wrote:
>>> Hope you don't mind my bikeshedding.
>> Better yet, this:
>>    s[0] = buf[0];
>>    if (2 <= result && result <= 4)
>>      {
>>        s[1] = buf[1];
>>        memcpy (&s[result - 2], &buf[result - 2], 2);
>>      }
>>    else
>>      memcpy (s, buf, result);
>> On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is
>> only 9 straight-line instructions, counting the compare insn and the 
>> conditional-branch insn that is never taken.
>> 
>
> Sorry I missed this one.  I tried it and with gcc 11 it seems to
> produce worse code, merging the two memcpys instead of inlining the
> first one.

Can we please use the simplified code?

“char buf[MB_LEN_MAX];” already gives GCC a strong hint this is a short
memcpy.  If it is beneficial on some architectures to inline such short,
but still variable memcpy calls, we should teach the compiler how to do
it, and not do it manually.

And wcrtomb would for sure benefit more from a fast path for ASCII and
UTF-8. 8-p

(Not that I think there are many users out there, given glibc's
historically poor performance in the multibyte/wide string conversion
space.)

Thanks,
Florian
  
Siddhesh Poyarekar May 13, 2022, 11:51 a.m. UTC | #6
On 13/05/2022 17:08, Florian Weimer via Libc-alpha wrote:
> * Siddhesh Poyarekar:
> 
>> On 13/05/2022 10:58, Paul Eggert wrote:
>>> On 5/12/22 21:56, Paul Eggert wrote:
>>>> Hope you don't mind my bikeshedding.
>>> Better yet, this:
>>>     s[0] = buf[0];
>>>     if (2 <= result && result <= 4)
>>>       {
>>>         s[1] = buf[1];
>>>         memcpy (&s[result - 2], &buf[result - 2], 2);
>>>       }
>>>     else
>>>       memcpy (s, buf, result);
>>> On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is
>>> only 9 straight-line instructions, counting the compare insn and the
>>> conditional-branch insn that is never taken.
>>>
>>
>> Sorry I missed this one.  I tried it and with gcc 11 it seems to
>> produce worse code, merging the two memcpys instead of inlining the
>> first one.
> 
> Can we please use the simplified code?

So just a memcpy call?  I could push that and attempt to micro-optimize 
if there are reports of slowdown due to this.

Siddhesh
  
Adhemerval Zanella May 13, 2022, 12:30 p.m. UTC | #7
On 13/05/2022 08:38, Florian Weimer via Libc-alpha wrote:
> * Siddhesh Poyarekar:
> 
>> On 13/05/2022 10:58, Paul Eggert wrote:
>>> On 5/12/22 21:56, Paul Eggert wrote:
>>>> Hope you don't mind my bikeshedding.
>>> Better yet, this:
>>>    s[0] = buf[0];
>>>    if (2 <= result && result <= 4)
>>>      {
>>>        s[1] = buf[1];
>>>        memcpy (&s[result - 2], &buf[result - 2], 2);
>>>      }
>>>    else
>>>      memcpy (s, buf, result);
>>> On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is
>>> only 9 straight-line instructions, counting the compare insn and the 
>>> conditional-branch insn that is never taken.
>>>
>>
>> Sorry I missed this one.  I tried it and with gcc 11 it seems to
>> produce worse code, merging the two memcpys instead of inlining the
>> first one.
> 
> Can we please use the simplified code?
> 
> “char buf[MB_LEN_MAX];” already gives GCC a strong hint this is a short
> memcpy.  If it is beneficial on some architectures to inline such short,
> but still variable memcpy calls, we should teach the compiler how to do
> it, and not do it manually.

Totally agreed.

> 
> And wcrtomb would for sure benefit more from a fast path for ASCII and
> UTF-8. 8-p

And it does make sene.

> 
> (Not that I think there are many users out there, given glibc's
> historically poor performance in the multibyte/wide string conversion
> space.)
  
Florian Weimer May 13, 2022, 12:55 p.m. UTC | #8
* Siddhesh Poyarekar:

> On 13/05/2022 17:08, Florian Weimer via Libc-alpha wrote:
>> * Siddhesh Poyarekar:
>> 
>>> On 13/05/2022 10:58, Paul Eggert wrote:
>>>> On 5/12/22 21:56, Paul Eggert wrote:
>>>>> Hope you don't mind my bikeshedding.
>>>> Better yet, this:
>>>>     s[0] = buf[0];
>>>>     if (2 <= result && result <= 4)
>>>>       {
>>>>         s[1] = buf[1];
>>>>         memcpy (&s[result - 2], &buf[result - 2], 2);
>>>>       }
>>>>     else
>>>>       memcpy (s, buf, result);
>>>> On x86-64 with GCC 12.1 -O2 and a glibc-supplied charmap, this is
>>>> only 9 straight-line instructions, counting the compare insn and the
>>>> conditional-branch insn that is never taken.
>>>>
>>>
>>> Sorry I missed this one.  I tried it and with gcc 11 it seems to
>>> produce worse code, merging the two memcpys instead of inlining the
>>> first one.
>> Can we please use the simplified code?
>
> So just a memcpy call?  I could push that and attempt to
> micro-optimize if there are reports of slowdown due to this.

Yes, I would prefer that.

Thanks,
Florian
  
Siddhesh Poyarekar May 13, 2022, 1:42 p.m. UTC | #9
On 13/05/2022 18:00, Adhemerval Zanella via Libc-alpha wrote:
>> Can we please use the simplified code?
>>
>> “char buf[MB_LEN_MAX];” already gives GCC a strong hint this is a short
>> memcpy.  If it is beneficial on some architectures to inline such short,
>> but still variable memcpy calls, we should teach the compiler how to do
>> it, and not do it manually.
> 
> Totally agreed.
> 

Ugh, both of you just kill all the fun.  I'll push a fix with just the 
memcpy ;)

Thanks,
Siddhesh
  
Paul Eggert May 13, 2022, 5:58 p.m. UTC | #10
On 5/13/22 06:42, Siddhesh Poyarekar wrote:
> Ugh, both of you just kill all the fun.  I'll push a fix with just the 
> memcpy ;)

Just as well, as the code I sent before is slow for ASCII anyway. I 
shouldn't let that stand as my suggestion, even if just for fun.


> with gcc 11 it seems to produce worse code, merging the two memcpys instead of inlining the first one.

Here's a better version if someone ever wants to tune this stuff. In 
this version there's only one memcpy so GCC won't merge it. On x86-64 
with GCC 12.1 -O2, if RESULT<=2 this executes 6 instructions (including 
1 comparison and 1 conditional branch); if 3<=RESULT<=4 this executes 8 
instructions total (including 2 comparisons and 2 conditional branches).

   if (result <= 2)
     {
       s[0] = buf[0];
       s[result - 1] = buf[result - 1];
     }
   else if (result <= 4)
     {
       s[0] = buf[0];
       s[1] = buf[1];
       s[result - 2] = buf[result - 2];
       s[result - 1] = buf[result - 1];
     }
   else
     memcpy (s, buf, result);

There are of course other variants....
  

Patch

diff --git a/debug/tst-fortify.c b/debug/tst-fortify.c
index 03c9867714..8e94643bf2 100644
--- a/debug/tst-fortify.c
+++ b/debug/tst-fortify.c
@@ -1478,10 +1478,15 @@  do_test (void)
 	 character which has a multibyte representation which does not
 	 fit.  */
       CHK_FAIL_START
-      char smallbuf[2];
+      char smallbuf[1];
       if (wcrtomb (smallbuf, L'\x100', &s) != 2)
 	FAIL ();
       CHK_FAIL_END
+
+      /* Same input with a large enough buffer and we're good.  */
+      char bigenoughbuf[2];
+      if (wcrtomb (bigenoughbuf, L'\x100', &s) != 2)
+	FAIL ();
 #endif
 
       wchar_t wenough[10];
diff --git a/debug/wcrtomb_chk.c b/debug/wcrtomb_chk.c
index 8b6d026560..28c3ea0d2d 100644
--- a/debug/wcrtomb_chk.c
+++ b/debug/wcrtomb_chk.c
@@ -1,4 +1,5 @@ 
 /* Copyright (C) 2005-2022 Free Software Foundation, Inc.
+   Copyright The GNU Toolchain Authors.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -25,10 +26,5 @@ 
 size_t
 __wcrtomb_chk (char *s, wchar_t wchar, mbstate_t *ps, size_t buflen)
 {
-  /* We do not have to implement the full wctomb semantics since we
-     know that S cannot be NULL when we come here.  */
-  if (buflen < MB_CUR_MAX)
-    __chk_fail ();
-
-  return __wcrtomb (s, wchar, ps);
+  return __wcrtomb_internal (s, wchar, ps, buflen);
 }
diff --git a/include/wchar.h b/include/wchar.h
index 4267985625..db83297bca 100644
--- a/include/wchar.h
+++ b/include/wchar.h
@@ -172,6 +172,10 @@  libc_hidden_proto (__mbrtowc)
 libc_hidden_proto (__mbrlen)
 extern size_t __wcrtomb (char *__restrict __s, wchar_t __wc,
 			 __mbstate_t *__restrict __ps) attribute_hidden;
+extern size_t __wcrtomb_internal (char *__restrict __s, wchar_t __wc,
+				  __mbstate_t *__restrict __ps,
+				  size_t __s_size)
+     attribute_hidden;
 extern size_t __mbsrtowcs (wchar_t *__restrict __dst,
 			   const char **__restrict __src,
 			   size_t __len, __mbstate_t *__restrict __ps)
diff --git a/manual/charset.texi b/manual/charset.texi
index a9b5cb4a37..427db3bc80 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -883,11 +883,12 @@  the string @var{s}.  This includes all bytes representing shift
 sequences.
 
 One word about the interface of the function: there is no parameter
-specifying the length of the array @var{s}.  Instead the function
-assumes that there are at least @code{MB_CUR_MAX} bytes available since
-this is the maximum length of any byte sequence representing a single
-character.  So the caller has to make sure that there is enough space
-available, otherwise buffer overruns can occur.
+specifying the length of the array @var{s}, so the caller has to make sure
+that there is enough space available, otherwise buffer overruns can occur.
+This version of @theglibc{} does not assume that @var{s} is at least
+@var{MB_CUR_MAX} bytes long, but programs that need to run on @glibcadj{}
+versions that have this assumption documented in the manual must comply
+with this limit.
 
 @pindex wchar.h
 @code{wcrtomb} was introduced in @w{Amendment 1} to @w{ISO C90} and is
diff --git a/wcsmbs/wcrtomb.c b/wcsmbs/wcrtomb.c
index e17438989f..d68dcdac82 100644
--- a/wcsmbs/wcrtomb.c
+++ b/wcsmbs/wcrtomb.c
@@ -1,4 +1,5 @@ 
 /* Copyright (C) 1996-2022 Free Software Foundation, Inc.
+   Copyright The GNU Toolchain Authors.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -20,6 +21,7 @@ 
 #include <errno.h>
 #include <gconv.h>
 #include <stdlib.h>
+#include <string.h>
 #include <wchar.h>
 #include <wcsmbsload.h>
 
@@ -34,7 +36,7 @@ 
 static mbstate_t state;
 
 size_t
-__wcrtomb (char *s, wchar_t wc, mbstate_t *ps)
+__wcrtomb_internal (char *s, wchar_t wc, mbstate_t *ps, size_t s_size)
 {
   char buf[MB_LEN_MAX];
   struct __gconv_step_data data;
@@ -52,14 +54,11 @@  __wcrtomb (char *s, wchar_t wc, mbstate_t *ps)
   /* A first special case is if S is NULL.  This means put PS in the
      initial state.  */
   if (s == NULL)
-    {
-      s = buf;
-      wc = L'\0';
-    }
+    wc = L'\0';
 
   /* Tell where we want to have the result.  */
-  data.__outbuf = (unsigned char *) s;
-  data.__outbufend = (unsigned char *) s + MB_CUR_MAX;
+  data.__outbuf = (unsigned char *) buf;
+  data.__outbufend = (unsigned char *) buf + sizeof buf;
 
   /* Get the conversion functions.  */
   fcts = get_gconv_fcts (_NL_CURRENT_DATA (LC_CTYPE));
@@ -101,7 +100,37 @@  __wcrtomb (char *s, wchar_t wc, mbstate_t *ps)
 
   if (status == __GCONV_OK || status == __GCONV_EMPTY_INPUT
       || status == __GCONV_FULL_OUTPUT)
-    result = data.__outbuf - (unsigned char *) s;
+    {
+      result = data.__outbuf - (unsigned char *) buf;
+
+      if (s != NULL)
+	{
+	  if (result > s_size)
+	    __chk_fail ();
+
+	  /* The maximum byte size in all charmaps is 4, so inline copies up to
+	     4 bytes.  One must benchmark before bumping this in future because
+	     the returns for inlining will diminish as the number of branches
+	     increases.  */
+	  switch (result)
+	    {
+	    case 4:
+	      s[3] = buf[3];
+	      /* Fall through.  */
+	    case 3:
+	      s[2] = buf[2];
+	      /* Fall through.  */
+	    case 2:
+	      s[1] = buf[1];
+	      /* Fall through.  */
+	    case 1:
+	      s[0] = buf[0];
+	      break;
+	    default:
+	      memcpy (s, buf, result);
+	    }
+	}
+    }
   else
     {
       result = (size_t) -1;
@@ -110,5 +139,11 @@  __wcrtomb (char *s, wchar_t wc, mbstate_t *ps)
 
   return result;
 }
+
+size_t
+__wcrtomb (char *s, wchar_t wc, mbstate_t *ps)
+{
+  return __wcrtomb_internal (s, wc, ps, (size_t) -1);
+}
 weak_alias (__wcrtomb, wcrtomb)
 libc_hidden_weak (wcrtomb)