mbstowcs: Document, test, and fix null pointer dst semantics.

Message ID 63ea09de-e943-c26e-c821-093ba303d76b@redhat.com
State Superseded
Headers
Series mbstowcs: Document, test, and fix null pointer dst semantics. |

Commit Message

Carlos O'Donell May 22, 2020, 2:15 a.m. UTC
  Martin,

The following patch does 3 things:

* Documents the mbstowcs behaviour of accepting a first argument as NULL.
* Relaxes the stdlib.h markup to allow this documented behaviour.
* Adds a regression test to the testsuite for mbstowcs specifically to
  cover the first argument NULL test case.

The regression test fails without the stdlib.h fix, and passes after.

No other regressions on x86_64.

I'm looking for a "Reviewed-by: Martin Sebor <msebor@redhat.com>" from you 
to indicate that what I've done makes sense and that you've reviewed
this patch and that I've answered all of your technical questions.

Thanks for the engaging discussion on documentation.

8< --- 8< --- 8<
The function mbstowcs, by an XSI extension to POSIX, accepts a null
pointer for the destination wchar_t array.  This API behaviour allows
you to use the function to compute the length of the required wchar_t
array i.e. does the conversion without storing it and returns the
number of wide characters required.

We remove the __write_only__ markup for the first argument because it
is not true since the destination may be a null pointer, and so the
length argument may not apply.  We remove the markup otherwise the new
test case cannot be compiled with -Werror=nonnull.

We add a new test case for mbstowcs which exercises the destination is
a null pointer behaviour which we have now explicitly documented.

The mbsrtowcs and mbsnrtowcs behave similarly, and mbsrtowcs is
documented as doing this in C11, even if the standard doesn't come out
and call out this specific use case.  We add one note to each of
mbsrtowcs and mbsnrtowcs to call out that they support a null pointer
for the destination.

The wcsrtombs function behaves similarly but in the other way around
and allows you to use a null destination pointer to compute how many
bytes you would need to convert the wide character input.  We document
this particular case also, but leave wcsnrtombs as a references to
wcsrtombs, so the reader must still read the details of the semantics
for wcsrtombs.
---
 manual/charset.texi   | 24 +++++++++++++++++----
 stdlib/stdlib.h       |  2 +-
 wcsmbs/Makefile       |  2 +-
 wcsmbs/tst-mbstowcs.c | 50 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 72 insertions(+), 6 deletions(-)
 create mode 100644 wcsmbs/tst-mbstowcs.c
  

Comments

Florian Weimer May 22, 2020, 11:25 a.m. UTC | #1
* Carlos O'Donell via Libc-alpha:

> +If @var{wstring} is a null pointer then no output is written and the
> +conversion proceeds as above, and the result is returned. In practice
> +such behaviour is useful for calculating the exact number of wide
> +characters required to convert @var{string}.  This behaviour of accepting
> +a null pointer for @var{wstring} is an @w{XSI} extension to the requirements
> +in @w{ISO C} and @w{POSIX}.

If you refer standards here, these references should be framed in terms
of XPG etc., as in other places of the manual.

An easy way out (that doesn't need researching history here) would be to
declare this as a GNU extension.

Thanks,
Florian
  
Martin Sebor May 22, 2020, 3:56 p.m. UTC | #2
On 5/21/20 8:15 PM, Carlos O'Donell via Libc-alpha wrote:
> Martin,
> 
> The following patch does 3 things:
> 
> * Documents the mbstowcs behaviour of accepting a first argument as NULL.
> * Relaxes the stdlib.h markup to allow this documented behaviour.
> * Adds a regression test to the testsuite for mbstowcs specifically to
>    cover the first argument NULL test case.
> 
> The regression test fails without the stdlib.h fix, and passes after.
> 
> No other regressions on x86_64.
> 
> I'm looking for a "Reviewed-by: Martin Sebor <msebor@redhat.com>" from you
> to indicate that what I've done makes sense and that you've reviewed
> this patch and that I've answered all of your technical questions.
> 
> Thanks for the engaging discussion on documentation.

Looks good.

Thanks for fixing up the mbstowcs attribute.  We might put it back
if/when the attribute implementation changes to avoid assuming that
nonzero size necessarily implies nonnull pointer.

Martin


> 
> 8< --- 8< --- 8<
> The function mbstowcs, by an XSI extension to POSIX, accepts a null
> pointer for the destination wchar_t array.  This API behaviour allows
> you to use the function to compute the length of the required wchar_t
> array i.e. does the conversion without storing it and returns the
> number of wide characters required.
> 
> We remove the __write_only__ markup for the first argument because it
> is not true since the destination may be a null pointer, and so the
> length argument may not apply.  We remove the markup otherwise the new
> test case cannot be compiled with -Werror=nonnull.
> 
> We add a new test case for mbstowcs which exercises the destination is
> a null pointer behaviour which we have now explicitly documented.
> 
> The mbsrtowcs and mbsnrtowcs behave similarly, and mbsrtowcs is
> documented as doing this in C11, even if the standard doesn't come out
> and call out this specific use case.  We add one note to each of
> mbsrtowcs and mbsnrtowcs to call out that they support a null pointer
> for the destination.
> 
> The wcsrtombs function behaves similarly but in the other way around
> and allows you to use a null destination pointer to compute how many
> bytes you would need to convert the wide character input.  We document
> this particular case also, but leave wcsnrtombs as a references to
> wcsrtombs, so the reader must still read the details of the semantics
> for wcsrtombs.
> ---
>   manual/charset.texi   | 24 +++++++++++++++++----
>   stdlib/stdlib.h       |  2 +-
>   wcsmbs/Makefile       |  2 +-
>   wcsmbs/tst-mbstowcs.c | 50 +++++++++++++++++++++++++++++++++++++++++++
>   4 files changed, 72 insertions(+), 6 deletions(-)
>   create mode 100644 wcsmbs/tst-mbstowcs.c
> 
> diff --git a/manual/charset.texi b/manual/charset.texi
> index 9fd0166115..c909cb88b7 100644
> --- a/manual/charset.texi
> +++ b/manual/charset.texi
> @@ -1026,6 +1026,10 @@ stores in the pointer pointed to by @var{src} either a null pointer (if
>   the NUL byte in the input string was reached) or the address of the byte
>   following the last converted multibyte character.
>   
> +Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and
> +the function can be used to count the number of wide characters that
> +would be required.
> +
>   @pindex wchar.h
>   @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
>   declared in @file{wchar.h}.
> @@ -1101,10 +1105,11 @@ successfully converted.
>   
>   Except in the case of an encoding error the return value of the
>   @code{wcsrtombs} function is the number of bytes in all the multibyte
> -character sequences stored in @var{dst}.  Before returning, the state in
> -the object pointed to by @var{ps} (or the internal object in case
> -@var{ps} is a null pointer) is updated to reflect the state after the
> -last conversion.  The state is the initial shift state in case the
> +character sequences which were or would have been (if @var{dst} was
> +not a null) stored in @var{dst}.  Before returning, the state in the
> +object pointed to by @var{ps} (or the internal object in case @var{ps}
> +is a null pointer) is updated to reflect the state after the last
> +conversion.  The state is the initial shift state in case the
>   terminating NUL wide character was converted.
>   
>   @pindex wchar.h
> @@ -1131,6 +1136,10 @@ string @code{*@var{src}} need not be NUL-terminated.  But if a NUL byte
>   is found within the @var{nmc} first bytes of the string, the conversion
>   stops there.
>   
> +Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and
> +the function can be used to count the number of wide characters that
> +would be required.
> +
>   This function is a GNU extension.  It is meant to work around the
>   problems mentioned above.  Now it is possible to convert a buffer with
>   multibyte character text piece by piece without having to care about
> @@ -1465,6 +1474,13 @@ mbstowcs_alloc (const char *string)
>   @}
>   @end smallexample
>   
> +If @var{wstring} is a null pointer then no output is written and the
> +conversion proceeds as above, and the result is returned. In practice
> +such behaviour is useful for calculating the exact number of wide
> +characters required to convert @var{string}.  This behaviour of accepting
> +a null pointer for @var{wstring} is an @w{XSI} extension to the requirements
> +in @w{ISO C} and @w{POSIX}.
> +
>   @end deftypefun
>   
>   @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
> diff --git a/stdlib/stdlib.h b/stdlib/stdlib.h
> index dd779bd740..f971df4247 100644
> --- a/stdlib/stdlib.h
> +++ b/stdlib/stdlib.h
> @@ -932,7 +932,7 @@ extern int wctomb (char *__s, wchar_t __wchar) __THROW;
>   /* Convert a multibyte string to a wide char string.  */
>   extern size_t mbstowcs (wchar_t *__restrict  __pwcs,
>   			const char *__restrict __s, size_t __n) __THROW
> -    __attr_access ((__write_only__, 1, 3)) __attr_access ((__read_only__, 2));
> +    __attr_access ((__read_only__, 2));
>   /* Convert a wide char string to multibyte string.  */
>   extern size_t wcstombs (char *__restrict __s,
>   			const wchar_t *__restrict __pwcs, size_t __n)
> diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
> index f02167fa58..e638e45522 100644
> --- a/wcsmbs/Makefile
> +++ b/wcsmbs/Makefile
> @@ -52,7 +52,7 @@ tests := tst-wcstof wcsmbs-tst1 tst-wcsnlen tst-btowc tst-mbrtowc \
>   	 tst-c16c32-1 wcsatcliff tst-wcstol-locale tst-wcstod-nan-locale \
>   	 tst-wcstod-round test-char-types tst-fgetwc-after-eof \
>   	 tst-wcstod-nan-sign tst-c16-surrogate tst-c32-state \
> -	 $(addprefix test-,$(strop-tests))
> +	 $(addprefix test-,$(strop-tests)) tst-mbstowcs
>   
>   include ../Rules
>   
> diff --git a/wcsmbs/tst-mbstowcs.c b/wcsmbs/tst-mbstowcs.c
> new file mode 100644
> index 0000000000..a803d826a2
> --- /dev/null
> +++ b/wcsmbs/tst-mbstowcs.c
> @@ -0,0 +1,50 @@
> +/* Test basic mbstowcs including wstring == NULL.
> +   Copyright (C) 2020 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <stdlib.h>
> +#include <string.h>
> +#include <support/check.h>
> +
> +static int
> +do_test (void)
> +{
> +  char string[] = { '1', '2', '3' , '4', '5', '\0' };
> +  size_t len = strlen (string);
> +  wchar_t wstring[] = { L'1', L'2', L'3', L'4', L'5', L'\0' };
> +#define NUM_WCHAR 6
> +  wchar_t wout[NUM_WCHAR];
> +  size_t result;
> +
> +  /* The input ASCII string in the C/POSIX locale must convert
> +     to the matching WSTRING.  */
> +  result = mbstowcs (wout, string, NUM_WCHAR);
> +  TEST_VERIFY (result == (NUM_WCHAR - 1));
> +  TEST_COMPARE_BLOB (wstring, sizeof(wchar_t) * (NUM_WCHAR - 1),
> +		     wout, sizeof(wchar_t) * result);
> +
> +  /* The input ASCII string in the C/POSIX locale must be the
> +     same length when using mbstowcs to compute the length of
> +     the string required in the conversion.  Using mbstowcs
> +     in this way is an XSI extension to POSIX.  */
> +  result = mbstowcs (NULL, string, len);
> +  TEST_VERIFY (result == len);
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>
>
  
Matheus Castanho May 22, 2020, 4:15 p.m. UTC | #3
Just tested the patch (on ppc64le) and the problem I reported on the
Bugzilla issue has been solved.

Thanks for working on the fix.

--
Matheus Castanho
  
Carlos O'Donell May 22, 2020, 4:35 p.m. UTC | #4
On 5/22/20 12:15 PM, Matheus Castanho wrote:
> Just tested the patch (on ppc64le) and the problem I reported on the
> Bugzilla issue has been solved.
> 
> Thanks for working on the fix.

Matheus thanks for testing!

Can you please provide the URL to the bug?
  
Florian Weimer May 22, 2020, 5:06 p.m. UTC | #5
* Carlos O'Donell via Libc-alpha:

> On 5/22/20 12:15 PM, Matheus Castanho wrote:
>> Just tested the patch (on ppc64le) and the problem I reported on the
>> Bugzilla issue has been solved.
>> 
>> Thanks for working on the fix.
>
> Matheus thanks for testing!
>
> Can you please provide the URL to the bug?

It's a comment on bug 25219:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=25219#c3>

Thanks,
Flroian
  
Matheus Castanho May 25, 2020, 2:10 p.m. UTC | #6
On 5/22/20 2:06 PM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> On 5/22/20 12:15 PM, Matheus Castanho wrote:
>>> Just tested the patch (on ppc64le) and the problem I reported on the
>>> Bugzilla issue has been solved.
>>>
>>> Thanks for working on the fix.
>>
>> Matheus thanks for testing!
>>
>> Can you please provide the URL to the bug?
> 
> It's a comment on bug 25219:
> 
>   <https://sourceware.org/bugzilla/show_bug.cgi?id=25219#c3>

That's the one. Thanks, Florian.

I just commented on the issue since it was related to that topic.

--
Matheus Castanho
  
Matheus Castanho June 1, 2020, 1:40 p.m. UTC | #7
Hi Carlos,

Is a v2 still pending with Florian's comments or is this ready to go?

Thanks,
Matheus Castanho
  
Carlos O'Donell June 1, 2020, 3:11 p.m. UTC | #8
On 6/1/20 9:40 AM, Matheus Castanho wrote:
> Is a v2 still pending with Florian's comments or is this ready to go?

I'll post a v2 shortly. Thanks for the ping.
  

Patch

diff --git a/manual/charset.texi b/manual/charset.texi
index 9fd0166115..c909cb88b7 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -1026,6 +1026,10 @@  stores in the pointer pointed to by @var{src} either a null pointer (if
 the NUL byte in the input string was reached) or the address of the byte
 following the last converted multibyte character.
 
+Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and
+the function can be used to count the number of wide characters that
+would be required.
+
 @pindex wchar.h
 @code{mbsrtowcs} was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
@@ -1101,10 +1105,11 @@  successfully converted.
 
 Except in the case of an encoding error the return value of the
 @code{wcsrtombs} function is the number of bytes in all the multibyte
-character sequences stored in @var{dst}.  Before returning, the state in
-the object pointed to by @var{ps} (or the internal object in case
-@var{ps} is a null pointer) is updated to reflect the state after the
-last conversion.  The state is the initial shift state in case the
+character sequences which were or would have been (if @var{dst} was
+not a null) stored in @var{dst}.  Before returning, the state in the
+object pointed to by @var{ps} (or the internal object in case @var{ps}
+is a null pointer) is updated to reflect the state after the last
+conversion.  The state is the initial shift state in case the
 terminating NUL wide character was converted.
 
 @pindex wchar.h
@@ -1131,6 +1136,10 @@  string @code{*@var{src}} need not be NUL-terminated.  But if a NUL byte
 is found within the @var{nmc} first bytes of the string, the conversion
 stops there.
 
+Like @code{mbstowcs} the @var{dst} parameter may be a null pointer and
+the function can be used to count the number of wide characters that
+would be required.
+
 This function is a GNU extension.  It is meant to work around the
 problems mentioned above.  Now it is possible to convert a buffer with
 multibyte character text piece by piece without having to care about
@@ -1465,6 +1474,13 @@  mbstowcs_alloc (const char *string)
 @}
 @end smallexample
 
+If @var{wstring} is a null pointer then no output is written and the
+conversion proceeds as above, and the result is returned. In practice
+such behaviour is useful for calculating the exact number of wide
+characters required to convert @var{string}.  This behaviour of accepting
+a null pointer for @var{wstring} is an @w{XSI} extension to the requirements
+in @w{ISO C} and @w{POSIX}.
+
 @end deftypefun
 
 @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
diff --git a/stdlib/stdlib.h b/stdlib/stdlib.h
index dd779bd740..f971df4247 100644
--- a/stdlib/stdlib.h
+++ b/stdlib/stdlib.h
@@ -932,7 +932,7 @@  extern int wctomb (char *__s, wchar_t __wchar) __THROW;
 /* Convert a multibyte string to a wide char string.  */
 extern size_t mbstowcs (wchar_t *__restrict  __pwcs,
 			const char *__restrict __s, size_t __n) __THROW
-    __attr_access ((__write_only__, 1, 3)) __attr_access ((__read_only__, 2));
+    __attr_access ((__read_only__, 2));
 /* Convert a wide char string to multibyte string.  */
 extern size_t wcstombs (char *__restrict __s,
 			const wchar_t *__restrict __pwcs, size_t __n)
diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile
index f02167fa58..e638e45522 100644
--- a/wcsmbs/Makefile
+++ b/wcsmbs/Makefile
@@ -52,7 +52,7 @@  tests := tst-wcstof wcsmbs-tst1 tst-wcsnlen tst-btowc tst-mbrtowc \
 	 tst-c16c32-1 wcsatcliff tst-wcstol-locale tst-wcstod-nan-locale \
 	 tst-wcstod-round test-char-types tst-fgetwc-after-eof \
 	 tst-wcstod-nan-sign tst-c16-surrogate tst-c32-state \
-	 $(addprefix test-,$(strop-tests))
+	 $(addprefix test-,$(strop-tests)) tst-mbstowcs
 
 include ../Rules
 
diff --git a/wcsmbs/tst-mbstowcs.c b/wcsmbs/tst-mbstowcs.c
new file mode 100644
index 0000000000..a803d826a2
--- /dev/null
+++ b/wcsmbs/tst-mbstowcs.c
@@ -0,0 +1,50 @@ 
+/* Test basic mbstowcs including wstring == NULL.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <stdlib.h>
+#include <string.h>
+#include <support/check.h>
+
+static int
+do_test (void)
+{
+  char string[] = { '1', '2', '3' , '4', '5', '\0' };
+  size_t len = strlen (string);
+  wchar_t wstring[] = { L'1', L'2', L'3', L'4', L'5', L'\0' };
+#define NUM_WCHAR 6
+  wchar_t wout[NUM_WCHAR];
+  size_t result;
+
+  /* The input ASCII string in the C/POSIX locale must convert
+     to the matching WSTRING.  */
+  result = mbstowcs (wout, string, NUM_WCHAR);
+  TEST_VERIFY (result == (NUM_WCHAR - 1));
+  TEST_COMPARE_BLOB (wstring, sizeof(wchar_t) * (NUM_WCHAR - 1),
+		     wout, sizeof(wchar_t) * result);
+
+  /* The input ASCII string in the C/POSIX locale must be the
+     same length when using mbstowcs to compute the length of
+     the string required in the conversion.  Using mbstowcs
+     in this way is an XSI extension to POSIX.  */
+  result = mbstowcs (NULL, string, len);
+  TEST_VERIFY (result == len);
+
+  return 0;
+}
+
+#include <support/test-driver.c>