[v4,0/3] C++20 P0482R6 and C2X N2653: char8_t, mbrtoc8(), and c8rtomb()

Message ID 20220630125215.6052-1-tom@honermann.net
Headers
Series C++20 P0482R6 and C2X N2653: char8_t, mbrtoc8(), and c8rtomb() |

Message

Tom Honermann June 30, 2022, 12:52 p.m. UTC
  This series of patches provides the following:
- A fix for BZ #25744 [1].
- Implementations of the mbrtoc8 and c8rtomb functions adopted for
  C++20 via WG21 P0482R6 [2] and for C2X via WG14 N2653 [3].
- A char8_t typedef as adopted for C2X via WG14 N2653 [3].

The fix for BZ #25744 [1] is included in this patch series because the tests
for mbrtoc8 and c8rtomb depend on it for exercising the special case where a
pair of Unicode code points is converted to/from a single double byte sequence.
Such conversion cases exist for Big5-HKSCS. 

N2653 was adopted by WG14 for C2X and wording is present in the N2912 revision
of the C2X working draft [4].  This patch series enables the new declarations
in C2X mode and when _GNU_SOURCE is defined. 

This patch series addresses feedback provided in response to the prior
submission series starting at [5].  All feedback was addressed except as noted
below:
- I removed the previously proposed wcsmbs/test-char8-type.c test as redundant
  per feedback. I prototyped extending the c++-types.data test as suggested,
  but that would have introduced a C++20 dependency.  The test demonstrated that
  the char16_t and char32_t types were mapped to 'Ds' and 'Di' as expected, but
  char8_t got mapped to 'h' rather than 'Du'.  It might be worth extending this
  test in the future if/when C++20 support becomes a minimal requirement.
- I did not switch the iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c test to use
  TEST_COMPARE in favor of the existing custom error handling; the current
  error messages are customized to provide a more detailed explanation of the
  error, so removing that seemed like an arguable regression.  My changes retain
  consistency with the existing code.  I think it is reasonable to switch to
  TEST_COMPARE, but I think that should be done separately from this change and
  separately motivated.
- I did not change the c8rtomb() and mbrtoc8() implementations to use thread
  local storage when PS is null; the implementations continue to use static
  storage consistent with c16rtomb(), mbrtoc16(), c32rtomb(), and mbrtoc32().
  I think it is reasonable to switch all of these functions to use thread
  local storage, but I think such a change should be done consistently across
  all of them and should be separately motivated.

Thank you to Joseph Myers, Carlos O'Donell, Adhemerval Zanella, and Florian
Weimer for prior reviews of this patch series.

Tested on Linux x86_64.

[1]: Bug 25744
     "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the
     second byte of certain double byte characters"
     https://sourceware.org/bugzilla/show_bug.cgi?id=25744

[2]: WG21 P0482R6
     "char8_t: A type for UTF-8 characters and strings (Revision 6)"
     https://wg21.link/p0482r6

[3]: WG14 N2653
     "char8_t: A type for UTF-8 characters and strings (Revision 1)"
     http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

[4]: WG14 N2912
     "Programming languages — C working draft — June 8, 2022"
     https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf

[5]: libc-alpha mailing list
     "[PATCH 0/3]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and c8rtomb()."
     https://sourceware.org/pipermail/libc-alpha/2022-February/136723.html
  

Comments

Adhemerval Zanella July 4, 2022, 7:08 p.m. UTC | #1
> On 30 Jun 2022, at 09:52, Tom Honermann via Libc-alpha <libc-alpha@sourceware.org> wrote:
> 
> This series of patches provides the following:
> - A fix for BZ #25744 [1].
> - Implementations of the mbrtoc8 and c8rtomb functions adopted for
>  C++20 via WG21 P0482R6 [2] and for C2X via WG14 N2653 [3].
> - A char8_t typedef as adopted for C2X via WG14 N2653 [3].
> 
> The fix for BZ #25744 [1] is included in this patch series because the tests
> for mbrtoc8 and c8rtomb depend on it for exercising the special case where a
> pair of Unicode code points is converted to/from a single double byte sequence.
> Such conversion cases exist for Big5-HKSCS. 
> 
> N2653 was adopted by WG14 for C2X and wording is present in the N2912 revision
> of the C2X working draft [4].  This patch series enables the new declarations
> in C2X mode and when _GNU_SOURCE is defined. 
> 
> This patch series addresses feedback provided in response to the prior
> submission series starting at [5].  All feedback was addressed except as noted
> below:
> - I removed the previously proposed wcsmbs/test-char8-type.c test as redundant
>  per feedback. I prototyped extending the c++-types.data test as suggested,
>  but that would have introduced a C++20 dependency.  The test demonstrated that
>  the char16_t and char32_t types were mapped to 'Ds' and 'Di' as expected, but
>  char8_t got mapped to 'h' rather than 'Du'.  It might be worth extending this
>  test in the future if/when C++20 support becomes a minimal requirement.
> - I did not switch the iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c test to use
>  TEST_COMPARE in favor of the existing custom error handling; the current
>  error messages are customized to provide a more detailed explanation of the
>  error, so removing that seemed like an arguable regression.  My changes retain
>  consistency with the existing code.  I think it is reasonable to switch to
>  TEST_COMPARE, but I think that should be done separately from this change and
>  separately motivated.
> - I did not change the c8rtomb() and mbrtoc8() implementations to use thread
>  local storage when PS is null; the implementations continue to use static
>  storage consistent with c16rtomb(), mbrtoc16(), c32rtomb(), and mbrtoc32().
>  I think it is reasonable to switch all of these functions to use thread
>  local storage, but I think such a change should be done consistently across
>  all of them and should be separately motivated.
> 
> Thank you to Joseph Myers, Carlos O'Donell, Adhemerval Zanella, and Florian
> Weimer for prior reviews of this patch series.
> 
> Tested on Linux x86_64.
> 
> [1]: Bug 25744
>     "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the
>     second byte of certain double byte characters"
>     https://sourceware.org/bugzilla/show_bug.cgi?id=25744
> 
> [2]: WG21 P0482R6
>     "char8_t: A type for UTF-8 characters and strings (Revision 6)"
>     https://wg21.link/p0482r6
> 
> [3]: WG14 N2653
>     "char8_t: A type for UTF-8 characters and strings (Revision 1)"
>     http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
> 
> [4]: WG14 N2912
>     "Programming languages — C working draft — June 8, 2022"
>     https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf
> 
> [5]: libc-alpha mailing list
>     "[PATCH 0/3]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and c8rtomb()."
>     https://sourceware.org/pipermail/libc-alpha/2022-February/136723.html

Hi Tom,

Patches look good in general, just some really minor style issues I
found in last review.  I have fixed them and if you are ok I can
install them, I have create a branch so you can check [1].

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/mbrtoc8-c8rtomb
  
Tom Honermann July 6, 2022, 3:27 a.m. UTC | #2
On 7/4/22 3:08 PM, Adhemerval Zanella wrote:
>
>> On 30 Jun 2022, at 09:52, Tom Honermann via Libc-alpha <libc-alpha@sourceware.org> wrote:
>>
>> This series of patches provides the following:
>> - A fix for BZ #25744 [1].
>> - Implementations of the mbrtoc8 and c8rtomb functions adopted for
>>   C++20 via WG21 P0482R6 [2] and for C2X via WG14 N2653 [3].
>> - A char8_t typedef as adopted for C2X via WG14 N2653 [3].
>>
>> The fix for BZ #25744 [1] is included in this patch series because the tests
>> for mbrtoc8 and c8rtomb depend on it for exercising the special case where a
>> pair of Unicode code points is converted to/from a single double byte sequence.
>> Such conversion cases exist for Big5-HKSCS.
>>
>> N2653 was adopted by WG14 for C2X and wording is present in the N2912 revision
>> of the C2X working draft [4].  This patch series enables the new declarations
>> in C2X mode and when _GNU_SOURCE is defined.
>>
>> This patch series addresses feedback provided in response to the prior
>> submission series starting at [5].  All feedback was addressed except as noted
>> below:
>> - I removed the previously proposed wcsmbs/test-char8-type.c test as redundant
>>   per feedback. I prototyped extending the c++-types.data test as suggested,
>>   but that would have introduced a C++20 dependency.  The test demonstrated that
>>   the char16_t and char32_t types were mapped to 'Ds' and 'Di' as expected, but
>>   char8_t got mapped to 'h' rather than 'Du'.  It might be worth extending this
>>   test in the future if/when C++20 support becomes a minimal requirement.
>> - I did not switch the iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c test to use
>>   TEST_COMPARE in favor of the existing custom error handling; the current
>>   error messages are customized to provide a more detailed explanation of the
>>   error, so removing that seemed like an arguable regression.  My changes retain
>>   consistency with the existing code.  I think it is reasonable to switch to
>>   TEST_COMPARE, but I think that should be done separately from this change and
>>   separately motivated.
>> - I did not change the c8rtomb() and mbrtoc8() implementations to use thread
>>   local storage when PS is null; the implementations continue to use static
>>   storage consistent with c16rtomb(), mbrtoc16(), c32rtomb(), and mbrtoc32().
>>   I think it is reasonable to switch all of these functions to use thread
>>   local storage, but I think such a change should be done consistently across
>>   all of them and should be separately motivated.
>>
>> Thank you to Joseph Myers, Carlos O'Donell, Adhemerval Zanella, and Florian
>> Weimer for prior reviews of this patch series.
>>
>> Tested on Linux x86_64.
>>
>> [1]: Bug 25744
>>      "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the
>>      second byte of certain double byte characters"
>>      https://sourceware.org/bugzilla/show_bug.cgi?id=25744
>>
>> [2]: WG21 P0482R6
>>      "char8_t: A type for UTF-8 characters and strings (Revision 6)"
>>      https://wg21.link/p0482r6
>>
>> [3]: WG14 N2653
>>      "char8_t: A type for UTF-8 characters and strings (Revision 1)"
>>      http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
>>
>> [4]: WG14 N2912
>>      "Programming languages — C working draft — June 8, 2022"
>>      https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf
>>
>> [5]: libc-alpha mailing list
>>      "[PATCH 0/3]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and c8rtomb()."
>>      https://sourceware.org/pipermail/libc-alpha/2022-February/136723.html
> Hi Tom,
>
> Patches look good in general, just some really minor style issues I
> found in last review.  I have fixed them and if you are ok I can
> install them, I have create a branch so you can check [1].

Excellent, thank you! I verified that your branch differs from the 
patches I posted just by the style issues mentioned in your review and 
the additional removal of the (empty) 
sysdeps/mach/hurd/libhurduser.abilist and 
sysdeps/mach/libmachuser.abilist files (which I presume to be 
intentional). Looks good to me! I very much appreciate you making the 
additional style corrections!

Tom.

>
> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/mbrtoc8-c8rtomb
>
  
Adhemerval Zanella July 6, 2022, 12:23 p.m. UTC | #3
> On 6 Jul 2022, at 00:27, Tom Honermann <tom@honermann.net> wrote:
> 
> On 7/4/22 3:08 PM, Adhemerval Zanella wrote:
>> 
>>> On 30 Jun 2022, at 09:52, Tom Honermann via Libc-alpha <libc-alpha@sourceware.org> wrote:
>>> 
>>> This series of patches provides the following:
>>> - A fix for BZ #25744 [1].
>>> - Implementations of the mbrtoc8 and c8rtomb functions adopted for
>>>  C++20 via WG21 P0482R6 [2] and for C2X via WG14 N2653 [3].
>>> - A char8_t typedef as adopted for C2X via WG14 N2653 [3].
>>> 
>>> The fix for BZ #25744 [1] is included in this patch series because the tests
>>> for mbrtoc8 and c8rtomb depend on it for exercising the special case where a
>>> pair of Unicode code points is converted to/from a single double byte sequence.
>>> Such conversion cases exist for Big5-HKSCS.
>>> 
>>> N2653 was adopted by WG14 for C2X and wording is present in the N2912 revision
>>> of the C2X working draft [4].  This patch series enables the new declarations
>>> in C2X mode and when _GNU_SOURCE is defined.
>>> 
>>> This patch series addresses feedback provided in response to the prior
>>> submission series starting at [5].  All feedback was addressed except as noted
>>> below:
>>> - I removed the previously proposed wcsmbs/test-char8-type.c test as redundant
>>>  per feedback. I prototyped extending the c++-types.data test as suggested,
>>>  but that would have introduced a C++20 dependency.  The test demonstrated that
>>>  the char16_t and char32_t types were mapped to 'Ds' and 'Di' as expected, but
>>>  char8_t got mapped to 'h' rather than 'Du'.  It might be worth extending this
>>>  test in the future if/when C++20 support becomes a minimal requirement.
>>> - I did not switch the iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c test to use
>>>  TEST_COMPARE in favor of the existing custom error handling; the current
>>>  error messages are customized to provide a more detailed explanation of the
>>>  error, so removing that seemed like an arguable regression.  My changes retain
>>>  consistency with the existing code.  I think it is reasonable to switch to
>>>  TEST_COMPARE, but I think that should be done separately from this change and
>>>  separately motivated.
>>> - I did not change the c8rtomb() and mbrtoc8() implementations to use thread
>>>  local storage when PS is null; the implementations continue to use static
>>>  storage consistent with c16rtomb(), mbrtoc16(), c32rtomb(), and mbrtoc32().
>>>  I think it is reasonable to switch all of these functions to use thread
>>>  local storage, but I think such a change should be done consistently across
>>>  all of them and should be separately motivated.
>>> 
>>> Thank you to Joseph Myers, Carlos O'Donell, Adhemerval Zanella, and Florian
>>> Weimer for prior reviews of this patch series.
>>> 
>>> Tested on Linux x86_64.
>>> 
>>> [1]: Bug 25744
>>>     "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the
>>>     second byte of certain double byte characters"
>>>     https://sourceware.org/bugzilla/show_bug.cgi?id=25744
>>> 
>>> [2]: WG21 P0482R6
>>>     "char8_t: A type for UTF-8 characters and strings (Revision 6)"
>>>     https://wg21.link/p0482r6
>>> 
>>> [3]: WG14 N2653
>>>     "char8_t: A type for UTF-8 characters and strings (Revision 1)"
>>>     http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
>>> 
>>> [4]: WG14 N2912
>>>     "Programming languages — C working draft — June 8, 2022"
>>>     https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf
>>> 
>>> [5]: libc-alpha mailing list
>>>     "[PATCH 0/3]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and c8rtomb()."
>>>     https://sourceware.org/pipermail/libc-alpha/2022-February/136723.html
>> Hi Tom,
>> 
>> Patches look good in general, just some really minor style issues I
>> found in last review.  I have fixed them and if you are ok I can
>> install them, I have create a branch so you can check [1].
> 
> Excellent, thank you! I verified that your branch differs from the patches I posted just by the style issues mentioned in your review and the additional removal of the (empty) sysdeps/mach/hurd/libhurduser.abilist and sysdeps/mach/libmachuser.abilist files (which I presume to be intentional). Looks good to me! I very much appreciate you making the additional style corrections!
> 
> Tom.

The hurd files removal is a mistake from my part (we should really fix
make update-abi/check-abi on hurd...).  I will fix it and install the
patches.

> 
>> 
>> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/mbrtoc8-c8rtomb
>>