[2/3] : C++20 P0482R6 and C2X N2653: Implement mbrtoc8, c8rtomb, char8_t
Checks
Context |
Check |
Description |
dj/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
Commit Message
This patch provides implementations for the mbrtoc8 and c8rtomb
functions adopted for C++20 via WG21 P0482R6 [1] and proposed for C2X
via WG14 N2653 [2]. It also provides the char8_t typedef from WG14 N2653
[2].
The mbrtoc8 and c8rtomb functions are declared in uchar.h if either of
the C++20 __cpp_char8_t or _GNU_SOURCE feature test macros are defined.
The char8_t typedef is declared in uchar.h if _GNU_SOURCE is defined and
__cpp_char8_t is not defined (if __cpp_char8_t is defined, then char8_t
is a builtin type).
Tested on Linux x86_64.
Tom.
[1]: WG21 P0482R6
"char8_t: A type for UTF-8 characters and strings (Revision 6)"
https://wg21.link/p0482r6
[2]: WG14 N2653
"char8_t: A type for UTF-8 characters and strings (Revision 1)"
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
Comments
This patch is adding a NEWS entry for 2.34. Since the next release is
2.35, it should add an entry there instead (except we've started release
slush for 2.35, so it might be better to aim this at 2.36).
On 1/10/22 7:53 PM, Joseph Myers wrote:
> This patch is adding a NEWS entry for 2.34. Since the next release is
> 2.35, it should add an entry there instead (except we've started release
> slush for 2.35, so it might be better to aim this at 2.36).
>
Thank you for spotting that; this was due to a bad merge conflict
resolution on my part. Fixed patch attached.
If this needs to wait for 2.36, I understand, but I would love to have
this in 2.35 to close out this gap in C++20 support.
Tom.
On 1/11/22 2:23 PM, Tom Honermann via Libc-alpha wrote:
> On 1/10/22 7:53 PM, Joseph Myers wrote:
>> This patch is adding a NEWS entry for 2.34. Since the next release is
>> 2.35, it should add an entry there instead (except we've started release
>> slush for 2.35, so it might be better to aim this at 2.36).
>>
> Thank you for spotting that; this was due to a bad merge conflict
> resolution on my part. Fixed patch attached.
>
> If this needs to wait for 2.36, I understand, but I would love to have
> this in 2.35 to close out this gap in C++20 support.
Any further comments on this patch series?
Tom.
On 20/01/2022 20:17, Tom Honermann via Libc-alpha wrote:
> On 1/11/22 2:23 PM, Tom Honermann via Libc-alpha wrote:
>> On 1/10/22 7:53 PM, Joseph Myers wrote:
>>> This patch is adding a NEWS entry for 2.34. Since the next release is
>>> 2.35, it should add an entry there instead (except we've started release
>>> slush for 2.35, so it might be better to aim this at 2.36).
>>>
>> Thank you for spotting that; this was due to a bad merge conflict resolution on my part. Fixed patch attached.
>>
>> If this needs to wait for 2.36, I understand, but I would love to have this in 2.35 to close out this gap in C++20 support.
>
> Any further comments on this patch series?
>
> Tom.
>
I think it is too late for 2.35, but I will take to check when 2.36 opens.
On 1/21/22 3:01 PM, Adhemerval Zanella wrote:
>
> On 20/01/2022 20:17, Tom Honermann via Libc-alpha wrote:
>> On 1/11/22 2:23 PM, Tom Honermann via Libc-alpha wrote:
>>> On 1/10/22 7:53 PM, Joseph Myers wrote:
>>>> This patch is adding a NEWS entry for 2.34. Since the next release is
>>>> 2.35, it should add an entry there instead (except we've started release
>>>> slush for 2.35, so it might be better to aim this at 2.36).
>>>>
>>> Thank you for spotting that; this was due to a bad merge conflict resolution on my part. Fixed patch attached.
>>>
>>> If this needs to wait for 2.36, I understand, but I would love to have this in 2.35 to close out this gap in C++20 support.
>> Any further comments on this patch series?
>>
>> Tom.
>>
> I think it is too late for 2.35, but I will take to check when 2.36 opens.
I understand. Thank you! I'll repost the patch series with the changes
needed for 2.36 once 2.35 is released.
Tom.
On Sat, 22 Jan 2022, Tom Honermann via Libc-alpha wrote:
> I understand. Thank you! I'll repost the patch series with the changes needed
> for 2.36 once 2.35 is released.
Now that this feature has been accepted for C2X at today's WG14 meeting, a
repost should condition the header declarations on __GLIBC_USE (ISOC2X)
(plus whatever is appropriate for C++), as well as using 2.36 symbol
versions, adding a NEWS entry in the 2.36 section and making that NEWS
entry reflect that the feature is in C2X.
On Wednesday, February 16, 2022 1:29pm, "Joseph Myers" <joseph@codesourcery.com> said:
> On Sat, 22 Jan 2022, Tom Honermann via Libc-alpha wrote:
>
> > I understand. Thank you! I'll repost the patch series with the changes
> needed
> > for 2.36 once 2.35 is released.
>
> Now that this feature has been accepted for C2X at today's WG14 meeting, a
> repost should condition the header declarations on __GLIBC_USE (ISOC2X)
> (plus whatever is appropriate for C++), as well as using 2.36 symbol
> versions, adding a NEWS entry in the 2.36 section and making that NEWS
> entry reflect that the feature is in C2X.
Sounds good, thank you! I'll get a new set of patches posted soon.
Tom.
commit 0710b1004c1eb151d739c73090c4eab81e454eb1
Author: Tom Honermann <tom@honermann.net>
Date: Wed Jan 5 18:42:03 2022 -0500
Implement mbrtoc8(), c8rtomb(), and the char8_t typedef.
This change provides implementations for the mbrtoc8 and c8rtomb
functions adopted for C++20 via WG21 P0482R6 and proposed for C2X
via WG14 N2653. It also provides the char8_t typedef from N2653.
The mbrtoc8 and c8rtomb functions are declared in uchar.h if
either of the C++20 __cpp_char8_t feature test macro or the
_GNU_SOURCE macro are defined.
The char8_t typedef is declared in uchar.h if _GNU_SOURCE is
defined and __cpp_char8_t is not defined (if __cpp_char8_t is
defined, then char8_t is a builtin type).
@@ -237,6 +237,14 @@ Major new features:
* The audit libraries will avoid unnecessary slowdown if it is not required
PLT tracking (by not implementing the la_pltenter or la_pltexit callbacks).
+* The mbrtoc8 and c8rtomb functions are added for implementation of the
+ C++20 P0482R6 and C2X N2653 proposals. These functions perform conversions
+ between multibyte sequences and the UTF-8 character encoding. A char8_t
+ typedef is added for the C2X N2653 proposal. The functions are declared
+ in uchar.h if the C++20 __cpp_char8_t feature test macro or _GNU_SOURCE
+ macro is defined. The char8_t typedef is declared in uchar.h if _GNU_SOURCE
+ is defined and __cpp_char8_t is not defined.
+
Deprecated and removed features, and other changes affecting compatibility:
* The function pthread_mutex_consistent_np has been deprecated; programs
@@ -2287,7 +2287,9 @@ GLIBC_2.34 shm_unlink F
GLIBC_2.34 timespec_getres F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
GLIBC_2.35 close_range F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2613,4 +2613,6 @@ GLIBC_2.34 tss_delete F
GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
+GLIBC_2.35 c8rtomb F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 mbrtoc8 F
@@ -2711,6 +2711,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2375,3 +2375,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -493,6 +493,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _Exit F
GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
GLIBC_2.4 _IO_2_1_stdin_ D 0xa0
@@ -490,6 +490,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _Exit F
GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
GLIBC_2.4 _IO_2_1_stdin_ D 0xa0
@@ -2649,3 +2649,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2598,6 +2598,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2782,6 +2782,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2549,6 +2549,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -494,6 +494,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _Exit F
GLIBC_2.4 _IO_2_1_stderr_ D 0x98
GLIBC_2.4 _IO_2_1_stdin_ D 0x98
@@ -2725,6 +2725,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2698,3 +2698,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2695,3 +2695,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2690,6 +2690,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2688,6 +2688,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2696,6 +2696,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2600,6 +2600,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2737,3 +2737,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2752,6 +2752,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2785,6 +2785,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2508,6 +2508,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2810,3 +2810,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2377,3 +2377,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2577,3 +2577,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -2750,6 +2750,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2545,6 +2545,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2605,6 +2605,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2602,6 +2602,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2745,6 +2745,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 _IO_fprintf F
GLIBC_2.4 _IO_printf F
GLIBC_2.4 _IO_sprintf F
@@ -2572,6 +2572,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2523,6 +2523,8 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
GLIBC_2.4 __confstr_chk F
GLIBC_2.4 __fgets_chk F
GLIBC_2.4 __fgets_unlocked_chk F
@@ -2629,3 +2629,5 @@ GLIBC_2.34 tss_get F
GLIBC_2.34 tss_set F
GLIBC_2.35 __memcmpeq F
GLIBC_2.35 _dl_find_object F
+GLIBC_2.35 c8rtomb F
+GLIBC_2.35 mbrtoc8 F
@@ -42,7 +42,7 @@ routines := wcscat wcschr wcscmp wcscpy wcscspn wcsdup wcslen wcsncat \
wcsmbsload mbsrtowcs_l \
isoc99_wscanf isoc99_vwscanf isoc99_fwscanf isoc99_vfwscanf \
isoc99_swscanf isoc99_vswscanf \
- mbrtoc16 c16rtomb mbrtoc32 c32rtomb
+ mbrtoc8 c8rtomb mbrtoc16 c16rtomb mbrtoc32 c32rtomb
strop-tests := wcscmp wcsncmp wmemcmp wcslen wcschr wcsrchr wcscpy wcsnlen \
wcpcpy wcsncpy wcpncpy wcscat wcsncat wcschrnul wcsspn wcspbrk \
@@ -49,4 +49,7 @@ libc {
wcstof32; wcstof64; wcstof32x;
wcstof32_l; wcstof64_l; wcstof32x_l;
}
+ GLIBC_2.35 {
+ c8rtomb; mbrtoc8;
+ }
}
new file mode 100644
@@ -0,0 +1,132 @@
+/* UTF-8 to multibyte conversion.
+ Copyright (C) 2022 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <errno.h>
+#include <uchar.h>
+#include <wchar.h>
+
+
+/* This is the private state used if PS is NULL. */
+static mbstate_t state;
+
+size_t
+c8rtomb (char *s, char8_t c8, mbstate_t *ps)
+{
+ /* This implementation depends on the converter invoked by wcrtomb not
+ needing to retain state in either the top most bit of ps->__count or
+ in ps->__value between invocations. This implementation uses the
+ top most bit of ps->__count to indicate that trailing code units are
+ expected and uses ps->__value to store previously seen code units. */
+
+ wchar_t wc;
+
+ if (ps == NULL)
+ ps = &state;
+
+ if (s == NULL)
+ {
+ /* if 's' is a null pointer, behave as if u8'\0' was passed as 'c8'. If
+ this occurs for an incomplete code unit sequence, then an error will
+ be reported below. */
+ c8 = u8""[0];
+ }
+
+ if (! (ps->__count & 0x80000000))
+ {
+ /* Initial state. */
+ if ((c8 >= 0x80 && c8 <= 0xC1) || c8 >= 0xF5)
+ {
+ /* An invalid lead code unit. */
+ __set_errno (EILSEQ);
+ return -1;
+ }
+ if (c8 >= 0xC2)
+ {
+ /* A valid lead code unit. */
+ ps->__count |= 0x80000000;
+ ps->__value.__wchb[0] = c8;
+ ps->__value.__wchb[3] = 1;
+ return 0;
+ }
+ /* A single byte (ASCII) code unit. */
+ wc = c8;
+ }
+ else
+ {
+ char8_t cu1 = ps->__value.__wchb[0];
+ if (ps->__value.__wchb[3] == 1)
+ {
+ /* A single lead code unit was previously seen. */
+ if ((c8 < 0x80 || c8 > 0xBF) ||
+ (cu1 == 0xE0 && c8 < 0xA0) ||
+ (cu1 == 0xED && c8 > 0x9F) ||
+ (cu1 == 0xF0 && c8 < 0x90) ||
+ (cu1 == 0xF4 && c8 > 0x8F))
+ {
+ /* An invalid second code unit. */
+ __set_errno (EILSEQ);
+ return -1;
+ }
+ if (cu1 >= 0xE0)
+ {
+ /* A three or four code unit sequence. */
+ ps->__value.__wchb[1] = c8;
+ ++ps->__value.__wchb[3];
+ return 0;
+ }
+ wc = ((cu1 & 0x1F) << 6) +
+ (c8 & 0x3F);
+ }
+ else
+ {
+ char8_t cu2 = ps->__value.__wchb[1];
+ /* A three or four byte code unit sequence. */
+ if (c8 < 0x80 || c8 > 0xBF)
+ {
+ /* An invalid third or fourth code unit. */
+ __set_errno (EILSEQ);
+ return -1;
+ }
+ if (ps->__value.__wchb[3] == 2 && cu1 >= 0xF0)
+ {
+ /* A four code unit sequence. */
+ ps->__value.__wchb[2] = c8;
+ ++ps->__value.__wchb[3];
+ return 0;
+ }
+ if (cu1 < 0xF0)
+ {
+ wc = ((cu1 & 0x0F) << 12) +
+ ((cu2 & 0x3F) << 6) +
+ (c8 & 0x3F);
+ }
+ else
+ {
+ char8_t cu3 = ps->__value.__wchb[2];
+ wc = ((cu1 & 0x07) << 18) +
+ ((cu2 & 0x3F) << 12) +
+ ((cu3 & 0x3F) << 6) +
+ (c8 & 0x3F);
+ }
+ }
+ ps->__count &= 0x7fffffff;
+ ps->__value.__wch = 0;
+ }
+
+ return wcrtomb (s, wc, ps);
+}
new file mode 100644
@@ -0,0 +1,126 @@
+/* Multibyte to UTF-8 conversion.
+ Copyright (C) 2022 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <assert.h>
+#include <dlfcn.h>
+#include <errno.h>
+#include <gconv.h>
+#include <uchar.h>
+#include <wcsmbsload.h>
+
+#include <sysdep.h>
+
+#ifndef EILSEQ
+# define EILSEQ EINVAL
+#endif
+
+
+/* This is the private state used if PS is NULL. */
+static mbstate_t state;
+
+size_t
+mbrtoc8 (char8_t *pc8, const char *s, size_t n, mbstate_t *ps)
+{
+ /* This implementation depends on the converter invoked by mbrtowc() not
+ needing to retain state in either the top most bit of ps->__count or
+ in ps->__value between invocations. This implementation uses the
+ top most bit of ps->__count to indicate that trailing code units are
+ yet to be written and uses ps->__value to store those code units. */
+
+ if (ps == NULL)
+ ps = &state;
+
+ /* If state indicates that trailing code units are yet to be written, write
+ those first regardless of whether 's' is a null pointer. */
+ if (ps->__count & 0x80000000)
+ {
+ /* ps->__value.__wchb[3] stores the index of the next code unit to
+ write. Code units are stored in reverse order. */
+ size_t i = ps->__value.__wchb[3];
+ if (pc8 != NULL)
+ {
+ *pc8 = ps->__value.__wchb[i];
+ }
+ if (i == 0)
+ {
+ ps->__count &= 0x7fffffff;
+ ps->__value.__wch = 0;
+ }
+ else
+ --ps->__value.__wchb[3];
+ return -3;
+ }
+
+ if (s == NULL)
+ {
+ /* if 's' is a null pointer, behave as if a null pointer was passed for
+ 'pc8', an empty string was passed for 's', and 1 passed for 'n'. */
+ pc8 = NULL;
+ s = "";
+ n = 1;
+ }
+
+ wchar_t wc;
+ size_t result;
+
+ result = mbrtowc(&wc, s, n, ps);
+ if (result <= n)
+ {
+ if (wc <= 0x7F)
+ {
+ if (pc8 != NULL)
+ *pc8 = wc;
+ }
+ else if (wc <= 0x7FF)
+ {
+ if (pc8 != NULL)
+ *pc8 = 0xC0 + ((wc >> 6) & 0x1F);
+ ps->__value.__wchb[0] = 0x80 + (wc & 0x3F);
+ ps->__value.__wchb[3] = 0;
+ ps->__count |= 0x80000000;
+ }
+ else if (wc <= 0xFFFF)
+ {
+ if (pc8 != NULL)
+ *pc8 = 0xE0 + ((wc >> 12) & 0x0F);
+ ps->__value.__wchb[1] = 0x80 + ((wc >> 6) & 0x3F);
+ ps->__value.__wchb[0] = 0x80 + (wc & 0x3F);
+ ps->__value.__wchb[3] = 1;
+ ps->__count |= 0x80000000;
+ }
+ else if (wc <= 0x10FFFF)
+ {
+ if (pc8 != NULL)
+ *pc8 = 0xF0 + ((wc >> 18) & 0x07);
+ ps->__value.__wchb[2] = 0x80 + ((wc >> 12) & 0x3F);
+ ps->__value.__wchb[1] = 0x80 + ((wc >> 6) & 0x3F);
+ ps->__value.__wchb[0] = 0x80 + (wc & 0x3F);
+ ps->__value.__wchb[3] = 2;
+ ps->__count |= 0x80000000;
+ }
+ }
+ if (result == 0 && wc != 0)
+ {
+ /* mbrtowc() never returns -3. When a MB sequence converts to multiple
+ WCs, no input is consumed when writing the subsequent WCs resulting
+ in a result of 0 even if a null character wasn't written. */
+ result = -3;
+ }
+
+ return result;
+}
@@ -31,6 +31,13 @@
#include <bits/types.h>
#include <bits/types/mbstate_t.h>
+/* Declare the char8_t typedef in _GNU_SOURCE mode, but only if the C++
+ __cpp_char8_t feature test macro is not defined. */
+#if defined __USE_GNU && !defined __cpp_char8_t
+/* Define the 8-bit character type. */
+typedef unsigned char char8_t;
+#endif
+
#ifndef __USE_ISOCXX11
/* Define the 16-bit and 32-bit character types. */
typedef __uint_least16_t char16_t;
@@ -40,6 +47,20 @@ typedef __uint_least32_t char32_t;
__BEGIN_DECLS
+/* Declare mbrtoc8() and c8rtomb() in _GNU_SOURCE mode or if the C++
+ __cpp_char8_t feature test macro is defined. */
+#if defined __USE_GNU || defined __cpp_char8_t
+/* Write char8_t representation of multibyte character pointed
+ to by S to PC8. */
+extern size_t mbrtoc8 (char8_t *__restrict __pc8,
+ const char *__restrict __s, size_t __n,
+ mbstate_t *__restrict __p) __THROW;
+
+/* Write multibyte representation of char8_t C8 to S. */
+extern size_t c8rtomb (char *__restrict __s, char8_t __c8,
+ mbstate_t *__restrict __ps) __THROW;
+#endif
+
/* Write char16_t representation of multibyte character pointed
to by S to PC16. */
extern size_t mbrtoc16 (char16_t *__restrict __pc16,