[v9] POSIX locale covers every byte [BZ# 29511]
Checks
Context |
Check |
Description |
dj/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
dj/TryBot-32bit |
success
|
Build for i686
|
Commit Message
This largely duplicates the ASCII code with the error path changed
There are two user-facing changes:
* nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
* mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
(a) is 1-byte, stateless, and contains 256 characters
(b) they collate in byte order
(c) the first 128 characters are equivalent to ASCII (like previous)
cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
changes to the standard;
in short, this means that mbrtowc() must never fail and must return
b if b <= 0x7F else ab+c for all bytes b
where c is some constant >=0x80
and a is a positive integer constant
By strategically picking c=<UDF00> we land at the tail-end of the
Unicode Low Surrogate Area at DC00-DFFF, described as
> Isolated surrogate code points have no interpretation;
> consequently, no character code charts or names lists
> are provided for this range.
and match musl
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
---
Clean rebase, reposting after a month.
NEWS | 7 ++
iconv/Makefile | 2 +-
iconv/gconv_builtin.h | 8 ++
iconv/gconv_int.h | 8 ++
iconv/gconv_posix.c | 96 ++++++++++++++++++++
iconv/tst-iconv_prog.sh | 43 +++++++++
iconvdata/tst-tables.sh | 1 +
inet/tst-idna_name_classify.c | 6 +-
locale/tst-C-locale.c | 44 +++++++++
localedata/charmaps/POSIX | 136 ++++++++++++++++++++++++++++
localedata/locales/POSIX | 143 +++++++++++++++++++++++++++++-
stdio-common/Makefile | 1 +
stdio-common/tst-printf-bz25691.c | 2 +
wcsmbs/wcsmbsload.c | 14 +--
14 files changed, 500 insertions(+), 11 deletions(-)
create mode 100644 iconv/gconv_posix.c
create mode 100644 localedata/charmaps/POSIX
Comments
* наб:
> This largely duplicates the ASCII code with the error path changed
>
> There are two user-facing changes:
> * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
>
> Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> (a) is 1-byte, stateless, and contains 256 characters
> (b) they collate in byte order
> (c) the first 128 characters are equivalent to ASCII (like previous)
> cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> changes to the standard;
> in short, this means that mbrtowc() must never fail and must return
> b if b <= 0x7F else ab+c for all bytes b
> where c is some constant >=0x80
> and a is a positive integer constant
>
> By strategically picking c=<UDF00> we land at the tail-end of the
> Unicode Low Surrogate Area at DC00-DFFF, described as
> > Isolated surrogate code points have no interpretation;
> > consequently, no character code charts or names lists
> > are provided for this range.
> and match musl
I've thought about this some more, and I don't think this is the
direction we should be going in.
* Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
the Python style). It should have the property that it can encode
every byte string as a string of wchar_t characters, and convert the
result back. It's not entirely trivial because we need to handle
partial UTF-8 sequences at the end of the buffer carefully. There
might be some warts regarding EILSEQ handling lurking there. Like the
Python approach, it is somewhat imperfect because it's not preserving
identity under string concatenation, i.e. f(x) || f(y) is not always
equal to f(x || y), but that's just unavoidable.
* Switch the charset for the default C locale to UTF-8SE. This matches
the POSIX requirement that every byte can be encoded.
* Work with POSIX to drop the requirement that the C locale needs to be
a single-byte locale.
* (Optional, somewhat unrelated.) Add a generic mechanism so that UTF-8
locales can be used as UTF-8SE without recompilation.
Thanks,
Florian
Hi! Long time, apologies.
On Mon, Feb 13, 2023 at 03:52:06PM +0100, Florian Weimer wrote:
> > This largely duplicates the ASCII code with the error path changed
> >
> > There are two user-facing changes:
> > * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> > * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
> >
> > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> > (a) is 1-byte, stateless, and contains 256 characters
> > (b) they collate in byte order
> > (c) the first 128 characters are equivalent to ASCII (like previous)
> > cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> > changes to the standard;
> > in short, this means that mbrtowc() must never fail and must return
> > b if b <= 0x7F else ab+c for all bytes b
> > where c is some constant >=0x80
> > and a is a positive integer constant
> >
> > By strategically picking c=<UDF00> we land at the tail-end of the
> > Unicode Low Surrogate Area at DC00-DFFF, described as
> > > Isolated surrogate code points have no interpretation;
> > > consequently, no character code charts or names lists
> > > are provided for this range.
> > and match musl
>
> I've thought about this some more, and I don't think this is the
> direction we should be going in.
>
> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
> the Python style). It should have the property that it can encode
> every byte string as a string of wchar_t characters, and convert the
> result back. It's not entirely trivial because we need to handle
> partial UTF-8 sequences at the end of the buffer carefully. There
> might be some warts regarding EILSEQ handling lurking there. Like the
> Python approach, it is somewhat imperfect because it's not preserving
> identity under string concatenation, i.e. f(x) || f(y) is not always
> equal to f(x || y), but that's just unavoidable.
>
> * Switch the charset for the default C locale to UTF-8SE. This matches
> the POSIX requirement that every byte can be encoded.
The main point of LC_CTYPE=POSIX as specified is that it allows you to
process paths (which are sequences of bytes, not characters) in a sane
way ‒ part of that is that collation needs to be correct, so maybe, as a
smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".
>>> b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape')
'Ŀ'
>>> b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape')
'\udcc4\udcc0'
>>>
>>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape'))]
[319]
>>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape'))]
[56516, 56512]
which, I mean, sure, maybe that's sensible (I wouldn't say so), but
>>> b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape')
'\uffff'
>>> b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape')
'\udcef\udcbf\udcc0'
>>>
>>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape'))]
[65535]
>>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape'))]
[56559, 56511, 56512]
Which means you can't process arbitrary data (pathnames) in a way that
makes sense. In my opinion this would be /worse/ than the current
behaviour, behaving erratically in the presence of Some Data instead of
simply not supporting it.
> * Work with POSIX to drop the requirement that the C locale needs to be
> a single-byte locale.
That's not going to happen because it's the /only/ way to process paths.
Indeed, XBD 8.2 puts it nicely:
Users may use the following environment variables to announce specific
localization requirements to applications.
As a user, I want to be able to announce "each byte is a character,
in natural ordering". This is what LC_CTYPE=C lets me do. I hope
you'll agree this is a good feature to be support.
POSIX, also, explicitly says that (XBD 8.2):
5499 1. If the LC_ALL environment variable is defined and is not null, the value of LC_ALL shall
5500 be used.
5501 2. If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGES,
5502 LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the value of the
5503 environment variable shall be used to initialize the category that corresponds to the
5504 environment variable.
5505 3. If the LANG environment variable is defined and is not null, the value of the LANG
5506 environment variable shall be used.
5507 4. If the LANG environment variable is not set or is set to the empty string, the
5508 implementation-defined default locale shall be used.
and XBD 7.2:
3643 All implementations shall define a locale as the default locale, to be invoked when no
3644 environment variables are set, or set to the empty string. This default locale can be the POSIX
3645 locale or any other implementation-defined locale. Some implementations may provide facilities
3646 for local installation administrators to set the default locale, customizing it for each location.
3647 POSIX.1-202x does not require such a facility.
To that end, how's about:
* invent UTF-8SE encoding as you say
* invent POSIX encoding like in this patch
(but move the area to match UTF-8SE probably, it's a good precedent)
* hook up POSIX to POSIX as in here
* change the implementation-defined default locale to POSIX-but-UTF-8SE
* (maybe) change the default locale on entry to main() to POSIX-but-UTF-8SE
POSIX requires that LC_ALL=POSIX is the default on entry to main().
That said, I wouldn't mind violating /that/, since anything we do with it
is backwards-compatible. Maybe it makes sense to do that for programs that
don't call setlocale() at all, and they'll behave better when used
internationally. Or not.
Logically, this translates to:
* if the user has their native locale selected, use that
* if the user has explicitly selected the bytewise locale, use that
* if the user hasn't configured their locales at all,
assume they want UTF-8 but degrade sensibly
* (maybe) if the program hasn't been written with locales in mind,
assume the user will be using it with UTF-8 input but
degrade sensibly
I think this leaves the wolf full and the sheep alive ‒ the default
behaviour is UTF-8(ish), and can be overridden to full UTF-8 or bytes,
per the user's requirements.
Existing users will thus gain the ability to:
* process data that's UTF-8 but skip over/retain
illegal/otherwise-encoded bytes losslessly
(this makes the sample above a killer feature instead of non-sensible,
so long as it's an encoding in its own right)
* correctly process arbitrily-encoded data as bytes
Thoughts?
наб
* наб:
>> I've thought about this some more, and I don't think this is the
>> direction we should be going in.
>>
>> * Add a UTF-8SE charset to glibc: it's UTF-8 with surrogate encoding (in
>> the Python style). It should have the property that it can encode
>> every byte string as a string of wchar_t characters, and convert the
>> result back. It's not entirely trivial because we need to handle
>> partial UTF-8 sequences at the end of the buffer carefully. There
>> might be some warts regarding EILSEQ handling lurking there. Like the
>> Python approach, it is somewhat imperfect because it's not preserving
>> identity under string concatenation, i.e. f(x) || f(y) is not always
>> equal to f(x || y), but that's just unavoidable.
>>
>> * Switch the charset for the default C locale to UTF-8SE. This matches
>> the POSIX requirement that every byte can be encoded.
> The main point of LC_CTYPE=POSIX as specified is that it allows you to
> process paths (which are sequences of bytes, not characters) in a sane
> way ‒ part of that is that collation needs to be correct, so maybe, as a
> smoke test, "[a, b, c] < [a, b, c+1] for all a,b,c".
>
> >>> b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape')
> 'Ŀ'
> >>> b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape')
> '\udcc4\udcc0'
> >>>
> >>> [*map(ord, b'\xc4\xbf'.decode('UTF-8', errors='surrogateescape'))]
> [319]
> >>> [*map(ord, b'\xc4\xc0'.decode('UTF-8', errors='surrogateescape'))]
> [56516, 56512]
> which, I mean, sure, maybe that's sensible (I wouldn't say so), but
> >>> b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape')
> '\uffff'
> >>> b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape')
> '\udcef\udcbf\udcc0'
> >>>
> >>> [*map(ord, b'\xef\xbf\xbf'.decode('UTF-8', errors='surrogateescape'))]
> [65535]
> >>> [*map(ord, b'\xef\xbf\xc0'.decode('UTF-8', errors='surrogateescape'))]
> [56559, 56511, 56512]
>
> Which means you can't process arbitrary data (pathnames) in a way that
> makes sense. In my opinion this would be /worse/ than the current
> behaviour, behaving erratically in the presence of Some Data instead of
> simply not supporting it.
Sorry for letting this linger for so long from my side, too.
Regarding the above, I'm not sure I find this convincing. That's just
business as usual with collation?
However, after thinking about this some more, my idea (just use a
liberal UTF-8 variant) does not work given the APIs we have, in the
sense that code that works in C.UTF-8 today will stop working under this
hypothetical new locale.
For example, for mbrlen (S, N, PS), we have this requirement:
If the first N bytes possibly form a valid multibyte character but
the character is incomplete, the return value is ‘(size_t) -2’.
Otherwise the multibyte character sequence is invalid and the
return value is ‘(size_t) -1’.
If every byte sequence is a valid, then mbrlen can never return
(size_t) -2. It would have to produce surrogate encoding instead.
But this means that detection of valid but incomplete UTF-8 sequences
(say at buffer boundaries) is no longer possible. And that can't be
good because we would produce unexpected wide characters around
buffer boundaries.
I think this leaves us with a straight byte encoding, so either
ISO-8859-1 for simplicity (and with the cultural bias it brings), or the
musl-style shifted upper half encoding that your patch implements.
In the end, enabling UTF-8 (or some variant) by default is probably not
that important because it directly impacts mostly the wide character
interfaces. Those are not widely used for a variety of reasons (one
probably being that our implementation is so incredibly slow).
Thanks,
Florian
@@ -36,6 +36,13 @@ Major new features:
getent with --no-addrconfig may contain addresses of families not
configured on the current host i.e. as-if you had not passed
AI_ADDRCONFIG to getaddrinfo calls.
+* The default/"POSIX"/"C" locale's character set is now "POSIX",
+ instead of "ANSI_X3.4-1968" ‒ this is a new fully-reversible
+ 8-bit transparent encoding for compatibility with POSIX Issue 7 TC 2,
+ identity-mapping bytes in the ASCII [0, 0x7F] range,
+ and mapping [0x80, 0xFF] bytes to [<U+DF80>, <U+DFFF>].
+ The standard now requires the "POSIX"/"C" locale to have an encoding
+ with these features ‒ 8-bit transparency and a continuous collation sequence.
Deprecated and removed features, and other changes affecting compatibility:
@@ -25,7 +25,7 @@ include ../Makeconfig
headers = iconv.h gconv.h
routines = iconv_open iconv iconv_close \
gconv_open gconv gconv_close gconv_db gconv_conf \
- gconv_builtin gconv_simple gconv_trans gconv_cache
+ gconv_builtin gconv_simple gconv_posix gconv_trans gconv_cache
routines += gconv_dl gconv_charset
vpath %.c ../locale/programs ../intl
@@ -89,6 +89,14 @@ BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=INTERNAL->ascii",
__gconv_transform_internal_ascii, NULL, 4, 4, 1, 1)
+BUILTIN_TRANSFORMATION ("POSIX//", "INTERNAL", 1, "=posix->INTERNAL",
+ __gconv_transform_posix_internal, __gconv_btwoc_posix,
+ 1, 1, 4, 4)
+
+BUILTIN_TRANSFORMATION ("INTERNAL", "POSIX//", 1, "=INTERNAL->posix",
+ __gconv_transform_internal_posix, NULL, 4, 4, 1, 1)
+
+
#if BYTE_ORDER == BIG_ENDIAN
BUILTIN_ALIAS ("UNICODEBIG//", "ISO-10646/UCS2/")
BUILTIN_ALIAS ("UCS-2BE//", "ISO-10646/UCS2/")
@@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2)
__BUILTIN_TRANSFORM (__gconv_transform_ascii_internal);
__BUILTIN_TRANSFORM (__gconv_transform_internal_ascii);
+__BUILTIN_TRANSFORM (__gconv_transform_posix_internal);
+__BUILTIN_TRANSFORM (__gconv_transform_internal_posix);
__BUILTIN_TRANSFORM (__gconv_transform_utf8_internal);
__BUILTIN_TRANSFORM (__gconv_transform_internal_utf8);
__BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal);
@@ -299,6 +301,12 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal);
only ASCII characters. */
extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c);
+/* Specialized conversion function for a single byte to INTERNAL,
+ identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end
+ of the Low Surrogate Area at [U+DF80, U+DFFF]. */
+extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c)
+ attribute_hidden;
+
#endif
__END_DECLS
new file mode 100644
@@ -0,0 +1,96 @@
+/* "POSIX" locale transformation functions.
+ Copyright (C) 2022 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <https://www.gnu.org/licenses/>. */
+
+
+#include <gconv_int.h>
+
+
+/* Specialized conversion function for a single byte to INTERNAL,
+ identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end
+ of the Low Surrogate Area at [U+DF80, U+DFFF]. */
+wint_t
+__gconv_btwoc_posix (struct __gconv_step *step, unsigned char c)
+{
+ if (c < 0x80)
+ return c;
+ else
+ return 0xdf00 + c;
+}
+
+
+/* Convert from {[0, 0x7F] => ISO 646-IRV; [0x80, 0xFF] => [U+DF80, U+DFFF]}
+ to the internal (UCS4-like) format. */
+#define DEFINE_INIT 0
+#define DEFINE_FINI 0
+#define MIN_NEEDED_FROM 1
+#define MIN_NEEDED_TO 4
+#define FROM_DIRECTION 1
+#define FROM_LOOP posix_internal_loop
+#define TO_LOOP posix_internal_loop /* This is not used. */
+#define FUNCTION_NAME __gconv_transform_posix_internal
+#define ONE_DIRECTION 1
+
+#define MIN_NEEDED_INPUT MIN_NEEDED_FROM
+#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO
+#define LOOPFCT FROM_LOOP
+#define BODY \
+ { \
+ if (__glibc_unlikely (*inptr > '\x7f')) \
+ *((uint32_t *) outptr) = 0xdf00 + *inptr++; \
+ else \
+ *((uint32_t *) outptr) = *inptr++; \
+ outptr += sizeof (uint32_t); \
+ }
+#include <iconv/loop.c>
+#include <iconv/skeleton.c>
+
+
+/* Convert from the internal (UCS4-like) format to
+ {ISO 646-IRV => [0, 0x7F]; [U+DF80, U+DFFF] => [0x80, 0xFF]}. */
+#define DEFINE_INIT 0
+#define DEFINE_FINI 0
+#define MIN_NEEDED_FROM 4
+#define MIN_NEEDED_TO 1
+#define FROM_DIRECTION 1
+#define FROM_LOOP internal_posix_loop
+#define TO_LOOP internal_posix_loop /* This is not used. */
+#define FUNCTION_NAME __gconv_transform_internal_posix
+#define ONE_DIRECTION 1
+
+#define MIN_NEEDED_INPUT MIN_NEEDED_FROM
+#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO
+#define LOOPFCT FROM_LOOP
+#define BODY \
+ { \
+ uint32_t val = *((const uint32_t *) inptr); \
+ if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff)) \
+ { \
+ UNICODE_TAG_HANDLER (val, 4); \
+ STANDARD_TO_LOOP_ERR_HANDLER (4); \
+ } \
+ else \
+ { \
+ if (val > 0x7f) \
+ val -= 0xdf00; \
+ *outptr++ = val; \
+ inptr += sizeof (uint32_t); \
+ } \
+ }
+#define LOOP_NEED_FLAGS
+#include <iconv/loop.c>
+#include <iconv/skeleton.c>
@@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do
execute_test
check_errtest_result
done
+
+allbytes ()
+{
+ for (( i = 0; i <= 255; i++ )); do
+ printf '\'"$(printf "%o" "$i")"
+ done
+}
+
+allucs4be ()
+{
+ for (( i = 0; i <= 127; i++ )); do
+ printf '\0\0\0\'"$(printf "%o" "$i")"
+ done
+ for (( i = 128; i <= 255; i++ )); do
+ printf '\0\0\xdf\'"$(printf "%o" "$i")"
+ done
+}
+
+check_posix_result ()
+{
+ if [ $? -eq 0 ]; then
+ result=PASS
+ else
+ result=FAIL
+ fi
+
+ echo "$result: from \"$1\", to: \"$2\""
+
+ if [ "$result" != "PASS" ]; then
+ exit 1
+ fi
+}
+
+check_posix_encoding ()
+{
+ eval PROG=\"$ICONV\"
+ allbytes | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be)
+ check_posix_result POSIX UCS-4BE
+ allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes)
+ check_posix_result UCS-4BE POSIX
+}
+
+check_posix_encoding
@@ -31,6 +31,7 @@ cat <<EOF |
# Keep this list in the same order as gconv-modules.
#
# charset name table name comment
+ POSIX
ASCII ANSI_X3.4-1968
ISO646-GB BS_4730
ISO646-CA CSA_Z243.4-1985-1
@@ -37,11 +37,11 @@ do_test (void)
puts ("info: C locale tests");
locale_insensitive_tests ();
TEST_COMPARE (__idna_name_classify ("abc\200def"),
- idna_name_encoding_error);
+ idna_name_nonascii);
TEST_COMPARE (__idna_name_classify ("abc\200\\def"),
- idna_name_encoding_error);
+ idna_name_nonascii_backslash);
TEST_COMPARE (__idna_name_classify ("abc\377def"),
- idna_name_encoding_error);
+ idna_name_nonascii);
puts ("info: en_US.ISO-8859-1 locale tests");
if (setlocale (LC_CTYPE, "en_US.ISO-8859-1") == 0)
@@ -20,6 +20,7 @@
#include <langinfo.h>
#include <limits.h>
#include <locale.h>
+#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
@@ -229,6 +230,49 @@ run_test (const char *locname)
STRTEST (YESSTR, "");
STRTEST (NOSTR, "");
+ for(int i = 0; i <= 0xff; ++i)
+ {
+ unsigned char bs[] = {i, 0};
+ mbstate_t ctx = {};
+ wchar_t wc = -1, exp = i <= 0x7f ? i : (0xdf00 + i);
+ size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx);
+ if (sz != !!i)
+ {
+ printf ("mbrtowc(%02hhx) width in locale %s wrong "
+ "(is %zd, should be %d)\n", *bs, locname, sz, !!i);
+ result = 1;
+ }
+ if (wc != exp)
+ {
+ printf ("mbrtowc(%02hhx) value in locale %s wrong "
+ "(is %x, should be %x)\n", *bs, locname, wc, exp);
+ result = 1;
+ }
+ }
+
+ for (int i = 0; i <= 0xffff; ++i)
+ {
+ bool expok = (i <= 0x7f) || (i >= 0xdf80 && i <= 0xdfff);
+ size_t expsz = expok ? 1 : (size_t) -1;
+ unsigned char expob = expok ? (i & 0xff) : (unsigned char) -1;
+
+ unsigned char ob = -1;
+ mbstate_t ctx = {};
+ size_t sz = wcrtomb ((char *) &ob, i, &ctx);
+ if (sz != expsz)
+ {
+ printf ("wcrtomb(%x) width in locale %s wrong "
+ "(is %zd, should be %zd)\n", i, locname, sz, expsz);
+ result = 1;
+ }
+ if (ob != expob)
+ {
+ printf ("wcrtomb(%x) value in locale %s wrong "
+ "(is %hhx, should be %hhx)\n", i, locname, ob, expob);
+ result = 1;
+ }
+ }
+
/* Test the new locale mechanisms. */
loc = newlocale (LC_ALL_MASK, locname, NULL);
if (loc == NULL)
new file mode 100644
@@ -0,0 +1,136 @@
+<code_set_name> POSIX
+<comment_char> %
+<escape_char> /
+% source: cf. localedata/locales/POSIX, LC_COLLATE
+
+CHARMAP
+<U0000> /x00 NULL (NUL)
+<U0001> /x01 START OF HEADING (SOH)
+<U0002> /x02 START OF TEXT (STX)
+<U0003> /x03 END OF TEXT (ETX)
+<U0004> /x04 END OF TRANSMISSION (EOT)
+<U0005> /x05 ENQUIRY (ENQ)
+<U0006> /x06 ACKNOWLEDGE (ACK)
+<U0007> /x07 BELL (BEL)
+<U0008> /x08 BACKSPACE (BS)
+<U0009> /x09 CHARACTER TABULATION (HT)
+<U000A> /x0a LINE FEED (LF)
+<U000B> /x0b LINE TABULATION (VT)
+<U000C> /x0c FORM FEED (FF)
+<U000D> /x0d CARRIAGE RETURN (CR)
+<U000E> /x0e SHIFT OUT (SO)
+<U000F> /x0f SHIFT IN (SI)
+<U0010> /x10 DATALINK ESCAPE (DLE)
+<U0011> /x11 DEVICE CONTROL ONE (DC1)
+<U0012> /x12 DEVICE CONTROL TWO (DC2)
+<U0013> /x13 DEVICE CONTROL THREE (DC3)
+<U0014> /x14 DEVICE CONTROL FOUR (DC4)
+<U0015> /x15 NEGATIVE ACKNOWLEDGE (NAK)
+<U0016> /x16 SYNCHRONOUS IDLE (SYN)
+<U0017> /x17 END OF TRANSMISSION BLOCK (ETB)
+<U0018> /x18 CANCEL (CAN)
+<U0019> /x19 END OF MEDIUM (EM)
+<U001A> /x1a SUBSTITUTE (SUB)
+<U001B> /x1b ESCAPE (ESC)
+<U001C> /x1c FILE SEPARATOR (IS4)
+<U001D> /x1d GROUP SEPARATOR (IS3)
+<U001E> /x1e RECORD SEPARATOR (IS2)
+<U001F> /x1f UNIT SEPARATOR (IS1)
+<U0020> /x20 SPACE
+<U0021> /x21 EXCLAMATION MARK
+<U0022> /x22 QUOTATION MARK
+<U0023> /x23 NUMBER SIGN
+<U0024> /x24 DOLLAR SIGN
+<U0025> /x25 PERCENT SIGN
+<U0026> /x26 AMPERSAND
+<U0027> /x27 APOSTROPHE
+<U0028> /x28 LEFT PARENTHESIS
+<U0029> /x29 RIGHT PARENTHESIS
+<U002A> /x2a ASTERISK
+<U002B> /x2b PLUS SIGN
+<U002C> /x2c COMMA
+<U002D> /x2d HYPHEN-MINUS
+<U002E> /x2e FULL STOP
+<U002F> /x2f SOLIDUS
+<U0030> /x30 DIGIT ZERO
+<U0031> /x31 DIGIT ONE
+<U0032> /x32 DIGIT TWO
+<U0033> /x33 DIGIT THREE
+<U0034> /x34 DIGIT FOUR
+<U0035> /x35 DIGIT FIVE
+<U0036> /x36 DIGIT SIX
+<U0037> /x37 DIGIT SEVEN
+<U0038> /x38 DIGIT EIGHT
+<U0039> /x39 DIGIT NINE
+<U003A> /x3a COLON
+<U003B> /x3b SEMICOLON
+<U003C> /x3c LESS-THAN SIGN
+<U003D> /x3d EQUALS SIGN
+<U003E> /x3e GREATER-THAN SIGN
+<U003F> /x3f QUESTION MARK
+<U0040> /x40 COMMERCIAL AT
+<U0041> /x41 LATIN CAPITAL LETTER A
+<U0042> /x42 LATIN CAPITAL LETTER B
+<U0043> /x43 LATIN CAPITAL LETTER C
+<U0044> /x44 LATIN CAPITAL LETTER D
+<U0045> /x45 LATIN CAPITAL LETTER E
+<U0046> /x46 LATIN CAPITAL LETTER F
+<U0047> /x47 LATIN CAPITAL LETTER G
+<U0048> /x48 LATIN CAPITAL LETTER H
+<U0049> /x49 LATIN CAPITAL LETTER I
+<U004A> /x4a LATIN CAPITAL LETTER J
+<U004B> /x4b LATIN CAPITAL LETTER K
+<U004C> /x4c LATIN CAPITAL LETTER L
+<U004D> /x4d LATIN CAPITAL LETTER M
+<U004E> /x4e LATIN CAPITAL LETTER N
+<U004F> /x4f LATIN CAPITAL LETTER O
+<U0050> /x50 LATIN CAPITAL LETTER P
+<U0051> /x51 LATIN CAPITAL LETTER Q
+<U0052> /x52 LATIN CAPITAL LETTER R
+<U0053> /x53 LATIN CAPITAL LETTER S
+<U0054> /x54 LATIN CAPITAL LETTER T
+<U0055> /x55 LATIN CAPITAL LETTER U
+<U0056> /x56 LATIN CAPITAL LETTER V
+<U0057> /x57 LATIN CAPITAL LETTER W
+<U0058> /x58 LATIN CAPITAL LETTER X
+<U0059> /x59 LATIN CAPITAL LETTER Y
+<U005A> /x5a LATIN CAPITAL LETTER Z
+<U005B> /x5b LEFT SQUARE BRACKET
+<U005C> /x5c REVERSE SOLIDUS
+<U005D> /x5d RIGHT SQUARE BRACKET
+<U005E> /x5e CIRCUMFLEX ACCENT
+<U005F> /x5f LOW LINE
+<U0060> /x60 GRAVE ACCENT
+<U0061> /x61 LATIN SMALL LETTER A
+<U0062> /x62 LATIN SMALL LETTER B
+<U0063> /x63 LATIN SMALL LETTER C
+<U0064> /x64 LATIN SMALL LETTER D
+<U0065> /x65 LATIN SMALL LETTER E
+<U0066> /x66 LATIN SMALL LETTER F
+<U0067> /x67 LATIN SMALL LETTER G
+<U0068> /x68 LATIN SMALL LETTER H
+<U0069> /x69 LATIN SMALL LETTER I
+<U006A> /x6a LATIN SMALL LETTER J
+<U006B> /x6b LATIN SMALL LETTER K
+<U006C> /x6c LATIN SMALL LETTER L
+<U006D> /x6d LATIN SMALL LETTER M
+<U006E> /x6e LATIN SMALL LETTER N
+<U006F> /x6f LATIN SMALL LETTER O
+<U0070> /x70 LATIN SMALL LETTER P
+<U0071> /x71 LATIN SMALL LETTER Q
+<U0072> /x72 LATIN SMALL LETTER R
+<U0073> /x73 LATIN SMALL LETTER S
+<U0074> /x74 LATIN SMALL LETTER T
+<U0075> /x75 LATIN SMALL LETTER U
+<U0076> /x76 LATIN SMALL LETTER V
+<U0077> /x77 LATIN SMALL LETTER W
+<U0078> /x78 LATIN SMALL LETTER X
+<U0079> /x79 LATIN SMALL LETTER Y
+<U007A> /x7a LATIN SMALL LETTER Z
+<U007B> /x7b LEFT CURLY BRACKET
+<U007C> /x7c VERTICAL LINE
+<U007D> /x7d RIGHT CURLY BRACKET
+<U007E> /x7e TILDE
+<U007F> /x7f DELETE (DEL)
+<UDF80>..<UDFFF> /x80
+END CHARMAP
@@ -97,6 +97,20 @@ END LC_CTYPE
LC_COLLATE
% This is the POSIX Locale definition for the LC_COLLATE category.
% The order is the same as in the ASCII code set.
+% Values above <DEL> (<U007F>) inserted in order, per Issue 7 TC2,
+% XBD, 7.3.2, LC_COLLATE Category in the POSIX Locale:
+% > All characters not explicitly listed here shall be inserted
+% > in the character collation order after the listed characters
+% > and shall be assigned unique primary weights. If the listed
+% > characters have ASCII encoding, the other characters shall
+% > be in ascending order according to their coded character set values
+% Since Issue 7 TC2 (XBD, 6.2 Character Encoding):
+% > The POSIX locale shall contain 256 single-byte characters [...]
+% (cf. bug 663, 674).
+% this is in contrast to previous issues, which limited the POSIX
+% locale to the Portable Character Set (7-bit ASCII).
+% We use the end of the Low Surrogate Area to contain these,
+% yielding [<UDF80>, <UDFFF>]
order_start forward
<U0000>
<U0001>
@@ -226,7 +240,134 @@ order_start forward
<U007D>
<U007E>
<U007F>
-UNDEFINED
+<UDF80>
+<UDF81>
+<UDF82>
+<UDF83>
+<UDF84>
+<UDF85>
+<UDF86>
+<UDF87>
+<UDF88>
+<UDF89>
+<UDF8A>
+<UDF8B>
+<UDF8C>
+<UDF8D>
+<UDF8E>
+<UDF8F>
+<UDF90>
+<UDF91>
+<UDF92>
+<UDF93>
+<UDF94>
+<UDF95>
+<UDF96>
+<UDF97>
+<UDF98>
+<UDF99>
+<UDF9A>
+<UDF9B>
+<UDF9C>
+<UDF9D>
+<UDF9E>
+<UDF9F>
+<UDFA0>
+<UDFA1>
+<UDFA2>
+<UDFA3>
+<UDFA4>
+<UDFA5>
+<UDFA6>
+<UDFA7>
+<UDFA8>
+<UDFA9>
+<UDFAA>
+<UDFAB>
+<UDFAC>
+<UDFAD>
+<UDFAE>
+<UDFAF>
+<UDFB0>
+<UDFB1>
+<UDFB2>
+<UDFB3>
+<UDFB4>
+<UDFB5>
+<UDFB6>
+<UDFB7>
+<UDFB8>
+<UDFB9>
+<UDFBA>
+<UDFBB>
+<UDFBC>
+<UDFBD>
+<UDFBE>
+<UDFBF>
+<UDFC0>
+<UDFC1>
+<UDFC2>
+<UDFC3>
+<UDFC4>
+<UDFC5>
+<UDFC6>
+<UDFC7>
+<UDFC8>
+<UDFC9>
+<UDFCA>
+<UDFCB>
+<UDFCC>
+<UDFCD>
+<UDFCE>
+<UDFCF>
+<UDFD0>
+<UDFD1>
+<UDFD2>
+<UDFD3>
+<UDFD4>
+<UDFD5>
+<UDFD6>
+<UDFD7>
+<UDFD8>
+<UDFD9>
+<UDFDA>
+<UDFDB>
+<UDFDC>
+<UDFDD>
+<UDFDE>
+<UDFDF>
+<UDFE0>
+<UDFE1>
+<UDFE2>
+<UDFE3>
+<UDFE4>
+<UDFE5>
+<UDFE6>
+<UDFE7>
+<UDFE8>
+<UDFE9>
+<UDFEA>
+<UDFEB>
+<UDFEC>
+<UDFED>
+<UDFEE>
+<UDFEF>
+<UDFF0>
+<UDFF1>
+<UDFF2>
+<UDFF3>
+<UDFF4>
+<UDFF5>
+<UDFF6>
+<UDFF7>
+<UDFF8>
+<UDFF9>
+<UDFFA>
+<UDFFB>
+<UDFFC>
+<UDFFD>
+<UDFFE>
+<UDFFF>
order_end
%
END LC_COLLATE
@@ -336,6 +336,7 @@ $(objpfx)test-vfprintf.out: $(gen-locales)
$(objpfx)tst-grouping.out: $(gen-locales)
$(objpfx)tst-grouping2.out: $(gen-locales)
$(objpfx)tst-grouping_iterator.out: $(gen-locales)
+$(objpfx)tst-printf-bz25691-mem.out: $(gen-locales)
$(objpfx)tst-sprintf.out: $(gen-locales)
$(objpfx)tst-sscanf.out: $(gen-locales)
$(objpfx)tst-swprintf.out: $(gen-locales)
@@ -30,6 +30,8 @@
static int
do_test (void)
{
+ setlocale(LC_CTYPE, "C.UTF-8");
+
mtrace ();
/* For 's' conversion specifier with 'l' modifier the array must be
@@ -33,10 +33,10 @@ static const struct __gconv_step to_wc =
.__shlib_handle = NULL,
.__modname = NULL,
.__counter = INT_MAX,
- .__from_name = (char *) "ANSI_X3.4-1968//TRANSLIT",
+ .__from_name = (char *) "POSIX",
.__to_name = (char *) "INTERNAL",
- .__fct = __gconv_transform_ascii_internal,
- .__btowc_fct = __gconv_btwoc_ascii,
+ .__fct = __gconv_transform_posix_internal,
+ .__btowc_fct = __gconv_btwoc_posix,
.__init_fct = NULL,
.__end_fct = NULL,
.__min_needed_from = 1,
@@ -53,8 +53,8 @@ static const struct __gconv_step to_mb =
.__modname = NULL,
.__counter = INT_MAX,
.__from_name = (char *) "INTERNAL",
- .__to_name = (char *) "ANSI_X3.4-1968//TRANSLIT",
- .__fct = __gconv_transform_internal_ascii,
+ .__to_name = (char *) "POSIX",
+ .__fct = __gconv_transform_internal_posix,
.__btowc_fct = NULL,
.__init_fct = NULL,
.__end_fct = NULL,
@@ -67,7 +67,9 @@ static const struct __gconv_step to_mb =
};
-/* For the default locale we only have to handle ANSI_X3.4-1968. */
+/* The default/"POSIX"/"C" locale is an 8-bit-clean mapping
+ with ANSI_X3.4-1968 in the first 128 characters;
+ we lift the remaining bytes by <UDF00>. */
const struct gconv_fcts __wcsmbs_gconv_fcts_c =
{
.towc = (struct __gconv_step *) &to_wc,