From patchwork Wed Nov 2 17:17:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ahelenia_Ziemia=C5=84ska?= X-Patchwork-Id: 59800 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2DAC9385AC3B for ; Wed, 2 Nov 2022 17:18:23 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2DAC9385AC3B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1667409503; bh=nAZSEoUMtv+kWyLjbWIIh+yyWHabIILtgZhZKACYZm8=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=njwzV6xUVRRUYGASWIEdRm7eMtCKbohDgW6KzFfZBMzvVEZq+pncKn30nwHE8tP/I rTPQ28VOPb/wQsF4sesutNNM1Up0THF/7vwhDc1nO7uo8+8CwU0EQeKyX0kr/fyg92 bBt53SFrl2Qv3LKubf5ZnixcI0cFQrmJ1EDt6CwI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id 75148385C421 for ; Wed, 2 Nov 2022 17:17:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 75148385C421 Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 6988A4948; Wed, 2 Nov 2022 18:17:18 +0100 (CET) Date: Wed, 2 Nov 2022 18:17:17 +0100 To: libc-alpha@sourceware.org Subject: [PATCH v6 2/2] POSIX locale covers every byte [BZ# 29511] Message-ID: <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20220429 X-Spam-Status: No, score=-8.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FROM_SUSPICIOUS_NTLD, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SHORT, PDS_OTHER_BAD_TLD, PDS_RDNS_DYNAMIC_FP, RDNS_DYNAMIC, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= From: =?utf-8?q?Ahelenia_Ziemia=C5=84ska?= Reply-To: =?utf-8?b?0L3QsNCx?= Cc: Florian Weimer Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" This is a logistically trivial patch, largely duplicating the extant ASCII code with the error path changed There are two user-facing changes: * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968" * mbrtowc() and friends return b if b <= 0x7F else +b Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, (a) is 1-byte, stateless, and contains 256 characters (b) they collate in byte order (c) the first 128 characters are equivalent to ASCII (like previous) cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of changes to the standard; in short, this means that mbrtowc() must never fail and must return b if b <= 0x7F else ab+c for all bytes b where c is some constant >=0x80 and a is a positive integer constant By strategically picking c= we land at the tail-end of the Unicode Low Surrogate Area at DC00-DFFF, described as > Isolated surrogate code points have no interpretation; > consequently, no character code charts or names lists > are provided for this range. and match musl Signed-off-by: Ahelenia ZiemiaƄska --- new in v2: nothing new in v3: POSIX charset, __wcsmbs_gconv_fcts_c comment fixed new in v4: nothing new in v5: nothing new in v6: clean rebase, rephrase message iconv/Makefile | 2 +- iconv/gconv_builtin.h | 8 ++ iconv/gconv_int.h | 7 ++ iconv/gconv_posix.c | 96 ++++++++++++++++++++ iconv/tst-iconv_prog.sh | 43 +++++++++ iconvdata/tst-tables.sh | 1 + inet/tst-idna_name_classify.c | 6 +- locale/tst-C-locale.c | 69 ++++++++++++++ localedata/charmaps/POSIX | 136 ++++++++++++++++++++++++++++ localedata/locales/POSIX | 143 +++++++++++++++++++++++++++++- stdio-common/tst-printf-bz25691.c | 2 + wcsmbs/wcsmbsload.c | 14 +-- 12 files changed, 516 insertions(+), 11 deletions(-) create mode 100644 iconv/gconv_posix.c create mode 100644 localedata/charmaps/POSIX diff --git a/iconv/Makefile b/iconv/Makefile index a0d90cfeac..6e926f53e3 100644 --- a/iconv/Makefile +++ b/iconv/Makefile @@ -25,7 +25,7 @@ include ../Makeconfig headers = iconv.h gconv.h routines = iconv_open iconv iconv_close \ gconv_open gconv gconv_close gconv_db gconv_conf \ - gconv_builtin gconv_simple gconv_trans gconv_cache + gconv_builtin gconv_simple gconv_posix gconv_trans gconv_cache routines += gconv_dl gconv_charset vpath %.c ../locale/programs ../intl diff --git a/iconv/gconv_builtin.h b/iconv/gconv_builtin.h index 68c2369b1f..cd1805b3ce 100644 --- a/iconv/gconv_builtin.h +++ b/iconv/gconv_builtin.h @@ -89,6 +89,14 @@ BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=INTERNAL->ascii", __gconv_transform_internal_ascii, NULL, 4, 4, 1, 1) +BUILTIN_TRANSFORMATION ("POSIX//", "INTERNAL", 1, "=posix->INTERNAL", + __gconv_transform_posix_internal, __gconv_btwoc_posix, + 1, 1, 4, 4) + +BUILTIN_TRANSFORMATION ("INTERNAL", "POSIX//", 1, "=INTERNAL->posix", + __gconv_transform_internal_posix, NULL, 4, 4, 1, 1) + + #if BYTE_ORDER == BIG_ENDIAN BUILTIN_ALIAS ("UNICODEBIG//", "ISO-10646/UCS2/") BUILTIN_ALIAS ("UCS-2BE//", "ISO-10646/UCS2/") diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h index 1c6745043e..45ab1edfad 100644 --- a/iconv/gconv_int.h +++ b/iconv/gconv_int.h @@ -281,6 +281,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2) __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii); +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal); +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix); __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8); __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal); @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal); only ASCII characters. */ extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c); +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DF80, U+DFFF]. */ +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c); + #endif __END_DECLS diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c new file mode 100644 index 0000000000..dcb13fbb43 --- /dev/null +++ b/iconv/gconv_posix.c @@ -0,0 +1,96 @@ +/* Simple transformations functions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + + +#include + + +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DF80, U+DFFF]. */ +wint_t +__gconv_btwoc_posix (struct __gconv_step *step, unsigned char c) +{ + if (c < 0x80) + return c; + else + return 0xdf00 + c; +} + + +/* Convert from {[0, 0x7F] => ISO 646-IRV; [0x80, 0xFF] => [U+DF80, U+DFFF]} + to the internal (UCS4-like) format. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 1 +#define MIN_NEEDED_TO 4 +#define FROM_DIRECTION 1 +#define FROM_LOOP posix_internal_loop +#define TO_LOOP posix_internal_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_posix_internal +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + if (__glibc_unlikely (*inptr > '\x7f')) \ + *((uint32_t *) outptr) = 0xdf00 + *inptr++; \ + else \ + *((uint32_t *) outptr) = *inptr++; \ + outptr += sizeof (uint32_t); \ + } +#include +#include + + +/* Convert from the internal (UCS4-like) format to + {ISO 646-IRV => [0, 0x7F]; [U+DF80, U+DFFF] => [0x80, 0xFF]}. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 4 +#define MIN_NEEDED_TO 1 +#define FROM_DIRECTION 1 +#define FROM_LOOP internal_posix_loop +#define TO_LOOP internal_posix_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_internal_posix +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + uint32_t val = *((const uint32_t *) inptr); \ + if (__glibc_unlikely ((val > 0x7f && val < 0xdf80) || val > 0xdfff)) \ + { \ + UNICODE_TAG_HANDLER (val, 4); \ + STANDARD_TO_LOOP_ERR_HANDLER (4); \ + } \ + else \ + { \ + if (__glibc_unlikely (val > 0x7f)) \ + val -= 0xdf00; \ + *outptr++ = val; \ + inptr += sizeof (uint32_t); \ + } \ + } +#define LOOP_NEED_FLAGS +#include +#include diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh index b3d8bf5110..a24d8d2207 100644 --- a/iconv/tst-iconv_prog.sh +++ b/iconv/tst-iconv_prog.sh @@ -285,3 +285,46 @@ for errorcommand in "${errorarray[@]}"; do execute_test check_errtest_result done + +allbytes () +{ + for (( i = 0; i <= 255; i++ )); do + printf '\'"$(printf "%o" "$i")" + done +} + +allucs4be () +{ + for (( i = 0; i <= 127; i++ )); do + printf '\0\0\0\'"$(printf "%o" "$i")" + done + for (( i = 128; i <= 255; i++ )); do + printf '\0\0\xdf\'"$(printf "%o" "$i")" + done +} + +check_posix_result () +{ + if [ $? -eq 0 ]; then + result=PASS + else + result=FAIL + fi + + echo "$result: from \"$1\", to: \"$2\"" + + if [ "$result" != "PASS" ]; then + exit 1 + fi +} + +check_posix_encoding () +{ + eval PROG=\"$ICONV\" + allbytes | $PROG -f POSIX -t UCS-4BE | cmp -s - <(allucs4be) + check_posix_result POSIX UCS-4BE + allucs4be | $PROG -f UCS-4BE -t POSIX | cmp -s - <(allbytes) + check_posix_result UCS-4BE POSIX +} + +check_posix_encoding diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh index 4207b44175..33a02158ac 100755 --- a/iconvdata/tst-tables.sh +++ b/iconvdata/tst-tables.sh @@ -31,6 +31,7 @@ cat < POSIX + % + / +% source: cf. localedata/locales/POSIX, LC_COLLATE + +CHARMAP + /x00 NULL (NUL) + /x01 START OF HEADING (SOH) + /x02 START OF TEXT (STX) + /x03 END OF TEXT (ETX) + /x04 END OF TRANSMISSION (EOT) + /x05 ENQUIRY (ENQ) + /x06 ACKNOWLEDGE (ACK) + /x07 BELL (BEL) + /x08 BACKSPACE (BS) + /x09 CHARACTER TABULATION (HT) + /x0a LINE FEED (LF) + /x0b LINE TABULATION (VT) + /x0c FORM FEED (FF) + /x0d CARRIAGE RETURN (CR) + /x0e SHIFT OUT (SO) + /x0f SHIFT IN (SI) + /x10 DATALINK ESCAPE (DLE) + /x11 DEVICE CONTROL ONE (DC1) + /x12 DEVICE CONTROL TWO (DC2) + /x13 DEVICE CONTROL THREE (DC3) + /x14 DEVICE CONTROL FOUR (DC4) + /x15 NEGATIVE ACKNOWLEDGE (NAK) + /x16 SYNCHRONOUS IDLE (SYN) + /x17 END OF TRANSMISSION BLOCK (ETB) + /x18 CANCEL (CAN) + /x19 END OF MEDIUM (EM) + /x1a SUBSTITUTE (SUB) + /x1b ESCAPE (ESC) + /x1c FILE SEPARATOR (IS4) + /x1d GROUP SEPARATOR (IS3) + /x1e RECORD SEPARATOR (IS2) + /x1f UNIT SEPARATOR (IS1) + /x20 SPACE + /x21 EXCLAMATION MARK + /x22 QUOTATION MARK + /x23 NUMBER SIGN + /x24 DOLLAR SIGN + /x25 PERCENT SIGN + /x26 AMPERSAND + /x27 APOSTROPHE + /x28 LEFT PARENTHESIS + /x29 RIGHT PARENTHESIS + /x2a ASTERISK + /x2b PLUS SIGN + /x2c COMMA + /x2d HYPHEN-MINUS + /x2e FULL STOP + /x2f SOLIDUS + /x30 DIGIT ZERO + /x31 DIGIT ONE + /x32 DIGIT TWO + /x33 DIGIT THREE + /x34 DIGIT FOUR + /x35 DIGIT FIVE + /x36 DIGIT SIX + /x37 DIGIT SEVEN + /x38 DIGIT EIGHT + /x39 DIGIT NINE + /x3a COLON + /x3b SEMICOLON + /x3c LESS-THAN SIGN + /x3d EQUALS SIGN + /x3e GREATER-THAN SIGN + /x3f QUESTION MARK + /x40 COMMERCIAL AT + /x41 LATIN CAPITAL LETTER A + /x42 LATIN CAPITAL LETTER B + /x43 LATIN CAPITAL LETTER C + /x44 LATIN CAPITAL LETTER D + /x45 LATIN CAPITAL LETTER E + /x46 LATIN CAPITAL LETTER F + /x47 LATIN CAPITAL LETTER G + /x48 LATIN CAPITAL LETTER H + /x49 LATIN CAPITAL LETTER I + /x4a LATIN CAPITAL LETTER J + /x4b LATIN CAPITAL LETTER K + /x4c LATIN CAPITAL LETTER L + /x4d LATIN CAPITAL LETTER M + /x4e LATIN CAPITAL LETTER N + /x4f LATIN CAPITAL LETTER O + /x50 LATIN CAPITAL LETTER P + /x51 LATIN CAPITAL LETTER Q + /x52 LATIN CAPITAL LETTER R + /x53 LATIN CAPITAL LETTER S + /x54 LATIN CAPITAL LETTER T + /x55 LATIN CAPITAL LETTER U + /x56 LATIN CAPITAL LETTER V + /x57 LATIN CAPITAL LETTER W + /x58 LATIN CAPITAL LETTER X + /x59 LATIN CAPITAL LETTER Y + /x5a LATIN CAPITAL LETTER Z + /x5b LEFT SQUARE BRACKET + /x5c REVERSE SOLIDUS + /x5d RIGHT SQUARE BRACKET + /x5e CIRCUMFLEX ACCENT + /x5f LOW LINE + /x60 GRAVE ACCENT + /x61 LATIN SMALL LETTER A + /x62 LATIN SMALL LETTER B + /x63 LATIN SMALL LETTER C + /x64 LATIN SMALL LETTER D + /x65 LATIN SMALL LETTER E + /x66 LATIN SMALL LETTER F + /x67 LATIN SMALL LETTER G + /x68 LATIN SMALL LETTER H + /x69 LATIN SMALL LETTER I + /x6a LATIN SMALL LETTER J + /x6b LATIN SMALL LETTER K + /x6c LATIN SMALL LETTER L + /x6d LATIN SMALL LETTER M + /x6e LATIN SMALL LETTER N + /x6f LATIN SMALL LETTER O + /x70 LATIN SMALL LETTER P + /x71 LATIN SMALL LETTER Q + /x72 LATIN SMALL LETTER R + /x73 LATIN SMALL LETTER S + /x74 LATIN SMALL LETTER T + /x75 LATIN SMALL LETTER U + /x76 LATIN SMALL LETTER V + /x77 LATIN SMALL LETTER W + /x78 LATIN SMALL LETTER X + /x79 LATIN SMALL LETTER Y + /x7a LATIN SMALL LETTER Z + /x7b LEFT CURLY BRACKET + /x7c VERTICAL LINE + /x7d RIGHT CURLY BRACKET + /x7e TILDE + /x7f DELETE (DEL) +.. /x80 +END CHARMAP diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX index 7ec7f1c577..fc34a6abc1 100644 --- a/localedata/locales/POSIX +++ b/localedata/locales/POSIX @@ -97,6 +97,20 @@ END LC_CTYPE LC_COLLATE % This is the POSIX Locale definition for the LC_COLLATE category. % The order is the same as in the ASCII code set. +% Values above () inserted in order, per Issue 7 TC2, +% XBD, 7.3.2, LC_COLLATE Category in the POSIX Locale: +% > All characters not explicitly listed here shall be inserted +% > in the character collation order after the listed characters +% > and shall be assigned unique primary weights. If the listed +% > characters have ASCII encoding, the other characters shall +% > be in ascending order according to their coded character set values +% Since Issue 7 TC2 (XBD, 6.2 Character Encoding): +% > The POSIX locale shall contain 256 single-byte characters [...] +% (cf. bug 663, 674). +% this is in contrast to previous issues, which limited the POSIX +% locale to the Portable Character Set (7-bit ASCII). +% We use the end of the Low Surrogate Area to contain these, +% yielding [, ] order_start forward @@ -226,7 +240,134 @@ order_start forward -UNDEFINED + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + order_end % END LC_COLLATE diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c index 44844e71c3..e66242b58f 100644 --- a/stdio-common/tst-printf-bz25691.c +++ b/stdio-common/tst-printf-bz25691.c @@ -30,6 +30,8 @@ static int do_test (void) { + setlocale(LC_CTYPE, "C.UTF-8"); + mtrace (); /* For 's' conversion specifier with 'l' modifier the array must be diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c index 0f0f55f9ed..97de9afd25 100644 --- a/wcsmbs/wcsmbsload.c +++ b/wcsmbs/wcsmbsload.c @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc = .__shlib_handle = NULL, .__modname = NULL, .__counter = INT_MAX, - .__from_name = (char *) "ANSI_X3.4-1968//TRANSLIT", + .__from_name = (char *) "POSIX", .__to_name = (char *) "INTERNAL", - .__fct = __gconv_transform_ascii_internal, - .__btowc_fct = __gconv_btwoc_ascii, + .__fct = __gconv_transform_posix_internal, + .__btowc_fct = __gconv_btwoc_posix, .__init_fct = NULL, .__end_fct = NULL, .__min_needed_from = 1, @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb = .__modname = NULL, .__counter = INT_MAX, .__from_name = (char *) "INTERNAL", - .__to_name = (char *) "ANSI_X3.4-1968//TRANSLIT", - .__fct = __gconv_transform_internal_ascii, + .__to_name = (char *) "POSIX", + .__fct = __gconv_transform_internal_posix, .__btowc_fct = NULL, .__init_fct = NULL, .__end_fct = NULL, @@ -67,7 +67,9 @@ static const struct __gconv_step to_mb = }; -/* For the default locale we only have to handle ANSI_X3.4-1968. */ +/* The default/"POSIX"/"C" locale is an 8-bit-clean mapping + with ANSI_X3.4-1968 in the first 128 characters; + we lift the remaining bytes by . */ const struct gconv_fcts __wcsmbs_gconv_fcts_c = { .towc = (struct __gconv_step *) &to_wc,