[v18,3/3] POSIX locale covers every byte [BZ# 29511]

  This largely duplicates the ASCII code with the error path changed

There is one user-facing change:
"ANSI_X3.4-1968" (and /only/ that, its former aliases are unaffected)
mbrtowc() and friends return b if b <= 0x7F else <UDC00>+b.

Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
  (a) is 1-byte, stateless, and contains 256 characters
  (b) they collate in ASCII-byte order
  (c) the first 128 characters map all ASCII characters (like previous)
cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
changes to the standard;
in short, this means that under an ASCII encoding,
mbrtowc() must never fail and must return
  b if b <= 0x7F else ab+c for all bytes b
  where c is some constant >=0x80
    and a is a positive integer constant

By strategically picking c=<UDC00> we land at the same point of the
Unicode Low Surrogate Area at DC00-DCFF, described as
  > Isolated surrogate code points have no interpretation;
  > consequently, no character code charts or names lists
  > are provided for this range.
as the Python UTF-8 errors=surrogateescape encoding.

As @mirabilos points out in
  https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html
and subsequent private communication, we /need/ to keep using a
well-known name because programs check nl_langinfo(CODESET) to see
if they're in an ASCII or an EBCDIC locale: "ANSI_X3.4-1968", being
glibc's default, is checked universally.

There are many aliases that glibc has for ASCII,
but the "ANSI_X3.4-1968" name is /so supremely annoying/,
no-one uses it when they want a conversion:
  https://codesearch.debian.net/search?q=iconv.*ANSI_X3.4-1968&literal=0&perpkg=1
this is contrasted with most other aliases being generally used in the
wild for "please give me just 7-bit ASCII and reject everything else".

Thus, by reparenting the ASCII alias tree at "ASCII", "ANSI_X3.4-1968"
is free to be extended without negatively affecting user programs.

Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
---
Clean rebase. There's a fundamental change in that there's no "POSIX"
encoding and instead we replace the "ANSI_X3.4-1968" one.

As pointed out by @mirabilos in
  https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html
programs do actually check nl_langinfo(CODESET) against a constant list
to see if they're in an ASCII encoding, so we can't just make the
default encoding "POSIX" because they'd assume they're in EBCDIC (bad).

Thankfully a user program survey
  https://codesearch.debian.net/search?q=iconv.*ANSI_X3.4-1968&literal=0&perpkg=1
(results archived as:
 $ base64 -di <<EOF | zstd -d
 KLUv/QSATRUAdqFoIfC2HhTZnYCbS8jO3hOVi6tYH6Tj+02DhX+NhcVhQGApAV0AXgBfADt23q3S
 5Rq0uGjIJNb5f4qi5iP4BUPZukPNP9PrsQ9Q3dTyB0O9JVTv4Rd6rCf6gseLmYq+UD0Xf4+wHw1V
 Cr7J89t2fCMURQl4xyG3+7fAexYE7f6eW3w/f2UGKvQS0Wyk1DqZBslWHtFxhxk1mXE6k0yFUWXW
 3WIb4u9l+lWCWuO3PHzd3/a7SRbziCo5jdP8AiTvmd5NRz+ZODKM3/rzvHPb3/MYj9BkIBqLduqN
 sq73Bz0E85Ul+9keAZvchV2dYWIaxh04Od9iWguSculg0LLspiBubR5a91ZONfFbxO9iD/WMh63r
 oiUzecX/56O35F++nq7ZWhYWMTnn54sxfZIGcWxu29VhmOYb+UJm6DaZPSKUKQbUA9ehywyHLtu6
 yU1t00Jag2LdMpNRWp6DzDZum9o2tcxwJWZ8QQ9UlYAzyVgufzr99QrFCypHlSkOgKIa+M99y7Gl
 EoL5hO+W2y9UVSBbUFvsZQrqdJ7v7rxj1c2tGT/+CW+osZ0yxMyJwrQkpWEMIIQYxY5uAzJKNmcs
 Mg4aChJGCsopo1g6iAFtArSGohdjtKHU47ut04aWBiv+bSd5wFVhpzDPGm/5DSxnQhn0lZVQnJVr
 2SbDmrw+jYxJ7gdjQszWN3H8VegwGnfiI8ru5Nqv0OXJBIoOHmHfMOsyCJAGQTTlAi0WmH6fu51Q
 AN6aI+NaVZ7TabdBpZliVXk68wIDp0Vbpcg4MOTCAo6UabwqgeUCOcDSNsA2JWNmcHeRk9Hn4sbz
 KayIfTIiSMvcerFQiXBcxhjtI6REJky3xuNwrL7NuiYsxTT5XooHtNTfdVjB6LXqJ4QQGQ5woRHJ
 ghyEj1YBN9ksOQ==
 EOF
 )
shows that people use ASCII/US-ASCII/even ISO_646.irv:1991 and whatever
for iconv as "7-bit ASCII". But no-one uses ANSI_X3.4-1968 because it's
such a deeply annoying name. Thus, we simply make ANSI_X3.4-1968
/and only ANSI_X3.4-1968/ the 256-byte encoding. User programs see no
difference in nl_langinfo(CODESET), but function correctly.

This unfortunately makes the patch bigger than it otherwise would've
been since ANSI_X3.4-1968 was the leader of the ASCII alias group.
I made it, uh, ASCII, simply because it's the least annoying one.
Which encodings are aliases or not isn't visible to users,
so it doesn't matter if it's ASCII or irv6861992v7.

Thus, by reparenting the ASCII alias tree at "ASCII", "ANSI_X3.4-1968"
is free to be extended without impacting user programs' behaviour.

Similarly, re-word the news and commit message to avoid saying POSIX
specifies an encoding (it doesn't, it specifies just some stiff
requirements on the encoding, and the wording heavily favours ASCII).

 NEWS                               |  10 ++
 iconv/Makefile                     |   2 +-
 iconv/gconv_builtin.h              |  37 +++++---
 iconv/gconv_int.h                  |   8 ++
 iconv/gconv_posix.c                |  94 +++++++++++++++++++
 iconv/tst-iconv_prog.sh            |  44 +++++++++
 iconvdata/TESTS                    |   1 +
 iconvdata/testdata/ANSI_X3.4-1968  |   7 +-
 iconvdata/testdata/ASCII           |   6 ++
 iconvdata/tst-tables.sh            |   3 +-
 inet/tst-idna_name_classify.c      |   6 +-
 intl/Makefile                      |   2 +-
 intl/tst-translit.c                |   2 +-
 locale/programs/config.h           |   2 +-
 locale/tst-C-locale.c              |  44 +++++++++
 localedata/Makefile                |   2 +-
 localedata/bug-iconv-trans.c       |   2 +-
 localedata/charmaps/ANSI_X3.4-1968 |  13 +--
 localedata/charmaps/ASCII          | 144 +++++++++++++++++++++++++++++
 localedata/locales/POSIX           | 143 +++++++++++++++++++++++++++-
 localedata/tests-mbwc/tgn_locdef.h |   4 +-
 localedata/tst-ctype.sh            |   2 +-
 localedata/tst-langinfo.sh         |  68 +++++++-------
 localedata/tst-mbswcs6.c           |   2 +-
 stdio-common/Makefile              |   1 +
 stdio-common/tst-printf-bz25691.c  |   2 +
 wcsmbs/Makefile                    |   2 +-
 wcsmbs/tst-btowc.c                 |   4 +-
 wcsmbs/wcsmbsload.c                |  14 +--
 29 files changed, 581 insertions(+), 90 deletions(-)
 create mode 100644 iconv/gconv_posix.c
 mode change 100644 => 120000 iconvdata/testdata/ANSI_X3.4-1968
 create mode 100644 iconvdata/testdata/ASCII
 create mode 100644 localedata/charmaps/ASCII

Message ID	81bebf97b6547133593d2089125aae672997a93f.1690133538.git.nabijaczleweli@nabijaczleweli.xyz
State	New
Headers	Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C29273857B8E for <patchwork@sourceware.org>; Sun, 23 Jul 2023 17:54:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C29273857B8E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1690134888; bh=l0pwPVa9bFWBHgsNuBWXq5UwkmjshUWbpBOO5go+d0U=; h=Date:To:Cc:Subject:References:In-Reply-To:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=TMJyBwUT71DeFmEClxVIaiJny/+wiDXC8OcU+NgHRHIwAbWah/DUeL/HkHrfhm5zp 3XyWP85RGAkTeuVYXQFohkDl/ou2EGOMVHkl0lhIK9xCnqbiMZDB0NxAEtNscClY0H kZKBrCaIuc/inbdot2pAQxrjai9fuUMlEIMsHQbc= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id E33453858D28 for <libc-alpha@sourceware.org>; Sun, 23 Jul 2023 17:54:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E33453858D28 Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 32DBA3C2A; Sun, 23 Jul 2023 19:54:20 +0200 (CEST) Date: Sun, 23 Jul 2023 19:54:19 +0200 To: Florian Weimer <fweimer@redhat.com> Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com>, Bruno Haible <bruno@clisp.org> Subject: [PATCH v18 3/3] POSIX locale covers every byte [BZ# 29511] Message-ID: <81bebf97b6547133593d2089125aae672997a93f.1690133538.git.nabijaczleweli@nabijaczleweli.xyz> References: <e4748b3c5f06edfa928a649c99b6ea8dddb799f9.1690133538.git.nabijaczleweli@nabijaczleweli.xyz> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="jya35ro4nllego5t" Content-Disposition: inline In-Reply-To: <e4748b3c5f06edfa928a649c99b6ea8dddb799f9.1690133538.git.nabijaczleweli@nabijaczleweli.xyz> User-Agent: NeoMutt/20230517 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SHORT, RDNS_DYNAMIC, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= <libc-alpha@sourceware.org> Reply-To: =?utf-8?b?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	[v18,1/3] iconv: __gconv_btwoc_ascii -> __gconv_btowc_ascii \| [v18,1/3] iconv: __gconv_btwoc_ascii -> __gconv_btowc_ascii [v18,2/3] locale: charmap: fix off-by-one with ranges [v18,3/3] POSIX locale covers every byte [BZ# 29511]

Context	Check	Description
linaro-tcwg-bot/tcwg_glibc_build--master-arm	success	Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm	success	Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64	success	Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64	fail	Testing failed
redhat-pt-bot/TryBot-still_applies	warning	Patch no longer applies to master

[v18,3/3] POSIX locale covers every byte [BZ# 29511]

Checks

Commit Message

Comments

Patch