[v7] POSIX locale covers every byte [BZ# 29511]

  Hi!

On Wed, Nov 09, 2022 at 03:20:26PM +0100, Florian Weimer wrote:
> * наб:
> > This is a logistically trivial patch,
> > largely duplicating the extant ASCII code with the error path changed
> I wouldn't say it's trivial in the commit message. 8-)
Trimmed to ~second line

> > There are two user-facing changes:
> >   * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
> >   * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b
> >
> > Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
> >   (a) is 1-byte, stateless, and contains 256 characters
> >   (b) they collate in byte order
> >   (c) the first 128 characters are equivalent to ASCII (like previous)
> > cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
> > changes to the standard;
> > in short, this means that mbrtowc() must never fail and must return
> >   b if b <= 0x7F else ab+c for all bytes b
> >   where c is some constant >=0x80
> >     and a is a positive integer constant
> >
> > By strategically picking c=<UDF00> we land at the tail-end of the
> > Unicode Low Surrogate Area at DC00-DFFF, described as
> >   > Isolated surrogate code points have no interpretation;
> >   > consequently, no character code charts or names lists
> >   > are provided for this range.
> > and match musl
> 
> Sadly this doesn't match Python and PEP 540:
> 
> >>> b'\x80'.decode('UTF-8', errors='surrogateescape')
> '\udc80'
> 
> I believe the implementation translates this to 0xDF80 instead.
Yes.

> Not sure what is more important here, musl compatibility or Python
> compatibility.  Cc:ing Victor in case he as comments.  I should probably
> ask on the musl list as well as how this divergence came to pass.
I went for musl because (a) it's a libc not some random programming
language, (b) putting the end of our domain at the end of the
surrogates is more aesthetically and ideologically pleasing, and (c)
there's marginal value of having both musl and glibc produce the same
characters if you like save them as integers for some reason.
But the choice of any range therein is pretty much editorial, I think.

> This change definitely needs a NEWS entry.
Something like this?
  Deprecated and removed features, and other changes affecting compatibility:
  * The default/"POSIX"/"C" locale's character set is now "POSIX",
    instead of "ANSI_X3.4-1968"  this is a new fully-reversible
    8-bit transparent encoding for compatibility with Issue 7 TC 2,
    identity-mapping bytes in the ASCII [0, 0x7F] range,
    and mapping [0x80, 0xFF] bytes to [<U+DF80>, <U+DFFF>].

> > diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h
> > index 1c6745043e..45ab1edfad 100644
> > --- a/iconv/gconv_int.h
> > +++ b/iconv/gconv_int.h
> > @@ -299,6 +301,11 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal);
> >     only ASCII characters.  */
> >  extern wint_t __gconv_btwoc_ascii (struct __gconv_step *step, unsigned char c);
> >  
> > +/* Specialized conversion function for a single byte to INTERNAL,
> > +   identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end
> > +   of the Low Surrogate Area at [U+DF80, U+DFFF].  */
> > +extern wint_t __gconv_btwoc_posix (struct __gconv_step *step, unsigned char c);
> > +
> >  #endif
> 
> Missing attribute_hidden.  Yes, it's also missing from
> __gconv_btwoc_ascii.  The linker probably papers over it.
Added.

> > diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c
> > new file mode 100644
> > index 0000000000..dcb13fbb43
> > --- /dev/null
> > +++ b/iconv/gconv_posix.c
> > @@ -0,0 +1,96 @@
> > +/* Simple transformations functions.
> 
> I think this line should say something about surrogate-escape encoding
> for the POSIX locale.
I completely missed that this line isn't part of the licence. Used
> "POSIX" locale transformation functions.
as the shorthand, this is expounded in the comment for __g_b_p() below.

> > +    else								      \
> > +      {									      \
> > +	if (__glibc_unlikely (val > 0x7f))				      \
> > +	  val -= 0xdf00;						      \
> > +	*outptr++ = val;						      \
> > +	inptr += sizeof (uint32_t);					      \
> > +      }									      \
> > +  }
> 
> I suggest to drop the last __glibc_unlikely here because it's
> input-dependent.
Applied.

> > diff --git a/locale/tst-C-locale.c b/locale/tst-C-locale.c
> > index 6bd0367069..f30396ae12 100644
> > --- a/locale/tst-C-locale.c
> > +++ b/locale/tst-C-locale.c
> > @@ -229,6 +229,75 @@ run_test (const char *locname)
> >    STRTEST (YESSTR, "");
> >    STRTEST (NOSTR, "");
> >  
> > +#define CONVTEST(b, v) \
> > +  {									      \
> > +    unsigned char bs[] = {b, 0};					      \
> > +    mbstate_t ctx = {};							      \
> > +    wchar_t wc = -1;							      \
> > +    size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx);			      \
> 
> Missing space before '(' (also in other cases below).
> 
> Not sure if the macros are needed, maybe write one loop for each
> direction with a condition in it?
Fixed, unrolled.

> > diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c
> > index 44844e71c3..e66242b58f 100644
> > --- a/stdio-common/tst-printf-bz25691.c
> > +++ b/stdio-common/tst-printf-bz25691.c
> > @@ -30,6 +30,8 @@
> >  static int
> >  do_test (void)
> >  {
> > +  setlocale(LC_CTYPE, "C.UTF-8");
> > +
> >    mtrace ();
> >  
> >    /* For 's' conversion specifier with 'l' modifier the array must be
> 
> What's the rationale for this change?  If it is really required, you
> must also update stdio-common/Makefile with a new dependency on
> $(gen-locales).
The test depends on the locale having a hole at 0xFF, cf. ll. 93-100:
    /* Same test, but with an invalid multibyte sequence.  */
    mbs[mbssize - 2] = 0xff;

    ret = swprintf (result, resultsize, L"%.65537s", mbs);
    TEST_COMPARE (ret, -1);

    ret = swprintf (result, resultsize, L"%1$.65537s", mbs);
    TEST_COMPARE (ret, -1);
And this is the simplest way to ensure that, I think.

Dependency added.

> Thanks,
> Florian
Best,
наб

Scissor-patch follows.
-- >8 --
This largely duplicates the ASCII code with the error path changed

There are two user-facing changes:
  * nl_langinfo(CODESET) is "POSIX" instead of "ANSI_X3.4-1968"
  * mbrtowc() and friends return b if b <= 0x7F else <UDF00>+b

Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively,
  (a) is 1-byte, stateless, and contains 256 characters
  (b) they collate in byte order
  (c) the first 128 characters are equivalent to ASCII (like previous)
cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of
changes to the standard;
in short, this means that mbrtowc() must never fail and must return
  b if b <= 0x7F else ab+c for all bytes b
  where c is some constant >=0x80
    and a is a positive integer constant

By strategically picking c=<UDF00> we land at the tail-end of the
Unicode Low Surrogate Area at DC00-DFFF, described as
  > Isolated surrogate code points have no interpretation;
  > consequently, no character code charts or names lists
  > are provided for this range.
and match musl

Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
---
 iconv/Makefile                    |   2 +-
 iconv/gconv_builtin.h             |   8 ++
 iconv/gconv_int.h                 |   8 ++
 iconv/gconv_posix.c               |  96 ++++++++++++++++++++
 iconv/tst-iconv_prog.sh           |  43 +++++++++
 iconvdata/tst-tables.sh           |   1 +
 inet/tst-idna_name_classify.c     |   6 +-
 locale/tst-C-locale.c             |  44 +++++++++
 localedata/charmaps/POSIX         | 136 ++++++++++++++++++++++++++++
 localedata/locales/POSIX          | 143 +++++++++++++++++++++++++++++-
 stdio-common/Makefile             |   1 +
 stdio-common/tst-printf-bz25691.c |   2 +
 wcsmbs/wcsmbsload.c               |  14 +--
 13 files changed, 493 insertions(+), 11 deletions(-)
 create mode 100644 iconv/gconv_posix.c
 create mode 100644 localedata/charmaps/POSIX

Message ID	20221109161415.eyqgyrp2jlwzfdmb@tarta.nabijaczleweli.xyz
State	Superseded
Delegated to:	Florian Weimer
Headers	DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 95A833858D1E Date: Wed, 9 Nov 2022 17:14:15 +0100 To: Florian Weimer <fweimer@redhat.com> Cc: libc-alpha@sourceware.org, Victor Stinner <vstinner@redhat.com> Subject: [PATCH v7] POSIX locale covers every byte [BZ# 29511] Message-ID: <20221109161415.eyqgyrp2jlwzfdmb@tarta.nabijaczleweli.xyz> References: <ad6720f44981d53ce50804d4ea3696ca1b7cd0b7.1663768863.git.nabijaczleweli@nabijaczleweli.xyz> <969aa82c8d5904c1d2040bba87abe2f17a0dc647.1667409408.git.nabijaczleweli@nabijaczleweli.xyz> <874jv8dxat.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ybw6ajnm6245g2f5" Content-Disposition: inline In-Reply-To: <874jv8dxat.fsf@oldenburg.str.redhat.com> User-Agent: NeoMutt/20220429 Precedence: list From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= <libc-alpha@sourceware.org> Reply-To: =?utf-8?b?0L3QsNCx?= <nabijaczleweli@nabijaczleweli.xyz> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	[v7] POSIX locale covers every byte [BZ# 29511] \| [v7] POSIX locale covers every byte [BZ# 29511]

Context	Check	Description
dj/TryBot-apply_patch	success	Patch applied to master at the time it was sent
dj/TryBot-32bit	success	Build for i686

[v7] POSIX locale covers every byte [BZ# 29511]

Checks

Commit Message

Comments

Patch