[v4,0/3] C.UTF-8

Message ID 20210729063515.1541388-1-carlos@redhat.com
Headers
Series C.UTF-8 |

Message

Carlos O'Donell July 29, 2021, 6:35 a.m. UTC
  The following changes implement a minimally sized C.UTF-8.
First we implement the 'strcmp_collation' directive.
Then we implement C.UTF-8 with an LC_COLLATE that uses the
'strcmp_collation' directive to support using strcmp for
collation i.e. code point sorting. The final C.UTF-8 is
only ~396KiB with the largest ~346KiB in LC_CTYPE for all
of Unicode.

This v4 fixes the regressions detected in Fedora Rawhide
here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421
Additional testing coverage is provided for fnmatch, regcomp,
and regexec (which would have caught the regression).

Carlos O'Donell (3):
  Add support for locales with zero collation rules.
  Add 'strcmp_collation' support for LC_COLLATE.
  Add generic C.UTF-8 locale (Bug 17318)

 iconv/Makefile                   |  22 +-
 iconv/tst-iconv9.c               |  87 +++++
 locale/programs/ld-collate.c     |  24 +-
 locale/programs/locfile-kw.gperf |   1 +
 locale/programs/locfile-kw.h     | 306 ++++++++---------
 locale/programs/locfile-token.h  |   1 +
 localedata/C.UTF-8.in            | 157 +++++++++
 localedata/Makefile              |   2 +
 localedata/SUPPORTED             |   1 +
 localedata/locales/C             | 194 +++++++++++
 posix/bug-regex1.c               |  20 ++
 posix/bug-regex19.c              |  22 +-
 posix/bug-regex4.c               |  25 ++
 posix/bug-regex6.c               |   2 +-
 posix/fnmatch_loop.c             |  95 ++++--
 posix/regcomp.c                  |  12 +-
 posix/regexec.c                  |  85 +++--
 posix/transbug.c                 |  22 +-
 posix/tst-fnmatch.input          | 549 ++++++++++++++++++++++++++++++-
 posix/tst-regcomp-truncated.c    |   1 +
 posix/tst-regex.c                |  25 +-
 21 files changed, 1385 insertions(+), 268 deletions(-)
 create mode 100644 iconv/tst-iconv9.c
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C
  

Comments

Florian Weimer July 29, 2021, 7:53 a.m. UTC | #1
* Carlos O'Donell via Libc-alpha:

> The following changes implement a minimally sized C.UTF-8.
> First we implement the 'strcmp_collation' directive.
> Then we implement C.UTF-8 with an LC_COLLATE that uses the
> 'strcmp_collation' directive to support using strcmp for
> collation i.e. code point sorting. The final C.UTF-8 is
> only ~396KiB with the largest ~346KiB in LC_CTYPE for all
> of Unicode.
>
> This v4 fixes the regressions detected in Fedora Rawhide
> here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421
> Additional testing coverage is provided for fnmatch, regcomp,
> and regexec (which would have caught the regression).

From a high-level point of view I wonder if the more conservative choice
would be to fix the localdef generation for LC_COLLATE, at least for
this release.  It would also mean that we do not break statically linked
executables.

Thanks,
Florian
  
Carlos O'Donell July 30, 2021, 3:12 a.m. UTC | #2
On 7/29/21 3:53 AM, Florian Weimer wrote:
> * Carlos O'Donell via Libc-alpha:
> 
>> The following changes implement a minimally sized C.UTF-8.
>> First we implement the 'strcmp_collation' directive.
>> Then we implement C.UTF-8 with an LC_COLLATE that uses the
>> 'strcmp_collation' directive to support using strcmp for
>> collation i.e. code point sorting. The final C.UTF-8 is
>> only ~396KiB with the largest ~346KiB in LC_CTYPE for all
>> of Unicode.
>>
>> This v4 fixes the regressions detected in Fedora Rawhide
>> here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421
>> Additional testing coverage is provided for fnmatch, regcomp,
>> and regexec (which would have caught the regression).
> 
> From a high-level point of view I wonder if the more conservative choice
> would be to fix the localdef generation for LC_COLLATE, at least for
> this release.  It would also mean that we do not break statically linked
> executables.

That is a great idea. In fact it's actually fairly easy to split and
reuse the tables from C-collate.c in ld-collate.c and emit them when
you have a nrules == 0 scenario and thus provide support in fnmatch,
regexec, and regcomp for ASCII ranges.

I've finished a new v5 along these lines and I'm testing it right now.
  
Mike Frysinger Aug. 18, 2021, 8:12 a.m. UTC | #3
On 29 Jul 2021 09:53, Florian Weimer via Libc-alpha wrote:
> * Carlos O'Donell via Libc-alpha:
> > The following changes implement a minimally sized C.UTF-8.
> > First we implement the 'strcmp_collation' directive.
> > Then we implement C.UTF-8 with an LC_COLLATE that uses the
> > 'strcmp_collation' directive to support using strcmp for
> > collation i.e. code point sorting. The final C.UTF-8 is
> > only ~396KiB with the largest ~346KiB in LC_CTYPE for all
> > of Unicode.
> >
> > This v4 fixes the regressions detected in Fedora Rawhide
> > here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421
> > Additional testing coverage is provided for fnmatch, regcomp,
> > and regexec (which would have caught the regression).
> 
> From a high-level point of view I wonder if the more conservative choice
> would be to fix the localdef generation for LC_COLLATE, at least for
> this release.  It would also mean that we do not break statically linked
> executables.

glibc already (somewhat regularly) breaks statically linked programs due to
nss incompatibilities.  unless/until we take that seriously, i'm not sure we
should bother expending effort on these trade-offs.  just go with whatever
makes sense long term.
-mike