[v4,4/4] Add generic C.UTF-8 locale (Bug 17318)

Message ID 20210428130033.3196848-5-carlos@redhat.com
State Superseded
Headers
Series Add new C.UTF-8 locale (Bug 17318) |

Commit Message

Carlos O'Donell April 28, 2021, 1 p.m. UTC
  We add a new C.UTF-8 locale.  This locale is not builtin to glibc, but
is provided as a distinct locale.  The locale provides full support
for UTF-8 and this includes full code point sorting via collation
(excludes surrogates).  Unfortuantely given the present implementation
in glibc this results in 28MiB of LC_COLLATE data for all possible
Unicode code points.  Future improvements may reduce this size. Such
improvements likely require a shortcut for the collation data that
relies on C.UTF-8 single-byte sorting being equivalent to strcmp.

The new locale is NOT added to SUPPORTED.  Minimal test data for
specific code points (minus those not supported by collate-test) is
provided in C.UTF-8.in, and this verifies code point sorting is
working reasonably across the range.

The next step is to reduce LC_COLLATE to a manageable size before we
enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
twice) so we don't enable full testing of all code points until we can
parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
test data passes cleanly.

Tested on x86_64 or i686 without regression.
---
 localedata/C.UTF-8.in | 156 +++++++++++++++++++++++++++++++++++
 localedata/Makefile   |   2 +
 localedata/locales/C  | 188 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 346 insertions(+)
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C
  

Comments

Florian Weimer April 29, 2021, 2:13 p.m. UTC | #1
* Carlos O'Donell:

> We add a new C.UTF-8 locale.  This locale is not builtin to glibc, but
> is provided as a distinct locale.  The locale provides full support
> for UTF-8 and this includes full code point sorting via collation
> (excludes surrogates).  Unfortuantely given the present implementation
> in glibc this results in 28MiB of LC_COLLATE data for all possible
> Unicode code points.  Future improvements may reduce this size. Such
> improvements likely require a shortcut for the collation data that
> relies on C.UTF-8 single-byte sorting being equivalent to strcmp.
>
> The new locale is NOT added to SUPPORTED.  Minimal test data for
> specific code points (minus those not supported by collate-test) is
> provided in C.UTF-8.in, and this verifies code point sorting is
> working reasonably across the range.
>
> The next step is to reduce LC_COLLATE to a manageable size before we
> enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
> add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
> twice) so we don't enable full testing of all code points until we can
> parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
> test data passes cleanly.

Can you compare this locale with what is in Fedora and Debian, for the
non-collaction/CTYPE aspects?

Are there other distributions which ship a downstream C.UTF-8 locale?

Thanks,
Florian
  
Carlos O'Donell April 29, 2021, 8:05 p.m. UTC | #2
On 4/29/21 10:13 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> We add a new C.UTF-8 locale.  This locale is not builtin to glibc, but
>> is provided as a distinct locale.  The locale provides full support
>> for UTF-8 and this includes full code point sorting via collation
>> (excludes surrogates).  Unfortuantely given the present implementation
>> in glibc this results in 28MiB of LC_COLLATE data for all possible
>> Unicode code points.  Future improvements may reduce this size. Such
>> improvements likely require a shortcut for the collation data that
>> relies on C.UTF-8 single-byte sorting being equivalent to strcmp.
>>
>> The new locale is NOT added to SUPPORTED.  Minimal test data for
>> specific code points (minus those not supported by collate-test) is
>> provided in C.UTF-8.in, and this verifies code point sorting is
>> working reasonably across the range.
>>
>> The next step is to reduce LC_COLLATE to a manageable size before we
>> enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
>> add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
>> twice) so we don't enable full testing of all code points until we can
>> parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
>> test data passes cleanly.
> 
> Can you compare this locale with what is in Fedora and Debian, for the
> non-collaction/CTYPE aspects?

Oh, doing this review in more detail for you found a potential defect.
Thank you for encouraging a more detailed review.

I see that C has the first work day as Monday, but in C.UTF-8 we have
switched to Sunday, possibly by accident, and my initial review didn't
catch this. I'll spin a v5 which is also going to be smaller after this
patch: https://sourceware.org/pipermail/libc-alpha/2021-April/125595.html.

Debian (sid, 2.31-11) vs Upstream:
- LC_IDENTIFICATION, contains old date, maintainer @debian address etc.
  - No substantive differences.
- LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
  - Upstream C.UTF-8 includes no transliteration, all characters pass
    through because UTF-8 supports all such characters.
- LC_COLLATE, split ranges differently, *includes* surrogates, uses UNDEFINED correctly.
  - Upstream C.UTF-8 *excludes* surrogates, but otherwise covers same set.
- LC_MONETARY, copy "POSIX"
  - Upstream C.UTF-8 explicitly defines fields, difference in 'negative_sign'
    where upstream will use "-" and POSIX uses "". This aligns with existing builtin
    C locale.
- LC_NUMERIC, copy "POSIX"
  - Upstream C.UTF-8 explicitly defines fields, no difference from POSIX.
- LC_TIME, first_workday is 2 (otherwise the same)
  - Upstream C.UTF-8 set first_workday to 1, this is a bug my patch.
- LC_MESSAGES, only defines yesexpr and noexpr.
  - Upstream C.UTF-8 defines yesexpr, noexpr, yesstr and nostr. Superset of data.
- LC_PAPER, copy "i18n"
  - Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_NAME, copy "i18n"
  - Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_ADDRESS, copy "i18n"
  - Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_TELEPHONE, defines tel_int_fmt.
  - Upstream C.UTF-8 explicitly defines tel_int_fmt, no differences.
- LC_MEASUREMENT, copy "i18n"
  - Upstream C.UTF-8 explicitly defines measurement, no differences.

> Are there other distributions which ship a downstream C.UTF-8 locale?

Yes, Gentoo. I spoke to Andreas Huettel from the gentoo-toolchain team
and they are using Mike Fabian's original C.UTF-8 which is harmonized
and identical (including the first_workday bug) to what I'm proposing.

I think it would be safe for Debian, Ubuntu, Gentoo, Fedora, CentOS Stream,
and RHEL to switch to the new C.UTF-8 locale from upstream.
  
Carlos O'Donell April 30, 2021, 5:59 p.m. UTC | #3
On 4/29/21 4:05 PM, Carlos O'Donell wrote:
> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
>   - Upstream C.UTF-8 includes no transliteration, all characters pass
>     through because UTF-8 supports all such characters.

It turns out that this is related to bug 26984.

I was wrong too, the C locale has a builtin set of ~1600 transliterations that
it uses internally (I even reviewed a patch for that you committed).
I had completely forgotten about this internal detail.

This transliteration affects converters ability to use //TRANSLIT, and so I
think we should include all the netural transliterations e.g.

translit_start
include "translit_neutral";""
translit_end

This makes things *better* with respect to harmonization with Debian/Ubuntu.

Thoughts?

In summary:
- POSIX says nothing about transliteration.
- C/POSIX already includes a partial set of ~1600 translit entries, and they
  are largely incomplete. It would be nice to harmonize them with the proper
  translit_neutral set.
- C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
  for conversions from UTF-8 to other charmaps. This would be a superset of
  those offered by C/POSIX.
- Fixing C/POSIX is another issue.
  
Florian Weimer April 30, 2021, 6:20 p.m. UTC | #4
* Carlos O'Donell:

> On 4/29/21 4:05 PM, Carlos O'Donell wrote:
>> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
>>   - Upstream C.UTF-8 includes no transliteration, all characters pass
>>     through because UTF-8 supports all such characters.
>
> It turns out that this is related to bug 26984.
>
> I was wrong too, the C locale has a builtin set of ~1600 transliterations that
> it uses internally (I even reviewed a patch for that you committed).
> I had completely forgotten about this internal detail.
>
> This transliteration affects converters ability to use //TRANSLIT, and so I
> think we should include all the netural transliterations e.g.
>
> translit_start
> include "translit_neutral";""
> translit_end
>
> This makes things *better* with respect to harmonization with Debian/Ubuntu.
>
> Thoughts?
>
> In summary:
> - POSIX says nothing about transliteration.
> - C/POSIX already includes a partial set of ~1600 translit entries, and they
>   are largely incomplete. It would be nice to harmonize them with the proper
>   translit_neutral set.
> - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
>   for conversions from UTF-8 to other charmaps. This would be a superset of
>   those offered by C/POSIX.
> - Fixing C/POSIX is another issue.

I'm in favor of including those transliterations.

Thanks,
Florian
  
Carlos O'Donell May 2, 2021, 7:18 p.m. UTC | #5
On 4/30/21 2:20 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 4/29/21 4:05 PM, Carlos O'Donell wrote:
>>> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
>>>   - Upstream C.UTF-8 includes no transliteration, all characters pass
>>>     through because UTF-8 supports all such characters.
>>
>> It turns out that this is related to bug 26984.
>>
>> I was wrong too, the C locale has a builtin set of ~1600 transliterations that
>> it uses internally (I even reviewed a patch for that you committed).
>> I had completely forgotten about this internal detail.
>>
>> This transliteration affects converters ability to use //TRANSLIT, and so I
>> think we should include all the netural transliterations e.g.
>>
>> translit_start
>> include "translit_neutral";""
>> translit_end
>>
>> This makes things *better* with respect to harmonization with Debian/Ubuntu.
>>
>> Thoughts?
>>
>> In summary:
>> - POSIX says nothing about transliteration.
>> - C/POSIX already includes a partial set of ~1600 translit entries, and they
>>   are largely incomplete. It would be nice to harmonize them with the proper
>>   translit_neutral set.
>> - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
>>   for conversions from UTF-8 to other charmaps. This would be a superset of
>>   those offered by C/POSIX.
>> - Fixing C/POSIX is another issue.
> 
> I'm in favor of including those transliterations.

Thanks. I'll spin a v5, test and repost.

I need to look at the size impact of the additional transliterations.
  

Patch

diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
new file mode 100644
index 0000000000..b8764a4e04
--- /dev/null
+++ b/localedata/C.UTF-8.in
@@ -0,0 +1,156 @@ 
+ ; <U1>
+ ; <U2>
+ ; <U3>
+ ; <U4>
+ ; <U5>
+ ; <U6>
+ ; <U7>
+ ; <U8>
+ ; <UE>
+ ; <UF>
+ ; <U10>
+ ; <U11>
+ ; <U12>
+ ; <U13>
+ ; <U14>
+ ; <U15>
+ ; <U16>
+ ; <U17>
+ ; <U18>
+ ; <U19>
+ ; <U1A>
+ ; <U1B>
+ ; <U1C>
+ ; <U1D>
+ ; <U1E>
+ ; <U1F>
+! ; <U21>
+" ; <U22>
+# ; <U23>
+$ ; <U24>
+% ; <U25>
+& ; <U26>
+' ; <U27>
+) ; <U29>
+* ; <U2A>
++ ; <U2B>
+, ; <U2C>
+- ; <U2D>
+. ; <U2E>
+/ ; <U2F>
+0 ; <U30>
+1 ; <U31>
+2 ; <U32>
+3 ; <U33>
+4 ; <U34>
+5 ; <U35>
+6 ; <U36>
+7 ; <U37>
+8 ; <U38>
+9 ; <U39>
+< ; <U3C>
+= ; <U3D>
+> ; <U3E>
+? ; <U3F>
+@ ; <U40>
+A ; <U41>
+B ; <U42>
+C ; <U43>
+D ; <U44>
+E ; <U45>
+F ; <U46>
+G ; <U47>
+H ; <U48>
+I ; <U49>
+J ; <U4A>
+K ; <U4B>
+L ; <U4C>
+M ; <U4D>
+N ; <U4E>
+O ; <U4F>
+P ; <U50>
+Q ; <U51>
+R ; <U52>
+S ; <U53>
+T ; <U54>
+U ; <U55>
+V ; <U56>
+W ; <U57>
+X ; <U58>
+Y ; <U59>
+Z ; <U5A>
+[ ; <U5B>
+\ ; <U5C>
+] ; <U5D>
+^ ; <U5E>
+_ ; <U5F>
+` ; <U60>
+a ; <U61>
+b ; <U62>
+c ; <U63>
+d ; <U64>
+e ; <U65>
+f ; <U66>
+g ; <U67>
+h ; <U68>
+i ; <U69>
+j ; <U6A>
+k ; <U6B>
+l ; <U6C>
+m ; <U6D>
+n ; <U6E>
+o ; <U6F>
+p ; <U70>
+q ; <U71>
+r ; <U72>
+s ; <U73>
+t ; <U74>
+u ; <U75>
+v ; <U76>
+w ; <U77>
+x ; <U78>
+y ; <U79>
+z ; <U7A>
+{ ; <U7B>
+| ; <U7C>
+} ; <U7D>
+~ ; <U7E>
+ ; <U7F>
+€ ; <U80>
+ÿ ; <UFF>
+Ā ; <U100>
+࿿ ; <UFFF>
+က ; <U1000>
+￿ ; <UFFFF>
+? ; <U10000>
+? ; <U1FFFF>
+? ; <U20000>
+? ; <U2FFFF>
+? ; <U30000>
+? ; <U3FFFE>
+? ; <U40000>
+? ; <U4FFFF>
+? ; <U50000>
+? ; <U5FFFF>
+? ; <U60000>
+? ; <U6FFFF>
+? ; <U70000>
+? ; <U7FFFF>
+? ; <U80000>
+? ; <U8FFFF>
+? ; <U90000>
+? ; <U9FFFF>
+? ; <UA0000>
+? ; <UAFFFF>
+? ; <UB0000>
+? ; <UBFFFF>
+? ; <UC0001>
+? ; <UCFFCC>
+? ; <UD000E>
+? ; <UDFFFF>
+? ; <UE0001>
+? ; <UEFFFF>
+? ; <UF0001>
+? ; <UFFFFF>
+? ; <U100001>
+? ; <U10FFFF>
diff --git a/localedata/Makefile b/localedata/Makefile
index 14e04cd3c5..38017f2c4c 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -47,6 +47,7 @@  test-input := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
@@ -206,6 +207,7 @@  LOCALES := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
diff --git a/localedata/locales/C b/localedata/locales/C
new file mode 100644
index 0000000000..67e5bd913b
--- /dev/null
+++ b/localedata/locales/C
@@ -0,0 +1,188 @@ 
+escape_char /
+comment_char %
+% Locale for C locale in UTF-8
+
+LC_IDENTIFICATION
+title      "C locale"
+source     ""
+address    ""
+contact    ""
+email      "bug-glibc-locales@gnu.org"
+tel        ""
+fax        ""
+language   ""
+territory  ""
+revision   "2.0"
+date       "2020-06-28"
+category  "i18n:2012";LC_IDENTIFICATION
+category  "i18n:2012";LC_CTYPE
+category  "i18n:2012";LC_COLLATE
+category  "i18n:2012";LC_TIME
+category  "i18n:2012";LC_NUMERIC
+category  "i18n:2012";LC_MONETARY
+category  "i18n:2012";LC_MESSAGES
+category  "i18n:2012";LC_PAPER
+category  "i18n:2012";LC_NAME
+category  "i18n:2012";LC_ADDRESS
+category  "i18n:2012";LC_TELEPHONE
+category  "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_CTYPE
+
+% Include only the i18n character type classes without any of the
+% transliteration that i18n uses by default.  The C locale has no
+% transliteration and passes all characters through unchanged.
+copy "i18n_ctype"
+
+END LC_CTYPE
+
+% One rule, sort forward, for all Unicode scalar values to give
+% code point order sorting for Unicode (excludes surrogates
+% which are not in the UTF-8 character map).
+LC_COLLATE
+order_start forward
+<U00000000>
+..
+<U0000D7FF>
+% Exclude surrogates <UD800> to <UDFFF> from collation.
+<U0000E000>
+..
+<U0010FFFF>
+UNDEFINED
+order_end
+END LC_COLLATE
+
+LC_MONETARY
+
+% This is the 14652 i18n fdcc-set definition for the LC_MONETARY
+% category (except for the int_curr_symbol and currency_symbol, they are
+% empty in the 14652 i18n fdcc-set definition and also empty in
+% glibc/locale/C-monetary.c.).
+int_curr_symbol     ""
+currency_symbol     ""
+mon_decimal_point   "."
+mon_thousands_sep   ""
+mon_grouping        -1
+positive_sign       ""
+negative_sign       "-"
+int_frac_digits     -1
+frac_digits         -1
+p_cs_precedes       -1
+int_p_sep_by_space  -1
+p_sep_by_space      -1
+n_cs_precedes       -1
+int_n_sep_by_space  -1
+n_sep_by_space      -1
+p_sign_posn         -1
+n_sign_posn         -1
+%
+END LC_MONETARY
+
+LC_NUMERIC
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+decimal_point   "."
+thousands_sep   ""
+grouping        -1
+END LC_NUMERIC
+
+LC_TIME
+% This is the POSIX Locale definition for the LC_TIME category with the
+% exception that time is per ISO 8601 and 24-hour.
+%
+% Abbreviated weekday names (%a)
+abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
+
+% Full weekday names (%A)
+day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
+            "Friday";"Saturday"
+
+% Abbreviated month names (%b)
+abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
+            "Oct";"Nov";"Dec"
+
+% Full month names (%B)
+mon         "January";"February";"March";"April";"May";"June";"July";/
+            "August";"September";"October";"November";"December"
+
+% Week description, consists of three fields:
+% 1. Number of days in a week.
+% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
+% 3. The weekday number to be contained in the first week of the year.
+%
+% ISO 8601 conforming applications should use the values 7, 19971201 (a
+% Monday), and 4 (Thursday), respectively.
+week    7;19971201;4
+first_weekday	1
+first_workday	1
+
+% Appropriate date and time representation (%c)
+d_t_fmt "%a %b %e %H:%M:%S %Y"
+
+% Appropriate date representation (%x)
+d_fmt   "%m/%d/%y"
+
+% Appropriate time representation (%X)
+t_fmt   "%H:%M:%S"
+
+% Appropriate AM/PM time representation (%r)
+t_fmt_ampm "%I:%M:%S %p"
+
+% Equivalent of AM/PM (%p)
+am_pm	"AM";"PM"
+
+% Appropriate date representation (date(1))   "%a %b %e %H:%M:%S %Z %Y"
+date_fmt	"%a %b %e %H:%M:%S %Z %Y"
+END LC_TIME
+
+LC_MESSAGES
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+yesexpr "^[yY]"
+noexpr  "^[nN]"
+yesstr  "Yes"
+nostr   "No"
+END LC_MESSAGES
+
+LC_PAPER
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_PAPER category.
+% (A4 paper, this is also used in the built in C/POSIX
+% locale in glibc/locale/C-paper.c)
+height   297
+width    210
+END LC_PAPER
+
+LC_NAME
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_NAME category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-name.c)
+name_fmt    "%p%t%g%t%m%t%f"
+END LC_NAME
+
+LC_ADDRESS
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_ADDRESS category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-address.c)
+postal_fmt    "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N"
+END LC_ADDRESS
+
+LC_TELEPHONE
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_TELEPHONE category.
+% "+%c %a %l"
+tel_int_fmt    "+%c %a %l"
+% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c)
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_MEASUREMENT category.
+% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c)
+%metric
+measurement    1
+END LC_MEASUREMENT
+