[v4,4/4] Add generic C.UTF-8 locale (Bug 17318)
Commit Message
We add a new C.UTF-8 locale. This locale is not builtin to glibc, but
is provided as a distinct locale. The locale provides full support
for UTF-8 and this includes full code point sorting via collation
(excludes surrogates). Unfortuantely given the present implementation
in glibc this results in 28MiB of LC_COLLATE data for all possible
Unicode code points. Future improvements may reduce this size. Such
improvements likely require a shortcut for the collation data that
relies on C.UTF-8 single-byte sorting being equivalent to strcmp.
The new locale is NOT added to SUPPORTED. Minimal test data for
specific code points (minus those not supported by collate-test) is
provided in C.UTF-8.in, and this verifies code point sorting is
working reasonably across the range.
The next step is to reduce LC_COLLATE to a manageable size before we
enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
twice) so we don't enable full testing of all code points until we can
parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
test data passes cleanly.
Tested on x86_64 or i686 without regression.
---
localedata/C.UTF-8.in | 156 +++++++++++++++++++++++++++++++++++
localedata/Makefile | 2 +
localedata/locales/C | 188 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 346 insertions(+)
create mode 100644 localedata/C.UTF-8.in
create mode 100644 localedata/locales/C
Comments
* Carlos O'Donell:
> We add a new C.UTF-8 locale. This locale is not builtin to glibc, but
> is provided as a distinct locale. The locale provides full support
> for UTF-8 and this includes full code point sorting via collation
> (excludes surrogates). Unfortuantely given the present implementation
> in glibc this results in 28MiB of LC_COLLATE data for all possible
> Unicode code points. Future improvements may reduce this size. Such
> improvements likely require a shortcut for the collation data that
> relies on C.UTF-8 single-byte sorting being equivalent to strcmp.
>
> The new locale is NOT added to SUPPORTED. Minimal test data for
> specific code points (minus those not supported by collate-test) is
> provided in C.UTF-8.in, and this verifies code point sorting is
> working reasonably across the range.
>
> The next step is to reduce LC_COLLATE to a manageable size before we
> enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
> add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
> twice) so we don't enable full testing of all code points until we can
> parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
> test data passes cleanly.
Can you compare this locale with what is in Fedora and Debian, for the
non-collaction/CTYPE aspects?
Are there other distributions which ship a downstream C.UTF-8 locale?
Thanks,
Florian
On 4/29/21 10:13 AM, Florian Weimer wrote:
> * Carlos O'Donell:
>
>> We add a new C.UTF-8 locale. This locale is not builtin to glibc, but
>> is provided as a distinct locale. The locale provides full support
>> for UTF-8 and this includes full code point sorting via collation
>> (excludes surrogates). Unfortuantely given the present implementation
>> in glibc this results in 28MiB of LC_COLLATE data for all possible
>> Unicode code points. Future improvements may reduce this size. Such
>> improvements likely require a shortcut for the collation data that
>> relies on C.UTF-8 single-byte sorting being equivalent to strcmp.
>>
>> The new locale is NOT added to SUPPORTED. Minimal test data for
>> specific code points (minus those not supported by collate-test) is
>> provided in C.UTF-8.in, and this verifies code point sorting is
>> working reasonably across the range.
>>
>> The next step is to reduce LC_COLLATE to a manageable size before we
>> enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
>> add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
>> twice) so we don't enable full testing of all code points until we can
>> parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
>> test data passes cleanly.
>
> Can you compare this locale with what is in Fedora and Debian, for the
> non-collaction/CTYPE aspects?
Oh, doing this review in more detail for you found a potential defect.
Thank you for encouraging a more detailed review.
I see that C has the first work day as Monday, but in C.UTF-8 we have
switched to Sunday, possibly by accident, and my initial review didn't
catch this. I'll spin a v5 which is also going to be smaller after this
patch: https://sourceware.org/pipermail/libc-alpha/2021-April/125595.html.
Debian (sid, 2.31-11) vs Upstream:
- LC_IDENTIFICATION, contains old date, maintainer @debian address etc.
- No substantive differences.
- LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
- Upstream C.UTF-8 includes no transliteration, all characters pass
through because UTF-8 supports all such characters.
- LC_COLLATE, split ranges differently, *includes* surrogates, uses UNDEFINED correctly.
- Upstream C.UTF-8 *excludes* surrogates, but otherwise covers same set.
- LC_MONETARY, copy "POSIX"
- Upstream C.UTF-8 explicitly defines fields, difference in 'negative_sign'
where upstream will use "-" and POSIX uses "". This aligns with existing builtin
C locale.
- LC_NUMERIC, copy "POSIX"
- Upstream C.UTF-8 explicitly defines fields, no difference from POSIX.
- LC_TIME, first_workday is 2 (otherwise the same)
- Upstream C.UTF-8 set first_workday to 1, this is a bug my patch.
- LC_MESSAGES, only defines yesexpr and noexpr.
- Upstream C.UTF-8 defines yesexpr, noexpr, yesstr and nostr. Superset of data.
- LC_PAPER, copy "i18n"
- Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_NAME, copy "i18n"
- Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_ADDRESS, copy "i18n"
- Upstream C.UTF-8 explicitly defines fields, no differences.
- LC_TELEPHONE, defines tel_int_fmt.
- Upstream C.UTF-8 explicitly defines tel_int_fmt, no differences.
- LC_MEASUREMENT, copy "i18n"
- Upstream C.UTF-8 explicitly defines measurement, no differences.
> Are there other distributions which ship a downstream C.UTF-8 locale?
Yes, Gentoo. I spoke to Andreas Huettel from the gentoo-toolchain team
and they are using Mike Fabian's original C.UTF-8 which is harmonized
and identical (including the first_workday bug) to what I'm proposing.
I think it would be safe for Debian, Ubuntu, Gentoo, Fedora, CentOS Stream,
and RHEL to switch to the new C.UTF-8 locale from upstream.
On 4/29/21 4:05 PM, Carlos O'Donell wrote:
> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
> - Upstream C.UTF-8 includes no transliteration, all characters pass
> through because UTF-8 supports all such characters.
It turns out that this is related to bug 26984.
I was wrong too, the C locale has a builtin set of ~1600 transliterations that
it uses internally (I even reviewed a patch for that you committed).
I had completely forgotten about this internal detail.
This transliteration affects converters ability to use //TRANSLIT, and so I
think we should include all the netural transliterations e.g.
translit_start
include "translit_neutral";""
translit_end
This makes things *better* with respect to harmonization with Debian/Ubuntu.
Thoughts?
In summary:
- POSIX says nothing about transliteration.
- C/POSIX already includes a partial set of ~1600 translit entries, and they
are largely incomplete. It would be nice to harmonize them with the proper
translit_neutral set.
- C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
for conversions from UTF-8 to other charmaps. This would be a superset of
those offered by C/POSIX.
- Fixing C/POSIX is another issue.
* Carlos O'Donell:
> On 4/29/21 4:05 PM, Carlos O'Donell wrote:
>> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
>> - Upstream C.UTF-8 includes no transliteration, all characters pass
>> through because UTF-8 supports all such characters.
>
> It turns out that this is related to bug 26984.
>
> I was wrong too, the C locale has a builtin set of ~1600 transliterations that
> it uses internally (I even reviewed a patch for that you committed).
> I had completely forgotten about this internal detail.
>
> This transliteration affects converters ability to use //TRANSLIT, and so I
> think we should include all the netural transliterations e.g.
>
> translit_start
> include "translit_neutral";""
> translit_end
>
> This makes things *better* with respect to harmonization with Debian/Ubuntu.
>
> Thoughts?
>
> In summary:
> - POSIX says nothing about transliteration.
> - C/POSIX already includes a partial set of ~1600 translit entries, and they
> are largely incomplete. It would be nice to harmonize them with the proper
> translit_neutral set.
> - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
> for conversions from UTF-8 to other charmaps. This would be a superset of
> those offered by C/POSIX.
> - Fixing C/POSIX is another issue.
I'm in favor of including those transliterations.
Thanks,
Florian
On 4/30/21 2:20 PM, Florian Weimer wrote:
> * Carlos O'Donell:
>
>> On 4/29/21 4:05 PM, Carlos O'Donell wrote:
>>> - LC_CTYPE, includes "translit_combining" which is wrong for a C locale IMO.
>>> - Upstream C.UTF-8 includes no transliteration, all characters pass
>>> through because UTF-8 supports all such characters.
>>
>> It turns out that this is related to bug 26984.
>>
>> I was wrong too, the C locale has a builtin set of ~1600 transliterations that
>> it uses internally (I even reviewed a patch for that you committed).
>> I had completely forgotten about this internal detail.
>>
>> This transliteration affects converters ability to use //TRANSLIT, and so I
>> think we should include all the netural transliterations e.g.
>>
>> translit_start
>> include "translit_neutral";""
>> translit_end
>>
>> This makes things *better* with respect to harmonization with Debian/Ubuntu.
>>
>> Thoughts?
>>
>> In summary:
>> - POSIX says nothing about transliteration.
>> - C/POSIX already includes a partial set of ~1600 translit entries, and they
>> are largely incomplete. It would be nice to harmonize them with the proper
>> translit_neutral set.
>> - C.UTF-8 including translit_neutral would bring in ~25,000 translit rules
>> for conversions from UTF-8 to other charmaps. This would be a superset of
>> those offered by C/POSIX.
>> - Fixing C/POSIX is another issue.
>
> I'm in favor of including those transliterations.
Thanks. I'll spin a v5, test and repost.
I need to look at the size impact of the additional transliterations.
new file mode 100644
@@ -0,0 +1,156 @@
+ ; <U1>
+ ; <U2>
+ ; <U3>
+ ; <U4>
+ ; <U5>
+ ; <U6>
+ ; <U7>
+ ; <U8>
+ ; <UE>
+ ; <UF>
+ ; <U10>
+ ; <U11>
+ ; <U12>
+ ; <U13>
+ ; <U14>
+ ; <U15>
+ ; <U16>
+ ; <U17>
+ ; <U18>
+ ; <U19>
+ ; <U1A>
+ ; <U1B>
+ ; <U1C>
+ ; <U1D>
+ ; <U1E>
+ ; <U1F>
+! ; <U21>
+" ; <U22>
+# ; <U23>
+$ ; <U24>
+% ; <U25>
+& ; <U26>
+' ; <U27>
+) ; <U29>
+* ; <U2A>
++ ; <U2B>
+, ; <U2C>
+- ; <U2D>
+. ; <U2E>
+/ ; <U2F>
+0 ; <U30>
+1 ; <U31>
+2 ; <U32>
+3 ; <U33>
+4 ; <U34>
+5 ; <U35>
+6 ; <U36>
+7 ; <U37>
+8 ; <U38>
+9 ; <U39>
+< ; <U3C>
+= ; <U3D>
+> ; <U3E>
+? ; <U3F>
+@ ; <U40>
+A ; <U41>
+B ; <U42>
+C ; <U43>
+D ; <U44>
+E ; <U45>
+F ; <U46>
+G ; <U47>
+H ; <U48>
+I ; <U49>
+J ; <U4A>
+K ; <U4B>
+L ; <U4C>
+M ; <U4D>
+N ; <U4E>
+O ; <U4F>
+P ; <U50>
+Q ; <U51>
+R ; <U52>
+S ; <U53>
+T ; <U54>
+U ; <U55>
+V ; <U56>
+W ; <U57>
+X ; <U58>
+Y ; <U59>
+Z ; <U5A>
+[ ; <U5B>
+\ ; <U5C>
+] ; <U5D>
+^ ; <U5E>
+_ ; <U5F>
+` ; <U60>
+a ; <U61>
+b ; <U62>
+c ; <U63>
+d ; <U64>
+e ; <U65>
+f ; <U66>
+g ; <U67>
+h ; <U68>
+i ; <U69>
+j ; <U6A>
+k ; <U6B>
+l ; <U6C>
+m ; <U6D>
+n ; <U6E>
+o ; <U6F>
+p ; <U70>
+q ; <U71>
+r ; <U72>
+s ; <U73>
+t ; <U74>
+u ; <U75>
+v ; <U76>
+w ; <U77>
+x ; <U78>
+y ; <U79>
+z ; <U7A>
+{ ; <U7B>
+| ; <U7C>
+} ; <U7D>
+~ ; <U7E>
+ ; <U7F>
+ ; <U80>
+ÿ ; <UFF>
+Ā ; <U100>
+ ; <UFFF>
+က ; <U1000>
+ ; <UFFFF>
+? ; <U10000>
+? ; <U1FFFF>
+? ; <U20000>
+? ; <U2FFFF>
+? ; <U30000>
+? ; <U3FFFE>
+? ; <U40000>
+? ; <U4FFFF>
+? ; <U50000>
+? ; <U5FFFF>
+? ; <U60000>
+? ; <U6FFFF>
+? ; <U70000>
+? ; <U7FFFF>
+? ; <U80000>
+? ; <U8FFFF>
+? ; <U90000>
+? ; <U9FFFF>
+? ; <UA0000>
+? ; <UAFFFF>
+? ; <UB0000>
+? ; <UBFFFF>
+? ; <UC0001>
+? ; <UCFFCC>
+? ; <UD000E>
+? ; <UDFFFF>
+? ; <UE0001>
+? ; <UEFFFF>
+? ; <UF0001>
+? ; <UFFFFF>
+? ; <U100001>
+? ; <U10FFFF>
@@ -47,6 +47,7 @@ test-input := \
bg_BG.UTF-8 \
br_FR.UTF-8 \
bs_BA.UTF-8 \
+ C.UTF-8 \
ckb_IQ.UTF-8 \
cmn_TW.UTF-8 \
crh_UA.UTF-8 \
@@ -206,6 +207,7 @@ LOCALES := \
bg_BG.UTF-8 \
br_FR.UTF-8 \
bs_BA.UTF-8 \
+ C.UTF-8 \
ckb_IQ.UTF-8 \
cmn_TW.UTF-8 \
crh_UA.UTF-8 \
new file mode 100644
@@ -0,0 +1,188 @@
+escape_char /
+comment_char %
+% Locale for C locale in UTF-8
+
+LC_IDENTIFICATION
+title "C locale"
+source ""
+address ""
+contact ""
+email "bug-glibc-locales@gnu.org"
+tel ""
+fax ""
+language ""
+territory ""
+revision "2.0"
+date "2020-06-28"
+category "i18n:2012";LC_IDENTIFICATION
+category "i18n:2012";LC_CTYPE
+category "i18n:2012";LC_COLLATE
+category "i18n:2012";LC_TIME
+category "i18n:2012";LC_NUMERIC
+category "i18n:2012";LC_MONETARY
+category "i18n:2012";LC_MESSAGES
+category "i18n:2012";LC_PAPER
+category "i18n:2012";LC_NAME
+category "i18n:2012";LC_ADDRESS
+category "i18n:2012";LC_TELEPHONE
+category "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_CTYPE
+
+% Include only the i18n character type classes without any of the
+% transliteration that i18n uses by default. The C locale has no
+% transliteration and passes all characters through unchanged.
+copy "i18n_ctype"
+
+END LC_CTYPE
+
+% One rule, sort forward, for all Unicode scalar values to give
+% code point order sorting for Unicode (excludes surrogates
+% which are not in the UTF-8 character map).
+LC_COLLATE
+order_start forward
+<U00000000>
+..
+<U0000D7FF>
+% Exclude surrogates <UD800> to <UDFFF> from collation.
+<U0000E000>
+..
+<U0010FFFF>
+UNDEFINED
+order_end
+END LC_COLLATE
+
+LC_MONETARY
+
+% This is the 14652 i18n fdcc-set definition for the LC_MONETARY
+% category (except for the int_curr_symbol and currency_symbol, they are
+% empty in the 14652 i18n fdcc-set definition and also empty in
+% glibc/locale/C-monetary.c.).
+int_curr_symbol ""
+currency_symbol ""
+mon_decimal_point "."
+mon_thousands_sep ""
+mon_grouping -1
+positive_sign ""
+negative_sign "-"
+int_frac_digits -1
+frac_digits -1
+p_cs_precedes -1
+int_p_sep_by_space -1
+p_sep_by_space -1
+n_cs_precedes -1
+int_n_sep_by_space -1
+n_sep_by_space -1
+p_sign_posn -1
+n_sign_posn -1
+%
+END LC_MONETARY
+
+LC_NUMERIC
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+decimal_point "."
+thousands_sep ""
+grouping -1
+END LC_NUMERIC
+
+LC_TIME
+% This is the POSIX Locale definition for the LC_TIME category with the
+% exception that time is per ISO 8601 and 24-hour.
+%
+% Abbreviated weekday names (%a)
+abday "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
+
+% Full weekday names (%A)
+day "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
+ "Friday";"Saturday"
+
+% Abbreviated month names (%b)
+abmon "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
+ "Oct";"Nov";"Dec"
+
+% Full month names (%B)
+mon "January";"February";"March";"April";"May";"June";"July";/
+ "August";"September";"October";"November";"December"
+
+% Week description, consists of three fields:
+% 1. Number of days in a week.
+% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
+% 3. The weekday number to be contained in the first week of the year.
+%
+% ISO 8601 conforming applications should use the values 7, 19971201 (a
+% Monday), and 4 (Thursday), respectively.
+week 7;19971201;4
+first_weekday 1
+first_workday 1
+
+% Appropriate date and time representation (%c)
+d_t_fmt "%a %b %e %H:%M:%S %Y"
+
+% Appropriate date representation (%x)
+d_fmt "%m/%d/%y"
+
+% Appropriate time representation (%X)
+t_fmt "%H:%M:%S"
+
+% Appropriate AM/PM time representation (%r)
+t_fmt_ampm "%I:%M:%S %p"
+
+% Equivalent of AM/PM (%p)
+am_pm "AM";"PM"
+
+% Appropriate date representation (date(1)) "%a %b %e %H:%M:%S %Z %Y"
+date_fmt "%a %b %e %H:%M:%S %Z %Y"
+END LC_TIME
+
+LC_MESSAGES
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+yesexpr "^[yY]"
+noexpr "^[nN]"
+yesstr "Yes"
+nostr "No"
+END LC_MESSAGES
+
+LC_PAPER
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_PAPER category.
+% (A4 paper, this is also used in the built in C/POSIX
+% locale in glibc/locale/C-paper.c)
+height 297
+width 210
+END LC_PAPER
+
+LC_NAME
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_NAME category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-name.c)
+name_fmt "%p%t%g%t%m%t%f"
+END LC_NAME
+
+LC_ADDRESS
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_ADDRESS category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-address.c)
+postal_fmt "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N"
+END LC_ADDRESS
+
+LC_TELEPHONE
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_TELEPHONE category.
+% "+%c %a %l"
+tel_int_fmt "+%c %a %l"
+% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c)
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_MEASUREMENT category.
+% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c)
+%metric
+measurement 1
+END LC_MEASUREMENT
+