[BZ,14094] Update LC_CTYPE character class data to Unicode 7.0.0
Commit Message
1) 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch
Patch to update the character class data in
glibc/localedata/locales/i18n. The patch includes the 2 scripts
gen-unicode-ctype.py and ctype-compatibility.py.
2) 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch
After applying 3), building glibc and running “make check”,
The test localedata/tst-ctype fails. See:
https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c34
I believe the test is wrong. Therefore, this patch fixes the test.
3) gen-unicode-ctype.py
(Included in the above patch, attached seperately here as well for easier
review).
Script to generate the new character class data for LC_CTYPE from
the Unicode data
Usage of the script:
python3 ./gen-unicode-ctype.py -u UnicodeData.txt -d DerivedCoreProperties.txt -i locales/i18n -o locales/i18n-new --unicode_version 7.0.0
Everything in the original glibc/localedata/locales/i18n file (given
with the -i option) except the "date" stamp and the LC_CTYPE
character class data is preserved and copied unchanged into the new
file (given with the -o option). The character class data is replaced
with the data from UnicodeData.txt and DerivedCoreProperties.txt from
Unicode 7.0.0.
The script is based on Bruno Haible’s gen-unicode-ctype.c program,
rewritten to Python3 and extended to use DerivedCoreProperties.txt as
well for the character classes “alpha”, “lower”, and “upper”.
I also considers all non-ASCII digits as alphabetic, just like
Bruno’s original gen-unicode-ctype.c because ISO C 99 forbids us to
have them in the category “digit” but we want “isalnum” return
true on them.
It treats title case characters as both “upper” and
“lower” (also the same as Bruno’s gen-unicode-ctype.c).
4) ctype-compatibility.py
(Included in the above patch, attached seperately here as well for easier
review).
A Python script to compare the old and the new i18n file and check
for errors. A sort of test suite for gen-unicode-ctype.py
Currently this test reports 11 “errors” in the new file, see:
https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c29
All these 11 “errors” are because of a disagreement between this
part of Bruno’s gen-unicode-ctype.c:
is_alpha (unsigned int ch)
{
return (unicode_attributes[ch].name != NULL
&& ((unicode_attributes[ch].category[0] == 'L'
/* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
<U0E2F>, <U0E46> should belong to is_punct. */
&& (ch != 0x0E2F) && (ch != 0x0E46))
/* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
<U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha. */
|| (ch == 0x0E31)
|| (ch >= 0x0E34 && ch <= 0x0E3A)
|| (ch >= 0x0E47 && ch <= 0x0E4E)
and Unicode’s DerivedCoreProperties.txt.
According to DerivedCoreProperties.txt, <U0E2F>, <U0E46> are
“Alphabetic”. And <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are
*not* “Alphabetic” according to DerivedCoreProperties.txt.
I tried to write mail to Bruno Haible and Theppitak Karoonboonyanan
but got no response.
I assume DerivedCoreProperties.txt is more trustworthy.
In that case, if we can trust DerivedCoreProperties.txt, there
are no errors left found by ctype-compatibility.py.
From 25c913674386011a44b6270579a894b2e8200d25 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Wed, 3 Dec 2014 10:05:42 +0100
Subject: [PATCH 2/2] Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
DerivedCoreProperties.txt from Unicode 7.0.0 lists
the characters U+00AA (ª) and U+00BA (º) as lower case:
00AA ; Lowercase # Lo FEMININE ORDINAL INDICATOR
00BA ; Lowercase # Lo MASCULINE ORDINAL INDICATOR
---
localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
@@ -1,5 +1,5 @@
lower  ¡¢£¤¥¦§¨©ª«¬Â®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÃÂÃÄÅÆÇÈÉÊËÌÃÃŽÃ
- 000000000000000000000100000000000000000000000000
+ 000000000010000000000100001000000000000000000000
lower ÃÑÒÓÔÕÖ×ØÙÚÛÜÃÞßà áâãäåæçèéêëìÃîïðñòóôõö÷øùúûüýþÿ
000000000000000111111111111111111111111011111111
upper  ¡¢£¤¥¦§¨©ª«¬Â®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÃÂÃÄÅÆÇÈÉÊËÌÃÃŽÃ
--
1.9.3