From patchwork Wed Dec 3 15:02:24 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Mike FABIAN X-Patchwork-Id: 4053 Received: (qmail 15858 invoked by alias); 3 Dec 2014 15:03:01 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 15819 invoked by uid 89); 3 Dec 2014 15:02:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=AWL, BAYES_50, BODY_8BITS, GARBLED_BODY, SPF_HELO_PASS, SPF_PASS, T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com From: Mike FABIAN To: libc-alpha@sourceware.org Cc: Pravin Satpute Subject: [PATCH] [BZ 14094] Update LC_CTYPE character class data to Unicode 7.0.0 Date: Wed, 03 Dec 2014 16:02:24 +0100 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.93 (gnu/linux) MIME-Version: 1.0 1) 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch Patch to update the character class data in glibc/localedata/locales/i18n. The patch includes the 2 scripts gen-unicode-ctype.py and ctype-compatibility.py. 2) 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch After applying 3), building glibc and running “make check”, The test localedata/tst-ctype fails. See: https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c34 I believe the test is wrong. Therefore, this patch fixes the test. 3) gen-unicode-ctype.py (Included in the above patch, attached seperately here as well for easier review). Script to generate the new character class data for LC_CTYPE from the Unicode data Usage of the script: python3 ./gen-unicode-ctype.py -u UnicodeData.txt -d DerivedCoreProperties.txt -i locales/i18n -o locales/i18n-new --unicode_version 7.0.0 Everything in the original glibc/localedata/locales/i18n file (given with the -i option) except the "date" stamp and the LC_CTYPE character class data is preserved and copied unchanged into the new file (given with the -o option). The character class data is replaced with the data from UnicodeData.txt and DerivedCoreProperties.txt from Unicode 7.0.0. The script is based on Bruno Haible’s gen-unicode-ctype.c program, rewritten to Python3 and extended to use DerivedCoreProperties.txt as well for the character classes “alpha”, “lower”, and “upper”. I also considers all non-ASCII digits as alphabetic, just like Bruno’s original gen-unicode-ctype.c because ISO C 99 forbids us to have them in the category “digit” but we want “isalnum” return true on them. It treats title case characters as both “upper” and “lower” (also the same as Bruno’s gen-unicode-ctype.c). 4) ctype-compatibility.py (Included in the above patch, attached seperately here as well for easier review). A Python script to compare the old and the new i18n file and check for errors. A sort of test suite for gen-unicode-ctype.py Currently this test reports 11 “errors” in the new file, see: https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c29 All these 11 “errors” are because of a disagreement between this part of Bruno’s gen-unicode-ctype.c: is_alpha (unsigned int ch) { return (unicode_attributes[ch].name != NULL && ((unicode_attributes[ch].category[0] == 'L' /* Theppitak Karoonboonyanan says , should belong to is_punct. */ && (ch != 0x0E2F) && (ch != 0x0E46)) /* Theppitak Karoonboonyanan says , .., .. are is_alpha. */ || (ch == 0x0E31) || (ch >= 0x0E34 && ch <= 0x0E3A) || (ch >= 0x0E47 && ch <= 0x0E4E) and Unicode’s DerivedCoreProperties.txt. According to DerivedCoreProperties.txt, , are “Alphabetic”. And , .., .. are *not* “Alphabetic” according to DerivedCoreProperties.txt. I tried to write mail to Bruno Haible and Theppitak Karoonboonyanan but got no response. I assume DerivedCoreProperties.txt is more trustworthy. In that case, if we can trust DerivedCoreProperties.txt, there are no errors left found by ctype-compatibility.py. From 25c913674386011a44b6270579a894b2e8200d25 Mon Sep 17 00:00:00 2001 From: Mike FABIAN Date: Wed, 3 Dec 2014 10:05:42 +0100 Subject: [PATCH 2/2] Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DerivedCoreProperties.txt from Unicode 7.0.0 lists the characters U+00AA (ª) and U+00BA (º) as lower case: 00AA ; Lowercase # Lo FEMININE ORDINAL INDICATOR 00BA ; Lowercase # Lo MASCULINE ORDINAL INDICATOR --- localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in b/localedata/tst-ctype-de_DE.ISO-8859-1.in index f71d76c..e124a52 100644 --- a/localedata/tst-ctype-de_DE.ISO-8859-1.in +++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in @@ -1,5 +1,5 @@ lower  ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ - 000000000000000000000100000000000000000000000000 + 000000000010000000000100001000000000000000000000 lower ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ 000000000000000111111111111111111111111011111111 upper  ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ -- 1.9.3