From patchwork Wed Dec  3 15:02:24 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Mike FABIAN <mfabian@redhat.com>
X-Patchwork-Id: 4053
Received: (qmail 15858 invoked by alias); 3 Dec 2014 15:03:01 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 15819 invoked by uid 89); 3 Dec 2014 15:02:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=3.1 required=5.0 tests=AWL, BAYES_50,
	BODY_8BITS, GARBLED_BODY, SPF_HELO_PASS, SPF_PASS,
	T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: mx1.redhat.com
From: Mike FABIAN <mfabian@redhat.com>
To: libc-alpha@sourceware.org
Cc: Pravin Satpute <psatpute@redhat.com>
Subject: [PATCH] [BZ 14094] Update LC_CTYPE character class data to Unicode
	7.0.0
Date: Wed, 03 Dec 2014 16:02:24 +0100
Message-ID: <s9dh9xcu67z.fsf@ari.site>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.93 (gnu/linux)
MIME-Version: 1.0

1) 0001-Update-LC_CTYPE-character-class-data-to-Unicode-7.0..patch

   Patch to update the character class data in
   glibc/localedata/locales/i18n. The patch includes the 2 scripts
   gen-unicode-ctype.py and ctype-compatibility.py.

2) 0002-Fix-test-case-localedata-tst-ctype-de_DE.ISO-8859-1..patch

   After applying 3), building glibc and running “make check”,
   The test localedata/tst-ctype fails. See:
   
   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c34

   I believe the test is wrong. Therefore, this patch fixes the test.

3) gen-unicode-ctype.py
   (Included in the above patch, attached seperately here as well for easier
   review).
   
   Script to generate the new character class data for LC_CTYPE from
   the Unicode data

   Usage of the script:
   
   python3 ./gen-unicode-ctype.py -u UnicodeData.txt -d DerivedCoreProperties.txt -i locales/i18n -o locales/i18n-new --unicode_version 7.0.0

   Everything in the original glibc/localedata/locales/i18n file (given
   with the -i option) except the "date" stamp and the LC_CTYPE
   character class data is preserved and copied unchanged into the new
   file (given with the -o option). The character class data is replaced
   with the data from UnicodeData.txt and DerivedCoreProperties.txt from
   Unicode 7.0.0.

   The script is based on Bruno Haible’s gen-unicode-ctype.c program,
   rewritten to Python3 and extended to use DerivedCoreProperties.txt as
   well for the character classes “alpha”, “lower”, and “upper”.

   I also considers all non-ASCII digits as alphabetic, just like
   Bruno’s original gen-unicode-ctype.c because ISO C 99 forbids us to
   have them in the category “digit” but we want “isalnum” return
   true on them.

   It treats title case characters as both “upper” and
   “lower” (also the same as Bruno’s gen-unicode-ctype.c).

4) ctype-compatibility.py
   (Included in the above patch, attached seperately here as well for easier
   review).

   A Python script to compare the old and the new i18n file and check
   for errors. A sort of test suite for gen-unicode-ctype.py

   Currently this test reports 11 “errors” in the new file, see:
   
   https://sourceware.org/bugzilla/show_bug.cgi?id=14094#c29

   All these 11 “errors” are because of a disagreement between this
   part of Bruno’s gen-unicode-ctype.c:
   
        is_alpha (unsigned int ch)
        {
          return (unicode_attributes[ch].name != NULL
                  && ((unicode_attributes[ch].category[0] == 'L'
                       /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                          <U0E2F>, <U0E46> should belong to is_punct.  */
                       && (ch != 0x0E2F) && (ch != 0x0E46))
                      /* Theppitak Karoonboonyanan <thep@links.nectec.or.th> says
                         <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha.  */
                      || (ch == 0x0E31)
                      || (ch >= 0x0E34 && ch <= 0x0E3A)
                      || (ch >= 0x0E47 && ch <= 0x0E4E)

   and Unicode’s DerivedCoreProperties.txt.
   According to DerivedCoreProperties.txt, <U0E2F>, <U0E46> are
   “Alphabetic”. And <U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are
   *not* “Alphabetic” according to DerivedCoreProperties.txt.

   I tried to write mail to Bruno Haible and Theppitak Karoonboonyanan
   but got no response.

   I assume DerivedCoreProperties.txt is more trustworthy.
   In that case, if we can trust DerivedCoreProperties.txt, there
   are no errors left found by ctype-compatibility.py.

From 25c913674386011a44b6270579a894b2e8200d25 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Wed, 3 Dec 2014 10:05:42 +0100
Subject: [PATCH 2/2] Fix test case localedata/tst-ctype-de_DE.ISO-8859-1.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

DerivedCoreProperties.txt from Unicode 7.0.0 lists
the characters U+00AA (Ã‚Âª) and U+00BA (Ã‚Âº) as lower case:

00AA          ; Lowercase # Lo       FEMININE ORDINAL INDICATOR
00BA          ; Lowercase # Lo       MASCULINE ORDINAL INDICATOR
---
 localedata/tst-ctype-de_DE.ISO-8859-1.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localedata/tst-ctype-de_DE.ISO-8859-1.in b/localedata/tst-ctype-de_DE.ISO-8859-1.in
index f71d76c..e124a52 100644
--- a/localedata/tst-ctype-de_DE.ISO-8859-1.in
+++ b/localedata/tst-ctype-de_DE.ISO-8859-1.in
@@ -1,5 +1,5 @@
 lower   Â Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬Â­Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃ
-        000000000000000000000100000000000000000000000000
+        000000000010000000000100001000000000000000000000
 lower   ÃÃ‘Ã’Ã“Ã”Ã•Ã–Ã—Ã˜Ã™ÃšÃ›ÃœÃÃžÃŸÃ Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬Ã­Ã®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã·Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¾Ã¿
         000000000000000111111111111111111111111011111111
 upper   Â Â¡Â¢Â£Â¤Â¥Â¦Â§Â¨Â©ÂªÂ«Â¬Â­Â®Â¯Â°Â±Â²Â³Â´ÂµÂ¶Â·Â¸Â¹ÂºÂ»Â¼Â½Â¾Â¿Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃ
-- 
1.9.3