[BZ,17588,13064] Update UTF-8 charmap and width to Unicode 7.0.0
Commit Message
On Feb 16, 2015, "Carlos O'Donell" <carlos@redhat.com> wrote:
> On 02/12/2015 05:18 AM, Alexandre Oliva wrote:
>>> Regression tested on x86_64-linux-gnu. Ok to install?
> Yes, this version is OK to install if you fix all the nits.
Thanks.
> Despite complaints that a change in the generator would create
> a smaller diff, that doesn't matter to me.
The script changes were small and I figured it wouldn't hurt to merge
them and reduce the diff, so I did. So I'll wait for another ACK before
checking this in.
I also added the downloaded files to the tree, so that binary
distributors don't risk running afoul of the LGPL for lack of the .txt
files. It's not clear that they would be required, but it doesn't hurt
to put them in. I also added unicode-license.txt, copied from other
packages that ship it. I couldn't find the text file for download from
unicode.org, though I admittedly didn't search very thoroughly.
> Nit: ChangeLog needs [BZ #xxx] etc.
*check*. Heh, I didn't realize there were open bugs about this, in
spite of the mention in the Subject. Doh!
> Nit: This covers bugs 17588, 13064, *AND* 14094.
*check*
> Nit: Needs a NEWS entry describing this in full glory :-)
* Character encoding and ctype tables were updated to Unicode 7.0.0, using
new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red
Hat). These updates cause user visible changes, such as the fix for bug
17998.
> Some might argue it fits better under "scripts" e.g. scripts/unicode-gen,
> but I don't care. We can move it later if we think it should move at all.
*nod*
>>> * unicode-gen/gen_unicode_ctype.py: New generator.
> Nit: Wrong copyright year e.g. 2014 -> 2015.
*check*. I added ", 2015" after 2014 in the scripts.
> Nit: We don't use "Contributed by" statements, they are instead pat of what
> git records as Author or in the git commit message.
*check*. I removed them from the scripts, and added them as "from" in
the ChangeLog and in NEWS.
I also removed the links that pointed to github as upstream, since I
understand the GNU libc repository is going to hold the master copy, and
the repository that was linked to is thus obsolescent.
>>> * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
>>> true for ordinal indicators.
> Nit: This need a specific new BZ for the fix to user-visible behaviour.
*check*: [BZ# 17998]
Here's the header of the patch and the incremental changes to the
scripts, from the previously posted version.
The entire patch can be found in the lzip-compressed attachment.
for localedata/ChangeLog
[BZ #17588]
[BZ #13064]
[BZ #14094]
[BZ #17998]
* unicode-gen/Makefile: New.
* unicode-gen/unicode-license.txt: New, from Unicode.
* unicode-gen/UnicodeData.txt: New, from Unicode.
* unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
* unicode-gen/EastAsianWidth.txt: New, from Unicode.
* unicode-gen/gen_unicode_ctype.py: New generator, from Mike
FABIAN <mfabian@redhat.com>.
* unicode-gen/ctype_compatibility.py: New verifier, from
Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
* unicode-gen/ctype_compatibility_test_cases.py: New verifier
module, from Mike FABIAN.
* unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
and Mike FABIAN.
* unicode-gen/utf8_compatibility.py: New verifier, from Pravin
Satpute and Mike FABIAN.
* charmaps/UTF-8: Update.
* locales/i18n: Update.
* gen-unicode-ctype.c: Remove.
* tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
true for ordinal indicators.
---
NEWS | 11
localedata/charmaps/UTF-8 |11946 ++++++---
localedata/gen-unicode-ctype.c | 784 -
localedata/locales/i18n | 2652 +-
localedata/tst-ctype-de_DE.ISO-8859-1.in | 2
localedata/unicode-gen/DerivedCoreProperties.txt |10794 ++++++++
localedata/unicode-gen/EastAsianWidth.txt | 2121 ++
localedata/unicode-gen/Makefile | 99
localedata/unicode-gen/UnicodeData.txt |27268 ++++++++++++++++++++
localedata/unicode-gen/ctype_compatibility.py | 546
.../unicode-gen/ctype_compatibility_test_cases.py | 951 +
localedata/unicode-gen/gen_unicode_ctype.py | 751 +
localedata/unicode-gen/unicode-license.txt | 50
localedata/unicode-gen/utf8_compatibility.py | 399
localedata/unicode-gen/utf8_gen.py | 286
15 files changed, 53278 insertions(+), 5382 deletions(-)
delete mode 100644 localedata/gen-unicode-ctype.c
create mode 100644 localedata/unicode-gen/DerivedCoreProperties.txt
create mode 100644 localedata/unicode-gen/EastAsianWidth.txt
create mode 100644 localedata/unicode-gen/Makefile
create mode 100644 localedata/unicode-gen/UnicodeData.txt
create mode 100755 localedata/unicode-gen/ctype_compatibility.py
create mode 100644 localedata/unicode-gen/ctype_compatibility_test_cases.py
create mode 100755 localedata/unicode-gen/gen_unicode_ctype.py
create mode 100644 localedata/unicode-gen/unicode-license.txt
create mode 100755 localedata/unicode-gen/utf8_compatibility.py
create mode 100755 localedata/unicode-gen/utf8_gen.py
Comments
On 18 Feb 2015 21:23, Alexandre Oliva wrote:
> --- a/localedata/unicode-gen/ctype_compatibility.py
> +++ b/localedata/unicode-gen/ctype_compatibility.py
>
> -# Copyright (C) 2014 Free Software Foundation, Inc.
> +# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
should be a date range (2014-2015)
> +# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
> +# sections 3.11 and 4.4.
> +
> +jamo_initial_short_name = [
> + 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
> + 'C', 'K', 'T', 'P', 'H'
> +]
module level constants should really be in CAPS. and use a tuple to make it
const.
-mike
On Feb 18, 2015, Mike Frysinger <vapier@gentoo.org> wrote:
> should be a date range (2014-2015)
> module level constants should really be in CAPS. and use a tuple to make it
> const.
Thanks. Mind if we save these cosmetic changes to the scripts for a
follow up patch?
There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h.
Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1
and 2 (and one extra character). Unfortunately I can find no sign of
amendment 2 ever having been published; it looks rather like it was
subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to
Unicode 7.0 (which would imply 201409L as version), but I can't find any
authoritative information, either on the Unicode website or after looking
through lots of SC2 documents, to confirm if there are indeed no
characters in 10646:2014 that aren't in Unicode 7.0.
Mike Frysinger <vapier@gentoo.org> wrote:
>> +# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
>> +# sections 3.11 and 4.4.
>> +
>> +jamo_initial_short_name = [
>> + 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
>> + 'C', 'K', 'T', 'P', 'H'
>> +]
>
> module level constants should really be in CAPS. and use a tuple to make it
> const.
> -mike
https://github.com/pravins/glibc-i18n/commit/53b81c58d220bfbb0e8faf8d4313c705826f4543
On 02/18/2015 06:23 PM, Alexandre Oliva wrote:
> [BZ #17588]
> [BZ #13064]
> [BZ #14094]
> [BZ #17998]
> * unicode-gen/Makefile: New.
> * unicode-gen/unicode-license.txt: New, from Unicode.
> * unicode-gen/UnicodeData.txt: New, from Unicode.
> * unicode-gen/DerivedCoreProperties.txt: New, from Unicode.
> * unicode-gen/EastAsianWidth.txt: New, from Unicode.
> * unicode-gen/gen_unicode_ctype.py: New generator, from Mike
> FABIAN <mfabian@redhat.com>.
> * unicode-gen/ctype_compatibility.py: New verifier, from
> Pravin Satpute <psatpute@redhat.com> and Mike FABIAN.
> * unicode-gen/ctype_compatibility_test_cases.py: New verifier
> module, from Mike FABIAN.
> * unicode-gen/utf8_gen.py: New generator, from Pravin Satpute
> and Mike FABIAN.
> * unicode-gen/utf8_compatibility.py: New verifier, from Pravin
> Satpute and Mike FABIAN.
> * charmaps/UTF-8: Update.
> * locales/i18n: Update.
> * gen-unicode-ctype.c: Remove.
> * tst-ctype-de_DE.ISO-8859-1.in: Adjust, islower now returns
> true for ordinal indicators.
Looks good to me. Please feel free to commit.
One nit:
-% Character width according to Unicode 5.0.0.
+% Character width according to Unicode 7.0.0.
% - Default width is 1.
% - Double-width characters have width 2; generated from
% "grep '^[^;]*;[WF]' EastAsianWidth.txt"
-% and "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
-% - Zero width characters have width 0; generated from
-% "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"
Why even mention the `grep` to be used to generate this data?
It should just say to use the scripts. Nobody should be confused
that this data was actually generated by this method. Nor do I want
anyone doing it this way ever again.
Thus shouldn't `write_header_width` simply not output any of this
stuff? I understand we're trying to minimize the initial diff, but
in cleanup, we should remove all of this and just say:
"% Character width according to Unicode 7.0.0."
Thoughts?
Cheers,
Carlos.
On 02/18/2015 08:19 PM, Joseph Myers wrote:
> There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h.
>
> Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1
> and 2 (and one extra character). Unfortunately I can find no sign of
> amendment 2 ever having been published; it looks rather like it was
> subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to
> Unicode 7.0 (which would imply 201409L as version), but I can't find any
> authoritative information, either on the Unicode website or after looking
> through lots of SC2 documents, to confirm if there are indeed no
> characters in 10646:2014 that aren't in Unicode 7.0.
I have submitted a question to the Unicode Consortium to answer this.
Proving there are no characters in 10646:2014 that aren't in Unicode 7.0
is going to be a difficult slog. Someone from the relevant groups has
to answer the question for us.
I went through SC2 documents from the Canadian side and found that
10646:2012 amendement 2 did go to ITTF for FDAM and a summary of votes
shows it passed. However, it seems the secretariat changed at that point
and perhaps everything was delayed until the 2014 standard.
Cheers,
Carlos.
On 02/18/2015 08:19 PM, Joseph Myers wrote:
> There's also the matter of updating __STDC_ISO_10646__ in stdc-predef.h.
>
> Unicode 7.0 claims to correspond to ISO/IEC 10646:2012 plus amendments 1
> and 2 (and one extra character). Unfortunately I can find no sign of
> amendment 2 ever having been published; it looks rather like it was
> subsumed into ISO/IEC 10646:2014. Wikipedia claims that corresponds to
> Unicode 7.0 (which would imply 201409L as version), but I can't find any
> authoritative information, either on the Unicode website or after looking
> through lots of SC2 documents, to confirm if there are indeed no
> characters in 10646:2014 that aren't in Unicode 7.0.
>
The ISO never published ammendment 2 for ISO/IEC 10646:2012.
The answer from the Unicode Consortium was (with some copy editing):
~~~
Version 7.0 of the Unicode Standard is synchronized with ISO/IEC 10646:2012,
plus Amendments 1 and 2. Additionally, it includes the accelerated
publication of U+20BD RUBLE SIGN.
Unicode 8.0, due for publication in the summer of 2015, is in early draft
stage now, with a page here:
http://www.unicode.org/versions/Unicode8.0.0/#Summary
Is intended to synchronize with ISO 10646:2014, plus Amendment 1.
~~~
Therefore Unicode 7.0.0 is between 10646:2012 and 10646:2014.
The wikipedia page is wrong and I have corrected it.
Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
Amd.1 was published).
Thoughts?
Cheers,
Carlos.
On Fri, 20 Feb 2015, Carlos O'Donell wrote:
> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
> Amd.1 was published).
>
> Thoughts?
That accords with what I suggested as a safe value in
<https://sourceware.org/ml/libc-alpha/2014-06/msg00588.html> when the 2014
edition hadn't been published either.
On 02/20/2015 04:28 PM, Joseph Myers wrote:
> On Fri, 20 Feb 2015, Carlos O'Donell wrote:
>
>> Thus __STDC_ISO_10646__ should be 201304L (the date that ISO/EC 10646:2012
>> Amd.1 was published).
>>
>> Thoughts?
>
> That accords with what I suggested as a safe value in
> <https://sourceware.org/ml/libc-alpha/2014-06/msg00588.html> when the 2014
> edition hadn't been published either.
>
Sounds like consensus.
Alex, could you please make sure __STDC_ISO_10646__ ends up as 201304L?
Cheers,
Carlos.
@@ -9,8 +9,15 @@ Version 2.22
* The following bugs are resolved with this release:
- 4719, 15319, 15467, 15790, 16560, 17569, 17792, 17912, 17932, 17944,
- 17949, 17964, 17965, 17967, 17969, 17978, 17987, 17991, 17996.
+ 4719, 13064, 14094, 15319, 15467, 15790, 16560, 17569, 17588, 17792,
+ 17912, 17932, 17944, 17949, 17964, 17965, 17967, 17969, 17978, 17987,
+ 17991, 17996, 17998.
+
+* Character encoding and ctype tables were updated to Unicode 7.0.0, using
+ new generator scripts contributed by Pravin Satpute and Mike FABIAN (Red
+ Hat). These updates cause user visible changes, such as the fix for bug
+ 17998.
+
Version 2.21
Incremental changes to the scripts:
@@ -1,10 +1,7 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
-# Contributed by
-# Pravin Satpute <psatpute@redhat.com>, 2014.
-# Mike FABIAN <mfabian@redhat.com>, 2014.
#
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
@@ -1,8 +1,6 @@
# -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
-# Contributed by
-# Mike FABIAN <mfabian@redhat.com>, 2014.
#
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
@@ -1,9 +1,8 @@
#!/usr/bin/python3
#
# Generate a Unicode conforming LC_CTYPE category from a UnicodeData file.
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
-# Contributed by Mike FABIAN <maiku.fabian@gmail.com>, 2014.
# Based on gen-unicode-ctype.c by Bruno Haible <haible@clisp.cons.org>, 2000.
#
# The GNU C Library is free software; you can redistribute it and/or
@@ -1,9 +1,7 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
-# Contributed by Pravin Satpute <psatpute@redhat.com>, 2014.
-# Mike FABIAN <mfabian@redhat.com>, 2014
#
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
@@ -27,8 +25,6 @@ To see how this script is used, call it with the “-h†option:
$ ./utf8_compatibility.py -h
… prints usage message …
-
-For issues upstream https://github.com/pravins/glibc-i18n
'''
import sys
@@ -1,10 +1,7 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
-# Copyright (C) 2014 Free Software Foundation, Inc.
+# Copyright (C) 2014, 2015 Free Software Foundation, Inc.
# This file is part of the GNU C Library.
-# Contributed by
-# Pravin Satpute <psatpute AT redhat DOT com> and
-# Mike Fabian <mfabian At redhat DOT com> - 2014
#
# The GNU C Library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
@@ -28,13 +25,30 @@ from Unicode data.
Usage: python3 utf8_gen.py UnicodeData.txt EastAsianWidth.txt
It will output UTF-8 file
-
-For issues upstream https://github.com/pravins/glibc-i18n
'''
import sys
import re
+# Auxiliary tables for Hangul syllable names, see the Unicode 3.0 book,
+# sections 3.11 and 4.4.
+
+jamo_initial_short_name = [
+ 'G', 'GG', 'N', 'D', 'DD', 'R', 'M', 'B', 'BB', 'S', 'SS', '', 'J', 'JJ',
+ 'C', 'K', 'T', 'P', 'H'
+]
+
+jamo_medial_short_name = [
+ 'A', 'AE', 'YA', 'YAE', 'EO', 'E', 'YEO', 'YE', 'O', 'WA', 'WAE', 'OE',
+ 'YO', 'U', 'WEO', 'WE', 'WI', 'YU', 'EU', 'YI', 'I'
+]
+
+jamo_final_short_name = [
+ '', 'G', 'GG', 'GS', 'N', 'NI', 'NH', 'D', 'L', 'LG', 'LM', 'LB', 'LS',
+ 'LT', 'LP', 'LH', 'M', 'B', 'BS', 'S', 'SS', 'NG', 'J', 'C', 'K', 'T',
+ 'P', 'H'
+]
+
def ucs_symbol(code_point):
'''Return the UCS symbol string for a Unicode character.'''
if code_point < 0x10000:
@@ -57,8 +71,15 @@ def process_range(start, end, outfile, name):
#
# So we expand the Hangul Syllables here:
for i in range(int(start, 16), int(end, 16)+1 ):
- outfile.write('{:s} {:s} {:s}\n'.format(
- ucs_symbol(i), convert_to_hex(i), name))
+ index2, index3 = divmod(i - 0xaC00, 28)
+ index1, index2 = divmod(index2, 21)
+ hangul_syllable_name = 'HANGUL SYLLABLE ' \
+ + jamo_initial_short_name[index1] \
+ + jamo_medial_short_name[index2] \
+ + jamo_final_short_name[index3]
+ outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+ ucs_symbol(i), convert_to_hex(i),
+ hangul_syllable_name))
return
# UnicodeData.txt file has contains code point ranges like this:
#
@@ -73,13 +94,13 @@ def process_range(start, end, outfile, name):
# <U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A>
for i in range(int(start, 16), int(end, 16), 64 ):
if i > (int(end, 16)-64):
- outfile.write('{:s}..{:s} {:s} {:s}\n'.format(
+ outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
ucs_symbol(i),
ucs_symbol(int(end,16)),
convert_to_hex(i),
name))
break
- outfile.write('{:s}..{:s} {:s} {:s}\n'.format(
+ outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
ucs_symbol(i),
ucs_symbol(i+63),
convert_to_hex(i),
@@ -146,7 +167,7 @@ def process_charmap(flines, outfile):
# the original UTF-8 file in glibc had them as
# comments, so we keep these comment lines.
outfile.write('%')
- outfile.write('{:s} {:s} {:s}\n'.format(
+ outfile.write('{:<11s} {:<12s} {:s}\n'.format(
ucs_symbol(int(fields[0], 16)),
convert_to_hex(int(fields[0], 16)),
fields[1]))