Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120]

Message ID s9d8sgn5sf8.fsf@taka.site
State Committed
Headers
Series Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to 0 [BZ #26120] |

Commit Message

Mike FABIAN June 16, 2020, 8:24 a.m. UTC
  
  

Comments

Mike FABIAN June 23, 2020, 9:30 a.m. UTC | #1
I skipped unassigned characters and ended the range at U+D7FF even
though U+D7FC .. U+D7FF are currently unassigned. But because
the script now skips the unassigned characters it is OK to end the range
for the Hangul Jamo at U+D7FF, if these characters ever happen to get
assigned in future, they will probably be Hangul Jamo because of
Block.txt.

After each Unicode update, manual checking is good anyway, but ending
the range in the script at U+D7FF seems more likely to do the right
thing already if these characters ever get assigned.
  
Carlos O'Donell June 23, 2020, 3:17 p.m. UTC | #2
On 6/23/20 5:30 AM, Mike FABIAN via Libc-alpha wrote:
> I skipped unassigned characters and ended the range at U+D7FF even
> though U+D7FC .. U+D7FF are currently unassigned. But because
> the script now skips the unassigned characters it is OK to end the range
> for the Hangul Jamo at U+D7FF, if these characters ever happen to get
> assigned in future, they will probably be Hangul Jamo because of
> Block.txt.
> 
> After each Unicode update, manual checking is good anyway, but ending
> the range in the script at U+D7FF seems more likely to do the right
> thing already if these characters ever get assigned.
> 

You change the generator but all the files that are generated by the
generator do not appear regenerated in your patch.

Can you please post exactly what you plan to commit, that way we can
review the results?

I'm expecting:
- generator change.
- all files updated with date changes.
- some files have more than date changes.

This way we keep the generated files consistent.
  
Mike FABIAN June 25, 2020, 8:32 a.m. UTC | #3
Carlos O'Donell <carlos@redhat.com> さんはかきました:

> On 6/23/20 5:30 AM, Mike FABIAN via Libc-alpha wrote:
>> I skipped unassigned characters and ended the range at U+D7FF even
>> though U+D7FC .. U+D7FF are currently unassigned. But because
>> the script now skips the unassigned characters it is OK to end the range
>> for the Hangul Jamo at U+D7FF, if these characters ever happen to get
>> assigned in future, they will probably be Hangul Jamo because of
>> Block.txt.
>> 
>> After each Unicode update, manual checking is good anyway, but ending
>> the range in the script at U+D7FF seems more likely to do the right
>> thing already if these characters ever get assigned.
>> 
>
> You change the generator but all the files that are generated by the
> generator do not appear regenerated in your patch.

> Can you please post exactly what you plan to commit, that way we can
> review the results?

The patch did contain everything.

> I'm expecting:
> - generator change.

This part is the generator change:

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 17b99ee88d..11c906b92f 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -258,7 +258,13 @@ def process_width(outfile, ulines, elines, plines):
         if key in width_dict:
             del width_dict[key] # default width is 1
     for key in list(range(0x1160, 0x1200)):
-        width_dict[key] = 0
+        # Hangul jungseong and jongseong:
+        if key in unicode_utils.UNICODE_ATTRIBUTES:
+            width_dict[key] = 0
+    for key in list(range(0xD7B0, 0xD800)):
+        # Hangul jungseong and jongseong:
+        if key in unicode_utils.UNICODE_ATTRIBUTES:
+            width_dict[key] = 0
     for key in list(range(0x3248, 0x3250)):
         # These are “A” which means we can decide whether to treat them
         # as “W” or “N” based on context:
@@ -327,6 +333,7 @@ if __name__ == "__main__":
         help='The Unicode version of the input files used.')
     ARGS = PARSER.parse_args()
 
+    unicode_utils.fill_attributes(ARGS.unicode_data_file)
     with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
         UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
     with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:

> - all files updated with date changes.

And the UTF-8 file in charmaps is the only file which changed, only in
the WIDTH section:

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0


> - some files have more than date changes.

No other files are changed and the UTF-8 file in charmaps does not
contain a generation date.

> This way we keep the generated files consistent.
  
Carlos O'Donell June 25, 2020, 12:35 p.m. UTC | #4
On 6/25/20 4:32 AM, Mike FABIAN wrote:
> Carlos O'Donell <carlos@redhat.com> さんはかきました:
> 
>> On 6/23/20 5:30 AM, Mike FABIAN via Libc-alpha wrote:
>>> I skipped unassigned characters and ended the range at U+D7FF even
>>> though U+D7FC .. U+D7FF are currently unassigned. But because
>>> the script now skips the unassigned characters it is OK to end the range
>>> for the Hangul Jamo at U+D7FF, if these characters ever happen to get
>>> assigned in future, they will probably be Hangul Jamo because of
>>> Block.txt.
>>>
>>> After each Unicode update, manual checking is good anyway, but ending
>>> the range in the script at U+D7FF seems more likely to do the right
>>> thing already if these characters ever get assigned.
>>>
>>
>> You change the generator but all the files that are generated by the
>> generator do not appear regenerated in your patch.
> 
>> Can you please post exactly what you plan to commit, that way we can
>> review the results?
> 
> The patch did contain everything.

My expectation was that we do this:

(a) Make all your source changes.
(b) Regenerate.
cd ~/src/glibc/localedata/unicode-gen
make
make-intstall

When you do it this way *all* files get updates.

You're cheating here because you know apriori which files need updating
and you appear to have carried out that updated by hand. I dislike doing
this kind of "by hand" process and was expecting the more general update
as I describe above. This ensures that all files are at the same "regenerated"
state and the date for all of them in their datestamps matches the date
you regenerated everything that is generated by unicode-gen. That means a
reviewer doesn't have to verify that no other files might have been changed
by your code change.

I'm requesting that we follow a simpler procedure that regenerates the files
using the expected process, and post patches based on that.

Does that clarify my position?

Please feel free to justify a different position.
 
>> I'm expecting:
>> - generator change.
> 
> This part is the generator change:
> 
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 17b99ee88d..11c906b92f 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -258,7 +258,13 @@ def process_width(outfile, ulines, elines, plines):
>          if key in width_dict:
>              del width_dict[key] # default width is 1
>      for key in list(range(0x1160, 0x1200)):
> -        width_dict[key] = 0
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0
> +    for key in list(range(0xD7B0, 0xD800)):
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0

OK.

>      for key in list(range(0x3248, 0x3250)):
>          # These are “A” which means we can decide whether to treat them
>          # as “W” or “N” based on context:
> @@ -327,6 +333,7 @@ if __name__ == "__main__":
>          help='The Unicode version of the input files used.')
>      ARGS = PARSER.parse_args()
>  
> +    unicode_utils.fill_attributes(ARGS.unicode_data_file)

OK.

>      with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
>          UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
>      with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:
> 
>> - all files updated with date changes.
> 
> And the UTF-8 file in charmaps is the only file which changed, only in
> the WIDTH section:

If you ran make; make install; from unicode-gen all the other files would have
changed their timestamps to show that they were updated and had no functional
change. Those patches are missing. Without them I don't know that the generator
actually generated "zero functional changes" for those files.

> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index 14c5d4fa33..8cce47cd97 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -48920,6 +48920,8 @@ WIDTH
>  <UABE8>        0
>  <UABED>        0
>  <UAC00>...<UD7A3>      2
> +<UD7B0>...<UD7C6>      0
> +<UD7CB>...<UD7FB>      0

OK.

>  <UF900>...<UFA6D>      2
>  <UFA70>...<UFAD9>      2
>  <UFB1E>        0
> 
> 
>> - some files have more than date changes.
> 
> No other files are changed and the UTF-8 file in charmaps does not
> contain a generation date.
> 
>> This way we keep the generated files consistent.
>
  
Mike FABIAN June 25, 2020, 1 p.m. UTC | #5
Carlos O'Donell <carlos@redhat.com> さんはかきました:

> My expectation was that we do this:
>
> (a) Make all your source changes.
> (b) Regenerate.
> cd ~/src/glibc/localedata/unicode-gen
> make
> make-intstall
>
> When you do it this way *all* files get updates.
>
> You're cheating here because you know apriori which files need updating
> and you appear to have carried out that updated by hand. I dislike doing
> this kind of "by hand" process and was expecting the more general update
> as I describe above. This ensures that all files are at the same "regenerated"
> state and the date for all of them in their datestamps matches the date
> you regenerated everything that is generated by unicode-gen. That means a
> reviewer doesn't have to verify that no other files might have been changed
> by your code change.
>
> I'm requesting that we follow a simpler procedure that regenerates the files
> using the expected process, and post patches based on that.
>
> Does that clarify my position?

Yes, now I understand what you mean.

[...]

> If you ran make; make install; from unicode-gen all the other files would have
> changed their timestamps to show that they were updated and had no functional
> change. Those patches are missing. Without them I don't know that the generator
> actually generated "zero functional changes" for those files.

I knew because I only changed utf8_gen.py and this generator script
changes only the UTF-8 file.

I'll do it the "make install" way ...
  
Mike FABIAN June 25, 2020, 1:07 p.m. UTC | #6
Mike FABIAN <mfabian@redhat.com> さんはかきました:

> I'll do it the "make install" way ...
  
Carlos O'Donell June 25, 2020, 6:57 p.m. UTC | #7
On 6/25/20 9:07 AM, Mike FABIAN wrote:
> From 32ee1eb4037ce5c777322a5043a48b8627610d8d Mon Sep 17 00:00:00 2001
> From: Mike FABIAN <mfabian@redhat.com>
> Date: Tue, 16 Jun 2020 08:29:40 +0200
> Subject: [PATCH] Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to
>  UD7FB to 0 [BZ #26120]

OK for master.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
 
> ---
>  localedata/charmaps/UTF-8              | 2 ++
>  localedata/locales/i18n_ctype          | 2 +-
>  localedata/locales/tr_TR               | 2 +-
>  localedata/locales/translit_circle     | 2 +-
>  localedata/locales/translit_cjk_compat | 2 +-
>  localedata/locales/translit_combining  | 2 +-
>  localedata/locales/translit_compat     | 2 +-
>  localedata/locales/translit_font       | 2 +-
>  localedata/locales/translit_fraction   | 2 +-
>  localedata/unicode-gen/utf8_gen.py     | 9 ++++++++-
>  10 files changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index 14c5d4fa33..8cce47cd97 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -48920,6 +48920,8 @@ WIDTH
>  <UABE8>	0
>  <UABED>	0
>  <UAC00>...<UD7A3>	2
> +<UD7B0>...<UD7C6>	0
> +<UD7CB>...<UD7FB>	0

OK. Expected output by generator.

>  <UF900>...<UFA6D>	2
>  <UFA70>...<UFAD9>	2
>  <UFB1E>	0
> diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype
> index 6f078a101d..c63e0790fc 100644
> --- a/localedata/locales/i18n_ctype
> +++ b/localedata/locales/i18n_ctype
> @@ -26,7 +26,7 @@ fax       ""
>  language  ""
>  territory "Earth"
>  revision  "13.0.0"
> -date      "2020-04-14"
> +date      "2020-06-25"

OK. No change. Expected.

>  category  "i18n:2012";LC_CTYPE
>  END LC_IDENTIFICATION
>  
> diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR
> index d5785ceca1..7dbb923228 100644
> --- a/localedata/locales/tr_TR
> +++ b/localedata/locales/tr_TR
> @@ -43,7 +43,7 @@ fax        ""
>  language   "Turkish"
>  territory  "Turkey"
>  revision   "1.0"
> -date       "2020-04-14"
> +date       "2020-06-25"

OK. No change. Expected.

>  
>  category "i18n:2012";LC_IDENTIFICATION
>  category "i18n:2012";LC_CTYPE
> diff --git a/localedata/locales/translit_circle b/localedata/locales/translit_circle
> index 0f1e81541c..5c07b44532 100644
> --- a/localedata/locales/translit_circle
> +++ b/localedata/locales/translit_circle
> @@ -9,7 +9,7 @@ comment_char %
>  % otherwise be governed by that license.
>  
>  % Transliterations of encircled characters.
> -% Generated automatically from UnicodeData.txt by gen_translit_circle.py on 2020-04-14 for Unicode 13.0.0.
> +% Generated automatically from UnicodeData.txt by gen_translit_circle.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  
>  LC_CTYPE
>  
> diff --git a/localedata/locales/translit_cjk_compat b/localedata/locales/translit_cjk_compat
> index 17b74134fc..ee0d7f83c6 100644
> --- a/localedata/locales/translit_cjk_compat
> +++ b/localedata/locales/translit_cjk_compat
> @@ -9,7 +9,7 @@ comment_char %
>  % otherwise be governed by that license.
>  
>  % Transliterations of CJK compatibility characters.
> -% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py on 2020-04-14 for Unicode 13.0.0.
> +% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  
>  LC_CTYPE
>  
> diff --git a/localedata/locales/translit_combining b/localedata/locales/translit_combining
> index d5c8bbfe8f..36128f097a 100644
> --- a/localedata/locales/translit_combining
> +++ b/localedata/locales/translit_combining
> @@ -10,7 +10,7 @@ comment_char %
>  
>  % Transliterations that remove all combining characters (accents,
>  % pronounciation marks, etc.).
> -% Generated automatically from UnicodeData.txt by gen_translit_combining.py on 2020-04-14 for Unicode 13.0.0.
> +% Generated automatically from UnicodeData.txt by gen_translit_combining.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  
>  LC_CTYPE
>  
> diff --git a/localedata/locales/translit_compat b/localedata/locales/translit_compat
> index ff18b02ea3..ac24c4e938 100644
> --- a/localedata/locales/translit_compat
> +++ b/localedata/locales/translit_compat
> @@ -9,7 +9,7 @@ comment_char %
>  % otherwise be governed by that license.
>  
>  % Transliterations of compatibility characters and ligatures.
> -% Generated automatically from UnicodeData.txt by gen_translit_compat.py on 2020-04-14 for Unicode 13.0.0.
> +% Generated automatically from UnicodeData.txt by gen_translit_compat.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  
>  LC_CTYPE
>  
> diff --git a/localedata/locales/translit_font b/localedata/locales/translit_font
> index e79b0d83f5..680c4ed426 100644
> --- a/localedata/locales/translit_font
> +++ b/localedata/locales/translit_font
> @@ -9,7 +9,7 @@ comment_char %
>  % otherwise be governed by that license.
>  
>  % Transliterations of font equivalents.
> -% Generated automatically from UnicodeData.txt by gen_translit_font.py on 2020-04-14 for Unicode 13.0.0.
> +% Generated automatically from UnicodeData.txt by gen_translit_font.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  
>  LC_CTYPE
>  
> diff --git a/localedata/locales/translit_fraction b/localedata/locales/translit_fraction
> index 197d57a644..b52244969e 100644
> --- a/localedata/locales/translit_fraction
> +++ b/localedata/locales/translit_fraction
> @@ -9,7 +9,7 @@ comment_char %
>  % otherwise be governed by that license.
>  
>  % Transliterations of fractions.
> -% Generated automatically from UnicodeData.txt by gen_translit_fraction.py on 2020-04-14 for Unicode 13.0.0.> +% Generated automatically from UnicodeData.txt by gen_translit_fraction.py on 2020-06-25 for Unicode 13.0.0.

OK. No change. Expected.

>  % The replacements have been surrounded with spaces, because fractions are
>  % often preceded by a decimal number and followed by a unit or a math symbol.
>  
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index 17b99ee88d..11c906b92f 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -258,7 +258,13 @@ def process_width(outfile, ulines, elines, plines):
>          if key in width_dict:
>              del width_dict[key] # default width is 1
>      for key in list(range(0x1160, 0x1200)):
> -        width_dict[key] = 0
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0
> +    for key in list(range(0xD7B0, 0xD800)):
> +        # Hangul jungseong and jongseong:
> +        if key in unicode_utils.UNICODE_ATTRIBUTES:
> +            width_dict[key] = 0

OK. Expected per bug.

>      for key in list(range(0x3248, 0x3250)):
>          # These are “A” which means we can decide whether to treat them
>          # as “W” or “N” based on context:
> @@ -327,6 +333,7 @@ if __name__ == "__main__":
>          help='The Unicode version of the input files used.')
>      ARGS = PARSER.parse_args()
>  
> +    unicode_utils.fill_attributes(ARGS.unicode_data_file)

OK.

>      with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE:
>          UNICODE_DATA_LINES = UNIDATA_FILE.readlines()
>      with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE:
> -- 2.26.2
  

Patch

From 94aa93f64d88dd52353480f3d425c6cc064b0d81 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Tue, 16 Jun 2020 08:29:40 +0200
Subject: [PATCH] Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to
 UD7FB to 0 [BZ #26120]

---
 localedata/charmaps/UTF-8          | 1 +
 localedata/unicode-gen/utf8_gen.py | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..56327e7830 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,7 @@  WIDTH
 <UABE8>	0
 <UABED>	0
 <UAC00>...<UD7A3>	2
+<UD7B0>...<UD7FB>	0
 <UF900>...<UFA6D>	2
 <UFA70>...<UFAD9>	2
 <UFB1E>	0
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 17b99ee88d..f83cb3c5e5 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -258,6 +258,10 @@  def process_width(outfile, ulines, elines, plines):
         if key in width_dict:
             del width_dict[key] # default width is 1
     for key in list(range(0x1160, 0x1200)):
+        # Hangul jungseong and jongseong:
+        width_dict[key] = 0
+    for key in list(range(0xD7B0, 0xD7FC)):
+        # Hangul jungseong and jongseong:
         width_dict[key] = 0
     for key in list(range(0x3248, 0x3250)):
         # These are “A” which means we can decide whether to treat them
-- 
2.26.2