[v2] localedata: Fix several issues with the set of characters considered 0-width [BZ #31370]

Message ID 20240218185326.16663-1-julesbertholet@quoi.xyz
State Superseded
Delegated to: Arjun Shankar
Headers
Series [v2] localedata: Fix several issues with the set of characters considered 0-width [BZ #31370] |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
redhat-pt-bot/TryBot-32bit fail Patch series failed to apply

Commit Message

Jules Bertholet Feb. 18, 2024, 6:54 p.m. UTC
  This new version of the patch has a more detailed commit message,
and includes one more related fix.

---

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters.
  One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo".
  These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc;
  a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
  However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0,
  to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
  U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo;
  it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2.

However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.

You can read more about Unicode jamo in the Unicode spec, sections 3.12 <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6 <https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028>,
and about `Default_Ignorable_Code_Point` in §5.21 <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095>.

---

The Unicode Standard, §5.21 - Characters Ignored for Display <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095> says the following:

> A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering, because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.

glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.

Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
---
 localedata/charmaps/UTF-8          | 21 ++++++----
 localedata/unicode-gen/Makefile    |  2 +
 localedata/unicode-gen/utf8_gen.py | 67 +++++++++++++++++-------------
 3 files changed, 53 insertions(+), 37 deletions(-)
  

Comments

Arjun Shankar Feb. 20, 2024, 12:57 p.m. UTC | #1
Hi Jules,

> This new version of the patch has a more detailed commit message,
> and includes one more related fix.

Thanks for working on this!

Looks like due to a couple of intervening changes to utf8_gen.py (and
the generated UTF-8 file) in master along with some copyright line
changes, your patch doesn't currently apply to master and will need an
update. Anyway, I resolved the conflict by hand and continued
reviewing so I could offer some feedback in this iteration itself.
I've found some issues that I've mentioned below, inline with the
patch content.

>
> ---

The first occurance of "---" makes git-am drop the rest of the body
from the commit message. I usually put all my
non-commit-message-relevant notes after the first "---" printed by
git's email tools when composing a patch post.

> Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property
>
> > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.
>
> Hence, `wcwidth()` should give them all a width of 0, with two exceptions:
>
> - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
> - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters.
>   One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo".
>   These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc;
>   a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
>   However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0,
>   to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
>   U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo;
>   it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2.

OK. I assume this simply explains current and correct behaviour. I'm
wondering if some of this can instead be used to expand the existing
comments in the `write_header_width' function of utf8_gen.py instead.

> However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
> this commit fixes that.

OK. I'll look for this change below.

> You can read more about Unicode jamo in the Unicode spec, sections 3.12 <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6 <https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028>,
> and about `Default_Ignorable_Code_Point` in §5.21 <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095>.

I suggest replacing the "You can read more" with a bulleted list of references.

>
> ---

> The Unicode Standard, §5.21 - Characters Ignored for Display <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095> says the following:
>
> > A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property.
> > This may surprise implementers, who often assume that all format characters are generally ignored in fallback display.
> > The exact list of these exceptional format characters can be found in the Unicode Character Database.
> > There are, however, three important sets of such format characters to note:
> >
> > - prepended concatenation marks
> > - interlinear annotation characters
> > - Egyptian hieroglyph format controls
> >
> > The prepended concatenation marks always have a visible display.
> > See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> > for more discussion of the use and display of these signs.
> >
> > The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters,
> > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> > and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> > These characters should have a visible glyph display for fallback rendering, because if they are not displayed,
> > it is too easy to misread the resulting displayed text.
> > See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> > for more discussion of the use and display of these characters.

OK. A direct quote from the chapter.

> glibc currently correctly assigns non-zero width to the prepended concatenation marks,
> but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret)
> and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present).
> This commit fixes both these issues as well.

OK. I'll look for this change below.

> Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>

A minor nit: would be great if some of these long lines get split
across multiple lines. Of course, it's only a nit and `git log' shows
that many commit messages do have long lines in them.

> ---
>  localedata/charmaps/UTF-8          | 21 ++++++----
>  localedata/unicode-gen/Makefile    |  2 +
>  localedata/unicode-gen/utf8_gen.py | 67 +++++++++++++++++-------------
>  3 files changed, 53 insertions(+), 37 deletions(-)
>
> diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
> index bd8075f20d..f3fcd64fce 100644
> --- a/localedata/charmaps/UTF-8
> +++ b/localedata/charmaps/UTF-8
> @@ -49842,12 +49842,17 @@ END CHARMAP
>
>  % Character width according to Unicode 15.0.0.
>  % - Default width is 1.
> +% - U+115F HANGUL CHOSEONG FILLER has width 2.
> +% - Combining jungseong and jongseong Hangul jamo have with 0.
> +% - U+00AD SOFT HYPHEN has width 1.
>  % - Double-width characters have width 2; generated from
>  %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
> -% - Non-spacing characters have width 0; generated from PropList.txt or
> -%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
> -% - Format control characters have width 0; generated from
> -%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
> +% - Non-spacing marks have width 0; generated from
> +%   "grep '^[^;]*;[^;]*;Mn;' UnicodeData.txt"
> +% - Enclosing marks have width 0; generated from
> +%   "grep '^[^;]*;[^;]*;Me;' UnicodeData.txt"
> +% - "Default_Ignorable_Code_Point"s have width 0; generated from
> +%   "grep '^[^;]*;\s*Default_Ignorable_Code_Point' UnicodeData.txt"

This bit doesn't apply due to the conflict I mentioned earlier.

>  WIDTH
>  <U0300>...<U036F>      0
>  <U0483>...<U0489>      0
> @@ -50069,7 +50074,9 @@ WIDTH
>  <U3099>...<U309A>      0
>  <U309B>...<U30FF>      2
>  <U3105>...<U312F>      2

> -<U3131>...<U318E>      2
> +<U3131>...<U3163>      2
> +<U3164>        0
> +<U3165>...<U318E>      2

OK. HANGUL FILLER.

>  <U3190>...<U31E3>      2
>  <U31F0>...<U321E>      2
>  <U3220>...<UA48C>      2
> @@ -50124,8 +50131,8 @@ WIDTH
>  <UFE68>...<UFE6B>      2
>  <UFEFF>        0
>  <UFF01>...<UFF60>      2

> +<UFFA0>        0

OK. HALFWIDTH HANGUL FILLER.

>  <UFFE0>...<UFFE6>      2

> -<UFFF9>...<UFFFB>      0

OK. "U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR
ANNOTATION TERMINATOR" should not be ignored. You quoted this in the
commit message.

>  <U000101FD>    0
>  <U000102E0>    0
>  <U00010376>...<U0001037A>      0
> @@ -50226,7 +50233,7 @@ WIDTH
>  <U00011F36>...<U00011F3A>      0
>  <U00011F40>    0
>  <U00011F42>    0

> -<U00013430>...<U00013440>      0
> +<U00013440>    0

OK. "U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F
EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE" should not be ignored.

>  <U00013447>...<U00013455>      0
>  <U00016AF0>...<U00016AF4>      0
>  <U00016B30>...<U00016B36>      0

> diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile
> index fd0c732ac4..1975065679 100644
> --- a/localedata/unicode-gen/Makefile
> +++ b/localedata/unicode-gen/Makefile
> @@ -1,4 +1,5 @@
>  # Copyright (C) 2015-2023 Free Software Foundation, Inc.
> +# Copyright (C) 2024 The GNU Toolchain Authors.

This bit doesn't apply due to a recent copyright line change in master.

>  # This file is part of the GNU C Library.
>
>  # The GNU C Library is free software; you can redistribute it and/or
> @@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt
>  UTF-8: utf8_gen.py
>         $(PYTHON3) utf8_gen.py -u UnicodeData.txt \
>         -e EastAsianWidth.txt -p PropList.txt \
> +       -d DerivedCoreProperties.txt \

OK. Adds a new parameter.

>         --unicode_version $(UNICODE_VERSION)
>
>  UTF-8-report: UTF-8 ../charmaps/UTF-8
> diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
> index b48dc2aaa4..eedf6eadb0 100755
> --- a/localedata/unicode-gen/utf8_gen.py
> +++ b/localedata/unicode-gen/utf8_gen.py
> @@ -1,6 +1,7 @@
>  #!/usr/bin/python3
>  # -*- coding: utf-8 -*-
>  # Copyright (C) 2014-2023 Free Software Foundation, Inc.
> +# Copyright (C) 2024 The GNU Toolchain Authors.

Again, needs to be rebased due to a copyright line change.

>  # This file is part of the GNU C Library.
>  #
>  # The GNU C Library is free software; you can redistribute it and/or
> @@ -28,7 +29,6 @@ It will output UTF-8 file
>  '''
>
>  import argparse
> -import sys

OK. As long as the script continues to run.

>  import re
>  import unicode_utils
>
> @@ -203,25 +203,24 @@ def write_header_width(outfile, unicode_version):
>      outfile.write('% Character width according to Unicode '
>                    + '{:s}.\n'.format(unicode_version))
>      outfile.write('% - Default width is 1.\n')
> +    outfile.write('% - U+115F HANGUL CHOSEONG FILLER has width 2.\n')
> +    outfile.write('% - Combining jungseong and jongseong Hangul jamo have with 0.\n')
> +    outfile.write('% - U+00AD SOFT HYPHEN has width 1.\n')

OK. You did add comments about the change.

>      outfile.write('% - Double-width characters have width 2; generated from\n')
>      outfile.write('%        "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n')
> -    outfile.write('% - Non-spacing characters have width 0; '
> -                  + 'generated from PropList.txt or\n')
> -    outfile.write('%   "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' '
> -                  + 'UnicodeData.txt"\n')
> -    outfile.write('% - Format control characters have width 0; '
> -                  + 'generated from\n')
> -    outfile.write("%   \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n")
> -#   Not needed covered by Cf
> -#    outfile.write("% - Zero width characters have width 0; generated from\n")
> -#    outfile.write("%   \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n")
> +    outfile.write('% - Non-spacing marks have width 0; generated from\n')
> +    outfile.write('%   "grep \'^[^;]*;[^;]*;Mn;\' UnicodeData.txt"\n')
> +    outfile.write('% - Enclosing marks have width 0; generated from\n')
> +    outfile.write('%   "grep \'^[^;]*;[^;]*;Me;\' UnicodeData.txt"\n')
> +    outfile.write('% - "Default_Ignorable_Code_Point"s have width 0; generated from\n')
> +    outfile.write("%   \"grep '^[^;]*;\\s*Default_Ignorable_Code_Point' UnicodeData.txt\"\n")

This doesn't apply due to conflicts.

>      outfile.write("WIDTH\n")
>
> -def process_width(outfile, ulines, elines, plines):
> +def process_width(outfile, ulines, elines, dlines):
>      '''ulines are lines from UnicodeData.txt, elines are lines from
> -    EastAsianWidth.txt containing characters with width “W” or “F”,
> -    plines are lines from PropList.txt which contain characters
> -    with the property “Prepended_Concatenation_Mark”.
> +    EastAsianWidth.txt containing characters with width “W” or “F”.
> +    dlines are lines from DerivedCoreProperties.txt which contain
> +    characters with the property “Default_Ignorable_Code_Point”.
>
>      '''
>      width_dict = {}
> @@ -237,12 +236,12 @@ def process_width(outfile, ulines, elines, plines):
>
>      for line in ulines:
>          fields = line.split(";")
> -        if fields[4] == "NSM" or fields[2] in ("Cf", "Me", "Mn"):
> +        if fields[4] == "NSM" or fields[2] in ("Me", "Mn"):
>              width_dict[int(fields[0], 16)] = 0
>
> -    for line in plines:
> -        # Characters with the property “Prepended_Concatenation_Mark”
> -        # should have the width 1:
> +    for line in dlines:
> +        # Characters with the property “Default_Ignorable_Code_Point”
> +        # should have the width 0:
>          fields = line.split(";")
>          if not '..' in fields[0]:
>              code_points = (fields[0], fields[0])
> @@ -250,7 +249,13 @@ def process_width(outfile, ulines, elines, plines):
>              code_points = fields[0].split("..")
>          for key in range(int(code_points[0], 16),
>                           int(code_points[1], 16)+1):
> -            del width_dict[key] # default width is 1
> +            width_dict[key] = 0 # default width is 1
> +
> +    # special case: U+115F HANGUL CHOSEONG FILLER
> +    # combines with other Hangul jamo to form a width-2
> +    # syllable block, so treat it as width 2
> +    # despite it being a `Default_Ignorable_Code_Point`
> +    width_dict[0x115F] = 2
>
>      # handle special cases for compatibility
>      for key in list((0x00AD,)):
> @@ -302,7 +307,7 @@ def process_width(outfile, ulines, elines, plines):
>  if __name__ == "__main__":
>      PARSER = argparse.ArgumentParser(
>          description='''
> -        Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt.
> +        Generate a UTF-8 file from UnicodeData.txt, DerivedCoreProperties.txt, and EastAsianWidth.txt
>          ''')
>      PARSER.add_argument(
>          '-u', '--unicode_data_file',
> @@ -319,11 +324,11 @@ if __name__ == "__main__":
>          help=('The EastAsianWidth.txt file to read, '
>                + 'default: %(default)s'))
>      PARSER.add_argument(
> -        '-p', '--prop_list_file',
> +        '-d', '--derived_core_properties_file',

This seems problematic. Running `make UTF-8' in localedata/unicode-gen
errors out:
"utf8_gen.py: error: unrecognized arguments: -p PropList.txt"

I didn't get around to reviewing the changes to the script but I'll
look out for a v3.

Cheers!

--
Arjun Shankar
he/him/his
  

Patch

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index bd8075f20d..f3fcd64fce 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -49842,12 +49842,17 @@  END CHARMAP
 
 % Character width according to Unicode 15.0.0.
 % - Default width is 1.
+% - U+115F HANGUL CHOSEONG FILLER has width 2.
+% - Combining jungseong and jongseong Hangul jamo have with 0.
+% - U+00AD SOFT HYPHEN has width 1.
 % - Double-width characters have width 2; generated from
 %        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
-% - Non-spacing characters have width 0; generated from PropList.txt or
-%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
-% - Format control characters have width 0; generated from
-%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
+% - Non-spacing marks have width 0; generated from
+%   "grep '^[^;]*;[^;]*;Mn;' UnicodeData.txt"
+% - Enclosing marks have width 0; generated from
+%   "grep '^[^;]*;[^;]*;Me;' UnicodeData.txt"
+% - "Default_Ignorable_Code_Point"s have width 0; generated from
+%   "grep '^[^;]*;\s*Default_Ignorable_Code_Point' UnicodeData.txt"
 WIDTH
 <U0300>...<U036F>	0
 <U0483>...<U0489>	0
@@ -50069,7 +50074,9 @@  WIDTH
 <U3099>...<U309A>	0
 <U309B>...<U30FF>	2
 <U3105>...<U312F>	2
-<U3131>...<U318E>	2
+<U3131>...<U3163>	2
+<U3164>	0
+<U3165>...<U318E>	2
 <U3190>...<U31E3>	2
 <U31F0>...<U321E>	2
 <U3220>...<UA48C>	2
@@ -50124,8 +50131,8 @@  WIDTH
 <UFE68>...<UFE6B>	2
 <UFEFF>	0
 <UFF01>...<UFF60>	2
+<UFFA0>	0
 <UFFE0>...<UFFE6>	2
-<UFFF9>...<UFFFB>	0
 <U000101FD>	0
 <U000102E0>	0
 <U00010376>...<U0001037A>	0
@@ -50226,7 +50233,7 @@  WIDTH
 <U00011F36>...<U00011F3A>	0
 <U00011F40>	0
 <U00011F42>	0
-<U00013430>...<U00013440>	0
+<U00013440>	0
 <U00013447>...<U00013455>	0
 <U00016AF0>...<U00016AF4>	0
 <U00016B30>...<U00016B36>	0
diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile
index fd0c732ac4..1975065679 100644
--- a/localedata/unicode-gen/Makefile
+++ b/localedata/unicode-gen/Makefile
@@ -1,4 +1,5 @@ 
 # Copyright (C) 2015-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
 # This file is part of the GNU C Library.
 
 # The GNU C Library is free software; you can redistribute it and/or
@@ -94,6 +95,7 @@  UTF-8: UnicodeData.txt EastAsianWidth.txt
 UTF-8: utf8_gen.py
 	$(PYTHON3) utf8_gen.py -u UnicodeData.txt \
 	-e EastAsianWidth.txt -p PropList.txt \
+	-d DerivedCoreProperties.txt \
 	--unicode_version $(UNICODE_VERSION)
 
 UTF-8-report: UTF-8 ../charmaps/UTF-8
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index b48dc2aaa4..eedf6eadb0 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,6 +1,7 @@ 
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
 # Copyright (C) 2014-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
@@ -28,7 +29,6 @@  It will output UTF-8 file
 '''
 
 import argparse
-import sys
 import re
 import unicode_utils
 
@@ -203,25 +203,24 @@  def write_header_width(outfile, unicode_version):
     outfile.write('% Character width according to Unicode '
                   + '{:s}.\n'.format(unicode_version))
     outfile.write('% - Default width is 1.\n')
+    outfile.write('% - U+115F HANGUL CHOSEONG FILLER has width 2.\n')
+    outfile.write('% - Combining jungseong and jongseong Hangul jamo have with 0.\n')
+    outfile.write('% - U+00AD SOFT HYPHEN has width 1.\n')
     outfile.write('% - Double-width characters have width 2; generated from\n')
     outfile.write('%        "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n')
-    outfile.write('% - Non-spacing characters have width 0; '
-                  + 'generated from PropList.txt or\n')
-    outfile.write('%   "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' '
-                  + 'UnicodeData.txt"\n')
-    outfile.write('% - Format control characters have width 0; '
-                  + 'generated from\n')
-    outfile.write("%   \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n")
-#   Not needed covered by Cf
-#    outfile.write("% - Zero width characters have width 0; generated from\n")
-#    outfile.write("%   \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n")
+    outfile.write('% - Non-spacing marks have width 0; generated from\n')
+    outfile.write('%   "grep \'^[^;]*;[^;]*;Mn;\' UnicodeData.txt"\n')
+    outfile.write('% - Enclosing marks have width 0; generated from\n')
+    outfile.write('%   "grep \'^[^;]*;[^;]*;Me;\' UnicodeData.txt"\n')
+    outfile.write('% - "Default_Ignorable_Code_Point"s have width 0; generated from\n')
+    outfile.write("%   \"grep '^[^;]*;\\s*Default_Ignorable_Code_Point' UnicodeData.txt\"\n")
     outfile.write("WIDTH\n")
 
-def process_width(outfile, ulines, elines, plines):
+def process_width(outfile, ulines, elines, dlines):
     '''ulines are lines from UnicodeData.txt, elines are lines from
-    EastAsianWidth.txt containing characters with width “W” or “F”,
-    plines are lines from PropList.txt which contain characters
-    with the property “Prepended_Concatenation_Mark”.
+    EastAsianWidth.txt containing characters with width “W” or “F”.
+    dlines are lines from DerivedCoreProperties.txt which contain
+    characters with the property “Default_Ignorable_Code_Point”.
 
     '''
     width_dict = {}
@@ -237,12 +236,12 @@  def process_width(outfile, ulines, elines, plines):
 
     for line in ulines:
         fields = line.split(";")
-        if fields[4] == "NSM" or fields[2] in ("Cf", "Me", "Mn"):
+        if fields[4] == "NSM" or fields[2] in ("Me", "Mn"):
             width_dict[int(fields[0], 16)] = 0
 
-    for line in plines:
-        # Characters with the property “Prepended_Concatenation_Mark”
-        # should have the width 1:
+    for line in dlines:
+        # Characters with the property “Default_Ignorable_Code_Point”
+        # should have the width 0:
         fields = line.split(";")
         if not '..' in fields[0]:
             code_points = (fields[0], fields[0])
@@ -250,7 +249,13 @@  def process_width(outfile, ulines, elines, plines):
             code_points = fields[0].split("..")
         for key in range(int(code_points[0], 16),
                          int(code_points[1], 16)+1):
-            del width_dict[key] # default width is 1
+            width_dict[key] = 0 # default width is 1
+
+    # special case: U+115F HANGUL CHOSEONG FILLER
+    # combines with other Hangul jamo to form a width-2
+    # syllable block, so treat it as width 2
+    # despite it being a `Default_Ignorable_Code_Point`
+    width_dict[0x115F] = 2
 
     # handle special cases for compatibility
     for key in list((0x00AD,)):
@@ -302,7 +307,7 @@  def process_width(outfile, ulines, elines, plines):
 if __name__ == "__main__":
     PARSER = argparse.ArgumentParser(
         description='''
-        Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt.
+        Generate a UTF-8 file from UnicodeData.txt, DerivedCoreProperties.txt, and EastAsianWidth.txt
         ''')
     PARSER.add_argument(
         '-u', '--unicode_data_file',
@@ -319,11 +324,11 @@  if __name__ == "__main__":
         help=('The EastAsianWidth.txt file to read, '
               + 'default: %(default)s'))
     PARSER.add_argument(
-        '-p', '--prop_list_file',
+        '-d', '--derived_core_properties_file',
         nargs='?',
         type=str,
-        default='PropList.txt',
-        help=('The PropList.txt file to read, '
+        default='DerivedCoreProperties.txt',
+        help=('The DerivedCoreProperties.txt file to read, '
               + 'default: %(default)s'))
     PARSER.add_argument(
         '--unicode_version',
@@ -352,11 +357,13 @@  if __name__ == "__main__":
                 continue
             if re.match(r'^[^;]*;[WF]', LINE):
                 EAST_ASIAN_WIDTH_LINES.append(LINE.strip())
-    with open(ARGS.prop_list_file, mode='r') as PROP_LIST_FILE:
-        PROP_LIST_LINES = []
-        for LINE in PROP_LIST_FILE:
-            if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE):
-                PROP_LIST_LINES.append(LINE.strip())
+    with open(ARGS.derived_core_properties_file, mode='r') as DERIVED_CORE_PROPERTIES_FILE:
+        DERIVED_CORE_PROPERTIES_LINES = []
+        for LINE in DERIVED_CORE_PROPERTIES_FILE:
+            if re.match(r'.*<reserved-.+>', LINE):
+                continue
+            if re.match(r'^[^;]*;\s*Default_Ignorable_Code_Point', LINE):
+                DERIVED_CORE_PROPERTIES_LINES.append(LINE.strip())
     with open('UTF-8', mode='w') as OUTFILE:
         # Processing UnicodeData.txt and write CHARMAP to UTF-8 file
         write_header_charmap(OUTFILE)
@@ -367,5 +374,5 @@  if __name__ == "__main__":
         process_width(OUTFILE,
                       UNICODE_DATA_LINES,
                       EAST_ASIAN_WIDTH_LINES,
-                      PROP_LIST_LINES)
+                      DERIVED_CORE_PROPERTIES_LINES)
         OUTFILE.write("END WIDTH\n")