From patchwork Sun Feb 11 18:00:11 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Jules Bertholet <julesbertholet@quoi.xyz>
X-Patchwork-Id: 85578
X-Patchwork-Delegate: arjun.is@lostca.se
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id F41CD385840C
	for <patchwork@sourceware.org>; Sun, 11 Feb 2024 18:00:37 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from wrqvqsbb.outbound-mail.sendgrid.net
 (wrqvqsbb.outbound-mail.sendgrid.net [149.72.70.187])
 by sourceware.org (Postfix) with ESMTPS id 4CB013858C3A
 for <libc-alpha@sourceware.org>; Sun, 11 Feb 2024 18:00:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4CB013858C3A
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=quoi.xyz
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=em1912.quoi.xyz
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 4CB013858C3A
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=149.72.70.187
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707674416; cv=none;
 b=aOq5BMfSdnx+lqtc8qLi2HWHxnm/rmZAphJhtGUAcDkJ3G8M1elVTyoTy6npuCWSUEuQnCjCM0FuYJXgoEwVKF4FST9Rb2DQzMmJDwYcZiuyeKbqeTORzF7/AkP/bKj9WcMRenYuOvUQHLN1EBzPLm4nSL5IjGCPgSMxjHYiK3g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1707674416; c=relaxed/simple;
 bh=8355PkWdT5+OM1K0wqHMWb0cZDkHQrIrK4VzRZs+f5A=;
 h=DKIM-Signature:From:Subject:Date:Message-ID:MIME-Version:To;
 b=rrjlCmbePAuS73380x/J6lSH+Nj+i4y78aLyWHh0WEWbosgmoPYiYD8AU5Sn6ZsLojQkzXWPl1e8dy1pUSA4+Bo/k2x22m9N0WOYrYZwemohHVD7aY1Bu7Sh5uwNBrRLyud/bhX8GAsRtxfzpcBhVt4mlvXq26o27egBVgYjMbY=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quoi.xyz;
 h=from:subject:mime-version:to:cc:content-type:
 content-transfer-encoding:cc:content-type:from:subject:to;
 s=s1; bh=fp3YFgHB6gvlIUEfta26tw2Oi5hX8vh67kGx2lNuS7A=;
 b=bnnCW5hSTVUuZurKM0mD2tCtcQvYCsMyMoqxc2wGvsdGPMNZixEbLSwD6Jnhf9RarujU
 xBSrolVJe3jq3Oq+UM7Z6t9eq7+191n0KeZXSJXgstWdJGT8Whly2wSgfrmLXGpNvHjI2r
 0DKZJmnSkVu5Vbrt062ygJwB0M4PMMuLZeJffKNbhmr9oU9df8LCByWSCWoCDvRGAzBMJa
 ddtiUHEYEEmsN5X6vqHlFnR8eRyKHH2SAtUlMfpKaRXwEpijeZ2LJOEANtQAaLzqdKZ39s
 vu6Cp96XzDEfTiQ3UnKATS/Pe0zL9ef2RCTyWaDBpCaebWxqWXzNVJhC2rfUMRSA==
Received: by recvd-6969b467b5-jprlk with SMTP id
 recvd-6969b467b5-jprlk-1-65C90B2B-37
 2024-02-11 18:00:11.904210025 +0000 UTC m=+767737.841650108
Received: from quoi.xyz (unknown) by geopod-ismtpd-3 (SG) with ESMTP
 id 5LGzRyAMR5uuflfGKMzA-w for <libc-alpha@sourceware.org>;
 Sun, 11 Feb 2024 18:00:11.454 +0000 (UTC)
From: Jules Bertholet <julesbertholet@quoi.xyz>
Subject: [PATCH] localedata: Set width of DEFAULT_IGNORABLE_CODE_POINTs to 0
 [BZ #31370]
Date: Sun, 11 Feb 2024 18:00:11 +0000 (UTC)
Message-ID: <20240211175840.228824-2-julesbertholet@quoi.xyz>
MIME-Version: 1.0
X-SG-EID: 
 pG4Bv12xk3gLYqaLRqStQNoyYUkOYIcrsoZkuBsEAL8oF8DL0shIH5yzK7gAu8Fw0GzqF8t3cnQsgfdEXsg65z3NkhT9Az3jPu0sSWuNjBfrsz/pSl11xCP5DeGTEcjB3dNqLIOwFZxzQS+a2fXrEpfGPRzeBlnVwwSurSKc/frsxXH8MCALBawdPsxlv9bmZ7XICqWdS6kOpH7xXcxvFlS5mUha6O5WULMtt/65zL0LFCD5X9je76KpscrMT/cA
To: libc-alpha@sourceware.org
Cc: Jules Bertholet <julesbertholet@quoi.xyz>
X-Entity-ID: 28f4Yw7S4WnSp85Bnn3KUg==
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ,
 KAM_SENDGRID, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER combines with jungseong and jongseong jamo to form a width-2 syllable block,
  and should therefore have width 2

However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER.
This commit fixes that.

Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
---
 localedata/charmaps/UTF-8          |  5 +++-
 localedata/unicode-gen/Makefile    |  2 ++
 localedata/unicode-gen/utf8_gen.py | 40 ++++++++++++++++++++++++++++--
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index bd8075f20d..d5f1456cc7 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -50069,7 +50069,9 @@ WIDTH
 <U3099>...<U309A>	0
 <U309B>...<U30FF>	2
 <U3105>...<U312F>	2
-<U3131>...<U318E>	2
+<U3131>...<U3163>	2
+<U3164>	0
+<U3165>...<U318E>	2
 <U3190>...<U31E3>	2
 <U31F0>...<U321E>	2
 <U3220>...<UA48C>	2
@@ -50124,6 +50126,7 @@ WIDTH
 <UFE68>...<UFE6B>	2
 <UFEFF>	0
 <UFF01>...<UFF60>	2
+<UFFA0>	0
 <UFFE0>...<UFFE6>	2
 <UFFF9>...<UFFFB>	0
 <U000101FD>	0
diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile
index fd0c732ac4..1975065679 100644
--- a/localedata/unicode-gen/Makefile
+++ b/localedata/unicode-gen/Makefile
@@ -1,4 +1,5 @@
 # Copyright (C) 2015-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
 # This file is part of the GNU C Library.
 
 # The GNU C Library is free software; you can redistribute it and/or
@@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt
 UTF-8: utf8_gen.py
 	$(PYTHON3) utf8_gen.py -u UnicodeData.txt \
 	-e EastAsianWidth.txt -p PropList.txt \
+	-d DerivedCoreProperties.txt \
 	--unicode_version $(UNICODE_VERSION)
 
 UTF-8-report: UTF-8 ../charmaps/UTF-8
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index b48dc2aaa4..c27b3c0088 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,6 +1,7 @@
 #!/usr/bin/python3
 # -*- coding: utf-8 -*-
 # Copyright (C) 2014-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
 # This file is part of the GNU C Library.
 #
 # The GNU C Library is free software; you can redistribute it and/or
@@ -217,11 +218,13 @@ def write_header_width(outfile, unicode_version):
 #    outfile.write("%   \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n")
     outfile.write("WIDTH\n")
 
-def process_width(outfile, ulines, elines, plines):
+def process_width(outfile, ulines, elines, plines, dlines):
     '''ulines are lines from UnicodeData.txt, elines are lines from
     EastAsianWidth.txt containing characters with width “W” or “F”,
     plines are lines from PropList.txt which contain characters
     with the property “Prepended_Concatenation_Mark”.
+    dlines are lines from DerivedCoreProperties.txt which contain
+    characters with the property “Default_Ignorable_Code_Point”.
 
     '''
     width_dict = {}
@@ -252,6 +255,24 @@ def process_width(outfile, ulines, elines, plines):
                          int(code_points[1], 16)+1):
             del width_dict[key] # default width is 1
 
+    for line in dlines:
+        # Characters with the property “Default_Ignorable_Code_Point”
+        # should have the width 0:
+        fields = line.split(";")
+        if not '..' in fields[0]:
+            code_points = (fields[0], fields[0])
+        else:
+            code_points = fields[0].split("..")
+        for key in range(int(code_points[0], 16),
+                         int(code_points[1], 16)+1):
+            width_dict[key] = 0 # default width is 1
+
+    # special case: U+115F HANGUL CHOSEONG FILLER
+    # combines with other Hangul jamo to form a width-2
+    # syllable block, so treat it as width 2
+    # despite it being a `Default_Ignorable_Code_Point`
+    width_dict[0x115F] = 2
+
     # handle special cases for compatibility
     for key in list((0x00AD,)):
         # https://www.cs.tut.fi/~jkorpela/shy.html
@@ -325,6 +346,13 @@ if __name__ == "__main__":
         default='PropList.txt',
         help=('The PropList.txt file to read, '
               + 'default: %(default)s'))
+    PARSER.add_argument(
+        '-d', '--derived_core_properties_file',
+        nargs='?',
+        type=str,
+        default='DerivedCoreProperties.txt',
+        help=('The DerivedCoreProperties.txt file to read, '
+              + 'default: %(default)s'))
     PARSER.add_argument(
         '--unicode_version',
         nargs='?',
@@ -357,6 +385,13 @@ if __name__ == "__main__":
         for LINE in PROP_LIST_FILE:
             if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE):
                 PROP_LIST_LINES.append(LINE.strip())
+    with open(ARGS.derived_core_properties_file, mode='r') as DERIVED_CORE_PROPERTIES_FILE:
+        DERIVED_CORE_PROPERTIES_LINES = []
+        for LINE in DERIVED_CORE_PROPERTIES_FILE:
+            if re.match(r'.*<reserved-.+>', LINE):
+                continue
+            if re.match(r'^[^;]*;[\s]*Default_Ignorable_Code_Point', LINE):
+                DERIVED_CORE_PROPERTIES_LINES.append(LINE.strip())
     with open('UTF-8', mode='w') as OUTFILE:
         # Processing UnicodeData.txt and write CHARMAP to UTF-8 file
         write_header_charmap(OUTFILE)
@@ -367,5 +402,6 @@ if __name__ == "__main__":
         process_width(OUTFILE,
                       UNICODE_DATA_LINES,
                       EAST_ASIAN_WIDTH_LINES,
-                      PROP_LIST_LINES)
+                      PROP_LIST_LINES,
+                      DERIVED_CORE_PROPERTIES_LINES)
         OUTFILE.write("END WIDTH\n")