From patchwork Sun Feb 11 18:00:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jules Bertholet X-Patchwork-Id: 85578 X-Patchwork-Delegate: arjun.is@lostca.se Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id F41CD385840C for ; Sun, 11 Feb 2024 18:00:37 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from wrqvqsbb.outbound-mail.sendgrid.net (wrqvqsbb.outbound-mail.sendgrid.net [149.72.70.187]) by sourceware.org (Postfix) with ESMTPS id 4CB013858C3A for ; Sun, 11 Feb 2024 18:00:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4CB013858C3A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=quoi.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=em1912.quoi.xyz ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 4CB013858C3A Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=149.72.70.187 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707674416; cv=none; b=aOq5BMfSdnx+lqtc8qLi2HWHxnm/rmZAphJhtGUAcDkJ3G8M1elVTyoTy6npuCWSUEuQnCjCM0FuYJXgoEwVKF4FST9Rb2DQzMmJDwYcZiuyeKbqeTORzF7/AkP/bKj9WcMRenYuOvUQHLN1EBzPLm4nSL5IjGCPgSMxjHYiK3g= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1707674416; c=relaxed/simple; bh=8355PkWdT5+OM1K0wqHMWb0cZDkHQrIrK4VzRZs+f5A=; h=DKIM-Signature:From:Subject:Date:Message-ID:MIME-Version:To; b=rrjlCmbePAuS73380x/J6lSH+Nj+i4y78aLyWHh0WEWbosgmoPYiYD8AU5Sn6ZsLojQkzXWPl1e8dy1pUSA4+Bo/k2x22m9N0WOYrYZwemohHVD7aY1Bu7Sh5uwNBrRLyud/bhX8GAsRtxfzpcBhVt4mlvXq26o27egBVgYjMbY= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quoi.xyz; h=from:subject:mime-version:to:cc:content-type: content-transfer-encoding:cc:content-type:from:subject:to; s=s1; bh=fp3YFgHB6gvlIUEfta26tw2Oi5hX8vh67kGx2lNuS7A=; b=bnnCW5hSTVUuZurKM0mD2tCtcQvYCsMyMoqxc2wGvsdGPMNZixEbLSwD6Jnhf9RarujU xBSrolVJe3jq3Oq+UM7Z6t9eq7+191n0KeZXSJXgstWdJGT8Whly2wSgfrmLXGpNvHjI2r 0DKZJmnSkVu5Vbrt062ygJwB0M4PMMuLZeJffKNbhmr9oU9df8LCByWSCWoCDvRGAzBMJa ddtiUHEYEEmsN5X6vqHlFnR8eRyKHH2SAtUlMfpKaRXwEpijeZ2LJOEANtQAaLzqdKZ39s vu6Cp96XzDEfTiQ3UnKATS/Pe0zL9ef2RCTyWaDBpCaebWxqWXzNVJhC2rfUMRSA== Received: by recvd-6969b467b5-jprlk with SMTP id recvd-6969b467b5-jprlk-1-65C90B2B-37 2024-02-11 18:00:11.904210025 +0000 UTC m=+767737.841650108 Received: from quoi.xyz (unknown) by geopod-ismtpd-3 (SG) with ESMTP id 5LGzRyAMR5uuflfGKMzA-w for ; Sun, 11 Feb 2024 18:00:11.454 +0000 (UTC) From: Jules Bertholet Subject: [PATCH] localedata: Set width of DEFAULT_IGNORABLE_CODE_POINTs to 0 [BZ #31370] Date: Sun, 11 Feb 2024 18:00:11 +0000 (UTC) Message-ID: <20240211175840.228824-2-julesbertholet@quoi.xyz> MIME-Version: 1.0 X-SG-EID: pG4Bv12xk3gLYqaLRqStQNoyYUkOYIcrsoZkuBsEAL8oF8DL0shIH5yzK7gAu8Fw0GzqF8t3cnQsgfdEXsg65z3NkhT9Az3jPu0sSWuNjBfrsz/pSl11xCP5DeGTEcjB3dNqLIOwFZxzQS+a2fXrEpfGPRzeBlnVwwSurSKc/frsxXH8MCALBawdPsxlv9bmZ7XICqWdS6kOpH7xXcxvFlS5mUha6O5WULMtt/65zL0LFCD5X9je76KpscrMT/cA To: libc-alpha@sourceware.org Cc: Jules Bertholet X-Entity-ID: 28f4Yw7S4WnSp85Bnn3KUg== X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SENDGRID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering. Hence, `wcwidth()` should give them all a width of 0, with two exceptions: - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent - U+115F HANGUL CHOSEONG FILLER combines with jungseong and jongseong jamo to form a width-2 syllable block, and should therefore have width 2 However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER. This commit fixes that. Signed-off-by: Jules Bertholet --- localedata/charmaps/UTF-8 | 5 +++- localedata/unicode-gen/Makefile | 2 ++ localedata/unicode-gen/utf8_gen.py | 40 ++++++++++++++++++++++++++++-- 3 files changed, 44 insertions(+), 3 deletions(-) diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8 index bd8075f20d..d5f1456cc7 100644 --- a/localedata/charmaps/UTF-8 +++ b/localedata/charmaps/UTF-8 @@ -50069,7 +50069,9 @@ WIDTH ... 0 ... 2 ... 2 -... 2 +... 2 + 0 +... 2 ... 2 ... 2 ... 2 @@ -50124,6 +50126,7 @@ WIDTH ... 2 0 ... 2 + 0 ... 2 ... 0 0 diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile index fd0c732ac4..1975065679 100644 --- a/localedata/unicode-gen/Makefile +++ b/localedata/unicode-gen/Makefile @@ -1,4 +1,5 @@ # Copyright (C) 2015-2023 Free Software Foundation, Inc. +# Copyright (C) 2024 The GNU Toolchain Authors. # This file is part of the GNU C Library. # The GNU C Library is free software; you can redistribute it and/or @@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt UTF-8: utf8_gen.py $(PYTHON3) utf8_gen.py -u UnicodeData.txt \ -e EastAsianWidth.txt -p PropList.txt \ + -d DerivedCoreProperties.txt \ --unicode_version $(UNICODE_VERSION) UTF-8-report: UTF-8 ../charmaps/UTF-8 diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py index b48dc2aaa4..c27b3c0088 100755 --- a/localedata/unicode-gen/utf8_gen.py +++ b/localedata/unicode-gen/utf8_gen.py @@ -1,6 +1,7 @@ #!/usr/bin/python3 # -*- coding: utf-8 -*- # Copyright (C) 2014-2023 Free Software Foundation, Inc. +# Copyright (C) 2024 The GNU Toolchain Authors. # This file is part of the GNU C Library. # # The GNU C Library is free software; you can redistribute it and/or @@ -217,11 +218,13 @@ def write_header_width(outfile, unicode_version): # outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n") outfile.write("WIDTH\n") -def process_width(outfile, ulines, elines, plines): +def process_width(outfile, ulines, elines, plines, dlines): '''ulines are lines from UnicodeData.txt, elines are lines from EastAsianWidth.txt containing characters with width “W” or “F”, plines are lines from PropList.txt which contain characters with the property “Prepended_Concatenation_Mark”. + dlines are lines from DerivedCoreProperties.txt which contain + characters with the property “Default_Ignorable_Code_Point”. ''' width_dict = {} @@ -252,6 +255,24 @@ def process_width(outfile, ulines, elines, plines): int(code_points[1], 16)+1): del width_dict[key] # default width is 1 + for line in dlines: + # Characters with the property “Default_Ignorable_Code_Point” + # should have the width 0: + fields = line.split(";") + if not '..' in fields[0]: + code_points = (fields[0], fields[0]) + else: + code_points = fields[0].split("..") + for key in range(int(code_points[0], 16), + int(code_points[1], 16)+1): + width_dict[key] = 0 # default width is 1 + + # special case: U+115F HANGUL CHOSEONG FILLER + # combines with other Hangul jamo to form a width-2 + # syllable block, so treat it as width 2 + # despite it being a `Default_Ignorable_Code_Point` + width_dict[0x115F] = 2 + # handle special cases for compatibility for key in list((0x00AD,)): # https://www.cs.tut.fi/~jkorpela/shy.html @@ -325,6 +346,13 @@ if __name__ == "__main__": default='PropList.txt', help=('The PropList.txt file to read, ' + 'default: %(default)s')) + PARSER.add_argument( + '-d', '--derived_core_properties_file', + nargs='?', + type=str, + default='DerivedCoreProperties.txt', + help=('The DerivedCoreProperties.txt file to read, ' + + 'default: %(default)s')) PARSER.add_argument( '--unicode_version', nargs='?', @@ -357,6 +385,13 @@ if __name__ == "__main__": for LINE in PROP_LIST_FILE: if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE): PROP_LIST_LINES.append(LINE.strip()) + with open(ARGS.derived_core_properties_file, mode='r') as DERIVED_CORE_PROPERTIES_FILE: + DERIVED_CORE_PROPERTIES_LINES = [] + for LINE in DERIVED_CORE_PROPERTIES_FILE: + if re.match(r'.*', LINE): + continue + if re.match(r'^[^;]*;[\s]*Default_Ignorable_Code_Point', LINE): + DERIVED_CORE_PROPERTIES_LINES.append(LINE.strip()) with open('UTF-8', mode='w') as OUTFILE: # Processing UnicodeData.txt and write CHARMAP to UTF-8 file write_header_charmap(OUTFILE) @@ -367,5 +402,6 @@ if __name__ == "__main__": process_width(OUTFILE, UNICODE_DATA_LINES, EAST_ASIAN_WIDTH_LINES, - PROP_LIST_LINES) + PROP_LIST_LINES, + DERIVED_CORE_PROPERTIES_LINES) OUTFILE.write("END WIDTH\n")