From patchwork Wed Apr 28 13:00:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Carlos O'Donell X-Patchwork-Id: 43187 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 60EAD3944810; Wed, 28 Apr 2021 13:01:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 60EAD3944810 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619614881; bh=PoNFRtKSXMTfHwT5iHCR+ONEYnY17U0/gcetPBLgLbM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=NCrC8iwkcZzZcJSlkxEOVfdtFYd6FLgN6205pp82yy+c+ZGfhxAr3xIOiQmVqXI8e TCtp2QQCUYha1swCvga/W4HaLw97JyudXYBjSlczpzvA3P+46/Ht3FonXKxop4i7F/ KYCf5eAup7gTm1F7Lw/2kmEmdRITXSjITYVmMvrQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id A16AF3943406 for ; Wed, 28 Apr 2021 13:01:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A16AF3943406 Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-282--KoWzsYGMqOzCJIcutc8bQ-1; Wed, 28 Apr 2021 09:00:51 -0400 X-MC-Unique: -KoWzsYGMqOzCJIcutc8bQ-1 Received: by mail-qk1-f198.google.com with SMTP id h15-20020a37de0f0000b029029a8ada2e18so25275777qkj.11 for ; Wed, 28 Apr 2021 06:00:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PoNFRtKSXMTfHwT5iHCR+ONEYnY17U0/gcetPBLgLbM=; b=CA9yII6gaqAOdbMeBsNkvEmPQcVHVea37+opTBSh+3wMcM0ygckp4icfSJtDJpd3/Y A0LVYd/s0zLL9T0r8j3UfgPII1lRpSTRIRliTAXofEvWUplrw93UVWbxu3BHpujB+fEq groSiDeBnHuNkBQcB+NH828ESJgveHFgGQLvxtVSP/nKNkOsMkASoG3sI/H6CHh4e9qJ 5H/TTvL26ZvyP70OfWaUNixmjHJ+pE3K2iAurLg7O403NyJfN/GDLCPbAbZGFp/CTrgU mujNnBjTgDgTjeJ/wZWwEV3Ib2q2X2GUlenb3OMvQUsR+DQRYbreUMRQULzEE/o/qSjw aLTQ== X-Gm-Message-State: AOAM530QPRpNAHlX+xjXrmmM6kEwveVU0+Flytot3AFdYxQSeFztzKsW z9zKagJkKOdrHDhllH4Yh2rIlRpJXLafyeOd/dL37oZT74ZfCZpPHZ0ksRH/efAYAYgcVZcm7Bl brceZGcL9rEBp8IsMODCGAgtXUSNXupKo+i8ZZ/Yfa2+cB1MV4D07mXmpbKvv+pLKgGFJjw== X-Received: by 2002:a05:6214:2467:: with SMTP id im7mr13984838qvb.59.1619614848946; Wed, 28 Apr 2021 06:00:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz6dhnapV5sogJ2psSQyCIvpD6hQu9zG8+V4UAxKLb/AxGQBtbdJB4wF4HX1zlBh442CXs9qg== X-Received: by 2002:a05:6214:2467:: with SMTP id im7mr13984799qvb.59.1619614848565; Wed, 28 Apr 2021 06:00:48 -0700 (PDT) Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id g18sm4903451qke.21.2021.04.28.06.00.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Apr 2021 06:00:48 -0700 (PDT) To: libc-alpha@sourceware.org, fweimer@redhat.com Subject: [PATCH v4 2/4] Update UTF-8 charmap processing. Date: Wed, 28 Apr 2021 09:00:31 -0400 Message-Id: <20210428130033.3196848-3-carlos@redhat.com> X-Mailer: git-send-email 2.26.3 In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com> References: <20210428130033.3196848-1-carlos@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha From: Carlos O'Donell Reply-To: Carlos O'Donell Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" The UTF-8 character map processing is updated to use the new wider ellipsis support. On top of this the Unicode Noncharacters compliance is improved by adding Noncharacters to the UTF-8 character map to allow them to be processed and transformed correctly when considering the character map only. All gaps, excluding surrogates, for the UTF-8 character map are filled with unassigned blocks of characters. The UTF-8 character map now includes all Unicode Scalar values. Tested by regenerating the locale data from the Unicode data and running the testsuite. Tested on x86_64 and i686 without regression. --- localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++---------- 1 file changed, 86 insertions(+), 47 deletions(-) diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py index 899840923a..56a680bc06 100755 --- a/localedata/unicode-gen/utf8_gen.py +++ b/localedata/unicode-gen/utf8_gen.py @@ -81,25 +81,46 @@ def process_range(start, end, outfile, name): # 3400;;Lo;0;L;;;;;N;;;;; # 4DB5;;Lo;0;L;;;;;N;;;;; # - # The glibc UTF-8 file splits ranges like these into shorter + # The old glibc UTF-8 file splits ranges like these into shorter # ranges of 64 code points each: # # .. /xe3/x90/x80 # … # .. /xe4/xb6/x80 - for i in range(int(start, 16), int(end, 16), 64 ): - if i > (int(end, 16)-64): - outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format( - unicode_utils.ucs_symbol(i), - unicode_utils.ucs_symbol(int(end,16)), - convert_to_hex(i), - name)) - break - outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format( - unicode_utils.ucs_symbol(i), - unicode_utils.ucs_symbol(i+63), - convert_to_hex(i), - name)) + # + # We do not split the ranges like this. It is not required. The + # ellipsis processing in ld-collate.c can handle any sized ranges. + outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format( + unicode_utils.ucs_symbol(int (start, 16)), + unicode_utils.ucs_symbol(int (end, 16)), + convert_to_hex (int (start, 16)), + name)) + +def process_gap (start, end, outfile): + '''This function processes a gap and fills it if needed. The value + of start is the last value output, and the value of end is the + next value which may be output. Therefore if there is a gap + between the two then it is filled with an ellipsis or a single + symbol. + + ''' + # If start and end are more than 1 away then we have a gap, and + # that needs filling to provide proper code-point collation + # support. + cp_prev = int(start, 16) + cp_next = int(end, 16) + + # Special case of just one symbol missing? + if cp_next - 1 == cp_prev + 1: + outfile.write('{:<11s} {:<12s} {:s}\n'.format( + unicode_utils.ucs_symbol(cp_prev + 1), + convert_to_hex(cp_prev + 1), + '')) + elif cp_next > cp_prev + 1: + # More than one symbol, so use an ellipsis. + process_range ('{:x}'.format(cp_prev + 1), + '{:x}'.format(cp_next - 1), + outfile, '') def process_charmap(flines, outfile): '''This function takes an array which contains *all* lines of @@ -129,63 +150,81 @@ def process_charmap(flines, outfile): % /xed/xad/xbf .. /xf4/x8f/xbf/x80 + The old glibc UTF-8 charmap left the surrogates commented out. + Surrogates are not Unicode scalar values, and are ill-formed code + sequences. We continue to comment them out in the character map to + ensure no locale accidentally uses these values. The use of + surrogate symbols will be treated as if they were UNDEFINED. The + converters will handle them as ill-formed code sequences and either + raise an error or transform them to REPLACEMENT CHARACTER. ''' fields_start = [] + fields_end = [] for line in flines: fields = line.split(";") - # Some characters have “” as their name. We try to - # use the “Unicode 1.0 Name” (10th field in - # UnicodeData.txt) for them. - # - # The Characters U+0080, U+0081, U+0084 and U+0099 have - # “” as their name but do not even have aa - # ”Unicode 1.0 Name”. We could write code to take their - # alternate names from NameAliases.txt. + # Some characters have "" as their name. We try to + # use the "Unicode 1.0 Name" (10th field in + # UnicodeData.txt) for them. + # + # The Characters U+0080, U+0081, U+0084 and U+0099 have + # "" as their name but do not even have a + # "Unicode 1.0 Name". We could write code to take their + # alternate names from NameAliases.txt. if fields[1] == "" and fields[10]: fields[1] = fields[10] # Handling code point ranges like: # # 3400;;Lo;0;L;;;;;N;;;;; # 4DB5;;Lo;0;L;;;;;N;;;;; - if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]: + if fields[1].endswith(', First>'): fields_start = fields continue - if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]: + if fields[1].endswith(', Last>'): + # 1. Process the gap. + # First process the gap between the last entry and the + # newly started range. + process_gap (fields_end[0], fields_start[0], outfile) + # 2. Exclude surrogate ranges. + # Comment out the surrogates in the UTF-8 file. + # One could of course skip them completely but + # the original UTF-8 file in glibc had them as + # comments, so we keep these comment lines. + if 'Surrogate,' in fields[1]: + outfile.write('%') + # 3. Process the range. process_range(fields_start[0], fields[0], outfile, fields[1][:-7]+'>') fields_start = [] + fields_end = fields continue fields_start = [] - if 'Surrogate,' in fields[1]: - # Comment out the surrogates in the UTF-8 file. - # One could of course skip them completely but - # the original UTF-8 file in glibc had them as - # comments, so we keep these comment lines. - outfile.write('%') + + if len (fields_end) > 0: + process_gap (fields_end[0], fields[0], outfile) + outfile.write('{:<11s} {:<12s} {:s}\n'.format( unicode_utils.ucs_symbol(int(fields[0], 16)), convert_to_hex(int(fields[0], 16)), fields[1])) + fields_end = fields + # We may need to output a final set of symbols if we are not yet at + # U+10FFFF, so check that last gap. We use U+110000 as the + # hypothetical next entry. In practice UTF-8 ends at U+10FFFD and + # so indeed we have 2 missing symbols at the end. + process_gap (fields_end[0], '110000', outfile) + def convert_to_hex(code_point): '''Converts a code point to a hexadecimal UTF-8 representation - like /x**/x**/x**.''' - # Getting UTF8 of Unicode characters. - # In Python3, .encode('UTF-8') does not work for - # surrogates. Therefore, we use this conversion table - surrogates = { - 0xD800: '/xed/xa0/x80', - 0xDB7F: '/xed/xad/xbf', - 0xDB80: '/xed/xae/x80', - 0xDBFF: '/xed/xaf/xbf', - 0xDC00: '/xed/xb0/x80', - 0xDFFF: '/xed/xbf/xbf', - } - if code_point in surrogates: - return surrogates[code_point] - return ''.join([ - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8') - ]) + ready for use in a locale character map specification e.g. + /xc2/xaf for MACRON. + + ''' + cp_locale = '' + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass') + for byte in cp_bytes: + cp_locale += ''.join('/x{:02x}'.format(byte)) + return cp_locale def write_header_charmap(outfile): '''Write the header on top of the CHARMAP section to the output file'''