From patchwork Wed Apr 28 13:00:31 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Carlos O'Donell <carlos@redhat.com>
X-Patchwork-Id: 43187
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 60EAD3944810;
	Wed, 28 Apr 2021 13:01:21 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 60EAD3944810
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1619614881;
	bh=PoNFRtKSXMTfHwT5iHCR+ONEYnY17U0/gcetPBLgLbM=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=NCrC8iwkcZzZcJSlkxEOVfdtFYd6FLgN6205pp82yy+c+ZGfhxAr3xIOiQmVqXI8e
	 TCtp2QQCUYha1swCvga/W4HaLw97JyudXYBjSlczpzvA3P+46/Ht3FonXKxop4i7F/
	 KYCf5eAup7gTm1F7Lw/2kmEmdRITXSjITYVmMvrQ=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id A16AF3943406
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 13:01:14 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A16AF3943406
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-282--KoWzsYGMqOzCJIcutc8bQ-1; Wed, 28 Apr 2021 09:00:51 -0400
X-MC-Unique: -KoWzsYGMqOzCJIcutc8bQ-1
Received: by mail-qk1-f198.google.com with SMTP id
 h15-20020a37de0f0000b029029a8ada2e18so25275777qkj.11
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 06:00:50 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=PoNFRtKSXMTfHwT5iHCR+ONEYnY17U0/gcetPBLgLbM=;
 b=CA9yII6gaqAOdbMeBsNkvEmPQcVHVea37+opTBSh+3wMcM0ygckp4icfSJtDJpd3/Y
 A0LVYd/s0zLL9T0r8j3UfgPII1lRpSTRIRliTAXofEvWUplrw93UVWbxu3BHpujB+fEq
 groSiDeBnHuNkBQcB+NH828ESJgveHFgGQLvxtVSP/nKNkOsMkASoG3sI/H6CHh4e9qJ
 5H/TTvL26ZvyP70OfWaUNixmjHJ+pE3K2iAurLg7O403NyJfN/GDLCPbAbZGFp/CTrgU
 mujNnBjTgDgTjeJ/wZWwEV3Ib2q2X2GUlenb3OMvQUsR+DQRYbreUMRQULzEE/o/qSjw
 aLTQ==
X-Gm-Message-State: AOAM530QPRpNAHlX+xjXrmmM6kEwveVU0+Flytot3AFdYxQSeFztzKsW
 z9zKagJkKOdrHDhllH4Yh2rIlRpJXLafyeOd/dL37oZT74ZfCZpPHZ0ksRH/efAYAYgcVZcm7Bl
 brceZGcL9rEBp8IsMODCGAgtXUSNXupKo+i8ZZ/Yfa2+cB1MV4D07mXmpbKvv+pLKgGFJjw==
X-Received: by 2002:a05:6214:2467:: with SMTP id
 im7mr13984838qvb.59.1619614848946;
 Wed, 28 Apr 2021 06:00:48 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJz6dhnapV5sogJ2psSQyCIvpD6hQu9zG8+V4UAxKLb/AxGQBtbdJB4wF4HX1zlBh442CXs9qg==
X-Received: by 2002:a05:6214:2467:: with SMTP id
 im7mr13984799qvb.59.1619614848565;
 Wed, 28 Apr 2021 06:00:48 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id g18sm4903451qke.21.2021.04.28.06.00.47
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Apr 2021 06:00:48 -0700 (PDT)
To: libc-alpha@sourceware.org,
	fweimer@redhat.com
Subject: [PATCH v4 2/4] Update UTF-8 charmap processing.
Date: Wed, 28 Apr 2021 09:00:31 -0400
Message-Id: <20210428130033.3196848-3-carlos@redhat.com>
X-Mailer: git-send-email 2.26.3
In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com>
References: <20210428130033.3196848-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha
 <libc-alpha@sourceware.org>
From: Carlos O'Donell <carlos@redhat.com>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.

Tested by regenerating the locale data from the Unicode data and
running the testsuite.

Tested on x86_64 and i686 without regression.
---
 localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
 1 file changed, 86 insertions(+), 47 deletions(-)

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..56a680bc06 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -81,25 +81,46 @@ def process_range(start, end, outfile, name):
     # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
     # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
     #
-    # The glibc UTF-8 file splits ranges like these into shorter
+    # The old glibc UTF-8 file splits ranges like these into shorter
     # ranges of 64 code points each:
     #
     # <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
     # …
     # <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>
-    for i in range(int(start, 16), int(end, 16), 64 ):
-        if i > (int(end, 16)-64):
-            outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                    unicode_utils.ucs_symbol(i),
-                    unicode_utils.ucs_symbol(int(end,16)),
-                    convert_to_hex(i),
-                    name))
-            break
-        outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                unicode_utils.ucs_symbol(i),
-                unicode_utils.ucs_symbol(i+63),
-                convert_to_hex(i),
-                name))
+    #
+    # We do not split the ranges like this. It is not required. The
+    # ellipsis processing in ld-collate.c can handle any sized ranges.
+    outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+                  unicode_utils.ucs_symbol(int (start, 16)),
+                  unicode_utils.ucs_symbol(int (end, 16)),
+                  convert_to_hex (int (start, 16)),
+                  name))
+
+def process_gap (start, end, outfile):
+    '''This function processes a gap and fills it if needed.  The value
+       of start is the last value output, and the value of end is the
+       next value which may be output.  Therefore if there is a gap
+       between the two then it is filled with an ellipsis or a single
+       symbol.
+
+    '''
+    # If start and end are more than 1 away then we have a gap, and
+    # that needs filling to provide proper code-point collation
+    # support.
+    cp_prev = int(start, 16)
+    cp_next = int(end, 16)
+
+    # Special case of just one symbol missing?
+    if cp_next - 1 == cp_prev + 1:
+        outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+                      unicode_utils.ucs_symbol(cp_prev + 1),
+                      convert_to_hex(cp_prev + 1),
+                      '<Unassigned>'))
+    elif cp_next > cp_prev + 1:
+        # More than one symbol, so use an ellipsis.
+        process_range ('{:x}'.format(cp_prev + 1),
+                       '{:x}'.format(cp_next - 1),
+                       outfile, '<Unassigned>')
 
 def process_charmap(flines, outfile):
     '''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@ def process_charmap(flines, outfile):
     %<UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
     <U0010FFC0>..<U0010FFFD>     /xf4/x8f/xbf/x80 <Plane 16 Private Use>
 
+    The old glibc UTF-8 charmap left the surrogates commented out.
+    Surrogates are not Unicode scalar values, and are ill-formed code
+    sequences. We continue to comment them out in the character map to
+    ensure no locale accidentally uses these values. The use of
+    surrogate symbols will be treated as if they were UNDEFINED. The
+    converters will handle them as ill-formed code sequences and either
+    raise an error or transform them to REPLACEMENT CHARACTER.
     '''
     fields_start = []
+    fields_end = []
     for line in flines:
         fields = line.split(";")
-         # Some characters have “<control>” as their name. We try to
-         # use the “Unicode 1.0 Name” (10th field in
-         # UnicodeData.txt) for them.
-         #
-         # The Characters U+0080, U+0081, U+0084 and U+0099 have
-         # “<control>” as their name but do not even have aa
-         # ”Unicode 1.0 Name”. We could write code to take their
-         # alternate names from NameAliases.txt.
+        # Some characters have "<control>" as their name. We try to
+        # use the "Unicode 1.0 Name" (10th field in
+        # UnicodeData.txt) for them.
+        #
+        # The Characters U+0080, U+0081, U+0084 and U+0099 have
+        # "<control>" as their name but do not even have a
+        # "Unicode 1.0 Name". We could write code to take their
+        # alternate names from NameAliases.txt.
         if fields[1] == "<control>" and fields[10]:
             fields[1] = fields[10]
         # Handling code point ranges like:
         #
         # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
         # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
-        if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', First>'):
             fields_start = fields
             continue
-        if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', Last>'):
+            # 1. Process the gap.
+            # First process the gap between the last entry and the
+            # newly started range.
+            process_gap (fields_end[0], fields_start[0], outfile)
+            # 2. Exclude surrogate ranges.
+            # Comment out the surrogates in the UTF-8 file.
+            # One could of course skip them completely but
+            # the original UTF-8 file in glibc had them as
+            # comments, so we keep these comment lines.
+            if 'Surrogate,' in fields[1]:
+                outfile.write('%')
+            # 3. Process the range.
             process_range(fields_start[0], fields[0],
                           outfile, fields[1][:-7]+'>')
             fields_start = []
+            fields_end = fields
             continue
         fields_start = []
-        if 'Surrogate,' in fields[1]:
-            # Comment out the surrogates in the UTF-8 file.
-            # One could of course skip them completely but
-            # the original UTF-8 file in glibc had them as
-            # comments, so we keep these comment lines.
-            outfile.write('%')
+
+        if len (fields_end) > 0:
+            process_gap (fields_end[0], fields[0], outfile)
+
         outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 unicode_utils.ucs_symbol(int(fields[0], 16)),
                 convert_to_hex(int(fields[0], 16)),
                 fields[1]))
 
+        fields_end = fields
+    # We may need to output a final set of symbols if we are not yet at
+    # U+10FFFF, so check that last gap.  We use U+110000 as the
+    # hypothetical next entry.  In practice UTF-8 ends at U+10FFFD and
+    # so indeed we have 2 missing symbols at the end.
+    process_gap (fields_end[0], '110000', outfile)
+
 def convert_to_hex(code_point):
     '''Converts a code point to a hexadecimal UTF-8 representation
-    like /x**/x**/x**.'''
-    # Getting UTF8 of Unicode characters.
-    # In Python3, .encode('UTF-8') does not work for
-    # surrogates. Therefore, we use this conversion table
-    surrogates = {
-        0xD800: '/xed/xa0/x80',
-        0xDB7F: '/xed/xad/xbf',
-        0xDB80: '/xed/xae/x80',
-        0xDBFF: '/xed/xaf/xbf',
-        0xDC00: '/xed/xb0/x80',
-        0xDFFF: '/xed/xbf/xbf',
-    }
-    if code_point in surrogates:
-        return surrogates[code_point]
-    return ''.join([
-        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
-    ])
+    ready for use in a locale character map specification e.g.
+    /xc2/xaf for MACRON.
+
+    '''
+    cp_locale = ''
+    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+    for byte in cp_bytes:
+       cp_locale += ''.join('/x{:02x}'.format(byte))
+    return cp_locale
 
 def write_header_charmap(outfile):
     '''Write the header on top of the CHARMAP section to the output file'''