[v4,2/4] Update UTF-8 charmap processing.

Message ID 20210428130033.3196848-3-carlos@redhat.com
State Superseded
Headers
Series Add new C.UTF-8 locale (Bug 17318) |

Commit Message

Carlos O'Donell April 28, 2021, 1 p.m. UTC
  The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.

Tested by regenerating the locale data from the Unicode data and
running the testsuite.

Tested on x86_64 and i686 without regression.
---
 localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
 1 file changed, 86 insertions(+), 47 deletions(-)
  

Comments

Florian Weimer April 29, 2021, 2:07 p.m. UTC | #1
* Carlos O'Donell:

>  def convert_to_hex(code_point):
>      '''Converts a code point to a hexadecimal UTF-8 representation
> -    like /x**/x**/x**.'''
> -    # Getting UTF8 of Unicode characters.
> -    # In Python3, .encode('UTF-8') does not work for
> -    # surrogates. Therefore, we use this conversion table
> -    surrogates = {
> -        0xD800: '/xed/xa0/x80',
> -        0xDB7F: '/xed/xad/xbf',
> -        0xDB80: '/xed/xae/x80',
> -        0xDBFF: '/xed/xaf/xbf',
> -        0xDC00: '/xed/xb0/x80',
> -        0xDFFF: '/xed/xbf/xbf',
> -    }
> -    if code_point in surrogates:
> -        return surrogates[code_point]
> -    return ''.join([
> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
> -    ])
> +    ready for use in a locale character map specification e.g.
> +    /xc2/xaf for MACRON.
> +
> +    '''
> +    cp_locale = ''
> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
> +    for byte in cp_bytes:
> +       cp_locale += ''.join('/x{:02x}'.format(byte))
> +    return cp_locale

I think you should keep the list comprehension.  That ''.join() is
unnecessary.

Thanks,
Florian
  
Carlos O'Donell April 29, 2021, 9:02 p.m. UTC | #2
On 4/29/21 10:07 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>>  def convert_to_hex(code_point):
>>      '''Converts a code point to a hexadecimal UTF-8 representation
>> -    like /x**/x**/x**.'''
>> -    # Getting UTF8 of Unicode characters.
>> -    # In Python3, .encode('UTF-8') does not work for
>> -    # surrogates. Therefore, we use this conversion table
>> -    surrogates = {
>> -        0xD800: '/xed/xa0/x80',
>> -        0xDB7F: '/xed/xad/xbf',
>> -        0xDB80: '/xed/xae/x80',
>> -        0xDBFF: '/xed/xaf/xbf',
>> -        0xDC00: '/xed/xb0/x80',
>> -        0xDFFF: '/xed/xbf/xbf',
>> -    }
>> -    if code_point in surrogates:
>> -        return surrogates[code_point]
>> -    return ''.join([
>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>> -    ])
>> +    ready for use in a locale character map specification e.g.
>> +    /xc2/xaf for MACRON.
>> +
>> +    '''
>> +    cp_locale = ''
>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>> +    for byte in cp_bytes:
>> +       cp_locale += ''.join('/x{:02x}'.format(byte))
>> +    return cp_locale
> 
> I think you should keep the list comprehension.  That ''.join() is
> unnecessary.

Like this?

    return ''.join(['/x{:02x}'.format(c) \
        for c in chr(code_point).encode('UTF-8', 'surrogatepass')])

(tested works fine and produces the same results)
  
Florian Weimer April 30, 2021, 4:18 a.m. UTC | #3
* Carlos O'Donell:

> On 4/29/21 10:07 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>> 
>>>  def convert_to_hex(code_point):
>>>      '''Converts a code point to a hexadecimal UTF-8 representation
>>> -    like /x**/x**/x**.'''
>>> -    # Getting UTF8 of Unicode characters.
>>> -    # In Python3, .encode('UTF-8') does not work for
>>> -    # surrogates. Therefore, we use this conversion table
>>> -    surrogates = {
>>> -        0xD800: '/xed/xa0/x80',
>>> -        0xDB7F: '/xed/xad/xbf',
>>> -        0xDB80: '/xed/xae/x80',
>>> -        0xDBFF: '/xed/xaf/xbf',
>>> -        0xDC00: '/xed/xb0/x80',
>>> -        0xDFFF: '/xed/xbf/xbf',
>>> -    }
>>> -    if code_point in surrogates:
>>> -        return surrogates[code_point]
>>> -    return ''.join([
>>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>>> -    ])
>>> +    ready for use in a locale character map specification e.g.
>>> +    /xc2/xaf for MACRON.
>>> +
>>> +    '''
>>> +    cp_locale = ''
>>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>>> +    for byte in cp_bytes:
>>> +       cp_locale += ''.join('/x{:02x}'.format(byte))
>>> +    return cp_locale
>> 
>> I think you should keep the list comprehension.  That ''.join() is
>> unnecessary.
>
> Like this?
>
>     return ''.join(['/x{:02x}'.format(c) \
>         for c in chr(code_point).encode('UTF-8', 'surrogatepass')])
>
> (tested works fine and produces the same results)

Yes, exactly.  Thanks.  The patch should be fine with this.

Florian
  
Carlos O'Donell May 2, 2021, 7:18 p.m. UTC | #4
On 4/30/21 12:18 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> On 4/29/21 10:07 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>>  def convert_to_hex(code_point):
>>>>      '''Converts a code point to a hexadecimal UTF-8 representation
>>>> -    like /x**/x**/x**.'''
>>>> -    # Getting UTF8 of Unicode characters.
>>>> -    # In Python3, .encode('UTF-8') does not work for
>>>> -    # surrogates. Therefore, we use this conversion table
>>>> -    surrogates = {
>>>> -        0xD800: '/xed/xa0/x80',
>>>> -        0xDB7F: '/xed/xad/xbf',
>>>> -        0xDB80: '/xed/xae/x80',
>>>> -        0xDBFF: '/xed/xaf/xbf',
>>>> -        0xDC00: '/xed/xb0/x80',
>>>> -        0xDFFF: '/xed/xbf/xbf',
>>>> -    }
>>>> -    if code_point in surrogates:
>>>> -        return surrogates[code_point]
>>>> -    return ''.join([
>>>> -        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>>>> -    ])
>>>> +    ready for use in a locale character map specification e.g.
>>>> +    /xc2/xaf for MACRON.
>>>> +
>>>> +    '''
>>>> +    cp_locale = ''
>>>> +    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>>>> +    for byte in cp_bytes:
>>>> +       cp_locale += ''.join('/x{:02x}'.format(byte))
>>>> +    return cp_locale
>>>
>>> I think you should keep the list comprehension.  That ''.join() is
>>> unnecessary.
>>
>> Like this?
>>
>>     return ''.join(['/x{:02x}'.format(c) \
>>         for c in chr(code_point).encode('UTF-8', 'surrogatepass')])
>>
>> (tested works fine and produces the same results)
> 
> Yes, exactly.  Thanks.  The patch should be fine with this.

Fixed. This will be part of the v5 repost.
  

Patch

diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index 899840923a..56a680bc06 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -81,25 +81,46 @@  def process_range(start, end, outfile, name):
     # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
     # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
     #
-    # The glibc UTF-8 file splits ranges like these into shorter
+    # The old glibc UTF-8 file splits ranges like these into shorter
     # ranges of 64 code points each:
     #
     # <U3400>..<U343F>     /xe3/x90/x80         <CJK Ideograph Extension A>
     # …
     # <U4D80>..<U4DB5>     /xe4/xb6/x80         <CJK Ideograph Extension A>
-    for i in range(int(start, 16), int(end, 16), 64 ):
-        if i > (int(end, 16)-64):
-            outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                    unicode_utils.ucs_symbol(i),
-                    unicode_utils.ucs_symbol(int(end,16)),
-                    convert_to_hex(i),
-                    name))
-            break
-        outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
-                unicode_utils.ucs_symbol(i),
-                unicode_utils.ucs_symbol(i+63),
-                convert_to_hex(i),
-                name))
+    #
+    # We do not split the ranges like this. It is not required. The
+    # ellipsis processing in ld-collate.c can handle any sized ranges.
+    outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+                  unicode_utils.ucs_symbol(int (start, 16)),
+                  unicode_utils.ucs_symbol(int (end, 16)),
+                  convert_to_hex (int (start, 16)),
+                  name))
+
+def process_gap (start, end, outfile):
+    '''This function processes a gap and fills it if needed.  The value
+       of start is the last value output, and the value of end is the
+       next value which may be output.  Therefore if there is a gap
+       between the two then it is filled with an ellipsis or a single
+       symbol.
+
+    '''
+    # If start and end are more than 1 away then we have a gap, and
+    # that needs filling to provide proper code-point collation
+    # support.
+    cp_prev = int(start, 16)
+    cp_next = int(end, 16)
+
+    # Special case of just one symbol missing?
+    if cp_next - 1 == cp_prev + 1:
+        outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+                      unicode_utils.ucs_symbol(cp_prev + 1),
+                      convert_to_hex(cp_prev + 1),
+                      '<Unassigned>'))
+    elif cp_next > cp_prev + 1:
+        # More than one symbol, so use an ellipsis.
+        process_range ('{:x}'.format(cp_prev + 1),
+                       '{:x}'.format(cp_next - 1),
+                       outfile, '<Unassigned>')
 
 def process_charmap(flines, outfile):
     '''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@  def process_charmap(flines, outfile):
     %<UDB7F>     /xed/xad/xbf <Non Private Use High Surrogate, Last>
     <U0010FFC0>..<U0010FFFD>     /xf4/x8f/xbf/x80 <Plane 16 Private Use>
 
+    The old glibc UTF-8 charmap left the surrogates commented out.
+    Surrogates are not Unicode scalar values, and are ill-formed code
+    sequences. We continue to comment them out in the character map to
+    ensure no locale accidentally uses these values. The use of
+    surrogate symbols will be treated as if they were UNDEFINED. The
+    converters will handle them as ill-formed code sequences and either
+    raise an error or transform them to REPLACEMENT CHARACTER.
     '''
     fields_start = []
+    fields_end = []
     for line in flines:
         fields = line.split(";")
-         # Some characters have “<control>” as their name. We try to
-         # use the “Unicode 1.0 Name” (10th field in
-         # UnicodeData.txt) for them.
-         #
-         # The Characters U+0080, U+0081, U+0084 and U+0099 have
-         # “<control>” as their name but do not even have aa
-         # ”Unicode 1.0 Name”. We could write code to take their
-         # alternate names from NameAliases.txt.
+        # Some characters have "<control>" as their name. We try to
+        # use the "Unicode 1.0 Name" (10th field in
+        # UnicodeData.txt) for them.
+        #
+        # The Characters U+0080, U+0081, U+0084 and U+0099 have
+        # "<control>" as their name but do not even have a
+        # "Unicode 1.0 Name". We could write code to take their
+        # alternate names from NameAliases.txt.
         if fields[1] == "<control>" and fields[10]:
             fields[1] = fields[10]
         # Handling code point ranges like:
         #
         # 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
         # 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
-        if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', First>'):
             fields_start = fields
             continue
-        if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+        if fields[1].endswith(', Last>'):
+            # 1. Process the gap.
+            # First process the gap between the last entry and the
+            # newly started range.
+            process_gap (fields_end[0], fields_start[0], outfile)
+            # 2. Exclude surrogate ranges.
+            # Comment out the surrogates in the UTF-8 file.
+            # One could of course skip them completely but
+            # the original UTF-8 file in glibc had them as
+            # comments, so we keep these comment lines.
+            if 'Surrogate,' in fields[1]:
+                outfile.write('%')
+            # 3. Process the range.
             process_range(fields_start[0], fields[0],
                           outfile, fields[1][:-7]+'>')
             fields_start = []
+            fields_end = fields
             continue
         fields_start = []
-        if 'Surrogate,' in fields[1]:
-            # Comment out the surrogates in the UTF-8 file.
-            # One could of course skip them completely but
-            # the original UTF-8 file in glibc had them as
-            # comments, so we keep these comment lines.
-            outfile.write('%')
+
+        if len (fields_end) > 0:
+            process_gap (fields_end[0], fields[0], outfile)
+
         outfile.write('{:<11s} {:<12s} {:s}\n'.format(
                 unicode_utils.ucs_symbol(int(fields[0], 16)),
                 convert_to_hex(int(fields[0], 16)),
                 fields[1]))
 
+        fields_end = fields
+    # We may need to output a final set of symbols if we are not yet at
+    # U+10FFFF, so check that last gap.  We use U+110000 as the
+    # hypothetical next entry.  In practice UTF-8 ends at U+10FFFD and
+    # so indeed we have 2 missing symbols at the end.
+    process_gap (fields_end[0], '110000', outfile)
+
 def convert_to_hex(code_point):
     '''Converts a code point to a hexadecimal UTF-8 representation
-    like /x**/x**/x**.'''
-    # Getting UTF8 of Unicode characters.
-    # In Python3, .encode('UTF-8') does not work for
-    # surrogates. Therefore, we use this conversion table
-    surrogates = {
-        0xD800: '/xed/xa0/x80',
-        0xDB7F: '/xed/xad/xbf',
-        0xDB80: '/xed/xae/x80',
-        0xDBFF: '/xed/xaf/xbf',
-        0xDC00: '/xed/xb0/x80',
-        0xDFFF: '/xed/xbf/xbf',
-    }
-    if code_point in surrogates:
-        return surrogates[code_point]
-    return ''.join([
-        '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
-    ])
+    ready for use in a locale character map specification e.g.
+    /xc2/xaf for MACRON.
+
+    '''
+    cp_locale = ''
+    cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+    for byte in cp_bytes:
+       cp_locale += ''.join('/x{:02x}'.format(byte))
+    return cp_locale
 
 def write_header_charmap(outfile):
     '''Write the header on top of the CHARMAP section to the output file'''