[v4,2/4] Update UTF-8 charmap processing.
Commit Message
The UTF-8 character map processing is updated to use the new wider
ellipsis support. On top of this the Unicode Noncharacters compliance
is improved by adding Noncharacters to the UTF-8 character map to
allow them to be processed and transformed correctly when considering
the character map only. All gaps, excluding surrogates, for the UTF-8
character map are filled with unassigned blocks of characters. The
UTF-8 character map now includes all Unicode Scalar values.
Tested by regenerating the locale data from the Unicode data and
running the testsuite.
Tested on x86_64 and i686 without regression.
---
localedata/unicode-gen/utf8_gen.py | 133 +++++++++++++++++++----------
1 file changed, 86 insertions(+), 47 deletions(-)
Comments
* Carlos O'Donell:
> def convert_to_hex(code_point):
> '''Converts a code point to a hexadecimal UTF-8 representation
> - like /x**/x**/x**.'''
> - # Getting UTF8 of Unicode characters.
> - # In Python3, .encode('UTF-8') does not work for
> - # surrogates. Therefore, we use this conversion table
> - surrogates = {
> - 0xD800: '/xed/xa0/x80',
> - 0xDB7F: '/xed/xad/xbf',
> - 0xDB80: '/xed/xae/x80',
> - 0xDBFF: '/xed/xaf/xbf',
> - 0xDC00: '/xed/xb0/x80',
> - 0xDFFF: '/xed/xbf/xbf',
> - }
> - if code_point in surrogates:
> - return surrogates[code_point]
> - return ''.join([
> - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
> - ])
> + ready for use in a locale character map specification e.g.
> + /xc2/xaf for MACRON.
> +
> + '''
> + cp_locale = ''
> + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
> + for byte in cp_bytes:
> + cp_locale += ''.join('/x{:02x}'.format(byte))
> + return cp_locale
I think you should keep the list comprehension. That ''.join() is
unnecessary.
Thanks,
Florian
On 4/29/21 10:07 AM, Florian Weimer wrote:
> * Carlos O'Donell:
>
>> def convert_to_hex(code_point):
>> '''Converts a code point to a hexadecimal UTF-8 representation
>> - like /x**/x**/x**.'''
>> - # Getting UTF8 of Unicode characters.
>> - # In Python3, .encode('UTF-8') does not work for
>> - # surrogates. Therefore, we use this conversion table
>> - surrogates = {
>> - 0xD800: '/xed/xa0/x80',
>> - 0xDB7F: '/xed/xad/xbf',
>> - 0xDB80: '/xed/xae/x80',
>> - 0xDBFF: '/xed/xaf/xbf',
>> - 0xDC00: '/xed/xb0/x80',
>> - 0xDFFF: '/xed/xbf/xbf',
>> - }
>> - if code_point in surrogates:
>> - return surrogates[code_point]
>> - return ''.join([
>> - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>> - ])
>> + ready for use in a locale character map specification e.g.
>> + /xc2/xaf for MACRON.
>> +
>> + '''
>> + cp_locale = ''
>> + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>> + for byte in cp_bytes:
>> + cp_locale += ''.join('/x{:02x}'.format(byte))
>> + return cp_locale
>
> I think you should keep the list comprehension. That ''.join() is
> unnecessary.
Like this?
return ''.join(['/x{:02x}'.format(c) \
for c in chr(code_point).encode('UTF-8', 'surrogatepass')])
(tested works fine and produces the same results)
* Carlos O'Donell:
> On 4/29/21 10:07 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>>
>>> def convert_to_hex(code_point):
>>> '''Converts a code point to a hexadecimal UTF-8 representation
>>> - like /x**/x**/x**.'''
>>> - # Getting UTF8 of Unicode characters.
>>> - # In Python3, .encode('UTF-8') does not work for
>>> - # surrogates. Therefore, we use this conversion table
>>> - surrogates = {
>>> - 0xD800: '/xed/xa0/x80',
>>> - 0xDB7F: '/xed/xad/xbf',
>>> - 0xDB80: '/xed/xae/x80',
>>> - 0xDBFF: '/xed/xaf/xbf',
>>> - 0xDC00: '/xed/xb0/x80',
>>> - 0xDFFF: '/xed/xbf/xbf',
>>> - }
>>> - if code_point in surrogates:
>>> - return surrogates[code_point]
>>> - return ''.join([
>>> - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>>> - ])
>>> + ready for use in a locale character map specification e.g.
>>> + /xc2/xaf for MACRON.
>>> +
>>> + '''
>>> + cp_locale = ''
>>> + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>>> + for byte in cp_bytes:
>>> + cp_locale += ''.join('/x{:02x}'.format(byte))
>>> + return cp_locale
>>
>> I think you should keep the list comprehension. That ''.join() is
>> unnecessary.
>
> Like this?
>
> return ''.join(['/x{:02x}'.format(c) \
> for c in chr(code_point).encode('UTF-8', 'surrogatepass')])
>
> (tested works fine and produces the same results)
Yes, exactly. Thanks. The patch should be fine with this.
Florian
On 4/30/21 12:18 AM, Florian Weimer wrote:
> * Carlos O'Donell:
>
>> On 4/29/21 10:07 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> def convert_to_hex(code_point):
>>>> '''Converts a code point to a hexadecimal UTF-8 representation
>>>> - like /x**/x**/x**.'''
>>>> - # Getting UTF8 of Unicode characters.
>>>> - # In Python3, .encode('UTF-8') does not work for
>>>> - # surrogates. Therefore, we use this conversion table
>>>> - surrogates = {
>>>> - 0xD800: '/xed/xa0/x80',
>>>> - 0xDB7F: '/xed/xad/xbf',
>>>> - 0xDB80: '/xed/xae/x80',
>>>> - 0xDBFF: '/xed/xaf/xbf',
>>>> - 0xDC00: '/xed/xb0/x80',
>>>> - 0xDFFF: '/xed/xbf/xbf',
>>>> - }
>>>> - if code_point in surrogates:
>>>> - return surrogates[code_point]
>>>> - return ''.join([
>>>> - '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
>>>> - ])
>>>> + ready for use in a locale character map specification e.g.
>>>> + /xc2/xaf for MACRON.
>>>> +
>>>> + '''
>>>> + cp_locale = ''
>>>> + cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
>>>> + for byte in cp_bytes:
>>>> + cp_locale += ''.join('/x{:02x}'.format(byte))
>>>> + return cp_locale
>>>
>>> I think you should keep the list comprehension. That ''.join() is
>>> unnecessary.
>>
>> Like this?
>>
>> return ''.join(['/x{:02x}'.format(c) \
>> for c in chr(code_point).encode('UTF-8', 'surrogatepass')])
>>
>> (tested works fine and produces the same results)
>
> Yes, exactly. Thanks. The patch should be fine with this.
Fixed. This will be part of the v5 repost.
@@ -81,25 +81,46 @@ def process_range(start, end, outfile, name):
# 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
# 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
#
- # The glibc UTF-8 file splits ranges like these into shorter
+ # The old glibc UTF-8 file splits ranges like these into shorter
# ranges of 64 code points each:
#
# <U3400>..<U343F> /xe3/x90/x80 <CJK Ideograph Extension A>
# …
# <U4D80>..<U4DB5> /xe4/xb6/x80 <CJK Ideograph Extension A>
- for i in range(int(start, 16), int(end, 16), 64 ):
- if i > (int(end, 16)-64):
- outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
- unicode_utils.ucs_symbol(i),
- unicode_utils.ucs_symbol(int(end,16)),
- convert_to_hex(i),
- name))
- break
- outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
- unicode_utils.ucs_symbol(i),
- unicode_utils.ucs_symbol(i+63),
- convert_to_hex(i),
- name))
+ #
+ # We do not split the ranges like this. It is not required. The
+ # ellipsis processing in ld-collate.c can handle any sized ranges.
+ outfile.write('{:s}..{:s} {:<12s} {:s}\n'.format(
+ unicode_utils.ucs_symbol(int (start, 16)),
+ unicode_utils.ucs_symbol(int (end, 16)),
+ convert_to_hex (int (start, 16)),
+ name))
+
+def process_gap (start, end, outfile):
+ '''This function processes a gap and fills it if needed. The value
+ of start is the last value output, and the value of end is the
+ next value which may be output. Therefore if there is a gap
+ between the two then it is filled with an ellipsis or a single
+ symbol.
+
+ '''
+ # If start and end are more than 1 away then we have a gap, and
+ # that needs filling to provide proper code-point collation
+ # support.
+ cp_prev = int(start, 16)
+ cp_next = int(end, 16)
+
+ # Special case of just one symbol missing?
+ if cp_next - 1 == cp_prev + 1:
+ outfile.write('{:<11s} {:<12s} {:s}\n'.format(
+ unicode_utils.ucs_symbol(cp_prev + 1),
+ convert_to_hex(cp_prev + 1),
+ '<Unassigned>'))
+ elif cp_next > cp_prev + 1:
+ # More than one symbol, so use an ellipsis.
+ process_range ('{:x}'.format(cp_prev + 1),
+ '{:x}'.format(cp_next - 1),
+ outfile, '<Unassigned>')
def process_charmap(flines, outfile):
'''This function takes an array which contains *all* lines of
@@ -129,63 +150,81 @@ def process_charmap(flines, outfile):
%<UDB7F> /xed/xad/xbf <Non Private Use High Surrogate, Last>
<U0010FFC0>..<U0010FFFD> /xf4/x8f/xbf/x80 <Plane 16 Private Use>
+ The old glibc UTF-8 charmap left the surrogates commented out.
+ Surrogates are not Unicode scalar values, and are ill-formed code
+ sequences. We continue to comment them out in the character map to
+ ensure no locale accidentally uses these values. The use of
+ surrogate symbols will be treated as if they were UNDEFINED. The
+ converters will handle them as ill-formed code sequences and either
+ raise an error or transform them to REPLACEMENT CHARACTER.
'''
fields_start = []
+ fields_end = []
for line in flines:
fields = line.split(";")
- # Some characters have “<control>” as their name. We try to
- # use the “Unicode 1.0 Name” (10th field in
- # UnicodeData.txt) for them.
- #
- # The Characters U+0080, U+0081, U+0084 and U+0099 have
- # “<control>” as their name but do not even have aa
- # ”Unicode 1.0 Name”. We could write code to take their
- # alternate names from NameAliases.txt.
+ # Some characters have "<control>" as their name. We try to
+ # use the "Unicode 1.0 Name" (10th field in
+ # UnicodeData.txt) for them.
+ #
+ # The Characters U+0080, U+0081, U+0084 and U+0099 have
+ # "<control>" as their name but do not even have a
+ # "Unicode 1.0 Name". We could write code to take their
+ # alternate names from NameAliases.txt.
if fields[1] == "<control>" and fields[10]:
fields[1] = fields[10]
# Handling code point ranges like:
#
# 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
# 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
- if fields[1].endswith(', First>') and not 'Surrogate,' in fields[1]:
+ if fields[1].endswith(', First>'):
fields_start = fields
continue
- if fields[1].endswith(', Last>') and not 'Surrogate,' in fields[1]:
+ if fields[1].endswith(', Last>'):
+ # 1. Process the gap.
+ # First process the gap between the last entry and the
+ # newly started range.
+ process_gap (fields_end[0], fields_start[0], outfile)
+ # 2. Exclude surrogate ranges.
+ # Comment out the surrogates in the UTF-8 file.
+ # One could of course skip them completely but
+ # the original UTF-8 file in glibc had them as
+ # comments, so we keep these comment lines.
+ if 'Surrogate,' in fields[1]:
+ outfile.write('%')
+ # 3. Process the range.
process_range(fields_start[0], fields[0],
outfile, fields[1][:-7]+'>')
fields_start = []
+ fields_end = fields
continue
fields_start = []
- if 'Surrogate,' in fields[1]:
- # Comment out the surrogates in the UTF-8 file.
- # One could of course skip them completely but
- # the original UTF-8 file in glibc had them as
- # comments, so we keep these comment lines.
- outfile.write('%')
+
+ if len (fields_end) > 0:
+ process_gap (fields_end[0], fields[0], outfile)
+
outfile.write('{:<11s} {:<12s} {:s}\n'.format(
unicode_utils.ucs_symbol(int(fields[0], 16)),
convert_to_hex(int(fields[0], 16)),
fields[1]))
+ fields_end = fields
+ # We may need to output a final set of symbols if we are not yet at
+ # U+10FFFF, so check that last gap. We use U+110000 as the
+ # hypothetical next entry. In practice UTF-8 ends at U+10FFFD and
+ # so indeed we have 2 missing symbols at the end.
+ process_gap (fields_end[0], '110000', outfile)
+
def convert_to_hex(code_point):
'''Converts a code point to a hexadecimal UTF-8 representation
- like /x**/x**/x**.'''
- # Getting UTF8 of Unicode characters.
- # In Python3, .encode('UTF-8') does not work for
- # surrogates. Therefore, we use this conversion table
- surrogates = {
- 0xD800: '/xed/xa0/x80',
- 0xDB7F: '/xed/xad/xbf',
- 0xDB80: '/xed/xae/x80',
- 0xDBFF: '/xed/xaf/xbf',
- 0xDC00: '/xed/xb0/x80',
- 0xDFFF: '/xed/xbf/xbf',
- }
- if code_point in surrogates:
- return surrogates[code_point]
- return ''.join([
- '/x{:02x}'.format(c) for c in chr(code_point).encode('UTF-8')
- ])
+ ready for use in a locale character map specification e.g.
+ /xc2/xaf for MACRON.
+
+ '''
+ cp_locale = ''
+ cp_bytes = chr(code_point).encode('UTF-8', 'surrogatepass')
+ for byte in cp_bytes:
+ cp_locale += ''.join('/x{:02x}'.format(byte))
+ return cp_locale
def write_header_charmap(outfile):
'''Write the header on top of the CHARMAP section to the output file'''