BZ #19575: Clarify status of entries in GB 18030-2005.

Message ID 56B9B942.2030203@redhat.com
State Committed
Headers

Commit Message

Carlos O'Donell Feb. 9, 2016, 10:02 a.m. UTC
  On 02/09/2016 03:55 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> On 02/08/2016 05:19 PM, Andreas Schwab wrote:
>>> "Carlos O'Donell" <carlos@redhat.com> writes:
>>>
>>>> This patch is only to clarify why these entries are being mapped
>>>> differently than in the original GB 18030-2005 standard.
>>>
>>> They aren't.
>>
>> Do you have a copy of the standard to verify that?
> 
> See charset/data/ucm/gb-18030-2005.ucm in ICU.

That's not a copy of the standard.

"CJKV Information Processing" by Dr. Ken Lunde on page 108
explicitly states that GB-18030-2005 has 24 PUA mappings
that with Unicode 4.1 or newer can be mapped to non-PUA
equivalents and he describes the 24 characters, and the ICU
ucm data does exactly that.

This does not match the published standard, but that is OK, 
it's best practice not to use PUA mappings if you can avoid
it when later Unicode versions include non-PUA equivalents
(as we do also in glibc).

All I want to clarify in the glibc version of these files
is that the data is not identical to the standard as published.

v2 of the patch follows.

OK to checkin?

Cheers,
Carlos.

2016-02-09  Carlos O'Donell  <carlos@redhat.com>

	[BZ #19575]
	* charmaps/GB18030: Document PUA to non-PUA equivalents.

---
  

Comments

Andreas Schwab Feb. 9, 2016, 10:16 a.m. UTC | #1
"Carlos O'Donell" <carlos@redhat.com> writes:

> 	* charmaps/GB18030: Document PUA to non-PUA equivalents.

This is by no way restricted to GB18030.  Several other legacy encodings
also have compatibility mappings that where remapped to official Unicode
points after they have become available.  The compatibility mappings are
usually included as comments in the code tables, as glibc can only use
the roudtrip code points.

If you want to add some verbiage, put it in a README.

Andreas.
  
Carlos O'Donell Feb. 9, 2016, 10:20 a.m. UTC | #2
On 02/09/2016 05:16 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> 	* charmaps/GB18030: Document PUA to non-PUA equivalents.
> 
> This is by no way restricted to GB18030.  Several other legacy encodings
> also have compatibility mappings that where remapped to official Unicode
> points after they have become available.  The compatibility mappings are
> usually included as comments in the code tables, as glibc can only use
> the roudtrip code points.

At this moment I'm only talking about GB 18030-2005.

> If you want to add some verbiage, put it in a README.

Why not in the file itself as the patch does? Keeping the notes right
beside the commented out mapping for developers to read if they have
any questions.

c.
  
Andreas Schwab Feb. 9, 2016, 10:29 a.m. UTC | #3
"Carlos O'Donell" <carlos@redhat.com> writes:

> Why not in the file itself as the patch does?

Because it singles out GB18030 which only creates confusion.

Andreas.
  
Carlos O'Donell Feb. 9, 2016, 2:02 p.m. UTC | #4
On 02/09/2016 05:29 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> Why not in the file itself as the patch does?
> 
> Because it singles out GB18030 which only creates confusion.

How does it create confusion?

I find it adds clarity to the character map by explaining why
the comments are there e.g. non-normative changes which are
done because of a newer supporting Unicode with non-PUA code
points that can be used.

Cheers,
Carlos.
  
Andreas Schwab Feb. 9, 2016, 2:26 p.m. UTC | #5
"Carlos O'Donell" <carlos@redhat.com> writes:

> the comments are there e.g. non-normative changes which are

Non-normative?

Andreas.
  
Carlos O'Donell Feb. 9, 2016, 4:06 p.m. UTC | #6
On 02/09/2016 09:26 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> the comments are there e.g. non-normative changes which are
> 
> Non-normative?

The glibc and ICU implementations of GB 18030-2005 have
non-normative mappings (not following the standard) for
the 24 PUA code point characters (as seen in my patch)
to their non-PUA equivalents.

Those mappings don't follow the printed normative standard
of GB 18030-2005, but that's OK, this is expected best practice.

Where "normative" means "part of the standard", and "non-normative"
means "not part of the standard."

Again, how does adding comments to one charmap create confusion?

Cheers,
Carlos.
  
Andreas Schwab Feb. 9, 2016, 4:50 p.m. UTC | #7
"Carlos O'Donell" <carlos@redhat.com> writes:

> Those mappings don't follow the printed normative standard
> of GB 18030-2005, but that's OK, this is expected best practice.

# The 2005 version of gb18030 includes updates to previous mappings that use to map to PUA
# but are now mapped to actual Unicode codepoints.
# (CJKV 2nd edition)

Andreas.
  
Carlos O'Donell Feb. 9, 2016, 5:45 p.m. UTC | #8
On 02/09/2016 11:50 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> Those mappings don't follow the printed normative standard
>> of GB 18030-2005, but that's OK, this is expected best practice.
> 
> # The 2005 version of gb18030 includes updates to previous mappings that use to map to PUA
> # but are now mapped to actual Unicode codepoints.
> # (CJKV 2nd edition)

This statement is only partly correct. Some of the mappings were updated
but 24 mappings for PUA code points still remained.

CJKV Information Processing 2nd edition by Dr. Ken Lunde page 108:

"... Although PUA mappings for 24 characters are still printed in the GB
18030-2005 standard proper, it is important to note that as long as
Unicode Version 4.1 or greater is used, all 24 of these characters can be
safely mapped or otherwise handled without resorting to PUA code points ..."

Therefore the 24 mappings I mention in my comments are non-normative, they
are not part of the standard, but are a best-practice mapping to newer Unicode
environments.

My comments to the file clarify this situation.

Does that clarify my position?

Cheers,
Carlos.
  
Andreas Schwab Feb. 10, 2016, 9:15 a.m. UTC | #9
"Carlos O'Donell" <carlos@redhat.com> writes:

> This statement is only partly correct. Some of the mappings were updated
> but 24 mappings for PUA code points still remained.

What are the updated mappings apart from the 24 being left?

Andreas.
  
Carlos O'Donell Feb. 10, 2016, 2:02 p.m. UTC | #10
On 02/10/2016 04:15 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> This statement is only partly correct. Some of the mappings were updated
>> but 24 mappings for PUA code points still remained.
> 
> What are the updated mappings apart from the 24 being left?

Sorry, I don't quite understand the question.

Could you please clarify exactly what you would like to know?

Cheers,
Carlos.
  
Andreas Schwab Feb. 10, 2016, 2:14 p.m. UTC | #11
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 02/10/2016 04:15 AM, Andreas Schwab wrote:
>> "Carlos O'Donell" <carlos@redhat.com> writes:
>> 
>>> This statement is only partly correct. Some of the mappings were updated
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> but 24 mappings for PUA code points still remained.
>> 
>> What are the updated mappings apart from the 24 being left?
>
> Sorry, I don't quite understand the question.
>
> Could you please clarify exactly what you would like to know?

Which are those updated mappings?

Andreas.
  
Carlos O'Donell Feb. 10, 2016, 2:50 p.m. UTC | #12
On 02/10/2016 09:14 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> On 02/10/2016 04:15 AM, Andreas Schwab wrote:
>>> "Carlos O'Donell" <carlos@redhat.com> writes:
>>>
>>>> This statement is only partly correct. Some of the mappings were updated
>                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> but 24 mappings for PUA code points still remained.
>>>
>>> What are the updated mappings apart from the 24 being left?
>>
>> Sorry, I don't quite understand the question.
>>
>> Could you please clarify exactly what you would like to know?
> 
> Which are those updated mappings?

So you would like to know which mappings changed between GB 18030-2000
and GB 18030-2005? I don't have such a list. In "CJKV Information Processing"
it is noted that there are 2 major areas of revision for 2000 -> 2005:

* Acknowledgment of CJK Unified Ideographs Extension B --- 42,711 hanzi

* Acknowledgment of the six regional scripts: Korean, Mongolian, Tai Le, Tibetan, Uyghur, and Yi.

So it supports all 42,711 hanzi characters, and the six scripts (all 4-byte
regions). There are also 4 pictoral glyph corrections.

May I ask why such a list of updated mappings is relevant here?

The only important thing here is that with those 24 PUA mappings made
into non-PUA equivalents the *entire* GB 18030-2005 can be represented
in Unicode without the use of PUA code points. Which is great because
it means normal unmodified programs can process and represent those
characters correctly.

In summary:
- glibc support GB 18030-2005.
- glibc modifies GB 18030-2005 to use 24 non-PUA code points and make
  the implementation fully use Unicode only.
- My comments are there to indicate the modifications for non-PUA code
  points (which deviate from the standard).

Cheers,
Carlos.
  
Andreas Schwab Feb. 10, 2016, 3:57 p.m. UTC | #13
"Carlos O'Donell" <carlos@redhat.com> writes:

> May I ask why such a list of updated mappings is relevant here?

There are exactly 24 code point differences between -2000 and -2005.  Go
figure.

Andreas.
  
Carlos O'Donell Feb. 10, 2016, 5:51 p.m. UTC | #14
On 02/10/2016 10:57 AM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> May I ask why such a list of updated mappings is relevant here?
> 
> There are exactly 24 code point differences between -2000 and -2005.  Go
> figure.

I did not say that. There are only 24 PUA code points in -2005 which as of
Unicode 4.1 can be represented as non-PUA code points.

The differences between -2000 and -2005 are very large, as I outlined in
the rest of my response.

Do you have a sustained opposition to my additional documentation of the
non-normative mappings listed in the GB 18030-2005 charmap?

Cheers,
Carlos.
  

Patch

diff --git a/localedata/charmaps/GB18030 b/localedata/charmaps/GB18030
index 863a123..85a15fe 100644
--- a/localedata/charmaps/GB18030
+++ b/localedata/charmaps/GB18030
@@ -57234,6 +57234,22 @@  CHARMAP
 <UE78A>     /xa6/xbe         <Private Use>
 <UE78B>     /xa6/xbf         <Private Use>
 <UE78C>     /xa6/xc0         <Private Use>
+% The newest GB 18030-2005 standard still uses some private use area
+% code points.  Any implementation which has Unicode 4.1 or newer
+% support should not use these PUA code points, and instead should
+% map these entries to their equivalent non-PUA code points. There
+% are 24 idiograms in GB 18030-2005 which have non-PUA equivalents. 
+% In glibc we only support roundtrip code points, and so must choose
+% between supporting the old PUA code points, or using the newer
+% non-PUA code points. We choose to use the non-PUA code points to
+% be compatible with ICU's similar choice. In choosing the non-PUA
+% code points we can no longer convert the old PUA code points back
+% to GB-18030-2005 (technically only fixable if we added support
+% for non-roundtrip code points e.g. ICU's "fallback mapping").
+% The recommendation to use the non-PUA code points, where available,
+% is based on "CJKV Information Processing" 2nd Ed. by Dr. Ken Lunde.
+%
+% These 10 PUA mappings use equivalents from <UFE10> to <UFE19>.
 % <UE78D>     /xa6/xd9         <Private Use>
 % <UE78E>     /xa6/xda         <Private Use>
 % <UE78F>     /xa6/xdb         <Private Use>
@@ -57371,6 +57387,7 @@  CHARMAP
 <UE813>     /xd7/xfd         <Private Use>
 <UE814>     /xd7/xfe         <Private Use>
 <UE815>     /x83/x36/xc9/x34 <Private Use>
+% These 3 PUA mappings use equivalents <U20087>, <U20089> and <U200CC>.
 % <UE816>     /xfe/x51         <Private Use>
 % <UE817>     /xfe/x52         <Private Use>
 % <UE818>     /xfe/x53         <Private Use>
@@ -57379,6 +57396,7 @@  CHARMAP
 <UE81B>     /x83/x36/xc9/x37 <Private Use>
 <UE81C>     /x83/x36/xc9/x38 <Private Use>
 <UE81D>     /x83/x36/xc9/x39 <Private Use>
+% This 1 PUA mapping uses the equivalent <U9FB4>.
 % <UE81E>     /xfe/x59         <Private Use>
 <UE81F>     /x83/x36/xca/x30 <Private Use>
 <UE820>     /x83/x36/xca/x31 <Private Use>
@@ -57387,17 +57405,20 @@  CHARMAP
 <UE823>     /x83/x36/xca/x34 <Private Use>
 <UE824>     /x83/x36/xca/x35 <Private Use>
 <UE825>     /x83/x36/xca/x36 <Private Use>
+% This 1 PUA mapping uses the equivalent <U9FB5>.
 % <UE826>     /xfe/x61         <Private Use>
 <UE827>     /x83/x36/xca/x37 <Private Use>
 <UE828>     /x83/x36/xca/x38 <Private Use>
 <UE829>     /x83/x36/xca/x39 <Private Use>
 <UE82A>     /x83/x36/xcb/x30 <Private Use>
+% These 2 PUA mappings use the equivalents <U9FB6> and <U9FB7>.
 % <UE82B>     /xfe/x66         <Private Use>
 % <UE82C>     /xfe/x67         <Private Use>
 <UE82D>     /x83/x36/xcb/x31 <Private Use>
 <UE82E>     /x83/x36/xcb/x32 <Private Use>
 <UE82F>     /x83/x36/xcb/x33 <Private Use>
 <UE830>     /x83/x36/xcb/x34 <Private Use>
+% These 2 PUA mappings use the equivalents <U215D7> and <U9FB8>.
 % <UE831>     /xfe/x6c         <Private Use>
 % <UE832>     /xfe/x6d         <Private Use>
 <UE833>     /x83/x36/xcb/x35 <Private Use>
@@ -57408,6 +57429,7 @@  CHARMAP
 <UE838>     /x83/x36/xcc/x30 <Private Use>
 <UE839>     /x83/x36/xcc/x31 <Private Use>
 <UE83A>     /x83/x36/xcc/x32 <Private Use>
+% This 1 PUA mapping uses the equivalent <U2298F>.
 % <UE83B>     /xfe/x76         <Private Use>
 <UE83C>     /x83/x36/xcc/x33 <Private Use>
 <UE83D>     /x83/x36/xcc/x34 <Private Use>
@@ -57416,6 +57438,7 @@  CHARMAP
 <UE840>     /x83/x36/xcc/x37 <Private Use>
 <UE841>     /x83/x36/xcc/x38 <Private Use>
 <UE842>     /x83/x36/xcc/x39 <Private Use>
+% This 1 PUA mapping uses the equivalent <U9FB9>.
 % <UE843>     /xfe/x7e         <Private Use>
 <UE844>     /x83/x36/xcd/x30 <Private Use>
 <UE845>     /x83/x36/xcd/x31 <Private Use>
@@ -57433,6 +57456,7 @@  CHARMAP
 <UE851>     /x83/x36/xce/x33 <Private Use>
 <UE852>     /x83/x36/xce/x34 <Private Use>
 <UE853>     /x83/x36/xce/x35 <Private Use>
+% These 2 PUA mappings use the equivalents <U9FBA> and <U241FE>.
 % <UE854>     /xfe/x90         <Private Use>
 % <UE855>     /xfe/x91         <Private Use>
 <UE856>     /x83/x36/xce/x36 <Private Use>
@@ -57449,6 +57473,7 @@  CHARMAP
 <UE861>     /x83/x36/xcf/x37 <Private Use>
 <UE862>     /x83/x36/xcf/x38 <Private Use>
 <UE863>     /x83/x36/xcf/x39 <Private Use>
+% This 1 PUA mapping uses the equivalent <U9FBB>.
 % <UE864>     /xfe/xa0         <Private Use>
 <UE865>     /x83/x36/xd0/x30 <Private Use>
 <UE866>     /x83/x36/xd0/x31 <Private Use>