BZ #19575: Clarify status of entries in GB 18030-2005.

Message ID 56B8FA69.8030508@redhat.com
State Dropped
Headers

Commit Message

Carlos O'Donell Feb. 8, 2016, 8:28 p.m. UTC
  In bug 19575 Florian Weimer asks about the status of the glibc
support for GB 18030-2005, since ICU and Emacs produce slightly
different results than glibc.

The following patch adds clarifying comments to GB 18030-2005's
character map to explain why glibc has the following mapping and
why it is best-practice.

localedata/

2016-02-08  Carlos O'Donell  <carlos@redhat.com>

	* charmaps/GB18030

---

Cheers,
Carlos.
  

Comments

Florian Weimer Feb. 8, 2016, 8:33 p.m. UTC | #1
On 02/08/2016 09:28 PM, Carlos O'Donell wrote:
> In bug 19575 Florian Weimer asks about the status of the glibc
> support for GB 18030-2005, since ICU and Emacs produce slightly
> different results than glibc.
> 
> The following patch adds clarifying comments to GB 18030-2005's
> character map to explain why glibc has the following mapping and
> why it is best-practice.

The comments would probably have helped me to understand the situation.

Thanks,
Florian
  
Andreas Schwab Feb. 8, 2016, 9:45 p.m. UTC | #2
"Carlos O'Donell" <carlos@redhat.com> writes:

> In bug 19575 Florian Weimer asks about the status of the glibc
> support for GB 18030-2005, since ICU and Emacs produce slightly
> different results than glibc.

Emacs uses the same table as glibc.

> +% The code points from <UFE10> to <UFE19> are a adjustment
> +% of the GB 18030-2005 standard to account for the fact that
> +% with Unicode 4.1 support we can now correctly represent those
> +% entries, which in the standard, used PUA code points.

There are more differences between GB18030-2000 and GB18030-2005.

Andreas.
  
Carlos O'Donell Feb. 8, 2016, 9:47 p.m. UTC | #3
On 02/08/2016 04:45 PM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> In bug 19575 Florian Weimer asks about the status of the glibc
>> support for GB 18030-2005, since ICU and Emacs produce slightly
>> different results than glibc.
> 
> Emacs uses the same table as glibc.

Good.

>> +% The code points from <UFE10> to <UFE19> are a adjustment
>> +% of the GB 18030-2005 standard to account for the fact that
>> +% with Unicode 4.1 support we can now correctly represent those
>> +% entries, which in the standard, used PUA code points.
> 
> There are more differences between GB18030-2000 and GB18030-2005.

Agreed.

This patch is only to clarify why these entries are being mapped
differently than in the original GB 18030-2005 standard.

Does the patch seem suitable to you?

Cheers,
Carlos.
  
Andreas Schwab Feb. 8, 2016, 10:19 p.m. UTC | #4
"Carlos O'Donell" <carlos@redhat.com> writes:

> This patch is only to clarify why these entries are being mapped
> differently than in the original GB 18030-2005 standard.

They aren't.

Andreas.
  
Carlos O'Donell Feb. 8, 2016, 11:59 p.m. UTC | #5
On 02/08/2016 05:19 PM, Andreas Schwab wrote:
> "Carlos O'Donell" <carlos@redhat.com> writes:
> 
>> This patch is only to clarify why these entries are being mapped
>> differently than in the original GB 18030-2005 standard.
> 
> They aren't.

Do you have a copy of the standard to verify that?

Cheers,
Carlos.
  
Andreas Schwab Feb. 9, 2016, 8:55 a.m. UTC | #6
"Carlos O'Donell" <carlos@redhat.com> writes:

> On 02/08/2016 05:19 PM, Andreas Schwab wrote:
>> "Carlos O'Donell" <carlos@redhat.com> writes:
>> 
>>> This patch is only to clarify why these entries are being mapped
>>> differently than in the original GB 18030-2005 standard.
>> 
>> They aren't.
>
> Do you have a copy of the standard to verify that?

See charset/data/ucm/gb-18030-2005.ucm in ICU.

Andreas.
  

Patch

diff --git a/localedata/charmaps/GB18030 b/localedata/charmaps/GB18030
index 863a123..c48276e 100644
--- a/localedata/charmaps/GB18030
+++ b/localedata/charmaps/GB18030
@@ -57234,6 +57234,12 @@  CHARMAP
 <UE78A>     /xa6/xbe         <Private Use>
 <UE78B>     /xa6/xbf         <Private Use>
 <UE78C>     /xa6/xc0         <Private Use>
+% The newest GB 18030-2005 standard still uses some private use area
+% code points.  Any implementation which has Unicode 4.1 or newer
+% support should not use these PUA code points, and instead should
+% map these entries to their equivalent non-PUA code points which
+% in this case map from <UFE10> to <UFE19>.  This recommendation is
+% based on "CJKV Processing" by Dr. Ken Lunde.
 % <UE78D>     /xa6/xd9         <Private Use>
 % <UE78E>     /xa6/xda         <Private Use>
 % <UE78F>     /xa6/xdb         <Private Use>
@@ -62997,6 +63003,10 @@  CHARMAP
 <UFE0D>     /x84/x31/x82/x33 VARIATION SELECTOR-14
 <UFE0E>     /x84/x31/x82/x34 VARIATION SELECTOR-15
 <UFE0F>     /x84/x31/x82/x35 VARIATION SELECTOR-16
+% The code points from <UFE10> to <UFE19> are a adjustment
+% of the GB 18030-2005 standard to account for the fact that
+% with Unicode 4.1 support we can now correctly represent those
+% entries, which in the standard, used PUA code points.
 <UFE10>     /xa6/xd9         PRESENTATION FORM FOR VERTICAL COMMA
 <UFE11>     /xa6/xdb         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
 <UFE12>     /xa6/xda         PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP