Use Unicode code points for country_isbn

Message ID 5571B8C2.8000108@redhat.com
State New, archived
Headers

Commit Message

Marko Myllynen June 5, 2015, 2:57 p.m. UTC
  Hi,

make country_isbn definitions consistent across locales by using
Unicode code points not numerals everywhere. The code in
locale/categories.def and locale/programs/ld-address.c already
handles strings.

Please apply.

2015-06-05  Marko Myllynen  <myllynen@redhat.com>

	* li_BE: Add country_isbn.
	* li_NL: Likewise.

	* af_ZA: Use Unicode code point string for country_isbn.
	* ak_GH: Likewise.
	* bg_BG: Likewise.
	* cmn_TW: Likewise.
	* cy_GB: Likewise.
	* de_DE: Likewise.
	* en_NG: Likewise.
	* en_US: Likewise.
	* en_ZA: Likewise.
	* es_CR: Likewise.
	* es_US: Likewise.
	* fi_FI: Likewise.
	* fy_DE: Likewise.
	* gd_GB: Likewise.
	* ha_NG: Likewise.
	* hak_TW: Likewise.
	* hsb_DE: Likewise.
	* ht_HT: Likewise.
	* ia_FR: Likewise.
	* ig_NG: Likewise.
	* ka_GE: Likewise.
	* ku_TR: Likewise.
	* lb_LU: Likewise.
	* lzh_TW: Likewise.
	* mk_MK: Likewise.
	* mn_MN: Likewise.
	* nan_TW: Likewise.
	* nds_DE: Likewise.
	* nds_NL: Likewise.
	* oc_FR: Likewise.
	* pap_AN: Likewise.
	* ro_RO: Likewise.
	* sq_MK: Likewise.
	* sv_FI: Likewise.
	* tr_CY: Likewise.
	* tr_TR: Likewise.
	* uk_UA: Likewise.
	* unm_US: Likewise.
	* wa_BE: Likewise.
	* wae_CH: Likewise.
	* yi_US: Likewise.
	* yo_NG: Likewise.



Thanks,
  

Comments

Ondrej Bilka June 9, 2015, 7:11 a.m. UTC | #1
On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> Hi,
> 
> make country_isbn definitions consistent across locales by using
> Unicode code points not numerals everywhere. The code in
> locale/categories.def and locale/programs/ld-address.c already
> handles strings.
> 
> Please apply.
> 
Possible but why, when these are numbers which are easier to read than
strings?
  
Marko Myllynen June 9, 2015, 10:12 a.m. UTC | #2
Hi,

On 2015-06-09 10:11, Ondřej Bílka wrote:
> On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
>>
>> make country_isbn definitions consistent across locales by using
>> Unicode code points not numerals everywhere. The code in
>> locale/categories.def and locale/programs/ld-address.c already
>> handles strings.
>>
>> Please apply.
>>
> Possible but why, when these are numbers which are easier to read than
> strings?

that's true, and I don't feel too strongly about this, but currently
some locales are using numbers and some are using Unicode code points so
there's a bit of inconsistency, also it's not that hard to read these
once one sees that e.g. 12 becomes "<U0031><U0032>" i.e. only the last
digit matters.

Thanks,
  
Mike Frysinger July 21, 2015, 8:18 a.m. UTC | #3
On 09 Jun 2015 13:12, Marko Myllynen wrote:
> On 2015-06-09 10:11, Ondřej Bílka wrote:
> > On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> >> make country_isbn definitions consistent across locales by using
> >> Unicode code points not numerals everywhere. The code in
> >> locale/categories.def and locale/programs/ld-address.c already
> >> handles strings.
> >>
> >> Please apply.
> >
> > Possible but why, when these are numbers which are easier to read than
> > strings?
> 
> that's true, and I don't feel too strongly about this, but currently
> some locales are using numbers and some are using Unicode code points so
> there's a bit of inconsistency, also it's not that hard to read these
> once one sees that e.g. 12 becomes "<U0031><U0032>" i.e. only the last
> digit matters.

i find many of the U markers pointlessly obscure, especially when they're used
for characters that are in the ASCII standard.  if we're standardizing on UTF8
encodings in general, why can't we convert these files as well ?  keep in mind
that i'm ignorant of the tooling around these files ;).
-mike
  
keld@keldix.com July 21, 2015, 8:40 a.m. UTC | #4
On Tue, Jul 21, 2015 at 04:18:40AM -0400, Mike Frysinger wrote:
> On 09 Jun 2015 13:12, Marko Myllynen wrote:
> > On 2015-06-09 10:11, Ond??ej Bílka wrote:
> > > On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> > >> make country_isbn definitions consistent across locales by using
> > >> Unicode code points not numerals everywhere. The code in
> > >> locale/categories.def and locale/programs/ld-address.c already
> > >> handles strings.
> > >>
> > >> Please apply.
> > >
> > > Possible but why, when these are numbers which are easier to read than
> > > strings?
> > 
> > that's true, and I don't feel too strongly about this, but currently
> > some locales are using numbers and some are using Unicode code points so
> > there's a bit of inconsistency, also it's not that hard to read these
> > once one sees that e.g. 12 becomes "<U0031><U0032>" i.e. only the last
> > digit matters.
> 
> i find many of the U markers pointlessly obscure, especially when they're used
> for characters that are in the ASCII standard.  if we're standardizing on UTF8
> encodings in general, why can't we convert these files as well ?  keep in mind
> that i'm ignorant of the tooling around these files ;).

The use of Unicode points helps making the locales portable, eg.
when crosscompiling for different architectures, including embedded systems, ebcdic
systems, utf-16 systems and utf8 systems, when you are on a different host platform.

For the ASCII characters one could use the symbolic character name from the
POSIX locale. They are much more readable than the Unicode code points, IMHO.

Best regards
Keld
  
Florian Weimer July 21, 2015, 8:54 a.m. UTC | #5
On 07/21/2015 10:40 AM, keld@keldix.com wrote:

> The use of Unicode points helps making the locales portable, eg.
> when crosscompiling for different architectures, including embedded systems, ebcdic
> systems, utf-16 systems and utf8 systems, when you are on a different host platform.

Is this really a relevant use case?  Cross-compiling glibc to an EBCDIC
system?
  
keld@keldix.com July 21, 2015, 9:02 a.m. UTC | #6
On Tue, Jul 21, 2015 at 10:54:21AM +0200, Florian Weimer wrote:
> On 07/21/2015 10:40 AM, keld@keldix.com wrote:
> 
> > The use of Unicode points helps making the locales portable, eg.
> > when crosscompiling for different architectures, including embedded systems, ebcdic
> > systems, utf-16 systems and utf8 systems, when you are on a different host platform.
> 
> Is this really a relevant use case?  Cross-compiling glibc to an EBCDIC
> system?

I also mentioned other cases, which may be more relevant.. 

Best regards
keld
  
Florian Weimer July 21, 2015, 9:05 a.m. UTC | #7
On 07/21/2015 11:02 AM, keld@keldix.com wrote:
> On Tue, Jul 21, 2015 at 10:54:21AM +0200, Florian Weimer wrote:
>> On 07/21/2015 10:40 AM, keld@keldix.com wrote:
>>
>>> The use of Unicode points helps making the locales portable, eg.
>>> when crosscompiling for different architectures, including embedded systems, ebcdic
>>> systems, utf-16 systems and utf8 systems, when you are on a different host platform.
>>
>> Is this really a relevant use case?  Cross-compiling glibc to an EBCDIC
>> system?
> 
> I also mentioned other cases, which may be more relevant.. 

I can't see how.  Unless someone maintains this code and processing
pipeline, it's not going to work with the current code.  Is anyone doing
it?  I doubt it.

We don't use trigraphs in C sources, so I really don't get why we have
to use an equivalent construct in the locale definitions.  Unless the
goal is to raise the bar for new contributors for some reason, but I
think the project has long walked away from that approach.
  
Mike Frysinger July 21, 2015, 9:22 a.m. UTC | #8
On 21 Jul 2015 10:40, keld@keldix.com wrote:
> On Tue, Jul 21, 2015 at 04:18:40AM -0400, Mike Frysinger wrote:
> > On 09 Jun 2015 13:12, Marko Myllynen wrote:
> > > On 2015-06-09 10:11, Ond??ej Bílka wrote:
> > > > On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> > > >> make country_isbn definitions consistent across locales by using
> > > >> Unicode code points not numerals everywhere. The code in
> > > >> locale/categories.def and locale/programs/ld-address.c already
> > > >> handles strings.
> > > >>
> > > >> Please apply.
> > > >
> > > > Possible but why, when these are numbers which are easier to read than
> > > > strings?
> > > 
> > > that's true, and I don't feel too strongly about this, but currently
> > > some locales are using numbers and some are using Unicode code points so
> > > there's a bit of inconsistency, also it's not that hard to read these
> > > once one sees that e.g. 12 becomes "<U0031><U0032>" i.e. only the last
> > > digit matters.
> > 
> > i find many of the U markers pointlessly obscure, especially when they're used
> > for characters that are in the ASCII standard.  if we're standardizing on UTF8
> > encodings in general, why can't we convert these files as well ?  keep in mind
> > that i'm ignorant of the tooling around these files ;).
> 
> The use of Unicode points helps making the locales portable, eg.
> when crosscompiling for different architectures, including embedded systems, ebcdic
> systems, utf-16 systems and utf8 systems, when you are on a different host platform.

i'm referring to the tools we use -- either inside of the source repo
(i.e. ones we wrote/maintain), or external ones that operate on our
files directly (i.e. gcc).  what actual problems do you see here ?
vague references like "cross-compiling is magic" aren't really that
interesting.

keep in mind we already use (and agreed to standardize on) UTF8 in
things like *.c and *.h and ChangeLog and READMEs and info pages.
-mike
  
keld@keldix.com July 21, 2015, 11:58 a.m. UTC | #9
On Tue, Jul 21, 2015 at 05:22:17AM -0400, Mike Frysinger wrote:
> On 21 Jul 2015 10:40, keld@keldix.com wrote:
> > On Tue, Jul 21, 2015 at 04:18:40AM -0400, Mike Frysinger wrote:
> > > On 09 Jun 2015 13:12, Marko Myllynen wrote:
> > > > On 2015-06-09 10:11, Ond??ej Bílka wrote:
> > > > > On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> > > > >> make country_isbn definitions consistent across locales by using
> > > > >> Unicode code points not numerals everywhere. The code in
> > > > >> locale/categories.def and locale/programs/ld-address.c already
> > > > >> handles strings.
> > > > >>
> > > > >> Please apply.
> > > > >
> > > > > Possible but why, when these are numbers which are easier to read than
> > > > > strings?
> > > > 
> > > > that's true, and I don't feel too strongly about this, but currently
> > > > some locales are using numbers and some are using Unicode code points so
> > > > there's a bit of inconsistency, also it's not that hard to read these
> > > > once one sees that e.g. 12 becomes "<U0031><U0032>" i.e. only the last
> > > > digit matters.
> > > 
> > > i find many of the U markers pointlessly obscure, especially when they're used
> > > for characters that are in the ASCII standard.  if we're standardizing on UTF8
> > > encodings in general, why can't we convert these files as well ?  keep in mind
> > > that i'm ignorant of the tooling around these files ;).
> > 
> > The use of Unicode points helps making the locales portable, eg.
> > when crosscompiling for different architectures, including embedded systems, ebcdic
> > systems, utf-16 systems and utf8 systems, when you are on a different host platform.
> 
> i'm referring to the tools we use -- either inside of the source repo
> (i.e. ones we wrote/maintain), or external ones that operate on our
> files directly (i.e. gcc).  what actual problems do you see here ?
> vague references like "cross-compiling is magic" aren't really that
> interesting.

It would mean that you cannot use the locale sources for crosscompiling when using
some different character sets on the hosting and the target machines.
Eg if you are making embedded systems on IOS or Windows or other utf16 machines
for an utf8 target, or making stuff for android. Or the other way round if you are
omn an utf8 host and generate locales for a utf16 target such as a utf16 embedded 
system or an iphone or ipad system.

I suggest you use the POSIX character names instead, eg 12 becomes "<1><2>"

> keep in mind we already use (and agreed to standardize on) UTF8 in
> things like *.c and *.h and ChangeLog and READMEs and info pages.

That is not related. Of cause we have our sources in a specific encoding,
and when sources are moved between platforms (aka portability) the 
sources text may be converted from one representation to another, 
which happens eg. when you move our sources to an IOS or Windows platform.

Best regards
Keld
  
keld@keldix.com July 21, 2015, 12:12 p.m. UTC | #10
On Tue, Jul 21, 2015 at 01:58:52PM +0200, Keld Simonsen wrote:
> I suggest you use the POSIX character names instead, eg 12 becomes "<1><2>"

I am sorry, in POSIX this would be "<one><two>" - I would then suggest that you
use the character naming of ISO TR 14652 or ISO TR 30112, which would be the
abovementioned "<1><2>"  -or the POSIX names, but the 14652/30112 names are more readable, IMHO.

Best regards
keld
  
Joseph Myers July 22, 2015, 5:25 p.m. UTC | #11
On Tue, 21 Jul 2015, Keld Simonsen wrote:

> It would mean that you cannot use the locale sources for crosscompiling 
> when using some different character sets on the hosting and the target 
> machines. Eg if you are making embedded systems on IOS or Windows or 
> other utf16 machines for an utf8 target, or making stuff for android. Or 
> the other way round if you are omn an utf8 host and generate locales for 
> a utf16 target such as a utf16 embedded system or an iphone or ipad 
> system.

On the build system on which glibc is built, we can always assume that the 
glibc sources are the exact sequences of octets provided by the glibc 
project, not converted into another character set and without any 
conversions of line endings.  Furthermore, on any system using glibc and 
executing tools such as localedef with the installed locale source files, 
it can be assumed that those source files are the files shipped with 
glibc, not those files after conversion into another character set.  Use 
of glibc source files after conversion into another character set is 
outside the scope of the glibc project - glibc is not expected to build 
with such converted source files.

Now, it's true that the installed localedef utility should be usable in 
locale A to generate locale B, for any pair (A, B) of installed locales - 
rather than only being able to generate locales as part of the glibc build 
/ install process.  If localedef interprets locale sources in the 
character set of the locale in which it runs, that may mean the installed 
locale sources do need to be in ASCII.  How does localedef determine the 
character set in which to interpret the textual locale source files?
  
keld@keldix.com July 22, 2015, 7:02 p.m. UTC | #12
On Wed, Jul 22, 2015 at 05:25:04PM +0000, Joseph Myers wrote:
> On Tue, 21 Jul 2015, Keld Simonsen wrote:
> 
> > It would mean that you cannot use the locale sources for crosscompiling 
> > when using some different character sets on the hosting and the target 
> > machines. Eg if you are making embedded systems on IOS or Windows or 
> > other utf16 machines for an utf8 target, or making stuff for android. Or 
> > the other way round if you are omn an utf8 host and generate locales for 
> > a utf16 target such as a utf16 embedded system or an iphone or ipad 
> > system.
> 
> On the build system on which glibc is built, we can always assume that the 
> glibc sources are the exact sequences of octets provided by the glibc 
> project, not converted into another character set and without any 
> conversions of line endings.  Furthermore, on any system using glibc and 
> executing tools such as localedef with the installed locale source files, 
> it can be assumed that those source files are the files shipped with 
> glibc, not those files after conversion into another character set.  Use 
> of glibc source files after conversion into another character set is 
> outside the scope of the glibc project - glibc is not expected to build 
> with such converted source files.

Sounds strange. glibc is the library for the GNU C language. Standard ISO C
is coded character set independent, as is also POSIX. Why would the glibc project 
not follow ISO C and POSIX design goals? Why would glibc exclude itself
from Apple and Microsoft (utf16) and non-utf8 Linux and UNIX systems? 

Maybe we should clone glibc to make it available on other platforms
than those using utf8. Or maybe you are not correct. I have not been watching
the glibc project close enough to tell.

> Now, it's true that the installed localedef utility should be usable in 
> locale A to generate locale B, for any pair (A, B) of installed locales - 
> rather than only being able to generate locales as part of the glibc build 
> / install process.  If localedef interprets locale sources in the 
> character set of the locale in which it runs, that may mean the installed 
> locale sources do need to be in ASCII.  How does localedef determine the 
> character set in which to interpret the textual locale source files?

Yes, that is why we use UCS symbolic code points. I would then rather to be
fully consistent use UCS symbolic code points all the way thru a locale source,
it is a bit more cumbersome, but I would rather be consistent. And it would facilitate
the crosscompiling I wrote about. I don't think there is a mix of locales where it
matters on Linux boxes. Oh well, some thinkable scenarios:
Apple or Windosw users on a linux box, linux users on apple or Windows boxes,
Some mix with EBCDIC - more unlikely, but still thinkable is a big
mainfame and number cruncher environment, the mainframe being IBM mainframe
running VM/CMS and the number cruncher being a linux supercomputer, eg in
a financial institution.

Keld
  
Joseph Myers July 22, 2015, 8:02 p.m. UTC | #13
On Wed, 22 Jul 2015, Keld Simonsen wrote:

> > On the build system on which glibc is built, we can always assume that the 
> > glibc sources are the exact sequences of octets provided by the glibc 
> > project, not converted into another character set and without any 
> > conversions of line endings.  Furthermore, on any system using glibc and 
> > executing tools such as localedef with the installed locale source files, 
> > it can be assumed that those source files are the files shipped with 
> > glibc, not those files after conversion into another character set.  Use 
> > of glibc source files after conversion into another character set is 
> > outside the scope of the glibc project - glibc is not expected to build 
> > with such converted source files.
> 
> Sounds strange. glibc is the library for the GNU C language. Standard 

No it's not.  It's the C library for the GNU system.  glibc has a range of 
requirements, including ELF, TLS, an MMU, two's complement integers, 
32-bit int, 32-bit or 64-bit long, 32-bit UTF-32 wchar_t, IEEE binary32 
float, IEEE binary64 double, various GNU tools present on the build system 
as documented in install.texi, ....

> ISO C is coded character set independent, as is also POSIX. Why would 
> the glibc project not follow ISO C and POSIX design goals? Why would 

Because glibc makes particular implementation choices in areas that are 
implementation-defined.  It's an implementation, not a meta-implementation 
that tries to cover the range of permitted implementation choices.  
Meta-implementations (at least of the language part of ISO C) exist, but 
they exist in the field of formal systems used to reason about C programs.

> glibc exclude itself from Apple and Microsoft (utf16) and non-utf8 Linux 
> and UNIX systems?

It's about 15-20 years since glibc was usable as a replacement C library 
for systems with an existing native non-free C library.  Those systems are 
not relevant to glibc nowadays (Apple and Microsoft systems fail the basic 
requirement of using ELF, which is assumed all over glibc).  UTF-16 is 
supported in iconv (only), just like EBCDIC.  Non-UTF-8 locales are 
supported, but deprecated (new non-UTF-8 locales should not be added, and 
any existing non-UTF-8 locales should have a UTF-8 counterpart), and to be 
usable in a POSIX-compliant way must have a character set that includes 
ASCII.

Given sufficiently many GNU tools built on a non-GNU build system, it 
should be possible to cross-compile glibc there - but localedef itself is 
only ever linked against glibc and run on a system using glibc (the 
cross-localedef functionality checked in to glibc is limited to allowing 
one glibc system to generate locales for another system with the same 
glibc version but a different endianness).

> > Now, it's true that the installed localedef utility should be usable in 
> > locale A to generate locale B, for any pair (A, B) of installed locales - 
> > rather than only being able to generate locales as part of the glibc build 
> > / install process.  If localedef interprets locale sources in the 
> > character set of the locale in which it runs, that may mean the installed 
> > locale sources do need to be in ASCII.  How does localedef determine the 
> > character set in which to interpret the textual locale source files?
> 
> Yes, that is why we use UCS symbolic code points. I would then rather to be

"Yes" does not answer my question about how localedef determines the 
character set of its input.

> fully consistent use UCS symbolic code points all the way thru a locale 
> source, it is a bit more cumbersome, but I would rather be consistent. 

I'd rather have some extension to allow a locale source file to declare 
that it is in UTF-8, and then use UTF-8 throughout except for control 
characters or combining characters used in isolation.
  
Ondrej Bilka July 23, 2015, 10:27 p.m. UTC | #14
On Wed, Jul 22, 2015 at 08:02:23PM +0000, Joseph Myers wrote:
> > > Now, it's true that the installed localedef utility should be usable in 
> > > locale A to generate locale B, for any pair (A, B) of installed locales - 
> > > rather than only being able to generate locales as part of the glibc build 
> > > / install process.  If localedef interprets locale sources in the 
> > > character set of the locale in which it runs, that may mean the installed 
> > > locale sources do need to be in ASCII.  How does localedef determine the 
> > > character set in which to interpret the textual locale source files?
> > 
> > Yes, that is why we use UCS symbolic code points. I would then rather to be
> 
> "Yes" does not answer my question about how localedef determines the 
> character set of its input.
> 
> > fully consistent use UCS symbolic code points all the way thru a locale 
> > source, it is a bit more cumbersome, but I would rather be consistent. 
> 
> I'd rather have some extension to allow a locale source file to declare 
> that it is in UTF-8, and then use UTF-8 throughout except for control 
> characters or combining characters used in isolation.
>
I second that. It would be technically easy to do, so its mostly matter
of selecting proper interface. If we require some utf8 locale (if we
decide for C.UTF8 then use it otherwise for example en_US.

Then it would be matter of selecting different locale on files marked
say by having UTF8 in first line. Sample implementation would be:

fgets (first_line, 5, locale);
if (!memcmp (first_line, "UTF8", 4))
  setlocale(LC_ALL,"en_US.UTF8");
else
/* unget first line.  */
  
Carlos O'Donell July 24, 2015, 12:20 a.m. UTC | #15
On 07/23/2015 06:27 PM, Ondřej Bílka wrote:
>> I'd rather have some extension to allow a locale source file to declare 
>> that it is in UTF-8, and then use UTF-8 throughout except for control 
>> characters or combining characters used in isolation.
>>
> I second that. It would be technically easy to do, so its mostly matter
> of selecting proper interface. If we require some utf8 locale (if we
> decide for C.UTF8 then use it otherwise for example en_US.
> 
> Then it would be matter of selecting different locale on files marked
> say by having UTF8 in first line. Sample implementation would be:
> 
> fgets (first_line, 5, locale);
> if (!memcmp (first_line, "UTF8", 4))
>   setlocale(LC_ALL,"en_US.UTF8");
> else
> /* unget first line.  */
> 

I agree with Joseph's position here.

Further to that, my primary goal is to make contribution for these
files easier.

I have no interest in the abstract cases that are not being supported
by anyone at the present moment.

Cheers,
Carlos.
  
Carlos O'Donell July 24, 2015, 12:23 a.m. UTC | #16
On 06/09/2015 03:11 AM, Ondřej Bílka wrote:
> On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
>> Hi,
>>
>> make country_isbn definitions consistent across locales by using
>> Unicode code points not numerals everywhere. The code in
>> locale/categories.def and locale/programs/ld-address.c already
>> handles strings.
>>
>> Please apply.
>>
> Possible but why, when these are numbers which are easier to read than
> strings?
> 

I agree with Ondrej. Why?

The question we should all be asking ourselves here is:

	"What can we do to make it *easier* to maintain these files?"

Making everyone write in Unicode code points is not easier.

Joseph, Ondrej, and myself agree that we should find a way to just make
these files UTF-8. I expect that a precondition is going to be to add
an unremovable C.UTF-8 locale, which I think is important.

Cheers,
Carlos.
  
Paul Eggert July 24, 2015, 2:16 a.m. UTC | #17
Carlos O'Donell wrote:
> Joseph, Ondrej, and myself agree that we should find a way to just make
> these files UTF-8. I expect that a precondition is going to be to add
> an unremovable C.UTF-8 locale, which I think is important.

I also like the idea of having these files be UTF-8.

Why is an unremovable C.UTF-8 locale a precondition, though?  We should be able 
to assume a properly-installed localedef and a minimum set of locales for 
running development tools like localedef.  The minimum set could include 
en_US.UTF-8.
  
Carlos O'Donell July 24, 2015, 2:45 a.m. UTC | #18
On 07/23/2015 10:16 PM, Paul Eggert wrote:
> Carlos O'Donell wrote:
>> Joseph, Ondrej, and myself agree that we should find a way to just
>> make these files UTF-8. I expect that a precondition is going to be
>> to add an unremovable C.UTF-8 locale, which I think is important.
> 
> I also like the idea of having these files be UTF-8.
> 
> Why is an unremovable C.UTF-8 locale a precondition, though?  We
> should be able to assume a properly-installed localedef and a minimum
> set of locales for running development tools like localedef.  The
> minimum set could include en_US.UTF-8.

Agreed. I should not have said "precondition" when I really meant
"nice to have" since it simplifies some of the error handling
if you know you have a fallback UTF-8 locale you can use.

c.
  
keld@keldix.com July 24, 2015, 10:24 a.m. UTC | #19
On Thu, Jul 23, 2015 at 07:16:45PM -0700, Paul Eggert wrote:
> Carlos O'Donell wrote:
> >Joseph, Ondrej, and myself agree that we should find a way to just make
> >these files UTF-8. I expect that a precondition is going to be to add
> >an unremovable C.UTF-8 locale, which I think is important.
> 
> I also like the idea of having these files be UTF-8.
> 
> Why is an unremovable C.UTF-8 locale a precondition, though?  We
> should be able to assume a properly-installed localedef and a
> minimum set of locales for running development tools like localedef.
> The minimum set could include en_US.UTF-8.

You are then going to deviate from POSUX and ISO TR 30112 practice.

As you may know, I am involved in POSIX and 30112 standardization,
and I have tried to align 30112 with glibc practice. If you are deviating
 from POSIX guidelines, I have trouble unifying the two goals.

I also would like to have the 30112 standard implemented more broadly than just
glibc. Rumours has it that this is done somewhere else.
I would like to then use the glibc locales for the ISO 15897 registry,
and give maximum usability to those - that is the locales should be
fulle charset independent.

Also SC35 is looking at revising 30112 and in that revision we would
like to also update the character info to a new revision of 10646,
aligned to what is used in the sorting standard ISO 14651,
and I would like to use the glibc LC_CTYPE for that. 

I agree with that it should be easier to create locales.
One could do that with a GUI that helped create and proofread and test locales.

Best regards
Keld
  
keld@keldix.com July 24, 2015, 10:26 a.m. UTC | #20
On Thu, Jul 23, 2015 at 07:16:45PM -0700, Paul Eggert wrote:
> Carlos O'Donell wrote:
> >Joseph, Ondrej, and myself agree that we should find a way to just make
> >these files UTF-8. I expect that a precondition is going to be to add
> >an unremovable C.UTF-8 locale, which I think is important.
> 
> I also like the idea of having these files be UTF-8.
> 
> Why is an unremovable C.UTF-8 locale a precondition, though?  We
> should be able to assume a properly-installed localedef and a
> minimum set of locales for running development tools like localedef.
> The minimum set could include en_US.UTF-8.

I would recommend using the i18n locale - that is the purpose of i18n
locale to be the locale to buid all the other locales from.

Best regards
Keld
  
keld@keldix.com July 24, 2015, 10:43 a.m. UTC | #21
On Wed, Jul 22, 2015 at 08:02:23PM +0000, Joseph Myers wrote:
> On Wed, 22 Jul 2015, Keld Simonsen wrote:
> 
> > > On the build system on which glibc is built, we can always assume that the 
> > > glibc sources are the exact sequences of octets provided by the glibc 
> > > project, not converted into another character set and without any 
> > > conversions of line endings.  Furthermore, on any system using glibc and 
> > > executing tools such as localedef with the installed locale source files, 
> > > it can be assumed that those source files are the files shipped with 
> > > glibc, not those files after conversion into another character set.  Use 
> > > of glibc source files after conversion into another character set is 
> > > outside the scope of the glibc project - glibc is not expected to build 
> > > with such converted source files.
> > 
> > Sounds strange. glibc is the library for the GNU C language. Standard 
> 
> No it's not.  It's the C library for the GNU system.  glibc has a range of 
> requirements, including ELF, TLS, an MMU, two's complement integers, 
> 32-bit int, 32-bit or 64-bit long, 32-bit UTF-32 wchar_t, IEEE binary32 
> float, IEEE binary64 double, various GNU tools present on the build system 
> as documented in install.texi, ....

Yes, understood, but I don't think any of these requirements influenses the 
locales part.

> > ISO C is coded character set independent, as is also POSIX. Why would 
> > the glibc project not follow ISO C and POSIX design goals? Why would 
> 
> Because glibc makes particular implementation choices in areas that are 
> implementation-defined.  It's an implementation, not a meta-implementation 
> that tries to cover the range of permitted implementation choices.  
> Meta-implementations (at least of the language part of ISO C) exist, but 
> they exist in the field of formal systems used to reason about C programs.

I am also active in C standardization. I think it is a good goal to not
deviate and restrict an implementalton more than necessary. And at least 
not restrict it further than already implemented. That would lead to a loss
of functionality.


> > glibc exclude itself from Apple and Microsoft (utf16) and non-utf8 Linux 
> > and UNIX systems?
> 
> It's about 15-20 years since glibc was usable as a replacement C library 
> for systems with an existing native non-free C library.  Those systems are 
> not relevant to glibc nowadays (Apple and Microsoft systems fail the basic 
> requirement of using ELF, which is assumed all over glibc).  UTF-16 is 
> supported in iconv (only), just like EBCDIC.  Non-UTF-8 locales are 
> supported, but deprecated (new non-UTF-8 locales should not be added, and 
> any existing non-UTF-8 locales should have a UTF-8 counterpart), and to be 
> usable in a POSIX-compliant way must have a character set that includes 
> ASCII.

I thought cygwin was a GNU implementation for windows, and that it also
implemented glibc. I now understand that the cygwin libc is different from
glibc. But how different? Do they use glibc locales, or are they able to?

I would like the glibc locales to also be usable in other libc environments.
Most of all because they IMHO are the most comprehensive set of locales available.
So that would benefit users also outside glibc. Why not have this in mind
also for our project?

> Given sufficiently many GNU tools built on a non-GNU build system, it 
> should be possible to cross-compile glibc there - but localedef itself is 
> only ever linked against glibc and run on a system using glibc (the 
> cross-localedef functionality checked in to glibc is limited to allowing 
> one glibc system to generate locales for another system with the same 
> glibc version but a different endianness).
> 
> > > Now, it's true that the installed localedef utility should be usable in 
> > > locale A to generate locale B, for any pair (A, B) of installed locales - 
> > > rather than only being able to generate locales as part of the glibc build 
> > > / install process.  If localedef interprets locale sources in the 
> > > character set of the locale in which it runs, that may mean the installed 
> > > locale sources do need to be in ASCII.  How does localedef determine the 
> > > character set in which to interpret the textual locale source files?
> > 
> > Yes, that is why we use UCS symbolic code points. I would then rather to be
> 
> "Yes" does not answer my question about how localedef determines the 
> character set of its input.

My understanding is that the charset of the source is the charset of the locale
of the environment that localedef is running in. If the locale then is ASCII only
then there is no need for conversion of it - except for conversion 
into UTF16. Restricting the source further to invariant-ASCII also makes
the source portable to EBCDIC systems. Unicode restricts its sources to ASCII,
possibly also for this reason. Unicode do not publish their data in Unicode.

> > fully consistent use UCS symbolic code points all the way thru a locale 
> > source, it is a bit more cumbersome, but I would rather be consistent. 
> 
> I'd rather have some extension to allow a locale source file to declare 
> that it is in UTF-8, and then use UTF-8 throughout except for control 
> characters or combining characters used in isolation.

That would make it difficult to maintain in environments that is not using utf8.
Using ASCII only would make the locales maintainable on all systems.

Best regards
Keld
  
Joseph Myers July 24, 2015, 3:11 p.m. UTC | #22
On Fri, 24 Jul 2015, Keld Simonsen wrote:

> > Because glibc makes particular implementation choices in areas that are 
> > implementation-defined.  It's an implementation, not a meta-implementation 
> > that tries to cover the range of permitted implementation choices.  
> > Meta-implementations (at least of the language part of ISO C) exist, but 
> > they exist in the field of formal systems used to reason about C programs.
> 
> I am also active in C standardization. I think it is a good goal to not
> deviate and restrict an implementalton more than necessary. And at least 
> not restrict it further than already implemented. That would lead to a loss
> of functionality.

The point of things being implementation-defined is to allow 
implementations flexibility in what is convenient for those 
implementations.  glibc duly uses that flexibility to adopt particular 
choices for implementation-defined behavior (some depending on the 
architecture, but most being globally fixed for all glibc configurations, 
so that all glibc code is free to rely on those choices).

> I thought cygwin was a GNU implementation for windows, and that it also
> implemented glibc. I now understand that the cygwin libc is different from
> glibc. But how different? Do they use glibc locales, or are they able to?

I don't think there's any use of glibc locales by newlib as Cygwin's libc.

> I would like the glibc locales to also be usable in other libc environments.
> Most of all because they IMHO are the most comprehensive set of locales available.
> So that would benefit users also outside glibc. Why not have this in mind
> also for our project?

I think CLDR is more likely to be the most comprehensive set of locales 
(it certainly claims to be "the largest and most extensive standard 
repository of locale data available"), and unlike glibc's locales is 
intended for wider use.  Even if we did want wider use for glibc's locales 
(beyond use by glibc's locale-dependent functions after having been 
compiled into binary form by glibc's localedef program from the same 
version of glibc) I think we should still say: UTF-8 is the way of the 
present and future, other multibyte character sets are legacy.  And, just 
as we require a range of GNU tools to build glibc, so we can rely on 
features of one part of the GNU system when working on another part, so we 
should require GNU localedef to build glibc's locales.

> > I'd rather have some extension to allow a locale source file to declare 
> > that it is in UTF-8, and then use UTF-8 throughout except for control 
> > characters or combining characters used in isolation.
> 
> That would make it difficult to maintain in environments that is not using utf8.

It would make the locales easier to maintain for people using UTF-8, the 
number of which (among people concerned with i18n) can be presumed to be 
much greater than the number using legacy character sets.
  
Paul Eggert July 24, 2015, 4:50 p.m. UTC | #23
Keld Simonsen wrote:
> it should be easier to create locales.
> One could do that with a GUI that helped create and proofread and test locales.

I'm not aware of any such GUI, and even if one existed people would have to be 
trained to use it.  In contrast, we already have GUIs (e.g., Emacs) that people 
already know how to use and that work reasonably well with UTF-8 localedef sources.

Although the other goals you mention are laudable ones, surely they could be 
achieved by an automatic transformation of UTF-8 localedef sources into a 
less-readable equivalent with angle brackets, an equivalent that could be 
processed even by hypothetical tools operating in legacy multibyte locales. 
This shouldn't require a fancy GUI; it should be a relatively simple batch 
program.  Any engineering effort in this area would likely need this kind of 
transformation anyway, and any software developers in this specialized area 
should be able to take on this relatively minor extra task.
  
keld@keldix.com July 24, 2015, 5:15 p.m. UTC | #24
On Fri, Jul 24, 2015 at 09:50:46AM -0700, Paul Eggert wrote:
> Keld Simonsen wrote:
> >it should be easier to create locales.
> >One could do that with a GUI that helped create and proofread and test 
> >locales.
> 
> I'm not aware of any such GUI, and even if one existed people would have to 
> be trained to use it.  In contrast, we already have GUIs (e.g., Emacs) that 
> people already know how to use and that work reasonably well with UTF-8 
> localedef sources.
> 
> Although the other goals you mention are laudable ones, surely they could 
> be achieved by an automatic transformation of UTF-8 localedef sources into 
> a less-readable equivalent with angle brackets, an equivalent that could be 
> processed even by hypothetical tools operating in legacy multibyte locales. 
> This shouldn't require a fancy GUI; it should be a relatively simple batch 
> program.  Any engineering effort in this area would likely need this kind 
> of transformation anyway, and any software developers in this specialized 
> area should be able to take on this relatively minor extra task.

We could have a utility to do that, and probably there was one developed 
when Ulrich converted the mnemonic style to UCS codepoints. But maybe
that is lost. Or it could be part of the localedef utility,
given that localedef understands the full syntax of locales, then a conversion
option to and from different charsets and symbolic representations could be
done, with some better chances of being maintained and updated for
new features, and not lost.

I was also thinking of testing, how would a date be output with this date format?
I think there may be someting like that lying around, eg for KDE localization,
which I think is based on some other data and formats than glibc locales,
but it is a much bigger work than just doing some conversion of characters.

Keld
  
keld@keldix.com July 25, 2015, 1:18 p.m. UTC | #25
On Fri, Jul 24, 2015 at 03:11:15PM +0000, Joseph Myers wrote:
> On Fri, 24 Jul 2015, Keld Simonsen wrote:
> 
> > > Because glibc makes particular implementation choices in areas that are 
> > > implementation-defined.  It's an implementation, not a meta-implementation 
> > > that tries to cover the range of permitted implementation choices.  
> > > Meta-implementations (at least of the language part of ISO C) exist, but 
> > > they exist in the field of formal systems used to reason about C programs.
> > 
> > I am also active in C standardization. I think it is a good goal to not
> > deviate and restrict an implementalton more than necessary. And at least 
> > not restrict it further than already implemented. That would lead to a loss
> > of functionality.
> 
> The point of things being implementation-defined is to allow 
> implementations flexibility in what is convenient for those 
> implementations.  glibc duly uses that flexibility to adopt particular 
> choices for implementation-defined behavior (some depending on the 
> architecture, but most being globally fixed for all glibc configurations, 
> so that all glibc code is free to rely on those choices).

Yes, of cause implementation defined allowance is to be used.

I then have another hat on, as I am involved in writing the standards.
I have to have a generic point of view, and also from the users point of view
implementation defined items are no good for portability, so you
cannot be sure of your independence. You are bound to the implementation
of which you used the implementation defined specs.

I don't know about the goals of the glibc project, but there are a number
of possibilities to get out to a bigger audience. Actually the locales are mostly used
for end user apps, and glibc has a end user audience, that could be made bigger.
Eg both the Apple end user community and the Android user community
are way bigger than the glibc end user community. And they could be a target for
at least glibc locales. I believe both Apple and Google use POSIX derived localization,
including the locale model. I, at least as the editor of ISO TR 30122, need to have those
communities in sight. I have been cooperating with the glibc community, especially
Ulrich, but also with FSF as I have donated many locale and charmap specs to them.
And I am usig glibc i18n locale as the locale source in the standard. 

So I would welcome if glibc adhered to the design goals of character set independence,
that both POSIX and 30112 have, a design goal also shared by Unicode Inc.

> > I thought cygwin was a GNU implementation for windows, and that it also
> > implemented glibc. I now understand that the cygwin libc is different from
> > glibc. But how different? Do they use glibc locales, or are they able to?
> 
> I don't think there's any use of glibc locales by newlib as Cygwin's libc.

I believe if that is true, then they use something based on my earlier locales,
that I released to X/Open many years ago. Those were widely used in the industry,
as they were the only and most comprehensive locales around, freely available.
They also were the basis for many of the glibc locales. I think there is a potential
for glibc locales to take that position today.

> > I would like the glibc locales to also be usable in other libc environments.
> > Most of all because they IMHO are the most comprehensive set of locales available.
> > So that would benefit users also outside glibc. Why not have this in mind
> > also for our project?
> 
> I think CLDR is more likely to be the most comprehensive set of locales 
> (it certainly claims to be "the largest and most extensive standard 
> repository of locale data available"), and unlike glibc's locales is 
> intended for wider use.  Even if we did want wider use for glibc's locales 
> (beyond use by glibc's locale-dependent functions after having been 
> compiled into binary form by glibc's localedef program from the same 
> version of glibc) I think we should still say: UTF-8 is the way of the 
> present and future, other multibyte character sets are legacy.  And, just 
> as we require a range of GNU tools to build glibc, so we can rely on 
> features of one part of the GNU system when working on another part, so we 
> should require GNU localedef to build glibc's locales.

CLDR is not POSIX like locales, they are in XML. Also I believe they
are not in the same quality as the glibc locales. I for one had an experience with Unicode that
they would not take my specs, even if I represented Danish Standards. The result
was that their Danish spec did not adhere to Danish Standards and to Danish
official orthography rules.  I then gave up contact with them.

> > > I'd rather have some extension to allow a locale source file to declare 
> > > that it is in UTF-8, and then use UTF-8 throughout except for control 
> > > characters or combining characters used in isolation.
> > 
> > That would make it difficult to maintain in environments that is not using utf8.
> 
> It would make the locales easier to maintain for people using UTF-8, the 
> number of which (among people concerned with i18n) can be presumed to be 
> much greater than the number using legacy character sets.

Yes, but you are excluding some communities. So: easier for the majority,
impossible for a number of diverse minorities, which actually has the potential
to be much larger than the current user base.

Best regards
Keld
  
Joseph Myers July 27, 2015, 2:54 p.m. UTC | #26
On Sat, 25 Jul 2015, Keld Simonsen wrote:

> > It would make the locales easier to maintain for people using UTF-8, the 
> > number of which (among people concerned with i18n) can be presumed to be 
> > much greater than the number using legacy character sets.
> 
> Yes, but you are excluding some communities. So: easier for the majority,
> impossible for a number of diverse minorities, which actually has the potential
> to be much larger than the current user base.

I think it's appropriate to say: if you want to use the glibc locales 
outside of glibc, you are responsible for maintaining the tools required 
to do so (e.g. for converting the encoding of locale source files).  I 
don't think such tools for conversion of encodings would be hard to write 
or need much maintenance when written (and in one direction - converting 
the ASCII files to UTF-8 - they might even be written by the glibc project 
as part of the initial conversion work).
  
Marko Myllynen Aug. 10, 2015, 10:31 a.m. UTC | #27
Hi,

On 2015-07-24 03:23, Carlos O'Donell wrote:
> On 06/09/2015 03:11 AM, Ondřej Bílka wrote:
>> On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
>>>
>>> make country_isbn definitions consistent across locales by using
>>> Unicode code points not numerals everywhere. The code in
>>> locale/categories.def and locale/programs/ld-address.c already
>>> handles strings.
>>>
>> Possible but why, when these are numbers which are easier to read than
>> strings?
> 
> I agree with Ondrej. Why?

see above, for consistency.

> The question we should all be asking ourselves here is:
> 
> 	"What can we do to make it *easier* to maintain these files?"

Currently the definitions of this particular key across locales are
inconsistent and it doesn't make things easier as one can get confused
which form should be used for country_isbn.

> Making everyone write in Unicode code points is not easier.

The patch was only about making one individual key consistent, it's not
like this patch would add any additional generic burden.

Thanks,
  
keld@keldix.com Aug. 10, 2015, 11:05 a.m. UTC | #28
On Mon, Aug 10, 2015 at 01:31:30PM +0300, Marko Myllynen wrote:
> Hi,
> 
> On 2015-07-24 03:23, Carlos O'Donell wrote:
> > On 06/09/2015 03:11 AM, Ond??ej Bílka wrote:
> >> On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
> >>>
> >>> make country_isbn definitions consistent across locales by using
> >>> Unicode code points not numerals everywhere. The code in
> >>> locale/categories.def and locale/programs/ld-address.c already
> >>> handles strings.
> >>>
> >> Possible but why, when these are numbers which are easier to read than
> >> strings?
> > 
> > I agree with Ondrej. Why?
> 
> see above, for consistency.
> 
> > The question we should all be asking ourselves here is:
> > 
> > 	"What can we do to make it *easier* to maintain these files?"
> 
> Currently the definitions of this particular key across locales are
> inconsistent and it doesn't make things easier as one can get confused
> which form should be used for country_isbn.
> 
> > Making everyone write in Unicode code points is not easier.
> 
> The patch was only about making one individual key consistent, it's not
> like this patch would add any additional generic burden.

Why not continue to use the UCS codepoints, as we do for all other strings in locales.
That would also add to consistency, and for portability (which I
understand is not a goal amongst glibc developers - but anyway...)

Best regards
keld
  
Marko Myllynen Aug. 10, 2015, 11:14 a.m. UTC | #29
Hi,

On 2015-08-10 14:05, Keld Simonsen wrote:
> On Mon, Aug 10, 2015 at 01:31:30PM +0300, Marko Myllynen wrote:
>> On 2015-07-24 03:23, Carlos O'Donell wrote:
>>> On 06/09/2015 03:11 AM, Ond??ej Bílka wrote:
>>>> On Fri, Jun 05, 2015 at 05:57:06PM +0300, Marko Myllynen wrote:
>>>>>
>>>>> make country_isbn definitions consistent across locales by using
>>>>> Unicode code points not numerals everywhere. The code in
>>>>> locale/categories.def and locale/programs/ld-address.c already
>>>>> handles strings.
>>>>>
>>>> Possible but why, when these are numbers which are easier to read than
>>>> strings?
>>>
>>> I agree with Ondrej. Why?
>>
>> see above, for consistency.
>>
>>> The question we should all be asking ourselves here is:
>>>
>>> 	"What can we do to make it *easier* to maintain these files?"
>>
>> Currently the definitions of this particular key across locales are
>> inconsistent and it doesn't make things easier as one can get confused
>> which form should be used for country_isbn.
>>
>>> Making everyone write in Unicode code points is not easier.
>>
>> The patch was only about making one individual key consistent, it's not
>> like this patch would add any additional generic burden.
> 
> Why not continue to use the UCS codepoints, as we do for all other strings in locales.
> That would also add to consistency, and for portability (which I
> understand is not a goal amongst glibc developers - but anyway...)

that is exactly what my patch proposal was doing, nothing more, nothing
less: switch those locales using plain numbers for country_isbn to use
Unicode code points for country_isbn to make things consistent across
all locales.

In the long term we could look for alternatives for creating and
maintaining locales easier in general but in the short term I think the
best solution is to keep things consistent.

Thanks,
  

Patch

diff --git a/localedata/locales/af_ZA b/localedata/locales/af_ZA
index 143ad75..29223d5 100644
--- a/localedata/locales/af_ZA
+++ b/localedata/locales/af_ZA
@@ -275,7 +275,7 @@  country_car   "<U005A><U0041>"
 
 % ISO 2108
 % http://www.isbn-international.org/html/prefix/prefa.htm
-country_isbn  0
+country_isbn  "<U0030>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology
diff --git a/localedata/locales/ak_GH b/localedata/locales/ak_GH
index 159acc8..0f5a667 100644
--- a/localedata/locales/ak_GH
+++ b/localedata/locales/ak_GH
@@ -195,7 +195,7 @@  country_ab2  "<U0047><U0048>"
 % GHA
 country_ab3  "<U0047><U0048><U0041>"
 country_num  288
-country_isbn 9964
+country_isbn "<U0039><U0039><U0036><U0034>"
 % Akan
 lang_name    "<U0041><U006B><U0061><U006E>"
 % ak
diff --git a/localedata/locales/bg_BG b/localedata/locales/bg_BG
index 74e5ad4..4a62159 100644
--- a/localedata/locales/bg_BG
+++ b/localedata/locales/bg_BG
@@ -266,7 +266,7 @@  country_ab2  "<U0042><U0047>"
 country_ab3  "<U0042><U0047><U0052>"
 country_num   100
 country_car  "<U0042><U0047>"
-country_isbn  954
+country_isbn  "<U0039><U0035><U0034>"
 % български език
 lang_name    "<U0431><U044A><U043B><U0433><U0430><U0440><U0441><U043A><U0438><U0020><U0435><U0437><U0438><U043A>"
 lang_ab      "<U0062><U0067>"
diff --git a/localedata/locales/cmn_TW b/localedata/locales/cmn_TW
index a332659..01838ed 100644
--- a/localedata/locales/cmn_TW
+++ b/localedata/locales/cmn_TW
@@ -200,7 +200,7 @@  country_ab2  "<U0054><U0057>"
 % TWN
 country_ab3  "<U0054><U0057><U004E>"
 country_num  158
-country_isbn 957
+country_isbn "<U0039><U0035><U0037>"
 % 漢語官話
 lang_name    "<U6F22><U8A9E><U5B98><U8A71>"
 % cmn
diff --git a/localedata/locales/cy_GB b/localedata/locales/cy_GB
index 66298e0..31e1e89 100644
--- a/localedata/locales/cy_GB
+++ b/localedata/locales/cy_GB
@@ -40,7 +40,7 @@  country_name "<U0043><U0079><U006D><U0072><U0075>"
 country_ab2 "<U0047><U0042>"
 country_ab3 "<U0047><U0042><U0052>"
 country_num 826
-country_isbn 0
+country_isbn "<U0030>"
 country_car "<U0047><U0042>"
 lang_name "<U0043><U0079><U006D><U0072><U0061><U0065><U0067>"
 lang_ab "<U0063><U0079>"
diff --git a/localedata/locales/de_DE b/localedata/locales/de_DE
index e2704a7..26d83c8 100644
--- a/localedata/locales/de_DE
+++ b/localedata/locales/de_DE
@@ -193,7 +193,7 @@  country_ab2   "<U0044><U0045>"
 country_ab3   "<U0044><U0045><U0055>"
 country_num   276
 country_car   "<U0044>"
-country_isbn  3
+country_isbn  "<U0033>"
 % Deutsch
 lang_name     "<U0044><U0065><U0075><U0074><U0073><U0063><U0068>"
 % de
diff --git a/localedata/locales/en_NG b/localedata/locales/en_NG
index 364b549..f6d4005 100644
--- a/localedata/locales/en_NG
+++ b/localedata/locales/en_NG
@@ -270,7 +270,7 @@  country_car   "<U0057><U0041><U004E>"
 
 % ISO 2108
 % http://www.isbn-international.org/
-country_isbn  978
+country_isbn  "<U0039><U0037><U0038>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology
diff --git a/localedata/locales/en_US b/localedata/locales/en_US
index d79c228..08154bc 100644
--- a/localedata/locales/en_US
+++ b/localedata/locales/en_US
@@ -164,7 +164,7 @@  country_ab3   "<U0055><U0053><U0041>"
 country_num   840
 % USA
 country_car   "<U0055><U0053><U0041>"
-country_isbn  0
+country_isbn  "<U0030>"
 % English
 lang_name     "<U0045><U006E><U0067><U006C><U0069><U0073><U0068>"
 % en
diff --git a/localedata/locales/en_ZA b/localedata/locales/en_ZA
index 294b0a3..263c718 100644
--- a/localedata/locales/en_ZA
+++ b/localedata/locales/en_ZA
@@ -338,7 +338,7 @@  country_car   "<U005A><U0041>"
 
 % ISO 2108
 % http://www.isbn-international.org/html/prefix/prefa.htm
-country_isbn  0
+country_isbn  "<U0030>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology
diff --git a/localedata/locales/es_CR b/localedata/locales/es_CR
index b5dec84..18b10f9 100644
--- a/localedata/locales/es_CR
+++ b/localedata/locales/es_CR
@@ -155,7 +155,7 @@  postal_fmt    "<U0025><U0066><U0025><U004E><U0025><U0061><U0025><U004E>/
 country_name  "<U0043><U006F><U0073><U0074><U0061><U0020><U0052><U0069><U0063><U0061>"
 country_post  "<U0043><U0052>"
 country_car   "<U0043><U0052>"
-country_isbn  "9930,9977,9968"
+country_isbn  "<U0039><U0039><U0033><U0030><U002C><U0039><U0039><U0037><U0037><U002C><U0039><U0039><U0036><U0038>"
 country_ab2   "<U0043><U0052>"
 country_ab3   "<U0043><U0052><U0049>"
 country_num   188
diff --git a/localedata/locales/es_US b/localedata/locales/es_US
index 6b808d5..357102e 100644
--- a/localedata/locales/es_US
+++ b/localedata/locales/es_US
@@ -208,7 +208,7 @@  country_ab2   "<U0055><U0053>"
 country_ab3   "<U0055><U0053><U0041>"
 country_num   840
 country_car   "<U0055><U0053><U0041>"
-country_isbn  0
+country_isbn  "<U0030>"
 % Español
 lang_name     "<U0045><U0073><U0070><U0061><U00F1><U006F><U006C>"
 % es
diff --git a/localedata/locales/fi_FI b/localedata/locales/fi_FI
index e87878c..6ba91ba 100644
--- a/localedata/locales/fi_FI
+++ b/localedata/locales/fi_FI
@@ -253,7 +253,7 @@  country_num 246
 country_name "<U0053><U0075><U006F><U006D><U0069>"
 country_post "<U0046><U0049>"
 country_car  "<U0046><U0049><U004E>"
-country_isbn 952
+country_isbn "<U0039><U0035><U0032>"
 % suomi
 lang_name    "<U0073><U0075><U006F><U006D><U0069>"
 lang_ab      "<U0066><U0069>"
diff --git a/localedata/locales/fy_DE b/localedata/locales/fy_DE
index 046d775..e68ed7d 100644
--- a/localedata/locales/fy_DE
+++ b/localedata/locales/fy_DE
@@ -48,7 +48,7 @@  country_ab3   "<U0044><U0045><U0055>"
 % D
 country_car   "<U0044>"
 country_num 276
-country_isbn "3"
+country_isbn "<U0033>"
 % FIXME country_name in Low Saxon ?
 % Frysk
 lang_name    "<U0046><U0072><U0079><U0073><U006B>"
diff --git a/localedata/locales/gd_GB b/localedata/locales/gd_GB
index 41943f5..765f9df 100644
--- a/localedata/locales/gd_GB
+++ b/localedata/locales/gd_GB
@@ -148,7 +148,7 @@  country_ab3  "<U0047><U0042><U0052>"
 country_num  826
 % GB
 country_car  "<U0047><U0042>"
-country_isbn 0
+country_isbn "<U0030>"
 % Gàidhlig
 lang_name    "<U0047><U00E0><U0069><U0064><U0068><U006C><U0069><U0067>"
 % gd
diff --git a/localedata/locales/ha_NG b/localedata/locales/ha_NG
index 6ea1a88..c5d1f77 100644
--- a/localedata/locales/ha_NG
+++ b/localedata/locales/ha_NG
@@ -287,7 +287,7 @@  country_car   "<U0057><U0041><U004E>"
 
 % ISO 2108
 % http://www.isbn-international.org/
-country_isbn  978
+country_isbn  "<U0039><U0037><U0038>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology
diff --git a/localedata/locales/hak_TW b/localedata/locales/hak_TW
index 454ebad..543206a 100644
--- a/localedata/locales/hak_TW
+++ b/localedata/locales/hak_TW
@@ -199,7 +199,7 @@  country_ab2  "<U0054><U0057>"
 % TWN
 country_ab3  "<U0054><U0057><U004E>"
 country_num  158
-country_isbn 957
+country_isbn "<U0039><U0035><U0037>"
 % 漢語客家語
 lang_name    "<U6F22><U8A9E><U5BA2><U5BB6><U8A9E>"
 % hak
diff --git a/localedata/locales/hsb_DE b/localedata/locales/hsb_DE
index db130fd..b177663 100644
--- a/localedata/locales/hsb_DE
+++ b/localedata/locales/hsb_DE
@@ -2212,7 +2212,7 @@  country_ab2   "<U0044><U0045>"
 country_ab3   "<U0044><U0045><U0055>"
 country_num   276
 country_car   "<U0044>"
-country_isbn  3
+country_isbn  "<U0033>"
 lang_name     "<U0048><U006F><U0072><U006E><U006A><U006F><U0073><U0065>/
 <U0072><U0062><U0161><U0107><U0069><U006E><U0061>"
 lang_ab      ""
diff --git a/localedata/locales/ht_HT b/localedata/locales/ht_HT
index 66ae10b..8f12153 100644
--- a/localedata/locales/ht_HT
+++ b/localedata/locales/ht_HT
@@ -193,7 +193,7 @@  country_ab2  "<U0048><U0054>"
 % HTI
 country_ab3  "<U0048><U0054><U0049>"
 country_num  332
-country_isbn 99935
+country_isbn "<U0039><U0039><U0039><U0033><U0035>"
 % RH
 country_car  "<U0052><U0048>"
 %
diff --git a/localedata/locales/ia_FR b/localedata/locales/ia_FR
index 722cc6e..64248c8 100644
--- a/localedata/locales/ia_FR
+++ b/localedata/locales/ia_FR
@@ -128,7 +128,7 @@  country_post "<U0046>"
 country_ab2 "<U0046><U0052>"
 country_ab3 "<U0046><U0052><U0041>"
 country_num 250
-country_isbn 2
+country_isbn "<U0032>"
 country_car "<U0046>"
 lang_name "<U0049><U006E><U0074><U0065><U0072><U006C><U0069><U006E><U0067><U0075><U0061>"
 
diff --git a/localedata/locales/ig_NG b/localedata/locales/ig_NG
index 8b1a48b..32f0f08 100644
--- a/localedata/locales/ig_NG
+++ b/localedata/locales/ig_NG
@@ -484,7 +484,7 @@  country_car   "<U0057><U0041><U004E>"
 
 % ISO 2108
 % http://www.isbn-international.org/
-country_isbn  978
+country_isbn  "<U0039><U0037><U0038>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology
diff --git a/localedata/locales/ka_GE b/localedata/locales/ka_GE
index 459c467..ad47bae 100644
--- a/localedata/locales/ka_GE
+++ b/localedata/locales/ka_GE
@@ -45,7 +45,7 @@  country_ab3 "GEO"
 country_num 268
 % GE
 country_car    "<U0047><U0045>"
-country_isbn "99928"
+country_isbn "<U0039><U0039><U0039><U0032><U0038>"
 % ქართული
 lang_name    "<U10E5><U10D0><U10E0><U10D7><U10E3><U10DA><U10D8>"
 % ka
diff --git a/localedata/locales/ku_TR b/localedata/locales/ku_TR
index d974bfb..44e2528 100644
--- a/localedata/locales/ku_TR
+++ b/localedata/locales/ku_TR
@@ -206,7 +206,7 @@  country_post "TR"
 country_ab2  "TR"
 country_ab3  "TUR"
 country_num  792
-country_isbn 975
+country_isbn "<U0039><U0037><U0035>"
 % TR
 country_car    "<U0054><U0052>"
 % "kurdi"
diff --git a/localedata/locales/lb_LU b/localedata/locales/lb_LU
index a74e162..36cb98e 100644
--- a/localedata/locales/lb_LU
+++ b/localedata/locales/lb_LU
@@ -175,7 +175,7 @@  country_ab2   "<U004C><U0055>"
 country_ab3   "<U004C><U0055><U0058>"
 country_num   442
 country_car   "<U004C>"
-country_isbn  2
+country_isbn  "<U0032>"
 lang_name     "<U004C><U00EB><U0074><U007A><U0065><U0062><U0075><U0065>/
 <U0072><U0067><U0065><U0073><U0063><U0068>"
 lang_ab       "<U006C><U0062>"
diff --git a/localedata/locales/li_BE b/localedata/locales/li_BE
index 5a89754..e917802 100644
--- a/localedata/locales/li_BE
+++ b/localedata/locales/li_BE
@@ -47,7 +47,7 @@  country_ab2   "<U0042><U0045>"
 country_ab3   "<U0042><U0045><U004C>"
 country_car   "<U0042>"
 country_num 56
-%FIXME country_isbn "2"
+country_isbn "<U0032>"
 % Lèmbörgs
 lang_name    "<U004C><U00E8><U006D><U0062><U00F6><U0072><U0067><U0073>"
 lang_ab "<U006C><U0069>"
diff --git a/localedata/locales/li_NL b/localedata/locales/li_NL
index b07c4a4..b92acbf 100644
--- a/localedata/locales/li_NL
+++ b/localedata/locales/li_NL
@@ -47,7 +47,7 @@  country_ab2   "<U004E><U004C>"
 country_ab3   "<U004E><U004C><U0044>"
 country_car   "<U004E><U004C>"
 country_num 528
-%FIXME country_isbn "2"
+country_isbn "<U0033>"
 % Lèmbörgs
 lang_name    "<U004C><U00E8><U006D><U0062><U00F6><U0072><U0067><U0073>"
 lang_ab "<U006C><U0069>"
diff --git a/localedata/locales/lzh_TW b/localedata/locales/lzh_TW
index 73b4897..0f26ecf 100644
--- a/localedata/locales/lzh_TW
+++ b/localedata/locales/lzh_TW
@@ -234,7 +234,7 @@  country_ab2  "<U0054><U0057>"
 % TWN
 country_ab3  "<U0054><U0057><U004E>"
 country_num  158
-country_isbn 957
+country_isbn "<U0039><U0035><U0037>"
 % 漢語文言
 lang_name    "<U6F22><U8A9E><U6587><U8A00>"
 % lzh
diff --git a/localedata/locales/mk_MK b/localedata/locales/mk_MK
index b751679..31653e7 100644
--- a/localedata/locales/mk_MK
+++ b/localedata/locales/mk_MK
@@ -152,7 +152,7 @@  country_ab2 "<U004d><U004b>"
 country_ab3 "<U004d><U004b><U0044>"
 country_car "<U004d><U004b>"
 country_num 807
-country_isbn "9989"
+country_isbn "<U0039><U0039><U0038><U0039>"
 % македонски јазик
 lang_name    "<U043C><U0430><U043A><U0435><U0434><U043E><U043D><U0441><U043A>/<U0438><U0020><U0458><U0430><U0437><U0438><U043A>"
 lang_ab "<U006d><U006b>"
diff --git a/localedata/locales/mn_MN b/localedata/locales/mn_MN
index 6649537..acb32da 100644
--- a/localedata/locales/mn_MN
+++ b/localedata/locales/mn_MN
@@ -254,7 +254,7 @@  country_ab2   "<U004D><U004E>"
 country_ab3   "<U004D><U004E><U0047>"
 country_num   496
 country_car   "<U004D><U0047><U004C>"
-country_isbn  99929
+country_isbn  "<U0039><U0039><U0039><U0032><U0039>"
 % Монгол хэл
 lang_name    "<U041C><U043E><U043D><U0433><U043E><U043B><U0020><U0445><U044D><U043B>"
 lang_ab       "<U006D><U006E>"
diff --git a/localedata/locales/nan_TW b/localedata/locales/nan_TW
index 0c11174..08bbb2d 100644
--- a/localedata/locales/nan_TW
+++ b/localedata/locales/nan_TW
@@ -200,7 +200,7 @@  country_ab2  "<U0054><U0057>"
 % TWN
 country_ab3  "<U0054><U0057><U004E>"
 country_num  158
-country_isbn 957
+country_isbn "<U0039><U0035><U0037>"
 % 漢語閩南語
 lang_name    "<U6F22><U8A9E><U95A9><U5357><U8A9E>"
 % nan
diff --git a/localedata/locales/nds_DE b/localedata/locales/nds_DE
index e1ab6e0..81d0ad4 100644
--- a/localedata/locales/nds_DE
+++ b/localedata/locales/nds_DE
@@ -46,7 +46,7 @@  country_ab2   "<U0044><U0045>"
 country_ab3   "<U0044><U0045><U0055>"
 country_car   "<U0044>"
 country_num 276
-country_isbn "3"
+country_isbn "<U0033>"
 lang_name "<U004E><U0065><U0064><U0064><U0065><U0072><U0073><U0061><U0073><U0073><U0069><U0073><U0063><U0068>"
 %lang_ab
 lang_term "<U006E><U0064><U0073>"
diff --git a/localedata/locales/nds_NL b/localedata/locales/nds_NL
index 14051f6..c59d3e6 100644
--- a/localedata/locales/nds_NL
+++ b/localedata/locales/nds_NL
@@ -45,7 +45,7 @@  country_ab2 "<U004E><U004C>"
 country_ab3 "<U004E><U004C><U0044>"
 country_car "<U004E><U004C>"
 country_num 528
-country_isbn "3"
+country_isbn "<U0033>"
 lang_name "<U004E><U0065><U0064><U0064><U0065><U0072><U0073><U0061><U0073><U0073><U0069><U0073><U0063><U0068>"
 %lang_ab
 lang_term "<U006E><U0064><U0073>"
diff --git a/localedata/locales/oc_FR b/localedata/locales/oc_FR
index 10e3a03..5a9fca6 100644
--- a/localedata/locales/oc_FR
+++ b/localedata/locales/oc_FR
@@ -44,7 +44,7 @@  country_post "F"
 country_ab2 "FR"
 country_ab3 "FRA"
 country_num 250
-country_isbn "2"
+country_isbn "<U0032>"
 country_car "F"
 % Occitan
 lang_name    "<U004F><U0063><U0063><U0069><U0074><U0061><U006E>"
diff --git a/localedata/locales/pap_AN b/localedata/locales/pap_AN
index 63262a5..f3c5a96 100644
--- a/localedata/locales/pap_AN
+++ b/localedata/locales/pap_AN
@@ -49,7 +49,7 @@  postal_fmt "<U0025><U0064><U0025><U004E><U0025><U0066><U0025><U004E><U0025><U006
 country_ab2 "<U0041><U004E>"
 country_ab3 "<U0041><U004E><U0054>"
 country_num 530
-country_isbn "99904"
+country_isbn "<U0039><U0039><U0039><U0030><U0034>"
 country_car "<U004E><U0041>"
 % lang_ab
 lang_term "<U0070><U0061><U0070>"
diff --git a/localedata/locales/ro_RO b/localedata/locales/ro_RO
index 610f071..ab41ab7 100644
--- a/localedata/locales/ro_RO
+++ b/localedata/locales/ro_RO
@@ -377,7 +377,7 @@  country_car "<U0052><U004F>"
 % ISBN code is 973
 % see: http://homepages.cwi.nl/~dik/english/codes/isbn.html
 % and other sources
-country_isbn 973
+country_isbn "<U0039><U0037><U0033>"
 % FIXME: is it really RO?
 country_post "<U0052><U004F>"
 % language names are not capitalized in Romanian ( roma>na( )
diff --git a/localedata/locales/sq_MK b/localedata/locales/sq_MK
index 9d6aef7..9d3957e 100644
--- a/localedata/locales/sq_MK
+++ b/localedata/locales/sq_MK
@@ -100,7 +100,7 @@  country_ab2 "<U004d><U004b>"
 country_ab3 "<U004d><U004b><U0044>"
 country_car "<U004d><U004b>"
 country_num 807
-country_isbn "9989"
+country_isbn "<U0039><U0039><U0038><U0039>"
 % shqip
 lang_name    "<U0073><U0068><U0071><U0069><U0070>"
 % sq
diff --git a/localedata/locales/sv_FI b/localedata/locales/sv_FI
index fca2935..007828e 100644
--- a/localedata/locales/sv_FI
+++ b/localedata/locales/sv_FI
@@ -143,7 +143,7 @@  country_num 246
 country_name "<U0046><U0069><U006E><U006C><U0061><U006E><U0064>"
 country_post "<U0046><U0049>"
 country_car  "<U0046><U0049><U004E>"
-country_isbn 952
+country_isbn "<U0039><U0035><U0032>"
 % svenska
 lang_name    "<U0073><U0076><U0065><U006E><U0073><U006B><U0061>"
 lang_ab      "<U0073><U0076>"
diff --git a/localedata/locales/tr_CY b/localedata/locales/tr_CY
index e2e6936..8665dfa 100644
--- a/localedata/locales/tr_CY
+++ b/localedata/locales/tr_CY
@@ -98,7 +98,7 @@  country_name	"<U004E><U006F><U0072><U0074><U0068><U0065><U0072><U006E>/
 country_post	"<U0054><U0052>"
 % TR
 country_car	"<U0054><U0052>"
-country_isbn	975
+country_isbn	"<U0039><U0037><U0035>"
 country_num	792
 % TR
 country_ab2	"<U0054><U0052>"
diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR
index f54be2c..82c8699 100644
--- a/localedata/locales/tr_TR
+++ b/localedata/locales/tr_TR
@@ -3587,7 +3587,7 @@  country_name	"<U0054><U0075><U0072><U006B><U0065><U0079>"
 country_post	"<U0054><U0052>"
 % TR
 country_car	"<U0054><U0052>"
-country_isbn	975
+country_isbn	"<U0039><U0037><U0035>"
 country_num	792
 % TR
 country_ab2	"<U0054><U0052>"
diff --git a/localedata/locales/uk_UA b/localedata/locales/uk_UA
index 511f004..a910ec6 100644
--- a/localedata/locales/uk_UA
+++ b/localedata/locales/uk_UA
@@ -1246,7 +1246,7 @@  country_num   804
 country_car   "<U0055><U0041>"
 
 % ISBN code, for books.
-country_isbn  966
+country_isbn  "<U0039><U0036><U0036>"
 
 % Two-letter abbreviation of the language, see ISO 639.
 lang_ab       "<U0075><U006B>"
diff --git a/localedata/locales/unm_US b/localedata/locales/unm_US
index 482a7da..3467b8c 100644
--- a/localedata/locales/unm_US
+++ b/localedata/locales/unm_US
@@ -150,7 +150,7 @@  country_ab3   "<U0055><U0053><U0041>"
 country_num   840
 % USA
 country_car   "<U0055><U0053><U0041>"
-country_isbn  0
+country_isbn  "<U0030>"
 % lang_name     ""
 % lang_ab       ""
 % unm
diff --git a/localedata/locales/wa_BE b/localedata/locales/wa_BE
index a2fb3be..21979c5 100644
--- a/localedata/locales/wa_BE
+++ b/localedata/locales/wa_BE
@@ -42,7 +42,7 @@  country_post "B"
 country_ab2 "BE"
 country_ab3 "BEL"
 country_num 56
-country_isbn "2"
+country_isbn "<U0032>"
 % B
 country_car  "<U0042>"
 lang_name "<U0057><U0061><U006C><U006F><U006E>"
diff --git a/localedata/locales/wae_CH b/localedata/locales/wae_CH
index 5f11613..264aa63 100644
--- a/localedata/locales/wae_CH
+++ b/localedata/locales/wae_CH
@@ -236,6 +236,6 @@  postal_fmt    "<U0025><U0066><U0025><U004E><U0025><U0061><U0025><U004E>/
 country_ab2   "<U0043><U0048>"
 country_ab3   "<U0043><U0048><U0045>"
 country_num   756
-country_isbn  3
+country_isbn  "<U0033>"
 
 END LC_ADDRESS
diff --git a/localedata/locales/yi_US b/localedata/locales/yi_US
index 97ed218..7c2259b 100644
--- a/localedata/locales/yi_US
+++ b/localedata/locales/yi_US
@@ -50,7 +50,7 @@  country_num 840
 % USA
 country_car   "<U0055><U0053><U0041>"
 % FIXME Check which isbn for Yiddish in USA
-country_isbn "0"
+country_isbn "<U0030>"
 lang_name "<U05D9><U05D9><U05B4><U05D3><U05D9><U05E9>"
 % yi
 lang_ab      "<U0079><U0069>"
diff --git a/localedata/locales/yo_NG b/localedata/locales/yo_NG
index c88ca6e..37a948e 100644
--- a/localedata/locales/yo_NG
+++ b/localedata/locales/yo_NG
@@ -491,7 +491,7 @@  country_car   "<U0057><U0041><U004E>"
 
 % ISO 2108
 % http://www.isbn-international.org/
-country_isbn  978
+country_isbn  "<U0039><U0037><U0038>"
 
 % ISO 639 language abbreviations:
 % 639-1 2 letter, 639-2 3 letter terminology