[PATCHv2] Update the localedata/locales/translit_* files to Unicode 7.0.0
Commit Message
This is an update to my earlier patches:
https://sourceware.org/ml/libc-alpha/2015-04/msg00361.html
Updates:
- transliteration rules for da, nb, nn, and sv locales
added to transliterate for example "ö" to "oe" in these
locales because the "neutral" transliteration should be
"ö" to "o" (For example in English, coöperation as used in
http://www.newyorker.com/humor/borowitz-report/obama-putin-agree-never-to-speak-to-each-other-again
should be transliterated to "cooperation", not "cooeperation").
This should fix [BZ #89].
- lots of stuff added to translit_neutral
- some more tweaks to the script generating the translit files
generated from Unicode
I tested the patches on Fedora 22.
Can somebody review this please?
----------------------------------------------------------------------
The attached file updates these translit files to Unicode 7.0.0:
locales/translit_circle
locales/translit_cjk_compat
locales/translit_combining
locales/translit_compat
locales/translit_font
locales/translit_fraction
it also contains lots of manual updates to
locales/translit_neutral
now, many of them taken from
http://unicode.org/cldr/trac/browser/trunk/common/transforms/Latin-ASCII.xml
It does *not* update these translit files:
locales/translit_cjk_variants
locales/translit_hangul
locales/translit_narrow
locales/translit_small
locales/translit_wide
because translit_cjk_variants is apparently not generated from Unicode
data.
The other files, translit_hangul, translit_narrow, translit_small,
translit_wide are generated but they would not change when using Unicode
7.0.0 data, nothing seems to have changed in Unicode affecting these
files. I could add scripts to generate these as well, but they would
just reproduce the current files. Maybe I should do that nevertheless,
just to be able to see if something changes in future (quite unlikely, I
think).
Some code was duplicated in utf8_gen.py and utf8_compatibility.py,
Alexandre Oliva had already suggested to split this into an extra file.
As the new generator scripts added by this patch needed this code
again I saw that Alexandre was right and did put the reusable code
into an extra file unicode_utils.py.
Not everything in the generated translit_* files could be reproduced
exactly from Unicode data, the were some manual additions in the files
(which were not mentioned in the comments on top of these files,
the “grep” and “sed” expressions mentioned in the comments reproduce
most of the contents of these files but not everything).
Where the manual additions seemed to make sense, I added manual
hacks to the new generator scripts gen_translit_*.py to reproduce
these manual additons as well.
Comments
Hi Mike,
I reviewed the resulting transliteration and special decompose rules and
in general everything looks very good, few minor comments below.
On 2015-06-15 19:04, Mike FABIAN wrote:
>
> Subject: [PATCH 1/4] Remove duplicate transliterations for U+0152 and U+0153
> from C-translit.h.in
this looks like an obvious fix.
> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
>
> +% LATIN CAPITAL LETTER ENG
> +<U014A> <U004E>
> +% LATIN SMALL LETTER ENG
> +<U014B> <U006E>
Hmm, I presume NG/ng would be more expected than N/n here, but reading
https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
answer either way, what do you think?
> +% EURO-CURRENCY SIGN
> +% CRUZEIRO SIGN
> +% FRENCH FRANC SIGN
> +% LIRA SIGN
> +% PESETA SIGN
> % DONG SIGN
> +% INDIAN RUPEE SIGN
> +% TURKISH LIRA SIGN
While at it, should we perhaps also add pound, ruble, drachma, won, and
hryvnia signs here?
> Subject: [PATCH 3/4] Update the translit files to Unicode 7.0.0
The generated files included in this patch look good.
> Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.
AFAICS these also look good.
Thanks,
Hi,
actually, one more additional note: after these patches some rules are
now duplicated, see below for few examples, is there some particular
reason for this or could those duplicates be avoided?
localhost:~> grep '^<U00C6>' translit*
translit_combining:<U00C6> "<U0041><U0045>"
translit_neutral:<U00C6> "<U0041><U0045>"
localhost:~> grep '^<U00D8>' translit*
translit_combining:<U00D8> <U004F>
translit_neutral:<U00D8> <U004F>
localhost:~>
Thanks,
On 2015-06-16 16:24, Marko Myllynen wrote:
> Hi Mike,
>
> I reviewed the resulting transliteration and special decompose rules and
> in general everything looks very good, few minor comments below.
>
> On 2015-06-15 19:04, Mike FABIAN wrote:
>>
>> Subject: [PATCH 1/4] Remove duplicate transliterations for U+0152 and U+0153
>> from C-translit.h.in
>
> this looks like an obvious fix.
>
>> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
>>
>> +% LATIN CAPITAL LETTER ENG
>> +<U014A> <U004E>
>> +% LATIN SMALL LETTER ENG
>> +<U014B> <U006E>
>
> Hmm, I presume NG/ng would be more expected than N/n here, but reading
> https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
> answer either way, what do you think?
>
>> +% EURO-CURRENCY SIGN
>> +% CRUZEIRO SIGN
>> +% FRENCH FRANC SIGN
>> +% LIRA SIGN
>> +% PESETA SIGN
>> % DONG SIGN
>> +% INDIAN RUPEE SIGN
>> +% TURKISH LIRA SIGN
>
> While at it, should we perhaps also add pound, ruble, drachma, won, and
> hryvnia signs here?
>
>> Subject: [PATCH 3/4] Update the translit files to Unicode 7.0.0
>
> The generated files included in this patch look good.
>
>> Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.
>
> AFAICS these also look good.
>
> Thanks,
>
Hi,
On 2015-06-16 17:24, Mike FABIAN wrote:
> Marko Myllynen <myllynen@redhat.com> さんはかきました:
>
>>> Subject: [PATCH 2/4] Addition and fixes for translit_neutral
>>>
>>> +% LATIN CAPITAL LETTER ENG
>>> +<U014A> <U004E>
>>> +% LATIN SMALL LETTER ENG
>>> +<U014B> <U006E>
>>
>> Hmm, I presume NG/ng would be more expected than N/n here, but reading
>> https://en.wikipedia.org/wiki/Eng_%28letter%29 doesn't give a clear
>> answer either way, what do you think?
>
> http://unicode.org/cldr/trac/browser/trunk/common/transforms/Latin-ASCII.xml#L54
>
> has:
>
> 54 <tRule>Ŋ → N ; # 014A;LATIN CAPITAL LETTER ENG</tRule>
> 55 <tRule>ŋ → n ; # 014B;LATIN SMALL LETTER ENG</tRule>
>
> "ng" might be phonetically closer but the main spirit of the "neutral"
> transliteration to ASCII seems to be something like "drop the accents",
> not "approximate the pronunciation using ASCII".
I see, looks ok then.
Thanks,
From ef2a1022224d32989891f7a12f2170a1b3a7e7f9 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Wed, 20 May 2015 11:16:30 +0200
Subject: [PATCH 4/4] Add transliteration rules for da, nb, nn, and sv locales.
for localedata/Changelog
[BZ #89]
* locales/da_DK add more transliteration rules
* locales/nb_NO add transliteration rules
* locales/sv_SE add transliteration rules
---
localedata/locales/da_DK | 21 ++++++++++++++++++---
localedata/locales/nb_NO | 22 ++++++++++++++++++++++
localedata/locales/sv_SE | 22 ++++++++++++++++++++++
3 files changed, 62 insertions(+), 3 deletions(-)
@@ -137,11 +137,26 @@ translit_start
include "translit_combining";""
-% Danish.
-% LATIN CAPITAL LETTER A WITH RING ABOVE.
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
<U00C5> "<U0041><U030A>";"<U0041><U0041>"
-% LATIN SMALL LETTER A WITH RING ABOVE.
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
<U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
translit_end
@@ -127,6 +127,28 @@ copy "i18n"
translit_start
include "translit_combining";""
+
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
+<U00C5> "<U0041><U030A>";"<U0041><U0041>"
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
+<U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
+
translit_end
END LC_CTYPE
@@ -112,6 +112,28 @@ copy "i18n"
translit_start
include "translit_combining";""
+
+% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
+<U00C4> "<U0041><U0308>";"<U0041><U0045>"
+% LATIN CAPITAL LETTER A WITH RING ABOVE -> "AA"
+<U00C5> "<U0041><U030A>";"<U0041><U0041>"
+% LATIN CAPITAL LETTER AE -> "AE"
+<U00C6> "<U0041><U0045>"
+% LATIN CAPITAL LETTER O WITH DIAERESIS -> "OE"
+<U00D6> "<U004F><U0308>";"<U004F><U0045>"
+% LATIN CAPITAL LETTER O WITH STROKE -> "OE"
+<U00D8> "<U004F><U0338>";"<U004F><U0045>"
+% LATIN SMALL LETTER A WITH DIAERESIS -> "ae"
+<U00E4> "<U0061><U0308>";"<U0061><U0065>"
+% LATIN SMALL LETTER A WITH RING ABOVE -> "aa"
+<U00E5> "<U0061><U030A>";"<U0061><U0061>"
+% LATIN SMALL LETTER AE -> "ae"
+<U00E6> "<U0061><U0065>"
+% LATIN SMALL LETTER O WITH DIAERESIS -> "oe"
+<U00F6> "<U006F><U0308>";"<U006F><U0065>"
+% LATIN SMALL LETTER O WITH STROKE -> "oe"
+<U00F8> "<U006F><U0338>";"<U006F><U0065>"
+
translit_end
END LC_CTYPE
--
2.4.2