[v9] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
Commit Message
Changelog v9:
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch
Changelog v8:
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).
Changelog v7:
* Generated against git://sourceware.org/git/glibc.git master with git
format-patch.
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
translit_end previously.)
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
“copy "tr_TR"”.
Changelog v6:
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
to sequences of all uppercase Latin letters in all languages (whenever
a Cyrillic letter is transliterated to more than one Latin letter),
for example "Ї" is now transliterated as "YI" rather than "Yi".
Dear locale maintainers,
fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
add the Cyrillic transliteration table translit_cyrillic file
https://sourceware.org/bugzilla/attachment.cgi?id=11340 [7]
to localedata/locales/ and include it in all your locales going forward.
The patch included inline below.
From this patch I have excluded locales that already mention cyrillic or
have a transliteration table for it:
mn_MN
sr_RS
tg_TJ
tk_TM
tt_RU
uk_UA
uz_UZ
uz_UZ@cyrillic
uk_UA
Their maintainers are requested to make an explicit decision on how and
whether at all to include this patch.
Current bug effect:
The glibc wiki explicitly lists this use case as the test example
https://sourceware.org/glibc/wiki/Locales#Testing_Locales :
LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt
currently it fails on Cyrillic texts in most locales including ru_RU [1]
[8] [9]:
LC_ALL=ru_RU.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt |grep CYRILLIC
CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
- It produces a string of question marks and spaces.
This is what it should produce and it does so after the patch applied:
CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.
The root problem and the fix:
The root problem is the missing transliteration table that I am
supplying here. Furthermore it has to be referenced/included into the
active locale at the compilation time to be used by iconv.
COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
Examples: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription and iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8 will produce Latin transliteration as per
ISO 9.1995.
While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.
The transliteration table itself is attached as a file translit_cyrillic
[7]. Its content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 official source (Federal Agency on Technical
Regulating and Metrology Of Russian Federation [2]). Technically an
independent but mostly identical source [3] was used and prepared in a
spreadsheet [6].
The documentation suggests that the transliteration tables inclusion is
done by adding *include "translit_cyrillic";""* string into LC_CTYPE
translit_start section
http://man7.org/linux/man-pages/man5/locale.5.html [5]
Practically I have searched for all locales that already have
'include .*translit.*;""' string and generated a patch for them.
The Cyrillic transliteration of e.g. Russian text may have already
worked to some extent for mn_MN, sr_RS, tk_TM, uz_UZ, uk_UA locales that
have their transliteration tables included inline.
I am excluding these locales from this proposed patch. I have written
directly to locale maintainer emails listed in the files. Volodymyr
Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
Данило Шеган <danilo@gnome.org> (sr_RS) have confirmed the
exclusion.
Links:
[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11301
[7] translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11340
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A
Best regards,
Egor Kobylkin
Comments
Thank you for working on this, Egor.
Before I start reviewing I would like to summarize the things which
I think are blocking for this patch.
1. I think we need tests for transliteration. Currently there is only
one test program which is similar to what we need,
localedata/bug-iconv-trans.c. It is old and it is not quite clear
what bug it is trying to test. Therefore I think we need a new
framework to test transliteration. Is it a good idea to base the
test on the iconv(1) command line utility which is part of glibc?
2. I made few tests in the command line and it seems to me that the
transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does
not work and has not been working for some time already because
I've checked some older systems as well and the result is always
the same. I think that the reason is that uk_UA defines multiple
transliteration rules for "З" depending on what is the letter following
it. It does not seem to work. AFAIK the reason is that the syntax of
transliteration rules says that a single non-Latin character may map
one or more Latin strings, each consisting of one or more characters.
There cannot be a rule transliterating multiple source characters into
one or multiple destination characters. Is it a bug in transliteration
implementation? Or maybe in the specification, including POSIX standard?
The definition of transliteration says that it is one-to-one mapping
of graphemes while a grapheme may be one or multiple characters.
It does not have to be always mapping one-to-one character. Should we
fix this bug first, make uk_UA transliteration work, and only then
add a generic Cyrillic transliteration? Egor's patch already contains
transliteration of "У" + combining acute accent to "Ú" which most
probably
will not work.
I still think that in the longer term all existing custom transliterations
of Cyrillic alphabets should be ported to a modification of your patch.
Egor, while at this I was thinking about your idea to transliterate letters
like "Ш" (uppercase) to "SH" (always uppercase) in order to distinguish
between "Шема" (-> "SHema") and "Схема" (-> "Shema" or "Sxema"). Also
you include a rule to transliterate "Х" to "H" or "X" depending on which
destination characters are available, which I told you already that will
not work because both "H" and "X" are always available and therefore only
the first rule will always be used. I still don't like the idea to
put two uppercase letters in a beginning of a word in titlecase only to
indicate that there was originally a single letter. What if we:
* drop the rule of transliterating "Х" to "H" and transliterate always to
"X",
* transliterate uppercase "Ш" to "Sh" (so it will work fine for titlecase
words)?
As a result the Latin letter "h" will only appear as part of a digraph and
never as a transliteration of "Х" and therefore will never cause a conflict.
Examples:
* "Шема" -> "Shema",
* "Схема" -> "Sxema".
Will this solve the problem?
Regards,
Rafal
Hi Rafal,
thanks for putting it into a clear issue statement on SH/Sh problem. I'm
totally with you on this being a good thing to discuss. It is orthogonal
to the tests so let me focus on SH/Sh and System A/B problematic here.
Looks like we have three issues:
1. lack of explicit control which transformation to use (System A or
System B) via //TRANSLIT
2. possibility of collision for System B if used CAP/low transcription
for capital letters
3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
System B because it's equivalent 'X'/'x' from System A is always present
and takes precedence.
As a solution shouldn't we only keep System B in a new file
transcribe_cyrillic and put it in place as the explicit ASCII
transcription for targeted locales (as opposed to transliteration)?
We would keep System A as translit_cyrillic but won't include it into
this patch. Once you have resolved an issue of having two conflicting
rule-sets but only one key //TRANSLIT you could add the System A back.
The SH/Sh can be decided on either way - seems like an easy change any way.
Please see more discussion on your excellent points below:
On 16.11.18 23:17, Rafal Luzynski wrote:
> Egor, while at this I was thinking about your idea to transliterate
> letters like "Ш" (uppercase) to "SH" (always uppercase) in order to
> distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or
> "Sxema").
to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
ASCII//TRANSLIT (i.e. System B transcription).
But it's not only SH/Sh, there are following combinations used to
transcribe capital letters:
YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ
Arguably any of them (if not in that CAP/CAP form) could collide with
their CAP/low equivalent from a different word. (there may be language
grammar rules that in fact prevent some but we don't know for sure)
With transcription we are basically striping information from the data,
mapping it into a smaller character set. The idea to keep them in
CAP/CAP is to try to preserve as much information as possible.
> Also you include a rule to transliterate "Х" to "H" or "X" depending
> on which destination characters are available, which I told you
> already that will not work because both "H" and "X" are always
> available and therefore only the first rule will always be used.
Just to have this here for reference, the idea was to have both rules in
one file so
iconv -f UTF-8 -t ASCII//TRANSLIT
will produce ASCII compatible _transcription_ (System B)
iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8
will produce Latin _transliteration_ as per ISO 9.1995. (System A)
So in fact we have two rules for each letter in the same file (System A
and System B), where System A takes precedence.
I have a question then: isn't this more like a hack than a right thing
to do?
Shouldn't we have two explicit rules for transcription and
transliteration not dependent on a destination character set?
> I still don't like the idea to
> put two uppercase letters in a beginning of a word in titlecase only
> to indicate that there was originally a single letter. What if we:
>
> * drop the rule of transliterating "Х" to "H" and transliterate
> always to "X",
This would contradict ISO 9.1995. (System A).
System A was added on Marko's request (so setting him on TO:) I am
neutral on keeping it or dropping it, just to be clear.
> * transliterate uppercase "Ш" to "Sh" (so it will work fine for
> titlecase words)?
>
> As a result the Latin letter "h" will only appear as part of a
> digraph and never as a transliteration of "Х" and therefore will
> never cause a conflict. Examples:
>
> * "Шема" -> "Shema", * "Схема" -> "Sxema".
>
> Will this solve the problem?
This particular rule with h/x would make sense it's own.
But again - it would contradict the standards.
On the other hand, for my personal needs I care less about standards but
about current functionality and data loss because of missing
transcription altogether due to the BZ #2872.
Bests,
Egor
Hi,
On 17/11/2018 20.34, Egor Kobylkin wrote:
>
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.
>
> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
>
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.
>
> The SH/Sh can be decided on either way - seems like an easy change any way.
>
> I have a question then: isn't this more like a hack than a right thing
> to do?
>
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?
>
> This would contradict ISO 9.1995. (System A).
> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.
>
> This particular rule with h/x would make sense it's own.
> But again - it would contradict the standards.
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.
Given the amount of questions above I think the way forward is to try
follow the relevant standards as closely as possible and also check what
the other implementations (i.e., uconv(1)) do. For example, checking the
case earlier mentioned case may or may not give some hints:
$ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Šema
$ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Shema
$ uconv -V
uconv v2.1 ICU 50.1.2
Thanks,
On 19.11.18 08:13, Marko Myllynen wrote:
> Hi,
>
> On 17/11/2018 20.34, Egor Kobylkin wrote:
>>
>> Shouldn't we have two explicit rules for transcription and
>> transliteration not dependent on a destination character set?
>>
>> This would contradict ISO 9.1995. (System A).
>> System A was added on Marko's request (so setting him on TO:) I am
>> neutral on keeping it or dropping it, just to be clear.
>>
>> This particular rule with h/x would make sense it's own.
>> But again - it would contradict the standards.
>> On the other hand, for my personal needs I care less about standards but
>> about current functionality and data loss because of missing
>> transcription altogether due to the BZ #2872.
>
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
>
> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1 ICU 50.1.2
Marko,
Your example only covers _tansliteration_ to Latin Diacritics
iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
| iconv -f ISO-8859-15 -t UTF-8
while BZ #2872 is about _transcription_ to ASCII
iconv -f UTF-8 -t ASCII//TRANSLIT
The glibc wiki explicitly lists this use case (ASCII) as the test
example https://sourceware.org/glibc/wiki/Locales#Testing_Locales
So again, you are asking to have ISO 9.1995. System A but the bug is
about ISO 9.1995. System B (GOST 7.79-2000)
Bests,
Egor
Hi,
On 19/11/2018 11.21, Egor Kobylkin wrote:
> On 19.11.18 08:13, Marko Myllynen wrote:
>> On 17/11/2018 20.34, Egor Kobylkin wrote:
>
> Your example only covers _tansliteration_ to Latin Diacritics
> iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
> | iconv -f ISO-8859-15 -t UTF-8
>
> while BZ #2872 is about _transcription_ to ASCII
> iconv -f UTF-8 -t ASCII//TRANSLIT
AFAICS v9 (unlike v10) supported both of the above cases.
> The glibc wiki explicitly lists this use case (ASCII) as the test
> example https://sourceware.org/glibc/wiki/Locales#Testing_Locales
I wrote that section and I certainly wasn't considering Cyrillic aspects
at that time (IIRC it was written even before Mike did the major update
for transliteration rules at the end of 2015). The context back then was
mostly about handling Latin letters like Å, Ä, Ö, Ø, etc.
> So again, you are asking to have ISO 9.1995. System A but the bug is
> about ISO 9.1995. System B (GOST 7.79-2000)
We certainly can decide here what's the best course of action, we do not
have to slavishly follow some old bug report when deciding the direction
for the implementation. But I think I've made my position clear by now
so I'm not going to repeat it anymore.
In any case once your patch lands I'm going to submit a follow-up patch
for fi_FI to make it compliant with the applicable national standard
(SFS 4900) which defines how to do Cyrillic transliteration /
transcription in the context Finnish.
Thanks,
19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> Given the amount of questions above I think the way forward is to try
> follow the relevant standards as closely as possible and also check what
> the other implementations (i.e., uconv(1)) do. For example, checking the
> case earlier mentioned case may or may not give some hints:
>
> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Šema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Shema
> $ uconv -V
> uconv v2.1 ICU 50.1.2
I've played a little with uconv and unfortunately it does not look good
to me.
It does not have any fallback transliteration to plain ASCII. When it says
that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
charset does not have this character then crashes:
$ echo Шема | uconv -f UTF-8 -t ASCII -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
Conversion from Unicode to codepage failed at output byte position 0.
Unicode: 0160 Error: Invalid character found
$ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
�ema
$ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
ISO-8859-2 -t UTF-8
Šema
It seems to follow ISO 9 (GOST 7.79) System A. However, the transliteration
of the hard sign is rather strange:
$ echo нъе | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
nʺe
The above was correct but:
$ echo НЪЕ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
Nʺ̱E
$ echo Ъ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
ʺ̱
$ echo Ъ | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
0000000 feff 02ba 0331 000a
0000008
So this generates:
02BA MODIFIER LETTER DOUBLE PRIME
0331 COMBINING MACRON BELOW
There is are more transliteration methods, for example Russian-Latin/BGN:
$ echo Шема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Shema
$ echo Схема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Skhema
Converting 'х' to 'kh' seems to be common in English transliteration but
it does not follow any ISO standard.
$ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
KHA kha
This means that the choice whether a digraph in the output should be
all uppercase or maybe upper+lower is context based, something which we
probably cannot implement. But definitely a good thing.
Two more tests:
$ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
Yeshchë
$ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
Conversion from Unicode to codepage failed at output byte position 6.
Unicode: 00eb Error: Invalid character found
So the output is not plain ASCII.
$ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
ye zhe le ne
Again this means that transliteration of 'е' is context based:
it is 'ye' in the beginning of a word and 'e' otherwise.
The version which I've tested:
$ uconv -V
uconv v2.1 ICU 60.2
It seems that uconv will not be a good hint about transliterating
to plain ASCII.
Also, the difference between uconv and iconv is that we can provide
multiple transliterations for any source character but we can't group
them into standards so we can't tell iconv to use this or another
system. It will just choose the best fitting the current output
character set and the only thing we can choose is the locale.
This makes me think: should we add a locale like ru_RU@SystemA or
ru_RU@SystemB?
Regards,
Rafal
On 01.12.18 23:07, Rafal Luzynski wrote:
>
> Also, the difference between uconv and iconv is that we can provide
> multiple transliterations for any source character but we can't group
> them into standards so we can't tell iconv to use this or another
> system. It will just choose the best fitting the current output
> character set and the only thing we can choose is the locale.
>
> This makes me think: should we add a locale like ru_RU@SystemA or
> ru_RU@SystemB?
Wouldn't it require to create 3 versions of every locale that would
include the translit_cyrillic file then? I.e. en_US + en_US@SystemA,
en_US@SystemB etc.?
This in turn will make two of them optional (as cyrillic fonts are at
the moment). The highest value is in having the default locale being
able to transliterate, isn't it? So putting the transliteration to
optional locales kind of defeats the purpose.
An example from my experience as a user - a networked device or host
would often have the en_US as the default (only?) locale with no viable
way to change it or install cyrillic fonts. Anyway, this is the most
dire situation where the ASCII transliteration certainly helps most.
Having en_US@SystemA or en_US@SystemB theoretically available but not
compiled by the distributor wouldn't help here, would it?
So the only useful scenario here would be to ship your locales with the
transliteration already included by default in en_US. This way the
distributor won't have to get active to include transliteration as
en_US@SystemA or en_US@SystemB.
From my (however limited) point of view it is better to have the System
B in first, then see if some code need to be changed to accommodate
System A/System B problematic. Again, System B is _transcription_ to
ASCII and System A _transliteration_ to Latin with different use cases.
It's insightful to see your comparison of the uconv vs. iconv!
Similar to your checks this is what I was using to see whether any
locale fails the transliteration for any cyrillic letter:
echo
"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ
ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"|
LOCPATH=$workdir/compiled_locales/"$locale"/ LC_ALL="$locale".UTF-8
iconv -f UTF-8 -t ASCII//TRANSLIT
should give (can be asserted with bash string comparison):
AaOoUussYODJG`YeZ`IYiJL`N`TSHK`U`DhABVGDEZHZIJKLMNOPRSTUUFHCCHSHSHHA`Y`E`YUYAabvgdezhzijklmnoprstuufhcchshshh``y`e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FhfhYhyhE`e`
G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`SH`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'
And I am attaching another file that has the Unicode Codepoints next to
the letters for easier identification of failures. (like "U0401-Ё
U0402-Ђ U0403-Ѓ etc.) Hope it will be helpful in creating the tests.
Best regards,
Egor Kobylkin
CYRILLIC RUSSIAN Съешь ещё этих мягких французских булок, да выпей же чаю. СЪЕШЬ ЕЩЁ ЭТИХ МЯГКИХ ФРАНЦУЗСКИХ БУЛОК? ДА ВЫПЕЙ ЖЕ ЧАЮ!
CYRILLIC COMPLETE U0401-Ё U0402-Ђ U0403-Ѓ U0404-Є U0405-Ѕ U0406-І U0407-Ї U0408-Ј U0409-Љ U040A-Њ U040B-Ћ U040C-Ќ U040E-Ў U040F-Џ U0410-А U0411-Б U0412-В U0413-Г U0414-Д U0415-Е U0416-Ж U0417-З U0418-И U0419-Й U041A-К U041B-Л U041C-М U041D-Н U041E-О U041F-П U0420-Р U0421-С U0422-Т U0423-У U0423 0301-У́ U0424-Ф U0425-Х U0426-Ц U0427-Ч U0428-Ш U0429-Щ U042A-ъ U042B-Ы U042C-ь U042D-Э U042E-Ю U042F-Я U0430-а U0431-б U0432-в U0433-г U0434-д U0435-е U0436-ж U0437-з U0438-и U0439-й U043A-к U043B-л U043C-м U043D-н U043E-о U043F-п U0440-р U0441-с U0442-т U0443-у U0443 0301-у́ U0444-ф U0445-х U0446-ц U0447-ч U0448-ш U0449-щ U044A-Ъ U044B-ы U044C-Ь U044D-э U044E-ю U044F-я U0451-ё U0452-ђ U0453-ѓ U0454-є U0455-ѕ U0456-і U0457-ї U0458-ј U0459-љ U045A-њ U045B-ћ U045C-ќ U045E-ў U045F-џ U046A-Ѫ U046B-ѫ U0472-Ѳ U0473-ѳ U0474-Ѵ U0475-ѵ U048C-Ҍ U048D-ҍ U0490-Ґ U0491-ґ U0492-Ғ U0493-ғ U0494-Ҕ U0495-ҕ U0496-Җ U0497-җ U049A-Қ U049B-қ U049E-Ҟ U049F-ҟ U04A2-Ң U04A3-ң U04A4-Ҥ U04A5-ҥ U04A6-Ҧ U04A7-ҧ U04A8-Ҩ U04A9-ҩ U04AA-Ҫ U04AB-ҫ U04AC-Ҭ U04AD-ҭ U04AE-Ү U04AF-ү U04B2-Ҳ U04B3-ҳ U04B4-Ҵ U04B5-ҵ U04BA-Һ U04BB-һ U04BC-Ҽ U04BD-ҽ U04BE-Ҿ U04BF-ҿ U04C0-Ӏ U04C1-Ӂ U04C2-ӂ U04CB-Ӌ U04CC-ӌ U04D0-Ӑ U04D1-ӑ U04D2-Ӓ U04D3-ӓ U04D6-Ӗ U04D7-ӗ U04D8-Ә U04D9-ә U04DC-Ӝ U04DD-ӝ U04DE-Ӟ U04DF-ӟ U04E0-Ӡ U04E1-ӡ U04E4-Ӥ U04E5-ӥ U04E6-Ӧ U04E7-ӧ U04E8-Ө U04E9-ө U04F0-Ӱ U04F1-ӱ U04F2-Ӳ U04F3-ӳ U04F4-Ӵ U04F5-ӵ U04F8-Ӹ U04F9-ӹ U2019-’
GREEK Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής.
GERMAN Zwölf Boxkämpfer jagen Victor quer über den großen Sylter Deich.
FRENCH Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d’exquis rôtis de bœuf au kir à l’aÿ d’âge mûr \& cætera.
SPANISH El veloz murciélago hindú comía feliz cardillo y kiwi, la cigüeña tocaba el saxofón detrás del palenque de paja.
END
Rafal,
Just to touch base on this, what is the best way forward? Did you get
any input/feedback on your questions below? Are you expecting input from
anyone but myself?
On the blocking issue #2: I really don’t see the connection to the uk_UA
locale that has its transliteration table inline and is explicitly
excluded from my patch. It may be revealing another issue you have with
glibc but wouldn’t that be better addressed in a new bug?
Again, in the v10 of my patch I have removed multicharacter source
graphemes, so that issue is moot there.
If you’d like to overhaul the glibc translit system wouldn’t it be
better to commit the simple text file with the Cyrillic
translit(transcription) table first, fix the bug from the year 2006 and
then proceed from there all due diligence?
The same with having both System A and System B. Initially I went along
with the suggestion to include the system A but it is clear now that it
doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
to set it aside for the moment and use the v10 without the system A.
That is the whole reason I have submitted it, to be superclear on that.
Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
should mitigate your concern about that issue too (somewhat, anyway).
Making it context based would also be about adding new code, see above.
Let me know if there’s anything I can help with getting more progress
with the decision
Bests,
Egor
On 16.11.18 23:17, Rafal Luzynski wrote:
> 2. I made few tests in the command line and it seems to me that the
> transliteration from "З" to "Z" (+ lowercase as well) in uk_UA does
> not work and has not been working for some time already because I've
> checked some older systems as well and the result is always the same.
> I think that the reason is that uk_UA defines multiple
> transliteration rules for "З" depending on what is the letter
> following it. It does not seem to work. AFAIK the reason is that
> the syntax of transliteration rules says that a single non-Latin
> character may map one or more Latin strings, each consisting of one
> or more characters. There cannot be a rule transliterating multiple
> source characters into one or multiple destination characters. Is it
> a bug in transliteration implementation? Or maybe in the
> specification, including POSIX standard?
> The definition of transliteration says that it is one-to-one mapping
> of graphemes while a grapheme may be one or multiple characters. It
> does not have to be always mapping one-to-one character. Should we
> fix this bug first, make uk_UA transliteration work, and only then
> add a generic Cyrillic transliteration? Egor's patch already
> contains transliteration of "У" + combining acute accent to "Ú" which
> most probably will not work.
>
> I still think that in the longer term all existing custom
> transliterations of Cyrillic alphabets should be ported to a
> modification of your patch.
On 01.12.18 23:07, Rafal Luzynski wrote:
> 19.11.2018 08:13 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> Given the amount of questions above I think the way forward is to try
>> follow the relevant standards as closely as possible and also check what
>> the other implementations (i.e., uconv(1)) do. For example, checking the
>> case earlier mentioned case may or may not give some hints:
>>
>> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Šema
>> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
>> Shema
>> $ uconv -V
>> uconv v2.1 ICU 50.1.2
>
> I've played a little with uconv and unfortunately it does not look good
> to me.
>
> It does not have any fallback transliteration to plain ASCII. When it says
> that 'Ш' is transliterated to 'Š' then it always uses 'Š' and if the target
> charset does not have this character then crashes:
>
> $ echo Шема | uconv -f UTF-8 -t ASCII -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема | uconv -f UTF-8 -t ISO-8859-1 -x cyrillic-latin
> Conversion from Unicode to codepage failed at output byte position 0.
> Unicode: 0160 Error: Invalid character found
> $ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin
> �ema
> $ echo Шема | uconv -f UTF-8 -t ISO-8859-2 -x cyrillic-latin | uconv -f
> ISO-8859-2 -t UTF-8
> Šema
>
> It seems to follow ISO 9 (GOST 7.79) System A. However, the transliteration
> of the hard sign is rather strange:
>
> $ echo нъе | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> nʺe
>
> The above was correct but:
>
> $ echo НЪЕ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> Nʺ̱E
> $ echo Ъ | uconv -f UTF-8 -t UTF-8 -x cyrillic-latin
> ʺ̱
> $ echo Ъ | uconv -f UTF-8 -t UTF-16 -x cyrillic-latin| hexdump -x
> 0000000 feff 02ba 0331 000a
> 0000008
>
> So this generates:
> 02BA MODIFIER LETTER DOUBLE PRIME
> 0331 COMBINING MACRON BELOW
>
> There is are more transliteration methods, for example Russian-Latin/BGN:
>
> $ echo Шема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Shema
> $ echo Схема | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Skhema
>
> Converting 'х' to 'kh' seems to be common in English transliteration but
> it does not follow any ISO standard.
>
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
>
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement. But definitely a good thing.
>
> Two more tests:
>
> $ echo Ещё | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> Yeshchë
> $ echo Ещё | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> Conversion from Unicode to codepage failed at output byte position 6.
> Unicode: 00eb Error: Invalid character found
>
> So the output is not plain ASCII.
>
> $ echo е же ле не | uconv -f UTF-8 -t ASCII -x Russian-Latin/BGN
> ye zhe le ne
>
> Again this means that transliteration of 'е' is context based:
> it is 'ye' in the beginning of a word and 'e' otherwise.
>
> The version which I've tested:
>
> $ uconv -V
> uconv v2.1 ICU 60.2
>
> It seems that uconv will not be a good hint about transliterating
> to plain ASCII.
>
> Also, the difference between uconv and iconv is that we can provide
> multiple transliterations for any source character but we can't group
> them into standards so we can't tell iconv to use this or another
> system. It will just choose the best fitting the current output
> character set and the only thing we can choose is the locale.
>
> This makes me think: should we add a locale like ru_RU@SystemA or
> ru_RU@SystemB?
>
> Regards,
>
> Rafal
>
17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Looks like we have three issues:
> 1. lack of explicit control which transformation to use (System A or
> System B) via //TRANSLIT
> 2. possibility of collision for System B if used CAP/low transcription
> for capital letters
> 3. Cyrillic 'Х'/'х' (ha) never transcribes to 'H'/'h' as it should per
> System B because it's equivalent 'X'/'x' from System A is always present
> and takes precedence.
True.
> As a solution shouldn't we only keep System B in a new file
> transcribe_cyrillic and put it in place as the explicit ASCII
> transcription for targeted locales (as opposed to transliteration)?
>
> We would keep System A as translit_cyrillic but won't include it into
> this patch. Once you have resolved an issue of having two conflicting
> rule-sets but only one key //TRANSLIT you could add the System A back.
Sounds like a good idea to provide those two files:
* translit_cyrillic_system_a,
* translit_cyrillic_system_b,
(or any other pair of names) and let the individual locales choose whether
they want to include System A or System B. For optimization, system_b
file could include system_a and modify it.
> The SH/Sh can be decided on either way - seems like an easy change any
> way.
I'm in favor of "Sh" because it will work fine for titlecased words
(where only the first letter is uppercase) but I'm aware it would be
a problem for uppercased words. Unfortunately, I think we are unable
to satisfy both cases.
> On 16.11.18 23:17, Rafal Luzynski wrote:
>
> > Egor, while at this I was thinking about your idea to transliterate
> > letters like "Ш" (uppercase) to "SH" (always uppercase) in order to
> > distinguish between "Шема" (-> "SHema") and "Схема" (-> "Shema" or
> > "Sxema").
>
> to clarify, this SH/Sh collision issue relates only to iconv -f UTF-8 -t
> ASCII//TRANSLIT (i.e. System B transcription).
True.
> But it's not only SH/Sh, there are following combinations used to
> transcribe capital letters:
>
> YO, DJ, YE, TSH, DH, ZH, CZ, CH, SH, SHH, YU, YA, FH, YH, GH, NG, TCZ
Absolutely true. I skip the whole list only for the brevity: if we
find a solution for one letter the same solution will work fine for
all others.
> [...]
> With transcription we are basically striping information from the data,
> mapping it into a smaller character set. The idea to keep them in
> CAP/CAP is to try to preserve as much information as possible.
I'm only afraid that things like "TWo CApitals" or "CamelCase" are
common among us computer geeks while they do not look great when
working with natural language and when displaying them to regular users
and even non-computer people.
> [...]
> So in fact we have two rules for each letter in the same file (System A
> and System B), where System A takes precedence.
>
> I have a question then: isn't this more like a hack than a right thing
> to do?
>
> Shouldn't we have two explicit rules for transcription and
> transliteration not dependent on a destination character set?
It's impossible with the current API of iconv. Maybe it would be
possible ever in future but that's a greater amount of work than what
we are doing here now. Again, for now different set of rules = different
locale.
I have another question: is it really a job of transliteration to preserve
all original information, to ensure no collisions and have the ability to
restore the original text? I'm afraid that as long as plain ASCII is the
destination charset whatever system we provide it will always be possible
to provide a malicious combination of the Cyrillic characters proving that
the system generates collisions.
> > I still don't like the idea to
> > put two uppercase letters in a beginning of a word in titlecase only
> > to indicate that there was originally a single letter. What if we:
> >
> > * drop the rule of transliterating "Х" to "H" and transliterate
> > always to "X",
> This would contradict ISO 9.1995. (System A).
Yes, it would. I'm trying to find solution here since I think we have
proved that we can't implement a system which will handle System A,
System B, and ensure no collisions at the same time. At least one
requirement must be dropped (at least partially).
> System A was added on Marko's request (so setting him on TO:) I am
> neutral on keeping it or dropping it, just to be clear.
I think I didn't see this Marko's request but I'm in favor of keeping
System A, too.
Marko, it would be good to hear your opinion about System A vs. System B
again.
> [...]
> On the other hand, for my personal needs I care less about standards but
> about current functionality and data loss because of missing
> transcription altogether due to the BZ #2872.
I read this that you are open to a solution which is inspired by some
standards but does not implement them fully due to our technical
limitations.
19.11.2018 10:21 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> Marko,
>
> Your example only covers _tansliteration_ to Latin Diacritics
> [...]
> while BZ #2872 is about _transcription_ to ASCII
> [...]
>
> So again, you are asking to have ISO 9.1995. System A but the bug is
> about ISO 9.1995. System B (GOST 7.79-2000)
It's hard to say what the original bug reporter meant but I think that the
problem is that there is no transliteration from Cyrillic to any variant of
Latin, except in few locales. If System A was implemented but System B was
not then at least some characters would be handled correctly. Currently no
Cyrillic characters are handled.
19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> In any case once your patch lands I'm going to submit a follow-up patch
> for fi_FI to make it compliant with the applicable national standard
> (SFS 4900) which defines how to do Cyrillic transliteration /
> transcription in the context Finnish.
I totally agree. As far as I can see, SFS 4900 is more similar to
System A (ISO 9) rather than System B, that is, it transliterates to Latin
characters with diacritics rather than plain ASCII. Marko, what is your
opinion about possible implementation of SFS 4900 in these cases:
* When the destination charset does not contain required Latin diacritic
characters (e.g., it is plain ASCII)?
* When the output is ambiguous, that means, when two different Cyrillic
strings produce the same Latin (or ASCII) output?
At the moment I am not curious about SFS 4900 but we are facing the same
problems now with ISO 9 and GOST 7.79.
1.12.2018 23:07 Rafal Luzynski <digitalfreak@lingonborough.com> wrote:
> [...]
> $ echo ХА ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
> KHA kha
>
> This means that the choice whether a digraph in the output should be
> all uppercase or maybe upper+lower is context based, something which we
> probably cannot implement. But definitely a good thing.
I forgot to include this test which is really interesting:
$ echo ХА Ха ха | uconv -f UTF-8 -t UTF-8 -x Russian-Latin/BGN
KHA Kha kha
which again confirms that the choice of all uppercase or just the first
letter uppercased is context based, a thing which we can't implement now.
1.12.2018 23:53 Egor Kobylkin <egor@kobylkin.com> wrote:
>
> On 01.12.18 23:07, Rafal Luzynski wrote:
> >
> > [...]
> > This makes me think: should we add a locale like ru_RU@SystemA or
> > ru_RU@SystemB?
>
> Wouldn't it require to create 3 versions of every locale that would
> include the translit_cyrillic file then? I.e. en_US + en_US@SystemA,
> en_US@SystemB etc.?
OK, please read this as another brainstorming idea and let's just
forget it.
> [...]
> An example from my experience as a user - a networked device or host
> would often have the en_US as the default (only?) locale with no viable
> way to change it or install cyrillic fonts. Anyway, this is the most
> dire situation where the ASCII transliteration certainly helps most.
> Having en_US@SystemA or en_US@SystemB theoretically available but not
> compiled by the distributor wouldn't help here, would it?
>
> So the only useful scenario here would be to ship your locales with the
> transliteration already included by default in en_US. This way the
> distributor won't have to get active to include transliteration as
> en_US@SystemA or en_US@SystemB.
Having the idea of "@SystemA" and "@SystemB" dropped I don't think
implementing any solution in glibc would be helpful for your use case.
Two reasons:
1. I believe that sooner or later someone will develop a transliteration
system for en_US which will follow English transliteration of Russian
instead of any standard we are discussing here. That means, it would
transliterate 'Х' as 'Kh' rather than 'H' or 'X'.
2. Currently there is a trend not to install even en_US locales and leave
only C which is hardcoded into glibc binaries. OTOH, I wouldn't mind
if ISO 9 was hardcoded into C as well.
3. That's beyond Russian language but transliteration according to Serbian
or Bulgarian or Ukrainian or Kazakh rules still requires installing their
proper locales. I think that requiring ru_RU to be installed could be
reasonable especially if we end up with ru_RU somehow differing from
the default "translit_cyrillic".
BTW you don't need Cyrillic fonts to be installed on your server in order
to process the Cyrillic text correctly unless your server renders the text.
3.12.2018 23:19 Egor Kobylkin <egor@kobylkin.com> wrote:
>
> Rafal,
>
> Just to touch base on this, what is the best way forward? Did you get
> any input/feedback on your questions below? Are you expecting input from
> anyone but myself?
Yes, I expected some input from more experienced maintainers about whether
and how to write the tests but I'd rather start another thread about it
because this one is too long already.
> On the blocking issue #2: I really don’t see the connection to the uk_UA
> locale that has its transliteration table inline and is explicitly
> excluded from my patch. It may be revealing another issue you have with
> glibc but wouldn’t that be better addressed in a new bug?
OK, I was not precise enough (I'm sorry about it) so I'd like to explain
here:
1. In the long term goal I would like to convert those excluded locales
to use your translit_cyrillic as well.
2. In order to ensure that change is not destructive for them I will need
automatic tests to prove that their transliteration rules work the
same good before the change and after the change.
3. It does not matter that converting those other locales is in a distant
future because we need the same tests for Russian language now.
4. Even although I have not started writing any tests I can see they
will be failing for uk_UA. The reason is that glibc transliteration
rules can handle transliterating single characters into single
characters,
single characters into multiple characters but not multiple characters
into multiple (or even single) characters.
5. We can ignore uk_UA but we will face the same case in ru_RU where
you had a case of 'У́ ' ('У' + 'COMBINING ACUTE ACCENT').
6. So the question was: how (and whether) to write the tests if we
already know they would be failing? Skip them? Resolve the other
issue first? Mark them as XFAIL?
In the meantime, you have removed the controversial conversion rule
of 'У' with the acute accent:
> Again, in the v10 of my patch I have removed multicharacter source
> graphemes, so that issue is moot there.
so we can move to the next step.
> If you’d like to overhaul the glibc translit system wouldn’t it be
> better to commit the simple text file with the Cyrillic
> translit(transcription) table first, fix the bug from the year 2006 and
> then proceed from there all due diligence?
I agree and we are now one step forward.
> The same with having both System A and System B. Initially I went along
> with the suggestion to include the system A but it is clear now that it
> doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
> to set it aside for the moment and use the v10 without the system A.
> That is the whole reason I have submitted it, to be superclear on that.
OK, I think that now I understand your reason to drop System A better.
But still I'd like to rethink implementing System A somehow and drop
(or rather: implement only partially) System B.
> Now you saw that uconv is transcribing «ХА» as KHA (cap/cap/cap) that
> should mitigate your concern about that issue too (somewhat, anyway).
> Making it context based would also be about adding new code, see above.
It would also require the changes in the syntax of the source code
of locale data and possibly breaking the POSIX compatibility which
I think would be unacceptable.
> Let me know if there’s anything I can help with getting more progress
> with the decision
I'm afraid you can't help more. I'd like to hear some feedback from other
people. Due to some minor obstacles we can't resolve this issue being only
two here.
Regards,
Rafal
Hi,
On 08/12/2018 03.15, Rafal Luzynski wrote:
> 17.11.2018 19:34 Egor Kobylkin <egor@kobylkin.com> wrote:
>>
>> The SH/Sh can be decided on either way - seems like an easy change any
>> way.
>
> I'm in favor of "Sh" because it will work fine for titlecased words
> (where only the first letter is uppercase) but I'm aware it would be
> a problem for uppercased words. Unfortunately, I think we are unable
> to satisfy both cases.
I think I'm in favor of "Sh" as well, although not perfect I'd assume
it's probably going to be correct in more cases than SH.
>> System A was added on Marko's request (so setting him on TO:) I am
>> neutral on keeping it or dropping it, just to be clear.
>
> I think I didn't see this Marko's request but I'm in favor of keeping
> System A, too.
>
> Marko, it would be good to hear your opinion about System A vs. System B
> again.
I think System A is a better option as it should be the same as ISO 9
and perhaps also produces results in some cases which are more expected
than with System B (if the Wikipedia ISO 9 article is to be believed).
Wrt BZ #2872 I think it's good to keep it in mind but IMHO we can also
deviate from it if needed, however with System A + ASCII fallback
definitions the RFE should be satisfied as well?
> 19.11.2018 20:35 Marko Myllynen <myllynen@redhat.com> wrote:
>> [...]
>> In any case once your patch lands I'm going to submit a follow-up patch
>> for fi_FI to make it compliant with the applicable national standard
>> (SFS 4900) which defines how to do Cyrillic transliteration /
>> transcription in the context Finnish.
>
> I totally agree. As far as I can see, SFS 4900 is more similar to
> System A (ISO 9) rather than System B, that is, it transliterates to Latin
> characters with diacritics rather than plain ASCII. Marko, what is your
> opinion about possible implementation of SFS 4900 in these cases:
>
> * When the destination charset does not contain required Latin diacritic
> characters (e.g., it is plain ASCII)?
This would be according to http://jkorpela.fi/iso9.html8 so for example
instead of ž -> zh and instead of štš -> shtsh.
> * When the output is ambiguous, that means, when two different Cyrillic
> strings produce the same Latin (or ASCII) output?
This is a good point and one I haven't considered but I'm not sure is
there anything we can do about this (at least without major locale
system internals work)? Do you have any rough idea how frequently this
could happen or is this more a theoretical issue? (Sorry if I've missed
earlier comments about this, it's been a long thread.)
>> The same with having both System A and System B. Initially I went along
>> with the suggestion to include the system A but it is clear now that it
>> doesn’t make fixing [BZ #2872] more straightforward. So I’d also propose
>> to set it aside for the moment and use the v10 without the system A.
>> That is the whole reason I have submitted it, to be superclear on that.
>
> OK, I think that now I understand your reason to drop System A better.
> But still I'd like to rethink implementing System A somehow and drop
> (or rather: implement only partially) System B.
Yes, I also think System A AKA ISO 9 would be a better choice but I'll
leave the final decision for you two (and others who might weigh in).
Thanks,
10.12.2018 22:20 Marko Myllynen <myllynen@redhat.com> wrote:
>
> Hi,
>
> On 08/12/2018 03.15, Rafal Luzynski wrote:
> > [...]
> > Marko, it would be good to hear your opinion about System A vs. System B
> > again.
>
> I think System A is a better option as it should be the same as ISO 9
> and perhaps also produces results in some cases which are more expected
> than with System B (if the Wikipedia ISO 9 article is to be believed).
>
> Wrt BZ #2872 I think it's good to keep it in mind but IMHO we can also
> deviate from it if needed, however with System A + ASCII fallback
> definitions the RFE should be satisfied as well?
That's exactly what I meant (sorry if it was not clear before).
> > [...] Marko, what is your
> > opinion about possible implementation of SFS 4900 in these cases:
> >
> > * When the destination charset does not contain required Latin diacritic
> > characters (e.g., it is plain ASCII)?
>
> This would be according to http://jkorpela.fi/iso9.html8 so for example
> instead of ž -> zh and instead of štš -> shtsh.
Agree.
> > * When the output is ambiguous, that means, when two different Cyrillic
> > strings produce the same Latin (or ASCII) output?
>
> This is a good point and one I haven't considered but I'm not sure is
> there anything we can do about this (at least without major locale
> system internals work)?
I agree with the suggestion that we can't do much about it. I mean,
there are possibly solutions (like using more punctuation characters)
but they don't look natural to me.
> Do you have any rough idea how frequently this
> could happen or is this more a theoretical issue? (Sorry if I've missed
> earlier comments about this, it's been a long thread.)
Yes, Egor provided this example many times:
"схема" -> "shema" (if "с" -> "s" and "х" -> "h")
"шема" -> "shema" (if "ш" -> "sh")
I don't think that it matters how frequent are these cases. I think that
the question is if ambiguity is a bug because if yes then even one corner
case proves that the solution is wrong.
> [...]
> Yes, I also think System A AKA ISO 9 would be a better choice but I'll
> leave the final decision for you two (and others who might weigh in).
Egor is a native speaker so I respect his opinion even if I'm not fully
convinced for technical reasons. Sadly, nobody else provides any opinion
which could weigh. I am going to write a separate email about it.
Regards,
Rafal
On 19.12.18 23:25, Rafal Luzynski wrote:
> 10.12.2018 22:20 Marko Myllynen <myllynen@redhat.com> wrote:
>
>> [...]
>> Yes, I also think System A AKA ISO 9 would be a better choice but I'll
>> leave the final decision for you two (and others who might weigh in).
>
> Egor is a native speaker so I respect his opinion even if I'm not fully
> convinced for technical reasons. Sadly, nobody else provides any opinion
> which could weigh. I am going to write a separate email about it.
>
> Regards,
>
> Rafal
>
It's not about which letter should be used for a particular
transliteration. I couldn't care less about that just to be clear.
May be I am missing something, could you tell how do you want to fit
System A to ASCII exactly?
Let's take the very first example from the table:
CyrillicUnicode CyrillicLetter CyrillicUnicodeName LatinUnicode System A
Latin Letter System B ASCII Letter
0401 Ё CYRILLIC CAPITAL LETTER IO 00CB Ë YO
so:
Cyrillic Ё U0401
System A - Ë U00CB - _not_ ASCII
System B - YO (or Yo) "<U0059><U004F>" - ASCII
Could you explain how can we make System A "Ë" to be displayed or
processes somehow in a C locale? Or in a locale or program that doesn't
have "Ë" U00CB?
Bests,
Egor
19.12.2018 23:48 Egor Kobylkin <egor@kobylkin.com> wrote:
> [...]
> May be I am missing something, could you tell how do you want to fit
> System A to ASCII exactly?
>
> Let's take the very first example from the table:
> CyrillicUnicode CyrillicLetter CyrillicUnicodeName LatinUnicode System A
> Latin Letter System B ASCII Letter
> 0401 Ё CYRILLIC CAPITAL LETTER IO 00CB Ë YO
>
> so:
> Cyrillic Ё U0401
> System A - Ë U00CB - _not_ ASCII
> System B - YO (or Yo) "<U0059><U004F>" - ASCII
>
> Could you explain how can we make System A "Ë" to be displayed or
> processes somehow in a C locale? Or in a locale or program that doesn't
> have "Ë" U00CB?
It should be "YO" (or "Yo"). Exactly as you provided in your previous
patches.
I am afraid that my description "Cyrillic -> Latin -> ASCII" was too
ambiguous, I am sorry about it. Actually it is a list which says:
Convert Cyrillic "Ё" into Latin "Ë" if possible, otherwise to "YO" ("Yo").
We may stop using "Cyrillic -> Latin -> ASCII" picture as too ambiguous
and invent a better one.
Regards,
Rafal
From a8ae30e0bf7484f4c0f034480110c81dd059b69e Mon Sep 17 00:00:00 2001
From: Egor Kobylkin <egor@kobylkin.com>
Date: Wed, 14 Nov 2018 22:10:37 +0100
Subject: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]
[BZ #2872]
* localedata/locales/translit_cyrillic: New file. Supports
ISO 9.1995, GOST 7.79 System A transliteration System B
transcription table from Cyrillic to Latin/ASCII.
* localedata/locales/aa_DJ: Add 'include "translit_cyrillic";""'
to LC_CTYPE translit section.
* localedata/locales/af_ZA: Likewise.
* localedata/locales/ak_GH: Likewise.
* localedata/locales/am_ET: Likewise.
* localedata/locales/ar_EG: Likewise.
* localedata/locales/be_BY: Likewise.
* localedata/locales/bem_ZM: Likewise.
* localedata/locales/ber_DZ: Likewise.
* localedata/locales/ber_MA: Likewise.
* localedata/locales/bg_BG: Likewise.
* localedata/locales/bi_VU: Likewise.
* localedata/locales/bn_BD: Likewise.
* localedata/locales/bo_CN: Likewise.
* localedata/locales/ca_ES: Likewise.
* localedata/locales/ce_RU: Likewise.
* localedata/locales/cmn_TW: Likewise.
* localedata/locales/cs_CZ: Likewise.
* localedata/locales/cv_RU: Likewise.
* localedata/locales/cy_GB: Likewise.
* localedata/locales/da_DK: Likewise.
* localedata/locales/de_DE: Likewise.
* localedata/locales/dv_MV: Likewise.
* localedata/locales/dz_BT: Likewise.
* localedata/locales/el_GR: Likewise.
* localedata/locales/en_GB: Likewise.
* localedata/locales/en_NG: Likewise.
* localedata/locales/en_ZM: Likewise.
* localedata/locales/es_CU: Likewise.
* localedata/locales/es_ES: Likewise.
* localedata/locales/et_EE: Likewise.
* localedata/locales/fa_IR: Likewise.
* localedata/locales/ff_SN: Likewise.
* localedata/locales/fi_FI: Likewise.
* localedata/locales/fr_FR: Likewise.
* localedata/locales/ga_IE: Likewise.
* localedata/locales/gd_GB: Likewise.
* localedata/locales/gu_IN: Likewise.
* localedata/locales/gv_GB: Likewise.
* localedata/locales/he_IL: Likewise.
* localedata/locales/hi_IN: Likewise.
* localedata/locales/hif_FJ: Likewise.
* localedata/locales/hr_HR: Likewise.
* localedata/locales/ht_HT: Likewise.
* localedata/locales/hu_HU: Likewise.
* localedata/locales/hy_AM: Likewise.
* localedata/locales/id_ID: Likewise.
* localedata/locales/is_IS: Likewise.
* localedata/locales/it_IT: Likewise.
* localedata/locales/ja_JP: Likewise.
* localedata/locales/kab_DZ: Likewise.
* localedata/locales/kk_KZ: Likewise.
* localedata/locales/km_KH: Likewise.
* localedata/locales/kn_IN: Likewise.
* localedata/locales/ko_KR: Likewise.
* localedata/locales/ks_IN: Likewise.
* localedata/locales/kw_GB: Likewise.
* localedata/locales/ky_KG: Likewise.
* localedata/locales/lb_LU: Likewise.
* localedata/locales/lg_UG: Likewise.
* localedata/locales/lij_IT: Likewise.
* localedata/locales/ln_CD: Likewise.
* localedata/locales/lo_LA: Likewise.
* localedata/locales/lt_LT: Likewise.
* localedata/locales/lv_LV: Likewise.
* localedata/locales/mg_MG: Likewise.
* localedata/locales/mhr_RU: Likewise.
* localedata/locales/mk_MK: Likewise.
* localedata/locales/ml_IN: Likewise.
* localedata/locales/ms_MY: Likewise.
* localedata/locales/mt_MT: Likewise.
* localedata/locales/nan_TW@latin: Likewise.
* localedata/locales/nb_NO: Likewise.
* localedata/locales/ne_NP: Likewise.
* localedata/locales/nhn_MX: Likewise.
* localedata/locales/niu_NU: Likewise.
* localedata/locales/niu_NZ: Likewise.
* localedata/locales/nl_NL: Likewise.
* localedata/locales/nr_ZA: Likewise.
* localedata/locales/oc_FR: Likewise.
* localedata/locales/om_KE: Likewise.
* localedata/locales/or_IN: Likewise.
* localedata/locales/os_RU: Likewise.
* localedata/locales/pa_IN: Likewise.
* localedata/locales/pa_PK: Likewise.
* localedata/locales/pl_PL: Likewise.
* localedata/locales/pt_PT: Likewise.
* localedata/locales/quz_PE: Likewise.
* localedata/locales/ro_RO: Likewise.
* localedata/locales/ru_RU: Likewise.
* localedata/locales/rw_RW: Likewise.
* localedata/locales/sa_IN: Likewise.
* localedata/locales/sd_IN: Likewise.
* localedata/locales/sd_IN@devanagari: Likewise.
* localedata/locales/se_NO: Likewise.
* localedata/locales/sgs_LT: Likewise.
* localedata/locales/shn_MM: Likewise.
* localedata/locales/si_LK: Likewise.
* localedata/locales/sk_SK: Likewise.
* localedata/locales/sl_SI: Likewise.
* localedata/locales/sm_WS: Likewise.
* localedata/locales/so_SO: Likewise.
* localedata/locales/sq_AL: Likewise.
* localedata/locales/ss_ZA: Likewise.
* localedata/locales/st_ZA: Likewise.
* localedata/locales/sv_SE: Likewise.
* localedata/locales/sw_KE: Likewise.
* localedata/locales/ta_IN: Likewise.
* localedata/locales/te_IN: Likewise.
* localedata/locales/th_TH: Likewise.
* localedata/locales/ti_ET: Likewise.
* localedata/locales/tn_ZA: Likewise.
* localedata/locales/to_TO: Likewise.
* localedata/locales/tpi_PG: Likewise.
* localedata/locales/tr_TR: Likewise.
* localedata/locales/ts_ZA: Likewise.
* localedata/locales/unm_US: Likewise.
* localedata/locales/ur_IN: Likewise.
* localedata/locales/ur_PK: Likewise.
* localedata/locales/ve_ZA: Likewise.
* localedata/locales/vi_VN: Likewise.
* localedata/locales/wa_BE: Likewise.
* localedata/locales/wo_SN: Likewise.
* localedata/locales/xh_ZA: Likewise.
* localedata/locales/yi_US: Likewise.
* localedata/locales/yuw_PG: Likewise.
* localedata/locales/zh_CN: Likewise.
* localedata/locales/zu_ZA: Likewise.
---
localedata/locales/aa_DJ | 1 +
localedata/locales/af_ZA | 1 +
localedata/locales/ak_GH | 1 +
localedata/locales/am_ET | 1 +
localedata/locales/ar_EG | 1 +
localedata/locales/be_BY | 1 +
localedata/locales/bem_ZM | 1 +
localedata/locales/ber_DZ | 1 +
localedata/locales/ber_MA | 1 +
localedata/locales/bg_BG | 1 +
localedata/locales/bi_VU | 1 +
localedata/locales/bn_BD | 1 +
localedata/locales/bo_CN | 1 +
localedata/locales/ca_ES | 1 +
localedata/locales/ce_RU | 1 +
localedata/locales/cs_CZ | 1 +
localedata/locales/cv_RU | 1 +
localedata/locales/cy_GB | 1 +
localedata/locales/da_DK | 1 +
localedata/locales/de_DE | 1 +
localedata/locales/dv_MV | 1 +
localedata/locales/dz_BT | 1 +
localedata/locales/el_GR | 1 +
localedata/locales/en_GB | 1 +
localedata/locales/en_NG | 1 +
localedata/locales/en_ZM | 1 +
localedata/locales/es_CU | 1 +
localedata/locales/es_ES | 1 +
localedata/locales/et_EE | 1 +
localedata/locales/fa_IR | 1 +
localedata/locales/ff_SN | 1 +
localedata/locales/fi_FI | 1 +
localedata/locales/fr_FR | 1 +
localedata/locales/ga_IE | 1 +
localedata/locales/gd_GB | 1 +
localedata/locales/gu_IN | 1 +
localedata/locales/gv_GB | 1 +
localedata/locales/he_IL | 1 +
localedata/locales/hi_IN | 1 +
localedata/locales/hif_FJ | 1 +
localedata/locales/hr_HR | 1 +
localedata/locales/ht_HT | 1 +
localedata/locales/hu_HU | 1 +
localedata/locales/hy_AM | 1 +
localedata/locales/id_ID | 1 +
localedata/locales/is_IS | 1 +
localedata/locales/it_IT | 1 +
localedata/locales/ja_JP | 1 +
localedata/locales/kab_DZ | 1 +
localedata/locales/kk_KZ | 1 +
localedata/locales/km_KH | 1 +
localedata/locales/kn_IN | 1 +
localedata/locales/ko_KR | 1 +
localedata/locales/ks_IN | 1 +
localedata/locales/kw_GB | 1 +
localedata/locales/ky_KG | 1 +
localedata/locales/lb_LU | 1 +
localedata/locales/lg_UG | 1 +
localedata/locales/lij_IT | 1 +
localedata/locales/ln_CD | 1 +
localedata/locales/lo_LA | 1 +
localedata/locales/lt_LT | 1 +
localedata/locales/lv_LV | 1 +
localedata/locales/mg_MG | 1 +
localedata/locales/mhr_RU | 1 +
localedata/locales/mk_MK | 1 +
localedata/locales/ml_IN | 1 +
localedata/locales/ms_MY | 1 +
localedata/locales/mt_MT | 1 +
localedata/locales/nan_TW@latin | 1 +
localedata/locales/nb_NO | 1 +
localedata/locales/ne_NP | 1 +
localedata/locales/nhn_MX | 1 +
localedata/locales/niu_NU | 1 +
localedata/locales/niu_NZ | 1 +
localedata/locales/nl_NL | 1 +
localedata/locales/nr_ZA | 1 +
localedata/locales/oc_FR | 1 +
localedata/locales/om_KE | 1 +
localedata/locales/or_IN | 1 +
localedata/locales/os_RU | 1 +
localedata/locales/pa_IN | 1 +
localedata/locales/pa_PK | 1 +
localedata/locales/pl_PL | 1 +
localedata/locales/pt_PT | 1 +
localedata/locales/quz_PE | 1 +
localedata/locales/ro_RO | 1 +
localedata/locales/ru_RU | 1 +
localedata/locales/rw_RW | 1 +
localedata/locales/sa_IN | 1 +
localedata/locales/sd_IN | 1 +
localedata/locales/sd_IN@devanagari | 1 +
localedata/locales/se_NO | 1 +
localedata/locales/sgs_LT | 1 +
localedata/locales/shn_MM | 1 +
localedata/locales/si_LK | 1 +
localedata/locales/sk_SK | 1 +
localedata/locales/sl_SI | 1 +
localedata/locales/sm_WS | 1 +
localedata/locales/so_SO | 1 +
localedata/locales/sq_AL | 1 +
localedata/locales/ss_ZA | 1 +
localedata/locales/st_ZA | 1 +
localedata/locales/sv_SE | 1 +
localedata/locales/sw_KE | 1 +
localedata/locales/ta_IN | 1 +
localedata/locales/te_IN | 1 +
localedata/locales/th_TH | 1 +
localedata/locales/ti_ET | 1 +
localedata/locales/tn_ZA | 1 +
localedata/locales/to_TO | 1 +
localedata/locales/tpi_PG | 1 +
localedata/locales/tr_TR | 1 +
localedata/locales/translit_cyrillic | 383 +++++++++++++++++++++++++++
localedata/locales/ts_ZA | 1 +
localedata/locales/unm_US | 1 +
localedata/locales/ur_IN | 1 +
localedata/locales/ur_PK | 1 +
localedata/locales/ve_ZA | 1 +
localedata/locales/vi_VN | 1 +
localedata/locales/wa_BE | 1 +
localedata/locales/wo_SN | 1 +
localedata/locales/xh_ZA | 1 +
localedata/locales/yi_US | 1 +
localedata/locales/yuw_PG | 1 +
localedata/locales/zh_CN | 1 +
localedata/locales/zu_ZA | 1 +
127 files changed, 509 insertions(+)
create mode 100644 localedata/locales/translit_cyrillic
@@ -68,6 +68,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -70,6 +70,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -54,6 +54,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -96,6 +96,7 @@ copy "i18n"
space <U1361>
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% hoy-sadis followed by a vowel
<U1205><U12A0> <U0068><U0027><U0065>
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -91,6 +91,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -41,6 +41,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -136,6 +136,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -83,6 +83,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -49,6 +49,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -39,6 +39,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -61,6 +61,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -43,6 +43,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -57,6 +57,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -38,6 +38,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -215,6 +215,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -103,6 +103,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -65,6 +65,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -147,6 +147,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
<U00C4> "<U0041><U0308>";"<U0041><U0045>"
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% German umlauts.
% LATIN CAPITAL LETTER A WITH DIAERESIS.
@@ -49,6 +49,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
@@ -59,6 +59,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -54,6 +54,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -49,6 +49,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -41,6 +41,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -107,6 +107,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -113,6 +113,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -78,6 +78,7 @@ map to_outpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -41,6 +41,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -177,6 +177,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -57,6 +57,7 @@ translit_start
% In France, accents are simply omitted if they cannot be represented.
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
@@ -53,6 +53,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -45,6 +45,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -62,6 +62,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -56,6 +56,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -61,6 +61,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -37,6 +37,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -46,6 +46,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% Historicaly we used ISO-8869-2 and wrote digraphs
% <U01C6> {dž}, <U01C9> {lj} and <U01CC> {nj}
@@ -57,6 +57,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -455,6 +455,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
<U00C1> "<U0041><U0301>";"<U0041><U00B4>";"<U0041><U0027>"
<U00C9> "<U0045><U0301>";"<U0045><U00B4>";"<U0045><U0027>"
@@ -75,6 +75,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -54,6 +54,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -149,6 +149,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -1680,6 +1680,7 @@ translit_start
include "translit_combining";""
include "translit_cjk_variants";""
+include "translit_cyrillic";""
translit_end
@@ -41,6 +41,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -99,6 +99,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -42,6 +42,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -63,6 +63,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -6098,6 +6098,7 @@ translit_start
include "translit_combining";""
include "translit_hangul";""
+include "translit_cyrillic";""
translit_end
@@ -46,6 +46,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -57,6 +57,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -82,6 +82,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% German umlauts
% LATIN CAPITAL LETTER A WITH DIAERESIS
@@ -56,6 +56,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -47,6 +47,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -39,6 +39,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -50,6 +50,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -163,6 +163,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -125,6 +125,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -53,6 +53,7 @@ translit_start
% Accents are simply omitted if they cannot be represented.
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -48,6 +48,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -60,6 +60,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
%
@@ -45,6 +45,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -47,6 +47,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -51,6 +51,7 @@ translit_start
% accents are simply omitted if they cannot be represented.
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
@@ -144,6 +144,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
<U00C4> "<U0041><U0308>";"<U0041><U0045>"
@@ -43,6 +43,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -59,6 +59,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -56,6 +56,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -64,6 +64,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -54,6 +54,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -156,6 +156,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -62,6 +62,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -71,6 +71,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -60,6 +60,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -49,6 +49,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% those two lettes are not in cp1256...
@@ -130,6 +130,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -55,6 +55,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -129,6 +129,7 @@ copy "i18n"
%
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% if t/scomma is not available, try first t/scedilla
<U0218> "<U015E>";"<U0053>"
@@ -69,6 +69,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -45,6 +45,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -46,6 +46,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -221,6 +221,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -67,6 +67,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -2120,6 +2120,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -37,6 +37,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -68,6 +68,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -45,6 +45,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -66,6 +66,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -62,6 +62,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -151,6 +151,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% LATIN CAPITAL LETTER A WITH DIAERESIS -> "AE"
<U00C4> "<U0041><U0308>";"<U0041><U0045>"
@@ -43,6 +43,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -63,6 +63,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -63,6 +63,7 @@ map to_inpunct; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -57,6 +57,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -864,6 +864,7 @@ translit_start
<U137C> <U0060><U0031><U0030><U0030><U0030><U0030>
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
%
END LC_CTYPE
@@ -67,6 +67,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -36,6 +36,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -44,6 +44,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -2535,6 +2535,7 @@ class "combining_level3"; /
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% TURKISH LIRA SIGN
<U20BA> "<U0054><U004C>"
new file mode 100644
@@ -0,0 +1,383 @@
+escape_char /
+comment_char %
+
+% This file is part of the GNU C Library and contains locale data.
+% The Free Software Foundation does not claim any copyright interest
+% in the locale data contained in this file. The foregoing does not
+% affect the license of the GNU C Library as a whole. It does not
+% exempt you from the conditions of the license if your use would
+% otherwise be governed by that license.
+
+% Transliterations of Cyrillic letters to Latin and/or ASCII symbols.
+% Inspired by ISO 9.1995 / GOST 7.79-2000.
+% Covers Unicode Range https://www.unicode.org/charts/PDF/U0400.pdf
+% i.e. [U0401-U04F9, U2019] but only the letters covered by ISO 9.1995
+% It implements the GOST_7.79 System A (Latin Script) as a first
+% option and System B Cyrillic (ASCII) as a second option. Check
+% h:ttps://en.wikipedia.org/wiki/ISO_9 for reference.
+% The System B is extended from GOST_7.79-Russian using open sources
+% of the transliteration mappings and the "h/`" diacritics logic.
+
+% Usage examples:
+% iconv -f UTF-8 -t ISO-8859-15//TRANSLIT \
+% | iconv -f ISO-8859-15 -t UTF-8 # System A
+% iconv -f UTF-8 -t ASCII//TRANSLIT # System B.
+
+% Contributions welcome for the rest of Cyrillic script in Unicode
+% https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode.
+% Bugfix for https://sourceware.org/bugzilla/show_bug.cgi?id=2872.
+% Generated from UnicodeData.txt with a spreadsheet referenced
+% in that bug's doclet
+
+LC_CTYPE
+
+translit_start
+
+% CYRILLIC CAPITAL LETTER IO
+<U0401> <U00CB>;"<U0059><U004F>"
+% CYRILLIC CAPITAL LETTER DJE
+<U0402> <U0110>;"<U0044><U004A>"
+% CYRILLIC CAPITAL LETTER GJE
+<U0403> <U01F4>;"<U0047><U0060>"
+% CYRILLIC CAPITAL LETTER UKRAINIAN IE
+<U0404> <U00CA>;"<U0059><U0045>"
+% CYRILLIC CAPITAL LETTER DZE
+<U0405> <U1E90>;"<U005A><U0060>"
+% CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
+<U0406> <U00CC>;<U0049>
+% CYRILLIC CAPITAL LETTER YI
+<U0407> <U00CF>;"<U0059><U0049>"
+% CYRILLIC CAPITAL LETTER JE
+<U0408> "<U004A><U030C>";<U004A>
+% CYRILLIC CAPITAL LETTER LJE
+<U0409> "<U004C><U0302>";"<U004C><U0060>"
+% CYRILLIC CAPITAL LETTER NJE
+<U040A> "<U004E><U0302>";"<U004E><U0060>"
+% CYRILLIC CAPITAL LETTER TSHE
+<U040B> <U0106>;"<U0054><U0053><U0048>"
+% CYRILLIC CAPITAL LETTER KJE
+<U040C> <U1E30>;"<U004B><U0060>"
+% CYRILLIC CAPITAL LETTER SHORT U
+<U040E> <U016C>;"<U0055><U0060>"
+% CYRILLIC CAPITAL LETTER DZHE
+<U040F> "<U0044><U0302>";"<U0044><U0048>"
+% CYRILLIC CAPITAL LETTER A
+<U0410> <U0041>
+% CYRILLIC CAPITAL LETTER BE
+<U0411> <U0042>
+% CYRILLIC CAPITAL LETTER VE
+<U0412> <U0056>
+% CYRILLIC CAPITAL LETTER GHE
+<U0413> <U0047>
+% CYRILLIC CAPITAL LETTER DE
+<U0414> <U0044>
+% CYRILLIC CAPITAL LETTER IE
+<U0415> <U0045>
+% CYRILLIC CAPITAL LETTER ZHE
+<U0416> <U017D>;"<U005A><U0048>"
+% CYRILLIC CAPITAL LETTER ZE
+<U0417> <U005A>
+% CYRILLIC CAPITAL LETTER I
+<U0418> <U0049>
+% CYRILLIC CAPITAL LETTER SHORT I
+<U0419> <U004A>
+% CYRILLIC CAPITAL LETTER KA
+<U041A> <U004B>
+% CYRILLIC CAPITAL LETTER EL
+<U041B> <U004C>
+% CYRILLIC CAPITAL LETTER EM
+<U041C> <U004D>
+% CYRILLIC CAPITAL LETTER EN
+<U041D> <U004E>
+% CYRILLIC CAPITAL LETTER O
+<U041E> <U004F>
+% CYRILLIC CAPITAL LETTER PE
+<U041F> <U0050>
+% CYRILLIC CAPITAL LETTER ER
+<U0420> <U0052>
+% CYRILLIC CAPITAL LETTER ES
+<U0421> <U0053>
+% CYRILLIC CAPITAL LETTER TE
+<U0422> <U0054>
+% CYRILLIC CAPITAL LETTER U
+<U0423> <U0055>
+% CYRILLIC UNDEFINED
+<U0423><U0301> <U00DA>;"<U0055><U0060>"
+% CYRILLIC CAPITAL LETTER EF
+<U0424> <U0046>
+% CYRILLIC CAPITAL LETTER HA
+<U0425> <U0048>;<U0058>
+% CYRILLIC CAPITAL LETTER TSE
+<U0426> <U0043>;"<U0043><U005A>"
+% CYRILLIC CAPITAL LETTER CHE
+<U0427> <U010C>;"<U0043><U0048>"
+% CYRILLIC CAPITAL LETTER SHA
+<U0428> <U0160>;"<U0053><U0048>"
+% CYRILLIC CAPITAL LETTER SHCHA
+<U0429> <U015C>;"<U0053><U0048><U0048>"
+% CYRILLIC CAPITAL LETTER HARD SIGN
+<U042A> <U02BA>;"<U0041><U0060>"
+% CYRILLIC CAPITAL LETTER YERU
+<U042B> <U0059>;"<U0059><U0060>"
+% CYRILLIC CAPITAL LETTER SOFT SIGN
+<U042C> <U02B9>;<U0060>
+% CYRILLIC CAPITAL LETTER E
+<U042D> <U00C8>;"<U0045><U0060>"
+% CYRILLIC CAPITAL LETTER YU
+<U042E> <U00DB>;"<U0059><U0055>"
+% CYRILLIC CAPITAL LETTER YA
+<U042F> <U00C2>;"<U0059><U0041>"
+% CYRILLIC SMALL LETTER A
+<U0430> <U0061>
+% CYRILLIC SMALL LETTER BE
+<U0431> <U0062>
+% CYRILLIC SMALL LETTER VE
+<U0432> <U0076>
+% CYRILLIC SMALL LETTER GHE
+<U0433> <U0067>
+% CYRILLIC SMALL LETTER DE
+<U0434> <U0064>
+% CYRILLIC SMALL LETTER IE
+<U0435> <U0065>
+% CYRILLIC SMALL LETTER ZHE
+<U0436> <U017E>;"<U007A><U0068>"
+% CYRILLIC SMALL LETTER ZE
+<U0437> <U007A>
+% CYRILLIC SMALL LETTER I
+<U0438> <U0069>
+% CYRILLIC SMALL LETTER SHORT I
+<U0439> <U006A>
+% CYRILLIC SMALL LETTER KA
+<U043A> <U006B>
+% CYRILLIC SMALL LETTER EL
+<U043B> <U006C>
+% CYRILLIC SMALL LETTER EM
+<U043C> <U006D>
+% CYRILLIC SMALL LETTER EN
+<U043D> <U006E>
+% CYRILLIC SMALL LETTER O
+<U043E> <U006F>
+% CYRILLIC SMALL LETTER PE
+<U043F> <U0070>
+% CYRILLIC SMALL LETTER ER
+<U0440> <U0072>
+% CYRILLIC SMALL LETTER ES
+<U0441> <U0073>
+% CYRILLIC SMALL LETTER TE
+<U0442> <U0074>
+% CYRILLIC SMALL LETTER U
+<U0443> <U0075>
+% CYRILLIC UNDEFINED
+<U0443><U0301> <U00FA>;"<U0075><U0060>"
+% CYRILLIC SMALL LETTER EF
+<U0444> <U0066>
+% CYRILLIC SMALL LETTER HA
+<U0445> <U0068>;<U0078>
+% CYRILLIC SMALL LETTER TSE
+<U0446> <U0063>;"<U0063><U007A>"
+% CYRILLIC SMALL LETTER CHE
+<U0447> <U010D>;"<U0063><U0068>"
+% CYRILLIC SMALL LETTER SHA
+<U0448> <U0161>;"<U0073><U0068>"
+% CYRILLIC SMALL LETTER SHCHA
+<U0449> <U015D>;"<U0073><U0068><U0068>"
+% CYRILLIC SMALL LETTER HARD SIGN
+<U044A> <U02BA>;"<U0060><U0060>"
+% CYRILLIC SMALL LETTER YERU
+<U044B> <U0079>;"<U0079><U0060>"
+% CYRILLIC SMALL LETTER SOFT SIGN
+<U044C> <U02B9>;<U0060>
+% CYRILLIC SMALL LETTER E
+<U044D> <U00E8>;"<U0065><U0060>"
+% CYRILLIC SMALL LETTER YU
+<U044E> <U00FB>;"<U0079><U0075>"
+% CYRILLIC SMALL LETTER YA
+<U044F> <U00E2>;"<U0079><U0061>"
+% CYRILLIC SMALL LETTER IO
+<U0451> <U00EB>;"<U0079><U006F>"
+% CYRILLIC SMALL LETTER DJE
+<U0452> <U0111>;"<U0064><U006A>"
+% CYRILLIC SMALL LETTER GJE
+<U0453> <U01F5>;"<U0067><U0060>"
+% CYRILLIC SMALL LETTER UKRAINIAN IE
+<U0454> <U00EA>;"<U0079><U0065>"
+% CYRILLIC SMALL LETTER DZE
+<U0455> <U1E91>;"<U007A><U0060>"
+% CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
+<U0456> <U00EC>;<U0069>
+% CYRILLIC SMALL LETTER YI
+<U0457> <U00EF>;"<U0079><U0069>"
+% CYRILLIC SMALL LETTER JE
+<U0458> <U01F0>;<U006A>
+% CYRILLIC SMALL LETTER LJE
+<U0459> "<U006C><U0302>";"<U006C><U0060>"
+% CYRILLIC SMALL LETTER NJE
+<U045A> "<U006E><U0302>";"<U006E><U0060>"
+% CYRILLIC SMALL LETTER TSHE
+<U045B> <U0107>;"<U0074><U0073><U0068>"
+% CYRILLIC SMALL LETTER KJE
+<U045C> <U1E31>;"<U006B><U0060>"
+% CYRILLIC SMALL LETTER SHORT U
+<U045E> <U016D>;"<U0075><U0060>"
+% CYRILLIC SMALL LETTER DZHE
+<U045F> "<U0064><U0302>";"<U0064><U0068>"
+% CYRILLIC CAPITAL LETTER BIG YUS
+<U046A> <U01CD>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER BIG YUS
+<U046B> <U01CE>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER FITA
+<U0472> "<U0046><U0300>";"<U0046><U0048>"
+% CYRILLIC SMALL LETTER FITA
+<U0473> "<U0066><U0300>";"<U0066><U0068>"
+% CYRILLIC CAPITAL LETTER IZHITSA
+<U0474> <U1EF2>;"<U0059><U0048>"
+% CYRILLIC SMALL LETTER IZHITSA
+<U0475> <U1EF3>;"<U0079><U0068>"
+% CYRILLIC CAPITAL LETTER SEMISOFT SIGN
+<U048C> <U011A>;"<U0045><U0060>"
+% CYRILLIC SMALL LETTER SEMISOFT SIGN
+<U048D> <U011B>;"<U0065><U0060>"
+% CYRILLIC CAPITAL LETTER GHE WITH UPTURN
+<U0490> "<U0047><U0300>";"<U0047><U0060>"
+% CYRILLIC SMALL LETTER GHE WITH UPTURN
+<U0491> "<U0067><U0300>";"<U0067><U0060>"
+% CYRILLIC CAPITAL LETTER GHE WITH STROKE
+<U0492> <U0120>;"<U0047><U0048>"
+% CYRILLIC SMALL LETTER GHE WITH STROKE
+<U0493> <U0121>;"<U0067><U0068>"
+% CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK
+<U0494> <U011E>;"<U0047><U0048>"
+% CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
+<U0495> <U011F>;"<U0067><U0068>"
+% CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER
+<U0496> "<U017D><U0327>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH DESCENDER
+<U0497> "<U017E><U0327>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER KA WITH DESCENDER
+<U049A> <U0136>;"<U004B><U0060>"
+% CYRILLIC SMALL LETTER KA WITH DESCENDER
+<U049B> <U0137>;"<U006B><U0060>"
+% CYRILLIC CAPITAL LETTER KA WITH STROKE
+<U049E> "<U004B><U0304>";"<U004B><U0060>"
+% CYRILLIC SMALL LETTER KA WITH STROKE
+<U049F> "<U006B><U0304>";"<U006B><U0060>"
+% CYRILLIC CAPITAL LETTER EN WITH DESCENDER
+<U04A2> <U1E46>;"<U004E><U0060>"
+% CYRILLIC SMALL LETTER EN WITH DESCENDER
+<U04A3> <U1E47>;"<U006E><U0060>"
+% CYRILLIC CAPITAL LIGATURE EN GHE
+<U04A4> <U1E44>;"<U004E><U0047>"
+% CYRILLIC SMALL LIGATURE EN GHE
+<U04A5> <U1E45>;"<U006E><U0067>"
+% CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK
+<U04A6> <U1E54>;"<U0050><U0060>"
+% CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
+<U04A7> <U1E55>;"<U0070><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN HA
+<U04A8> <U00D2>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN HA
+<U04A9> <U00F2>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER ES WITH DESCENDER
+<U04AA> <U00C7>;"<U0043><U0060>"
+% CYRILLIC SMALL LETTER ES WITH DESCENDER
+<U04AB> <U00E7>;"<U0043><U0060>"
+% CYRILLIC CAPITAL LETTER TE WITH DESCENDER
+<U04AC> <U0162>;"<U0054><U0060>"
+% CYRILLIC SMALL LETTER TE WITH DESCENDER
+<U04AD> <U0163>;"<U0074><U0060>"
+% CYRILLIC CAPITAL LETTER STRAIGHT U
+<U04AE> <U00D9>;<U0055>
+% CYRILLIC SMALL LETTER STRAIGHT U
+<U04AF> <U00F9>;<U0075>
+% CYRILLIC CAPITAL LETTER HA WITH DESCENDER
+<U04B2> <U1E28>;"<U0048><U0060>"
+% CYRILLIC SMALL LETTER HA WITH DESCENDER
+<U04B3> <U1E29>;"<U0068><U0060>"
+% CYRILLIC CAPITAL LIGATURE TE TSE
+<U04B4> "<U0043><U0304>";"<U0054><U0043><U005A>"
+% CYRILLIC SMALL LIGATURE TE TSE
+<U04B5> "<U0063><U0304>";"<U0074><U0063><U007A>"
+% CYRILLIC CAPITAL LETTER SHHA
+<U04BA> <U1E24>;"<U0053><U0048><U0060>"
+% CYRILLIC SMALL LETTER SHHA
+<U04BB> <U1E25>;"<U0053><U0048><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN CHE
+<U04BC> "<U0043><U0306>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN CHE
+<U04BD> "<U0063><U0306>";"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
+<U04BE> "<U00C7><U0306>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
+<U04BF> "<U00E7><U0306>";"<U0063><U0068><U0060>"
+% CYRILLIC LETTER PALOCHKA
+<U04C0> <U2021>;<U0069>
+% CYRILLIC CAPITAL LETTER ZHE WITH BREVE
+<U04C1> "<U005A><U0306>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH BREVE
+<U04C2> "<U007A><U0306>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
+<U04CB> <U00C7>;"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER KHAKASSIAN CHE
+<U04CC> <U00E7>;"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER A WITH BREVE
+<U04D0> <U0102>;"<U0041><U0060>"
+% CYRILLIC SMALL LETTER A WITH BREVE
+<U04D1> <U0103>;"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER A WITH DIAERESIS
+<U04D2> <U00C4>;"<U0041><U0060>"
+% CYRILLIC SMALL LETTER A WITH DIAERESIS
+<U04D3> <U00E4>;"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER IE WITH BREVE
+<U04D6> <U0114>;"<U0045><U0060>"
+% CYRILLIC SMALL LETTER IE WITH BREVE
+<U04D7> <U0115>;"<U0065><U0060>"
+% CYRILLIC CAPITAL LETTER SCHWA
+<U04D8> "<U0041><U030B>";"<U0041><U0060>"
+% CYRILLIC SMALL LETTER SCHWA
+<U04D9> "<U0061><U030B>";"<U0061><U0060>"
+% CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
+<U04DC> "<U005A><U0304>";"<U005A><U0048><U0060>"
+% CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
+<U04DD> "<U007A><U0304>";"<U007A><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
+<U04DE> "<U005A><U0308>";"<U005A><U0060>"
+% CYRILLIC SMALL LETTER ZE WITH DIAERESIS
+<U04DF> "<U007A><U0308>";"<U007A><U0060>"
+% CYRILLIC CAPITAL LETTER ABKHASIAN DZE
+<U04E0> <U0179>;"<U005A><U0060>"
+% CYRILLIC SMALL LETTER ABKHASIAN DZE
+<U04E1> <U017A>;"<U007A><U0060>"
+% CYRILLIC CAPITAL LETTER I WITH DIAERESIS
+<U04E4> <U00CE>;"<U0049><U0060>"
+% CYRILLIC SMALL LETTER I WITH DIAERESIS
+<U04E5> <U00EE>;"<U0069><U0060>"
+% CYRILLIC CAPITAL LETTER O WITH DIAERESIS
+<U04E6> <U00D6>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER O WITH DIAERESIS
+<U04E7> <U00F6>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER BARRED O
+<U04E8> <U00D4>;"<U004F><U0060>"
+% CYRILLIC SMALL LETTER BARRED O
+<U04E9> <U00F4>;"<U006F><U0060>"
+% CYRILLIC CAPITAL LETTER U WITH DIAERESIS
+<U04F0> <U00DC>;"<U0055><U0060>"
+% CYRILLIC SMALL LETTER U WITH DIAERESIS
+<U04F1> <U00FC>;"<U0075><U0060>"
+% CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
+<U04F2> <U0170>;"<U0055><U0060>"
+% CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
+<U04F3> <U0171>;"<U0075><U0060>"
+% CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
+<U04F4> "<U0043><U0308>";"<U0043><U0048><U0060>"
+% CYRILLIC SMALL LETTER CHE WITH DIAERESIS
+<U04F5> "<U0063><U0308>";"<U0063><U0068><U0060>"
+% CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
+<U04F8> <U0178>;"<U0059><U0060>"
+% CYRILLIC SMALL LETTER YERU WITH DIAERESIS
+<U04F9> <U00FF>;"<U0079><U0060>"
+% RIGHT SINGLE QUOTATION MARK
+<U2019> <U2035>;<U0027>
+
+translit_end
+
+END LC_CTYPE
@@ -62,6 +62,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -48,6 +48,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -46,6 +46,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -49,6 +49,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% those two lettes are not in cp1256...
@@ -65,6 +65,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -53,6 +53,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% dong sign -> d// -> dd
<U20AB> "<U0111>";"<U0064><U0064>"
@@ -54,6 +54,7 @@ LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% A-bole -> A-circonflecse -> AU
<U00C5> "A<U030A>";"A";"AU"
@@ -53,6 +53,7 @@ translit_start
% Accents are simply omitted if they cannot be represented.
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
@@ -64,6 +64,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -60,6 +60,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
% if digraphs are not available (this is the case with iso-8859-8)
% then use the single letters
@@ -40,6 +40,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
@@ -58,6 +58,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
class "hanzi"; /
@@ -68,6 +68,7 @@ copy "i18n"
translit_start
include "translit_combining";""
+include "translit_cyrillic";""
translit_end
END LC_CTYPE
--
2.17.1