[v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Message ID a1db6ae3-2847-1482-b849-dd383e8c85aa@kobylkin.com
State Superseded
Headers

Commit Message

Egor Kobylkin Jan. 2, 2019, 6:38 p.m. UTC
  Changelog v12:
* Adjusted to the new comment style suddenly appearing in the target 
file locale/C-translit.h.in (the original file changed on the master 
branch from /* style to # style since v11)
* Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to 
"sh`" instead of erroneous "SH`" in v11

Changelog v11:
* Re-targeted the patch against locale/C-translit.h.in as the proper
file for the ASCII translit table.
* Correspondingly the patch now only contains the additional
Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
The 'include "translit_cyrillic";""' directives are not necessary in the
locale files and they are now all left intact.
* Also the file translit_cyrillic is not longer needed and is omitted.
* Edited below email, commit message.

Changelog v10:
* Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
with diacritics) as conflicting with System B within glibc mechanics and
not solving BZ #2872
* Edited below email, commit message, comment in translit_cyrillic to
reflect System A removal
* Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
using composition) as composing is not covered by current glibc
conversion mechanics

Changelog v9:
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch

Changelog v8:
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).

Changelog v7:
* Generated against git://sourceware.org/git/glibc.git master with git
format-patch.
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
translit_end previously.)
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
“copy "tr_TR"”.

Changelog v6:
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
   to sequences of all uppercase Latin letters in all languages (whenever
   a Cyrillic letter is transliterated to more than one Latin letter),
   for example "Ї" is now transliterated as "YI" rather than "Yi".

Dear locale maintainers,

fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"

https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]

add the Cyrillic transliteration rows to locale/C-translit.h.in.

The patch is attached.


Current bug effect:

The glibc wiki explicitly lists this use case as the test example and
currently it fails on Cyrillic texts [1] [8] [9]:

iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC

CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.

- it produces a string of question marks and spaces.

This is what it should produce and it does so after the patch applied:

CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.


The root problem and the fix:

The root problem is the missing transliteration table that I am
supplying here.


COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.

Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription.

While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.

The patch content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 System B official source (Federal Agency on
Technical Regulating and Metrology Of Russian Federation [2]).
Technically an independent but mostly identical source [3] was used and
prepared in a spreadsheet [6].

The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
System B represents what is actually called transcription (preserving
phonemes), while System A is the transliteration (preserving graphemes).
There is no meaningful way to preserve graphemes converting Cyrillic to
ASCII and thus the System B is chosen [11]. To be super clear the System
A has nothing to do with this bug regardless it being a transliteration.

Those interested in implementing System A for transliteration of
Cyrillic to Latin with Diacritic as a new feature are welcome to use the
spreadsheet in [6] as a starting point.

Links:

[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
[11]
https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3

Best regards,
Egor Kobylkin
  

Comments

Rafal Luzynski Jan. 5, 2019, 2:35 p.m. UTC | #1
2.01.2019 19:38 Egor Kobylkin <egor@kobylkin.com> wrote:
> 
> Changelog v12:
> [...]
> 
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
> [...]

I have tested this and, unfortunately, now this transliteration
works *only* in C locale, that is, only when no locale is set or when
it is explicitly set to C (C.UTF8, POSIX).  It does not work when locale
is set to anything different, including en_US, ru_RU, etc.

I'm sorry for confusing you.  I think that either we should revert back
to the older versions of your patch to make all locales supported or
merge those two versions to make the transliteration work both in
C and in all (almost all) other locales.  Unfortunately, C locale is
not a base for all other locales and is not included, it is only a fallback
when a locale does not provide its own data (that is, when it does not
provide any transliteration table at all).

Regards,

Rafal
  
Egor Kobylkin Jan. 5, 2019, 9:12 p.m. UTC | #2
On 05.01.19 15:35, Rafal Luzynski wrote:
> 2.01.2019 19:38 Egor Kobylkin <egor@kobylkin.com> wrote:
>>
>> Changelog v12:
>> [...]
>>
>> Changelog v11:
>> * Re-targeted the patch against locale/C-translit.h.in as the proper
>> file for the ASCII translit table.
>> * Correspondingly the patch now only contains the additional
>> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
>> The 'include "translit_cyrillic";""' directives are not necessary in the
>> locale files and they are now all left intact.
>> * Also the file translit_cyrillic is not longer needed and is omitted.
>> * Edited below email, commit message.
>> [...]
> 
> I have tested this and, unfortunately, now this transliteration
> works *only* in C locale, that is, only when no locale is set or when
> it is explicitly set to C (C.UTF8, POSIX).  It does not work when locale
> is set to anything different, including en_US, ru_RU, etc.
> 
> I'm sorry for confusing you.  I think that either we should revert back
> to the older versions of your patch to make all locales supported or
> merge those two versions to make the transliteration work both in
> C and in all (almost all) other locales.  Unfortunately, C locale is
> not a base for all other locales and is not included, it is only a fallback
> when a locale does not provide its own data (that is, when it does not
> provide any transliteration table at all).

Good catch! Should we maybe split this into two patches, one for C and 
the other for "country" locales? They have different codes and 
functionality so it looks like it would be easier to keep focus.

My understanding is that locale/C-translit.h.in is still the proper 
locale for the sole ASCII translit table. It is also the only solution 
for many use cases where there is no locale available (not compiled or 
not set).
"Country" locales in localedata/locales/ can then have the exact same 
translit table included or they can have any other flavor - I don't see 
a problem here.

Best regards,
Egor
  
Marko Myllynen Jan. 7, 2019, 8:37 p.m. UTC | #3
Hi,

On 05/01/2019 23.12, Egor Kobylkin wrote:
> On 05.01.19 15:35, Rafal Luzynski wrote:
>> 2.01.2019 19:38 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>
>>> Changelog v12:
>>> [...]
>>>
>>> Changelog v11:
>>> * Re-targeted the patch against locale/C-translit.h.in as the proper
>>> file for the ASCII translit table.
>>> * Correspondingly the patch now only contains the additional
>>> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
>>> The 'include "translit_cyrillic";""' directives are not necessary in the
>>> locale files and they are now all left intact.
>>> * Also the file translit_cyrillic is not longer needed and is omitted.
>>> * Edited below email, commit message.
>>> [...]
>>
>> I have tested this and, unfortunately, now this transliteration
>> works *only* in C locale, that is, only when no locale is set or when
>> it is explicitly set to C (C.UTF8, POSIX).  It does not work when locale
>> is set to anything different, including en_US, ru_RU, etc.
> 
> Good catch! Should we maybe split this into two patches, one for C and
> the other for "country" locales? They have different codes and
> functionality so it looks like it would be easier to keep focus.

That would probably make sense, the standard C/POSIX locale won't
support System A so it also narrows down solution alternatives with it.

(If the C.UTF-8 locale (see
https://sourceware.org/bugzilla/show_bug.cgi?id=17318) materializes one
day I'm not sure would transliteration be applicable in that context.)

> My understanding is that locale/C-translit.h.in is still the proper
> locale for the sole ASCII translit table. It is also the only solution
> for many use cases where there is no locale available (not compiled or
> not set).

Correct, as Siddhesh mentioned those rules will end up to the built-in
C/POSIX locale which is ASCII and will be used if no other locales are
available or set properly. The translit_* files won't affect to it.

> "Country" locales in localedata/locales/ can then have the exact same
> translit table included or they can have any other flavor - I don't see
> a problem here.

Indeed, and since those files are not limited to ASCII, perhaps we could
now reconsider the v9 approach for them, i.e., prefer System A if
possible, otherwise use System B / ASCII (just need to make sure that
the ASCII fall-back for them will match the built-in C ASCII rule)?

Thanks,
  
Egor Kobylkin Jan. 9, 2019, 12:46 a.m. UTC | #4
On 07.01.19 21:37, Marko Myllynen wrote:
> Hi,
> 
> On 05/01/2019 23.12, Egor Kobylkin wrote:
>> On 05.01.19 15:35, Rafal Luzynski wrote:
>>> 2.01.2019 19:38 Egor Kobylkin <egor@kobylkin.com> wrote:
>>>>
>>>> Changelog v12:
>>>> [...]
>>>>
>>>> Changelog v11:
>>>> * Re-targeted the patch against locale/C-translit.h.in as the proper
>>>> file for the ASCII translit table.
>>>> * Correspondingly the patch now only contains the additional
>>>> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
>>>> The 'include "translit_cyrillic";""' directives are not necessary in the
>>>> locale files and they are now all left intact.
>>>> * Also the file translit_cyrillic is not longer needed and is omitted.
>>>> * Edited below email, commit message.
>>>> [...]
>>>
>>> I have tested this and, unfortunately, now this transliteration
>>> works *only* in C locale, that is, only when no locale is set or when
>>> it is explicitly set to C (C.UTF8, POSIX).  It does not work when locale
>>> is set to anything different, including en_US, ru_RU, etc.
>>
>> Good catch! Should we maybe split this into two patches, one for C and
>> the other for "country" locales? They have different codes and
>> functionality so it looks like it would be easier to keep focus.
> 
> That would probably make sense, the standard C/POSIX locale won't
> support System A so it also narrows down solution alternatives with it.
> 

[SNIP]

>> "Country" locales in localedata/locales/ can then have the exact same
>> translit table included or they can have any other flavor - I don't see
>> a problem here.
> 
> Indeed, and since those files are not limited to ASCII, perhaps we could
> now reconsider the v9 approach for them, i.e., prefer System A if
> possible, otherwise use System B / ASCII (just need to make sure that
> the ASCII fall-back for them will match the built-in C ASCII rule)?
> 

Happy to hear the split seems to be a clear cut one.
How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]... 
C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report 
(number) and title for clarity in communication?

The bug report for [PATCH v9] ("Countries" locales) should then ideally 
have your (and others) explicit requirements as to the GOST System A/B 
fall-back, which countries to include etc. Again, myself I have no other 
req. here but just to have _any_ translit in place.

This way it would probably be easier to have the decision making process 
tied up for both patches (separately). We may want to get the v12 POSIX 
out of the door in 2.30 then and can take all the time we need to set up 
the rules for "Countries" locales as you need them to be.

Bests,
Egor
  
Marko Myllynen Jan. 9, 2019, 8:03 p.m. UTC | #5
Hi,

On 09/01/2019 02.46, Egor Kobylkin wrote:
> On 07.01.19 21:37, Marko Myllynen wrote:
>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>
>>> Good catch! Should we maybe split this into two patches, one for C and
>>> the other for "country" locales? They have different codes and
>>> functionality so it looks like it would be easier to keep focus.
>>
>> That would probably make sense, the standard C/POSIX locale won't
>> support System A so it also narrows down solution alternatives with it.
>>
>>> "Country" locales in localedata/locales/ can then have the exact same
>>> translit table included or they can have any other flavor - I don't see
>>> a problem here.
>>
>> Indeed, and since those files are not limited to ASCII, perhaps we could
>> now reconsider the v9 approach for them, i.e., prefer System A if
>> possible, otherwise use System B / ASCII (just need to make sure that
>> the ASCII fall-back for them will match the built-in C ASCII rule)?
> 
> Happy to hear the split seems to be a clear cut one.
> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
> (number) and title for clarity in communication?

I'm not sure is a new BZ really needed for such an addition, perhaps a
NEWS entry might be more appropriate (with the full details explained in
the commit messages of course) but I'll leave this to others to decide.

> This way it would probably be easier to have the decision making process
> tied up for both patches (separately). We may want to get the v12 POSIX
> out of the door in 2.30 then and can take all the time we need to set up
> the rules for "Countries" locales as you need them to be.

Perhaps Rafal or Carlos have better suggestions but I would think we
could have a patch series where the patch 1/3 adds the C/POSIX locale
part (that would be what you posted as v12), then patch 2/3 adds
translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
System A and GOST 7.79 System B as a fall-back (which would match the
C/POSIX rules)), and finally the patch 3/3 updates locales to use
translit_cyrillic as appropriate. But as said, Rafal or Carlos may have
alternative suggestions so it might be best to wait for their feedback
before doing anything yet (it's unfortunate you've had to do so many
iterations around this already but I think we've all learned something
during the process and the end result will be more correct than any of
the earlier versions).

Thanks,
  
Egor Kobylkin Feb. 4, 2019, 7:14 a.m. UTC | #6
Carlos,
are you comfortable to pick this up again this month?

I would really love to have a reliable action plan to get this committed 
for 2.30. Maybe cut out a subset that is undisputed and commit only that 
first. It looks kinda like an eternal moving target otherwise.

for you reference:
https://sourceware.org/ml/libc-alpha/2019-01/msg00036.html
https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html

Bests,
Egor Kobylkin

On 09.01.19 21:03, Marko Myllynen wrote:
> Hi,
> 
> On 09/01/2019 02.46, Egor Kobylkin wrote:
>> On 07.01.19 21:37, Marko Myllynen wrote:
>>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>>
>>>> Good catch! Should we maybe split this into two patches, one for C and
>>>> the other for "country" locales? They have different codes and
>>>> functionality so it looks like it would be easier to keep focus.
>>>
>>> That would probably make sense, the standard C/POSIX locale won't
>>> support System A so it also narrows down solution alternatives with it.
>>>
>>>> "Country" locales in localedata/locales/ can then have the exact same
>>>> translit table included or they can have any other flavor - I don't see
>>>> a problem here.
>>>
>>> Indeed, and since those files are not limited to ASCII, perhaps we could
>>> now reconsider the v9 approach for them, i.e., prefer System A if
>>> possible, otherwise use System B / ASCII (just need to make sure that
>>> the ASCII fall-back for them will match the built-in C ASCII rule)?
>>
>> Happy to hear the split seems to be a clear cut one.
>> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
>> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
>> (number) and title for clarity in communication?
> 
> I'm not sure is a new BZ really needed for such an addition, perhaps a
> NEWS entry might be more appropriate (with the full details explained in
> the commit messages of course) but I'll leave this to others to decide.
> 
>> This way it would probably be easier to have the decision making process
>> tied up for both patches (separately). We may want to get the v12 POSIX
>> out of the door in 2.30 then and can take all the time we need to set up
>> the rules for "Countries" locales as you need them to be.
> 
> Perhaps Rafal or Carlos have better suggestions but I would think we
> could have a patch series where the patch 1/3 adds the C/POSIX locale
> part (that would be what you posted as v12), then patch 2/3 adds
> translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
> System A and GOST 7.79 System B as a fall-back (which would match the
> C/POSIX rules)), and finally the patch 3/3 updates locales to use
> translit_cyrillic as appropriate. But as said, Rafal or Carlos may have
> alternative suggestions so it might be best to wait for their feedback
> before doing anything yet (it's unfortunate you've had to do so many
> iterations around this already but I think we've all learned something
> during the process and the end result will be more correct than any of
> the earlier versions).
> 
> Thanks,
>
  
Marko Myllynen Feb. 14, 2019, 4:48 p.m. UTC | #7
Hi Carlos, Mike, Rafal,

It seems clear that you all are currently too busy to have a look at
this but would you have any estimate when you might be able to review
this so that we could consider merging?

FWIW, I chatted with Egor off-list and we're on the same page wrt the
following, hopefully this gives you a bit off jump start for this
subject when you have time to dig deeper:

1) Built-in C locale doesn't read/use any translit_* files and it can't
have any fallback mechanisms and it only supports ASCII so using GOST
7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
be the appropriate way to implement Cyrillic transliteration for the
built-in C locale (it adds some 8KB to the binary).

2) Other locales read/use translit_* files and with them fallbacks and
non-ASCII are possible so it would seem preferable to first try ISO 9 /
GOST 7.79 System A and only if that fails then use GOST 7.79 System B
(in which case the end result should match with the built-in C locale).
For this the translit_cyrillic file should be added (as per patch v9 +
changes mentioned in patches v10 and v12).

3) Individual locale files can then be updated to use translit_cyrillic
as appropriate (see patch v9) and language/national specific conventions
(e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.

Thanks,

On 04/02/2019 09.14, Egor Kobylkin wrote:
> Carlos,
> are you comfortable to pick this up again this month?
> 
> I would really love to have a reliable action plan to get this committed
> for 2.30. Maybe cut out a subset that is undisputed and commit only that
> first. It looks kinda like an eternal moving target otherwise.
> 
> for you reference:
> https://sourceware.org/ml/libc-alpha/2019-01/msg00036.html
> https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
> 
> Bests,
> Egor Kobylkin
> 
> On 09.01.19 21:03, Marko Myllynen wrote:
>> Hi,
>>
>> On 09/01/2019 02.46, Egor Kobylkin wrote:
>>> On 07.01.19 21:37, Marko Myllynen wrote:
>>>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>>>
>>>>> Good catch! Should we maybe split this into two patches, one for C and
>>>>> the other for "country" locales? They have different codes and
>>>>> functionality so it looks like it would be easier to keep focus.
>>>>
>>>> That would probably make sense, the standard C/POSIX locale won't
>>>> support System A so it also narrows down solution alternatives with it.
>>>>
>>>>> "Country" locales in localedata/locales/ can then have the exact same
>>>>> translit table included or they can have any other flavor - I don't
>>>>> see
>>>>> a problem here.
>>>>
>>>> Indeed, and since those files are not limited to ASCII, perhaps we
>>>> could
>>>> now reconsider the v9 approach for them, i.e., prefer System A if
>>>> possible, otherwise use System B / ASCII (just need to make sure that
>>>> the ASCII fall-back for them will match the built-in C ASCII rule)?
>>>
>>> Happy to hear the split seems to be a clear cut one.
>>> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
>>> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
>>> (number) and title for clarity in communication?
>>
>> I'm not sure is a new BZ really needed for such an addition, perhaps a
>> NEWS entry might be more appropriate (with the full details explained in
>> the commit messages of course) but I'll leave this to others to decide.
>>
>>> This way it would probably be easier to have the decision making process
>>> tied up for both patches (separately). We may want to get the v12 POSIX
>>> out of the door in 2.30 then and can take all the time we need to set up
>>> the rules for "Countries" locales as you need them to be.
>>
>> Perhaps Rafal or Carlos have better suggestions but I would think we
>> could have a patch series where the patch 1/3 adds the C/POSIX locale
>> part (that would be what you posted as v12), then patch 2/3 adds
>> translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
>> System A and GOST 7.79 System B as a fall-back (which would match the
>> C/POSIX rules)), and finally the patch 3/3 updates locales to use
>> translit_cyrillic as appropriate. But as said, Rafal or Carlos may have
>> alternative suggestions so it might be best to wait for their feedback
>> before doing anything yet (it's unfortunate you've had to do so many
>> iterations around this already but I think we've all learned something
>> during the process and the end result will be more correct than any of
>> the earlier versions).
>>
>> Thanks,
>>
  
Egor Kobylkin March 4, 2019, 10:11 p.m. UTC | #8
ping

On 14.02.19 17:48, Marko Myllynen wrote:
> Hi Carlos, Mike, Rafal,
> 
> It seems clear that you all are currently too busy to have a look at
> this but would you have any estimate when you might be able to review
> this so that we could consider merging?
> 
> FWIW, I chatted with Egor off-list and we're on the same page wrt the
> following, hopefully this gives you a bit off jump start for this
> subject when you have time to dig deeper:
> 
> 1) Built-in C locale doesn't read/use any translit_* files and it can't
> have any fallback mechanisms and it only supports ASCII so using GOST
> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
> be the appropriate way to implement Cyrillic transliteration for the
> built-in C locale (it adds some 8KB to the binary).
> 
> 2) Other locales read/use translit_* files and with them fallbacks and
> non-ASCII are possible so it would seem preferable to first try ISO 9 /
> GOST 7.79 System A and only if that fails then use GOST 7.79 System B
> (in which case the end result should match with the built-in C locale).
> For this the translit_cyrillic file should be added (as per patch v9 +
> changes mentioned in patches v10 and v12).
> 
> 3) Individual locale files can then be updated to use translit_cyrillic
> as appropriate (see patch v9) and language/national specific conventions
> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.
> 
> Thanks,
> 
> On 04/02/2019 09.14, Egor Kobylkin wrote:
>> Carlos,
>> are you comfortable to pick this up again this month?
>>
>> I would really love to have a reliable action plan to get this committed
>> for 2.30. Maybe cut out a subset that is undisputed and commit only that
>> first. It looks kinda like an eternal moving target otherwise.
>>
>> for you reference:
>> https://sourceware.org/ml/libc-alpha/2019-01/msg00036.html
>> https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
>>
>> Bests,
>> Egor Kobylkin
>>
>> On 09.01.19 21:03, Marko Myllynen wrote:
>>> Hi,
>>>
>>> On 09/01/2019 02.46, Egor Kobylkin wrote:
>>>> On 07.01.19 21:37, Marko Myllynen wrote:
>>>>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>>>>
>>>>>> Good catch! Should we maybe split this into two patches, one for C and
>>>>>> the other for "country" locales? They have different codes and
>>>>>> functionality so it looks like it would be easier to keep focus.
>>>>>
>>>>> That would probably make sense, the standard C/POSIX locale won't
>>>>> support System A so it also narrows down solution alternatives with it.
>>>>>
>>>>>> "Country" locales in localedata/locales/ can then have the exact same
>>>>>> translit table included or they can have any other flavor - I don't
>>>>>> see
>>>>>> a problem here.
>>>>>
>>>>> Indeed, and since those files are not limited to ASCII, perhaps we
>>>>> could
>>>>> now reconsider the v9 approach for them, i.e., prefer System A if
>>>>> possible, otherwise use System B / ASCII (just need to make sure that
>>>>> the ASCII fall-back for them will match the built-in C ASCII rule)?
>>>>
>>>> Happy to hear the split seems to be a clear cut one.
>>>> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
>>>> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
>>>> (number) and title for clarity in communication?
>>>
>>> I'm not sure is a new BZ really needed for such an addition, perhaps a
>>> NEWS entry might be more appropriate (with the full details explained in
>>> the commit messages of course) but I'll leave this to others to decide.
>>>
>>>> This way it would probably be easier to have the decision making process
>>>> tied up for both patches (separately). We may want to get the v12 POSIX
>>>> out of the door in 2.30 then and can take all the time we need to set up
>>>> the rules for "Countries" locales as you need them to be.
>>>
>>> Perhaps Rafal or Carlos have better suggestions but I would think we
>>> could have a patch series where the patch 1/3 adds the C/POSIX locale
>>> part (that would be what you posted as v12), then patch 2/3 adds
>>> translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
>>> System A and GOST 7.79 System B as a fall-back (which would match the
>>> C/POSIX rules)), and finally the patch 3/3 updates locales to use
>>> translit_cyrillic as appropriate. But as said, Rafal or Carlos may have
>>> alternative suggestions so it might be best to wait for their feedback
>>> before doing anything yet (it's unfortunate you've had to do so many
>>> iterations around this already but I think we've all learned something
>>> during the process and the end result will be more correct than any of
>>> the earlier versions).
>>>
>>> Thanks,
>>>
> 
>
  
Egor Kobylkin March 11, 2019, 1:59 p.m. UTC | #9
On 04.03.19 23:11, Egor Kobylkin wrote:
> ping
> 
> On 14.02.19 17:48, Marko Myllynen wrote:
>> Hi Carlos, Mike, Rafal,
>>
>> It seems clear that you all are currently too busy to have a look at
>> this but would you have any estimate when you might be able to review
>> this so that we could consider merging?
>>
>> FWIW, I chatted with Egor off-list and we're on the same page wrt the
>> following, hopefully this gives you a bit off jump start for this
>> subject when you have time to dig deeper:
>>
>> 1) Built-in C locale doesn't read/use any translit_* files and it can't
>> have any fallback mechanisms and it only supports ASCII so using GOST
>> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
>> be the appropriate way to implement Cyrillic transliteration for the
>> built-in C locale (it adds some 8KB to the binary).
>>
>> 2) Other locales read/use translit_* files and with them fallbacks and
>> non-ASCII are possible so it would seem preferable to first try ISO 9 /
>> GOST 7.79 System A and only if that fails then use GOST 7.79 System B
>> (in which case the end result should match with the built-in C locale).
>> For this the translit_cyrillic file should be added (as per patch v9 +
>> changes mentioned in patches v10 and v12).
>>
>> 3) Individual locale files can then be updated to use translit_cyrillic
>> as appropriate (see patch v9) and language/national specific conventions
>> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.
>>
>> Thanks,
>>
>> On 04/02/2019 09.14, Egor Kobylkin wrote:
>>> Carlos,
>>> are you comfortable to pick this up again this month?
>>>
>>> I would really love to have a reliable action plan to get this committed
>>> for 2.30. Maybe cut out a subset that is undisputed and commit only that
>>> first. It looks kinda like an eternal moving target otherwise.
>>>
>>> for you reference:
>>> https://sourceware.org/ml/libc-alpha/2019-01/msg00036.html
>>> https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
>>>
>>> Bests,
>>> Egor Kobylkin
>>>
>>> On 09.01.19 21:03, Marko Myllynen wrote:
>>>> Hi,
>>>>
>>>> On 09/01/2019 02.46, Egor Kobylkin wrote:
>>>>> On 07.01.19 21:37, Marko Myllynen wrote:
>>>>>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>>>>>
>>>>>>> Good catch! Should we maybe split this into two patches, one for 
>>>>>>> C and
>>>>>>> the other for "country" locales? They have different codes and
>>>>>>> functionality so it looks like it would be easier to keep focus.
>>>>>>
>>>>>> That would probably make sense, the standard C/POSIX locale won't
>>>>>> support System A so it also narrows down solution alternatives 
>>>>>> with it.
>>>>>>
>>>>>>> "Country" locales in localedata/locales/ can then have the exact 
>>>>>>> same
>>>>>>> translit table included or they can have any other flavor - I don't
>>>>>>> see
>>>>>>> a problem here.
>>>>>>
>>>>>> Indeed, and since those files are not limited to ASCII, perhaps we
>>>>>> could
>>>>>> now reconsider the v9 approach for them, i.e., prefer System A if
>>>>>> possible, otherwise use System B / ASCII (just need to make sure that
>>>>>> the ASCII fall-back for them will match the built-in C ASCII rule)?
>>>>>
>>>>> Happy to hear the split seems to be a clear cut one.
>>>>> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
>>>>> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
>>>>> (number) and title for clarity in communication?
>>>>
>>>> I'm not sure is a new BZ really needed for such an addition, perhaps a
>>>> NEWS entry might be more appropriate (with the full details 
>>>> explained in
>>>> the commit messages of course) but I'll leave this to others to decide.
>>>>
>>>>> This way it would probably be easier to have the decision making 
>>>>> process
>>>>> tied up for both patches (separately). We may want to get the v12 
>>>>> POSIX
>>>>> out of the door in 2.30 then and can take all the time we need to 
>>>>> set up
>>>>> the rules for "Countries" locales as you need them to be.
>>>>
>>>> Perhaps Rafal or Carlos have better suggestions but I would think we
>>>> could have a patch series where the patch 1/3 adds the C/POSIX locale
>>>> part (that would be what you posted as v12), then patch 2/3 adds
>>>> translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
>>>> System A and GOST 7.79 System B as a fall-back (which would match the
>>>> C/POSIX rules)), and finally the patch 3/3 updates locales to use
>>>> translit_cyrillic as appropriate. But as said, Rafal or Carlos may have
>>>> alternative suggestions so it might be best to wait for their feedback
>>>> before doing anything yet (it's unfortunate you've had to do so many
>>>> iterations around this already but I think we've all learned something
>>>> during the process and the end result will be more correct than any of
>>>> the earlier versions).
>>>>
>>>> Thanks,
>>>>
>>
>>
  
Egor Kobylkin March 14, 2019, 7:48 p.m. UTC | #10
On 11.03.19 14:59, Egor Kobylkin wrote:
> 
> 
> On 04.03.19 23:11, Egor Kobylkin wrote:
>> ping
>>
>> On 14.02.19 17:48, Marko Myllynen wrote:
>>> Hi Carlos, Mike, Rafal,
>>>
>>> It seems clear that you all are currently too busy to have a look at
>>> this but would you have any estimate when you might be able to review
>>> this so that we could consider merging?
>>>
>>> FWIW, I chatted with Egor off-list and we're on the same page wrt the
>>> following, hopefully this gives you a bit off jump start for this
>>> subject when you have time to dig deeper:
>>>
>>> 1) Built-in C locale doesn't read/use any translit_* files and it can't
>>> have any fallback mechanisms and it only supports ASCII so using GOST
>>> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
>>> be the appropriate way to implement Cyrillic transliteration for the
>>> built-in C locale (it adds some 8KB to the binary).
>>>
>>> 2) Other locales read/use translit_* files and with them fallbacks and
>>> non-ASCII are possible so it would seem preferable to first try ISO 9 /
>>> GOST 7.79 System A and only if that fails then use GOST 7.79 System B
>>> (in which case the end result should match with the built-in C locale).
>>> For this the translit_cyrillic file should be added (as per patch v9 +
>>> changes mentioned in patches v10 and v12).
>>>
>>> 3) Individual locale files can then be updated to use translit_cyrillic
>>> as appropriate (see patch v9) and language/national specific conventions
>>> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.
>>>
>>> Thanks,
>>>
>>> On 04/02/2019 09.14, Egor Kobylkin wrote:
>>>> Carlos,
>>>> are you comfortable to pick this up again this month?
>>>>
>>>> I would really love to have a reliable action plan to get this 
>>>> committed
>>>> for 2.30. Maybe cut out a subset that is undisputed and commit only 
>>>> that
>>>> first. It looks kinda like an eternal moving target otherwise.
>>>>
>>>> for you reference:
>>>> https://sourceware.org/ml/libc-alpha/2019-01/msg00036.html
>>>> https://sourceware.org/ml/libc-alpha/2019-01/msg00040.html
>>>>
>>>> Bests,
>>>> Egor Kobylkin
>>>>
>>>> On 09.01.19 21:03, Marko Myllynen wrote:
>>>>> Hi,
>>>>>
>>>>> On 09/01/2019 02.46, Egor Kobylkin wrote:
>>>>>> On 07.01.19 21:37, Marko Myllynen wrote:
>>>>>>> On 05/01/2019 23.12, Egor Kobylkin wrote:
>>>>>>>>
>>>>>>>> Good catch! Should we maybe split this into two patches, one for 
>>>>>>>> C and
>>>>>>>> the other for "country" locales? They have different codes and
>>>>>>>> functionality so it looks like it would be easier to keep focus.
>>>>>>>
>>>>>>> That would probably make sense, the standard C/POSIX locale won't
>>>>>>> support System A so it also narrows down solution alternatives 
>>>>>>> with it.
>>>>>>>
>>>>>>>> "Country" locales in localedata/locales/ can then have the exact 
>>>>>>>> same
>>>>>>>> translit table included or they can have any other flavor - I don't
>>>>>>>> see
>>>>>>>> a problem here.
>>>>>>>
>>>>>>> Indeed, and since those files are not limited to ASCII, perhaps we
>>>>>>> could
>>>>>>> now reconsider the v9 approach for them, i.e., prefer System A if
>>>>>>> possible, otherwise use System B / ASCII (just need to make sure 
>>>>>>> that
>>>>>>> the ASCII fall-back for them will match the built-in C ASCII rule)?
>>>>>>
>>>>>> Happy to hear the split seems to be a clear cut one.
>>>>>> How about I rename the "[PATCH v12]...[BZ #2872]" to "[PATCH v1]...
>>>>>> C/POSIX [BZ #2872]" and the "[PATCH v9]" gets its own bug-report
>>>>>> (number) and title for clarity in communication?
>>>>>
>>>>> I'm not sure is a new BZ really needed for such an addition, perhaps a
>>>>> NEWS entry might be more appropriate (with the full details 
>>>>> explained in
>>>>> the commit messages of course) but I'll leave this to others to 
>>>>> decide.
>>>>>
>>>>>> This way it would probably be easier to have the decision making 
>>>>>> process
>>>>>> tied up for both patches (separately). We may want to get the v12 
>>>>>> POSIX
>>>>>> out of the door in 2.30 then and can take all the time we need to 
>>>>>> set up
>>>>>> the rules for "Countries" locales as you need them to be.
>>>>>
>>>>> Perhaps Rafal or Carlos have better suggestions but I would think we
>>>>> could have a patch series where the patch 1/3 adds the C/POSIX locale
>>>>> part (that would be what you posted as v12), then patch 2/3 adds
>>>>> translit_cyrillic (based on your v9 so supports ISO 9.1995 / GOST 7.79
>>>>> System A and GOST 7.79 System B as a fall-back (which would match the
>>>>> C/POSIX rules)), and finally the patch 3/3 updates locales to use
>>>>> translit_cyrillic as appropriate. But as said, Rafal or Carlos may 
>>>>> have
>>>>> alternative suggestions so it might be best to wait for their feedback
>>>>> before doing anything yet (it's unfortunate you've had to do so many
>>>>> iterations around this already but I think we've all learned something
>>>>> during the process and the end result will be more correct than any of
>>>>> the earlier versions).
>>>>>
>>>>> Thanks,
>>>>>
>>>
>>>
  
Carlos O'Donell April 9, 2019, 1:04 a.m. UTC | #11
On 1/2/19 1:38 PM, Egor Kobylkin wrote:
> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target file locale/C-translit.h.in (the original file changed on the master branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to "sh`" instead of erroneous "SH`" in v11

I have installed this patch and I'm testing some transliterations.

Cheers,
Carlos.
  
Rafal Luzynski April 19, 2019, 10:24 p.m. UTC | #12
Thank you Siddhesh and Carlos for your involvement in testing this
patch and I apologize Egor and Marko and everyone else who need this
patch to be pushed for my poor involvement.  I'd like to reply to
this email from Marko because it summarizes all issues.  Also I hope
I will explain the problems which made me stuck.

14.02.2019 17:48 Marko Myllynen <myllynen@redhat.com> wrote:
> [...]
> 1) Built-in C locale doesn't read/use any translit_* files and it can't
> have any fallback mechanisms and it only supports ASCII so using GOST
> 7.79 System B in locale/C-translit.h.in (as per patch v12) would seem to
> be the appropriate way to implement Cyrillic transliteration for the
> built-in C locale (it adds some 8KB to the binary).

This sounds like a good idea.

Also, C locale is probably a good way to enforce the plain ASCII
transliteration without any fallback.

> 2) Other locales read/use translit_* files and with them fallbacks and
> non-ASCII are possible so it would seem preferable to first try ISO 9 /
> GOST 7.79 System A

OK, we agree here.

> and only if that fails then use GOST 7.79 System B
> (in which case the end result should match with the built-in C locale).

This is impossible due to this case.  System A transliterates the Cyrillic
"Х" to Latin "H", system B transliterates it to Latin "X".  Transliteration
as implemented in glibc supports a simple fallback algorithm: transliterate
the letter "X" to "YY" but if it is not available then to "ZZ".  It can't
support the complex algorithm which we need here: transliterate "X" to "YY"
but if "Q" cannot be transliterated to "RR" then transliterate "X" to "ZZ".
In our case we would like to transliterate "Х" to "X" if "Ш" cannot be
transliterated to "Š".  The only thing we can implement is a fallback
transliteration which is similar to System B but not 100% compatible.

This is not the case if we are going to implement only System B in C locale
because we know already that "Š" is unavailable so we have to transliterate
"Х" to "X" always.

> For this the translit_cyrillic file should be added (as per patch v9 +
> changes mentioned in patches v10 and v12).
> 
> 3) Individual locale files can then be updated to use translit_cyrillic
> as appropriate (see patch v9) and language/national specific conventions
> (e.g., SFS 4900 for fi_FI) can be applied on per-locale basis.

Sometimes I wonder whether really any other locale than a language which
uses the Cyrillic script should want to have a Cyrillic transliteration
but on the other hand - why not.

Also I'd like to reiterate other disagreements which we have here:

1. How to handle upper/lower case in System B?  Should we transliterate
   "Ш" to "SH" or "Sh"?  Should we maybe implement a smart context based
   casing algorithm first?  I mean the algorithm which would detect if
   an uppercase letter appears as the first letter of otherwise lowercase
   word so should be transliterated as "Sh", or maybe it's in a context
   of a fully uppercase word so should be transliterated as "SH".
   I think that uconv implements this algorithm.
2. How to handle ambiguous transliterations like "Схема" -> "Shema"
   vs. "Шема" -> "Shema"? "SHema"?
3. How to handle the characters which are proper letters in Cyrillic
   and have an upper and lower case like a hard and soft sign but are
   transliterated to punctuation characters (grave accent "`")?
   Should we transliterate upper and lower case to the same character
   or should we mark them somehow?  uconv adds Unicode combining low
   line to the grave accent (so the output is "`̲") if the original
   Cyrillic character was uppercase.  But this is unavailable if
   our target charset is ASCII.

Regarding the test cases which I mentioned the other day I discussed
this with Dmitry and he convinced me that requiring the test cases is
the bar set too high so I agree we don't need to require them already.

Regards,

Rafal
  
Siddhesh Poyarekar April 27, 2019, 2:51 a.m. UTC | #13
On 27/04/19 4:19 AM, Diego (Egor) Kobylkin wrote:
> Dear all, 
> I think Rafal is making good points again. And  the best thing is that
> we actually seem to have full consensus from everyone involved about
> current limited ASCII patch V12 (GOST 7.79 System B in
> locale/C-translit.h.in).  
> So let’s just for the time being concentrate on getting this committed? 
> 
> We can get to further issues in the next release and having a base to
> start with will make them much clearer by the contrast of what’s already
> in. 
> 
> Please let me know if you see any entanglement between the V12 patch
> content and other issues listed below. I believe Carlos can test the
> patch in isolation and hopefully have it approved for the next release. 

Please put it as a release blocker:

https://sourceware.org/glibc/wiki/Release/2.30

Siddhesh
  
Egor Kobylkin April 27, 2019, 7:34 a.m. UTC | #14
Thanks, Siddhesh, it's in.

Bests,
Egor Kobylkin

P.S. just for the historians: I have noticed that my quoted message below didn't go to the lists because it was in html format. But I believe all involved have received it directly.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, April 27, 2019 4:51 AM, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:

> On 27/04/19 4:19 AM, Diego (Egor) Kobylkin wrote:

> > current limited ASCII patch V12 (GOST 7.79 System B in
> > locale/C-translit.h.in).  
> > So let’s just for the time being concentrate on getting this committed?

>
> Please put it as a release blocker:
>
> https://sourceware.org/glibc/wiki/Release/2.30
>
> Siddhesh
  

Patch

From 46e0d0e3d07805ec853fdd72dc3793995cb5593c Mon Sep 17 00:00:00 2001
From: Egor Kobylkin <egor@kobylkin.com>
Date: Wed, 2 Jan 2019 05:50:13 +0100
Subject: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

	[BZ #2872]
	* locale/C-translit.h.in: Add Cyrillic transliteration.
---
 locale/C-translit.h.in | 169 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)

diff --git a/locale/C-translit.h.in b/locale/C-translit.h.in
index d5f00df0f3..758171c394 100644
--- a/locale/C-translit.h.in
+++ b/locale/C-translit.h.in
@@ -56,6 +56,175 @@ 
 "\x02cd"	"_"	# <U02CD> MODIFIER LETTER LOW MACRON
 "\x02d0"	":"	# <U02D0> MODIFIER LETTER TRIANGULAR COLON
 "\x02dc"	"~"	# <U02DC> SMALL TILDE
+"\x0401"	"YO"	# <U0401> CYRILLIC CAPITAL LETTER IO
+"\x0402"	"DJ"	# <U0402> CYRILLIC CAPITAL LETTER DJE
+"\x0403"	"G`"	# <U0403> CYRILLIC CAPITAL LETTER GJE
+"\x0404"	"YE"	# <U0404> CYRILLIC CAPITAL LETTER UKRAINIAN IE
+"\x0405"	"Z`"	# <U0405> CYRILLIC CAPITAL LETTER DZE
+"\x0406"	"I"	# <U0406> CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
+"\x0407"	"YI"	# <U0407> CYRILLIC CAPITAL LETTER YI
+"\x0408"	"J"	# <U0408> CYRILLIC CAPITAL LETTER JE
+"\x0409"	"L`"	# <U0409> CYRILLIC CAPITAL LETTER LJE
+"\x040a"	"N`"	# <U040A> CYRILLIC CAPITAL LETTER NJE
+"\x040b"	"TSH"	# <U040B> CYRILLIC CAPITAL LETTER TSHE
+"\x040c"	"K`"	# <U040C> CYRILLIC CAPITAL LETTER KJE
+"\x040e"	"U`"	# <U040E> CYRILLIC CAPITAL LETTER SHORT U
+"\x040f"	"DH"	# <U040F> CYRILLIC CAPITAL LETTER DZHE
+"\x0410"	"A"	# <U0410> CYRILLIC CAPITAL LETTER A
+"\x0411"	"B"	# <U0411> CYRILLIC CAPITAL LETTER BE
+"\x0412"	"V"	# <U0412> CYRILLIC CAPITAL LETTER VE
+"\x0413"	"G"	# <U0413> CYRILLIC CAPITAL LETTER GHE
+"\x0414"	"D"	# <U0414> CYRILLIC CAPITAL LETTER DE
+"\x0415"	"E"	# <U0415> CYRILLIC CAPITAL LETTER IE
+"\x0416"	"ZH"	# <U0416> CYRILLIC CAPITAL LETTER ZHE
+"\x0417"	"Z"	# <U0417> CYRILLIC CAPITAL LETTER ZE
+"\x0418"	"I"	# <U0418> CYRILLIC CAPITAL LETTER I
+"\x0419"	"J"	# <U0419> CYRILLIC CAPITAL LETTER SHORT I
+"\x041a"	"K"	# <U041A> CYRILLIC CAPITAL LETTER KA
+"\x041b"	"L"	# <U041B> CYRILLIC CAPITAL LETTER EL
+"\x041c"	"M"	# <U041C> CYRILLIC CAPITAL LETTER EM
+"\x041d"	"N"	# <U041D> CYRILLIC CAPITAL LETTER EN
+"\x041e"	"O"	# <U041E> CYRILLIC CAPITAL LETTER O
+"\x041f"	"P"	# <U041F> CYRILLIC CAPITAL LETTER PE
+"\x0420"	"R"	# <U0420> CYRILLIC CAPITAL LETTER ER
+"\x0421"	"S"	# <U0421> CYRILLIC CAPITAL LETTER ES
+"\x0422"	"T"	# <U0422> CYRILLIC CAPITAL LETTER TE
+"\x0423"	"U"	# <U0423> CYRILLIC CAPITAL LETTER U
+"\x0424"	"F"	# <U0424> CYRILLIC CAPITAL LETTER EF
+"\x0425"	"X"	# <U0425> CYRILLIC CAPITAL LETTER HA
+"\x0426"	"CZ"	# <U0426> CYRILLIC CAPITAL LETTER TSE
+"\x0427"	"CH"	# <U0427> CYRILLIC CAPITAL LETTER CHE
+"\x0428"	"SH"	# <U0428> CYRILLIC CAPITAL LETTER SHA
+"\x0429"	"SHH"	# <U0429> CYRILLIC CAPITAL LETTER SHCHA
+"\x042a"	"A`"	# <U042A> CYRILLIC CAPITAL LETTER HARD SIGN
+"\x042b"	"Y`"	# <U042B> CYRILLIC CAPITAL LETTER YERU
+"\x042c"	"`"	# <U042C> CYRILLIC CAPITAL LETTER SOFT SIGN
+"\x042d"	"E`"	# <U042D> CYRILLIC CAPITAL LETTER E
+"\x042e"	"YU"	# <U042E> CYRILLIC CAPITAL LETTER YU
+"\x042f"	"YA"	# <U042F> CYRILLIC CAPITAL LETTER YA
+"\x0430"	"a"	# <U0430> CYRILLIC SMALL LETTER A
+"\x0431"	"b"	# <U0431> CYRILLIC SMALL LETTER BE
+"\x0432"	"v"	# <U0432> CYRILLIC SMALL LETTER VE
+"\x0433"	"g"	# <U0433> CYRILLIC SMALL LETTER GHE
+"\x0434"	"d"	# <U0434> CYRILLIC SMALL LETTER DE
+"\x0435"	"e"	# <U0435> CYRILLIC SMALL LETTER IE
+"\x0436"	"zh"	# <U0436> CYRILLIC SMALL LETTER ZHE
+"\x0437"	"z"	# <U0437> CYRILLIC SMALL LETTER ZE
+"\x0438"	"i"	# <U0438> CYRILLIC SMALL LETTER I
+"\x0439"	"j"	# <U0439> CYRILLIC SMALL LETTER SHORT I
+"\x043a"	"k"	# <U043A> CYRILLIC SMALL LETTER KA
+"\x043b"	"l"	# <U043B> CYRILLIC SMALL LETTER EL
+"\x043c"	"m"	# <U043C> CYRILLIC SMALL LETTER EM
+"\x043d"	"n"	# <U043D> CYRILLIC SMALL LETTER EN
+"\x043e"	"o"	# <U043E> CYRILLIC SMALL LETTER O
+"\x043f"	"p"	# <U043F> CYRILLIC SMALL LETTER PE
+"\x0440"	"r"	# <U0440> CYRILLIC SMALL LETTER ER
+"\x0441"	"s"	# <U0441> CYRILLIC SMALL LETTER ES
+"\x0442"	"t"	# <U0442> CYRILLIC SMALL LETTER TE
+"\x0443"	"u"	# <U0443> CYRILLIC SMALL LETTER U
+"\x0444"	"f"	# <U0444> CYRILLIC SMALL LETTER EF
+"\x0445"	"x"	# <U0445> CYRILLIC SMALL LETTER HA
+"\x0446"	"cz"	# <U0446> CYRILLIC SMALL LETTER TSE
+"\x0447"	"ch"	# <U0447> CYRILLIC SMALL LETTER CHE
+"\x0448"	"sh"	# <U0448> CYRILLIC SMALL LETTER SHA
+"\x0449"	"shh"	# <U0449> CYRILLIC SMALL LETTER SHCHA
+"\x044a"	"``"	# <U044A> CYRILLIC SMALL LETTER HARD SIGN
+"\x044b"	"y`"	# <U044B> CYRILLIC SMALL LETTER YERU
+"\x044c"	"`"	# <U044C> CYRILLIC SMALL LETTER SOFT SIGN
+"\x044d"	"e`"	# <U044D> CYRILLIC SMALL LETTER E
+"\x044e"	"yu"	# <U044E> CYRILLIC SMALL LETTER YU
+"\x044f"	"ya"	# <U044F> CYRILLIC SMALL LETTER YA
+"\x0451"	"yo"	# <U0451> CYRILLIC SMALL LETTER IO
+"\x0452"	"dj"	# <U0452> CYRILLIC SMALL LETTER DJE
+"\x0453"	"g`"	# <U0453> CYRILLIC SMALL LETTER GJE
+"\x0454"	"ye"	# <U0454> CYRILLIC SMALL LETTER UKRAINIAN IE
+"\x0455"	"z`"	# <U0455> CYRILLIC SMALL LETTER DZE
+"\x0456"	"i"	# <U0456> CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
+"\x0457"	"yi"	# <U0457> CYRILLIC SMALL LETTER YI
+"\x0458"	"j"	# <U0458> CYRILLIC SMALL LETTER JE
+"\x0459"	"l`"	# <U0459> CYRILLIC SMALL LETTER LJE
+"\x045a"	"n`"	# <U045A> CYRILLIC SMALL LETTER NJE
+"\x045b"	"tsh"	# <U045B> CYRILLIC SMALL LETTER TSHE
+"\x045c"	"k`"	# <U045C> CYRILLIC SMALL LETTER KJE
+"\x045e"	"u`"	# <U045E> CYRILLIC SMALL LETTER SHORT U
+"\x045f"	"dh"	# <U045F> CYRILLIC SMALL LETTER DZHE
+"\x046a"	"O`"	# <U046A> CYRILLIC CAPITAL LETTER BIG YUS
+"\x046b"	"o`"	# <U046B> CYRILLIC SMALL LETTER BIG YUS
+"\x0472"	"FH"	# <U0472> CYRILLIC CAPITAL LETTER FITA
+"\x0473"	"fh"	# <U0473> CYRILLIC SMALL LETTER FITA
+"\x0474"	"YH"	# <U0474> CYRILLIC CAPITAL LETTER IZHITSA
+"\x0475"	"yh"	# <U0475> CYRILLIC SMALL LETTER IZHITSA
+"\x048c"	"E`"	# <U048C> CYRILLIC CAPITAL LETTER SEMISOFT SIGN
+"\x048d"	"e`"	# <U048D> CYRILLIC SMALL LETTER SEMISOFT SIGN
+"\x0490"	"G`"	# <U0490> CYRILLIC CAPITAL LETTER GHE WITH UPTURN
+"\x0491"	"g`"	# <U0491> CYRILLIC SMALL LETTER GHE WITH UPTURN
+"\x0492"	"GH"	# <U0492> CYRILLIC CAPITAL LETTER GHE WITH STROKE
+"\x0493"	"gh"	# <U0493> CYRILLIC SMALL LETTER GHE WITH STROKE
+"\x0494"	"GH"	# <U0494> CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK
+"\x0495"	"gh"	# <U0495> CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
+"\x0496"	"ZH`"	# <U0496> CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER
+"\x0497"	"zh`"	# <U0497> CYRILLIC SMALL LETTER ZHE WITH DESCENDER
+"\x049a"	"K`"	# <U049A> CYRILLIC CAPITAL LETTER KA WITH DESCENDER
+"\x049b"	"k`"	# <U049B> CYRILLIC SMALL LETTER KA WITH DESCENDER
+"\x049e"	"K`"	# <U049E> CYRILLIC CAPITAL LETTER KA WITH STROKE
+"\x049f"	"k`"	# <U049F> CYRILLIC SMALL LETTER KA WITH STROKE
+"\x04a2"	"N`"	# <U04A2> CYRILLIC CAPITAL LETTER EN WITH DESCENDER
+"\x04a3"	"n`"	# <U04A3> CYRILLIC SMALL LETTER EN WITH DESCENDER
+"\x04a4"	"NG"	# <U04A4> CYRILLIC CAPITAL LIGATURE EN GHE
+"\x04a5"	"ng"	# <U04A5> CYRILLIC SMALL LIGATURE EN GHE
+"\x04a6"	"P`"	# <U04A6> CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK
+"\x04a7"	"p`"	# <U04A7> CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
+"\x04a8"	"O`"	# <U04A8> CYRILLIC CAPITAL LETTER ABKHASIAN HA
+"\x04a9"	"o`"	# <U04A9> CYRILLIC SMALL LETTER ABKHASIAN HA
+"\x04aa"	"C`"	# <U04AA> CYRILLIC CAPITAL LETTER ES WITH DESCENDER
+"\x04ab"	"C`"	# <U04AB> CYRILLIC SMALL LETTER ES WITH DESCENDER
+"\x04ac"	"T`"	# <U04AC> CYRILLIC CAPITAL LETTER TE WITH DESCENDER
+"\x04ad"	"t`"	# <U04AD> CYRILLIC SMALL LETTER TE WITH DESCENDER
+"\x04ae"	"U"	# <U04AE> CYRILLIC CAPITAL LETTER STRAIGHT U
+"\x04af"	"u"	# <U04AF> CYRILLIC SMALL LETTER STRAIGHT U
+"\x04b2"	"H`"	# <U04B2> CYRILLIC CAPITAL LETTER HA WITH DESCENDER
+"\x04b3"	"h`"	# <U04B3> CYRILLIC SMALL LETTER HA WITH DESCENDER
+"\x04b4"	"TCZ"	# <U04B4> CYRILLIC CAPITAL LIGATURE TE TSE
+"\x04b5"	"tcz"	# <U04B5> CYRILLIC SMALL LIGATURE TE TSE
+"\x04ba"	"SH`"	# <U04BA> CYRILLIC CAPITAL LETTER SHHA
+"\x04bb"	"sh`"	# <U04BB> CYRILLIC SMALL LETTER SHHA
+"\x04bc"	"CH`"	# <U04BC> CYRILLIC CAPITAL LETTER ABKHASIAN CHE
+"\x04bd"	"ch`"	# <U04BD> CYRILLIC SMALL LETTER ABKHASIAN CHE
+"\x04be"	"CH`"	# <U04BE> CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
+"\x04bf"	"ch`"	# <U04BF> CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
+"\x04c0"	"i"	# <U04C0> CYRILLIC LETTER PALOCHKA
+"\x04c1"	"ZH`"	# <U04C1> CYRILLIC CAPITAL LETTER ZHE WITH BREVE
+"\x04c2"	"zh`"	# <U04C2> CYRILLIC SMALL LETTER ZHE WITH BREVE
+"\x04cb"	"CH`"	# <U04CB> CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
+"\x04cc"	"ch`"	# <U04CC> CYRILLIC SMALL LETTER KHAKASSIAN CHE
+"\x04d0"	"A`"	# <U04D0> CYRILLIC CAPITAL LETTER A WITH BREVE
+"\x04d1"	"a`"	# <U04D1> CYRILLIC SMALL LETTER A WITH BREVE
+"\x04d2"	"A`"	# <U04D2> CYRILLIC CAPITAL LETTER A WITH DIAERESIS
+"\x04d3"	"a`"	# <U04D3> CYRILLIC SMALL LETTER A WITH DIAERESIS
+"\x04d6"	"E`"	# <U04D6> CYRILLIC CAPITAL LETTER IE WITH BREVE
+"\x04d7"	"e`"	# <U04D7> CYRILLIC SMALL LETTER IE WITH BREVE
+"\x04d8"	"A`"	# <U04D8> CYRILLIC CAPITAL LETTER SCHWA
+"\x04d9"	"a`"	# <U04D9> CYRILLIC SMALL LETTER SCHWA
+"\x04dc"	"ZH`"	# <U04DC> CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
+"\x04dd"	"zh`"	# <U04DD> CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
+"\x04de"	"Z`"	# <U04DE> CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
+"\x04df"	"z`"	# <U04DF> CYRILLIC SMALL LETTER ZE WITH DIAERESIS
+"\x04e0"	"Z`"	# <U04E0> CYRILLIC CAPITAL LETTER ABKHASIAN DZE
+"\x04e1"	"z`"	# <U04E1> CYRILLIC SMALL LETTER ABKHASIAN DZE
+"\x04e4"	"I`"	# <U04E4> CYRILLIC CAPITAL LETTER I WITH DIAERESIS
+"\x04e5"	"i`"	# <U04E5> CYRILLIC SMALL LETTER I WITH DIAERESIS
+"\x04e6"	"O`"	# <U04E6> CYRILLIC CAPITAL LETTER O WITH DIAERESIS
+"\x04e7"	"o`"	# <U04E7> CYRILLIC SMALL LETTER O WITH DIAERESIS
+"\x04e8"	"O`"	# <U04E8> CYRILLIC CAPITAL LETTER BARRED O
+"\x04e9"	"o`"	# <U04E9> CYRILLIC SMALL LETTER BARRED O
+"\x04f0"	"U`"	# <U04F0> CYRILLIC CAPITAL LETTER U WITH DIAERESIS
+"\x04f1"	"u`"	# <U04F1> CYRILLIC SMALL LETTER U WITH DIAERESIS
+"\x04f2"	"U`"	# <U04F2> CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
+"\x04f3"	"u`"	# <U04F3> CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
+"\x04f4"	"CH`"	# <U04F4> CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
+"\x04f5"	"ch`"	# <U04F5> CYRILLIC SMALL LETTER CHE WITH DIAERESIS
+"\x04f8"	"Y`"	# <U04F8> CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
+"\x04f9"	"y`"	# <U04F9> CYRILLIC SMALL LETTER YERU WITH DIAERESIS
 "\x2002"	" "	# <U2002> EN SPACE
 "\x2003"	" "	# <U2003> EM SPACE
 "\x2004"	" "	# <U2004> THREE-PER-EM SPACE
-- 
2.17.1