ping [PATCH v12] Locales: Cyrillic -> ASCII transliteration [BZ #2872]

Message ID 7cdd817a-4a47-201a-8eeb-87db324104b3@kobylkin.com
State Committed
Headers

Commit Message

Egor Kobylkin March 19, 2019, 10:39 a.m. UTC
  Changelog v12:
* Adjusted to the new comment style suddenly appearing in the target 
file locale/C-translit.h.in (the original file changed on the master 
branch from /* style to # style since v11)
* Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to 
"sh`" instead of erroneous "SH`" in v11

Changelog v11:
* Re-targeted the patch against locale/C-translit.h.in as the proper
file for the ASCII translit table.
* Correspondingly the patch now only contains the additional
Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
The 'include "translit_cyrillic";""' directives are not necessary in the
locale files and they are now all left intact.
* Also the file translit_cyrillic is not longer needed and is omitted.
* Edited below email, commit message.

Changelog v10:
* Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
with diacritics) as conflicting with System B within glibc mechanics and
not solving BZ #2872
* Edited below email, commit message, comment in translit_cyrillic to
reflect System A removal
* Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
using composition) as composing is not covered by current glibc
conversion mechanics

Changelog v9:
* Fixed formatting (trailing spaces etc.)
* Put commit summary in the patch file, now it is generated completely
by git format-patch

Changelog v8:
* Re-added missing translit_cyrillic in patch v7 (due to missing "git
add" in the script).

Changelog v7:
* Generated against git://sourceware.org/git/glibc.git master with git
format-patch.
* The 'include "translit_cyrillic";""' now immediately follows last
'include "translit_XXX";""' string (was inserted just before
translit_end previously.)
* Only the locales already having 'include .*translit.*;""' are patched
(see the list for manual exclusions below, full list of included locales
at the end of the email in the commit section.)
* Excluded az_AZ completely to avoid circular reference from tr_TR via
“copy "tr_TR"”.

Changelog v6:
* Locales removed from the patch: C and sd_PK.
* Added locales: az_AZ and ky_KG.
* Consistently transliterate single uppercase Cyrillic letters
    to sequences of all uppercase Latin letters in all languages (whenever
    a Cyrillic letter is transliterated to more than one Latin letter),
    for example "Ї" is now transliterated as "YI" rather than "Yi".

Dear locale maintainers,

fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"

https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]

add the Cyrillic transliteration rows to locale/C-translit.h.in.

The patch is attached.


Current bug effect:

The glibc wiki explicitly lists this use case as the test example and
currently it fails on Cyrillic texts [1] [8] [9]:

iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC

CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.

- it produces a string of question marks and spaces.

This is what it should produce and it does so after the patch applied:

CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.


The root problem and the fix:

The root problem is the missing transliteration table that I am
supplying here.


COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.

Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription.

While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.

The patch content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 System B official source (Federal Agency on
Technical Regulating and Metrology Of Russian Federation [2]).
Technically an independent but mostly identical source [3] was used and
prepared in a spreadsheet [6].

The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
System B represents what is actually called transcription (preserving
phonemes), while System A is the transliteration (preserving graphemes).
There is no meaningful way to preserve graphemes converting Cyrillic to
ASCII and thus the System B is chosen [11]. To be super clear the System
A has nothing to do with this bug regardless it being a transliteration.

Those interested in implementing System A for transliteration of
Cyrillic to Latin with Diacritic as a new feature are welcome to use the
spreadsheet in [6] as a starting point.

Links:

[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
[11]
https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3

Best regards,
Egor Kobylkin
  

Comments

Marko Myllynen March 28, 2019, 4:20 p.m. UTC | #1
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:
> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
> 
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
> 
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
> 
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
> 
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
> 
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
> 
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
> 
> Dear locale maintainers,
> 
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
> 
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
> 
> The patch is attached.
> 
> 
> Current bug effect:
> 
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
> 
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
> 
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
> 
> - it produces a string of question marks and spaces.
> 
> This is what it should produce and it does so after the patch applied:
> 
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
> 
> 
> The root problem and the fix:
> 
> The root problem is the missing transliteration table that I am
> supplying here.
> 
> 
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
> 
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
> 
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
> 
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
> 
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
> 
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
> 
> Links:
> 
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
> 
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
> 
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
> 
> 
> Best regards,
> Egor Kobylkin
> 
> 
> 
>
  
Egor Kobylkin April 4, 2019, 7:44 p.m. UTC | #2
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:
> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
> 
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
> 
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
> 
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
> 
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
> 
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
> 
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
> 
> Dear locale maintainers,
> 
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
> 
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
> 
> The patch is attached.
> 
> 
> Current bug effect:
> 
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
> 
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
> 
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
> 
> - it produces a string of question marks and spaces.
> 
> This is what it should produce and it does so after the patch applied:
> 
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
> 
> 
> The root problem and the fix:
> 
> The root problem is the missing transliteration table that I am
> supplying here.
> 
> 
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
> 
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
> 
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
> 
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
> 
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
> 
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
> 
> Links:
> 
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
> 
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
> 
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
> 
> 
> Best regards,
> Egor Kobylkin
> 
> 
> 
>
  
Siddhesh Poyarekar April 6, 2019, 1:36 a.m. UTC | #3
On 05/04/19 1:14 AM, Egor Kobylkin wrote:
> Ping?
> 

I'm committing to looking at this on Monday if nobody gets to it ovevr
the weekend.

Siddhesh
  
Marko Myllynen April 16, 2019, 7:15 a.m. UTC | #4
Ping?

On 19/03/2019 12.39, Egor Kobylkin wrote:
> Changelog v12:
> * Adjusted to the new comment style suddenly appearing in the target
> file locale/C-translit.h.in (the original file changed on the master
> branch from /* style to # style since v11)
> * Fixed a typo for <U04BB> CYRILLIC SMALL LETTER SHHA to be mapped to
> "sh`" instead of erroneous "SH`" in v11
> 
> Changelog v11:
> * Re-targeted the patch against locale/C-translit.h.in as the proper
> file for the ASCII translit table.
> * Correspondingly the patch now only contains the additional
> Cyrillic-ASCII strings in the format of locale/C-translit.h.in table.
> The 'include "translit_cyrillic";""' directives are not necessary in the
> locale files and they are now all left intact.
> * Also the file translit_cyrillic is not longer needed and is omitted.
> * Edited below email, commit message.
> 
> Changelog v10:
> * Removed ISO 9.1995 GOST 7.79-2000 System A (transliteration to Latin
> with diacritics) as conflicting with System B within glibc mechanics and
> not solving BZ #2872
> * Edited below email, commit message, comment in translit_cyrillic to
> reflect System A removal
> * Removed <U0423><U0301> and <U0443><U0301> (Cyrillic U with acute,
> using composition) as composing is not covered by current glibc
> conversion mechanics
> 
> Changelog v9:
> * Fixed formatting (trailing spaces etc.)
> * Put commit summary in the patch file, now it is generated completely
> by git format-patch
> 
> Changelog v8:
> * Re-added missing translit_cyrillic in patch v7 (due to missing "git
> add" in the script).
> 
> Changelog v7:
> * Generated against git://sourceware.org/git/glibc.git master with git
> format-patch.
> * The 'include "translit_cyrillic";""' now immediately follows last
> 'include "translit_XXX";""' string (was inserted just before
> translit_end previously.)
> * Only the locales already having 'include .*translit.*;""' are patched
> (see the list for manual exclusions below, full list of included locales
> at the end of the email in the commit section.)
> * Excluded az_AZ completely to avoid circular reference from tr_TR via
> “copy "tr_TR"”.
> 
> Changelog v6:
> * Locales removed from the patch: C and sd_PK.
> * Added locales: az_AZ and ky_KG.
> * Consistently transliterate single uppercase Cyrillic letters
>    to sequences of all uppercase Latin letters in all languages (whenever
>    a Cyrillic letter is transliterated to more than one Latin letter),
>    for example "Ї" is now transliterated as "YI" rather than "Yi".
> 
> Dear locale maintainers,
> 
> fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]
> 
> add the Cyrillic transliteration rows to locale/C-translit.h.in.
> 
> The patch is attached.
> 
> 
> Current bug effect:
> 
> The glibc wiki explicitly lists this use case as the test example and
> currently it fails on Cyrillic texts [1] [8] [9]:
> 
> iconv -f UTF-8 -t ASCII//TRANSLIT < translit-test-input.txt |grep CYRILLIC
> 
> CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.
> 
> - it produces a string of question marks and spaces.
> 
> This is what it should produce and it does so after the patch applied:
> 
> CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
> chayu.
> 
> 
> The root problem and the fix:
> 
> The root problem is the missing transliteration table that I am
> supplying here.
> 
> 
> COMMIT MESSAGE:
> This translit_cyrillic table enables conversion (e.g. with iconv) from a
> UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.
> 
> Example: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
> compatible transcription.
> 
> While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
> a transliteration/transcription has only Latin/ASCII codes but still can
> be read by a native speaker. Among other things it is useful for
> processing the Cyrillic texts and filenames by programs or on systems
> that are not specifically prepared to work with Cyrillic, don't have
> corresponding fonts installed or can't handle UTF-8.
> 
> The patch content (mapping) is based on ISO 9.1995 standard [10] and its
> derivative GOST 7.79-2000 System B official source (Federal Agency on
> Technical Regulating and Metrology Of Russian Federation [2]).
> Technically an independent but mostly identical source [3] was used and
> prepared in a spreadsheet [6].
> 
> The transliteration of Cyrillic to ASCII according to GOST 7.79-2000
> System B represents what is actually called transcription (preserving
> phonemes), while System A is the transliteration (preserving graphemes).
> There is no meaningful way to preserve graphemes converting Cyrillic to
> ASCII and thus the System B is chosen [11]. To be super clear the System
> A has nothing to do with this bug regardless it being a transliteration.
> 
> Those interested in implementing System A for transliteration of
> Cyrillic to Latin with Diacritic as a new feature are welcome to use the
> spreadsheet in [6] as a starting point.
> 
> Links:
> 
> [1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
> [2] GOST 7.79-2000 official source
> http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
> available in low quality gif format)
> [3] http://transliteration.ru/gost-7-79-2000/ and
> http://www.yfermer.ru/specifications/285821.html
> [4] Wikipedia article on Cyrillic transliteration with Latin alphabet
> https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
> 
> [5] http://man7.org/linux/man-pages/man5/locale.5.html
> [6] Spreadsheet for generating translit_cyrillic
> https://sourceware.org/bugzilla/attachment.cgi?bugid=2872&action=viewall&hide_obsolete=1
> 
> [8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
> [9] translit-test-input.txt
> https://sourceware.org/bugzilla/attachment.cgi?id=11304
> [10] https://en.wikipedia.org/wiki/ISO_9#GOST_7.79_System_B
> [11]
> https://scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=gslmka8xq3
> 
> 
> Best regards,
> Egor Kobylkin
> 
> 
> 
>
  
Carlos O'Donell April 16, 2019, 1:17 p.m. UTC | #5
On 4/16/19 3:15 AM, Marko Myllynen wrote:
> Ping?

I have this patch applied locally and I'm working through some
comparisons for the transliteration.
  
Egor Kobylkin April 16, 2019, 5:06 p.m. UTC | #6
Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f UTF-8 
-t ASCII//TRANSLIT <<< 
"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ 
ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"

And this is the expected result ("" added by myself):
"YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e` 
G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"

Bests,
Egor Kobylkin


On 16.04.19 15:17, Carlos O'Donell wrote:
> On 4/16/19 3:15 AM, Marko Myllynen wrote:
>> Ping?
> 
> I have this patch applied locally and I'm working through some
> comparisons for the transliteration.
> 
>
  
Carlos O'Donell April 16, 2019, 5:58 p.m. UTC | #7
On 4/16/19 1:06 PM, Egor Kobylkin wrote:
> Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f UTF-8 -t ASCII//TRANSLIT <<< "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
> 
> And this is the expected result ("" added by myself):
> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e` G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'"

Thanks.

I was using CyrTranslit (python translater) to review other work done in this area,
but it wasn't very fruitful.

$ python3
Python 3.7.3 (default, Mar 27 2019, 13:36:35)
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cyrtranslit
>>> cyrtranslit.supported()
dict_keys(['sr', 'me', 'mk', 'ru'])
>>> cyrtranslit.to_latin("ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’")
'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'
>>> 

"ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’'

Which doesn't give a good transliteration.

But the table is better:
https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L138-L155

Ё -> YO.

Which is a good cross-check for me.
  
Egor Kobylkin April 16, 2019, 6:41 p.m. UTC | #8
On 16.04.19 19:58, Carlos O'Donell wrote:
> On 4/16/19 1:06 PM, Egor Kobylkin wrote:
>> Just FYI, this what I was testing: ./testrun.sh /usr/bin/iconv -f 
>> UTF-8 -t ASCII//TRANSLIT <<< 
>> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍ 
>> ҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’"
>>
>> And this is the expected result ("" added by myself):
>> "YODJG`YEZ`IYIJL`N`TSHK`U`DHABVGDEZHZIJKLMNOPRSTUU?FXCZCHSHSHHA`Y``E`YUYAabvgdezhzijklmnoprstuu?fxczchshshh``y``e`yuyayodjg`yez`iyijl`n`tshk`u`dhO`o`FHfhYHyhE`e` 
>> G`g`GHghGHghZH`zh`K`k`K`k`N`n`NGngP`p`O`o`C`C`T`t`UuH`h`TCZtczSH`sh`CH`ch`CH`ch`iZH`zh`CH`ch`A`a`A`a`E`e`A`a`ZH`zh`Z`z`Z`z`I`i`O`o`O`o`U`u`U`u`CH`ch`Y`y`'" 
>>
> 
> Thanks.
> 
> I was using CyrTranslit (python translater) to review other work done in 
> this area,
> but it wasn't very fruitful.
> 
> $ python3
> Python 3.7.3 (default, Mar 27 2019, 13:36:35)
> [GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import cyrtranslit
>>>> cyrtranslit.supported()
> dict_keys(['sr', 'me', 'mk', 'ru'])
>>>> cyrtranslit.to_latin("ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’") 
>>>>
> 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’' 
> 
>>>>
> 
> "ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУУ́ФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуу́фхцчшщъыьэюяёђѓєѕіїјљњћќўџѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’" 
> 
> 'ЁĐЃЄЅІЇJLjNjĆЌЎDžABVGDEŽZIЙKLMNOPRSTUÚFHCČŠЩЪЫЬЭЮЯabvgdežziйklmnoprstuúfhcčšщъыьэюяёđѓєѕіїjljnjćќўdžѪѫѲѳѴѵҌҍҐґҒғҔҕҖҗҚқҞҟҢңҤҥҦҧҨҩҪҫҬҭҮүҲҳҴҵҺһҼҽҾҿӀӁӂӋӌӐӑӒӓӖӗӘәӜӝӞӟӠӡӤӥӦӧӨөӰӱӲӳӴӵӸӹ’' 
> 
> 
> Which doesn't give a good transliteration.

I guess the reason for that is that it is using the first key 'sr' from 
your list that stands for Serbian. And Serbian doesn't have those 
characters that are omitted ( "Щ" for example).

> But the table is better:
> https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L138-L155 
> 
> 
> Ё -> YO.
> 
> Which is a good cross-check for me.

Yet the closest one from that codebase should be this 
https://github.com/opendatakosovo/cyrillic-transliteration/blob/master/cyrtranslit/mapping.py#L88

It is exactly the reason we had 12 iterations on this patch - we wanted 
to cover the most complete yet workable standard for the table. What we 
reference in the bug memo is the actual accepted standard. It is 
coalesced with the extended standard for further outdated cyrillic letters.

Bests,
Egor Kobylkin
  
Carlos O'Donell April 16, 2019, 7:06 p.m. UTC | #9
On 4/16/19 2:41 PM, Egor Kobylkin wrote:
> It is exactly the reason we had 12 iterations on this patch - we
> wanted to cover the most complete yet workable standard for the
> table. What we reference in the bug memo is the actual accepted
> standard. It is coalesced with the extended standard for further
> outdated cyrillic letters.

I agree, and this is what makes review complicated and time
consuming. I'm relying on you as the expert, and my goal is only
to spot check for any inconsistencies.
  
Marko Myllynen May 10, 2019, 12:19 p.m. UTC | #10
Hi Carlos,

On 16/04/2019 22.06, Carlos O'Donell wrote:
> On 4/16/19 2:41 PM, Egor Kobylkin wrote:
>> It is exactly the reason we had 12 iterations on this patch - we
>> wanted to cover the most complete yet workable standard for the
>> table. What we reference in the bug memo is the actual accepted
>> standard. It is coalesced with the extended standard for further
>> outdated cyrillic letters.
> 
> I agree, and this is what makes review complicated and time
> consuming. I'm relying on you as the expert, and my goal is only
> to spot check for any inconsistencies.

I know you've been very busy with everything else but did you happen to
have any chance to check this further, shall we still wait for your
results or how would you suggests us to proceed?

Thanks,
  

Patch

From 46e0d0e3d07805ec853fdd72dc3793995cb5593c Mon Sep 17 00:00:00 2001
From: Egor Kobylkin <egor@kobylkin.com>
Date: Wed, 2 Jan 2019 05:50:13 +0100
Subject: [PATCH] Locales: Cyrillic -> ASCII transliteration table [BZ #2872]

	[BZ #2872]
	* locale/C-translit.h.in: Add Cyrillic transliteration.
---
 locale/C-translit.h.in | 169 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)

diff --git a/locale/C-translit.h.in b/locale/C-translit.h.in
index d5f00df0f3..758171c394 100644
--- a/locale/C-translit.h.in
+++ b/locale/C-translit.h.in
@@ -56,6 +56,175 @@ 
 "\x02cd"	"_"	# <U02CD> MODIFIER LETTER LOW MACRON
 "\x02d0"	":"	# <U02D0> MODIFIER LETTER TRIANGULAR COLON
 "\x02dc"	"~"	# <U02DC> SMALL TILDE
+"\x0401"	"YO"	# <U0401> CYRILLIC CAPITAL LETTER IO
+"\x0402"	"DJ"	# <U0402> CYRILLIC CAPITAL LETTER DJE
+"\x0403"	"G`"	# <U0403> CYRILLIC CAPITAL LETTER GJE
+"\x0404"	"YE"	# <U0404> CYRILLIC CAPITAL LETTER UKRAINIAN IE
+"\x0405"	"Z`"	# <U0405> CYRILLIC CAPITAL LETTER DZE
+"\x0406"	"I"	# <U0406> CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
+"\x0407"	"YI"	# <U0407> CYRILLIC CAPITAL LETTER YI
+"\x0408"	"J"	# <U0408> CYRILLIC CAPITAL LETTER JE
+"\x0409"	"L`"	# <U0409> CYRILLIC CAPITAL LETTER LJE
+"\x040a"	"N`"	# <U040A> CYRILLIC CAPITAL LETTER NJE
+"\x040b"	"TSH"	# <U040B> CYRILLIC CAPITAL LETTER TSHE
+"\x040c"	"K`"	# <U040C> CYRILLIC CAPITAL LETTER KJE
+"\x040e"	"U`"	# <U040E> CYRILLIC CAPITAL LETTER SHORT U
+"\x040f"	"DH"	# <U040F> CYRILLIC CAPITAL LETTER DZHE
+"\x0410"	"A"	# <U0410> CYRILLIC CAPITAL LETTER A
+"\x0411"	"B"	# <U0411> CYRILLIC CAPITAL LETTER BE
+"\x0412"	"V"	# <U0412> CYRILLIC CAPITAL LETTER VE
+"\x0413"	"G"	# <U0413> CYRILLIC CAPITAL LETTER GHE
+"\x0414"	"D"	# <U0414> CYRILLIC CAPITAL LETTER DE
+"\x0415"	"E"	# <U0415> CYRILLIC CAPITAL LETTER IE
+"\x0416"	"ZH"	# <U0416> CYRILLIC CAPITAL LETTER ZHE
+"\x0417"	"Z"	# <U0417> CYRILLIC CAPITAL LETTER ZE
+"\x0418"	"I"	# <U0418> CYRILLIC CAPITAL LETTER I
+"\x0419"	"J"	# <U0419> CYRILLIC CAPITAL LETTER SHORT I
+"\x041a"	"K"	# <U041A> CYRILLIC CAPITAL LETTER KA
+"\x041b"	"L"	# <U041B> CYRILLIC CAPITAL LETTER EL
+"\x041c"	"M"	# <U041C> CYRILLIC CAPITAL LETTER EM
+"\x041d"	"N"	# <U041D> CYRILLIC CAPITAL LETTER EN
+"\x041e"	"O"	# <U041E> CYRILLIC CAPITAL LETTER O
+"\x041f"	"P"	# <U041F> CYRILLIC CAPITAL LETTER PE
+"\x0420"	"R"	# <U0420> CYRILLIC CAPITAL LETTER ER
+"\x0421"	"S"	# <U0421> CYRILLIC CAPITAL LETTER ES
+"\x0422"	"T"	# <U0422> CYRILLIC CAPITAL LETTER TE
+"\x0423"	"U"	# <U0423> CYRILLIC CAPITAL LETTER U
+"\x0424"	"F"	# <U0424> CYRILLIC CAPITAL LETTER EF
+"\x0425"	"X"	# <U0425> CYRILLIC CAPITAL LETTER HA
+"\x0426"	"CZ"	# <U0426> CYRILLIC CAPITAL LETTER TSE
+"\x0427"	"CH"	# <U0427> CYRILLIC CAPITAL LETTER CHE
+"\x0428"	"SH"	# <U0428> CYRILLIC CAPITAL LETTER SHA
+"\x0429"	"SHH"	# <U0429> CYRILLIC CAPITAL LETTER SHCHA
+"\x042a"	"A`"	# <U042A> CYRILLIC CAPITAL LETTER HARD SIGN
+"\x042b"	"Y`"	# <U042B> CYRILLIC CAPITAL LETTER YERU
+"\x042c"	"`"	# <U042C> CYRILLIC CAPITAL LETTER SOFT SIGN
+"\x042d"	"E`"	# <U042D> CYRILLIC CAPITAL LETTER E
+"\x042e"	"YU"	# <U042E> CYRILLIC CAPITAL LETTER YU
+"\x042f"	"YA"	# <U042F> CYRILLIC CAPITAL LETTER YA
+"\x0430"	"a"	# <U0430> CYRILLIC SMALL LETTER A
+"\x0431"	"b"	# <U0431> CYRILLIC SMALL LETTER BE
+"\x0432"	"v"	# <U0432> CYRILLIC SMALL LETTER VE
+"\x0433"	"g"	# <U0433> CYRILLIC SMALL LETTER GHE
+"\x0434"	"d"	# <U0434> CYRILLIC SMALL LETTER DE
+"\x0435"	"e"	# <U0435> CYRILLIC SMALL LETTER IE
+"\x0436"	"zh"	# <U0436> CYRILLIC SMALL LETTER ZHE
+"\x0437"	"z"	# <U0437> CYRILLIC SMALL LETTER ZE
+"\x0438"	"i"	# <U0438> CYRILLIC SMALL LETTER I
+"\x0439"	"j"	# <U0439> CYRILLIC SMALL LETTER SHORT I
+"\x043a"	"k"	# <U043A> CYRILLIC SMALL LETTER KA
+"\x043b"	"l"	# <U043B> CYRILLIC SMALL LETTER EL
+"\x043c"	"m"	# <U043C> CYRILLIC SMALL LETTER EM
+"\x043d"	"n"	# <U043D> CYRILLIC SMALL LETTER EN
+"\x043e"	"o"	# <U043E> CYRILLIC SMALL LETTER O
+"\x043f"	"p"	# <U043F> CYRILLIC SMALL LETTER PE
+"\x0440"	"r"	# <U0440> CYRILLIC SMALL LETTER ER
+"\x0441"	"s"	# <U0441> CYRILLIC SMALL LETTER ES
+"\x0442"	"t"	# <U0442> CYRILLIC SMALL LETTER TE
+"\x0443"	"u"	# <U0443> CYRILLIC SMALL LETTER U
+"\x0444"	"f"	# <U0444> CYRILLIC SMALL LETTER EF
+"\x0445"	"x"	# <U0445> CYRILLIC SMALL LETTER HA
+"\x0446"	"cz"	# <U0446> CYRILLIC SMALL LETTER TSE
+"\x0447"	"ch"	# <U0447> CYRILLIC SMALL LETTER CHE
+"\x0448"	"sh"	# <U0448> CYRILLIC SMALL LETTER SHA
+"\x0449"	"shh"	# <U0449> CYRILLIC SMALL LETTER SHCHA
+"\x044a"	"``"	# <U044A> CYRILLIC SMALL LETTER HARD SIGN
+"\x044b"	"y`"	# <U044B> CYRILLIC SMALL LETTER YERU
+"\x044c"	"`"	# <U044C> CYRILLIC SMALL LETTER SOFT SIGN
+"\x044d"	"e`"	# <U044D> CYRILLIC SMALL LETTER E
+"\x044e"	"yu"	# <U044E> CYRILLIC SMALL LETTER YU
+"\x044f"	"ya"	# <U044F> CYRILLIC SMALL LETTER YA
+"\x0451"	"yo"	# <U0451> CYRILLIC SMALL LETTER IO
+"\x0452"	"dj"	# <U0452> CYRILLIC SMALL LETTER DJE
+"\x0453"	"g`"	# <U0453> CYRILLIC SMALL LETTER GJE
+"\x0454"	"ye"	# <U0454> CYRILLIC SMALL LETTER UKRAINIAN IE
+"\x0455"	"z`"	# <U0455> CYRILLIC SMALL LETTER DZE
+"\x0456"	"i"	# <U0456> CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
+"\x0457"	"yi"	# <U0457> CYRILLIC SMALL LETTER YI
+"\x0458"	"j"	# <U0458> CYRILLIC SMALL LETTER JE
+"\x0459"	"l`"	# <U0459> CYRILLIC SMALL LETTER LJE
+"\x045a"	"n`"	# <U045A> CYRILLIC SMALL LETTER NJE
+"\x045b"	"tsh"	# <U045B> CYRILLIC SMALL LETTER TSHE
+"\x045c"	"k`"	# <U045C> CYRILLIC SMALL LETTER KJE
+"\x045e"	"u`"	# <U045E> CYRILLIC SMALL LETTER SHORT U
+"\x045f"	"dh"	# <U045F> CYRILLIC SMALL LETTER DZHE
+"\x046a"	"O`"	# <U046A> CYRILLIC CAPITAL LETTER BIG YUS
+"\x046b"	"o`"	# <U046B> CYRILLIC SMALL LETTER BIG YUS
+"\x0472"	"FH"	# <U0472> CYRILLIC CAPITAL LETTER FITA
+"\x0473"	"fh"	# <U0473> CYRILLIC SMALL LETTER FITA
+"\x0474"	"YH"	# <U0474> CYRILLIC CAPITAL LETTER IZHITSA
+"\x0475"	"yh"	# <U0475> CYRILLIC SMALL LETTER IZHITSA
+"\x048c"	"E`"	# <U048C> CYRILLIC CAPITAL LETTER SEMISOFT SIGN
+"\x048d"	"e`"	# <U048D> CYRILLIC SMALL LETTER SEMISOFT SIGN
+"\x0490"	"G`"	# <U0490> CYRILLIC CAPITAL LETTER GHE WITH UPTURN
+"\x0491"	"g`"	# <U0491> CYRILLIC SMALL LETTER GHE WITH UPTURN
+"\x0492"	"GH"	# <U0492> CYRILLIC CAPITAL LETTER GHE WITH STROKE
+"\x0493"	"gh"	# <U0493> CYRILLIC SMALL LETTER GHE WITH STROKE
+"\x0494"	"GH"	# <U0494> CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK
+"\x0495"	"gh"	# <U0495> CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
+"\x0496"	"ZH`"	# <U0496> CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER
+"\x0497"	"zh`"	# <U0497> CYRILLIC SMALL LETTER ZHE WITH DESCENDER
+"\x049a"	"K`"	# <U049A> CYRILLIC CAPITAL LETTER KA WITH DESCENDER
+"\x049b"	"k`"	# <U049B> CYRILLIC SMALL LETTER KA WITH DESCENDER
+"\x049e"	"K`"	# <U049E> CYRILLIC CAPITAL LETTER KA WITH STROKE
+"\x049f"	"k`"	# <U049F> CYRILLIC SMALL LETTER KA WITH STROKE
+"\x04a2"	"N`"	# <U04A2> CYRILLIC CAPITAL LETTER EN WITH DESCENDER
+"\x04a3"	"n`"	# <U04A3> CYRILLIC SMALL LETTER EN WITH DESCENDER
+"\x04a4"	"NG"	# <U04A4> CYRILLIC CAPITAL LIGATURE EN GHE
+"\x04a5"	"ng"	# <U04A5> CYRILLIC SMALL LIGATURE EN GHE
+"\x04a6"	"P`"	# <U04A6> CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK
+"\x04a7"	"p`"	# <U04A7> CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
+"\x04a8"	"O`"	# <U04A8> CYRILLIC CAPITAL LETTER ABKHASIAN HA
+"\x04a9"	"o`"	# <U04A9> CYRILLIC SMALL LETTER ABKHASIAN HA
+"\x04aa"	"C`"	# <U04AA> CYRILLIC CAPITAL LETTER ES WITH DESCENDER
+"\x04ab"	"C`"	# <U04AB> CYRILLIC SMALL LETTER ES WITH DESCENDER
+"\x04ac"	"T`"	# <U04AC> CYRILLIC CAPITAL LETTER TE WITH DESCENDER
+"\x04ad"	"t`"	# <U04AD> CYRILLIC SMALL LETTER TE WITH DESCENDER
+"\x04ae"	"U"	# <U04AE> CYRILLIC CAPITAL LETTER STRAIGHT U
+"\x04af"	"u"	# <U04AF> CYRILLIC SMALL LETTER STRAIGHT U
+"\x04b2"	"H`"	# <U04B2> CYRILLIC CAPITAL LETTER HA WITH DESCENDER
+"\x04b3"	"h`"	# <U04B3> CYRILLIC SMALL LETTER HA WITH DESCENDER
+"\x04b4"	"TCZ"	# <U04B4> CYRILLIC CAPITAL LIGATURE TE TSE
+"\x04b5"	"tcz"	# <U04B5> CYRILLIC SMALL LIGATURE TE TSE
+"\x04ba"	"SH`"	# <U04BA> CYRILLIC CAPITAL LETTER SHHA
+"\x04bb"	"sh`"	# <U04BB> CYRILLIC SMALL LETTER SHHA
+"\x04bc"	"CH`"	# <U04BC> CYRILLIC CAPITAL LETTER ABKHASIAN CHE
+"\x04bd"	"ch`"	# <U04BD> CYRILLIC SMALL LETTER ABKHASIAN CHE
+"\x04be"	"CH`"	# <U04BE> CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
+"\x04bf"	"ch`"	# <U04BF> CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
+"\x04c0"	"i"	# <U04C0> CYRILLIC LETTER PALOCHKA
+"\x04c1"	"ZH`"	# <U04C1> CYRILLIC CAPITAL LETTER ZHE WITH BREVE
+"\x04c2"	"zh`"	# <U04C2> CYRILLIC SMALL LETTER ZHE WITH BREVE
+"\x04cb"	"CH`"	# <U04CB> CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
+"\x04cc"	"ch`"	# <U04CC> CYRILLIC SMALL LETTER KHAKASSIAN CHE
+"\x04d0"	"A`"	# <U04D0> CYRILLIC CAPITAL LETTER A WITH BREVE
+"\x04d1"	"a`"	# <U04D1> CYRILLIC SMALL LETTER A WITH BREVE
+"\x04d2"	"A`"	# <U04D2> CYRILLIC CAPITAL LETTER A WITH DIAERESIS
+"\x04d3"	"a`"	# <U04D3> CYRILLIC SMALL LETTER A WITH DIAERESIS
+"\x04d6"	"E`"	# <U04D6> CYRILLIC CAPITAL LETTER IE WITH BREVE
+"\x04d7"	"e`"	# <U04D7> CYRILLIC SMALL LETTER IE WITH BREVE
+"\x04d8"	"A`"	# <U04D8> CYRILLIC CAPITAL LETTER SCHWA
+"\x04d9"	"a`"	# <U04D9> CYRILLIC SMALL LETTER SCHWA
+"\x04dc"	"ZH`"	# <U04DC> CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
+"\x04dd"	"zh`"	# <U04DD> CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
+"\x04de"	"Z`"	# <U04DE> CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
+"\x04df"	"z`"	# <U04DF> CYRILLIC SMALL LETTER ZE WITH DIAERESIS
+"\x04e0"	"Z`"	# <U04E0> CYRILLIC CAPITAL LETTER ABKHASIAN DZE
+"\x04e1"	"z`"	# <U04E1> CYRILLIC SMALL LETTER ABKHASIAN DZE
+"\x04e4"	"I`"	# <U04E4> CYRILLIC CAPITAL LETTER I WITH DIAERESIS
+"\x04e5"	"i`"	# <U04E5> CYRILLIC SMALL LETTER I WITH DIAERESIS
+"\x04e6"	"O`"	# <U04E6> CYRILLIC CAPITAL LETTER O WITH DIAERESIS
+"\x04e7"	"o`"	# <U04E7> CYRILLIC SMALL LETTER O WITH DIAERESIS
+"\x04e8"	"O`"	# <U04E8> CYRILLIC CAPITAL LETTER BARRED O
+"\x04e9"	"o`"	# <U04E9> CYRILLIC SMALL LETTER BARRED O
+"\x04f0"	"U`"	# <U04F0> CYRILLIC CAPITAL LETTER U WITH DIAERESIS
+"\x04f1"	"u`"	# <U04F1> CYRILLIC SMALL LETTER U WITH DIAERESIS
+"\x04f2"	"U`"	# <U04F2> CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
+"\x04f3"	"u`"	# <U04F3> CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
+"\x04f4"	"CH`"	# <U04F4> CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
+"\x04f5"	"ch`"	# <U04F5> CYRILLIC SMALL LETTER CHE WITH DIAERESIS
+"\x04f8"	"Y`"	# <U04F8> CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
+"\x04f9"	"y`"	# <U04F9> CYRILLIC SMALL LETTER YERU WITH DIAERESIS
 "\x2002"	" "	# <U2002> EN SPACE
 "\x2003"	" "	# <U2003> EM SPACE
 "\x2004"	" "	# <U2004> THREE-PER-EM SPACE
-- 
2.17.1