Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2

  Dear locale maintainers,

fix the glibc bug 2872 "Transliteration Cyrillic -> ASCII fails"

https://sourceware.org/bugzilla/show_bug.cgi?id=2872 [1]

add Cyrillic transliteration table translit_cyrillic file

https://sourceware.org/bugzilla/attachment.cgi?id=8591 [7]

to localedata/locales/ and include it in all your locales going forward.

Patch included inline below.

From this patch I have excluded locales that already mention cyrillic or
have a transliteration table for it:
az_AZ
iso14651_t1_common
ky_KG
mn_MN
sr_RS
tg_TJ
tk_TM
tt_RU
uk_UA
uz_UZ
uz_UZ@cyrillic

Their maintainers are requested to make an explicit decision on how and
whether at all to include this patch.

Current bug effect:

The glibc wiki explicitly lists this use case as the test example

https://sourceware.org/glibc/wiki/Locales#Testing_Locales :

LC_ALL=$LOCALE.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt

currently it fails on Cyrillic texts in most locales including ru_RU [1]
[8] [9]:

LC_ALL=ru_RU.UTF-8 iconv -f UTF-8 -t ASCII//TRANSLIT <
translit-test-input.txt |grep CYRILLIC

CYRILLIC ????? ??? ???? ?????? ??????????? ?????, ?? ????? ?? ???.

 - It produces a string of question marks and spaces.

This is what it should produce and it does so after the patch applied:

CYRILLIC S``esh` eshhyo e`tix myagkix franczuzskix bulok, da vy'pej zhe
chayu.

Root problem and the fix:

The root problem is the missing transliteration table that I am
supplying here. Furthermore it has to be referenced/included into the
active locale at the compilation time to be used by iconv.

COMMIT MESSAGE:
This translit_cyrillic table enables conversion (e.g. with iconv) from a
UTF-8 encoded text based on Cyrillic alphabet to a ASCII//TRANSLIT text.

Examples: iconv -f UTF-8 -t ASCII//TRANSLIT will produce ASCII
compatible transcription and iconv -f UTF-8 -t ISO-8859-15//TRANSLIT |
iconv -f ISO-8859-15 -t UTF-8 will produce Latin transliteration as per
ISO 9.1995.

While a UTF-encoded Cyrillic text requires Cyrillic fonts the result of
a transliteration/transcription has only Latin/ASCII codes but still can
be read by a native speaker. Among other things it is useful for
processing the Cyrillic texts and filenames by programs or on systems
that are not specifically prepared to work with Cyrillic, don't have
corresponding fonts installed or can't handle UTF-8.

The transliteration table itself is attached as a file translit_cyrillic
[7]. Its content (mapping) is based on ISO 9.1995 standard [10] and its
derivative GOST 7.79-2000 official source (Federal Agency on Technical
Regulating and Metrology Of Russian Federation [2]). Technically an
independent but mostly identical source [3] was used and prepared in a
spreadsheet [6].

The documentation suggests that the transliteration tables inclusion is
done by adding *include "translit_cyrillic";""* string into LC_CTYPE
translit_start section
http://man7.org/linux/man-pages/man5/locale.5.html [5]
Practically I have searched for all locales that have a
translit_start/end stance and generated a patch for them.

The Cyrillic transliteration of e.g. Russian text may have already
worked to some extent for mn_MN, sr_RS, tk_TM, uz_UZ, uk_UA locales that
have their transliteration tables included inline.

I am excluding these locales from this proposed patch. I have written
directly to locale maintainer emails listed in the files. Volodymyr
Lisivka <vlisivka@gmail.com>, Max Kutny <mkutny@gmail.com> (uk_UA),
Данило Шеган <danilo@gnome.org>  (sr_YU, sr_CS) have confirmed the
exclusion.

Links:

[1] This bug entry https://sourceware.org/bugzilla/show_bug.cgi?id=2872
[2] GOST 7.79-2000 official source
http://protect.gost.ru/document.aspx?control=7&id=130715 (is only
available in low quality gif format)
[3] http://transliteration.ru/gost-7-79-2000/ and
http://www.yfermer.ru/specifications/285821.html
[4] Wikipedia article on Cyrillic transliteration with Latin alphabet
https://ru.wikipedia.org/wiki/%D0%A2%D1%80%D0%B0%D0%BD%D1%81%D0%BB%D0%B8%D1%82%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D1%8F_%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D0%B0%D0%BB%D1%84%D0%B0%D0%B2%D0%B8%D1%82%D0%B0_%D0%BB%D0%B0%D1%82%D0%B8%D0%BD%D0%B8%D1%86%D0%B5%D0%B9
[5] http://man7.org/linux/man-pages/man5/locale.5.html
[6] Spreadsheet for generating translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11301
[7] translit_cyrillic
https://sourceware.org/bugzilla/attachment.cgi?id=11302
[8] https://sourceware.org/glibc/wiki/Locales#Testing_Locales
[9] translit-test-input.txt
https://sourceware.org/bugzilla/attachment.cgi?id=11304
[10] https://en.wikipedia.org/wiki/ISO_9#ISO_9:1995,_or_GOST_7.79_System_A

Best regards,
Egor Kobylkin

---
2018-10-11  Egor Kobylkin  <egor@kobylkin.com>

	[BZ #2872]
	* localedata/locales/translit_cyrillic: add ISO 9.1995, GOST 7.79
System A transliteration System B transcription table from Cyrillic to
Latin/ASCII.
	* localedata/locales/C: add include "translit_cyrillic";"" to LC_CTYPE
translit section.
	* localedata/locales/aa_DJ: Likewise.
	* localedata/locales/af_ZA: Likewise.
	* localedata/locales/ak_GH: Likewise.
	* localedata/locales/am_ET: Likewise.
	* localedata/locales/ar_EG: Likewise.
	* localedata/locales/be_BY: Likewise.
	* localedata/locales/bem_ZM: Likewise.
	* localedata/locales/ber_DZ: Likewise.
	* localedata/locales/ber_MA: Likewise.
	* localedata/locales/bg_BG: Likewise.
	* localedata/locales/bi_VU: Likewise.
	* localedata/locales/bn_BD: Likewise.
	* localedata/locales/bo_CN: Likewise.
	* localedata/locales/ca_ES: Likewise.
	* localedata/locales/ce_RU: Likewise.
	* localedata/locales/cmn_TW: Likewise.
	* localedata/locales/cs_CZ: Likewise.
	* localedata/locales/cv_RU: Likewise.
	* localedata/locales/cy_GB: Likewise.
	* localedata/locales/da_DK: Likewise.
	* localedata/locales/de_DE: Likewise.
	* localedata/locales/dv_MV: Likewise.
	* localedata/locales/dz_BT: Likewise.
	* localedata/locales/el_GR: Likewise.
	* localedata/locales/en_GB: Likewise.
	* localedata/locales/en_NG: Likewise.
	* localedata/locales/en_ZM: Likewise.
	* localedata/locales/es_CU: Likewise.
	* localedata/locales/es_ES: Likewise.
	* localedata/locales/et_EE: Likewise.
	* localedata/locales/fa_IR: Likewise.
	* localedata/locales/ff_SN: Likewise.
	* localedata/locales/fi_FI: Likewise.
	* localedata/locales/fr_FR: Likewise.
	* localedata/locales/ga_IE: Likewise.
	* localedata/locales/gd_GB: Likewise.
	* localedata/locales/gu_IN: Likewise.
	* localedata/locales/gv_GB: Likewise.
	* localedata/locales/he_IL: Likewise.
	* localedata/locales/hi_IN: Likewise.
	* localedata/locales/hif_FJ: Likewise.
	* localedata/locales/hr_HR: Likewise.
	* localedata/locales/ht_HT: Likewise.
	* localedata/locales/hu_HU: Likewise.
	* localedata/locales/hy_AM: Likewise.
	* localedata/locales/id_ID: Likewise.
	* localedata/locales/is_IS: Likewise.
	* localedata/locales/it_IT: Likewise.
	* localedata/locales/ja_JP: Likewise.
	* localedata/locales/kab_DZ: Likewise.
	* localedata/locales/kk_KZ: Likewise.
	* localedata/locales/km_KH: Likewise.
	* localedata/locales/kn_IN: Likewise.
	* localedata/locales/ko_KR: Likewise.
	* localedata/locales/ks_IN: Likewise.
	* localedata/locales/kw_GB: Likewise.
	* localedata/locales/lb_LU: Likewise.
	* localedata/locales/lg_UG: Likewise.
	* localedata/locales/lij_IT: Likewise.
	* localedata/locales/ln_CD: Likewise.
	* localedata/locales/lo_LA: Likewise.
	* localedata/locales/lt_LT: Likewise.
	* localedata/locales/lv_LV: Likewise.
	* localedata/locales/mg_MG: Likewise.
	* localedata/locales/mhr_RU: Likewise.
	* localedata/locales/mk_MK: Likewise.
	* localedata/locales/ml_IN: Likewise.
	* localedata/locales/ms_MY: Likewise.
	* localedata/locales/mt_MT: Likewise.
	* localedata/locales/nan_TW@latin: Likewise.
	* localedata/locales/nb_NO: Likewise.
	* localedata/locales/ne_NP: Likewise.
	* localedata/locales/nhn_MX: Likewise.
	* localedata/locales/niu_NU: Likewise.
	* localedata/locales/niu_NZ: Likewise.
	* localedata/locales/nl_NL: Likewise.
	* localedata/locales/nr_ZA: Likewise.
	* localedata/locales/oc_FR: Likewise.
	* localedata/locales/om_KE: Likewise.
	* localedata/locales/or_IN: Likewise.
	* localedata/locales/os_RU: Likewise.
	* localedata/locales/pa_IN: Likewise.
	* localedata/locales/pa_PK: Likewise.
	* localedata/locales/pl_PL: Likewise.
	* localedata/locales/pt_PT: Likewise.
	* localedata/locales/quz_PE: Likewise.
	* localedata/locales/ro_RO: Likewise.
	* localedata/locales/ru_RU: Likewise.
	* localedata/locales/rw_RW: Likewise.
	* localedata/locales/sa_IN: Likewise.
	* localedata/locales/sd_IN: Likewise.
	* localedata/locales/sd_IN@devanagari: Likewise.
	* localedata/locales/sd_PK: Likewise.
	* localedata/locales/se_NO: Likewise.
	* localedata/locales/sgs_LT: Likewise.
	* localedata/locales/shn_MM: Likewise.
	* localedata/locales/si_LK: Likewise.
	* localedata/locales/sk_SK: Likewise.
	* localedata/locales/sl_SI: Likewise.
	* localedata/locales/sm_WS: Likewise.
	* localedata/locales/so_SO: Likewise.
	* localedata/locales/sq_AL: Likewise.
	* localedata/locales/ss_ZA: Likewise.
	* localedata/locales/st_ZA: Likewise.
	* localedata/locales/sv_SE: Likewise.
	* localedata/locales/sw_KE: Likewise.
	* localedata/locales/ta_IN: Likewise.
	* localedata/locales/te_IN: Likewise.
	* localedata/locales/th_TH: Likewise.
	* localedata/locales/ti_ET: Likewise.
	* localedata/locales/tn_ZA: Likewise.
	* localedata/locales/to_TO: Likewise.
	* localedata/locales/tpi_PG: Likewise.
	* localedata/locales/tr_TR: Likewise.
	* localedata/locales/ts_ZA: Likewise.
	* localedata/locales/unm_US: Likewise.
	* localedata/locales/ur_IN: Likewise.
	* localedata/locales/ur_PK: Likewise.
	* localedata/locales/ve_ZA: Likewise.
	* localedata/locales/vi_VN: Likewise.
	* localedata/locales/wa_BE: Likewise.
	* localedata/locales/wo_SN: Likewise.
	* localedata/locales/xh_ZA: Likewise.
	* localedata/locales/yi_US: Likewise.
	* localedata/locales/yuw_PG: Likewise.
	* localedata/locales/zh_CN: Likewise.
	* localedata/locales/zu_ZA: Likewise.

Locales: Cyrillic -> ASCII transliteration table [BZ #2872] v2

Commit Message

Comments

Patch