sort diacritics left-to-right except in fr_CA locale
Commit Message
This fixes a long-standing collation bug in glibc, affecting all locales
but de_DE, lb_LU and fr_CA. This led me to write a separate NEWS entry
for this bug; do we want a bug report in the database regardless?
Tested on x86_64-linux-gnu. Ok to install?
for ChangeLog
* localedata/Makefile (test-input): Add fr_CA.UTF-8.
(LOCALES): Likewise.
* localedata/fr_CA.in: Copied and adjusted from...
* localedata/fr_FR.in: ... this. Adjusted too.
* localedata/locales/de_DE (DIACRIT_FORWARD): Do not define.
* localedata/locales/lb_LU (DIACRIT_FORWARD): Likewise.
* localedata/locales/fr_CA (DIACRIT_BACKWARD): Define.
* localedata/locales/iso14651_t1_common (DIACRIT_FORWARD):
Make it the new default, overridable with DIACRIT_BACKWARD.
* NEWS: Note behavior change.
---
NEWS | 9 +++
localedata/Makefile | 4 +
localedata/fr_CA.in | 96 +++++++++++++++++++++++++++++++++
localedata/fr_FR.in | 22 ++++----
localedata/locales/de_DE | 2 -
localedata/locales/fr_CA | 2 +
localedata/locales/iso14651_t1_common | 6 +-
localedata/locales/lb_LU | 2 -
8 files changed, 123 insertions(+), 20 deletions(-)
create mode 100644 localedata/fr_CA.in
Comments
Policy is that user-visible changes require a bug report.
On Dec 16, 2014, Roland McGrath <roland@hack.frob.com> wrote:
> Policy is that user-visible changes require a bug report.
Noted, thanks. Any other comments on the patch, before I post a revised
version mentioning the yet-to-be-filed bug report?
Thanks,
> Noted, thanks. Any other comments on the patch, before I post a revised
> version mentioning the yet-to-be-filed bug report?
I am pretty useless in that area of the code, sorry.
* Alexandre Oliva:
> This fixes a long-standing collation bug in glibc, affecting all locales
> but de_DE, lb_LU and fr_CA. This led me to write a separate NEWS entry
> for this bug; do we want a bug report in the database regardless?
I wonder if this libc change means that database indexes may have to
be rebuilt. This:
<http://www.postgresql.org/docs/current/static/sql-createdatabase.html>
suggests “yes”. Which means that adopting this change as part of a
system glibc update is quite risky.
I have no idea how to communicate this appropriately. Maybe we should
ask the PostgreSQL folks about opinions on this matter.
@@ -41,6 +41,15 @@ Version 2.21
* Merged gettext 0.19.3 into the intl subdirectory. This fixes building
with newer versions of bison.
+
+* Collation (sorting) general rules regarding diacritics have been fixed to
+ match those in Unicode CLDR, namely, whether diacritic tie-breaking takes
+ place in a forward or backward pass over the strings or wstrings. The
+ only locale that sort diacritics with a backward pass is now fr_CA; it
+ already sorted «cote < côte < coté < côté» before. All other locales now
+ use a forward pass, so that they sort «cote < coté < côte < côté», which
+ only de_DE and lb_LU did before.
+
Version 2.20
@@ -37,7 +37,7 @@ test-srcs := collate-test xfrm-test tst-fmon tst-rpmatch tst-trans \
tst-ctype tst-langinfo tst-langinfo-static tst-numeric
test-input := de_DE.ISO-8859-1 en_US.ISO-8859-1 da_DK.ISO-8859-1 \
hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 tr_TR.UTF-8 fr_FR.UTF-8 \
- si_LK.UTF-8
+ si_LK.UTF-8 fr_CA.UTF-8
test-input-data = $(addsuffix .in, $(basename $(test-input)))
test-output := $(foreach s, .out .xout, \
$(addsuffix $s, $(basename $(test-input))))
@@ -106,7 +106,7 @@ LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 \
hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS fr_FR.ISO-8859-1 \
nb_NO.ISO-8859-1 nn_NO.ISO-8859-1 tr_TR.UTF-8 cs_CZ.UTF-8 \
zh_TW.EUC-TW fa_IR.UTF-8 fr_FR.UTF-8 ja_JP.UTF-8 si_LK.UTF-8 \
- tr_TR.ISO-8859-9 en_GB.UTF-8
+ tr_TR.ISO-8859-9 en_GB.UTF-8 fr_CA.UTF-8
LOCALE_SRCS := $(shell echo "$(LOCALES)"|sed 's/\([^ .]*\)[^ ]*/\1/g')
CHARMAPS := $(shell echo "$(LOCALES)" | \
sed -e 's/[^ .]*[.]\([^ ]*\)/\1/g' -e s/SJIS/SHIFT_JIS/g)
new file mode 100644
@@ -0,0 +1,96 @@
+@@@@@
+0000
+9999
+Aalborg
+aide
+aïeul
+air
+@@@air
+air@@@
+Ã…lborg
+août
+bohème
+Bohême
+Bohémien
+caennais
+cæsium
+çà et lÃ
+C.A.F.
+Canon
+cañon
+casanier
+cølibat
+colon
+côlon
+COOP
+CO-OP
+coop
+co-op
+Copenhagen
+COTE
+cote
+CÔTE
+côte
+COTÉ
+coté
+CÔTÉ
+côté
+du
+dû
+élève
+élevé
+gène
+gêne
+gêné
+Größe
+Grossist
+haie
+haïe
+île
+Île d'Orléans
+lame
+l'âme
+lamé
+les
+LÈS
+lèse
+lésé
+L'Haÿ-les-Roses
+MÂCON
+maçon
+McArthur
+Mc Arthur
+Mc Mahon
+MODÈLE
+modelé
+NOËL
+Noël
+notre
+nôtre
+ode
+Å“il
+ou
+OÙ
+ovoïde
+pèche
+pêche
+PÉCHÉ
+péché
+pêché
+pécher
+pêcher
+pechère
+péchère
+relève
+relevé
+resume
+resumé
+résumé
+révèle
+révélé
+vice-president
+vice-président
+vice-president's offices
+vice-presidents' offices
+VICE-VERSA
+vice versa
@@ -29,16 +29,16 @@ CO-OP
Copenhagen
cote
COTE
-côte
-CÔTE
coté
COTÉ
+côte
+CÔTE
côté
CÔTÉ
du
dû
-élève
élevé
+élève
gène
gêne
gêné
@@ -49,20 +49,20 @@ haïe
île
Île d'Orléans
lame
-l'âme
lamé
+l'âme
les
LÈS
-lèse
lésé
+lèse
L'Haÿ-les-Roses
-MÂCON
maçon
+MÂCON
McArthur
Mc Arthur
Mc Mahon
-MODÈLE
modelé
+MODÈLE
Noël
NOËL
notre
@@ -72,22 +72,22 @@ ode
ou
OÙ
ovoïde
-pèche
-pêche
péché
PÉCHÉ
+pèche
+pêche
pêché
pécher
pêcher
pechère
péchère
-relève
relevé
+relève
resume
resumé
résumé
-révèle
révélé
+révèle
vice-president
vice-président
vice-president's offices
@@ -76,8 +76,6 @@ END LC_CTYPE
LC_COLLATE
-define DIACRIT_FORWARD
-
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
@@ -51,6 +51,8 @@ copy "fr_FR"
END LC_CTYPE
LC_COLLATE
+define DIACRIT_BACKWARD
+
copy "en_CA"
END LC_COLLATE
@@ -5060,10 +5060,10 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U009E> IGNORE;IGNORE;IGNORE;<U009E>
<U009F> IGNORE;IGNORE;IGNORE;<U009F>
-ifdef DIACRIT_FORWARD
-order_start <LATIN>;forward;forward;forward;forward,position
-else
+ifdef DIACRIT_BACKWARD
order_start <LATIN>;forward;backward;forward;forward,position
+else
+order_start <LATIN>;forward;forward;forward;forward,position
endif
#
<U00A0> <U0020>;<BAS>;<MIN>;IGNORE # 170<NBSP>
@@ -77,8 +77,6 @@ END LC_CTYPE
LC_COLLATE
-define DIACRIT_FORWARD
-
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"