sort diacritics left-to-right except in fr_CA locale

Message ID oroar4bmxl.fsf@free.home
State Superseded
Headers

Commit Message

Alexandre Oliva Dec. 15, 2014, 9:52 p.m. UTC
  This fixes a long-standing collation bug in glibc, affecting all locales
but de_DE, lb_LU and fr_CA.  This led me to write a separate NEWS entry
for this bug; do we want a bug report in the database regardless?

Tested on x86_64-linux-gnu.  Ok to install?


for  ChangeLog

	* localedata/Makefile (test-input): Add fr_CA.UTF-8.
	(LOCALES): Likewise.
	* localedata/fr_CA.in: Copied and adjusted from...
	* localedata/fr_FR.in: ... this.  Adjusted too.
	* localedata/locales/de_DE (DIACRIT_FORWARD): Do not define.
	* localedata/locales/lb_LU (DIACRIT_FORWARD): Likewise.
	* localedata/locales/fr_CA (DIACRIT_BACKWARD): Define.
	* localedata/locales/iso14651_t1_common (DIACRIT_FORWARD):
	Make it the new default, overridable with DIACRIT_BACKWARD.
	* NEWS: Note behavior change.
---
 NEWS                                  |    9 +++
 localedata/Makefile                   |    4 +
 localedata/fr_CA.in                   |   96 +++++++++++++++++++++++++++++++++
 localedata/fr_FR.in                   |   22 ++++----
 localedata/locales/de_DE              |    2 -
 localedata/locales/fr_CA              |    2 +
 localedata/locales/iso14651_t1_common |    6 +-
 localedata/locales/lb_LU              |    2 -
 8 files changed, 123 insertions(+), 20 deletions(-)
 create mode 100644 localedata/fr_CA.in
  

Comments

Roland McGrath Dec. 16, 2014, 5:57 p.m. UTC | #1
Policy is that user-visible changes require a bug report.
  
Alexandre Oliva Dec. 17, 2014, 6:43 p.m. UTC | #2
On Dec 16, 2014, Roland McGrath <roland@hack.frob.com> wrote:

> Policy is that user-visible changes require a bug report.

Noted, thanks.  Any other comments on the patch, before I post a revised
version mentioning the yet-to-be-filed bug report?

Thanks,
  
Roland McGrath Dec. 17, 2014, 6:44 p.m. UTC | #3
> Noted, thanks.  Any other comments on the patch, before I post a revised
> version mentioning the yet-to-be-filed bug report?

I am pretty useless in that area of the code, sorry.
  
Florian Weimer Dec. 23, 2014, 9:47 p.m. UTC | #4
* Alexandre Oliva:

> This fixes a long-standing collation bug in glibc, affecting all locales
> but de_DE, lb_LU and fr_CA.  This led me to write a separate NEWS entry
> for this bug; do we want a bug report in the database regardless?

I wonder if this libc change means that database indexes may have to
be rebuilt.  This:

  <http://www.postgresql.org/docs/current/static/sql-createdatabase.html>

suggests “yes”.  Which means that adopting this change as part of a
system glibc update is quite risky.

I have no idea how to communicate this appropriately.  Maybe we should
ask the PostgreSQL folks about opinions on this matter.
  

Patch

diff --git a/NEWS b/NEWS
index a324c10..1a78cda 100644
--- a/NEWS
+++ b/NEWS
@@ -41,6 +41,15 @@  Version 2.21
 
 * Merged gettext 0.19.3 into the intl subdirectory.  This fixes building
   with newer versions of bison.
+
+* Collation (sorting) general rules regarding diacritics have been fixed to
+  match those in Unicode CLDR, namely, whether diacritic tie-breaking takes
+  place in a forward or backward pass over the strings or wstrings.  The
+  only locale that sort diacritics with a backward pass is now fr_CA; it
+  already sorted «cote < côte < coté < côté» before.  All other locales now
+  use a forward pass, so that they sort «cote < coté < côte < côté», which
+  only de_DE and lb_LU did before.
+
 
 Version 2.20
 
diff --git a/localedata/Makefile b/localedata/Makefile
index 0826b36..4fc523e 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -37,7 +37,7 @@  test-srcs := collate-test xfrm-test tst-fmon tst-rpmatch tst-trans \
 	     tst-ctype tst-langinfo tst-langinfo-static tst-numeric
 test-input := de_DE.ISO-8859-1 en_US.ISO-8859-1 da_DK.ISO-8859-1 \
 	      hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 tr_TR.UTF-8 fr_FR.UTF-8 \
-	      si_LK.UTF-8
+	      si_LK.UTF-8 fr_CA.UTF-8
 test-input-data = $(addsuffix .in, $(basename $(test-input)))
 test-output := $(foreach s, .out .xout, \
 			 $(addsuffix $s, $(basename $(test-input))))
@@ -106,7 +106,7 @@  LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 \
 	   hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS fr_FR.ISO-8859-1 \
 	   nb_NO.ISO-8859-1 nn_NO.ISO-8859-1 tr_TR.UTF-8 cs_CZ.UTF-8 \
 	   zh_TW.EUC-TW fa_IR.UTF-8 fr_FR.UTF-8 ja_JP.UTF-8 si_LK.UTF-8 \
-	   tr_TR.ISO-8859-9 en_GB.UTF-8
+	   tr_TR.ISO-8859-9 en_GB.UTF-8 fr_CA.UTF-8
 LOCALE_SRCS := $(shell echo "$(LOCALES)"|sed 's/\([^ .]*\)[^ ]*/\1/g')
 CHARMAPS := $(shell echo "$(LOCALES)" | \
 		    sed -e 's/[^ .]*[.]\([^ ]*\)/\1/g' -e s/SJIS/SHIFT_JIS/g)
diff --git a/localedata/fr_CA.in b/localedata/fr_CA.in
new file mode 100644
index 0000000..1c05d69
--- /dev/null
+++ b/localedata/fr_CA.in
@@ -0,0 +1,96 @@ 
+@@@@@
+0000
+9999
+Aalborg
+aide
+aïeul
+air
+@@@air
+air@@@
+Ã…lborg
+août
+bohème
+Bohême
+Bohémien
+caennais
+cæsium
+çà et là
+C.A.F.
+Canon
+cañon
+casanier
+cølibat
+colon
+côlon
+COOP
+CO-OP
+coop
+co-op
+Copenhagen
+COTE
+cote
+CÔTE
+côte
+COTÉ
+coté
+CÔTÉ
+côté
+du
+dû
+élève
+élevé
+gène
+gêne
+gêné
+Größe
+Grossist
+haie
+haïe
+île
+Île d'Orléans
+lame
+l'âme
+lamé
+les
+LÈS
+lèse
+lésé
+L'Haÿ-les-Roses
+MÂCON
+maçon
+McArthur
+Mc Arthur
+Mc Mahon
+MODÈLE
+modelé
+NOËL
+Noël
+notre
+nôtre
+ode
+Å“il
+ou
+OÙ
+ovoïde
+pèche
+pêche
+PÉCHÉ
+péché
+pêché
+pécher
+pêcher
+pechère
+péchère
+relève
+relevé
+resume
+resumé
+résumé
+révèle
+révélé
+vice-president
+vice-président
+vice-president's offices
+vice-presidents' offices
+VICE-VERSA
+vice versa
diff --git a/localedata/fr_FR.in b/localedata/fr_FR.in
index dd5c533..070eb4dc 100644
--- a/localedata/fr_FR.in
+++ b/localedata/fr_FR.in
@@ -29,16 +29,16 @@  CO-OP
 Copenhagen
 cote
 COTE
-côte
-CÔTE
 coté
 COTÉ
+côte
+CÔTE
 côté
 CÔTÉ
 du
 dû
-élève
 élevé
+élève
 gène
 gêne
 gêné
@@ -49,20 +49,20 @@  haïe
 île
 Île d'Orléans
 lame
-l'âme
 lamé
+l'âme
 les
 LÈS
-lèse
 lésé
+lèse
 L'Haÿ-les-Roses
-MÂCON
 maçon
+MÂCON
 McArthur
 Mc Arthur
 Mc Mahon
-MODÈLE
 modelé
+MODÈLE
 Noël
 NOËL
 notre
@@ -72,22 +72,22 @@  ode
 ou
 OÙ
 ovoïde
-pèche
-pêche
 péché
 PÉCHÉ
+pèche
+pêche
 pêché
 pécher
 pêcher
 pechère
 péchère
-relève
 relevé
+relève
 resume
 resumé
 résumé
-révèle
 révélé
+révèle
 vice-president
 vice-président
 vice-president's offices
diff --git a/localedata/locales/de_DE b/localedata/locales/de_DE
index e2704a7..2c3510a 100644
--- a/localedata/locales/de_DE
+++ b/localedata/locales/de_DE
@@ -76,8 +76,6 @@  END LC_CTYPE
 
 LC_COLLATE
 
-define DIACRIT_FORWARD
-
 % Copy the template from ISO/IEC 14651
 copy "iso14651_t1"
 
diff --git a/localedata/locales/fr_CA b/localedata/locales/fr_CA
index 5e2c5a1..878539b 100644
--- a/localedata/locales/fr_CA
+++ b/localedata/locales/fr_CA
@@ -51,6 +51,8 @@  copy "fr_FR"
 END LC_CTYPE
 
 LC_COLLATE
+define DIACRIT_BACKWARD
+
 copy "en_CA"
 END LC_COLLATE
 
diff --git a/localedata/locales/iso14651_t1_common b/localedata/locales/iso14651_t1_common
index e0c3eaa..1fc214f 100644
--- a/localedata/locales/iso14651_t1_common
+++ b/localedata/locales/iso14651_t1_common
@@ -5060,10 +5060,10 @@  order_start <SPECIAL>;forward;backward;forward;forward,position
 <U009E> IGNORE;IGNORE;IGNORE;<U009E>
 <U009F> IGNORE;IGNORE;IGNORE;<U009F>
 
-ifdef DIACRIT_FORWARD
-order_start <LATIN>;forward;forward;forward;forward,position
-else
+ifdef DIACRIT_BACKWARD
 order_start <LATIN>;forward;backward;forward;forward,position
+else
+order_start <LATIN>;forward;forward;forward;forward,position
 endif
 #
 <U00A0> <U0020>;<BAS>;<MIN>;IGNORE # 170<NBSP>
diff --git a/localedata/locales/lb_LU b/localedata/locales/lb_LU
index a74e162..c8616fd 100644
--- a/localedata/locales/lb_LU
+++ b/localedata/locales/lb_LU
@@ -77,8 +77,6 @@  END LC_CTYPE
 
 LC_COLLATE
 
-define DIACRIT_FORWARD
-
 % Copy the template from ISO/IEC 14651
 copy "iso14651_t1"