[v2,11/14,BZ,#14095] update collation data from Unicode / ISO 14651

Message ID	s9dd11j9zzr.fsf@taka.site
State	Superseded
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: Mike FABIAN <mfabian@redhat.com> To: libc-alpha@sourceware.org Cc: "Dmitry V. Levin" <ldv@altlinux.org> Subject: [Patch v2 11/14] [BZ #14095] update collation data from Unicode / ISO 14651 Date: Mon, 05 Feb 2018 17:12:56 +0100 Message-ID: <s9dd11j9zzr.fsf@taka.site> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/x-patch Content-Disposition: inline; filename=0011-Fix-test-cases-tst-fnmatch-and-tst-regexloc-for-the-.patch Content-Transfer-Encoding: 8bit

From 7bd32b54d54e4cc924309373f4c4ad59de6ef1d8 Mon Sep 17 00:00:00 2001 From: Mike FABIAN <mfabian@redhat.com> Date: Tue, 23 Jan 2018 17:29:36 +0100 Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new iso14651_t1_common file. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit See: http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html > A range expression represents the set of collating elements that fall > between two elements in the current collation sequence, > inclusively. It is expressed as the starting point and the ending > point separated by a hyphen (-). > > Range expressions must not be used in portable applications because > their behaviour is dependent on the collating sequence. Ranges will be > treated according to the current collating sequence, and include such > characters that fall within the range based on that collating > sequence, regardless of character values. This, however, means that > the interpretation will differ depending on collating sequence. If, > for instance, one collating sequence defines ÃƒÂ¤ as a variant of a, > while another defines it as a letter following z, then the expression > [ÃƒÂ¤-z] is valid in the first language and invalid in the second. Therefore, using [a-z] does not make much sense except in the C/POSIX locale. The new iso14651_t1_common lists upper case and lower case Latin characters in a different order than the old one which causes surprising results for example in the de_DE locale: [a-z] now includes A because A comes after a in iso14651_t1_common but does not include Z because that comes after z in iso14651_t1_common. * posix/tst-fnmatch.input: Use range expressions only in C locale. * posix/tst-regexloc.c: Do not use a range expression for de_DE.ISO-8859-1 locale. --- posix/tst-fnmatch.input | 40 ---------------------------------------- posix/tst-regexloc.c | 4 ++-- 2 files changed, 2 insertions(+), 42 deletions(-) diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input index 88b3f739a5..1e2f62c0ed 100644 --- a/posix/tst-fnmatch.input +++ b/posix/tst-fnmatch.input @@ -418,26 +418,6 @@ C "-" "[Z-\\]]" NOMATCH # Following are tests outside the scope of IEEE 2003.2 since they are using # locales other than the C locale. The main focus of the tests is on the # handling of ranges and the recognition of character (vs bytes). -de_DE.ISO-8859-1 "a" "[a-z]" 0 -de_DE.ISO-8859-1 "z" "[a-z]" 0 -de_DE.ISO-8859-1 "Ã¤" "[a-z]" 0 -de_DE.ISO-8859-1 "Ã¶" "[a-z]" 0 -de_DE.ISO-8859-1 "Ã¼" "[a-z]" 0 -de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH -de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH -de_DE.ISO-8859-1 "Ã„" "[a-z]" NOMATCH -de_DE.ISO-8859-1 "Ã–" "[a-z]" NOMATCH -de_DE.ISO-8859-1 "Ãœ" "[a-z]" NOMATCH -de_DE.ISO-8859-1 "a" "[A-Z]" NOMATCH -de_DE.ISO-8859-1 "z" "[A-Z]" NOMATCH -de_DE.ISO-8859-1 "Ã¤" "[A-Z]" NOMATCH -de_DE.ISO-8859-1 "Ã¶" "[A-Z]" NOMATCH -de_DE.ISO-8859-1 "Ã¼" "[A-Z]" NOMATCH -de_DE.ISO-8859-1 "A" "[A-Z]" 0 -de_DE.ISO-8859-1 "Z" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ã„" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ã–" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ãœ" "[A-Z]" 0 de_DE.ISO-8859-1 "a" "[[:lower:]]" 0 de_DE.ISO-8859-1 "z" "[[:lower:]]" 0 de_DE.ISO-8859-1 "Ã¤" "[[:lower:]]" 0 @@ -510,26 +490,6 @@ de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH # And with a multibyte character set. -de_DE.UTF-8 "a" "[a-z]" 0 -de_DE.UTF-8 "z" "[a-z]" 0 -de_DE.UTF-8 "ÃƒÂ¤" "[a-z]" 0 -de_DE.UTF-8 "ÃƒÂ¶" "[a-z]" 0 -de_DE.UTF-8 "ÃƒÂ¼" "[a-z]" 0 -de_DE.UTF-8 "A" "[a-z]" NOMATCH -de_DE.UTF-8 "Z" "[a-z]" NOMATCH -de_DE.UTF-8 "Ãƒâ€ž" "[a-z]" NOMATCH -de_DE.UTF-8 "Ãƒâ€“" "[a-z]" NOMATCH -de_DE.UTF-8 "ÃƒÅ“" "[a-z]" NOMATCH -de_DE.UTF-8 "a" "[A-Z]" NOMATCH -de_DE.UTF-8 "z" "[A-Z]" NOMATCH -de_DE.UTF-8 "ÃƒÂ¤" "[A-Z]" NOMATCH -de_DE.UTF-8 "ÃƒÂ¶" "[A-Z]" NOMATCH -de_DE.UTF-8 "ÃƒÂ¼" "[A-Z]" NOMATCH -de_DE.UTF-8 "A" "[A-Z]" 0 -de_DE.UTF-8 "Z" "[A-Z]" 0 -de_DE.UTF-8 "Ãƒâ€ž" "[A-Z]" 0 -de_DE.UTF-8 "Ãƒâ€“" "[A-Z]" 0 -de_DE.UTF-8 "ÃƒÅ“" "[A-Z]" 0 de_DE.UTF-8 "a" "[[:lower:]]" 0 de_DE.UTF-8 "z" "[[:lower:]]" 0 de_DE.UTF-8 "ÃƒÂ¤" "[[:lower:]]" 0 diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c index 60235b4d3b..7fbc496d0c 100644 --- a/posix/tst-regexloc.c +++ b/posix/tst-regexloc.c @@ -29,8 +29,8 @@ do_test (void) if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL) puts ("cannot set locale"); - else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR) - puts ("cannot compile expression \"[a-f]*\""); + else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR) + puts ("cannot compile expression \"[abcdef]*\""); else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH) puts ("no match"); else -- 2.14.3

[v2,11/14,BZ,#14095] update collation data from Unicode / ISO 14651

Commit Message

Patch