[v3,11/14,BZ,#14095] update collation data from Unicode / ISO 14651
Commit Message
Comments
On 02/23/2018 02:24 AM, Mike FABIAN wrote:
> From 5c65168e569ba0c59ad43bbd88f37cdb356c16b6 Mon Sep 17 00:00:00 2001
> From: Mike FABIAN <mfabian@redhat.com>
> Date: Tue, 23 Jan 2018 17:29:36 +0100
> Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new
> iso14651_t1_common file.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
OK with the following changes:
- Comment added in tst-fnmatch.input about range usage like this.
- Rework the test input to keep testing the range.
See comments below.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
> See:
>
> http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html
>
>> A range expression represents the set of collating elements that fall
>> between two elements in the current collation sequence,
>> inclusively. It is expressed as the starting point and the ending
>> point separated by a hyphen (-).
>>
>> Range expressions must not be used in portable applications because
>> their behaviour is dependent on the collating sequence. Ranges will be
>> treated according to the current collating sequence, and include such
>> characters that fall within the range based on that collating
>> sequence, regardless of character values. This, however, means that
>> the interpretation will differ depending on collating sequence. If,
>> for instance, one collating sequence defines ä as a variant of a,
>> while another defines it as a letter following z, then the expression
>> [ä-z] is valid in the first language and invalid in the second.
> Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
> The new iso14651_t1_common lists upper case and lower case Latin characters
> in a different order than the old one which causes surprising results
> for example in the de_DE locale: [a-z] now includes A because A comes
> after a in iso14651_t1_common but does not include Z because that comes
> after z in iso14651_t1_common.
Why delete the tests though? Why not adjust them to cover the result?
The old tests were similarly adjusted, since it expects 'ä' to be within
the range of [a-z], similarly we could adjust the tests?
>
> * posix/tst-fnmatch.input: Use range expressions only in C locale.
> * posix/tst-regexloc.c: Do not use a range expression for
> de_DE.ISO-8859-1 locale.
> ---
> posix/tst-fnmatch.input | 40 ----------------------------------------
> posix/tst-regexloc.c | 4 ++--
> 2 files changed, 2 insertions(+), 42 deletions(-)
>
> diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
> index 88b3f739a5..1e2f62c0ed 100644
> --- a/posix/tst-fnmatch.input
> +++ b/posix/tst-fnmatch.input
> @@ -418,26 +418,6 @@ C "-" "[Z-\\]]" NOMATCH
> # Following are tests outside the scope of IEEE 2003.2 since they are using
> # locales other than the C locale. The main focus of the tests is on the
> # handling of ranges and the recognition of character (vs bytes).
Here we need a comment explaining exactly why [a-z] is tricky. Basically include
the text you wrote for the commit message here :-)
> -de_DE.ISO-8859-1 "a" "[a-z]" 0
> -de_DE.ISO-8859-1 "z" "[a-z]" 0
> -de_DE.ISO-8859-1 "ä" "[a-z]" 0
> -de_DE.ISO-8859-1 "ö" "[a-z]" 0
> -de_DE.ISO-8859-1 "ü" "[a-z]" 0
> -de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH
This becomes 0.
> -de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH
Stays the same.
> -de_DE.ISO-8859-1 "Ä" "[a-z]" NOMATCH
> -de_DE.ISO-8859-1 "Ö" "[a-z]" NOMATCH
> -de_DE.ISO-8859-1 "Ü" "[a-z]" NOMATCH
All become 0.
etc.
> -de_DE.ISO-8859-1 "a" "[A-Z]" NOMATCH
> -de_DE.ISO-8859-1 "z" "[A-Z]" NOMATCH
> -de_DE.ISO-8859-1 "ä" "[A-Z]" NOMATCH
> -de_DE.ISO-8859-1 "ö" "[A-Z]" NOMATCH
> -de_DE.ISO-8859-1 "ü" "[A-Z]" NOMATCH
> -de_DE.ISO-8859-1 "A" "[A-Z]" 0
> -de_DE.ISO-8859-1 "Z" "[A-Z]" 0
> -de_DE.ISO-8859-1 "Ä" "[A-Z]" 0
> -de_DE.ISO-8859-1 "Ö" "[A-Z]" 0
> -de_DE.ISO-8859-1 "Ü" "[A-Z]" 0
> de_DE.ISO-8859-1 "a" "[[:lower:]]" 0
> de_DE.ISO-8859-1 "z" "[[:lower:]]" 0
> de_DE.ISO-8859-1 "ä" "[[:lower:]]" 0
> @@ -510,26 +490,6 @@ de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH
>
>
> # And with a multibyte character set.
> -de_DE.UTF-8 "a" "[a-z]" 0
> -de_DE.UTF-8 "z" "[a-z]" 0
> -de_DE.UTF-8 "ä" "[a-z]" 0
> -de_DE.UTF-8 "ö" "[a-z]" 0
> -de_DE.UTF-8 "ü" "[a-z]" 0
> -de_DE.UTF-8 "A" "[a-z]" NOMATCH
> -de_DE.UTF-8 "Z" "[a-z]" NOMATCH
> -de_DE.UTF-8 "Ä" "[a-z]" NOMATCH
> -de_DE.UTF-8 "Ö" "[a-z]" NOMATCH
> -de_DE.UTF-8 "Ü" "[a-z]" NOMATCH
> -de_DE.UTF-8 "a" "[A-Z]" NOMATCH
> -de_DE.UTF-8 "z" "[A-Z]" NOMATCH
> -de_DE.UTF-8 "ä" "[A-Z]" NOMATCH
> -de_DE.UTF-8 "ö" "[A-Z]" NOMATCH
> -de_DE.UTF-8 "ü" "[A-Z]" NOMATCH
> -de_DE.UTF-8 "A" "[A-Z]" 0
> -de_DE.UTF-8 "Z" "[A-Z]" 0
> -de_DE.UTF-8 "Ä" "[A-Z]" 0
> -de_DE.UTF-8 "Ö" "[A-Z]" 0
> -de_DE.UTF-8 "Ü" "[A-Z]" 0
> de_DE.UTF-8 "a" "[[:lower:]]" 0
> de_DE.UTF-8 "z" "[[:lower:]]" 0
> de_DE.UTF-8 "ä" "[[:lower:]]" 0
> diff --git a/posix/tst-regexloc.c b/posix/tst-regexloc.c
> index 60235b4d3b..7fbc496d0c 100644
> --- a/posix/tst-regexloc.c
> +++ b/posix/tst-regexloc.c
> @@ -29,8 +29,8 @@ do_test (void)
>
> if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
> puts ("cannot set locale");
> - else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)
> - puts ("cannot compile expression \"[a-f]*\"");
> + else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)
> + puts ("cannot compile expression \"[abcdef]*\"");
OK.
> else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)
> puts ("no match");
> else
> -- 2.14.3
From 5c65168e569ba0c59ad43bbd88f37cdb356c16b6 Mon Sep 17 00:00:00 2001
From: Mike FABIAN <mfabian@redhat.com>
Date: Tue, 23 Jan 2018 17:29:36 +0100
Subject: [PATCH 11/14] Fix test cases tst-fnmatch and tst-regexloc for the new
iso14651_t1_common file.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
See:
http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html
> A range expression represents the set of collating elements that fall
> between two elements in the current collation sequence,
> inclusively. It is expressed as the starting point and the ending
> point separated by a hyphen (-).
>
> Range expressions must not be used in portable applications because
> their behaviour is dependent on the collating sequence. Ranges will be
> treated according to the current collating sequence, and include such
> characters that fall within the range based on that collating
> sequence, regardless of character values. This, however, means that
> the interpretation will differ depending on collating sequence. If,
> for instance, one collating sequence defines ä as a variant of a,
> while another defines it as a letter following z, then the expression
> [ä-z] is valid in the first language and invalid in the second.
Therefore, using [a-z] does not make much sense except in the C/POSIX locale.
The new iso14651_t1_common lists upper case and lower case Latin characters
in a different order than the old one which causes surprising results
for example in the de_DE locale: [a-z] now includes A because A comes
after a in iso14651_t1_common but does not include Z because that comes
after z in iso14651_t1_common.
* posix/tst-fnmatch.input: Use range expressions only in C locale.
* posix/tst-regexloc.c: Do not use a range expression for
de_DE.ISO-8859-1 locale.
---
posix/tst-fnmatch.input | 40 ----------------------------------------
posix/tst-regexloc.c | 4 ++--
2 files changed, 2 insertions(+), 42 deletions(-)
@@ -418,26 +418,6 @@ C "-" "[Z-\\]]" NOMATCH
# Following are tests outside the scope of IEEE 2003.2 since they are using
# locales other than the C locale. The main focus of the tests is on the
# handling of ranges and the recognition of character (vs bytes).
-de_DE.ISO-8859-1 "a" "[a-z]" 0
-de_DE.ISO-8859-1 "z" "[a-z]" 0
-de_DE.ISO-8859-1 "ä" "[a-z]" 0
-de_DE.ISO-8859-1 "ö" "[a-z]" 0
-de_DE.ISO-8859-1 "ü" "[a-z]" 0
-de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH
-de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH
-de_DE.ISO-8859-1 "Ä" "[a-z]" NOMATCH
-de_DE.ISO-8859-1 "Ö" "[a-z]" NOMATCH
-de_DE.ISO-8859-1 "Ü" "[a-z]" NOMATCH
-de_DE.ISO-8859-1 "a" "[A-Z]" NOMATCH
-de_DE.ISO-8859-1 "z" "[A-Z]" NOMATCH
-de_DE.ISO-8859-1 "ä" "[A-Z]" NOMATCH
-de_DE.ISO-8859-1 "ö" "[A-Z]" NOMATCH
-de_DE.ISO-8859-1 "ü" "[A-Z]" NOMATCH
-de_DE.ISO-8859-1 "A" "[A-Z]" 0
-de_DE.ISO-8859-1 "Z" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ä" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ö" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ü" "[A-Z]" 0
de_DE.ISO-8859-1 "a" "[[:lower:]]" 0
de_DE.ISO-8859-1 "z" "[[:lower:]]" 0
de_DE.ISO-8859-1 "ä" "[[:lower:]]" 0
@@ -510,26 +490,6 @@ de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH
# And with a multibyte character set.
-de_DE.UTF-8 "a" "[a-z]" 0
-de_DE.UTF-8 "z" "[a-z]" 0
-de_DE.UTF-8 "ä" "[a-z]" 0
-de_DE.UTF-8 "ö" "[a-z]" 0
-de_DE.UTF-8 "ü" "[a-z]" 0
-de_DE.UTF-8 "A" "[a-z]" NOMATCH
-de_DE.UTF-8 "Z" "[a-z]" NOMATCH
-de_DE.UTF-8 "Ä" "[a-z]" NOMATCH
-de_DE.UTF-8 "Ö" "[a-z]" NOMATCH
-de_DE.UTF-8 "Ü" "[a-z]" NOMATCH
-de_DE.UTF-8 "a" "[A-Z]" NOMATCH
-de_DE.UTF-8 "z" "[A-Z]" NOMATCH
-de_DE.UTF-8 "ä" "[A-Z]" NOMATCH
-de_DE.UTF-8 "ö" "[A-Z]" NOMATCH
-de_DE.UTF-8 "ü" "[A-Z]" NOMATCH
-de_DE.UTF-8 "A" "[A-Z]" 0
-de_DE.UTF-8 "Z" "[A-Z]" 0
-de_DE.UTF-8 "Ä" "[A-Z]" 0
-de_DE.UTF-8 "Ö" "[A-Z]" 0
-de_DE.UTF-8 "Ü" "[A-Z]" 0
de_DE.UTF-8 "a" "[[:lower:]]" 0
de_DE.UTF-8 "z" "[[:lower:]]" 0
de_DE.UTF-8 "ä" "[[:lower:]]" 0
@@ -29,8 +29,8 @@ do_test (void)
if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
puts ("cannot set locale");
- else if (regcomp (&re, "[a-f]*", 0) != REG_NOERROR)
- puts ("cannot compile expression \"[a-f]*\"");
+ else if (regcomp (&re, "[abcdef]*", 0) != REG_NOERROR)
+ puts ("cannot compile expression \"[abcdef]*\"");
else if (regexec (&re, "abcdefCDEF", 1, mat, 0) == REG_NOMATCH)
puts ("no match");
else
--
2.14.3