Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
Commit Message
On 07/20/2018 03:19 PM, Florian Weimer wrote:
> On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
>> On 07/19/2018 04:39 PM, Florian Weimer wrote:
>>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>>>> exercise that [a-z] does not match A or Z.
>>>
>>> [a-z] still matches ñ,
Comments
On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
> v2
> - Fixed tr_TR by duplicating A-Z rational range.
> - Fixed tst-rxspender.
> - Fixed bug-regex17.
>
> Tell me how the new version does.
My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
support, too, and initial results look good as well.
Thanks,
Florian
On 07/23/2018 11:10 AM, Florian Weimer wrote:
> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>> v2
>> - Fixed tr_TR by duplicating A-Z rational range.
>> - Fixed tst-rxspender.
>> - Fixed bug-regex17.
>>
>> Tell me how the new version does.
>
> My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
> support, too, and initial results look good as well.
OK, so we have the capability to deploy rational ranges.
Florian,
Should we do so in 2.28? Avoiding all possible problems in the future
and making the ranges portable, rational, and safe from a security
perspective?
Rafal,
As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Mike,
Same question to you.
For historical context in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
For context from POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
(see the section on "RE Bracket Expressions").
Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
rational for all locales, and would no longer include mixed case, or accents.
I'd like to year affirmatives from the localedata maintainers on this issue.
Cheers,
Carlos.
23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
> [...]
> Rafal,
>
> As localedata maintainer what is your opinion of changing the meaning
> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
> which mean exactly the latin character sequences you would expect
> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
Having discussed this off-list my answer is: I'm in favor of implementing
rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
code-point ranges. But I understand that this is possible only in 2.29.
Therefore for 2.28 I support this data-based solution.
Regards,
Rafal
On 07/24/2018 04:45 PM, Rafal Luzynski wrote:
> 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> Rafal,
>>
>> As localedata maintainer what is your opinion of changing the meaning
>> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
>> which mean exactly the latin character sequences you would expect
>> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
>> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
>
> Having discussed this off-list my answer is: I'm in favor of implementing
> rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
> code-point ranges. But I understand that this is possible only in 2.29.
> Therefore for 2.28 I support this data-based solution.
From the perspective of the user of the library and the locales the
rational ranges we implement will look as-if they were code point ranges
for the ranges in question e.g. a-z, A-Z, 0-9 and their subranges.
For 2.28 we will implement rational ranges for [a-z], [A-Z], and [0-9],
and all of their subsets via a data-only solution. Just wanted to make
it clear that all subsets will be treated as rational ranges.
It is only for other subsets like [!-~] (ASCII range) where we will not
have a rational range until we switch to making ranges operate on code
points. That will be a 2.29 optimization.
OK, I will prepare a patch to fix this.
Cheers,
Carlos.
On 07/24/2018 04:45 PM, Rafal Luzynski wrote:
> 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> Rafal,
>>
>> As localedata maintainer what is your opinion of changing the meaning
>> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
>> which mean exactly the latin character sequences you would expect
>> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
>> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
>
> Having discussed this off-list my answer is: I'm in favor of implementing
> rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
> code-point ranges. But I understand that this is possible only in 2.29.
> Therefore for 2.28 I support this data-based solution.
I'll put together a final patch ASAP that provides:
* Deinterlace upper/lower
* Group a-z, A-Z, 0-9,
* NEWS entry for rational ranges.
Note: manual/stdio.texi also makes the mistake of saying [a-z] is lowercase
characters, so this will fix the manual bug with no change :-)
Cheers,
Carlos.
Carlos O'Donell <carlos@redhat.com> さんはかきました:
> On 07/23/2018 11:10 AM, Florian Weimer wrote:
>> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>>> v2
>>> - Fixed tr_TR by duplicating A-Z rational range.
>>> - Fixed tst-rxspender.
>>> - Fixed bug-regex17.
>>>
>>> Tell me how the new version does.
>>
>> My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
>> support, too, and initial results look good as well.
>
> OK, so we have the capability to deploy rational ranges.
>
> Florian,
>
> Should we do so in 2.28? Avoiding all possible problems in the future
> and making the ranges portable, rational, and safe from a security
> perspective?
>
> Rafal,
>
> As localedata maintainer what is your opinion of changing the meaning
> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
> which mean exactly the latin character sequences you would expect
> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
>
> Mike,
>
> Same question to you.
I agree that rational ranges are much more useful.
I cannot imagine any use case for [a-z] matching aAbB...z and not Z.
One never knows what [a-z] would match if it uses the locale sort order,
it is just too confusing.
In the long run, I think implementing ranges by code points would be
the best solution and make updates of the iso14651_t1_common file easier
because we need to make less changes to the upstream version of that
file then.
But for 2.28 this cannot be done. Therefore, I think the solution
by Carlos is very good.
> For historical context in gawk:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
>
> For context from POSIX:
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
> (see the section on "RE Bracket Expressions").
>
> Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
> rational for all locales, and would no longer include mixed case, or accents.
>
> I'd like to year affirmatives from the localedata maintainers on this issue.
>
> Cheers,
> Carlos.
On 07/23/2018 11:10 AM, Florian Weimer wrote:
> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>> v2
>> - Fixed tr_TR by duplicating A-Z rational range.
>> - Fixed tst-rxspender.
>> - Fixed bug-regex17.
>>
>> Tell me how the new version does.
>
> My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch
> support, too, and initial results look good as well.
OK, here is v3.
~~~ NEWS ~~
* The GNU C Library now uses rational ranges for regular expression
matching of ranges that are within a-z, A-Z, and 0-9 for all
locales. This means that the range [a-c] will no longer match
accented letter a's and will only match exactly a, b, and c. Likewise
[0-9] will only include the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and
no other characters. Rational ranges have been implemented by
several other GNU projects to provide straight forward rules for
regular expression ranges and to make them portable across locales.
The current rational ranges are implemented using collation element
ordering, which may yield unexpected results if the range includes
accented characters e.g. [a-ñ], since such a range will include a-z
since ñ comes after the rational range in collation element order.
In the future the library may implement full rational ranges covering
all characters by using Unicode code point ordering which will make
the sequences faster to match and more portable.
~~~
We have approval from Mike and Rafal, the two localedata subsystem
maintainers.
This solution matches what you and Rich Felker both thinks is the
correct solution.
So for 2.28 we would use rational ranges for a-z, A-Z, and 0-9, until
we can implement code point ranges.
v3
- Merged lowercase/uppercase deinterlacing.
- Added NEWS entry.
Please run this through your checker, and ACK this for 2.28 and I'll
commit.
Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
Cheers,
Carlos.
On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
Quick comment. The middle line here adds trailing whitespace:
- { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+ The U+02DA RING ABOVE is chosen because it's not in [s-㏜]. */
Florian
On 07/25/2018 04:18 PM, Florian Weimer wrote:
> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
>
> Quick comment. The middle line here adds trailing whitespace:
>
> - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
> +
> + The U+02DA RING ABOVE is chosen because it's not in [s-㏜]. */
Thanks. I'll fix this with v4.
I had to fix the following locales:
modified: localedata/locales/ar_SA
modified: localedata/locales/km_KH
modified: localedata/locales/lo_LA
modified: localedata/locales/or_IN
modified: localedata/locales/sl_SI
modified: localedata/locales/th_TH
They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.
Could you please add these locales to your tester?
c.
On 07/25/2018 10:25 PM, Carlos O'Donell wrote:
> On 07/25/2018 04:18 PM, Florian Weimer wrote:
>> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
>>
>> Quick comment. The middle line here adds trailing whitespace:
>>
>> - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
>> +
>> + The U+02DA RING ABOVE is chosen because it's not in [s-㏜]. */
>
> Thanks. I'll fix this with v4.
I have verified that localedata/locales/iso14651_t1_common is just a
reordering (except for the new comments).
localedata/locales/tr_TR is more complicated, but looks like an
order-only change for me too.
> I had to fix the following locales:
>
> modified: localedata/locales/ar_SA
> modified: localedata/locales/km_KH
> modified: localedata/locales/lo_LA
> modified: localedata/locales/or_IN
> modified: localedata/locales/sl_SI
> modified: localedata/locales/th_TH
Do you have the actual locale names handy? localedata/SUPPORTED
contains charsets, but I'm not sure if the translation to locale names
is completely regular.
> They all re-arranged ASCII character collation element ordering like tr_TR,
> and so they needed manual fixing.
>
> Could you please add these locales to your tester?
I will try. I already have an xtests part, and these probably need to
go there as well.
Thanks,
Florian
On 07/25/2018 04:31 PM, Florian Weimer wrote:
> On 07/25/2018 10:25 PM, Carlos O'Donell wrote:
>> On 07/25/2018 04:18 PM, Florian Weimer wrote:
>>> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>>>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
>>>
>>> Quick comment. The middle line here adds trailing whitespace:
>>>
>>> - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
>>> +
>>> + The U+02DA RING ABOVE is chosen because it's not in [s-㏜]. */
>>
>> Thanks. I'll fix this with v4.
>
> I have verified that localedata/locales/iso14651_t1_common is just a reordering (except for the new comments).
>
> localedata/locales/tr_TR is more complicated, but looks like an order-only change for me too.
>
>> I had to fix the following locales:
>>
>> modified: localedata/locales/ar_SA
>> modified: localedata/locales/km_KH
>> modified: localedata/locales/lo_LA
>> modified: localedata/locales/or_IN
>> modified: localedata/locales/sl_SI
>> modified: localedata/locales/th_TH
>
> Do you have the actual locale names handy? localedata/SUPPORTED contains charsets, but I'm not sure if the translation to locale names is completely regular.
It is completely regular. In that ar_SA => ar_SA.UTF-8. And so forth.
>> They all re-arranged ASCII character collation element ordering like tr_TR,
>> and so they needed manual fixing.
>>
>> Could you please add these locales to your tester?
>
> I will try. I already have an xtests part, and these probably need to go there as well.
v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.
All of my testers are clean.
So the question is now:
Do we commit to rational ranges for a-z, A-Z, 0-9 ... for 2.28.
or
Do we just do the deinterlacing of iso14651_t1_common to fix en_US.UTF-8?
Cheers,
Carlos.
25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote:
> [...]
> I had to fix the following locales:
>
> modified: localedata/locales/ar_SA
> modified: localedata/locales/km_KH
> modified: localedata/locales/lo_LA
> modified: localedata/locales/or_IN
> modified: localedata/locales/sl_SI
> modified: localedata/locales/th_TH
>
> They all re-arranged ASCII character collation element ordering like tr_TR,
> and so they needed manual fixing.
Please check bg_BG. It also has a large reorder: puts all Cyrillic characters
before Latin. (However, this may not be relevant at all.)
Regards,
Rafal
On 07/25/2018 05:06 PM, Rafal Luzynski wrote:
> 25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> I had to fix the following locales:
>>
>> modified: localedata/locales/ar_SA
>> modified: localedata/locales/km_KH
>> modified: localedata/locales/lo_LA
>> modified: localedata/locales/or_IN
>> modified: localedata/locales/sl_SI
>> modified: localedata/locales/th_TH
>>
>> They all re-arranged ASCII character collation element ordering like tr_TR,
>> and so they needed manual fixing.
>
> Please check bg_BG. It also has a large reorder: puts all Cyrillic characters
> before Latin. (However, this may not be relevant at all.)
Right, that won't affect the rational range for ASCII.
The new tst-fnmatch.input has this:
886 bg_BG.UTF-8 "a" "[a-z]" 0
887 bg_BG.UTF-8 "z" "[a-z]" 0
888 bg_BG.UTF-8 "A" "[a-z]" NOMATCH
889 bg_BG.UTF-8 "Z" "[a-z]" NOMATCH
890 bg_BG.UTF-8 "A" "[A-Z]" 0
891 bg_BG.UTF-8 "Z" "[A-Z]" 0
892 bg_BG.UTF-8 "a" "[A-Z]" NOMATCH
893 bg_BG.UTF-8 "z" "[A-Z]" NOMATCH
Which tests the range extremes, and it passes.
It doesn't reorder any actual LATIN characters and so it's safe.
Cheers,
Carlos.
On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
> v4
> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
>
> All of my testers are clean.
Attaching v4 on top of the current master.
This fixes all the locales.
All locales, even with tailoring have rational range support now.
If this passes your tests tomorrow I'm OK to put this into 2.28.
Cheers,
Carlos.
On 07/26/2018 04:34 AM, Carlos O'Donell wrote:
> On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
>> v4
>> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
>> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
>>
>> All of my testers are clean.
>
> Attaching v4 on top of the current master.
>
> This fixes all the locales.
I wrote another enumeration tester, this time covering all locales. It
found these issues:
az_AZ: U+000069 fails to match /[a-z]/
az_AZ: U+000049 fails to match /[A-Z]/
az_AZ.utf8: U+000069 fails to match /[a-z]/
az_AZ.utf8: U+000049 fails to match /[A-Z]/
crh_UA: U+000069 fails to match /[a-z]/
crh_UA: U+000049 fails to match /[A-Z]/
crh_UA.utf8: U+000069 fails to match /[a-z]/
crh_UA.utf8: U+000049 fails to match /[A-Z]/
ku_TR: U+000069 fails to match /[a-z]/
ku_TR: U+000049 fails to match /[A-Z]/
ku_TR.iso88599: U+000069 fails to match /[a-z]/
ku_TR.iso88599: U+000049 fails to match /[A-Z]/
ku_TR.utf8: U+000069 fails to match /[a-z]/
ku_TR.utf8: U+000049 fails to match /[A-Z]/
lv_LV: U+000079 fails to match /[a-z]/
lv_LV: U+000059 fails to match /[A-Z]/
lv_LV.iso885913: U+000079 fails to match /[a-z]/
lv_LV.iso885913: U+000059 fails to match /[A-Z]/
lv_LV.utf8: U+000079 fails to match /[a-z]/
lv_LV.utf8: U+000059 fails to match /[A-Z]/
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
slovene: U+00006A fails to match /[a-z]/
slovene: U+00006B fails to match /[a-z]/
slovene: U+00006C fails to match /[a-z]/
slovene: U+00006D fails to match /[a-z]/
slovene: U+00006E fails to match /[a-z]/
slovene: U+00006F fails to match /[a-z]/
slovenian: U+00006A fails to match /[a-z]/
slovenian: U+00006B fails to match /[a-z]/
slovenian: U+00006C fails to match /[a-z]/
slovenian: U+00006D fails to match /[a-z]/
slovenian: U+00006E fails to match /[a-z]/
slovenian: U+00006F fails to match /[a-z]/
sl_SI: U+00006A fails to match /[a-z]/
sl_SI: U+00006B fails to match /[a-z]/
sl_SI: U+00006C fails to match /[a-z]/
sl_SI: U+00006D fails to match /[a-z]/
sl_SI: U+00006E fails to match /[a-z]/
sl_SI: U+00006F fails to match /[a-z]/
sl_SI.iso88592: U+00006A fails to match /[a-z]/
sl_SI.iso88592: U+00006B fails to match /[a-z]/
sl_SI.iso88592: U+00006C fails to match /[a-z]/
sl_SI.iso88592: U+00006D fails to match /[a-z]/
sl_SI.iso88592: U+00006E fails to match /[a-z]/
sl_SI.iso88592: U+00006F fails to match /[a-z]/
sl_SI.utf8: U+00006A fails to match /[a-z]/
sl_SI.utf8: U+00006B fails to match /[a-z]/
sl_SI.utf8: U+00006C fails to match /[a-z]/
sl_SI.utf8: U+00006D fails to match /[a-z]/
sl_SI.utf8: U+00006E fails to match /[a-z]/
sl_SI.utf8: U+00006F fails to match /[a-z]/
sv_FI: U+000077 fails to match /[a-z]/
sv_FI: U+000057 fails to match /[A-Z]/
sv_FI@euro: U+000077 fails to match /[a-z]/
sv_FI@euro: U+000057 fails to match /[A-Z]/
sv_FI.iso88591: U+000077 fails to match /[a-z]/
sv_FI.iso88591: U+000057 fails to match /[A-Z]/
sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/
sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/
sv_FI.utf8: U+000077 fails to match /[a-z]/
sv_FI.utf8: U+000057 fails to match /[A-Z]/
sv_SE: U+000077 fails to match /[a-z]/
sv_SE: U+000057 fails to match /[A-Z]/
sv_SE.iso88591: U+000077 fails to match /[a-z]/
sv_SE.iso88591: U+000057 fails to match /[A-Z]/
sv_SE.utf8: U+000077 fails to match /[a-z]/
sv_SE.utf8: U+000057 fails to match /[A-Z]/
swedish: U+000077 fails to match /[a-z]/
swedish: U+000057 fails to match /[A-Z]/
tt_RU: U+000069 fails to match /[a-z]/
tt_RU: U+000049 fails to match /[A-Z]/
tt_RU@iqtelif: U+000069 fails to match /[a-z]/
tt_RU@iqtelif: U+000049 fails to match /[A-Z]/
tt_RU.utf8: U+000069 fails to match /[a-z]/
tt_RU.utf8: U+000049 fails to match /[A-Z]/
tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/
tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/
Thanks,
Florian
On 07/26/2018 10:50 AM, Florian Weimer wrote:
> On 07/26/2018 04:34 AM, Carlos O'Donell wrote:
>> On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
>>> v4
>>> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
>>> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
>>>
>>> All of my testers are clean.
>>
>> Attaching v4 on top of the current master.
>>
>> This fixes all the locales.
>
> I wrote another enumeration tester, this time covering all locales. It found these issues:
>
> az_AZ: U+000069 fails to match /[a-z]/
> az_AZ: U+000049 fails to match /[A-Z]/
> az_AZ.utf8: U+000069 fails to match /[a-z]/
> az_AZ.utf8: U+000049 fails to match /[A-Z]/
See it.
> crh_UA: U+000069 fails to match /[a-z]/
> crh_UA: U+000049 fails to match /[A-Z]/
> crh_UA.utf8: U+000069 fails to match /[a-z]/
> crh_UA.utf8: U+000049 fails to match /[A-Z]/
See it.
> ku_TR: U+000069 fails to match /[a-z]/
> ku_TR: U+000049 fails to match /[A-Z]/
> ku_TR.iso88599: U+000069 fails to match /[a-z]/
> ku_TR.iso88599: U+000049 fails to match /[A-Z]/
> ku_TR.utf8: U+000069 fails to match /[a-z]/
> ku_TR.utf8: U+000049 fails to match /[A-Z]/
See it.
> lv_LV: U+000079 fails to match /[a-z]/
> lv_LV: U+000059 fails to match /[A-Z]/
> lv_LV.iso885913: U+000079 fails to match /[a-z]/
> lv_LV.iso885913: U+000059 fails to match /[A-Z]/
> lv_LV.utf8: U+000079 fails to match /[a-z]/
> lv_LV.utf8: U+000059 fails to match /[A-Z]/
See it.
> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
Good catch. These were the ones I was hoping your finder would catch.
> slovene: U+00006A fails to match /[a-z]/
> slovene: U+00006B fails to match /[a-z]/
> slovene: U+00006C fails to match /[a-z]/
> slovene: U+00006D fails to match /[a-z]/
> slovene: U+00006E fails to match /[a-z]/
> slovene: U+00006F fails to match /[a-z]/
This is an alias for sl_SI.ISO-8859-2 and we see it below.
> slovenian: U+00006A fails to match /[a-z]/
> slovenian: U+00006B fails to match /[a-z]/
> slovenian: U+00006C fails to match /[a-z]/
> slovenian: U+00006D fails to match /[a-z]/
> slovenian: U+00006E fails to match /[a-z]/
> slovenian: U+00006F fails to match /[a-z]/
This is an alias for sl_SI.ISO-8859-2 and we see it below.
> sl_SI: U+00006A fails to match /[a-z]/
> sl_SI: U+00006B fails to match /[a-z]/
> sl_SI: U+00006C fails to match /[a-z]/
> sl_SI: U+00006D fails to match /[a-z]/
> sl_SI: U+00006E fails to match /[a-z]/
> sl_SI: U+00006F fails to match /[a-z]/
See it.
> sl_SI.iso88592: U+00006A fails to match /[a-z]/
> sl_SI.iso88592: U+00006B fails to match /[a-z]/
> sl_SI.iso88592: U+00006C fails to match /[a-z]/
> sl_SI.iso88592: U+00006D fails to match /[a-z]/
> sl_SI.iso88592: U+00006E fails to match /[a-z]/
> sl_SI.iso88592: U+00006F fails to match /[a-z]/
See it (aliased above twice).
> sl_SI.utf8: U+00006A fails to match /[a-z]/
> sl_SI.utf8: U+00006B fails to match /[a-z]/
> sl_SI.utf8: U+00006C fails to match /[a-z]/
> sl_SI.utf8: U+00006D fails to match /[a-z]/
> sl_SI.utf8: U+00006E fails to match /[a-z]/
> sl_SI.utf8: U+00006F fails to match /[a-z]/
See it.
> sv_FI: U+000077 fails to match /[a-z]/
> sv_FI: U+000057 fails to match /[A-Z]/
See it.
> sv_FI@euro: U+000077 fails to match /[a-z]/
> sv_FI@euro: U+000057 fails to match /[A-Z]/
Same as sv_FI.
> sv_FI.iso88591: U+000077 fails to match /[a-z]/
> sv_FI.iso88591: U+000057 fails to match /[A-Z]/
Likewise.
> sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/
> sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/
Likewise.
> sv_FI.utf8: U+000077 fails to match /[a-z]/
> sv_FI.utf8: U+000057 fails to match /[A-Z]/
Likewise.
> sv_SE: U+000077 fails to match /[a-z]/
> sv_SE: U+000057 fails to match /[A-Z]/
See it.
> sv_SE.iso88591: U+000077 fails to match /[a-z]/
> sv_SE.iso88591: U+000057 fails to match /[A-Z]/
Same as above.
> sv_SE.utf8: U+000077 fails to match /[a-z]/
> sv_SE.utf8: U+000057 fails to match /[A-Z]/
Likewise.
> swedish: U+000077 fails to match /[a-z]/
> swedish: U+000057 fails to match /[A-Z]/
Alias for sv_SE.
> tt_RU: U+000069 fails to match /[a-z]/
> tt_RU: U+000049 fails to match /[A-Z]/
See it.
> tt_RU@iqtelif: U+000069 fails to match /[a-z]/
> tt_RU@iqtelif: U+000049 fails to match /[A-Z]/
See it.
> tt_RU.utf8: U+000069 fails to match /[a-z]/
> tt_RU.utf8: U+000049 fails to match /[A-Z]/
See it.
> tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/
> tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/
See it.
Thanks you!
I increased tst-fnmatch.input coverage and I get this:
Line #3699: Test #3548 (az_AZ.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #3751: Test #3600 (az_AZ.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6819: Test #6668 (crh_UA.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6871: Test #6720 (crh_UA.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18675: Test #18524 (ku_TR.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18727: Test #18576 (ku_TR.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19835: Test #19684 (lv_LV.UTF-8): fnmatch ("[a-z]", "y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19887: Test #19736 (lv_LV.UTF-8): fnmatch ("[A-Z]", "Y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26684: Test #26533 (sl_SI.UTF-8): fnmatch ("[a-z]", "j", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26685: Test #26534 (sl_SI.UTF-8): fnmatch ("[a-z]", "k", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26686: Test #26535 (sl_SI.UTF-8): fnmatch ("[a-z]", "l", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26687: Test #26536 (sl_SI.UTF-8): fnmatch ("[a-z]", "m", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26688: Test #26537 (sl_SI.UTF-8): fnmatch ("[a-z]", "n", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26689: Test #26538 (sl_SI.UTF-8): fnmatch ("[a-z]", "o", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28049: Test #27898 (sv_FI.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28101: Test #27950 (sv_FI.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28153: Test #28002 (sv_SE.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28205: Test #28054 (sv_SE.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30427: Test #30276 (tt_RU.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30479: Test #30328 (tt_RU.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30531: Test #30380 (tt_RU.UTF-8@iqtelif): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30583: Test #30432 (tt_RU.UTF-8@iqtelif): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Which matches all the locales you saw failures in except for shs_CA, which is a real bug.
I'll fix these up quickly.
Cheers,
Carlos.
On 07/26/2018 10:50 AM, Florian Weimer wrote:
> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.
How slow is your tester? Should I do what you do to test
for the inclusion of characters that shouldn't be in the
range? Or will that take too long?
v5
- Add ~30k+ tests to tst-fnmatch.input.
- Fix broken locales:
- Fix shs_CA to not reorder-after for no reason.
Could you run this through the tester please?
Cheers,
Carlos.
On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
> This is a WIP, because the number of tests now is too big
> to simply add them to tst-fnmatch.input, and so I'm writing
> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
> expecting all of the locales to be built for testing, and
> then running through all the rational ranges to test
> inclusion of the required datums.
Let me repeat my suggestion that we should initially fix the locales
with the common collation order, where glibc 2.28 regresses.
> How slow is your tester? Should I do what you do to test
> for the inclusion of characters that shouldn't be in the
> range? Or will that take too long?
>
> v5
> - Add ~30k+ tests to tst-fnmatch.input.
> - Fix broken locales:
> - Fix shs_CA to not reorder-after for no reason.
>
> Could you run this through the tester please?
It fails installation for me:
$ make localedata/install-locales DESTDIR=/tmp/locales
sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined
at locales/sl_SI:998
locales/sl_SI:1231: [error] symbol `S0062' not defined
locales/sl_SI:1231: [error] symbol `BASE' not defined
/bin/sh: line 17: 4148 Segmentation fault (core dumped) I18NPATH=.
GCONV_PATH=/home/fweimer/src/gnu/glibc/build/iconvdata LC_ALL=C
/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2
--library-path
/home/fweimer/src/gnu/glibc/build:/home/fweimer/src/gnu/glibc/build/math:/home/fweimer/src/gnu/glibc/build/elf:/home/fweimer/src/gnu/glibc/build/dlfcn:/home/fweimer/src/gnu/glibc/build/nss:/home/fweimer/src/gnu/glibc/build/nis:/home/fweimer/src/gnu/glibc/build/rt:/home/fweimer/src/gnu/glibc/build/resolv:/home/fweimer/src/gnu/glibc/build/mathvec:/home/fweimer/src/gnu/glibc/build/support:/home/fweimer/src/gnu/glibc/build/crypt:/home/fweimer/src/gnu/glibc/build/nptl
/home/fweimer/src/gnu/glibc/build/locale/localedef $flags
--alias-file=../intl/locale.alias -i locales/$input -f charmaps/$charset
--prefix=/tmp/locales $locale
GDB says this:
Core was generated by
`/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2
--library-path /home'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0,
collate=collate@entry=0x7fd5a8a03240,
elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
1912 len += utf8_encode (&buf[len],
(gdb) bt
#0 0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0,
collate=collate@entry=0x7fd5a8a03240,
elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
#1 0x000000000041dc4a in collate_output () at programs/ld-collate.c:2180
#2 0x000000000042709f in write_all_categories
(definitions=0x7ffdf15513c0, charmap=charmap@entry=0x7fd5a71786a0,
locname=0x7ffdf1552e33 "sl_SI.UTF-8",
output_path=output_path@entry=0x7fd5a7178310
"/tmp/locales/usr/lib64/locale/sl_SI.utf8/")
at programs/locfile.c:337
#3 0x0000000000402f69 in main (argc=<optimized out>,
argv=0x7ffdf1551630) at programs/localedef.c:300
(gdb) l
1907 int i;
1908
1909 for (i = 0; i < elem->weights[cnt].cnt; ++i)
1910 /* Encode the weight value. We do nothing for IGNORE
entries. */
1911 if (elem->weights[cnt].w[i] != NULL)
1912 len += utf8_encode (&buf[len],
1913
elem->weights[cnt].w[i]->mborder[cnt]);
1914
1915 /* And add the buffer content. */
1916 obstack_1grow (pool, len);
(gdb) print elem->weights[cnt].w[i]->mborder[cnt]
Cannot access memory at address 0x0
(gdb) print elem->weights[cnt].w[i]->mborder
$3 = (int *) 0x0
(gdb)
Any idea what is going on?
Thanks,
Florian
On 07/30/2018 01:39 PM, Florian Weimer wrote:
> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>
>> This is a WIP, because the number of tests now is too big
>> to simply add them to tst-fnmatch.input, and so I'm writing
>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>> expecting all of the locales to be built for testing, and
>> then running through all the rational ranges to test
>> inclusion of the required datums.
>
> Let me repeat my suggestion that we should initially fix the locales
> with the common collation order, where glibc 2.28 regresses.
I do not think it is appropriate to release rational range support on
only a subset of the SUPPORTED set of locales. Either we support it on
all SUPPORTED locales or we work until we are ready.
At present glibc 2.28 does not regress because of commit
7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
uppercase.
In glibc 2.28 we simply have ~2500 characters in the range of a-z,
and in 2.27 we had ~250, it's still a large set of non-ASCII characters
accepted by the range, all because we caught up to Unicode 9.0.0 with
the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
with the next release, and probably always lagging a bit).
I don't see an urgent need to get rational range support into 2.28.
I was happy to get it in earlier, but now with deeper testing showing
that not all locales are working correctly, I'm not happy to see this
go out the door. I think it will be ready very shortly, and we can check
it in immediately into 2.29, and then continue our work on code point
ranges as the next step, which will require even more testing, and
internal API cleanup.
On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>
>>> This is a WIP, because the number of tests now is too big
>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>> expecting all of the locales to be built for testing, and
>>> then running through all the rational ranges to test
>>> inclusion of the required datums.
>>
>> Let me repeat my suggestion that we should initially fix the locales
>> with the common collation order, where glibc 2.28 regresses.
>
> I do not think it is appropriate to release rational range support on
> only a subset of the SUPPORTED set of locales. Either we support it on
> all SUPPORTED locales or we work until we are ready.
>
> At present glibc 2.28 does not regress because of commit
> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
> uppercase.
>
> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
> accepted by the range, all because we caught up to Unicode 9.0.0 with
> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
> with the next release, and probably always lagging a bit).
Ahh. So it's more complex and a regression longer in the making.
> I don't see an urgent need to get rational range support into 2.28.
> I was happy to get it in earlier, but now with deeper testing showing
> that not all locales are working correctly, I'm not happy to see this
> go out the door. I think it will be ready very shortly, and we can check
> it in immediately into 2.29, and then continue our work on code point
> ranges as the next step, which will require even more testing, and
> internal API cleanup.
Sounds reasonable.
Thanks,
Florian
On 07/30/2018 01:54 PM, Florian Weimer wrote:
> On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
>> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>>
>>>> This is a WIP, because the number of tests now is too big
>>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>>> expecting all of the locales to be built for testing, and
>>>> then running through all the rational ranges to test
>>>> inclusion of the required datums.
>>>
>>> Let me repeat my suggestion that we should initially fix the locales
>>> with the common collation order, where glibc 2.28 regresses.
>>
>> I do not think it is appropriate to release rational range support on
>> only a subset of the SUPPORTED set of locales. Either we support it on
>> all SUPPORTED locales or we work until we are ready.
>>
>> At present glibc 2.28 does not regress because of commit
>> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
>> uppercase.
>>
>> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
>> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
>> accepted by the range, all because we caught up to Unicode 9.0.0 with
>> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
>> with the next release, and probably always lagging a bit).
>
> Ahh. So it's more complex and a regression longer in the making.
I'm worried I don't quite follow your statement of "longer in the making,"
but let me summarize what I think you wrote, and tell me if I have
it right.
The regression, from the perspective of en_US, is that [a-z] in master
accepts uppercase ASCII characters, and this breaks user expectations.
This is the only regression I'm considering serious enough to block the
release for and we've fixed it for now.
The regression which you say is "longer in the making" is that at some
point in the past the collation data for en_US contained only ASCII
ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges,
particularly those from a-z, and A-Z began accepting non-ASCII characters.
Thus the regression, from your perspective, happened far in the past.
As far as I can tell the regression has existed since the first import
for en_US which copied LC_COLLATE from en_DK (showing en_DK):
~~~
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE
...
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE
~~~
Is this what you mean by "longer in the making?"
I expect that en_US at some point along the way is switched to use the
iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard
to tell exactly if CEO was fully functional, if fnmatch worked as expected,
etc.
Either way this is all a poorly understood and structured solution at this
point, and I hope that in 1 or 2 releases we go from "unusable interface" to
"rational ranges (data)" to "full rational ranges (code point ranges)" and
end up with a sensible portable solution.
>> I don't see an urgent need to get rational range support into 2.28.
>> I was happy to get it in earlier, but now with deeper testing showing
>> that not all locales are working correctly, I'm not happy to see this
>> go out the door. I think it will be ready very shortly, and we can check
>> it in immediately into 2.29, and then continue our work on code point
>> ranges as the next step, which will require even more testing, and
>> internal API cleanup.
>
> Sounds reasonable.
That sounds great. I will continue to update this patch set and get some
independent checking from your scripts, and my own testing. I also need
to add collation tests for all the locales I touch to ensure that the
reordering is just that, and that it doesn't materially change the collation
sequence (if it does it's a bug). This all adds more coverage to the
SUPPORTED set of languages which is a positive thing.
On 07/30/2018 08:25 PM, Carlos O'Donell wrote:
> As far as I can tell the regression has existed since the first import
> for en_US which copied LC_COLLATE from en_DK (showing en_DK):
> ~~~
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE
> ...
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE
> ~~~
> Is this what you mean by "longer in the making?"
Yes, that's what I meant. I didn't check whether it went back to 2.17,
2.12, or even earlier.
Thanks,
Florian
On 07/30/2018 01:39 PM, Florian Weimer wrote:
> It fails installation for me:
I'm so sorry to waste your time like this.
I apparently failed to test sl_SI.
> $ make localedata/install-locales DESTDIR=/tmp/locales
> sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined at locales/sl_SI:998
> locales/sl_SI:1231: [error] symbol `S0062' not defined
> locales/sl_SI:1231: [error] symbol `BASE' not defined
... this is a cascading set of errors.
> (gdb) print elem->weights[cnt].w[i]->mborder[cnt]
> Cannot access memory at address 0x0
> (gdb) print elem->weights[cnt].w[i]->mborder
> $3 = (int *) 0x0
> (gdb)
>
> Any idea what is going on?
The parser should have stopped at the first error IMO, going any further
just results in problems. It's very hard to rollback the state of the parser
and data structures if there is an error in the source files. It should just
have stopped at the duplicate U0061 definition.
I'm testing a v6 with the sl_SI fixes, and a new test case.
@@ -63177,7 +63177,19 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U20BC> <S20BC>;<BASE>;<MIN>;<U20BC> % MANAT SIGN
<U20BD> <S20BD>;<BASE>;<MIN>;<U20BD> % RUBLE SIGN
<U20BE> <S20BE>;<BASE>;<MIN>;<U20BE> % LARI SIGN
+% Implement rational range for [0-9] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
+<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
+<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
+<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
+<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
+<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
+<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
+<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
+<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
+<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
@@ -63250,7 +63262,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U2080> <S0030>;<BASE>;<MNS>;<U2080> % SUBSCRIPT ZERO
<U2189> "<S0030><S0033>";"<BASE><BASE>";"<FRACTION><FRACTION>";<U2189> % VULGAR FRACTION ZERO THIRDS
<U3358> "<S0030><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3358> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO
-<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
<U0661> <S0031>;<BASE>;<MIN>;<U0661> % ARABIC-INDIC DIGIT ONE
<U06F1> <S0031>;<BASE>;<MIN>;<U06F1> % EXTENDED ARABIC-INDIC DIGIT ONE
<U07C1> <S0031>;<BASE>;<MIN>;<U07C1> % NKO DIGIT ONE
@@ -63440,7 +63451,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E0> "<S0031><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE
<U32C0> "<S0031><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY
<U3359> "<S0031><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3359> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ONE
-<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
<U0662> <S0032>;<BASE>;<MIN>;<U0662> % ARABIC-INDIC DIGIT TWO
<U06F2> <S0032>;<BASE>;<MIN>;<U06F2> % EXTENDED ARABIC-INDIC DIGIT TWO
<U07C2> <S0032>;<BASE>;<MIN>;<U07C2> % NKO DIGIT TWO
@@ -63583,7 +63593,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E1> "<S0032><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO
<U32C1> "<S0032><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR FEBRUARY
<U335A> "<S0032><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335A> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWO
-<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
<U0663> <S0033>;<BASE>;<MIN>;<U0663> % ARABIC-INDIC DIGIT THREE
<U06F3> <S0033>;<BASE>;<MIN>;<U06F3> % EXTENDED ARABIC-INDIC DIGIT THREE
<U07C3> <S0033>;<BASE>;<MIN>;<U07C3> % NKO DIGIT THREE
@@ -63709,7 +63718,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E2> "<S0033><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THREE
<U32C2> "<S0033><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MARCH
<U335B> "<S0033><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335B> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR THREE
-<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
<U0664> <S0034>;<BASE>;<MIN>;<U0664> % ARABIC-INDIC DIGIT FOUR
<U06F4> <S0034>;<BASE>;<MIN>;<U06F4> % EXTENDED ARABIC-INDIC DIGIT FOUR
<U07C4> <S0034>;<BASE>;<MIN>;<U07C4> % NKO DIGIT FOUR
@@ -63829,7 +63837,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E3> "<S0034><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FOUR
<U32C3> "<S0034><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR APRIL
<U335C> "<S0034><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335C> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FOUR
-<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
<U0665> <S0035>;<BASE>;<MIN>;<U0665> % ARABIC-INDIC DIGIT FIVE
<U06F5> <S0035>;<BASE>;<MIN>;<U06F5> % EXTENDED ARABIC-INDIC DIGIT FIVE
<U07C5> <S0035>;<BASE>;<MIN>;<U07C5> % NKO DIGIT FIVE
@@ -63941,7 +63948,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E4> "<S0035><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FIVE
<U32C4> "<S0035><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MAY
<U335D> "<S0035><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335D> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIVE
-<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
<U0666> <S0036>;<BASE>;<MIN>;<U0666> % ARABIC-INDIC DIGIT SIX
<U06F6> <S0036>;<BASE>;<MIN>;<U06F6> % EXTENDED ARABIC-INDIC DIGIT SIX
<U07C6> <S0036>;<BASE>;<MIN>;<U07C6> % NKO DIGIT SIX
@@ -64036,7 +64042,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E5> "<S0036><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SIX
<U32C5> "<S0036><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JUNE
<U335E> "<S0036><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335E> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SIX
-<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
<U0667> <S0037>;<BASE>;<MIN>;<U0667> % ARABIC-INDIC DIGIT SEVEN
<U06F7> <S0037>;<BASE>;<MIN>;<U06F7> % EXTENDED ARABIC-INDIC DIGIT SEVEN
<U07C7> <S0037>;<BASE>;<MIN>;<U07C7> % NKO DIGIT SEVEN
@@ -64132,7 +64137,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E6> "<S0037><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SEVEN
<U32C6> "<S0037><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JULY
<U335F> "<S0037><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335F> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SEVEN
-<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
<U0668> <S0038>;<BASE>;<MIN>;<U0668> % ARABIC-INDIC DIGIT EIGHT
<U06F8> <S0038>;<BASE>;<MIN>;<U06F8> % EXTENDED ARABIC-INDIC DIGIT EIGHT
<U07C8> <S0038>;<BASE>;<MIN>;<U07C8> % NKO DIGIT EIGHT
@@ -64226,7 +64230,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
<U33E7> "<S0038><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY EIGHT
<U32C7> "<S0038><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR AUGUST
<U3360> "<S0038><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3360> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR EIGHT
-<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
<U0669> <S0039>;<BASE>;<MIN>;<U0669> % ARABIC-INDIC DIGIT NINE
<U06F9> <S0039>;<BASE>;<MIN>;<U06F9> % EXTENDED ARABIC-INDIC DIGIT NINE
<U07C9> <S0039>;<BASE>;<MIN>;<U07C9> % NKO DIGIT NINE
@@ -64326,7 +64329,35 @@ order_start <LATIN>;forward;backward;forward;forward,position
else
order_start <LATIN>;forward;forward;forward;forward,position
endif
+% Implement rational range for [a-z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
<U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A
+<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
+<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
+<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
+<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
+<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
+<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
+<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
+<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
+<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
+<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
+<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
+<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
+<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
+<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
+<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
+<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
+<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
+<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
+<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
+<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
+<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
+<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
+<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
+<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
+<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
<UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A
<U0363> <S0061>;<BASE>;<COMPAT>;<U0363> % COMBINING LATIN SMALL LETTER A
<U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A
@@ -64418,7 +64449,6 @@ endif
<U0252> <S0252>;<BASE>;<MIN>;<U0252> % LATIN SMALL LETTER TURNED ALPHA
<U1D9B> <S0252>;<BASE>;<MNN>;<U1D9B> % MODIFIER LETTER SMALL TURNED ALPHA
<UAB64> <SAB64>;<BASE>;<MIN>;<UAB64> % LATIN SMALL LETTER INVERTED ALPHA
-<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
<UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B
<U1DE8> <S0062>;<BASE>;<COMPAT>;<U1DE8> % COMBINING LATIN SMALL LETTER B
<U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B
@@ -64454,7 +64484,6 @@ endif
<U0183> <S0183>;<BASE>;<MIN>;<U0183> % LATIN SMALL LETTER B WITH TOPBAR
<UA7B5> <SA7B5>;<BASE>;<MIN>;<UA7B5> % LATIN SMALL LETTER BETA
<U1DE9> <SA7B5>;<BASE>;<COMPAT>;<U1DE9> % COMBINING LATIN SMALL LETTER BETA
-<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
<UFF43> <S0063>;<BASE>;<WIDE>;<UFF43> % FULLWIDTH LATIN SMALL LETTER C
<U0368> <S0063>;<BASE>;<COMPAT>;<U0368> % COMBINING LATIN SMALL LETTER C
<U217D> <S0063>;<BASE>;<COMPAT>;<U217D> % SMALL ROMAN NUMERAL ONE HUNDRED
@@ -64504,7 +64533,6 @@ endif
<U1D9D> <S0255>;<BASE>;<MNN>;<U1D9D> % MODIFIER LETTER SMALL C WITH CURL
<U2184> <S2184>;<BASE>;<MIN>;<U2184> % LATIN SMALL LETTER REVERSED C
<UA73F> <SA73F>;<BASE>;<MIN>;<UA73F> % LATIN SMALL LETTER REVERSED C WITH DOT
-<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
<UFF44> <S0064>;<BASE>;<WIDE>;<UFF44> % FULLWIDTH LATIN SMALL LETTER D
<U0369> <S0064>;<BASE>;<COMPAT>;<U0369> % COMBINING LATIN SMALL LETTER D
<U217E> <S0064>;<BASE>;<COMPAT>;<U217E> % SMALL ROMAN NUMERAL FIVE HUNDRED
@@ -64563,7 +64591,6 @@ endif
<U0221> <S0221>;<BASE>;<MIN>;<U0221> % LATIN SMALL LETTER D WITH CURL
<UA771> <SA771>;<BASE>;<MIN>;<UA771> % LATIN SMALL LETTER DUM
<U1E9F> <S1E9F>;<BASE>;<MIN>;<U1E9F> % LATIN SMALL LETTER DELTA
-<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
<UFF45> <S0065>;<BASE>;<WIDE>;<UFF45> % FULLWIDTH LATIN SMALL LETTER E
<U0364> <S0065>;<BASE>;<COMPAT>;<U0364> % COMBINING LATIN SMALL LETTER E
<U24A0> <S0065>;<BASE>;<COMPAT>;<U24A0> % PARENTHESIZED LATIN SMALL LETTER E
@@ -64641,7 +64668,6 @@ endif
<U025E> <S025E>;<BASE>;<MIN>;<U025E> % LATIN SMALL LETTER CLOSED REVERSED OPEN E
<U029A> <S029A>;<BASE>;<MIN>;<U029A> % LATIN SMALL LETTER CLOSED OPEN E
<U0264> <S0264>;<BASE>;<MIN>;<U0264> % LATIN SMALL LETTER RAMS HORN
-<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
<UFF46> <S0066>;<BASE>;<WIDE>;<UFF46> % FULLWIDTH LATIN SMALL LETTER F
<U1DEB> <S0066>;<BASE>;<COMPAT>;<U1DEB> % COMBINING LATIN SMALL LETTER F
<U24A1> <S0066>;<BASE>;<COMPAT>;<U24A1> % PARENTHESIZED LATIN SMALL LETTER F
@@ -64680,7 +64706,6 @@ endif
<U0192> <S0192>;<BASE>;<MIN>;<U0192> % LATIN SMALL LETTER F WITH HOOK
<U214E> <S214E>;<BASE>;<MIN>;<U214E> % TURNED SMALL F
<UA7FB> <SA7FB>;<BASE>;<MIN>;<UA7FB> % LATIN EPIGRAPHIC LETTER REVERSED F
-<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
<UFF47> <S0067>;<BASE>;<WIDE>;<UFF47> % FULLWIDTH LATIN SMALL LETTER G
<U1DDA> <S0067>;<BASE>;<COMPAT>;<U1DDA> % COMBINING LATIN SMALL LETTER G
<U24A2> <S0067>;<BASE>;<COMPAT>;<U24A2> % PARENTHESIZED LATIN SMALL LETTER G
@@ -64727,7 +64752,6 @@ endif
<U0263> <S0263>;<BASE>;<MIN>;<U0263> % LATIN SMALL LETTER GAMMA
<U02E0> <S0263>;<BASE>;<MNN>;<U02E0> % MODIFIER LETTER SMALL GAMMA
<U01A3> <S01A3>;<BASE>;<MIN>;<U01A3> % LATIN SMALL LETTER OI
-<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
<UFF48> <S0068>;<BASE>;<WIDE>;<UFF48> % FULLWIDTH LATIN SMALL LETTER H
<U036A> <S0068>;<BASE>;<COMPAT>;<U036A> % COMBINING LATIN SMALL LETTER H
<U24A3> <S0068>;<BASE>;<COMPAT>;<U24A3> % PARENTHESIZED LATIN SMALL LETTER H
@@ -64780,7 +64804,6 @@ endif
<U0267> <S0267>;<BASE>;<MIN>;<U0267> % LATIN SMALL LETTER HENG WITH HOOK
<U02BB> <S02BB>;<BASE>;<MIN>;<U02BB> % MODIFIER LETTER TURNED COMMA
<U02BD> <S02BD>;<BASE>;<MIN>;<U02BD> % MODIFIER LETTER REVERSED COMMA
-<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
<UFF49> <S0069>;<BASE>;<WIDE>;<UFF49> % FULLWIDTH LATIN SMALL LETTER I
<U0365> <S0069>;<BASE>;<COMPAT>;<U0365> % COMBINING LATIN SMALL LETTER I
<U2170> <S0069>;<BASE>;<COMPAT>;<U2170> % SMALL ROMAN NUMERAL ONE
@@ -64844,7 +64867,6 @@ endif
<U0269> <S0269>;<BASE>;<MIN>;<U0269> % LATIN SMALL LETTER IOTA
<U1DA5> <S0269>;<BASE>;<MNN>;<U1DA5> % MODIFIER LETTER SMALL IOTA
<U1D7C> <S1D7C>;<BASE>;<MIN>;<U1D7C> % LATIN SMALL LETTER IOTA WITH STROKE
-<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
<UFF4A> <S006A>;<BASE>;<WIDE>;<UFF4A> % FULLWIDTH LATIN SMALL LETTER J
<U24A5> <S006A>;<BASE>;<COMPAT>;<U24A5> % PARENTHESIZED LATIN SMALL LETTER J
<U2149> <S006A>;<BASE>;<FONT>;<U2149> % DOUBLE-STRUCK ITALIC SMALL J
@@ -64876,7 +64898,6 @@ endif
<U025F> <S025F>;<BASE>;<MIN>;<U025F> % LATIN SMALL LETTER DOTLESS J WITH STROKE
<U1DA1> <S025F>;<BASE>;<MNN>;<U1DA1> % MODIFIER LETTER SMALL DOTLESS J WITH STROKE
<U0284> <S0284>;<BASE>;<MIN>;<U0284> % LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
-<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
<UFF4B> <S006B>;<BASE>;<WIDE>;<UFF4B> % FULLWIDTH LATIN SMALL LETTER K
<U1DDC> <S006B>;<BASE>;<COMPAT>;<U1DDC> % COMBINING LATIN SMALL LETTER K
<U24A6> <S006B>;<BASE>;<COMPAT>;<U24A6> % PARENTHESIZED LATIN SMALL LETTER K
@@ -64926,7 +64947,6 @@ endif
<UA743> <SA743>;<BASE>;<MIN>;<UA743> % LATIN SMALL LETTER K WITH DIAGONAL STROKE
<UA745> <SA745>;<BASE>;<MIN>;<UA745> % LATIN SMALL LETTER K WITH STROKE AND DIAGONAL STROKE
<U029E> <S029E>;<BASE>;<MIN>;<U029E> % LATIN SMALL LETTER TURNED K
-<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
<UFF4C> <S006C>;<BASE>;<WIDE>;<UFF4C> % FULLWIDTH LATIN SMALL LETTER L
<U1DDD> <S006C>;<BASE>;<COMPAT>;<U1DDD> % COMBINING LATIN SMALL LETTER L
<U217C> <S006C>;<BASE>;<COMPAT>;<U217C> % SMALL ROMAN NUMERAL FIFTY
@@ -64996,7 +65016,6 @@ endif
<UA781> <SA781>;<BASE>;<MIN>;<UA781> % LATIN SMALL LETTER TURNED L
<U019B> <S019B>;<BASE>;<MIN>;<U019B> % LATIN SMALL LETTER LAMBDA WITH STROKE
<U028E> <S028E>;<BASE>;<MIN>;<U028E> % LATIN SMALL LETTER TURNED Y
-<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
<UFF4D> <S006D>;<BASE>;<WIDE>;<UFF4D> % FULLWIDTH LATIN SMALL LETTER M
<U036B> <S006D>;<BASE>;<COMPAT>;<U036B> % COMBINING LATIN SMALL LETTER M
<U217F> <S006D>;<BASE>;<COMPAT>;<U217F> % SMALL ROMAN NUMERAL ONE THOUSAND
@@ -65055,7 +65074,6 @@ endif
<UA7FD> <SA7FD>;<BASE>;<MIN>;<UA7FD> % LATIN EPIGRAPHIC LETTER INVERTED M
<UA7FF> <SA7FF>;<BASE>;<MIN>;<UA7FF> % LATIN EPIGRAPHIC LETTER ARCHAIC M
<UA773> <SA773>;<BASE>;<MIN>;<UA773> % LATIN SMALL LETTER MUM
-<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
<UFF4E> <S006E>;<BASE>;<WIDE>;<UFF4E> % FULLWIDTH LATIN SMALL LETTER N
<U1DE0> <S006E>;<BASE>;<COMPAT>;<U1DE0> % COMBINING LATIN SMALL LETTER N
<U24A9> <S006E>;<BASE>;<COMPAT>;<U24A9> % PARENTHESIZED LATIN SMALL LETTER N
@@ -65114,7 +65132,6 @@ endif
<U014B> <S014B>;<BASE>;<MIN>;<U014B> % LATIN SMALL LETTER ENG
<U1D51> <S014B>;<BASE>;<MNN>;<U1D51> % MODIFIER LETTER SMALL ENG
<UAB3C> <SAB3C>;<BASE>;<MIN>;<UAB3C> % LATIN SMALL LETTER ENG WITH CROSSED-TAIL
-<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
<UFF4F> <S006F>;<BASE>;<WIDE>;<UFF4F> % FULLWIDTH LATIN SMALL LETTER O
<U0366> <S006F>;<BASE>;<COMPAT>;<U0366> % COMBINING LATIN SMALL LETTER O
<U24AA> <S006F>;<BASE>;<COMPAT>;<U24AA> % PARENTHESIZED LATIN SMALL LETTER O
@@ -65213,7 +65230,6 @@ endif
<U0223> <S0223>;<BASE>;<MIN>;<U0223> % LATIN SMALL LETTER OU
<U1D3D> <S0223>;<BASE>;<MISCCAP>;<U1D3D> % MODIFIER LETTER CAPITAL OU
<U1D15> <S1D15>;<BASE>;<MIN>;<U1D15> % LATIN LETTER SMALL CAPITAL OU
-<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
<UFF50> <S0070>;<BASE>;<WIDE>;<UFF50> % FULLWIDTH LATIN SMALL LETTER P
<U1DEE> <S0070>;<BASE>;<COMPAT>;<U1DEE> % COMBINING LATIN SMALL LETTER P
<U24AB> <S0070>;<BASE>;<COMPAT>;<U24AB> % PARENTHESIZED LATIN SMALL LETTER P
@@ -65262,7 +65278,6 @@ endif
<U0278> <S0278>;<BASE>;<MIN>;<U0278> % LATIN SMALL LETTER PHI
<U1DB2> <S0278>;<BASE>;<MNN>;<U1DB2> % MODIFIER LETTER SMALL PHI
<U2C77> <S2C77>;<BASE>;<MIN>;<U2C77> % LATIN SMALL LETTER TAILLESS PHI
-<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
<UFF51> <S0071>;<BASE>;<WIDE>;<UFF51> % FULLWIDTH LATIN SMALL LETTER Q
<U24AC> <S0071>;<BASE>;<COMPAT>;<U24AC> % PARENTHESIZED LATIN SMALL LETTER Q
<U0001D42A> <S0071>;<BASE>;<FONT>;<U0001D42A> % MATHEMATICAL BOLD SMALL Q
@@ -65285,7 +65300,6 @@ endif
<U02A0> <S02A0>;<BASE>;<MIN>;<U02A0> % LATIN SMALL LETTER Q WITH HOOK
<U024B> <S024B>;<BASE>;<MIN>;<U024B> % LATIN SMALL LETTER Q WITH HOOK TAIL
<U0138> <S0138>;<BASE>;<MIN>;<U0138> % LATIN SMALL LETTER KRA
-<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
<UFF52> <S0072>;<BASE>;<WIDE>;<UFF52> % FULLWIDTH LATIN SMALL LETTER R
<U036C> <S0072>;<BASE>;<COMPAT>;<U036C> % COMBINING LATIN SMALL LETTER R
<U1DCA> <S0072>;<BASE>;<COMPAT>;<U1DCA> % COMBINING LATIN SMALL LETTER R BELOW
@@ -65354,7 +65368,6 @@ endif
<UA775> <SA775>;<BASE>;<MIN>;<UA775> % LATIN SMALL LETTER RUM
<UA776> <SA776>;<BASE>;<MIN>;<UA776> % LATIN LETTER SMALL CAPITAL RUM
<UA75D> <SA75D>;<BASE>;<MIN>;<UA75D> % LATIN SMALL LETTER RUM ROTUNDA
-<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
<UFF53> <S0073>;<BASE>;<WIDE>;<UFF53> % FULLWIDTH LATIN SMALL LETTER S
<U1DE4> <S0073>;<BASE>;<COMPAT>;<U1DE4> % COMBINING LATIN SMALL LETTER S
<U24AE> <S0073>;<BASE>;<COMPAT>;<U24AE> % PARENTHESIZED LATIN SMALL LETTER S
@@ -65417,7 +65430,6 @@ endif
<U0285> <S0285>;<BASE>;<MIN>;<U0285> % LATIN SMALL LETTER SQUAT REVERSED ESH
<U1D98> <S1D98>;<BASE>;<MIN>;<U1D98> % LATIN SMALL LETTER ESH WITH RETROFLEX HOOK
<U0286> <S0286>;<BASE>;<MIN>;<U0286> % LATIN SMALL LETTER ESH WITH CURL
-<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
<UFF54> <S0074>;<BASE>;<WIDE>;<UFF54> % FULLWIDTH LATIN SMALL LETTER T
<U036D> <S0074>;<BASE>;<COMPAT>;<U036D> % COMBINING LATIN SMALL LETTER T
<U24AF> <S0074>;<BASE>;<COMPAT>;<U24AF> % PARENTHESIZED LATIN SMALL LETTER T
@@ -65467,7 +65479,6 @@ endif
<U0236> <S0236>;<BASE>;<MIN>;<U0236> % LATIN SMALL LETTER T WITH CURL
<UA777> <SA777>;<BASE>;<MIN>;<UA777> % LATIN SMALL LETTER TUM
<U0287> <S0287>;<BASE>;<MIN>;<U0287> % LATIN SMALL LETTER TURNED T
-<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
<UFF55> <S0075>;<BASE>;<WIDE>;<UFF55> % FULLWIDTH LATIN SMALL LETTER U
<U0367> <S0075>;<BASE>;<COMPAT>;<U0367> % COMBINING LATIN SMALL LETTER U
<U24B0> <S0075>;<BASE>;<COMPAT>;<U24B0> % PARENTHESIZED LATIN SMALL LETTER U
@@ -65552,7 +65563,6 @@ endif
<U028A> <S028A>;<BASE>;<MIN>;<U028A> % LATIN SMALL LETTER UPSILON
<U1DB7> <S028A>;<BASE>;<MNN>;<U1DB7> % MODIFIER LETTER SMALL UPSILON
<U1D7F> <S1D7F>;<BASE>;<MIN>;<U1D7F> % LATIN SMALL LETTER UPSILON WITH STROKE
-<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
<UFF56> <S0076>;<BASE>;<WIDE>;<UFF56> % FULLWIDTH LATIN SMALL LETTER V
<U036E> <S0076>;<BASE>;<COMPAT>;<U036E> % COMBINING LATIN SMALL LETTER V
<U2174> <S0076>;<BASE>;<COMPAT>;<U2174> % SMALL ROMAN NUMERAL FIVE
@@ -65593,7 +65603,6 @@ endif
<U1EFD> <S1EFD>;<BASE>;<MIN>;<U1EFD> % LATIN SMALL LETTER MIDDLE-WELSH V
<U028C> <S028C>;<BASE>;<MIN>;<U028C> % LATIN SMALL LETTER TURNED V
<U1DBA> <S028C>;<BASE>;<MNN>;<U1DBA> % MODIFIER LETTER SMALL TURNED V
-<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
<UFF57> <S0077>;<BASE>;<WIDE>;<UFF57> % FULLWIDTH LATIN SMALL LETTER W
<U1DF1> <S0077>;<BASE>;<COMPAT>;<U1DF1> % COMBINING LATIN SMALL LETTER W
<U24B2> <S0077>;<BASE>;<COMPAT>;<U24B2> % PARENTHESIZED LATIN SMALL LETTER W
@@ -65627,7 +65636,6 @@ endif
<U1D21> <S1D21>;<BASE>;<MIN>;<U1D21> % LATIN LETTER SMALL CAPITAL W
<U2C73> <S2C73>;<BASE>;<MIN>;<U2C73> % LATIN SMALL LETTER W WITH HOOK
<U028D> <S028D>;<BASE>;<MIN>;<U028D> % LATIN SMALL LETTER TURNED W
-<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
<UFF58> <S0078>;<BASE>;<WIDE>;<UFF58> % FULLWIDTH LATIN SMALL LETTER X
<U036F> <S0078>;<BASE>;<COMPAT>;<U036F> % COMBINING LATIN SMALL LETTER X
<U2179> <S0078>;<BASE>;<COMPAT>;<U2179> % SMALL ROMAN NUMERAL TEN
@@ -65660,7 +65668,6 @@ endif
<UAB53> <SAB53>;<BASE>;<MIN>;<UAB53> % LATIN SMALL LETTER CHI
<UAB54> <SAB54>;<BASE>;<MIN>;<UAB54> % LATIN SMALL LETTER CHI WITH LOW RIGHT RING
<UAB55> <SAB55>;<BASE>;<MIN>;<UAB55> % LATIN SMALL LETTER CHI WITH LOW LEFT SERIF
-<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
<UFF59> <S0079>;<BASE>;<WIDE>;<UFF59> % FULLWIDTH LATIN SMALL LETTER Y
<U24B4> <S0079>;<BASE>;<COMPAT>;<U24B4> % PARENTHESIZED LATIN SMALL LETTER Y
<U0001D432> <S0079>;<BASE>;<FONT>;<U0001D432> % MATHEMATICAL BOLD SMALL Y
@@ -65694,7 +65701,6 @@ endif
<U1EFF> <S1EFF>;<BASE>;<MIN>;<U1EFF> % LATIN SMALL LETTER Y WITH LOOP
<UAB5A> <SAB5A>;<BASE>;<MIN>;<UAB5A> % LATIN SMALL LETTER Y WITH SHORT RIGHT LEG
<U021D> <S021D>;<BASE>;<MIN>;<U021D> % LATIN SMALL LETTER YOGH
-<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
<UFF5A> <S007A>;<BASE>;<WIDE>;<UFF5A> % FULLWIDTH LATIN SMALL LETTER Z
<U1DE6> <S007A>;<BASE>;<COMPAT>;<U1DE6> % COMBINING LATIN SMALL LETTER Z
<U24B5> <S007A>;<BASE>;<COMPAT>;<U24B5> % PARENTHESIZED LATIN SMALL LETTER Z
@@ -65796,7 +65802,35 @@ endif
<U0001D736> <S03B1>;<BASE>;<FONT>;<U0001D736> % MATHEMATICAL BOLD ITALIC SMALL ALPHA
<U0001D770> <S03B1>;<BASE>;<FONT>;<U0001D770> % MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA
<U0001D7AA> <S03B1>;<BASE>;<FONT>;<U0001D7AA> % MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA
+% Implement rational range for [A-Z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
<U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
+<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
+<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
+<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
+<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
+<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
+<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
+<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
+<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
+<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
+<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
+<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
+<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
+<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
+<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
+<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
+<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
+<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
+<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
+<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
+<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
+<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
+<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
+<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
+<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
+<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
<UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A
<U0001F110> <S0061>;<BASE>;<COMPATCAP>;<U0001F110> % PARENTHESIZED LATIN CAPITAL LETTER A
<U0001D400> <S0061>;<BASE>;<FONTCAP>;<U0001D400> % MATHEMATICAL BOLD CAPITAL A
@@ -65860,7 +65894,6 @@ endif
<U2C6F> <S0250>;<BASE>;<CAP>;<U2C6F> % LATIN CAPITAL LETTER TURNED A
<U2C6D> <S0251>;<BASE>;<CAP>;<U2C6D> % LATIN CAPITAL LETTER ALPHA
<U2C70> <S0252>;<BASE>;<CAP>;<U2C70> % LATIN CAPITAL LETTER TURNED ALPHA
-<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
<UFF22> <S0062>;<BASE>;<WIDECAP>;<UFF22> % FULLWIDTH LATIN CAPITAL LETTER B
<U0001F111> <S0062>;<BASE>;<COMPATCAP>;<U0001F111> % PARENTHESIZED LATIN CAPITAL LETTER B
<U212C> <S0062>;<BASE>;<FONTCAP>;<U212C> % SCRIPT CAPITAL B
@@ -65888,7 +65921,6 @@ endif
<U0181> <S0253>;<BASE>;<CAP>;<U0181> % LATIN CAPITAL LETTER B WITH HOOK
<U0182> <S0183>;<BASE>;<CAP>;<U0182> % LATIN CAPITAL LETTER B WITH TOPBAR
<UA7B4> <SA7B5>;<BASE>;<CAP>;<UA7B4> % LATIN CAPITAL LETTER BETA
-<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
<UFF23> <S0063>;<BASE>;<WIDECAP>;<UFF23> % FULLWIDTH LATIN CAPITAL LETTER C
<U216D> <S0063>;<BASE>;<COMPATCAP>;<U216D> % ROMAN NUMERAL ONE HUNDRED
<U0001F112> <S0063>;<BASE>;<COMPATCAP>;<U0001F112> % PARENTHESIZED LATIN CAPITAL LETTER C
@@ -65921,7 +65953,6 @@ endif
<U0187> <S0188>;<BASE>;<CAP>;<U0187> % LATIN CAPITAL LETTER C WITH HOOK
<U2183> <S2184>;<BASE>;<CAP>;<U2183> % ROMAN NUMERAL REVERSED ONE HUNDRED
<UA73E> <SA73F>;<BASE>;<CAP>;<UA73E> % LATIN CAPITAL LETTER REVERSED C WITH DOT
-<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
<UFF24> <S0064>;<BASE>;<WIDECAP>;<UFF24> % FULLWIDTH LATIN CAPITAL LETTER D
<U216E> <S0064>;<BASE>;<COMPATCAP>;<U216E> % ROMAN NUMERAL FIVE HUNDRED
<U0001F113> <S0064>;<BASE>;<COMPATCAP>;<U0001F113> % PARENTHESIZED LATIN CAPITAL LETTER D
@@ -65959,7 +65990,6 @@ endif
<U0189> <S0256>;<BASE>;<CAP>;<U0189> % LATIN CAPITAL LETTER AFRICAN D
<U018A> <S0257>;<BASE>;<CAP>;<U018A> % LATIN CAPITAL LETTER D WITH HOOK
<U018B> <S018C>;<BASE>;<CAP>;<U018B> % LATIN CAPITAL LETTER D WITH TOPBAR
-<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
<UFF25> <S0065>;<BASE>;<WIDECAP>;<UFF25> % FULLWIDTH LATIN CAPITAL LETTER E
<U0001F114> <S0065>;<BASE>;<COMPATCAP>;<U0001F114> % PARENTHESIZED LATIN CAPITAL LETTER E
<U2130> <S0065>;<BASE>;<FONTCAP>;<U2130> % SCRIPT CAPITAL E
@@ -66010,7 +66040,6 @@ endif
<U0190> <S025B>;<BASE>;<CAP>;<U0190> % LATIN CAPITAL LETTER OPEN E
<U2107> <S025B>;<BASE>;<COMPATCAP>;<U2107> % EULER CONSTANT
<UA7AB> <S025C>;<BASE>;<CAP>;<UA7AB> % LATIN CAPITAL LETTER REVERSED OPEN E
-<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
<UFF26> <S0066>;<BASE>;<WIDECAP>;<UFF26> % FULLWIDTH LATIN CAPITAL LETTER F
<U0001F115> <S0066>;<BASE>;<COMPATCAP>;<U0001F115> % PARENTHESIZED LATIN CAPITAL LETTER F
<U2131> <S0066>;<BASE>;<FONTCAP>;<U2131> % SCRIPT CAPITAL F
@@ -66035,7 +66064,6 @@ endif
<UA798> <SA799>;<BASE>;<CAP>;<UA798> % LATIN CAPITAL LETTER F WITH STROKE
<U0191> <S0192>;<BASE>;<CAP>;<U0191> % LATIN CAPITAL LETTER F WITH HOOK
<U2132> <S214E>;<BASE>;<CAP>;<U2132> % TURNED CAPITAL F
-<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
<UFF27> <S0067>;<BASE>;<WIDECAP>;<UFF27> % FULLWIDTH LATIN CAPITAL LETTER G
<U0001F116> <S0067>;<BASE>;<COMPATCAP>;<U0001F116> % PARENTHESIZED LATIN CAPITAL LETTER G
<U0001D406> <S0067>;<BASE>;<FONTCAP>;<U0001D406> % MATHEMATICAL BOLD CAPITAL G
@@ -66071,7 +66099,6 @@ endif
<UA77E> <SA77F>;<BASE>;<CAP>;<UA77E> % LATIN CAPITAL LETTER TURNED INSULAR G
<U0194> <S0263>;<BASE>;<CAP>;<U0194> % LATIN CAPITAL LETTER GAMMA
<U01A2> <S01A3>;<BASE>;<CAP>;<U01A2> % LATIN CAPITAL LETTER OI
-<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
<UFF28> <S0068>;<BASE>;<WIDECAP>;<UFF28> % FULLWIDTH LATIN CAPITAL LETTER H
<U0001F117> <S0068>;<BASE>;<COMPATCAP>;<U0001F117> % PARENTHESIZED LATIN CAPITAL LETTER H
<U210B> <S0068>;<BASE>;<FONTCAP>;<U210B> % SCRIPT CAPITAL H
@@ -66104,7 +66131,6 @@ endif
<U2C67> <S2C68>;<BASE>;<CAP>;<U2C67> % LATIN CAPITAL LETTER H WITH DESCENDER
<U2C75> <S2C76>;<BASE>;<CAP>;<U2C75> % LATIN CAPITAL LETTER HALF H
<UA726> <SA727>;<BASE>;<CAP>;<UA726> % LATIN CAPITAL LETTER HENG
-<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
<UFF29> <S0069>;<BASE>;<WIDECAP>;<UFF29> % FULLWIDTH LATIN CAPITAL LETTER I
<U2160> <S0069>;<BASE>;<COMPATCAP>;<U2160> % ROMAN NUMERAL ONE
<U0001F118> <S0069>;<BASE>;<COMPATCAP>;<U0001F118> % PARENTHESIZED LATIN CAPITAL LETTER I
@@ -66149,7 +66175,6 @@ endif
<UA7AE> <S026A>;<BASE>;<CAP>;<UA7AE> % LATIN CAPITAL LETTER SMALL CAPITAL I
<U0197> <S0268>;<BASE>;<CAP>;<U0197> % LATIN CAPITAL LETTER I WITH STROKE
<U0196> <S0269>;<BASE>;<CAP>;<U0196> % LATIN CAPITAL LETTER IOTA
-<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
<UFF2A> <S006A>;<BASE>;<WIDECAP>;<UFF2A> % FULLWIDTH LATIN CAPITAL LETTER J
<U0001F119> <S006A>;<BASE>;<COMPATCAP>;<U0001F119> % PARENTHESIZED LATIN CAPITAL LETTER J
<U0001D409> <S006A>;<BASE>;<FONTCAP>;<U0001D409> % MATHEMATICAL BOLD CAPITAL J
@@ -66172,7 +66197,6 @@ endif
<U0134> <S006A>;"<BASE><CIRCF>";"<CAP><MIN>";<U0134> % LATIN CAPITAL LETTER J WITH CIRCUMFLEX
<U0248> <S0249>;<BASE>;<CAP>;<U0248> % LATIN CAPITAL LETTER J WITH STROKE
<UA7B2> <S029D>;<BASE>;<CAP>;<UA7B2> % LATIN CAPITAL LETTER J WITH CROSSED-TAIL
-<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
<U212A> <S006B>;<BASE>;<CAP>;<U212A> % KELVIN SIGN
<UFF2B> <S006B>;<BASE>;<WIDECAP>;<UFF2B> % FULLWIDTH LATIN CAPITAL LETTER K
<U0001F11A> <S006B>;<BASE>;<COMPATCAP>;<U0001F11A> % PARENTHESIZED LATIN CAPITAL LETTER K
@@ -66206,7 +66230,6 @@ endif
<UA742> <SA743>;<BASE>;<CAP>;<UA742> % LATIN CAPITAL LETTER K WITH DIAGONAL STROKE
<UA744> <SA745>;<BASE>;<CAP>;<UA744> % LATIN CAPITAL LETTER K WITH STROKE AND DIAGONAL STROKE
<UA7B0> <S029E>;<BASE>;<CAP>;<UA7B0> % LATIN CAPITAL LETTER TURNED K
-<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
<UFF2C> <S006C>;<BASE>;<WIDECAP>;<UFF2C> % FULLWIDTH LATIN CAPITAL LETTER L
<U216C> <S006C>;<BASE>;<COMPATCAP>;<U216C> % ROMAN NUMERAL FIFTY
<U0001F11B> <S006C>;<BASE>;<COMPATCAP>;<U0001F11B> % PARENTHESIZED LATIN CAPITAL LETTER L
@@ -66249,7 +66272,6 @@ endif
<U2C62> <S026B>;<BASE>;<CAP>;<U2C62> % LATIN CAPITAL LETTER L WITH MIDDLE TILDE
<UA7AD> <S026C>;<BASE>;<CAP>;<UA7AD> % LATIN CAPITAL LETTER L WITH BELT
<UA780> <SA781>;<BASE>;<CAP>;<UA780> % LATIN CAPITAL LETTER TURNED L
-<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
<UFF2D> <S006D>;<BASE>;<WIDECAP>;<UFF2D> % FULLWIDTH LATIN CAPITAL LETTER M
<U216F> <S006D>;<BASE>;<COMPATCAP>;<U216F> % ROMAN NUMERAL ONE THOUSAND
<U0001F11C> <S006D>;<BASE>;<COMPATCAP>;<U0001F11C> % PARENTHESIZED LATIN CAPITAL LETTER M
@@ -66275,7 +66297,6 @@ endif
<U1E42> <S006D>;"<BASE><POINS>";"<CAP><MIN>";<U1E42> % LATIN CAPITAL LETTER M WITH DOT BELOW
<U1DDF> <S1D0D>;<BASE>;<COMPAT>;<U1DDF> % COMBINING LATIN LETTER SMALL CAPITAL M
<U2C6E> <S0271>;<BASE>;<CAP>;<U2C6E> % LATIN CAPITAL LETTER M WITH HOOK
-<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
<UFF2E> <S006E>;<BASE>;<WIDECAP>;<UFF2E> % FULLWIDTH LATIN CAPITAL LETTER N
<U0001F11D> <S006E>;<BASE>;<COMPATCAP>;<U0001F11D> % PARENTHESIZED LATIN CAPITAL LETTER N
<U2115> <S006E>;<BASE>;<FONTCAP>;<U2115> % DOUBLE-STRUCK CAPITAL N
@@ -66312,7 +66333,6 @@ endif
<U0220> <S019E>;<BASE>;<CAP>;<U0220> % LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
<UA790> <SA791>;<BASE>;<CAP>;<UA790> % LATIN CAPITAL LETTER N WITH DESCENDER
<U014A> <S014B>;<BASE>;<CAP>;<U014A> % LATIN CAPITAL LETTER ENG
-<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
<UFF2F> <S006F>;<BASE>;<WIDECAP>;<UFF2F> % FULLWIDTH LATIN CAPITAL LETTER O
<U0001F11E> <S006F>;<BASE>;<COMPATCAP>;<U0001F11E> % PARENTHESIZED LATIN CAPITAL LETTER O
<U0001D40E> <S006F>;<BASE>;<FONTCAP>;<U0001D40E> % MATHEMATICAL BOLD CAPITAL O
@@ -66377,7 +66397,6 @@ endif
<UA74A> <SA74B>;<BASE>;<CAP>;<UA74A> % LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY
<UA7B6> <SA7B7>;<BASE>;<CAP>;<UA7B6> % LATIN CAPITAL LETTER OMEGA
<U0222> <S0223>;<BASE>;<CAP>;<U0222> % LATIN CAPITAL LETTER OU
-<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
<UFF30> <S0070>;<BASE>;<WIDECAP>;<UFF30> % FULLWIDTH LATIN CAPITAL LETTER P
<U0001F11F> <S0070>;<BASE>;<COMPATCAP>;<U0001F11F> % PARENTHESIZED LATIN CAPITAL LETTER P
<U2119> <S0070>;<BASE>;<FONTCAP>;<U2119> % DOUBLE-STRUCK CAPITAL P
@@ -66405,7 +66424,6 @@ endif
<U01A4> <S01A5>;<BASE>;<CAP>;<U01A4> % LATIN CAPITAL LETTER P WITH HOOK
<UA752> <SA753>;<BASE>;<CAP>;<UA752> % LATIN CAPITAL LETTER P WITH FLOURISH
<UA754> <SA755>;<BASE>;<CAP>;<UA754> % LATIN CAPITAL LETTER P WITH SQUIRREL TAIL
-<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
<UFF31> <S0071>;<BASE>;<WIDECAP>;<UFF31> % FULLWIDTH LATIN CAPITAL LETTER Q
<U0001F120> <S0071>;<BASE>;<COMPATCAP>;<U0001F120> % PARENTHESIZED LATIN CAPITAL LETTER Q
<U211A> <S0071>;<BASE>;<FONTCAP>;<U211A> % DOUBLE-STRUCK CAPITAL Q
@@ -66428,7 +66446,6 @@ endif
<UA756> <SA757>;<BASE>;<CAP>;<UA756> % LATIN CAPITAL LETTER Q WITH STROKE THROUGH DESCENDER
<UA758> <SA759>;<BASE>;<CAP>;<UA758> % LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE
<U024A> <S024B>;<BASE>;<CAP>;<U024A> % LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
-<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
<UFF32> <S0072>;<BASE>;<WIDECAP>;<UFF32> % FULLWIDTH LATIN CAPITAL LETTER R
<U0001F121> <S0072>;<BASE>;<COMPATCAP>;<U0001F121> % PARENTHESIZED LATIN CAPITAL LETTER R
<U211B> <S0072>;<BASE>;<FONTCAP>;<U211B> % SCRIPT CAPITAL R
@@ -66466,7 +66483,6 @@ endif
<U024C> <S024D>;<BASE>;<CAP>;<U024C> % LATIN CAPITAL LETTER R WITH STROKE
<U2C64> <S027D>;<BASE>;<CAP>;<U2C64> % LATIN CAPITAL LETTER R WITH TAIL
<UA75C> <SA75D>;<BASE>;<CAP>;<UA75C> % LATIN CAPITAL LETTER RUM ROTUNDA
-<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
<UFF33> <S0073>;<BASE>;<WIDECAP>;<UFF33> % FULLWIDTH LATIN CAPITAL LETTER S
<U0001F122> <S0073>;<BASE>;<COMPATCAP>;<U0001F122> % PARENTHESIZED LATIN CAPITAL LETTER S
<U0001F12A> <S0073>;<BASE>;<COMPATCAP>;<U0001F12A> % TORTOISE SHELL BRACKETED LATIN CAPITAL LETTER S
@@ -66502,7 +66518,6 @@ endif
<U1E9E> "<S0073><S0073>";"<BASE><VRNT1><BASE>";"<COMPATCAP><COMPAT><COMPATCAP>";<U1E9E> % LATIN CAPITAL LETTER SHARP S
<U2C7E> <S023F>;<BASE>;<CAP>;<U2C7E> % LATIN CAPITAL LETTER S WITH SWASH TAIL
<U01A9> <S0283>;<BASE>;<CAP>;<U01A9> % LATIN CAPITAL LETTER ESH
-<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
<UFF34> <S0074>;<BASE>;<WIDECAP>;<UFF34> % FULLWIDTH LATIN CAPITAL LETTER T
<U0001F123> <S0074>;<BASE>;<COMPATCAP>;<U0001F123> % PARENTHESIZED LATIN CAPITAL LETTER T
<U0001D413> <S0074>;<BASE>;<FONTCAP>;<U0001D413> % MATHEMATICAL BOLD CAPITAL T
@@ -66536,7 +66551,6 @@ endif
<U01AC> <S01AD>;<BASE>;<CAP>;<U01AC> % LATIN CAPITAL LETTER T WITH HOOK
<U01AE> <S0288>;<BASE>;<CAP>;<U01AE> % LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
<UA7B1> <S0287>;<BASE>;<CAP>;<UA7B1> % LATIN CAPITAL LETTER TURNED T
-<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
<UFF35> <S0075>;<BASE>;<WIDECAP>;<UFF35> % FULLWIDTH LATIN CAPITAL LETTER U
<U0001F124> <S0075>;<BASE>;<COMPATCAP>;<U0001F124> % PARENTHESIZED LATIN CAPITAL LETTER U
<U0001D414> <S0075>;<BASE>;<FONTCAP>;<U0001D414> % MATHEMATICAL BOLD CAPITAL U
@@ -66591,7 +66605,6 @@ endif
<UA78D> <S0265>;<BASE>;<CAP>;<UA78D> % LATIN CAPITAL LETTER TURNED H
<U019C> <S026F>;<BASE>;<CAP>;<U019C> % LATIN CAPITAL LETTER TURNED M
<U01B1> <S028A>;<BASE>;<CAP>;<U01B1> % LATIN CAPITAL LETTER UPSILON
-<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
<UFF36> <S0076>;<BASE>;<WIDECAP>;<UFF36> % FULLWIDTH LATIN CAPITAL LETTER V
<U2164> <S0076>;<BASE>;<COMPATCAP>;<U2164> % ROMAN NUMERAL FIVE
<U0001F125> <S0076>;<BASE>;<COMPATCAP>;<U0001F125> % PARENTHESIZED LATIN CAPITAL LETTER V
@@ -66622,7 +66635,6 @@ endif
<U01B2> <S028B>;<BASE>;<CAP>;<U01B2> % LATIN CAPITAL LETTER V WITH HOOK
<U1EFC> <S1EFD>;<BASE>;<CAP>;<U1EFC> % LATIN CAPITAL LETTER MIDDLE-WELSH V
<U0245> <S028C>;<BASE>;<CAP>;<U0245> % LATIN CAPITAL LETTER TURNED V
-<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
<UFF37> <S0077>;<BASE>;<WIDECAP>;<UFF37> % FULLWIDTH LATIN CAPITAL LETTER W
<U0001F126> <S0077>;<BASE>;<COMPATCAP>;<U0001F126> % PARENTHESIZED LATIN CAPITAL LETTER W
<U0001D416> <S0077>;<BASE>;<FONTCAP>;<U0001D416> % MATHEMATICAL BOLD CAPITAL W
@@ -66649,7 +66661,6 @@ endif
<U1E86> <S0077>;"<BASE><POINT>";"<CAP><MIN>";<U1E86> % LATIN CAPITAL LETTER W WITH DOT ABOVE
<U1E88> <S0077>;"<BASE><POINS>";"<CAP><MIN>";<U1E88> % LATIN CAPITAL LETTER W WITH DOT BELOW
<U2C72> <S2C73>;<BASE>;<CAP>;<U2C72> % LATIN CAPITAL LETTER W WITH HOOK
-<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
<UFF38> <S0078>;<BASE>;<WIDECAP>;<UFF38> % FULLWIDTH LATIN CAPITAL LETTER X
<U2169> <S0078>;<BASE>;<COMPATCAP>;<U2169> % ROMAN NUMERAL TEN
<U0001F127> <S0078>;<BASE>;<COMPATCAP>;<U0001F127> % PARENTHESIZED LATIN CAPITAL LETTER X
@@ -66675,7 +66686,6 @@ endif
<U216A> "<S0078><S0069>";"<BASE><BASE>";"<COMPATCAP><COMPATCAP>";<U216A> % ROMAN NUMERAL ELEVEN
<U216B> "<S0078><S0069><S0069>";"<BASE><BASE><BASE>";"<COMPATCAP><COMPATCAP><COMPATCAP>";<U216B> % ROMAN NUMERAL TWELVE
<UA7B3> <SAB53>;<BASE>;<CAP>;<UA7B3> % LATIN CAPITAL LETTER CHI
-<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
<UFF39> <S0079>;<BASE>;<WIDECAP>;<UFF39> % FULLWIDTH LATIN CAPITAL LETTER Y
<U0001F128> <S0079>;<BASE>;<COMPATCAP>;<U0001F128> % PARENTHESIZED LATIN CAPITAL LETTER Y
<U0001D418> <S0079>;<BASE>;<FONTCAP>;<U0001D418> % MATHEMATICAL BOLD CAPITAL Y
@@ -66708,7 +66718,6 @@ endif
<U01B3> <S01B4>;<BASE>;<CAP>;<U01B3> % LATIN CAPITAL LETTER Y WITH HOOK
<U1EFE> <S1EFF>;<BASE>;<CAP>;<U1EFE> % LATIN CAPITAL LETTER Y WITH LOOP
<U021C> <S021D>;<BASE>;<CAP>;<U021C> % LATIN CAPITAL LETTER YOGH
-<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
<UFF3A> <S007A>;<BASE>;<WIDECAP>;<UFF3A> % FULLWIDTH LATIN CAPITAL LETTER Z
<U0001F129> <S007A>;<BASE>;<COMPATCAP>;<U0001F129> % PARENTHESIZED LATIN CAPITAL LETTER Z
<U2124> <S007A>;<BASE>;<FONTCAP>;<U2124> % DOUBLE-STRUCK CAPITAL Z
@@ -81,6 +81,8 @@ copy "iso14651_t1"
%
% The following rules implement the same order for glibc.
+% All of these collating symbols are used as primary weights
+% and cause equivalnce class problems, see Bug 23437.
collating-symbol <c-cedilla>
collating-symbol <g-breve>
collating-symbol <i-dotless>
@@ -111,8 +113,40 @@ reorder-after <AFTER-U>
<U011F> <g-breve>;<BASE>;<MIN>;IGNORE % ğ
<U011E> <g-breve>;<BASE>;<CAP>;IGNORE % Ğ
<U0131> <i-dotless>;<BASE>;<MIN>;IGNORE % ı
+
+% tr_TR must copy the rational range definition here for CEO:
+% Implement rational range for [A-Z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
+<U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
+<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
+<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
+<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
+<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
+<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
+<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
+<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
+% Turkish sorting of I, but within rational range.
+% FIXME: 'I' is no longer in the equivalence class of i's.
<U0049> <i-dotless>;<BASE>;<CAP>;IGNORE % I
-<U0069> <S0069>;<BASE>;<MIN>;IGNORE % i
+<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
+<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
+<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
+<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
+<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
+<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
+<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
+<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
+<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
+<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
+<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
+<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
+<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
+<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
+<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
+<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
+<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
+
<U0130> <S0069>;<BASE>;<CAP>;IGNORE % Ä°
<U00F6> <o-diaresis>;<BASE>;<MIN>;IGNORE % ö
<U00D6> <o-diaresis>;<BASE>;<CAP>;IGNORE % Ö
@@ -46,14 +46,25 @@ struct
{ { 2, 10 }, { -1, -1 } } },
/* Tests for bug 9697:
+ Look for a multibyte sequence in a range. We pick the range based
+ on collation element order, since a-z is no longer valid since it's
+ a rational range.
+
+ We use U+FF53 FULLWIDTH LATIN SMALL LETTER S as the start of the
+ range, and U+33DC SQUARE SV as the end of the range. These were
+ chosen by looking at collation element ordering and picking a range
+ in which the matching character was listed.
+
+ U+02E2 \xcb\xa2 MODIFIER LETTER SMALL S
U+00DF \xc3\x9f LATIN SMALL LETTER SHARP S
U+02DA \xcb\x9a RING ABOVE
- U+02E2 \xcb\xa2 MODIFIER LETTER SMALL S */
- { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+ The U+02DA RING ABOVE is chosen because it's not in [s-ãÂÂœ]. */
+ { "[s-ãÂÂœ]|[^s-ãÂÂœ]", "\xcb\xa2", REG_EXTENDED, 2,
{ { 0, 2 }, { -1, -1 } } },
- { "[a-z]", "\xc3\x9f", REG_EXTENDED, 2,
+ { "[s-ãÂÂœ]", "\xc3\x9f", REG_EXTENDED, 2,
{ { 0, 2 }, { -1, -1 } } },
- { "[^a-z]", "\xcb\x9a", REG_EXTENDED, 2,
+ { "[^s-ãÂÂœ]", "\xcb\x9a", REG_EXTENDED, 2,
{ { 0, 2 }, { -1, -1 } } },
};
@@ -67,9 +67,11 @@
# https://sourceware.org/bugzilla/show_bug.cgi?id=23393
# https://sourceware.org/bugzilla/show_bug.cgi?id=23420
#
-# No consensus exists on how best to handle the changes so the
-# iso14651_t1_common collation element order (CEO) has been changed to
-# deinterlace the a-z and A-Z regions.
+# The solution was to implement rational ranges by moving the collation
+# element order to fix this for [a-z], [A-Z], and [0-9]. Likewise the
+# upper and lower case letters are deinterlaced to allow for accented
+# ranges that don't include uppercase e.g. [a-ñ] should not include
+# any uppercase letters but may include a-z and more.
#
# With the deinterlacing commit ac3a3b4b0d561d776b60317d6a926050c8541655
# could be reverted to re-test the correct non-interleaved expectations.
@@ -77,9 +79,7 @@
# Please note that despite the region being deinterlaced, the ordering
# of collation remains the same. In glibc we implement CEO and because of
# that we can reorder the elements to reorder ranges without impacting
-# collation which depends on weights. The collation element ordering
-# could have been changed to include just a-z, A-Z, and 0-9 in three
-# distinct blocks, but this needs more discussion by the community.
+# collation which depends on weights.
# B.6 004(C)
C "!#%+,-./01234567889" "!#%+,-./01234567889" 0
@@ -477,9 +477,9 @@ C "-" "[Z-\\]]" NOMATCH
# handling of ranges and the recognition of character (vs bytes).
de_DE.ISO-8859-1 "a" "[a-z]" 0
de_DE.ISO-8859-1 "z" "[a-z]" 0
-de_DE.ISO-8859-1 "ä" "[a-z]" 0
-de_DE.ISO-8859-1 "ö" "[a-z]" 0
-de_DE.ISO-8859-1 "ü" "[a-z]" 0
+de_DE.ISO-8859-1 "ä" "[a-z]" NOMATCH
+de_DE.ISO-8859-1 "ö" "[a-z]" NOMATCH
+de_DE.ISO-8859-1 "ü" "[a-z]" NOMATCH
de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH
de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH
de_DE.ISO-8859-1 "Ä" "[a-z]" NOMATCH
@@ -492,9 +492,9 @@ de_DE.ISO-8859-1 "
de_DE.ISO-8859-1 "ü" "[A-Z]" NOMATCH
de_DE.ISO-8859-1 "A" "[A-Z]" 0
de_DE.ISO-8859-1 "Z" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ä" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ö" "[A-Z]" 0
-de_DE.ISO-8859-1 "Ü" "[A-Z]" 0
+de_DE.ISO-8859-1 "Ä" "[A-Z]" NOMATCH
+de_DE.ISO-8859-1 "Ö" "[A-Z]" NOMATCH
+de_DE.ISO-8859-1 "Ü" "[A-Z]" NOMATCH
de_DE.ISO-8859-1 "a" "[[:lower:]]" 0
de_DE.ISO-8859-1 "z" "[[:lower:]]" 0
de_DE.ISO-8859-1 "ä" "[[:lower:]]" 0
@@ -566,22 +566,46 @@ de_DE.ISO-8859-1 "aa" "[[.a.]]a" 0
de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH
-# And with a multibyte character set.
+# And with a multibyte character set:
+# Ensure that Turkish reordering rules don't move 'i' out of a-z set,
+# or 'I' out of A-Z set.
+tr_TR.UTF-8 "i" "[a-z]" 0
+tr_TR.UTF-8 "ı" "[a-z]" NOMATCH
+tr_TR.UTF-8 "I" "[A-Z]" 0
+tr_TR.UTF-8 "Ä°" "[A-Z]" NOMATCH
+tr_TR.ISO-8859-9 "i" "[a-z]" 0
+tr_TR.ISO-8859-9 "I" "[A-Z]" 0
+# See bug 23437 for I not being in [=i=].
+tr_TR.UTF-8 "I" "[=i=]" NOMATCH
en_US.UTF-8 "a" "[a-z]" 0
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8 "ñ" "[a-z]" NOMATCH
en_US.UTF-8 "z" "[a-z]" 0
en_US.UTF-8 "A" "[a-z]" NOMATCH
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8 "Ñ" "[a-z]" NOMATCH
en_US.UTF-8 "Z" "[a-z]" NOMATCH
en_US.UTF-8 "a" "[A-Z]" NOMATCH
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8 "ñ" "[A-Z]" NOMATCH
en_US.UTF-8 "z" "[A-Z]" NOMATCH
en_US.UTF-8 "A" "[A-Z]" 0
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8 "Ñ" "[A-Z]" NOMATCH
en_US.UTF-8 "Z" "[A-Z]" 0
en_US.UTF-8 "0" "[0-9]" 0
+# Test that <UFF10> FULLWIDTH DIGIT ZERO is not in [0-9].
+en_US.UTF-8 "ï¼Â" "[0-9]" NOMATCH
+# Test that <U00BD> VULGAR FRACTION ONE HALF is not in [0-9].
+en_US.UTF-8 "½" "[0-9]" NOMATCH
en_US.UTF-8 "9" "[0-9]" 0
+# Test that <UFF19> FULLWIDTH DIGIT NINE is not in [0-9].
+en_US.UTF-8 "9" "[0-9]" NOMATCH
de_DE.UTF-8 "a" "[a-z]" 0
de_DE.UTF-8 "z" "[a-z]" 0
-de_DE.UTF-8 "ä" "[a-z]" 0
-de_DE.UTF-8 "ö" "[a-z]" 0
-de_DE.UTF-8 "ü" "[a-z]" 0
+de_DE.UTF-8 "ä" "[a-z]" NOMATCH
+de_DE.UTF-8 "ö" "[a-z]" NOMATCH
+de_DE.UTF-8 "ü" "[a-z]" NOMATCH
de_DE.UTF-8 "A" "[a-z]" NOMATCH
de_DE.UTF-8 "Z" "[a-z]" NOMATCH
de_DE.UTF-8 "Ä" "[a-z]" NOMATCH
@@ -594,9 +618,9 @@ de_DE.UTF-8 "ö" "[A-Z]" NOMATCH
de_DE.UTF-8 "ü" "[A-Z]" NOMATCH
de_DE.UTF-8 "A" "[A-Z]" 0
de_DE.UTF-8 "Z" "[A-Z]" 0
-de_DE.UTF-8 "Ä" "[A-Z]" 0
-de_DE.UTF-8 "Ö" "[A-Z]" 0
-de_DE.UTF-8 "Ü" "[A-Z]" 0
+de_DE.UTF-8 "Ä" "[A-Z]" NOMATCH
+de_DE.UTF-8 "Ö" "[A-Z]" NOMATCH
+de_DE.UTF-8 "Ü" "[A-Z]" NOMATCH
de_DE.UTF-8 "a" "[[:lower:]]" 0
de_DE.UTF-8 "z" "[[:lower:]]" 0
de_DE.UTF-8 "ä" "[[:lower:]]" 0
@@ -155,7 +155,12 @@ mb_frob_pattern (const char *str, const char *letters)
*dst++ = *src;
continue;
}
- else if (!in_class && strchr (letters, *src))
+ /* We do a replacement, but not for the start of ranges, because
+ mb_replace will create invalid rational ranges. For example
+ [á-z] is an invalid range because á comes after z, but [a-á]
+ is a valid range. So we avoid replacing the start of ranges
+ to avoid this problem. */
+ else if (!in_class && src[1] != '-' && strchr (letters, *src))
dst = mb_replace (dst, *src);
else
{