diff mbox

[V3,BZ,#18441] fix sorting multibyte charsets with an improper locale

Message ID 559AF57C.8010608@web.de
State Dropped
Headers show

Commit Message

Leonhard Holz July 6, 2015, 9:39 p.m. UTC
Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.

In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
regression. The cause of the problem is that

a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
sort weight which causes the comparison to check the whole string instead of
breaking up early and

b) the sequence-to-weight list is partitioned by the first byte of the first
character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
they tend to have an equal starting byte (e.g. all thai chars start with E0).

The approach of the patch is to interprete TABLEMB as a hashtable and find a
better hash key. My first try was to somehow "fold" a multibyte character into one
byte but that worsened the overall performance a lot. Enhancing the table to 2
byte keys works much better while needing a reasonable amount of extra memory.

The patch vastly improves the performance of languages with multibyte chars (see
zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
get a bit slower because of the extra check for the first byte while finding the right
sequence in the sequence list . It cannot be avoided since the hash key is not
longer equal to the first byte of the sequence. Tests are ok.

filelist#C			  1.73%
filelist#en_US.UTF-8		  0.54%
lorem_ipsum#vi_VN.UTF-8		  1.90%
lorem_ipsum#ar_SA.UTF-8		-12.06%
lorem_ipsum#en_US.UTF-8		  1.15%
lorem_ipsum#zh_CN.UTF-8		-86.32%
lorem_ipsum#cs_CZ.UTF-8		-11.42%
lorem_ipsum#en_GB.UTF-8		- 3.09%
lorem_ipsum#da_DK.UTF-8		  6.70%
lorem_ipsum#pl_PL.UTF-8		- 1.04%
lorem_ipsum#fr_FR.UTF-8		- 1.22%
lorem_ipsum#pt_PT.UTF-8		  0.47%
lorem_ipsum#el_GR.UTF-8		-29.40%
lorem_ipsum#ru_RU.UTF-8		-11.79%
lorem_ipsum#iw_IL.UTF-8		- 1.39%
lorem_ipsum#es_ES.UTF-8		  3.91%
lorem_ipsum#hi_IN.UTF-8		-98.26%
lorem_ipsum#sv_SE.UTF-8		  5.61%
lorem_ipsum#hu_HU.UTF-8		 15.32%
lorem_ipsum#tr_TR.UTF-8		- 3.51%
lorem_ipsum#is_IS.UTF-8		  5.62%
lorem_ipsum#it_IT.UTF-8		-05.97%
lorem_ipsum#sr_RS.UTF-8		-01.19%
lorem_ipsum#ja_JP.UTF-8		-98.11%
wikipedia-th#en_US.UTF-8	-99.63%


	* locale/programs/ld-collate.c (struct locale_collate_t):
	Expand mbheads array from 256 to 16384 entries.
	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
	(collate_output): Output larger table and sequences including first byte.
	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
	* posix/regcomp.c (build_equiv_class): Likewise.
	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
	* posix/regexec.c (check_node_accept_bytes): Likewise.
	* string/strcoll_l.c (get_next_seq): Likewise.
	(STRCOLL): Call get_next_seq with encoding parameter.
	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
	(STRXFRM): Call find_idx with encoding parameter.

Comments

Leonhard Holz July 13, 2015, 8:25 a.m. UTC | #1
Ping!

Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
> Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.
> 
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
> 
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
> 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
> 
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
> 
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
> get a bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
> 
> filelist#C			  1.73%
> filelist#en_US.UTF-8		  0.54%
> lorem_ipsum#vi_VN.UTF-8		  1.90%
> lorem_ipsum#ar_SA.UTF-8		-12.06%
> lorem_ipsum#en_US.UTF-8		  1.15%
> lorem_ipsum#zh_CN.UTF-8		-86.32%
> lorem_ipsum#cs_CZ.UTF-8		-11.42%
> lorem_ipsum#en_GB.UTF-8		- 3.09%
> lorem_ipsum#da_DK.UTF-8		  6.70%
> lorem_ipsum#pl_PL.UTF-8		- 1.04%
> lorem_ipsum#fr_FR.UTF-8		- 1.22%
> lorem_ipsum#pt_PT.UTF-8		  0.47%
> lorem_ipsum#el_GR.UTF-8		-29.40%
> lorem_ipsum#ru_RU.UTF-8		-11.79%
> lorem_ipsum#iw_IL.UTF-8		- 1.39%
> lorem_ipsum#es_ES.UTF-8		  3.91%
> lorem_ipsum#hi_IN.UTF-8		-98.26%
> lorem_ipsum#sv_SE.UTF-8		  5.61%
> lorem_ipsum#hu_HU.UTF-8		 15.32%
> lorem_ipsum#tr_TR.UTF-8		- 3.51%
> lorem_ipsum#is_IS.UTF-8		  5.62%
> lorem_ipsum#it_IT.UTF-8		-05.97%
> lorem_ipsum#sr_RS.UTF-8		-01.19%
> lorem_ipsum#ja_JP.UTF-8		-98.11%
> wikipedia-th#en_US.UTF-8	-99.63%
> 
> 
> 	* locale/programs/ld-collate.c (struct locale_collate_t):
> 	Expand mbheads array from 256 to 16384 entries.
> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
> 	(collate_output): Output larger table and sequences including first byte.
> 	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
> 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
> 	* posix/regcomp.c (build_equiv_class): Likewise.
> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
> 	* string/strcoll_l.c (get_next_seq): Likewise.
> 	(STRCOLL): Call get_next_seq with encoding parameter.
> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
> 	(STRXFRM): Call find_idx with encoding parameter.
>
Leonhard Holz July 20, 2015, 9:11 p.m. UTC | #2
Ping!

Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
> Ping!
> 
> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>> Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.
>>
>> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
>> regression. The cause of the problem is that
>>
>> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
>> sort weight which causes the comparison to check the whole string instead of
>> breaking up early and
>>
>> b) the sequence-to-weight list is partitioned by the first byte of the first
>> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
>> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>>
>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>> better hash key. My first try was to somehow "fold" a multibyte character into one
>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>> byte keys works much better while needing a reasonable amount of extra memory.
>>
>> The patch vastly improves the performance of languages with multibyte chars (see
>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>> get a bit slower because of the extra check for the first byte while finding the right
>> sequence in the sequence list . It cannot be avoided since the hash key is not
>> longer equal to the first byte of the sequence. Tests are ok.
>>
>> filelist#C			  1.73%
>> filelist#en_US.UTF-8		  0.54%
>> lorem_ipsum#vi_VN.UTF-8		  1.90%
>> lorem_ipsum#ar_SA.UTF-8		-12.06%
>> lorem_ipsum#en_US.UTF-8		  1.15%
>> lorem_ipsum#zh_CN.UTF-8		-86.32%
>> lorem_ipsum#cs_CZ.UTF-8		-11.42%
>> lorem_ipsum#en_GB.UTF-8		- 3.09%
>> lorem_ipsum#da_DK.UTF-8		  6.70%
>> lorem_ipsum#pl_PL.UTF-8		- 1.04%
>> lorem_ipsum#fr_FR.UTF-8		- 1.22%
>> lorem_ipsum#pt_PT.UTF-8		  0.47%
>> lorem_ipsum#el_GR.UTF-8		-29.40%
>> lorem_ipsum#ru_RU.UTF-8		-11.79%
>> lorem_ipsum#iw_IL.UTF-8		- 1.39%
>> lorem_ipsum#es_ES.UTF-8		  3.91%
>> lorem_ipsum#hi_IN.UTF-8		-98.26%
>> lorem_ipsum#sv_SE.UTF-8		  5.61%
>> lorem_ipsum#hu_HU.UTF-8		 15.32%
>> lorem_ipsum#tr_TR.UTF-8		- 3.51%
>> lorem_ipsum#is_IS.UTF-8		  5.62%
>> lorem_ipsum#it_IT.UTF-8		-05.97%
>> lorem_ipsum#sr_RS.UTF-8		-01.19%
>> lorem_ipsum#ja_JP.UTF-8		-98.11%
>> wikipedia-th#en_US.UTF-8	-99.63%
>>
>>
>> 	* locale/programs/ld-collate.c (struct locale_collate_t):
>> 	Expand mbheads array from 256 to 16384 entries.
>> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
>> 	(collate_output): Output larger table and sequences including first byte.
>> 	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
>> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
>> 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
>> 	* posix/regcomp.c (build_equiv_class): Likewise.
>> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
>> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
>> 	* string/strcoll_l.c (get_next_seq): Likewise.
>> 	(STRCOLL): Call get_next_seq with encoding parameter.
>> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
>> 	(STRXFRM): Call find_idx with encoding parameter.
>>
Leonhard Holz July 30, 2015, 5:55 a.m. UTC | #3
Ping!

Am 20.07.2015 um 23:11 schrieb Leonhard Holz:
> Ping!
> 
> Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
>> Ping!
>>
>> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>>> Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.
>>>
>>> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
>>> regression. The cause of the problem is that
>>>
>>> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
>>> sort weight which causes the comparison to check the whole string instead of
>>> breaking up early and
>>>
>>> b) the sequence-to-weight list is partitioned by the first byte of the first
>>> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
>>> they tend to have an equal starting byte (e.g. all thai chars start with E0).
>>>
>>> The approach of the patch is to interprete TABLEMB as a hashtable and find a
>>> better hash key. My first try was to somehow "fold" a multibyte character into one
>>> byte but that worsened the overall performance a lot. Enhancing the table to 2
>>> byte keys works much better while needing a reasonable amount of extra memory.
>>>
>>> The patch vastly improves the performance of languages with multibyte chars (see
>>> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
>>> get a bit slower because of the extra check for the first byte while finding the right
>>> sequence in the sequence list . It cannot be avoided since the hash key is not
>>> longer equal to the first byte of the sequence. Tests are ok.
>>>
>>> filelist#C			  1.73%
>>> filelist#en_US.UTF-8		  0.54%
>>> lorem_ipsum#vi_VN.UTF-8		  1.90%
>>> lorem_ipsum#ar_SA.UTF-8		-12.06%
>>> lorem_ipsum#en_US.UTF-8		  1.15%
>>> lorem_ipsum#zh_CN.UTF-8		-86.32%
>>> lorem_ipsum#cs_CZ.UTF-8		-11.42%
>>> lorem_ipsum#en_GB.UTF-8		- 3.09%
>>> lorem_ipsum#da_DK.UTF-8		  6.70%
>>> lorem_ipsum#pl_PL.UTF-8		- 1.04%
>>> lorem_ipsum#fr_FR.UTF-8		- 1.22%
>>> lorem_ipsum#pt_PT.UTF-8		  0.47%
>>> lorem_ipsum#el_GR.UTF-8		-29.40%
>>> lorem_ipsum#ru_RU.UTF-8		-11.79%
>>> lorem_ipsum#iw_IL.UTF-8		- 1.39%
>>> lorem_ipsum#es_ES.UTF-8		  3.91%
>>> lorem_ipsum#hi_IN.UTF-8		-98.26%
>>> lorem_ipsum#sv_SE.UTF-8		  5.61%
>>> lorem_ipsum#hu_HU.UTF-8		 15.32%
>>> lorem_ipsum#tr_TR.UTF-8		- 3.51%
>>> lorem_ipsum#is_IS.UTF-8		  5.62%
>>> lorem_ipsum#it_IT.UTF-8		-05.97%
>>> lorem_ipsum#sr_RS.UTF-8		-01.19%
>>> lorem_ipsum#ja_JP.UTF-8		-98.11%
>>> wikipedia-th#en_US.UTF-8	-99.63%
>>>
>>>
>>> 	* locale/programs/ld-collate.c (struct locale_collate_t):
>>> 	Expand mbheads array from 256 to 16384 entries.
>>> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
>>> 	(collate_output): Output larger table and sequences including first byte.
>>> 	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
>>> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
>>> 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
>>> 	* posix/regcomp.c (build_equiv_class): Likewise.
>>> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
>>> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
>>> 	* string/strcoll_l.c (get_next_seq): Likewise.
>>> 	(STRCOLL): Call get_next_seq with encoding parameter.
>>> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
>>> 	(STRXFRM): Call find_idx with encoding parameter.
>>>
Carlos O'Donell July 31, 2015, 3:58 a.m. UTC | #4
On 07/30/2015 01:55 AM, Leonhard Holz wrote:
> Ping!
> 
> Am 20.07.2015 um 23:11 schrieb Leonhard Holz:
>> Ping!
>>
>> Am 13.07.2015 um 10:25 schrieb Leonhard Holz:
>>> Ping!
>>>
>>> Am 06.07.2015 um 23:39 schrieb Leonhard Holz:
>>>> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
>>>> Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.

v3 looks almost good to me, I have some comments to make on it, but I don't think
we're ready for 2.22.

Ping me again when 2.23 opens.

c.
Carlos O'Donell Oct. 7, 2015, 4:12 p.m. UTC | #5
On 07/06/2015 05:39 PM, Leonhard Holz wrote:
> Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
> Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.

Leonhard,

Thank you very much for the excellent work on this patch.

In general it looks great. High level this is the right solution.

There are two things I noticed:
* Code duplication. Please see if you can avoid this by refactoring.
* Change to zero-indexing needs a little more reivew in two spots.

Please review and repost v4.

I agree that a follow-up patch for cleanup and optimization is the correct
approach. You should not be rolling those two things together. This patch is
sufficient and does a good job.

It is my opinion that this needs a NEWS entry to call out the change, something
like:

* Performance degradation in strcoll_l caused by sorting UTF-8 characters for
  languages that are not part of the current locale has been fixed. The most
  common case of this included sorting, for example, Thai text using the
  en_US.UTF-8 locale. In this situation the en_US.UTF-8 locale has no sorting
  information for Thai, and therefore resulted in lower than expected performance
  from strcoll_l. The strcoll_l algorithm has been improved to handle this case,
  but at a slight performance loss on average for locales with one-byte characters
  like en_US.UTF-8 (0.54% loss). Conformance is maintained and the locale can be
  used to sort any UTF-8 characters efficiently. (Bug 18441)

Thoughts?

> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that
> 
> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and
> 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).
> 
> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.
> 
> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
> get a bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.
> 
> filelist#C			  1.73%
> filelist#en_US.UTF-8		  0.54%
> lorem_ipsum#vi_VN.UTF-8		  1.90%
> lorem_ipsum#ar_SA.UTF-8		-12.06%
> lorem_ipsum#en_US.UTF-8		  1.15%
> lorem_ipsum#zh_CN.UTF-8		-86.32%
> lorem_ipsum#cs_CZ.UTF-8		-11.42%
> lorem_ipsum#en_GB.UTF-8		- 3.09%
> lorem_ipsum#da_DK.UTF-8		  6.70%
> lorem_ipsum#pl_PL.UTF-8		- 1.04%
> lorem_ipsum#fr_FR.UTF-8		- 1.22%
> lorem_ipsum#pt_PT.UTF-8		  0.47%
> lorem_ipsum#el_GR.UTF-8		-29.40%
> lorem_ipsum#ru_RU.UTF-8		-11.79%
> lorem_ipsum#iw_IL.UTF-8		- 1.39%
> lorem_ipsum#es_ES.UTF-8		  3.91%
> lorem_ipsum#hi_IN.UTF-8		-98.26%
> lorem_ipsum#sv_SE.UTF-8		  5.61%
> lorem_ipsum#hu_HU.UTF-8		 15.32%
> lorem_ipsum#tr_TR.UTF-8		- 3.51%
> lorem_ipsum#is_IS.UTF-8		  5.62%
> lorem_ipsum#it_IT.UTF-8		-05.97%
> lorem_ipsum#sr_RS.UTF-8		-01.19%
> lorem_ipsum#ja_JP.UTF-8		-98.11%
> wikipedia-th#en_US.UTF-8	-99.63%

While there is some performance loss for key locales, overall the entire change
to the new implementation is a win IMO. Therefore even though we loose in some
key languages like da_DK, that's OK. People should review their applications,
profile them, and then we can work out making strcoll_l faster based on that
profile driven feedback (or new benchtests).

> 
> 
> 	* locale/programs/ld-collate.c (struct locale_collate_t):
> 	Expand mbheads array from 256 to 16384 entries.
> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
> 	(collate_output): Output larger table and sequences including first byte.
> 	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
> 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
> 	* posix/regcomp.c (build_equiv_class): Likewise.
> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
> 	* string/strcoll_l.c (get_next_seq): Likewise.
> 	(STRCOLL): Call get_next_seq with encoding parameter.
> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
> 	(STRXFRM): Call find_idx with encoding parameter.
> 
> 
> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
> index a39a94f..8f4bec8 100644
> --- a/locale/programs/ld-collate.c
> +++ b/locale/programs/ld-collate.c
> @@ -244,9 +244,9 @@ struct locale_collate_t
>       Therefore we keep all relevant input in a list.  */
>    struct locale_collate_t *next;
> 
> -  /* Arrays with heads of the list for each of the leading bytes in
> +  /* Arrays with heads of the list for the leading bytes in
>       the multibyte sequences.  */
> -  struct element_t *mbheads[256];
> +  struct element_t *mbheads[256 * 256];

Use #define MBHEADS_SZ or something similar.

OK with that change.

> 
>    /* Arrays with heads of the list for each of the leading bytes in
>       the multibyte sequences.  */
> @@ -1558,6 +1558,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>    struct section_list *sect;
>    int ruleidx;
>    int nr_wide_elems = 0;
> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

OK.

Will this always work? I'm just wondering about a user generated charmap that they
call 'utf8', which is the other common alias for instance where the dash is not valid
syntax. Probably not since the official name is UTF-8, and that's what you should use.

> 
>    if (collate == NULL)
>      {
> @@ -1664,7 +1665,49 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>  	  struct element_t *lastp = NULL;
> 
>  	  /* Find the point where to insert in the list.  */
> -	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
> +	  uint16_t index = ((unsigned char *) runp->mbs)[0];
> +
> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.
> +	     Also check the UTF-8 encoding.  Keep locale/weight.h in sync.  */

Not OK. Can we refactor to avoid keeing the two in sync?

> +	  if (is_utf8 && index >= 0x80)
> +	    {
> +	      if ((index & 0xC0) == 0x80)
> +		{
> +		utf8_error:
> +		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
> +						  _("\
> +malformed UTF-8 character in `%s'"), runp->name););
> +		  goto dont_insert;
> +		}
> +	      else if (index < 0xE0)
> +		{
> +		  if (runp->nmbs < 2)
> +		    goto utf8_error;
> +		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
> +		  index = (index << 6) + byte2 - 0x3080;
> +		}
> +	      else if (index < 0xF0)
> +		{
> +		  if (runp->nmbs < 3)
> +		    goto utf8_error;
> +		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
> +		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
> +		  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
> +		}
> +	      else if (index < 0xF8)
> +		{
> +		  if (runp->nmbs < 4)
> +		    goto utf8_error;
> +		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
> +		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
> +		  uint16_t byte4 = ((unsigned char *) runp->mbs)[3];
> +		  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
> +		}
> +	      else
> +		goto utf8_error;
> +	    }
> +
> +	  eptr = &collate->mbheads[index];

OK.

>  	  while (*eptr != NULL)
>  	    {
>  	      if ((*eptr)->nmbs < runp->nmbs)
> @@ -1735,7 +1778,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name);
> 
>    /* Find out whether any of the `mbheads' entries is unset.  In this
>       case we use the UNDEFINED entry.  */
> -  for (i = 1; i < 256; ++i)
> +  for (i = 1; i < 256 * 256; ++i)

Use MBHEADS_SZ. OK with that change.

>      if (collate->mbheads[i] == NULL)
>        {
>  	need_undefined = 1;
> @@ -2108,7 +2151,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
>    struct locale_file file;
>    size_t ch;
> -  int32_t tablemb[256];
> +  int32_t tablemb[256 * 256];

Again, MBHEAD_SZ, avoids duplication.

>    struct obstack weightpool;
>    struct obstack extrapool;
>    struct obstack indirectpool;
> @@ -2186,7 +2229,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    if (collate->undefined.used_in_level != 0)
>      output_weight (&weightpool, collate, &collate->undefined);
> 
> -  for (ch = 1; ch < 256; ++ch)
> +  for (ch = 1; ch < 256 * 256; ++ch)

Likewise.

>      if (collate->mbheads[ch]->mbnext == NULL
>  	&& collate->mbheads[ch]->nmbs <= 1)
>        {
> @@ -2211,7 +2254,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	   and add only one index into the weight table.  We can find the
>  	   consecutive entries since they are also consecutive in the list.  */
>  	struct element_t *runp = collate->mbheads[ch];
> -	struct element_t *lastp;

OK.

> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> @@ -2239,7 +2281,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* Compute how much space we will need.  */
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + 2 * (runp->nmbs - 1));
> +					  + 2 * runp->nmbs);

Doesn't the change to zero indexing make the conditional in the code above this wrong?

e.g.
2230             if (runp->mbnext != NULL
2231                 && runp->nmbs == runp->mbnext->nmbs
2232                 && memcmp (runp->mbs, runp->mbnext->mbs, runp->nmbs - 1) == 0
2233                 && (runp->mbs[runp->nmbs - 1]
2234                     == runp->mbnext->mbs[runp->nmbs - 1] + 1))


>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
> @@ -2262,9 +2304,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		/* Now walk backward from here to the beginning.  */
>  		curp = runp;
> 
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
> -		for (i = 1; i < curp->nmbs; ++i)
> +		assert (runp->nmbs <= 255);
> +		obstack_1grow_fast (&extrapool, curp->nmbs);
> +		for (i = 0; i < curp->nmbs; ++i)

OK.

>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
> 
>  		/* Now find the end of the consecutive sequence and
> @@ -2284,7 +2326,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* And add the end byte sequence.  Without length this
>  		   time.  */
> -		for (i = 1; i < curp->nmbs; ++i)
> +		for (i = 0; i < curp->nmbs; ++i)

OK.

>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
>  	      }
>  	    else
> @@ -2298,15 +2340,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		weightidx = output_weight (&weightpool, collate, runp);
> 
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + runp->nmbs - 1);
> +					  + runp->nmbs);

OK.

>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
>  		obstack_int32_grow_fast (&extrapool, weightidx);
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
> +		assert (runp->nmbs <= 255);
> +		obstack_1grow_fast (&extrapool, runp->nmbs);
> 
> -		for (i = 1; i < runp->nmbs; ++i)
> +		for (i = 0; i < runp->nmbs; ++i)

OK.

>  		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
>  	      }
> 
> @@ -2315,30 +2357,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	      obstack_1grow_fast (&extrapool, '\0');
> 
>  	    /* Next entry.  */
> -	    lastp = runp;

OK.

>  	    runp = runp->mbnext;
>  	  }
>  	while (runp != NULL);
> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> -	/* If the final entry in the list is not a single character we
> -	   add an UNDEFINED entry here.  */
> -	if (lastp->nmbs != 1)
> -	  {
> -	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> -	    obstack_make_room (&extrapool, added);
> +	/* Add an UNDEFINED entry at the end of the list.  */
> +	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> +	obstack_make_room (&extrapool, added);

OK.

> 
> -	    obstack_int32_grow_fast (&extrapool, 0);
> -	    /* XXX What rule? We just pick the first.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> -	    /* Length is zero.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> +	obstack_int32_grow_fast (&extrapool, 0);
> +	/* XXX What rule? We just pick the first.  */
> +	obstack_1grow_fast (&extrapool, 0);
> +	/* Length is zero.  */
> +	obstack_1grow_fast (&extrapool, 0);

OK.

> 
> -	    /* Add alignment bytes if necessary.  */
> -	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> -	      obstack_1grow_fast (&extrapool, '\0');
> -	  }
> +	/* Add alignment bytes if necessary.  */
> +	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> +	  obstack_1grow_fast (&extrapool, '\0');

OK.

>        }
> 
>    /* Add padding to the tables if necessary.  */
> @@ -2346,7 +2383,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>      obstack_1grow (&weightpool, 0);
> 
>    /* Now add the four tables.  */
> -  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
> +  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256 * 256);

Use macro as described above.

>    add_locale_raw_obstack (&file, &weightpool);
>    add_locale_raw_obstack (&file, &extrapool);
>    add_locale_raw_obstack (&file, &indirectpool);
> diff --git a/locale/weight.h b/locale/weight.h
> index 721bf7d..d9e63ac 100644
> --- a/locale/weight.h
> +++ b/locale/weight.h
> @@ -21,24 +21,65 @@
> 
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t locale_encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const unsigned char *extra,
>  	 const unsigned char **cpp, size_t len)
>  {
> -  int_fast32_t i = table[*(*cpp)++];
>    const unsigned char *cp;
>    const unsigned char *usrc;
> +  uint16_t index = (*cpp)[0];

OK.

> 
> +  /* Special handling of UTF-8: Generate a 2-byte index for table.
> +     This has to be equal to the folding in locale/programs/ld-collate.c:
> +     collate_finish().  */
> +  if (locale_encoding == __cet_utf8 && index >= 0x80)
> +    {
> +      if (index < 0xE0)
> +	{
> +	  if (len < 2)
> +	    goto utf8_error;
> +	  uint16_t byte2 = (*cpp)[1];
> +	  index = (index << 6) + byte2 - 0x3080;
> +	}
> +      else if (index < 0xF0)
> +	{
> +	  if (len < 3)
> +	    goto utf8_error;
> +	  uint16_t byte2 = (*cpp)[1];
> +	  uint16_t byte3 = (*cpp)[2];
> +	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
> +	}
> +      else if (index < 0xF8)
> +	{
> +	  if (len < 4)
> +	    goto utf8_error;
> +	  uint16_t byte2 = (*cpp)[1];
> +	  uint16_t byte3 = (*cpp)[2];
> +	  uint16_t byte4 = (*cpp)[3];
> +	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
> +	}
> +      else
> +	{
> +	utf8_error:
> +	  *cpp += 1;
> +	  return 0;
> +	}
> +    }

See notes above about avoiding duplication.

> +
> +  int_fast32_t i = table[index];
>    if (i >= 0)
> -    /* This is an index into the weight table.  Cool.  */
> -    return i;
> +    {
> +      /* This is an index into the weight table.  Cool.  */
> +      *cpp += 1;
> +      return i;
> +    }

OK.

> 
>    /* Oh well, more than one sequence starting with this byte.
>       Search for the correct one.  */
>    cp = &extra[-i];
>    usrc = *cpp;
> -  --len;
>    while (1)
>      {
>        size_t nhere;
> @@ -57,8 +98,7 @@ findidx (const int32_t *table,
>  	  /* It is a single character.  If it matches we found our
>  	     index.  Note that at the end of each list there is an
>  	     entry of length zero which represents the single byte
> -	     sequence.  The first (and here only) byte was tested
> -	     already.  */
> +	     sequence.  */
>  	  size_t cnt;
> 
>  	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
> @@ -68,7 +108,7 @@ findidx (const int32_t *table,
>  	  if (cnt == nhere)
>  	    {
>  	      /* Found it.  */
> -	      *cpp += nhere;
> +	      *cpp += nhere > 0 ? nhere : 1;
>  	      return i;

OK.

>  	    }
> 
> @@ -127,7 +167,7 @@ findidx (const int32_t *table,
>  	      while (++cnt < nhere);
>  	    }
> 
> -	  *cpp += nhere;
> +	  *cpp += nhere > 0 ? nhere : 1;
>  	  return indirect[-i + offset];

OK.

>  	}
>      }
> diff --git a/locale/weightwc.h b/locale/weightwc.h
> index 3cd7a69..3781d0d 100644
> --- a/locale/weightwc.h
> +++ b/locale/weightwc.h
> @@ -21,7 +21,8 @@
> 
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const wint_t *extra,
>  	 const wint_t **cpp, size_t len)
> diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
> index f46c9df..db576fe 100644
> --- a/posix/fnmatch_loop.c
> +++ b/posix/fnmatch_loop.c
> @@ -389,6 +389,8 @@ FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
>  			const int32_t *indirect;
>  			int32_t idx;
>  			const UCHAR *cp = (const UCHAR *) &str;
> +			uint_fast32_t encoding =
> +			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);

OK.

> 
>  # if WIDE_CHAR_VERSION
>  			table = (const int32_t *)
> @@ -410,7 +412,7 @@ FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
>  			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
>  # endif
> 
> -			idx = FINDIDX (table, indirect, extra, &cp, 1);
> +			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);

OK.

>  			if (idx != 0)
>  			  {
>  			    /* We found a table entry.  Now see whether the
> @@ -420,7 +422,7 @@ FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
>  			    int32_t idx2;
>  			    const UCHAR *np = (const UCHAR *) n;
> 
> -			    idx2 = FINDIDX (table, indirect, extra,
> +			    idx2 = FINDIDX (encoding, table, indirect, extra,

OK.

>  					    &np, string_end - n);
>  			    if (idx2 != 0
>  				&& (idx >> 24) == (idx2 >> 24)
> diff --git a/posix/regcomp.c b/posix/regcomp.c
> index bf8aa16..afd9c4c 100644
> --- a/posix/regcomp.c
> +++ b/posix/regcomp.c
> @@ -3426,6 +3426,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>    uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
>    if (nrules != 0)
>      {
> +      uint_fast32_t encoding;
>        const int32_t *table, *indirect;
>        const unsigned char *weights, *extra, *cp;
>        unsigned char char_buf[2];
> @@ -3434,6 +3435,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>        size_t len;
>        /* Calculate the index for equivalence class.  */
>        cp = name;
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);

OK.

>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
>  					       _NL_COLLATE_WEIGHTMB);
> @@ -3441,7 +3443,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  						   _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
> -      idx1 = findidx (table, indirect, extra, &cp, -1);
> +      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
>        if (BE (idx1 == 0 || *cp != '\0', 0))
>  	/* This isn't a valid character.  */
>  	return REG_ECOLLATE;
> @@ -3452,7 +3454,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  	{
>  	  char_buf[0] = ch;
>  	  cp = char_buf;
> -	  idx2 = findidx (table, indirect, extra, &cp, 1);
> +	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
>  /*
>  	  idx2 = table[ch];
>  */
> diff --git a/posix/regex_internal.h b/posix/regex_internal.h
> index 154e969..42e43fa 100644
> --- a/posix/regex_internal.h
> +++ b/posix/regex_internal.h
> @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx)
>  #  ifdef _LIBC
>    const unsigned char *p, *extra;
>    const int32_t *table, *indirect;
> +  uint_fast32_t encoding;
>    uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
> 
>    if (nrules != 0)
>      {
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        extra = (const unsigned char *)
>  	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
>        p = pstr->mbs + idx;
> -      findidx (table, indirect, extra, &p, pstr->len - idx);
> +      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);

OK.

>        return p - pstr->mbs - idx;
>      }
>    else
> diff --git a/posix/regexec.c b/posix/regexec.c
> index 70cd606..798eb51 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3869,6 +3869,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>        if (nrules != 0)
>  	{
>  	  unsigned int in_collseq = 0;
> +	  uint_fast32_t encoding;
>  	  const int32_t *table, *indirect;
>  	  const unsigned char *weights, *extra;
>  	  const char *collseqwc;
> @@ -3919,6 +3920,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  	  if (cset->nequiv_classes)
>  	    {
>  	      const unsigned char *cp = pin;
> +	      encoding =
> +		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>  	      table = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>  	      weights = (const unsigned char *)
> @@ -3927,7 +3930,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>  	      indirect = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
> -	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
> +	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
> +				     elem_len);

OK.

>  	      if (idx > 0)
>  		for (i = 0; i < cset->nequiv_classes; ++i)
>  		  {
> diff --git a/string/strcoll_l.c b/string/strcoll_l.c
> index 8f1225f..668ea9d 100644
> --- a/string/strcoll_l.c
> +++ b/string/strcoll_l.c
> @@ -78,9 +78,9 @@ typedef struct
>  /* Get next sequence.  Traverse the string as required.  */
>  static __always_inline void
>  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> -	      const USTRING_TYPE *weights, const int32_t *table,
> -	      const USTRING_TYPE *extra, const int32_t *indirect,
> -	      int pass)
> +	      const USTRING_TYPE *weights, uint_fast32_t encoding,
> +	      const int32_t *table, const USTRING_TYPE *extra,
> +	      const int32_t *indirect, int pass)

OK.

>  {
>    size_t val = seq->val = 0;
>    int len = seq->len;
> @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
>  	      us = seq->back_us;
>  	      while (i < backw)
>  		{
> -		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  		  idx = tmp & 0xffffff;
>  		  i++;
>  		}
> @@ -139,7 +139,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> 
>  	  while (*us != L('\0'))
>  	    {
> -	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  	      unsigned char rule = tmp >> 24;
>  	      prev_idx = idx;
>  	      idx = tmp & 0xffffff;
> @@ -345,9 +345,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
> 
>        while (1)
>  	{
> -	  get_next_seq (&seq1, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);
> -	  get_next_seq (&seq2, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);

OK.

>  	  /* See whether any or both strings are empty.  */
>  	  if (seq1.len == 0 || seq2.len == 0)
> diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
> index 8b61ea2..95abc4e 100644
> --- a/string/strxfrm_l.c
> +++ b/string/strxfrm_l.c
> @@ -53,6 +53,7 @@ typedef struct
>    uint_fast32_t nrules;
>    unsigned char *rulesets;
>    USTRING_TYPE *weights;
> +  uint_fast32_t encoding;
>    int32_t *table;
>    USTRING_TYPE *extra;
>    int32_t *indirect;
> @@ -100,8 +101,8 @@ static __always_inline size_t
>  find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
>  	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
>  {
> -  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
> -			 -1);
> +  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
> +			 l_data->extra, us, -1);
>    *rule_idx = tmp >> 24;
>    int32_t idx = tmp & 0xffffff;
>    size_t len = l_data->weights[idx++];
> @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
>    /* Get the locale data.  */
>    l_data.rulesets = (unsigned char *)
>      current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
> +  l_data.encoding =
> +    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
>    l_data.table = (int32_t *)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
>    l_data.weights = (USTRING_TYPE *)
> @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
> 
>    do
>      {
> -      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
> -			     -1);
> +      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
> +			     l_data.extra, &cur, -1);
>        rulearr[idxmax] = tmp >> 24;
>        idxarr[idxmax] = tmp & 0xffffff;
> 

OK.

Cheers,
Carlos.
diff mbox

Patch

diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index a39a94f..8f4bec8 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -244,9 +244,9 @@  struct locale_collate_t
      Therefore we keep all relevant input in a list.  */
   struct locale_collate_t *next;

-  /* Arrays with heads of the list for each of the leading bytes in
+  /* Arrays with heads of the list for the leading bytes in
      the multibyte sequences.  */
-  struct element_t *mbheads[256];
+  struct element_t *mbheads[256 * 256];

   /* Arrays with heads of the list for each of the leading bytes in
      the multibyte sequences.  */
@@ -1558,6 +1558,7 @@  collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
   struct section_list *sect;
   int ruleidx;
   int nr_wide_elems = 0;
+  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

   if (collate == NULL)
     {
@@ -1664,7 +1665,49 @@  collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
 	  struct element_t *lastp = NULL;

 	  /* Find the point where to insert in the list.  */
-	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
+	  uint16_t index = ((unsigned char *) runp->mbs)[0];
+
+	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.
+	     Also check the UTF-8 encoding.  Keep locale/weight.h in sync.  */
+	  if (is_utf8 && index >= 0x80)
+	    {
+	      if ((index & 0xC0) == 0x80)
+		{
+		utf8_error:
+		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
+						  _("\
+malformed UTF-8 character in `%s'"), runp->name););
+		  goto dont_insert;
+		}
+	      else if (index < 0xE0)
+		{
+		  if (runp->nmbs < 2)
+		    goto utf8_error;
+		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
+		  index = (index << 6) + byte2 - 0x3080;
+		}
+	      else if (index < 0xF0)
+		{
+		  if (runp->nmbs < 3)
+		    goto utf8_error;
+		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
+		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
+		  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
+		}
+	      else if (index < 0xF8)
+		{
+		  if (runp->nmbs < 4)
+		    goto utf8_error;
+		  uint16_t byte2 = ((unsigned char *) runp->mbs)[1];
+		  uint16_t byte3 = ((unsigned char *) runp->mbs)[2];
+		  uint16_t byte4 = ((unsigned char *) runp->mbs)[3];
+		  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
+		}
+	      else
+		goto utf8_error;
+	    }
+
+	  eptr = &collate->mbheads[index];
 	  while (*eptr != NULL)
 	    {
 	      if ((*eptr)->nmbs < runp->nmbs)
@@ -1735,7 +1778,7 @@  symbol `%s' has the same encoding as"), (*eptr)->name);

   /* Find out whether any of the `mbheads' entries is unset.  In this
      case we use the UNDEFINED entry.  */
-  for (i = 1; i < 256; ++i)
+  for (i = 1; i < 256 * 256; ++i)
     if (collate->mbheads[i] == NULL)
       {
 	need_undefined = 1;
@@ -2108,7 +2151,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
   struct locale_file file;
   size_t ch;
-  int32_t tablemb[256];
+  int32_t tablemb[256 * 256];
   struct obstack weightpool;
   struct obstack extrapool;
   struct obstack indirectpool;
@@ -2186,7 +2229,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   if (collate->undefined.used_in_level != 0)
     output_weight (&weightpool, collate, &collate->undefined);

-  for (ch = 1; ch < 256; ++ch)
+  for (ch = 1; ch < 256 * 256; ++ch)
     if (collate->mbheads[ch]->mbnext == NULL
 	&& collate->mbheads[ch]->nmbs <= 1)
       {
@@ -2211,7 +2254,6 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	   and add only one index into the weight table.  We can find the
 	   consecutive entries since they are also consecutive in the list.  */
 	struct element_t *runp = collate->mbheads[ch];
-	struct element_t *lastp;

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

@@ -2239,7 +2281,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* Compute how much space we will need.  */
 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + 2 * (runp->nmbs - 1));
+					  + 2 * runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

@@ -2262,9 +2304,9 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		/* Now walk backward from here to the beginning.  */
 		curp = runp;

-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
-		for (i = 1; i < curp->nmbs; ++i)
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, curp->nmbs);
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);

 		/* Now find the end of the consecutive sequence and
@@ -2284,7 +2326,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* And add the end byte sequence.  Without length this
 		   time.  */
-		for (i = 1; i < curp->nmbs; ++i)
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
 	      }
 	    else
@@ -2298,15 +2340,15 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		weightidx = output_weight (&weightpool, collate, runp);

 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + runp->nmbs - 1);
+					  + runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

 		obstack_int32_grow_fast (&extrapool, weightidx);
-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, runp->nmbs);

-		for (i = 1; i < runp->nmbs; ++i)
+		for (i = 0; i < runp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
 	      }

@@ -2315,30 +2357,25 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	      obstack_1grow_fast (&extrapool, '\0');

 	    /* Next entry.  */
-	    lastp = runp;
 	    runp = runp->mbnext;
 	  }
 	while (runp != NULL);

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

-	/* If the final entry in the list is not a single character we
-	   add an UNDEFINED entry here.  */
-	if (lastp->nmbs != 1)
-	  {
-	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
-	    obstack_make_room (&extrapool, added);
+	/* Add an UNDEFINED entry at the end of the list.  */
+	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
+	obstack_make_room (&extrapool, added);

-	    obstack_int32_grow_fast (&extrapool, 0);
-	    /* XXX What rule? We just pick the first.  */
-	    obstack_1grow_fast (&extrapool, 0);
-	    /* Length is zero.  */
-	    obstack_1grow_fast (&extrapool, 0);
+	obstack_int32_grow_fast (&extrapool, 0);
+	/* XXX What rule? We just pick the first.  */
+	obstack_1grow_fast (&extrapool, 0);
+	/* Length is zero.  */
+	obstack_1grow_fast (&extrapool, 0);

-	    /* Add alignment bytes if necessary.  */
-	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
-	      obstack_1grow_fast (&extrapool, '\0');
-	  }
+	/* Add alignment bytes if necessary.  */
+	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
+	  obstack_1grow_fast (&extrapool, '\0');
       }

   /* Add padding to the tables if necessary.  */
@@ -2346,7 +2383,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
     obstack_1grow (&weightpool, 0);

   /* Now add the four tables.  */
-  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
+  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256 * 256);
   add_locale_raw_obstack (&file, &weightpool);
   add_locale_raw_obstack (&file, &extrapool);
   add_locale_raw_obstack (&file, &indirectpool);
diff --git a/locale/weight.h b/locale/weight.h
index 721bf7d..d9e63ac 100644
--- a/locale/weight.h
+++ b/locale/weight.h
@@ -21,24 +21,65 @@ 

 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t locale_encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const unsigned char *extra,
 	 const unsigned char **cpp, size_t len)
 {
-  int_fast32_t i = table[*(*cpp)++];
   const unsigned char *cp;
   const unsigned char *usrc;
+  uint16_t index = (*cpp)[0];

+  /* Special handling of UTF-8: Generate a 2-byte index for table.
+     This has to be equal to the folding in locale/programs/ld-collate.c:
+     collate_finish().  */
+  if (locale_encoding == __cet_utf8 && index >= 0x80)
+    {
+      if (index < 0xE0)
+	{
+	  if (len < 2)
+	    goto utf8_error;
+	  uint16_t byte2 = (*cpp)[1];
+	  index = (index << 6) + byte2 - 0x3080;
+	}
+      else if (index < 0xF0)
+	{
+	  if (len < 3)
+	    goto utf8_error;
+	  uint16_t byte2 = (*cpp)[1];
+	  uint16_t byte3 = (*cpp)[2];
+	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
+	}
+      else if (index < 0xF8)
+	{
+	  if (len < 4)
+	    goto utf8_error;
+	  uint16_t byte2 = (*cpp)[1];
+	  uint16_t byte3 = (*cpp)[2];
+	  uint16_t byte4 = (*cpp)[3];
+	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
+	}
+      else
+	{
+	utf8_error:
+	  *cpp += 1;
+	  return 0;
+	}
+    }
+
+  int_fast32_t i = table[index];
   if (i >= 0)
-    /* This is an index into the weight table.  Cool.  */
-    return i;
+    {
+      /* This is an index into the weight table.  Cool.  */
+      *cpp += 1;
+      return i;
+    }

   /* Oh well, more than one sequence starting with this byte.
      Search for the correct one.  */
   cp = &extra[-i];
   usrc = *cpp;
-  --len;
   while (1)
     {
       size_t nhere;
@@ -57,8 +98,7 @@  findidx (const int32_t *table,
 	  /* It is a single character.  If it matches we found our
 	     index.  Note that at the end of each list there is an
 	     entry of length zero which represents the single byte
-	     sequence.  The first (and here only) byte was tested
-	     already.  */
+	     sequence.  */
 	  size_t cnt;

 	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
@@ -68,7 +108,7 @@  findidx (const int32_t *table,
 	  if (cnt == nhere)
 	    {
 	      /* Found it.  */
-	      *cpp += nhere;
+	      *cpp += nhere > 0 ? nhere : 1;
 	      return i;
 	    }

@@ -127,7 +167,7 @@  findidx (const int32_t *table,
 	      while (++cnt < nhere);
 	    }

-	  *cpp += nhere;
+	  *cpp += nhere > 0 ? nhere : 1;
 	  return indirect[-i + offset];
 	}
     }
diff --git a/locale/weightwc.h b/locale/weightwc.h
index 3cd7a69..3781d0d 100644
--- a/locale/weightwc.h
+++ b/locale/weightwc.h
@@ -21,7 +21,8 @@ 

 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const wint_t *extra,
 	 const wint_t **cpp, size_t len)
diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
index f46c9df..db576fe 100644
--- a/posix/fnmatch_loop.c
+++ b/posix/fnmatch_loop.c
@@ -389,6 +389,8 @@  FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
 			const int32_t *indirect;
 			int32_t idx;
 			const UCHAR *cp = (const UCHAR *) &str;
+			uint_fast32_t encoding =
+			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);

 # if WIDE_CHAR_VERSION
 			table = (const int32_t *)
@@ -410,7 +412,7 @@  FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
 			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
 # endif

-			idx = FINDIDX (table, indirect, extra, &cp, 1);
+			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
 			if (idx != 0)
 			  {
 			    /* We found a table entry.  Now see whether the
@@ -420,7 +422,7 @@  FCT (pattern, string, string_end, no_leading_period, flags, ends, alloca_used)
 			    int32_t idx2;
 			    const UCHAR *np = (const UCHAR *) n;

-			    idx2 = FINDIDX (table, indirect, extra,
+			    idx2 = FINDIDX (encoding, table, indirect, extra,
 					    &np, string_end - n);
 			    if (idx2 != 0
 				&& (idx >> 24) == (idx2 >> 24)
diff --git a/posix/regcomp.c b/posix/regcomp.c
index bf8aa16..afd9c4c 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -3426,6 +3426,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
   uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
   if (nrules != 0)
     {
+      uint_fast32_t encoding;
       const int32_t *table, *indirect;
       const unsigned char *weights, *extra, *cp;
       unsigned char char_buf[2];
@@ -3434,6 +3435,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
       size_t len;
       /* Calculate the index for equivalence class.  */
       cp = name;
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
 					       _NL_COLLATE_WEIGHTMB);
@@ -3441,7 +3443,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
 						   _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
-      idx1 = findidx (table, indirect, extra, &cp, -1);
+      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
       if (BE (idx1 == 0 || *cp != '\0', 0))
 	/* This isn't a valid character.  */
 	return REG_ECOLLATE;
@@ -3452,7 +3454,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
 	{
 	  char_buf[0] = ch;
 	  cp = char_buf;
-	  idx2 = findidx (table, indirect, extra, &cp, 1);
+	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
 /*
 	  idx2 = table[ch];
 */
diff --git a/posix/regex_internal.h b/posix/regex_internal.h
index 154e969..42e43fa 100644
--- a/posix/regex_internal.h
+++ b/posix/regex_internal.h
@@ -743,17 +743,19 @@  re_string_elem_size_at (const re_string_t *pstr, int idx)
 #  ifdef _LIBC
   const unsigned char *p, *extra;
   const int32_t *table, *indirect;
+  uint_fast32_t encoding;
   uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);

   if (nrules != 0)
     {
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       extra = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
       p = pstr->mbs + idx;
-      findidx (table, indirect, extra, &p, pstr->len - idx);
+      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
       return p - pstr->mbs - idx;
     }
   else
diff --git a/posix/regexec.c b/posix/regexec.c
index 70cd606..798eb51 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3869,6 +3869,7 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
       if (nrules != 0)
 	{
 	  unsigned int in_collseq = 0;
+	  uint_fast32_t encoding;
 	  const int32_t *table, *indirect;
 	  const unsigned char *weights, *extra;
 	  const char *collseqwc;
@@ -3919,6 +3920,8 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 	  if (cset->nequiv_classes)
 	    {
 	      const unsigned char *cp = pin;
+	      encoding =
+		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
 	      table = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
 	      weights = (const unsigned char *)
@@ -3927,7 +3930,8 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
 	      indirect = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
-	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
+	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
+				     elem_len);
 	      if (idx > 0)
 		for (i = 0; i < cset->nequiv_classes; ++i)
 		  {
diff --git a/string/strcoll_l.c b/string/strcoll_l.c
index 8f1225f..668ea9d 100644
--- a/string/strcoll_l.c
+++ b/string/strcoll_l.c
@@ -78,9 +78,9 @@  typedef struct
 /* Get next sequence.  Traverse the string as required.  */
 static __always_inline void
 get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
-	      const USTRING_TYPE *weights, const int32_t *table,
-	      const USTRING_TYPE *extra, const int32_t *indirect,
-	      int pass)
+	      const USTRING_TYPE *weights, uint_fast32_t encoding,
+	      const int32_t *table, const USTRING_TYPE *extra,
+	      const int32_t *indirect, int pass)
 {
   size_t val = seq->val = 0;
   int len = seq->len;
@@ -124,7 +124,7 @@  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
 	      us = seq->back_us;
 	      while (i < backw)
 		{
-		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
+		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 		  idx = tmp & 0xffffff;
 		  i++;
 		}
@@ -139,7 +139,7 @@  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,

 	  while (*us != L('\0'))
 	    {
-	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
+	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 	      unsigned char rule = tmp >> 24;
 	      prev_idx = idx;
 	      idx = tmp & 0xffffff;
@@ -345,9 +345,9 @@  STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)

       while (1)
 	{
-	  get_next_seq (&seq1, nrules, rulesets, weights, table,
+	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
-	  get_next_seq (&seq2, nrules, rulesets, weights, table,
+	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
 	  /* See whether any or both strings are empty.  */
 	  if (seq1.len == 0 || seq2.len == 0)
diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
index 8b61ea2..95abc4e 100644
--- a/string/strxfrm_l.c
+++ b/string/strxfrm_l.c
@@ -53,6 +53,7 @@  typedef struct
   uint_fast32_t nrules;
   unsigned char *rulesets;
   USTRING_TYPE *weights;
+  uint_fast32_t encoding;
   int32_t *table;
   USTRING_TYPE *extra;
   int32_t *indirect;
@@ -100,8 +101,8 @@  static __always_inline size_t
 find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
 	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
 {
-  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
-			 -1);
+  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
+			 l_data->extra, us, -1);
   *rule_idx = tmp >> 24;
   int32_t idx = tmp & 0xffffff;
   size_t len = l_data->weights[idx++];
@@ -693,6 +694,8 @@  STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
   /* Get the locale data.  */
   l_data.rulesets = (unsigned char *)
     current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
+  l_data.encoding =
+    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
   l_data.table = (int32_t *)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
   l_data.weights = (USTRING_TYPE *)
@@ -721,8 +724,8 @@  STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)

   do
     {
-      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
-			     -1);
+      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
+			     l_data.extra, &cur, -1);
       rulearr[idxmax] = tmp >> 24;
       idxarr[idxmax] = tmp & 0xffffff;