[V4,BZ,#18441] fix sorting multibyte charsets with an improper locale

Message ID 56D3F8F0.8070401@web.de
State New, archived
Headers

Commit Message

Leonhard Holz Feb. 29, 2016, 7:53 a.m. UTC
  In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
regression. The cause of the problem is that

a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
sort weight which causes the comparison to check the whole string instead of
breaking up early and

b) the sequence-to-weight list is partitioned by the first byte of the first
character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
they tend to have an equal starting byte (e.g. all thai chars start with E0).

The approach of the patch is to interprete TABLEMB as a hashtable and find a
better hash key. My first try was to somehow "fold" a multibyte character into one
byte but that worsened the overall performance a lot. Enhancing the table to 2
byte keys works much better while needing a reasonable amount of extra memory.

The patch vastly improves the performance of languages with multibyte chars (see
zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
get a bit slower because of the extra check for the first byte while finding the right
sequence in the sequence list . It cannot be avoided since the hash key is not
longer equal to the first byte of the sequence. Tests are ok.

filelist#C			  1.75%		23,396,200	23,805,700
filelist#en_US.UTF-8		  1.42%		77,186,200	78,285,200
lorem_ipsum#vi_VN.UTF-8		 -1.70%		1,680,740	1,652,110
lorem_ipsum#ar_SA.UTF-8		 -7.71%		2,134,780	1,970,170
lorem_ipsum#en_US.UTF-8	 	  2.61%		1,685,120	1,729,160
lorem_ipsum#zh_CN.UTF-8		-88.66%		806,176		91,423
lorem_ipsum#cs_CZ.UTF-8		 -4.89%		2,150,120	2,045,030
lorem_ipsum#en_GB.UTF-8		 -1.47%		2,061,960	2,031,620
lorem_ipsum#da_DK.UTF-8		  3.15%		1,703,710	1,757,390
lorem_ipsum#pl_PL.UTF-8		  0.86%		1,634,890	1,648,870
lorem_ipsum#fr_FR.UTF-8		 -2.06%		2,232,030	2,186,030
lorem_ipsum#pt_PT.UTF-8		 -2.60%		2,238,410	2,180,210
lorem_ipsum#el_GR.UTF-8		-34.52%		3,413,330	2,235,010
lorem_ipsum#ru_RU.UTF-8		 -9.88%		2,403,370	2,165,950
lorem_ipsum#iw_IL.UTF-8		 -9.56%		2,209,740	1,998,500
lorem_ipsum#es_ES.UTF-8	 	  4.92%		1,983,470	2,081,050
lorem_ipsum#hi_IN.UTF-8		-98.88%		220,453,000	2,458,620
lorem_ipsum#sv_SE.UTF-8		  1.79%		1,645,370	1,674,760
lorem_ipsum#hu_HU.UTF-8		  4.86%		3,179,620	3,334,290
lorem_ipsum#tr_TR.UTF-8		-23.59%		2,473,330	1,889,870
lorem_ipsum#is_IS.UTF-8		  2.49%		1,620,370	1,660,680
lorem_ipsum#it_IT.UTF-8		 -2.67%		2,186,160	2,127,710
lorem_ipsum#sr_RS.UTF-8		  2.70%		1,930,520	1,982,720
lorem_ipsum#ja_JP.UTF-8		-97.43%		958,411		24,664
wikipedia-th#en_US.UTF-8	-99.61%		10,511,700,000	40,577,100

The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
performance of these languages too, though there is no benchmark for that.

Regarding Carlos comments:

>> +  struct element_t *mbheads[256 * 256];
>
> Use #define MBHEADS_SZ or something similar.

Ok.

>> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;
>
> OK.
>
> Will this always work? I'm just wondering about a user generated charmap that they
> call 'utf8', which is the other common alias for instance where the dash is not valid
> syntax. Probably not since the official name is UTF-8, and that's what you should use.

Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".

>> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.
>> +	     Also check the UTF-8 encoding.  Keep locale/weight.h in sync.  */
>
> Not OK. Can we refactor to avoid keeing the two in sync?

Ok, there is a new function utf8index in locale/weight.h that does the job.

>> @@ -2239,7 +2281,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>>
>>  		/* Compute how much space we will need.  */
>>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
>> -					  + 2 * (runp->nmbs - 1));
>> +					  + 2 * runp->nmbs);
>
> Doesn't the change to zero indexing make the conditional in the code above this wrong?
>
> e.g.
> 2230             if (runp->mbnext != NULL
> 2231                 && runp->nmbs == runp->mbnext->nmbs
> 2232                 && memcmp (runp->mbs, runp->mbnext->mbs, runp->nmbs - 1) == 0
> 2233                 && (runp->mbs[runp->nmbs - 1]
> 2234                     == runp->mbnext->mbs[runp->nmbs - 1] + 1))

No. runp traverses through the input / locale definition file and this is not affected by the change. What happens here
is a check if the next unicode literal has the same byte sequence as the current except for the last byte, which should
be 1 higher than the last byte of the current literal -> beginning of a sequence.


	* benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
	* benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
	* locale/localeinfo.h: Add enum collation_encoding_type.
	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
	* locale/programs/ld-collate.c (struct locale_collate_t):
	Expand mbheads array from 256 to 16384 entries.
	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
	(collate_output): Output larger table and sequences including first byte.
	(collate_output): Add encoding type info.
	* locale/weight.h (utf8index): New function to calculate 2 byte index.
	(findidx): Use 2-byte index for table if UTF-8 locale.
	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
	* posix/regcomp.c (build_equiv_class): Likewise.
	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
	* posix/regexec.c (check_node_accept_bytes): Likewise.
	* string/strcoll_l.c (get_next_seq): Likewise.
	(STRCOLL): Call get_next_seq with encoding parameter.
	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
	(STRXFRM): Call find_idx with encoding parameter.
เนบิวลาปู เป็นซากซูเปอร์โนวาและเนบิวลาลมพัลซาร์ในกลุ่มดาววัว
เนบิวลานี้ได้รับการสังเกตโดยจอห์น เบวิส ในปี พ.ศ. 2274
ซึ่งสอดคล้องกับการบันทึกเหตุการณ์ซูเปอร์โนวาสว่างโดยนักดาราศาสตร์ชาวจีนและชาวอาหรับใน
พ.ศ. 1597 ที่ระดับรังสีเอกซ์และรังสีแกมมาสูงกว่า 30 กิโลอิเล็กตรอนโวลต์
เนบิวลาปูเป็นแหล่งพลังงานที่เข้มที่สุดบนท้องฟ้ามาอย่างต่อเนื่อง โดยสามารถวัดฟลักซ์ได้ถึงสูงกว่า
1012 อิเล็กตรอนโวลต์ เนบิวลาปูตั้งอยู่ห่างจากโลก 6,500 ปีแสง (2 กิโลพาร์เซก)
มีเส้นผ่านศูนย์กลาง 11 ปีแสง (3.4 พาร์เซก) และขยายตัวในอัตรา 1,500 กิโลเมตรต่อวินาที
ณ ใจกลางเนบิวลาปูเป็นที่อยู่ของพัลซาร์ปู ดาวนิวตรอนขนาดเส้นผ่านศูนย์กลาง 28-30 กิโลเมตร
ซึ่งปลดปล่อยรังสีตั้งแต่รังสีแกมมาไปจนถึงคลื่นวิทยุด้วยอัตราการหมุน 30.2 รอบต่อวินาที
เนบิวลาปูเป็นวัตถุทางดาราศาสตร์วัตถุแรกที่สามารถระบุได้จากการระเบิดซูเปอร์โนวาในประวัติศาสตร์
เนบิวลานี้ทำตัวเสมือนหนึ่งแหล่งกำเนิดรังสีสำหรับการศึกษาเทห์ฟากฟ้าที่เคลื่อนผ่านตัวมัน
ในช่วงปีพ.ศ. 2493 และ 2512
มีการทำแผนภูมิโคโรนาของดวงอาทิตย์ขึ้นจากการเฝ้าสังเกตคลื่นวิทยุจากเนบิวลาปูที่ผ่านชั้นโคโรนาไป
และในปี พ.ศ. 2546 เราสามารถวัดความหนาของชั้นบรรยากาศของดวงจันทร์ไททัน
ดาวบริวารของดาวเสาร์ได้จากการที่ชั้นบรรยากาศนี้กีดขวางรังสีเอกซ์จากเนบิวลา (อ่านต่อ...)
ฌอร์ฌ เลอแม็ทร์ นักวิทยาศาสตร์และพระโรมันคาทอลิก เป็นผู้เสนอแนวคิดการกำเนิดของเอกภพ
ซึ่งต่อมารู้จักกันในชื่อ ทฤษฎีบิกแบง ในเบื้องแรกเขาเรียกทฤษฎีนี้ว่า
สมมติฐานเกี่ยวกับอะตอมแรกเริ่ม (hypothesis of the primeval atom) อเล็กซานเดอร์
ฟรีดแมน
ทำการคำนวณแบบจำลองโดยมีกรอบการพิจารณาอยู่บนพื้นฐานของทฤษฎีสัมพัทธภาพทั่วไปของอัลเบิร์ต
ไอน์สไตน์ ต่อมาในปี ค.ศ. 1929 เอ็ดวิน ฮับเบิลค้นพบว่า
ระยะห่างของดาราจักรมีสัดส่วนที่เปลี่ยนแปลงสัมพันธ์กับการเคลื่อนไปทางแดง
การสังเกตการณ์นี้บ่งชี้ว่า ดาราจักรและกระจุกดาวอันห่างไกลกำลังเคลื่อนที่ออกจากจุดสังเกต
ซึ่งหมายความว่าเอกภพกำลังขยายตัว ยิ่งตำแหน่งดาราจักรไกลยิ่งขึ้น
ความเร็วปรากฏก็ยิ่งเพิ่มมากขึ้น หากเอกภพในปัจจุบันกำลังขยายตัว แสดงว่าก่อนหน้านี้
เอกภพย่อมมีขนาดเล็กกว่า หนาแน่นกว่า และร้อนกว่าที่เป็นอยู่
แนวคิดนี้มีการพิจารณาอย่างละเอียดย้อนไปจนถึงระดับความหนาแน่นและอุณหภูมิที่จุดสูงสุด
และผลสรุปที่ได้ก็สอดคล้องอย่างยิ่งกับผลจากการสังเกตการณ์
ทว่าการเพิ่มของอัตราเร่งมีข้อจำกัดในการตรวจสอบสภาวะพลังงานที่สูงขนาดนั้น
หากไม่มีข้อมูลอื่นที่ช่วยยืนยันสภาวะเริ่มต้นชั่วขณะก่อนการระเบิด
ลำพังทฤษฎีบิกแบงก็ยังไม่สามารถใช้อธิบายสภาวะเริ่มต้นได้
มันเพียงอธิบายกระบวนการเปลี่ยนแปลงของเอกภพที่เกิดขึ้นหลังจากสภาวะเริ่มต้นเท่านั้น
(อ่านต่อ...)
  

Comments

Carlos O'Donell March 29, 2016, 4:12 p.m. UTC | #1
On 02/29/2016 02:53 AM, Leonhard Holz wrote:
> In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
> regression. The cause of the problem is that

I think we need a V5 with a few more changes as noted below:
- More comments.
- Change signature of utf8index.
- Probably one more define for magic 255 constant.

> a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
> sort weight which causes the comparison to check the whole string instead of
> breaking up early and

OK.
 
> b) the sequence-to-weight list is partitioned by the first byte of the first
> character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
> they tend to have an equal starting byte (e.g. all thai chars start with E0).

Agreed.

> The approach of the patch is to interprete TABLEMB as a hashtable and find a
> better hash key. My first try was to somehow "fold" a multibyte character into one
> byte but that worsened the overall performance a lot. Enhancing the table to 2
> byte keys works much better while needing a reasonable amount of extra memory.

How much memory on average?

Worst case is all possible 2-byte hash values?

e.g. 256*256*(sizeof(struct element_t)) == ~10MB?

Best case is the cost of the NULL pointer, so 256*256*8 = 0.5MB?

> The patch vastly improves the performance of languages with multibyte chars (see
> zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
> get a bit slower because of the extra check for the first byte while finding the right
> sequence in the sequence list . It cannot be avoided since the hash key is not
> longer equal to the first byte of the sequence. Tests are ok.

Can we use UTF-8-specific knowledge to accelerate the lookup?

For example, you know that E0 is always the start of a 3-byte UTF-8 sequence.

Could you do two checks?

(a) Are we in a 1, 2, 3, 4, 5, or 6 bytes sequence?
(b) If in a 1 byte sequence use a one-byte table.
(c) If in a 2-6 byte sequence use the hash-table?


> filelist#C			  1.75%		23,396,200	23,805,700
> filelist#en_US.UTF-8		  1.42%		77,186,200	78,285,200
> lorem_ipsum#vi_VN.UTF-8		 -1.70%		1,680,740	1,652,110
> lorem_ipsum#ar_SA.UTF-8		 -7.71%		2,134,780	1,970,170
> lorem_ipsum#en_US.UTF-8	 	  2.61%		1,685,120	1,729,160
> lorem_ipsum#zh_CN.UTF-8		-88.66%		806,176		91,423
> lorem_ipsum#cs_CZ.UTF-8		 -4.89%		2,150,120	2,045,030
> lorem_ipsum#en_GB.UTF-8		 -1.47%		2,061,960	2,031,620
> lorem_ipsum#da_DK.UTF-8		  3.15%		1,703,710	1,757,390
> lorem_ipsum#pl_PL.UTF-8		  0.86%		1,634,890	1,648,870
> lorem_ipsum#fr_FR.UTF-8		 -2.06%		2,232,030	2,186,030
> lorem_ipsum#pt_PT.UTF-8		 -2.60%		2,238,410	2,180,210
> lorem_ipsum#el_GR.UTF-8		-34.52%		3,413,330	2,235,010
> lorem_ipsum#ru_RU.UTF-8		 -9.88%		2,403,370	2,165,950
> lorem_ipsum#iw_IL.UTF-8		 -9.56%		2,209,740	1,998,500
> lorem_ipsum#es_ES.UTF-8	 	  4.92%		1,983,470	2,081,050
> lorem_ipsum#hi_IN.UTF-8		-98.88%		220,453,000	2,458,620
> lorem_ipsum#sv_SE.UTF-8		  1.79%		1,645,370	1,674,760
> lorem_ipsum#hu_HU.UTF-8		  4.86%		3,179,620	3,334,290
> lorem_ipsum#tr_TR.UTF-8		-23.59%		2,473,330	1,889,870
> lorem_ipsum#is_IS.UTF-8		  2.49%		1,620,370	1,660,680
> lorem_ipsum#it_IT.UTF-8		 -2.67%		2,186,160	2,127,710
> lorem_ipsum#sr_RS.UTF-8		  2.70%		1,930,520	1,982,720
> lorem_ipsum#ja_JP.UTF-8		-97.43%		958,411		24,664
> wikipedia-th#en_US.UTF-8	-99.61%		10,511,700,000	40,577,100
> 
> The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
> the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
> of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
> performance of these languages too, though there is no benchmark for that.


>> Will this always work? I'm just wondering about a user generated charmap that they
>> call 'utf8', which is the other common alias for instance where the dash is not valid
>> syntax. Probably not since the official name is UTF-8, and that's what you should use.
> 
> Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".

Could you check to see what value 'code_set_name' uses internally?

Is it always 'UTF-8'? If it is, then the check you have is just fine.

Otherwise we should check for utf8 and UTF-8.

We don't know until you verify the values code_set_name can have.

> 
> 	* benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
> 	* benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
> 	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
> 	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
> 	* locale/localeinfo.h: Add enum collation_encoding_type.
> 	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
> 	* locale/programs/ld-collate.c (struct locale_collate_t):
> 	Expand mbheads array from 256 to 16384 entries.
> 	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
> 	(collate_output): Output larger table and sequences including first byte.
> 	(collate_output): Add encoding type info.
> 	* locale/weight.h (utf8index): New function to calculate 2 byte index.
> 	(findidx): Use 2-byte index for table if UTF-8 locale.
> 	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
>  	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
> 	* posix/regcomp.c (build_equiv_class): Likewise.
> 	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
> 	* posix/regexec.c (check_node_accept_bytes): Likewise.
> 	* string/strcoll_l.c (get_next_seq): Likewise.
> 	(STRCOLL): Call get_next_seq with encoding parameter.
> 	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
> 	(STRXFRM): Call find_idx with encoding parameter.
> 
> 
> diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c
> index 22ae87c..6ce5b2a 100644
> --- a/benchtests/bench-strcoll.c
> +++ b/benchtests/bench-strcoll.c
> @@ -53,7 +53,8 @@ static const char *const input_files[] = {
>    "lorem_ipsum#is_IS.UTF-8",
>    "lorem_ipsum#it_IT.UTF-8",
>    "lorem_ipsum#sr_RS.UTF-8",
> -  "lorem_ipsum#ja_JP.UTF-8"
> +  "lorem_ipsum#ja_JP.UTF-8",
> +  "wikipedia-th#en_US.UTF-8"

OK.

>  };
> 
>  #define TEXTFILE_DELIMITER " \n\r\t.,?!"
> diff --git a/locale/C-collate.c b/locale/C-collate.c
> index 8214ff5..5a9ed6a 100644
> --- a/locale/C-collate.c
> +++ b/locale/C-collate.c
> @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
>      /* _NL_COLLATE_COLLSEQWC */
>      { .string = (const char *) collseqwc },
>      /* _NL_COLLATE_CODESET */
> -    { .string = _nl_C_codeset }
> +    { .string = _nl_C_codeset },
> +    /* _NL_COLLATE_ENCODING_TYPE */
> +    { .word = __cet_8bit }

This makes locale-archive incompatible again right?

Users have to regenerate the locale-archive after the upgrade?

We need a release note for that under "Packaging Changes"
https://sourceware.org/glibc/wiki/Release/2.24#Packaging_Changes

The release note should mention the binary locale-archive format
has changed and that locale-archive must be removed before upgrading
and then recompiled after upgrade.

>    }
>  };
> diff --git a/locale/categories.def b/locale/categories.def
> index d8a3ab8..cb57eae 100644
> --- a/locale/categories.def
> +++ b/locale/categories.def
> @@ -58,6 +58,7 @@ DEFINE_CATEGORY
>    DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB,        "collate-collseqmb",        std, wstring)
>    DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC,        "collate-collseqwc",        std, wstring)
>    DEFINE_ELEMENT (_NL_COLLATE_CODESET,		"collate-codeset",	    std, string)
> +  DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE,	"collate-encoding-type",    std, word)

OK.

>    ), NO_POSTLOAD)
> 
> 
> diff --git a/locale/langinfo.h b/locale/langinfo.h
> index 481e226..0906a6a 100644
> --- a/locale/langinfo.h
> +++ b/locale/langinfo.h
> @@ -255,6 +255,7 @@ enum
>    _NL_COLLATE_COLLSEQMB,
>    _NL_COLLATE_COLLSEQWC,
>    _NL_COLLATE_CODESET,
> +  _NL_COLLATE_ENCODING_TYPE,

OK.

>    _NL_NUM_LC_COLLATE,
> 
>    /* LC_CTYPE category: character classification.
> diff --git a/locale/localeinfo.h b/locale/localeinfo.h
> index 5c4e6ef..bd284df 100644
> --- a/locale/localeinfo.h
> +++ b/locale/localeinfo.h
> @@ -110,6 +110,14 @@ enum coll_sort_rule
>    sort_mask
>  };
> 
> +/* Collation encoding type.  */
> +enum collation_encoding_type
> +{
> +  __cet_other,
> +  __cet_8bit,
> +  __cet_utf8
> +};

OK.

> +
>  /* We can map the types of the entries into a few categories.  */
>  enum value_type
>  {
> diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
> index 1e125f6..efaacf6 100644
> --- a/locale/programs/ld-collate.c
> +++ b/locale/programs/ld-collate.c
> @@ -32,6 +32,8 @@
>  #include "linereader.h"
>  #include "locfile.h"
>  #include "elem-hash.h"
> +#include "../localeinfo.h"
> +#include "../locale/weight.h"

OK.

> 
>  /* Uncomment the following line in the production version.  */
>  /* #define NDEBUG 1 */
> @@ -243,9 +245,10 @@ struct locale_collate_t
>       Therefore we keep all relevant input in a list.  */
>    struct locale_collate_t *next;
> 
> -  /* Arrays with heads of the list for each of the leading bytes in
> +  /* Arrays with heads of the list for the leading bytes in
>       the multibyte sequences.  */
> -  struct element_t *mbheads[256];

Needs a comment explaining why this is '256*256'.

> +  #define MBHEADS_SZ (256 * 256)
> +  struct element_t *mbheads[MBHEADS_SZ];

OK.

> 
>    /* Arrays with heads of the list for each of the leading bytes in
>       the multibyte sequences.  */
> @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>    struct section_list *sect;
>    int ruleidx;
>    int nr_wide_elems = 0;
> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

As discussed above you need to verify that code_set_name is always "UTF-8" if it's
a UTF-8 charmap. If it is always "UTF-8" then this check is just fine. I will note
that `locale -a -v` always prints 'UTF-8' for codeset and never 'utf8', but the
printed name is still 'C.utf8' or 'es_EC.utf8' etc because glibc normalizes the name
internally (downcase and strip dashes e.g. pt_PT.iso88591).

I suggest just putting a breakpoint there and verify that it's UTF-8 for a few
locale names.

You could then avoid strcmp:

bool is_utf8 == strlen (charmap->code_set_name) == 5 
		? (charmap->code_set_name[0] == 'U'
		   && charmap->code_set_name[1] == 'T'
		   && charmap->code_set_name[2] == 'F'
		   && charmap->code_set_name[3] == '-'
		   && charmap->code_set_name[4] == '8')
		: 0;

Since more than 5 chars means it's not 'UTF-8' e.g. BIG5-HKSCS, or ISO-8859-1.
The only locales with 5 or less are GBK and PT154. So you could skip the strcmp
in all cases, but still need a comparison to disambiguate PT154 vs. UTF-8.

> 
>    if (collate == NULL)
>      {
> @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
>  	  struct element_t *lastp = NULL;
> 
>  	  /* Find the point where to insert in the list.  */
> -	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
> +	  uint16_t index = ((unsigned char *) runp->mbs)[0];
> +
> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.  */
> +	  if (is_utf8 && index > 0)
> +	    {
> +	      index = utf8index((unsigned char *) runp->mbs, runp->nmbs);
> +	      if (index == 0)
> +		{
> +		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
> +						  _("\
> +malformed UTF-8 character in `%s'"), runp->name););
> +		  goto dont_insert;
> +		}
> +	    }
> +
> +	  eptr = &collate->mbheads[index];

OK.

>  	  while (*eptr != NULL)
>  	    {
>  	      if ((*eptr)->nmbs < runp->nmbs)
> @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name);
> 
>    /* Find out whether any of the `mbheads' entries is unset.  In this
>       case we use the UNDEFINED entry.  */
> -  for (i = 1; i < 256; ++i)
> +  for (i = 1; i < MBHEADS_SZ; ++i)

OK.

>      if (collate->mbheads[i] == NULL)
>        {
>  	need_undefined = 1;
> @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
>    struct locale_file file;
>    size_t ch;
> -  int32_t tablemb[256];
> +  int32_t tablemb[MBHEADS_SZ];

OK.

>    struct obstack weightpool;
>    struct obstack extrapool;
>    struct obstack indirectpool;
> @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	  /* The words have to be handled specially.  */
>  	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
>  	    add_locale_uint32 (&file, 0);
> +	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
> +	    add_locale_uint32 (&file, __cet_other);

OK.

>  	  else
>  	    add_locale_empty (&file);
>  	}
> @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    if (collate->undefined.used_in_level != 0)
>      output_weight (&weightpool, collate, &collate->undefined);
> 
> -  for (ch = 1; ch < 256; ++ch)
> +  for (ch = 1; ch < MBHEADS_SZ; ++ch)

OK.

>      if (collate->mbheads[ch]->mbnext == NULL
>  	&& collate->mbheads[ch]->nmbs <= 1)
>        {
> @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	   and add only one index into the weight table.  We can find the
>  	   consecutive entries since they are also consecutive in the list.  */
>  	struct element_t *runp = collate->mbheads[ch];
> -	struct element_t *lastp;
> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* Compute how much space we will need.  */
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + 2 * (runp->nmbs - 1));
> +					  + 2 * runp->nmbs);
>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
> @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		/* Now walk backward from here to the beginning.  */
>  		curp = runp;
> 
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
> -		for (i = 1; i < curp->nmbs; ++i)
> +		assert (runp->nmbs <= 255);

Another magic constant. Where does 256 or 255 come from?

These are struct element_t structures with entries chained from there.

Why do we limit them to 255? When might they be larger?

We should use a define here and document why the limit exists.

> +		obstack_1grow_fast (&extrapool, curp->nmbs);
> +		for (i = 0; i < curp->nmbs; ++i)

OK.

>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
> 
>  		/* Now find the end of the consecutive sequence and
> @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
> 
>  		/* And add the end byte sequence.  Without length this
>  		   time.  */
> -		for (i = 1; i < curp->nmbs; ++i)
> +		for (i = 0; i < curp->nmbs; ++i)
>  		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
>  	      }
>  	    else
> @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  		weightidx = output_weight (&weightpool, collate, runp);
> 
>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
> -					  + runp->nmbs - 1);
> +					  + runp->nmbs);
>  		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
>  		obstack_make_room (&extrapool, added);
> 
>  		obstack_int32_grow_fast (&extrapool, weightidx);
> -		assert (runp->nmbs <= 256);
> -		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
> +		assert (runp->nmbs <= 255);
> +		obstack_1grow_fast (&extrapool, runp->nmbs);
> 
> -		for (i = 1; i < runp->nmbs; ++i)
> +		for (i = 0; i < runp->nmbs; ++i)
>  		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
>  	      }
> 
> @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>  	      obstack_1grow_fast (&extrapool, '\0');
> 
>  	    /* Next entry.  */
> -	    lastp = runp;
>  	    runp = runp->mbnext;
>  	  }
>  	while (runp != NULL);
> 
>  	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
> 
> -	/* If the final entry in the list is not a single character we
> -	   add an UNDEFINED entry here.  */
> -	if (lastp->nmbs != 1)
> -	  {
> -	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> -	    obstack_make_room (&extrapool, added);
> +	/* Add an UNDEFINED entry at the end of the list.  */
> +	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
> +	obstack_make_room (&extrapool, added);
> 
> -	    obstack_int32_grow_fast (&extrapool, 0);
> -	    /* XXX What rule? We just pick the first.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> -	    /* Length is zero.  */
> -	    obstack_1grow_fast (&extrapool, 0);
> +	obstack_int32_grow_fast (&extrapool, 0);
> +	/* XXX What rule? We just pick the first.  */
> +	obstack_1grow_fast (&extrapool, 0);
> +	/* Length is zero.  */
> +	obstack_1grow_fast (&extrapool, 0);
> 
> -	    /* Add alignment bytes if necessary.  */
> -	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> -	      obstack_1grow_fast (&extrapool, '\0');
> -	  }
> +	/* Add alignment bytes if necessary.  */
> +	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
> +	  obstack_1grow_fast (&extrapool, '\0');
>        }
> 
>    /* Add padding to the tables if necessary.  */
> @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>      obstack_1grow (&weightpool, 0);
> 
>    /* Now add the four tables.  */
> -  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
> +  add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ);
>    add_locale_raw_obstack (&file, &weightpool);
>    add_locale_raw_obstack (&file, &extrapool);
>    add_locale_raw_obstack (&file, &indirectpool);
> @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>    add_locale_raw_data (&file, collate->mbseqorder, 256);
>    add_locale_collseq_table (&file, &collate->wcseqorder);
>    add_locale_string (&file, charmap->code_set_name);
> +  if (strcmp (charmap->code_set_name, "UTF-8") == 0)

Similar discussion as above regarding UTF-8 and optimziation.

> +    add_locale_uint32 (&file, __cet_utf8);
> +  else if (charmap->mb_cur_max == 1)
> +    add_locale_uint32 (&file, __cet_8bit);
> +  else
> +    add_locale_uint32 (&file, __cet_other);
>    write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);
> 
>    obstack_free (&weightpool, NULL);
> diff --git a/locale/weight.h b/locale/weight.h
> index c99730c..5b4103b 100644
> --- a/locale/weight.h
> +++ b/locale/weight.h
> @@ -19,26 +19,81 @@
>  #ifndef _WEIGHT_H_
>  #define _WEIGHT_H_	1
> 
> +/* Generate 2 byte code for the next UTF-8 encoded char.
> +   Returns zero on UTF-8 encoding errors.  */

Plase change the prototype to be:

int utf8index (const unsigned char *cp, size_t len, uint16_t *index);

Return 0 on success.
Return !0 on error.
Store index into *index.

This way we avoid conflating index value 0 and error.

> +static __always_inline uint16_t
> +utf8index (const unsigned char *cp, size_t len)
> +{
> +  uint16_t index = cp[0];
> +
> +  if (index >= 0x80)
> +    {
> +      if (index < 0xE0)
> +	{
> +	  if (len < 2)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  index = (index << 6) + byte2 - 0x3080;
> +	}
> +      else if (index < 0xF0)
> +	{
> +	  if (len < 3)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  uint16_t byte3 = cp[2];
> +	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
> +	}
> +      else if (index < 0xF8)
> +	{
> +	  if (len < 4)
> +	    return 0;
> +	  uint16_t byte2 = cp[1];
> +	  uint16_t byte3 = cp[2];
> +	  uint16_t byte4 = cp[3];
> +	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;

I believe this is technically inaccurate since it allows all 4-byte
sequences, when in reality the limit is at U+10FFFF?

You need not fix it, but we should add a comment saying that for the
sake of simpler code we're allowing those 4-byte sequences which are
not normally accepted.

> +	}
> +      else
> +	return 0;

OK.

> +    }
> +
> +  return index;
> +}
> +
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t locale_encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const unsigned char *extra,
>  	 const unsigned char **cpp, size_t len)
>  {
> -  int_fast32_t i = table[*(*cpp)++];
>    const unsigned char *cp;
>    const unsigned char *usrc;
> +  uint16_t index = (*cpp)[0];
> +
> +  /* Special handling of UTF-8: Generate a 2-byte index for table.  */
> +  if (index >= 0x80 && locale_encoding == __cet_utf8)
> +    {
> +      index = utf8index(*cpp, len);
> +      if (index == 0)
> +	{
> +	  *cpp += 1;
> +	  return 0;
> +	}
> +    }

OK.

> 
> +  int_fast32_t i = table[index];
>    if (i >= 0)
> -    /* This is an index into the weight table.  Cool.  */
> -    return i;
> +    {
> +      /* This is an index into the weight table.  Cool.  */
> +      *cpp += 1;
> +      return i;
> +    }
> 
>    /* Oh well, more than one sequence starting with this byte.
>       Search for the correct one.  */
>    cp = &extra[-i];
>    usrc = *cpp;
> -  --len;
>    while (1)
>      {
>        size_t nhere;
> @@ -57,8 +112,7 @@ findidx (const int32_t *table,
>  	  /* It is a single character.  If it matches we found our
>  	     index.  Note that at the end of each list there is an
>  	     entry of length zero which represents the single byte
> -	     sequence.  The first (and here only) byte was tested
> -	     already.  */
> +	     sequence.  */
>  	  size_t cnt;
> 
>  	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
> @@ -68,7 +122,7 @@ findidx (const int32_t *table,
>  	  if (cnt == nhere)
>  	    {
>  	      /* Found it.  */
> -	      *cpp += nhere;
> +	      *cpp += nhere > 0 ? nhere : 1;
>  	      return i;
>  	    }
> 
> @@ -127,7 +181,7 @@ findidx (const int32_t *table,
>  	      while (++cnt < nhere);
>  	    }
> 
> -	  *cpp += nhere;
> +	  *cpp += nhere > 0 ? nhere : 1;
>  	  return indirect[-i + offset];
>  	}
>      }

OK.

> diff --git a/locale/weightwc.h b/locale/weightwc.h
> index ab26482..4101dc8 100644
> --- a/locale/weightwc.h
> +++ b/locale/weightwc.h
> @@ -21,7 +21,8 @@
> 
>  /* Find index of weight.  */
>  static inline int32_t __attribute__ ((always_inline))
> -findidx (const int32_t *table,
> +findidx (uint_fast32_t encoding,
> +	 const int32_t *table,
>  	 const int32_t *indirect,
>  	 const wint_t *extra,
>  	 const wint_t **cpp, size_t len)
> diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
> index 229904e..07b60fb 100644
> --- a/posix/fnmatch_loop.c
> +++ b/posix/fnmatch_loop.c
> @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			const int32_t *indirect;
>  			int32_t idx;
>  			const UCHAR *cp = (const UCHAR *) &str;
> +			uint_fast32_t encoding =
> +			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
> 
>  # if WIDE_CHAR_VERSION
>  			table = (const int32_t *)
> @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
>  # endif
> 
> -			idx = FINDIDX (table, indirect, extra, &cp, 1);
> +			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
>  			if (idx != 0)
>  			  {
>  			    /* We found a table entry.  Now see whether the
> @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
>  			    int32_t idx2;
>  			    const UCHAR *np = (const UCHAR *) n;
> 
> -			    idx2 = FINDIDX (table, indirect, extra,
> +			    idx2 = FINDIDX (encoding, table, indirect, extra,
>  					    &np, string_end - n);
>  			    if (idx2 != 0
>  				&& (idx >> 24) == (idx2 >> 24)

OK.

> diff --git a/posix/regcomp.c b/posix/regcomp.c
> index b6126b7..011ef92 100644
> --- a/posix/regcomp.c
> +++ b/posix/regcomp.c
> @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>    uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
>    if (nrules != 0)
>      {
> +      uint_fast32_t encoding;
>        const int32_t *table, *indirect;
>        const unsigned char *weights, *extra, *cp;
>        unsigned char char_buf[2];
> @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>        size_t len;
>        /* Calculate the index for equivalence class.  */
>        cp = name;
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
>  					       _NL_COLLATE_WEIGHTMB);
> @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  						   _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
> -      idx1 = findidx (table, indirect, extra, &cp, -1);
> +      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
>        if (BE (idx1 == 0 || *cp != '\0', 0))
>  	/* This isn't a valid character.  */
>  	return REG_ECOLLATE;
> @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name)
>  	{
>  	  char_buf[0] = ch;
>  	  cp = char_buf;
> -	  idx2 = findidx (table, indirect, extra, &cp, 1);
> +	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
>  /*
>  	  idx2 = table[ch];
>  */

OK

> diff --git a/posix/regex_internal.h b/posix/regex_internal.h
> index 02e040b..993c7c3 100644
> --- a/posix/regex_internal.h
> +++ b/posix/regex_internal.h
> @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx)
>  #  ifdef _LIBC
>    const unsigned char *p, *extra;
>    const int32_t *table, *indirect;
> +  uint_fast32_t encoding;
>    uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
> 
>    if (nrules != 0)
>      {
> +      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>        table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>        extra = (const unsigned char *)
>  	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>        indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
>  						_NL_COLLATE_INDIRECTMB);
>        p = pstr->mbs + idx;
> -      findidx (table, indirect, extra, &p, pstr->len - idx);
> +      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
>        return p - pstr->mbs - idx;
>      }
>    else

OK.

> diff --git a/posix/regexec.c b/posix/regexec.c
> index ec46c3a..3d3ad9a 100644
> --- a/posix/regexec.c
> +++ b/posix/regexec.c
> @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>        if (nrules != 0)
>  	{
>  	  unsigned int in_collseq = 0;
> +	  uint_fast32_t encoding;
>  	  const int32_t *table, *indirect;
>  	  const unsigned char *weights, *extra;
>  	  const char *collseqwc;
> @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  	  if (cset->nequiv_classes)
>  	    {
>  	      const unsigned char *cp = pin;
> +	      encoding =
> +		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
>  	      table = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
>  	      weights = (const unsigned char *)
> @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
>  	      indirect = (const int32_t *)
>  		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
> -	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
> +	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
> +				     elem_len);
>  	      if (idx > 0)
>  		for (i = 0; i < cset->nequiv_classes; ++i)
>  		  {

OK.

> diff --git a/string/strcoll_l.c b/string/strcoll_l.c
> index 4d1e3ab..2c2cab0 100644
> --- a/string/strcoll_l.c
> +++ b/string/strcoll_l.c
> @@ -63,9 +63,9 @@ typedef struct
>  /* Get next sequence.  Traverse the string as required.  */
>  static __always_inline void
>  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> -	      const USTRING_TYPE *weights, const int32_t *table,
> -	      const USTRING_TYPE *extra, const int32_t *indirect,
> -	      int pass)
> +	      const USTRING_TYPE *weights, uint_fast32_t encoding,
> +	      const int32_t *table, const USTRING_TYPE *extra,
> +	      const int32_t *indirect, int pass)
>  {
>    size_t val = seq->val = 0;
>    int len = seq->len;
> @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
>  	      us = seq->back_us;
>  	      while (i < backw)
>  		{
> -		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  		  idx = tmp & 0xffffff;
>  		  i++;
>  		}
> @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
> 
>  	  while (*us != L('\0'))
>  	    {
> -	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
> +	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
>  	      unsigned char rule = tmp >> 24;
>  	      prev_idx = idx;
>  	      idx = tmp & 0xffffff;
> @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
>    const USTRING_TYPE *weights;
>    const USTRING_TYPE *extra;
>    const int32_t *indirect;
> +  uint_fast32_t encoding;
> 
>    if (nrules == 0)
>      return STRCMP (s1, s2);
> @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string;
>    indirect = (const int32_t *)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string;
> +  encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
> +
> 
>    assert (((uintptr_t) table) % __alignof__ (table[0]) == 0);
>    assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0);
> @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
> 
>        while (1)
>  	{
> -	  get_next_seq (&seq1, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);
> -	  get_next_seq (&seq2, nrules, rulesets, weights, table,
> +	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
>  				    extra, indirect, pass);
>  	  /* See whether any or both strings are empty.  */
>  	  if (seq1.len == 0 || seq2.len == 0)

OK.

> diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
> index 22e24d3..5c89b15 100644
> --- a/string/strxfrm_l.c
> +++ b/string/strxfrm_l.c
> @@ -53,6 +53,7 @@ typedef struct
>    uint_fast32_t nrules;
>    unsigned char *rulesets;
>    USTRING_TYPE *weights;
> +  uint_fast32_t encoding;
>    int32_t *table;
>    USTRING_TYPE *extra;
>    int32_t *indirect;
> @@ -100,8 +101,8 @@ static __always_inline size_t
>  find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
>  	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
>  {
> -  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
> -			 -1);
> +  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
> +			 l_data->extra, us, -1);
>    *rule_idx = tmp >> 24;
>    int32_t idx = tmp & 0xffffff;
>    size_t len = l_data->weights[idx++];
> @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
>    /* Get the locale data.  */
>    l_data.rulesets = (unsigned char *)
>      current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
> +  l_data.encoding =
> +    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
>    l_data.table = (int32_t *)
>      current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
>    l_data.weights = (USTRING_TYPE *)
> @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
> 
>    do
>      {
> -      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
> -			     -1);
> +      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
> +			     l_data.extra, &cur, -1);
>        rulearr[idxmax] = tmp >> 24;
>        idxarr[idxmax] = tmp & 0xffffff;
> 

OK.
  
Joseph Myers March 29, 2016, 10:05 p.m. UTC | #2
On Tue, 29 Mar 2016, Carlos O'Donell wrote:

> I believe this is technically inaccurate since it allows all 4-byte
> sequences, when in reality the limit is at U+10FFFF?

That glibc accepts UTF-8 according to the definition in the 2003 edition 
of ISO 10646 rather than the definition in the 2011 and later editions is 
a known issue.  I've filed bug 19883 for it since I couldn't find an 
existing bug report in Bugzilla.  I don't think it's particularly relevant 
to any patch not aiming to fix that bug, but:

> You need not fix it, but we should add a comment saying that for the
> sake of simpler code we're allowing those 4-byte sequences which are
> not normally accepted.

I'd think a reference to this code in bug 19883 might be more useful - or 
something in that bug giving a standard (greppable) wording for a comment 
identifying places needing updating for the current UTF-8 (or in some 
cases UCS-4) definition, with such a comment added in this code.
  
Zack Weinberg March 29, 2016, 10:17 p.m. UTC | #3
On Tue, Mar 29, 2016 at 6:05 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Tue, 29 Mar 2016, Carlos O'Donell wrote:
>
>> I believe this is technically inaccurate since it allows all 4-byte
>> sequences, when in reality the limit is at U+10FFFF?
>
> That glibc accepts UTF-8 according to the definition in the 2003 edition
> of ISO 10646 rather than the definition in the 2011 and later editions is
> a known issue.  I've filed bug 19883 for it since I couldn't find an
> existing bug report in Bugzilla.

Note that the U+10FFFF limit equates to a Y2541 bug, at the present
(post-2000) rate of codepoint assignment.  See
https://gist.github.com/zackw/f2e74a8d7b31baa88002 for calculations
and pretty graph.

zw
  

Patch

diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c
index 22ae87c..6ce5b2a 100644
--- a/benchtests/bench-strcoll.c
+++ b/benchtests/bench-strcoll.c
@@ -53,7 +53,8 @@  static const char *const input_files[] = {
   "lorem_ipsum#is_IS.UTF-8",
   "lorem_ipsum#it_IT.UTF-8",
   "lorem_ipsum#sr_RS.UTF-8",
-  "lorem_ipsum#ja_JP.UTF-8"
+  "lorem_ipsum#ja_JP.UTF-8",
+  "wikipedia-th#en_US.UTF-8"
 };

 #define TEXTFILE_DELIMITER " \n\r\t.,?!"
diff --git a/locale/C-collate.c b/locale/C-collate.c
index 8214ff5..5a9ed6a 100644
--- a/locale/C-collate.c
+++ b/locale/C-collate.c
@@ -144,6 +144,8 @@  const struct __locale_data _nl_C_LC_COLLATE attribute_hidden =
     /* _NL_COLLATE_COLLSEQWC */
     { .string = (const char *) collseqwc },
     /* _NL_COLLATE_CODESET */
-    { .string = _nl_C_codeset }
+    { .string = _nl_C_codeset },
+    /* _NL_COLLATE_ENCODING_TYPE */
+    { .word = __cet_8bit }
   }
 };
diff --git a/locale/categories.def b/locale/categories.def
index d8a3ab8..cb57eae 100644
--- a/locale/categories.def
+++ b/locale/categories.def
@@ -58,6 +58,7 @@  DEFINE_CATEGORY
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB,        "collate-collseqmb",        std, wstring)
   DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC,        "collate-collseqwc",        std, wstring)
   DEFINE_ELEMENT (_NL_COLLATE_CODESET,		"collate-codeset",	    std, string)
+  DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE,	"collate-encoding-type",    std, word)
   ), NO_POSTLOAD)


diff --git a/locale/langinfo.h b/locale/langinfo.h
index 481e226..0906a6a 100644
--- a/locale/langinfo.h
+++ b/locale/langinfo.h
@@ -255,6 +255,7 @@  enum
   _NL_COLLATE_COLLSEQMB,
   _NL_COLLATE_COLLSEQWC,
   _NL_COLLATE_CODESET,
+  _NL_COLLATE_ENCODING_TYPE,
   _NL_NUM_LC_COLLATE,

   /* LC_CTYPE category: character classification.
diff --git a/locale/localeinfo.h b/locale/localeinfo.h
index 5c4e6ef..bd284df 100644
--- a/locale/localeinfo.h
+++ b/locale/localeinfo.h
@@ -110,6 +110,14 @@  enum coll_sort_rule
   sort_mask
 };

+/* Collation encoding type.  */
+enum collation_encoding_type
+{
+  __cet_other,
+  __cet_8bit,
+  __cet_utf8
+};
+
 /* We can map the types of the entries into a few categories.  */
 enum value_type
 {
diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index 1e125f6..efaacf6 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -32,6 +32,8 @@ 
 #include "linereader.h"
 #include "locfile.h"
 #include "elem-hash.h"
+#include "../localeinfo.h"
+#include "../locale/weight.h"

 /* Uncomment the following line in the production version.  */
 /* #define NDEBUG 1 */
@@ -243,9 +245,10 @@  struct locale_collate_t
      Therefore we keep all relevant input in a list.  */
   struct locale_collate_t *next;

-  /* Arrays with heads of the list for each of the leading bytes in
+  /* Arrays with heads of the list for the leading bytes in
      the multibyte sequences.  */
-  struct element_t *mbheads[256];
+  #define MBHEADS_SZ (256 * 256)
+  struct element_t *mbheads[MBHEADS_SZ];

   /* Arrays with heads of the list for each of the leading bytes in
      the multibyte sequences.  */
@@ -1557,6 +1560,7 @@  collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
   struct section_list *sect;
   int ruleidx;
   int nr_wide_elems = 0;
+  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;

   if (collate == NULL)
     {
@@ -1663,7 +1667,22 @@  collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
 	  struct element_t *lastp = NULL;

 	  /* Find the point where to insert in the list.  */
-	  eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]];
+	  uint16_t index = ((unsigned char *) runp->mbs)[0];
+
+	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.  */
+	  if (is_utf8 && index > 0)
+	    {
+	      index = utf8index((unsigned char *) runp->mbs, runp->nmbs);
+	      if (index == 0)
+		{
+		  WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line,
+						  _("\
+malformed UTF-8 character in `%s'"), runp->name););
+		  goto dont_insert;
+		}
+	    }
+
+	  eptr = &collate->mbheads[index];
 	  while (*eptr != NULL)
 	    {
 	      if ((*eptr)->nmbs < runp->nmbs)
@@ -1734,7 +1753,7 @@  symbol `%s' has the same encoding as"), (*eptr)->name);

   /* Find out whether any of the `mbheads' entries is unset.  In this
      case we use the UNDEFINED entry.  */
-  for (i = 1; i < 256; ++i)
+  for (i = 1; i < MBHEADS_SZ; ++i)
     if (collate->mbheads[i] == NULL)
       {
 	need_undefined = 1;
@@ -2107,7 +2126,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE);
   struct locale_file file;
   size_t ch;
-  int32_t tablemb[256];
+  int32_t tablemb[MBHEADS_SZ];
   struct obstack weightpool;
   struct obstack extrapool;
   struct obstack indirectpool;
@@ -2130,6 +2149,8 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	  /* The words have to be handled specially.  */
 	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
 	    add_locale_uint32 (&file, 0);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE))
+	    add_locale_uint32 (&file, __cet_other);
 	  else
 	    add_locale_empty (&file);
 	}
@@ -2183,7 +2204,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   if (collate->undefined.used_in_level != 0)
     output_weight (&weightpool, collate, &collate->undefined);

-  for (ch = 1; ch < 256; ++ch)
+  for (ch = 1; ch < MBHEADS_SZ; ++ch)
     if (collate->mbheads[ch]->mbnext == NULL
 	&& collate->mbheads[ch]->nmbs <= 1)
       {
@@ -2208,7 +2229,6 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	   and add only one index into the weight table.  We can find the
 	   consecutive entries since they are also consecutive in the list.  */
 	struct element_t *runp = collate->mbheads[ch];
-	struct element_t *lastp;

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

@@ -2236,7 +2256,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* Compute how much space we will need.  */
 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + 2 * (runp->nmbs - 1));
+					  + 2 * runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

@@ -2259,9 +2279,9 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		/* Now walk backward from here to the beginning.  */
 		curp = runp;

-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, curp->nmbs - 1);
-		for (i = 1; i < curp->nmbs; ++i)
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, curp->nmbs);
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);

 		/* Now find the end of the consecutive sequence and
@@ -2281,7 +2301,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,

 		/* And add the end byte sequence.  Without length this
 		   time.  */
-		for (i = 1; i < curp->nmbs; ++i)
+		for (i = 0; i < curp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, curp->mbs[i]);
 	      }
 	    else
@@ -2295,15 +2315,15 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 		weightidx = output_weight (&weightpool, collate, runp);

 		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
-					  + runp->nmbs - 1);
+					  + runp->nmbs);
 		assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));
 		obstack_make_room (&extrapool, added);

 		obstack_int32_grow_fast (&extrapool, weightidx);
-		assert (runp->nmbs <= 256);
-		obstack_1grow_fast (&extrapool, runp->nmbs - 1);
+		assert (runp->nmbs <= 255);
+		obstack_1grow_fast (&extrapool, runp->nmbs);

-		for (i = 1; i < runp->nmbs; ++i)
+		for (i = 0; i < runp->nmbs; ++i)
 		  obstack_1grow_fast (&extrapool, runp->mbs[i]);
 	      }

@@ -2312,30 +2332,25 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	      obstack_1grow_fast (&extrapool, '\0');

 	    /* Next entry.  */
-	    lastp = runp;
 	    runp = runp->mbnext;
 	  }
 	while (runp != NULL);

 	assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)));

-	/* If the final entry in the list is not a single character we
-	   add an UNDEFINED entry here.  */
-	if (lastp->nmbs != 1)
-	  {
-	    int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
-	    obstack_make_room (&extrapool, added);
+	/* Add an UNDEFINED entry at the end of the list.  */
+	int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1);
+	obstack_make_room (&extrapool, added);

-	    obstack_int32_grow_fast (&extrapool, 0);
-	    /* XXX What rule? We just pick the first.  */
-	    obstack_1grow_fast (&extrapool, 0);
-	    /* Length is zero.  */
-	    obstack_1grow_fast (&extrapool, 0);
+	obstack_int32_grow_fast (&extrapool, 0);
+	/* XXX What rule? We just pick the first.  */
+	obstack_1grow_fast (&extrapool, 0);
+	/* Length is zero.  */
+	obstack_1grow_fast (&extrapool, 0);

-	    /* Add alignment bytes if necessary.  */
-	    while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
-	      obstack_1grow_fast (&extrapool, '\0');
-	  }
+	/* Add alignment bytes if necessary.  */
+	while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool)))
+	  obstack_1grow_fast (&extrapool, '\0');
       }

   /* Add padding to the tables if necessary.  */
@@ -2343,7 +2358,7 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
     obstack_1grow (&weightpool, 0);

   /* Now add the four tables.  */
-  add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256);
+  add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ);
   add_locale_raw_obstack (&file, &weightpool);
   add_locale_raw_obstack (&file, &extrapool);
   add_locale_raw_obstack (&file, &indirectpool);
@@ -2493,6 +2508,12 @@  collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   add_locale_raw_data (&file, collate->mbseqorder, 256);
   add_locale_collseq_table (&file, &collate->wcseqorder);
   add_locale_string (&file, charmap->code_set_name);
+  if (strcmp (charmap->code_set_name, "UTF-8") == 0)
+    add_locale_uint32 (&file, __cet_utf8);
+  else if (charmap->mb_cur_max == 1)
+    add_locale_uint32 (&file, __cet_8bit);
+  else
+    add_locale_uint32 (&file, __cet_other);
   write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file);

   obstack_free (&weightpool, NULL);
diff --git a/locale/weight.h b/locale/weight.h
index c99730c..5b4103b 100644
--- a/locale/weight.h
+++ b/locale/weight.h
@@ -19,26 +19,81 @@ 
 #ifndef _WEIGHT_H_
 #define _WEIGHT_H_	1

+/* Generate 2 byte code for the next UTF-8 encoded char.
+   Returns zero on UTF-8 encoding errors.  */
+static __always_inline uint16_t
+utf8index (const unsigned char *cp, size_t len)
+{
+  uint16_t index = cp[0];
+
+  if (index >= 0x80)
+    {
+      if (index < 0xE0)
+	{
+	  if (len < 2)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  index = (index << 6) + byte2 - 0x3080;
+	}
+      else if (index < 0xF0)
+	{
+	  if (len < 3)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  uint16_t byte3 = cp[2];
+	  index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080;
+	}
+      else if (index < 0xF8)
+	{
+	  if (len < 4)
+	    return 0;
+	  uint16_t byte2 = cp[1];
+	  uint16_t byte3 = cp[2];
+	  uint16_t byte4 = cp[3];
+	  index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080;
+	}
+      else
+	return 0;
+    }
+
+  return index;
+}
+
 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t locale_encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const unsigned char *extra,
 	 const unsigned char **cpp, size_t len)
 {
-  int_fast32_t i = table[*(*cpp)++];
   const unsigned char *cp;
   const unsigned char *usrc;
+  uint16_t index = (*cpp)[0];
+
+  /* Special handling of UTF-8: Generate a 2-byte index for table.  */
+  if (index >= 0x80 && locale_encoding == __cet_utf8)
+    {
+      index = utf8index(*cpp, len);
+      if (index == 0)
+	{
+	  *cpp += 1;
+	  return 0;
+	}
+    }

+  int_fast32_t i = table[index];
   if (i >= 0)
-    /* This is an index into the weight table.  Cool.  */
-    return i;
+    {
+      /* This is an index into the weight table.  Cool.  */
+      *cpp += 1;
+      return i;
+    }

   /* Oh well, more than one sequence starting with this byte.
      Search for the correct one.  */
   cp = &extra[-i];
   usrc = *cpp;
-  --len;
   while (1)
     {
       size_t nhere;
@@ -57,8 +112,7 @@  findidx (const int32_t *table,
 	  /* It is a single character.  If it matches we found our
 	     index.  Note that at the end of each list there is an
 	     entry of length zero which represents the single byte
-	     sequence.  The first (and here only) byte was tested
-	     already.  */
+	     sequence.  */
 	  size_t cnt;

 	  for (cnt = 0; cnt < nhere && cnt < len; ++cnt)
@@ -68,7 +122,7 @@  findidx (const int32_t *table,
 	  if (cnt == nhere)
 	    {
 	      /* Found it.  */
-	      *cpp += nhere;
+	      *cpp += nhere > 0 ? nhere : 1;
 	      return i;
 	    }

@@ -127,7 +181,7 @@  findidx (const int32_t *table,
 	      while (++cnt < nhere);
 	    }

-	  *cpp += nhere;
+	  *cpp += nhere > 0 ? nhere : 1;
 	  return indirect[-i + offset];
 	}
     }
diff --git a/locale/weightwc.h b/locale/weightwc.h
index ab26482..4101dc8 100644
--- a/locale/weightwc.h
+++ b/locale/weightwc.h
@@ -21,7 +21,8 @@ 

 /* Find index of weight.  */
 static inline int32_t __attribute__ ((always_inline))
-findidx (const int32_t *table,
+findidx (uint_fast32_t encoding,
+	 const int32_t *table,
 	 const int32_t *indirect,
 	 const wint_t *extra,
 	 const wint_t **cpp, size_t len)
diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
index 229904e..07b60fb 100644
--- a/posix/fnmatch_loop.c
+++ b/posix/fnmatch_loop.c
@@ -383,6 +383,8 @@  FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			const int32_t *indirect;
 			int32_t idx;
 			const UCHAR *cp = (const UCHAR *) &str;
+			uint_fast32_t encoding =
+			  _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);

 # if WIDE_CHAR_VERSION
 			table = (const int32_t *)
@@ -404,7 +406,7 @@  FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			  _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
 # endif

-			idx = FINDIDX (table, indirect, extra, &cp, 1);
+			idx = FINDIDX (encoding, table, indirect, extra, &cp, 1);
 			if (idx != 0)
 			  {
 			    /* We found a table entry.  Now see whether the
@@ -414,7 +416,7 @@  FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 			    int32_t idx2;
 			    const UCHAR *np = (const UCHAR *) n;

-			    idx2 = FINDIDX (table, indirect, extra,
+			    idx2 = FINDIDX (encoding, table, indirect, extra,
 					    &np, string_end - n);
 			    if (idx2 != 0
 				&& (idx >> 24) == (idx2 >> 24)
diff --git a/posix/regcomp.c b/posix/regcomp.c
index b6126b7..011ef92 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -3414,6 +3414,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
   uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
   if (nrules != 0)
     {
+      uint_fast32_t encoding;
       const int32_t *table, *indirect;
       const unsigned char *weights, *extra, *cp;
       unsigned char char_buf[2];
@@ -3422,6 +3423,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
       size_t len;
       /* Calculate the index for equivalence class.  */
       cp = name;
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE,
 					       _NL_COLLATE_WEIGHTMB);
@@ -3429,7 +3431,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
 						   _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
-      idx1 = findidx (table, indirect, extra, &cp, -1);
+      idx1 = findidx (encoding, table, indirect, extra, &cp, -1);
       if (BE (idx1 == 0 || *cp != '\0', 0))
 	/* This isn't a valid character.  */
 	return REG_ECOLLATE;
@@ -3440,7 +3442,7 @@  build_equiv_class (bitset_t sbcset, const unsigned char *name)
 	{
 	  char_buf[0] = ch;
 	  cp = char_buf;
-	  idx2 = findidx (table, indirect, extra, &cp, 1);
+	  idx2 = findidx (encoding, table, indirect, extra, &cp, 1);
 /*
 	  idx2 = table[ch];
 */
diff --git a/posix/regex_internal.h b/posix/regex_internal.h
index 02e040b..993c7c3 100644
--- a/posix/regex_internal.h
+++ b/posix/regex_internal.h
@@ -743,17 +743,19 @@  re_string_elem_size_at (const re_string_t *pstr, int idx)
 #  ifdef _LIBC
   const unsigned char *p, *extra;
   const int32_t *table, *indirect;
+  uint_fast32_t encoding;
   uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);

   if (nrules != 0)
     {
+      encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
       table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
       extra = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
       indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE,
 						_NL_COLLATE_INDIRECTMB);
       p = pstr->mbs + idx;
-      findidx (table, indirect, extra, &p, pstr->len - idx);
+      findidx (encoding, table, indirect, extra, &p, pstr->len - idx);
       return p - pstr->mbs - idx;
     }
   else
diff --git a/posix/regexec.c b/posix/regexec.c
index ec46c3a..3d3ad9a 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3843,6 +3843,7 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
       if (nrules != 0)
 	{
 	  unsigned int in_collseq = 0;
+	  uint_fast32_t encoding;
 	  const int32_t *table, *indirect;
 	  const unsigned char *weights, *extra;
 	  const char *collseqwc;
@@ -3893,6 +3894,8 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 	  if (cset->nequiv_classes)
 	    {
 	      const unsigned char *cp = pin;
+	      encoding =
+		_NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE);
 	      table = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB);
 	      weights = (const unsigned char *)
@@ -3901,7 +3904,8 @@  check_node_accept_bytes (const re_dfa_t *dfa, int node_idx,
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB);
 	      indirect = (const int32_t *)
 		_NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB);
-	      int32_t idx = findidx (table, indirect, extra, &cp, elem_len);
+	      int32_t idx = findidx (encoding, table, indirect, extra, &cp,
+				     elem_len);
 	      if (idx > 0)
 		for (i = 0; i < cset->nequiv_classes; ++i)
 		  {
diff --git a/string/strcoll_l.c b/string/strcoll_l.c
index 4d1e3ab..2c2cab0 100644
--- a/string/strcoll_l.c
+++ b/string/strcoll_l.c
@@ -63,9 +63,9 @@  typedef struct
 /* Get next sequence.  Traverse the string as required.  */
 static __always_inline void
 get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
-	      const USTRING_TYPE *weights, const int32_t *table,
-	      const USTRING_TYPE *extra, const int32_t *indirect,
-	      int pass)
+	      const USTRING_TYPE *weights, uint_fast32_t encoding,
+	      const int32_t *table, const USTRING_TYPE *extra,
+	      const int32_t *indirect, int pass)
 {
   size_t val = seq->val = 0;
   int len = seq->len;
@@ -109,7 +109,7 @@  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
 	      us = seq->back_us;
 	      while (i < backw)
 		{
-		  int32_t tmp = findidx (table, indirect, extra, &us, -1);
+		  int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 		  idx = tmp & 0xffffff;
 		  i++;
 		}
@@ -124,7 +124,7 @@  get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,

 	  while (*us != L('\0'))
 	    {
-	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
+	      int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1);
 	      unsigned char rule = tmp >> 24;
 	      prev_idx = idx;
 	      idx = tmp & 0xffffff;
@@ -253,6 +253,7 @@  STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
   const USTRING_TYPE *weights;
   const USTRING_TYPE *extra;
   const int32_t *indirect;
+  uint_fast32_t encoding;

   if (nrules == 0)
     return STRCMP (s1, s2);
@@ -271,6 +272,8 @@  STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string;
   indirect = (const int32_t *)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string;
+  encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
+

   assert (((uintptr_t) table) % __alignof__ (table[0]) == 0);
   assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0);
@@ -310,9 +313,9 @@  STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)

       while (1)
 	{
-	  get_next_seq (&seq1, nrules, rulesets, weights, table,
+	  get_next_seq (&seq1, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
-	  get_next_seq (&seq2, nrules, rulesets, weights, table,
+	  get_next_seq (&seq2, nrules, rulesets, weights, encoding, table,
 				    extra, indirect, pass);
 	  /* See whether any or both strings are empty.  */
 	  if (seq1.len == 0 || seq2.len == 0)
diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c
index 22e24d3..5c89b15 100644
--- a/string/strxfrm_l.c
+++ b/string/strxfrm_l.c
@@ -53,6 +53,7 @@  typedef struct
   uint_fast32_t nrules;
   unsigned char *rulesets;
   USTRING_TYPE *weights;
+  uint_fast32_t encoding;
   int32_t *table;
   USTRING_TYPE *extra;
   int32_t *indirect;
@@ -100,8 +101,8 @@  static __always_inline size_t
 find_idx (const USTRING_TYPE **us, int32_t *weight_idx,
 	  unsigned char *rule_idx, const locale_data_t *l_data, const int pass)
 {
-  int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us,
-			 -1);
+  int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect,
+			 l_data->extra, us, -1);
   *rule_idx = tmp >> 24;
   int32_t idx = tmp & 0xffffff;
   size_t len = l_data->weights[idx++];
@@ -693,6 +694,8 @@  STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)
   /* Get the locale data.  */
   l_data.rulesets = (unsigned char *)
     current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
+  l_data.encoding =
+    current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word;
   l_data.table = (int32_t *)
     current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string;
   l_data.weights = (USTRING_TYPE *)
@@ -721,8 +724,8 @@  STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l)

   do
     {
-      int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur,
-			     -1);
+      int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect,
+			     l_data.extra, &cur, -1);
       rulearr[idxmax] = tmp >> 24;
       idxarr[idxmax] = tmp & 0xffffff;