[V4,BZ,#18441] fix sorting multibyte charsets with an improper locale

  In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
regression. The cause of the problem is that

a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
sort weight which causes the comparison to check the whole string instead of
breaking up early and

b) the sequence-to-weight list is partitioned by the first byte of the first
character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
they tend to have an equal starting byte (e.g. all thai chars start with E0).

The approach of the patch is to interprete TABLEMB as a hashtable and find a
better hash key. My first try was to somehow "fold" a multibyte character into one
byte but that worsened the overall performance a lot. Enhancing the table to 2
byte keys works much better while needing a reasonable amount of extra memory.

The patch vastly improves the performance of languages with multibyte chars (see
zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
get a bit slower because of the extra check for the first byte while finding the right
sequence in the sequence list . It cannot be avoided since the hash key is not
longer equal to the first byte of the sequence. Tests are ok.

filelist#C			  1.75%		23,396,200	23,805,700
filelist#en_US.UTF-8		  1.42%		77,186,200	78,285,200
lorem_ipsum#vi_VN.UTF-8		 -1.70%		1,680,740	1,652,110
lorem_ipsum#ar_SA.UTF-8		 -7.71%		2,134,780	1,970,170
lorem_ipsum#en_US.UTF-8	 	  2.61%		1,685,120	1,729,160
lorem_ipsum#zh_CN.UTF-8		-88.66%		806,176		91,423
lorem_ipsum#cs_CZ.UTF-8		 -4.89%		2,150,120	2,045,030
lorem_ipsum#en_GB.UTF-8		 -1.47%		2,061,960	2,031,620
lorem_ipsum#da_DK.UTF-8		  3.15%		1,703,710	1,757,390
lorem_ipsum#pl_PL.UTF-8		  0.86%		1,634,890	1,648,870
lorem_ipsum#fr_FR.UTF-8		 -2.06%		2,232,030	2,186,030
lorem_ipsum#pt_PT.UTF-8		 -2.60%		2,238,410	2,180,210
lorem_ipsum#el_GR.UTF-8		-34.52%		3,413,330	2,235,010
lorem_ipsum#ru_RU.UTF-8		 -9.88%		2,403,370	2,165,950
lorem_ipsum#iw_IL.UTF-8		 -9.56%		2,209,740	1,998,500
lorem_ipsum#es_ES.UTF-8	 	  4.92%		1,983,470	2,081,050
lorem_ipsum#hi_IN.UTF-8		-98.88%		220,453,000	2,458,620
lorem_ipsum#sv_SE.UTF-8		  1.79%		1,645,370	1,674,760
lorem_ipsum#hu_HU.UTF-8		  4.86%		3,179,620	3,334,290
lorem_ipsum#tr_TR.UTF-8		-23.59%		2,473,330	1,889,870
lorem_ipsum#is_IS.UTF-8		  2.49%		1,620,370	1,660,680
lorem_ipsum#it_IT.UTF-8		 -2.67%		2,186,160	2,127,710
lorem_ipsum#sr_RS.UTF-8		  2.70%		1,930,520	1,982,720
lorem_ipsum#ja_JP.UTF-8		-97.43%		958,411		24,664
wikipedia-th#en_US.UTF-8	-99.61%		10,511,700,000	40,577,100

The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and
the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed
of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex
performance of these languages too, though there is no benchmark for that.

Regarding Carlos comments:

>> +  struct element_t *mbheads[256 * 256];
>
> Use #define MBHEADS_SZ or something similar.

Ok.

>> +  bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0;
>
> OK.
>
> Will this always work? I'm just wondering about a user generated charmap that they
> call 'utf8', which is the other common alias for instance where the dash is not valid
> syntax. Probably not since the official name is UTF-8, and that's what you should use.

Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8".

>> +	  /* Special handling of UTF-8: Generate a 2-byte index to mbheads.
>> +	     Also check the UTF-8 encoding.  Keep locale/weight.h in sync.  */
>
> Not OK. Can we refactor to avoid keeing the two in sync?

Ok, there is a new function utf8index in locale/weight.h that does the job.

>> @@ -2239,7 +2281,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
>>
>>  		/* Compute how much space we will need.  */
>>  		added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1
>> -					  + 2 * (runp->nmbs - 1));
>> +					  + 2 * runp->nmbs);
>
> Doesn't the change to zero indexing make the conditional in the code above this wrong?
>
> e.g.
> 2230             if (runp->mbnext != NULL
> 2231                 && runp->nmbs == runp->mbnext->nmbs
> 2232                 && memcmp (runp->mbs, runp->mbnext->mbs, runp->nmbs - 1) == 0
> 2233                 && (runp->mbs[runp->nmbs - 1]
> 2234                     == runp->mbnext->mbs[runp->nmbs - 1] + 1))

No. runp traverses through the input / locale definition file and this is not affected by the change. What happens here
is a check if the next unicode literal has the same byte sequence as the current except for the last byte, which should
be 1 higher than the last byte of the current literal -> beginning of a sequence.

	* benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale.
	* benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file.
	* locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE.
	* locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list.
	* locale/localeinfo.h: Add enum collation_encoding_type.
	* locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit.
	* locale/programs/ld-collate.c (struct locale_collate_t):
	Expand mbheads array from 256 to 16384 entries.
	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
	(collate_output): Output larger table and sequences including first byte.
	(collate_output): Add encoding type info.
	* locale/weight.h (utf8index): New function to calculate 2 byte index.
	(findidx): Use 2-byte index for table if UTF-8 locale.
	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
 	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
	* posix/regcomp.c (build_equiv_class): Likewise.
	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
	* posix/regexec.c (check_node_accept_bytes): Likewise.
	* string/strcoll_l.c (get_next_seq): Likewise.
	(STRCOLL): Call get_next_seq with encoding parameter.
	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
	(STRXFRM): Call find_idx with encoding parameter.
เนบิวลาปู เป็นซากซูเปอร์โนวาและเนบิวลาลมพัลซาร์ในกลุ่มดาววัว
เนบิวลานี้ได้รับการสังเกตโดยจอห์น เบวิส ในปี พ.ศ. 2274
ซึ่งสอดคล้องกับการบันทึกเหตุการณ์ซูเปอร์โนวาสว่างโดยนักดาราศาสตร์ชาวจีนและชาวอาหรับใน
พ.ศ. 1597 ที่ระดับรังสีเอกซ์และรังสีแกมมาสูงกว่า 30 กิโลอิเล็กตรอนโวลต์
เนบิวลาปูเป็นแหล่งพลังงานที่เข้มที่สุดบนท้องฟ้ามาอย่างต่อเนื่อง โดยสามารถวัดฟลักซ์ได้ถึงสูงกว่า
1012 อิเล็กตรอนโวลต์ เนบิวลาปูตั้งอยู่ห่างจากโลก 6,500 ปีแสง (2 กิโลพาร์เซก)
มีเส้นผ่านศูนย์กลาง 11 ปีแสง (3.4 พาร์เซก) และขยายตัวในอัตรา 1,500 กิโลเมตรต่อวินาที
ณ ใจกลางเนบิวลาปูเป็นที่อยู่ของพัลซาร์ปู ดาวนิวตรอนขนาดเส้นผ่านศูนย์กลาง 28-30 กิโลเมตร
ซึ่งปลดปล่อยรังสีตั้งแต่รังสีแกมมาไปจนถึงคลื่นวิทยุด้วยอัตราการหมุน 30.2 รอบต่อวินาที
เนบิวลาปูเป็นวัตถุทางดาราศาสตร์วัตถุแรกที่สามารถระบุได้จากการระเบิดซูเปอร์โนวาในประวัติศาสตร์
เนบิวลานี้ทำตัวเสมือนหนึ่งแหล่งกำเนิดรังสีสำหรับการศึกษาเทห์ฟากฟ้าที่เคลื่อนผ่านตัวมัน
ในช่วงปีพ.ศ. 2493 และ 2512
มีการทำแผนภูมิโคโรนาของดวงอาทิตย์ขึ้นจากการเฝ้าสังเกตคลื่นวิทยุจากเนบิวลาปูที่ผ่านชั้นโคโรนาไป
และในปี พ.ศ. 2546 เราสามารถวัดความหนาของชั้นบรรยากาศของดวงจันทร์ไททัน
ดาวบริวารของดาวเสาร์ได้จากการที่ชั้นบรรยากาศนี้กีดขวางรังสีเอกซ์จากเนบิวลา (อ่านต่อ...)
ฌอร์ฌ เลอแม็ทร์ นักวิทยาศาสตร์และพระโรมันคาทอลิก เป็นผู้เสนอแนวคิดการกำเนิดของเอกภพ
ซึ่งต่อมารู้จักกันในชื่อ ทฤษฎีบิกแบง ในเบื้องแรกเขาเรียกทฤษฎีนี้ว่า
สมมติฐานเกี่ยวกับอะตอมแรกเริ่ม (hypothesis of the primeval atom) อเล็กซานเดอร์
ฟรีดแมน
ทำการคำนวณแบบจำลองโดยมีกรอบการพิจารณาอยู่บนพื้นฐานของทฤษฎีสัมพัทธภาพทั่วไปของอัลเบิร์ต
ไอน์สไตน์ ต่อมาในปี ค.ศ. 1929 เอ็ดวิน ฮับเบิลค้นพบว่า
ระยะห่างของดาราจักรมีสัดส่วนที่เปลี่ยนแปลงสัมพันธ์กับการเคลื่อนไปทางแดง
การสังเกตการณ์นี้บ่งชี้ว่า ดาราจักรและกระจุกดาวอันห่างไกลกำลังเคลื่อนที่ออกจากจุดสังเกต
ซึ่งหมายความว่าเอกภพกำลังขยายตัว ยิ่งตำแหน่งดาราจักรไกลยิ่งขึ้น
ความเร็วปรากฏก็ยิ่งเพิ่มมากขึ้น หากเอกภพในปัจจุบันกำลังขยายตัว แสดงว่าก่อนหน้านี้
เอกภพย่อมมีขนาดเล็กกว่า หนาแน่นกว่า และร้อนกว่าที่เป็นอยู่
แนวคิดนี้มีการพิจารณาอย่างละเอียดย้อนไปจนถึงระดับความหนาแน่นและอุณหภูมิที่จุดสูงสุด
และผลสรุปที่ได้ก็สอดคล้องอย่างยิ่งกับผลจากการสังเกตการณ์
ทว่าการเพิ่มของอัตราเร่งมีข้อจำกัดในการตรวจสอบสภาวะพลังงานที่สูงขนาดนั้น
หากไม่มีข้อมูลอื่นที่ช่วยยืนยันสภาวะเริ่มต้นชั่วขณะก่อนการระเบิด
ลำพังทฤษฎีบิกแบงก็ยังไม่สามารถใช้อธิบายสภาวะเริ่มต้นได้
มันเพียงอธิบายกระบวนการเปลี่ยนแปลงของเอกภพที่เกิดขึ้นหลังจากสภาวะเริ่มต้นเท่านั้น
(อ่านต่อ...)

[V4,BZ,#18441] fix sorting multibyte charsets with an improper locale

Commit Message

Comments

Patch