[V3,BZ,#18441] fix sorting multibyte charsets with an improper locale

  Patch v3: Replace _NL_CURRENT with _NL_CURRENT_WORD for reading the encoding.
Patch v2: Use the UTF-8 to codepoint conversion proposed by Ondřej.

In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance
regression. The cause of the problem is that

a) en_US.UTF-8 has no informations for thai chars and so always reports a zero
sort weight which causes the comparison to check the whole string instead of
breaking up early and

b) the sequence-to-weight list is partitioned by the first byte of the first
character (TABLEMB); this generates long lists for multibyte UTF-8 characters as
they tend to have an equal starting byte (e.g. all thai chars start with E0).

The approach of the patch is to interprete TABLEMB as a hashtable and find a
better hash key. My first try was to somehow "fold" a multibyte character into one
byte but that worsened the overall performance a lot. Enhancing the table to 2
byte keys works much better while needing a reasonable amount of extra memory.

The patch vastly improves the performance of languages with multibyte chars (see
zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars
get a bit slower because of the extra check for the first byte while finding the right
sequence in the sequence list . It cannot be avoided since the hash key is not
longer equal to the first byte of the sequence. Tests are ok.

filelist#C			  1.73%
filelist#en_US.UTF-8		  0.54%
lorem_ipsum#vi_VN.UTF-8		  1.90%
lorem_ipsum#ar_SA.UTF-8		-12.06%
lorem_ipsum#en_US.UTF-8		  1.15%
lorem_ipsum#zh_CN.UTF-8		-86.32%
lorem_ipsum#cs_CZ.UTF-8		-11.42%
lorem_ipsum#en_GB.UTF-8		- 3.09%
lorem_ipsum#da_DK.UTF-8		  6.70%
lorem_ipsum#pl_PL.UTF-8		- 1.04%
lorem_ipsum#fr_FR.UTF-8		- 1.22%
lorem_ipsum#pt_PT.UTF-8		  0.47%
lorem_ipsum#el_GR.UTF-8		-29.40%
lorem_ipsum#ru_RU.UTF-8		-11.79%
lorem_ipsum#iw_IL.UTF-8		- 1.39%
lorem_ipsum#es_ES.UTF-8		  3.91%
lorem_ipsum#hi_IN.UTF-8		-98.26%
lorem_ipsum#sv_SE.UTF-8		  5.61%
lorem_ipsum#hu_HU.UTF-8		 15.32%
lorem_ipsum#tr_TR.UTF-8		- 3.51%
lorem_ipsum#is_IS.UTF-8		  5.62%
lorem_ipsum#it_IT.UTF-8		-05.97%
lorem_ipsum#sr_RS.UTF-8		-01.19%
lorem_ipsum#ja_JP.UTF-8		-98.11%
wikipedia-th#en_US.UTF-8	-99.63%

	* locale/programs/ld-collate.c (struct locale_collate_t):
	Expand mbheads array from 256 to 16384 entries.
	(collate_finish): Generate 2-byte key for mbheads if UTF-8 locale.
	(collate_output): Output larger table and sequences including first byte.
	* locale/weight.h (findidx): Use 2-byte key for table if UTF-8 locale.
	* locale/weightwc.h (findidx): Accept encoding parameter, not used.
	* posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter.
	* posix/regcomp.c (build_equiv_class): Likewise.
	* posix/regex_internal.h (re_string_elem_size_at): Likewise.
	* posix/regexec.c (check_node_accept_bytes): Likewise.
	* string/strcoll_l.c (get_next_seq): Likewise.
	(STRCOLL): Call get_next_seq with encoding parameter.
	* string/strxfrm_l.c (find_idx): Call findidx with encoding parameter.
	(STRXFRM): Call find_idx with encoding parameter.

[V3,BZ,#18441] fix sorting multibyte charsets with an improper locale

Commit Message

Comments

Patch