From patchwork Mon Feb 29 07:53:20 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Leonhard Holz X-Patchwork-Id: 11138 Received: (qmail 21411 invoked by alias); 29 Feb 2016 07:53:37 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 21367 invoked by uid 89); 29 Feb 2016 07:53:36 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: =?ISO-8859-1?Q?Yes, score=5.6 required=5.0 tests=BAYES_50, BODY_8BITS, FREEMAIL_FROM, GARBLED_BODY, KAM_STOCKGEN, RCVD_IN_DNSWL_LOW, RP_MATCHES_RCVD, SPF_PASS autolearn=no version=3.3.2 spammy=8:=b8=b4, 8:=b8=a3, 8:=b1=e0, 8:=b9?= X-HELO: mout.web.de From: Leonhard Holz Subject: [PATCH V4][BZ #18441] fix sorting multibyte charsets with an improper locale To: GNU C Library Cc: Carlos O'Donell Message-ID: <56D3F8F0.8070401@web.de> Date: Mon, 29 Feb 2016 08:53:20 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 X-UI-Out-Filterresults: notjunk:1; V01:K0:ErXvnSjD38g=:ldO1ZAQVBzhnR/ufp2PnJV Fxetd9kwldcYcWxZNlaR8Uw27XRsNm3kDNU639ZIUMN388nCVlQ0rSyR2e9N/CAqi5UN0hGwI gTl2QoCDFeKQHI3MwUUnAgz/KqAydMBkFPAc7x6BvVjOTI8xlKdRdlkV/dJ8v2l/enD7J3MsB qG0TOJ+Sd6bI871iNA8eoRS6d9rh4cIPNxecYFLPl1/rfxum50wi7amLZBZuNi5yxKcYCuw57 A5yj7+A42GMfbzYAfdKggItYI4NnrBcL+mD02Vm7wBeTfdxVS8hsqFaauk9l63ayIdExfCnvT +/OGU37QS/L7Gydrn1rNsZ68lt9LEFpWVipKKYfruJEQJbW60fAmmZhEjcs6FFvvQ7YfKMM/z u9c2li7r+irRnNyoWIT84XzXyx4wz8LJDEcbgSTWWVbYo8VHlECJBuJtXhUvXOYNCeWJzYU/v GOqG/piYK12lrHIsICZs5jRRqHOzyekK3RHbnKZIvl71chMOaf6QNx0MSLRIy+InVg2OO8rad xhHSsLtHRw7OzATbuj8MAnLEl0XBk/6+OdxuqU10WZbAQcrDP/AvBkhEgaXpnp/3cN05yMbL6 dPe1+VjL+wO2XFv3OXx8TqDzkb4lctuujCrH+f9K+nqhf9Brgd0Z//BVDyDnqm6KM/z8cPrXo K1NwkNP8IvE3DQxa8rkSYyKIItGTbFF+PMMkuyMR+/gOeyO9pqCOlCkicUngEDIwtOR8Uekpz DZXR4t+8H5IX8FgMhCSFlvkWfTDh6FUW+GqnGAW6Ol4WFiKjCQyzCOF2CJw= In BZ #18441 sorting a thai text with the en_US.UTF-8 locale causes a performance regression. The cause of the problem is that a) en_US.UTF-8 has no informations for thai chars and so always reports a zero sort weight which causes the comparison to check the whole string instead of breaking up early and b) the sequence-to-weight list is partitioned by the first byte of the first character (TABLEMB); this generates long lists for multibyte UTF-8 characters as they tend to have an equal starting byte (e.g. all thai chars start with E0). The approach of the patch is to interprete TABLEMB as a hashtable and find a better hash key. My first try was to somehow "fold" a multibyte character into one byte but that worsened the overall performance a lot. Enhancing the table to 2 byte keys works much better while needing a reasonable amount of extra memory. The patch vastly improves the performance of languages with multibyte chars (see zh_CN, hi_IN and ja_JP below). A side effect is that some languages with one-byte chars get a bit slower because of the extra check for the first byte while finding the right sequence in the sequence list . It cannot be avoided since the hash key is not longer equal to the first byte of the sequence. Tests are ok. filelist#C 1.75% 23,396,200 23,805,700 filelist#en_US.UTF-8 1.42% 77,186,200 78,285,200 lorem_ipsum#vi_VN.UTF-8 -1.70% 1,680,740 1,652,110 lorem_ipsum#ar_SA.UTF-8 -7.71% 2,134,780 1,970,170 lorem_ipsum#en_US.UTF-8 2.61% 1,685,120 1,729,160 lorem_ipsum#zh_CN.UTF-8 -88.66% 806,176 91,423 lorem_ipsum#cs_CZ.UTF-8 -4.89% 2,150,120 2,045,030 lorem_ipsum#en_GB.UTF-8 -1.47% 2,061,960 2,031,620 lorem_ipsum#da_DK.UTF-8 3.15% 1,703,710 1,757,390 lorem_ipsum#pl_PL.UTF-8 0.86% 1,634,890 1,648,870 lorem_ipsum#fr_FR.UTF-8 -2.06% 2,232,030 2,186,030 lorem_ipsum#pt_PT.UTF-8 -2.60% 2,238,410 2,180,210 lorem_ipsum#el_GR.UTF-8 -34.52% 3,413,330 2,235,010 lorem_ipsum#ru_RU.UTF-8 -9.88% 2,403,370 2,165,950 lorem_ipsum#iw_IL.UTF-8 -9.56% 2,209,740 1,998,500 lorem_ipsum#es_ES.UTF-8 4.92% 1,983,470 2,081,050 lorem_ipsum#hi_IN.UTF-8 -98.88% 220,453,000 2,458,620 lorem_ipsum#sv_SE.UTF-8 1.79% 1,645,370 1,674,760 lorem_ipsum#hu_HU.UTF-8 4.86% 3,179,620 3,334,290 lorem_ipsum#tr_TR.UTF-8 -23.59% 2,473,330 1,889,870 lorem_ipsum#is_IS.UTF-8 2.49% 1,620,370 1,660,680 lorem_ipsum#it_IT.UTF-8 -2.67% 2,186,160 2,127,710 lorem_ipsum#sr_RS.UTF-8 2.70% 1,930,520 1,982,720 lorem_ipsum#ja_JP.UTF-8 -97.43% 958,411 24,664 wikipedia-th#en_US.UTF-8 -99.61% 10,511,700,000 40,577,100 The performance numbers and the size of the patch changed due to the removal of the strdiff optimization (#18589) and the included thai test. Performance degration for locales in the ASCII plane is still minor. It does increase the speed of strcoll for all languages that mostly use multiple byte UTF-8 encoding a lot. Note that it should affect the regex performance of these languages too, though there is no benchmark for that. Regarding Carlos comments: >> + struct element_t *mbheads[256 * 256]; > > Use #define MBHEADS_SZ or something similar. Ok. >> + bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0; > > OK. > > Will this always work? I'm just wondering about a user generated charmap that they > call 'utf8', which is the other common alias for instance where the dash is not valid > syntax. Probably not since the official name is UTF-8, and that's what you should use. Well, if it does not work it's just a speed penalty. But there is no problem in adding a check for "utf8". >> + /* Special handling of UTF-8: Generate a 2-byte index to mbheads. >> + Also check the UTF-8 encoding. Keep locale/weight.h in sync. */ > > Not OK. Can we refactor to avoid keeing the two in sync? Ok, there is a new function utf8index in locale/weight.h that does the job. >> @@ -2239,7 +2281,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, >> >> /* Compute how much space we will need. */ >> added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 >> - + 2 * (runp->nmbs - 1)); >> + + 2 * runp->nmbs); > > Doesn't the change to zero indexing make the conditional in the code above this wrong? > > e.g. > 2230 if (runp->mbnext != NULL > 2231 && runp->nmbs == runp->mbnext->nmbs > 2232 && memcmp (runp->mbs, runp->mbnext->mbs, runp->nmbs - 1) == 0 > 2233 && (runp->mbs[runp->nmbs - 1] > 2234 == runp->mbnext->mbs[runp->nmbs - 1] + 1)) No. runp traverses through the input / locale definition file and this is not affected by the change. What happens here is a check if the next unicode literal has the same byte sequence as the current except for the last byte, which should be 1 higher than the last byte of the current literal -> beginning of a sequence. * benchtests/bench-strcoll.c: Add thai text with en_US.UTF-8 locale. * benchtests/strcoll-inputs/wikipedia-th#en_US.UTF-8: New file. * locale/categories.def: Define _NL_COLLATE_ENCODING_TYPE. * locale/langinfo.h: Add _NL_COLLATE_ENCODING_TYPE to attribute list. * locale/localeinfo.h: Add enum collation_encoding_type. * locale/C-collate.c: Set _NL_COLLATE_ENCODING_TYPE to 8bit. * locale/programs/ld-collate.c (struct locale_collate_t): Expand mbheads array from 256 to 16384 entries. (collate_finish): Generate 2-byte key for mbheads if UTF-8 locale. (collate_output): Output larger table and sequences including first byte. (collate_output): Add encoding type info. * locale/weight.h (utf8index): New function to calculate 2 byte index. (findidx): Use 2-byte index for table if UTF-8 locale. * locale/weightwc.h (findidx): Accept encoding parameter, not used. * posix/fnmatch_loop.c (FCT): Call findidx with encoding parameter. * posix/regcomp.c (build_equiv_class): Likewise. * posix/regex_internal.h (re_string_elem_size_at): Likewise. * posix/regexec.c (check_node_accept_bytes): Likewise. * string/strcoll_l.c (get_next_seq): Likewise. (STRCOLL): Call get_next_seq with encoding parameter. * string/strxfrm_l.c (find_idx): Call findidx with encoding parameter. (STRXFRM): Call find_idx with encoding parameter. เนบิวลาปู เป็นซากซูเปอร์โนวาและเนบิวลาลมพัลซาร์ในกลุ่มดาววัว เนบิวลานี้ได้รับการสังเกตโดยจอห์น เบวิส ในปี พ.ศ. 2274 ซึ่งสอดคล้องกับการบันทึกเหตุการณ์ซูเปอร์โนวาสว่างโดยนักดาราศาสตร์ชาวจีนและชาวอาหรับใน พ.ศ. 1597 ที่ระดับรังสีเอกซ์และรังสีแกมมาสูงกว่า 30 กิโลอิเล็กตรอนโวลต์ เนบิวลาปูเป็นแหล่งพลังงานที่เข้มที่สุดบนท้องฟ้ามาอย่างต่อเนื่อง โดยสามารถวัดฟลักซ์ได้ถึงสูงกว่า 1012 อิเล็กตรอนโวลต์ เนบิวลาปูตั้งอยู่ห่างจากโลก 6,500 ปีแสง (2 กิโลพาร์เซก) มีเส้นผ่านศูนย์กลาง 11 ปีแสง (3.4 พาร์เซก) และขยายตัวในอัตรา 1,500 กิโลเมตรต่อวินาที ณ ใจกลางเนบิวลาปูเป็นที่อยู่ของพัลซาร์ปู ดาวนิวตรอนขนาดเส้นผ่านศูนย์กลาง 28-30 กิโลเมตร ซึ่งปลดปล่อยรังสีตั้งแต่รังสีแกมมาไปจนถึงคลื่นวิทยุด้วยอัตราการหมุน 30.2 รอบต่อวินาที เนบิวลาปูเป็นวัตถุทางดาราศาสตร์วัตถุแรกที่สามารถระบุได้จากการระเบิดซูเปอร์โนวาในประวัติศาสตร์ เนบิวลานี้ทำตัวเสมือนหนึ่งแหล่งกำเนิดรังสีสำหรับการศึกษาเทห์ฟากฟ้าที่เคลื่อนผ่านตัวมัน ในช่วงปีพ.ศ. 2493 และ 2512 มีการทำแผนภูมิโคโรนาของดวงอาทิตย์ขึ้นจากการเฝ้าสังเกตคลื่นวิทยุจากเนบิวลาปูที่ผ่านชั้นโคโรนาไป และในปี พ.ศ. 2546 เราสามารถวัดความหนาของชั้นบรรยากาศของดวงจันทร์ไททัน ดาวบริวารของดาวเสาร์ได้จากการที่ชั้นบรรยากาศนี้กีดขวางรังสีเอกซ์จากเนบิวลา (อ่านต่อ...) ฌอร์ฌ เลอแม็ทร์ นักวิทยาศาสตร์และพระโรมันคาทอลิก เป็นผู้เสนอแนวคิดการกำเนิดของเอกภพ ซึ่งต่อมารู้จักกันในชื่อ ทฤษฎีบิกแบง ในเบื้องแรกเขาเรียกทฤษฎีนี้ว่า สมมติฐานเกี่ยวกับอะตอมแรกเริ่ม (hypothesis of the primeval atom) อเล็กซานเดอร์ ฟรีดแมน ทำการคำนวณแบบจำลองโดยมีกรอบการพิจารณาอยู่บนพื้นฐานของทฤษฎีสัมพัทธภาพทั่วไปของอัลเบิร์ต ไอน์สไตน์ ต่อมาในปี ค.ศ. 1929 เอ็ดวิน ฮับเบิลค้นพบว่า ระยะห่างของดาราจักรมีสัดส่วนที่เปลี่ยนแปลงสัมพันธ์กับการเคลื่อนไปทางแดง การสังเกตการณ์นี้บ่งชี้ว่า ดาราจักรและกระจุกดาวอันห่างไกลกำลังเคลื่อนที่ออกจากจุดสังเกต ซึ่งหมายความว่าเอกภพกำลังขยายตัว ยิ่งตำแหน่งดาราจักรไกลยิ่งขึ้น ความเร็วปรากฏก็ยิ่งเพิ่มมากขึ้น หากเอกภพในปัจจุบันกำลังขยายตัว แสดงว่าก่อนหน้านี้ เอกภพย่อมมีขนาดเล็กกว่า หนาแน่นกว่า และร้อนกว่าที่เป็นอยู่ แนวคิดนี้มีการพิจารณาอย่างละเอียดย้อนไปจนถึงระดับความหนาแน่นและอุณหภูมิที่จุดสูงสุด และผลสรุปที่ได้ก็สอดคล้องอย่างยิ่งกับผลจากการสังเกตการณ์ ทว่าการเพิ่มของอัตราเร่งมีข้อจำกัดในการตรวจสอบสภาวะพลังงานที่สูงขนาดนั้น หากไม่มีข้อมูลอื่นที่ช่วยยืนยันสภาวะเริ่มต้นชั่วขณะก่อนการระเบิด ลำพังทฤษฎีบิกแบงก็ยังไม่สามารถใช้อธิบายสภาวะเริ่มต้นได้ มันเพียงอธิบายกระบวนการเปลี่ยนแปลงของเอกภพที่เกิดขึ้นหลังจากสภาวะเริ่มต้นเท่านั้น (อ่านต่อ...) diff --git a/benchtests/bench-strcoll.c b/benchtests/bench-strcoll.c index 22ae87c..6ce5b2a 100644 --- a/benchtests/bench-strcoll.c +++ b/benchtests/bench-strcoll.c @@ -53,7 +53,8 @@ static const char *const input_files[] = { "lorem_ipsum#is_IS.UTF-8", "lorem_ipsum#it_IT.UTF-8", "lorem_ipsum#sr_RS.UTF-8", - "lorem_ipsum#ja_JP.UTF-8" + "lorem_ipsum#ja_JP.UTF-8", + "wikipedia-th#en_US.UTF-8" }; #define TEXTFILE_DELIMITER " \n\r\t.,?!" diff --git a/locale/C-collate.c b/locale/C-collate.c index 8214ff5..5a9ed6a 100644 --- a/locale/C-collate.c +++ b/locale/C-collate.c @@ -144,6 +144,8 @@ const struct __locale_data _nl_C_LC_COLLATE attribute_hidden = /* _NL_COLLATE_COLLSEQWC */ { .string = (const char *) collseqwc }, /* _NL_COLLATE_CODESET */ - { .string = _nl_C_codeset } + { .string = _nl_C_codeset }, + /* _NL_COLLATE_ENCODING_TYPE */ + { .word = __cet_8bit } } }; diff --git a/locale/categories.def b/locale/categories.def index d8a3ab8..cb57eae 100644 --- a/locale/categories.def +++ b/locale/categories.def @@ -58,6 +58,7 @@ DEFINE_CATEGORY DEFINE_ELEMENT (_NL_COLLATE_COLLSEQMB, "collate-collseqmb", std, wstring) DEFINE_ELEMENT (_NL_COLLATE_COLLSEQWC, "collate-collseqwc", std, wstring) DEFINE_ELEMENT (_NL_COLLATE_CODESET, "collate-codeset", std, string) + DEFINE_ELEMENT (_NL_COLLATE_ENCODING_TYPE, "collate-encoding-type", std, word) ), NO_POSTLOAD) diff --git a/locale/langinfo.h b/locale/langinfo.h index 481e226..0906a6a 100644 --- a/locale/langinfo.h +++ b/locale/langinfo.h @@ -255,6 +255,7 @@ enum _NL_COLLATE_COLLSEQMB, _NL_COLLATE_COLLSEQWC, _NL_COLLATE_CODESET, + _NL_COLLATE_ENCODING_TYPE, _NL_NUM_LC_COLLATE, /* LC_CTYPE category: character classification. diff --git a/locale/localeinfo.h b/locale/localeinfo.h index 5c4e6ef..bd284df 100644 --- a/locale/localeinfo.h +++ b/locale/localeinfo.h @@ -110,6 +110,14 @@ enum coll_sort_rule sort_mask }; +/* Collation encoding type. */ +enum collation_encoding_type +{ + __cet_other, + __cet_8bit, + __cet_utf8 +}; + /* We can map the types of the entries into a few categories. */ enum value_type { diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c index 1e125f6..efaacf6 100644 --- a/locale/programs/ld-collate.c +++ b/locale/programs/ld-collate.c @@ -32,6 +32,8 @@ #include "linereader.h" #include "locfile.h" #include "elem-hash.h" +#include "../localeinfo.h" +#include "../locale/weight.h" /* Uncomment the following line in the production version. */ /* #define NDEBUG 1 */ @@ -243,9 +245,10 @@ struct locale_collate_t Therefore we keep all relevant input in a list. */ struct locale_collate_t *next; - /* Arrays with heads of the list for each of the leading bytes in + /* Arrays with heads of the list for the leading bytes in the multibyte sequences. */ - struct element_t *mbheads[256]; + #define MBHEADS_SZ (256 * 256) + struct element_t *mbheads[MBHEADS_SZ]; /* Arrays with heads of the list for each of the leading bytes in the multibyte sequences. */ @@ -1557,6 +1560,7 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) struct section_list *sect; int ruleidx; int nr_wide_elems = 0; + bool is_utf8 = strcmp (charmap->code_set_name, "UTF-8") == 0; if (collate == NULL) { @@ -1663,7 +1667,22 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap) struct element_t *lastp = NULL; /* Find the point where to insert in the list. */ - eptr = &collate->mbheads[((unsigned char *) runp->mbs)[0]]; + uint16_t index = ((unsigned char *) runp->mbs)[0]; + + /* Special handling of UTF-8: Generate a 2-byte index to mbheads. */ + if (is_utf8 && index > 0) + { + index = utf8index((unsigned char *) runp->mbs, runp->nmbs); + if (index == 0) + { + WITH_CUR_LOCALE (error_at_line (0, 0, runp->file, runp->line, + _("\ +malformed UTF-8 character in `%s'"), runp->name);); + goto dont_insert; + } + } + + eptr = &collate->mbheads[index]; while (*eptr != NULL) { if ((*eptr)->nmbs < runp->nmbs) @@ -1734,7 +1753,7 @@ symbol `%s' has the same encoding as"), (*eptr)->name); /* Find out whether any of the `mbheads' entries is unset. In this case we use the UNDEFINED entry. */ - for (i = 1; i < 256; ++i) + for (i = 1; i < MBHEADS_SZ; ++i) if (collate->mbheads[i] == NULL) { need_undefined = 1; @@ -2107,7 +2126,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, const size_t nelems = _NL_ITEM_INDEX (_NL_NUM_LC_COLLATE); struct locale_file file; size_t ch; - int32_t tablemb[256]; + int32_t tablemb[MBHEADS_SZ]; struct obstack weightpool; struct obstack extrapool; struct obstack indirectpool; @@ -2130,6 +2149,8 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* The words have to be handled specially. */ if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB)) add_locale_uint32 (&file, 0); + else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)) + add_locale_uint32 (&file, __cet_other); else add_locale_empty (&file); } @@ -2183,7 +2204,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, if (collate->undefined.used_in_level != 0) output_weight (&weightpool, collate, &collate->undefined); - for (ch = 1; ch < 256; ++ch) + for (ch = 1; ch < MBHEADS_SZ; ++ch) if (collate->mbheads[ch]->mbnext == NULL && collate->mbheads[ch]->nmbs <= 1) { @@ -2208,7 +2229,6 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, and add only one index into the weight table. We can find the consecutive entries since they are also consecutive in the list. */ struct element_t *runp = collate->mbheads[ch]; - struct element_t *lastp; assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); @@ -2236,7 +2256,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* Compute how much space we will need. */ added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 - + 2 * (runp->nmbs - 1)); + + 2 * runp->nmbs); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); obstack_make_room (&extrapool, added); @@ -2259,9 +2279,9 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* Now walk backward from here to the beginning. */ curp = runp; - assert (runp->nmbs <= 256); - obstack_1grow_fast (&extrapool, curp->nmbs - 1); - for (i = 1; i < curp->nmbs; ++i) + assert (runp->nmbs <= 255); + obstack_1grow_fast (&extrapool, curp->nmbs); + for (i = 0; i < curp->nmbs; ++i) obstack_1grow_fast (&extrapool, curp->mbs[i]); /* Now find the end of the consecutive sequence and @@ -2281,7 +2301,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, /* And add the end byte sequence. Without length this time. */ - for (i = 1; i < curp->nmbs; ++i) + for (i = 0; i < curp->nmbs; ++i) obstack_1grow_fast (&extrapool, curp->mbs[i]); } else @@ -2295,15 +2315,15 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, weightidx = output_weight (&weightpool, collate, runp); added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 - + runp->nmbs - 1); + + runp->nmbs); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); obstack_make_room (&extrapool, added); obstack_int32_grow_fast (&extrapool, weightidx); - assert (runp->nmbs <= 256); - obstack_1grow_fast (&extrapool, runp->nmbs - 1); + assert (runp->nmbs <= 255); + obstack_1grow_fast (&extrapool, runp->nmbs); - for (i = 1; i < runp->nmbs; ++i) + for (i = 0; i < runp->nmbs; ++i) obstack_1grow_fast (&extrapool, runp->mbs[i]); } @@ -2312,30 +2332,25 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, obstack_1grow_fast (&extrapool, '\0'); /* Next entry. */ - lastp = runp; runp = runp->mbnext; } while (runp != NULL); assert (LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))); - /* If the final entry in the list is not a single character we - add an UNDEFINED entry here. */ - if (lastp->nmbs != 1) - { - int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); - obstack_make_room (&extrapool, added); + /* Add an UNDEFINED entry at the end of the list. */ + int added = LOCFILE_ALIGN_UP (sizeof (int32_t) + 1 + 1); + obstack_make_room (&extrapool, added); - obstack_int32_grow_fast (&extrapool, 0); - /* XXX What rule? We just pick the first. */ - obstack_1grow_fast (&extrapool, 0); - /* Length is zero. */ - obstack_1grow_fast (&extrapool, 0); + obstack_int32_grow_fast (&extrapool, 0); + /* XXX What rule? We just pick the first. */ + obstack_1grow_fast (&extrapool, 0); + /* Length is zero. */ + obstack_1grow_fast (&extrapool, 0); - /* Add alignment bytes if necessary. */ - while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) - obstack_1grow_fast (&extrapool, '\0'); - } + /* Add alignment bytes if necessary. */ + while (!LOCFILE_ALIGNED_P (obstack_object_size (&extrapool))) + obstack_1grow_fast (&extrapool, '\0'); } /* Add padding to the tables if necessary. */ @@ -2343,7 +2358,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, obstack_1grow (&weightpool, 0); /* Now add the four tables. */ - add_locale_uint32_array (&file, (const uint32_t *) tablemb, 256); + add_locale_uint32_array (&file, (const uint32_t *) tablemb, MBHEADS_SZ); add_locale_raw_obstack (&file, &weightpool); add_locale_raw_obstack (&file, &extrapool); add_locale_raw_obstack (&file, &indirectpool); @@ -2493,6 +2508,12 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap, add_locale_raw_data (&file, collate->mbseqorder, 256); add_locale_collseq_table (&file, &collate->wcseqorder); add_locale_string (&file, charmap->code_set_name); + if (strcmp (charmap->code_set_name, "UTF-8") == 0) + add_locale_uint32 (&file, __cet_utf8); + else if (charmap->mb_cur_max == 1) + add_locale_uint32 (&file, __cet_8bit); + else + add_locale_uint32 (&file, __cet_other); write_locale_data (output_path, LC_COLLATE, "LC_COLLATE", &file); obstack_free (&weightpool, NULL); diff --git a/locale/weight.h b/locale/weight.h index c99730c..5b4103b 100644 --- a/locale/weight.h +++ b/locale/weight.h @@ -19,26 +19,81 @@ #ifndef _WEIGHT_H_ #define _WEIGHT_H_ 1 +/* Generate 2 byte code for the next UTF-8 encoded char. + Returns zero on UTF-8 encoding errors. */ +static __always_inline uint16_t +utf8index (const unsigned char *cp, size_t len) +{ + uint16_t index = cp[0]; + + if (index >= 0x80) + { + if (index < 0xE0) + { + if (len < 2) + return 0; + uint16_t byte2 = cp[1]; + index = (index << 6) + byte2 - 0x3080; + } + else if (index < 0xF0) + { + if (len < 3) + return 0; + uint16_t byte2 = cp[1]; + uint16_t byte3 = cp[2]; + index = (index << 12) + (byte2 << 6) + byte3 - 0xE2080; + } + else if (index < 0xF8) + { + if (len < 4) + return 0; + uint16_t byte2 = cp[1]; + uint16_t byte3 = cp[2]; + uint16_t byte4 = cp[3]; + index = (byte2 << 12) + (byte3 << 6) + byte4 - 0x82080; + } + else + return 0; + } + + return index; +} + /* Find index of weight. */ static inline int32_t __attribute__ ((always_inline)) -findidx (const int32_t *table, +findidx (uint_fast32_t locale_encoding, + const int32_t *table, const int32_t *indirect, const unsigned char *extra, const unsigned char **cpp, size_t len) { - int_fast32_t i = table[*(*cpp)++]; const unsigned char *cp; const unsigned char *usrc; + uint16_t index = (*cpp)[0]; + + /* Special handling of UTF-8: Generate a 2-byte index for table. */ + if (index >= 0x80 && locale_encoding == __cet_utf8) + { + index = utf8index(*cpp, len); + if (index == 0) + { + *cpp += 1; + return 0; + } + } + int_fast32_t i = table[index]; if (i >= 0) - /* This is an index into the weight table. Cool. */ - return i; + { + /* This is an index into the weight table. Cool. */ + *cpp += 1; + return i; + } /* Oh well, more than one sequence starting with this byte. Search for the correct one. */ cp = &extra[-i]; usrc = *cpp; - --len; while (1) { size_t nhere; @@ -57,8 +112,7 @@ findidx (const int32_t *table, /* It is a single character. If it matches we found our index. Note that at the end of each list there is an entry of length zero which represents the single byte - sequence. The first (and here only) byte was tested - already. */ + sequence. */ size_t cnt; for (cnt = 0; cnt < nhere && cnt < len; ++cnt) @@ -68,7 +122,7 @@ findidx (const int32_t *table, if (cnt == nhere) { /* Found it. */ - *cpp += nhere; + *cpp += nhere > 0 ? nhere : 1; return i; } @@ -127,7 +181,7 @@ findidx (const int32_t *table, while (++cnt < nhere); } - *cpp += nhere; + *cpp += nhere > 0 ? nhere : 1; return indirect[-i + offset]; } } diff --git a/locale/weightwc.h b/locale/weightwc.h index ab26482..4101dc8 100644 --- a/locale/weightwc.h +++ b/locale/weightwc.h @@ -21,7 +21,8 @@ /* Find index of weight. */ static inline int32_t __attribute__ ((always_inline)) -findidx (const int32_t *table, +findidx (uint_fast32_t encoding, + const int32_t *table, const int32_t *indirect, const wint_t *extra, const wint_t **cpp, size_t len) diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c index 229904e..07b60fb 100644 --- a/posix/fnmatch_loop.c +++ b/posix/fnmatch_loop.c @@ -383,6 +383,8 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, const int32_t *indirect; int32_t idx; const UCHAR *cp = (const UCHAR *) &str; + uint_fast32_t encoding = + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); # if WIDE_CHAR_VERSION table = (const int32_t *) @@ -404,7 +406,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); # endif - idx = FINDIDX (table, indirect, extra, &cp, 1); + idx = FINDIDX (encoding, table, indirect, extra, &cp, 1); if (idx != 0) { /* We found a table entry. Now see whether the @@ -414,7 +416,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end, int32_t idx2; const UCHAR *np = (const UCHAR *) n; - idx2 = FINDIDX (table, indirect, extra, + idx2 = FINDIDX (encoding, table, indirect, extra, &np, string_end - n); if (idx2 != 0 && (idx >> 24) == (idx2 >> 24) diff --git a/posix/regcomp.c b/posix/regcomp.c index b6126b7..011ef92 100644 --- a/posix/regcomp.c +++ b/posix/regcomp.c @@ -3414,6 +3414,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); if (nrules != 0) { + uint_fast32_t encoding; const int32_t *table, *indirect; const unsigned char *weights, *extra, *cp; unsigned char char_buf[2]; @@ -3422,6 +3423,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) size_t len; /* Calculate the index for equivalence class. */ cp = name; + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); weights = (const unsigned char *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_WEIGHTMB); @@ -3429,7 +3431,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); - idx1 = findidx (table, indirect, extra, &cp, -1); + idx1 = findidx (encoding, table, indirect, extra, &cp, -1); if (BE (idx1 == 0 || *cp != '\0', 0)) /* This isn't a valid character. */ return REG_ECOLLATE; @@ -3440,7 +3442,7 @@ build_equiv_class (bitset_t sbcset, const unsigned char *name) { char_buf[0] = ch; cp = char_buf; - idx2 = findidx (table, indirect, extra, &cp, 1); + idx2 = findidx (encoding, table, indirect, extra, &cp, 1); /* idx2 = table[ch]; */ diff --git a/posix/regex_internal.h b/posix/regex_internal.h index 02e040b..993c7c3 100644 --- a/posix/regex_internal.h +++ b/posix/regex_internal.h @@ -743,17 +743,19 @@ re_string_elem_size_at (const re_string_t *pstr, int idx) # ifdef _LIBC const unsigned char *p, *extra; const int32_t *table, *indirect; + uint_fast32_t encoding; uint_fast32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES); if (nrules != 0) { + encoding = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); extra = (const unsigned char *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); p = pstr->mbs + idx; - findidx (table, indirect, extra, &p, pstr->len - idx); + findidx (encoding, table, indirect, extra, &p, pstr->len - idx); return p - pstr->mbs - idx; } else diff --git a/posix/regexec.c b/posix/regexec.c index ec46c3a..3d3ad9a 100644 --- a/posix/regexec.c +++ b/posix/regexec.c @@ -3843,6 +3843,7 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, if (nrules != 0) { unsigned int in_collseq = 0; + uint_fast32_t encoding; const int32_t *table, *indirect; const unsigned char *weights, *extra; const char *collseqwc; @@ -3893,6 +3894,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, if (cset->nequiv_classes) { const unsigned char *cp = pin; + encoding = + _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_ENCODING_TYPE); table = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_TABLEMB); weights = (const unsigned char *) @@ -3901,7 +3904,8 @@ check_node_accept_bytes (const re_dfa_t *dfa, int node_idx, _NL_CURRENT (LC_COLLATE, _NL_COLLATE_EXTRAMB); indirect = (const int32_t *) _NL_CURRENT (LC_COLLATE, _NL_COLLATE_INDIRECTMB); - int32_t idx = findidx (table, indirect, extra, &cp, elem_len); + int32_t idx = findidx (encoding, table, indirect, extra, &cp, + elem_len); if (idx > 0) for (i = 0; i < cset->nequiv_classes; ++i) { diff --git a/string/strcoll_l.c b/string/strcoll_l.c index 4d1e3ab..2c2cab0 100644 --- a/string/strcoll_l.c +++ b/string/strcoll_l.c @@ -63,9 +63,9 @@ typedef struct /* Get next sequence. Traverse the string as required. */ static __always_inline void get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, - const USTRING_TYPE *weights, const int32_t *table, - const USTRING_TYPE *extra, const int32_t *indirect, - int pass) + const USTRING_TYPE *weights, uint_fast32_t encoding, + const int32_t *table, const USTRING_TYPE *extra, + const int32_t *indirect, int pass) { size_t val = seq->val = 0; int len = seq->len; @@ -109,7 +109,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, us = seq->back_us; while (i < backw) { - int32_t tmp = findidx (table, indirect, extra, &us, -1); + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); idx = tmp & 0xffffff; i++; } @@ -124,7 +124,7 @@ get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets, while (*us != L('\0')) { - int32_t tmp = findidx (table, indirect, extra, &us, -1); + int32_t tmp = findidx (encoding, table, indirect, extra, &us, -1); unsigned char rule = tmp >> 24; prev_idx = idx; idx = tmp & 0xffffff; @@ -253,6 +253,7 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) const USTRING_TYPE *weights; const USTRING_TYPE *extra; const int32_t *indirect; + uint_fast32_t encoding; if (nrules == 0) return STRCMP (s1, s2); @@ -271,6 +272,8 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_EXTRA,SUFFIX))].string; indirect = (const int32_t *) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_INDIRECT,SUFFIX))].string; + encoding = current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; + assert (((uintptr_t) table) % __alignof__ (table[0]) == 0); assert (((uintptr_t) weights) % __alignof__ (weights[0]) == 0); @@ -310,9 +313,9 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l) while (1) { - get_next_seq (&seq1, nrules, rulesets, weights, table, + get_next_seq (&seq1, nrules, rulesets, weights, encoding, table, extra, indirect, pass); - get_next_seq (&seq2, nrules, rulesets, weights, table, + get_next_seq (&seq2, nrules, rulesets, weights, encoding, table, extra, indirect, pass); /* See whether any or both strings are empty. */ if (seq1.len == 0 || seq2.len == 0) diff --git a/string/strxfrm_l.c b/string/strxfrm_l.c index 22e24d3..5c89b15 100644 --- a/string/strxfrm_l.c +++ b/string/strxfrm_l.c @@ -53,6 +53,7 @@ typedef struct uint_fast32_t nrules; unsigned char *rulesets; USTRING_TYPE *weights; + uint_fast32_t encoding; int32_t *table; USTRING_TYPE *extra; int32_t *indirect; @@ -100,8 +101,8 @@ static __always_inline size_t find_idx (const USTRING_TYPE **us, int32_t *weight_idx, unsigned char *rule_idx, const locale_data_t *l_data, const int pass) { - int32_t tmp = findidx (l_data->table, l_data->indirect, l_data->extra, us, - -1); + int32_t tmp = findidx (l_data->encoding, l_data->table, l_data->indirect, + l_data->extra, us, -1); *rule_idx = tmp >> 24; int32_t idx = tmp & 0xffffff; size_t len = l_data->weights[idx++]; @@ -693,6 +694,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) /* Get the locale data. */ l_data.rulesets = (unsigned char *) current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string; + l_data.encoding = + current->values[_NL_ITEM_INDEX (_NL_COLLATE_ENCODING_TYPE)].word; l_data.table = (int32_t *) current->values[_NL_ITEM_INDEX (CONCAT(_NL_COLLATE_TABLE,SUFFIX))].string; l_data.weights = (USTRING_TYPE *) @@ -721,8 +724,8 @@ STRXFRM (STRING_TYPE *dest, const STRING_TYPE *src, size_t n, __locale_t l) do { - int32_t tmp = findidx (l_data.table, l_data.indirect, l_data.extra, &cur, - -1); + int32_t tmp = findidx (l_data.encoding, l_data.table, l_data.indirect, + l_data.extra, &cur, -1); rulearr[idxmax] = tmp >> 24; idxarr[idxmax] = tmp & 0xffffff;