From patchwork Sun Feb 27 16:53:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom Honermann X-Patchwork-Id: 51410 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 580BA3858428 for ; Sun, 27 Feb 2022 16:54:21 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 580BA3858428 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1645980861; bh=ng0xLrhxTwejUCm5LqMpYFG/HEg/gFb2/ag+LufdBRo=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=h/4HVxKv0MdiyGeXTFV6Wtla42iMSlD0P8xfZHMOB3cLe71DsJusFEiZHiYmWBa84 iPk8LsXsBXzE8IIs1P6iRDt0ZNJt3JPOFrJ0yn42/gpOhbSJPXOqR3YJpFSBqDfGBT /cc0VQSpM6/j5OQluCCiBJYlKq1QYw9Q7J6pUVpQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from smtp72.ord1c.emailsrvr.com (smtp72.ord1c.emailsrvr.com [108.166.43.72]) by sourceware.org (Postfix) with ESMTPS id 550833858C3A for ; Sun, 27 Feb 2022 16:53:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 550833858C3A X-Auth-ID: tom@honermann.net Received: by smtp18.relay.ord1c.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id 32B6AE00D0 for ; Sun, 27 Feb 2022 11:53:20 -0500 (EST) Message-ID: <29a48f8e-0c31-072c-ec36-8b62a0e1b430@honermann.net> Date: Sun, 27 Feb 2022 11:53:19 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US Subject: [PATCH 1/3]: C++20 P0482R6 and C2X N2653: Fix for bug 25744, mbrtowc with Big5-HKSCS To: libc-alpha X-Classification-ID: 94f9124f-ec9c-43af-bb50-51b69921ce7b-1-1 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Tom Honermann via Libc-alpha From: Tom Honermann Reply-To: Tom Honermann Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" This patch for bug 25744 [1] updates the Big5-HKSCS converter to properly maintain the lowest 3 bits of the mbstate_t __count data member. This change is necessary to ensure that state is correctly preserved when the converter encounters an incomplete multibyte character. More details are available in bug 25744 [1]. The code changes are styled to match how these bits are maintained by converters such as iso-2022-jp.c, ibm930.c, and others. Running 'grep __count' in the 'iconvdata' directory suggests that a number of other converters, euc-jisx0213.c for example, also fail to preserve these bits in some cases, though it may be that negative effects are not observed for those converters. This patch does not attempt to address such issues with other converters. Tested on Linux x86_64. Tom. [1]: Bug 25744 "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters" https://sourceware.org/bugzilla/show_bug.cgi?id=25744 commit 9580fc4e7fa0ce33c049b0c2d61b98405fdd2ae3 Author: Tom Honermann Date: Wed Jan 5 18:02:24 2022 -0500 Correct the Big5-HKSCS converter to preserve low order state bits. BZ: https://sourceware.org/bugzilla/show_bug.cgi?id=25744 diff --git a/iconvdata/big5hkscs.c b/iconvdata/big5hkscs.c index a28b18a5ec..d12389b2e3 100644 --- a/iconvdata/big5hkscs.c +++ b/iconvdata/big5hkscs.c @@ -17769,7 +17769,7 @@ static struct the output state to the initial state. This has to be done during the flushing. */ #define EMIT_SHIFT_TO_INIT \ - if (data->__statep->__count != 0) \ + if ((data->__statep->__count >> 3) != 0) \ { \ if (FROM_DIRECTION) \ { \ @@ -17778,7 +17778,7 @@ static struct /* Write out the last character. */ \ *((uint32_t *) outbuf) = data->__statep->__count >> 3; \ outbuf += sizeof (uint32_t); \ - data->__statep->__count = 0; \ + data->__statep->__count &= 7; \ } \ else \ /* We don't have enough room in the output buffer. */ \ @@ -17792,7 +17792,7 @@ static struct uint32_t lasttwo = data->__statep->__count >> 3; \ *outbuf++ = (lasttwo >> 8) & 0xff; \ *outbuf++ = lasttwo & 0xff; \ - data->__statep->__count = 0; \ + data->__statep->__count &= 7; \ } \ else \ /* We don't have enough room in the output buffer. */ \ @@ -17878,7 +17878,7 @@ static struct \ /* Otherwise store only the first character now, and \ put the second one into the queue. */ \ - *statep = ch2 << 3; \ + *statep = (ch2 << 3) | (*statep & 7); \ /* Tell the caller why we terminate the loop. */ \ result = __GCONV_FULL_OUTPUT; \ break; \ @@ -17895,7 +17895,7 @@ static struct } \ else \ /* Clear the queue and proceed to output the saved character. */ \ - *statep = 0; \ + *statep &= 7; \ \ put32 (outptr, ch); \ outptr += 4; \ @@ -17946,7 +17946,7 @@ static struct } \ *outptr++ = (ch >> 8) & 0xff; \ *outptr++ = ch & 0xff; \ - *statep = 0; \ + *statep &= 7; \ inptr += 4; \ continue; \ \ @@ -17959,7 +17959,7 @@ static struct } \ *outptr++ = (lasttwo >> 8) & 0xff; \ *outptr++ = lasttwo & 0xff; \ - *statep = 0; \ + *statep &= 7; \ continue; \ } \ \ @@ -17996,7 +17996,7 @@ static struct /* Check for possible combining character. */ \ if (__glibc_unlikely (ch == 0xca || ch == 0xea)) \ { \ - *statep = ((cp[0] << 8) | cp[1]) << 3; \ + *statep = (((cp[0] << 8) | cp[1]) << 3) | (*statep & 7); \ inptr += 4; \ continue; \ } \ diff --git a/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c b/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c index 9601b6c1d9..e1472dc2e2 100644 --- a/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c +++ b/iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c @@ -128,6 +128,71 @@ check_conversion (struct testdata test) printf ("error: Result of third conversion was wrong.\n"); err++; } + + /* Now perform the same test as above consuming one byte at a time. */ + mbs = test.input; + memset (&st, 0, sizeof (st)); + + /* Consume the first byte; expect an incomplete multibyte character. */ + ret = mbrtowc (&wc, mbs, 1, &st); + if (ret != -2) + { + printf ("error: First byte conversion returned %zd.\n", ret); + err++; + } + /* Advance past the first consumed byte. */ + mbs += 1; + /* Consume the second byte; expect the first wchar_t. */ + ret = mbrtowc (&wc, mbs, 1, &st); + if (ret != 1) + { + printf ("error: Second byte conversion returned %zd.\n", ret); + err++; + } + /* Advance past the second consumed byte. */ + mbs += 1; + if (wc != test.expected[0]) + { + printf ("error: Result of first wchar_t conversion was wrong.\n"); + err++; + } + /* Consume no bytes; expect the second wchar_t. */ + ret = mbrtowc (&wc, mbs, 1, &st); + if (ret != 0) + { + printf ("error: First attempt of third byte conversion returned %zd.\n", ret); + err++; + } + /* Do not advance past the third byte. */ + mbs += 0; + if (wc != test.expected[1]) + { + printf ("error: Result of second wchar_t conversion was wrong.\n"); + err++; + } + /* After the second wchar_t conversion, the converter should be in + the initial state since the two input BIG5-HKSCS bytes have been + consumed and the two wchar_t's have been output. */ + if (mbsinit (&st) == 0) + { + printf ("error: Converter not in initial state.\n"); + err++; + } + /* Consume the third byte; expect the third wchar_t. */ + ret = mbrtowc (&wc, mbs, 1, &st); + if (ret != 1) + { + printf ("error: Third byte conversion returned %zd.\n", ret); + err++; + } + /* Advance past the third consumed byte. */ + mbs += 1; + if (wc != test.expected[2]) + { + printf ("error: Result of third wchar_t conversion was wrong.\n"); + err++; + } + /* Return 0 if we saw no errors. */ return err; }