From patchwork Mon Aug 27 21:11:47 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Paul Eggert <eggert@cs.ucla.edu>
X-Patchwork-Id: 29084
Received: (qmail 101231 invoked by alias); 27 Aug 2018 21:25:15 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 101111 invoked by uid 89); 27 Aug 2018 21:25:14 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.2 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, SPF_PASS,
	URIBL_RED autolearn=ham version=3.3.2 spammy=rational,
	H*Ad:D*edu
X-HELO: zimbra.cs.ucla.edu
From: Paul Eggert <eggert@cs.ucla.edu>
To: libc-alpha@sourceware.org
Cc: Paul Eggert <eggert@cs.ucla.edu>
Subject: [committed] regex: Gnulib unibyte RRI uses bytes not chars
Date: Mon, 27 Aug 2018 14:11:47 -0700
Message-Id: <20180827211149.10421-2-eggert@cs.ucla.edu>
In-Reply-To: <20180827211149.10421-1-eggert@cs.ucla.edu>
References: <3db72f1d-1547-c5eb-cf78-d6198be62c55@redhat.com>
	<20180827211149.10421-1-eggert@cs.ucla.edu>

Adjust the non-glibc code to agree with what Gawk needs for
rational range interpretation (RRI) for regular expression ranges.
In unibyte locales, Gawk wants ranges to use the underlying byte
rather than the character code point.  This change does not affect
glibc proper.
* posix/regcomp.c (parse_byte) [!LIBC && RE_ENABLE_I18N]:
In unibyte locales, use the byte value rather than
running it through btowc.
---
 ChangeLog       | 12 ++++++++++++
 posix/regcomp.c |  9 ++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 2ef08f0ed1..2ee6a12704 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,15 @@
+2018-08-10  Paul Eggert  <eggert@cs.ucla.edu>
+
+	regex: Gnulib unibyte RRI uses bytes not chars
+	Adjust the non-glibc code to agree with what Gawk needs for
+	rational range interpretation (RRI) for regular expression ranges.
+	In unibyte locales, Gawk wants ranges to use the underlying byte
+	rather than the character code point.  This change does not affect
+	glibc proper.
+	* posix/regcomp.c (parse_byte) [!LIBC && RE_ENABLE_I18N]:
+	In unibyte locales, use the byte value rather than
+	running it through btowc.
+
 2018-08-10  Joseph Myers  <joseph@codesourcery.com>
 
 	* sysdeps/generic/math-tests-snan.h: New file.
diff --git a/posix/regcomp.c b/posix/regcomp.c
index 3b0a3c6b6a..e81652f229 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -2684,15 +2684,14 @@ parse_dup_op (bin_tree_t *elem, re_string_t *regexp, re_dfa_t *dfa,
 
 # ifdef RE_ENABLE_I18N
 /* Convert the byte B to the corresponding wide character.  In a
-   unibyte locale, treat B as itself if it is an encoding error.
-   In a multibyte locale, return WEOF if B is an encoding error.  */
+   unibyte locale, treat B as itself.  In a multibyte locale, return
+   WEOF if B is an encoding error.  */
 static wint_t
 parse_byte (unsigned char b, re_charset_t *mbcset)
 {
-  wint_t wc = __btowc (b);
-  return wc == WEOF && !mbcset ? b : wc;
+  return mbcset == NULL ? b : __btowc (b);
 }
-#endif
+# endif
 
   /* Local function for parse_bracket_exp only used in case of NOT _LIBC.
      Build the range expression which starts from START_ELEM, and ends