From patchwork Thu Jul 29 06:35:13 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Carlos O'Donell <carlos@redhat.com>
X-Patchwork-Id: 44509
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 92DBE389801C
	for <patchwork@sourceware.org>; Thu, 29 Jul 2021 06:38:36 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 92DBE389801C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1627540716;
	bh=jNosdQNgNhhXBvsMGdWHnt0g6Vg4lgagCuaqHey+Ad8=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=uW/vZqy5Lr5e1An+vOdTZZkIy6KG3y8skWIXucaoQ4wXgmF/deRe0vPU+XGlUAeSK
	 wbOaVyfP4LQirA6kB5dQagAI5ObRn66FEurJNrGzVCBoykfloBY4N2DEJvXQIfXK2/
	 kw6PKTZMEFR5rgOPbwtSiImYyDPz7SturJUuKEPo=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 0C046389800F
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 06:35:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0C046389800F
Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com
 [209.85.219.72]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-328-ekVD0Wr-Md-Y6fLyTmvyJw-1; Thu, 29 Jul 2021 02:35:21 -0400
X-MC-Unique: ekVD0Wr-Md-Y6fLyTmvyJw-1
Received: by mail-qv1-f72.google.com with SMTP id
 t4-20020a05621421a4b02902e2f9404330so2581165qvc.9
 for <libc-alpha@sourceware.org>; Wed, 28 Jul 2021 23:35:21 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=jNosdQNgNhhXBvsMGdWHnt0g6Vg4lgagCuaqHey+Ad8=;
 b=SPT9uSLuwzfMiSvbN8GgQwQObnGdr0WHIUXfhcsLastzBNvh2Rrn5BzxB+yJKuyC5O
 w0bLs6pus7Phr3FKCUR7AUFuJWtSaShvJ0lwyFDtzhDAPRrtPU4RCk+e61mb+Nt0x5Nr
 pbMyxIFglT9xJM4rz51aDWzeQLdMBf8fZ78L8X/ZWl6HVZkCF2T/NuElVaQ/ue/Zx6GR
 TPwMP0fJHqf8aUBkHKkj6Qbgn1DXLE7eBRnUuxG9s0ZVSWOyzim9QKmp0oSa8ZecpwYO
 WKUt99cYpRbKXLUmq8+mYOif6dWTJ/rgjl2cqJtlN1Vd0XSEzX8F1WelRHfXSvuNhXYp
 HBaA==
X-Gm-Message-State: AOAM533m9Rc+z2MJ1r0ngXEgekxEhx5of5QPcpPqD2WXgUAjM+BrMsrk
 U+S/eFBv7iIZH7jsZuksTtOQkTjqEO5amvzTMQXZ4Rzb5LyHJwUPPtlBsy5lg1eF1ipTZz+HErc
 qRnx9UWrpykQEd3lqAN9ZnAdllOu2V79iN4upYTtSIyDkbNY/X9DjfEd3JBu8FW+NaAqnng==
X-Received: by 2002:a05:620a:e14:: with SMTP id
 y20mr3717550qkm.335.1627540521078;
 Wed, 28 Jul 2021 23:35:21 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwEzzmDj3Vg2+7qt3x6Ul2IF1SopTDEzRhb8Uv3dHUIXBZK0wMxuLp6zu+KPELmHWlgohVywA==
X-Received: by 2002:a05:620a:e14:: with SMTP id
 y20mr3717529qkm.335.1627540520779;
 Wed, 28 Jul 2021 23:35:20 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id y2sm1311857qkd.38.2021.07.28.23.35.19
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Jul 2021 23:35:20 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v4 1/3] Add support for locales with zero collation rules.
Date: Thu, 29 Jul 2021 02:35:13 -0400
Message-Id: <20210729063515.1541388-2-carlos@redhat.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20210729063515.1541388-1-carlos@redhat.com>
References: <20210729063515.1541388-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha
 <libc-alpha@sourceware.org>
From: Carlos O'Donell <carlos@redhat.com>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

While there is code to handle 'nrules == 0' in various locations
within posix/fnmatch_loop.c, posix/regcomp.c and posix/regexec.c,
these conditionals do not work.  The only collation with zero
rules in effect today is the builtin C/POSIX locale which is
built by hand, and despite have zero rules it has a collseqmb
and collseqwc tables stored in the locale data. These tables are
simple identity tables which are not actually required and could
be removed at a later date after this change.  The changes are in
order to prepare for C.UTF-8 which has zero rules and has no
collation sequence tables (multibyte or widechar).

No regressions on x86_64 or i686.
---
 posix/fnmatch_loop.c | 95 +++++++++++++++++++++++++++-----------------
 posix/regcomp.c      | 12 +++---
 posix/regexec.c      | 85 ++++++++++++++++++---------------------
 3 files changed, 104 insertions(+), 88 deletions(-)

diff --git a/posix/fnmatch_loop.c b/posix/fnmatch_loop.c
index 7f938af590..547952f0a9 100644
--- a/posix/fnmatch_loop.c
+++ b/posix/fnmatch_loop.c
@@ -51,6 +51,7 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
     _NL_CURRENT(LC_COLLATE, _NL_COLLATE_COLLSEQMB);
 # endif
 #endif
+  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
 
   while ((c = *p++) != L_('\0'))
     {
@@ -324,8 +325,6 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                        diagnose a "used initialized" in a dead branch in the
                        findidx function.  */
                     UCHAR str;
-                    uint32_t nrules =
-                      _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
                     const CHAR *startp = p;
 
                     c = *++p;
@@ -437,8 +436,6 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
 
                     if (c == L_('[') && *p == L_('.'))
                       {
-                        uint32_t nrules =
-                          _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
                         const CHAR *startp = p;
                         size_t c1 = 0;
 
@@ -600,42 +597,51 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                     if (c == L_('-') && *p != L_(']'))
                       {
 #if _LIBC
-                        /* We have to find the collation sequence
-                           value for C.  Collation sequence is nothing
-                           we can regularly access.  The sequence
-                           value is defined by the order in which the
-                           definitions of the collation values for the
-                           various characters appear in the source
-                           file.  A strange concept, nowhere
-                           documented.  */
-                        uint32_t fcollseq;
-                        uint32_t lcollseq;
+			/* We must find the collation sequence values for
+			   the low part of the range, the high part of the
+			   range and the searched value FN.  We do this by
+			   using the POSIX concept of Collation Element
+			   Ordering, which is the defined order of elements
+			   in the source locale.  FCOLLSEQ is the searched
+			   element in the range, while LCOLLSEQ is the low
+			   element in the range.  If we have no collation
+			   rules (nrules == 0) then we must fall back to a
+			   basic code point value for the collation
+			   sequence value (which is correct for ASCII and
+			   UTF-8).  We must never use collseq if nrules ==
+			   0 since none of the tables we need will be
+			   present in the compiled binary locale.  We start
+			   with fcollseq and lcollseq at unknown collation
+			   sequences.  We only compute hcollseq, the high
+			   part of the range if required.  */
+                        uint32_t fcollseq = ~((uint32_t) 0);
+                        uint32_t lcollseq = ~((uint32_t) 0);
                         UCHAR cend = *p++;
 
+			if (nrules != 0)
+			  {
 # if WIDE_CHAR_VERSION
-                        /* Search in the 'names' array for the characters.  */
-                        fcollseq = __collseq_table_lookup (collseq, fn);
-                        if (fcollseq == ~((uint32_t) 0))
-                          /* XXX We don't know anything about the character
-                             we are supposed to match.  This means we are
-                             failing.  */
-                          goto range_not_matched;
-
-                        if (is_seqval)
-                          lcollseq = cold;
-                        else
-                          lcollseq = __collseq_table_lookup (collseq, cold);
+			    /* Search the collation data for the character.  */
+			    fcollseq = __collseq_table_lookup (collseq, fn);
+			    if (fcollseq == ~((uint32_t) 0))
+			      /* We don't know anything about the character
+				 we are supposed to match.  This means we are
+				 failing.  */
+			      goto range_not_matched;
+
+			    if (is_seqval)
+			      lcollseq = cold;
+			    else
+			      lcollseq = __collseq_table_lookup (collseq, cold);
 # else
-                        fcollseq = collseq[fn];
-                        lcollseq = is_seqval ? cold : collseq[(UCHAR) cold];
+			    fcollseq = collseq[fn];
+			    lcollseq = is_seqval ? cold : collseq[(UCHAR) cold];
 # endif
+			  }
 
                         is_seqval = false;
                         if (cend == L_('[') && *p == L_('.'))
                           {
-                            uint32_t nrules =
-                              _NL_CURRENT_WORD (LC_COLLATE,
-                                                _NL_COLLATE_NRULES);
                             const CHAR *startp = p;
                             size_t c1 = 0;
 
@@ -752,14 +758,20 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                             cend = FOLD (cend);
                           }
 
-                        /* XXX It is not entirely clear to me how to handle
-                           characters which are not mentioned in the
-                           collation specification.  */
-                        if (
+			/* If we have rules, and the low sequence is lower than
+			   the value of the searched sequence then we must
+			   lookup the high collation sequence value and
+			   determine if the fcollseq falls within the range.
+			   If hcollseq is unknown then we could still match
+			   fcollseq on the low end of the range.  If lcollseq
+			   if unknown (0xffffffff) we will still fail to
+			   match, but in the future we might consider matching
+			   the high end of the range on an exact match.  */
+                        if (nrules != 0 && (
 # if WIDE_CHAR_VERSION
                             lcollseq == 0xffffffff ||
 # endif
-                            lcollseq <= fcollseq)
+                            lcollseq <= fcollseq))
                           {
                             /* We have to look at the upper bound.  */
                             uint32_t hcollseq;
@@ -789,6 +801,17 @@ FCT (const CHAR *pattern, const CHAR *string, const CHAR *string_end,
                             if (lcollseq <= hcollseq && fcollseq <= hcollseq)
                               goto matched;
                           }
+
+			/* No rules, but we have a range.  */
+			if (nrules == 0)
+			  {
+			    if (cend == L_('\0'))
+			      return FNM_NOMATCH;
+
+			    /* Compare that fn is within the range.  */
+			    if ((UCHAR) cold <= fn && fn <= cend)
+			      goto matched;
+			  }
 # if WIDE_CHAR_VERSION
                       range_not_matched:
 # endif
diff --git a/posix/regcomp.c b/posix/regcomp.c
index d93698ae78..f55d20cbfd 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -2889,7 +2889,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	  if (MB_CUR_MAX == 1)
 	  */
 	  if (nrules == 0)
-	    return collseqmb[br_elem->opr.ch];
+	    return br_elem->opr.ch;
 	  else
 	    {
 	      wint_t wc = __btowc (br_elem->opr.ch);
@@ -2900,6 +2900,8 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	{
 	  if (nrules != 0)
 	    return __collseq_table_lookup (collseqwc, br_elem->opr.wch);
+	  else
+	    return br_elem->opr.wch;
 	}
       else if (br_elem->type == COLL_SYM)
 	{
@@ -2935,7 +2937,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 		}
 	    }
 	  else if (sym_name_len == 1)
-	    return collseqmb[br_elem->opr.name[0]];
+	    return br_elem->opr.name[0];
 	}
       return UINT_MAX;
     }
@@ -3017,7 +3019,7 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
 	  if (MB_CUR_MAX == 1)
 	  */
 	  if (nrules == 0)
-	    ch_collseq = collseqmb[ch];
+	    ch_collseq = ch;
 	  else
 	    ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch));
 	  if (start_collseq <= ch_collseq && ch_collseq <= end_collseq)
@@ -3103,11 +3105,11 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token,
   int token_len;
   bool first_round = true;
 #ifdef _LIBC
-  collseqmb = (const unsigned char *)
-    _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
   nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
   if (nrules)
     {
+      collseqmb = (const unsigned char *)
+	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
       /*
       if (MB_CUR_MAX > 1)
       */
diff --git a/posix/regexec.c b/posix/regexec.c
index f7b4f9cfc3..6cc23831aa 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -3858,62 +3858,53 @@ check_node_accept_bytes (const re_dfa_t *dfa, Idx node_idx,
 }
 
 # ifdef _LIBC
+#include <assert.h>
+
 static unsigned int
 find_collation_sequence_value (const unsigned char *mbs, size_t mbs_len)
 {
-  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
-  if (nrules == 0)
-    {
-      if (mbs_len == 1)
-	{
-	  /* No valid character.  Match it as a single byte character.  */
-	  const unsigned char *collseq = (const unsigned char *)
-	    _NL_CURRENT (LC_COLLATE, _NL_COLLATE_COLLSEQMB);
-	  return collseq[mbs[0]];
-	}
-      return UINT_MAX;
-    }
-  else
-    {
-      int32_t idx;
-      const unsigned char *extra = (const unsigned char *)
+  int32_t idx;
+  const unsigned char *extra = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB);
-      int32_t extrasize = (const unsigned char *)
+  int32_t extrasize = (const unsigned char *)
 	_NL_CURRENT (LC_COLLATE, _NL_COLLATE_SYMB_EXTRAMB + 1) - extra;
+  uint32_t nrules = _NL_CURRENT_WORD (LC_COLLATE, _NL_COLLATE_NRULES);
+
+  /* Only called from within 'if (nrules != 0)'.  */ 
+  assert (nrules != 0);
 
-      for (idx = 0; idx < extrasize;)
+  for (idx = 0; idx < extrasize;)
+    {
+      int mbs_cnt;
+      bool found = false;
+      int32_t elem_mbs_len;
+      /* Skip the name of collating element name.  */
+      idx = idx + extra[idx] + 1;
+      elem_mbs_len = extra[idx++];
+      if (mbs_len == elem_mbs_len)
 	{
-	  int mbs_cnt;
-	  bool found = false;
-	  int32_t elem_mbs_len;
-	  /* Skip the name of collating element name.  */
-	  idx = idx + extra[idx] + 1;
-	  elem_mbs_len = extra[idx++];
-	  if (mbs_len == elem_mbs_len)
-	    {
-	      for (mbs_cnt = 0; mbs_cnt < elem_mbs_len; ++mbs_cnt)
-		if (extra[idx + mbs_cnt] != mbs[mbs_cnt])
-		  break;
-	      if (mbs_cnt == elem_mbs_len)
-		/* Found the entry.  */
-		found = true;
-	    }
-	  /* Skip the byte sequence of the collating element.  */
-	  idx += elem_mbs_len;
-	  /* Adjust for the alignment.  */
-	  idx = (idx + 3) & ~3;
-	  /* Skip the collation sequence value.  */
-	  idx += sizeof (uint32_t);
-	  /* Skip the wide char sequence of the collating element.  */
-	  idx = idx + sizeof (uint32_t) * (*(int32_t *) (extra + idx) + 1);
-	  /* If we found the entry, return the sequence value.  */
-	  if (found)
-	    return *(uint32_t *) (extra + idx);
-	  /* Skip the collation sequence value.  */
-	  idx += sizeof (uint32_t);
+	  for (mbs_cnt = 0; mbs_cnt < elem_mbs_len; ++mbs_cnt)
+	    if (extra[idx + mbs_cnt] != mbs[mbs_cnt])
+	      break;
+	  if (mbs_cnt == elem_mbs_len)
+	    /* Found the entry.  */
+	    found = true;
 	}
-      return UINT_MAX;
+      /* Skip the byte sequence of the collating element.  */
+      idx += elem_mbs_len;
+      /* Adjust for the alignment.  */
+      idx = (idx + 3) & ~3;
+      /* Skip the collation sequence value.  */
+      idx += sizeof (uint32_t);
+      /* Skip the wide char sequence of the collating element.  */
+      idx = idx + sizeof (uint32_t) * (*(int32_t *) (extra + idx) + 1);
+      /* If we found the entry, return the sequence value.  */
+      if (found)
+        return *(uint32_t *) (extra + idx);
+      /* Skip the collation sequence value.  */
+      idx += sizeof (uint32_t);
     }
+  return UINT_MAX;
 }
 # endif /* _LIBC */
 #endif /* RE_ENABLE_I18N */

From patchwork Thu Jul 29 06:35:14 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Carlos O'Donell <carlos@redhat.com>
X-Patchwork-Id: 44507
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 7FC9F389802A
	for <patchwork@sourceware.org>; Thu, 29 Jul 2021 06:36:51 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7FC9F389802A
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1627540611;
	bh=Y+4ZT1CFex1h4Z5lkrPY+lboRsBejeoNPUR4ybqGwVY=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=pstC40i66jWsf82ba3xTXuAGxZ+yzgREV8L8TbGtXCCcxZp4raXrivXScKJdA42JA
	 LURhUghbMG8pWhpDW1z66NC/56VxIKzmia3epLmV8TooN1rtMs2+astCS+KoE4PVIo
	 K3qrni1sU5T8i9+EfnGgWPRq1h4raTd2PHtABAhs=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id D8F4D3898507
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 06:35:25 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D8F4D3898507
Received: from mail-qt1-f197.google.com (mail-qt1-f197.google.com
 [209.85.160.197]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-143-wjo43tuROxuUvazBdRuRVw-1; Thu, 29 Jul 2021 02:35:24 -0400
X-MC-Unique: wjo43tuROxuUvazBdRuRVw-1
Received: by mail-qt1-f197.google.com with SMTP id
 i8-20020ac85c080000b029026ae3f4adc9so2304801qti.13
 for <libc-alpha@sourceware.org>; Wed, 28 Jul 2021 23:35:24 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=Y+4ZT1CFex1h4Z5lkrPY+lboRsBejeoNPUR4ybqGwVY=;
 b=T/HajIj5Xj8LnrlYfvoAh6z/5CnezCtJUUZ5J0yMuDnK/X3ks2jIdlv7Sk8f2hR19z
 57rfctJWHfD5AWcBFQxurImesXRrNBiEs9ntg3sMTu+M9WgNCNasqnDMasegcIdQ7N+b
 TXc1VRBvJyJGdmGMqbJ0yKKi8qgyy9MOe4co9Z5+W7h3q9/Iy8Re7ejcQ3Qt7tahStoe
 9qffCM8t4otW/LrzavZnq7xSbfrvQZeHVTZLXuh0VK1zQKBD1HCThLOFi4rT/i0yj8hV
 5ozAVi7JVNdnIyvag7kIObuYuaZv20dD4uztqW574xpSw6COTZUCPyAoDvCsSqBBf/vu
 ZVQQ==
X-Gm-Message-State: AOAM530QxvExfRXR/X8ng4NI6esdC6TFVZ1U7g2Rf4+SqFXPBM42xt9D
 JOB1joQoIZkFRLVxqPIsV992OU25rmi8H9PFCQdiL0JkiJN0++EoqIqb8VR5GTuFjwC2Pz65iuv
 lNC2Dphq47LICvlYrLmQEctgOb99rE0FBOzHqWkuajv2EX4U/kgiWz9BVVpRburGbibEG3A==
X-Received: by 2002:a05:6214:1933:: with SMTP id
 es19mr3816304qvb.43.1627540523026;
 Wed, 28 Jul 2021 23:35:23 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJxaLd7ccO1mMCYhXyw4n+bqEanG6HDYNF9EkYJONlqsByJVwgqUA25HlPed+nD2p7ORTxWvbA==
X-Received: by 2002:a05:6214:1933:: with SMTP id
 es19mr3816281qvb.43.1627540522544;
 Wed, 28 Jul 2021 23:35:22 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id y2sm1311857qkd.38.2021.07.28.23.35.21
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Jul 2021 23:35:21 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v4 2/3] Add 'strcmp_collation' support for LC_COLLATE.
Date: Thu, 29 Jul 2021 02:35:14 -0400
Message-Id: <20210729063515.1541388-3-carlos@redhat.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20210729063515.1541388-1-carlos@redhat.com>
References: <20210729063515.1541388-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_STOCKGEN, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL,
 SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha
 <libc-alpha@sourceware.org>
From: Carlos O'Donell <carlos@redhat.com>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

Support a new directive 'strcmp_collation' in the LC_COLLATE
section of a locale source file. This new directive causes all
collation rules to be dropped and instead 'strcmp' is used for
collation of the input character set. This is required to allow
for a C.UTF-8 that contains zero collation rules (minimal size)
and sorts using code point sorting.

Tested on x86_64 and i686 without regression.
---
 locale/programs/ld-collate.c     |  24 ++-
 locale/programs/locfile-kw.gperf |   1 +
 locale/programs/locfile-kw.h     | 306 ++++++++++++++++---------------
 locale/programs/locfile-token.h  |   1 +
 4 files changed, 177 insertions(+), 155 deletions(-)

diff --git a/locale/programs/ld-collate.c b/locale/programs/ld-collate.c
index b6406b775d..ec778e23d3 100644
--- a/locale/programs/ld-collate.c
+++ b/locale/programs/ld-collate.c
@@ -195,6 +195,9 @@ struct name_list
 /* The real definition of the struct for the LC_COLLATE locale.  */
 struct locale_collate_t
 {
+  /* Does the locale use strcmp to compare the encoding?  */
+  bool strcmp_collation;
+
   int col_weight_max;
   int cur_weight_max;
 
@@ -1510,6 +1513,7 @@ collate_startup (struct linereader *ldfile, struct localedef_t *locale,
 	  obstack_init (&collate->mempool);
 
 	  collate->col_weight_max = -1;
+	  collate->strcmp_collation = false;
 	}
       else
 	/* Reuse the copy_locale's data structures.  */
@@ -1568,6 +1572,10 @@ collate_finish (struct localedef_t *locale, const struct charmap_t *charmap)
       return;
     }
 
+  /* No data required.  */
+  if (collate->strcmp_collation)
+    return;
+
   /* If this assertion is hit change the type in `element_t'.  */
   assert (nrules <= sizeof (runp->used_in_level) * 8);
 
@@ -2115,7 +2123,7 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
   add_locale_uint32 (&file, nrules);
 
   /* If we have no LC_COLLATE data emit only the number of rules as zero.  */
-  if (collate == NULL)
+  if (collate == NULL || collate->strcmp_collation)
     {
       size_t idx;
       for (idx = 1; idx < nelems; idx++)
@@ -2123,6 +2131,10 @@ collate_output (struct localedef_t *locale, const struct charmap_t *charmap,
 	  /* The words have to be handled specially.  */
 	  if (idx == _NL_ITEM_INDEX (_NL_COLLATE_SYMB_HASH_SIZEMB))
 	    add_locale_uint32 (&file, 0);
+	  else if (idx == _NL_ITEM_INDEX (_NL_COLLATE_CODESET)
+		   && collate != NULL)
+	    /* A valid LC_COLLATE must have a code set name.  */
+	    add_locale_string (&file, charmap->code_set_name);
 	  else
 	    add_locale_empty (&file);
 	}
@@ -2672,6 +2684,10 @@ collate_read (struct linereader *ldfile, struct localedef_t *result,
 
       switch (nowtok)
 	{
+	case tok_strcmp_collation:
+	  collate->strcmp_collation = true;
+	  break;
+
 	case tok_copy:
 	  /* Allow copying other locales.  */
 	  now = lr_token (ldfile, charmap, result, NULL, verbose);
@@ -3742,9 +3758,11 @@ error while adding equivalent collating symbol"));
 	  /* Next we assume `LC_COLLATE'.  */
 	  if (!ignore_content)
 	    {
-	      if (state == 0 && copy_locale == NULL)
+	      if (state == 0
+		  && copy_locale == NULL
+		  && !collate->strcmp_collation)
 		/* We must either see a copy statement or have
-		   ordering values.  */
+		   ordering values, or strcmp_collation.  */
 		lr_error (ldfile,
 			  _("%s: empty category description not allowed"),
 			  "LC_COLLATE");
diff --git a/locale/programs/locfile-kw.gperf b/locale/programs/locfile-kw.gperf
index bcded15ddb..0ae7267340 100644
--- a/locale/programs/locfile-kw.gperf
+++ b/locale/programs/locfile-kw.gperf
@@ -54,6 +54,7 @@ translit_end,           tok_translit_end,           0
 translit_ignore,        tok_translit_ignore,        0
 default_missing,        tok_default_missing,        0
 LC_COLLATE,             tok_lc_collate,             0
+strcmp_collation,       tok_strcmp_collation,       0
 coll_weight_max,        tok_coll_weight_max,        0
 section-symbol,         tok_section_symbol,         0
 collating-element,      tok_collating_element,      0
diff --git a/locale/programs/locfile-kw.h b/locale/programs/locfile-kw.h
index bc1cb8f084..f7af5c8a0a 100644
--- a/locale/programs/locfile-kw.h
+++ b/locale/programs/locfile-kw.h
@@ -54,7 +54,7 @@
 #line 24 "locfile-kw.gperf"
 struct keyword_t ;
 
-#define TOTAL_KEYWORDS 178
+#define TOTAL_KEYWORDS 179
 #define MIN_WORD_LENGTH 3
 #define MAX_WORD_LENGTH 22
 #define MIN_HASH_VALUE 3
@@ -78,7 +78,7 @@ hash (register const char *str, register size_t len)
       631, 631, 631, 631, 631, 631, 631, 631, 631, 631,
       631, 631, 631, 631, 631, 631, 631, 631, 631, 631,
       631, 631, 631, 631, 631, 631, 631, 631, 631, 631,
-        5,   0, 631, 631, 631, 631, 631, 631, 631, 631,
+       10,   5, 631, 631, 631, 631, 631, 631, 631, 631,
       631, 631, 631, 631, 631,   5, 631,   0,   0,   0,
         0,   0,  10,   0, 631, 631,   0, 631,   0,   5,
       631, 631,   0,   0,   0,  10, 631, 631, 631,   0,
@@ -134,92 +134,92 @@ locfile_hash (register const char *str, register size_t len)
 #line 31 "locfile-kw.gperf"
       {"END",                    tok_end,                    0},
       {""}, {""},
-#line 70 "locfile-kw.gperf"
+#line 71 "locfile-kw.gperf"
       {"IGNORE",                 tok_ignore,                 0},
-#line 129 "locfile-kw.gperf"
+#line 130 "locfile-kw.gperf"
       {"LC_TIME",                tok_lc_time,                0},
 #line 30 "locfile-kw.gperf"
       {"LC_CTYPE",               tok_lc_ctype,               0},
       {""},
-#line 168 "locfile-kw.gperf"
+#line 169 "locfile-kw.gperf"
       {"LC_ADDRESS",             tok_lc_address,             0},
-#line 153 "locfile-kw.gperf"
+#line 154 "locfile-kw.gperf"
       {"LC_MESSAGES",            tok_lc_messages,            0},
-#line 161 "locfile-kw.gperf"
+#line 162 "locfile-kw.gperf"
       {"LC_NAME",                tok_lc_name,                0},
-#line 158 "locfile-kw.gperf"
+#line 159 "locfile-kw.gperf"
       {"LC_PAPER",               tok_lc_paper,               0},
-#line 186 "locfile-kw.gperf"
+#line 187 "locfile-kw.gperf"
       {"LC_MEASUREMENT",         tok_lc_measurement,         0},
 #line 56 "locfile-kw.gperf"
       {"LC_COLLATE",             tok_lc_collate,             0},
       {""},
-#line 188 "locfile-kw.gperf"
+#line 189 "locfile-kw.gperf"
       {"LC_IDENTIFICATION",      tok_lc_identification,      0},
-#line 201 "locfile-kw.gperf"
+#line 202 "locfile-kw.gperf"
       {"revision",               tok_revision,               0},
-#line 69 "locfile-kw.gperf"
+#line 70 "locfile-kw.gperf"
       {"UNDEFINED",              tok_undefined,              0},
-#line 125 "locfile-kw.gperf"
+#line 126 "locfile-kw.gperf"
       {"LC_NUMERIC",             tok_lc_numeric,             0},
-#line 82 "locfile-kw.gperf"
+#line 83 "locfile-kw.gperf"
       {"LC_MONETARY",            tok_lc_monetary,            0},
-#line 181 "locfile-kw.gperf"
+#line 182 "locfile-kw.gperf"
       {"LC_TELEPHONE",           tok_lc_telephone,           0},
       {""}, {""}, {""},
-#line 75 "locfile-kw.gperf"
+#line 76 "locfile-kw.gperf"
       {"define",                 tok_define,                 0},
-#line 154 "locfile-kw.gperf"
+#line 155 "locfile-kw.gperf"
       {"yesexpr",                tok_yesexpr,                0},
-#line 141 "locfile-kw.gperf"
+#line 142 "locfile-kw.gperf"
       {"era_year",               tok_era_year,               0},
       {""},
 #line 54 "locfile-kw.gperf"
       {"translit_ignore",        tok_translit_ignore,        0},
-#line 156 "locfile-kw.gperf"
+#line 157 "locfile-kw.gperf"
       {"yesstr",                 tok_yesstr,                 0},
       {""},
-#line 89 "locfile-kw.gperf"
+#line 90 "locfile-kw.gperf"
       {"negative_sign",          tok_negative_sign,          0},
       {""},
-#line 137 "locfile-kw.gperf"
+#line 138 "locfile-kw.gperf"
       {"t_fmt",                  tok_t_fmt,                  0},
-#line 159 "locfile-kw.gperf"
+#line 160 "locfile-kw.gperf"
       {"height",                 tok_height,                 0},
       {""}, {""},
 #line 52 "locfile-kw.gperf"
       {"translit_start",         tok_translit_start,         0},
-#line 136 "locfile-kw.gperf"
+#line 137 "locfile-kw.gperf"
       {"d_fmt",                  tok_d_fmt,                  0},
       {""},
 #line 53 "locfile-kw.gperf"
       {"translit_end",           tok_translit_end,           0},
-#line 94 "locfile-kw.gperf"
+#line 95 "locfile-kw.gperf"
       {"n_cs_precedes",          tok_n_cs_precedes,          0},
-#line 144 "locfile-kw.gperf"
+#line 145 "locfile-kw.gperf"
       {"era_t_fmt",              tok_era_t_fmt,              0},
 #line 39 "locfile-kw.gperf"
       {"space",                  tok_space,                  0},
-#line 72 "locfile-kw.gperf"
-      {"reorder-end",            tok_reorder_end,            0},
 #line 73 "locfile-kw.gperf"
+      {"reorder-end",            tok_reorder_end,            0},
+#line 74 "locfile-kw.gperf"
       {"reorder-sections-after", tok_reorder_sections_after, 0},
       {""},
-#line 142 "locfile-kw.gperf"
+#line 143 "locfile-kw.gperf"
       {"era_d_fmt",              tok_era_d_fmt,              0},
-#line 189 "locfile-kw.gperf"
+#line 190 "locfile-kw.gperf"
       {"title",                  tok_title,                  0},
       {""}, {""},
-#line 149 "locfile-kw.gperf"
+#line 150 "locfile-kw.gperf"
       {"timezone",               tok_timezone,               0},
       {""},
-#line 74 "locfile-kw.gperf"
+#line 75 "locfile-kw.gperf"
       {"reorder-sections-end",   tok_reorder_sections_end,   0},
       {""}, {""}, {""},
-#line 95 "locfile-kw.gperf"
+#line 96 "locfile-kw.gperf"
       {"n_sep_by_space",         tok_n_sep_by_space,         0},
       {""}, {""},
-#line 100 "locfile-kw.gperf"
+#line 101 "locfile-kw.gperf"
       {"int_n_cs_precedes",      tok_int_n_cs_precedes,      0},
       {""}, {""}, {""},
 #line 26 "locfile-kw.gperf"
@@ -233,147 +233,147 @@ locfile_hash (register const char *str, register size_t len)
       {"print",                  tok_print,                  0},
 #line 44 "locfile-kw.gperf"
       {"xdigit",                 tok_xdigit,                 0},
-#line 110 "locfile-kw.gperf"
+#line 111 "locfile-kw.gperf"
       {"duo_n_cs_precedes",      tok_duo_n_cs_precedes,      0},
-#line 127 "locfile-kw.gperf"
+#line 128 "locfile-kw.gperf"
       {"thousands_sep",          tok_thousands_sep,          0},
-#line 197 "locfile-kw.gperf"
+#line 198 "locfile-kw.gperf"
       {"territory",              tok_territory,              0},
 #line 36 "locfile-kw.gperf"
       {"digit",                  tok_digit,                  0},
       {""}, {""},
-#line 92 "locfile-kw.gperf"
+#line 93 "locfile-kw.gperf"
       {"p_cs_precedes",          tok_p_cs_precedes,          0},
       {""}, {""},
-#line 62 "locfile-kw.gperf"
+#line 63 "locfile-kw.gperf"
       {"script",                 tok_script,                 0},
 #line 29 "locfile-kw.gperf"
       {"include",                tok_include,                0},
       {""},
-#line 78 "locfile-kw.gperf"
+#line 79 "locfile-kw.gperf"
       {"else",                   tok_else,                   0},
-#line 184 "locfile-kw.gperf"
+#line 185 "locfile-kw.gperf"
       {"int_select",             tok_int_select,             0},
       {""}, {""}, {""},
-#line 132 "locfile-kw.gperf"
+#line 133 "locfile-kw.gperf"
       {"week",                   tok_week,                   0},
 #line 33 "locfile-kw.gperf"
       {"upper",                  tok_upper,                  0},
       {""}, {""},
-#line 194 "locfile-kw.gperf"
+#line 195 "locfile-kw.gperf"
       {"tel",                    tok_tel,                    0},
-#line 93 "locfile-kw.gperf"
+#line 94 "locfile-kw.gperf"
       {"p_sep_by_space",         tok_p_sep_by_space,         0},
-#line 160 "locfile-kw.gperf"
+#line 161 "locfile-kw.gperf"
       {"width",                  tok_width,                  0},
       {""},
-#line 98 "locfile-kw.gperf"
+#line 99 "locfile-kw.gperf"
       {"int_p_cs_precedes",      tok_int_p_cs_precedes,      0},
       {""}, {""},
 #line 41 "locfile-kw.gperf"
       {"punct",                  tok_punct,                  0},
       {""}, {""},
-#line 101 "locfile-kw.gperf"
+#line 102 "locfile-kw.gperf"
       {"int_n_sep_by_space",     tok_int_n_sep_by_space,     0},
       {""}, {""}, {""},
-#line 108 "locfile-kw.gperf"
+#line 109 "locfile-kw.gperf"
       {"duo_p_cs_precedes",      tok_duo_p_cs_precedes,      0},
 #line 48 "locfile-kw.gperf"
       {"charconv",               tok_charconv,               0},
       {""},
 #line 47 "locfile-kw.gperf"
       {"class",                  tok_class,                  0},
-#line 114 "locfile-kw.gperf"
-      {"duo_int_n_cs_precedes",  tok_duo_int_n_cs_precedes,  0},
 #line 115 "locfile-kw.gperf"
+      {"duo_int_n_cs_precedes",  tok_duo_int_n_cs_precedes,  0},
+#line 116 "locfile-kw.gperf"
       {"duo_int_n_sep_by_space", tok_duo_int_n_sep_by_space, 0},
-#line 111 "locfile-kw.gperf"
+#line 112 "locfile-kw.gperf"
       {"duo_n_sep_by_space",     tok_duo_n_sep_by_space,     0},
-#line 119 "locfile-kw.gperf"
+#line 120 "locfile-kw.gperf"
       {"duo_int_n_sign_posn",    tok_duo_int_n_sign_posn,    0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""},
-#line 58 "locfile-kw.gperf"
+#line 59 "locfile-kw.gperf"
       {"section-symbol",         tok_section_symbol,         0},
-#line 185 "locfile-kw.gperf"
+#line 186 "locfile-kw.gperf"
       {"int_prefix",             tok_int_prefix,             0},
       {""}, {""}, {""}, {""},
 #line 42 "locfile-kw.gperf"
       {"graph",                  tok_graph,                  0},
       {""}, {""},
-#line 99 "locfile-kw.gperf"
+#line 100 "locfile-kw.gperf"
       {"int_p_sep_by_space",     tok_int_p_sep_by_space,     0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 112 "locfile-kw.gperf"
-      {"duo_int_p_cs_precedes",  tok_duo_int_p_cs_precedes,  0},
 #line 113 "locfile-kw.gperf"
+      {"duo_int_p_cs_precedes",  tok_duo_int_p_cs_precedes,  0},
+#line 114 "locfile-kw.gperf"
       {"duo_int_p_sep_by_space", tok_duo_int_p_sep_by_space, 0},
-#line 109 "locfile-kw.gperf"
+#line 110 "locfile-kw.gperf"
       {"duo_p_sep_by_space",     tok_duo_p_sep_by_space,     0},
-#line 118 "locfile-kw.gperf"
+#line 119 "locfile-kw.gperf"
       {"duo_int_p_sign_posn",    tok_duo_int_p_sign_posn,    0},
-#line 157 "locfile-kw.gperf"
+#line 158 "locfile-kw.gperf"
       {"nostr",                  tok_nostr,                  0},
       {""}, {""},
-#line 140 "locfile-kw.gperf"
+#line 141 "locfile-kw.gperf"
       {"era",                    tok_era,                    0},
       {""},
-#line 84 "locfile-kw.gperf"
+#line 85 "locfile-kw.gperf"
       {"currency_symbol",        tok_currency_symbol,        0},
       {""},
-#line 167 "locfile-kw.gperf"
+#line 168 "locfile-kw.gperf"
       {"name_ms",                tok_name_ms,                0},
-#line 165 "locfile-kw.gperf"
-      {"name_mrs",               tok_name_mrs,               0},
 #line 166 "locfile-kw.gperf"
+      {"name_mrs",               tok_name_mrs,               0},
+#line 167 "locfile-kw.gperf"
       {"name_miss",              tok_name_miss,              0},
-#line 83 "locfile-kw.gperf"
+#line 84 "locfile-kw.gperf"
       {"int_curr_symbol",        tok_int_curr_symbol,        0},
-#line 190 "locfile-kw.gperf"
+#line 191 "locfile-kw.gperf"
       {"source",                 tok_source,                 0},
-#line 164 "locfile-kw.gperf"
+#line 165 "locfile-kw.gperf"
       {"name_mr",                tok_name_mr,                0},
-#line 163 "locfile-kw.gperf"
+#line 164 "locfile-kw.gperf"
       {"name_gen",               tok_name_gen,               0},
-#line 202 "locfile-kw.gperf"
+#line 203 "locfile-kw.gperf"
       {"date",                   tok_date,                   0},
       {""}, {""},
-#line 191 "locfile-kw.gperf"
+#line 192 "locfile-kw.gperf"
       {"address",                tok_address,                0},
-#line 162 "locfile-kw.gperf"
+#line 163 "locfile-kw.gperf"
       {"name_fmt",               tok_name_fmt,               0},
 #line 32 "locfile-kw.gperf"
       {"copy",                   tok_copy,                   0},
-#line 103 "locfile-kw.gperf"
+#line 104 "locfile-kw.gperf"
       {"int_n_sign_posn",        tok_int_n_sign_posn,        0},
       {""}, {""},
-#line 131 "locfile-kw.gperf"
+#line 132 "locfile-kw.gperf"
       {"day",                    tok_day,                    0},
-#line 105 "locfile-kw.gperf"
+#line 106 "locfile-kw.gperf"
       {"duo_currency_symbol",    tok_duo_currency_symbol,    0},
       {""}, {""}, {""},
-#line 150 "locfile-kw.gperf"
+#line 151 "locfile-kw.gperf"
       {"date_fmt",               tok_date_fmt,               0},
-#line 64 "locfile-kw.gperf"
+#line 65 "locfile-kw.gperf"
       {"order_end",              tok_order_end,              0},
-#line 117 "locfile-kw.gperf"
+#line 118 "locfile-kw.gperf"
       {"duo_n_sign_posn",        tok_duo_n_sign_posn,        0},
       {""},
-#line 170 "locfile-kw.gperf"
+#line 171 "locfile-kw.gperf"
       {"country_name",           tok_country_name,           0},
-#line 71 "locfile-kw.gperf"
+#line 72 "locfile-kw.gperf"
       {"reorder-after",          tok_reorder_after,          0},
       {""}, {""},
-#line 155 "locfile-kw.gperf"
+#line 156 "locfile-kw.gperf"
       {"noexpr",                 tok_noexpr,                 0},
 #line 50 "locfile-kw.gperf"
       {"tolower",                tok_tolower,                0},
-#line 198 "locfile-kw.gperf"
+#line 199 "locfile-kw.gperf"
       {"audience",               tok_audience,               0},
       {""}, {""}, {""},
 #line 49 "locfile-kw.gperf"
       {"toupper",                tok_toupper,                0},
-#line 68 "locfile-kw.gperf"
+#line 69 "locfile-kw.gperf"
       {"position",               tok_position,               0},
       {""},
 #line 40 "locfile-kw.gperf"
@@ -381,196 +381,198 @@ locfile_hash (register const char *str, register size_t len)
       {""},
 #line 27 "locfile-kw.gperf"
       {"comment_char",           tok_comment_char,           0},
-#line 88 "locfile-kw.gperf"
+#line 89 "locfile-kw.gperf"
       {"positive_sign",          tok_positive_sign,          0},
       {""}, {""}, {""}, {""},
-#line 61 "locfile-kw.gperf"
+#line 62 "locfile-kw.gperf"
       {"symbol-equivalence",     tok_symbol_equivalence,     0},
       {""},
-#line 102 "locfile-kw.gperf"
+#line 103 "locfile-kw.gperf"
       {"int_p_sign_posn",        tok_int_p_sign_posn,        0},
-#line 175 "locfile-kw.gperf"
+#line 176 "locfile-kw.gperf"
       {"country_car",            tok_country_car,            0},
       {""}, {""},
-#line 104 "locfile-kw.gperf"
+#line 105 "locfile-kw.gperf"
       {"duo_int_curr_symbol",    tok_duo_int_curr_symbol,    0},
       {""}, {""},
-#line 135 "locfile-kw.gperf"
+#line 136 "locfile-kw.gperf"
       {"d_t_fmt",                tok_d_t_fmt,                0},
       {""}, {""},
-#line 116 "locfile-kw.gperf"
+#line 117 "locfile-kw.gperf"
       {"duo_p_sign_posn",        tok_duo_p_sign_posn,        0},
-#line 187 "locfile-kw.gperf"
+#line 188 "locfile-kw.gperf"
       {"measurement",            tok_measurement,            0},
-#line 176 "locfile-kw.gperf"
+#line 177 "locfile-kw.gperf"
       {"country_isbn",           tok_country_isbn,           0},
 #line 37 "locfile-kw.gperf"
       {"outdigit",               tok_outdigit,               0},
       {""}, {""},
-#line 143 "locfile-kw.gperf"
+#line 144 "locfile-kw.gperf"
       {"era_d_t_fmt",            tok_era_d_t_fmt,            0},
       {""}, {""}, {""},
 #line 34 "locfile-kw.gperf"
       {"lower",                  tok_lower,                  0},
-#line 183 "locfile-kw.gperf"
+#line 184 "locfile-kw.gperf"
       {"tel_dom_fmt",            tok_tel_dom_fmt,            0},
-#line 171 "locfile-kw.gperf"
+#line 172 "locfile-kw.gperf"
       {"country_post",           tok_country_post,           0},
-#line 148 "locfile-kw.gperf"
+#line 149 "locfile-kw.gperf"
       {"cal_direction",          tok_cal_direction,          0},
       {""},
-#line 139 "locfile-kw.gperf"
+#line 140 "locfile-kw.gperf"
       {"t_fmt_ampm",             tok_t_fmt_ampm,             0},
-#line 91 "locfile-kw.gperf"
+#line 92 "locfile-kw.gperf"
       {"frac_digits",            tok_frac_digits,            0},
       {""}, {""},
-#line 177 "locfile-kw.gperf"
+#line 178 "locfile-kw.gperf"
       {"lang_name",              tok_lang_name,              0},
-#line 90 "locfile-kw.gperf"
+#line 91 "locfile-kw.gperf"
       {"int_frac_digits",        tok_int_frac_digits,        0},
       {""},
-#line 121 "locfile-kw.gperf"
+#line 122 "locfile-kw.gperf"
       {"uno_valid_to",           tok_uno_valid_to,           0},
-#line 126 "locfile-kw.gperf"
+#line 127 "locfile-kw.gperf"
       {"decimal_point",          tok_decimal_point,          0},
       {""},
-#line 133 "locfile-kw.gperf"
+#line 134 "locfile-kw.gperf"
       {"abmon",                  tok_abmon,                  0},
       {""}, {""}, {""}, {""},
-#line 107 "locfile-kw.gperf"
+#line 108 "locfile-kw.gperf"
       {"duo_frac_digits",        tok_duo_frac_digits,        0},
-#line 182 "locfile-kw.gperf"
+#line 183 "locfile-kw.gperf"
       {"tel_int_fmt",            tok_tel_int_fmt,            0},
-#line 123 "locfile-kw.gperf"
+#line 124 "locfile-kw.gperf"
       {"duo_valid_to",           tok_duo_valid_to,           0},
-#line 146 "locfile-kw.gperf"
+#line 147 "locfile-kw.gperf"
       {"first_weekday",          tok_first_weekday,          0},
       {""},
-#line 130 "locfile-kw.gperf"
+#line 131 "locfile-kw.gperf"
       {"abday",                  tok_abday,                  0},
       {""},
-#line 200 "locfile-kw.gperf"
+#line 201 "locfile-kw.gperf"
       {"abbreviation",           tok_abbreviation,           0},
-#line 147 "locfile-kw.gperf"
+#line 148 "locfile-kw.gperf"
       {"first_workday",          tok_first_workday,          0},
       {""}, {""},
-#line 97 "locfile-kw.gperf"
+#line 98 "locfile-kw.gperf"
       {"n_sign_posn",            tok_n_sign_posn,            0},
       {""}, {""}, {""},
-#line 145 "locfile-kw.gperf"
+#line 146 "locfile-kw.gperf"
       {"alt_digits",             tok_alt_digits,             0},
       {""}, {""},
-#line 128 "locfile-kw.gperf"
+#line 129 "locfile-kw.gperf"
       {"grouping",               tok_grouping,               0},
       {""},
 #line 45 "locfile-kw.gperf"
       {"blank",                  tok_blank,                  0},
       {""}, {""},
-#line 196 "locfile-kw.gperf"
+#line 197 "locfile-kw.gperf"
       {"language",               tok_language,               0},
-#line 120 "locfile-kw.gperf"
+#line 121 "locfile-kw.gperf"
       {"uno_valid_from",         tok_uno_valid_from,         0},
       {""},
-#line 199 "locfile-kw.gperf"
+#line 200 "locfile-kw.gperf"
       {"application",            tok_application,            0},
       {""},
-#line 80 "locfile-kw.gperf"
+#line 81 "locfile-kw.gperf"
       {"elifndef",               tok_elifndef,               0},
       {""}, {""}, {""}, {""}, {""},
-#line 122 "locfile-kw.gperf"
+#line 123 "locfile-kw.gperf"
       {"duo_valid_from",         tok_duo_valid_from,         0},
-#line 57 "locfile-kw.gperf"
+#line 58 "locfile-kw.gperf"
       {"coll_weight_max",        tok_coll_weight_max,        0},
       {""},
-#line 79 "locfile-kw.gperf"
+#line 80 "locfile-kw.gperf"
       {"elifdef",                tok_elifdef,                0},
-#line 67 "locfile-kw.gperf"
+#line 68 "locfile-kw.gperf"
       {"backward",               tok_backward,               0},
-#line 106 "locfile-kw.gperf"
+#line 107 "locfile-kw.gperf"
       {"duo_int_frac_digits",    tok_duo_int_frac_digits,    0},
       {""}, {""}, {""}, {""}, {""}, {""},
-#line 96 "locfile-kw.gperf"
+#line 97 "locfile-kw.gperf"
       {"p_sign_posn",            tok_p_sign_posn,            0},
       {""},
-#line 203 "locfile-kw.gperf"
+#line 204 "locfile-kw.gperf"
       {"category",               tok_category,               0},
       {""}, {""}, {""}, {""},
-#line 134 "locfile-kw.gperf"
+#line 135 "locfile-kw.gperf"
       {"mon",                    tok_mon,                    0},
       {""},
-#line 124 "locfile-kw.gperf"
+#line 125 "locfile-kw.gperf"
       {"conversion_rate",        tok_conversion_rate,        0},
       {""}, {""}, {""}, {""}, {""},
-#line 63 "locfile-kw.gperf"
+#line 64 "locfile-kw.gperf"
       {"order_start",            tok_order_start,            0},
       {""}, {""}, {""}, {""}, {""},
-#line 178 "locfile-kw.gperf"
+#line 179 "locfile-kw.gperf"
       {"lang_ab",                tok_lang_ab,                0},
-#line 180 "locfile-kw.gperf"
+#line 181 "locfile-kw.gperf"
       {"lang_lib",               tok_lang_lib,               0},
       {""}, {""}, {""},
-#line 192 "locfile-kw.gperf"
+#line 193 "locfile-kw.gperf"
       {"contact",                tok_contact,                0},
       {""}, {""}, {""},
-#line 173 "locfile-kw.gperf"
-      {"country_ab3",            tok_country_ab3,            0},
+#line 57 "locfile-kw.gperf"
+      {"strcmp_collation",       tok_strcmp_collation,       0},
       {""}, {""}, {""},
-#line 193 "locfile-kw.gperf"
+#line 194 "locfile-kw.gperf"
       {"email",                  tok_email,                  0},
-#line 172 "locfile-kw.gperf"
-      {"country_ab2",            tok_country_ab2,            0},
+#line 174 "locfile-kw.gperf"
+      {"country_ab3",            tok_country_ab3,            0},
       {""}, {""}, {""},
 #line 55 "locfile-kw.gperf"
       {"default_missing",        tok_default_missing,        0},
-      {""}, {""},
-#line 195 "locfile-kw.gperf"
+#line 173 "locfile-kw.gperf"
+      {"country_ab2",            tok_country_ab2,            0},
+      {""},
+#line 196 "locfile-kw.gperf"
       {"fax",                    tok_fax,                    0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 174 "locfile-kw.gperf"
+#line 175 "locfile-kw.gperf"
       {"country_num",            tok_country_num,            0},
       {""}, {""}, {""}, {""}, {""}, {""},
 #line 51 "locfile-kw.gperf"
       {"map",                    tok_map,                    0},
-#line 65 "locfile-kw.gperf"
+#line 66 "locfile-kw.gperf"
       {"from",                   tok_from,                   0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 86 "locfile-kw.gperf"
+#line 87 "locfile-kw.gperf"
       {"mon_thousands_sep",      tok_mon_thousands_sep,      0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""},
-#line 81 "locfile-kw.gperf"
+#line 82 "locfile-kw.gperf"
       {"endif",                  tok_endif,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 151 "locfile-kw.gperf"
+#line 152 "locfile-kw.gperf"
       {"alt_mon",                tok_alt_mon,                0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 76 "locfile-kw.gperf"
+#line 77 "locfile-kw.gperf"
       {"undef",                  tok_undef,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 59 "locfile-kw.gperf"
+#line 60 "locfile-kw.gperf"
       {"collating-element",      tok_collating_element,      0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 152 "locfile-kw.gperf"
+#line 153 "locfile-kw.gperf"
       {"ab_alt_mon",             tok_ab_alt_mon,             0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 66 "locfile-kw.gperf"
+#line 67 "locfile-kw.gperf"
       {"forward",                tok_forward,                0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""},
-#line 85 "locfile-kw.gperf"
+#line 86 "locfile-kw.gperf"
       {"mon_decimal_point",      tok_mon_decimal_point,      0},
       {""}, {""},
-#line 169 "locfile-kw.gperf"
+#line 170 "locfile-kw.gperf"
       {"postal_fmt",             tok_postal_fmt,             0},
       {""}, {""}, {""}, {""}, {""},
-#line 60 "locfile-kw.gperf"
+#line 61 "locfile-kw.gperf"
       {"collating-symbol",       tok_collating_symbol,       0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
@@ -583,15 +585,15 @@ locfile_hash (register const char *str, register size_t len)
 #line 38 "locfile-kw.gperf"
       {"alnum",                  tok_alnum,                  0},
       {""},
-#line 87 "locfile-kw.gperf"
+#line 88 "locfile-kw.gperf"
       {"mon_grouping",           tok_mon_grouping,           0},
       {""},
-#line 179 "locfile-kw.gperf"
+#line 180 "locfile-kw.gperf"
       {"lang_term",              tok_lang_term,              0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""},
-#line 77 "locfile-kw.gperf"
+#line 78 "locfile-kw.gperf"
       {"ifdef",                  tok_ifdef,                  0},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
@@ -599,7 +601,7 @@ locfile_hash (register const char *str, register size_t len)
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""}, {""},
       {""}, {""}, {""}, {""},
-#line 138 "locfile-kw.gperf"
+#line 139 "locfile-kw.gperf"
       {"am_pm",                  tok_am_pm,                  0}
     };
 
diff --git a/locale/programs/locfile-token.h b/locale/programs/locfile-token.h
index 414ad30762..0ea73c51f1 100644
--- a/locale/programs/locfile-token.h
+++ b/locale/programs/locfile-token.h
@@ -91,6 +91,7 @@ enum token_t
   tok_translit_ignore,
   tok_default_missing,
   tok_lc_collate,
+  tok_strcmp_collation,
   tok_coll_weight_max,
   tok_section_symbol,
   tok_collating_element,

From patchwork Thu Jul 29 06:35:15 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Carlos O'Donell <carlos@redhat.com>
X-Patchwork-Id: 44508
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id A08973898014
	for <patchwork@sourceware.org>; Thu, 29 Jul 2021 06:37:44 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A08973898014
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1627540664;
	bh=Hit99DeWNOHoOtPcXQPy2etbZj1vZTkqouR/O4YkuxU=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=iZ5EddR8TDbh8Gg76yc8S/wEVuSPTD07TkmtjdFVHe3snVUgZUq21ZC5a1ZMavt2W
	 yoUshRtyuxGc+OlmM+kIIMehlyNtamRJfPLZeTJLK6Pl7Lwhl2PNAaTez95t3thDLV
	 LiJ9kCxvF4WrLLaqgcvPgMDXaduQqYEvo9EH8f2c=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 69BB63896C25
 for <libc-alpha@sourceware.org>; Thu, 29 Jul 2021 06:35:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 69BB63896C25
Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com
 [209.85.222.199]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-548-Z6aWLnBmM5qhzPzLVZdMWQ-1; Thu, 29 Jul 2021 02:35:26 -0400
X-MC-Unique: Z6aWLnBmM5qhzPzLVZdMWQ-1
Received: by mail-qk1-f199.google.com with SMTP id
 i15-20020a05620a150fb02903b960837cbfso3217063qkk.10
 for <libc-alpha@sourceware.org>; Wed, 28 Jul 2021 23:35:26 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=oQu7zagBBJZmHg2YP1uG2W2aM+JcMKh6gmKAx2tIRCU=;
 b=YuCG+G5IHuTbNS69xM6AUfY4xX+GNFa9aaB3RLfBJW7PyMcPtLx8qC5YFntmNKHjhA
 MBwlCGFIylzFac9qFSnwB371vlPVd4j5o6W1aQKGoSPkb+/41O0Sh2Ou8/aLUVEq1y75
 /gFMs5UGscVqsAuRMtHltUWZz/8recqe4uriRu1K3qI3KMxMGag5dASwYvK4YaEfMiHQ
 zim5siT/CtMC/UnKjXLwKX1eiO6R96PBmVtnBc/cf9nfv2xfhbQLCKuF+lnChK5RpyuL
 XyaZLhe7F42Nil3T4xE5Xq3RVLhY+jXw5SGK+8mhPuC6TdvrVePuAUZPIz4vRAV42K+l
 jMPA==
X-Gm-Message-State: AOAM532XlZoOGbUo8Z5jWy1v7wF91Co2RTCd/PVvut5jgZfzZg/5OwE2
 eD/Ns7Y/x0yyy1VoEeEZTaW1lSXTK0hHU3bRiREGMMeQaFO69xYdypfVzs8zrKZmp67UG7jt3t+
 g6J6hYMAlegm4/+oabSRrCZpa0LzUTj3oSChpZBrUlKiX5qWWN1ScT1yTpy1KDcT7fo5teg==
X-Received: by 2002:ae9:c316:: with SMTP id n22mr3503891qkg.481.1627540525155;
 Wed, 28 Jul 2021 23:35:25 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzEiEhBiA0SBaPIeJJtEbUlLHfaFy+i/VvR1q30EIFebqcOE56NzGVJIpowaOC898x1eWRsdA==
X-Received: by 2002:ae9:c316:: with SMTP id n22mr3503847qkg.481.1627540524263;
 Wed, 28 Jul 2021 23:35:24 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id y2sm1311857qkd.38.2021.07.28.23.35.23
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Jul 2021 23:35:23 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v4 3/3] Add generic C.UTF-8 locale (Bug 17318)
Date: Thu, 29 Jul 2021 02:35:15 -0400
Message-Id: <20210729063515.1541388-4-carlos@redhat.com>
X-Mailer: git-send-email 2.31.1
In-Reply-To: <20210729063515.1541388-1-carlos@redhat.com>
References: <20210729063515.1541388-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_SHORT,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,
 SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha
 <libc-alpha@sourceware.org>
From: Carlos O'Donell <carlos@redhat.com>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

We add a new C.UTF-8 locale.  This locale is not builtin to glibc, but
is provided as a distinct locale.  The locale provides full support
for UTF-8 and this includes full code point sorting via strcmp-based
collation.

The collation uses a new keyword 'strcmp_collation' which drops all
collation rules and generates an empty zero rules collation to enable
strcmp usage in collation. This ensures that we get full code point
sorting for C.UTF-8 with a minimal 92 bytes of overhead (LC_COLLATE
structure information).

The new locale is added to SUPPORTED.  Minimal test data for specific
code points (minus those not supported by collate-test) is provided
in C.UTF-8.in, and this verifies code point sorting is working
reasonably across the range.  The locale was tested manually with the
full set of code points without failure.

The locale is harmonized with locales already shipping in Gentoo,
Debian, Ubuntu, Fedora, CentOS Stream, and RHEL. A new tst-iconv9 test
is added which verifies the C.UTF-8 locale is generally usable.

Testing for fnmatch, regexec, and recomp is provided by extending
bug-regex1, bugregex19, bug-regex4, bug-regex6, transbug, tst-fnmatch,
tst-regcomp-truncated, and tst-regex to use C.UTF-8.

Tested on x86_64 or i686 without regression.
---
 iconv/Makefile                |  22 +-
 iconv/tst-iconv9.c            |  87 ++++++
 localedata/C.UTF-8.in         | 157 ++++++++++
 localedata/Makefile           |   2 +
 localedata/SUPPORTED          |   1 +
 localedata/locales/C          | 194 ++++++++++++
 posix/bug-regex1.c            |  20 ++
 posix/bug-regex19.c           |  22 +-
 posix/bug-regex4.c            |  25 ++
 posix/bug-regex6.c            |   2 +-
 posix/transbug.c              |  22 +-
 posix/tst-fnmatch.input       | 549 +++++++++++++++++++++++++++++++++-
 posix/tst-regcomp-truncated.c |   1 +
 posix/tst-regex.c             |  25 +-
 14 files changed, 1104 insertions(+), 25 deletions(-)
 create mode 100644 iconv/tst-iconv9.c
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C

diff --git a/iconv/Makefile b/iconv/Makefile
index 07d77c9eca..9993f2d3f3 100644
--- a/iconv/Makefile
+++ b/iconv/Makefile
@@ -43,8 +43,19 @@ CFLAGS-charmap.c += -DCHARMAP_PATH='"$(i18ndir)/charmaps"' \
 CFLAGS-linereader.c += -DNO_TRANSLITERATION
 CFLAGS-simple-hash.c += -I../locale
 
-tests	= tst-iconv1 tst-iconv2 tst-iconv3 tst-iconv4 tst-iconv5 tst-iconv6 \
-	  tst-iconv7 tst-iconv8 tst-iconv-mt tst-iconv-opt
+tests = \
+	tst-iconv1 \
+	tst-iconv2 \
+	tst-iconv3 \
+	tst-iconv4 \
+	tst-iconv5 \
+	tst-iconv6 \
+	tst-iconv7 \
+	tst-iconv8 \
+	tst-iconv9 \
+	tst-iconv-mt \
+	tst-iconv-opt \
+	# tests
 
 others		= iconv_prog iconvconfig
 install-others-programs	= $(inst_bindir)/iconv
@@ -83,10 +94,15 @@ endif
 include ../Rules
 
 ifeq ($(run-built-tests),yes)
-LOCALES := en_US.UTF-8
+# We have to generate locales (list sorted alphabetically)
+LOCALES := \
+	C.UTF-8 \
+	en_US.UTF-8 \
+	# LOCALES
 include ../gen-locales.mk
 
 $(objpfx)tst-iconv-opt.out: $(gen-locales)
+$(objpfx)tst-iconv9.out: $(gen-locales)
 endif
 
 $(inst_bindir)/iconv: $(objpfx)iconv_prog $(+force)
diff --git a/iconv/tst-iconv9.c b/iconv/tst-iconv9.c
new file mode 100644
index 0000000000..78a5324279
--- /dev/null
+++ b/iconv/tst-iconv9.c
@@ -0,0 +1,87 @@
+/* Verify that using C.UTF-8 works.
+
+   Copyright (C) 2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <iconv.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <string.h>
+#include <support/support.h>
+#include <support/check.h>
+
+/* This test does two things:
+   (1) Verify that we have likely included translit_combining in C.UTF-8.
+   (2) Verify default_missing is '?' as expected.  */
+
+/* ISO-8859-1 encoding of "für".  */
+char iso88591_in[] = { 0x66, 0xfc, 0x72, 0x0 };
+/* ASCII transliteration is "fur" with C.UTF-8 translit_combining.  */
+char ascii_exp[] = { 0x66, 0x75, 0x72, 0x0 };
+
+/* First 3-byte UTF-8 code point.  */
+char utf8_in[] = { 0xe0, 0xa0, 0x80, 0x0 };
+/* There is no ASCII transliteration for SAMARITAN LETTER ALAF
+   so we get default_missing used which is '?'.  */
+char default_missing_exp[] = { 0x3f, 0x0 };
+
+static int
+do_test (void)
+{
+  char ascii_out[5];
+  iconv_t cd;
+  char *inbuf;
+  char *outbuf;
+  size_t inbytes;
+  size_t outbytes;
+  size_t n;
+
+  /* The C.UTF-8 locale should include translit_combining, which provides
+     the transliteration for "LATIN SMALL LETTER U WITH DIAERESIS" which
+     is not provided by locale/C-translit.h.in.  */
+  xsetlocale (LC_ALL, "C.UTF-8");
+
+  /* From ISO-8859-1 to ASCII. */
+  cd = iconv_open ("ASCII//TRANSLIT,IGNORE", "ISO-8859-1");
+  TEST_VERIFY (cd != (iconv_t) -1);
+  inbuf = iso88591_in;
+  inbytes = 3;
+  outbuf = ascii_out;
+  outbytes = 3;
+  n = iconv (cd, &inbuf, &inbytes, &outbuf, &outbytes);
+  TEST_VERIFY (n != -1);
+  *outbuf = '\0';
+  TEST_COMPARE_BLOB (ascii_out, 3, ascii_exp, 3);
+  TEST_VERIFY (iconv_close (cd) == 0);
+
+  /* From UTF-8 to ASCII. */
+  cd = iconv_open ("ASCII//TRANSLIT,IGNORE", "UTF-8");
+  TEST_VERIFY (cd != (iconv_t) -1);
+  inbuf = utf8_in;
+  inbytes = 3;
+  outbuf = ascii_out;
+  outbytes = 3;
+  n = iconv (cd, &inbuf, &inbytes, &outbuf, &outbytes);
+  TEST_VERIFY (n != -1);
+  *outbuf = '\0';
+  TEST_COMPARE_BLOB (ascii_out, 1, default_missing_exp, 1);
+  TEST_VERIFY (iconv_close (cd) == 0);
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
new file mode 100644
index 0000000000..c31dcc2aa0
--- /dev/null
+++ b/localedata/C.UTF-8.in
@@ -0,0 +1,157 @@
+ ; <U1>
+ ; <U2>
+ ; <U3>
+ ; <U4>
+ ; <U5>
+ ; <U6>
+ ; <U7>
+ ; <U8>
+ ; <UE>
+ ; <UF>
+ ; <U10>
+ ; <U11>
+ ; <U12>
+ ; <U13>
+ ; <U14>
+ ; <U15>
+ ; <U16>
+ ; <U17>
+ ; <U18>
+ ; <U19>
+ ; <U1A>
+ ; <U1B>
+ ; <U1C>
+ ; <U1D>
+ ; <U1E>
+ ; <U1F>
+! ; <U21>
+" ; <U22>
+# ; <U23>
+$ ; <U24>
+% ; <U25>
+& ; <U26>
+' ; <U27>
+) ; <U29>
+* ; <U2A>
++ ; <U2B>
+, ; <U2C>
+- ; <U2D>
+. ; <U2E>
+/ ; <U2F>
+0 ; <U30>
+1 ; <U31>
+2 ; <U32>
+3 ; <U33>
+4 ; <U34>
+5 ; <U35>
+6 ; <U36>
+7 ; <U37>
+8 ; <U38>
+9 ; <U39>
+< ; <U3C>
+= ; <U3D>
+> ; <U3E>
+? ; <U3F>
+@ ; <U40>
+A ; <U41>
+B ; <U42>
+C ; <U43>
+D ; <U44>
+E ; <U45>
+F ; <U46>
+G ; <U47>
+H ; <U48>
+I ; <U49>
+J ; <U4A>
+K ; <U4B>
+L ; <U4C>
+M ; <U4D>
+N ; <U4E>
+O ; <U4F>
+P ; <U50>
+Q ; <U51>
+R ; <U52>
+S ; <U53>
+T ; <U54>
+U ; <U55>
+V ; <U56>
+W ; <U57>
+X ; <U58>
+Y ; <U59>
+Z ; <U5A>
+[ ; <U5B>
+\ ; <U5C>
+] ; <U5D>
+^ ; <U5E>
+_ ; <U5F>
+` ; <U60>
+a ; <U61>
+b ; <U62>
+c ; <U63>
+d ; <U64>
+e ; <U65>
+f ; <U66>
+g ; <U67>
+h ; <U68>
+i ; <U69>
+j ; <U6A>
+k ; <U6B>
+l ; <U6C>
+m ; <U6D>
+n ; <U6E>
+o ; <U6F>
+p ; <U70>
+q ; <U71>
+r ; <U72>
+s ; <U73>
+t ; <U74>
+u ; <U75>
+v ; <U76>
+w ; <U77>
+x ; <U78>
+y ; <U79>
+z ; <U7A>
+{ ; <U7B>
+| ; <U7C>
+} ; <U7D>
+~ ; <U7E>
+ ; <U7F>
+ ; <U80>
+ÿ ; <UFF>
+Ā ; <U100>
+࿿ ; <UFFF>
+က ; <U1000>
+� ; <UFFFD>
+￿ ; <UFFFF>
+? ; <U10000>
+? ; <U1FFFF>
+? ; <U20000>
+? ; <U2FFFF>
+? ; <U30000>
+? ; <U3FFFE>
+? ; <U40000>
+? ; <U4FFFF>
+? ; <U50000>
+? ; <U5FFFF>
+? ; <U60000>
+? ; <U6FFFF>
+? ; <U70000>
+? ; <U7FFFF>
+? ; <U80000>
+? ; <U8FFFF>
+? ; <U90000>
+? ; <U9FFFF>
+? ; <UA0000>
+? ; <UAFFFF>
+? ; <UB0000>
+? ; <UBFFFF>
+? ; <UC0001>
+? ; <UCFFCC>
+? ; <UD000E>
+? ; <UDFFFF>
+? ; <UE0001>
+? ; <UEFFFF>
+? ; <UF0001>
+? ; <UFFFFF>
+? ; <U100001>
+? ; <U10FFFF>
diff --git a/localedata/Makefile b/localedata/Makefile
index f585e0dd41..66a269641b 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -47,6 +47,7 @@ test-input := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
@@ -206,6 +207,7 @@ LOCALES := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
diff --git a/localedata/SUPPORTED b/localedata/SUPPORTED
index 1ee5b5e8c8..d768aa4795 100644
--- a/localedata/SUPPORTED
+++ b/localedata/SUPPORTED
@@ -79,6 +79,7 @@ brx_IN/UTF-8 \
 bs_BA.UTF-8/UTF-8 \
 bs_BA/ISO-8859-2 \
 byn_ER/UTF-8 \
+C.UTF-8/UTF-8 \
 ca_AD.UTF-8/UTF-8 \
 ca_AD/ISO-8859-15 \
 ca_ES.UTF-8/UTF-8 \
diff --git a/localedata/locales/C b/localedata/locales/C
new file mode 100644
index 0000000000..651691c724
--- /dev/null
+++ b/localedata/locales/C
@@ -0,0 +1,194 @@
+escape_char /
+comment_char %
+% Locale for C locale in UTF-8
+
+LC_IDENTIFICATION
+title      "C locale"
+source     ""
+address    ""
+contact    ""
+email      "bug-glibc-locales@gnu.org"
+tel        ""
+fax        ""
+language   ""
+territory  ""
+revision   "2.0"
+date       "2020-06-28"
+category  "i18n:2012";LC_IDENTIFICATION
+category  "i18n:2012";LC_CTYPE
+category  "i18n:2012";LC_COLLATE
+category  "i18n:2012";LC_TIME
+category  "i18n:2012";LC_NUMERIC
+category  "i18n:2012";LC_MONETARY
+category  "i18n:2012";LC_MESSAGES
+category  "i18n:2012";LC_PAPER
+category  "i18n:2012";LC_NAME
+category  "i18n:2012";LC_ADDRESS
+category  "i18n:2012";LC_TELEPHONE
+category  "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_CTYPE
+% Include only the i18n character type classes without any of the
+% transliteration that i18n uses by default.
+copy "i18n_ctype"
+
+% Include the neutral transliterations.  The builtin C and
+% POSIX locales have +1600 transliterations that are built into
+% the locales, and these are a superset of those.
+translit_start
+include "translit_neutral";""
+% We must use '?' for default_missing because the transliteration
+% framework includes it directly into the output and so it must
+% be compatible with ASCII if that is the target character set.
+default_missing <U003F>
+translit_end
+
+% Include the transliterations that can convert combined cahracters.
+% These are generally expected by users.
+translit_start
+include "translit_combining";""
+translit_end
+
+END LC_CTYPE
+
+LC_COLLATE
+% The keyword 'strcmp_collation' in any part of any LC_COLLATE
+% immediately discards all collation information and causes the
+% locale to use strcmp for collation comparison.  This is exactly
+% what is needed for C (ASCII) or C.UTF-8.
+strcmp_collation
+END LC_COLLATE
+
+LC_MONETARY
+
+% This is the 14652 i18n fdcc-set definition for the LC_MONETARY
+% category (except for the int_curr_symbol and currency_symbol, they are
+% empty in the 14652 i18n fdcc-set definition and also empty in
+% glibc/locale/C-monetary.c.).
+int_curr_symbol     ""
+currency_symbol     ""
+mon_decimal_point   "."
+mon_thousands_sep   ""
+mon_grouping        -1
+positive_sign       ""
+negative_sign       "-"
+int_frac_digits     -1
+frac_digits         -1
+p_cs_precedes       -1
+int_p_sep_by_space  -1
+p_sep_by_space      -1
+n_cs_precedes       -1
+int_n_sep_by_space  -1
+n_sep_by_space      -1
+p_sign_posn         -1
+n_sign_posn         -1
+%
+END LC_MONETARY
+
+LC_NUMERIC
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+decimal_point   "."
+thousands_sep   ""
+grouping        -1
+END LC_NUMERIC
+
+LC_TIME
+% This is the POSIX Locale definition for the LC_TIME category with the
+% exception that time is per ISO 8601 and 24-hour.
+%
+% Abbreviated weekday names (%a)
+abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
+
+% Full weekday names (%A)
+day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
+            "Friday";"Saturday"
+
+% Abbreviated month names (%b)
+abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
+            "Oct";"Nov";"Dec"
+
+% Full month names (%B)
+mon         "January";"February";"March";"April";"May";"June";"July";/
+            "August";"September";"October";"November";"December"
+
+% Week description, consists of three fields:
+% 1. Number of days in a week.
+% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
+% 3. The weekday number to be contained in the first week of the year.
+%
+% ISO 8601 conforming applications should use the values 7, 19971201 (a
+% Monday), and 4 (Thursday), respectively.
+week    7;19971201;4
+first_weekday	1
+first_workday	2
+
+% Appropriate date and time representation (%c)
+d_t_fmt "%a %b %e %H:%M:%S %Y"
+
+% Appropriate date representation (%x)
+d_fmt   "%m/%d/%y"
+
+% Appropriate time representation (%X)
+t_fmt   "%H:%M:%S"
+
+% Appropriate AM/PM time representation (%r)
+t_fmt_ampm "%I:%M:%S %p"
+
+% Equivalent of AM/PM (%p)
+am_pm	"AM";"PM"
+
+% Appropriate date representation (date(1))   "%a %b %e %H:%M:%S %Z %Y"
+date_fmt	"%a %b %e %H:%M:%S %Z %Y"
+END LC_TIME
+
+LC_MESSAGES
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+yesexpr "^[yY]"
+noexpr  "^[nN]"
+yesstr  "Yes"
+nostr   "No"
+END LC_MESSAGES
+
+LC_PAPER
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_PAPER category.
+% (A4 paper, this is also used in the built in C/POSIX
+% locale in glibc/locale/C-paper.c)
+height   297
+width    210
+END LC_PAPER
+
+LC_NAME
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_NAME category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-name.c)
+name_fmt    "%p%t%g%t%m%t%f"
+END LC_NAME
+
+LC_ADDRESS
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_ADDRESS category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-address.c)
+postal_fmt    "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N"
+END LC_ADDRESS
+
+LC_TELEPHONE
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_TELEPHONE category.
+% "+%c %a %l"
+tel_int_fmt    "+%c %a %l"
+% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c)
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_MEASUREMENT category.
+% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c)
+%metric
+measurement    1
+END LC_MEASUREMENT
diff --git a/posix/bug-regex1.c b/posix/bug-regex1.c
index 38eb543951..85da8cc7ca 100644
--- a/posix/bug-regex1.c
+++ b/posix/bug-regex1.c
@@ -41,6 +41,26 @@ main (void)
 	puts (" -> OK");
     }
 
+  puts ("in C.UTF-8 locale");
+  setlocale (LC_ALL, "C.UTF-8");
+  s = re_compile_pattern ("[an�]*n", 7, &regex);
+  if (s != NULL)
+    {
+      puts ("re_compile_pattern return non-NULL value");
+      result = 1;
+    }
+  else
+    {
+      match = re_match (&regex, "an", 2, 0, &regs);
+      if (match != 2)
+	{
+	  printf ("re_match returned %d, expected 2\n", match);
+	  result = 1;
+	}
+      else
+	puts (" -> OK");
+    }
+
   puts ("in de_DE.ISO-8859-1 locale");
   setlocale (LC_ALL, "de_DE.ISO-8859-1");
   s = re_compile_pattern ("[an�]*n", 7, &regex);
diff --git a/posix/bug-regex19.c b/posix/bug-regex19.c
index b3fee0a730..e00ff60a14 100644
--- a/posix/bug-regex19.c
+++ b/posix/bug-regex19.c
@@ -25,6 +25,7 @@
 #include <string.h>
 #include <locale.h>
 #include <libc-diag.h>
+#include <support/support.h>
 
 #define BRE RE_SYNTAX_POSIX_BASIC
 #define ERE RE_SYNTAX_POSIX_EXTENDED
@@ -407,8 +408,8 @@ do_mb_tests (const struct test_s *test)
   return 0;
 }
 
-int
-main (void)
+static int
+do_test (void)
 {
   size_t i;
   int ret = 0;
@@ -417,20 +418,17 @@ main (void)
 
   for (i = 0; i < sizeof (tests) / sizeof (tests[0]); ++i)
     {
-      if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
-	{
-	  puts ("setlocale de_DE.ISO-8859-1 failed");
-	  ret = 1;
-	}
+      xsetlocale (LC_ALL, "de_DE.ISO-8859-1");
       ret |= do_one_test (&tests[i], "");
-      if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
-	{
-	  puts ("setlocale de_DE.UTF-8 failed");
-	  ret = 1;
-	}
+      xsetlocale (LC_ALL, "de_DE.UTF-8");
+      ret |= do_one_test (&tests[i], "UTF-8 ");
+      ret |= do_mb_tests (&tests[i]);
+      xsetlocale (LC_ALL, "C.UTF-8");
       ret |= do_one_test (&tests[i], "UTF-8 ");
       ret |= do_mb_tests (&tests[i]);
     }
 
   return ret;
 }
+
+#include <support/test-driver.c>
diff --git a/posix/bug-regex4.c b/posix/bug-regex4.c
index 8d5ae11567..6475833c52 100644
--- a/posix/bug-regex4.c
+++ b/posix/bug-regex4.c
@@ -32,8 +32,33 @@ main (void)
 
   memset (&regex, '\0', sizeof (regex));
 
+  printf ("INFO: Checking C.\n");
   setlocale (LC_ALL, "C");
 
+  s = re_compile_pattern ("ab[cde]", 7, &regex);
+  if (s != NULL)
+    {
+      puts ("re_compile_pattern returned non-NULL value");
+      result = 1;
+    }
+  else
+    {
+      match[0] = re_search_2 (&regex, "xyabez", 6, "", 0, 1, 5, NULL, 6);
+      match[1] = re_search_2 (&regex, NULL, 0, "abc", 3, 0, 3, NULL, 3);
+      match[2] = re_search_2 (&regex, "xya", 3, "bd", 2, 2, 3, NULL, 5);
+      if (match[0] != 2 || match[1] != 0 || match[2] != 2)
+	{
+	  printf ("re_search_2 returned %d,%d,%d, expected 2,0,2\n",
+		  match[0], match[1], match[2]);
+	  result = 1;
+	}
+      else
+	puts (" -> OK");
+    }
+
+  printf ("INFO: Checking C.UTF-8.\n");
+  setlocale (LC_ALL, "C.UTF-8");
+
   s = re_compile_pattern ("ab[cde]", 7, &regex);
   if (s != NULL)
     {
diff --git a/posix/bug-regex6.c b/posix/bug-regex6.c
index 2bdf2126a4..0929b69b83 100644
--- a/posix/bug-regex6.c
+++ b/posix/bug-regex6.c
@@ -30,7 +30,7 @@ main (int argc, char *argv[])
   regex_t re;
   regmatch_t mat[10];
   int i, j, ret = 0;
-  const char *locales[] = { "C", "de_DE.UTF-8" };
+  const char *locales[] = { "C", "C.UTF-8", "de_DE.UTF-8" };
   const char *string = "http://www.regex.com/pattern/matching.html#intro";
   regmatch_t expect[10] = {
     { 0, 48 }, { 0, 5 }, { 0, 4 }, { 5, 20 }, { 7, 20 }, { 20, 42 },
diff --git a/posix/transbug.c b/posix/transbug.c
index d0983b4d44..71632b7976 100644
--- a/posix/transbug.c
+++ b/posix/transbug.c
@@ -116,14 +116,30 @@ do_test (void)
   static const char lower[] = "[[:lower:]]+";
   static const char upper[] = "[[:upper:]]+";
   struct re_registers regs[4];
+  int result;
 
+#define CHECK(exp) \
+  if (exp) { puts (#exp); result = 1; }
+
+  printf ("INFO: Checking C.\n");
   setlocale (LC_ALL, "C");
 
   (void) re_set_syntax (RE_SYNTAX_GNU_AWK);
 
-  int result;
-#define CHECK(exp) \
-  if (exp) { puts (#exp); result = 1; }
+  result = run_test (lower, regs);
+  result |= run_test (upper, &regs[2]);
+  if (! result)
+    {
+      CHECK (regs[0].start[0] != regs[2].start[0]);
+      CHECK (regs[0].end[0] != regs[2].end[0]);
+      CHECK (regs[1].start[0] != regs[3].start[0]);
+      CHECK (regs[1].end[0] != regs[3].end[0]);
+    }
+
+  printf ("INFO: Checking C.UTF-8.\n");
+  setlocale (LC_ALL, "C.UTF-8");
+
+  (void) re_set_syntax (RE_SYNTAX_GNU_AWK);
 
   result = run_test (lower, regs);
   result |= run_test (upper, &regs[2]);
diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index 67aac5aada..6ff5318032 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -472,6 +472,397 @@ C		"\\"			"[Z-\\]]"	       0
 C		"]"			"[Z-\\]]"	       0
 C		"-"			"[Z-\\]]"	       NOMATCH
 
+# B.6 004(C)
+C.UTF-8		 "!#%+,-./01234567889"	"!#%+,-./01234567889"  0
+C.UTF-8		 ":;=@ABCDEFGHIJKLMNO"	":;=@ABCDEFGHIJKLMNO"  0
+C.UTF-8		 "PQRSTUVWXYZ]abcdefg"	"PQRSTUVWXYZ]abcdefg"  0
+C.UTF-8		 "hijklmnopqrstuvwxyz"	"hijklmnopqrstuvwxyz"  0
+C.UTF-8		 "^_{}~"		"^_{}~"		       0
+
+# B.6 005(C)
+C.UTF-8		 "\"$&'()"		"\\\"\\$\\&\\'\\(\\)"  0
+C.UTF-8		 "*?[\\`|"		"\\*\\?\\[\\\\\\`\\|"  0
+C.UTF-8		 "<>"			"\\<\\>"	       0
+
+# B.6 006(C)
+C.UTF-8		 "?*["			"[?*[][?*[][?*[]"      0
+C.UTF-8		 "a/b"			"?/b"		       0
+
+# B.6 007(C)
+C.UTF-8		 "a/b"			"a?b"		       0
+C.UTF-8		 "a/b"			"a/?"		       0
+C.UTF-8		 "aa/b"			"?/b"		       NOMATCH
+C.UTF-8		 "aa/b"			"a?b"		       NOMATCH
+C.UTF-8		 "a/bb"			"a/?"		       NOMATCH
+
+# B.6 009(C)
+C.UTF-8		 "abc"			"[abc]"		       NOMATCH
+C.UTF-8		 "x"			"[abc]"		       NOMATCH
+C.UTF-8		 "a"			"[abc]"		       0
+C.UTF-8		 "["			"[[abc]"	       0
+C.UTF-8		 "a"			"[][abc]"	       0
+C.UTF-8		 "a]"			"[]a]]"		       0
+
+# B.6 010(C)
+C.UTF-8		 "xyz"			"[!abc]"	       NOMATCH
+C.UTF-8		 "x"			"[!abc]"	       0
+C.UTF-8		 "a"			"[!abc]"	       NOMATCH
+
+# B.6 011(C)
+C.UTF-8		 "]"			"[][abc]"	       0
+C.UTF-8		 "abc]"			"[][abc]"	       NOMATCH
+C.UTF-8		 "[]abc"		"[][]abc"	       NOMATCH
+C.UTF-8		 "]"			"[!]]"		       NOMATCH
+C.UTF-8		 "aa]"			"[!]a]"		       NOMATCH
+C.UTF-8		 "]"			"[!a]"		       0
+C.UTF-8		 "]]"			"[!a]]"		       0
+
+# B.6 012(C)
+C.UTF-8		 "a"			"[[.a.]]"	       0
+C.UTF-8		 "-"			"[[.-.]]"	       0
+C.UTF-8		 "-"			"[[.-.][.].]]"	       0
+C.UTF-8		 "-"			"[[.].][.-.]]"	       0
+C.UTF-8		 "-"			"[[.-.][=u=]]"	       0
+C.UTF-8		 "-"			"[[.-.][:alpha:]]"     0
+C.UTF-8		 "a"			"[![.a.]]"	       NOMATCH
+
+# B.6 013(C)
+C.UTF-8		 "a"			"[[.b.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.b.][.c.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.b.][=b=]]"	       NOMATCH
+
+
+# B.6 015(C)
+C.UTF-8		 "a"			"[[=a=]]"	       0
+C.UTF-8		 "b"			"[[=a=]b]"	       0
+C.UTF-8		 "b"			"[[=a=][=b=]]"	       0
+C.UTF-8		 "a"			"[[=a=][=b=]]"	       0
+C.UTF-8		 "a"			"[[=a=][.b.]]"	       0
+C.UTF-8		 "a"			"[[=a=][:digit:]]"     0
+
+# B.6 016(C)
+C.UTF-8		 "="			"[[=a=]b]"	       NOMATCH
+C.UTF-8		 "]"			"[[=a=]b]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][=c=]]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][.].]]"	       NOMATCH
+C.UTF-8		 "a"			"[[=b=][:digit:]]"     NOMATCH
+
+# B.6 017(C)
+C.UTF-8		 "a"			"[[:alnum:]]"	       0
+C.UTF-8		 "a"			"[![:alnum:]]"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "a]a"			"[[:alnum:]]a"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]-]"	       0
+C.UTF-8		 "aa"			"[[:alnum:]]a"	       0
+C.UTF-8		 "-"			"[![:alnum:]]"	       0
+C.UTF-8		 "]"			"[!][:alnum:]]"	       NOMATCH
+C.UTF-8		 "["			"[![:alnum:][]"	       NOMATCH
+C.UTF-8		 "a"			"[[:alnum:]]"	       0
+C.UTF-8		 "b"			"[[:alnum:]]"	       0
+C.UTF-8		 "c"			"[[:alnum:]]"	       0
+C.UTF-8		 "d"			"[[:alnum:]]"	       0
+C.UTF-8		 "e"			"[[:alnum:]]"	       0
+C.UTF-8		 "f"			"[[:alnum:]]"	       0
+C.UTF-8		 "g"			"[[:alnum:]]"	       0
+C.UTF-8		 "h"			"[[:alnum:]]"	       0
+C.UTF-8		 "i"			"[[:alnum:]]"	       0
+C.UTF-8		 "j"			"[[:alnum:]]"	       0
+C.UTF-8		 "k"			"[[:alnum:]]"	       0
+C.UTF-8		 "l"			"[[:alnum:]]"	       0
+C.UTF-8		 "m"			"[[:alnum:]]"	       0
+C.UTF-8		 "n"			"[[:alnum:]]"	       0
+C.UTF-8		 "o"			"[[:alnum:]]"	       0
+C.UTF-8		 "p"			"[[:alnum:]]"	       0
+C.UTF-8		 "q"			"[[:alnum:]]"	       0
+C.UTF-8		 "r"			"[[:alnum:]]"	       0
+C.UTF-8		 "s"			"[[:alnum:]]"	       0
+C.UTF-8		 "t"			"[[:alnum:]]"	       0
+C.UTF-8		 "u"			"[[:alnum:]]"	       0
+C.UTF-8		 "v"			"[[:alnum:]]"	       0
+C.UTF-8		 "w"			"[[:alnum:]]"	       0
+C.UTF-8		 "x"			"[[:alnum:]]"	       0
+C.UTF-8		 "y"			"[[:alnum:]]"	       0
+C.UTF-8		 "z"			"[[:alnum:]]"	       0
+C.UTF-8		 "A"			"[[:alnum:]]"	       0
+C.UTF-8		 "B"			"[[:alnum:]]"	       0
+C.UTF-8		 "C"			"[[:alnum:]]"	       0
+C.UTF-8		 "D"			"[[:alnum:]]"	       0
+C.UTF-8		 "E"			"[[:alnum:]]"	       0
+C.UTF-8		 "F"			"[[:alnum:]]"	       0
+C.UTF-8		 "G"			"[[:alnum:]]"	       0
+C.UTF-8		 "H"			"[[:alnum:]]"	       0
+C.UTF-8		 "I"			"[[:alnum:]]"	       0
+C.UTF-8		 "J"			"[[:alnum:]]"	       0
+C.UTF-8		 "K"			"[[:alnum:]]"	       0
+C.UTF-8		 "L"			"[[:alnum:]]"	       0
+C.UTF-8		 "M"			"[[:alnum:]]"	       0
+C.UTF-8		 "N"			"[[:alnum:]]"	       0
+C.UTF-8		 "O"			"[[:alnum:]]"	       0
+C.UTF-8		 "P"			"[[:alnum:]]"	       0
+C.UTF-8		 "Q"			"[[:alnum:]]"	       0
+C.UTF-8		 "R"			"[[:alnum:]]"	       0
+C.UTF-8		 "S"			"[[:alnum:]]"	       0
+C.UTF-8		 "T"			"[[:alnum:]]"	       0
+C.UTF-8		 "U"			"[[:alnum:]]"	       0
+C.UTF-8		 "V"			"[[:alnum:]]"	       0
+C.UTF-8		 "W"			"[[:alnum:]]"	       0
+C.UTF-8		 "X"			"[[:alnum:]]"	       0
+C.UTF-8		 "Y"			"[[:alnum:]]"	       0
+C.UTF-8		 "Z"			"[[:alnum:]]"	       0
+C.UTF-8		 "0"			"[[:alnum:]]"	       0
+C.UTF-8		 "1"			"[[:alnum:]]"	       0
+C.UTF-8		 "2"			"[[:alnum:]]"	       0
+C.UTF-8		 "3"			"[[:alnum:]]"	       0
+C.UTF-8		 "4"			"[[:alnum:]]"	       0
+C.UTF-8		 "5"			"[[:alnum:]]"	       0
+C.UTF-8		 "6"			"[[:alnum:]]"	       0
+C.UTF-8		 "7"			"[[:alnum:]]"	       0
+C.UTF-8		 "8"			"[[:alnum:]]"	       0
+C.UTF-8		 "9"			"[[:alnum:]]"	       0
+C.UTF-8		 "!"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "#"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "%"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "+"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ","			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "-"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "."			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "/"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ":"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ";"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "="			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "@"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "["			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\\"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "]"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "^"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "_"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "{"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "}"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "~"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\""			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "$"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "&"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "'"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "("			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ")"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "*"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "?"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "`"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "|"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "<"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 ">"			"[[:alnum:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:cntrl:]]"	       0
+C.UTF-8		 "t"			"[[:cntrl:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:lower:]]"	       0
+C.UTF-8		 "\t"			"[[:lower:]]"	       NOMATCH
+C.UTF-8		 "T"			"[[:lower:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:space:]]"	       0
+C.UTF-8		 "t"			"[[:space:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:alpha:]]"	       0
+C.UTF-8		 "\t"			"[[:alpha:]]"	       NOMATCH
+C.UTF-8		 "0"			"[[:digit:]]"	       0
+C.UTF-8		 "\t"			"[[:digit:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:digit:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:print:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:print:]]"	       0
+C.UTF-8		 "T"			"[[:upper:]]"	       0
+C.UTF-8		 "\t"			"[[:upper:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:upper:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:blank:]]"	       0
+C.UTF-8		 "t"			"[[:blank:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:graph:]]"	       NOMATCH
+C.UTF-8		 "t"			"[[:graph:]]"	       0
+C.UTF-8		 "."			"[[:punct:]]"	       0
+C.UTF-8		 "t"			"[[:punct:]]"	       NOMATCH
+C.UTF-8		 "\t"			"[[:punct:]]"	       NOMATCH
+C.UTF-8		 "0"			"[[:xdigit:]]"	       0
+C.UTF-8		 "\t"			"[[:xdigit:]]"	       NOMATCH
+C.UTF-8		 "a"			"[[:xdigit:]]"	       0
+C.UTF-8		 "A"			"[[:xdigit:]]"	       0
+C.UTF-8		 "t"			"[[:xdigit:]]"	       NOMATCH
+C.UTF-8		 "a"			"[[alpha]]"	       NOMATCH
+C.UTF-8		 "a"			"[[alpha:]]"	       NOMATCH
+C.UTF-8		 "a]"			"[[alpha]]"	       0
+C.UTF-8		 "a]"			"[[alpha:]]"	       0
+C.UTF-8		 "a"			"[[:alpha:][.b.]]"     0
+C.UTF-8		 "a"			"[[:alpha:][=b=]]"     0
+C.UTF-8		 "a"			"[[:alpha:][:digit:]]" 0
+C.UTF-8		 "a"			"[[:digit:][:alpha:]]" 0
+
+# B.6 018(C)
+C.UTF-8		 "a"			"[a-c]"		       0
+C.UTF-8		 "b"			"[a-c]"		       0
+C.UTF-8		 "c"			"[a-c]"		       0
+C.UTF-8		 "a"			"[b-c]"		       NOMATCH
+C.UTF-8		 "d"			"[b-c]"		       NOMATCH
+C.UTF-8		 "B"			"[a-c]"		       NOMATCH
+C.UTF-8		 "b"			"[A-C]"		       NOMATCH
+C.UTF-8		 ""			"[a-c]"		       NOMATCH
+C.UTF-8		 "as"			"[a-ca-z]"	       NOMATCH
+C.UTF-8		 "a"			"[[.a.]-c]"	       0
+C.UTF-8		 "a"			"[a-[.c.]]"	       0
+C.UTF-8		 "a"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "b"			"[[.a.]-c]"	       0
+C.UTF-8		 "b"			"[a-[.c.]]"	       0
+C.UTF-8		 "b"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "c"			"[[.a.]-c]"	       0
+C.UTF-8		 "c"			"[a-[.c.]]"	       0
+C.UTF-8		 "c"			"[[.a.]-[.c.]]"	       0
+C.UTF-8		 "d"			"[[.a.]-c]"	       NOMATCH
+C.UTF-8		 "d"			"[a-[.c.]]"	       NOMATCH
+C.UTF-8		 "d"			"[[.a.]-[.c.]]"	       NOMATCH
+
+# B.6 019(C)
+C.UTF-8		 "a"			"[c-a]"		       NOMATCH
+C.UTF-8		 "a"			"[[.c.]-a]"	       NOMATCH
+C.UTF-8		 "a"			"[c-[.a.]]"	       NOMATCH
+C.UTF-8		 "a"			"[[.c.]-[.a.]]"	       NOMATCH
+C.UTF-8		 "c"			"[c-a]"		       NOMATCH
+C.UTF-8		 "c"			"[[.c.]-a]"	       NOMATCH
+C.UTF-8		 "c"			"[c-[.a.]]"	       NOMATCH
+C.UTF-8		 "c"			"[[.c.]-[.a.]]"	       NOMATCH
+
+# B.6 020(C)
+C.UTF-8		 "a"			"[a-c0-9]"	       0
+C.UTF-8		 "d"			"[a-c0-9]"	       NOMATCH
+C.UTF-8		 "B"			"[a-c0-9]"	       NOMATCH
+
+# B.6 021(C)
+C.UTF-8		 "-"			"[-a]"		       0
+C.UTF-8		 "a"			"[-b]"		       NOMATCH
+C.UTF-8		 "-"			"[!-a]"		       NOMATCH
+C.UTF-8		 "a"			"[!-b]"		       0
+C.UTF-8		 "-"			"[a-c-0-9]"	       0
+C.UTF-8		 "b"			"[a-c-0-9]"	       0
+C.UTF-8		 "a:"			"a[0-9-a]"	       NOMATCH
+C.UTF-8		 "a:"			"a[09-a]"	       0
+
+# B.6 024(C)
+C.UTF-8		 ""			"*"		       0
+C.UTF-8		 "asd/sdf"		"*"		       0
+
+# B.6 025(C)
+C.UTF-8		 "as"			"[a-c][a-z]"	       0
+C.UTF-8		 "as"			"??"		       0
+
+# B.6 026(C)
+C.UTF-8		 "asd/sdf"		"as*df"		       0
+C.UTF-8		 "asd/sdf"		"as*"		       0
+C.UTF-8		 "asd/sdf"		"*df"		       0
+C.UTF-8		 "asd/sdf"		"as*dg"		       NOMATCH
+C.UTF-8		 "asdf"			"as*df"		       0
+C.UTF-8		 "asdf"			"as*df?"	       NOMATCH
+C.UTF-8		 "asdf"			"as*??"		       0
+C.UTF-8		 "asdf"			"a*???"		       0
+C.UTF-8		 "asdf"			"*????"		       0
+C.UTF-8		 "asdf"			"????*"		       0
+C.UTF-8		 "asdf"			"??*?"		       0
+
+# B.6 027(C)
+C.UTF-8		 "/"			"/"		       0
+C.UTF-8		 "/"			"/*"		       0
+C.UTF-8		 "/"			"*/"		       0
+C.UTF-8		 "/"			"/?"		       NOMATCH
+C.UTF-8		 "/"			"?/"		       NOMATCH
+C.UTF-8		 "/"			"?"		       0
+C.UTF-8		 "."			"?"		       0
+C.UTF-8		 "/."			"??"		       0
+C.UTF-8		 "/"			"[!a-c]"	       0
+C.UTF-8		 "."			"[!a-c]"	       0
+
+# B.6 029(C)
+C.UTF-8		 "/"			"/"		       0       PATHNAME
+C.UTF-8		 "//"			"//"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/*"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/?a"		       0       PATHNAME
+C.UTF-8		 "/.a"			"/[!a-z]a"	       0       PATHNAME
+C.UTF-8		 "/.a/.b"		"/*/?b"		       0       PATHNAME
+
+# B.6 030(C)
+C.UTF-8		 "/"			"?"		       NOMATCH PATHNAME
+C.UTF-8		 "/"			"*"		       NOMATCH PATHNAME
+C.UTF-8		 "a/b"			"a?b"		       NOMATCH PATHNAME
+C.UTF-8		 "/.a/.b"		"/*b"		       NOMATCH PATHNAME
+
+# B.6 031(C)
+C.UTF-8		 "/$"			"\\/\\$"	       0
+C.UTF-8		 "/["			"\\/\\["	       0
+C.UTF-8		 "/["			"\\/["		       0
+C.UTF-8		 "/[]"			"\\/\\[]"	       0
+
+# B.6 032(C)
+C.UTF-8		 "/$"			"\\/\\$"	       NOMATCH NOESCAPE
+C.UTF-8		 "/\\$"			"\\/\\$"	       NOMATCH NOESCAPE
+C.UTF-8		 "\\/\\$"		"\\/\\$"	       0       NOESCAPE
+
+# B.6 033(C)
+C.UTF-8		 ".asd"			".*"		       0       PERIOD
+C.UTF-8		 "/.asd"		"*"		       0       PERIOD
+C.UTF-8		 "/as/.df"		"*/?*f"		       0       PERIOD
+C.UTF-8		 "..asd"		".[!a-z]*"	       0       PERIOD
+
+# B.6 034(C)
+C.UTF-8		 ".asd"			"*"		       NOMATCH PERIOD
+C.UTF-8		 ".asd"			"?asd"		       NOMATCH PERIOD
+C.UTF-8		 ".asd"			"[!a-z]*"	       NOMATCH PERIOD
+
+# B.6 035(C)
+C.UTF-8		 "/."			"/."		       0       PATHNAME|PERIOD
+C.UTF-8		 "/.a./.b."		"/.*/.*"	       0       PATHNAME|PERIOD
+C.UTF-8		 "/.a./.b."		"/.??/.??"	       0       PATHNAME|PERIOD
+
+# B.6 036(C)
+C.UTF-8		 "/."			"*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/?"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/."			"/[!a-z]"	       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/a./.b."		"/*/*"		       NOMATCH PATHNAME|PERIOD
+C.UTF-8		 "/a./.b."		"/??/???"	       NOMATCH PATHNAME|PERIOD
+
+# Some home-grown tests.
+C.UTF-8		"foobar"		"foo*[abc]z"	       NOMATCH
+C.UTF-8		"foobaz"		"foo*[abc][xyz]"       0
+C.UTF-8		"foobaz"		"foo?*[abc][xyz]"      0
+C.UTF-8		"foobaz"		"foo?*[abc][x/yz]"     0
+C.UTF-8		"foobaz"		"foo?*[abc]/[xyz]"     NOMATCH PATHNAME
+C.UTF-8		"a"			"a/"                   NOMATCH PATHNAME
+C.UTF-8		"a/"			"a"		       NOMATCH PATHNAME
+C.UTF-8		"//a"			"/a"		       NOMATCH PATHNAME
+C.UTF-8		"/a"			"//a"		       NOMATCH PATHNAME
+C.UTF-8		"az"			"[a-]z"		       0
+C.UTF-8		"bz"			"[ab-]z"	       0
+C.UTF-8		"cz"			"[ab-]z"	       NOMATCH
+C.UTF-8		"-z"			"[ab-]z"	       0
+C.UTF-8		"az"			"[-a]z"		       0
+C.UTF-8		"bz"			"[-ab]z"	       0
+C.UTF-8		"cz"			"[-ab]z"	       NOMATCH
+C.UTF-8		"-z"			"[-ab]z"	       0
+C.UTF-8		"\\"			"[\\\\-a]"	       0
+C.UTF-8		"_"			"[\\\\-a]"	       0
+C.UTF-8		"a"			"[\\\\-a]"	       0
+C.UTF-8		"-"			"[\\\\-a]"	       NOMATCH
+C.UTF-8		"\\"			"[\\]-a]"	       NOMATCH
+C.UTF-8		"_"			"[\\]-a]"	       0
+C.UTF-8		"a"			"[\\]-a]"	       0
+C.UTF-8		"]"			"[\\]-a]"	       0
+C.UTF-8		"-"			"[\\]-a]"	       NOMATCH
+C.UTF-8		"\\"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"_"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"a"			"[!\\\\-a]"	       NOMATCH
+C.UTF-8		"-"			"[!\\\\-a]"	       0
+C.UTF-8		"!"			"[\\!-]"	       0
+C.UTF-8		"-"			"[\\!-]"	       0
+C.UTF-8		"\\"			"[\\!-]"	       NOMATCH
+C.UTF-8		"Z"			"[Z-\\\\]"	       0
+C.UTF-8		"["			"[Z-\\\\]"	       0
+C.UTF-8		"\\"			"[Z-\\\\]"	       0
+C.UTF-8		"-"			"[Z-\\\\]"	       NOMATCH
+C.UTF-8		"Z"			"[Z-\\]]"	       0
+C.UTF-8		"["			"[Z-\\]]"	       0
+C.UTF-8		"\\"			"[Z-\\]]"	       0
+C.UTF-8		"]"			"[Z-\\]]"	       0
+C.UTF-8		"-"			"[Z-\\]]"	       NOMATCH
+
 # Following are tests outside the scope of IEEE 2003.2 since they are using
 # locales other than the C locale.  The main focus of the tests is on the
 # handling of ranges and the recognition of character (vs bytes).
@@ -677,7 +1068,6 @@ C		 "x/y"			"*"		       0       PATHNAME|LEADING_DIR
 C		 "x/y/z"		"*"		       0       PATHNAME|LEADING_DIR
 C		 "x"			"*x"		       0       PATHNAME|LEADING_DIR
 
-en_US.UTF-8	 "\366.csv"		"*.csv"                0
 C		 "x/y"			"*x"		       0       PATHNAME|LEADING_DIR
 C		 "x/y/z"		"*x"		       0       PATHNAME|LEADING_DIR
 C		 "x"			"x*"		       0       PATHNAME|LEADING_DIR
@@ -693,6 +1083,33 @@ C		 "x"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 C		 "x/y"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 C		 "x/y/z"		"x?y"		       NOMATCH PATHNAME|LEADING_DIR
 
+# Duplicate the "Test of GNU extensions." tests but for C.UTF-8.
+C.UTF-8		 "x"			"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"*x"		       0       PATHNAME|LEADING_DIR
+
+C.UTF-8		 "x/y"			"*x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"*x"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x*"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"a"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x/y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x/y"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x/y"		       0       PATHNAME|LEADING_DIR
+C.UTF-8		 "x"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y"			"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+C.UTF-8		 "x/y/z"		"x?y"		       NOMATCH PATHNAME|LEADING_DIR
+
+# Bug 14185
+en_US.UTF-8	 "\366.csv"		"*.csv"                0
+
 # ksh style matching.
 C		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
 C		"/dev/udp/129.22.8.102/45" "/dev/@(tcp|udp)/*/*" 0     PATHNAME|EXTMATCH
@@ -822,3 +1239,133 @@ C		""			""		       0
 C		""			""		       0       EXTMATCH
 C		""			"*([abc])"	       0       EXTMATCH
 C		""			"?([abc])"	       0       EXTMATCH
+
+# Duplicate the "ksh style matching." for C.UTF-8.
+C.UTF-8		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
+C.UTF-8		"/dev/udp/129.22.8.102/45" "/dev/@(tcp|udp)/*/*" 0     PATHNAME|EXTMATCH
+C.UTF-8		"12"			"[1-9]*([0-9])"        0       EXTMATCH
+C.UTF-8		"12abc"			"[1-9]*([0-9])"        NOMATCH EXTMATCH
+C.UTF-8		"1"			"[1-9]*([0-9])"	       0       EXTMATCH
+C.UTF-8		"07"			"+([0-7])"	       0       EXTMATCH
+C.UTF-8		"0377"			"+([0-7])"	       0       EXTMATCH
+C.UTF-8		"09"			"+([0-7])"	       NOMATCH EXTMATCH
+C.UTF-8		"paragraph"		"para@(chute|graph)"   0       EXTMATCH
+C.UTF-8		"paramour"		"para@(chute|graph)"   NOMATCH EXTMATCH
+C.UTF-8		"para991"		"para?([345]|99)1"     0       EXTMATCH
+C.UTF-8		"para381"		"para?([345]|99)1"     NOMATCH EXTMATCH
+C.UTF-8		"paragraph"		"para*([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para"			"para*([0-9])"	       0       EXTMATCH
+C.UTF-8		"para13829383746592"	"para*([0-9])"	       0       EXTMATCH
+C.UTF-8		"paragraph"		"para+([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para"			"para+([0-9])"	       NOMATCH EXTMATCH
+C.UTF-8		"para987346523"		"para+([0-9])"	       0       EXTMATCH
+C.UTF-8		"paragraph"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para.38"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para.graph"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		"para39"		"para!(*.[0-9])"       0       EXTMATCH
+C.UTF-8		""			"*(0|1|3|5|7|9)"       0       EXTMATCH
+C.UTF-8		"137577991"		"*(0|1|3|5|7|9)"       0       EXTMATCH
+C.UTF-8		"2468"			"*(0|1|3|5|7|9)"       NOMATCH EXTMATCH
+C.UTF-8		"1358"			"*(0|1|3|5|7|9)"       NOMATCH EXTMATCH
+C.UTF-8		"file.c"		"*.c?(c)"	       0       EXTMATCH
+C.UTF-8		"file.C"		"*.c?(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"file.cc"		"*.c?(c)"	       0       EXTMATCH
+C.UTF-8		"file.ccc"		"*.c?(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"parse.y"		"!(*.c|*.h|Makefile.in|config*|README)" 0 EXTMATCH
+C.UTF-8		"shell.c"		"!(*.c|*.h|Makefile.in|config*|README)" NOMATCH EXTMATCH
+C.UTF-8		"Makefile"		"!(*.c|*.h|Makefile.in|config*|README)" 0 EXTMATCH
+C.UTF-8		"VMS.FILE;1"		"*\;[1-9]*([0-9])"     0       EXTMATCH
+C.UTF-8		"VMS.FILE;0"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"VMS.FILE;"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"VMS.FILE;139"		"*\;[1-9]*([0-9])"     0       EXTMATCH
+C.UTF-8		"VMS.FILE;1N"		"*\;[1-9]*([0-9])"     NOMATCH EXTMATCH
+C.UTF-8		"abcfefg"		"ab**(e|f)"	       0       EXTMATCH
+C.UTF-8		"abcfefg"		"ab**(e|f)g"	       0       EXTMATCH
+C.UTF-8		"ab"			"ab*+(e|f)"	       NOMATCH EXTMATCH
+C.UTF-8		"abef"			"ab***ef"	       0       EXTMATCH
+C.UTF-8		"abef"			"ab**"		       0       EXTMATCH
+C.UTF-8		"fofo"			"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"ffo"			"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofo"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofof"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"fooofoofofooo"		"*(f*(o))"	       0       EXTMATCH
+C.UTF-8		"foooofof"		"*(f+(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"xfoooofof"		"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"foooofofx"		"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"ofxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofooofoofofooo"	"*(f*(o))"	       NOMATCH EXTMATCH
+C.UTF-8		"foooxfooxfoxfooox"	"*(f*(o)x)"	       0       EXTMATCH
+C.UTF-8		"foooxfooxofoxfooox"	"*(f*(o)x)"	       NOMATCH EXTMATCH
+C.UTF-8		"foooxfooxfxfooox"	"*(f*(o)x)"	       0       EXTMATCH
+C.UTF-8		"ofxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofoooxoofxo"		"*(*(of*(o)x)o)"       0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxo" "*(*(of*(o)x)o)"      0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxoo" "*(*(of*(o)x)o)"     0       EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxofo" "*(*(of*(o)x)o)"    NOMATCH EXTMATCH
+C.UTF-8		"ofoooxoofxoofoooxoofxooofxofxo" "*(*(of*(o)x)o)" 0    EXTMATCH
+C.UTF-8		"aac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"ac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"c"			"*(@(a))a@(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"aaac"			"*(@(a))a@(c)"	       0       EXTMATCH
+C.UTF-8		"baaac"			"*(@(a))a@(c)"	       NOMATCH EXTMATCH
+C.UTF-8		"abcd"			"?@(a|b)*@(c)d"	       0       EXTMATCH
+C.UTF-8		"abcd"			"@(ab|a*@(b))*(c)d"    0       EXTMATCH
+C.UTF-8		"acd"			"@(ab|a*(b))*(c)d"     0       EXTMATCH
+C.UTF-8		"abbcd"			"@(ab|a*(b))*(c)d"     0       EXTMATCH
+C.UTF-8		"effgz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"efgz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egz"			"@(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egzefffgzbcdij"	"*(b+(c)d|e*(f)g?|?(h)i@(j|k))" 0 EXTMATCH
+C.UTF-8		"egz"			"@(b+(c)d|e+(f)g?|?(h)i@(j|k))" NOMATCH EXTMATCH
+C.UTF-8		"ofoofo"		"*(of+(o))"	       0       EXTMATCH
+C.UTF-8		"oxfoxoxfox"		"*(oxf+(ox))"	       0       EXTMATCH
+C.UTF-8		"oxfoxfox"		"*(oxf+(ox))"	       NOMATCH EXTMATCH
+C.UTF-8		"ofoofo"		"*(of+(o)|f)"	       0       EXTMATCH
+C.UTF-8		"foofoofo"		"@(foo|f|fo)*(f|of+(o))" 0     EXTMATCH
+C.UTF-8		"oofooofo"		"*(of|oof+(o))"	       0       EXTMATCH
+C.UTF-8		"fffooofoooooffoofffooofff" "*(*(f)*(o))"      0       EXTMATCH
+C.UTF-8		"fofoofoofofoo"		"*(fo|foo)"	       0       EXTMATCH
+C.UTF-8		"foo"			"!(x)"		       0       EXTMATCH
+C.UTF-8		"foo"			"!(x)*"		       0       EXTMATCH
+C.UTF-8		"foo"			"!(foo)"	       NOMATCH EXTMATCH
+C.UTF-8		"foo"			"!(foo)*"	       0       EXTMATCH
+C.UTF-8		"foobar"		"!(foo)"	       0       EXTMATCH
+C.UTF-8		"foobar"		"!(foo)*"	       0       EXTMATCH
+C.UTF-8		"moo.cow"		"!(*.*).!(*.*)"	       0       EXTMATCH
+C.UTF-8		"mad.moo.cow"		"!(*.*).!(*.*)"	       NOMATCH EXTMATCH
+C.UTF-8		"mucca.pazza"		"mu!(*(c))?.pa!(*(z))?" NOMATCH EXTMATCH
+C.UTF-8		"fff"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"fff"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"fff"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"ooo"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"ooo"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"ooo"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"foo"			"!(f)"		       0       EXTMATCH
+C.UTF-8		"foo"			"*(!(f))"	       0       EXTMATCH
+C.UTF-8		"foo"			"+(!(f))"	       0       EXTMATCH
+C.UTF-8		"f"			"!(f)"		       NOMATCH EXTMATCH
+C.UTF-8		"f"			"*(!(f))"	       NOMATCH EXTMATCH
+C.UTF-8		"f"			"+(!(f))"	       NOMATCH EXTMATCH
+C.UTF-8		"foot"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"zoot"			"@(!(z*)|*x)"	       NOMATCH EXTMATCH
+C.UTF-8		"foox"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"zoox"			"@(!(z*)|*x)"	       0       EXTMATCH
+C.UTF-8		"foo"			"*(!(foo))"	       0       EXTMATCH
+C.UTF-8		"foob"			"!(foo)b*"	       NOMATCH EXTMATCH
+C.UTF-8		"foobb"			"!(foo)b*"	       0       EXTMATCH
+C.UTF-8		"["			"*([a[])"	       0       EXTMATCH
+C.UTF-8		"]"			"*([]a[])"	       0       EXTMATCH
+C.UTF-8		"a"			"*([]a[])"	       0       EXTMATCH
+C.UTF-8		"b"			"*([!]a[])"	       0       EXTMATCH
+C.UTF-8		"["			"*([!]a[]|[[])"	       0       EXTMATCH
+C.UTF-8		"]"			"*([!]a[]|[]])"	       0       EXTMATCH
+C.UTF-8		"["			"!([!]a[])"	       0       EXTMATCH
+C.UTF-8		"]"			"!([!]a[])"	       0       EXTMATCH
+C.UTF-8		")"			"*([)])"	       0       EXTMATCH
+C.UTF-8		"*"			"*([*(])"	       0       EXTMATCH
+C.UTF-8		"abcd"			"*!(|a)cd"	       0       EXTMATCH
+C.UTF-8		"ab/.a"			"+([abc])/*"	       NOMATCH EXTMATCH|PATHNAME|PERIOD
+C.UTF-8		""			""		       0
+C.UTF-8		""			""		       0       EXTMATCH
+C.UTF-8		""			"*([abc])"	       0       EXTMATCH
+C.UTF-8		""			"?([abc])"	       0       EXTMATCH
diff --git a/posix/tst-regcomp-truncated.c b/posix/tst-regcomp-truncated.c
index 84195fcd2e..da3f97799e 100644
--- a/posix/tst-regcomp-truncated.c
+++ b/posix/tst-regcomp-truncated.c
@@ -37,6 +37,7 @@
 static const char locales[][17] =
   {
     "C",
+    "C.UTF-8",
     "en_US.UTF-8",
     "de_DE.ISO-8859-1",
   };
diff --git a/posix/tst-regex.c b/posix/tst-regex.c
index e7c2b05e86..4be5d173eb 100644
--- a/posix/tst-regex.c
+++ b/posix/tst-regex.c
@@ -32,6 +32,7 @@
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <regex.h>
+#include <support/support.h>
 
 
 #if defined _POSIX_CPUTIME && _POSIX_CPUTIME >= 0
@@ -150,9 +151,23 @@ test_expr (const char *expr, int expected, int expectedicase)
   size_t outlen;
   char *uexpr;
 
-  /* First test: search with an UTF-8 locale.  */
-  if (setlocale (LC_ALL, "de_DE.UTF-8") == NULL)
-    error (EXIT_FAILURE, 0, "cannot set locale de_DE.UTF-8");
+  /* First test: search with basic C.UTF-8 locale.  */
+  printf ("INFO: Testing C.UTF-8.\n");
+  xsetlocale (LC_ALL, "C.UTF-8");
+
+  printf ("\nTest \"%s\" with multi-byte locale\n", expr);
+  result = run_test (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" with multi-byte locale, case insensitive\n", expr);
+  result |= run_test (expr, mem, memlen, 1, expectedicase);
+  printf ("\nTest \"%s\" backwards with multi-byte locale\n", expr);
+  result |= run_test_backwards (expr, mem, memlen, 0, expected);
+  printf ("\nTest \"%s\" backwards with multi-byte locale, case insensitive\n",
+	  expr);
+  result |= run_test_backwards (expr, mem, memlen, 1, expectedicase);
+
+  /* Second test: search with an UTF-8 locale.  */
+  printf ("INFO: Testing de_DE.UTF-8.\n");
+  xsetlocale (LC_ALL, "de_DE.UTF-8");
 
   printf ("\nTest \"%s\" with multi-byte locale\n", expr);
   result = run_test (expr, mem, memlen, 0, expected);
@@ -165,8 +180,8 @@ test_expr (const char *expr, int expected, int expectedicase)
   result |= run_test_backwards (expr, mem, memlen, 1, expectedicase);
 
   /* Second test: search with an ISO-8859-1 locale.  */
-  if (setlocale (LC_ALL, "de_DE.ISO-8859-1") == NULL)
-    error (EXIT_FAILURE, 0, "cannot set locale de_DE.ISO-8859-1");
+  printf ("INFO: Testing de_DE.ISO-8859-1.\n");
+  xsetlocale (LC_ALL, "de_DE.ISO-8859-1");
 
   inmem = (char *) expr;
   inlen = strlen (expr);