From patchwork Wed Apr 28 13:00:30 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Carlos O'Donell <carlos@redhat.com>
X-Patchwork-Id: 43185
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id B51AD3944820;
	Wed, 28 Apr 2021 13:00:53 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B51AD3944820
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1619614853;
	bh=gXehIdVasmD0fWJAPFskKlhac6xi3d51GJKCk9YMzDE=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=EWKAUkfJoVwRt8xKEf8DJz14CroFmfUzn/NqLQf7dpmImO063LEVCaBis9QoVZRYZ
	 PIXDxebx63uy9iIfYiJR+BCyg6lcYeIY88154fVHx4olunYvmO6CBn+GuYFXC0I3+e
	 0I2z5Si7a9bGaOi7k6nVg0BoyZUb5SFjAg8U4Fac=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id 3462B3938C24
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 13:00:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3462B3938C24
Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com
 [209.85.160.198]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-78-YlN3ujCANqCNnQjpXhx8UA-1; Wed, 28 Apr 2021 09:00:48 -0400
X-MC-Unique: YlN3ujCANqCNnQjpXhx8UA-1
Received: by mail-qt1-f198.google.com with SMTP id
 u16-20020ac86f700000b02901baa6e2dbfcso6411704qtv.20
 for <libc-alpha@sourceware.org>; Wed, 28 Apr 2021 06:00:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=gXehIdVasmD0fWJAPFskKlhac6xi3d51GJKCk9YMzDE=;
 b=TJxkUsvyVtNTXq0fZ1ehQof+lHhLx5VZmjrp6ECUpC9pgxLloN4EizjEr/mU3ZXZgH
 iF8Xfg2vGwJapoz+IJf6itPTuRnLLSIy9aweRvaG/T4tNge+Vr8d24gLLUv2uqp5byR1
 U+k5zEtcoHb5D+a+t3hBD/iB4CWrcBY+HcMh+Zh+OmvLluxvrwwGskYUURLRaq8Z/HFa
 W7rYdbrFIHT9/Wcfx9HzvO2xVgtXzdRwRgMDIe3xIvn6FllY5VaFypWk0YMw/Ux3fMY6
 flHLe5AHTvzMKI966AL5mMygqU3SSABrWOewKRemMzNJO+v6gXTiHtUYNjTnyWjW3qZt
 mtPQ==
X-Gm-Message-State: AOAM533kazv5NufxOKUgqiXhtt5lNKbhx8NAzueStXkyaEfsb3wKRLAY
 tUWJpdOi0gmFCPCR0U6AUK7hSiNmSTYvFIoIPbKgs4L3k0v4+UdfnNLeWlXVZ/ep/6+dbFpWkYV
 YEH5Lx1jzPjFqD2Ug8u09qTGoa8/dil5K1ylcjWbIvenzeiLvVi3lRsTvdxeH+bpayXZpWQ==
X-Received: by 2002:a37:7745:: with SMTP id s66mr27662028qkc.18.1619614846968;
 Wed, 28 Apr 2021 06:00:46 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzWYjjGizxDz8LaqShH8Mz3w71SVJpHwb6tWoKzwkXnJaVUD8rW+KyfvDXMKTbl9qRS9FZp/g==
X-Received: by 2002:a37:7745:: with SMTP id s66mr27661978qkc.18.1619614846547;
 Wed, 28 Apr 2021 06:00:46 -0700 (PDT)
Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com.
 [198.84.214.74])
 by smtp.gmail.com with ESMTPSA id g18sm4903451qke.21.2021.04.28.06.00.45
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 28 Apr 2021 06:00:45 -0700 (PDT)
To: libc-alpha@sourceware.org,
	fweimer@redhat.com
Subject: [PATCH v4 1/4] Add support for processing wide ellipsis ranges in
 UTF-8.
Date: Wed, 28 Apr 2021 09:00:30 -0400
Message-Id: <20210428130033.3196848-2-carlos@redhat.com>
X-Mailer: git-send-email 2.26.3
In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com>
References: <20210428130033.3196848-1-carlos@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha
 <libc-alpha@sourceware.org>
From: Carlos O'Donell <carlos@redhat.com>
Reply-To: Carlos O'Donell <carlos@redhat.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

If the input charater map is UTF-8 then the ellipsis handling is
relaxed with regards to the POSIX requirement for null byte
output and instead a custom increment function is used to
correctly handle the ellipsis output to generate valid UTF-8
code points.

Developers of locales want to be able to write large ellipsis
sequences without having apriori knowledge of the encoding that
would require them to split the ellipsis to avoid null byte
output.

Tested on x86_64 and i686 without regression.
---
 locale/programs/charmap.c | 174 ++++++++++++++++++++++++++++++++++----
 1 file changed, 156 insertions(+), 18 deletions(-)

diff --git a/locale/programs/charmap.c b/locale/programs/charmap.c
index 3d51e702dc..cb134e3b8a 100644
--- a/locale/programs/charmap.c
+++ b/locale/programs/charmap.c
@@ -49,7 +49,7 @@ static void new_width (struct linereader *cmfile, struct charmap_t *result,
 static void charmap_new_char (struct linereader *lr, struct charmap_t *cm,
 			      size_t nbytes, unsigned char *bytes,
 			      const char *from, const char *to,
-			      int decimal_ellipsis, int step);
+			      int decimal_ellipsis, int step, bool is_utf8);
 
 
 bool enc_not_ascii_compatible;
@@ -285,6 +285,27 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
   enum token_t ellipsis = 0;
   int step = 1;
 
+  /* POSIX explicitly requires that ellipsis processing do the
+     following: "Bytes shall be treated as unsigned octets, and carry
+     shall be propagated between the bytes as necessary to represent the
+     range."  It then goes on to say that such a declaration should
+     never be specified because it creates null bytes.  Therefore we
+     error on this condition (see charmap_new_char).  However this still
+     leaves a problem for encodings which use less than the full 8-bits,
+     like UTF-8, and in such encodings you can use an ellipsis to
+     silently and accidentally create invalid ranges.  In UTF-8 you have
+     only N-bits of the first byte and if your ellipsis covers a code
+     point range larger than this code point block the output is going
+     to be an invalid non-UTF-8 multi-byte sequence.  Thus for
+     UTF-8 we add a special ellipsis handling loop that can increment
+     UTF-8 multi-byte output effectively and for UTF-8 we allow larger
+     ellipsis ranges without error.  There may still be other encodings
+     for which the ellipsis will still generate invalid multi-byte
+     output, but not for UTF-8.  The only alternative would be to call
+     gconv for each Unicode code point in the loop to convert it to the
+     appropriate multi-byte output, but that would be slow.  */
+  bool is_utf8 = false;
+
   /* We don't want symbolic names in string to be translated.  */
   cmfile->translate_strings = 0;
 
@@ -385,9 +406,14 @@ parse_charmap (struct linereader *cmfile, int verbose, int be_quiet)
 		}
 
 	      if (nowtok == tok_code_set_name)
-		result->code_set_name = obstack_copy0 (&result->mem_pool,
-						       arg->val.str.startmb,
-						       arg->val.str.lenmb);
+		{
+		  result->code_set_name = obstack_copy0 (&result->mem_pool,
+							 arg->val.str.startmb,
+							 arg->val.str.lenmb);
+
+		  if (strcmp (result->code_set_name, "UTF-8") == 0)
+		    is_utf8 = true;
+		}
 	      else
 		result->repertoiremap = obstack_copy0 (&result->mem_pool,
 						       arg->val.str.startmb,
@@ -570,7 +596,7 @@ character sets with locking states are not supported"));
 	  else
 	    charmap_new_char (cmfile, result, now->val.charcode.nbytes,
 			      now->val.charcode.bytes, from_name, to_name,
-			      ellipsis != tok_ellipsis2, step);
+			      ellipsis != tok_ellipsis2, step, is_utf8);
 
 	  /* Ignore trailing comment silently.  */
 	  lr_ignore_rest (cmfile, 0);
@@ -929,12 +955,81 @@ charmap_find_value (const struct charmap_t *cm, const char *name, size_t len)
 	  < 0 ? NULL : (struct charseq *) result);
 }
 
+/* This function takes the Unicode code point CP and encodes it into
+   a UTF-8 byte stream that must be NBYTES long and is stored into
+   the unsigned character array at BYTES.
+
+   If CP requires more than NBYTES to be encoded then we return an
+   error of -1.
+
+   If CP is not within any of the valid Unicode code point ranges
+   then we return an error of -2.
+
+   Otherwise we return the number of bytes encoded.  */
+static int
+output_utf8_bytes (unsigned int cp, size_t nbytes, unsigned char *bytes)
+{
+  /* We need at least 1 byte.  */
+  if (nbytes < 1)
+    return -1;
+
+  /* One byte range.  */
+  if (cp >= 0x0 && cp <= 0x7f)
+    {
+      bytes[0] = cp;
+      return 1;
+    }
+
+  /* We need at least 2 bytes.  */
+  if (nbytes < 2)
+    return -1;
+
+  /* Two byte range.  */
+  if (cp >= 0x80 && cp <= 0x7ff)
+    {
+      bytes[0] = 0xc0 | ((cp & 0x07c0) >> 6);
+      bytes[1] = 0x80 | (cp & 0x003f);
+      return 2;
+    }
+
+  /* We need at least 3 bytes.  */
+  if (nbytes < 3)
+    return -1;
+
+  /* Three byte range.  Explicitly allow the surrogate range from
+     0xd800 to 0xdfff since we want consistent sorting of the invalid
+     values that might appear in UTF-8 data.  */
+  if (cp >= 0x800 && cp <= 0xffff)
+    {
+      bytes[0] = 0xe0 | ((cp & 0xf000) >> 12);
+      bytes[1] = 0x80 | ((cp & 0x0fc0) >> 6);
+      bytes[2] = 0x80 | (cp & 0x003f);
+      return 3;
+    }
+
+  /* We need at least 4 bytes.  */
+  if (nbytes < 4)
+    return -1;
+
+  /* Four byte range.  */
+  if (cp >= 0x10000 && cp <= 0x10ffff)
+    {
+      bytes[0] = 0xf0 | ((cp & 0x1c0000) >> 18);
+      bytes[1] = 0x80 | ((cp & 0x03f000) >> 12);
+      bytes[2] = 0x80 | ((cp & 0x000fc0) >> 6);
+      bytes[3] = 0x80 | (cp & 0x00003f);
+      return 4;
+    }
+
+  /* Invalid code point.  */
+  return -2;
+}
 
 static void
 charmap_new_char (struct linereader *lr, struct charmap_t *cm,
 		  size_t nbytes, unsigned char *bytes,
 		  const char *from, const char *to,
-		  int decimal_ellipsis, int step)
+		  int decimal_ellipsis, int step, bool is_utf8)
 {
   hash_table *ht = &cm->char_table;
   hash_table *bt = &cm->byte_table;
@@ -1039,11 +1134,56 @@ hexadecimal range format should use only capital characters"));
   for (cnt = from_nr; cnt <= to_nr; cnt += step)
     {
       char *name_end;
+      unsigned char ubytes[4] = { '\0', '\0', '\0', '\0' };
       obstack_printf (ob, decimal_ellipsis ? "%.*s%0*d" : "%.*s%0*X",
 		      prefix_len, from, len1 - prefix_len, cnt);
       obstack_1grow (ob, '\0');
       name_end = obstack_finish (ob);
 
+      /* Either we have a UTF-8 charmap, and we compute the bytes (see
+	 comment above), or we have a non-UTF-8 charmap and we follow
+	 POSIX rules as further below for incrementing the bytes in an
+	 ellipsis.  */
+      if (is_utf8)
+	{
+	  int nubytes;
+
+	  /* Directly convert the code point to the UTF-8 encoded bytes.  */
+	  nubytes = output_utf8_bytes (cnt, 4, ubytes);
+
+	  /* This should not happen, but we check for it just in case.  */
+	  if (nubytes == -1)
+	    lr_error (lr,
+		      _("not enough space to output UTF-8 encoding."));
+
+	  /* The other defect here could be that we have a mismatch
+	     between the code point and the encoded value or number of
+	     output bytes.  For example you specify U0000 but assign it
+	     an encoded value that is 3-bytes long (an error), or U0000
+	     is assigned a value of /x01.  */
+	  if (cnt == from_nr)
+	    {
+	      if (nubytes != nbytes)
+		lr_error (lr,
+			  _("encoding length does not match "
+			    "Unicode code point."));
+	      else
+		if (memcmp (bytes, ubytes, nbytes) != 0)
+		  lr_error (lr,
+			    _("encoded value does not match "
+			      "Unicode code point."));
+	    }
+
+	  /* The range does not cover one of the 4 UTF-8 code point ranges.  */
+	  if (nubytes == -2)
+	    lr_error (lr,
+		      _("invalid code point in the range."));
+
+	  /* Use the generated UTF-8 bytes.  */
+	  bytes = ubytes;
+	  nbytes = nubytes;
+	}
+
       newp = (struct charseq *) obstack_alloc (ob, sizeof (*newp) + nbytes);
       newp->nbytes = nbytes;
       memcpy (newp->bytes, bytes, nbytes);
@@ -1081,19 +1221,17 @@ hexadecimal range format should use only capital characters"));
       /* Please note we don't examine the return value since it is no error
 	 if we have two definitions for a symbol.  */
 
-      /* Increment the value in the byte sequence.  */
-      if (++bytes[nbytes - 1] == '\0')
-	{
-	  int b = nbytes - 2;
+      /* Increment the byte stream following POSIX rules.  */
+      if (!is_utf8)
+        bytes[nbytes - 1]++;
 
-	  do
-	    if (b < 0)
-	      {
-		lr_error (lr,
-			  _("resulting bytes for range not representable."));
-		return;
-	      }
-	  while (++bytes[b--] == 0);
+      /* If we overflowed then that generates a null byte which is an invalid
+	 specification according to POSIX and we issue a parser error.  */
+      if (bytes[nbytes - 1] == '\0')
+	{
+	  lr_error (lr,
+		    _("resulting bytes for range would contain null byte."));
+	  return;
 	}
     }
 }