diff mbox series

[v5,1/5] libcpp: reject codepoints above 0x10FFFF

Message ID	20230125210636.2960049-2-ben.boeckel@kitware.com
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 569B23858C53 To: gcc-patches@gcc.gnu.org Cc: Ben Boeckel <ben.boeckel@kitware.com>, jason@redhat.com, nathan@acm.org, fortran@gcc.gnu.org, gcc@gcc.gnu.org, brad.king@kitware.com Subject: [PATCH v5 1/5] libcpp: reject codepoints above 0x10FFFF Date: Wed, 25 Jan 2023 16:06:32 -0500 Message-Id: <20230125210636.2960049-2-ben.boeckel@kitware.com> In-Reply-To: <20230125210636.2960049-1-ben.boeckel@kitware.com> References: <20230125210636.2960049-1-ben.boeckel@kitware.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list From: Ben Boeckel via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Ben Boeckel <ben.boeckel@kitware.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	P1689R5 support \| [v5,0/5] P1689R5 support [v5,1/5] libcpp: reject codepoints above 0x10FFFF [v5,2/5] libcpp: add a function to determine UTF-8 validity of a C string [v5,3/5] p1689r5: initial support [v5,4/5] c++modules: report imported CMI files as dependencies [v5,5/5] c++modules: report module mapper files as a dependency

Commit Message

Ben Boeckel Jan. 25, 2023, 9:06 p.m. UTC

  Unicode does not support such values because they are unrepresentable in
UTF-16.

libcpp/

	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
	UTF-16 does not support such codepoints and therefore all
	Unicode rejects such values.

Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
---
 libcpp/charset.cc | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Jason Merrill Feb. 13, 2023, 3:53 p.m. UTC | #1

On 1/25/23 13:06, Ben Boeckel wrote:
> Unicode does not support such values because they are unrepresentable in
> UTF-16.
> 
> libcpp/
> 
> 	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
> 	UTF-16 does not support such codepoints and therefore all
> 	Unicode rejects such values.

It seems that this causes a bunch of testsuite failures from tests that 
expect this limit to be checked elsewhere with a different diagnostic, 
so I think the easiest thing is to fold this into _cpp_valid_utf8_str 
instead, i.e.:

Make sense?

Jason

Ben Boeckel May 12, 2023, 2:26 p.m. UTC | #2

On Mon, Feb 13, 2023 at 10:53:17 -0500, Jason Merrill wrote:
> On 1/25/23 13:06, Ben Boeckel wrote:
> > Unicode does not support such values because they are unrepresentable in
> > UTF-16.
> > 
> > libcpp/
> > 
> > 	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
> > 	UTF-16 does not support such codepoints and therefore all
> > 	Unicode rejects such values.
> 
> It seems that this causes a bunch of testsuite failures from tests that 
> expect this limit to be checked elsewhere with a different diagnostic, 
> so I think the easiest thing is to fold this into _cpp_valid_utf8_str 
> instead, i.e.:

Since then, `cpp_valid_utf8_p` has appeared and takes care of the
over-long encodings. The new patchset just checks for codepoints beyond
0x10FFFF and rejects them in this function (and the test suite matches
`master` results for me then).

--Ben

diff mbox series

Patch

diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index 3c47d4f868b..f7ae12ea5a2 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -158,6 +158,10 @@  struct _cpp_strbuf
    encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or
    FC 80 80 80 9F 80.  Only the first is valid.
 
+   Additionally, Unicode declares that all codepoints above 0010FFFF are
+   invalid because they cannot be represented in UTF-16. As such, all 5- and
+   6-byte encodings are invalid.
+
    An implementation note: the transformation from UTF-16 to UTF-8, or
    vice versa, is easiest done by using UTF-32 as an intermediary.  */
 
@@ -216,7 +220,7 @@  one_utf8_to_cppchar (const uchar **inbufp, size_t *inbytesleftp,
   if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
 
   /* Make sure the character is valid.  */
-  if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
+  if (c > 0x10FFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
 
   *cp = c;
   *inbufp = inbuf;
@@ -320,7 +324,7 @@  one_utf32_to_utf8 (iconv_t bigend, const uchar **inbufp, size_t *inbytesleftp,
   s += inbuf[bigend ? 2 : 1] << 8;
   s += inbuf[bigend ? 3 : 0];
 
-  if (s >= 0x7FFFFFFF || (s >= 0xD800 && s <= 0xDFFF))
+  if (s > 0x10FFFF || (s >= 0xD800 && s <= 0xDFFF))
     return EILSEQ;
 
   rval = one_cppchar_to_utf8 (s, outbufp, outbytesleftp);