[v6,1/4] libcpp: reject codepoints above 0x10FFFF

Message ID 20230606205025.3164738-2-ben.boeckel@kitware.com
State Committed
Headers
Series P1689R5 support |

Commit Message

Ben Boeckel June 6, 2023, 8:50 p.m. UTC
  Unicode does not support such values because they are unrepresentable in
UTF-16.

libcpp/

	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
	UTF-16 does not support such codepoints and therefore all
	Unicode rejects such values.

Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
---
 libcpp/charset.cc | 7 +++++++
 1 file changed, 7 insertions(+)
  

Comments

Jason Merrill June 19, 2023, 9:34 p.m. UTC | #1
On 6/6/23 16:50, Ben Boeckel wrote:
> Unicode does not support such values because they are unrepresentable in
> UTF-16.

Pushed.

> libcpp/
> 
> 	* charset.cc: Reject encodings of codepoints above 0x10FFFF.
> 	UTF-16 does not support such codepoints and therefore all
> 	Unicode rejects such values.
> 
> Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
> ---
>   libcpp/charset.cc | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/libcpp/charset.cc b/libcpp/charset.cc
> index d7f323b2cd5..3b34d804cf1 100644
> --- a/libcpp/charset.cc
> +++ b/libcpp/charset.cc
> @@ -1886,6 +1886,13 @@ cpp_valid_utf8_p (const char *buffer, size_t num_bytes)
>         int err = one_utf8_to_cppchar (&iter, &bytesleft, &cp);
>         if (err)
>   	return false;
> +
> +      /* Additionally, Unicode declares that all codepoints above 0010FFFF are
> +	 invalid because they cannot be represented in UTF-16.
> +
> +	 Reject such values.*/
> +      if (cp >= 0x10FFFF)
> +	return false;
>       }
>     /* No problems encountered.  */
>     return true;
  

Patch

diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index d7f323b2cd5..3b34d804cf1 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1886,6 +1886,13 @@  cpp_valid_utf8_p (const char *buffer, size_t num_bytes)
       int err = one_utf8_to_cppchar (&iter, &bytesleft, &cp);
       if (err)
 	return false;
+
+      /* Additionally, Unicode declares that all codepoints above 0010FFFF are
+	 invalid because they cannot be represented in UTF-16.
+
+	 Reject such values.*/
+      if (cp >= 0x10FFFF)
+	return false;
     }
   /* No problems encountered.  */
   return true;