[v5,2/5] libcpp: add a function to determine UTF-8 validity of a C string

Message ID 20230125210636.2960049-3-ben.boeckel@kitware.com
State Superseded
Headers
Series P1689R5 support |

Commit Message

Ben Boeckel Jan. 25, 2023, 9:06 p.m. UTC
  This simplifies the interface for other UTF-8 validity detections when a
simple "yes" or "no" answer is sufficient.

libcpp/

	* charset.cc: Add `_cpp_valid_utf8_str` which determines whether
	a C string is valid UTF-8 or not.
	* internal.h: Add prototype for `_cpp_valid_utf8_str`.

Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
---
 libcpp/charset.cc | 20 ++++++++++++++++++++
 libcpp/internal.h |  2 ++
 2 files changed, 22 insertions(+)
  

Comments

David Malcolm Oct. 23, 2023, 3:16 p.m. UTC | #1
On Wed, Jan 25, 2023 at 4:09 PM Ben Boeckel via Gcc <gcc@gcc.gnu.org> wrote:
>
> This simplifies the interface for other UTF-8 validity detections when a
> simple "yes" or "no" answer is sufficient.
>
> libcpp/
>
>         * charset.cc: Add `_cpp_valid_utf8_str` which determines whether
>         a C string is valid UTF-8 or not.
>         * internal.h: Add prototype for `_cpp_valid_utf8_str`.
>
> Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>

[going through patches in patchwork]

What's the status of this patch; did this ever get committed?

I see that Jason preapproved this via his review of "[PATCH v3 2/3]
libcpp: add a function to determine UTF-8 validity of a C string"

Thanks
Dave
  
Jason Merrill Oct. 23, 2023, 3:24 p.m. UTC | #2
On 10/23/23 11:16, David Malcolm wrote:
> On Wed, Jan 25, 2023 at 4:09 PM Ben Boeckel via Gcc <gcc@gcc.gnu.org> wrote:
>>
>> This simplifies the interface for other UTF-8 validity detections when a
>> simple "yes" or "no" answer is sufficient.
>>
>> libcpp/
>>
>>          * charset.cc: Add `_cpp_valid_utf8_str` which determines whether
>>          a C string is valid UTF-8 or not.
>>          * internal.h: Add prototype for `_cpp_valid_utf8_str`.
>>
>> Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
> 
> [going through patches in patchwork]
> 
> What's the status of this patch; did this ever get committed?

It was superseded.

Jason
  
David Malcolm Oct. 23, 2023, 3:28 p.m. UTC | #3
On Mon, 2023-10-23 at 11:24 -0400, Jason Merrill wrote:
> On 10/23/23 11:16, David Malcolm wrote:
> > On Wed, Jan 25, 2023 at 4:09 PM Ben Boeckel via Gcc
> > <gcc@gcc.gnu.org> wrote:
> > > 
> > > This simplifies the interface for other UTF-8 validity detections
> > > when a
> > > simple "yes" or "no" answer is sufficient.
> > > 
> > > libcpp/
> > > 
> > >          * charset.cc: Add `_cpp_valid_utf8_str` which determines
> > > whether
> > >          a C string is valid UTF-8 or not.
> > >          * internal.h: Add prototype for `_cpp_valid_utf8_str`.
> > > 
> > > Signed-off-by: Ben Boeckel <ben.boeckel@kitware.com>
> > 
> > [going through patches in patchwork]
> > 
> > What's the status of this patch; did this ever get committed?
> 
> It was superseded.

Thanks; closed out in patchwork.

Dave
  

Patch

diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index f7ae12ea5a2..616be9d02ee 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1868,6 +1868,26 @@  _cpp_valid_utf8 (cpp_reader *pfile,
   return true;
 }
 
+/*  Detect whether a C-string is a valid UTF-8-encoded set of bytes. Returns
+    `false` if any contained byte sequence encodes an invalid Unicode codepoint
+    or is not a valid UTF-8 sequence. Returns `true` otherwise. */
+
+extern bool
+_cpp_valid_utf8_str (const char *name)
+{
+  const uchar* in = (const uchar*)name;
+  size_t len = strlen (name);
+  cppchar_t cp;
+
+  while (*in)
+    {
+      if (one_utf8_to_cppchar (&in, &len, &cp))
+	return false;
+    }
+
+  return true;
+}
+
 /* Subroutine of convert_hex and convert_oct.  N is the representation
    in the execution character set of a numeric escape; write it into the
    string buffer TBUF and update the end-of-string pointer therein.  WIDE
diff --git a/libcpp/internal.h b/libcpp/internal.h
index 9724676a8cd..48520901b2d 100644
--- a/libcpp/internal.h
+++ b/libcpp/internal.h
@@ -834,6 +834,8 @@  extern bool _cpp_valid_utf8 (cpp_reader *pfile,
 			     struct normalize_state *nst,
 			     cppchar_t *cp);
 
+extern bool _cpp_valid_utf8_str (const char *str);
+
 extern void _cpp_destroy_iconv (cpp_reader *);
 extern unsigned char *_cpp_convert_input (cpp_reader *, const char *,
 					  unsigned char *, size_t, size_t,