'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes)

Message ID 87h6qjvfp1.fsf@euler.schwinge.homeip.net
State RFC, archived
Headers
Series 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 warning Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 warning Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-arm warning Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_check--master-arm warning Patch failed to apply

Commit Message

Thomas Schwinge July 4, 2023, 3:50 p.m. UTC
  Hi!

I came across this one here on my way working through another (somewhat
related) GTY issue.  I generally do understand the issue here, but do
have a question about 'unsigned int len' field in
'libcpp/include/symtab.h:struct ht_identifier':

On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains
> (whether they live in GC-controlled memory or not) will be marked for PCH
> output by the routine gt_pch_note_object in ggc-common.cc. This routine
> special-cases plain char* strings, and in particular it uses strlen() to get
> their length.

Oh, wow, this special casing for strings...  8-|

> Thus it does not handle strings with embedded null bytes, but it
> is possible for something PCH cares about (such as a string literal token in a
> macro definition) to contain such embedded nulls. To fix that up, add a new
> GTY option "string_length" so that gt_pch_note_object can be informed the
> actual length it ought to use, and use it in the relevant libcpp structs
> (cpp_string and ht_identifier) accordingly.

For your test case:

> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C
> @@ -0,0 +1,3 @@
> +// { dg-do compile { target c++11 } }
> +#include "pch-string-nulls.H"
> +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error");

> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs
> @@ -0,0 +1,2 @@
> +/* Note that there is a null byte following "ABC".  */
> +#define X R"(ABC^@[!])"

..., I understand how the following is necessary:

> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X,
>  /* Payload of a NUMBER, STRING, CHAR or COMMENT token.  */
>  struct GTY(()) cpp_string {
>    unsigned int len;
> -  const unsigned char *text;
> +
> +  /* TEXT is always null terminated (terminator not included in len); but this
> +     GTY markup arranges that PCH streaming works properly even if there is a
> +     null byte in the middle of the string.  */
> +  const unsigned char * GTY((string_length ("1 + %h.len"))) text;
>  };

(That is, the test case FAILs with that one reverted.)

However, this one did confuse me:

> --- a/libcpp/include/symtab.h
> +++ b/libcpp/include/symtab.h
> @@ -29,7 +29,10 @@ along with this program; see the file COPYING3.  If not see
>  typedef struct ht_identifier ht_identifier;
>  typedef struct ht_identifier *ht_identifier_ptr;
>  struct GTY(()) ht_identifier {
> -  const unsigned char *str;
> +  /* This GTY markup arranges that the null-terminated identifier would still
> +     stream to PCH correctly, if a null byte were to make its way into an
> +     identifier somehow.  */
> +  const unsigned char * GTY((string_length ("1 + %h.len"))) str;
>    unsigned int len;
>    unsigned int hash_value;
>  };

I did wonder whether that's an actual or just a theoretical concern, to
have 'ht_identifier's with embedded NULs?  If an actual concern, can we
get a test case constructed?  Otherwise, should we revert this hunk,
given that we have this auto-'strlen' handling, ignorant of embedded
NULs, in a lot of other places?

But then I realized that possibly we do maintain 'len' here not for
correctness but as an optimization (trading an 'unsigned int' for
repeated 'strlen' calls)?  My quick testing with the attached
"[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm
this theory: no regressions (..., but I didn't bootstrap, and ran only
parts of the testsuite).  (Not proposing that RFC for 'git push', of
course.)

If that's indeed the intention here, I shall change/add source code
commentary to describe this rationale for this dedicated 'len' field
(plus, that handling of embedded NULs falls out of that, automatically).

For reference, this 'len' field has existed "forever".  Before
'struct ht_identifier' was added in
commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we
had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or
earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode':
'unsigned short length', earlier 'size_t length' with comment:
"length of token, for quick comparison", erlier 'int length', ever since
the 'gcc/cpp*' files were added in
commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191).


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Comments

Lewis Hyatt July 4, 2023, 7:56 p.m. UTC | #1
On Tue, Jul 4, 2023 at 11:50 AM Thomas Schwinge <thomas@codesourcery.com> wrote:
>
> Hi!
>
> I came across this one here on my way working through another (somewhat
> related) GTY issue.  I generally do understand the issue here, but do
> have a question about 'unsigned int len' field in
> 'libcpp/include/symtab.h:struct ht_identifier':
>
> On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains
> > (whether they live in GC-controlled memory or not) will be marked for PCH
> > output by the routine gt_pch_note_object in ggc-common.cc. This routine
> > special-cases plain char* strings, and in particular it uses strlen() to get
> > their length.
>
> Oh, wow, this special casing for strings...  8-|
>
> > Thus it does not handle strings with embedded null bytes, but it
> > is possible for something PCH cares about (such as a string literal token in a
> > macro definition) to contain such embedded nulls. To fix that up, add a new
> > GTY option "string_length" so that gt_pch_note_object can be informed the
> > actual length it ought to use, and use it in the relevant libcpp structs
> > (cpp_string and ht_identifier) accordingly.
>
> For your test case:
>
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C
> > @@ -0,0 +1,3 @@
> > +// { dg-do compile { target c++11 } }
> > +#include "pch-string-nulls.H"
> > +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error");
>
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs
> > @@ -0,0 +1,2 @@
> > +/* Note that there is a null byte following "ABC".  */
> > +#define X R"(ABC^@[!])"
>
> ..., I understand how the following is necessary:
>
> > --- a/libcpp/include/cpplib.h
> > +++ b/libcpp/include/cpplib.h
> > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X,
> >  /* Payload of a NUMBER, STRING, CHAR or COMMENT token.  */
> >  struct GTY(()) cpp_string {
> >    unsigned int len;
> > -  const unsigned char *text;
> > +
> > +  /* TEXT is always null terminated (terminator not included in len); but this
> > +     GTY markup arranges that PCH streaming works properly even if there is a
> > +     null byte in the middle of the string.  */
> > +  const unsigned char * GTY((string_length ("1 + %h.len"))) text;
> >  };
>
> (That is, the test case FAILs with that one reverted.)
>
> However, this one did confuse me:
>
> > --- a/libcpp/include/symtab.h
> > +++ b/libcpp/include/symtab.h
> > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3.  If not see
> >  typedef struct ht_identifier ht_identifier;
> >  typedef struct ht_identifier *ht_identifier_ptr;
> >  struct GTY(()) ht_identifier {
> > -  const unsigned char *str;
> > +  /* This GTY markup arranges that the null-terminated identifier would still
> > +     stream to PCH correctly, if a null byte were to make its way into an
> > +     identifier somehow.  */
> > +  const unsigned char * GTY((string_length ("1 + %h.len"))) str;
> >    unsigned int len;
> >    unsigned int hash_value;
> >  };
>
> I did wonder whether that's an actual or just a theoretical concern, to
> have 'ht_identifier's with embedded NULs?  If an actual concern, can we
> get a test case constructed?  Otherwise, should we revert this hunk,
> given that we have this auto-'strlen' handling, ignorant of embedded
> NULs, in a lot of other places?
>
> But then I realized that possibly we do maintain 'len' here not for
> correctness but as an optimization (trading an 'unsigned int' for
> repeated 'strlen' calls)?  My quick testing with the attached
> "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm
> this theory: no regressions (..., but I didn't bootstrap, and ran only
> parts of the testsuite).  (Not proposing that RFC for 'git push', of
> course.)
>
> If that's indeed the intention here, I shall change/add source code
> commentary to describe this rationale for this dedicated 'len' field
> (plus, that handling of embedded NULs falls out of that, automatically).
>
> For reference, this 'len' field has existed "forever".  Before
> 'struct ht_identifier' was added in
> commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we
> had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or
> earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode':
> 'unsigned short length', earlier 'size_t length' with comment:
> "length of token, for quick comparison", erlier 'int length', ever since
> the 'gcc/cpp*' files were added in
> commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191).
>

I don't think there is currently any possibility for a null byte to
end up in an ht_identifier's string. I assumed that ht_identifier
stores the length as an optimization (especially since it doesn't take
up any extra space on 64-bit platforms, given the 32-bit hash code is
stored as well there.) I created the string_length GTY markup mainly
to support another patch that I have still pending review, which I
thought would increase the likelihood of PCH needing to handle null
bytes in general. When I did that, I added the markup to ht_identifier
simply because the length was already there, so there was no reason
not to add it. It does save a few cycles when streaming out the PCH,
but I doubt it is meaningful.

-Lewis
  

Patch

From 75b54da2f30b7ffa0c69c1659e3f3538bfdbd339 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 4 Jul 2023 14:07:47 +0200
Subject: [PATCH] [RFC] Verify no embedded NULs in 'struct ht_identifier'

---
 gcc/analyzer/pending-diagnostic.cc | 2 +-
 gcc/c-family/c-spellcheck.h        | 2 +-
 gcc/c/c-decl.cc                    | 2 +-
 gcc/tree.h                         | 2 +-
 libcpp/include/symtab.h            | 4 ++--
 libcpp/symtab.cc                   | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/gcc/analyzer/pending-diagnostic.cc b/gcc/analyzer/pending-diagnostic.cc
index e36ed4fd9c1..9452719a908 100644
--- a/gcc/analyzer/pending-diagnostic.cc
+++ b/gcc/analyzer/pending-diagnostic.cc
@@ -128,7 +128,7 @@  pending_diagnostic::same_tree_p (tree t1, tree t2)
 static bool
 ht_ident_eq (ht_identifier ident, const char *str)
 {
-  return (strlen (str) == ident.len
+  return (strlen (str) == HT_LEN (&ident)
 	  && 0 == strcmp (str, (const char *)ident.str));
 }
 
diff --git a/gcc/c-family/c-spellcheck.h b/gcc/c-family/c-spellcheck.h
index e3748430b18..0f0c264ef01 100644
--- a/gcc/c-family/c-spellcheck.h
+++ b/gcc/c-family/c-spellcheck.h
@@ -31,7 +31,7 @@  struct edit_distance_traits<cpp_hashnode *>
 {
   static size_t get_length (cpp_hashnode *hashnode)
   {
-    return hashnode->ident.len;
+    return HT_LEN (&hashnode->ident);
   }
 
   static const char *get_string (cpp_hashnode *hashnode)
diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 1af51c4acfc..c67ae5be514 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -4499,7 +4499,7 @@  lookup_name_fuzzy (tree name, enum lookup_name_fuzzy_kind kind, location_t loc)
     {
       const char *id = (const char *)best_macro->ident.str;
       tree macro_as_identifier
-	= get_identifier_with_length (id, best_macro->ident.len);
+	= get_identifier_with_length (id, HT_LEN (&best_macro->ident));
       bm.set_best_so_far (macro_as_identifier,
 			  bmm.get_best_distance (),
 			  bmm.get_best_candidate_length ());
diff --git a/gcc/tree.h b/gcc/tree.h
index 1854fe4a7d4..20169e7a467 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1174,7 +1174,7 @@  extern void omp_clause_range_check_failed (const_tree, const char *, int,
 
 /* Unlike STRING_CST, in C terms this is strlen, not sizeof.  */
 #define IDENTIFIER_LENGTH(NODE) \
-  (IDENTIFIER_NODE_CHECK (NODE)->identifier.id.len)
+  (HT_LEN (&IDENTIFIER_NODE_CHECK (NODE)->identifier.id))
 #define IDENTIFIER_POINTER(NODE) \
   ((const char *) IDENTIFIER_NODE_CHECK (NODE)->identifier.id.str)
 #define IDENTIFIER_HASH_VALUE(NODE) \
diff --git a/libcpp/include/symtab.h b/libcpp/include/symtab.h
index c7ccc6db9f0..ea23e3dbedb 100644
--- a/libcpp/include/symtab.h
+++ b/libcpp/include/symtab.h
@@ -33,11 +33,11 @@  struct GTY(()) ht_identifier {
      stream to PCH correctly, if a null byte were to make its way into an
      identifier somehow.  */
   const unsigned char * GTY((string_length ("1 + %h.len"))) str;
-  unsigned int len;
+  unsigned int len_;
   unsigned int hash_value;
 };
 
-#define HT_LEN(NODE) ((NODE)->len)
+#define HT_LEN(NODE) ({ gcc_assert ((NODE)->len_ == strlen ((const char *) HT_STR (NODE))); (NODE)->len_; })
 #define HT_STR(NODE) ((NODE)->str)
 
 typedef struct ht cpp_hash_table;
diff --git a/libcpp/symtab.cc b/libcpp/symtab.cc
index ea9f26905a9..be5c1da9d67 100644
--- a/libcpp/symtab.cc
+++ b/libcpp/symtab.cc
@@ -155,7 +155,7 @@  ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str,
   node = (*table->alloc_node) (table);
   table->entries[index] = node;
 
-  HT_LEN (node) = (unsigned int) len;
+  node->len_ = (unsigned int) len;
   node->hash_value = hash;
 
   if (table->alloc_subobject)
-- 
2.34.1