Add string cache and use it in cooked index

Message ID 20250309173146.1675304-1-tom@tromey.com
State New
Headers
Series Add string cache and use it in cooked index |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 fail Patch failed to apply
linaro-tcwg-bot/tcwg_gdb_build--master-arm fail Patch failed to apply

Commit Message

Tom Tromey March 9, 2025, 5:31 p.m. UTC
  The cooked index needs to allocate names in some cases -- when
canonicalizing or when synthesizing Ada package names.  This process
currently uses a vector of unique_ptrs to manage the memory.

Another series I'm writing adds another spot where this allocation
must be done, and examining the result showed that certain names were
allocated multiple times.

To clean this up, this patch introduces a string cache object and
changes the cooked indexer to use it.  I considered using bcache here,
but bcache doesn't work as nicely with string_view -- because bcache
is fundamentally memory-based, a temporary copy of the contents must
be made to ensure that bcache can see the trailing \0.  Furthermore,
writing a custom class lets us avoid another copy when canonicalizing
C++ names.
---
 gdb/dwarf2/cooked-index.c |  16 ++---
 gdb/dwarf2/cooked-index.h |   3 +-
 gdbsupport/string-set.h   | 138 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 144 insertions(+), 13 deletions(-)
 create mode 100644 gdbsupport/string-set.h
  

Comments

Simon Marchi March 10, 2025, 4:33 a.m. UTC | #1
On 2025-03-09 13:31, Tom Tromey wrote:
> diff --git a/gdbsupport/string-set.h b/gdbsupport/string-set.h
> new file mode 100644
> index 00000000000..836dd6a7afc
> --- /dev/null
> +++ b/gdbsupport/string-set.h
> @@ -0,0 +1,138 @@
> +/* String-interning set
> +
> +   Copyright (C) 2025 Free Software Foundation, Inc.
> +
> +   This file is part of GDB.
> +
> +   This program is free software; you can redistribute it and/or modify
> +   it under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3 of the License, or
> +   (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GDBSUPPORT_STRING_SET_H
> +#define GDBSUPPORT_STRING_SET_H
> +
> +#include "gdbsupport/common-utils.h"
> +#include "gdbsupport/unordered_set.h"
> +#include <string_view>
> +
> +namespace gdb
> +{
> +
> +/* This is a string-interning set.  It manages storage for strings,
> +   ensuring that just a single copy of a given string is kept.  The
> +   underlying C string will remain valid for the lifetime of this
> +   object.  */
> +
> +class string_set
> +{
> +public:
> +
> +  string_set () = default;

The explicit default is not needed.

> +
> +  /* Insert STR into this set.  Returns a pointer to the interned
> +     string.  */
> +  const char *insert (const char *str)
> +  {
> +    /* We need to take the length to hash the string anyway, so it's
> +       convenient to just wrap it here.  */
> +    return insert (std::string_view (str));
> +  }
> +
> +  /* An overload accepting a string.  */
> +  const char *insert (const std::string &str)
> +  {
> +    return m_set.insert (str).first->get ();
> +  }

Is this overload necessary if we have the string_view one?  I think that
std::string is implicitly convertible to std::string_view.  Well, it
must be, since this ends up calling the string_view overload of
the local_string constructor.

> +
> +  /* An overload accepting a string view.  */
> +  const char *insert (std::string_view str)
> +  {
> +    return m_set.insert (str).first->get ();
> +  }
> +
> +  /* An overload that takes ownership of the string.  */
> +  const char *insert (gdb::unique_xmalloc_ptr<char> str)
> +  {
> +    local_string ls (std::move (str));
> +    return m_set.insert (std::move (ls)).first->get ();

Up to you, but I think it's clear without the local variable:

    return m_set.insert (local_string (std::move (str))).first->get ();

> +  }
> +
> +private:
> +
> +  /* The type of string we store.  Note that we do not store
> +     std::string here to avoid the small-string optimization
> +     invalidating a pointer on rehash.  */
> +  struct local_string
> +  {
> +    explicit local_string (std::string_view str)
> +      : contents (xstrndup (str.data (), str.size ())),
> +	len (str.size ())
> +    { }
> +
> +    explicit local_string (gdb::unique_xmalloc_ptr<char> str)
> +      : contents (std::move (str)),
> +	len (strlen (contents.get ()))
> +    { }
> +
> +    const char *get () const
> +    { return contents.get (); }
> +
> +    std::string_view as_view () const
> +    { return std::string_view (contents.get (), len); }
> +
> +    /* \0-terminated string contents.  */
> +    gdb::unique_xmalloc_ptr<char> contents;
> +    /* Length of the string.  */
> +    size_t len;
> +  };
> +
> +  /* Equality object for the set.  */
> +  struct str_eq
> +  {
> +    using is_transparent = void;
> +
> +    bool operator() (std::string_view lhs, const local_string &rhs)
> +      const noexcept
> +    {
> +      return lhs == rhs.as_view ();
> +    }
> +
> +    bool operator() (const local_string &lhs, const local_string &rhs)
> +      const noexcept
> +    {
> +      return strcmp (lhs.get (), rhs.get ()) == 0;
> +    }
> +  };
> +
> +  /* Hash object for the set.  */
> +  struct str_hash
> +  {
> +    using is_transparent = void;
> +
> +    bool operator() (const local_string &rhs) const noexcept
> +    {
> +      return fast_hash (rhs.get (), rhs.len);
> +    }
> +
> +    bool operator() (std::string_view rhs) const noexcept
> +    {
> +      return fast_hash (rhs.data (), rhs.size ());
> +    }
> +  };

As you probably saw in my patches, I try to make it show that the
various operator() overloads end up calling a single one.  I think this
helps reduce the risk of behavior difference between them, which would
be a nasty and hard to debug bug.  I do that with the hope that the
compiler will all optimize it away, although I haven't checked.

If you want to follow the same style here, it would look like this:

  /* Equality object for the set.  */
  struct str_eq
  {
    using is_transparent = void;

    bool operator() (std::string_view lhs, const local_string &rhs)
      const noexcept
    { return lhs == rhs.as_view (); }

    bool operator() (const local_string &lhs, const local_string &rhs)
      const noexcept
    { return (*this) (lhs.as_view (), rhs); }
  };

  /* Hash object for the set.  */
  struct str_hash
  {
    using is_transparent = void;

    bool operator() (std::string_view rhs) const noexcept
    { return fast_hash (rhs.data (), rhs.size ()); }

    bool operator() (const local_string &rhs) const noexcept
    { return (*this) (rhs.as_view ()); }
  };

I didn't use it in my previous gdb::unordered_set because I don't really
know much about hash functions, but it is possible to define the
is_avalanching member to tell unordered_dense that our hash function is
good (I assume that whatever xxhash provides has good avalanching
effect):

    https://github.com/martinus/unordered_dense?tab=readme-ov-file#32-hash

If you think it's appropriate to use it here, it would probably save a
few cycles.

Simon
  
Tom Tromey March 10, 2025, 7:36 p.m. UTC | #2
>>>>> "Simon" == Simon Marchi <simark@simark.ca> writes:

>> +  /* An overload accepting a string.  */
>> +  const char *insert (const std::string &str)
>> +  {
>> +    return m_set.insert (str).first->get ();
>> +  }

Simon> Is this overload necessary if we have the string_view one?

No, I removed it.

Simon> I didn't use it in my previous gdb::unordered_set because I don't really
Simon> know much about hash functions, but it is possible to define the
Simon> is_avalanching member to tell unordered_dense that our hash function is
Simon> good (I assume that whatever xxhash provides has good avalanching
Simon> effect):

In v2 I changed this to use the ankerl hasher, since it is already
specialized for string_view.  I also added the is_avalanching tag.

Tom
  

Patch

diff --git a/gdb/dwarf2/cooked-index.c b/gdb/dwarf2/cooked-index.c
index 6612585649f..9d3a4b03489 100644
--- a/gdb/dwarf2/cooked-index.c
+++ b/gdb/dwarf2/cooked-index.c
@@ -368,13 +368,11 @@  cooked_index_shard::handle_gnat_encoded_entry
       cooked_index_entry *last = (cooked_index_entry *) *slot;
       if (last == nullptr || last->per_cu != entry->per_cu)
 	{
-	  gdb::unique_xmalloc_ptr<char> new_name
-	    = make_unique_xstrndup (name.data (), name.length ());
+	  const char *new_name = m_names.insert (name);
 	  last = create (entry->die_offset, DW_TAG_module,
-			 IS_SYNTHESIZED, language_ada, new_name.get (), parent,
+			 IS_SYNTHESIZED, language_ada, new_name, parent,
 			 entry->per_cu);
 	  last->canonical = last->name;
-	  m_names.push_back (std::move (new_name));
 	  new_entries.push_back (last);
 	  *slot = last;
 	}
@@ -383,9 +381,7 @@  cooked_index_shard::handle_gnat_encoded_entry
     }
 
   entry->set_parent (parent);
-  auto new_canon = make_unique_xstrndup (tail.data (), tail.length ());
-  entry->canonical = new_canon.get ();
-  m_names.push_back (std::move (new_canon));
+  entry->canonical = m_names.insert (tail);
 }
 
 /* See cooked-index.h.  */
@@ -503,10 +499,7 @@  cooked_index_shard::finalize (const parent_map_map *parent_maps)
 	      if (canon_name == nullptr)
 		entry->canonical = entry->name;
 	      else
-		{
-		  entry->canonical = canon_name.get ();
-		  m_names.push_back (std::move (canon_name));
-		}
+		entry->canonical = m_names.insert (std::move (canon_name));
 	      *slot = entry;
 	    }
 	  else
@@ -526,7 +519,6 @@  cooked_index_shard::finalize (const parent_map_map *parent_maps)
   m_entries.insert (m_entries.end (), new_gnat_entries.begin (),
 		    new_gnat_entries.end ());
 
-  m_names.shrink_to_fit ();
   m_entries.shrink_to_fit ();
   std::sort (m_entries.begin (), m_entries.end (),
 	     [] (const cooked_index_entry *a, const cooked_index_entry *b)
diff --git a/gdb/dwarf2/cooked-index.h b/gdb/dwarf2/cooked-index.h
index f6586359770..c6911c23781 100644
--- a/gdb/dwarf2/cooked-index.h
+++ b/gdb/dwarf2/cooked-index.h
@@ -32,6 +32,7 @@ 
 #include "dwarf2/read.h"
 #include "dwarf2/parent-map.h"
 #include "gdbsupport/range-chain.h"
+#include "gdbsupport/string-set.h"
 #include "complaints.h"
 
 #if CXX_STD_THREAD
@@ -367,7 +368,7 @@  class cooked_index_shard
   /* The addrmap.  This maps address ranges to dwarf2_per_cu objects.  */
   addrmap_fixed *m_addrmap = nullptr;
   /* Storage for canonical names.  */
-  std::vector<gdb::unique_xmalloc_ptr<char>> m_names;
+  gdb::string_set m_names;
 };
 
 using cooked_index_shard_up = std::unique_ptr<cooked_index_shard>;
diff --git a/gdbsupport/string-set.h b/gdbsupport/string-set.h
new file mode 100644
index 00000000000..836dd6a7afc
--- /dev/null
+++ b/gdbsupport/string-set.h
@@ -0,0 +1,138 @@ 
+/* String-interning set
+
+   Copyright (C) 2025 Free Software Foundation, Inc.
+
+   This file is part of GDB.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#ifndef GDBSUPPORT_STRING_SET_H
+#define GDBSUPPORT_STRING_SET_H
+
+#include "gdbsupport/common-utils.h"
+#include "gdbsupport/unordered_set.h"
+#include <string_view>
+
+namespace gdb
+{
+
+/* This is a string-interning set.  It manages storage for strings,
+   ensuring that just a single copy of a given string is kept.  The
+   underlying C string will remain valid for the lifetime of this
+   object.  */
+
+class string_set
+{
+public:
+
+  string_set () = default;
+
+  /* Insert STR into this set.  Returns a pointer to the interned
+     string.  */
+  const char *insert (const char *str)
+  {
+    /* We need to take the length to hash the string anyway, so it's
+       convenient to just wrap it here.  */
+    return insert (std::string_view (str));
+  }
+
+  /* An overload accepting a string.  */
+  const char *insert (const std::string &str)
+  {
+    return m_set.insert (str).first->get ();
+  }
+
+  /* An overload accepting a string view.  */
+  const char *insert (std::string_view str)
+  {
+    return m_set.insert (str).first->get ();
+  }
+
+  /* An overload that takes ownership of the string.  */
+  const char *insert (gdb::unique_xmalloc_ptr<char> str)
+  {
+    local_string ls (std::move (str));
+    return m_set.insert (std::move (ls)).first->get ();
+  }
+
+private:
+
+  /* The type of string we store.  Note that we do not store
+     std::string here to avoid the small-string optimization
+     invalidating a pointer on rehash.  */
+  struct local_string
+  {
+    explicit local_string (std::string_view str)
+      : contents (xstrndup (str.data (), str.size ())),
+	len (str.size ())
+    { }
+
+    explicit local_string (gdb::unique_xmalloc_ptr<char> str)
+      : contents (std::move (str)),
+	len (strlen (contents.get ()))
+    { }
+
+    const char *get () const
+    { return contents.get (); }
+
+    std::string_view as_view () const
+    { return std::string_view (contents.get (), len); }
+
+    /* \0-terminated string contents.  */
+    gdb::unique_xmalloc_ptr<char> contents;
+    /* Length of the string.  */
+    size_t len;
+  };
+
+  /* Equality object for the set.  */
+  struct str_eq
+  {
+    using is_transparent = void;
+
+    bool operator() (std::string_view lhs, const local_string &rhs)
+      const noexcept
+    {
+      return lhs == rhs.as_view ();
+    }
+
+    bool operator() (const local_string &lhs, const local_string &rhs)
+      const noexcept
+    {
+      return strcmp (lhs.get (), rhs.get ()) == 0;
+    }
+  };
+
+  /* Hash object for the set.  */
+  struct str_hash
+  {
+    using is_transparent = void;
+
+    bool operator() (const local_string &rhs) const noexcept
+    {
+      return fast_hash (rhs.get (), rhs.len);
+    }
+
+    bool operator() (std::string_view rhs) const noexcept
+    {
+      return fast_hash (rhs.data (), rhs.size ());
+    }
+  };
+
+  /* The strings.  */
+  gdb::unordered_set<local_string, str_hash, str_eq> m_set;
+};
+
+}
+
+#endif /* GDBSUPPORT_STRING_SET_H */