[00/19] Add hash table to gdbsupport

Message ID 20230407-t-robin-hood-hash-v1-0-900d93ef1510@tromey.com
Headers
Series Add hash table to gdbsupport |

Message

Tom Tromey April 7, 2023, 3:25 p.m. UTC
  I recently read an article about hash tables and was inspired to write
a new one for gdb.  I haven't converted all the libiberty htab uses in
gdb, but this series does change enough of them to, I think, show that
the new implementation is workable.

The benefits of this approach are explained in the first patch.

Regression tested on x86-64 Fedora 36.  This found a latent bug in one
use of htab_t, see the typedefs patch.

Let me know what you think.

---
Tom Tromey (19):
      Add a hash table to gdbsupport
      Convert compile-c-symbols.c to new hash table
      Convert filename-seen-cache.h to new hash table
      Convert linespec.c to new hash table
      Convert target-descriptions.c to new hash table
      Convert dwarf2/macro.c to new hash table
      Convert breakpoint.c to new hash table
      Convert py-framefilter.c to new hash table
      Convert disasm.c to new hash table
      Convert compile/compile.c to new hash table
      Convert type copying to new hash table
      Convert static links to new hash table
      Convert gnu-v3-abi.c to new hash table
      Convert abbrev cache to new hash table
      Convert abbrevs to new hash table
      Convert typedef hash to new hash table
      Convert all_bfds to new hash table
      Convert more DWARF code to new hash table
      Convert gdb_bfd.c to new hash table

 gdb/Makefile.in                         |   2 +-
 gdb/breakpoint.c                        |  16 +-
 gdb/compile/compile-c-symbols.c         |  54 ++--
 gdb/compile/compile-object-run.c        |   4 +-
 gdb/compile/compile.c                   | 151 +--------
 gdb/compile/compile.h                   |  10 +-
 gdb/disasm.c                            |  72 ++---
 gdb/dwarf2/abbrev-cache.c               |  35 +-
 gdb/dwarf2/abbrev-cache.h               |  37 ++-
 gdb/dwarf2/abbrev.c                     |  46 ---
 gdb/dwarf2/abbrev.h                     |  54 +++-
 gdb/dwarf2/cu.c                         |  50 +--
 gdb/dwarf2/cu.h                         |  10 +-
 gdb/dwarf2/macro.c                      |  21 +-
 gdb/dwarf2/read.c                       |  55 ++--
 gdb/extension-priv.h                    |   3 +-
 gdb/extension.c                         |   3 +-
 gdb/extension.h                         |   4 +-
 gdb/filename-seen-cache.c               |  59 ----
 gdb/filename-seen-cache.h               |  35 +-
 gdb/gdb_bfd.c                           | 137 +++-----
 gdb/gdbtypes.c                          |  59 +---
 gdb/gdbtypes.h                          |   7 +-
 gdb/gnu-v3-abi.c                        | 101 +++---
 gdb/guile/guile-internal.h              |   2 +-
 gdb/guile/scm-type.c                    |   9 +-
 gdb/guile/scm-value.c                   |   3 +-
 gdb/linespec.c                          |  53 +--
 gdb/objfiles.c                          |  74 +----
 gdb/objfiles.h                          |   4 +-
 gdb/python/py-framefilter.c             |  25 +-
 gdb/python/py-type.c                    |   6 +-
 gdb/python/py-value.c                   |   3 +-
 gdb/python/python-internal.h            |   2 +-
 gdb/symfile.c                           |   2 +-
 gdb/target-descriptions.c               |  17 +-
 gdb/testsuite/gdb.cp/ptype-flags.exp    |  25 +-
 gdb/testsuite/gdb.gdb/python-helper.exp |   3 +-
 gdb/typeprint.c                         |  89 +-----
 gdb/typeprint.h                         |  29 +-
 gdb/unittests/hash-table-selftests.c    | 128 ++++++++
 gdb/value.c                             |  17 +-
 gdb/value.h                             |   4 +-
 gdbsupport/Makefile.am                  |   1 +
 gdbsupport/Makefile.in                  |  19 +-
 gdbsupport/hash-table.cc                |  75 +++++
 gdbsupport/hash-table.h                 | 551 ++++++++++++++++++++++++++++++++
 47 files changed, 1199 insertions(+), 967 deletions(-)
---
base-commit: 929a05081ec2ca6448927b96f673b0cd9633a342
change-id: 20230407-t-robin-hood-hash-e01cb3a6d3d6

Best regards,
  

Comments

John Baldwin April 10, 2023, 7:45 p.m. UTC | #1
On 4/7/23 8:25 AM, Tom Tromey wrote:
> I recently read an article about hash tables and was inspired to write
> a new one for gdb.  I haven't converted all the libiberty htab uses in
> gdb, but this series does change enough of them to, I think, show that
> the new implementation is workable.
> 
> The benefits of this approach are explained in the first patch.
> 
> Regression tested on x86-64 Fedora 36.  This found a latent bug in one
> use of htab_t, see the typedefs patch.
> 
> Let me know what you think.

I just have one possibly naive question which is why not use
std::unordered_map<> directly in the various callers?  At least, I think
it might be useful to explain why GDB includes a bespoke hash table now
that we require C++11, it was more obvious I think when GDB was still
written in C.
  
Tom Tromey Nov. 3, 2023, 6:54 p.m. UTC | #2
>>>>> "John" == John Baldwin <jhb@FreeBSD.org> writes:

John> On 4/7/23 8:25 AM, Tom Tromey wrote:
>> I recently read an article about hash tables and was inspired to write
>> a new one for gdb.  I haven't converted all the libiberty htab uses in
>> gdb, but this series does change enough of them to, I think, show that
>> the new implementation is workable.
>> The benefits of this approach are explained in the first patch.
>> Regression tested on x86-64 Fedora 36.  This found a latent bug in
>> one
>> use of htab_t, see the typedefs patch.
>> Let me know what you think.

John> I just have one possibly naive question which is why not use
John> std::unordered_map<> directly in the various callers?  At least, I think
John> it might be useful to explain why GDB includes a bespoke hash table now
John> that we require C++11, it was more obvious I think when GDB was still
John> written in C.

std::unordered_map has some issues.  They may not affect every spot
using hash tables in gdb, but they probably affect some of them.

unordered_map is necessarily a chained hash table, not an open addressed
hash table.  And, it has to allocate nodes for entries, because it
promises pointer stability.  So, it has higher overhead than the hash
table in this series.

Also, I don't think there's a way to use one type as the key in
unordered_map but do lookups using a different type.  gdb uses this in
some places.

I'll add some text to this effect to patch #1.

Tom
  
Tom Tromey Dec. 8, 2023, 6:28 p.m. UTC | #3
>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:

Tom> I recently read an article about hash tables and was inspired to write
Tom> a new one for gdb.  I haven't converted all the libiberty htab uses in
Tom> gdb, but this series does change enough of them to, I think, show that
Tom> the new implementation is workable.

Tom> The benefits of this approach are explained in the first patch.

Tom> Regression tested on x86-64 Fedora 36.  This found a latent bug in one
Tom> use of htab_t, see the typedefs patch.

Tom> Let me know what you think.

I'd like to move forward with this.

I think most of the patches here are straightforward.  The real question
is whether we want it.  I've laid out my rationale in patch #1, with
some additions (explaining why this is better than unordered_map /
unordered_set) in a follow-up email.

Tom
  
Tom Tromey Jan. 11, 2024, 6:07 p.m. UTC | #4
>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:

[ new hash table ]

Tom> I'd like to move forward with this.

Tom> I think most of the patches here are straightforward.  The real question
Tom> is whether we want it.  I've laid out my rationale in patch #1, with
Tom> some additions (explaining why this is better than unordered_map /
Tom> unordered_set) in a follow-up email.

I tried converting a few more libiberty hash tables to this one.

For a lot of things this is fine, big improvement, etc.

However, this hash table doesn't really admit the possibility that an
element might not have a natural sentinel value.  This particularly
comes up with hash maps, where if you want a std::string->whatever map,
you have to introduce some kind of explicit sentinel (in my case I chose
empty string) and also then have to write a trait class.

So, this is one area where unordered_map has the advantage.  Now, it
would be possible to handle this in some way.  The hash table could
store a sentinel flag, for instance.  Or, we could just accept this
drawback and use unordered_map when convenience is important.

Anyway, I tend to think it's not an enormous issue, but I figured I'd
point it out.

Tom
  
John Baldwin Jan. 11, 2024, 7:35 p.m. UTC | #5
On 1/11/24 10:07 AM, Tom Tromey wrote:
>>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:
> 
> [ new hash table ]
> 
> Tom> I'd like to move forward with this.
> 
> Tom> I think most of the patches here are straightforward.  The real question
> Tom> is whether we want it.  I've laid out my rationale in patch #1, with
> Tom> some additions (explaining why this is better than unordered_map /
> Tom> unordered_set) in a follow-up email.
> 
> I tried converting a few more libiberty hash tables to this one.
> 
> For a lot of things this is fine, big improvement, etc.
> 
> However, this hash table doesn't really admit the possibility that an
> element might not have a natural sentinel value.  This particularly
> comes up with hash maps, where if you want a std::string->whatever map,
> you have to introduce some kind of explicit sentinel (in my case I chose
> empty string) and also then have to write a trait class.
> 
> So, this is one area where unordered_map has the advantage.  Now, it
> would be possible to handle this in some way.  The hash table could
> store a sentinel flag, for instance.  Or, we could just accept this
> drawback and use unordered_map when convenience is important.> 
> Anyway, I tend to think it's not an enormous issue, but I figured I'd
> point it out.

This seems fine to me.
  
Simon Marchi Jan. 12, 2024, 2:57 a.m. UTC | #6
On 2024-01-11 13:07, Tom Tromey wrote:
>>>>>> "Tom" == Tom Tromey <tom@tromey.com> writes:
> 
> [ new hash table ]
> 
> Tom> I'd like to move forward with this.
> 
> Tom> I think most of the patches here are straightforward.  The real question
> Tom> is whether we want it.  I've laid out my rationale in patch #1, with
> Tom> some additions (explaining why this is better than unordered_map /
> Tom> unordered_set) in a follow-up email.
> 
> I tried converting a few more libiberty hash tables to this one.
> 
> For a lot of things this is fine, big improvement, etc.
> 
> However, this hash table doesn't really admit the possibility that an
> element might not have a natural sentinel value.  This particularly
> comes up with hash maps, where if you want a std::string->whatever map,
> you have to introduce some kind of explicit sentinel (in my case I chose
> empty string) and also then have to write a trait class.
> 
> So, this is one area where unordered_map has the advantage.  Now, it
> would be possible to handle this in some way.  The hash table could
> store a sentinel flag, for instance.  Or, we could just accept this
> drawback and use unordered_map when convenience is important.
> 
> Anyway, I tend to think it's not an enormous issue, but I figured I'd
> point it out.
> 
> Tom

I spent a bit of time reading the interface of your hash table, and that
was a point I found unfortunate.  The type V must have a default
constructor and an "empty" state (I guess that's what you mean by
sentinel), even if it doesn't make sense for the rest of the program.

How do other implementations (of open addessing hash tables) typically
deal with this?  From what I recally, other implementations I have used
in the past didn't have this requirement.

And that makes me think that a question I did not see answered is: I
don't want to undermine your work, but what is the rationale for
implementing it ourselves?  Efficient C++ hash maps is a topic that
comes up all the time on various programming news sites (I found this
recent blog post after a quick google search [1]), surely that's a
solved problem.  Is there not a popular hash map library we can use
as-is?

Simon

[1] https://martin.ankerl.com/2022/08/27/hashmap-bench-01/
  
Tom Tromey Jan. 12, 2024, 6:22 p.m. UTC | #7
Simon> I spent a bit of time reading the interface of your hash table, and that
Simon> was a point I found unfortunate.  The type V must have a default
Simon> constructor and an "empty" state (I guess that's what you mean by
Simon> sentinel), even if it doesn't make sense for the rest of the program.

It's maybe worth noting that this is not a step backward from htab_t,
which only stores pointers.

Simon> How do other implementations (of open addessing hash tables) typically
Simon> deal with this?  From what I recally, other implementations I have used
Simon> in the past didn't have this requirement.

I don't know in general.  I looked at one and it uses a separate vector
of flag bytes to indicate which values are valid.

Simon> And that makes me think that a question I did not see answered is: I
Simon> don't want to undermine your work, but what is the rationale for
Simon> implementing it ourselves?

In my view, integrating an external C++ package like this is pain both
legally (at least, you have to find one with the appropriate license)
and technically.  Even just trying to integrate the GCC hash table was a
pain.  The typical problem, IMO, is that it's unusual to write a
standalone class like this, instead one introduces dependencies on a
variety of other things.

Tom
  
Simon Marchi Jan. 12, 2024, 7:12 p.m. UTC | #8
On 1/12/24 13:22, Tom Tromey wrote:
> Simon> I spent a bit of time reading the interface of your hash table, and that
> Simon> was a point I found unfortunate.  The type V must have a default
> Simon> constructor and an "empty" state (I guess that's what you mean by
> Simon> sentinel), even if it doesn't make sense for the rest of the program.
> 
> It's maybe worth noting that this is not a step backward from htab_t,
> which only stores pointers.

That's true, in the sense that you can still decide to store pointers in
the gdb::traited_hash_table.  But if you want to change it so that it
stores objects directly (which I guess is an advantage because it's one
less indirection), then you have to introduce this new "invalid" or
"empty" state for objects for that class.  It therefore removes the
guarantee you previously had for objects of that class to always be
valid, and it becomes possible for objects used elsewhere to be (by
mistake or not) in that new "empty" state.  It's one more thing to
reason about, and some more potential pitfalls.

> Simon> How do other implementations (of open addessing hash tables) typically
> Simon> deal with this?  From what I recally, other implementations I have used
> Simon> in the past didn't have this requirement.
> 
> I don't know in general.  I looked at one and it uses a separate vector
> of flag bytes to indicate which values are valid.

I looked at this one:

https://github.com/martinus/unordered_dense/tree/main

From what I understand, the underlying storage is a vector, and they
somehow make it so all vector elements are always occupied, there are
no empty slots.

> Simon> And that makes me think that a question I did not see answered is: I
> Simon> don't want to undermine your work, but what is the rationale for
> Simon> implementing it ourselves?
> 
> In my view, integrating an external C++ package like this is pain both
> legally (at least, you have to find one with the appropriate license)
> and technically.  Even just trying to integrate the GCC hash table was a
> pain.  The typical problem, IMO, is that it's unusual to write a
> standalone class like this, instead one introduces dependencies on a
> variety of other things.

I did have some experiences (with folly notably) where trying to use one
data structure pulled in a bunch of files and it quickly became
unmanageable, but that was not the norm.  But in my experience, most of
the time it's just one header file to copy over, with no dependency.
The one I linked above seems to be just that (I gave it a quick try in
GDB).

And the license being MIT, I don't think that's a problem.  My
understanding is that it's permitted in this direction, and we use other
MIT-licensed stuff.

Simon