Don't use the mutex for each symbol_set_names call

Message ID 20190930165543.68106-1-cbiesinger@google.com
State New, archived
Headers

Commit Message

Terekhov, Mikhail via Gdb-patches Sept. 30, 2019, 4:55 p.m. UTC
  [Thanks Tom -- here's a rebased version, with also a small improvement
to parallel-for.h]

This avoids the lock contention that was seen with tromey's patch (it
moves it to a once- per-thread lock call from a once-per-symbol call).

It speeds up "attach to Chrome's content_shell binary" from 44 sec -> 30
sec (32%) (compared to GDB trunk), or from 37 sec compared to this
patchset before this patch.

gdb/ChangeLog:

2019-09-28  Christian Biesinger  <cbiesinger@google.com>

	* gdbsupport/parallel-for.h (parallel_for_each): Add a way to
	run a function on each thread after the elements are processed,
	to "finish up" computations.
	* minsyms.c (minimal_symbol_reader::install): Only do
	demangling on the background thread; still call
	symbol_set_names on the main thread.
	* symtab.c (symbol_find_demangled_name): Make public,
	so that minsyms.c can call it.
	(symbol_set_names): Remove mutex. Avoid demangle call
	if the demangled name is already set.
	* symtab.h (symbol_find_demangled_name): Make public.
---
 gdb/gdbsupport/parallel-for.h |  12 ++-
 gdb/minsyms.c                 |  48 +++++++---
 gdb/symtab.c                  | 162 +++++++++++++++-------------------
 gdb/symtab.h                  |  10 +++
 4 files changed, 130 insertions(+), 102 deletions(-)
  

Comments

Tom Tromey Oct. 2, 2019, 5:18 p.m. UTC | #1
>>>>> "Christian" == Christian Biesinger via gdb-patches <gdb-patches@sourceware.org> writes:

Christian> It speeds up "attach to Chrome's content_shell binary" from 44 sec -> 30
Christian> sec (32%) (compared to GDB trunk), or from 37 sec compared to this
Christian> patchset before this patch.

Nice.

Christian> +	  [&] (minimal_symbol *first, minimal_symbol* last) {
Christian> +	    std::lock_guard<std::mutex> guard (demangled_mutex);
Christian> +	    for (; first < last; ++first) {
Christian> +	      symbol_set_names (first, first->name,
Christian> +		  strlen (first->name), 0,
Christian> +		  m_objfile->per_bfd);
Christian> +	    }
Christian> +	  });

IIUC the idea is to separate the demangling from updating the
demangled_names_hash.

A couple of thoughts on that...

Christian> +          *slot
Christian> +            = ((struct demangled_name_entry *)
Christian> +      	 obstack_alloc (&per_bfd->storage_obstack,
Christian> +      			offsetof (struct demangled_name_entry, demangled)
Christian> +      			+ len + demangled_len + 2));
Christian> +          mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
Christian> +          strcpy (mangled_ptr, linkage_name_copy);
Christian> +          (*slot)->mangled = mangled_ptr;

There's no deep reason that these things have to be stored on the
per-BFD obstack -- and requiring this means an extra copy.  Instead we
could change the hash table to use ordinary heap allocation, and I think
this would be more efficient when demangling in worker threads, because
we could just reuse the existing allocation.

Also, it seems fine to serialize the calls to symbol_set_names.  There's
no need for a lock at all, then.

One idea I had was to parallelize build_minimal_symbol_hash_tables as
well: when possible, have it run two threads, and create the hash tables
in parallel.  Adding a third thread here to update the
demangled_names_hash might help too?  Maybe this approach would
eliminate the need for your "Compute msymbol hash codes in parallel"
patch ... the issue there being that it makes the minsyms larger.
(Another way to handle that would be to keep the hashes in a local array
of some kind that is discarded once the hash tables are updated.)

Tom
  
Terekhov, Mikhail via Gdb-patches Oct. 2, 2019, 6:20 p.m. UTC | #2
On Wed, Oct 2, 2019 at 12:18 PM Tom Tromey <tom@tromey.com> wrote:
>
> >>>>> "Christian" == Christian Biesinger via gdb-patches <gdb-patches@sourceware.org> writes:
>
> Christian> It speeds up "attach to Chrome's content_shell binary" from 44 sec -> 30
> Christian> sec (32%) (compared to GDB trunk), or from 37 sec compared to this
> Christian> patchset before this patch.
>
> Nice.

I do need to redo these measurements with the latest version of the
branch and patch...

> Christian> +      [&] (minimal_symbol *first, minimal_symbol* last) {
> Christian> +        std::lock_guard<std::mutex> guard (demangled_mutex);
> Christian> +        for (; first < last; ++first) {
> Christian> +          symbol_set_names (first, first->name,
> Christian> +              strlen (first->name), 0,
> Christian> +              m_objfile->per_bfd);
> Christian> +        }
> Christian> +      });
>
> IIUC the idea is to separate the demangling from updating the
> demangled_names_hash.

That's correct.

> A couple of thoughts on that...
>
> Christian> +          *slot
> Christian> +            = ((struct demangled_name_entry *)
> Christian> +             obstack_alloc (&per_bfd->storage_obstack,
> Christian> +                            offsetof (struct demangled_name_entry, demangled)
> Christian> +                            + len + demangled_len + 2));
> Christian> +          mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
> Christian> +          strcpy (mangled_ptr, linkage_name_copy);
> Christian> +          (*slot)->mangled = mangled_ptr;
>
> There's no deep reason that these things have to be stored on the
> per-BFD obstack -- and requiring this means an extra copy.  Instead we
> could change the hash table to use ordinary heap allocation, and I think
> this would be more efficient when demangling in worker threads, because
> we could just reuse the existing allocation.

Yes indeed. I was actually thinking of that last night -- we could
change to a hash_set<demangled_name_entry> + reuse the alloc from
gdb_demangle and avoid any allocations here.

> Also, it seems fine to serialize the calls to symbol_set_names.  There's
> no need for a lock at all, then.

True, though this way, if some threads finish faster than others it's
possible to parallelize the work a bit more.

> One idea I had was to parallelize build_minimal_symbol_hash_tables as
> well: when possible, have it run two threads, and create the hash tables
> in parallel.

Hmm, yeah, that's a good idea. I hadn't thought of doing it quite that way.

>  Adding a third thread here to update the
> demangled_names_hash might help too?  Maybe this approach would
> eliminate the need for your "Compute msymbol hash codes in parallel"
> patch ... the issue there being that it makes the minsyms larger.
> (Another way to handle that would be to keep the hashes in a local array
> of some kind that is discarded once the hash tables are updated.)

The local array is a bit tricky... it needs an entry for each msymbol,
which is only known at runtime. So it can't be stack-allocated with a
fixed size, and I'm hesitant to use alloca for this unbounded size. So
it would require a heap allocation for that vector. Maybe it's still
worth it...

OK, let me play with some ideas.


Christian
  
Terekhov, Mikhail via Gdb-patches Oct. 2, 2019, 10:01 p.m. UTC | #3
On Wed, Oct 2, 2019 at 1:20 PM Christian Biesinger
<cbiesinger@google.com> wrote:
>
> On Wed, Oct 2, 2019 at 12:18 PM Tom Tromey <tom@tromey.com> wrote:
> >
> > >>>>> "Christian" == Christian Biesinger via gdb-patches <gdb-patches@sourceware.org> writes:
> >
> > Christian> It speeds up "attach to Chrome's content_shell binary" from 44 sec -> 30
> > Christian> sec (32%) (compared to GDB trunk), or from 37 sec compared to this
> > Christian> patchset before this patch.
> >
> > Nice.
>
> I do need to redo these measurements with the latest version of the
> branch and patch...

Here are some new measurements on trunk (this is also with a
recompiled Chrome); either trunk gdb is slower or trunk Chrome is
bigger. I'll call tromey/t/parallel-minsyms-mutex "tromey" below.

GDB Trunk: 49.8s
tromey: 53-54s (!)
tromey + my patch here: ~30.3s
tromey + my patch here +
https://sourceware.org/ml/gdb-patches/2019-09/msg00633.html: 24.8s
(-18% compared to previous)
tromey + my patch here +
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=shortlog;h=refs/heads/users/cbiesinger/minsym-hash-one-thread:
28.8s (only -5%)

I repeatedly measured "tromey" because it is so slow and got
consistent results. It must be due to the lock contention.

> > Christian> +      [&] (minimal_symbol *first, minimal_symbol* last) {
> > Christian> +        std::lock_guard<std::mutex> guard (demangled_mutex);
> > Christian> +        for (; first < last; ++first) {
> > Christian> +          symbol_set_names (first, first->name,
> > Christian> +              strlen (first->name), 0,
> > Christian> +              m_objfile->per_bfd);
> > Christian> +        }
> > Christian> +      });
> >
> > IIUC the idea is to separate the demangling from updating the
> > demangled_names_hash.
>
> That's correct.
>
> > A couple of thoughts on that...
> >
> > Christian> +          *slot
> > Christian> +            = ((struct demangled_name_entry *)
> > Christian> +             obstack_alloc (&per_bfd->storage_obstack,
> > Christian> +                            offsetof (struct demangled_name_entry, demangled)
> > Christian> +                            + len + demangled_len + 2));
> > Christian> +          mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
> > Christian> +          strcpy (mangled_ptr, linkage_name_copy);
> > Christian> +          (*slot)->mangled = mangled_ptr;
> >
> > There's no deep reason that these things have to be stored on the
> > per-BFD obstack -- and requiring this means an extra copy.  Instead we
> > could change the hash table to use ordinary heap allocation, and I think
> > this would be more efficient when demangling in worker threads, because
> > we could just reuse the existing allocation.
>
> Yes indeed. I was actually thinking of that last night -- we could
> change to a hash_set<demangled_name_entry> + reuse the alloc from
> gdb_demangle and avoid any allocations here.

https://sourceware.org/ml/gdb-patches/2019-10/msg00085.html

Although now that I re-read this, I'm not sure if I understood you
correctly, did you want to allocate more things with regular
malloc/new?

> > Also, it seems fine to serialize the calls to symbol_set_names.  There's
> > no need for a lock at all, then.
>
> True, though this way, if some threads finish faster than others it's
> possible to parallelize the work a bit more.

Trying this out, it seems to be about 1-1.5s slower (3-5%). However,
this did make me realize that there's no real reason why the mutex
should be global, so I'm going to move it inside this function.

> > One idea I had was to parallelize build_minimal_symbol_hash_tables as
> > well: when possible, have it run two threads, and create the hash tables
> > in parallel.
>
> Hmm, yeah, that's a good idea. I hadn't thought of doing it quite that way.

See measurements above for users/cbiesinger/minsym-hash-one-thread;
it's not nearly as fast as my "Compute msymbol hash codes in parallel"
patch. However I couldn't do it quite as you suggested (as mentioned
in IRC, but writing it down here as well for anyone watching):
- Building the demangled minsym hash table requires having the
demangled names available, so it needs to happen at the very least
after find_demangled_name
- But it can't happen in parallel with symbol_set_names either,
because that may change the pointer (to one from the hashtable)
- So in practice, I can only build the msymbol_hash table on a thread,
and that's the faster one (search_name_hash is slowish)

> >  Adding a third thread here to update the
> > demangled_names_hash might help too?  Maybe this approach would
> > eliminate the need for your "Compute msymbol hash codes in parallel"
> > patch ... the issue there being that it makes the minsyms larger.
> > (Another way to handle that would be to keep the hashes in a local array
> > of some kind that is discarded once the hash tables are updated.)
>
> The local array is a bit tricky... it needs an entry for each msymbol,
> which is only known at runtime. So it can't be stack-allocated with a
> fixed size, and I'm hesitant to use alloca for this unbounded size. So
> it would require a heap allocation for that vector. Maybe it's still
> worth it...

Putting this in a vector works out! It might possibly be a couple of
tenths of a second faster even. Pushed to:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=85f7818c32fdf5b9fbd24f08320c54e9f9d50b4c

Actually makes me wonder if I could precompute the hash code for
demangled_names_hash in a similar way...

Will send an updated version of this patch in a bit.

Christian
  

Patch

diff --git a/gdb/gdbsupport/parallel-for.h b/gdb/gdbsupport/parallel-for.h
index e8bfced475..e5b8d33989 100644
--- a/gdb/gdbsupport/parallel-for.h
+++ b/gdb/gdbsupport/parallel-for.h
@@ -36,13 +36,20 @@  namespace gdb
 
 extern int max_threads;
 
+template <class It>
+void DoNothing(It a, It b)
+{
+}
+
 /* A very simple "parallel for".  This iterates over the elements
    given by the range of iterators, which must be random access
    iterators.  For each element, it calls the callback function.  The
    work may or may not be done by separate threads.  */
 
-template<class RandomIt, class UnaryFunction>
-void parallel_for_each (RandomIt first, RandomIt last, UnaryFunction f)
+template<class RandomIt, class UnaryFunction, class PerThreadFinishFunction>
+void parallel_for_each (RandomIt first, RandomIt last, UnaryFunction f,
+			PerThreadFinishFunction per_thread_finish
+			  = DoNothing<RandomIt>)
 {
   auto body = [&] (RandomIt start, RandomIt end)
     {
@@ -60,6 +67,7 @@  void parallel_for_each (RandomIt first, RandomIt last, UnaryFunction f)
 	    /* Just ignore exceptions.  Each item here must be
 	       independent.  */
 	  }
+      per_thread_finish (start, end);
     };
 
 #if CXX_STD_THREAD
diff --git a/gdb/minsyms.c b/gdb/minsyms.c
index 10e3c8a548..4802b0bea0 100644
--- a/gdb/minsyms.c
+++ b/gdb/minsyms.c
@@ -55,6 +55,20 @@ 
 #include "safe-ctype.h"
 #include "gdbsupport/parallel-for.h"
 
+#if CXX_STD_THREAD
+#include <mutex>
+#endif
+
+#if CXX_STD_THREAD
+/* Mutex that is used when modifying or accessing the demangled hash
+   table.  This is a global mutex simply because the only current
+   multi-threaded user of the hash table does not process multiple
+   objfiles in parallel.  The mutex could easily live on the per-BFD
+   object, but this approach avoids using extra memory when it is not
+   needed.  */
+static std::mutex demangled_mutex;
+#endif
+
 /* See minsyms.h.  */
 
 bool
@@ -1340,17 +1354,29 @@  minimal_symbol_reader::install ()
 
       msymbols = m_objfile->per_bfd->msymbols.get ();
       gdb::parallel_for_each
-	(&msymbols[0], &msymbols[mcount],
-	 [&] (minimal_symbol &msym)
-	 {
-	   if (!msym.name_set)
-	     {
-	       symbol_set_names (&msym, msym.name,
-				 strlen (msym.name), 0,
-				 m_objfile->per_bfd);
-	       msym.name_set = 1;
-	     }
-	 });
+	(&msymbols[0], &msymbols[mcount], [&] (minimal_symbol &msym)
+	  {
+	    if (!msym.name_set)
+	      {
+	        if (msym.language != language_ada)
+		  {
+		    /* This will be freed later, by symbol_set_names.  */
+		    char* demangled_name = symbol_find_demangled_name (&msym,
+								       msym.name);
+		    symbol_set_demangled_name (&msym, demangled_name,
+					       &m_objfile->per_bfd->storage_obstack);
+		    msym.name_set = 1;
+		  }
+	      }
+	  },
+	  [&] (minimal_symbol *first, minimal_symbol* last) {
+	    std::lock_guard<std::mutex> guard (demangled_mutex);
+	    for (; first < last; ++first) {
+	      symbol_set_names (first, first->name,
+		  strlen (first->name), 0,
+		  m_objfile->per_bfd);
+	    }
+	  });
 
       build_minimal_symbol_hash_tables (m_objfile);
     }
diff --git a/gdb/symtab.c b/gdb/symtab.c
index 8adcff7cf2..3600e0b0ff 100644
--- a/gdb/symtab.c
+++ b/gdb/symtab.c
@@ -69,9 +69,6 @@ 
 #include "arch-utils.h"
 #include <algorithm>
 #include "gdbsupport/pathstuff.h"
-#if CXX_STD_THREAD
-#include <mutex>
-#endif
 
 /* Forward declarations for local functions.  */
 
@@ -713,16 +710,6 @@  symbol_set_language (struct general_symbol_info *gsymbol,
 
 /* Functions to initialize a symbol's mangled name.  */
 
-#if CXX_STD_THREAD
-/* Mutex that is used when modifying or accessing the demangled hash
-   table.  This is a global mutex simply because the only current
-   multi-threaded user of the hash table does not process multiple
-   objfiles in parallel.  The mutex could easily live on the per-BFD
-   object, but this approach avoids using extra memory when it is not
-   needed.  */
-static std::mutex demangled_mutex;
-#endif
-
 /* Objects of this type are stored in the demangled name hash table.  */
 struct demangled_name_entry
 {
@@ -772,13 +759,9 @@  create_demangled_names_hash (struct objfile_per_bfd_storage *per_bfd)
      NULL, xcalloc, xfree));
 }
 
-/* Try to determine the demangled name for a symbol, based on the
-   language of that symbol.  If the language is set to language_auto,
-   it will attempt to find any demangling algorithm that works and
-   then set the language appropriately.  The returned name is allocated
-   by the demangler and should be xfree'd.  */
+/* See symtab.h  */
 
-static char *
+char *
 symbol_find_demangled_name (struct general_symbol_info *gsymbol,
 			    const char *mangled)
 {
@@ -867,78 +850,79 @@  symbol_set_names (struct general_symbol_info *gsymbol,
 
   struct demangled_name_entry *found_entry;
 
-  {
-#if CXX_STD_THREAD
-    std::lock_guard<std::mutex> guard (demangled_mutex);
-#endif
-
-    if (per_bfd->demangled_names_hash == NULL)
-      create_demangled_names_hash (per_bfd);
-
-    entry.mangled = linkage_name_copy;
-    slot = ((struct demangled_name_entry **)
-	    htab_find_slot (per_bfd->demangled_names_hash.get (),
-			    &entry, INSERT));
-
-    /* If this name is not in the hash table, add it.  */
-    if (*slot == NULL
-	/* A C version of the symbol may have already snuck into the table.
-	   This happens to, e.g., main.init (__go_init_main).  Cope.  */
-	|| (gsymbol->language == language_go
-	    && (*slot)->demangled[0] == '\0'))
-      {
-	char *demangled_name_ptr
-	  = symbol_find_demangled_name (gsymbol, linkage_name_copy);
-	gdb::unique_xmalloc_ptr<char> demangled_name (demangled_name_ptr);
-	int demangled_len = demangled_name ? strlen (demangled_name.get ()) : 0;
-
-	/* Suppose we have demangled_name==NULL, copy_name==0, and
-	   linkage_name_copy==linkage_name.  In this case, we already have the
-	   mangled name saved, and we don't have a demangled name.  So,
-	   you might think we could save a little space by not recording
-	   this in the hash table at all.
-	 
-	   It turns out that it is actually important to still save such
-	   an entry in the hash table, because storing this name gives
-	   us better bcache hit rates for partial symbols.  */
-	if (!copy_name && linkage_name_copy == linkage_name)
-	  {
-	    *slot
-	      = ((struct demangled_name_entry *)
-		 obstack_alloc (&per_bfd->storage_obstack,
-				offsetof (struct demangled_name_entry, demangled)
-				+ demangled_len + 1));
-	    (*slot)->mangled = linkage_name;
-	  }
-	else
-	  {
-	    char *mangled_ptr;
-
-	    /* If we must copy the mangled name, put it directly after
-	       the demangled name so we can have a single
-	       allocation.  */
-	    *slot
-	      = ((struct demangled_name_entry *)
-		 obstack_alloc (&per_bfd->storage_obstack,
-				offsetof (struct demangled_name_entry, demangled)
-				+ len + demangled_len + 2));
-	    mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
-	    strcpy (mangled_ptr, linkage_name_copy);
-	    (*slot)->mangled = mangled_ptr;
-	  }
-	(*slot)->language = gsymbol->language;
+  if (per_bfd->demangled_names_hash == NULL)
+    create_demangled_names_hash (per_bfd);
+
+  entry.mangled = linkage_name_copy;
+  slot = ((struct demangled_name_entry **)
+          htab_find_slot (per_bfd->demangled_names_hash.get (),
+      		    &entry, INSERT));
+
+  /* If this name is not in the hash table, add it.  */
+  if (*slot == NULL
+      /* A C version of the symbol may have already snuck into the table.
+         This happens to, e.g., main.init (__go_init_main).  Cope.  */
+      || (gsymbol->language == language_go
+          && (*slot)->demangled[0] == '\0'))
+    {
+      /* The const_cast is safe because the only reason it is already initialized
+         is if we purposefully set it from a background thread to avoid doing the
+         work here. However, it is still allocated from the heap and needs to
+         be freed by us, just like if we called symbol_find_demangled_name
+         here.  */
+      char *demangled_name_ptr
+        = gsymbol->language_specific.demangled_name
+        ? const_cast<char*>(gsymbol->language_specific.demangled_name)
+        : symbol_find_demangled_name (gsymbol, linkage_name_copy);
+      gdb::unique_xmalloc_ptr<char> demangled_name (demangled_name_ptr);
+      int demangled_len = demangled_name ? strlen (demangled_name.get ()) : 0;
+
+      /* Suppose we have demangled_name==NULL, copy_name==0, and
+         linkage_name_copy==linkage_name.  In this case, we already have the
+         mangled name saved, and we don't have a demangled name.  So,
+         you might think we could save a little space by not recording
+         this in the hash table at all.
+         
+         It turns out that it is actually important to still save such
+         an entry in the hash table, because storing this name gives
+         us better bcache hit rates for partial symbols.  */
+      if (!copy_name && linkage_name_copy == linkage_name)
+        {
+          *slot
+            = ((struct demangled_name_entry *)
+      	 obstack_alloc (&per_bfd->storage_obstack,
+      			offsetof (struct demangled_name_entry, demangled)
+      			+ demangled_len + 1));
+          (*slot)->mangled = linkage_name;
+        }
+      else
+        {
+          char *mangled_ptr;
+
+          /* If we must copy the mangled name, put it directly after
+             the demangled name so we can have a single
+             allocation.  */
+          *slot
+            = ((struct demangled_name_entry *)
+      	 obstack_alloc (&per_bfd->storage_obstack,
+      			offsetof (struct demangled_name_entry, demangled)
+      			+ len + demangled_len + 2));
+          mangled_ptr = &((*slot)->demangled[demangled_len + 1]);
+          strcpy (mangled_ptr, linkage_name_copy);
+          (*slot)->mangled = mangled_ptr;
+        }
+      (*slot)->language = gsymbol->language;
 
-	if (demangled_name != NULL)
-	  strcpy ((*slot)->demangled, demangled_name.get ());
-	else
-	  (*slot)->demangled[0] = '\0';
-      }
-    else if (gsymbol->language == language_unknown
-	     || gsymbol->language == language_auto)
-      gsymbol->language = (*slot)->language;
+      if (demangled_name != NULL)
+        strcpy ((*slot)->demangled, demangled_name.get ());
+      else
+        (*slot)->demangled[0] = '\0';
+    }
+  else if (gsymbol->language == language_unknown
+           || gsymbol->language == language_auto)
+    gsymbol->language = (*slot)->language;
 
-    found_entry = *slot;
-  }
+  found_entry = *slot;
 
   gsymbol->name = found_entry->mangled;
   if (found_entry->demangled[0] != '\0')
diff --git a/gdb/symtab.h b/gdb/symtab.h
index c3918a85af..17903df92d 100644
--- a/gdb/symtab.h
+++ b/gdb/symtab.h
@@ -483,6 +483,16 @@  extern void symbol_set_language (struct general_symbol_info *symbol,
                                  enum language language,
 				 struct obstack *obstack);
 
+
+/* Try to determine the demangled name for a symbol, based on the
+   language of that symbol.  If the language is set to language_auto,
+   it will attempt to find any demangling algorithm that works and
+   then set the language appropriately.  The returned name is allocated
+   by the demangler and should be xfree'd.  */
+
+extern char *symbol_find_demangled_name (struct general_symbol_info *gsymbol,
+					 const char *mangled);
+
 /* Set just the linkage name of a symbol; do not try to demangle
    it.  Used for constructs which do not have a mangled name,
    e.g. struct tags.  Unlike SYMBOL_SET_NAMES, linkage_name must