Optimizing hash table lookup in symbol binding

Message ID 87lfsd477i.fsf@oldenburg2.str.redhat.com
State Not applicable
Headers

Commit Message

Florian Weimer Nov. 18, 2019, 1:58 p.m. UTC
  I investigated optimizing the hash table lookup used for symbol
binding.  This is about this code in elf/dl-lookup.c:

      /* The tables for this map.  */
      const ElfW(Sym) *symtab = (const void *) D_PTR (map, l_info[DT_SYMTAB]);
      const char *strtab = (const void *) D_PTR (map, l_info[DT_STRTAB]);

      const ElfW(Sym) *sym;
      const ElfW(Addr) *bitmask = map->l_gnu_bitmask;
      if (__glibc_likely (bitmask != NULL))
	{
	  ElfW(Addr) bitmask_word
	    = bitmask[(new_hash / __ELF_NATIVE_CLASS)
		      & map->l_gnu_bitmask_idxbits];

	  unsigned int hashbit1 = new_hash & (__ELF_NATIVE_CLASS - 1);
	  unsigned int hashbit2 = ((new_hash >> map->l_gnu_shift)
				   & (__ELF_NATIVE_CLASS - 1));

	  if (__glibc_unlikely ((bitmask_word >> hashbit1)
				& (bitmask_word >> hashbit2) & 1))
	    {
	      Elf32_Word bucket = map->l_gnu_buckets[new_hash
						     % map->l_nbuckets];
	      if (bucket != 0)
		{
		  const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket];

		  do
		    if (((*hasharr ^ new_hash) >> 1) == 0)
		      {
			symidx = ELF_MACHINE_HASH_SYMIDX (map, hasharr);
			sym = check_match (undef_name, ref, version, flags,
					   type_class, &symtab[symidx], symidx,
					   strtab, map, &versioned_sym,
					   &num_versions);
			if (sym != NULL)
			  goto found_it;
		      }
		  while ((*hasharr++ & 1u) == 0);
		}
	    }
	  /* No symbol found.  */
	  symidx = SHN_UNDEF;
	}
      else

My primary interest was the % operator because it turns out that it
actually shows up in some profiles stressing symbol binding during
program lookup.  In most cases, however the search for the right mapping
dominates and the preceding bitmask check fails most of the time.  But
with shallow library dependencies, we can end up in a situation where
the division actually matters.

Strategies for optimizing integer division are discussed in Hacker's
Delight and here:

<http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html>
<http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html>

(I have written to the author to get some of the math fixed in minor
ways, but I think the general direction is solid.)

The algorithm from the first episode looks like this:



Benchmarking results are mixed.  As I said, for deep DSO dependency
chains, the bitmask check clearly dominates.  I tried to benchmark this
directly using dlsym, with the dladdr avoidance patch here:

  dlsym: Do not determine caller link map if not needed
  <https://gnutoolchain-gerrit.osci.io/r/c/glibc/+/528>

On a second-generation AMD EPYC, I didn't see a difference at all for
some reason.  On Cascade Lake, I see a moderate improvement for the
dlsym test, but I don't know how realistic this microbenchmark is.  Both
patches had performance that was on par.

I also tried to remove the bitmask check altogether, but it was not an
improvement.  I suspect the bitmasks are much smaller, so consulting
them avoids the cache misses in the full table lookup.

If any of the architecture maintainers think this is worth doing, we can
still incorporate it.

Thanks,
Florian
  

Comments

Szabolcs Nagy Nov. 18, 2019, 5:09 p.m. UTC | #1
On 18/11/2019 13:58, Florian Weimer wrote:
> My primary interest was the % operator because it turns out that it

> actually shows up in some profiles stressing symbol binding during

> program lookup.  In most cases, however the search for the right mapping

> dominates and the preceding bitmask check fails most of the time.  But

> with shallow library dependencies, we can end up in a situation where

> the division actually matters.

> 

> Strategies for optimizing integer division are discussed in Hacker's

> Delight and here:

> 

> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html>

> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html>

> 

> (I have written to the author to get some of the math fixed in minor

> ways, but I think the general direction is solid.)

> 

> The algorithm from the first episode looks like this:


note that _itoa.c already uses something like this.

(i think the itoa code is unnecessary: there is no reason
for optimizing anything but base 10 and 16, and the compiler
can do a better job at those than trying something at runtime,
but you may look at that implementation if it's any better).
  
Florian Weimer Nov. 18, 2019, 5:34 p.m. UTC | #2
* Szabolcs Nagy:

> On 18/11/2019 13:58, Florian Weimer wrote:
>> My primary interest was the % operator because it turns out that it
>> actually shows up in some profiles stressing symbol binding during
>> program lookup.  In most cases, however the search for the right mapping
>> dominates and the preceding bitmask check fails most of the time.  But
>> with shallow library dependencies, we can end up in a situation where
>> the division actually matters.
>> 
>> Strategies for optimizing integer division are discussed in Hacker's
>> Delight and here:
>> 
>> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html>
>> <http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html>
>> 
>> (I have written to the author to get some of the math fixed in minor
>> ways, but I think the general direction is solid.)
>> 
>> The algorithm from the first episode looks like this:
>
> note that _itoa.c already uses something like this.
>
> (i think the itoa code is unnecessary: there is no reason
> for optimizing anything but base 10 and 16, and the compiler
> can do a better job at those than trying something at runtime,
> but you may look at that implementation if it's any better).

I don't think so.  It doesn't show to generate magic values for
arbitrary divisors, which is the tricky part anyway.  It uses both the
Episode 1 and Episode 3 algorithms.  The uint64_t case for 32-bit
architectures is very complicated because it avoids the __umoddi3 call.
This is *not* something that current GCC would do on its own.  GCC does
not even fold the division and modulo into one libcall.

For base 10, it could still result in better code to take a few
__udivdi3 calls to reduce the value into 9-digit pieces, and then use
the 32-bit code to process that efficiently.  I'm not sure if we need
anything else for glibc (only base 8, 10, 16, maybe base 2 somewhere).

People have also posted vectorized implementations of this operation,
including base 10.

But that's not really realted to the hash table optimization. 8-)

Thanks,
Florian
  
Carlos O'Donell Nov. 21, 2019, 1:55 a.m. UTC | #3
On 11/18/19 8:58 AM, Florian Weimer wrote:
> On a second-generation AMD EPYC, I didn't see a difference at all for
> some reason.  On Cascade Lake, I see a moderate improvement for the
> dlsym test, but I don't know how realistic this microbenchmark is.  Both
> patches had performance that was on par.
> 
> I also tried to remove the bitmask check altogether, but it was not an
> improvement.  I suspect the bitmasks are much smaller, so consulting
> them avoids the cache misses in the full table lookup.
> 
> If any of the architecture maintainers think this is worth doing, we can
> still incorporate it.

At this point you are probably straying to the realm of special purpose
proprietary analysis programs provided by hardware vendors to get the
most performance out of a given program (stalls, cache misses, interlocks
etc.).

Have you used perf at all to look into aspects beyond just performance?
That is say confirming the cache misses saved vs. the full table lookup?

What about PGO for this case? I often wonder in cases like this if we
fed "normative" workloads into a PGO build if we could get better
results from generated code like this, but it's an entire project
just to do this.

To accept this code I'd want a microbenchmark added, and then we'd
commit the code, and ask the machine maintainers to review that nothing
got terribly worse or was out of kilter with the microbenchmark numbers.

Thoughts?
  
Adhemerval Zanella Nov. 21, 2019, 1:24 p.m. UTC | #4
On 20/11/2019 22:55, Carlos O'Donell wrote:
> On 11/18/19 8:58 AM, Florian Weimer wrote:
>> On a second-generation AMD EPYC, I didn't see a difference at all for
>> some reason.  On Cascade Lake, I see a moderate improvement for the
>> dlsym test, but I don't know how realistic this microbenchmark is.  Both
>> patches had performance that was on par.
>>
>> I also tried to remove the bitmask check altogether, but it was not an
>> improvement.  I suspect the bitmasks are much smaller, so consulting
>> them avoids the cache misses in the full table lookup.
>>
>> If any of the architecture maintainers think this is worth doing, we can
>> still incorporate it.
> 
> At this point you are probably straying to the realm of special purpose
> proprietary analysis programs provided by hardware vendors to get the
> most performance out of a given program (stalls, cache misses, interlocks
> etc.).
> 
> Have you used perf at all to look into aspects beyond just performance?
> That is say confirming the cache misses saved vs. the full table lookup?
> 
> What about PGO for this case? I often wonder in cases like this if we
> fed "normative" workloads into a PGO build if we could get better
> results from generated code like this, but it's an entire project
> just to do this.
> 
> To accept this code I'd want a microbenchmark added, and then we'd
> commit the code, and ask the machine maintainers to review that nothing
> got terribly worse or was out of kilter with the microbenchmark numbers.
> 
> Thoughts?
> 

I see that such changes are hard to measure over different architectures
and get even more tricker with different implementations within the
architectures.

This change seems a straighfoward optimization for architectures that issues
a libcall for module operation, such as arm, alpha, and hppa.

However it seems to be a performance regression on other architectures where
the 32x32->64 multiplier *also* requires a libcall, such as csky, microblaze,
nios2, and sparcv8.

And it become even less obvious for architectures that depending of the
compiler flags might generate better code, for instance on armv7-a where
modulus is done with udiv instead of a libcall.  And there is even riscv64
where provides a specific instruction for that, remuw (although I couldn't
find the expected latency/throughput for that).

Also, some architectures might show better latency over chips version. For
instance aarch A57 has a udiv latency of 4-20 and execution throughput of
1/20 - 1/4, where A55 has a latency of 3-12 and throughput of 1/12 - 1/3.
It might the case where this trick might show better performance on some
chips where newer ones it would be worse than a simple modules operation.

So it would require a lot of testing and deliberation for each arch 
maintainer to check if this optimization is worth, it would add another
build permutation we need to maintained, and it would need to be constantly
checked on every new releases to see implementation get modulus operand
better.

Instead I would try to focus and use a better hashtable scheme, as Wilco
has suggested. To give a crude estimative of the possible gain, on the
simple benchmark bellow the power2 rehash policy with mask ranging showed
better performance on multiple architectures ranging from 17% - 50%. 
Even for more embedded oriented architectures, such as sh4, I see that
avoid a modules hashing scheme showed about 30% better results on insertion.

And I fully agree with that we should at least have some of synthetic
benchmark to evaluate it.

--

#include <iostream>
#include <unordered_map>
#include <chrono>

using hash_power2_t = std::_Hashtable<
  uint32_t, std::pair<const uint32_t, uint32_t>,
  std::allocator<uint32_t>,
  std::__detail::_Select1st,
  std::equal_to<uint32_t>,
  std::hash<uint32_t>,
  std::__detail::_Mask_range_hashing,
  std::__detail::_Default_ranged_hash,
  std::__detail::_Power2_rehash_policy,
  std::__ummap_traits<true>>;

using hash_mod_t = std::_Hashtable<
  uint32_t, std::pair<const uint32_t, uint32_t>,
  std::allocator<uint32_t>,
  std::__detail::_Select1st,
  std::equal_to<uint32_t>,
  std::hash<uint32_t>,
  std::__detail::_Mod_range_hashing,
  std::__detail::_Default_ranged_hash,
  std::__detail::_Prime_rehash_policy,
  std::__ummap_traits<true>>;

int main ()
{
  const int sz = 30000000;

  hash_mod_t hash_mod;
  hash_mod.reserve (sz);

  hash_power2_t hash_power2;
  hash_power2.reserve (sz);

  double mod_time, power2_time;
  {
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < sz; i++)
    hash_mod.emplace (std::make_pair (i, i));
  auto end = std::chrono::steady_clock::now();
  auto diff = end - start;
  mod_time = std::chrono::duration <double, std::nano> (diff).count();
  std::cout << "mod hash   : " << mod_time << " ns" << std::endl;
  }

  {
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < sz; i++)
    hash_power2.emplace (std::make_pair (i, i));
  auto end = std::chrono::steady_clock::now();
  auto diff = end - start;
  power2_time = std::chrono::duration <double, std::nano> (diff).count();
  std::cout << "power2 hash: " << power2_time << " ns" << std::endl;
  }

  std::cout << "speedup    : " << 1 / (power2_time / mod_time) << std::endl;
}
  

Patch

diff --git a/elf/dl-lookup.c b/elf/dl-lookup.c
index fd44cd4101..205d043717 100644
--- a/elf/dl-lookup.c
+++ b/elf/dl-lookup.c
@@ -28,6 +28,7 @@ 
 #include <libc-lock.h>
 #include <tls.h>
 #include <atomic.h>
+#include <divopt.h>
 
 #include <assert.h>
 
@@ -394,8 +395,18 @@  do_lookup_x (const char *undef_name, uint_fast32_t new_hash,
 	  if (__glibc_unlikely ((bitmask_word >> hashbit1)
 				& (bitmask_word >> hashbit2) & 1))
 	    {
-	      Elf32_Word bucket = map->l_gnu_buckets[new_hash
-						     % map->l_nbuckets];
+	      Elf32_Word bucket;
+	      if (map->l_nbuckets > 1)
+		{
+		  uint32_t quotient
+		    = divopt_32 (new_hash, map->l_nbuckets_multiplier,
+				 map->l_nbuckets_multiplier_shift);
+		  uint32_t remainder = new_hash - map->l_nbuckets * quotient;
+		  bucket = map->l_gnu_buckets[remainder];
+		}
+	      else
+		bucket = map->l_gnu_buckets[0];
+
 	      if (bucket != 0)
 		{
 		  const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket];
@@ -931,6 +942,11 @@  _dl_setup_hash (struct link_map *map)
       /* Initialize MIPS xhash translation table.  */
       ELF_MACHINE_XHASH_SETUP (hash32, symbias, map);
 
+      if (map->l_nbuckets >= 2)
+	map->l_nbuckets_multiplier_shift
+	  = precompute_divopt_32 (map->l_nbuckets,
+				  &map->l_nbuckets_multiplier);
+
       return;
     }
 
diff --git a/include/divopt.h b/include/divopt.h
new file mode 100644
index 0000000000..85eece8fd9
--- /dev/null
+++ b/include/divopt.h
@@ -0,0 +1,78 @@ 
+/* Optimization of repeated integer division.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+#include <sys/param.h>
+
+/* Precompute *MULTIPLIER for dividing by DIVISOR, which must be two
+   or larger, and return the shift count (non-negative and less than
+   32), for use with divopt_32 below.  */
+static int __attribute__ ((used))
+precompute_divopt_32 (uint32_t divisor, uint32_t *multiplier)
+{
+  if (divisor == 1)
+    {
+      *multiplier = 1;
+      return 0;
+    }
+
+  int log2 = 32 - __builtin_clz (divisor);
+
+  /* Handle powers-of-two first, so that we do not need to deal with
+     the clz corner cases below.  */
+  if (powerof2 (divisor))
+    {
+      *multiplier = 1;
+      return log2 - 2;
+    }
+
+  if (log2 != 32)
+    {
+      /* Compute ceil (2**(32 + log2) / divisor).  The
+         most-significant bit is always set and is discarded.  */
+      *multiplier = (((uint64_t) 1 << (32 + log2)) + divisor) / divisor;
+      return log2 - 1;
+    }
+  else
+    {
+      /* Perform a long division of 2**64 + (divisor - 1) by the
+         divisor, encoded in base-2**32, using a 64-by-32 division.
+         Start out with the first two digits, which are (1, 0).  2**32
+         divided by the divisor is 1 because the divisor is larger
+         than 2**31.  This set bit is discarded.  */
+      uint64_t remainder = -divisor;
+
+      /* Combine the remainder of the first division with the third
+         and final base 2**32 digit.  */
+      *multiplier = ((remainder << 32) | (divisor - 1)) / divisor;
+      return 31;
+    }
+}
+
+/* Return the quotient of DIVIDEND devided by the divisor that was
+   used to compute MULTIPLIER and SHIFT via precompute_divopt_32.  */
+static inline uint32_t
+divopt_32 (uint32_t dividend, uint32_t multiplier, int shift)
+{
+  /* Approximation to the quotient.  */
+  uint32_t quotient = ((uint64_t) dividend * multiplier) >> 32;
+  /* Compute (dividend + quotient) / 2 without overflow.  */
+  uint32_t temp = ((dividend - quotient) >> 1) + quotient;
+  /* The result is in the higher-order bits.  */
+  return temp >> shift;
+}
diff --git a/include/link.h b/include/link.h
index 1184201f91..b09aa81bb4 100644
--- a/include/link.h
+++ b/include/link.h
@@ -153,6 +153,8 @@  struct link_map
 
     /* Symbol hash table.  */
     Elf_Symndx l_nbuckets;
+    uint32_t l_nbuckets_multiplier;
+    int l_nbuckets_multiplier_shift;
     Elf32_Word l_gnu_bitmask_idxbits;
     Elf32_Word l_gnu_shift;
     const ElfW(Addr) *l_gnu_bitmask;

It only requires a 32x32→64 multiplier, which is an advantage.  However,
the dependency chain in divopt_32 is fairly long, so performance is less
than optimal.

The Episode 3 approach as described is not general, so it would work
well only if we told binutils ld to avoid bucket counts that are
problematic for it.  (In both cases, we need run-time checks to avoid
the unsupported cases, but if we could work together with binutils, this
data-dependent branch should be easy to predict in practice.)

There is a variant of the Episode 3 approach which uses a 64x64→128
multiplication, for a much shorter dependency chain.  For convenience, I
used the __int128 support from libgcc below, but that could probably be
avoided on 64-bit architectures with suitable multipliers:

diff --git a/elf/dl-lookup.c b/elf/dl-lookup.c
index fd44cd4101..4d2f3b91f0 100644
--- a/elf/dl-lookup.c
+++ b/elf/dl-lookup.c
@@ -394,8 +395,17 @@  do_lookup_x (const char *undef_name, uint_fast32_t new_hash,
 	  if (__glibc_unlikely ((bitmask_word >> hashbit1)
 				& (bitmask_word >> hashbit2) & 1))
 	    {
-	      Elf32_Word bucket = map->l_gnu_buckets[new_hash
-						     % map->l_nbuckets];
+	      Elf32_Word bucket;
+	      if (powerof2 (map->l_nbuckets))
+		bucket = map->l_gnu_buckets[new_hash & (map->l_nbuckets - 1)];
+	      else
+		{
+		  uint32_t quotient
+		    = ((unsigned __int128) map->l_nbuckets_multiplier * ((uint64_t) new_hash + 1)) >> 64;
+		  uint32_t remainder = new_hash - map->l_nbuckets * quotient;
+		  bucket = map->l_gnu_buckets[remainder];
+		}
+
 	      if (bucket != 0)
 		{
 		  const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket];
@@ -931,6 +941,10 @@  _dl_setup_hash (struct link_map *map)
       /* Initialize MIPS xhash translation table.  */
       ELF_MACHINE_XHASH_SETUP (hash32, symbias, map);
 
+      if (powerof2 (map->l_nbuckets))
+	map->l_nbuckets_multiplier = __builtin_ctz (map->l_nbuckets);
+      else
+	map->l_nbuckets_multiplier = ((unsigned __int128) 1 << 64) / map->l_nbuckets;
       return;
     }
 
diff --git a/include/link.h b/include/link.h
index 1184201f91..eec4c9ef6e 100644
--- a/include/link.h
+++ b/include/link.h
@@ -153,6 +153,7 @@  struct link_map
 
     /* Symbol hash table.  */
     Elf_Symndx l_nbuckets;
+    uint64_t l_nbuckets_multiplier;
     Elf32_Word l_gnu_bitmask_idxbits;
     Elf32_Word l_gnu_shift;
     const ElfW(Addr) *l_gnu_bitmask;