Optimizing hash table lookup in symbol binding

  I investigated optimizing the hash table lookup used for symbol
binding.  This is about this code in elf/dl-lookup.c:

      /* The tables for this map.  */
      const ElfW(Sym) *symtab = (const void *) D_PTR (map, l_info[DT_SYMTAB]);
      const char *strtab = (const void *) D_PTR (map, l_info[DT_STRTAB]);

      const ElfW(Sym) *sym;
      const ElfW(Addr) *bitmask = map->l_gnu_bitmask;
      if (__glibc_likely (bitmask != NULL))
	{
	  ElfW(Addr) bitmask_word
	    = bitmask[(new_hash / __ELF_NATIVE_CLASS)
		      & map->l_gnu_bitmask_idxbits];

	  unsigned int hashbit1 = new_hash & (__ELF_NATIVE_CLASS - 1);
	  unsigned int hashbit2 = ((new_hash >> map->l_gnu_shift)
				   & (__ELF_NATIVE_CLASS - 1));

	  if (__glibc_unlikely ((bitmask_word >> hashbit1)
				& (bitmask_word >> hashbit2) & 1))
	    {
	      Elf32_Word bucket = map->l_gnu_buckets[new_hash
						     % map->l_nbuckets];
	      if (bucket != 0)
		{
		  const Elf32_Word *hasharr = &map->l_gnu_chain_zero[bucket];

		  do
		    if (((*hasharr ^ new_hash) >> 1) == 0)
		      {
			symidx = ELF_MACHINE_HASH_SYMIDX (map, hasharr);
			sym = check_match (undef_name, ref, version, flags,
					   type_class, &symtab[symidx], symidx,
					   strtab, map, &versioned_sym,
					   &num_versions);
			if (sym != NULL)
			  goto found_it;
		      }
		  while ((*hasharr++ & 1u) == 0);
		}
	    }
	  /* No symbol found.  */
	  symidx = SHN_UNDEF;
	}
      else

My primary interest was the % operator because it turns out that it
actually shows up in some profiles stressing symbol binding during
program lookup.  In most cases, however the search for the right mapping
dominates and the preceding bitmask check fails most of the time.  But
with shallow library dependencies, we can end up in a situation where
the division actually matters.

Strategies for optimizing integer division are discussed in Hacker's
Delight and here:

<http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html>
<http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html>

(I have written to the author to get some of the math fixed in minor
ways, but I think the general direction is solid.)

The algorithm from the first episode looks like this:

Benchmarking results are mixed.  As I said, for deep DSO dependency
chains, the bitmask check clearly dominates.  I tried to benchmark this
directly using dlsym, with the dladdr avoidance patch here:

  dlsym: Do not determine caller link map if not needed
  <https://gnutoolchain-gerrit.osci.io/r/c/glibc/+/528>

On a second-generation AMD EPYC, I didn't see a difference at all for
some reason.  On Cascade Lake, I see a moderate improvement for the
dlsym test, but I don't know how realistic this microbenchmark is.  Both
patches had performance that was on par.

I also tried to remove the bitmask check altogether, but it was not an
improvement.  I suspect the bitmasks are much smaller, so consulting
them avoids the cache misses in the full table lookup.

If any of the architecture maintainers think this is worth doing, we can
still incorporate it.

Thanks,
Florian

Optimizing hash table lookup in symbol binding

Commit Message

Comments

Patch