[v3,6/6] elf: Optimize _dl_new_hash in dl-new-hash.h

Message ID 20220425163601.3670626-6-goldstein.w.n@gmail.com
State Superseded
Headers
Series [v3,1/6] elf: Refactor dl_new_hash so it can be tested / benchmarked |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Noah Goldstein April 25, 2022, 4:36 p.m. UTC
  Unroll slightly so some of the multiples can be pipelined on out-order
machines. Unrolling further started to induce slowdowns for sizes
[0, 4] but can help the loop so if larger sizes are the target
further unrolling can be beneficial.

Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Time as Geometric Mean of N=25 runs
Geometric of all benchmark New / Old: 0.791
  type, length, New Time, Old Time, New Time / Old Time
 fixed,      0,    0.641,    0.658,               0.974
 fixed,      1,    1.888,    1.883,               1.003
 fixed,      2,    2.712,    2.833,               0.957
 fixed,      3,    3.314,    3.739,               0.886
 fixed,      4,    4.316,    4.866,               0.887
 fixed,      5,     5.16,    5.966,               0.865
 fixed,      6,    5.986,    7.241,               0.827
 fixed,      7,    7.264,    8.435,               0.861
 fixed,      8,    8.052,    9.846,               0.818
 fixed,      9,    9.369,   11.316,               0.828
 fixed,     10,   10.256,   12.925,               0.794
 fixed,     11,   12.191,   14.546,               0.838
 fixed,     12,   12.667,    15.92,               0.796
 fixed,     13,   14.442,   17.465,               0.827
 fixed,     14,   14.808,   18.981,                0.78
 fixed,     15,   16.244,   20.565,                0.79
 fixed,     16,   17.166,   22.044,               0.779
 fixed,     32,   35.447,   50.558,               0.701
 fixed,     64,   86.479,  134.529,               0.643
 fixed,    128,  155.453,  287.527,               0.541
 fixed,    256,   302.57,   593.64,                0.51
random,      2,   11.168,    10.61,               1.053
random,      4,   13.308,    13.53,               0.984
random,      8,   16.579,   19.437,               0.853
random,     16,   21.292,   24.776,               0.859
random,     32,    30.56,   35.906,               0.851
random,     64,   49.249,   68.577,               0.718
random,    128,   81.845,  140.664,               0.582
random,    256,  152.517,  292.204,               0.522
---
 elf/dl-new-hash.h | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)
  

Comments

Florian Weimer April 27, 2022, 10:43 a.m. UTC | #1
* Noah Goldstein via Libc-alpha:

> diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
> index 52eef4e417..b0026706bd 100644
> --- a/elf/dl-new-hash.h
> +++ b/elf/dl-new-hash.h
> @@ -20,14 +20,33 @@
>  #define _DL_NEW_HASH_H 1
>  
>  #include <stdint.h>
>  /* For __glibc_unlikely.  */
>  #include <sys/cdefs.h>
>  
>  static uint32_t
> +__attribute__ ((unused))
>  _dl_new_hash (const char *s)
>  {

Does this change belong here, into this commit?

Thanks,
Florian
  
Alexander Monakov April 27, 2022, 3:02 p.m. UTC | #2
On Mon, 25 Apr 2022, Noah Goldstein via Libc-alpha wrote:

> Unroll slightly so some of the multiples can be pipelined on out-order
> machines. Unrolling further started to induce slowdowns for sizes
> [0, 4] but can help the loop so if larger sizes are the target
> further unrolling can be beneficial.

Note, the original algorithm does not need a literal multiplication (with
3cyc latency), as h*33 == (h << 5) + h:

> -  for (unsigned char c = *s; c != '\0'; c = *++s)
> -    h = h * 33 + c;

in musl we spell this out as 'h += h*32 + c' to avoid GCC emitting a
multiplication at -Os.

In 'h = (h + c) + (h << 5)' critical path has latency of only 2 cycles,
and (h + c) goes independent of (h << 5).

(if the original loop is implemented with a multiplication, its critical
patch has 4-cycle latency, 3cyc from the multiplication plus 1 from the
addition)

However, when you reroll the loop and overlap two iterations, multiplication
by 33*33 no longer has this nice property and runs with two 4cyc paths
overlapped (so effective critical path is the same as original).

Alexander
  
Noah Goldstein April 27, 2022, 4:23 p.m. UTC | #3
On Wed, Apr 27, 2022 at 11:17 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Wed, 27 Apr 2022, Noah Goldstein wrote:
>
> > > However, when you reroll the loop and overlap two iterations, multiplication
> > > by 33*33 no longer has this nice property and runs with two 4cyc paths
> > > overlapped (so effective critical path is the same as original).
> >
> > the 33 * c0 can still use `addl; sall; addl` so not sure what you mean by
> > two 4cyc paths overlapped. Its one 4c path.
> >
> > `imul; addl` and `addl; sall; addl`.
> >
> > But it's fair that either wait its 4c of computation for 2 iterations. The
> > difference is the 5c load latency being amortized over 2 iterations
> > or 1 iteration.
>
> Right, it's one 4c path, I was thinking about something else for a moment.
> I'm not sure it's correct to amortize load latency like that, I'd say the
> difference is just that the original loop cannot issue two loads at once
> because of the dependency in its address computation.
>
> I see you dropped libc-alpha@ from Cc:, was that intentional?
No misclick sorry. Adding it back.

I think it is the way you're doing your analysis as a loop-carried
dependency. I.e really 7c per iteration with no unroll (although
its fair the loads on address can speculate ahead so it will
indeed be faster) vs 9c per 2x iterations.

>
> Alexander
  
Noah Goldstein April 27, 2022, 4:25 p.m. UTC | #4
On Wed, Apr 27, 2022 at 5:43 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein via Libc-alpha:
>
> > diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
> > index 52eef4e417..b0026706bd 100644
> > --- a/elf/dl-new-hash.h
> > +++ b/elf/dl-new-hash.h
> > @@ -20,14 +20,33 @@
> >  #define _DL_NEW_HASH_H 1
> >
> >  #include <stdint.h>
> >  /* For __glibc_unlikely.  */
> >  #include <sys/cdefs.h>
> >
> >  static uint32_t
> > +__attribute__ ((unused))
> >  _dl_new_hash (const char *s)
> >  {
>
> Does this change belong here, into this commit?

Moved it to the refactor commit in V4.
>
> Thanks,
> Florian
>
  
Alexander Monakov April 28, 2022, 6:03 p.m. UTC | #5
On Wed, 27 Apr 2022, Noah Goldstein via Libc-alpha wrote:

> I think it is the way you're doing your analysis as a loop-carried
> dependency. I.e really 7c per iteration with no unroll (although
> its fair the loads on address can speculate ahead so it will
> indeed be faster) vs 9c per 2x iterations.

Hm? Right, the CPU will issue loads speculatively, so you shouldn't count
load latency as part of critical path.

I don't understand how you get a 2x improvement on long strings, did you
run the benchmark with rdtscp timing, i.e. with

    make USE_RDTSCP=1 bench

?

Alexander
  
Alexander Monakov May 4, 2022, 6:04 p.m. UTC | #6
Hi,

I managed to run the new benchtest; I understand how you get a 2x speedup
on long inputs now.

Tigerlake belongs to Intel CPU generations that don't perform move
elimination for general registers (along with Icelake and Sandybridge).
So register copy in preparation to shift 'h' by 5 costs an extra cycle
on the critical path.

Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
'((h + *c) + h*32)', which prevents interleaving the additions and thus
puts one extra add on the critical path.

(gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')

So due to the above issues, on Tigerlake you get 4 cycles for the original
loop, and also 4 cycles for the modified two-at-a-time loop.

(on top of that, gcc-10 fails to eliminate extra zero-extends and ends up
with two zero-extends for *c in the loop)

Alexander
  
Alexander Monakov May 5, 2022, 11:07 a.m. UTC | #7
On Wed, 4 May 2022, Alexander Monakov wrote:
> Tigerlake belongs to Intel CPU generations that don't perform move
> elimination for general registers (along with Icelake and Sandybridge).
> So register copy in preparation to shift 'h' by 5 costs an extra cycle
> on the critical path.
> 
> Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
> '((h + *c) + h*32)', which prevents interleaving the additions and thus
> puts one extra add on the critical path.
> 
> (gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')
> 
> So due to the above issues, on Tigerlake you get 4 cycles for the original
> loop, and also 4 cycles for the modified two-at-a-time loop.

The following variant of the original loop avoids those issues and
should run close to 2 cycles per iteration on most CPUs:

static uint32_t
_dl_new_hash (const char *s)
{
  uint32_t h = 5381, c;
  const unsigned char *us = (const void *)s;
  while ((c = *us++))
    {
      c += h;
      asm("" : "+r"(h) : "r"(c));
      h = h * 32 + c;
    }
  return h;
}

Alexander
  
Noah Goldstein May 5, 2022, 3:10 p.m. UTC | #8
On Thu, May 5, 2022 at 6:07 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Wed, 4 May 2022, Alexander Monakov wrote:
> > Tigerlake belongs to Intel CPU generations that don't perform move
> > elimination for general registers (along with Icelake and Sandybridge).
> > So register copy in preparation to shift 'h' by 5 costs an extra cycle
> > on the critical path.
> >
> > Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
> > '((h + *c) + h*32)', which prevents interleaving the additions and thus
> > puts one extra add on the critical path.
> >
> > (gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')
> >
> > So due to the above issues, on Tigerlake you get 4 cycles for the original
> > loop, and also 4 cycles for the modified two-at-a-time loop.
>
> The following variant of the original loop avoids those issues and
> should run close to 2 cycles per iteration on most CPUs:
>
> static uint32_t
> _dl_new_hash (const char *s)
> {
>   uint32_t h = 5381, c;
>   const unsigned char *us = (const void *)s;
>   while ((c = *us++))
>     {
>       c += h;
>       asm("" : "+r"(h) : "r"(c));
>       h = h * 32 + c;
>     }
>   return h;
> }
>

I'm not sure what you are getting at with the asm(). It seems the
produce the exact
same assembly:
https://godbolt.org/z/93qGMaTTE
> Alexander
  
Alexander Monakov May 5, 2022, 3:26 p.m. UTC | #9
On Thu, 5 May 2022, Noah Goldstein wrote:

> > The following variant of the original loop avoids those issues and
> > should run close to 2 cycles per iteration on most CPUs:
> >
> > static uint32_t
> > _dl_new_hash (const char *s)
> > {
> >   uint32_t h = 5381, c;
> >   const unsigned char *us = (const void *)s;
> >   while ((c = *us++))
> >     {
> >       c += h;
> >       asm("" : "+r"(h) : "r"(c));
> >       h = h * 32 + c;
> >     }
> >   return h;
> > }
> >
> 
> I'm not sure what you are getting at with the asm(). It seems the
> produce the exact
> same assembly:
> https://godbolt.org/z/93qGMaTTE

They are definitely not the same even via your link. Loop body of dl_new_hash0:

.L3:
(a)	addl    %eax, %edx
	addq    $1, %rcx
(b)	sall    $5, %eax
(c)	addl    %edx, %eax
	movzbl  -1(%rcx), %edx
	testl   %edx, %edx
	jne     .L3

and of dl_new_hash1:

.L9:
(A)	movl    %r8d, %eax
	addq    $1, %rcx
(B)	sall    $5, %eax
(C)	addl    %edx, %eax
	movzbl  -1(%rcx), %edx
(D)	addl    %eax, %r8d
	testl   %edx, %edx
	jne     .L9

(in fact even the instruction count is not the same, 7 vs 8)

In the first loop, (a) and (b) are independent, (c) depends on them both, and
on the next iteration (a) and (b) take the result of (c) from the previous. Thus
the dependencies are 2 cycles for one iteration.

In the second loop, (B) depends on (A), (C) depends on (B), (D) depends on (C),
and on the next iteration (A) depends on (D) from the previous. Thus four
instructions form a dependency chain of 3 or 4 cycles depending if move
elimination happens or not.

In any case, if you benchmark them both you should see the difference.

Alexander
  
Noah Goldstein May 5, 2022, 6:03 p.m. UTC | #10
On Thu, May 5, 2022 at 10:26 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Thu, 5 May 2022, Noah Goldstein wrote:
>
> > > The following variant of the original loop avoids those issues and
> > > should run close to 2 cycles per iteration on most CPUs:
> > >
> > > static uint32_t
> > > _dl_new_hash (const char *s)
> > > {
> > >   uint32_t h = 5381, c;
> > >   const unsigned char *us = (const void *)s;
> > >   while ((c = *us++))
> > >     {
> > >       c += h;
> > >       asm("" : "+r"(h) : "r"(c));
> > >       h = h * 32 + c;
> > >     }
> > >   return h;
> > > }
> > >
> >
> > I'm not sure what you are getting at with the asm(). It seems the
> > produce the exact
> > same assembly:
> > https://godbolt.org/z/93qGMaTTE
>
> They are definitely not the same even via your link. Loop body of dl_new_hash0:

Oh wow. Must have 'looked at it' before it reloaded.
I'm sorry!
>
> .L3:
> (a)     addl    %eax, %edx
>         addq    $1, %rcx
> (b)     sall    $5, %eax
> (c)     addl    %edx, %eax
>         movzbl  -1(%rcx), %edx
>         testl   %edx, %edx
>         jne     .L3
>
> and of dl_new_hash1:
>
> .L9:
> (A)     movl    %r8d, %eax
>         addq    $1, %rcx
> (B)     sall    $5, %eax
> (C)     addl    %edx, %eax
>         movzbl  -1(%rcx), %edx
> (D)     addl    %eax, %r8d
>         testl   %edx, %edx
>         jne     .L9
>
> (in fact even the instruction count is not the same, 7 vs 8)
>
> In the first loop, (a) and (b) are independent, (c) depends on them both, and
> on the next iteration (a) and (b) take the result of (c) from the previous. Thus
> the dependencies are 2 cycles for one iteration.
>
> In the second loop, (B) depends on (A), (C) depends on (B), (D) depends on (C),
> and on the next iteration (A) depends on (D) from the previous. Thus four
> instructions form a dependency chain of 3 or 4 cycles depending if move
> elimination happens or not.
>
> In any case, if you benchmark them both you should see the difference.
>
Okay that makes sense and indeed results in a substantial improvement.

Totally happy with going with your version.

Think there is still some benefit to the unrolled version because

1) It's less eager about hitting the LSD on newer processors
(but that's really only an issue for strings > ~24 characters).

2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
and the branches are distributed between `p0` and `p6` instead of
always on `p6`.

3) It still saves a few uops (although imul vs `add + shl` isn't really
a meaningful save).

Either way it will be an improvement.

Little benchmark: https://godbolt.org/z/G6PvW4eTr

Generally see hash2 winning the most.

> Alexander
  
Alexander Monakov May 5, 2022, 7:37 p.m. UTC | #11
On Thu, 5 May 2022, Noah Goldstein wrote:
> Okay that makes sense and indeed results in a substantial improvement.
> 
> Totally happy with going with your version.
> 
> Think there is still some benefit to the unrolled version because
> 
> 1) It's less eager about hitting the LSD on newer processors
> (but that's really only an issue for strings > ~24 characters).
> 
> 2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
> and the branches are distributed between `p0` and `p6` instead of
> always on `p6`.
> 
> 3) It still saves a few uops (although imul vs `add + shl` isn't really
> a meaningful save).

Agreed; let me point out though that your variant is preferable provided that
multiplication is at most 3 cycles (or integer multiply-add is at most 4).
That's a given on not-too-old x86, but I'm not sure how things stand elsewhere.

On the other hand, if a 3-cycle multiply-add is available, you variant is
strongly preferable (this is generic code so we should try to think a bit
about other architectures).

I'd recommend to reword commit message of your patch 6/6 so it properly
explains that the apparent 2x speedup is due to baseline case hitting
two issues that slow it down by, well, almost 2x.

> Either way it will be an improvement.
> 
> Little benchmark: https://godbolt.org/z/G6PvW4eTr
> 
> Generally see hash2 winning the most.

I noticed that hash2 needs more magic empty asms, because the compiler may
reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which
increases the dependency chain. I adjusted it like this:

        c1 += c0;
        asm("" : "+r"(c1), "+r"(c0));
        h *= 33 * 33;
        c1 += c0 * 32;
        asm("" : "+r"(c1));
        h += c1;
        s += 2;

Alexander
  
Noah Goldstein May 5, 2022, 10:51 p.m. UTC | #12
On Thu, May 5, 2022 at 2:37 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Thu, 5 May 2022, Noah Goldstein wrote:
> > Okay that makes sense and indeed results in a substantial improvement.
> >
> > Totally happy with going with your version.
> >
> > Think there is still some benefit to the unrolled version because
> >
> > 1) It's less eager about hitting the LSD on newer processors
> > (but that's really only an issue for strings > ~24 characters).
> >
> > 2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
> > and the branches are distributed between `p0` and `p6` instead of
> > always on `p6`.
> >
> > 3) It still saves a few uops (although imul vs `add + shl` isn't really
> > a meaningful save).
>
> Agreed; let me point out though that your variant is preferable provided that
> multiplication is at most 3 cycles (or integer multiply-add is at most 4).
> That's a given on not-too-old x86, but I'm not sure how things stand elsewhere.
>
> On the other hand, if a 3-cycle multiply-add is available, you variant is
> strongly preferable (this is generic code so we should try to think a bit
> about other architectures).

Does anyone have any thoughts on this? I generally think 32-bit multiply
is safe (as opposed to 64-bit). But if anyone has any feedback these
are the three versions in contention:

/* Pros: No multiply. Good scheduling.
   Cons: slightly slower than alternative (tested on modern x86). Ugly
asm statement. */
dl_new_hash0(const char * s) {
    uint32_t              h  = 5381, c;
    const unsigned char * us = (const void *)s;
    while ((c = *us++)) {
        c += h;
        asm("" : "+r"(h) : "r"(c));
        h = h * 32 + c;
    }
    return h;
}

/* Pros: Fastest version (tested on modern x86). Good scheduling.
   Cons: 32 bit multiply. Ugly asm statement. */
dl_new_hash2(const unsigned char * s) {
    unsigned int h = 5381;
    unsigned int c0, c1;
    for (;;) {
        c0 = (unsigned int)(*s);
        /* Unlikely length zero string so evens will be slightly less
           common.  */
        if (__glibc_unlikely(c0 == 0)) {
            return h;
        }

        c1 = (unsigned int)(*(s + 1));
        if (c1 == 0) {
            c0 += h;
            asm("" : "+r"(h) : "r"(c0));
            h = h * 32 + c0;
            return h;
        }
        c1 += c0;
        asm("" : "+r"(c1), "+r"(c0));
         h *= 33 * 33;
         c1 += c0 * 32;
         asm("" : "+r"(c1));
         h += c1;
         s += 2;
    }
}

/* Pros: Faster than hash0, slower than hash2. No asm statement.
   Cons: 32 bit multiply, compiler can slightly de-optimized scheduling. */
dl_new_hash3(const char * s) {
    unsigned int  h = 5381;
    unsigned char c0, c1;
    for (;;) {
        c0 = *s;
        /* Unlikely length zero string so evens will be slightly less
           common.  */
        if (__glibc_unlikely(c0 == 0)) {
            return h;
        }

        c1 = *(s + 1);
        if (c1 == 0) {
            return h * 33 + c0;
        }
        h = 33 * 33 * h + 33 * c0 + c1;
        s += 2;
    }
}



>
> I'd recommend to reword commit message of your patch 6/6 so it properly
> explains that the apparent 2x speedup is due to baseline case hitting
> two issues that slow it down by, well, almost 2x.

Makes sense. I'll change in next version (going to leave a bit of time for
feedback on the 32-bit multiply).
>
> > Either way it will be an improvement.
> >
> > Little benchmark: https://godbolt.org/z/G6PvW4eTr
> >
> > Generally see hash2 winning the most.
>
> I noticed that hash2 needs more magic empty asms, because the compiler may
> reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which
> increases the dependency chain. I adjusted it like this:
>
>         c1 += c0;
>         asm("" : "+r"(c1), "+r"(c0));
>         h *= 33 * 33;
>         c1 += c0 * 32;
>         asm("" : "+r"(c1));
>         h += c1;
>         s += 2;

Bright. Thanks!
>
> Alexander
  

Patch

diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
index 52eef4e417..b0026706bd 100644
--- a/elf/dl-new-hash.h
+++ b/elf/dl-new-hash.h
@@ -20,14 +20,33 @@ 
 #define _DL_NEW_HASH_H 1
 
 #include <stdint.h>
+/* For __glibc_unlikely.  */
+#include <sys/cdefs.h>
 
 static uint32_t
+__attribute__ ((unused))
 _dl_new_hash (const char *s)
 {
-  uint32_t h = 5381;
-  for (unsigned char c = *s; c != '\0'; c = *++s)
-    h = h * 33 + c;
-  return h;
+  unsigned int h = 5381;
+  unsigned char c0, c1;
+  for (;;)
+    {
+      c0 = *s;
+      /* Unlikely length zero string so evens will be slightly less
+         common.  */
+      if (__glibc_unlikely (c0 == 0))
+	{
+	  return h;
+	}
+
+      c1 = *(s + 1);
+      if (c1 == 0)
+	{
+	  return h * 33 + c0;
+	}
+      h = 33 * 33 * h + 33 * c0 + c1;
+      s += 2;
+    }
 }