[v3,6/6] elf: Optimize _dl_new_hash in dl-new-hash.h
Checks
Context |
Check |
Description |
dj/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
dj/TryBot-32bit |
success
|
Build for i686
|
Commit Message
Unroll slightly so some of the multiples can be pipelined on out-order
machines. Unrolling further started to induce slowdowns for sizes
[0, 4] but can help the loop so if larger sizes are the target
further unrolling can be beneficial.
Results for _dl_new_hash
Benchmarked on Tigerlake: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Time as Geometric Mean of N=25 runs
Geometric of all benchmark New / Old: 0.791
type, length, New Time, Old Time, New Time / Old Time
fixed, 0, 0.641, 0.658, 0.974
fixed, 1, 1.888, 1.883, 1.003
fixed, 2, 2.712, 2.833, 0.957
fixed, 3, 3.314, 3.739, 0.886
fixed, 4, 4.316, 4.866, 0.887
fixed, 5, 5.16, 5.966, 0.865
fixed, 6, 5.986, 7.241, 0.827
fixed, 7, 7.264, 8.435, 0.861
fixed, 8, 8.052, 9.846, 0.818
fixed, 9, 9.369, 11.316, 0.828
fixed, 10, 10.256, 12.925, 0.794
fixed, 11, 12.191, 14.546, 0.838
fixed, 12, 12.667, 15.92, 0.796
fixed, 13, 14.442, 17.465, 0.827
fixed, 14, 14.808, 18.981, 0.78
fixed, 15, 16.244, 20.565, 0.79
fixed, 16, 17.166, 22.044, 0.779
fixed, 32, 35.447, 50.558, 0.701
fixed, 64, 86.479, 134.529, 0.643
fixed, 128, 155.453, 287.527, 0.541
fixed, 256, 302.57, 593.64, 0.51
random, 2, 11.168, 10.61, 1.053
random, 4, 13.308, 13.53, 0.984
random, 8, 16.579, 19.437, 0.853
random, 16, 21.292, 24.776, 0.859
random, 32, 30.56, 35.906, 0.851
random, 64, 49.249, 68.577, 0.718
random, 128, 81.845, 140.664, 0.582
random, 256, 152.517, 292.204, 0.522
---
elf/dl-new-hash.h | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)
Comments
* Noah Goldstein via Libc-alpha:
> diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
> index 52eef4e417..b0026706bd 100644
> --- a/elf/dl-new-hash.h
> +++ b/elf/dl-new-hash.h
> @@ -20,14 +20,33 @@
> #define _DL_NEW_HASH_H 1
>
> #include <stdint.h>
> /* For __glibc_unlikely. */
> #include <sys/cdefs.h>
>
> static uint32_t
> +__attribute__ ((unused))
> _dl_new_hash (const char *s)
> {
Does this change belong here, into this commit?
Thanks,
Florian
On Mon, 25 Apr 2022, Noah Goldstein via Libc-alpha wrote:
> Unroll slightly so some of the multiples can be pipelined on out-order
> machines. Unrolling further started to induce slowdowns for sizes
> [0, 4] but can help the loop so if larger sizes are the target
> further unrolling can be beneficial.
Note, the original algorithm does not need a literal multiplication (with
3cyc latency), as h*33 == (h << 5) + h:
> - for (unsigned char c = *s; c != '\0'; c = *++s)
> - h = h * 33 + c;
in musl we spell this out as 'h += h*32 + c' to avoid GCC emitting a
multiplication at -Os.
In 'h = (h + c) + (h << 5)' critical path has latency of only 2 cycles,
and (h + c) goes independent of (h << 5).
(if the original loop is implemented with a multiplication, its critical
patch has 4-cycle latency, 3cyc from the multiplication plus 1 from the
addition)
However, when you reroll the loop and overlap two iterations, multiplication
by 33*33 no longer has this nice property and runs with two 4cyc paths
overlapped (so effective critical path is the same as original).
Alexander
On Wed, Apr 27, 2022 at 11:17 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Wed, 27 Apr 2022, Noah Goldstein wrote:
>
> > > However, when you reroll the loop and overlap two iterations, multiplication
> > > by 33*33 no longer has this nice property and runs with two 4cyc paths
> > > overlapped (so effective critical path is the same as original).
> >
> > the 33 * c0 can still use `addl; sall; addl` so not sure what you mean by
> > two 4cyc paths overlapped. Its one 4c path.
> >
> > `imul; addl` and `addl; sall; addl`.
> >
> > But it's fair that either wait its 4c of computation for 2 iterations. The
> > difference is the 5c load latency being amortized over 2 iterations
> > or 1 iteration.
>
> Right, it's one 4c path, I was thinking about something else for a moment.
> I'm not sure it's correct to amortize load latency like that, I'd say the
> difference is just that the original loop cannot issue two loads at once
> because of the dependency in its address computation.
>
> I see you dropped libc-alpha@ from Cc:, was that intentional?
No misclick sorry. Adding it back.
I think it is the way you're doing your analysis as a loop-carried
dependency. I.e really 7c per iteration with no unroll (although
its fair the loads on address can speculate ahead so it will
indeed be faster) vs 9c per 2x iterations.
>
> Alexander
On Wed, Apr 27, 2022 at 5:43 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein via Libc-alpha:
>
> > diff --git a/elf/dl-new-hash.h b/elf/dl-new-hash.h
> > index 52eef4e417..b0026706bd 100644
> > --- a/elf/dl-new-hash.h
> > +++ b/elf/dl-new-hash.h
> > @@ -20,14 +20,33 @@
> > #define _DL_NEW_HASH_H 1
> >
> > #include <stdint.h>
> > /* For __glibc_unlikely. */
> > #include <sys/cdefs.h>
> >
> > static uint32_t
> > +__attribute__ ((unused))
> > _dl_new_hash (const char *s)
> > {
>
> Does this change belong here, into this commit?
Moved it to the refactor commit in V4.
>
> Thanks,
> Florian
>
On Wed, 27 Apr 2022, Noah Goldstein via Libc-alpha wrote:
> I think it is the way you're doing your analysis as a loop-carried
> dependency. I.e really 7c per iteration with no unroll (although
> its fair the loads on address can speculate ahead so it will
> indeed be faster) vs 9c per 2x iterations.
Hm? Right, the CPU will issue loads speculatively, so you shouldn't count
load latency as part of critical path.
I don't understand how you get a 2x improvement on long strings, did you
run the benchmark with rdtscp timing, i.e. with
make USE_RDTSCP=1 bench
?
Alexander
Hi,
I managed to run the new benchtest; I understand how you get a 2x speedup
on long inputs now.
Tigerlake belongs to Intel CPU generations that don't perform move
elimination for general registers (along with Icelake and Sandybridge).
So register copy in preparation to shift 'h' by 5 costs an extra cycle
on the critical path.
Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
'((h + *c) + h*32)', which prevents interleaving the additions and thus
puts one extra add on the critical path.
(gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')
So due to the above issues, on Tigerlake you get 4 cycles for the original
loop, and also 4 cycles for the modified two-at-a-time loop.
(on top of that, gcc-10 fails to eliminate extra zero-extends and ends up
with two zero-extends for *c in the loop)
Alexander
On Wed, 4 May 2022, Alexander Monakov wrote:
> Tigerlake belongs to Intel CPU generations that don't perform move
> elimination for general registers (along with Icelake and Sandybridge).
> So register copy in preparation to shift 'h' by 5 costs an extra cycle
> on the critical path.
>
> Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
> '((h + *c) + h*32)', which prevents interleaving the additions and thus
> puts one extra add on the critical path.
>
> (gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')
>
> So due to the above issues, on Tigerlake you get 4 cycles for the original
> loop, and also 4 cycles for the modified two-at-a-time loop.
The following variant of the original loop avoids those issues and
should run close to 2 cycles per iteration on most CPUs:
static uint32_t
_dl_new_hash (const char *s)
{
uint32_t h = 5381, c;
const unsigned char *us = (const void *)s;
while ((c = *us++))
{
c += h;
asm("" : "+r"(h) : "r"(c));
h = h * 32 + c;
}
return h;
}
Alexander
On Thu, May 5, 2022 at 6:07 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Wed, 4 May 2022, Alexander Monakov wrote:
> > Tigerlake belongs to Intel CPU generations that don't perform move
> > elimination for general registers (along with Icelake and Sandybridge).
> > So register copy in preparation to shift 'h' by 5 costs an extra cycle
> > on the critical path.
> >
> > Furthermore, h*33 + *c gets evaluated as '((h + h*32) + *c) instead of
> > '((h + *c) + h*32)', which prevents interleaving the additions and thus
> > puts one extra add on the critical path.
> >
> > (gcc-11 gets this right if you assist it by writing 'h = h + *c + h*32')
> >
> > So due to the above issues, on Tigerlake you get 4 cycles for the original
> > loop, and also 4 cycles for the modified two-at-a-time loop.
>
> The following variant of the original loop avoids those issues and
> should run close to 2 cycles per iteration on most CPUs:
>
> static uint32_t
> _dl_new_hash (const char *s)
> {
> uint32_t h = 5381, c;
> const unsigned char *us = (const void *)s;
> while ((c = *us++))
> {
> c += h;
> asm("" : "+r"(h) : "r"(c));
> h = h * 32 + c;
> }
> return h;
> }
>
I'm not sure what you are getting at with the asm(). It seems the
produce the exact
same assembly:
https://godbolt.org/z/93qGMaTTE
> Alexander
On Thu, 5 May 2022, Noah Goldstein wrote:
> > The following variant of the original loop avoids those issues and
> > should run close to 2 cycles per iteration on most CPUs:
> >
> > static uint32_t
> > _dl_new_hash (const char *s)
> > {
> > uint32_t h = 5381, c;
> > const unsigned char *us = (const void *)s;
> > while ((c = *us++))
> > {
> > c += h;
> > asm("" : "+r"(h) : "r"(c));
> > h = h * 32 + c;
> > }
> > return h;
> > }
> >
>
> I'm not sure what you are getting at with the asm(). It seems the
> produce the exact
> same assembly:
> https://godbolt.org/z/93qGMaTTE
They are definitely not the same even via your link. Loop body of dl_new_hash0:
.L3:
(a) addl %eax, %edx
addq $1, %rcx
(b) sall $5, %eax
(c) addl %edx, %eax
movzbl -1(%rcx), %edx
testl %edx, %edx
jne .L3
and of dl_new_hash1:
.L9:
(A) movl %r8d, %eax
addq $1, %rcx
(B) sall $5, %eax
(C) addl %edx, %eax
movzbl -1(%rcx), %edx
(D) addl %eax, %r8d
testl %edx, %edx
jne .L9
(in fact even the instruction count is not the same, 7 vs 8)
In the first loop, (a) and (b) are independent, (c) depends on them both, and
on the next iteration (a) and (b) take the result of (c) from the previous. Thus
the dependencies are 2 cycles for one iteration.
In the second loop, (B) depends on (A), (C) depends on (B), (D) depends on (C),
and on the next iteration (A) depends on (D) from the previous. Thus four
instructions form a dependency chain of 3 or 4 cycles depending if move
elimination happens or not.
In any case, if you benchmark them both you should see the difference.
Alexander
On Thu, May 5, 2022 at 10:26 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Thu, 5 May 2022, Noah Goldstein wrote:
>
> > > The following variant of the original loop avoids those issues and
> > > should run close to 2 cycles per iteration on most CPUs:
> > >
> > > static uint32_t
> > > _dl_new_hash (const char *s)
> > > {
> > > uint32_t h = 5381, c;
> > > const unsigned char *us = (const void *)s;
> > > while ((c = *us++))
> > > {
> > > c += h;
> > > asm("" : "+r"(h) : "r"(c));
> > > h = h * 32 + c;
> > > }
> > > return h;
> > > }
> > >
> >
> > I'm not sure what you are getting at with the asm(). It seems the
> > produce the exact
> > same assembly:
> > https://godbolt.org/z/93qGMaTTE
>
> They are definitely not the same even via your link. Loop body of dl_new_hash0:
Oh wow. Must have 'looked at it' before it reloaded.
I'm sorry!
>
> .L3:
> (a) addl %eax, %edx
> addq $1, %rcx
> (b) sall $5, %eax
> (c) addl %edx, %eax
> movzbl -1(%rcx), %edx
> testl %edx, %edx
> jne .L3
>
> and of dl_new_hash1:
>
> .L9:
> (A) movl %r8d, %eax
> addq $1, %rcx
> (B) sall $5, %eax
> (C) addl %edx, %eax
> movzbl -1(%rcx), %edx
> (D) addl %eax, %r8d
> testl %edx, %edx
> jne .L9
>
> (in fact even the instruction count is not the same, 7 vs 8)
>
> In the first loop, (a) and (b) are independent, (c) depends on them both, and
> on the next iteration (a) and (b) take the result of (c) from the previous. Thus
> the dependencies are 2 cycles for one iteration.
>
> In the second loop, (B) depends on (A), (C) depends on (B), (D) depends on (C),
> and on the next iteration (A) depends on (D) from the previous. Thus four
> instructions form a dependency chain of 3 or 4 cycles depending if move
> elimination happens or not.
>
> In any case, if you benchmark them both you should see the difference.
>
Okay that makes sense and indeed results in a substantial improvement.
Totally happy with going with your version.
Think there is still some benefit to the unrolled version because
1) It's less eager about hitting the LSD on newer processors
(but that's really only an issue for strings > ~24 characters).
2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
and the branches are distributed between `p0` and `p6` instead of
always on `p6`.
3) It still saves a few uops (although imul vs `add + shl` isn't really
a meaningful save).
Either way it will be an improvement.
Little benchmark: https://godbolt.org/z/G6PvW4eTr
Generally see hash2 winning the most.
> Alexander
On Thu, 5 May 2022, Noah Goldstein wrote:
> Okay that makes sense and indeed results in a substantial improvement.
>
> Totally happy with going with your version.
>
> Think there is still some benefit to the unrolled version because
>
> 1) It's less eager about hitting the LSD on newer processors
> (but that's really only an issue for strings > ~24 characters).
>
> 2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
> and the branches are distributed between `p0` and `p6` instead of
> always on `p6`.
>
> 3) It still saves a few uops (although imul vs `add + shl` isn't really
> a meaningful save).
Agreed; let me point out though that your variant is preferable provided that
multiplication is at most 3 cycles (or integer multiply-add is at most 4).
That's a given on not-too-old x86, but I'm not sure how things stand elsewhere.
On the other hand, if a 3-cycle multiply-add is available, you variant is
strongly preferable (this is generic code so we should try to think a bit
about other architectures).
I'd recommend to reword commit message of your patch 6/6 so it properly
explains that the apparent 2x speedup is due to baseline case hitting
two issues that slow it down by, well, almost 2x.
> Either way it will be an improvement.
>
> Little benchmark: https://godbolt.org/z/G6PvW4eTr
>
> Generally see hash2 winning the most.
I noticed that hash2 needs more magic empty asms, because the compiler may
reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which
increases the dependency chain. I adjusted it like this:
c1 += c0;
asm("" : "+r"(c1), "+r"(c0));
h *= 33 * 33;
c1 += c0 * 32;
asm("" : "+r"(c1));
h += c1;
s += 2;
Alexander
On Thu, May 5, 2022 at 2:37 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
> On Thu, 5 May 2022, Noah Goldstein wrote:
> > Okay that makes sense and indeed results in a substantial improvement.
> >
> > Totally happy with going with your version.
> >
> > Think there is still some benefit to the unrolled version because
> >
> > 1) It's less eager about hitting the LSD on newer processors
> > (but that's really only an issue for strings > ~24 characters).
> >
> > 2) It bottlenecks less hard on `p6` because the `imul` goes `p0`
> > and the branches are distributed between `p0` and `p6` instead of
> > always on `p6`.
> >
> > 3) It still saves a few uops (although imul vs `add + shl` isn't really
> > a meaningful save).
>
> Agreed; let me point out though that your variant is preferable provided that
> multiplication is at most 3 cycles (or integer multiply-add is at most 4).
> That's a given on not-too-old x86, but I'm not sure how things stand elsewhere.
>
> On the other hand, if a 3-cycle multiply-add is available, you variant is
> strongly preferable (this is generic code so we should try to think a bit
> about other architectures).
Does anyone have any thoughts on this? I generally think 32-bit multiply
is safe (as opposed to 64-bit). But if anyone has any feedback these
are the three versions in contention:
/* Pros: No multiply. Good scheduling.
Cons: slightly slower than alternative (tested on modern x86). Ugly
asm statement. */
dl_new_hash0(const char * s) {
uint32_t h = 5381, c;
const unsigned char * us = (const void *)s;
while ((c = *us++)) {
c += h;
asm("" : "+r"(h) : "r"(c));
h = h * 32 + c;
}
return h;
}
/* Pros: Fastest version (tested on modern x86). Good scheduling.
Cons: 32 bit multiply. Ugly asm statement. */
dl_new_hash2(const unsigned char * s) {
unsigned int h = 5381;
unsigned int c0, c1;
for (;;) {
c0 = (unsigned int)(*s);
/* Unlikely length zero string so evens will be slightly less
common. */
if (__glibc_unlikely(c0 == 0)) {
return h;
}
c1 = (unsigned int)(*(s + 1));
if (c1 == 0) {
c0 += h;
asm("" : "+r"(h) : "r"(c0));
h = h * 32 + c0;
return h;
}
c1 += c0;
asm("" : "+r"(c1), "+r"(c0));
h *= 33 * 33;
c1 += c0 * 32;
asm("" : "+r"(c1));
h += c1;
s += 2;
}
}
/* Pros: Faster than hash0, slower than hash2. No asm statement.
Cons: 32 bit multiply, compiler can slightly de-optimized scheduling. */
dl_new_hash3(const char * s) {
unsigned int h = 5381;
unsigned char c0, c1;
for (;;) {
c0 = *s;
/* Unlikely length zero string so evens will be slightly less
common. */
if (__glibc_unlikely(c0 == 0)) {
return h;
}
c1 = *(s + 1);
if (c1 == 0) {
return h * 33 + c0;
}
h = 33 * 33 * h + 33 * c0 + c1;
s += 2;
}
}
>
> I'd recommend to reword commit message of your patch 6/6 so it properly
> explains that the apparent 2x speedup is due to baseline case hitting
> two issues that slow it down by, well, almost 2x.
Makes sense. I'll change in next version (going to leave a bit of time for
feedback on the 32-bit multiply).
>
> > Either way it will be an improvement.
> >
> > Little benchmark: https://godbolt.org/z/G6PvW4eTr
> >
> > Generally see hash2 winning the most.
>
> I noticed that hash2 needs more magic empty asms, because the compiler may
> reassociate 'h = h + (32 * c0 + c1);' to 'h = (h + c1) + 32 * c0' which
> increases the dependency chain. I adjusted it like this:
>
> c1 += c0;
> asm("" : "+r"(c1), "+r"(c0));
> h *= 33 * 33;
> c1 += c0 * 32;
> asm("" : "+r"(c1));
> h += c1;
> s += 2;
Bright. Thanks!
>
> Alexander
@@ -20,14 +20,33 @@
#define _DL_NEW_HASH_H 1
#include <stdint.h>
+/* For __glibc_unlikely. */
+#include <sys/cdefs.h>
static uint32_t
+__attribute__ ((unused))
_dl_new_hash (const char *s)
{
- uint32_t h = 5381;
- for (unsigned char c = *s; c != '\0'; c = *++s)
- h = h * 33 + c;
- return h;
+ unsigned int h = 5381;
+ unsigned char c0, c1;
+ for (;;)
+ {
+ c0 = *s;
+ /* Unlikely length zero string so evens will be slightly less
+ common. */
+ if (__glibc_unlikely (c0 == 0))
+ {
+ return h;
+ }
+
+ c1 = *(s + 1);
+ if (c1 == 0)
+ {
+ return h * 33 + c0;
+ }
+ h = 33 * 33 * h + 33 * c0 + c1;
+ s += 2;
+ }
}