[rfc] : use xxhash() in ld build-id computation, incl. benchmark
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_binutils_build--master-arm |
fail
|
Build failed
|
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 |
fail
|
Build failed
|
Commit Message
Hi -
xxhash[1] is a popular very high speed hash library, BSD/2-Clause,
which is getting increasing attention in data-intensive tools like
rsync. binutils ld is a popular tool :-) that includes a hash
operation for computing build-ids, suffering recurrent complaints that
this operation is still too slow.
[1] https://www.xxhash.com/
The following patch draft adds a new ld build-id computation mode,
"xx", using xxhash in its 128-bit mode. The patch prereqs the
xxhash-devel headers being installed, and uses the "all-inlined"
model, so no run-time or link-time library dependence exists.
Yes, yes, yes, so how fast is it? Pretty good! Tested on a Fedora
x86-64 box, linking the largest shared library in the fedora
"suitesparse" package (libgraphblas.so, 600MB). (The test involved
hand-running the final cmake/gcc link stage by hand in an rpmbuild
--no-clean build tree, in -fno-lto mode to trigger mainly linking,
even though incoming objects were -ffat-lto-objects.)
RPM_PACKAGE_NAME=1 RPM_PACKAGE_VERSION=1 RPM_PACKAGE_RELEASE=1 RPM_ARCH=1 gcc -fPIC -O2 -fexceptions -g -grecord-gcc-switches -pipe -fno-lto -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wundef -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -DNDEBUG -DNDEBUG -Wl,-z,relro -Wl,--as-needed -Wl,-z,pack-relative-relocs -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -specs=/usr/lib/rpm/redhat/redhat-package-notes -shared -Wl,-soname,libgraphblas.so.9 -o /tmp/libgraphblas.so.9.1.0 @CMakeFiles/GraphBLAS.dir/objects1.rsp @CMakeFiles/GraphBLAS.dir/objects2.rsp -Wl,-rpath,::::::: -Wl,--build-id=$format -ldl /usr/lib/gcc/x86_64-redhat-linux/14/libgomp.so /usr/lib64/libpthread.a
Here are the time results for $format in md5, sha1 (fedora default), and xx,
doing some iterations:
md5 1.68s user 0.86s system 80% cpu 3.137 total
md5 1.73s user 0.78s system 79% cpu 3.148 total
md5 1.73s user 0.80s system 79% cpu 3.200 total
md5 1.75s user 0.85s system 79% cpu 3.275 total
md5 1.77s user 0.80s system 78% cpu 3.273 total
md5 1.77s user 0.80s system 81% cpu 3.160 total
md5 1.77s user 0.85s system 81% cpu 3.229 total
md5 1.79s user 0.88s system 80% cpu 3.326 total
md5 1.80s user 0.78s system 81% cpu 3.165 total
md5 1.81s user 0.83s system 78% cpu 3.349 total
md5 1.83s user 0.84s system 82% cpu 3.256 total
sha1 1.28s user 0.78s system 76% cpu 2.712 total
sha1 1.28s user 0.83s system 77% cpu 2.717 total
sha1 1.31s user 0.84s system 78% cpu 2.751 total
sha1 1.32s user 0.80s system 78% cpu 2.707 total
sha1 1.32s user 0.81s system 78% cpu 2.737 total
sha1 1.34s user 0.77s system 76% cpu 2.752 total
sha1 1.34s user 0.81s system 76% cpu 2.812 total
sha1 1.34s user 0.86s system 77% cpu 2.824 total
sha1 1.36s user 0.83s system 76% cpu 2.880 total
sha1 1.37s user 0.82s system 76% cpu 2.878 total
sha1 1.37s user 0.84s system 78% cpu 2.831 total
xx 1.07s user 0.79s system 73% cpu 2.524 total
xx 1.08s user 0.83s system 73% cpu 2.605 total
xx 1.08s user 0.84s system 74% cpu 2.577 total
xx 1.09s user 0.81s system 75% cpu 2.533 total
xx 1.10s user 0.81s system 74% cpu 2.576 total
xx 1.11s user 0.77s system 76% cpu 2.454 total
xx 1.11s user 0.82s system 73% cpu 2.617 total
xx 1.12s user 0.80s system 76% cpu 2.513 total
xx 1.14s user 0.86s system 74% cpu 2.680 total
xx 1.15s user 0.86s system 75% cpu 2.640 total
xx 1.16s user 0.81s system 76% cpu 2.567 total
i.e., this gcc/link overall is about 20% faster with xx than sha1.
Not too bad.
Here's the patch. Is there interest in me adding xxhash configury and
formally submitting it? Or interest in benchmarking with something
else?
Comments
On Tue, Sep 17, 2024 at 1:15 PM Frank Ch. Eigler <fche@redhat.com> wrote:
>
> Hi -
>
> xxhash[1] is a popular very high speed hash library, BSD/2-Clause,
> which is getting increasing attention in data-intensive tools like
> rsync. binutils ld is a popular tool :-) that includes a hash
> operation for computing build-ids, suffering recurrent complaints that
> this operation is still too slow.
>
> [1] https://www.xxhash.com/
>
> The following patch draft adds a new ld build-id computation mode,
> "xx", using xxhash in its 128-bit mode. The patch prereqs the
> xxhash-devel headers being installed, and uses the "all-inlined"
> model, so no run-time or link-time library dependence exists.
>
> Yes, yes, yes, so how fast is it? Pretty good! Tested on a Fedora
> x86-64 box, linking the largest shared library in the fedora
> "suitesparse" package (libgraphblas.so, 600MB). (The test involved
> hand-running the final cmake/gcc link stage by hand in an rpmbuild
> --no-clean build tree, in -fno-lto mode to trigger mainly linking,
> even though incoming objects were -ffat-lto-objects.)
>
> RPM_PACKAGE_NAME=1 RPM_PACKAGE_VERSION=1 RPM_PACKAGE_RELEASE=1 RPM_ARCH=1 gcc -fPIC -O2 -fexceptions -g -grecord-gcc-switches -pipe -fno-lto -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wundef -std=c11 -lm -Wno-pragmas -fexcess-precision=fast -fcx-limited-range -fno-math-errno -fwrapv -DNDEBUG -DNDEBUG -Wl,-z,relro -Wl,--as-needed -Wl,-z,pack-relative-relocs -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -specs=/usr/lib/rpm/redhat/redhat-package-notes -shared -Wl,-soname,libgraphblas.so.9 -o /tmp/libgraphblas.so.9.1.0 @CMakeFiles/GraphBLAS.dir/objects1.rsp @CMakeFiles/GraphBLAS.dir/objects2.rsp -Wl,-rpath,::::::: -Wl,--build-id=$format -ldl /usr/lib/gcc/x86_64-redhat-linux/14/libgomp.so /usr/lib64/libpthread.a
>
> Here are the time results for $format in md5, sha1 (fedora default), and xx,
> doing some iterations:
>
> md5 1.68s user 0.86s system 80% cpu 3.137 total
> md5 1.73s user 0.78s system 79% cpu 3.148 total
> md5 1.73s user 0.80s system 79% cpu 3.200 total
> md5 1.75s user 0.85s system 79% cpu 3.275 total
> md5 1.77s user 0.80s system 78% cpu 3.273 total
> md5 1.77s user 0.80s system 81% cpu 3.160 total
> md5 1.77s user 0.85s system 81% cpu 3.229 total
> md5 1.79s user 0.88s system 80% cpu 3.326 total
> md5 1.80s user 0.78s system 81% cpu 3.165 total
> md5 1.81s user 0.83s system 78% cpu 3.349 total
> md5 1.83s user 0.84s system 82% cpu 3.256 total
> sha1 1.28s user 0.78s system 76% cpu 2.712 total
> sha1 1.28s user 0.83s system 77% cpu 2.717 total
> sha1 1.31s user 0.84s system 78% cpu 2.751 total
> sha1 1.32s user 0.80s system 78% cpu 2.707 total
> sha1 1.32s user 0.81s system 78% cpu 2.737 total
> sha1 1.34s user 0.77s system 76% cpu 2.752 total
> sha1 1.34s user 0.81s system 76% cpu 2.812 total
> sha1 1.34s user 0.86s system 77% cpu 2.824 total
> sha1 1.36s user 0.83s system 76% cpu 2.880 total
> sha1 1.37s user 0.82s system 76% cpu 2.878 total
> sha1 1.37s user 0.84s system 78% cpu 2.831 total
> xx 1.07s user 0.79s system 73% cpu 2.524 total
> xx 1.08s user 0.83s system 73% cpu 2.605 total
> xx 1.08s user 0.84s system 74% cpu 2.577 total
> xx 1.09s user 0.81s system 75% cpu 2.533 total
> xx 1.10s user 0.81s system 74% cpu 2.576 total
> xx 1.11s user 0.77s system 76% cpu 2.454 total
> xx 1.11s user 0.82s system 73% cpu 2.617 total
> xx 1.12s user 0.80s system 76% cpu 2.513 total
> xx 1.14s user 0.86s system 74% cpu 2.680 total
> xx 1.15s user 0.86s system 75% cpu 2.640 total
> xx 1.16s user 0.81s system 76% cpu 2.567 total
>
> i.e., this gcc/link overall is about 20% faster with xx than sha1.
> Not too bad.
>
> Here's the patch. Is there interest in me adding xxhash configury and
> formally submitting it? Or interest in benchmarking with something
> else?
>
>
> diff --git a/ld/ldbuildid.c b/ld/ldbuildid.c
> index 5ba9e503c7c3..070bb7e0ad22 100644
> --- a/ld/ldbuildid.c
> +++ b/ld/ldbuildid.c
> @@ -23,6 +23,8 @@
> #include "safe-ctype.h"
> #include "md5.h"
> #include "sha1.h"
> +#define XXH_INLINE_ALL
> +#include "xxhash.h"
> #include "ldbuildid.h"
> #ifdef __MINGW32__
> #include <windows.h>
> @@ -34,7 +36,7 @@
> bool
> validate_build_id_style (const char *style)
> {
> - if ((streq (style, "md5")) || (streq (style, "sha1"))
> + if ((streq (style, "xx")) || (streq (style, "md5")) || (streq (style, "sha1"))
> || (streq (style, "uuid")) || (startswith (style, "0x")))
> return true;
>
> @@ -44,7 +46,7 @@ validate_build_id_style (const char *style)
> bfd_size_type
> compute_build_id_size (const char *style)
> {
> - if (streq (style, "md5") || streq (style, "uuid"))
> + if (streq (style, "md5") || streq (style, "uuid") || streq (style, "xx"))
> return 128 / 8;
>
> if (streq (style, "sha1"))
> @@ -93,6 +95,15 @@ read_hex (const char xdigit)
> return 0;
> }
>
> +
> +
> +static void
> +xx_process_bytes(const void* buffer, size_t size, void* state)
> +{
> + XXH3_128bits_update ((XXH3_state_t*) state, buffer, size);
> +}
> +
> +
> bool
> generate_build_id (bfd *abfd,
> const char *style,
> @@ -100,7 +111,25 @@ generate_build_id (bfd *abfd,
> unsigned char *id_bits,
> int size ATTRIBUTE_UNUSED)
> {
> - if (streq (style, "md5"))
> + if (streq (style, "xx"))
> + {
> + XXH3_state_t* state = XXH3_createState();
> + if (!state)
> + {
> + return false;
> + }
> + XXH3_128bits_reset (state);
> + if (!(*checksum_contents) (abfd, &xx_process_bytes, state))
> + {
> + XXH3_freeState (state);
> + return false;
> + }
> + XXH128_hash_t result = XXH3_128bits_digest (state);
> + XXH3_freeState (state);
> + memcpy (id_bits, &result,
> + (size_t) size < sizeof (result) ? (size_t) size : sizeof (result));
> + }
> + else if (streq (style, "md5"))
> {
> struct md5_ctx ctx;
>
>
Introducing a new --build-id= value requires the ecosystem to adapt.
Therefore, lld/mold's --build-id=fast never gets wide adoption.
sha1 is 160 bits, wider than 128 bits, which many fast hash
implementations support.
lld and mold truncate BLAKE3's 256-bit digest to 128/160 for md5/sha1 :)
Hi -
Thanks for your response!
> Introducing a new --build-id= value requires the ecosystem to adapt.
I'm not sure I see why. Build-ids for any given binary may be
computed or set any way at all, independent of other binaries or the
rest of the ecosystem.
> Therefore, lld/mold's --build-id=fast never gets wide adoption.
I'm not sure whether why that would be or why it should be a
consideration for an improvement in binutils.
> sha1 is 160 bits, wider than 128 bits, which many fast hash
> implementations support. lld and mold truncate BLAKE3's 256-bit
> digest to 128/160 for md5/sha1 :)
Digest width is an interesting question, yes. I don't recall which
newfangled linker it was, but some years ago, someone proposed that it
should use an 8-byte (64-bit) buildid, because of speed. We were not
a big fan.
We did some calculations and some checking of the debuginfod buildid
corpus (which as of today includes ~100 million distinct buildids in
the public federated servers). That starts to come foreseeably close
to 32-bits, and considering the birthday paradox, needing at least 64
bits to reliably tell apart. But 128 bits ought to be enough for
anybody.
- FChE
Hi,
On Tue, 2024-09-17 at 17:27 -0400, Frank Ch. Eigler wrote:
> > Introducing a new --build-id= value requires the ecosystem to adapt.
>
> I'm not sure I see why. Build-ids for any given binary may be
> computed or set any way at all, independent of other binaries or the
> rest of the ecosystem.
I don't think that is really true. The way and the "strength" of the
build-id calculation matter. This is only true if the way the build-id
is calculated is really reproducible, has strong collission resistance
properties and nothing depends on the way the build-id is calculated.
e.g. using --build-id=uuid (not reproducible) or --build-id=0xhexstring
(strong possibility it will collide) are fundementally different from
using --build-id=md5 or --build-id=sha1
(I am probably using the term strong collision resistance wrongly here,
it is one property of a cryptographic hash functions, but we don't need
one of these here.)
I do think it is true for what you are proposing here though. I would
say that build-ids calculated through md5, sha or your proposed xx are
interchangeble even though they use different lengths (because they are
at least 128bit and strongly "random").
> > Therefore, lld/mold's --build-id=fast never gets wide adoption.
>
> I'm not sure whether why that would be or why it should be a
> consideration for an improvement in binutils.
lld --build-id=fast generates only an 8byte/64bit build-id, which is
why nobody is using that.
> > sha1 is 160 bits, wider than 128 bits, which many fast hash
> > implementations support. lld and mold truncate BLAKE3's 256-bit
> > digest to 128/160 for md5/sha1 :)
>
> Digest width is an interesting question, yes. I don't recall which
> newfangled linker it was, but some years ago, someone proposed that it
> should use an 8-byte (64-bit) buildid, because of speed. We were not
> a big fan.
>
> We did some calculations and some checking of the debuginfod buildid
> corpus (which as of today includes ~100 million distinct buildids in
> the public federated servers). That starts to come foreseeably close
> to 32-bits, and considering the birthday paradox, needing at least 64
> bits to reliably tell apart. But 128 bits ought to be enough for
> anybody.
I would say if the speedup is really as much as you say then just make
this the default and maybe just use it for md5 or sha1 (with a little
padding) too. Assuming people don't depend on the particular algorithm
used, but just want a strong 128bit+ build-id.
But it would be good to document these calculations somewhere so we are
really sure 128 bits are enough for any binary.
Cheers,
Mark
Hi -
> > I'm not sure I see why. Build-ids for any given binary may be
> > computed or set any way at all, independent of other binaries or the
> > rest of the ecosystem.
> I don't think that is really true. The way and the "strength" of the
> build-id calculation matter. This is only true if the way the
> build-id is calculated is really reproducible, has strong collission
> resistance properties and nothing depends on the way the build-id is
> calculated.
Sure, depending on which "ecosystem". Some don't care about
reproducible buildids, some have few enough builds that they don't
care about strong collision resistance. There is no "the" ecosystem.
The buildid-related tools will accept anything and work within each
ecosystem.
> [...] I would say if the speedup is really as much as you say then
> just make this the default [...]
Maybe, though defaulting (within binutils or within distros) is a
separate followup question.
I'd want to do one more test before merging the initial enabling work.
Namely, I'd try to recompute xx-style buildids for a large number of
binaries, just to confirm collision resistance. This assumes that
there's interest in the binutils community for xx.
> But it would be good to document these calculations somewhere so we
> are really sure 128 bits are enough for any binary.
At the depth of the investigation, this would take only a paragraph. :)
- FChE
>>>>> "Frank" == Frank Ch Eigler <fche@redhat.com> writes:
Frank> Here's the patch. Is there interest in me adding xxhash configury and
Frank> formally submitting it? Or interest in benchmarking with something
Frank> else?
Note there's already some code in gdbsupport/common.m4.
Tom
@@ -23,6 +23,8 @@
#include "safe-ctype.h"
#include "md5.h"
#include "sha1.h"
+#define XXH_INLINE_ALL
+#include "xxhash.h"
#include "ldbuildid.h"
#ifdef __MINGW32__
#include <windows.h>
@@ -34,7 +36,7 @@
bool
validate_build_id_style (const char *style)
{
- if ((streq (style, "md5")) || (streq (style, "sha1"))
+ if ((streq (style, "xx")) || (streq (style, "md5")) || (streq (style, "sha1"))
|| (streq (style, "uuid")) || (startswith (style, "0x")))
return true;
@@ -44,7 +46,7 @@ validate_build_id_style (const char *style)
bfd_size_type
compute_build_id_size (const char *style)
{
- if (streq (style, "md5") || streq (style, "uuid"))
+ if (streq (style, "md5") || streq (style, "uuid") || streq (style, "xx"))
return 128 / 8;
if (streq (style, "sha1"))
@@ -93,6 +95,15 @@ read_hex (const char xdigit)
return 0;
}
+
+
+static void
+xx_process_bytes(const void* buffer, size_t size, void* state)
+{
+ XXH3_128bits_update ((XXH3_state_t*) state, buffer, size);
+}
+
+
bool
generate_build_id (bfd *abfd,
const char *style,
@@ -100,7 +111,25 @@ generate_build_id (bfd *abfd,
unsigned char *id_bits,
int size ATTRIBUTE_UNUSED)
{
- if (streq (style, "md5"))
+ if (streq (style, "xx"))
+ {
+ XXH3_state_t* state = XXH3_createState();
+ if (!state)
+ {
+ return false;
+ }
+ XXH3_128bits_reset (state);
+ if (!(*checksum_contents) (abfd, &xx_process_bytes, state))
+ {
+ XXH3_freeState (state);
+ return false;
+ }
+ XXH128_hash_t result = XXH3_128bits_digest (state);
+ XXH3_freeState (state);
+ memcpy (id_bits, &result,
+ (size_t) size < sizeof (result) ? (size_t) size : sizeof (result));
+ }
+ else if (streq (style, "md5"))
{
struct md5_ctx ctx;