[0/3] Speedup abidiff for large symbol diffs

Message ID 20260224132624.3469488-1-christopher.burr@cern.ch
Headers
Series Speedup abidiff for large symbol diffs |

Message

Chris Burr Feb. 24, 2026, 1:26 p.m. UTC
  When comparing corpora with large numbers of differing symbols (in my
case a library which was linking to LLVM and accidentally exposing all
LLVM symbols), abidiff was taking ~9 minutes.  Two algorithmic
bottlenecks were responsible:

1. compute_aliases_for_elf_symbol performed a full O(N) symbol table
   scan per call, yielding O(N^2) behavior.  Fixed by using the
   existing hash table for O(1) lookup.

2. The corpus diff computation used Myers diff O(N*D) to compare
   symbol lists, but the downstream consumer only builds unordered
   maps.  Replaced with direct hash-based set difference in O(N+M).

Together, these reduce abidiff runtime on corpora with ~57K symbol
changes from ~9 minutes to ~1 second.  A regression test with 20K
synthetic symbols is included.

Chris Burr (3):
  ir: Speedup compute_aliases_for_elf_symbol
  comparison: Speedup corpus diff computation
  tests: Add perf regression test for symbol diff

 include/abg-comparison.h                      |     6 -
 src/abg-comparison-priv.h                     |     4 -
 src/abg-comparison.cc                         |   430 +-
 src/abg-ir.cc                                 |    13 +-
 tests/data/Makefile.am                        |     3 +
 .../test-many-unreferenced-syms-report.txt    | 19998 +++++++++++++++
 .../test-many-unreferenced-syms-v0.abi        |    14 +
 .../test-many-unreferenced-syms-v1.abi        | 20004 ++++++++++++++++
 tests/generate-perf-test-data.py              |    44 +
 tests/test-abidiff-exit.cc                    |    15 +
 10 files changed, 40251 insertions(+), 280 deletions(-)
 create mode 100644 tests/data/test-abidiff-exit/test-many-unreferenced-syms-report.txt
 create mode 100644 tests/data/test-abidiff-exit/test-many-unreferenced-syms-v0.abi
 create mode 100644 tests/data/test-abidiff-exit/test-many-unreferenced-syms-v1.abi
 create mode 100644 tests/generate-perf-test-data.py
  

Comments

Dodji Seketeli Feb. 27, 2026, 1:39 p.m. UTC | #1
Hello Chris,

OK, let's get this straight, I am impressed.  Congratulations and many
thanks for diving into this topic.

I have been willing to try to compare functions & variable "directly" as
you did precisely to do away with the potentially quadratic behaviour of the
otherwise great Myers diff algorithm but never got time to get to the
bottom of it.  And here you are, extremely welcome ;-)

Chris Burr <christopher.burr@cern.ch> a écrit:

> When comparing corpora with large numbers of differing symbols (in my
> case a library which was linking to LLVM and accidentally exposing all
> LLVM symbols), abidiff was taking ~9 minutes.

Whoah. Out of curiosity, have you noticed speed gains when comparing,
say, that binary against itself?

Also, if the library is Free Software, maybe we can add it to the "big
tests" suite of libabigail at
https://sourceware.org/cgit/libabigail-tests/tree ?

> Two algorithmic bottlenecks were responsible:
>
> 1. compute_aliases_for_elf_symbol performed a full O(N) symbol table
>    scan per call, yielding O(N^2) behavior.  Fixed by using the
>    existing hash table for O(1) lookup.
>
> 2. The corpus diff computation used Myers diff O(N*D) to compare
>    symbol lists, but the downstream consumer only builds unordered
>    maps.  Replaced with direct hash-based set difference in O(N+M).
>
> Together, these reduce abidiff runtime on corpora with ~57K symbol
> changes from ~9 minutes to ~1 second.  A regression test with 20K
> synthetic symbols is included.

Many thanks.  I am looking at the patches and will be replying to them
specifically.

[...]

Thanks again.

Cheers,