[RFC] Add an experimental ABI pruning utility.

Message ID 20210311115340.3033687-1-gprocida@google.com
State Rejected, archived
Headers
Series [RFC] Add an experimental ABI pruning utility. |

Commit Message

Giuliano Procida March 11, 2021, 11:53 a.m. UTC
  In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue.

The script aims to remove parts of the XML that are irrelevant to
abidiff. It has been run on the current libabigail tests as well as
some locally-generated ABI files.

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

The script may have some merit on its own but perhaps should be
implemented differently.

Question: How hard would it be to implement this functionality (under
flag control) within libabigail itself?

A couple of tangential things worth mentioning (see the comments at
the very beginning and very end of the script):

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot.

	* scripts/abiprune.pl: Add script to prune ABI XML.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 323 insertions(+)
 create mode 100755 scripts/abiprune.pl
  

Comments

Ben Woodard March 11, 2021, 8:39 p.m. UTC | #1
I think that you have pointed out a problem that I’ve been discussing with some users that I’m working with. The sheer volume of the abixml output is pretty extreme. I don’t think that they were really expecting it. However, they are not running into the particular situation that you are seeing since they don’t have the coding model of the problem library. They are more likely to fall in the 5-10% range that you reported above. 

We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”. The ELF reachability aspect that you point out seems to be part of the difference between those two concepts.

> On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <libabigail@sourceware.org> wrote:
> 
> In response to internal reports about very large ABIs for libraires
> implemented in C++ but exposing only a C API, I wrote a script as a
> means of investigating the issue.
> 
> The script aims to remove parts of the XML that are irrelevant to
> abidiff. It has been run on the current libabigail tests as well as
> some locally-generated ABI files.
> 
> On most XML files, this results in a modest saving (5-10%). However,
> for the library that sparked the invesigaton, the resulting XML was
> more than 2300 times smaller.
> 
> The script may have some merit on its own but perhaps should be
> implemented differently.
> 
> Question: How hard would it be to implement this functionality (under
> flag control) within libabigail itself?

I can’t directly answer “how hard” that is one of the things which we would discover by actually doing it. As for should it be done. I think it should be carefully looked into. However, as for priority since it is an optimization rather than a functional problem, if it were me, I would file this as a bug and put examples of the test data and results you are seeing as well as the script below in that bz and then start working on the process of refining libabigail’s output so that it matches the results of your script.

-ben

> 
> A couple of tangential things worth mentioning (see the comments at
> the very beginning and very end of the script):
> 
> - abidiff rejects XML files with an XML declaration at the top
> - abidiff rejects completely empty files
> 
> Resolving the first would make the second moot.
> 
> 	* scripts/abiprune.pl: Add script to prune ABI XML.
> 
> Signed-off-by: Giuliano Procida <gprocida@google.com>
> ---
> scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 323 insertions(+)
> create mode 100755 scripts/abiprune.pl
> 
> diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
> new file mode 100755
> index 00000000..8e04a273
> --- /dev/null
> +++ b/scripts/abiprune.pl
> @@ -0,0 +1,323 @@
> +#!/usr/bin/perl
> +
> +# This script is intended to consume libabigail ABI XML as generated
> +# by abidw and produce a possibly smaller representation that captures
> +# the same ABI. In particular, the output should be such that abidiff
> +# --harmless reports no differences (or is empty).
> +#
> +# It does this by modelling symbols and types dependencies as a
> +# directed graph, marking nodes reachable from the ELF symbols and
> +# dropping unreachable nodes.
> +
> +use strict;
> +use warnings;
> +no warnings 'recursion';
> +
> +use autodie;
> +
> +use XML::LibXML;
> +
> +# Making a graph from ABI XML.
> +#
> +# The following are the types of "node" we care about. The "edges" are
> +# obtained from a few XML attributes as well as via XML element
> +# containment.
> +#
> +#  ELF (exported) symbols
> +#
> +#   elf-symbol (has a name; the code here currently knows nothing
> +#     about aliases)
> +#
> +#  Declarations (that mention a symbol)
> +#
> +#   These live in a scope. In C++ scopes can be nested and include
> +#   namespaces and class types.
> +#
> +#  var-decl (also used for member variables)
> +#    elf-symbol linked to symbol via mangled-name
> +#    type-id links to a type
> +#  function-decl (also used for member functions)
> +#    elf-symbol linked to symbol via mangled-name
> +#    parameter and return type-ids link to types
> +#      (alas, not just a link to a function type)
> +#
> +#  Types
> +#
> +#   These occupy pretty much all the other elements, besides those
> +#   that act as simple containers.
> +#
> +# XML elements and their roles in more detail
> +#
> +# ELF
> +#
> +# elf-needed - container
> +#  dependency - names a library
> +# elf-function-symbols - contains a list of symbols
> +# elf-variable-symbols - contains a list of symbols
> +#  elf-symbol - describes an ELF variable or function
> +#
> +# Grouping and scoping
> +#
> +# abi-corpus-group
> +#  abi-corpus
> +#   abi-instr - compilation unit containers
> +#    namespace-decl - pure container, possibly named
> +#
> +# Types (some introduce scopes, only in C++)
> +#
> +# type-decl - defines a primitive type
> +# typedef-decl - defines a type, links to a type
> +# qualified-type-def - defines a type, links to a type
> +# pointer-type-def - defines a type, links to a type
> +# reference-type-def - defines a type, links to a type
> +# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
> +#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
> +# function-type - defines a type
> +#  parameter - belongs to function-type and -decl, links to a type
> +#  return - belongs to function-type and -decl, links to a type
> +# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
> +#  underlying-type - belongs to enum-decl
> +#  enumerator - belongs to enum-decl
> +# union-decl - defines and names a type, contains member elements linked to other things
> +# class-decl - defines and names a type, contains base type, member elements linking to other things
> +#  base-class - belongs to class-decl
> +#  data-member - container for a member; holds access level
> +#  member-function - container for a member; holds access level
> +#  member-type - container for a type declaration; holds access level
> +#  member-template - container for a (function?) template declaration; holds access level
> +#
> +# Higher order Things
> +#
> +# class-template-decl - defines a "type", but without instantiation this isn't usable
> +# function-template-decl - defines a "type", but without instantiation this isn't usable
> +#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
> +#  template-non-type-parameter - names a template parameter, links to a type
> +#  template-parameter-type-composition - container?
> +#
> +# Values
> +#
> +# function-decl - names a function, can link to a symbol
> +#     has same children as function-type, rather than linking to a type
> +# var-decl - names a variable, can link to a symbol, links to a type
> +
> +# Graph nodes.
> +my %symbols;
> +my %exported_symbols;
> +my %types;
> +# Graph edges.
> +my %type_type_edges;
> +my %symbol_type_edges;
> +my %type_symbol_edges;
> +
> +# Keep track of the types being defined.
> +my @id_stack;
> +# Keep track of the current (value) declaration.
> +my @symbol_stack;
> +sub process($);
> +sub process($) {
> +  my ($node) = @_;
> +
> +  # The XML attributes we care about.
> +  my $name;
> +  my $id;
> +  my $type_id;
> +  my $symbol;
> +  my $naming_typedef_id;
> +
> +  # Not every node we encounter is an XML element.
> +  if ($node->nodeType == XML_ELEMENT_NODE) {
> +    $name = $node->getAttribute('name');
> +    $id = $node->getAttribute('id');
> +    $type_id = $node->getAttribute('type-id');
> +    $symbol = $node->getAttribute('mangled-name');
> +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> +    die if defined $id && defined $symbol;
> +  }
> +
> +  if (defined $name && $node->getName() eq 'elf-symbol') {
> +    $exported_symbols{$name} = undef;
> +    # Early return is safe, but not necessary.
> +    return;
> +  }
> +
> +  if (defined $id) {
> +    # This element defines a type (but there may be more than one
> +    # defining the same type - we cannot rely on uniqueness).
> +    $types{$id} = undef;
> +    if (defined $naming_typedef_id) {
> +      # This is an odd one, there can be a backwards link from an
> +      # anonymous type to the typedef that refers to it, so we need to
> +      # pull in the typedef, even if nothing else refers to it.
> +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> +    }
> +    if (@id_stack) {
> +      # Type parent<->child dependencies; record dependencies both
> +      # ways to avoid making holes in the XML types.
> +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> +    }
> +    push @id_stack, $id;
> +  }
> +
> +  if (defined $symbol) {
> +    # This element is a declaration linked to a symbol (whether or not
> +    # exported).
> +    $symbols{$symbol} = undef;
> +    if (@id_stack) {
> +      # The enclosing scope may include a C++ class, in which case we
> +      # need to record the dependency.
> +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> +    }
> +    # The symbol depends on the types mentioned in this element, so
> +    # record it.
> +    push @symbol_stack, $symbol;
> +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> +  }
> +
> +  if (defined $type_id) {
> +    if (@id_stack) {
> +      # The enclosing type refers to this type, record the dependency.
> +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> +    }
> +    if (@symbol_stack) {
> +      # The enclosing declaration (linked to a symbol) refers to this
> +      # type, record the dependency.
> +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> +    }
> +  }
> +
> +  for my $child ($node->nonBlankChildNodes()) {
> +    process($child);
> +  }
> +
> +  if (defined $symbol) {
> +    pop @symbol_stack;
> +  }
> +  if (defined $id) {
> +    pop @id_stack;
> +  }
> +}
> +
> +# Load the XML.
> +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> +# Build a graph.
> +process($dom);
> +die if @id_stack or @symbol_stack;
> +
> +# DFS visited state.
> +my %seen_types;
> +my %seen_symbols;
> +sub dfs_type($);
> +sub dfs_symbol($);
> +
> +sub dfs_type($) {
> +  my ($type) = @_;
> +  return if exists $seen_types{$type};
> +  $seen_types{$type} = undef;
> +
> +  my $types = $type_type_edges{$type};
> +  if (defined $types) {
> +    for my $type (keys %$types) {
> +      for my $link (keys %{$types->{$type}}) {
> +        if (exists $types->{$type}{$link}) {
> +          dfs_type($type);
> +        }
> +      }
> +    }
> +  }
> +  my $symbols = $type_symbol_edges{$type};
> +  if (defined $symbols) {
> +    for my $symbol (keys %$symbols) {
> +      dfs_symbol($symbol);
> +    }
> +  }
> +}
> +
> +sub dfs_symbol($) {
> +  my ($symbol) = @_;
> +  return if exists $seen_symbols{$symbol};
> +  $seen_symbols{$symbol} = undef;
> +
> +  my $types = $symbol_type_edges{$symbol};
> +  if (defined $types) {
> +    for my $type (keys %$types) {
> +      dfs_type($type);
> +    }
> +  } else {
> +    warn "no declaration found for exported symbol $symbol\n";
> +  }
> +}
> +
> +# Traverse the graph, using the exported symbols as the roots.
> +for my $symbol (keys %exported_symbols) {
> +  dfs_symbol($symbol);
> +}
> +
> +# Useful counts.
> +sub print_report() {
> +  my $count_symbols = scalar keys %symbols;
> +  my $count_exported_symbols = scalar keys %exported_symbols;
> +  my $count_types = scalar keys %types;
> +  my $count_reachable_types = scalar keys %seen_types;
> +
> +  print STDERR qq{symbols = $count_symbols
> +exported symbols = $count_exported_symbols
> +types = $count_types
> +reachable types = $count_reachable_types
> +};
> +}
> +
> +my %drop_if_empty = map { $_ => undef } qw(
> +  namespace-decl
> +  abi-instr
> +  abi-corpus
> +  abi-corpus-group
> +);
> +
> +# This is another DFS traversal of the XML.
> +sub prune($);
> +sub prune($) {
> +  my ($node) = @_;
> +
> +  my $name = $node->getName();
> +  my $id;
> +  my $symbol;
> +
> +  if ($node->nodeType == XML_ELEMENT_NODE) {
> +    $id = $node->getAttribute('id');
> +    $symbol = $node->getAttribute('mangled-name');
> +    die if defined $id && defined $symbol;
> +  }
> +
> +  # Return if we know that this is a type or declaration to keep or
> +  # drop in its entirety.
> +  if (defined $id) {
> +    return !exists $seen_types{$id};
> +  }
> +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> +    return !defined $symbol || !exists $exported_symbols{$symbol};
> +  }
> +
> +  # Otherwise, this is not a type, declaration or part thereof, so
> +  # process child elements.
> +  my $empty = 1;
> +  for my $child ($node->nonBlankChildNodes()) {
> +    if (prune($child)) {
> +      $node->removeChild($child);
> +    } else {
> +      $empty = 0;
> +    }
> +  }
> +
> +  # Empty container elements can be dropped.
> +  return $empty && exists $drop_if_empty{$name};
> +}
> +
> +# Drop unreachable nodes and any container nodes that end up empty.
> +prune($dom);
> +my $out = $dom->toString();
> +# Remove blank line debris.
> +$out =~ s;^ *\n;;mg;
> +# Remove the XML declaration as abidiff doesn't like it.
> +$out =~ s;^<\?xml .*\?>\n;;m;
> +print $out;
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
>
  
Giuliano Procida March 12, 2021, 4:51 p.m. UTC | #2
Hi.

On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com> wrote:

> I think that you have pointed out a problem that I’ve been discussing with
> some users that I’m working with. The sheer volume of the abixml output is
> pretty extreme. I don’t think that they were really expecting it. However,
> they are not running into the particular situation that you are seeing
> since they don’t have the coding model of the problem library. They are
> more likely to fall in the 5-10% range that you reported above.
>
> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”.
> The ELF reachability aspect that you point out seems to be part of the
> difference between those two concepts.
>
>
We want a distilled description of the ABI (the surface), not what may lie
on the other side. More precisely, we want a typed interface where "type"
encompasses extra things like architecture, calling convention, ISA, ELF
symbol properties etc.

One worry is that just chasing references from the exported ELF symbols may
somehow miss something important, say for libraries that ship with bits of
implementation in header files.


> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <
> libabigail@sourceware.org> wrote:
> >
> > In response to internal reports about very large ABIs for libraires
> > implemented in C++ but exposing only a C API, I wrote a script as a
> > means of investigating the issue.
> >
> > The script aims to remove parts of the XML that are irrelevant to
> > abidiff. It has been run on the current libabigail tests as well as
> > some locally-generated ABI files.
> >
> > On most XML files, this results in a modest saving (5-10%). However,
> > for the library that sparked the invesigaton, the resulting XML was
> > more than 2300 times smaller.
> >
> > The script may have some merit on its own but perhaps should be
> > implemented differently.
> >
> > Question: How hard would it be to implement this functionality (under
> > flag control) within libabigail itself?
>
> I can’t directly answer “how hard” that is one of the things which we
> would discover by actually doing it. As for should it be done. I think it
> should be carefully looked into. However, as for priority since it is an
> optimization rather than a functional problem, if it were me, I would file
> this as a bug and put examples of the test data and results you are seeing
> as well as the script below in that bz and then start working on the
> process of refining libabigail’s output so that it matches the results of
> your script.
>
>
It's an optimisation analogous to DCE in a compiler. I'll certainly follow
up with a bug. It's one library in particular that is ~50M before pruning
and ~22k after.

I've already made changes to the script so that it can be given a list of
symbols and produce a reduced ABI. At the moment, we achieve this by
passing a list to abidw (with -w). In fact, we extract full and reduced
ABIs concurrently in the Android kernel build as each run takes a while. It
should be straightforward to reimplement the script in C++, though it's not
too slow already.

In terms of implementation, it would need to work with libabigail IR and
ir_node_visitor might be the right place to start.

Giuliano.


> -ben
>
> >
> > A couple of tangential things worth mentioning (see the comments at
> > the very beginning and very end of the script):
> >
> > - abidiff rejects XML files with an XML declaration at the top
> > - abidiff rejects completely empty files
> >
> > Resolving the first would make the second moot.
> >
> >       * scripts/abiprune.pl: Add script to prune ABI XML.
> >
> > Signed-off-by: Giuliano Procida <gprocida@google.com>
> > ---
> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 323 insertions(+)
> > create mode 100755 scripts/abiprune.pl
> >
> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
> > new file mode 100755
> > index 00000000..8e04a273
> > --- /dev/null
> > +++ b/scripts/abiprune.pl
> > @@ -0,0 +1,323 @@
> > +#!/usr/bin/perl
> > +
> > +# This script is intended to consume libabigail ABI XML as generated
> > +# by abidw and produce a possibly smaller representation that captures
> > +# the same ABI. In particular, the output should be such that abidiff
> > +# --harmless reports no differences (or is empty).
> > +#
> > +# It does this by modelling symbols and types dependencies as a
> > +# directed graph, marking nodes reachable from the ELF symbols and
> > +# dropping unreachable nodes.
> > +
> > +use strict;
> > +use warnings;
> > +no warnings 'recursion';
> > +
> > +use autodie;
> > +
> > +use XML::LibXML;
> > +
> > +# Making a graph from ABI XML.
> > +#
> > +# The following are the types of "node" we care about. The "edges" are
> > +# obtained from a few XML attributes as well as via XML element
> > +# containment.
> > +#
> > +#  ELF (exported) symbols
> > +#
> > +#   elf-symbol (has a name; the code here currently knows nothing
> > +#     about aliases)
> > +#
> > +#  Declarations (that mention a symbol)
> > +#
> > +#   These live in a scope. In C++ scopes can be nested and include
> > +#   namespaces and class types.
> > +#
> > +#  var-decl (also used for member variables)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    type-id links to a type
> > +#  function-decl (also used for member functions)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    parameter and return type-ids link to types
> > +#      (alas, not just a link to a function type)
> > +#
> > +#  Types
> > +#
> > +#   These occupy pretty much all the other elements, besides those
> > +#   that act as simple containers.
> > +#
> > +# XML elements and their roles in more detail
> > +#
> > +# ELF
> > +#
> > +# elf-needed - container
> > +#  dependency - names a library
> > +# elf-function-symbols - contains a list of symbols
> > +# elf-variable-symbols - contains a list of symbols
> > +#  elf-symbol - describes an ELF variable or function
> > +#
> > +# Grouping and scoping
> > +#
> > +# abi-corpus-group
> > +#  abi-corpus
> > +#   abi-instr - compilation unit containers
> > +#    namespace-decl - pure container, possibly named
> > +#
> > +# Types (some introduce scopes, only in C++)
> > +#
> > +# type-decl - defines a primitive type
> > +# typedef-decl - defines a type, links to a type
> > +# qualified-type-def - defines a type, links to a type
> > +# pointer-type-def - defines a type, links to a type
> > +# reference-type-def - defines a type, links to a type
> > +# array-type-def - defines a (multidimensional array) type, refers to
> element type, contains subranges
> > +#  subrange - contains array length, refers to element type; defines
> types (never referred to; duplicated)
> > +# function-type - defines a type
> > +#  parameter - belongs to function-type and -decl, links to a type
> > +#  return - belongs to function-type and -decl, links to a type
> > +# enum-decl - defines a type, names it, contains a list of enumerators
> and an underlying-type
> > +#  underlying-type - belongs to enum-decl
> > +#  enumerator - belongs to enum-decl
> > +# union-decl - defines and names a type, contains member elements
> linked to other things
> > +# class-decl - defines and names a type, contains base type, member
> elements linking to other things
> > +#  base-class - belongs to class-decl
> > +#  data-member - container for a member; holds access level
> > +#  member-function - container for a member; holds access level
> > +#  member-type - container for a type declaration; holds access level
> > +#  member-template - container for a (function?) template declaration;
> holds access level
> > +#
> > +# Higher order Things
> > +#
> > +# class-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +# function-template-decl - defines a "type", but without instantiation
> this isn't usable
> > +#  template-type-parameter - defines a type (formal variable), perhaps
> one which should be excluded from the real type graph
> > +#  template-non-type-parameter - names a template parameter, links to a
> type
> > +#  template-parameter-type-composition - container?
> > +#
> > +# Values
> > +#
> > +# function-decl - names a function, can link to a symbol
> > +#     has same children as function-type, rather than linking to a type
> > +# var-decl - names a variable, can link to a symbol, links to a type
> > +
> > +# Graph nodes.
> > +my %symbols;
> > +my %exported_symbols;
> > +my %types;
> > +# Graph edges.
> > +my %type_type_edges;
> > +my %symbol_type_edges;
> > +my %type_symbol_edges;
> > +
> > +# Keep track of the types being defined.
> > +my @id_stack;
> > +# Keep track of the current (value) declaration.
> > +my @symbol_stack;
> > +sub process($);
> > +sub process($) {
> > +  my ($node) = @_;
> > +
> > +  # The XML attributes we care about.
> > +  my $name;
> > +  my $id;
> > +  my $type_id;
> > +  my $symbol;
> > +  my $naming_typedef_id;
> > +
> > +  # Not every node we encounter is an XML element.
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $name = $node->getAttribute('name');
> > +    $id = $node->getAttribute('id');
> > +    $type_id = $node->getAttribute('type-id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
> > +    $exported_symbols{$name} = undef;
> > +    # Early return is safe, but not necessary.
> > +    return;
> > +  }
> > +
> > +  if (defined $id) {
> > +    # This element defines a type (but there may be more than one
> > +    # defining the same type - we cannot rely on uniqueness).
> > +    $types{$id} = undef;
> > +    if (defined $naming_typedef_id) {
> > +      # This is an odd one, there can be a backwards link from an
> > +      # anonymous type to the typedef that refers to it, so we need to
> > +      # pull in the typedef, even if nothing else refers to it.
> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> > +    }
> > +    if (@id_stack) {
> > +      # Type parent<->child dependencies; record dependencies both
> > +      # ways to avoid making holes in the XML types.
> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> > +    }
> > +    push @id_stack, $id;
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    # This element is a declaration linked to a symbol (whether or not
> > +    # exported).
> > +    $symbols{$symbol} = undef;
> > +    if (@id_stack) {
> > +      # The enclosing scope may include a C++ class, in which case we
> > +      # need to record the dependency.
> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> > +    }
> > +    # The symbol depends on the types mentioned in this element, so
> > +    # record it.
> > +    push @symbol_stack, $symbol;
> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> > +  }
> > +
> > +  if (defined $type_id) {
> > +    if (@id_stack) {
> > +      # The enclosing type refers to this type, record the dependency.
> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> > +    }
> > +    if (@symbol_stack) {
> > +      # The enclosing declaration (linked to a symbol) refers to this
> > +      # type, record the dependency.
> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> > +    }
> > +  }
> > +
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    process($child);
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    pop @symbol_stack;
> > +  }
> > +  if (defined $id) {
> > +    pop @id_stack;
> > +  }
> > +}
> > +
> > +# Load the XML.
> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> > +# Build a graph.
> > +process($dom);
> > +die if @id_stack or @symbol_stack;
> > +
> > +# DFS visited state.
> > +my %seen_types;
> > +my %seen_symbols;
> > +sub dfs_type($);
> > +sub dfs_symbol($);
> > +
> > +sub dfs_type($) {
> > +  my ($type) = @_;
> > +  return if exists $seen_types{$type};
> > +  $seen_types{$type} = undef;
> > +
> > +  my $types = $type_type_edges{$type};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      for my $link (keys %{$types->{$type}}) {
> > +        if (exists $types->{$type}{$link}) {
> > +          dfs_type($type);
> > +        }
> > +      }
> > +    }
> > +  }
> > +  my $symbols = $type_symbol_edges{$type};
> > +  if (defined $symbols) {
> > +    for my $symbol (keys %$symbols) {
> > +      dfs_symbol($symbol);
> > +    }
> > +  }
> > +}
> > +
> > +sub dfs_symbol($) {
> > +  my ($symbol) = @_;
> > +  return if exists $seen_symbols{$symbol};
> > +  $seen_symbols{$symbol} = undef;
> > +
> > +  my $types = $symbol_type_edges{$symbol};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      dfs_type($type);
> > +    }
> > +  } else {
> > +    warn "no declaration found for exported symbol $symbol\n";
> > +  }
> > +}
> > +
> > +# Traverse the graph, using the exported symbols as the roots.
> > +for my $symbol (keys %exported_symbols) {
> > +  dfs_symbol($symbol);
> > +}
> > +
> > +# Useful counts.
> > +sub print_report() {
> > +  my $count_symbols = scalar keys %symbols;
> > +  my $count_exported_symbols = scalar keys %exported_symbols;
> > +  my $count_types = scalar keys %types;
> > +  my $count_reachable_types = scalar keys %seen_types;
> > +
> > +  print STDERR qq{symbols = $count_symbols
> > +exported symbols = $count_exported_symbols
> > +types = $count_types
> > +reachable types = $count_reachable_types
> > +};
> > +}
> > +
> > +my %drop_if_empty = map { $_ => undef } qw(
> > +  namespace-decl
> > +  abi-instr
> > +  abi-corpus
> > +  abi-corpus-group
> > +);
> > +
> > +# This is another DFS traversal of the XML.
> > +sub prune($);
> > +sub prune($) {
> > +  my ($node) = @_;
> > +
> > +  my $name = $node->getName();
> > +  my $id;
> > +  my $symbol;
> > +
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $id = $node->getAttribute('id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  # Return if we know that this is a type or declaration to keep or
> > +  # drop in its entirety.
> > +  if (defined $id) {
> > +    return !exists $seen_types{$id};
> > +  }
> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
> > +  }
> > +
> > +  # Otherwise, this is not a type, declaration or part thereof, so
> > +  # process child elements.
> > +  my $empty = 1;
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    if (prune($child)) {
> > +      $node->removeChild($child);
> > +    } else {
> > +      $empty = 0;
> > +    }
> > +  }
> > +
> > +  # Empty container elements can be dropped.
> > +  return $empty && exists $drop_if_empty{$name};
> > +}
> > +
> > +# Drop unreachable nodes and any container nodes that end up empty.
> > +prune($dom);
> > +my $out = $dom->toString();
> > +# Remove blank line debris.
> > +$out =~ s;^ *\n;;mg;
> > +# Remove the XML declaration as abidiff doesn't like it.
> > +$out =~ s;^<\?xml .*\?>\n;;m;
> > +print $out;
> > --
> > 2.31.0.rc2.261.g7f71774620-goog
> >
>
>
  
Ben Woodard March 12, 2021, 6:41 p.m. UTC | #3
if you look at abicompat there are functions like: 
perform_compat_check_in_*_mode(options& opts,
                                    diff_context_sptr& ctxt,
                                    corpus_sptr app_corpus,
                                    corpus_sptr lib1_corpus,
                                    corpus_sptr lib2_corpus)
which ultimately does something like your perl script’s ELF reachability check when computing compatibility between libraries. It iterates through the undefined symbols and then calls get_sym_ids_of_*_to_keep(). 

I would think that you would be able to essentially do the same thing inside of abidw to have it prune you abixml file in a similar way to how your perl script does. 

(Paranthetical Note: one thing that really bothers me about those functions which I hope to work on as soon as I get a chance is I believe that they need to take const corpus_sptr’s for the app and lib1 lib2 corpuses. Of course this means that the data structures would have to change and the sym_ids_of_*_to_keep() would need to be some sort of external data structure rather than something inside the corpus itself, I’ve been calling this a subcorpus as I’ve been thinking about how to make this change happen. 

The problem with the way that it is now is when using libabigail as a library, rather than as it is used in a tool like abicompat you want to be able to load a collection of corpora which takes non-trivial time in many cases and use them over and over. For example say you want to load the corpora of many variations of libraries and find the ones that are compatible with each other. You are going to want to use the app corpus over and over without having to reload it from the ELF or abixml each and every time. Yeah as a first pass you could clear the vectors with the *_to_keep but then you would only be able to use the corpus in one thread at a time. I think it would make better library design to move those things to the side.)

> On Mar 12, 2021, at 8:51 AM, Giuliano Procida <gprocida@google.com> wrote:
> 
> Hi.
> 
> On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com <mailto:woodard@redhat.com>> wrote:
> I think that you have pointed out a problem that I’ve been discussing with some users that I’m working with. The sheer volume of the abixml output is pretty extreme. I don’t think that they were really expecting it. However, they are not running into the particular situation that you are seeing since they don’t have the coding model of the problem library. They are more likely to fall in the 5-10% range that you reported above. 
> 
> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”. The ELF reachability aspect that you point out seems to be part of the difference between those two concepts.
> 
> 
> We want a distilled description of the ABI (the surface), not what may lie on the other side. More precisely, we want a typed interface where "type" encompasses extra things like architecture, calling convention, ISA, ELF symbol properties etc.
> 
> One worry is that just chasing references from the exported ELF symbols may somehow miss something important, say for libraries that ship with bits of implementation in header files.

Right now given the implementation above, it looks like ELF reachability is the only thing that abicompat takes into consideration. What you are saying potentially has important consequences because it could suggest that abicompat may be missing things.

Do you know of or could you construct an example that demonstrates a case that would be missed by this ELF reachability? It may mean that we need to look at ways to enhance the checks in abicompat and in doing so they could also be part of the code that you use to prune the output of abidw.

The thing that pops into my mind would be inline functions defined in the header files. We may be able to find those by searching through the TUs but then there are things like cpp macros even with -g3 I’m not sure we could find those but I’m not entirely sure we would need to.

I really do believe that this topic needs additional investigation.

-ben

>  
> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <libabigail@sourceware.org <mailto:libabigail@sourceware.org>> wrote:
> > 
> > In response to internal reports about very large ABIs for libraires
> > implemented in C++ but exposing only a C API, I wrote a script as a
> > means of investigating the issue.
> > 
> > The script aims to remove parts of the XML that are irrelevant to
> > abidiff. It has been run on the current libabigail tests as well as
> > some locally-generated ABI files.
> > 
> > On most XML files, this results in a modest saving (5-10%). However,
> > for the library that sparked the invesigaton, the resulting XML was
> > more than 2300 times smaller.
> > 
> > The script may have some merit on its own but perhaps should be
> > implemented differently.
> > 
> > Question: How hard would it be to implement this functionality (under
> > flag control) within libabigail itself?
> 
> I can’t directly answer “how hard” that is one of the things which we would discover by actually doing it. As for should it be done. I think it should be carefully looked into. However, as for priority since it is an optimization rather than a functional problem, if it were me, I would file this as a bug and put examples of the test data and results you are seeing as well as the script below in that bz and then start working on the process of refining libabigail’s output so that it matches the results of your script.
> 
> 
> It's an optimisation analogous to DCE in a compiler. I'll certainly follow up with a bug. It's one library in particular that is ~50M before pruning and ~22k after.
> 
> I've already made changes to the script so that it can be given a list of symbols and produce a reduced ABI. At the moment, we achieve this by passing a list to abidw (with -w). In fact, we extract full and reduced ABIs concurrently in the Android kernel build as each run takes a while. It should be straightforward to reimplement the script in C++, though it's not too slow already.
> 
> In terms of implementation, it would need to work with libabigail IR and ir_node_visitor might be the right place to start.
> 
> Giuliano.
>  
> -ben
> 
> > 
> > A couple of tangential things worth mentioning (see the comments at
> > the very beginning and very end of the script):
> > 
> > - abidiff rejects XML files with an XML declaration at the top
> > - abidiff rejects completely empty files
> > 
> > Resolving the first would make the second moot.
> > 
> >       * scripts/abiprune.pl <http://abiprune.pl/>: Add script to prune ABI XML.
> > 
> > Signed-off-by: Giuliano Procida <gprocida@google.com <mailto:gprocida@google.com>>
> > ---
> > scripts/abiprune.pl <http://abiprune.pl/> | 323 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 323 insertions(+)
> > create mode 100755 scripts/abiprune.pl <http://abiprune.pl/>
> > 
> > diff --git a/scripts/abiprune.pl <http://abiprune.pl/> b/scripts/abiprune.pl <http://abiprune.pl/>
> > new file mode 100755
> > index 00000000..8e04a273
> > --- /dev/null
> > +++ b/scripts/abiprune.pl <http://abiprune.pl/>
> > @@ -0,0 +1,323 @@
> > +#!/usr/bin/perl
> > +
> > +# This script is intended to consume libabigail ABI XML as generated
> > +# by abidw and produce a possibly smaller representation that captures
> > +# the same ABI. In particular, the output should be such that abidiff
> > +# --harmless reports no differences (or is empty).
> > +#
> > +# It does this by modelling symbols and types dependencies as a
> > +# directed graph, marking nodes reachable from the ELF symbols and
> > +# dropping unreachable nodes.
> > +
> > +use strict;
> > +use warnings;
> > +no warnings 'recursion';
> > +
> > +use autodie;
> > +
> > +use XML::LibXML;
> > +
> > +# Making a graph from ABI XML.
> > +#
> > +# The following are the types of "node" we care about. The "edges" are
> > +# obtained from a few XML attributes as well as via XML element
> > +# containment.
> > +#
> > +#  ELF (exported) symbols
> > +#
> > +#   elf-symbol (has a name; the code here currently knows nothing
> > +#     about aliases)
> > +#
> > +#  Declarations (that mention a symbol)
> > +#
> > +#   These live in a scope. In C++ scopes can be nested and include
> > +#   namespaces and class types.
> > +#
> > +#  var-decl (also used for member variables)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    type-id links to a type
> > +#  function-decl (also used for member functions)
> > +#    elf-symbol linked to symbol via mangled-name
> > +#    parameter and return type-ids link to types
> > +#      (alas, not just a link to a function type)
> > +#
> > +#  Types
> > +#
> > +#   These occupy pretty much all the other elements, besides those
> > +#   that act as simple containers.
> > +#
> > +# XML elements and their roles in more detail
> > +#
> > +# ELF
> > +#
> > +# elf-needed - container
> > +#  dependency - names a library
> > +# elf-function-symbols - contains a list of symbols
> > +# elf-variable-symbols - contains a list of symbols
> > +#  elf-symbol - describes an ELF variable or function
> > +#
> > +# Grouping and scoping
> > +#
> > +# abi-corpus-group
> > +#  abi-corpus
> > +#   abi-instr - compilation unit containers
> > +#    namespace-decl - pure container, possibly named
> > +#
> > +# Types (some introduce scopes, only in C++)
> > +#
> > +# type-decl - defines a primitive type
> > +# typedef-decl - defines a type, links to a type
> > +# qualified-type-def - defines a type, links to a type
> > +# pointer-type-def - defines a type, links to a type
> > +# reference-type-def - defines a type, links to a type
> > +# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
> > +#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
> > +# function-type - defines a type
> > +#  parameter - belongs to function-type and -decl, links to a type
> > +#  return - belongs to function-type and -decl, links to a type
> > +# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
> > +#  underlying-type - belongs to enum-decl
> > +#  enumerator - belongs to enum-decl
> > +# union-decl - defines and names a type, contains member elements linked to other things
> > +# class-decl - defines and names a type, contains base type, member elements linking to other things
> > +#  base-class - belongs to class-decl
> > +#  data-member - container for a member; holds access level
> > +#  member-function - container for a member; holds access level
> > +#  member-type - container for a type declaration; holds access level
> > +#  member-template - container for a (function?) template declaration; holds access level
> > +#
> > +# Higher order Things
> > +#
> > +# class-template-decl - defines a "type", but without instantiation this isn't usable
> > +# function-template-decl - defines a "type", but without instantiation this isn't usable
> > +#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
> > +#  template-non-type-parameter - names a template parameter, links to a type
> > +#  template-parameter-type-composition - container?
> > +#
> > +# Values
> > +#
> > +# function-decl - names a function, can link to a symbol
> > +#     has same children as function-type, rather than linking to a type
> > +# var-decl - names a variable, can link to a symbol, links to a type
> > +
> > +# Graph nodes.
> > +my %symbols;
> > +my %exported_symbols;
> > +my %types;
> > +# Graph edges.
> > +my %type_type_edges;
> > +my %symbol_type_edges;
> > +my %type_symbol_edges;
> > +
> > +# Keep track of the types being defined.
> > +my @id_stack;
> > +# Keep track of the current (value) declaration.
> > +my @symbol_stack;
> > +sub process($);
> > +sub process($) {
> > +  my ($node) = @_;
> > +
> > +  # The XML attributes we care about.
> > +  my $name;
> > +  my $id;
> > +  my $type_id;
> > +  my $symbol;
> > +  my $naming_typedef_id;
> > +
> > +  # Not every node we encounter is an XML element.
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $name = $node->getAttribute('name');
> > +    $id = $node->getAttribute('id');
> > +    $type_id = $node->getAttribute('type-id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
> > +    $exported_symbols{$name} = undef;
> > +    # Early return is safe, but not necessary.
> > +    return;
> > +  }
> > +
> > +  if (defined $id) {
> > +    # This element defines a type (but there may be more than one
> > +    # defining the same type - we cannot rely on uniqueness).
> > +    $types{$id} = undef;
> > +    if (defined $naming_typedef_id) {
> > +      # This is an odd one, there can be a backwards link from an
> > +      # anonymous type to the typedef that refers to it, so we need to
> > +      # pull in the typedef, even if nothing else refers to it.
> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
> > +    }
> > +    if (@id_stack) {
> > +      # Type parent<->child dependencies; record dependencies both
> > +      # ways to avoid making holes in the XML types.
> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
> > +    }
> > +    push @id_stack, $id;
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    # This element is a declaration linked to a symbol (whether or not
> > +    # exported).
> > +    $symbols{$symbol} = undef;
> > +    if (@id_stack) {
> > +      # The enclosing scope may include a C++ class, in which case we
> > +      # need to record the dependency.
> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
> > +    }
> > +    # The symbol depends on the types mentioned in this element, so
> > +    # record it.
> > +    push @symbol_stack, $symbol;
> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
> > +  }
> > +
> > +  if (defined $type_id) {
> > +    if (@id_stack) {
> > +      # The enclosing type refers to this type, record the dependency.
> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
> > +    }
> > +    if (@symbol_stack) {
> > +      # The enclosing declaration (linked to a symbol) refers to this
> > +      # type, record the dependency.
> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
> > +    }
> > +  }
> > +
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    process($child);
> > +  }
> > +
> > +  if (defined $symbol) {
> > +    pop @symbol_stack;
> > +  }
> > +  if (defined $id) {
> > +    pop @id_stack;
> > +  }
> > +}
> > +
> > +# Load the XML.
> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
> > +# Build a graph.
> > +process($dom);
> > +die if @id_stack or @symbol_stack;
> > +
> > +# DFS visited state.
> > +my %seen_types;
> > +my %seen_symbols;
> > +sub dfs_type($);
> > +sub dfs_symbol($);
> > +
> > +sub dfs_type($) {
> > +  my ($type) = @_;
> > +  return if exists $seen_types{$type};
> > +  $seen_types{$type} = undef;
> > +
> > +  my $types = $type_type_edges{$type};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      for my $link (keys %{$types->{$type}}) {
> > +        if (exists $types->{$type}{$link}) {
> > +          dfs_type($type);
> > +        }
> > +      }
> > +    }
> > +  }
> > +  my $symbols = $type_symbol_edges{$type};
> > +  if (defined $symbols) {
> > +    for my $symbol (keys %$symbols) {
> > +      dfs_symbol($symbol);
> > +    }
> > +  }
> > +}
> > +
> > +sub dfs_symbol($) {
> > +  my ($symbol) = @_;
> > +  return if exists $seen_symbols{$symbol};
> > +  $seen_symbols{$symbol} = undef;
> > +
> > +  my $types = $symbol_type_edges{$symbol};
> > +  if (defined $types) {
> > +    for my $type (keys %$types) {
> > +      dfs_type($type);
> > +    }
> > +  } else {
> > +    warn "no declaration found for exported symbol $symbol\n";
> > +  }
> > +}
> > +
> > +# Traverse the graph, using the exported symbols as the roots.
> > +for my $symbol (keys %exported_symbols) {
> > +  dfs_symbol($symbol);
> > +}
> > +
> > +# Useful counts.
> > +sub print_report() {
> > +  my $count_symbols = scalar keys %symbols;
> > +  my $count_exported_symbols = scalar keys %exported_symbols;
> > +  my $count_types = scalar keys %types;
> > +  my $count_reachable_types = scalar keys %seen_types;
> > +
> > +  print STDERR qq{symbols = $count_symbols
> > +exported symbols = $count_exported_symbols
> > +types = $count_types
> > +reachable types = $count_reachable_types
> > +};
> > +}
> > +
> > +my %drop_if_empty = map { $_ => undef } qw(
> > +  namespace-decl
> > +  abi-instr
> > +  abi-corpus
> > +  abi-corpus-group
> > +);
> > +
> > +# This is another DFS traversal of the XML.
> > +sub prune($);
> > +sub prune($) {
> > +  my ($node) = @_;
> > +
> > +  my $name = $node->getName();
> > +  my $id;
> > +  my $symbol;
> > +
> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
> > +    $id = $node->getAttribute('id');
> > +    $symbol = $node->getAttribute('mangled-name');
> > +    die if defined $id && defined $symbol;
> > +  }
> > +
> > +  # Return if we know that this is a type or declaration to keep or
> > +  # drop in its entirety.
> > +  if (defined $id) {
> > +    return !exists $seen_types{$id};
> > +  }
> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
> > +  }
> > +
> > +  # Otherwise, this is not a type, declaration or part thereof, so
> > +  # process child elements.
> > +  my $empty = 1;
> > +  for my $child ($node->nonBlankChildNodes()) {
> > +    if (prune($child)) {
> > +      $node->removeChild($child);
> > +    } else {
> > +      $empty = 0;
> > +    }
> > +  }
> > +
> > +  # Empty container elements can be dropped.
> > +  return $empty && exists $drop_if_empty{$name};
> > +}
> > +
> > +# Drop unreachable nodes and any container nodes that end up empty.
> > +prune($dom);
> > +my $out = $dom->toString();
> > +# Remove blank line debris.
> > +$out =~ s;^ *\n;;mg;
> > +# Remove the XML declaration as abidiff doesn't like it.
> > +$out =~ s;^<\?xml .*\?>\n;;m;
> > +print $out;
> > -- 
> > 2.31.0.rc2.261.g7f71774620-goog
> >
  
Giuliano Procida March 15, 2021, 11:08 a.m. UTC | #4
Hi.

On Fri, 12 Mar 2021 at 18:41, Ben Woodard <woodard@redhat.com> wrote:

> if you look at abicompat there are functions like:
> perform_compat_check_in_*_mode(options& opts,
>                                     diff_context_sptr& ctxt,
>                                     corpus_sptr app_corpus,
>                                     corpus_sptr lib1_corpus,
>                                     corpus_sptr lib2_corpus)
> which ultimately does something like your perl script’s ELF reachability
> check when computing compatibility between libraries. It iterates through
> the undefined symbols and then calls get_sym_ids_of_*_to_keep().
>
>
So abicompat is a bit like abidiff -w symbol_list where the symbol_list is
just taken from an application.


> I would think that you would be able to essentially do the same thing
> inside of abidw to have it prune you abixml file in a similar way to how
> your perl script does.
>
>
Here the symbol list would be taken to be the ELF symbols. The harder part
is implementing the reachability traversal.


> (Paranthetical Note: one thing that really bothers me about those
> functions which I hope to work on as soon as I get a chance is I believe
> that they need to take const corpus_sptr’s for the app and lib1 lib2
> corpuses. Of course this means that the data structures would have to
> change and the sym_ids_of_*_to_keep() would need to be some sort of
> external data structure rather than something inside the corpus itself,
> I’ve been calling this a subcorpus as I’ve been thinking about how to make
> this change happen.
>
> The problem with the way that it is now is when using libabigail as a
> library, rather than as it is used in a tool like abicompat you want to be
> able to load a collection of corpora which takes non-trivial time in many
> cases and use them over and over. For example say you want to load the
> corpora of many variations of libraries and find the ones that are
> compatible with each other. You are going to want to use the app corpus
> over and over without having to reload it from the ELF or abixml each and
> every time. Yeah as a first pass you could clear the vectors with the
> *_to_keep but then you would only be able to use the corpus in one thread
> at a time. I think it would make better library design to move those things
> to the side.)
>
>
Some of the existing traversals in libabigail also have the notion of
"visited" as part of another object (like diff_context) and so mutate this
object, rather than being instantiated just for the given traversal. I
agree that it would be good to separate out this kind of interpretive state
from underlying objects.

In the case of optionally pruning unreachable symbols and types, having
extra state in read_context would make what's a very large object even
larger.

Logically, what we'd want to happen is something like:

set<node> reachable(const set<node> & roots, const corpus & corpus) {
  set<node> visited;
  for (const auto & root : roots)
    visit(visited, root, corpus);
  return visited;
}

corpus = read_corpus();
if (prune) {
  auto roots = exported(corpus);
  auto reachable = reachable(roots, corpus);
  corpus = filter(reachable, corpus);
}
write_corpus(corpus);

On Mar 12, 2021, at 8:51 AM, Giuliano Procida <gprocida@google.com> wrote:
>
> Hi.
>
> On Thu, 11 Mar 2021 at 20:39, Ben Woodard <woodard@redhat.com> wrote:
>
>> I think that you have pointed out a problem that I’ve been discussing
>> with some users that I’m working with. The sheer volume of the abixml
>> output is pretty extreme. I don’t think that they were really expecting it.
>> However, they are not running into the particular situation that you are
>> seeing since they don’t have the coding model of the problem library. They
>> are more likely to fall in the 5-10% range that you reported above.
>>
>> We have been talking about it in terms of “ABI Surface” vs. “ABI Corpus”.
>> The ELF reachability aspect that you point out seems to be part of the
>> difference between those two concepts.
>>
>>
> We want a distilled description of the ABI (the surface), not what may lie
> on the other side. More precisely, we want a typed interface where "type"
> encompasses extra things like architecture, calling convention, ISA, ELF
> symbol properties etc.
>
> One worry is that just chasing references from the exported ELF symbols
> may somehow miss something important, say for libraries that ship with bits
> of implementation in header files.
>
>
> Right now given the implementation above, it looks like ELF reachability
> is the only thing that abicompat takes into consideration. What you are
> saying potentially has important consequences because it could suggest that
> abicompat may be missing things.
>
> Do you know of or could you construct an example that demonstrates a case
> that would be missed by this ELF reachability? It may mean that we need to
> look at ways to enhance the checks in abicompat and in doing so they could
> also be part of the code that you use to prune the output of abidw.
>
>
Not really. My main thoughts were constructors and (as you mention) inline
functions in general.

If we think abouts what's being pruned:

1. declarations of functions and variables, absent from the ELF symbol table

These are either unusable (link failures) or have an implementation
entirely within header files.

In the simplest cases, a constexpr value or the behaviour of an inline
function could change between library versions, or their types could
change. We cannot possibly detect all of these kinds of things. The changes
may well be incompatible.

Note that abidiff doesn't report any such changes; I checked both with
abidiff -w and using objcopy to strip unwanted symbols.

2. types that are not reachable from any of the ELF symbols

If the types are not POD and caused the emission of constructors etc., then
the functions are covered by 1.

For POD types, the user can have code that builds, consumes and mutates
objects of these types. But it has no way to pass such objects through the
library interface (symbols) unless the types are hidden behind things like
void *.

If in some future revision of libabigail, we had a mechanism to say "this
void could actually be one of these other types" in the ABI, then the types
would become reachable, diffable and also not pruned by abiprune.

The thing that pops into my mind would be inline functions defined in the
> header files. We may be able to find those by searching through the TUs but
> then there are things like cpp macros even with -g3 I’m not sure we could
> find those but I’m not entirely sure we would need to.
>
> I really do believe that this topic needs additional investigation.
>
>
I think the question to answer is roughly: what can we get from the library
headers that is not reachable from the export ELF symbols? And the answer
consists of the values of all defines, macros, constants, inline functions.

However, it may be necessary to subtract some of the OS "background" noise.
Just because a random dependency of the library introduces a new constant,
it doesn't mean we care. We'd also need to decide where to draw a line
(which header files are excluded).

At one end of the spectrum, we'd need a sophisticated parser and deep
knowledge of the language semantics and at the other we'd just offer up
header diffs. How much of this were you hoping would be in the TUs (bearing
in mind that the library may not even use everything it makes available to
client code)?


> -ben
>
>
Giuliano.


>
>> > On Mar 11, 2021, at 3:53 AM, Giuliano Procida via Libabigail <
>> libabigail@sourceware.org> wrote:
>> >
>> > In response to internal reports about very large ABIs for libraires
>> > implemented in C++ but exposing only a C API, I wrote a script as a
>> > means of investigating the issue.
>> >
>> > The script aims to remove parts of the XML that are irrelevant to
>> > abidiff. It has been run on the current libabigail tests as well as
>> > some locally-generated ABI files.
>> >
>> > On most XML files, this results in a modest saving (5-10%). However,
>> > for the library that sparked the invesigaton, the resulting XML was
>> > more than 2300 times smaller.
>> >
>> > The script may have some merit on its own but perhaps should be
>> > implemented differently.
>> >
>> > Question: How hard would it be to implement this functionality (under
>> > flag control) within libabigail itself?
>>
>> I can’t directly answer “how hard” that is one of the things which we
>> would discover by actually doing it. As for should it be done. I think it
>> should be carefully looked into. However, as for priority since it is an
>> optimization rather than a functional problem, if it were me, I would file
>> this as a bug and put examples of the test data and results you are seeing
>> as well as the script below in that bz and then start working on the
>> process of refining libabigail’s output so that it matches the results of
>> your script.
>>
>>
> It's an optimisation analogous to DCE in a compiler. I'll certainly follow
> up with a bug. It's one library in particular that is ~50M before pruning
> and ~22k after.
>
> I've already made changes to the script so that it can be given a list of
> symbols and produce a reduced ABI. At the moment, we achieve this by
> passing a list to abidw (with -w). In fact, we extract full and reduced
> ABIs concurrently in the Android kernel build as each run takes a while. It
> should be straightforward to reimplement the script in C++, though it's not
> too slow already.
>
> In terms of implementation, it would need to work with libabigail IR and
> ir_node_visitor might be the right place to start.
>
> Giuliano.
>
>
>> -ben
>>
>> >
>> > A couple of tangential things worth mentioning (see the comments at
>> > the very beginning and very end of the script):
>> >
>> > - abidiff rejects XML files with an XML declaration at the top
>> > - abidiff rejects completely empty files
>> >
>> > Resolving the first would make the second moot.
>> >
>> >       * scripts/abiprune.pl: Add script to prune ABI XML.
>> >
>> > Signed-off-by: Giuliano Procida <gprocida@google.com>
>> > ---
>> > scripts/abiprune.pl | 323 ++++++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 323 insertions(+)
>> > create mode 100755 scripts/abiprune.pl
>> >
>> > diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
>> > new file mode 100755
>> > index 00000000..8e04a273
>> > --- /dev/null
>> > +++ b/scripts/abiprune.pl
>> > @@ -0,0 +1,323 @@
>> > +#!/usr/bin/perl
>> > +
>> > +# This script is intended to consume libabigail ABI XML as generated
>> > +# by abidw and produce a possibly smaller representation that captures
>> > +# the same ABI. In particular, the output should be such that abidiff
>> > +# --harmless reports no differences (or is empty).
>> > +#
>> > +# It does this by modelling symbols and types dependencies as a
>> > +# directed graph, marking nodes reachable from the ELF symbols and
>> > +# dropping unreachable nodes.
>> > +
>> > +use strict;
>> > +use warnings;
>> > +no warnings 'recursion';
>> > +
>> > +use autodie;
>> > +
>> > +use XML::LibXML;
>> > +
>> > +# Making a graph from ABI XML.
>> > +#
>> > +# The following are the types of "node" we care about. The "edges" are
>> > +# obtained from a few XML attributes as well as via XML element
>> > +# containment.
>> > +#
>> > +#  ELF (exported) symbols
>> > +#
>> > +#   elf-symbol (has a name; the code here currently knows nothing
>> > +#     about aliases)
>> > +#
>> > +#  Declarations (that mention a symbol)
>> > +#
>> > +#   These live in a scope. In C++ scopes can be nested and include
>> > +#   namespaces and class types.
>> > +#
>> > +#  var-decl (also used for member variables)
>> > +#    elf-symbol linked to symbol via mangled-name
>> > +#    type-id links to a type
>> > +#  function-decl (also used for member functions)
>> > +#    elf-symbol linked to symbol via mangled-name
>> > +#    parameter and return type-ids link to types
>> > +#      (alas, not just a link to a function type)
>> > +#
>> > +#  Types
>> > +#
>> > +#   These occupy pretty much all the other elements, besides those
>> > +#   that act as simple containers.
>> > +#
>> > +# XML elements and their roles in more detail
>> > +#
>> > +# ELF
>> > +#
>> > +# elf-needed - container
>> > +#  dependency - names a library
>> > +# elf-function-symbols - contains a list of symbols
>> > +# elf-variable-symbols - contains a list of symbols
>> > +#  elf-symbol - describes an ELF variable or function
>> > +#
>> > +# Grouping and scoping
>> > +#
>> > +# abi-corpus-group
>> > +#  abi-corpus
>> > +#   abi-instr - compilation unit containers
>> > +#    namespace-decl - pure container, possibly named
>> > +#
>> > +# Types (some introduce scopes, only in C++)
>> > +#
>> > +# type-decl - defines a primitive type
>> > +# typedef-decl - defines a type, links to a type
>> > +# qualified-type-def - defines a type, links to a type
>> > +# pointer-type-def - defines a type, links to a type
>> > +# reference-type-def - defines a type, links to a type
>> > +# array-type-def - defines a (multidimensional array) type, refers to
>> element type, contains subranges
>> > +#  subrange - contains array length, refers to element type; defines
>> types (never referred to; duplicated)
>> > +# function-type - defines a type
>> > +#  parameter - belongs to function-type and -decl, links to a type
>> > +#  return - belongs to function-type and -decl, links to a type
>> > +# enum-decl - defines a type, names it, contains a list of enumerators
>> and an underlying-type
>> > +#  underlying-type - belongs to enum-decl
>> > +#  enumerator - belongs to enum-decl
>> > +# union-decl - defines and names a type, contains member elements
>> linked to other things
>> > +# class-decl - defines and names a type, contains base type, member
>> elements linking to other things
>> > +#  base-class - belongs to class-decl
>> > +#  data-member - container for a member; holds access level
>> > +#  member-function - container for a member; holds access level
>> > +#  member-type - container for a type declaration; holds access level
>> > +#  member-template - container for a (function?) template declaration;
>> holds access level
>> > +#
>> > +# Higher order Things
>> > +#
>> > +# class-template-decl - defines a "type", but without instantiation
>> this isn't usable
>> > +# function-template-decl - defines a "type", but without instantiation
>> this isn't usable
>> > +#  template-type-parameter - defines a type (formal variable), perhaps
>> one which should be excluded from the real type graph
>> > +#  template-non-type-parameter - names a template parameter, links to
>> a type
>> > +#  template-parameter-type-composition - container?
>> > +#
>> > +# Values
>> > +#
>> > +# function-decl - names a function, can link to a symbol
>> > +#     has same children as function-type, rather than linking to a type
>> > +# var-decl - names a variable, can link to a symbol, links to a type
>> > +
>> > +# Graph nodes.
>> > +my %symbols;
>> > +my %exported_symbols;
>> > +my %types;
>> > +# Graph edges.
>> > +my %type_type_edges;
>> > +my %symbol_type_edges;
>> > +my %type_symbol_edges;
>> > +
>> > +# Keep track of the types being defined.
>> > +my @id_stack;
>> > +# Keep track of the current (value) declaration.
>> > +my @symbol_stack;
>> > +sub process($);
>> > +sub process($) {
>> > +  my ($node) = @_;
>> > +
>> > +  # The XML attributes we care about.
>> > +  my $name;
>> > +  my $id;
>> > +  my $type_id;
>> > +  my $symbol;
>> > +  my $naming_typedef_id;
>> > +
>> > +  # Not every node we encounter is an XML element.
>> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
>> > +    $name = $node->getAttribute('name');
>> > +    $id = $node->getAttribute('id');
>> > +    $type_id = $node->getAttribute('type-id');
>> > +    $symbol = $node->getAttribute('mangled-name');
>> > +    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
>> > +    die if defined $id && defined $symbol;
>> > +  }
>> > +
>> > +  if (defined $name && $node->getName() eq 'elf-symbol') {
>> > +    $exported_symbols{$name} = undef;
>> > +    # Early return is safe, but not necessary.
>> > +    return;
>> > +  }
>> > +
>> > +  if (defined $id) {
>> > +    # This element defines a type (but there may be more than one
>> > +    # defining the same type - we cannot rely on uniqueness).
>> > +    $types{$id} = undef;
>> > +    if (defined $naming_typedef_id) {
>> > +      # This is an odd one, there can be a backwards link from an
>> > +      # anonymous type to the typedef that refers to it, so we need to
>> > +      # pull in the typedef, even if nothing else refers to it.
>> > +      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
>> > +    }
>> > +    if (@id_stack) {
>> > +      # Type parent<->child dependencies; record dependencies both
>> > +      # ways to avoid making holes in the XML types.
>> > +      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
>> > +      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
>> > +    }
>> > +    push @id_stack, $id;
>> > +  }
>> > +
>> > +  if (defined $symbol) {
>> > +    # This element is a declaration linked to a symbol (whether or not
>> > +    # exported).
>> > +    $symbols{$symbol} = undef;
>> > +    if (@id_stack) {
>> > +      # The enclosing scope may include a C++ class, in which case we
>> > +      # need to record the dependency.
>> > +      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
>> > +    }
>> > +    # The symbol depends on the types mentioned in this element, so
>> > +    # record it.
>> > +    push @symbol_stack, $symbol;
>> > +    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
>> > +  }
>> > +
>> > +  if (defined $type_id) {
>> > +    if (@id_stack) {
>> > +      # The enclosing type refers to this type, record the dependency.
>> > +      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
>> > +    }
>> > +    if (@symbol_stack) {
>> > +      # The enclosing declaration (linked to a symbol) refers to this
>> > +      # type, record the dependency.
>> > +      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
>> > +    }
>> > +  }
>> > +
>> > +  for my $child ($node->nonBlankChildNodes()) {
>> > +    process($child);
>> > +  }
>> > +
>> > +  if (defined $symbol) {
>> > +    pop @symbol_stack;
>> > +  }
>> > +  if (defined $id) {
>> > +    pop @id_stack;
>> > +  }
>> > +}
>> > +
>> > +# Load the XML.
>> > +my $dom = XML::LibXML->load_xml(IO => \*STDIN);
>> > +# Build a graph.
>> > +process($dom);
>> > +die if @id_stack or @symbol_stack;
>> > +
>> > +# DFS visited state.
>> > +my %seen_types;
>> > +my %seen_symbols;
>> > +sub dfs_type($);
>> > +sub dfs_symbol($);
>> > +
>> > +sub dfs_type($) {
>> > +  my ($type) = @_;
>> > +  return if exists $seen_types{$type};
>> > +  $seen_types{$type} = undef;
>> > +
>> > +  my $types = $type_type_edges{$type};
>> > +  if (defined $types) {
>> > +    for my $type (keys %$types) {
>> > +      for my $link (keys %{$types->{$type}}) {
>> > +        if (exists $types->{$type}{$link}) {
>> > +          dfs_type($type);
>> > +        }
>> > +      }
>> > +    }
>> > +  }
>> > +  my $symbols = $type_symbol_edges{$type};
>> > +  if (defined $symbols) {
>> > +    for my $symbol (keys %$symbols) {
>> > +      dfs_symbol($symbol);
>> > +    }
>> > +  }
>> > +}
>> > +
>> > +sub dfs_symbol($) {
>> > +  my ($symbol) = @_;
>> > +  return if exists $seen_symbols{$symbol};
>> > +  $seen_symbols{$symbol} = undef;
>> > +
>> > +  my $types = $symbol_type_edges{$symbol};
>> > +  if (defined $types) {
>> > +    for my $type (keys %$types) {
>> > +      dfs_type($type);
>> > +    }
>> > +  } else {
>> > +    warn "no declaration found for exported symbol $symbol\n";
>> > +  }
>> > +}
>> > +
>> > +# Traverse the graph, using the exported symbols as the roots.
>> > +for my $symbol (keys %exported_symbols) {
>> > +  dfs_symbol($symbol);
>> > +}
>> > +
>> > +# Useful counts.
>> > +sub print_report() {
>> > +  my $count_symbols = scalar keys %symbols;
>> > +  my $count_exported_symbols = scalar keys %exported_symbols;
>> > +  my $count_types = scalar keys %types;
>> > +  my $count_reachable_types = scalar keys %seen_types;
>> > +
>> > +  print STDERR qq{symbols = $count_symbols
>> > +exported symbols = $count_exported_symbols
>> > +types = $count_types
>> > +reachable types = $count_reachable_types
>> > +};
>> > +}
>> > +
>> > +my %drop_if_empty = map { $_ => undef } qw(
>> > +  namespace-decl
>> > +  abi-instr
>> > +  abi-corpus
>> > +  abi-corpus-group
>> > +);
>> > +
>> > +# This is another DFS traversal of the XML.
>> > +sub prune($);
>> > +sub prune($) {
>> > +  my ($node) = @_;
>> > +
>> > +  my $name = $node->getName();
>> > +  my $id;
>> > +  my $symbol;
>> > +
>> > +  if ($node->nodeType == XML_ELEMENT_NODE) {
>> > +    $id = $node->getAttribute('id');
>> > +    $symbol = $node->getAttribute('mangled-name');
>> > +    die if defined $id && defined $symbol;
>> > +  }
>> > +
>> > +  # Return if we know that this is a type or declaration to keep or
>> > +  # drop in its entirety.
>> > +  if (defined $id) {
>> > +    return !exists $seen_types{$id};
>> > +  }
>> > +  if ($name eq 'var-decl' || $name eq 'function-decl') {
>> > +    return !defined $symbol || !exists $exported_symbols{$symbol};
>> > +  }
>> > +
>> > +  # Otherwise, this is not a type, declaration or part thereof, so
>> > +  # process child elements.
>> > +  my $empty = 1;
>> > +  for my $child ($node->nonBlankChildNodes()) {
>> > +    if (prune($child)) {
>> > +      $node->removeChild($child);
>> > +    } else {
>> > +      $empty = 0;
>> > +    }
>> > +  }
>> > +
>> > +  # Empty container elements can be dropped.
>> > +  return $empty && exists $drop_if_empty{$name};
>> > +}
>> > +
>> > +# Drop unreachable nodes and any container nodes that end up empty.
>> > +prune($dom);
>> > +my $out = $dom->toString();
>> > +# Remove blank line debris.
>> > +$out =~ s;^ *\n;;mg;
>> > +# Remove the XML declaration as abidiff doesn't like it.
>> > +$out =~ s;^<\?xml .*\?>\n;;m;
>> > +print $out;
>> > --
>> > 2.31.0.rc2.261.g7f71774620-goog
>> >
>
>
>
  

Patch

diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
new file mode 100755
index 00000000..8e04a273
--- /dev/null
+++ b/scripts/abiprune.pl
@@ -0,0 +1,323 @@ 
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+#
+# It does this by modelling symbols and types dependencies as a
+# directed graph, marking nodes reachable from the ELF symbols and
+# dropping unreachable nodes.
+
+use strict;
+use warnings;
+no warnings 'recursion';
+
+use autodie;
+
+use XML::LibXML;
+
+# Making a graph from ABI XML.
+#
+# The following are the types of "node" we care about. The "edges" are
+# obtained from a few XML attributes as well as via XML element
+# containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+#
+# XML elements and their roles in more detail
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-function-symbols - contains a list of symbols
+# elf-variable-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a "type", but without instantiation this isn't usable
+# function-template-decl - defines a "type", but without instantiation this isn't usable
+#  template-type-parameter - defines a type (formal variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+# var-decl - names a variable, can link to a symbol, links to a type
+
+# Graph nodes.
+my %symbols;
+my %exported_symbols;
+my %types;
+# Graph edges.
+my %type_type_edges;
+my %symbol_type_edges;
+my %type_symbol_edges;
+
+# Keep track of the types being defined.
+my @id_stack;
+# Keep track of the current (value) declaration.
+my @symbol_stack;
+sub process($);
+sub process($) {
+  my ($node) = @_;
+
+  # The XML attributes we care about.
+  my $name;
+  my $id;
+  my $type_id;
+  my $symbol;
+  my $naming_typedef_id;
+
+  # Not every node we encounter is an XML element.
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $type_id = $node->getAttribute('type-id');
+    $symbol = $node->getAttribute('mangled-name');
+    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+    die if defined $id && defined $symbol;
+  }
+
+  if (defined $name && $node->getName() eq 'elf-symbol') {
+    $exported_symbols{$name} = undef;
+    # Early return is safe, but not necessary.
+    return;
+  }
+
+  if (defined $id) {
+    # This element defines a type (but there may be more than one
+    # defining the same type - we cannot rely on uniqueness).
+    $types{$id} = undef;
+    if (defined $naming_typedef_id) {
+      # This is an odd one, there can be a backwards link from an
+      # anonymous type to the typedef that refers to it, so we need to
+      # pull in the typedef, even if nothing else refers to it.
+      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
+    }
+    if (@id_stack) {
+      # Type parent<->child dependencies; record dependencies both
+      # ways to avoid making holes in the XML types.
+      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
+      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
+    }
+    push @id_stack, $id;
+  }
+
+  if (defined $symbol) {
+    # This element is a declaration linked to a symbol (whether or not
+    # exported).
+    $symbols{$symbol} = undef;
+    if (@id_stack) {
+      # The enclosing scope may include a C++ class, in which case we
+      # need to record the dependency.
+      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
+    }
+    # The symbol depends on the types mentioned in this element, so
+    # record it.
+    push @symbol_stack, $symbol;
+    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
+  }
+
+  if (defined $type_id) {
+    if (@id_stack) {
+      # The enclosing type refers to this type, record the dependency.
+      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
+    }
+    if (@symbol_stack) {
+      # The enclosing declaration (linked to a symbol) refers to this
+      # type, record the dependency.
+      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
+    }
+  }
+
+  for my $child ($node->nonBlankChildNodes()) {
+    process($child);
+  }
+
+  if (defined $symbol) {
+    pop @symbol_stack;
+  }
+  if (defined $id) {
+    pop @id_stack;
+  }
+}
+
+# Load the XML.
+my $dom = XML::LibXML->load_xml(IO => \*STDIN);
+# Build a graph.
+process($dom);
+die if @id_stack or @symbol_stack;
+
+# DFS visited state.
+my %seen_types;
+my %seen_symbols;
+sub dfs_type($);
+sub dfs_symbol($);
+
+sub dfs_type($) {
+  my ($type) = @_;
+  return if exists $seen_types{$type};
+  $seen_types{$type} = undef;
+
+  my $types = $type_type_edges{$type};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      for my $link (keys %{$types->{$type}}) {
+        if (exists $types->{$type}{$link}) {
+          dfs_type($type);
+        }
+      }
+    }
+  }
+  my $symbols = $type_symbol_edges{$type};
+  if (defined $symbols) {
+    for my $symbol (keys %$symbols) {
+      dfs_symbol($symbol);
+    }
+  }
+}
+
+sub dfs_symbol($) {
+  my ($symbol) = @_;
+  return if exists $seen_symbols{$symbol};
+  $seen_symbols{$symbol} = undef;
+
+  my $types = $symbol_type_edges{$symbol};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      dfs_type($type);
+    }
+  } else {
+    warn "no declaration found for exported symbol $symbol\n";
+  }
+}
+
+# Traverse the graph, using the exported symbols as the roots.
+for my $symbol (keys %exported_symbols) {
+  dfs_symbol($symbol);
+}
+
+# Useful counts.
+sub print_report() {
+  my $count_symbols = scalar keys %symbols;
+  my $count_exported_symbols = scalar keys %exported_symbols;
+  my $count_types = scalar keys %types;
+  my $count_reachable_types = scalar keys %seen_types;
+
+  print STDERR qq{symbols = $count_symbols
+exported symbols = $count_exported_symbols
+types = $count_types
+reachable types = $count_reachable_types
+};
+}
+
+my %drop_if_empty = map { $_ => undef } qw(
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is another DFS traversal of the XML.
+sub prune($);
+sub prune($) {
+  my ($node) = @_;
+
+  my $name = $node->getName();
+  my $id;
+  my $symbol;
+
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $id = $node->getAttribute('id');
+    $symbol = $node->getAttribute('mangled-name');
+    die if defined $id && defined $symbol;
+  }
+
+  # Return if we know that this is a type or declaration to keep or
+  # drop in its entirety.
+  if (defined $id) {
+    return !exists $seen_types{$id};
+  }
+  if ($name eq 'var-decl' || $name eq 'function-decl') {
+    return !defined $symbol || !exists $exported_symbols{$symbol};
+  }
+
+  # Otherwise, this is not a type, declaration or part thereof, so
+  # process child elements.
+  my $empty = 1;
+  for my $child ($node->nonBlankChildNodes()) {
+    if (prune($child)) {
+      $node->removeChild($child);
+    } else {
+      $empty = 0;
+    }
+  }
+
+  # Empty container elements can be dropped.
+  return $empty && exists $drop_if_empty{$name};
+}
+
+# Drop unreachable nodes and any container nodes that end up empty.
+prune($dom);
+my $out = $dom->toString();
+# Remove blank line debris.
+$out =~ s;^ *\n;;mg;
+# Remove the XML declaration as abidiff doesn't like it.
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $out;