[RFC,v2] Add an experimental ABI pruning utility.

Message ID 20210312165958.3575755-1-gprocida@google.com
State Rejected, archived
Headers
Series [RFC,v2] Add an experimental ABI pruning utility. |

Commit Message

Giuliano Procida March 12, 2021, 4:59 p.m. UTC
  In response to internal reports about very large ABIs for libraires
implemented in C++ but exposing only a C API, I wrote a script as a
means of investigating the issue.

The script aims to remove parts of the XML that are irrelevant to
abidiff. It has been run on current libabigail test data ABI files as
well as some locally-generated ones.

The script by default considers all exported ELF symbols as the roots
of interest (but this list can be overridden).

On most XML files, this results in a modest saving (5-10%). However,
for the library that sparked the invesigaton, the resulting XML was
more than 2300 times smaller.

The script may have some merit on its own but perhaps should be
implemented differently (say in C++).

Question: How hard would it be to implement this functionality (under
flag control) within libabigail itself?

A couple of tangential things worth mentioning (see the comments at
the very beginning and very end of the script):

- abidiff rejects XML files with an XML declaration at the top
- abidiff rejects completely empty files

Resolving the first would make the second moot.

	* scripts/abiprune.pl: Add script to prune ABI XML or restrict
	it to a subset of symbols.

Signed-off-by: Giuliano Procida <gprocida@google.com>
---
 scripts/abiprune.pl | 356 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100755 scripts/abiprune.pl
  

Patch

diff --git a/scripts/abiprune.pl b/scripts/abiprune.pl
new file mode 100755
index 00000000..45592d29
--- /dev/null
+++ b/scripts/abiprune.pl
@@ -0,0 +1,356 @@ 
+#!/usr/bin/perl
+
+# This script is intended to consume libabigail ABI XML as generated
+# by abidw and produce a possibly smaller representation that captures
+# the same ABI. In particular, the output should be such that abidiff
+# --harmless reports no differences (or is empty).
+#
+# It does this by modelling symbols and types dependencies as a
+# directed graph, marking nodes reachable from the ELF symbols and
+# dropping unreachable nodes.
+
+use strict;
+use warnings;
+no warnings 'recursion';
+
+use autodie;
+
+use Getopt::Long;
+use IO::File;
+use XML::LibXML;
+
+# Making a graph from ABI XML.
+#
+# The following are the types of "node" we care about. The "edges" are
+# obtained from a few XML attributes as well as via XML element
+# containment.
+#
+#  ELF (exported) symbols
+#
+#   elf-symbol (has a name; the code here currently knows nothing
+#     about aliases)
+#
+#  Declarations (that mention a symbol)
+#
+#   These live in a scope. In C++ scopes can be nested and include
+#   namespaces and class types.
+#
+#  var-decl (also used for member variables)
+#    elf-symbol linked to symbol via mangled-name
+#    type-id links to a type
+#  function-decl (also used for member functions)
+#    elf-symbol linked to symbol via mangled-name
+#    parameter and return type-ids link to types
+#      (alas, not just a link to a function type)
+#
+#  Types
+#
+#   These occupy pretty much all the other elements, besides those
+#   that act as simple containers.
+#
+# XML elements and their roles in more detail
+#
+# ELF
+#
+# elf-needed - container
+#  dependency - names a library
+# elf-variable-symbols - contains a list of symbols
+# elf-function-symbols - contains a list of symbols
+#  elf-symbol - describes an ELF variable or function
+#
+# Grouping and scoping
+#
+# abi-corpus-group
+#  abi-corpus
+#   abi-instr - compilation unit containers
+#    namespace-decl - pure container, possibly named
+#
+# Types (some introduce scopes, only in C++)
+#
+# type-decl - defines a primitive type
+# typedef-decl - defines a type, links to a type
+# qualified-type-def - defines a type, links to a type
+# pointer-type-def - defines a type, links to a type
+# reference-type-def - defines a type, links to a type
+# array-type-def - defines a (multidimensional array) type, refers to element type, contains subranges
+#  subrange - contains array length, refers to element type; defines types (never referred to; duplicated)
+# function-type - defines a type
+#  parameter - belongs to function-type and -decl, links to a type
+#  return - belongs to function-type and -decl, links to a type
+# enum-decl - defines a type, names it, contains a list of enumerators and an underlying-type
+#  underlying-type - belongs to enum-decl
+#  enumerator - belongs to enum-decl
+# union-decl - defines and names a type, contains member elements linked to other things
+# class-decl - defines and names a type, contains base type, member elements linking to other things
+#  base-class - belongs to class-decl
+#  data-member - container for a member; holds access level
+#  member-function - container for a member; holds access level
+#  member-type - container for a type declaration; holds access level
+#  member-template - container for a (function?) template declaration; holds access level
+#
+# Higher order Things
+#
+# class-template-decl - defines a type (function), but without instantiation this isn't usable
+# function-template-decl - defines a type (function), but without instantiation this isn't usable
+#  template-type-parameter - defines a type (variable), perhaps one which should be excluded from the real type graph
+#  template-non-type-parameter - names a template parameter, links to a type
+#  template-parameter-type-composition - container?
+#
+# Values
+#
+# function-decl - names a function, can link to a symbol
+#     has same children as function-type, rather than linking to a type
+# var-decl - names a variable, can link to a symbol, links to a type
+
+my $symbols_file;
+GetOptions ("symbols=s" => \$symbols_file)
+  && scalar @ARGV == 0
+  or die "usage: $0 [-s symbols_file]\n";
+
+my %wanted_symbols;
+if ($symbols_file) {
+  my $fh = new IO::File $symbols_file, '<';
+  while (<$fh>) {
+    chomp;
+    $wanted_symbols{$_} = undef;
+  }
+  close $fh;
+}
+
+# Graph nodes.
+my %symbols;
+my %exported_symbols;
+my %types;
+# Graph edges.
+my %type_type_edges;
+my %symbol_type_edges;
+my %type_symbol_edges;
+
+# Keep track of the types being defined.
+my @id_stack;
+# Keep track of the current (value) declaration.
+my @symbol_stack;
+# Traverse the whole XML DOM.
+sub process($);
+sub process($) {
+  my ($node) = @_;
+
+  # The XML attributes we care about.
+  my $name;
+  my $id;
+  my $type_id;
+  my $symbol;
+  my $naming_typedef_id;
+
+  # Not every node we encounter is an XML element.
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $type_id = $node->getAttribute('type-id');
+    $symbol = $node->getAttribute('mangled-name');
+    $naming_typedef_id = $node->getAttribute('naming-typedef-id');
+    die if defined $id && defined $symbol;
+  }
+
+  if (defined $name && $node->getName() eq 'elf-symbol') {
+    $exported_symbols{$name} = undef;
+    # Early return is safe, but not necessary.
+    return;
+  }
+
+  if (defined $id) {
+    # This element defines a type (but there may be more than one
+    # defining the same type - we cannot rely on uniqueness).
+    $types{$id} = undef;
+    if (defined $naming_typedef_id) {
+      # This is an odd one, there can be a backwards link from an
+      # anonymous type to the typedef that refers to it, so we need to
+      # pull in the typedef, even if nothing else refers to it.
+      $type_type_edges{$id}{$naming_typedef_id}{naming} = undef;
+    }
+    if (@id_stack) {
+      # Type parent<->child dependencies; record dependencies both
+      # ways to avoid making holes in the XML types.
+      $type_type_edges{$id_stack[-1]}{$id}{contains} = undef;
+      $type_type_edges{$id}{$id_stack[-1]}{in} = undef;
+    }
+    push @id_stack, $id;
+  }
+
+  if (defined $symbol) {
+    # This element is a declaration linked to a symbol (whether or not
+    # exported).
+    $symbols{$symbol} = undef;
+    if (@id_stack) {
+      # The enclosing scope may include a C++ class, in which case we
+      # need to record the dependency.
+      $type_symbol_edges{$id_stack[-1]}{$symbol} = undef;
+    }
+    # The symbol depends on the types mentioned in this element, so
+    # record it.
+    push @symbol_stack, $symbol;
+    die "nested symbols: @symbol_stack\n" if scalar @symbol_stack > 1;
+  }
+
+  if (defined $type_id) {
+    if (@id_stack) {
+      # The enclosing type refers to this type, record the dependency.
+      $type_type_edges{$id_stack[-1]}{$type_id}{depends} = undef;
+    }
+    if (@symbol_stack) {
+      # The enclosing declaration (linked to a symbol) refers to this
+      # type, record the dependency.
+      $symbol_type_edges{$symbol_stack[-1]}{$type_id} = undef;
+    }
+  }
+
+  for my $child ($node->nonBlankChildNodes()) {
+    process($child);
+  }
+
+  if (defined $symbol) {
+    pop @symbol_stack;
+  }
+  if (defined $id) {
+    pop @id_stack;
+  }
+}
+
+# Load the XML.
+my $dom = XML::LibXML->load_xml(IO => \*STDIN);
+# Build a graph.
+process($dom);
+die if @id_stack or @symbol_stack;
+
+%wanted_symbols = %exported_symbols unless defined $symbols_file;
+
+# DFS visited state.
+my %seen_types;
+my %seen_symbols;
+sub dfs_type($);
+sub dfs_symbol($);
+
+sub dfs_type($) {
+  my ($type) = @_;
+  if (not exists $types{$type}) {
+    warn "edge to unknown type $type\n";
+    return;
+  }
+  return if exists $seen_types{$type};
+  $seen_types{$type} = undef;
+
+  my $types = $type_type_edges{$type};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      for my $link (keys %{$types->{$type}}) {
+        if (exists $types->{$type}{$link}) {
+          dfs_type($type);
+        }
+      }
+    }
+  }
+  my $symbols = $type_symbol_edges{$type};
+  if (defined $symbols) {
+    for my $symbol (keys %$symbols) {
+      dfs_symbol($symbol);
+    }
+  }
+}
+
+sub dfs_symbol($) {
+  my ($symbol) = @_;
+  return if exists $seen_symbols{$symbol};
+  $seen_symbols{$symbol} = undef;
+
+  my $types = $symbol_type_edges{$symbol};
+  if (defined $types) {
+    for my $type (keys %$types) {
+      dfs_type($type);
+    }
+  } else {
+    warn "no declaration found for exported symbol $symbol\n";
+  }
+}
+
+# Traverse the graph, starting from the wanted symbols.
+for my $symbol (keys %wanted_symbols) {
+  dfs_symbol($symbol);
+}
+
+# Useful counts.
+sub print_report() {
+  my $count_symbols = scalar keys %symbols;
+  my $count_exported_symbols = scalar keys %exported_symbols;
+  my $count_types = scalar keys %types;
+  my $count_reachable_types = scalar keys %seen_types;
+
+  print STDERR qq{symbols = $count_symbols
+exported symbols = $count_exported_symbols
+types = $count_types
+reachable types = $count_reachable_types
+};
+}
+
+my %drop_if_empty = map { $_ => undef } qw(
+  elf-variable-symbols
+  elf-function-symbols
+  namespace-decl
+  abi-instr
+  abi-corpus
+  abi-corpus-group
+);
+
+# This is another XML DOM traversal.
+sub prune($);
+sub prune($) {
+  my ($node) = @_;
+
+  my $node_name = $node->getName();
+  my $name;
+  my $id;
+  my $symbol;
+
+  if ($node->nodeType == XML_ELEMENT_NODE) {
+    $name = $node->getAttribute('name');
+    $id = $node->getAttribute('id');
+    $symbol = $node->getAttribute('mangled-name');
+    die if defined $id && defined $symbol;
+  }
+
+  # Keep / prune wanted / unwanted symbols
+  if (defined $name && $node_name eq 'elf-symbol') {
+    return !exists $exported_symbols{$name};
+  }
+
+  # Return if we know that this is a type or declaration to keep or
+  # drop in its entirety.
+  if (defined $id) {
+    return !exists $seen_types{$id};
+  }
+  if ($node_name eq 'var-decl' || $node_name eq 'function-decl') {
+    return !defined $symbol || !exists $seen_symbols{$symbol};
+  }
+
+  # Otherwise, this is not a type, declaration or part thereof, so
+  # process child elements.
+  my $empty = 1;
+  for my $child ($node->nonBlankChildNodes()) {
+    if (prune($child)) {
+      $node->removeChild($child);
+    } else {
+      $empty = 0;
+    }
+  }
+
+  # Empty container elements can be dropped.
+  return $empty && exists $drop_if_empty{$node_name};
+}
+
+# Drop unreachable nodes and any container nodes that end up empty.
+prune($dom);
+my $out = $dom->toString();
+# Remove blank line debris.
+$out =~ s;^ *\n;;mg;
+# Remove the XML declaration as abidiff doesn't like it.
+$out =~ s;^<\?xml .*\?>\n;;m;
+print $out;