[4/4] Add --merge-translation-units option.

Message ID 20200421122821.13769-5-mark@klomp.org
State Changes Requested
Headers
Series [1/4] Add named-types-ids to use name ids after the type name instead of numbers. |

Commit Message

Mark Wielaard April 21, 2020, 12:28 p.m. UTC
  From: Mark Wielaard <mark@klomp.org>

It is not always necessary to know in which translation unit a type,
variable or function was defined first. Provide an option to simply
merge all translation units for the same language (and address size).
This also reduces the size of the XML representation of the corpus by
~10% on a simple C only library.

Currently only implemented for the DWARF reader and abidw tool.

	* doc/manuals/abidw.rst: Document --merge-translation-units.
	* include/abg-dwarf-reader.h (set_merge_translation_units):
	New function.
	* src/abg-dwarf-reader.cc (read_context): Add
	merge_translation_units_ bool, merge_translation_units and
	merge_translation_units methods.
	(set_merge_translation_units): New function.
	(read_context::build_translation_unit_and_add_to_ir): Check
	merge_translation_units to see how to find an existing
	(compatible) translation_unit to use. Drop path and
	compilation_dir when merging translation_units.
	* src/abg-reader.cc (read_context): Add m_merge_translation_units
	bool, merge_translation_units and merge_translation_units methods.
	* tools/abidw.cc (options): Add merge_translation_units bool.
	(display_usage): Describe --merge-translation-units.
	(parse_command_line): Parse --merge-translation-units.
	(main): Call set_merge_translation_units.

Signed-off-by: Mark Wielaard <mark@klomp.org>
---
 doc/manuals/abidw.rst      |  7 ++++
 include/abg-dwarf-reader.h |  4 ++
 src/abg-dwarf-reader.cc    | 85 +++++++++++++++++++++++++++++++-------
 src/abg-reader.cc          | 20 ++++++++-
 tools/abidw.cc             |  6 +++
 5 files changed, 105 insertions(+), 17 deletions(-)
  

Comments

Matthias Männich May 7, 2020, 12:27 p.m. UTC | #1
Hi Mark!

On Tue, Apr 21, 2020 at 02:28:21PM +0200, Mark J. Wielaard wrote:
>From: Mark Wielaard <mark@klomp.org>
>
>It is not always necessary to know in which translation unit a type,
>variable or function was defined first. Provide an option to simply
>merge all translation units for the same language (and address size).
>This also reduces the size of the XML representation of the corpus by
>~10% on a simple C only library.
>
>Currently only implemented for the DWARF reader and abidw tool.
>
>	* doc/manuals/abidw.rst: Document --merge-translation-units.
>	* include/abg-dwarf-reader.h (set_merge_translation_units):
>	New function.
>	* src/abg-dwarf-reader.cc (read_context): Add
>	merge_translation_units_ bool, merge_translation_units and
>	merge_translation_units methods.
>	(set_merge_translation_units): New function.
>	(read_context::build_translation_unit_and_add_to_ir): Check
>	merge_translation_units to see how to find an existing
>	(compatible) translation_unit to use. Drop path and
>	compilation_dir when merging translation_units.
>	* src/abg-reader.cc (read_context): Add m_merge_translation_units
>	bool, merge_translation_units and merge_translation_units methods.
>	* tools/abidw.cc (options): Add merge_translation_units bool.
>	(display_usage): Describe --merge-translation-units.
>	(parse_command_line): Parse --merge-translation-units.
>	(main): Call set_merge_translation_units.
>

I tested that change with Linux Kernel binaries and I could not reach
the same gains (as in size reductions). 100K chopped off of 16M. It
simply does not have the same characteristics as elfutils(?) I suppose.

Nevertheless, the patch makes sense and is a good addition. Please note
that due to commit 246ca2004930 ("corpus/writer: sort emitted
translation units by path name"), it needs to be reworked as the empty
path name is now invalid.  In my tests I used an artificial filename
created out of language and address size (e.g. 'LANG_C89_64bit'). That
also simplifies the code much as you can do the lookup by name in any
case.

Cheers,
Matthias

>Signed-off-by: Mark Wielaard <mark@klomp.org>
>---
> doc/manuals/abidw.rst      |  7 ++++
> include/abg-dwarf-reader.h |  4 ++
> src/abg-dwarf-reader.cc    | 85 +++++++++++++++++++++++++++++++-------
> src/abg-reader.cc          | 20 ++++++++-
> tools/abidw.cc             |  6 +++
> 5 files changed, 105 insertions(+), 17 deletions(-)
>
>diff --git a/doc/manuals/abidw.rst b/doc/manuals/abidw.rst
>index 7ae44737..256c4ad6 100644
>--- a/doc/manuals/abidw.rst
>+++ b/doc/manuals/abidw.rst
>@@ -158,6 +158,13 @@ Options
>     representation build by Libabigail to represent the ABI and will
>     not end up in the abi XML file.
>
>+  * ``--merge-translation-units``
>+
>+    With this option translation units for the same language (and
>+    address size) will be merged together as if the functions,
>+    variables and types were all defined together.  Note that this
>+    also drops the compilation paths used.
>+
>   * ``--no-linux-kernel-mode``
>
>     Without this option, if abipkgiff detects that the binaries it is
>diff --git a/include/abg-dwarf-reader.h b/include/abg-dwarf-reader.h
>index d0329aed..430cc222 100644
>--- a/include/abg-dwarf-reader.h
>+++ b/include/abg-dwarf-reader.h
>@@ -192,6 +192,10 @@ void
> set_drop_undefined_syms(read_context& ctxt,
> 			bool f);
>
>+void
>+set_merge_translation_units(read_context& ctxt,
>+			    bool f);
>+
> void
> set_do_log(read_context& ctxt, bool f);
>
>diff --git a/src/abg-dwarf-reader.cc b/src/abg-dwarf-reader.cc
>index 1c0d6ea0..f7cbe6a5 100644
>--- a/src/abg-dwarf-reader.cc
>+++ b/src/abg-dwarf-reader.cc
>@@ -3260,6 +3260,7 @@ public:
>   corpus::exported_decls_builder* exported_decls_builder_;
>   options_type			options_;
>   bool				drop_undefined_syms_;
>+  bool				merge_translation_units_;
>   read_context();
>
> public:
>@@ -3429,6 +3430,7 @@ public:
>     options_.load_in_linux_kernel_mode = linux_kernel_mode;
>     options_.load_all_types = load_all_types;
>     drop_undefined_syms_ = false;
>+    merge_translation_units_ = false;
>     load_in_linux_kernel_mode(linux_kernel_mode);
>   }
>
>@@ -3537,6 +3539,22 @@ public:
>   drop_undefined_syms(bool f)
>   {drop_undefined_syms_ = f;}
>
>+  /// Setter for the flag that tells us if we are merging translation
>+  /// units.
>+  ///
>+  /// @param f the new value of the flag.
>+  void
>+  merge_translation_units(bool f)
>+  {merge_translation_units_ = f;}
>+
>+  /// Getter for the flag that tells us if we are merging translation
>+  /// units.
>+  ///
>+  /// @return true iff we are merging translation units.
>+  bool
>+  merge_translation_units() const
>+  {return merge_translation_units_;}
>+
>   /// Getter of the suppression specifications to be used during
>   /// ELF/DWARF parsing.
>   ///
>@@ -9487,6 +9505,17 @@ void
> set_drop_undefined_syms(read_context& ctxt, bool f)
> {ctxt.drop_undefined_syms(f);}
>
>+/// Setter of the "merge_translation_units" flag.
>+///
>+/// This flag tells if we should merge translation units.
>+///
>+/// @param ctxt the read context to consider for this flag.
>+///
>+/// @param f the value of the flag.
>+void
>+set_merge_translation_units(read_context& ctxt, bool f)
>+{ctxt.merge_translation_units(f);}
>+
> /// Setter of the "do_log" flag.
> ///
> /// This flag tells if we should emit verbose logs for various
>@@ -14267,28 +14296,52 @@ build_translation_unit_and_add_to_ir(read_context&	ctxt,
>   string path = die_string_attribute(die, DW_AT_name);
>   string compilation_dir = die_string_attribute(die, DW_AT_comp_dir);
>
>-  // See if the same translation unit exits already in the current
>-  // corpus.  Sometimes, the same translation unit can be present
>-  // several times in the same debug info.  The content of the
>-  // different instances of the translation unit are different.  So to
>-  // represent that, we are going to re-use the same translation
>-  // unit.  That is, it's going to be the union of all the translation
>-  // units of the same path.
>-  {
>-    string abs_path = compilation_dir + "/" + path;
>-    result = ctxt.current_corpus()->find_translation_unit(abs_path);
>-  }
>+  uint64_t lang = 0;
>+  die_unsigned_constant_attribute(die, DW_AT_language, lang);
>+  translation_unit::language language = dwarf_language_to_tu_language(lang);
>+
>+  corpus_sptr corp = ctxt.current_corpus();
>+
>+  if (ctxt.merge_translation_units())
>+    {
>+      // See if there is already a translation for the address_size
>+      // and language. If so, just reuse that one.
>+      for (translation_units::const_iterator tu =
>+	     corp->get_translation_units().begin();
>+	   tu != corp->get_translation_units().end();
>+	   ++tu)
>+	{
>+	  if ((*tu)->get_address_size() == address_size
>+	      && (*tu)->get_language() == language)
>+	    {
>+	      result = (*tu);
>+	      break;
>+	    }
>+	}
>+    }
>+  else
>+    {
>+      // See if the same translation unit exits already in the current
>+      // corpus.  Sometimes, the same translation unit can be present
>+      // several times in the same debug info.  The content of the
>+      // different instances of the translation unit are different.  So to
>+      // represent that, we are going to re-use the same translation
>+      // unit.  That is, it's going to be the union of all the translation
>+      // units of the same path.
>+      string abs_path = compilation_dir + "/" + path;
>+      result = corp->find_translation_unit(abs_path);
>+    }
>
>   if (!result)
>     {
>       result.reset(new translation_unit(ctxt.env(),
>-					path,
>+					(ctxt.merge_translation_units()
>+					 ? "" : path),
> 					address_size));
>-      result->set_compilation_dir_path(compilation_dir);
>+      if (!ctxt.merge_translation_units())
>+	result->set_compilation_dir_path(compilation_dir);
>       ctxt.current_corpus()->add(result);
>-      uint64_t l = 0;
>-      die_unsigned_constant_attribute(die, DW_AT_language, l);
>-      result->set_language(dwarf_language_to_tu_language(l));
>+      result->set_language(language);
>     }
>
>   ctxt.cur_transl_unit(result);
>diff --git a/src/abg-reader.cc b/src/abg-reader.cc
>index dcaa27e1..7f9374c5 100644
>--- a/src/abg-reader.cc
>+++ b/src/abg-reader.cc
>@@ -132,6 +132,7 @@ private:
>   suppr::suppressions_type				m_supprs;
>   bool							m_tracking_non_reachable_types;
>   bool							m_drop_undefined_syms;
>+  bool							m_merge_translation_units;
>
>   read_context();
>
>@@ -143,7 +144,8 @@ public:
>       m_corp_node(),
>       m_exported_decls_builder(),
>       m_tracking_non_reachable_types(),
>-      m_drop_undefined_syms()
>+      m_drop_undefined_syms(),
>+      m_merge_translation_units()
>   {}
>
>   /// Getter for the flag that tells us if we are tracking types that
>@@ -181,6 +183,22 @@ public:
>   drop_undefined_syms(bool f)
>   {m_drop_undefined_syms = f;}
>
>+  /// Getter for the flag that tells us if we are merging translation
>+  /// units.
>+  ///
>+  /// @return true iff we are merging translation units.
>+  bool
>+  merge_translation_units() const
>+  {return m_merge_translation_units;}
>+
>+  /// Setter for the flag that tells us if we are merging translation
>+  /// units.
>+  ///
>+  /// @param f the new value of the flag.
>+  void
>+  merge_translation_units(bool f)
>+  {m_merge_translation_units = f;}
>+
>   /// Getter of the path to the ABI file.
>   ///
>   /// @return the path to the native xml abi file.
>diff --git a/tools/abidw.cc b/tools/abidw.cc
>index 3f4b3f42..87248a2c 100644
>--- a/tools/abidw.cc
>+++ b/tools/abidw.cc
>@@ -113,6 +113,7 @@ struct options
>   bool			do_log;
>   bool			drop_private_types;
>   bool			drop_undefined_syms;
>+  bool			merge_translation_units;
>   bool			named_type_ids;
>
>   options()
>@@ -136,6 +137,7 @@ struct options
>       do_log(),
>       drop_private_types(false),
>       drop_undefined_syms(false),
>+      merge_translation_units(false),
>       named_type_ids(false)
>   {}
>
>@@ -170,6 +172,7 @@ display_usage(const string& prog_name, ostream& out)
>     << "  --short-locs  only print filenames rather than paths\n"
>     << "  --drop-private-types  drop private types from representation\n"
>     << "  --drop-undefined-syms  drop undefined symbols from representation\n"
>+    << "  --merge-translation-units  merge translation units for same language\n"
>     << "  --named_type-ids  use id attributes based on type names in XML file\n"
>     << "  --no-comp-dir-path  do not show compilation path information\n"
>     << "  --no-elf-needed  do not show the DT_NEEDED information\n"
>@@ -317,6 +320,8 @@ parse_command_line(int argc, char* argv[], options& opts)
> 	opts.drop_private_types = true;
>       else if (!strcmp(argv[i], "--drop-undefined-syms"))
> 	opts.drop_undefined_syms = true;
>+      else if (!strcmp(argv[i], "--merge-translation-units"))
>+	opts.merge_translation_units = true;
>       else if (!strcmp(argv[i], "--named-type-ids"))
> 	opts.named_type_ids = true;
>       else if (!strcmp(argv[i], "--no-linux-kernel-mode"))
>@@ -806,6 +811,7 @@ main(int argc, char* argv[])
> 						opts.linux_kernel_mode);
>       read_context& ctxt = *c;
>       set_drop_undefined_syms(ctxt, opts.drop_undefined_syms);
>+      set_merge_translation_units(ctxt, opts.merge_translation_units);
>       set_show_stats(ctxt, opts.show_stats);
>       set_suppressions(ctxt, opts);
>       abigail::dwarf_reader::set_do_log(ctxt, opts.do_log);
>-- 
>2.18.2
>
  

Patch

diff --git a/doc/manuals/abidw.rst b/doc/manuals/abidw.rst
index 7ae44737..256c4ad6 100644
--- a/doc/manuals/abidw.rst
+++ b/doc/manuals/abidw.rst
@@ -158,6 +158,13 @@  Options
     representation build by Libabigail to represent the ABI and will
     not end up in the abi XML file.
 
+  * ``--merge-translation-units``
+
+    With this option translation units for the same language (and
+    address size) will be merged together as if the functions,
+    variables and types were all defined together.  Note that this
+    also drops the compilation paths used.
+
   * ``--no-linux-kernel-mode``
 
     Without this option, if abipkgiff detects that the binaries it is
diff --git a/include/abg-dwarf-reader.h b/include/abg-dwarf-reader.h
index d0329aed..430cc222 100644
--- a/include/abg-dwarf-reader.h
+++ b/include/abg-dwarf-reader.h
@@ -192,6 +192,10 @@  void
 set_drop_undefined_syms(read_context& ctxt,
 			bool f);
 
+void
+set_merge_translation_units(read_context& ctxt,
+			    bool f);
+
 void
 set_do_log(read_context& ctxt, bool f);
 
diff --git a/src/abg-dwarf-reader.cc b/src/abg-dwarf-reader.cc
index 1c0d6ea0..f7cbe6a5 100644
--- a/src/abg-dwarf-reader.cc
+++ b/src/abg-dwarf-reader.cc
@@ -3260,6 +3260,7 @@  public:
   corpus::exported_decls_builder* exported_decls_builder_;
   options_type			options_;
   bool				drop_undefined_syms_;
+  bool				merge_translation_units_;
   read_context();
 
 public:
@@ -3429,6 +3430,7 @@  public:
     options_.load_in_linux_kernel_mode = linux_kernel_mode;
     options_.load_all_types = load_all_types;
     drop_undefined_syms_ = false;
+    merge_translation_units_ = false;
     load_in_linux_kernel_mode(linux_kernel_mode);
   }
 
@@ -3537,6 +3539,22 @@  public:
   drop_undefined_syms(bool f)
   {drop_undefined_syms_ = f;}
 
+  /// Setter for the flag that tells us if we are merging translation
+  /// units.
+  ///
+  /// @param f the new value of the flag.
+  void
+  merge_translation_units(bool f)
+  {merge_translation_units_ = f;}
+
+  /// Getter for the flag that tells us if we are merging translation
+  /// units.
+  ///
+  /// @return true iff we are merging translation units.
+  bool
+  merge_translation_units() const
+  {return merge_translation_units_;}
+
   /// Getter of the suppression specifications to be used during
   /// ELF/DWARF parsing.
   ///
@@ -9487,6 +9505,17 @@  void
 set_drop_undefined_syms(read_context& ctxt, bool f)
 {ctxt.drop_undefined_syms(f);}
 
+/// Setter of the "merge_translation_units" flag.
+///
+/// This flag tells if we should merge translation units.
+///
+/// @param ctxt the read context to consider for this flag.
+///
+/// @param f the value of the flag.
+void
+set_merge_translation_units(read_context& ctxt, bool f)
+{ctxt.merge_translation_units(f);}
+
 /// Setter of the "do_log" flag.
 ///
 /// This flag tells if we should emit verbose logs for various
@@ -14267,28 +14296,52 @@  build_translation_unit_and_add_to_ir(read_context&	ctxt,
   string path = die_string_attribute(die, DW_AT_name);
   string compilation_dir = die_string_attribute(die, DW_AT_comp_dir);
 
-  // See if the same translation unit exits already in the current
-  // corpus.  Sometimes, the same translation unit can be present
-  // several times in the same debug info.  The content of the
-  // different instances of the translation unit are different.  So to
-  // represent that, we are going to re-use the same translation
-  // unit.  That is, it's going to be the union of all the translation
-  // units of the same path.
-  {
-    string abs_path = compilation_dir + "/" + path;
-    result = ctxt.current_corpus()->find_translation_unit(abs_path);
-  }
+  uint64_t lang = 0;
+  die_unsigned_constant_attribute(die, DW_AT_language, lang);
+  translation_unit::language language = dwarf_language_to_tu_language(lang);
+
+  corpus_sptr corp = ctxt.current_corpus();
+
+  if (ctxt.merge_translation_units())
+    {
+      // See if there is already a translation for the address_size
+      // and language. If so, just reuse that one.
+      for (translation_units::const_iterator tu =
+	     corp->get_translation_units().begin();
+	   tu != corp->get_translation_units().end();
+	   ++tu)
+	{
+	  if ((*tu)->get_address_size() == address_size
+	      && (*tu)->get_language() == language)
+	    {
+	      result = (*tu);
+	      break;
+	    }
+	}
+    }
+  else
+    {
+      // See if the same translation unit exits already in the current
+      // corpus.  Sometimes, the same translation unit can be present
+      // several times in the same debug info.  The content of the
+      // different instances of the translation unit are different.  So to
+      // represent that, we are going to re-use the same translation
+      // unit.  That is, it's going to be the union of all the translation
+      // units of the same path.
+      string abs_path = compilation_dir + "/" + path;
+      result = corp->find_translation_unit(abs_path);
+    }
 
   if (!result)
     {
       result.reset(new translation_unit(ctxt.env(),
-					path,
+					(ctxt.merge_translation_units()
+					 ? "" : path),
 					address_size));
-      result->set_compilation_dir_path(compilation_dir);
+      if (!ctxt.merge_translation_units())
+	result->set_compilation_dir_path(compilation_dir);
       ctxt.current_corpus()->add(result);
-      uint64_t l = 0;
-      die_unsigned_constant_attribute(die, DW_AT_language, l);
-      result->set_language(dwarf_language_to_tu_language(l));
+      result->set_language(language);
     }
 
   ctxt.cur_transl_unit(result);
diff --git a/src/abg-reader.cc b/src/abg-reader.cc
index dcaa27e1..7f9374c5 100644
--- a/src/abg-reader.cc
+++ b/src/abg-reader.cc
@@ -132,6 +132,7 @@  private:
   suppr::suppressions_type				m_supprs;
   bool							m_tracking_non_reachable_types;
   bool							m_drop_undefined_syms;
+  bool							m_merge_translation_units;
 
   read_context();
 
@@ -143,7 +144,8 @@  public:
       m_corp_node(),
       m_exported_decls_builder(),
       m_tracking_non_reachable_types(),
-      m_drop_undefined_syms()
+      m_drop_undefined_syms(),
+      m_merge_translation_units()
   {}
 
   /// Getter for the flag that tells us if we are tracking types that
@@ -181,6 +183,22 @@  public:
   drop_undefined_syms(bool f)
   {m_drop_undefined_syms = f;}
 
+  /// Getter for the flag that tells us if we are merging translation
+  /// units.
+  ///
+  /// @return true iff we are merging translation units.
+  bool
+  merge_translation_units() const
+  {return m_merge_translation_units;}
+
+  /// Setter for the flag that tells us if we are merging translation
+  /// units.
+  ///
+  /// @param f the new value of the flag.
+  void
+  merge_translation_units(bool f)
+  {m_merge_translation_units = f;}
+
   /// Getter of the path to the ABI file.
   ///
   /// @return the path to the native xml abi file.
diff --git a/tools/abidw.cc b/tools/abidw.cc
index 3f4b3f42..87248a2c 100644
--- a/tools/abidw.cc
+++ b/tools/abidw.cc
@@ -113,6 +113,7 @@  struct options
   bool			do_log;
   bool			drop_private_types;
   bool			drop_undefined_syms;
+  bool			merge_translation_units;
   bool			named_type_ids;
 
   options()
@@ -136,6 +137,7 @@  struct options
       do_log(),
       drop_private_types(false),
       drop_undefined_syms(false),
+      merge_translation_units(false),
       named_type_ids(false)
   {}
 
@@ -170,6 +172,7 @@  display_usage(const string& prog_name, ostream& out)
     << "  --short-locs  only print filenames rather than paths\n"
     << "  --drop-private-types  drop private types from representation\n"
     << "  --drop-undefined-syms  drop undefined symbols from representation\n"
+    << "  --merge-translation-units  merge translation units for same language\n"
     << "  --named_type-ids  use id attributes based on type names in XML file\n"
     << "  --no-comp-dir-path  do not show compilation path information\n"
     << "  --no-elf-needed  do not show the DT_NEEDED information\n"
@@ -317,6 +320,8 @@  parse_command_line(int argc, char* argv[], options& opts)
 	opts.drop_private_types = true;
       else if (!strcmp(argv[i], "--drop-undefined-syms"))
 	opts.drop_undefined_syms = true;
+      else if (!strcmp(argv[i], "--merge-translation-units"))
+	opts.merge_translation_units = true;
       else if (!strcmp(argv[i], "--named-type-ids"))
 	opts.named_type_ids = true;
       else if (!strcmp(argv[i], "--no-linux-kernel-mode"))
@@ -806,6 +811,7 @@  main(int argc, char* argv[])
 						opts.linux_kernel_mode);
       read_context& ctxt = *c;
       set_drop_undefined_syms(ctxt, opts.drop_undefined_syms);
+      set_merge_translation_units(ctxt, opts.merge_translation_units);
       set_show_stats(ctxt, opts.show_stats);
       set_suppressions(ctxt, opts);
       abigail::dwarf_reader::set_do_log(ctxt, opts.do_log);