ctf-reader: Lookup debug info for symbols in a non default archive member

Message ID 20220831151603.915945-1-guillermo.e.martinez@oracle.com
State New
Headers
Series ctf-reader: Lookup debug info for symbols in a non default archive member |

Commit Message

Guillermo E. Martinez Aug. 31, 2022, 3:16 p.m. UTC
  Hello,

This patch improves the ABI XML file generated by ctf reader, there
are Linux symbols (EXPORT_SYMBOL*) that were missing.

Comments will be grateful and appreciated!.

Thanks in advanced,
guillermo
--

The current mechanism used by the ctf reader to looking for debug
information given a specific Linux symbol, is open the dictionary
(default) which the name match with the binary name being processing
in the current corpus, e.g. `vmlinux' or `module-name.ko'. However
there are symbol information is not located in a default dictionary,
this is evident comparing the symbols in `Module.symvers' file with
ABI XML file, so for example, the ctf reader is expecting to find the
information for `LZ4_decompress_fast' symbol in the CTF `vmlinux'
archive member, because this symbols is defined in `vmlinux' binary:

   0x4c416eb9	LZ4_decompress_fast	vmlinux	EXPORT_SYMBOL

But it figures out that it is missing. The correct location is
`vmlinux#0' dictionary.

  CTF archive member: vmlinux:
    ...
    Function objects:
    ...

  CTF archive member: vmlinux#0:
    Function objects:
    ...
    LZ4_decompress_fast -> 0x80037400: (kind 5) int (*) (const char *, char *, int) (aligned at 0x8)
    ...

Therefore, ctf reader is looking for debug information in the whole
archive, fortunately `libctf' provides a fast lookup mechanism using
cache, dictionary references, etc., so the penalty performance is ~10%.

	* src/abg-ctf-reader.cc (lookup_symbol_in_ctf_archive): New function.
	(process_ctf_archive): Use `lookup_symbol_in_ctf_archive'.

Signed-off-by: Guillermo E. Martinez <guillermo.e.martinez@oracle.com>
---
 src/abg-ctf-reader.cc | 72 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 64 insertions(+), 8 deletions(-)
  

Comments

Dodji Seketeli Sept. 6, 2022, 12:49 p.m. UTC | #1
Hello Guillermo,

Thanks for the patch.  I have tested and it seems to pass regression
testing on my system.  However, there are some things that I don't
understand so I have some questions below.  The questions are just for
my own understanding.  I don't have anything major against the patch,
obviously.

[...]

"Guillermo E. Martinez via Libabigail" <libabigail@sourceware.org> a
écrit:


[...]

> +/// Given a symbol name, lookup the corresponding CTF information in
> +/// the default dictionary (CTF archive member provided by the caller)
> +/// If the search is not success, the  looks for the symbol name
> +/// in _all_ archive members.
> +///
> +/// @param ctfa the CTF archive.
> +/// @param dict the default dictionary to looks for.
> +/// @param sym_name the symbol name.
> +/// @param corp the IR corpus.
> +///
> +/// Note that if @ref sym_name is found in other than default dictionary
> +/// @ref ctf_dict will be updated and it must be explicate closed by its
> +/// caller.
> +///
> +/// @return a valid CTF type id, if @ref sym_name was found, -1 otherwise.
> +
> +static ctf_id_t
> +lookup_symbol_in_ctf_archive(ctf_archive_t *ctfa, ctf_dict_t **ctf_dict,
> +                             const char *sym_name, corpus_sptr corp)
> +{
> +  int ctf_err;
> +  ctf_dict_t *dict = *ctf_dict;
> +  ctf_id_t ctf_type = ctf_lookup_variable(dict, sym_name);

So, here, we begin by looking for a variable (using ctf_lookup_variable)
which ELF symbol is sym_name, is that correct?

> +
> +  /* lookup CTF type for a given symbol in its default
> +     dictionary */
> +  if (ctf_type == (ctf_id_t) -1

So, I guess the variable lookup failed, right?

> +      && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))

Why this condition?  Why only considering cases where we are not looking
at a Linux Kernel binary?  I would think that we would want to consider
the case where the variable lookup failed, even in the case of a Linux
Kernel binary, wouldn't we? If not why?  Maybe we should add a comment
to explain this.

> +    ctf_type = ctf_lookup_by_symbol_name(dict, sym_name);

So I am guessing that ctf_lookup_by_symbol_name looks up both variable
and function symbols from the same dictionary, is that correct?
Also, I don't understand why we don't just use ctf_lookup_by_symbol_name
rather than starting with ctf_lookup_variable first.  Is it a
performance things?


Incidentally, I haven't found documentation for the lookup functions
other than by looking at the code, in say:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob_plain;f=libctf/ctf-lookup.c;hb=refs/heads/master.

If there is documentation for it somewhere else, maybe we can link that
place in the code here in a comment somewhere, or we can just point to
that link above.  Both would be fine by me.

> +
> +  /* Not lucky, then, search in whole archive */
> +  if (ctf_type == (ctf_id_t) -1)
> +    {
> +      ctf_dict_t *fp;
> +      ctf_next_t *i = NULL;
> +      const char *arcname;
> +
> +      while ((fp = ctf_archive_next(ctfa, &i, &arcname, 1, &ctf_err)) != NULL)
> +        {
> +          ctf_type = ctf_lookup_variable (fp, sym_name);
> +          if (ctf_type == (ctf_id_t) -1
> +              && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))

The same questions as above.

> +            ctf_type = ctf_lookup_by_symbol_name(fp, sym_name);
> +
> +          if (ctf_type != (ctf_id_t) -1)
> +            {
> +              *ctf_dict = fp;
> +              break;
> +            }
> +          ctf_dict_close(fp);
> +        }
> +    }
> +
> +  return ctf_type;
> +}
> +

Cheers,

[...]
  
Guillermo E. Martinez Sept. 7, 2022, 6:40 p.m. UTC | #2
On 9/6/22 07:49, Dodji Seketeli wrote:
> Hello Guillermo,

Hello Dodji,

> Thanks for the patch.  I have tested and it seems to pass regression
> testing on my system.  However, there are some things that I don't
> understand so I have some questions below.  The questions are just for
> my own understanding.  I don't have anything major against the patch,
> obviously.
> 
> [...]
> 
> "Guillermo E. Martinez via Libabigail" <libabigail@sourceware.org> a
> écrit:
> 
> 
> [...]
> 
>> +/// Given a symbol name, lookup the corresponding CTF information in
>> +/// the default dictionary (CTF archive member provided by the caller)
>> +/// If the search is not success, the  looks for the symbol name
>> +/// in _all_ archive members.
>> +///
>> +/// @param ctfa the CTF archive.
>> +/// @param dict the default dictionary to looks for.
>> +/// @param sym_name the symbol name.
>> +/// @param corp the IR corpus.
>> +///
>> +/// Note that if @ref sym_name is found in other than default dictionary
>> +/// @ref ctf_dict will be updated and it must be explicate closed by its
>> +/// caller.
>> +///
>> +/// @return a valid CTF type id, if @ref sym_name was found, -1 otherwise.
>> +
>> +static ctf_id_t
>> +lookup_symbol_in_ctf_archive(ctf_archive_t *ctfa, ctf_dict_t **ctf_dict,
>> +                             const char *sym_name, corpus_sptr corp)
>> +{
>> +  int ctf_err;
>> +  ctf_dict_t *dict = *ctf_dict;
>> +  ctf_id_t ctf_type = ctf_lookup_variable(dict, sym_name);
> 
> So, here, we begin by looking for a variable (using ctf_lookup_variable)
> which ELF symbol is sym_name, is that correct?

That's correct, `sym_name' is the symbol name.

>> +
>> +  /* lookup CTF type for a given symbol in its default
>> +     dictionary */
>> +  if (ctf_type == (ctf_id_t) -1
> 
> So, I guess the variable lookup failed, right?

Correct, libctf `ctf_lookup_*' functions return CTF_ERR when fails,
so I'm goinf to changed it for clarity.

>> +      && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))
> 
> Why this condition?  Why only considering cases where we are not looking
> at a Linux Kernel binary?  I would think that we would want to consider
> the case where the variable lookup failed, even in the case of a Linux
> Kernel binary, wouldn't we? If not why?  Maybe we should add a comment
> to explain this.

OK. The linker (ld) in the Kenel build mechanism uses: `--ctf-variables',
then it emits the symbols type definitions using just the CTF Variable
ection:

$ objdump --ctf foo

   ...

   Labels:

   Data objects:

   Function objects:

   Variables:
     main -> 0x2: (kind 5) int (*) () (aligned at 0x8)
     main_func -> 0x4: (kind 5) void (*) () (aligned at 0x8)
     okkk -> 0x1: (kind 1) int (format 0x1) (size 0x4) (aligned at 0x4)


Otherwise, it must be splitted across CTF Data, Function and Variable
sections:

$ objdump --ctf foo.o

  Data objects:
     okkk -> 0x1: (kind 1) int (format 0x1) (size 0x4) (aligned at 0x4)

   Function objects:
     main -> 0x2: (kind 5) int (*) () (aligned at 0x8)
     main_func -> 0x4: (kind 5) void (*) () (aligned at 0x8)

   Variables:
     okkk -> 0x1: (kind 1) int (format 0x1) (size 0x4) (aligned at 0x4)


Since, vmlinux + *.ko, is *big* binary, I arranged the order of CTF
lookup functions invoking at first: 'ctf_lookup_variable` and then,
if it fails `ctf_lookup_by_symbol_name' by performance reasons.
But I'm agree to remove `!(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))'
changing the invocation order for those functions, the penalty performance
was less than 10s building the ABI representation for the kernel,
I consider it as acceptable.


>> +    ctf_type = ctf_lookup_by_symbol_name(dict, sym_name);
> 
> So I am guessing that ctf_lookup_by_symbol_name looks up both variable
> and function symbols from the same dictionary, is that correct?

True.

> Also, I don't understand why we don't just use ctf_lookup_by_symbol_name
> rather than starting with ctf_lookup_variable first.  Is it a
> performance things?

Exactly. Performance when we are processing a Linux tree directory.

> Incidentally, I haven't found documentation for the lookup functions
> other than by looking at the code, in say:
> https://sourceware.org/git/?p=binutils-gdb.git;a=blob_plain;f=libctf/ctf-lookup.c;hb=refs/heads/master.

I'm afraid that the documentation is just in the source code.

> If there is documentation for it somewhere else, maybe we can link that
> place in the code here in a comment somewhere, or we can just point to
> that link above.  Both would be fine by me.
> 
>> +
>> +  /* Not lucky, then, search in whole archive */
>> +  if (ctf_type == (ctf_id_t) -1)
>> +    {
>> +      ctf_dict_t *fp;
>> +      ctf_next_t *i = NULL;
>> +      const char *arcname;
>> +
>> +      while ((fp = ctf_archive_next(ctfa, &i, &arcname, 1, &ctf_err)) != NULL)
>> +        {
>> +          ctf_type = ctf_lookup_variable (fp, sym_name);
>> +          if (ctf_type == (ctf_id_t) -1
>> +              && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))
> 
> The same questions as above.
> 
>> +            ctf_type = ctf_lookup_by_symbol_name(fp, sym_name);
>> +
>> +          if (ctf_type != (ctf_id_t) -1)
>> +            {
>> +              *ctf_dict = fp;
>> +              break;
>> +            }
>> +          ctf_dict_close(fp);
>> +        }
>> +    }
>> +
>> +  return ctf_type;
>> +}
>> +
> 
> Cheers,
> 
> [...]
> 
> 

Really thanks for your comments!,
I will prepare the v2

Kind regards,
guillermo
  

Patch

diff --git a/src/abg-ctf-reader.cc b/src/abg-ctf-reader.cc
index 71808f9a..8fa98a94 100644
--- a/src/abg-ctf-reader.cc
+++ b/src/abg-ctf-reader.cc
@@ -1204,6 +1204,62 @@  lookup_type(read_context *ctxt, corpus_sptr corp,
   return result;
 }
 
+/// Given a symbol name, lookup the corresponding CTF information in
+/// the default dictionary (CTF archive member provided by the caller)
+/// If the search is not success, the  looks for the symbol name
+/// in _all_ archive members.
+///
+/// @param ctfa the CTF archive.
+/// @param dict the default dictionary to looks for.
+/// @param sym_name the symbol name.
+/// @param corp the IR corpus.
+///
+/// Note that if @ref sym_name is found in other than default dictionary
+/// @ref ctf_dict will be updated and it must be explicate closed by its
+/// caller.
+///
+/// @return a valid CTF type id, if @ref sym_name was found, -1 otherwise.
+
+static ctf_id_t
+lookup_symbol_in_ctf_archive(ctf_archive_t *ctfa, ctf_dict_t **ctf_dict,
+                             const char *sym_name, corpus_sptr corp)
+{
+  int ctf_err;
+  ctf_dict_t *dict = *ctf_dict;
+  ctf_id_t ctf_type = ctf_lookup_variable(dict, sym_name);
+
+  /* lookup CTF type for a given symbol in its default
+     dictionary */
+  if (ctf_type == (ctf_id_t) -1
+      && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))
+    ctf_type = ctf_lookup_by_symbol_name(dict, sym_name);
+
+  /* Not lucky, then, search in whole archive */
+  if (ctf_type == (ctf_id_t) -1)
+    {
+      ctf_dict_t *fp;
+      ctf_next_t *i = NULL;
+      const char *arcname;
+
+      while ((fp = ctf_archive_next(ctfa, &i, &arcname, 1, &ctf_err)) != NULL)
+        {
+          ctf_type = ctf_lookup_variable (fp, sym_name);
+          if (ctf_type == (ctf_id_t) -1
+              && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))
+            ctf_type = ctf_lookup_by_symbol_name(fp, sym_name);
+
+          if (ctf_type != (ctf_id_t) -1)
+            {
+              *ctf_dict = fp;
+              break;
+            }
+          ctf_dict_close(fp);
+        }
+    }
+
+  return ctf_type;
+}
+
 /// Process a CTF archive and create libabigail IR for the types,
 /// variables and function declarations found in the archive, iterating
 /// over public symbols.  The IR is added to the given corpus.
@@ -1222,7 +1278,7 @@  process_ctf_archive(read_context *ctxt, corpus_sptr corp)
   corp->add(ir_translation_unit);
 
   int ctf_err;
-  ctf_dict_t *ctf_dict;
+  ctf_dict_t *ctf_dict, *dict_tmp;
   const auto symtab = ctxt->symtab;
   symtab_reader::symtab_filter filter = symtab->make_filter();
   filter.set_public_symbols();
@@ -1248,19 +1304,17 @@  process_ctf_archive(read_context *ctxt, corpus_sptr corp)
       abort();
     }
 
+  dict_tmp = ctf_dict;
+
   for (const auto& symbol : symtab_reader::filtered_symtab(*symtab, filter))
     {
       std::string sym_name = symbol->get_name();
       ctf_id_t ctf_sym_type;
 
-      ctf_sym_type = ctf_lookup_variable(ctf_dict, sym_name.c_str());
-      if (ctf_sym_type == (ctf_id_t) -1
-          && !(corp->get_origin() & corpus::LINUX_KERNEL_BINARY_ORIGIN))
-        // lookup in function objects
-        ctf_sym_type = ctf_lookup_by_symbol_name(ctf_dict, sym_name.c_str());
-
+      ctf_sym_type = lookup_symbol_in_ctf_archive(ctxt->ctfa, &ctf_dict,
+                                                  sym_name.c_str(), corp);
       if (ctf_sym_type == (ctf_id_t) -1)
-        continue;
+          continue;
 
       if (ctf_type_kind(ctf_dict, ctf_sym_type) != CTF_K_FUNCTION)
         {
@@ -1305,6 +1359,8 @@  process_ctf_archive(read_context *ctxt, corpus_sptr corp)
           func_declaration->set_is_in_public_symbol_table(true);
           ctxt->maybe_add_fn_to_exported_decls(func_declaration.get());
         }
+
+      ctf_dict = dict_tmp;
     }
 
   ctf_dict_close(ctf_dict);