[RFC] partial symbol name matching vs regexp

  Hello,

First of all, apologies for the long-ish email below; I was just
trying to be thorough and go slow in my explanation. I hope nothing's
too convoluted, and that the email can be read without too much
brain intensity.... (the fixing, on the other hand... ;-))

One of our users reported that info types was sometimes not working
for Ada types. Consider for instance the following code:

    package Alpha is
       type Beta is record
          B : Integer;
       end record;
    end Alpha;

This package defines a structure type called "Beta" inside package
Alpha. The fully qualified (natural) name of the type is "Alpha.Beta"
and underneath, its linkage name is alpha__beta.

What the user is trying to do is use the "info types" command with
the fully qualified name, something like:

    (gdb) info types alpha.beta

Unfortunately, as of today, this does not always work:

    $ gdb -q foo
    Reading symbols from foo...
    (gdb) info types alpha.beta
    All types matching regular expression "alpha.beta":
    (gdb)

The same query but with just the type's name, not qualified,
does work; and once that's done, package Alpha's partial symtab
is now expanded, so do the initial query above also starts working:

    $ gdb -q foo
    Reading symbols from foo...
    (gdb) info types beta
    All types matching regular expression "beta":

    File alpha.ads:
    2:	alpha.beta;
    (gdb) info types alpha.beta
    All types matching regular expression "alpha.beta":

    File alpha.ads:
    2:	alpha.beta;

So, the problem in the matching of the partial symtabs we want to expand.

The function that searches for matching symbols is in symtab.c...

    | std::vector<symbol_search>
    | search_symbols (const char *regexp, enum search_domain kind,
    |                 int nfiles, const char *files[])

... and the expansion is done via a call to expand_symtabs_matching:

    | /* Search through the partial symtabs *first* for all symbols
    |    matching the regexp.  That way we don't have to reproduce all of
    |    the machinery below.  */
    | expand_symtabs_matching ([file_matcher snip'ed],
    |                          lookup_name_info::match_any (),
    |                       |  [&] (const char *symname)
    |                       |  {
    |    [symbol_matcher]---+    return (!preg || preg->exec (symname,
    |                       |                                 0, NULL, 0) == 0);
    |                       |  },
    |                          NULL,
    |                          kind);

The part of interest is the symbol_matcher argument, passed via
a lambda function which evaluates the regexp against the given
symname; I've kind of highlighted that argument in the copy/paste
above, but here it is again, for the avoidance of doubt:

    |  [&] (const char *symname)
    |  {
    |    return (!preg || preg->exec (symname,
    |                                 0, NULL, 0) == 0);
    |  },

As you can see, it is a fairly straightforward implementation
that runs the given symbol name against the regexp.

In our case, that symbol_matchers doesn't match alpha.beta,
because SYMNAME is the symbol's search name, and in Ada,
we search by encoded name. Thus, for the symbol we were
hoping to match, it gets called with "alpha__beta".

We can verify that the search_name gets used by looking at
the caller of that lambda function (recursively_search_psymtabs),
which iterates over all the symbols of a partial symtab with:

    | && (sym_matcher == NULL || sym_matcher (symbol_search_name (*psym))))

That's why it doesn't work.

After pondering about it a bit, my overall thinking is that regular
expressions used in the context of matching symbol name have to be
coming from a user, who normally reasons in terms of natural name;
and therefore these regexps should be evaluated against the symbols'
natural name, rather than their search name.

With that assumption in mind, I think the way to make it work
is to change the symbol_matcher's signature to receive more
information - so that each caller of expand_symtabs_matching can
then decide which symbol name it needs to look at; in most cases,
it will be the symbol_search_name, but in the particular case of
search_symbols where we're comparing against a regexp that we assume
comes from the user, we can decide to use the symbol_natural_name
instead.

This would actually be consistent with the rest of search_symbols's
implementation; as you can see, once the partial symbol expansion
is performed, it iterates over minimal symbols and full symbols,
and selects them based on the symbol's natural name as well:

    | ALL_MSYMBOLS (objfile, msymbol)
    | [...]
    |       if (!preg
    |           || preg->exec (MSYMBOL_NATURAL_NAME (msymbol), 0,

... and ...

    | ALL_COMPUNITS (objfile, cust)
    | [...]
    |      && ((!preg
    |           || preg->exec (SYMBOL_NATURAL_NAME (sym), 0,
    |                          NULL, 0) == 0)

Following this, I thought the simplest would probably to have
the symbol_matcher receive a general_symbol_info instead of
just the name.

I tried writing a prototype that does just this, and it seems to be
giving the results I hoped, without any regression on x86_64-linux,
using both the official testsuite as well as AdaCore's testsuite.
See the attached patch that modifies expand_symtabs_symbol_matcher_ftype
to take a general_symbol_info instead of a name, and then adjusts
all users.

Unfortunately, while working on that prototype, I realized we have
.gdb_index support in dwarf2read.c that uses that infrastructure too.
Given that gdb_index is just an elaborate index (and therefore just
a very partial view of the debuging information), I don't think
the entries in the gdb_index symbol table have enough information
to allow us to construct a complete general_symbol_info object.
The prototype I attached side-steps the question by just creating
partial general_symbol_info objects where only the name is set.
It allows me to get identical results for any language but Ada,
knowing that it doesn't interfere with my Ada testing, since we
do not support gdb_index with Ada yet.

Trying to resolve that hitch with the gdb_index handling, and
after having read the description of what's in the gdb_index
section, I think there should be a way to link a symbol in
the gdb_index back to the DWARF CU, and therefore get the symbol's
language. Once we have the language in addition to the symbol name,
we should have all we need for our immediate purposes.

Thus, a second option would be to keep the name parameter
in expand_symtabs_symbol_matcher_ftype, but then add the symbol
language as a second parameter; we would then modify the symbol_matcher
in search_symbols to check the language, and if the language is Ada,
then decode the name before evaluating the regexp. Outside of adjusting
the prototype of a couple of lambda functions, everything else would
remain the same.

But I'm not really satisfied with the second option (two parameters).
What if a search routine wanted to expand based on the linkage name,
one day, even for C++ symbols. I don't know if mangling a symbol back
would be possible or not, but regardless of that, it would feel wrong
to compute a mangled name when the caller had it.

For that reason, I tend to still think that changing
expand_symtabs_symbol_matcher_ftype to receive a general_symbol_info
is better. It's not perfect in the case of gdb_index handling,
but I think that the consequences of that peculiarity would be
contained within the gdb_index handling in dwarf2read.c. So,
at least, it wouldn't be a caller in dwarf2read.c passing an
unsuspecting symbol_matcher function defined elsewhere an incomplete
general_symbol_info.

We could possibly mitigate the problem by documenting that only
the name and language of the general_symbol_info parameter in
expand_symtabs_symbol_matcher_ftype function  should be accessed,
but I think it's too easy for people to miss that and access
the fields anyway. Better put the documentation in dwarf2read's
particular implementation, IMO, and make sure that it remains
contained and consistent within that unit.

Attached is my prototype patch, with a small test that has one
fail because the patch is applied. The test is very simple,
and I intend to make a more complex one, but it should make it
easier for you to try this example should you like to.

Please let me know what you think!

Thank you,

[RFC] partial symbol name matching vs regexp

Commit Message

Comments

Patch