Add wildcard matching to substitute-path rules

Message ID 5ff720ec-4453-441e-ba82-84d60a2e8d50@campus.tu-berlin.de
State New
Headers
Series Add wildcard matching to substitute-path rules |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Max Yvon Zimmermann April 1, 2024, 9:26 p.m. UTC
  Hi everyone!

I have developed a patch that allows for glob pattern matching within substitute-path rules.

Why is this useful?
Some build systems and/or package managers may copy the sources of a program before build. The path of this copy may not always be the same. In the case of conan (https://conan.io/) the path will consist of some constant part and some directory named with some hash value (i.e., /path/to/<unique hash>/builddir/). This hash will differ with each build, and therefore any source redirection within GDB must be reconfigured. This is, of course, really annoying when debugging a library. You might write a custom startup script for GDB to set the correct path automatically. But I thought it would be much more elegant if GDB would just support glob patterns (so that the rule /path/to/*/builddir/ would match any relevant build directory).

How does the patch work?
I wrote a custom glob matching function gdb_fileprefix_fnmatch() that is called from substitute_path_rule_matches() (maybe this function should be renamed because its meaning is slightly different now). It is located in utils.c. If you think this function should be defined at another location, let me know.
I considered just using fnmatch(), but I found that this is not viable for the problem at hand. fnmatch() matches a complete string, but I need to find the longest leading substring (and its length) that matches a given pattern. So calling fnmatch() multiple times would have quadratic complexity.
I also didn't want to add any additional dependencies.
Instead, I was inspired by git's implementation (https://github.com/git/git/blob/master/wildmatch.c) as well as fnmatch() from libiberty. It is recursive, but given that usually only a few (often just one) wildcards are used, this should not be an issue.

What must be considered?
This patch will break some (rather obscure) substitute-path rules. Any rule containing the characters '\', '?' and '*' must be rewritten to escape the aforementioned characters. This could get messy because backslashes must already be escaped in the GDB command line. Maybe another command (substitute-path-glob) should be introduced instead. Personally, I am ok with breaking stuff because it's only a small change to perform for the user, but that's not really my decision to make.
I have not implemented character sets yet. I don't really see use cases for them in this setting. But if you want me to add them for completeness, I would be willing to do so.
I also actively decided against implementing the '**' wildcard (match any string, also match directory separators), as this could lead to ambiguous matches.

Testing
I have created the 'gdb.base/subst-glob.exp' testsuite to test out the new feature. I also tested against 'gdb.base/subst.exp'. The tests were performed on GNU/Linux and Windows (MinGW). I did not find any regressions in the testsuite (on Linux).

Copyright
I have not yet signed any papers regarding FSF copyright assignment.

Please let me know what you think of this patch!
Thank you in advance,
Max

---
 gdb/NEWS                              |   5 +
 gdb/doc/gdb.texinfo                   |  29 +++++
 gdb/source.c                          |  77 ++++++++----
 gdb/testsuite/gdb.base/subst-glob.exp | 166 ++++++++++++++++++++++++++
 gdb/utils.c                           | 100 ++++++++++++++++
 gdb/utils.h                           |   2 +
 6 files changed, 356 insertions(+), 23 deletions(-)
 create mode 100644 gdb/testsuite/gdb.base/subst-glob.exp
  

Comments

Eli Zaretskii April 2, 2024, 11:46 a.m. UTC | #1
> Date: Mon, 1 Apr 2024 23:26:36 +0200
> From: Max Yvon Zimmermann <max.yvon.zimmermann@campus.tu-berlin.de>
> 
> I have developed a patch that allows for glob pattern matching within substitute-path rules.

Thanks.

> diff --git a/gdb/NEWS b/gdb/NEWS
> index feb3a37393a..5a041175507 100644
> --- a/gdb/NEWS
> +++ b/gdb/NEWS
> @@ -90,6 +90,11 @@ show unwind-on-signal
>    These new commands replaces the existing set/show unwindonsignal.  The
>    old command is maintained as an alias.
>  
> +set substitute-path
> +  This command now supports glob pattern matching for substitution
> +  rules.  Wildcards '?' and '*' are supported.  Use '\' to escape
> +  '?', '*' and '\' characters.

This part is approved, documentation-wise.

But I wonder what does the above mean for Windows file names that use
backslashes as directory separators?  Does each backslash need to be
escaped by another one, when creating the rewrite rules??

> --- a/gdb/doc/gdb.texinfo
> +++ b/gdb/doc/gdb.texinfo
> @@ -9954,6 +9954,35 @@ For instance, if we had entered the following commands:
>  use the second rule to rewrite @file{/usr/src/lib/foo.c} into
>  @file{/mnt/src/lib/foo.c}.
>  
> +Rules can contain wildcards to match multiple paths.  The supported
> +wildcards are @samp{?} (to match any single character) and @samp{*}
> +(to match any string).  Wildcards will never match the path separator of
> +the system.

File names of their parts should have the @file markup, not @samp.
(Wildcards are file names of sorts.)

> +For instance, if we had entered the following command:
> +
> +@smallexample
> +(@value{GDBP}) set substitute-path /usr/*/include /mnt/include
> +@end smallexample

Does this mean that multiple /usr/*/include directories could be
redirected to the same /mnt/include directory?  Won't that cause
trouble, if there are several distinct directories recorded in the
debug info whose names match the wildcard, and which are supposed to
resolve to different directories?

> +Use @samp{@backslashchar{}} to escape the characters @samp{?}@comma{}
> +@samp{*} and @samp{@backslashchar{}}.  Note that you need to escape any
> +@samp{@backslashchar{}} characters twice in the @value{GDBN} command line.

I'd prefer not to use @comma{} and @backslashchar{} except where the
literal characters will confuse TeX.  AFAIU, they are not needed here,
and literal characters will do the job.

> +  if (IS_DIR_SEPARATOR (path[last]))
> +    {
> +      path[last] = '\0';
> +
> +      if (path[last] != '\\')
> +        return;

Something's wrong here: you assign path[last] and then test its value?

Also, how does this handle Windows-style file names with backslashes?

> +#ifdef HAVE_DOS_BASED_FILE_SYSTEM
> +          /* On DOS-based file systems, the '/' and the '\' are equivalent.  */
> +          if (pattern_c == '/')
> +            pattern_c = '\\';
> +          if (string_c == '/')
> +            string_c = '\\';
> +#endif

Does this mean "//" (two slashes in a row) will be converted to "\\",
and will then be interpreted as an escaped single backslash?

Reviewed-By: Eli Zaretskii <eliz@gnu.org>
  
Max Yvon Zimmermann April 4, 2024, 12:05 a.m. UTC | #2
Thank you for the quick reply!

>> diff --git a/gdb/NEWS b/gdb/NEWS
>> index feb3a37393a..5a041175507 100644
>> --- a/gdb/NEWS
>> +++ b/gdb/NEWS
>> @@ -90,6 +90,11 @@ show unwind-on-signal
>>    These new commands replaces the existing set/show unwindonsignal.  The
>>    old command is maintained as an alias.
>>  
>> +set substitute-path
>> +  This command now supports glob pattern matching for substitution
>> +  rules.  Wildcards '?' and '*' are supported.  Use '\' to escape
>> +  '?', '*' and '\' characters.
> 
> This part is approved, documentation-wise.
> 
> But I wonder what does the above mean for Windows file names that use
> backslashes as directory separators?  Does each backslash need to be
> escaped by another one, when creating the rewrite rules??

On DOS-based systems both backslashes and forward slashes are valid path separators. So while Windows users COULD write their paths as \\path\\to\\, /path/to/ would work as well. This behavior is not new, the previous version behaved similarly. New is that to match a backslash from the GDB command line you would have to escape a backslash twice (because the CLI already unescapes backslashes). But because everyone can just continue to use forwards slashes, I think this should be fine.

>> --- a/gdb/doc/gdb.texinfo
>> +++ b/gdb/doc/gdb.texinfo
>> @@ -9954,6 +9954,35 @@ For instance, if we had entered the following commands:
>>  use the second rule to rewrite @file{/usr/src/lib/foo.c} into
>>  @file{/mnt/src/lib/foo.c}.
>>  
>> +Rules can contain wildcards to match multiple paths.  The supported
>> +wildcards are @samp{?} (to match any single character) and @samp{*}
>> +(to match any string).  Wildcards will never match the path separator of
>> +the system.
> 
> File names of their parts should have the @file markup, not @samp.
> (Wildcards are file names of sorts.)

Ok, I will change that.

>> +For instance, if we had entered the following command:
>> +
>> +@smallexample
>> +(@value{GDBP}) set substitute-path /usr/*/include /mnt/include
>> +@end smallexample
> 
> Does this mean that multiple /usr/*/include directories could be
> redirected to the same /mnt/include directory?  Won't that cause
> trouble, if there are several distinct directories recorded in the
> debug info whose names match the wildcard, and which are supposed to
> resolve to different directories?

I realize now that /usr/*/include is actually a terrible example for this feature. I will change the example to avoid confusion. However it is correct that two files from distinct directories COULD accidentally get matched. But you could argue that this is already the case. Let's say we set the following two rules:

set substitute-path /foo/include /mnt/include
set substitute-path /bar/include /mnt/include

Then we could run into exactly the same issue.

I would also like to point out that the rule selection is unaffected by this patch. To quote from the documentation: "In the case when more than one substitution rule have been defined, the rules are evaluated one by one in the order where they have been defined. The first one matching, if any, is selected to perform the substitution." I think this should prevent any accidental matches if the feature is used correctly.

>> +Use @samp{@backslashchar{}} to escape the characters @samp{?}@comma{}
>> +@samp{*} and @samp{@backslashchar{}}.  Note that you need to escape any
>> +@samp{@backslashchar{}} characters twice in the @value{GDBN} command line.
> 
> I'd prefer not to use @comma{} and @backslashchar{} except where the
> literal characters will confuse TeX.  AFAIU, they are not needed here,
> and literal characters will do the job.
> 
>> +  if (IS_DIR_SEPARATOR (path[last]))
>> +    {
>> +      path[last] = '\0';
>> +
>> +      if (path[last] != '\\')
>> +        return;
> 
> Something's wrong here: you assign path[last] and then test its value?
> 
> Also, how does this handle Windows-style file names with backslashes?

Yes, this is a bug. I actually wrote this function specifically for DOS-style file names. The idea is that any trailing backslash (that should be deleted) will be preceded by another (escape) backslash that must also be deleted. I will post a V2 soon where this function is getting a major overhaul and the relevant edge cases are included in the testsuite.

>> +#ifdef HAVE_DOS_BASED_FILE_SYSTEM
>> +          /* On DOS-based file systems, the '/' and the '\' are equivalent.  */
>> +          if (pattern_c == '/')
>> +            pattern_c = '\\';
>> +          if (string_c == '/')
>> +            string_c = '\\';
>> +#endif
> 
> Does this mean "//" (two slashes in a row) will be converted to "\\",
> and will then be interpreted as an escaped single backslash?

No, this is only done for the comparison directly below. The pattern is not altered in any way. "//" will not result in an escaped single backslash.
  
Eli Zaretskii April 4, 2024, 5:53 a.m. UTC | #3
> Date: Thu, 4 Apr 2024 02:05:11 +0200
> CC: <gdb-patches@sourceware.org>
> From: Max Yvon Zimmermann <max.yvon.zimmermann@campus.tu-berlin.de>
> 
> Thank you for the quick reply!

Thank you for working on this feature in the first place.

> > But I wonder what does the above mean for Windows file names that use
> > backslashes as directory separators?  Does each backslash need to be
> > escaped by another one, when creating the rewrite rules??
> 
> On DOS-based systems both backslashes and forward slashes are valid path separators. So while Windows users COULD write their paths as \\path\\to\\, /path/to/ would work as well. This behavior is not new, the previous version behaved similarly. New is that to match a backslash from the GDB command line you would have to escape a backslash twice (because the CLI already unescapes backslashes). But because everyone can just continue to use forwards slashes, I think this should be fine.

So extra \-escaping will still be needed for DOS-based filesystems.  I
think this should be mentioned in the documentation.

> > Does this mean that multiple /usr/*/include directories could be
> > redirected to the same /mnt/include directory?  Won't that cause
> > trouble, if there are several distinct directories recorded in the
> > debug info whose names match the wildcard, and which are supposed to
> > resolve to different directories?
> 
> I realize now that /usr/*/include is actually a terrible example for this feature. I will change the example to avoid confusion. However it is correct that two files from distinct directories COULD accidentally get matched. But you could argue that this is already the case. Let's say we set the following two rules:
> 
> set substitute-path /foo/include /mnt/include
> set substitute-path /bar/include /mnt/include
> 
> Then we could run into exactly the same issue.

The difference here is that the two lines above must be explicitly
written by the user, whereas the wildcard rule could catch another
directory unintentionally.

> I would also like to point out that the rule selection is unaffected by this patch. To quote from the documentation: "In the case when more than one substitution rule have been defined, the rules are evaluated one by one in the order where they have been defined. The first one matching, if any, is selected to perform the substitution."

To me, this means that the two-line example you've shown is processed
in a predictable way, by only using the first one if both match,
whereas a wildcard rule will cause unpredictable results, because the
order in which files are compared against a wildcard is undefined.
Right?

Thanks.
  
Max Yvon Zimmermann April 4, 2024, 8:52 a.m. UTC | #4
Thank you.

>> On DOS-based systems both backslashes and forward slashes are valid path separators. So while Windows users COULD write their paths as \\path\\to\\, /path/to/ would work as well. This behavior is not new, the previous version behaved similarly. New is that to match a backslash from the GDB command line you would have to escape a backslash twice (because the CLI already unescapes backslashes). But because everyone can just continue to use forwards slashes, I think this should be fine.
> 
> So extra \-escaping will still be needed for DOS-based filesystems.  I
> think this should be mentioned in the documentation.

Actually, this is already mentioned in the documentation. But maybe it can be made a bit clearer. On the other hand, I really do not want to encourage the extra \-escaping. I am thinking of writing something like this instead:

"On DOS-based file system, you are encouraged to use '/' as the path separator within your patterns. This avoids escaping the '\' characters and would still match against file names that use '\' as the path separator. For instance, the pattern 'C:/foo/*' would still match 'C:\foo\bar'."

>> I realize now that /usr/*/include is actually a terrible example for this feature. I will change the example to avoid confusion. However it is correct that two files from distinct directories COULD accidentally get matched. But you could argue that this is already the case. Let's say we set the following two rules:
>>
>> set substitute-path /foo/include /mnt/include
>> set substitute-path /bar/include /mnt/include
>>
>> Then we could run into exactly the same issue.
> 
> The difference here is that the two lines above must be explicitly
> written by the user, whereas the wildcard rule could catch another
> directory unintentionally.
> 
>> I would also like to point out that the rule selection is unaffected by this patch. To quote from the documentation: "In the case when more than one substitution rule have been defined, the rules are evaluated one by one in the order where they have been defined. The first one matching, if any, is selected to perform the substitution."
> 
> To me, this means that the two-line example you've shown is processed
> in a predictable way, by only using the first one if both match,
> whereas a wildcard rule will cause unpredictable results, because the
> order in which files are compared against a wildcard is undefined.
> Right?

No, rules that use wildcards are still processed in a predictable way. If we set the following two rules (and in that order!):

set substitute-path /foo/include /path/to/foo
set substitute-path /*/include /path/to/somewhere

Then any file names matching '/foo/include' would be rewritten to '/path/to/foo' and NOT '/path/to/somewhere'. Even though any file name matching '/foo/include' would of course also match '/*/include'. The rule selection is entirely based on the order of their definition.

Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).
  
Eli Zaretskii April 4, 2024, 11:52 a.m. UTC | #5
> Date: Thu, 4 Apr 2024 10:52:31 +0200
> CC: <gdb-patches@sourceware.org>
> From: Max Yvon Zimmermann <max.yvon.zimmermann@campus.tu-berlin.de>
> 
> Actually, this is already mentioned in the documentation. But maybe it can be made a bit clearer. On the other hand, I really do not want to encourage the extra \-escaping. I am thinking of writing something like this instead:
> 
> "On DOS-based file system, you are encouraged to use '/' as the path separator within your patterns. This avoids escaping the '\' characters and would still match against file names that use '\' as the path separator. For instance, the pattern 'C:/foo/*' would still match 'C:\foo\bar'."

I'm not sure this should be in the manual, since the way file names
are written in the debug info is not under user's control.  If the
file names in the debug info use backslashes, will GDB recognize
rewrite rules that use forward slashes?  Even if GDB does treat both
characters as equal on Windows, I'm not sure users will trust that up
front, so they might use backslashes in the rewrite rules anyway.

> > To me, this means that the two-line example you've shown is processed
> > in a predictable way, by only using the first one if both match,
> > whereas a wildcard rule will cause unpredictable results, because the
> > order in which files are compared against a wildcard is undefined.
> > Right?
> 
> No, rules that use wildcards are still processed in a predictable way. If we set the following two rules (and in that order!):
> 
> set substitute-path /foo/include /path/to/foo
> set substitute-path /*/include /path/to/somewhere
> 
> Then any file names matching '/foo/include' would be rewritten to '/path/to/foo' and NOT '/path/to/somewhere'. Even though any file name matching '/foo/include' would of course also match '/*/include'. The rule selection is entirely based on the order of their definition.

That's not what I meant.  I meant that a rule like

  set substitute-path /*/include /path/to/somewhere

can catch any parent directory of 'include', and which one it will
catch is not known in advance.

> Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).

Well, two wrongs don't make a right.

I'd be interested to hear opinions of others about this.
  
Guinevere Larsen April 4, 2024, 12:42 p.m. UTC | #6
On 4/4/24 08:52, Eli Zaretskii wrote:
>>> To me, this means that the two-line example you've shown is processed
>>> in a predictable way, by only using the first one if both match,
>>> whereas a wildcard rule will cause unpredictable results, because the
>>> order in which files are compared against a wildcard is undefined.
>>> Right?
>> No, rules that use wildcards are still processed in a predictable way. If we set the following two rules (and in that order!):
>>
>> set substitute-path /foo/include /path/to/foo
>> set substitute-path /*/include /path/to/somewhere
>>
>> Then any file names matching '/foo/include' would be rewritten to '/path/to/foo' and NOT '/path/to/somewhere'. Even though any file name matching '/foo/include' would of course also match '/*/include'. The rule selection is entirely based on the order of their definition.
> That's not what I meant.  I meant that a rule like
>
>    set substitute-path /*/include /path/to/somewhere
>
> can catch any parent directory of 'include', and which one it will
> catch is not known in advance.

This was something I was also wondering about. taking from the original 
example:

Max> In the case of conan (https://conan.io/) the path will consist of 
some constant part and some directory named with some hash value (i.e., 
/path/to/<unique hash>/builddir/)

Suppose there are 2 builds for this package, so we have the following 
valid directories: /path/to/deadbeef/builddir /path/to/00000000/builddir

which of those will be chosen? Is it consistent across gdb runs? OSs? 
How can a user know in advance which will be picked?

>
>> Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).

The difference I see is that in the previous examples, the problem is 
apparent within what the user would expect (some GDB script, debug 
information, or other things related to their project), whereas with 
glob matching, the problem may be unrelated (directory structure of 
unrelated folders).

If I wasn't aware of this patch, I would never think to check if the 
folder structure of my system could be the problem for getting debug 
information.

> Well, two wrongs don't make a right.
>
> I'd be interested to hear opinions of others about this.
  
Max Yvon Zimmermann April 4, 2024, 10:52 p.m. UTC | #7
Thank you once again for your feedback. You made me take a deep dive into how source redirection works in detail. It was quite fun!
I will try to answer the concerns from both of you in this reply.

>> Actually, this is already mentioned in the documentation. But maybe it can be made a bit clearer. On the other hand, I really do not want to encourage the extra \-escaping. I am thinking of writing something like this instead:
>>
>> "On DOS-based file system, you are encouraged to use '/' as the path separator within your patterns. This avoids escaping the '\' characters and would still match against file names that use '\' as the path separator. For instance, the pattern 'C:/foo/*' would still match 'C:\foo\bar'."
> 
> I'm not sure this should be in the manual, since the way file names
> are written in the debug info is not under user's control.  If the
> file names in the debug info use backslashes, will GDB recognize
> rewrite rules that use forward slashes?  Even if GDB does treat both
> characters as equal on Windows, I'm not sure users will trust that up
> front, so they might use backslashes in the rewrite rules anyway.

Yes, GDB will recognize rewrite rules that use forward slashes even if file names in the debug info use backslashes.
I can of course still add an example for double \-escaped DOS paths if you want me to.

>> That's not what I meant.  I meant that a rule like
>>
>>   set substitute-path /*/include /path/to/somewhere
>>
>> can catch any parent directory of 'include', and which one it will
>> catch is not known in advance.
> 
> This was something I was also wondering about. taking from the original example:
> 
> Max> In the case of conan (https://conan.io/) the path will consist of some constant part and some directory named with some hash value (i.e., /path/to/<unique hash>/builddir/)
> 
> Suppose there are 2 builds for this package, so we have the following valid directories: /path/to/deadbeef/builddir /path/to/00000000/builddir
> 
> which of those will be chosen? Is it consistent across gdb runs? OSs? How can a user know in advance which will be picked?

I think I need to clear up some confusion about the direction of this mapping. We map from the file names provided in the debug info to our local source files. NOT the other way around!

In my original example we would set the following rule:

set substitute-path /path/to/*/builddir /src

Let's say we are debugging an executable which contains the file name '/path/to/deadbeef/builddir/main.c' in its debug info.
When GDB wants to open the file it will look up any substitution rules that match this file name.
In this case the file would match against our specified rule and would therefore be rewritten as '/src/main.c'. This is unambiguous.
Now '/src/main.c' will be opened by GDB.
If we were to debug another executable which contains the file name '/path/to/00000000/builddir/main.c' in its debug info, the same thing would happen.

I want to emphasize that we are NOT looking for any paths that would "match" '/src/main.c'.
So your question about which path will be chosen does not really apply.
There is no path to be chosen.

But I think I know what you mean with your question.
In a nutshell, the "chosen" path depends on which build of the package is currently being debugged.

Is this consistent across GDB runs?
Yes.

Is this consistent across OSs?
Yes.

How can a user know in advance which will be picked?
There is nothing to be picked.

>>> Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).
> 
> The difference I see is that in the previous examples, the problem is apparent within what the user would expect (some GDB script, debug information, or other things related to their project), whereas with glob matching, the problem may be unrelated (directory structure of unrelated folders).
> 
> If I wasn't aware of this patch, I would never think to check if the folder structure of my system could be the problem for getting debug information.

Once again, this is not really an issue due to the direction of the mapping.
The folder structure of a system does not affect which rule will be matched. I know this might sound like a bold statement but it's true.
Only the file names in the debug info are relevant.

>> Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).
> 
> Well, two wrongs don't make a right.

This is a very fair point. Having different files map to the same file in the local source directory could be annoying.
Because this isn't a new issue per se, I thought about implementing a separate patch that would warn the user if two different files get mapped onto the same source file.
But after some thought I realized that there are actually valid use-cases for this multi-mapping.

Let's say we have the following directory structure:

/inc.h
/foo.c
/dir/bar.c

'/foo.c' includes '/inc.h' via #include "inc.h".
'/dir/bar.c' includes '/inc.h' via #include "../inc.h"

With GCC this will result in two entries in the debug info: '/inc.h' and '/dir/../inc.h'.
For GDB these entries are two separate files. (Maybe this could changed with another patch though)
But with the help of the following two rules we can tell GDB to treat them the same:

set substitute-path /from /to
set substitute-path /from/dir/../ /to

I therefore don't know if multi-mapping should be considered wrong here.
Also I don't think that glob matching would make this more common due to the aforementioned direction of the mapping.
For me, this is a completely separate issue.
  
Max Yvon Zimmermann April 5, 2024, 5:58 a.m. UTC | #8
Sorry, I need to make a correction. This topic can be a bit more confusing at times than I initially thought.

>>> Also, in the example I stated in my last reply, no file name that matches '/foo/include' would ever match '/bar/include'. The point I was trying to get across was that two unrelated files can already be mapped onto the same file (accidentally).
>>
>> Well, two wrongs don't make a right.
> 
> This is a very fair point. Having different files map to the same file in the local source directory could be annoying.
> Because this isn't a new issue per se, I thought about implementing a separate patch that would warn the user if two different files get mapped onto the same source file.
> But after some thought I realized that there are actually valid use-cases for this multi-mapping.
> 
> Let's say we have the following directory structure:
> 
> /inc.h
> /foo.c
> /dir/bar.c
> 
> '/foo.c' includes '/inc.h' via #include "inc.h".
> '/dir/bar.c' includes '/inc.h' via #include "../inc.h"
> 
> With GCC this will result in two entries in the debug info: '/inc.h' and '/dir/../inc.h'.
> For GDB these entries are two separate files. (Maybe this could changed with another patch though)
> But with the help of the following two rules we can tell GDB to treat them the same:
> 
> set substitute-path /from /to
> set substitute-path /from/dir/../ /to
> 
> I therefore don't know if multi-mapping should be considered wrong here.
> Also I don't think that glob matching would make this more common due to the aforementioned direction of the mapping.
> For me, this is a completely separate issue.

The rule

set substitute-path /from/dir/../ /to

would actually never match because '/dir/../inc.h' (inside '/from') would also match '/from' which is defined first.

Also you could make the case that this is not really a multi-mapping, because '/inc.h' and '/dir/../inc.h' are, of course, the same file.
But it makes implementing a usable warning more tedious.
Maybe a warning to the user is actually the right solution here but I'm not entirely sure.
What are your thoughts on this?
  

Patch

diff --git a/gdb/NEWS b/gdb/NEWS
index feb3a37393a..5a041175507 100644
--- a/gdb/NEWS
+++ b/gdb/NEWS
@@ -90,6 +90,11 @@  show unwind-on-signal
   These new commands replaces the existing set/show unwindonsignal.  The
   old command is maintained as an alias.
 
+set substitute-path
+  This command now supports glob pattern matching for substitution
+  rules.  Wildcards '?' and '*' are supported.  Use '\' to escape
+  '?', '*' and '\' characters.
+
 * New features in the GDB remote stub, GDBserver
 
   ** The --remote-debug and --event-loop-debug command line options
diff --git a/gdb/doc/gdb.texinfo b/gdb/doc/gdb.texinfo
index 727f9275bfb..bfd96afb5b4 100644
--- a/gdb/doc/gdb.texinfo
+++ b/gdb/doc/gdb.texinfo
@@ -9954,6 +9954,35 @@  For instance, if we had entered the following commands:
 use the second rule to rewrite @file{/usr/src/lib/foo.c} into
 @file{/mnt/src/lib/foo.c}.
 
+Rules can contain wildcards to match multiple paths.  The supported
+wildcards are @samp{?} (to match any single character) and @samp{*}
+(to match any string).  Wildcards will never match the path separator of
+the system.
+
+For instance, if we had entered the following command:
+
+@smallexample
+(@value{GDBP}) set substitute-path /usr/*/include /mnt/include
+@end smallexample
+
+@noindent
+@value{GDBN} would then rewrite @file{/usr/src/include/defs.h} into
+@file{/mnt/include/defs.h}.  Another file @file{/usr/local/include/inc.h}
+would be rewritten as @file{/mnt/include/inc.h} using the same rule.
+
+Use @samp{@backslashchar{}} to escape the characters @samp{?}@comma{}
+@samp{*} and @samp{@backslashchar{}}.  Note that you need to escape any
+@samp{@backslashchar{}} characters twice in the @value{GDBN} command line.
+
+So if we want to match a literal @samp{*} character in a rule, we would enter:
+
+@smallexample
+(@value{GDBP}) set substitute-path /foo\\*/bar /mnt/cross
+@end smallexample
+
+@noindent
+Now only the directory @file{/foo*/bar/} would match against the rule.
+
 
 @item unset substitute-path [path]
 @kindex unset substitute-path
diff --git a/gdb/source.c b/gdb/source.c
index 432301e2a71..2406e75a8b9 100644
--- a/gdb/source.c
+++ b/gdb/source.c
@@ -26,6 +26,7 @@ 
 #include "frame.h"
 #include "value.h"
 #include "gdbsupport/filestuff.h"
+#include "utils.h"
 
 #include <sys/types.h>
 #include <fcntl.h>
@@ -959,44 +960,48 @@  source_full_path_of (const char *filename,
   return 1;
 }
 
-/* Return non-zero if RULE matches PATH, that is if the rule can be
-   applied to PATH.  */
+/* Return the position in PATH up until RULE matches PATH, that is if the rule
+   can be applied to PATH.
+   Return -1 if there is no match.  */
 
 static int
 substitute_path_rule_matches (const struct substitute_path_rule *rule,
 			      const char *path)
 {
-  const int from_len = rule->from.length ();
-  const int path_len = strlen (path);
-
-  if (path_len < from_len)
-    return 0;
-
   /* The substitution rules are anchored at the start of the path,
      so the path should start with rule->from.  */
 
-  if (filename_ncmp (path, rule->from.c_str (), from_len) != 0)
-    return 0;
+  const int result = gdb_fileprefix_fnmatch (rule->from.c_str (), path);
 
-  /* Make sure that the region in the path that matches the substitution
-     rule is immediately followed by a directory separator (or the end of
-     string character).  */
+  if (result != -1)
+    {
+      /* Make sure that the region in the path that matches the substitution
+        rule is immediately followed by a directory separator (or the end of
+        string character).  */
 
-  if (path[from_len] != '\0' && !IS_DIR_SEPARATOR (path[from_len]))
-    return 0;
+      if (path[result] != '\0' && !IS_DIR_SEPARATOR (path[result]))
+        return -1;
+    }
 
-  return 1;
+  return result;
 }
 
 /* Find the substitute-path rule that applies to PATH and return it.
+   Also set SUB_POS to the position in PATH up until the rule matches PATH.
    Return NULL if no rule applies.  */
 
 static struct substitute_path_rule *
-get_substitute_path_rule (const char *path)
+get_substitute_path_rule (const char *path, int &sub_pos)
 {
   for (substitute_path_rule &rule : substitute_path_rules)
-    if (substitute_path_rule_matches (&rule, path))
-      return &rule;
+    {
+      const int result = substitute_path_rule_matches (&rule, path);
+      if (result != -1)
+        {
+          sub_pos = result;
+          return &rule;
+        }
+    }
 
   return nullptr;
 }
@@ -1010,7 +1015,8 @@  get_substitute_path_rule (const char *path)
 gdb::unique_xmalloc_ptr<char>
 rewrite_source_path (const char *path)
 {
-  const struct substitute_path_rule *rule = get_substitute_path_rule (path);
+  int sub_pos;
+  const struct substitute_path_rule *rule = get_substitute_path_rule (path, sub_pos);
 
   if (rule == nullptr)
     return nullptr;
@@ -1018,7 +1024,7 @@  rewrite_source_path (const char *path)
   /* Compute the rewritten path and return it.  */
 
   return (gdb::unique_xmalloc_ptr<char>
-	  (concat (rule->to.c_str (), path + rule->from.length (), nullptr)));
+	  (concat (rule->to.c_str (), path + sub_pos, nullptr)));
 }
 
 /* See source.h.  */
@@ -1718,6 +1724,31 @@  strip_trailing_directory_separator (char *path)
     path[last] = '\0';
 }
 
+/* If the last character of PATH is a directory separator, then strip it.
+   Also remove any related escape character.  */
+
+static void
+strip_trailing_directory_separator_and_escape (char *path)
+{
+  int last = strlen (path) - 1;
+
+  if (last < 0)
+    return;  /* No stripping is needed if PATH is the empty string.  */
+
+  if (IS_DIR_SEPARATOR (path[last]))
+    {
+      path[last] = '\0';
+
+      if (path[last] != '\\')
+        return;
+
+      --last;
+      /* Remove any related escape character.  */
+      if (last >= 0 && path[last] == '\\')
+        path[last] = '\0';
+    }
+}
+
 /* Add a new substitute-path rule at the end of the current list of rules.
    The new rule will replace FROM into TO.  */
 
@@ -1754,7 +1785,7 @@  show_substitute_path_command (const char *args, int from_tty)
 
   for (substitute_path_rule &rule : substitute_path_rules)
     {
-      if (from == NULL || substitute_path_rule_matches (&rule, from) != 0)
+      if (from == NULL || substitute_path_rule_matches (&rule, from) != -1)
 	gdb_printf ("  `%s' -> `%s'.\n", rule.from.c_str (),
 		    rule.to.c_str ());
     }
@@ -1830,7 +1861,7 @@  set_substitute_path_command (const char *args, int from_tty)
 
   /* Strip any trailing directory separator character in either FROM
      or TO.  The substitution rule already implicitly contains them.  */
-  strip_trailing_directory_separator (argv[0]);
+  strip_trailing_directory_separator_and_escape (argv[0]);
   strip_trailing_directory_separator (argv[1]);
 
   /* If a rule with the same "from" was previously defined, then
diff --git a/gdb/testsuite/gdb.base/subst-glob.exp b/gdb/testsuite/gdb.base/subst-glob.exp
new file mode 100644
index 00000000000..74dec472b7a
--- /dev/null
+++ b/gdb/testsuite/gdb.base/subst-glob.exp
@@ -0,0 +1,166 @@ 
+# Copyright 2006-2024 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+clean_restart
+
+# Do a bunch of testing of the substitute-path glob pattern matching.
+
+gdb_test_no_output "set confirm off" \
+        "deactivate GDB's confirmation interface"
+
+proc test_pattern { pattern path mode } {
+    # Escape backslashes so the GDB console can unescape them again.
+    set terminal_pattern [ \
+        string map { \
+            "\\" "\\\\" \
+        } $pattern \
+    ]
+
+    # Escape the pattern for regex matching.
+    set match_pattern [ \
+        string map { \
+            "*"  "\\\\*" \
+            "?"  "\\\\?" \
+            "\\" "\\\\\\\\" \
+        } $pattern \
+    ]
+
+    # Handle stripping of path separators.
+    if {[ishost "*-mingw*"]} {
+        regsub {\\\\\\\\\\\\|/$} $match_pattern {} match_pattern
+        set match_pattern [subst $match_pattern]
+    } else {
+        regsub {/$} $match_pattern {} match_pattern
+        set match_pattern [subst $match_pattern]
+    }
+
+    # Escape backslashes so the GDB console can unescape them again.
+    set terminal_path [ \
+        string map { \
+            "\\" "\\\\" \
+        } $path \
+    ]
+    
+    # Escape the path for regex matching.
+    set match_path [ \
+        string map { \
+            "*"  "\\*" \
+            "?"  "\\?" \
+            "\\" "\\\\" \
+        } $path \
+    ]
+    
+    gdb_test_no_output "unset substitute-path" \
+        "unset substitute-path before testing '$terminal_pattern' matches '$terminal_path'"
+
+    gdb_test_no_output "set substitute-path \"$terminal_pattern\" \"to\"" \
+        "set substitute-path $terminal_pattern before testing '$terminal_pattern' matches '$terminal_path'"
+    
+    if {$mode == 1} {
+        gdb_test "show substitute-path \"$terminal_path\"" \
+            "Source path substitution rule matching `$match_path':\r\n +`$match_pattern' -> `to'." \
+            "testing '$terminal_pattern' matches '$terminal_path'"
+    } else {
+        gdb_test "show substitute-path \"$terminal_path\"" \
+            "Source path substitution rule matching `$match_path':" \
+            "testing '$terminal_pattern' does not match '$terminal_path'"
+    }
+}
+
+proc test_pattern_pos { pattern path } {
+    test_pattern $pattern $path 1
+}
+
+proc test_pattern_neg { pattern path } {
+    test_pattern $pattern $path 0
+}
+
+# Sanity checks.
+test_pattern_pos "path" "path"
+test_pattern_pos "path" "path/to"
+test_pattern_pos "/" "/test"
+test_pattern_pos "/testing" "/testing/test"
+test_pattern_pos "/testing/" "/testing/test"
+test_pattern_neg "path" "test"
+
+# '?' wildcard.
+test_pattern_pos "?atchone" "matchone"
+test_pattern_pos "pat?/to" "path/to"
+test_pattern_pos "path??" "pathto"
+test_pattern_pos "test?ng" "testing"
+test_pattern_pos "?" "?/test"
+test_pattern_neg "test?" "test/"
+test_pattern_neg "test?" "testing/"
+test_pattern_neg "?" ""
+
+# '*' wildcard.
+test_pattern_pos "*" "matchall"
+test_pattern_pos "path_*" "path_pattern"
+test_pattern_pos "test*/test" "testing/test"
+test_pattern_pos "test*" "testing/test"
+test_pattern_pos "test*" "test/test"
+test_pattern_pos "*" "testing/test"
+test_pattern_pos "*/*" "testing/test"
+test_pattern_pos "*/" "test/"
+test_pattern_pos "/*" "/test"
+test_pattern_pos "test*" "test/"
+test_pattern_pos "test*" "test"
+test_pattern_pos "test*test" "testtest"
+test_pattern_pos "test*test" "testingtest"
+test_pattern_pos "test*test" "testingtest/test"
+test_pattern_pos "*" "*test"
+test_pattern_pos "**" "t"
+test_pattern_pos "*" ""
+test_pattern_pos "*t*st" "foobartest"
+test_pattern_pos "*t*st" "foobartest/ing"
+test_pattern_pos "*t*st" "tetest"
+test_pattern_pos "*t*st" "tetest/ing"
+test_pattern_pos "*t*st" "testtest"
+test_pattern_pos "*t*st" "testtest/ing"
+test_pattern_neg "*test" "foobar"
+test_pattern_neg "*/test" "foo/bar"
+
+# Escapes.
+test_pattern_pos "\\\\" "\\"
+test_pattern_pos "\\\\*" "\\test"
+test_pattern_pos "*\\\\" "test\\"
+test_pattern_pos "\\\\/" "\\/"
+test_pattern_pos "\\*" "*"
+test_pattern_pos "\\?" "?"
+test_pattern_pos "\\*" "*/test"
+test_pattern_pos "\\?" "?/test"
+test_pattern_neg "\\*" "*test"
+test_pattern_neg "\\?" "?test"
+test_pattern_neg "\\*" "t"
+test_pattern_neg "\\?" "t"
+test_pattern_neg "\\" "test"
+test_pattern_neg "\\/" ""
+test_pattern_neg "\\/" "/test"
+
+if {[ishost "*-mingw*"]} {
+    # DOS tests.
+    test_pattern_pos "test" "TEST"
+    test_pattern_pos "/" "\\test"
+    test_pattern_pos "\\\\" "/test"
+    test_pattern_pos "*\\\\" "test/"
+}
+
+if {[ishost "*-linux*"]} {
+    # Unix tests.
+    test_pattern_neg "test" "TEST"
+    test_pattern_neg "/" "\\test"
+    test_pattern_neg "\\\\" "/test"
+    test_pattern_neg "*\\\\" "test/"
+}
diff --git a/gdb/utils.c b/gdb/utils.c
index ded03c74099..00597543051 100644
--- a/gdb/utils.c
+++ b/gdb/utils.c
@@ -3532,6 +3532,106 @@  gdb_filename_fnmatch (const char *pattern, const char *string, int flags)
   return fnmatch (pattern, string, flags);
 }
 
+/* Return the position in STRING up until a PATTERN expression is matched.
+   Return -1 if there is no match.
+   
+   Only the wildcards ? and * are supported. */
+
+int
+gdb_fileprefix_fnmatch (const char *pattern, const char *string)
+{
+  int string_pos = 0;
+  char pattern_c;
+  char string_c;
+
+  while (*pattern != '\0' && *string != '\0')
+    {
+      switch (*pattern)
+        {
+        /* Unescape and match the next character.  */
+        case '\\':
+          ++pattern;
+          if (*pattern == '\0')
+            return -1;
+          [[fallthrough]];
+
+        default:
+          pattern_c = *pattern;
+          string_c = *string;
+
+#ifdef HAVE_CASE_INSENSITIVE_FILE_SYSTEM
+          pattern_c = TOLOWER (pattern_c);
+          string_c = TOLOWER (string_c);
+#endif
+
+#ifdef HAVE_DOS_BASED_FILE_SYSTEM
+          /* On DOS-based file systems, the '/' and the '\' are equivalent.  */
+          if (pattern_c == '/')
+            pattern_c = '\\';
+          if (string_c == '/')
+            string_c = '\\';
+#endif
+
+          /* Compare the current character of the pattern with the path.  */
+          if (pattern_c != string_c)
+            return -1;
+          break;
+
+        /* Match any character.  */
+        case '?':
+          /* Directory separators are not matched by '?'.  */
+          if (IS_DIR_SEPARATOR (*string))
+            return -1;
+          break;
+
+        /* Match any string.  */
+        case '*':
+          int best_result = -1;
+
+          /* Try to match any folling substring.  */
+          while (true)
+            {
+              /* Most of these attempts will fail at the first character. */
+              int result = gdb_fileprefix_fnmatch (pattern+1, string);
+
+              if (result != -1)
+                {
+                  /* If there is a substring match, compare its result to the best
+                    candidate so far.  */
+                  result += string_pos;
+                  if (result > best_result)
+                    best_result = result;
+                }
+
+              /* Exit on a null byte or a directory separator.  */
+              if (*string == '\0' || IS_DIR_SEPARATOR (*string))
+                return best_result;
+
+              ++string;
+              ++string_pos;
+            }
+        }
+    
+        ++pattern;
+        ++string;
+        ++string_pos;
+      }
+
+  /* If the macthing is complete but there is still some of the pattern left,
+     we must ensure that the remaining pattern matches the empty string.  */
+  if (*pattern != '\0')
+    {
+      /* Only '*' can match the empty string.  */
+      while (*pattern == '*')
+        ++pattern;
+
+      if (*pattern != '\0')
+        return -1;
+    }
+
+  return string_pos;
+}
+
 /* Return the number of path elements in PATH.
    / = 1
    /foo = 2
diff --git a/gdb/utils.h b/gdb/utils.h
index 875a2583179..eaf3fe8a8c3 100644
--- a/gdb/utils.h
+++ b/gdb/utils.h
@@ -137,6 +137,8 @@  struct set_batch_flag_and_restore_page_info
 extern int gdb_filename_fnmatch (const char *pattern, const char *string,
 				 int flags);
 
+extern int gdb_fileprefix_fnmatch (const char *pattern, const char *string);
+
 extern void substitute_path_component (char **stringp, const char *from,
 				       const char *to);