libiberty rust-demangle, ignore .suffix

Message ID 20211202171713.15454-1-mark@klomp.org
State New
Headers
Series libiberty rust-demangle, ignore .suffix |

Commit Message

Mark Wielaard Dec. 2, 2021, 5:17 p.m. UTC
  Rust v0 symbols can have a .suffix because if compiler transformations.
These can be ignored it the demangled name. Which is what this patch
implements). But an alternative implementation could be to follow what
C++ does and represent these as [clone .suffix] tagged onto the
demangled name. But this seems somewhat confusing since it results in
a demangled name that cannot be mangled again. And it would mean
trying to decode compiler internal naming.

https://bugs.kde.org/show_bug.cgi?id=445916
https://github.com/rust-lang/rust/issues/60705

libiberty/Changelog

	* rust-demangle.c (rust_demangle_callback): Ignore everything
	after '.' char in sym for v0.
	* testsuite/rust-demangle-expected: Add new .suffix testcase.
---
 libiberty/rust-demangle.c                  | 4 ++++
 libiberty/testsuite/rust-demangle-expected | 6 ++++++
 2 files changed, 10 insertions(+)
  

Comments

Eduard-Mihai Burtescu Dec. 2, 2021, 5:35 p.m. UTC | #1
On Thu, Dec 2, 2021, at 19:17, Mark Wielaard wrote:
> Rust v0 symbols can have a .suffix because if compiler transformations.

For some context, the suffix comes from LLVM (I believe as part of its LTO).
If this were a semantic part of a Rust symbol, it would have an encoding within v0 (as we already do for e.g. shims).

That also means that for consistency, suffixes like these should be handled uniformly for both v0 and legacy (as rustc-demangle does), since LLVM doesn't distinguish.

You may even be able to get Clang to generate C++ mangled symbols with ".llvm." suffixes, with enough application of LTO.
This is not unlike GCC ".clone" suffixes, AFAIK.
Sadly I don't think there's a way to handle both as "outside the symbol", without hardcoding ".llvm." in the implementation.

I don't recall the libiberty demangling API having any provisions for the demangler deciding that a mangled symbol "stops early", which would maybe allow for a more general solution.

Despite all that, if it helps in practice, I would still not mind this patch landing in its current form, I just wanted to share my perspective on the larger issue.

Thanks,
- Eddy B.
  
Nicholas Nethercote Dec. 2, 2021, 7:58 p.m. UTC | #2
On Fri, 3 Dec 2021 at 04:17, Mark Wielaard <mark@klomp.org> wrote:

>
>         * rust-demangle.c (rust_demangle_callback): Ignore everything
>         after '.' char in sym for v0.
>

I just applied this change to Valgrind's copy of rust-demangle.c and I can
confirm that it works -- the symbols that were failing to be demangled are
now demangled.


> +      /* Rust v0 symbols can have '.' suffixes, ignore those.  */
> +      if (rdm.version == 0 && *p == '.')
> +       break;
>

Note that the indent on the `break` is 1 char, it should be 2.

Thanks for doing this, Mark!

Nick
  
Mark Wielaard Dec. 2, 2021, 10:07 p.m. UTC | #3
Hi Eddy,

On Thu, Dec 02, 2021 at 07:35:17PM +0200, Eduard-Mihai Burtescu wrote:
> On Thu, Dec 2, 2021, at 19:17, Mark Wielaard wrote:
> > Rust v0 symbols can have a .suffix because if compiler transformations.
> 
> For some context, the suffix comes from LLVM (I believe as part of
> its LTO).  If this were a semantic part of a Rust symbol, it would
> have an encoding within v0 (as we already do for e.g. shims).

The same is true for gccrs or the libgccjit backend, gcc might clone
the symbol name and make it unique by appending some .suffix name.

> That also means that for consistency, suffixes like these should be
> handled uniformly for both v0 and legacy (as rustc-demangle does),
> since LLVM doesn't distinguish.

The problem with the legacy mangling is that dot '.' is a valid
character. That is why the patch only handles the v0 mangling case
(where dot '.' isn't valid).

> You may even be able to get Clang to generate C++ mangled symbols
> with ".llvm." suffixes, with enough application of LTO.  This is not
> unlike GCC ".clone" suffixes, AFAIK.  Sadly I don't think there's a
> way to handle both as "outside the symbol", without hardcoding
> ".llvm." in the implementation.

We could use the scheme used by c++ where the .suffix is added as "
[clone .suffix]", it even handles multiple dots, where something like
_Z3fooi.part.9.165493.constprop.775.31805
demangles to
foo(int) [clone .part.9.165493] [clone .constprop.775.31805]

I just don't think that is very useful and a little confusing.

> I don't recall the libiberty demangling API having any provisions
> for the demangler deciding that a mangled symbol "stops early",
> which would maybe allow for a more general solution.

No, there indeed is no interface. We might introduce a new option flag
for treating '.' as end of symbol. But do we really need that
flexibility?

> Despite all that, if it helps in practice, I would still not mind
> this patch landing in its current form, I just wanted to share my
> perspective on the larger issue.

Thanks for that. Do you happen to know what other rust demanglers do?

Cheers,

Mark
  
Mark Wielaard Dec. 2, 2021, 10:54 p.m. UTC | #4
Hi,

On Fri, Dec 03, 2021 at 06:58:36AM +1100, Nicholas Nethercote wrote:
> On Fri, 3 Dec 2021 at 04:17, Mark Wielaard <mark@klomp.org> wrote:
> >
> >         * rust-demangle.c (rust_demangle_callback): Ignore everything
> >         after '.' char in sym for v0.
> >
> 
> I just applied this change to Valgrind's copy of rust-demangle.c and I can
> confirm that it works -- the symbols that were failing to be demangled are
> now demangled.

Thanks for testing.

> > +      /* Rust v0 symbols can have '.' suffixes, ignore those.  */
> > +      if (rdm.version == 0 && *p == '.')
> > +       break;
> >
> 
> Note that the indent on the `break` is 1 char, it should be 2.

Oops. That is actually because I used a tab for indentation, but this
file doesn't use tabs. Fixed in my local copy
https://code.wildebeest.org/git/user/mjw/gcc/commit/?h=rust-demangle-suffix
Will push with that fix if approved.

Thanks,

Mark
  
Eduard-Mihai Burtescu Dec. 2, 2021, 11:14 p.m. UTC | #5
On Fri, Dec 3, 2021, at 00:07, Mark Wielaard wrote:
> Hi Eddy,
>
> On Thu, Dec 02, 2021 at 07:35:17PM +0200, Eduard-Mihai Burtescu wrote:
>> On Thu, Dec 2, 2021, at 19:17, Mark Wielaard wrote:
>> > Rust v0 symbols can have a .suffix because if compiler transformations.
>> 
>> For some context, the suffix comes from LLVM (I believe as part of
>> its LTO).  If this were a semantic part of a Rust symbol, it would
>> have an encoding within v0 (as we already do for e.g. shims).
>
> The same is true for gccrs or the libgccjit backend, gcc might clone
> the symbol name and make it unique by appending some .suffix name.
>
>> That also means that for consistency, suffixes like these should be
>> handled uniformly for both v0 and legacy (as rustc-demangle does),
>> since LLVM doesn't distinguish.
>
> The problem with the legacy mangling is that dot '.' is a valid
> character. That is why the patch only handles the v0 mangling case
> (where dot '.' isn't valid).

Thought so, that's an annoying complication - however, see later down
why that's still not a blocker to the way rustc-demangle handles it.

>> You may even be able to get Clang to generate C++ mangled symbols
>> with ".llvm." suffixes, with enough application of LTO.  This is not
>> unlike GCC ".clone" suffixes, AFAIK.  Sadly I don't think there's a
>> way to handle both as "outside the symbol", without hardcoding
>> ".llvm." in the implementation.
>
> We could use the scheme used by c++ where the .suffix is added as "
> [clone .suffix]", it even handles multiple dots, where something like
> _Z3fooi.part.9.165493.constprop.775.31805
> demangles to
> foo(int) [clone .part.9.165493] [clone .constprop.775.31805]
>
> I just don't think that is very useful and a little confusing.

Calling it "clone" is a bit weird, but I just checked what rustc-demangle
does for printing suffixes back out and it's not great either:
- ".llvm." (and everything after it) is completely removed
- any left-over suffixes (after demangling), if they start with ".", are
  not considered errors, but printed out verbatim after the demangling

>> I don't recall the libiberty demangling API having any provisions
>> for the demangler deciding that a mangled symbol "stops early",
>> which would maybe allow for a more general solution.
>
> No, there indeed is no interface. We might introduce a new option flag
> for treating '.' as end of symbol. But do we really need that
> flexibility?

That's not what I meant - a v0 or legacy symbol is self-terminating in
its parsing (or at the very least there are not dots allowed outside
of a length-prefixed identifier), so that when you see the start of
a valid mangled symbol, you can always find its end in the string,
even when that end is half-way through (and is followed by suffixes
or any other unrelated noise).

What I was imagining is a way to return to the caller the number of
chars from the start of the original string, that were demangled,
letting the caller do something else with the rest of that string.
(see below for how rustc-demangle already does something similar)

>> Despite all that, if it helps in practice, I would still not mind
>> this patch landing in its current form, I just wanted to share my
>> perspective on the larger issue.
>
> Thanks for that. Do you happen to know what other rust demanglers do?

rustc-demangle's internal API returns a pair of the demangler and the
"leftover" parts of the original string, after the end of the symbol.
You can see here how that suffix is further checked, and kept:
https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/lib.rs#L108-L138

As mentioned above, ".llvm." is handled differently, just above the snippet
linked - perhaps it was deemed too common to let it pollute the output.

> Cheers,
>
> Mark

Thanks,
- Eddy B.
  
Mark Wielaard Dec. 7, 2021, 7:16 p.m. UTC | #6
Hi Eddy,

On Fri, 2021-12-03 at 01:14 +0200, Eduard-Mihai Burtescu wrote:
> On Fri, Dec 3, 2021, at 00:07, Mark Wielaard wrote:
> > On Thu, Dec 02, 2021 at 07:35:17PM +0200, Eduard-Mihai Burtescu
> > wrote:
> > > That also means that for consistency, suffixes like these should
> > > be
> > > handled uniformly for both v0 and legacy (as rustc-demangle
> > > does),
> > > since LLVM doesn't distinguish.
> > 
> > The problem with the legacy mangling is that dot '.' is a valid
> > character. That is why the patch only handles the v0 mangling case
> > (where dot '.' isn't valid).
> 
> Thought so, that's an annoying complication - however, see later down
> why that's still not a blocker to the way rustc-demangle handles it.
> 
> > > You may even be able to get Clang to generate C++ mangled symbols
> > > with ".llvm." suffixes, with enough application of LTO.  This is
> > > not
> > > unlike GCC ".clone" suffixes, AFAIK.  Sadly I don't think there's
> > > a
> > > way to handle both as "outside the symbol", without hardcoding
> > > ".llvm." in the implementation.
> > 
> > We could use the scheme used by c++ where the .suffix is added as "
> > [clone .suffix]", it even handles multiple dots, where something
> > like
> > _Z3fooi.part.9.165493.constprop.775.31805
> > demangles to
> > foo(int) [clone .part.9.165493] [clone .constprop.775.31805]
> > 
> > I just don't think that is very useful and a little confusing.
> 
> Calling it "clone" is a bit weird, but I just checked what rustc-
> demangle
> does for printing suffixes back out and it's not great either:
> - ".llvm." (and everything after it) is completely removed
> - any left-over suffixes (after demangling), if they start with ".",
> are
>   not considered errors, but printed out verbatim after the
> demangling
> 
> > > I don't recall the libiberty demangling API having any provisions
> > > for the demangler deciding that a mangled symbol "stops early",
> > > which would maybe allow for a more general solution.
> > 
> > No, there indeed is no interface. We might introduce a new option
> > flag
> > for treating '.' as end of symbol. But do we really need that
> > flexibility?
> 
> That's not what I meant - a v0 or legacy symbol is self-terminating
> in
> its parsing (or at the very least there are not dots allowed outside
> of a length-prefixed identifier), so that when you see the start of
> a valid mangled symbol, you can always find its end in the string,
> even when that end is half-way through (and is followed by suffixes
> or any other unrelated noise).
> 
> What I was imagining is a way to return to the caller the number of
> chars from the start of the original string, that were demangled,
> letting the caller do something else with the rest of that string.
> (see below for how rustc-demangle already does something similar)
>
> > > Despite all that, if it helps in practice, I would still not mind
> > > this patch landing in its current form, I just wanted to share my
> > > perspective on the larger issue.
> > 
> > Thanks for that. Do you happen to know what other rust demanglers
> > do?
> 
> rustc-demangle's internal API returns a pair of the demangler and the
> "leftover" parts of the original string, after the end of the symbol.
> You can see here how that suffix is further checked, and kept:
> https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/lib.rs#L108-L138

Yes, returning a struct that returns the style detected, the demangled
string and the left over chars makes sense. But we don't have an
interface like that at the moment, and I am not sure we (currently)
have users who want this.

> As mentioned above, ".llvm." is handled differently, just above the snippet
> linked - perhaps it was deemed too common to let it pollute the output.

But that also makes it a slightly odd interface. I would imagine that
people would be interested in the .llvm. part. Now that just gets
dropped.

Since we don't have an interface to return the suffix (and I find the
choice of dropping .llvm. but not other suffixes odd), I think we
should just simply always drop the .suffix. I understand now how to do
that for legacy symbols too, thanks for the hints.

See the attached update to the patch. What do you think?

Thanks,

Mark
  
Eduard-Mihai Burtescu Dec. 20, 2021, 11:50 a.m. UTC | #7
Apologies for the delay, the email fell through the cracks somehow.

The updated patch looks like it would work alright, only needs a couple tests, e.g.:
https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/lib.rs#L413-L422
https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/v0.rs#L1442-L1444

Thanks,
- Eddy B.

On Tue, Dec 7, 2021, at 21:16, Mark Wielaard wrote:
> Hi Eddy,
>
> On Fri, 2021-12-03 at 01:14 +0200, Eduard-Mihai Burtescu wrote:
>> On Fri, Dec 3, 2021, at 00:07, Mark Wielaard wrote:
>> > On Thu, Dec 02, 2021 at 07:35:17PM +0200, Eduard-Mihai Burtescu
>> > wrote:
>> > > That also means that for consistency, suffixes like these should
>> > > be
>> > > handled uniformly for both v0 and legacy (as rustc-demangle
>> > > does),
>> > > since LLVM doesn't distinguish.
>> > 
>> > The problem with the legacy mangling is that dot '.' is a valid
>> > character. That is why the patch only handles the v0 mangling case
>> > (where dot '.' isn't valid).
>> 
>> Thought so, that's an annoying complication - however, see later down
>> why that's still not a blocker to the way rustc-demangle handles it.
>> 
>> > > You may even be able to get Clang to generate C++ mangled symbols
>> > > with ".llvm." suffixes, with enough application of LTO.  This is
>> > > not
>> > > unlike GCC ".clone" suffixes, AFAIK.  Sadly I don't think there's
>> > > a
>> > > way to handle both as "outside the symbol", without hardcoding
>> > > ".llvm." in the implementation.
>> > 
>> > We could use the scheme used by c++ where the .suffix is added as "
>> > [clone .suffix]", it even handles multiple dots, where something
>> > like
>> > _Z3fooi.part.9.165493.constprop.775.31805
>> > demangles to
>> > foo(int) [clone .part.9.165493] [clone .constprop.775.31805]
>> > 
>> > I just don't think that is very useful and a little confusing.
>> 
>> Calling it "clone" is a bit weird, but I just checked what rustc-
>> demangle
>> does for printing suffixes back out and it's not great either:
>> - ".llvm." (and everything after it) is completely removed
>> - any left-over suffixes (after demangling), if they start with ".",
>> are
>>   not considered errors, but printed out verbatim after the
>> demangling
>> 
>> > > I don't recall the libiberty demangling API having any provisions
>> > > for the demangler deciding that a mangled symbol "stops early",
>> > > which would maybe allow for a more general solution.
>> > 
>> > No, there indeed is no interface. We might introduce a new option
>> > flag
>> > for treating '.' as end of symbol. But do we really need that
>> > flexibility?
>> 
>> That's not what I meant - a v0 or legacy symbol is self-terminating
>> in
>> its parsing (or at the very least there are not dots allowed outside
>> of a length-prefixed identifier), so that when you see the start of
>> a valid mangled symbol, you can always find its end in the string,
>> even when that end is half-way through (and is followed by suffixes
>> or any other unrelated noise).
>> 
>> What I was imagining is a way to return to the caller the number of
>> chars from the start of the original string, that were demangled,
>> letting the caller do something else with the rest of that string.
>> (see below for how rustc-demangle already does something similar)
>>
>> > > Despite all that, if it helps in practice, I would still not mind
>> > > this patch landing in its current form, I just wanted to share my
>> > > perspective on the larger issue.
>> > 
>> > Thanks for that. Do you happen to know what other rust demanglers
>> > do?
>> 
>> rustc-demangle's internal API returns a pair of the demangler and the
>> "leftover" parts of the original string, after the end of the symbol.
>> You can see here how that suffix is further checked, and kept:
>> https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/lib.rs#L108-L138
>
> Yes, returning a struct that returns the style detected, the demangled
> string and the left over chars makes sense. But we don't have an
> interface like that at the moment, and I am not sure we (currently)
> have users who want this.
>
>> As mentioned above, ".llvm." is handled differently, just above the snippet
>> linked - perhaps it was deemed too common to let it pollute the output.
>
> But that also makes it a slightly odd interface. I would imagine that
> people would be interested in the .llvm. part. Now that just gets
> dropped.
>
> Since we don't have an interface to return the suffix (and I find the
> choice of dropping .llvm. but not other suffixes odd), I think we
> should just simply always drop the .suffix. I understand now how to do
> that for legacy symbols too, thanks for the hints.
>
> See the attached update to the patch. What do you think?
>
> Thanks,
>
> Mark
>
> Attachments:
> * 0001-libiberty-rust-demangle-ignore-.suffix.patch
  
Mark Wielaard Jan. 16, 2022, 12:27 a.m. UTC | #8
Hi Eddy,

On Mon, Dec 20, 2021 at 01:50:52PM +0200, Eduard-Mihai Burtescu wrote:
> Apologies for the delay, the email fell through the cracks somehow.

And then I went on vacation... Sorry this fairly simple patch takes so
long.

> The updated patch looks like it would work alright, only needs a couple tests, e.g.:
> https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/lib.rs#L413-L422
> https://github.com/rust-lang/rustc-demangle/blob/2811a1ad6f7c8bead2ef3671e4fdc10de1553e96/src/v0.rs#L1442-L1444

Thanks. I have added those testcases and noticed we need a tweak for
having an @ in the suffix of a legacy symbol. I'll sent the updated
patch. Hope it can still be pushed even though stage 3 is closing
soon.

Cheers,

Mark
  

Patch

diff --git a/libiberty/rust-demangle.c b/libiberty/rust-demangle.c
index 449941b56dc1..dee9eb64df79 100644
--- a/libiberty/rust-demangle.c
+++ b/libiberty/rust-demangle.c
@@ -1340,6 +1340,10 @@  rust_demangle_callback (const char *mangled, int options,
   /* Rust symbols (v0) use only [_0-9a-zA-Z] characters. */
   for (p = rdm.sym; *p; p++)
     {
+      /* Rust v0 symbols can have '.' suffixes, ignore those.  */
+      if (rdm.version == 0 && *p == '.')
+	break;
+
       rdm.sym_len++;
 
       if (*p == '_' || ISALNUM (*p))
diff --git a/libiberty/testsuite/rust-demangle-expected b/libiberty/testsuite/rust-demangle-expected
index 7dca315d0054..bc32ed85f882 100644
--- a/libiberty/testsuite/rust-demangle-expected
+++ b/libiberty/testsuite/rust-demangle-expected
@@ -295,3 +295,9 @@  _RMCs4fqI2P2rA04_13const_genericINtB0_4CharKc2202_E
 --format=auto
 _RNvNvMCs4fqI2P2rA04_13const_genericINtB4_3FooKpE3foo3FOO
 <const_generic::Foo<_>>::foo::FOO
+#
+# Suffixes
+#
+--format=rust
+_RNvMs0_NtCs5l0EXMQXRMU_21rustc_data_structures17obligation_forestINtB5_16ObligationForestNtNtNtCsdozMG8X9FIu_21rustc_trait_selection6traits7fulfill26PendingPredicateObligationE22register_obligation_atB1v_.llvm.8517020237817239694
+<rustc_data_structures::obligation_forest::ObligationForest<rustc_trait_selection::traits::fulfill::PendingPredicateObligation>>::register_obligation_at