Remove gdb_assert in dw2_find_pc_sect_compunit_symtab.

Message ID 20191217234203.22520-1-tamur@google.com
State New, archived
Headers

Commit Message

Terekhov, Mikhail via Gdb-patches Dec. 17, 2019, 11:42 p.m. UTC
  All callers (such as find_pc_sect_compunit_symtab in symtab.c) check
the return value and take meaningful action if it is null.

The problem with the gdb_assert is that, although the function should never
return null with well formed debug info, it does when for example .gdb_index
section of the object file is faulty. I am debugging a rare and not easy to
reproduce crash in our code base, and this assertion prevents the user seeing
the full stack trace. When this assertion is removed, gdb falls back to not
using gdb_index information and prints a useful stack trace. Apart from the
separate issue of finding out why gdb_index is corrupted, I am guessing that
gdb should be resilient when presented with bad debug info.

gdb/ChangeLog

        * dwarf2read.c (dw2_find_pc_sect_compunit_symtab): Remove gdb_assert.
---
 gdb/dwarf2read.c | 1 -
 1 file changed, 1 deletion(-)
  

Comments

Tom Tromey Dec. 19, 2019, 6:10 p.m. UTC | #1
>>>>> "Ali" == Ali Tamur via gdb-patches <gdb-patches@sourceware.org> writes:

Ali> All callers (such as find_pc_sect_compunit_symtab in symtab.c) check
Ali> the return value and take meaningful action if it is null.

Ali> The problem with the gdb_assert is that, although the function should never
Ali> return null with well formed debug info, it does when for example .gdb_index
Ali> section of the object file is faulty. I am debugging a rare and not easy to
Ali> reproduce crash in our code base, and this assertion prevents the user seeing
Ali> the full stack trace. When this assertion is removed, gdb falls back to not
Ali> using gdb_index information and prints a useful stack trace. Apart from the
Ali> separate issue of finding out why gdb_index is corrupted, I am guessing that
Ali> gdb should be resilient when presented with bad debug info.

I think the idea of this assert is that recursively_find_pc_sect_compunit_symtab
"shouldn't" be able to fail.

dw2_find_pc_sect_compunit_symtab first checks the addrmap, and if the address
isn't found there, it returns early.

But, if the address is in the addrmpa, then it should be covered by some symtab.
If this doesn't happen, it means there is a disagreement between the psymtab
reader and the symtab reader -- in other words, a bug elsewhere in
dwarf2read.c.

So, I think it would be better to track down this other bug.

What do you think of that?

thanks,
Tom
  
Terekhov, Mikhail via Gdb-patches Dec. 20, 2019, 4:21 a.m. UTC | #2
Hi Tom,

Thank you for your reply. Sure, I don't want to cover gdb bugs and I want
to get to the bottom of this.
In dw2_find_pc_sect_compunit_symtab, the line:

  data = (struct dwarf2_per_cu_data *) addrmap_find
(objfile->partial_symtabs->psymtabs_addrmap, pc - baseaddr);assigns a
wrong cu, so recursively_find_pc_sect_compunit_symtab cannot find the
address and fails.

The root cause is that either create_addrmap_from_index does a poor
job constructing the index,or the .gdb_index section of the binary is
hopelessly corrupted, and I don't know which. So, maybeyou can help.
The Address area offset part of the index looks suspicious to me:1.
There are pairs like [0x100, 0x101) which are empty.My teammates say
this might be caused by unused & removed functions, but we are not
sure.
2. There are address pairs like [0x100, 0x200) that are assigned to
multiple cu's.Teammates say this might be caused by inlined functions,
or icf=safe etc. build options.
3. There are address pairs like [0x100, 0x200) ==> cu_A, where I
cannot find such an addressrange in the related  DW_TAG_compile_unit.
But I need to verify this more carefully.
3. There are overlapping address pairs like [0x100, 0x400) ==> cu_A,
[0x200, 0x300) ==>cu_B.Teammates have a harder time explaining this.
I look at the way how addrmap is constructed; there is this splay tree
underneath, and it looks likegdb does not assume overlapping address
pairs. My current hypothesis is that, a sequence like:[0x300, 0x400)
==> cu_A
[0x100, 0x500)  ==> cu_A
[0x200, 0x600)  ==> cu_B
results in something like:0x100(A) -- 0x200(B) -- 0x600(end) so 0x301
is assigned to cu_B instead of cu_A.
I am sorry that I cannot reduce the problem case and cannot share the
binary. (I cannot even compileand reproduce the binary, so I am
somewhat in the dark too). So, what do you think? Does the
gdb_indexlook well formed to you? If not, is it reasonable for gdb to
still try its best under the circumstances? If gdb isto handle this,
it seems like addrmap implementation needs a non-trivial upgrade so
that every interval canmap to multiple cu's. (Also, splay tree seems
like not the most efficient data structure for the problem, sinceso
much code looks for predecessor and successor nodes, but that may be
my ignorance). I can invest sometime to write a new implementation,
but I don't even know whether it is a good idea to support
multiplemappings in the first place?
Thank you.Ali


On Thu, Dec 19, 2019 at 10:10 AM Tom Tromey <tom@tromey.com> wrote:

> >>>>> "Ali" == Ali Tamur via gdb-patches <gdb-patches@sourceware.org>
> writes:
>
> Ali> All callers (such as find_pc_sect_compunit_symtab in symtab.c) check
> Ali> the return value and take meaningful action if it is null.
>
> Ali> The problem with the gdb_assert is that, although the function should
> never
> Ali> return null with well formed debug info, it does when for example
> .gdb_index
> Ali> section of the object file is faulty. I am debugging a rare and not
> easy to
> Ali> reproduce crash in our code base, and this assertion prevents the
> user seeing
> Ali> the full stack trace. When this assertion is removed, gdb falls back
> to not
> Ali> using gdb_index information and prints a useful stack trace. Apart
> from the
> Ali> separate issue of finding out why gdb_index is corrupted, I am
> guessing that
> Ali> gdb should be resilient when presented with bad debug info.
>
> I think the idea of this assert is that
> recursively_find_pc_sect_compunit_symtab
> "shouldn't" be able to fail.
>
> dw2_find_pc_sect_compunit_symtab first checks the addrmap, and if the
> address
> isn't found there, it returns early.
>
> But, if the address is in the addrmpa, then it should be covered by some
> symtab.
> If this doesn't happen, it means there is a disagreement between the
> psymtab
> reader and the symtab reader -- in other words, a bug elsewhere in
> dwarf2read.c.
>
> So, I think it would be better to track down this other bug.
>
> What do you think of that?
>
> thanks,
> Tom
>
  
Terekhov, Mikhail via Gdb-patches Jan. 9, 2020, 1:01 a.m. UTC | #3
I've looked into the problem more deeply and it appears that the linker
(lld in our case) produces a corrupt (at least inconsistent) gdb_index. I
think gdb should not be expected to do a good job under the circumstances,
(also removing the assert does not produce a flawless backtrace anyway), so
I will not pursue this patch. Thanks for the review.

On Thu, Dec 19, 2019 at 8:21 PM Ali Tamur <tamur@google.com> wrote:

> Hi Tom,
>
> Thank you for your reply. Sure, I don't want to cover gdb bugs and I want
> to get to the bottom of this.
> In dw2_find_pc_sect_compunit_symtab, the line:
>
>   data = (struct dwarf2_per_cu_data *) addrmap_find    (objfile->partial_symtabs->psymtabs_addrmap, pc - baseaddr);assigns a wrong cu, so recursively_find_pc_sect_compunit_symtab cannot find the address and fails.
>
> The root cause is that either create_addrmap_from_index does a poor job constructing the index,or the .gdb_index section of the binary is hopelessly corrupted, and I don't know which. So, maybeyou can help.
> The Address area offset part of the index looks suspicious to me:1. There are pairs like [0x100, 0x101) which are empty.My teammates say this might be caused by unused & removed functions, but we are not sure.
> 2. There are address pairs like [0x100, 0x200) that are assigned to multiple cu's.Teammates say this might be caused by inlined functions, or icf=safe etc. build options.
> 3. There are address pairs like [0x100, 0x200) ==> cu_A, where I cannot find such an addressrange in the related  DW_TAG_compile_unit. But I need to verify this more carefully.
> 3. There are overlapping address pairs like [0x100, 0x400) ==> cu_A, [0x200, 0x300) ==>cu_B.Teammates have a harder time explaining this.
> I look at the way how addrmap is constructed; there is this splay tree underneath, and it looks likegdb does not assume overlapping address pairs. My current hypothesis is that, a sequence like:[0x300, 0x400)  ==> cu_A
> [0x100, 0x500)  ==> cu_A
> [0x200, 0x600)  ==> cu_B
> results in something like:0x100(A) -- 0x200(B) -- 0x600(end) so 0x301 is assigned to cu_B instead of cu_A.
> I am sorry that I cannot reduce the problem case and cannot share the binary. (I cannot even compileand reproduce the binary, so I am somewhat in the dark too). So, what do you think? Does the gdb_indexlook well formed to you? If not, is it reasonable for gdb to still try its best under the circumstances? If gdb isto handle this, it seems like addrmap implementation needs a non-trivial upgrade so that every interval canmap to multiple cu's. (Also, splay tree seems like not the most efficient data structure for the problem, sinceso much code looks for predecessor and successor nodes, but that may be my ignorance). I can invest sometime to write a new implementation, but I don't even know whether it is a good idea to support multiplemappings in the first place?
> Thank you.Ali
>
>
> On Thu, Dec 19, 2019 at 10:10 AM Tom Tromey <tom@tromey.com> wrote:
>
>> >>>>> "Ali" == Ali Tamur via gdb-patches <gdb-patches@sourceware.org>
>> writes:
>>
>> Ali> All callers (such as find_pc_sect_compunit_symtab in symtab.c) check
>> Ali> the return value and take meaningful action if it is null.
>>
>> Ali> The problem with the gdb_assert is that, although the function
>> should never
>> Ali> return null with well formed debug info, it does when for example
>> .gdb_index
>> Ali> section of the object file is faulty. I am debugging a rare and not
>> easy to
>> Ali> reproduce crash in our code base, and this assertion prevents the
>> user seeing
>> Ali> the full stack trace. When this assertion is removed, gdb falls back
>> to not
>> Ali> using gdb_index information and prints a useful stack trace. Apart
>> from the
>> Ali> separate issue of finding out why gdb_index is corrupted, I am
>> guessing that
>> Ali> gdb should be resilient when presented with bad debug info.
>>
>> I think the idea of this assert is that
>> recursively_find_pc_sect_compunit_symtab
>> "shouldn't" be able to fail.
>>
>> dw2_find_pc_sect_compunit_symtab first checks the addrmap, and if the
>> address
>> isn't found there, it returns early.
>>
>> But, if the address is in the addrmpa, then it should be covered by some
>> symtab.
>> If this doesn't happen, it means there is a disagreement between the
>> psymtab
>> reader and the symtab reader -- in other words, a bug elsewhere in
>> dwarf2read.c.
>>
>> So, I think it would be better to track down this other bug.
>>
>> What do you think of that?
>>
>> thanks,
>> Tom
>>
>
  
Tom Tromey Jan. 17, 2020, 6:13 p.m. UTC | #4
>>>>> "Ali" == Ali Tamur via gdb-patches <gdb-patches@sourceware.org> writes:

Ali> I've looked into the problem more deeply and it appears that the linker
Ali> (lld in our case) produces a corrupt (at least inconsistent) gdb_index. I
Ali> think gdb should not be expected to do a good job under the circumstances,
Ali> (also removing the assert does not produce a flawless backtrace anyway), so
Ali> I will not pursue this patch. Thanks for the review.

Would it make sense to simply drop the index in this case, rather than
assert?

If there are executables out there with this issue, I guess we don't
want to assert, but instead show some kind of error, or fall back to a
more exhaustive search somehow.

Tom
  

Patch

diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
index 9178e0eb94..5f1a58c728 100644
--- a/gdb/dwarf2read.c
+++ b/gdb/dwarf2read.c
@@ -5258,7 +5258,6 @@  dw2_find_pc_sect_compunit_symtab (struct objfile *objfile,
     = recursively_find_pc_sect_compunit_symtab (dw2_instantiate_symtab (data,
 									false),
 						pc);
-  gdb_assert (result != NULL);
   return result;
 }