sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux

Message ID 20150805010413.GL4777@adacore.com
State New, archived
Headers

Commit Message

Joel Brobecker Aug. 5, 2015, 1:04 a.m. UTC
  > If the "next" is for thread 1,
> 
> > That's when we get an event from a different thread (thread 3):
> > 
> >     infrun: target_wait (-1.0.0, status) =
> >     infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
> >     infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
> >     infrun: TARGET_WAITKIND_STOPPED
> >     infrun: stop_pc = 0x80782d0
> >     infrun: context switch
> >     infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
> > 
> > ... which we find to be at the address where we set a breakpoint
> > on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's
> > why GDB reports for this event that this is...
> > 
> >     infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME
> 
> Why are we getting this?  longjmp/exception/step-resume breakpoints
> are thread-specific.
> 
> I'd guess that the bug is in bpstat_what:
> 
> struct bpstat_what
> bpstat_what (bpstat bs_head)
> {
> ...
> 	case bp_longjmp:
> 	case bp_longjmp_call_dummy:
> 	case bp_exception:
> 	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
> 	  retval.is_longjmp = bptype != bp_exception;
> 	  break;
> ...
> 
> This bit is not considering "if (bs->stop)" like e.g.,
> the bp_step_resume case.
> 
> I've seen something like this trigger before, and have a patch
> somewhere to rewrite bpstat_what differently which fixes that.
> I never managed to write a testcase for it so never submitted
> it.  But, could you try the simpler approach?  Try making that:
> 
> 	  if (bs->stop)
> 	    {
> 	       this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
> 	       retval.is_longjmp = bptype != bp_exception;
> 	    }
> 	  else
> 	    this_action = BPSTAT_WHAT_SINGLE;
> 	  break;

Ah ha, I missed the fact that the exception breakpoint is thread-
specific. Your fix seems to be working very well; thanks for suggestion,
Pedro! Attached is a patch with a slightly altered analysis as the revision
log. Our SuSE 10 machine is very slow, so I tested it on a more modern
machine with a slightly different distro.

I'm wondering if we shouldn't be doing the same for:

        case bp_longjmp_resume:
        case bp_exception_resume:
          this_action = BPSTAT_WHAT_CLEAR_LONGJMP_RESUME;
          retval.is_longjmp = bptype == bp_longjmp_resume;
          break;

gdb/ChangeLog:

        Pedro Alves  <palves@redhat.com>
        * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
        <bp_exception>: Correctly handle the case where BS->STOP is not set.

Thanks!
  

Patch

From 8ff769070f12eafd1b858a63a184a4be9f9a6500 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 4 Aug 2015 23:40:08 +0200
Subject: [PATCH] sig != GDB_SIGNAL_0 failed assertion stepping program on
 GNU/Linux

Trying to next/step a program on GNU/Linux sometimes results in
the following failed assertion:

    % gdb -q .obj/gprof/main
    (gdb) start
    (gdb) n
    (gdb) step
    [...]/infrun.c:2391: internal-error:
    resume: Assertion `sig != GDB_SIGNAL_0' failed.

What happens is that, durig the "next" operation, GDB hits
a longjmp/exception/step-resume breakpoint but fails to see that
this breakpoint was set for a different thread than the one being
stepped.

More precisely, at the end of the "start" command, we are stopped
at the start of function Main in main.adb; there are 4 threads in
total, and we are in the main thread (which is thread 1):

    (gdb) info thread
      Id   Target Id         Frame
      4    Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall ()
      3    Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall ()
      2    Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall ()
    * 1    Thread 0xb7ea18c0 (LWP 28370) main () at /[...]/main.adb:57

All the logs below reference Thread ID/LWP, but I think it'll be easier
to talk about the the thread by thread number. For instance, thread 1
is LWP 28370 while thread 3 is LWP 28378. So, I will translate in my
explanations the LWPs into thread numbers.

Back to what happens while we are trying to "next' our program:
    (gdb) n
    infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379))
    infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378))
    infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377))
    infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370))
    infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT)
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8054523

We've resumed thread 1 (LWP 28370), and received in return a signal
that the same thread stopped slightly further. It's still in the range
of instructions for the line of source we started the "next" from,
as evidenced by the following trace...

    infrun: stepping inside range [0x805451e-0x8054531]

... and thus, we decide to continue stepping the same thread:

    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523
    infrun: prepare_to_wait

That's when we get an event from a different thread (thread 3)...

    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80782d0
    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)

... which we find to be at the address where we set a breakpoint
on "the unwinder debug hook" (namely "_Unwind_DebugHook"). But GDB
fails to notice that the breakpoint was inserted for thread 1 only,
and so decides to handle it as...

    infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME

... and inserts a breakpoint at the corresponding resume address,
as evidenced by this the next log:

    infrun: exception resume at 80542a2

That breakpoint seems innocent right now, but will play a role fairly
quickly. But for now, GDB has inserted the exception-resume breakpoint,
and needs to single-step thread 3 past the breakpoint it just hit. Thus,
it temporarily disables the exception breakpoint, and requests a step of
that thread:

    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: skipping breakpoint: stepping past insn at: 0x80782d0
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0
    infrun: prepare_to_wait

We then get a notification, still from thread 3, that it's now past
that breakpoint...

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x8078424

... so we can resume what we were doing before, which is single-stepping
thread 1 until we get to a new line of code:

    infrun: switching back to stepped thread
    infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370)
    infrun: expected thread still hasn't advanced
    infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523

The "resume" log above shows that we're resuming thread 1 from
where we left off (0x8054523). We get one more stop at 0x8054529,
which is still inside our stepping range so we go again. That's
when we get the following event, from thread 3:

    infrun: prepare_to_wait
    infrun: target_wait (-1.0.0, status) =
    infrun:   28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)],
    infrun:   status->kind = stopped, signal = GDB_SIGNAL_TRAP
    infrun: TARGET_WAITKIND_STOPPED
    infrun: stop_pc = 0x80542a2

Now the stop_pc adddres is interesting, because it's the address
of "exception resume" breakpoint. When GDB sees this, it knows

    infrun: context switch
    infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378)
    infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME

... and since the location is at a different line of code,
this is where it decide the "next" operation should stop:

    infrun: stop_waiting
    [Switching to Thread 0xb7c5aba0 (LWP 28378)]
    0x080542a2 in inte_tache_rt.ttache_rt (
        <_task>=0x80968ec <inte_tache_rt_inst.tache2>)
        at /[...]/inte_tache_rt.adb:54
    54            end loop;

Instead, what GDB should be doing is noticing that the exception
breakpoint we hit was for a different thread, thus single-step
that thread out of the breakpoint _without_ inserting the exception-return
breakpoint, and then resume the single-stepping of the initial thread
(thread 1) until stepping out of the stepping range.

This is what this patch does, and after applying it, GDB now correctly
stops on the next line of code.

gdb/ChangeLog:

        Pedro Alves  <palves@redhat.com>
        * breakpoint.c (bpstat_what) <bp_longjmp, bp_longjmp_call_dummy>
        <bp_exception>: Correctly handle the case where BS->STOP is not set.

Tested on x86_64-linux, no regressions.
---
 gdb/breakpoint.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/gdb/breakpoint.c b/gdb/breakpoint.c
index 74c7a7b..da4ee82 100644
--- a/gdb/breakpoint.c
+++ b/gdb/breakpoint.c
@@ -5778,8 +5778,13 @@  bpstat_what (bpstat bs_head)
 	case bp_longjmp:
 	case bp_longjmp_call_dummy:
 	case bp_exception:
-	  this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
-	  retval.is_longjmp = bptype != bp_exception;
+	  if (bs->stop)
+	    {
+	      this_action = BPSTAT_WHAT_SET_LONGJMP_RESUME;
+	      retval.is_longjmp = bptype != bp_exception;
+	    }
+	  else
+	    this_action = BPSTAT_WHAT_SINGLE;
 	  break;
 	case bp_longjmp_resume:
 	case bp_exception_resume:
-- 
1.7.10.4