RFA/gdbserver: GDB internal-error debugging threaded program with breakpoint and forks (was: "Re: RFH: failed assert debugging threaded+fork program over gdbserver")

Message ID 20160624181152.GD3295@adacore.com
State New, archived
Headers

Commit Message

Joel Brobecker June 24, 2016, 6:11 p.m. UTC
  > I'm thinking the right thing to do, here is to enhance select_event_lwp
> to look for threads that received a fork/vfork event first, and report
> that event to gdb ahead of all the other kinds of events.

That seems to work well. Attached is the patch doing this.

The commit's revision log gives I think a reasonable overview of
what the issue is and how the patch fixes it. But more information
that I found while I was investigating this which is only tangential
to the issue can be found in the following message, exchanged on
this list:

    https://www.sourceware.org/ml/gdb-patches/2016-05/msg00207.html
    https://www.sourceware.org/ml/gdb-patches/2016-05/msg00209.html
    https://www.sourceware.org/ml/gdb-patches/2016-06/msg00391.html

gdb/gdbserver/ChangeLog:

        * linux-low.c (select_fork_vfork_lwp_callback): New function.
        (select_event_lwp): Give priority fork/vfork events over all
        other types of events.

gdb/testsuite/ChangeLog:

        * gdb.ada/bp_and_fork: New test.

Tested on x86_64-linux.
OK to apply?

Thanks!
  

Comments

Pedro Alves June 24, 2016, 9:57 p.m. UTC | #1
I haven't gone through this with fine-tooth comb yet, but,
will we still have the same problem if _two_ threads (or inferiors...)
fork at the "same" time, and we end up reporting one fork, while
leaving the another one pending?

Thanks,
Pedro Alves
  
Joel Brobecker June 24, 2016, 10:36 p.m. UTC | #2
> I haven't gone through this with fine-tooth comb yet, but,
> will we still have the same problem if _two_ threads (or inferiors...)
> fork at the "same" time, and we end up reporting one fork, while
> leaving the another one pending?

At the moment, I do not think so, because we seem to just process
the fork even without requesting an update of the thread list.
This is an extract of the remote protocol transmissions that
show us resume the execution of our program (vCont;c), receive
the fork event (T05fork), do a bunch of stuff to handle the
fork event (and in particular "D"etaching from the child process),
followed by the resumption of our program's execution (vCont;c):

    Sending packet: $vCont;c:p2992.-1#b4...
    Packet received: T05fork:p299c.299c;01:bffff5e0;40:0fe87158;thread:p2992.2992;core:1;
    Sending packet: $Hgp299c.299c#5b...Packet received: OK
    Sending packet: $z0,1000224c,4#f2...Packet received: OK
    Sending packet: $z0,10003218,4#c5...Packet received: OK
    Sending packet: $z0,1000336c,4#f6...Packet received: OK
    Sending packet: $D;299c#86...Packet received: OK
    Sending packet: $vCont;c:p2992.-1#b4...Packet received: T05swbreak:;01:482009e0;40:10003218;thread:p2992.299a;core:0;

I wish I could give you a stronger justification, but at least
we seem to be OK, so the patch could give us a fix while we think
a more solid approach through.
  
Pedro Alves June 24, 2016, 10:37 p.m. UTC | #3
On 06/24/2016 11:36 PM, Joel Brobecker wrote:
>> I haven't gone through this with fine-tooth comb yet, but,
>> will we still have the same problem if _two_ threads (or inferiors...)
>> fork at the "same" time, and we end up reporting one fork, while
>> leaving the another one pending?
> 
> At the moment, I do not think so, because we seem to just process
> the fork even without requesting an update of the thread list.

"catch fork" would make us stop though.

Thanks,
Pedro Alves
  
Joel Brobecker June 27, 2016, 10:32 p.m. UTC | #4
> >> I haven't gone through this with fine-tooth comb yet, but,
> >> will we still have the same problem if _two_ threads (or inferiors...)
> >> fork at the "same" time, and we end up reporting one fork, while
> >> leaving the another one pending?
> > 
> > At the moment, I do not think so, because we seem to just process
> > the fork even without requesting an update of the thread list.
> 
> "catch fork" would make us stop though.

:-(. Most likely. I had the weekend to mull this over. The only
possible solutions I can see are:

  a. Make gdbserver "hide" the threads that are children of forks
     until we've reported the corresponding fork event to GDB.

     But then, I think it's unclear what to do if the user does
     a "step" or "continue" while you have multiple pending
     fork events. That's probably a question that's likely not
     specific to forks, as you might have the same issue when
     requesting an action after seeing the first of multiple
     events received at the same time. Perhaps simply just return
     the next event without resuming anything? Is that what we do?

  b. Somehow enhance GDB to handle the extra unknown threads
     more gracefully.

I don't really see how (b) could work. It seems that (a) would
be more promising. That said, I would still consider my current
patch, as reporting the forks early allow us to either detach
from them earlier.
  
Pedro Alves June 28, 2016, 7:40 p.m. UTC | #5
On 06/27/2016 11:32 PM, Joel Brobecker wrote:
>>>> I haven't gone through this with fine-tooth comb yet, but,
>>>> will we still have the same problem if _two_ threads (or inferiors...)
>>>> fork at the "same" time, and we end up reporting one fork, while
>>>> leaving the another one pending?
>>>
>>> At the moment, I do not think so, because we seem to just process
>>> the fork even without requesting an update of the thread list.
>>
>> "catch fork" would make us stop though.
> 
> :-(. Most likely. I had the weekend to mull this over. The only
> possible solutions I can see are:
> 
>   a. Make gdbserver "hide" the threads that are children of forks
>      until we've reported the corresponding fork event to GDB.
> 

Agreed, I think we need to do this.  It's somewhat what
linux-nat.c does, except linux-nat.c hides the fork child
until target_follow_fork time.  

I say similar, instead of "just like", because the RSP doesn't
have a "follow fork / detach fork" concept.  That's because the model
of instead making the child visible as a regular process as soon as we
see the parent fork allows using regular detach, remove breakpoints, etc.
packets against the fork child.

We could make the server always hide the child until it gets some
new "unhide child now" packet, but I think all it would avoid is gdb
looking over the already-reported events and check whether a fork is
pending, so it seems unnecessary.

>      But then, I think it's unclear what to do if the user does
>      a "step" or "continue" while you have multiple pending
>      fork events. That's probably a question that's likely not
>      specific to forks, as you might have the same issue when
>      requesting an action after seeing the first of multiple
>      events received at the same time. Perhaps simply just return
>      the next event without resuming anything? Is that what we do?

Yes, that's what we do.  For gdb there's no difference between
the other pending event having happened to trigger at the same
time as the last reported event, vs it triggering immediately
on next resume.

> 
>   b. Somehow enhance GDB to handle the extra unknown threads
>      more gracefully.
> 
> I don't really see how (b) could work. It seems that (a) would
> be more promising. That said, I would still consider my current
> patch, as reporting the forks early allow us to either detach
> from them earlier.

My usual thought process is this: imagine we had (a) already.  Would we
have a particularly strong reason to complicate the code and do (b) on
its own?  Seems like not.  We could apply the same rationale for preferring
to report any other thread stopped at a breakpoint before the fork
events (so that we could move them past their breakpoints earlier).  Or
always prefer the stepping thread, as that's the thread the user is most
interested in (*).  Etc.

(*) - IIRC, the reason we prefer a stepped thread first is for correctness,
not because that's what the user is focused in.  It used to be that if a step
event got pending, and we reported some other event first, later when the
pending step event is finally reported as a plain SIGTRAP, if the thread that had
a pending step was now continued instead of stepped, infrun wouldn't
understanding what this SIGTRAP was about, since the thread was no longer
supposed to be single-stepping, and would thus report the SIGTRAP to the
user as a spurious signal.  With "maint set target-non-stop on", which is still
not the default with target remote, infrun.c:clear_proceed_status_thread
handles this scenario on the gdb side and discards the single-step, but with
plain all-stop, the spurious SIGTRAP probably would still happen.

Thanks,
Pedro Alves
  
Joel Brobecker July 5, 2016, 4:49 p.m. UTC | #6
> >   a. Make gdbserver "hide" the threads that are children of forks
> >      until we've reported the corresponding fork event to GDB.
> > 
> 
> Agreed, I think we need to do this.  It's somewhat what
> linux-nat.c does, except linux-nat.c hides the fork child
> until target_follow_fork time.  
[...]
> That said, I would still consider my current
> > patch, as reporting the forks early allow us to either detach
> > from them earlier.
> 
> My usual thought process is this: imagine we had (a) already.  Would we
> have a particularly strong reason to complicate the code and do (b) on
> its own?  Seems like not.  We could apply the same rationale for preferring
> to report any other thread stopped at a breakpoint before the fork
> events (so that we could move them past their breakpoints earlier).  Or
> always prefer the stepping thread, as that's the thread the user is most
> interested in (*).  Etc.
> 
> (*) - IIRC, the reason we prefer a stepped thread first is for
> correctness, not because that's what the user is focused in.  It used
> to be that if a step event got pending, and we reported some other
> event first, later when the pending step event is finally reported as
> a plain SIGTRAP, if the thread that had a pending step was now
> continued instead of stepped, infrun wouldn't understanding what this
> SIGTRAP was about, since the thread was no longer supposed to be
> single-stepping, and would thus report the SIGTRAP to the user as a
> spurious signal.  With "maint set target-non-stop on", which is still
> not the default with target remote,
> infrun.c:clear_proceed_status_thread handles this scenario on the gdb
> side and discards the single-step, but with plain all-stop, the
> spurious SIGTRAP probably would still happen.

I would have thought that we'd want GDB and GDBserver to be in sync
as quickly as possible, so as to release the new inferiors, but I guess
that doesn't really make any difference in practice. The issue with
reporting SIGTRAP from an old single-step is even more convincing that
my patch is actually wrong.  Sigh...
  

Patch

From cccc5072db8eeefbdfe11840c4e6e2330bac46a8 Mon Sep 17 00:00:00 2001
From: Joel Brobecker <brobecker@adacore.com>
Date: Fri, 24 Jun 2016 11:56:32 -0400
Subject: [PATCH] GDB internal-error debugging threaded program with breakpoint
 and forks

Debugging through gdbserver a program uses threads hitting breakpoints
as well as doing a fork/exec, can lead to an internal error. This
happens randomly, since it depends on timing, but here is what we can
see from the user's perpective:

    % gdb a_test
    (gdb) break a_test.adb:30
    (gdb) break a_test.adb:39
    (gdb) target remote my_board:4444
    (gdb) continue
    Continuing.
    [...]
    [New Thread 866.868]
    [New Thread 866.869]
    [New Thread 870.870]
    /[...]/gdb/thread.c:89: internal-error: thread_info* inferior_thread(): Assertion `tp' failed.
    A problem internal to GDB has been detected,
    further debugging may prove unreliable.
    Quit this debugging session? (y or n)

What happens is that the inferior forks around the same time its
threads hit one of the breakpoints we inserted. As a result,
one single round of linux_wait_for_event_filtered, is able to
fetch breakpoints events from our inferior's two threads, as well
as the fork event reported by the inferior's main thread, all
in one go.

What GDBserver then has to do next, is decide which of these events
to report to GDB first. Currently, it gives priority to threads that
we are single-stepping, and then just grabs one of the threads
that have a pending event, selected at random (see select_event_lwp).
At the point where see the problem, the single-step-out-of-breakpoint
has already been completed, so we do not have any thread doing
a single-step, and so GDBserver selects one of the threads that
received a SIGTRAP. This can been seen by looking at the debug
traces, which contain the following hint:

        SEL: Found 3 SIGTRAP events, selecting #1

This leads GDBserver to report a SIGTRAP event on that thread.
Processing that event, GDB requests the updated list of threads,
which now contains the forked process' thread. Seeing that thread
with a different PID which confuses GDB, eventually leading to
the internal error.

This patch fixes the issue by make sure GDBserver always reports
fork/vfork events first. That way, GDB is aware that the new thread
comes from the fork/vfork event. In particular, in the mode where
we are in (follow the parent only), it knows to ignore the new
thread, thus avoiding the issue entirely.

gdb/gdbserver/ChangeLog:

        * linux-low.c (select_fork_vfork_lwp_callback): New function.
        (select_event_lwp): Give priority fork/vfork events over all
        other types of events.

gdb/testsuite/ChangeLog:

        * gdb.ada/bp_and_fork: New test.

Tested on x86_64-linux.
---
 gdb/gdbserver/linux-low.c                    | 44 +++++++++++++++++-
 gdb/testsuite/gdb.ada/bp_and_fork.exp        | 41 +++++++++++++++++
 gdb/testsuite/gdb.ada/bp_and_fork/a_test.adb | 68 ++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 gdb/testsuite/gdb.ada/bp_and_fork.exp
 create mode 100644 gdb/testsuite/gdb.ada/bp_and_fork/a_test.adb

diff --git a/gdb/gdbserver/linux-low.c b/gdb/gdbserver/linux-low.c
index dd92e78..d71595b 100644
--- a/gdb/gdbserver/linux-low.c
+++ b/gdb/gdbserver/linux-low.c
@@ -2826,6 +2826,23 @@  count_events_callback (struct inferior_list_entry *entry, void *data)
   return 0;
 }
 
+/* Select the first LWP (if any) for which a fork or vfork event
+   was reported.  */
+
+static int
+select_fork_vfork_lwp_callback (struct inferior_list_entry *entry, void *data)
+{
+  struct thread_info *thread = (struct thread_info *) entry;
+  struct lwp_info *lp = get_thread_lwp (thread);
+
+  if ((lp->waitstatus.kind == TARGET_WAITKIND_FORKED
+       || thread->last_status.kind == TARGET_WAITKIND_VFORKED)
+      && lp->status_pending_p)
+    return 1;
+  else
+    return 0;
+}
+
 /* Select the LWP (if any) that is currently being single-stepped.  */
 
 static int
@@ -2871,6 +2888,31 @@  select_event_lwp (struct lwp_info **orig_lp)
   int random_selector;
   struct thread_info *event_thread = NULL;
 
+  /* If an LWP forked/vforked, select that LWP over all the others,
+     so as to tell GDB right away about the the new child process
+     being the result of a fork/vfork.
+
+     It is important to do so first, because GDB needs to be told
+     about fork/vfork threads to allow GDB to separate the new
+     process' thread from the threads of the current (parent)
+     inferior.  In particular, we ran into an issue where GDB
+     requested the list of threads which included both parent
+     and the thread of the fork/vfork, and because GDB didn't know
+     about the fork/vfork, did not ignore the fork/vfork. This
+     eventually caused an internal error because that thread had
+     a PID for which there was not corresponding thread (see
+     inferior_thread).  */
+  event_thread
+    = (struct thread_info *) find_inferior (&all_threads,
+					    select_fork_vfork_lwp_callback,
+					    NULL);
+  if (event_thread != NULL)
+    {
+      if (debug_threads)
+	debug_printf ("SEL: Select fork/vfork %s\n",
+		      target_pid_to_str (ptid_of (event_thread)));
+    }
+
   /* In all-stop, give preference to the LWP that is being
      single-stepped.  There will be at most one, and it's the LWP that
      the core is most interested in.  If we didn't do this, then we'd
@@ -2879,7 +2921,7 @@  select_event_lwp (struct lwp_info **orig_lp)
      report the pending SIGTRAP, and the core, not having stepped the
      thread, wouldn't understand what the trap was for, and therefore
      would report it to the user as a random signal.  */
-  if (!non_stop)
+  if (event_thread == NULL && !non_stop)
     {
       event_thread
 	= (struct thread_info *) find_inferior (&all_threads,
diff --git a/gdb/testsuite/gdb.ada/bp_and_fork.exp b/gdb/testsuite/gdb.ada/bp_and_fork.exp
new file mode 100644
index 0000000..e9ec08c
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/bp_and_fork.exp
@@ -0,0 +1,41 @@ 
+# Copyright 2016 Free Software Foundation, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+load_lib "ada.exp"
+
+standard_ada_testfile a_test
+
+if {[gdb_compile_ada "${srcfile}" "${binfile}" executable [list debug]] != "" } {
+  return -1
+}
+
+clean_restart ${testfile}
+
+set bp_loc_1 [gdb_get_line_number "Task 1" ${testdir}/a_test.adb]
+gdb_test "break a_test.adb:$bp_loc_1" \
+         "Breakpoint $decimal at $hex: file .*a_test.adb, line $decimal\\."
+
+set bp_loc_2 [gdb_get_line_number "Task 2" ${testdir}/a_test.adb]
+gdb_test "break a_test.adb:$bp_loc_2" \
+         "Breakpoint $decimal at $hex: file .*a_test.adb, line $decimal\\."
+
+gdb_run_cmd
+gdb_test "" \
+         "Breakpoint $decimal, a_test.task$decimal \\(.*\\).*" \
+         "run to first breakpoint"
+
+gdb_test "continue" \
+         "Breakpoint $decimal, a_test.task$decimal \\(.*\\).*" \
+         "continue to second breakpoint"
diff --git a/gdb/testsuite/gdb.ada/bp_and_fork/a_test.adb b/gdb/testsuite/gdb.ada/bp_and_fork/a_test.adb
new file mode 100644
index 0000000..095c2eb
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/bp_and_fork/a_test.adb
@@ -0,0 +1,68 @@ 
+--  Copyright 2003-2016 Free Software Foundation, Inc.
+--
+--  This program is free software; you can redistribute it and/or modify
+--  it under the terms of the GNU General Public License as published by
+--  the Free Software Foundation; either version 3 of the License, or
+--  (at your option) any later version.
+--
+--  This program is distributed in the hope that it will be useful,
+--  but WITHOUT ANY WARRANTY; without even the implied warranty of
+--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+--  GNU General Public License for more details.
+--
+--  You should have received a copy of the GNU General Public License
+--  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+with System;
+with Text_Io;
+
+procedure A_Test is
+
+  Command : constant String := "echo Hello World &" & Ascii.Nul;
+  Status : Integer;
+
+  ---------------------------------
+  -- System shell command interface
+  ---------------------------------
+  function Shell_Cmd (Command : in System.Address) return Integer;
+  pragma Import (C, Shell_Cmd, "system");
+
+  ------------------
+  -- Start 2 threads
+  ------------------
+
+  task Task1 is
+    entry Start_Task;
+  end Task1;
+
+  task Task2 is
+    entry Start_Task;
+  end Task2;
+
+  task body Task1 is
+  begin
+    select  -- Task 1
+      accept Start_Task;
+    or
+      terminate;
+    end select;
+  end Task1;
+
+  task body Task2 is
+  begin
+    select  -- Task 2
+      accept Start_Task;
+    or
+      terminate;
+    end select;
+  end Task2;
+
+  procedure My_Routine is
+  begin
+    Text_Io.Put_Line ("Program successful");
+  end My_Routine;
+
+begin
+  Status := Shell_Cmd (Command'Address);
+  My_Routine;
+end A_Test;
-- 
2.1.4