Fix threads left stopped after fork+thread spawn (PR threads/18600)

Message ID 1437492947-2533-1-git-send-email-simon.marchi@ericsson.com
State New, archived
Headers

Commit Message

Simon Marchi July 21, 2015, 3:35 p.m. UTC
  I am posting the fix suggested by Pedro for bug 18600 [1, 2].  I have to
admit that I don't fully understand the fix, but it fixes the bug and
causes no regression according in the testsuite.

The first issue is that threads were left stopped while gdb was thinking
they were running.  The second issue, uncovered after fixing the first
one, was that exited inferiors were left when they should have been
removed.

I tried to make a test for this situation, but it's been a bit more
difficult than I expected.  The idea is that the inferior forks a
certain number of times and waits for all children to exit.  Each fork
child spawns a number of threads that do nothing and joins them
immediately.  Normally, the program should run unimpeded (from the point
of view of the user) and exit very quickly.  Without this fix, it
doesn't because of some threads left stopped by gdb.  My only problem is
that the prompt comes back as soon as any inferior (even a child) exits,
and that moment is too early to say if the test passed or not.  I had to
resort to make it sleep for a second and then check that no thread is
left.  If you have suggestions on how to make the test more robust, they
are very welcome.

If this is acceptable, I would suggest backporting the fix to the 7.10
branch (it's currently in the TODO on the 7.10 wiki page).

gdb/ChangeLog:

	* linux-nat.c (linux_handle_extended_wait): On CLONE event,
	always mark the new thread as resumed.  Remove STOPPING
	parameter.
	(wait_lwp): Report to the core when thread group leader exits.
	Adjust call to linux_handle_extended_wait.
	(linux_nat_filter_event): Adjust call to
	linux_handle_extended_wait.
	(resume_stopped_resumed_lwps): Add debug output.

gdb/testsuite/ChangeLog:

	* gdb.threads/fork-plus-threads.c: New file.
	* gdb.threads/fork-plus-threads.exp: New file.

[1] https://sourceware.org/ml/gdb-patches/2015-07/msg00186.html
[2] https://sourceware.org/ml/gdb-patches/2015-07/msg00190.html
---
 gdb/linux-nat.c                                 | 114 ++++++++++++------------
 gdb/testsuite/gdb.threads/fork-plus-threads.c   |  94 +++++++++++++++++++
 gdb/testsuite/gdb.threads/fork-plus-threads.exp |  53 +++++++++++
 3 files changed, 205 insertions(+), 56 deletions(-)
 create mode 100644 gdb/testsuite/gdb.threads/fork-plus-threads.c
 create mode 100644 gdb/testsuite/gdb.threads/fork-plus-threads.exp
  

Comments

Pedro Alves July 23, 2015, 11:04 a.m. UTC | #1
On 07/21/2015 04:35 PM, Simon Marchi wrote:
> I am posting the fix suggested by Pedro for bug 18600 [1, 2].  I have to
> admit that I don't fully understand the fix, but it fixes the bug and
> causes no regression according in the testsuite.

Thanks Simon.  I'm playing with this.  I made the test run run against
extended-remote gdbserver, and that caught several issues with
non-stop + the new remote follow-fork support.

Thanks,
Pedro Alves
  
Pedro Alves July 23, 2015, 5:10 p.m. UTC | #2
On 07/23/2015 12:04 PM, Pedro Alves wrote:

> Thanks Simon.  I'm playing with this.  I made the test run run against
> extended-remote gdbserver, and that caught several issues with
> non-stop + the new remote follow-fork support.

So one of the changes I had done was to run to a breakpoint at the end
of main instead of letting inferior 1 exit.  That exposes gdbserver
crashes.  I have fixes for that, but I'm a bit reluctant to put them
in 7.10, so I'll post them for master only.  Reverting that change to
use a breakpoint still shows other bogus things against gdbserver, but the
test still passes.  E.g., all the "[Thread FOO] #NN stopped." below are bogus
(it's a gdbserver bug), and note the "Cannot remove breakpoints because
program is no longer writable." too:

(gdb) PASS: gdb.threads/fork-plus-threads.exp: set detach-on-fork off
continue &
Continuing.
(gdb) PASS: gdb.threads/fork-plus-threads.exp: continue &
[New Thread 28092.28092]

[Thread 28092.28092] #2 stopped.
[New Thread 28094.28094]
[Inferior 2 (process 28092) exited normally]
[New Thread 28094.28105]
[New Thread 28094.28109]

[Thread 28094.28094] #3 stopped.
[New Thread 28106.28106]
[Inferior 3 (process 28094) exited normally]
[New Thread 28106.28117]

[Thread 28106.28106] #6 stopped.
[New Thread 28118.28118]
[Inferior 4 (process 28106) exited normally]
[New Thread 28118.28132]

[Thread 28118.28118] #8 stopped.
[New Thread 28128.28128]
[Inferior 5 (process 28118) exited normally]
[New Thread 28128.28145]

[Thread 28128.28128] #10 stopped.
[New Thread 28141.28141]
[Inferior 6 (process 28128) exited normally]
[New Thread 28141.28159]

[Thread 28141.28141] #12 stopped.
[New Thread 28151.28151]
[Inferior 7 (process 28141) exited normally]
[New Thread 28151.28165]

[Thread 28151.28151] #14 stopped.
[New Thread 28162.28162]
[Inferior 8 (process 28151) exited normally]
[New Thread 28162.28162]

[Thread 28162.28162] #17 stopped.
[New Thread 28174.28174]
[Inferior 9 (process 28162) exited normally]
[New Thread 28174.28191]

[Thread 28174.28174] #18 stopped.
[New Thread 28185.28185]
[Inferior 10 (process 28174) exited normally]
[New Thread 28185.28196]

[Thread 28185.28185] #20 stopped.
Cannot remove breakpoints because program is no longer writable.
Further execution is probably impossible.
[Inferior 11 (process 28185) exited normally]
[Inferior 1 (process 28091) exited normally]
PASS: gdb.threads/fork-plus-threads.exp: reached breakpoint
info threads
No threads.
(gdb) PASS: gdb.threads/fork-plus-threads.exp: no threads left
info inferiors
  Num  Description       Executable
* 1    <null>            /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/fork-plus-threads
(gdb) PASS: gdb.threads/fork-plus-threads.exp: only inferior 1 left


So that we're in the same page, I'm replying with a few comments to the
original patch below, and then I'll post updated patches with the
points I raise addressed, but split in separate patches, one for each
of the problems identified.  That will include a test for the second
issue as well.

Many thanks for the test, and writing ChangeLogs, etc.!


On 07/21/2015 04:35 PM, Simon Marchi wrote:
 I tried to make a test for this situation, but it's been a bit more
> difficult than I expected.  The idea is that the inferior forks a
> certain number of times and waits for all children to exit.  Each fork
> child spawns a number of threads that do nothing and joins them
> immediately.  Normally, the program should run unimpeded (from the point
> of view of the user) and exit very quickly.  Without this fix, it
> doesn't because of some threads left stopped by gdb.  My only problem is
> that the prompt comes back as soon as any inferior (even a child) exits,
> and that moment is too early to say if the test passed or not.  I had to
> resort to make it sleep for a second and then check that no thread is
> left.  If you have suggestions on how to make the test more robust, they
> are very welcome.

We can instead wait for the "inferior 1 exited normally" output.
On a buggy gdb, inferior 1 won't ever exit, stuck waiting for the
children to exit.

> +++ b/gdb/testsuite/gdb.threads/fork-plus-threads.c
> @@ -0,0 +1,94 @@
> +#include <assert.h>
> +#include <pthread.h>

Missing copyright header.

> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +
> +


> +int
> +main (void)
> +{
> +  pid_t childs[NFORKS];
> +  int i;
> +  int status;
> +  int num_exited = 0;


We should have an "alarm()" call here, so that if something goes
wrong, the process eventually kills itself.


> +# The problem was originally seen on Linux, but the test could be
> +# generalized to all targets that support forks and threads.
> +if ![istarget *-*-linux*] then {
> +    return
> +}

The problem with these checks is that most probably nobody will
ever relax them.  Targets that don't support fork or threads at all
will fail to compile the test, which results in the test being
skipped already.  So I think we should just start by running the
test everywhere.

> +
> +# When using gdbserver, even on Linux, we don't get notifications
> +# about new threads.  This is expected, so don't test for that.
> +if [is_remote target] then {
> +    return
> +}

I'm not seeing why thread notifications would be required,
but in any case, extended-remote supports follow fork
nowadays, so I think we should remove this check too.

> +
> +standard_testfile
> +
> +if {[gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable debug] != "" } {
> +    return -1
> +}
> +
> +clean_restart ${binfile}
> +
> +gdb_test_no_output "set non-stop on"

The native-extended-gdbserver board starts gdbserver
and connects to with from within clean_restart, so this
"set non-stop on" here is too late, though.  The easy fix
is to push '-ex "set non-stop on"' in GDBFLAGS instead.

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index be429f8..4a0391e 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2000,8 +2000,7 @@  linux_handle_syscall_trap (struct lwp_info *lp, int stopping)
    true, the new LWP remains stopped, otherwise it is continued.  */
 
 static int
-linux_handle_extended_wait (struct lwp_info *lp, int status,
-			    int stopping)
+linux_handle_extended_wait (struct lwp_info *lp, int status)
 {
   int pid = ptid_get_lwp (lp->ptid);
   struct target_waitstatus *ourstatus = &lp->waitstatus;
@@ -2071,7 +2070,7 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	ourstatus->kind = TARGET_WAITKIND_FORKED;
       else if (event == PTRACE_EVENT_VFORK)
 	ourstatus->kind = TARGET_WAITKIND_VFORKED;
-      else
+      else if (event == PTRACE_EVENT_CLONE)
 	{
 	  struct lwp_info *new_lp;
 
@@ -2086,43 +2085,7 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	  new_lp = add_lwp (ptid_build (ptid_get_pid (lp->ptid), new_pid, 0));
 	  new_lp->cloned = 1;
 	  new_lp->stopped = 1;
-
-	  if (WSTOPSIG (status) != SIGSTOP)
-	    {
-	      /* This can happen if someone starts sending signals to
-		 the new thread before it gets a chance to run, which
-		 have a lower number than SIGSTOP (e.g. SIGUSR1).
-		 This is an unlikely case, and harder to handle for
-		 fork / vfork than for clone, so we do not try - but
-		 we handle it for clone events here.  We'll send
-		 the other signal on to the thread below.  */
-
-	      new_lp->signalled = 1;
-	    }
-	  else
-	    {
-	      struct thread_info *tp;
-
-	      /* When we stop for an event in some other thread, and
-		 pull the thread list just as this thread has cloned,
-		 we'll have seen the new thread in the thread_db list
-		 before handling the CLONE event (glibc's
-		 pthread_create adds the new thread to the thread list
-		 before clone'ing, and has the kernel fill in the
-		 thread's tid on the clone call with
-		 CLONE_PARENT_SETTID).  If that happened, and the core
-		 had requested the new thread to stop, we'll have
-		 killed it with SIGSTOP.  But since SIGSTOP is not an
-		 RT signal, it can only be queued once.  We need to be
-		 careful to not resume the LWP if we wanted it to
-		 stop.  In that case, we'll leave the SIGSTOP pending.
-		 It will later be reported as GDB_SIGNAL_0.  */
-	      tp = find_thread_ptid (new_lp->ptid);
-	      if (tp != NULL && tp->stop_requested)
-		new_lp->last_resume_kind = resume_stop;
-	      else
-		status = 0;
-	    }
+	  new_lp->resumed = 1;
 
 	  /* If the thread_db layer is active, let it record the user
 	     level thread id and status, and add the thread to GDB's
@@ -2136,19 +2099,23 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	    }
 
 	  /* Even if we're stopping the thread for some reason
-	     internal to this module, from the user/frontend's
-	     perspective, this new thread is running.  */
+	     internal to this module, from the perspective of infrun
+	     and the user/frontend, this new thread is running until
+	     it next reports a stop.  */
 	  set_running (new_lp->ptid, 1);
-	  if (!stopping)
-	    {
-	      set_executing (new_lp->ptid, 1);
-	      /* thread_db_attach_lwp -> lin_lwp_attach_lwp forced
-		 resume_stop.  */
-	      new_lp->last_resume_kind = resume_continue;
-	    }
+	  set_executing (new_lp->ptid, 1);
 
-	  if (status != 0)
+	  if (WSTOPSIG (status) != SIGSTOP)
 	    {
+	      /* This can happen if someone starts sending signals to
+		 the new thread before it gets a chance to run, which
+		 have a lower number than SIGSTOP (e.g. SIGUSR1).
+		 This is an unlikely case, and harder to handle for
+		 fork / vfork than for clone, so we do not try - but
+		 we handle it for clone events here.  */
+
+	      new_lp->signalled = 1;
+
 	      /* We created NEW_LP so it cannot yet contain STATUS.  */
 	      gdb_assert (new_lp->status == 0);
 
@@ -2162,7 +2129,6 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	      new_lp->status = status;
 	    }
 
-	  new_lp->resumed = !stopping;
 	  return 1;
 	}
 
@@ -2308,6 +2274,20 @@  wait_lwp (struct lwp_info *lp)
       /* Check if the thread has exited.  */
       if (WIFEXITED (status) || WIFSIGNALED (status))
 	{
+	  if (ptid_get_pid (lp->ptid) == ptid_get_lwp (lp->ptid))
+	    {
+	      if (debug_linux_nat)
+		fprintf_unfiltered (gdb_stdlog, "WL: Process %d exited.\n",
+				    ptid_get_pid (lp->ptid));
+
+	      /* This is the leader exiting, it means the whole
+		 process is gone.  Store the status to report to the
+		 core.  Store it in the lp->waitstatus, because
+		 W_EXITCODE(0,0) == 0.  */
+	      store_waitstatus (&lp->waitstatus, status);
+	      return 0;
+	    }
+
 	  thread_dead = 1;
 	  if (debug_linux_nat)
 	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
@@ -2353,7 +2333,7 @@  wait_lwp (struct lwp_info *lp)
 	fprintf_unfiltered (gdb_stdlog,
 			    "WL: Handling extended status 0x%06x\n",
 			    status);
-      linux_handle_extended_wait (lp, status, 1);
+      linux_handle_extended_wait (lp, status);
       return 0;
     }
 
@@ -3155,7 +3135,7 @@  linux_nat_filter_event (int lwpid, int status)
 	fprintf_unfiltered (gdb_stdlog,
 			    "LLW: Handling extended status 0x%06x\n",
 			    status);
-      if (linux_handle_extended_wait (lp, status, 0))
+      if (linux_handle_extended_wait (lp, status))
 	return NULL;
     }
 
@@ -3673,9 +3653,31 @@  resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
 {
   ptid_t *wait_ptid_p = data;
 
-  if (lp->stopped
-      && lp->resumed
-      && !lwp_status_pending_p (lp))
+  if (!lp->stopped)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not stopped\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (!lp->resumed)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not resumed\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (lwp_status_pending_p (lp))
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "has pending status\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else
     {
       struct regcache *regcache = get_thread_regcache (lp->ptid);
       struct gdbarch *gdbarch = get_regcache_arch (regcache);
diff --git a/gdb/testsuite/gdb.threads/fork-plus-threads.c b/gdb/testsuite/gdb.threads/fork-plus-threads.c
new file mode 100644
index 0000000..576e79d
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/fork-plus-threads.c
@@ -0,0 +1,94 @@ 
+#include <assert.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+
+/* Number of times the main process forks.  */
+#define NFORKS 10
+
+/* Number of threads by each fork child.  */
+#define NTHREADS 10
+
+static void *
+thread_func (void *arg)
+{
+  /* Empty.  */
+}
+
+static void
+fork_child (void)
+{
+  pthread_t threads[NTHREADS];
+  int i;
+  int ret;
+
+  for (i = 0; i < NTHREADS; i++)
+    {
+      ret = pthread_create (&threads[i], NULL, thread_func, NULL);
+      assert (ret == 0);
+    }
+
+  for (i = 0; i < NTHREADS; i++)
+    {
+      ret = pthread_join (threads[i], NULL);
+      assert (ret == 0);
+    }
+}
+
+int
+main (void)
+{
+  pid_t childs[NFORKS];
+  int i;
+  int status;
+  int num_exited = 0;
+
+  for (i = 0; i < NFORKS; i++)
+  {
+    pid_t pid;
+
+    pid = fork ();
+
+    if (pid > 0)
+      {
+	/* Parent.  */
+	childs[i] = pid;
+      }
+    else if (pid == 0)
+      {
+	/* Child.  */
+	fork_child ();
+	return 0;
+      }
+    else
+      {
+	perror ("fork");
+	return 1;
+      }
+  }
+
+  while (num_exited != NFORKS)
+    {
+      pid_t pid = wait (&status);
+
+      if (pid == -1)
+	{
+	  perror ("wait");
+	  return 1;
+	}
+
+      if (WIFEXITED (status))
+        {
+	  num_exited++;
+	}
+      else
+	{
+	  printf ("Hmm, unexpected wait status 0x%x from child %d\n", status,
+	         pid);
+	}
+    }
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.threads/fork-plus-threads.exp b/gdb/testsuite/gdb.threads/fork-plus-threads.exp
new file mode 100644
index 0000000..cbb6a90
--- /dev/null
+++ b/gdb/testsuite/gdb.threads/fork-plus-threads.exp
@@ -0,0 +1,53 @@ 
+# Copyright (C) 2015 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# This test verifies that threads created by child fork are properly
+# handled.
+#
+# See https://sourceware.org/bugzilla/show_bug.cgi?id=18600
+
+# The problem was originally seen on Linux, but the test could be
+# generalized to all targets that support forks and threads.
+if ![istarget *-*-linux*] then {
+    return
+}
+
+# When using gdbserver, even on Linux, we don't get notifications
+# about new threads.  This is expected, so don't test for that.
+if [is_remote target] then {
+    return
+}
+
+standard_testfile
+
+if {[gdb_compile_pthreads "${srcdir}/${subdir}/${srcfile}" "${binfile}" executable debug] != "" } {
+    return -1
+}
+
+clean_restart ${binfile}
+
+gdb_test_no_output "set non-stop on"
+
+if ![runto_main] then {
+   fail "Can't run to main"
+   return 0
+}
+
+gdb_test_no_output "set detach-on-fork off"
+send_gdb "continue &\n"
+
+sleep 2
+
+gdb_test "info threads" "No threads.*"