Should this be on the blocker list for the 7.10 release?

Message ID	559C14B6.5020800@redhat.com
State	New, archived
Headers	Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk Sender: gdb-patches-owner@sourceware.org Message-ID: <559C14B6.5020800@redhat.com> Date: Tue, 07 Jul 2015 19:04:38 +0100 From: Pedro Alves <palves@redhat.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Joel Brobecker <brobecker@adacore.com>, Simon Marchi <simon.marchi@ericsson.com> CC: gdb-patches <gdb-patches@sourceware.org> Subject: Re: Should this be on the blocker list for the 7.10 release? References: <559AE482.1010109@ericsson.com> <20150707132459.GA16734@adacore.com> <559BFBBD.4000303@redhat.com> In-Reply-To: <559BFBBD.4000303@redhat.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit

Message ID

559C14B6.5020800@redhat.com

State

New, archived

Headers

Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: gdb-patches-owner@sourceware.org
Message-ID: <559C14B6.5020800@redhat.com>
Date: Tue, 07 Jul 2015 19:04:38 +0100
From: Pedro Alves <palves@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Joel Brobecker <brobecker@adacore.com>,
	Simon Marchi <simon.marchi@ericsson.com>
CC: gdb-patches <gdb-patches@sourceware.org>
Subject: Re: Should this be on the blocker list for the 7.10 release?
References: <559AE482.1010109@ericsson.com>
	<20150707132459.GA16734@adacore.com>
	<559BFBBD.4000303@redhat.com>
In-Reply-To: <559BFBBD.4000303@redhat.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit

Commit Message

Pedro Alves July 7, 2015, 6:04 p.m. UTC

  On 07/07/2015 05:18 PM, Pedro Alves wrote:
> On 07/07/2015 02:24 PM, Joel Brobecker wrote:
> 
>> Not sure. I think Pedro would be in a better position to answer.
>> For now, I've put this issue as a "maybe" for 7.10; so we will not
>> release until this is fixed, or we explicitly decide it's OK for 7.10.
>>
>> Pedro?
> 
> Let me take a look and understand this better.

OK, the issue is that the new clone thread is found while inside
the linux_stop_and_wait_all_lwps call in this new bit of
code in linux-thread-db.c:

      linux_stop_and_wait_all_lwps ();

      ALL_LWPS (lp)
	if (ptid_get_pid (lp->ptid) == pid)
	  thread_from_lwp (lp->ptid);

      linux_unstop_all_lwps ();

We reach linux_handle_extended_wait with the "stopping"
parameter set to 1, and because of that we don't mark the
new lwp as resumed.  As consequence, the subsequent
resume_stopped_resumed_lwps (called first from that
linux_unstop_all_lwps) never resumes the new LWP...

There's lots of cruft in linux_handle_extended_wait that no
longer makes sense.  This seems to fix your github test
for me, and causes no testsuite regressions.

Did you try converting your test case to a proper
GDB test?  That'd be much appreciated.

---
From a4f205a18dffaff3344b31e9b8009b1c0de8ba80 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 7 Jul 2015 17:42:52 +0100
Subject: [PATCH] fix

---
 gdb/linux-nat.c | 91 +++++++++++++++++++++++++--------------------------------
 1 file changed, 40 insertions(+), 51 deletions(-)

Comments

Simon Marchi July 7, 2015, 6:35 p.m. UTC | #1

> OK, the issue is that the new clone thread is found while inside
> the linux_stop_and_wait_all_lwps call in this new bit of
> code in linux-thread-db.c:
> 
>       linux_stop_and_wait_all_lwps ();
> 
>       ALL_LWPS (lp)
> 	if (ptid_get_pid (lp->ptid) == pid)
> 	  thread_from_lwp (lp->ptid);
> 
>       linux_unstop_all_lwps ();
> 
> We reach linux_handle_extended_wait with the "stopping"
> parameter set to 1, and because of that we don't mark the
> new lwp as resumed.  As consequence, the subsequent
> resume_stopped_resumed_lwps (called first from that
> linux_unstop_all_lwps) never resumes the new LWP...
> 
> There's lots of cruft in linux_handle_extended_wait that no
> longer makes sense.  This seems to fix your github test
> for me, and causes no testsuite regressions.

It seems to fix most of it.  The only odd thing left that I
noticed is that it leaves some of the inferiors there.  When I
type "info inferiors" after running the program, I see one or
two of them left.  I believe there should only be inferior #1
left.

> Did you try converting your test case to a proper
> GDB test?  That'd be much appreciated.

I haven't, but I will.

Thanks!

Pedro Alves July 7, 2015, 6:47 p.m. UTC | #2

On 07/07/2015 07:35 PM, Simon Marchi wrote:
>> OK, the issue is that the new clone thread is found while inside
>> the linux_stop_and_wait_all_lwps call in this new bit of
>> code in linux-thread-db.c:
>>
>>       linux_stop_and_wait_all_lwps ();
>>
>>       ALL_LWPS (lp)
>> 	if (ptid_get_pid (lp->ptid) == pid)
>> 	  thread_from_lwp (lp->ptid);
>>
>>       linux_unstop_all_lwps ();
>>
>> We reach linux_handle_extended_wait with the "stopping"
>> parameter set to 1, and because of that we don't mark the
>> new lwp as resumed.  As consequence, the subsequent
>> resume_stopped_resumed_lwps (called first from that
>> linux_unstop_all_lwps) never resumes the new LWP...
>>
>> There's lots of cruft in linux_handle_extended_wait that no
>> longer makes sense.  This seems to fix your github test
>> for me, and causes no testsuite regressions.
> 
> It seems to fix most of it.  The only odd thing left that I
> noticed is that it leaves some of the inferiors there.  When I
> type "info inferiors" after running the program, I see one or
> two of them left.  I believe there should only be inferior #1
> left.

Hmm, indeed:

(gdb) info inferiors
  Num  Description       Executable
  4    process 8393      /home/pedro/bugs/src/test
  2    process 8388      /home/pedro/bugs/src/test
* 1    <null>            /home/pedro/bugs/src/test
(gdb) info threads

Calling prune_inferiors() at this point (from a top gdb)
does not remove them, because they still have inf->pid != 0.
Sounds like we miss mourning inferiors 2 and 4 somehow?

Thanks,
Pedro Alves

Pedro Alves July 7, 2015, 7:04 p.m. UTC | #3

On 07/07/2015 07:47 PM, Pedro Alves wrote:

> Hmm, indeed:
> 
> (gdb) info inferiors
>   Num  Description       Executable
>   4    process 8393      /home/pedro/bugs/src/test
>   2    process 8388      /home/pedro/bugs/src/test
> * 1    <null>            /home/pedro/bugs/src/test
> (gdb) info threads
> 
> Calling prune_inferiors() at this point (from a top gdb)
> does not remove them, because they still have inf->pid != 0.
> Sounds like we miss mourning inferiors 2 and 4 somehow?

I think I got it.

Enabling logs (master + previous patch) I see:

...
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
WL: Handling extended status 0x03057f
LHEW: Got clone event from LWP 9513, new child is LWP 9579
[New Thread 0x7ffff37b8700 (LWP 9579)]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
                           ^^^^^^^^
[Thread 0x7ffff7fc2740 (LWP 9508) exited]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
[Thread 0x7ffff7fc2740 (LWP 9499) exited]
RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
...
(gdb) info inferiors
  Num  Description       Executable
  5    process 9508      /home/pedro/bugs/src/test
               ^^^^
  4    process 9503      /home/pedro/bugs/src/test
  3    process 9500      /home/pedro/bugs/src/test
  2    process 9499      /home/pedro/bugs/src/test
* 1    <null>            /home/pedro/bugs/src/test
(gdb)
...

Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
That's this in wait_lwp:

      /* Check if the thread has exited.  */
      if (WIFEXITED (status) || WIFSIGNALED (status))
	{
	  thread_dead = 1;
	  if (debug_linux_nat)
	    fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
				target_pid_to_str (lp->ptid));
	}
    }

This code doesn't understand that an WIFEXITED status
of the last lwp of the process should be reported as
process exit.  Haven't tried to fix it yet.

I haven't tried 7.9, but I'd think it possible to trigger
this issue in all-stop there, because all it takes
is a whole process exiting while gdb is iterating
over lwps, stopping them.

Thanks,
Pedro Alves

diff mbox

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index be429f8..ea38ebb 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -2086,43 +2086,7 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	  new_lp = add_lwp (ptid_build (ptid_get_pid (lp->ptid), new_pid, 0));
 	  new_lp->cloned = 1;
 	  new_lp->stopped = 1;
-
-	  if (WSTOPSIG (status) != SIGSTOP)
-	    {
-	      /* This can happen if someone starts sending signals to
-		 the new thread before it gets a chance to run, which
-		 have a lower number than SIGSTOP (e.g. SIGUSR1).
-		 This is an unlikely case, and harder to handle for
-		 fork / vfork than for clone, so we do not try - but
-		 we handle it for clone events here.  We'll send
-		 the other signal on to the thread below.  */
-
-	      new_lp->signalled = 1;
-	    }
-	  else
-	    {
-	      struct thread_info *tp;
-
-	      /* When we stop for an event in some other thread, and
-		 pull the thread list just as this thread has cloned,
-		 we'll have seen the new thread in the thread_db list
-		 before handling the CLONE event (glibc's
-		 pthread_create adds the new thread to the thread list
-		 before clone'ing, and has the kernel fill in the
-		 thread's tid on the clone call with
-		 CLONE_PARENT_SETTID).  If that happened, and the core
-		 had requested the new thread to stop, we'll have
-		 killed it with SIGSTOP.  But since SIGSTOP is not an
-		 RT signal, it can only be queued once.  We need to be
-		 careful to not resume the LWP if we wanted it to
-		 stop.  In that case, we'll leave the SIGSTOP pending.
-		 It will later be reported as GDB_SIGNAL_0.  */
-	      tp = find_thread_ptid (new_lp->ptid);
-	      if (tp != NULL && tp->stop_requested)
-		new_lp->last_resume_kind = resume_stop;
-	      else
-		status = 0;
-	    }
+	  new_lp->resumed = 1;
 
 	  /* If the thread_db layer is active, let it record the user
 	     level thread id and status, and add the thread to GDB's
@@ -2136,19 +2100,23 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	    }
 
 	  /* Even if we're stopping the thread for some reason
-	     internal to this module, from the user/frontend's
-	     perspective, this new thread is running.  */
+	     internal to this module, from the perspective of infrun
+	     and the user/frontend, this new thread is running until
+	     it next reports a stop.  */
 	  set_running (new_lp->ptid, 1);
-	  if (!stopping)
-	    {
-	      set_executing (new_lp->ptid, 1);
-	      /* thread_db_attach_lwp -> lin_lwp_attach_lwp forced
-		 resume_stop.  */
-	      new_lp->last_resume_kind = resume_continue;
-	    }
+	  set_executing (new_lp->ptid, 1);
 
-	  if (status != 0)
+	  if (WSTOPSIG (status) != SIGSTOP)
 	    {
+	      /* This can happen if someone starts sending signals to
+		 the new thread before it gets a chance to run, which
+		 have a lower number than SIGSTOP (e.g. SIGUSR1).
+		 This is an unlikely case, and harder to handle for
+		 fork / vfork than for clone, so we do not try - but
+		 we handle it for clone events here.  */
+
+	      new_lp->signalled = 1;
+
 	      /* We created NEW_LP so it cannot yet contain STATUS.  */
 	      gdb_assert (new_lp->status == 0);
 
@@ -2162,7 +2130,6 @@  linux_handle_extended_wait (struct lwp_info *lp, int status,
 	      new_lp->status = status;
 	    }
 
-	  new_lp->resumed = !stopping;
 	  return 1;
 	}
 
@@ -3673,9 +3640,31 @@  resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
 {
   ptid_t *wait_ptid_p = data;
 
-  if (lp->stopped
-      && lp->resumed
-      && !lwp_status_pending_p (lp))
+  if (!lp->stopped)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not stopped\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (!lp->resumed)
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "not resumed\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else if (lwp_status_pending_p (lp))
+    {
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: NOT resuming stopped-resumed LWP %s, "
+			    "has pending status\n",
+			    target_pid_to_str (lp->ptid));
+    }
+  else
     {
       struct regcache *regcache = get_thread_regcache (lp->ptid);
       struct gdbarch *gdbarch = get_regcache_arch (regcache);