Should this be on the blocker list for the 7.10 release?
Commit Message
On 07/07/2015 05:18 PM, Pedro Alves wrote:
> On 07/07/2015 02:24 PM, Joel Brobecker wrote:
>
>> Not sure. I think Pedro would be in a better position to answer.
>> For now, I've put this issue as a "maybe" for 7.10; so we will not
>> release until this is fixed, or we explicitly decide it's OK for 7.10.
>>
>> Pedro?
>
> Let me take a look and understand this better.
OK, the issue is that the new clone thread is found while inside
the linux_stop_and_wait_all_lwps call in this new bit of
code in linux-thread-db.c:
linux_stop_and_wait_all_lwps ();
ALL_LWPS (lp)
if (ptid_get_pid (lp->ptid) == pid)
thread_from_lwp (lp->ptid);
linux_unstop_all_lwps ();
We reach linux_handle_extended_wait with the "stopping"
parameter set to 1, and because of that we don't mark the
new lwp as resumed. As consequence, the subsequent
resume_stopped_resumed_lwps (called first from that
linux_unstop_all_lwps) never resumes the new LWP...
There's lots of cruft in linux_handle_extended_wait that no
longer makes sense. This seems to fix your github test
for me, and causes no testsuite regressions.
Did you try converting your test case to a proper
GDB test? That'd be much appreciated.
---
From a4f205a18dffaff3344b31e9b8009b1c0de8ba80 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 7 Jul 2015 17:42:52 +0100
Subject: [PATCH] fix
---
gdb/linux-nat.c | 91 +++++++++++++++++++++++++--------------------------------
1 file changed, 40 insertions(+), 51 deletions(-)
Comments
> OK, the issue is that the new clone thread is found while inside
> the linux_stop_and_wait_all_lwps call in this new bit of
> code in linux-thread-db.c:
>
> linux_stop_and_wait_all_lwps ();
>
> ALL_LWPS (lp)
> if (ptid_get_pid (lp->ptid) == pid)
> thread_from_lwp (lp->ptid);
>
> linux_unstop_all_lwps ();
>
> We reach linux_handle_extended_wait with the "stopping"
> parameter set to 1, and because of that we don't mark the
> new lwp as resumed. As consequence, the subsequent
> resume_stopped_resumed_lwps (called first from that
> linux_unstop_all_lwps) never resumes the new LWP...
>
> There's lots of cruft in linux_handle_extended_wait that no
> longer makes sense. This seems to fix your github test
> for me, and causes no testsuite regressions.
It seems to fix most of it. The only odd thing left that I
noticed is that it leaves some of the inferiors there. When I
type "info inferiors" after running the program, I see one or
two of them left. I believe there should only be inferior #1
left.
> Did you try converting your test case to a proper
> GDB test? That'd be much appreciated.
I haven't, but I will.
Thanks!
On 07/07/2015 07:35 PM, Simon Marchi wrote:
>> OK, the issue is that the new clone thread is found while inside
>> the linux_stop_and_wait_all_lwps call in this new bit of
>> code in linux-thread-db.c:
>>
>> linux_stop_and_wait_all_lwps ();
>>
>> ALL_LWPS (lp)
>> if (ptid_get_pid (lp->ptid) == pid)
>> thread_from_lwp (lp->ptid);
>>
>> linux_unstop_all_lwps ();
>>
>> We reach linux_handle_extended_wait with the "stopping"
>> parameter set to 1, and because of that we don't mark the
>> new lwp as resumed. As consequence, the subsequent
>> resume_stopped_resumed_lwps (called first from that
>> linux_unstop_all_lwps) never resumes the new LWP...
>>
>> There's lots of cruft in linux_handle_extended_wait that no
>> longer makes sense. This seems to fix your github test
>> for me, and causes no testsuite regressions.
>
> It seems to fix most of it. The only odd thing left that I
> noticed is that it leaves some of the inferiors there. When I
> type "info inferiors" after running the program, I see one or
> two of them left. I believe there should only be inferior #1
> left.
Hmm, indeed:
(gdb) info inferiors
Num Description Executable
4 process 8393 /home/pedro/bugs/src/test
2 process 8388 /home/pedro/bugs/src/test
* 1 <null> /home/pedro/bugs/src/test
(gdb) info threads
Calling prune_inferiors() at this point (from a top gdb)
does not remove them, because they still have inf->pid != 0.
Sounds like we miss mourning inferiors 2 and 4 somehow?
Thanks,
Pedro Alves
On 07/07/2015 07:47 PM, Pedro Alves wrote:
> Hmm, indeed:
>
> (gdb) info inferiors
> Num Description Executable
> 4 process 8393 /home/pedro/bugs/src/test
> 2 process 8388 /home/pedro/bugs/src/test
> * 1 <null> /home/pedro/bugs/src/test
> (gdb) info threads
>
> Calling prune_inferiors() at this point (from a top gdb)
> does not remove them, because they still have inf->pid != 0.
> Sounds like we miss mourning inferiors 2 and 4 somehow?
I think I got it.
Enabling logs (master + previous patch) I see:
...
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9513) received Trace/breakpoint trap (stopped)
WL: Handling extended status 0x03057f
LHEW: Got clone event from LWP 9513, new child is LWP 9579
[New Thread 0x7ffff37b8700 (LWP 9579)]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9508) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9508) exited.
^^^^^^^^
[Thread 0x7ffff7fc2740 (LWP 9508) exited]
WL: waitpid Thread 0x7ffff7fc2740 (LWP 9499) received 0 (exited)
WL: Thread 0x7ffff7fc2740 (LWP 9499) exited.
[Thread 0x7ffff7fc2740 (LWP 9499) exited]
RSRL: resuming stopped-resumed LWP Thread 0x7ffff37b8700 (LWP 9579) at 0x3615ef4ce1: step=0
...
(gdb) info inferiors
Num Description Executable
5 process 9508 /home/pedro/bugs/src/test
^^^^
4 process 9503 /home/pedro/bugs/src/test
3 process 9500 /home/pedro/bugs/src/test
2 process 9499 /home/pedro/bugs/src/test
* 1 <null> /home/pedro/bugs/src/test
(gdb)
...
Note the "Thread 0x7ffff7fc2740 (LWP 9508) exited." line.
That's this in wait_lwp:
/* Check if the thread has exited. */
if (WIFEXITED (status) || WIFSIGNALED (status))
{
thread_dead = 1;
if (debug_linux_nat)
fprintf_unfiltered (gdb_stdlog, "WL: %s exited.\n",
target_pid_to_str (lp->ptid));
}
}
This code doesn't understand that an WIFEXITED status
of the last lwp of the process should be reported as
process exit. Haven't tried to fix it yet.
I haven't tried 7.9, but I'd think it possible to trigger
this issue in all-stop there, because all it takes
is a whole process exiting while gdb is iterating
over lwps, stopping them.
Thanks,
Pedro Alves
@@ -2086,43 +2086,7 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
new_lp = add_lwp (ptid_build (ptid_get_pid (lp->ptid), new_pid, 0));
new_lp->cloned = 1;
new_lp->stopped = 1;
-
- if (WSTOPSIG (status) != SIGSTOP)
- {
- /* This can happen if someone starts sending signals to
- the new thread before it gets a chance to run, which
- have a lower number than SIGSTOP (e.g. SIGUSR1).
- This is an unlikely case, and harder to handle for
- fork / vfork than for clone, so we do not try - but
- we handle it for clone events here. We'll send
- the other signal on to the thread below. */
-
- new_lp->signalled = 1;
- }
- else
- {
- struct thread_info *tp;
-
- /* When we stop for an event in some other thread, and
- pull the thread list just as this thread has cloned,
- we'll have seen the new thread in the thread_db list
- before handling the CLONE event (glibc's
- pthread_create adds the new thread to the thread list
- before clone'ing, and has the kernel fill in the
- thread's tid on the clone call with
- CLONE_PARENT_SETTID). If that happened, and the core
- had requested the new thread to stop, we'll have
- killed it with SIGSTOP. But since SIGSTOP is not an
- RT signal, it can only be queued once. We need to be
- careful to not resume the LWP if we wanted it to
- stop. In that case, we'll leave the SIGSTOP pending.
- It will later be reported as GDB_SIGNAL_0. */
- tp = find_thread_ptid (new_lp->ptid);
- if (tp != NULL && tp->stop_requested)
- new_lp->last_resume_kind = resume_stop;
- else
- status = 0;
- }
+ new_lp->resumed = 1;
/* If the thread_db layer is active, let it record the user
level thread id and status, and add the thread to GDB's
@@ -2136,19 +2100,23 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
}
/* Even if we're stopping the thread for some reason
- internal to this module, from the user/frontend's
- perspective, this new thread is running. */
+ internal to this module, from the perspective of infrun
+ and the user/frontend, this new thread is running until
+ it next reports a stop. */
set_running (new_lp->ptid, 1);
- if (!stopping)
- {
- set_executing (new_lp->ptid, 1);
- /* thread_db_attach_lwp -> lin_lwp_attach_lwp forced
- resume_stop. */
- new_lp->last_resume_kind = resume_continue;
- }
+ set_executing (new_lp->ptid, 1);
- if (status != 0)
+ if (WSTOPSIG (status) != SIGSTOP)
{
+ /* This can happen if someone starts sending signals to
+ the new thread before it gets a chance to run, which
+ have a lower number than SIGSTOP (e.g. SIGUSR1).
+ This is an unlikely case, and harder to handle for
+ fork / vfork than for clone, so we do not try - but
+ we handle it for clone events here. */
+
+ new_lp->signalled = 1;
+
/* We created NEW_LP so it cannot yet contain STATUS. */
gdb_assert (new_lp->status == 0);
@@ -2162,7 +2130,6 @@ linux_handle_extended_wait (struct lwp_info *lp, int status,
new_lp->status = status;
}
- new_lp->resumed = !stopping;
return 1;
}
@@ -3673,9 +3640,31 @@ resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
{
ptid_t *wait_ptid_p = data;
- if (lp->stopped
- && lp->resumed
- && !lwp_status_pending_p (lp))
+ if (!lp->stopped)
+ {
+ if (debug_linux_nat)
+ fprintf_unfiltered (gdb_stdlog,
+ "RSRL: NOT resuming stopped-resumed LWP %s, "
+ "not stopped\n",
+ target_pid_to_str (lp->ptid));
+ }
+ else if (!lp->resumed)
+ {
+ if (debug_linux_nat)
+ fprintf_unfiltered (gdb_stdlog,
+ "RSRL: NOT resuming stopped-resumed LWP %s, "
+ "not resumed\n",
+ target_pid_to_str (lp->ptid));
+ }
+ else if (lwp_status_pending_p (lp))
+ {
+ if (debug_linux_nat)
+ fprintf_unfiltered (gdb_stdlog,
+ "RSRL: NOT resuming stopped-resumed LWP %s, "
+ "has pending status\n",
+ target_pid_to_str (lp->ptid));
+ }
+ else
{
struct regcache *regcache = get_thread_regcache (lp->ptid);
struct gdbarch *gdbarch = get_regcache_arch (regcache);