[11/16,v3] Extended-remote Linux exit events
Commit Message
On 10/31/2014 11:28 PM, Don Breazeal wrote:
>
> Now for the long answer: in my testing I ran into a race condition in
> check_zombie_leaders, which detects when a thread group leader has exited
> and other threads still exist. On the Linux kernel, ptrace/waitpid don't
> allow reaping the leader thread until all other threads in the group are
> reaped. When the leader exits, it goes zombie, but waitpid will not return
> an exit status until the other threads are gone. When a non-leader thread
> calls exec, all other non-leader threads are destroyed, the leader becomes
> a zombie, and once the "other" threads have been reaped, the execing thread
> takes over the leader's pid (tgid) and appears to vanish. In order to
> handle this situation in the current implementation, check_zombie_leaders
> polls the process state in /proc and deletes thread group leaders that are
> in a zombie state. The replacement is added to the lwp list when the exec
> event is reported.
>
> See https://sourceware.org/ml/gdb-patches/2011-10/msg00704.html for a more
> detailed explanation of how this works.
>
> Here is the relevant part of check_zombie_leaders:
>
> if (leader_lp != NULL
> /* Check if there are other threads in the group, as we may
> have raced with the inferior simply exiting. */
> && !last_thread_of_process_p (leader_pid)
> && linux_proc_pid_is_zombie (leader_pid))
> {
> /* ...large informative comment block... */
> delete_lwp (leader_lp);
>
> The race occurred when there were two threads in the program, and the
> non-leader thread called exec. In this case the leader thread passed
> through a very brief zombie state before being replaced by the exec'ing
> thread as the thread group leader. This state transition was asynchronous,
> with no dependency on anything gdbserver did. Because there were no other
> threads, there were no thread exit events, and thus there was no
> synchronization with the leader passing through the zombie state and the
> exec completing. If there had been more threads, the leader would remain
> in the zombie state until they were waited for. In the two-thread case,
> sometimes the leader exit was detected and sometimes it wasn't. (Recall
> that check_zombie_leaders is polling the state, via
> linux_proc_pid_is_zombie. The race is between the leader thread passing
> through the zombie state and check_zombie_leaders testing for zombie
> state.) If leader exit wasn't detected, gdbserver would end up with a
> dangling lwp entry that didn't correspond to any real lwp, and would hang
> waiting for that lwp to stop. Using PTRACE_EVENT_EXIT guarantees that the
> leader exit will be detected.
>
> Note that check_zombie_leaders works just fine for the scenarios where the
> leader thread exits and the other threads continue to run, with no exec
> calls. It is required for systems that don't support the extended ptrace
> events.
>
> The sequence of events resulting in the race condition was this:
>
> 1) In the program, a CLONE event for a new thread occurs.
>
> 2) In the program, both threads are resumed once gdbserver has
> completed the new thread processing.
>
> 3) In gdbserver, the function linux_wait_for_event_filtered loops until
> waitpid returns "no more events" for the SIGCHLD generated by the
> CLONE event. Then linux_wait_for_event_filtered calls
> check_zombie_leaders.
>
> 4) In the program, the new thread is doing the exec. During the exec
> the leader thread will pass through a transitory zombie state. If
> there were more than two threads, the leader thread would remain a
> zombie until all the non-leader, non-exec'ing threads were reaped by
> gdbserver. Since there are no such threads to reap, the leader just
> becomes a zombie and is replaced by the exec'ing thread on-the-fly.
> (Note that it appears that the leader thread is a zombie just for a
> very brief instant.)
>
> 5) In gdbserver, check_zombie_leaders checks whether an lwp entry
> corresponds to a zombie leader thread, and if so, deletes it. Here
> is the race: in (4) above, the leader may or may not be in the
> transitory zombie state. In the case where a zombie isn't detected,
> delete_lwp is not called.
>
> 6) In gdbserver, an EXEC event is detected and processed. When it gets
> ready to report the event to GDB, it calls stop_all_lwps, which sends
> a SIGSTOP to each lwp in the list and the waits until all the lwps in
> the list have reported a stop event. If the zombie leader wasn't
> detected and processed in step (5), gdbserver blocks forever in
> linux_wait_for_event_filtered, waiting for the undeleted lwp to be
> stopped, which will never happen.
This is easy to see with thread-execl.exp, if we revert the PTRACE_EVENT_EXIT
parts:
linux_low_filter_event: pc is 0x3b37209237
pc is 0x3b37209237
stop pc is 0x3b37209237
HEW: Got exec event from LWP 2721
Hit a non-gdbserver trap event.
>>>> entering stop_all_lwps
stop_all_lwps (stop, except=none)
Sending sigstop to lwp 23192
wait_for_sigstop: pulling events
my_waitpid (-1, 0x40000001)
my_waitpid (-1, 0x80000001): status(0), -1
LWFE: waitpid(-1, ...) returned -1, No child processes
leader_pid=2721, leader_lp!=NULL=1, num_lwps=2, zombie=0
sigsuspend'ing
*** hang ***
But I'm having trouble convincing myself that we do need
the PTRACE_EVENT_EXIT handling.
This seems to work for me just as well. The fix is in
making check_zombie_leaders ignore stopped threads.
@@ -1629,7 +1597,7 @@ check_zombie_leaders (void)
leader_pid, leader_lp!= NULL, num_lwps (leader_pid),
linux_proc_pid_is_zombie (leader_pid));
- if (leader_lp != NULL
+ if (leader_lp != NULL && !leader_lp->stopped
But I'm in a hurry to leave now. Maybe I'm missing something else.
From ab07ee8afc3c2358cd6b23474655fde087e2b36e Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Thu, 13 Nov 2014 17:53:43 +0000
Subject: [PATCH] revert PTRACE_EVENT_EXIT bits
---
gdb/gdbserver/linux-low.c | 56 +++--------------------------------------------
1 file changed, 3 insertions(+), 53 deletions(-)
@@ -538,38 +538,6 @@ handle_extended_wait (struct lwp_info *event_child, int *wstatp)
event_child->waitstatus.kind = TARGET_WAITKIND_VFORK_DONE;
return 0;
}
- else if (event == PTRACE_EVENT_EXIT)
- {
- unsigned long exit_status;
- unsigned long lwpid = lwpid_of (event_thr);
- int ret;
-
- if (debug_threads)
- debug_printf ("HEW: Got exit event from LWP %ld\n", lwpid );
-
- ptrace (PTRACE_GETEVENTMSG, lwpid, (PTRACE_TYPE_ARG3) 0, &exit_status);
-
- if (num_lwps (pid_of (event_thr)) > 1)
- {
- /* If there is at least one more LWP, then the program still
- exists and the exit should not be reported to GDB. */
- delete_lwp (event_child);
- ret = 1;
- }
- else
- {
- /* Set the exit status to the actual exit status, so normal
- WIFEXITED/WIFSIGNALLED processing and reporting for the
- last lwp in the process can proceed from here. */
- *wstatp = exit_status;
- ret = 0;
- }
-
- /* Resume the thread so that it actually exits. Subsequent exit
- events for LWPs that were deleted above will be ignored. */
- ptrace (PTRACE_CONT, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0);
- return ret;
- }
else if (event == PTRACE_EVENT_EXEC)
{
if (debug_threads)
@@ -1629,7 +1597,7 @@ check_zombie_leaders (void)
leader_pid, leader_lp!= NULL, num_lwps (leader_pid),
linux_proc_pid_is_zombie (leader_pid));
- if (leader_lp != NULL
+ if (leader_lp != NULL && !leader_lp->stopped
/* Check if there are other threads in the group, as we may
have raced with the inferior simply exiting. */
&& !last_thread_of_process_p (leader_pid)
@@ -2107,17 +2075,6 @@ linux_low_filter_event (ptid_t filter_ptid, int lwpid, int *wstatp)
child = add_lwp (child_ptid);
child->stopped = 1;
current_thread = child->thread;
-
- if (non_stop && stopping_threads == NOT_STOPPING_THREADS)
- {
- /* Make sure we delete the lwp entry for the exec'ing thread,
- which will have vanished. We do this by sending a signal
- to all the other threads in the lwp list, deleting any
- that are not found. Note that in all-stop mode this will
- happen before reporting the event. */
- stop_all_lwps (0, child);
- unstop_all_lwps (0, child);
- }
}
/* If we didn't find a process, one of two things presumably happened:
@@ -2166,8 +2123,7 @@ linux_low_filter_event (ptid_t filter_ptid, int lwpid, int *wstatp)
architecture being defined already (so that CHILD has a valid
regcache), and on LAST_STATUS being set (to check for SIGTRAP or
not). */
- if (WIFSTOPPED (wstat)
- && (linux_ptrace_get_extended_event (wstat) != PTRACE_EVENT_EXIT))
+ if (WIFSTOPPED (wstat))
{
if (debug_threads
&& the_low_target.get_pc != NULL)
@@ -3486,8 +3442,6 @@ send_sigstop (struct lwp_info *lwp)
calling exec. See comments in linux_low_filter_event regarding
PTRACE_EVENT_EXEC. */
delete_lwp (lwp);
- set_desired_thread (0);
-
if (debug_threads)
debug_printf ("send_sigstop: lwp %d has vanished\n", pid);
}
@@ -5565,10 +5519,7 @@ linux_supports_vfork_events (void)
static int
linux_supports_exec_events (void)
{
- /* Check for PTRACE_O_TRACEEXIT, since our implementation of follow
- exec depends on this option, which was implemented in a later
- kernel version than PTRACE_O_TRACEFORK et al. */
- return linux_supports_traceexit ();
+ return linux_supports_tracefork ();
}
static int
@@ -6588,7 +6539,6 @@ initialize_low (void)
linux_ptrace_set_requested_options (PTRACE_O_TRACEFORK
| PTRACE_O_TRACEVFORK
| PTRACE_O_TRACEVFORKDONE
- | PTRACE_O_TRACEEXIT
| PTRACE_O_TRACEEXEC);
linux_ptrace_check_options ();
}