[v2] GDBserver crashes when killing a multi-thread process

Message ID 55A3E23C.8020101@gmail.com
State New, archived
Headers

Commit Message

Yao Qi July 13, 2015, 4:07 p.m. UTC
  On 10/07/14 16:16, Pedro Alves wrote:
> +static void
> +kill_wait_lwp (struct lwp_info *lwp)
> +{
> +  struct thread_info *thr = get_lwp_thread (lwp);
> +  int pid = ptid_get_pid (ptid_of (thr));
> +  int lwpid = ptid_get_lwp (ptid_of (thr));
> +  int wstat;
> +  int res;
> +
> +  if (debug_threads)
> +    debug_printf ("kwl: killing lwp %d, for pid: %d\n", lwpid, pid);
> +
> +  do
> +    {
> +      linux_kill_one_lwp (lwp);
> +
> +      /* Make sure it died.  Notes:
> +
> +	 - The loop is most likely unnecessary.
> +
> +         - We don't use linux_wait_for_event as that could delete lwps
> +           while we're iterating over them.  We're not interested in
> +           any pending status at this point, only in making sure all
> +           wait status on the kernel side are collected until the
> +           process is reaped.
> +
> +	 - We don't use __WALL here as the __WALL emulation relies on
> +	   SIGCHLD, and killing a stopped process doesn't generate
> +	   one, nor an exit status.
> +      */
> +      res = my_waitpid (lwpid, &wstat, 0);
> +      if (res == -1 && errno == ECHILD)
> +	res = my_waitpid (lwpid, &wstat, __WCLONE);
> +    } while (res > 0 && WIFSTOPPED (wstat));
> +
> +  gdb_assert (res > 0);
> +}

Hi Pedro,
do you still remember why did you add this assert?  It wasn't
mentioned in the mail 
https://sourceware.org/ml/gdb-patches/2014-07/msg00206.html

I am looking at a GDBserver internal error on x86_64 when I run
gdb.threads/thread-unwindonsignal.exp with GDBserver,

continue^M
Continuing.^M
warning: Remote failure reply: E.No unwaited-for children left.^M
PC register is not available^M
(gdb) FAIL: gdb.threads/thread-unwindonsignal.exp: continue until exit
Remote debugging from host 127.0.0.1^M
ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
monitor exit^M
Killing process(es): 30694^M
(gdb) /home/yao/SourceCode/gnu/gdb/git/gdb/gdbserver/linux-low.c:1106: A 
problem internal to GDBserver has been detected.^M
kill_wait_lwp: Assertion `res > 0' failed.

After your patch https://sourceware.org/ml/gdb-patches/2015-03/msg00597.html
GDBserver starts to swallows errors if the LWP is gone.  Then, when
GDBservers kills non-exist LWP, the assert will be triggered.

Why don't we implement kill_wait_lwp like its counterpart in GDB
linux-nat.c:kill_wait_callback? we can loop and assert like this
patch below, (note that this patch fixes the internal error, and
the FAIL is still there).
  

Comments

Pedro Alves July 13, 2015, 5:32 p.m. UTC | #1
On 07/13/2015 05:07 PM, Yao Qi wrote:

> Hi Pedro,
> do you still remember why did you add this assert?  It wasn't
> mentioned in the mail 
> https://sourceware.org/ml/gdb-patches/2014-07/msg00206.html
> 

Simply because getting here was supposed to indicate
something went wrong elsewhere, but at the time I didn't consider
that the child could die while ptrace-stopped.

> I am looking at a GDBserver internal error on x86_64 when I run
> gdb.threads/thread-unwindonsignal.exp with GDBserver,
> 
> continue^M
> Continuing.^M
> warning: Remote failure reply: E.No unwaited-for children left.^M
> PC register is not available^M
> (gdb) FAIL: gdb.threads/thread-unwindonsignal.exp: continue until exit
> Remote debugging from host 127.0.0.1^M
> ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
> ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
> ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
> ptrace(regsets_fetch_inferior_registers) PID=30700: No such process^M
> monitor exit^M
> Killing process(es): 30694^M
> (gdb) /home/yao/SourceCode/gnu/gdb/git/gdb/gdbserver/linux-low.c:1106: A 
> problem internal to GDBserver has been detected.^M
> kill_wait_lwp: Assertion `res > 0' failed.
> 
> After your patch https://sourceware.org/ml/gdb-patches/2015-03/msg00597.html

> GDBserver starts to swallows errors if the LWP is gone.  Then, when
> GDBservers kills non-exist LWP, the assert will be triggered.
> 

Looks like I forgot to push the rest of that series:

 https://sourceware.org/ml/gdb-patches/2015-03/msg00182.html

What do you think of that one?

> Why don't we implement kill_wait_lwp like its counterpart in GDB
> linux-nat.c:kill_wait_callback? we can loop and assert like this
> patch below, (note that this patch fixes the internal error, and
> the FAIL is still there).
> 

Seems to me it's not 100% correct to waitpid the pid one more time
after we've already reaped it, because there's a minuscule chance
another process that we're debugging could clone a new lwp that reuses
the PID of the one we've just killed/reaped, and then another iteration
could collect the initial SIGSTOP of the wrong LWP and we'd kill it:

-> kill (pid1, SIGKILL);
<- waitpid (pid1) returns pid1/WSIGNALLED
-> on iteration1: new pid1 clone lwp is spawned
-> ret==pid1, continue iterating
-> kill (pid1, SIGKILL); // killing wrong process
<- waitpid (pid1) returns either SIGSTOP or WSIGNALLED
...

Thanks,
Pedro Alves
  
Yao Qi July 14, 2015, 8 a.m. UTC | #2
Pedro Alves <palves@redhat.com> writes:

> Looks like I forgot to push the rest of that series:
>
>  https://sourceware.org/ml/gdb-patches/2015-03/msg00182.html
>
> What do you think of that one?

Yes, it looks good to me.  We also need it on 7.10 branch.

>
>> Why don't we implement kill_wait_lwp like its counterpart in GDB
>> linux-nat.c:kill_wait_callback? we can loop and assert like this
>> patch below, (note that this patch fixes the internal error, and
>> the FAIL is still there).
>> 
>
> Seems to me it's not 100% correct to waitpid the pid one more time
> after we've already reaped it, because there's a minuscule chance
> another process that we're debugging could clone a new lwp that reuses
> the PID of the one we've just killed/reaped, and then another iteration
> could collect the initial SIGSTOP of the wrong LWP and we'd kill it:
>
> -> kill (pid1, SIGKILL);
> <- waitpid (pid1) returns pid1/WSIGNALLED
> -> on iteration1: new pid1 clone lwp is spawned
> -> ret==pid1, continue iterating
> -> kill (pid1, SIGKILL); // killing wrong process
> <- waitpid (pid1) returns either SIGSTOP or WSIGNALLED
> ...

Yes, that is possible.
  
Pedro Alves July 14, 2015, 9:13 a.m. UTC | #3
On 07/14/2015 09:00 AM, Yao Qi wrote:
> Pedro Alves <palves@redhat.com> writes:
> 
>> Looks like I forgot to push the rest of that series:
>>
>>  https://sourceware.org/ml/gdb-patches/2015-03/msg00182.html
>>
>> What do you think of that one?
> 
> Yes, it looks good to me.  We also need it on 7.10 branch.

OK, I'll push it.

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/gdbserver/linux-low.c b/gdb/gdbserver/linux-low.c
index 7bb9f7f..07d051a 100644
--- a/gdb/gdbserver/linux-low.c
+++ b/gdb/gdbserver/linux-low.c
@@ -1101,9 +1101,9 @@  kill_wait_lwp (struct lwp_info *lwp)
        res = my_waitpid (lwpid, &wstat, 0);
        if (res == -1 && errno == ECHILD)
         res = my_waitpid (lwpid, &wstat, __WCLONE);
-    } while (res > 0 && WIFSTOPPED (wstat));
+    } while (res == lwpid);

-  gdb_assert (res > 0);
+  gdb_assert (res == -1 && errno == ECHILD);
  }

  /* Callback for `find_inferior'.  Kills an lwp of a given process,