[review,gdb/threads] Fix hang in stop_all_threads after killing inferior

Message ID gerrit.1580307037000.Ibe1f29251fe2ff1c1991f041babbe18373c113b1@gnutoolchain-gerrit.osci.io
State New, archived
Headers

Commit Message

Simon Marchi (Code Review) Jan. 29, 2020, 2:10 p.m. UTC
  Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................

[gdb/threads] Fix hang in stop_all_threads after killing inferior

Consider a two-threaded testcase a.out, sleeping in both its threads:
...
$ gdb -ex r --args a.out
Reading symbols from a.out...
Starting program: /data/gdb_versions/devel/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff77fe700 (LWP 31268)]
...

Typing ^C causing stop_all_threads to be executed, and if an external SIGKILL
(such as caused by killall -9 a.out) arrives at the start of stop_all_threads,
gdb hangs in stop_all_threads after giving this warning:
...
warning: unable to open /proc file '/proc/24938/status'
...

Using "set debug infrun 1" we can see in more detail where we hang:
...
infrun: stop_all_threads
infrun: stop_all_threads, pass=0, iterations=0
infrun:   Thread 0x7ffff7fa6740 (LWP 10264) not executing
infrun:   Thread 0x7ffff77fe700 (LWP 10268) executing, need stop
infrun: target_wait (-1.0.0, status) =
infrun:   10264.10268.0 [Thread 0x7ffff77fe700 (LWP 10268)],
infrun:   status->kind = signalled, signal = GDB_SIGNAL_KILL
infrun: stop_all_threads status->kind = signalled, signal = GDB_SIGNAL_KILL \
  Thread 0x7ffff77fe700 (LWP 10268)
infrun:   Thread 0x7ffff7fa6740 (LWP 10264) not executing
infrun:   Thread 0x7ffff77fe700 (LWP 10268) executing, already stopping
warning: unable to open /proc file '/proc/10264/status'
infrun: target_wait (-1.0.0, status) =
infrun:   -1.0.0 [process -1],
infrun:   status->kind = no-resumed
infrun: infrun_async(0)
infrun: stop_all_threads status->kind = no-resumed process -1
infrun:   Thread 0x7ffff7fa6740 (LWP 10264) not executing
infrun:   Thread 0x7ffff77fe700 (LWP 10268) executing, already stopping
infrun: stop_all_threads status->kind = no-resumed process -1
infrun:   Thread 0x7ffff7fa6740 (LWP 10264) not executing
infrun:   Thread 0x7ffff77fe700 (LWP 10268) executing, already stopping
infrun: stop_all_threads status->kind = no-resumed process -1
infrun:   Thread 0x7ffff7fa6740 (LWP 10264) not executing
infrun:   Thread 0x7ffff77fe700 (LWP 10268) executing, already stopping
<repeat>
......

So, we're hanging in the 'while (1)' loop in stop_all_threads as follows:
- thread t is tested, and both t->executing and t->stop_requested are found
  to be 1
- consequently need_wait is set 1
- consequently wait_one is executed
- wait_one returns a TARGET_WAITKIND_NO_RESUMED event, which is handled by
  continuing at the start of the loop

The loop actually starts with update_thread_list (), but that doesn't seem
to change the state of the threads.

Fix the hang by detecting the first sign of trouble: the
TARGET_WAITKIND_SIGNALLED event with signal GDB_SIGNAL_KILL, and breaking out
of the loop.

Build and reg-tested on x86_64-linux.

gdb/ChangeLog:

2020-01-29  Tom de Vries  <tdevries@suse.de>

	PR threads/25478
	* infrun.c (stop_all_threads): Return when detecting event
	TARGET_WAITKIND_SIGNALLED with signal GDB_SIGNAL_KILL.

Change-Id: Ibe1f29251fe2ff1c1991f041babbe18373c113b1
---
M gdb/infrun.c
1 file changed, 5 insertions(+), 1 deletion(-)
  

Comments

Simon Marchi (Code Review) Jan. 29, 2020, 2:39 p.m. UTC | #1
Mihails Strasuns has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 1:

This seems very similar to https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/133 - I don't think there should be any logic specific to SIGKILL here. It seems like a general race condition for early termination while `stop_all_threads` is being executed.
  
Simon Marchi (Code Review) Jan. 29, 2020, 4:13 p.m. UTC | #2
Tankut Baris Aktemur has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 1:

> FWIW, I looked over the review comments there, and noticed the suggestion to "leave the TARGET_WAITKIND_EXITED/TARGET_WAITKIND_SIGNALLED event pending"

For the continuation of the discussion, please see
https://sourceware.org/ml/gdb-patches/2020-01/msg00212.html
  
Simon Marchi (Code Review) Jan. 30, 2020, 2:52 p.m. UTC | #3
Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 1:

> FWIW, I looked over the review comments there, and noticed the suggestion to "leave the TARGET_WAITKIND_EXITED/TARGET_WAITKIND_SIGNALLED event pending", which I tried using this additional patch:
> ...
> diff --git a/gdb/infrun.c b/gdb/infrun.c
> index e34ddc83b45..c1035c25d7f 100644
> --- a/gdb/infrun.c
> +++ b/gdb/infrun.c
> @@ -4774,7 +4774,11 @@ stop_all_threads (void)
>  
>           if (event.ws.kind == TARGET_WAITKIND_SIGNALLED
>               && event.ws.value.sig == GDB_SIGNAL_KILL)
> -           goto done;
> +           {
> +             thread_info *t = find_thread_ptid (event.target, event.ptid);
> +             save_waitstatus (t, &event.ws);
> +             goto done;
> +           }
>           else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
>               || event.ws.kind == TARGET_WAITKIND_THREAD_EXITED
>               || event.ws.kind == TARGET_WAITKIND_EXITED
> ...
> 
> But I didn't notice any difference in behaviour.

I've realized from another use of save_waitstatus that I have to set t->resumed in order to have the saved wait status noticed.

With this additional patch, I did manage to observe a difference in behaviour:
...
@@ -4777,6 +4777,7 @@ stop_all_threads (void)
            {
              thread_info *t = find_thread_ptid (event.target, event.ptid);
              save_waitstatus (t, &event.ws);
+             t->resumed = 1;
              goto done;
            }
          else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
...

The difference is in what happens when I continue after the ^C is handled:
- without the two patches gdb hangs, until I press enter, after which I get the
  prompt
- with the two patches, gdb just presents the prompt
  
Simon Marchi (Code Review) Jan. 30, 2020, 4:41 p.m. UTC | #4
Tankut Baris Aktemur has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 2:

(1 comment)

| --- gdb/infrun.c
| +++ gdb/infrun.c
| @@ -4767,11 +4767,19 @@ stop_all_threads (void)
|  	  if (debug_infrun)
|  	    {
|  	      fprintf_unfiltered (gdb_stdlog,
|  				  "infrun: stop_all_threads %s %s\n",
|  				  target_waitstatus_to_string (&event.ws).c_str (),
|  				  target_pid_to_str (event.ptid).c_str ());
|  	    }
|  
| -	  if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
| +	  if (event.ws.kind == TARGET_WAITKIND_SIGNALLED
| +	      && event.ws.value.sig == GDB_SIGNAL_KILL)

PS2, Line 4776:

Why only this case?  Wouldn't the same problem be seen if, say, a
different signal or an exited status was received?

| +	    {
| +	      thread_info *t = find_thread_ptid (event.target, event.ptid);
| +	      save_waitstatus (t, &event.ws);
| +	      t->resumed = 1;
| +	      goto done;
| +	    }
| +	  else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
|  	      || event.ws.kind == TARGET_WAITKIND_THREAD_EXITED
|  	      || event.ws.kind == TARGET_WAITKIND_EXITED
  
Simon Marchi (Code Review) Jan. 30, 2020, 5:35 p.m. UTC | #5
Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 2:

(1 comment)

| --- gdb/infrun.c
| +++ gdb/infrun.c
| @@ -4767,11 +4767,19 @@ stop_all_threads (void)
|  	  if (debug_infrun)
|  	    {
|  	      fprintf_unfiltered (gdb_stdlog,
|  				  "infrun: stop_all_threads %s %s\n",
|  				  target_waitstatus_to_string (&event.ws).c_str (),
|  				  target_pid_to_str (event.ptid).c_str ());
|  	    }
|  
| -	  if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
| +	  if (event.ws.kind == TARGET_WAITKIND_SIGNALLED
| +	      && event.ws.value.sig == GDB_SIGNAL_KILL)

PS2, Line 4776:

For now I'm just trying to solve the PR in the absolute minimal way.

| +	    {
| +	      thread_info *t = find_thread_ptid (event.target, event.ptid);
| +	      save_waitstatus (t, &event.ws);
| +	      t->resumed = 1;
| +	      goto done;
| +	    }
| +	  else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
|  	      || event.ws.kind == TARGET_WAITKIND_THREAD_EXITED
|  	      || event.ws.kind == TARGET_WAITKIND_EXITED
  
Simon Marchi (Code Review) Jan. 30, 2020, 5:42 p.m. UTC | #6
Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 2:

(1 comment)

| --- gdb/infrun.c
| +++ gdb/infrun.c
| @@ -4773,10 +4773,18 @@ stop_all_threads (void)
|  	    }
|  
| -	  if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
| +	  if (event.ws.kind == TARGET_WAITKIND_SIGNALLED
| +	      && event.ws.value.sig == GDB_SIGNAL_KILL)
| +	    {
| +	      thread_info *t = find_thread_ptid (event.target, event.ptid);
| +	      save_waitstatus (t, &event.ws);
| +	      t->resumed = 1;
| +	      goto done;

PS2, Line 4781:

I've realized that the 'goto done' is not a good idea, since
all_non_exited_threads can iterate over more than one inferior.

I'll upload a new version that removes the 'goto done.

| +	    }
| +	  else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
|  	      || event.ws.kind == TARGET_WAITKIND_THREAD_EXITED
|  	      || event.ws.kind == TARGET_WAITKIND_EXITED
|  	      || event.ws.kind == TARGET_WAITKIND_SIGNALLED)
|  	    {
|  	      /* All resumed threads exited
|  		 or one thread/process exited/signalled.  */
|  	    }
  
Simon Marchi (Code Review) Feb. 3, 2020, 2:30 p.m. UTC | #7
Tankut Baris Aktemur has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 3:

I've tried this patch in two scenarios.  One is when multi-threading is
enabled (executable is multi.out), and the other is when the program is
single-threaded (executable is named single1.out and single2.out -- exactly
the same copies just to distinguish the processes).  The test program is
sleep.c from the PR.

I start the multi-threaded case as follows:

  $ gdb -ex "set prompt (master-gdb) " -ex "b stop_all_threads" -ex r -ex \
  "shell killall -9 multi.out" -ex c --args gdb -ex "set debug infrun 1" \
  -ex r  --args multi.out

When gdb stops, I try "info inferiors" and "info threads" commands:

  Thread 1 received signal SIGINT, Interrupt.
  0x00007ffff78a99d0 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
  infrun: infrun_async(0)
  (gdb) i inferiors
    Num  Description       Connection           Executable
  * 1    process 12979     1 (native)           /path/to/multi.out
  (gdb) i threads
    Id   Target Id                         Frame
  * 1    Thread 0x7ffff7fe3740 (LWP 12979) 0x00007ffff78a99d0 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
    2    Thread 0x7ffff77c4700 (LWP 12996) Couldn't get registers: No such process.
  (gdb) thread 2
  [Switching to thread 2 (Thread 0x7ffff77c4700 (LWP 12996))]
  Couldn't get registers: No such process.
  (gdb) i threads
  Couldn't get registers: No such process.
  (gdb) thread 1
  Couldn't get registers: No such process.

The second scenario is similar, except that instead of a multi-threaded
single process, we have two single-threaded processes.  Started as

  $ gdb -ex "set prompt (master-gdb) " -ex "b stop_all_threads" -ex r -ex c \
  -ex c -ex "shell killall -9 single1.out" -ex c --args gdb -ex "set debug infrun 1" \
  -ex start -ex "add-inferior -exec ./single2.out" -ex "inferior 2" -ex "start" \
  -ex "set schedule-multiple on" -ex c --args ./single1.out

The behavior is this:

  (gdb) info inferiors
    Num  Description       Connection           Executable
    1    process 2266      1 (native)           /path/to/single1.out
  * 2    process 2282      1 (native)           /path/to/single2.out
  (gdb) info threads
    Id   Target Id                  Frame
    1.1  process 2266               Couldn't get registers: No such process.
  (gdb) inferior 1
  [Switching to inferior 1 [process 2266] (/path/to/single1.out)]
  [Switching to thread 1.1 (process 2266)]
  Couldn't get registers: No such process.
  (gdb) info inferiors
    Num  Description       Connection           Executable
  Couldn't get registers: No such process.
  (gdb) info threads
  Couldn't get registers: No such process.
  (gdb)

So, I think at this point it boils down to the discussion covered in
https://sourceware.org/ml/gdb-patches/2020-01/msg00212.html
and
https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/133
GDB is not able to handle disappeared processes gracefully.  The 
maintainers may consider the fix for the hanging behavior (i.e.  the
infinite loop) OK and the weird post-behavior above as a separate problem
to be addressed later, but my understanding from Pedro's comment was that
the deeper problem of GDB not handling already-gone processes well shall be
addressed.

Thanks
-Baris
  
Simon Marchi (Code Review) Feb. 3, 2020, 3:20 p.m. UTC | #8
Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 3:

> Patch Set 3:
> GDB is not able to handle disappeared processes gracefully.

Ack.

> The 
> maintainers may consider the fix for the hanging behavior (i.e.  the
> infinite loop) OK and the weird post-behavior above as a separate problem

That's my understanding from Pedro's comment: the weird post-behaviour is a pre-existing issue, independent of the hang that this patch fixes.
  
Simon Marchi (Code Review) Feb. 5, 2020, 7:53 p.m. UTC | #9
Tom de Vries has posted comments on this change.

Change URL: https://gnutoolchain-gerrit.osci.io/r/c/binutils-gdb/+/759
......................................................................


Patch Set 4:

Added test-case.
  

Patch

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 22de42c..e34ddc8 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -4772,7 +4772,10 @@ 
 				  target_pid_to_str (event.ptid).c_str ());
 	    }
 
-	  if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
+	  if (event.ws.kind == TARGET_WAITKIND_SIGNALLED
+	      && event.ws.value.sig == GDB_SIGNAL_KILL)
+	    goto done;
+	  else if (event.ws.kind == TARGET_WAITKIND_NO_RESUMED
 	      || event.ws.kind == TARGET_WAITKIND_THREAD_EXITED
 	      || event.ws.kind == TARGET_WAITKIND_EXITED
 	      || event.ws.kind == TARGET_WAITKIND_SIGNALLED)
@@ -4872,6 +4875,7 @@ 
 	}
     }
 
+ done:
   if (debug_infrun)
     fprintf_unfiltered (gdb_stdlog, "infrun: stop_all_threads done\n");
 }