PR threads/20743: Don't attempt to suspend or resume exited threads.

Message ID 20161223212842.42715-1-jhb@FreeBSD.org
State New, archived
Headers

Commit Message

John Baldwin Dec. 23, 2016, 9:28 p.m. UTC
  When resuming a native FreeBSD process, ignore exited threads when
suspending/resuming individual threads prior to continuing the process.

gdb/ChangeLog:

	PR threads/20743
	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
	(resume_all_threads_cb): Likewise.
	(fbsd_resume): Assert resuming thread has not exited.
---
 gdb/ChangeLog  | 7 +++++++
 gdb/fbsd-nat.c | 7 +++++++
 2 files changed, 14 insertions(+)
  

Comments

Luis Machado Dec. 23, 2016, 9:43 p.m. UTC | #1
On 12/23/2016 03:28 PM, John Baldwin wrote:
> When resuming a native FreeBSD process, ignore exited threads when
> suspending/resuming individual threads prior to continuing the process.
>
> gdb/ChangeLog:
>
> 	PR threads/20743
> 	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> 	(resume_all_threads_cb): Likewise.
> 	(fbsd_resume): Assert resuming thread has not exited.
> ---
>  gdb/ChangeLog  | 7 +++++++
>  gdb/fbsd-nat.c | 7 +++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/gdb/ChangeLog b/gdb/ChangeLog
> index db6e913..4fb3732 100644
> --- a/gdb/ChangeLog
> +++ b/gdb/ChangeLog
> @@ -1,3 +1,10 @@
> +2016-12-23  John Baldwin  <jhb@FreeBSD.org>
> +
> +	PR threads/20743
> +	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> +	(resume_all_threads_cb): Likewise.
> +	(fbsd_resume): Assert resuming thread has not exited.
> +
>  2016-12-22  Doug Evans  <xdje42@gmail.com>
>
>  	* infrun.c (set_step_over_info): Add comment.
> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> index ade62f1..7cd08c6 100644
> --- a/gdb/fbsd-nat.c
> +++ b/gdb/fbsd-nat.c
> @@ -662,6 +662,9 @@ resume_one_thread_cb (struct thread_info *tp, void *data)
>    if (ptid_get_pid (tp->ptid) != ptid_get_pid (*ptid))
>      return 0;
>
> +  if (is_exited (tp->ptid))
> +    return 0;
> +		
>    if (ptid_get_lwp (tp->ptid) == ptid_get_lwp (*ptid))
>      request = PT_RESUME;
>    else
> @@ -680,6 +683,9 @@ resume_all_threads_cb (struct thread_info *tp, void *data)
>    if (!ptid_match (tp->ptid, *filter))
>      return 0;
>
> +  if (is_exited (tp->ptid))
> +    return 0;
> +		
>    if (ptrace (PT_RESUME, ptid_get_lwp (tp->ptid), NULL, 0) == -1)
>      perror_with_name (("ptrace"));
>    return 0;
> @@ -711,6 +717,7 @@ fbsd_resume (struct target_ops *ops,
>    if (ptid_lwp_p (ptid))
>      {
>        /* If ptid is a specific LWP, suspend all other LWPs in the process.  */
> +      gdb_assert (!is_exited (ptid));

If we're asserting on this (since supposedly it shouldn't happen), do we 
need to check for is_exited on the two functions above?

Also, is there a reason why we're not detecting a thread that has 
exited? Aren't all threads stopped at this point (for all-stop mode at 
least)?

>        iterate_over_threads (resume_one_thread_cb, &ptid);
>      }
>    else
>
  
Vasil Dimov Dec. 27, 2016, 4:43 p.m. UTC | #2
On Fri, Dec 23, 2016 at 15:43:19 -0600, Luis Machado wrote:
> On 12/23/2016 03:28 PM, John Baldwin wrote:
> > When resuming a native FreeBSD process, ignore exited threads when
> > suspending/resuming individual threads prior to continuing the process.
> >
> > gdb/ChangeLog:
> >
> > 	PR threads/20743
> > 	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > 	(resume_all_threads_cb): Likewise.
> > 	(fbsd_resume): Assert resuming thread has not exited.
> > ---
> >  gdb/ChangeLog  | 7 +++++++
> >  gdb/fbsd-nat.c | 7 +++++++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/gdb/ChangeLog b/gdb/ChangeLog
> > index db6e913..4fb3732 100644
> > --- a/gdb/ChangeLog
> > +++ b/gdb/ChangeLog
> > @@ -1,3 +1,10 @@
> > +2016-12-23  John Baldwin  <jhb@FreeBSD.org>
> > +
> > +	PR threads/20743
> > +	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > +	(resume_all_threads_cb): Likewise.
> > +	(fbsd_resume): Assert resuming thread has not exited.
> > +
> >  2016-12-22  Doug Evans  <xdje42@gmail.com>
> >
> >  	* infrun.c (set_step_over_info): Add comment.
> > diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> > index ade62f1..7cd08c6 100644
> > --- a/gdb/fbsd-nat.c
> > +++ b/gdb/fbsd-nat.c
> > @@ -662,6 +662,9 @@ resume_one_thread_cb (struct thread_info *tp, void *data)
> >    if (ptid_get_pid (tp->ptid) != ptid_get_pid (*ptid))
> >      return 0;
> >
> > +  if (is_exited (tp->ptid))
> > +    return 0;
> > +		
> >    if (ptid_get_lwp (tp->ptid) == ptid_get_lwp (*ptid))
> >      request = PT_RESUME;
> >    else
> > @@ -680,6 +683,9 @@ resume_all_threads_cb (struct thread_info *tp, void *data)
> >    if (!ptid_match (tp->ptid, *filter))
> >      return 0;
> >
> > +  if (is_exited (tp->ptid))
> > +    return 0;
> > +		
> >    if (ptrace (PT_RESUME, ptid_get_lwp (tp->ptid), NULL, 0) == -1)
> >      perror_with_name (("ptrace"));
> >    return 0;
> > @@ -711,6 +717,7 @@ fbsd_resume (struct target_ops *ops,
> >    if (ptid_lwp_p (ptid))
> >      {
> >        /* If ptid is a specific LWP, suspend all other LWPs in the process.  */
> > +      gdb_assert (!is_exited (ptid));
> 
> If we're asserting on this (since supposedly it shouldn't happen), do we 
> need to check for is_exited on the two functions above?
> 
> Also, is there a reason why we're not detecting a thread that has 
> exited? Aren't all threads stopped at this point (for all-stop mode at 
> least)?
[...]

Hello,

I just nailed this down after it has been annoying me for some time,
fixed it with a similar patch as the one submitted by John, and came
here to report it.

The reason that we are "not detecting" an exited thread (at least in the
scenario I got is), gdb/thread.c:

  --- cut ---
  static void
  delete_thread_1 (ptid_t ptid, int silent)
  {
  ...
    /* If this is the current thread, or there's code out there that
       relies on it existing (refcount > 0) we can't delete yet.  Mark
       it as exited, and notify it.  */
    if (tp->refcount > 0
        || ptid_equal (tp->ptid, inferior_ptid))
      {
  ...
         /* Will be really deleted some other time.  */
         printf_unfiltered ("========== Will be really deleted some other time %u\n", ptid);
         return;
       }
  ...
  if (tpprev)
    tpprev->next = tp->next;
  else
    thread_list = tp->next;
  --- cut ---

In my scenario tp->refcount is 0, but
"ptid_equal (tp->ptid, inferior_ptid)" is true, so the thread's entry is
not removed from the global "threads_list".

The gdb output (with "set debug fbsd-lwp" enabled):

  --- cut ---
  FLWP: adding thread for LWP 102009
  [New LWP 102009 of process 40304]
  FLWP: fbsd_resume for ptid (-1, 0, 0)
  FLWP: fbsd_resume for ptid (40304, 102009, 0)
  FLWP: fbsd_resume for ptid (-1, 0, 0)
  FLWP: fbsd_resume for ptid (40304, 102009, 0)
  FLWP: fbsd_resume for ptid (-1, 0, 0)
  FLWP: deleting thread for LWP 102009
  [LWP 102009 of process 40304 exited]
  ...
  ptrace: No such process.
  --- cut ---

Hope this helps.
  
John Baldwin Dec. 27, 2016, 9:03 p.m. UTC | #3
On Tuesday, December 27, 2016 05:43:29 PM Vasil Dimov wrote:
> On Fri, Dec 23, 2016 at 15:43:19 -0600, Luis Machado wrote:
> > On 12/23/2016 03:28 PM, John Baldwin wrote:
> > > When resuming a native FreeBSD process, ignore exited threads when
> > > suspending/resuming individual threads prior to continuing the process.
> > >
> > > gdb/ChangeLog:
> > >
> > > 	PR threads/20743
> > > 	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > > 	(resume_all_threads_cb): Likewise.
> > > 	(fbsd_resume): Assert resuming thread has not exited.
> > > ---
> > >  gdb/ChangeLog  | 7 +++++++
> > >  gdb/fbsd-nat.c | 7 +++++++
> > >  2 files changed, 14 insertions(+)
> > >
> > > diff --git a/gdb/ChangeLog b/gdb/ChangeLog
> > > index db6e913..4fb3732 100644
> > > --- a/gdb/ChangeLog
> > > +++ b/gdb/ChangeLog
> > > @@ -1,3 +1,10 @@
> > > +2016-12-23  John Baldwin  <jhb@FreeBSD.org>
> > > +
> > > +	PR threads/20743
> > > +	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > > +	(resume_all_threads_cb): Likewise.
> > > +	(fbsd_resume): Assert resuming thread has not exited.
> > > +
> > >  2016-12-22  Doug Evans  <xdje42@gmail.com>
> > >
> > >  	* infrun.c (set_step_over_info): Add comment.
> > > diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> > > index ade62f1..7cd08c6 100644
> > > --- a/gdb/fbsd-nat.c
> > > +++ b/gdb/fbsd-nat.c
> > > @@ -662,6 +662,9 @@ resume_one_thread_cb (struct thread_info *tp, void *data)
> > >    if (ptid_get_pid (tp->ptid) != ptid_get_pid (*ptid))
> > >      return 0;
> > >
> > > +  if (is_exited (tp->ptid))
> > > +    return 0;
> > > +		
> > >    if (ptid_get_lwp (tp->ptid) == ptid_get_lwp (*ptid))
> > >      request = PT_RESUME;
> > >    else
> > > @@ -680,6 +683,9 @@ resume_all_threads_cb (struct thread_info *tp, void *data)
> > >    if (!ptid_match (tp->ptid, *filter))
> > >      return 0;
> > >
> > > +  if (is_exited (tp->ptid))
> > > +    return 0;
> > > +		
> > >    if (ptrace (PT_RESUME, ptid_get_lwp (tp->ptid), NULL, 0) == -1)
> > >      perror_with_name (("ptrace"));
> > >    return 0;
> > > @@ -711,6 +717,7 @@ fbsd_resume (struct target_ops *ops,
> > >    if (ptid_lwp_p (ptid))
> > >      {
> > >        /* If ptid is a specific LWP, suspend all other LWPs in the process.  */
> > > +      gdb_assert (!is_exited (ptid));
> > 
> > If we're asserting on this (since supposedly it shouldn't happen), do we 
> > need to check for is_exited on the two functions above?
> > 
> > Also, is there a reason why we're not detecting a thread that has 
> > exited? Aren't all threads stopped at this point (for all-stop mode at 
> > least)?
> [...]
> 
> Hello,
> 
> I just nailed this down after it has been annoying me for some time,
> fixed it with a similar patch as the one submitted by John, and came
> here to report it.
> 
> The reason that we are "not detecting" an exited thread (at least in the
> scenario I got is), gdb/thread.c:
> 
>   --- cut ---
>   static void
>   delete_thread_1 (ptid_t ptid, int silent)
>   {
>   ...
>     /* If this is the current thread, or there's code out there that
>        relies on it existing (refcount > 0) we can't delete yet.  Mark
>        it as exited, and notify it.  */
>     if (tp->refcount > 0
>         || ptid_equal (tp->ptid, inferior_ptid))
>       {
>   ...
>          /* Will be really deleted some other time.  */
>          printf_unfiltered ("========== Will be really deleted some other time %u\n", ptid);
>          return;
>        }
>   ...
>   if (tpprev)
>     tpprev->next = tp->next;
>   else
>     thread_list = tp->next;
>   --- cut ---
> 
> In my scenario tp->refcount is 0, but
> "ptid_equal (tp->ptid, inferior_ptid)" is true, so the thread's entry is
> not removed from the global "threads_list".
> 
> The gdb output (with "set debug fbsd-lwp" enabled):
> 
>   --- cut ---
>   FLWP: adding thread for LWP 102009
>   [New LWP 102009 of process 40304]
>   FLWP: fbsd_resume for ptid (-1, 0, 0)
>   FLWP: fbsd_resume for ptid (40304, 102009, 0)
>   FLWP: fbsd_resume for ptid (-1, 0, 0)
>   FLWP: fbsd_resume for ptid (40304, 102009, 0)
>   FLWP: fbsd_resume for ptid (-1, 0, 0)
>   FLWP: deleting thread for LWP 102009
>   [LWP 102009 of process 40304 exited]
>   ...
>   ptrace: No such process.
>   --- cut ---
> 
> Hope this helps.

In particular, the sequence of events is this:

- an LWP (T1) reports a "normal" event (in the test case it is hitting a
  breakpoint).  This is reported to the core and sets the current thread
  (and thus inferior_ptid) to T1.
- the same LWP (T1) then exits and a thread exit event is reported via
  ptrace() to the native target.  The native target calls delete_thread,
  but the thread is not removed, just marked EXITING since it ==
  inferior_ptid as Vasil noted.  The native target just
  continues the process explicitly via ptrace() without reporting any
  event to the core aside from the call to delete_thread().
- some other LWP (T2) reports an event (in the test case it is a
  breakpoint).
- the user continues which invokes fbsd_resume() which wants to resume
  all threads.  Here iterate_over_threads() in fbsd_resume() will
  encounters the exited thread for T1 since nothing has called
  thread_update_list() (which would invoke delete_exited_threads() from
  fbsd_update_thread_list()).  Since the thread is exited, trying to
  manipulate it via ptrace() results in an error.

I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
instead of explicitly continuing the process, but that doesn't help, and it
means that the ptid being returned is still T1 in that case.

I'm not sure if I should explicitly be calling delete_exited_threads() in
fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
clear to me which of these is preferred since both are in use).

I added the assertion for my own sanity.  I suspect gdb should never try to
invoke target_resume() with a ptid of an exited thread, but if for some
reason it did the effect on FreeBSD would be a hang since we would suspend
all the other threads and when the process was continued via PT_CONTINUE it
would have nothing to do and would never return from wait().  I'd rather have
gdb fail an assertion in that case rather than hang.
  
Vasil Dimov Dec. 28, 2016, 8:07 a.m. UTC | #4
On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
[...]
> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
> instead of explicitly continuing the process, but that doesn't help, and it
> means that the ptid being returned is still T1 in that case.
> 
> I'm not sure if I should explicitly be calling delete_exited_threads() in
> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
> clear to me which of these is preferred since both are in use).
> 
> I added the assertion for my own sanity.  I suspect gdb should never try to
> invoke target_resume() with a ptid of an exited thread, but if for some
> reason it did the effect on FreeBSD would be a hang since we would suspend
> all the other threads and when the process was continued via PT_CONTINUE it
> would have nothing to do and would never return from wait().  I'd rather have
> gdb fail an assertion in that case rather than hang.
[...]

Hi,

I am not sure if this is related, but since I get a hang I would rather
mention it: with the John's patch (including the assert) gdb does not
emit the "ptrace: No such process" error, but when I attempt to quit,
it hangs:

  --- cut ---
  ^C
  Thread 1 received signal SIGINT, Interrupt.
  [Switching to LWP 100746 of process 3453]
  0x0000000804e59c7a in _poll () from /lib/libc.so.7
  (gdb) q
  A debugging session is active.

  	Inferior 1 [process 3453] will be killed.

  Quit anyway? (y or n) y

  (hangs here)
  --- cut ---

It has hung here:

  --- cut ---
  (top-gdb) bt
  #0  0x0000000803526df8 in _wait4 () from /lib/libc.so.7
  During symbol reading, cannot get low and high bounds for subprogram DIE at 76931.
  During symbol reading, incomplete CFI data; unspecified registers (e.g., rax) at 0x803a53a8e.
  #1  0x0000000803a53abc in __thr_wait4 (pid=3453, pid@entry=<optimized out>, 
      status=0x7fffffffe304, status@entry=<optimized out>, options=0, 
      options@entry=<optimized out>, rusage=0x0, rusage@entry=<optimized out>)
      at /usr/src/lib/libthr/thread/thr_syscalls.c:563
  #2  0x0000000000540d85 in inf_ptrace_mourn_inferior(target_ops*) ()
  #3  0x000000000067c28f in target_mourn_inferior() ()
  #4  0x0000000000540c6d in inf_ptrace_kill(target_ops*) ()
  #5  0x000000000072769a in kill_or_detach(inferior*, void*) ()
  #6  0x000000000074bdc9 in iterate_over_inferiors(int (*)(inferior*, void*), void*) ()
  #7  0x00000000007272b2 in quit_force(char*, int) ()
  #8  0x0000000000576f74 in cmd_func(cmd_list_element*, char*, int) ()
  #9  0x000000000072698d in execute_command(char*, int) ()
  #10 0x000000000065b2d0 in command_line_handler(char*) ()
  #11 0x000000000065aaf2 in gdb_rl_callback_handler(char*) ()
  #12 0x0000000801c41d7a in rl_callback_read_char () from /usr/local/lib/libreadline.so.6
  #13 0x000000000065a880 in gdb_rl_callback_read_char_wrapper(void*) ()
  #14 0x000000000065ae80 in stdin_event_handler(int, void*) ()
  #15 0x0000000000659cf3 in gdb_wait_for_event(int) ()
  #16 0x00000000006598a3 in gdb_do_one_event() ()
  #17 0x0000000000659df0 in start_event_loop() ()
  #18 0x0000000000654c5c in captured_command_loop(void*) ()
  #19 0x0000000000650037 in catch_errors(int (*)(void*), void*, char*, return_mask) ()
  #20 0x0000000000654696 in gdb_main(captured_main_args*) ()
  #21 0x0000000000408643 in main ()
  (top-gdb)
  --- cut ---

Calling delete_exited_threads before iterate_over_threads in fbsd_resume
does not fix the hang.

inf_ptrace_mourn_inferior calls waitpid and it hangs on the pid of the
program being debugged, which is in TX state:
T       Marks a stopped process.
X       The process is being traced or debugged.
  
John Baldwin Dec. 28, 2016, 5:37 p.m. UTC | #5
On Wednesday, December 28, 2016 09:07:07 AM Vasil Dimov wrote:
> On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
> [...]
> > I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
> > instead of explicitly continuing the process, but that doesn't help, and it
> > means that the ptid being returned is still T1 in that case.
> > 
> > I'm not sure if I should explicitly be calling delete_exited_threads() in
> > fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
> > could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
> > clear to me which of these is preferred since both are in use).
> > 
> > I added the assertion for my own sanity.  I suspect gdb should never try to
> > invoke target_resume() with a ptid of an exited thread, but if for some
> > reason it did the effect on FreeBSD would be a hang since we would suspend
> > all the other threads and when the process was continued via PT_CONTINUE it
> > would have nothing to do and would never return from wait().  I'd rather have
> > gdb fail an assertion in that case rather than hang.
> [...]
> 
> Hi,
> 
> I am not sure if this is related, but since I get a hang I would rather
> mention it: with the John's patch (including the assert) gdb does not
> emit the "ptrace: No such process" error, but when I attempt to quit,
> it hangs:

No, this is a separate bug in the kernel whereby a process doesn't
treat PT_KILL as a detach-like event but incorrectly expects to keep
getting PT_CONTINUE events for a while until it finally exits.  I'm
working on writing up regression/unit tests for PT_KILL and then
fixing the bug.
  
John Baldwin Jan. 6, 2017, 7:26 p.m. UTC | #6
On Tuesday, December 27, 2016 01:03:27 PM John Baldwin wrote:
> On Tuesday, December 27, 2016 05:43:29 PM Vasil Dimov wrote:
> > On Fri, Dec 23, 2016 at 15:43:19 -0600, Luis Machado wrote:
> > > On 12/23/2016 03:28 PM, John Baldwin wrote:
> > > > When resuming a native FreeBSD process, ignore exited threads when
> > > > suspending/resuming individual threads prior to continuing the process.
> > > >
> > > > gdb/ChangeLog:
> > > >
> > > > 	PR threads/20743
> > > > 	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > > > 	(resume_all_threads_cb): Likewise.
> > > > 	(fbsd_resume): Assert resuming thread has not exited.
> > > > ---
> > > >  gdb/ChangeLog  | 7 +++++++
> > > >  gdb/fbsd-nat.c | 7 +++++++
> > > >  2 files changed, 14 insertions(+)
> > > >
> > > > diff --git a/gdb/ChangeLog b/gdb/ChangeLog
> > > > index db6e913..4fb3732 100644
> > > > --- a/gdb/ChangeLog
> > > > +++ b/gdb/ChangeLog
> > > > @@ -1,3 +1,10 @@
> > > > +2016-12-23  John Baldwin  <jhb@FreeBSD.org>
> > > > +
> > > > +	PR threads/20743
> > > > +	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
> > > > +	(resume_all_threads_cb): Likewise.
> > > > +	(fbsd_resume): Assert resuming thread has not exited.
> > > > +
> > > >  2016-12-22  Doug Evans  <xdje42@gmail.com>
> > > >
> > > >  	* infrun.c (set_step_over_info): Add comment.
> > > > diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
> > > > index ade62f1..7cd08c6 100644
> > > > --- a/gdb/fbsd-nat.c
> > > > +++ b/gdb/fbsd-nat.c
> > > > @@ -662,6 +662,9 @@ resume_one_thread_cb (struct thread_info *tp, void *data)
> > > >    if (ptid_get_pid (tp->ptid) != ptid_get_pid (*ptid))
> > > >      return 0;
> > > >
> > > > +  if (is_exited (tp->ptid))
> > > > +    return 0;
> > > > +		
> > > >    if (ptid_get_lwp (tp->ptid) == ptid_get_lwp (*ptid))
> > > >      request = PT_RESUME;
> > > >    else
> > > > @@ -680,6 +683,9 @@ resume_all_threads_cb (struct thread_info *tp, void *data)
> > > >    if (!ptid_match (tp->ptid, *filter))
> > > >      return 0;
> > > >
> > > > +  if (is_exited (tp->ptid))
> > > > +    return 0;
> > > > +		
> > > >    if (ptrace (PT_RESUME, ptid_get_lwp (tp->ptid), NULL, 0) == -1)
> > > >      perror_with_name (("ptrace"));
> > > >    return 0;
> > > > @@ -711,6 +717,7 @@ fbsd_resume (struct target_ops *ops,
> > > >    if (ptid_lwp_p (ptid))
> > > >      {
> > > >        /* If ptid is a specific LWP, suspend all other LWPs in the process.  */
> > > > +      gdb_assert (!is_exited (ptid));
> > > 
> > > If we're asserting on this (since supposedly it shouldn't happen), do we 
> > > need to check for is_exited on the two functions above?
> > > 
> > > Also, is there a reason why we're not detecting a thread that has 
> > > exited? Aren't all threads stopped at this point (for all-stop mode at 
> > > least)?
> > [...]
> > 
> > Hello,
> > 
> > I just nailed this down after it has been annoying me for some time,
> > fixed it with a similar patch as the one submitted by John, and came
> > here to report it.
> > 
> > The reason that we are "not detecting" an exited thread (at least in the
> > scenario I got is), gdb/thread.c:
> > 
> >   --- cut ---
> >   static void
> >   delete_thread_1 (ptid_t ptid, int silent)
> >   {
> >   ...
> >     /* If this is the current thread, or there's code out there that
> >        relies on it existing (refcount > 0) we can't delete yet.  Mark
> >        it as exited, and notify it.  */
> >     if (tp->refcount > 0
> >         || ptid_equal (tp->ptid, inferior_ptid))
> >       {
> >   ...
> >          /* Will be really deleted some other time.  */
> >          printf_unfiltered ("========== Will be really deleted some other time %u\n", ptid);
> >          return;
> >        }
> >   ...
> >   if (tpprev)
> >     tpprev->next = tp->next;
> >   else
> >     thread_list = tp->next;
> >   --- cut ---
> > 
> > In my scenario tp->refcount is 0, but
> > "ptid_equal (tp->ptid, inferior_ptid)" is true, so the thread's entry is
> > not removed from the global "threads_list".
> > 
> > The gdb output (with "set debug fbsd-lwp" enabled):
> > 
> >   --- cut ---
> >   FLWP: adding thread for LWP 102009
> >   [New LWP 102009 of process 40304]
> >   FLWP: fbsd_resume for ptid (-1, 0, 0)
> >   FLWP: fbsd_resume for ptid (40304, 102009, 0)
> >   FLWP: fbsd_resume for ptid (-1, 0, 0)
> >   FLWP: fbsd_resume for ptid (40304, 102009, 0)
> >   FLWP: fbsd_resume for ptid (-1, 0, 0)
> >   FLWP: deleting thread for LWP 102009
> >   [LWP 102009 of process 40304 exited]
> >   ...
> >   ptrace: No such process.
> >   --- cut ---
> > 
> > Hope this helps.
> 
> In particular, the sequence of events is this:
> 
> - an LWP (T1) reports a "normal" event (in the test case it is hitting a
>   breakpoint).  This is reported to the core and sets the current thread
>   (and thus inferior_ptid) to T1.
> - the same LWP (T1) then exits and a thread exit event is reported via
>   ptrace() to the native target.  The native target calls delete_thread,
>   but the thread is not removed, just marked EXITING since it ==
>   inferior_ptid as Vasil noted.  The native target just
>   continues the process explicitly via ptrace() without reporting any
>   event to the core aside from the call to delete_thread().
> - some other LWP (T2) reports an event (in the test case it is a
>   breakpoint).
> - the user continues which invokes fbsd_resume() which wants to resume
>   all threads.  Here iterate_over_threads() in fbsd_resume() will
>   encounters the exited thread for T1 since nothing has called
>   thread_update_list() (which would invoke delete_exited_threads() from
>   fbsd_update_thread_list()).  Since the thread is exited, trying to
>   manipulate it via ptrace() results in an error.
> 
> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
> instead of explicitly continuing the process, but that doesn't help, and it
> means that the ptid being returned is still T1 in that case.
> 
> I'm not sure if I should explicitly be calling delete_exited_threads() in
> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
> clear to me which of these is preferred since both are in use).
> 
> I added the assertion for my own sanity.  I suspect gdb should never try to
> invoke target_resume() with a ptid of an exited thread, but if for some
> reason it did the effect on FreeBSD would be a hang since we would suspend
> all the other threads and when the process was continued via PT_CONTINUE it
> would have nothing to do and would never return from wait().  I'd rather have
> gdb fail an assertion in that case rather than hang.

Luis (or anyone else), any thoughts on the above or if there is a better way
to solve this (e.g. calling delete_exited_threads() explicitly in fbsd_resume()
before iterate_threads() and/or using ALL_NONEXITED_THREADS instead of
iterate_threads)?
  
Luis Machado Jan. 12, 2017, 4:29 p.m. UTC | #7
On 12/28/2016 11:37 AM, John Baldwin wrote:
> On Wednesday, December 28, 2016 09:07:07 AM Vasil Dimov wrote:
>> On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
>> [...]
>>> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
>>> instead of explicitly continuing the process, but that doesn't help, and it
>>> means that the ptid being returned is still T1 in that case.
>>>
>>> I'm not sure if I should explicitly be calling delete_exited_threads() in
>>> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
>>> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
>>> clear to me which of these is preferred since both are in use).
>>>
>>> I added the assertion for my own sanity.  I suspect gdb should never try to
>>> invoke target_resume() with a ptid of an exited thread, but if for some
>>> reason it did the effect on FreeBSD would be a hang since we would suspend
>>> all the other threads and when the process was continued via PT_CONTINUE it
>>> would have nothing to do and would never return from wait().  I'd rather have
>>> gdb fail an assertion in that case rather than hang.
>> [...]
>>
>> Hi,
>>
>> I am not sure if this is related, but since I get a hang I would rather
>> mention it: with the John's patch (including the assert) gdb does not
>> emit the "ptrace: No such process" error, but when I attempt to quit,
>> it hangs:
>
> No, this is a separate bug in the kernel whereby a process doesn't
> treat PT_KILL as a detach-like event but incorrectly expects to keep
> getting PT_CONTINUE events for a while until it finally exits.  I'm
> working on writing up regression/unit tests for PT_KILL and then
> fixing the bug.
>

I think the patch is mainly papering over a bigger problem. My guess is 
that the native fbsd backend is not doing something it should.

I'd check how linux-nat.c is doing things and then try to confirm the 
fbsd behavior is sane.

For example, i noticed linux-nat.c has exit_lwp (...) that handles 
deletion of both thread information and the thread itself (lwp). Even if 
it is the currently-selected thread, we *will* get the lwp removed from 
the list of existing lwp's.

It doesn't make sense to keep a thread that has already exitted in the 
list of threads we are manipulating.
  
John Baldwin Jan. 12, 2017, 7:16 p.m. UTC | #8
On Thursday, January 12, 2017 10:29:00 AM Luis Machado wrote:
> On 12/28/2016 11:37 AM, John Baldwin wrote:
> > On Wednesday, December 28, 2016 09:07:07 AM Vasil Dimov wrote:
> >> On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
> >> [...]
> >>> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
> >>> instead of explicitly continuing the process, but that doesn't help, and it
> >>> means that the ptid being returned is still T1 in that case.
> >>>
> >>> I'm not sure if I should explicitly be calling delete_exited_threads() in
> >>> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
> >>> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
> >>> clear to me which of these is preferred since both are in use).
> >>>
> >>> I added the assertion for my own sanity.  I suspect gdb should never try to
> >>> invoke target_resume() with a ptid of an exited thread, but if for some
> >>> reason it did the effect on FreeBSD would be a hang since we would suspend
> >>> all the other threads and when the process was continued via PT_CONTINUE it
> >>> would have nothing to do and would never return from wait().  I'd rather have
> >>> gdb fail an assertion in that case rather than hang.
> >> [...]
> >>
> >> Hi,
> >>
> >> I am not sure if this is related, but since I get a hang I would rather
> >> mention it: with the John's patch (including the assert) gdb does not
> >> emit the "ptrace: No such process" error, but when I attempt to quit,
> >> it hangs:
> >
> > No, this is a separate bug in the kernel whereby a process doesn't
> > treat PT_KILL as a detach-like event but incorrectly expects to keep
> > getting PT_CONTINUE events for a while until it finally exits.  I'm
> > working on writing up regression/unit tests for PT_KILL and then
> > fixing the bug.
> >
> 
> I think the patch is mainly papering over a bigger problem. My guess is 
> that the native fbsd backend is not doing something it should.
> 
> I'd check how linux-nat.c is doing things and then try to confirm the 
> fbsd behavior is sane.
> 
> For example, i noticed linux-nat.c has exit_lwp (...) that handles 
> deletion of both thread information and the thread itself (lwp). Even if 
> it is the currently-selected thread, we *will* get the lwp removed from 
> the list of existing lwp's.

FreeBSD's backend doesn't maintain a separate lwp list, it just uses
the existing GDB thread list.  For FreeBSD's backend the two lists would
simply mirror each other so it seems a bit of a waste to maintain a
duplicate list.  exit_lwp() calls delete_thread() which is the same thing
the FreeBSD backend is doing, so if that is the current thread in
inferior_ptid, the Linux backend will also being leaving the exited
thread around in GDB's list until some future call to delete_exited_threads().

I think the thing that makes Linux work is that it doesn't use GDB's
thread list.  Meaning, it doesn't walk over GDB's thread list, but instead
iterates over its private LWP list via iterate_over_lwps().  It would seem
that GDB's thread list is designed so that backends shouldn't need their
own thread list (you can add target-specific data with a custom destructor
that gets invoked when freeing a thread for example), but the Linux backend
doesn't choose to use it that way?

Looking at some other threaded backends:

- aix-thread.c relies on custom ptrace ops that resume a single thread
- darwin-nat.c uses its own thread list (stored in the inferior's
  private data) instead of GDB's thread list.
- gnu-nat.c uses its own thread list instead of GDB's thread list.
- obsd-nat.c uses GDB's thread list but doesn't seem to support resuming
  individual threads (only entire processes).
- procfs.c maintains its own thread list, but it doesn't seem to use it
  for resume but relies on the associated kernel resuming either an
  entire process or a single thread in a process via different ioctls.
- remote.c:remote_resume() uses ALL_NON_EXITED_THREADS
- windows_nat.cwindows_resume() calls windows_continue() which uses a
  target-internal thread list rather than GDB's thread list.

> It doesn't make sense to keep a thread that has already exitted in the 
> list of threads we are manipulating.

FreeBSD's backend isn't making that choice.  delete_thread() in threads.c
is the one making that choice.  If FreeBSD's backend were to define its
own thread list, the contents would be identical except it would not
include any exited threads, so skipping exited threads gives the same
result as walking a hypothetical private list.
  
Luis Machado Jan. 13, 2017, 1:27 a.m. UTC | #9
On 01/12/2017 01:16 PM, John Baldwin wrote:
> On Thursday, January 12, 2017 10:29:00 AM Luis Machado wrote:
>> On 12/28/2016 11:37 AM, John Baldwin wrote:
>>> On Wednesday, December 28, 2016 09:07:07 AM Vasil Dimov wrote:
>>>> On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
>>>> [...]
>>>>> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
>>>>> instead of explicitly continuing the process, but that doesn't help, and it
>>>>> means that the ptid being returned is still T1 in that case.
>>>>>
>>>>> I'm not sure if I should explicitly be calling delete_exited_threads() in
>>>>> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
>>>>> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
>>>>> clear to me which of these is preferred since both are in use).
>>>>>
>>>>> I added the assertion for my own sanity.  I suspect gdb should never try to
>>>>> invoke target_resume() with a ptid of an exited thread, but if for some
>>>>> reason it did the effect on FreeBSD would be a hang since we would suspend
>>>>> all the other threads and when the process was continued via PT_CONTINUE it
>>>>> would have nothing to do and would never return from wait().  I'd rather have
>>>>> gdb fail an assertion in that case rather than hang.
>>>> [...]
>>>>
>>>> Hi,
>>>>
>>>> I am not sure if this is related, but since I get a hang I would rather
>>>> mention it: with the John's patch (including the assert) gdb does not
>>>> emit the "ptrace: No such process" error, but when I attempt to quit,
>>>> it hangs:
>>>
>>> No, this is a separate bug in the kernel whereby a process doesn't
>>> treat PT_KILL as a detach-like event but incorrectly expects to keep
>>> getting PT_CONTINUE events for a while until it finally exits.  I'm
>>> working on writing up regression/unit tests for PT_KILL and then
>>> fixing the bug.
>>>
>>
>> I think the patch is mainly papering over a bigger problem. My guess is
>> that the native fbsd backend is not doing something it should.
>>
>> I'd check how linux-nat.c is doing things and then try to confirm the
>> fbsd behavior is sane.
>>
>> For example, i noticed linux-nat.c has exit_lwp (...) that handles
>> deletion of both thread information and the thread itself (lwp). Even if
>> it is the currently-selected thread, we *will* get the lwp removed from
>> the list of existing lwp's.
>
> FreeBSD's backend doesn't maintain a separate lwp list, it just uses
> the existing GDB thread list.  For FreeBSD's backend the two lists would
> simply mirror each other so it seems a bit of a waste to maintain a
> duplicate list.  exit_lwp() calls delete_thread() which is the same thing
> the FreeBSD backend is doing, so if that is the current thread in
> inferior_ptid, the Linux backend will also being leaving the exited
> thread around in GDB's list until some future call to delete_exited_threads().
>
> I think the thing that makes Linux work is that it doesn't use GDB's
> thread list.  Meaning, it doesn't walk over GDB's thread list, but instead
> iterates over its private LWP list via iterate_over_lwps().  It would seem
> that GDB's thread list is designed so that backends shouldn't need their
> own thread list (you can add target-specific data with a custom destructor
> that gets invoked when freeing a thread for example), but the Linux backend
> doesn't choose to use it that way?
>
> Looking at some other threaded backends:
>
> - aix-thread.c relies on custom ptrace ops that resume a single thread
> - darwin-nat.c uses its own thread list (stored in the inferior's
>   private data) instead of GDB's thread list.
> - gnu-nat.c uses its own thread list instead of GDB's thread list.
> - obsd-nat.c uses GDB's thread list but doesn't seem to support resuming
>   individual threads (only entire processes).
> - procfs.c maintains its own thread list, but it doesn't seem to use it
>   for resume but relies on the associated kernel resuming either an
>   entire process or a single thread in a process via different ioctls.
> - remote.c:remote_resume() uses ALL_NON_EXITED_THREADS
> - windows_nat.cwindows_resume() calls windows_continue() which uses a
>   target-internal thread list rather than GDB's thread list.
>
>> It doesn't make sense to keep a thread that has already exitted in the
>> list of threads we are manipulating.
>
> FreeBSD's backend isn't making that choice.  delete_thread() in threads.c
> is the one making that choice.  If FreeBSD's backend were to define its
> own thread list, the contents would be identical except it would not
> include any exited threads, so skipping exited threads gives the same
> result as walking a hypothetical private list.
>

So i take it using ALL_NON_EXITED_THREADS is something that would seem 
reasonable to use in this case and not iterate through all threads (even 
ones marked exitting)?
  
John Baldwin Jan. 13, 2017, 1:44 a.m. UTC | #10
On Thursday, January 12, 2017 07:27:43 PM Luis Machado wrote:
> On 01/12/2017 01:16 PM, John Baldwin wrote:
> > On Thursday, January 12, 2017 10:29:00 AM Luis Machado wrote:
> >> On 12/28/2016 11:37 AM, John Baldwin wrote:
> >>> On Wednesday, December 28, 2016 09:07:07 AM Vasil Dimov wrote:
> >>>> On Tue, Dec 27, 2016 at 13:03:27 -0800, John Baldwin wrote:
> >>>> [...]
> >>>>> I have tried changing fbsd_wait() to return a TARGET_WAITKIND_SPURIOUS
> >>>>> instead of explicitly continuing the process, but that doesn't help, and it
> >>>>> means that the ptid being returned is still T1 in that case.
> >>>>>
> >>>>> I'm not sure if I should explicitly be calling delete_exited_threads() in
> >>>>> fbsd_resume() before calling iterate_threads()?  Alternatively, fbsd_resume()
> >>>>> could use ALL_NONEXITED_THREADS() instead of iterate_threads() (it isn't
> >>>>> clear to me which of these is preferred since both are in use).
> >>>>>
> >>>>> I added the assertion for my own sanity.  I suspect gdb should never try to
> >>>>> invoke target_resume() with a ptid of an exited thread, but if for some
> >>>>> reason it did the effect on FreeBSD would be a hang since we would suspend
> >>>>> all the other threads and when the process was continued via PT_CONTINUE it
> >>>>> would have nothing to do and would never return from wait().  I'd rather have
> >>>>> gdb fail an assertion in that case rather than hang.
> >>>> [...]
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am not sure if this is related, but since I get a hang I would rather
> >>>> mention it: with the John's patch (including the assert) gdb does not
> >>>> emit the "ptrace: No such process" error, but when I attempt to quit,
> >>>> it hangs:
> >>>
> >>> No, this is a separate bug in the kernel whereby a process doesn't
> >>> treat PT_KILL as a detach-like event but incorrectly expects to keep
> >>> getting PT_CONTINUE events for a while until it finally exits.  I'm
> >>> working on writing up regression/unit tests for PT_KILL and then
> >>> fixing the bug.
> >>>
> >>
> >> I think the patch is mainly papering over a bigger problem. My guess is
> >> that the native fbsd backend is not doing something it should.
> >>
> >> I'd check how linux-nat.c is doing things and then try to confirm the
> >> fbsd behavior is sane.
> >>
> >> For example, i noticed linux-nat.c has exit_lwp (...) that handles
> >> deletion of both thread information and the thread itself (lwp). Even if
> >> it is the currently-selected thread, we *will* get the lwp removed from
> >> the list of existing lwp's.
> >
> > FreeBSD's backend doesn't maintain a separate lwp list, it just uses
> > the existing GDB thread list.  For FreeBSD's backend the two lists would
> > simply mirror each other so it seems a bit of a waste to maintain a
> > duplicate list.  exit_lwp() calls delete_thread() which is the same thing
> > the FreeBSD backend is doing, so if that is the current thread in
> > inferior_ptid, the Linux backend will also being leaving the exited
> > thread around in GDB's list until some future call to delete_exited_threads().
> >
> > I think the thing that makes Linux work is that it doesn't use GDB's
> > thread list.  Meaning, it doesn't walk over GDB's thread list, but instead
> > iterates over its private LWP list via iterate_over_lwps().  It would seem
> > that GDB's thread list is designed so that backends shouldn't need their
> > own thread list (you can add target-specific data with a custom destructor
> > that gets invoked when freeing a thread for example), but the Linux backend
> > doesn't choose to use it that way?
> >
> > Looking at some other threaded backends:
> >
> > - aix-thread.c relies on custom ptrace ops that resume a single thread
> > - darwin-nat.c uses its own thread list (stored in the inferior's
> >   private data) instead of GDB's thread list.
> > - gnu-nat.c uses its own thread list instead of GDB's thread list.
> > - obsd-nat.c uses GDB's thread list but doesn't seem to support resuming
> >   individual threads (only entire processes).
> > - procfs.c maintains its own thread list, but it doesn't seem to use it
> >   for resume but relies on the associated kernel resuming either an
> >   entire process or a single thread in a process via different ioctls.
> > - remote.c:remote_resume() uses ALL_NON_EXITED_THREADS
> > - windows_nat.cwindows_resume() calls windows_continue() which uses a
> >   target-internal thread list rather than GDB's thread list.
> >
> >> It doesn't make sense to keep a thread that has already exitted in the
> >> list of threads we are manipulating.
> >
> > FreeBSD's backend isn't making that choice.  delete_thread() in threads.c
> > is the one making that choice.  If FreeBSD's backend were to define its
> > own thread list, the contents would be identical except it would not
> > include any exited threads, so skipping exited threads gives the same
> > result as walking a hypothetical private list.
> >
> 
> So i take it using ALL_NON_EXITED_THREADS is something that would seem 
> reasonable to use in this case and not iterate through all threads (even 
> ones marked exitting)?

Yes, that would work for me.
  
Pedro Alves Jan. 19, 2017, 11:53 a.m. UTC | #11
On 01/13/2017 01:44 AM, John Baldwin wrote:

>> So i take it using ALL_NON_EXITED_THREADS is something that would seem 
>> reasonable to use in this case and not iterate through all threads (even 
>> ones marked exitting)?
> 
> Yes, that would work for me.
> 

Agree, that should be fine.

Thanks,
Pedro Alves
  
Pedro Alves Jan. 19, 2017, 11:56 a.m. UTC | #12
On 12/27/2016 09:03 PM, John Baldwin wrote:
> I added the assertion for my own sanity.  I suspect gdb should never try to
> invoke target_resume() with a ptid of an exited thread, but if for some
> reason it did the effect on FreeBSD would be a hang since we would suspend
> all the other threads and when the process was continued via PT_CONTINUE it
> would have nothing to do and would never return from wait().  I'd rather have
> gdb fail an assertion in that case rather than hang.

OOC, what happens on FreeBSD on the gdb.threads/no-unwaited-for-left.exp test?

That test:

 - spawns a thread
 - gdb switches to that thread, and,
 - enables "set scheduler-locking on", and "continue"s.
 - the child thread exits
 - gdb should not hang.

It used to hang on Linux, until TARGET_WAITKIND_NO_RESUMED was added.

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/ChangeLog b/gdb/ChangeLog
index db6e913..4fb3732 100644
--- a/gdb/ChangeLog
+++ b/gdb/ChangeLog
@@ -1,3 +1,10 @@ 
+2016-12-23  John Baldwin  <jhb@FreeBSD.org>
+
+	PR threads/20743
+	* fbsd-nat.c (resume_one_thread_cb): Ignore exited threads.
+	(resume_all_threads_cb): Likewise.
+	(fbsd_resume): Assert resuming thread has not exited.
+
 2016-12-22  Doug Evans  <xdje42@gmail.com>
 
 	* infrun.c (set_step_over_info): Add comment.
diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c
index ade62f1..7cd08c6 100644
--- a/gdb/fbsd-nat.c
+++ b/gdb/fbsd-nat.c
@@ -662,6 +662,9 @@  resume_one_thread_cb (struct thread_info *tp, void *data)
   if (ptid_get_pid (tp->ptid) != ptid_get_pid (*ptid))
     return 0;
 
+  if (is_exited (tp->ptid))
+    return 0;
+		  
   if (ptid_get_lwp (tp->ptid) == ptid_get_lwp (*ptid))
     request = PT_RESUME;
   else
@@ -680,6 +683,9 @@  resume_all_threads_cb (struct thread_info *tp, void *data)
   if (!ptid_match (tp->ptid, *filter))
     return 0;
 
+  if (is_exited (tp->ptid))
+    return 0;
+		  
   if (ptrace (PT_RESUME, ptid_get_lwp (tp->ptid), NULL, 0) == -1)
     perror_with_name (("ptrace"));
   return 0;
@@ -711,6 +717,7 @@  fbsd_resume (struct target_ops *ops,
   if (ptid_lwp_p (ptid))
     {
       /* If ptid is a specific LWP, suspend all other LWPs in the process.  */
+      gdb_assert (!is_exited (ptid));
       iterate_over_threads (resume_one_thread_cb, &ptid);
     }
   else