[PATCHv2,5/8] gdb: don't resume vfork parent while child is still running

Message ID 9b3303bed5b6afcbe2e11e65d9696e3b59a61826.1688484032.git.aburgess@redhat.com
State New
Headers
Series Some vfork related fixes |

Commit Message

Andrew Burgess July 4, 2023, 3:22 p.m. UTC
  Like the last few commit, this fixes yet another vfork related issue.
Like the commit titled:

  gdb: don't restart vfork parent while waiting for child to finish

which addressed a case in linux-nat where we would try to resume a
vfork parent, this commit addresses a very similar case, but this time
occurring in infrun.c.  Just like with that previous commit, this bug
results in the assert:

  x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.

In this case the issue occurs when target-non-stop is on, but non-stop
is off, and again, schedule-multiple is on.  As with the previous
commit, GDB is in follow-fork-mode child.  If you have not done so, it
is worth reading the earlier commit as many of the problems leading to
the failure are the same, they just appear in a different part of GDB.

Here are the steps leading to the assertion failure:

  1. The user performs a 'next' over a vfork, GDB stop in the vfork
  child,

  2. As we are planning to follow the child GDB sets the vfork_parent
  and vfork_child member variables in the two inferiors, the
  thread_waiting_for_vfork_done member is left as nullptr, that member
  is only used when GDB is planning to follow the parent inferior,

  3. The user does 'continue', our expectation is that the vfork child
  should resume, and once that process has exited or execd, GDB should
  detach from the vfork parent.  As a result of the 'continue' GDB
  eventually enters the proceed function,

  4. In proceed we selected a ptid_t to resume, because
  schedule-multiple is on we select minus_one_ptid (see
  user_visible_resume_ptid),

  5. As GDB is running in all-stop on top of non-stop mode, in the
  proceed function we iterate over all threads that match the resume
  ptid, which turns out to be all threads, and call
  proceed_resume_thread_checked.  One of the threads we iterate over
  is the vfork parent thread,

  6. As the thread passed to proceed_resume_thread_checked doesn't
  match any of the early return conditions, GDB will set the thread
  resumed,

  7. As we are resuming one thread at a time, this thread is seen by
  the lower layers (e.g. linux-nat) as the "event thread", which means
  we don't apply any of the checks, e.g. is this thread a
  vfork parent, instead we assume that GDB core knows what it's doing,
  and linux-nat will resume the thread, we have now incorrectly set
  running the vfork parent thread when this thread should be waiting
  for the vfork child to complete,

  8. Back in the proceed function GDB continues to iterate over all
  threads, and now (correctly) resumes the vfork child thread,

  8. As the vfork child is still alive the kernel holds the vfork
  parent stopped,

  9. Eventually the child performs its exec and GDB is sent and EXECD
  event.  However, because the parent is resumed, as soon as the child
  performs its exec the vfork parent also sends a VFORK_DONE event to
  GDB,

  10. Depending on timing both of these events might seem to arrive in
  GDB at the same time.  Normally GDB expects to see the EXECD or
  EXITED/SIGNALED event from the vfork child before getting the
  VFORK_DONE in the parent.  We know this because it is as a result of
  the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
  handle_vfork_child_exec_or_exit for details).  Further the comment
  in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
  when we remain attached to the child (not the parent) we should not
  expect to see a VFORK_DONE,

  11. If both events arrive at the same time then GDB will randomly
  choose one event to handle first, in some cases this will be the
  VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
  expects that (a) the vfork child has finished, however, in this case
  this is not completely true, the child has finished, but GDB has not
  processed the event associated with the completion yet, and (b) upon
  seeing a VFORK_DONE GDB assumes we are remaining attached to the
  parent, and so resumes the parent process,

  12. GDB now handles the EXECD event.  In our case we are detaching
  from the parent, so GDB calls target_detach (see
  handle_vfork_child_exec_or_exit),

  13. While this has been going on the vfork parent is executing, and
  might even exit,

  14. In linux_nat_target::detach the first thing we do is stop all
  threads in the process we're detaching from, the result of the stop
  request will be cached on the lwp_info object,

  15. In our case the vfork parent has exited though, so when GDB
  waits for the thread, instead of a stop due to signal, we instead
  get a thread exited status,

  16. Later in the detach process we try to resume the threads just
  prior to making the ptrace call to actually detach (see
  detach_one_lwp), as part of the process to resume a thread we try to
  touch some registers within the thread, and before doing this GDB
  asserts that the thread is stopped,

  17. An exited thread is not classified as stopped, and so the assert
  triggers!

Just like with the earlier commit, the fix is to spot the vfork parent
status of the thread, and not resume such threads.  Where the earlier
commit fixed this in linux-nat, in this case I think the fix should
live in infrun.c, in proceed_resume_thread_checked.  This function
already has a similar check to not resume the vfork parent in the case
where we are planning to follow the vfork parent, I propose adding a
similar case that checks for the vfork parent when we plan to follow
the vfork child.

This new check will mean that at step #6 above GDB doesn't try to
resume the vfork parent thread, which prevents the VFORK_DONE from
ever arriving.  Once GDB sees the EXECD/EXITED/SIGNALLED event from
the vfork child GDB will detach from the parent.

There's no test included in this commit.  In a subsequent commit I
will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.

If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive.  I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
---
 gdb/infrun.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)
  

Comments

Simon Marchi July 18, 2023, 8:42 p.m. UTC | #1
On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote:
> Like the last few commit, this fixes yet another vfork related issue.
> Like the commit titled:
> 
>   gdb: don't restart vfork parent while waiting for child to finish
> 
> which addressed a case in linux-nat where we would try to resume a
> vfork parent, this commit addresses a very similar case, but this time
> occurring in infrun.c.  Just like with that previous commit, this bug
> results in the assert:
> 
>   x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
> 
> In this case the issue occurs when target-non-stop is on, but non-stop
> is off, and again, schedule-multiple is on.  As with the previous
> commit, GDB is in follow-fork-mode child.  If you have not done so, it
> is worth reading the earlier commit as many of the problems leading to
> the failure are the same, they just appear in a different part of GDB.
> 
> Here are the steps leading to the assertion failure:
> 
>   1. The user performs a 'next' over a vfork, GDB stop in the vfork
>   child,
> 
>   2. As we are planning to follow the child GDB sets the vfork_parent
>   and vfork_child member variables in the two inferiors, the
>   thread_waiting_for_vfork_done member is left as nullptr, that member
>   is only used when GDB is planning to follow the parent inferior,
> 
>   3. The user does 'continue', our expectation is that the vfork child
>   should resume, and once that process has exited or execd, GDB should
>   detach from the vfork parent.  As a result of the 'continue' GDB
>   eventually enters the proceed function,
> 
>   4. In proceed we selected a ptid_t to resume, because
>   schedule-multiple is on we select minus_one_ptid (see
>   user_visible_resume_ptid),
> 
>   5. As GDB is running in all-stop on top of non-stop mode, in the
>   proceed function we iterate over all threads that match the resume
>   ptid, which turns out to be all threads, and call
>   proceed_resume_thread_checked.  One of the threads we iterate over
>   is the vfork parent thread,
> 
>   6. As the thread passed to proceed_resume_thread_checked doesn't
>   match any of the early return conditions, GDB will set the thread
>   resumed,
> 
>   7. As we are resuming one thread at a time, this thread is seen by
>   the lower layers (e.g. linux-nat) as the "event thread", which means
>   we don't apply any of the checks, e.g. is this thread a
>   vfork parent, instead we assume that GDB core knows what it's doing,
>   and linux-nat will resume the thread, we have now incorrectly set
>   running the vfork parent thread when this thread should be waiting
>   for the vfork child to complete,
> 
>   8. Back in the proceed function GDB continues to iterate over all
>   threads, and now (correctly) resumes the vfork child thread,
> 
>   8. As the vfork child is still alive the kernel holds the vfork
>   parent stopped,
> 
>   9. Eventually the child performs its exec and GDB is sent and EXECD
>   event.  However, because the parent is resumed, as soon as the child
>   performs its exec the vfork parent also sends a VFORK_DONE event to
>   GDB,
> 
>   10. Depending on timing both of these events might seem to arrive in
>   GDB at the same time.  Normally GDB expects to see the EXECD or
>   EXITED/SIGNALED event from the vfork child before getting the
>   VFORK_DONE in the parent.  We know this because it is as a result of
>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>   handle_vfork_child_exec_or_exit for details).  Further the comment
>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>   when we remain attached to the child (not the parent) we should not
>   expect to see a VFORK_DONE,
> 
>   11. If both events arrive at the same time then GDB will randomly
>   choose one event to handle first, in some cases this will be the
>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>   expects that (a) the vfork child has finished, however, in this case
>   this is not completely true, the child has finished, but GDB has not
>   processed the event associated with the completion yet, and (b) upon
>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>   parent, and so resumes the parent process,
> 
>   12. GDB now handles the EXECD event.  In our case we are detaching
>   from the parent, so GDB calls target_detach (see
>   handle_vfork_child_exec_or_exit),
> 
>   13. While this has been going on the vfork parent is executing, and
>   might even exit,
> 
>   14. In linux_nat_target::detach the first thing we do is stop all
>   threads in the process we're detaching from, the result of the stop
>   request will be cached on the lwp_info object,
> 
>   15. In our case the vfork parent has exited though, so when GDB
>   waits for the thread, instead of a stop due to signal, we instead
>   get a thread exited status,
> 
>   16. Later in the detach process we try to resume the threads just
>   prior to making the ptrace call to actually detach (see
>   detach_one_lwp), as part of the process to resume a thread we try to
>   touch some registers within the thread, and before doing this GDB
>   asserts that the thread is stopped,
> 
>   17. An exited thread is not classified as stopped, and so the assert
>   triggers!
> 
> Just like with the earlier commit, the fix is to spot the vfork parent
> status of the thread, and not resume such threads.  Where the earlier
> commit fixed this in linux-nat, in this case I think the fix should
> live in infrun.c, in proceed_resume_thread_checked.  This function
> already has a similar check to not resume the vfork parent in the case
> where we are planning to follow the vfork parent, I propose adding a
> similar case that checks for the vfork parent when we plan to follow
> the vfork child.
> 
> This new check will mean that at step #6 above GDB doesn't try to
> resume the vfork parent thread, which prevents the VFORK_DONE from
> ever arriving.  Once GDB sees the EXECD/EXITED/SIGNALLED event from
> the vfork child GDB will detach from the parent.
> 
> There's no test included in this commit.  In a subsequent commit I
> will expand gdb.base/foll-vfork.exp which is when this bug would be
> exposed.
> 
> If you do want to reproduce this failure then you will for certainly
> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
> are all very timing sensitive.  I've found that running multiple
> copies in parallel makes the failure more likely to appear, I usually
> run ~6 copies in parallel and expect to see a failure after within
> 10mins.

Hi Andrew,

Since this commit, I see this on native-gdbserver and
native-extended-gdbserver:

FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)

I haven't had the time to read this vfork series, but I look forward to,
since I also did some vfork fixes not too long ago.

Simon
  
Andrew Burgess July 21, 2023, 9:47 a.m. UTC | #2
Simon Marchi <simark@simark.ca> writes:

> On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote:
>> Like the last few commit, this fixes yet another vfork related issue.
>> Like the commit titled:
>> 
>>   gdb: don't restart vfork parent while waiting for child to finish
>> 
>> which addressed a case in linux-nat where we would try to resume a
>> vfork parent, this commit addresses a very similar case, but this time
>> occurring in infrun.c.  Just like with that previous commit, this bug
>> results in the assert:
>> 
>>   x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
>> 
>> In this case the issue occurs when target-non-stop is on, but non-stop
>> is off, and again, schedule-multiple is on.  As with the previous
>> commit, GDB is in follow-fork-mode child.  If you have not done so, it
>> is worth reading the earlier commit as many of the problems leading to
>> the failure are the same, they just appear in a different part of GDB.
>> 
>> Here are the steps leading to the assertion failure:
>> 
>>   1. The user performs a 'next' over a vfork, GDB stop in the vfork
>>   child,
>> 
>>   2. As we are planning to follow the child GDB sets the vfork_parent
>>   and vfork_child member variables in the two inferiors, the
>>   thread_waiting_for_vfork_done member is left as nullptr, that member
>>   is only used when GDB is planning to follow the parent inferior,
>> 
>>   3. The user does 'continue', our expectation is that the vfork child
>>   should resume, and once that process has exited or execd, GDB should
>>   detach from the vfork parent.  As a result of the 'continue' GDB
>>   eventually enters the proceed function,
>> 
>>   4. In proceed we selected a ptid_t to resume, because
>>   schedule-multiple is on we select minus_one_ptid (see
>>   user_visible_resume_ptid),
>> 
>>   5. As GDB is running in all-stop on top of non-stop mode, in the
>>   proceed function we iterate over all threads that match the resume
>>   ptid, which turns out to be all threads, and call
>>   proceed_resume_thread_checked.  One of the threads we iterate over
>>   is the vfork parent thread,
>> 
>>   6. As the thread passed to proceed_resume_thread_checked doesn't
>>   match any of the early return conditions, GDB will set the thread
>>   resumed,
>> 
>>   7. As we are resuming one thread at a time, this thread is seen by
>>   the lower layers (e.g. linux-nat) as the "event thread", which means
>>   we don't apply any of the checks, e.g. is this thread a
>>   vfork parent, instead we assume that GDB core knows what it's doing,
>>   and linux-nat will resume the thread, we have now incorrectly set
>>   running the vfork parent thread when this thread should be waiting
>>   for the vfork child to complete,
>> 
>>   8. Back in the proceed function GDB continues to iterate over all
>>   threads, and now (correctly) resumes the vfork child thread,
>> 
>>   8. As the vfork child is still alive the kernel holds the vfork
>>   parent stopped,
>> 
>>   9. Eventually the child performs its exec and GDB is sent and EXECD
>>   event.  However, because the parent is resumed, as soon as the child
>>   performs its exec the vfork parent also sends a VFORK_DONE event to
>>   GDB,
>> 
>>   10. Depending on timing both of these events might seem to arrive in
>>   GDB at the same time.  Normally GDB expects to see the EXECD or
>>   EXITED/SIGNALED event from the vfork child before getting the
>>   VFORK_DONE in the parent.  We know this because it is as a result of
>>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>>   handle_vfork_child_exec_or_exit for details).  Further the comment
>>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>>   when we remain attached to the child (not the parent) we should not
>>   expect to see a VFORK_DONE,
>> 
>>   11. If both events arrive at the same time then GDB will randomly
>>   choose one event to handle first, in some cases this will be the
>>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>>   expects that (a) the vfork child has finished, however, in this case
>>   this is not completely true, the child has finished, but GDB has not
>>   processed the event associated with the completion yet, and (b) upon
>>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>>   parent, and so resumes the parent process,
>> 
>>   12. GDB now handles the EXECD event.  In our case we are detaching
>>   from the parent, so GDB calls target_detach (see
>>   handle_vfork_child_exec_or_exit),
>> 
>>   13. While this has been going on the vfork parent is executing, and
>>   might even exit,
>> 
>>   14. In linux_nat_target::detach the first thing we do is stop all
>>   threads in the process we're detaching from, the result of the stop
>>   request will be cached on the lwp_info object,
>> 
>>   15. In our case the vfork parent has exited though, so when GDB
>>   waits for the thread, instead of a stop due to signal, we instead
>>   get a thread exited status,
>> 
>>   16. Later in the detach process we try to resume the threads just
>>   prior to making the ptrace call to actually detach (see
>>   detach_one_lwp), as part of the process to resume a thread we try to
>>   touch some registers within the thread, and before doing this GDB
>>   asserts that the thread is stopped,
>> 
>>   17. An exited thread is not classified as stopped, and so the assert
>>   triggers!
>> 
>> Just like with the earlier commit, the fix is to spot the vfork parent
>> status of the thread, and not resume such threads.  Where the earlier
>> commit fixed this in linux-nat, in this case I think the fix should
>> live in infrun.c, in proceed_resume_thread_checked.  This function
>> already has a similar check to not resume the vfork parent in the case
>> where we are planning to follow the vfork parent, I propose adding a
>> similar case that checks for the vfork parent when we plan to follow
>> the vfork child.
>> 
>> This new check will mean that at step #6 above GDB doesn't try to
>> resume the vfork parent thread, which prevents the VFORK_DONE from
>> ever arriving.  Once GDB sees the EXECD/EXITED/SIGNALLED event from
>> the vfork child GDB will detach from the parent.
>> 
>> There's no test included in this commit.  In a subsequent commit I
>> will expand gdb.base/foll-vfork.exp which is when this bug would be
>> exposed.
>> 
>> If you do want to reproduce this failure then you will for certainly
>> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
>> are all very timing sensitive.  I've found that running multiple
>> copies in parallel makes the failure more likely to appear, I usually
>> run ~6 copies in parallel and expect to see a failure after within
>> 10mins.
>
> Hi Andrew,
>
> Since this commit, I see this on native-gdbserver and
> native-extended-gdbserver:
>
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)
>
> I haven't had the time to read this vfork series, but I look forward to,
> since I also did some vfork fixes not too long ago.

Thanks, I'll take a look.

Andrew
  
Andrew Burgess July 23, 2023, 8:53 a.m. UTC | #3
Simon Marchi <simark@simark.ca> writes:

> On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote:
>> Like the last few commit, this fixes yet another vfork related issue.
>> Like the commit titled:
>> 
>>   gdb: don't restart vfork parent while waiting for child to finish
>> 
>> which addressed a case in linux-nat where we would try to resume a
>> vfork parent, this commit addresses a very similar case, but this time
>> occurring in infrun.c.  Just like with that previous commit, this bug
>> results in the assert:
>> 
>>   x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
>> 
>> In this case the issue occurs when target-non-stop is on, but non-stop
>> is off, and again, schedule-multiple is on.  As with the previous
>> commit, GDB is in follow-fork-mode child.  If you have not done so, it
>> is worth reading the earlier commit as many of the problems leading to
>> the failure are the same, they just appear in a different part of GDB.
>> 
>> Here are the steps leading to the assertion failure:
>> 
>>   1. The user performs a 'next' over a vfork, GDB stop in the vfork
>>   child,
>> 
>>   2. As we are planning to follow the child GDB sets the vfork_parent
>>   and vfork_child member variables in the two inferiors, the
>>   thread_waiting_for_vfork_done member is left as nullptr, that member
>>   is only used when GDB is planning to follow the parent inferior,
>> 
>>   3. The user does 'continue', our expectation is that the vfork child
>>   should resume, and once that process has exited or execd, GDB should
>>   detach from the vfork parent.  As a result of the 'continue' GDB
>>   eventually enters the proceed function,
>> 
>>   4. In proceed we selected a ptid_t to resume, because
>>   schedule-multiple is on we select minus_one_ptid (see
>>   user_visible_resume_ptid),
>> 
>>   5. As GDB is running in all-stop on top of non-stop mode, in the
>>   proceed function we iterate over all threads that match the resume
>>   ptid, which turns out to be all threads, and call
>>   proceed_resume_thread_checked.  One of the threads we iterate over
>>   is the vfork parent thread,
>> 
>>   6. As the thread passed to proceed_resume_thread_checked doesn't
>>   match any of the early return conditions, GDB will set the thread
>>   resumed,
>> 
>>   7. As we are resuming one thread at a time, this thread is seen by
>>   the lower layers (e.g. linux-nat) as the "event thread", which means
>>   we don't apply any of the checks, e.g. is this thread a
>>   vfork parent, instead we assume that GDB core knows what it's doing,
>>   and linux-nat will resume the thread, we have now incorrectly set
>>   running the vfork parent thread when this thread should be waiting
>>   for the vfork child to complete,
>> 
>>   8. Back in the proceed function GDB continues to iterate over all
>>   threads, and now (correctly) resumes the vfork child thread,
>> 
>>   8. As the vfork child is still alive the kernel holds the vfork
>>   parent stopped,
>> 
>>   9. Eventually the child performs its exec and GDB is sent and EXECD
>>   event.  However, because the parent is resumed, as soon as the child
>>   performs its exec the vfork parent also sends a VFORK_DONE event to
>>   GDB,
>> 
>>   10. Depending on timing both of these events might seem to arrive in
>>   GDB at the same time.  Normally GDB expects to see the EXECD or
>>   EXITED/SIGNALED event from the vfork child before getting the
>>   VFORK_DONE in the parent.  We know this because it is as a result of
>>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>>   handle_vfork_child_exec_or_exit for details).  Further the comment
>>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>>   when we remain attached to the child (not the parent) we should not
>>   expect to see a VFORK_DONE,
>> 
>>   11. If both events arrive at the same time then GDB will randomly
>>   choose one event to handle first, in some cases this will be the
>>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>>   expects that (a) the vfork child has finished, however, in this case
>>   this is not completely true, the child has finished, but GDB has not
>>   processed the event associated with the completion yet, and (b) upon
>>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>>   parent, and so resumes the parent process,
>> 
>>   12. GDB now handles the EXECD event.  In our case we are detaching
>>   from the parent, so GDB calls target_detach (see
>>   handle_vfork_child_exec_or_exit),
>> 
>>   13. While this has been going on the vfork parent is executing, and
>>   might even exit,
>> 
>>   14. In linux_nat_target::detach the first thing we do is stop all
>>   threads in the process we're detaching from, the result of the stop
>>   request will be cached on the lwp_info object,
>> 
>>   15. In our case the vfork parent has exited though, so when GDB
>>   waits for the thread, instead of a stop due to signal, we instead
>>   get a thread exited status,
>> 
>>   16. Later in the detach process we try to resume the threads just
>>   prior to making the ptrace call to actually detach (see
>>   detach_one_lwp), as part of the process to resume a thread we try to
>>   touch some registers within the thread, and before doing this GDB
>>   asserts that the thread is stopped,
>> 
>>   17. An exited thread is not classified as stopped, and so the assert
>>   triggers!
>> 
>> Just like with the earlier commit, the fix is to spot the vfork parent
>> status of the thread, and not resume such threads.  Where the earlier
>> commit fixed this in linux-nat, in this case I think the fix should
>> live in infrun.c, in proceed_resume_thread_checked.  This function
>> already has a similar check to not resume the vfork parent in the case
>> where we are planning to follow the vfork parent, I propose adding a
>> similar case that checks for the vfork parent when we plan to follow
>> the vfork child.
>> 
>> This new check will mean that at step #6 above GDB doesn't try to
>> resume the vfork parent thread, which prevents the VFORK_DONE from
>> ever arriving.  Once GDB sees the EXECD/EXITED/SIGNALLED event from
>> the vfork child GDB will detach from the parent.
>> 
>> There's no test included in this commit.  In a subsequent commit I
>> will expand gdb.base/foll-vfork.exp which is when this bug would be
>> exposed.
>> 
>> If you do want to reproduce this failure then you will for certainly
>> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
>> are all very timing sensitive.  I've found that running multiple
>> copies in parallel makes the failure more likely to appear, I usually
>> run ~6 copies in parallel and expect to see a failure after within
>> 10mins.
>
> Hi Andrew,
>
> Since this commit, I see this on native-gdbserver and
> native-extended-gdbserver:
>
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)
>
> I haven't had the time to read this vfork series, but I look forward to,
> since I also did some vfork fixes not too long ago.

If I remember correctly your fixes focused on the follow-parent side of
vfork, while the fixes I looked at focused on the follow-child side.

I have some more vfork fixes that I'm working on, which I'm hoping to
get posted soon, but I have a couple of other tasks I need to get done
first.

Anyway, below is my proposed fix for the above regressions.  The GDB
part of the fix is trivial, then there's a bunch of changes to the above
test script so that we check more cases.  Let me know what you think.

Thanks,
Andrew

---

commit a6609bd4fcf8d6d5718a7fb093dbaa34286938b6
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Sat Jul 22 15:32:29 2023 +0100

    gdb: fix vfork regressions when target-non-stop is off
    
    It was pointed out on the mailing list[1] that after this commit:
    
      commit b1e0126ec56e099d753c20e91a9f8623aabd6b46
      Date:   Wed Jun 21 14:18:54 2023 +0100
    
          gdb: don't resume vfork parent while child is still running
    
    the test gdb.base/vfork-follow-parent.exp now has some failures when
    run with the native-gdbserver or native-extended-gdbserver boards:
    
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)
    
    The reason that these failures don't show up when run on the standard
    unix board is that the test is only run in the default operating mode,
    so for Linux this will be all-stop on top of non-stop.
    
    If we adjust the test script so that it runs in the default mode and
    with target-non-stop turned off, then we see the same failures on the
    unix board.  This commit includes this change.
    
    The way that the test is written means that it is not (currently)
    possible to turn on non-stop mode and have the test still work, so
    this commit does not do that.
    
    I have also updated the test script so that the vfork child performs
    an exec as well as the current exit.  Exec and exit are the two ways
    in which a vfork child can release the vfork parent, so testing both
    of these cases is useful I think.
    
    In this test the inferior performs a vfork and the vfork-child
    immediately exits.  The vfork-parent will wait for the vfork-child and
    then blocks waiting for gdb.  Once gdb has released the vfork-parent,
    the vfork-parent also exits.
    
    In the test that fails, GDB sets 'detach-on-fork off' and then runs to
    the vfork.  At this point the test tries to just "continue", but this
    fails as the vfork-parent is still selected, and the parent can't
    continue until the vfork-child completes.  As the vfork-child is
    stopped by GDB the parent will never stop once resumed, so GDB refuses
    to resume it.
    
    The test script then sets 'schedule-multiple on' and once again
    continues.  This time GDB, in theory, resumes both the parent and the
    child, the parent will be held blocked by the kernel, but the child
    will run until it exits, and which point GDB stops again, this time
    with inferior 2, the newly exited vfork-child, selected.
    
    What happens after this in the test script is irrelevant as far as
    this failure is concerned.
    
    To understand why the test started failing we should consider the
    behaviour of four different cases:
    
      1. All-stop-on-non-stop before commit b1e0126ec56e,
    
      2. All-stop-on-non-stop after commit b1e0126ec56e,
    
      3. All-stop-on-all-stop before commit b1e0126ec56e, and
    
      4. All-stop-on-all-stop after commit b1e0126ec56e.
    
    Only case #4 is failing after commit b1e0126ec56e, but I think the
    other cases are interesting because, (a) they inform how we might fix
    the regression, and (b) it turns out the behaviour of #2 changed too
    with the commit, but the change was harmless.
    
    For #1 All-stop-on-non-stop before commit b1e0126ec56e, what happens
    is:
    
      1. GDB calls proceed with the vfork-parent selected, as schedule
         multiple is on user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid (see proceed function),
    
      2. As this is all-stop-on-non-stop, every thread is resumed
        individually, so GDB tries to resume both the vfork-parent and the
        vfork-child, both of which succeed,
    
      3. The vfork-parent is held stopped by the kernel,
    
      4. The vfork-child completes (exits) at which point the GDB sees the
         EXITED event for the vfork-child and the VFORK_DONE event for the
         vfork-parent,
    
      5. At this point we might take two paths depending on which event
         GDB handles first, if GDB handles the VFORK_DONE first then:
    
         (a) As GDB is controlling both parent and child the VFORK_DONE is
             ignored (see handle_vfork_done), the vfork-parent will be
             resumed,
    
         (b) GDB processes the EXITED event, selects the (now defunct)
             vfork-child, and stops, returning control to the user.
    
         Alternatively, if GDB selects the EXITED event first then:
    
         (c) GDB processes the EXITED event, selects the (now defunct)
             vfork-child, and stops, returning control to the user.
    
         (d) At some future time the user resumes the vfork-parent, at
             which point the VFORK_DONE is reported to GDB, however, GDB
             is ignoring the VFORK_DONE (see handle_vfork_done), so the
             parent is resumed.
    
    For case #2, all-stop-on-non-stop after commit b1e0126ec56e, the
    important difference is in step (2) above, now, instead of resuming
    both the vfork-parent and the vfork-child, only the vfork-child is
    resumed.  As such, when we get to step (5), only a single event, the
    EXITED event is reported.
    
    GDB handles the EXITED just as in (5)(c), then, later, when the user
    resumes the vfork-parent, the VFORKED_DONE is immediately delivered
    from the kernel, but this is ignored just as in (5)(d), and so,
    though the pattern of when the vfork-parent is resumed changes, the
    overall pattern of which events are reported and when, doesn't
    actually change.  In fact, by not resuming the vfork-parent, the order
    of events (in this test) is now deterministic, which (maybe?) is a
    good thing.
    
    If we now consider case #3, all-stop-on-all-stop before commit
    b1e0126ec56e, then what happens is:
    
      1. GDB calls proceed with the vfork-parent selected, as schedule
         multiple is on user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid (see proceed function),
    
      2. As this is all-stop-on-all-stop, the resume is passed down to the
         linux-nat target, the vfork-parent is the event thread, while the
         vfork-child is a sibling of the event thread,
    
      3. In linux_nat_target::resume, GDB calls linux_nat_resume_callback
         for all threads, this causes the vfork-child to be resumed.  Then
         in linux_nat_target::resume, the event thread, the vfork-parent,
         is also resumed.
    
      4. The vfork-parent is held stopped by the kernel,
    
      5. The vfork-child completes (exits) at which point the GDB sees the
         EXITED event for the vfork-child and the VFORK_DONE event for the
         vfork-parent,
    
      6. We are now in a situation identical to step (5) as for
         all-stop-on-non-stop above, GDB selects one of the events to
         handle, and whichever we select the user sees the correct
         behaviour.
    
    And so, finally, we can consider #4, all-stop-on-all-stop after commit
    b1e0126ec56e, this is the case that started failing.
    
    We start out just like above, in proceed, the resume_ptid is
    -1 (resume everything), due to schedule multiple being on.  And just
    like above, due to the target being all-stop, we call
    proceed_resume_thread_checked just once, for the current thread,
    which, remember, is the vfork-parent thread.
    
    The change in commit b1e0126ec56e was to avoid resuming a vfork-parent
    thread, read the commit message for the justification for this change.
    
    However, this means that GDB now rejects resuming the vfork-parent in
    this case, which means that nothing gets resumed!  Obviously, if
    nothing resumes, then nothing will ever stop, and so GDB appears to
    hang.
    
    I considered a couple of solutions which, in the end, I didn't go
    with, these were:
    
      1. Move the vfork-parent check out of proceed_resume_thread_checked,
         and place it in proceed, but only on the all-stop-on-non-stop
         path, this should still address the issue seen in b1e0126ec56e,
         but would avoid the issue seen here.  I rejected this just
         because it didn't feel great to split the checks that exist in
         proceed_resume_thread_checked like this,
    
      2. Extend the condition in proceed_resume_thread_checked by adding a
         target_is_non_stop_p check.  This would have the same effect as
         idea 1, but leaves all the checks in the same place, which I
         think would be better, but this still just didn't feel right to
         me, and so,
    
    What I noticed was that for the all-stop-on-non-stop, after commit
    b1e0126ec56e, we only resumed the vfork-child, and this seems fine.
    The vfork-parent isn't going to run anyway (the kernel will hold it
    back), so if feels like we there's no harm in just waiting for the
    child to complete, and then resuming the parent.
    
    So then I started looking at follow_fork, which is called from the top
    of proceed.  This function already has the task of switching between
    the parent and child based on which the user wishes to follow.  So, I
    wondered, could we use this to switch to the vfork-child in the case
    that we are attached to both?
    
    Turns out this is pretty simple to do.
    
    Having done that, now the process is for all-stop-on-all-stop after
    commit b1e0126ec56e, and with this new fix is:
    
      1. GDB calls proceed with the vfork-parent selected, but,
    
      2. In follow_fork, and follow_fork_inferior, GDB switches the
         selected thread to be that of the vfork-child,
    
      3. Back in proceed user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid still, but now,
    
      4. When GDB calls proceed_resume_thread_checked, the vfork-child is
         the current selected thread, this is not a vfork-parent, and so
         GDB allows the proceed to continue to the linux-nat target,
    
      5. In linux_nat_target::resume, GDB calls linux_nat_resume_callback
         for all threads, this does not resume the vfork-parent (because
         it is a vfork-parent), and then the vfork-child is resumed as
         this is the event thread,
    
    At this point we are back in the same situation as for
    all-stop-on-non-stop after commit b1e0126ec56e, that is, the
    vfork-child is resumed, while the vfork-parent is held stopped by
    GDB.
    
    Eventually the vfork-child will exit or exec, at which point the
    vfork-parent will be resumed.
    
    [1] https://inbox.sourceware.org/gdb-patches/3e1e1db0-13d9-dd32-b4bb-051149ae6e76@simark.ca/

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 7efa0617526..7f4e6e50d6b 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -713,7 +713,7 @@ holding the child stopped.  Try \"set detach-on-fork\" or \
 	 (do not restore the parent as the current inferior).  */
       gdb::optional<scoped_restore_current_thread> maybe_restore;
 
-      if (!follow_child)
+      if (!follow_child && !sched_multi)
 	maybe_restore.emplace ();
 
       switch_to_thread (*child_inf->threads ().begin ());
@@ -3400,8 +3400,10 @@ proceed (CORE_ADDR addr, enum gdb_signal siggnal)
   struct gdbarch *gdbarch;
   CORE_ADDR pc;
 
-  /* If we're stopped at a fork/vfork, follow the branch set by the
-     "set follow-fork-mode" command; otherwise, we'll just proceed
+  /* If we're stopped at a fork/vfork, switch to either the parent or child
+     thread as defined by the "set follow-fork-mode" command, or, if both
+     the parent and child are controlled by GDB, and schedule-multiple is
+     on, follow the child.  If none of the above apply then we just proceed
      resuming the current thread.  */
   if (!follow_fork ())
     {
diff --git a/gdb/testsuite/gdb.base/vfork-follow-parent.c b/gdb/testsuite/gdb.base/vfork-follow-parent.c
index df45b9c2dbe..15ff84a0bad 100644
--- a/gdb/testsuite/gdb.base/vfork-follow-parent.c
+++ b/gdb/testsuite/gdb.base/vfork-follow-parent.c
@@ -17,6 +17,10 @@
 
 #include <unistd.h>
 
+#include <string.h>
+#include <limits.h>
+#include <stdio.h>
+
 static volatile int unblock_parent = 0;
 
 static void
@@ -25,7 +29,7 @@ break_parent (void)
 }
 
 int
-main (void)
+main (int argc, char **argv)
 {
   alarm (30);
 
@@ -40,7 +44,28 @@ main (void)
       break_parent ();
     }
   else
-    _exit (0);
+    {
+#if defined TEST_EXEC
+      char prog[PATH_MAX];
+      int len;
+
+      strcpy (prog, argv[0]);
+      len = strlen (prog);
+      for (; len > 0; --len)
+	{
+	  if (prog[len - 1] == '/')
+	    break;
+	}
+      strcpy (&prog[len], "vforked-prog");
+      execlp (prog, prog, (char *) 0);
+      perror ("exec failed");
+      _exit (1);
+#elif defined TEST_EXIT
+      _exit (0);
+#else
+#error Define TEST_EXEC or TEST_EXIT
+#endif
+    }
 
   return 0;
 }
diff --git a/gdb/testsuite/gdb.base/vfork-follow-parent.exp b/gdb/testsuite/gdb.base/vfork-follow-parent.exp
index 89c38001dac..ee1aef128bc 100644
--- a/gdb/testsuite/gdb.base/vfork-follow-parent.exp
+++ b/gdb/testsuite/gdb.base/vfork-follow-parent.exp
@@ -19,20 +19,40 @@
 # schedule-multiple on" or "set detach-on-fork on".  Test these two resolution
 # methods.
 
-standard_testfile
+standard_testfile .c vforked-prog.c
 
-if { [build_executable "failed to prepare" \
-	${testfile} ${srcfile}] } {
+set binfile ${testfile}-exit
+set binfile2 ${testfile}-exec
+set binfile3 vforked-prog
+
+set opts [list debug additional_flags=-DTEST_EXIT]
+if { [build_executable "compile ${binfile}" ${binfile} ${srcfile} ${opts}] } {
+    untested "failed to compile first test binary"
     return
 }
 
+set opts [list debug additional_flags=-DTEST_EXEC]
+if { [build_executable "compile ${binfile2}" ${binfile2} ${srcfile} ${opts}] } {
+    untested "failed to compile second test binary"
+    return
+}
+
+if { [build_executable "compile $binfile3" $binfile3 $srcfile2] } {
+    untested "failed to compile third test binary"
+    return -1
+}
+
 # Test running to the "Can not resume the parent..." message.  Then, resolve
 # the situation using the method in RESOLUTION_METHOD, either "detach-on-fork"
 # or "schedule-multiple" (the two alternatives the message suggests to the
 # user).
 
-proc do_test { resolution_method } {
-    clean_restart $::binfile
+proc do_test { exec_file resolution_method target_non_stop non_stop } {
+    save_vars { ::GDBFLAGS } {
+	append ::GDBFLAGS " -ex \"maint set target-non-stop ${target_non_stop}\""
+	append ::GDBFLAGS " -ex \"set non-stop ${non_stop}\""
+	clean_restart $exec_file
+    }
 
     gdb_test_no_output "set detach-on-fork off"
 
@@ -40,6 +60,10 @@ proc do_test { resolution_method } {
 	return
     }
 
+    # Delete the breakpoint on main so we don't bit the breakpoint in
+    # the case that the vfork child performs an exec.
+    delete_breakpoints
+
     gdb_test "break break_parent"
 
     gdb_test "continue" \
@@ -75,6 +99,16 @@ proc do_test { resolution_method } {
 	"continue to break_parent"
 }
 
-foreach_with_prefix resolution_method {detach-on-fork schedule-multiple} {
-    do_test $resolution_method
+foreach_with_prefix exec_file [list $binfile $binfile2] {
+    foreach_with_prefix target-non-stop {on off} {
+	# This test was written assuming non-stop mode is off.
+	foreach_with_prefix non-stop {off} {
+	    if {!${target-non-stop} && ${non-stop}} {
+		continue
+	    }
+	    foreach_with_prefix resolution_method {detach-on-fork schedule-multiple} {
+		do_test $exec_file $resolution_method ${target-non-stop} ${non-stop}
+	    }
+	}
+    }
 }
  
Andrew Burgess Aug. 16, 2023, 2:02 p.m. UTC | #4
Andrew Burgess <aburgess@redhat.com> writes:

> Simon Marchi <simark@simark.ca> writes:
>
>> On 2023-07-04 11:22, Andrew Burgess via Gdb-patches wrote:
>>> Like the last few commit, this fixes yet another vfork related issue.
>>> Like the commit titled:
>>> 
>>>   gdb: don't restart vfork parent while waiting for child to finish
>>> 
>>> which addressed a case in linux-nat where we would try to resume a
>>> vfork parent, this commit addresses a very similar case, but this time
>>> occurring in infrun.c.  Just like with that previous commit, this bug
>>> results in the assert:
>>> 
>>>   x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
>>> 
>>> In this case the issue occurs when target-non-stop is on, but non-stop
>>> is off, and again, schedule-multiple is on.  As with the previous
>>> commit, GDB is in follow-fork-mode child.  If you have not done so, it
>>> is worth reading the earlier commit as many of the problems leading to
>>> the failure are the same, they just appear in a different part of GDB.
>>> 
>>> Here are the steps leading to the assertion failure:
>>> 
>>>   1. The user performs a 'next' over a vfork, GDB stop in the vfork
>>>   child,
>>> 
>>>   2. As we are planning to follow the child GDB sets the vfork_parent
>>>   and vfork_child member variables in the two inferiors, the
>>>   thread_waiting_for_vfork_done member is left as nullptr, that member
>>>   is only used when GDB is planning to follow the parent inferior,
>>> 
>>>   3. The user does 'continue', our expectation is that the vfork child
>>>   should resume, and once that process has exited or execd, GDB should
>>>   detach from the vfork parent.  As a result of the 'continue' GDB
>>>   eventually enters the proceed function,
>>> 
>>>   4. In proceed we selected a ptid_t to resume, because
>>>   schedule-multiple is on we select minus_one_ptid (see
>>>   user_visible_resume_ptid),
>>> 
>>>   5. As GDB is running in all-stop on top of non-stop mode, in the
>>>   proceed function we iterate over all threads that match the resume
>>>   ptid, which turns out to be all threads, and call
>>>   proceed_resume_thread_checked.  One of the threads we iterate over
>>>   is the vfork parent thread,
>>> 
>>>   6. As the thread passed to proceed_resume_thread_checked doesn't
>>>   match any of the early return conditions, GDB will set the thread
>>>   resumed,
>>> 
>>>   7. As we are resuming one thread at a time, this thread is seen by
>>>   the lower layers (e.g. linux-nat) as the "event thread", which means
>>>   we don't apply any of the checks, e.g. is this thread a
>>>   vfork parent, instead we assume that GDB core knows what it's doing,
>>>   and linux-nat will resume the thread, we have now incorrectly set
>>>   running the vfork parent thread when this thread should be waiting
>>>   for the vfork child to complete,
>>> 
>>>   8. Back in the proceed function GDB continues to iterate over all
>>>   threads, and now (correctly) resumes the vfork child thread,
>>> 
>>>   8. As the vfork child is still alive the kernel holds the vfork
>>>   parent stopped,
>>> 
>>>   9. Eventually the child performs its exec and GDB is sent and EXECD
>>>   event.  However, because the parent is resumed, as soon as the child
>>>   performs its exec the vfork parent also sends a VFORK_DONE event to
>>>   GDB,
>>> 
>>>   10. Depending on timing both of these events might seem to arrive in
>>>   GDB at the same time.  Normally GDB expects to see the EXECD or
>>>   EXITED/SIGNALED event from the vfork child before getting the
>>>   VFORK_DONE in the parent.  We know this because it is as a result of
>>>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>>>   handle_vfork_child_exec_or_exit for details).  Further the comment
>>>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>>>   when we remain attached to the child (not the parent) we should not
>>>   expect to see a VFORK_DONE,
>>> 
>>>   11. If both events arrive at the same time then GDB will randomly
>>>   choose one event to handle first, in some cases this will be the
>>>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>>>   expects that (a) the vfork child has finished, however, in this case
>>>   this is not completely true, the child has finished, but GDB has not
>>>   processed the event associated with the completion yet, and (b) upon
>>>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>>>   parent, and so resumes the parent process,
>>> 
>>>   12. GDB now handles the EXECD event.  In our case we are detaching
>>>   from the parent, so GDB calls target_detach (see
>>>   handle_vfork_child_exec_or_exit),
>>> 
>>>   13. While this has been going on the vfork parent is executing, and
>>>   might even exit,
>>> 
>>>   14. In linux_nat_target::detach the first thing we do is stop all
>>>   threads in the process we're detaching from, the result of the stop
>>>   request will be cached on the lwp_info object,
>>> 
>>>   15. In our case the vfork parent has exited though, so when GDB
>>>   waits for the thread, instead of a stop due to signal, we instead
>>>   get a thread exited status,
>>> 
>>>   16. Later in the detach process we try to resume the threads just
>>>   prior to making the ptrace call to actually detach (see
>>>   detach_one_lwp), as part of the process to resume a thread we try to
>>>   touch some registers within the thread, and before doing this GDB
>>>   asserts that the thread is stopped,
>>> 
>>>   17. An exited thread is not classified as stopped, and so the assert
>>>   triggers!
>>> 
>>> Just like with the earlier commit, the fix is to spot the vfork parent
>>> status of the thread, and not resume such threads.  Where the earlier
>>> commit fixed this in linux-nat, in this case I think the fix should
>>> live in infrun.c, in proceed_resume_thread_checked.  This function
>>> already has a similar check to not resume the vfork parent in the case
>>> where we are planning to follow the vfork parent, I propose adding a
>>> similar case that checks for the vfork parent when we plan to follow
>>> the vfork child.
>>> 
>>> This new check will mean that at step #6 above GDB doesn't try to
>>> resume the vfork parent thread, which prevents the VFORK_DONE from
>>> ever arriving.  Once GDB sees the EXECD/EXITED/SIGNALLED event from
>>> the vfork child GDB will detach from the parent.
>>> 
>>> There's no test included in this commit.  In a subsequent commit I
>>> will expand gdb.base/foll-vfork.exp which is when this bug would be
>>> exposed.
>>> 
>>> If you do want to reproduce this failure then you will for certainly
>>> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
>>> are all very timing sensitive.  I've found that running multiple
>>> copies in parallel makes the failure more likely to appear, I usually
>>> run ~6 copies in parallel and expect to see a failure after within
>>> 10mins.
>>
>> Hi Andrew,
>>
>> Since this commit, I see this on native-gdbserver and
>> native-extended-gdbserver:
>>
>> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
>> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
>> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
>> FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)
>>
>> I haven't had the time to read this vfork series, but I look forward to,
>> since I also did some vfork fixes not too long ago.
>
> If I remember correctly your fixes focused on the follow-parent side of
> vfork, while the fixes I looked at focused on the follow-child side.
>
> I have some more vfork fixes that I'm working on, which I'm hoping to
> get posted soon, but I have a couple of other tasks I need to get done
> first.
>
> Anyway, below is my proposed fix for the above regressions.  The GDB
> part of the fix is trivial, then there's a bunch of changes to the above
> test script so that we check more cases.  Let me know what you think.

Given this was fixing a regression, I've gone ahead and pushed this
fix.  If there's any follow-up feedback, I'm happy to address it.

I actually tweaked the test slightly so that it would pass on boards
with an actual remote target (e.g. local-remote-host-native).  The
version I pushed is below.

Thanks,
Andrew

---

commit 05e1cac2496f842f70744dc5210fb3072ef32f3a
Author: Andrew Burgess <aburgess@redhat.com>
Date:   Sat Jul 22 15:32:29 2023 +0100

    gdb: fix vfork regressions when target-non-stop is off
    
    It was pointed out on the mailing list[1] that after this commit:
    
      commit b1e0126ec56e099d753c20e91a9f8623aabd6b46
      Date:   Wed Jun 21 14:18:54 2023 +0100
    
          gdb: don't resume vfork parent while child is still running
    
    the test gdb.base/vfork-follow-parent.exp now has some failures when
    run with the native-gdbserver or native-extended-gdbserver boards:
    
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to end of inferior 2 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: inferior 1 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: print unblock_parent = 1 (timeout)
      FAIL: gdb.base/vfork-follow-parent.exp: resolution_method=schedule-multiple: continue to break_parent (timeout)
    
    The reason that these failures don't show up when run on the standard
    unix board is that the test is only run in the default operating mode,
    so for Linux this will be all-stop on top of non-stop.
    
    If we adjust the test script so that it runs in the default mode and
    with target-non-stop turned off, then we see the same failures on the
    unix board.  This commit includes this change.
    
    The way that the test is written means that it is not (currently)
    possible to turn on non-stop mode and have the test still work, so
    this commit does not do that.
    
    I have also updated the test script so that the vfork child performs
    an exec as well as the current exit.  Exec and exit are the two ways
    in which a vfork child can release the vfork parent, so testing both
    of these cases is useful I think.
    
    In this test the inferior performs a vfork and the vfork-child
    immediately exits.  The vfork-parent will wait for the vfork-child and
    then blocks waiting for gdb.  Once gdb has released the vfork-parent,
    the vfork-parent also exits.
    
    In the test that fails, GDB sets 'detach-on-fork off' and then runs to
    the vfork.  At this point the test tries to just "continue", but this
    fails as the vfork-parent is still selected, and the parent can't
    continue until the vfork-child completes.  As the vfork-child is
    stopped by GDB the parent will never stop once resumed, so GDB refuses
    to resume it.
    
    The test script then sets 'schedule-multiple on' and once again
    continues.  This time GDB, in theory, resumes both the parent and the
    child, the parent will be held blocked by the kernel, but the child
    will run until it exits, and which point GDB stops again, this time
    with inferior 2, the newly exited vfork-child, selected.
    
    What happens after this in the test script is irrelevant as far as
    this failure is concerned.
    
    To understand why the test started failing we should consider the
    behaviour of four different cases:
    
      1. All-stop-on-non-stop before commit b1e0126ec56e,
    
      2. All-stop-on-non-stop after commit b1e0126ec56e,
    
      3. All-stop-on-all-stop before commit b1e0126ec56e, and
    
      4. All-stop-on-all-stop after commit b1e0126ec56e.
    
    Only case #4 is failing after commit b1e0126ec56e, but I think the
    other cases are interesting because, (a) they inform how we might fix
    the regression, and (b) it turns out the behaviour of #2 changed too
    with the commit, but the change was harmless.
    
    For #1 All-stop-on-non-stop before commit b1e0126ec56e, what happens
    is:
    
      1. GDB calls proceed with the vfork-parent selected, as schedule
         multiple is on user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid (see proceed function),
    
      2. As this is all-stop-on-non-stop, every thread is resumed
        individually, so GDB tries to resume both the vfork-parent and the
        vfork-child, both of which succeed,
    
      3. The vfork-parent is held stopped by the kernel,
    
      4. The vfork-child completes (exits) at which point the GDB sees the
         EXITED event for the vfork-child and the VFORK_DONE event for the
         vfork-parent,
    
      5. At this point we might take two paths depending on which event
         GDB handles first, if GDB handles the VFORK_DONE first then:
    
         (a) As GDB is controlling both parent and child the VFORK_DONE is
             ignored (see handle_vfork_done), the vfork-parent will be
             resumed,
    
         (b) GDB processes the EXITED event, selects the (now defunct)
             vfork-child, and stops, returning control to the user.
    
         Alternatively, if GDB selects the EXITED event first then:
    
         (c) GDB processes the EXITED event, selects the (now defunct)
             vfork-child, and stops, returning control to the user.
    
         (d) At some future time the user resumes the vfork-parent, at
             which point the VFORK_DONE is reported to GDB, however, GDB
             is ignoring the VFORK_DONE (see handle_vfork_done), so the
             parent is resumed.
    
    For case #2, all-stop-on-non-stop after commit b1e0126ec56e, the
    important difference is in step (2) above, now, instead of resuming
    both the vfork-parent and the vfork-child, only the vfork-child is
    resumed.  As such, when we get to step (5), only a single event, the
    EXITED event is reported.
    
    GDB handles the EXITED just as in (5)(c), then, later, when the user
    resumes the vfork-parent, the VFORKED_DONE is immediately delivered
    from the kernel, but this is ignored just as in (5)(d), and so,
    though the pattern of when the vfork-parent is resumed changes, the
    overall pattern of which events are reported and when, doesn't
    actually change.  In fact, by not resuming the vfork-parent, the order
    of events (in this test) is now deterministic, which (maybe?) is a
    good thing.
    
    If we now consider case #3, all-stop-on-all-stop before commit
    b1e0126ec56e, then what happens is:
    
      1. GDB calls proceed with the vfork-parent selected, as schedule
         multiple is on user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid (see proceed function),
    
      2. As this is all-stop-on-all-stop, the resume is passed down to the
         linux-nat target, the vfork-parent is the event thread, while the
         vfork-child is a sibling of the event thread,
    
      3. In linux_nat_target::resume, GDB calls linux_nat_resume_callback
         for all threads, this causes the vfork-child to be resumed.  Then
         in linux_nat_target::resume, the event thread, the vfork-parent,
         is also resumed.
    
      4. The vfork-parent is held stopped by the kernel,
    
      5. The vfork-child completes (exits) at which point the GDB sees the
         EXITED event for the vfork-child and the VFORK_DONE event for the
         vfork-parent,
    
      6. We are now in a situation identical to step (5) as for
         all-stop-on-non-stop above, GDB selects one of the events to
         handle, and whichever we select the user sees the correct
         behaviour.
    
    And so, finally, we can consider #4, all-stop-on-all-stop after commit
    b1e0126ec56e, this is the case that started failing.
    
    We start out just like above, in proceed, the resume_ptid is
    -1 (resume everything), due to schedule multiple being on.  And just
    like above, due to the target being all-stop, we call
    proceed_resume_thread_checked just once, for the current thread,
    which, remember, is the vfork-parent thread.
    
    The change in commit b1e0126ec56e was to avoid resuming a vfork-parent
    thread, read the commit message for the justification for this change.
    
    However, this means that GDB now rejects resuming the vfork-parent in
    this case, which means that nothing gets resumed!  Obviously, if
    nothing resumes, then nothing will ever stop, and so GDB appears to
    hang.
    
    I considered a couple of solutions which, in the end, I didn't go
    with, these were:
    
      1. Move the vfork-parent check out of proceed_resume_thread_checked,
         and place it in proceed, but only on the all-stop-on-non-stop
         path, this should still address the issue seen in b1e0126ec56e,
         but would avoid the issue seen here.  I rejected this just
         because it didn't feel great to split the checks that exist in
         proceed_resume_thread_checked like this,
    
      2. Extend the condition in proceed_resume_thread_checked by adding a
         target_is_non_stop_p check.  This would have the same effect as
         idea 1, but leaves all the checks in the same place, which I
         think would be better, but this still just didn't feel right to
         me, and so,
    
    What I noticed was that for the all-stop-on-non-stop, after commit
    b1e0126ec56e, we only resumed the vfork-child, and this seems fine.
    The vfork-parent isn't going to run anyway (the kernel will hold it
    back), so if feels like we there's no harm in just waiting for the
    child to complete, and then resuming the parent.
    
    So then I started looking at follow_fork, which is called from the top
    of proceed.  This function already has the task of switching between
    the parent and child based on which the user wishes to follow.  So, I
    wondered, could we use this to switch to the vfork-child in the case
    that we are attached to both?
    
    Turns out this is pretty simple to do.
    
    Having done that, now the process is for all-stop-on-all-stop after
    commit b1e0126ec56e, and with this new fix is:
    
      1. GDB calls proceed with the vfork-parent selected, but,
    
      2. In follow_fork, and follow_fork_inferior, GDB switches the
         selected thread to be that of the vfork-child,
    
      3. Back in proceed user_visible_resume_ptid returns -1 (everything)
         as the resume_ptid still, but now,
    
      4. When GDB calls proceed_resume_thread_checked, the vfork-child is
         the current selected thread, this is not a vfork-parent, and so
         GDB allows the proceed to continue to the linux-nat target,
    
      5. In linux_nat_target::resume, GDB calls linux_nat_resume_callback
         for all threads, this does not resume the vfork-parent (because
         it is a vfork-parent), and then the vfork-child is resumed as
         this is the event thread,
    
    At this point we are back in the same situation as for
    all-stop-on-non-stop after commit b1e0126ec56e, that is, the
    vfork-child is resumed, while the vfork-parent is held stopped by
    GDB.
    
    Eventually the vfork-child will exit or exec, at which point the
    vfork-parent will be resumed.
    
    [1] https://inbox.sourceware.org/gdb-patches/3e1e1db0-13d9-dd32-b4bb-051149ae6e76@simark.ca/

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 8286026e6c6..72852e63906 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -713,7 +713,7 @@ holding the child stopped.  Try \"set detach-on-fork\" or \
 	 (do not restore the parent as the current inferior).  */
       gdb::optional<scoped_restore_current_thread> maybe_restore;
 
-      if (!follow_child)
+      if (!follow_child && !sched_multi)
 	maybe_restore.emplace ();
 
       switch_to_thread (*child_inf->threads ().begin ());
@@ -3400,8 +3400,10 @@ proceed (CORE_ADDR addr, enum gdb_signal siggnal)
   struct gdbarch *gdbarch;
   CORE_ADDR pc;
 
-  /* If we're stopped at a fork/vfork, follow the branch set by the
-     "set follow-fork-mode" command; otherwise, we'll just proceed
+  /* If we're stopped at a fork/vfork, switch to either the parent or child
+     thread as defined by the "set follow-fork-mode" command, or, if both
+     the parent and child are controlled by GDB, and schedule-multiple is
+     on, follow the child.  If none of the above apply then we just proceed
      resuming the current thread.  */
   if (!follow_fork ())
     {
diff --git a/gdb/testsuite/gdb.base/vfork-follow-parent.c b/gdb/testsuite/gdb.base/vfork-follow-parent.c
index df45b9c2dbe..15ff84a0bad 100644
--- a/gdb/testsuite/gdb.base/vfork-follow-parent.c
+++ b/gdb/testsuite/gdb.base/vfork-follow-parent.c
@@ -17,6 +17,10 @@
 
 #include <unistd.h>
 
+#include <string.h>
+#include <limits.h>
+#include <stdio.h>
+
 static volatile int unblock_parent = 0;
 
 static void
@@ -25,7 +29,7 @@ break_parent (void)
 }
 
 int
-main (void)
+main (int argc, char **argv)
 {
   alarm (30);
 
@@ -40,7 +44,28 @@ main (void)
       break_parent ();
     }
   else
-    _exit (0);
+    {
+#if defined TEST_EXEC
+      char prog[PATH_MAX];
+      int len;
+
+      strcpy (prog, argv[0]);
+      len = strlen (prog);
+      for (; len > 0; --len)
+	{
+	  if (prog[len - 1] == '/')
+	    break;
+	}
+      strcpy (&prog[len], "vforked-prog");
+      execlp (prog, prog, (char *) 0);
+      perror ("exec failed");
+      _exit (1);
+#elif defined TEST_EXIT
+      _exit (0);
+#else
+#error Define TEST_EXEC or TEST_EXIT
+#endif
+    }
 
   return 0;
 }
diff --git a/gdb/testsuite/gdb.base/vfork-follow-parent.exp b/gdb/testsuite/gdb.base/vfork-follow-parent.exp
index 89c38001dac..70b54e729a5 100644
--- a/gdb/testsuite/gdb.base/vfork-follow-parent.exp
+++ b/gdb/testsuite/gdb.base/vfork-follow-parent.exp
@@ -19,10 +19,28 @@
 # schedule-multiple on" or "set detach-on-fork on".  Test these two resolution
 # methods.
 
-standard_testfile
+standard_testfile .c vforked-prog.c
 
-if { [build_executable "failed to prepare" \
-	${testfile} ${srcfile}] } {
+set binfile ${testfile}-exit
+set binfile2 ${testfile}-exec
+set binfile3 vforked-prog
+
+if { [build_executable "compile $binfile3" $binfile3 $srcfile2] } {
+    untested "failed to compile third test binary"
+    return -1
+}
+
+set remote_exec_prog [gdb_remote_download target $binfile3]
+
+set opts [list debug additional_flags=-DTEST_EXIT]
+if { [build_executable "compile ${binfile}" ${binfile} ${srcfile} ${opts}] } {
+    untested "failed to compile first test binary"
+    return
+}
+
+set opts [list debug additional_flags=-DTEST_EXEC]
+if { [build_executable "compile ${binfile2}" ${binfile2} ${srcfile} ${opts}] } {
+    untested "failed to compile second test binary"
     return
 }
 
@@ -31,8 +49,12 @@ if { [build_executable "failed to prepare" \
 # or "schedule-multiple" (the two alternatives the message suggests to the
 # user).
 
-proc do_test { resolution_method } {
-    clean_restart $::binfile
+proc do_test { exec_file resolution_method target_non_stop non_stop } {
+    save_vars { ::GDBFLAGS } {
+	append ::GDBFLAGS " -ex \"maint set target-non-stop ${target_non_stop}\""
+	append ::GDBFLAGS " -ex \"set non-stop ${non_stop}\""
+	clean_restart $exec_file
+    }
 
     gdb_test_no_output "set detach-on-fork off"
 
@@ -40,6 +62,10 @@ proc do_test { resolution_method } {
 	return
     }
 
+    # Delete the breakpoint on main so we don't bit the breakpoint in
+    # the case that the vfork child performs an exec.
+    delete_breakpoints
+
     gdb_test "break break_parent"
 
     gdb_test "continue" \
@@ -75,6 +101,16 @@ proc do_test { resolution_method } {
 	"continue to break_parent"
 }
 
-foreach_with_prefix resolution_method {detach-on-fork schedule-multiple} {
-    do_test $resolution_method
+foreach_with_prefix exec_file [list $binfile $binfile2] {
+    foreach_with_prefix target-non-stop {on off} {
+	# This test was written assuming non-stop mode is off.
+	foreach_with_prefix non-stop {off} {
+	    if {!${target-non-stop} && ${non-stop}} {
+		continue
+	    }
+	    foreach_with_prefix resolution_method {detach-on-fork schedule-multiple} {
+		do_test $exec_file $resolution_method ${target-non-stop} ${non-stop}
+	    }
+	}
+    }
 }
  
Tom de Vries Aug. 17, 2023, 6:36 a.m. UTC | #5
On 8/16/23 16:02, Andrew Burgess via Gdb-patches wrote:
> +set remote_exec_prog [gdb_remote_download target $binfile3]

With testing on target board unix I ran into:
...
ERROR: tcl error sourcing 
/data/vries/gdb/src/gdb/testsuite/gdb.base/vfork-follow-parent.exp.
ERROR: error copying "vforked-prog": no such file or directory
     while executing
"file copy -force $fromfile $tofile"
     (procedure "gdb_remote_download" line 29)
     invoked from within
"gdb_remote_download target $binfile3"
...

I managed to reproduce it, also with the mentioned target boards 
native-gdbserver and native-extended-gdbserver.

Did you mean something like:
...
if {[is_remote host]} {
     remote_upload target $binfile3
}
...
?

With that instead, I still see these FAILs with gdbserver boards:
...
FAIL: gdb.base/vfork-follow-parent.exp: 
exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
resolution_method=schedule-multiple: continue to end of inferior 2
FAIL: gdb.base/vfork-follow-parent.exp: 
exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
resolution_method=schedule-multiple: continue to break_parent
...
but not 100% reproducible.

Thanks,
- Tom
  
Tom de Vries Aug. 17, 2023, 7:01 a.m. UTC | #6
On 8/17/23 08:36, Tom de Vries wrote:
> On 8/16/23 16:02, Andrew Burgess via Gdb-patches wrote:
>> +set remote_exec_prog [gdb_remote_download target $binfile3]
> 
> With testing on target board unix I ran into:
> ...
> ERROR: tcl error sourcing 
> /data/vries/gdb/src/gdb/testsuite/gdb.base/vfork-follow-parent.exp.
> ERROR: error copying "vforked-prog": no such file or directory
>      while executing
> "file copy -force $fromfile $tofile"
>      (procedure "gdb_remote_download" line 29)
>      invoked from within
> "gdb_remote_download target $binfile3"
> ...
> 
> I managed to reproduce it, also with the mentioned target boards 
> native-gdbserver and native-extended-gdbserver.
> 
> Did you mean something like:
> ...
> if {[is_remote host]} {
>      remote_upload target $binfile3
> }
> ...
> ?
> 

Oops, of course that should be:
...
  if {[is_remote target]} {
       remote_upload target $binfile3
}
...

Thanks,
- Tom

> With that instead, I still see these FAILs with gdbserver boards:
> ...
> FAIL: gdb.base/vfork-follow-parent.exp: 
> exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
> resolution_method=schedule-multiple: continue to end of inferior 2
> FAIL: gdb.base/vfork-follow-parent.exp: 
> exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
> resolution_method=schedule-multiple: continue to break_parent
> ...
> but not 100% reproducible.
> 
> Thanks,
> - Tom
  
Tom de Vries Aug. 17, 2023, 8:06 a.m. UTC | #7
On 8/17/23 09:01, Tom de Vries via Gdb-patches wrote:
> On 8/17/23 08:36, Tom de Vries wrote:
>> On 8/16/23 16:02, Andrew Burgess via Gdb-patches wrote:
>>> +set remote_exec_prog [gdb_remote_download target $binfile3]
>>
>> With testing on target board unix I ran into:
>> ...
>> ERROR: tcl error sourcing 
>> /data/vries/gdb/src/gdb/testsuite/gdb.base/vfork-follow-parent.exp.
>> ERROR: error copying "vforked-prog": no such file or directory
>>      while executing
>> "file copy -force $fromfile $tofile"
>>      (procedure "gdb_remote_download" line 29)
>>      invoked from within
>> "gdb_remote_download target $binfile3"
>> ...
>>
>> I managed to reproduce it, also with the mentioned target boards 
>> native-gdbserver and native-extended-gdbserver.
>>
>> Did you mean something like:
>> ...
>> if {[is_remote host]} {
>>      remote_upload target $binfile3
>> }
>> ...
>> ?
>>
> 
> Oops, of course that should be:
> ...
>   if {[is_remote target]} {
>        remote_upload target $binfile3
> }
> ...
> 

I've got a patch that works, I'll commit shortly.

FWIW, I got my uploads and download confused, I forgot dejagnu inverts 
the meaning of the concepts.

Thanks,
- Tom

> Thanks,
> - Tom
> 
>> With that instead, I still see these FAILs with gdbserver boards:
>> ...
>> FAIL: gdb.base/vfork-follow-parent.exp: 
>> exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
>> resolution_method=schedule-multiple: continue to end of inferior 2
>> FAIL: gdb.base/vfork-follow-parent.exp: 
>> exec_file=vfork-follow-parent-exec: target-non-stop=off: non-stop=off: 
>> resolution_method=schedule-multiple: continue to break_parent
>> ...
>> but not 100% reproducible.
>>
>> Thanks,
>> - Tom
>
  
Tom de Vries Aug. 17, 2023, 8:22 a.m. UTC | #8
On 8/17/23 10:06, Tom de Vries via Gdb-patches wrote:
> On 8/17/23 09:01, Tom de Vries via Gdb-patches wrote:
>> On 8/17/23 08:36, Tom de Vries wrote:
>>> On 8/16/23 16:02, Andrew Burgess via Gdb-patches wrote:
>>>> +set remote_exec_prog [gdb_remote_download target $binfile3]
>>>
>>> With testing on target board unix I ran into:
>>> ...
>>> ERROR: tcl error sourcing 
>>> /data/vries/gdb/src/gdb/testsuite/gdb.base/vfork-follow-parent.exp.
>>> ERROR: error copying "vforked-prog": no such file or directory
>>>      while executing
>>> "file copy -force $fromfile $tofile"
>>>      (procedure "gdb_remote_download" line 29)
>>>      invoked from within
>>> "gdb_remote_download target $binfile3"
>>> ...
>>>
>>> I managed to reproduce it, also with the mentioned target boards 
>>> native-gdbserver and native-extended-gdbserver.
>>>
>>> Did you mean something like:
>>> ...
>>> if {[is_remote host]} {
>>>      remote_upload target $binfile3
>>> }
>>> ...
>>> ?
>>>
>>
>> Oops, of course that should be:
>> ...
>>   if {[is_remote target]} {
>>        remote_upload target $binfile3
>> }
>> ...
>>
> 
> I've got a patch that works, I'll commit shortly.
> 

https://sourceware.org/pipermail/gdb-patches/2023-August/201694.html

Thanks,
- Tom

> FWIW, I got my uploads and download confused, I forgot dejagnu inverts 
> the meaning of the concepts.
> 
> Thanks,
> - Tom
> 
>> Thanks,
>> - Tom
>>
>>> With that instead, I still see these FAILs with gdbserver boards:
>>> ...
>>> FAIL: gdb.base/vfork-follow-parent.exp: 
>>> exec_file=vfork-follow-parent-exec: target-non-stop=off: 
>>> non-stop=off: resolution_method=schedule-multiple: continue to end of 
>>> inferior 2
>>> FAIL: gdb.base/vfork-follow-parent.exp: 
>>> exec_file=vfork-follow-parent-exec: target-non-stop=off: 
>>> non-stop=off: resolution_method=schedule-multiple: continue to 
>>> break_parent
>>> ...
>>> but not 100% reproducible.
>>>
>>> Thanks,
>>> - Tom
>>
>
  

Patch

diff --git a/gdb/infrun.c b/gdb/infrun.c
index 010fcd7952f..2d2f7d67a0f 100644
--- a/gdb/infrun.c
+++ b/gdb/infrun.c
@@ -3299,10 +3299,12 @@  proceed_resume_thread_checked (thread_info *tp)
     }
 
   /* When handling a vfork GDB removes all breakpoints from the program
-     space in which the vfork is being handled, as such we must take care
-     not to resume any thread other than the vfork parent -- resuming the
-     vfork parent allows GDB to receive and handle the 'vfork done'
-     event.  */
+     space in which the vfork is being handled.  If we are following the
+     parent then GDB will set the thread_waiting_for_vfork_done member of
+     the parent inferior.  In this case we should take care to only resume
+     the vfork parent thread, the kernel will hold this thread suspended
+     until the vfork child has exited or execd, at which point the parent
+     will be resumed and a VFORK_DONE event sent to GDB.  */
   if (tp->inf->thread_waiting_for_vfork_done != nullptr)
     {
       if (target_is_non_stop_p ())
@@ -3341,6 +3343,20 @@  proceed_resume_thread_checked (thread_info *tp)
 	}
     }
 
+  /* When handling a vfork GDB removes all breakpoints from the program
+     space in which the vfork is being handled.  If we are following the
+     child then GDB will set vfork_child member of the vfork parent
+     inferior.  Once the child has either exited or execd then GDB will
+     detach from the parent process.  Until that point GDB should not
+     resume any thread in the parent process.  */
+  if (tp->inf->vfork_child != nullptr)
+    {
+      infrun_debug_printf ("[%s] thread is part of a vfork parent, child is %d",
+			   tp->ptid.to_string ().c_str (),
+			   tp->inf->vfork_child->pid);
+      return;
+    }
+
   infrun_debug_printf ("resuming %s",
 		       tp->ptid.to_string ().c_str ());