[PATCHv2,2/8] gdb: don't restart vfork parent while waiting for child to finish

Message ID a9b31c5abcb5c63bb329c62be568ca0c3a139692.1688484032.git.aburgess@redhat.com
State New
Headers
Series Some vfork related fixes |

Commit Message

Andrew Burgess July 4, 2023, 3:22 p.m. UTC
  While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:

  x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.

I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing.  The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits.  As best I understand
things, here's what happens when things go wrong:

  1. The parent process performs a vfork, GDB sees the VFORKED event
  and creates an inferior and thread for the vfork child,

  2. GDB resumes the vfork child process.  As schedule-multiple is on
  and target-non-stop is off, this is translated into a request to
  start all processes (see user_visible_resume_ptid),

  3. In the linux-nat layer we spot that one of the threads we are
  about to start is a vfork parent, and so don't start that
  thread (see resume_lwp), the vfork child thread is resumed,

  4. GDB waits for the next event, eventually entering
  linux_nat_target::wait, which in turn calls linux_nat_wait_1,

  5. In linux_nat_wait_1 we eventually call
  resume_stopped_resumed_lwps, this should restart threads that have
  stopped but don't actually have anything interesting to report.

  6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
  vfork parents like resume_lwp does, so at this point the vfork
  parent is resumed.  This feels like the start of the bug, and this
  is where I'm proposing to fix things, but, resuming the vfork parent
  isn't the worst thing in the world because....

  7. As the vfork child is still alive the kernel holds the vfork
  parent stopped,

  8. Eventually the child performs its exec and GDB is sent and EXECD
  event.  However, because the parent is resumed, as soon as the child
  performs its exec the vfork parent also sends a VFORK_DONE event to
  GDB,

  9. Depending on timing both of these events might seem to arrive in
  GDB at the same time.  Normally GDB expects to see the EXECD or
  EXITED/SIGNALED event from the vfork child before getting the
  VFORK_DONE in the parent.  We know this because it is as a result of
  the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
  handle_vfork_child_exec_or_exit for details).  Further the comment
  in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
  when we remain attached to the child (not the parent) we should not
  expect to see a VFORK_DONE,

  10. If both events arrive at the same time then GDB will randomly
  choose one event to handle first, in some cases this will be the
  VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
  expects that (a) the vfork child has finished, however, in this case
  this is not completely true, the child has finished, but GDB has not
  processed the event associated with the completion yet, and (b) upon
  seeing a VFORK_DONE GDB assumes we are remaining attached to the
  parent, and so resumes the parent process,

  11. GDB now handles the EXECD event.  In our case we are detaching
  from the parent, so GDB calls target_detach (see
  handle_vfork_child_exec_or_exit),

  12. While this has been going on the vfork parent is executing, and
  might even exit,

  13. In linux_nat_target::detach the first thing we do is stop all
  threads in the process we're detaching from, the result of the stop
  request will be cached on the lwp_info object,

  14. In our case the vfork parent has exited though, so when GDB
  waits for the thread, instead of a stop due to signal, we instead
  get a thread exited status,

  15. Later in the detach process we try to resume the threads just
  prior to making the ptrace call to actually detach (see
  detach_one_lwp), as part of the process to resume a thread we try to
  touch some registers within the thread, and before doing this GDB
  asserts that the thread is stopped,

  16. An exited thread is not classified as stopped, and so the assert
  triggers!

So there's two bugs I see here.  The first, and most critical one here
is in step #6.  I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.

With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event.  As the parent is still stopped #12 doesn't
apply, and in #13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point.  And finally, in #15 we are safe to poke the
process registers because it will not have exited by this point.

However, I did mention two bugs.

The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step #13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from.  If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.

But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.

There's no test included in this commit.  In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.  Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.

If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive.  I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
---
 gdb/linux-nat.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)
  

Comments

Terekhov, Mikhail via Gdb-patches July 5, 2023, 10:08 a.m. UTC | #1
On Tuesday, July 4, 2023 5:23 PM, Andrew Burgess wrote:
> While working on a later patch, which changes gdb.base/foll-vfork.exp,
> I noticed that sometimes I would hit this assert:
> 
>   x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.
> 
> I eventually tracked it down to a combination of schedule-multiple
> mode being on, target-non-stop being off, follow-fork-mode being set
> to child, and some bad timing.  The failing case is pretty simple, a
> single threaded application performs a vfork, the child process then
> execs some other application while the parent process (once the vfork
> child has completed its exec) just exits.  As best I understand
> things, here's what happens when things go wrong:
> 
>   1. The parent process performs a vfork, GDB sees the VFORKED event
>   and creates an inferior and thread for the vfork child,
> 
>   2. GDB resumes the vfork child process.  As schedule-multiple is on
>   and target-non-stop is off, this is translated into a request to
>   start all processes (see user_visible_resume_ptid),
> 
>   3. In the linux-nat layer we spot that one of the threads we are
>   about to start is a vfork parent, and so don't start that
>   thread (see resume_lwp), the vfork child thread is resumed,
> 
>   4. GDB waits for the next event, eventually entering
>   linux_nat_target::wait, which in turn calls linux_nat_wait_1,
> 
>   5. In linux_nat_wait_1 we eventually call
>   resume_stopped_resumed_lwps, this should restart threads that have
>   stopped but don't actually have anything interesting to report.
> 
>   6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
>   vfork parents like resume_lwp does, so at this point the vfork
>   parent is resumed.  This feels like the start of the bug, and this
>   is where I'm proposing to fix things, but, resuming the vfork parent
>   isn't the worst thing in the world because....
> 
>   7. As the vfork child is still alive the kernel holds the vfork
>   parent stopped,
> 
>   8. Eventually the child performs its exec and GDB is sent and EXECD
>   event.  However, because the parent is resumed, as soon as the child
>   performs its exec the vfork parent also sends a VFORK_DONE event to
>   GDB,
> 
>   9. Depending on timing both of these events might seem to arrive in
>   GDB at the same time.  Normally GDB expects to see the EXECD or
>   EXITED/SIGNALED event from the vfork child before getting the
>   VFORK_DONE in the parent.  We know this because it is as a result of
>   the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
>   handle_vfork_child_exec_or_exit for details).  Further the comment
>   in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
>   when we remain attached to the child (not the parent) we should not
>   expect to see a VFORK_DONE,
> 
>   10. If both events arrive at the same time then GDB will randomly
>   choose one event to handle first, in some cases this will be the
>   VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
>   expects that (a) the vfork child has finished, however, in this case
>   this is not completely true, the child has finished, but GDB has not
>   processed the event associated with the completion yet, and (b) upon
>   seeing a VFORK_DONE GDB assumes we are remaining attached to the
>   parent, and so resumes the parent process,
> 
>   11. GDB now handles the EXECD event.  In our case we are detaching
>   from the parent, so GDB calls target_detach (see
>   handle_vfork_child_exec_or_exit),
> 
>   12. While this has been going on the vfork parent is executing, and
>   might even exit,
> 
>   13. In linux_nat_target::detach the first thing we do is stop all
>   threads in the process we're detaching from, the result of the stop
>   request will be cached on the lwp_info object,
> 
>   14. In our case the vfork parent has exited though, so when GDB
>   waits for the thread, instead of a stop due to signal, we instead
>   get a thread exited status,
> 
>   15. Later in the detach process we try to resume the threads just
>   prior to making the ptrace call to actually detach (see
>   detach_one_lwp), as part of the process to resume a thread we try to
>   touch some registers within the thread, and before doing this GDB
>   asserts that the thread is stopped,
> 
>   16. An exited thread is not classified as stopped, and so the assert
>   triggers!
> 
> So there's two bugs I see here.  The first, and most critical one here
> is in step #6.  I think that resume_stopped_resumed_lwps should not
> resume a vfork parent, just like resume_lwp doesn't resume a vfork
> parent.
> 
> With this change in place the vfork parent will remain stopped in step
> instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
> problems in #9 and #10 are therefore skipped and we arrive at #11,
> handling the EXECD event.  As the parent is still stopped #12 doesn't
> apply, and in #13 when we try to stop the process we will see that it
> is already stopped, there's no risk of the vfork parent exiting before
> we get to this point.  And finally, in #15 we are safe to poke the
> process registers because it will not have exited by this point.
> 
> However, I did mention two bugs.
> 
> The second bug I've not yet managed to actually trigger, but I'm
> convinced it must exist: if we forget vforks for a moment, in step #13
> above, when linux_nat_target::detach is called, we first try to stop
> all threads in the process GDB is detaching from.  If we imagine a
> multi-threaded inferior with many threads, and GDB running in non-stop
> mode, then, if the user tries to detach there is a chance that thread
> could exit just as linux_nat_target::detach is entered, in which case
> we should be able to trigger the same assert.
> 
> But, like I said, I've not (yet) managed to trigger this second bug,
> and even if I could, the fix would not belong in this commit, so I'm
> pointing this out just for completeness.
> 
> There's no test included in this commit.  In a couple of commits time
> I will expand gdb.base/foll-vfork.exp which is when this bug would be
> exposed.  Unfortunately there are at least two other bugs (separate
> from the ones discussed above) that need fixing first, these will be
> fixed in the next commits before the gdb.base/foll-vfork.exp test is
> expanded.
> 
> If you do want to reproduce this failure then you will for certainly
> need to run the gdb.base/foll-vfork.exp test in a loop as the failures
> are all very timing sensitive.  I've found that running multiple
> copies in parallel makes the failure more likely to appear, I usually
> run ~6 copies in parallel and expect to see a failure after within
> 10mins.
> ---
>  gdb/linux-nat.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
> index 383ef58fa23..7e121b7ab41 100644
> --- a/gdb/linux-nat.c
> +++ b/gdb/linux-nat.c
> @@ -3346,7 +3346,14 @@ linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus
> *ourstatus,
>  static int
>  resume_stopped_resumed_lwps (struct lwp_info *lp, const ptid_t wait_ptid)
>  {
> -  if (!lp->stopped)
> +  struct inferior *inf = find_inferior_ptid (linux_target, lp->ptid);

Nit: The 'struct' keyword can be omitted.

-Baris


Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
  

Patch

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 383ef58fa23..7e121b7ab41 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -3346,7 +3346,14 @@  linux_nat_wait_1 (ptid_t ptid, struct target_waitstatus *ourstatus,
 static int
 resume_stopped_resumed_lwps (struct lwp_info *lp, const ptid_t wait_ptid)
 {
-  if (!lp->stopped)
+  struct inferior *inf = find_inferior_ptid (linux_target, lp->ptid);
+
+  if (inf->vfork_child != nullptr)
+    {
+      linux_nat_debug_printf ("NOT resuming LWP %s (vfork parent)",
+			      lp->ptid.to_string ().c_str ());
+    }
+  else if (!lp->stopped)
     {
       linux_nat_debug_printf ("NOT resuming LWP %s, not stopped",
 			      lp->ptid.to_string ().c_str ());