[PATCHv2,2/8] gdb: don't restart vfork parent while waiting for child to finish

  While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:

  x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.

I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing.  The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits.  As best I understand
things, here's what happens when things go wrong:

  1. The parent process performs a vfork, GDB sees the VFORKED event
  and creates an inferior and thread for the vfork child,

  2. GDB resumes the vfork child process.  As schedule-multiple is on
  and target-non-stop is off, this is translated into a request to
  start all processes (see user_visible_resume_ptid),

  3. In the linux-nat layer we spot that one of the threads we are
  about to start is a vfork parent, and so don't start that
  thread (see resume_lwp), the vfork child thread is resumed,

  4. GDB waits for the next event, eventually entering
  linux_nat_target::wait, which in turn calls linux_nat_wait_1,

  5. In linux_nat_wait_1 we eventually call
  resume_stopped_resumed_lwps, this should restart threads that have
  stopped but don't actually have anything interesting to report.

  6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
  vfork parents like resume_lwp does, so at this point the vfork
  parent is resumed.  This feels like the start of the bug, and this
  is where I'm proposing to fix things, but, resuming the vfork parent
  isn't the worst thing in the world because....

  7. As the vfork child is still alive the kernel holds the vfork
  parent stopped,

  8. Eventually the child performs its exec and GDB is sent and EXECD
  event.  However, because the parent is resumed, as soon as the child
  performs its exec the vfork parent also sends a VFORK_DONE event to
  GDB,

  9. Depending on timing both of these events might seem to arrive in
  GDB at the same time.  Normally GDB expects to see the EXECD or
  EXITED/SIGNALED event from the vfork child before getting the
  VFORK_DONE in the parent.  We know this because it is as a result of
  the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
  handle_vfork_child_exec_or_exit for details).  Further the comment
  in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
  when we remain attached to the child (not the parent) we should not
  expect to see a VFORK_DONE,

  10. If both events arrive at the same time then GDB will randomly
  choose one event to handle first, in some cases this will be the
  VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
  expects that (a) the vfork child has finished, however, in this case
  this is not completely true, the child has finished, but GDB has not
  processed the event associated with the completion yet, and (b) upon
  seeing a VFORK_DONE GDB assumes we are remaining attached to the
  parent, and so resumes the parent process,

  11. GDB now handles the EXECD event.  In our case we are detaching
  from the parent, so GDB calls target_detach (see
  handle_vfork_child_exec_or_exit),

  12. While this has been going on the vfork parent is executing, and
  might even exit,

  13. In linux_nat_target::detach the first thing we do is stop all
  threads in the process we're detaching from, the result of the stop
  request will be cached on the lwp_info object,

  14. In our case the vfork parent has exited though, so when GDB
  waits for the thread, instead of a stop due to signal, we instead
  get a thread exited status,

  15. Later in the detach process we try to resume the threads just
  prior to making the ptrace call to actually detach (see
  detach_one_lwp), as part of the process to resume a thread we try to
  touch some registers within the thread, and before doing this GDB
  asserts that the thread is stopped,

  16. An exited thread is not classified as stopped, and so the assert
  triggers!

So there's two bugs I see here.  The first, and most critical one here
is in step #6.  I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.

With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event.  As the parent is still stopped #12 doesn't
apply, and in #13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point.  And finally, in #15 we are safe to poke the
process registers because it will not have exited by this point.

However, I did mention two bugs.

The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step #13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from.  If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.

But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.

There's no test included in this commit.  In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.  Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.

If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive.  I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
---
 gdb/linux-nat.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Message ID	a9b31c5abcb5c63bb329c62be568ca0c3a139692.1688484032.git.aburgess@redhat.com
State	New
Headers	Return-Path: <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C8E963882066 for <patchwork@sourceware.org>; Tue, 4 Jul 2023 15:25:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C8E963882066 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1688484304; bh=mKzH9MLKVP5xgcDn1f82J9gu64ht1G+TBO+TnqY596I=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=yK9SC6SiHeWOU2qKD11WiyD+Bq5ay+bXOMjKau6atorYaCbwUQV8xHmqNviO6h98y QxA3ctBHKVEp90gWCnUqdlb9PsMg2qb8JChpYIqMpv3MXnExZtc3oXMd/FqBSiX70l 8hYyDTgkzD1JfDCLoflZvyGegxTMkZIWphr9ygTQ= X-Original-To: gdb-patches@sourceware.org Delivered-To: gdb-patches@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id E625938555B2 for <gdb-patches@sourceware.org>; Tue, 4 Jul 2023 15:23:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E625938555B2 Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-363-eE_iD2DlPGWUB8zmyfs5nw-1; Tue, 04 Jul 2023 11:23:07 -0400 X-MC-Unique: eE_iD2DlPGWUB8zmyfs5nw-1 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-3fb416d7731so31948875e9.2 for <gdb-patches@sourceware.org>; Tue, 04 Jul 2023 08:23:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688484186; x=1691076186; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mKzH9MLKVP5xgcDn1f82J9gu64ht1G+TBO+TnqY596I=; b=RBsV+GBidt3lbPhu48ev6XL/x28nM6YpPPO84saGgYY2nIIsnqR+YhkrcDB9V31V8i eBdNcjr72MZ2/erjN4cj+ZdzO37R4JBjNGvgDzu0TiYoBm3VzaX4K621f0EiGPxNx2Of yZ8t0VcFGICT2O8v9Q/i1iXV8ej7fLa2lVsbWjOblb4KO1L43i4S0AdSm914g30togtF /YgJ2Hu6ZxB6I5bV16EAbKuUxbQnCOmfbzP6mOvPjpNWr2JzRW0P23G5URq6FR2jxFRa G3R3fmeSgxi+PefJbNALWrCmxeSCcSDeSCum4i0Fdi9KpRcPkQNJEzU0Qvsa+z8eoK29 l2fA== X-Gm-Message-State: AC+VfDzr8GWX+l4gIO1Ba/i+5Q9U/Q7tJPQPLkgmroWkaAOPsXgYaFnE Wo+aBKC0MQ8pQsGj8z84FR6fjHZxKBlXrwLpS9EQCEboe8QGeh3Rxgv7aZctJtHO4qWVSJfpP6j sBUrqW1rZD6XzuyD1GLnTylHrT3UVOI9oeQ7h5LxtQrPpFeGt2jiMnnqwyp2Rti6Zx6ZwaOAd3t qaNQ2Vvg== X-Received: by 2002:a7b:cb95:0:b0:3fb:b53c:1a32 with SMTP id m21-20020a7bcb95000000b003fbb53c1a32mr10163411wmi.34.1688484186029; Tue, 04 Jul 2023 08:23:06 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7KWnrYBXpHqFm8iCx0zyaeMXX2EHc9kudTcaMBGDRw58VEdqjB7MxMu9mG0ysLAFNwQHEmAQ== X-Received: by 2002:a7b:cb95:0:b0:3fb:b53c:1a32 with SMTP id m21-20020a7bcb95000000b003fbb53c1a32mr10163384wmi.34.1688484185528; Tue, 04 Jul 2023 08:23:05 -0700 (PDT) Received: from localhost (2.72.115.87.dyn.plus.net. [87.115.72.2]) by smtp.gmail.com with ESMTPSA id y5-20020a05600c364500b003fbc9d178a8sm11962492wmq.4.2023.07.04.08.23.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Jul 2023 08:23:05 -0700 (PDT) To: gdb-patches@sourceware.org Cc: Andrew Burgess <aburgess@redhat.com>, tankut.baris.aktemur@intel.com Subject: [PATCHv2 2/8] gdb: don't restart vfork parent while waiting for child to finish Date: Tue, 4 Jul 2023 16:22:52 +0100 Message-Id: <a9b31c5abcb5c63bb329c62be568ca0c3a139692.1688484032.git.aburgess@redhat.com> X-Mailer: git-send-email 2.25.4 In-Reply-To: <cover.1688484032.git.aburgess@redhat.com> References: <cover.1687438786.git.aburgess@redhat.com> <cover.1688484032.git.aburgess@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>, <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/gdb-patches/> List-Post: <mailto:gdb-patches@sourceware.org> List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>, <mailto:gdb-patches-request@sourceware.org?subject=subscribe> From: Andrew Burgess via Gdb-patches <gdb-patches@sourceware.org> Reply-To: Andrew Burgess <aburgess@redhat.com> Errors-To: gdb-patches-bounces+patchwork=sourceware.org@sourceware.org Sender: "Gdb-patches" <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org>
Series	Some vfork related fixes \| [PATCHv2,0/8] Some vfork related fixes [PATCHv2,1/8] gdb: catch more errors in gdb.base/foll-vfork.exp [PATCHv2,2/8] gdb: don't restart vfork parent while waiting for child to finish [PATCHv2,3/8] gdb: fix an issue with vfork in non-stop mode [PATCHv2,4/8] gdb, infrun: refactor part of `proceed` into separate function [PATCHv2,5/8] gdb: don't resume vfork parent while child is still running [PATCHv2,6/8] gdb/testsuite: expand gdb.base/foll-vfork.exp [PATCHv2,7/8] gdb/testsuite: remove use of sleep from gdb.base/foll-vfork.exp [PATCHv2,8/8] gdb: additional debug output in infrun.c and linux-nat.c

[PATCHv2,2/8] gdb: don't restart vfork parent while waiting for child to finish

Commit Message

Comments

Patch