Message ID | 20230228181845.99936-6-jhb@FreeBSD.org |
---|---|
State | New |
Headers |
Return-Path: <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D6DF23850209 for <patchwork@sourceware.org>; Tue, 28 Feb 2023 18:19:26 +0000 (GMT) X-Original-To: gdb-patches@sourceware.org Delivered-To: gdb-patches@sourceware.org Received: from mail.baldwin.cx (bigwig.baldwin.cx [IPv6:2607:f138:0:13::2]) by sourceware.org (Postfix) with ESMTPS id DA0653858C50 for <gdb-patches@sourceware.org>; Tue, 28 Feb 2023 18:19:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DA0653858C50 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=FreeBSD.org Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=FreeBSD.org Received: from gimli.baldwin.net (c-98-35-126-114.hsd1.ca.comcast.net [98.35.126.114]) by mail.baldwin.cx (Postfix) with ESMTPSA id 6A5181A84E33 for <gdb-patches@sourceware.org>; Tue, 28 Feb 2023 13:18:57 -0500 (EST) From: John Baldwin <jhb@FreeBSD.org> To: gdb-patches@sourceware.org Subject: [PATCH 5/9] fbsd-nat: Defer any ineligible events reported by wait. Date: Tue, 28 Feb 2023 10:18:41 -0800 Message-Id: <20230228181845.99936-6-jhb@FreeBSD.org> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20230228181845.99936-1-jhb@FreeBSD.org> References: <20230228181845.99936-1-jhb@FreeBSD.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.6.4 (mail.baldwin.cx [0.0.0.0]); Tue, 28 Feb 2023 13:18:57 -0500 (EST) X-Virus-Scanned: clamav-milter 0.103.1 at mail.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, FORGED_SPF_HELO, GIT_PATCH_0, KAM_DMARC_STATUS, KHOP_HELO_FCRDNS, SPF_HELO_PASS, SPF_SOFTFAIL, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>, <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/gdb-patches/> List-Post: <mailto:gdb-patches@sourceware.org> List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>, <mailto:gdb-patches-request@sourceware.org?subject=subscribe> Errors-To: gdb-patches-bounces+patchwork=sourceware.org@sourceware.org Sender: "Gdb-patches" <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org> |
Series |
Fixes for multiprocess for FreeBSD's native target
|
|
Commit Message
John Baldwin
Feb. 28, 2023, 6:18 p.m. UTC
If wait_1 finds an event for a thread or process that does not match the set of threads and processes previously resumed, defer the event. If the event is for a specific thread, suspend the thread and continue the associated process before waiting for another event. One specific example of such an event is if a thread is created while another thread in the same process hits a breakpoint. If the second thread's event is reported first, the target resume method does not yet "know" about the new thread and will not suspend it via PT_SUSPEND. When wait is called, it will probably return the event from the first thread before the result of the step from second thread. This is the case reported in PR 21497. --- gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++- 1 file changed, 33 insertions(+), 1 deletion(-)
Comments
On 2/28/23 13:18, John Baldwin wrote: > If wait_1 finds an event for a thread or process that does not match > the set of threads and processes previously resumed, defer the event. > If the event is for a specific thread, suspend the thread and continue > the associated process before waiting for another event. I do not understand this, because it doesn't match my mental model of how things should work (which is derived from working with linux-nat). In that model, a thread with a pending event should never be ptrace-resumed. So the last sentence doesn't make sense, it implies that a thread has a pending and is ptrace-resumed (since we suspend it). > One specific example of such an event is if a thread is created while > another thread in the same process hits a breakpoint. If the second > thread's event is reported first, the target resume method does not > yet "know" about the new thread and will not suspend it via > PT_SUSPEND. I was surprised to read that the resume method suspends some threads. I don't think it should, but I do see the current code. The current code appears to interpret the resume request as: "ensure the state of thread resumption looks exactly like this", so it suspends threads as needed, to match the requested resumption state. I don't think it's supposed to work like that. The resume method should only resume some threads, and if the core of GDB wants some threads stopped, it will call the stop method. Anyhow, just to be sure I understand the problem you describe: when fbsd-nat reports the breakpoint hit from "another thread", is the "thread created" event still in the kernel, not consumed by GDB? > When wait is called, it will probably return the event > from the first thread before the result of the step from second > thread. This is the case reported in PR 21497. > --- > gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++- > 1 file changed, 33 insertions(+), 1 deletion(-) > > diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c > index ca278b871ef..3f7278c6ea0 100644 > --- a/gdb/fbsd-nat.c > +++ b/gdb/fbsd-nat.c > @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus, > if (is_async_p ()) > async_file_flush (); > > - wptid = wait_1 (ptid, ourstatus, target_options); > + while (1) > + { > + wptid = wait_1 (ptid, ourstatus, target_options); > + > + /* If no event was found, just return. */ > + if (ourstatus->kind () == TARGET_WAITKIND_IGNORE > + || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED) > + break; > + > + /* If an event is reported for a thread or process while > + stepping some other thread, suspend the thread reporting the > + event and defer the event until it can be reported to the > + core. */ > + if (!wptid.matches (resume_ptid)) > + { > + add_pending_event (wptid, *ourstatus); > + fbsd_nat_debug_printf ("deferring event [%s], [%s]", > + target_pid_to_str (wptid).c_str (), > + ourstatus->to_string ().c_str ()); > + if (wptid.pid () == resume_ptid.pid ()) > + { > + fbsd_nat_debug_printf ("suspending thread [%s]", > + target_pid_to_str (wptid).c_str ()); > + if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1) > + perror_with_name (("ptrace (PT_SUSPEND)")); This is the part I don't understand. After wait_1 returned an event for a specific thread, I would expect this thread to be ptrace-stopped. So, I don't see the need to suspend it. You'd just leave it in the "resumed from the until. > + if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1) > + perror_with_name (("ptrace (PT_CONTINUE)")); I don't get why you continue `wptid.pid ()`. What assures you that this particular thread was stopped previously? Perhaps I don't understand because I assume somethings about pids/lwps and ptrace that are only true on Linux. Simon
On 3/20/23 12:09 PM, Simon Marchi wrote: > On 2/28/23 13:18, John Baldwin wrote: >> If wait_1 finds an event for a thread or process that does not match >> the set of threads and processes previously resumed, defer the event. >> If the event is for a specific thread, suspend the thread and continue >> the associated process before waiting for another event. > > I do not understand this, because it doesn't match my mental model of > how things should work (which is derived from working with linux-nat). > In that model, a thread with a pending event should never be > ptrace-resumed. So the last sentence doesn't make sense, it implies > that a thread has a pending and is ptrace-resumed (since we suspend it). The difference here might be that on FreeBSD ptrace can only PT_CONTINUE and wait() for entire processes. When a thread contains multiple threads, PT_CONTINUE resumes all threads belonging to that process. Individual threads can be controlled by using PT_SUSPEND on a specific LWP. Those LWPs will stay asleep after PT_CONTINUE. If any thread encounters a stop event (breakpoint, fork event, etc.) the entire process stops (the kernel signals threads in userland if needed to get them back into the kernel to stop). Once it is stopped it reports an event via wait(). Currently on FreeBSD wait() only reports a single event at once. So if multiple threads hit a breakpoint, wait() returns for the first one, PT_LWPINFO is used on the pid to find out what specific event happened (and which LWP it happened for). The other thread will keep its event pending in the kernel, and after PT_CONTINUE it will immediately trigger another stop and an event reported from wait(). >> One specific example of such an event is if a thread is created while >> another thread in the same process hits a breakpoint. If the second >> thread's event is reported first, the target resume method does not >> yet "know" about the new thread and will not suspend it via >> PT_SUSPEND. > > I was surprised to read that the resume method suspends some threads. I > don't think it should, but I do see the current code. The current code > appears to interpret the resume request as: "ensure the state of thread > resumption looks exactly like this", so it suspends threads as needed, > to match the requested resumption state. I don't think it's supposed to > work like that. The resume method should only resume some threads, and > if the core of GDB wants some threads stopped, it will call the stop > method. See above. I can't individually suspend/resume threads. I can only resume processes, but decide which set of threads within a process to keep resumed when resuming the entire process. > Anyhow, just to be sure I understand the problem you describe: when > fbsd-nat reports the breakpoint hit from "another thread", is the > "thread created" event still in the kernel, not consumed by GDB? Yes. What happens is that GDB tries to single-step a thread over a breakpoint, so it does a resume() of a single ptid_t. To implement this, fbsd-nat's resume has to explicitly PT_SUSPEND all the other LWPs in the process before doing a PT_CONTINUE so that only the stepped thread will execute. However, this logic doesn't yet know about the new thread (and it hasn't started executing in userland yet). However, once the new thread starts executing, it immediately reports a ptrace event. This confuses GDB currently becuase that ptrace event can be reported right after the PT_CONTINUE. I think in part this is because unlike on Linux, when thread A creates thread B, thread A doesn't report anything. Instead, only thread B reports an event when it starts. On Linux thread creation is more like fork where thread A reports an event saying it created B. >> When wait is called, it will probably return the event >> from the first thread before the result of the step from second >> thread. This is the case reported in PR 21497. > >> --- >> gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++- >> 1 file changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c >> index ca278b871ef..3f7278c6ea0 100644 >> --- a/gdb/fbsd-nat.c >> +++ b/gdb/fbsd-nat.c >> @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus, >> if (is_async_p ()) >> async_file_flush (); >> >> - wptid = wait_1 (ptid, ourstatus, target_options); >> + while (1) >> + { >> + wptid = wait_1 (ptid, ourstatus, target_options); >> + >> + /* If no event was found, just return. */ >> + if (ourstatus->kind () == TARGET_WAITKIND_IGNORE >> + || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED) >> + break; >> + >> + /* If an event is reported for a thread or process while >> + stepping some other thread, suspend the thread reporting the >> + event and defer the event until it can be reported to the >> + core. */ >> + if (!wptid.matches (resume_ptid)) >> + { >> + add_pending_event (wptid, *ourstatus); >> + fbsd_nat_debug_printf ("deferring event [%s], [%s]", >> + target_pid_to_str (wptid).c_str (), >> + ourstatus->to_string ().c_str ()); >> + if (wptid.pid () == resume_ptid.pid ()) >> + { >> + fbsd_nat_debug_printf ("suspending thread [%s]", >> + target_pid_to_str (wptid).c_str ()); >> + if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1) >> + perror_with_name (("ptrace (PT_SUSPEND)")); > > > This is the part I don't understand. After wait_1 returned an event for > a specific thread, I would expect this thread to be ptrace-stopped. So, > I don't see the need to suspend it. You'd just leave it in the "resumed > from the > until. > >> + if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1) >> + perror_with_name (("ptrace (PT_CONTINUE)")); > > I don't get why you continue `wptid.pid ()`. What assures you that this > particular thread was stopped previously? Perhaps I don't understand > because I assume somethings about pids/lwps and ptrace that are only > true on Linux. In this specific case, here is the sequence of events: - process P1 contains two threads, T1, and T2 - T1 creates a new thread T3, but T3 isn't yet scheduled to execute - T2 hits a breakpoint so the process stops - wait() returns P1 - PT_LWPINFO reports the breakpoint event for T2 - GDB undoes the breakpoint and tries to single-step T2 over the breakpoint - GDB calls resume with step=1 and scope_ptid=ptid(P1, T2) - fbsd_nat::resume iterates over the threads it knows about, but it only knows about T1 and T2 - PT_SUSPEND T1 - PT_RESUME T2 - the P1 process is now resumed via PT_CONTINUE - T3 starts executing and reports its "born" event - wait() returns P1 - PT_LWPINFO reports the thread birth event for T3 Previously at this point fbsd_nat::wait() returned an event for T3 which triggers the assertion failure in the PR. With this patch what happens now is: - fbsd_nat::wait() realizes T3 isn't "resumed" from the core's perspective so explicitly suspends it via PT_SUSPEND - the P1 process is continued via PT_CONTINUE - T2 completes its step and reports the event - wait() returns P1 - PT_LWPINFO reports the step-complete event for P2 - the event for T2 is returned from fbsd_nat::wait() to the core
On 3/27/23 16:49, John Baldwin wrote: > On 3/20/23 12:09 PM, Simon Marchi wrote: >> On 2/28/23 13:18, John Baldwin wrote: >>> If wait_1 finds an event for a thread or process that does not match >>> the set of threads and processes previously resumed, defer the event. >>> If the event is for a specific thread, suspend the thread and continue >>> the associated process before waiting for another event. >> >> I do not understand this, because it doesn't match my mental model of >> how things should work (which is derived from working with linux-nat). >> In that model, a thread with a pending event should never be >> ptrace-resumed. So the last sentence doesn't make sense, it implies >> that a thread has a pending and is ptrace-resumed (since we suspend it). > > The difference here might be that on FreeBSD ptrace can only PT_CONTINUE > and wait() for entire processes. When a thread contains multiple threads, > PT_CONTINUE resumes all threads belonging to that process. Individual > threads can be controlled by using PT_SUSPEND on a specific LWP. Those > LWPs will stay asleep after PT_CONTINUE. If any thread encounters a > stop event (breakpoint, fork event, etc.) the entire process stops (the > kernel signals threads in userland if needed to get them back into the > kernel to stop). Once it is stopped it reports an event via wait(). > > Currently on FreeBSD wait() only reports a single event at once. So > if multiple threads hit a breakpoint, wait() returns for the first one, > PT_LWPINFO is used on the pid to find out what specific event happened > (and which LWP it happened for). The other thread will keep its event > pending in the kernel, and after PT_CONTINUE it will immediately trigger > another stop and an event reported from wait(). Thanks for the explanation. I'll keep that in mind when re-reading (perhaps a v2, if you want to rebase and submit a fresh version). >>> One specific example of such an event is if a thread is created while >>> another thread in the same process hits a breakpoint. If the second >>> thread's event is reported first, the target resume method does not >>> yet "know" about the new thread and will not suspend it via >>> PT_SUSPEND. >> >> I was surprised to read that the resume method suspends some threads. I >> don't think it should, but I do see the current code. The current code >> appears to interpret the resume request as: "ensure the state of thread >> resumption looks exactly like this", so it suspends threads as needed, >> to match the requested resumption state. I don't think it's supposed to >> work like that. The resume method should only resume some threads, and >> if the core of GDB wants some threads stopped, it will call the stop >> method. > > See above. I can't individually suspend/resume threads. I can only > resume processes, but decide which set of threads within a process > to keep resumed when resuming the entire process. Ack. > >> Anyhow, just to be sure I understand the problem you describe: when >> fbsd-nat reports the breakpoint hit from "another thread", is the >> "thread created" event still in the kernel, not consumed by GDB? > > Yes. What happens is that GDB tries to single-step a thread over a > breakpoint, so it does a resume() of a single ptid_t. To implement > this, fbsd-nat's resume has to explicitly PT_SUSPEND all the other > LWPs in the process before doing a PT_CONTINUE so that only the > stepped thread will execute. However, this logic doesn't yet know > about the new thread (and it hasn't started executing in userland > yet). However, once the new thread starts executing, it immediately > reports a ptrace event. This confuses GDB currently becuase that > ptrace event can be reported right after the PT_CONTINUE. I think > in part this is because unlike on Linux, when thread A creates thread > B, thread A doesn't report anything. Instead, only thread B reports > an event when it starts. On Linux thread creation is more like fork > where thread A reports an event saying it created B. Ack. > >>> When wait is called, it will probably return the event >>> from the first thread before the result of the step from second >>> thread. This is the case reported in PR 21497. >> >>> --- >>> gdb/fbsd-nat.c | 34 +++++++++++++++++++++++++++++++++- >>> 1 file changed, 33 insertions(+), 1 deletion(-) >>> >>> diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c >>> index ca278b871ef..3f7278c6ea0 100644 >>> --- a/gdb/fbsd-nat.c >>> +++ b/gdb/fbsd-nat.c >>> @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus, >>> if (is_async_p ()) >>> async_file_flush (); >>> - wptid = wait_1 (ptid, ourstatus, target_options); >>> + while (1) >>> + { >>> + wptid = wait_1 (ptid, ourstatus, target_options); >>> + >>> + /* If no event was found, just return. */ >>> + if (ourstatus->kind () == TARGET_WAITKIND_IGNORE >>> + || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED) >>> + break; >>> + >>> + /* If an event is reported for a thread or process while >>> + stepping some other thread, suspend the thread reporting the >>> + event and defer the event until it can be reported to the >>> + core. */ >>> + if (!wptid.matches (resume_ptid)) >>> + { >>> + add_pending_event (wptid, *ourstatus); >>> + fbsd_nat_debug_printf ("deferring event [%s], [%s]", >>> + target_pid_to_str (wptid).c_str (), >>> + ourstatus->to_string ().c_str ()); >>> + if (wptid.pid () == resume_ptid.pid ()) >>> + { >>> + fbsd_nat_debug_printf ("suspending thread [%s]", >>> + target_pid_to_str (wptid).c_str ()); >>> + if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1) >>> + perror_with_name (("ptrace (PT_SUSPEND)")); >> >> >> This is the part I don't understand. After wait_1 returned an event for >> a specific thread, I would expect this thread to be ptrace-stopped. So, >> I don't see the need to suspend it. You'd just leave it in the "resumed >> from the >> until. >> >>> + if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1) >>> + perror_with_name (("ptrace (PT_CONTINUE)")); >> >> I don't get why you continue `wptid.pid ()`. What assures you that this >> particular thread was stopped previously? Perhaps I don't understand >> because I assume somethings about pids/lwps and ptrace that are only >> true on Linux. > > In this specific case, here is the sequence of events: > > - process P1 contains two threads, T1, and T2 > - T1 creates a new thread T3, but T3 isn't yet scheduled to execute > - T2 hits a breakpoint so the process stops > - wait() returns P1 > - PT_LWPINFO reports the breakpoint event for T2 > - GDB undoes the breakpoint and tries to single-step T2 over the > breakpoint > - GDB calls resume with step=1 and scope_ptid=ptid(P1, T2) > - fbsd_nat::resume iterates over the threads it knows about, but > it only knows about T1 and T2 > - PT_SUSPEND T1 > - PT_RESUME T2 > - the P1 process is now resumed via PT_CONTINUE > - T3 starts executing and reports its "born" event > - wait() returns P1 > - PT_LWPINFO reports the thread birth event for T3 > > Previously at this point fbsd_nat::wait() returned an event for T3 > which triggers the assertion failure in the PR. With this patch > what happens now is: > > - fbsd_nat::wait() realizes T3 isn't "resumed" from the core's > perspective so explicitly suspends it via PT_SUSPEND > - the P1 process is continued via PT_CONTINUE > - T2 completes its step and reports the event > - wait() returns P1 > - PT_LWPINFO reports the step-complete event for P2 > - the event for T2 is returned from fbsd_nat::wait() to the core Ok, thanks, that makes sense. Simon
diff --git a/gdb/fbsd-nat.c b/gdb/fbsd-nat.c index ca278b871ef..3f7278c6ea0 100644 --- a/gdb/fbsd-nat.c +++ b/gdb/fbsd-nat.c @@ -1514,7 +1514,39 @@ fbsd_nat_target::wait (ptid_t ptid, struct target_waitstatus *ourstatus, if (is_async_p ()) async_file_flush (); - wptid = wait_1 (ptid, ourstatus, target_options); + while (1) + { + wptid = wait_1 (ptid, ourstatus, target_options); + + /* If no event was found, just return. */ + if (ourstatus->kind () == TARGET_WAITKIND_IGNORE + || ourstatus->kind () == TARGET_WAITKIND_NO_RESUMED) + break; + + /* If an event is reported for a thread or process while + stepping some other thread, suspend the thread reporting the + event and defer the event until it can be reported to the + core. */ + if (!wptid.matches (resume_ptid)) + { + add_pending_event (wptid, *ourstatus); + fbsd_nat_debug_printf ("deferring event [%s], [%s]", + target_pid_to_str (wptid).c_str (), + ourstatus->to_string ().c_str ()); + if (wptid.pid () == resume_ptid.pid ()) + { + fbsd_nat_debug_printf ("suspending thread [%s]", + target_pid_to_str (wptid).c_str ()); + if (ptrace (PT_SUSPEND, wptid.lwp (), NULL, 0) == -1) + perror_with_name (("ptrace (PT_SUSPEND)")); + if (ptrace (PT_CONTINUE, wptid.pid (), (caddr_t) 1, 0) == -1) + perror_with_name (("ptrace (PT_CONTINUE)")); + } + continue; + } + + break; + } /* If we are in async mode and found an event, there may still be another event pending. Trigger the event pipe so that that the