From patchwork Wed Mar 11 15:44:27 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pedro Alves X-Patchwork-Id: 5581 Received: (qmail 114284 invoked by alias); 11 Mar 2015 15:44:36 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Delivered-To: mailing list gdb-patches@sourceware.org Received: (qmail 114266 invoked by uid 89); 11 Mar 2015 15:44:35 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL, BAYES_00, SPF_HELO_PASS, SPF_PASS, T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Wed, 11 Mar 2015 15:44:33 +0000 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t2BFiUoG001523 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 11 Mar 2015 11:44:31 -0400 Received: from [127.0.0.1] (ovpn01.gateway.prod.ext.ams2.redhat.com [10.39.146.11]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t2BFiSHt029656; Wed, 11 Mar 2015 11:44:29 -0400 Message-ID: <550062DB.9040009@redhat.com> Date: Wed, 11 Mar 2015 15:44:27 +0000 From: Pedro Alves User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Mark Kettenis CC: gdb-patches@sourceware.org Subject: Re: [PATCH 2/6] Introduce throw_ptrace_error References: <1425671886-7798-1-git-send-email-palves@redhat.com> <1425671886-7798-3-git-send-email-palves@redhat.com> <201503062103.t26L3tef004332@glazunov.sibelius.xs4all.nl> <54FA1EB3.2050706@redhat.com> <201503082029.t28KToYr022852@glazunov.sibelius.xs4all.nl> <54FCC39C.6090302@redhat.com> <201503101453.t2AEr4Wh011546@glazunov.sibelius.xs4all.nl> In-Reply-To: <201503101453.t2AEr4Wh011546@glazunov.sibelius.xs4all.nl> On 03/10/2015 02:53 PM, Mark Kettenis wrote: >> Date: Sun, 08 Mar 2015 21:48:12 +0000 >> From: Pedro Alves >> >> On 03/08/2015 08:29 PM, Mark Kettenis wrote: >> >> The debugger is notified. It's just a fact that a process can >> die (and become zombie) even while it was _stopped_ under >> ptrace control. That's a race you can't prevent, only cope with. > > A process yes, but a thread no. Threads die with the process, so... And we can probably observe stopped threads dying without the "process" dying with programs that use raw clone instead of pthreads. > >> I found NetBSD 5.1 in the GCC compile farm, and I see ESRCH >> there too: >> >> -bash-4.2$ uname -a >> NetBSD gcc70.fsffrance.org 5.1 NetBSD 5.1 (GENERIC) #0: Sat Nov 6 13:19:33 UTC 2010 builds@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/amd64/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/amd64/compile/GENERIC amd64 >> >> -bash-4.2$ gdb ./foo >> GNU gdb 6.5 >> ... >> (gdb) start >> Breakpoint 1 at 0x400894: file foo.c, line 5. >> Starting program: /home/palves/foo >> main () at foo.c:5 >> 5 return 0; >> (gdb) p getpid () >> $1 = 24557 >> (gdb) shell kill -9 24557 >> (gdb) c >> Continuing. >> ptrace: No such process. >> (gdb) > > That's an ancient GDB though. > >> But even if some ptrace-based OS uses a different errno >> for that (which I doubt), we can just tweak throw_ptrace_error >> (a centralized place, yay!) to look for a different >> errno value. So what does OpenBSD's ptrace return >> in the test above? > > (gdb) p getpid() > $1 = 24737 > (gdb) shell kill -9 24747 > ksh: kill: 24747: No such process > (gdb) shell kill -9 24737 > (gdb) c > Continuing. > > Program received signal SIGKILL, Killed. > main () at ../../../../src/binutils-gdb/gdb/testsuite/gdb.base/wchar.c:29 > 29 wchar_t narrow = 97; > (gdb) That "Program received" instead of "Program exited with ..." just shows that a ptracer can intercept SIGKILL on OpenBSD... Does that mean that a ptracer can suppress SIGKILL then? At least on Linux and (AFAICS) NetBSD, SIGKILL cannot be caught. I'm sure there must be situations where the kernel or wheel/root can make a user process disappear under the user's debugger's feet in OpenBSD as well. Here's what I see on Linux, after the fix: (gdb) shell kill -9 492 (gdb) c Continuing. Program terminated with signal SIGKILL, Killed. The program no longer exists. (gdb) Though note that more fixing is necessary through out to fix all this better. E.g., that "continue" works because GDB didn't try to insert any breakpoint in the main program. If it does, we get: (gdb) c Continuing. Warning: Cannot insert breakpoint 2. Cannot access memory at address 0x4004c8 and again: (gdb) c Continuing. Warning: Cannot insert breakpoint 2. Cannot access memory at address 0x4004c8 forever and ever. IOW, it's not possible to continue to collect the wait status. Or we get seemingly strange (for a user) things like: (gdb) info threads Id Target Id Frame * 1 process 615 "main" main (argc=, argv=) at main.c:5 (gdb) why am I getting these errors, argh!" Yeah, GDB could have told the user why, but it didn't. I think that to fix this we'll need several TRY/CATCH around some strategic places in the core. I suspect we'll also need to model a "zombie" thread state (different from "exited" in that there's a wait status to collect). > Which strikes me as the proper behaviour in this case. Not sure if > the NetBSD behaviour you're seeing is the result of different GDB > code, or really different kernel behaviour. I'm sure it must be different kernel behavior. I saw the same with mainline gdb when I tried it a couple days ago. The machine seems to be down now, though... (It's gcc70.fsffrance.org. I filed a support request with the gcc compiler farm folks). > > OpenBSD's ptrace(2) will set errno to ESRCH if you pass it a > non-existant process ID or thread ID. As can be seen when using the > "attach" command: > > (gdb) attach 666 > Attaching to program: /home/kettenis/obj/binutils-gdb/gdb/testsuite/gdb.base/wchar, process 666 > ptrace: No such process > > Perhaps the error message here could be improved, but that doesn't > require us to add a throw with more details here. Sure, same thing with Linux's ptrace. But, that's clearly a case where you _want_ the error to end up at the command level. > In the end the meaning of ESRCH is dependent on the context. You > can't just interpret it as a missing thread. The error is THREAD_NOT_FOUND_ERROR. Saying "search for process failed" (ESRCH) or "process not found" sounds like exactly the same to my ears. There's no interpretation intended right there. Sure, OK, I could have called it PROCESS_NOT_FOUND_ERROR. I called it THREAD because it looks to me that if the process is not found, so can't its threads, so saying the (kernel) thread/task that was passed to ptrace wasn't found is correct too, while the opposite is not true: if a thread is not found, but it's containing process (thread group for Linux) exists alive, then PROCESS_NOT_FOUND_ERROR wouldn't be correct. The "thread is gone" part was indeed interpretation, but that was in the code that _catches_ the exception. In the case in question, that's exactly the right context -- we know we were attached to the process, otherwise we wouldn't be trying to resume it! So if ptrace can't find it, "thread/process not found" means "thread/process is gone". >>> I think at this point the right approach is to make >>> linux_resume_one_lwp() call ptrace() directly instead of calling down >>> into the inf_ptrace_resume(). That way you can simply check errno in >>> the place where it matters. >> >> No, your "simply" is not simple as you imply. There can be any number >> of ptrace calls that fail before the PT_CONTINUE in inf_ptrace_resume >> is reached. And whether to ignore the error should be left to some >> caller higher up on the call chain. That was the _whole point_ of this >> fuller fix, as I explained throughout the series. >> E.g., the ptrace call that fails can be the one that tries to write >> debug registers to the inferior, normal registers, reading the auxv, >> any memory read/write, whatever. Any ptrace error that throws ends up >> in the generic perror_with_name today, after the series, they'll >> end up in throw_ptrace_error instead, a single place we can add >> more context info to the error thrown. How is that a bad thing? > > The problem is that you'll always end up losing some context by using > a throw/catch model. I think that should be avoided whenever > possible, and I got the impression that in the specific race you were > trying to fix here it could be avoided. If from local context at the ptrace call site the error can be handled better, sure, but then we wouldn't be calling perror_with_name. But the situation is the reverse here. The context is higher up on the call chain, not around the ptrace call site. It's because errors can't be handled locally that we're _already_ throwing today. The fix was for all those ptrace calls deep down in the call chain that can't tell why they're being called, and so don't know whether they should ignore the error or not. For example, say you're poking a register manually in the command line and that errors out with ESRCH. Should we ignore that error? Probably not. It should probably propagate to the command line. OTOH, if GDB decides to flush registers just while it is resuming a thread, and the thread/process becomes zombie. We should probably ignore the ESRCH, cancel whatever other setup we were doing for the resume / skip the remainder of the ptrace calls, and go straight to collecting the wait status. In both cases, the registers write ptrace call is the same. So now what? > It also strikes me as undesitable having to add a new an dfairly > generic file to the list in config/*.mh. Let me say that objecting on grounds of "new" and "generic" seems pretty groundless. ;-) The file was added to all targets that already list inf-ptrace.o as first approximation, well because they'd all need it. It's a separate file because it's needed by gdbserver as well. Adding it to gdbserver was much simpler given gdbserver/configure.srv though. What would you suggest? Anyway, I've got a better objection to my own patch! :-) Not against using try/catch, but against a new error code. I poked a little at making the core behave better in this whole disappearing process/thread scenario. A new error code doesn't really work nicely with remote targets given that RSP error codes aren't well defined. So instead, I switched to catching any error and then checking whether the LWP is now "dead/zombie" by reading its status in /proc/PID/status. If so, ignore the error, otherwise, rethrow. With a scheme like this, there's a tiny race window where we could have gotten a different error and then the process died, but I don't think that matters. If we carry on fixing core GDB to cope better with a disappearing process, then a scheme like this should work in the core side too. We'd use target_thread_alive and/or add a new target_thread_is_zombie or some such. So here's my current patch. It's entirely Linux-specific now. Let me know what you think. ---- From 2042d98af96655f04d69a0bbc78f0b0a4e203ae1 Mon Sep 17 00:00:00 2001 From: Pedro Alves Date: Wed, 11 Mar 2015 15:20:31 +0000 Subject: [PATCH] Fix race exposed by gdb.threads/killed.exp On GNU/Linux, this test sometimes FAILs like this: (gdb) run Starting program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/killed [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". ptrace: No such process. (gdb) Program terminated with signal SIGKILL, Killed. The program no longer exists. FAIL: gdb.threads/killed.exp: run program to completion (timeout) Note the suspicious "No such process" line (that's errno==ESRCH). Adding debug output we see: linux_nat_wait: [process -1], [TARGET_WNOHANG] LLW: enter LNW: waitpid(-1, ...) returned 18465, ERRNO-OK LLW: waitpid 18465 received Stopped (signal) (stopped) LNW: waitpid(-1, ...) returned 18461, ERRNO-OK LLW: waitpid 18461 received Trace/breakpoint trap (stopped) LLW: Handling extended status 0x03057f LHEW: Got clone event from LWP 18461, new child is LWP 18465 LNW: waitpid(-1, ...) returned 0, ERRNO-OK RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0 RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0 sigchld ptrace: No such process. (gdb) linux_nat_wait: [process -1], [TARGET_WNOHANG] LLW: enter LNW: waitpid(-1, ...) returned 18465, ERRNO-OK LLW: waitpid 18465 received Killed (terminated) LLW: LWP 18465 exited. LNW: waitpid(-1, ...) returned 18461, No child processes LLW: waitpid 18461 received Killed (terminated) Process 18461 exited LNW: waitpid(-1, ...) returned -1, No child processes LLW: exit sigchld infrun: target_wait (-1, status) = infrun: 18461 [process 18461], infrun: status->kind = signalled, signal = GDB_SIGNAL_KILL infrun: TARGET_WAITKIND_SIGNALLED Program terminated with signal SIGKILL, Killed. The program no longer exists. infrun: stop_waiting FAIL: gdb.threads/killed.exp: run program to completion (timeout) The ptrace call that fails was a PTRACE_CONTINUE. The issue is that here: RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0 RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0 The first line shows we had just resumed LWP 18465, which does: void * child_func (void *dummy) { kill (pid, SIGKILL); exit (1); } So if the kernel manages to schedule that thread fast enough, the process may be killed before GDB has a chance to resume LWP 18461. GDBserver has code at the tail end of linux_resume_one_lwp to cope with this: ~~~ ptrace (step ? PTRACE_SINGLESTEP : PTRACE_CONT, lwpid_of (thread), (PTRACE_TYPE_ARG3) 0, /* Coerce to a uintptr_t first to avoid potential gcc warning of coercing an 8 byte integer to a 4 byte pointer. */ (PTRACE_TYPE_ARG4) (uintptr_t) signal); current_thread = saved_thread; if (errno) { /* ESRCH from ptrace either means that the thread was already running (an error) or that it is gone (a race condition). If it's gone, we will get a notification the next time we wait, so we can ignore the error. We could differentiate these two, but it's tricky without waiting; the thread still exists as a zombie, so sending it signal 0 would succeed. So just ignore ESRCH. */ if (errno == ESRCH) return; perror_with_name ("ptrace"); } ~~~ However, that's not a complete fix, because between starting to handle the resume request and getting that PTRACE_CONTINUE, we run other ptrace calls that can also fail with ESRCH, and that end up throwing an error (with perror_with_name). Whether to ignore ptrace errors or not depends on context that is only available somewhere up the call chain. So the fix is to let ptrace errors throw as they do today, and wrap the resume request in a TRY/CATCH that swallows it iff the lwp that we were trying to resume is now zombie. gdb/gdbserver/ChangeLog: 2015-03-11 Pedro Alves * linux-low.c (linux_resume_one_lwp): Rename to ... (linux_resume_one_lwp_throw): ... this. Don't handle ESRCH here, instead call perror_with_name. (handle_zombie_lwp): New function. (linux_resume_one_lwp): Reimplement as wrapper around linux_resume_one_lwp_throw that swallows errors if the LWP is zombie. gdb/ChangeLog: 2015-03-11 Pedro Alves * linux-nat.c (linux_resume_one_lwp): Rename to ... (linux_resume_one_lwp_throw): ... this. Don't handle ESRCH here, instead call perror_with_name. (handle_zombie_lwp): New function. (linux_resume_one_lwp): Reimplement as wrapper around linux_resume_one_lwp_throw that swallows errors if the LWP is zombie. --- gdb/gdbserver/linux-low.c | 59 +++++++++++++++++++++++++++++++++++------------ gdb/linux-nat.c | 43 +++++++++++++++++++++++++++++++++- 2 files changed, 86 insertions(+), 16 deletions(-) diff --git a/gdb/gdbserver/linux-low.c b/gdb/gdbserver/linux-low.c index 48d905b..e8a66a3 100644 --- a/gdb/gdbserver/linux-low.c +++ b/gdb/gdbserver/linux-low.c @@ -3378,13 +3378,12 @@ stop_all_lwps (int suspend, struct lwp_info *except) } } -/* Resume execution of the inferior process. - If STEP is nonzero, single-step it. - If SIGNAL is nonzero, give it that signal. */ +/* Resume execution of LWP. If STEP is nonzero, single-step it. If + SIGNAL is nonzero, give it that signal. */ static void -linux_resume_one_lwp (struct lwp_info *lwp, - int step, int signal, siginfo_t *info) +linux_resume_one_lwp_throw (struct lwp_info *lwp, + int step, int signal, siginfo_t *info) { struct thread_info *thread = get_lwp_thread (lwp); struct thread_info *saved_thread; @@ -3576,19 +3575,49 @@ linux_resume_one_lwp (struct lwp_info *lwp, current_thread = saved_thread; if (errno) + perror_with_name ("resuming thread"); +} + +/* Called when we try to resume a stopped LWP and that errors out. If + the LWP is zombie, discard the error, clear any pending status the + LWP may have, and return true (we'll collect the exit status soon + enough). Otherwise, return false. */ + +static int +handle_zombie_lwp (struct lwp_info *lp) +{ + struct thread_info *thread = get_lwp_thread (lp); + + if (linux_proc_pid_is_zombie (lwpid_of (thread))) { - /* ESRCH from ptrace either means that the thread was already - running (an error) or that it is gone (a race condition). If - it's gone, we will get a notification the next time we wait, - so we can ignore the error. We could differentiate these - two, but it's tricky without waiting; the thread still exists - as a zombie, so sending it signal 0 would succeed. So just - ignore ESRCH. */ - if (errno == ESRCH) - return; + lp->stop_reason = TARGET_STOPPED_BY_NO_REASON; + lp->status_pending_p = 0; + lp->stopped = 0; + return 1; + } + return 0; +} + +/* Like linux_resume_one_lwp_throw, but no error is thrown if the LWP + disappears while we try to resume it. */ - perror_with_name ("ptrace"); +static void +linux_resume_one_lwp (struct lwp_info *lwp, + int step, int signal, siginfo_t *info) +{ + TRY + { + linux_resume_one_lwp_throw (lwp, step, signal, info); + } + CATCH (ex, RETURN_MASK_ERROR) + { + /* If we got an error because the thread/process is gone, ignore + it; we will collect the exit status the next time we wait, + shortly. */ + if (!handle_zombie_lwp (lwp)) + throw_exception (ex); } + END_CATCH } struct thread_resume_array diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c index 22ce842..1291ab3 100644 --- a/gdb/linux-nat.c +++ b/gdb/linux-nat.c @@ -1503,7 +1503,8 @@ linux_nat_detach (struct target_ops *ops, const char *args, int from_tty) single-step it. If SIGNAL is nonzero, give it that signal. */ static void -linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) +linux_resume_one_lwp_throw (struct lwp_info *lp, int step, + enum gdb_signal signo) { lp->step = step; @@ -1527,6 +1528,46 @@ linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) registers_changed_ptid (lp->ptid); } +/* Called when we try to resume a stopped LWP and that errors out. If + the LWP is zombie, discard the error, clear any pending status the + LWP may have, and return true (we'll collect the exit status soon + enough). Otherwise, return false. */ + +static int +handle_zombie_lwp (struct lwp_info *lp) +{ + if (linux_proc_pid_is_zombie (ptid_get_lwp (lp->ptid))) + { + lp->stop_reason = TARGET_STOPPED_BY_NO_REASON; + lp->status = 0; + lp->waitstatus.kind = TARGET_WAITKIND_IGNORE; + lp->stopped = 0; + return 1; + } + return 0; +} + +/* Like linux_resume_one_lwp_throw, but no error is thrown if the LWP + disappears while we try to resume it. */ + +static void +linux_resume_one_lwp (struct lwp_info *lp, int step, enum gdb_signal signo) +{ + TRY + { + linux_resume_one_lwp_throw (lp, step, signo); + } + CATCH (ex, RETURN_MASK_ERROR) + { + /* If we got an error because the thread/process is gone, ignore + it; we will collect the exit status the next time we wait, + shortly. */ + if (!handle_zombie_lwp (lp)) + throw_exception (ex); + } + END_CATCH +} + /* Resume LP. */ static void