[2/6] Introduce throw_ptrace_error

From: Pedro Alves <palves@redhat.com>

  On 03/10/2015 02:53 PM, Mark Kettenis wrote:
>> Date: Sun, 08 Mar 2015 21:48:12 +0000
>> From: Pedro Alves <palves@redhat.com>
>>
>> On 03/08/2015 08:29 PM, Mark Kettenis wrote:
>>
>> The debugger is notified.  It's just a fact that a process can
>> die (and become zombie) even while it was _stopped_ under
>> ptrace control.  That's a race you can't prevent, only cope with.
> 
> A process yes, but a thread no.

Threads die with the process, so...  And we can probably
observe stopped threads dying without the "process" dying
with programs that use raw clone instead of pthreads.

> 
>> I found NetBSD 5.1 in the GCC compile farm, and I see ESRCH
>> there too:
>>
>> -bash-4.2$ uname -a
>> NetBSD gcc70.fsffrance.org 5.1 NetBSD 5.1 (GENERIC) #0: Sat Nov  6 13:19:33 UTC 2010  builds@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/amd64/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/amd64/compile/GENERIC amd64
>>
>> -bash-4.2$ gdb ./foo
>> GNU gdb 6.5
>> ...
>> (gdb) start
>> Breakpoint 1 at 0x400894: file foo.c, line 5.
>> Starting program: /home/palves/foo
>> main () at foo.c:5
>> 5         return 0;
>> (gdb) p getpid ()
>> $1 = 24557
>> (gdb) shell kill -9 24557
>> (gdb) c
>> Continuing.
>> ptrace: No such process.
>> (gdb)
> 
> That's an ancient GDB though.
> 
>> But even if some ptrace-based OS uses a different errno
>> for that (which I doubt), we can just tweak throw_ptrace_error
>> (a centralized place, yay!) to look for a different
>> errno value.  So what does OpenBSD's ptrace return
>> in the test above?
> 
> (gdb) p getpid()
> $1 = 24737
> (gdb) shell kill -9 24747
> ksh: kill: 24747: No such process
> (gdb) shell kill -9 24737
> (gdb) c
> Continuing.
> 
> Program received signal SIGKILL, Killed.
> main () at ../../../../src/binutils-gdb/gdb/testsuite/gdb.base/wchar.c:29
> 29        wchar_t narrow = 97;
> (gdb) 

That "Program received" instead of "Program exited with ..." just
shows that a ptracer can intercept SIGKILL on OpenBSD...  Does that
mean that a ptracer can suppress SIGKILL then?

At least on Linux and (AFAICS) NetBSD, SIGKILL cannot be caught.

I'm sure there must be situations where the kernel or wheel/root can
make a user process disappear under the user's debugger's feet
in OpenBSD as well.

Here's what I see on Linux, after the fix:

(gdb) shell kill -9 492
(gdb) c
Continuing.

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb)

Though note that more fixing is necessary through out to
fix all this better.  E.g., that "continue" works because
GDB didn't try to insert any breakpoint in the main program.
If it does, we get:

(gdb) c
Continuing.
Warning:
Cannot insert breakpoint 2.
Cannot access memory at address 0x4004c8

and again:

(gdb) c
Continuing.
Warning:
Cannot insert breakpoint 2.
Cannot access memory at address 0x4004c8

forever and ever.  IOW, it's not possible to continue
to collect the wait status.

Or we get seemingly strange (for a user) things like:

(gdb) info threads
  Id   Target Id         Frame
* 1    process 615 "main" main (argc=<error reading variable: Cannot access memory at address 0x7fffffffd83c>,
    argv=<error reading variable: Cannot access memory at address 0x7fffffffd830>) at main.c:5
(gdb)

why am I getting these errors, argh!"

Yeah, GDB could have told the user why, but it didn't.

I think that to fix this we'll need several TRY/CATCH around some
strategic places in the core.  I suspect we'll also need to
model a "zombie" thread state (different from "exited" in that
there's a wait status to collect).

> Which strikes me as the proper behaviour in this case.  Not sure if
> the NetBSD behaviour you're seeing is the result of different GDB
> code, or really different kernel behaviour.

I'm sure it must be different kernel behavior.  I saw the same with
mainline gdb when I tried it a couple days ago.  The machine seems to
be down now, though... (It's gcc70.fsffrance.org.  I filed a support
request with the gcc compiler farm folks).

> 
> OpenBSD's ptrace(2) will set errno to ESRCH if you pass it a
> non-existant process ID or thread ID.  As can be seen when using the
> "attach" command:
> 
> (gdb) attach 666
> Attaching to program: /home/kettenis/obj/binutils-gdb/gdb/testsuite/gdb.base/wchar, process 666
> ptrace: No such process
> 
> Perhaps the error message here could be improved, but that doesn't
> require us to add a throw with more details here.

Sure, same thing with Linux's ptrace.

But, that's clearly a case where you _want_ the error to
end up at the command level.

> In the end the meaning of ESRCH is dependent on the context.  You
> can't just interpret it as a missing thread.

The error is THREAD_NOT_FOUND_ERROR.  Saying "search for process
failed" (ESRCH) or "process not found" sounds like exactly the same
to my ears.  There's no interpretation intended right there.  Sure,
OK, I could have called it PROCESS_NOT_FOUND_ERROR.  I called it
THREAD because it looks to me that if the process is not found, so
can't its threads, so saying the (kernel) thread/task that was passed
to ptrace wasn't found is correct too, while the opposite is not true: if
a thread is not found, but it's containing process (thread group
for Linux) exists alive, then PROCESS_NOT_FOUND_ERROR wouldn't be
correct.

The "thread is gone" part was indeed interpretation, but that was
in the code that _catches_ the exception.  In the case in question, that's
exactly the right context -- we know we were attached to the process,
otherwise we wouldn't be trying to resume it!  So if ptrace can't find
it, "thread/process not found" means "thread/process is gone".

>>> I think at this point the right approach is to make
>>> linux_resume_one_lwp() call ptrace() directly instead of calling down
>>> into the inf_ptrace_resume().  That way you can simply check errno in
>>> the place where it matters.
>>
>> No, your "simply" is not simple as you imply.  There can be any number
>> of ptrace calls that fail before the PT_CONTINUE in inf_ptrace_resume
>> is reached.  And whether to ignore the error should be left to some
>> caller higher up on the call chain.  That was the _whole point_ of this
>> fuller fix, as I explained throughout the series.
>> E.g., the ptrace call that fails can be the one that tries to write
>> debug registers to the inferior, normal registers, reading the auxv,
>> any memory read/write, whatever.  Any ptrace error that throws ends up
>> in the generic perror_with_name today, after the series, they'll
>> end up in throw_ptrace_error instead, a single place we can add
>> more context info to the error thrown.  How is that a bad thing?
> 
> The problem is that you'll always end up losing some context by using
> a throw/catch model.  I think that should be avoided whenever
> possible, and I got the impression that in the specific race you were
> trying to fix here it could be avoided.

If from local context at the ptrace call site the error can be handled
better, sure, but then we wouldn't be calling perror_with_name.  But
the situation is the reverse here.  The context is higher up on the
call chain, not around the ptrace call site.  It's because errors can't
be handled locally that we're _already_ throwing today.  The fix was for all
those ptrace calls deep down in the call chain that can't tell why they're
being called, and so don't know whether they should ignore the error
or not.  For example, say you're poking a register manually in the command
line and that errors out with ESRCH.  Should we ignore that error?  Probably not.
It should probably propagate to the command line.  OTOH, if GDB decides to
flush registers just while it is resuming a thread, and the thread/process
becomes zombie.  We should probably ignore the ESRCH, cancel whatever other
setup we were doing for the resume / skip the remainder of the ptrace calls,
and go straight to collecting the wait status.  In both cases, the registers
write ptrace call is the same.  So now what?

> It also strikes me as undesitable having to add a new an dfairly
> generic file to the list in config/*.mh.

Let me say that objecting on grounds of "new" and "generic" seems
pretty groundless.  ;-)  The file was added to all targets that
already list inf-ptrace.o as first approximation, well because
they'd all need it.  It's a separate file because it's needed
by gdbserver as well.  Adding it to gdbserver was much
simpler given gdbserver/configure.srv though.  What would you
suggest?

Anyway, I've got a better objection to my own patch! :-)

Not against using try/catch, but against a new error code.

I poked a little at making the core behave better in this
whole disappearing process/thread scenario.  A new error code
doesn't really work nicely with remote targets given that
RSP error codes aren't well defined.

So instead, I switched to catching any error and then checking
whether the LWP is now "dead/zombie" by reading its status
in /proc/PID/status.  If so, ignore the error, otherwise, rethrow.
With a scheme like this, there's a tiny race window where we could
have gotten a different error and then the process died, but I
don't think that matters.

If we carry on fixing core GDB to cope better with a disappearing
process, then a scheme like this should work in the core side
too.  We'd use target_thread_alive and/or add a new
target_thread_is_zombie or some such.

So here's my current patch.  It's entirely Linux-specific now.

Let me know what you think.

----
From 2042d98af96655f04d69a0bbc78f0b0a4e203ae1 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Wed, 11 Mar 2015 15:20:31 +0000
Subject: [PATCH] Fix race exposed by gdb.threads/killed.exp

On GNU/Linux, this test sometimes FAILs like this:

 (gdb) run
 Starting program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/killed
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib64/libthread_db.so.1".
 ptrace: No such process.
 (gdb)
 Program terminated with signal SIGKILL, Killed.
 The program no longer exists.
 FAIL: gdb.threads/killed.exp: run program to completion (timeout)

Note the suspicious "No such process" line (that's errno==ESRCH).
Adding debug output we see:

  linux_nat_wait: [process -1], [TARGET_WNOHANG]
  LLW: enter
  LNW: waitpid(-1, ...) returned 18465, ERRNO-OK
  LLW: waitpid 18465 received Stopped (signal) (stopped)
  LNW: waitpid(-1, ...) returned 18461, ERRNO-OK
  LLW: waitpid 18461 received Trace/breakpoint trap (stopped)
  LLW: Handling extended status 0x03057f
  LHEW: Got clone event from LWP 18461, new child is LWP 18465
  LNW: waitpid(-1, ...) returned 0, ERRNO-OK
  RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0
  RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0
  sigchld
  ptrace: No such process.
  (gdb) linux_nat_wait: [process -1], [TARGET_WNOHANG]
  LLW: enter
  LNW: waitpid(-1, ...) returned 18465, ERRNO-OK
  LLW: waitpid 18465 received Killed (terminated)
  LLW: LWP 18465 exited.
  LNW: waitpid(-1, ...) returned 18461, No child processes
  LLW: waitpid 18461 received Killed (terminated)
  Process 18461 exited
  LNW: waitpid(-1, ...) returned -1, No child processes
  LLW: exit
  sigchld
  infrun: target_wait (-1, status) =
  infrun:   18461 [process 18461],
  infrun:   status->kind = signalled, signal = GDB_SIGNAL_KILL
  infrun: TARGET_WAITKIND_SIGNALLED

  Program terminated with signal SIGKILL, Killed.
  The program no longer exists.
  infrun: stop_waiting
  FAIL: gdb.threads/killed.exp: run program to completion (timeout)

The ptrace call that fails was a PTRACE_CONTINUE.

The issue is that here:

  RSRL: resuming stopped-resumed LWP LWP 18465 at 0x3b36af4b51: step=0
  RSRL: resuming stopped-resumed LWP LWP 18461 at 0x3b36af4b51: step=0

The first line shows we had just resumed LWP 18465, which does:

 void *
 child_func (void *dummy)
 {
   kill (pid, SIGKILL);
   exit (1);
 }

So if the kernel manages to schedule that thread fast enough, the
process may be killed before GDB has a chance to resume LWP 18461.

GDBserver has code at the tail end of linux_resume_one_lwp to cope
with this:

~~~
    ptrace (step ? PTRACE_SINGLESTEP : PTRACE_CONT, lwpid_of (thread),
	    (PTRACE_TYPE_ARG3) 0,
	    /* Coerce to a uintptr_t first to avoid potential gcc warning
	       of coercing an 8 byte integer to a 4 byte pointer.  */
	    (PTRACE_TYPE_ARG4) (uintptr_t) signal);

    current_thread = saved_thread;
    if (errno)
      {
	/* ESRCH from ptrace either means that the thread was already
	   running (an error) or that it is gone (a race condition).  If
	   it's gone, we will get a notification the next time we wait,
	   so we can ignore the error.  We could differentiate these
	   two, but it's tricky without waiting; the thread still exists
	   as a zombie, so sending it signal 0 would succeed.  So just
	   ignore ESRCH.  */
	if (errno == ESRCH)
	  return;

	perror_with_name ("ptrace");
      }
~~~

However, that's not a complete fix, because between starting to handle
the resume request and getting that PTRACE_CONTINUE, we run other
ptrace calls that can also fail with ESRCH, and that end up throwing
an error (with perror_with_name).

Whether to ignore ptrace errors or not depends on context that is only
available somewhere up the call chain.  So the fix is to let ptrace
errors throw as they do today, and wrap the resume request in a
TRY/CATCH that swallows it iff the lwp that we were trying to resume
is now zombie.

gdb/gdbserver/ChangeLog:
2015-03-11  Pedro Alves  <palves@redhat.com>

	* linux-low.c (linux_resume_one_lwp): Rename to ...
	(linux_resume_one_lwp_throw): ... this.  Don't handle ESRCH here,
	instead call perror_with_name.
	(handle_zombie_lwp): New function.
	(linux_resume_one_lwp): Reimplement as wrapper around
	linux_resume_one_lwp_throw that swallows errors if the LWP is
	zombie.

gdb/ChangeLog:
2015-03-11  Pedro Alves  <palves@redhat.com>

	* linux-nat.c (linux_resume_one_lwp): Rename to ...
	(linux_resume_one_lwp_throw): ... this.  Don't handle ESRCH here,
	instead call perror_with_name.
	(handle_zombie_lwp): New function.
	(linux_resume_one_lwp): Reimplement as wrapper around
	linux_resume_one_lwp_throw that swallows errors if the LWP is
	zombie.
---
 gdb/gdbserver/linux-low.c | 59 +++++++++++++++++++++++++++++++++++------------
 gdb/linux-nat.c           | 43 +++++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 16 deletions(-)

[2/6] Introduce throw_ptrace_error

Commit Message

Patch