[PATCHv2] gdb: Don't drop SIGSTOP during stop_all_threads

  This is an update based on Pedro's feedback:

All of the spelling/grammar issues should have been fixed.

The actual code changes in GDB are identical.

The biggest change is in the test which I've updated to work with:
  --target_board=native-extended-gdbserver

Initially once I changed from using gdb_spawn_with_cmdline_opts to
gdb_start as suggested the test hung with gdbserver, and I though that
I'd found a bug in gdbserver.

Not so, turns out I was just hitting the same issue which caused the
preload library to break GDB with v1 of my patch.

So, in this version the preload library is a little more complex.  I
now try to emulate the kernels sending of SIGCHLD after a short period
of time in order to make it appear like the child thread status really
has just arrived from the kernel later than it really did.  This works
better than I had hoped, with this change in place not only does the
test now pass on gdbserver, but with the preload library I can (in a
very simple test) use GDB as normal.

---

This patch fixes an issue where GDB would sometimes hang when
attaching to a multi-threaded process.  This issue was especially
likely to trigger if the machine (running the inferior) was under
load.

In summary, the problem is an imbalance between two functions in
linux-nat.c, stop_callback and stop_wait_callback.  In stop_callback
we send SIGSTOP to a thread, but _only_ if the thread is not already
stopped, and if it is not signalled, which means it should stop soon.
In stop_wait_callback we wait for the SIGSTOP to arrive, however, we
are aware that the thread might have been signalled for some other
reason, and so if a signal other than SIGSTOP causes the thread to
stop then we stash that signal away so it can be reported back later.
If we get a SIGSTOP then this is discarded, after all, this signal was
sent from stop_callback.  Except that this might not be the case, it
could be that SIGSTOP was sent to a thread from elsewhere in GDB, in
which case we would not have sent another SIGSTOP from stop_callback
and the SIGSTOP received in stop_wait_callback should not be ignored.

Below I've laid out the exact sequence of events that I saw that lead
me to track down the above diagnosis.

After attaching to the inferior GDB sends a SIGSTOP to all of the
threads and then returns to the event loop waiting for interesting
things to happen.

Eventually the first target event is detected (this will be the first
SIGSTOP arriving) and GDB calls inferior_event_handler which calls
fetch_inferior_event.  Inside fetch_inferior_event GDB calls
do_target_wait which calls target_wait to find a thread with an event.

The target_wait call ends up in linux_nat_wait_1, which first checks
to see if any threads already have stashed stop events to report, and
if there are none then we enter a loop fetching as many events as
possible out of the kernel.  This event fetching is non-blocking, and
we give up once the kernel has no more events ready to give us.

All of the events from the kernel are passed through
linux_nat_filter_event which stashes the wait status for all of the
threads that reported a SIGSTOP, these will be returned by future
calls to linux_nat_wait_1.

Lets assume for a moment that we've attached to a multi-threaded
inferior, and that all but one thread has reported its stop during the
initial wait call in linux_nat_wait_1.  The other thread will be
reporting a SIGSTOP, but the kernel has not yet managed to deliver
that signal to GDB before GDB gave up waiting and continued handling
the events it already had.  GDB selects one of the threads that has
reported a SIGSTOP and passes this thread ID back to
fetch_inferior_event.

To handle the thread's SIGSTOP, GDB calls handle_signal_stop, which
calls stop_all_threads, this calls wait_one, which in turn calls
target_wait.

The first call to target_wait at this point will result in a stashed
wait status being returned, at which point we call setup_inferior.
The call to setup_inferior leads to a call into try_thread_db_load_1
which results in a call to linux_stop_and_wait_all_lwps.  This in turn
calls stop_callback on each thread followed by stop_wait_callback on
each thread.

We're now ready to make the mistake.  In stop_callback we see that our
problem thread is not stopped, but is signalled, so it should stop
soon.  As a result we don't send another SIGSTOP.

We then enter stop_wait_callback, eventually the problem thread stops
with SIGSTOP which we _incorrectly_ assume came from stop_callback,
and we discard.

Once stop_wait_callback has done its damage we return from
linux_stop_and_wait_all_lwps, finish in try_thread_db_load_1, and
eventually unwind back to the call to setup_inferior in
stop_all_threads.  GDB now loops around, and performs another
target_wait to get the next event from the inferior.

The target_wait calls causes us to once again reach linux_nat_wait_1,
and we pass through some code that calls resume_stopped_resumed_lwps.
This allows GDB to resume threads that are physically stopped, but
which GDB doesn't see any good reason for the thread to remain
stopped.  In our case, the problem thread which had its SIGSTOP
discarded is stopped, but doesn't have a stashed wait status to
report, and so GDB sets the thread going again.

We are now stuck waiting for an event on the problem thread that might
never arrive.

When considering how to write a test for this bug I struggled.  The
issue was only spotted _randomly_ when a machine was heavily loaded
with many multi-threaded applications, and GDB was being attached (by
script) to all of these applications in parallel.  In one reproducer I
required around 5 applications each of 5 threads per machine core in
order to reproduce the bug 2 out of 3 times.

What we really want to do though is simulate the kernel being slow to
report events through waitpid during the initial attach.  The solution
I came up with was to write an LD_PRELOAD library that intercepts
(some) waitpid calls and rate limits them to one per-second.  Any more
than that simply return 0 indicating there's no event available.
Obviously this can only be applied to waitpid calls that have the
WNOHANG flag set.

Unfortunately, once you ignore a waitpid call GDB can get a bit stuck.
Usually, once the kernel has made a child status available to waitpid
GDB will be sent a SIGCHLD signal.  However, if the kernel makes 5
child statuses available but, due to the preload library we only
collect one of them, then the kernel will not send any further SIGCHLD
signals, and so, when GDB, thinking that the remaining statuses have
not yet arrived sits waiting for a SIGCHLD it will be disappointed.

The solution, implemented within the preload library, is that, when we
hold back a waitpid result from GDB we spawn a new thread.  This
thread delays for a short period, and then sends GDB a SIGCHLD.  This
causes GDB to retry the waitpid, at which point sufficient time has
passed and our library allows the waitpid call to complete.

gdb/ChangeLog:

	* linux-nat.c (stop_wait_callback): Don't discard SIGSTOP if it
	was requested by GDB.

gdb/testsuite/ChangeLog:

	* gdb.threads/attach-slow-waitpid.c: New file.
	* gdb.threads/attach-slow-waitpid.exp: New file.
	* gdb.threads/slow-waitpid.c: New file.
---
 gdb/ChangeLog                                     |   6 +
 gdb/linux-nat.c                                   |  14 +-
 gdb/testsuite/ChangeLog                           |   7 +
 gdb/testsuite/gdb.threads/attach-slow-waitpid.c   |  77 +++++
 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp | 102 +++++++
 gdb/testsuite/gdb.threads/slow-waitpid.c          | 342 ++++++++++++++++++++++
 6 files changed, 544 insertions(+), 4 deletions(-)
 create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.c
 create mode 100644 gdb/testsuite/gdb.threads/attach-slow-waitpid.exp
 create mode 100644 gdb/testsuite/gdb.threads/slow-waitpid.c

[PATCHv2] gdb: Don't drop SIGSTOP during stop_all_threads

Commit Message

Comments

Patch