PR remote/17028: GDB+GDBserver hangs on Windows
Commit Message
Hi Joel,
Here's my current patch to fix PR remote/17028.
It fixes the testcase in the PR for me, but I'm not set up
for any sort of automated testing on Windows...
I'd push this to my github for easier fetching, but github
seems to be busted atm...
8<-----------------
From 6d70a517e8162fd135d932be057e111a10b365a4 Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 10 Jun 2014 18:58:08 +0100
Subject: [PATCH] PR remote/17028: GDB+GDBserver hangs on Windows
Since target-async was turned on by default, debugging on Windows
using GDB+GDBserver sometimes hangs while waiting for a RSP reply.
The problem is a race in the gdb_select machinery.
This is what we see for a faulty next on the GDB side:
(gdb) n
infrun: clear_proceed_status_thread (Thread 4424)
infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT, step=1)
(...)
infrun: resume (step=1, signal=GDB_SIGNAL_0), ...
Sending packet: $vCont;s:1148;c#5e...
*hang*
At this point, attaching a debugger to the hanging GDB confirms that
it is blocked, waiting for a socket event:
#6 0x757841d8 in WaitForMultipleObjects ()
from C:\Windows\syswow64\kernel32.dll
#7 0x004708e7 in gdb_select (n=469, readfds=0x88ca50 <gdb_notifier+784>,
writefds=0x88cb54 <gdb_notifier+1044>,
exceptfds=0x88cc58 <gdb_notifier+1304>, timeout=0x0)
at /[...]/gdb/mingw-hdep.c:172
#8 0x00527926 in gdb_wait_for_event (block=1)
at /[...]/gdb/event-loop.c:831
#9 0x00526ff1 in gdb_do_one_event ()
at /[...]/gdb/event-loop.c:403
However, on the GDBserver side, we that GDBserver already sent a T05
packet reply:
gdbserver: kernel event EXCEPTION_DEBUG_EVENT for pid=4968 tid=1148
EXCEPTION_SINGLE_STEP
Child Stopped with signal = 5
Writing resume reply for LWP 4968.4424:1
DEBUG: write_prim ($T0505:c8fe2800;04:a0fe2800;08:38164000;thread:1148;#f0)
-> 55
To recap, on Windows, `select' only works with sockets, so we have a
wrapper, gdb_select, that uses the GDB serial abstraction to handle
sockets, consoles, pipes, and serial ports. Each serial descriptor
has a thread associated (we call those the select threads), and those
threads communicate with the main thread by means of standard Windows
events.
It basically goes like this: gdb_select first loops through all fds of
interest, calling their wait_handle hooks, which returns an event that
WaitForMultipleObjects can wait on. gdb_select then blocks in
WaitForMultipleObjects. The wait_handle hook is responsible for
arranging for the returned event to become set once data is available.
This is done by setting the descriptor's helper thread running, which
itself knows how to wait data in the type of handle it manages
(sockets, pipes, consoles, files, etc.). Once it data arrives, the
thread sets the corresponding event which unblocks
WaitForMultipleObjects within gdb_select. However, the wait_handle
hook can also apply an optimization --- if data is already pending,
then there's no need to set the thread running, and the descriptors
event can be set immediately. It's around this latter aspect that
lies the bug/race.
Adding some ad hoc debug logs to ser-mingw.c and mingw-hdep.c, we see
the following sequence of events, right after sending
"$vCont;s:1148;c#5e". Thread 1 is the main thread, and thread 2 is
the socket's helper/select thread. gdb_select was only passed one
descriptor to wait on, the remote target's socket.
net_windows_select_thread is the entry point of the select threads for
sockets.
#1 - thread 1: gdb_select: enter
#2 - thread 2: net_windows_select_thread: WaitForMultipleObjects blocking
gdb_select walked over the wait_handle hooks, and woke up the socket's
helper thread. The helper thread is now blocked waiting for socket
events.
#3 - thread 1: gdb_select: WaitForMultipleObjects polling (timeout=0ms)
#4 - thread 1: gdb_select: WaitForMultipleObjects returned 102 (WAIT_TIMEOUT)
There was no pending data available yet, and gdb_select was passed
timeout==0ms, and so WaitForMultipleObjects times out immediately.
#5 - thread 2: net_windows_select_thread: WaitForMultipleObjects returned 1
Just afterwards, socket data arrives, and thread 2 wakes up. thread 2
calls WSAEnumNetworkEvents, which clears state->sock_event, and marks
the serial's read_event event, telling the main thread that data is
available.
#6 - thread 1: gdb_select: call serial_done_wait_handle on each serial
gdb_select stops all the helper/select threads.
#7 - thread 1: gdb_select: return 0 (WAIT_TIMEOUT)
gdb_select in the main thread returns to the caller.
Note that at this point, data is pending on the socket, the serial's
read_event is set, but the socket's sock_event is not set, until
_further_ data arrives.
Now GDB does it's thing and goes back to the event loop. That calls
gdb_select, but with timeout==INFINITE.
Again, gdb_select calls the socket serial's wait_handle hook. It first
clears its events, starting from a clean slate:
ResetEvent (state->base.read_event);
ResetEvent (state->base.except_event);
ResetEvent (state->base.stop_select);
That cleared read_event, which was previously set in #5 above. And
then it checks for pending events, in the sock_event event:
/* Check any pending events. This both avoids starting the thread
unnecessarily, and handles stray FD_READ events (see below). */
if (WaitForSingleObject (state->sock_event, 0) == WAIT_OBJECT_0)
{
That also fails because state->sock_event was cleared in #5 too...
So the wait_handle hook erroneously decides that it needs to start the
helper thread to wait for input:
#8 - thread 2: net_windows_select_thread: WaitForMultipleObjects blocking
#9 - thread 1: gdb_select: WaitForMultipleObjects blocking (INFINITE)
But, GDBserver already sent all it had to send, so both threads waits
forever...
At first I thought that net_windows_wait_handle shouldn't be resetting
state->base.read_event or state->base.except_event, but looking
deeper, the pipe and console wait_handle hooks reset all events too.
It actually makes sense that way -- we should always be able to query
pending state without looking at the state->sock_event from within
net_windows_wait_handle. The end result is much simpler, and makes
net_windows_select_thread look a lot like console_select_thread,
actually.
gdb/
2014-06-10 Pedro Alves <palves@redhat.com>
PR remote/17028
* ser-mingw.c (net_windows_socket_check_pending): New function.
(net_windows_select_thread): Ignore spurious wakeups. Use
net_windows_socket_check_pending.
(net_windows_wait_handle): Check for pending events with
ioctlsocket, through net_windows_socket_check_pending, instead of
checking the socket's event.
---
gdb/ser-mingw.c | 141 +++++++++++++++++++++++++++-----------------------------
1 file changed, 69 insertions(+), 72 deletions(-)
Comments
On 06/10/2014 07:14 PM, Pedro Alves wrote:
> Hi Joel,
>
> Here's my current patch to fix PR remote/17028.
>
> It fixes the testcase in the PR for me, but I'm not set up
> for any sort of automated testing on Windows...
>
> I'd push this to my github for easier fetching, but github
> seems to be busted atm...
And ... it just fixed itself... I've now pushed this to:
git@github.com:palves/gdb.git windows_async
> Here's my current patch to fix PR remote/17028.
>
> It fixes the testcase in the PR for me, but I'm not set up
> for any sort of automated testing on Windows...
Thanks a lot for the patch. It looks pretty good to me, and it also
passed testing with our testsuite.
> I'd push this to my github for easier fetching, but github
> seems to be busted atm...
I know you were able to push it, but patches sent via git-send-email
are fine as far as I am concerned. I find it faster to apply via
git-am, for instance.
> However, on the GDBserver side, we that GDBserver already sent a T05
^^^ see?
> Now GDB does it's thing and goes back to the event loop. That calls
^^^ its
> gdb/
> 2014-06-10 Pedro Alves <palves@redhat.com>
>
> PR remote/17028
> * ser-mingw.c (net_windows_socket_check_pending): New function.
> (net_windows_select_thread): Ignore spurious wakeups. Use
> net_windows_socket_check_pending.
> (net_windows_wait_handle): Check for pending events with
> ioctlsocket, through net_windows_socket_check_pending, instead of
> checking the socket's event.
Thank you!
On 06/11/2014 09:27 AM, Joel Brobecker wrote:
>> Here's my current patch to fix PR remote/17028.
>>
>> It fixes the testcase in the PR for me, but I'm not set up
>> for any sort of automated testing on Windows...
>
> Thanks a lot for the patch. It looks pretty good to me, and it also
> passed testing with our testsuite.
Great! Looks like target-async still makes it safe and sound to
7.8 then. Phew! :-)
I've pushed it in now, with the log issues you pointed out fixed.
>> I'd push this to my github for easier fetching, but github
>> seems to be busted atm...
>
> I know you were able to push it, but patches sent via git-send-email
> are fine as far as I am concerned. I find it faster to apply via
> git-am, for instance.
It's fine with me too in general.
Though, when calling for testing I like to make it as easy as possible
for someone "driving by" to test (in this case, you, but other people
could want to test as well). Applying a patch works OK if the touched
code hasn't changed since the patch was generated, otherwise one
either gets to hunt for the base revision or fix conflicts.
A test branch is immune to that.
Since you ran the testing already, and the patch is in
mainline, that's moot now though. :-)
Thanks,
@@ -1046,6 +1046,32 @@ struct net_windows_state
HANDLE sock_event;
};
+/* Check whether the socket has any pending data to be read. If so,
+ set the select thread's read event. On error, set the select
+ thread's except event. If any event was set, return true,
+ otherwise return false. */
+
+static int
+net_windows_socket_check_pending (struct serial *scb)
+{
+ struct net_windows_state *state = scb->state;
+ unsigned long available;
+
+ if (ioctlsocket (scb->fd, FIONREAD, &available) != 0)
+ {
+ /* The socket closed, or some other error. */
+ SetEvent (state->base.except_event);
+ return 1;
+ }
+ else if (available > 0)
+ {
+ SetEvent (state->base.read_event);
+ return 1;
+ }
+
+ return 0;
+}
+
static DWORD WINAPI
net_windows_select_thread (void *arg)
{
@@ -1065,33 +1091,54 @@ net_windows_select_thread (void *arg)
wait_events[0] = state->base.stop_select;
wait_events[1] = state->sock_event;
- event_index = WaitForMultipleObjects (2, wait_events, FALSE, INFINITE);
-
- if (event_index == WAIT_OBJECT_0
- || WaitForSingleObject (state->base.stop_select, 0) == WAIT_OBJECT_0)
- /* We have been requested to stop. */
- ;
- else if (event_index != WAIT_OBJECT_0 + 1)
- /* Some error has occured. Assume that this is an error
- condition. */
- SetEvent (state->base.except_event);
- else
+ /* Wait for something to happen on the socket. */
+ while (1)
{
+ event_index = WaitForMultipleObjects (2, wait_events, FALSE, INFINITE);
+
+ if (event_index == WAIT_OBJECT_0
+ || WaitForSingleObject (state->base.stop_select, 0) == WAIT_OBJECT_0)
+ {
+ /* We have been requested to stop. */
+ break;
+ }
+
+ if (event_index != WAIT_OBJECT_0 + 1)
+ {
+ /* Some error has occured. Assume that this is an error
+ condition. */
+ SetEvent (state->base.except_event);
+ break;
+ }
+
/* Enumerate the internal network events, and reset the
object that signalled us to catch the next event. */
- WSAEnumNetworkEvents (scb->fd, state->sock_event, &events);
-
- gdb_assert (events.lNetworkEvents & (FD_READ | FD_CLOSE));
-
+ if (WSAEnumNetworkEvents (scb->fd, state->sock_event, &events) != 0)
+ {
+ /* Something went wrong. Maybe the socket is gone. */
+ SetEvent (state->base.except_event);
+ break;
+ }
+
if (events.lNetworkEvents & FD_READ)
- SetEvent (state->base.read_event);
-
+ {
+ if (net_windows_socket_check_pending (scb))
+ break;
+
+ /* Spurious wakeup. That is, the socket's event was
+ signalled before we last called recv. */
+ }
+
if (events.lNetworkEvents & FD_CLOSE)
- SetEvent (state->base.except_event);
+ {
+ SetEvent (state->base.except_event);
+ break;
+ }
}
SetEvent (state->base.have_stopped);
}
+ return 0;
}
static void
@@ -1107,60 +1154,10 @@ net_windows_wait_handle (struct serial *scb, HANDLE *read, HANDLE *except)
*read = state->base.read_event;
*except = state->base.except_event;
- /* Check any pending events. This both avoids starting the thread
- unnecessarily, and handles stray FD_READ events (see below). */
- if (WaitForSingleObject (state->sock_event, 0) == WAIT_OBJECT_0)
- {
- WSANETWORKEVENTS events;
- int any = 0;
-
- /* Enumerate the internal network events, and reset the object that
- signalled us to catch the next event. */
- WSAEnumNetworkEvents (scb->fd, state->sock_event, &events);
-
- /* You'd think that FD_READ or FD_CLOSE would be set here. But,
- sometimes, neither is. I suspect that the FD_READ is set and
- the corresponding event signalled while recv is running, and
- the FD_READ is then lowered when recv consumes all the data,
- but there's no way to un-signal the event. This isn't a
- problem for the call in net_select_thread, since any new
- events after this point will not have been drained by recv.
- It just means that we can't have the obvious assert here. */
-
- /* If there is a read event, it might be still valid, or it might
- not be - it may have been signalled before we last called
- recv. Double-check that there is data. */
- if (events.lNetworkEvents & FD_READ)
- {
- unsigned long available;
-
- if (ioctlsocket (scb->fd, FIONREAD, &available) == 0
- && available > 0)
- {
- SetEvent (state->base.read_event);
- any = 1;
- }
- else
- /* Oops, no data. This call to recv will cause future
- data to retrigger the event, e.g. while we are
- in net_select_thread. */
- recv (scb->fd, NULL, 0, 0);
- }
-
- /* If there's a close event, then record it - it is obviously
- still valid, and it will not be resignalled. */
- if (events.lNetworkEvents & FD_CLOSE)
- {
- SetEvent (state->base.except_event);
- any = 1;
- }
-
- /* If we set either handle, there's no need to wake the thread. */
- if (any)
- return;
- }
-
- start_select_thread (&state->base);
+ /* Check any pending events. Otherwise, start the select
+ thread. */
+ if (!net_windows_socket_check_pending (scb))
+ start_select_thread (&state->base);
}
static void