[v2,2/3] PR remote/19496, interrupted syscall in forking-threads-plus-bkpt

  Hi Pedro,

On 2/1/2016 11:38 AM, Pedro Alves wrote:
> On 01/28/2016 12:48 AM, Don Breazeal wrote:
>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>
>> The test program spawns ten threads, each of which do ten fork/waitpid
>> sequences.  The cause of the problem was that when one of the fork
>> children exited before the corresponding fork parent could initiate its
>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>> or waitpid in another thread.

In fact, I think my diagnosis here was incorrect, or at least incorrect
in some cases.  I believe at least some of the interruptions are caused
by SIGSTOP, sent by GDB when stopping all the threads.  More below.

>>
>> The fix was to wrap the system calls in a loop to retry the call if
>> it was interrupted, like:
>>
>> do
>>   {
>>     pid = fork ();
>>   }
>> while (pid == -1 && errno == EINTR);
>>
>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>
>> Tested on Nios II Linux target with x86 Linux host.
> 
> I'd prefer to avoid this if possible.  These loops potentially hide
> bugs like ERESTARTSYS escaping out of a syscall and mishandling of
> signals.  See bc9540e842eb5639ca59cb133adef211d252843c for example:
>    https://sourceware.org/ml/gdb-patches/2015-02/msg00654.html
> 
> How about setting SIGCHLD to SIG_IGN, or making SIGCHLD be SA_RESTART?

I spent a couple of days trying to find an alternate solution, but
couldn't find one that was reliable.  Here is a snapshot of what I tried:

1) SIG_IGN: results in an ECHILD from waitpid.  The man page for waitpid
   says "This can happen for one's own child if the action for SIGCHLD is
   set to SIG_IGN."

2) SA_RESTART: While waitpid is listed as a system call that can be
   restarted by SA_RESTART, fork is not.  Even if I leave the "EINTR loop"
   in place for fork, using SA_RESTART I still see an interrupted system
   call for waitpid.  Possibly because the problem is SIGSTOP and not
   SIGCHLD.

3) pthread_sigblock: With this set for SIGCHLD in all the threads, I
   still saw an interrupted system call.  You can't block SIGSTOP.

4) pthread_sigblock with sigwait: using pthread_sigblock on all the
   blockable signals with a signal thread that called sigwait for all
   the signals in a loop, the signal thread would see a bunch of SIGCHLDs,
   but there would eventually be an interrupted system call.

5) bsd_signal: this function is supposed to automatically restart blocking
   system calls.  fork is not a blocking system call, but it doesn't help
   for waitpid either.

I found this in the ptrace(2) man page: "Note that a suppressed signal
still causes system calls to return prematurely.  In this case, system
calls will be restarted: the tracer will observe the tracee to reexecute
the interrupted system call (or restart_syscall(2) system call for a few
system calls which use a different mechanism for restarting) if the tracer
uses PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
restartable after signal are restarted after signal is suppressed; however,
kernel bugs exist which cause some system calls to fail with EINTR even
though no observable signal is injected to the tracee."

The GDB manual mentions something similar about interrupted system calls.

So, the bottom line is that I haven't changed the fix for the interrupted
system calls, because I can't find anything that works as well as the
original fix.  Perhaps this test puts enough stress on the kernel that the
kernel bugs mentioned above are exposed.

One change I did make from the previous version was to increase the
timeout to 90 seconds, which was necessary to get more reliable results
on the Nios II target.

Let me know what you think.
Thanks!
--Don

---
This patch addresses "fork:Interrupted system call" (or wait:) failures
in gdb.threads/forking-threads-plus-breakpoint.exp.

The test program spawns ten threads, each of which do ten fork/waitpid
sequences.  The cause of the problem was that when one of the fork
children exited before the corresponding fork parent could initiate its
waitpid for that child, a SIGCHLD was delivered and interrupted a fork
or waitpid in another thread.

The fix was to wrap the system calls in a loop to retry the call if
it was interrupted, like:

do
  {
    pid = fork ();
  }
while (pid == -1 && errno == EINTR);

Since this is a Linux-only test I figure it is OK to use errno and EINTR.
I tried a number of alternative fixes using SIG_IGN, SA_RESTART,
pthread_sigblock, and bsd_signal, but none of these worked as well.

Tested on Nios II Linux target with x86 Linux host.

gdb/testsuite/ChangeLog:
2016-02-10  Don Breazeal  <donb@codesourcery.com>

	* gdb.threads/forking-threads-plus-breakpoint.c (thread_forks):
	Retry fork and waitpid on interrupted system call errors.
	* gdb.threads/forking-threads-plus-breakpoint.exp: (do_test):
	Increase timeout to 90.

---
 .../gdb.threads/forking-threads-plus-breakpoint.c          | 14 ++++++++++++--
 .../gdb.threads/forking-threads-plus-breakpoint.exp        |  3 +++
 2 files changed, 15 insertions(+), 2 deletions(-)

[v2,2/3] PR remote/19496, interrupted syscall in forking-threads-plus-bkpt

Commit Message

Comments

Patch