[v2,2/3] PR remote/19496, interrupted syscall in forking-threads-plus-bkpt

Message ID 1455150484-12600-1-git-send-email-donb@codesourcery.com
State New, archived
Headers

Commit Message

Don Breazeal Feb. 11, 2016, 12:28 a.m. UTC
  Hi Pedro,

On 2/1/2016 11:38 AM, Pedro Alves wrote:
> On 01/28/2016 12:48 AM, Don Breazeal wrote:
>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>
>> The test program spawns ten threads, each of which do ten fork/waitpid
>> sequences.  The cause of the problem was that when one of the fork
>> children exited before the corresponding fork parent could initiate its
>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>> or waitpid in another thread.

In fact, I think my diagnosis here was incorrect, or at least incorrect
in some cases.  I believe at least some of the interruptions are caused
by SIGSTOP, sent by GDB when stopping all the threads.  More below.

>>
>> The fix was to wrap the system calls in a loop to retry the call if
>> it was interrupted, like:
>>
>> do
>>   {
>>     pid = fork ();
>>   }
>> while (pid == -1 && errno == EINTR);
>>
>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>
>> Tested on Nios II Linux target with x86 Linux host.
> 
> I'd prefer to avoid this if possible.  These loops potentially hide
> bugs like ERESTARTSYS escaping out of a syscall and mishandling of
> signals.  See bc9540e842eb5639ca59cb133adef211d252843c for example:
>    https://sourceware.org/ml/gdb-patches/2015-02/msg00654.html
> 
> How about setting SIGCHLD to SIG_IGN, or making SIGCHLD be SA_RESTART?

I spent a couple of days trying to find an alternate solution, but
couldn't find one that was reliable.  Here is a snapshot of what I tried:

1) SIG_IGN: results in an ECHILD from waitpid.  The man page for waitpid
   says "This can happen for one's own child if the action for SIGCHLD is
   set to SIG_IGN."

2) SA_RESTART: While waitpid is listed as a system call that can be
   restarted by SA_RESTART, fork is not.  Even if I leave the "EINTR loop"
   in place for fork, using SA_RESTART I still see an interrupted system
   call for waitpid.  Possibly because the problem is SIGSTOP and not
   SIGCHLD.

3) pthread_sigblock: With this set for SIGCHLD in all the threads, I
   still saw an interrupted system call.  You can't block SIGSTOP.

4) pthread_sigblock with sigwait: using pthread_sigblock on all the
   blockable signals with a signal thread that called sigwait for all
   the signals in a loop, the signal thread would see a bunch of SIGCHLDs,
   but there would eventually be an interrupted system call.

5) bsd_signal: this function is supposed to automatically restart blocking
   system calls.  fork is not a blocking system call, but it doesn't help
   for waitpid either.

I found this in the ptrace(2) man page: "Note that a suppressed signal
still causes system calls to return prematurely.  In this case, system
calls will be restarted: the tracer will observe the tracee to reexecute
the interrupted system call (or restart_syscall(2) system call for a few
system calls which use a different mechanism for restarting) if the tracer
uses PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
restartable after signal are restarted after signal is suppressed; however,
kernel bugs exist which cause some system calls to fail with EINTR even
though no observable signal is injected to the tracee."

The GDB manual mentions something similar about interrupted system calls.

So, the bottom line is that I haven't changed the fix for the interrupted
system calls, because I can't find anything that works as well as the
original fix.  Perhaps this test puts enough stress on the kernel that the
kernel bugs mentioned above are exposed.

One change I did make from the previous version was to increase the
timeout to 90 seconds, which was necessary to get more reliable results
on the Nios II target.

Let me know what you think.
Thanks!
--Don

---
This patch addresses "fork:Interrupted system call" (or wait:) failures
in gdb.threads/forking-threads-plus-breakpoint.exp.

The test program spawns ten threads, each of which do ten fork/waitpid
sequences.  The cause of the problem was that when one of the fork
children exited before the corresponding fork parent could initiate its
waitpid for that child, a SIGCHLD was delivered and interrupted a fork
or waitpid in another thread.

The fix was to wrap the system calls in a loop to retry the call if
it was interrupted, like:

do
  {
    pid = fork ();
  }
while (pid == -1 && errno == EINTR);

Since this is a Linux-only test I figure it is OK to use errno and EINTR.
I tried a number of alternative fixes using SIG_IGN, SA_RESTART,
pthread_sigblock, and bsd_signal, but none of these worked as well.

Tested on Nios II Linux target with x86 Linux host.

gdb/testsuite/ChangeLog:
2016-02-10  Don Breazeal  <donb@codesourcery.com>

	* gdb.threads/forking-threads-plus-breakpoint.c (thread_forks):
	Retry fork and waitpid on interrupted system call errors.
	* gdb.threads/forking-threads-plus-breakpoint.exp: (do_test):
	Increase timeout to 90.

---
 .../gdb.threads/forking-threads-plus-breakpoint.c          | 14 ++++++++++++--
 .../gdb.threads/forking-threads-plus-breakpoint.exp        |  3 +++
 2 files changed, 15 insertions(+), 2 deletions(-)
  

Comments

Don Breazeal Feb. 25, 2016, 5:28 p.m. UTC | #1
Ping
Thanks,
--Don

On 2/10/2016 4:28 PM, Don Breazeal wrote:
> Hi Pedro,
> 
> On 2/1/2016 11:38 AM, Pedro Alves wrote:
>> On 01/28/2016 12:48 AM, Don Breazeal wrote:
>>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>>
>>> The test program spawns ten threads, each of which do ten fork/waitpid
>>> sequences.  The cause of the problem was that when one of the fork
>>> children exited before the corresponding fork parent could initiate its
>>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>>> or waitpid in another thread.
> 
> In fact, I think my diagnosis here was incorrect, or at least incorrect
> in some cases.  I believe at least some of the interruptions are caused
> by SIGSTOP, sent by GDB when stopping all the threads.  More below.
> 
>>>
>>> The fix was to wrap the system calls in a loop to retry the call if
>>> it was interrupted, like:
>>>
>>> do
>>>   {
>>>     pid = fork ();
>>>   }
>>> while (pid == -1 && errno == EINTR);
>>>
>>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>>
>>> Tested on Nios II Linux target with x86 Linux host.
>>
>> I'd prefer to avoid this if possible.  These loops potentially hide
>> bugs like ERESTARTSYS escaping out of a syscall and mishandling of
>> signals.  See bc9540e842eb5639ca59cb133adef211d252843c for example:
>>    https://sourceware.org/ml/gdb-patches/2015-02/msg00654.html
>>
>> How about setting SIGCHLD to SIG_IGN, or making SIGCHLD be SA_RESTART?
> 
> I spent a couple of days trying to find an alternate solution, but
> couldn't find one that was reliable.  Here is a snapshot of what I tried:
> 
> 1) SIG_IGN: results in an ECHILD from waitpid.  The man page for waitpid
>    says "This can happen for one's own child if the action for SIGCHLD is
>    set to SIG_IGN."
> 
> 2) SA_RESTART: While waitpid is listed as a system call that can be
>    restarted by SA_RESTART, fork is not.  Even if I leave the "EINTR loop"
>    in place for fork, using SA_RESTART I still see an interrupted system
>    call for waitpid.  Possibly because the problem is SIGSTOP and not
>    SIGCHLD.
> 
> 3) pthread_sigblock: With this set for SIGCHLD in all the threads, I
>    still saw an interrupted system call.  You can't block SIGSTOP.
> 
> 4) pthread_sigblock with sigwait: using pthread_sigblock on all the
>    blockable signals with a signal thread that called sigwait for all
>    the signals in a loop, the signal thread would see a bunch of SIGCHLDs,
>    but there would eventually be an interrupted system call.
> 
> 5) bsd_signal: this function is supposed to automatically restart blocking
>    system calls.  fork is not a blocking system call, but it doesn't help
>    for waitpid either.
> 
> I found this in the ptrace(2) man page: "Note that a suppressed signal
> still causes system calls to return prematurely.  In this case, system
> calls will be restarted: the tracer will observe the tracee to reexecute
> the interrupted system call (or restart_syscall(2) system call for a few
> system calls which use a different mechanism for restarting) if the tracer
> uses PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
> restartable after signal are restarted after signal is suppressed; however,
> kernel bugs exist which cause some system calls to fail with EINTR even
> though no observable signal is injected to the tracee."
> 
> The GDB manual mentions something similar about interrupted system calls.
> 
> So, the bottom line is that I haven't changed the fix for the interrupted
> system calls, because I can't find anything that works as well as the
> original fix.  Perhaps this test puts enough stress on the kernel that the
> kernel bugs mentioned above are exposed.
> 
> One change I did make from the previous version was to increase the
> timeout to 90 seconds, which was necessary to get more reliable results
> on the Nios II target.
> 
> Let me know what you think.
> Thanks!
> --Don
> 
> ---
> This patch addresses "fork:Interrupted system call" (or wait:) failures
> in gdb.threads/forking-threads-plus-breakpoint.exp.
> 
> The test program spawns ten threads, each of which do ten fork/waitpid
> sequences.  The cause of the problem was that when one of the fork
> children exited before the corresponding fork parent could initiate its
> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
> or waitpid in another thread.
> 
> The fix was to wrap the system calls in a loop to retry the call if
> it was interrupted, like:
> 
> do
>   {
>     pid = fork ();
>   }
> while (pid == -1 && errno == EINTR);
> 
> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
> I tried a number of alternative fixes using SIG_IGN, SA_RESTART,
> pthread_sigblock, and bsd_signal, but none of these worked as well.
> 
> Tested on Nios II Linux target with x86 Linux host.
> 
> gdb/testsuite/ChangeLog:
> 2016-02-10  Don Breazeal  <donb@codesourcery.com>
> 
> 	* gdb.threads/forking-threads-plus-breakpoint.c (thread_forks):
> 	Retry fork and waitpid on interrupted system call errors.
> 	* gdb.threads/forking-threads-plus-breakpoint.exp: (do_test):
> 	Increase timeout to 90.
> 
> ---
>  .../gdb.threads/forking-threads-plus-breakpoint.c          | 14 ++++++++++++--
>  .../gdb.threads/forking-threads-plus-breakpoint.exp        |  3 +++
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
> index fc64d93..c169e18 100644
> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
> @@ -22,6 +22,7 @@
>  #include <sys/types.h>
>  #include <sys/wait.h>
>  #include <stdlib.h>
> +#include <errno.h>
>  
>  /* Number of threads.  Each thread continuously spawns a fork and wait
>     for it.  If we have another thread continuously start a step over,
> @@ -49,14 +50,23 @@ thread_forks (void *arg)
>      {
>        pid_t pid;
>  
> -      pid = fork ();
> +      do
> +	{
> +	  pid = fork ();
> +	}
> +      while (pid == -1 && errno == EINTR);
>  
>        if (pid > 0)
>  	{
>  	  int status;
>  
>  	  /* Parent.  */
> -	  pid = waitpid (pid, &status, 0);
> +	  do
> +	    {
> +	      pid = waitpid (pid, &status, 0);
> +	    }
> +	  while (pid == -1 && errno == EINTR);
> +
>  	  if (pid == -1)
>  	    {
>  	      perror ("wait");
> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> index ff3ca9a..6889c2b 100644
> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> @@ -73,6 +73,9 @@ proc do_test { cond_bp_target detach_on_fork displaced } {
>      global linenum
>      global is_remote_target
>  
> +    global timeout
> +    set timeout 90
> +
>      set saved_gdbflags $GDBFLAGS
>      set GDBFLAGS [concat $GDBFLAGS " -ex \"set non-stop on\""]
>      clean_restart $binfile
>
  
Don Breazeal March 3, 2016, 6:19 p.m. UTC | #2
Ping.
I checked, the patch still applies cleanly to mainline.
Thanks
--Don

On 2/25/2016 9:28 AM, Don Breazeal wrote:
> Ping
> Thanks,
> --Don
> 
> On 2/10/2016 4:28 PM, Don Breazeal wrote:
>> Hi Pedro,
>>
>> On 2/1/2016 11:38 AM, Pedro Alves wrote:
>>> On 01/28/2016 12:48 AM, Don Breazeal wrote:
>>>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>>>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>>>
>>>> The test program spawns ten threads, each of which do ten fork/waitpid
>>>> sequences.  The cause of the problem was that when one of the fork
>>>> children exited before the corresponding fork parent could initiate its
>>>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>>>> or waitpid in another thread.
>>
>> In fact, I think my diagnosis here was incorrect, or at least incorrect
>> in some cases.  I believe at least some of the interruptions are caused
>> by SIGSTOP, sent by GDB when stopping all the threads.  More below.
>>
>>>>
>>>> The fix was to wrap the system calls in a loop to retry the call if
>>>> it was interrupted, like:
>>>>
>>>> do
>>>>   {
>>>>     pid = fork ();
>>>>   }
>>>> while (pid == -1 && errno == EINTR);
>>>>
>>>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>>>
>>>> Tested on Nios II Linux target with x86 Linux host.
>>>
>>> I'd prefer to avoid this if possible.  These loops potentially hide
>>> bugs like ERESTARTSYS escaping out of a syscall and mishandling of
>>> signals.  See bc9540e842eb5639ca59cb133adef211d252843c for example:
>>>    https://sourceware.org/ml/gdb-patches/2015-02/msg00654.html
>>>
>>> How about setting SIGCHLD to SIG_IGN, or making SIGCHLD be SA_RESTART?
>>
>> I spent a couple of days trying to find an alternate solution, but
>> couldn't find one that was reliable.  Here is a snapshot of what I tried:
>>
>> 1) SIG_IGN: results in an ECHILD from waitpid.  The man page for waitpid
>>    says "This can happen for one's own child if the action for SIGCHLD is
>>    set to SIG_IGN."
>>
>> 2) SA_RESTART: While waitpid is listed as a system call that can be
>>    restarted by SA_RESTART, fork is not.  Even if I leave the "EINTR loop"
>>    in place for fork, using SA_RESTART I still see an interrupted system
>>    call for waitpid.  Possibly because the problem is SIGSTOP and not
>>    SIGCHLD.
>>
>> 3) pthread_sigblock: With this set for SIGCHLD in all the threads, I
>>    still saw an interrupted system call.  You can't block SIGSTOP.
>>
>> 4) pthread_sigblock with sigwait: using pthread_sigblock on all the
>>    blockable signals with a signal thread that called sigwait for all
>>    the signals in a loop, the signal thread would see a bunch of SIGCHLDs,
>>    but there would eventually be an interrupted system call.
>>
>> 5) bsd_signal: this function is supposed to automatically restart blocking
>>    system calls.  fork is not a blocking system call, but it doesn't help
>>    for waitpid either.
>>
>> I found this in the ptrace(2) man page: "Note that a suppressed signal
>> still causes system calls to return prematurely.  In this case, system
>> calls will be restarted: the tracer will observe the tracee to reexecute
>> the interrupted system call (or restart_syscall(2) system call for a few
>> system calls which use a different mechanism for restarting) if the tracer
>> uses PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
>> restartable after signal are restarted after signal is suppressed; however,
>> kernel bugs exist which cause some system calls to fail with EINTR even
>> though no observable signal is injected to the tracee."
>>
>> The GDB manual mentions something similar about interrupted system calls.
>>
>> So, the bottom line is that I haven't changed the fix for the interrupted
>> system calls, because I can't find anything that works as well as the
>> original fix.  Perhaps this test puts enough stress on the kernel that the
>> kernel bugs mentioned above are exposed.
>>
>> One change I did make from the previous version was to increase the
>> timeout to 90 seconds, which was necessary to get more reliable results
>> on the Nios II target.
>>
>> Let me know what you think.
>> Thanks!
>> --Don
>>
>> ---
>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>
>> The test program spawns ten threads, each of which do ten fork/waitpid
>> sequences.  The cause of the problem was that when one of the fork
>> children exited before the corresponding fork parent could initiate its
>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>> or waitpid in another thread.
>>
>> The fix was to wrap the system calls in a loop to retry the call if
>> it was interrupted, like:
>>
>> do
>>   {
>>     pid = fork ();
>>   }
>> while (pid == -1 && errno == EINTR);
>>
>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>> I tried a number of alternative fixes using SIG_IGN, SA_RESTART,
>> pthread_sigblock, and bsd_signal, but none of these worked as well.
>>
>> Tested on Nios II Linux target with x86 Linux host.
>>
>> gdb/testsuite/ChangeLog:
>> 2016-02-10  Don Breazeal  <donb@codesourcery.com>
>>
>> 	* gdb.threads/forking-threads-plus-breakpoint.c (thread_forks):
>> 	Retry fork and waitpid on interrupted system call errors.
>> 	* gdb.threads/forking-threads-plus-breakpoint.exp: (do_test):
>> 	Increase timeout to 90.
>>
>> ---
>>  .../gdb.threads/forking-threads-plus-breakpoint.c          | 14 ++++++++++++--
>>  .../gdb.threads/forking-threads-plus-breakpoint.exp        |  3 +++
>>  2 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>> index fc64d93..c169e18 100644
>> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>> @@ -22,6 +22,7 @@
>>  #include <sys/types.h>
>>  #include <sys/wait.h>
>>  #include <stdlib.h>
>> +#include <errno.h>
>>  
>>  /* Number of threads.  Each thread continuously spawns a fork and wait
>>     for it.  If we have another thread continuously start a step over,
>> @@ -49,14 +50,23 @@ thread_forks (void *arg)
>>      {
>>        pid_t pid;
>>  
>> -      pid = fork ();
>> +      do
>> +	{
>> +	  pid = fork ();
>> +	}
>> +      while (pid == -1 && errno == EINTR);
>>  
>>        if (pid > 0)
>>  	{
>>  	  int status;
>>  
>>  	  /* Parent.  */
>> -	  pid = waitpid (pid, &status, 0);
>> +	  do
>> +	    {
>> +	      pid = waitpid (pid, &status, 0);
>> +	    }
>> +	  while (pid == -1 && errno == EINTR);
>> +
>>  	  if (pid == -1)
>>  	    {
>>  	      perror ("wait");
>> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>> index ff3ca9a..6889c2b 100644
>> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>> @@ -73,6 +73,9 @@ proc do_test { cond_bp_target detach_on_fork displaced } {
>>      global linenum
>>      global is_remote_target
>>  
>> +    global timeout
>> +    set timeout 90
>> +
>>      set saved_gdbflags $GDBFLAGS
>>      set GDBFLAGS [concat $GDBFLAGS " -ex \"set non-stop on\""]
>>      clean_restart $binfile
>>
>
  
Don Breazeal March 14, 2016, 9:29 p.m. UTC | #3
Hi Pedro,
Did you have any more suggestions about handling the interrupted system
call, or shall we go with the "loop until we don't get -1 and
errno===EINTR" approach?
Thanks,
--Don

On 3/3/2016 10:19 AM, Don Breazeal wrote:
> Ping.
> I checked, the patch still applies cleanly to mainline.
> Thanks
> --Don
> 
> On 2/25/2016 9:28 AM, Don Breazeal wrote:
>> Ping
>> Thanks,
>> --Don
>>
>> On 2/10/2016 4:28 PM, Don Breazeal wrote:
>>> Hi Pedro,
>>>
>>> On 2/1/2016 11:38 AM, Pedro Alves wrote:
>>>> On 01/28/2016 12:48 AM, Don Breazeal wrote:
>>>>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>>>>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>>>>
>>>>> The test program spawns ten threads, each of which do ten fork/waitpid
>>>>> sequences.  The cause of the problem was that when one of the fork
>>>>> children exited before the corresponding fork parent could initiate its
>>>>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>>>>> or waitpid in another thread.
>>>
>>> In fact, I think my diagnosis here was incorrect, or at least incorrect
>>> in some cases.  I believe at least some of the interruptions are caused
>>> by SIGSTOP, sent by GDB when stopping all the threads.  More below.
>>>
>>>>>
>>>>> The fix was to wrap the system calls in a loop to retry the call if
>>>>> it was interrupted, like:
>>>>>
>>>>> do
>>>>>   {
>>>>>     pid = fork ();
>>>>>   }
>>>>> while (pid == -1 && errno == EINTR);
>>>>>
>>>>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>>>>
>>>>> Tested on Nios II Linux target with x86 Linux host.
>>>>
>>>> I'd prefer to avoid this if possible.  These loops potentially hide
>>>> bugs like ERESTARTSYS escaping out of a syscall and mishandling of
>>>> signals.  See bc9540e842eb5639ca59cb133adef211d252843c for example:
>>>>    https://sourceware.org/ml/gdb-patches/2015-02/msg00654.html
>>>>
>>>> How about setting SIGCHLD to SIG_IGN, or making SIGCHLD be SA_RESTART?
>>>
>>> I spent a couple of days trying to find an alternate solution, but
>>> couldn't find one that was reliable.  Here is a snapshot of what I tried:
>>>
>>> 1) SIG_IGN: results in an ECHILD from waitpid.  The man page for waitpid
>>>    says "This can happen for one's own child if the action for SIGCHLD is
>>>    set to SIG_IGN."
>>>
>>> 2) SA_RESTART: While waitpid is listed as a system call that can be
>>>    restarted by SA_RESTART, fork is not.  Even if I leave the "EINTR loop"
>>>    in place for fork, using SA_RESTART I still see an interrupted system
>>>    call for waitpid.  Possibly because the problem is SIGSTOP and not
>>>    SIGCHLD.
>>>
>>> 3) pthread_sigblock: With this set for SIGCHLD in all the threads, I
>>>    still saw an interrupted system call.  You can't block SIGSTOP.
>>>
>>> 4) pthread_sigblock with sigwait: using pthread_sigblock on all the
>>>    blockable signals with a signal thread that called sigwait for all
>>>    the signals in a loop, the signal thread would see a bunch of SIGCHLDs,
>>>    but there would eventually be an interrupted system call.
>>>
>>> 5) bsd_signal: this function is supposed to automatically restart blocking
>>>    system calls.  fork is not a blocking system call, but it doesn't help
>>>    for waitpid either.
>>>
>>> I found this in the ptrace(2) man page: "Note that a suppressed signal
>>> still causes system calls to return prematurely.  In this case, system
>>> calls will be restarted: the tracer will observe the tracee to reexecute
>>> the interrupted system call (or restart_syscall(2) system call for a few
>>> system calls which use a different mechanism for restarting) if the tracer
>>> uses PTRACE_SYSCALL.  Even system calls (such as poll(2)) which are not
>>> restartable after signal are restarted after signal is suppressed; however,
>>> kernel bugs exist which cause some system calls to fail with EINTR even
>>> though no observable signal is injected to the tracee."
>>>
>>> The GDB manual mentions something similar about interrupted system calls.
>>>
>>> So, the bottom line is that I haven't changed the fix for the interrupted
>>> system calls, because I can't find anything that works as well as the
>>> original fix.  Perhaps this test puts enough stress on the kernel that the
>>> kernel bugs mentioned above are exposed.
>>>
>>> One change I did make from the previous version was to increase the
>>> timeout to 90 seconds, which was necessary to get more reliable results
>>> on the Nios II target.
>>>
>>> Let me know what you think.
>>> Thanks!
>>> --Don
>>>
>>> ---
>>> This patch addresses "fork:Interrupted system call" (or wait:) failures
>>> in gdb.threads/forking-threads-plus-breakpoint.exp.
>>>
>>> The test program spawns ten threads, each of which do ten fork/waitpid
>>> sequences.  The cause of the problem was that when one of the fork
>>> children exited before the corresponding fork parent could initiate its
>>> waitpid for that child, a SIGCHLD was delivered and interrupted a fork
>>> or waitpid in another thread.
>>>
>>> The fix was to wrap the system calls in a loop to retry the call if
>>> it was interrupted, like:
>>>
>>> do
>>>   {
>>>     pid = fork ();
>>>   }
>>> while (pid == -1 && errno == EINTR);
>>>
>>> Since this is a Linux-only test I figure it is OK to use errno and EINTR.
>>> I tried a number of alternative fixes using SIG_IGN, SA_RESTART,
>>> pthread_sigblock, and bsd_signal, but none of these worked as well.
>>>
>>> Tested on Nios II Linux target with x86 Linux host.
>>>
>>> gdb/testsuite/ChangeLog:
>>> 2016-02-10  Don Breazeal  <donb@codesourcery.com>
>>>
>>> 	* gdb.threads/forking-threads-plus-breakpoint.c (thread_forks):
>>> 	Retry fork and waitpid on interrupted system call errors.
>>> 	* gdb.threads/forking-threads-plus-breakpoint.exp: (do_test):
>>> 	Increase timeout to 90.
>>>
>>> ---
>>>  .../gdb.threads/forking-threads-plus-breakpoint.c          | 14 ++++++++++++--
>>>  .../gdb.threads/forking-threads-plus-breakpoint.exp        |  3 +++
>>>  2 files changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>>> index fc64d93..c169e18 100644
>>> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>>> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
>>> @@ -22,6 +22,7 @@
>>>  #include <sys/types.h>
>>>  #include <sys/wait.h>
>>>  #include <stdlib.h>
>>> +#include <errno.h>
>>>  
>>>  /* Number of threads.  Each thread continuously spawns a fork and wait
>>>     for it.  If we have another thread continuously start a step over,
>>> @@ -49,14 +50,23 @@ thread_forks (void *arg)
>>>      {
>>>        pid_t pid;
>>>  
>>> -      pid = fork ();
>>> +      do
>>> +	{
>>> +	  pid = fork ();
>>> +	}
>>> +      while (pid == -1 && errno == EINTR);
>>>  
>>>        if (pid > 0)
>>>  	{
>>>  	  int status;
>>>  
>>>  	  /* Parent.  */
>>> -	  pid = waitpid (pid, &status, 0);
>>> +	  do
>>> +	    {
>>> +	      pid = waitpid (pid, &status, 0);
>>> +	    }
>>> +	  while (pid == -1 && errno == EINTR);
>>> +
>>>  	  if (pid == -1)
>>>  	    {
>>>  	      perror ("wait");
>>> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>>> index ff3ca9a..6889c2b 100644
>>> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>>> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
>>> @@ -73,6 +73,9 @@ proc do_test { cond_bp_target detach_on_fork displaced } {
>>>      global linenum
>>>      global is_remote_target
>>>  
>>> +    global timeout
>>> +    set timeout 90
>>> +
>>>      set saved_gdbflags $GDBFLAGS
>>>      set GDBFLAGS [concat $GDBFLAGS " -ex \"set non-stop on\""]
>>>      clean_restart $binfile
>>>
>>
>
  
Pedro Alves March 15, 2016, 12:55 p.m. UTC | #4
Hi Don,

On 02/11/2016 12:28 AM, Don Breazeal wrote:

> So, the bottom line is that I haven't changed the fix for the interrupted
> system calls, because I can't find anything that works as well as the
> original fix.  Perhaps this test puts enough stress on the kernel that the
> kernel bugs mentioned above are exposed.

Many thanks for the thorough investigation.  Let's take your
original approach then.

> 
> One change I did make from the previous version was to increase the
> timeout to 90 seconds, which was necessary to get more reliable results
> on the Nios II target.
> diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> index ff3ca9a..6889c2b 100644
> --- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> +++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
> @@ -73,6 +73,9 @@ proc do_test { cond_bp_target detach_on_fork displaced } {
>       global linenum
>       global is_remote_target
>   
> +    global timeout
> +    set timeout 90
> +

Use with_timeout_factor instead so that the timeout is properly restored,
and put it around the problematic test, only, instead of basically
around the whole test case.  I think that'll be the "inferior 1 exited"
test?

>       set saved_gdbflags $GDBFLAGS
>       set GDBFLAGS [concat $GDBFLAGS " -ex \"set non-stop on\""]
>       clean_restart $binfile

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
index fc64d93..c169e18 100644
--- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
+++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.c
@@ -22,6 +22,7 @@ 
 #include <sys/types.h>
 #include <sys/wait.h>
 #include <stdlib.h>
+#include <errno.h>
 
 /* Number of threads.  Each thread continuously spawns a fork and wait
    for it.  If we have another thread continuously start a step over,
@@ -49,14 +50,23 @@  thread_forks (void *arg)
     {
       pid_t pid;
 
-      pid = fork ();
+      do
+	{
+	  pid = fork ();
+	}
+      while (pid == -1 && errno == EINTR);
 
       if (pid > 0)
 	{
 	  int status;
 
 	  /* Parent.  */
-	  pid = waitpid (pid, &status, 0);
+	  do
+	    {
+	      pid = waitpid (pid, &status, 0);
+	    }
+	  while (pid == -1 && errno == EINTR);
+
 	  if (pid == -1)
 	    {
 	      perror ("wait");
diff --git a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
index ff3ca9a..6889c2b 100644
--- a/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
+++ b/gdb/testsuite/gdb.threads/forking-threads-plus-breakpoint.exp
@@ -73,6 +73,9 @@  proc do_test { cond_bp_target detach_on_fork displaced } {
     global linenum
     global is_remote_target
 
+    global timeout
+    set timeout 90
+
     set saved_gdbflags $GDBFLAGS
     set GDBFLAGS [concat $GDBFLAGS " -ex \"set non-stop on\""]
     clean_restart $binfile