nptl: Fix racy pipe closing in tst-cancel{20,21}

Message ID 1444849395-18800-1-git-send-email-adhemerval.zanella@linaro.com
State Committed
Headers

Commit Message

Adhemerval Zanella Netto Oct. 14, 2015, 7:03 p.m. UTC
  The tst-cancel20 open two pipes and creates a thread which blocks
reading the first pipe.  It then issues a signal to activate an
handler which also blocks reading the second pipe.  Finally the
cancellation cleanup-up handlers are tested by first closing the
all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
have a similar behavior, but use an extra fork after the test itself.

The race condition occurs if the cancellation handling acts after the
pipe close: in this case read will return EOF (indicating side-effects)
and thus the cancellation must not act.  However current GLIBC
cancellation behavior acts regardless the syscalls returns with
sid-effects.

This patch adjust the test by moving the pipe closing after the
cancellation handling.  This avoid spurious cancellation for the
case described.

Checked on x86_64 and i386.

	* nptl/tst-cancel20.c (do_one_test): Move the pipe closing after
	pthread_join.
	* nptl/tst-cancel21.c (tf): Likewise.
---
 ChangeLog           |  6 ++++++
 nptl/tst-cancel20.c | 15 +++++++++------
 nptl/tst-cancel21.c | 15 +++++++++------
 3 files changed, 24 insertions(+), 12 deletions(-)
  

Comments

Rich Felker Oct. 14, 2015, 7:58 p.m. UTC | #1
On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
> The tst-cancel20 open two pipes and creates a thread which blocks
> reading the first pipe.  It then issues a signal to activate an
> handler which also blocks reading the second pipe.  Finally the
> cancellation cleanup-up handlers are tested by first closing the
> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
> have a similar behavior, but use an extra fork after the test itself.
> 
> The race condition occurs if the cancellation handling acts after the
> pipe close: in this case read will return EOF (indicating side-effects)
> and thus the cancellation must not act.  However current GLIBC
> cancellation behavior acts regardless the syscalls returns with
> sid-effects.
> 
> This patch adjust the test by moving the pipe closing after the
> cancellation handling.  This avoid spurious cancellation for the
> case described.
> 
> Checked on x86_64 and i386.

I was involved in the discussion of this and believe that the fix is
correct. The only reason the tests "worked" before was that
cancellation was wrongly being acted upon after read succeeded in
reading EOF.

Note that, with this change, the tests will now timeout if read fails
to act on cancellation, rather than exiting with a reportable error.
This could be fixed with some very complicated machinery involving an
additional signal handler and AS-safe synchronization mechanisms to
control the ordering of close with respect to interruption of read,
but as long as timeout is an acceptable way of detecting test failure,
I see no reason to complicate the test logic like that.

Rich
  
Adhemerval Zanella Netto Oct. 19, 2015, 7:01 p.m. UTC | #2
Ping.

On 14-10-2015 16:58, Rich Felker wrote:
> On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
>> The tst-cancel20 open two pipes and creates a thread which blocks
>> reading the first pipe.  It then issues a signal to activate an
>> handler which also blocks reading the second pipe.  Finally the
>> cancellation cleanup-up handlers are tested by first closing the
>> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
>> have a similar behavior, but use an extra fork after the test itself.
>>
>> The race condition occurs if the cancellation handling acts after the
>> pipe close: in this case read will return EOF (indicating side-effects)
>> and thus the cancellation must not act.  However current GLIBC
>> cancellation behavior acts regardless the syscalls returns with
>> sid-effects.
>>
>> This patch adjust the test by moving the pipe closing after the
>> cancellation handling.  This avoid spurious cancellation for the
>> case described.
>>
>> Checked on x86_64 and i386.
> 
> I was involved in the discussion of this and believe that the fix is
> correct. The only reason the tests "worked" before was that
> cancellation was wrongly being acted upon after read succeeded in
> reading EOF.
> 
> Note that, with this change, the tests will now timeout if read fails
> to act on cancellation, rather than exiting with a reportable error.
> This could be fixed with some very complicated machinery involving an
> additional signal handler and AS-safe synchronization mechanisms to
> control the ordering of close with respect to interruption of read,
> but as long as timeout is an acceptable way of detecting test failure,
> I see no reason to complicate the test logic like that.
> 
> Rich
>
  
Adhemerval Zanella Netto Oct. 29, 2015, 5:28 p.m. UTC | #3
Ping.

On 19-10-2015 17:01, Adhemerval Zanella wrote:
> Ping.
> 
> On 14-10-2015 16:58, Rich Felker wrote:
>> On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
>>> The tst-cancel20 open two pipes and creates a thread which blocks
>>> reading the first pipe.  It then issues a signal to activate an
>>> handler which also blocks reading the second pipe.  Finally the
>>> cancellation cleanup-up handlers are tested by first closing the
>>> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
>>> have a similar behavior, but use an extra fork after the test itself.
>>>
>>> The race condition occurs if the cancellation handling acts after the
>>> pipe close: in this case read will return EOF (indicating side-effects)
>>> and thus the cancellation must not act.  However current GLIBC
>>> cancellation behavior acts regardless the syscalls returns with
>>> sid-effects.
>>>
>>> This patch adjust the test by moving the pipe closing after the
>>> cancellation handling.  This avoid spurious cancellation for the
>>> case described.
>>>
>>> Checked on x86_64 and i386.
>>
>> I was involved in the discussion of this and believe that the fix is
>> correct. The only reason the tests "worked" before was that
>> cancellation was wrongly being acted upon after read succeeded in
>> reading EOF.
>>
>> Note that, with this change, the tests will now timeout if read fails
>> to act on cancellation, rather than exiting with a reportable error.
>> This could be fixed with some very complicated machinery involving an
>> additional signal handler and AS-safe synchronization mechanisms to
>> control the ordering of close with respect to interruption of read,
>> but as long as timeout is an acceptable way of detecting test failure,
>> I see no reason to complicate the test logic like that.
>>
>> Rich
>>
  
Paul E. Murphy Oct. 29, 2015, 7:59 p.m. UTC | #4
This looks ok to me. Failing with a timeout seems better than
incorrectly reporting a pass. Though, a 40 second timeout seems
a bit high to me.

Tested on ppc64.

Paul

On 10/29/2015 12:28 PM, Adhemerval Zanella wrote:
> Ping.
> 
> On 19-10-2015 17:01, Adhemerval Zanella wrote:
>> Ping.
>>
>> On 14-10-2015 16:58, Rich Felker wrote:
>>> On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
>>>> The tst-cancel20 open two pipes and creates a thread which blocks
>>>> reading the first pipe.  It then issues a signal to activate an
>>>> handler which also blocks reading the second pipe.  Finally the
>>>> cancellation cleanup-up handlers are tested by first closing the
>>>> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
>>>> have a similar behavior, but use an extra fork after the test itself.
>>>>
>>>> The race condition occurs if the cancellation handling acts after the
>>>> pipe close: in this case read will return EOF (indicating side-effects)
>>>> and thus the cancellation must not act.  However current GLIBC
>>>> cancellation behavior acts regardless the syscalls returns with
>>>> sid-effects.
>>>>
>>>> This patch adjust the test by moving the pipe closing after the
>>>> cancellation handling.  This avoid spurious cancellation for the
>>>> case described.
>>>>
>>>> Checked on x86_64 and i386.
>>>
>>> I was involved in the discussion of this and believe that the fix is
>>> correct. The only reason the tests "worked" before was that
>>> cancellation was wrongly being acted upon after read succeeded in
>>> reading EOF.
>>>
>>> Note that, with this change, the tests will now timeout if read fails
>>> to act on cancellation, rather than exiting with a reportable error.
>>> This could be fixed with some very complicated machinery involving an
>>> additional signal handler and AS-safe synchronization mechanisms to
>>> control the ordering of close with respect to interruption of read,
>>> but as long as timeout is an acceptable way of detecting test failure,
>>> I see no reason to complicate the test logic like that.
>>>
>>> Rich
>>>
>
  
Adhemerval Zanella Netto Nov. 9, 2015, 12:21 p.m. UTC | #5
Thanks for checking this out.  Anyone more have any more comments on that?

On 29-10-2015 17:59, Paul E. Murphy wrote:
> This looks ok to me. Failing with a timeout seems better than
> incorrectly reporting a pass. Though, a 40 second timeout seems
> a bit high to me.
> 
> Tested on ppc64.
> 
> Paul
> 
> On 10/29/2015 12:28 PM, Adhemerval Zanella wrote:
>> Ping.
>>
>> On 19-10-2015 17:01, Adhemerval Zanella wrote:
>>> Ping.
>>>
>>> On 14-10-2015 16:58, Rich Felker wrote:
>>>> On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
>>>>> The tst-cancel20 open two pipes and creates a thread which blocks
>>>>> reading the first pipe.  It then issues a signal to activate an
>>>>> handler which also blocks reading the second pipe.  Finally the
>>>>> cancellation cleanup-up handlers are tested by first closing the
>>>>> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
>>>>> have a similar behavior, but use an extra fork after the test itself.
>>>>>
>>>>> The race condition occurs if the cancellation handling acts after the
>>>>> pipe close: in this case read will return EOF (indicating side-effects)
>>>>> and thus the cancellation must not act.  However current GLIBC
>>>>> cancellation behavior acts regardless the syscalls returns with
>>>>> sid-effects.
>>>>>
>>>>> This patch adjust the test by moving the pipe closing after the
>>>>> cancellation handling.  This avoid spurious cancellation for the
>>>>> case described.
>>>>>
>>>>> Checked on x86_64 and i386.
>>>>
>>>> I was involved in the discussion of this and believe that the fix is
>>>> correct. The only reason the tests "worked" before was that
>>>> cancellation was wrongly being acted upon after read succeeded in
>>>> reading EOF.
>>>>
>>>> Note that, with this change, the tests will now timeout if read fails
>>>> to act on cancellation, rather than exiting with a reportable error.
>>>> This could be fixed with some very complicated machinery involving an
>>>> additional signal handler and AS-safe synchronization mechanisms to
>>>> control the ordering of close with respect to interruption of read,
>>>> but as long as timeout is an acceptable way of detecting test failure,
>>>> I see no reason to complicate the test logic like that.
>>>>
>>>> Rich
>>>>
>>
>
  
Adhemerval Zanella Netto Dec. 2, 2015, 1:35 p.m. UTC | #6
If noone opposes I will commit this shortly.

On 09-11-2015 10:21, Adhemerval Zanella wrote:
> Thanks for checking this out.  Anyone more have any more comments on that?
> 
> On 29-10-2015 17:59, Paul E. Murphy wrote:
>> This looks ok to me. Failing with a timeout seems better than
>> incorrectly reporting a pass. Though, a 40 second timeout seems
>> a bit high to me.
>>
>> Tested on ppc64.
>>
>> Paul
>>
>> On 10/29/2015 12:28 PM, Adhemerval Zanella wrote:
>>> Ping.
>>>
>>> On 19-10-2015 17:01, Adhemerval Zanella wrote:
>>>> Ping.
>>>>
>>>> On 14-10-2015 16:58, Rich Felker wrote:
>>>>> On Wed, Oct 14, 2015 at 04:03:15PM -0300, Adhemerval Zanella wrote:
>>>>>> The tst-cancel20 open two pipes and creates a thread which blocks
>>>>>> reading the first pipe.  It then issues a signal to activate an
>>>>>> handler which also blocks reading the second pipe.  Finally the
>>>>>> cancellation cleanup-up handlers are tested by first closing the
>>>>>> all the pipe ends and issuing a pthread_cancel. The tst-cancel21 
>>>>>> have a similar behavior, but use an extra fork after the test itself.
>>>>>>
>>>>>> The race condition occurs if the cancellation handling acts after the
>>>>>> pipe close: in this case read will return EOF (indicating side-effects)
>>>>>> and thus the cancellation must not act.  However current GLIBC
>>>>>> cancellation behavior acts regardless the syscalls returns with
>>>>>> sid-effects.
>>>>>>
>>>>>> This patch adjust the test by moving the pipe closing after the
>>>>>> cancellation handling.  This avoid spurious cancellation for the
>>>>>> case described.
>>>>>>
>>>>>> Checked on x86_64 and i386.
>>>>>
>>>>> I was involved in the discussion of this and believe that the fix is
>>>>> correct. The only reason the tests "worked" before was that
>>>>> cancellation was wrongly being acted upon after read succeeded in
>>>>> reading EOF.
>>>>>
>>>>> Note that, with this change, the tests will now timeout if read fails
>>>>> to act on cancellation, rather than exiting with a reportable error.
>>>>> This could be fixed with some very complicated machinery involving an
>>>>> additional signal handler and AS-safe synchronization mechanisms to
>>>>> control the ordering of close with respect to interruption of read,
>>>>> but as long as timeout is an acceptable way of detecting test failure,
>>>>> I see no reason to complicate the test logic like that.
>>>>>
>>>>> Rich
>>>>>
>>>
>>
  
Florian Weimer Dec. 2, 2015, 7 p.m. UTC | #7
On 10/14/2015 09:03 PM, Adhemerval Zanella wrote:
> +  /* The pipe closing must be issued after the cancellation handling to avoid
> +     a race condition where the cancellation runs after both pipe ends are
> +     closed.  In this case the read syscall returns EOF and the cancellation
> +     must not act.  */

I find this comment confusing.  What is actually happening during a
spurious failure?  The thread which is supposed to be canceled sees a
failed read from the pipe instead?

In this case, the return value from the read should be checked in a more
stringent fashion.  Or is the read not supposed to return at all?  Then
why have two read calls instead of one?

Florian
  
Adhemerval Zanella Netto Dec. 2, 2015, 8:40 p.m. UTC | #8
On 02-12-2015 17:00, Florian Weimer wrote:
> On 10/14/2015 09:03 PM, Adhemerval Zanella wrote:
>> +  /* The pipe closing must be issued after the cancellation handling to avoid
>> +     a race condition where the cancellation runs after both pipe ends are
>> +     closed.  In this case the read syscall returns EOF and the cancellation
>> +     must not act.  */
> 
> I find this comment confusing.  What is actually happening during a
> spurious failure?  The thread which is supposed to be canceled sees a
> failed read from the pipe instead?

For current GLIBC cancellation mechanism it does not matter if the read
have side-effects or not, the cancellation will act regardless.  However
if we change the cancellation to *not* work if side-effects are returned 
from the syscall (which is the case for new cancellation mechanism I am 
currently working and the musl libc), then we must not cancel if the 
'read' syscall returns EOF.

This is the case of this testcase, where there is no synchronization
between the read in the thread and close in the main thread.  If the
cancellation signal handler acts *after* close syscalls, the read
on the other side of the pipe will return with side-effects (EOF). This
does not fail for GLIBC, even with the racy condition.
 

> 
> In this case, the return value from the read should be checked in a more
> stringent fashion.  Or is the read not supposed to return at all?  Then
> why have two read calls instead of one?

My understanding of the testcase if read to act as a blocking point
and to check if it is cancelled correctly.  As I put before, if the
syscall returns of not with side-effects, it does not matter to *GLIBC*.
However, since we are moving to a more robust cancellation mechanism
we must not cancel in this case.

Thus why I moved the closed after the thread is finished, if the
cancellation is not triggered, the testcase will still fail with a
timeout.

> 
> Florian
>
  

Patch

diff --git a/nptl/tst-cancel20.c b/nptl/tst-cancel20.c
index 51b558e..91452fb 100644
--- a/nptl/tst-cancel20.c
+++ b/nptl/tst-cancel20.c
@@ -145,12 +145,6 @@  do_one_test (void)
       return 1;
     }
 
-  /* This will cause the read in the child to return.  */
-  close (fd[0]);
-  close (fd[1]);
-  close (fd[2]);
-  close (fd[3]);
-
   void *ret;
   if (pthread_join (th, &ret) != 0)
     {
@@ -170,6 +164,15 @@  do_one_test (void)
       return 1;
     }
 
+  /* The pipe closing must be issued after the cancellation handling to avoid
+     a race condition where the cancellation runs after both pipe ends are
+     closed.  In this case the read syscall returns EOF and the cancellation
+     must not act.  */
+  close (fd[0]);
+  close (fd[1]);
+  close (fd[2]);
+  close (fd[3]);
+
   return 0;
 }
 
diff --git a/nptl/tst-cancel21.c b/nptl/tst-cancel21.c
index b54f236..d082776 100644
--- a/nptl/tst-cancel21.c
+++ b/nptl/tst-cancel21.c
@@ -123,12 +123,6 @@  tf (void *arg)
       exit (1);
     }
 
-  /* This will cause the read in the initial thread to return.  */
-  close (fd[0]);
-  close (fd[1]);
-  close (fd[2]);
-  close (fd[3]);
-
   void *ret;
   if (pthread_join (th, &ret) != 0)
     {
@@ -154,6 +148,15 @@  tf (void *arg)
       exit (1);
     }
 
+  /* The pipe closing must be issued after the cancellation handling to avoid
+     a race condition where the cancellation runs after both pipe ends are
+     closed.  In this case the read syscall returns EOF and the cancellation
+     must not act.  */
+  close (fd[0]);
+  close (fd[1]);
+  close (fd[2]);
+  close (fd[3]);
+
   exit (0);
 }