Fix race in tst-mqueue5

Message ID 569552C6.8050200@linux.vnet.ibm.com
State Committed
Headers

Commit Message

Paul E. Murphy Jan. 12, 2016, 7:23 p.m. UTC
  This seems to fix the test for ppc, and probably others too.
  

Comments

Carlos O'Donell Jan. 13, 2016, 1:58 a.m. UTC | #1
On 01/12/2016 02:23 PM, Paul E. Murphy wrote:
> This seems to fix the test for ppc, and probably others too.
 
Had you considered the 'sleep (1);' on line 394 as another source of problems?

Must it be the case that by the time '(void) pthread_barrier_wait (b3);' returns,
the SIGRTMIN has been delivered and handled? Why isn't 'sigwait' required to make
this test operate without a race bewtween checking 'rtmin_cnt != 2' and the signal
arriving and being handled?

While I agree that any fix that makes tst-mqueue5 fail less spuriously is a good
thing, I'm curious about your review of the test as a whole (now that I've looked
at it again).

Cheers,
Carlos.
  
Adhemerval Zanella Netto Jan. 13, 2016, 2 p.m. UTC | #2
On 12-01-2016 23:58, Carlos O'Donell wrote:
> On 01/12/2016 02:23 PM, Paul E. Murphy wrote:
>> This seems to fix the test for ppc, and probably others too.
>  
> Had you considered the 'sleep (1);' on line 394 as another source of problems?
> 
> Must it be the case that by the time '(void) pthread_barrier_wait (b3);' returns,
> the SIGRTMIN has been delivered and handled? Why isn't 'sigwait' required to make
> this test operate without a race bewtween checking 'rtmin_cnt != 2' and the signal
> arriving and being handled?

My understanding is let's say due some scheduling the main thread of child process
calls 'mq_notify' before the 'mqrecv' on child thread in child process. In this 
case a notification will be send, since eventually child thread in child process
will call 'mqrecv'. This will trigger 'rtmin_handler' and the test at 413 will fail.

I do not know how often this scenario might arise, but it might happen on a heavy
loaded machine.

This is similar of a fix I added on 'nptl/tst-cancel{20,21}.c' where it used a sleep
to add some 'synchronization'. I don't know which a better solution to synchronize
with a thread syscall that will block.

> 
> While I agree that any fix that makes tst-mqueue5 fail less spuriously is a good
> thing, I'm curious about your review of the test as a whole (now that I've looked
> at it again).
> 

I would say to just remove the spurious synchronization issues, specially the ones
that might arise with the use of 'sleep' in bad ways.

> Cheers,
> Carlos.
>
  
Carlos O'Donell Jan. 13, 2016, 2:11 p.m. UTC | #3
On 01/13/2016 09:00 AM, Adhemerval Zanella wrote:
>> While I agree that any fix that makes tst-mqueue5 fail less spuriously is a good
>> thing, I'm curious about your review of the test as a whole (now that I've looked
>> at it again).
> 
> I would say to just remove the spurious synchronization issues, specially the ones
> that might arise with the use of 'sleep' in bad ways.

Does that mean you'd rather see a more comprehensive fix?

The test still uses sleep() in one place.

Cheers,
Carlos.
  
Adhemerval Zanella Netto Jan. 13, 2016, 5:40 p.m. UTC | #4
On 13-01-2016 12:11, Carlos O'Donell wrote:
> On 01/13/2016 09:00 AM, Adhemerval Zanella wrote:
>>> While I agree that any fix that makes tst-mqueue5 fail less spuriously is a good
>>> thing, I'm curious about your review of the test as a whole (now that I've looked
>>> at it again).
>>
>> I would say to just remove the spurious synchronization issues, specially the ones
>> that might arise with the use of 'sleep' in bad ways.
> 
> Does that mean you'd rather see a more comprehensive fix?
> 
> The test still uses sleep() in one place.

I would prefer a comprehensible test that do raise false positives, even by
a slight chance (such as the case). This is also similar to the nanosleep
regression testcase [1] that generate different opinions whether a possible
racy testcase yields any meaningful validation. 

At first I did no oppose, but recently I noted that potentially racy testcase
are mainly ignored when testers see spurious fails (the tst-mqueue5 is an
example, this issue exists for a long time).

[1] https://sourceware.org/ml/libc-alpha/2015-11/msg00199.html
  
Paul E. Murphy Jan. 13, 2016, 5:57 p.m. UTC | #5
On 01/12/2016 07:58 PM, Carlos O'Donell wrote:
> On 01/12/2016 02:23 PM, Paul E. Murphy wrote:
>> This seems to fix the test for ppc, and probably others too.
> 
> Had you considered the 'sleep (1);' on line 394 as another source of problems?

That is yet another problem, though seemingly less likely. I ran the test in a loop
on a relatively busy machine for a night and didn't catch any failures. I'm not sure
how or if it can be fixed.

> Must it be the case that by the time '(void) pthread_barrier_wait (b3);' returns,
> the SIGRTMIN has been delivered and handled? Why isn't 'sigwait' required to make
> this test operate without a race bewtween checking 'rtmin_cnt != 2' and the signal
> arriving and being handled?

I think there is an assumption that calling mqrecv() in do_test() will generate a
signal in the child process when the child is in one of the following states:

1. thr() blocked on barrier b3, do_child() is still approaching barrier b3.
2. do_child() blocked on barrier b3, thr() is still approaching barrier b3.
3. Both are waiting in barrier b3.

Masking the signal on do_child() guarantees thr() will handle the exception.
Furthermore, this assumes the signal to thr() cannot be delayed or blocked. I don't
know if that is a safe assumption. Though, the assumption is made elsewhere in the
test.

> While I agree that any fix that makes tst-mqueue5 fail less spuriously is a good
> thing, I'm curious about your review of the test as a whole (now that I've looked
> at it again).

Each call to pthread_barrier_wait should have a comment to quickly match it up
with the matching call in the other one or two functions :).

It does validate the documented behavior of mq, so it's a good test. If the test
is still troublesome, maybe the signals can be replaced with another means of
validating the asynchronous API.

Thanks,
Paul
  
Carlos O'Donell Jan. 13, 2016, 6:25 p.m. UTC | #6
On 01/13/2016 12:57 PM, Paul E. Murphy wrote:
> 
> 
> On 01/12/2016 07:58 PM, Carlos O'Donell wrote:
>> On 01/12/2016 02:23 PM, Paul E. Murphy wrote:
>>> This seems to fix the test for ppc, and probably others too.
>>
>> Had you considered the 'sleep (1);' on line 394 as another source of problems?
> 
> That is yet another problem, though seemingly less likely. I ran the test in a loop
> on a relatively busy machine for a night and didn't catch any failures. I'm not sure
> how or if it can be fixed.

OK.

>> Must it be the case that by the time '(void) pthread_barrier_wait (b3);' returns,
>> the SIGRTMIN has been delivered and handled? Why isn't 'sigwait' required to make
>> this test operate without a race bewtween checking 'rtmin_cnt != 2' and the signal
>> arriving and being handled?
> 
> I think there is an assumption that calling mqrecv() in do_test() will generate a
> signal in the child process when the child is in one of the following states:
> 
> 1. thr() blocked on barrier b3, do_child() is still approaching barrier b3.
> 2. do_child() blocked on barrier b3, thr() is still approaching barrier b3.
> 3. Both are waiting in barrier b3.
> 
> Masking the signal on do_child() guarantees thr() will handle the exception.
> Furthermore, this assumes the signal to thr() cannot be delayed or blocked. I don't
> know if that is a safe assumption. Though, the assumption is made elsewhere in the
> test.

Agreed.

This patch is good for me.

Please check this in or have Adhemerval or an IBM team member check it in for you.

Cheers,
Carlos.
  
Florian Weimer Jan. 15, 2016, 3:05 p.m. UTC | #7
On 01/12/2016 08:23 PM, Paul E. Murphy wrote:
> This seems to fix the test for ppc, and probably others too.

I found a powerpc box where it was constantly failing without your fix.
I applied your fix, and the test now passes. So it looks like a net
improvement. :)

The explanation about where the signal is delivered looks reasonable, too.

Florian
  
Tulio Magno Quites Machado Filho Jan. 15, 2016, 6:55 p.m. UTC | #8
"Paul E. Murphy" <murphyp@linux.vnet.ibm.com> writes:

> The check is done on line 117 by a thread spawned
> from do_child(), forked from do_test().  This test
> generates a signal in the forked process.
>
> Either thread may handle the signal, and on ppc,
> it happens to be done on do_child, on the thread
> which is not doing the check on line 117.
>
> This exposes a race condition whereby the test
> incorrectly fails as the signal is caught during
> or after the check.
>
> This is mitigated by ensuring the signal is blocked
> in the child thread while thread is running.

Pushed as a3e5b4f.
  

Patch

From 49d281b08a871cb5835b2ca166082821ea0e9085 Mon Sep 17 00:00:00 2001
From: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com>
Date: Mon, 11 Jan 2016 17:24:04 -0500
Subject: [PATCH] Fix race in tst-mqueue5

The check is done on line 117 by a thread spawned
from do_child(), forked from do_test().  This test
generates a signal in the forked process.

Either thread may handle the signal, and on ppc,
it happens to be done on do_child, on the thread
which is not doing the check on line 117.

This exposes a race condition whereby the test
incorrectly fails as the signal is caught during
or after the check.

This is mitigated by ensuring the signal is blocked
in the child thread while thread is running.

2016-01-11  Paul E. Murphy  <murphyp@linux.vnet.ibm.com>

	* rt/tst-mqueue5.c (thr): Cleanup misleading comment.
	(do_child): Mask SIGRTMIN while thr is running.
---
 rt/tst-mqueue5.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/rt/tst-mqueue5.c b/rt/tst-mqueue5.c
index aa74fa3..25042bc 100644
--- a/rt/tst-mqueue5.c
+++ b/rt/tst-mqueue5.c
@@ -116,7 +116,7 @@  thr (void *arg)
 
   if (rtmin_cnt != 2)
     {
-      puts ("SIGRTMIN signal in child did not arrive");
+      puts ("SIGRTMIN signal in thread did not arrive");
       result = 1;
     }
   else if (rtmin_pid != getppid ()
@@ -403,6 +403,16 @@  do_child (const char *name, pthread_barrier_t *b2, pthread_barrier_t *b3,
       result = 1;
     }
 
+  /* Ensure the thr thread gets the signal, not us.  */
+  sigset_t set;
+  sigemptyset (&set);
+  sigaddset (&set, SIGRTMIN);
+  if (pthread_sigmask (SIG_BLOCK, &set, NULL))
+    {
+      printf ("Failed to block SIGRTMIN in child: %m\n");
+      result = 1;
+    }
+
   (void) pthread_barrier_wait (b2);
 
   /* Parent calls mqsend (q), which should wake up mqrecv (q)
@@ -514,7 +524,14 @@  do_child (const char *name, pthread_barrier_t *b2, pthread_barrier_t *b3,
       result = 1;
     }
 
- void *thr_ret;
+  /* Reenable test signals before cleaning up the thread.  */
+  if (pthread_sigmask (SIG_UNBLOCK, &set, NULL))
+    {
+      printf ("Failed to unblock SIGRTMIN in child: %m\n");
+      result = 1;
+    }
+
+  void *thr_ret;
   ret = pthread_join (th, &thr_ret);
   if (ret)
     {
-- 
2.4.3