[1/3] Use reliable sem_wait interruption in nptl/tst-sem6.

Message ID 1417805577.25868.4.camel@triegel.csb
State Dropped
Headers

Commit Message

Torvald Riegel Dec. 5, 2014, 6:52 p.m. UTC
  This makes the nptl/tst-sem6 interrupt sem_wait reliably by letting the
signal handler call sem_post.  This ensures that if the signal handler
should happen to run before sem_wait actually starts, sem_wait will find
an available token and return.  This is necessary if a program does not
want to rely on timing nor on forward progress / fairness guarantees of
the OS scheduler.

It's needed so that Patch 3/3 can do the right thing in face of a
FUTEX_WAIT that may return EINTR on either signal interruption or
spuriously.
  

Comments

Ondrej Bilka Dec. 6, 2014, 1:50 p.m. UTC | #1
On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
>    alarm (1);
>  
>    int res = sem_wait (&s);
> -  if (res == 0)
> -    {
> -      puts ("wait succeeded");
> -      return 1;
> -    }
> -  if (res != -1 || errno != EINTR)
> +  /* We accept all allowed behavior: Implementations that return EINTR and
> +     those that rely on correct interruption through signals to use sem_post
> +     in the signal handler.  */
> +  if (res != 0 && !(res == -1 && errno == EINTR))
>      {
> -      puts ("wait didn't fail with EINTR");
> +      puts ("wait neiter succeeded nor failed with EINTR");
>        return 1;
>      }
>  

That does change test logic as it originally failed when wait succeeded.
  
Torvald Riegel Dec. 8, 2014, 11:43 a.m. UTC | #2
On Sat, 2014-12-06 at 14:50 +0100, Ondřej Bílka wrote:
> On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> >    alarm (1);
> >  
> >    int res = sem_wait (&s);
> > -  if (res == 0)
> > -    {
> > -      puts ("wait succeeded");
> > -      return 1;
> > -    }
> > -  if (res != -1 || errno != EINTR)
> > +  /* We accept all allowed behavior: Implementations that return EINTR and
> > +     those that rely on correct interruption through signals to use sem_post
> > +     in the signal handler.  */
> > +  if (res != 0 && !(res == -1 && errno == EINTR))
> >      {
> > -      puts ("wait didn't fail with EINTR");
> > +      puts ("wait neiter succeeded nor failed with EINTR");
> >        return 1;
> >      }
> >  
> 
> That does change test logic as it originally failed when wait succeeded.

Yes, but why do you think that this is inconsistent?  The previous test
didn't add a token in the signal handler, so if wait succeeded, then the
test should fail.

However, the correct way to interrupt the semaphore with a signal is to
add a token.  My patch does that.  Second, if we do not want to make
timing assumptions (which the existing test would do if we add a token
to the semaphore in the signal handler), then we need to accept that the
(first) signal handler execution might happen before sem_wait actually
executes.  Therefore, we must not fail in this case.

We have to correctly interrupt with signals because as the futex
documentation stands (allowing return of EINTR on spurious wake-ups),
there's no way to implement the specific behavior of the existing
implementation (which assumes EINTR is returned *only* on signals).

IOW, the existing test does white-box testing with timing assumptions;
with this patch, we do make a slightly different black-box test with no
timing assumptions.
  
Ondrej Bilka Dec. 8, 2014, 10:28 p.m. UTC | #3
On Mon, Dec 08, 2014 at 12:43:17PM +0100, Torvald Riegel wrote:
> On Sat, 2014-12-06 at 14:50 +0100, Ondřej Bílka wrote:
> > On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> > >    alarm (1);
> > >  
> > >    int res = sem_wait (&s);
> > > -  if (res == 0)
> > > -    {
> > > -      puts ("wait succeeded");
> > > -      return 1;
> > > -    }
> > > -  if (res != -1 || errno != EINTR)
> > > +  /* We accept all allowed behavior: Implementations that return EINTR and
> > > +     those that rely on correct interruption through signals to use sem_post
> > > +     in the signal handler.  */
> > > +  if (res != 0 && !(res == -1 && errno == EINTR))
> > >      {
> > > -      puts ("wait didn't fail with EINTR");
> > > +      puts ("wait neiter succeeded nor failed with EINTR");
> > >        return 1;
> > >      }
> > >  
> > 
> > That does change test logic as it originally failed when wait succeeded.
> 
> Yes, but why do you think that this is inconsistent?  The previous test
> didn't add a token in the signal handler, so if wait succeeded, then the
> test should fail.
> 
> However, the correct way to interrupt the semaphore with a signal is to
> add a token.  My patch does that.  Second, if we do not want to make
> timing assumptions (which the existing test would do if we add a token
> to the semaphore in the signal handler), then we need to accept that the
> (first) signal handler execution might happen before sem_wait actually
> executes.  Therefore, we must not fail in this case.
> 
> We have to correctly interrupt with signals because as the futex
> documentation stands (allowing return of EINTR on spurious wake-ups),
> there's no way to implement the specific behavior of the existing
> implementation (which assumes EINTR is returned *only* on signals).
> 
> IOW, the existing test does white-box testing with timing assumptions;
> with this patch, we do make a slightly different black-box test with no
> timing assumptions.

Which misses point of test which is to find bugs. What if in new fooarch 
assembly one forget to return -1 on interrupt which breaks user application 
which will assume that semaphore is locked, one can deal with rare false 
positives in tests.

One does not need justify this fact by forward progress/fairness
assumptions. Just assumption that OS which keeps all CPU idle for one
second while there is suitable task is simply broken. It would be better
to add serialization so no other test runs in parallel with this.
  
Torvald Riegel Dec. 9, 2014, 10:16 a.m. UTC | #4
On Mon, 2014-12-08 at 23:28 +0100, Ondřej Bílka wrote:
> On Mon, Dec 08, 2014 at 12:43:17PM +0100, Torvald Riegel wrote:
> > On Sat, 2014-12-06 at 14:50 +0100, Ondřej Bílka wrote:
> > > On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> > > >    alarm (1);
> > > >  
> > > >    int res = sem_wait (&s);
> > > > -  if (res == 0)
> > > > -    {
> > > > -      puts ("wait succeeded");
> > > > -      return 1;
> > > > -    }
> > > > -  if (res != -1 || errno != EINTR)
> > > > +  /* We accept all allowed behavior: Implementations that return EINTR and
> > > > +     those that rely on correct interruption through signals to use sem_post
> > > > +     in the signal handler.  */
> > > > +  if (res != 0 && !(res == -1 && errno == EINTR))
> > > >      {
> > > > -      puts ("wait didn't fail with EINTR");
> > > > +      puts ("wait neiter succeeded nor failed with EINTR");
> > > >        return 1;
> > > >      }
> > > >  
> > > 
> > > That does change test logic as it originally failed when wait succeeded.
> > 
> > Yes, but why do you think that this is inconsistent?  The previous test
> > didn't add a token in the signal handler, so if wait succeeded, then the
> > test should fail.
> > 
> > However, the correct way to interrupt the semaphore with a signal is to
> > add a token.  My patch does that.  Second, if we do not want to make
> > timing assumptions (which the existing test would do if we add a token
> > to the semaphore in the signal handler), then we need to accept that the
> > (first) signal handler execution might happen before sem_wait actually
> > executes.  Therefore, we must not fail in this case.
> > 
> > We have to correctly interrupt with signals because as the futex
> > documentation stands (allowing return of EINTR on spurious wake-ups),
> > there's no way to implement the specific behavior of the existing
> > implementation (which assumes EINTR is returned *only* on signals).
> > 
> > IOW, the existing test does white-box testing with timing assumptions;
> > with this patch, we do make a slightly different black-box test with no
> > timing assumptions.
> 
> Which misses point of test which is to find bugs.

Please read the POSIX spec.  It allows both outcomes, and without timing
assumptions etc., we can't drive executions towards just one of the
outcomes.

> What if in new fooarch 
> assembly one forget to return -1 on interrupt which breaks user application 
> which will assume that semaphore is locked,

I can strengthen the test on res==0, checking whether there is no token
left.  I don't think it buys us much though.

Another option would be to disallow failure; however, this only works if
sem_assume_only_signals_cause_futex_EINTR remains 0 and not set to a
different value by a distribution (see Patch 3/3).

> one can deal with rare false 
> positives in tests.

I think you'd be creating false negatives with what you have in mind.
The false positive would be not failing when 0 is returned, incorrectly,
ie your example.

False negatives are a pain.  You can deal with them of course, but it
stands in the way of doing continuous integration and such.  what we de
facto do is just ignore all tests that we know can have false negatives,
which doesn't make the test useful at all.

> One does not need justify this fact by forward progress/fairness
> assumptions. Just assumption that OS which keeps all CPU idle for one
> second while there is suitable task is simply broken.

Well, that *is* a timing assumption.  There is nothing broken about this
in general.  Remember that we do have tests failing now and then just
because of that.  Which is awful for testing.

> It would be better
> to add serialization so no other test runs in parallel with this.

You can't prevent other load on the machine in general.

Also, please review the actual background for this change.  See patch
3/3 for that.  The change to the test is not ideal, but please see the
trade-offs.
  
Ondrej Bilka Dec. 9, 2014, 4:50 p.m. UTC | #5
On Tue, Dec 09, 2014 at 11:16:00AM +0100, Torvald Riegel wrote:
> On Mon, 2014-12-08 at 23:28 +0100, Ondřej Bílka wrote:
> > On Mon, Dec 08, 2014 at 12:43:17PM +0100, Torvald Riegel wrote:
> > > On Sat, 2014-12-06 at 14:50 +0100, Ondřej Bílka wrote:
> > > > On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> > > > >    alarm (1);
> > > > >  
> > > > >    int res = sem_wait (&s);
> > > > > -  if (res == 0)
> > > > > -    {
> > > > > -      puts ("wait succeeded");
> > > > > -      return 1;
> > > > > -    }
> > > > > -  if (res != -1 || errno != EINTR)
> > > > > +  /* We accept all allowed behavior: Implementations that return EINTR and
> > > > > +     those that rely on correct interruption through signals to use sem_post
> > > > > +     in the signal handler.  */
> > > > > +  if (res != 0 && !(res == -1 && errno == EINTR))
> > > > >      {
> > > > > -      puts ("wait didn't fail with EINTR");
> > > > > +      puts ("wait neiter succeeded nor failed with EINTR");
> > > > >        return 1;
> > > > >      }
> > > > >  
> > > > 
> > > > That does change test logic as it originally failed when wait succeeded.
> > > 
> > > Yes, but why do you think that this is inconsistent?  The previous test
> > > didn't add a token in the signal handler, so if wait succeeded, then the
> > > test should fail.
> > > 
> > > However, the correct way to interrupt the semaphore with a signal is to
> > > add a token.  My patch does that.  Second, if we do not want to make
> > > timing assumptions (which the existing test would do if we add a token
> > > to the semaphore in the signal handler), then we need to accept that the
> > > (first) signal handler execution might happen before sem_wait actually
> > > executes.  Therefore, we must not fail in this case.
> > > 
> > > We have to correctly interrupt with signals because as the futex
> > > documentation stands (allowing return of EINTR on spurious wake-ups),
> > > there's no way to implement the specific behavior of the existing
> > > implementation (which assumes EINTR is returned *only* on signals).
> > > 
> > > IOW, the existing test does white-box testing with timing assumptions;
> > > with this patch, we do make a slightly different black-box test with no
> > > timing assumptions.
> > 
> > Which misses point of test which is to find bugs.
> 
> Please read the POSIX spec.  It allows both outcomes, and without timing
> assumptions etc., we can't drive executions towards just one of the
> outcomes.
>
Which does not answer my objection. What extra bugs could this test catch,
compared to say tst-sem2? If there is no such bug you could just delete
that file.


> > What if in new fooarch 
> > assembly one forget to return -1 on interrupt which breaks user application 
> > which will assume that semaphore is locked,
> 
> I can strengthen the test on res==0, checking whether there is no token
> left.  I don't think it buys us much though.
> 
> Another option would be to disallow failure; however, this only works if
> sem_assume_only_signals_cause_futex_EINTR remains 0 and not set to a
> different value by a distribution (see Patch 3/3).
> 
> > one can deal with rare false 
> > positives in tests.
> 
> I think you'd be creating false negatives with what you have in mind.
> The false positive would be not failing when 0 is returned, incorrectly,
> ie your example.
>
Depends whats your default. Usually false positive means that test shows
there is disease/bug but not in reality.
 
> False negatives are a pain.  You can deal with them of course, but it
> stands in the way of doing continuous integration and such.  what we de
> facto do is just ignore all tests that we know can have false negatives,
> which doesn't make the test useful at all.
> 
> > One does not need justify this fact by forward progress/fairness
> > assumptions. Just assumption that OS which keeps all CPU idle for one
> > second while there is suitable task is simply broken.
> 
> Well, that *is* a timing assumption.  There is nothing broken about this
> in general.  Remember that we do have tests failing now and then just
> because of that.  Which is awful for testing.
> 
Without that assumption there is no guarantee that it will take more
than week to run test suite. That would make it useless.

As failing tests are concerned what is low hanging fruit? It is better
to first fix tests that fail with less load than this one and if one of these
is not worth fixing this is not either.

> > It would be better
> > to add serialization so no other test runs in parallel with this.
> 
> You can't prevent other load on the machine in general.
> 
Its not general case, its when you run test suite, its tradeoff between
how many bugs it can detect and needed mainteinance.


> Also, please review the actual background for this change.  See patch
> 3/3 for that.  The change to the test is not ideal, but please see the
> trade-offs.

The rarity of problem bugged me, as to trigger that behaviour one would
need run also highly parallel program so it looked unlikely that anybody
would report that. As it couldn't detect some bugs it previously could
its hard to see what is better.
  
Torvald Riegel Dec. 9, 2014, 6:24 p.m. UTC | #6
On Tue, 2014-12-09 at 17:50 +0100, Ondřej Bílka wrote:
> On Tue, Dec 09, 2014 at 11:16:00AM +0100, Torvald Riegel wrote:
> > On Mon, 2014-12-08 at 23:28 +0100, Ondřej Bílka wrote:
> > > On Mon, Dec 08, 2014 at 12:43:17PM +0100, Torvald Riegel wrote:
> > > > On Sat, 2014-12-06 at 14:50 +0100, Ondřej Bílka wrote:
> > > > > On Fri, Dec 05, 2014 at 07:52:57PM +0100, Torvald Riegel wrote:
> > > > > >    alarm (1);
> > > > > >  
> > > > > >    int res = sem_wait (&s);
> > > > > > -  if (res == 0)
> > > > > > -    {
> > > > > > -      puts ("wait succeeded");
> > > > > > -      return 1;
> > > > > > -    }
> > > > > > -  if (res != -1 || errno != EINTR)
> > > > > > +  /* We accept all allowed behavior: Implementations that return EINTR and
> > > > > > +     those that rely on correct interruption through signals to use sem_post
> > > > > > +     in the signal handler.  */
> > > > > > +  if (res != 0 && !(res == -1 && errno == EINTR))
> > > > > >      {
> > > > > > -      puts ("wait didn't fail with EINTR");
> > > > > > +      puts ("wait neiter succeeded nor failed with EINTR");
> > > > > >        return 1;
> > > > > >      }
> > > > > >  
> > > > > 
> > > > > That does change test logic as it originally failed when wait succeeded.
> > > > 
> > > > Yes, but why do you think that this is inconsistent?  The previous test
> > > > didn't add a token in the signal handler, so if wait succeeded, then the
> > > > test should fail.
> > > > 
> > > > However, the correct way to interrupt the semaphore with a signal is to
> > > > add a token.  My patch does that.  Second, if we do not want to make
> > > > timing assumptions (which the existing test would do if we add a token
> > > > to the semaphore in the signal handler), then we need to accept that the
> > > > (first) signal handler execution might happen before sem_wait actually
> > > > executes.  Therefore, we must not fail in this case.
> > > > 
> > > > We have to correctly interrupt with signals because as the futex
> > > > documentation stands (allowing return of EINTR on spurious wake-ups),
> > > > there's no way to implement the specific behavior of the existing
> > > > implementation (which assumes EINTR is returned *only* on signals).
> > > > 
> > > > IOW, the existing test does white-box testing with timing assumptions;
> > > > with this patch, we do make a slightly different black-box test with no
> > > > timing assumptions.
> > > 
> > > Which misses point of test which is to find bugs.
> > 
> > Please read the POSIX spec.  It allows both outcomes, and without timing
> > assumptions etc., we can't drive executions towards just one of the
> > outcomes.
> >
> Which does not answer my objection. What extra bugs could this test catch,
> compared to say tst-sem2? If there is no such bug you could just delete
> that file.

tst-sem2 tests that spurious wake-ups and such don't return anything but
-1 and errno==EINTR, in particular that 0 isn't returned.

After the patch, tst-sem6 tests that a signal handler that posts a token
will make sem_wait return.  It *also* allows for sem_wait to return -1
and errno==EINTR in that case.

Thus, one possible error that the patched tst-sem6 will catch is if the
sem_wait itself just retries the futex_wait after the futex_wait
returned EINTR, instead of looking for whether there is an available
token.

> 
> > > What if in new fooarch 
> > > assembly one forget to return -1 on interrupt which breaks user application 
> > > which will assume that semaphore is locked,
> > 
> > I can strengthen the test on res==0, checking whether there is no token
> > left.  I don't think it buys us much though.
> > 
> > Another option would be to disallow failure; however, this only works if
> > sem_assume_only_signals_cause_futex_EINTR remains 0 and not set to a
> > different value by a distribution (see Patch 3/3).
> > 
> > > one can deal with rare false 
> > > positives in tests.
> > 
> > I think you'd be creating false negatives with what you have in mind.
> > The false positive would be not failing when 0 is returned, incorrectly,
> > ie your example.
> >
> Depends whats your default. Usually false positive means that test shows
> there is disease/bug but not in reality.

I disagree regarding "usually", and would say that a negative result for
a test is if the test fails, so false negative being a reportedly failed
test that was not caused by an actual fault.
Nonetheless, we seem to be on the same page regarding what we want to
avoid.

> > False negatives are a pain.  You can deal with them of course, but it
> > stands in the way of doing continuous integration and such.  what we de
> > facto do is just ignore all tests that we know can have false negatives,
> > which doesn't make the test useful at all.
> > 
> > > One does not need justify this fact by forward progress/fairness
> > > assumptions. Just assumption that OS which keeps all CPU idle for one
> > > second while there is suitable task is simply broken.
> > 
> > Well, that *is* a timing assumption.  There is nothing broken about this
> > in general.  Remember that we do have tests failing now and then just
> > because of that.  Which is awful for testing.
> > 
> Without that assumption there is no guarantee that it will take more
> than week to run test suite. That would make it useless.

There are timing assumptions that state something like "after a second,
the thread will have been scheduled".  Those are bad, because they state
a concrete bound (1s in this example).

The forward progress guarantees that make sense for us, in particular in
non-real-time systems, are instead of the form "something good will
*eventually* happen".  In other words, there's no concrete bound
promised, but that there is a (finite) bound.

That's an important difference.

> > Also, please review the actual background for this change.  See patch
> > 3/3 for that.  The change to the test is not ideal, but please see the
> > trade-offs.
> 
> The rarity of problem bugged me, as to trigger that behaviour one would
> need run also highly parallel program so it looked unlikely that anybody
> would report that. As it couldn't detect some bugs it previously could
> its hard to see what is better. 

Let me try to summarize the background behind this change again:

1) Linux documents futex_wait to return EINTR on signals *or* on
spurious wake-ups.
2) If we treat 1) as true -- which we should to unless getting
confirmation otherwise -- sem_wait must not return EINTR to the caller
anymore if futex_wait returned EINTR.
3) Because of 2), the behavior that is tested in tst-sem6 before my
patch cannot be implemented anymore.

Thus, we need to do *something*.  I proposed this patch, and variations.
If you don't see other alternatives, then I guess we'll have to pick
from the options I gave.

If you disagree with 2), then please comment on Patch 3/3, because
that's where this belongs.
  
Rich Felker Dec. 9, 2014, 6:36 p.m. UTC | #7
On Tue, Dec 09, 2014 at 07:24:49PM +0100, Torvald Riegel wrote:
> > Which does not answer my objection. What extra bugs could this test catch,
> > compared to say tst-sem2? If there is no such bug you could just delete
> > that file.
> 
> tst-sem2 tests that spurious wake-ups and such don't return anything but
> -1 and errno==EINTR, in particular that 0 isn't returned.
> 
> After the patch, tst-sem6 tests that a signal handler that posts a token
> will make sem_wait return.  It *also* allows for sem_wait to return -1
> and errno==EINTR in that case.
> 
> Thus, one possible error that the patched tst-sem6 will catch is if the
> sem_wait itself just retries the futex_wait after the futex_wait
> returned EINTR, instead of looking for whether there is an available
> token.

This would not be a bug. Simply retrying the futex_wait would result
in EAGAIN, since the futex value would no longer match.

> > > Also, please review the actual background for this change.  See patch
> > > 3/3 for that.  The change to the test is not ideal, but please see the
> > > trade-offs.
> > 
> > The rarity of problem bugged me, as to trigger that behaviour one would
> > need run also highly parallel program so it looked unlikely that anybody
> > would report that. As it couldn't detect some bugs it previously could
> > its hard to see what is better. 
> 
> Let me try to summarize the background behind this change again:
> 
> 1) Linux documents futex_wait to return EINTR on signals *or* on
> spurious wake-ups.

No, the man pages document this, and they're wrong. I have not seen
any other "Linux documentation" claiming it.

> 2) If we treat 1) as true -- which we should to unless getting
> confirmation otherwise -- sem_wait must not return EINTR to the caller
> anymore if futex_wait returned EINTR.
> 3) Because of 2), the behavior that is tested in tst-sem6 before my
> patch cannot be implemented anymore.

These (2) and (3) are based on false assumptions. I agree this is a
positive change (EINTR is generally undesirable) and it may be
necessary if you want to support old kernels where the futex syscall
could fail with EINTR even when the signal handler was SA_RESTART
type, but I don't think glibc supports those kernels.

> Thus, we need to do *something*.  I proposed this patch, and variations.
> If you don't see other alternatives, then I guess we'll have to pick
> from the options I gave.
> 
> If you disagree with 2), then please comment on Patch 3/3, because
> that's where this belongs.

The most important thing to do is get clarification from the kernel
side that the man page is wrong and that there is no intent to
overload EINTR or allow incorrect/spurious EINTR from futex. This will
possibly affect other interfaces now and in the future, e.g.
aio_suspend. I agree that making sem_[timed]wait ignore EINTR is
desirable too, but I don't think there's any actual current problem
being fixed - the kernel is not generating spurious EINTR, and we just
need to keep it that way.

Rich
  
Torvald Riegel Dec. 9, 2014, 6:57 p.m. UTC | #8
On Tue, 2014-12-09 at 13:36 -0500, Rich Felker wrote:
> On Tue, Dec 09, 2014 at 07:24:49PM +0100, Torvald Riegel wrote:
> > > Which does not answer my objection. What extra bugs could this test catch,
> > > compared to say tst-sem2? If there is no such bug you could just delete
> > > that file.
> > 
> > tst-sem2 tests that spurious wake-ups and such don't return anything but
> > -1 and errno==EINTR, in particular that 0 isn't returned.
> > 
> > After the patch, tst-sem6 tests that a signal handler that posts a token
> > will make sem_wait return.  It *also* allows for sem_wait to return -1
> > and errno==EINTR in that case.
> > 
> > Thus, one possible error that the patched tst-sem6 will catch is if the
> > sem_wait itself just retries the futex_wait after the futex_wait
> > returned EINTR, instead of looking for whether there is an available
> > token.
> 
> This would not be a bug. Simply retrying the futex_wait would result
> in EAGAIN, since the futex value would no longer match.

Right.   So it would catch a bug that did a futex_wait after loading the
new value.

> > > > Also, please review the actual background for this change.  See patch
> > > > 3/3 for that.  The change to the test is not ideal, but please see the
> > > > trade-offs.
> > > 
> > > The rarity of problem bugged me, as to trigger that behaviour one would
> > > need run also highly parallel program so it looked unlikely that anybody
> > > would report that. As it couldn't detect some bugs it previously could
> > > its hard to see what is better. 
> > 
> > Let me try to summarize the background behind this change again:
> > 
> > 1) Linux documents futex_wait to return EINTR on signals *or* on
> > spurious wake-ups.
> 
> No, the man pages document this, and they're wrong. I have not seen
> any other "Linux documentation" claiming it.

But is there other documentation than the man pages?  The sources don't
really count because that's not a guarantee nor a specification, that's
the current implementation.

Also, at least one kernel person seems to have confirmed that the
current manpage is correct: https://lkml.org/lkml/2014/5/15/356

So in absence of any other documentation, I'll follow what we have and
for which we have at least some documentation.

> > 2) If we treat 1) as true -- which we should to unless getting
> > confirmation otherwise -- sem_wait must not return EINTR to the caller
> > anymore if futex_wait returned EINTR.
> > 3) Because of 2), the behavior that is tested in tst-sem6 before my
> > patch cannot be implemented anymore.
> 
> These (2) and (3) are based on false assumptions.

I don't have any evidence to rely on something else.  Don't get me
wrong, if we get confirmation from the kernel that 1) is not true, then
I'm open to doing something else.  But until then, what should we do? 

Also, the change is within what's allowed by POSIX IMO, so we're not
inventing new behavior here.

> I agree this is a
> positive change (EINTR is generally undesirable) and it may be
> necessary if you want to support old kernels where the futex syscall
> could fail with EINTR even when the signal handler was SA_RESTART
> type, but I don't think glibc supports those kernels.
> 
> > Thus, we need to do *something*.  I proposed this patch, and variations.
> > If you don't see other alternatives, then I guess we'll have to pick
> > from the options I gave.
> > 
> > If you disagree with 2), then please comment on Patch 3/3, because
> > that's where this belongs.
> 
> The most important thing to do is get clarification from the kernel
> side that the man page is wrong and that there is no intent to
> overload EINTR or allow incorrect/spurious EINTR from futex.

Once we'll get this, I'll take care to adjust sem_wait accordingly, and
depending on the adjustment, might adapt tst-sem6 as well.

BTW, have you already asked on LKML about this?
  
Rich Felker Dec. 9, 2014, 8:19 p.m. UTC | #9
On Tue, Dec 09, 2014 at 07:57:08PM +0100, Torvald Riegel wrote:
> On Tue, 2014-12-09 at 13:36 -0500, Rich Felker wrote:
> > On Tue, Dec 09, 2014 at 07:24:49PM +0100, Torvald Riegel wrote:
> > > > Which does not answer my objection. What extra bugs could this test catch,
> > > > compared to say tst-sem2? If there is no such bug you could just delete
> > > > that file.
> > > 
> > > tst-sem2 tests that spurious wake-ups and such don't return anything but
> > > -1 and errno==EINTR, in particular that 0 isn't returned.
> > > 
> > > After the patch, tst-sem6 tests that a signal handler that posts a token
> > > will make sem_wait return.  It *also* allows for sem_wait to return -1
> > > and errno==EINTR in that case.
> > > 
> > > Thus, one possible error that the patched tst-sem6 will catch is if the
> > > sem_wait itself just retries the futex_wait after the futex_wait
> > > returned EINTR, instead of looking for whether there is an available
> > > token.
> > 
> > This would not be a bug. Simply retrying the futex_wait would result
> > in EAGAIN, since the futex value would no longer match.
> 
> Right.   So it would catch a bug that did a futex_wait after loading the
> new value.

I don't follow. If I understand what type of bug you're talking about,
there's no way such a bug would arise accidentally and only affect
EINTR. It would be a break in the whole usage pattern for futex waits
and would affect EAGAIN and non-spurious wakes too unless someone
intentionally special-cased EINTR to do the wrong thing.

> > > Let me try to summarize the background behind this change again:
> > > 
> > > 1) Linux documents futex_wait to return EINTR on signals *or* on
> > > spurious wake-ups.
> > 
> > No, the man pages document this, and they're wrong. I have not seen
> > any other "Linux documentation" claiming it.
> 
> But is there other documentation than the man pages?  The sources don't
> really count because that's not a guarantee nor a specification, that's
> the current implementation.
> 
> Also, at least one kernel person seems to have confirmed that the
> current manpage is correct: https://lkml.org/lkml/2014/5/15/356

The linked mailing list message does not contain the text EINTR at
all, so I don't see where your claim that it supports the current man
page text about EINTR comes from.

> > > 2) If we treat 1) as true -- which we should to unless getting
> > > confirmation otherwise -- sem_wait must not return EINTR to the caller
> > > anymore if futex_wait returned EINTR.
> > > 3) Because of 2), the behavior that is tested in tst-sem6 before my
> > > patch cannot be implemented anymore.
> > 
> > These (2) and (3) are based on false assumptions.
> 
> I don't have any evidence to rely on something else.  Don't get me
> wrong, if we get confirmation from the kernel that 1) is not true, then
> I'm open to doing something else.  But until then, what should we do? 
> 
> Also, the change is within what's allowed by POSIX IMO, so we're not
> inventing new behavior here.

It's allowed by POSIX, yes, and as I've said before, I agree it's
better behavior -- programming with interrupting signal handlers is a
backwards, bogus practice, and from a hardening standpoint it seems
preferable not to have sem_wait fail at all. I just don't think the
"spurious EINTR is documented" argument should be used to justify such
a change, because accepting spurious EINTR is going to come back to
bite us if there are ever other interfaces (I believe aio_suspend
already is one?) that need to be implemented with futex and need to
report EINTR.

Rich
  
Torvald Riegel Dec. 10, 2014, 9:34 a.m. UTC | #10
On Tue, 2014-12-09 at 15:19 -0500, Rich Felker wrote:
> On Tue, Dec 09, 2014 at 07:57:08PM +0100, Torvald Riegel wrote:
> > On Tue, 2014-12-09 at 13:36 -0500, Rich Felker wrote:
> > > On Tue, Dec 09, 2014 at 07:24:49PM +0100, Torvald Riegel wrote:
> > > > > Which does not answer my objection. What extra bugs could this test catch,
> > > > > compared to say tst-sem2? If there is no such bug you could just delete
> > > > > that file.
> > > > 
> > > > tst-sem2 tests that spurious wake-ups and such don't return anything but
> > > > -1 and errno==EINTR, in particular that 0 isn't returned.
> > > > 
> > > > After the patch, tst-sem6 tests that a signal handler that posts a token
> > > > will make sem_wait return.  It *also* allows for sem_wait to return -1
> > > > and errno==EINTR in that case.
> > > > 
> > > > Thus, one possible error that the patched tst-sem6 will catch is if the
> > > > sem_wait itself just retries the futex_wait after the futex_wait
> > > > returned EINTR, instead of looking for whether there is an available
> > > > token.
> > > 
> > > This would not be a bug. Simply retrying the futex_wait would result
> > > in EAGAIN, since the futex value would no longer match.
> > 
> > Right.   So it would catch a bug that did a futex_wait after loading the
> > new value.
> 
> I don't follow. If I understand what type of bug you're talking about,
> there's no way such a bug would arise accidentally and only affect
> EINTR. It would be a break in the whole usage pattern for futex waits
> and would affect EAGAIN and non-spurious wakes too unless someone
> intentionally special-cased EINTR to do the wrong thing.
> 
> > > > Let me try to summarize the background behind this change again:
> > > > 
> > > > 1) Linux documents futex_wait to return EINTR on signals *or* on
> > > > spurious wake-ups.
> > > 
> > > No, the man pages document this, and they're wrong. I have not seen
> > > any other "Linux documentation" claiming it.
> > 
> > But is there other documentation than the man pages?  The sources don't
> > really count because that's not a guarantee nor a specification, that's
> > the current implementation.
> > 
> > Also, at least one kernel person seems to have confirmed that the
> > current manpage is correct: https://lkml.org/lkml/2014/5/15/356
> 
> The linked mailing list message does not contain the text EINTR at
> all, so I don't see where your claim that it supports the current man
> page text about EINTR comes from.

The mail states:

FUTEX_WAIT

	< Existing blurb seems ok >

... and then it goes on providing revised documentation for some of the
previously documented error codes and adds documentation for a few
additional ones.  It doesn't mention 0 or EINTR otherwise, so I read
this as meaning that the docs for 0 and EINTR are correct.

> > > > 2) If we treat 1) as true -- which we should to unless getting
> > > > confirmation otherwise -- sem_wait must not return EINTR to the caller
> > > > anymore if futex_wait returned EINTR.
> > > > 3) Because of 2), the behavior that is tested in tst-sem6 before my
> > > > patch cannot be implemented anymore.
> > > 
> > > These (2) and (3) are based on false assumptions.
> > 
> > I don't have any evidence to rely on something else.  Don't get me
> > wrong, if we get confirmation from the kernel that 1) is not true, then
> > I'm open to doing something else.  But until then, what should we do? 
> > 
> > Also, the change is within what's allowed by POSIX IMO, so we're not
> > inventing new behavior here.
> 
> It's allowed by POSIX, yes, and as I've said before, I agree it's
> better behavior -- programming with interrupting signal handlers is a
> backwards, bogus practice, and from a hardening standpoint it seems
> preferable not to have sem_wait fail at all.

Okay.

> I just don't think the
> "spurious EINTR is documented" argument should be used to justify such
> a change, because accepting spurious EINTR is going to come back to
> bite us if there are ever other interfaces (I believe aio_suspend
> already is one?) that need to be implemented with futex and need to
> report EINTR.

I don't want to argue against that.  But we need to separate the issues
here.  This topic is something that the kernel has to decide, so we can
only discuss this with them.  Whether or not glibc sem_wait follows
what's currently documented doesn't change that other decision at all.

Thus, let's keep those separate topics separate.  We can try to clarify
with the kernel folks re EINTR before we agree on the new semaphore
implementation, but I wouldn't like to hold up the new semaphore just
because we can't agree with the kernel soon enough.
  

Patch

commit a819010e2a625eafc66fb1bf1301fcba5c05e5e6
Author: Torvald Riegel <triegel@redhat.com>
Date:   Thu Dec 4 20:43:22 2014 +0100

    Use reliable sem_wait signal interruption in tst-sem6

diff --git a/nptl/tst-sem6.c b/nptl/tst-sem6.c
index 2d9f1ab..3efd4eb 100644
--- a/nptl/tst-sem6.c
+++ b/nptl/tst-sem6.c
@@ -22,6 +22,7 @@ 
 #include <stdio.h>
 #include <unistd.h>
 
+sem_t s;
 
 static void
 handler (int sig)
@@ -34,6 +35,10 @@  handler (int sig)
 
   sigaction (SIGALRM, &sa, NULL);
 
+  /* Correctly interrupting a sem_wait without relying on timing assumptions
+     requires using sem_post in the handler.  */
+  sem_post (&s);
+
   /* Rearm the timer.  */
   alarm (1);
 }
@@ -42,7 +47,6 @@  handler (int sig)
 static int
 do_test (void)
 {
-  sem_t s;
   struct sigaction sa;
 
   sa.sa_handler = handler;
@@ -61,14 +65,12 @@  do_test (void)
   alarm (1);
 
   int res = sem_wait (&s);
-  if (res == 0)
-    {
-      puts ("wait succeeded");
-      return 1;
-    }
-  if (res != -1 || errno != EINTR)
+  /* We accept all allowed behavior: Implementations that return EINTR and
+     those that rely on correct interruption through signals to use sem_post
+     in the signal handler.  */
+  if (res != 0 && !(res == -1 && errno == EINTR))
     {
-      puts ("wait didn't fail with EINTR");
+      puts ("wait neiter succeeded nor failed with EINTR");
       return 1;
     }