Fix race between sem_post and semaphore destruction [BZ #12674]

Message ID 20140521110711.GA3598@spoyarek.pnq.redhat.com
State Rejected
Headers

Commit Message

Siddhesh Poyarekar May 21, 2014, 11:07 a.m. UTC
  Summary of the race:

T1: enter sem_post and increment value
T2: enter_sem_wait and decrement value
T2: return from sem_wait and destroy the semaphore
T1: Check value of semaphore->nwaiters and find an invalid value

The idea for the fix is adapted from Rich Felker's fix for the same
race in musl.  The fix is to prevent sem_post from accessing nwaiters
after it has incremented the value since the state of the semaphore is
not known beyond this point.  This is fairly easy to achieve using
Rich's idea.  One may set the value to a special value of -1 to
indicate waiters.  That way, sem_post can inspect the old value and
call futex_wake if it is necessary.

Rich used this condition as a primary check and the waiter count as a
secondary check, but I think the secondary check is not required in
glibc.  The only time the secondary check would theoretically be
required is when the old value came up as zero *and* there were
waiters to wake up.  This happens only momentarily when an exiting
waiter decrements nwaiters and resets the semaphore value if it is -1
and that operation races with a new waiter entering and losing the
race, thus keeping the value as 0.  This is momentary because the
futex call on such a value will fail with EWOULDBLOCK since it expects
the value to be -1 and not 0.  After this failure, the waiter fixes up
the value to -1 and goes back to wait.

This requires two other changes:

- The type of value is now int instead of unsigned int.  This should
  not break the ABI since we don't expose the sign of the value.  In
  fact, the only place the value is seen is with sem_getvalue, where
  it is int.  And speaking of sem_getvalue...

- sem_getvalue is patched to lie about the actual value if it sees the
  -1 and return 0.

Siddhesh

	[BZ #12674]
	* nptl/sem_getvalue.c (__new_sem_getvalue): Return 0 for
	negative semaphore value.
	* nptl/sysdeps/unix/sysv/linux/internaltypes.h (struct
	new_sem): Change type of VALUE to int.
	* nptl/sysdeps/unix/sysv/linux/sem_post.c (__new_sem_post):
	Avoid accessing semaphore after incrementing it.
	* sysdeps/unix/sysv/linux/i386/i486/sem_post.S
	(__new_sem_post): Likewise.
	* sysdeps/unix/sysv/linux/x86_64/sem_post.S (__new_sem_post):
	Likewise.
	* nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
	(do_futex_timed_wait): Set expected value of futex to -1.
	(sem_timedwait): Set expected value of semaphore to -1.
	* sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S
	(sem_wait_cleanup): Reset semaphore value when there are no
	waiters.
	(sem_timedwait): Set expected value of semaphore to -1.
	* sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S
	(sem_wait_cleanup): Reset semaphore value when there are no
	waiters.
	(sem_wait_cleanup2): Likewise.
	(sem_timedwait): Set expected value of semaphore to -1.
	* nptl/sysdeps/unix/sysv/linux/sem_wait.c
	(__sem_wait_cleanup): Reset semaphore value when there are no
	waiters.
	(do_futex_wait): Set expected value of futex to -1.
	(__new_sem_wait): Set expected value of semaphore to -1.
	* sysdeps/unix/sysv/linux/i386/i486/sem_wait.S
	(sem_wait_cleanup): Reset semaphore value when there are no
	waiters.
	(__new_sem_wait): Set expected value of semaphore to -1.
	* sysdeps/unix/sysv/linux/x86_64/sem_wait.S
	(sem_wait_cleanup): Reset semaphore value when there are no
	waiters.
	(__new_sem_wait): Set expected value of semaphore to -1.
  

Comments

Andreas Schwab May 21, 2014, 12:06 p.m. UTC | #1
Siddhesh Poyarekar <siddhesh@redhat.com> writes:

> +     set it to a negative value.  There is a transient condition whe it could

s/whe/&n/

Andreas.
  
Rich Felker May 21, 2014, 10:43 p.m. UTC | #2
On Wed, May 21, 2014 at 04:37:11PM +0530, Siddhesh Poyarekar wrote:
> Rich used this condition as a primary check and the waiter count as a
> secondary check, but I think the secondary check is not required in
> glibc.

I think the case I had in mind goes like this:

1. Thread A and B are waiting, val is -1.
2. Thread C posts the semaphore and wakes up thread A, val is now 0.
3. The semaphore gets posted again, but with no indication there's a
   waiter, thread B never wakes up.

You cannot modify sem_post to set the value to -1 rather than 0 if
there's a waiter, because of two issues; one is fundamental and the
other is merely a performance concern:

1. Multiple sem_post calls before the waiters wake up have to
   accumulate the semaphore value increments. You might could satisfy
   this using a flag bit instead of the value -1, but it's ugly.

2. There's no way for sem_post to know if there's another remaining
   waiter. This is because it cannot inspect the waiter count after
   modifying the value -- that's the exact bug you're trying to fix.
   So it would have to always set the waiter indicator, resulting in
   every subsequent sem_post call generating futex syscalls.

It I'm missing something obvious, let me know, but I don't think your
approach works. FWIW, I haven't RTFP'd; I'm just basing this response
on your email text.

Rich
  
Patchwork Bot May 22, 2014, 12:05 a.m. UTC | #3
On 22 May 2014 04:13, Rich Felker <dalias@libc.org> wrote:
> I think the case I had in mind goes like this:
>
> 1. Thread A and B are waiting, val is -1.
> 2. Thread C posts the semaphore and wakes up thread A, val is now 0.

This would be:

2. Thread C posts the semaphore and val is now 1.  It sees that the
old val was -1 and calls futex_wake, which wakes up thread A.

Similar to your implementation, sem_post increments the value by 1 if
it is non-negative and by 2 if it is negative, so sem_post will never
result in the value becoming 0.  However, there is one way in which
the value will become 0, which is in sem_wait.  sem_wait will reset
the value to 0 when it sees no waiters.  This could technically race
with another waiter that has not yet called futex_wait, but that
situation fixes itself since the futex_wait will return with
EWOULDBLOCK and fix the value to -1 and call futex_wait again.

> You cannot modify sem_post to set the value to -1 rather than 0 if
> there's a waiter, because of two issues; one is fundamental and the
> other is merely a performance concern:

sem_wait sets the value to -1, not sem_post.  sem_post always sets the
value to something positive.

Siddhesh
  
Rich Felker May 22, 2014, 2:11 a.m. UTC | #4
On Thu, May 22, 2014 at 05:35:27AM +0530, Siddhesh Poyarekar wrote:
> On 22 May 2014 04:13, Rich Felker <dalias@libc.org> wrote:
> > I think the case I had in mind goes like this:
> >
> > 1. Thread A and B are waiting, val is -1.
> > 2. Thread C posts the semaphore and wakes up thread A, val is now 0.
> 
> This would be:
> 
> 2. Thread C posts the semaphore and val is now 1.  It sees that the
> old val was -1 and calls futex_wake, which wakes up thread A.
> 
> Similar to your implementation, sem_post increments the value by 1 if
> it is non-negative and by 2 if it is negative, so sem_post will never
> result in the value becoming 0.  However, there is one way in which
> the value will become 0, which is in sem_wait.  sem_wait will reset
> the value to 0 when it sees no waiters.  This could technically race
> with another waiter that has not yet called futex_wait, but that
> situation fixes itself since the futex_wait will return with
> EWOULDBLOCK and fix the value to -1 and call futex_wait again.

OK I think your reasoning on this sounds correct.

> > You cannot modify sem_post to set the value to -1 rather than 0 if
> > there's a waiter, because of two issues; one is fundamental and the
> > other is merely a performance concern:
> 
> sem_wait sets the value to -1, not sem_post.  sem_post always sets the
> value to something positive.

Yes, and sem_wait can legitimately inspect the waiters count since the
semaphore is still in use, and use it to determine whether to set -1
or 0. In fact my implementation is doing that now (in sem_trywait) so
it may be possible to just remove the secondary check in my code with
no other changes.

BTW the other confusing case I seem to remember is that waiters can
decrement without the semaphore value decrementing, as a result of
EINTR or ETIMEDOUT. This *might* have an impact on the logic but I
don't see right off how it would, and it's been a while since I put
much thought into it.

Rich
  
Patchwork Bot May 22, 2014, 2:38 a.m. UTC | #5
On 22 May 2014 07:41, Rich Felker <dalias@libc.org> wrote:
> BTW the other confusing case I seem to remember is that waiters can
> decrement without the semaphore value decrementing, as a result of
> EINTR or ETIMEDOUT. This *might* have an impact on the logic but I
> don't see right off how it would, and it's been a while since I put
> much thought into it.

I think resetting the value to 0 when there are no waiters covers
this, since that would only have an impact when nwaiters is 0 and the
semaphore value stayed as -1.

Siddhesh
  
Rich Felker May 22, 2014, 3:11 a.m. UTC | #6
On Thu, May 22, 2014 at 08:08:37AM +0530, Siddhesh Poyarekar wrote:
> On 22 May 2014 07:41, Rich Felker <dalias@libc.org> wrote:
> > BTW the other confusing case I seem to remember is that waiters can
> > decrement without the semaphore value decrementing, as a result of
> > EINTR or ETIMEDOUT. This *might* have an impact on the logic but I
> > don't see right off how it would, and it's been a while since I put
> > much thought into it.
> 
> I think resetting the value to 0 when there are no waiters covers
> this, since that would only have an impact when nwaiters is 0 and the
> semaphore value stayed as -1.

I think you're stuck leaving the value as -1 in this case, resulting
in a spurious futex wake syscall on the next post. Any attempt to
reset it to 0 along with decrementing waiters down to 0 seems like it
would create race conditions. Maybe there's a safe way to do it, but I
don't see it.

Rich
  
Patchwork Bot May 22, 2014, 3:27 a.m. UTC | #7
On 22 May 2014 08:41, Rich Felker <dalias@libc.org> wrote:
> I think you're stuck leaving the value as -1 in this case, resulting
> in a spurious futex wake syscall on the next post. Any attempt to
> reset it to 0 along with decrementing waiters down to 0 seems like it
> would create race conditions. Maybe there's a safe way to do it, but I
> don't see it.

Yes, it does have a race condition between two waiters causing a
additional futex_wait syscall in a waiter.  That is, if I reset the
value to 0, it could race with another waiter queuing up, which then
fails its futex_wait with EWOULDBLOCK, fixes up the value and goes
into futex_wait again.  Without the value being fixed up, sem_post
sends a futex_wake despite there being no waiters.  The spurious wake
is also harmless, since if a waiter happens to enter futex_wait just
as sem_post is about to send a wake, it will go right back to sleep
since the value is -1.

I chose the spurious wait instead of wake with the reasoning that
sem_post performance ought to be better than sem_wait since sem_wait
could block and there less incentive in trying to optimize performance
in a blocking case.

Siddhesh
  
Torvald Riegel May 22, 2014, 8:10 p.m. UTC | #8
On Wed, 2014-05-21 at 16:37 +0530, Siddhesh Poyarekar wrote:
> Summary of the race:
> 
> T1: enter sem_post and increment value
> T2: enter_sem_wait and decrement value
> T2: return from sem_wait and destroy the semaphore
> T1: Check value of semaphore->nwaiters and find an invalid value
> 
> The idea for the fix is adapted from Rich Felker's fix for the same
> race in musl.  The fix is to prevent sem_post from accessing nwaiters
> after it has incremented the value since the state of the semaphore is
> not known beyond this point.  This is fairly easy to achieve using
> Rich's idea.  One may set the value to a special value of -1 to
> indicate waiters.  That way, sem_post can inspect the old value and
> call futex_wake if it is necessary.

I haven't looked at musl, so I'll take it for what it seems to be:
adding a state bit that's set whenever there are threads waiting on a
variable that is also used by a futex.  (That's similar to the value 2
used in the mutex impl.)

> Rich used this condition as a primary check and the waiter count as a
> secondary check, but I think the secondary check is not required in
> glibc.  The only time the secondary check would theoretically be
> required is when the old value came up as zero *and* there were
> waiters to wake up.  This happens only momentarily when an exiting
> waiter decrements nwaiters and resets the semaphore value if it is -1
> and that operation races with a new waiter entering and losing the
> race, thus keeping the value as 0.  This is momentary because the
> futex call on such a value will fail with EWOULDBLOCK since it expects
> the value to be -1 and not 0.  After this failure, the waiter fixes up
> the value to -1 and goes back to wait.

That sounds okay, but I don't think it's sufficient to show that your
changes still work.  To do that, I find it useful to think about (and
document!) what the intent behind each piece of the algorithm is.  IOW,
what does it do on an abstract level?  What's the overall state flow?
Which pieces of the algorithm prevent certain states (so that the thing
becomes manageable, overall).  What patterns do we follow?  So, discuss
more of the why, not just the how.  If you can't explain it elegantly
and intuitively, there's a good chance something isn't well
understood :)

For example, one good way to fix this would be to start by documenting
the existing algorithm: Write an (informal) proof why it works because
this will give you the understanding how it works, and is a good outline
for the documentation of the algorithm.
Try to break down all possible executions into subexecutions that can
happen, and show which sub-executions can't happen at all.  Keep track
of which subexecutions (in particular, memory writes / CAS) are
"pending" in the sense of being possible at arbitrary times; this will
help cutting down the possible execution+state space.

Maybe start without blocking first, assuming all futex waits are
busy-waiting instead.  Then add the blocking.

> This requires two other changes:
> 
> - The type of value is now int instead of unsigned int.  This should
>   not break the ABI since we don't expose the sign of the value.  In
>   fact, the only place the value is seen is with sem_getvalue, where
>   it is int.  And speaking of sem_getvalue...

That's fine I think.

> - sem_getvalue is patched to lie about the actual value if it sees the
>   -1 and return 0.

OT: We should add a note there about having to clarify the ordering
guarantees that this gives.  This is effectively an mo_relaxed load, so
very weak ordering guarantees; OTOH, POSIX seems to want very strong
ordering guarantees for most of its sync operations.  So, I think we
either need to clarify in POSIX or make this at least an acquire load or
such.

> Siddhesh
> 
> 	[BZ #12674]
> 	* nptl/sem_getvalue.c (__new_sem_getvalue): Return 0 for
> 	negative semaphore value.
> 	* nptl/sysdeps/unix/sysv/linux/internaltypes.h (struct
> 	new_sem): Change type of VALUE to int.
> 	* nptl/sysdeps/unix/sysv/linux/sem_post.c (__new_sem_post):
> 	Avoid accessing semaphore after incrementing it.

See below.

> 	* sysdeps/unix/sysv/linux/i386/i486/sem_post.S
> 	(__new_sem_post): Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/sem_post.S (__new_sem_post):
> 	Likewise.
> 	* nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
> 	(do_futex_timed_wait): Set expected value of futex to -1.
> 	(sem_timedwait): Set expected value of semaphore to -1.
> 	* sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S
> 	(sem_wait_cleanup): Reset semaphore value when there are no
> 	waiters.
> 	(sem_timedwait): Set expected value of semaphore to -1.
> 	* sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S
> 	(sem_wait_cleanup): Reset semaphore value when there are no
> 	waiters.
> 	(sem_wait_cleanup2): Likewise.
> 	(sem_timedwait): Set expected value of semaphore to -1.
> 	* nptl/sysdeps/unix/sysv/linux/sem_wait.c
> 	(__sem_wait_cleanup): Reset semaphore value when there are no
> 	waiters.
> 	(do_futex_wait): Set expected value of futex to -1.
> 	(__new_sem_wait): Set expected value of semaphore to -1.
> 	* sysdeps/unix/sysv/linux/i386/i486/sem_wait.S
> 	(sem_wait_cleanup): Reset semaphore value when there are no
> 	waiters.
> 	(__new_sem_wait): Set expected value of semaphore to -1.
> 	* sysdeps/unix/sysv/linux/x86_64/sem_wait.S
> 	(sem_wait_cleanup): Reset semaphore value when there are no
> 	waiters.
> 	(__new_sem_wait): Set expected value of semaphore to -1.

I think that any change in the generic C versions is a good time to
review whether we still need the custom assembler versions for
performance.  We'd need a multi-threaded benchtest in this case,
probably (to measure round-trip time and contention), but that might be
easier than fixing up all the assembler versions too :)

> 
> 
> diff --git a/nptl/sem_getvalue.c b/nptl/sem_getvalue.c
> index a4ab41f..d9b70fd 100644
> --- a/nptl/sem_getvalue.c
> +++ b/nptl/sem_getvalue.c
> @@ -22,15 +22,15 @@
>  
> 
>  int
> -__new_sem_getvalue (sem, sval)
> -     sem_t *sem;
> -     int *sval;
> +__new_sem_getvalue (sem_t *sem, int *sval)
>  {
>    struct new_sem *isem = (struct new_sem *) sem;
>  
>    /* XXX Check for valid SEM parameter.  */
>  
>    *sval = isem->value;
> +  if (*sval < 0)
> +    *sval = 0;
>  
>    return 0;
>  }
> diff --git a/nptl/sysdeps/unix/sysv/linux/internaltypes.h b/nptl/sysdeps/unix/sysv/linux/internaltypes.h
> index d127f68..5eea097 100644
> --- a/nptl/sysdeps/unix/sysv/linux/internaltypes.h
> +++ b/nptl/sysdeps/unix/sysv/linux/internaltypes.h
> @@ -141,7 +141,7 @@ struct pthread_key_struct
>  /* Semaphore variable structure.  */
>  struct new_sem
>  {
> -  unsigned int value;
> +  int value;
>    int private;
>    unsigned long int nwaiters;
>  };
> diff --git a/nptl/sysdeps/unix/sysv/linux/sem_post.c b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> index 4906adf..0ff1699 100644
> --- a/nptl/sysdeps/unix/sysv/linux/sem_post.c
> +++ b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> @@ -30,24 +30,35 @@ int
>  __new_sem_post (sem_t *sem)
>  {
>    struct new_sem *isem = (struct new_sem *) sem;
> +  int incr, is_private = isem->private;
>  
>    __typeof (isem->value) cur;
>    do
>      {
>        cur = isem->value;
> +      incr = 1 + (cur < 0);
>        if (isem->value == SEM_VALUE_MAX)
>  	{
>  	  __set_errno (EOVERFLOW);
>  	  return -1;
>  	}
>      }
> -  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + 1, cur));
> +  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + incr, cur));
>  
>    atomic_full_barrier ();

Why do you still need the full barrier?  AFAICS, it was necessary before
because of the Dekker-like synchronization between nwaiters and value --
but you've changed that (or not?).

> -  if (isem->nwaiters > 0)
> +  /* This is always a sufficient condition to detect waiters.  This is because
> +     once there is either a waiter or a poster, the value is always non-zero at
> +     this point, either because sem_post set it to a positive value or sem_wait
> +     set it to a negative value.  There is a transient condition whe it could
> +     be 0 with a waiter.  This happens when a waiter is cancelled and another
> +     waiter arrives, where a race condition causes the value to be 0 before the
> +     futex_wait is called.  That is fixed immediately since the futex_wait will
> +     return immediately with EWOULDBLOCK, fix the value and go back to
> +     sleep in futex_wait.  */

Why would this condition be the *only* case where this can happen?  I
think this should be documented.  And it will get easier if you document
the state flow for the whole algorithm :)

The only thing you test below is that *this thread* saw the flag for the
waiter-present state.
However, it's the only place where a wake-up for 1 thread happened -- so
it should actually see and act on *all* necessary wake-ups.  Which I
believe isn't the case with your changes.

> +  if (cur < 0)
>      {
>        int err = lll_futex_wake (&isem->value, 1,
> -				isem->private ^ FUTEX_PRIVATE_FLAG);
> +				is_private ^ FUTEX_PRIVATE_FLAG);
>        if (__builtin_expect (err, 0) < 0)
>  	{
>  	  __set_errno (-err);
> diff --git a/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c b/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
> index 7dfe51d..1e74c40 100644
> --- a/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
> +++ b/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
> @@ -38,7 +38,7 @@ do_futex_timed_wait (struct new_sem *isem, struct timespec *rt)
>  {
>    int err, oldtype = __pthread_enable_asynccancel ();
>  
> -  err = lll_futex_timed_wait (&isem->value, 0, rt,
> +  err = lll_futex_timed_wait (&isem->value, -1, rt,
>  			      isem->private ^ FUTEX_PRIVATE_FLAG);
>  
>    __pthread_disable_asynccancel (oldtype);
> @@ -60,6 +60,8 @@ sem_timedwait (sem_t *sem, const struct timespec *abstime)
>        return -1;
>      }
>  
> +  /* If the value is zero, set it to -1 to indicate waiters.  */
> +  atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);

Please document why the 0 -> -1 transition is always ok.  (This CAS can
occur anywhere in an execution, assuming that an incoming waiter is
allowed.)  The overview documentation of the algorithm would be a good
place to explain that, given that this can happen in a few places.

>    atomic_increment (&isem->nwaiters);

What about the memory orders here?  Why is the _acq above
sufficient/required/insufficient?

>  
>    pthread_cleanup_push (__sem_wait_cleanup, isem);
> @@ -106,11 +108,17 @@ sem_timedwait (sem_t *sem, const struct timespec *abstime)
>  	  err = 0;
>  	  break;
>  	}
> +
> +      /* Still not time to wake up.  Go back to sleep.  */
> +      atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
>      }
>  
>    pthread_cleanup_pop (0);
>  
> -  atomic_decrement (&isem->nwaiters);
> +  /* Reset semaphore value to zero if we are the last waiter.  The reset will

This should say "if we _were_ the last waiter" because ...

> +     actually happen only when we exit due to an error.  */
> +  if (atomic_decrement_and_test (&isem->nwaiters))
> +    atomic_compare_and_exchange_val_acq (&isem->value, 0, -1);

... like above, the CAS can happen at any time; the waiters has already
exited (fetch_dec on nwaiters), so it can be suspended, and some time
later do the -1 -> 0 transition.  Why is that okay?

>  
>    return err;
>  }
> diff --git a/nptl/sysdeps/unix/sysv/linux/sem_wait.c b/nptl/sysdeps/unix/sysv/linux/sem_wait.c
> index b12babb..4d3f91b 100644
> --- a/nptl/sysdeps/unix/sysv/linux/sem_wait.c
> +++ b/nptl/sysdeps/unix/sysv/linux/sem_wait.c
> @@ -34,7 +34,12 @@ __sem_wait_cleanup (void *arg)
>  {
>    struct new_sem *isem = (struct new_sem *) arg;
>  
> -  atomic_decrement (&isem->nwaiters);
> +  /* Decrement nwaiters and reset value if there are no other waiters.  This
> +     could race with the futex_wait call in another waiter and cause it to wake
> +     up when it shouldn't, but that is OK since it will go right back to sleep
> +     when it sees that the semaphore value is not what it wants.  */
> +  if (atomic_decrement_and_test (&isem->nwaiters))
> +    atomic_compare_and_exchange_val_acq (&isem->value, 0, -1);

See above.

>  }
>  
>  /* This is in a seperate function in order to make sure gcc
> @@ -46,7 +51,7 @@ do_futex_wait (struct new_sem *isem)
>  {
>    int err, oldtype = __pthread_enable_asynccancel ();
>  
> -  err = lll_futex_wait (&isem->value, 0, isem->private ^ FUTEX_PRIVATE_FLAG);
> +  err = lll_futex_wait (&isem->value, -1, isem->private ^ FUTEX_PRIVATE_FLAG);
>  
>    __pthread_disable_asynccancel (oldtype);
>    return err;
> @@ -61,6 +66,12 @@ __new_sem_wait (sem_t *sem)
>    if (atomic_decrement_if_positive (&isem->value) > 0)
>      return 0;
>  
> +  /* If we are the only know waiter right now, indicate that by setting the

How do you know that's the case?

> +     value to -1.  This is useful to avoid access to nwaiters in sem_post when
> +     the sole waiter exits and destroys the semaphore before sem_post has a
> +     chance to test the value of nwaiters.  */
> +  atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
> +

See above.

>    atomic_increment (&isem->nwaiters);
>  
>    pthread_cleanup_push (__sem_wait_cleanup, isem);
> @@ -80,11 +91,17 @@ __new_sem_wait (sem_t *sem)
>  	  err = 0;
>  	  break;
>  	}
> +
> +      /* Still not time to wake up.  Go back to sleep.  */
> +      atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
>      }
>  
>    pthread_cleanup_pop (0);
>  
> -  atomic_decrement (&isem->nwaiters);
> +  /* Reset semaphore value to zero if we are the last waiter.  The reset will
> +     actually happen only when we exit due to an error.  */
> +  if (atomic_decrement_and_test (&isem->nwaiters))
> +    atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);

Doesn't this reset to -1?

HTH.
  
Rich Felker May 22, 2014, 8:22 p.m. UTC | #9
On Thu, May 22, 2014 at 10:10:06PM +0200, Torvald Riegel wrote:
> > - sem_getvalue is patched to lie about the actual value if it sees the
> >   -1 and return 0.
> 
> OT: We should add a note there about having to clarify the ordering
> guarantees that this gives.  This is effectively an mo_relaxed load, so
> very weak ordering guarantees; OTOH, POSIX seems to want very strong
> ordering guarantees for most of its sync operations.  So, I think we
> either need to clarify in POSIX or make this at least an acquire load or
> such.

AFAIK POSIX does not impose any synchronization on sem_getvalue.

> > diff --git a/nptl/sysdeps/unix/sysv/linux/sem_post.c b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > index 4906adf..0ff1699 100644
> > --- a/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > +++ b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > @@ -30,24 +30,35 @@ int
> >  __new_sem_post (sem_t *sem)
> >  {
> >    struct new_sem *isem = (struct new_sem *) sem;
> > +  int incr, is_private = isem->private;
> >  
> >    __typeof (isem->value) cur;
> >    do
> >      {
> >        cur = isem->value;
> > +      incr = 1 + (cur < 0);
> >        if (isem->value == SEM_VALUE_MAX)
> >  	{
> >  	  __set_errno (EOVERFLOW);
> >  	  return -1;
> >  	}
> >      }
> > -  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + 1, cur));
> > +  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + incr, cur));
> >  
> >    atomic_full_barrier ();
> 
> Why do you still need the full barrier?  AFAICS, it was necessary before
> because of the Dekker-like synchronization between nwaiters and value --
> but you've changed that (or not?).

Per POSIX, all functions which synchronize memory are full barriers.

Rich
  
Torvald Riegel May 22, 2014, 8:34 p.m. UTC | #10
On Thu, 2014-05-22 at 16:22 -0400, Rich Felker wrote:
> On Thu, May 22, 2014 at 10:10:06PM +0200, Torvald Riegel wrote:
> > > - sem_getvalue is patched to lie about the actual value if it sees the
> > >   -1 and return 0.
> > 
> > OT: We should add a note there about having to clarify the ordering
> > guarantees that this gives.  This is effectively an mo_relaxed load, so
> > very weak ordering guarantees; OTOH, POSIX seems to want very strong
> > ordering guarantees for most of its sync operations.  So, I think we
> > either need to clarify in POSIX or make this at least an acquire load or
> > such.
> 
> AFAIK POSIX does not impose any synchronization on sem_getvalue.

Standard says:
  The updated value represents an actual semaphore value that occurred
  at some unspecified time during the call, but it need not be the
  actual value of the semaphore when it is returned to the calling
  process.

The second part is obvious (in an asynchronous system).  The first part
makes this, effectively, linearizable without saying that explicitly.
That's not what mo_relaxed guarantees.

> > > diff --git a/nptl/sysdeps/unix/sysv/linux/sem_post.c b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > > index 4906adf..0ff1699 100644
> > > --- a/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > > +++ b/nptl/sysdeps/unix/sysv/linux/sem_post.c
> > > @@ -30,24 +30,35 @@ int
> > >  __new_sem_post (sem_t *sem)
> > >  {
> > >    struct new_sem *isem = (struct new_sem *) sem;
> > > +  int incr, is_private = isem->private;
> > >  
> > >    __typeof (isem->value) cur;
> > >    do
> > >      {
> > >        cur = isem->value;
> > > +      incr = 1 + (cur < 0);
> > >        if (isem->value == SEM_VALUE_MAX)
> > >  	{
> > >  	  __set_errno (EOVERFLOW);
> > >  	  return -1;
> > >  	}
> > >      }
> > > -  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + 1, cur));
> > > +  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + incr, cur));
> > >  
> > >    atomic_full_barrier ();
> > 
> > Why do you still need the full barrier?  AFAICS, it was necessary before
> > because of the Dekker-like synchronization between nwaiters and value --
> > but you've changed that (or not?).
> 
> Per POSIX, all functions which synchronize memory are full barriers.

But we haven't implemented them this way (e.g., see the existing
sem_wait, or mutexes), and that's The Right Thing To Do IMO.  Yes, I
know this should be taken to the Austin group, but it's a nontrivial
thing because it would, I believe, require them to adopt a more complex
memory model and/or at least spell out in detail what's actually
required.
  
Siddhesh Poyarekar May 23, 2014, 11:03 a.m. UTC | #11
On Thu, May 22, 2014 at 10:10:06PM +0200, Torvald Riegel wrote:
> I haven't looked at musl, so I'll take it for what it seems to be:
> adding a state bit that's set whenever there are threads waiting on a
> variable that is also used by a futex.  (That's similar to the value 2
> used in the mutex impl.)

IIRC It is the same as the approach I have used, except for resetting
the semaphore value when there are no waiters, which is why I credited
Rich with the idea.

> That sounds okay, but I don't think it's sufficient to show that your
> changes still work.  To do that, I find it useful to think about (and
> document!) what the intent behind each piece of the algorithm is.  IOW,
> what does it do on an abstract level?  What's the overall state flow?
> Which pieces of the algorithm prevent certain states (so that the thing
> becomes manageable, overall).  What patterns do we follow?  So, discuss
> more of the why, not just the how.  If you can't explain it elegantly
> and intuitively, there's a good chance something isn't well
> understood :)

I attempted to document the algorithm in sem_wait.c, but while doing
so I found some flaws (including the one you pointed out), so I have
to go back to the drawing board on this.  I'll respond to your
questions though, hoping that they'll expose more flaws in my
understanding.

> The only thing you test below is that *this thread* saw the flag for the
> waiter-present state.
> However, it's the only place where a wake-up for 1 thread happened -- so
> it should actually see and act on *all* necessary wake-ups.  Which I
> believe isn't the case with your changes.

Right, subsequent posters won't wake any threads since they see a
positive semaphore value, increment it and return, possibly resulting
in perennially sleeping threads.  This is a problem.

> Please document why the 0 -> -1 transition is always ok.  (This CAS can
> occur anywhere in an execution, assuming that an incoming waiter is
> allowed.)  The overview documentation of the algorithm would be a good
> place to explain that, given that this can happen in a few places.

A semaphore value of 0 can occur only in the following cases:

- Initial state when there were no earlier waiters or posters
- The last waiter exited (either by decrementing the value or resetting it)

If a poster increments this value in the meantime (resulting in this 0
-> -1 transition failing or being overwritten), it will result in the
futex_wait returning with EWOULDBLOCK and the subsequent CAS should
get us the posted value.

> >    atomic_increment (&isem->nwaiters);
> 
> What about the memory orders here?  Why is the _acq above
> sufficient/required/insufficient?

The order between incrementing nwaiters and setting of the semaphore
value does not matter because we anyway cannot avoid the race between
two waiters that result in an extra futex_wait.

> > +     actually happen only when we exit due to an error.  */
> > +  if (atomic_decrement_and_test (&isem->nwaiters))
> > +    atomic_compare_and_exchange_val_acq (&isem->value, 0, -1);
> 
> ... like above, the CAS can happen at any time; the waiters has already
> exited (fetch_dec on nwaiters), so it can be suspended, and some time
> later do the -1 -> 0 transition.  Why is that okay?

It is not OK because it it could result in sem_post not waking a
waiter since it will see the 0.

> >  
> > +  /* If we are the only know waiter right now, indicate that by setting the
> 
> How do you know that's the case?

Actually we don't care if that is the case.  It's just that we need to
explicitly set it when we are the first waiter.

> > +  if (atomic_decrement_and_test (&isem->nwaiters))
> > +    atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
> 
> Doesn't this reset to -1?

Typo, thanks for catching that.

Siddhesh
  

Patch

diff --git a/nptl/sem_getvalue.c b/nptl/sem_getvalue.c
index a4ab41f..d9b70fd 100644
--- a/nptl/sem_getvalue.c
+++ b/nptl/sem_getvalue.c
@@ -22,15 +22,15 @@ 
 
 
 int
-__new_sem_getvalue (sem, sval)
-     sem_t *sem;
-     int *sval;
+__new_sem_getvalue (sem_t *sem, int *sval)
 {
   struct new_sem *isem = (struct new_sem *) sem;
 
   /* XXX Check for valid SEM parameter.  */
 
   *sval = isem->value;
+  if (*sval < 0)
+    *sval = 0;
 
   return 0;
 }
diff --git a/nptl/sysdeps/unix/sysv/linux/internaltypes.h b/nptl/sysdeps/unix/sysv/linux/internaltypes.h
index d127f68..5eea097 100644
--- a/nptl/sysdeps/unix/sysv/linux/internaltypes.h
+++ b/nptl/sysdeps/unix/sysv/linux/internaltypes.h
@@ -141,7 +141,7 @@  struct pthread_key_struct
 /* Semaphore variable structure.  */
 struct new_sem
 {
-  unsigned int value;
+  int value;
   int private;
   unsigned long int nwaiters;
 };
diff --git a/nptl/sysdeps/unix/sysv/linux/sem_post.c b/nptl/sysdeps/unix/sysv/linux/sem_post.c
index 4906adf..0ff1699 100644
--- a/nptl/sysdeps/unix/sysv/linux/sem_post.c
+++ b/nptl/sysdeps/unix/sysv/linux/sem_post.c
@@ -30,24 +30,35 @@  int
 __new_sem_post (sem_t *sem)
 {
   struct new_sem *isem = (struct new_sem *) sem;
+  int incr, is_private = isem->private;
 
   __typeof (isem->value) cur;
   do
     {
       cur = isem->value;
+      incr = 1 + (cur < 0);
       if (isem->value == SEM_VALUE_MAX)
 	{
 	  __set_errno (EOVERFLOW);
 	  return -1;
 	}
     }
-  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + 1, cur));
+  while (atomic_compare_and_exchange_bool_rel (&isem->value, cur + incr, cur));
 
   atomic_full_barrier ();
-  if (isem->nwaiters > 0)
+  /* This is always a sufficient condition to detect waiters.  This is because
+     once there is either a waiter or a poster, the value is always non-zero at
+     this point, either because sem_post set it to a positive value or sem_wait
+     set it to a negative value.  There is a transient condition whe it could
+     be 0 with a waiter.  This happens when a waiter is cancelled and another
+     waiter arrives, where a race condition causes the value to be 0 before the
+     futex_wait is called.  That is fixed immediately since the futex_wait will
+     return immediately with EWOULDBLOCK, fix the value and go back to
+     sleep in futex_wait.  */
+  if (cur < 0)
     {
       int err = lll_futex_wake (&isem->value, 1,
-				isem->private ^ FUTEX_PRIVATE_FLAG);
+				is_private ^ FUTEX_PRIVATE_FLAG);
       if (__builtin_expect (err, 0) < 0)
 	{
 	  __set_errno (-err);
diff --git a/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c b/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
index 7dfe51d..1e74c40 100644
--- a/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
+++ b/nptl/sysdeps/unix/sysv/linux/sem_timedwait.c
@@ -38,7 +38,7 @@  do_futex_timed_wait (struct new_sem *isem, struct timespec *rt)
 {
   int err, oldtype = __pthread_enable_asynccancel ();
 
-  err = lll_futex_timed_wait (&isem->value, 0, rt,
+  err = lll_futex_timed_wait (&isem->value, -1, rt,
 			      isem->private ^ FUTEX_PRIVATE_FLAG);
 
   __pthread_disable_asynccancel (oldtype);
@@ -60,6 +60,8 @@  sem_timedwait (sem_t *sem, const struct timespec *abstime)
       return -1;
     }
 
+  /* If the value is zero, set it to -1 to indicate waiters.  */
+  atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
   atomic_increment (&isem->nwaiters);
 
   pthread_cleanup_push (__sem_wait_cleanup, isem);
@@ -106,11 +108,17 @@  sem_timedwait (sem_t *sem, const struct timespec *abstime)
 	  err = 0;
 	  break;
 	}
+
+      /* Still not time to wake up.  Go back to sleep.  */
+      atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
     }
 
   pthread_cleanup_pop (0);
 
-  atomic_decrement (&isem->nwaiters);
+  /* Reset semaphore value to zero if we are the last waiter.  The reset will
+     actually happen only when we exit due to an error.  */
+  if (atomic_decrement_and_test (&isem->nwaiters))
+    atomic_compare_and_exchange_val_acq (&isem->value, 0, -1);
 
   return err;
 }
diff --git a/nptl/sysdeps/unix/sysv/linux/sem_wait.c b/nptl/sysdeps/unix/sysv/linux/sem_wait.c
index b12babb..4d3f91b 100644
--- a/nptl/sysdeps/unix/sysv/linux/sem_wait.c
+++ b/nptl/sysdeps/unix/sysv/linux/sem_wait.c
@@ -34,7 +34,12 @@  __sem_wait_cleanup (void *arg)
 {
   struct new_sem *isem = (struct new_sem *) arg;
 
-  atomic_decrement (&isem->nwaiters);
+  /* Decrement nwaiters and reset value if there are no other waiters.  This
+     could race with the futex_wait call in another waiter and cause it to wake
+     up when it shouldn't, but that is OK since it will go right back to sleep
+     when it sees that the semaphore value is not what it wants.  */
+  if (atomic_decrement_and_test (&isem->nwaiters))
+    atomic_compare_and_exchange_val_acq (&isem->value, 0, -1);
 }
 
 /* This is in a seperate function in order to make sure gcc
@@ -46,7 +51,7 @@  do_futex_wait (struct new_sem *isem)
 {
   int err, oldtype = __pthread_enable_asynccancel ();
 
-  err = lll_futex_wait (&isem->value, 0, isem->private ^ FUTEX_PRIVATE_FLAG);
+  err = lll_futex_wait (&isem->value, -1, isem->private ^ FUTEX_PRIVATE_FLAG);
 
   __pthread_disable_asynccancel (oldtype);
   return err;
@@ -61,6 +66,12 @@  __new_sem_wait (sem_t *sem)
   if (atomic_decrement_if_positive (&isem->value) > 0)
     return 0;
 
+  /* If we are the only know waiter right now, indicate that by setting the
+     value to -1.  This is useful to avoid access to nwaiters in sem_post when
+     the sole waiter exits and destroys the semaphore before sem_post has a
+     chance to test the value of nwaiters.  */
+  atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
+
   atomic_increment (&isem->nwaiters);
 
   pthread_cleanup_push (__sem_wait_cleanup, isem);
@@ -80,11 +91,17 @@  __new_sem_wait (sem_t *sem)
 	  err = 0;
 	  break;
 	}
+
+      /* Still not time to wake up.  Go back to sleep.  */
+      atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
     }
 
   pthread_cleanup_pop (0);
 
-  atomic_decrement (&isem->nwaiters);
+  /* Reset semaphore value to zero if we are the last waiter.  The reset will
+     actually happen only when we exit due to an error.  */
+  if (atomic_decrement_and_test (&isem->nwaiters))
+    atomic_compare_and_exchange_val_acq (&isem->value, -1, 0);
 
   return err;
 }
diff --git a/sysdeps/unix/sysv/linux/i386/i486/sem_post.S b/sysdeps/unix/sysv/linux/i386/i486/sem_post.S
index bc091a0..82435ab 100644
--- a/sysdeps/unix/sysv/linux/i386/i486/sem_post.S
+++ b/sysdeps/unix/sysv/linux/i386/i486/sem_post.S
@@ -35,6 +35,7 @@  __new_sem_post:
 	cfi_offset(%ebx, -8)
 
 	movl	8(%esp), %ebx
+	movl	PRIVATE(%ebx), %ecx
 
 #if VALUE == 0
 	movl	(%ebx), %eax
@@ -43,8 +44,13 @@  __new_sem_post:
 #endif
 0:	cmpl	$SEM_VALUE_MAX, %eax
 	je	3f
+
+	/* Add 2 to the value if it is negative, or else add 1.  */
 	leal	1(%eax), %edx
-	LOCK
+	testl   %eax, %eax
+	jns     6f
+	addl    $1, %edx
+6:	LOCK
 #if VALUE == 0
 	cmpxchgl %edx, (%ebx)
 #else
@@ -52,11 +58,12 @@  __new_sem_post:
 #endif
 	jnz	0b
 
-	cmpl	$0, NWAITERS(%ebx)
-	je	2f
+	/* Test the old semaphore value again.  Don't do the syscall if the
+	   sign bit is not set, i.e. the value is not negative.  */
+	testl	%eax, %eax
+	jns	2f
 
-	movl	$FUTEX_WAKE, %ecx
-	orl	PRIVATE(%ebx), %ecx
+	orl	$FUTEX_WAKE, %ecx
 	movl	$1, %edx
 	movl	$SYS_futex, %eax
 	ENTER_KERNEL
diff --git a/sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S b/sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S
index 94d052a..575359b 100644
--- a/sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S
+++ b/sysdeps/unix/sysv/linux/i386/i486/sem_timedwait.S
@@ -39,7 +39,7 @@  sem_timedwait:
 
 	movl	(%ecx), %eax
 2:	testl	%eax, %eax
-	je	1f
+	jle	1f
 
 	leal	-1(%eax), %edx
 	LOCK
@@ -87,17 +87,25 @@  sem_timedwait:
 	addl	$1000000000, %edx
 	subl	$1, %ecx
 5:	testl	%ecx, %ecx
-	movl	$ETIMEDOUT, %esi
-	js	6f		/* Time is already up.  */
+	movl	$-ETIMEDOUT, %esi
+	movl	28(%esp), %ebx	/* Load semaphore address.  */
+	js	10f		/* Time is already up.  */
 
 	movl	%ecx, (%esp)	/* Store relative timeout.  */
 	movl	%edx, 4(%esp)
 
+	/* If value is 0, set it to -1.  */
+	movl	%eax, %edx
+	movl	$-1, %ecx
+	xorl	%eax, %eax
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+	movl	%edx, %eax
+
 .LcleanupSTART:
 	call	__pthread_enable_asynccancel
 	movl	%eax, 8(%esp)
 
-	movl	28(%esp), %ebx	/* Load semaphore address.  */
 #if FUTEX_WAIT == 0
 	movl	PRIVATE(%ebx), %ecx
 #else
@@ -105,7 +113,7 @@  sem_timedwait:
 	orl	PRIVATE(%ebx), %ecx
 #endif
 	movl	%esp, %esi
-	xorl	%edx, %edx
+	movl	$-1, %edx
 	movl	$SYS_futex, %eax
 	ENTER_KERNEL
 	movl	%eax, %esi
@@ -117,11 +125,11 @@  sem_timedwait:
 	testl	%esi, %esi
 	je	9f
 	cmpl	$-EWOULDBLOCK, %esi
-	jne	3f
+	jne	10f
 
 9:	movl	(%ebx), %eax
 8:	testl	%eax, %eax
-	je	7b
+	jle	7b
 
 	leal	-1(%eax), %ecx
 	LOCK
@@ -130,10 +138,23 @@  sem_timedwait:
 
 	xorl	%eax, %eax
 
-	LOCK
+10:	LOCK
 	decl	NWAITERS(%ebx)
+	jnz	11f
+
+	movl	%eax, %edx
+	movl	$-1, %eax
+	xorl	%ecx, %ecx
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+	movl	%edx, %eax
+	/* Set errno if the kernel returned an error.  We moved the syscall
+	   return value into %esi earlier.  */
+	test	%esi, %esi
+	jnz	6f
+
+11:	addl	$12, %esp
 
-10:	addl	$12, %esp
 .Ladd_esp:
 	popl	%ebx
 .Lpop_ebx:
@@ -144,11 +165,7 @@  sem_timedwait:
 	ret
 
 .Lafter_ret:
-3:	negl	%esi
-6:
-	movl	28(%esp), %ebx	/* Load semaphore address.  */
-	LOCK
-	decl	NWAITERS(%ebx)
+6:	negl	%esi
 .Lerrno_exit:
 #ifdef PIC
 	SETUP_PIC_REG(bx)
@@ -167,15 +184,23 @@  sem_timedwait:
 #endif
 
 	orl	$-1, %eax
-	jmp	10b
+	jmp	11b
 	.size	sem_timedwait,.-sem_timedwait
 
 
 	.type	sem_wait_cleanup,@function
 sem_wait_cleanup:
+	movl	%eax, %edx
 	LOCK
 	decl	NWAITERS(%ebx)
-	movl	%eax, (%esp)
+	jnz	11f
+
+	movl	$-1, %eax
+	xorl	%ecx, %ecx
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+
+11:	movl	%edx, (%esp)
 .LcallUR:
 	call	_Unwind_Resume@PLT
 	hlt
diff --git a/sysdeps/unix/sysv/linux/i386/i486/sem_wait.S b/sysdeps/unix/sysv/linux/i386/i486/sem_wait.S
index 14d616f..db7637f 100644
--- a/sysdeps/unix/sysv/linux/i386/i486/sem_wait.S
+++ b/sysdeps/unix/sysv/linux/i386/i486/sem_wait.S
@@ -45,13 +45,13 @@  __new_sem_wait:
 
 	movl	(%ebx), %eax
 2:	testl	%eax, %eax
-	je	1f
+	jle	1f
 
 	leal	-1(%eax), %edx
 	LOCK
 	cmpxchgl %edx, (%ebx)
 	jne	2b
-7:	xorl	%eax, %eax
+	xorl	%eax, %eax
 
 9:	movl	4(%esp), %esi
 	movl	8(%esp), %ebx
@@ -63,8 +63,16 @@  __new_sem_wait:
 1:	LOCK
 	incl	NWAITERS(%ebx)
 
+	/* If value is 0, set it to -1.  */
+6:	movl	%eax, %edx
+	movl	$-1, %ecx
+	xorl	%eax, %eax
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+	movl	%edx, %eax
+
 .LcleanupSTART:
-6:	call	__pthread_enable_asynccancel
+	call	__pthread_enable_asynccancel
 	movl	%eax, (%esp)
 
 #if FUTEX_WAIT == 0
@@ -74,7 +82,7 @@  __new_sem_wait:
 	orl	PRIVATE(%ebx), %ecx
 #endif
 	xorl	%esi, %esi
-	xorl	%edx, %edx
+	movl	$-1, %edx
 	movl	$SYS_futex, %eax
 	ENTER_KERNEL
 	movl	%eax, %esi
@@ -91,19 +99,30 @@  __new_sem_wait:
 3:
 	movl	(%ebx), %eax
 5:	testl	%eax, %eax
-	je	6b
+	jle	6b
 
 	leal	-1(%eax), %edx
 	LOCK
 	cmpxchgl %edx, (%ebx)
 	jne	5b
 
-	LOCK
+	xorl	%eax, %eax
+	/* Decrement nwaiters and if zero, reset the value to 0 if it is
+	   -1.  */
+7:	LOCK
 	decl	NWAITERS(%ebx)
-	jmp	7b
+	jnz	9b
+	movl	%eax, %edx
+	movl	$-1, %eax
+	xorl	%ecx, %ecx
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+	movl	%edx, %eax
+	jmp	9b
 
-4:	LOCK
-	decl	NWAITERS(%ebx)
+	/* Back up the semaphore pointer before we set up the GOT pointer to
+	   store errno.  */
+4:	movl	%ebx, %ecx
 
 	negl	%esi
 #ifdef PIC
@@ -122,17 +141,24 @@  __new_sem_wait:
 	movl	%esi, %gs:(%edx)
 #endif
 	orl	$-1, %eax
+	movl	%ecx, %ebx
 
-	jmp	9b
+	jmp	7b
 	.size	__new_sem_wait,.-__new_sem_wait
 	versioned_symbol(libpthread, __new_sem_wait, sem_wait, GLIBC_2_1)
 
 
 	.type	sem_wait_cleanup,@function
 sem_wait_cleanup:
+	movl	%eax, %edx
 	LOCK
 	decl	NWAITERS(%ebx)
-	movl	%eax, (%esp)
+	jnz	1f
+	movl	$-1, %eax
+	xorl	%ecx, %ecx
+	LOCK
+	cmpxchgl %ecx, (%ebx)
+1:	movl	%edx, (%esp)
 .LcallUR:
 	call	_Unwind_Resume@PLT
 	hlt
diff --git a/sysdeps/unix/sysv/linux/x86_64/sem_post.S b/sysdeps/unix/sysv/linux/x86_64/sem_post.S
index 1c11600..854fa56 100644
--- a/sysdeps/unix/sysv/linux/x86_64/sem_post.S
+++ b/sysdeps/unix/sysv/linux/x86_64/sem_post.S
@@ -29,6 +29,8 @@ 
 	.type	sem_post,@function
 	.align	16
 sem_post:
+	/* Get the private flag in advance.  */
+	movl	PRIVATE(%rdi), %esi
 #if VALUE == 0
 	movl	(%rdi), %eax
 #else
@@ -36,21 +38,27 @@  sem_post:
 #endif
 0:	cmpl	$SEM_VALUE_MAX, %eax
 	je	3f
-	leal	1(%rax), %esi
-	LOCK
+
+	/* Add 2 to the value if it is negative, or else add 1.  */
+	leal	1(%rax), %edx
+	testl	%eax, %eax
+	jns	5f
+	addl	$1, %edx
+5:	LOCK
 #if VALUE == 0
-	cmpxchgl %esi, (%rdi)
+	cmpxchgl %edx, (%rdi)
 #else
-	cmpxchgl %esi, VALUE(%rdi)
+	cmpxchgl %edx, VALUE(%rdi)
 #endif
 	jnz	0b
 
-	LP_OP(cmp) $0, NWAITERS(%rdi)
-	je	2f
+	/* Test the old semaphore value again.  Don't do the syscall if the
+	   sign bit is not set, i.e. the value is not negative.  */
+	testl	%eax, %eax
+	jns	2f
 
 	movl	$SYS_futex, %eax
-	movl	$FUTEX_WAKE, %esi
-	orl	PRIVATE(%rdi), %esi
+	orl	$FUTEX_WAKE, %esi
 	movl	$1, %edx
 	syscall
 
diff --git a/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S b/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S
index 880610e..104505b 100644
--- a/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S
+++ b/sysdeps/unix/sysv/linux/x86_64/sem_timedwait.S
@@ -45,7 +45,7 @@  sem_timedwait:
 	movl	VALUE(%rdi), %eax
 #endif
 2:	testl	%eax, %eax
-	je	1f
+	jle	1f
 
 	leaq	-1(%rax), %rdx
 	LOCK
@@ -85,10 +85,24 @@  sem_timedwait:
 	LOCK
 	LP_OP(add) $1, NWAITERS(%rdi)
 
+
+13:	movl	%eax, %r8d
+
+	/* If the semaphore value is 0, set it to -1 before going to wait.  */
+	movl	$-1, %esi
+	movl	$0, %eax
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%esi, (%rdi)
+#else
+	cmpxchgl	%esi, VALUE(%rdi)
+#endif
+	movl	%r8d, %eax
+
 .LcleanupSTART:
-13:	call	__pthread_enable_asynccancel
+	call	__pthread_enable_asynccancel
 	movl	%eax, %r8d
-
 #if VALUE != 0
 	leaq	VALUE(%rdi), %rdi
 #endif
@@ -96,7 +110,7 @@  sem_timedwait:
 	movl	$FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, %esi
 	orl	PRIVATE(%rdi), %esi
 	movl	$SYS_futex, %eax
-	xorl	%edx, %edx
+	movl	$-1, %edx
 	syscall
 	movq	%rax, %r9
 #if VALUE != 0
@@ -120,7 +134,7 @@  sem_timedwait:
 	movl	VALUE(%rdi), %eax
 #endif
 14:	testl	%eax, %eax
-	je	13b
+	jle	13b
 
 	leaq	-1(%rax), %rcx
 	LOCK
@@ -135,8 +149,24 @@  sem_timedwait:
 
 15:	LOCK
 	LP_OP(sub) $1, NWAITERS(%rdi)
+	jne 17f
+
+	/* If we were the last waiter, reset the value to 0 if it was set to
+	   -1.  This may race with another thread setting itself up to wait,
+	   but it is OK since it will just spin around and set up its wait
+	   again.  */
+	movq	%rax, %r12
+	movl	$-1, %eax
+	movl	$0, %edx
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%edx, (%rdi)
+#else
+	cmpxchgl	%edx, VALUE(%rdi)
+#endif
+	movq	%r12, %rax
 
-	leaq	8(%rsp), %rsp
+17:	leaq	8(%rsp), %rsp
 	cfi_adjust_cfa_offset(-8)
 	retq
 
@@ -215,6 +245,19 @@  sem_timedwait:
 	movq	%rdi, (%rsp)	/* Store relative timeout.  */
 	movq	%rsi, 8(%rsp)
 
+	/* If the semaphore value is 0, set it to -1 before going to wait.  */
+	movl	%eax, %r10d
+	movl	$-1, %esi
+	movl	$0, %eax
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%esi, (%r12)
+#else
+	cmpxchgl	%esi, VALUE(%r12)
+#endif
+	movl	%r10d, %eax
+
 .LcleanupSTART2:
 	call	__pthread_enable_asynccancel
 	movl	%eax, 16(%rsp)
@@ -232,7 +275,7 @@  sem_timedwait:
 	orl	PRIVATE(%rdi), %esi
 # endif
 	movl	$SYS_futex, %eax
-	xorl	%edx, %edx
+	movl	$-1, %edx
 	syscall
 	movq	%rax, %r14
 
@@ -252,7 +295,7 @@  sem_timedwait:
 	movl	VALUE(%r12), %eax
 # endif
 8:	testl	%eax, %eax
-	je	7b
+	jle	7b
 
 	leaq	-1(%rax), %rcx
 	LOCK
@@ -267,8 +310,24 @@  sem_timedwait:
 
 45:	LOCK
 	LP_OP(sub) $1, NWAITERS(%r12)
+	jne 46f
+
+	/* If we were the last waiter, reset the value to 0 if it was set to
+	   -1.  This may race with another thread setting itself up to wait,
+	   but it is OK since it will just spin around and set up its wait
+	   again.  */
+	movq	%rax, %r13
+	movl	$-1, %eax
+	movl	$0, %edx
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%edx, (%r12)
+#else
+	cmpxchgl	%edx, VALUE(%r12)
+#endif
+	movq	%r13, %rax
 
-	addq	$STACKFRAME, %rsp
+46:	addq	$STACKFRAME, %rsp
 	cfi_adjust_cfa_offset(-STACKFRAME)
 	popq	%r14
 	cfi_adjust_cfa_offset(-8)
@@ -305,7 +364,24 @@  sem_timedwait_cleanup:
 	movq	(%rsp), %rdi
 	LOCK
 	LP_OP(sub) $1, NWAITERS(%rdi)
-	movq	%rax, %rdi
+	movq	%rax, %r12
+	jne 1f
+
+	/* If we were the last waiter, reset the value to 0 if it was set to
+	   -1.  This may race with another thread setting itself up to wait,
+	   but it is OK since it will just spin around and set up its wait
+	   again.  */
+	movl	$-1, %eax
+	movl	$0, %edx
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%edx, (%rdi)
+#else
+	cmpxchgl	%edx, VALUE(%rdi)
+#endif
+
+1:	movq	%r12, %rdi
 .LcallUR:
 	call	_Unwind_Resume@PLT
 	hlt
@@ -326,7 +402,23 @@  sem_timedwait_cleanup2:
 	LOCK
 	LP_OP(sub) $1, NWAITERS(%r12)
 	movq	%rax, %rdi
-	movq	STACKFRAME(%rsp), %r14
+	jne 1f
+
+	/* If we were the last waiter, reset the value to 0 if it was set to
+	   -1.  This may race with another thread setting itself up to wait,
+	   but it is OK since it will just spin around and set up its wait
+	   again.  */
+	movl	$-1, %eax
+	movl	$0, %edx
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%edx, (%r12)
+#else
+	cmpxchgl	%edx, VALUE(%r12)
+#endif
+
+1:	movq	STACKFRAME(%rsp), %r14
 	movq	STACKFRAME+8(%rsp), %r13
 	movq	STACKFRAME+16(%rsp), %r12
 .LcallUR2:
diff --git a/sysdeps/unix/sysv/linux/x86_64/sem_wait.S b/sysdeps/unix/sysv/linux/x86_64/sem_wait.S
index 8f4d068..bdbeb8a 100644
--- a/sysdeps/unix/sysv/linux/x86_64/sem_wait.S
+++ b/sysdeps/unix/sysv/linux/x86_64/sem_wait.S
@@ -46,7 +46,7 @@  sem_wait:
 	movl	VALUE(%rdi), %eax
 #endif
 2:	testl	%eax, %eax
-	je	1f
+	jle	1f
 
 	leal	-1(%rax), %edx
 	LOCK
@@ -68,8 +68,21 @@  sem_wait:
 	LOCK
 	LP_OP(add) $1, NWAITERS(%rdi)
 
+	/* If the semaphore value is 0, set it to -1 before going to wait.  */
+6:	movl	%eax, %r8d
+	movl	$-1, %esi
+	movl	$0, %eax
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%esi, (%rdi)
+#else
+	cmpxchgl	%esi, VALUE(%rdi)
+#endif
+	movl	%r8d, %eax
+
 .LcleanupSTART:
-6:	call	__pthread_enable_asynccancel
+	call	__pthread_enable_asynccancel
 	movl	%eax, %r8d
 
 	xorq	%r10, %r10
@@ -80,7 +93,7 @@  sem_wait:
 	movl	$FUTEX_WAIT, %esi
 	orl	PRIVATE(%rdi), %esi
 #endif
-	xorl	%edx, %edx
+	movl	$-1, %edx
 	syscall
 	movq	%rax, %rcx
 
@@ -101,7 +114,7 @@  sem_wait:
 	movl	VALUE(%rdi), %eax
 #endif
 5:	testl	%eax, %eax
-	je	6b
+	jle	6b
 
 	leal	-1(%rax), %edx
 	LOCK
@@ -137,7 +150,23 @@  sem_wait_cleanup:
 	movq	(%rsp), %rdi
 	LOCK
 	LP_OP(sub) $1, NWAITERS(%rdi)
-	movq	%rax, %rdi
+	movq	%rax, %r12
+	jne 1f
+
+	/* If we were the last waiter, reset the value to 0 if it was set to
+	   -1.  This may race with another thread setting itself up to wait,
+	   but it is OK since it will just spin around and set up its wait
+	   again.  */
+	movl	$-1, %eax
+	movl	$0, %edx
+
+	LOCK
+#if VALUE == 0
+	cmpxchgl	%edx, (%rdi)
+#else
+	cmpxchgl	%edx, VALUE(%rdi)
+#endif
+1:	movq	%r12, %rdi
 .LcallUR:
 	call	_Unwind_Resume@PLT
 	hlt