[1/2] nptl: Simplify condition variables to fix pthread_cond_signal (BZ 25847)

Message ID 20221015195305.1322087-1-malteskarupke@fastmail.fm
State New
Headers
Series [1/2] nptl: Simplify condition variables to fix pthread_cond_signal (BZ 25847) |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

Tang, Jun via Libc-alpha Oct. 15, 2022, 7:53 p.m. UTC
  From: Malte Skarupke <malteskarupke@fastmail.fm>

Bug BZ 25847 proved hard to fix mainly because the condition variables
implementation are hard to reason about. An attempt of a description of
the problem:

Problem 1:
There used to be a problem where pthread_cond_signal could wake up future
waiters. E.g. if we are certain that these happened in order:
1. Thread A calls pthread_cond_wait
2. Thread B calls pthread_cond_signal
3. Thread C calls pthread_cond_wait
then it used to be possible that thread C gets woken up and thread A remains
asleep even if we are certain that thread C went to sleep after thread B
signaled.

Solution 1:
As a fix there are now two groups, g1 and g2. Threads always go to sleep in
g2, and signals always signal to g1. When no threads remain in g1, the two
groups switch. This prevents the above problem: Thread C can't be in the same
group that thread B signaled in. (whichever group it signals in is not going
to be g2 afterwards, so thread C goes to sleep in a different group)

Problem 2:
The problem with this is that if a sleeper takes a while to wake up, we can
run into an ABA problem if the group has switched twice. If we were also
currently trying to wake up a sleeper in the newer version of g1, the old
sleeper can cut in and steal the signal. The thread that was supposed to wake
up then goes back to sleep.

Solution 2:
The code tried to detect this and fix the problem by adding an extra signal.

Problem 3:
That extra signal can lead to more problems in complex interleavings, which
causes the current bug, BZ 25847.

New solution:
All this context is here to explain that there are several different ways of
attacking this. We could pop the stack down to any layer of this explanation:
If we could make Problem 1 go away, we wouldn't need to worry
about Problem 2 or 3.

In this feature I choose to pop the stack down to Problem 2: Make the ABA
problem harmless. You may have noticed that I said "the thread that was
supposed to wake up goes back to sleep". This is rather silly. It received a
FUTEX_WAKE then notices that there are 0 available signals and decides to
call FUTEX_WAIT again. Why? Spurious wakes are allowed. If we just allow that
thread to leave pthread_cond_wait, the bug goes away. (in fact if there was
signal stealing, it wasn't even a spurious wake. Both threads were supposed
to wake up)

So in this patch I just allow the thread to stay awake. If a thread has been
woken with FUTEX_WAKE, it will leave pthread_cond_wait. This means if we
spuriously woke up from the futex sleep, the user will notice, but that's
allowed.

After that change our group counts will be wrong if we ran into the ABA
problem from Problem 2 above. So my solution is to not keep group counts.
But how do we know when to switch groups? FUTEX_WAKE can tell us if it woke
someone. If it didn't, that means we're on an empty group and can switch.

This means that on a group switch there will be two calls to FUTEX_WAKE: One
call that does nothing, and one call that actually wakes someone. We still
have the __wrefs count to allow us to early-out if nobody is sleeping in g2,
or in either group.

The result is a much simpler implementation of condition variables that still
solves Problem 1, above.

Benefits:
1. Much easier to reason about. All the pthread_cond_* functions have very
straightforward control flow now. Just try to reason through all the edge
cases in __condvar_cancel_waiting (now deleted). The old code just looked
like code that has more bugs hiding in it.
2. Tested for correctness in TLA+ (the usual caveats apply. Bugs may have
snuck in during translation back to C, this still needs review)
3. Less memory usage. For now I have reserved the same number of bytes
though, to allow future growth without breaking binary compatibility.

Downsides:
1. More spurious wakes. This code makes no attempt to put threads back to
sleep that were woken spuriously.
2. Smaller tolerance for ABA problems: If there were (1 << 30) calls to
pthread_cond_wait in the time that it takes one thread to go to sleep, that
thread will go to sleep even though it shouldn't have. I think this is fine
because it seems very unlikely to get a billion calls.
Note that calls to pthread_cond_signal are not enough. You need waiters to
cause the problem. Meaning if you don't have a billion threads, you need a
lot of calls to pthread_cond_wait and matching wakeups in the time that it
takes one thread to go to sleep.

Neutral:
1. Same speed. As a benchmark I ran pthread_cond_repro.c from BZ 25847 and
this new implementation seems to be 0.5% faster, which might just be noisy
measurements.

Overall I think the benefits outweigh the downsides. Now that the code is
easier to reason about, we could add some complexity back in to make the
downsides less bad. But I don't think they are a big concern.
---
 nptl/pthread_cond_broadcast.c           |  54 +--
 nptl/pthread_cond_common.c              | 259 +-------------
 nptl/pthread_cond_signal.c              |  32 +-
 nptl/pthread_cond_wait.c                | 450 +++---------------------
 nptl/tst-cond22.c                       |  30 +-
 sysdeps/nptl/bits/thread-shared-types.h |  15 +-
 sysdeps/nptl/futex-internal.h           |   8 +-
 sysdeps/nptl/lowlevellock-futex.h       |  19 +-
 sysdeps/nptl/pthread.h                  |   2 +-
 9 files changed, 138 insertions(+), 731 deletions(-)
  

Comments

Carlos O'Donell Oct. 24, 2022, 3:50 a.m. UTC | #1
On 10/15/22 15:53, malteskarupke--- via Libc-alpha wrote:
> From: Malte Skarupke <malteskarupke@fastmail.fm>
> 
> Bug BZ 25847 proved hard to fix mainly because the condition variables
> implementation are hard to reason about. An attempt of a description of
> the problem:
> 
> Problem 1:
> There used to be a problem where pthread_cond_signal could wake up future
> waiters. E.g. if we are certain that these happened in order:
> 1. Thread A calls pthread_cond_wait
> 2. Thread B calls pthread_cond_signal
> 3. Thread C calls pthread_cond_wait
> then it used to be possible that thread C gets woken up and thread A remains
> asleep even if we are certain that thread C went to sleep after thread B
> signaled.
> 
> Solution 1:
> As a fix there are now two groups, g1 and g2. Threads always go to sleep in
> g2, and signals always signal to g1. When no threads remain in g1, the two
> groups switch. This prevents the above problem: Thread C can't be in the same
> group that thread B signaled in. (whichever group it signals in is not going
> to be g2 afterwards, so thread C goes to sleep in a different group)

Agreed.

> Problem 2:
> The problem with this is that if a sleeper takes a while to wake up, we can
> run into an ABA problem if the group has switched twice. If we were also
> currently trying to wake up a sleeper in the newer version of g1, the old
> sleeper can cut in and steal the signal. The thread that was supposed to wake
> up then goes back to sleep.

The only ABA issue I'm aware of is the futex word value.

Which ABA issue is this in particular?

Could you expand on this a bit more?

> Solution 2:
> The code tried to detect this and fix the problem by adding an extra signal.

331    Waiters try to grab a signal from __g_signals without holding a reference
332    count, which can lead to stealing a signal from a more recent group after
333    their own group was already closed.  They cannot always detect whether they
334    in fact did because they do not know when they stole, but they can
335    conservatively add a signal back to the group they stole from; if they
336    did so unnecessarily, all that happens is a spurious wake-up.  To make this
337    even less likely, __g1_start contains the index of the current g2 too,
338    which allows waiters to check if there aliasing on the group slots; if
339    there wasn't, they didn't steal from the current G1, which means that the
340    G1 they stole from must have been already closed and they do not need to
341    fix anything.

Do you mean this ^^^ ?

> Problem 3:
> That extra signal can lead to more problems in complex interleavings, which
> causes the current bug, BZ 25847.

I would be helpful to have a more detailed explanation of the root cause.

> New solution:
> All this context is here to explain that there are several different ways of
> attacking this. We could pop the stack down to any layer of this explanation:
> If we could make Problem 1 go away, we wouldn't need to worry
> about Problem 2 or 3.

Your patch changes the algorithm significantly and so it represents a third (original,
two-group, and now your solution) algorithmic change that needs detailed review.

How does this compare against Frank Barrus' patch?
https://sourceware.org/bugzilla/attachment.cgi?id=13419

I have some reports from other developers in the community that this patch does solve
the issue.

> In this feature I choose to pop the stack down to Problem 2: Make the ABA
> problem harmless. You may have noticed that I said "the thread that was
> supposed to wake up goes back to sleep". This is rather silly. It received a
> FUTEX_WAKE then notices that there are 0 available signals and decides to
> call FUTEX_WAIT again. Why? Spurious wakes are allowed. If we just allow that
> thread to leave pthread_cond_wait, the bug goes away. (in fact if there was
> signal stealing, it wasn't even a spurious wake. Both threads were supposed
> to wake up)
> 
> So in this patch I just allow the thread to stay awake. If a thread has been
> woken with FUTEX_WAKE, it will leave pthread_cond_wait. This means if we
> spuriously woke up from the futex sleep, the user will notice, but that's
> allowed.
> 
> After that change our group counts will be wrong if we ran into the ABA
> problem from Problem 2 above. So my solution is to not keep group counts.
> But how do we know when to switch groups? FUTEX_WAKE can tell us if it woke
> someone. If it didn't, that means we're on an empty group and can switch.
> 
> This means that on a group switch there will be two calls to FUTEX_WAKE: One
> call that does nothing, and one call that actually wakes someone. We still
> have the __wrefs count to allow us to early-out if nobody is sleeping in g2,
> or in either group.
> 
> The result is a much simpler implementation of condition variables that still
> solves Problem 1, above.
> 
> Benefits:
> 1. Much easier to reason about. All the pthread_cond_* functions have very
> straightforward control flow now. Just try to reason through all the edge
> cases in __condvar_cancel_waiting (now deleted). The old code just looked
> like code that has more bugs hiding in it.
> 2. Tested for correctness in TLA+ (the usual caveats apply. Bugs may have
> snuck in during translation back to C, this still needs review)
> 3. Less memory usage. For now I have reserved the same number of bytes
> though, to allow future growth without breaking binary compatibility.
> 
> Downsides:
> 1. More spurious wakes. This code makes no attempt to put threads back to
> sleep that were woken spuriously.
> 2. Smaller tolerance for ABA problems: If there were (1 << 30) calls to
> pthread_cond_wait in the time that it takes one thread to go to sleep, that
> thread will go to sleep even though it shouldn't have. I think this is fine
> because it seems very unlikely to get a billion calls.
> Note that calls to pthread_cond_signal are not enough. You need waiters to
> cause the problem. Meaning if you don't have a billion threads, you need a
> lot of calls to pthread_cond_wait and matching wakeups in the time that it
> takes one thread to go to sleep.
> 
> Neutral:
> 1. Same speed. As a benchmark I ran pthread_cond_repro.c from BZ 25847 and
> this new implementation seems to be 0.5% faster, which might just be noisy
> measurements.
> 
> Overall I think the benefits outweigh the downsides. Now that the code is
> easier to reason about, we could add some complexity back in to make the
> downsides less bad. But I don't think they are a big concern.
> ---
>  nptl/pthread_cond_broadcast.c           |  54 +--
>  nptl/pthread_cond_common.c              | 259 +-------------
>  nptl/pthread_cond_signal.c              |  32 +-
>  nptl/pthread_cond_wait.c                | 450 +++---------------------
>  nptl/tst-cond22.c                       |  30 +-
>  sysdeps/nptl/bits/thread-shared-types.h |  15 +-
>  sysdeps/nptl/futex-internal.h           |   8 +-
>  sysdeps/nptl/lowlevellock-futex.h       |  19 +-
>  sysdeps/nptl/pthread.h                  |   2 +-
>  9 files changed, 138 insertions(+), 731 deletions(-)
> 
> diff --git a/nptl/pthread_cond_broadcast.c b/nptl/pthread_cond_broadcast.c
> index 5ae141ac81..b556c49af7 100644
> --- a/nptl/pthread_cond_broadcast.c
> +++ b/nptl/pthread_cond_broadcast.c
> @@ -29,16 +29,14 @@
>  #include "pthread_cond_common.c"
>  
>  
> -/* We do the following steps from __pthread_cond_signal in one critical
> -   section: (1) signal all waiters in G1, (2) close G1 so that it can become
> -   the new G2 and make G2 the new G1, and (3) signal all waiters in the new
> -   G1.  We don't need to do all these steps if there are no waiters in G1
> -   and/or G2.  See __pthread_cond_signal for further details.  */
> +/* See __pthread_cond_wait for a high-level description of the algorithm.  */
>  int
>  ___pthread_cond_broadcast (pthread_cond_t *cond)
>  {
>    LIBC_PROBE (cond_broadcast, 1, cond);
>  
> +  /* First check whether there are waiters.  See pthread_cond_signal.c for
> +     why relaxed MO is fine.  */
>    unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
>    if (wrefs >> 3 == 0)
>      return 0;
> @@ -46,43 +44,27 @@ ___pthread_cond_broadcast (pthread_cond_t *cond)
>  
>    __condvar_acquire_lock (cond, private);
>  
> -  unsigned long long int wseq = __condvar_load_wseq_relaxed (cond);
> +  /* Load the waiter sequence number, which represents our relative ordering
> +     to any waiters.  See pthread_cond_signal.c for why relaxed MO is fine.  */
> +  unsigned int wseq = atomic_load_relaxed (&cond->__data.__wseq);
>    unsigned int g2 = wseq & 1;
>    unsigned int g1 = g2 ^ 1;
> -  wseq >>= 1;
> -  bool do_futex_wake = false;
> +  wseq &= ~1;
> +  bool wake_g2 = wseq != cond->__data.__g2_start;
>  
> -  /* Step (1): signal all waiters remaining in G1.  */
> -  if (cond->__data.__g_size[g1] != 0)
> -    {
> -      /* Add as many signals as the remaining size of the group.  */
> -      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1,
> -				cond->__data.__g_size[g1] << 1);
> -      cond->__data.__g_size[g1] = 0;
> -
> -      /* We need to wake G1 waiters before we quiesce G1 below.  */
> -      /* TODO Only set it if there are indeed futex waiters.  We could
> -	 also try to move this out of the critical section in cases when
> -	 G2 is empty (and we don't need to quiesce).  */
> -      futex_wake (cond->__data.__g_signals + g1, INT_MAX, private);
> -    }
> -
> -  /* G1 is complete.  Step (2) is next unless there are no waiters in G2, in
> -     which case we can stop.  */
> -  if (__condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
> -    {
> -      /* Step (3): Send signals to all waiters in the old G2 / new G1.  */
> -      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1,
> -				cond->__data.__g_size[g1] << 1);
> -      cond->__data.__g_size[g1] = 0;
> -      /* TODO Only set it if there are indeed futex waiters.  */
> -      do_futex_wake = true;
> -    }
> +  /* Modify both g1 and g2 and reset g2_start to the same value. This puts us
> +     in a similar state as a newly initialized condvar: Everything has the
> +     same value.  */
> +  atomic_store_release (cond->__data.__gseq + g1, wseq);
> +  atomic_store_release (cond->__data.__gseq + g2, wseq);
> +  cond->__data.__g2_start = wseq;
>  
>    __condvar_release_lock (cond, private);
>  
> -  if (do_futex_wake)
> -    futex_wake (cond->__data.__g_signals + g1, INT_MAX, private);
> +  /* We don't know if there are waiters in g1, so just wake.  */
> +  futex_wake (cond->__data.__gseq + g1, INT_MAX, private);
> +  if (wake_g2)
> +    futex_wake (cond->__data.__gseq + g2, INT_MAX, private);
>  
>    return 0;
>  }
> diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c
> index fb035f72c3..b9043a218b 100644
> --- a/nptl/pthread_cond_common.c
> +++ b/nptl/pthread_cond_common.c
> @@ -21,90 +21,9 @@
>  #include <stdint.h>
>  #include <pthread.h>
>  
> -/* We need 3 least-significant bits on __wrefs for something else.
> -   This also matches __atomic_wide_counter requirements: The highest
> -   value we add is __PTHREAD_COND_MAX_GROUP_SIZE << 2 to __g1_start
> -   (the two extra bits are for the lock in the two LSBs of
> -   __g1_start).  */
> +/* We use 3 least-significant bits on __wrefs for something else.  */
>  #define __PTHREAD_COND_MAX_GROUP_SIZE ((unsigned) 1 << 29)
>  
> -static inline uint64_t
> -__condvar_load_wseq_relaxed (pthread_cond_t *cond)
> -{
> -  return __atomic_wide_counter_load_relaxed (&cond->__data.__wseq);
> -}
> -
> -static inline uint64_t
> -__condvar_fetch_add_wseq_acquire (pthread_cond_t *cond, unsigned int val)
> -{
> -  return __atomic_wide_counter_fetch_add_acquire (&cond->__data.__wseq, val);
> -}
> -
> -static inline uint64_t
> -__condvar_load_g1_start_relaxed (pthread_cond_t *cond)
> -{
> -  return __atomic_wide_counter_load_relaxed (&cond->__data.__g1_start);
> -}
> -
> -static inline void
> -__condvar_add_g1_start_relaxed (pthread_cond_t *cond, unsigned int val)
> -{
> -  __atomic_wide_counter_add_relaxed (&cond->__data.__g1_start, val);
> -}
> -
> -#if __HAVE_64B_ATOMICS == 1
> -
> -static inline uint64_t
> -__condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
> -{
> -  return atomic_fetch_xor_release (&cond->__data.__wseq.__value64, val);
> -}
> -
> -#else /* !__HAVE_64B_ATOMICS */
> -
> -/* The xor operation needs to be an atomic read-modify-write.  The write
> -   itself is not an issue as it affects just the lower-order half but not bits
> -   used in the add operation.  To make the full fetch-and-xor atomic, we
> -   exploit that concurrently, the value can increase by at most 1<<31 (*): The
> -   xor operation is only called while having acquired the lock, so not more
> -   than __PTHREAD_COND_MAX_GROUP_SIZE waiters can enter concurrently and thus
> -   increment __wseq.  Therefore, if the xor operation observes a value of
> -   __wseq, then the value it applies the modification to later on can be
> -   derived.  */
> -
> -static uint64_t __attribute__ ((unused))
> -__condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
> -{
> -  /* First, get the current value.  See __atomic_wide_counter_load_relaxed.  */
> -  unsigned int h, l, h2;
> -  do
> -    {
> -      h = atomic_load_acquire (&cond->__data.__wseq.__value32.__high);
> -      l = atomic_load_acquire (&cond->__data.__wseq.__value32.__low);
> -      h2 = atomic_load_relaxed (&cond->__data.__wseq.__value32.__high);
> -    }
> -  while (h != h2);
> -  if (((l >> 31) > 0) && ((h >> 31) == 0))
> -    h++;
> -  h &= ~((unsigned int) 1 << 31);
> -  l &= ~((unsigned int) 1 << 31);
> -
> -  /* Now modify.  Due to the coherence rules, the prior load will read a value
> -     earlier in modification order than the following fetch-xor.
> -     This uses release MO to make the full operation have release semantics
> -     (all other operations access the lower-order half).  */
> -  unsigned int l2
> -    = (atomic_fetch_xor_release (&cond->__data.__wseq.__value32.__low, val)
> -       & ~((unsigned int) 1 << 31));
> -  if (l2 < l)
> -    /* The lower-order half overflowed in the meantime.  This happened exactly
> -       once due to the limit on concurrent waiters (see above).  */
> -    h++;
> -  return ((uint64_t) h << 31) + l2;
> -}
> -
> -#endif /* !__HAVE_64B_ATOMICS */
> -
>  /* The lock that signalers use.  See pthread_cond_wait_common for uses.
>     The lock is our normal three-state lock: not acquired (0) / acquired (1) /
>     acquired-with-futex_wake-request (2).  However, we need to preserve the
> @@ -113,10 +32,10 @@ __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
>  static void __attribute__ ((unused))
>  __condvar_acquire_lock (pthread_cond_t *cond, int private)
>  {
> -  unsigned int s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
> +  unsigned int s = atomic_load_relaxed (&cond->__data.__lock);
>    while ((s & 3) == 0)
>      {
> -      if (atomic_compare_exchange_weak_acquire (&cond->__data.__g1_orig_size,
> +      if (atomic_compare_exchange_weak_acquire (&cond->__data.__lock,
>  	  &s, s | 1))
>  	return;
>        /* TODO Spinning and back-off.  */
> @@ -129,7 +48,7 @@ __condvar_acquire_lock (pthread_cond_t *cond, int private)
>        while ((s & 3) != 2)
>  	{
>  	  if (atomic_compare_exchange_weak_acquire
> -	      (&cond->__data.__g1_orig_size, &s, (s & ~(unsigned int) 3) | 2))
> +	      (&cond->__data.__lock, &s, (s & ~(unsigned int) 3) | 2))
>  	    {
>  	      if ((s & 3) == 0)
>  		return;
> @@ -137,10 +56,10 @@ __condvar_acquire_lock (pthread_cond_t *cond, int private)
>  	    }
>  	  /* TODO Back off.  */
>  	}
> -      futex_wait_simple (&cond->__data.__g1_orig_size,
> +      futex_wait_simple (&cond->__data.__lock,
>  	  (s & ~(unsigned int) 3) | 2, private);
>        /* Reload so we see a recent value.  */
> -      s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
> +      s = atomic_load_relaxed (&cond->__data.__lock);
>      }
>  }
>  
> @@ -148,34 +67,10 @@ __condvar_acquire_lock (pthread_cond_t *cond, int private)
>  static void __attribute__ ((unused))
>  __condvar_release_lock (pthread_cond_t *cond, int private)
>  {
> -  if ((atomic_fetch_and_release (&cond->__data.__g1_orig_size,
> +  if ((atomic_fetch_and_release (&cond->__data.__lock,
>  				 ~(unsigned int) 3) & 3)
>        == 2)
> -    futex_wake (&cond->__data.__g1_orig_size, 1, private);
> -}
> -
> -/* Only use this when having acquired the lock.  */
> -static unsigned int __attribute__ ((unused))
> -__condvar_get_orig_size (pthread_cond_t *cond)
> -{
> -  return atomic_load_relaxed (&cond->__data.__g1_orig_size) >> 2;
> -}
> -
> -/* Only use this when having acquired the lock.  */
> -static void __attribute__ ((unused))
> -__condvar_set_orig_size (pthread_cond_t *cond, unsigned int size)
> -{
> -  /* We have acquired the lock, but might get one concurrent update due to a
> -     lock state change from acquired to acquired-with-futex_wake-request.
> -     The store with relaxed MO is fine because there will be no further
> -     changes to the lock bits nor the size, and we will subsequently release
> -     the lock with release MO.  */
> -  unsigned int s;
> -  s = (atomic_load_relaxed (&cond->__data.__g1_orig_size) & 3)
> -      | (size << 2);
> -  if ((atomic_exchange_relaxed (&cond->__data.__g1_orig_size, s) & 3)
> -      != (s & 3))
> -    atomic_store_relaxed (&cond->__data.__g1_orig_size, (size << 2) | 2);
> +    futex_wake (&cond->__data.__lock, 1, private);
>  }
>  
>  /* Returns FUTEX_SHARED or FUTEX_PRIVATE based on the provided __wrefs
> @@ -189,141 +84,3 @@ __condvar_get_private (int flags)
>      return FUTEX_SHARED;
>  }
>  
> -/* This closes G1 (whose index is in G1INDEX), waits for all futex waiters to
> -   leave G1, converts G1 into a fresh G2, and then switches group roles so that
> -   the former G2 becomes the new G1 ending at the current __wseq value when we
> -   eventually make the switch (WSEQ is just an observation of __wseq by the
> -   signaler).
> -   If G2 is empty, it will not switch groups because then it would create an
> -   empty G1 which would require switching groups again on the next signal.
> -   Returns false iff groups were not switched because G2 was empty.  */
> -static bool __attribute__ ((unused))
> -__condvar_quiesce_and_switch_g1 (pthread_cond_t *cond, uint64_t wseq,
> -    unsigned int *g1index, int private)
> -{
> -  const unsigned int maxspin = 0;
> -  unsigned int g1 = *g1index;
> -
> -  /* If there is no waiter in G2, we don't do anything.  The expression may
> -     look odd but remember that __g_size might hold a negative value, so
> -     putting the expression this way avoids relying on implementation-defined
> -     behavior.
> -     Note that this works correctly for a zero-initialized condvar too.  */
> -  unsigned int old_orig_size = __condvar_get_orig_size (cond);
> -  uint64_t old_g1_start = __condvar_load_g1_start_relaxed (cond) >> 1;
> -  if (((unsigned) (wseq - old_g1_start - old_orig_size)
> -	  + cond->__data.__g_size[g1 ^ 1]) == 0)
> -	return false;
> -
> -  /* Now try to close and quiesce G1.  We have to consider the following kinds
> -     of waiters:
> -     * Waiters from less recent groups than G1 are not affected because
> -       nothing will change for them apart from __g1_start getting larger.
> -     * New waiters arriving concurrently with the group switching will all go
> -       into G2 until we atomically make the switch.  Waiters existing in G2
> -       are not affected.
> -     * Waiters in G1 will be closed out immediately by setting a flag in
> -       __g_signals, which will prevent waiters from blocking using a futex on
> -       __g_signals and also notifies them that the group is closed.  As a
> -       result, they will eventually remove their group reference, allowing us
> -       to close switch group roles.  */
> -
> -  /* First, set the closed flag on __g_signals.  This tells waiters that are
> -     about to wait that they shouldn't do that anymore.  This basically
> -     serves as an advance notificaton of the upcoming change to __g1_start;
> -     waiters interpret it as if __g1_start was larger than their waiter
> -     sequence position.  This allows us to change __g1_start after waiting
> -     for all existing waiters with group references to leave, which in turn
> -     makes recovery after stealing a signal simpler because it then can be
> -     skipped if __g1_start indicates that the group is closed (otherwise,
> -     we would have to recover always because waiters don't know how big their
> -     groups are).  Relaxed MO is fine.  */
> -  atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);
> -
> -  /* Wait until there are no group references anymore.  The fetch-or operation
> -     injects us into the modification order of __g_refs; release MO ensures
> -     that waiters incrementing __g_refs after our fetch-or see the previous
> -     changes to __g_signals and to __g1_start that had to happen before we can
> -     switch this G1 and alias with an older group (we have two groups, so
> -     aliasing requires switching group roles twice).  Note that nobody else
> -     can have set the wake-request flag, so we do not have to act upon it.
> -
> -     Also note that it is harmless if older waiters or waiters from this G1
> -     get a group reference after we have quiesced the group because it will
> -     remain closed for them either because of the closed flag in __g_signals
> -     or the later update to __g1_start.  New waiters will never arrive here
> -     but instead continue to go into the still current G2.  */
> -  unsigned r = atomic_fetch_or_release (cond->__data.__g_refs + g1, 0);
> -  while ((r >> 1) > 0)
> -    {
> -      for (unsigned int spin = maxspin; ((r >> 1) > 0) && (spin > 0); spin--)
> -	{
> -	  /* TODO Back off.  */
> -	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
> -	}
> -      if ((r >> 1) > 0)
> -	{
> -	  /* There is still a waiter after spinning.  Set the wake-request
> -	     flag and block.  Relaxed MO is fine because this is just about
> -	     this futex word.
> -
> -	     Update r to include the set wake-request flag so that the upcoming
> -	     futex_wait only blocks if the flag is still set (otherwise, we'd
> -	     violate the basic client-side futex protocol).  */
> -	  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
> -
> -	  if ((r >> 1) > 0)
> -	    futex_wait_simple (cond->__data.__g_refs + g1, r, private);
> -	  /* Reload here so we eventually see the most recent value even if we
> -	     do not spin.   */
> -	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
> -	}
> -    }
> -  /* Acquire MO so that we synchronize with the release operation that waiters
> -     use to decrement __g_refs and thus happen after the waiters we waited
> -     for.  */
> -  atomic_thread_fence_acquire ();
> -
> -  /* Update __g1_start, which finishes closing this group.  The value we add
> -     will never be negative because old_orig_size can only be zero when we
> -     switch groups the first time after a condvar was initialized, in which
> -     case G1 will be at index 1 and we will add a value of 1.  See above for
> -     why this takes place after waiting for quiescence of the group.
> -     Relaxed MO is fine because the change comes with no additional
> -     constraints that others would have to observe.  */
> -  __condvar_add_g1_start_relaxed (cond,
> -      (old_orig_size << 1) + (g1 == 1 ? 1 : - 1));
> -
> -  /* Now reopen the group, thus enabling waiters to again block using the
> -     futex controlled by __g_signals.  Release MO so that observers that see
> -     no signals (and thus can block) also see the write __g1_start and thus
> -     that this is now a new group (see __pthread_cond_wait_common for the
> -     matching acquire MO loads).  */
> -  atomic_store_release (cond->__data.__g_signals + g1, 0);
> -
> -  /* At this point, the old G1 is now a valid new G2 (but not in use yet).
> -     No old waiter can neither grab a signal nor acquire a reference without
> -     noticing that __g1_start is larger.
> -     We can now publish the group switch by flipping the G2 index in __wseq.
> -     Release MO so that this synchronizes with the acquire MO operation
> -     waiters use to obtain a position in the waiter sequence.  */
> -  wseq = __condvar_fetch_xor_wseq_release (cond, 1) >> 1;
> -  g1 ^= 1;
> -  *g1index ^= 1;
> -
> -  /* These values are just observed by signalers, and thus protected by the
> -     lock.  */
> -  unsigned int orig_size = wseq - (old_g1_start + old_orig_size);
> -  __condvar_set_orig_size (cond, orig_size);
> -  /* Use and addition to not loose track of cancellations in what was
> -     previously G2.  */
> -  cond->__data.__g_size[g1] += orig_size;
> -
> -  /* The new G1's size may be zero because of cancellations during its time
> -     as G2.  If this happens, there are no waiters that have to receive a
> -     signal, so we do not need to add any and return false.  */
> -  if (cond->__data.__g_size[g1] == 0)
> -    return false;
> -
> -  return true;
> -}
> diff --git a/nptl/pthread_cond_signal.c b/nptl/pthread_cond_signal.c
> index 14800ba00b..e6968164cd 100644
> --- a/nptl/pthread_cond_signal.c
> +++ b/nptl/pthread_cond_signal.c
> @@ -63,34 +63,30 @@ ___pthread_cond_signal (pthread_cond_t *cond)
>          that has been woken but hasn't resumed execution yet), and thus it
>          cannot try to deduce that a signal happened before a particular
>          waiter.  */
> -  unsigned long long int wseq = __condvar_load_wseq_relaxed (cond);
> +  unsigned int wseq = atomic_load_relaxed (&cond->__data.__wseq);
>    unsigned int g1 = (wseq & 1) ^ 1;
> -  wseq >>= 1;
> +  wseq &= ~1;
>    bool do_futex_wake = false;
>  
> -  /* If G1 is still receiving signals, we put the signal there.  If not, we
> -     check if G2 has waiters, and if so, quiesce and switch G1 to the former
> -     G2; if this results in a new G1 with waiters (G2 might have cancellations
> -     already, see __condvar_quiesce_and_switch_g1), we put the signal in the
> -     new G1.  */
> -  if ((cond->__data.__g_size[g1] != 0)
> -      || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
> +  atomic_store_release (cond->__data.__gseq + g1, wseq);
> +  bool woken = futex_wake (cond->__data.__gseq + g1, 1, private) > 0;
> +  if (!woken && wseq != cond->__data.__g2_start)
>      {
> -      /* Add a signal.  Relaxed MO is fine because signaling does not need to
> -	 establish a happens-before relation (see above).  We do not mask the
> -	 release-MO store when initializing a group in
> -	 __condvar_quiesce_and_switch_g1 because we use an atomic
> -	 read-modify-write and thus extend that store's release sequence.  */
> -      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
> -      cond->__data.__g_size[g1]--;
> -      /* TODO Only set it if there are indeed futex waiters.  */
> +      /* Nobody was woken by futex_wake but there are waiters in g2. Switch
> +	 groups and try again.  */
> +      wseq = atomic_fetch_xor_release (&cond->__data.__wseq, 1) & ~1;
> +
> +      g1 ^= 1;
> +      cond->__data.__g2_start = wseq;
> +
> +      atomic_store_release (cond->__data.__gseq + g1, wseq);
>        do_futex_wake = true;
>      }
>  
>    __condvar_release_lock (cond, private);
>  
>    if (do_futex_wake)
> -    futex_wake (cond->__data.__g_signals + g1, 1, private);
> +    futex_wake (cond->__data.__gseq + g1, 1, private);
>  
>    return 0;
>  }
> diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
> index 20c348a503..c146784e96 100644
> --- a/nptl/pthread_cond_wait.c
> +++ b/nptl/pthread_cond_wait.c
> @@ -32,16 +32,13 @@
>  
>  #include "pthread_cond_common.c"
>  
> -
>  struct _condvar_cleanup_buffer
>  {
> -  uint64_t wseq;
>    pthread_cond_t *cond;
>    pthread_mutex_t *mutex;
>    int private;
>  };
>  
> -
>  /* Decrease the waiter reference count.  */
>  static void
>  __condvar_confirm_wakeup (pthread_cond_t *cond, int private)
> @@ -54,112 +51,6 @@ __condvar_confirm_wakeup (pthread_cond_t *cond, int private)
>      futex_wake (&cond->__data.__wrefs, INT_MAX, private);
>  }
>  
> -
> -/* Cancel waiting after having registered as a waiter previously.  SEQ is our
> -   position and G is our group index.
> -   The goal of cancellation is to make our group smaller if that is still
> -   possible.  If we are in a closed group, this is not possible anymore; in
> -   this case, we need to send a replacement signal for the one we effectively
> -   consumed because the signal should have gotten consumed by another waiter
> -   instead; we must not both cancel waiting and consume a signal.
> -
> -   Must not be called while still holding a reference on the group.
> -
> -   Returns true iff we consumed a signal.
> -
> -   On some kind of timeouts, we may be able to pretend that a signal we
> -   effectively consumed happened before the timeout (i.e., similarly to first
> -   spinning on signals before actually checking whether the timeout has
> -   passed already).  Doing this would allow us to skip sending a replacement
> -   signal, but this case might happen rarely because the end of the timeout
> -   must race with someone else sending a signal.  Therefore, we don't bother
> -   trying to optimize this.  */
> -static void
> -__condvar_cancel_waiting (pthread_cond_t *cond, uint64_t seq, unsigned int g,
> -			  int private)
> -{
> -  bool consumed_signal = false;
> -
> -  /* No deadlock with group switching is possible here because we do
> -     not hold a reference on the group.  */
> -  __condvar_acquire_lock (cond, private);
> -
> -  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond) >> 1;
> -  if (g1_start > seq)
> -    {
> -      /* Our group is closed, so someone provided enough signals for it.
> -	 Thus, we effectively consumed a signal.  */
> -      consumed_signal = true;
> -    }
> -  else
> -    {
> -      if (g1_start + __condvar_get_orig_size (cond) <= seq)
> -	{
> -	  /* We are in the current G2 and thus cannot have consumed a signal.
> -	     Reduce its effective size or handle overflow.  Remember that in
> -	     G2, unsigned int size is zero or a negative value.  */
> -	  if (cond->__data.__g_size[g] + __PTHREAD_COND_MAX_GROUP_SIZE > 0)
> -	    {
> -	      cond->__data.__g_size[g]--;
> -	    }
> -	  else
> -	    {
> -	      /* Cancellations would overflow the maximum group size.  Just
> -		 wake up everyone spuriously to create a clean state.  This
> -		 also means we do not consume a signal someone else sent.  */
> -	      __condvar_release_lock (cond, private);
> -	      __pthread_cond_broadcast (cond);
> -	      return;
> -	    }
> -	}
> -      else
> -	{
> -	  /* We are in current G1.  If the group's size is zero, someone put
> -	     a signal in the group that nobody else but us can consume.  */
> -	  if (cond->__data.__g_size[g] == 0)
> -	    consumed_signal = true;
> -	  else
> -	    {
> -	      /* Otherwise, we decrease the size of the group.  This is
> -		 equivalent to atomically putting in a signal just for us and
> -		 consuming it right away.  We do not consume a signal sent
> -		 by someone else.  We also cannot have consumed a futex
> -		 wake-up because if we were cancelled or timed out in a futex
> -		 call, the futex will wake another waiter.  */
> -	      cond->__data.__g_size[g]--;
> -	    }
> -	}
> -    }
> -
> -  __condvar_release_lock (cond, private);
> -
> -  if (consumed_signal)
> -    {
> -      /* We effectively consumed a signal even though we didn't want to.
> -	 Therefore, we need to send a replacement signal.
> -	 If we would want to optimize this, we could do what
> -	 pthread_cond_signal does right in the critical section above.  */
> -      __pthread_cond_signal (cond);
> -    }
> -}
> -
> -/* Wake up any signalers that might be waiting.  */
> -static void
> -__condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int private)
> -{
> -  /* Release MO to synchronize-with the acquire load in
> -     __condvar_quiesce_and_switch_g1.  */
> -  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
> -    {
> -      /* Clear the wake-up request flag before waking up.  We do not need more
> -	 than relaxed MO and it doesn't matter if we apply this for an aliased
> -	 group because we wake all futex waiters right after clearing the
> -	 flag.  */
> -      atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned int) 1);
> -      futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
> -    }
> -}
> -
>  /* Clean-up for cancellation of waiters waiting for normal signals.  We cancel
>     our registration as a waiter, confirm we have woken up, and re-acquire the
>     mutex.  */
> @@ -168,21 +59,8 @@ __condvar_cleanup_waiting (void *arg)
>  {
>    struct _condvar_cleanup_buffer *cbuffer =
>      (struct _condvar_cleanup_buffer *) arg;
> -  pthread_cond_t *cond = cbuffer->cond;
> -  unsigned g = cbuffer->wseq & 1;
> -
> -  __condvar_dec_grefs (cond, g, cbuffer->private);
>  
> -  __condvar_cancel_waiting (cond, cbuffer->wseq >> 1, g, cbuffer->private);
> -  /* FIXME With the current cancellation implementation, it is possible that
> -     a thread is cancelled after it has returned from a syscall.  This could
> -     result in a cancelled waiter consuming a futex wake-up that is then
> -     causing another waiter in the same group to not wake up.  To work around
> -     this issue until we have fixed cancellation, just add a futex wake-up
> -     conservatively.  */
> -  futex_wake (cond->__data.__g_signals + g, 1, cbuffer->private);
> -
> -  __condvar_confirm_wakeup (cond, cbuffer->private);
> +  __condvar_confirm_wakeup (cbuffer->cond, cbuffer->private);
>  
>    /* XXX If locking the mutex fails, should we just stop execution?  This
>       might be better than silently ignoring the error.  */
> @@ -197,7 +75,7 @@ __condvar_cleanup_waiting (void *arg)
>     not necessarily result in additional happens-before relations being
>     established (which aligns well with spurious wake-ups being allowed).
>  
> -   All waiters acquire a certain position in a 64b waiter sequence (__wseq).
> +   All waiters acquire a certain position in a 32b waiter sequence (__wseq).
>     This sequence determines which waiters are allowed to consume signals.
>     A broadcast is equal to sending as many signals as are unblocked waiters.
>     When a signal arrives, it samples the current value of __wseq with a
> @@ -210,18 +88,8 @@ __condvar_cleanup_waiting (void *arg)
>     This would be straight-forward to implement if waiters would just spin but
>     we need to let them block using futexes.  Futexes give no guarantee of
>     waking in FIFO order, so we cannot reliably wake eligible waiters if we
> -   just use a single futex.  Also, futex words are 32b in size, but we need
> -   to distinguish more than 1<<32 states because we need to represent the
> -   order of wake-up (and thus which waiters are eligible to consume signals);
> -   blocking in a futex is not atomic with a waiter determining its position in
> -   the waiter sequence, so we need the futex word to reliably notify waiters
> -   that they should not attempt to block anymore because they have been
> -   already signaled in the meantime.  While an ABA issue on a 32b value will
> -   be rare, ignoring it when we are aware of it is not the right thing to do
> -   either.
> -
> -   Therefore, we use a 64b counter to represent the waiter sequence (on
> -   architectures which only support 32b atomics, we use a few bits less).
> +   just use a single futex.
> +
>     To deal with the blocking using futexes, we maintain two groups of waiters:
>     * Group G1 consists of waiters that are all eligible to consume signals;
>       incoming signals will always signal waiters in this group until all
> @@ -233,36 +101,30 @@ __condvar_cleanup_waiting (void *arg)
>  
>     We cannot allocate new memory because of process-shared condvars, so we
>     have just two slots of groups that change their role between G1 and G2.
> -   Each has a separate futex word, a number of signals available for
> -   consumption, a size (number of waiters in the group that have not been
> -   signaled), and a reference count.
> -
> -   The group reference count is used to maintain the number of waiters that
> -   are using the group's futex.  Before a group can change its role, the
> -   reference count must show that no waiters are using the futex anymore; this
> -   prevents ABA issues on the futex word.
>  
>     To represent which intervals in the waiter sequence the groups cover (and
> -   thus also which group slot contains G1 or G2), we use a 64b counter to
> -   designate the start position of G1 (inclusive), and a single bit in the
> +   thus also which group slot contains G1 or G2), we use a 32b counter to
> +   designate the start position of G2 (inclusive), and a single bit in the
>     waiter sequence counter to represent which group slot currently contains
>     G2.  This allows us to switch group roles atomically wrt. waiters obtaining
> -   a position in the waiter sequence.  The G1 start position allows waiters to
> -   figure out whether they are in a group that has already been completely
> -   signaled (i.e., if the current G1 starts at a later position that the
> -   waiter's position).  Waiters cannot determine whether they are currently
> -   in G2 or G1 -- but they do not have too because all they are interested in
> -   is whether there are available signals, and they always start in G2 (whose
> -   group slot they know because of the bit in the waiter sequence.  Signalers
> -   will simply fill the right group until it is completely signaled and can
> -   be closed (they do not switch group roles until they really have to to
> -   decrease the likelihood of having to wait for waiters still holding a
> -   reference on the now-closed G1).
> -
> -   Signalers maintain the initial size of G1 to be able to determine where
> -   G2 starts (G2 is always open-ended until it becomes G1).  They track the
> -   remaining size of a group; when waiters cancel waiting (due to PThreads
> -   cancellation or timeouts), they will decrease this remaining size as well.
> +   a position in the waiter sequence.  Waiters cannot determine whether they
> +   are currently in G2 or G1 -- but they do not have to, and they always start
> +   in G2 (whose group slot they know because of the bit in the waiter sequence.
> +   Signalers will simply fill the right group until it is completely signaled
> +   and can be closed.
> +
> +   Signallers write the current value of __wseq to __gseq. This indicates that
> +   any waiter before that is allowed to wake up, if they are in the right
> +   group.  G1 will be closed when a futex_wake did not wake any waiter.
> +
> +   Since we only use 31 bits of a 32b counter, we have to be aware of overflow
> +   and ABA problems. Overflow is handled by doing an unsigned subtraction
> +   before comparison. This will return the wrong result if there were
> +   (1 << 30) calls to pthread_cond_wait in the time after a waiter increments
> +   __wseq and before the comparison to __gseq. The biggest work we do in that
> +   scope is unlocking the mutex. Since we don't do much more, it seems
> +   unlikely that a billion waits could be scheduled before that waiter gets
> +   scheduled again. So we ignore this possibility.
>  
>     To implement condvar destruction requirements (i.e., that
>     pthread_cond_destroy can be called as soon as all waiters have been
> @@ -279,13 +141,9 @@ __condvar_cleanup_waiting (void *arg)
>       * LSB is index of current G2.
>       * Waiters fetch-add while having acquire the mutex associated with the
>         condvar.  Signalers load it and fetch-xor it concurrently.
> -   __g1_start: Starting position of G1 (inclusive)
> -     * LSB is index of current G2.
> -     * Modified by signalers while having acquired the condvar-internal lock
> -       and observed concurrently by waiters.
> -   __g1_orig_size: Initial size of G1
> -     * The two least-significant bits represent the condvar-internal lock.
> -     * Only accessed while having acquired the condvar-internal lock.
> +   __g2_start: Starting position of G2 (inclusive)
> +     * Used by signallers to determine whether to switch groups.
> +   __lock: The condvar-internal lock, used by signal and broadcast.
>     __wrefs: Waiter reference counter.
>       * Bit 2 is true if waiters should run futex_wake when they remove the
>         last reference.  pthread_cond_destroy uses this as futex word.
> @@ -295,64 +153,15 @@ __condvar_cleanup_waiting (void *arg)
>       (If the format of __wrefs is changed, update nptl_lock_constants.pysym
>        and the pretty printers.)
>     For each of the two groups, we have:
> -   __g_refs: Futex waiter reference count.
> -     * LSB is true if waiters should run futex_wake when they remove the
> -       last reference.
> -     * Reference count used by waiters concurrently with signalers that have
> -       acquired the condvar-internal lock.
> -   __g_signals: The number of signals that can still be consumed.
> +   __gseq: The value of __wseq at the time of the last signal to this group
>       * Used as a futex word by waiters.  Used concurrently by waiters and
>         signalers.
> -     * LSB is true iff this group has been completely signaled (i.e., it is
> -       closed).
> -   __g_size: Waiters remaining in this group (i.e., which have not been
> -     signaled yet.
> -     * Accessed by signalers and waiters that cancel waiting (both do so only
> -       when having acquired the condvar-internal lock.
> -     * The size of G2 is always zero because it cannot be determined until
> -       the group becomes G1.
> -     * Although this is of unsigned type, we rely on using unsigned overflow
> -       rules to make this hold effectively negative values too (in
> -       particular, when waiters in G2 cancel waiting).
>  
>     A PTHREAD_COND_INITIALIZER condvar has all fields set to zero, which yields
>     a condvar that has G2 starting at position 0 and a G1 that is closed.
>  
> -   Because waiters do not claim ownership of a group right when obtaining a
> -   position in __wseq but only reference count the group when using futexes
> -   to block, it can happen that a group gets closed before a waiter can
> -   increment the reference count.  Therefore, waiters have to check whether
> -   their group is already closed using __g1_start.  They also have to perform
> -   this check when spinning when trying to grab a signal from __g_signals.
> -   Note that for these checks, using relaxed MO to load __g1_start is
> -   sufficient because if a waiter can see a sufficiently large value, it could
> -   have also consume a signal in the waiters group.
> -
> -   Waiters try to grab a signal from __g_signals without holding a reference
> -   count, which can lead to stealing a signal from a more recent group after
> -   their own group was already closed.  They cannot always detect whether they
> -   in fact did because they do not know when they stole, but they can
> -   conservatively add a signal back to the group they stole from; if they
> -   did so unnecessarily, all that happens is a spurious wake-up.  To make this
> -   even less likely, __g1_start contains the index of the current g2 too,
> -   which allows waiters to check if there aliasing on the group slots; if
> -   there wasn't, they didn't steal from the current G1, which means that the
> -   G1 they stole from must have been already closed and they do not need to
> -   fix anything.
> -
> -   It is essential that the last field in pthread_cond_t is __g_signals[1]:
> -   The previous condvar used a pointer-sized field in pthread_cond_t, so a
> -   PTHREAD_COND_INITIALIZER from that condvar implementation might only
> -   initialize 4 bytes to zero instead of the 8 bytes we need (i.e., 44 bytes
> -   in total instead of the 48 we need).  __g_signals[1] is not accessed before
> -   the first group switch (G2 starts at index 0), which will set its value to
> -   zero after a harmless fetch-or whose return value is ignored.  This
> -   effectively completes initialization.
> -
>  
>     Limitations:
> -   * This condvar isn't designed to allow for more than
> -     __PTHREAD_COND_MAX_GROUP_SIZE * (1 << 31) calls to __pthread_cond_wait.
>     * More than __PTHREAD_COND_MAX_GROUP_SIZE concurrent waiters are not
>       supported.
>     * Beyond what is allowed as errors by POSIX or documented, we can also
> @@ -379,7 +188,6 @@ static __always_inline int
>  __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
>      clockid_t clockid, const struct __timespec64 *abstime)
>  {
> -  const int maxspin = 0;
>    int err;
>    int result = 0;
>  
> @@ -396,13 +204,11 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
>       because we do not need to establish any happens-before relation with
>       signalers (see __pthread_cond_signal); modification order alone
>       establishes a total order of waiters/signals.  We do need acquire MO
> -     to synchronize with group reinitialization in
> -     __condvar_quiesce_and_switch_g1.  */
> -  uint64_t wseq = __condvar_fetch_add_wseq_acquire (cond, 2);
> +     to synchronize with group switch in ___pthread_cond_signal.  */
> +  unsigned int wseq = atomic_fetch_add_acquire (&cond->__data.__wseq, 2);
>    /* Find our group's index.  We always go into what was G2 when we acquired
>       our position.  */
>    unsigned int g = wseq & 1;
> -  uint64_t seq = wseq >> 1;
>  
>    /* Increase the waiter reference count.  Relaxed MO is sufficient because
>       we only need to synchronize when decrementing the reference count.  */
> @@ -419,184 +225,40 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
>    err = __pthread_mutex_unlock_usercnt (mutex, 0);
>    if (__glibc_unlikely (err != 0))
>      {
> -      __condvar_cancel_waiting (cond, seq, g, private);
>        __condvar_confirm_wakeup (cond, private);
>        return err;
>      }
>  
> -  /* Now wait until a signal is available in our group or it is closed.
> -     Acquire MO so that if we observe a value of zero written after group
> -     switching in __condvar_quiesce_and_switch_g1, we synchronize with that
> -     store and will see the prior update of __g1_start done while switching
> -     groups too.  */
> -  unsigned int signals = atomic_load_acquire (cond->__data.__g_signals + g);
> +  /* Load signals.  Acquire MO to synchronize with the release in
> +     pthread_cond_signal or pthread_cond_broadcast.  */
> +  unsigned int gseq = atomic_load_acquire (cond->__data.__gseq + g);
>  
> -  do
> +  /* Compare (gseq <= wseq) while being robust to overflow. This would be
> +     false if someone signaled after we incremented wseq, above.  */
> +  if ((int)(gseq - wseq) <= 0)
>      {
> -      while (1)
> -	{
> -	  /* Spin-wait first.
> -	     Note that spinning first without checking whether a timeout
> -	     passed might lead to what looks like a spurious wake-up even
> -	     though we should return ETIMEDOUT (e.g., if the caller provides
> -	     an absolute timeout that is clearly in the past).  However,
> -	     (1) spurious wake-ups are allowed, (2) it seems unlikely that a
> -	     user will (ab)use pthread_cond_wait as a check for whether a
> -	     point in time is in the past, and (3) spinning first without
> -	     having to compare against the current time seems to be the right
> -	     choice from a performance perspective for most use cases.  */
> -	  unsigned int spin = maxspin;
> -	  while (signals == 0 && spin > 0)
> -	    {
> -	      /* Check that we are not spinning on a group that's already
> -		 closed.  */
> -	      if (seq < (__condvar_load_g1_start_relaxed (cond) >> 1))
> -		goto done;
> -
> -	      /* TODO Back off.  */
> -
> -	      /* Reload signals.  See above for MO.  */
> -	      signals = atomic_load_acquire (cond->__data.__g_signals + g);
> -	      spin--;
> -	    }
> -
> -	  /* If our group will be closed as indicated by the flag on signals,
> -	     don't bother grabbing a signal.  */
> -	  if (signals & 1)
> -	    goto done;
> -
> -	  /* If there is an available signal, don't block.  */
> -	  if (signals != 0)
> -	    break;
> -
> -	  /* No signals available after spinning, so prepare to block.
> -	     We first acquire a group reference and use acquire MO for that so
> -	     that we synchronize with the dummy read-modify-write in
> -	     __condvar_quiesce_and_switch_g1 if we read from that.  In turn,
> -	     in this case this will make us see the closed flag on __g_signals
> -	     that designates a concurrent attempt to reuse the group's slot.
> -	     We use acquire MO for the __g_signals check to make the
> -	     __g1_start check work (see spinning above).
> -	     Note that the group reference acquisition will not mask the
> -	     release MO when decrementing the reference count because we use
> -	     an atomic read-modify-write operation and thus extend the release
> -	     sequence.  */
> -	  atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2);
> -	  if (((atomic_load_acquire (cond->__data.__g_signals + g) & 1) != 0)
> -	      || (seq < (__condvar_load_g1_start_relaxed (cond) >> 1)))
> -	    {
> -	      /* Our group is closed.  Wake up any signalers that might be
> -		 waiting.  */
> -	      __condvar_dec_grefs (cond, g, private);
> -	      goto done;
> -	    }
> -
> -	  // Now block.
> -	  struct _pthread_cleanup_buffer buffer;
> -	  struct _condvar_cleanup_buffer cbuffer;
> -	  cbuffer.wseq = wseq;
> -	  cbuffer.cond = cond;
> -	  cbuffer.mutex = mutex;
> -	  cbuffer.private = private;
> -	  __pthread_cleanup_push (&buffer, __condvar_cleanup_waiting, &cbuffer);
> -
> -	  err = __futex_abstimed_wait_cancelable64 (
> -	    cond->__data.__g_signals + g, 0, clockid, abstime, private);
> -
> -	  __pthread_cleanup_pop (&buffer, 0);
> -
> -	  if (__glibc_unlikely (err == ETIMEDOUT || err == EOVERFLOW))
> -	    {
> -	      __condvar_dec_grefs (cond, g, private);
> -	      /* If we timed out, we effectively cancel waiting.  Note that
> -		 we have decremented __g_refs before cancellation, so that a
> -		 deadlock between waiting for quiescence of our group in
> -		 __condvar_quiesce_and_switch_g1 and us trying to acquire
> -		 the lock during cancellation is not possible.  */
> -	      __condvar_cancel_waiting (cond, seq, g, private);
> -	      result = err;
> -	      goto done;
> -	    }
> -	  else
> -	    __condvar_dec_grefs (cond, g, private);
> -
> -	  /* Reload signals.  See above for MO.  */
> -	  signals = atomic_load_acquire (cond->__data.__g_signals + g);
> -	}
> -
> -    }
> -  /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
> -     of __g1_start below (see spinning above for a similar case).  In
> -     particular, if we steal from a more recent group, we will also see a
> -     more recent __g1_start below.  */
> -  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
> -						&signals, signals - 2));
> -
> -  /* We consumed a signal but we could have consumed from a more recent group
> -     that aliased with ours due to being in the same group slot.  If this
> -     might be the case our group must be closed as visible through
> -     __g1_start.  */
> -  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
> -  if (seq < (g1_start >> 1))
> -    {
> -      /* We potentially stole a signal from a more recent group but we do not
> -	 know which group we really consumed from.
> -	 We do not care about groups older than current G1 because they are
> -	 closed; we could have stolen from these, but then we just add a
> -	 spurious wake-up for the current groups.
> -	 We will never steal a signal from current G2 that was really intended
> -	 for G2 because G2 never receives signals (until it becomes G1).  We
> -	 could have stolen a signal from G2 that was conservatively added by a
> -	 previous waiter that also thought it stole a signal -- but given that
> -	 that signal was added unnecessarily, it's not a problem if we steal
> -	 it.
> -	 Thus, the remaining case is that we could have stolen from the current
> -	 G1, where "current" means the __g1_start value we observed.  However,
> -	 if the current G1 does not have the same slot index as we do, we did
> -	 not steal from it and do not need to undo that.  This is the reason
> -	 for putting a bit with G2's index into__g1_start as well.  */
> -      if (((g1_start & 1) ^ 1) == g)
> -	{
> -	  /* We have to conservatively undo our potential mistake of stealing
> -	     a signal.  We can stop trying to do that when the current G1
> -	     changes because other spinning waiters will notice this too and
> -	     __condvar_quiesce_and_switch_g1 has checked that there are no
> -	     futex waiters anymore before switching G1.
> -	     Relaxed MO is fine for the __g1_start load because we need to
> -	     merely be able to observe this fact and not have to observe
> -	     something else as well.
> -	     ??? Would it help to spin for a little while to see whether the
> -	     current G1 gets closed?  This might be worthwhile if the group is
> -	     small or close to being closed.  */
> -	  unsigned int s = atomic_load_relaxed (cond->__data.__g_signals + g);
> -	  while (__condvar_load_g1_start_relaxed (cond) == g1_start)
> -	    {
> -	      /* Try to add a signal.  We don't need to acquire the lock
> -		 because at worst we can cause a spurious wake-up.  If the
> -		 group is in the process of being closed (LSB is true), this
> -		 has an effect similar to us adding a signal.  */
> -	      if (((s & 1) != 0)
> -		  || atomic_compare_exchange_weak_relaxed
> -		       (cond->__data.__g_signals + g, &s, s + 2))
> -		{
> -		  /* If we added a signal, we also need to add a wake-up on
> -		     the futex.  We also need to do that if we skipped adding
> -		     a signal because the group is being closed because
> -		     while __condvar_quiesce_and_switch_g1 could have closed
> -		     the group, it might stil be waiting for futex waiters to
> -		     leave (and one of those waiters might be the one we stole
> -		     the signal from, which cause it to block using the
> -		     futex).  */
> -		  futex_wake (cond->__data.__g_signals + g, 1, private);
> -		  break;
> -		}
> -	      /* TODO Back off.  */
> -	    }
> -	}
> +      /* Need to block on the futex until someone signals.  */
> +      struct _pthread_cleanup_buffer buffer;
> +      struct _condvar_cleanup_buffer cbuffer;
> +      cbuffer.cond = cond;
> +      cbuffer.mutex = mutex;
> +      cbuffer.private = private;
> +      __pthread_cleanup_push (&buffer, __condvar_cleanup_waiting, &cbuffer);
> +
> +      /* Sleep. If we get woken by an interrupt, go back to sleep. Wake in
> +	 all other cases. Note that even after an interrupt we won't go back
> +	 to sleep if __gseq changed.  */
> +      do
> +	err = __futex_abstimed_wait_cancelable64 (
> +	  cond->__data.__gseq + g, gseq, clockid, abstime, private);
> +      while (err == EINTR);
> +
> +      __pthread_cleanup_pop (&buffer, 0);
> +
> +      if (__glibc_unlikely (err == ETIMEDOUT || err == EOVERFLOW))
> +	result = err;
>      }
>  
> - done:
> -
>    /* Confirm that we have been woken.  We do that before acquiring the mutex
>       to allow for execution of pthread_cond_destroy while having acquired the
>       mutex.  */
> diff --git a/nptl/tst-cond22.c b/nptl/tst-cond22.c
> index 1336e9c79d..479b5e6daa 100644
> --- a/nptl/tst-cond22.c
> +++ b/nptl/tst-cond22.c
> @@ -106,14 +106,13 @@ do_test (void)
>        status = 1;
>      }
>  
> -  printf ("cond = { 0x%x:%x, 0x%x:%x, %u/%u/%u, %u/%u/%u, %u, %u }\n",
> -	  c.__data.__wseq.__value32.__high,
> -	  c.__data.__wseq.__value32.__low,
> -	  c.__data.__g1_start.__value32.__high,
> -	  c.__data.__g1_start.__value32.__low,
> -	  c.__data.__g_signals[0], c.__data.__g_refs[0], c.__data.__g_size[0],
> -	  c.__data.__g_signals[1], c.__data.__g_refs[1], c.__data.__g_size[1],
> -	  c.__data.__g1_orig_size, c.__data.__wrefs);
> +  printf ("cond = { %u, %u, %u, %u, %u, %u }\n",
> +	  c.__data.__wseq,
> +	  c.__data.__g2_start,
> +	  c.__data.__gseq[0],
> +	  c.__data.__gseq[1],
> +	  c.__data.__lock,
> +	  c.__data.__wrefs);
>  
>    if (pthread_create (&th, NULL, tf, (void *) 1l) != 0)
>      {
> @@ -152,14 +151,13 @@ do_test (void)
>        status = 1;
>      }
>  
> -  printf ("cond = { 0x%x:%x, 0x%x:%x, %u/%u/%u, %u/%u/%u, %u, %u }\n",
> -	  c.__data.__wseq.__value32.__high,
> -	  c.__data.__wseq.__value32.__low,
> -	  c.__data.__g1_start.__value32.__high,
> -	  c.__data.__g1_start.__value32.__low,
> -	  c.__data.__g_signals[0], c.__data.__g_refs[0], c.__data.__g_size[0],
> -	  c.__data.__g_signals[1], c.__data.__g_refs[1], c.__data.__g_size[1],
> -	  c.__data.__g1_orig_size, c.__data.__wrefs);
> +  printf ("cond = { %u, %u, %u, %u, %u, %u }\n",
> +	  c.__data.__wseq,
> +	  c.__data.__g2_start,
> +	  c.__data.__gseq[0],
> +	  c.__data.__gseq[1],
> +	  c.__data.__lock,
> +	  c.__data.__wrefs);
>  
>    return status;
>  }
> diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
> index 5653507e55..e2f1036af4 100644
> --- a/sysdeps/nptl/bits/thread-shared-types.h
> +++ b/sysdeps/nptl/bits/thread-shared-types.h
> @@ -93,13 +93,16 @@ typedef struct __pthread_internal_slist
>  
>  struct __pthread_cond_s
>  {
> -  __atomic_wide_counter __wseq;
> -  __atomic_wide_counter __g1_start;
> -  unsigned int __g_refs[2] __LOCK_ALIGNMENT;
> -  unsigned int __g_size[2];
> -  unsigned int __g1_orig_size;
> +  unsigned int __wseq;
> +  unsigned int __g2_start;
> +  unsigned int __gseq[2];
> +  unsigned int __lock __LOCK_ALIGNMENT;
>    unsigned int __wrefs;
> -  unsigned int __g_signals[2];
> +  /* Reserve some bytes for future changes. It's difficult to know how many we
> +     might need, but the smallest previous implementation used 44 bytes, so
> +     future versions can safely rely on that many bytes being initialized. So
> +     we reserve the same number.  */
> +  unsigned int __unused_padding[5];
>  };
>  
>  typedef unsigned int __tss_t;
> diff --git a/sysdeps/nptl/futex-internal.h b/sysdeps/nptl/futex-internal.h
> index cf54c27ed9..a3d9df8143 100644
> --- a/sysdeps/nptl/futex-internal.h
> +++ b/sysdeps/nptl/futex-internal.h
> @@ -203,13 +203,13 @@ futex_abstimed_supported_clockid (clockid_t clockid)
>     has potentially been reused due to POSIX' requirements on synchronization
>     object destruction (see above); therefore, we must not report or abort
>     on most errors.  */
> -static __always_inline void
> +static __always_inline int
>  futex_wake (unsigned int* futex_word, int processes_to_wake, int private)
>  {
>    int res = lll_futex_wake (futex_word, processes_to_wake, private);
> -  /* No error.  Ignore the number of woken processes.  */
> +  /* No error.  */
>    if (res >= 0)
> -    return;
> +    return res;
>    switch (res)
>      {
>      case -EFAULT: /* Could have happened due to memory reuse.  */
> @@ -218,7 +218,7 @@ futex_wake (unsigned int* futex_word, int processes_to_wake, int private)
>  		     reused for a PI futex.  We cannot distinguish between the
>  		     two causes, and one of them is correct use, so we do not
>  		     act in this case.  */
> -      return;
> +      return 0;
>      case -ENOSYS: /* Must have been caused by a glibc bug.  */
>      /* No other errors are documented at this time.  */
>      default:
> diff --git a/sysdeps/nptl/lowlevellock-futex.h b/sysdeps/nptl/lowlevellock-futex.h
> index eaf1f19168..02d0939f31 100644
> --- a/sysdeps/nptl/lowlevellock-futex.h
> +++ b/sysdeps/nptl/lowlevellock-futex.h
> @@ -62,11 +62,20 @@
>       ? -INTERNAL_SYSCALL_ERRNO (__ret) : 0);                     	\
>    })
>  
> +# define lll_futex_syscall_positive(nargs, futexp, op, ...)                      \
> +  ({                                                                    \
> +    long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op,  \
> +               __VA_ARGS__);                    \
> +    (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret))          \
> +     ? -INTERNAL_SYSCALL_ERRNO (__ret) : __ret);                      \
> +  })
> +
>  /* For most of these macros, the return value is never really used.
>     Nevertheless, the protocol is that each one returns a negated errno
> -   code for failure or zero for success.  (Note that the corresponding
> -   Linux system calls can sometimes return positive values for success
> -   cases too.  We never use those values.)  */
> +   code for failure or zero for success.  Only futex_wake will return
> +   a positive value, the number of woken waiters. (Note that the
> +   other Linux system calls can sometimes return positive values for
> +   success cases too.  We never use those values.) */
>  
>  
>  /* Wait while *FUTEXP == VAL for an lll_futex_wake call on FUTEXP.  */
> @@ -85,8 +94,8 @@
>  
>  /* Wake up up to NR waiters on FUTEXP.  */
>  # define lll_futex_wake(futexp, nr, private)                             \
> -  lll_futex_syscall (4, futexp,                                         \
> -		     __lll_private_flag (FUTEX_WAKE, private), nr, 0)
> +  lll_futex_syscall_positive (4, futexp,                                         \
> +         __lll_private_flag (FUTEX_WAKE, private), nr, 0)
>  
>  /* Wake up up to NR_WAKE waiters on FUTEXP.  Move up to NR_MOVE of the
>     rest from waiting on FUTEXP to waiting on MUTEX (a different futex).
> diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
> index dedad4ec86..dae319a19c 100644
> --- a/sysdeps/nptl/pthread.h
> +++ b/sysdeps/nptl/pthread.h
> @@ -152,7 +152,7 @@ enum
>  
>  
>  /* Conditional variable handling.  */
> -#define PTHREAD_COND_INITIALIZER { { {0}, {0}, {0, 0}, {0, 0}, 0, 0, {0, 0} } }
> +#define PTHREAD_COND_INITIALIZER { { 0, 0, {0, 0}, 0, 0, { 0, 0, 0, 0, 0 } } }
>  
>  
>  /* Cleanup buffers */
  
Malte Skarupke Oct. 24, 2022, 11:17 a.m. UTC | #2
>> Solution 2:
>> The code tried to detect this and fix the problem by adding an extra signal.
>
> 331    Waiters try to grab a signal from __g_signals without holding a 
> reference
> 332    count, which can lead to stealing a signal from a more recent 
> group after
> 333    their own group was already closed.  They cannot always detect 
> whether they
> 334    in fact did because they do not know when they stole, but they 
> can
> 335    conservatively add a signal back to the group they stole from; 
> if they
> 336    did so unnecessarily, all that happens is a spurious wake-up.  
> To make this
> 337    even less likely, __g1_start contains the index of the current 
> g2 too,
> 338    which allows waiters to check if there aliasing on the group 
> slots; if
> 339    there wasn't, they didn't steal from the current G1, which means 
> that the
> 340    G1 they stole from must have been already closed and they do not 
> need to
> 341    fix anything.
>
> Do you mean this ^^^ ?

Yes, this is the ABA problem I'm referring to. Here is the order:
1. Thread A calls pthread_cond_wait, is about to decrement g_signals
2. Thread Z signals often enough to switch groups twice. While doing that it resets g_signals down to 0 (in __condvar_quiesce_and_switch_g1)
3. Thread B goes to sleep in the same group as thread A
4. Thread Z signals again, increments g_signals
5. Thread A finally does its decrement
6. Thread B gets woken up by the futex_wake from step 5
7. Thread B finds g_signals==0, goes back to sleep

>> Problem 3:
>> That extra signal can lead to more problems in complex interleavings, which
>> causes the current bug, BZ 25847.
>
> I would be helpful to have a more detailed explanation of the root cause.

The extra signal can mess with the counts. When switching groups, the old code would set g_size to the amount of threads that's in g1. Then pthread_cond_signal would signal exactly that many times before switching groups again. But the extra increment of g_signals from the "potential stealing" code can allow a thread to wake up spuriously. The comment you quoted above says this is harmless, but if there are fewer sleepers in a group than g_size claims, then one pthread_cond_signal call won't do anything. This is bad if there are sleepers in the other group, which should have been woken by that pthread_cond_signal call.


Do you want me to include these details in the patch description? I wasn't sure how much detail to include in there.


> Your patch changes the algorithm significantly and so it represents a 
> third (original,
> two-group, and now your solution) algorithmic change that needs 
> detailed review.

I can see if I can make a series of more incremental changes from the old code. But overall this is going to end up quite different.  I think it may be easier to review than a smaller change though, because the new code taken on its own is easier to reason about.

> How does this compare against Frank Barrus' patch?
> https://sourceware.org/bugzilla/attachment.cgi?id=13419
>
> I have some reports from other developers in the community that this 
> patch does solve
> the issue.

This is very much inspired by Frank Barrus' patch. I had tried to review his patch when I realized how to simplify the code a lot, like I did in my patches. He has a g_signals value that is incrementing on every group switch, until it eventually wraps around. So waiters who observe a value that's too high can simply wake up without decrementing. I still found it tricky to reason about the decrementing and the sizes, so that's how you get to my patch, where we only increment. (by "tricky to reason about" I mean it's not that complex in theory, but hard if you're trying to think about weird interleavings like what caused BZ 25847)


Let me know what you would like me to do here. I also have an idea for making the new code faster than the old code, by taking the condvar internal mutex less often. Then it would be a win not only in correctness, but also in speed. It's what I'm currently working on.
  

Patch

diff --git a/nptl/pthread_cond_broadcast.c b/nptl/pthread_cond_broadcast.c
index 5ae141ac81..b556c49af7 100644
--- a/nptl/pthread_cond_broadcast.c
+++ b/nptl/pthread_cond_broadcast.c
@@ -29,16 +29,14 @@ 
 #include "pthread_cond_common.c"
 
 
-/* We do the following steps from __pthread_cond_signal in one critical
-   section: (1) signal all waiters in G1, (2) close G1 so that it can become
-   the new G2 and make G2 the new G1, and (3) signal all waiters in the new
-   G1.  We don't need to do all these steps if there are no waiters in G1
-   and/or G2.  See __pthread_cond_signal for further details.  */
+/* See __pthread_cond_wait for a high-level description of the algorithm.  */
 int
 ___pthread_cond_broadcast (pthread_cond_t *cond)
 {
   LIBC_PROBE (cond_broadcast, 1, cond);
 
+  /* First check whether there are waiters.  See pthread_cond_signal.c for
+     why relaxed MO is fine.  */
   unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
   if (wrefs >> 3 == 0)
     return 0;
@@ -46,43 +44,27 @@  ___pthread_cond_broadcast (pthread_cond_t *cond)
 
   __condvar_acquire_lock (cond, private);
 
-  unsigned long long int wseq = __condvar_load_wseq_relaxed (cond);
+  /* Load the waiter sequence number, which represents our relative ordering
+     to any waiters.  See pthread_cond_signal.c for why relaxed MO is fine.  */
+  unsigned int wseq = atomic_load_relaxed (&cond->__data.__wseq);
   unsigned int g2 = wseq & 1;
   unsigned int g1 = g2 ^ 1;
-  wseq >>= 1;
-  bool do_futex_wake = false;
+  wseq &= ~1;
+  bool wake_g2 = wseq != cond->__data.__g2_start;
 
-  /* Step (1): signal all waiters remaining in G1.  */
-  if (cond->__data.__g_size[g1] != 0)
-    {
-      /* Add as many signals as the remaining size of the group.  */
-      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1,
-				cond->__data.__g_size[g1] << 1);
-      cond->__data.__g_size[g1] = 0;
-
-      /* We need to wake G1 waiters before we quiesce G1 below.  */
-      /* TODO Only set it if there are indeed futex waiters.  We could
-	 also try to move this out of the critical section in cases when
-	 G2 is empty (and we don't need to quiesce).  */
-      futex_wake (cond->__data.__g_signals + g1, INT_MAX, private);
-    }
-
-  /* G1 is complete.  Step (2) is next unless there are no waiters in G2, in
-     which case we can stop.  */
-  if (__condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
-    {
-      /* Step (3): Send signals to all waiters in the old G2 / new G1.  */
-      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1,
-				cond->__data.__g_size[g1] << 1);
-      cond->__data.__g_size[g1] = 0;
-      /* TODO Only set it if there are indeed futex waiters.  */
-      do_futex_wake = true;
-    }
+  /* Modify both g1 and g2 and reset g2_start to the same value. This puts us
+     in a similar state as a newly initialized condvar: Everything has the
+     same value.  */
+  atomic_store_release (cond->__data.__gseq + g1, wseq);
+  atomic_store_release (cond->__data.__gseq + g2, wseq);
+  cond->__data.__g2_start = wseq;
 
   __condvar_release_lock (cond, private);
 
-  if (do_futex_wake)
-    futex_wake (cond->__data.__g_signals + g1, INT_MAX, private);
+  /* We don't know if there are waiters in g1, so just wake.  */
+  futex_wake (cond->__data.__gseq + g1, INT_MAX, private);
+  if (wake_g2)
+    futex_wake (cond->__data.__gseq + g2, INT_MAX, private);
 
   return 0;
 }
diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c
index fb035f72c3..b9043a218b 100644
--- a/nptl/pthread_cond_common.c
+++ b/nptl/pthread_cond_common.c
@@ -21,90 +21,9 @@ 
 #include <stdint.h>
 #include <pthread.h>
 
-/* We need 3 least-significant bits on __wrefs for something else.
-   This also matches __atomic_wide_counter requirements: The highest
-   value we add is __PTHREAD_COND_MAX_GROUP_SIZE << 2 to __g1_start
-   (the two extra bits are for the lock in the two LSBs of
-   __g1_start).  */
+/* We use 3 least-significant bits on __wrefs for something else.  */
 #define __PTHREAD_COND_MAX_GROUP_SIZE ((unsigned) 1 << 29)
 
-static inline uint64_t
-__condvar_load_wseq_relaxed (pthread_cond_t *cond)
-{
-  return __atomic_wide_counter_load_relaxed (&cond->__data.__wseq);
-}
-
-static inline uint64_t
-__condvar_fetch_add_wseq_acquire (pthread_cond_t *cond, unsigned int val)
-{
-  return __atomic_wide_counter_fetch_add_acquire (&cond->__data.__wseq, val);
-}
-
-static inline uint64_t
-__condvar_load_g1_start_relaxed (pthread_cond_t *cond)
-{
-  return __atomic_wide_counter_load_relaxed (&cond->__data.__g1_start);
-}
-
-static inline void
-__condvar_add_g1_start_relaxed (pthread_cond_t *cond, unsigned int val)
-{
-  __atomic_wide_counter_add_relaxed (&cond->__data.__g1_start, val);
-}
-
-#if __HAVE_64B_ATOMICS == 1
-
-static inline uint64_t
-__condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
-{
-  return atomic_fetch_xor_release (&cond->__data.__wseq.__value64, val);
-}
-
-#else /* !__HAVE_64B_ATOMICS */
-
-/* The xor operation needs to be an atomic read-modify-write.  The write
-   itself is not an issue as it affects just the lower-order half but not bits
-   used in the add operation.  To make the full fetch-and-xor atomic, we
-   exploit that concurrently, the value can increase by at most 1<<31 (*): The
-   xor operation is only called while having acquired the lock, so not more
-   than __PTHREAD_COND_MAX_GROUP_SIZE waiters can enter concurrently and thus
-   increment __wseq.  Therefore, if the xor operation observes a value of
-   __wseq, then the value it applies the modification to later on can be
-   derived.  */
-
-static uint64_t __attribute__ ((unused))
-__condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
-{
-  /* First, get the current value.  See __atomic_wide_counter_load_relaxed.  */
-  unsigned int h, l, h2;
-  do
-    {
-      h = atomic_load_acquire (&cond->__data.__wseq.__value32.__high);
-      l = atomic_load_acquire (&cond->__data.__wseq.__value32.__low);
-      h2 = atomic_load_relaxed (&cond->__data.__wseq.__value32.__high);
-    }
-  while (h != h2);
-  if (((l >> 31) > 0) && ((h >> 31) == 0))
-    h++;
-  h &= ~((unsigned int) 1 << 31);
-  l &= ~((unsigned int) 1 << 31);
-
-  /* Now modify.  Due to the coherence rules, the prior load will read a value
-     earlier in modification order than the following fetch-xor.
-     This uses release MO to make the full operation have release semantics
-     (all other operations access the lower-order half).  */
-  unsigned int l2
-    = (atomic_fetch_xor_release (&cond->__data.__wseq.__value32.__low, val)
-       & ~((unsigned int) 1 << 31));
-  if (l2 < l)
-    /* The lower-order half overflowed in the meantime.  This happened exactly
-       once due to the limit on concurrent waiters (see above).  */
-    h++;
-  return ((uint64_t) h << 31) + l2;
-}
-
-#endif /* !__HAVE_64B_ATOMICS */
-
 /* The lock that signalers use.  See pthread_cond_wait_common for uses.
    The lock is our normal three-state lock: not acquired (0) / acquired (1) /
    acquired-with-futex_wake-request (2).  However, we need to preserve the
@@ -113,10 +32,10 @@  __condvar_fetch_xor_wseq_release (pthread_cond_t *cond, unsigned int val)
 static void __attribute__ ((unused))
 __condvar_acquire_lock (pthread_cond_t *cond, int private)
 {
-  unsigned int s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
+  unsigned int s = atomic_load_relaxed (&cond->__data.__lock);
   while ((s & 3) == 0)
     {
-      if (atomic_compare_exchange_weak_acquire (&cond->__data.__g1_orig_size,
+      if (atomic_compare_exchange_weak_acquire (&cond->__data.__lock,
 	  &s, s | 1))
 	return;
       /* TODO Spinning and back-off.  */
@@ -129,7 +48,7 @@  __condvar_acquire_lock (pthread_cond_t *cond, int private)
       while ((s & 3) != 2)
 	{
 	  if (atomic_compare_exchange_weak_acquire
-	      (&cond->__data.__g1_orig_size, &s, (s & ~(unsigned int) 3) | 2))
+	      (&cond->__data.__lock, &s, (s & ~(unsigned int) 3) | 2))
 	    {
 	      if ((s & 3) == 0)
 		return;
@@ -137,10 +56,10 @@  __condvar_acquire_lock (pthread_cond_t *cond, int private)
 	    }
 	  /* TODO Back off.  */
 	}
-      futex_wait_simple (&cond->__data.__g1_orig_size,
+      futex_wait_simple (&cond->__data.__lock,
 	  (s & ~(unsigned int) 3) | 2, private);
       /* Reload so we see a recent value.  */
-      s = atomic_load_relaxed (&cond->__data.__g1_orig_size);
+      s = atomic_load_relaxed (&cond->__data.__lock);
     }
 }
 
@@ -148,34 +67,10 @@  __condvar_acquire_lock (pthread_cond_t *cond, int private)
 static void __attribute__ ((unused))
 __condvar_release_lock (pthread_cond_t *cond, int private)
 {
-  if ((atomic_fetch_and_release (&cond->__data.__g1_orig_size,
+  if ((atomic_fetch_and_release (&cond->__data.__lock,
 				 ~(unsigned int) 3) & 3)
       == 2)
-    futex_wake (&cond->__data.__g1_orig_size, 1, private);
-}
-
-/* Only use this when having acquired the lock.  */
-static unsigned int __attribute__ ((unused))
-__condvar_get_orig_size (pthread_cond_t *cond)
-{
-  return atomic_load_relaxed (&cond->__data.__g1_orig_size) >> 2;
-}
-
-/* Only use this when having acquired the lock.  */
-static void __attribute__ ((unused))
-__condvar_set_orig_size (pthread_cond_t *cond, unsigned int size)
-{
-  /* We have acquired the lock, but might get one concurrent update due to a
-     lock state change from acquired to acquired-with-futex_wake-request.
-     The store with relaxed MO is fine because there will be no further
-     changes to the lock bits nor the size, and we will subsequently release
-     the lock with release MO.  */
-  unsigned int s;
-  s = (atomic_load_relaxed (&cond->__data.__g1_orig_size) & 3)
-      | (size << 2);
-  if ((atomic_exchange_relaxed (&cond->__data.__g1_orig_size, s) & 3)
-      != (s & 3))
-    atomic_store_relaxed (&cond->__data.__g1_orig_size, (size << 2) | 2);
+    futex_wake (&cond->__data.__lock, 1, private);
 }
 
 /* Returns FUTEX_SHARED or FUTEX_PRIVATE based on the provided __wrefs
@@ -189,141 +84,3 @@  __condvar_get_private (int flags)
     return FUTEX_SHARED;
 }
 
-/* This closes G1 (whose index is in G1INDEX), waits for all futex waiters to
-   leave G1, converts G1 into a fresh G2, and then switches group roles so that
-   the former G2 becomes the new G1 ending at the current __wseq value when we
-   eventually make the switch (WSEQ is just an observation of __wseq by the
-   signaler).
-   If G2 is empty, it will not switch groups because then it would create an
-   empty G1 which would require switching groups again on the next signal.
-   Returns false iff groups were not switched because G2 was empty.  */
-static bool __attribute__ ((unused))
-__condvar_quiesce_and_switch_g1 (pthread_cond_t *cond, uint64_t wseq,
-    unsigned int *g1index, int private)
-{
-  const unsigned int maxspin = 0;
-  unsigned int g1 = *g1index;
-
-  /* If there is no waiter in G2, we don't do anything.  The expression may
-     look odd but remember that __g_size might hold a negative value, so
-     putting the expression this way avoids relying on implementation-defined
-     behavior.
-     Note that this works correctly for a zero-initialized condvar too.  */
-  unsigned int old_orig_size = __condvar_get_orig_size (cond);
-  uint64_t old_g1_start = __condvar_load_g1_start_relaxed (cond) >> 1;
-  if (((unsigned) (wseq - old_g1_start - old_orig_size)
-	  + cond->__data.__g_size[g1 ^ 1]) == 0)
-	return false;
-
-  /* Now try to close and quiesce G1.  We have to consider the following kinds
-     of waiters:
-     * Waiters from less recent groups than G1 are not affected because
-       nothing will change for them apart from __g1_start getting larger.
-     * New waiters arriving concurrently with the group switching will all go
-       into G2 until we atomically make the switch.  Waiters existing in G2
-       are not affected.
-     * Waiters in G1 will be closed out immediately by setting a flag in
-       __g_signals, which will prevent waiters from blocking using a futex on
-       __g_signals and also notifies them that the group is closed.  As a
-       result, they will eventually remove their group reference, allowing us
-       to close switch group roles.  */
-
-  /* First, set the closed flag on __g_signals.  This tells waiters that are
-     about to wait that they shouldn't do that anymore.  This basically
-     serves as an advance notificaton of the upcoming change to __g1_start;
-     waiters interpret it as if __g1_start was larger than their waiter
-     sequence position.  This allows us to change __g1_start after waiting
-     for all existing waiters with group references to leave, which in turn
-     makes recovery after stealing a signal simpler because it then can be
-     skipped if __g1_start indicates that the group is closed (otherwise,
-     we would have to recover always because waiters don't know how big their
-     groups are).  Relaxed MO is fine.  */
-  atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);
-
-  /* Wait until there are no group references anymore.  The fetch-or operation
-     injects us into the modification order of __g_refs; release MO ensures
-     that waiters incrementing __g_refs after our fetch-or see the previous
-     changes to __g_signals and to __g1_start that had to happen before we can
-     switch this G1 and alias with an older group (we have two groups, so
-     aliasing requires switching group roles twice).  Note that nobody else
-     can have set the wake-request flag, so we do not have to act upon it.
-
-     Also note that it is harmless if older waiters or waiters from this G1
-     get a group reference after we have quiesced the group because it will
-     remain closed for them either because of the closed flag in __g_signals
-     or the later update to __g1_start.  New waiters will never arrive here
-     but instead continue to go into the still current G2.  */
-  unsigned r = atomic_fetch_or_release (cond->__data.__g_refs + g1, 0);
-  while ((r >> 1) > 0)
-    {
-      for (unsigned int spin = maxspin; ((r >> 1) > 0) && (spin > 0); spin--)
-	{
-	  /* TODO Back off.  */
-	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
-	}
-      if ((r >> 1) > 0)
-	{
-	  /* There is still a waiter after spinning.  Set the wake-request
-	     flag and block.  Relaxed MO is fine because this is just about
-	     this futex word.
-
-	     Update r to include the set wake-request flag so that the upcoming
-	     futex_wait only blocks if the flag is still set (otherwise, we'd
-	     violate the basic client-side futex protocol).  */
-	  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
-
-	  if ((r >> 1) > 0)
-	    futex_wait_simple (cond->__data.__g_refs + g1, r, private);
-	  /* Reload here so we eventually see the most recent value even if we
-	     do not spin.   */
-	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
-	}
-    }
-  /* Acquire MO so that we synchronize with the release operation that waiters
-     use to decrement __g_refs and thus happen after the waiters we waited
-     for.  */
-  atomic_thread_fence_acquire ();
-
-  /* Update __g1_start, which finishes closing this group.  The value we add
-     will never be negative because old_orig_size can only be zero when we
-     switch groups the first time after a condvar was initialized, in which
-     case G1 will be at index 1 and we will add a value of 1.  See above for
-     why this takes place after waiting for quiescence of the group.
-     Relaxed MO is fine because the change comes with no additional
-     constraints that others would have to observe.  */
-  __condvar_add_g1_start_relaxed (cond,
-      (old_orig_size << 1) + (g1 == 1 ? 1 : - 1));
-
-  /* Now reopen the group, thus enabling waiters to again block using the
-     futex controlled by __g_signals.  Release MO so that observers that see
-     no signals (and thus can block) also see the write __g1_start and thus
-     that this is now a new group (see __pthread_cond_wait_common for the
-     matching acquire MO loads).  */
-  atomic_store_release (cond->__data.__g_signals + g1, 0);
-
-  /* At this point, the old G1 is now a valid new G2 (but not in use yet).
-     No old waiter can neither grab a signal nor acquire a reference without
-     noticing that __g1_start is larger.
-     We can now publish the group switch by flipping the G2 index in __wseq.
-     Release MO so that this synchronizes with the acquire MO operation
-     waiters use to obtain a position in the waiter sequence.  */
-  wseq = __condvar_fetch_xor_wseq_release (cond, 1) >> 1;
-  g1 ^= 1;
-  *g1index ^= 1;
-
-  /* These values are just observed by signalers, and thus protected by the
-     lock.  */
-  unsigned int orig_size = wseq - (old_g1_start + old_orig_size);
-  __condvar_set_orig_size (cond, orig_size);
-  /* Use and addition to not loose track of cancellations in what was
-     previously G2.  */
-  cond->__data.__g_size[g1] += orig_size;
-
-  /* The new G1's size may be zero because of cancellations during its time
-     as G2.  If this happens, there are no waiters that have to receive a
-     signal, so we do not need to add any and return false.  */
-  if (cond->__data.__g_size[g1] == 0)
-    return false;
-
-  return true;
-}
diff --git a/nptl/pthread_cond_signal.c b/nptl/pthread_cond_signal.c
index 14800ba00b..e6968164cd 100644
--- a/nptl/pthread_cond_signal.c
+++ b/nptl/pthread_cond_signal.c
@@ -63,34 +63,30 @@  ___pthread_cond_signal (pthread_cond_t *cond)
         that has been woken but hasn't resumed execution yet), and thus it
         cannot try to deduce that a signal happened before a particular
         waiter.  */
-  unsigned long long int wseq = __condvar_load_wseq_relaxed (cond);
+  unsigned int wseq = atomic_load_relaxed (&cond->__data.__wseq);
   unsigned int g1 = (wseq & 1) ^ 1;
-  wseq >>= 1;
+  wseq &= ~1;
   bool do_futex_wake = false;
 
-  /* If G1 is still receiving signals, we put the signal there.  If not, we
-     check if G2 has waiters, and if so, quiesce and switch G1 to the former
-     G2; if this results in a new G1 with waiters (G2 might have cancellations
-     already, see __condvar_quiesce_and_switch_g1), we put the signal in the
-     new G1.  */
-  if ((cond->__data.__g_size[g1] != 0)
-      || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
+  atomic_store_release (cond->__data.__gseq + g1, wseq);
+  bool woken = futex_wake (cond->__data.__gseq + g1, 1, private) > 0;
+  if (!woken && wseq != cond->__data.__g2_start)
     {
-      /* Add a signal.  Relaxed MO is fine because signaling does not need to
-	 establish a happens-before relation (see above).  We do not mask the
-	 release-MO store when initializing a group in
-	 __condvar_quiesce_and_switch_g1 because we use an atomic
-	 read-modify-write and thus extend that store's release sequence.  */
-      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
-      cond->__data.__g_size[g1]--;
-      /* TODO Only set it if there are indeed futex waiters.  */
+      /* Nobody was woken by futex_wake but there are waiters in g2. Switch
+	 groups and try again.  */
+      wseq = atomic_fetch_xor_release (&cond->__data.__wseq, 1) & ~1;
+
+      g1 ^= 1;
+      cond->__data.__g2_start = wseq;
+
+      atomic_store_release (cond->__data.__gseq + g1, wseq);
       do_futex_wake = true;
     }
 
   __condvar_release_lock (cond, private);
 
   if (do_futex_wake)
-    futex_wake (cond->__data.__g_signals + g1, 1, private);
+    futex_wake (cond->__data.__gseq + g1, 1, private);
 
   return 0;
 }
diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
index 20c348a503..c146784e96 100644
--- a/nptl/pthread_cond_wait.c
+++ b/nptl/pthread_cond_wait.c
@@ -32,16 +32,13 @@ 
 
 #include "pthread_cond_common.c"
 
-
 struct _condvar_cleanup_buffer
 {
-  uint64_t wseq;
   pthread_cond_t *cond;
   pthread_mutex_t *mutex;
   int private;
 };
 
-
 /* Decrease the waiter reference count.  */
 static void
 __condvar_confirm_wakeup (pthread_cond_t *cond, int private)
@@ -54,112 +51,6 @@  __condvar_confirm_wakeup (pthread_cond_t *cond, int private)
     futex_wake (&cond->__data.__wrefs, INT_MAX, private);
 }
 
-
-/* Cancel waiting after having registered as a waiter previously.  SEQ is our
-   position and G is our group index.
-   The goal of cancellation is to make our group smaller if that is still
-   possible.  If we are in a closed group, this is not possible anymore; in
-   this case, we need to send a replacement signal for the one we effectively
-   consumed because the signal should have gotten consumed by another waiter
-   instead; we must not both cancel waiting and consume a signal.
-
-   Must not be called while still holding a reference on the group.
-
-   Returns true iff we consumed a signal.
-
-   On some kind of timeouts, we may be able to pretend that a signal we
-   effectively consumed happened before the timeout (i.e., similarly to first
-   spinning on signals before actually checking whether the timeout has
-   passed already).  Doing this would allow us to skip sending a replacement
-   signal, but this case might happen rarely because the end of the timeout
-   must race with someone else sending a signal.  Therefore, we don't bother
-   trying to optimize this.  */
-static void
-__condvar_cancel_waiting (pthread_cond_t *cond, uint64_t seq, unsigned int g,
-			  int private)
-{
-  bool consumed_signal = false;
-
-  /* No deadlock with group switching is possible here because we do
-     not hold a reference on the group.  */
-  __condvar_acquire_lock (cond, private);
-
-  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond) >> 1;
-  if (g1_start > seq)
-    {
-      /* Our group is closed, so someone provided enough signals for it.
-	 Thus, we effectively consumed a signal.  */
-      consumed_signal = true;
-    }
-  else
-    {
-      if (g1_start + __condvar_get_orig_size (cond) <= seq)
-	{
-	  /* We are in the current G2 and thus cannot have consumed a signal.
-	     Reduce its effective size or handle overflow.  Remember that in
-	     G2, unsigned int size is zero or a negative value.  */
-	  if (cond->__data.__g_size[g] + __PTHREAD_COND_MAX_GROUP_SIZE > 0)
-	    {
-	      cond->__data.__g_size[g]--;
-	    }
-	  else
-	    {
-	      /* Cancellations would overflow the maximum group size.  Just
-		 wake up everyone spuriously to create a clean state.  This
-		 also means we do not consume a signal someone else sent.  */
-	      __condvar_release_lock (cond, private);
-	      __pthread_cond_broadcast (cond);
-	      return;
-	    }
-	}
-      else
-	{
-	  /* We are in current G1.  If the group's size is zero, someone put
-	     a signal in the group that nobody else but us can consume.  */
-	  if (cond->__data.__g_size[g] == 0)
-	    consumed_signal = true;
-	  else
-	    {
-	      /* Otherwise, we decrease the size of the group.  This is
-		 equivalent to atomically putting in a signal just for us and
-		 consuming it right away.  We do not consume a signal sent
-		 by someone else.  We also cannot have consumed a futex
-		 wake-up because if we were cancelled or timed out in a futex
-		 call, the futex will wake another waiter.  */
-	      cond->__data.__g_size[g]--;
-	    }
-	}
-    }
-
-  __condvar_release_lock (cond, private);
-
-  if (consumed_signal)
-    {
-      /* We effectively consumed a signal even though we didn't want to.
-	 Therefore, we need to send a replacement signal.
-	 If we would want to optimize this, we could do what
-	 pthread_cond_signal does right in the critical section above.  */
-      __pthread_cond_signal (cond);
-    }
-}
-
-/* Wake up any signalers that might be waiting.  */
-static void
-__condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int private)
-{
-  /* Release MO to synchronize-with the acquire load in
-     __condvar_quiesce_and_switch_g1.  */
-  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
-    {
-      /* Clear the wake-up request flag before waking up.  We do not need more
-	 than relaxed MO and it doesn't matter if we apply this for an aliased
-	 group because we wake all futex waiters right after clearing the
-	 flag.  */
-      atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned int) 1);
-      futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
-    }
-}
-
 /* Clean-up for cancellation of waiters waiting for normal signals.  We cancel
    our registration as a waiter, confirm we have woken up, and re-acquire the
    mutex.  */
@@ -168,21 +59,8 @@  __condvar_cleanup_waiting (void *arg)
 {
   struct _condvar_cleanup_buffer *cbuffer =
     (struct _condvar_cleanup_buffer *) arg;
-  pthread_cond_t *cond = cbuffer->cond;
-  unsigned g = cbuffer->wseq & 1;
-
-  __condvar_dec_grefs (cond, g, cbuffer->private);
 
-  __condvar_cancel_waiting (cond, cbuffer->wseq >> 1, g, cbuffer->private);
-  /* FIXME With the current cancellation implementation, it is possible that
-     a thread is cancelled after it has returned from a syscall.  This could
-     result in a cancelled waiter consuming a futex wake-up that is then
-     causing another waiter in the same group to not wake up.  To work around
-     this issue until we have fixed cancellation, just add a futex wake-up
-     conservatively.  */
-  futex_wake (cond->__data.__g_signals + g, 1, cbuffer->private);
-
-  __condvar_confirm_wakeup (cond, cbuffer->private);
+  __condvar_confirm_wakeup (cbuffer->cond, cbuffer->private);
 
   /* XXX If locking the mutex fails, should we just stop execution?  This
      might be better than silently ignoring the error.  */
@@ -197,7 +75,7 @@  __condvar_cleanup_waiting (void *arg)
    not necessarily result in additional happens-before relations being
    established (which aligns well with spurious wake-ups being allowed).
 
-   All waiters acquire a certain position in a 64b waiter sequence (__wseq).
+   All waiters acquire a certain position in a 32b waiter sequence (__wseq).
    This sequence determines which waiters are allowed to consume signals.
    A broadcast is equal to sending as many signals as are unblocked waiters.
    When a signal arrives, it samples the current value of __wseq with a
@@ -210,18 +88,8 @@  __condvar_cleanup_waiting (void *arg)
    This would be straight-forward to implement if waiters would just spin but
    we need to let them block using futexes.  Futexes give no guarantee of
    waking in FIFO order, so we cannot reliably wake eligible waiters if we
-   just use a single futex.  Also, futex words are 32b in size, but we need
-   to distinguish more than 1<<32 states because we need to represent the
-   order of wake-up (and thus which waiters are eligible to consume signals);
-   blocking in a futex is not atomic with a waiter determining its position in
-   the waiter sequence, so we need the futex word to reliably notify waiters
-   that they should not attempt to block anymore because they have been
-   already signaled in the meantime.  While an ABA issue on a 32b value will
-   be rare, ignoring it when we are aware of it is not the right thing to do
-   either.
-
-   Therefore, we use a 64b counter to represent the waiter sequence (on
-   architectures which only support 32b atomics, we use a few bits less).
+   just use a single futex.
+
    To deal with the blocking using futexes, we maintain two groups of waiters:
    * Group G1 consists of waiters that are all eligible to consume signals;
      incoming signals will always signal waiters in this group until all
@@ -233,36 +101,30 @@  __condvar_cleanup_waiting (void *arg)
 
    We cannot allocate new memory because of process-shared condvars, so we
    have just two slots of groups that change their role between G1 and G2.
-   Each has a separate futex word, a number of signals available for
-   consumption, a size (number of waiters in the group that have not been
-   signaled), and a reference count.
-
-   The group reference count is used to maintain the number of waiters that
-   are using the group's futex.  Before a group can change its role, the
-   reference count must show that no waiters are using the futex anymore; this
-   prevents ABA issues on the futex word.
 
    To represent which intervals in the waiter sequence the groups cover (and
-   thus also which group slot contains G1 or G2), we use a 64b counter to
-   designate the start position of G1 (inclusive), and a single bit in the
+   thus also which group slot contains G1 or G2), we use a 32b counter to
+   designate the start position of G2 (inclusive), and a single bit in the
    waiter sequence counter to represent which group slot currently contains
    G2.  This allows us to switch group roles atomically wrt. waiters obtaining
-   a position in the waiter sequence.  The G1 start position allows waiters to
-   figure out whether they are in a group that has already been completely
-   signaled (i.e., if the current G1 starts at a later position that the
-   waiter's position).  Waiters cannot determine whether they are currently
-   in G2 or G1 -- but they do not have too because all they are interested in
-   is whether there are available signals, and they always start in G2 (whose
-   group slot they know because of the bit in the waiter sequence.  Signalers
-   will simply fill the right group until it is completely signaled and can
-   be closed (they do not switch group roles until they really have to to
-   decrease the likelihood of having to wait for waiters still holding a
-   reference on the now-closed G1).
-
-   Signalers maintain the initial size of G1 to be able to determine where
-   G2 starts (G2 is always open-ended until it becomes G1).  They track the
-   remaining size of a group; when waiters cancel waiting (due to PThreads
-   cancellation or timeouts), they will decrease this remaining size as well.
+   a position in the waiter sequence.  Waiters cannot determine whether they
+   are currently in G2 or G1 -- but they do not have to, and they always start
+   in G2 (whose group slot they know because of the bit in the waiter sequence.
+   Signalers will simply fill the right group until it is completely signaled
+   and can be closed.
+
+   Signallers write the current value of __wseq to __gseq. This indicates that
+   any waiter before that is allowed to wake up, if they are in the right
+   group.  G1 will be closed when a futex_wake did not wake any waiter.
+
+   Since we only use 31 bits of a 32b counter, we have to be aware of overflow
+   and ABA problems. Overflow is handled by doing an unsigned subtraction
+   before comparison. This will return the wrong result if there were
+   (1 << 30) calls to pthread_cond_wait in the time after a waiter increments
+   __wseq and before the comparison to __gseq. The biggest work we do in that
+   scope is unlocking the mutex. Since we don't do much more, it seems
+   unlikely that a billion waits could be scheduled before that waiter gets
+   scheduled again. So we ignore this possibility.
 
    To implement condvar destruction requirements (i.e., that
    pthread_cond_destroy can be called as soon as all waiters have been
@@ -279,13 +141,9 @@  __condvar_cleanup_waiting (void *arg)
      * LSB is index of current G2.
      * Waiters fetch-add while having acquire the mutex associated with the
        condvar.  Signalers load it and fetch-xor it concurrently.
-   __g1_start: Starting position of G1 (inclusive)
-     * LSB is index of current G2.
-     * Modified by signalers while having acquired the condvar-internal lock
-       and observed concurrently by waiters.
-   __g1_orig_size: Initial size of G1
-     * The two least-significant bits represent the condvar-internal lock.
-     * Only accessed while having acquired the condvar-internal lock.
+   __g2_start: Starting position of G2 (inclusive)
+     * Used by signallers to determine whether to switch groups.
+   __lock: The condvar-internal lock, used by signal and broadcast.
    __wrefs: Waiter reference counter.
      * Bit 2 is true if waiters should run futex_wake when they remove the
        last reference.  pthread_cond_destroy uses this as futex word.
@@ -295,64 +153,15 @@  __condvar_cleanup_waiting (void *arg)
      (If the format of __wrefs is changed, update nptl_lock_constants.pysym
       and the pretty printers.)
    For each of the two groups, we have:
-   __g_refs: Futex waiter reference count.
-     * LSB is true if waiters should run futex_wake when they remove the
-       last reference.
-     * Reference count used by waiters concurrently with signalers that have
-       acquired the condvar-internal lock.
-   __g_signals: The number of signals that can still be consumed.
+   __gseq: The value of __wseq at the time of the last signal to this group
      * Used as a futex word by waiters.  Used concurrently by waiters and
        signalers.
-     * LSB is true iff this group has been completely signaled (i.e., it is
-       closed).
-   __g_size: Waiters remaining in this group (i.e., which have not been
-     signaled yet.
-     * Accessed by signalers and waiters that cancel waiting (both do so only
-       when having acquired the condvar-internal lock.
-     * The size of G2 is always zero because it cannot be determined until
-       the group becomes G1.
-     * Although this is of unsigned type, we rely on using unsigned overflow
-       rules to make this hold effectively negative values too (in
-       particular, when waiters in G2 cancel waiting).
 
    A PTHREAD_COND_INITIALIZER condvar has all fields set to zero, which yields
    a condvar that has G2 starting at position 0 and a G1 that is closed.
 
-   Because waiters do not claim ownership of a group right when obtaining a
-   position in __wseq but only reference count the group when using futexes
-   to block, it can happen that a group gets closed before a waiter can
-   increment the reference count.  Therefore, waiters have to check whether
-   their group is already closed using __g1_start.  They also have to perform
-   this check when spinning when trying to grab a signal from __g_signals.
-   Note that for these checks, using relaxed MO to load __g1_start is
-   sufficient because if a waiter can see a sufficiently large value, it could
-   have also consume a signal in the waiters group.
-
-   Waiters try to grab a signal from __g_signals without holding a reference
-   count, which can lead to stealing a signal from a more recent group after
-   their own group was already closed.  They cannot always detect whether they
-   in fact did because they do not know when they stole, but they can
-   conservatively add a signal back to the group they stole from; if they
-   did so unnecessarily, all that happens is a spurious wake-up.  To make this
-   even less likely, __g1_start contains the index of the current g2 too,
-   which allows waiters to check if there aliasing on the group slots; if
-   there wasn't, they didn't steal from the current G1, which means that the
-   G1 they stole from must have been already closed and they do not need to
-   fix anything.
-
-   It is essential that the last field in pthread_cond_t is __g_signals[1]:
-   The previous condvar used a pointer-sized field in pthread_cond_t, so a
-   PTHREAD_COND_INITIALIZER from that condvar implementation might only
-   initialize 4 bytes to zero instead of the 8 bytes we need (i.e., 44 bytes
-   in total instead of the 48 we need).  __g_signals[1] is not accessed before
-   the first group switch (G2 starts at index 0), which will set its value to
-   zero after a harmless fetch-or whose return value is ignored.  This
-   effectively completes initialization.
-
 
    Limitations:
-   * This condvar isn't designed to allow for more than
-     __PTHREAD_COND_MAX_GROUP_SIZE * (1 << 31) calls to __pthread_cond_wait.
    * More than __PTHREAD_COND_MAX_GROUP_SIZE concurrent waiters are not
      supported.
    * Beyond what is allowed as errors by POSIX or documented, we can also
@@ -379,7 +188,6 @@  static __always_inline int
 __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
     clockid_t clockid, const struct __timespec64 *abstime)
 {
-  const int maxspin = 0;
   int err;
   int result = 0;
 
@@ -396,13 +204,11 @@  __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
      because we do not need to establish any happens-before relation with
      signalers (see __pthread_cond_signal); modification order alone
      establishes a total order of waiters/signals.  We do need acquire MO
-     to synchronize with group reinitialization in
-     __condvar_quiesce_and_switch_g1.  */
-  uint64_t wseq = __condvar_fetch_add_wseq_acquire (cond, 2);
+     to synchronize with group switch in ___pthread_cond_signal.  */
+  unsigned int wseq = atomic_fetch_add_acquire (&cond->__data.__wseq, 2);
   /* Find our group's index.  We always go into what was G2 when we acquired
      our position.  */
   unsigned int g = wseq & 1;
-  uint64_t seq = wseq >> 1;
 
   /* Increase the waiter reference count.  Relaxed MO is sufficient because
      we only need to synchronize when decrementing the reference count.  */
@@ -419,184 +225,40 @@  __pthread_cond_wait_common (pthread_cond_t *cond, pthread_mutex_t *mutex,
   err = __pthread_mutex_unlock_usercnt (mutex, 0);
   if (__glibc_unlikely (err != 0))
     {
-      __condvar_cancel_waiting (cond, seq, g, private);
       __condvar_confirm_wakeup (cond, private);
       return err;
     }
 
-  /* Now wait until a signal is available in our group or it is closed.
-     Acquire MO so that if we observe a value of zero written after group
-     switching in __condvar_quiesce_and_switch_g1, we synchronize with that
-     store and will see the prior update of __g1_start done while switching
-     groups too.  */
-  unsigned int signals = atomic_load_acquire (cond->__data.__g_signals + g);
+  /* Load signals.  Acquire MO to synchronize with the release in
+     pthread_cond_signal or pthread_cond_broadcast.  */
+  unsigned int gseq = atomic_load_acquire (cond->__data.__gseq + g);
 
-  do
+  /* Compare (gseq <= wseq) while being robust to overflow. This would be
+     false if someone signaled after we incremented wseq, above.  */
+  if ((int)(gseq - wseq) <= 0)
     {
-      while (1)
-	{
-	  /* Spin-wait first.
-	     Note that spinning first without checking whether a timeout
-	     passed might lead to what looks like a spurious wake-up even
-	     though we should return ETIMEDOUT (e.g., if the caller provides
-	     an absolute timeout that is clearly in the past).  However,
-	     (1) spurious wake-ups are allowed, (2) it seems unlikely that a
-	     user will (ab)use pthread_cond_wait as a check for whether a
-	     point in time is in the past, and (3) spinning first without
-	     having to compare against the current time seems to be the right
-	     choice from a performance perspective for most use cases.  */
-	  unsigned int spin = maxspin;
-	  while (signals == 0 && spin > 0)
-	    {
-	      /* Check that we are not spinning on a group that's already
-		 closed.  */
-	      if (seq < (__condvar_load_g1_start_relaxed (cond) >> 1))
-		goto done;
-
-	      /* TODO Back off.  */
-
-	      /* Reload signals.  See above for MO.  */
-	      signals = atomic_load_acquire (cond->__data.__g_signals + g);
-	      spin--;
-	    }
-
-	  /* If our group will be closed as indicated by the flag on signals,
-	     don't bother grabbing a signal.  */
-	  if (signals & 1)
-	    goto done;
-
-	  /* If there is an available signal, don't block.  */
-	  if (signals != 0)
-	    break;
-
-	  /* No signals available after spinning, so prepare to block.
-	     We first acquire a group reference and use acquire MO for that so
-	     that we synchronize with the dummy read-modify-write in
-	     __condvar_quiesce_and_switch_g1 if we read from that.  In turn,
-	     in this case this will make us see the closed flag on __g_signals
-	     that designates a concurrent attempt to reuse the group's slot.
-	     We use acquire MO for the __g_signals check to make the
-	     __g1_start check work (see spinning above).
-	     Note that the group reference acquisition will not mask the
-	     release MO when decrementing the reference count because we use
-	     an atomic read-modify-write operation and thus extend the release
-	     sequence.  */
-	  atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2);
-	  if (((atomic_load_acquire (cond->__data.__g_signals + g) & 1) != 0)
-	      || (seq < (__condvar_load_g1_start_relaxed (cond) >> 1)))
-	    {
-	      /* Our group is closed.  Wake up any signalers that might be
-		 waiting.  */
-	      __condvar_dec_grefs (cond, g, private);
-	      goto done;
-	    }
-
-	  // Now block.
-	  struct _pthread_cleanup_buffer buffer;
-	  struct _condvar_cleanup_buffer cbuffer;
-	  cbuffer.wseq = wseq;
-	  cbuffer.cond = cond;
-	  cbuffer.mutex = mutex;
-	  cbuffer.private = private;
-	  __pthread_cleanup_push (&buffer, __condvar_cleanup_waiting, &cbuffer);
-
-	  err = __futex_abstimed_wait_cancelable64 (
-	    cond->__data.__g_signals + g, 0, clockid, abstime, private);
-
-	  __pthread_cleanup_pop (&buffer, 0);
-
-	  if (__glibc_unlikely (err == ETIMEDOUT || err == EOVERFLOW))
-	    {
-	      __condvar_dec_grefs (cond, g, private);
-	      /* If we timed out, we effectively cancel waiting.  Note that
-		 we have decremented __g_refs before cancellation, so that a
-		 deadlock between waiting for quiescence of our group in
-		 __condvar_quiesce_and_switch_g1 and us trying to acquire
-		 the lock during cancellation is not possible.  */
-	      __condvar_cancel_waiting (cond, seq, g, private);
-	      result = err;
-	      goto done;
-	    }
-	  else
-	    __condvar_dec_grefs (cond, g, private);
-
-	  /* Reload signals.  See above for MO.  */
-	  signals = atomic_load_acquire (cond->__data.__g_signals + g);
-	}
-
-    }
-  /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
-     of __g1_start below (see spinning above for a similar case).  In
-     particular, if we steal from a more recent group, we will also see a
-     more recent __g1_start below.  */
-  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
-						&signals, signals - 2));
-
-  /* We consumed a signal but we could have consumed from a more recent group
-     that aliased with ours due to being in the same group slot.  If this
-     might be the case our group must be closed as visible through
-     __g1_start.  */
-  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
-  if (seq < (g1_start >> 1))
-    {
-      /* We potentially stole a signal from a more recent group but we do not
-	 know which group we really consumed from.
-	 We do not care about groups older than current G1 because they are
-	 closed; we could have stolen from these, but then we just add a
-	 spurious wake-up for the current groups.
-	 We will never steal a signal from current G2 that was really intended
-	 for G2 because G2 never receives signals (until it becomes G1).  We
-	 could have stolen a signal from G2 that was conservatively added by a
-	 previous waiter that also thought it stole a signal -- but given that
-	 that signal was added unnecessarily, it's not a problem if we steal
-	 it.
-	 Thus, the remaining case is that we could have stolen from the current
-	 G1, where "current" means the __g1_start value we observed.  However,
-	 if the current G1 does not have the same slot index as we do, we did
-	 not steal from it and do not need to undo that.  This is the reason
-	 for putting a bit with G2's index into__g1_start as well.  */
-      if (((g1_start & 1) ^ 1) == g)
-	{
-	  /* We have to conservatively undo our potential mistake of stealing
-	     a signal.  We can stop trying to do that when the current G1
-	     changes because other spinning waiters will notice this too and
-	     __condvar_quiesce_and_switch_g1 has checked that there are no
-	     futex waiters anymore before switching G1.
-	     Relaxed MO is fine for the __g1_start load because we need to
-	     merely be able to observe this fact and not have to observe
-	     something else as well.
-	     ??? Would it help to spin for a little while to see whether the
-	     current G1 gets closed?  This might be worthwhile if the group is
-	     small or close to being closed.  */
-	  unsigned int s = atomic_load_relaxed (cond->__data.__g_signals + g);
-	  while (__condvar_load_g1_start_relaxed (cond) == g1_start)
-	    {
-	      /* Try to add a signal.  We don't need to acquire the lock
-		 because at worst we can cause a spurious wake-up.  If the
-		 group is in the process of being closed (LSB is true), this
-		 has an effect similar to us adding a signal.  */
-	      if (((s & 1) != 0)
-		  || atomic_compare_exchange_weak_relaxed
-		       (cond->__data.__g_signals + g, &s, s + 2))
-		{
-		  /* If we added a signal, we also need to add a wake-up on
-		     the futex.  We also need to do that if we skipped adding
-		     a signal because the group is being closed because
-		     while __condvar_quiesce_and_switch_g1 could have closed
-		     the group, it might stil be waiting for futex waiters to
-		     leave (and one of those waiters might be the one we stole
-		     the signal from, which cause it to block using the
-		     futex).  */
-		  futex_wake (cond->__data.__g_signals + g, 1, private);
-		  break;
-		}
-	      /* TODO Back off.  */
-	    }
-	}
+      /* Need to block on the futex until someone signals.  */
+      struct _pthread_cleanup_buffer buffer;
+      struct _condvar_cleanup_buffer cbuffer;
+      cbuffer.cond = cond;
+      cbuffer.mutex = mutex;
+      cbuffer.private = private;
+      __pthread_cleanup_push (&buffer, __condvar_cleanup_waiting, &cbuffer);
+
+      /* Sleep. If we get woken by an interrupt, go back to sleep. Wake in
+	 all other cases. Note that even after an interrupt we won't go back
+	 to sleep if __gseq changed.  */
+      do
+	err = __futex_abstimed_wait_cancelable64 (
+	  cond->__data.__gseq + g, gseq, clockid, abstime, private);
+      while (err == EINTR);
+
+      __pthread_cleanup_pop (&buffer, 0);
+
+      if (__glibc_unlikely (err == ETIMEDOUT || err == EOVERFLOW))
+	result = err;
     }
 
- done:
-
   /* Confirm that we have been woken.  We do that before acquiring the mutex
      to allow for execution of pthread_cond_destroy while having acquired the
      mutex.  */
diff --git a/nptl/tst-cond22.c b/nptl/tst-cond22.c
index 1336e9c79d..479b5e6daa 100644
--- a/nptl/tst-cond22.c
+++ b/nptl/tst-cond22.c
@@ -106,14 +106,13 @@  do_test (void)
       status = 1;
     }
 
-  printf ("cond = { 0x%x:%x, 0x%x:%x, %u/%u/%u, %u/%u/%u, %u, %u }\n",
-	  c.__data.__wseq.__value32.__high,
-	  c.__data.__wseq.__value32.__low,
-	  c.__data.__g1_start.__value32.__high,
-	  c.__data.__g1_start.__value32.__low,
-	  c.__data.__g_signals[0], c.__data.__g_refs[0], c.__data.__g_size[0],
-	  c.__data.__g_signals[1], c.__data.__g_refs[1], c.__data.__g_size[1],
-	  c.__data.__g1_orig_size, c.__data.__wrefs);
+  printf ("cond = { %u, %u, %u, %u, %u, %u }\n",
+	  c.__data.__wseq,
+	  c.__data.__g2_start,
+	  c.__data.__gseq[0],
+	  c.__data.__gseq[1],
+	  c.__data.__lock,
+	  c.__data.__wrefs);
 
   if (pthread_create (&th, NULL, tf, (void *) 1l) != 0)
     {
@@ -152,14 +151,13 @@  do_test (void)
       status = 1;
     }
 
-  printf ("cond = { 0x%x:%x, 0x%x:%x, %u/%u/%u, %u/%u/%u, %u, %u }\n",
-	  c.__data.__wseq.__value32.__high,
-	  c.__data.__wseq.__value32.__low,
-	  c.__data.__g1_start.__value32.__high,
-	  c.__data.__g1_start.__value32.__low,
-	  c.__data.__g_signals[0], c.__data.__g_refs[0], c.__data.__g_size[0],
-	  c.__data.__g_signals[1], c.__data.__g_refs[1], c.__data.__g_size[1],
-	  c.__data.__g1_orig_size, c.__data.__wrefs);
+  printf ("cond = { %u, %u, %u, %u, %u, %u }\n",
+	  c.__data.__wseq,
+	  c.__data.__g2_start,
+	  c.__data.__gseq[0],
+	  c.__data.__gseq[1],
+	  c.__data.__lock,
+	  c.__data.__wrefs);
 
   return status;
 }
diff --git a/sysdeps/nptl/bits/thread-shared-types.h b/sysdeps/nptl/bits/thread-shared-types.h
index 5653507e55..e2f1036af4 100644
--- a/sysdeps/nptl/bits/thread-shared-types.h
+++ b/sysdeps/nptl/bits/thread-shared-types.h
@@ -93,13 +93,16 @@  typedef struct __pthread_internal_slist
 
 struct __pthread_cond_s
 {
-  __atomic_wide_counter __wseq;
-  __atomic_wide_counter __g1_start;
-  unsigned int __g_refs[2] __LOCK_ALIGNMENT;
-  unsigned int __g_size[2];
-  unsigned int __g1_orig_size;
+  unsigned int __wseq;
+  unsigned int __g2_start;
+  unsigned int __gseq[2];
+  unsigned int __lock __LOCK_ALIGNMENT;
   unsigned int __wrefs;
-  unsigned int __g_signals[2];
+  /* Reserve some bytes for future changes. It's difficult to know how many we
+     might need, but the smallest previous implementation used 44 bytes, so
+     future versions can safely rely on that many bytes being initialized. So
+     we reserve the same number.  */
+  unsigned int __unused_padding[5];
 };
 
 typedef unsigned int __tss_t;
diff --git a/sysdeps/nptl/futex-internal.h b/sysdeps/nptl/futex-internal.h
index cf54c27ed9..a3d9df8143 100644
--- a/sysdeps/nptl/futex-internal.h
+++ b/sysdeps/nptl/futex-internal.h
@@ -203,13 +203,13 @@  futex_abstimed_supported_clockid (clockid_t clockid)
    has potentially been reused due to POSIX' requirements on synchronization
    object destruction (see above); therefore, we must not report or abort
    on most errors.  */
-static __always_inline void
+static __always_inline int
 futex_wake (unsigned int* futex_word, int processes_to_wake, int private)
 {
   int res = lll_futex_wake (futex_word, processes_to_wake, private);
-  /* No error.  Ignore the number of woken processes.  */
+  /* No error.  */
   if (res >= 0)
-    return;
+    return res;
   switch (res)
     {
     case -EFAULT: /* Could have happened due to memory reuse.  */
@@ -218,7 +218,7 @@  futex_wake (unsigned int* futex_word, int processes_to_wake, int private)
 		     reused for a PI futex.  We cannot distinguish between the
 		     two causes, and one of them is correct use, so we do not
 		     act in this case.  */
-      return;
+      return 0;
     case -ENOSYS: /* Must have been caused by a glibc bug.  */
     /* No other errors are documented at this time.  */
     default:
diff --git a/sysdeps/nptl/lowlevellock-futex.h b/sysdeps/nptl/lowlevellock-futex.h
index eaf1f19168..02d0939f31 100644
--- a/sysdeps/nptl/lowlevellock-futex.h
+++ b/sysdeps/nptl/lowlevellock-futex.h
@@ -62,11 +62,20 @@ 
      ? -INTERNAL_SYSCALL_ERRNO (__ret) : 0);                     	\
   })
 
+# define lll_futex_syscall_positive(nargs, futexp, op, ...)                      \
+  ({                                                                    \
+    long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op,  \
+               __VA_ARGS__);                    \
+    (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret))          \
+     ? -INTERNAL_SYSCALL_ERRNO (__ret) : __ret);                      \
+  })
+
 /* For most of these macros, the return value is never really used.
    Nevertheless, the protocol is that each one returns a negated errno
-   code for failure or zero for success.  (Note that the corresponding
-   Linux system calls can sometimes return positive values for success
-   cases too.  We never use those values.)  */
+   code for failure or zero for success.  Only futex_wake will return
+   a positive value, the number of woken waiters. (Note that the
+   other Linux system calls can sometimes return positive values for
+   success cases too.  We never use those values.) */
 
 
 /* Wait while *FUTEXP == VAL for an lll_futex_wake call on FUTEXP.  */
@@ -85,8 +94,8 @@ 
 
 /* Wake up up to NR waiters on FUTEXP.  */
 # define lll_futex_wake(futexp, nr, private)                             \
-  lll_futex_syscall (4, futexp,                                         \
-		     __lll_private_flag (FUTEX_WAKE, private), nr, 0)
+  lll_futex_syscall_positive (4, futexp,                                         \
+         __lll_private_flag (FUTEX_WAKE, private), nr, 0)
 
 /* Wake up up to NR_WAKE waiters on FUTEXP.  Move up to NR_MOVE of the
    rest from waiting on FUTEXP to waiting on MUTEX (a different futex).
diff --git a/sysdeps/nptl/pthread.h b/sysdeps/nptl/pthread.h
index dedad4ec86..dae319a19c 100644
--- a/sysdeps/nptl/pthread.h
+++ b/sysdeps/nptl/pthread.h
@@ -152,7 +152,7 @@  enum
 
 
 /* Conditional variable handling.  */
-#define PTHREAD_COND_INITIALIZER { { {0}, {0}, {0, 0}, {0, 0}, 0, 0, {0, 0} } }
+#define PTHREAD_COND_INITIALIZER { { 0, 0, {0, 0}, 0, 0, { 0, 0, 0, 0, 0 } } }
 
 
 /* Cleanup buffers */