New condvar implementation that provides stronger ordering guarantees.

  TLDR: This is a new implementation for condition variables, required
after http://austingroupbugs.net/view.php?id=609 to fix bug 13165.  In
essence, we need to be stricter in which waiters a signal or broadcast
is required to wake up; this couldn't be solved using the old algorithm.
ISO C++ made a similar clarification, so this also fixes a bug in
current libstdc++, for example.
This doesn't contain the existing requeue optimization (see below), and
PI support is not worse (except no requeue).

Arch maintainers: This adapts each arch's definition of pthread_cond_t.
Only x86, x86_64, and hppa have significant arch-specific changes.  I'd
appreciate review considering we want to stay compatible with old
initializers, and we don't want existing padding to alias with bytes
that are now used for real fields (see text on pthread_cond_t below for
details).

And here's the detailed version :)

We can't use the old algorithm, which tries to avoid spurious wake-ups,
anymore because there's no way (AFAICT) to wake in FIFO order from a
futex (the current Linux implementation may do today, but it's not
guaranteed).  Thus, when we wake, we can't simply let someone grab a
signal, but we need to ensure that one of the waiters happening before
the signal is woken up -- not just any waiter.  This is something the
previous algorithm violated (see bug 13165).
The problem with having to wake in-order and trying to prevent spurious
wake-ups is that one would have to encode the order, which one needs
space for (e.g., for separate futexes).  But pthread_cond_t is limited
in space, and we can't use external space for process-shared condvars.

The primary reason for spurious wake-ups with this new algorithm is the
following:  If we have waiters W1 and W2, and W2 registers itself as a
waiter later than W1 does, and if W2 blocks earlier than W1 using a
futex, then a signal will wake both W1 and W2.  IOW, this is when one
waiter races ahead of an earlier one when doing futex_wait.  Once they
both wait, or if they keep their order, there's no spurious wake-ups.

We could avoid more of these spurious wake-ups by maintaining at least
two groups of waiters that approximate the waiter order (e.g., have one
group that is eligible for wake-up, and drained with priority, and a
second that is catch-all and will become the new first group when that
is drained).  But this would add substantial complexity to the
algorithm, and it may be a tight fit into the size of pthread_cond_t we
have today.

There's another issue specific to condvars: ABA issues on the underlying
futexes.  Unlike mutexes that have just three states, or semaphores that
have no tokens or a limited number of them, the state of a condvar is
the *order* of the waiters.  With a semaphore, we can grab whenever
there is a token, and block in the futex_wait when there is none.  With
condvars, a waiter must not block if there had been more
signals/broadcasts than waiters before it (ie, ordering matters).

Futex words in userspace (ie, those memory locations that control
whether futex_wait blocks or not) are 32b.  Therefore, we can have ABA
issues on it, which could lead to lost wake-ups because a waiter simply
can't distinguish between no new signals being sent and lots of signals
being sent (2^31 in this implementation).

It might be unlikely that this occurs, and needs a specific scenario,
but I'm not comfortable with just ignoring it -- knowingly.  Therefore,
this patch avoids the ABA issues by quiescing the condvar before an
overflow on the internal counters for the number of waiters /
signals-sent happens, and then resets the condvar.  This just increases
the number of spurious wake-ups while we do the reset on non-PI condvars
(but it is a problem for PI; see below).  The quiescence handling does
add complexity, but it seems not excessive relative to the rest of the
algorithm.

This algorithm satisfies the equivalent of the strong mutex destruction
guarantee.  However, unlike mutexes, because of spurious wake-ups being
allowed a correct program has to effectively ensure that destruction
happens after signals/broadcast calls return.  Thus, the destruction
requirement in POSIX is not as restrictive as with semaphores, but we
still need to take care.

If you want to dive into the code, it's best to start with the comments
on top of __pthread_cond_wait_common in nptl/pthread_cond_wait.c.

One notable difference to the previous implementation is that the new
code doesn't use an internal lock anymore.  This simplifies the PI
implementation (see below), and should speed up things like concurrent
signals/broadcasts, and the general hand-off between wait and
signal/broadcast.

I've merged pthread_cond_timedwait.c into pthread_cond_wait.c because
they both share most of the code.  __pthread_cond_wait_common is
currently an always_inline, but we might also make that a noinline if
you're concerned over statically linked programs that use either the
timed or the non-timed cond_wait variant.

pthread_cond_t is the same on all archs (except on hppa, see below, and
m68k which enforces 4-byte alignment of the first int).  The new
algorithm needs less space (int instead of long long int in the case of
three fields), so the old initializers should remain working.  The x86
version looks fine for me, but I'd appreciate (an)other set(s) of eyes
on this aspect.  We don't need stronger alignment for the new algorithm.

Carlos: the hppa changes are completely untested.  They change the
pthread-once-like code to C11 atomics, which fixes one missing compiler
barrier (acquire load was a plain load).  Let me know if you object to
these.

x86 had custom assembly implementations.  Given that this patch fixes a
correctness problem, I've just removed them.  Additionally, there's no
real fast-path in cond_wait unless perhaps if we want to consider just
spin for a signal to become available as a fast path; in all other
cases, we have to wait, so that's cache misses at least.  signal and
broadcast could be considered fast paths; the new code doesn't use an
internal lock anymore, so they should have become faster (e.g.,
cond_signal is just a CAS loop now and a call to futex_wait (that we
might avoid too with some more complexity).

There are three new tests, cond36-cond28, which are variations of
existing tests that frequently drive a condvar into the quiescence state
(and thus test quiescence).

This condvar doesn't yet use a requeue optimization (ie, on a broadcast,
waking just one thread and requeueing all others on the futex of the
mutex supplied by the program).  I don't think doing the requeue is
necessarily the right approach (but I haven't done real measurements
yet):
* If a program expects to wake many threads at the same time and make
that scalable, a condvar isn't great anyway because of how it requires
waiters to operate mutually exclusive (due to the mutex usage).  Thus, a
thundering herd problem is a scalability problem with or without the
optimization.  Using something like a semaphore might be more
appropriate in such a case.
* The scalability problem is actually at the mutex side; the condvar
could help (and it tries to with the requeue optimization), but it
should be the mutex who decides how that is done, and whether it is done
at all.
* Forcing all but one waiter into the kernel-side wait queue of the
mutex prevents/avoids the use of lock elision on the mutex.  Thus, it
prevents the only cure against the underlying scalability problem
inherent to condvars.
* If condvars use short critical sections (ie, hold the mutex just to
check a binary flag or such), which they should do ideally, then forcing
all those waiter to proceed serially with kernel-based hand-off (ie,
futex ops in the mutex' contended state, via the futex wait queues) will
be less efficient than just letting a scalable mutex implementation take
care of it.  Our current mutex impl doesn't employ spinning at all, but
if critical sections are short, spinning can be much better.
* Doing the requeue stuff requires all waiters to always drive the mutex
into the contended state.  This leads to each waiter having to call
futex_wake after lock release, even if this wouldn't be necessary.

Therefore, this condvar doesn't do requeue currently.  I'd like to get
your opinion on this.
Once we agree on a way forward, I'd either (1) adapt the condvar to use
requeue or (2) adapt the _cond_ variants of the lowlevellock and
pthread_mutex_* to not always drive the mutex into the contended state.
Here's a sketch of how we could implement requeue (IOW, make sure we
don't requeue to the wrong mutex):
* Use one bit (NEW_MUTEX_BIT or such) in signals_sent as a flag for
whether the mutex associated with the condvar changed.  Add proper
masking of it, adapt WSEQ_THRESHOLD accordingly, etc.
* Let waiters do this:
  if (mutex != cond->mutex) {
    atomic_store_relaxed (&cond->mutex, newmutex);
    atomic_fetch_or_release (&cond->signals_sent, NEW_MUTEX_BIT);
  }
  futex_wait(...)
* Let broadcast do:
  s = atomic_fetch_add_acquire (&cond->signals_sent, signals_to_add);
  if (s & NEW_MUTEX_BIT) /* reset the bit with a CAS, retry; */
  m = atomic_load_relaxed (&cond->mutex);
  futex_cmp_requeue (..., s + signals_to_add /* expected value */,
     ..., m /* expected mutex */
This would ensure that if a futex_wait on a new mutex comes in,
broadcast will grab the new mutex or futex_cmp_requeue will fail (it
will see the signals_sent update because of futex op ordering).

PI support is "kind of" included.  There is no internal lock anymore, so
the thing that Darren proposed the fix for is gone.  So far so good;
however, we don't requeue, and Darren's paper states that requeue would
yield better latency in the PI scenario (is this still the case?).

Darren, I'd propose that we figure out how to best adapt this new
implementation to do PI.  I'm looking forward to your comments.

One thing I don't think we'll be able to solve is ensuring PI during
quiescence.  When doing quiescence, we need for all waiters to go out of
the condvar, so confirm their wake-up.  I can't see a way of boosting
their priorities if they get suspended between releasing the mutex and
starting to wait; there's no app lock they still hold, and we can't give
wwaiters per-waiter locks linked from the condvar that we could use to
boost individual threads because this doesn't work in the process-shared
case.  Maybe there's something I'm missing (but I though a while about
it ;), and maybe there's some PI-futex functionality that I wasn't aware
of that we could (ab)use.
Thus, the most practical approach seems to be to just not do any PI
during quiescence (ie, every 2^31 cond_wait calls).  Any alternative
suggestions?

This problem would go away if we had 64b futexes because then we
wouldn't need quiescence anymore (assuming 64b counters avoid ABA).

Tested on x86_64-linux and x86-linux.

2015-02-20  Torvald Riegel  <triegel@redhat.com>

	[BZ #13165]
	* nptl/pthread_cond_broadcast.c (__pthread_cond_broadcast): Rewrite to
	use new algorithm.
	* nptl/pthread_cond_destroy.c (__pthread_cond_destroy): Likewise.
	* nptl/pthread_cond_init.c (__pthread_cond_init): Likewise.
	* nptl/pthread_cond_signal.c (__pthread_cond_signal): Likewise.
	* nptl/pthread_cond_wait.c (__pthread_cond_wait): Likewise.
	(__pthread_cond_timedwait): Move here from pthread_cond_timedwait.c.
	(__condvar_confirm_wakeup, __condvar_cancel_waiting,
	__condvar_cleanup_waiting, __condvar_cleanup_quiescence,
	__pthread_cond_wait_common): New.
	(__condvar_cleanup): Remove.
	* npt/pthread_condattr_getclock (pthread_condattr_getclock): Adapt.
	* npt/pthread_condattr_setclock (pthread_condattr_setclock): Likewise.
	* nptl/tst-cond1.c: Add comment.
	* nptl/tst-cond18.c (tf): Add quiescence testing.
	* nptl/tst-cond20.c (do_test): Likewise.
	* nptl/tst-cond25.c (do_test_wait): Likewise.
	* nptl/tst-cond20.c (do_test): Adapt.
	* nptl/tst-cond26.c: New file.
	* nptl/tst-cond27.c: Likewise.
	* nptl/tst-cond28.c: Likewise.
	* sysdeps/aarch64/nptl/bits/pthreadtypes.h (pthread_cond_t): Adapt
	structure.
	* sysdeps/arm/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/hppa/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/ia64/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/m68k/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/microblaze/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/mips/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/nios2/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/s390/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/sh/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/sparc/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/tile/nptl/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/unix/sysv/linux/alpha/bits/pthreadtypes.h (pthread_cond_t):
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/bits/pthreadtypes.h (pthread_cond_t):
	Likewise.
	* sysdeps/x86/bits/pthreadtypes.h (pthread_cond_t): Likewise.
	* sysdeps/nptl/internaltypes.h (COND_NWAITERS_SHIFT): Remove.
	(COND_CLOCK_BITS): Adapt.
	* sysdeps/nptl/pthread.h (PTHREAD_COND_INITIALIZER): Adapt.
	* sysdeps/unix/sysv/linux/hppa/internaltypes.h (cond_compat_clear,
	cond_compat_check_and_clear): Adapt.
	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Remove file ...
	* sysdeps/unix/sysv/linux/hppa/pthread_cond_wait.c
	(__pthread_cond_timedwait): ... and move here.
	* nptl/DESIGN-condvar.txt: Remove file.
	* nptl/lowlevelcond.sym: Likewise.
	* nptl/pthread_cond_timedwait.c: Likewise.
	* sysdeps/unix/sysv/linux/hppa/pthread_cond_timedwait.c: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_broadcast.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_signal.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_timedwait.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_broadcast.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_signal.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_timedwait.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i586/pthread_cond_wait.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_broadcast.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_signal.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_timedwait.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/i686/pthread_cond_wait.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Likewise.

New condvar implementation that provides stronger ordering guarantees.

Commit Message

Comments

Patch