Transition to C11 atomics and memory model

  I think we should transition to using the C11 memory model and atomics
instead of the current custom implementation.  There are two reasons for
this:

1) Compilers need to know which memory accesses are atomic (and thus
potentially concurrent with other accesses to the same location) and
which aren't (and can thus be optimized more aggressively).  We
currently make the compiler do the right thing by using inline asm, but
the existence of atomic_forced_read shows that this isn't the full
picture.

2) Over time, more and more programmers will become familiar with the
C11 model.  Our current atomics are different, so if we use the C11
model, it will likely be easier for future glibc developers to work on
glibc's concurrent code.  This also applies to the tool support (e.g.,
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/).

We can't rely on having a conforming C11 compiler yet, so I think it's
best if we add new atomic_ functions that closely resemble the atomic
operations C11 provides.  We use the C11 memory model (ie, the
semantics) by either using C11 support (if the compiler has it) or
building our own atomics in such a way that they implement the HW
instruction side of a C11 memory model implementation.  We can't force
the compiler to be C11-conforming in it's transformations, but our
atomics seem to work currently, so we just expect this to work further.

I propose that our phase in transitioning to C11 is to focus on uses of
the atomic operations.  In particular, the rules are:

* All accesses to atomic vars need to use atomic_* functions.  IOW, all
non-atomic accesses are not subject to data races.  The only exceptions
is initialization (ie, when the variable is not visible to any other
thread); nonetheless, initialization accesses must not result in data
races with other accesses.  (This exception isn't allowed by C11, but
eases the transition to C11 atomics and likely works fine in current
implementations; as alternative, we could require MO-relaxed stores for
initialization as well.)

* Atomic vars aren't explicitly annotated with atomic types, but just
use the base types.  They need to be naturally aligned.  This makes the
transition easier because we don't get any dependencies on C11 atomic
types.

* On a certain architecture, we typically only use the atomic_* ops if
the HW actually supports these; we expect to have pointer-sized atomics
at most.  If the arch has no native support for atomics, it can either
use modified algorithms or emulate atomics differently.

* The atomic ops are similar to the _explicit variation of C11's
functions, except that _explicit is replaced with the last part of the
MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst).  All
arguments (except the MO, which is dropped) are the same as for C11.
That avoids using the same names yet should make the names easy to
understand for people familiar with C11.

I also propose an incremental transition.  In particular, the steps are
roughly:

1) Add new C11-like atomics.  If GCC supports them on this architecture,
use GCC's atomic builtins.  Make them fall back to the existing atomics
otherwise.  Attached is a small patch that illustrates this.

2) Refactor one use (ie, all the accesses belonging to one algorithm or
group of functions that synchronize with each other) at a time.  This
involves reviewing the code and basically reimplementing the
synchronization bits in on top of the C11 memory model.  We also should
take this opportunity to add any documentation of concurrent code that's
missing (which is often the case).

3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
at all uses and decide whether we really need to keep them.

4) Once all of glibc uses the new atomics, remove the old ones for a
particular arch if the oldest compiler required has support for the
respective builtins.

Open questions:

* Are the current read/write memory barriers equivalent to C11
acquire/release fences?  I guess that's the case (did I mention lack of
documentation? ;) ), but we should check whether this is true on every
architecture (ie, whether the HW instructions used for read/write
membars are the same as what the compiler would use for
acquire/release).  If not, we can't implement acquire/release based on
read/write membars but need something else for this arch.  I'd
appreciate help from the machine maintainers for this one.

* How do we deal with archs such as older SPARC that don't have CAS and
other archs without HW support for atomics?  Using modified algorithms
should be the best-performing option (eg, if we can use one critical
section instead of a complicated alternative that uses lots of atomic
operations).  However, that means we'll have to maintain more algorithms
(even if they might be simpler).
Furthermore, do all uses of atomics work well with blocking atomics that
might also not be indivisible steps?  For example, the cancellation code
might be affected because a blocking emulation of atomics won't be
async-cancel-safe?

* Which of the catomic_ variants do we really need?  Similarly to the
non-native atomics case, we often might be better off which running a
slightly different nonatomic (or just nonsynchronizing) algorithm in the
first place.  We'll have to review all the uses to be able to tell.

Thoughts?  Any feedback and help is welcome!

Transition to C11 atomics and memory model

Commit Message

Comments

Patch