Transition to C11 atomics and memory model

Message ID 1410719669.4967.160.camel@triegel.csb
State Not applicable
Headers

Commit Message

Torvald Riegel Sept. 14, 2014, 6:34 p.m. UTC
  I think we should transition to using the C11 memory model and atomics
instead of the current custom implementation.  There are two reasons for
this:

1) Compilers need to know which memory accesses are atomic (and thus
potentially concurrent with other accesses to the same location) and
which aren't (and can thus be optimized more aggressively).  We
currently make the compiler do the right thing by using inline asm, but
the existence of atomic_forced_read shows that this isn't the full
picture.

2) Over time, more and more programmers will become familiar with the
C11 model.  Our current atomics are different, so if we use the C11
model, it will likely be easier for future glibc developers to work on
glibc's concurrent code.  This also applies to the tool support (e.g.,
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/).


We can't rely on having a conforming C11 compiler yet, so I think it's
best if we add new atomic_ functions that closely resemble the atomic
operations C11 provides.  We use the C11 memory model (ie, the
semantics) by either using C11 support (if the compiler has it) or
building our own atomics in such a way that they implement the HW
instruction side of a C11 memory model implementation.  We can't force
the compiler to be C11-conforming in it's transformations, but our
atomics seem to work currently, so we just expect this to work further.


I propose that our phase in transitioning to C11 is to focus on uses of
the atomic operations.  In particular, the rules are:

* All accesses to atomic vars need to use atomic_* functions.  IOW, all
non-atomic accesses are not subject to data races.  The only exceptions
is initialization (ie, when the variable is not visible to any other
thread); nonetheless, initialization accesses must not result in data
races with other accesses.  (This exception isn't allowed by C11, but
eases the transition to C11 atomics and likely works fine in current
implementations; as alternative, we could require MO-relaxed stores for
initialization as well.)

* Atomic vars aren't explicitly annotated with atomic types, but just
use the base types.  They need to be naturally aligned.  This makes the
transition easier because we don't get any dependencies on C11 atomic
types.

* On a certain architecture, we typically only use the atomic_* ops if
the HW actually supports these; we expect to have pointer-sized atomics
at most.  If the arch has no native support for atomics, it can either
use modified algorithms or emulate atomics differently.

* The atomic ops are similar to the _explicit variation of C11's
functions, except that _explicit is replaced with the last part of the
MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst).  All
arguments (except the MO, which is dropped) are the same as for C11.
That avoids using the same names yet should make the names easy to
understand for people familiar with C11.


I also propose an incremental transition.  In particular, the steps are
roughly:

1) Add new C11-like atomics.  If GCC supports them on this architecture,
use GCC's atomic builtins.  Make them fall back to the existing atomics
otherwise.  Attached is a small patch that illustrates this.

2) Refactor one use (ie, all the accesses belonging to one algorithm or
group of functions that synchronize with each other) at a time.  This
involves reviewing the code and basically reimplementing the
synchronization bits in on top of the C11 memory model.  We also should
take this opportunity to add any documentation of concurrent code that's
missing (which is often the case).

3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
at all uses and decide whether we really need to keep them.

4) Once all of glibc uses the new atomics, remove the old ones for a
particular arch if the oldest compiler required has support for the
respective builtins.


Open questions:

* Are the current read/write memory barriers equivalent to C11
acquire/release fences?  I guess that's the case (did I mention lack of
documentation? ;) ), but we should check whether this is true on every
architecture (ie, whether the HW instructions used for read/write
membars are the same as what the compiler would use for
acquire/release).  If not, we can't implement acquire/release based on
read/write membars but need something else for this arch.  I'd
appreciate help from the machine maintainers for this one.

* How do we deal with archs such as older SPARC that don't have CAS and
other archs without HW support for atomics?  Using modified algorithms
should be the best-performing option (eg, if we can use one critical
section instead of a complicated alternative that uses lots of atomic
operations).  However, that means we'll have to maintain more algorithms
(even if they might be simpler).
Furthermore, do all uses of atomics work well with blocking atomics that
might also not be indivisible steps?  For example, the cancellation code
might be affected because a blocking emulation of atomics won't be
async-cancel-safe?

* Which of the catomic_ variants do we really need?  Similarly to the
non-native atomics case, we often might be better off which running a
slightly different nonatomic (or just nonsynchronizing) algorithm in the
first place.  We'll have to review all the uses to be able to tell.


Thoughts?  Any feedback and help is welcome!
  

Comments

Florian Weimer Sept. 15, 2014, 1:56 p.m. UTC | #1
On 09/14/2014 08:34 PM, Torvald Riegel wrote:
> I think we should transition to using the C11 memory model and atomics
> instead of the current custom implementation.  There are two reasons for
> this:

What's the quality of the C11 memory specification?  The things I've 
seen so far do not inspire confidence, e.g.:

“The value of an object visible to a thread T at a particular point is 
the initial value of the object, a value stored in the object by T, or a 
value stored in the object by another thread, according to the rules 
below.” (5.1.2.4/7, perhaps just an “atomic” is missing somewhere.)

“Unlike memset, any call to the memset_s function shall be evaluated 
strictly according to the rules of the abstract machine as described in 
(5.1.2.3).” (K.3.7.4.1/4, if this means anything at all, it seems to 
require suspending all other threads in the program)
  
Torvald Riegel Sept. 15, 2014, 2:28 p.m. UTC | #2
On Mon, 2014-09-15 at 15:56 +0200, Florian Weimer wrote:
> On 09/14/2014 08:34 PM, Torvald Riegel wrote:
> > I think we should transition to using the C11 memory model and atomics
> > instead of the current custom implementation.  There are two reasons for
> > this:
> 
> What's the quality of the C11 memory specification?

The specification's quality is high.  The only open issue I'm aware of
is that there's no proper wording currently to disallow out-of-thin-air
values -- but all the implementers agree that those shouldn't be allowed
to happen by an implementation, and this is something C++ SG1 is working
on.

There's a formalization of the model[1], the cppmem tool I mentioned in
my previous email based on that formalization that enumerates all the
behaviors allowed by a piece of C11/C++11 code, there are formalizations
of HW models (eg, Power) and proofs for how to implement the C11/C++11
model based on them [2].

[1] http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
[2] http://www.cl.cam.ac.uk/~pes20/

> The things I've 
> seen so far do not inspire confidence, e.g.:
> 
> “The value of an object visible to a thread T at a particular point is 
> the initial value of the object, a value stored in the object by T, or a 
> value stored in the object by another thread, according to the rules 
> below.” (5.1.2.4/7, perhaps just an “atomic” is missing somewhere.)

That's fine, because there are more "rules below" :)  Those specify
which values are visible, they require data-race freedom, etc.

> “Unlike memset, any call to the memset_s function shall be evaluated 
> strictly according to the rules of the abstract machine as described in 
> (5.1.2.3).” (K.3.7.4.1/4, if this means anything at all, it seems to 
> require suspending all other threads in the program)

Even the abstract machine follows the memory model.  memset isn't
specified to be atomic, so it can participate in data races, so
essentially the programmer has to make sure that there are no data
races.

I suggest looking at the formalization of the model[1].  If you're
interested in understanding things in detail, that's the resource I'd
start with.
  
Carlos O'Donell Sept. 15, 2014, 3:59 p.m. UTC | #3
Torvald,

Thanks for the email, very good questions.

David,

SPARC pre-v9 question at the bottom for you.

On 09/14/2014 02:34 PM, Torvald Riegel wrote:
> I think we should transition to using the C11 memory model and atomics
> instead of the current custom implementation.  There are two reasons for
> this:

Architecturally I think that glibc transitioning to the C11 memory model
is the *only* way forward.

> I propose that our phase in transitioning to C11 is to focus on uses of
> the atomic operations.  In particular, the rules are:
> 
> * All accesses to atomic vars need to use atomic_* functions.  IOW, all
> non-atomic accesses are not subject to data races.  The only exceptions
> is initialization (ie, when the variable is not visible to any other
> thread); nonetheless, initialization accesses must not result in data
> races with other accesses.  (This exception isn't allowed by C11, but
> eases the transition to C11 atomics and likely works fine in current
> implementations; as alternative, we could require MO-relaxed stores for
> initialization as well.)

At present we rely on small word-length writes to complete atomically,
would you suggest we have to wrap those in true atomic operations?
Won't this hurt performance? What correctness issue exists?

> * Atomic vars aren't explicitly annotated with atomic types, but just
> use the base types.  They need to be naturally aligned.  This makes the
> transition easier because we don't get any dependencies on C11 atomic
> types.

OK.
 
> * On a certain architecture, we typically only use the atomic_* ops if
> the HW actually supports these; we expect to have pointer-sized atomics
> at most.  If the arch has no native support for atomics, it can either
> use modified algorithms or emulate atomics differently.

I strongly suggest all such machines should emulate atomics in the kernel
using kernel-level locks. The downside of this is that all atomic vars
must use atomic_* functions because otherwise the release of the lock
word by a normal store won't order correctly. This already happened on
hppa with userspace spinlocks.

> * The atomic ops are similar to the _explicit variation of C11's
> functions, except that _explicit is replaced with the last part of the
> MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst).  All
> arguments (except the MO, which is dropped) are the same as for C11.
> That avoids using the same names yet should make the names easy to
> understand for people familiar with C11.

OK.
 
> I also propose an incremental transition.  In particular, the steps are
> roughly:
> 
> 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> use GCC's atomic builtins.  Make them fall back to the existing atomics
> otherwise.  Attached is a small patch that illustrates this.

OK.

> 2) Refactor one use (ie, all the accesses belonging to one algorithm or
> group of functions that synchronize with each other) at a time.  This
> involves reviewing the code and basically reimplementing the
> synchronization bits in on top of the C11 memory model.  We also should
> take this opportunity to add any documentation of concurrent code that's
> missing (which is often the case).

Not OK until we talk about it more.

> 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
> at all uses and decide whether we really need to keep them.

Agreed e.g. rewrite.

> 4) Once all of glibc uses the new atomics, remove the old ones for a
> particular arch if the oldest compiler required has support for the
> respective builtins.
 
OK.
 
> Open questions:
> 
> * Are the current read/write memory barriers equivalent to C11
> acquire/release fences?  I guess that's the case (did I mention lack of
> documentation? ;) ), but we should check whether this is true on every
> architecture (ie, whether the HW instructions used for read/write
> membars are the same as what the compiler would use for
> acquire/release).  If not, we can't implement acquire/release based on
> read/write membars but need something else for this arch.  I'd
> appreciate help from the machine maintainers for this one.

Don't know.

Create an internals manual? Add a new chapter on atomics? :-)

> * How do we deal with archs such as older SPARC that don't have CAS and
> other archs without HW support for atomics?  Using modified algorithms
> should be the best-performing option (eg, if we can use one critical
> section instead of a complicated alternative that uses lots of atomic
> operations).  However, that means we'll have to maintain more algorithms
> (even if they might be simpler).

No. Stop. One algorithm. All arches that can't meet the HW support for
atomics must enter the kernel and do the work there. This is just like hppa
and ARM do. They use a light-weight syscall mechanism and serialize in the
kernel.

> Furthermore, do all uses of atomics work well with blocking atomics that
> might also not be indivisible steps?  For example, the cancellation code
> might be affected because a blocking emulation of atomics won't be
> async-cancel-safe?

It will be safe because the kernel emulation should not deliver a signal
during the emulation.

> * Which of the catomic_ variants do we really need?  Similarly to the
> non-native atomics case, we often might be better off which running a
> slightly different nonatomic (or just nonsynchronizing) algorithm in the
> first place.  We'll have to review all the uses to be able to tell.
 
Don't know.
 
> Thoughts?  Any feedback and help is welcome!

Cheers,
Carlos.
  
Joseph Myers Sept. 15, 2014, 4:51 p.m. UTC | #4
On Sun, 14 Sep 2014, Torvald Riegel wrote:

> 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> use GCC's atomic builtins.  Make them fall back to the existing atomics
> otherwise.  Attached is a small patch that illustrates this.

I'm not clear on how you propose to determine whether to define 
USE_ATOMIC_COMPILER_BUILTINS.

The obvious approach of testing __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 (etc.), 
together with making sure the GCC version is at least 4.7, runs into 
architecture-specific problems:

* While the builtins are available in 4.7, on some architectures they may 
be suboptimal before a later version (see the MIPS code preferring to use 
them only for 4.8 and later, but using them for 4.7 in the MIPS16 case; 
generally this applies if the GCC support was originally implemented for 
__sync_*, with its stronger barrier requirements, but the architecture 
does support specifying the required memory ordering more precisely, but 
the GCC support for that was added later).

* In some cases those macros are defined by GCC even though the operations 
are out of line in libgcc.  On PA the HAVE_sync_compare_and_swap* macros 
are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to 
be predefined; for Tile GCC has special target-specific code to define 
__GCC_HAVE_SYNC_COMPARE_AND_SWAP_*.  Do we believe such out-of-line libgcc 
versions will work in glibc, and is it preferable to use them or inline 
asm?  (My guess is that they might work except for cases such as the 
64-bit ARM atomics with references to libc interfaces.)

So if you have a default based on those macros, you'd need to allow for 
architectures overriding it.
  
Torvald Riegel Sept. 15, 2014, 7:27 p.m. UTC | #5
On Mon, 2014-09-15 at 16:51 +0000, Joseph S. Myers wrote:
> On Sun, 14 Sep 2014, Torvald Riegel wrote:
> 
> > 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> > use GCC's atomic builtins.  Make them fall back to the existing atomics
> > otherwise.  Attached is a small patch that illustrates this.
> 
> I'm not clear on how you propose to determine whether to define 
> USE_ATOMIC_COMPILER_BUILTINS.

Yes, sorry for leaving that unspecified.  I intended this flag to be
arch-specific, not a global setting; in fact, the comments you
previously made on the subject motivated that.

We can handle most of the transition to C11 atomics in a
non-arch-specific way (because the memory model itself is portable).
But for the arch-specific changes I'd really appreciate input / help
from the machine maintainers.

> * In some cases those macros are defined by GCC even though the operations 
> are out of line in libgcc.  On PA the HAVE_sync_compare_and_swap* macros 
> are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to 
> be predefined; for Tile GCC has special target-specific code to define 
> __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*.  Do we believe such out-of-line libgcc 
> versions will work in glibc, and is it preferable to use them or inline 
> asm?  (My guess is that they might work except for cases such as the 
> 64-bit ARM atomics with references to libc interfaces.)

I believe they should work from a correctness perspective, but might
have a slight performance disadvantage.  I'm not sure I understand the
ARM case you mention.
  
Torvald Riegel Sept. 15, 2014, 7:52 p.m. UTC | #6
On Mon, 2014-09-15 at 11:59 -0400, Carlos O'Donell wrote:
> Torvald,
> 
> Thanks for the email, very good questions.
> 
> David,
> 
> SPARC pre-v9 question at the bottom for you.
> 
> On 09/14/2014 02:34 PM, Torvald Riegel wrote:
> > I think we should transition to using the C11 memory model and atomics
> > instead of the current custom implementation.  There are two reasons for
> > this:
> 
> Architecturally I think that glibc transitioning to the C11 memory model
> is the *only* way forward.
> 
> > I propose that our phase in transitioning to C11 is to focus on uses of
> > the atomic operations.  In particular, the rules are:
> > 
> > * All accesses to atomic vars need to use atomic_* functions.  IOW, all
> > non-atomic accesses are not subject to data races.  The only exceptions
> > is initialization (ie, when the variable is not visible to any other
> > thread); nonetheless, initialization accesses must not result in data
> > races with other accesses.  (This exception isn't allowed by C11, but
> > eases the transition to C11 atomics and likely works fine in current
> > implementations; as alternative, we could require MO-relaxed stores for
> > initialization as well.)
> 
> At present we rely on small word-length writes to complete atomically,
> would you suggest we have to wrap those in true atomic operations?

Yes!  That's what the C11 model requires.  And we don't use special
types for atomically accessed variables, so if we don't use atomic ops
for every access there's no way for the compiler to figure out which
memory locations are accessed concurrently, and which aren't.

> Won't this hurt performance?

I don't think so.  We'll use memory_order_relaxed in each case where we
have a plain memory access right now to concurrently accessed data.  The
only reasons I can think of that might lead to a decrease in performance
are:
* Inefficient implementation of atomics, for example due to a too old
compiler version (see the example that Joseph brought up).  That can be
worked around on a arch-specific basis, for example by using inline-asm
differently.
* Compilers that don't optimize across memory_order_relaxed atomic ops
and glibc code actually benefits from optimizations by the compiler
across current plain memory accesses.  I doubt that this actually
happens in practice, because it would need a loop or such and other
things in the loop would need to be performance-critical -- which is not
a pattern I think is frequent in concurrent code.
* If we currently have code where the compiler combines several plain
memory accesses to concurrently accessed data into one, then we could
have more accesses if using memory_order_relaxed atomics.  However, such
an optimization can easily be not what the programmer intended to happen
(e.g., if in a busy-waiting loop -- hence atomic_forced_read...).

> What correctness issue exists?

The compiler needs to know for which memory accesses it can assume
data-race-freedom and for which it doesn't.  This results in different
optimizations being valid, or not.  You can try to make the compiler do
the right thing by forcing it towards doing that, but the clean way is
to actually tell the compiler what it should do.

For example, there where bugs fixed in GCC for optimizations across or
involving atomic operations; these optimizations where valid for
sequential code, but not in a multi-threaded setting.  This shows that
things can go wrong if we don't tell the compiler.

> > * Atomic vars aren't explicitly annotated with atomic types, but just
> > use the base types.  They need to be naturally aligned.  This makes the
> > transition easier because we don't get any dependencies on C11 atomic
> > types.
> 
> OK.
>  
> > * On a certain architecture, we typically only use the atomic_* ops if
> > the HW actually supports these; we expect to have pointer-sized atomics
> > at most.  If the arch has no native support for atomics, it can either
> > use modified algorithms or emulate atomics differently.
> 
> I strongly suggest all such machines should emulate atomics in the kernel
> using kernel-level locks.

That's fine for me.  I just didn't want to make this decision for
machine maintainers.

> The downside of this is that all atomic vars
> must use atomic_* functions because otherwise the release of the lock
> word by a normal store won't order correctly. This already happened on
> hppa with userspace spinlocks.

But that's The Right Thing To Do anyway, so I don't see it as a
downside.
IMO, when writing concurrent code, the least of your troubles is writing
   atomic_store_relaxed(&foo, 23)
instead of
   foo = 23;
IMHO, it actually aids in readability because it shows where concurrency
has to be considered and where not (ie, for all data accessed by only
one thread).

> > * The atomic ops are similar to the _explicit variation of C11's
> > functions, except that _explicit is replaced with the last part of the
> > MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst).  All
> > arguments (except the MO, which is dropped) are the same as for C11.
> > That avoids using the same names yet should make the names easy to
> > understand for people familiar with C11.
> 
> OK.
>  
> > I also propose an incremental transition.  In particular, the steps are
> > roughly:
> > 
> > 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> > use GCC's atomic builtins.  Make them fall back to the existing atomics
> > otherwise.  Attached is a small patch that illustrates this.
> 
> OK.
> 
> > 2) Refactor one use (ie, all the accesses belonging to one algorithm or
> > group of functions that synchronize with each other) at a time.  This
> > involves reviewing the code and basically reimplementing the
> > synchronization bits in on top of the C11 memory model.  We also should
> > take this opportunity to add any documentation of concurrent code that's
> > missing (which is often the case).
> 
> Not OK until we talk about it more.

So, what do you want to talk about? ;)

> > 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
> > at all uses and decide whether we really need to keep them.
> 
> Agreed e.g. rewrite.
> 
> > 4) Once all of glibc uses the new atomics, remove the old ones for a
> > particular arch if the oldest compiler required has support for the
> > respective builtins.
>  
> OK.
>  
> > Open questions:
> > 
> > * Are the current read/write memory barriers equivalent to C11
> > acquire/release fences?  I guess that's the case (did I mention lack of
> > documentation? ;) ), but we should check whether this is true on every
> > architecture (ie, whether the HW instructions used for read/write
> > membars are the same as what the compiler would use for
> > acquire/release).  If not, we can't implement acquire/release based on
> > read/write membars but need something else for this arch.  I'd
> > appreciate help from the machine maintainers for this one.
> 
> Don't know.
> 
> Create an internals manual? Add a new chapter on atomics? :-)

I had hoped we can do with comments in atomic.h, given that we basically
just need to explain our exceptions.
I agree we should have documentation for the transition, but I also
think this just documents existing state.  Every new architecture added
should target C11 right away (and every new architecture's designer
should have better made it clear how to implement C11 on top of this
HW :)

> > * How do we deal with archs such as older SPARC that don't have CAS and
> > other archs without HW support for atomics?  Using modified algorithms
> > should be the best-performing option (eg, if we can use one critical
> > section instead of a complicated alternative that uses lots of atomic
> > operations).  However, that means we'll have to maintain more algorithms
> > (even if they might be simpler).
> 
> No. Stop. One algorithm. All arches that can't meet the HW support for
> atomics must enter the kernel and do the work there. This is just like hppa
> and ARM do. They use a light-weight syscall mechanism and serialize in the
> kernel.

Fine with me.

> > Furthermore, do all uses of atomics work well with blocking atomics that
> > might also not be indivisible steps?  For example, the cancellation code
> > might be affected because a blocking emulation of atomics won't be
> > async-cancel-safe?
> 
> It will be safe because the kernel emulation should not deliver a signal
> during the emulation.

What about PI code?  If we have a non-blocking algorithm in glibc (using
nonblocking atomics), then this will work fine with hard real time
priorities; when using blocking atomics, we might get priority
inversion.
  
Chris Metcalf Sept. 15, 2014, 11:37 p.m. UTC | #7
On 9/15/2014 11:59 AM, Carlos O'Donell wrote:
>> * How do we deal with archs such as older SPARC that don't have CAS and
>> >other archs without HW support for atomics?  Using modified algorithms
>> >should be the best-performing option (eg, if we can use one critical
>> >section instead of a complicated alternative that uses lots of atomic
>> >operations).  However, that means we'll have to maintain more algorithms
>> >(even if they might be simpler).
> No. Stop. One algorithm. All arches that can't meet the HW support for
> atomics must enter the kernel and do the work there. This is just like hppa
> and ARM do. They use a light-weight syscall mechanism and serialize in the
> kernel.

Agreed!  And, tilepro is also on that list of platforms with weak atomics that uses a light-weight syscall to serialize in the kernel.  (Unlike tilegx, which has fairly rich atomics.)

On 9/15/2014 12:51 PM, Joseph S. Myers wrote:
> * In some cases those macros are defined by GCC even though the operations
> are out of line in libgcc.  On PA the HAVE_sync_compare_and_swap* macros
> are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to
> be predefined; for Tile GCC has special target-specific code to define
> __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*.  Do we believe such out-of-line libgcc
> versions will work in glibc, and is it preferable to use them or inline
> asm?

Yes, the tile out-of-line versions should work fine in glibc.  The tilepro out-of-line atomics do the lightweight syscalls.  On tilegx out-of-line support is only used for 1- and 2-byte atomics, which are not supported in hardware, so the out-of-line code does a 4-byte compare-and-exchange loop to synthesize the operation.  However, I suspect we are also not using those sizes in glibc itself - or if we are, we might decide we don't want to be.
  
Joseph Myers Sept. 16, 2014, 4:06 p.m. UTC | #8
On Mon, 15 Sep 2014, Torvald Riegel wrote:

> > * In some cases those macros are defined by GCC even though the operations 
> > are out of line in libgcc.  On PA the HAVE_sync_compare_and_swap* macros 
> > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to 
> > be predefined; for Tile GCC has special target-specific code to define 
> > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*.  Do we believe such out-of-line libgcc 
> > versions will work in glibc, and is it preferable to use them or inline 
> > asm?  (My guess is that they might work except for cases such as the 
> > 64-bit ARM atomics with references to libc interfaces.)
> 
> I believe they should work from a correctness perspective, but might
> have a slight performance disadvantage.  I'm not sure I understand the
> ARM case you mention.

The 64-bit ARM atomics (linux-atomic-64bit.c) rely on kernel helpers that 
are only available in some kernel versions and involve a function called 
at startup to test for availability of those helpers and call __write / 
abort if not available.

I don't know if that would work in all places where glibc uses atomics, 
but I think we should avoid the issue by making sure not to use 64-bit 
atomics at all on 32-bit platforms (and hopefully ensure that such atomics 
produce a compile / link failure if used).
  
Torvald Riegel Sept. 16, 2014, 4:46 p.m. UTC | #9
On Tue, 2014-09-16 at 16:06 +0000, Joseph S. Myers wrote:
> On Mon, 15 Sep 2014, Torvald Riegel wrote:
> 
> > > * In some cases those macros are defined by GCC even though the operations 
> > > are out of line in libgcc.  On PA the HAVE_sync_compare_and_swap* macros 
> > > are defined in pa-linux.h to cause __GCC_HAVE_SYNC_COMPARE_AND_SWAP_* to 
> > > be predefined; for Tile GCC has special target-specific code to define 
> > > __GCC_HAVE_SYNC_COMPARE_AND_SWAP_*.  Do we believe such out-of-line libgcc 
> > > versions will work in glibc, and is it preferable to use them or inline 
> > > asm?  (My guess is that they might work except for cases such as the 
> > > 64-bit ARM atomics with references to libc interfaces.)
> > 
> > I believe they should work from a correctness perspective, but might
> > have a slight performance disadvantage.  I'm not sure I understand the
> > ARM case you mention.
> 
> The 64-bit ARM atomics (linux-atomic-64bit.c) rely on kernel helpers that 
> are only available in some kernel versions and involve a function called 
> at startup to test for availability of those helpers and call __write / 
> abort if not available.
> 
> I don't know if that would work in all places where glibc uses atomics, 
> but I think we should avoid the issue by making sure not to use 64-bit 
> atomics at all on 32-bit platforms (and hopefully ensure that such atomics 
> produce a compile / link failure if used).

Agreed.  I think we should by default not expect more than pointer-sized
atomic ops.  If a particular concurrent algorithm implementation really
needs more, it's often more efficient to use a solution specific to that
algorithm (e.g., exploiting monotonicity in the values or such) rather
than try to rely on kernel helpers or locks for that.
  
Joseph Myers Sept. 16, 2014, 6:17 p.m. UTC | #10
On Sun, 14 Sep 2014, Torvald Riegel wrote:

> * All accesses to atomic vars need to use atomic_* functions.  IOW, all
> non-atomic accesses are not subject to data races.  The only exceptions
> is initialization (ie, when the variable is not visible to any other
> thread); nonetheless, initialization accesses must not result in data
> races with other accesses.  (This exception isn't allowed by C11, but
> eases the transition to C11 atomics and likely works fine in current
> implementations; as alternative, we could require MO-relaxed stores for
> initialization as well.)

Note <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273> regarding 
MO-relaxed operations not being well-optimized.  It will be important to 
compare the code generated before and after any changes, and may be 
necessary to map the "relaxed" operations to plain loads and stores in 
some cases depending on the compiler version.
  
Torvald Riegel Sept. 16, 2014, 6:33 p.m. UTC | #11
On Tue, 2014-09-16 at 18:17 +0000, Joseph S. Myers wrote:
> On Sun, 14 Sep 2014, Torvald Riegel wrote:
> 
> > * All accesses to atomic vars need to use atomic_* functions.  IOW, all
> > non-atomic accesses are not subject to data races.  The only exceptions
> > is initialization (ie, when the variable is not visible to any other
> > thread); nonetheless, initialization accesses must not result in data
> > races with other accesses.  (This exception isn't allowed by C11, but
> > eases the transition to C11 atomics and likely works fine in current
> > implementations; as alternative, we could require MO-relaxed stores for
> > initialization as well.)
> 
> Note <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273> regarding 
> MO-relaxed operations not being well-optimized.

I mentioned this already in my reply to a question by Carlos (in this
thread) -- but thanks for noting the GCC bug:

* Compilers that don't optimize across memory_order_relaxed atomic ops
and glibc code actually benefits from optimizations by the compiler
across current plain memory accesses.  I doubt that this actually
happens in practice, because it would need a loop or such and other
things in the loop would need to be performance-critical -- which is not
a pattern I think is frequent in concurrent code.
* If we currently have code where the compiler combines several plain
memory accesses to concurrently accessed data into one, then we could
have more accesses if using memory_order_relaxed atomics.  However, such
an optimization can easily be not what the programmer intended to happen
(e.g., if in a busy-waiting loop -- hence atomic_forced_read...).

> It will be important to 
> compare the code generated before and after any changes, and may be 
> necessary to map the "relaxed" operations to plain loads and stores in 
> some cases depending on the compiler version.

I agree that we need to compare, but I wouldn't be happy with mapping
back to plain loads and stores either.  Sure, it seems to work right
now, but we're really keeping our fingers cross, actually.

My *guess* would be that we won't see slow-down for any of the pthreads
synchronization data structures, but might see some in cases where
atomics are uses for things like statistics counters in malloc or such
(ie, where they are surrounded by significant amounts of nonconcurrent
code without additional memory barriers and such).
  
Torvald Riegel Sept. 16, 2014, 9:48 p.m. UTC | #12
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote:
> 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> use GCC's atomic builtins.  Make them fall back to the existing atomics
> otherwise.  Attached is a small patch that illustrates this.

I made a scan through the current atomics implementation on different
archs.  It seems as if there's some disagreement about which operations
have which barrier semantics, and which ones are worthwhile to have
specific implementations of vs. falling back to a CAS loop.

Things that stood out for me (some may be bugs):
* Membars not being defined on i486 and x86_64 (unless I missed
something).  The HW memory model for these two is TSO, which would mean
that algorithms that rely on atomic_full_barrier won't work.  I'm not
sure whether we have these in current code, but atomic_full_barrier is
used in code I'm working on.
* mips use of __sync_lock_test_and_set for both _acq and _rel is
surprising.  Is this __sync builtin different on mips, or is the _rel
variant broken and needs a stronger barrier?
* powerpc having memory_order_relaxed semantics on some operations while
other archs use _acq, yet has acquire semantics for
decrement_if_positive.  Inconsistency?  Just a performance problem or a
correctness problem?
* What's the real semantics of atomic_write_barrier?  The SPARC code
looks as if this wouldn't be equivalent to a memory_order_release fence.
But for Power I'd guess it is?


Archs marked with ??? are those I know very little about.
Read/write/full membar are the current ones (not C11).  _acq and _rel
refer to what the current code understands as that, not C11
memory_order_acquire and memory_order_release


Default:
* If _rel variant not defined by arch, _acq is used.
* All atomic ops without a _acq or _rel suffix (eg,
atomic_exchange_and_add) default to have _acq semantics.
* atomic_decrement_if_positive has _acq semantics on the success path
(ie, if something gets added) but relaxed semantics on the failure path
(ie, if the result of the addition would be negative).
* Full membars are just compiler barriers (ie, __asm ("" ::: "memory")).
* Read/write membars default to full membars.


aarch64:
* Uses the C11 GCC builtins. 8-64b CAS + exchange.
* _acq and _rel implemented as memory_order_acquire /
memory_order_release.
* CAS failure path has memory_order_relaxed.
* Only full membar defined.

alpha ???:
* Custom asm for CAS + exchange. 8-64b.
* Full membar defined and equal to read membar.  Write membar is
different.
* CAS failure path seems to have memory_order_relaxed.
* atomic_exchange_and_add uses full membar.

arm:
* Uses either C11 GCC builtins, the older _sync* builtins, or CAS
provided by the OS (_arm_assisted...).  32b only CAS + exchange.
* CAS failure path has memory_order_relaxed.
* _acq and _rel implemented as mo_acquire / mo_release, except when
using _arm_assisted*, in which case only _acq is used.
* Only full membar defined.

hppa ???:
* Uses kernel for CAS. ??? How is the kernel aware of the size of the
CAS?
* Only _acq CAS defined.
* Membars are not defined.

i486:
* Custom asm for CAS, exchange, exchange_and_add, add, add_negative,
add_zero, increment, increment_and_test, decrement, decrement_and_test,
bit_set, and, or.  All for 8-32b.
* atomic_delay has custom asm.
* atomic_exchange_and_add uses the __sync builtin, so is a full membar.
* Membars are not defined.

ia64 ???:
* CAS based on __sync builtins; exchange uses __sync_lock_test_and_set
with proper release membar added for _rel.  32-64b.
* atomic_exchange_and_add is a full membar.
* Full barrier defined as __sync_synchronize.

m68k coldfire ???:
* Uses kernel for CAS. ??? 32b only it seems?
* Uses kernel for full membar.

m68k 68020 ???:
* Custom asm for CAS, exchange, exchange_and_add, add,
increment_and_test, decrement_and_test, bit_set, bit_test_set.  8-64b.
* Membars are not defined.

microblaze ???:
* Custom asm for CAS, exchange, exchange_and_add, increment, decrement.
32b.
* Membars are not defined.

mips ???:
* Uses C11 GCC builtins, __sync (see Joseph's notes on GCC version
requirements), or custom asm for CAS, exchange.  32b and 64b unless
ABI032.
* If __sync used, then more operations use the __sync builtins.
* If custom asm used, then exchange_and_add is defined too.
* CAS failure path is memory_order_relaxed (in case of C11 builtins and
custom asm).
* atomic_exchange uses __sync_lock_test_and_set for both _acq and _rel,
but has different barriers for _acq and _rel elsewhere.
* Full membar is defined

powerpc32:
* Custom asm for CAS, exchange.  32b.
* Custom asm also exists for exchange_and_add, increment, decrement,
decrement_if_positive. The first three of these don't use any membars,
so are likely memory_order_relaxed; however, decrement_if_positive  uses
an acquire barrier.
* CAS failure path has acquire membar on _acq, memory_order_relaxed
semantics on _rel.
* Read membar defined (lwsync or sync).  Full and write mbars defined.

powerpc64:
* Like powerpc32, but custom asm also provided for 64b (same ops).
* CAS failure path has acquire membar on _acq, memory_order_relaxed
semantics on _rel.
* Read membar defined (lwsync or sync).  Full and write mbars defined.

s390 ???:
* Custom asm for CAS. 32-64b.
* No membars defined.

sh ???:
* Kernel used for CAS, exchange, exchange_and_add, add_negative,
add_zero, increment_and_test, decrement_and_test, bit_set, bit_test_set.
8-32b.
* No membars defined.

sparc32 v9 ???:
* Custom asm for CAS, exchange.  32b.
* Full membar defined.
* Read membar defined.   I guess this would work as a
memory_order_acquire barrier (or C11 fence).
* Write membar defined.  This does not seem to implement a
memory_order_release barrier if I interpret the SPARC asm correctly, but
rather be like a more traditional "don't reorder the preceding write
with anything else" barrier; it issues a StoreLoad and StoreStore
barrier, but if used as memory_order_release, it will be followed by an
atomic store, so I guess it should be LoadStore and StoreStore?  I
haven't checked what GCC generates, so I'm not sure.

sparc32 pre v9 ???:
* Either uses custom asm for a lock-based implementation of atomics or
uses v9 code.

sparc64 ???:
* Custom asm for CAS, exchange.  32-64b.
* Membars are same as sparc32 v9.

x86_64:
* Uses __sync builtins for CAS, custom asm for exchange,
exchange_and_add, add, add_zero, increment, increment_and_test,
decrement, decrement_and_test, bit_set, bit_test_set, and, or. 8-64b.
* Custom asm for atomic ops that emit lock prefix conditionally (e.g.,
catomic_add).
* Membars are not defined.
  
Torvald Riegel Sept. 16, 2014, 9:48 p.m. UTC | #13
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote: 
> 1) Add new C11-like atomics.  If GCC supports them on this architecture,
> use GCC's atomic builtins.  Make them fall back to the existing atomics
> otherwise.  Attached is a small patch that illustrates this.

I made a scan through the current atomics implementation on different
archs.  It seems as if there's some disagreement about which operations
have which barrier semantics, and which ones are worthwhile to have
specific implementations of vs. falling back to a CAS loop.

Things that stood out for me (some may be bugs):
* Membars not being defined on i486 and x86_64 (unless I missed
something).  The HW memory model for these two is TSO, which would mean
that algorithms that rely on atomic_full_barrier won't work.  I'm not
sure whether we have these in current code, but atomic_full_barrier is
used in code I'm working on.
* mips use of __sync_lock_test_and_set for both _acq and _rel is
surprising.  Is this __sync builtin different on mips, or is the _rel
variant broken and needs a stronger barrier?
* powerpc having memory_order_relaxed semantics on some operations while
other archs use _acq, yet has acquire semantics for
decrement_if_positive.  Inconsistency?  Just a performance problem or a
correctness problem?
* What's the real semantics of atomic_write_barrier?  The SPARC code
looks as if this wouldn't be equivalent to a memory_order_release fence.
But for Power I'd guess it is?


Archs marked with ??? are those I know very little about.
Read/write/full membar are the current ones (not C11).  _acq and _rel
refer to what the current code understands as that, not C11
memory_order_acquire and memory_order_release


Default:
* If _rel variant not defined by arch, _acq is used.
* All atomic ops without a _acq or _rel suffix (eg,
atomic_exchange_and_add) default to have _acq semantics.
* atomic_decrement_if_positive has _acq semantics on the success path
(ie, if something gets added) but relaxed semantics on the failure path
(ie, if the result of the addition would be negative).
* Full membars are just compiler barriers (ie, __asm ("" ::: "memory")).
* Read/write membars default to full membars.


aarch64:
* Uses the C11 GCC builtins. 8-64b CAS + exchange.
* _acq and _rel implemented as memory_order_acquire /
memory_order_release.
* CAS failure path has memory_order_relaxed.
* Only full membar defined.

alpha ???:
* Custom asm for CAS + exchange. 8-64b.
* Full membar defined and equal to read membar.  Write membar is
different.
* CAS failure path seems to have memory_order_relaxed.
* atomic_exchange_and_add uses full membar.

arm:
* Uses either C11 GCC builtins, the older _sync* builtins, or CAS
provided by the OS (_arm_assisted...).  32b only CAS + exchange.
* CAS failure path has memory_order_relaxed.
* _acq and _rel implemented as mo_acquire / mo_release, except when
using _arm_assisted*, in which case only _acq is used.
* Only full membar defined.

hppa ???:
* Uses kernel for CAS. ??? How is the kernel aware of the size of the
CAS?
* Only _acq CAS defined.
* Membars are not defined.

i486:
* Custom asm for CAS, exchange, exchange_and_add, add, add_negative,
add_zero, increment, increment_and_test, decrement, decrement_and_test,
bit_set, and, or.  All for 8-32b.
* atomic_delay has custom asm.
* atomic_exchange_and_add uses the __sync builtin, so is a full membar.
* Membars are not defined.

ia64 ???:
* CAS based on __sync builtins; exchange uses __sync_lock_test_and_set
with proper release membar added for _rel.  32-64b.
* atomic_exchange_and_add is a full membar.
* Full barrier defined as __sync_synchronize.

m68k coldfire ???:
* Uses kernel for CAS. ??? 32b only it seems?
* Uses kernel for full membar.

m68k 68020 ???:
* Custom asm for CAS, exchange, exchange_and_add, add,
increment_and_test, decrement_and_test, bit_set, bit_test_set.  8-64b.
* Membars are not defined.

microblaze ???:
* Custom asm for CAS, exchange, exchange_and_add, increment, decrement.
32b.
* Membars are not defined.

mips ???:
* Uses C11 GCC builtins, __sync (see Joseph's notes on GCC version
requirements), or custom asm for CAS, exchange.  32b and 64b unless
ABI032.
* If __sync used, then more operations use the __sync builtins.
* If custom asm used, then exchange_and_add is defined too.
* CAS failure path is memory_order_relaxed (in case of C11 builtins and
custom asm).
* atomic_exchange uses __sync_lock_test_and_set for both _acq and _rel,
but has different barriers for _acq and _rel elsewhere.
* Full membar is defined

powerpc32:
* Custom asm for CAS, exchange.  32b.
* Custom asm also exists for exchange_and_add, increment, decrement,
decrement_if_positive. The first three of these don't use any membars,
so are likely memory_order_relaxed; however, decrement_if_positive  uses
an acquire barrier.
* CAS failure path has acquire membar on _acq, memory_order_relaxed
semantics on _rel.
* Read membar defined (lwsync or sync).  Full and write mbars defined.

powerpc64:
* Like powerpc32, but custom asm also provided for 64b (same ops).
* CAS failure path has acquire membar on _acq, memory_order_relaxed
semantics on _rel.
* Read membar defined (lwsync or sync).  Full and write mbars defined.

s390 ???:
* Custom asm for CAS. 32-64b.
* No membars defined.

sh ???:
* Kernel used for CAS, exchange, exchange_and_add, add_negative,
add_zero, increment_and_test, decrement_and_test, bit_set, bit_test_set.
8-32b.
* No membars defined.

sparc32 v9 ???:
* Custom asm for CAS, exchange.  32b.
* Full membar defined.
* Read membar defined.   I guess this would work as a
memory_order_acquire barrier (or C11 fence).
* Write membar defined.  This does not seem to implement a
memory_order_release barrier if I interpret the SPARC asm correctly, but
rather be like a more traditional "don't reorder the preceding write
with anything else" barrier; it issues a StoreLoad and StoreStore
barrier, but if used as memory_order_release, it will be followed by an
atomic store, so I guess it should be LoadStore and StoreStore?  I
haven't checked what GCC generates, so I'm not sure.

sparc32 pre v9 ???:
* Either uses custom asm for a lock-based implementation of atomics or
uses v9 code.

sparc64 ???:
* Custom asm for CAS, exchange.  32-64b.
* Membars are same as sparc32 v9.

x86_64:
* Uses __sync builtins for CAS, custom asm for exchange,
exchange_and_add, add, add_zero, increment, increment_and_test,
decrement, decrement_and_test, bit_set, bit_test_set, and, or. 8-64b.
* Custom asm for atomic ops that emit lock prefix conditionally (e.g.,
catomic_add).
* Membars are not defined.
  
Joseph Myers Sept. 16, 2014, 10:02 p.m. UTC | #14
On Tue, 16 Sep 2014, Torvald Riegel wrote:

> * mips use of __sync_lock_test_and_set for both _acq and _rel is
> surprising.  Is this __sync builtin different on mips, or is the _rel
> variant broken and needs a stronger barrier?

It sounds like a bug - specific to the case of (MIPS16, GCC before 4.7), 
as that's the only case where __sync_* get used on MIPS, and that case may 
not be widely used.  Please file in Bugzilla.  (__sync_* for MIPS16 then 
in turn ends up using out-of-line libgcc functions that call __sync_* 
built as non-MIPS16 code, but I see nothing special there to give a 
stronger barrier.)
  
Torvald Riegel Sept. 17, 2014, 1:57 p.m. UTC | #15
On Tue, 2014-09-16 at 23:48 +0200, Torvald Riegel wrote:
> * Membars not being defined on i486 and x86_64 (unless I missed
> something).  The HW memory model for these two is TSO, which would mean
> that algorithms that rely on atomic_full_barrier won't work.  I'm not
> sure whether we have these in current code, but atomic_full_barrier is
> used in code I'm working on.

Based on a quick look at uses of atomic_full_barrier, this doesn't seem
to cause incorrect behavior *currently*:
* nptl/pthread_mutex_setprioceiling.c has a full barrier immediately
followed by a futex syscall, which is a full barrier too effectively.
* sysdeps/nptl/fork.c seems to only need an acquire barrier, so on x86
the current compiler-only barrier that the full barrier will result in
is sufficient.
* Other generic code that uses full barriers (eg, sem_post) currently
has custom asm versions on x86/x86_64.  However, once we remove those,
the lack of a correct full barrier will be a bug.

To summarize, I think we need to fix the full barriers as soon as we add
other code that uses atomic_full_barrier and is used by x86/x86_64.

This is #17403.
  
Torvald Riegel Sept. 17, 2014, 2:07 p.m. UTC | #16
On Tue, 2014-09-16 at 22:02 +0000, Joseph S. Myers wrote:
> On Tue, 16 Sep 2014, Torvald Riegel wrote:
> 
> > * mips use of __sync_lock_test_and_set for both _acq and _rel is
> > surprising.  Is this __sync builtin different on mips, or is the _rel
> > variant broken and needs a stronger barrier?
> 
> It sounds like a bug - specific to the case of (MIPS16, GCC before 4.7), 
> as that's the only case where __sync_* get used on MIPS, and that case may 
> not be widely used.  Please file in Bugzilla.  (__sync_* for MIPS16 then 
> in turn ends up using out-of-line libgcc functions that call __sync_* 
> built as non-MIPS16 code, but I see nothing special there to give a 
> stronger barrier.)
> 

https://sourceware.org/bugzilla/show_bug.cgi?id=17404
  
Torvald Riegel Sept. 17, 2014, 2:53 p.m. UTC | #17
On Tue, 2014-09-16 at 23:48 +0200, Torvald Riegel wrote:
> sparc32 v9 ???:
> * Custom asm for CAS, exchange.  32b.
> * Full membar defined.
> * Read membar defined.   I guess this would work as a
> memory_order_acquire barrier (or C11 fence).
> * Write membar defined.  This does not seem to implement a
> memory_order_release barrier if I interpret the SPARC asm correctly, but
> rather be like a more traditional "don't reorder the preceding write
> with anything else" barrier; it issues a StoreLoad and StoreStore
> barrier, but if used as memory_order_release, it will be followed by an
> atomic store, so I guess it should be LoadStore and StoreStore?  I
> haven't checked what GCC generates, so I'm not sure.

Dave,

on SPARC the membars are currently defined as:

#define atomic_full_barrier() \
  __asm __volatile ("membar #LoadLoad | #LoadStore"			      \
			  " | #StoreLoad | #StoreStore" : : : "memory")
#define atomic_read_barrier() \
  __asm __volatile ("membar #LoadLoad | #LoadStore" : : : "memory")
#define atomic_write_barrier() \
  __asm __volatile ("membar #StoreLoad | #StoreStore" : : : "memory")

From what I can see in uses of atomic_write_barrier and in how it's
implemented on other archs, this is intended to be similar to
memory_order_release.  Throughout glibc, it's often used like this:
  x = foo;
  y = bar;
  atomic_write_barrier;
  flag = 1; // or pointer initialization

So we'd want to prevent reordering of stuff prior to the membar with the
store to flag becoming visible.  This would mean -- if I interpret SPARC
asm correctly -- to have a #LoadStore | #StoreStore membar.  This would
align with SPARC being TSO, so not needing an actual HW membar for
release or acquire (same as x86).  However, atomic_write_barrier
currently includes #StoreLoad, which should make it a full barrier.
This is correct, but stronger than necessary.

Am I missing something, is this intended to be a full barrier, or just
an oversight?
  
Torvald Riegel Sept. 17, 2014, 3:09 p.m. UTC | #18
On Sun, 2014-09-14 at 20:34 +0200, Torvald Riegel wrote:
> 2) Refactor one use (ie, all the accesses belonging to one algorithm or
> group of functions that synchronize with each other) at a time.  This
> involves reviewing the code and basically reimplementing the
> synchronization bits in on top of the C11 memory model.  We also should
> take this opportunity to add any documentation of concurrent code that's
> missing (which is often the case).

While trying to determine what the intended semantics of
atomic_write_barrier are -- it seems to be indeed intended to be a
memory_order_release fence -- I looked at several uses of
atomic_write_barrier.

This wasn't a detailed analysis, but it seems that acquire barriers are
missing in several places:
* nis/nss_nisplus/<several files>  (pthread_once)
* stdlib/msort.c
* malloc/arena.c
* nptl/tpp.c (pthread_once)
* nptl/allocatestack.c
* inet/getnetgrent_r.c (pthread_once)
* include/list.h (concurrent list?)

Those marked with (pthread_once) are, at first sight, incorrect
implementations of exactly-once execution, due to a lack of
atomic_read_barrier / atomic_thread_fence(memory_order_acquire).

For archs with weak memory models like ARM or Power, this can result in
HW race conditions and state corruption.  Even for archs with stronger
models like x86 and SPARC, this can result in state corruption due to
code reordering by the compiler (it's all plain memory accesses, so the
compiler is allowed to assume that this is sequential code, and can
potentially load speculatively or similar).

I also don't remember seeing documentation of the synchronization scheme
in most of these files.

I think this further motivates transitioning to the C11 memory model.
  

Patch

commit 7bd3b53f2dc61e0bf2ef018140ef1cc83f0827c5
Author: Torvald Riegel <triegel@redhat.com>
Date:   Sun Sep 14 20:04:54 2014 +0200

    Illustrate new function names for atomic ops.

diff --git a/include/atomic.h b/include/atomic.h
index 3e82b6a..939d1c3 100644
--- a/include/atomic.h
+++ b/include/atomic.h
@@ -543,6 +543,86 @@ 
 #endif
 
 
+/* The following functions are a subset of the atomic operations provided by
+   C11.  Usually, a function named atomic_OP_MO(args) is equivalent to C11's
+   atomic_OP_explicit(args, memory_order_MO); exceptions noted below.  */
+
+#if USE_ATOMIC_COMPILER_BUILTINS
+# define atomic_thread_fence_acquire() \
+  __atomic_thread_fence (__ATOMIC_RELAXED)
+# define atomic_thread_fence_release() \
+  __atomic_thread_fence (__ATOMIC_RELEASE)
+# define atomic_thread_fence_seq_cst() \
+  __atomic_thread_fence (__ATOMIC_SEQ_CST)
+
+# define __atomic_load_mo(mem, mo) \
+  ({ __typeof (*(mem)) __atg100_val;					      \
+  __atomic_load (mem, &__atg100_val, mo);				      \
+  __atg100_val; })
+# define atomic_load_relaxed(mem) __atomic_load_mo ((mem), __ATOMIC_RELAXED)
+# define atomic_load_acquire(mem) __atomic_load_mo ((mem), __ATOMIC_ACQUIRE)
+
+# define __atomic_store_mo(mem, val, mo) \
+  ({ __typeof (*(mem)) __atg101_val = val;				      \
+  __atomic_store (mem, &__atg100_val, mo);				      \
+  __atg100_val; })
+# define atomic_store_relaxed(mem) __atomic_store_mo ((mem), __ATOMIC_RELAXED)
+# define atomic_store_release(mem) __atomic_store_mo ((mem), __ATOMIC_RELEASE)
+
+/* TODO atomic_exchange relaxed acquire release acq_rel?  */
+
+/* TODO atomic_compare_exchange_weak relaxed acquire release acq_rel?  */
+/* TODO atomic_compare_exchange_strong relaxed acquire release acq_rel?  */
+
+/* TODO atomic_fetch_add relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_sub relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_and relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_or relaxed acquire? release? acq_rel? */
+
+#else /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
+/* By default, we assume that read, write, and full barriers are equivalent
+   to acquire, release, and seq_cst barriers.  Archs for which this does not
+   hold have to provide custom definitions of the fences.  */
+# ifndef atomic_thread_fence_acquire
+#  define atomic_thread_fence_acquire() atomic_read_barrier ()
+# endif
+# ifndef atomic_thread_fence_release
+#  define atomic_thread_fence_release() atomic_write_barrier ()
+# endif
+# ifndef atomic_thread_fence_seq_cst
+#  define atomic_thread_fence_seq_cst() atomic_full_barrier ()
+# endif
+
+# ifndef atomic_load_relaxed
+#  define atomic_load_relaxed(mem) \
+   ({ __typeof (*(mem)) __atg100_val;					      \
+   __asm ("" : "=r" (__atg100_val) : "0" (*(mem)));			      \
+   __atg100_val; })
+# endif
+# ifndef atomic_load_acquire
+#  define atomic_load_acquire(mem) \
+   ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem);})	      \
+   atomic_thread_fence_acquire ();					      \
+   __atg101_val; })
+# endif
+# ifndef atomic_store_relaxed
+/* XXX Use inline asm here too?  */
+#  define atomic_store_relaxed(mem, val) do { *(mem) = val; } while (0)
+# endif
+# ifndef atomic_store_release
+#  define atomic_store_release(mem, val) \
+   do {									      \
+     atomic_thread_fence_release ();					      \
+     atomic_store_relaxed ((mem), (val));				      \
+   } while (0)
+# endif
+
+/* TODO same as above  */
+
+#endif /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
+
 #ifndef atomic_delay
 # define atomic_delay() do { /* nothing */ } while (0)
 #endif