[2/4] Add atomic operations similar to those provided by C11.

Message ID 1414619416.10085.46.camel@triegel.csb
State Committed
Headers

Commit Message

Torvald Riegel Oct. 29, 2014, 9:50 p.m. UTC
  This patch adds atomic operations similar to C11.

The function naming is essentially the C11 names, but with the memory
order argument removed and added as a suffix.  For example, C11's
  atomic_store_explicit(&foo, 23, memory_order_release)
becomes
  atomic_store_relaxed (&foo, 23);

This avoids naming collisions, and it's not much more verbose because we
would make the memory orders explicit anyway.

The provided functions are a subset of what C11 provides.  I think it's
better to start minimal, as we can easily add more operations if needed.
I've chosen this subset based on the operations that would be necessary
for generic implementations of mutexes, pthread_once, rwlock,
semaphores, and condvars.

Likewise, I've chosen to only provide 32b atomics (which we need for
futexes, for example) and additionally 64b atomics if provided by the
arch (that's what __HAVE_64B_ATOMICS is for, see Patch 1/4).

Per Josephs request, attempts to apply atomic operations when using the
__atomic builtins to variables of other sizes results in an early error.
If not using the builtins, I rely on the existing atomics to Do The
Right Thing.

	* include/atomic.h (__atomic_link_error, __atomic_check_size,
	atomic_thread_fence_acquire, atomic_thread_fence_release,
	atomic_thread_fence_seq_cst, atomic_load_relaxed,
	atomic_load_acquire, atomic_store_relaxed, atomic_store_release,
	atomic_compare_exchange_weak_relaxed,
	atomic_compare_exchange_weak_acquire,
	atomic_compare_exchange_weak_release,
	atomic_exchange_acquire, atomic_exchange_release,
	atomic_fetch_add_relaxed, atomic_fetch_add_acquire,
	atomic_fetch_add_release, atomic_fetch_add_acq_rel,
	atomic_fetch_and_acquire,
	atomic_fetch_or_relaxed, atomic_fetch_or_acquire): New.
  

Comments

Joseph Myers Oct. 29, 2014, 10 p.m. UTC | #1
On Wed, 29 Oct 2014, Torvald Riegel wrote:

> This patch adds atomic operations similar to C11.
> 
> The function naming is essentially the C11 names, but with the memory
> order argument removed and added as a suffix.  For example, C11's
>   atomic_store_explicit(&foo, 23, memory_order_release)
> becomes
>   atomic_store_relaxed (&foo, 23);

As previously discussed, I'm concerned about the explicit relaxed loads 
and stores being defined in terms of __atomic_* (see 
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>).  Unless and until 
__atomic_* implements relaxed atomics as plain loads/stores not inhibiting 
optimization (as far as compatible with standard semantics), as evidenced 
by the change not significantly changing code generated for uses in glibc, 
I think the glibc implementation should be using plain loads and stores 
rather than __atomic_*.
  
Torvald Riegel Oct. 29, 2014, 10:13 p.m. UTC | #2
On Wed, 2014-10-29 at 22:00 +0000, Joseph S. Myers wrote:
> On Wed, 29 Oct 2014, Torvald Riegel wrote:
> 
> > This patch adds atomic operations similar to C11.
> > 
> > The function naming is essentially the C11 names, but with the memory
> > order argument removed and added as a suffix.  For example, C11's
> >   atomic_store_explicit(&foo, 23, memory_order_release)
> > becomes
> >   atomic_store_relaxed (&foo, 23);
> 
> As previously discussed, I'm concerned about the explicit relaxed loads 
> and stores being defined in terms of __atomic_* (see 
> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>).  Unless and until 
> __atomic_* implements relaxed atomics as plain loads/stores not inhibiting 
> optimization (as far as compatible with standard semantics), as evidenced 
> by the change not significantly changing code generated for uses in glibc, 
> I think the glibc implementation should be using plain loads and stores 
> rather than __atomic_*.

Yes, see Patch 4/4 for how this looks with pthread_once.  (You review
faster than I can write emails -- nice! :)
I've chosen to keep the builtins for relaxed loads/stores for now, but
we can revisit later when we have transformed more code.
  
Torvald Riegel Oct. 29, 2014, 10:15 p.m. UTC | #3
On Wed, 2014-10-29 at 22:50 +0100, Torvald Riegel wrote:
> This patch adds atomic operations similar to C11.
> 
> The function naming is essentially the C11 names, but with the memory
> order argument removed and added as a suffix.  For example, C11's
>   atomic_store_explicit(&foo, 23, memory_order_release)
> becomes
>   atomic_store_relaxed (&foo, 23);

Typo, sorry.  The latter should in fact be
  atomic_store_release (&foo, 23);
  
Torvald Riegel Oct. 29, 2014, 10:45 p.m. UTC | #4
On Wed, 2014-10-29 at 22:00 +0000, Joseph S. Myers wrote:
> On Wed, 29 Oct 2014, Torvald Riegel wrote:
> 
> > This patch adds atomic operations similar to C11.
> > 
> > The function naming is essentially the C11 names, but with the memory
> > order argument removed and added as a suffix.  For example, C11's
> >   atomic_store_explicit(&foo, 23, memory_order_release)
> > becomes
> >   atomic_store_relaxed (&foo, 23);
> 
> As previously discussed, I'm concerned about the explicit relaxed loads 
> and stores being defined in terms of __atomic_* (see 
> <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273>).  Unless and until 
> __atomic_* implements relaxed atomics as plain loads/stores not inhibiting 
> optimization (as far as compatible with standard semantics), as evidenced 
> by the change not significantly changing code generated for uses in glibc, 
> I think the glibc implementation should be using plain loads and stores 
> rather than __atomic_*.

Let me reply in some more detail.

First, do you agree that we need to make the compiler aware of
concurrency?  For example, it would be bad if the compiler assumes that
it can safely reload from an atomic variable just because it was able to
prove that the loading thread didn't change it in the meantime.

If we assume that, we can either (1) use __atomic* and check all the
generated code, or (2) use inline asm, or (3) use volatile inline asm.
Any other options?  Plain loads will not reliably make the compiler
aware that it has to take concurrent accesses into account.

That might also mean that atomic_store_relaxed should actually use
inline asm (see the comment in the patch).  Thoughts?


However, I would guess that we won't be really affected by 63273 anyway.
The triggering usage there was very special in that the sanitizer
generates loads of relaxed atomic accesses, and just that.  That's not
what we have in typical glibc code.  If we use a relaxed access, it's
either (1) in front of a CAS, so we'll have an optimization-constraining
operation close-by anyway, or (2) it's in combination with an explicit
fence next to it (Dekker sync, relaxed load + acquire fence, etc.), so
it's likely that it can't optimized as freely anyway.

Are there any other examples where the lack of optimizations of relaxed
accesses in typical concurrent code was really decreasing performance
(ie, ignoring the sanitizer thing and non-optimized code such as maybe
comes out of templates that are *expected* to be optimized)?
  
Joseph Myers Oct. 29, 2014, 11:06 p.m. UTC | #5
On Wed, 29 Oct 2014, Torvald Riegel wrote:

> First, do you agree that we need to make the compiler aware of
> concurrency?  For example, it would be bad if the compiler assumes that
> it can safely reload from an atomic variable just because it was able to
> prove that the loading thread didn't change it in the meantime.

I believe that the code already makes it explicit where such reloading 
would be problematic, either (a) using asm with appropriate clobbers so 
the compiler doesn't know the value is unchanged or (b) using volatile.  
In principle atomics could avoid optimization limitations from use of 
volatile (and allow more code motion than might be allowed by the asms), 
but it's not clear if they do improve on volatile at present.

> If we assume that, we can either (1) use __atomic* and check all the
> generated code, or (2) use inline asm, or (3) use volatile inline asm.
> Any other options?  Plain loads will not reliably make the compiler
> aware that it has to take concurrent accesses into account.

As noted, I think appropriate asms or volatile are already in place where 
this is an issue for a plain load.

> That might also mean that atomic_store_relaxed should actually use
> inline asm (see the comment in the patch).  Thoughts?

Well, I don't see a need to supplement __atomic_* with inline asm.  In the 
case of not using __atomic_*, if you find you are replacing places with 
such asms (and there aren't that many) then the replacement may need such 
an asm, but if replacing other places it may be less obvious if an asm is 
needed (or if it will make any code generation difference; if not, it 
should be safe).
  
Torvald Riegel Oct. 30, 2014, 9:27 a.m. UTC | #6
On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote:
> On Wed, 29 Oct 2014, Torvald Riegel wrote:
> 
> > First, do you agree that we need to make the compiler aware of
> > concurrency?  For example, it would be bad if the compiler assumes that
> > it can safely reload from an atomic variable just because it was able to
> > prove that the loading thread didn't change it in the meantime.
> 
> I believe that the code already makes it explicit where such reloading 
> would be problematic,

Why do you think this is the case?  There is an atomic_forced_read, for
example, but no atomic_read_exactly_once.

> either (a) using asm with appropriate clobbers so 
> the compiler doesn't know the value is unchanged

Do you have any examples for this where the clobber is not due to a
compiler barrier used for acquire or release fences?

> or (b) using volatile.

But volatile does not guarantee atomicity.  I also think that current
compilers are probably unlikely to split a load or store into different
parts, but can we be sure about this?  What if, for example, we have
adjacent atomic variables that are loaded from, and they happen to be
not perfectly aligned (e.g., that's fine for x86 atomic accesses IIRC),
and the compiler merges some of the accesses and thus splits the last
one?  Can't we just do the right thing (and safe thing!) and tell the
compiler that this is an atomic access?

> In principle atomics could avoid optimization limitations from use of 
> volatile (and allow more code motion than might be allowed by the asms), 
> but it's not clear if they do improve on volatile at present.

(I suppose you mean relaxed MO atomic loads/stores.  For everything else
I hope it's clear that atomics are an improvement...)

But I also see no real evidence that using atomics for relaxed
loads/stores would hurt at all.  The sanitizer use case in the BZ you
cited is arguably special.  The pthread_once case didn't show any
difference on the fast path.  I can test with a __atomic*-providing
compiler with and without using __atomic to look for differences, if
that would help you.

> > If we assume that, we can either (1) use __atomic* and check all the
> > generated code, or (2) use inline asm, or (3) use volatile inline asm.
> > Any other options?  Plain loads will not reliably make the compiler
> > aware that it has to take concurrent accesses into account.
> 
> As noted, I think appropriate asms or volatile are already in place where 
> this is an issue for a plain load.

We don't even have correct barriers everywhere, so I don't really have a
reason to agree to that.

Can we at least agree on having all glibc code use our own
atomic_store_relaxed / atomic_load_relaxed (see the patch)?  Then we can
still argue whether to use __atomic* to implement these.  But at least
we can easily switch to __atomic in the future.  Any objections?
  
Joseph Myers Oct. 30, 2014, 5:17 p.m. UTC | #7
On Thu, 30 Oct 2014, Torvald Riegel wrote:

> On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote:
> > On Wed, 29 Oct 2014, Torvald Riegel wrote:
> > 
> > > First, do you agree that we need to make the compiler aware of
> > > concurrency?  For example, it would be bad if the compiler assumes that
> > > it can safely reload from an atomic variable just because it was able to
> > > prove that the loading thread didn't change it in the meantime.
> > 
> > I believe that the code already makes it explicit where such reloading 
> > would be problematic,
> 
> Why do you think this is the case?  There is an atomic_forced_read, for
> example, but no atomic_read_exactly_once.

See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads 
happen in a particular order:

  size_t n = scope->r_nlist;
  /* Make sure we read the value before proceeding.  Otherwise we
     might use r_list pointing to the initial scope and r_nlist being
     the value after a resize.  That is the only path in dl-open.c not
     protected by GSCOPE.  A read barrier here might be to expensive.  */
  __asm volatile ("" : "+r" (n), "+m" (scope->r_list));
  struct link_map **list = scope->r_list;

(There are lots of cases of asms being used - via macros - to control 
exactly when floating-point evaluations take place and associated 
exceptions are raised, but that's not relevant to atomics.  In general, 
any asm of an empty string is probably doing something like this.)

> > either (a) using asm with appropriate clobbers so 
> > the compiler doesn't know the value is unchanged
> 
> Do you have any examples for this where the clobber is not due to a
> compiler barrier used for acquire or release fences?

I don't understand the question.

> > or (b) using volatile.
> 
> But volatile does not guarantee atomicity.  I also think that current

The point is to guarantee ordering (with the atomicity of aligned word 
loads / stores being part of the GNU C language semantics required for 
building glibc).

> But I also see no real evidence that using atomics for relaxed
> loads/stores would hurt at all.  The sanitizer use case in the BZ you
> cited is arguably special.  The pthread_once case didn't show any
> difference on the fast path.  I can test with a __atomic*-providing
> compiler with and without using __atomic to look for differences, if
> that would help you.

Yes.  We may find in particular cases (possibly all of them) that the 
change does not have significant code generation consequences, but given 
the known bug we should actually check that in each case rather than 
presuming it.

> Can we at least agree on having all glibc code use our own
> atomic_store_relaxed / atomic_load_relaxed (see the patch)?  Then we can
> still argue whether to use __atomic* to implement these.  But at least
> we can easily switch to __atomic in the future.  Any objections?

I think explicit atomic loads / stores are fine, subject to checking code 
generation before making them use __atomic_*.
  
Torvald Riegel Oct. 30, 2014, 6:20 p.m. UTC | #8
On Thu, 2014-10-30 at 17:17 +0000, Joseph S. Myers wrote:
> On Thu, 30 Oct 2014, Torvald Riegel wrote:
> 
> > On Wed, 2014-10-29 at 23:06 +0000, Joseph S. Myers wrote:
> > > On Wed, 29 Oct 2014, Torvald Riegel wrote:
> > > 
> > > > First, do you agree that we need to make the compiler aware of
> > > > concurrency?  For example, it would be bad if the compiler assumes that
> > > > it can safely reload from an atomic variable just because it was able to
> > > > prove that the loading thread didn't change it in the meantime.
> > > 
> > > I believe that the code already makes it explicit where such reloading 
> > > would be problematic,
> > 
> > Why do you think this is the case?  There is an atomic_forced_read, for
> > example, but no atomic_read_exactly_once.
> 
> See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads 
> happen in a particular order:
> 
>   size_t n = scope->r_nlist;
>   /* Make sure we read the value before proceeding.  Otherwise we
>      might use r_list pointing to the initial scope and r_nlist being
>      the value after a resize.  That is the only path in dl-open.c not
>      protected by GSCOPE.  A read barrier here might be to expensive.  */
>   __asm volatile ("" : "+r" (n), "+m" (scope->r_list));
>   struct link_map **list = scope->r_list;

That ensures that the compiler won't reorder the reads -- but it does
*not* ensure that reads happen in a particular order when executed on an
arch with a weak memory model.  If we don't care about HW reordering, we
won't need the asm (unless perhaps we communicate with external stuff
through volatile, or do something behind the compiler's back), because
the language semantics for sequential code are taking care of that.

Thus, my guess is that this is a bug.

> (There are lots of cases of asms being used - via macros - to control 
> exactly when floating-point evaluations take place and associated 
> exceptions are raised, but that's not relevant to atomics.  In general, 
> any asm of an empty string is probably doing something like this.)
> 
> > > either (a) using asm with appropriate clobbers so 
> > > the compiler doesn't know the value is unchanged
> > 
> > Do you have any examples for this where the clobber is not due to a
> > compiler barrier used for acquire or release fences?
> 
> I don't understand the question.

The fences consist of a compiler barrier and, depending on the arch,
potentially a real HW fence instruction.  The example you gave above is
a compiler barrier, but it looks like it should actually be an acquire
fence.

> > > or (b) using volatile.
> > 
> > But volatile does not guarantee atomicity.  I also think that current
> 
> The point is to guarantee ordering

volatile does not guarantee ordering in a concurrent setting.  The
guarantee is that the loads/stores will be issued in exactly the same
order as if the abstract machine would execute the program -- but this
doesn't constrain visibility to other threads, or what other threads
observe.

> (with the atomicity of aligned word 
> loads / stores being part of the GNU C language semantics required for 
> building glibc).

Ok.  We can continue to rely on such things, but we can also just
embrace that C11 now has an actual memory model for concurrent code.

> > But I also see no real evidence that using atomics for relaxed
> > loads/stores would hurt at all.  The sanitizer use case in the BZ you
> > cited is arguably special.  The pthread_once case didn't show any
> > difference on the fast path.  I can test with a __atomic*-providing
> > compiler with and without using __atomic to look for differences, if
> > that would help you.
> 
> Yes.  We may find in particular cases (possibly all of them) that the 
> change does not have significant code generation consequences, but given 
> the known bug we should actually check that in each case rather than 
> presuming it.

I can't argue against being conservative by default, but I also can't
imagine this being an actual problem for typical synchronizing code.

> > Can we at least agree on having all glibc code use our own
> > atomic_store_relaxed / atomic_load_relaxed (see the patch)?  Then we can
> > still argue whether to use __atomic* to implement these.  But at least
> > we can easily switch to __atomic in the future.  Any objections?
> 
> I think explicit atomic loads / stores are fine, subject to checking code 
> generation before making them use __atomic_*.

How do you envision to do this?

Compare generated code on all archs before and after a change, for each
algorithm on its own?
Do the C11 transition, and then compare all code?
Look at a few algorithms to see whether it makes a difference, and then
see whether we have anecdotal evidence that it might affect performance?
  
Joseph Myers Oct. 30, 2014, 8:12 p.m. UTC | #9
On Thu, 30 Oct 2014, Torvald Riegel wrote:

> > See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads 
> > happen in a particular order:
> > 
> >   size_t n = scope->r_nlist;
> >   /* Make sure we read the value before proceeding.  Otherwise we
> >      might use r_list pointing to the initial scope and r_nlist being
> >      the value after a resize.  That is the only path in dl-open.c not
> >      protected by GSCOPE.  A read barrier here might be to expensive.  */
> >   __asm volatile ("" : "+r" (n), "+m" (scope->r_list));
> >   struct link_map **list = scope->r_list;
> 
> That ensures that the compiler won't reorder the reads -- but it does
> *not* ensure that reads happen in a particular order when executed on an
> arch with a weak memory model.  If we don't care about HW reordering, we
> won't need the asm (unless perhaps we communicate with external stuff
> through volatile, or do something behind the compiler's back), because
> the language semantics for sequential code are taking care of that.
> 
> Thus, my guess is that this is a bug.

My guess is that on the system where a problem reordering was observed, it 
was sufficient to stop the compiler from reordering the loads (but there 
might be issues for some other system).  And this sort of code, where 
there aren't other atomic operations nearby, is a case where we'd need to 
be particularly careful about checking the performance impact of inserting 
barriers.

> > > Can we at least agree on having all glibc code use our own
> > > atomic_store_relaxed / atomic_load_relaxed (see the patch)?  Then we can
> > > still argue whether to use __atomic* to implement these.  But at least
> > > we can easily switch to __atomic in the future.  Any objections?
> > 
> > I think explicit atomic loads / stores are fine, subject to checking code 
> > generation before making them use __atomic_*.
> 
> How do you envision to do this?
> 
> Compare generated code on all archs before and after a change, for each
> algorithm on its own?

Comparing code for one architecture, at the time of each conversion, seems 
appropriate (with additional benchmarking for cases that don't otherwise 
involve atomics).
  
Torvald Riegel Oct. 30, 2014, 8:38 p.m. UTC | #10
On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote:
> On Thu, 30 Oct 2014, Torvald Riegel wrote:
> 
> > > See for example elf/dl-lookup.c:do_lookup_x, using an asm to ensure reads 
> > > happen in a particular order:
> > > 
> > >   size_t n = scope->r_nlist;
> > >   /* Make sure we read the value before proceeding.  Otherwise we
> > >      might use r_list pointing to the initial scope and r_nlist being
> > >      the value after a resize.  That is the only path in dl-open.c not
> > >      protected by GSCOPE.  A read barrier here might be to expensive.  */
> > >   __asm volatile ("" : "+r" (n), "+m" (scope->r_list));
> > >   struct link_map **list = scope->r_list;
> > 
> > That ensures that the compiler won't reorder the reads -- but it does
> > *not* ensure that reads happen in a particular order when executed on an
> > arch with a weak memory model.  If we don't care about HW reordering, we
> > won't need the asm (unless perhaps we communicate with external stuff
> > through volatile, or do something behind the compiler's back), because
> > the language semantics for sequential code are taking care of that.
> > 
> > Thus, my guess is that this is a bug.
> 
> My guess is that on the system where a problem reordering was observed, it 
> was sufficient to stop the compiler from reordering the loads (but there 
> might be issues for some other system).  And this sort of code, where 
> there aren't other atomic operations nearby, is a case where we'd need to 
> be particularly careful about checking the performance impact of inserting 
> barriers.

But if there are issues for other archs, we need to insert a barrier
anyway, agreed?  If we want to write generic, non-arch-specific code,
this would be a generic barrier too -- which, in turn, would boil down
to a normal compiler barrier (ie, "memory").

We could think about adding specific versions of acquire barriers that
do something like the asm above in that the compiler barrier part just
applies to certain memory locations -- but I doubt that's really worth
it.  If we want to avoid the barrier, we can try other ways to do that.

This piece of code is also similar to the pthread_once fastpath, in that
it's an acquire barrier surrounded by loads.

> > > > Can we at least agree on having all glibc code use our own
> > > > atomic_store_relaxed / atomic_load_relaxed (see the patch)?  Then we can
> > > > still argue whether to use __atomic* to implement these.  But at least
> > > > we can easily switch to __atomic in the future.  Any objections?
> > > 
> > > I think explicit atomic loads / stores are fine, subject to checking code 
> > > generation before making them use __atomic_*.
> > 
> > How do you envision to do this?
> > 
> > Compare generated code on all archs before and after a change, for each
> > algorithm on its own?
> 
> Comparing code for one architecture, at the time of each conversion, seems 
> appropriate (with additional benchmarking for cases that don't otherwise 
> involve atomics).

How do we handle bug fixes?  When fixing a bug, do you want performance
parity too?

By "involve atomics", do you mean specifically atomic read-modify-write
ops, so atomic ops other than just loads and stores?
  
Joseph Myers Oct. 30, 2014, 8:55 p.m. UTC | #11
On Thu, 30 Oct 2014, Torvald Riegel wrote:

> > My guess is that on the system where a problem reordering was observed, it 
> > was sufficient to stop the compiler from reordering the loads (but there 
> > might be issues for some other system).  And this sort of code, where 
> > there aren't other atomic operations nearby, is a case where we'd need to 
> > be particularly careful about checking the performance impact of inserting 
> > barriers.
> 
> But if there are issues for other archs, we need to insert a barrier
> anyway, agreed?  If we want to write generic, non-arch-specific code,

Yes, subject to considering whether there's any more efficient approach 
available.

> > Comparing code for one architecture, at the time of each conversion, seems 
> > appropriate (with additional benchmarking for cases that don't otherwise 
> > involve atomics).
> 
> How do we handle bug fixes?  When fixing a bug, do you want performance
> parity too?

No, but if the fix is in a critical path we should look more carefully at 
what the most efficient way to acheive the fix is.

> By "involve atomics", do you mean specifically atomic read-modify-write
> ops, so atomic ops other than just loads and stores?

Yes.
  
Torvald Riegel Oct. 30, 2014, 9:11 p.m. UTC | #12
On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote:
> Comparing code for one architecture, at the time of each conversion, seems 
> appropriate (with additional benchmarking for cases that don't otherwise 
> involve atomics).

So, if you want this code comparison as default, we can as well start
with using relaxed __atomic* loads and stores instead of first using
plain loads and stores, agreed?

Do you have any other feedback on the patch set?
  
Joseph Myers Oct. 30, 2014, 11:31 p.m. UTC | #13
On Thu, 30 Oct 2014, Torvald Riegel wrote:

> On Thu, 2014-10-30 at 20:12 +0000, Joseph S. Myers wrote:
> > Comparing code for one architecture, at the time of each conversion, seems 
> > appropriate (with additional benchmarking for cases that don't otherwise 
> > involve atomics).
> 
> So, if you want this code comparison as default, we can as well start
> with using relaxed __atomic* loads and stores instead of first using
> plain loads and stores, agreed?

Yes.

> Do you have any other feedback on the patch set?

No.
  

Patch

commit 5921ff5b989972b744f5cb7d8f4f0b16c04ab19e
Author: Torvald Riegel <triegel@redhat.com>
Date:   Sun Sep 14 20:04:54 2014 +0200

    Add atomic operations similar to those provided by C11.

diff --git a/include/atomic.h b/include/atomic.h
index 3e82b6a..c4df8f1 100644
--- a/include/atomic.h
+++ b/include/atomic.h
@@ -542,6 +542,218 @@ 
   ({ __typeof (x) __x; __asm ("" : "=r" (__x) : "0" (x)); __x; })
 #endif
 
+/* This is equal to 1 iff the architecture supports 64b atomic operations.  */
+#ifndef __HAVE_64B_ATOMICS
+#error Unable to determine if 64-bit atomics are present.
+#endif
+
+/* The following functions are a subset of the atomic operations provided by
+   C11.  Usually, a function named atomic_OP_MO(args) is equivalent to C11's
+   atomic_OP_explicit(args, memory_order_MO); exceptions noted below.  */
+
+/* Each arch can request to use compiler built-ins for C11 atomics.  If it
+   does, all atomics will be based on these.  */
+#if USE_ATOMIC_COMPILER_BUILTINS
+
+/* We require 32b atomic operations; some archs also support 64b atomic
+   operations.  */
+void __atomic_link_error (void);
+# if __HAVE_64B_ATOMICS == 1
+#  define __atomic_check_size(mem) \
+   if ((sizeof (*mem) != 4) && (sizeof (*mem) != 8))			      \
+     __atomic_link_error ();
+# else
+#  define __atomic_check_size(mem) \
+   if (sizeof (*mem) != 4)						      \
+     __atomic_link_error ();
+# endif
+
+# define atomic_thread_fence_acquire() \
+  __atomic_thread_fence (__ATOMIC_ACQUIRE)
+# define atomic_thread_fence_release() \
+  __atomic_thread_fence (__ATOMIC_RELEASE)
+# define atomic_thread_fence_seq_cst() \
+  __atomic_thread_fence (__ATOMIC_SEQ_CST)
+
+# define atomic_load_relaxed(mem) \
+  ({ __atomic_check_size((mem)); __atomic_load_n ((mem), __ATOMIC_RELAXED); })
+# define atomic_load_acquire(mem) \
+  ({ __atomic_check_size((mem)); __atomic_load_n ((mem), __ATOMIC_ACQUIRE); })
+
+# define atomic_store_relaxed(mem, val) \
+  do {									      \
+    __atomic_check_size((mem));						      \
+    __atomic_store_n ((mem), (val), __ATOMIC_RELAXED);			      \
+  } while (0)
+# define atomic_store_release(mem, val) \
+  do {									      \
+    __atomic_check_size((mem));						      \
+    __atomic_store_n ((mem), (val), __ATOMIC_RELEASE);			      \
+  } while (0)
+
+/* On failure, this CAS has memory_order_relaxed semantics.  */
+# define atomic_compare_exchange_weak_relaxed(mem, expected, desired) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_compare_exchange_n ((mem), (expected), (desired), 1,		      \
+    __ATOMIC_RELAXED, __ATOMIC_RELAXED); })
+# define atomic_compare_exchange_weak_acquire(mem, expected, desired) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_compare_exchange_n ((mem), (expected), (desired), 1,		      \
+    __ATOMIC_ACQUIRE, __ATOMIC_RELAXED); })
+# define atomic_compare_exchange_weak_release(mem, expected, desired) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_compare_exchange_n ((mem), (expected), (desired), 1,		      \
+    __ATOMIC_RELEASE, __ATOMIC_RELAXED); })
+
+# define atomic_exchange_acquire(mem, desired) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_exchange_n ((mem), (desired), __ATOMIC_ACQUIRE); })
+# define atomic_exchange_release(mem, desired) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_exchange_n ((mem), (desired), __ATOMIC_RELEASE); })
+
+# define atomic_fetch_add_relaxed(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_RELAXED); })
+# define atomic_fetch_add_acquire(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_ACQUIRE); })
+# define atomic_fetch_add_release(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_RELEASE); })
+# define atomic_fetch_add_acq_rel(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_ACQ_REL); })
+
+# define atomic_fetch_and_acquire(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_and ((mem), (operand), __ATOMIC_ACQUIRE); })
+
+# define atomic_fetch_or_relaxed(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_or ((mem), (operand), __ATOMIC_RELAXED); })
+# define atomic_fetch_or_acquire(mem, operand) \
+  ({ __atomic_check_size((mem));					      \
+  __atomic_fetch_or ((mem), (operand), __ATOMIC_ACQUIRE); })
+
+#else /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
+/* By default, we assume that read, write, and full barriers are equivalent
+   to acquire, release, and seq_cst barriers.  Archs for which this does not
+   hold have to provide custom definitions of the fences.  */
+# ifndef atomic_thread_fence_acquire
+#  define atomic_thread_fence_acquire() atomic_read_barrier ()
+# endif
+# ifndef atomic_thread_fence_release
+#  define atomic_thread_fence_release() atomic_write_barrier ()
+# endif
+# ifndef atomic_thread_fence_seq_cst
+#  define atomic_thread_fence_seq_cst() atomic_full_barrier ()
+# endif
+
+# ifndef atomic_load_relaxed
+#  define atomic_load_relaxed(mem) \
+   ({ __typeof (*(mem)) __atg100_val;					      \
+   __asm ("" : "=r" (__atg100_val) : "0" (*(mem)));			      \
+   __atg100_val; })
+# endif
+# ifndef atomic_load_acquire
+#  define atomic_load_acquire(mem) \
+   ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem);	      \
+   atomic_thread_fence_acquire ();					      \
+   __atg101_val; })
+# endif
+
+# ifndef atomic_store_relaxed
+/* XXX Use inline asm here?  */
+#  define atomic_store_relaxed(mem, val) do { *(mem) = (val); } while (0)
+# endif
+# ifndef atomic_store_release
+#  define atomic_store_release(mem, val) \
+   do {									      \
+     atomic_thread_fence_release ();					      \
+     atomic_store_relaxed ((mem), (val));				      \
+   } while (0)
+# endif
+
+/* On failure, this CAS has memory_order_relaxed semantics.  */
+/* XXX This potentially has one branch more than necessary, but archs
+   currently do not define a CAS that returns both the previous value and
+   the success flag.  */
+# ifndef atomic_compare_exchange_weak_acquire
+#  define atomic_compare_exchange_weak_acquire(mem, expected, desired) \
+   ({ typeof (*(expected)) __atg102_expected = *(expected);		      \
+   *(expected) =							      \
+     atomic_compare_and_exchange_val_acq ((mem), (desired), *(expected));     \
+   *(expected) == __atg102_expected; })
+# endif
+# ifndef atomic_compare_exchange_weak_relaxed
+/* XXX Fall back to CAS with acquire MO because archs do not define a weaker
+   CAS.  */
+#  define atomic_compare_exchange_weak_relaxed(mem, expected, desired) \
+   atomic_compare_exchange_weak_acquire ((mem), (expected), (desired))
+# endif
+# ifndef atomic_compare_exchange_weak_release
+#  define atomic_compare_exchange_weak_release(mem, expected, desired) \
+   ({ typeof (*(expected)) __atg103_expected = *(expected);		      \
+   *(expected) =							      \
+     atomic_compare_and_exchange_val_rel ((mem), (desired), *(expected));     \
+   *(expected) == __atg103_expected; })
+# endif
+
+# ifndef atomic_exchange_acquire
+#  define atomic_exchange_acquire(mem, val) \
+   atomic_exchange_acq ((mem), (val))
+# endif
+# ifndef atomic_exchange_release
+#  define atomic_exchange_release(mem, val) \
+   atomic_exchange_rel ((mem), (val))
+# endif
+
+# ifndef atomic_fetch_add_acquire
+#  define atomic_fetch_add_acquire(mem, operand) \
+   atomic_exchange_and_add_acq ((mem), (operand))
+# endif
+# ifndef atomic_fetch_add_relaxed
+/* XXX Fall back to acquire MO because the MO semantics of
+   atomic_exchange_and_add are not documented; the generic version falls back
+   to atomic_exchange_and_add_acq if atomic_exchange_and_add is not defined,
+   and vice versa.  */
+#  define atomic_fetch_add_relaxed(mem, operand) \
+   atomic_fetch_add_acquire ((mem), (operand))
+# endif
+# ifndef atomic_fetch_add_release
+#  define atomic_fetch_add_release(mem, operand) \
+   atomic_exchange_and_add_rel ((mem), (operand))
+# endif
+# ifndef atomic_fetch_add_acq_rel
+#  define atomic_fetch_add_acq_rel(mem, operand) \
+   ({ atomic_thread_fence_release ();					      \
+   atomic_exchange_and_add_acq ((mem), (operand)); })
+# endif
+
+/* XXX The default for atomic_and_val has acquire semantics, but this is not
+   documented.  */
+# ifndef atomic_fetch_and_acquire
+#  define atomic_fetch_and_acquire(mem, operand) \
+   atomic_and_val ((mem), (operand))
+# endif
+
+/* XXX The default for atomic_or_val has acquire semantics, but this is not
+   documented.  */
+# ifndef atomic_fetch_or_acquire
+#  define atomic_fetch_or_acquire(mem, operand) \
+   atomic_or_val ((mem), (operand))
+# endif
+/* XXX Fall back to acquire MO because archs do not define a weaker
+   atomic_or_val.  */
+# ifndef atomic_fetch_or_relaxed
+#  define atomic_fetch_or_relaxed(mem, operand) \
+   atomic_fetch_or_acquire ((mem), (operand))
+# endif
+
+#endif /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
 
 #ifndef atomic_delay
 # define atomic_delay() do { /* nothing */ } while (0)