From patchwork Sun Sep 14 18:34:29 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Torvald Riegel <triegel@redhat.com>
X-Patchwork-Id: 2838
Received: (qmail 12060 invoked by alias); 14 Sep 2014 18:34:35 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 12048 invoked by uid 89); 14 Sep 2014 18:34:35 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL, BAYES_05,
	RP_MATCHES_RCVD, SPF_HELO_PASS,
	SPF_PASS autolearn=no version=3.3.2
X-HELO: mx1.redhat.com
Subject: Transition to C11 atomics and memory model
From: Torvald Riegel <triegel@redhat.com>
To: GLIBC Devel <libc-alpha@sourceware.org>
Date: Sun, 14 Sep 2014 20:34:29 +0200
Message-ID: <1410719669.4967.160.camel@triegel.csb>
Mime-Version: 1.0

I think we should transition to using the C11 memory model and atomics
instead of the current custom implementation.  There are two reasons for
this:

1) Compilers need to know which memory accesses are atomic (and thus
potentially concurrent with other accesses to the same location) and
which aren't (and can thus be optimized more aggressively).  We
currently make the compiler do the right thing by using inline asm, but
the existence of atomic_forced_read shows that this isn't the full
picture.

2) Over time, more and more programmers will become familiar with the
C11 model.  Our current atomics are different, so if we use the C11
model, it will likely be easier for future glibc developers to work on
glibc's concurrent code.  This also applies to the tool support (e.g.,
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/).


We can't rely on having a conforming C11 compiler yet, so I think it's
best if we add new atomic_ functions that closely resemble the atomic
operations C11 provides.  We use the C11 memory model (ie, the
semantics) by either using C11 support (if the compiler has it) or
building our own atomics in such a way that they implement the HW
instruction side of a C11 memory model implementation.  We can't force
the compiler to be C11-conforming in it's transformations, but our
atomics seem to work currently, so we just expect this to work further.


I propose that our phase in transitioning to C11 is to focus on uses of
the atomic operations.  In particular, the rules are:

* All accesses to atomic vars need to use atomic_* functions.  IOW, all
non-atomic accesses are not subject to data races.  The only exceptions
is initialization (ie, when the variable is not visible to any other
thread); nonetheless, initialization accesses must not result in data
races with other accesses.  (This exception isn't allowed by C11, but
eases the transition to C11 atomics and likely works fine in current
implementations; as alternative, we could require MO-relaxed stores for
initialization as well.)

* Atomic vars aren't explicitly annotated with atomic types, but just
use the base types.  They need to be naturally aligned.  This makes the
transition easier because we don't get any dependencies on C11 atomic
types.

* On a certain architecture, we typically only use the atomic_* ops if
the HW actually supports these; we expect to have pointer-sized atomics
at most.  If the arch has no native support for atomics, it can either
use modified algorithms or emulate atomics differently.

* The atomic ops are similar to the _explicit variation of C11's
functions, except that _explicit is replaced with the last part of the
MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst).  All
arguments (except the MO, which is dropped) are the same as for C11.
That avoids using the same names yet should make the names easy to
understand for people familiar with C11.


I also propose an incremental transition.  In particular, the steps are
roughly:

1) Add new C11-like atomics.  If GCC supports them on this architecture,
use GCC's atomic builtins.  Make them fall back to the existing atomics
otherwise.  Attached is a small patch that illustrates this.

2) Refactor one use (ie, all the accesses belonging to one algorithm or
group of functions that synchronize with each other) at a time.  This
involves reviewing the code and basically reimplementing the
synchronization bits in on top of the C11 memory model.  We also should
take this opportunity to add any documentation of concurrent code that's
missing (which is often the case).

3) For non-standard atomic ops (eg, atomic_add_negative()), have a look
at all uses and decide whether we really need to keep them.

4) Once all of glibc uses the new atomics, remove the old ones for a
particular arch if the oldest compiler required has support for the
respective builtins.


Open questions:

* Are the current read/write memory barriers equivalent to C11
acquire/release fences?  I guess that's the case (did I mention lack of
documentation? ;) ), but we should check whether this is true on every
architecture (ie, whether the HW instructions used for read/write
membars are the same as what the compiler would use for
acquire/release).  If not, we can't implement acquire/release based on
read/write membars but need something else for this arch.  I'd
appreciate help from the machine maintainers for this one.

* How do we deal with archs such as older SPARC that don't have CAS and
other archs without HW support for atomics?  Using modified algorithms
should be the best-performing option (eg, if we can use one critical
section instead of a complicated alternative that uses lots of atomic
operations).  However, that means we'll have to maintain more algorithms
(even if they might be simpler).
Furthermore, do all uses of atomics work well with blocking atomics that
might also not be indivisible steps?  For example, the cancellation code
might be affected because a blocking emulation of atomics won't be
async-cancel-safe?

* Which of the catomic_ variants do we really need?  Similarly to the
non-native atomics case, we often might be better off which running a
slightly different nonatomic (or just nonsynchronizing) algorithm in the
first place.  We'll have to review all the uses to be able to tell.


Thoughts?  Any feedback and help is welcome!

commit 7bd3b53f2dc61e0bf2ef018140ef1cc83f0827c5
Author: Torvald Riegel <triegel@redhat.com>
Date:   Sun Sep 14 20:04:54 2014 +0200

    Illustrate new function names for atomic ops.

diff --git a/include/atomic.h b/include/atomic.h
index 3e82b6a..939d1c3 100644
--- a/include/atomic.h
+++ b/include/atomic.h
@@ -543,6 +543,86 @@
 #endif
 
 
+/* The following functions are a subset of the atomic operations provided by
+   C11.  Usually, a function named atomic_OP_MO(args) is equivalent to C11's
+   atomic_OP_explicit(args, memory_order_MO); exceptions noted below.  */
+
+#if USE_ATOMIC_COMPILER_BUILTINS
+# define atomic_thread_fence_acquire() \
+  __atomic_thread_fence (__ATOMIC_RELAXED)
+# define atomic_thread_fence_release() \
+  __atomic_thread_fence (__ATOMIC_RELEASE)
+# define atomic_thread_fence_seq_cst() \
+  __atomic_thread_fence (__ATOMIC_SEQ_CST)
+
+# define __atomic_load_mo(mem, mo) \
+  ({ __typeof (*(mem)) __atg100_val;					      \
+  __atomic_load (mem, &__atg100_val, mo);				      \
+  __atg100_val; })
+# define atomic_load_relaxed(mem) __atomic_load_mo ((mem), __ATOMIC_RELAXED)
+# define atomic_load_acquire(mem) __atomic_load_mo ((mem), __ATOMIC_ACQUIRE)
+
+# define __atomic_store_mo(mem, val, mo) \
+  ({ __typeof (*(mem)) __atg101_val = val;				      \
+  __atomic_store (mem, &__atg100_val, mo);				      \
+  __atg100_val; })
+# define atomic_store_relaxed(mem) __atomic_store_mo ((mem), __ATOMIC_RELAXED)
+# define atomic_store_release(mem) __atomic_store_mo ((mem), __ATOMIC_RELEASE)
+
+/* TODO atomic_exchange relaxed acquire release acq_rel?  */
+
+/* TODO atomic_compare_exchange_weak relaxed acquire release acq_rel?  */
+/* TODO atomic_compare_exchange_strong relaxed acquire release acq_rel?  */
+
+/* TODO atomic_fetch_add relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_sub relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_and relaxed acquire? release? acq_rel? */
+/* TODO atomic_fetch_or relaxed acquire? release? acq_rel? */
+
+#else /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
+/* By default, we assume that read, write, and full barriers are equivalent
+   to acquire, release, and seq_cst barriers.  Archs for which this does not
+   hold have to provide custom definitions of the fences.  */
+# ifndef atomic_thread_fence_acquire
+#  define atomic_thread_fence_acquire() atomic_read_barrier ()
+# endif
+# ifndef atomic_thread_fence_release
+#  define atomic_thread_fence_release() atomic_write_barrier ()
+# endif
+# ifndef atomic_thread_fence_seq_cst
+#  define atomic_thread_fence_seq_cst() atomic_full_barrier ()
+# endif
+
+# ifndef atomic_load_relaxed
+#  define atomic_load_relaxed(mem) \
+   ({ __typeof (*(mem)) __atg100_val;					      \
+   __asm ("" : "=r" (__atg100_val) : "0" (*(mem)));			      \
+   __atg100_val; })
+# endif
+# ifndef atomic_load_acquire
+#  define atomic_load_acquire(mem) \
+   ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem);})	      \
+   atomic_thread_fence_acquire ();					      \
+   __atg101_val; })
+# endif
+# ifndef atomic_store_relaxed
+/* XXX Use inline asm here too?  */
+#  define atomic_store_relaxed(mem, val) do { *(mem) = val; } while (0)
+# endif
+# ifndef atomic_store_release
+#  define atomic_store_release(mem, val) \
+   do {									      \
+     atomic_thread_fence_release ();					      \
+     atomic_store_relaxed ((mem), (val));				      \
+   } while (0)
+# endif
+
+/* TODO same as above  */
+
+#endif /* !USE_ATOMIC_COMPILER_BUILTINS  */
+
+
 #ifndef atomic_delay
 # define atomic_delay() do { /* nothing */ } while (0)
 #endif