From patchwork Sun Sep 14 18:34:29 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Torvald Riegel X-Patchwork-Id: 2838 Received: (qmail 12060 invoked by alias); 14 Sep 2014 18:34:35 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 12048 invoked by uid 89); 14 Sep 2014 18:34:35 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.8 required=5.0 tests=AWL, BAYES_05, RP_MATCHES_RCVD, SPF_HELO_PASS, SPF_PASS autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Subject: Transition to C11 atomics and memory model From: Torvald Riegel To: GLIBC Devel Date: Sun, 14 Sep 2014 20:34:29 +0200 Message-ID: <1410719669.4967.160.camel@triegel.csb> Mime-Version: 1.0 I think we should transition to using the C11 memory model and atomics instead of the current custom implementation. There are two reasons for this: 1) Compilers need to know which memory accesses are atomic (and thus potentially concurrent with other accesses to the same location) and which aren't (and can thus be optimized more aggressively). We currently make the compiler do the right thing by using inline asm, but the existence of atomic_forced_read shows that this isn't the full picture. 2) Over time, more and more programmers will become familiar with the C11 model. Our current atomics are different, so if we use the C11 model, it will likely be easier for future glibc developers to work on glibc's concurrent code. This also applies to the tool support (e.g., http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/). We can't rely on having a conforming C11 compiler yet, so I think it's best if we add new atomic_ functions that closely resemble the atomic operations C11 provides. We use the C11 memory model (ie, the semantics) by either using C11 support (if the compiler has it) or building our own atomics in such a way that they implement the HW instruction side of a C11 memory model implementation. We can't force the compiler to be C11-conforming in it's transformations, but our atomics seem to work currently, so we just expect this to work further. I propose that our phase in transitioning to C11 is to focus on uses of the atomic operations. In particular, the rules are: * All accesses to atomic vars need to use atomic_* functions. IOW, all non-atomic accesses are not subject to data races. The only exceptions is initialization (ie, when the variable is not visible to any other thread); nonetheless, initialization accesses must not result in data races with other accesses. (This exception isn't allowed by C11, but eases the transition to C11 atomics and likely works fine in current implementations; as alternative, we could require MO-relaxed stores for initialization as well.) * Atomic vars aren't explicitly annotated with atomic types, but just use the base types. They need to be naturally aligned. This makes the transition easier because we don't get any dependencies on C11 atomic types. * On a certain architecture, we typically only use the atomic_* ops if the HW actually supports these; we expect to have pointer-sized atomics at most. If the arch has no native support for atomics, it can either use modified algorithms or emulate atomics differently. * The atomic ops are similar to the _explicit variation of C11's functions, except that _explicit is replaced with the last part of the MO argument (ie, acquire, release, acq_rel, relaxed, seq_cst). All arguments (except the MO, which is dropped) are the same as for C11. That avoids using the same names yet should make the names easy to understand for people familiar with C11. I also propose an incremental transition. In particular, the steps are roughly: 1) Add new C11-like atomics. If GCC supports them on this architecture, use GCC's atomic builtins. Make them fall back to the existing atomics otherwise. Attached is a small patch that illustrates this. 2) Refactor one use (ie, all the accesses belonging to one algorithm or group of functions that synchronize with each other) at a time. This involves reviewing the code and basically reimplementing the synchronization bits in on top of the C11 memory model. We also should take this opportunity to add any documentation of concurrent code that's missing (which is often the case). 3) For non-standard atomic ops (eg, atomic_add_negative()), have a look at all uses and decide whether we really need to keep them. 4) Once all of glibc uses the new atomics, remove the old ones for a particular arch if the oldest compiler required has support for the respective builtins. Open questions: * Are the current read/write memory barriers equivalent to C11 acquire/release fences? I guess that's the case (did I mention lack of documentation? ;) ), but we should check whether this is true on every architecture (ie, whether the HW instructions used for read/write membars are the same as what the compiler would use for acquire/release). If not, we can't implement acquire/release based on read/write membars but need something else for this arch. I'd appreciate help from the machine maintainers for this one. * How do we deal with archs such as older SPARC that don't have CAS and other archs without HW support for atomics? Using modified algorithms should be the best-performing option (eg, if we can use one critical section instead of a complicated alternative that uses lots of atomic operations). However, that means we'll have to maintain more algorithms (even if they might be simpler). Furthermore, do all uses of atomics work well with blocking atomics that might also not be indivisible steps? For example, the cancellation code might be affected because a blocking emulation of atomics won't be async-cancel-safe? * Which of the catomic_ variants do we really need? Similarly to the non-native atomics case, we often might be better off which running a slightly different nonatomic (or just nonsynchronizing) algorithm in the first place. We'll have to review all the uses to be able to tell. Thoughts? Any feedback and help is welcome! commit 7bd3b53f2dc61e0bf2ef018140ef1cc83f0827c5 Author: Torvald Riegel Date: Sun Sep 14 20:04:54 2014 +0200 Illustrate new function names for atomic ops. diff --git a/include/atomic.h b/include/atomic.h index 3e82b6a..939d1c3 100644 --- a/include/atomic.h +++ b/include/atomic.h @@ -543,6 +543,86 @@ #endif +/* The following functions are a subset of the atomic operations provided by + C11. Usually, a function named atomic_OP_MO(args) is equivalent to C11's + atomic_OP_explicit(args, memory_order_MO); exceptions noted below. */ + +#if USE_ATOMIC_COMPILER_BUILTINS +# define atomic_thread_fence_acquire() \ + __atomic_thread_fence (__ATOMIC_RELAXED) +# define atomic_thread_fence_release() \ + __atomic_thread_fence (__ATOMIC_RELEASE) +# define atomic_thread_fence_seq_cst() \ + __atomic_thread_fence (__ATOMIC_SEQ_CST) + +# define __atomic_load_mo(mem, mo) \ + ({ __typeof (*(mem)) __atg100_val; \ + __atomic_load (mem, &__atg100_val, mo); \ + __atg100_val; }) +# define atomic_load_relaxed(mem) __atomic_load_mo ((mem), __ATOMIC_RELAXED) +# define atomic_load_acquire(mem) __atomic_load_mo ((mem), __ATOMIC_ACQUIRE) + +# define __atomic_store_mo(mem, val, mo) \ + ({ __typeof (*(mem)) __atg101_val = val; \ + __atomic_store (mem, &__atg100_val, mo); \ + __atg100_val; }) +# define atomic_store_relaxed(mem) __atomic_store_mo ((mem), __ATOMIC_RELAXED) +# define atomic_store_release(mem) __atomic_store_mo ((mem), __ATOMIC_RELEASE) + +/* TODO atomic_exchange relaxed acquire release acq_rel? */ + +/* TODO atomic_compare_exchange_weak relaxed acquire release acq_rel? */ +/* TODO atomic_compare_exchange_strong relaxed acquire release acq_rel? */ + +/* TODO atomic_fetch_add relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_sub relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_and relaxed acquire? release? acq_rel? */ +/* TODO atomic_fetch_or relaxed acquire? release? acq_rel? */ + +#else /* !USE_ATOMIC_COMPILER_BUILTINS */ + +/* By default, we assume that read, write, and full barriers are equivalent + to acquire, release, and seq_cst barriers. Archs for which this does not + hold have to provide custom definitions of the fences. */ +# ifndef atomic_thread_fence_acquire +# define atomic_thread_fence_acquire() atomic_read_barrier () +# endif +# ifndef atomic_thread_fence_release +# define atomic_thread_fence_release() atomic_write_barrier () +# endif +# ifndef atomic_thread_fence_seq_cst +# define atomic_thread_fence_seq_cst() atomic_full_barrier () +# endif + +# ifndef atomic_load_relaxed +# define atomic_load_relaxed(mem) \ + ({ __typeof (*(mem)) __atg100_val; \ + __asm ("" : "=r" (__atg100_val) : "0" (*(mem))); \ + __atg100_val; }) +# endif +# ifndef atomic_load_acquire +# define atomic_load_acquire(mem) \ + ({ __typeof (*(mem)) __atg101_val = atomic_load_relaxed (mem);}) \ + atomic_thread_fence_acquire (); \ + __atg101_val; }) +# endif +# ifndef atomic_store_relaxed +/* XXX Use inline asm here too? */ +# define atomic_store_relaxed(mem, val) do { *(mem) = val; } while (0) +# endif +# ifndef atomic_store_release +# define atomic_store_release(mem, val) \ + do { \ + atomic_thread_fence_release (); \ + atomic_store_relaxed ((mem), (val)); \ + } while (0) +# endif + +/* TODO same as above */ + +#endif /* !USE_ATOMIC_COMPILER_BUILTINS */ + + #ifndef atomic_delay # define atomic_delay() do { /* nothing */ } while (0) #endif