From patchwork Wed Oct 29 22:10:50 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Torvald Riegel X-Patchwork-Id: 3492 Received: (qmail 3932 invoked by alias); 29 Oct 2014 22:10:55 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 3920 invoked by uid 89); 29 Oct 2014 22:10:54 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.4 required=5.0 tests=AWL, BAYES_00, RP_MATCHES_RCVD, SPF_HELO_PASS, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mx1.redhat.com Subject: [PATCH 4/4] Use C11 atomics in pthread_once. From: Torvald Riegel To: GLIBC Devel In-Reply-To: <1414617613.10085.23.camel@triegel.csb> References: <1414617613.10085.23.camel@triegel.csb> Date: Wed, 29 Oct 2014 23:10:50 +0100 Message-ID: <1414620650.10085.63.camel@triegel.csb> Mime-Version: 1.0 This patch transforms pthread_once to use C11 atomics. It's meant as an illustration and early test. Please note that I've transformed *all* accesses to concurrently accessed memory locations to use atomic operations. This is the right thing to do to inform the compiler about concurrency and prevent it from making optimizations based on assumptions about data-race-freedom and sequential code (concurrent accesses are not sequential code...). You'll see that atomic_*_relaxed is used quite a bit, which restricts the compiler a little but does not add any barriers. Also, this makes it easy to see which loads and stores are actually concurrent code and thus need additional attention by the programmer. I've compared generated code on x86_64 on GCC 4.4.7. The only thing that changes between before/after the patch is that a "cmp %eax,%edx" becomes a "cmp %edx,%eax", but it's used to test equality of the values. I've also looked at the code generated by a pre-4.9 GCC build. The code generated for the pthread_once fast path is the same as with GCC 4.4.7 before the patch. The slow path has somewhat different code with the more recent compiler, with less instructions. Joseph mentioned that GCC doesn't yet optimize memory_order_relaxed loads as well as it could. This is something we may have to look at again for other code, but for pthread_once at least this wasn't a problem. My gut feeling is that it won't matter as much for most concurrent code, because relaxed atomic ops will likely be close to other atomic ops with stronger memory orders, unless you're doing something special like generating loads and loads of relaxed ops. * nptl/pthread_once.c (clear_once_control, __pthread_once_slow, __pthread_once): Use C11 atomics. commit 985496a0a6cd8dbef60276e400bc3e51528862a6 Author: Torvald Riegel Date: Tue Oct 21 00:34:53 2014 +0200 Use C11 atomics in pthread_once. diff --git a/nptl/pthread_once.c b/nptl/pthread_once.c index 05aea84..9bb82e9 100644 --- a/nptl/pthread_once.c +++ b/nptl/pthread_once.c @@ -34,7 +34,7 @@ clear_once_control (void *arg) other threads that see this value: This function will be called if we get interrupted (see __pthread_once), so all we need to relay to other threads is the state being reset again. */ - *once_control = 0; + atomic_store_relaxed (once_control, 0); lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE); } @@ -68,20 +68,18 @@ __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void)) { while (1) { - int oldval, val, newval; + int val, newval; /* We need acquire memory order for this load because if the value signals that initialization has finished, we need to see any data modifications done during initialization. */ - val = *once_control; - atomic_read_barrier (); + val = atomic_load_acquire (once_control); do { /* Check if the initialization has already been done. */ if (__glibc_likely ((val & __PTHREAD_ONCE_DONE) != 0)) return 0; - oldval = val; /* We try to set the state to in-progress and having the current fork generation. We don't need atomic accesses for the fork generation because it's immutable in a particular process, and @@ -90,18 +88,17 @@ __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void)) newval = __fork_generation | __PTHREAD_ONCE_INPROGRESS; /* We need acquire memory order here for the same reason as for the load from once_control above. */ - val = atomic_compare_and_exchange_val_acq (once_control, newval, - oldval); } - while (__glibc_unlikely (val != oldval)); + while (__glibc_unlikely (!atomic_compare_exchange_weak_acquire ( + once_control, &val, newval))); /* Check if another thread already runs the initializer. */ - if ((oldval & __PTHREAD_ONCE_INPROGRESS) != 0) + if ((val & __PTHREAD_ONCE_INPROGRESS) != 0) { /* Check whether the initializer execution was interrupted by a fork. We know that for both values, __PTHREAD_ONCE_INPROGRESS is set and __PTHREAD_ONCE_DONE is not. */ - if (oldval == newval) + if (val == newval) { /* Same generation, some other thread was faster. Wait. */ lll_futex_wait (once_control, newval, LLL_PRIVATE); @@ -122,8 +119,7 @@ __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void)) /* Mark *once_control as having finished the initialization. We need release memory order here because we need to synchronize with other threads that want to use the initialized data. */ - atomic_write_barrier (); - *once_control = __PTHREAD_ONCE_DONE; + atomic_store_release (once_control, __PTHREAD_ONCE_DONE); /* Wake up all other threads. */ lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE); @@ -138,8 +134,7 @@ __pthread_once (pthread_once_t *once_control, void (*init_routine) (void)) { /* Fast path. See __pthread_once_slow. */ int val; - val = *once_control; - atomic_read_barrier (); + val = atomic_load_acquire (once_control); if (__glibc_likely ((val & __PTHREAD_ONCE_DONE) != 0)) return 0; else