diff mbox

[4/4] Use C11 atomics in pthread_once.

Message ID 1414620650.10085.63.camel@triegel.csb
State Committed
Headers show

Commit Message

Torvald Riegel Oct. 29, 2014, 10:10 p.m. UTC
This patch transforms pthread_once to use C11 atomics.  It's meant as an
illustration and early test.

Please note that I've transformed *all* accesses to concurrently
accessed memory locations to use atomic operations.  This is the right
thing to do to inform the compiler about concurrency and prevent it from
making optimizations based on assumptions about data-race-freedom and
sequential code (concurrent accesses are not sequential code...).
You'll see that atomic_*_relaxed is used quite a bit, which restricts
the compiler a little but does not add any barriers.

Also, this makes it easy to see which loads and stores are actually
concurrent code and thus need additional attention by the programmer.

I've compared generated code on x86_64 on GCC 4.4.7.  The only thing
that changes between before/after the patch is that a "cmp %eax,%edx"
becomes a "cmp %edx,%eax", but it's used to test equality of the values.

I've also looked at the code generated by a pre-4.9 GCC build.  The code
generated for the pthread_once fast path is the same as with GCC 4.4.7
before the patch.  The slow path has somewhat different code with the
more recent compiler, with less instructions.

Joseph mentioned that GCC doesn't yet optimize memory_order_relaxed
loads as well as it could.  This is something we may have to look at
again for other code, but for pthread_once at least this wasn't a
problem.  My gut feeling is that it won't matter as much for most
concurrent code, because relaxed atomic ops will likely be close to
other atomic ops with stronger memory orders, unless you're doing
something special like generating loads and loads of relaxed ops.

	* nptl/pthread_once.c (clear_once_control, __pthread_once_slow,
	__pthread_once): Use C11 atomics.

Comments

Will Newton Nov. 4, 2014, 3:18 p.m. UTC | #1
On 29 October 2014 22:10, Torvald Riegel <triegel@redhat.com> wrote:

Hi Torvald,

> This patch transforms pthread_once to use C11 atomics.  It's meant as an
> illustration and early test.
>
> Please note that I've transformed *all* accesses to concurrently
> accessed memory locations to use atomic operations.  This is the right
> thing to do to inform the compiler about concurrency and prevent it from
> making optimizations based on assumptions about data-race-freedom and
> sequential code (concurrent accesses are not sequential code...).
> You'll see that atomic_*_relaxed is used quite a bit, which restricts
> the compiler a little but does not add any barriers.
>
> Also, this makes it easy to see which loads and stores are actually
> concurrent code and thus need additional attention by the programmer.
>
> I've compared generated code on x86_64 on GCC 4.4.7.  The only thing
> that changes between before/after the patch is that a "cmp %eax,%edx"
> becomes a "cmp %edx,%eax", but it's used to test equality of the values.
>
> I've also looked at the code generated by a pre-4.9 GCC build.  The code
> generated for the pthread_once fast path is the same as with GCC 4.4.7
> before the patch.  The slow path has somewhat different code with the
> more recent compiler, with less instructions.

I tried gcc 4.8.3 on x86_64 and the code is slightly longer in the new
version (including pushing a larger frame) but the performance
measured by the pthread_once benchtest is identical. gcc 4.8.2 on ARM
the code is identical apart from some register allocation decisions.

It would be interesting to know what the compiler is doing here but I
guess if the performance is the same it may not really matter.
diff mbox

Patch

commit 985496a0a6cd8dbef60276e400bc3e51528862a6
Author: Torvald Riegel <triegel@redhat.com>
Date:   Tue Oct 21 00:34:53 2014 +0200

    Use C11 atomics in pthread_once.

diff --git a/nptl/pthread_once.c b/nptl/pthread_once.c
index 05aea84..9bb82e9 100644
--- a/nptl/pthread_once.c
+++ b/nptl/pthread_once.c
@@ -34,7 +34,7 @@  clear_once_control (void *arg)
      other threads that see this value: This function will be called if we
      get interrupted (see __pthread_once), so all we need to relay to other
      threads is the state being reset again.  */
-  *once_control = 0;
+  atomic_store_relaxed (once_control, 0);
   lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
 }
 
@@ -68,20 +68,18 @@  __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void))
 {
   while (1)
     {
-      int oldval, val, newval;
+      int val, newval;
 
       /* We need acquire memory order for this load because if the value
          signals that initialization has finished, we need to see any
          data modifications done during initialization.  */
-      val = *once_control;
-      atomic_read_barrier ();
+      val = atomic_load_acquire (once_control);
       do
 	{
 	  /* Check if the initialization has already been done.  */
 	  if (__glibc_likely ((val & __PTHREAD_ONCE_DONE) != 0))
 	    return 0;
 
-	  oldval = val;
 	  /* We try to set the state to in-progress and having the current
 	     fork generation.  We don't need atomic accesses for the fork
 	     generation because it's immutable in a particular process, and
@@ -90,18 +88,17 @@  __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void))
 	  newval = __fork_generation | __PTHREAD_ONCE_INPROGRESS;
 	  /* We need acquire memory order here for the same reason as for the
 	     load from once_control above.  */
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
 	}
-      while (__glibc_unlikely (val != oldval));
+      while (__glibc_unlikely (!atomic_compare_exchange_weak_acquire (
+	  once_control, &val, newval)));
 
       /* Check if another thread already runs the initializer.	*/
-      if ((oldval & __PTHREAD_ONCE_INPROGRESS) != 0)
+      if ((val & __PTHREAD_ONCE_INPROGRESS) != 0)
 	{
 	  /* Check whether the initializer execution was interrupted by a
 	     fork.  We know that for both values, __PTHREAD_ONCE_INPROGRESS
 	     is set and __PTHREAD_ONCE_DONE is not.  */
-	  if (oldval == newval)
+	  if (val == newval)
 	    {
 	      /* Same generation, some other thread was faster. Wait.  */
 	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
@@ -122,8 +119,7 @@  __pthread_once_slow (pthread_once_t *once_control, void (*init_routine) (void))
       /* Mark *once_control as having finished the initialization.  We need
          release memory order here because we need to synchronize with other
          threads that want to use the initialized data.  */
-      atomic_write_barrier ();
-      *once_control = __PTHREAD_ONCE_DONE;
+      atomic_store_release (once_control, __PTHREAD_ONCE_DONE);
 
       /* Wake up all other threads.  */
       lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
@@ -138,8 +134,7 @@  __pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
 {
   /* Fast path.  See __pthread_once_slow.  */
   int val;
-  val = *once_control;
-  atomic_read_barrier ();
+  val = atomic_load_acquire (once_control);
   if (__glibc_likely ((val & __PTHREAD_ONCE_DONE) != 0))
     return 0;
   else