PowerPC: Sync pthread_once with default implementation

Message ID 534BF5EB.4000608@linux.vnet.ibm.com
State Committed
Headers

Commit Message

Adhemerval Zanella Netto April 14, 2014, 2:51 p.m. UTC
  This patch removes the arch specific powerpc implementation and instead
uses the linux default one.  Although the current powerpc implementation
already contains the required memory barriers for correct
initialization, the default implementation shows a better performance on
newer chips.

I checked the default implementations with some other tests for powerpc
and everything looks ok.  The only nit that was puzzling me was code
difference I saw checking if a load acquire instruction sequence would
yield better performance than the relaxed load plus acquire memory fence.
>From Torvald Riegel comment:

> Okay.  If lwsync is preferable for an acquire load, it might be best to
> check the GCC atomics implementation because I believe it's following
> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
> isync for acquire loads.

I rechecked ISA documentation and some literature and the only inhibitor
about using an 'lwsync' as an acquire barrier is because it should not
be used on 'Write Through Required and Caching Inhibited' storage location,
which AFAIK are used only in restricted kernel areas.  In fact, load/store
with reservations does not work on such kind of memory attributes,  so for
general userland lock mechanism 'lwsync' is safe to be used as acquire
memory fence.

GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
'lwsync', so a relaxed load plus memory acquire fence will be translated
to 'ld; lwsync'.  Two advantage by using 'ld; cmp; bc; isync' for
acquire loads is 1) it is more strict, thus safer (although still not
really required) 2) is it performs much better some chips (POWER6), while
showing very similar performance on other POWER platforms.

I will see if it is worth to change the current core for acquire load/
release store, at least for POWER.  The only chip that I saw a big improvement
of doing it is on POWER6.

Also, this testcase itself is single-thread and think it would be worth to
extend it to check contention on multithread cases as well.

--

* nptl/sysdeps/unix/sysv/linux/powerpc/pthread_once.c: Remove file.

---
  

Comments

Torvald Riegel April 17, 2014, 4:03 p.m. UTC | #1
On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
> This patch removes the arch specific powerpc implementation and instead
> uses the linux default one.  Although the current powerpc implementation
> already contains the required memory barriers for correct
> initialization, the default implementation shows a better performance on
> newer chips.
> 
> I checked the default implementations with some other tests for powerpc
> and everything looks ok.  The only nit that was puzzling me was code
> difference I saw checking if a load acquire instruction sequence would
> yield better performance than the relaxed load plus acquire memory fence.
> From Torvald Riegel comment:
> 
> > Okay.  If lwsync is preferable for an acquire load, it might be best to
> > check the GCC atomics implementation because I believe it's following
> > http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
> > isync for acquire loads.
> 
> I rechecked ISA documentation and some literature and the only inhibitor
> about using an 'lwsync' as an acquire barrier is because it should not
> be used on 'Write Through Required and Caching Inhibited' storage location,
> which AFAIK are used only in restricted kernel areas.  In fact, load/store
> with reservations does not work on such kind of memory attributes,  so for
> general userland lock mechanism 'lwsync' is safe to be used as acquire
> memory fence.
> 
> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
> 'lwsync', so a relaxed load plus memory acquire fence will be translated
> to 'ld; lwsync'.  Two advantage by using 'ld; cmp; bc; isync' for
> acquire loads is 1) it is more strict, thus safer (although still not
> really required) 2) is it performs much better some chips (POWER6), while
> showing very similar performance on other POWER platforms.

Thanks for looking into this.

> I will see if it is worth to change the current core for acquire load/
> release store, at least for POWER.  The only chip that I saw a big improvement
> of doing it is on POWER6.

Sounds good.  I think that in the long term, having atomic.h provide
acquire loads and release stores wouldn't be bad.  I'd prefer doing that
instead of running with a POWER6-specific pthread_once variant.

> Also, this testcase itself is single-thread and think it would be worth to
> extend it to check contention on multithread cases as well.

Probably.  Although given that the contented case happens once per
pthread_once instance, it could be rather hard to trigger; to increase
the chance that one hits it often, one would need to involve other
synchronization, which could skew the results.

The patch is OK in my opinion.
  
Adhemerval Zanella Netto April 17, 2014, 5:02 p.m. UTC | #2
On 17-04-2014 13:03, Torvald Riegel wrote:
> On Mon, 2014-04-14 at 11:51 -0300, Adhemerval Zanella wrote:
>> This patch removes the arch specific powerpc implementation and instead
>> uses the linux default one.  Although the current powerpc implementation
>> already contains the required memory barriers for correct
>> initialization, the default implementation shows a better performance on
>> newer chips.
>>
>> I checked the default implementations with some other tests for powerpc
>> and everything looks ok.  The only nit that was puzzling me was code
>> difference I saw checking if a load acquire instruction sequence would
>> yield better performance than the relaxed load plus acquire memory fence.
>> From Torvald Riegel comment:
>>
>>> Okay.  If lwsync is preferable for an acquire load, it might be best to
>>> check the GCC atomics implementation because I believe it's following
>>> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html, which suggests
>>> isync for acquire loads.
>> I rechecked ISA documentation and some literature and the only inhibitor
>> about using an 'lwsync' as an acquire barrier is because it should not
>> be used on 'Write Through Required and Caching Inhibited' storage location,
>> which AFAIK are used only in restricted kernel areas.  In fact, load/store
>> with reservations does not work on such kind of memory attributes,  so for
>> general userland lock mechanism 'lwsync' is safe to be used as acquire
>> memory fence.
>>
>> GCC it self translate C11 atomic_thread_fence (memory_order_acquire) to
>> 'lwsync', so a relaxed load plus memory acquire fence will be translated
>> to 'ld; lwsync'.  Two advantage by using 'ld; cmp; bc; isync' for
>> acquire loads is 1) it is more strict, thus safer (although still not
>> really required) 2) is it performs much better some chips (POWER6), while
>> showing very similar performance on other POWER platforms.
> Thanks for looking into this.
>
>> I will see if it is worth to change the current core for acquire load/
>> release store, at least for POWER.  The only chip that I saw a big improvement
>> of doing it is on POWER6.
> Sounds good.  I think that in the long term, having atomic.h provide
> acquire loads and release stores wouldn't be bad.  I'd prefer doing that
> instead of running with a POWER6-specific pthread_once variant.
>
>> Also, this testcase itself is single-thread and think it would be worth to
>> extend it to check contention on multithread cases as well.
> Probably.  Although given that the contented case happens once per
> pthread_once instance, it could be rather hard to trigger; to increase
> the chance that one hits it often, one would need to involve other
> synchronization, which could skew the results.
>
> The patch is OK in my opinion.

Thanks for the review and in fact my idea of measure contention in a multi-thread environment
is not for pthread_once initialization, but rather the difference of latency of using lwsync
and isync.  Anyway, I'll commit this patch.
  

Patch

diff --git a/nptl/sysdeps/unix/sysv/linux/powerpc/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/powerpc/pthread_once.c
deleted file mode 100644
index e925299..0000000
--- a/nptl/sysdeps/unix/sysv/linux/powerpc/pthread_once.c
+++ /dev/null
@@ -1,110 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Paul Mackerras <paulus@au.ibm.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  __asm __volatile (__lll_rel_instr);
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-
-int
-__pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
-{
-  for (;;)
-    {
-      int oldval;
-      int newval;
-      int tmp;
-
-      /* Pseudo code:
-	 newval = __fork_generation | 1;
-	 oldval = *once_control;
-	 if ((oldval & 2) == 0)
-	   *once_control = newval;
-	 Do this atomically with an acquire barrier.
-      */
-      newval = __fork_generation | 1;
-      __asm __volatile ("1:	lwarx	%0,0,%3" MUTEX_HINT_ACQ "\n"
-			"	andi.	%1,%0,2\n"
-			"	bne	2f\n"
-			"	stwcx.	%4,0,%3\n"
-			"	bne	1b\n"
-			"2:	" __lll_acq_instr
-			: "=&r" (oldval), "=&r" (tmp), "=m" (*once_control)
-			: "r" (once_control), "r" (newval), "m" (*once_control)
-			: "cr0");
-
-      /* Check if the initializer has already been done.  */
-      if ((oldval & 2) != 0)
-	return 0;
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) == 0)
-	break;
-
-      /* Check whether the initializer execution was interrupted by a fork.  */
-      if (oldval != newval)
-	break;
-
-      /* Same generation, some other thread was faster. Wait.  */
-      lll_futex_wait (once_control, oldval, LLL_PRIVATE);
-    }
-
-
-  /* This thread is the first here.  Do the initialization.
-     Register a cleanup handler so that in case the thread gets
-     interrupted the initialization can be restarted.  */
-  pthread_cleanup_push (clear_once_control, once_control);
-
-  init_routine ();
-
-  pthread_cleanup_pop (0);
-
-
-  /* Add one to *once_control to take the bottom 2 bits from 01 to 10.
-     A release barrier is needed to ensure memory written by init_routine
-     is seen in other threads before *once_control changes.  */
-  int tmp;
-  __asm __volatile (__lll_rel_instr "\n"
-		    "1:	lwarx	%0,0,%2" MUTEX_HINT_REL "\n"
-		    "	addi	%0,%0,1\n"
-		    "	stwcx.	%0,0,%2\n"
-		    "	bne-	1b"
-		    : "=&b" (tmp), "=m" (*once_control)
-		    : "r" (once_control), "m" (*once_control)
-		    : "cr0");
-
-  /* Wake up all other threads.  */
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)