[PATCHv2] powerpc: Spinlock optimization and cleanup

Message ID 560C0DA6.5060409@linux.vnet.ibm.com
State Superseded
Delegated to: Tulio Magno Quites Machado Filho
Headers

Commit Message

Paul E. Murphy Sept. 30, 2015, 4:28 p.m. UTC
  Changes from V1:

* Use C macro for atomic_store_release as suggested in
comments.

* Run benchmarks to quantize the performance changes and
note in commit message.

---8<---
This patch optimizes powerpc spinlock implementation by:

* Current algorithm spin over a lwzx, but only after issuing a lwarx.
  first.  The optimization for such case is to avoid the first lwarx
  in contention case (which is much more costly than normal loads).

* Use the correct EH hint bit on the larx for supported ISA.  For lock
  acquisition, the thread that acquired the lock with a successful stcx
  do not want to give away the write ownership on the cacheline.  The
  idea is to make the load reservation "sticky" about retaining write
  authority to the line.  That way, the store that must inevitably come
  to release the lock can succeed quickly and not contend with other
  threads issuing lwarx.  If another thread does a store to the line
  (false sharing), the winning thread must give up write authority to
  The proper value of EH for the larx for a lock acquisition is 1.

* Increase contented lock performance by up to 40%, and no measurable
  impact on uncontended locks on P8.

It also adds some cleanup to use the defined acquire semantic
instructions and function prototype using default C style.

Thanks to Adhemerval Zanella who did most of the work.  I've run some
tests, and addressed some minor feedback.

2015-09-25  Paul E. Murphy  <murphyp@linux.vnet.ibm.com>

	* sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock):
	Optimize first check for contention case and add lwarx hint.
	* sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock):
	Use ANSI prototype.
	* sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ...
	* sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and
	update to new atomic macros.
---
 sysdeps/powerpc/nptl/pthread_spin_lock.c           |   22 ++++++++-------
 sysdeps/powerpc/nptl/pthread_spin_trylock.c        |    3 +-
 sysdeps/powerpc/nptl/pthread_spin_unlock.c         |   27 +++++++++++++++++++
 .../unix/sysv/linux/powerpc/pthread_spin_unlock.c  |   28 --------------------
 4 files changed, 40 insertions(+), 40 deletions(-)
 create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c
 delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
  

Comments

Tulio Magno Quites Machado Filho Sept. 30, 2015, 8:31 p.m. UTC | #1
"Paul E. Murphy" <murphyp@linux.vnet.ibm.com> writes:

> Changes from V1:
>
> * Use C macro for atomic_store_release as suggested in
> comments.
>
> * Run benchmarks to quantize the performance changes and
> note in commit message.

Thanks for resurrecting this patch!

> This patch optimizes powerpc spinlock implementation by:
>
> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>   first.  The optimization for such case is to avoid the first lwarx
>   in contention case (which is much more costly than normal loads).
>
> * Use the correct EH hint bit on the larx for supported ISA.  For lock

lwarx

>   acquisition, the thread that acquired the lock with a successful stcx

stwcx

>   do not want to give away the write ownership on the cacheline.  The
>   idea is to make the load reservation "sticky" about retaining write
>   authority to the line.  That way, the store that must inevitably come
>   to release the lock can succeed quickly and not contend with other
>   threads issuing lwarx.  If another thread does a store to the line
>   (false sharing), the winning thread must give up write authority to

Is something missing between the previous line and the next?

>   The proper value of EH for the larx for a lock acquisition is 1.

Could you move part of this explanation to pthread_spin_lock.c as a comment,
please?
I'm referring only to the part where you explain the logic behind this
optimization.

> * Increase contented lock performance by up to 40%, and no measurable
>   impact on uncontended locks on P8.
>
> It also adds some cleanup to use the defined acquire semantic
> instructions and function prototype using default C style.
>
> Thanks to Adhemerval Zanella who did most of the work.  I've run some
> tests, and addressed some minor feedback.
>
> 2015-09-25  Paul E. Murphy  <murphyp@linux.vnet.ibm.com>

As this patch is based on the work of Adhemerval, it's important to mention
him in the ChangeLog.  Here's an example:
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=d3573f61a#patch1

> ---
>  sysdeps/powerpc/nptl/pthread_spin_lock.c           |   22 ++++++++-------
>  sysdeps/powerpc/nptl/pthread_spin_trylock.c        |    3 +-
>  sysdeps/powerpc/nptl/pthread_spin_unlock.c         |   27 +++++++++++++++++++
>  .../unix/sysv/linux/powerpc/pthread_spin_unlock.c  |   28 --------------------
>  4 files changed, 40 insertions(+), 40 deletions(-)
>  create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c
>  delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
>
> diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> index cc081f8..8a39da9 100644
> --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
> +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> @@ -19,24 +19,26 @@
>  #include "pthreadP.h"
>
>  int
> -pthread_spin_lock (lock)
> -     pthread_spinlock_t *lock;
> +pthread_spin_lock (pthread_spinlock_t *lock)
>  {
>    unsigned int __tmp;
>
>    asm volatile (
> -       "1:	lwarx	%0,0,%1\n"
> +       "0:	lwzx    %0,0,%1\n"
> +       "	cmpwi   0,%0,0\n"
> +       "	bne	0b\n"
> +       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
>         "	cmpwi	0,%0,0\n"
>         "	bne-	2f\n"
>         "	stwcx.	%2,0,%1\n"
>         "	bne-	2f\n"
> -       "	isync\n"
> -       "	.subsection 1\n"
> -       "2:	lwzx	%0,0,%1\n"
> -       "	cmpwi	0,%0,0\n"
> -       "	bne	2b\n"
> -       "	b	1b\n"
> -       "	.previous"
> +                __ARCH_ACQ_INSTR "\n"
> +       "        .subsection 1\n"
> +       "2:	lwzx    %0,0,%1\n"
> +       "        cmpwi   0,%0,0\n"
> +       "        bne     2b\n"
> +       "        b       1b\n"
> +       "        .previous"

Part of the asm instructions are indented with spaces while most of them are
using tabs.
  
Szabolcs Nagy Oct. 1, 2015, 9:18 a.m. UTC | #2
On 30/09/15 17:28, Paul E. Murphy wrote:
>
> ---8<---
> This patch optimizes powerpc spinlock implementation by:
>
...

The glibc pthread spinlock semantics is weaker than what
posix requires, I'm wondering if this is expected to stay
or glibc might want to switch to stronger semantics.

is it worthwhile to add optimized asm with weak semantics
for other targets that currently use the generic c code?

(the issue is that for correct pthread_spin_trylock behavior
the lock should be seqcst instead of acquire and the unlock
should be release instead of barrier+store otherwise trylock
can spuriously report locked state).
  
Steven Munroe Oct. 1, 2015, 3:27 p.m. UTC | #3
On Thu, 2015-10-01 at 10:18 +0100, Szabolcs Nagy wrote:
> On 30/09/15 17:28, Paul E. Murphy wrote:
> >
> > ---8<---
> > This patch optimizes powerpc spinlock implementation by:
> >
> ...
> 
> The glibc pthread spinlock semantics is weaker than what
> posix requires, I'm wondering if this is expected to stay
> or glibc might want to switch to stronger semantics.
> 
Since when? Include the text the requires this?

> is it worthwhile to add optimized asm with weak semantics
> for other targets that currently use the generic c code?
> 
> (the issue is that for correct pthread_spin_trylock behavior
> the lock should be seqcst instead of acquire and the unlock
> should be release instead of barrier+store otherwise trylock
> can spuriously report locked state).
> 

Paul patch already changes pthread_spin_unlock to atomic_store_release,
which will generate lwsync/stw.

But I don't think anyone wants or need pthread_spin_lock to be seqcst.

Also as the acquire sequence used in Paul patch is a full "import
barrier", it is sufficient for the critical region.

Read PowerISA-2.07B BookII Appendix B Programming Examples for Shared
Storage, Section B.2.1 Lock Acquisition and Import Barriers.

Which specifically say that a hwsync is not required if the acquire
import barrier is used. And this sequence will perform better.
  
Torvald Riegel Oct. 1, 2015, 7:17 p.m. UTC | #4
On Thu, 2015-10-01 at 10:18 +0100, Szabolcs Nagy wrote:
> On 30/09/15 17:28, Paul E. Murphy wrote:
> >
> > ---8<---
> > This patch optimizes powerpc spinlock implementation by:
> >
> ...
> 
> The glibc pthread spinlock semantics is weaker than what
> posix requires, I'm wondering if this is expected to stay
> or glibc might want to switch to stronger semantics.

I think this should stay the way it is.  Thus, do what C++ and soo also
C11 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_470)
specify.  Making this (it's all the mtx operations that succeed, not
just trylock) seqcst would decrease performance for the most common case
just to make arcane cases work (e.g., abusing POSIX synchronization
functions such as trylock or sem_getvalue as atomics).

Fixing that at the POSIX level would require POSIX to use a more
involved memory model (hopefully following the C11 model).  If anyone
feels like contributing to make this happen, please do so.

> is it worthwhile to add optimized asm with weak semantics
> for other targets that currently use the generic c code?

I think only nptl/pthread_spin_unlock.c should be changed, the other
generic functions use weaker memory orders already.  OTOH, this would
change existing behavior, so one can argue this is more risky than
keeping weaker-than-POSIX implementations unchanged.

> (the issue is that for correct pthread_spin_trylock behavior
> the lock should be seqcst instead of acquire and the unlock
> should be release instead of barrier+store otherwise trylock
> can spuriously report locked state).

Right now, unlock is a full barrier (ie, seqcst) plus store.  That is
stronger than a release store.  Also note that a failing POSIX
synchronization function is not supposed to synchronize memory.  So, a
failing trylock doesn't help a program unless it synchronizes through
some other way, in which case this other way will "provide" the
barriers.
  
Rich Felker Oct. 1, 2015, 7:27 p.m. UTC | #5
On Thu, Oct 01, 2015 at 09:17:06PM +0200, Torvald Riegel wrote:
> > (the issue is that for correct pthread_spin_trylock behavior
> > the lock should be seqcst instead of acquire and the unlock
> > should be release instead of barrier+store otherwise trylock
> > can spuriously report locked state).
> 
> Right now, unlock is a full barrier (ie, seqcst) plus store.  That is
> stronger than a release store.  Also note that a failing POSIX
> synchronization function is not supposed to synchronize memory.  So, a
> failing trylock doesn't help a program unless it synchronizes through
> some other way, in which case this other way will "provide" the
> barriers.

As always, the issue is not whether it synchronizes memory on failure,
but whether it can accurately determine whether to fail or not without
synchronizing memory. The latter is an implementation detail of
course, but in practice you can't make this determination without
synchronizing memory.

Rich
  
Torvald Riegel Oct. 1, 2015, 7:27 p.m. UTC | #6
On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote:
> Changes from V1:
> 
> * Use C macro for atomic_store_release as suggested in
> comments.
> 
> * Run benchmarks to quantize the performance changes and
> note in commit message.
> 
> ---8<---
> This patch optimizes powerpc spinlock implementation by:
> 
> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>   first.  The optimization for such case is to avoid the first lwarx
>   in contention case (which is much more costly than normal loads).
> 
> * Use the correct EH hint bit on the larx for supported ISA.  For lock
>   acquisition, the thread that acquired the lock with a successful stcx
>   do not want to give away the write ownership on the cacheline.  The
>   idea is to make the load reservation "sticky" about retaining write
>   authority to the line.  That way, the store that must inevitably come
>   to release the lock can succeed quickly and not contend with other
>   threads issuing lwarx.  If another thread does a store to the line
>   (false sharing), the winning thread must give up write authority to
>   The proper value of EH for the larx for a lock acquisition is 1.
> 
> * Increase contented lock performance by up to 40%, and no measurable
>   impact on uncontended locks on P8.

Could you add the tests you used to the glibc microbenchmarks (or
whatever works best for them)?  We do want to be able to track
performance, and having benchmarks is the first step towards that.

> It also adds some cleanup to use the defined acquire semantic
> instructions and function prototype using default C style.
> 
> Thanks to Adhemerval Zanella who did most of the work.  I've run some
> tests, and addressed some minor feedback.
> 
> 2015-09-25  Paul E. Murphy  <murphyp@linux.vnet.ibm.com>
> 
> 	* sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock):
> 	Optimize first check for contention case and add lwarx hint.
> 	* sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock):
> 	Use ANSI prototype.
> 	* sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ...
> 	* sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and
> 	update to new atomic macros.
> ---
>  sysdeps/powerpc/nptl/pthread_spin_lock.c           |   22 ++++++++-------
>  sysdeps/powerpc/nptl/pthread_spin_trylock.c        |    3 +-
>  sysdeps/powerpc/nptl/pthread_spin_unlock.c         |   27 +++++++++++++++++++
>  .../unix/sysv/linux/powerpc/pthread_spin_unlock.c  |   28 --------------------
>  4 files changed, 40 insertions(+), 40 deletions(-)
>  create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c
>  delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> 
> diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> index cc081f8..8a39da9 100644
> --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
> +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> @@ -19,24 +19,26 @@
>  #include "pthreadP.h"
>  
>  int
> -pthread_spin_lock (lock)
> -     pthread_spinlock_t *lock;
> +pthread_spin_lock (pthread_spinlock_t *lock)
>  {
>    unsigned int __tmp;
>  
>    asm volatile (
> -       "1:	lwarx	%0,0,%1\n"
> +       "0:	lwzx    %0,0,%1\n"
> +       "	cmpwi   0,%0,0\n"
> +       "	bne	0b\n"
> +       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
>         "	cmpwi	0,%0,0\n"
>         "	bne-	2f\n"
>         "	stwcx.	%2,0,%1\n"
>         "	bne-	2f\n"
> -       "	isync\n"
> -       "	.subsection 1\n"
> -       "2:	lwzx	%0,0,%1\n"
> -       "	cmpwi	0,%0,0\n"
> -       "	bne	2b\n"
> -       "	b	1b\n"
> -       "	.previous"
> +                __ARCH_ACQ_INSTR "\n"
> +       "        .subsection 1\n"
> +       "2:	lwzx    %0,0,%1\n"
> +       "        cmpwi   0,%0,0\n"
> +       "        bne     2b\n"
> +       "        b       1b\n"
> +       "        .previous"
>         : "=&r" (__tmp)
>         : "r" (lock), "r" (1)
>         : "cr0", "memory");

Is this essentially a test-and-test-and-set implementation of a lock,
with the MUTEX_HINT_ACQ hint additionally?  If so, have you considered 
Have you considered adding a variant of
atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set
and hint, and then writing a C function using atomic an atomic load in
the outer test loop and using the new cmpxchg/xchg variant in the inner
loop?  That would make it easier to eventually merge this with the
generic version of spinlocks.  Also, once we have to adding spinning and
randomized exponential backoff to the generic locks, powerpc would be
able to benefit from those generic changes too (perhaps with some
additional tuning). 

> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> new file mode 100644
> index 0000000..f830ad2
> --- /dev/null
> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> @@ -0,0 +1,27 @@
> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "pthreadP.h"
> +#include <lowlevellock.h>

Do you really need to include lowlevellock.h?

> +int
> +pthread_spin_unlock (pthread_spinlock_t *lock)
> +{
> +  atomic_store_release (lock, 0);
> +  return 0;
> +}
  
Torvald Riegel Oct. 1, 2015, 7:35 p.m. UTC | #7
On Thu, 2015-10-01 at 15:27 -0400, Rich Felker wrote:
> On Thu, Oct 01, 2015 at 09:17:06PM +0200, Torvald Riegel wrote:
> > > (the issue is that for correct pthread_spin_trylock behavior
> > > the lock should be seqcst instead of acquire and the unlock
> > > should be release instead of barrier+store otherwise trylock
> > > can spuriously report locked state).
> > 
> > Right now, unlock is a full barrier (ie, seqcst) plus store.  That is
> > stronger than a release store.  Also note that a failing POSIX
> > synchronization function is not supposed to synchronize memory.  So, a
> > failing trylock doesn't help a program unless it synchronizes through
> > some other way, in which case this other way will "provide" the
> > barriers.
> 
> As always, the issue is not whether it synchronizes memory on failure,
> but whether it can accurately determine whether to fail or not without
> synchronizing memory. The latter is an implementation detail of
> course, but in practice you can't make this determination without
> synchronizing memory.

Sticking to our example here, you can determine this.  When you do a
memory_order_relaxed load as part of a trylock implementation, for
example, you "pick a position" in happens-before that represents always
a valid observation of state from this thread's perspective.  The
question is whether you make guarantees to your caller about your
observation and what it might mean for something else, or not.
  
Steven Munroe Oct. 1, 2015, 8:28 p.m. UTC | #8
On Thu, 2015-10-01 at 21:27 +0200, Torvald Riegel wrote:
> On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote:
> > Changes from V1:
> > 
> > * Use C macro for atomic_store_release as suggested in
> > comments.
> > 
> > * Run benchmarks to quantize the performance changes and
> > note in commit message.
> > 
> > ---8<---
> > This patch optimizes powerpc spinlock implementation by:
> > 
> > * Current algorithm spin over a lwzx, but only after issuing a lwarx.
> >   first.  The optimization for such case is to avoid the first lwarx
> >   in contention case (which is much more costly than normal loads).
> > 
> > * Use the correct EH hint bit on the larx for supported ISA.  For lock
> >   acquisition, the thread that acquired the lock with a successful stcx
> >   do not want to give away the write ownership on the cacheline.  The
> >   idea is to make the load reservation "sticky" about retaining write
> >   authority to the line.  That way, the store that must inevitably come
> >   to release the lock can succeed quickly and not contend with other
> >   threads issuing lwarx.  If another thread does a store to the line
> >   (false sharing), the winning thread must give up write authority to
> >   The proper value of EH for the larx for a lock acquisition is 1.
> > 
> > * Increase contented lock performance by up to 40%, and no measurable
> >   impact on uncontended locks on P8.
> 
> Could you add the tests you used to the glibc microbenchmarks (or
> whatever works best for them)?  We do want to be able to track
> performance, and having benchmarks is the first step towards that.
> 
> > It also adds some cleanup to use the defined acquire semantic
> > instructions and function prototype using default C style.
> > 
> > Thanks to Adhemerval Zanella who did most of the work.  I've run some
> > tests, and addressed some minor feedback.
> > 
> > 2015-09-25  Paul E. Murphy  <murphyp@linux.vnet.ibm.com>
> > 
> > 	* sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock):
> > 	Optimize first check for contention case and add lwarx hint.
> > 	* sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock):
> > 	Use ANSI prototype.
> > 	* sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ...
> > 	* sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and
> > 	update to new atomic macros.
> > ---
> >  sysdeps/powerpc/nptl/pthread_spin_lock.c           |   22 ++++++++-------
> >  sysdeps/powerpc/nptl/pthread_spin_trylock.c        |    3 +-
> >  sysdeps/powerpc/nptl/pthread_spin_unlock.c         |   27 +++++++++++++++++++
> >  .../unix/sysv/linux/powerpc/pthread_spin_unlock.c  |   28 --------------------
> >  4 files changed, 40 insertions(+), 40 deletions(-)
> >  create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c
> >  delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
> > 
> > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > index cc081f8..8a39da9 100644
> > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > @@ -19,24 +19,26 @@
> >  #include "pthreadP.h"
> >  
> >  int
> > -pthread_spin_lock (lock)
> > -     pthread_spinlock_t *lock;
> > +pthread_spin_lock (pthread_spinlock_t *lock)
> >  {
> >    unsigned int __tmp;
> >  
> >    asm volatile (
> > -       "1:	lwarx	%0,0,%1\n"
> > +       "0:	lwzx    %0,0,%1\n"
> > +       "	cmpwi   0,%0,0\n"
> > +       "	bne	0b\n"
> > +       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
> >         "	cmpwi	0,%0,0\n"
> >         "	bne-	2f\n"
> >         "	stwcx.	%2,0,%1\n"
> >         "	bne-	2f\n"
> > -       "	isync\n"
> > -       "	.subsection 1\n"
> > -       "2:	lwzx	%0,0,%1\n"
> > -       "	cmpwi	0,%0,0\n"
> > -       "	bne	2b\n"
> > -       "	b	1b\n"
> > -       "	.previous"
> > +                __ARCH_ACQ_INSTR "\n"
> > +       "        .subsection 1\n"
> > +       "2:	lwzx    %0,0,%1\n"
> > +       "        cmpwi   0,%0,0\n"
> > +       "        bne     2b\n"
> > +       "        b       1b\n"
> > +       "        .previous"
> >         : "=&r" (__tmp)
> >         : "r" (lock), "r" (1)
> >         : "cr0", "memory");
> 
> Is this essentially a test-and-test-and-set implementation of a lock,
> with the MUTEX_HINT_ACQ hint additionally?  If so, have you considered 
> Have you considered adding a variant of
> atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set
> and hint, and then writing a C function using atomic an atomic load in
> the outer test loop and using the new cmpxchg/xchg variant in the inner
> loop?  That would make it easier to eventually merge this with the
> generic version of spinlocks.  Also, once we have to adding spinning and
> randomized exponential backoff to the generic locks, powerpc would be
> able to benefit from those generic changes too (perhaps with some
> additional tuning). 
> 
Paul is traveling today so I will answer.

It may be functionally possible to use a 

__atomic_compare_exchange_n ((mem), (expected), (desired),
	weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)

in this construct.

But the cmpxchg is a clumsy way to generate the PowerISA Acquire Import
Barrier which has a very specific form. 

Also it is not obvious how the MUTEX_HINT_ACQ applies to
__atomic_compare_exchange_n in general or even
__atomic_compare_exchange_n (*, *, *, *, __ATOMIC_ACQUIRE,
__ATOMIC_RELAXED).

In the PowerISA, MUTEX_HINT_ACQ means:

 The Load and Reserve instructions include an Exclusive
 Access hint (EH), which can be used to indicate
 that the instruction sequence being executed is implementing
 one of two types of algorithms:

 Atomic Update (EH=0)
  This hint indicates that the program is using a fetch and
  operate (e.g., fetch and add) or some similar algorithm
  and that all programs accessing the shared variable are
  likely to use a similar operation to access the shared
  variable for some time.
 Exclusive Access (EH=1)
  This hint indicates that the program is attempting to
  acquire a lock and if it succeeds, will perform another
  store to the lock variable (releasing the lock) before
  another program attempts to modify the lock variable.

It is not clear that C11 __atomic_compare_exchange when/if/ever implies
exclusive access in the meaning of the PowerISA.

Finally trying to combine atomic_compare_exchange_weak_acquire with the
relaxed load spin is likely to generate some awkward sequences with
unnecessary code.

I would prefer to proceed with this implementation and revisit when we
are further along with C11 integration.

> > diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> > new file mode 100644
> > index 0000000..f830ad2
> > --- /dev/null
> > +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
> > @@ -0,0 +1,27 @@
> > +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
> > +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include "pthreadP.h"
> > +#include <lowlevellock.h>
> 
> Do you really need to include lowlevellock.h?
> 
> > +int
> > +pthread_spin_unlock (pthread_spinlock_t *lock)
> > +{
> > +  atomic_store_release (lock, 0);
> > +  return 0;
> > +}
> 
>
  
Torvald Riegel Oct. 2, 2015, 11:38 a.m. UTC | #9
On Thu, 2015-10-01 at 15:28 -0500, Steven Munroe wrote:
> On Thu, 2015-10-01 at 21:27 +0200, Torvald Riegel wrote:
> > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote:
> > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > > index cc081f8..8a39da9 100644
> > > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
> > > @@ -19,24 +19,26 @@
> > >  #include "pthreadP.h"
> > >  
> > >  int
> > > -pthread_spin_lock (lock)
> > > -     pthread_spinlock_t *lock;
> > > +pthread_spin_lock (pthread_spinlock_t *lock)
> > >  {
> > >    unsigned int __tmp;
> > >  
> > >    asm volatile (
> > > -       "1:	lwarx	%0,0,%1\n"
> > > +       "0:	lwzx    %0,0,%1\n"
> > > +       "	cmpwi   0,%0,0\n"
> > > +       "	bne	0b\n"
> > > +       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
> > >         "	cmpwi	0,%0,0\n"
> > >         "	bne-	2f\n"
> > >         "	stwcx.	%2,0,%1\n"
> > >         "	bne-	2f\n"
> > > -       "	isync\n"
> > > -       "	.subsection 1\n"
> > > -       "2:	lwzx	%0,0,%1\n"
> > > -       "	cmpwi	0,%0,0\n"
> > > -       "	bne	2b\n"
> > > -       "	b	1b\n"
> > > -       "	.previous"
> > > +                __ARCH_ACQ_INSTR "\n"
> > > +       "        .subsection 1\n"
> > > +       "2:	lwzx    %0,0,%1\n"
> > > +       "        cmpwi   0,%0,0\n"
> > > +       "        bne     2b\n"
> > > +       "        b       1b\n"
> > > +       "        .previous"
> > >         : "=&r" (__tmp)
> > >         : "r" (lock), "r" (1)
> > >         : "cr0", "memory");
> > 
> > Is this essentially a test-and-test-and-set implementation of a lock,
> > with the MUTEX_HINT_ACQ hint additionally?  If so, have you considered 
> > Have you considered adding a variant of
> > atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set
> > and hint, and then writing a C function using atomic an atomic load in
> > the outer test loop and using the new cmpxchg/xchg variant in the inner
> > loop?  That would make it easier to eventually merge this with the
> > generic version of spinlocks.  Also, once we have to adding spinning and
> > randomized exponential backoff to the generic locks, powerpc would be
> > able to benefit from those generic changes too (perhaps with some
> > additional tuning). 
> > 
> Paul is traveling today so I will answer.
> 
> It may be functionally possible to use a 
> 
> __atomic_compare_exchange_n ((mem), (expected), (desired),
> 	weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)
> 
> in this construct.

I agree, but that's not what I proposed.

> But the cmpxchg is a clumsy way to generate the PowerISA Acquire Import
> Barrier which has a very specific form. 
> 
> Also it is not obvious how the MUTEX_HINT_ACQ applies to
> __atomic_compare_exchange_n in general or even
> __atomic_compare_exchange_n (*, *, *, *, __ATOMIC_ACQUIRE,
> __ATOMIC_RELAXED).
> 
> In the PowerISA, MUTEX_HINT_ACQ means:
> 
>  The Load and Reserve instructions include an Exclusive
>  Access hint (EH), which can be used to indicate
>  that the instruction sequence being executed is implementing
>  one of two types of algorithms:
> 
>  Atomic Update (EH=0)
>   This hint indicates that the program is using a fetch and
>   operate (e.g., fetch and add) or some similar algorithm
>   and that all programs accessing the shared variable are
>   likely to use a similar operation to access the shared
>   variable for some time.
>  Exclusive Access (EH=1)
>   This hint indicates that the program is attempting to
>   acquire a lock and if it succeeds, will perform another
>   store to the lock variable (releasing the lock) before
>   another program attempts to modify the lock variable.
> 
> It is not clear that C11 __atomic_compare_exchange when/if/ever implies
> exclusive access in the meaning of the PowerISA.

Yes.  The hint is orthogonal to the C11 CAS semantics.

This is why I asked about adding a variant for *glibc's*
atomic_compare_exchange_weak_acquire, see include/atomic.h.  This would
likely be implemented with custom asm on Power, and be just deferring to
atomic_compare_exchange_weak_acquire for all other archs.

Whether introducing that is better than just specializing the spinlocks
is something you will have to figure out.  But there are several
lock-like CASes throughout the code (e.g., pthread_once, so not just in
the mutexes or rwlock).  You probably don't want to have custom Power
variants of all those synchronization mechanisms.

> Finally trying to combine atomic_compare_exchange_weak_acquire with the
> relaxed load spin is likely to generate some awkward sequences with
> unnecessary code.

Perhaps, but that's determined by the compiler and how good it is at
optimizing atomics.  Whether it matters will have to be tracked by a
benchmark: If this unnecessary code matters and the compiler isn't doing
a good enough job yet, it should be visible in some benchmark; once it
is, the benchmark will tell us and we're not stuck with an "old
assumption" in our code.

> I would prefer to proceed with this implementation and revisit when we
> are further along with C11 integration.

I don't mind having intermediate steps.  But if you do something
Power-specific, and then somebody else improves the generic version, IMO
this other person has no obligation to spend any effort on updating your
Power-specific version too.
  
Paul E. Murphy Oct. 7, 2015, 7 p.m. UTC | #10
On 10/01/2015 02:27 PM, Torvald Riegel wrote:
> On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote:
>> Changes from V1:
>>
>> * Use C macro for atomic_store_release as suggested in
>> comments.
>>
>> * Run benchmarks to quantize the performance changes and
>> note in commit message.
>>
>> ---8<---
>> This patch optimizes powerpc spinlock implementation by:
>>
>> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
>>   first.  The optimization for such case is to avoid the first lwarx
>>   in contention case (which is much more costly than normal loads).
>>
>> * Use the correct EH hint bit on the larx for supported ISA.  For lock
>>   acquisition, the thread that acquired the lock with a successful stcx
>>   do not want to give away the write ownership on the cacheline.  The
>>   idea is to make the load reservation "sticky" about retaining write
>>   authority to the line.  That way, the store that must inevitably come
>>   to release the lock can succeed quickly and not contend with other
>>   threads issuing lwarx.  If another thread does a store to the line
>>   (false sharing), the winning thread must give up write authority to
>>   The proper value of EH for the larx for a lock acquisition is 1.
>>
>> * Increase contented lock performance by up to 40%, and no measurable
>>   impact on uncontended locks on P8.
> 
> Could you add the tests you used to the glibc microbenchmarks (or
> whatever works best for them)?  We do want to be able to track
> performance, and having benchmarks is the first step towards that.

Would you object to that being a separate patch?

As I sketch out the work to integrate it into our benchtest, it seems
like it could be trivially extended to mutexes aswell.

>> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> new file mode 100644
>> index 0000000..f830ad2
>> --- /dev/null
>> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
>> @@ -0,0 +1,27 @@
>> +/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
>> +   Copyright (C) 2007-2015 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include "pthreadP.h"
>> +#include <lowlevellock.h>
> 
> Do you really need to include lowlevellock.h?

No, atomic.h is needed at minimum.

Thanks,
Paul
  
Torvald Riegel Oct. 8, 2015, 3:22 p.m. UTC | #11
On Wed, 2015-10-07 at 14:00 -0500, Paul E. Murphy wrote:
> 
> On 10/01/2015 02:27 PM, Torvald Riegel wrote:
> > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote:
> >> Changes from V1:
> >>
> >> * Use C macro for atomic_store_release as suggested in
> >> comments.
> >>
> >> * Run benchmarks to quantize the performance changes and
> >> note in commit message.
> >>
> >> ---8<---
> >> This patch optimizes powerpc spinlock implementation by:
> >>
> >> * Current algorithm spin over a lwzx, but only after issuing a lwarx.
> >>   first.  The optimization for such case is to avoid the first lwarx
> >>   in contention case (which is much more costly than normal loads).
> >>
> >> * Use the correct EH hint bit on the larx for supported ISA.  For lock
> >>   acquisition, the thread that acquired the lock with a successful stcx
> >>   do not want to give away the write ownership on the cacheline.  The
> >>   idea is to make the load reservation "sticky" about retaining write
> >>   authority to the line.  That way, the store that must inevitably come
> >>   to release the lock can succeed quickly and not contend with other
> >>   threads issuing lwarx.  If another thread does a store to the line
> >>   (false sharing), the winning thread must give up write authority to
> >>   The proper value of EH for the larx for a lock acquisition is 1.
> >>
> >> * Increase contented lock performance by up to 40%, and no measurable
> >>   impact on uncontended locks on P8.
> > 
> > Could you add the tests you used to the glibc microbenchmarks (or
> > whatever works best for them)?  We do want to be able to track
> > performance, and having benchmarks is the first step towards that.
> 
> Would you object to that being a separate patch?

No.

> As I sketch out the work to integrate it into our benchtest, it seems
> like it could be trivially extended to mutexes aswell.

Agreed.
  

Patch

diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c
index cc081f8..8a39da9 100644
--- a/sysdeps/powerpc/nptl/pthread_spin_lock.c
+++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c
@@ -19,24 +19,26 @@ 
 #include "pthreadP.h"
 
 int
-pthread_spin_lock (lock)
-     pthread_spinlock_t *lock;
+pthread_spin_lock (pthread_spinlock_t *lock)
 {
   unsigned int __tmp;
 
   asm volatile (
-       "1:	lwarx	%0,0,%1\n"
+       "0:	lwzx    %0,0,%1\n"
+       "	cmpwi   0,%0,0\n"
+       "	bne	0b\n"
+       "1:	lwarx	%0,0,%1" MUTEX_HINT_ACQ "\n"
        "	cmpwi	0,%0,0\n"
        "	bne-	2f\n"
        "	stwcx.	%2,0,%1\n"
        "	bne-	2f\n"
-       "	isync\n"
-       "	.subsection 1\n"
-       "2:	lwzx	%0,0,%1\n"
-       "	cmpwi	0,%0,0\n"
-       "	bne	2b\n"
-       "	b	1b\n"
-       "	.previous"
+                __ARCH_ACQ_INSTR "\n"
+       "        .subsection 1\n"
+       "2:	lwzx    %0,0,%1\n"
+       "        cmpwi   0,%0,0\n"
+       "        bne     2b\n"
+       "        b       1b\n"
+       "        .previous"
        : "=&r" (__tmp)
        : "r" (lock), "r" (1)
        : "cr0", "memory");
diff --git a/sysdeps/powerpc/nptl/pthread_spin_trylock.c b/sysdeps/powerpc/nptl/pthread_spin_trylock.c
index 77a4615..c485aa4 100644
--- a/sysdeps/powerpc/nptl/pthread_spin_trylock.c
+++ b/sysdeps/powerpc/nptl/pthread_spin_trylock.c
@@ -20,8 +20,7 @@ 
 #include "pthreadP.h"
 
 int
-pthread_spin_trylock (lock)
-     pthread_spinlock_t *lock;
+pthread_spin_trylock (pthread_spinlock_t *lock)
 {
   unsigned int old;
   int err = EBUSY;
diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
new file mode 100644
index 0000000..f830ad2
--- /dev/null
+++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c
@@ -0,0 +1,27 @@ 
+/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
+   Copyright (C) 2007-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "pthreadP.h"
+#include <lowlevellock.h>
+
+int
+pthread_spin_unlock (pthread_spinlock_t *lock)
+{
+  atomic_store_release (lock, 0);
+  return 0;
+}
diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
deleted file mode 100644
index 7af694f..0000000
--- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
+++ /dev/null
@@ -1,28 +0,0 @@ 
-/* pthread_spin_unlock -- unlock a spin lock.  PowerPC version.
-   Copyright (C) 2007-2015 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-int
-pthread_spin_unlock (pthread_spinlock_t *lock)
-{
-  __asm __volatile (__ARCH_REL_INSTR ::: "memory");
-  *lock = 0;
-  return 0;
-}