Message ID | 560C0DA6.5060409@linux.vnet.ibm.com |
---|---|
State | Superseded |
Delegated to: | Tulio Magno Quites Machado Filho |
Headers |
Received: (qmail 17740 invoked by alias); 30 Sep 2015 16:28:31 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 17727 invoked by uid 89); 30 Sep 2015 16:28:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=AWL, BAYES_00, KAM_LAZY_DOMAIN_SECURITY, T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: e17.ny.us.ibm.com X-IBM-Helo: d01dlp01.pok.ibm.com X-IBM-MailFrom: murphyp@linux.vnet.ibm.com X-IBM-RcptTo: libc-alpha@sourceware.org From: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com> Subject: [PATCHv2] powerpc: Spinlock optimization and cleanup To: "libc-alpha@sourceware.org" <libc-alpha@sourceware.org>, triegel@redhat.com, rth@twiddle.net, Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>, rth@twiddle.net Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>, Steve Munroe <sjmunroe@us.ibm.com> Message-ID: <560C0DA6.5060409@linux.vnet.ibm.com> Date: Wed, 30 Sep 2015 11:28:22 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15093016-0041-0000-0000-000001B35FF9 |
Commit Message
Paul E. Murphy
Sept. 30, 2015, 4:28 p.m. UTC
Changes from V1: * Use C macro for atomic_store_release as suggested in comments. * Run benchmarks to quantize the performance changes and note in commit message. ---8<--- This patch optimizes powerpc spinlock implementation by: * Current algorithm spin over a lwzx, but only after issuing a lwarx. first. The optimization for such case is to avoid the first lwarx in contention case (which is much more costly than normal loads). * Use the correct EH hint bit on the larx for supported ISA. For lock acquisition, the thread that acquired the lock with a successful stcx do not want to give away the write ownership on the cacheline. The idea is to make the load reservation "sticky" about retaining write authority to the line. That way, the store that must inevitably come to release the lock can succeed quickly and not contend with other threads issuing lwarx. If another thread does a store to the line (false sharing), the winning thread must give up write authority to The proper value of EH for the larx for a lock acquisition is 1. * Increase contented lock performance by up to 40%, and no measurable impact on uncontended locks on P8. It also adds some cleanup to use the defined acquire semantic instructions and function prototype using default C style. Thanks to Adhemerval Zanella who did most of the work. I've run some tests, and addressed some minor feedback. 2015-09-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com> * sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock): Optimize first check for contention case and add lwarx hint. * sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock): Use ANSI prototype. * sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ... * sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and update to new atomic macros. --- sysdeps/powerpc/nptl/pthread_spin_lock.c | 22 ++++++++------- sysdeps/powerpc/nptl/pthread_spin_trylock.c | 3 +- sysdeps/powerpc/nptl/pthread_spin_unlock.c | 27 +++++++++++++++++++ .../unix/sysv/linux/powerpc/pthread_spin_unlock.c | 28 -------------------- 4 files changed, 40 insertions(+), 40 deletions(-) create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c
Comments
"Paul E. Murphy" <murphyp@linux.vnet.ibm.com> writes: > Changes from V1: > > * Use C macro for atomic_store_release as suggested in > comments. > > * Run benchmarks to quantize the performance changes and > note in commit message. Thanks for resurrecting this patch! > This patch optimizes powerpc spinlock implementation by: > > * Current algorithm spin over a lwzx, but only after issuing a lwarx. > first. The optimization for such case is to avoid the first lwarx > in contention case (which is much more costly than normal loads). > > * Use the correct EH hint bit on the larx for supported ISA. For lock lwarx > acquisition, the thread that acquired the lock with a successful stcx stwcx > do not want to give away the write ownership on the cacheline. The > idea is to make the load reservation "sticky" about retaining write > authority to the line. That way, the store that must inevitably come > to release the lock can succeed quickly and not contend with other > threads issuing lwarx. If another thread does a store to the line > (false sharing), the winning thread must give up write authority to Is something missing between the previous line and the next? > The proper value of EH for the larx for a lock acquisition is 1. Could you move part of this explanation to pthread_spin_lock.c as a comment, please? I'm referring only to the part where you explain the logic behind this optimization. > * Increase contented lock performance by up to 40%, and no measurable > impact on uncontended locks on P8. > > It also adds some cleanup to use the defined acquire semantic > instructions and function prototype using default C style. > > Thanks to Adhemerval Zanella who did most of the work. I've run some > tests, and addressed some minor feedback. > > 2015-09-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com> As this patch is based on the work of Adhemerval, it's important to mention him in the ChangeLog. Here's an example: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=d3573f61a#patch1 > --- > sysdeps/powerpc/nptl/pthread_spin_lock.c | 22 ++++++++------- > sysdeps/powerpc/nptl/pthread_spin_trylock.c | 3 +- > sysdeps/powerpc/nptl/pthread_spin_unlock.c | 27 +++++++++++++++++++ > .../unix/sysv/linux/powerpc/pthread_spin_unlock.c | 28 -------------------- > 4 files changed, 40 insertions(+), 40 deletions(-) > create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c > delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c > index cc081f8..8a39da9 100644 > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c > @@ -19,24 +19,26 @@ > #include "pthreadP.h" > > int > -pthread_spin_lock (lock) > - pthread_spinlock_t *lock; > +pthread_spin_lock (pthread_spinlock_t *lock) > { > unsigned int __tmp; > > asm volatile ( > - "1: lwarx %0,0,%1\n" > + "0: lwzx %0,0,%1\n" > + " cmpwi 0,%0,0\n" > + " bne 0b\n" > + "1: lwarx %0,0,%1" MUTEX_HINT_ACQ "\n" > " cmpwi 0,%0,0\n" > " bne- 2f\n" > " stwcx. %2,0,%1\n" > " bne- 2f\n" > - " isync\n" > - " .subsection 1\n" > - "2: lwzx %0,0,%1\n" > - " cmpwi 0,%0,0\n" > - " bne 2b\n" > - " b 1b\n" > - " .previous" > + __ARCH_ACQ_INSTR "\n" > + " .subsection 1\n" > + "2: lwzx %0,0,%1\n" > + " cmpwi 0,%0,0\n" > + " bne 2b\n" > + " b 1b\n" > + " .previous" Part of the asm instructions are indented with spaces while most of them are using tabs.
On 30/09/15 17:28, Paul E. Murphy wrote: > > ---8<--- > This patch optimizes powerpc spinlock implementation by: > ... The glibc pthread spinlock semantics is weaker than what posix requires, I'm wondering if this is expected to stay or glibc might want to switch to stronger semantics. is it worthwhile to add optimized asm with weak semantics for other targets that currently use the generic c code? (the issue is that for correct pthread_spin_trylock behavior the lock should be seqcst instead of acquire and the unlock should be release instead of barrier+store otherwise trylock can spuriously report locked state).
On Thu, 2015-10-01 at 10:18 +0100, Szabolcs Nagy wrote: > On 30/09/15 17:28, Paul E. Murphy wrote: > > > > ---8<--- > > This patch optimizes powerpc spinlock implementation by: > > > ... > > The glibc pthread spinlock semantics is weaker than what > posix requires, I'm wondering if this is expected to stay > or glibc might want to switch to stronger semantics. > Since when? Include the text the requires this? > is it worthwhile to add optimized asm with weak semantics > for other targets that currently use the generic c code? > > (the issue is that for correct pthread_spin_trylock behavior > the lock should be seqcst instead of acquire and the unlock > should be release instead of barrier+store otherwise trylock > can spuriously report locked state). > Paul patch already changes pthread_spin_unlock to atomic_store_release, which will generate lwsync/stw. But I don't think anyone wants or need pthread_spin_lock to be seqcst. Also as the acquire sequence used in Paul patch is a full "import barrier", it is sufficient for the critical region. Read PowerISA-2.07B BookII Appendix B Programming Examples for Shared Storage, Section B.2.1 Lock Acquisition and Import Barriers. Which specifically say that a hwsync is not required if the acquire import barrier is used. And this sequence will perform better.
On Thu, 2015-10-01 at 10:18 +0100, Szabolcs Nagy wrote: > On 30/09/15 17:28, Paul E. Murphy wrote: > > > > ---8<--- > > This patch optimizes powerpc spinlock implementation by: > > > ... > > The glibc pthread spinlock semantics is weaker than what > posix requires, I'm wondering if this is expected to stay > or glibc might want to switch to stronger semantics. I think this should stay the way it is. Thus, do what C++ and soo also C11 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_470) specify. Making this (it's all the mtx operations that succeed, not just trylock) seqcst would decrease performance for the most common case just to make arcane cases work (e.g., abusing POSIX synchronization functions such as trylock or sem_getvalue as atomics). Fixing that at the POSIX level would require POSIX to use a more involved memory model (hopefully following the C11 model). If anyone feels like contributing to make this happen, please do so. > is it worthwhile to add optimized asm with weak semantics > for other targets that currently use the generic c code? I think only nptl/pthread_spin_unlock.c should be changed, the other generic functions use weaker memory orders already. OTOH, this would change existing behavior, so one can argue this is more risky than keeping weaker-than-POSIX implementations unchanged. > (the issue is that for correct pthread_spin_trylock behavior > the lock should be seqcst instead of acquire and the unlock > should be release instead of barrier+store otherwise trylock > can spuriously report locked state). Right now, unlock is a full barrier (ie, seqcst) plus store. That is stronger than a release store. Also note that a failing POSIX synchronization function is not supposed to synchronize memory. So, a failing trylock doesn't help a program unless it synchronizes through some other way, in which case this other way will "provide" the barriers.
On Thu, Oct 01, 2015 at 09:17:06PM +0200, Torvald Riegel wrote: > > (the issue is that for correct pthread_spin_trylock behavior > > the lock should be seqcst instead of acquire and the unlock > > should be release instead of barrier+store otherwise trylock > > can spuriously report locked state). > > Right now, unlock is a full barrier (ie, seqcst) plus store. That is > stronger than a release store. Also note that a failing POSIX > synchronization function is not supposed to synchronize memory. So, a > failing trylock doesn't help a program unless it synchronizes through > some other way, in which case this other way will "provide" the > barriers. As always, the issue is not whether it synchronizes memory on failure, but whether it can accurately determine whether to fail or not without synchronizing memory. The latter is an implementation detail of course, but in practice you can't make this determination without synchronizing memory. Rich
On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote: > Changes from V1: > > * Use C macro for atomic_store_release as suggested in > comments. > > * Run benchmarks to quantize the performance changes and > note in commit message. > > ---8<--- > This patch optimizes powerpc spinlock implementation by: > > * Current algorithm spin over a lwzx, but only after issuing a lwarx. > first. The optimization for such case is to avoid the first lwarx > in contention case (which is much more costly than normal loads). > > * Use the correct EH hint bit on the larx for supported ISA. For lock > acquisition, the thread that acquired the lock with a successful stcx > do not want to give away the write ownership on the cacheline. The > idea is to make the load reservation "sticky" about retaining write > authority to the line. That way, the store that must inevitably come > to release the lock can succeed quickly and not contend with other > threads issuing lwarx. If another thread does a store to the line > (false sharing), the winning thread must give up write authority to > The proper value of EH for the larx for a lock acquisition is 1. > > * Increase contented lock performance by up to 40%, and no measurable > impact on uncontended locks on P8. Could you add the tests you used to the glibc microbenchmarks (or whatever works best for them)? We do want to be able to track performance, and having benchmarks is the first step towards that. > It also adds some cleanup to use the defined acquire semantic > instructions and function prototype using default C style. > > Thanks to Adhemerval Zanella who did most of the work. I've run some > tests, and addressed some minor feedback. > > 2015-09-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com> > > * sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock): > Optimize first check for contention case and add lwarx hint. > * sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock): > Use ANSI prototype. > * sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ... > * sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and > update to new atomic macros. > --- > sysdeps/powerpc/nptl/pthread_spin_lock.c | 22 ++++++++------- > sysdeps/powerpc/nptl/pthread_spin_trylock.c | 3 +- > sysdeps/powerpc/nptl/pthread_spin_unlock.c | 27 +++++++++++++++++++ > .../unix/sysv/linux/powerpc/pthread_spin_unlock.c | 28 -------------------- > 4 files changed, 40 insertions(+), 40 deletions(-) > create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c > delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c > index cc081f8..8a39da9 100644 > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c > @@ -19,24 +19,26 @@ > #include "pthreadP.h" > > int > -pthread_spin_lock (lock) > - pthread_spinlock_t *lock; > +pthread_spin_lock (pthread_spinlock_t *lock) > { > unsigned int __tmp; > > asm volatile ( > - "1: lwarx %0,0,%1\n" > + "0: lwzx %0,0,%1\n" > + " cmpwi 0,%0,0\n" > + " bne 0b\n" > + "1: lwarx %0,0,%1" MUTEX_HINT_ACQ "\n" > " cmpwi 0,%0,0\n" > " bne- 2f\n" > " stwcx. %2,0,%1\n" > " bne- 2f\n" > - " isync\n" > - " .subsection 1\n" > - "2: lwzx %0,0,%1\n" > - " cmpwi 0,%0,0\n" > - " bne 2b\n" > - " b 1b\n" > - " .previous" > + __ARCH_ACQ_INSTR "\n" > + " .subsection 1\n" > + "2: lwzx %0,0,%1\n" > + " cmpwi 0,%0,0\n" > + " bne 2b\n" > + " b 1b\n" > + " .previous" > : "=&r" (__tmp) > : "r" (lock), "r" (1) > : "cr0", "memory"); Is this essentially a test-and-test-and-set implementation of a lock, with the MUTEX_HINT_ACQ hint additionally? If so, have you considered Have you considered adding a variant of atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set and hint, and then writing a C function using atomic an atomic load in the outer test loop and using the new cmpxchg/xchg variant in the inner loop? That would make it easier to eventually merge this with the generic version of spinlocks. Also, once we have to adding spinning and randomized exponential backoff to the generic locks, powerpc would be able to benefit from those generic changes too (perhaps with some additional tuning). > diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c > new file mode 100644 > index 0000000..f830ad2 > --- /dev/null > +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c > @@ -0,0 +1,27 @@ > +/* pthread_spin_unlock -- unlock a spin lock. PowerPC version. > + Copyright (C) 2007-2015 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#include "pthreadP.h" > +#include <lowlevellock.h> Do you really need to include lowlevellock.h? > +int > +pthread_spin_unlock (pthread_spinlock_t *lock) > +{ > + atomic_store_release (lock, 0); > + return 0; > +}
On Thu, 2015-10-01 at 15:27 -0400, Rich Felker wrote: > On Thu, Oct 01, 2015 at 09:17:06PM +0200, Torvald Riegel wrote: > > > (the issue is that for correct pthread_spin_trylock behavior > > > the lock should be seqcst instead of acquire and the unlock > > > should be release instead of barrier+store otherwise trylock > > > can spuriously report locked state). > > > > Right now, unlock is a full barrier (ie, seqcst) plus store. That is > > stronger than a release store. Also note that a failing POSIX > > synchronization function is not supposed to synchronize memory. So, a > > failing trylock doesn't help a program unless it synchronizes through > > some other way, in which case this other way will "provide" the > > barriers. > > As always, the issue is not whether it synchronizes memory on failure, > but whether it can accurately determine whether to fail or not without > synchronizing memory. The latter is an implementation detail of > course, but in practice you can't make this determination without > synchronizing memory. Sticking to our example here, you can determine this. When you do a memory_order_relaxed load as part of a trylock implementation, for example, you "pick a position" in happens-before that represents always a valid observation of state from this thread's perspective. The question is whether you make guarantees to your caller about your observation and what it might mean for something else, or not.
On Thu, 2015-10-01 at 21:27 +0200, Torvald Riegel wrote: > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote: > > Changes from V1: > > > > * Use C macro for atomic_store_release as suggested in > > comments. > > > > * Run benchmarks to quantize the performance changes and > > note in commit message. > > > > ---8<--- > > This patch optimizes powerpc spinlock implementation by: > > > > * Current algorithm spin over a lwzx, but only after issuing a lwarx. > > first. The optimization for such case is to avoid the first lwarx > > in contention case (which is much more costly than normal loads). > > > > * Use the correct EH hint bit on the larx for supported ISA. For lock > > acquisition, the thread that acquired the lock with a successful stcx > > do not want to give away the write ownership on the cacheline. The > > idea is to make the load reservation "sticky" about retaining write > > authority to the line. That way, the store that must inevitably come > > to release the lock can succeed quickly and not contend with other > > threads issuing lwarx. If another thread does a store to the line > > (false sharing), the winning thread must give up write authority to > > The proper value of EH for the larx for a lock acquisition is 1. > > > > * Increase contented lock performance by up to 40%, and no measurable > > impact on uncontended locks on P8. > > Could you add the tests you used to the glibc microbenchmarks (or > whatever works best for them)? We do want to be able to track > performance, and having benchmarks is the first step towards that. > > > It also adds some cleanup to use the defined acquire semantic > > instructions and function prototype using default C style. > > > > Thanks to Adhemerval Zanella who did most of the work. I've run some > > tests, and addressed some minor feedback. > > > > 2015-09-25 Paul E. Murphy <murphyp@linux.vnet.ibm.com> > > > > * sysdeps/powerpc/nptl/pthread_spin_lock.c (pthread_spin_lock): > > Optimize first check for contention case and add lwarx hint. > > * sysdeps/powerpc/nptl/pthread_spin_trylock.c (pthread_spin_trylock): > > Use ANSI prototype. > > * sysdep/unix/sysv/linux/powerpc/pthread_spin_unlock.c: Move to ... > > * sysdeps/powerpc/nptl/pthread_spin_unlock.c: ... here, and > > update to new atomic macros. > > --- > > sysdeps/powerpc/nptl/pthread_spin_lock.c | 22 ++++++++------- > > sysdeps/powerpc/nptl/pthread_spin_trylock.c | 3 +- > > sysdeps/powerpc/nptl/pthread_spin_unlock.c | 27 +++++++++++++++++++ > > .../unix/sysv/linux/powerpc/pthread_spin_unlock.c | 28 -------------------- > > 4 files changed, 40 insertions(+), 40 deletions(-) > > create mode 100644 sysdeps/powerpc/nptl/pthread_spin_unlock.c > > delete mode 100644 sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c > > > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c > > index cc081f8..8a39da9 100644 > > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c > > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c > > @@ -19,24 +19,26 @@ > > #include "pthreadP.h" > > > > int > > -pthread_spin_lock (lock) > > - pthread_spinlock_t *lock; > > +pthread_spin_lock (pthread_spinlock_t *lock) > > { > > unsigned int __tmp; > > > > asm volatile ( > > - "1: lwarx %0,0,%1\n" > > + "0: lwzx %0,0,%1\n" > > + " cmpwi 0,%0,0\n" > > + " bne 0b\n" > > + "1: lwarx %0,0,%1" MUTEX_HINT_ACQ "\n" > > " cmpwi 0,%0,0\n" > > " bne- 2f\n" > > " stwcx. %2,0,%1\n" > > " bne- 2f\n" > > - " isync\n" > > - " .subsection 1\n" > > - "2: lwzx %0,0,%1\n" > > - " cmpwi 0,%0,0\n" > > - " bne 2b\n" > > - " b 1b\n" > > - " .previous" > > + __ARCH_ACQ_INSTR "\n" > > + " .subsection 1\n" > > + "2: lwzx %0,0,%1\n" > > + " cmpwi 0,%0,0\n" > > + " bne 2b\n" > > + " b 1b\n" > > + " .previous" > > : "=&r" (__tmp) > > : "r" (lock), "r" (1) > > : "cr0", "memory"); > > Is this essentially a test-and-test-and-set implementation of a lock, > with the MUTEX_HINT_ACQ hint additionally? If so, have you considered > Have you considered adding a variant of > atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set > and hint, and then writing a C function using atomic an atomic load in > the outer test loop and using the new cmpxchg/xchg variant in the inner > loop? That would make it easier to eventually merge this with the > generic version of spinlocks. Also, once we have to adding spinning and > randomized exponential backoff to the generic locks, powerpc would be > able to benefit from those generic changes too (perhaps with some > additional tuning). > Paul is traveling today so I will answer. It may be functionally possible to use a __atomic_compare_exchange_n ((mem), (expected), (desired), weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) in this construct. But the cmpxchg is a clumsy way to generate the PowerISA Acquire Import Barrier which has a very specific form. Also it is not obvious how the MUTEX_HINT_ACQ applies to __atomic_compare_exchange_n in general or even __atomic_compare_exchange_n (*, *, *, *, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED). In the PowerISA, MUTEX_HINT_ACQ means: The Load and Reserve instructions include an Exclusive Access hint (EH), which can be used to indicate that the instruction sequence being executed is implementing one of two types of algorithms: Atomic Update (EH=0) This hint indicates that the program is using a fetch and operate (e.g., fetch and add) or some similar algorithm and that all programs accessing the shared variable are likely to use a similar operation to access the shared variable for some time. Exclusive Access (EH=1) This hint indicates that the program is attempting to acquire a lock and if it succeeds, will perform another store to the lock variable (releasing the lock) before another program attempts to modify the lock variable. It is not clear that C11 __atomic_compare_exchange when/if/ever implies exclusive access in the meaning of the PowerISA. Finally trying to combine atomic_compare_exchange_weak_acquire with the relaxed load spin is likely to generate some awkward sequences with unnecessary code. I would prefer to proceed with this implementation and revisit when we are further along with C11 integration. > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c > > new file mode 100644 > > index 0000000..f830ad2 > > --- /dev/null > > +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c > > @@ -0,0 +1,27 @@ > > +/* pthread_spin_unlock -- unlock a spin lock. PowerPC version. > > + Copyright (C) 2007-2015 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <http://www.gnu.org/licenses/>. */ > > + > > +#include "pthreadP.h" > > +#include <lowlevellock.h> > > Do you really need to include lowlevellock.h? > > > +int > > +pthread_spin_unlock (pthread_spinlock_t *lock) > > +{ > > + atomic_store_release (lock, 0); > > + return 0; > > +} > >
On Thu, 2015-10-01 at 15:28 -0500, Steven Munroe wrote: > On Thu, 2015-10-01 at 21:27 +0200, Torvald Riegel wrote: > > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote: > > > diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c > > > index cc081f8..8a39da9 100644 > > > --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c > > > +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c > > > @@ -19,24 +19,26 @@ > > > #include "pthreadP.h" > > > > > > int > > > -pthread_spin_lock (lock) > > > - pthread_spinlock_t *lock; > > > +pthread_spin_lock (pthread_spinlock_t *lock) > > > { > > > unsigned int __tmp; > > > > > > asm volatile ( > > > - "1: lwarx %0,0,%1\n" > > > + "0: lwzx %0,0,%1\n" > > > + " cmpwi 0,%0,0\n" > > > + " bne 0b\n" > > > + "1: lwarx %0,0,%1" MUTEX_HINT_ACQ "\n" > > > " cmpwi 0,%0,0\n" > > > " bne- 2f\n" > > > " stwcx. %2,0,%1\n" > > > " bne- 2f\n" > > > - " isync\n" > > > - " .subsection 1\n" > > > - "2: lwzx %0,0,%1\n" > > > - " cmpwi 0,%0,0\n" > > > - " bne 2b\n" > > > - " b 1b\n" > > > - " .previous" > > > + __ARCH_ACQ_INSTR "\n" > > > + " .subsection 1\n" > > > + "2: lwzx %0,0,%1\n" > > > + " cmpwi 0,%0,0\n" > > > + " bne 2b\n" > > > + " b 1b\n" > > > + " .previous" > > > : "=&r" (__tmp) > > > : "r" (lock), "r" (1) > > > : "cr0", "memory"); > > > > Is this essentially a test-and-test-and-set implementation of a lock, > > with the MUTEX_HINT_ACQ hint additionally? If so, have you considered > > Have you considered adding a variant of > > atomic_compare_exchange_weak_acquire or atomic_exchange_acquire that set > > and hint, and then writing a C function using atomic an atomic load in > > the outer test loop and using the new cmpxchg/xchg variant in the inner > > loop? That would make it easier to eventually merge this with the > > generic version of spinlocks. Also, once we have to adding spinning and > > randomized exponential backoff to the generic locks, powerpc would be > > able to benefit from those generic changes too (perhaps with some > > additional tuning). > > > Paul is traveling today so I will answer. > > It may be functionally possible to use a > > __atomic_compare_exchange_n ((mem), (expected), (desired), > weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) > > in this construct. I agree, but that's not what I proposed. > But the cmpxchg is a clumsy way to generate the PowerISA Acquire Import > Barrier which has a very specific form. > > Also it is not obvious how the MUTEX_HINT_ACQ applies to > __atomic_compare_exchange_n in general or even > __atomic_compare_exchange_n (*, *, *, *, __ATOMIC_ACQUIRE, > __ATOMIC_RELAXED). > > In the PowerISA, MUTEX_HINT_ACQ means: > > The Load and Reserve instructions include an Exclusive > Access hint (EH), which can be used to indicate > that the instruction sequence being executed is implementing > one of two types of algorithms: > > Atomic Update (EH=0) > This hint indicates that the program is using a fetch and > operate (e.g., fetch and add) or some similar algorithm > and that all programs accessing the shared variable are > likely to use a similar operation to access the shared > variable for some time. > Exclusive Access (EH=1) > This hint indicates that the program is attempting to > acquire a lock and if it succeeds, will perform another > store to the lock variable (releasing the lock) before > another program attempts to modify the lock variable. > > It is not clear that C11 __atomic_compare_exchange when/if/ever implies > exclusive access in the meaning of the PowerISA. Yes. The hint is orthogonal to the C11 CAS semantics. This is why I asked about adding a variant for *glibc's* atomic_compare_exchange_weak_acquire, see include/atomic.h. This would likely be implemented with custom asm on Power, and be just deferring to atomic_compare_exchange_weak_acquire for all other archs. Whether introducing that is better than just specializing the spinlocks is something you will have to figure out. But there are several lock-like CASes throughout the code (e.g., pthread_once, so not just in the mutexes or rwlock). You probably don't want to have custom Power variants of all those synchronization mechanisms. > Finally trying to combine atomic_compare_exchange_weak_acquire with the > relaxed load spin is likely to generate some awkward sequences with > unnecessary code. Perhaps, but that's determined by the compiler and how good it is at optimizing atomics. Whether it matters will have to be tracked by a benchmark: If this unnecessary code matters and the compiler isn't doing a good enough job yet, it should be visible in some benchmark; once it is, the benchmark will tell us and we're not stuck with an "old assumption" in our code. > I would prefer to proceed with this implementation and revisit when we > are further along with C11 integration. I don't mind having intermediate steps. But if you do something Power-specific, and then somebody else improves the generic version, IMO this other person has no obligation to spend any effort on updating your Power-specific version too.
On 10/01/2015 02:27 PM, Torvald Riegel wrote: > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote: >> Changes from V1: >> >> * Use C macro for atomic_store_release as suggested in >> comments. >> >> * Run benchmarks to quantize the performance changes and >> note in commit message. >> >> ---8<--- >> This patch optimizes powerpc spinlock implementation by: >> >> * Current algorithm spin over a lwzx, but only after issuing a lwarx. >> first. The optimization for such case is to avoid the first lwarx >> in contention case (which is much more costly than normal loads). >> >> * Use the correct EH hint bit on the larx for supported ISA. For lock >> acquisition, the thread that acquired the lock with a successful stcx >> do not want to give away the write ownership on the cacheline. The >> idea is to make the load reservation "sticky" about retaining write >> authority to the line. That way, the store that must inevitably come >> to release the lock can succeed quickly and not contend with other >> threads issuing lwarx. If another thread does a store to the line >> (false sharing), the winning thread must give up write authority to >> The proper value of EH for the larx for a lock acquisition is 1. >> >> * Increase contented lock performance by up to 40%, and no measurable >> impact on uncontended locks on P8. > > Could you add the tests you used to the glibc microbenchmarks (or > whatever works best for them)? We do want to be able to track > performance, and having benchmarks is the first step towards that. Would you object to that being a separate patch? As I sketch out the work to integrate it into our benchtest, it seems like it could be trivially extended to mutexes aswell. >> diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c >> new file mode 100644 >> index 0000000..f830ad2 >> --- /dev/null >> +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c >> @@ -0,0 +1,27 @@ >> +/* pthread_spin_unlock -- unlock a spin lock. PowerPC version. >> + Copyright (C) 2007-2015 Free Software Foundation, Inc. >> + This file is part of the GNU C Library. >> + >> + The GNU C Library is free software; you can redistribute it and/or >> + modify it under the terms of the GNU Lesser General Public >> + License as published by the Free Software Foundation; either >> + version 2.1 of the License, or (at your option) any later version. >> + >> + The GNU C Library is distributed in the hope that it will be useful, >> + but WITHOUT ANY WARRANTY; without even the implied warranty of >> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + Lesser General Public License for more details. >> + >> + You should have received a copy of the GNU Lesser General Public >> + License along with the GNU C Library; if not, see >> + <http://www.gnu.org/licenses/>. */ >> + >> +#include "pthreadP.h" >> +#include <lowlevellock.h> > > Do you really need to include lowlevellock.h? No, atomic.h is needed at minimum. Thanks, Paul
On Wed, 2015-10-07 at 14:00 -0500, Paul E. Murphy wrote: > > On 10/01/2015 02:27 PM, Torvald Riegel wrote: > > On Wed, 2015-09-30 at 11:28 -0500, Paul E. Murphy wrote: > >> Changes from V1: > >> > >> * Use C macro for atomic_store_release as suggested in > >> comments. > >> > >> * Run benchmarks to quantize the performance changes and > >> note in commit message. > >> > >> ---8<--- > >> This patch optimizes powerpc spinlock implementation by: > >> > >> * Current algorithm spin over a lwzx, but only after issuing a lwarx. > >> first. The optimization for such case is to avoid the first lwarx > >> in contention case (which is much more costly than normal loads). > >> > >> * Use the correct EH hint bit on the larx for supported ISA. For lock > >> acquisition, the thread that acquired the lock with a successful stcx > >> do not want to give away the write ownership on the cacheline. The > >> idea is to make the load reservation "sticky" about retaining write > >> authority to the line. That way, the store that must inevitably come > >> to release the lock can succeed quickly and not contend with other > >> threads issuing lwarx. If another thread does a store to the line > >> (false sharing), the winning thread must give up write authority to > >> The proper value of EH for the larx for a lock acquisition is 1. > >> > >> * Increase contented lock performance by up to 40%, and no measurable > >> impact on uncontended locks on P8. > > > > Could you add the tests you used to the glibc microbenchmarks (or > > whatever works best for them)? We do want to be able to track > > performance, and having benchmarks is the first step towards that. > > Would you object to that being a separate patch? No. > As I sketch out the work to integrate it into our benchtest, it seems > like it could be trivially extended to mutexes aswell. Agreed.
diff --git a/sysdeps/powerpc/nptl/pthread_spin_lock.c b/sysdeps/powerpc/nptl/pthread_spin_lock.c index cc081f8..8a39da9 100644 --- a/sysdeps/powerpc/nptl/pthread_spin_lock.c +++ b/sysdeps/powerpc/nptl/pthread_spin_lock.c @@ -19,24 +19,26 @@ #include "pthreadP.h" int -pthread_spin_lock (lock) - pthread_spinlock_t *lock; +pthread_spin_lock (pthread_spinlock_t *lock) { unsigned int __tmp; asm volatile ( - "1: lwarx %0,0,%1\n" + "0: lwzx %0,0,%1\n" + " cmpwi 0,%0,0\n" + " bne 0b\n" + "1: lwarx %0,0,%1" MUTEX_HINT_ACQ "\n" " cmpwi 0,%0,0\n" " bne- 2f\n" " stwcx. %2,0,%1\n" " bne- 2f\n" - " isync\n" - " .subsection 1\n" - "2: lwzx %0,0,%1\n" - " cmpwi 0,%0,0\n" - " bne 2b\n" - " b 1b\n" - " .previous" + __ARCH_ACQ_INSTR "\n" + " .subsection 1\n" + "2: lwzx %0,0,%1\n" + " cmpwi 0,%0,0\n" + " bne 2b\n" + " b 1b\n" + " .previous" : "=&r" (__tmp) : "r" (lock), "r" (1) : "cr0", "memory"); diff --git a/sysdeps/powerpc/nptl/pthread_spin_trylock.c b/sysdeps/powerpc/nptl/pthread_spin_trylock.c index 77a4615..c485aa4 100644 --- a/sysdeps/powerpc/nptl/pthread_spin_trylock.c +++ b/sysdeps/powerpc/nptl/pthread_spin_trylock.c @@ -20,8 +20,7 @@ #include "pthreadP.h" int -pthread_spin_trylock (lock) - pthread_spinlock_t *lock; +pthread_spin_trylock (pthread_spinlock_t *lock) { unsigned int old; int err = EBUSY; diff --git a/sysdeps/powerpc/nptl/pthread_spin_unlock.c b/sysdeps/powerpc/nptl/pthread_spin_unlock.c new file mode 100644 index 0000000..f830ad2 --- /dev/null +++ b/sysdeps/powerpc/nptl/pthread_spin_unlock.c @@ -0,0 +1,27 @@ +/* pthread_spin_unlock -- unlock a spin lock. PowerPC version. + Copyright (C) 2007-2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include "pthreadP.h" +#include <lowlevellock.h> + +int +pthread_spin_unlock (pthread_spinlock_t *lock) +{ + atomic_store_release (lock, 0); + return 0; +} diff --git a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c b/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c deleted file mode 100644 index 7af694f..0000000 --- a/sysdeps/unix/sysv/linux/powerpc/pthread_spin_unlock.c +++ /dev/null @@ -1,28 +0,0 @@ -/* pthread_spin_unlock -- unlock a spin lock. PowerPC version. - Copyright (C) 2007-2015 Free Software Foundation, Inc. - This file is part of the GNU C Library. - - The GNU C Library is free software; you can redistribute it and/or - modify it under the terms of the GNU Lesser General Public - License as published by the Free Software Foundation; either - version 2.1 of the License, or (at your option) any later version. - - The GNU C Library is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - Lesser General Public License for more details. - - You should have received a copy of the GNU Lesser General Public - License along with the GNU C Library; if not, see - <http://www.gnu.org/licenses/>. */ - -#include "pthreadP.h" -#include <lowlevellock.h> - -int -pthread_spin_unlock (pthread_spinlock_t *lock) -{ - __asm __volatile (__ARCH_REL_INSTR ::: "memory"); - *lock = 0; - return 0; -}