Message ID | 1530520046-18343-3-git-send-email-kemi.wang@intel.com |
---|---|
State | New, archived |
Headers |
Received: (qmail 22052 invoked by alias); 2 Jul 2018 08:31:22 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 19022 invoked by uid 89); 2 Jul 2018 08:30:28 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-25.0 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, SPF_PASS autolearn=ham version=3.3.2 spammy=winner, Head, 0.7, blocking X-HELO: mga01.intel.com From: Kemi Wang <kemi.wang@intel.com> To: Adhemerval Zanella <adhemerval.zanella@linaro.org>, Florian Weimer <fweimer@redhat.com>, Rical Jason <rj@2c3t.io>, Carlos Donell <carlos@redhat.com>, Glibc alpha <libc-alpha@sourceware.org> Cc: Dave Hansen <dave.hansen@linux.intel.com>, Tim Chen <tim.c.chen@intel.com>, Andi Kleen <andi.kleen@intel.com>, Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>, Lu Aubrey <aubrey.li@intel.com>, Kemi Wang <kemi.wang@intel.com> Subject: [PATCH v6 3/3] Mutex: Replace trylock by read only while spinning Date: Mon, 2 Jul 2018 16:27:26 +0800 Message-Id: <1530520046-18343-3-git-send-email-kemi.wang@intel.com> In-Reply-To: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> References: <1530520046-18343-1-git-send-email-kemi.wang@intel.com> |
Commit Message
Kemi Wang
July 2, 2018, 8:27 a.m. UTC
The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.
Performance impact:
Significant mutex performance improvement is not expected for this patch,
though, it probably bring some benefit for the scenarios with severe lock
contention on many architectures, the whole system performance can benefit
from this modification because a number of unnecessary "read for ownership"
requests which stress the cache system by broadcasting cache line
invalidity are eliminated during spinning.
Meanwhile, it may have some tiny performance regression on the lock holder
transformation for the case of lock acquisition via spinning gets, because
the lock state is checked before acquiring the lock via trylock.
Similar mechanism has been implemented for pthread spin lock.
Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM
Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex
with global update)
Usage: make bench BENCHSET=mutex-adaptive-thread
Test result:
+----------------+-----------------+-----------------+------------+
| Configuration | Base | Head | % Change |
| | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
| | Critical section size: 1x |
+----------------+------------------------------------------------+
|1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% |
|2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% |
|3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% |
|4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% |
|28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% |
|56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% |
|112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% |
+----------------+------------------------------------------------+
| | Critical section size: 10x |
+----------------+------------------------------------------------+
|1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% |
|2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% |
|3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% |
|4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% |
|28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% |
|56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% |
|112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% |
+----------------+------------------------------------------------+
| | Critical section size: 100x |
+----------------+------------------------------------------------+
|1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% |
|2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% |
|3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% |
|4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% |
|28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% |
|56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% |
|112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% |
+----------------+------------------------------------------------+
| | Critical section size: 1000x |
+----------------+------------------------------------------------+
|1 thread | 316392 | 315635 | -0.2% |
|2 threads | 302806 | 303469 | +0.2% |
|3 threads | 298506 | 294281 | -1.4% |
|4 threads | 292037 | 289945 | -0.7% |
|28 threads | 155188 | 155250 | +0.0% |
|56 threads | 190657 | 183106 | -4.0% |
|112 threads | 210818 | 220342 | +4.5% |
+----------------+-----------------+-----------------+------------+
* nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex
* nptl/pthread_mutex_timedlock.c: Likewise
* sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only
while spinning for x86 architecture
* sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise
* sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise
ChangLog:
V5->V6:
no change
V4->V5:
a) Make the optimization work for pthread mutex_timedlock() in x86
architecture.
b) Move the READ_ONLY_SPIN macro definition from this patch to the
first patch which adds glibc.mutex.spin_count tunable entry
V3->V4:
a) Make the optimization opt-in, and enable for x86 architecture as
default, as suggested by Florian Weimer.
V2->V3:
a) Drop the idea of blocking spinners if fail to acquire a lock, since
this idea would not be an universal winner. E.g. several threads
contend for a lock which protects a small critical section, thus,
probably any thread can acquire the lock via spinning.
b) Fix the format issue AFAIC
V1->V2: fix format issue
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
nptl/pthread_mutex_lock.c | 15 +++++++++++++++
nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++
sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 +
sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 +
sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 +
5 files changed, 33 insertions(+)
Comments
On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang <kemi.wang@intel.com> wrote: > The pthread adaptive spin mutex spins on the lock for a while before > calling into the kernel to block. But, in the current implementation of > spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when > the lock is contended, it is not a good idea on many targets as that will > force expensive memory synchronization among processors and penalize other > running threads. For example, it constantly floods the system with "read > for ownership" requests, which are much more expensive to process than a > single read. Thus, we only use MO read until we observe the lock to not be > acquired anymore, as suggested by Andi Kleen. > > Performance impact: > Significant mutex performance improvement is not expected for this patch, > though, it probably bring some benefit for the scenarios with severe lock > contention on many architectures, the whole system performance can benefit > from this modification because a number of unnecessary "read for ownership" > requests which stress the cache system by broadcasting cache line > invalidity are eliminated during spinning. > Meanwhile, it may have some tiny performance regression on the lock holder > transformation for the case of lock acquisition via spinning gets, because > the lock state is checked before acquiring the lock via trylock. > > Similar mechanism has been implemented for pthread spin lock. > > Test machine: > 2-sockets Skylake platform, 112 cores with 62G RAM > > Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex > with global update) > Usage: make bench BENCHSET=mutex-adaptive-thread > Test result: > +----------------+-----------------+-----------------+------------+ > | Configuration | Base | Head | % Change | > | | Total iteration | Total iteration | base->head | > +----------------+-----------------+-----------------+------------+ > | | Critical section size: 1x | > +----------------+------------------------------------------------+ > |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | > |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | > |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | > |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | > |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | > |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | > |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | > +----------------+------------------------------------------------+ > | | Critical section size: 10x | > +----------------+------------------------------------------------+ > |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | > |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | > |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | > |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | > |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | > |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | > |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | > +----------------+------------------------------------------------+ > | | Critical section size: 100x | > +----------------+------------------------------------------------+ > |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | > |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | > |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | > |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | > |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | > |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | > |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | > +----------------+------------------------------------------------+ > | | Critical section size: 1000x | > +----------------+------------------------------------------------+ > |1 thread | 316392 | 315635 | -0.2% | > |2 threads | 302806 | 303469 | +0.2% | > |3 threads | 298506 | 294281 | -1.4% | > |4 threads | 292037 | 289945 | -0.7% | > |28 threads | 155188 | 155250 | +0.0% | > |56 threads | 190657 | 183106 | -4.0% | > |112 threads | 210818 | 220342 | +4.5% | > +----------------+-----------------+-----------------+------------+ > > * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex > * nptl/pthread_mutex_timedlock.c: Likewise > * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only > while spinning for x86 architecture > * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise > * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise > > ChangLog: > V5->V6: > no change > > V4->V5: > a) Make the optimization work for pthread mutex_timedlock() in x86 > architecture. > b) Move the READ_ONLY_SPIN macro definition from this patch to the > first patch which adds glibc.mutex.spin_count tunable entry > > V3->V4: > a) Make the optimization opt-in, and enable for x86 architecture as > default, as suggested by Florian Weimer. > > V2->V3: > a) Drop the idea of blocking spinners if fail to acquire a lock, since > this idea would not be an universal winner. E.g. several threads > contend for a lock which protects a small critical section, thus, > probably any thread can acquire the lock via spinning. > b) Fix the format issue AFAIC > > V1->V2: fix format issue > > Suggested-by: Andi Kleen <andi.kleen@intel.com> > Signed-off-by: Kemi Wang <kemi.wang@intel.com> > --- > nptl/pthread_mutex_lock.c | 15 +++++++++++++++ > nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++ > sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + > sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + > sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 + > 5 files changed, 33 insertions(+) > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c > index 1519c14..9a7b5c2 100644 > --- a/nptl/pthread_mutex_lock.c > +++ b/nptl/pthread_mutex_lock.c > @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) > LLL_MUTEX_LOCK (mutex); > break; > } > +#ifdef READ_ONLY_SPIN > + do > + { > + atomic_spin_nop (); > + val = atomic_load_relaxed (&mutex->__data.__lock); > + } > + while (val != 0 && ++cnt < max_cnt); > +#else > atomic_spin_nop (); > +#endif > } > while (LLL_MUTEX_TRYLOCK (mutex) != 0); > > diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c > index 66efd39..dcaeca8 100644 > --- a/nptl/pthread_mutex_timedlock.c > +++ b/nptl/pthread_mutex_timedlock.c > @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, > PTHREAD_MUTEX_PSHARED (mutex)); > break; > } > +#ifdef READ_ONLY_SPIN > + do > + { > + atomic_spin_nop (); > + val = atomic_load_relaxed (&mutex->__data.__lock); > + } > + while (val != 0 && ++cnt < max_cnt); > +#else > atomic_spin_nop (); > +#endif > } > while (lll_trylock (mutex->__data.__lock) != 0); > Please define generic static inline void __always_inline atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) { atomic_spin_nop (); } and x86 version: static inline void __always_inline atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) { do { atomic_spin_nop (); val = atomic_load_relaxed (&mutex->__data.__lock); } while (val != 0 && ++cnt < max_cnt); } in a standalone patch.
On Tue, Jul 3, 2018 at 5:45 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang <kemi.wang@intel.com> wrote: >> The pthread adaptive spin mutex spins on the lock for a while before >> calling into the kernel to block. But, in the current implementation of >> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when >> the lock is contended, it is not a good idea on many targets as that will >> force expensive memory synchronization among processors and penalize other >> running threads. For example, it constantly floods the system with "read >> for ownership" requests, which are much more expensive to process than a >> single read. Thus, we only use MO read until we observe the lock to not be >> acquired anymore, as suggested by Andi Kleen. >> >> Performance impact: >> Significant mutex performance improvement is not expected for this patch, >> though, it probably bring some benefit for the scenarios with severe lock >> contention on many architectures, the whole system performance can benefit >> from this modification because a number of unnecessary "read for ownership" >> requests which stress the cache system by broadcasting cache line >> invalidity are eliminated during spinning. >> Meanwhile, it may have some tiny performance regression on the lock holder >> transformation for the case of lock acquisition via spinning gets, because >> the lock state is checked before acquiring the lock via trylock. >> >> Similar mechanism has been implemented for pthread spin lock. >> >> Test machine: >> 2-sockets Skylake platform, 112 cores with 62G RAM >> >> Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex >> with global update) >> Usage: make bench BENCHSET=mutex-adaptive-thread >> Test result: >> +----------------+-----------------+-----------------+------------+ >> | Configuration | Base | Head | % Change | >> | | Total iteration | Total iteration | base->head | >> +----------------+-----------------+-----------------+------------+ >> | | Critical section size: 1x | >> +----------------+------------------------------------------------+ >> |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | >> |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | >> |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | >> |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | >> |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | >> |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | >> |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | >> +----------------+------------------------------------------------+ >> | | Critical section size: 10x | >> +----------------+------------------------------------------------+ >> |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | >> |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | >> |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | >> |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | >> |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | >> |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | >> |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | >> +----------------+------------------------------------------------+ >> | | Critical section size: 100x | >> +----------------+------------------------------------------------+ >> |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | >> |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | >> |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | >> |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | >> |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | >> |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | >> |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | >> +----------------+------------------------------------------------+ >> | | Critical section size: 1000x | >> +----------------+------------------------------------------------+ >> |1 thread | 316392 | 315635 | -0.2% | >> |2 threads | 302806 | 303469 | +0.2% | >> |3 threads | 298506 | 294281 | -1.4% | >> |4 threads | 292037 | 289945 | -0.7% | >> |28 threads | 155188 | 155250 | +0.0% | >> |56 threads | 190657 | 183106 | -4.0% | >> |112 threads | 210818 | 220342 | +4.5% | >> +----------------+-----------------+-----------------+------------+ >> >> * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex >> * nptl/pthread_mutex_timedlock.c: Likewise >> * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only >> while spinning for x86 architecture >> * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise >> * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise >> >> ChangLog: >> V5->V6: >> no change >> >> V4->V5: >> a) Make the optimization work for pthread mutex_timedlock() in x86 >> architecture. >> b) Move the READ_ONLY_SPIN macro definition from this patch to the >> first patch which adds glibc.mutex.spin_count tunable entry >> >> V3->V4: >> a) Make the optimization opt-in, and enable for x86 architecture as >> default, as suggested by Florian Weimer. >> >> V2->V3: >> a) Drop the idea of blocking spinners if fail to acquire a lock, since >> this idea would not be an universal winner. E.g. several threads >> contend for a lock which protects a small critical section, thus, >> probably any thread can acquire the lock via spinning. >> b) Fix the format issue AFAIC >> >> V1->V2: fix format issue >> >> Suggested-by: Andi Kleen <andi.kleen@intel.com> >> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >> --- >> nptl/pthread_mutex_lock.c | 15 +++++++++++++++ >> nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++ >> sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + >> sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + >> sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 + >> 5 files changed, 33 insertions(+) >> >> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c >> index 1519c14..9a7b5c2 100644 >> --- a/nptl/pthread_mutex_lock.c >> +++ b/nptl/pthread_mutex_lock.c >> @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) >> LLL_MUTEX_LOCK (mutex); >> break; >> } >> +#ifdef READ_ONLY_SPIN >> + do >> + { >> + atomic_spin_nop (); >> + val = atomic_load_relaxed (&mutex->__data.__lock); >> + } >> + while (val != 0 && ++cnt < max_cnt); >> +#else >> atomic_spin_nop (); >> +#endif >> } >> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >> >> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c >> index 66efd39..dcaeca8 100644 >> --- a/nptl/pthread_mutex_timedlock.c >> +++ b/nptl/pthread_mutex_timedlock.c >> @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, >> PTHREAD_MUTEX_PSHARED (mutex)); >> break; >> } >> +#ifdef READ_ONLY_SPIN >> + do >> + { >> + atomic_spin_nop (); >> + val = atomic_load_relaxed (&mutex->__data.__lock); >> + } >> + while (val != 0 && ++cnt < max_cnt); >> +#else >> atomic_spin_nop (); >> +#endif >> } >> while (lll_trylock (mutex->__data.__lock) != 0); >> > > Please define generic > > static inline void __always_inline > atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) > { > atomic_spin_nop (); > } > > and x86 version: > > static inline void __always_inline > atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) > { > do > { > atomic_spin_nop (); > val = atomic_load_relaxed (&mutex->__data.__lock); > } > while (val != 0 && ++cnt < max_cnt); > } > > in a standalone patch. > Please take a look at https://github.com/hjl-tools/glibc/tree/hjl/spin/master I added: static __always_inline void atomic_spin_lock (pthread_spinlock_t *lock, int cnt, int max_cnt) { int val = 0; do { atomic_spin_nop (); val = atomic_load_relaxed (lock); } while (val != 0 && ++cnt < max_cnt); }
On 2018年07月03日 21:16, H.J. Lu wrote: > On Tue, Jul 3, 2018 at 5:45 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Mon, Jul 2, 2018 at 1:27 AM, Kemi Wang <kemi.wang@intel.com> wrote: >>> The pthread adaptive spin mutex spins on the lock for a while before >>> calling into the kernel to block. But, in the current implementation of >>> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when >>> the lock is contended, it is not a good idea on many targets as that will >>> force expensive memory synchronization among processors and penalize other >>> running threads. For example, it constantly floods the system with "read >>> for ownership" requests, which are much more expensive to process than a >>> single read. Thus, we only use MO read until we observe the lock to not be >>> acquired anymore, as suggested by Andi Kleen. >>> >>> Performance impact: >>> Significant mutex performance improvement is not expected for this patch, >>> though, it probably bring some benefit for the scenarios with severe lock >>> contention on many architectures, the whole system performance can benefit >>> from this modification because a number of unnecessary "read for ownership" >>> requests which stress the cache system by broadcasting cache line >>> invalidity are eliminated during spinning. >>> Meanwhile, it may have some tiny performance regression on the lock holder >>> transformation for the case of lock acquisition via spinning gets, because >>> the lock state is checked before acquiring the lock via trylock. >>> >>> Similar mechanism has been implemented for pthread spin lock. >>> >>> Test machine: >>> 2-sockets Skylake platform, 112 cores with 62G RAM >>> >>> Test case: mutex-adaptive-thread (Contended pthread adaptive spin mutex >>> with global update) >>> Usage: make bench BENCHSET=mutex-adaptive-thread >>> Test result: >>> +----------------+-----------------+-----------------+------------+ >>> | Configuration | Base | Head | % Change | >>> | | Total iteration | Total iteration | base->head | >>> +----------------+-----------------+-----------------+------------+ >>> | | Critical section size: 1x | >>> +----------------+------------------------------------------------+ >>> |1 thread | 2.76681e+07 | 2.7965e+07 | +1.1% | >>> |2 threads | 3.29905e+07 | 3.55279e+07 | +7.7% | >>> |3 threads | 4.38102e+07 | 3.98567e+07 | -9.0% | >>> |4 threads | 1.72172e+07 | 2.09498e+07 | +21.7% | >>> |28 threads | 1.03732e+07 | 1.05133e+07 | +1.4% | >>> |56 threads | 1.06308e+07 | 5.06522e+07 | +14.6% | >>> |112 threads | 8.55177e+06 | 1.02954e+07 | +20.4% | >>> +----------------+------------------------------------------------+ >>> | | Critical section size: 10x | >>> +----------------+------------------------------------------------+ >>> |1 thread | 1.57006e+07 | 1.54727e+07 | -1.5% | >>> |2 threads | 1.8044e+07 | 1.75601e+07 | -2.7% | >>> |3 threads | 1.35634e+07 | 1.46384e+07 | +7.9% | >>> |4 threads | 1.21257e+07 | 1.32046e+07 | +8.9% | >>> |28 threads | 8.09593e+06 | 1.02713e+07 | +26.9% | >>> |56 threads | 9.09907e+06 | 4.16203e+07 | +16.4% | >>> |112 threads | 7.09731e+06 | 8.62406e+06 | +21.5% | >>> +----------------+------------------------------------------------+ >>> | | Critical section size: 100x | >>> +----------------+------------------------------------------------+ >>> |1 thread | 2.87116e+06 | 2.89188e+06 | +0.7% | >>> |2 threads | 2.23409e+06 | 2.24216e+06 | +0.4% | >>> |3 threads | 2.29888e+06 | 2.29964e+06 | +0.0% | >>> |4 threads | 2.26898e+06 | 2.21394e+06 | -2.4% | >>> |28 threads | 1.03228e+06 | 1.0051e+06 | -2.6% | >>> |56 threads | 1.02953 +06 | 1.6344e+07 | -2.3% | >>> |112 threads | 1.01615e+06 | 1.00134e+06 | -1.5% | >>> +----------------+------------------------------------------------+ >>> | | Critical section size: 1000x | >>> +----------------+------------------------------------------------+ >>> |1 thread | 316392 | 315635 | -0.2% | >>> |2 threads | 302806 | 303469 | +0.2% | >>> |3 threads | 298506 | 294281 | -1.4% | >>> |4 threads | 292037 | 289945 | -0.7% | >>> |28 threads | 155188 | 155250 | +0.0% | >>> |56 threads | 190657 | 183106 | -4.0% | >>> |112 threads | 210818 | 220342 | +4.5% | >>> +----------------+-----------------+-----------------+------------+ >>> >>> * nptl/pthread_mutex_lock.c: Optimize adaptive spin mutex >>> * nptl/pthread_mutex_timedlock.c: Likewise >>> * sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c: Enable read only >>> while spinning for x86 architecture >>> * sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c: Likewise >>> * sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c: Likewise >>> >>> ChangLog: >>> V5->V6: >>> no change >>> >>> V4->V5: >>> a) Make the optimization work for pthread mutex_timedlock() in x86 >>> architecture. >>> b) Move the READ_ONLY_SPIN macro definition from this patch to the >>> first patch which adds glibc.mutex.spin_count tunable entry >>> >>> V3->V4: >>> a) Make the optimization opt-in, and enable for x86 architecture as >>> default, as suggested by Florian Weimer. >>> >>> V2->V3: >>> a) Drop the idea of blocking spinners if fail to acquire a lock, since >>> this idea would not be an universal winner. E.g. several threads >>> contend for a lock which protects a small critical section, thus, >>> probably any thread can acquire the lock via spinning. >>> b) Fix the format issue AFAIC >>> >>> V1->V2: fix format issue >>> >>> Suggested-by: Andi Kleen <andi.kleen@intel.com> >>> Signed-off-by: Kemi Wang <kemi.wang@intel.com> >>> --- >>> nptl/pthread_mutex_lock.c | 15 +++++++++++++++ >>> nptl/pthread_mutex_timedlock.c | 15 +++++++++++++++ >>> sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c | 1 + >>> sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c | 1 + >>> sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c | 1 + >>> 5 files changed, 33 insertions(+) >>> >>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c >>> index 1519c14..9a7b5c2 100644 >>> --- a/nptl/pthread_mutex_lock.c >>> +++ b/nptl/pthread_mutex_lock.c >>> @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) >>> LLL_MUTEX_LOCK (mutex); >>> break; >>> } >>> +#ifdef READ_ONLY_SPIN >>> + do >>> + { >>> + atomic_spin_nop (); >>> + val = atomic_load_relaxed (&mutex->__data.__lock); >>> + } >>> + while (val != 0 && ++cnt < max_cnt); >>> +#else >>> atomic_spin_nop (); >>> +#endif >>> } >>> while (LLL_MUTEX_TRYLOCK (mutex) != 0); >>> >>> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c >>> index 66efd39..dcaeca8 100644 >>> --- a/nptl/pthread_mutex_timedlock.c >>> +++ b/nptl/pthread_mutex_timedlock.c >>> @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, >>> PTHREAD_MUTEX_PSHARED (mutex)); >>> break; >>> } >>> +#ifdef READ_ONLY_SPIN >>> + do >>> + { >>> + atomic_spin_nop (); >>> + val = atomic_load_relaxed (&mutex->__data.__lock); >>> + } >>> + while (val != 0 && ++cnt < max_cnt); >>> +#else >>> atomic_spin_nop (); >>> +#endif >>> } >>> while (lll_trylock (mutex->__data.__lock) != 0); >>> >> >> Please define generic >> >> static inline void __always_inline >> atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) >> { >> atomic_spin_nop (); >> } >> >> and x86 version: >> >> static inline void __always_inline >> atomic_spin_nop_with_limit (int cnt, int max_cnt, pthread_mutex_t *mutex) >> { >> do >> { >> atomic_spin_nop (); >> val = atomic_load_relaxed (&mutex->__data.__lock); >> } >> while (val != 0 && ++cnt < max_cnt); >> } >> >> in a standalone patch. >> > > Please take a look at > > https://github.com/hjl-tools/glibc/tree/hjl/spin/master > Reviewed. Thanks for refining. > I added: > > static __always_inline void > atomic_spin_lock (pthread_spinlock_t *lock, int cnt, int max_cnt) > { > int val = 0; > do > { > atomic_spin_nop (); > val = atomic_load_relaxed (lock); > } > while (val != 0 && ++cnt < max_cnt); > } > >
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c index 1519c14..9a7b5c2 100644 --- a/nptl/pthread_mutex_lock.c +++ b/nptl/pthread_mutex_lock.c @@ -124,8 +124,14 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) if (LLL_MUTEX_TRYLOCK (mutex) != 0) { int cnt = 0; +#ifdef READ_ONLY_SPIN + int val = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); +#else int max_cnt = MIN (MAX_ADAPTIVE_COUNT, mutex->__data.__spins * 2 + 10); +#endif do { if (cnt++ >= max_cnt) @@ -133,7 +139,16 @@ __pthread_mutex_lock (pthread_mutex_t *mutex) LLL_MUTEX_LOCK (mutex); break; } +#ifdef READ_ONLY_SPIN + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); +#else atomic_spin_nop (); +#endif } while (LLL_MUTEX_TRYLOCK (mutex) != 0); diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c index 66efd39..dcaeca8 100644 --- a/nptl/pthread_mutex_timedlock.c +++ b/nptl/pthread_mutex_timedlock.c @@ -116,8 +116,14 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, if (lll_trylock (mutex->__data.__lock) != 0) { int cnt = 0; +#ifdef READ_ONLY_SPIN + int val = 0; + int max_cnt = MIN (__mutex_aconf.spin_count, + mutex->__data.__spins * 2 + 10); +#else int max_cnt = MIN (MAX_ADAPTIVE_COUNT, mutex->__data.__spins * 2 + 10); +#endif do { if (cnt++ >= max_cnt) @@ -126,7 +132,16 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex, PTHREAD_MUTEX_PSHARED (mutex)); break; } +#ifdef READ_ONLY_SPIN + do + { + atomic_spin_nop (); + val = atomic_load_relaxed (&mutex->__data.__lock); + } + while (val != 0 && ++cnt < max_cnt); +#else atomic_spin_nop (); +#endif } while (lll_trylock (mutex->__data.__lock) != 0); diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c index 967d007..a44c48c 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_cond_lock.c @@ -19,4 +19,5 @@ already elided locks. */ #include <elision-conf.h> +#include <nptl/pthread_mutex_conf.h> #include <nptl/pthread_mutex_cond_lock.c> diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c index c23678f..29d20e8 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_lock.c @@ -19,4 +19,5 @@ #include <elision-conf.h> #include "force-elision.h" +#include "nptl/pthread_mutex_conf.h" #include "nptl/pthread_mutex_lock.c" diff --git a/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c b/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c index 7997580..52345f5 100644 --- a/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c +++ b/sysdeps/unix/sysv/linux/x86/pthread_mutex_timedlock.c @@ -19,4 +19,5 @@ #include <elision-conf.h> #include "force-elision.h" +#include "nptl/pthread_mutex_conf.h" #include "nptl/pthread_mutex_timedlock.c"