Patchwork [v7,2/2] Mutex: Replace trylock by read only while spinning

login
register
mail settings
Submitter Kemi Wang
Date July 6, 2018, 7:50 a.m.
Message ID <1530863409-326-2-git-send-email-kemi.wang@intel.com>
Download mbox | patch
Permalink /patch/28255/
State New
Headers show

Comments

Kemi Wang - July 6, 2018, 7:50 a.m.
The pthread adaptive spin mutex spins on the lock for a while before
calling into the kernel to block. But, in the current implementation of
spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
the lock is contended, it is not a good idea on many targets as that will
force expensive memory synchronization among processors and penalize other
running threads. For example, it constantly floods the system with "read
for ownership" requests, which are much more expensive to process than a
single read. Thus, we only use MO read until we observe the lock to not be
acquired anymore, as suggested by Andi Kleen.

Performance impact:
It would bring some benefit in the scenarios with severe lock contention on
many architectures (significant performance improvement is not expected),
and the whole system performance can benefit from this modification because
a number of unnecessary "read for ownership" requests which stress the
cache system by broadcasting cache line invalidity are eliminated during
spinning.

Meanwhile, it may have some tiny performance regression on the lock holder
transformation for the case of lock acquisition via spinning gets, because
the lock state is checked before acquiring the lock via trylock.

Similar mechanism has been implemented for pthread spin lock.

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: Multiple threads contend for adaptive spin mutex
In this test case, each thread binds to an individual CPU core, and does
the following:
1) lock
2) spend about 50 nanoseconds (~1 pause on Skylake) in the critical section
3) unlock
4) spend 500 nanoseconds in the non-critical section in a loop until 15
seconds, and the lock performance is measured by the total iterations.

Then, enlarge the size of critical section, and repeat. Test result is
shows as below:

+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |           Critical section size: 1x            |
+----------------+------------------------------------------------+
|1 thread        |  2.76681e+07    |  2.7965e+07     |   +1.1%    |
|2 threads       |  3.29905e+07    |  3.55279e+07    |   +7.7%    |
|3 threads       |  4.38102e+07    |  3.98567e+07    |   -9.0%    |
|4 threads       |  1.72172e+07    |  2.09498e+07    |   +21.7%   |
|28 threads      |  1.03732e+07    |  1.05133e+07    |   +1.4%    |
|56 threads      |  1.06308e+07    |  5.06522e+07    |   +14.6%   |
|112 threads     |  8.55177e+06    |  1.02954e+07    |   +20.4%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 10x           |
+----------------+------------------------------------------------+
|1 thread        |  1.57006e+07    |  1.54727e+07    |   -1.5%    |
|2 threads       |  1.8044e+07     |  1.75601e+07    |   -2.7%    |
|3 threads       |  1.35634e+07    |  1.46384e+07    |   +7.9%    |
|4 threads       |  1.21257e+07    |  1.32046e+07    |   +8.9%    |
|28 threads      |  8.09593e+06    |  1.02713e+07    |   +26.9%   |
|56 threads      |  9.09907e+06    |  4.16203e+07    |   +16.4%   |
|112 threads     |  7.09731e+06    |  8.62406e+06    |   +21.5%   |
+----------------+------------------------------------------------+
|                |           Critical section size: 100x          |
+----------------+------------------------------------------------+
|1 thread        |  2.87116e+06    | 2.89188e+06     |   +0.7%    |
|2 threads       |  2.23409e+06    | 2.24216e+06     |   +0.4%    |
|3 threads       |  2.29888e+06    | 2.29964e+06     |   +0.0%    |
|4 threads       |  2.26898e+06    | 2.21394e+06     |   -2.4%    |
|28 threads      |  1.03228e+06    | 1.0051e+06      |   -2.6%    |
|56 threads      |  1.02953 +06    | 1.6344e+07      |   -2.3%    |
|112 threads     |  1.01615e+06    | 1.00134e+06     |   -1.5%    |
+----------------+------------------------------------------------+
|                |           Critical section size: 1000x         |
+----------------+------------------------------------------------+
|1 thread        |  316392         |  315635         |   -0.2%    |
|2 threads       |  302806         |  303469         |   +0.2%    |
|3 threads       |  298506         |  294281         |   -1.4%    |
|4 threads       |  292037         |  289945         |   -0.7%    |
|28 threads      |  155188         |  155250         |   +0.0%    |
|56 threads      |  190657         |  183106         |   -4.0%    |
|112 threads     |  210818         |  220342         |   +4.5%    |
+----------------+-----------------+-----------------+------------+

    * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
    * nptl/pthread_mutex_timedlock.c: Likewise
    * nptl/pthread_spinlock.h: New file
    * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file

ChangLog:
    V6->V7:
    a) Patch is refined by H.J.Lu

    V5->V6:
    no change

    V4->V5:
    a) Make the optimization work for pthread mutex_timedlock() in x86
    architecture.
    b) Move the READ_ONLY_SPIN macro definition from this patch to the
    first patch which adds glibc.mutex.spin_count tunable entry

    V3->V4:
    a) Make the optimization opt-in, and enable for x86 architecture as
    default, as suggested by Florian Weimer.

    V2->V3:
    a) Drop the idea of blocking spinners if fail to acquire a lock, since
       this idea would not be an universal winner. E.g. several threads
       contend for a lock which protects a small critical section, thus,
       probably any thread can acquire the lock via spinning.
    b) Fix the format issue AFAIC

    V1->V2: fix format issue

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c                      |  3 ++-
 nptl/pthread_mutex_timedlock.c                 |  4 ++--
 nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
 4 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 nptl/pthread_spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
Carlos O'Donell - July 6, 2018, 6:04 p.m.
On 07/06/2018 03:50 AM, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before
> calling into the kernel to block. But, in the current implementation of
> spinning, the spinners go straight back to LLL_MUTEX_TRYLOCK(cmpxchg) when
> the lock is contended, it is not a good idea on many targets as that will
> force expensive memory synchronization among processors and penalize other
> running threads. For example, it constantly floods the system with "read
> for ownership" requests, which are much more expensive to process than a
> single read. Thus, we only use MO read until we observe the lock to not be
> acquired anymore, as suggested by Andi Kleen.
> 
> Performance impact:
> It would bring some benefit in the scenarios with severe lock contention on
> many architectures (significant performance improvement is not expected),
> and the whole system performance can benefit from this modification because
> a number of unnecessary "read for ownership" requests which stress the
> cache system by broadcasting cache line invalidity are eliminated during
> spinning.
> 
> Meanwhile, it may have some tiny performance regression on the lock holder
> transformation for the case of lock acquisition via spinning gets, because
> the lock state is checked before acquiring the lock via trylock.
> 
> Similar mechanism has been implemented for pthread spin lock.

Why should I accept this patch?

You make a strong case about the cost of the expensive memory synchronization.

However, the numbers don't appear to back this up.

If the cost of the synchronization was high, when you add the spinning, why
doesn't it improve performance?

Do you need to do a whole system performance measurement?

As it stands it looks like this patch makes the general use case of 1-4
threads roughly 5% slower across a variety of workloads.

I'm not inclined to include this work unless there is some stronger
justification, or perhaps I have just misunderstood the numbers you
have provided.
Kemi Wang - July 8, 2018, 2:04 p.m.
Hi, Carlos
Thanks for your review and your questions.

> Why should I accept this patch?


Current implementation of spinning during the lock is held uses CAS (compare and swap) instead
of the way we called test and CAS.

The first is very unfriendly to the uncore because it constantly floods
the system with "read for ownership" requests, which are much more expensive
to process than a single read.

Andi Kleen may have more input here.

> You make a strong case about the cost of the expensive memory synchronization.

> However, the numbers don't appear to back this up.


As I have said in the commit log,  significant performance improvement is not expected.

> If the cost of the synchronization was high, when you add the spinning, why doesn't it improve performance?


To be simple, we can think the time spent in the critical section and non-critical section is constant (I used lock delay 
and unlock delay to emulate the workload),  lock performance is mainly determined by the latency of lock holder
transition.

This patch reduces meaningless cache line traffic during the lock is *held*, in other words, it *optimizes* the process during
the lock is held.  So, the test result is expected.

However, if the lock contention is severe, too much "read-for-ownership" requests may affect the latency of lock holder transition in 
cache coherency system, in such case, it would improve system performance, as we can see in the test result with 28/56/112 threads.

Please notice, if the critical section increases to a certain degree, the lock performance is mainly determined by lock acquisition via 
futex_wake. In such case, the threshold of spin count would affect lock performance, and this patch performs similar to original 
version because we use the same threshold (100). 

> Do you need to do a whole system performance measurement?


I thought the data posted here is good enough to demonstrate the effectiveness of this patch.
But if you insist, I will try to do something to figure it out.

> As it stands it looks like this patch makes the general use case of 1-4 threads roughly 5% slower across a variety of workloads.


No, this patch did not change the fast path. So it would not affect the general  lock performance (no contention) theoretically.

Also, I don't know why you said this patch make general case roughly 5% slower?  instead, it performs a little better than original
version for most cases.
Further, this patch also makes the use case of 5-10 threads better than original one in my test,  maybe I should post them in the commit log?

> I'm not inclined to include this work unless there is some stronger justification, or perhaps I have just misunderstood the numbers you have provided.


I am not sure whether my reply answers your questions.
If still in doubt, feel free to raise your questions and I will try to figure it out. Thanks

-----Original Message-----
From: Carlos O'Donell [mailto:carlos@redhat.com] 

Sent: Saturday, July 7, 2018 2:04 AM
To: Wang, Kemi <kemi.wang@intel.com>; Adhemerval Zanella <adhemerval.zanella@linaro.org>; Florian Weimer <fweimer@redhat.com>; Rical Jason <rj@2c3t.io>; Glibc alpha <libc-alpha@sourceware.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>; Chen, Tim C <tim.c.chen@intel.com>; Kleen, Andi <andi.kleen@intel.com>; Huang, Ying <ying.huang@intel.com>; Lu, Aaron <aaron.lu@intel.com>; Li, Aubrey <aubrey.li@intel.com>
Subject: Re: [PATCH v7 2/2] Mutex: Replace trylock by read only while spinning

On 07/06/2018 03:50 AM, Kemi Wang wrote:
> The pthread adaptive spin mutex spins on the lock for a while before 

> calling into the kernel to block. But, in the current implementation 

> of spinning, the spinners go straight back to 

> LLL_MUTEX_TRYLOCK(cmpxchg) when the lock is contended, it is not a 

> good idea on many targets as that will force expensive memory 

> synchronization among processors and penalize other running threads. 

> For example, it constantly floods the system with "read for ownership" 

> requests, which are much more expensive to process than a single read. 

> Thus, we only use MO read until we observe the lock to not be acquired anymore, as suggested by Andi Kleen.

> 

> Performance impact:

> It would bring some benefit in the scenarios with severe lock 

> contention on many architectures (significant performance improvement 

> is not expected), and the whole system performance can benefit from 

> this modification because a number of unnecessary "read for ownership" 

> requests which stress the cache system by broadcasting cache line 

> invalidity are eliminated during spinning.

> 

> Meanwhile, it may have some tiny performance regression on the lock 

> holder transformation for the case of lock acquisition via spinning 

> gets, because the lock state is checked before acquiring the lock via trylock.

> 

> Similar mechanism has been implemented for pthread spin lock.


Why should I accept this patch?

You make a strong case about the cost of the expensive memory synchronization.

However, the numbers don't appear to back this up.

If the cost of the synchronization was high, when you add the spinning, why doesn't it improve performance?

Do you need to do a whole system performance measurement?

As it stands it looks like this patch makes the general use case of 1-4 threads roughly 5% slower across a variety of workloads.

I'm not inclined to include this work unless there is some stronger justification, or perhaps I have just misunderstood the numbers you have provided.

--
Cheers,
Carlos.
Kemi Wang - July 11, 2018, 3:17 a.m.
Hi, Carlos
>> Do you need to do a whole system performance measurement?
> 
> I thought the data posted here is good enough to demonstrate the effectiveness of this patch.
> But if you insist, I will try to do something to figure it out.
> 

To measure a whole system performance:

We run *two* processes with locking at the same time. Each process has multiple threads, and
each thread do the following:
a) lock
b) delay 1000ns in the critical section
c) unlock
d) delay 6000ns in the non-critical section
in a loop for 5 seconds, and measure the total iterations for each process.

Each test run 4 times with stable testing result.

Test result:
threads           base(CAS)                     head(test and CAS)          
28                 1681975			 1813090(+7.8%)

56		   1801954			 1890549(+4.9%)

========================cut==========================================
I also run a *single* process and post the test result as below (each test run 4 times with 
stable testing result):
threads           base(CAS)                    head(test and CAS)
28		   2262848                       2274362 (+0.5%)

56		   1949439                       1994526 (+2.3%)

This is what I got in my test, this change is not a big optimization, but should have some 
limited performance improvement. Additionally, the workload I run here is not so Macro 
(Some Micro benchmarks make the size of critical section and non-critical section extremely
 small), so I believe this change is helpful with some practical workload.
Kemi Wang - July 11, 2018, 5:36 a.m.
On 2018年07月11日 11:17, kemi wrote:
> Hi, Carlos
>>> Do you need to do a whole system performance measurement?
>>
>> I thought the data posted here is good enough to demonstrate the effectiveness of this patch.
>> But if you insist, I will try to do something to figure it out.
>>
> 
> To measure a whole system performance:
> 
> We run *two* processes with locking at the same time. Each process has multiple threads, and
> each thread do the following:
> a) lock
> b) delay 1000ns in the critical section
> c) unlock
> d) delay 6000ns in the non-critical section
> in a loop for 5 seconds, and measure the total iterations for each process.
> 
> Each test run 4 times with stable testing result.
> 
> Test result:
> threads           base(CAS)                     head(test and CAS)          
> 28                 1681975			 1813090(+7.8%)
> 
> 56		   1801954			 1890549(+4.9%)
> 
> ========================cut==========================================
> I also run a *single* process and post the test result as below (each test run 4 times with 
> stable testing result):
> threads           base(CAS)                    head(test and CAS)
> 28		   2262848                       2274362 (+0.5%)
> 
> 56		   1949439                       1994526 (+2.3%)
> 
> This is what I got in my test, this change is not a big optimization, but should have some 
> limited performance improvement. Additionally, the workload I run here is not so Macro 
> (Some Micro benchmarks make the size of critical section and non-critical section extremely
          ^~~~~~
Sorry, a typo, reminded by Ying Huang.
s/Micro/Macro

>  small), so I believe this change is helpful with some practical workload.
> 
>
Kemi Wang - July 11, 2018, 9:01 a.m.
On 2018年07月11日 11:17, kemi wrote:
> Hi, Carlos
>>> Do you need to do a whole system performance measurement?
>>
>> I thought the data posted here is good enough to demonstrate the effectiveness of this patch.
>> But if you insist, I will try to do something to figure it out.
>>
> 
> To measure a whole system performance:
> 
> We run *two* processes with locking at the same time. Each process has multiple threads, and
> each thread do the following:
> a) lock
> b) delay 1000ns in the critical section
> c) unlock
> d) delay 6000ns in the non-critical section
> in a loop for 5 seconds, and measure the total iterations for each process.
> 
> Each test run 4 times with stable testing result.
> 
> Test result:
> threads           base(CAS)                     head(test and CAS)          
> 28                 1681975			 1813090(+7.8%)
> 
> 56		   1801954			 1890549(+4.9%)
> 
> ========================cut==========================================
> I also run a *single* process and post the test result as below (each test run 4 times with 
> stable testing result):
> threads           base(CAS)                    head(test and CAS)
> 28		   2262848                       2274362 (+0.5%)
> 
> 56		   1949439                       1994526 (+2.3%)
> 
> This is what I got in my test, this change is not a big optimization, but should have some 
> limited performance improvement. Additionally, the workload I run here is not so Macro 
                                                 ^~~~~
                                                                                    s/Macro/Micro
							                            Ignore the previous email
> (Some Micro benchmarks make the size of critical section and non-critical section extremely
>  small), so I believe this change is helpful with some practical workload.
> 
>

Patch

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c910ec4 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -25,6 +25,7 @@ 
 #include "pthreadP.h"
 #include <atomic.h>
 #include <lowlevellock.h>
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_lock_elision
@@ -133,7 +134,7 @@  __pthread_mutex_lock (pthread_mutex_t *mutex)
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 28237b0..2ede5a0 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,7 +25,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
-
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_timedlock_elision
@@ -126,7 +126,7 @@  __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 					  PTHREAD_MUTEX_PSHARED (mutex));
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (lll_trylock (mutex->__data.__lock) != 0);
 
diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
new file mode 100644
index 0000000..8bd7c16
--- /dev/null
+++ b/nptl/pthread_spinlock.h
@@ -0,0 +1,23 @@ 
+/* Functions for pthread_spinlock_t.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  atomic_spin_nop ();
+}
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
new file mode 100644
index 0000000..5ca84d1
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
@@ -0,0 +1,31 @@ 
+/* Functions for pthread_spinlock_t.  X86 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  int val = 0;
+  int cnt = *cnt_p;
+  do
+    {
+      atomic_spin_nop ();
+      val = atomic_load_relaxed (lock);
+    }
+  while (val != 0 && ++cnt < max_cnt);
+  *cnt_p = cnt;
+}