Mutex: Use test and cmpxchg instead of cmpxchg while spinning

Message ID 1532911369-8174-2-git-send-email-kemi.wang@intel.com
State New, archived
Headers

Commit Message

Kemi Wang July 30, 2018, 12:42 a.m. UTC
  The pthread adaptive spin mutex uses cmpxchg instead of test and cmpxchg
while spinning on the lock. The first is very unfriendly to the uncore
because it consistently floods the system with "read for ownership"
requests, which are much more expensive to process than a single read.
  

Comments

Kemi Wang Aug. 2, 2018, 12:28 a.m. UTC | #1
Hi, Carlos and other gentle maintainers
   May I have your time to revisit this patch? Since many analysis and
testing work done shows that it is indeed helpful on x86 architecture.
Thanks

On 2018年07月30日 08:42, Kemi Wang wrote:
> The pthread adaptive spin mutex uses cmpxchg instead of test and cmpxchg
> while spinning on the lock. The first is very unfriendly to the uncore
> because it consistently floods the system with "read for ownership"
> requests, which are much more expensive to process than a single read.
> 
> ============================Some Background===============================
> For Intel x86 architecture:
> Memory coherence between cores is handled by the "uncore" (roughly equates
> to logic outside the CPU cores but residing on the same die). Cores can ask
> the uncore to "read" or "read for ownership".  If the request is "read"
> then generally the cacheline will wind up being marked SHARED in all caches
> that have a copy, or will wind up EXCLUSIVE in the requesting core if no
> one else had it before.  If the request is "read for ownership" any other
> copies will be invalidated (or written-back and invalidated if MODIFIED)
> and the requesting core will get the line in EXCLUSIVE mode.
> 
> When a read for ownership comes to the uncore, all copies in other caches
> must be invalidated.  If there is only one other copy, it may be MODIFIED
> and will have to be written back before being invalidated.
> 
> The uncore does these things by sending "snoops" to affected cores on the
> same or other sockets and waiting for the answers.
> 
> When a read comes to the uncore, if there are more than one copy already in
> cores, then adding another one does not require a snoop.  If there is only
> one other copy, a snoop is required to find out what state it is in and to
> force a transition to SHARED state.
> 
> A lock cmpxchg prevents any other core from modifying the line between the
> read phase of cmpxchg and the (potential) write phase.  The way this is
> done is to "read for ownership" the line, which will make it EXCLUSIVE.
> Then the core defers responding to a snoop from the uncore until the write
> phase is complete.  This prevents any other core from acquiring the
> necessary exclusive access to the cacheline until the instruction is
> complete.
> 
> With this background, we can explain how lock cmpxchg works and understand
> the benefit of test and cmpxchg vs cmpxchg for implementation of locks.
> 
> ============================Performance Impact============================
> Now think about the case that a lock is in alternate use by two cores.  At
> the conclusion of an unlock, the lock's cacheline is in MODIFIED state in
> core A which did the unlock.
> 
> If core B tries for a lock while core A is holding it, then core B will
> take away the cache line from core A and only then find out the lock is
> held.  Core B will then spin, testing the lock over and over again and
> making no progress until core A finally releases it.  The release will be
> expensive because core A will have to do an RFO to get the line back, and
> then hold off core B's snoops until it can release the lock.  Then core B
> will grab it back and successfully acquire.
> 
> In the two-contending cores case, lock-cmpxchg is not too bad.  The real
> problem happens with three or more contenders.  In this case, every time
> one of the contenders attempts to lock the lock with a lock cmpxchg, a
> flurry of RFO transactions happens.
> 
> Generally, test and test and set is better.  To acquire the lock, you first
> "test" with an ordinary MOV and only when the lock appears free do you try
> the lock cmpxchg.  The MOV puts the line into SHARED state, and everyone
> waiting can poll locally in their own caches without causing any uncore
> traffic. When core A finally unlocks, the uncore will have to invalidate
> everyone, then grant ownership to A for the unlock.  Then all the other
> cores will pile on again and one of them will win.
> 
> Now, let's go back to see the impact of test and cmpxchg on adaptive mutex:
> For the case with no/little lock contention, the lock performance is nearly
> no any difference between two approaches, because lock is usually acquired
> via immediate gets in the fast path.
> 
> For the case with slight lock contention, lock usually is acquired via
> either immediate gets or spinning gets, test and cmpxchg performs better
> than the cmpxchg way even if the first one has one more test operation.
> This is probably because, in regard of lock acquisition (decode as "lock
> cmpxchg") and lock release (decode as "lock dec"), the snoop of uncore
> originated from RFOs of requesting core can be responded immediately due to
> fewer in-flight snoops.
> 
> For the case of severe lock contention (E.g. the contending thread number
> is large), the lock performance improvement may not be obvious even if
> significant RFOs number is reduced. Because the behavior of adaptive mutex
> is changed, and some of lock acquisition is shifted from spinning gets to
> waking up. In such case, the lock performance is dominated by the overhead
> of futex syscall and context switch. Even so, test and cmpxchg also has its
> value because unnecessary uncore traffic is reduced in the whole system.
> 
> ================================Testing===================================
> Test Machine:
> Intel 2-sockets Skylake platform (56 cores, 62G RAM)
> 
> Methodology:
> Let's assume the size of critical section is represented by *s*, the size
> of non-critical section is represented by *t*, and let t = k*s. Then, on a
> single thread, the arrival rate at which a single core will try to acquire
> the lock, in the absence of contention, is 1/(k+1). We also assume there
> are *n* threads contending for a lock, each thread binds to an individual
> CPU core, and does the following:
> 1) lock
> 2) spend *s* nanoseconds in the critical section
> 3) unlock
> 4) spend *t* nanoseconds in the non-critical section
> in a loop until each thread reaches 100,000 iteration, the performance is
> measured by RFOs number and the average latency of lock acquisition and
> lock release.
> *Note*: the latency is measured by CPU cycles with the help of RDTSCP
> instruction. The beginning time frame is recorded before calling
> lock/unlock, and ending time frame is recorded once lock/unlock is
> returned. The delta is calculated as the latency of lock acquisition and
> lock release, respectively. We have got rid of invalid data in the
> statistic result (e.g. Interruption is handled in that core during calling
> lock/unlock).
> 
> To emulate different usage scenarios, we let k=6, s=200ns and run this
> workload with the different spinning methods. In our workload, [1-5]
> threads contending for a lock emulates little(no) lock contention, and
> [6-15] threads contending for a lock emulates slight lock contention, and
> [16-56] threads contending for a lock emulates severe lock contention
> across sockets (Benchmark is provided by Andi Kleen).
> 
> Test Command:
> perf stat -a -e offcore_requests.demand_rfo ./run_workload
> 
> Result:
> Generally, test and cmpxchg performs consistently better than cmpxchg with
> lock contention, see details below:
> For the case with little/no lock contention, no obvious difference of lock
> performance between two approaches.
> For the case with slight lock contention (using thread number 8 as an
> example), the test and cmpxchg way performs better than the cmpxchg way,
> the average latency of lock acquisition is reduced from 7501 cycles to
> 5396 cycles (-28.1%), the average latency of lock release is reduced from
> 1704 cycles to 1095 cycles (-35.7%), and total RFOs number is reduced from
> 22239643 to 12484358 (-43.9%).
> For the case with severe lock contention (using thread number 56 as an
> example), the test and cmpxchg way performs better than the cmpxchg way,
> the average latency of lock acquisition is reduced from 136787 cycles to
> 106050 cycles (-22.5%), the average latency of lock release is reduced from
> 57787 cycles to 49895 cycles (-13.7%), and total RFOs number is reduced
> from 317483283 to 215756961 (-32.0%).
> 
> Many thanks to H.J. Lu to help refine this patch, and to Stewart, Lawrence
> to explain these "lock compxchg" matters to me in details.
> 
>     * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
>     * nptl/pthread_mutex_timedlock.c: Likewise
>     * nptl/pthread_spinlock.h: New file
>     * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file
> 
> Suggested-by: Andi Kleen <andi.kleen@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  nptl/pthread_mutex_lock.c                      |  3 ++-
>  nptl/pthread_mutex_timedlock.c                 |  4 ++--
>  nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
>  sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
>  4 files changed, 58 insertions(+), 3 deletions(-)
>  create mode 100644 nptl/pthread_spinlock.h
>  create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 1519c14..c910ec4 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -25,6 +25,7 @@
>  #include "pthreadP.h"
>  #include <atomic.h>
>  #include <lowlevellock.h>
> +#include <pthread_spinlock.h>
>  #include <stap-probe.h>
>  
>  #ifndef lll_lock_elision
> @@ -133,7 +134,7 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
>  		  LLL_MUTEX_LOCK (mutex);
>  		  break;
>  		}
> -	      atomic_spin_nop ();
> +	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 28237b0..2ede5a0 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -25,7 +25,7 @@
>  #include <atomic.h>
>  #include <lowlevellock.h>
>  #include <not-cancel.h>
> -
> +#include <pthread_spinlock.h>
>  #include <stap-probe.h>
>  
>  #ifndef lll_timedlock_elision
> @@ -126,7 +126,7 @@ __pthread_mutex_timedlock (pthread_mutex_t *mutex,
>  					  PTHREAD_MUTEX_PSHARED (mutex));
>  		  break;
>  		}
> -	      atomic_spin_nop ();
> +	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
>  	    }
>  	  while (lll_trylock (mutex->__data.__lock) != 0);
>  
> diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
> new file mode 100644
> index 0000000..8bd7c16
> --- /dev/null
> +++ b/nptl/pthread_spinlock.h
> @@ -0,0 +1,23 @@
> +/* Functions for pthread_spinlock_t.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static __always_inline void
> +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
> +{
> +  atomic_spin_nop ();
> +}
> diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> new file mode 100644
> index 0000000..5ca84d1
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
> @@ -0,0 +1,31 @@
> +/* Functions for pthread_spinlock_t.  X86 version.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +static __always_inline void
> +atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
> +{
> +  int val = 0;
> +  int cnt = *cnt_p;
> +  do
> +    {
> +      atomic_spin_nop ();
> +      val = atomic_load_relaxed (lock);
> +    }
> +  while (val != 0 && ++cnt < max_cnt);
> +  *cnt_p = cnt;
> +}
>
  

Patch

============================Some Background===============================
For Intel x86 architecture:
Memory coherence between cores is handled by the "uncore" (roughly equates
to logic outside the CPU cores but residing on the same die). Cores can ask
the uncore to "read" or "read for ownership".  If the request is "read"
then generally the cacheline will wind up being marked SHARED in all caches
that have a copy, or will wind up EXCLUSIVE in the requesting core if no
one else had it before.  If the request is "read for ownership" any other
copies will be invalidated (or written-back and invalidated if MODIFIED)
and the requesting core will get the line in EXCLUSIVE mode.

When a read for ownership comes to the uncore, all copies in other caches
must be invalidated.  If there is only one other copy, it may be MODIFIED
and will have to be written back before being invalidated.

The uncore does these things by sending "snoops" to affected cores on the
same or other sockets and waiting for the answers.

When a read comes to the uncore, if there are more than one copy already in
cores, then adding another one does not require a snoop.  If there is only
one other copy, a snoop is required to find out what state it is in and to
force a transition to SHARED state.

A lock cmpxchg prevents any other core from modifying the line between the
read phase of cmpxchg and the (potential) write phase.  The way this is
done is to "read for ownership" the line, which will make it EXCLUSIVE.
Then the core defers responding to a snoop from the uncore until the write
phase is complete.  This prevents any other core from acquiring the
necessary exclusive access to the cacheline until the instruction is
complete.

With this background, we can explain how lock cmpxchg works and understand
the benefit of test and cmpxchg vs cmpxchg for implementation of locks.

============================Performance Impact============================
Now think about the case that a lock is in alternate use by two cores.  At
the conclusion of an unlock, the lock's cacheline is in MODIFIED state in
core A which did the unlock.

If core B tries for a lock while core A is holding it, then core B will
take away the cache line from core A and only then find out the lock is
held.  Core B will then spin, testing the lock over and over again and
making no progress until core A finally releases it.  The release will be
expensive because core A will have to do an RFO to get the line back, and
then hold off core B's snoops until it can release the lock.  Then core B
will grab it back and successfully acquire.

In the two-contending cores case, lock-cmpxchg is not too bad.  The real
problem happens with three or more contenders.  In this case, every time
one of the contenders attempts to lock the lock with a lock cmpxchg, a
flurry of RFO transactions happens.

Generally, test and test and set is better.  To acquire the lock, you first
"test" with an ordinary MOV and only when the lock appears free do you try
the lock cmpxchg.  The MOV puts the line into SHARED state, and everyone
waiting can poll locally in their own caches without causing any uncore
traffic. When core A finally unlocks, the uncore will have to invalidate
everyone, then grant ownership to A for the unlock.  Then all the other
cores will pile on again and one of them will win.

Now, let's go back to see the impact of test and cmpxchg on adaptive mutex:
For the case with no/little lock contention, the lock performance is nearly
no any difference between two approaches, because lock is usually acquired
via immediate gets in the fast path.

For the case with slight lock contention, lock usually is acquired via
either immediate gets or spinning gets, test and cmpxchg performs better
than the cmpxchg way even if the first one has one more test operation.
This is probably because, in regard of lock acquisition (decode as "lock
cmpxchg") and lock release (decode as "lock dec"), the snoop of uncore
originated from RFOs of requesting core can be responded immediately due to
fewer in-flight snoops.

For the case of severe lock contention (E.g. the contending thread number
is large), the lock performance improvement may not be obvious even if
significant RFOs number is reduced. Because the behavior of adaptive mutex
is changed, and some of lock acquisition is shifted from spinning gets to
waking up. In such case, the lock performance is dominated by the overhead
of futex syscall and context switch. Even so, test and cmpxchg also has its
value because unnecessary uncore traffic is reduced in the whole system.

================================Testing===================================
Test Machine:
Intel 2-sockets Skylake platform (56 cores, 62G RAM)

Methodology:
Let's assume the size of critical section is represented by *s*, the size
of non-critical section is represented by *t*, and let t = k*s. Then, on a
single thread, the arrival rate at which a single core will try to acquire
the lock, in the absence of contention, is 1/(k+1). We also assume there
are *n* threads contending for a lock, each thread binds to an individual
CPU core, and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until each thread reaches 100,000 iteration, the performance is
measured by RFOs number and the average latency of lock acquisition and
lock release.
*Note*: the latency is measured by CPU cycles with the help of RDTSCP
instruction. The beginning time frame is recorded before calling
lock/unlock, and ending time frame is recorded once lock/unlock is
returned. The delta is calculated as the latency of lock acquisition and
lock release, respectively. We have got rid of invalid data in the
statistic result (e.g. Interruption is handled in that core during calling
lock/unlock).

To emulate different usage scenarios, we let k=6, s=200ns and run this
workload with the different spinning methods. In our workload, [1-5]
threads contending for a lock emulates little(no) lock contention, and
[6-15] threads contending for a lock emulates slight lock contention, and
[16-56] threads contending for a lock emulates severe lock contention
across sockets (Benchmark is provided by Andi Kleen).

Test Command:
perf stat -a -e offcore_requests.demand_rfo ./run_workload

Result:
Generally, test and cmpxchg performs consistently better than cmpxchg with
lock contention, see details below:
For the case with little/no lock contention, no obvious difference of lock
performance between two approaches.
For the case with slight lock contention (using thread number 8 as an
example), the test and cmpxchg way performs better than the cmpxchg way,
the average latency of lock acquisition is reduced from 7501 cycles to
5396 cycles (-28.1%), the average latency of lock release is reduced from
1704 cycles to 1095 cycles (-35.7%), and total RFOs number is reduced from
22239643 to 12484358 (-43.9%).
For the case with severe lock contention (using thread number 56 as an
example), the test and cmpxchg way performs better than the cmpxchg way,
the average latency of lock acquisition is reduced from 136787 cycles to
106050 cycles (-22.5%), the average latency of lock release is reduced from
57787 cycles to 49895 cycles (-13.7%), and total RFOs number is reduced
from 317483283 to 215756961 (-32.0%).

Many thanks to H.J. Lu to help refine this patch, and to Stewart, Lawrence
to explain these "lock compxchg" matters to me in details.

    * nptl/pthread_mutex_lock.c: Use architecture-specific atomic spin API
    * nptl/pthread_mutex_timedlock.c: Likewise
    * nptl/pthread_spinlock.h: New file
    * sysdeps/unix/sysv/linux/x86/pthread_spinlock.h: New file

Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/pthread_mutex_lock.c                      |  3 ++-
 nptl/pthread_mutex_timedlock.c                 |  4 ++--
 nptl/pthread_spinlock.h                        | 23 +++++++++++++++++++
 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h | 31 ++++++++++++++++++++++++++
 4 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 nptl/pthread_spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/x86/pthread_spinlock.h

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1519c14..c910ec4 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -25,6 +25,7 @@ 
 #include "pthreadP.h"
 #include <atomic.h>
 #include <lowlevellock.h>
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_lock_elision
@@ -133,7 +134,7 @@  __pthread_mutex_lock (pthread_mutex_t *mutex)
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 28237b0..2ede5a0 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -25,7 +25,7 @@ 
 #include <atomic.h>
 #include <lowlevellock.h>
 #include <not-cancel.h>
-
+#include <pthread_spinlock.h>
 #include <stap-probe.h>
 
 #ifndef lll_timedlock_elision
@@ -126,7 +126,7 @@  __pthread_mutex_timedlock (pthread_mutex_t *mutex,
 					  PTHREAD_MUTEX_PSHARED (mutex));
 		  break;
 		}
-	      atomic_spin_nop ();
+	      atomic_spin_lock (&mutex->__data.__lock, &cnt, max_cnt);
 	    }
 	  while (lll_trylock (mutex->__data.__lock) != 0);
 
diff --git a/nptl/pthread_spinlock.h b/nptl/pthread_spinlock.h
new file mode 100644
index 0000000..8bd7c16
--- /dev/null
+++ b/nptl/pthread_spinlock.h
@@ -0,0 +1,23 @@ 
+/* Functions for pthread_spinlock_t.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  atomic_spin_nop ();
+}
diff --git a/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
new file mode 100644
index 0000000..5ca84d1
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/pthread_spinlock.h
@@ -0,0 +1,31 @@ 
+/* Functions for pthread_spinlock_t.  X86 version.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+static __always_inline void
+atomic_spin_lock (pthread_spinlock_t *lock, int *cnt_p, int max_cnt)
+{
+  int val = 0;
+  int cnt = *cnt_p;
+  do
+    {
+      atomic_spin_nop ();
+      val = atomic_load_relaxed (lock);
+    }
+  while (val != 0 && ++cnt < max_cnt);
+  *cnt_p = cnt;
+}