nptl: Add backoff mechanism to spinlock loop

Message ID 20220325015050.2065523-1-wangyang.guo@intel.com
State Superseded
Headers
Series nptl: Add backoff mechanism to spinlock loop |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Wangyang Guo March 25, 2022, 1:50 a.m. UTC
  When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)
  

Comments

Noah Goldstein March 25, 2022, 1:58 a.m. UTC | #1
On Thu, Mar 24, 2022 at 8:51 PM Wangyang Guo via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> When mutiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
>
> Binary exponential backoff with random jitter is introduced. As try-lock
> attempt increases, there is more likely that a larger number threads
> compete for adaptive mutex lock, so increase wait time in exponential.
> A random jitter is also added to avoid synchronous try-lock from other
> threads.
>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index d2e652d151..ec57dc3627 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>  #include <futex-internal.h>
>  #include <stap-probe.h>
>  #include <shlib-compat.h>
> +#include <random-bits.h>
>
>  /* Some of the following definitions differ when pthread_mutex_cond_lock.c
>     includes this file.  */
> @@ -138,14 +139,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>           int cnt = 0;
>           int max_cnt = MIN (max_adaptive_count (),
>                              mutex->__data.__spins * 2 + 10);
> +         int spin_count, exp_backoff = 1;
> +         unsigned int jitter = random_bits ();
>           do
>             {
> -             if (cnt++ >= max_cnt)
> +             /* In each loop, spin count is exponential backoff plus
> +                random jitter, random range is [0, exp_backoff-1].  */
> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> +             cnt += spin_count;
> +             if (cnt >= max_cnt)
>                 {
> +                 /* If cnt exceeds max spin count, just go to wait
> +                    queue.  */
>                   LLL_MUTEX_LOCK (mutex);
>                   break;
>                 }
> -             atomic_spin_nop ();
> +             do
> +                 atomic_spin_nop ();
> +             while (--spin_count > 0);
> +             /* Binary exponential backoff, prepare for next loop.  */
> +             exp_backoff <<= 1;
>             }
>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
Does this load not already prevent against the 'CAS storm'?

>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> --
> 2.35.1
>
  
develop--- via Libc-alpha March 25, 2022, 3:24 a.m. UTC | #2
>> +                 atomic_spin_nop ();
>> +             while (--spin_count > 0);
>> +             /* Binary exponential backoff, prepare for next loop.  */
>> +             exp_backoff <<= 1;
>>             }
>>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
>Does this load not already prevent against the 'CAS storm'?
This just prevent CAS in a long held lock.
But if multiple threads waiting for lock at the same time, suppose many of them are going to read lock state,
once the lock owner release the lock at this point, those waiter threads will be convinced lock is unlocked. 
The next step, they will all do try lock at the same time. Backoff is introduced to solve this problem.

>>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
  
Noah Goldstein March 25, 2022, 3:42 a.m. UTC | #3
On Thu, Mar 24, 2022 at 10:24 PM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>
>
> >> +                 atomic_spin_nop ();
> >> +             while (--spin_count > 0);
> >> +             /* Binary exponential backoff, prepare for next loop.  */
> >> +             exp_backoff <<= 1;
> >>             }
> >>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> >Does this load not already prevent against the 'CAS storm'?
> This just prevent CAS in a long held lock.
> But if multiple threads waiting for lock at the same time, suppose many of them are going to read lock state,
> once the lock owner release the lock at this point, those waiter threads will be convinced lock is unlocked.
> The next step, they will all do try lock at the same time. Backoff is introduced to solve this problem.

The loop isn't spinning on CAS failure which is typically where
you see the poor performance.
I get that there can still be some contention on the CAS, but
shouldn't the read check limit the evict ping-ponging?



>
> >>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
>
  
develop--- via Libc-alpha March 25, 2022, 4:32 a.m. UTC | #4
Noah Goldstein wrote:
> > >> +                 atomic_spin_nop ();
> > >> +             while (--spin_count > 0);
> > >> +             /* Binary exponential backoff, prepare for next loop.  */
> > >> +             exp_backoff <<= 1;
> > >>             }
> > >>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > >Does this load not already prevent against the 'CAS storm'?
> > This just prevent CAS in a long held lock.
> > But if multiple threads waiting for lock at the same time, suppose 
> > many of them are going to read lock state, once the lock owner release the lock at this point, those waiter threads will be convinced lock is unlocked.
> > The next step, they will all do try lock at the same time. Backoff is introduced to solve this problem.
> 
> The loop isn't spinning on CAS failure which is typically where you see the poor performance.
> I get that there can still be some contention on the CAS, but shouldn't the read check limit the evict ping-ponging?

Yes, read check can help. 
But in a very high contention case, it becomes more easier to meet the above problem that needs backoff.
> >
> > >>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
  
Noah Goldstein March 25, 2022, 3:25 p.m. UTC | #5
On Thu, Mar 24, 2022 at 11:32 PM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>
> Noah Goldstein wrote:
> > > >> +                 atomic_spin_nop ();
> > > >> +             while (--spin_count > 0);
> > > >> +             /* Binary exponential backoff, prepare for next loop.  */
> > > >> +             exp_backoff <<= 1;
> > > >>             }
> > > >>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > > >Does this load not already prevent against the 'CAS storm'?
> > > This just prevent CAS in a long held lock.
> > > But if multiple threads waiting for lock at the same time, suppose
> > > many of them are going to read lock state, once the lock owner release the lock at this point, those waiter threads will be convinced lock is unlocked.
> > > The next step, they will all do try lock at the same time. Backoff is introduced to solve this problem.
> >
> > The loop isn't spinning on CAS failure which is typically where you see the poor performance.
> > I get that there can still be some contention on the CAS, but shouldn't the read check limit the evict ping-ponging?
>
> Yes, read check can help.
> But in a very high contention case, it becomes more easier to meet the above problem that needs backoff.

Do we need both then?

But made a quick benchmark to test this out and your right, using the lock for
a tiny critical section at least (incrementing an int) see less failed
CAS attempts
w/ this patch and better performance :)


> > >
> > > >>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > >
  
develop--- via Libc-alpha March 28, 2022, 8:35 a.m. UTC | #6
> On Thu, Mar 24, 2022 at 11:32 PM Guo, Wangyang <wangyang.guo@intel.com> wrote:
> >
> > Noah Goldstein wrote:
> > > > >> +                 atomic_spin_nop ();
> > > > >> +             while (--spin_count > 0);
> > > > >> +             /* Binary exponential backoff, prepare for next loop.  */
> > > > >> +             exp_backoff <<= 1;
> > > > >>             }
> > > > >>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > > > >Does this load not already prevent against the 'CAS storm'?
> > > > This just prevent CAS in a long held lock.
> > > > But if multiple threads waiting for lock at the same time, suppose 
> > > > many of them are going to read lock state, once the lock owner release the lock at this point, those waiter threads will be convinced lock is unlocked.
> > > > The next step, they will all do try lock at the same time. Backoff is introduced to solve this problem.
> > >
> > > The loop isn't spinning on CAS failure which is typically where you see the poor performance.
> > > I get that there can still be some contention on the CAS, but shouldn't the read check limit the evict ping-ponging?
> >
> > Yes, read check can help.
> > But in a very high contention case, it becomes more easier to meet the above problem that needs backoff.
> Do we need both then?
> But made a quick benchmark to test this out and your right, using the lock for a tiny critical section at least (incrementing an int) see less failed CAS attempts w/ this patch and better performance :)

In theory, the read-check still works in a long held lock, but will have extra overhead if lock can be acquired immediately.
From my testing, without read-check has a better performance.

> > > >
> > > > >>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > >
  

Patch

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..ec57dc3627 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@ 
 #include <futex-internal.h>
 #include <stap-probe.h>
 #include <shlib-compat.h>
+#include <random-bits.h>
 
 /* Some of the following definitions differ when pthread_mutex_cond_lock.c
    includes this file.  */
@@ -138,14 +139,26 @@  PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = random_bits ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+	         random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		  atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Binary exponential backoff, prepare for next loop.  */
+	      exp_backoff <<= 1;
 	    }
 	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
 		 || LLL_MUTEX_TRYLOCK (mutex) != 0);