[v6,1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]

Message ID 20211111162428.2286605-2-hjl.tools@gmail.com
State Committed
Commit d672a98a1af106bd68deb15576710cd61363f7a6
Headers
Series Optimize CAS [BZ #28537] |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

H.J. Lu Nov. 11, 2021, 4:24 p.m. UTC
  CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.
---
 nptl/pthread_mutex_lock.c | 7 +++++++
 1 file changed, 7 insertions(+)
  

Comments

Szabolcs Nagy Nov. 12, 2021, 5:23 p.m. UTC | #1
The 11/11/2021 08:24, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.

essentially the contended loop goes from

  if (cas(lock, 0, 1))
    break;

to

  if (read(lock) == 0)
    if (cas(lock, 0, 1))
      break;

i believe this helps on aarch64 too if it is only
done after contention is detected.

since the first cas access to lock is kept as is,
the common, uncontended case should not be affected.
this looks good to me.

> ---
>  nptl/pthread_mutex_lock.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>  
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> +  atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>  
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>  		  break;
>  		}
>  	      atomic_spin_nop ();
> +	      if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> +		continue;
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> -- 
> 2.33.1
>
  
Noah Goldstein Nov. 17, 2021, 2:24 a.m. UTC | #2
On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.
> ---
>  nptl/pthread_mutex_lock.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> +  atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>                   break;
>                 }
>               atomic_spin_nop ();
> +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> +               continue;

Now that the lock spins on a simple read should `max_cnt` be adjusted?
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
>             }
>           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>
> --
> 2.33.1
>
  
H.J. Lu Nov. 17, 2021, 11:54 p.m. UTC | #3
On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > loop if compare may fail to reduce cache line bouncing on contended locks.
> > ---
> >  nptl/pthread_mutex_lock.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index 2bd41767e0..72058c719c 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> >  # define PTHREAD_MUTEX_VERSIONS 1
> >  #endif
> >
> > +#ifndef LLL_MUTEX_READ_LOCK
> > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > +#endif
> > +
> >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >       __attribute_noinline__;
> >
> > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >                   break;
> >                 }
> >               atomic_spin_nop ();
> > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > +               continue;
>
> Now that the lock spins on a simple read should `max_cnt` be adjusted?

Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
LLL_MUTEX_TRYLOCK.  It doesn't change the flow.

> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> >             }
> >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
> > --
> > 2.33.1
> >
  
Noah Goldstein Nov. 18, 2021, 12:03 a.m. UTC | #4
On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > ---
> > >  nptl/pthread_mutex_lock.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index 2bd41767e0..72058c719c 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > >  # define PTHREAD_MUTEX_VERSIONS 1
> > >  #endif
> > >
> > > +#ifndef LLL_MUTEX_READ_LOCK
> > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > > +#endif
> > > +
> > >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > >       __attribute_noinline__;
> > >
> > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > >                   break;
> > >                 }
> > >               atomic_spin_nop ();
> > > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > +               continue;
> >
> > Now that the lock spins on a simple read should `max_cnt` be adjusted?
>
> Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> LLL_MUTEX_TRYLOCK.  It doesn't change the flow.

Yes, but the loop will be able to run `max_cnt` iterations much faster now.
Just wondering if the value needs to be re-tuned. Not that is necessarily needs
to be.
>
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > >             }
> > >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > >
> > > --
> > > 2.33.1
> > >
>
>
>
> --
> H.J.
  
H.J. Lu Nov. 18, 2021, 12:31 a.m. UTC | #5
On Wed, Nov 17, 2021 at 4:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > > ---
> > > >  nptl/pthread_mutex_lock.c | 7 +++++++
> > > >  1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > > index 2bd41767e0..72058c719c 100644
> > > > --- a/nptl/pthread_mutex_lock.c
> > > > +++ b/nptl/pthread_mutex_lock.c
> > > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > > >  # define PTHREAD_MUTEX_VERSIONS 1
> > > >  #endif
> > > >
> > > > +#ifndef LLL_MUTEX_READ_LOCK
> > > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > > > +#endif
> > > > +
> > > >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > > >       __attribute_noinline__;
> > > >
> > > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > > >                   break;
> > > >                 }
> > > >               atomic_spin_nop ();
> > > > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > > +               continue;
> > >
> > > Now that the lock spins on a simple read should `max_cnt` be adjusted?
> >
> > Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> > LLL_MUTEX_TRYLOCK.  It doesn't change the flow.
>
> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> to be.

Maybe if we can find some data to show for.

> >
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > > >             }
> > > >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > >
> > > > --
> > > > 2.33.1
> > > >
> >
> >
> >
> > --
> > H.J.
  
Arjan van de Ven Nov. 18, 2021, 1:16 a.m. UTC | #6
On 11/17/2021 4:31 PM, H.J. Lu wrote:
>> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
>> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
>> to be.
> Maybe if we can find some data to show for.
> 

wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
have an order of magnitude different performance in this regard....
if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
retuning.

or maybe it's time to retune this in general sometime soon after this patch goes in ;)
  
Sunil Pandey Sept. 11, 2022, 8:19 p.m. UTC | #7
On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On 11/17/2021 4:31 PM, H.J. Lu wrote:
> >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> >> to be.
> > Maybe if we can find some data to show for.
> >
>
> wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> have an order of magnitude different performance in this regard....
> if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> retuning.
>
> or maybe it's time to retune this in general sometime soon after this patch goes in ;)

I would like to backport this patch to release branch 2.33 and 2.34

Any comments/suggestions or objections on this.

commit d672a98a1af106bd68deb15576710cd61363f7a6
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Nov 2 18:33:07 2021 -0700

    Add LLL_MUTEX_READ_LOCK [BZ #28537]

    CAS instruction is expensive.  From the x86 CPU's point of view, getting
    a cache line for writing is more expensive than reading.  See Appendix
    A.2 Spinlock in:

    https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
  
Noah Goldstein Sept. 29, 2022, 12:10 a.m. UTC | #8
On Sun, Sep 11, 2022 at 4:19 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On 11/17/2021 4:31 PM, H.J. Lu wrote:
> > >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> > >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> > >> to be.
> > > Maybe if we can find some data to show for.
> > >
> >
> > wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> > have an order of magnitude different performance in this regard....
> > if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> > retuning.
> >
> > or maybe it's time to retune this in general sometime soon after this patch goes in ;)
>
> I would like to backport this patch to release branch 2.33 and 2.34

Fine by  me.
>
> Any comments/suggestions or objections on this.
>
> commit d672a98a1af106bd68deb15576710cd61363f7a6
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Tue Nov 2 18:33:07 2021 -0700
>
>     Add LLL_MUTEX_READ_LOCK [BZ #28537]
>
>     CAS instruction is expensive.  From the x86 CPU's point of view, getting
>     a cache line for writing is more expensive than reading.  See Appendix
>     A.2 Spinlock in:
>
>     https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
  

Patch

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 2bd41767e0..72058c719c 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -64,6 +64,11 @@  lll_mutex_lock_optimized (pthread_mutex_t *mutex)
 # define PTHREAD_MUTEX_VERSIONS 1
 #endif
 
+#ifndef LLL_MUTEX_READ_LOCK
+# define LLL_MUTEX_READ_LOCK(mutex) \
+  atomic_load_relaxed (&(mutex)->__data.__lock)
+#endif
+
 static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
      __attribute_noinline__;
 
@@ -141,6 +146,8 @@  PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 		  break;
 		}
 	      atomic_spin_nop ();
+	      if (LLL_MUTEX_READ_LOCK (mutex) != 0)
+		continue;
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);