[v6,1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
Checks
Context |
Check |
Description |
dj/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
Commit Message
CAS instruction is expensive. From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading. See Appendix
A.2 Spinlock in:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.
Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.
---
nptl/pthread_mutex_lock.c | 7 +++++++
1 file changed, 7 insertions(+)
Comments
The 11/11/2021 08:24, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.
essentially the contended loop goes from
if (cas(lock, 0, 1))
break;
to
if (read(lock) == 0)
if (cas(lock, 0, 1))
break;
i believe this helps on aarch64 too if it is only
done after contention is detected.
since the first cas access to lock is kept as is,
the common, uncontended case should not be affected.
this looks good to me.
> ---
> nptl/pthread_mutex_lock.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> # define PTHREAD_MUTEX_VERSIONS 1
> #endif
>
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> + atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
> static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> __attribute_noinline__;
>
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> break;
> }
> atomic_spin_nop ();
> + if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> + continue;
> }
> while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>
> --
> 2.33.1
>
On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.
> ---
> nptl/pthread_mutex_lock.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> # define PTHREAD_MUTEX_VERSIONS 1
> #endif
>
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> + atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
> static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> __attribute_noinline__;
>
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> break;
> }
> atomic_spin_nop ();
> + if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> + continue;
Now that the lock spins on a simple read should `max_cnt` be adjusted?
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> }
> while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>
> --
> 2.33.1
>
On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading. See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > loop if compare may fail to reduce cache line bouncing on contended locks.
> > ---
> > nptl/pthread_mutex_lock.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index 2bd41767e0..72058c719c 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > # define PTHREAD_MUTEX_VERSIONS 1
> > #endif
> >
> > +#ifndef LLL_MUTEX_READ_LOCK
> > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > + atomic_load_relaxed (&(mutex)->__data.__lock)
> > +#endif
> > +
> > static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > __attribute_noinline__;
> >
> > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > break;
> > }
> > atomic_spin_nop ();
> > + if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > + continue;
>
> Now that the lock spins on a simple read should `max_cnt` be adjusted?
Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
LLL_MUTEX_TRYLOCK. It doesn't change the flow.
> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > }
> > while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
> > --
> > 2.33.1
> >
On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading. See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > ---
> > > nptl/pthread_mutex_lock.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index 2bd41767e0..72058c719c 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > > # define PTHREAD_MUTEX_VERSIONS 1
> > > #endif
> > >
> > > +#ifndef LLL_MUTEX_READ_LOCK
> > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > + atomic_load_relaxed (&(mutex)->__data.__lock)
> > > +#endif
> > > +
> > > static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > > __attribute_noinline__;
> > >
> > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > > break;
> > > }
> > > atomic_spin_nop ();
> > > + if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > + continue;
> >
> > Now that the lock spins on a simple read should `max_cnt` be adjusted?
>
> Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> LLL_MUTEX_TRYLOCK. It doesn't change the flow.
Yes, but the loop will be able to run `max_cnt` iterations much faster now.
Just wondering if the value needs to be re-tuned. Not that is necessarily needs
to be.
>
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > > }
> > > while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > >
> > > --
> > > 2.33.1
> > >
>
>
>
> --
> H.J.
On Wed, Nov 17, 2021 at 4:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading. See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > > ---
> > > > nptl/pthread_mutex_lock.c | 7 +++++++
> > > > 1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > > index 2bd41767e0..72058c719c 100644
> > > > --- a/nptl/pthread_mutex_lock.c
> > > > +++ b/nptl/pthread_mutex_lock.c
> > > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > > > # define PTHREAD_MUTEX_VERSIONS 1
> > > > #endif
> > > >
> > > > +#ifndef LLL_MUTEX_READ_LOCK
> > > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > > + atomic_load_relaxed (&(mutex)->__data.__lock)
> > > > +#endif
> > > > +
> > > > static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > > > __attribute_noinline__;
> > > >
> > > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > > > break;
> > > > }
> > > > atomic_spin_nop ();
> > > > + if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > > + continue;
> > >
> > > Now that the lock spins on a simple read should `max_cnt` be adjusted?
> >
> > Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> > LLL_MUTEX_TRYLOCK. It doesn't change the flow.
>
> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> to be.
Maybe if we can find some data to show for.
> >
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > > > }
> > > > while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > >
> > > > --
> > > > 2.33.1
> > > >
> >
> >
> >
> > --
> > H.J.
On 11/17/2021 4:31 PM, H.J. Lu wrote:
>> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
>> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
>> to be.
> Maybe if we can find some data to show for.
>
wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
have an order of magnitude different performance in this regard....
if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
retuning.
or maybe it's time to retune this in general sometime soon after this patch goes in ;)
On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On 11/17/2021 4:31 PM, H.J. Lu wrote:
> >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> >> to be.
> > Maybe if we can find some data to show for.
> >
>
> wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> have an order of magnitude different performance in this regard....
> if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> retuning.
>
> or maybe it's time to retune this in general sometime soon after this patch goes in ;)
I would like to backport this patch to release branch 2.33 and 2.34
Any comments/suggestions or objections on this.
commit d672a98a1af106bd68deb15576710cd61363f7a6
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Tue Nov 2 18:33:07 2021 -0700
Add LLL_MUTEX_READ_LOCK [BZ #28537]
CAS instruction is expensive. From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading. See Appendix
A.2 Spinlock in:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
On Sun, Sep 11, 2022 at 4:19 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On 11/17/2021 4:31 PM, H.J. Lu wrote:
> > >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> > >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> > >> to be.
> > > Maybe if we can find some data to show for.
> > >
> >
> > wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> > have an order of magnitude different performance in this regard....
> > if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> > retuning.
> >
> > or maybe it's time to retune this in general sometime soon after this patch goes in ;)
>
> I would like to backport this patch to release branch 2.33 and 2.34
Fine by me.
>
> Any comments/suggestions or objections on this.
>
> commit d672a98a1af106bd68deb15576710cd61363f7a6
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date: Tue Nov 2 18:33:07 2021 -0700
>
> Add LLL_MUTEX_READ_LOCK [BZ #28537]
>
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
@@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
# define PTHREAD_MUTEX_VERSIONS 1
#endif
+#ifndef LLL_MUTEX_READ_LOCK
+# define LLL_MUTEX_READ_LOCK(mutex) \
+ atomic_load_relaxed (&(mutex)->__data.__lock)
+#endif
+
static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
__attribute_noinline__;
@@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
break;
}
atomic_spin_nop ();
+ if (LLL_MUTEX_READ_LOCK (mutex) != 0)
+ continue;
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);