[v3] x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537]

Message ID 20211104161443.734681-1-hjl.tools@gmail.com
State Superseded
Headers
Series [v3] x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537] |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

H.J. Lu Nov. 4, 2021, 4:14 p.m. UTC
  From the CPU's point of view, getting a cache line for writing is more
expensive than reading.  See Appendix A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.  Load the current memory value via a
volatile pointer first, which should be atomic and won't be optimized
out by compiler, check and return immediately if writing cache line may
fail to reduce cache line bouncing on contended locks.

This fixes BZ# 28537.

A GCC bug is opened:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065

The fixed compiler should define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK
to indicate that compiler will generate the check with the volatile load.
Then glibc can check __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK to avoid the
extra volatile load.
---
 sysdeps/x86/atomic-machine.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)
  

Comments

Noah Goldstein Nov. 8, 2021, 3:32 p.m. UTC | #1
On Thu, Nov 4, 2021 at 11:15 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> From the CPU's point of view, getting a cache line for writing is more
> expensive than reading.  See Appendix A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.  Load the current memory value via a
> volatile pointer first, which should be atomic and won't be optimized
> out by compiler, check and return immediately if writing cache line may
> fail to reduce cache line bouncing on contended locks.
>
> This fixes BZ# 28537.
>
> A GCC bug is opened:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
>
> The fixed compiler should define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK
> to indicate that compiler will generate the check with the volatile load.
> Then glibc can check __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK to avoid the
> extra volatile load.
> ---
>  sysdeps/x86/atomic-machine.h | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..597dc1cf92 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,20 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS       0
>
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> +     __typeof (*(mem)) oldmem = *memp, ret;                            \
> +     ret = (oldmem == (oldval)                                         \
> +           ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> +           : oldmem);                                                  \
> +     ret; })
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> +     __typeof (*(mem)) oldmem = *memp;                                 \
> +     int ret;                                                          \
> +     ret = (oldmem == (oldval)                                         \
> +           ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
> +           : 1);                                                       \
> +     ret; })
>
>
>  #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> --
> 2.33.1
>

Worth noting on X86 any of the __atomic_fetch_* builtins aside from add/sub
are implemented with a CAS loop that may benefit from this:
https://godbolt.org/z/z87v9Kbcz
  
H.J. Lu Nov. 8, 2021, 3:45 p.m. UTC | #2
On Mon, Nov 8, 2021 at 7:32 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Nov 4, 2021 at 11:15 AM H.J. Lu via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > From the CPU's point of view, getting a cache line for writing is more
> > expensive than reading.  See Appendix A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.  Load the current memory value via a
> > volatile pointer first, which should be atomic and won't be optimized
> > out by compiler, check and return immediately if writing cache line may
> > fail to reduce cache line bouncing on contended locks.
> >
> > This fixes BZ# 28537.
> >
> > A GCC bug is opened:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
> >
> > The fixed compiler should define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK
> > to indicate that compiler will generate the check with the volatile load.
> > Then glibc can check __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK to avoid the
> > extra volatile load.
> > ---
> >  sysdeps/x86/atomic-machine.h | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> > index 2692d94a92..597dc1cf92 100644
> > --- a/sysdeps/x86/atomic-machine.h
> > +++ b/sysdeps/x86/atomic-machine.h
> > @@ -73,9 +73,20 @@ typedef uintmax_t uatomic_max_t;
> >  #define ATOMIC_EXCHANGE_USES_CAS       0
> >
> >  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> > -  __sync_val_compare_and_swap (mem, oldval, newval)
> > +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> > +     __typeof (*(mem)) oldmem = *memp, ret;                            \
> > +     ret = (oldmem == (oldval)                                         \
> > +           ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> > +           : oldmem);                                                  \
> > +     ret; })
> >  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> > -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> > +  ({ volatile __typeof (*(mem)) *memp = (mem);                         \
> > +     __typeof (*(mem)) oldmem = *memp;                                 \
> > +     int ret;                                                          \
> > +     ret = (oldmem == (oldval)                                         \
> > +           ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
> > +           : 1);                                                       \
> > +     ret; })
> >
> >
> >  #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> > --
> > 2.33.1
> >
>
> Worth noting on X86 any of the __atomic_fetch_* builtins aside from add/sub
> are implemented with a CAS loop that may benefit from this:
> https://godbolt.org/z/z87v9Kbcz

This is:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069
  
Florian Weimer Nov. 8, 2021, 5:36 p.m. UTC | #3
* H. J. Lu:

> The fixed compiler should define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK
> to indicate that compiler will generate the check with the volatile load.
> Then glibc can check __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK to avoid the
> extra volatile load.

I don't see __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK in the patch, so
this usage is confusing.

It would be illuminating which code path concerns you the most.  Looking
at the default mutex implementation, I think we eventually end up here
(in sysdeps/nptl/lowleverlock.h):

/* This is an expression rather than a statement even though its value is
   void, so that it can be used in a comma expression or as an expression
   that's cast to void.  */
/* The inner conditional compiles to a call to __lll_lock_wait_private if
   private is known at compile time to be LLL_PRIVATE, and to a call to
   __lll_lock_wait otherwise.  */
/* If FUTEX is 0 (not acquired), set to 1 (acquired with no waiters) and
   return.  Otherwise, ensure that it is >1 (acquired, possibly with waiters)
   and then block until we acquire the lock, at which point FUTEX will still be
   >1.  The lock is always acquired on return.  */
#define __lll_lock(futex, private)                                      \
  ((void)                                                               \
   ({                                                                   \
     int *__futex = (futex);                                            \
     if (__glibc_unlikely                                               \
         (atomic_compare_and_exchange_bool_acq (__futex, 1, 0)))        \
       {                                                                \
         if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
           __lll_lock_wait_private (__futex);                           \
         else                                                           \
           __lll_lock_wait (__futex, private);                          \
       }                                                                \
   }))
#define lll_lock(futex, private)	\
  __lll_lock (&(futex), private)

We already have an optimization that does not count the waiters (it's 0,
1, 2 only).  However, the code as written is clearly written to indicate
to the compiler that CAS failure is rare (the __glibc_unlikely).

Inside __lll_lock_wait_private, we do another load:

void
__lll_lock_wait_private (int *futex)
{
  if (atomic_load_relaxed (futex) == 2)
    goto futex;

  while (atomic_exchange_acquire (futex, 2) != 0)
    {
    futex:
      LIBC_PROBE (lll_lock_wait_private, 1, futex);
      futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2.  */
    }
}

On x86 the unconditional exchange does not use CAS.  But there is still
that extra load.

I think we should drop the __glibc_unlikely (because we don't know if
it's actually rare; in the contended case it clear is not), move the
load into __lll_lock, check it before the CAS, and also pass down the
result to __lll_lock_wait* (to avoid the redundant load there).  It will
slightly bloat the inline __lll_lock implementation, but so does putting
a check into atomic_compare_and_exchange_bool_acq.

Thanks,
Flroian
  

Patch

diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
index 2692d94a92..597dc1cf92 100644
--- a/sysdeps/x86/atomic-machine.h
+++ b/sysdeps/x86/atomic-machine.h
@@ -73,9 +73,20 @@  typedef uintmax_t uatomic_max_t;
 #define ATOMIC_EXCHANGE_USES_CAS	0
 
 #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
-  __sync_val_compare_and_swap (mem, oldval, newval)
+  ({ volatile __typeof (*(mem)) *memp = (mem);				\
+     __typeof (*(mem)) oldmem = *memp, ret;				\
+     ret = (oldmem == (oldval)						\
+	    ? __sync_val_compare_and_swap (mem, oldval, newval)		\
+	    : oldmem);							\
+     ret; })
 #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
-  (! __sync_bool_compare_and_swap (mem, oldval, newval))
+  ({ volatile __typeof (*(mem)) *memp = (mem);				\
+     __typeof (*(mem)) oldmem = *memp;					\
+     int ret;								\
+     ret = (oldmem == (oldval)						\
+	    ? !__sync_bool_compare_and_swap (mem, oldval, newval)	\
+	    : 1);							\
+     ret; })
 
 
 #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \