x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537]

Message ID 20211103150415.1211388-1-hjl.tools@gmail.com
State Superseded
Headers
Series x86: Optimize atomic_compare_and_exchange_[val|bool]_acq [BZ #28537] |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

H.J. Lu Nov. 3, 2021, 3:04 p.m. UTC
  From the CPU's point of view, getting a cache line for writing is more
expensive than reading.  See Appendix A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.  Check the current memory value first and
return immediately if writing cache line may fail to reduce cache line
bouncing on contended locks.

This fixes BZ# 28537.
---
 sysdeps/x86/atomic-machine.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)
  

Comments

Andreas Schwab Nov. 3, 2021, 3:14 p.m. UTC | #1
On Nov 03 2021, H.J. Lu via Libc-alpha wrote:

> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..92c7cf58b7 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS	0
>  
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ __typeof (*(mem)) oldmem = *(mem), ret;				\
> +     ret = (oldmem == (oldval)						\
> +	    ? __sync_val_compare_and_swap (mem, oldval, newval)		\
> +	    : oldmem);							\
> +     ret; })

Can't this cause the CAS to fail spuriously if the compiler used
some stale cached value for *mem?

Andreas.
  
Oleh Derevenko Nov. 3, 2021, 3:50 p.m. UTC | #2
Hi, H.J. Lu

You may not perform plain reads on values you want to be atomic. This
results in undefined behavior.
For example, the compiler IS NOT obliged to perform the read with a
single CPU instruction -- of course it will not, but it is allowed to
read it in two halves and compare them separately. Or it may reuse
cached value from previous evaluations.
This is only the compiler level issue. Similar issues will arise at
CPU level with all the kind of memory coherency, caching and
instruction reordering.
Or if the value would cross a cache line boundary the plain read might
return half-updated value with the part from one cache line being new
and the other part being old.
Finally, I'm not following this thread and I have very little
knowledge on function purposes in the library but the "_acq" suffix
bears the impression that the memory operation has to exhibit the
acquire semantics. I.e., it has to fetch memory updates and make them
visible for the thread. But this is what you are trying to eliminate
with your patch.

On Wed, Nov 3, 2021 at 5:05 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> From the CPU's point of view, getting a cache line for writing is more
> expensive than reading.  See Appendix A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.  Check the current memory value first and
> return immediately if writing cache line may fail to reduce cache line
> bouncing on contended locks.
>
> This fixes BZ# 28537.
> ---
>  sysdeps/x86/atomic-machine.h | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..92c7cf58b7 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS       0
>
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ __typeof (*(mem)) oldmem = *(mem), ret;                           \
> +     ret = (oldmem == (oldval)                                         \
> +           ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> +           : oldmem);                                                  \
> +     ret; })
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> +  ({ __typeof (*(mem)) old = *(mem);                                   \
> +     int ret;                                                          \
> +     if (old != (oldval))                                              \
> +       ret = 1;                                                                \
> +     else                                                              \
> +       ret = !__sync_bool_compare_and_swap (mem, oldval, newval);      \
> +     ret; })
>
>
>  #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> --
> 2.33.1
>
  
Florian Weimer Nov. 3, 2021, 4:35 p.m. UTC | #3
* H. J. Lu via Libc-alpha:

> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..92c7cf58b7 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS	0
>  
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ __typeof (*(mem)) oldmem = *(mem), ret;				\
> +     ret = (oldmem == (oldval)						\
> +	    ? __sync_val_compare_and_swap (mem, oldval, newval)		\
> +	    : oldmem);							\
> +     ret; })
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> +  ({ __typeof (*(mem)) old = *(mem);					\
> +     int ret;								\
> +     if (old != (oldval))						\
> +       ret = 1;								\
> +     else								\
> +       ret = !__sync_bool_compare_and_swap (mem, oldval, newval);	\
> +     ret; })

Shouldn't GCC be fixed to generate the appropriate instruction sequence
for this architecture?  Should we perhaps switch to compiler atomics for
x86 instead of defining our own?

Thanks,
Florian
  
Arjan van de Ven Nov. 3, 2021, 4:59 p.m. UTC | #4
On 11/3/2021 8:50 AM, Oleh Derevenko wrote:
> Hi, H.J. Lu
> 
> You may not perform plain reads on values you want to be atomic. This
> results in undefined behavior.

so the way the patch works is that it does not DEPEND on that read to be atomic.

What the patch does is check non-atomic first if the actual atomic operation has
a chance of working. if it has a chance, the actual normal atomic operation is done as
before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
out early.

The big gain is for the contended lock case (_acq suffix!). If there's, say, 4 threads spinning
on a lock. Before this patch these 4 cpu cores would be taking turns bouncing the cacheline around
super aggressively.. which causes system degradation and worse, also makes the core that will
eventually unlock the lock wait for the cacheline.

Now with the patch, the "it is locked already" is noticed before the cacheline gets taken exclusive,
so all 4 spinning cores have the same cacheline in a shared state -- no pingponging.
Now the core that is going to unlock the lock can now do the exclusive acquire of the cacheline
without having to fight with those 4 cores in the exclusive acquire fight.


> For example, the compiler IS NOT obliged to perform the read with a
> single CPU instruction -- of course it will not, but it is allowed to
> read it in two halves and compare them separately. Or it may reuse
> cached value from previous evaluations.

> This is only the compiler level issue. Similar issues will arise at
> CPU level with all the kind of memory coherency, caching and
> instruction reordering.

the cpu in this case won't, the x86 memory model won't allow that
(and this is in the x86 implementation code)

> Or if the value would cross a cache line boundary the plain read might
> return half-updated value with the part from one cache line being new
> and the other part being old.

(I can't say in polite company what cmpxchg across cache lines does)
  
Andreas Schwab Nov. 3, 2021, 5:17 p.m. UTC | #5
On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote:

> What the patch does is check non-atomic first if the actual atomic operation has
> a chance of working. if it has a chance, the actual normal atomic operation is done as
> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
> out early.

But if the compiler keeps reusing the same cached value it may conclude
that the CAS _never_ has a chance to succeed.

Andreas.
  
Noah Goldstein Nov. 3, 2021, 5:25 p.m. UTC | #6
On Wed, Nov 3, 2021 at 10:05 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> From the CPU's point of view, getting a cache line for writing is more
> expensive than reading.  See Appendix A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.  Check the current memory value first and
> return immediately if writing cache line may fail to reduce cache line
> bouncing on contended locks.

How does this affect the no-contention case? AFAIK that is the common case
in most systems. This seems like something a user can do if their application
can benefit from it and not something for everyone.

>
> This fixes BZ# 28537.
> ---
>  sysdeps/x86/atomic-machine.h | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> index 2692d94a92..92c7cf58b7 100644
> --- a/sysdeps/x86/atomic-machine.h
> +++ b/sysdeps/x86/atomic-machine.h
> @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t;
>  #define ATOMIC_EXCHANGE_USES_CAS       0
>
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> -  __sync_val_compare_and_swap (mem, oldval, newval)
> +  ({ __typeof (*(mem)) oldmem = *(mem), ret;                           \
> +     ret = (oldmem == (oldval)                                         \
> +           ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> +           : oldmem);                                                  \
> +     ret; })
>  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> +  ({ __typeof (*(mem)) old = *(mem);                                   \
> +     int ret;                                                          \
> +     if (old != (oldval))                                              \
> +       ret = 1;                                                                \
> +     else                                                              \
> +       ret = !__sync_bool_compare_and_swap (mem, oldval, newval);      \
> +     ret; })
>
>
>  #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \
> --
> 2.33.1
>
  
Oleh Derevenko Nov. 3, 2021, 5:26 p.m. UTC | #7
Arjan

> What the patch does is check non-atomic first if the actual atomic operation has
a chance of working. if it has a chance, the actual normal atomic
operation is done as
before. But if non-atomic read already tells you the cmpxchg has no
chance to succeed, it errors
out early.

The idea of atomic function is that they are intended to work fairly
with any type of memory. In your case, the speculative reads for a
cached device memory may result in cache access only and will prevent
fetching memory updates from the device, thus making the reading
thread "see" the change later than it could.
If you want to make a "RAM-specific" version of compare-n-exchange
give it a distinct specific name.

On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/3/2021 8:50 AM, Oleh Derevenko wrote:
> > Hi, H.J. Lu
> >
> > You may not perform plain reads on values you want to be atomic. This
> > results in undefined behavior.
>
> so the way the patch works is that it does not DEPEND on that read to be atomic.
>
> What the patch does is check non-atomic first if the actual atomic operation has
> a chance of working. if it has a chance, the actual normal atomic operation is done as
> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
> out early.
>
> The big gain is for the contended lock case (_acq suffix!). If there's, say, 4 threads spinning
> on a lock. Before this patch these 4 cpu cores would be taking turns bouncing the cacheline around
> super aggressively.. which causes system degradation and worse, also makes the core that will
> eventually unlock the lock wait for the cacheline.
>
> Now with the patch, the "it is locked already" is noticed before the cacheline gets taken exclusive,
> so all 4 spinning cores have the same cacheline in a shared state -- no pingponging.
> Now the core that is going to unlock the lock can now do the exclusive acquire of the cacheline
> without having to fight with those 4 cores in the exclusive acquire fight.
>
>
> > For example, the compiler IS NOT obliged to perform the read with a
> > single CPU instruction -- of course it will not, but it is allowed to
> > read it in two halves and compare them separately. Or it may reuse
> > cached value from previous evaluations.
>
> > This is only the compiler level issue. Similar issues will arise at
> > CPU level with all the kind of memory coherency, caching and
> > instruction reordering.
>
> the cpu in this case won't, the x86 memory model won't allow that
> (and this is in the x86 implementation code)
>
> > Or if the value would cross a cache line boundary the plain read might
> > return half-updated value with the part from one cache line being new
> > and the other part being old.
>
> (I can't say in polite company what cmpxchg across cache lines does)
  
Arjan van de Ven Nov. 3, 2021, 5:30 p.m. UTC | #8
On 11/3/2021 10:26 AM, Oleh Derevenko wrote:
> Arjan
> 
>> What the patch does is check non-atomic first if the actual atomic operation has
> a chance of working. if it has a chance, the actual normal atomic
> operation is done as
> before. But if non-atomic read already tells you the cmpxchg has no
> chance to succeed, it errors
> out early.
> 
> The idea of atomic function is that they are intended to work fairly
> with any type of memory. In your case, the speculative reads for a
> cached device memory may result in cache access only and will prevent
> fetching memory updates from the device, thus making the reading
> thread "see" the change later than it could.


eh I am not sure I understand what you say since cmpxchg uses the exact same
cache protocol/etc to do its read...  it won't go to device memory either if
the cache line is anywhere in the cache hierarchy (including core-to-core transfers
in case another core has it in their caches)


(and cmpxchg on MMIO space has very very interesting and unexpected behavior. If folks remember
the "linux torches your e1000 eeprom" bug from some years ago, it came from that)
  
Oleh Derevenko Nov. 3, 2021, 5:55 p.m. UTC | #9
Arjan,

> eh I am not sure I understand what you say since cmpxchg uses the exact same
> cache protocol/etc to do its read...
Well, if you are sure... I did not have that information.

The last question if you permit.
What, as to your opinion, are the reasons they did it in the hardware
implementation? This thing, I mean:
> The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing.
Why was not this optimization initially implemented in the CPU?

Well, I guess I can explain it myself. Because in the success cases
these extra checks would add to execution time. And cmpcxhg can be
used for many purposes — not only for polling from four threads in
parallel.


On Wed, Nov 3, 2021 at 7:30 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/3/2021 10:26 AM, Oleh Derevenko wrote:
> > Arjan
> >
> >> What the patch does is check non-atomic first if the actual atomic operation has
> > a chance of working. if it has a chance, the actual normal atomic
> > operation is done as
> > before. But if non-atomic read already tells you the cmpxchg has no
> > chance to succeed, it errors
> > out early.
> >
> > The idea of atomic function is that they are intended to work fairly
> > with any type of memory. In your case, the speculative reads for a
> > cached device memory may result in cache access only and will prevent
> > fetching memory updates from the device, thus making the reading
> > thread "see" the change later than it could.
>
>
> eh I am not sure I understand what you say since cmpxchg uses the exact same
> cache protocol/etc to do its read...  it won't go to device memory either if
> the cache line is anywhere in the cache hierarchy (including core-to-core transfers
> in case another core has it in their caches)
>
>
> (and cmpxchg on MMIO space has very very interesting and unexpected behavior. If folks remember
> the "linux torches your e1000 eeprom" bug from some years ago, it came from that)
  
H.J. Lu Nov. 3, 2021, 7:13 p.m. UTC | #10
On Wed, Nov 3, 2021 at 9:36 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu via Libc-alpha:
>
> > diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
> > index 2692d94a92..92c7cf58b7 100644
> > --- a/sysdeps/x86/atomic-machine.h
> > +++ b/sysdeps/x86/atomic-machine.h
> > @@ -73,9 +73,19 @@ typedef uintmax_t uatomic_max_t;
> >  #define ATOMIC_EXCHANGE_USES_CAS     0
> >
> >  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> > -  __sync_val_compare_and_swap (mem, oldval, newval)
> > +  ({ __typeof (*(mem)) oldmem = *(mem), ret;                         \
> > +     ret = (oldmem == (oldval)                                               \
> > +         ? __sync_val_compare_and_swap (mem, oldval, newval)         \
> > +         : oldmem);                                                  \
> > +     ret; })
> >  #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
> > -  (! __sync_bool_compare_and_swap (mem, oldval, newval))
> > +  ({ __typeof (*(mem)) old = *(mem);                                 \
> > +     int ret;                                                                \
> > +     if (old != (oldval))                                            \
> > +       ret = 1;                                                              \
> > +     else                                                            \
> > +       ret = !__sync_bool_compare_and_swap (mem, oldval, newval);    \
> > +     ret; })
>
> Shouldn't GCC be fixed to generate the appropriate instruction sequence
> for this architecture?  Should we perhaps switch to compiler atomics for
> x86 instead of defining our own?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065

In the meantime, we should help older compilers.
  
Arjan van de Ven Nov. 3, 2021, 7:21 p.m. UTC | #11
On 11/3/2021 10:17 AM, Andreas Schwab wrote:
> On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote:
> 
>> What the patch does is check non-atomic first if the actual atomic operation has
>> a chance of working. if it has a chance, the actual normal atomic operation is done as
>> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
>> out early.
> 
> But if the compiler keeps reusing the same cached value it may conclude
> that the CAS _never_ has a chance to succeed.

that's a good point. Now there SHOULD be a (compiler) barrier around this somewhere due
to inserting the pause instruction for such a loop. But putting some explicit compiler barrier in this
makes it for sure safer.
  
Arjan van de Ven Nov. 3, 2021, 7:22 p.m. UTC | #12
On 11/3/2021 10:55 AM, Oleh Derevenko wrote:
> Arjan,
> 
>> eh I am not sure I understand what you say since cmpxchg uses the exact same
>> cache protocol/etc to do its read...
> Well, if you are sure... I did not have that information.
> 
> The last question if you permit.
> What, as to your opinion, are the reasons they did it in the hardware
> implementation? This thing, I mean:
>> The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing.
> Why was not this optimization initially implemented in the CPU?


an instruction that needs the same cacheline for read and maybe exclusive will get it exclusive,
in cpu microarchitecture there usually isn't some fast way (and we all want locks ot be fast
clearly) to do a two phase cache protocol on the same line

> 
> Well, I guess I can explain it myself. Because in the success cases
> these extra checks would add to execution time. And cmpcxhg can be
> used for many purposes — not only for polling from four threads in
> parallel.
  
H.J. Lu Nov. 3, 2021, 7:48 p.m. UTC | #13
On Wed, Nov 3, 2021 at 12:21 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> On 11/3/2021 10:17 AM, Andreas Schwab wrote:
> > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote:
> >
> >> What the patch does is check non-atomic first if the actual atomic operation has
> >> a chance of working. if it has a chance, the actual normal atomic operation is done as
> >> before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
> >> out early.
> >
> > But if the compiler keeps reusing the same cached value it may conclude
> > that the CAS _never_ has a chance to succeed.
>
> that's a good point. Now there SHOULD be a (compiler) barrier around this somewhere due
> to inserting the pause instruction for such a loop. But putting some explicit compiler barrier in this
> makes it for sure safer.

Like this?

#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
  ({ __typeof (*(mem)) oldmem = *(mem), ret;                            \
     atomic_read_barrier ();                                            \
     ret = (oldmem == (oldval)                                          \
            ? __sync_val_compare_and_swap (mem, oldval, newval)         \
            : oldmem);                                                  \
     ret; })
#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
  ({ __typeof (*(mem)) oldmem = *(mem);                                 \
     atomic_read_barrier ();                                            \
     int ret;                                                           \
     ret = (oldmem == (oldval)                                          \
            ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
            : 1);                                                       \
     ret; })
  
Oleh Derevenko Nov. 3, 2021, 8:38 p.m. UTC | #14
To force a compiler to re-read the value it is sufficient to take a
pointer to the storage, cast it to volatile and read via that
pointer-to-volatile.

On Wed, Nov 3, 2021 at 8:00 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote:
>
> > What the patch does is check non-atomic first if the actual atomic operation has
> > a chance of working. if it has a chance, the actual normal atomic operation is done as
> > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
> > out early.
>
> But if the compiler keeps reusing the same cached value it may conclude
> that the CAS _never_ has a chance to succeed.
>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."
  
H.J. Lu Nov. 3, 2021, 10:12 p.m. UTC | #15
On Wed, Nov 3, 2021 at 1:38 PM Oleh Derevenko <oleh.derevenko@gmail.com> wrote:
>
> To force a compiler to re-read the value it is sufficient to take a
> pointer to the storage, cast it to volatile and read via that
> pointer-to-volatile.

Like this?

#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
  ({ volatile __typeof (*(mem)) *memp = (mem);                          \
     __typeof (*(mem)) oldmem = *memp, ret;                             \
     ret = (oldmem == (oldval)                                          \
            ? __sync_val_compare_and_swap (mem, oldval, newval)         \
            : oldmem);                                                  \
     ret; })
#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
  ({ volatile __typeof (*(mem)) *memp = (mem);                          \
     __typeof (*(mem)) oldmem = *memp;                                  \
     int ret;                                                           \
     ret = (oldmem == (oldval)                                          \
            ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
            : 1);                                                       \
     ret; })

> On Wed, Nov 3, 2021 at 8:00 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > On Nov 03 2021, Arjan van de Ven via Libc-alpha wrote:
> >
> > > What the patch does is check non-atomic first if the actual atomic operation has
> > > a chance of working. if it has a chance, the actual normal atomic operation is done as
> > > before. But if non-atomic read already tells you the cmpxchg has no chance to succeed, it errors
> > > out early.
> >
> > But if the compiler keeps reusing the same cached value it may conclude
> > that the CAS _never_ has a chance to succeed.
> >
> > Andreas.
> >
> > --
> > Andreas Schwab, schwab@linux-m68k.org
> > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > "And now for something completely different."
>
>
>
> --
>
> Oleh Derevenko
>
> -- Skype with underscore
  
Oleh Derevenko Nov. 4, 2021, 8:58 a.m. UTC | #16
Hi,

On Thu, Nov 4, 2021 at 12:12 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Nov 3, 2021 at 1:38 PM Oleh Derevenko <oleh.derevenko@gmail.com> wrote:
> >
> > To force a compiler to re-read the value it is sufficient to take a
> > pointer to the storage, cast it to volatile and read via that
> > pointer-to-volatile.
>
> Like this?
>
> #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
>   ({ volatile __typeof (*(mem)) *memp = (mem);                          \
>      __typeof (*(mem)) oldmem = *memp, ret;                             \
>      ret = (oldmem == (oldval)                                          \
>             ? __sync_val_compare_and_swap (mem, oldval, newval)         \
>             : oldmem);                                                  \
>      ret; })
> #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
>   ({ volatile __typeof (*(mem)) *memp = (mem);                          \
>      __typeof (*(mem)) oldmem = *memp;                                  \
>      int ret;                                                           \
>      ret = (oldmem == (oldval)                                          \
>             ? !__sync_bool_compare_and_swap (mem, oldval, newval)       \
>             : 1);                                                       \
>      ret; })
>

Yes. The only problem is that I'm not sure if that extra volatile will
not cause a compilation warning/error if the type already had the
volatile modifier in it.

Also, another problem, blindly taking
 __typeof (*(mem)) oldmem = ...
could declare the "oldmem" variable with const and/or volatile
(depending on the actual parameter to the macro) and make either an
assignment to const (a compile error) or an assignment to volatile
(unnecessary forced store into memory with following retrieval from
there).
  
Oleh Derevenko Nov. 4, 2021, 9:44 a.m. UTC | #17
Hi

On Thu, Nov 4, 2021 at 10:58 AM Oleh Derevenko <oleh.derevenko@gmail.com> wrote:
> Also, another problem, blindly taking
>  __typeof (*(mem)) oldmem = ...
> could declare the "oldmem" variable with const and/or volatile
> (depending on the actual parameter to the macro) and make either an
> assignment to const (a compile error) or an assignment to volatile
> (unnecessary forced store into memory with following retrieval from
> there).

Well, const cannot appear in the target, of course. But volatile can.
  
Florian Weimer Nov. 4, 2021, 10:15 a.m. UTC | #18
* H. J. Lu:

>> Shouldn't GCC be fixed to generate the appropriate instruction sequence
>> for this architecture?  Should we perhaps switch to compiler atomics for
>> x86 instead of defining our own?
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
>
> In the meantime, we should help older compilers.

But wouldn't that extra read persist with fixed compilers, due to the
memory barrier?  So that there are two reads?

Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in
the memory order?

Thanks,
Florian
  
Oleh Derevenko Nov. 4, 2021, 11:42 a.m. UTC | #19
Hi Arjan,

On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
> > Or if the value would cross a cache line boundary the plain read might
> > return half-updated value with the part from one cache line being new
> > and the other part being old.
>
> (I can't say in polite company what cmpxchg across cache lines does)

Well, the  "IA-32 Intel® Architecture Software Developer’s Manual,
Volume 3" in section 7.1.2.1 doesn't agree with you

The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK
semantics are followed for as many bus cycles as necessary to update
the entire operand.
However, it is recommend that locked accesses be aligned on their
natural boundaries for better
system performance:
  
Arjan van de Ven Nov. 4, 2021, 2:15 p.m. UTC | #20
On 11/4/2021 4:42 AM, Oleh Derevenko wrote:
> Hi Arjan,
> 
> On Wed, Nov 3, 2021 at 7:00 PM Arjan van de Ven <arjan@linux.intel.com> wrote:
>>> Or if the value would cross a cache line boundary the plain read might
>>> return half-updated value with the part from one cache line being new
>>> and the other part being old.
>>
>> (I can't say in polite company what cmpxchg across cache lines does)
> 
> Well, the  "IA-32 Intel® Architecture Software Developer’s Manual,
> Volume 3" in section 7.1.2.1 doesn't agree with you
> 
> The integrity of a bus lock is not affected by the alignment of the
> memory field. The LOCK
> semantics are followed for as many bus cycles as necessary to update
> the entire operand.
> However, it is recommend that locked accesses be aligned on their
> natural boundaries for better
> system performance:


yes you can do buslocks. they are functionally correct unless you turn on the bios/OS option
to ban them (at which point they cause a fault).
what also happens is that the WHOLE SYSTEM (all cores on all sockets) come to a screeching halt
if you have a pile of buslocks happening to you.
(hence the ability to disable them and make them fault)

on a contented lock that is a huge issue since at that point, the core that holds the lock
and wants to unlock it is also put in that very slow queue of who gets the cachelines


>
  
H.J. Lu Nov. 4, 2021, 2:31 p.m. UTC | #21
On Thu, Nov 4, 2021 at 3:15 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> >> Shouldn't GCC be fixed to generate the appropriate instruction sequence
> >> for this architecture?  Should we perhaps switch to compiler atomics for
> >> x86 instead of defining our own?
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
> >
> > In the meantime, we should help older compilers.
>
> But wouldn't that extra read persist with fixed compilers, due to the
> memory barrier?  So that there are two reads?

Yes, this could be a problem.

> Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in
> the memory order?

We need a way to identify the fixed compiler.  We can add a predefined macro,
something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK.
  
H.J. Lu Nov. 4, 2021, 2:59 p.m. UTC | #22
On Thu, Nov 4, 2021 at 7:31 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Nov 4, 2021 at 3:15 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * H. J. Lu:
> >
> > >> Shouldn't GCC be fixed to generate the appropriate instruction sequence
> > >> for this architecture?  Should we perhaps switch to compiler atomics for
> > >> x86 instead of defining our own?
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103065
> > >
> > > In the meantime, we should help older compilers.
> >
> > But wouldn't that extra read persist with fixed compilers, due to the
> > memory barrier?  So that there are two reads?
>
> Yes, this could be a problem.
>
> > Maybe we need a hint flag for __atomic_compare_exchange_n, encoded in
> > the memory order?
>
> We need a way to identify the fixed compiler.  We can add a predefined macro,
> something like __GCC_HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK.
>

Better to define __HAVE_SYNC_COMPARE_AND_SWAP_LOAD_CHECK for
other compilers.
  

Patch

diff --git a/sysdeps/x86/atomic-machine.h b/sysdeps/x86/atomic-machine.h
index 2692d94a92..92c7cf58b7 100644
--- a/sysdeps/x86/atomic-machine.h
+++ b/sysdeps/x86/atomic-machine.h
@@ -73,9 +73,19 @@  typedef uintmax_t uatomic_max_t;
 #define ATOMIC_EXCHANGE_USES_CAS	0
 
 #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
-  __sync_val_compare_and_swap (mem, oldval, newval)
+  ({ __typeof (*(mem)) oldmem = *(mem), ret;				\
+     ret = (oldmem == (oldval)						\
+	    ? __sync_val_compare_and_swap (mem, oldval, newval)		\
+	    : oldmem);							\
+     ret; })
 #define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
-  (! __sync_bool_compare_and_swap (mem, oldval, newval))
+  ({ __typeof (*(mem)) old = *(mem);					\
+     int ret;								\
+     if (old != (oldval))						\
+       ret = 1;								\
+     else								\
+       ret = !__sync_bool_compare_and_swap (mem, oldval, newval);	\
+     ret; })
 
 
 #define __arch_c_compare_and_exchange_val_8_acq(mem, newval, oldval) \