Fix atomic_full_barrier on x86 and x86_64.

Message ID 1414606736.10085.1.camel@triegel.csb
State Superseded
Headers

Commit Message

Torvald Riegel Oct. 29, 2014, 6:18 p.m. UTC
  Currently, x86 and x86_64 do not define any of the barriers.  The
generic default for the barriers is that full barriers are just compiler
barriers and read and write barriers default to using the full barrier
definition.  On x86 and x86_64, however, only read/write barriers (ie,
acquire and release fences in how glibc uses them) are just compiler
barriers -- a full barrier needs an mfence instruction.  This patch
defines atomic_full_barrier accordingly without making read/write
barriers stronger than compiler barriers.

The only use of atomic_full_barrier on x86/x86_64 is in
sysdeps/nptl/fork.c, but there, ISTM that this actually only needs an
acquire barrier.  So, this patch doesn't fix existing bugs in *current*
code, and only makes fork a tiny tiny bit slower.

Nonetheless, there are other uses of full barriers in generic code
(e.g., semaphores).  Those don't apply for x86/x86_64 because we
currently have arch-specific code for those, but would if we start using
the generic code on x86 too (and we couldn't do until we fix the x86
full barrier).

        * sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
        atomic_read_barrier, atomic_write_barrier): Define.
        * sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
        atomic_read_barrier, atomic_write_barrier): Define.
  

Comments

Torvald Riegel Oct. 29, 2014, 7:02 p.m. UTC | #1
On Wed, 2014-10-29 at 19:18 +0100, Torvald Riegel wrote:
> Currently, x86 and x86_64 do not define any of the barriers.  The
> generic default for the barriers is that full barriers are just compiler
> barriers and read and write barriers default to using the full barrier
> definition.  On x86 and x86_64, however, only read/write barriers (ie,
> acquire and release fences in how glibc uses them) are just compiler
> barriers -- a full barrier needs an mfence instruction.  This patch
> defines atomic_full_barrier accordingly without making read/write
> barriers stronger than compiler barriers.
> 
> The only use of atomic_full_barrier on x86/x86_64 is in
> sysdeps/nptl/fork.c, but there, ISTM that this actually only needs an
> acquire barrier.  So, this patch doesn't fix existing bugs in *current*
> code, and only makes fork a tiny tiny bit slower.
> 
> Nonetheless, there are other uses of full barriers in generic code
> (e.g., semaphores).  Those don't apply for x86/x86_64 because we
> currently have arch-specific code for those, but would if we start using
> the generic code on x86 too (and we couldn't do until we fix the x86
> full barrier).
> 
>         * sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
>         atomic_read_barrier, atomic_write_barrier): Define.
>         * sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
>         atomic_read_barrier, atomic_write_barrier): Define.
> 
> 

Forgot to add the BZ to the Changelog:

	[BZ #17403]
	* sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
	atomic_read_barrier, atomic_write_barrier): Define.
	* sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
	atomic_read_barrier, atomic_write_barrier): Define.
  
Rich Felker Oct. 29, 2014, 8:31 p.m. UTC | #2
On Wed, Oct 29, 2014 at 07:18:56PM +0100, Torvald Riegel wrote:
> Currently, x86 and x86_64 do not define any of the barriers.  The
> generic default for the barriers is that full barriers are just compiler
> barriers and read and write barriers default to using the full barrier
> definition.  On x86 and x86_64, however, only read/write barriers (ie,
> acquire and release fences in how glibc uses them) are just compiler
> barriers -- a full barrier needs an mfence instruction.  This patch
> defines atomic_full_barrier accordingly without making read/write
> barriers stronger than compiler barriers.

Can you explain why you claim these are needed? The architecture
inherently has strong memory ordering that prevents all reorderings
except for moving loads before stores when the addresses differ. Are
you concerned about creating a barrier for the loosely-ordered "nt"
sse operations or something? Or something else?

> The only use of atomic_full_barrier on x86/x86_64 is in
> sysdeps/nptl/fork.c, but there, ISTM that this actually only needs an
> acquire barrier.  So, this patch doesn't fix existing bugs in *current*
> code, and only makes fork a tiny tiny bit slower.
> 
> Nonetheless, there are other uses of full barriers in generic code
> (e.g., semaphores).  Those don't apply for x86/x86_64 because we
> currently have arch-specific code for those, but would if we start using
> the generic code on x86 too (and we couldn't do until we fix the x86
> full barrier).

Shouldn't the atomics for these already provide sufficient barriers?

> commit be14e33b607609462130a0264c7a9460596da3f8
> Author: Torvald Riegel <triegel@redhat.com>
> Date:   Wed Oct 29 10:34:36 2014 +0100
> 
>     Fix atomic_full_barrier on x86 and x86_64.
>     
>     	* sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
>     	atomic_read_barrier, atomic_write_barrier): Define.
>     	* sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
>     	atomic_read_barrier, atomic_write_barrier): Define.
> 
> diff --git a/sysdeps/i386/i486/bits/atomic.h b/sysdeps/i386/i486/bits/atomic.h
> index 76e0e8e..7c432f9 100644
> --- a/sysdeps/i386/i486/bits/atomic.h
> +++ b/sysdeps/i386/i486/bits/atomic.h
> @@ -532,3 +532,7 @@ typedef uintmax_t uatomic_max_t;
>  #define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
>  
>  #define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
> +
> +#define atomic_full_barrier() __asm ("mfence" ::: "memory")
> +#define atomic_read_barrier() __asm ("" ::: "memory")
> +#define atomic_write_barrier() __asm ("" ::: "memory")

Does glibc now require i686+sse? AFAIK mfence is SIGILL on all
previous x86 generations. An alternate portable (to earlier chips) but
slower approach is a useless xchg. It also _might_ work to perform a
store/load pair on the same address (I'm not clear on whether this
method is valid).

Rich
  
Torvald Riegel Oct. 29, 2014, 8:55 p.m. UTC | #3
On Wed, 2014-10-29 at 16:31 -0400, Rich Felker wrote:
> On Wed, Oct 29, 2014 at 07:18:56PM +0100, Torvald Riegel wrote:
> > Currently, x86 and x86_64 do not define any of the barriers.  The
> > generic default for the barriers is that full barriers are just compiler
> > barriers and read and write barriers default to using the full barrier
> > definition.  On x86 and x86_64, however, only read/write barriers (ie,
> > acquire and release fences in how glibc uses them) are just compiler
> > barriers -- a full barrier needs an mfence instruction.  This patch
> > defines atomic_full_barrier accordingly without making read/write
> > barriers stronger than compiler barriers.
> 
> Can you explain why you claim these are needed?

Because TSO is not sequential consistency.  Consider Dekker
synchronization -- you need a full membar there to prevent store/load
reordering.  Also see
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

> The architecture
> inherently has strong memory ordering that prevents all reorderings
> except for moving loads before stores when the addresses differ.

And a full barrier (in the sense of memory_order_seq_cst) prevents that.

> Are
> you concerned about creating a barrier for the loosely-ordered "nt"
> sse operations or something? Or something else?

No.  The focus is on barriers/fences used in combination with
memory_order_relaxed atomics.

> > The only use of atomic_full_barrier on x86/x86_64 is in
> > sysdeps/nptl/fork.c, but there, ISTM that this actually only needs an
> > acquire barrier.  So, this patch doesn't fix existing bugs in *current*
> > code, and only makes fork a tiny tiny bit slower.
> > 
> > Nonetheless, there are other uses of full barriers in generic code
> > (e.g., semaphores).  Those don't apply for x86/x86_64 because we
> > currently have arch-specific code for those, but would if we start using
> > the generic code on x86 too (and we couldn't do until we fix the x86
> > full barrier).
> 
> Shouldn't the atomics for these already provide sufficient barriers?

Synchronization code similar to Dekker synchronization is more
efficiently expressed with seq_cst fences and relaxed loads/stores than
with seq_cst loads/stores.

> > commit be14e33b607609462130a0264c7a9460596da3f8
> > Author: Torvald Riegel <triegel@redhat.com>
> > Date:   Wed Oct 29 10:34:36 2014 +0100
> > 
> >     Fix atomic_full_barrier on x86 and x86_64.
> >     
> >     	* sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
> >     	atomic_read_barrier, atomic_write_barrier): Define.
> >     	* sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
> >     	atomic_read_barrier, atomic_write_barrier): Define.
> > 
> > diff --git a/sysdeps/i386/i486/bits/atomic.h b/sysdeps/i386/i486/bits/atomic.h
> > index 76e0e8e..7c432f9 100644
> > --- a/sysdeps/i386/i486/bits/atomic.h
> > +++ b/sysdeps/i386/i486/bits/atomic.h
> > @@ -532,3 +532,7 @@ typedef uintmax_t uatomic_max_t;
> >  #define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
> >  
> >  #define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
> > +
> > +#define atomic_full_barrier() __asm ("mfence" ::: "memory")
> > +#define atomic_read_barrier() __asm ("" ::: "memory")
> > +#define atomic_write_barrier() __asm ("" ::: "memory")
> 
> Does glibc now require i686+sse? AFAIK mfence is SIGILL on all
> previous x86 generations. An alternate portable (to earlier chips) but
> slower approach is a useless xchg. It also _might_ work to perform a
> store/load pair on the same address (I'm not clear on whether this
> method is valid).

Good point.

What's the include order on x86, counting down from the most recent
supported revision (ie, for an i686 machine, include sysdeps/i386/i686
first, then i586 and i486)?
  
Joseph Myers Oct. 29, 2014, 9:54 p.m. UTC | #4
On Wed, 29 Oct 2014, Torvald Riegel wrote:

> What's the include order on x86, counting down from the most recent
> supported revision (ie, for an i686 machine, include sysdeps/i386/i686
> first, then i586 and i486)?

That depends on the sysdeps Implies file.  For i686, it uses i386/i486 but 
*not* i386/i586.

(i586 uses i486, so any files for plain i386 that are overridden for i486, 
or overridden for both i586 and i686, are dead and can be removed.  It 
would also be OK to merge i486 into i386 now anything before i486 isn't 
supported.)
  
Torvald Riegel Oct. 29, 2014, 10:32 p.m. UTC | #5
On Wed, 2014-10-29 at 21:54 +0000, Joseph S. Myers wrote:
> On Wed, 29 Oct 2014, Torvald Riegel wrote:
> 
> > What's the include order on x86, counting down from the most recent
> > supported revision (ie, for an i686 machine, include sysdeps/i386/i686
> > first, then i586 and i486)?
> 
> That depends on the sysdeps Implies file.  For i686, it uses i386/i486 but 
> *not* i386/i586.
> 
> (i586 uses i486, so any files for plain i386 that are overridden for i486, 
> or overridden for both i586 and i686, are dead and can be removed.  It 
> would also be OK to merge i486 into i386 now anything before i486 isn't 
> supported.)

Thanks for the background.

So, mfence seems to have been introduced with SSE2.  Should I try to
test for SSE2 specifically, or rather assume SSE2 support for i786?
  
Joseph Myers Oct. 29, 2014, 10:54 p.m. UTC | #6
On Wed, 29 Oct 2014, Torvald Riegel wrote:

> So, mfence seems to have been introduced with SSE2.  Should I try to
> test for SSE2 specifically, or rather assume SSE2 support for i786?

I think the i786 directories should be removed; config.guess will never 
return such a processor name for GNU/Linux at least (I don't know what it 
returns on Hurd).  The comment in sysdeps/i386/i786/Implies suggests it 
was for PII, but PII was still family 6 (and family 15 came after family 
6, I don't think there were any x86 processors with family numbers 7 to 
14).

So, anything conditional on SSE2 should test for __SSE2__.
  

Patch

commit be14e33b607609462130a0264c7a9460596da3f8
Author: Torvald Riegel <triegel@redhat.com>
Date:   Wed Oct 29 10:34:36 2014 +0100

    Fix atomic_full_barrier on x86 and x86_64.
    
    	* sysdeps/x86_64/bits/atomic.h: (atomic_full_barrier,
    	atomic_read_barrier, atomic_write_barrier): Define.
    	* sysdeps/i386/i486/bits/atomic.h (atomic_full_barrier,
    	atomic_read_barrier, atomic_write_barrier): Define.

diff --git a/sysdeps/i386/i486/bits/atomic.h b/sysdeps/i386/i486/bits/atomic.h
index 76e0e8e..7c432f9 100644
--- a/sysdeps/i386/i486/bits/atomic.h
+++ b/sysdeps/i386/i486/bits/atomic.h
@@ -532,3 +532,7 @@  typedef uintmax_t uatomic_max_t;
 #define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
 
 #define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
+
+#define atomic_full_barrier() __asm ("mfence" ::: "memory")
+#define atomic_read_barrier() __asm ("" ::: "memory")
+#define atomic_write_barrier() __asm ("" ::: "memory")
diff --git a/sysdeps/x86_64/bits/atomic.h b/sysdeps/x86_64/bits/atomic.h
index 4d19ef0..25f73aa 100644
--- a/sysdeps/x86_64/bits/atomic.h
+++ b/sysdeps/x86_64/bits/atomic.h
@@ -466,3 +466,7 @@  typedef uintmax_t uatomic_max_t;
 #define atomic_or(mem, mask) __arch_or_body (LOCK_PREFIX, mem, mask)
 
 #define catomic_or(mem, mask) __arch_or_body (__arch_cprefix, mem, mask)
+
+#define atomic_full_barrier() __asm ("mfence" ::: "memory")
+#define atomic_read_barrier() __asm ("" ::: "memory")
+#define atomic_write_barrier() __asm ("" ::: "memory")