[BZ,18034,AArch64] Lazy TLSDESC relocation data race fix

Message ID 553E5381.504@arm.com
State Dropped
Headers

Commit Message

Szabolcs Nagy April 27, 2015, 3:19 p.m. UTC
  On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
>> accesses.
> 
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
> 
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
> 

Updated the patch.

I used a fence instead of the suggested atomic store, because I
think this part of the concurrency wiki is problematic if glibc
ever tries to move to C11:

"We (currently) do not use special types for the variables accessed
by atomic operations, but the underlying non-atomic types with
suitable alignment on each architecture."

The release fence generates the same code (dmb ish) as the previous
__sync_synchronize (release fence is slightly stronger than what arm
actually needs here: only store-store is needed, release fence gives
load-store barrier too).

I haven't changed other parts of the patch.

Changelog:

2015-04-27  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #18034]
	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
	(_dl_tlsdesc_dynamic): Likewise.
	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
	barrier between stores.
  

Comments

Torvald Riegel May 26, 2015, 8:37 p.m. UTC | #1
On Mon, 2015-04-27 at 16:19 +0100, Szabolcs Nagy wrote:
> 
> On 22/04/15 17:08, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> >> Lazy TLSDESC initialization needs to be synchronized with
> concurrent TLS
> >> accesses.
> > 
> > Please get familiar with
> https://sourceware.org/glibc/wiki/Concurrency
> > Fixes to existing code should use the C11 memory model for
> concurrent
> > code, especially if the fix is about a concurrency issue.
> > 
> > I agree with Adhemerval that a release MO store seems to be
> sufficient
> > in this case.
> > 
> 
> Updated the patch.
> 
> I used a fence instead of the suggested atomic store, because I
> think this part of the concurrency wiki is problematic if glibc
> ever tries to move to C11:
> 
> "We (currently) do not use special types for the variables accessed
> by atomic operations, but the underlying non-atomic types with
> suitable alignment on each architecture."

No, I don't see how it would be problematic.  Why do you think this is
the case?
There's no collision because glibc's atomic functions are named
differently.  Second, we expect the underlying data type for both the
arch's atomic types and what we use now to be identical, so even
switching all of that over would be possible.

This should have relaxed atomic accesses for all things concurrently
accessed.  As a result, you can drop the volatile qualification I
believe (I haven't checked, but it seems this was a pre-memory-model way
to avoid compiler reordering -- it's not actually observable behavior
that we'd need it for).

The release store would be clear, but you can use the release fence as
well.

> 
> diff --git a/sysdeps/aarch64/dl-tlsdesc.S
> b/sysdeps/aarch64/dl-tlsdesc.S
> index be9b9b3..ff74b52 100644
> --- a/sysdeps/aarch64/dl-tlsdesc.S
> +++ b/sysdeps/aarch64/dl-tlsdesc.S
> @@ -79,6 +79,25 @@ _dl_tlsdesc_return:
>         cfi_endproc
>         .size   _dl_tlsdesc_return, .-_dl_tlsdesc_return
>  
> +       /* Same as _dl_tlsdesc_return but with synchronization for
> +          lazy relocation.
> +          Prototype:
> +          _dl_tlsdesc_return_lazy (tlsdesc *) ;
> +        */
> +       .hidden _dl_tlsdesc_return_lazy
> +       .global _dl_tlsdesc_return_lazy
> +       .type   _dl_tlsdesc_return_lazy,%function
> +       cfi_startproc
> +       .align 2
> +_dl_tlsdesc_return_lazy:
> +       /* The ldar guarantees ordering between the load from [x0] at
> the
> +          call site and the load from [x0,#8] here for lazy
> relocation. */

You should point out where the matching release MO operation is.  For
example, you could say somethign along the lines of:

"The ldar ensures that we synchronize-with with the release MO store in
_dl_tlsdesc_resolve_rela_fixup; as a result, the load from [x0,#8] will
happen after the initialization of td->arg in
_dl_tlsdesc_resolve_rela_fixup."

> +       ldar    xzr, [x0]
> +       ldr     x0, [x0, #8]
> +       RET
> +       cfi_endproc
> +       .size   _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
> +
>         /* Handler for undefined weak TLS symbols.
>            Prototype:
>            _dl_tlsdesc_undefweak (tlsdesc *);
> @@ -96,6 +115,9 @@ _dl_tlsdesc_return:
>  _dl_tlsdesc_undefweak:
>         str     x1, [sp, #-16]!
>         cfi_adjust_cfa_offset(16)
> +       /* The ldar guarantees ordering between the load from [x0] at
> the
> +          call site and the load from [x0,#8] here for lazy
> relocation. */
> +       ldar    xzr, [x0]

Likewise.

>         ldr     x0, [x0, #8]
>         mrs     x1, tpidr_el0
>         sub     x0, x0, x1
> @@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic:
>         stp     x3,  x4, [sp, #32+16*1]
>  
>         mrs     x4, tpidr_el0
> +       /* The ldar guarantees ordering between the load from [x0] at
> the
> +          call site and the load from [x0,#8] here for lazy
> relocation. */
> +       ldar    xzr, [x0]

Likewise.

>         ldr     x1, [x0,#8]
>         ldr     x0, [x4]
>         ldr     x3, [x1,#16]
> diff --git a/sysdeps/aarch64/dl-tlsdesc.h
> b/sysdeps/aarch64/dl-tlsdesc.h
> index 7a1285e..e6c0078 100644
> --- a/sysdeps/aarch64/dl-tlsdesc.h
> +++ b/sysdeps/aarch64/dl-tlsdesc.h
> @@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden
>  _dl_tlsdesc_return (struct tlsdesc *);
>  
>  extern ptrdiff_t attribute_hidden
> +_dl_tlsdesc_return_lazy (struct tlsdesc *);
> +
> +extern ptrdiff_t attribute_hidden
>  _dl_tlsdesc_undefweak (struct tlsdesc *);
>  
>  extern ptrdiff_t attribute_hidden
> diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
> index 4821f8c..2e55c07 100644
> --- a/sysdeps/aarch64/tlsdesc.c
> +++ b/sysdeps/aarch64/tlsdesc.c
> @@ -25,6 +25,7 @@
>  #include <dl-tlsdesc.h>
>  #include <dl-unmap-segments.h>
>  #include <tlsdeschtab.h>
> +#include <atomic.h>
>  
>  /* The following functions take an entry_check_offset argument.  It's
>     computed by the caller as an offset between its entry point and
> the
> @@ -87,6 +88,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,

This volatile can be dropped when we use atomics properly, I believe.

>    if (!sym)
>      {
>        td->arg = (void*) reloc->r_addend;

That needs to be an atomic access.

> +      /* Store-store barrier so the two writes are not reordered. */

Say that this release MO store is meant to synchronize with the acquire
MO operation in _dl_tlsdesc_undefweak.

> +      atomic_thread_fence_release ();
>        td->entry = _dl_tlsdesc_undefweak;

Relaxed MO atomic store, or make it a release MO store and drop the
barrier.

>      }
>    else
> @@ -98,6 +101,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,
>         {
>           td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
>                                               + reloc->r_addend);
> +         /* Store-store barrier so the two writes are not reordered.
> */
> +         atomic_thread_fence_release ();
>           td->entry = _dl_tlsdesc_dynamic;

Likewise.

>         }
>        else
> @@ -105,7 +110,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc
> volatile *td,
>         {
>           td->arg = (void*) (sym->st_value + result->l_tls_offset
>                              + reloc->r_addend);
> -         td->entry = _dl_tlsdesc_return;
> +         /* Store-store barrier so the two writes are not reordered.
> */
> +         atomic_thread_fence_release ();
> +         td->entry = _dl_tlsdesc_return_lazy;

Likewise.

>         }
>      }
>  
> 
You should also document why the relaxed MO load in
_dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
the email thread -- I now see what you meant but this isn't obvious at
all).  The acquire MO operations that make it work are added by you in
this patch.
  
Szabolcs Nagy May 27, 2015, 11:27 a.m. UTC | #2
On 26/05/15 21:37, Torvald Riegel wrote:
> On Mon, 2015-04-27 at 16:19 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 17:08, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>>>> Lazy TLSDESC initialization needs to be synchronized with
>> concurrent TLS
>>>> accesses.
>>>
>>> Please get familiar with
>> https://sourceware.org/glibc/wiki/Concurrency
>>> Fixes to existing code should use the C11 memory model for
>> concurrent
>>> code, especially if the fix is about a concurrency issue.
>>>
>>> I agree with Adhemerval that a release MO store seems to be
>> sufficient
>>> in this case.
>>
>> I used a fence instead of the suggested atomic store, because I
>> think this part of the concurrency wiki is problematic if glibc
>> ever tries to move to C11:
>>
>> "We (currently) do not use special types for the variables accessed
>> by atomic operations, but the underlying non-atomic types with
>> suitable alignment on each architecture."
> 
> No, I don't see how it would be problematic.  Why do you think this is
> the case?

ok, i was thinking about type system issues but that's
probably not relevant.

(_Atomic int* is incompatible with int* and changing
the type everywhere to _Atomic might not work if the
type is externally visible like pthread types).

> This should have relaxed atomic accesses for all things concurrently
> accessed.  As a result, you can drop the volatile qualification I
> believe (I haven't checked, but it seems this was a pre-memory-model way
> to avoid compiler reordering -- it's not actually observable behavior
> that we'd need it for).

i guess volatile prevents reordering wrt other volatiles
(but not wrt other non-volatiles).

the other use of volatile is to prevent spurious loads, eg.

x = *p;
y = x;
z = x;

can be "optimized" into

y = *p;
z = *p;

or even

y = z = *p & *p;

if every access to *p is changed to use an atomic function
then dropping volatile is ok, but in this case td is passed
around and used in _dl_tlsdesc_resolve_early_return_p too,
so i'm not sure if it's ok to remove volatile yet.

is it ok if that's fixed as a separate commit?

> The release store would be clear, but you can use the release fence as
> well.
> 

ok, i'll use release store then.

>> +_dl_tlsdesc_return_lazy:
>> +       /* The ldar guarantees ordering between the load from [x0] at
>> the
>> +          call site and the load from [x0,#8] here for lazy
>> relocation. */
> 
> You should point out where the matching release MO operation is.  For
> example, you could say somethign along the lines of:
>
> "The ldar ensures that we synchronize-with with the release MO store in
> _dl_tlsdesc_resolve_rela_fixup; as a result, the load from [x0,#8] will
> happen after the initialization of td->arg in
> _dl_tlsdesc_resolve_rela_fixup."
> 

ok

>> +      /* Store-store barrier so the two writes are not reordered. */
> 
> Say that this release MO store is meant to synchronize with the acquire
> MO operation in _dl_tlsdesc_undefweak.

ok

> You should also document why the relaxed MO load in
> _dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
> the email thread -- I now see what you meant but this isn't obvious at
> all).  The acquire MO operations that make it work are added by you in
> this patch.

ok,

but i think that should be fixed separately
together with other archs.
  
Torvald Riegel May 27, 2015, 1:02 p.m. UTC | #3
On Wed, 2015-05-27 at 12:27 +0100, Szabolcs Nagy wrote:
> On 26/05/15 21:37, Torvald Riegel wrote:
> > This should have relaxed atomic accesses for all things concurrently
> > accessed.  As a result, you can drop the volatile qualification I
> > believe (I haven't checked, but it seems this was a pre-memory-model way
> > to avoid compiler reordering -- it's not actually observable behavior
> > that we'd need it for).
> 
> i guess volatile prevents reordering wrt other volatiles
> (but not wrt other non-volatiles).

The atomic ops will take care of this (or, allow one to take care of
this).

> the other use of volatile is to prevent spurious loads, eg.
> 
> x = *p;
> y = x;
> z = x;
> 
> can be "optimized" into
> 
> y = *p;
> z = *p;
> 
> or even
> 
> y = z = *p & *p;

Yes, that's allowed for non-atomic non-volatile (either due to the
data-race-freedom requirement for non-atomic, or due to no volatile
constraints regarding equality to abstract machine).  If the load from
*p were even a memory_order_relaxed atomic load, the compiler would not
be allowed to re-load.

> if every access to *p is changed to use an atomic function
> then dropping volatile is ok, but in this case td is passed
> around and used in _dl_tlsdesc_resolve_early_return_p too,
> so i'm not sure if it's ok to remove volatile yet.

Yes, the clean thing to do would be to change it everywhere at once.

> is it ok if that's fixed as a separate commit?

Yes, if you promise to submit such a patch :)

Please still use mo_relaxed atomic accesses though, so that we can take
care of the DRF requirement for the code you changed.

> > You should also document why the relaxed MO load in
> > _dl_tlsdesc_resolve_early_return_p is sufficient (see the other part of
> > the email thread -- I now see what you meant but this isn't obvious at
> > all).  The acquire MO operations that make it work are added by you in
> > this patch.
> 
> ok,
> 
> but i think that should be fixed separately
> together with other archs.
> 

If you want to merge that into, for example, using a common tlsdesc.c,
then that's fine with me.  I just don't want this information to get
lost; I think it's non-obvious and thus should be documented.
  
Alexandre Oliva June 9, 2015, 3:10 a.m. UTC | #4
On May 26, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> As a result, you can drop the volatile qualification I
> believe (I haven't checked, but it seems this was a pre-memory-model way
> to avoid compiler reordering

That's correct.  It was the only portable way available back then to
force stores and loads to take place in specific orders.
  
Torvald Riegel June 9, 2015, 8:34 a.m. UTC | #5
On Tue, 2015-06-09 at 00:10 -0300, Alexandre Oliva wrote:
> On May 26, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > As a result, you can drop the volatile qualification I
> > believe (I haven't checked, but it seems this was a pre-memory-model way
> > to avoid compiler reordering

I meant that I haven't checked what the intent behind it was, and why it
was chosen instead of other alternatives.

> That's correct.  It was the only portable way available back then to
> force stores and loads to take place in specific orders.

An alternative would have been to do what Boehm's libatomic-ops did:
Have functions/macros for all atomic accesses, and implement them with
asm volatile.
  

Patch

diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..ff74b52 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,25 @@  _dl_tlsdesc_return:
 	cfi_endproc
 	.size	_dl_tlsdesc_return, .-_dl_tlsdesc_return
 
+	/* Same as _dl_tlsdesc_return but with synchronization for
+	   lazy relocation.
+	   Prototype:
+	   _dl_tlsdesc_return_lazy (tlsdesc *) ;
+	 */
+	.hidden _dl_tlsdesc_return_lazy
+	.global	_dl_tlsdesc_return_lazy
+	.type	_dl_tlsdesc_return_lazy,%function
+	cfi_startproc
+	.align 2
+_dl_tlsdesc_return_lazy:
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
+	ldr	x0, [x0, #8]
+	RET
+	cfi_endproc
+	.size	_dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
 	/* Handler for undefined weak TLS symbols.
 	   Prototype:
 	   _dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +115,9 @@  _dl_tlsdesc_return:
 _dl_tlsdesc_undefweak:
 	str	x1, [sp, #-16]!
 	cfi_adjust_cfa_offset(16)
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x0, [x0, #8]
 	mrs	x1, tpidr_el0
 	sub	x0, x0, x1
@@ -152,6 +174,9 @@  _dl_tlsdesc_dynamic:
 	stp	x3,  x4, [sp, #32+16*1]
 
 	mrs	x4, tpidr_el0
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x1, [x0,#8]
 	ldr	x0, [x4]
 	ldr	x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@  extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_return (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_undefweak (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..2e55c07 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -25,6 +25,7 @@ 
 #include <dl-tlsdesc.h>
 #include <dl-unmap-segments.h>
 #include <tlsdeschtab.h>
+#include <atomic.h>
 
 /* The following functions take an entry_check_offset argument.  It's
    computed by the caller as an offset between its entry point and the
@@ -87,6 +88,8 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
   if (!sym)
     {
       td->arg = (void*) reloc->r_addend;
+      /* Store-store barrier so the two writes are not reordered. */
+      atomic_thread_fence_release ();
       td->entry = _dl_tlsdesc_undefweak;
     }
   else
@@ -98,6 +101,8 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
 					      + reloc->r_addend);
+	  /* Store-store barrier so the two writes are not reordered. */
+	  atomic_thread_fence_release ();
 	  td->entry = _dl_tlsdesc_dynamic;
 	}
       else
@@ -105,7 +110,9 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = (void*) (sym->st_value + result->l_tls_offset
 			     + reloc->r_addend);
-	  td->entry = _dl_tlsdesc_return;
+	  /* Store-store barrier so the two writes are not reordered. */
+	  atomic_thread_fence_release ();
+	  td->entry = _dl_tlsdesc_return_lazy;
 	}
     }