diff mbox

[BZ,18034,AArch64] Lazy TLSDESC relocation data race fix

Message ID 553793A3.7030206@arm.com
State Superseded
Headers show

Commit Message

Szabolcs Nagy April 22, 2015, 12:27 p.m. UTC
Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
accesses.

The original ARM TLSDESC design did not discuss this race that
arises on systems with weak memory order guarantees

http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt

"Storing the final value of the TLS descriptor also needs care: the
resolver field must be set to its final value after the argument gets
its final value, such that any if thread attempts to use the
descriptor before it gets its final value, it still goes to the hold
function."

The current glibc code (i386, x86_64, arm, aarch64) is

  td->arg = ...;
  td->entry = ...;

the arm memory model allows store-store reordering, so a barrier is
needed between these two stores (the code is broken on x86 as well in
principle: x86 memory model does not help on the c code level, the
compiler can reorder non-aliasing stores).

What is missing from the TLSDESC design spec is a description of the
ordering requirements on the read side: the TLS access sequence
(generated by the compiler) loads the descriptor function pointer
(td->entry) and then the argument is loaded (td->arg) in that function.
These two loads must observe the changes made on the write side in a
sequentially consistent way. The code in asm:

  ldr x1, [x0]    // load td->entry
  blr [x0]        // call it

entryfunc:
  ldr x0, [x0,#8] // load td->arg

The branch (blr) does not provide a load-load ordering barrier (so the
second load may observe an old arg even though the first load observed
the new entry, this happens with existing aarch64 hardware causing
invalid memory access due to the incorrect TLS offset).

Various alternatives were considered to force the load ordering in the
descriptor function:

(1) barrier

entryfunc:
  dmb ishld
  ldr x0, [x0,#8]

(2) address dependency (if the address of the second load depends on the
result of the first one the ordering is guaranteed):

entryfunc:
  ldr x1,[x0]
  and x1,x1,#8
  orr x1,x1,#8
  ldr x0,[x0,x1]

(3) load-acquire (ARMv8 instruction that is ordered before subsequent
loads and stores)

entryfunc:
  ldar xzr,[x0]
  ldr x0,[x0,#8]

Option (1) is the simplest but slowest (note: this runs at every TLS
access), options (2) and (3) do one extra load from [x0] (same address
loads are ordered so it happens-after the load on the call site),
option (2) clobbers x1, so I think (3) is the best solution (a load
into the zero register has the same effect as with a normal register,
but the result is discarded so nothing is clobbered. Note that the
TLSDESC abi allows the descriptor function to clobber x1, but current
gcc does not implement this correctly, gcc will be fixed independently,
the dynamic and undefweak descriptors currently save/restore x1 so only
static TLS would be affected by this clobbering issue).

On the write side I used a full barrier (__sync_synchronize) although

  dmb ishst

would be enough, but write side is not performance critical.

I introduced a new _dl_tlsdesc_return_lazy descriptor function for
lazily relocated static TLS, so non-lazy use-case is not affected.
The dynamic and undefweak descriptors do enough work so the additional
ldar should not have a performance impact.

Other thoughts:

- Lazy binding for static TLS may be unnecessary complexity: it seems
that gcc generates at most two static TLSDESC relocation entries for a
translation unit (zero and non-zero initialized TLS), so there has to be
a lot of object files with static TLS linked into a shared object to
make the load time relocation overhead significant. Unless there is some
evidence that lazy static TLS relocation makes sense I would suggest
removing that logic (from all archs). (The dynamic case is probably a
similar micro-optimization, but there the lazy vs non-lazy semantics are
different in case of undefined symbols and some applications may depend
on that).

- 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
with option (1) or (2) with an additional twist: some existing ARMv7 cpus
don't implement the same address load ordering reliably, see:
http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

- I don't understand the role of the hold function in the general
TLSDESC design, why is it not enough to let other threads wait on the
global lock in the initial resolver function? Is the global dl lock
implemented as a spin lock? Is it for some liveness/fairness property?

- There are some incorrect uses of "cfi_adjust_cfa_offset" in
dl-tlsdesc.S which is a separate patch.

Changelog:

2015-04-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #18034]
	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
	(_dl_tlsdesc_dynamic): Likewise.
	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
	barrier between stores.

Comments

Torvald Riegel April 22, 2015, 4:08 p.m. UTC | #1
On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
> accesses.
> 
> The original ARM TLSDESC design did not discuss this race that
> arises on systems with weak memory order guarantees
> 
> http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
> 
> "Storing the final value of the TLS descriptor also needs care: the
> resolver field must be set to its final value after the argument gets
> its final value, such that any if thread attempts to use the
> descriptor before it gets its final value, it still goes to the hold
> function."
> 
> The current glibc code (i386, x86_64, arm, aarch64) is
> 
>   td->arg = ...;
>   td->entry = ...;
> 
> the arm memory model allows store-store reordering, so a barrier is
> needed between these two stores (the code is broken on x86 as well in
> principle: x86 memory model does not help on the c code level, the
> compiler can reorder non-aliasing stores).

Yes, this is not guaranteed to work.  This should thus be fixed for all
the architectures.

> What is missing from the TLSDESC design spec is a description of the
> ordering requirements on the read side: the TLS access sequence
> (generated by the compiler) loads the descriptor function pointer
> (td->entry) and then the argument is loaded (td->arg) in that function.
> These two loads must observe the changes made on the write side in a
> sequentially consistent way.

Not strictly in a sequentially consistent way, I believe.  What you are
probably looking for (I'm not familiar with TLS in detail) is some
invariant like if you read update X, you will also read update Y or a
more recent update to Y.

> The code in asm:
> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 
> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> The branch (blr) does not provide a load-load ordering barrier (so the
> second load may observe an old arg even though the first load observed
> the new entry, this happens with existing aarch64 hardware causing
> invalid memory access due to the incorrect TLS offset).
> 
> Various alternatives were considered to force the load ordering in the
> descriptor function:
> 
> (1) barrier
> 
> entryfunc:
>   dmb ishld
>   ldr x0, [x0,#8]
> 
> (2) address dependency (if the address of the second load depends on the
> result of the first one the ordering is guaranteed):
> 
> entryfunc:
>   ldr x1,[x0]
>   and x1,x1,#8
>   orr x1,x1,#8
>   ldr x0,[x0,x1]

The address dependencies could be useful in terms of performance, but
the disadvantage that I see are that there's currently no useful way to
use them in C code (C11's memory_order_consume is broken).

> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> loads and stores)
> 
> entryfunc:
>   ldar xzr,[x0]
>   ldr x0,[x0,#8]
> 
> Option (1) is the simplest but slowest (note: this runs at every TLS
> access), options (2) and (3) do one extra load from [x0] (same address
> loads are ordered so it happens-after the load on the call site),
> option (2) clobbers x1, so I think (3) is the best solution (a load
> into the zero register has the same effect as with a normal register,
> but the result is discarded so nothing is clobbered. Note that the
> TLSDESC abi allows the descriptor function to clobber x1, but current
> gcc does not implement this correctly, gcc will be fixed independently,
> the dynamic and undefweak descriptors currently save/restore x1 so only
> static TLS would be affected by this clobbering issue).
> 
> On the write side I used a full barrier (__sync_synchronize) although
> 
>   dmb ishst
> 
> would be enough, but write side is not performance critical.

Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
Fixes to existing code should use the C11 memory model for concurrent
code, especially if the fix is about a concurrency issue.

I agree with Adhemerval that a release MO store seems to be sufficient
in this case.

I haven't scanned through the TLS code to assess how much work it would
be to make it data-race-free (as defined by C11), but it would certainly
be helpful in showing similar issues (e.g., when documenting it) and
preventing compiler optimizations that are legal for non-concurrent
accesses but not what we need for TLS (e.g., speculative loads
introduced by the compiler on x86 in the case you mention above).

Given that you have looked at the code, could you give a rough estimate
of how much churn it would be to make the TLS code data-race-free?

It also looks as if the x86_64 variant of tlsdesc.c is, before your
changes and ignoring some additional comments, very similar to the
aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
using the C11 model) that can be used on several archs?  This would
certainly decrease maintenance overhead.

I'm aware this might look like I'm requesting extra work that is not at
the core of what you are trying to fix.  However, moving to portable
concurrent code that is shared across several archs is something we've
been doing elsewhere too, and in those cases ARM was on the benefiting
side (e.g., pthread_once, ongoing work in nscd, ...).  I think overall,
some extra work and clean-up will be a good thing for ARM archs too.

Thanks.
Szabolcs Nagy April 22, 2015, 5:14 p.m. UTC | #2
On 22/04/15 17:08, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>> (2) address dependency (if the address of the second load depends on the
>> result of the first one the ordering is guaranteed):
>>
>> entryfunc:
>>   ldr x1,[x0]
>>   and x1,x1,#8
>>   orr x1,x1,#8
>>   ldr x0,[x0,x1]
> 
> The address dependencies could be useful in terms of performance, but
> the disadvantage that I see are that there's currently no useful way to
> use them in C code (C11's memory_order_consume is broken).
> 

this must be in asm so i don't think it matters if consume
is broken in c11.

(tlsdesc resolver cannot clobber registers).

>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>> loads and stores)
>>
>> entryfunc:
>>   ldar xzr,[x0]
>>   ldr x0,[x0,#8]
>>
>> Option (1) is the simplest but slowest (note: this runs at every TLS
>> access), options (2) and (3) do one extra load from [x0] (same address
>> loads are ordered so it happens-after the load on the call site),
>> option (2) clobbers x1, so I think (3) is the best solution (a load
>> into the zero register has the same effect as with a normal register,
>> but the result is discarded so nothing is clobbered. Note that the
>> TLSDESC abi allows the descriptor function to clobber x1, but current
>> gcc does not implement this correctly, gcc will be fixed independently,
>> the dynamic and undefweak descriptors currently save/restore x1 so only
>> static TLS would be affected by this clobbering issue).
>>
>> On the write side I used a full barrier (__sync_synchronize) although
>>
>>   dmb ishst
>>
>> would be enough, but write side is not performance critical.
> 
> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> Fixes to existing code should use the C11 memory model for concurrent
> code, especially if the fix is about a concurrency issue.
> 
> I agree with Adhemerval that a release MO store seems to be sufficient
> in this case.
> 

yes, the write side could use c11 atomics and i agree that
store-release is enough, but i'm not sure how much one can
rely on c11 memory model when interacting with asm code.
(so i thought a full barrier is the easiest to reason about)

i can update the code to use glibc atomics, but i will
have to look into how those work (are there type issues
or are they generic?)

> I haven't scanned through the TLS code to assess how much work it would
> be to make it data-race-free (as defined by C11), but it would certainly
> be helpful in showing similar issues (e.g., when documenting it) and
> preventing compiler optimizations that are legal for non-concurrent
> accesses but not what we need for TLS (e.g., speculative loads
> introduced by the compiler on x86 in the case you mention above).
> 
> Given that you have looked at the code, could you give a rough estimate
> of how much churn it would be to make the TLS code data-race-free?

i think td->entry should only be accessed with atomics once
the loader finished loading the dso with it.
(i assume during dso loading concurrent access is not possible)

(the particular case i'm trying to fix is hard because
the access to td->entry is generated by the compiler, so
it has to be worked around inside the entry function.
i think x86 does not have this issue)

i think these function might need some attention:

elf/tlsdeschtab.h:
_dl_tlsdesc_resolve_early_return_p
sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
_dl_tlsdesc_resolve_rela_fixup
_dl_tlsdesc_resolve_hold_fixup

but i haven't done an detailed analysis

> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> changes and ignoring some additional comments, very similar to the
> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
> using the C11 model) that can be used on several archs?  This would
> certainly decrease maintenance overhead.
> 

should be possible i think: these functions are called
from asm to do the lazy resolution

but i have to check the arch specific details.

> I'm aware this might look like I'm requesting extra work that is not at
> the core of what you are trying to fix.  However, moving to portable
> concurrent code that is shared across several archs is something we've
> been doing elsewhere too, and in those cases ARM was on the benefiting
> side (e.g., pthread_once, ongoing work in nscd, ...).  I think overall,
> some extra work and clean-up will be a good thing for ARM archs too.
> 
> Thanks.
>
Szabolcs Nagy April 23, 2015, 12:08 p.m. UTC | #3
On 22/04/15 18:14, Szabolcs Nagy wrote:
> On 22/04/15 17:08, Torvald Riegel wrote:
>> Given that you have looked at the code, could you give a rough estimate
>> of how much churn it would be to make the TLS code data-race-free?
> 
> elf/tlsdeschtab.h:
> _dl_tlsdesc_resolve_early_return_p
> sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
> _dl_tlsdesc_resolve_rela_fixup
> _dl_tlsdesc_resolve_hold_fixup
> 

i think these all need to use atomics for accessing
td->entry for c11 correctness, but it's hard to tell
what's right since c11 only talks about synchronizing
with c code, not asm.

the td->entry can be in 3 states during lazy resolution:

* init: retries the call of entry until caller==entry
  (using a double-checked locking mechanism, then it
  - grabs GL(dl_load_lock)
  - changes the entry to the hold state
  - does the resolver work
  - changes arg and entry to the final value
  - releases the lock to wake the waiters
  - calls the new entry)
* hold: waits on the GL(dl_load_lock)
  (then retries calling the entry when it's woken up)
* final state: (static, dynamic or undefweak callback)
  calculates the tls offset based on arg
  (currently without any locks which is bad on weak
  memory models)

the code for tls access is generated by the compiler so
there is no atomics there: the loaded entry can be in
init, hold or final state, the guarantee about the state
of arg must come from the target arch memory model.

and to answer my earlier question about the hold state:
it is needed to avoid the spinning in the double-checked
locking in init.

>> It also looks as if the x86_64 variant of tlsdesc.c is, before your
>> changes and ignoring some additional comments, very similar to the
>> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
>> using the C11 model) that can be used on several archs?  This would
>> certainly decrease maintenance overhead.
>>
> 
> should be possible i think: these functions are called
> from asm to do the lazy resolution
> 
> but i have to check the arch specific details.

looking at this, but it seems to be significant work:

* i386 and arm determine the caller differently
* i386 uses special calling convention from asm
* arm handles local symbols specially

i think the way to move forward is

* fix the correctness bug now on aarch64
* decide if lazy static tls resolver is worth it
* do the refactoring so tlsdesc is common code
* use c11 atomics

the fix for arm is a lot harder (because atomics are
different on different versions of arm, an ifunc based
dispatch could in principle solve the dmb vs no-dmb
issue for the lazy resolvers).

is this ok?
Torvald Riegel April 23, 2015, 9:46 p.m. UTC | #4
On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> On 22/04/15 17:08, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
> >> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
> >> loads and stores)
> >>
> >> entryfunc:
> >>   ldar xzr,[x0]
> >>   ldr x0,[x0,#8]
> >>
> >> Option (1) is the simplest but slowest (note: this runs at every TLS
> >> access), options (2) and (3) do one extra load from [x0] (same address
> >> loads are ordered so it happens-after the load on the call site),
> >> option (2) clobbers x1, so I think (3) is the best solution (a load
> >> into the zero register has the same effect as with a normal register,
> >> but the result is discarded so nothing is clobbered. Note that the
> >> TLSDESC abi allows the descriptor function to clobber x1, but current
> >> gcc does not implement this correctly, gcc will be fixed independently,
> >> the dynamic and undefweak descriptors currently save/restore x1 so only
> >> static TLS would be affected by this clobbering issue).
> >>
> >> On the write side I used a full barrier (__sync_synchronize) although
> >>
> >>   dmb ishst
> >>
> >> would be enough, but write side is not performance critical.
> > 
> > Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
> > Fixes to existing code should use the C11 memory model for concurrent
> > code, especially if the fix is about a concurrency issue.
> > 
> > I agree with Adhemerval that a release MO store seems to be sufficient
> > in this case.
> > 
> 
> yes, the write side could use c11 atomics and i agree that
> store-release is enough, but i'm not sure how much one can
> rely on c11 memory model when interacting with asm code.
> (so i thought a full barrier is the easiest to reason about)

IIUC, the asm that you need to interact with is generated by the
compiler.  The C11 atomics implementation (that is, the asm / HW
features used in it) are effectively part of the ABI; C11 translation
units need to be able to synchronize with each other even if compiled by
different compiler versions.

Therefore, making sure that the asm follows what the compiler would
generate for the intended C11 atomics equivalent is a solution.  That
also means that we can say clearly, in C11 memory model terms, what the
asm implementation on an arch should do.

> i can update the code to use glibc atomics, but i will
> have to look into how those work (are there type issues
> or are they generic?)

There are basically generic, and internally use either the __atomic
builtins for newer compilers and archs that provide them, or (fall back
to) the previous implementation based on __sync-builtins or custom asm.
One constraint is that we don't support sizes of atomic access that
aren't actually provided on the particular arch (either through HW
instructions, or with kernel helpers).  So, for example, atomic access
to a naturally aligned pointer-sized type can be expected to work.

> > I haven't scanned through the TLS code to assess how much work it would
> > be to make it data-race-free (as defined by C11), but it would certainly
> > be helpful in showing similar issues (e.g., when documenting it) and
> > preventing compiler optimizations that are legal for non-concurrent
> > accesses but not what we need for TLS (e.g., speculative loads
> > introduced by the compiler on x86 in the case you mention above).
> > 
> > Given that you have looked at the code, could you give a rough estimate
> > of how much churn it would be to make the TLS code data-race-free?
> 
> i think td->entry should only be accessed with atomics once
> the loader finished loading the dso with it.
> (i assume during dso loading concurrent access is not possible)

If that's some kind of initialization you have in mind, using nonatomics
accesses may be fine.  We don't have strict rules around that currently.
But if it's just a few lines of code, then simply using relaxed-MO
atomic accesses could be fine as well.

> (the particular case i'm trying to fix is hard because
> the access to td->entry is generated by the compiler, so
> it has to be worked around inside the entry function.
> i think x86 does not have this issue)

Do you mean that you want to affect the compiler-generated code after it
has been generated?
Szabolcs Nagy April 24, 2015, 11:05 a.m. UTC | #5
On 23/04/15 22:46, Torvald Riegel wrote:
> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
>> On 22/04/15 17:08, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote:
>>>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent
>>>> loads and stores)
>>>>
>>>> entryfunc:
>>>>   ldar xzr,[x0]
>>>>   ldr x0,[x0,#8]
>>>>
>>>> Option (1) is the simplest but slowest (note: this runs at every TLS
>>>> access), options (2) and (3) do one extra load from [x0] (same address
>>>> loads are ordered so it happens-after the load on the call site),
>>>> option (2) clobbers x1, so I think (3) is the best solution (a load
>>>> into the zero register has the same effect as with a normal register,
>>>> but the result is discarded so nothing is clobbered. Note that the
>>>> TLSDESC abi allows the descriptor function to clobber x1, but current
>>>> gcc does not implement this correctly, gcc will be fixed independently,
>>>> the dynamic and undefweak descriptors currently save/restore x1 so only
>>>> static TLS would be affected by this clobbering issue).
>>>>
>>>> On the write side I used a full barrier (__sync_synchronize) although
>>>>
>>>>   dmb ishst
>>>>
>>>> would be enough, but write side is not performance critical.
>>>
>>> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency
>>> Fixes to existing code should use the C11 memory model for concurrent
>>> code, especially if the fix is about a concurrency issue.
>>>
>>> I agree with Adhemerval that a release MO store seems to be sufficient
>>> in this case.
>>>
>>
>> yes, the write side could use c11 atomics and i agree that
>> store-release is enough, but i'm not sure how much one can
>> rely on c11 memory model when interacting with asm code.
>> (so i thought a full barrier is the easiest to reason about)
> 
> IIUC, the asm that you need to interact with is generated by the
> compiler.  The C11 atomics implementation (that is, the asm / HW

to clarify:

// this bit is generated by the compiler for every tls access
// (the exact sequence is different and depends on tiny/small/large
// memory model, it is part of the abi so linkers can do relaxations)
// assume x0 is a pointer to the tls descriptor (td) in the GOT.
// (2 consecutive GOT entries are used: td->entry and td->arg)
// the entry function returns an offset (so tls data is at tp+off)
// note: no barriers are used here

  mrs x2, tpidr_el0 // load the thread pointer
  ldr x1, [x0]      // load td->entry
  blr x1            // call it, returns off in x0
  ldr x0, [x2, x0]  // access tls at tp+off

// this is a function implemented by the libc where td->entry points
// currently the options are
//   _dl_tlsdesc_return,
//   _dl_tlsdesc_undefweak,
//   _dl_tlsdesc_dynamic
// originally td->entry points to a trampoline that calls into
// _dl_tlsdesc_resovle_rela which does the lazy relocation and sets
// td->entry to _dl_tlsdesc_hold temporarily.
entryfunc:
  ldr x0, [x0,#8] // load td->arg

>> i can update the code to use glibc atomics, but i will
>> have to look into how those work (are there type issues
>> or are they generic?)
> 
> There are basically generic, and internally use either the __atomic
> builtins for newer compilers and archs that provide them, or (fall back
> to) the previous implementation based on __sync-builtins or custom asm.
> One constraint is that we don't support sizes of atomic access that
> aren't actually provided on the particular arch (either through HW
> instructions, or with kernel helpers).  So, for example, atomic access
> to a naturally aligned pointer-sized type can be expected to work.

ok

>> i think td->entry should only be accessed with atomics once
>> the loader finished loading the dso with it.
>> (i assume during dso loading concurrent access is not possible)
> 
> If that's some kind of initialization you have in mind, using nonatomics
> accesses may be fine.  We don't have strict rules around that currently.
> But if it's just a few lines of code, then simply using relaxed-MO
> atomic accesses could be fine as well.
> 

at load time the td may be set up and then it is not
written later, but with lazy binding it will be written
latter

care should be taken because the compiler generated
tls access code does not use atomics and can run
in parallel to lazy initialization.

so i guess any td->entry access in the lazy init code
path should use atomics for c11 correctness

>> (the particular case i'm trying to fix is hard because
>> the access to td->entry is generated by the compiler, so
>> it has to be worked around inside the entry function.
>> i think x86 does not have this issue)
> 
> Do you mean that you want to affect the compiler-generated code after it
> has been generated?
> 

i'm fixing the tlsdesc entry functions in libc
Torvald Riegel April 24, 2015, 5:37 p.m. UTC | #6
On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: 
> On 22/04/15 18:14, Szabolcs Nagy wrote:
> > On 22/04/15 17:08, Torvald Riegel wrote:
> >> Given that you have looked at the code, could you give a rough estimate
> >> of how much churn it would be to make the TLS code data-race-free?
> > 
> > elf/tlsdeschtab.h:
> > _dl_tlsdesc_resolve_early_return_p
> > sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c:
> > _dl_tlsdesc_resolve_rela_fixup
> > _dl_tlsdesc_resolve_hold_fixup
> > 
> 
> i think these all need to use atomics for accessing
> td->entry for c11 correctness, but it's hard to tell
> what's right since c11 only talks about synchronizing
> with c code, not asm.

That's true, but if we can require the asm code to be conceptually
equivalent to some C11 code, then we can specify and document the
concurrent algorithm or synchronization scheme based on C11 as
foundation, and either use C11 atomics to implement it or use code the
compiler would generate for C11 atomics if implementing in asm.

That doesn't cover cases where in the asm, we would want to use
synchronization that can't be represented through C11 atomics, or that
is incompatible with C11 atomics.  But so far it doesn't look like this,
or have you observed such cases?
(consume MO might be a case, but I guess for that we could still wave
our hands and pretend compilers would simply generate the "intuitively
generated" code).

> the td->entry can be in 3 states during lazy resolution:
> 
> * init: retries the call of entry until caller==entry
>   (using a double-checked locking mechanism, then it
>   - grabs GL(dl_load_lock)
>   - changes the entry to the hold state
>   - does the resolver work
>   - changes arg and entry to the final value
>   - releases the lock to wake the waiters
>   - calls the new entry)
> * hold: waits on the GL(dl_load_lock)
>   (then retries calling the entry when it's woken up)
> * final state: (static, dynamic or undefweak callback)
>   calculates the tls offset based on arg
>   (currently without any locks which is bad on weak
>   memory models)

Could you point me to (or confirm that) a new load of td->entry happens
if caller != td->entry?  In the asm bits you posted so far, it seems
that if the call to *td->entry returns, actual TLS data is loaded.  For
example, you posted this:

  ldr x1, [x0]    // load td->entry
  blr [x0]        // call it

entryfunc:
  ldr x0, [x0,#8] // load td->arg


I'm asking because in tlsdesctab.h we have:

static int
_dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
{
  if (caller != td->entry)
    return 1;

  __rtld_lock_lock_recursive (GL(dl_load_lock));
  if (caller != td->entry)
    {
      __rtld_lock_unlock_recursive (GL(dl_load_lock));
      return 1;
....

If _dl_tlsdesc_resolve_early_return_p returns 1,
_dl_tlsdesc_resolve_rela_fixup just returns.  If the next thing we do is
load td->arg, we need another acquire-MO load for td->entry in
_dl_tlsdesc_resolve_early_return_p.  A relaxed-MO load (or the current
non-atomic access, ingoring DRF) would only be sufficient if all that
caller != atomic_load_relaxed (&td->entry) leads to is further busy
waiting (eg, calling *td->entry again).  But if we really expect
caller != td->entry to have meaning and tell us something about other
state, we need to have another barrier there to not have a typical
double-checked-locking bug.

(This shows another benefit of using atomics for all concurrent
accesses: It forces us to make a conscious decision about the memory
orders, and document why.  If a relaxed-MO access is sufficient, it
should better have a clear explanation...)

> 
> the code for tls access is generated by the compiler so
> there is no atomics there: the loaded entry can be in
> init, hold or final state, the guarantee about the state
> of arg must come from the target arch memory model.

I understood there are no C11 atomics used in the asm implementation
bits.  I'm thinking about abstract the synchronization scheme we want to
use in the implementation, represented through C11 atomics
synchronization; the asm would just do something equivalent (see my
comments about the atomics implementation being effectively part of the
ABI).

> and to answer my earlier question about the hold state:
> it is needed to avoid the spinning in the double-checked
> locking in init.
> 
> >> It also looks as if the x86_64 variant of tlsdesc.c is, before your
> >> changes and ignoring some additional comments, very similar to the
> >> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
> >> using the C11 model) that can be used on several archs?  This would
> >> certainly decrease maintenance overhead.
> >>
> > 
> > should be possible i think: these functions are called
> > from asm to do the lazy resolution
> > 
> > but i have to check the arch specific details.
> 
> looking at this, but it seems to be significant work:
> 
> * i386 and arm determine the caller differently
> * i386 uses special calling convention from asm
> * arm handles local symbols specially

Okay.  But do they still use the same abstract synchronization scheme,
and just differ in a few aspects?  Or are the synchronization schemes
significantly different?

> i think the way to move forward is

I think step 1 should be to document the synchronization scheme on an
abstract, algorithmic level.  Using C11 atomics in this pseudo-code.  We
want to document the abstract level because it's typically much easier
to reason about the abstraction than the concrete implementation in case
of concurrent code, and because confirming that an implementation
matches the abstraction is typically also much easier than inferring the
abstraction from a concrete implementation.

So, for example, if we use a kind of double-checked locking here,
document that, show the pseudo code with C11 atomics, and then refer
back to this when documenting why certain memory orders on atomic
operations (or asm instructions) are sufficient and equivalent to the
abstract algorithm.  In our case, this would mean saying that, for
example, we need load Y to use acquire MO so that it synchronizes with
release-MO store X, making sure that a load B that is sequenced after X
reads from store A (which is a core part of double-checked locking, so
could also be done at the abstract algorithm docs and then just referred
back to from the concrete implementation code).

> * fix the correctness bug now on aarch64

OK.  When doing that, check and document equivalence to the abstract
synchronization scheme.  When you do that, please use atomic_* for all
synchronization that you add or change.
It might be easiest to also change all atomic accesses that you checked
and documented to using atomic_*; this isn't much work compared to the
checking, and it becomes obvious what has been checked.

> * decide if lazy static tls resolver is worth it
> * do the refactoring so tlsdesc is common code
> * use c11 atomics

I'd prefer C11 atomics to be used right away when fixing things;
however, if you mean transforming and checking all other TLS code
unrelated to the bug you're looking at, then I agree that this can be
done incrementally and after you fixed this bug.

> the fix for arm is a lot harder (because atomics are
> different on different versions of arm, an ifunc based
> dispatch could in principle solve the dmb vs no-dmb
> issue for the lazy resolvers).

Is the synchronization used in the asm bits on the different versions of
arm different from how C11 atomics would look like on those versions of
arm?  Or are we just dealing with different ways to implement acquire
loads, for example?
Torvald Riegel April 24, 2015, 5:40 p.m. UTC | #7
On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
> 
> On 23/04/15 22:46, Torvald Riegel wrote:
> > On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> >> (the particular case i'm trying to fix is hard because
> >> the access to td->entry is generated by the compiler, so
> >> it has to be worked around inside the entry function.
> >> i think x86 does not have this issue)
> > 
> > Do you mean that you want to affect the compiler-generated code after it
> > has been generated?
> > 
> 
> i'm fixing the tlsdesc entry functions in libc
> 

Sorry, I'm still not sure I can follow.  IIUC, the compiler-generated
asm code for TLS accesses may lack acquire barriers in some cases.  Do
you want to work around that by trying to adapt the glibc
implementation, or do you just want to fix all glibc-specific bugs, and
handle the compiler-side issues separately?
Szabolcs Nagy April 24, 2015, 9:12 p.m. UTC | #8
On 24/04/15 18:37, Torvald Riegel wrote:
> On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: 
>> On 22/04/15 18:14, Szabolcs Nagy wrote:
> 
>> the td->entry can be in 3 states during lazy resolution:
>>
>> * init: retries the call of entry until caller==entry
>>   (using a double-checked locking mechanism, then it
>>   - grabs GL(dl_load_lock)
>>   - changes the entry to the hold state
>>   - does the resolver work
>>   - changes arg and entry to the final value
>>   - releases the lock to wake the waiters
>>   - calls the new entry)
>> * hold: waits on the GL(dl_load_lock)
>>   (then retries calling the entry when it's woken up)
>> * final state: (static, dynamic or undefweak callback)
>>   calculates the tls offset based on arg
>>   (currently without any locks which is bad on weak
>>   memory models)
> 
> Could you point me to (or confirm that) a new load of td->entry happens
> if caller != td->entry?  In the asm bits you posted so far, it seems
> that if the call to *td->entry returns, actual TLS data is loaded.  For
> example, you posted this:

if td->entry is in init state it will do the retry
(when it ends up in _dl_tlsdesc_resolve_early_return_p)

> 
>   ldr x1, [x0]    // load td->entry
>   blr [x0]        // call it
> 

this is supposed to be 'blr x1', but you got the meaning

> entryfunc:
>   ldr x0, [x0,#8] // load td->arg
> 
> 
> I'm asking because in tlsdesctab.h we have:
> 
> static int
> _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller)
> {
>   if (caller != td->entry)
>     return 1;
> 
>   __rtld_lock_lock_recursive (GL(dl_load_lock));
>   if (caller != td->entry)
>     {
>       __rtld_lock_unlock_recursive (GL(dl_load_lock));
>       return 1;
> ....
> 
> If _dl_tlsdesc_resolve_early_return_p returns 1,
> _dl_tlsdesc_resolve_rela_fixup just returns.  If the next thing we do is
> load td->arg, we need another acquire-MO load for td->entry in

no, if _dl_tlsdesc_resolve_rela_fixup returns it ends up
in _dl_tlsdesc_resolve_rela (asm code) that loads td->entry again
and calls the entry again (this is the retry loop)

(so when the resolver updated the entry then the new entry is
actually called, and in case of early return there is a retry)

> _dl_tlsdesc_resolve_early_return_p.  A relaxed-MO load (or the current
> non-atomic access, ingoring DRF) would only be sufficient if all that
> caller != atomic_load_relaxed (&td->entry) leads to is further busy
> waiting (eg, calling *td->entry again).  But if we really expect
> caller != td->entry to have meaning and tell us something about other
> state, we need to have another barrier there to not have a typical
> double-checked-locking bug.
> 

this is why i said you need atomics for accessing td->entry

but this is generic code and not related to the aarch64 bug
so i'd prefer if any such cleanup were done independently

> (This shows another benefit of using atomics for all concurrent
> accesses: It forces us to make a conscious decision about the memory
> orders, and document why.  If a relaxed-MO access is sufficient, it
> should better have a clear explanation...)
> 
>>
>> the code for tls access is generated by the compiler so
>> there is no atomics there: the loaded entry can be in
>> init, hold or final state, the guarantee about the state
>> of arg must come from the target arch memory model.
> 
> I understood there are no C11 atomics used in the asm implementation
> bits.  I'm thinking about abstract the synchronization scheme we want to
> use in the implementation, represented through C11 atomics
> synchronization; the asm would just do something equivalent (see my
> comments about the atomics implementation being effectively part of the
> ABI).
> 

ok

>> and to answer my earlier question about the hold state:
>> it is needed to avoid the spinning in the double-checked
>> locking in init.
>>
>>>> It also looks as if the x86_64 variant of tlsdesc.c is, before your
>>>> changes and ignoring some additional comments, very similar to the
>>>> aarch64 variant.  Can we get one proper tlsdesc.c (data-race-free and
>>>> using the C11 model) that can be used on several archs?  This would
>>>> certainly decrease maintenance overhead.
>>>>
>>>
>>> should be possible i think: these functions are called
>>> from asm to do the lazy resolution
>>>
>>> but i have to check the arch specific details.
>>
>> looking at this, but it seems to be significant work:
>>
>> * i386 and arm determine the caller differently
>> * i386 uses special calling convention from asm
>> * arm handles local symbols specially
> 
> Okay.  But do they still use the same abstract synchronization scheme,
> and just differ in a few aspects?  Or are the synchronization schemes
> significantly different?
> 

the synchronization scheme is the same

but if the calling convention is non-C (regparm) then you
cannot use the same generic C code (without significant
changes somewhere).

>> i think the way to move forward is
> 
> I think step 1 should be to document the synchronization scheme on an
> abstract, algorithmic level.  Using C11 atomics in this pseudo-code.  We
> want to document the abstract level because it's typically much easier
> to reason about the abstraction than the concrete implementation in case
> of concurrent code, and because confirming that an implementation
> matches the abstraction is typically also much easier than inferring the
> abstraction from a concrete implementation.
> 

i think the documentation is the tlsdesc abi spec
(synchronization requirements can be derived from that)

i think we need to fix the bug instead of writing more documents

(you can build the theories later, it will inevitably cause a lot
of changes because the lazy dl code is far from perfect)

> So, for example, if we use a kind of double-checked locking here,
> document that, show the pseudo code with C11 atomics, and then refer
> back to this when documenting why certain memory orders on atomic
> operations (or asm instructions) are sufficient and equivalent to the
> abstract algorithm.  In our case, this would mean saying that, for
> example, we need load Y to use acquire MO so that it synchronizes with
> release-MO store X, making sure that a load B that is sequenced after X
> reads from store A (which is a core part of double-checked locking, so
> could also be done at the abstract algorithm docs and then just referred
> back to from the concrete implementation code).
> 
>> * fix the correctness bug now on aarch64
> 
> OK.  When doing that, check and document equivalence to the abstract
> synchronization scheme.  When you do that, please use atomic_* for all
> synchronization that you add or change.
> It might be easiest to also change all atomic accesses that you checked
> and documented to using atomic_*; this isn't much work compared to the
> checking, and it becomes obvious what has been checked.
> 
>> * decide if lazy static tls resolver is worth it
>> * do the refactoring so tlsdesc is common code
>> * use c11 atomics
> 
> I'd prefer C11 atomics to be used right away when fixing things;
> however, if you mean transforming and checking all other TLS code
> unrelated to the bug you're looking at, then I agree that this can be
> done incrementally and after you fixed this bug.
> 

ok i'll do the atomic_store on td->entry instead of __sync_synchronize

>> the fix for arm is a lot harder (because atomics are
>> different on different versions of arm, an ifunc based
>> dispatch could in principle solve the dmb vs no-dmb
>> issue for the lazy resolvers).
> 
> Is the synchronization used in the asm bits on the different versions of
> arm different from how C11 atomics would look like on those versions of
> arm?  Or are we just dealing with different ways to implement acquire
> loads, for example?
> 

i dont know the details of c11 atomics on arm

but i assume the emitted code depends on which cpu
you compile for (so lots of ifdefs in the asm, or
you can detect the current cpu at runtime with ifunc
which is also ugly)
Szabolcs Nagy April 24, 2015, 9:27 p.m. UTC | #9
On 24/04/15 18:40, Torvald Riegel wrote:
> On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
>>
>> On 23/04/15 22:46, Torvald Riegel wrote:
>>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
>>>> (the particular case i'm trying to fix is hard because
>>>> the access to td->entry is generated by the compiler, so
>>>> it has to be worked around inside the entry function.
>>>> i think x86 does not have this issue)
>>>
>>> Do you mean that you want to affect the compiler-generated code after it
>>> has been generated?
>>>
>>
>> i'm fixing the tlsdesc entry functions in libc
>>
> 
> Sorry, I'm still not sure I can follow.  IIUC, the compiler-generated
> asm code for TLS accesses may lack acquire barriers in some cases.  Do
> you want to work around that by trying to adapt the glibc
> implementation, or do you just want to fix all glibc-specific bugs, and
> handle the compiler-side issues separately?
> 

the compiler generated code dont need any change
(that would break the arm and aarch64 abi)

i'm fixing the issue in the libc tlsdesc entry functions
(because glibc is broken currently, very easy to make it
crash)

the fix relies on the arm memory model

the real bug is the complexity of the dynamic linker code in glibc
and the design document that specified lazy binding for tlsdesc
without considering weak memory model issues, for these there is
an easy fix but you won't like it: don't do lazy binding.
Rich Felker April 25, 2015, 4:27 a.m. UTC | #10
On Fri, Apr 24, 2015 at 10:27:36PM +0100, Szabolcs Nagy wrote:
> 
> On 24/04/15 18:40, Torvald Riegel wrote:
> > On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote:
> >>
> >> On 23/04/15 22:46, Torvald Riegel wrote:
> >>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote:
> >>>> (the particular case i'm trying to fix is hard because
> >>>> the access to td->entry is generated by the compiler, so
> >>>> it has to be worked around inside the entry function.
> >>>> i think x86 does not have this issue)
> >>>
> >>> Do you mean that you want to affect the compiler-generated code after it
> >>> has been generated?
> >>
> >> i'm fixing the tlsdesc entry functions in libc
> > 
> > Sorry, I'm still not sure I can follow.  IIUC, the compiler-generated
> > asm code for TLS accesses may lack acquire barriers in some cases.  Do
> > you want to work around that by trying to adapt the glibc
> > implementation, or do you just want to fix all glibc-specific bugs, and
> > handle the compiler-side issues separately?
> > 
> 
> the compiler generated code dont need any change
> (that would break the arm and aarch64 abi)
> 
> i'm fixing the issue in the libc tlsdesc entry functions
> (because glibc is broken currently, very easy to make it
> crash)
> 
> the fix relies on the arm memory model
> 
> the real bug is the complexity of the dynamic linker code in glibc
> and the design document that specified lazy binding for tlsdesc
> without considering weak memory model issues, for these there is
> an easy fix but you won't like it: don't do lazy binding.

Being that lazy tlsdesc binding seems to break AS-safety too, and
breaks fail-safety (there's an allocation that can fail at resolving
time) I think it should just be dropped. Then this whole issue goes
away and doesn't need complex arch-specific solutions.

Rich
Alexandre Oliva June 9, 2015, 3:43 a.m. UTC | #11
On Apr 22, 2015, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:

> - I don't understand the role of the hold function in the general
> TLSDESC design

On machines that can't overwrite the entire descriptor atomically, it is
intended to stop a thread from using a partially-relocated descriptor.
Users of the descriptor must load the function word before the operand,
and the relocator must first set the function word to hold, then modify
the operand, and then set the function to its final value.

This is supposed to ensure that users of the descriptor can tell the
relocation of the descriptor is already underway, thus the operand may
already be trashed.  I agree it's not extremely relevant for current
implementations, but the x86 TLSDESC RFC (the first that couldn't update
the descriptor atomically) describes other envisioned implementations
that could take advantage of it.
Ondrej Bilka July 11, 2015, 10:16 a.m. UTC | #12
On Wed, Apr 22, 2015 at 01:27:15PM +0100, Szabolcs Nagy wrote:
> Other thoughts:
> 
> - Lazy binding for static TLS may be unnecessary complexity: it seems
> that gcc generates at most two static TLSDESC relocation entries for a
> translation unit (zero and non-zero initialized TLS), so there has to be
> a lot of object files with static TLS linked into a shared object to
> make the load time relocation overhead significant. Unless there is some
> evidence that lazy static TLS relocation makes sense I would suggest
> removing that logic (from all archs). (The dynamic case is probably a
> similar micro-optimization, but there the lazy vs non-lazy semantics are
> different in case of undefined symbols and some applications may depend
> on that).
>
I agree with this. As undefined symbols there is bug that we segfault
with weak tls variable so I doubt that anyone depends on that.

You didn't mention one essential argument in your analysis as overhead
is caused by unused tls variables. When variable is used then you need
to do relocation anyway and lazy one could be mangitude more expensive
than eager.

There is project on my backlog to improve tls access
in libraries which is still slow even with gnu2 so it would need
introduce -mtls-dialect=gnu3.

Eager binding of TLS is prerequisite. Now tls models are flawed design
as they try to save space by being lazy without any evidence that there
is application that uses it.

Main idea that total tls usage is for most application less than 65536
bytes. So unless application uses more we could preallocate that when
loading dso. That makes a code for dynamic library require only one
extra read versus tls access in main binary when you use following.

static int foo_offset;
#define foo *(foo_offset > 0 ? tcb + foo_offset: get_tls_ptr (- foo_offset))
 
Here you could do lazy allocation in get_tls_ptr without taking lock.
There is bug that when you do allocation in signal handler it calls
malloc which could result in deadlock, there was patch to use mmap to
fix it which was reverted.
diff mbox

Patch

diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S
index be9b9b3..ff74b52 100644
--- a/sysdeps/aarch64/dl-tlsdesc.S
+++ b/sysdeps/aarch64/dl-tlsdesc.S
@@ -79,6 +79,25 @@  _dl_tlsdesc_return:
 	cfi_endproc
 	.size	_dl_tlsdesc_return, .-_dl_tlsdesc_return
 
+	/* Same as _dl_tlsdesc_return but with synchronization for
+	   lazy relocation.
+	   Prototype:
+	   _dl_tlsdesc_return_lazy (tlsdesc *) ;
+	 */
+	.hidden _dl_tlsdesc_return_lazy
+	.global	_dl_tlsdesc_return_lazy
+	.type	_dl_tlsdesc_return_lazy,%function
+	cfi_startproc
+	.align 2
+_dl_tlsdesc_return_lazy:
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
+	ldr	x0, [x0, #8]
+	RET
+	cfi_endproc
+	.size	_dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy
+
 	/* Handler for undefined weak TLS symbols.
 	   Prototype:
 	   _dl_tlsdesc_undefweak (tlsdesc *);
@@ -96,6 +115,9 @@  _dl_tlsdesc_return:
 _dl_tlsdesc_undefweak:
 	str	x1, [sp, #-16]!
 	cfi_adjust_cfa_offset(16)
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x0, [x0, #8]
 	mrs	x1, tpidr_el0
 	sub	x0, x0, x1
@@ -152,6 +174,9 @@  _dl_tlsdesc_dynamic:
 	stp	x3,  x4, [sp, #32+16*1]
 
 	mrs	x4, tpidr_el0
+	/* The ldar guarantees ordering between the load from [x0] at the
+	   call site and the load from [x0,#8] here for lazy relocation. */
+	ldar	xzr, [x0]
 	ldr	x1, [x0,#8]
 	ldr	x0, [x4]
 	ldr	x3, [x1,#16]
diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h
index 7a1285e..e6c0078 100644
--- a/sysdeps/aarch64/dl-tlsdesc.h
+++ b/sysdeps/aarch64/dl-tlsdesc.h
@@ -46,6 +46,9 @@  extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_return (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
+_dl_tlsdesc_return_lazy (struct tlsdesc *);
+
+extern ptrdiff_t attribute_hidden
 _dl_tlsdesc_undefweak (struct tlsdesc *);
 
 extern ptrdiff_t attribute_hidden
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c
index 4821f8c..f738cc6 100644
--- a/sysdeps/aarch64/tlsdesc.c
+++ b/sysdeps/aarch64/tlsdesc.c
@@ -87,6 +87,8 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
   if (!sym)
     {
       td->arg = (void*) reloc->r_addend;
+      /* Barrier so readers see the write above before the one below. */
+      __sync_synchronize ();
       td->entry = _dl_tlsdesc_undefweak;
     }
   else
@@ -98,6 +100,8 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value
 					      + reloc->r_addend);
+	  /* Barrier so readers see the write above before the one below. */
+	  __sync_synchronize ();
 	  td->entry = _dl_tlsdesc_dynamic;
 	}
       else
@@ -105,7 +109,9 @@  _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td,
 	{
 	  td->arg = (void*) (sym->st_value + result->l_tls_offset
 			     + reloc->r_addend);
-	  td->entry = _dl_tlsdesc_return;
+	  /* Barrier so readers see the write above before the one below. */
+	  __sync_synchronize ();
+	  td->entry = _dl_tlsdesc_return_lazy;
 	}
     }