[BZ,18034,AArch64] Lazy TLSDESC relocation data race fix

  Lazy TLSDESC initialization needs to be synchronized with concurrent TLS
accesses.

The original ARM TLSDESC design did not discuss this race that
arises on systems with weak memory order guarantees

http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt

"Storing the final value of the TLS descriptor also needs care: the
resolver field must be set to its final value after the argument gets
its final value, such that any if thread attempts to use the
descriptor before it gets its final value, it still goes to the hold
function."

The current glibc code (i386, x86_64, arm, aarch64) is

  td->arg = ...;
  td->entry = ...;

the arm memory model allows store-store reordering, so a barrier is
needed between these two stores (the code is broken on x86 as well in
principle: x86 memory model does not help on the c code level, the
compiler can reorder non-aliasing stores).

What is missing from the TLSDESC design spec is a description of the
ordering requirements on the read side: the TLS access sequence
(generated by the compiler) loads the descriptor function pointer
(td->entry) and then the argument is loaded (td->arg) in that function.
These two loads must observe the changes made on the write side in a
sequentially consistent way. The code in asm:

  ldr x1, [x0]    // load td->entry
  blr [x0]        // call it

entryfunc:
  ldr x0, [x0,#8] // load td->arg

The branch (blr) does not provide a load-load ordering barrier (so the
second load may observe an old arg even though the first load observed
the new entry, this happens with existing aarch64 hardware causing
invalid memory access due to the incorrect TLS offset).

Various alternatives were considered to force the load ordering in the
descriptor function:

(1) barrier

entryfunc:
  dmb ishld
  ldr x0, [x0,#8]

(2) address dependency (if the address of the second load depends on the
result of the first one the ordering is guaranteed):

entryfunc:
  ldr x1,[x0]
  and x1,x1,#8
  orr x1,x1,#8
  ldr x0,[x0,x1]

(3) load-acquire (ARMv8 instruction that is ordered before subsequent
loads and stores)

entryfunc:
  ldar xzr,[x0]
  ldr x0,[x0,#8]

Option (1) is the simplest but slowest (note: this runs at every TLS
access), options (2) and (3) do one extra load from [x0] (same address
loads are ordered so it happens-after the load on the call site),
option (2) clobbers x1, so I think (3) is the best solution (a load
into the zero register has the same effect as with a normal register,
but the result is discarded so nothing is clobbered. Note that the
TLSDESC abi allows the descriptor function to clobber x1, but current
gcc does not implement this correctly, gcc will be fixed independently,
the dynamic and undefweak descriptors currently save/restore x1 so only
static TLS would be affected by this clobbering issue).

On the write side I used a full barrier (__sync_synchronize) although

  dmb ishst

would be enough, but write side is not performance critical.

I introduced a new _dl_tlsdesc_return_lazy descriptor function for
lazily relocated static TLS, so non-lazy use-case is not affected.
The dynamic and undefweak descriptors do enough work so the additional
ldar should not have a performance impact.

Other thoughts:

- Lazy binding for static TLS may be unnecessary complexity: it seems
that gcc generates at most two static TLSDESC relocation entries for a
translation unit (zero and non-zero initialized TLS), so there has to be
a lot of object files with static TLS linked into a shared object to
make the load time relocation overhead significant. Unless there is some
evidence that lazy static TLS relocation makes sense I would suggest
removing that logic (from all archs). (The dynamic case is probably a
similar micro-optimization, but there the lazy vs non-lazy semantics are
different in case of undefined symbols and some applications may depend
on that).

- 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go
with option (1) or (2) with an additional twist: some existing ARMv7 cpus
don't implement the same address load ordering reliably, see:
http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

- I don't understand the role of the hold function in the general
TLSDESC design, why is it not enough to let other threads wait on the
global lock in the initial resolver function? Is the global dl lock
implemented as a spin lock? Is it for some liveness/fairness property?

- There are some incorrect uses of "cfi_adjust_cfa_offset" in
dl-tlsdesc.S which is a separate patch.

Changelog:

2015-04-22  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #18034]
	* sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare.
	* sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define.
	(_dl_tlsdesc_undefweak): Guarantee load-load ordering.
	(_dl_tlsdesc_dynamic): Likewise.
	* sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add
	barrier between stores.

[BZ,18034,AArch64] Lazy TLSDESC relocation data race fix

Commit Message

Comments

Patch