From patchwork Wed Apr 22 12:27:15 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Szabolcs Nagy X-Patchwork-Id: 6380 Received: (qmail 41145 invoked by alias); 22 Apr 2015 12:27:23 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 41132 invoked by uid 89); 22 Apr 2015 12:27:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Message-ID: <553793A3.7030206@arm.com> Date: Wed, 22 Apr 2015 13:27:15 +0100 From: Szabolcs Nagy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: libc-alpha@sourceware.org CC: Marcus Shawcroft , Ramana Radhakrishnan Subject: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix X-MC-Unique: EFvNXasET1abYlkVHhAudw-1 Lazy TLSDESC initialization needs to be synchronized with concurrent TLS accesses. The original ARM TLSDESC design did not discuss this race that arises on systems with weak memory order guarantees http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt "Storing the final value of the TLS descriptor also needs care: the resolver field must be set to its final value after the argument gets its final value, such that any if thread attempts to use the descriptor before it gets its final value, it still goes to the hold function." The current glibc code (i386, x86_64, arm, aarch64) is td->arg = ...; td->entry = ...; the arm memory model allows store-store reordering, so a barrier is needed between these two stores (the code is broken on x86 as well in principle: x86 memory model does not help on the c code level, the compiler can reorder non-aliasing stores). What is missing from the TLSDESC design spec is a description of the ordering requirements on the read side: the TLS access sequence (generated by the compiler) loads the descriptor function pointer (td->entry) and then the argument is loaded (td->arg) in that function. These two loads must observe the changes made on the write side in a sequentially consistent way. The code in asm: ldr x1, [x0] // load td->entry blr [x0] // call it entryfunc: ldr x0, [x0,#8] // load td->arg The branch (blr) does not provide a load-load ordering barrier (so the second load may observe an old arg even though the first load observed the new entry, this happens with existing aarch64 hardware causing invalid memory access due to the incorrect TLS offset). Various alternatives were considered to force the load ordering in the descriptor function: (1) barrier entryfunc: dmb ishld ldr x0, [x0,#8] (2) address dependency (if the address of the second load depends on the result of the first one the ordering is guaranteed): entryfunc: ldr x1,[x0] and x1,x1,#8 orr x1,x1,#8 ldr x0,[x0,x1] (3) load-acquire (ARMv8 instruction that is ordered before subsequent loads and stores) entryfunc: ldar xzr,[x0] ldr x0,[x0,#8] Option (1) is the simplest but slowest (note: this runs at every TLS access), options (2) and (3) do one extra load from [x0] (same address loads are ordered so it happens-after the load on the call site), option (2) clobbers x1, so I think (3) is the best solution (a load into the zero register has the same effect as with a normal register, but the result is discarded so nothing is clobbered. Note that the TLSDESC abi allows the descriptor function to clobber x1, but current gcc does not implement this correctly, gcc will be fixed independently, the dynamic and undefweak descriptors currently save/restore x1 so only static TLS would be affected by this clobbering issue). On the write side I used a full barrier (__sync_synchronize) although dmb ishst would be enough, but write side is not performance critical. I introduced a new _dl_tlsdesc_return_lazy descriptor function for lazily relocated static TLS, so non-lazy use-case is not affected. The dynamic and undefweak descriptors do enough work so the additional ldar should not have a performance impact. Other thoughts: - Lazy binding for static TLS may be unnecessary complexity: it seems that gcc generates at most two static TLSDESC relocation entries for a translation unit (zero and non-zero initialized TLS), so there has to be a lot of object files with static TLS linked into a shared object to make the load time relocation overhead significant. Unless there is some evidence that lazy static TLS relocation makes sense I would suggest removing that logic (from all archs). (The dynamic case is probably a similar micro-optimization, but there the lazy vs non-lazy semantics are different in case of undefined symbols and some applications may depend on that). - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go with option (1) or (2) with an additional twist: some existing ARMv7 cpus don't implement the same address load ordering reliably, see: http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf - I don't understand the role of the hold function in the general TLSDESC design, why is it not enough to let other threads wait on the global lock in the initial resolver function? Is the global dl lock implemented as a spin lock? Is it for some liveness/fairness property? - There are some incorrect uses of "cfi_adjust_cfa_offset" in dl-tlsdesc.S which is a separate patch. Changelog: 2015-04-22 Szabolcs Nagy [BZ #18034] * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare. * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define. (_dl_tlsdesc_undefweak): Guarantee load-load ordering. (_dl_tlsdesc_dynamic): Likewise. * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add barrier between stores. diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S index be9b9b3..ff74b52 100644 --- a/sysdeps/aarch64/dl-tlsdesc.S +++ b/sysdeps/aarch64/dl-tlsdesc.S @@ -79,6 +79,25 @@ _dl_tlsdesc_return: cfi_endproc .size _dl_tlsdesc_return, .-_dl_tlsdesc_return + /* Same as _dl_tlsdesc_return but with synchronization for + lazy relocation. + Prototype: + _dl_tlsdesc_return_lazy (tlsdesc *) ; + */ + .hidden _dl_tlsdesc_return_lazy + .global _dl_tlsdesc_return_lazy + .type _dl_tlsdesc_return_lazy,%function + cfi_startproc + .align 2 +_dl_tlsdesc_return_lazy: + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] + ldr x0, [x0, #8] + RET + cfi_endproc + .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy + /* Handler for undefined weak TLS symbols. Prototype: _dl_tlsdesc_undefweak (tlsdesc *); @@ -96,6 +115,9 @@ _dl_tlsdesc_return: _dl_tlsdesc_undefweak: str x1, [sp, #-16]! cfi_adjust_cfa_offset(16) + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] ldr x0, [x0, #8] mrs x1, tpidr_el0 sub x0, x0, x1 @@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic: stp x3, x4, [sp, #32+16*1] mrs x4, tpidr_el0 + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] ldr x1, [x0,#8] ldr x0, [x4] ldr x3, [x1,#16] diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h index 7a1285e..e6c0078 100644 --- a/sysdeps/aarch64/dl-tlsdesc.h +++ b/sysdeps/aarch64/dl-tlsdesc.h @@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden _dl_tlsdesc_return (struct tlsdesc *); extern ptrdiff_t attribute_hidden +_dl_tlsdesc_return_lazy (struct tlsdesc *); + +extern ptrdiff_t attribute_hidden _dl_tlsdesc_undefweak (struct tlsdesc *); extern ptrdiff_t attribute_hidden diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c index 4821f8c..f738cc6 100644 --- a/sysdeps/aarch64/tlsdesc.c +++ b/sysdeps/aarch64/tlsdesc.c @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, if (!sym) { td->arg = (void*) reloc->r_addend; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); td->entry = _dl_tlsdesc_undefweak; } else @@ -98,6 +100,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, { td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value + reloc->r_addend); + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); td->entry = _dl_tlsdesc_dynamic; } else @@ -105,7 +109,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, { td->arg = (void*) (sym->st_value + result->l_tls_offset + reloc->r_addend); - td->entry = _dl_tlsdesc_return; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); + td->entry = _dl_tlsdesc_return_lazy; } }