From patchwork Mon Jun 15 11:30:38 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Szabolcs Nagy X-Patchwork-Id: 7180 Received: (qmail 103137 invoked by alias); 15 Jun 2015 11:30:46 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 103079 invoked by uid 89); 15 Jun 2015 11:30:44 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=AWL, BAYES_40, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Message-ID: <557EB75E.6090002@arm.com> Date: Mon, 15 Jun 2015 12:30:38 +0100 From: Szabolcs Nagy User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Torvald Riegel CC: "libc-alpha@sourceware.org" , Marcus Shawcroft , Ramana Radhakrishnan Subject: [PATCH v4][BZ 18034][AArch64] Lazy TLSDESC relocation data race fix References: <553793A3.7030206@arm.com> <1429718899.6557.17.camel@triegel.csb> <553E5381.504@arm.com> <1432672677.26239.41.camel@triegel.csb> <5565AA2A.7010509@arm.com> <1432731762.30849.53.camel@triegel.csb> <556C3301.1070007@arm.com> <1433328013.21461.86.camel@triegel.csb> In-Reply-To: <1433328013.21461.86.camel@triegel.csb> X-MC-Unique: nH96QDegRsO8k40mZWkkNg-1 On 03/06/15 11:40, Torvald Riegel wrote: > On Mon, 2015-06-01 at 11:25 +0100, Szabolcs Nagy wrote: >> i added a comment to the _dl_tlsdesc_resolve_early_return_p >> call in aarch64 tlsdesc.c about the retry loop. > > That's good, but it would have been better if you could have briefly > pointed out that this relates to mo_relaxed loads in > _dl_tlsdesc_resolve_early_return_p. And/or added a comment there saying > that the mo_relaxed loads are fine because of this retry loop in the > caller(s). > >> - const ElfW(Rela) *reloc = td->arg; >> + const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg); > > Good change. Can you add a brief comment saying why the mo_relaxed load > is sufficient? IIRC, this is because of the acquire loads done by the > caller. > > OK with those changes. > updated the comments in the code and the description: Lazy TLSDESC initialization needs to be synchronized with concurrent TLS accesses. The TLS descriptor contains a function pointer (entry) and an argument that is accessed from the entry function. With lazy initialization the first call to the entry function updates the entry and the argument to their final value. A final entry function must make sure that it accesses an initialized argument, this needs synchronization on systems with weak memory ordering otherwise the writes of the first call can be observed out of order. There are at least two issues with the current code: tlsdesc.c (i386, x86_64, arm, aarch64) uses volatile memory accesses on the write side (in the initial entry function) instead of C11 atomics. And on systems with weak memory ordering (arm, aarch64) the read side synchronization is missing from the final entry functions (dl-tlsdesc.S). This patch only deals with aarch64. * Write side: Volatile accesses were replaced with C11 relaxed atomics, and a release store was used for the initialization of entry so the read side can synchronize with it. * Read side: TLS access generated by the compiler and an entry function code is roughly ldr x1, [x0] // load the entry blr x1 // call it entryfunc: ldr x0, [x0,#8] // load the arg ret Various alternatives were considered to force the ordering in the entry function between the two loads: (1) barrier entryfunc: dmb ishld ldr x0, [x0,#8] (2) address dependency (if the address of the second load depends on the result of the first one the ordering is guaranteed): entryfunc: ldr x1,[x0] and x1,x1,#8 orr x1,x1,#8 ldr x0,[x0,x1] (3) load-acquire (ARMv8 instruction that is ordered before subsequent loads and stores) entryfunc: ldar xzr,[x0] ldr x0,[x0,#8] Option (1) is the simplest but slowest (note: this runs at every TLS access), options (2) and (3) do one extra load from [x0] (same address loads are ordered so it happens-after the load on the call site), option (2) clobbers x1 which is problematic because existing gcc does not expect that, so approach (3) was chosen. A new _dl_tlsdesc_return_lazy entry function was introduced for lazily relocated static TLS, so non-lazy static TLS can avoid the synchronization cost. Changelog: 2015-06-15 Szabolcs Nagy [BZ #18034] * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare. * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define. (_dl_tlsdesc_undefweak): Guarantee TLSDESC entry and argument load-load ordering using ldar. (_dl_tlsdesc_dynamic): Likewise. (_dl_tlsdesc_return_lazy): Likewise. * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Use relaxed atomics instead of volatile and synchronize with release store. (_dl_tlsdesc_resolve_hold_fixup): Use relaxed atomics instead of volatile. * elf/tlsdeschtab.h (_dl_tlsdesc_resolve_early_return_p): Likewise. diff --git a/elf/tlsdeschtab.h b/elf/tlsdeschtab.h index d13b4e5..fb0eb88 100644 --- a/elf/tlsdeschtab.h +++ b/elf/tlsdeschtab.h @@ -20,6 +20,8 @@ #ifndef TLSDESCHTAB_H # define TLSDESCHTAB_H 1 +#include + # ifdef SHARED # include @@ -138,17 +140,17 @@ _dl_make_tlsdesc_dynamic (struct link_map *map, size_t ti_offset) static int _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller) { - if (caller != td->entry) + if (caller != atomic_load_relaxed (&td->entry)) return 1; __rtld_lock_lock_recursive (GL(dl_load_lock)); - if (caller != td->entry) + if (caller != atomic_load_relaxed (&td->entry)) { __rtld_lock_unlock_recursive (GL(dl_load_lock)); return 1; } - td->entry = _dl_tlsdesc_resolve_hold; + atomic_store_relaxed (&td->entry, _dl_tlsdesc_resolve_hold); return 0; } diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S index be9b9b3..c7adf79 100644 --- a/sysdeps/aarch64/dl-tlsdesc.S +++ b/sysdeps/aarch64/dl-tlsdesc.S @@ -79,6 +79,29 @@ _dl_tlsdesc_return: cfi_endproc .size _dl_tlsdesc_return, .-_dl_tlsdesc_return + /* Same as _dl_tlsdesc_return but with synchronization for + lazy relocation. + Prototype: + _dl_tlsdesc_return_lazy (tlsdesc *) ; + */ + .hidden _dl_tlsdesc_return_lazy + .global _dl_tlsdesc_return_lazy + .type _dl_tlsdesc_return_lazy,%function + cfi_startproc + .align 2 +_dl_tlsdesc_return_lazy: + /* The ldar here happens after the load from [x0] at the call site + (that is generated by the compiler as part of the TLS access ABI), + so it reads the same value (this function is the final value of + td->entry) and thus it synchronizes with the release store to + td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load + from [x0,#8] here happens after the initialization of td->arg. */ + ldar xzr, [x0] + ldr x0, [x0, #8] + RET + cfi_endproc + .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy + /* Handler for undefined weak TLS symbols. Prototype: _dl_tlsdesc_undefweak (tlsdesc *); @@ -96,6 +119,13 @@ _dl_tlsdesc_return: _dl_tlsdesc_undefweak: str x1, [sp, #-16]! cfi_adjust_cfa_offset(16) + /* The ldar here happens after the load from [x0] at the call site + (that is generated by the compiler as part of the TLS access ABI), + so it reads the same value (this function is the final value of + td->entry) and thus it synchronizes with the release store to + td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load + from [x0,#8] here happens after the initialization of td->arg. */ + ldar xzr, [x0] ldr x0, [x0, #8] mrs x1, tpidr_el0 sub x0, x0, x1 @@ -152,6 +182,13 @@ _dl_tlsdesc_dynamic: stp x3, x4, [sp, #32+16*1] mrs x4, tpidr_el0 + /* The ldar here happens after the load from [x0] at the call site + (that is generated by the compiler as part of the TLS access ABI), + so it reads the same value (this function is the final value of + td->entry) and thus it synchronizes with the release store to + td->entry in _dl_tlsdesc_resolve_rela_fixup ensuring that the load + from [x0,#8] here happens after the initialization of td->arg. */ + ldar xzr, [x0] ldr x1, [x0,#8] ldr x0, [x4] ldr x3, [x1,#16] diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h index 7a1285e..e6c0078 100644 --- a/sysdeps/aarch64/dl-tlsdesc.h +++ b/sysdeps/aarch64/dl-tlsdesc.h @@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden _dl_tlsdesc_return (struct tlsdesc *); extern ptrdiff_t attribute_hidden +_dl_tlsdesc_return_lazy (struct tlsdesc *); + +extern ptrdiff_t attribute_hidden _dl_tlsdesc_undefweak (struct tlsdesc *); extern ptrdiff_t attribute_hidden diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c index 4821f8c..9f3ff9b 100644 --- a/sysdeps/aarch64/tlsdesc.c +++ b/sysdeps/aarch64/tlsdesc.c @@ -25,6 +25,7 @@ #include #include #include +#include /* The following functions take an entry_check_offset argument. It's computed by the caller as an offset between its entry point and the @@ -39,11 +40,15 @@ void attribute_hidden -_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, - struct link_map *l) +_dl_tlsdesc_resolve_rela_fixup (struct tlsdesc *td, struct link_map *l) { - const ElfW(Rela) *reloc = td->arg; + const ElfW(Rela) *reloc = atomic_load_relaxed (&td->arg); + /* After GL(dl_load_lock) is grabbed only one caller can see td->entry in + initial state in _dl_tlsdesc_resolve_early_return_p, other concurrent + callers will return and retry calling td->entry. The updated td->entry + synchronizes with the single writer so all read accesses here can use + relaxed order. */ if (_dl_tlsdesc_resolve_early_return_p (td, (void*)(D_PTR (l, l_info[ADDRIDX (DT_TLSDESC_PLT)]) + l->l_addr))) return; @@ -86,8 +91,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, if (!sym) { - td->arg = (void*) reloc->r_addend; - td->entry = _dl_tlsdesc_undefweak; + atomic_store_relaxed (&td->arg, (void *) reloc->r_addend); + /* This release store synchronizes with the ldar acquire load + instruction in _dl_tlsdesc_undefweak. */ + atomic_store_release (&td->entry, _dl_tlsdesc_undefweak); } else { @@ -96,16 +103,22 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, # else if (!TRY_STATIC_TLS (l, result)) { - td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value + void *p = _dl_make_tlsdesc_dynamic (result, sym->st_value + reloc->r_addend); - td->entry = _dl_tlsdesc_dynamic; + atomic_store_relaxed (&td->arg, p); + /* This release store synchronizes with the ldar acquire load + instruction in _dl_tlsdesc_dynamic. */ + atomic_store_release (&td->entry, _dl_tlsdesc_dynamic); } else # endif { - td->arg = (void*) (sym->st_value + result->l_tls_offset + void *p = (void*) (sym->st_value + result->l_tls_offset + reloc->r_addend); - td->entry = _dl_tlsdesc_return; + atomic_store_relaxed (&td->arg, p); + /* This release store synchronizes with the ldar acquire load + instruction in _dl_tlsdesc_return_lazy. */ + atomic_store_release (&td->entry, _dl_tlsdesc_return_lazy); } } @@ -120,11 +133,10 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, void attribute_hidden -_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc volatile *td, - void *caller) +_dl_tlsdesc_resolve_hold_fixup (struct tlsdesc *td, void *caller) { /* Maybe we're lucky and can return early. */ - if (caller != td->entry) + if (caller != atomic_load_relaxed (&td->entry)) return; /* Locking here will stop execution until the running resolver runs