Message ID | 5537B943.2060001@linaro.org |
---|---|
State | Not applicable |
Headers |
Received: (qmail 125055 invoked by alias); 22 Apr 2015 15:07:55 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 125043 invoked by uid 89); 22 Apr 2015 15:07:55 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2 X-HELO: mail-qg0-f44.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=ANrr7XVp3lIkPqvbOxbVzUtOSczlHPqLpyF8HWckdJM=; b=DFXn9Dxmtwl6N1MziFKBI3bG79D23MrdEbwsU8YHhJ+tdTYBeIdFRkG8j35ThfiNtq 0JkZy1AaVFmLtgsp7hW/OH4OjADCRSNbSqaEirjrkIcqoxAX/sSljd+VbgUMQGdiix1g DfWgrCHIj9UirI9Petxl6BqP37KYGoPKqVuQ2Y738JMCCDxV8omIGoR3qgS29Hv8/oMJ aD1U27cPcpAUl7ksJeBmw0b7nfPvpNc0c33MgDqVWd5qtB5YLAPxfdTphboe4Pt3Cby+ eTM6oBQ5U0IDX7xFro+Jcc+AYy+KsLUDKVjAWiKWVK0ipEZaEsKNdZM+XvCvRkamtbQe Iv/w== X-Gm-Message-State: ALoCoQl1G4mhenYF2PtOAZZro7ZjTNLc1cDWCfA/TXBFDrENVkRk8KRVeJwjYvTJYhV/0Rsn2PNW X-Received: by 10.55.53.72 with SMTP id c69mr49807331qka.67.1429715271230; Wed, 22 Apr 2015 08:07:51 -0700 (PDT) Message-ID: <5537B943.2060001@linaro.org> Date: Wed, 22 Apr 2015 12:07:47 -0300 From: Adhemerval Zanella <adhemerval.zanella@linaro.org> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: libc-alpha@sourceware.org Subject: Re: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix References: <553793A3.7030206@arm.com> In-Reply-To: <553793A3.7030206@arm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit |
Commit Message
Adhemerval Zanella Netto
April 22, 2015, 3:07 p.m. UTC
Hi On 22-04-2015 09:27, Szabolcs Nagy wrote: > Lazy TLSDESC initialization needs to be synchronized with concurrent TLS > accesses. > > The original ARM TLSDESC design did not discuss this race that > arises on systems with weak memory order guarantees > > http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt > > "Storing the final value of the TLS descriptor also needs care: the > resolver field must be set to its final value after the argument gets > its final value, such that any if thread attempts to use the > descriptor before it gets its final value, it still goes to the hold > function." > > The current glibc code (i386, x86_64, arm, aarch64) is > > td->arg = ...; > td->entry = ...; > > the arm memory model allows store-store reordering, so a barrier is > needed between these two stores (the code is broken on x86 as well in > principle: x86 memory model does not help on the c code level, the > compiler can reorder non-aliasing stores). > > What is missing from the TLSDESC design spec is a description of the > ordering requirements on the read side: the TLS access sequence > (generated by the compiler) loads the descriptor function pointer > (td->entry) and then the argument is loaded (td->arg) in that function. > These two loads must observe the changes made on the write side in a > sequentially consistent way. The code in asm: > > ldr x1, [x0] // load td->entry > blr [x0] // call it > > entryfunc: > ldr x0, [x0,#8] // load td->arg > > The branch (blr) does not provide a load-load ordering barrier (so the > second load may observe an old arg even though the first load observed > the new entry, this happens with existing aarch64 hardware causing > invalid memory access due to the incorrect TLS offset). > > Various alternatives were considered to force the load ordering in the > descriptor function: > > (1) barrier > > entryfunc: > dmb ishld > ldr x0, [x0,#8] > > (2) address dependency (if the address of the second load depends on the > result of the first one the ordering is guaranteed): > > entryfunc: > ldr x1,[x0] > and x1,x1,#8 > orr x1,x1,#8 > ldr x0,[x0,x1] > > (3) load-acquire (ARMv8 instruction that is ordered before subsequent > loads and stores) > > entryfunc: > ldar xzr,[x0] > ldr x0,[x0,#8] > > Option (1) is the simplest but slowest (note: this runs at every TLS > access), options (2) and (3) do one extra load from [x0] (same address > loads are ordered so it happens-after the load on the call site), > option (2) clobbers x1, so I think (3) is the best solution (a load > into the zero register has the same effect as with a normal register, > but the result is discarded so nothing is clobbered. Note that the > TLSDESC abi allows the descriptor function to clobber x1, but current > gcc does not implement this correctly, gcc will be fixed independently, > the dynamic and undefweak descriptors currently save/restore x1 so only > static TLS would be affected by this clobbering issue). I see options (3) as the preferred one as well, since it fits better on the C11 memory semantics. > > On the write side I used a full barrier (__sync_synchronize) although > > dmb ishst > > would be enough, but write side is not performance critical. My understanding is you do not need to push a seq-consistent, but rather a store release on the 'td->arg', since the code only requires that 'td->entry' should not be reordered with 'td->arg'. So, to adjust to C11 semantics I would change: > > I introduced a new _dl_tlsdesc_return_lazy descriptor function for > lazily relocated static TLS, so non-lazy use-case is not affected. > The dynamic and undefweak descriptors do enough work so the additional > ldar should not have a performance impact. > > Other thoughts: > > - Lazy binding for static TLS may be unnecessary complexity: it seems > that gcc generates at most two static TLSDESC relocation entries for a > translation unit (zero and non-zero initialized TLS), so there has to be > a lot of object files with static TLS linked into a shared object to > make the load time relocation overhead significant. Unless there is some > evidence that lazy static TLS relocation makes sense I would suggest > removing that logic (from all archs). (The dynamic case is probably a > similar micro-optimization, but there the lazy vs non-lazy semantics are > different in case of undefined symbols and some applications may depend > on that). Recently I am seeing that all lazy relocation yields less gains than before and add a lot of complexity. I would like to see an usercase where lazy TLS or even normal relocation shows a real gain. > > - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go > with option (1) or (2) with an additional twist: some existing ARMv7 cpus > don't implement the same address load ordering reliably, see: > http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > > - I don't understand the role of the hold function in the general > TLSDESC design, why is it not enough to let other threads wait on the > global lock in the initial resolver function? Is the global dl lock > implemented as a spin lock? Is it for some liveness/fairness property? > > - There are some incorrect uses of "cfi_adjust_cfa_offset" in > dl-tlsdesc.S which is a separate patch. > > Changelog: > > 2015-04-22 Szabolcs Nagy <szabolcs.nagy@arm.com> > > [BZ #18034] > * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare. > * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define. > (_dl_tlsdesc_undefweak): Guarantee load-load ordering. > (_dl_tlsdesc_dynamic): Likewise. > * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add > barrier between stores. >
Comments
On 22/04/15 16:07, Adhemerval Zanella wrote: > On 22-04-2015 09:27, Szabolcs Nagy wrote: >> On the write side I used a full barrier (__sync_synchronize) although >> >> dmb ishst >> >> would be enough, but write side is not performance critical. > > My understanding is you do not need to push a seq-consistent, but rather > a store release on the 'td->arg', since the code only requires that > 'td->entry' should not be reordered with 'td->arg'. So, to adjust to > C11 semantics I would change: > > diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c > index 4821f8c..f738cc6 100644 > --- a/sysdeps/aarch64/tlsdesc.c > +++ b/sysdeps/aarch64/tlsdesc.c > @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, > if (!sym) > { > td->arg = (void*) reloc->r_addend; > + /* Barrier so readers see the write above before the one below. */ > + __sync_synchronize (); > > To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))' > > i think atomic_store_release(&td->entry, (void*) reloc->r_addend)) is the correct replacement, because moving stores before a store_release is valid >> - Lazy binding for static TLS may be unnecessary complexity: it seems >> that gcc generates at most two static TLSDESC relocation entries for a >> translation unit (zero and non-zero initialized TLS), so there has to be >> a lot of object files with static TLS linked into a shared object to >> make the load time relocation overhead significant. Unless there is some >> evidence that lazy static TLS relocation makes sense I would suggest >> removing that logic (from all archs). (The dynamic case is probably a >> similar micro-optimization, but there the lazy vs non-lazy semantics are >> different in case of undefined symbols and some applications may depend >> on that). > > Recently I am seeing that all lazy relocation yields less gains than > before and add a lot of complexity. I would like to see an usercase > where lazy TLS or even normal relocation shows a real gain. > i'm not sure if glibc considers lazy relocation as part of the abi contract, but i think static tls is always safe to do non-lazily
diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c index 4821f8c..f738cc6 100644 --- a/sysdeps/aarch64/tlsdesc.c +++ b/sysdeps/aarch64/tlsdesc.c @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, if (!sym) { td->arg = (void*) reloc->r_addend; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); To 'atomic_store_relase (td->arg, (void*) reloc->r_addend))'