Message ID | 553793A3.7030206@arm.com |
---|---|
State | Superseded |
Headers |
Received: (qmail 41145 invoked by alias); 22 Apr 2015 12:27:23 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 41132 invoked by uid 89); 22 Apr 2015 12:27:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Message-ID: <553793A3.7030206@arm.com> Date: Wed, 22 Apr 2015 13:27:15 +0100 From: Szabolcs Nagy <szabolcs.nagy@arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: libc-alpha@sourceware.org CC: Marcus Shawcroft <marcus.shawcroft@arm.com>, Ramana Radhakrishnan <Ramana.Radhakrishnan@arm.com> Subject: [PATCH] [BZ 18034] [AArch64] Lazy TLSDESC relocation data race fix X-MC-Unique: EFvNXasET1abYlkVHhAudw-1 Content-Type: multipart/mixed; boundary="------------060706060100030308050009" |
Commit Message
Szabolcs Nagy
April 22, 2015, 12:27 p.m. UTC
Lazy TLSDESC initialization needs to be synchronized with concurrent TLS accesses. The original ARM TLSDESC design did not discuss this race that arises on systems with weak memory order guarantees http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt "Storing the final value of the TLS descriptor also needs care: the resolver field must be set to its final value after the argument gets its final value, such that any if thread attempts to use the descriptor before it gets its final value, it still goes to the hold function." The current glibc code (i386, x86_64, arm, aarch64) is td->arg = ...; td->entry = ...; the arm memory model allows store-store reordering, so a barrier is needed between these two stores (the code is broken on x86 as well in principle: x86 memory model does not help on the c code level, the compiler can reorder non-aliasing stores). What is missing from the TLSDESC design spec is a description of the ordering requirements on the read side: the TLS access sequence (generated by the compiler) loads the descriptor function pointer (td->entry) and then the argument is loaded (td->arg) in that function. These two loads must observe the changes made on the write side in a sequentially consistent way. The code in asm: ldr x1, [x0] // load td->entry blr [x0] // call it entryfunc: ldr x0, [x0,#8] // load td->arg The branch (blr) does not provide a load-load ordering barrier (so the second load may observe an old arg even though the first load observed the new entry, this happens with existing aarch64 hardware causing invalid memory access due to the incorrect TLS offset). Various alternatives were considered to force the load ordering in the descriptor function: (1) barrier entryfunc: dmb ishld ldr x0, [x0,#8] (2) address dependency (if the address of the second load depends on the result of the first one the ordering is guaranteed): entryfunc: ldr x1,[x0] and x1,x1,#8 orr x1,x1,#8 ldr x0,[x0,x1] (3) load-acquire (ARMv8 instruction that is ordered before subsequent loads and stores) entryfunc: ldar xzr,[x0] ldr x0,[x0,#8] Option (1) is the simplest but slowest (note: this runs at every TLS access), options (2) and (3) do one extra load from [x0] (same address loads are ordered so it happens-after the load on the call site), option (2) clobbers x1, so I think (3) is the best solution (a load into the zero register has the same effect as with a normal register, but the result is discarded so nothing is clobbered. Note that the TLSDESC abi allows the descriptor function to clobber x1, but current gcc does not implement this correctly, gcc will be fixed independently, the dynamic and undefweak descriptors currently save/restore x1 so only static TLS would be affected by this clobbering issue). On the write side I used a full barrier (__sync_synchronize) although dmb ishst would be enough, but write side is not performance critical. I introduced a new _dl_tlsdesc_return_lazy descriptor function for lazily relocated static TLS, so non-lazy use-case is not affected. The dynamic and undefweak descriptors do enough work so the additional ldar should not have a performance impact. Other thoughts: - Lazy binding for static TLS may be unnecessary complexity: it seems that gcc generates at most two static TLSDESC relocation entries for a translation unit (zero and non-zero initialized TLS), so there has to be a lot of object files with static TLS linked into a shared object to make the load time relocation overhead significant. Unless there is some evidence that lazy static TLS relocation makes sense I would suggest removing that logic (from all archs). (The dynamic case is probably a similar micro-optimization, but there the lazy vs non-lazy semantics are different in case of undefined symbols and some applications may depend on that). - 32bit arm with -mtls-dialect=gnu2 is affected too, it will have to go with option (1) or (2) with an additional twist: some existing ARMv7 cpus don't implement the same address load ordering reliably, see: http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf - I don't understand the role of the hold function in the general TLSDESC design, why is it not enough to let other threads wait on the global lock in the initial resolver function? Is the global dl lock implemented as a spin lock? Is it for some liveness/fairness property? - There are some incorrect uses of "cfi_adjust_cfa_offset" in dl-tlsdesc.S which is a separate patch. Changelog: 2015-04-22 Szabolcs Nagy <szabolcs.nagy@arm.com> [BZ #18034] * sysdeps/aarch64/dl-tlsdesc.h (_dl_tlsdesc_return_lazy): Declare. * sysdeps/aarch64/dl-tlsdesc.S (_dl_tlsdesc_return_lazy): Define. (_dl_tlsdesc_undefweak): Guarantee load-load ordering. (_dl_tlsdesc_dynamic): Likewise. * sysdeps/aarch64/tlsdesc.c (_dl_tlsdesc_resolve_rela_fixup): Add barrier between stores.
Comments
On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote: > Lazy TLSDESC initialization needs to be synchronized with concurrent TLS > accesses. > > The original ARM TLSDESC design did not discuss this race that > arises on systems with weak memory order guarantees > > http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt > > "Storing the final value of the TLS descriptor also needs care: the > resolver field must be set to its final value after the argument gets > its final value, such that any if thread attempts to use the > descriptor before it gets its final value, it still goes to the hold > function." > > The current glibc code (i386, x86_64, arm, aarch64) is > > td->arg = ...; > td->entry = ...; > > the arm memory model allows store-store reordering, so a barrier is > needed between these two stores (the code is broken on x86 as well in > principle: x86 memory model does not help on the c code level, the > compiler can reorder non-aliasing stores). Yes, this is not guaranteed to work. This should thus be fixed for all the architectures. > What is missing from the TLSDESC design spec is a description of the > ordering requirements on the read side: the TLS access sequence > (generated by the compiler) loads the descriptor function pointer > (td->entry) and then the argument is loaded (td->arg) in that function. > These two loads must observe the changes made on the write side in a > sequentially consistent way. Not strictly in a sequentially consistent way, I believe. What you are probably looking for (I'm not familiar with TLS in detail) is some invariant like if you read update X, you will also read update Y or a more recent update to Y. > The code in asm: > > ldr x1, [x0] // load td->entry > blr [x0] // call it > > entryfunc: > ldr x0, [x0,#8] // load td->arg > > The branch (blr) does not provide a load-load ordering barrier (so the > second load may observe an old arg even though the first load observed > the new entry, this happens with existing aarch64 hardware causing > invalid memory access due to the incorrect TLS offset). > > Various alternatives were considered to force the load ordering in the > descriptor function: > > (1) barrier > > entryfunc: > dmb ishld > ldr x0, [x0,#8] > > (2) address dependency (if the address of the second load depends on the > result of the first one the ordering is guaranteed): > > entryfunc: > ldr x1,[x0] > and x1,x1,#8 > orr x1,x1,#8 > ldr x0,[x0,x1] The address dependencies could be useful in terms of performance, but the disadvantage that I see are that there's currently no useful way to use them in C code (C11's memory_order_consume is broken). > (3) load-acquire (ARMv8 instruction that is ordered before subsequent > loads and stores) > > entryfunc: > ldar xzr,[x0] > ldr x0,[x0,#8] > > Option (1) is the simplest but slowest (note: this runs at every TLS > access), options (2) and (3) do one extra load from [x0] (same address > loads are ordered so it happens-after the load on the call site), > option (2) clobbers x1, so I think (3) is the best solution (a load > into the zero register has the same effect as with a normal register, > but the result is discarded so nothing is clobbered. Note that the > TLSDESC abi allows the descriptor function to clobber x1, but current > gcc does not implement this correctly, gcc will be fixed independently, > the dynamic and undefweak descriptors currently save/restore x1 so only > static TLS would be affected by this clobbering issue). > > On the write side I used a full barrier (__sync_synchronize) although > > dmb ishst > > would be enough, but write side is not performance critical. Please get familiar with https://sourceware.org/glibc/wiki/Concurrency Fixes to existing code should use the C11 memory model for concurrent code, especially if the fix is about a concurrency issue. I agree with Adhemerval that a release MO store seems to be sufficient in this case. I haven't scanned through the TLS code to assess how much work it would be to make it data-race-free (as defined by C11), but it would certainly be helpful in showing similar issues (e.g., when documenting it) and preventing compiler optimizations that are legal for non-concurrent accesses but not what we need for TLS (e.g., speculative loads introduced by the compiler on x86 in the case you mention above). Given that you have looked at the code, could you give a rough estimate of how much churn it would be to make the TLS code data-race-free? It also looks as if the x86_64 variant of tlsdesc.c is, before your changes and ignoring some additional comments, very similar to the aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and using the C11 model) that can be used on several archs? This would certainly decrease maintenance overhead. I'm aware this might look like I'm requesting extra work that is not at the core of what you are trying to fix. However, moving to portable concurrent code that is shared across several archs is something we've been doing elsewhere too, and in those cases ARM was on the benefiting side (e.g., pthread_once, ongoing work in nscd, ...). I think overall, some extra work and clean-up will be a good thing for ARM archs too. Thanks.
On 22/04/15 17:08, Torvald Riegel wrote: > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote: >> (2) address dependency (if the address of the second load depends on the >> result of the first one the ordering is guaranteed): >> >> entryfunc: >> ldr x1,[x0] >> and x1,x1,#8 >> orr x1,x1,#8 >> ldr x0,[x0,x1] > > The address dependencies could be useful in terms of performance, but > the disadvantage that I see are that there's currently no useful way to > use them in C code (C11's memory_order_consume is broken). > this must be in asm so i don't think it matters if consume is broken in c11. (tlsdesc resolver cannot clobber registers). >> (3) load-acquire (ARMv8 instruction that is ordered before subsequent >> loads and stores) >> >> entryfunc: >> ldar xzr,[x0] >> ldr x0,[x0,#8] >> >> Option (1) is the simplest but slowest (note: this runs at every TLS >> access), options (2) and (3) do one extra load from [x0] (same address >> loads are ordered so it happens-after the load on the call site), >> option (2) clobbers x1, so I think (3) is the best solution (a load >> into the zero register has the same effect as with a normal register, >> but the result is discarded so nothing is clobbered. Note that the >> TLSDESC abi allows the descriptor function to clobber x1, but current >> gcc does not implement this correctly, gcc will be fixed independently, >> the dynamic and undefweak descriptors currently save/restore x1 so only >> static TLS would be affected by this clobbering issue). >> >> On the write side I used a full barrier (__sync_synchronize) although >> >> dmb ishst >> >> would be enough, but write side is not performance critical. > > Please get familiar with https://sourceware.org/glibc/wiki/Concurrency > Fixes to existing code should use the C11 memory model for concurrent > code, especially if the fix is about a concurrency issue. > > I agree with Adhemerval that a release MO store seems to be sufficient > in this case. > yes, the write side could use c11 atomics and i agree that store-release is enough, but i'm not sure how much one can rely on c11 memory model when interacting with asm code. (so i thought a full barrier is the easiest to reason about) i can update the code to use glibc atomics, but i will have to look into how those work (are there type issues or are they generic?) > I haven't scanned through the TLS code to assess how much work it would > be to make it data-race-free (as defined by C11), but it would certainly > be helpful in showing similar issues (e.g., when documenting it) and > preventing compiler optimizations that are legal for non-concurrent > accesses but not what we need for TLS (e.g., speculative loads > introduced by the compiler on x86 in the case you mention above). > > Given that you have looked at the code, could you give a rough estimate > of how much churn it would be to make the TLS code data-race-free? i think td->entry should only be accessed with atomics once the loader finished loading the dso with it. (i assume during dso loading concurrent access is not possible) (the particular case i'm trying to fix is hard because the access to td->entry is generated by the compiler, so it has to be worked around inside the entry function. i think x86 does not have this issue) i think these function might need some attention: elf/tlsdeschtab.h: _dl_tlsdesc_resolve_early_return_p sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c: _dl_tlsdesc_resolve_rela_fixup _dl_tlsdesc_resolve_hold_fixup but i haven't done an detailed analysis > It also looks as if the x86_64 variant of tlsdesc.c is, before your > changes and ignoring some additional comments, very similar to the > aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and > using the C11 model) that can be used on several archs? This would > certainly decrease maintenance overhead. > should be possible i think: these functions are called from asm to do the lazy resolution but i have to check the arch specific details. > I'm aware this might look like I'm requesting extra work that is not at > the core of what you are trying to fix. However, moving to portable > concurrent code that is shared across several archs is something we've > been doing elsewhere too, and in those cases ARM was on the benefiting > side (e.g., pthread_once, ongoing work in nscd, ...). I think overall, > some extra work and clean-up will be a good thing for ARM archs too. > > Thanks. >
On 22/04/15 18:14, Szabolcs Nagy wrote: > On 22/04/15 17:08, Torvald Riegel wrote: >> Given that you have looked at the code, could you give a rough estimate >> of how much churn it would be to make the TLS code data-race-free? > > elf/tlsdeschtab.h: > _dl_tlsdesc_resolve_early_return_p > sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c: > _dl_tlsdesc_resolve_rela_fixup > _dl_tlsdesc_resolve_hold_fixup > i think these all need to use atomics for accessing td->entry for c11 correctness, but it's hard to tell what's right since c11 only talks about synchronizing with c code, not asm. the td->entry can be in 3 states during lazy resolution: * init: retries the call of entry until caller==entry (using a double-checked locking mechanism, then it - grabs GL(dl_load_lock) - changes the entry to the hold state - does the resolver work - changes arg and entry to the final value - releases the lock to wake the waiters - calls the new entry) * hold: waits on the GL(dl_load_lock) (then retries calling the entry when it's woken up) * final state: (static, dynamic or undefweak callback) calculates the tls offset based on arg (currently without any locks which is bad on weak memory models) the code for tls access is generated by the compiler so there is no atomics there: the loaded entry can be in init, hold or final state, the guarantee about the state of arg must come from the target arch memory model. and to answer my earlier question about the hold state: it is needed to avoid the spinning in the double-checked locking in init. >> It also looks as if the x86_64 variant of tlsdesc.c is, before your >> changes and ignoring some additional comments, very similar to the >> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and >> using the C11 model) that can be used on several archs? This would >> certainly decrease maintenance overhead. >> > > should be possible i think: these functions are called > from asm to do the lazy resolution > > but i have to check the arch specific details. looking at this, but it seems to be significant work: * i386 and arm determine the caller differently * i386 uses special calling convention from asm * arm handles local symbols specially i think the way to move forward is * fix the correctness bug now on aarch64 * decide if lazy static tls resolver is worth it * do the refactoring so tlsdesc is common code * use c11 atomics the fix for arm is a lot harder (because atomics are different on different versions of arm, an ifunc based dispatch could in principle solve the dmb vs no-dmb issue for the lazy resolvers). is this ok?
On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote: > On 22/04/15 17:08, Torvald Riegel wrote: > > On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote: > >> (3) load-acquire (ARMv8 instruction that is ordered before subsequent > >> loads and stores) > >> > >> entryfunc: > >> ldar xzr,[x0] > >> ldr x0,[x0,#8] > >> > >> Option (1) is the simplest but slowest (note: this runs at every TLS > >> access), options (2) and (3) do one extra load from [x0] (same address > >> loads are ordered so it happens-after the load on the call site), > >> option (2) clobbers x1, so I think (3) is the best solution (a load > >> into the zero register has the same effect as with a normal register, > >> but the result is discarded so nothing is clobbered. Note that the > >> TLSDESC abi allows the descriptor function to clobber x1, but current > >> gcc does not implement this correctly, gcc will be fixed independently, > >> the dynamic and undefweak descriptors currently save/restore x1 so only > >> static TLS would be affected by this clobbering issue). > >> > >> On the write side I used a full barrier (__sync_synchronize) although > >> > >> dmb ishst > >> > >> would be enough, but write side is not performance critical. > > > > Please get familiar with https://sourceware.org/glibc/wiki/Concurrency > > Fixes to existing code should use the C11 memory model for concurrent > > code, especially if the fix is about a concurrency issue. > > > > I agree with Adhemerval that a release MO store seems to be sufficient > > in this case. > > > > yes, the write side could use c11 atomics and i agree that > store-release is enough, but i'm not sure how much one can > rely on c11 memory model when interacting with asm code. > (so i thought a full barrier is the easiest to reason about) IIUC, the asm that you need to interact with is generated by the compiler. The C11 atomics implementation (that is, the asm / HW features used in it) are effectively part of the ABI; C11 translation units need to be able to synchronize with each other even if compiled by different compiler versions. Therefore, making sure that the asm follows what the compiler would generate for the intended C11 atomics equivalent is a solution. That also means that we can say clearly, in C11 memory model terms, what the asm implementation on an arch should do. > i can update the code to use glibc atomics, but i will > have to look into how those work (are there type issues > or are they generic?) There are basically generic, and internally use either the __atomic builtins for newer compilers and archs that provide them, or (fall back to) the previous implementation based on __sync-builtins or custom asm. One constraint is that we don't support sizes of atomic access that aren't actually provided on the particular arch (either through HW instructions, or with kernel helpers). So, for example, atomic access to a naturally aligned pointer-sized type can be expected to work. > > I haven't scanned through the TLS code to assess how much work it would > > be to make it data-race-free (as defined by C11), but it would certainly > > be helpful in showing similar issues (e.g., when documenting it) and > > preventing compiler optimizations that are legal for non-concurrent > > accesses but not what we need for TLS (e.g., speculative loads > > introduced by the compiler on x86 in the case you mention above). > > > > Given that you have looked at the code, could you give a rough estimate > > of how much churn it would be to make the TLS code data-race-free? > > i think td->entry should only be accessed with atomics once > the loader finished loading the dso with it. > (i assume during dso loading concurrent access is not possible) If that's some kind of initialization you have in mind, using nonatomics accesses may be fine. We don't have strict rules around that currently. But if it's just a few lines of code, then simply using relaxed-MO atomic accesses could be fine as well. > (the particular case i'm trying to fix is hard because > the access to td->entry is generated by the compiler, so > it has to be worked around inside the entry function. > i think x86 does not have this issue) Do you mean that you want to affect the compiler-generated code after it has been generated?
On 23/04/15 22:46, Torvald Riegel wrote: > On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote: >> On 22/04/15 17:08, Torvald Riegel wrote: >>> On Wed, 2015-04-22 at 13:27 +0100, Szabolcs Nagy wrote: >>>> (3) load-acquire (ARMv8 instruction that is ordered before subsequent >>>> loads and stores) >>>> >>>> entryfunc: >>>> ldar xzr,[x0] >>>> ldr x0,[x0,#8] >>>> >>>> Option (1) is the simplest but slowest (note: this runs at every TLS >>>> access), options (2) and (3) do one extra load from [x0] (same address >>>> loads are ordered so it happens-after the load on the call site), >>>> option (2) clobbers x1, so I think (3) is the best solution (a load >>>> into the zero register has the same effect as with a normal register, >>>> but the result is discarded so nothing is clobbered. Note that the >>>> TLSDESC abi allows the descriptor function to clobber x1, but current >>>> gcc does not implement this correctly, gcc will be fixed independently, >>>> the dynamic and undefweak descriptors currently save/restore x1 so only >>>> static TLS would be affected by this clobbering issue). >>>> >>>> On the write side I used a full barrier (__sync_synchronize) although >>>> >>>> dmb ishst >>>> >>>> would be enough, but write side is not performance critical. >>> >>> Please get familiar with https://sourceware.org/glibc/wiki/Concurrency >>> Fixes to existing code should use the C11 memory model for concurrent >>> code, especially if the fix is about a concurrency issue. >>> >>> I agree with Adhemerval that a release MO store seems to be sufficient >>> in this case. >>> >> >> yes, the write side could use c11 atomics and i agree that >> store-release is enough, but i'm not sure how much one can >> rely on c11 memory model when interacting with asm code. >> (so i thought a full barrier is the easiest to reason about) > > IIUC, the asm that you need to interact with is generated by the > compiler. The C11 atomics implementation (that is, the asm / HW to clarify: // this bit is generated by the compiler for every tls access // (the exact sequence is different and depends on tiny/small/large // memory model, it is part of the abi so linkers can do relaxations) // assume x0 is a pointer to the tls descriptor (td) in the GOT. // (2 consecutive GOT entries are used: td->entry and td->arg) // the entry function returns an offset (so tls data is at tp+off) // note: no barriers are used here mrs x2, tpidr_el0 // load the thread pointer ldr x1, [x0] // load td->entry blr x1 // call it, returns off in x0 ldr x0, [x2, x0] // access tls at tp+off // this is a function implemented by the libc where td->entry points // currently the options are // _dl_tlsdesc_return, // _dl_tlsdesc_undefweak, // _dl_tlsdesc_dynamic // originally td->entry points to a trampoline that calls into // _dl_tlsdesc_resovle_rela which does the lazy relocation and sets // td->entry to _dl_tlsdesc_hold temporarily. entryfunc: ldr x0, [x0,#8] // load td->arg >> i can update the code to use glibc atomics, but i will >> have to look into how those work (are there type issues >> or are they generic?) > > There are basically generic, and internally use either the __atomic > builtins for newer compilers and archs that provide them, or (fall back > to) the previous implementation based on __sync-builtins or custom asm. > One constraint is that we don't support sizes of atomic access that > aren't actually provided on the particular arch (either through HW > instructions, or with kernel helpers). So, for example, atomic access > to a naturally aligned pointer-sized type can be expected to work. ok >> i think td->entry should only be accessed with atomics once >> the loader finished loading the dso with it. >> (i assume during dso loading concurrent access is not possible) > > If that's some kind of initialization you have in mind, using nonatomics > accesses may be fine. We don't have strict rules around that currently. > But if it's just a few lines of code, then simply using relaxed-MO > atomic accesses could be fine as well. > at load time the td may be set up and then it is not written later, but with lazy binding it will be written latter care should be taken because the compiler generated tls access code does not use atomics and can run in parallel to lazy initialization. so i guess any td->entry access in the lazy init code path should use atomics for c11 correctness >> (the particular case i'm trying to fix is hard because >> the access to td->entry is generated by the compiler, so >> it has to be worked around inside the entry function. >> i think x86 does not have this issue) > > Do you mean that you want to affect the compiler-generated code after it > has been generated? > i'm fixing the tlsdesc entry functions in libc
On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: > On 22/04/15 18:14, Szabolcs Nagy wrote: > > On 22/04/15 17:08, Torvald Riegel wrote: > >> Given that you have looked at the code, could you give a rough estimate > >> of how much churn it would be to make the TLS code data-race-free? > > > > elf/tlsdeschtab.h: > > _dl_tlsdesc_resolve_early_return_p > > sysdep/{x86_64,i386,arm,aarch64}/tlsdesc.c: > > _dl_tlsdesc_resolve_rela_fixup > > _dl_tlsdesc_resolve_hold_fixup > > > > i think these all need to use atomics for accessing > td->entry for c11 correctness, but it's hard to tell > what's right since c11 only talks about synchronizing > with c code, not asm. That's true, but if we can require the asm code to be conceptually equivalent to some C11 code, then we can specify and document the concurrent algorithm or synchronization scheme based on C11 as foundation, and either use C11 atomics to implement it or use code the compiler would generate for C11 atomics if implementing in asm. That doesn't cover cases where in the asm, we would want to use synchronization that can't be represented through C11 atomics, or that is incompatible with C11 atomics. But so far it doesn't look like this, or have you observed such cases? (consume MO might be a case, but I guess for that we could still wave our hands and pretend compilers would simply generate the "intuitively generated" code). > the td->entry can be in 3 states during lazy resolution: > > * init: retries the call of entry until caller==entry > (using a double-checked locking mechanism, then it > - grabs GL(dl_load_lock) > - changes the entry to the hold state > - does the resolver work > - changes arg and entry to the final value > - releases the lock to wake the waiters > - calls the new entry) > * hold: waits on the GL(dl_load_lock) > (then retries calling the entry when it's woken up) > * final state: (static, dynamic or undefweak callback) > calculates the tls offset based on arg > (currently without any locks which is bad on weak > memory models) Could you point me to (or confirm that) a new load of td->entry happens if caller != td->entry? In the asm bits you posted so far, it seems that if the call to *td->entry returns, actual TLS data is loaded. For example, you posted this: ldr x1, [x0] // load td->entry blr [x0] // call it entryfunc: ldr x0, [x0,#8] // load td->arg I'm asking because in tlsdesctab.h we have: static int _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller) { if (caller != td->entry) return 1; __rtld_lock_lock_recursive (GL(dl_load_lock)); if (caller != td->entry) { __rtld_lock_unlock_recursive (GL(dl_load_lock)); return 1; .... If _dl_tlsdesc_resolve_early_return_p returns 1, _dl_tlsdesc_resolve_rela_fixup just returns. If the next thing we do is load td->arg, we need another acquire-MO load for td->entry in _dl_tlsdesc_resolve_early_return_p. A relaxed-MO load (or the current non-atomic access, ingoring DRF) would only be sufficient if all that caller != atomic_load_relaxed (&td->entry) leads to is further busy waiting (eg, calling *td->entry again). But if we really expect caller != td->entry to have meaning and tell us something about other state, we need to have another barrier there to not have a typical double-checked-locking bug. (This shows another benefit of using atomics for all concurrent accesses: It forces us to make a conscious decision about the memory orders, and document why. If a relaxed-MO access is sufficient, it should better have a clear explanation...) > > the code for tls access is generated by the compiler so > there is no atomics there: the loaded entry can be in > init, hold or final state, the guarantee about the state > of arg must come from the target arch memory model. I understood there are no C11 atomics used in the asm implementation bits. I'm thinking about abstract the synchronization scheme we want to use in the implementation, represented through C11 atomics synchronization; the asm would just do something equivalent (see my comments about the atomics implementation being effectively part of the ABI). > and to answer my earlier question about the hold state: > it is needed to avoid the spinning in the double-checked > locking in init. > > >> It also looks as if the x86_64 variant of tlsdesc.c is, before your > >> changes and ignoring some additional comments, very similar to the > >> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and > >> using the C11 model) that can be used on several archs? This would > >> certainly decrease maintenance overhead. > >> > > > > should be possible i think: these functions are called > > from asm to do the lazy resolution > > > > but i have to check the arch specific details. > > looking at this, but it seems to be significant work: > > * i386 and arm determine the caller differently > * i386 uses special calling convention from asm > * arm handles local symbols specially Okay. But do they still use the same abstract synchronization scheme, and just differ in a few aspects? Or are the synchronization schemes significantly different? > i think the way to move forward is I think step 1 should be to document the synchronization scheme on an abstract, algorithmic level. Using C11 atomics in this pseudo-code. We want to document the abstract level because it's typically much easier to reason about the abstraction than the concrete implementation in case of concurrent code, and because confirming that an implementation matches the abstraction is typically also much easier than inferring the abstraction from a concrete implementation. So, for example, if we use a kind of double-checked locking here, document that, show the pseudo code with C11 atomics, and then refer back to this when documenting why certain memory orders on atomic operations (or asm instructions) are sufficient and equivalent to the abstract algorithm. In our case, this would mean saying that, for example, we need load Y to use acquire MO so that it synchronizes with release-MO store X, making sure that a load B that is sequenced after X reads from store A (which is a core part of double-checked locking, so could also be done at the abstract algorithm docs and then just referred back to from the concrete implementation code). > * fix the correctness bug now on aarch64 OK. When doing that, check and document equivalence to the abstract synchronization scheme. When you do that, please use atomic_* for all synchronization that you add or change. It might be easiest to also change all atomic accesses that you checked and documented to using atomic_*; this isn't much work compared to the checking, and it becomes obvious what has been checked. > * decide if lazy static tls resolver is worth it > * do the refactoring so tlsdesc is common code > * use c11 atomics I'd prefer C11 atomics to be used right away when fixing things; however, if you mean transforming and checking all other TLS code unrelated to the bug you're looking at, then I agree that this can be done incrementally and after you fixed this bug. > the fix for arm is a lot harder (because atomics are > different on different versions of arm, an ifunc based > dispatch could in principle solve the dmb vs no-dmb > issue for the lazy resolvers). Is the synchronization used in the asm bits on the different versions of arm different from how C11 atomics would look like on those versions of arm? Or are we just dealing with different ways to implement acquire loads, for example?
On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote: > > On 23/04/15 22:46, Torvald Riegel wrote: > > On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote: > >> (the particular case i'm trying to fix is hard because > >> the access to td->entry is generated by the compiler, so > >> it has to be worked around inside the entry function. > >> i think x86 does not have this issue) > > > > Do you mean that you want to affect the compiler-generated code after it > > has been generated? > > > > i'm fixing the tlsdesc entry functions in libc > Sorry, I'm still not sure I can follow. IIUC, the compiler-generated asm code for TLS accesses may lack acquire barriers in some cases. Do you want to work around that by trying to adapt the glibc implementation, or do you just want to fix all glibc-specific bugs, and handle the compiler-side issues separately?
On 24/04/15 18:37, Torvald Riegel wrote: > On Thu, 2015-04-23 at 13:08 +0100, Szabolcs Nagy wrote: >> On 22/04/15 18:14, Szabolcs Nagy wrote: > >> the td->entry can be in 3 states during lazy resolution: >> >> * init: retries the call of entry until caller==entry >> (using a double-checked locking mechanism, then it >> - grabs GL(dl_load_lock) >> - changes the entry to the hold state >> - does the resolver work >> - changes arg and entry to the final value >> - releases the lock to wake the waiters >> - calls the new entry) >> * hold: waits on the GL(dl_load_lock) >> (then retries calling the entry when it's woken up) >> * final state: (static, dynamic or undefweak callback) >> calculates the tls offset based on arg >> (currently without any locks which is bad on weak >> memory models) > > Could you point me to (or confirm that) a new load of td->entry happens > if caller != td->entry? In the asm bits you posted so far, it seems > that if the call to *td->entry returns, actual TLS data is loaded. For > example, you posted this: if td->entry is in init state it will do the retry (when it ends up in _dl_tlsdesc_resolve_early_return_p) > > ldr x1, [x0] // load td->entry > blr [x0] // call it > this is supposed to be 'blr x1', but you got the meaning > entryfunc: > ldr x0, [x0,#8] // load td->arg > > > I'm asking because in tlsdesctab.h we have: > > static int > _dl_tlsdesc_resolve_early_return_p (struct tlsdesc volatile *td, void *caller) > { > if (caller != td->entry) > return 1; > > __rtld_lock_lock_recursive (GL(dl_load_lock)); > if (caller != td->entry) > { > __rtld_lock_unlock_recursive (GL(dl_load_lock)); > return 1; > .... > > If _dl_tlsdesc_resolve_early_return_p returns 1, > _dl_tlsdesc_resolve_rela_fixup just returns. If the next thing we do is > load td->arg, we need another acquire-MO load for td->entry in no, if _dl_tlsdesc_resolve_rela_fixup returns it ends up in _dl_tlsdesc_resolve_rela (asm code) that loads td->entry again and calls the entry again (this is the retry loop) (so when the resolver updated the entry then the new entry is actually called, and in case of early return there is a retry) > _dl_tlsdesc_resolve_early_return_p. A relaxed-MO load (or the current > non-atomic access, ingoring DRF) would only be sufficient if all that > caller != atomic_load_relaxed (&td->entry) leads to is further busy > waiting (eg, calling *td->entry again). But if we really expect > caller != td->entry to have meaning and tell us something about other > state, we need to have another barrier there to not have a typical > double-checked-locking bug. > this is why i said you need atomics for accessing td->entry but this is generic code and not related to the aarch64 bug so i'd prefer if any such cleanup were done independently > (This shows another benefit of using atomics for all concurrent > accesses: It forces us to make a conscious decision about the memory > orders, and document why. If a relaxed-MO access is sufficient, it > should better have a clear explanation...) > >> >> the code for tls access is generated by the compiler so >> there is no atomics there: the loaded entry can be in >> init, hold or final state, the guarantee about the state >> of arg must come from the target arch memory model. > > I understood there are no C11 atomics used in the asm implementation > bits. I'm thinking about abstract the synchronization scheme we want to > use in the implementation, represented through C11 atomics > synchronization; the asm would just do something equivalent (see my > comments about the atomics implementation being effectively part of the > ABI). > ok >> and to answer my earlier question about the hold state: >> it is needed to avoid the spinning in the double-checked >> locking in init. >> >>>> It also looks as if the x86_64 variant of tlsdesc.c is, before your >>>> changes and ignoring some additional comments, very similar to the >>>> aarch64 variant. Can we get one proper tlsdesc.c (data-race-free and >>>> using the C11 model) that can be used on several archs? This would >>>> certainly decrease maintenance overhead. >>>> >>> >>> should be possible i think: these functions are called >>> from asm to do the lazy resolution >>> >>> but i have to check the arch specific details. >> >> looking at this, but it seems to be significant work: >> >> * i386 and arm determine the caller differently >> * i386 uses special calling convention from asm >> * arm handles local symbols specially > > Okay. But do they still use the same abstract synchronization scheme, > and just differ in a few aspects? Or are the synchronization schemes > significantly different? > the synchronization scheme is the same but if the calling convention is non-C (regparm) then you cannot use the same generic C code (without significant changes somewhere). >> i think the way to move forward is > > I think step 1 should be to document the synchronization scheme on an > abstract, algorithmic level. Using C11 atomics in this pseudo-code. We > want to document the abstract level because it's typically much easier > to reason about the abstraction than the concrete implementation in case > of concurrent code, and because confirming that an implementation > matches the abstraction is typically also much easier than inferring the > abstraction from a concrete implementation. > i think the documentation is the tlsdesc abi spec (synchronization requirements can be derived from that) i think we need to fix the bug instead of writing more documents (you can build the theories later, it will inevitably cause a lot of changes because the lazy dl code is far from perfect) > So, for example, if we use a kind of double-checked locking here, > document that, show the pseudo code with C11 atomics, and then refer > back to this when documenting why certain memory orders on atomic > operations (or asm instructions) are sufficient and equivalent to the > abstract algorithm. In our case, this would mean saying that, for > example, we need load Y to use acquire MO so that it synchronizes with > release-MO store X, making sure that a load B that is sequenced after X > reads from store A (which is a core part of double-checked locking, so > could also be done at the abstract algorithm docs and then just referred > back to from the concrete implementation code). > >> * fix the correctness bug now on aarch64 > > OK. When doing that, check and document equivalence to the abstract > synchronization scheme. When you do that, please use atomic_* for all > synchronization that you add or change. > It might be easiest to also change all atomic accesses that you checked > and documented to using atomic_*; this isn't much work compared to the > checking, and it becomes obvious what has been checked. > >> * decide if lazy static tls resolver is worth it >> * do the refactoring so tlsdesc is common code >> * use c11 atomics > > I'd prefer C11 atomics to be used right away when fixing things; > however, if you mean transforming and checking all other TLS code > unrelated to the bug you're looking at, then I agree that this can be > done incrementally and after you fixed this bug. > ok i'll do the atomic_store on td->entry instead of __sync_synchronize >> the fix for arm is a lot harder (because atomics are >> different on different versions of arm, an ifunc based >> dispatch could in principle solve the dmb vs no-dmb >> issue for the lazy resolvers). > > Is the synchronization used in the asm bits on the different versions of > arm different from how C11 atomics would look like on those versions of > arm? Or are we just dealing with different ways to implement acquire > loads, for example? > i dont know the details of c11 atomics on arm but i assume the emitted code depends on which cpu you compile for (so lots of ifdefs in the asm, or you can detect the current cpu at runtime with ifunc which is also ugly)
On 24/04/15 18:40, Torvald Riegel wrote: > On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote: >> >> On 23/04/15 22:46, Torvald Riegel wrote: >>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote: >>>> (the particular case i'm trying to fix is hard because >>>> the access to td->entry is generated by the compiler, so >>>> it has to be worked around inside the entry function. >>>> i think x86 does not have this issue) >>> >>> Do you mean that you want to affect the compiler-generated code after it >>> has been generated? >>> >> >> i'm fixing the tlsdesc entry functions in libc >> > > Sorry, I'm still not sure I can follow. IIUC, the compiler-generated > asm code for TLS accesses may lack acquire barriers in some cases. Do > you want to work around that by trying to adapt the glibc > implementation, or do you just want to fix all glibc-specific bugs, and > handle the compiler-side issues separately? > the compiler generated code dont need any change (that would break the arm and aarch64 abi) i'm fixing the issue in the libc tlsdesc entry functions (because glibc is broken currently, very easy to make it crash) the fix relies on the arm memory model the real bug is the complexity of the dynamic linker code in glibc and the design document that specified lazy binding for tlsdesc without considering weak memory model issues, for these there is an easy fix but you won't like it: don't do lazy binding.
On Fri, Apr 24, 2015 at 10:27:36PM +0100, Szabolcs Nagy wrote: > > On 24/04/15 18:40, Torvald Riegel wrote: > > On Fri, 2015-04-24 at 12:05 +0100, Szabolcs Nagy wrote: > >> > >> On 23/04/15 22:46, Torvald Riegel wrote: > >>> On Wed, 2015-04-22 at 18:14 +0100, Szabolcs Nagy wrote: > >>>> (the particular case i'm trying to fix is hard because > >>>> the access to td->entry is generated by the compiler, so > >>>> it has to be worked around inside the entry function. > >>>> i think x86 does not have this issue) > >>> > >>> Do you mean that you want to affect the compiler-generated code after it > >>> has been generated? > >> > >> i'm fixing the tlsdesc entry functions in libc > > > > Sorry, I'm still not sure I can follow. IIUC, the compiler-generated > > asm code for TLS accesses may lack acquire barriers in some cases. Do > > you want to work around that by trying to adapt the glibc > > implementation, or do you just want to fix all glibc-specific bugs, and > > handle the compiler-side issues separately? > > > > the compiler generated code dont need any change > (that would break the arm and aarch64 abi) > > i'm fixing the issue in the libc tlsdesc entry functions > (because glibc is broken currently, very easy to make it > crash) > > the fix relies on the arm memory model > > the real bug is the complexity of the dynamic linker code in glibc > and the design document that specified lazy binding for tlsdesc > without considering weak memory model issues, for these there is > an easy fix but you won't like it: don't do lazy binding. Being that lazy tlsdesc binding seems to break AS-safety too, and breaks fail-safety (there's an allocation that can fail at resolving time) I think it should just be dropped. Then this whole issue goes away and doesn't need complex arch-specific solutions. Rich
On Apr 22, 2015, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > - I don't understand the role of the hold function in the general > TLSDESC design On machines that can't overwrite the entire descriptor atomically, it is intended to stop a thread from using a partially-relocated descriptor. Users of the descriptor must load the function word before the operand, and the relocator must first set the function word to hold, then modify the operand, and then set the function to its final value. This is supposed to ensure that users of the descriptor can tell the relocation of the descriptor is already underway, thus the operand may already be trashed. I agree it's not extremely relevant for current implementations, but the x86 TLSDESC RFC (the first that couldn't update the descriptor atomically) describes other envisioned implementations that could take advantage of it.
On Wed, Apr 22, 2015 at 01:27:15PM +0100, Szabolcs Nagy wrote: > Other thoughts: > > - Lazy binding for static TLS may be unnecessary complexity: it seems > that gcc generates at most two static TLSDESC relocation entries for a > translation unit (zero and non-zero initialized TLS), so there has to be > a lot of object files with static TLS linked into a shared object to > make the load time relocation overhead significant. Unless there is some > evidence that lazy static TLS relocation makes sense I would suggest > removing that logic (from all archs). (The dynamic case is probably a > similar micro-optimization, but there the lazy vs non-lazy semantics are > different in case of undefined symbols and some applications may depend > on that). > I agree with this. As undefined symbols there is bug that we segfault with weak tls variable so I doubt that anyone depends on that. You didn't mention one essential argument in your analysis as overhead is caused by unused tls variables. When variable is used then you need to do relocation anyway and lazy one could be mangitude more expensive than eager. There is project on my backlog to improve tls access in libraries which is still slow even with gnu2 so it would need introduce -mtls-dialect=gnu3. Eager binding of TLS is prerequisite. Now tls models are flawed design as they try to save space by being lazy without any evidence that there is application that uses it. Main idea that total tls usage is for most application less than 65536 bytes. So unless application uses more we could preallocate that when loading dso. That makes a code for dynamic library require only one extra read versus tls access in main binary when you use following. static int foo_offset; #define foo *(foo_offset > 0 ? tcb + foo_offset: get_tls_ptr (- foo_offset)) Here you could do lazy allocation in get_tls_ptr without taking lock. There is bug that when you do allocation in signal handler it calls malloc which could result in deadlock, there was patch to use mmap to fix it which was reverted.
diff --git a/sysdeps/aarch64/dl-tlsdesc.S b/sysdeps/aarch64/dl-tlsdesc.S index be9b9b3..ff74b52 100644 --- a/sysdeps/aarch64/dl-tlsdesc.S +++ b/sysdeps/aarch64/dl-tlsdesc.S @@ -79,6 +79,25 @@ _dl_tlsdesc_return: cfi_endproc .size _dl_tlsdesc_return, .-_dl_tlsdesc_return + /* Same as _dl_tlsdesc_return but with synchronization for + lazy relocation. + Prototype: + _dl_tlsdesc_return_lazy (tlsdesc *) ; + */ + .hidden _dl_tlsdesc_return_lazy + .global _dl_tlsdesc_return_lazy + .type _dl_tlsdesc_return_lazy,%function + cfi_startproc + .align 2 +_dl_tlsdesc_return_lazy: + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] + ldr x0, [x0, #8] + RET + cfi_endproc + .size _dl_tlsdesc_return_lazy, .-_dl_tlsdesc_return_lazy + /* Handler for undefined weak TLS symbols. Prototype: _dl_tlsdesc_undefweak (tlsdesc *); @@ -96,6 +115,9 @@ _dl_tlsdesc_return: _dl_tlsdesc_undefweak: str x1, [sp, #-16]! cfi_adjust_cfa_offset(16) + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] ldr x0, [x0, #8] mrs x1, tpidr_el0 sub x0, x0, x1 @@ -152,6 +174,9 @@ _dl_tlsdesc_dynamic: stp x3, x4, [sp, #32+16*1] mrs x4, tpidr_el0 + /* The ldar guarantees ordering between the load from [x0] at the + call site and the load from [x0,#8] here for lazy relocation. */ + ldar xzr, [x0] ldr x1, [x0,#8] ldr x0, [x4] ldr x3, [x1,#16] diff --git a/sysdeps/aarch64/dl-tlsdesc.h b/sysdeps/aarch64/dl-tlsdesc.h index 7a1285e..e6c0078 100644 --- a/sysdeps/aarch64/dl-tlsdesc.h +++ b/sysdeps/aarch64/dl-tlsdesc.h @@ -46,6 +46,9 @@ extern ptrdiff_t attribute_hidden _dl_tlsdesc_return (struct tlsdesc *); extern ptrdiff_t attribute_hidden +_dl_tlsdesc_return_lazy (struct tlsdesc *); + +extern ptrdiff_t attribute_hidden _dl_tlsdesc_undefweak (struct tlsdesc *); extern ptrdiff_t attribute_hidden diff --git a/sysdeps/aarch64/tlsdesc.c b/sysdeps/aarch64/tlsdesc.c index 4821f8c..f738cc6 100644 --- a/sysdeps/aarch64/tlsdesc.c +++ b/sysdeps/aarch64/tlsdesc.c @@ -87,6 +87,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, if (!sym) { td->arg = (void*) reloc->r_addend; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); td->entry = _dl_tlsdesc_undefweak; } else @@ -98,6 +100,8 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, { td->arg = _dl_make_tlsdesc_dynamic (result, sym->st_value + reloc->r_addend); + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); td->entry = _dl_tlsdesc_dynamic; } else @@ -105,7 +109,9 @@ _dl_tlsdesc_resolve_rela_fixup (struct tlsdesc volatile *td, { td->arg = (void*) (sym->st_value + result->l_tls_offset + reloc->r_addend); - td->entry = _dl_tlsdesc_return; + /* Barrier so readers see the write above before the one below. */ + __sync_synchronize (); + td->entry = _dl_tlsdesc_return_lazy; } }