Message ID | orwpzkee1h.fsf@livre.home |
---|---|
State | Superseded |
Headers |
Received: (qmail 93819 invoked by alias); 3 Jun 2015 20:24:45 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 93806 invoked by uid 89); 3 Jun 2015 20:24:44 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL, BAYES_50, KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_PASS, T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com From: Alexandre Oliva <aoliva@redhat.com> To: Torvald Riegel <triegel@redhat.com> Cc: Andreas Schwab <schwab@linux-m68k.org>, libc-alpha@sourceware.org Subject: Re: [PR18457] Don't require rtld lock to compute DTV addr for static TLS References: <orvbf5ffyt.fsf@livre.home> <1433326788.21461.81.camel@triegel.csb> <ora8whexn9.fsf@livre.home> <1433344426.21461.202.camel@triegel.csb> Date: Wed, 03 Jun 2015 17:24:10 -0300 In-Reply-To: <1433344426.21461.202.camel@triegel.csb> (Torvald Riegel's message of "Wed, 03 Jun 2015 17:13:46 +0200") Message-ID: <orwpzkee1h.fsf@livre.home> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain |
Commit Message
Alexandre Oliva
June 3, 2015, 8:24 p.m. UTC
On Jun 3, 2015, Torvald Riegel <triegel@redhat.com> wrote: > So you think a reload by the compiler would be bad. Yeah. It's a double-checked lock entry with sugar on top. We only need to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if it is resolved, we want different behavior depending on whether it is FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset). > This can only be bad if there is concurrent modification, potentially Concurrent modification is made while holding the lock. It shouldn't happen in the same thread, at least as long as TLS is regarded as AS-Unsafe, but other threads could concurrently attempt to resolve a module's l_tls_offset to a static offset or forced dynamic. > Thus, therefore we need the atomic access I'm not convinced we do, but I don't mind, and I don't want this to be a point of contention. > AFAIU, you seem to speak about memory reuse unrelated to this > specifically this particular load, right? Yeah, some other earlier use of the same location. > But that sounds like an issue independently of whether the specific > load is an acquire or relaxed load. Not really. It is a preexisting issue, yes, but an acquire load would make sure the (re)initialization of the memory into a link map, performed while holding the lock (and with an atomic write, no less), would necessarily be observed by the atomic acquire load. A relaxed load might still observe the un(re)initialized value. Right? > We established before that you want to prevent reload because there are > concurrent stores. Are these by other threads? Answered above. > If so, are there any cases of the following pattern: Dude. Of course not. None of them use atomics. So far this has only used locks to guard changes, and double-checked locks for access. > storing thread; > A; > atomic_store_relaxed(&l_tls_offset, ...); > observing thread: > offset = atomic_load_relaxed(&l_tls_offset); > B(offset); > where something in B (which uses or has a dependency on offset) relies > on happening after A? Let's rewrite this into something more like what we have now: storing thread: acquire lock; A; set l_tls_offset; release lock; observing thread: if l_tls_offset is undecided: acquire lock; if l_tls_offset is undecided: set l_tls_offset to forced_dynamic; // storing thread too! release lock; assert(l_tls_offset is decided); if l_tls_offset is forced_dynamic: dynamicB(l_tls_offset) else staticB(l_tls_offset) The forced_dynamic case of B(l_tls_offset) will involve at least copying the TLS init block, which A will have mapped in and relocated. We don't take the lock for that copy, so the release after A doesn't guarantee we see the intended values. Now, since the area is mmapped and then relocated, it is very unlikely that any cache would have kept a copy of the unrelocated block, let alone of any prior mmapping in that range. So, it is very likely to work, but it's not guaranteed to work. As for the static TLS case of B(l_tls_offset), the potential for problems is similar, but not quite the same. The key difference is that the initialization of the static TLS block takes place at the storing thread, rather than in the observing thread, and although B(l_tls_offset) won't access the thread's static TLS block, the caller of __tls_get_addr will. (and so will any IE and LE TLS access) Now, in order for any such access to take place, some relocation applied by A must be seen by the observing thread, and if there isn't some sequencing event that ensures the dlopen (or initial load) enclosing A happens-before the use of the relocation, the whole thing is undefined; otherwise, this sequencing event ought to be enough of a memory barrier to guarantee the whole thing works. It's just that the sequencing event is not provided by the TLS machinery itself, but rather by the user, in sequencing events after the dlopen, by the init code, in sequencing the initial loading and relocation before any application code execution, or by the thread library, sequencing any thread started by module initializers after their relocation. Which means to me that a relaxed load might turn out to be enough, after all. > I'm trying to find out what you know about the intent behind the TLS > synchronization FWIW, in this discussion we're touching just a tiny fraction of it, and one that's particularly trivial compared with other bits :-( > Does dlopen just have to decide about this value It does tons of stuff (loading modules and dependencies, applying relocations, running initializers), and it must have a say first. E.g., if any IE relocation references a module, we must make it static TLS. Otherwise, dlopen may leave it undecided, and then a later dlopen might attempt to make it static TLS again (and fail if that's no longer possible), or an intervening tls_get_addr may see it's undecided and make the module's TLS dynamic. > I disagree. You added an atomic load on the consumer side (rightly > so!), so you should not ignore the producer side either. This is in the > same function, and you touch most of the lines around it, and it's > confusing if you make a partial change. You're missing the other cases elsewhere that set this same field. > Let me point out that we do have consensus in the project that new code > must be free of data races. Is a double-check lock regarded as a race? I didn't think so. So, I'm proposing this patch, that reorganizes the function a bit to make this absolutely clear, so that we can get the fix in and I can move on, instead of extending any further the useless part of this conversation, so that we can focus on the important stuff. How's this? We used to store static TLS addrs in the DTV at module load time, but this required one thread to modify another thread's DTV. Now that we defer the DTV update to the first use in the thread, we should do so without taking the rtld lock if the module is already assigned to static TLS. Taking the lock introduces deadlocks where there weren't any before. This patch fixes the deadlock caused by tls_get_addr's unnecessarily taking the rtld lock to initialize the DTV entry for tls_dtor_list within __call_dtors_list, which deadlocks with a dlclose of a module whose finalizer joins with that thread. The patch does not, however, attempt to fix other potential sources of similar deadlocks, such as the explicit rtld locks taken by call_dtors_list, when the dtor list is not empty; lazy relocation of the reference to tls_dtor_list, when TLS Descriptors are in use; when tls dtors call functions through the PLT and lazy relocation needs to be performed, or any such called functions interact with the dynamic loader in ways that require its lock to be taken. for ChangeLog [PR dynamic-link/18457] * elf/dl-tls.c (tls_get_addr_tail): Don't take the rtld lock if we already have a final static TLS offset. * nptl/tst-join7.c, nptl/tst-join7mod.c: New, from Andreas Schwab's bug report. * nptl/Makefile (tests): Add tst-join7. (module-names): Add tst-join7mod. ($(objpfx)tst-join7): New. Add deps. ($(objpfx)tst-join7.out): Likewise. ($(objpfx)tst-join7mod.so): Likewise. (LDFLAGS-tst-join7mod.so): Set soname. --- NEWS | 2 +- elf/dl-tls.c | 63 +++++++++++++++++++++++++++++++-------------------- nptl/Makefile | 10 ++++++-- nptl/tst-join7.c | 12 ++++++++++ nptl/tst-join7mod.c | 29 +++++++++++++++++++++++ 5 files changed, 88 insertions(+), 28 deletions(-) create mode 100644 nptl/tst-join7.c create mode 100644 nptl/tst-join7mod.c
Comments
On Wed, 2015-06-03 at 17:24 -0300, Alexandre Oliva wrote: > On Jun 3, 2015, Torvald Riegel <triegel@redhat.com> wrote: > > > So you think a reload by the compiler would be bad. > > Yeah. It's a double-checked lock entry with sugar on top. We only need > to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if > it is resolved, we want different behavior depending on whether it is > FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset). > > > This can only be bad if there is concurrent modification, potentially > > Concurrent modification is made while holding the lock. Unless *all* accesses happen while holding the lock, there are pairs of accesses that are not ordered by happens-before; if there is a write in any of those pairs, you must use atomics to prevent a data race. (If happens-before is ensured through *other* synchronization (e.g., another lock), then this needs to be considered too.) Applied to a double-checked locking pattern, this means that all data accessed outside the critical section, and is also checked and modified inside the critical section, must use atomic accesses. > It shouldn't happen in the same thread, at least as long as TLS is > regarded as AS-Unsafe, but other threads could concurrently attempt to > resolve a module's l_tls_offset to a static offset or forced dynamic. OK. So we're dealing with inter-thread concurrency here. > > Thus, therefore we need the atomic access > > I'm not convinced we do, but I don't mind, and I don't want this to be a > point of contention. > > > AFAIU, you seem to speak about memory reuse unrelated to this > > specifically this particular load, right? > > Yeah, some other earlier use of the same location. > > > But that sounds like an issue independently of whether the specific > > load is an acquire or relaxed load. > > Not really. It is a preexisting issue, yes, but an acquire load would > make sure the (re)initialization of the memory into a link map, > performed while holding the lock (and with an atomic write, no less), > would necessarily be observed by the atomic acquire load. A relaxed > load might still observe the un(re)initialized value. Right? I can't follow you here. One thing to note is that acquire loads synchronize with release stores (or stronger) on the *same* memory location. An acquire load does not synchronize with operations on a lock, unless the acquire load peeks into the lock and does an acquire load on the lock's state or such. Therefore, when you think about which effect an acquire load has, consider which release store you are actually thinking about. An acquire operation does not have an effect on it's own, only in combination with other effects in the program. This is also why we want to document which release store an acquire load is supposed to synchronize with. Thus, which release store are you thinking about in this case? > > We established before that you want to prevent reload because there are > > concurrent stores. Are these by other threads? > > Answered above. > > > If so, are there any cases of the following pattern: > > Dude. Of course not. None of them use atomics. So far this has only > used locks to guard changes, and double-checked locks for access. Think conceptually, and consider that atomics don't synchronize with locks unless the atomics access the lock state themselves. Thus, the atomics in there are the atomics inside and outside of the critical section(s). > > storing thread; > > A; > > atomic_store_relaxed(&l_tls_offset, ...); > > > observing thread: > > offset = atomic_load_relaxed(&l_tls_offset); > > B(offset); > > > where something in B (which uses or has a dependency on offset) relies > > on happening after A? > > Let's rewrite this into something more like what we have now: > > storing thread: > acquire lock; > A; > set l_tls_offset; > release lock; > > observing thread: > if l_tls_offset is undecided: > acquire lock; > if l_tls_offset is undecided: > set l_tls_offset to forced_dynamic; // storing thread too! > release lock; > assert(l_tls_offset is decided); > if l_tls_offset is forced_dynamic: > dynamicB(l_tls_offset) > else > staticB(l_tls_offset) > > The forced_dynamic case of B(l_tls_offset) will involve at least copying > the TLS init block, which A will have mapped in and relocated. We don't > take the lock for that copy, so the release after A doesn't guarantee we > see the intended values. Now, since the area is mmapped and then > relocated, it is very unlikely that any cache would have kept a copy of > the unrelocated block, let alone of any prior mmapping in that range. > So, it is very likely to work, but it's not guaranteed to work. Try to ignore what you think a particular cache implementation may or may not be likely to do. Instead, please think about whether the program enforces the happens-before relationships it relies on. So, the effects in A are mmap and relocation of the TLS init block (which involves stores to the memory locations copied by B). B needs both to happen-before itself to work correctly. A can be executed by a different thread than the thread executing B. Correct? In this case, A needs to happen-before B, and you need a release store for l_tls_offset and an appropriate acquire load of l_tls_offset on the observer side. Those establish the happens-before between A and B. Alternatively, A needs to (conceptually) include issuing a release fence after the effects, so that a relaxed store is sufficient on the storing side. For mmap specifically, we may be able to assume that all mmap implementations are a syscall, the kernel does the mmap (ie, the effect), and the return from the kernel is always a release fence. However, then we need to document this for mmap, if we want to rely on it. And on the observer side, we'd still need an acquire load, unless we can put an acquire fence somewhere else. Note that whether the storing thread has acquired the lock or not doesn't matter in the analysis of this case. > As for the static TLS case of B(l_tls_offset), the potential for > problems is similar, but not quite the same. The key difference is that > the initialization of the static TLS block takes place at the storing > thread, rather than in the observing thread, and although > B(l_tls_offset) won't access the thread's static TLS block, the caller > of __tls_get_addr will. (and so will any IE and LE TLS access) OK. So let's consider callers to be part of B(l_tls_offset). > Now, in order for any such access to take place, some relocation applied > by A must be seen by the observing thread, and if there isn't some > sequencing event that ensures the dlopen (or initial load) enclosing A > happens-before the use of the relocation, the whole thing is undefined; > otherwise, this sequencing event ought to be enough of a memory barrier > to guarantee the whole thing works. It's just that the sequencing event > is not provided by the TLS machinery itself, but rather by the user, in > sequencing events after the dlopen, by the init code, in sequencing the > initial loading and relocation before any application code execution, or > by the thread library, sequencing any thread started by module > initializers after their relocation. If that's the situation in the static case, it seems we do not need an acquire load nor release store because there is either no concurrent access, naturally, or the user has to ensure happens-before. We can say something like this (feel free to fix the TLS terminology): In the static case, we can assume that the initialization/relocation of a TLS slot [I mean A] always happens-before the use of this TLS slot: (1) in case of dlopen, we require the user to ensure that dlopen happens-before other threads execute the new TLS; (2) in case of initial loading, this will happen-before any execution of application code; and (3) in case of [module initializers case, don't know how to phrase that properly...] Therefore, relaxed MO is fine for this load and the store in [the storing side function]. This documents the assumptions the function is making, and thus what other things it relies on. Somebody reading the code of the function sees the comment, knows why there's just a relaxed MO load, and can connect it to the rest of the mental model of TLS synchronization. We do need to document this to be able to understand how the synchronization is intended to work, and to be able to cross-check this with the assumptions and documentation of other related functions. > Which means to me that a relaxed load might turn out to be enough, after > all. > > > I'm trying to find out what you know about the intent behind the TLS > > synchronization > > FWIW, in this discussion we're touching just a tiny fraction of it, and > one that's particularly trivial compared with other bits :-( > > > Does dlopen just have to decide about this value > > It does tons of stuff (loading modules and dependencies, applying > relocations, running initializers), and it must have a say first. E.g., > if any IE relocation references a module, we must make it static TLS. > Otherwise, dlopen may leave it undecided, and then a later dlopen might > attempt to make it static TLS again (and fail if that's no longer > possible), or an intervening tls_get_addr may see it's undecided and > make the module's TLS dynamic. > > > I disagree. You added an atomic load on the consumer side (rightly > > so!), so you should not ignore the producer side either. This is in the > > same function, and you touch most of the lines around it, and it's > > confusing if you make a partial change. > > You're missing the other cases elsewhere that set this same field. What do you mean? How is it any better if you don't fix it properly in the functions you have looked at and modified, just because there are more problems elsewhere? > > Let me point out that we do have consensus in the project that new code > > must be free of data races. > > Is a double-check lock regarded as a race? I didn't think so. *Correct* double-checked locking isn't. But correct double-checked locking doesn't have data races, and this code here does (unless there's some constraint on execution that I'm not aware of and you haven't told me about). Look at your example code above. "set l_tls_offset;" in the storing thread can be concurrent with the first access of "l_tls_offset" (or the ones in the branches after the critical section). Concurrent here means not ordered by happens-before (unlike in the static case, because there we rely on other happens-before, as you point out above). One of the concurrent accesses is a store. This is a data race. > So, I'm > proposing this patch, that reorganizes the function a bit to make this > absolutely clear, so that we can get the fix in and I can move on, > instead of extending any further the useless part of this conversation, > so that we can focus on the important stuff. Adding data races, or not fixing data races when you touch the code, is not correct -- thus discussing whether you have a data race in your path is absolutely not useless. We agreed on the no-data-races rule for a reason. Please adhere to this like you do for other project rules. If you disagree with the rule, bring this up again, as a separate topic. Until that happens, I'd like to not have to argue over this again and again. > How's this? I'll comment on the revised patch.
On Jun 4, 2015, Torvald Riegel <triegel@redhat.com> wrote: > Applied to a double-checked locking pattern, this means that all data > accessed outside the critical section, and is also checked and modified > inside the critical section, must use atomic accesses. Is the l_tls_offset field the data you're talking about? We've already determined that there is a happens-before for everything else, and my understanding is that, for l_tls_offset alone, being the double-checked lock key value, and the only one that matters for the uses that isn't necessarily covered by the happens-before relationship we've already established, we have no need for atomics there. > OK. So we're dealing with inter-thread concurrency here. Yes. >> Not really. It is a preexisting issue, yes, but an acquire load would >> make sure the (re)initialization of the memory into a link map, >> performed while holding the lock (and with an atomic write, no less), >> would necessarily be observed by the atomic acquire load. A relaxed >> load might still observe the un(re)initialized value. Right? > I can't follow you here. > One thing to note is that acquire loads synchronize with release stores > (or stronger) on the *same* memory location. An acquire load does not > synchronize with operations on a lock, unless the acquire load peeks > into the lock and does an acquire load on the lock's state or such. > Therefore, when you think about which effect an acquire load has, > consider which release store you are actually thinking about. An > acquire operation does not have an effect on it's own, only in > combination with other effects in the program. This is also why we want > to document which release store an acquire load is supposed to > synchronize with. > Thus, which release store are you thinking about in this case? Nothing but l_tls_offset. >> Now, in order for any such access to take place, some relocation applied >> by A must be seen by the observing thread, and if there isn't some >> sequencing event that ensures the dlopen (or initial load) enclosing A >> happens-before the use of the relocation, the whole thing is undefined; >> otherwise, this sequencing event ought to be enough of a memory barrier >> to guarantee the whole thing works. It's just that the sequencing event >> is not provided by the TLS machinery itself, but rather by the user, in >> sequencing events after the dlopen, by the init code, in sequencing the >> initial loading and relocation before any application code execution, or >> by the thread library, sequencing any thread started by module >> initializers after their relocation. > If that's the situation in the static case The paragraph quoted above applies to both cases. >> You're missing the other cases elsewhere that set this same field. > What do you mean? How is it any better if you don't fix it properly in > the functions you have looked at and modified, just because there are > more problems elsewhere? If you say atomics are only correct if all loads and stores, including those guarded by locks, are atomic, then adding atomics to only some of them makes things wrong. >> Is a double-check lock regarded as a race? I didn't think so. > *Correct* double-checked locking isn't. > "set l_tls_offset;" in the storing thread can be concurrent with the > first access of "l_tls_offset" (or the ones in the branches after the > critical section). It appears to follow from your statement and example above that *correct* double-checked locking can only be attained using atomics. Is that so? If not, can you give an example of correct double-checked locking that doesn't use them, and explain why that's different from what's in my revised patch? > If you disagree with the rule, I don't. Maybe one of us misunderstands it, or we're otherwise failing to communicate, but I'm honestly trying to avoid data races. I just don't know that the unguarded reads in double-checked locks qualify as data races.
On Fri, 2015-06-05 at 01:39 -0300, Alexandre Oliva wrote: > On Jun 4, 2015, Torvald Riegel <triegel@redhat.com> wrote: > > > Applied to a double-checked locking pattern, this means that all data > > accessed outside the critical section, and is also checked and modified > > inside the critical section, must use atomic accesses. > > Is the l_tls_offset field the data you're talking about? Yes, in this case. (There may be other data that is initialized once inside the critical section and then accessed outside. Accesses to those data can use nonatomic accesses if data like l_tls_offset makes them data data-race-free.) > We've already > determined that there is a happens-before for everything else, and my > understanding is that, for l_tls_offset alone, being the double-checked > lock key value, and the only one that matters for the uses that isn't > necessarily covered by the happens-before relationship we've already > established, we have no need for atomics there. No, that's not true. If you use a flag like l_tls_offset to indicate whether some intitialization or other job has been performed already, and try to apply double-checked locking, this has to be like this (flag==true means init done, flag initially false): if (atomic_load_acquire(&flag) != true) { lock(); init(&data); atomic_store_release(&flag, true); unlock(); } // data ready to use here use(&data); This is double-checked locking. For an alternative version that combines the lock with the flag, look at how pthread_once does it. The atomic accesses for flag are necessary. If you wouldn't actually have concurrent accesses to flag, you're not doing double-checked locking and can avoid the lock altogether unless you need it for some other orthogonal reason. If you're code deviates from this, you can't simply say it's double-checked locking, but need to comment why the deviation is correct. > > OK. So we're dealing with inter-thread concurrency here. > > Yes. > > >> Not really. It is a preexisting issue, yes, but an acquire load would > >> make sure the (re)initialization of the memory into a link map, > >> performed while holding the lock (and with an atomic write, no less), > >> would necessarily be observed by the atomic acquire load. A relaxed > >> load might still observe the un(re)initialized value. Right? > > > I can't follow you here. > > > One thing to note is that acquire loads synchronize with release stores > > (or stronger) on the *same* memory location. An acquire load does not > > synchronize with operations on a lock, unless the acquire load peeks > > into the lock and does an acquire load on the lock's state or such. > > > Therefore, when you think about which effect an acquire load has, > > consider which release store you are actually thinking about. An > > acquire operation does not have an effect on it's own, only in > > combination with other effects in the program. This is also why we want > > to document which release store an acquire load is supposed to > > synchronize with. > > > Thus, which release store are you thinking about in this case? > > Nothing but l_tls_offset. I haven't see a release-MO atomic store to l_tls_offset yet. Nor a release fence. Which one do you mean? > > >> Now, in order for any such access to take place, some relocation applied > >> by A must be seen by the observing thread, and if there isn't some > >> sequencing event that ensures the dlopen (or initial load) enclosing A > >> happens-before the use of the relocation, the whole thing is undefined; > >> otherwise, this sequencing event ought to be enough of a memory barrier > >> to guarantee the whole thing works. It's just that the sequencing event > >> is not provided by the TLS machinery itself, but rather by the user, in > >> sequencing events after the dlopen, by the init code, in sequencing the > >> initial loading and relocation before any application code execution, or > >> by the thread library, sequencing any thread started by module > >> initializers after their relocation. > > > If that's the situation in the static case > > The paragraph quoted above applies to both cases. Huh? Why would you then need the double-checked locking at all? Unless I misunderstand you, what you seem to be saying isn't consistent. > >> You're missing the other cases elsewhere that set this same field. > > > What do you mean? How is it any better if you don't fix it properly in > > the functions you have looked at and modified, just because there are > > more problems elsewhere? > > If you say atomics are only correct if all loads and stores, including > those guarded by locks, are atomic, then adding atomics to only some of > them makes things wrong. Yes, that's the case, strictly speaking. But you didn't want to tackle a full conversion of every access to l_tls_offset anywhere -- understandably. Therefore, I didn't request that. But that doesn't mean you can ignore what's in the immediate neighborhood of your change. > >> Is a double-check lock regarded as a race? I didn't think so. > > > *Correct* double-checked locking isn't. > > > "set l_tls_offset;" in the storing thread can be concurrent with the > > first access of "l_tls_offset" (or the ones in the branches after the > > critical section). > > It appears to follow from your statement and example above that > *correct* double-checked locking can only be attained using atomics. Is > that so? See above. Note that init(data) and use(data) can use nonatomics to access data, but just because because of the correct double-checked locking synchronization and atomic store-release / load-acquire accesses on the flag. > If not, can you give an example of correct double-checked > locking that doesn't use them, and explain why that's different from > what's in my revised patch? > > > If you disagree with the rule, > > I don't. Maybe one of us misunderstands it, or we're otherwise failing > to communicate, but I'm honestly trying to avoid data races. I just > don't know that the unguarded reads in double-checked locks qualify as > data races. If the double-checked locking example above nor the comments in pthread_once don't make it clear, please get in touch and I'll try to explain.
On Jun 5, 2015, Torvald Riegel <triegel@redhat.com> wrote: >> Is the l_tls_offset field the data you're talking about? > Yes, in this case. > (There may be other data that is initialized once inside the critical > section and then accessed outside. Accesses to those data can use > nonatomic accesses if data like l_tls_offset makes them data > data-race-free.) Ok, this means we can take them out of the analysis and focus on l_tls_offset alone. Let's do that, shall we? Consider this: enum spin { NEGATIVE = -1, UNKNOWN = 0, POSITIVE = 1 }; void *init () { static enum spin electron; take lock; electron = UNKNOWN; release lock; return &electron; } bool make_positive (void *x) { enum spin *p = x; take lock; if (*p == UNKNOWN) *p = POSITIVE; release lock; return *p == POSITIVE; } bool positive_p (void *x) { enum spin *p = x; if (*p != NEGATIVE) { take lock; if (*p == UNKNOWN) *p = NEGATIVE; release lock; } return *p == POSITIVE; } The program has to call init to get the pointer, so that happens before any uses of the pointer. From then on, any threads may call make_positive or positive_p, however many times they wish, passing them the pointer returned by init. init should not be called again. There is no other data associated with the electron or the global lock (not declared here) that guards it. The enum type is opaque, not visible to callers, and its alignment and size ensure it will always be loaded and stored atomically. This is supposed to implement a state machine in which the spin starts as UNKNOWN, but once it is determined to be POSITIVE or NEGATIVE, it won't ever change again. make_positive will turn UNKNOWN spins into positive, whereas positive_p will turn them into NEGATIVE. This is analogous to the code we have in place right now. The change I'm proposing is an optimization to positive_p, that changes the following line: if (*p != NEGATIVE) { to: if (*p == UNKNOWN) { because we don't have to take the lock when *p has already advanced to POSITIVE. I have two questions for you about this proposed change: Is it really that hard to see that the change is correct, safe, and a strict performance improvement? What is the *minimum* set of atomic loads and stores that we have to introduce, replacing unadorned loads and stores, to make the whole thing compliant with our standards, should this be GNU libc code? >> > One thing to note is that acquire loads synchronize with release stores >> > (or stronger) on the *same* memory location. An acquire load does not >> > synchronize with operations on a lock, unless the acquire load peeks >> > into the lock and does an acquire load on the lock's state or such. >> >> > Therefore, when you think about which effect an acquire load has, >> > consider which release store you are actually thinking about. An >> > acquire operation does not have an effect on it's own, only in >> > combination with other effects in the program. This is also why we want >> > to document which release store an acquire load is supposed to >> > synchronize with. >> >> > Thus, which release store are you thinking about in this case? >> >> Nothing but l_tls_offset. > I haven't see a release-MO atomic store to l_tls_offset yet. Of course, there isn't any ATM. What I'm saying is that l_tls_offset is the only memory location that's relevant in this situation. The part of the dynamic loader that loads the module that defines the TLS variable is init, the part of the dynamic loader that resolves a TLS relocation in a way that requires the variable to be in static TLS is make_positive, and the part of tls_get_addr that I'm modifying in the patch is is_positive. Any static TLS offset maps to POSITIVE, FORCED_DYNAMIC maps to NEGATIVE, and NO_TLS_OFFSET maps to UNKNOWN. > Nor a release fence. The release fences are the unlock operations after each and every one of the stores. >> >> Now, in order for any such access to take place, some relocation applied >> >> by A must be seen by the observing thread, and if there isn't some >> >> sequencing event that ensures the dlopen (or initial load) enclosing A >> >> happens-before the use of the relocation, the whole thing is undefined; >> >> otherwise, this sequencing event ought to be enough of a memory barrier >> >> to guarantee the whole thing works. It's just that the sequencing event >> >> is not provided by the TLS machinery itself, but rather by the user, in >> >> sequencing events after the dlopen, by the init code, in sequencing the >> >> initial loading and relocation before any application code execution, or >> >> by the thread library, sequencing any thread started by module >> >> initializers after their relocation. >> > If that's the situation in the static case >> The paragraph quoted above applies to both cases. > Huh? Why would you then need the double-checked locking at all? Because it's required to ensure no two threads take it upon themselves to decide whether l_tls_offset will be static or dynamic. The paragraph above, about other data such as the TLS initializer block, applies to both cases.
On Fri, 2015-06-05 at 14:43 -0300, Alexandre Oliva wrote: > On Jun 5, 2015, Torvald Riegel <triegel@redhat.com> wrote: > > >> Is the l_tls_offset field the data you're talking about? > > > Yes, in this case. > > > (There may be other data that is initialized once inside the critical > > section and then accessed outside. Accesses to those data can use > > nonatomic accesses if data like l_tls_offset makes them data > > data-race-free.) > > Ok, this means we can take them out of the analysis and focus on > l_tls_offset alone. Let's do that, shall we? You can't just ignore them because they can put requirements on the flag. *Iff* they don't exist, we can relaxed the synchronization on the flag. > > Consider this: > > enum spin { NEGATIVE = -1, UNKNOWN = 0, POSITIVE = 1 }; > > void *init () { > static enum spin electron; > take lock; > electron = UNKNOWN; > release lock; > return &electron; > } > > bool make_positive (void *x) { > enum spin *p = x; > take lock; > if (*p == UNKNOWN) > *p = POSITIVE; > release lock; > return *p == POSITIVE; > } > > bool positive_p (void *x) { > enum spin *p = x; > if (*p != NEGATIVE) { > take lock; > if (*p == UNKNOWN) > *p = NEGATIVE; > release lock; > } > > return *p == POSITIVE; > } > > The program has to call init to get the pointer, so that happens before > any uses of the pointer. OK. init sets the value unconditionally, which seems either counterproductive, or init is called exactly once and that call happens before any use by another thread. If the latter (and that seems to be the case as you indicate below): (1) the critical section is not required (2) we should use an atomic access for clarity of code (because we have no atomic types), but it can be considered initialization too and then our exception applies that we don't currently enforce atomic accesses for initialization (a brief code comment is useful in this case nonetheless). > From then on, any threads may call > make_positive or positive_p, however many times they wish, passing them > the pointer returned by init. init should not be called again. There > is no other data associated with the electron or the global lock (not > declared here) that guards it. OK. Then this isn't really double-checked locking, but you're implementing a CAS with locks. (More generally, you are implementing single-memory-location consensus. See the single-memory-location state machine comment you make below.) One could argue that this is a degenerated double-checked locking pattern, but double-checked locking is really meant for things like pthread_once where there is other data that has to be protected with the critical section. If all one wants is single-memory-word consensus, then that's what the very basic HW atomic instructions already provide. (This is why I asked whether you are just implementing a CAS in one of my past emails.) The fact that there is no other data associated with your state machine must be pointed out in the comments. It's often not the case that something is indeed "freestanding", so if it is we want to briefly note why (using a comment in the code). We got similar cases wrong in glibc quite often (e.g., the previously incorrect pthread_once implementation), so such a code comment tells everyone we checked this case, and why it's correct. > The enum type is opaque, not visible to > callers, and its alignment and size ensure it will always be loaded and > stored atomically. Careful here. The alignment and size of the type may *allow* a compiler (and our atomic operations) to actually work and make access atomic. But the compiler is *not required* to do so if plain nonatomic accesses are used; it could load/store byte-wise, it could reload, or it could store speculative values. Unless you use atomic accesses, there's no guarantee that what looks like a single load or store in source code actually ends up as exactly one atomic, full-size load or store in the generated code. Thus, if you'd add a comment on the type, you should say that it is compatible (or sufficient for) atomic accesses. > This is supposed to implement a state machine in which the spin starts > as UNKNOWN, but once it is determined to be POSITIVE or NEGATIVE, it > won't ever change again. make_positive will turn UNKNOWN spins into > positive, whereas positive_p will turn them into NEGATIVE. OK. So, this establishes consensus for whether POSITIVE or NEGATIVE is the final outcome. (And I guess UNKNOWN is never exposed to other code.) > This is analogous to the code we have in place right now. > > The change I'm proposing is an optimization to positive_p, that changes > the following line: > > if (*p != NEGATIVE) { > > to: > > if (*p == UNKNOWN) { > > because we don't have to take the lock when *p has already advanced to > POSITIVE. OK. Note that this was the clearest bit of your patch all way along. It's everything around that (and the actual synchronization) that was problematic / unclear. > I have two questions for you about this proposed change: > > Is it really that hard to see that the change is correct, safe, and a > strict performance improvement? The intent is good. The synchronization has data races (and also had them before, but that's no reason to keep doing the incorrect thing). It is a performance improvement. Using a CAS instead of a critical section would make it even faster for the first accesses. For example, make_positive could do that: /* Establish POSITIVE if no consensus yet. See [...] for why relaxed MO is sufficient. */ spin s = atomic_load_relaxed (p); while (s == UNKNOWN) atomic_compare_exchange_weak_relaxed(p, &s, POSITIVE); If you want to keep the locks, I suggest mentioning the equivalent CAS anyway because it's conveys the intent more clearly in this case. > What is the *minimum* set of atomic loads and stores that we have to > introduce, replacing unadorned loads and stores, to make the whole thing > compliant with our standards, should this be GNU libc code? All accesses to *p must use relaxed atomic accesses (the initialization may be an exception if the change complicates things there (e.g., if it's a memset that also covers other memory locations), but if it's easy it's better to use a relaxed-MO atomic access there as well for clarity). > >> > One thing to note is that acquire loads synchronize with release stores > >> > (or stronger) on the *same* memory location. An acquire load does not > >> > synchronize with operations on a lock, unless the acquire load peeks > >> > into the lock and does an acquire load on the lock's state or such. > >> > >> > Therefore, when you think about which effect an acquire load has, > >> > consider which release store you are actually thinking about. An > >> > acquire operation does not have an effect on it's own, only in > >> > combination with other effects in the program. This is also why we want > >> > to document which release store an acquire load is supposed to > >> > synchronize with. > >> > >> > Thus, which release store are you thinking about in this case? > >> > >> Nothing but l_tls_offset. > > > I haven't see a release-MO atomic store to l_tls_offset yet. > > Of course, there isn't any ATM. What I'm saying is that l_tls_offset is > the only memory location that's relevant in this situation. > > The part of the dynamic loader that loads the module that defines the > TLS variable is init, the part of the dynamic loader that resolves a TLS > relocation in a way that requires the variable to be in static TLS is > make_positive, and the part of tls_get_addr that I'm modifying in the > patch is is_positive. Just to clarify on previous comments you make: And init() always happens-before make_positive or is_positive? > Any static TLS offset maps to POSITIVE, > FORCED_DYNAMIC maps to NEGATIVE, and NO_TLS_OFFSET maps to UNKNOWN. > > > Nor a release fence. > > The release fences are the unlock operations after each and every one of > the stores. Hmm. This still seems somewhat inconsistent. You are arguing that you do not need an acquire load on the consensus implementation above, because l_tls_offset would stand on its own and program logic doesn't need ordering dependencies between this consensus and anything else. Yet you seem to think the release fence is necessary. Do you think the release fence is necessary for something, and if so, what is it and where is the matching acquire? Also note that an unlock operation on a lock is *not* guaranteed to have the same effects as an atomic_thread_fence_release. A compiler can reorder atomics into a critical section (under certain conditions), but it can't typically reorder to before an atomic_thread_fence_release (there are corner cases though where it can merge with other accesses before the fence). The hardware may or may not treat unlocks (ie, release stores on a *particular* memory location) and fences the same.
On Jun 7, 2015, Torvald Riegel <triegel@redhat.com> wrote: > OK. init sets the value unconditionally, which seems either > counterproductive, or init is called exactly once and that call happens > before any use by another thread. The latter, as I wrote. > If the latter (and that seems to be > the case as you indicate below): And so you noticed. > (1) the critical section is not required Yes, it is. > (2) we should use an atomic access for clarity of code No opposition to that. > OK. Then this isn't really double-checked locking, Well, you used this term for this pattern back when we discussed lock-bypassing in stream orientation. I'm trying hard to overcome my different background and use your terminology, but this doesn't make it any easier :-( > If all one wants is > single-memory-word consensus, That's not all we want. We also want to give the dynamic loader priority in assigning a module to static TLS, while it's loading a set of modules. HW CAS atomic insns don't give us that AFAIK. Also, we want to make sure we wait till the dynamic loader is done with defining the TLS variable before we access it. It might be the case that some module's initializer recursively dlopens additional modules (yuck), or start a thread that attempts to access the variable. We want to make those accesses wait for the loader to complete its job before they get a chance to make the variable dynamic, that other modules being loaded might want to use as static. > The fact that there is no other data associated with your state machine > must be pointed out in the comments. There is all the initialization the dynamic loader performs when the module that defines the variable is loaded. > It's often not the case that something is indeed "freestanding", so if > it is we want to briefly note why (using a comment in the code). I don't think a random participant of an existing synchronization pattern is the right place for this sort of comprehensive synchronization documentation. It's a cross-cutting concern, and as we (I?) have learned from computational reflection and aspect-oriented programming, there's no single right place in the code to document this; it's a side document that the relevant pieces should reference. >> The enum type is opaque, not visible to callers, and its alignment >> and size ensure it will always be loaded and stored atomically. > Careful here. The alignment and size of the type may *allow* a compiler > (and our atomic operations) to actually work and make access atomic. > But the compiler is *not required* to do so if plain nonatomic accesses > are used; it could load/store byte-wise, it could reload, or it could > store speculative values. *nod*, but not relevant: - load/store of entire words byte-wise would quickly drive the compiler out of existence, or at least out of the marketplace for compilers of system libraries to be used in production - reloads would not be a problem for the pattern used in the second version of the patch - speculative stores that could affect this pattern are not permitted by the relevant standards AFAICT > Unless you use atomic accesses, there's no guarantee that what looks > like a single load or store in source code actually ends up as exactly > one atomic, full-size load or store in the generated code. Alignment, size, and compiler's interest in generating reasonable code does. But we both know you don't agree with that. > Thus, if you'd add a comment on the type, you should say that it is > compatible (or sufficient for) atomic accesses. This is a very pervasive assumption in GNU libc. The only reason I can think of to add this to every situation in which it is relevant is to make the source code tarballs get higher compression rates. > OK. Note that this was the clearest bit of your patch all way along. Well, then, what are we waiting for, considering that this is *all* the patch does? > It is a performance improvement. Using a CAS instead of a critical > section would make it even faster for the first accesses. But it wouldn't wait for the dynamic loader to complete the loading or the relocation. > If you want to keep the locks, I suggest mentioning the equivalent CAS > anyway because it's conveys the intent more clearly in this case. Since Carlos and Siddhesh took this over, I'll leave it for them too. > Just to clarify on previous comments you make: And init() always > happens-before make_positive or is_positive? Yes. > Hmm. This still seems somewhat inconsistent. You are arguing that you > do not need an acquire load on the consensus implementation above, > because l_tls_offset would stand on its own and program logic doesn't > need ordering dependencies between this consensus and anything else. > Yet you seem to think the release fence is necessary. Do you think the > release fence is necessary for something, and if so, what is it and > where is the matching acquire? We'd already determined the release fence was needed and taken care of by other means. > Also note that an unlock operation on a lock is *not* guaranteed to have > the same effects as an atomic_thread_fence_release. *nod* > The hardware may or may not treat unlocks (ie, > release stores on a *particular* memory location) It is my understanding that, per POSIX, unlocks ought to release all prior stores made by the thread, and locks must acquire all previously-released stores by any thread. I.e., they don't apply to single memory locations. Locks don't even know what particular memory locations they guard, if any. So I can't make sense of what distinction you intend to point out above.
On Tue, 2015-06-09 at 00:01 -0300, Alexandre Oliva wrote: > On Jun 7, 2015, Torvald Riegel <triegel@redhat.com> wrote: > > > OK. init sets the value unconditionally, which seems either > > counterproductive, or init is called exactly once and that call happens > > before any use by another thread. > > The latter, as I wrote. > > > If the latter (and that seems to be > > the case as you indicate below): > > And so you noticed. > > > (1) the critical section is not required > > Yes, it is. Not for *this*, specifically. You produced a reduced analogy / pattern of the synchronization we're discussing here, so I was referring to that specifically. > > (2) we should use an atomic access for clarity of code > > No opposition to that. Good. > > OK. Then this isn't really double-checked locking, > > Well, you used this term for this pattern back when we discussed > lock-bypassing in stream orientation. I'm trying hard to overcome my > different background and use your terminology, but this doesn't make it > any easier :-( I don't remember all of that case, but I think we discussed whether it is double-checked locking and whether it is a correct implementation. If this other case was similar to this one, and we're just interested in reaching consensus on one variable's value and there's no logical relationship to something else or that is taken care of through other means, then this wouldn't have been typical double-checked locking either. Sorry if this didn't come across clearly. I'm trying to make this as little confusing as possible. > > If all one wants is > > single-memory-word consensus, > > That's not all we want. We also want to give the dynamic loader > priority in assigning a module to static TLS, while it's loading a set > of modules. HW CAS atomic insns don't give us that AFAIK. OK. That would then be a good point to document. I'm not yet sure I understand how that is consistent with dlopen being required to happen before accesses to TLS by other threads. > Also, we want to make sure we wait till the dynamic loader is done with > defining the TLS variable before we access it. It might be the case > that some module's initializer recursively dlopens additional modules > (yuck), or start a thread that attempts to access the variable. We want > to make those accesses wait for the loader to complete its job before > they get a chance to make the variable dynamic, that other modules being > loaded might want to use as static. OK, also a good point to document. > > > The fact that there is no other data associated with your state machine > > must be pointed out in the comments. > > There is all the initialization the dynamic loader performs when the > module that defines the variable is loaded. I can't follow you here. You said there are no other dependencies, and we just want to reach consensus on the final value of l_tls_offset. So, from the perspective of just this synchronization state machine you gave, there's no other data that's relevant. Now you say there is other stuff. What's true? Are you just trying to point out that there is other initialization but that other happens-before relationships (e.g., user must synchronize externally) make this initialization happen-before all non-init() functions in the state machine? That would then mean that in this state machine, the critical section in init() is indeed not required (for this particular use!). But above, you said it is. > > It's often not the case that something is indeed "freestanding", so if > > it is we want to briefly note why (using a comment in the code). > > I don't think a random participant of an existing synchronization > pattern is the right place for this sort of comprehensive > synchronization documentation. Maybe you misunderstood, so let me rephrase it. When, such as in this case, something deviates from what's typical, it's worth pointing this out. It deviates from a typical double-checked locking pattern because you don't have acquire/release pairs. If you comment in the code that you need just consensus on the single variable, and there's no other dependencies, you also clarify it. > It's a cross-cutting concern, and as we > (I?) have learned from computational reflection and aspect-oriented > programming, there's no single right place in the code to document this; > it's a side document that the relevant pieces should reference. You can put the code comments somewhere in the code, for example at one of the functions taking part, or at the decl of the variable, or somewhere else. And then reference it. For example, for the semaphore or the condvar, I just put the more general overview of the synchronization scheme on some of the functions, and referenced that throughout the code of the other functions. You don't need a separate document for it. > >> The enum type is opaque, not visible to callers, and its alignment > >> and size ensure it will always be loaded and stored atomically. > > > Careful here. The alignment and size of the type may *allow* a compiler > > (and our atomic operations) to actually work and make access atomic. > > But the compiler is *not required* to do so if plain nonatomic accesses > > are used; it could load/store byte-wise, it could reload, or it could > > store speculative values. > > *nod*, but not relevant: No, this is very much relevant. We do not speculate about compiler implementations but stick to semantics required by the standards. If we need exceptions to that, we document those (e.g., if we rely on a GNU extension). I thought we had discussed this sufficiently before, but if you want to start this topic again, let's do it. > - load/store of entire words byte-wise would quickly drive the compiler > out of existence, or at least out of the marketplace for compilers of > system libraries to be used in production Remember the loads on tile that got mentioned in a previous discussion we had? > - reloads would not be a problem for the pattern used in the second > version of the patch How do you know? You could only argue this way by making assumptions about other code generation in the compiler, which you haven't done. And you certainly don't want to document these, which should be a clear enough indication that you also don't want to reason in detail about them, nor want anybody reading the code to have to reason about that. Yeah, one could speculate about what a compiler may or may not do in this *specific* piece of code. But that's not the level of abstraction we want to use as base for our reasoning. For example, assume the compiler is aware of the code for "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't access l_tls_offset in some way (which it doesn't). You access l_tls_offset inside and out of a critical section, so by data-race-freedom, there must not be concurrent write accesses. So it does not actually have to reload l_tls_offset inside of the critical section, but use the value of your initial load. This correct and reasonable-for-C11 compiler optimization breaks your synchronization. > - speculative stores that could affect this pattern are not permitted by > the relevant standards AFAICT The standard permits to speculatively store a value, if the target is not volatile and the speculative store does not create a data race. For example, the compiler can turn spin = NEGATIVE; into spin = POSITIVE; spin = NEGATIVE; You can *speculate* that this is unlikely to happen in this case, but that's speculation. The standard allows such things. There's no simple way for you to argue that this won't happen without considering what does or does not happen elsewhere in the program, what variables adjacent to spin might be stored to as well, and so on. IOW, you need to reason about this and a whole lot of other stuff. Therefore, to allow for local reasoning, stick to what the standard guarantees. > > Unless you use atomic accesses, there's no guarantee that what looks > > like a single load or store in source code actually ends up as exactly > > one atomic, full-size load or store in the generated code. > > Alignment, size, and compiler's interest in generating reasonable code > does. But we both know you don't agree with that. No, this is simply not true in general. You can argue about likelihood in this *particular* case, but then you're doing just that. If you think it's unlikely compilers will try to optimize, have a look at this: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html > > Thus, if you'd add a comment on the type, you should say that it is > > compatible (or sufficient for) atomic accesses. > > This is a very pervasive assumption in GNU libc. It is common, but it sometimes breaks. For example, the publicly exposed semaphore type was aligned differently on some archs than the alignment for the internal representation if we had changed it to use wider atomics as offered by the arch. > The only reason I can > think of to add this to every situation in which it is relevant is to > make the source code tarballs get higher compression rates. > > > OK. Note that this was the clearest bit of your patch all way along. > > Well, then, what are we waiting for, considering that this is *all* the > patch does? I think I've made it clear what's missing. There's more to a patch than the intent behind it. > > It is a performance improvement. Using a CAS instead of a critical > > section would make it even faster for the first accesses. > > But it wouldn't wait for the dynamic loader to complete the loading or > the relocation. > > > If you want to keep the locks, I suggest mentioning the equivalent CAS > > anyway because it's conveys the intent more clearly in this case. > > Since Carlos and Siddhesh took this over, I'll leave it for them too. So you will stop working on this? If you intend to work on patches affecting synchronization in the future, please remember the feedback you got in this patch. > > Just to clarify on previous comments you make: And init() always > > happens-before make_positive or is_positive? > > Yes. > > > Hmm. This still seems somewhat inconsistent. You are arguing that you > > do not need an acquire load on the consensus implementation above, > > because l_tls_offset would stand on its own and program logic doesn't > > need ordering dependencies between this consensus and anything else. > > Yet you seem to think the release fence is necessary. Do you think the > > release fence is necessary for something, and if so, what is it and > > where is the matching acquire? > > We'd already determined the release fence was needed and taken care of > by other means. Huh? > > Also note that an unlock operation on a lock is *not* guaranteed to have > > the same effects as an atomic_thread_fence_release. > > *nod* > > > The hardware may or may not treat unlocks (ie, > > release stores on a *particular* memory location) > > It is my understanding that, per POSIX, unlocks ought to release all > prior stores made by the thread, and locks must acquire all > previously-released stores by any thread. I.e., they don't apply to > single memory locations. Locks don't even know what particular memory > locations they guard, if any. So I can't make sense of what distinction > you intend to point out above. We implement POSIX for users of glibc, but we do not implement on top of POSIX inside of glibc -- we implement on top of our own code and the C standard including its memory model. (And you know that we don't even implement precisely the POSIX semantics with our POSIX locks.) In C11, there's a distinction between a release-MO fence and a mutex unlock operation (i.e., a release-MO store). Does that answer your question?
[I'm dropping Andreas from Cc, since this part of the thread has nothing to do with the reason I copied him at first] On Jun 9, 2015, Torvald Riegel <triegel@redhat.com> wrote: > On Tue, 2015-06-09 at 00:01 -0300, Alexandre Oliva wrote: >> That's not all we want. We also want to give the dynamic loader >> priority in assigning a module to static TLS, while it's loading a set >> of modules. HW CAS atomic insns don't give us that AFAIK. > OK. That would then be a good point to document. I'm not yet sure I > understand how that is consistent with dlopen being required to happen > before accesses to TLS by other threads. I think the confusion may arise because of the two roles played by rtld (s/dlopen/rtld/) in the scene. One is as the loader of the module that defines the TLS variable. It's the one that initializes the TLS initialization block holding the variable. Another is as the relocator, that processes relocations referencing the variable in modules that may have been loaded along with the defining module (it could be the same module too), or at subsequent dlopen requests. When the defining and referencing modules are loaded at once, both initialization and relocation are performed without releasing the lock. When the referencing module is dlopened later, the lock is taken again, and then dlopen might find a relocation that requires a TLS variable to be in static TLS. If the variable is already in dynamic TLS, because some earlier access already made the decision, it returns an error. We could use CAS to this end both in dlopen and in tls_get_addr, but then, given concurrent access, we'd not increase the odds of a successful dlopen. With the lock, we do so to some extent. So, it's not a determinant factor, since it is always possible that the concurrent access beats the subsequent dlopen and places the variable in dynamic TLS before dlopen grabs the lock or decides with a CAS, but it shifts the balance slightly so that dlopen, however long it takes, is given way by concurrent tls_get_addr. Now, considering that we're moving towards reducing IE in dlopenable libs, in favor of TLS Descriptors, maybe it's time to revisit this balance and go for HW CAS instead to decide between static and dynamic. Even then, we'd want a lock, it occurs to me now. The reason is that, once dlopen determines a variable that must go in static TLS can go in static TLS, it computes the static TLS offset and proceeds to copy the TLS initialization block to the the corresponding area of the Static TLS block of each thread, all while holding the rtld lock. The lock is what currently stops anyone from stepping in between dlopen's determination that the variable must and can go in static TLS, and the ultimate setting of the offset to indicate so. Should we go CAS, we'd need some intermediate hold value, or dlopen would have to retract its decision, or somesuch. Now, for IE access models, this is not a problem: if they execute an IE access model, it means the relocation and initialization happens-before the execution, or that the execution is unordered and therefore undefined. As for dynamic access models involving tls_get_addr, this is not clear-cut: there could be GD relocations processed earlier, but that are only exercised concurrently with a dlopen that places the variable in Static TLS. If we want dlopen to succeed, concurrent dynamic access models must wait until the static TLS area is initialized by dlopen. Taking the lock is the way this is accomplished, with the added benefit of acquiring the static TLS block memory initialized released by dlopen as it released the lock. Now, in order for this to work without a lock, we'd need dlopen to set l_tls_offset *after* completing the initialization of the static TLS area, and while releasing it, never before. Heck, we need this even with my proprosed patch. However, I don't see that dlopen behaves this way: it appears to set l_tls_offset first, and then use it to initialize each thread's static TLS area. Therefore, my current patch is broken, and the code in master is correct as far as dynamic access to TLS variables in static TLS goes: the lock is unfortunately necessary under the current design. I withdraw the patch. > You said there are no other dependencies, and we just want to reach > consensus on the final value of l_tls_offset. So, from the > perspective of just this synchronization state machine you gave, > there's no other data that's relevant. Now you say there is other > stuff. What's true? The latter. When I wrote that, I was convinced we'd covered all the cases of initialization performed by dlopen because I had had failed to take into account the subsequent cross-thread static TLS area initialization by a subsequent dlopen that assigns a previously-loaded TLS variable to static TLS. It looks like we could still avoid the lock in tls_get_addr by reorganizing dlopen to choose the offset, initialize and release the static blocks, and only then CAS the offset, failing if l_tls_offset was changed along the way. > Maybe you misunderstood, so let me rephrase it. When, such as in this > case, something deviates from what's typical, it's worth pointing this > out. I agree. But does it? GNU libc has lots of occurrences of this pattern, so this is hardly a deviation from what's typical. > You can put the code comments somewhere in the code, for example at one > of the functions taking part, or at the decl of the variable, or > somewhere else. And then reference it. *nod* > I thought we had discussed this sufficiently before We had, but we never reached agreement, and I doubt we ever will on this point. > Remember the loads on tile that got mentioned in a previous discussion > we had? 'fraid I don't. Reference? >> - reloads would not be a problem for the pattern used in the second >> version of the patch > Yeah, one could speculate about what a compiler may or may not do in > this *specific* piece of code. But that's not the level of abstraction > we want to use as base for our reasoning. What is clear to me is that our reasonings and thought frameworks are so different that your personal preferences don't apply to my way of thinking, and vice versa. Why is why I'm ok with answering questions you may have about synchronization patterns I'm familiar with in GNU libc, but not at all ok with writing the documentation. From our discussions so far, I am sure any such documentation I write will be regarded as useless for you. Because you and I think in different terms, different primitives, different abstraction layers. I'd build the reasoning from my perspective, and it wouldn't match yours. And vice-versa. > For example, assume the compiler is aware of the code for > "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't > access l_tls_offset in some way (which it doesn't). You access > l_tls_offset inside and out of a critical section, so by > data-race-freedom, there must not be concurrent write accesses. So it > does not actually have to reload l_tls_offset inside of the critical > section, but use the value of your initial load. This correct and > reasonable-for-C11 compiler optimization breaks your synchronization. Correct and reasonable-for-C11 except for the bold and bald assumption that the a lock operation is not a global acquire. The compiler is forbidden from removing the second load because of the existence of the lock. Now, that requirement comes from POSIX, not from the C standard. >> - speculative stores that could affect this pattern are not permitted by >> the relevant standards AFAICT > The standard permits to speculatively store a value, if the target is > not volatile and the speculative store does not create a data race. *if* the abstract machine would have stored something in the variable to begin with. In the case at hand, it wouldn't, so, no speculative writes. > No, this is simply not true in general. You can argue about likelihood > in this *particular* case, but then you're doing just that. Indeed, that's just what I'm doing. > If you think it's unlikely compilers will try to optimize, have a look > at this: > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html >> Since Carlos and Siddhesh took this over, I'll leave it for them too. > So you will stop working on this? Once Carlos wrote he and Siddhesh would take care of the issue, I did. It doesn't mean I'm not willing to share the knowledge I got from studying the intricacies of TLS synchronization any more. Now that the patch is withdrawn because it is broken, we can even focus on more productive questions and answers about it. But this only makes sense (as opposed to being a waste of time) if there's a commitment to turn the conversation into proper documentation. Do we have that from anyone? > If you intend to work on patches affecting synchronization in the > future, please remember the feedback you got in this patch. I'll rather try to avoid it. It's just too hard for you and I to communicate in this field, and I assume it's as frustrating for you as it is for me :-( >> We'd already determined the release fence was needed and taken care of >> by other means. > Huh? rtld that loads and defines the variable and then releases the rtld lock (and thus the memory) happens-before any well-defined use of the variable (without some happens before, how could it safely get ahold of the relocations used in the TLS access model?), even if it doesn't take the rtld lock, as covered in other messages upthread. (That's a case not to be confused with the subsequent dlopen that processes relocations referencing the variable and assigns it to static TLS, described in this message.) >> > Also note that an unlock operation on a lock is *not* guaranteed to have >> > the same effects as an atomic_thread_fence_release. >> >> *nod* >> >> > The hardware may or may not treat unlocks (ie, >> > release stores on a *particular* memory location) >> >> It is my understanding that, per POSIX, unlocks ought to release all >> prior stores made by the thread, and locks must acquire all >> previously-released stores by any thread. I.e., they don't apply to >> single memory locations. Locks don't even know what particular memory >> locations they guard, if any. So I can't make sense of what distinction >> you intend to point out above. > We implement POSIX for users of glibc, but we do not implement on top of > POSIX inside of glibc Is this documented anywhere? If we're breaking well-set expectations WRT locks, we'd better document that. Though I wouldn't recommend deviating so much from well-established practice. It would be a huge burden for anyone who might consider joining our community. > (And you know that we don't even > implement precisely the POSIX semantics with our POSIX locks.) Yeah, and I do find that unfortunate. > In C11, there's a distinction between a release-MO fence and a mutex > unlock operation (i.e., a release-MO store). > Does that answer your question? I don't think I asked a question, but no, it doesn't help me understand what you meant by "unlocks (ie release stores on a particular memory location)". In my understanding, unlocks are not about particular memory locations, so it comes across as a contradiction and your attempt to answer some unstated underlying question does nothing to clarify it.
On Tue, 2015-06-09 at 19:19 -0300, Alexandre Oliva wrote: > On Jun 9, 2015, Torvald Riegel <triegel@redhat.com> wrote: > > You said there are no other dependencies, and we just want to reach > > consensus on the final value of l_tls_offset. So, from the > > perspective of just this synchronization state machine you gave, > > there's no other data that's relevant. Now you say there is other > > stuff. What's true? > > The latter. When I wrote that, I was convinced we'd covered all the > cases of initialization performed by dlopen because I had had failed to > take into account the subsequent cross-thread static TLS area > initialization by a subsequent dlopen that assigns a previously-loaded > TLS variable to static TLS. > > > It looks like we could still avoid the lock in tls_get_addr by > reorganizing dlopen to choose the offset, initialize and release the > static blocks, and only then CAS the offset, failing if l_tls_offset was > changed along the way. Seems so, based on what you wrote. And then we'd need the CAS to use release MO and the loads that may read-from that CAS to use acquire MO. > > Remember the loads on tile that got mentioned in a previous discussion > > we had? > > 'fraid I don't. Reference? I didn't find the email thread after looking for a while. What I remember is that tile had operations that are not truly atomic (IIRC, don't necessarily reload a whole cache line from memory, or some such). > >> - reloads would not be a problem for the pattern used in the second > >> version of the patch > > > Yeah, one could speculate about what a compiler may or may not do in > > this *specific* piece of code. But that's not the level of abstraction > > we want to use as base for our reasoning. > > What is clear to me is that our reasonings and thought frameworks are so > different that your personal preferences don't apply to my way of > thinking, and vice versa. Why is why I'm ok with answering questions > you may have about synchronization patterns I'm familiar with in GNU > libc, but not at all ok with writing the documentation. From our > discussions so far, I am sure any such documentation I write will be > regarded as useless for you. I won't say it's useless if you try to convey knowledge you have to others. I may disagree how it's done or expressed, or don't understand what you're trying to say -- but even then I'd consider it input and a start of something that just needs further iteration. > Because you and I think in different > terms, different primitives, different abstraction layers. I'd build > the reasoning from my perspective, and it wouldn't match yours. And > vice-versa. I don't think this is a question of my perspective vs. your perspective. Eventually, we want to have consensus in the project about how we, collectively, document and reason about concurrency. I've been arguing for what I think is the right layers of abstractions, terminology, memory model, etc., based on my experience in this field. If that turns out to be not what works best for the project, that's fine, but we need to discuss this. Everyone may have a favourite way of thinking, but the project still needs a common "concurrency speak". > > For example, assume the compiler is aware of the code for > > "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't > > access l_tls_offset in some way (which it doesn't). You access > > l_tls_offset inside and out of a critical section, so by > > data-race-freedom, there must not be concurrent write accesses. So it > > does not actually have to reload l_tls_offset inside of the critical > > section, but use the value of your initial load. This correct and > > reasonable-for-C11 compiler optimization breaks your synchronization. > > Correct and reasonable-for-C11 except for the bold and bald assumption > that the a lock operation is not a global acquire. The compiler is > forbidden from removing the second load because of the existence of the > lock. Now, that requirement comes from POSIX, not from the C standard. POSIX doesn't guarantee anything about plain memory accesses that do have data races, right? This isn't about the lock itself and whether it "synchronizes memory", but about the data race and what the compiler can do based on assuming DRF programs. If the accesses to l_tls_offset were sem_getvalue and sem_post (or relaxed-MO atomics), the compiler would have to reload because of the lock acquisition and the acquire-MO load part of it. > >> - speculative stores that could affect this pattern are not permitted by > >> the relevant standards AFAICT > > > The standard permits to speculatively store a value, if the target is > > not volatile and the speculative store does not create a data race. > > *if* the abstract machine would have stored something in the variable to > begin with. No, not generally. The program must behave as-if executed by the virtual machine. But behavior is defined as all accesses to volatile-qualified variables and I/O, so the program is allowed to speculatively store if it doesn't affect the volatile / I/O behavior of the program. That's why I said the target must not be volatile. If adding a speculative store does not create a data race (and there's different ways for a compiler to detect that), then there's no other thread that can observe the speculative store. Read-only mapped memory isn't considered by the standard, but compilers may. Even then, this is on a page granularity, and compilers can detect whether something else in the page might be stored to. > In the case at hand, it wouldn't, so, no speculative > writes. Just to be clear, this is the code snippet we're talking about: if (*p != NEGATIVE) { take lock; if (*p == UNKNOWN) *p = NEGATIVE; release lock; Your program told the compiler that there are no concurrent modifications of the variable by accessing *p inside and outside of the critical section. That is, only we modify it while running this piece of code. That means we can ignore the lock, and we end up with this, conceptually: // *p must be UNKNOWN OR POSITIVE, due to the first if. if (*p == UNKNOWN) *p = NEGATIVE; (And assume the compiler doesn't try to support read-only mapped memory, or it observed that elsewhere, the program writes to the page p is part of.) Then, the compiler can speculatively store NEGATIVE to *p, and fix up afterwards if it was wrong. > > No, this is simply not true in general. You can argue about likelihood > > in this *particular* case, but then you're doing just that. > > Indeed, that's just what I'm doing. But that's not useful. First, we want rules that apply and are safe in general, not rules that require case-by-case reasoning. Second, we don't want to depend on something being unlikely -- we want the thing to always work. > > If you think it's unlikely compilers will try to optimize, have a look > > at this: > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html > > >> Since Carlos and Siddhesh took this over, I'll leave it for them too. > > > So you will stop working on this? > > Once Carlos wrote he and Siddhesh would take care of the issue, I did. > > It doesn't mean I'm not willing to share the knowledge I got from > studying the intricacies of TLS synchronization any more. And I appreciate that. > Now that the patch is withdrawn because it is broken, we can even focus > on more productive questions and answers about it. But this only makes > sense (as opposed to being a waste of time) if there's a commitment to > turn the conversation into proper documentation. Do we have that from > anyone? I suppose Carlos and Siddhesh will do that then. I have no plans to work on TLS specifically, but will help with checking that the concurrency bits are sound (including how they get documented). > > If you intend to work on patches affecting synchronization in the > > future, please remember the feedback you got in this patch. > > I'll rather try to avoid it. It's just too hard for you and I to > communicate in this field, and I assume it's as frustrating for you as > it is for me :-( > > >> We'd already determined the release fence was needed and taken care of > >> by other means. > > > Huh? > > rtld that loads and defines the variable and then releases the rtld lock > (and thus the memory) happens-before any well-defined use of the > variable (without some happens before, how could it safely get ahold of > the relocations used in the TLS access model?), even if it doesn't take > the rtld lock, as covered in other messages upthread. To avoid confusion, I assume you are referring to such a pattern: Thread 1: data = 1; flag = 1; unlock; Thread 2: if (flag) foo = data; POSIX doesn't guarantee anything for non-data-race-free plain memory accesses, and this has data races. Even if we'd assume that this isn't the case, and unlock would magically make all prior modifications become visible globally, there's still no guarantee that the load of flag happens after the unlock. You could make Thread 1 do this (and assume that unlock is a release fence, and all plain memory accesses are indeed atomic): data = 1; unlock; flag = 1; But then you'd still need something at the observer side, to enforce that the load of data won't be performed before the load of flag. If the necessary happens-before is ensured through other sync, then we're good though. > (That's a case > not to be confused with the subsequent dlopen that processes relocations > referencing the variable and assigns it to static TLS, described in this > message.) > > >> > Also note that an unlock operation on a lock is *not* guaranteed to have > >> > the same effects as an atomic_thread_fence_release. > >> > >> *nod* > >> > >> > The hardware may or may not treat unlocks (ie, > >> > release stores on a *particular* memory location) > >> > >> It is my understanding that, per POSIX, unlocks ought to release all > >> prior stores made by the thread, and locks must acquire all > >> previously-released stores by any thread. I.e., they don't apply to > >> single memory locations. Locks don't even know what particular memory > >> locations they guard, if any. So I can't make sense of what distinction > >> you intend to point out above. > > > We implement POSIX for users of glibc, but we do not implement on top of > > POSIX inside of glibc > > Is this documented anywhere? https://sourceware.org/glibc/wiki/Concurrency says that we're using the C11 memory model. > If we're breaking well-set expectations > WRT locks, we'd better document that. First of all, the POSIX semantics are not sufficient for us, because they don't define atomics and also don't try to give programs with data races some well-defined meaning (if they tried, they'd fail because it would break plenty compiler optimizations that are all used in practice). Second, we don't implement the stronger "synchronizes memory" semantics POSIX seems to want to have. That's not something new. > > In C11, there's a distinction between a release-MO fence and a mutex > > unlock operation (i.e., a release-MO store). > > Does that answer your question? > > I don't think I asked a question, but no, it doesn't help me understand > what you meant by "unlocks (ie release stores on a particular memory > location)". In my understanding, unlocks are not about particular > memory locations, so it comes across as a contradiction and your attempt > to answer some unstated underlying question does nothing to clarify it. See C11, 5.1.2.4p6: For example, a call that acquires a mutex will perform an acquire operation on the locations composing the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Thus, there will be synchronizes-with (and thus happens-before) edges between a release of a particular mutex instance and a subsequent acquisition of the *same* mutex instance. This creates a total order on each particular mutex instance, but there is no happens-before enforces with unrelated release or acquire operations (e.g., on different mutex instances, atomics, ...). This makes sense because it allows hardware and compilers to not have to enforce global ordering. And it aligns well with the use cases mutexes are built for, namely critical sections and mutual exclusion. Mutexes do not interact with atomics sequenced before or after the mutex release/acquire in the same way that a fences would interact with such atomics.
On Jun 15, 2015, Torvald Riegel <triegel@redhat.com> wrote: > On Tue, 2015-06-09 at 19:19 -0300, Alexandre Oliva wrote: >> It looks like we could still avoid the lock in tls_get_addr by >> reorganizing dlopen to choose the offset, initialize and release the >> static blocks, and only then CAS the offset, failing if l_tls_offset was >> changed along the way. > Seems so, based on what you wrote. And then we'd need the CAS to use > release MO and the loads that may read-from that CAS to use acquire MO. *nod* >> > Remember the loads on tile that got mentioned in a previous discussion >> > we had? >> >> 'fraid I don't. Reference? > I didn't find the email thread after looking for a while. What I > remember is that tile had operations that are not truly atomic (IIRC, > don't necessarily reload a whole cache line from memory, or some such). Aah, now that rings a bell. It was in a subthread about what a signal handler might find when interrupting a memset. >> Correct and reasonable-for-C11 except for the bold and bald assumption >> that the a lock operation is not a global acquire. The compiler is >> forbidden from removing the second load because of the existence of the >> lock. Now, that requirement comes from POSIX, not from the C standard. > POSIX doesn't guarantee anything about plain memory accesses that do > have data races, right? This isn't about the lock itself and whether it > "synchronizes memory", but about the data race and what the compiler can > do based on assuming DRF programs. Since POSIX doesn't specify atomics, I'm inclined to think this reasoning rules out useful existing practice that is at least arguably valid under POSIX, but... I see where this is coming from, and I kind of like the better-defined semantics, so I can see it could make sense for POSIX to deprecate this practice and embrace atomics. >> *if* the abstract machine would have stored something in the variable to >> begin with. > No, not generally. The program must behave as-if executed by the > virtual machine. But behavior is defined as all accesses to > volatile-qualified variables and I/O, so the program is allowed to > speculatively store if it doesn't affect the volatile / I/O behavior of > the program. I'm pretty sure I saw wording to the effect of what I wrote above in some standard or draft a while ago. That made sense to me: introducing stores where there weren't any before seems like a great recipe to introduce races. >> > No, this is simply not true in general. You can argue about likelihood >> > in this *particular* case, but then you're doing just that. >> >> Indeed, that's just what I'm doing. > But that's not useful. First, we want rules that apply and are safe in > general, not rules that require case-by-case reasoning. Second, we > don't want to depend on something being unlikely -- we want the thing to > always work. It looks like you assume "argue about likelihood" does not encompass the case of "argue it's never going to happen", although that is what I was doing. I like general rules, but sometimes we have to reason on a case-by-case basis. I don't oppose taking advantage of situations that only arise in specific cases to improve the code. >> Now that the patch is withdrawn because it is broken, we can even focus >> on more productive questions and answers about it. But this only makes >> sense (as opposed to being a waste of time) if there's a commitment to >> turn the conversation into proper documentation. Do we have that from >> anyone? > I suppose Carlos and Siddhesh will do that then. I wouldn't jump to that conclusion. The broader problem is not related with TLS, but about rtld's holding a lock while running module finalizers. >> rtld that loads and defines the variable and then releases the rtld lock >> (and thus the memory) happens-before any well-defined use of the >> variable (without some happens before, how could it safely get ahold of >> the relocations used in the TLS access model?), even if it doesn't take >> the rtld lock, as covered in other messages upthread. > To avoid confusion, I assume you are referring to such a pattern: > Thread 1: > data = 1; > flag = 1; > unlock; > Thread 2: > if (flag) foo = data; You're missing the happens-before relationship. I don't see any in your example, but I explicitly stated there was one, so your example does not match at all my attempt summarize what we'd already discussed and agreed on, namely, rtld's setting up the TLS variable and applying TLS relocations that reference it, so please refer to that part of the thread so that we don't go in circles. > If the necessary happens-before is ensured through other sync, then > we're good though. That was the conclusion upthread. Do you see any reason to revisit it after rereading it? >> > We implement POSIX for users of glibc, but we do not implement on top of >> > POSIX inside of glibc >> Is this documented anywhere? > https://sourceware.org/glibc/wiki/Concurrency > says that we're using the C11 memory model. Well, that's not exactly a true statement ATM. We are *aiming* at it, but there are tons of code in GNU libc that aren't quite there yet. > Thus, there will be synchronizes-with (and thus happens-before) edges > between a release of a particular mutex instance and a subsequent > acquisition of the *same* mutex instance. This creates a total order > on each particular mutex instance, but there is no happens-before > enforces with unrelated release or acquire operations (e.g., on > different mutex instances, atomics, ...). ... because those don't introduce a synchronization with the release sequence that precedes releasing that specific mutex, whereas a memory fence would have more pervasive synchronization effects. Is this what you meant with "In C11, there's a distinction between a release-MO fence and a mutex unlock operation (i.e., a release-MO store)"? Is this all you meant?
On Mon, 2015-06-22 at 03:39 -0300, Alexandre Oliva wrote: > >> > We implement POSIX for users of glibc, but we do not implement on top of > >> > POSIX inside of glibc > > >> Is this documented anywhere? > > > https://sourceware.org/glibc/wiki/Concurrency > > says that we're using the C11 memory model. > > Well, that's not exactly a true statement ATM. We are *aiming* at it, > but there are tons of code in GNU libc that aren't quite there yet. Sure. > > > Thus, there will be synchronizes-with (and thus happens-before) edges > > between a release of a particular mutex instance and a subsequent > > acquisition of the *same* mutex instance. This creates a total order > > on each particular mutex instance, but there is no happens-before > > enforces with unrelated release or acquire operations (e.g., on > > different mutex instances, atomics, ...). > > ... because those don't introduce a synchronization with the release > sequence that precedes releasing that specific mutex, whereas a memory > fence would have more pervasive synchronization effects. Fences create synchronizes with in combination with other atomic accesses sequenced after them (in case of release fences, otherwise sequenced before them). See 7.17.4p1-4 in C11 (N1570 is a draft) for how that's worded there. Also, to see this in action, you can try the example below at http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ using a browser that has javascript enabled: int main() { atomic_int x = 0; atomic_int unrelated = 0; int y = 0; {{{ { y = 1; // won't work; will have data races: // unrelated.store(23, memory_order_release); // does work: atomic_thread_fence(memory_order_release); x.store(1, memory_order_relaxed); } ||| { r1 = x.load(memory_order_acquire).readsvalue(1); r2 = y.readsvalue(1); } }}}; This tries to access the non-atomic initialization of y safely by using an atomic flag x. The readsvalue calls are assertions basically, to constrain which executions the cppmem tool looks at. Using the release store instead of the release fence will result in data races. The plot at the bottom right shows the relationships that matter in the possible executions found by the tool. > Is this what > you meant with "In C11, there's a distinction between a release-MO fence > and a mutex unlock operation (i.e., a release-MO store)"? Yes, basically. More specifically, the difference between the fence and a release store in cases similar to the example above. And, as a result, that a mutex unlock has not the same synchronization effects as a release fence.
diff --git a/NEWS b/NEWS index ed02de0..eac100c 100644 --- a/NEWS +++ b/NEWS @@ -19,7 +19,7 @@ Version 2.22 18047, 18049, 18068, 18080, 18093, 18100, 18104, 18110, 18111, 18116, 18125, 18128, 18138, 18185, 18196, 18197, 18206, 18210, 18211, 18217, 18220, 18221, 18234, 18244, 18247, 18287, 18319, 18333, 18346, 18397, - 18409, 18410, 18412, 18418, 18422, 18434, 18444, 18469. + 18409, 18410, 18412, 18418, 18422, 18434, 18444, 18457, 18469. * Cache information can be queried via sysconf() function on s390 e.g. with _SC_LEVEL1_ICACHE_SIZE as argument. diff --git a/elf/dl-tls.c b/elf/dl-tls.c index 20c7e33..8fc210d 100644 --- a/elf/dl-tls.c +++ b/elf/dl-tls.c @@ -755,41 +755,54 @@ tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map) the_map = listp->slotinfo[idx].map; } - /* Make sure that, if a dlopen running in parallel forces the - variable into static storage, we'll wait until the address in the - static TLS block is set up, and use that. If we're undecided - yet, make sure we make the decision holding the lock as well. */ - if (__glibc_unlikely (the_map->l_tls_offset - != FORCED_DYNAMIC_TLS_OFFSET)) + /* If the TLS block for the map is already assigned to dynamic, or + to some static TLS offset, the decision is final, and no lock is + required. Now, if the decision hasn't been made, take the rtld + lock, so that an ongoing dlopen gets a chance to complete, + possibly assigning the module to static TLS and initializing the + corresponding TLS area for all threads, and then retest; if the + decision is still pending, force the module to dynamic TLS. + + The risk that the thread accesses an earlier value in that memory + location, from before it was recycled into a link map in another + thread, is removed by the need for some happens before + relationship between the loader that set that up and the TLS + access that referenced the module id. In the presence of such a + relationship, the value will be at least as recent as the + initialization, and in its absence, calling tls_get_addr with its + module id invokes undefined behavior. */ + if (__glibc_unlikely (the_map->l_tls_offset == NO_TLS_OFFSET)) { __rtld_lock_lock_recursive (GL(dl_load_lock)); if (__glibc_likely (the_map->l_tls_offset == NO_TLS_OFFSET)) - { - the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET; - __rtld_lock_unlock_recursive (GL(dl_load_lock)); - } - else if (__glibc_likely (the_map->l_tls_offset - != FORCED_DYNAMIC_TLS_OFFSET)) - { + the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET; + __rtld_lock_unlock_recursive (GL(dl_load_lock)); + } + + void *p; + + if (the_map->l_tls_offset != FORCED_DYNAMIC_TLS_OFFSET) + { #if TLS_TCB_AT_TP - void *p = (char *) THREAD_SELF - the_map->l_tls_offset; + p = (char *) THREAD_SELF - the_map->l_tls_offset; #elif TLS_DTV_AT_TP - void *p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE; + p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE; #else # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined" #endif - __rtld_lock_unlock_recursive (GL(dl_load_lock)); - - dtv[GET_ADDR_MODULE].pointer.is_static = true; - dtv[GET_ADDR_MODULE].pointer.val = p; - return (char *) p + GET_ADDR_OFFSET; - } - else - __rtld_lock_unlock_recursive (GL(dl_load_lock)); + dtv[GET_ADDR_MODULE].pointer.is_static = true; + dtv[GET_ADDR_MODULE].pointer.val = p; + } + else + { + p = allocate_and_init (the_map); + assert (!dtv[GET_ADDR_MODULE].pointer.is_static); + /* FIXME: this is AS-Unsafe. We'd have to atomically set val to + p, if it is still unallocated, or release p and set it to + val. */ + dtv[GET_ADDR_MODULE].pointer.val = p; } - void *p = dtv[GET_ADDR_MODULE].pointer.val = allocate_and_init (the_map); - assert (!dtv[GET_ADDR_MODULE].pointer.is_static); return (char *) p + GET_ADDR_OFFSET; } diff --git a/nptl/Makefile b/nptl/Makefile index 3dd2944..4ffd13c 100644 --- a/nptl/Makefile +++ b/nptl/Makefile @@ -242,7 +242,7 @@ tests = tst-typesizes \ tst-basic7 \ tst-kill1 tst-kill2 tst-kill3 tst-kill4 tst-kill5 tst-kill6 \ tst-raise1 \ - tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 \ + tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 tst-join7 \ tst-detach1 \ tst-eintr1 tst-eintr2 tst-eintr3 tst-eintr4 tst-eintr5 \ tst-tsd1 tst-tsd2 tst-tsd3 tst-tsd4 tst-tsd5 tst-tsd6 \ @@ -320,7 +320,8 @@ endif modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \ tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \ tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \ - tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod + tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \ + tst-join7mod extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o test-extras += $(modules-names) tst-cleanup4aux test-modules = $(addprefix $(objpfx),$(addsuffix .so,$(modules-names))) @@ -525,6 +526,11 @@ $(objpfx)tst-tls6.out: tst-tls6.sh $(objpfx)tst-tls5 \ $(evaluate-test) endif +$(objpfx)tst-join7: $(libdl) $(shared-thread-library) +$(objpfx)tst-join7.out: $(objpfx)tst-join7mod.so +$(objpfx)tst-join7mod.so: $(shared-thread-library) +LDFLAGS-tst-join7mod.so = -Wl,-soname,tst-join7mod.so + $(objpfx)tst-dlsym1: $(libdl) $(shared-thread-library) $(objpfx)tst-fini1: $(shared-thread-library) $(objpfx)tst-fini1mod.so diff --git a/nptl/tst-join7.c b/nptl/tst-join7.c new file mode 100644 index 0000000..bf6fc76 --- /dev/null +++ b/nptl/tst-join7.c @@ -0,0 +1,12 @@ +#include <dlfcn.h> + +int +do_test (void) +{ + void *f = dlopen ("tst-join7mod.so", RTLD_NOW | RTLD_GLOBAL); + if (f) dlclose (f); else return 1; + return 0; +} + +#define TEST_FUNCTION do_test () +#include "../test-skeleton.c" diff --git a/nptl/tst-join7mod.c b/nptl/tst-join7mod.c new file mode 100644 index 0000000..a8c7bc0 --- /dev/null +++ b/nptl/tst-join7mod.c @@ -0,0 +1,29 @@ +#include <stdio.h> +#include <pthread.h> + +static pthread_t th; +static int running = 1; + +static void * +test_run (void *p) +{ + while (running) + fprintf (stderr, "XXX test_run\n"); + fprintf (stderr, "XXX test_run FINISHED\n"); + return NULL; +} + +static void __attribute__ ((constructor)) +do_init (void) +{ + pthread_create (&th, NULL, test_run, NULL); +} + +static void __attribute__ ((destructor)) +do_end (void) +{ + running = 0; + fprintf (stderr, "thread_join...\n"); + pthread_join (th, NULL); + fprintf (stderr, "thread_join DONE\n"); +}