[PR18457] Don't require rtld lock to compute DTV addr for static TLS

Message ID orwpzkee1h.fsf@livre.home
State Superseded
Headers

Commit Message

Alexandre Oliva June 3, 2015, 8:24 p.m. UTC
  On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> So you think a reload by the compiler would be bad.

Yeah.  It's a double-checked lock entry with sugar on top.  We only need
to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if
it is resolved, we want different behavior depending on whether it is
FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset).

> This can only be bad if there is concurrent modification, potentially

Concurrent modification is made while holding the lock.

It shouldn't happen in the same thread, at least as long as TLS is
regarded as AS-Unsafe, but other threads could concurrently attempt to
resolve a module's l_tls_offset to a static offset or forced dynamic.

> Thus, therefore we need the atomic access

I'm not convinced we do, but I don't mind, and I don't want this to be a
point of contention.

> AFAIU, you seem to speak about memory reuse unrelated to this
> specifically this particular load, right?

Yeah, some other earlier use of the same location.

> But that sounds like an issue independently of whether the specific
> load is an acquire or relaxed load.

Not really.  It is a preexisting issue, yes, but an acquire load would
make sure the (re)initialization of the memory into a link map,
performed while holding the lock (and with an atomic write, no less),
would necessarily be observed by the atomic acquire load.  A relaxed
load might still observe the un(re)initialized value.  Right?

> We established before that you want to prevent reload because there are
> concurrent stores.  Are these by other threads?

Answered above.

> If so, are there any cases of the following pattern:

Dude.  Of course not.  None of them use atomics.  So far this has only
used locks to guard changes, and double-checked locks for access.

> storing thread;
>   A;
>   atomic_store_relaxed(&l_tls_offset, ...);

> observing thread:
>   offset = atomic_load_relaxed(&l_tls_offset);
>   B(offset);

> where something in B (which uses or has a dependency on offset) relies
> on happening after A?

Let's rewrite this into something more like what we have now:

  storing thread:
     acquire lock;
     A;
     set l_tls_offset;
     release lock;

  observing thread:
     if l_tls_offset is undecided:
       acquire lock;
       if l_tls_offset is undecided:
         set l_tls_offset to forced_dynamic; // storing thread too!
       release lock;
     assert(l_tls_offset is decided);
     if l_tls_offset is forced_dynamic:
       dynamicB(l_tls_offset)
     else
       staticB(l_tls_offset)

The forced_dynamic case of B(l_tls_offset) will involve at least copying
the TLS init block, which A will have mapped in and relocated.  We don't
take the lock for that copy, so the release after A doesn't guarantee we
see the intended values.  Now, since the area is mmapped and then
relocated, it is very unlikely that any cache would have kept a copy of
the unrelocated block, let alone of any prior mmapping in that range.
So, it is very likely to work, but it's not guaranteed to work.

As for the static TLS case of B(l_tls_offset), the potential for
problems is similar, but not quite the same.  The key difference is that
the initialization of the static TLS block takes place at the storing
thread, rather than in the observing thread, and although
B(l_tls_offset) won't access the thread's static TLS block, the caller
of __tls_get_addr will.  (and so will any IE and LE TLS access)

Now, in order for any such access to take place, some relocation applied
by A must be seen by the observing thread, and if there isn't some
sequencing event that ensures the dlopen (or initial load) enclosing A
happens-before the use of the relocation, the whole thing is undefined;
otherwise, this sequencing event ought to be enough of a memory barrier
to guarantee the whole thing works.  It's just that the sequencing event
is not provided by the TLS machinery itself, but rather by the user, in
sequencing events after the dlopen, by the init code, in sequencing the
initial loading and relocation before any application code execution, or
by the thread library, sequencing any thread started by module
initializers after their relocation.

Which means to me that a relaxed load might turn out to be enough, after
all.

> I'm trying to find out what you know about the intent behind the TLS
> synchronization

FWIW, in this discussion we're touching just a tiny fraction of it, and
one that's particularly trivial compared with other bits :-(

> Does dlopen just have to decide about this value

It does tons of stuff (loading modules and dependencies, applying
relocations, running initializers), and it must have a say first.  E.g.,
if any IE relocation references a module, we must make it static TLS.
Otherwise, dlopen may leave it undecided, and then a later dlopen might
attempt to make it static TLS again (and fail if that's no longer
possible), or an intervening tls_get_addr may see it's undecided and
make the module's TLS dynamic.

> I disagree.  You added an atomic load on the consumer side (rightly
> so!), so you should not ignore the producer side either.  This is in the
> same function, and you touch most of the lines around it, and it's
> confusing if you make a partial change.

You're missing the other cases elsewhere that set this same field.

> Let me point out that we do have consensus in the project that new code
> must be free of data races.

Is a double-check lock regarded as a race?  I didn't think so.  So, I'm
proposing this patch, that reorganizes the function a bit to make this
absolutely clear, so that we can get the fix in and I can move on,
instead of extending any further the useless part of this conversation,
so that we can focus on the important stuff.

How's this?


We used to store static TLS addrs in the DTV at module load time, but
this required one thread to modify another thread's DTV.  Now that we
defer the DTV update to the first use in the thread, we should do so
without taking the rtld lock if the module is already assigned to static
TLS.  Taking the lock introduces deadlocks where there weren't any
before.

This patch fixes the deadlock caused by tls_get_addr's unnecessarily
taking the rtld lock to initialize the DTV entry for tls_dtor_list
within __call_dtors_list, which deadlocks with a dlclose of a module
whose finalizer joins with that thread.  The patch does not, however,
attempt to fix other potential sources of similar deadlocks, such as
the explicit rtld locks taken by call_dtors_list, when the dtor list
is not empty; lazy relocation of the reference to tls_dtor_list, when
TLS Descriptors are in use; when tls dtors call functions through the
PLT and lazy relocation needs to be performed, or any such called
functions interact with the dynamic loader in ways that require its
lock to be taken.

for  ChangeLog

	[PR dynamic-link/18457]
	* elf/dl-tls.c (tls_get_addr_tail): Don't take the rtld lock
	if we already have a final static TLS offset.
	* nptl/tst-join7.c, nptl/tst-join7mod.c: New, from Andreas
	Schwab's bug report.
	* nptl/Makefile (tests): Add tst-join7.
	(module-names): Add tst-join7mod.
	($(objpfx)tst-join7): New.  Add deps.
	($(objpfx)tst-join7.out): Likewise.
	($(objpfx)tst-join7mod.so): Likewise.
	(LDFLAGS-tst-join7mod.so): Set soname.
---
 NEWS                |    2 +-
 elf/dl-tls.c        |   63 +++++++++++++++++++++++++++++++--------------------
 nptl/Makefile       |   10 ++++++--
 nptl/tst-join7.c    |   12 ++++++++++
 nptl/tst-join7mod.c |   29 +++++++++++++++++++++++
 5 files changed, 88 insertions(+), 28 deletions(-)
 create mode 100644 nptl/tst-join7.c
 create mode 100644 nptl/tst-join7mod.c
  

Comments

Torvald Riegel June 4, 2015, 12:11 p.m. UTC | #1
On Wed, 2015-06-03 at 17:24 -0300, Alexandre Oliva wrote:
> On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > So you think a reload by the compiler would be bad.
> 
> Yeah.  It's a double-checked lock entry with sugar on top.  We only need
> to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if
> it is resolved, we want different behavior depending on whether it is
> FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset).
> 
> > This can only be bad if there is concurrent modification, potentially
> 
> Concurrent modification is made while holding the lock.

Unless *all* accesses happen while holding the lock, there are pairs of
accesses that are not ordered by happens-before; if there is a write in
any of those pairs, you must use atomics to prevent a data race.

(If happens-before is ensured through *other* synchronization (e.g.,
another lock), then this needs to be considered too.)

Applied to a double-checked locking pattern, this means that all data
accessed outside the critical section, and is also checked and modified
inside the critical section, must use atomic accesses.

> It shouldn't happen in the same thread, at least as long as TLS is
> regarded as AS-Unsafe, but other threads could concurrently attempt to
> resolve a module's l_tls_offset to a static offset or forced dynamic.

OK.  So we're dealing with inter-thread concurrency here.

> > Thus, therefore we need the atomic access
> 
> I'm not convinced we do, but I don't mind, and I don't want this to be a
> point of contention.
> 
> > AFAIU, you seem to speak about memory reuse unrelated to this
> > specifically this particular load, right?
> 
> Yeah, some other earlier use of the same location.
> 
> > But that sounds like an issue independently of whether the specific
> > load is an acquire or relaxed load.
> 
> Not really.  It is a preexisting issue, yes, but an acquire load would
> make sure the (re)initialization of the memory into a link map,
> performed while holding the lock (and with an atomic write, no less),
> would necessarily be observed by the atomic acquire load.  A relaxed
> load might still observe the un(re)initialized value.  Right?

I can't follow you here.

One thing to note is that acquire loads synchronize with release stores
(or stronger) on the *same* memory location.  An acquire load does not
synchronize with operations on a lock, unless the acquire load peeks
into the lock and does an acquire load on the lock's state or such.

Therefore, when you think about which effect an acquire load has,
consider which release store you are actually thinking about.  An
acquire operation does not have an effect on it's own, only in
combination with other effects in the program.  This is also why we want
to document which release store an acquire load is supposed to
synchronize with.

Thus, which release store are you thinking about in this case?

> > We established before that you want to prevent reload because there are
> > concurrent stores.  Are these by other threads?
> 
> Answered above.
> 
> > If so, are there any cases of the following pattern:
> 
> Dude.  Of course not.  None of them use atomics.  So far this has only
> used locks to guard changes, and double-checked locks for access.

Think conceptually, and consider that atomics don't synchronize with
locks unless the atomics access the lock state themselves.  Thus, the
atomics in there are the atomics inside and outside of the critical
section(s).

> > storing thread;
> >   A;
> >   atomic_store_relaxed(&l_tls_offset, ...);
> 
> > observing thread:
> >   offset = atomic_load_relaxed(&l_tls_offset);
> >   B(offset);
> 
> > where something in B (which uses or has a dependency on offset) relies
> > on happening after A?
> 
> Let's rewrite this into something more like what we have now:
> 
>   storing thread:
>      acquire lock;
>      A;
>      set l_tls_offset;
>      release lock;
> 
>   observing thread:
>      if l_tls_offset is undecided:
>        acquire lock;
>        if l_tls_offset is undecided:
>          set l_tls_offset to forced_dynamic; // storing thread too!
>        release lock;
>      assert(l_tls_offset is decided);
>      if l_tls_offset is forced_dynamic:
>        dynamicB(l_tls_offset)
>      else
>        staticB(l_tls_offset)
> 
> The forced_dynamic case of B(l_tls_offset) will involve at least copying
> the TLS init block, which A will have mapped in and relocated.  We don't
> take the lock for that copy, so the release after A doesn't guarantee we
> see the intended values.  Now, since the area is mmapped and then
> relocated, it is very unlikely that any cache would have kept a copy of
> the unrelocated block, let alone of any prior mmapping in that range.
> So, it is very likely to work, but it's not guaranteed to work.

Try to ignore what you think a particular cache implementation may or
may not be likely to do.  Instead, please think about whether the
program enforces the happens-before relationships it relies on.

So, the effects in A are mmap and relocation of the TLS init block
(which involves stores to the memory locations copied by B).  B needs
both to happen-before itself to work correctly.  A can be executed by a
different thread than the thread executing B.  Correct?

In this case, A needs to happen-before B, and you need a release store
for l_tls_offset and an appropriate acquire load of l_tls_offset on the
observer side.  Those establish the happens-before between A and B.

Alternatively, A needs to (conceptually) include issuing a release fence
after the effects, so that a relaxed store is sufficient on the storing
side.
For mmap specifically, we may be able to assume that all mmap
implementations are a syscall, the kernel does the mmap (ie, the
effect), and the return from the kernel is always a release fence.
However, then we need to document this for mmap, if we want to rely on
it.
And on the observer side, we'd still need an acquire load, unless we can
put an acquire fence somewhere else.

Note that whether the storing thread has acquired the lock or not
doesn't matter in the analysis of this case.

> As for the static TLS case of B(l_tls_offset), the potential for
> problems is similar, but not quite the same.  The key difference is that
> the initialization of the static TLS block takes place at the storing
> thread, rather than in the observing thread, and although
> B(l_tls_offset) won't access the thread's static TLS block, the caller
> of __tls_get_addr will.  (and so will any IE and LE TLS access)

OK. So let's consider callers to be part of B(l_tls_offset).

> Now, in order for any such access to take place, some relocation applied
> by A must be seen by the observing thread, and if there isn't some
> sequencing event that ensures the dlopen (or initial load) enclosing A
> happens-before the use of the relocation, the whole thing is undefined;
> otherwise, this sequencing event ought to be enough of a memory barrier
> to guarantee the whole thing works.  It's just that the sequencing event
> is not provided by the TLS machinery itself, but rather by the user, in
> sequencing events after the dlopen, by the init code, in sequencing the
> initial loading and relocation before any application code execution, or
> by the thread library, sequencing any thread started by module
> initializers after their relocation.

If that's the situation in the static case, it seems we do not need an
acquire load nor release store because there is either no concurrent
access, naturally, or the user has to ensure happens-before.  We can say
something like this (feel free to fix the TLS terminology):

In the static case, we can assume that the initialization/relocation of
a TLS slot [I mean A] always happens-before the use of this TLS slot:
(1) in case of dlopen, we require the user to ensure that dlopen
happens-before other threads execute the new TLS;
(2) in case of initial loading, this will happen-before any execution of
application code; and
(3) in case of [module initializers case, don't know how to phrase that
properly...]
Therefore, relaxed MO is fine for this load and the store in [the
storing side function].

This documents the assumptions the function is making, and thus what
other things it relies on.  Somebody reading the code of the function
sees the comment, knows why there's just a relaxed MO load, and can
connect it to the rest of the mental model of TLS synchronization.  We
do need to document this to be able to understand how the
synchronization is intended to work, and to be able to cross-check this
with the assumptions and documentation of other related functions.

> Which means to me that a relaxed load might turn out to be enough, after
> all.
> 
> > I'm trying to find out what you know about the intent behind the TLS
> > synchronization
> 
> FWIW, in this discussion we're touching just a tiny fraction of it, and
> one that's particularly trivial compared with other bits :-(
> 
> > Does dlopen just have to decide about this value
> 
> It does tons of stuff (loading modules and dependencies, applying
> relocations, running initializers), and it must have a say first.  E.g.,
> if any IE relocation references a module, we must make it static TLS.
> Otherwise, dlopen may leave it undecided, and then a later dlopen might
> attempt to make it static TLS again (and fail if that's no longer
> possible), or an intervening tls_get_addr may see it's undecided and
> make the module's TLS dynamic.
> 
> > I disagree.  You added an atomic load on the consumer side (rightly
> > so!), so you should not ignore the producer side either.  This is in the
> > same function, and you touch most of the lines around it, and it's
> > confusing if you make a partial change.
> 
> You're missing the other cases elsewhere that set this same field.

What do you mean?  How is it any better if you don't fix it properly in
the functions you have looked at and modified, just because there are
more problems elsewhere?

> > Let me point out that we do have consensus in the project that new code
> > must be free of data races.
> 
> Is a double-check lock regarded as a race?  I didn't think so.

*Correct* double-checked locking isn't.  But correct double-checked
locking doesn't have data races, and this code here does (unless there's
some constraint on execution that I'm not aware of and you haven't told
me about).

Look at your example code above.  "set l_tls_offset;" in the storing
thread can be concurrent with the first access of "l_tls_offset" (or the
ones in the branches after the critical section).  Concurrent here means
not ordered by happens-before (unlike in the static case, because there
we rely on other happens-before, as you point out above).  One of the
concurrent accesses is a store.  This is a data race.

> So, I'm
> proposing this patch, that reorganizes the function a bit to make this
> absolutely clear, so that we can get the fix in and I can move on,
> instead of extending any further the useless part of this conversation,
> so that we can focus on the important stuff.

Adding data races, or not fixing data races when you touch the code, is
not correct -- thus discussing whether you have a data race in your path
is absolutely not useless.

We agreed on the no-data-races rule for a reason.  Please adhere to this
like you do for other project rules.  If you disagree with the rule,
bring this up again, as a separate topic.  Until that happens, I'd like
to not have to argue over this again and again.

> How's this?

I'll comment on the revised patch.
  
Alexandre Oliva June 5, 2015, 4:39 a.m. UTC | #2
On Jun  4, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> Applied to a double-checked locking pattern, this means that all data
> accessed outside the critical section, and is also checked and modified
> inside the critical section, must use atomic accesses.

Is the l_tls_offset field the data you're talking about?  We've already
determined that there is a happens-before for everything else, and my
understanding is that, for l_tls_offset alone, being the double-checked
lock key value, and the only one that matters for the uses that isn't
necessarily covered by the happens-before relationship we've already
established, we have no need for atomics there.

> OK.  So we're dealing with inter-thread concurrency here.

Yes.

>> Not really.  It is a preexisting issue, yes, but an acquire load would
>> make sure the (re)initialization of the memory into a link map,
>> performed while holding the lock (and with an atomic write, no less),
>> would necessarily be observed by the atomic acquire load.  A relaxed
>> load might still observe the un(re)initialized value.  Right?

> I can't follow you here.

> One thing to note is that acquire loads synchronize with release stores
> (or stronger) on the *same* memory location.  An acquire load does not
> synchronize with operations on a lock, unless the acquire load peeks
> into the lock and does an acquire load on the lock's state or such.

> Therefore, when you think about which effect an acquire load has,
> consider which release store you are actually thinking about.  An
> acquire operation does not have an effect on it's own, only in
> combination with other effects in the program.  This is also why we want
> to document which release store an acquire load is supposed to
> synchronize with.

> Thus, which release store are you thinking about in this case?

Nothing but l_tls_offset.

>> Now, in order for any such access to take place, some relocation applied
>> by A must be seen by the observing thread, and if there isn't some
>> sequencing event that ensures the dlopen (or initial load) enclosing A
>> happens-before the use of the relocation, the whole thing is undefined;
>> otherwise, this sequencing event ought to be enough of a memory barrier
>> to guarantee the whole thing works.  It's just that the sequencing event
>> is not provided by the TLS machinery itself, but rather by the user, in
>> sequencing events after the dlopen, by the init code, in sequencing the
>> initial loading and relocation before any application code execution, or
>> by the thread library, sequencing any thread started by module
>> initializers after their relocation.

> If that's the situation in the static case

The paragraph quoted above applies to both cases.

>> You're missing the other cases elsewhere that set this same field.

> What do you mean?  How is it any better if you don't fix it properly in
> the functions you have looked at and modified, just because there are
> more problems elsewhere?

If you say atomics are only correct if all loads and stores, including
those guarded by locks, are atomic, then adding atomics to only some of
them makes things wrong.

>> Is a double-check lock regarded as a race?  I didn't think so.

> *Correct* double-checked locking isn't.

> "set l_tls_offset;" in the storing thread can be concurrent with the
> first access of "l_tls_offset" (or the ones in the branches after the
> critical section).

It appears to follow from your statement and example above that
*correct* double-checked locking can only be attained using atomics.  Is
that so?  If not, can you give an example of correct double-checked
locking that doesn't use them, and explain why that's different from
what's in my revised patch?

> If you disagree with the rule,

I don't.  Maybe one of us misunderstands it, or we're otherwise failing
to communicate, but I'm honestly trying to avoid data races.  I just
don't know that the unguarded reads in double-checked locks qualify as
data races.
  
Torvald Riegel June 5, 2015, 10:39 a.m. UTC | #3
On Fri, 2015-06-05 at 01:39 -0300, Alexandre Oliva wrote:
> On Jun  4, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > Applied to a double-checked locking pattern, this means that all data
> > accessed outside the critical section, and is also checked and modified
> > inside the critical section, must use atomic accesses.
> 
> Is the l_tls_offset field the data you're talking about?

Yes, in this case.

(There may be other data that is initialized once inside the critical
section and then accessed outside.  Accesses to those data can use
nonatomic accesses if data like l_tls_offset makes them data
data-race-free.)

> We've already
> determined that there is a happens-before for everything else, and my
> understanding is that, for l_tls_offset alone, being the double-checked
> lock key value, and the only one that matters for the uses that isn't
> necessarily covered by the happens-before relationship we've already
> established, we have no need for atomics there.

No, that's not true.  If you use a flag like l_tls_offset to indicate
whether some intitialization or other job has been performed already,
and try to apply double-checked locking, this has to be like this
(flag==true means init done, flag initially false):

if (atomic_load_acquire(&flag) != true)
  {
     lock();
     init(&data);
     atomic_store_release(&flag, true);
     unlock();
  }
// data ready to use here
use(&data);

This is double-checked locking.  For an alternative version that
combines the lock with the flag, look at how pthread_once does it.  The
atomic accesses for flag are necessary.  If you wouldn't actually have
concurrent accesses to flag, you're not doing double-checked locking and
can avoid the lock altogether unless you need it for some other
orthogonal reason.

If you're code deviates from this, you can't simply say it's
double-checked locking, but need to comment why the deviation is
correct.

> > OK.  So we're dealing with inter-thread concurrency here.
> 
> Yes.
> 
> >> Not really.  It is a preexisting issue, yes, but an acquire load would
> >> make sure the (re)initialization of the memory into a link map,
> >> performed while holding the lock (and with an atomic write, no less),
> >> would necessarily be observed by the atomic acquire load.  A relaxed
> >> load might still observe the un(re)initialized value.  Right?
> 
> > I can't follow you here.
> 
> > One thing to note is that acquire loads synchronize with release stores
> > (or stronger) on the *same* memory location.  An acquire load does not
> > synchronize with operations on a lock, unless the acquire load peeks
> > into the lock and does an acquire load on the lock's state or such.
> 
> > Therefore, when you think about which effect an acquire load has,
> > consider which release store you are actually thinking about.  An
> > acquire operation does not have an effect on it's own, only in
> > combination with other effects in the program.  This is also why we want
> > to document which release store an acquire load is supposed to
> > synchronize with.
> 
> > Thus, which release store are you thinking about in this case?
> 
> Nothing but l_tls_offset.

I haven't see a release-MO atomic store to l_tls_offset yet.  Nor a
release fence.  Which one do you mean?

> 
> >> Now, in order for any such access to take place, some relocation applied
> >> by A must be seen by the observing thread, and if there isn't some
> >> sequencing event that ensures the dlopen (or initial load) enclosing A
> >> happens-before the use of the relocation, the whole thing is undefined;
> >> otherwise, this sequencing event ought to be enough of a memory barrier
> >> to guarantee the whole thing works.  It's just that the sequencing event
> >> is not provided by the TLS machinery itself, but rather by the user, in
> >> sequencing events after the dlopen, by the init code, in sequencing the
> >> initial loading and relocation before any application code execution, or
> >> by the thread library, sequencing any thread started by module
> >> initializers after their relocation.
> 
> > If that's the situation in the static case
> 
> The paragraph quoted above applies to both cases.

Huh?  Why would you then need the double-checked locking at all?  Unless
I misunderstand you, what you seem to be saying isn't consistent.

> >> You're missing the other cases elsewhere that set this same field.
> 
> > What do you mean?  How is it any better if you don't fix it properly in
> > the functions you have looked at and modified, just because there are
> > more problems elsewhere?
> 
> If you say atomics are only correct if all loads and stores, including
> those guarded by locks, are atomic, then adding atomics to only some of
> them makes things wrong.

Yes, that's the case, strictly speaking.  But you didn't want to tackle
a full conversion of every access to l_tls_offset anywhere --
understandably.  Therefore, I didn't request that.
But that doesn't mean you can ignore what's in the immediate
neighborhood of your change.

> >> Is a double-check lock regarded as a race?  I didn't think so.
> 
> > *Correct* double-checked locking isn't.
> 
> > "set l_tls_offset;" in the storing thread can be concurrent with the
> > first access of "l_tls_offset" (or the ones in the branches after the
> > critical section).
> 
> It appears to follow from your statement and example above that
> *correct* double-checked locking can only be attained using atomics.  Is
> that so?

See above.  Note that init(data) and use(data) can use nonatomics to
access data, but just because because of the correct double-checked
locking synchronization and atomic store-release / load-acquire accesses
on the flag.

> If not, can you give an example of correct double-checked
> locking that doesn't use them, and explain why that's different from
> what's in my revised patch?
> 
> > If you disagree with the rule,
> 
> I don't.  Maybe one of us misunderstands it, or we're otherwise failing
> to communicate, but I'm honestly trying to avoid data races.  I just
> don't know that the unguarded reads in double-checked locks qualify as
> data races.

If the double-checked locking example above nor the comments in
pthread_once don't make it clear, please get in touch and I'll try to
explain.
  
Alexandre Oliva June 5, 2015, 5:43 p.m. UTC | #4
On Jun  5, 2015, Torvald Riegel <triegel@redhat.com> wrote:

>> Is the l_tls_offset field the data you're talking about?

> Yes, in this case.

> (There may be other data that is initialized once inside the critical
> section and then accessed outside.  Accesses to those data can use
> nonatomic accesses if data like l_tls_offset makes them data
> data-race-free.)

Ok, this means we can take them out of the analysis and focus on
l_tls_offset alone.  Let's do that, shall we?

Consider this:

enum spin { NEGATIVE = -1, UNKNOWN = 0, POSITIVE = 1 };

void *init () {
  static enum spin electron;
  take lock;
  electron = UNKNOWN;
  release lock;
  return &electron;
}

bool make_positive (void *x) {
  enum spin *p = x;
  take lock;
  if (*p == UNKNOWN)
    *p = POSITIVE;
  release lock;
  return *p == POSITIVE;
}

bool positive_p (void *x) {
  enum spin *p = x;
  if (*p != NEGATIVE) {
    take lock;
    if (*p == UNKNOWN)
      *p = NEGATIVE;
    release lock;
  }

  return *p == POSITIVE;
}

The program has to call init to get the pointer, so that happens before
any uses of the pointer.  From then on, any threads may call
make_positive or positive_p, however many times they wish, passing them
the pointer returned by init.  init should not be called again.  There
is no other data associated with the electron or the global lock (not
declared here) that guards it.  The enum type is opaque, not visible to
callers, and its alignment and size ensure it will always be loaded and
stored atomically.

This is supposed to implement a state machine in which the spin starts
as UNKNOWN, but once it is determined to be POSITIVE or NEGATIVE, it
won't ever change again.  make_positive will turn UNKNOWN spins into
positive, whereas positive_p will turn them into NEGATIVE.

This is analogous to the code we have in place right now.

The change I'm proposing is an optimization to positive_p, that changes
the following line:

  if (*p != NEGATIVE) {

to:

  if (*p == UNKNOWN) {

because we don't have to take the lock when *p has already advanced to
POSITIVE.

I have two questions for you about this proposed change:

Is it really that hard to see that the change is correct, safe, and a
strict performance improvement?

What is the *minimum* set of atomic loads and stores that we have to
introduce, replacing unadorned loads and stores, to make the whole thing
compliant with our standards, should this be GNU libc code?

>> > One thing to note is that acquire loads synchronize with release stores
>> > (or stronger) on the *same* memory location.  An acquire load does not
>> > synchronize with operations on a lock, unless the acquire load peeks
>> > into the lock and does an acquire load on the lock's state or such.
>> 
>> > Therefore, when you think about which effect an acquire load has,
>> > consider which release store you are actually thinking about.  An
>> > acquire operation does not have an effect on it's own, only in
>> > combination with other effects in the program.  This is also why we want
>> > to document which release store an acquire load is supposed to
>> > synchronize with.
>> 
>> > Thus, which release store are you thinking about in this case?
>> 
>> Nothing but l_tls_offset.

> I haven't see a release-MO atomic store to l_tls_offset yet.

Of course, there isn't any ATM.  What I'm saying is that l_tls_offset is
the only memory location that's relevant in this situation.

The part of the dynamic loader that loads the module that defines the
TLS variable is init, the part of the dynamic loader that resolves a TLS
relocation in a way that requires the variable to be in static TLS is
make_positive, and the part of tls_get_addr that I'm modifying in the
patch is is_positive.  Any static TLS offset maps to POSITIVE,
FORCED_DYNAMIC maps to NEGATIVE, and NO_TLS_OFFSET maps to UNKNOWN.

> Nor a release fence.

The release fences are the unlock operations after each and every one of
the stores.

>> >> Now, in order for any such access to take place, some relocation applied
>> >> by A must be seen by the observing thread, and if there isn't some
>> >> sequencing event that ensures the dlopen (or initial load) enclosing A
>> >> happens-before the use of the relocation, the whole thing is undefined;
>> >> otherwise, this sequencing event ought to be enough of a memory barrier
>> >> to guarantee the whole thing works.  It's just that the sequencing event
>> >> is not provided by the TLS machinery itself, but rather by the user, in
>> >> sequencing events after the dlopen, by the init code, in sequencing the
>> >> initial loading and relocation before any application code execution, or
>> >> by the thread library, sequencing any thread started by module
>> >> initializers after their relocation.

>> > If that's the situation in the static case

>> The paragraph quoted above applies to both cases.

> Huh?  Why would you then need the double-checked locking at all?

Because it's required to ensure no two threads take it upon themselves
to decide whether l_tls_offset will be static or dynamic.  The paragraph
above, about other data such as the TLS initializer block, applies to
both cases.
  
Torvald Riegel June 7, 2015, 8:39 p.m. UTC | #5
On Fri, 2015-06-05 at 14:43 -0300, Alexandre Oliva wrote:
> On Jun  5, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> >> Is the l_tls_offset field the data you're talking about?
> 
> > Yes, in this case.
> 
> > (There may be other data that is initialized once inside the critical
> > section and then accessed outside.  Accesses to those data can use
> > nonatomic accesses if data like l_tls_offset makes them data
> > data-race-free.)
> 
> Ok, this means we can take them out of the analysis and focus on
> l_tls_offset alone.  Let's do that, shall we?

You can't just ignore them because they can put requirements on the
flag.  *Iff* they don't exist, we can relaxed the synchronization on the
flag.

> 
> Consider this:
> 
> enum spin { NEGATIVE = -1, UNKNOWN = 0, POSITIVE = 1 };
> 
> void *init () {
>   static enum spin electron;
>   take lock;
>   electron = UNKNOWN;
>   release lock;
>   return &electron;
> }
> 
> bool make_positive (void *x) {
>   enum spin *p = x;
>   take lock;
>   if (*p == UNKNOWN)
>     *p = POSITIVE;
>   release lock;
>   return *p == POSITIVE;
> }
> 
> bool positive_p (void *x) {
>   enum spin *p = x;
>   if (*p != NEGATIVE) {
>     take lock;
>     if (*p == UNKNOWN)
>       *p = NEGATIVE;
>     release lock;
>   }
> 
>   return *p == POSITIVE;
> }
> 
> The program has to call init to get the pointer, so that happens before
> any uses of the pointer.

OK.  init sets the value unconditionally, which seems either
counterproductive, or init is called exactly once and that call happens
before any use by another thread.  If the latter (and that seems to be
the case as you indicate below):
(1) the critical section is not required
(2) we should use an atomic access for clarity of code (because we have
no atomic types), but it can be considered initialization too and then
our exception applies that we don't currently enforce atomic accesses
for initialization (a brief code comment is useful in this case
nonetheless).

> From then on, any threads may call
> make_positive or positive_p, however many times they wish, passing them
> the pointer returned by init.  init should not be called again.  There
> is no other data associated with the electron or the global lock (not
> declared here) that guards it.

OK.  Then this isn't really double-checked locking, but you're
implementing a CAS with locks.  (More generally, you are implementing
single-memory-location consensus.  See the single-memory-location state
machine comment you make below.)  One could argue that this is a
degenerated double-checked locking pattern, but double-checked locking
is really meant for things like pthread_once where there is other data
that has to be protected with the critical section.  If all one wants is
single-memory-word consensus, then that's what the very basic HW atomic
instructions already provide.
(This is why I asked whether you are just implementing a CAS in one of
my past emails.)

The fact that there is no other data associated with your state machine
must be pointed out in the comments.  It's often not the case that
something is indeed "freestanding", so if it is we want to briefly note
why (using a comment in the code).  We got similar cases wrong in glibc
quite often (e.g., the previously incorrect pthread_once
implementation), so such a code comment tells everyone we checked this
case, and why it's correct.

> The enum type is opaque, not visible to
> callers, and its alignment and size ensure it will always be loaded and
> stored atomically.

Careful here.  The alignment and size of the type may *allow* a compiler
(and our atomic operations) to actually work and make access atomic.
But the compiler is *not required* to do so if plain nonatomic accesses
are used; it could load/store byte-wise, it could reload, or it could
store speculative values.  Unless you use atomic accesses, there's no
guarantee that what looks like a single load or store in source code
actually ends up as exactly one atomic, full-size load or store in the
generated code.

Thus, if you'd add a comment on the type, you should say that it is
compatible (or sufficient for) atomic accesses.

> This is supposed to implement a state machine in which the spin starts
> as UNKNOWN, but once it is determined to be POSITIVE or NEGATIVE, it
> won't ever change again.  make_positive will turn UNKNOWN spins into
> positive, whereas positive_p will turn them into NEGATIVE.

OK.  So, this establishes consensus for whether POSITIVE or NEGATIVE is
the final outcome.  (And I guess UNKNOWN is never exposed to other
code.)

> This is analogous to the code we have in place right now.
> 
> The change I'm proposing is an optimization to positive_p, that changes
> the following line:
> 
>   if (*p != NEGATIVE) {
> 
> to:
> 
>   if (*p == UNKNOWN) {
> 
> because we don't have to take the lock when *p has already advanced to
> POSITIVE.

OK.  Note that this was the clearest bit of your patch all way along.
It's everything around that (and the actual synchronization) that was
problematic / unclear.

> I have two questions for you about this proposed change:
> 
> Is it really that hard to see that the change is correct, safe, and a
> strict performance improvement?

The intent is good.

The synchronization has data races (and also had them before, but that's
no reason to keep doing the incorrect thing).

It is a performance improvement.  Using a CAS instead of a critical
section would make it even faster for the first accesses.
For example, make_positive could do that:

  /* Establish POSITIVE if no consensus yet.  See [...] for why
     relaxed MO is sufficient.  */ 
  spin s = atomic_load_relaxed (p);
  while (s == UNKNOWN)
    atomic_compare_exchange_weak_relaxed(p, &s, POSITIVE);

If you want to keep the locks, I suggest mentioning the equivalent CAS
anyway because it's conveys the intent more clearly in this case.

> What is the *minimum* set of atomic loads and stores that we have to
> introduce, replacing unadorned loads and stores, to make the whole thing
> compliant with our standards, should this be GNU libc code?

All accesses to *p must use relaxed atomic accesses (the initialization
may be an exception if the change complicates things there (e.g., if
it's a memset that also covers other memory locations), but if it's easy
it's better to use a relaxed-MO atomic access there as well for
clarity).

> >> > One thing to note is that acquire loads synchronize with release stores
> >> > (or stronger) on the *same* memory location.  An acquire load does not
> >> > synchronize with operations on a lock, unless the acquire load peeks
> >> > into the lock and does an acquire load on the lock's state or such.
> >> 
> >> > Therefore, when you think about which effect an acquire load has,
> >> > consider which release store you are actually thinking about.  An
> >> > acquire operation does not have an effect on it's own, only in
> >> > combination with other effects in the program.  This is also why we want
> >> > to document which release store an acquire load is supposed to
> >> > synchronize with.
> >> 
> >> > Thus, which release store are you thinking about in this case?
> >> 
> >> Nothing but l_tls_offset.
> 
> > I haven't see a release-MO atomic store to l_tls_offset yet.
> 
> Of course, there isn't any ATM.  What I'm saying is that l_tls_offset is
> the only memory location that's relevant in this situation.
> 
> The part of the dynamic loader that loads the module that defines the
> TLS variable is init, the part of the dynamic loader that resolves a TLS
> relocation in a way that requires the variable to be in static TLS is
> make_positive, and the part of tls_get_addr that I'm modifying in the
> patch is is_positive.

Just to clarify on previous comments you make: And init() always
happens-before make_positive or is_positive?

> Any static TLS offset maps to POSITIVE,
> FORCED_DYNAMIC maps to NEGATIVE, and NO_TLS_OFFSET maps to UNKNOWN.
> 
> > Nor a release fence.
> 
> The release fences are the unlock operations after each and every one of
> the stores.

Hmm.  This still seems somewhat inconsistent.  You are arguing that you
do not need an acquire load on the consensus implementation above,
because l_tls_offset would stand on its own and program logic doesn't
need ordering dependencies between this consensus and anything else.
Yet you seem to think the release fence is necessary.  Do you think the
release fence is necessary for something, and if so, what is it and
where is the matching acquire?

Also note that an unlock operation on a lock is *not* guaranteed to have
the same effects as an atomic_thread_fence_release.  A compiler can
reorder atomics into a critical section (under certain conditions), but
it can't typically reorder to before an atomic_thread_fence_release
(there are corner cases though where it can merge with other accesses
before the fence).  The hardware may or may not treat unlocks (ie,
release stores on a *particular* memory location) and fences the same.
  
Alexandre Oliva June 9, 2015, 3:01 a.m. UTC | #6
On Jun  7, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> OK.  init sets the value unconditionally, which seems either
> counterproductive, or init is called exactly once and that call happens
> before any use by another thread.

The latter, as I wrote.

> If the latter (and that seems to be
> the case as you indicate below):

And so you noticed.

> (1) the critical section is not required

Yes, it is.

> (2) we should use an atomic access for clarity of code

No opposition to that.

> OK.  Then this isn't really double-checked locking,

Well, you used this term for this pattern back when we discussed
lock-bypassing in stream orientation.  I'm trying hard to overcome my
different background and use your terminology, but this doesn't make it
any easier :-(

> If all one wants is
> single-memory-word consensus,

That's not all we want.  We also want to give the dynamic loader
priority in assigning a module to static TLS, while it's loading a set
of modules.  HW CAS atomic insns don't give us that AFAIK.

Also, we want to make sure we wait till the dynamic loader is done with
defining the TLS variable before we access it.  It might be the case
that some module's initializer recursively dlopens additional modules
(yuck), or start a thread that attempts to access the variable.  We want
to make those accesses wait for the loader to complete its job before
they get a chance to make the variable dynamic, that other modules being
loaded might want to use as static.


> The fact that there is no other data associated with your state machine
> must be pointed out in the comments.

There is all the initialization the dynamic loader performs when the
module that defines the variable is loaded.

> It's often not the case that something is indeed "freestanding", so if
> it is we want to briefly note why (using a comment in the code).

I don't think a random participant of an existing synchronization
pattern is the right place for this sort of comprehensive
synchronization documentation.  It's a cross-cutting concern, and as we
(I?) have learned from computational reflection and aspect-oriented
programming, there's no single right place in the code to document this;
it's a side document that the relevant pieces should reference.

>> The enum type is opaque, not visible to callers, and its alignment
>> and size ensure it will always be loaded and stored atomically.

> Careful here.  The alignment and size of the type may *allow* a compiler
> (and our atomic operations) to actually work and make access atomic.
> But the compiler is *not required* to do so if plain nonatomic accesses
> are used; it could load/store byte-wise, it could reload, or it could
> store speculative values.

*nod*, but not relevant:

- load/store of entire words byte-wise would quickly drive the compiler
  out of existence, or at least out of the marketplace for compilers of
  system libraries to be used in production

- reloads would not be a problem for the pattern used in the second
  version of the patch

- speculative stores that could affect this pattern are not permitted by
  the relevant standards AFAICT

> Unless you use atomic accesses, there's no guarantee that what looks
> like a single load or store in source code actually ends up as exactly
> one atomic, full-size load or store in the generated code.

Alignment, size, and compiler's interest in generating reasonable code
does.  But we both know you don't agree with that.

> Thus, if you'd add a comment on the type, you should say that it is
> compatible (or sufficient for) atomic accesses.

This is a very pervasive assumption in GNU libc.  The only reason I can
think of to add this to every situation in which it is relevant is to
make the source code tarballs get higher compression rates.

> OK.  Note that this was the clearest bit of your patch all way along.

Well, then, what are we waiting for, considering that this is *all* the
patch does?

> It is a performance improvement.  Using a CAS instead of a critical
> section would make it even faster for the first accesses.

But it wouldn't wait for the dynamic loader to complete the loading or
the relocation.

> If you want to keep the locks, I suggest mentioning the equivalent CAS
> anyway because it's conveys the intent more clearly in this case.

Since Carlos and Siddhesh took this over, I'll leave it for them too.

> Just to clarify on previous comments you make: And init() always
> happens-before make_positive or is_positive?

Yes.

> Hmm.  This still seems somewhat inconsistent.  You are arguing that you
> do not need an acquire load on the consensus implementation above,
> because l_tls_offset would stand on its own and program logic doesn't
> need ordering dependencies between this consensus and anything else.
> Yet you seem to think the release fence is necessary.  Do you think the
> release fence is necessary for something, and if so, what is it and
> where is the matching acquire?

We'd already determined the release fence was needed and taken care of
by other means.

> Also note that an unlock operation on a lock is *not* guaranteed to have
> the same effects as an atomic_thread_fence_release.

*nod*

> The hardware may or may not treat unlocks (ie,
> release stores on a *particular* memory location)

It is my understanding that, per POSIX, unlocks ought to release all
prior stores made by the thread, and locks must acquire all
previously-released stores by any thread.  I.e., they don't apply to
single memory locations.  Locks don't even know what particular memory
locations they guard, if any.  So I can't make sense of what distinction
you intend to point out above.
  
Torvald Riegel June 9, 2015, 12:01 p.m. UTC | #7
On Tue, 2015-06-09 at 00:01 -0300, Alexandre Oliva wrote:
> On Jun  7, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > OK.  init sets the value unconditionally, which seems either
> > counterproductive, or init is called exactly once and that call happens
> > before any use by another thread.
> 
> The latter, as I wrote.
> 
> > If the latter (and that seems to be
> > the case as you indicate below):
> 
> And so you noticed.
> 
> > (1) the critical section is not required
> 
> Yes, it is.

Not for *this*, specifically.  You produced a reduced analogy / pattern
of the synchronization we're discussing here, so I was referring to that
specifically.

> > (2) we should use an atomic access for clarity of code
> 
> No opposition to that.

Good.

> > OK.  Then this isn't really double-checked locking,
> 
> Well, you used this term for this pattern back when we discussed
> lock-bypassing in stream orientation.  I'm trying hard to overcome my
> different background and use your terminology, but this doesn't make it
> any easier :-(

I don't remember all of that case, but I think we discussed whether it
is double-checked locking and whether it is a correct implementation.
If this other case was similar to this one, and we're just interested in
reaching consensus on one variable's value and there's no logical
relationship to something else or that is taken care of through other
means, then this wouldn't have been typical double-checked locking
either.
Sorry if this didn't come across clearly.  I'm trying to make this as
little confusing as possible.

> > If all one wants is
> > single-memory-word consensus,
> 
> That's not all we want.  We also want to give the dynamic loader
> priority in assigning a module to static TLS, while it's loading a set
> of modules.  HW CAS atomic insns don't give us that AFAIK.

OK.  That would then be a good point to document.  I'm not yet sure I
understand how that is consistent with dlopen being required to happen
before accesses to TLS by other threads.

> Also, we want to make sure we wait till the dynamic loader is done with
> defining the TLS variable before we access it.  It might be the case
> that some module's initializer recursively dlopens additional modules
> (yuck), or start a thread that attempts to access the variable.  We want
> to make those accesses wait for the loader to complete its job before
> they get a chance to make the variable dynamic, that other modules being
> loaded might want to use as static.

OK, also a good point to document.

> 
> > The fact that there is no other data associated with your state machine
> > must be pointed out in the comments.
> 
> There is all the initialization the dynamic loader performs when the
> module that defines the variable is loaded.

I can't follow you here.  You said there are no other dependencies, and
we just want to reach consensus on the final value of l_tls_offset.  So,
from the perspective of just this synchronization state machine you
gave, there's no other data that's relevant.  Now you say there is other
stuff.  What's true?

Are you just trying to point out that there is other initialization but
that other happens-before relationships (e.g., user must synchronize
externally) make this initialization happen-before all non-init()
functions in the state machine?  That would then mean that in this state
machine, the critical section in init() is indeed not required (for this
particular use!).  But above, you said it is.

> > It's often not the case that something is indeed "freestanding", so if
> > it is we want to briefly note why (using a comment in the code).
> 
> I don't think a random participant of an existing synchronization
> pattern is the right place for this sort of comprehensive
> synchronization documentation.

Maybe you misunderstood, so let me rephrase it.  When, such as in this
case, something deviates from what's typical, it's worth pointing this
out.  It deviates from a typical double-checked locking pattern because
you don't have acquire/release pairs.  If you comment in the code that
you need just consensus on the single variable, and there's no other
dependencies, you also clarify it.

> It's a cross-cutting concern, and as we
> (I?) have learned from computational reflection and aspect-oriented
> programming, there's no single right place in the code to document this;
> it's a side document that the relevant pieces should reference.

You can put the code comments somewhere in the code, for example at one
of the functions taking part, or at the decl of the variable, or
somewhere else.  And then reference it.  For example, for the semaphore
or the condvar, I just put the more general overview of the
synchronization scheme on some of the functions, and referenced that
throughout the code of the other functions.  You don't need a separate
document for it.

> >> The enum type is opaque, not visible to callers, and its alignment
> >> and size ensure it will always be loaded and stored atomically.
> 
> > Careful here.  The alignment and size of the type may *allow* a compiler
> > (and our atomic operations) to actually work and make access atomic.
> > But the compiler is *not required* to do so if plain nonatomic accesses
> > are used; it could load/store byte-wise, it could reload, or it could
> > store speculative values.
> 
> *nod*, but not relevant:

No, this is very much relevant.  We do not speculate about compiler
implementations but stick to semantics required by the standards.  If we
need exceptions to that, we document those (e.g., if we rely on a GNU
extension).

I thought we had discussed this sufficiently before, but if you want to
start this topic again, let's do it.

> - load/store of entire words byte-wise would quickly drive the compiler
>   out of existence, or at least out of the marketplace for compilers of
>   system libraries to be used in production

Remember the loads on tile that got mentioned in a previous discussion
we had?

> - reloads would not be a problem for the pattern used in the second
>   version of the patch

How do you know?  You could only argue this way by making assumptions
about other code generation in the compiler, which you haven't done.
And you certainly don't want to document these, which should be a clear
enough indication that you also don't want to reason in detail about
them, nor want anybody reading the code to have to reason about that.

Yeah, one could speculate about what a compiler may or may not do in
this *specific* piece of code.  But that's not the level of abstraction
we want to use as base for our reasoning.

For example, assume the compiler is aware of the code for
"__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't
access l_tls_offset in some way (which it doesn't).   You access
l_tls_offset inside and out of a critical section, so by
data-race-freedom, there must not be concurrent write accesses.  So it
does not actually have to reload l_tls_offset inside of the critical
section, but use the value of your initial load.  This correct and
reasonable-for-C11 compiler optimization breaks your synchronization.

> - speculative stores that could affect this pattern are not permitted by
>   the relevant standards AFAICT

The standard permits to speculatively store a value, if the target is
not volatile and the speculative store does not create a data race.

For example, the compiler can turn
  spin = NEGATIVE;
into
  spin = POSITIVE; spin = NEGATIVE;

You can *speculate* that this is unlikely to happen in this case, but
that's speculation.  The standard allows such things.  There's no simple
way for you to argue that this won't happen without considering what
does or does not happen elsewhere in the program, what variables
adjacent to spin might be stored to as well, and so on.  IOW, you need
to reason about this and a whole lot of other stuff.

Therefore, to allow for local reasoning, stick to what the standard
guarantees.

> > Unless you use atomic accesses, there's no guarantee that what looks
> > like a single load or store in source code actually ends up as exactly
> > one atomic, full-size load or store in the generated code.
> 
> Alignment, size, and compiler's interest in generating reasonable code
> does.  But we both know you don't agree with that.

No, this is simply not true in general.  You can argue about likelihood
in this *particular* case, but then you're doing just that.

If you think it's unlikely compilers will try to optimize, have a look
at this:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html

> > Thus, if you'd add a comment on the type, you should say that it is
> > compatible (or sufficient for) atomic accesses.
> 
> This is a very pervasive assumption in GNU libc.

It is common, but it sometimes breaks.  For example, the publicly
exposed semaphore type was aligned differently on some archs than the
alignment for the internal representation if we had changed it to use
wider atomics as offered by the arch.

> The only reason I can
> think of to add this to every situation in which it is relevant is to
> make the source code tarballs get higher compression rates.
> 
> > OK.  Note that this was the clearest bit of your patch all way along.
> 
> Well, then, what are we waiting for, considering that this is *all* the
> patch does?

I think I've made it clear what's missing.  There's more to a patch than
the intent behind it.

> > It is a performance improvement.  Using a CAS instead of a critical
> > section would make it even faster for the first accesses.
> 
> But it wouldn't wait for the dynamic loader to complete the loading or
> the relocation.
> 
> > If you want to keep the locks, I suggest mentioning the equivalent CAS
> > anyway because it's conveys the intent more clearly in this case.
> 
> Since Carlos and Siddhesh took this over, I'll leave it for them too.

So you will stop working on this?

If you intend to work on patches affecting synchronization in the
future, please remember the feedback you got in this patch.

> > Just to clarify on previous comments you make: And init() always
> > happens-before make_positive or is_positive?
> 
> Yes.
> 
> > Hmm.  This still seems somewhat inconsistent.  You are arguing that you
> > do not need an acquire load on the consensus implementation above,
> > because l_tls_offset would stand on its own and program logic doesn't
> > need ordering dependencies between this consensus and anything else.
> > Yet you seem to think the release fence is necessary.  Do you think the
> > release fence is necessary for something, and if so, what is it and
> > where is the matching acquire?
> 
> We'd already determined the release fence was needed and taken care of
> by other means.

Huh?

> > Also note that an unlock operation on a lock is *not* guaranteed to have
> > the same effects as an atomic_thread_fence_release.
> 
> *nod*
> 
> > The hardware may or may not treat unlocks (ie,
> > release stores on a *particular* memory location)
> 
> It is my understanding that, per POSIX, unlocks ought to release all
> prior stores made by the thread, and locks must acquire all
> previously-released stores by any thread.  I.e., they don't apply to
> single memory locations.  Locks don't even know what particular memory
> locations they guard, if any.  So I can't make sense of what distinction
> you intend to point out above.

We implement POSIX for users of glibc, but we do not implement on top of
POSIX inside of glibc -- we implement on top of our own code and the C
standard including its memory model.  (And you know that we don't even
implement precisely the POSIX semantics with our POSIX locks.)
In C11, there's a distinction between a release-MO fence and a mutex
unlock operation (i.e., a release-MO store).
Does that answer your question?
  
Alexandre Oliva June 9, 2015, 10:19 p.m. UTC | #8
[I'm dropping Andreas from Cc, since this part of the thread has nothing
to do with the reason I copied him at first]

On Jun  9, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> On Tue, 2015-06-09 at 00:01 -0300, Alexandre Oliva wrote:
>> That's not all we want.  We also want to give the dynamic loader
>> priority in assigning a module to static TLS, while it's loading a set
>> of modules.  HW CAS atomic insns don't give us that AFAIK.

> OK.  That would then be a good point to document.  I'm not yet sure I
> understand how that is consistent with dlopen being required to happen
> before accesses to TLS by other threads.

I think the confusion may arise because of the two roles played by rtld
(s/dlopen/rtld/) in the scene.

One is as the loader of the module that defines the TLS variable.  It's
the one that initializes the TLS initialization block holding the
variable.

Another is as the relocator, that processes relocations referencing the
variable in modules that may have been loaded along with the defining
module (it could be the same module too), or at subsequent dlopen
requests.

When the defining and referencing modules are loaded at once, both
initialization and relocation are performed without releasing the lock.

When the referencing module is dlopened later, the lock is taken again,
and then dlopen might find a relocation that requires a TLS variable to
be in static TLS.  If the variable is already in dynamic TLS, because
some earlier access already made the decision, it returns an error.

We could use CAS to this end both in dlopen and in tls_get_addr, but
then, given concurrent access, we'd not increase the odds of a
successful dlopen.  With the lock, we do so to some extent.  So, it's
not a determinant factor, since it is always possible that the
concurrent access beats the subsequent dlopen and places the variable in
dynamic TLS before dlopen grabs the lock or decides with a CAS, but it
shifts the balance slightly so that dlopen, however long it takes, is
given way by concurrent tls_get_addr.

Now, considering that we're moving towards reducing IE in dlopenable
libs, in favor of TLS Descriptors, maybe it's time to revisit this
balance and go for HW CAS instead to decide between static and dynamic.


Even then, we'd want a lock, it occurs to me now.  The reason is that,
once dlopen determines a variable that must go in static TLS can go in
static TLS, it computes the static TLS offset and proceeds to copy the
TLS initialization block to the the corresponding area of the Static TLS
block of each thread, all while holding the rtld lock.

The lock is what currently stops anyone from stepping in between
dlopen's determination that the variable must and can go in static TLS,
and the ultimate setting of the offset to indicate so.  Should we go
CAS, we'd need some intermediate hold value, or dlopen would have to
retract its decision, or somesuch.

Now, for IE access models, this is not a problem: if they execute an IE
access model, it means the relocation and initialization happens-before
the execution, or that the execution is unordered and therefore
undefined.

As for dynamic access models involving tls_get_addr, this is not
clear-cut: there could be GD relocations processed earlier, but that are
only exercised concurrently with a dlopen that places the variable in
Static TLS.  If we want dlopen to succeed, concurrent dynamic access
models must wait until the static TLS area is initialized by dlopen.
Taking the lock is the way this is accomplished, with the added benefit
of acquiring the static TLS block memory initialized released by dlopen
as it released the lock.

Now, in order for this to work without a lock, we'd need dlopen to set
l_tls_offset *after* completing the initialization of the static TLS
area, and while releasing it, never before.  Heck, we need this even
with my proprosed patch.  However, I don't see that dlopen behaves this
way: it appears to set l_tls_offset first, and then use it to initialize
each thread's static TLS area.  Therefore, my current patch is broken,
and the code in master is correct as far as dynamic access to TLS
variables in static TLS goes: the lock is unfortunately necessary under
the current design.

I withdraw the patch.


> You said there are no other dependencies, and we just want to reach
> consensus on the final value of l_tls_offset.  So, from the
> perspective of just this synchronization state machine you gave,
> there's no other data that's relevant.  Now you say there is other
> stuff.  What's true?

The latter.  When I wrote that, I was convinced we'd covered all the
cases of initialization performed by dlopen because I had had failed to
take into account the subsequent cross-thread static TLS area
initialization by a subsequent dlopen that assigns a previously-loaded
TLS variable to static TLS.


It looks like we could still avoid the lock in tls_get_addr by
reorganizing dlopen to choose the offset, initialize and release the
static blocks, and only then CAS the offset, failing if l_tls_offset was
changed along the way.

> Maybe you misunderstood, so let me rephrase it.  When, such as in this
> case, something deviates from what's typical, it's worth pointing this
> out.

I agree.  But does it?  GNU libc has lots of occurrences of this
pattern, so this is hardly a deviation from what's typical.

> You can put the code comments somewhere in the code, for example at one
> of the functions taking part, or at the decl of the variable, or
> somewhere else.  And then reference it.

*nod*

> I thought we had discussed this sufficiently before

We had, but we never reached agreement, and I doubt we ever will on this
point.

> Remember the loads on tile that got mentioned in a previous discussion
> we had?

'fraid I don't.  Reference?

>> - reloads would not be a problem for the pattern used in the second
>> version of the patch

> Yeah, one could speculate about what a compiler may or may not do in
> this *specific* piece of code.  But that's not the level of abstraction
> we want to use as base for our reasoning.

What is clear to me is that our reasonings and thought frameworks are so
different that your personal preferences don't apply to my way of
thinking, and vice versa.  Why is why I'm ok with answering questions
you may have about synchronization patterns I'm familiar with in GNU
libc, but not at all ok with writing the documentation.  From our
discussions so far, I am sure any such documentation I write will be
regarded as useless for you.  Because you and I think in different
terms, different primitives, different abstraction layers.  I'd build
the reasoning from my perspective, and it wouldn't match yours.  And
vice-versa.

> For example, assume the compiler is aware of the code for
> "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't
> access l_tls_offset in some way (which it doesn't).   You access
> l_tls_offset inside and out of a critical section, so by
> data-race-freedom, there must not be concurrent write accesses.  So it
> does not actually have to reload l_tls_offset inside of the critical
> section, but use the value of your initial load.  This correct and
> reasonable-for-C11 compiler optimization breaks your synchronization.

Correct and reasonable-for-C11 except for the bold and bald assumption
that the a lock operation is not a global acquire.  The compiler is
forbidden from removing the second load because of the existence of the
lock.  Now, that requirement comes from POSIX, not from the C standard.

>> - speculative stores that could affect this pattern are not permitted by
>> the relevant standards AFAICT

> The standard permits to speculatively store a value, if the target is
> not volatile and the speculative store does not create a data race.

*if* the abstract machine would have stored something in the variable to
begin with.  In the case at hand, it wouldn't, so, no speculative
writes.

> No, this is simply not true in general.  You can argue about likelihood
> in this *particular* case, but then you're doing just that.

Indeed, that's just what I'm doing.

> If you think it's unlikely compilers will try to optimize, have a look
> at this:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html

>> Since Carlos and Siddhesh took this over, I'll leave it for them too.

> So you will stop working on this?

Once Carlos wrote he and Siddhesh would take care of the issue, I did.

It doesn't mean I'm not willing to share the knowledge I got from
studying the intricacies of TLS synchronization any more.

Now that the patch is withdrawn because it is broken, we can even focus
on more productive questions and answers about it.  But this only makes
sense (as opposed to being a waste of time) if there's a commitment to
turn the conversation into proper documentation.  Do we have that from
anyone?

> If you intend to work on patches affecting synchronization in the
> future, please remember the feedback you got in this patch.

I'll rather try to avoid it.  It's just too hard for you and I to
communicate in this field, and I assume it's as frustrating for you as
it is for me :-(

>> We'd already determined the release fence was needed and taken care of
>> by other means.

> Huh?

rtld that loads and defines the variable and then releases the rtld lock
(and thus the memory) happens-before any well-defined use of the
variable (without some happens before, how could it safely get ahold of
the relocations used in the TLS access model?), even if it doesn't take
the rtld lock, as covered in other messages upthread.  (That's a case
not to be confused with the subsequent dlopen that processes relocations
referencing the variable and assigns it to static TLS, described in this
message.)

>> > Also note that an unlock operation on a lock is *not* guaranteed to have
>> > the same effects as an atomic_thread_fence_release.
>> 
>> *nod*
>> 
>> > The hardware may or may not treat unlocks (ie,
>> > release stores on a *particular* memory location)
>> 
>> It is my understanding that, per POSIX, unlocks ought to release all
>> prior stores made by the thread, and locks must acquire all
>> previously-released stores by any thread.  I.e., they don't apply to
>> single memory locations.  Locks don't even know what particular memory
>> locations they guard, if any.  So I can't make sense of what distinction
>> you intend to point out above.

> We implement POSIX for users of glibc, but we do not implement on top of
> POSIX inside of glibc

Is this documented anywhere?  If we're breaking well-set expectations
WRT locks, we'd better document that.  Though I wouldn't recommend
deviating so much from well-established practice.  It would be a huge
burden for anyone who might consider joining our community.

> (And you know that we don't even
> implement precisely the POSIX semantics with our POSIX locks.)

Yeah, and I do find that unfortunate.

> In C11, there's a distinction between a release-MO fence and a mutex
> unlock operation (i.e., a release-MO store).
> Does that answer your question?

I don't think I asked a question, but no, it doesn't help me understand
what you meant by "unlocks (ie release stores on a particular memory
location)".  In my understanding, unlocks are not about particular
memory locations, so it comes across as a contradiction and your attempt
to answer some unstated underlying question does nothing to clarify it.
  
Torvald Riegel June 15, 2015, 5:26 p.m. UTC | #9
On Tue, 2015-06-09 at 19:19 -0300, Alexandre Oliva wrote:
> On Jun  9, 2015, Torvald Riegel <triegel@redhat.com> wrote:
> > You said there are no other dependencies, and we just want to reach
> > consensus on the final value of l_tls_offset.  So, from the
> > perspective of just this synchronization state machine you gave,
> > there's no other data that's relevant.  Now you say there is other
> > stuff.  What's true?
> 
> The latter.  When I wrote that, I was convinced we'd covered all the
> cases of initialization performed by dlopen because I had had failed to
> take into account the subsequent cross-thread static TLS area
> initialization by a subsequent dlopen that assigns a previously-loaded
> TLS variable to static TLS.
> 
> 
> It looks like we could still avoid the lock in tls_get_addr by
> reorganizing dlopen to choose the offset, initialize and release the
> static blocks, and only then CAS the offset, failing if l_tls_offset was
> changed along the way.

Seems so, based on what you wrote.  And then we'd need the CAS to use
release MO and the loads that may read-from that CAS to use acquire MO.

> > Remember the loads on tile that got mentioned in a previous discussion
> > we had?
> 
> 'fraid I don't.  Reference?

I didn't find the email thread after looking for a while.  What I
remember is that tile had operations that are not truly atomic (IIRC,
don't necessarily reload a whole cache line from memory, or some such).

> >> - reloads would not be a problem for the pattern used in the second
> >> version of the patch
> 
> > Yeah, one could speculate about what a compiler may or may not do in
> > this *specific* piece of code.  But that's not the level of abstraction
> > we want to use as base for our reasoning.
> 
> What is clear to me is that our reasonings and thought frameworks are so
> different that your personal preferences don't apply to my way of
> thinking, and vice versa.  Why is why I'm ok with answering questions
> you may have about synchronization patterns I'm familiar with in GNU
> libc, but not at all ok with writing the documentation.  From our
> discussions so far, I am sure any such documentation I write will be
> regarded as useless for you.

I won't say it's useless if you try to convey knowledge you have to
others.  I may disagree how it's done or expressed, or don't understand
what you're trying to say -- but even then I'd consider it input and a
start of something that just needs further iteration.

> Because you and I think in different
> terms, different primitives, different abstraction layers.  I'd build
> the reasoning from my perspective, and it wouldn't match yours.  And
> vice-versa.

I don't think this is a question of my perspective vs. your perspective.
Eventually, we want to have consensus in the project about how we,
collectively, document and reason about concurrency.  I've been arguing
for what I think is the right layers of abstractions, terminology,
memory model, etc., based on my experience in this field.  If that turns
out to be not what works best for the project, that's fine, but we need
to discuss this.  Everyone may have a favourite way of thinking, but the
project still needs a common "concurrency speak".

> > For example, assume the compiler is aware of the code for
> > "__rtld_lock_lock_recursive (GL(dl_load_lock))" and knows that it won't
> > access l_tls_offset in some way (which it doesn't).   You access
> > l_tls_offset inside and out of a critical section, so by
> > data-race-freedom, there must not be concurrent write accesses.  So it
> > does not actually have to reload l_tls_offset inside of the critical
> > section, but use the value of your initial load.  This correct and
> > reasonable-for-C11 compiler optimization breaks your synchronization.
> 
> Correct and reasonable-for-C11 except for the bold and bald assumption
> that the a lock operation is not a global acquire.  The compiler is
> forbidden from removing the second load because of the existence of the
> lock.  Now, that requirement comes from POSIX, not from the C standard.

POSIX doesn't guarantee anything about plain memory accesses that do
have data races, right?  This isn't about the lock itself and whether it
"synchronizes memory", but about the data race and what the compiler can
do based on assuming DRF programs.
If the accesses to l_tls_offset were sem_getvalue and sem_post (or
relaxed-MO atomics), the compiler would have to reload because of the
lock acquisition and the acquire-MO load part of it.

> >> - speculative stores that could affect this pattern are not permitted by
> >> the relevant standards AFAICT
> 
> > The standard permits to speculatively store a value, if the target is
> > not volatile and the speculative store does not create a data race.
> 
> *if* the abstract machine would have stored something in the variable to
> begin with.

No, not generally.  The program must behave as-if executed by the
virtual machine.  But behavior is defined as all accesses to
volatile-qualified variables and I/O, so the program is allowed to
speculatively store if it doesn't affect the volatile / I/O behavior of
the program.

That's why I said the target must not be volatile.  If adding a
speculative store does not create a data race (and there's different
ways for a compiler to detect that), then there's no other thread that
can observe the speculative store.

Read-only mapped memory isn't considered by the standard, but compilers
may.  Even then, this is on a page granularity, and compilers can detect
whether something else in the page might be stored to.

> In the case at hand, it wouldn't, so, no speculative
> writes.

Just to be clear, this is the code snippet we're talking about:
  if (*p != NEGATIVE) {
    take lock;
    if (*p == UNKNOWN)
      *p = NEGATIVE;
    release lock;

Your program told the compiler that there are no concurrent
modifications of the variable by accessing *p inside and outside of the
critical section.  That is, only we modify it while running this piece
of code.  That means we can ignore the lock, and we end up with this,
conceptually:
// *p must be UNKNOWN OR POSITIVE, due to the first if.
if (*p == UNKNOWN)
  *p = NEGATIVE;

(And assume the compiler doesn't try to support read-only mapped memory,
or it observed that elsewhere, the program writes to the page p is part
of.)
Then, the compiler can speculatively store NEGATIVE to *p, and fix up
afterwards if it was wrong.

> > No, this is simply not true in general.  You can argue about likelihood
> > in this *particular* case, but then you're doing just that.
> 
> Indeed, that's just what I'm doing.

But that's not useful.  First, we want rules that apply and are safe in
general, not rules that require case-by-case reasoning.  Second, we
don't want to depend on something being unlikely -- we want the thing to
always work.

> > If you think it's unlikely compilers will try to optimize, have a look
> > at this:
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
> 
> >> Since Carlos and Siddhesh took this over, I'll leave it for them too.
> 
> > So you will stop working on this?
> 
> Once Carlos wrote he and Siddhesh would take care of the issue, I did.
> 
> It doesn't mean I'm not willing to share the knowledge I got from
> studying the intricacies of TLS synchronization any more.

And I appreciate that.

> Now that the patch is withdrawn because it is broken, we can even focus
> on more productive questions and answers about it.  But this only makes
> sense (as opposed to being a waste of time) if there's a commitment to
> turn the conversation into proper documentation.  Do we have that from
> anyone?

I suppose Carlos and Siddhesh will do that then.  I have no plans to
work on TLS specifically, but will help with checking that the
concurrency bits are sound (including how they get documented).

> > If you intend to work on patches affecting synchronization in the
> > future, please remember the feedback you got in this patch.
> 
> I'll rather try to avoid it.  It's just too hard for you and I to
> communicate in this field, and I assume it's as frustrating for you as
> it is for me :-(
> 
> >> We'd already determined the release fence was needed and taken care of
> >> by other means.
> 
> > Huh?
> 
> rtld that loads and defines the variable and then releases the rtld lock
> (and thus the memory) happens-before any well-defined use of the
> variable (without some happens before, how could it safely get ahold of
> the relocations used in the TLS access model?), even if it doesn't take
> the rtld lock, as covered in other messages upthread.

To avoid confusion, I assume you are referring to such a pattern:
Thread 1:
  data = 1;
  flag = 1;
  unlock;

Thread 2:
  if (flag) foo = data;

POSIX doesn't guarantee anything for non-data-race-free plain memory
accesses, and this has data races.
Even if we'd assume that this isn't the case, and unlock would magically
make all prior modifications become visible globally, there's still no
guarantee that the load of flag happens after the unlock.

You could make Thread 1 do this (and assume that unlock is a release
fence, and all plain memory accesses are indeed atomic):
  data = 1;
  unlock;
  flag = 1;

But then you'd still need something at the observer side, to enforce
that the load of data won't be performed before the load of flag.

If the necessary happens-before is ensured through other sync, then
we're good though.

> (That's a case
> not to be confused with the subsequent dlopen that processes relocations
> referencing the variable and assigns it to static TLS, described in this
> message.)
> 
> >> > Also note that an unlock operation on a lock is *not* guaranteed to have
> >> > the same effects as an atomic_thread_fence_release.
> >> 
> >> *nod*
> >> 
> >> > The hardware may or may not treat unlocks (ie,
> >> > release stores on a *particular* memory location)
> >> 
> >> It is my understanding that, per POSIX, unlocks ought to release all
> >> prior stores made by the thread, and locks must acquire all
> >> previously-released stores by any thread.  I.e., they don't apply to
> >> single memory locations.  Locks don't even know what particular memory
> >> locations they guard, if any.  So I can't make sense of what distinction
> >> you intend to point out above.
> 
> > We implement POSIX for users of glibc, but we do not implement on top of
> > POSIX inside of glibc
> 
> Is this documented anywhere?

https://sourceware.org/glibc/wiki/Concurrency
says that we're using the C11 memory model.

> If we're breaking well-set expectations
> WRT locks, we'd better document that.

First of all, the POSIX semantics are not sufficient for us, because
they don't define atomics and also don't try to give programs with data
races some well-defined meaning (if they tried, they'd fail because it
would break plenty compiler optimizations that are all used in
practice).

Second, we don't implement the stronger "synchronizes memory" semantics
POSIX seems to want to have.  That's not something new.

> > In C11, there's a distinction between a release-MO fence and a mutex
> > unlock operation (i.e., a release-MO store).
> > Does that answer your question?
> 
> I don't think I asked a question, but no, it doesn't help me understand
> what you meant by "unlocks (ie release stores on a particular memory
> location)".  In my understanding, unlocks are not about particular
> memory locations, so it comes across as a contradiction and your attempt
> to answer some unstated underlying question does nothing to clarify it.

See C11, 5.1.2.4p6:
For example, a call that acquires a mutex will perform an acquire
operation on the locations composing the mutex. Correspondingly, a call
that releases the same mutex will perform a release operation on those
same locations.

Thus, there will be synchronizes-with (and thus happens-before) edges
between a release of a particular mutex instance and a subsequent
acquisition of the *same* mutex instance.   This creates a total order
on each particular mutex instance, but there is no happens-before
enforces with unrelated release or acquire operations (e.g., on
different mutex instances, atomics, ...).

This makes sense because it allows hardware and compilers to not have to
enforce global ordering.  And it aligns well with the use cases mutexes
are built for, namely critical sections and mutual exclusion.  Mutexes
do not interact with atomics sequenced before or after the mutex
release/acquire in the same way that a fences would interact with such
atomics.
  
Alexandre Oliva June 22, 2015, 6:39 a.m. UTC | #10
On Jun 15, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> On Tue, 2015-06-09 at 19:19 -0300, Alexandre Oliva wrote:
>> It looks like we could still avoid the lock in tls_get_addr by
>> reorganizing dlopen to choose the offset, initialize and release the
>> static blocks, and only then CAS the offset, failing if l_tls_offset was
>> changed along the way.

> Seems so, based on what you wrote.  And then we'd need the CAS to use
> release MO and the loads that may read-from that CAS to use acquire MO.

*nod*

>> > Remember the loads on tile that got mentioned in a previous discussion
>> > we had?
>> 
>> 'fraid I don't.  Reference?

> I didn't find the email thread after looking for a while.  What I
> remember is that tile had operations that are not truly atomic (IIRC,
> don't necessarily reload a whole cache line from memory, or some such).

Aah, now that rings a bell.  It was in a subthread about what a signal
handler might find when interrupting a memset.

>> Correct and reasonable-for-C11 except for the bold and bald assumption
>> that the a lock operation is not a global acquire.  The compiler is
>> forbidden from removing the second load because of the existence of the
>> lock.  Now, that requirement comes from POSIX, not from the C standard.

> POSIX doesn't guarantee anything about plain memory accesses that do
> have data races, right?  This isn't about the lock itself and whether it
> "synchronizes memory", but about the data race and what the compiler can
> do based on assuming DRF programs.

Since POSIX doesn't specify atomics, I'm inclined to think this
reasoning rules out useful existing practice that is at least arguably
valid under POSIX, but...  I see where this is coming from, and I kind
of like the better-defined semantics, so I can see it could make sense
for POSIX to deprecate this practice and embrace atomics.

>> *if* the abstract machine would have stored something in the variable to
>> begin with.

> No, not generally.  The program must behave as-if executed by the
> virtual machine.  But behavior is defined as all accesses to
> volatile-qualified variables and I/O, so the program is allowed to
> speculatively store if it doesn't affect the volatile / I/O behavior of
> the program.

I'm pretty sure I saw wording to the effect of what I wrote above in
some standard or draft a while ago.  That made sense to me: introducing
stores where there weren't any before seems like a great recipe to
introduce races.

>> > No, this is simply not true in general.  You can argue about likelihood
>> > in this *particular* case, but then you're doing just that.
>> 
>> Indeed, that's just what I'm doing.

> But that's not useful.  First, we want rules that apply and are safe in
> general, not rules that require case-by-case reasoning.  Second, we
> don't want to depend on something being unlikely -- we want the thing to
> always work.

It looks like you assume "argue about likelihood" does not encompass the
case of "argue it's never going to happen", although that is what I was
doing.

I like general rules, but sometimes we have to reason on a case-by-case
basis.  I don't oppose taking advantage of situations that only arise in
specific cases to improve the code.

>> Now that the patch is withdrawn because it is broken, we can even focus
>> on more productive questions and answers about it.  But this only makes
>> sense (as opposed to being a waste of time) if there's a commitment to
>> turn the conversation into proper documentation.  Do we have that from
>> anyone?

> I suppose Carlos and Siddhesh will do that then.

I wouldn't jump to that conclusion.  The broader problem is not related
with TLS, but about rtld's holding a lock while running module
finalizers.

>> rtld that loads and defines the variable and then releases the rtld lock
>> (and thus the memory) happens-before any well-defined use of the
>> variable (without some happens before, how could it safely get ahold of
>> the relocations used in the TLS access model?), even if it doesn't take
>> the rtld lock, as covered in other messages upthread.

> To avoid confusion, I assume you are referring to such a pattern:

> Thread 1:
>   data = 1;
>   flag = 1;
>   unlock;

> Thread 2:
>   if (flag) foo = data;

You're missing the happens-before relationship.  I don't see any in your
example, but I explicitly stated there was one, so your example does not
match at all my attempt summarize what we'd already discussed and agreed
on, namely, rtld's setting up the TLS variable and applying TLS
relocations that reference it, so please refer to that part of the
thread so that we don't go in circles.

> If the necessary happens-before is ensured through other sync, then
> we're good though.

That was the conclusion upthread.  Do you see any reason to revisit it
after rereading it?

>> > We implement POSIX for users of glibc, but we do not implement on top of
>> > POSIX inside of glibc

>> Is this documented anywhere?

> https://sourceware.org/glibc/wiki/Concurrency
> says that we're using the C11 memory model.

Well, that's not exactly a true statement ATM.  We are *aiming* at it,
but there are tons of code in GNU libc that aren't quite there yet.


> Thus, there will be synchronizes-with (and thus happens-before) edges
> between a release of a particular mutex instance and a subsequent
> acquisition of the *same* mutex instance.   This creates a total order
> on each particular mutex instance, but there is no happens-before
> enforces with unrelated release or acquire operations (e.g., on
> different mutex instances, atomics, ...).

... because those don't introduce a synchronization with the release
sequence that precedes releasing that specific mutex, whereas a memory
fence would have more pervasive synchronization effects.  Is this what
you meant with "In C11, there's a distinction between a release-MO fence
and a mutex unlock operation (i.e., a release-MO store)"?  Is this all
you meant?
  
Torvald Riegel June 22, 2015, 8:38 a.m. UTC | #11
On Mon, 2015-06-22 at 03:39 -0300, Alexandre Oliva wrote:
> >> > We implement POSIX for users of glibc, but we do not implement on top of
> >> > POSIX inside of glibc
> 
> >> Is this documented anywhere?
> 
> > https://sourceware.org/glibc/wiki/Concurrency
> > says that we're using the C11 memory model.
> 
> Well, that's not exactly a true statement ATM.  We are *aiming* at it,
> but there are tons of code in GNU libc that aren't quite there yet.

Sure.

> 
> > Thus, there will be synchronizes-with (and thus happens-before) edges
> > between a release of a particular mutex instance and a subsequent
> > acquisition of the *same* mutex instance.   This creates a total order
> > on each particular mutex instance, but there is no happens-before
> > enforces with unrelated release or acquire operations (e.g., on
> > different mutex instances, atomics, ...).
> 
> ... because those don't introduce a synchronization with the release
> sequence that precedes releasing that specific mutex, whereas a memory
> fence would have more pervasive synchronization effects.

Fences create synchronizes with in combination with other atomic
accesses sequenced after them (in case of release fences, otherwise
sequenced before them).  See 7.17.4p1-4 in C11 (N1570 is a draft) for
how that's worded there.

Also, to see this in action, you can try the example below at
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ using a browser that has
javascript enabled:

int main() {
  atomic_int x = 0;
  atomic_int unrelated = 0;
  int y = 0;
  {{{ {
     y = 1;
     // won't work; will have data races:
     // unrelated.store(23, memory_order_release);
     // does work:
     atomic_thread_fence(memory_order_release);
     x.store(1, memory_order_relaxed);
  } ||| {
     r1 = x.load(memory_order_acquire).readsvalue(1);
     r2 = y.readsvalue(1);
  } }}};

This tries to access the non-atomic initialization of y safely by using
an atomic flag x.  The readsvalue calls are assertions basically, to
constrain which executions the cppmem tool looks at.
Using the release store instead of the release fence will result in data
races.  The plot at the bottom right shows the relationships that matter
in the possible executions found by the tool.

> Is this what
> you meant with "In C11, there's a distinction between a release-MO fence
> and a mutex unlock operation (i.e., a release-MO store)"?

Yes, basically.  More specifically, the difference between the fence and
a release store in cases similar to the example above.  And, as a
result, that a mutex unlock has not the same synchronization effects as
a release fence.
  

Patch

diff --git a/NEWS b/NEWS
index ed02de0..eac100c 100644
--- a/NEWS
+++ b/NEWS
@@ -19,7 +19,7 @@  Version 2.22
   18047, 18049, 18068, 18080, 18093, 18100, 18104, 18110, 18111, 18116,
   18125, 18128, 18138, 18185, 18196, 18197, 18206, 18210, 18211, 18217,
   18220, 18221, 18234, 18244, 18247, 18287, 18319, 18333, 18346, 18397,
-  18409, 18410, 18412, 18418, 18422, 18434, 18444, 18469.
+  18409, 18410, 18412, 18418, 18422, 18434, 18444, 18457, 18469.
 
 * Cache information can be queried via sysconf() function on s390 e.g. with
   _SC_LEVEL1_ICACHE_SIZE as argument.
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 20c7e33..8fc210d 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -755,41 +755,54 @@  tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
       the_map = listp->slotinfo[idx].map;
     }
 
-  /* Make sure that, if a dlopen running in parallel forces the
-     variable into static storage, we'll wait until the address in the
-     static TLS block is set up, and use that.  If we're undecided
-     yet, make sure we make the decision holding the lock as well.  */
-  if (__glibc_unlikely (the_map->l_tls_offset
-			!= FORCED_DYNAMIC_TLS_OFFSET))
+  /* If the TLS block for the map is already assigned to dynamic, or
+     to some static TLS offset, the decision is final, and no lock is
+     required.  Now, if the decision hasn't been made, take the rtld
+     lock, so that an ongoing dlopen gets a chance to complete,
+     possibly assigning the module to static TLS and initializing the
+     corresponding TLS area for all threads, and then retest; if the
+     decision is still pending, force the module to dynamic TLS.
+
+     The risk that the thread accesses an earlier value in that memory
+     location, from before it was recycled into a link map in another
+     thread, is removed by the need for some happens before
+     relationship between the loader that set that up and the TLS
+     access that referenced the module id.  In the presence of such a
+     relationship, the value will be at least as recent as the
+     initialization, and in its absence, calling tls_get_addr with its
+     module id invokes undefined behavior.  */
+  if (__glibc_unlikely (the_map->l_tls_offset == NO_TLS_OFFSET))
     {
       __rtld_lock_lock_recursive (GL(dl_load_lock));
       if (__glibc_likely (the_map->l_tls_offset == NO_TLS_OFFSET))
-	{
-	  the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
-	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
-	}
-      else if (__glibc_likely (the_map->l_tls_offset
-			       != FORCED_DYNAMIC_TLS_OFFSET))
-	{
+	the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
+      __rtld_lock_unlock_recursive (GL(dl_load_lock));
+    }
+
+  void *p;
+
+  if (the_map->l_tls_offset != FORCED_DYNAMIC_TLS_OFFSET)
+    {
 #if TLS_TCB_AT_TP
-	  void *p = (char *) THREAD_SELF - the_map->l_tls_offset;
+      p = (char *) THREAD_SELF - the_map->l_tls_offset;
 #elif TLS_DTV_AT_TP
-	  void *p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE;
+      p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE;
 #else
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
-	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
-	  dtv[GET_ADDR_MODULE].pointer.is_static = true;
-	  dtv[GET_ADDR_MODULE].pointer.val = p;
 
-	  return (char *) p + GET_ADDR_OFFSET;
-	}
-      else
-	__rtld_lock_unlock_recursive (GL(dl_load_lock));
+      dtv[GET_ADDR_MODULE].pointer.is_static = true;
+      dtv[GET_ADDR_MODULE].pointer.val = p;
+    }
+  else
+    {
+      p = allocate_and_init (the_map);
+      assert (!dtv[GET_ADDR_MODULE].pointer.is_static);
+      /* FIXME: this is AS-Unsafe.  We'd have to atomically set val to
+	 p, if it is still unallocated, or release p and set it to
+	 val.  */
+      dtv[GET_ADDR_MODULE].pointer.val = p;
     }
-  void *p = dtv[GET_ADDR_MODULE].pointer.val = allocate_and_init (the_map);
-  assert (!dtv[GET_ADDR_MODULE].pointer.is_static);
 
   return (char *) p + GET_ADDR_OFFSET;
 }
diff --git a/nptl/Makefile b/nptl/Makefile
index 3dd2944..4ffd13c 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -242,7 +242,7 @@  tests = tst-typesizes \
 	tst-basic7 \
 	tst-kill1 tst-kill2 tst-kill3 tst-kill4 tst-kill5 tst-kill6 \
 	tst-raise1 \
-	tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 \
+	tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 tst-join7 \
 	tst-detach1 \
 	tst-eintr1 tst-eintr2 tst-eintr3 tst-eintr4 tst-eintr5 \
 	tst-tsd1 tst-tsd2 tst-tsd3 tst-tsd4 tst-tsd5 tst-tsd6 \
@@ -320,7 +320,8 @@  endif
 modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
-		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
+		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
+		tst-join7mod
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
 test-extras += $(modules-names) tst-cleanup4aux
 test-modules = $(addprefix $(objpfx),$(addsuffix .so,$(modules-names)))
@@ -525,6 +526,11 @@  $(objpfx)tst-tls6.out: tst-tls6.sh $(objpfx)tst-tls5 \
 	$(evaluate-test)
 endif
 
+$(objpfx)tst-join7: $(libdl) $(shared-thread-library)
+$(objpfx)tst-join7.out: $(objpfx)tst-join7mod.so
+$(objpfx)tst-join7mod.so: $(shared-thread-library)
+LDFLAGS-tst-join7mod.so = -Wl,-soname,tst-join7mod.so
+
 $(objpfx)tst-dlsym1: $(libdl) $(shared-thread-library)
 
 $(objpfx)tst-fini1: $(shared-thread-library) $(objpfx)tst-fini1mod.so
diff --git a/nptl/tst-join7.c b/nptl/tst-join7.c
new file mode 100644
index 0000000..bf6fc76
--- /dev/null
+++ b/nptl/tst-join7.c
@@ -0,0 +1,12 @@ 
+#include <dlfcn.h>
+
+int
+do_test (void)
+{
+  void *f = dlopen ("tst-join7mod.so", RTLD_NOW | RTLD_GLOBAL);
+  if (f) dlclose (f); else return 1;
+  return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/nptl/tst-join7mod.c b/nptl/tst-join7mod.c
new file mode 100644
index 0000000..a8c7bc0
--- /dev/null
+++ b/nptl/tst-join7mod.c
@@ -0,0 +1,29 @@ 
+#include <stdio.h>
+#include <pthread.h>
+
+static pthread_t th;
+static int running = 1;
+
+static void *
+test_run (void *p)
+{
+  while (running)
+    fprintf (stderr, "XXX test_run\n");
+  fprintf (stderr, "XXX test_run FINISHED\n");
+  return NULL;
+}
+
+static void __attribute__ ((constructor))
+do_init (void)
+{
+  pthread_create (&th, NULL, test_run, NULL);
+}
+
+static void __attribute__ ((destructor))
+do_end (void)
+{
+  running = 0;
+  fprintf (stderr, "thread_join...\n");
+  pthread_join (th, NULL);
+  fprintf (stderr, "thread_join DONE\n");
+}