[PR18457] Don't require rtld lock to compute DTV addr for static TLS

  On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> So you think a reload by the compiler would be bad.

Yeah.  It's a double-checked lock entry with sugar on top.  We only need
to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if
it is resolved, we want different behavior depending on whether it is
FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset).

> This can only be bad if there is concurrent modification, potentially

Concurrent modification is made while holding the lock.

It shouldn't happen in the same thread, at least as long as TLS is
regarded as AS-Unsafe, but other threads could concurrently attempt to
resolve a module's l_tls_offset to a static offset or forced dynamic.

> Thus, therefore we need the atomic access

I'm not convinced we do, but I don't mind, and I don't want this to be a
point of contention.

> AFAIU, you seem to speak about memory reuse unrelated to this
> specifically this particular load, right?

Yeah, some other earlier use of the same location.

> But that sounds like an issue independently of whether the specific
> load is an acquire or relaxed load.

Not really.  It is a preexisting issue, yes, but an acquire load would
make sure the (re)initialization of the memory into a link map,
performed while holding the lock (and with an atomic write, no less),
would necessarily be observed by the atomic acquire load.  A relaxed
load might still observe the un(re)initialized value.  Right?

> We established before that you want to prevent reload because there are
> concurrent stores.  Are these by other threads?

Answered above.

> If so, are there any cases of the following pattern:

Dude.  Of course not.  None of them use atomics.  So far this has only
used locks to guard changes, and double-checked locks for access.

> storing thread;
>   A;
>   atomic_store_relaxed(&l_tls_offset, ...);

> observing thread:
>   offset = atomic_load_relaxed(&l_tls_offset);
>   B(offset);

> where something in B (which uses or has a dependency on offset) relies
> on happening after A?

Let's rewrite this into something more like what we have now:

  storing thread:
     acquire lock;
     A;
     set l_tls_offset;
     release lock;

  observing thread:
     if l_tls_offset is undecided:
       acquire lock;
       if l_tls_offset is undecided:
         set l_tls_offset to forced_dynamic; // storing thread too!
       release lock;
     assert(l_tls_offset is decided);
     if l_tls_offset is forced_dynamic:
       dynamicB(l_tls_offset)
     else
       staticB(l_tls_offset)

The forced_dynamic case of B(l_tls_offset) will involve at least copying
the TLS init block, which A will have mapped in and relocated.  We don't
take the lock for that copy, so the release after A doesn't guarantee we
see the intended values.  Now, since the area is mmapped and then
relocated, it is very unlikely that any cache would have kept a copy of
the unrelocated block, let alone of any prior mmapping in that range.
So, it is very likely to work, but it's not guaranteed to work.

As for the static TLS case of B(l_tls_offset), the potential for
problems is similar, but not quite the same.  The key difference is that
the initialization of the static TLS block takes place at the storing
thread, rather than in the observing thread, and although
B(l_tls_offset) won't access the thread's static TLS block, the caller
of __tls_get_addr will.  (and so will any IE and LE TLS access)

Now, in order for any such access to take place, some relocation applied
by A must be seen by the observing thread, and if there isn't some
sequencing event that ensures the dlopen (or initial load) enclosing A
happens-before the use of the relocation, the whole thing is undefined;
otherwise, this sequencing event ought to be enough of a memory barrier
to guarantee the whole thing works.  It's just that the sequencing event
is not provided by the TLS machinery itself, but rather by the user, in
sequencing events after the dlopen, by the init code, in sequencing the
initial loading and relocation before any application code execution, or
by the thread library, sequencing any thread started by module
initializers after their relocation.

Which means to me that a relaxed load might turn out to be enough, after
all.

> I'm trying to find out what you know about the intent behind the TLS
> synchronization

FWIW, in this discussion we're touching just a tiny fraction of it, and
one that's particularly trivial compared with other bits :-(

> Does dlopen just have to decide about this value

It does tons of stuff (loading modules and dependencies, applying
relocations, running initializers), and it must have a say first.  E.g.,
if any IE relocation references a module, we must make it static TLS.
Otherwise, dlopen may leave it undecided, and then a later dlopen might
attempt to make it static TLS again (and fail if that's no longer
possible), or an intervening tls_get_addr may see it's undecided and
make the module's TLS dynamic.

> I disagree.  You added an atomic load on the consumer side (rightly
> so!), so you should not ignore the producer side either.  This is in the
> same function, and you touch most of the lines around it, and it's
> confusing if you make a partial change.

You're missing the other cases elsewhere that set this same field.

> Let me point out that we do have consensus in the project that new code
> must be free of data races.

Is a double-check lock regarded as a race?  I didn't think so.  So, I'm
proposing this patch, that reorganizes the function a bit to make this
absolutely clear, so that we can get the fix in and I can move on,
instead of extending any further the useless part of this conversation,
so that we can focus on the important stuff.

How's this?

We used to store static TLS addrs in the DTV at module load time, but
this required one thread to modify another thread's DTV.  Now that we
defer the DTV update to the first use in the thread, we should do so
without taking the rtld lock if the module is already assigned to static
TLS.  Taking the lock introduces deadlocks where there weren't any
before.

This patch fixes the deadlock caused by tls_get_addr's unnecessarily
taking the rtld lock to initialize the DTV entry for tls_dtor_list
within __call_dtors_list, which deadlocks with a dlclose of a module
whose finalizer joins with that thread.  The patch does not, however,
attempt to fix other potential sources of similar deadlocks, such as
the explicit rtld locks taken by call_dtors_list, when the dtor list
is not empty; lazy relocation of the reference to tls_dtor_list, when
TLS Descriptors are in use; when tls dtors call functions through the
PLT and lazy relocation needs to be performed, or any such called
functions interact with the dynamic loader in ways that require its
lock to be taken.

for  ChangeLog

	[PR dynamic-link/18457]
	* elf/dl-tls.c (tls_get_addr_tail): Don't take the rtld lock
	if we already have a final static TLS offset.
	* nptl/tst-join7.c, nptl/tst-join7mod.c: New, from Andreas
	Schwab's bug report.
	* nptl/Makefile (tests): Add tst-join7.
	(module-names): Add tst-join7mod.
	($(objpfx)tst-join7): New.  Add deps.
	($(objpfx)tst-join7.out): Likewise.
	($(objpfx)tst-join7mod.so): Likewise.
	(LDFLAGS-tst-join7mod.so): Set soname.
---
 NEWS                |    2 +-
 elf/dl-tls.c        |   63 +++++++++++++++++++++++++++++++--------------------
 nptl/Makefile       |   10 ++++++--
 nptl/tst-join7.c    |   12 ++++++++++
 nptl/tst-join7mod.c |   29 +++++++++++++++++++++++
 5 files changed, 88 insertions(+), 28 deletions(-)
 create mode 100644 nptl/tst-join7.c
 create mode 100644 nptl/tst-join7mod.c

[PR18457] Don't require rtld lock to compute DTV addr for static TLS

Commit Message

Comments

Patch