From patchwork Wed Jun  3 20:24:10 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexandre Oliva <aoliva@redhat.com>
X-Patchwork-Id: 7035
Received: (qmail 93819 invoked by alias); 3 Jun 2015 20:24:45 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 93806 invoked by uid 89); 3 Jun 2015 20:24:44 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL, BAYES_50,
	KAM_LAZY_DOMAIN_SECURITY, SPF_HELO_PASS,
	T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: mx1.redhat.com
From: Alexandre Oliva <aoliva@redhat.com>
To: Torvald Riegel <triegel@redhat.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>, libc-alpha@sourceware.org
Subject: Re: [PR18457] Don't require rtld lock to compute DTV addr for
	static TLS
References: <orvbf5ffyt.fsf@livre.home>
	<1433326788.21461.81.camel@triegel.csb> <ora8whexn9.fsf@livre.home>
	<1433344426.21461.202.camel@triegel.csb>
Date: Wed, 03 Jun 2015 17:24:10 -0300
In-Reply-To: <1433344426.21461.202.camel@triegel.csb> (Torvald Riegel's
	message of "Wed, 03 Jun 2015 17:13:46 +0200")
Message-ID: <orwpzkee1h.fsf@livre.home>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
MIME-Version: 1.0

On Jun  3, 2015, Torvald Riegel <triegel@redhat.com> wrote:

> So you think a reload by the compiler would be bad.

Yeah.  It's a double-checked lock entry with sugar on top.  We only need
to take the lock if l_tls_offset is unresolved (NO_TLS_OFFSET), but if
it is resolved, we want different behavior depending on whether it is
FORCED_DYNAMIC_TLS_OFFSET, or something else (static TLS offset).

> This can only be bad if there is concurrent modification, potentially

Concurrent modification is made while holding the lock.

It shouldn't happen in the same thread, at least as long as TLS is
regarded as AS-Unsafe, but other threads could concurrently attempt to
resolve a module's l_tls_offset to a static offset or forced dynamic.

> Thus, therefore we need the atomic access

I'm not convinced we do, but I don't mind, and I don't want this to be a
point of contention.

> AFAIU, you seem to speak about memory reuse unrelated to this
> specifically this particular load, right?

Yeah, some other earlier use of the same location.

> But that sounds like an issue independently of whether the specific
> load is an acquire or relaxed load.

Not really.  It is a preexisting issue, yes, but an acquire load would
make sure the (re)initialization of the memory into a link map,
performed while holding the lock (and with an atomic write, no less),
would necessarily be observed by the atomic acquire load.  A relaxed
load might still observe the un(re)initialized value.  Right?

> We established before that you want to prevent reload because there are
> concurrent stores.  Are these by other threads?

Answered above.

> If so, are there any cases of the following pattern:

Dude.  Of course not.  None of them use atomics.  So far this has only
used locks to guard changes, and double-checked locks for access.

> storing thread;
>   A;
>   atomic_store_relaxed(&l_tls_offset, ...);

> observing thread:
>   offset = atomic_load_relaxed(&l_tls_offset);
>   B(offset);

> where something in B (which uses or has a dependency on offset) relies
> on happening after A?

Let's rewrite this into something more like what we have now:

  storing thread:
     acquire lock;
     A;
     set l_tls_offset;
     release lock;

  observing thread:
     if l_tls_offset is undecided:
       acquire lock;
       if l_tls_offset is undecided:
         set l_tls_offset to forced_dynamic; // storing thread too!
       release lock;
     assert(l_tls_offset is decided);
     if l_tls_offset is forced_dynamic:
       dynamicB(l_tls_offset)
     else
       staticB(l_tls_offset)

The forced_dynamic case of B(l_tls_offset) will involve at least copying
the TLS init block, which A will have mapped in and relocated.  We don't
take the lock for that copy, so the release after A doesn't guarantee we
see the intended values.  Now, since the area is mmapped and then
relocated, it is very unlikely that any cache would have kept a copy of
the unrelocated block, let alone of any prior mmapping in that range.
So, it is very likely to work, but it's not guaranteed to work.

As for the static TLS case of B(l_tls_offset), the potential for
problems is similar, but not quite the same.  The key difference is that
the initialization of the static TLS block takes place at the storing
thread, rather than in the observing thread, and although
B(l_tls_offset) won't access the thread's static TLS block, the caller
of __tls_get_addr will.  (and so will any IE and LE TLS access)

Now, in order for any such access to take place, some relocation applied
by A must be seen by the observing thread, and if there isn't some
sequencing event that ensures the dlopen (or initial load) enclosing A
happens-before the use of the relocation, the whole thing is undefined;
otherwise, this sequencing event ought to be enough of a memory barrier
to guarantee the whole thing works.  It's just that the sequencing event
is not provided by the TLS machinery itself, but rather by the user, in
sequencing events after the dlopen, by the init code, in sequencing the
initial loading and relocation before any application code execution, or
by the thread library, sequencing any thread started by module
initializers after their relocation.

Which means to me that a relaxed load might turn out to be enough, after
all.

> I'm trying to find out what you know about the intent behind the TLS
> synchronization

FWIW, in this discussion we're touching just a tiny fraction of it, and
one that's particularly trivial compared with other bits :-(

> Does dlopen just have to decide about this value

It does tons of stuff (loading modules and dependencies, applying
relocations, running initializers), and it must have a say first.  E.g.,
if any IE relocation references a module, we must make it static TLS.
Otherwise, dlopen may leave it undecided, and then a later dlopen might
attempt to make it static TLS again (and fail if that's no longer
possible), or an intervening tls_get_addr may see it's undecided and
make the module's TLS dynamic.

> I disagree.  You added an atomic load on the consumer side (rightly
> so!), so you should not ignore the producer side either.  This is in the
> same function, and you touch most of the lines around it, and it's
> confusing if you make a partial change.

You're missing the other cases elsewhere that set this same field.

> Let me point out that we do have consensus in the project that new code
> must be free of data races.

Is a double-check lock regarded as a race?  I didn't think so.  So, I'm
proposing this patch, that reorganizes the function a bit to make this
absolutely clear, so that we can get the fix in and I can move on,
instead of extending any further the useless part of this conversation,
so that we can focus on the important stuff.

How's this?


We used to store static TLS addrs in the DTV at module load time, but
this required one thread to modify another thread's DTV.  Now that we
defer the DTV update to the first use in the thread, we should do so
without taking the rtld lock if the module is already assigned to static
TLS.  Taking the lock introduces deadlocks where there weren't any
before.

This patch fixes the deadlock caused by tls_get_addr's unnecessarily
taking the rtld lock to initialize the DTV entry for tls_dtor_list
within __call_dtors_list, which deadlocks with a dlclose of a module
whose finalizer joins with that thread.  The patch does not, however,
attempt to fix other potential sources of similar deadlocks, such as
the explicit rtld locks taken by call_dtors_list, when the dtor list
is not empty; lazy relocation of the reference to tls_dtor_list, when
TLS Descriptors are in use; when tls dtors call functions through the
PLT and lazy relocation needs to be performed, or any such called
functions interact with the dynamic loader in ways that require its
lock to be taken.

for  ChangeLog

	[PR dynamic-link/18457]
	* elf/dl-tls.c (tls_get_addr_tail): Don't take the rtld lock
	if we already have a final static TLS offset.
	* nptl/tst-join7.c, nptl/tst-join7mod.c: New, from Andreas
	Schwab's bug report.
	* nptl/Makefile (tests): Add tst-join7.
	(module-names): Add tst-join7mod.
	($(objpfx)tst-join7): New.  Add deps.
	($(objpfx)tst-join7.out): Likewise.
	($(objpfx)tst-join7mod.so): Likewise.
	(LDFLAGS-tst-join7mod.so): Set soname.
---
 NEWS                |    2 +-
 elf/dl-tls.c        |   63 +++++++++++++++++++++++++++++++--------------------
 nptl/Makefile       |   10 ++++++--
 nptl/tst-join7.c    |   12 ++++++++++
 nptl/tst-join7mod.c |   29 +++++++++++++++++++++++
 5 files changed, 88 insertions(+), 28 deletions(-)
 create mode 100644 nptl/tst-join7.c
 create mode 100644 nptl/tst-join7mod.c

diff --git a/NEWS b/NEWS
index ed02de0..eac100c 100644
--- a/NEWS
+++ b/NEWS
@@ -19,7 +19,7 @@ Version 2.22
   18047, 18049, 18068, 18080, 18093, 18100, 18104, 18110, 18111, 18116,
   18125, 18128, 18138, 18185, 18196, 18197, 18206, 18210, 18211, 18217,
   18220, 18221, 18234, 18244, 18247, 18287, 18319, 18333, 18346, 18397,
-  18409, 18410, 18412, 18418, 18422, 18434, 18444, 18469.
+  18409, 18410, 18412, 18418, 18422, 18434, 18444, 18457, 18469.
 
 * Cache information can be queried via sysconf() function on s390 e.g. with
   _SC_LEVEL1_ICACHE_SIZE as argument.
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 20c7e33..8fc210d 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -755,41 +755,54 @@ tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
       the_map = listp->slotinfo[idx].map;
     }
 
-  /* Make sure that, if a dlopen running in parallel forces the
-     variable into static storage, we'll wait until the address in the
-     static TLS block is set up, and use that.  If we're undecided
-     yet, make sure we make the decision holding the lock as well.  */
-  if (__glibc_unlikely (the_map->l_tls_offset
-			!= FORCED_DYNAMIC_TLS_OFFSET))
+  /* If the TLS block for the map is already assigned to dynamic, or
+     to some static TLS offset, the decision is final, and no lock is
+     required.  Now, if the decision hasn't been made, take the rtld
+     lock, so that an ongoing dlopen gets a chance to complete,
+     possibly assigning the module to static TLS and initializing the
+     corresponding TLS area for all threads, and then retest; if the
+     decision is still pending, force the module to dynamic TLS.
+
+     The risk that the thread accesses an earlier value in that memory
+     location, from before it was recycled into a link map in another
+     thread, is removed by the need for some happens before
+     relationship between the loader that set that up and the TLS
+     access that referenced the module id.  In the presence of such a
+     relationship, the value will be at least as recent as the
+     initialization, and in its absence, calling tls_get_addr with its
+     module id invokes undefined behavior.  */
+  if (__glibc_unlikely (the_map->l_tls_offset == NO_TLS_OFFSET))
     {
       __rtld_lock_lock_recursive (GL(dl_load_lock));
       if (__glibc_likely (the_map->l_tls_offset == NO_TLS_OFFSET))
-	{
-	  the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
-	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
-	}
-      else if (__glibc_likely (the_map->l_tls_offset
-			       != FORCED_DYNAMIC_TLS_OFFSET))
-	{
+	the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
+      __rtld_lock_unlock_recursive (GL(dl_load_lock));
+    }
+
+  void *p;
+
+  if (the_map->l_tls_offset != FORCED_DYNAMIC_TLS_OFFSET)
+    {
 #if TLS_TCB_AT_TP
-	  void *p = (char *) THREAD_SELF - the_map->l_tls_offset;
+      p = (char *) THREAD_SELF - the_map->l_tls_offset;
 #elif TLS_DTV_AT_TP
-	  void *p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE;
+      p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE;
 #else
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
-	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
-
-	  dtv[GET_ADDR_MODULE].pointer.is_static = true;
-	  dtv[GET_ADDR_MODULE].pointer.val = p;
 
-	  return (char *) p + GET_ADDR_OFFSET;
-	}
-      else
-	__rtld_lock_unlock_recursive (GL(dl_load_lock));
+      dtv[GET_ADDR_MODULE].pointer.is_static = true;
+      dtv[GET_ADDR_MODULE].pointer.val = p;
+    }
+  else
+    {
+      p = allocate_and_init (the_map);
+      assert (!dtv[GET_ADDR_MODULE].pointer.is_static);
+      /* FIXME: this is AS-Unsafe.  We'd have to atomically set val to
+	 p, if it is still unallocated, or release p and set it to
+	 val.  */
+      dtv[GET_ADDR_MODULE].pointer.val = p;
     }
-  void *p = dtv[GET_ADDR_MODULE].pointer.val = allocate_and_init (the_map);
-  assert (!dtv[GET_ADDR_MODULE].pointer.is_static);
 
   return (char *) p + GET_ADDR_OFFSET;
 }
diff --git a/nptl/Makefile b/nptl/Makefile
index 3dd2944..4ffd13c 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -242,7 +242,7 @@ tests = tst-typesizes \
 	tst-basic7 \
 	tst-kill1 tst-kill2 tst-kill3 tst-kill4 tst-kill5 tst-kill6 \
 	tst-raise1 \
-	tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 \
+	tst-join1 tst-join2 tst-join3 tst-join4 tst-join5 tst-join6 tst-join7 \
 	tst-detach1 \
 	tst-eintr1 tst-eintr2 tst-eintr3 tst-eintr4 tst-eintr5 \
 	tst-tsd1 tst-tsd2 tst-tsd3 tst-tsd4 tst-tsd5 tst-tsd6 \
@@ -320,7 +320,8 @@ endif
 modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
-		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
+		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
+		tst-join7mod
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
 test-extras += $(modules-names) tst-cleanup4aux
 test-modules = $(addprefix $(objpfx),$(addsuffix .so,$(modules-names)))
@@ -525,6 +526,11 @@ $(objpfx)tst-tls6.out: tst-tls6.sh $(objpfx)tst-tls5 \
 	$(evaluate-test)
 endif
 
+$(objpfx)tst-join7: $(libdl) $(shared-thread-library)
+$(objpfx)tst-join7.out: $(objpfx)tst-join7mod.so
+$(objpfx)tst-join7mod.so: $(shared-thread-library)
+LDFLAGS-tst-join7mod.so = -Wl,-soname,tst-join7mod.so
+
 $(objpfx)tst-dlsym1: $(libdl) $(shared-thread-library)
 
 $(objpfx)tst-fini1: $(shared-thread-library) $(objpfx)tst-fini1mod.so
diff --git a/nptl/tst-join7.c b/nptl/tst-join7.c
new file mode 100644
index 0000000..bf6fc76
--- /dev/null
+++ b/nptl/tst-join7.c
@@ -0,0 +1,12 @@
+#include <dlfcn.h>
+
+int
+do_test (void)
+{
+  void *f = dlopen ("tst-join7mod.so", RTLD_NOW | RTLD_GLOBAL);
+  if (f) dlclose (f); else return 1;
+  return 0;
+}
+
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
diff --git a/nptl/tst-join7mod.c b/nptl/tst-join7mod.c
new file mode 100644
index 0000000..a8c7bc0
--- /dev/null
+++ b/nptl/tst-join7mod.c
@@ -0,0 +1,29 @@
+#include <stdio.h>
+#include <pthread.h>
+
+static pthread_t th;
+static int running = 1;
+
+static void *
+test_run (void *p)
+{
+  while (running)
+    fprintf (stderr, "XXX test_run\n");
+  fprintf (stderr, "XXX test_run FINISHED\n");
+  return NULL;
+}
+
+static void __attribute__ ((constructor))
+do_init (void)
+{
+  pthread_create (&th, NULL, test_run, NULL);
+}
+
+static void __attribute__ ((destructor))
+do_end (void)
+{
+  running = 0;
+  fprintf (stderr, "thread_join...\n");
+  pthread_join (th, NULL);
+  fprintf (stderr, "thread_join DONE\n");
+}