[BZ#17090/17620/17621] : fix DTV race, assert, and DTV_SURPLUS Static TLS limit

Message ID ormw7ol1sf.fsf@free.home
State Superseded
Headers

Commit Message

Alexandre Oliva Nov. 18, 2014, 7:57 p.m. UTC
  This patch fixes 3 bugs, 2 of which I've just filed to have all of the
issues covered in the bug database.


The original goal was to fix #17620, the problem with Static TLS
overflowing the DTV_SURPLUS range of the DTV, causing unnecessary dlopen
failures, more visible on AArch64 due to TLS Descriptor use by default.

In the process of fixing it, I noticed the race condition reported in
#17621.

The fix for the latter is to move Static TLS DTV entry initialization to
DTV updates by the thread itself, since the DTV entry is never used
except in GD/LD access models.

This enables us to get rid of the limit imposed in dlopen, that arises
precisely from the former need to ensure that all threads' DTVs are big
enough.  If we leave it for each thread to update its own DTV, the check
becomes unnecessary and it can be safely dropped.

The assert in bug 17090 was pretty obvious, and since I found it while
looking for dupes of the bugs I filed, and I was touching neighbor code,
I figured I'd just bring it into this patch.


Regression-tested on x86_64-linux-gnu.  Ok to install?


for  ChangeLog

	[BZ #17090]
	[BZ #17620]
	[BZ #17621]
	* NEWS: Update.
	* elf/dl-tls.c (_dl_update_slotinfo): Clean up outdated DTV
	entries with Static TLS too.
	(tls_get_addr_tail): Move Static TLS DTV entry set up from...
	 (_dl_allocate_tls_init): ... here (fix modid assertion), ...
	* elf/dl-reloc.c (_dl_nothread_init_static_tls): ... here...
	* nptl/allocatestack.c (init_one_static_tls): ... and here...
	* elf/dlopen.c (dl_open_worker): Drop l_tls_modid upper bound
        for Static TLS.
---
 NEWS                 |    8 ++++----
 elf/dl-open.c        |   12 +----------
 elf/dl-reloc.c       |    6 ------
 elf/dl-tls.c         |   53 ++++++++++++++++++++++++--------------------------
 nptl/allocatestack.c |    9 +++-----
 5 files changed, 33 insertions(+), 55 deletions(-)
  

Comments

Roland McGrath Nov. 18, 2014, 8:33 p.m. UTC | #1
Use __glibc_{,un}likely rather than __builtin_expect.

Can you confirm that this does not change the set of states that might be
seen by td_thr_tlsbase?


Thanks,
Roland
  
Alexandre Oliva Nov. 18, 2014, 10:08 p.m. UTC | #2
On Nov 18, 2014, Roland McGrath <roland@hack.frob.com> wrote:

> Use __glibc_{,un}likely rather than __builtin_expect.

*nod*, I've now changed the code I had reindented.

> Can you confirm that this does not change the set of states that might be
> seen by td_thr_tlsbase?

No, but I can confirm that, after this change, td_thr_tlsbase may return
as much garbage for Static TLS modules as the current code may for
dynamic TLS modules, since it doesn't check generation counts.

It would be possible to change it so that it compares the generation
count of the module and that of the DTV, so as to avoid returning
garbage; it could even compute the address for Static TLS modules, so
that we kept on returning the same pointer, regardless of what is
actually in the DTV.  Should it?
  
Roland McGrath Nov. 18, 2014, 10:40 p.m. UTC | #3
> No, but I can confirm that, after this change, td_thr_tlsbase may return
> as much garbage for Static TLS modules as the current code may for
> dynamic TLS modules, since it doesn't check generation counts.

That sounds like you're saying your change causes a regression, which
incidentally has the same failure mode as an existing bug for a
different circumstance.  The existing bug existing is reason to fix it,
and indeed the existing bug is not the fault of your change.  But that
bug existing is not an excuse to cause a regression in a related case.

> It would be possible to change it so that it compares the generation
> count of the module and that of the DTV, so as to avoid returning
> garbage; it could even compute the address for Static TLS modules, so
> that we kept on returning the same pointer, regardless of what is
> actually in the DTV.  Should it?

td_the_tlsbase should return success and yield the correct pointer in
any circumstance where the TLS block for the module and thread requested
has already been initialized.  It should fail with TD_TLSDEFER only when
the thread could not possibly have observed any values in that TLS
block.  That way, the debugger can fall back to showing initial values
from the PT_TLS segment (and refusing attempts to mutate) for the
TD_TLSDEFER case, and never fail to make the values the program will
actually see available to the user of the debugger.

I gather from what you've said that it does not already meet that
standard in all cases.  We should fix that, but do it in a way that does
not leave any intermediate state that is a regression from the status
quo ante.


Thanks,
Roland
  
Alexandre Oliva Nov. 19, 2014, 7:56 p.m. UTC | #4
On Nov 18, 2014, Roland McGrath <roland@hack.frob.com> wrote:

>> No, but I can confirm that, after this change, td_thr_tlsbase may return
>> as much garbage for Static TLS modules as the current code may for
>> dynamic TLS modules, since it doesn't check generation counts.

> That sounds like you're saying your change causes a regression, which
> incidentally has the same failure mode as an existing bug for a
> different circumstance.

The lack of docs prevents from from concluding it's a regression, rather
than a change in behavior within the intended boundaries, and the lack
of special-casing for Static TLS in the implementation doesn't enable me
to conclude it's a different circumstance, but I'd trade run-time TLS
failures, memory corruption and memory leaks for more frequent garbage
out of nptl_db any time.

Anyway, if we had to choose between the two, I'd easily go with the
patch that makes this tradeoff.  (Not that I had *meant* the patch as a
trade-off, mind you.  I didn't realize nptl_db could be affected by this
patch, so I only looked into it after you pointed it out.  So I thank
you for catching this, regardless of whether we agree on whether the
change in behavior is a regression or a legitimate change within the
(un?)defined boundaries of behavior for that function.)

Fortunately, we don't have to make that choice; we can have the fix for
the TLS bugs, and we can have reliable TLS accessors in nptl_db.  There
is a caveats, however: we need some means to find a module's DTV
generation and Static TLS offset, given a modid.  The offset can be
obtained from the link_map, whereas the generation and the link_map
pointer can be obtained from the dtv_slotinfo_list.  However, we have no
immediate means to get ot either of these data structures.

The main difficulty I'm yet to address (the rest is already done; I'll
gladly share the WIP patch if anyone's interested) is that
_dl_tls_dtv_slotinfo_list is defined under different names for static
and dynamic programs, a stand-alone symbol for the former and a
_rtld_global member in the latter, and we can't assume from say dynamic
nptl whether we're gonna be dlopened by a static libdl, or by the
dynamic ld.so, and a libpthread.a might be linked into a dynamic
executable, or even into a dlopened library.  I'm not sure which, if
any, of these cases we regard as unsupported, but I see difficulties in
handling them all.

Plus, nptl_db seems to be only equipped to look up symbols in libpthread
proper, and nothing in there links back to _rtld_global or its
corresponding standalone symbols.

I see a number of possibilities to overcome this:

- add a symbol to libpthread that, during early initialization, is set
to points to the head of the dtv slotinfo list.  I'm not sure yet how it
would be initialized though to work under all the situations above.

- add to link_maps a pointer to the corresponding slotinfo entry, so
that we can get ahold of the generation starting from the link_map.
This would enable at least td_thr_tls_get_addr to DTRT; td_thr_tlsbase,
taking only modids, would require some means to map those back to
link_maps to DTRT.

- iterate over the link_map list to locate a module with the looked-for
modid.  Horribly inefficient, but lacking other means, this could be a
viable last resort.  We'd still need gen counts added to link_maps
though.

- special-case the _dl_tls_dtv_slotinfo_list lookup so that we can find
it both as a member of _rtld_global, defined in ld.so, and as a
stand-alone symbol defined in the main static executable.  (Can
ps_pglobal_lookup be generally used to look up a symbol globally, if
given a NULL object_name or somesuch?  I can't find documentation for
this interface, and we only use it to look up symbols in libpthread_so.
Plus, what if we can't get to the slotinfo_list, say, because the static
executable is stripped and it hasn't exported this dynamic symbol we
need?)

- other possibilities I haven't considered?

Any recommendations?

> td_the_tlsbase should return success and yield the correct pointer in
> any circumstance where the TLS block for the module and thread requested
> has already been initialized.  It should fail with TD_TLSDEFER only when
> the thread could not possibly have observed any values in that TLS
> block.  That way, the debugger can fall back to showing initial values
> from the PT_TLS segment (and refusing attempts to mutate) for the
> TD_TLSDEFER case, and never fail to make the values the program will
> actually see available to the user of the debugger.

This sounds like a very reasonable goal.  Is this documented anywhere?
If not, may I paste your paragraph as a comment next to the function
definition?  Or to its declaration?

Thanks,
  
Roland McGrath Nov. 20, 2014, 2:17 a.m. UTC | #5
> - add a symbol to libpthread that, during early initialization, is set
> to points to the head of the dtv slotinfo list.  I'm not sure yet how it
> would be initialized though to work under all the situations above.

This would not be too bad.  But it's of course better to have zero runtime
overhead if it's feasible, which it seems like it is.

> - special-case the _dl_tls_dtv_slotinfo_list lookup so that we can find
> it both as a member of _rtld_global, defined in ld.so, and as a
> stand-alone symbol defined in the main static executable.  

I wouldn't really call this a special case.  It's easy enough to extend the
DB_SYMBOL macro, db-symbols.h, and td_symbol_list.c to distinguish the
object name for different symbols.  Then _rtld_global is a symbol like any
other (but for LD_SO instead of LIBPTHREAD_SO).  The td_thr_tlsbase code
can just look for _rtld_global first and if it doesn't find that, look for
_dl_tls_dtv_slotinfo_list instead.

This seems like the right thing to do.

> (Can
> ps_pglobal_lookup be generally used to look up a symbol globally, if
> given a NULL object_name or somesuch?  I can't find documentation for
> this interface, and we only use it to look up symbols in libpthread_so.

It's possible there's some old Solaris documentation to be found for the
details of the proc_service.h interface, since they invented it.  But all
that really matters is how GDB has used it for glibc systems, which might
not actually adhere to any to what any such Solaris documentation
prescribes.  

In actual fact, GDB ignores the argument entirely and just does a vanilla
symbol lookup like you'd get if you said "p &symbol".  That's why it works
at all now for statically-linked libpthread, where the main executable is
where the symbols are actually found.  That said, since the interface has
the parameter, I'd still be in favor of passing in the correct object name
just pro forma (since it's really not hard to implement).

> Plus, what if we can't get to the slotinfo_list, say, because the static
> executable is stripped and it hasn't exported this dynamic symbol we
> need?)

Then td_ta_new will have failed to find the nptl_version symbol and no use
of libthread_db will ever succeed, so you won't be there to wonder.  (It
needs some local symbols like this one, so a stripped executable with
libpthread statically linked never works, even if it is a dynamic
executable that exports every global in libpthread.a.)

> > td_the_tlsbase should return success and yield the correct pointer in
> > any circumstance where the TLS block for the module and thread requested
> > has already been initialized.  It should fail with TD_TLSDEFER only when
> > the thread could not possibly have observed any values in that TLS
> > block.  That way, the debugger can fall back to showing initial values
> > from the PT_TLS segment (and refusing attempts to mutate) for the
> > TD_TLSDEFER case, and never fail to make the values the program will
> > actually see available to the user of the debugger.
> 
> This sounds like a very reasonable goal.  Is this documented anywhere?
> If not, may I paste your paragraph as a comment next to the function
> definition?  Or to its declaration?

The Solaris documentation specifies td_thr_tlsbase quite simply,
"... returns the base address of a threads local storage block..."
The quality of implementation target that I stated just seems obvious.
Use my text however you like.


Thanks,
Roland
  
Pedro Alves Nov. 20, 2014, 10:28 a.m. UTC | #6
On 11/20/2014 02:17 AM, Roland McGrath wrote:
> In actual fact, GDB ignores the argument entirely and just does a vanilla
> symbol lookup like you'd get if you said "p &symbol".  That's why it works
> at all now for statically-linked libpthread, where the main executable is
> where the symbols are actually found.  

Indeed.  I once tried to change GDB to actually use the passed in argument,
but found that broke the statically-linked libpthread case.

> That said, since the interface has
> the parameter, I'd still be in favor of passing in the correct object name
> just pro forma (since it's really not hard to implement).

Note that's really not sufficient for getting the statically-linked case
fully right.  Some of the libpthread symbols thread_db wants to looks up,
although static, have names that aren't in the implementation namespace.  As
such, if the statically-linked program happens to have other symbols with the
same name, GDB might well find and return those to nptl_db instead.

For example, it should be perfectly legitimate for a program to define a
symbol called "stack_used", but that ends up confusing nptl_db/gdb,
because that's one of the static libpthread symbols nptl_db looks up:

 /* List of the stacks in use.  */
 static LIST_HEAD (stack_used);

And that's exactly what happened in:

 [Bug 9635 - Cannot find new threads: generic error]
 https://sourceware.org/bugzilla/show_bug.cgi?id=9635

Actually looks like stack_used may be the only one with the problem:

 $ grep -rn DB_GET_SYMBOL -rn
 nptl_db/td_ta_thr_iter.c:157:  err = DB_GET_SYMBOL (list, ta, __stack_user);
 nptl_db/td_ta_thr_iter.c:164:    err = DB_GET_SYMBOL (list, ta, stack_used);
 nptl_db/td_ta_tsd_iter.c:53:  err = DB_GET_SYMBOL (addr, ta, __pthread_keys);
 nptl_db/td_ta_event_addr.c:40:      err = DB_GET_SYMBOL (taddr, ta, __nptl_create_event);
 nptl_db/td_ta_event_addr.c:44:      err = DB_GET_SYMBOL (taddr, ta, __nptl_death_event);
 nptl_db/td_thr_validate.c:65:  err = DB_GET_SYMBOL (list, th->th_ta_p, __stack_user);
 nptl_db/td_thr_validate.c:73:      err = DB_GET_SYMBOL (list, th->th_ta_p, stack_used);
 nptl_db/td_ta_map_lwp2thr.c:190:  td_err_e err = DB_GET_SYMBOL (list, ta, __stack_user);
 nptl_db/td_ta_set_event.c:41:  err = DB_GET_SYMBOL (eventmask, ta, __nptl_threads_events);
 nptl_db/thread_dbP.h:150:#define DB_GET_SYMBOL(var, ta, name)                                         \
 nptl_db/td_thr_event_getmsg.c:71:  err = DB_GET_SYMBOL (prevp, th->th_ta_p, __nptl_last_event);
 nptl_db/td_ta_clear_event.c:41:  err = DB_GET_SYMBOL (eventmask, ta, __nptl_threads_events);

Though renaming __stack_user to __nptl_stack_user too might be good, in case
other parts of the implementation (e.g., libgcc) might end up defining their
own unrelated static symbols with that name too.

Thanks,
Pedro Alves
  
Pedro Alves Nov. 20, 2014, 11:48 a.m. UTC | #7
On 11/20/2014 10:28 AM, Pedro Alves wrote:
> On 11/20/2014 02:17 AM, Roland McGrath wrote:
>> In actual fact, GDB ignores the argument entirely and just does a vanilla
>> symbol lookup like you'd get if you said "p &symbol".  That's why it works
>> at all now for statically-linked libpthread, where the main executable is
>> where the symbols are actually found.  
> 
> Indeed.  I once tried to change GDB to actually use the passed in argument,
> but found that broke the statically-linked libpthread case.
> 
>> That said, since the interface has
>> the parameter, I'd still be in favor of passing in the correct object name
>> just pro forma (since it's really not hard to implement).
> 
> Note that's really not sufficient for getting the statically-linked case
> fully right.  Some of the libpthread symbols thread_db wants to looks up,
> although static, have names that aren't in the implementation namespace.  As
> such, if the statically-linked program happens to have other symbols with the
> same name, GDB might well find and return those to nptl_db instead.

FYI, filed PR17629 for this now.

https://sourceware.org/bugzilla/show_bug.cgi?id=17629

Thanks,
Pedro Alves

> 
> For example, it should be perfectly legitimate for a program to define a
> symbol called "stack_used", but that ends up confusing nptl_db/gdb,
> because that's one of the static libpthread symbols nptl_db looks up:
> 
>  /* List of the stacks in use.  */
>  static LIST_HEAD (stack_used);
> 
> And that's exactly what happened in:
> 
>  [Bug 9635 - Cannot find new threads: generic error]
>  https://sourceware.org/bugzilla/show_bug.cgi?id=9635
> 
> Actually looks like stack_used may be the only one with the problem:
> 
>  $ grep -rn DB_GET_SYMBOL -rn
>  nptl_db/td_ta_thr_iter.c:157:  err = DB_GET_SYMBOL (list, ta, __stack_user);
>  nptl_db/td_ta_thr_iter.c:164:    err = DB_GET_SYMBOL (list, ta, stack_used);
>  nptl_db/td_ta_tsd_iter.c:53:  err = DB_GET_SYMBOL (addr, ta, __pthread_keys);
>  nptl_db/td_ta_event_addr.c:40:      err = DB_GET_SYMBOL (taddr, ta, __nptl_create_event);
>  nptl_db/td_ta_event_addr.c:44:      err = DB_GET_SYMBOL (taddr, ta, __nptl_death_event);
>  nptl_db/td_thr_validate.c:65:  err = DB_GET_SYMBOL (list, th->th_ta_p, __stack_user);
>  nptl_db/td_thr_validate.c:73:      err = DB_GET_SYMBOL (list, th->th_ta_p, stack_used);
>  nptl_db/td_ta_map_lwp2thr.c:190:  td_err_e err = DB_GET_SYMBOL (list, ta, __stack_user);
>  nptl_db/td_ta_set_event.c:41:  err = DB_GET_SYMBOL (eventmask, ta, __nptl_threads_events);
>  nptl_db/thread_dbP.h:150:#define DB_GET_SYMBOL(var, ta, name)                                         \
>  nptl_db/td_thr_event_getmsg.c:71:  err = DB_GET_SYMBOL (prevp, th->th_ta_p, __nptl_last_event);
>  nptl_db/td_ta_clear_event.c:41:  err = DB_GET_SYMBOL (eventmask, ta, __nptl_threads_events);
> 
> Though renaming __stack_user to __nptl_stack_user too might be good, in case
> other parts of the implementation (e.g., libgcc) might end up defining their
> own unrelated static symbols with that name too.
> 
> Thanks,
> Pedro Alves
>
  

Patch

diff --git a/NEWS b/NEWS
index b152488..4f208a2 100644
--- a/NEWS
+++ b/NEWS
@@ -9,10 +9,10 @@  Version 2.21
 
 * The following bugs are resolved with this release:
 
-  6652, 12926, 14132, 14138, 14171, 15215, 15884, 17266, 17344, 17363,
-  17370, 17371, 17411, 17460, 17475, 17485, 17501, 17506, 17508, 17522,
-  17555, 17570, 17571, 17572, 17573, 17574, 17582, 17583, 17584, 17585,
-  17589, 17594, 17616.
+  6652, 12926, 14132, 14138, 14171, 15215, 15884, 17090, 17266, 17344,
+  17363, 17370, 17371, 17411, 17460, 17475, 17485, 17501, 17506, 17508,
+  17522, 17555, 17570, 17571, 17572, 17573, 17574, 17582, 17583, 17584,
+  17585, 17589, 17594, 17616, 17620, 17621.
 
 * The minimum GCC version that can be used to build this version of the GNU
   C Library is GCC 4.6.  Older GCC versions, and non-GNU compilers, can
diff --git a/elf/dl-open.c b/elf/dl-open.c
index 7cc4cc1..9d6006b 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -531,17 +531,7 @@  TLS generation counter wrapped!  Please report this."));
 	  && imap->l_tls_blocksize > 0)
 	{
 	  /* For static TLS we have to allocate the memory here and
-	     now.  This includes allocating memory in the DTV.  But we
-	     cannot change any DTV other than our own.  So, if we
-	     cannot guarantee that there is room in the DTV we don't
-	     even try it and fail the load.
-
-	     XXX We could track the minimum DTV slots allocated in
-	     all threads.  */
-	  if (! RTLD_SINGLE_THREAD_P && imap->l_tls_modid > DTV_SURPLUS)
-	    _dl_signal_error (0, "dlopen", NULL, N_("\
-cannot load any more object with static TLS"));
-
+	     now, but we can delay updating the DTV.  */
 	  imap->l_need_tls_init = 0;
 #ifdef SHARED
 	  /* Update the slot information data for at least the
diff --git a/elf/dl-reloc.c b/elf/dl-reloc.c
index 97a7119..1d66f79 100644
--- a/elf/dl-reloc.c
+++ b/elf/dl-reloc.c
@@ -136,12 +136,6 @@  _dl_nothread_init_static_tls (struct link_map *map)
 # error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 #endif
 
-  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
-  dtv_t *dtv = THREAD_DTV ();
-  assert (map->l_tls_modid <= dtv[-1].counter);
-  dtv[map->l_tls_modid].pointer.val = dest;
-  dtv[map->l_tls_modid].pointer.is_static = true;
-
   /* Initialize the memory.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 5204fda..2408384 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -437,17 +437,14 @@  _dl_allocate_tls_init (void *result)
 	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
 	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
 
+	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
+	  dtv[map->l_tls_modid].pointer.is_static = false;
+
 	  if (map->l_tls_offset == NO_TLS_OFFSET
 	      || map->l_tls_offset == FORCED_DYNAMIC_TLS_OFFSET)
-	    {
-	      /* For dynamically loaded modules we simply store
-		 the value indicating deferred allocation.  */
-	      dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
-	      dtv[map->l_tls_modid].pointer.is_static = false;
-	      continue;
-	    }
+	    continue;
 
-	  assert (map->l_tls_modid == cnt);
+	  assert (map->l_tls_modid == total + cnt);
 	  assert (map->l_tls_blocksize >= map->l_tls_initimage_size);
 #if TLS_TCB_AT_TP
 	  assert ((size_t) map->l_tls_offset >= map->l_tls_blocksize);
@@ -459,8 +456,6 @@  _dl_allocate_tls_init (void *result)
 #endif
 
 	  /* Copy the initialization image and clear the BSS part.  */
-	  dtv[map->l_tls_modid].pointer.val = dest;
-	  dtv[map->l_tls_modid].pointer.is_static = true;
 	  memset (__mempcpy (dest, map->l_tls_initimage,
 			     map->l_tls_initimage_size), '\0',
 		  map->l_tls_blocksize - map->l_tls_initimage_size);
@@ -632,10 +627,9 @@  _dl_update_slotinfo (unsigned long int req_modid)
 		     might still be allocated.  */
 		  if (! dtv[total + cnt].pointer.is_static
 		      && dtv[total + cnt].pointer.val != TLS_DTV_UNALLOCATED)
-		    {
-		      free (dtv[total + cnt].pointer.val);
-		      dtv[total + cnt].pointer.val = TLS_DTV_UNALLOCATED;
-		    }
+		    free (dtv[total + cnt].pointer.val);
+		  dtv[total + cnt].pointer.val = TLS_DTV_UNALLOCATED;
+		  dtv[total + cnt].pointer.is_static = false;
 
 		  continue;
 		}
@@ -698,10 +692,8 @@  _dl_update_slotinfo (unsigned long int req_modid)
 		   memalign and not malloc.  */
 		free (dtv[modid].pointer.val);
 
-	      /* This module is loaded dynamically- We defer memory
-		 allocation.  */
-	      dtv[modid].pointer.is_static = false;
 	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
+	      dtv[modid].pointer.is_static = false;
 
 	      if (modid == req_modid)
 		the_map = map;
@@ -739,7 +731,6 @@  tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
       the_map = listp->slotinfo[idx].map;
     }
 
- again:
   /* Make sure that, if a dlopen running in parallel forces the
      variable into static storage, we'll wait until the address in the
      static TLS block is set up, and use that.  If we're undecided
@@ -753,22 +744,28 @@  tls_get_addr_tail (GET_ADDR_ARGS, dtv_t *dtv, struct link_map *the_map)
 	  the_map->l_tls_offset = FORCED_DYNAMIC_TLS_OFFSET;
 	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
 	}
-      else
+      else if (__builtin_expect (the_map->l_tls_offset
+				 != FORCED_DYNAMIC_TLS_OFFSET, 1))
 	{
+#if TLS_TCB_AT_TP
+	  void *p = (char *) THREAD_SELF - the_map->l_tls_offset;
+#elif TLS_DTV_AT_TP
+	  void *p = (char *) THREAD_SELF + the_map->l_tls_offset + TLS_PRE_TCB_SIZE;
+#else
+# error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
+#endif
 	  __rtld_lock_unlock_recursive (GL(dl_load_lock));
-	  if (__builtin_expect (the_map->l_tls_offset
-				!= FORCED_DYNAMIC_TLS_OFFSET, 1))
-	    {
-	      void *p = dtv[GET_ADDR_MODULE].pointer.val;
-	      if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED))
-		goto again;
 
-	      return (char *) p + GET_ADDR_OFFSET;
-	    }
+	  dtv[GET_ADDR_MODULE].pointer.is_static = true;
+	  dtv[GET_ADDR_MODULE].pointer.val = p;
+
+	  return (char *) p + GET_ADDR_OFFSET;
 	}
+      else
+	__rtld_lock_unlock_recursive (GL(dl_load_lock));
     }
   void *p = dtv[GET_ADDR_MODULE].pointer.val = allocate_and_init (the_map);
-  dtv[GET_ADDR_MODULE].pointer.is_static = false;
+  assert (!dtv[GET_ADDR_MODULE].pointer.is_static);
 
   return (char *) p + GET_ADDR_OFFSET;
 }
diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c
index 8cf0274..8b053c1 100644
--- a/nptl/allocatestack.c
+++ b/nptl/allocatestack.c
@@ -1190,7 +1190,6 @@  __nptl_setxid (struct xid_command *cmdp)
 static inline void __attribute__((always_inline))
 init_one_static_tls (struct pthread *curp, struct link_map *map)
 {
-  dtv_t *dtv = GET_DTV (TLS_TPADJ (curp));
 # if TLS_TCB_AT_TP
   void *dest = (char *) curp - map->l_tls_offset;
 # elif TLS_DTV_AT_TP
@@ -1199,11 +1198,9 @@  init_one_static_tls (struct pthread *curp, struct link_map *map)
 #  error "Either TLS_TCB_AT_TP or TLS_DTV_AT_TP must be defined"
 # endif
 
-  /* Fill in the DTV slot so that a later LD/GD access will find it.  */
-  dtv[map->l_tls_modid].pointer.val = dest;
-  dtv[map->l_tls_modid].pointer.is_static = true;
-
-  /* Initialize the memory.  */
+  /* We cannot delay the initialization of the Static TLS area, since
+     it can be accessed with LE or IE, but since the DTV is only used
+     by GD and LD, we can delay its update to avoid a race.  */
   memset (__mempcpy (dest, map->l_tls_initimage, map->l_tls_initimage_size),
 	  '\0', map->l_tls_blocksize - map->l_tls_initimage_size);
 }