[BZ,#19329] Fix race between tls allocation at thread creation and dlopen

Message ID 5693D908.8090104@arm.com
State New, archived
Headers

Commit Message

Szabolcs Nagy Jan. 11, 2016, 4:32 p.m. UTC
  Changes in patch v2:

* The listp->next race is fixed.

* GL(dl_tls_generation) is loaded before GL(dl_tls_max_dtv_idx).
   (matters if several dlopen happens during _dl_allocate_tls_init
   between those two loads.)

* more detailed analysis below.

I can trigger the bug on x86_64 by loading a library with >50 deps
with tls and concurrently creating >1000 threads, this was not turned
into a glibc test case, but attached to the bug report.
(On aarch64 the failure happens to be easier to trigger.)


The following failures can be triggered:

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= 
_rtld_local._dl_tls_generation' failed!

and

Inconsistency detected by ld.so: dl-tls.c: 525: _dl_allocate_tls_init: Assertion `listp != NULL' failed!

dlopen modifies tls related dynamic linker data structures while holding
the GL(dl_load_lock) if the loaded library has tls.

Meanwhile at thread creation the same globals are accessed when the dtv
and tls is set up for the new thread in _dl_allocate_tls_init without
holding any locks.

At least the following global objects may have conflicting access:

   GL(dl_tls_max_dtv_idx)
   GL(dl_tls_generation)
   listp->slotinfo[i].map
   listp->slotinfo[i].gen
   listp->next

where listp points to a node in GL(dl_tls_dtv_slotinfo_list).

The race window for the first assert failure above is short compared to
dlopen and thread creation, so it rarely happens, but the probability
can be increased if the loaded library has a lot of dependencies with
tls, and if the deps don't fit into the same listp->slotinfo array then
the second assert failure starts to happen.

The current sequence of events is:

dlopen:

   The modids of the loaded library and its dependencies are set one by
   one to ++GL(dl_tls_max_dtv_idx) (assuming they have tls).

   Then these modules are added to GL(dl_tls_dtv_slotinfo_list) one by
   one with the same generation number: GL(dl_tls_generation)+1.

   Then the GL(dl_tls_generation) is increased.

pthread_create:

   The dtv for the thread is allocated assuming GL(dl_tls_max_dtv_idx) is
   the max modid that will be added to the dtv here.

   The GL(dl_tls_dtv_slotinfo_list) is walked to initialize dtv[modid]
   for all modules in the list and initialize the related tls.

   At the end dtv[0].counter is set to the max generation number seen,
   all modules with less-than-or-equal number must have initialized dtv
   and tls in this thread.

The patch uses atomics to add modules to the slotinfo list and to increase
GL(dl_tls_generation) during dlopen and similarly uses atomics to safely
walk the slotinfo list in pthread_create.  The dtv and tls are initialized
for all modules up to the observed GL(dl_tls_generation), modules with
larger generation number (concurrently loaded modules) are ignored.


Further issues:

dlclose also modifies the slotinfo list in unsafe ways and i don't
immediately see a lockfree way to synchronize that with thread
creation. (using the GL(dl_load_lock) at thread creation does not work
because it can deadlock if pthread_create is called from a ctor while
dlopen holds the lock.)
i.e. this patch assumes dlclose is not called concurrently with
pthread_create.

Two (incorrect) assertions were removed from the code and i don't
see an easy way to do similar runtime checks to guard against similar
races or memory corruptions.

I did not review all accesses to the problematic globals there are most
likely other problems (e.g. GL(dl_tls_max_dtv_idx) is modified without
atomics, tls access (__tls_get_addr) does not use atomics to access the
same globals, etc)

Glibc does not try to do worst-case allocation of tls for existing
threads whenever a library is loaded so allocation might be needed at
the first access to tls objects, making it non-as-safe and possibly oom
crash.  The dtv and tls of the modules that are ignored in this patch
will be lazy initialized.


Changelog:

2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
	atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
	(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
	atomically.
  

Comments

Szabolcs Nagy Jan. 12, 2016, 11:02 a.m. UTC | #1
On 11/01/16 16:32, Szabolcs Nagy wrote:
> Further issues:
>
> dlclose also modifies the slotinfo list in unsafe ways and i don't
> immediately see a lockfree way to synchronize that with thread
> creation. (using the GL(dl_load_lock) at thread creation does not work
> because it can deadlock if pthread_create is called from a ctor while
> dlopen holds the lock.)
> i.e. this patch assumes dlclose is not called concurrently with
> pthread_create.
>

dlopen holding the lock during ctors is observably broken, i filed
https://sourceware.org/bugzilla/show_bug.cgi?id=19448

if that's fixed then pthread_create can grab the lock too and then
dlopen/dlclose/pthread_create are trivially synchronized without any
atomics.

the problem of non-as-safety of tls access still remains, but that's
a separate issue.
  
Carlos O'Donell Jan. 13, 2016, 2:34 a.m. UTC | #2
On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
>     [BZ #19329]
>     * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>     atomically.
>     * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>     GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>     (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>     atomically.

You are headed in the right direction. I like where this patch is going,
but don't have enough time to review this in detail yet.

At first glance your patch lacks sufficient concurrency documentation to
be acceptable. You need to document which acquires the releases
synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
of detail we need to maintain these kinds of changes.

Cheers,
Carlos.
  
Szabolcs Nagy Jan. 14, 2016, 3:09 p.m. UTC | #3
On 13/01/16 02:34, Carlos O'Donell wrote:
> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>
>>      [BZ #19329]
>>      * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>      atomically.
>>      * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>      GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>      (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>      atomically.
>
> You are headed in the right direction. I like where this patch is going,
> but don't have enough time to review this in detail yet.
>

will there be time before 2.23?

> At first glance your patch lacks sufficient concurrency documentation to
> be acceptable. You need to document which acquires the releases
> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
> of detail we need to maintain these kinds of changes.
>

i wanted to avoid documenting all the mess in the dynamic linker,
but i can improve the comments.

i see the following bugs:

1) pthread_create is not synced with dlopen
2) pthread_create is not synced with dlclose
3) tls access is not synced with dlopen
4) tls access is not synced with dlclose
5) tls access can oom crash
6) tls access is not as-safe
7) dlopen holds a global lock during ctors

i can fix 1) by adding some atomics (this is the current patch).

for 2) that's not enough, because dlclose has to wait with the
free(link_map) while pthread_create is initializing the tls.

i guess 3) can be fixed similarly to 1) but i don't have a
test case for that.

the rest is harder to fix.

is it ok to only fix 1) for 2.23?
or should i try to fix 3) as well (races on the same globals)?
  
Adhemerval Zanella Netto Jan. 18, 2016, 6 p.m. UTC | #4
On 14-01-2016 13:09, Szabolcs Nagy wrote:
> On 13/01/16 02:34, Carlos O'Donell wrote:
>> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>>
>>>      [BZ #19329]
>>>      * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>>      atomically.
>>>      * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>>      GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>>      (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>>      atomically.
>>
>> You are headed in the right direction. I like where this patch is going,
>> but don't have enough time to review this in detail yet.
>>
> 
> will there be time before 2.23?

I intend to send a message about the hard freeze today and the question is
if this fix is suitable to not show regression in the following mount and
if so if we could iron out the bugs.  So are you confident about that?

I am asking you this because this synchronization issues are a nest of rats
sometimes and I also noted you found out multiple issues during the patch
sending and reviewing. Objections?

> 
>> At first glance your patch lacks sufficient concurrency documentation to
>> be acceptable. You need to document which acquires the releases
>> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
>> of detail we need to maintain these kinds of changes.
>>
> 
> i wanted to avoid documenting all the mess in the dynamic linker,
> but i can improve the comments.
> 
> i see the following bugs:
> 
> 1) pthread_create is not synced with dlopen
> 2) pthread_create is not synced with dlclose
> 3) tls access is not synced with dlopen
> 4) tls access is not synced with dlclose
> 5) tls access can oom crash
> 6) tls access is not as-safe
> 7) dlopen holds a global lock during ctors
> 
> i can fix 1) by adding some atomics (this is the current patch).
> 
> for 2) that's not enough, because dlclose has to wait with the
> free(link_map) while pthread_create is initializing the tls.

So which would be a better approach regarding it? 

> 
> i guess 3) can be fixed similarly to 1) but i don't have a
> test case for that.
> 
> the rest is harder to fix.
> 
> is it ok to only fix 1) for 2.23?
> or should i try to fix 3) as well (races on the same globals)?
> 

I would say we attack one problem at time, even when they are related.
How hard do you consider to fix 3.?
  
Szabolcs Nagy Jan. 18, 2016, 8:09 p.m. UTC | #5
On 18/01/16 18:00, Adhemerval Zanella wrote:
> On 14-01-2016 13:09, Szabolcs Nagy wrote:
>> On 13/01/16 02:34, Carlos O'Donell wrote:
>>> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>>>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>>>
>>>>       [BZ #19329]
>>>>       * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>>>       atomically.
>>>>       * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>>>       GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>>>       (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>>>       atomically.
>>>
>>> You are headed in the right direction. I like where this patch is going,
>>> but don't have enough time to review this in detail yet.
>>>
>>
>> will there be time before 2.23?
>
> I intend to send a message about the hard freeze today and the question is
> if this fix is suitable to not show regression in the following mount and
> if so if we could iron out the bugs.  So are you confident about that?
>
> I am asking you this because this synchronization issues are a nest of rats
> sometimes and I also noted you found out multiple issues during the patch
> sending and reviewing. Objections?
>

i'm confident that i can fix the races i can
trigger (new threads vs dlopen), without creating
regressions (other than performance regression).

adding a handful of atomics is enough for that
(it should not alter the behaviour when there is
no conflicting memory access), however it might
not be the best solution (too conservative) or i
might miss a few races etc.

>>
>>> At first glance your patch lacks sufficient concurrency documentation to
>>> be acceptable. You need to document which acquires the releases
>>> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
>>> of detail we need to maintain these kinds of changes.
>>>
>>
>> i wanted to avoid documenting all the mess in the dynamic linker,
>> but i can improve the comments.
>>
>> i see the following bugs:
>>
>> 1) pthread_create is not synced with dlopen
>> 2) pthread_create is not synced with dlclose
>> 3) tls access is not synced with dlopen
>> 4) tls access is not synced with dlclose
>> 5) tls access can oom crash
>> 6) tls access is not as-safe
>> 7) dlopen holds a global lock during ctors
>>
>> i can fix 1) by adding some atomics (this is the current patch).
>>
>> for 2) that's not enough, because dlclose has to wait with the
>> free(link_map) while pthread_create is initializing the tls.
>
> So which would be a better approach regarding it?
>

i don't know what's the best approach

one possibility is to not free memory in dlclose
(i guess glibc does not want this).

another is to sync e.g. by using some lock, but it
cannot be GL(dl_load_lock) now because that is held
during ctors are running in dlopen which may call
pthread_create.

of course using both atomics and locks is a bit
ugly..

>>
>> i guess 3) can be fixed similarly to 1) but i don't have a
>> test case for that.
>>
>> the rest is harder to fix.
>>
>> is it ok to only fix 1) for 2.23?
>> or should i try to fix 3) as well (races on the same globals)?
>>
>
> I would say we attack one problem at time, even when they are related.
> How hard do you consider to fix 3.?
>

i haven't looked at that in detail, it involves more
code starting from __tls_get_addr, but it is probably
less critical because the code tries to only look at
slotinfo entries up to a point (module's gen number where
the accessed tls is) that is guaranteed to be synced by
a dlopen that returned.
(which means less problems with concurrently loading
modules, however GL(*) should be accessed with atomics).

i can try to cook up something tomorrow with a detailed
concurrency note and then others can decide if it's
too risky change or not.
  
Adhemerval Zanella Netto Jan. 18, 2016, 8:16 p.m. UTC | #6
On 18-01-2016 18:09, Szabolcs Nagy wrote:
> On 18/01/16 18:00, Adhemerval Zanella wrote:
>> On 14-01-2016 13:09, Szabolcs Nagy wrote:
>>> On 13/01/16 02:34, Carlos O'Donell wrote:
>>>> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>>>>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>>>>
>>>>>       [BZ #19329]
>>>>>       * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>>>>       atomically.
>>>>>       * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>>>>       GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>>>>       (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>>>>       atomically.
>>>>
>>>> You are headed in the right direction. I like where this patch is going,
>>>> but don't have enough time to review this in detail yet.
>>>>
>>>
>>> will there be time before 2.23?
>>
>> I intend to send a message about the hard freeze today and the question is
>> if this fix is suitable to not show regression in the following mount and
>> if so if we could iron out the bugs.  So are you confident about that?
>>
>> I am asking you this because this synchronization issues are a nest of rats
>> sometimes and I also noted you found out multiple issues during the patch
>> sending and reviewing. Objections?
>>
> 
> i'm confident that i can fix the races i can
> trigger (new threads vs dlopen), without creating
> regressions (other than performance regression).
> 
> adding a handful of atomics is enough for that
> (it should not alter the behaviour when there is
> no conflicting memory access), however it might
> not be the best solution (too conservative) or i
> might miss a few races etc.
> 

Fair enough, I have added it on the blockers for 2.23 release.

>>>
>>>> At first glance your patch lacks sufficient concurrency documentation to
>>>> be acceptable. You need to document which acquires the releases
>>>> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
>>>> of detail we need to maintain these kinds of changes.
>>>>
>>>
>>> i wanted to avoid documenting all the mess in the dynamic linker,
>>> but i can improve the comments.
>>>
>>> i see the following bugs:
>>>
>>> 1) pthread_create is not synced with dlopen
>>> 2) pthread_create is not synced with dlclose
>>> 3) tls access is not synced with dlopen
>>> 4) tls access is not synced with dlclose
>>> 5) tls access can oom crash
>>> 6) tls access is not as-safe
>>> 7) dlopen holds a global lock during ctors
>>>
>>> i can fix 1) by adding some atomics (this is the current patch).
>>>
>>> for 2) that's not enough, because dlclose has to wait with the
>>> free(link_map) while pthread_create is initializing the tls.
>>
>> So which would be a better approach regarding it?
>>
> 
> i don't know what's the best approach
> 
> one possibility is to not free memory in dlclose
> (i guess glibc does not want this).
> 
> another is to sync e.g. by using some lock, but it
> cannot be GL(dl_load_lock) now because that is held
> during ctors are running in dlopen which may call
> pthread_create.
> 
> of course using both atomics and locks is a bit
> ugly..

Right, I think we will need to get back on this after 2.23 release.

> 
>>>
>>> i guess 3) can be fixed similarly to 1) but i don't have a
>>> test case for that.
>>>
>>> the rest is harder to fix.
>>>
>>> is it ok to only fix 1) for 2.23?
>>> or should i try to fix 3) as well (races on the same globals)?
>>>
>>
>> I would say we attack one problem at time, even when they are related.
>> How hard do you consider to fix 3.?
>>
> 
> i haven't looked at that in detail, it involves more
> code starting from __tls_get_addr, but it is probably
> less critical because the code tries to only look at
> slotinfo entries up to a point (module's gen number where
> the accessed tls is) that is guaranteed to be synced by
> a dlopen that returned.
> (which means less problems with concurrently loading
> modules, however GL(*) should be accessed with atomics).
> 
> i can try to cook up something tomorrow with a detailed
> concurrency note and then others can decide if it's
> too risky change or not.
> 

Ok, let's delay the fix for 3. to 2.24 then.
  
Carlos O'Donell Jan. 20, 2016, 8:09 p.m. UTC | #7
On 01/14/2016 10:09 AM, Szabolcs Nagy wrote:
> On 13/01/16 02:34, Carlos O'Donell wrote:
>> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>>
>>>      [BZ #19329]
>>>      * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>>      atomically.
>>>      * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>>      GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>>      (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>>      atomically.
>>
>> You are headed in the right direction. I like where this patch is going,
>> but don't have enough time to review this in detail yet.
>>
> 
> will there be time before 2.23?

I don't think there is time to review this sufficiently to get it into
2.23, 2.24 yes, but not 2.23.

Even then I will try to do a round of review, and I've asked Torvald if
he has any spare cycles to help review this.

>> At first glance your patch lacks sufficient concurrency documentation to
>> be acceptable. You need to document which acquires the releases
>> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
>> of detail we need to maintain these kinds of changes.
>>
> 
> i wanted to avoid documenting all the mess in the dynamic linker,
> but i can improve the comments.

You must document *some* of the mess, and you must minimally document
everything you change in order to provide future documentation for
subsequent reviews of the concurrency code.

> i see the following bugs:
> 
> 1) pthread_create is not synced with dlopen
> 2) pthread_create is not synced with dlclose
> 3) tls access is not synced with dlopen
> 4) tls access is not synced with dlclose
> 5) tls access can oom crash
> 6) tls access is not as-safe
> 7) dlopen holds a global lock during ctors
> 
> i can fix 1) by adding some atomics (this is the current patch).
> 
> for 2) that's not enough, because dlclose has to wait with the
> free(link_map) while pthread_create is initializing the tls.
> 
> i guess 3) can be fixed similarly to 1) but i don't have a
> test case for that.
> 
> the rest is harder to fix.
> 
> is it ok to only fix 1) for 2.23?
> or should i try to fix 3) as well (races on the same globals)?

No no. We want to minimize the changes going into 2.23.

We should aim to fix only (1) right now if that is what you are
interested in fixing.

Everything we do should be incremental. At each step though we
might ask ourselves: How do we solve the more global problem
of getting rid of the load lock?

Do we all agree that the better future would be an atomic process
that can add loaded libraries to the link map list without having
to take a lock?

Cheers,
Carlos.
  
Torvald Riegel Jan. 21, 2016, 9:03 p.m. UTC | #8
On Mon, 2016-01-18 at 20:09 +0000, Szabolcs Nagy wrote:
> of course using both atomics and locks is a bit
> ugly..

Not necessarily.  If it works and makes sense otherwise, I would not
advice against it.  Some problems are easier to solve with locks, some
not.  Mixing them isn't usually a big problem, if one follows some
rules.  For example, if you access something with atomics, access it
with atomics everywhere, including in the critical sections.
  
Torvald Riegel Jan. 21, 2016, 9:10 p.m. UTC | #9
On Mon, 2016-01-18 at 16:00 -0200, Adhemerval Zanella wrote:
> I would say we attack one problem at time, even when they are related.

I think we should at least have something of a high-level plan for how
to fix related problems.  One will need that eventually unless the
different parts are indeed orthogonal, so it's not wasted work.  But if
trying to fix one part of a group of related problems until the end
before looking at others, there's a risk of wasting time on details of a
solution (eg, reviews) that might not work if considering the whole
problem space.

So, I think the best approach here is to first understand what we
actually want to solve and provide in terms of behavior, then decide on
a high-level scheme for how to do this (ie, the major atomicity /
ordering pieces), and then implementing parts of this scheme one at a
time (or fixing/adapting current code for this).
  
Torvald Riegel Jan. 21, 2016, 9:41 p.m. UTC | #10
On Wed, 2016-01-20 at 15:09 -0500, Carlos O'Donell wrote:
> Everything we do should be incremental. At each step though we
> might ask ourselves: How do we solve the more global problem
> of getting rid of the load lock?

IMO we should ask that.  A better question would probably be to ask why
we need the load lock; the lock is just a tool to implement some
intended high-level synchronization scheme.  We want to agree on this
scheme early on, otherwise it will be hard to judge correctness.

Note that it can be a perfectly fine approach to simply want to
implement the atomicity and ordering provided by some critical sections
differently -- but in this specific case, it sounds as if we'd like to
have a lock-less scheme anyway, and the current scheme doesn't work or
is not implemented correctly.  Both hint that we should go up in layers
of abstraction and revisit what we actually want.
  

Patch

diff --git a/elf/dl-open.c b/elf/dl-open.c
index 6f178b3..7315c3a 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,15 @@  dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;
+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);
+    }
 
   /* We need a second pass for static tls data, because _dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..0afbcab 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -455,9 +455,15 @@  _dl_allocate_tls_init (void *result)
   struct dtv_slotinfo_list *listp;
   size_t total = 0;
   size_t maxgen = 0;
+  size_t gen_count;
+  size_t dtv_slots;
 
+  /* Synchronize with dl_open_worker, concurrently loaded modules with
+     larger generation number should be ignored here.  */
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
   /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
+  if (dtv[-1].counter < dtv_slots)
     {
       /* Resize the dtv.  */
       dtv = _dl_resize_dtv (dtv);
@@ -480,18 +486,23 @@  _dl_allocate_tls_init (void *result)
 	  void *dest;
 
 	  /* Check for the total number of used slots.  */
-	  if (total + cnt > GL(dl_tls_max_dtv_idx))
+	  if (total + cnt > dtv_slots)
 	    break;
 
-	  map = listp->slotinfo[cnt].map;
+	  /* Synchronize with _dl_add_to_slotinfo.  */
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
 	  if (map == NULL)
 	    /* Unused entry.  */
 	    continue;
 
+	  size_t gen = listp->slotinfo[cnt].gen;
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
 	  /* Keep track of the maximum generation number.  This might
 	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);
 
 	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
 	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,11 +529,13 @@  _dl_allocate_tls_init (void *result)
 	}
 
       total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)
 	break;
 
-      listp = listp->next;
-      assert (listp != NULL);
+      /* Synchronize with _dl_add_to_slotinfo.  */
+      listp = atomic_load_acquire (&listp->next);
+      if (listp == NULL)
+	break;
     }
 
   /* The DTV version is up-to-date now.  */
@@ -916,7 +929,7 @@  _dl_add_to_slotinfo (struct link_map *l)
 	 the first slot.  */
       assert (idx == 0);
 
-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
       if (listp == NULL)
@@ -939,9 +952,13 @@  cannot create TLS data structures"));
       listp->next = NULL;
       memset (listp->slotinfo, '\0',
 	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      /* _dl_allocate_tls_init concurrently walks this list at thread
+	 creation, it must only observe initialized nodes in the list.  */
+      atomic_store_release (&prevp->next, listp);
     }
 
   /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
   listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
 }