Patchwork [BZ,#19329] Fix race between tls allocation at thread creation and dlopen

login
register
mail settings
Submitter Szabolcs Nagy
Date Jan. 19, 2016, 6:28 p.m.
Message ID <569E8050.20606@arm.com>
Download mbox | patch
Permalink /patch/10458/
State New
Headers show

Comments

Szabolcs Nagy - Jan. 19, 2016, 6:28 p.m.
On 13/01/16 02:34, Carlos O'Donell wrote:
> On 01/11/2016 11:32 AM, Szabolcs Nagy wrote:
>> 2016-01-11  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>
>>      [BZ #19329]
>>      * elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
>>      atomically.
>>      * elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
>>      GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
>>      (_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
>>      atomically.
>
> You are headed in the right direction. I like where this patch is going,
> but don't have enough time to review this in detail yet.
>
> At first glance your patch lacks sufficient concurrency documentation to
> be acceptable. You need to document which acquires the releases
> synchronizes-with. Please grep for "CONCURRENCY NOTES" to see the level
> of detail we need to maintain these kinds of changes.

Patch v3:

I added some concurrency notes.  There are some TODO items left in
the comments.

I decided not to change __tls_get_addr and dlclose code paths, this
patch only changes comments compared to v2.

Overall the patch only turns some load/stores into atomic and
changes their order (but not in a way that can be observed without
concurrently running pthread_create and dlopen).  Two branches
are changed: if slotinfo[i].gen >  GL(dl_tls_generation) then that
entry is skipped instead of abort, and if listp->next == NULL then
the iteration is finished instead of abort.

These changes might not be entirely safe: dlclose is still not
synced with pthread_create and there might be other bugs that can
cause large slotinfo[i].gen or early list->next==NULL, then the
current glibc code would abort and the new code will continue
without noticing this.

Note that the current asserts do not cover the concurrency bugs
completely: concurrent dlclose can (even with the patch) lead to
use after free, concurrently added modules can lead to read of
uninitialized slotinfo entry where map is already non-null, but
gen is 0 (and thus smaller than the current generation count)
which can lead to out-of-bounds write to the DTV or other memory
corruption.  (If the asserts were more helpfully protecting against
memory corruption, then i'd be reluctant removing them before a
complete fix for the concurrency issues.)

The acquire load of every slotinfo[i].map might be excessive use
of atomics in _dl_allocate_tls_init, but i think it is necessary
with the current design.

I'm OK if this gets rescheduled after 2.23.

2016-01-19  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation)
	atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx), slotinfo entries and listp->next atomically.
	(_dl_add_to_slotinfo): Write the slotinfo entries and listp->next
	atomically.
Torvald Riegel - Jan. 22, 2016, 1:02 a.m.
I've just looked at your patch so far and haven't analyzed all the
affected code.  Nonetheless, a few comments below;  so far, I would
rather postpone this to 2.23 because you do change a few things and I
don't see the level of detail in analysis and code comments that we'd
ideally like to see for concurrency-related changes.

On Tue, 2016-01-19 at 18:28 +0000, Szabolcs Nagy wrote:
> diff --git a/elf/dl-open.c b/elf/dl-open.c
> index 6f178b3..2b97605 100644
> --- a/elf/dl-open.c
> +++ b/elf/dl-open.c
> @@ -524,9 +524,16 @@ dl_open_worker (void *a)
>      }
>  
>    /* Bump the generation number if necessary.  */
> -  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
> -    _dl_fatal_printf (N_("\
> +  if (any_tls)
> +    {
> +      size_t newgen = GL(dl_tls_generation) + 1;

If you access a memory location with atomics somewhere, access it with
atomics everywhere (initialization is typically an exception).  This is
important because it's our way to deal with not having atomic types
right now (and so the compiler is aware, too), and because it's
important for readers of the code because it alerts them that this is
concurrent code.

> +      if (__builtin_expect (newgen == 0, 0))
> +	_dl_fatal_printf (N_("\
>  TLS generation counter wrapped!  Please report this."));
> +      /* Synchronize with the load acquire in _dl_allocate_tls_init.
> +	 See the CONCURRENCY NOTES there in dl-tls.c.  */
> +      atomic_store_release (&GL(dl_tls_generation), newgen);

Is there any reason you wouldn't want a release MO atomic increment
here?

> +    }
>  
>    /* We need a second pass for static tls data, because _dl_update_slotinfo
>       must not be run while calls to _dl_add_to_slotinfo are still pending.  */
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index ed13fd9..7184a54 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -443,6 +443,48 @@ _dl_resize_dtv (dtv_t *dtv)
>  }
>  
> 
> +/* CONCURRENCY NOTES:

This is a good start, but I think we need more detail, and we need to
more carefully describe the high-level synchronization scheme (which we
are still working on, of course).

> +   During dynamic TLS and DTV allocation and setup various objects may be
> +   accessed concurrently:
> +
> +     GL(dl_tls_max_dtv_idx)
> +     GL(dl_tls_generation)
> +     listp->slotinfo[i].map
> +     listp->slotinfo[i].gen
> +     listp->next
> +
> +   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
> +   APIs that may access them are
> +
> +     Writers: dlopen, dlclose and dynamic linker start up code.
> +     Readers only: pthread_create and __tls_get_addr (TLS access).

I would first describe what these functions / features intend to do and
what synchronization problem they have to solve before going into
details of how that's implemented.  This should be sufficiently detailed
to answer all questions such as "why do I need this critical section
described below?".

> +   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
> +   should be used when accessing these globals.

This covers the low-level data races, but it doesn't yet explain why
you'd want the critical sections.  Think about which atomicity and
ordering guarantees you need to make your high-level synchronization
scheme work.

> +   dl_open_worker (called from dlopen) for each loaded module increases
> +   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
> +   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
> +   the next generation number GL(dl_tls_generation)+1.  Then it increases
> +   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.

You describe this as if it were sequential code.  Is it, or do you need
more detail?

> +   This last write is release mo so previous writes can be synchronized.

Terminology: the matching acquire MO load synchronized with the release
MO write.  The writes sequenced before the release MO store than happen
before stuff sequenced after the acquire MO load.

Also, uppercase MO so it's clear it's an abbreviation.

> +   GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready
> +   entries.  The slotinfo list might be shorter than that during dlopen.
> +   Entries in the slotinfo list might have gen > GL(dl_tls_generation) and
> +   map == NULL.

That bit sounds like a candidate for the high-level scheme.

> +   _dl_allocate_tls_init is called from pthread_create and it looks through
> +   the slotinfo list to do the dynamic TLS and DTV setup for the new thread.
> +   It first loads the current GL(dl_tls_generation) with acquire mo and only
> +   considers modules up to that generation ignoring any later change to the
> +   slotinfo list.
> +
> +   TODO: Entries might get changed and freed in dlclose without sync.
> +   TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose.
> +*/
> +
>  void *
>  internal_function
>  _dl_allocate_tls_init (void *result)
> @@ -455,9 +497,18 @@ _dl_allocate_tls_init (void *result)
>    struct dtv_slotinfo_list *listp;
>    size_t total = 0;
>    size_t maxgen = 0;
> -
> -  /* Check if the current dtv is big enough.   */
> -  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> +  size_t gen_count;
> +  size_t dtv_slots;
> +
> +  /* Synchronize with the release mo store in dl_open_worker, modules with
> +     larger generation number are ignored.  */
> +  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
> +  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
> +     concurrently modified, but after the release mo store to

"after" in terms of which relation?  This is a good example of a place
where you could probably refer to a part of the high-level
synchronization scheme.

> +     GL(dl_tls_generation) it always remains a modid upper bound for
> +     previously loaded modules so relaxed access is enough.  */
> +  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
> +  if (dtv[-1].counter < dtv_slots)
>      {
>        /* Resize the dtv.  */
>        dtv = _dl_resize_dtv (dtv);
> @@ -480,18 +531,25 @@ _dl_allocate_tls_init (void *result)
>  	  void *dest;
>  
>  	  /* Check for the total number of used slots.  */
> -	  if (total + cnt > GL(dl_tls_max_dtv_idx))
> +	  if (total + cnt > dtv_slots)
>  	    break;
>  
> -	  map = listp->slotinfo[cnt].map;
> +	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
> +	     dlopen, so the generation number read below is for a valid entry.

Isn't it in dl-tls.c?

> +	     TODO: remove_slotinfo in dlclose is not synchronized.  */
> +	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
>  	  if (map == NULL)
>  	    /* Unused entry.  */
>  	    continue;

It might be worthwhile to point out why missing out on a concurrent
update to map would be okay here, or why it can't actually happen.

> +	  size_t gen = listp->slotinfo[cnt].gen;
> +	  if (gen > gen_count)
> +	    /* New, concurrently loaded entry.  */
> +	    continue;
> +
>  	  /* Keep track of the maximum generation number.  This might
>  	     not be the generation counter.  */
> -	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));

See my comments elsewhere in the thread.

> -	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
> +	  maxgen = MAX (maxgen, gen);
>  
>  	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
>  	  dtv[map->l_tls_modid].pointer.is_static = false;
> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
>  	}
>  
>        total += cnt;
> -      if (total >= GL(dl_tls_max_dtv_idx))
> +      if (total >= dtv_slots)

Could this have changed in the meantime, and thus be large enough?  You
change here what is potentially more than one memory access (depending
on what the compiler does to this, strictly speaking, sequential code)
to a single one in the beginning.

>  	break;
>  
> -      listp = listp->next;
> -      assert (listp != NULL);
> +      /* Synchronize with the release mo store in _dl_add_to_slotinfo
> +	 so only initialized slotinfo nodes are looked at.  */
> +      listp = atomic_load_acquire (&listp->next);

> +      if (listp == NULL)
> +	break;
>      }
>  
>    /* The DTV version is up-to-date now.  */
> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
>  	 the first slot.  */
>        assert (idx == 0);
>  
> -      listp = prevp->next = (struct dtv_slotinfo_list *)
> +      listp = (struct dtv_slotinfo_list *)
>  	malloc (sizeof (struct dtv_slotinfo_list)
>  		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
>        if (listp == NULL)

There's another store to dl_tls_generation around here that you're not
accessing atomically.

> @@ -939,9 +1000,15 @@ cannot create TLS data structures"));
>        listp->next = NULL;
>        memset (listp->slotinfo, '\0',
>  	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> +      /* _dl_allocate_tls_init concurrently walks this list at thread
> +	 creation, it must only observe initialized nodes in the list.

Another bit for the high-level scheme.

> +	 See the CONCURRENCY NOTES there.  */
> +      atomic_store_release (&prevp->next, listp);
>      }
>  
>    /* Add the information into the slotinfo data structure.  */
> -  listp->slotinfo[idx].map = l;
>    listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
> +  /* Synchronize with the acquire load in _dl_allocate_tls_init.

Say why you need to enforce this, or point to part in the higher-level
scheme where this is explained.

> +     See the CONCURRENCY NOTES there.  */
> +  atomic_store_release (&listp->slotinfo[idx].map, l);
>  }
Szabolcs Nagy - Jan. 22, 2016, 7:31 p.m.
On 22/01/16 01:02, Torvald Riegel wrote:
> I've just looked at your patch so far and haven't analyzed all the
> affected code.  Nonetheless, a few comments below;  so far, I would

thanks

> rather postpone this to 2.23 because you do change a few things and I

ok

>> -  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
>> -    _dl_fatal_printf (N_("\
>> +  if (any_tls)
>> +    {
>> +      size_t newgen = GL(dl_tls_generation) + 1;
>
> If you access a memory location with atomics somewhere, access it with
> atomics everywhere (initialization is typically an exception).  This is

ok

> important because it's our way to deal with not having atomic types
> right now (and so the compiler is aware, too), and because it's
> important for readers of the code because it alerts them that this is
> concurrent code.
>
>> +      if (__builtin_expect (newgen == 0, 0))
>> +	_dl_fatal_printf (N_("\
>>   TLS generation counter wrapped!  Please report this."));
>> +      /* Synchronize with the load acquire in _dl_allocate_tls_init.
>> +	 See the CONCURRENCY NOTES there in dl-tls.c.  */
>> +      atomic_store_release (&GL(dl_tls_generation), newgen);
>
> Is there any reason you wouldn't want a release MO atomic increment
> here?

i wanted to avoid storing an overflowing counter.
(didn't want to analyze that case)

>> +   During dynamic TLS and DTV allocation and setup various objects may be
>> +   accessed concurrently:
>> +
>> +     GL(dl_tls_max_dtv_idx)
>> +     GL(dl_tls_generation)
>> +     listp->slotinfo[i].map
>> +     listp->slotinfo[i].gen
>> +     listp->next
>> +
>> +   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
>> +   APIs that may access them are
>> +
>> +     Writers: dlopen, dlclose and dynamic linker start up code.
>> +     Readers only: pthread_create and __tls_get_addr (TLS access).
>
> I would first describe what these functions / features intend to do and
> what synchronization problem they have to solve before going into
> details of how that's implemented.  This should be sufficiently detailed
> to answer all questions such as "why do I need this critical section
> described below?".

i probably don't know all the details of that.

>> +   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
>> +   should be used when accessing these globals.
>
> This covers the low-level data races, but it doesn't yet explain why
> you'd want the critical sections.  Think about which atomicity and
> ordering guarantees you need to make your high-level synchronization
> scheme work.

i can write more about the synchronizations i added,
but not much about what other state dl_load_lock
protects.

>> +   dl_open_worker (called from dlopen) for each loaded module increases
>> +   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
>> +   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
>> +   the next generation number GL(dl_tls_generation)+1.  Then it increases
>> +   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.
>
> You describe this as if it were sequential code.  Is it, or do you need
> more detail?

yes this is sequential code in dlopen.
(but scattered around in many functions so it
is not immediately obvious)

>> +   This last write is release mo so previous writes can be synchronized.
>
> Terminology: the matching acquire MO load synchronized with the release
> MO write.  The writes sequenced before the release MO store than happen
> before stuff sequenced after the acquire MO load.
>
> Also, uppercase MO so it's clear it's an abbreviation.

ok

>> -
>> -  /* Check if the current dtv is big enough.   */
>> -  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
>> +  size_t gen_count;
>> +  size_t dtv_slots;
>> +
>> +  /* Synchronize with the release mo store in dl_open_worker, modules with
>> +     larger generation number are ignored.  */
>> +  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
>> +  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
>> +     concurrently modified, but after the release mo store to
>
> "after" in terms of which relation?  This is a good example of a place
> where you could probably refer to a part of the high-level
> synchronization scheme.

if the load of *_idx "happens after" the store of *_generation
then it is an upper bound to modids at that generation.

>> +     GL(dl_tls_generation) it always remains a modid upper bound for
>> +     previously loaded modules so relaxed access is enough.  */
>> +  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
...
>> -	  map = listp->slotinfo[cnt].map;
>> +	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
>> +	     dlopen, so the generation number read below is for a valid entry.
>
> Isn't it in dl-tls.c?

yes, but it is called from dlopen.

>> +	     TODO: remove_slotinfo in dlclose is not synchronized.  */
>> +	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
>>   	  if (map == NULL)
>>   	    /* Unused entry.  */
>>   	    continue;
>
> It might be worthwhile to point out why missing out on a concurrent
> update to map would be okay here, or why it can't actually happen.

ok

concurrent update of the relevant maps don't happen
(except in dlclose.. but that bug is not fixed)

only slotinfo entries that should be ignored here may
change concurrently (and the dtv/tls of those are
initialized lazily at first access).

>> +	  size_t gen = listp->slotinfo[cnt].gen;
>> +	  if (gen > gen_count)
>> +	    /* New, concurrently loaded entry.  */
>> +	    continue;
>> +
>>   	  /* Keep track of the maximum generation number.  This might
>>   	     not be the generation counter.  */
>> -	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
>
> See my comments elsewhere in the thread.

i can keep the assert but the gen > gen_count cases
must be skipped here so the only way the assert can
fail is if the generation number overflows.
(otherwise gen <= gen_count <= GL(dl_tls_generation) holds)

>> -	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
>> +	  maxgen = MAX (maxgen, gen);
>>
>>   	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
>>   	  dtv[map->l_tls_modid].pointer.is_static = false;
>> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
>>   	}
>>
>>         total += cnt;
>> -      if (total >= GL(dl_tls_max_dtv_idx))
>> +      if (total >= dtv_slots)
>
> Could this have changed in the meantime, and thus be large enough?  You
> change here what is potentially more than one memory access (depending
> on what the compiler does to this, strictly speaking, sequential code)
> to a single one in the beginning.

dtv_slots at the beginning is already an upper bound to
the relevant modids (and that's how much dtv we allocated),
so we can finish the iteration if total visited modules
grows beyond that.

rereading it does not cause a problem, but it does not
help either (can be smaller or larger as well, depending
on what dlopen/dlclose happened concurrently, the
slotinfo list does not shrink though)

>> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
>>   	 the first slot.  */
>>         assert (idx == 0);
>>
>> -      listp = prevp->next = (struct dtv_slotinfo_list *)
>> +      listp = (struct dtv_slotinfo_list *)
>>   	malloc (sizeof (struct dtv_slotinfo_list)
>>   		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
>>         if (listp == NULL)
>
> There's another store to dl_tls_generation around here that you're not
> accessing atomically.
>

yes, the oom code path

that looked broken logic: it does not check for overflow (bug)
it leaks memory (another bug) and the comment roughly says
"the application will crash anyway" so i wanted to avoid messing
with it.

but yes that should be atomic too.

>> @@ -939,9 +1000,15 @@ cannot create TLS data structures"));
>>         listp->next = NULL;
>>         memset (listp->slotinfo, '\0',
>>   	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
>> +      /* _dl_allocate_tls_init concurrently walks this list at thread
>> +	 creation, it must only observe initialized nodes in the list.
>
> Another bit for the high-level scheme.
>
>> +	 See the CONCURRENCY NOTES there.  */
>> +      atomic_store_release (&prevp->next, listp);
>>       }
>>
>>     /* Add the information into the slotinfo data structure.  */
>> -  listp->slotinfo[idx].map = l;
>>     listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
>> +  /* Synchronize with the acquire load in _dl_allocate_tls_init.
>
> Say why you need to enforce this, or point to part in the higher-level
> scheme where this is explained.

_dl_allocate_tls_init walks the slotinfo list beyond
the entries it actually cares about, the extra entries
are either unused (zero) or initialized here with
large gen number.

so those extra entries must be possible to recognize.

i used sync so either map==0 or gen>gen_count is
visible to _dl_allocate_tls_init for new entries.

however now i see yet another bug here: the gen can
overflow and become 0 that way too, so the right fix
is to check for (map==0 || gen==0 || gen>gen_count)
and no sync is needed only relaxed atomic access.

but i have to verify gen==0 cannot happen otherwise
so it's ok to skip it in _dl_allocate_tls_init.

>> +     See the CONCURRENCY NOTES there.  */
>> +  atomic_store_release (&listp->slotinfo[idx].map, l);
>>   }
>
>
>
Torvald Riegel - Jan. 25, 2016, 1:03 p.m.
On Fri, 2016-01-22 at 19:31 +0000, Szabolcs Nagy wrote:
> On 22/01/16 01:02, Torvald Riegel wrote:
> > important because it's our way to deal with not having atomic types
> > right now (and so the compiler is aware, too), and because it's
> > important for readers of the code because it alerts them that this is
> > concurrent code.
> >
> >> +      if (__builtin_expect (newgen == 0, 0))
> >> +	_dl_fatal_printf (N_("\
> >>   TLS generation counter wrapped!  Please report this."));
> >> +      /* Synchronize with the load acquire in _dl_allocate_tls_init.
> >> +	 See the CONCURRENCY NOTES there in dl-tls.c.  */
> >> +      atomic_store_release (&GL(dl_tls_generation), newgen);
> >
> > Is there any reason you wouldn't want a release MO atomic increment
> > here?
> 
> i wanted to avoid storing an overflowing counter.
> (didn't want to analyze that case)

If there are concurrent increments, not using an atomic increment will
loose updates.  If that can happen, the code should point out why that's
okay.

Possible countermeasures are using a CAS loop (less scalable if there is
contention (is there?) but simple) or setting the overflow threshold low
enough and implementing an anti-overflow mechanism so that the maximum
number of threads that can run concurrently can never actually cause an
overflow (not simple, but also not too hard; see the new barrier, for
example).

> >> +   During dynamic TLS and DTV allocation and setup various objects may be
> >> +   accessed concurrently:
> >> +
> >> +     GL(dl_tls_max_dtv_idx)
> >> +     GL(dl_tls_generation)
> >> +     listp->slotinfo[i].map
> >> +     listp->slotinfo[i].gen
> >> +     listp->next
> >> +
> >> +   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
> >> +   APIs that may access them are
> >> +
> >> +     Writers: dlopen, dlclose and dynamic linker start up code.
> >> +     Readers only: pthread_create and __tls_get_addr (TLS access).
> >
> > I would first describe what these functions / features intend to do and
> > what synchronization problem they have to solve before going into
> > details of how that's implemented.  This should be sufficiently detailed
> > to answer all questions such as "why do I need this critical section
> > described below?".
> 
> i probably don't know all the details of that.

That's okay given that this is a complex matter and we're still
investigating all its parts.  Nonetheless, if you can't describe the
solution at a high level because you don't know all the details, it's
(1) a good indication that this needs more work (IOW, "if you can't
explain it, it may not be ready") and (2) worth pointing out in a
comment in the code so that everyone else is aware that this piece of
code is still WIP.

> >> +   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
> >> +   should be used when accessing these globals.
> >
> > This covers the low-level data races, but it doesn't yet explain why
> > you'd want the critical sections.  Think about which atomicity and
> > ordering guarantees you need to make your high-level synchronization
> > scheme work.
> 
> i can write more about the synchronizations i added,
> but not much about what other state dl_load_lock
> protects.

Either should be an improvement.  What I wanted to ask for is not just
more information, but also how it's presented and structured.  For
example, knowing which state a lock "protects" is just a bit of the
information necessary; it often makes more sense to think about the
high-level atomicity and ordering constraints one needs first, and then
see how to solve that with mutual exclusion and locks, and then it falls
out which data (or rather logical state and state transactions!) are
"protected" by a lock.

> >> +   dl_open_worker (called from dlopen) for each loaded module increases
> >> +   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
> >> +   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
> >> +   the next generation number GL(dl_tls_generation)+1.  Then it increases
> >> +   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.
> >
> > You describe this as if it were sequential code.  Is it, or do you need
> > more detail?
> 
> yes this is sequential code in dlopen.
> (but scattered around in many functions so it
> is not immediately obvious)
> 
> >> +   This last write is release mo so previous writes can be synchronized.
> >
> > Terminology: the matching acquire MO load synchronized with the release
> > MO write.  The writes sequenced before the release MO store than happen
> > before stuff sequenced after the acquire MO load.
> >
> > Also, uppercase MO so it's clear it's an abbreviation.
> 
> ok
> 
> >> -
> >> -  /* Check if the current dtv is big enough.   */
> >> -  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> >> +  size_t gen_count;
> >> +  size_t dtv_slots;
> >> +
> >> +  /* Synchronize with the release mo store in dl_open_worker, modules with
> >> +     larger generation number are ignored.  */
> >> +  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
> >> +  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
> >> +     concurrently modified, but after the release mo store to
> >
> > "after" in terms of which relation?  This is a good example of a place
> > where you could probably refer to a part of the high-level
> > synchronization scheme.
> 
> if the load of *_idx "happens after" the store of *_generation
> then it is an upper bound to modids at that generation.

The "happens after" is either enforced by something, and then you should
refer to that piece.  Or if the loads reads from (as in the reads-from
relation in the memory model) the store, then it either may itself
enforce a happens-before, or use something else.

IOW, I still don't understand what you really mean :)

> >> +     GL(dl_tls_generation) it always remains a modid upper bound for
> >> +     previously loaded modules so relaxed access is enough.  */
> >> +  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
> ...
> >> -	  map = listp->slotinfo[cnt].map;
> >> +	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
> >> +	     dlopen, so the generation number read below is for a valid entry.
> >
> > Isn't it in dl-tls.c?
> 
> yes, but it is called from dlopen.

So please say that it is called from dlopen, not "in dlopen".

> >> +	     TODO: remove_slotinfo in dlclose is not synchronized.  */
> >> +	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
> >>   	  if (map == NULL)
> >>   	    /* Unused entry.  */
> >>   	    continue;
> >
> > It might be worthwhile to point out why missing out on a concurrent
> > update to map would be okay here, or why it can't actually happen.
> 
> ok
> 
> concurrent update of the relevant maps don't happen
> (except in dlclose.. but that bug is not fixed)
> 
> only slotinfo entries that should be ignored here may
> change concurrently (and the dtv/tls of those are
> initialized lazily at first access).
> 
> >> +	  size_t gen = listp->slotinfo[cnt].gen;
> >> +	  if (gen > gen_count)
> >> +	    /* New, concurrently loaded entry.  */
> >> +	    continue;
> >> +
> >>   	  /* Keep track of the maximum generation number.  This might
> >>   	     not be the generation counter.  */
> >> -	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
> >
> > See my comments elsewhere in the thread.
> 
> i can keep the assert but the gen > gen_count cases
> must be skipped here so the only way the assert can
> fail is if the generation number overflows.
> (otherwise gen <= gen_count <= GL(dl_tls_generation) holds)
> 
> >> -	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
> >> +	  maxgen = MAX (maxgen, gen);
> >>
> >>   	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
> >>   	  dtv[map->l_tls_modid].pointer.is_static = false;
> >> @@ -518,11 +576,14 @@ _dl_allocate_tls_init (void *result)
> >>   	}
> >>
> >>         total += cnt;
> >> -      if (total >= GL(dl_tls_max_dtv_idx))
> >> +      if (total >= dtv_slots)
> >
> > Could this have changed in the meantime, and thus be large enough?  You
> > change here what is potentially more than one memory access (depending
> > on what the compiler does to this, strictly speaking, sequential code)
> > to a single one in the beginning.
> 
> dtv_slots at the beginning is already an upper bound to
> the relevant modids (and that's how much dtv we allocated),
> so we can finish the iteration if total visited modules
> grows beyond that.
> 
> rereading it does not cause a problem, but it does not
> help either (can be smaller or larger as well, depending
> on what dlopen/dlclose happened concurrently, the
> slotinfo list does not shrink though)

While we can discuss the analysis here, eventually the outcome needs to
be included as comments in the code, as part of the patch.  If the
intent / high-level scheme isn't obvious to see (which is the case given
that we have to discuss it here), then it should be documented, IMO.

> >> @@ -916,7 +977,7 @@ _dl_add_to_slotinfo (struct link_map *l)
> >>   	 the first slot.  */
> >>         assert (idx == 0);
> >>
> >> -      listp = prevp->next = (struct dtv_slotinfo_list *)
> >> +      listp = (struct dtv_slotinfo_list *)
> >>   	malloc (sizeof (struct dtv_slotinfo_list)
> >>   		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> >>         if (listp == NULL)
> >
> > There's another store to dl_tls_generation around here that you're not
> > accessing atomically.
> >
> 
> yes, the oom code path
> 
> that looked broken logic: it does not check for overflow (bug)
> it leaks memory (another bug) and the comment roughly says
> "the application will crash anyway" so i wanted to avoid messing
> with it.
> 
> but yes that should be atomic too.
> 
> >> @@ -939,9 +1000,15 @@ cannot create TLS data structures"));
> >>         listp->next = NULL;
> >>         memset (listp->slotinfo, '\0',
> >>   	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> >> +      /* _dl_allocate_tls_init concurrently walks this list at thread
> >> +	 creation, it must only observe initialized nodes in the list.
> >
> > Another bit for the high-level scheme.
> >
> >> +	 See the CONCURRENCY NOTES there.  */
> >> +      atomic_store_release (&prevp->next, listp);
> >>       }
> >>
> >>     /* Add the information into the slotinfo data structure.  */
> >> -  listp->slotinfo[idx].map = l;
> >>     listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
> >> +  /* Synchronize with the acquire load in _dl_allocate_tls_init.
> >
> > Say why you need to enforce this, or point to part in the higher-level
> > scheme where this is explained.
> 
> _dl_allocate_tls_init walks the slotinfo list beyond
> the entries it actually cares about, the extra entries
> are either unused (zero) or initialized here with
> large gen number.
> 
> so those extra entries must be possible to recognize.
> 
> i used sync so either map==0 or gen>gen_count is
> visible to _dl_allocate_tls_init for new entries.
> 
> however now i see yet another bug here: the gen can
> overflow and become 0 that way too, so the right fix
> is to check for (map==0 || gen==0 || gen>gen_count)
> and no sync is needed only relaxed atomic access.
> 
> but i have to verify gen==0 cannot happen otherwise
> so it's ok to skip it in _dl_allocate_tls_init.

You'll probably have noticed that I'm asking several small corner-case
questions.  I my experience, doing that is helpful to figure out whether
the synchronization really works correctly all the time.  And a good way
to do this by yourself, before you send a patch, is to try to write a
thorough explanation of the concurrent code you're working on.  I've
always found this helpful.  The additional benefit is that you'll have
comments ready so others can understand your code easier.  If you have a
thorough explanation, you also already have answers for all the little
questions, which may or may not(!) have obvious answers.  Therefore, I'd
strongly suggest to work on explanations too, even if that doesn't yield
a larger LOC count.  But the complexity of concurrent code cannot be
measured in LOC anyway.

Patch

diff --git a/elf/dl-open.c b/elf/dl-open.c
index 6f178b3..2b97605 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,16 @@  dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;
+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.
+	 See the CONCURRENCY NOTES there in dl-tls.c.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);
+    }
 
   /* We need a second pass for static tls data, because _dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index ed13fd9..7184a54 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -443,6 +443,48 @@  _dl_resize_dtv (dtv_t *dtv)
 }
 
 
+/* CONCURRENCY NOTES:
+
+   During dynamic TLS and DTV allocation and setup various objects may be
+   accessed concurrently:
+
+     GL(dl_tls_max_dtv_idx)
+     GL(dl_tls_generation)
+     listp->slotinfo[i].map
+     listp->slotinfo[i].gen
+     listp->next
+
+   where listp is a node in the GL(dl_tls_dtv_slotinfo_list) list.  The public
+   APIs that may access them are
+
+     Writers: dlopen, dlclose and dynamic linker start up code.
+     Readers only: pthread_create and __tls_get_addr (TLS access).
+
+   The writers hold the GL(dl_load_lock), but the readers don't, so atomics
+   should be used when accessing these globals.
+
+   dl_open_worker (called from dlopen) for each loaded module increases
+   GL(dl_tls_max_dtv_idx), sets the link_map of the module up, adds a new
+   slotinfo entry to GL(dl_tls_dtv_slotinfo_list) with the new link_map and
+   the next generation number GL(dl_tls_generation)+1.  Then it increases
+   GL(dl_tls_generation) which sinals that the new slotinfo entries are ready.
+   This last write is release mo so previous writes can be synchronized.
+
+   GL(dl_tls_max_dtv_idx) is always an upper bound of the modids of the ready
+   entries.  The slotinfo list might be shorter than that during dlopen.
+   Entries in the slotinfo list might have gen > GL(dl_tls_generation) and
+   map == NULL.
+
+   _dl_allocate_tls_init is called from pthread_create and it looks through
+   the slotinfo list to do the dynamic TLS and DTV setup for the new thread.
+   It first loads the current GL(dl_tls_generation) with acquire mo and only
+   considers modules up to that generation ignoring any later change to the
+   slotinfo list.
+
+   TODO: Entries might get changed and freed in dlclose without sync.
+   TODO: __tls_get_addr is not yet synchronized with dlopen and dlclose.
+*/
+
 void *
 internal_function
 _dl_allocate_tls_init (void *result)
@@ -455,9 +497,18 @@  _dl_allocate_tls_init (void *result)
   struct dtv_slotinfo_list *listp;
   size_t total = 0;
   size_t maxgen = 0;
-
-  /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  size_t gen_count;
+  size_t dtv_slots;
+
+  /* Synchronize with the release mo store in dl_open_worker, modules with
+     larger generation number are ignored.  */
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
+  /* Check if the current dtv is big enough.  GL(dl_tls_max_dtv_idx) is
+     concurrently modified, but after the release mo store to
+     GL(dl_tls_generation) it always remains a modid upper bound for
+     previously loaded modules so relaxed access is enough.  */
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
+  if (dtv[-1].counter < dtv_slots)
     {
       /* Resize the dtv.  */
       dtv = _dl_resize_dtv (dtv);
@@ -480,18 +531,25 @@  _dl_allocate_tls_init (void *result)
 	  void *dest;
 
 	  /* Check for the total number of used slots.  */
-	  if (total + cnt > GL(dl_tls_max_dtv_idx))
+	  if (total + cnt > dtv_slots)
 	    break;
 
-	  map = listp->slotinfo[cnt].map;
+	  /* Synchronize with the release mo store in _dl_add_to_slotinfo in
+	     dlopen, so the generation number read below is for a valid entry.
+	     TODO: remove_slotinfo in dlclose is not synchronized.  */
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
 	  if (map == NULL)
 	    /* Unused entry.  */
 	    continue;
 
+	  size_t gen = listp->slotinfo[cnt].gen;
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
 	  /* Keep track of the maximum generation number.  This might
 	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);
 
 	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
 	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,11 +576,14 @@  _dl_allocate_tls_init (void *result)
 	}
 
       total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)
 	break;
 
-      listp = listp->next;
-      assert (listp != NULL);
+      /* Synchronize with the release mo store in _dl_add_to_slotinfo
+	 so only initialized slotinfo nodes are looked at.  */
+      listp = atomic_load_acquire (&listp->next);
+      if (listp == NULL)
+	break;
     }
 
   /* The DTV version is up-to-date now.  */
@@ -916,7 +977,7 @@  _dl_add_to_slotinfo (struct link_map *l)
 	 the first slot.  */
       assert (idx == 0);
 
-      listp = prevp->next = (struct dtv_slotinfo_list *)
+      listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
       if (listp == NULL)
@@ -939,9 +1000,15 @@  cannot create TLS data structures"));
       listp->next = NULL;
       memset (listp->slotinfo, '\0',
 	      TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      /* _dl_allocate_tls_init concurrently walks this list at thread
+	 creation, it must only observe initialized nodes in the list.
+	 See the CONCURRENCY NOTES there.  */
+      atomic_store_release (&prevp->next, listp);
     }
 
   /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
   listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.
+     See the CONCURRENCY NOTES there.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
 }