[BZ,#19329] Fix race between tls allocation at thread creation and dlopen

Message ID 568D5E11.3010301@arm.com
State New, archived
Headers

Commit Message

Szabolcs Nagy Jan. 6, 2016, 6:33 p.m. UTC
  I've seen the following failure:

Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= 
_rtld_local._dl_tls_generation' failed!

It seems dlopen modifies tls related dynamic linker data structures in
dl_open_worker if a shared library is loaded with tls, while holding
the GL(dl_load_lock).

Meanwhile at thread creation the same globals are accessed when the dtv
and tls is set up for the new thread in _dl_allocate_tls_init without
holding any locks.

At least the following objects may have conflicting access:

   GL(dl_tls_max_dtv_idx)
   GL(dl_tls_generation)
   listp->slotinfo[i].map
   listp->slotinfo[i].gen
   listp->next

The race window that triggers the assert failure above is short compared
to dlopen and thread creation, so it rarely happens, but the probability
can be increased by loading a long chain of dependent shared objects
with tls.  Then all the loaded dsos get the same generation number,
GL(dl_tls_generation)+1, in the slotinfo list and GL(dl_tls_generation)
gets incremented only after that, so the assertion can fail in a
concurrently created thread reading the new slotinfos.

My fix loads the current number of dtv slots, GL(dl_tls_max_dtv_idx),
and the current generation counter, GL(dl_tls_generation), at the
beginning of _dl_allocate_tls_init and then ignores any slotinfo
entry that got concurrently added.  So instead of taking the
GL(dl_load_lock), used atomics to avoid the races.

I think various other accesses to the objects in the list above should
be atomic as well, but that rabbit hole is deeper than what a quick
fix could patch up.

I can trigger the bug on x86_64 by loading a chain of 50 dsos and
concurrently creating >1000 threads, but i cannot easily turn that
into a glibc test case.  (On aarch64 the failure happens to be more
common.)

A related issue is that glibc does not try to do worst-case allocation
of tls for existing threads whenever a library is loaded so there might
be further allocation needed at first tls access making it non-as-safe
and possibly oom crash.  This lazy allocation allows me to ignore the
slotinfo of concurrently loaded dsos in _dl_allocate_tls_init, those
will be lazy initialized.

Changelog:

2016-01-06  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	[BZ #19329]
	* elf/dl-open.c (dl_open_worker): Write GL(dl_tls_generation) atomically.
	* elf/dl-tls.c (_dl_allocate_tls_init): Read GL(dl_tls_generation),
	GL(dl_tls_max_dtv_idx) and slotinfo entries atomically.
	(_dl_add_to_slotinfo): Write the slotinfo entry atomically.
  

Comments

Szabolcs Nagy Jan. 7, 2016, 4:46 p.m. UTC | #1
On 06/01/16 18:33, Szabolcs Nagy wrote:
> At least the following objects may have conflicting access:
>
>    GL(dl_tls_max_dtv_idx)
>    GL(dl_tls_generation)
>    listp->slotinfo[i].map
>    listp->slotinfo[i].gen
>    listp->next
>
...
> I think various other accesses to the objects in the list above should
> be atomic as well, but that rabbit hole is deeper than what a quick
> fix could patch up.

i thought i can get away without making listp->next
access atomic, but now i see that can fail too

>     /* Check if the current dtv is big enough.   */
> -  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> +  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
> +  if (dtv[-1].counter < dtv_slots)
>       {
>         /* Resize the dtv.  */
>         dtv = _dl_resize_dtv (dtv);
> @@ -470,6 +473,7 @@ _dl_allocate_tls_init (void *result)
>        TLS.  For those which are dynamically loaded we add the values
>        indicating deferred allocation.  */
>     listp = GL(dl_tls_dtv_slotinfo_list);
> +  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
>     while (1)
>       {
>         size_t cnt;
> @@ -480,18 +484,22 @@ _dl_allocate_tls_init (void *result)
>   	  void *dest;
>
>   	  /* Check for the total number of used slots.  */
> -	  if (total + cnt > GL(dl_tls_max_dtv_idx))
> +	  if (total + cnt > dtv_slots)
>   	    break;
....
>         total += cnt;
> -      if (total >= GL(dl_tls_max_dtv_idx))
> +      if (total >= dtv_slots)
>   	break;
>
>         listp = listp->next;

here.

a single listp node can hold 64 slotinfo entries, if more
libraries are loaded with tls then GL(dl_tls_max_dtv_idx) > 64,
but the updated listp->next pointer might not be visible here.

i will update the patch.
  
Ilya Palachev Jan. 11, 2016, 3:48 p.m. UTC | #2
On 06.01.2016 21:33, Szabolcs Nagy wrote:
>   	  /* Keep track of the maximum generation number.  This might
>   	     not be the generation counter.  */
> -	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
> -	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
> +	  maxgen = MAX (maxgen, gen);

Thanks for the patch.

But it seems quite strange that the failed assertion is simply deleted 
from the code.
Is it still failing for your patch?
How can you prove that it is working if the assertion that was failing 
is now just deleted from the code?

If I just remove the assertion and do nothing else, the error will go away.
Can you stay the assertion at its place or otherwise explain why do you 
want to remove it?

--
Best regards,
Ilya Palachev
  
Szabolcs Nagy Jan. 11, 2016, 4:42 p.m. UTC | #3
On 11/01/16 15:48, Ilya Palachev wrote:
> On 06.01.2016 21:33, Szabolcs Nagy wrote:
>>         /* Keep track of the maximum generation number.  This might
>>            not be the generation counter.  */
>> -      assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
>> -      maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
>> +      maxgen = MAX (maxgen, gen);
>
> Thanks for the patch.
>
> But it seems quite strange that the failed assertion is simply deleted from the code.
> Is it still failing for your patch?

yes

> How can you prove that it is working if the assertion that was failing is now just deleted from the code?

i can only prove that the assertion is wrong by
analysing the code: the condition it verifies
cannot be enforced with the current design.

removing it is harmless since the slotinfo entries
are lazy initialized.

> If I just remove the assertion and do nothing else, the error will go away.

that's not enough: the dtv of the thread will be
in an inconsistent state and tls access in the
thread may crash.

> Can you stay the assertion at its place or otherwise explain why do you want to remove it?
>
> --
> Best regards,
> Ilya Palachev
>
  
Szabolcs Nagy Jan. 11, 2016, 4:46 p.m. UTC | #4
On 11/01/16 16:42, Szabolcs Nagy wrote:
> On 11/01/16 15:48, Ilya Palachev wrote:
>> How can you prove that it is working if the assertion that was failing is now just deleted from the code?
>
> i can only prove that the assertion is wrong by
> analysing the code: the condition it verifies
> cannot be enforced with the current design.
>
> removing it is harmless since the slotinfo entries
> are lazy initialized.
>

actually this is not true if dlclose may be called
concurrently with pthread_create as i noted in my
new patch description.

i don't know how to protect against that.
  
Ilya Palachev Jan. 12, 2016, 12:20 p.m. UTC | #5
On 11.01.2016 19:46, Szabolcs Nagy wrote:
> On 11/01/16 16:42, Szabolcs Nagy wrote:
>> On 11/01/16 15:48, Ilya Palachev wrote:
>>> How can you prove that it is working if the assertion that was 
>>> failing is now just deleted from the code?
>>
>> i can only prove that the assertion is wrong by
>> analysing the code: the condition it verifies
>> cannot be enforced with the current design.
>>
>> removing it is harmless since the slotinfo entries
>> are lazy initialized.
>>

Thanks for explanation.
Do you mean that slotinfo entries are lazy initialized in 
_dl_update_slotinfo ?

>
> actually this is not true if dlclose may be called
> concurrently with pthread_create as i noted in my
> new patch description.
>
> i don't know how to protect against that.

Does _dl_update_slotinfo has support for updating DTV's of other threads 
in case when some thread called dlclose?
As I can see dlclose just sets map pointer to NULL and increments the 
generation counter:

1. In remove_slotinfo function:

           listp->slotinfo[idx - disp].gen = GL(dl_tls_generation) + 1;
           listp->slotinfo[idx - disp].map = NULL;

2. In _dl_close_worker function:
   /* If we removed any object which uses TLS bump the generation 
counter.  */
   if (any_tls)
     {
       if (__glibc_unlikely (++GL(dl_tls_generation) == 0))
         _dl_fatal_printf ("TLS generation counter wrapped!  Please 
report as described in "REPORT_BUGS_TO".\n");

       if (tls_free_end == GL(dl_tls_static_used))
         GL(dl_tls_static_used) = tls_free_start;
     }

You have just fixed such memory accesses with atomic_load_acquire and 
atomic_store_release.
Won't these atomics be enough to fix races with dlclose? Or am I missing 
something?

Or you just would like to fill separate bug report for dlclose race?

--
Best regards,
Ilya Palachev
  
Szabolcs Nagy Jan. 12, 2016, 2:28 p.m. UTC | #6
On 12/01/16 12:20, Ilya Palachev wrote:
> On 11.01.2016 19:46, Szabolcs Nagy wrote:
>> On 11/01/16 16:42, Szabolcs Nagy wrote:
>>> On 11/01/16 15:48, Ilya Palachev wrote:
>>>> How can you prove that it is working if the assertion that was failing is now just deleted from the code?
>>>
>>> i can only prove that the assertion is wrong by
>>> analysing the code: the condition it verifies
>>> cannot be enforced with the current design.
>>>
>>> removing it is harmless since the slotinfo entries
>>> are lazy initialized.
>>>
>
> Thanks for explanation.
> Do you mean that slotinfo entries are lazy initialized in _dl_update_slotinfo ?
>
>>
>> actually this is not true if dlclose may be called
>> concurrently with pthread_create as i noted in my
>> new patch description.
>>
>> i don't know how to protect against that.
>
> Does _dl_update_slotinfo has support for updating DTV's of other threads in case when some thread called dlclose?

_dl_update_slotinfo (and tls_get_addr_tail) also walk
the slotinfo list without synchronization like
_dl_allocate_tls_init.

that is a worse case because it happens at tls access
time not at a library call (pthread_create) time.

they initialize the dtv of the current thread based
on the global slotinfo list and might allocate memory
for tls, resize the dtv and may also hold the global
dl lock.. which are all problematic operations.

(these functions should only do as-safe operations
and must not do anything that can fail since, unlike
a library call, a tls access cannot report failures)

i did not try to fix them.

i only attempted to fix the pthread_create vs dlopen case.

> As I can see dlclose just sets map pointer to NULL and increments the generation counter:
>
> 1. In remove_slotinfo function:
>
>            listp->slotinfo[idx - disp].gen = GL(dl_tls_generation) + 1;
>            listp->slotinfo[idx - disp].map = NULL;
>
> 2. In _dl_close_worker function:
>    /* If we removed any object which uses TLS bump the generation counter.  */
>    if (any_tls)
>      {
>        if (__glibc_unlikely (++GL(dl_tls_generation) == 0))
>          _dl_fatal_printf ("TLS generation counter wrapped!  Please report as described in "REPORT_BUGS_TO".\n");
>
>        if (tls_free_end == GL(dl_tls_static_used))
>          GL(dl_tls_static_used) = tls_free_start;
>      }
>
> You have just fixed such memory accesses with atomic_load_acquire and atomic_store_release.
> Won't these atomics be enough to fix races with dlclose? Or am I missing something?
>

i was looking at that and thought the memory accesses
might be possible to sync (although it's a bit hairy),
however dlclose frees the link_map at the end and that's
not possible to sync.

after _dl_allocate_tls_init loads .map and .gen from the
slotinfo it accesses the map assuming that won't go away.

the current design in glibc is very broken, e.g. musl can
do tls initialization and access in a lockfree way because
its dlclose does not free the loaded library, if it did
then i think using the dl lock would be necessary.

however the dl lock cannot (easily) fix the tls access
case and currently it does not fix the pthread_create case
either since the dl lock is held during ctors are running
so user code can deadlock on it.

tl;dr: if your libc has non-noop dlclose it must use a
global lock in dlopen/dlclose/thread creation/tls access
and user code must not run while that lock is held
(e.g. signals must be blocked)
  
Torvald Riegel Jan. 21, 2016, 8:58 p.m. UTC | #7
On Mon, 2016-01-11 at 16:42 +0000, Szabolcs Nagy wrote:
> On 11/01/16 15:48, Ilya Palachev wrote:
> > How can you prove that it is working if the assertion that was failing is now just deleted from the code?
> 
> i can only prove that the assertion is wrong by
> analysing the code: the condition it verifies
> cannot be enforced with the current design.

But that's no sufficient reason to remove the assertion.  The assertion
might have very well been the intent of the programmer, and what the
code actually does might have not been what was intended.  Removing the
assertion would be similar to removing documentation or comments just
because the code doesn't match it.

IMO, the proper way to solve this is to investigate the conflict, and
find out what the intent should be (eg, to make this work correctly);
after that, one knows whether the code or the assertion are wrong.

(If we are reasonably confident that the assertion might be wrong, or
that the damage caused by it failing is larger than an error triggered
by the potential bug, we might just make it not fail.  But in this case,
the assertion should remain as a comment with a big FIXME or such next
to it.)
  
Torvald Riegel Jan. 22, 2016, 2:49 p.m. UTC | #8
On Tue, 2016-01-12 at 14:28 +0000, Szabolcs Nagy wrote:
> tl;dr: if your libc has non-noop dlclose it must use a
> global lock in dlopen/dlclose/thread creation/tls access
> and user code must not run while that lock is held
> (e.g. signals must be blocked)

Depending on what the non-noop functionality is, it might be possible to
still implement this in a nonblocking way (so you avoid the blocking
sync vs. signals and reentrancy issue).
  
Szabolcs Nagy Jan. 22, 2016, 6:32 p.m. UTC | #9
On 22/01/16 14:49, Torvald Riegel wrote:
> On Tue, 2016-01-12 at 14:28 +0000, Szabolcs Nagy wrote:
>> tl;dr: if your libc has non-noop dlclose it must use a
>> global lock in dlopen/dlclose/thread creation/tls access
>> and user code must not run while that lock is held
>> (e.g. signals must be blocked)
>
> Depending on what the non-noop functionality is, it might be possible to
> still implement this in a nonblocking way (so you avoid the blocking
> sync vs. signals and reentrancy issue).

the non-noop functionality is that dlclose frees
memory that pthread_create and tls access may look at
(the link_map of a dso).

i guess lock-free garbage collection is a possibility
(e.g. refcounting, except tls access cannot call free
so it's not entirely trivial.)
  
Torvald Riegel Jan. 25, 2016, 12:38 p.m. UTC | #10
On Fri, 2016-01-22 at 18:32 +0000, Szabolcs Nagy wrote:
> On 22/01/16 14:49, Torvald Riegel wrote:
> > On Tue, 2016-01-12 at 14:28 +0000, Szabolcs Nagy wrote:
> >> tl;dr: if your libc has non-noop dlclose it must use a
> >> global lock in dlopen/dlclose/thread creation/tls access
> >> and user code must not run while that lock is held
> >> (e.g. signals must be blocked)
> >
> > Depending on what the non-noop functionality is, it might be possible to
> > still implement this in a nonblocking way (so you avoid the blocking
> > sync vs. signals and reentrancy issue).
> 
> the non-noop functionality is that dlclose frees
> memory that pthread_create and tls access may look at
> (the link_map of a dso).
> 
> i guess lock-free garbage collection is a possibility
> (e.g. refcounting, except tls access cannot call free
> so it's not entirely trivial.)

So we need to reach a quiescent state (eg, like RCU has to).
Not trivial, but it seems releasing memory can be deferred to a later
time, and it does not need to be performed by a particular thread.  So
one can just build a list of things to be freed, and the first thread
that can do it just does it.
  

Patch

diff --git a/elf/dl-open.c b/elf/dl-open.c
index 5429d18..4e2cd6a 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -524,9 +524,15 @@  dl_open_worker (void *a)
     }
 
   /* Bump the generation number if necessary.  */
-  if (any_tls && __builtin_expect (++GL(dl_tls_generation) == 0, 0))
-    _dl_fatal_printf (N_("\
+  if (any_tls)
+    {
+      size_t newgen = GL(dl_tls_generation) + 1;
+      if (__builtin_expect (newgen == 0, 0))
+	_dl_fatal_printf (N_("\
 TLS generation counter wrapped!  Please report this."));
+      /* Synchronize with the load acquire in _dl_allocate_tls_init.  */
+      atomic_store_release (&GL(dl_tls_generation), newgen);
+    }
 
   /* We need a second pass for static tls data, because _dl_update_slotinfo
      must not be run while calls to _dl_add_to_slotinfo are still pending.  */
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 20c7e33..3c08a5f 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -455,9 +455,12 @@  _dl_allocate_tls_init (void *result)
   struct dtv_slotinfo_list *listp;
   size_t total = 0;
   size_t maxgen = 0;
+  size_t gen_count;
+  size_t dtv_slots;
 
   /* Check if the current dtv is big enough.   */
-  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+  dtv_slots = atomic_load_relaxed (&GL(dl_tls_max_dtv_idx));
+  if (dtv[-1].counter < dtv_slots)
     {
       /* Resize the dtv.  */
       dtv = _dl_resize_dtv (dtv);
@@ -470,6 +473,7 @@  _dl_allocate_tls_init (void *result)
      TLS.  For those which are dynamically loaded we add the values
      indicating deferred allocation.  */
   listp = GL(dl_tls_dtv_slotinfo_list);
+  gen_count = atomic_load_acquire (&GL(dl_tls_generation));
   while (1)
     {
       size_t cnt;
@@ -480,18 +484,22 @@  _dl_allocate_tls_init (void *result)
 	  void *dest;
 
 	  /* Check for the total number of used slots.  */
-	  if (total + cnt > GL(dl_tls_max_dtv_idx))
+	  if (total + cnt > dtv_slots)
 	    break;
 
-	  map = listp->slotinfo[cnt].map;
+	  map = atomic_load_acquire (&listp->slotinfo[cnt].map);
 	  if (map == NULL)
 	    /* Unused entry.  */
 	    continue;
 
+	  size_t gen = atomic_load_relaxed (&listp->slotinfo[cnt].gen);
+	  if (gen > gen_count)
+	    /* New, concurrently loaded entry.  */
+	    continue;
+
 	  /* Keep track of the maximum generation number.  This might
 	     not be the generation counter.  */
-	  assert (listp->slotinfo[cnt].gen <= GL(dl_tls_generation));
-	  maxgen = MAX (maxgen, listp->slotinfo[cnt].gen);
+	  maxgen = MAX (maxgen, gen);
 
 	  dtv[map->l_tls_modid].pointer.val = TLS_DTV_UNALLOCATED;
 	  dtv[map->l_tls_modid].pointer.is_static = false;
@@ -518,7 +526,7 @@  _dl_allocate_tls_init (void *result)
 	}
 
       total += cnt;
-      if (total >= GL(dl_tls_max_dtv_idx))
+      if (total >= dtv_slots)
 	break;
 
       listp = listp->next;
@@ -942,6 +950,7 @@  cannot create TLS data structures"));
     }
 
   /* Add the information into the slotinfo data structure.  */
-  listp->slotinfo[idx].map = l;
-  listp->slotinfo[idx].gen = GL(dl_tls_generation) + 1;
+  atomic_store_relaxed (&listp->slotinfo[idx].gen, GL(dl_tls_generation) + 1);
+  /* Synchronize with the acquire load in _dl_allocate_tls_init.  */
+  atomic_store_release (&listp->slotinfo[idx].map, l);
 }