elf: Support recursive use of dynamic TLS in interposed malloc

Message ID 8734p2h0t4.fsf@oldenburg.str.redhat.com
State Superseded
Headers
Series elf: Support recursive use of dynamic TLS in interposed malloc |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Florian Weimer June 24, 2024, 2:28 p.m. UTC
  It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS).  The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.

This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update.  This
is possible because an interposed malloc has a fixed module ID and
TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc.  It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.

The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.

All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Tested on aarch64-linux-gnu, i686-linux-gnu, x86_64-linux-gnu,
powerpc64le-linux-gnu.

---
 elf/dl-tls.c               | 97 +++++++++++++++++++++++++++++++++++++++++-----
 elf/rtld.c                 |  2 +
 sysdeps/generic/ldsodefs.h | 14 +++++++
 sysdeps/x86_64/dl-tls.c    |  5 ++-
 4 files changed, 107 insertions(+), 11 deletions(-)


base-commit: da61ba3f8930e9b39fe35f88cb2ed4bc4571e598
  

Comments

Szabolcs Nagy June 25, 2024, 8:46 a.m. UTC | #1
The 06/24/2024 16:28, Florian Weimer wrote:
> It turns out that quite a few applications use bundled mallocs that
> have been built to use global-dynamic TLS (instead of the recommended
> initial-exec TLS).  The previous workaround from
> commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
> free (NULL) calls in _dl_update_slotinfo") does not fix all
> encountered cases unfortunatelly.
> 
> This change avoids the TLS generation update for recursive use
> of TLS from a malloc that was called during a TLS update.  This
> is possible because an interposed malloc has a fixed module ID and
> TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
> is encountered in __tls_get_addr and the dynamic linker is already
> in the middle of a TLS update, use the outdated DTV, thus avoiding
> another call into malloc.  It's still necessary to update the
> DTV to the most recent generation, to get out of the slow path,
> which is why the check for recursion is needed.
> 
> The bookkeeping is done using a global counter instead of per-thread
> flag because TLS access in the dynamic linker is tricky.
> 
> All this will go away once the dynamic linker stops using malloc
> for TLS, likely as part of a change that pre-allocates all TLS
> during pthread_create/dlopen.
> 
> Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
> tls access after dlopen [BZ #19924]").
> 
> Tested on aarch64-linux-gnu, i686-linux-gnu, x86_64-linux-gnu,
> powerpc64le-linux-gnu.
> 

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

but see comments below


> ---
>  elf/dl-tls.c               | 97 +++++++++++++++++++++++++++++++++++++++++-----
>  elf/rtld.c                 |  2 +
>  sysdeps/generic/ldsodefs.h | 14 +++++++
>  sysdeps/x86_64/dl-tls.c    |  5 ++-
>  4 files changed, 107 insertions(+), 11 deletions(-)
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 670dbc42fc..c0d995d69d 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -75,6 +75,31 @@
>  /* Default for dl_tls_static_optional.  */
>  #define OPTIONAL_TLS 512
>  
> +/* Used to count the number of threads currently executing dynamic TLS
> +   updates.  Used to avoid recursive malloc calls in __tls_get_addr
> +   for an interposed malloc that uses global-dynamic TLS (which is not
> +   recommended); see _dl_tls_allocate_active checks.  This could be a
> +   per-thread flag, but would need TLS access in the dynamic linker.  */
> +unsigned int _dl_tls_threads_in_update;
> +
> +static inline void
> +_dl_tls_allocate_begin (void)
> +{
> +  atomic_fetch_add_relaxed (&_dl_tls_threads_in_update, 1);
> +}
> +
> +static inline void
> +_dl_tls_allocate_end (void)
> +{
> +  atomic_fetch_add_relaxed (&_dl_tls_threads_in_update, -1);
> +}
> +
> +static inline bool
> +_dl_tls_allocate_active (void)
> +{
> +  return atomic_load_relaxed (&_dl_tls_threads_in_update) > 0;
> +}
> +
>  /* Compute the static TLS surplus based on the namespace count and the
>     TLS space that can be used for optimizations.  */
>  static inline int
> @@ -425,12 +450,18 @@ _dl_allocate_tls_storage (void)
>    size += TLS_PRE_TCB_SIZE;
>  #endif
>  
> -  /* Perform the allocation.  Reserve space for the required alignment
> -     and the pointer to the original allocation.  */
> +  /* Reserve space for the required alignment and the pointer to the
> +     original allocation.  */
>    size_t alignment = GLRO (dl_tls_static_align);
> +
> +  /* Perform the allocation.  */
> +  _dl_tls_allocate_begin ();
>    void *allocated = malloc (size + alignment + sizeof (void *));
>    if (__glibc_unlikely (allocated == NULL))
> -    return NULL;
> +    {
> +      _dl_tls_allocate_end ();
> +      return NULL;
> +    }
>  
>    /* Perform alignment and allocate the DTV.  */
>  #if TLS_TCB_AT_TP
> @@ -466,6 +497,8 @@ _dl_allocate_tls_storage (void)
>    result = allocate_dtv (result);
>    if (result == NULL)
>      free (allocated);
> +
> +  _dl_tls_allocate_end ();
>    return result;
>  }
>  
> @@ -483,6 +516,7 @@ _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
>    size_t newsize = max_modid + DTV_SURPLUS;
>    size_t oldsize = dtv[-1].counter;
>  
> +  _dl_tls_allocate_begin ();
>    if (dtv == GL(dl_initial_dtv))
>      {
>        /* This is the initial dtv that was either statically allocated in
> @@ -502,6 +536,7 @@ _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
>        if (newp == NULL)
>  	oom ();
>      }
> +  _dl_tls_allocate_end ();
>  
>    newp[0].counter = newsize;
>  
> @@ -676,7 +711,9 @@ allocate_dtv_entry (size_t alignment, size_t size)
>    if (powerof2 (alignment) && alignment <= _Alignof (max_align_t))
>      {
>        /* The alignment is supported by malloc.  */
> +      _dl_tls_allocate_begin ();
>        void *ptr = malloc (size);
> +      _dl_tls_allocate_end ();
>        return (struct dtv_pointer) { ptr, ptr };
>      }
>  
> @@ -688,7 +725,10 @@ allocate_dtv_entry (size_t alignment, size_t size)
>  
>    /* Perform the allocation.  This is the pointer we need to free
>       later.  */
> +  _dl_tls_allocate_begin ();
>    void *start = malloc (alloc_size);
> +  _dl_tls_allocate_end ();
> +
>    if (start == NULL)
>      return (struct dtv_pointer) {};
>  

these look OK.

> @@ -826,7 +866,11 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
>  		 free implementation.  Checking here papers over at
>  		 least some dynamic TLS usage by interposed mallocs.  */
>  	      if (dtv[modid].pointer.to_free != NULL)
> -		free (dtv[modid].pointer.to_free);
> +		{
> +		  _dl_tls_allocate_begin ();
> +		  free (dtv[modid].pointer.to_free);
> +		  _dl_tls_allocate_end ();
> +		}
>  	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>  	      dtv[modid].pointer.to_free = NULL;
>  
> @@ -953,13 +997,25 @@ __tls_get_addr (GET_ADDR_ARGS)
>       module, but the global generation counter is easier to check (which
>       must be synchronized up to the generation of the accessed module by
>       user code doing the TLS access so relaxed mo read is enough).  */
> -  size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
> +    size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));

this whitespace change looks wrong.


>    if (__glibc_unlikely (dtv[0].counter != gen))
>      {
> -      /* Update DTV up to the global generation, see CONCURRENCY NOTES
> -         in _dl_update_slotinfo.  */
> -      gen = atomic_load_acquire (&GL(dl_tls_generation));
> -      return update_get_addr (GET_ADDR_PARAM, gen);
> +      if (_dl_tls_allocate_active ()
> +	  && GET_ADDR_MODULE < _dl_tls_initial_modid_limit)
> +	  /* This is a reentrant __tls_get_addr call, but we can
> +	     satisfy it because it's an initially-loaded module ID.
> +	     These TLS slotinfo slots do not change, so the
> +	     out-of-date generation counter does not matter.  However,
> +	     if not in a TLS update, still update_get_addr below, to
> +	     get off the slow path eventually.  */
> +	;
> +      else
> +	{
> +	  /* Update DTV up to the global generation, see CONCURRENCY NOTES
> +	     in _dl_update_slotinfo.  */
> +	  gen = atomic_load_acquire (&GL(dl_tls_generation));
> +	  return update_get_addr (GET_ADDR_PARAM, gen);
> +	}
>      }

OK.

ideally this would be fixed for any loaded lib that uses
fixed offset tls (static tls, tp+map->l_tls_offset).

but __tls_get_addr fast path looks at dtv so that has to
be updated which needs allocations even for fixed offset
modules.

i wonder if we could do something like ppc tls optimization:
special modid value means the GET_ADDR_OFFSET is already
the final offset relative to tp (assuming this can work for
local dynamic access without DTPOFF reloc). so there is no
dtv entry at all for modules with static tls and handle them
in asm fast paths.

>  
>    void *p = dtv[GET_ADDR_MODULE].pointer.val;
> @@ -969,7 +1025,7 @@ __tls_get_addr (GET_ADDR_ARGS)
>  
>    return (char *) p + GET_ADDR_OFFSET;
>  }
> -#endif
> +#endif /* SHARED */
>  
>  
>  /* Look up the module's TLS block as for __tls_get_addr,
> @@ -1018,6 +1074,25 @@ _dl_tls_get_addr_soft (struct link_map *l)
>    return data;
>  }
>  
> +size_t _dl_tls_initial_modid_limit;
> +
> +void
> +_dl_tls_initial_modid_limit_setup (void)
> +{
> +  struct dtv_slotinfo_list *listp = GL(dl_tls_dtv_slotinfo_list);
> +  size_t idx;
> +  for (idx = 0; idx < listp->len; ++idx)
> +    {
> +      struct link_map *l = listp->slotinfo[idx].map;
> +      if (l == NULL
> +	  /* The object can be unloaded, so its modid can be
> +	     reassociated.  */
> +	  || !(l->l_type == lt_executable || l->l_type == lt_library))
> +	break;
> +    }
> +  _dl_tls_initial_modid_limit = idx;
> +}
> +

OK.

>  
>  void
>  _dl_add_to_slotinfo (struct link_map *l, bool do_add)
> @@ -1050,9 +1125,11 @@ _dl_add_to_slotinfo (struct link_map *l, bool do_add)
>  	 the first slot.  */
>        assert (idx == 0);
>  
> +      _dl_tls_allocate_begin ();
>        listp = (struct dtv_slotinfo_list *)
>  	malloc (sizeof (struct dtv_slotinfo_list)
>  		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
> +      _dl_tls_allocate_end ();
>        if (listp == NULL)
>  	{
>  	  /* We ran out of memory while resizing the dtv slotinfo list.  */
> diff --git a/elf/rtld.c b/elf/rtld.c
> index e9525ea987..6352ba76c5 100644
> --- a/elf/rtld.c
> +++ b/elf/rtld.c
> @@ -788,6 +788,8 @@ init_tls (size_t naudit)
>      _dl_fatal_printf ("\
>  cannot allocate TLS data structures for initial thread\n");
>  
> +  _dl_tls_initial_modid_limit_setup ();
> +
>    /* Store for detection of the special case by __tls_get_addr
>       so it knows not to pass this dtv to the normal realloc.  */
>    GL(dl_initial_dtv) = GET_DTV (tcbp);
> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
> index 50f58a60e3..656e8a3fa0 100644
> --- a/sysdeps/generic/ldsodefs.h
> +++ b/sysdeps/generic/ldsodefs.h
> @@ -1256,6 +1256,20 @@ extern struct link_map *_dl_update_slotinfo (unsigned long int req_modid,
>  					     size_t gen)
>       attribute_hidden;
>  
> +/* The last TLS module ID that is initially loaded, plus 1.  TLS
> +   addresses for modules with IDs lower than that can be obtained from
> +   the DTV even if its generation is outdated.  */
> +extern size_t _dl_tls_initial_modid_limit attribute_hidden attribute_relro;
> +
> +/* Compute _dl_tls_initial_modid_limit.  To be called after initial
> +   relocation.  */
> +void _dl_tls_initial_modid_limit_setup (void) attribute_hidden;
> +
> +/* Number of threads currently in a TLS update.  This is used to
> +   detect reentrant __tls_get_addr calls without a per-thread
> +   flag.  */
> +extern unsigned int _dl_tls_threads_in_update attribute_hidden;
> +

OK.

>  /* Look up the module's TLS block as for __tls_get_addr,
>     but never touch anything.  Return null if it's not allocated yet.  */
>  extern void *_dl_tls_get_addr_soft (struct link_map *l) attribute_hidden;
> diff --git a/sysdeps/x86_64/dl-tls.c b/sysdeps/x86_64/dl-tls.c
> index 869023bbba..b3c1e4fcd7 100644
> --- a/sysdeps/x86_64/dl-tls.c
> +++ b/sysdeps/x86_64/dl-tls.c
> @@ -41,7 +41,10 @@ __tls_get_addr_slow (GET_ADDR_ARGS)
>    dtv_t *dtv = THREAD_DTV ();
>  
>    size_t gen = atomic_load_acquire (&GL(dl_tls_generation));
> -  if (__glibc_unlikely (dtv[0].counter != gen))
> +  if (__glibc_unlikely (dtv[0].counter != gen)
> +      /* See comment in __tls_get_addr in elf/dl-tls.c.  */
> +      && !(_dl_tls_allocate_active ()
> +           && GET_ADDR_MODULE < _dl_tls_initial_modid_limit))
>      return update_get_addr (GET_ADDR_PARAM, gen);
>  
>    return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);
> 
> base-commit: da61ba3f8930e9b39fe35f88cb2ed4bc4571e598
>
  
Florian Weimer June 25, 2024, noon UTC | #2
* Szabolcs Nagy:

> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Thanks.

>> @@ -953,13 +997,25 @@ __tls_get_addr (GET_ADDR_ARGS)
>>       module, but the global generation counter is easier to check (which
>>       must be synchronized up to the generation of the accessed module by
>>       user code doing the TLS access so relaxed mo read is enough).  */
>> -  size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
>> +    size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
>
> this whitespace change looks wrong.

Thanks, fixed locally.

> ideally this would be fixed for any loaded lib that uses
> fixed offset tls (static tls, tp+map->l_tls_offset).
>
> but __tls_get_addr fast path looks at dtv so that has to
> be updated which needs allocations even for fixed offset
> modules.
>
> i wonder if we could do something like ppc tls optimization:
> special modid value means the GET_ADDR_OFFSET is already
> the final offset relative to tp (assuming this can work for
> local dynamic access without DTPOFF reloc). so there is no
> dtv entry at all for modules with static tls and handle them
> in asm fast paths.

I don't think optimizing this is worth it.  We should switch to eager
TLS allocation, then __tls_get_addr would look like this:

void *
__tls_get_addr (GET_ADDR_ARGS)
{
  dtv_t *dtv = THREAD_DTV ();
  return (char *) dtv[GET_ADDR_MODULE] + GET_ADDR_OFFSET;
}

With multiple modules sharing one TLS module ID.  I assume that's what
musl is doing today.

I've created a lightly tested variant of the patch that uses TLS instead
of a global variable for the in-malloc state.  I don't think it's an
improvement (particularly because of the missing compiler barrier before
THREAD_GETMEM), so I'd like to stick with the first patch.

Thanks,
Florian

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 670dbc42fc..66445686fa 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -75,6 +75,49 @@
 /* Default for dl_tls_static_optional.  */
 #define OPTIONAL_TLS 512
 
+/* Called before invoking the malloc subsystem.  Returns true if
+   rtld_in_malloc needs to be reset.  */
+static inline bool __attribute__ ((warn_unused_result))
+_dl_tls_allocate_begin (void)
+{
+#ifdef SHARED
+  if (!__rtld_tls_init_tp_called)
+    /* If there is no TLS yet, we still use the minimal malloc, which
+       does not need reentrancy protection.  */
+    return false;
+  struct pthread *self = THREAD_SELF;
+  if (THREAD_GETMEM (self, rtld_in_malloc))
+    return false;
+  THREAD_SETMEM (self, rtld_in_malloc, true);
+  return true;
+#else
+  return false;
+#endif
+}
+
+/* Must be invoked with the result of _dl_tls_allocate_begin after the
+   call into the malloc subsystem.  */
+static inline void
+_dl_tls_allocate_end (bool reset)
+{
+  if (reset)
+    THREAD_SETMEM (THREAD_SELF, rtld_in_malloc, false);
+}
+
+/* Returns true if this thread is in a call to malloc.  */
+static inline bool
+_dl_tls_allocate_active (void)
+{
+#ifdef SHARED
+  if (!__rtld_tls_init_tp_called)
+    /* Minimal malloc is still in use.  Recursive call is impossible.  */
+    return false;
+  return THREAD_GETMEM (THREAD_SELF, rtld_in_malloc);
+#else
+  return false;
+#endif
+}
+
 /* Compute the static TLS surplus based on the namespace count and the
    TLS space that can be used for optimizations.  */
 static inline int
@@ -425,12 +468,18 @@ _dl_allocate_tls_storage (void)
   size += TLS_PRE_TCB_SIZE;
 #endif
 
-  /* Perform the allocation.  Reserve space for the required alignment
-     and the pointer to the original allocation.  */
+  /* Reserve space for the required alignment and the pointer to the
+     original allocation.  */
   size_t alignment = GLRO (dl_tls_static_align);
+
+  /* Perform the allocation.  */
+  bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
   void *allocated = malloc (size + alignment + sizeof (void *));
   if (__glibc_unlikely (allocated == NULL))
-    return NULL;
+    {
+      _dl_tls_allocate_end (reset_rtld_in_malloc);
+      return NULL;
+    }
 
   /* Perform alignment and allocate the DTV.  */
 #if TLS_TCB_AT_TP
@@ -466,6 +515,8 @@ _dl_allocate_tls_storage (void)
   result = allocate_dtv (result);
   if (result == NULL)
     free (allocated);
+
+  _dl_tls_allocate_end (reset_rtld_in_malloc);
   return result;
 }
 
@@ -483,6 +534,7 @@ _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
   size_t newsize = max_modid + DTV_SURPLUS;
   size_t oldsize = dtv[-1].counter;
 
+  bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
   if (dtv == GL(dl_initial_dtv))
     {
       /* This is the initial dtv that was either statically allocated in
@@ -502,6 +554,7 @@ _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
       if (newp == NULL)
 	oom ();
     }
+  _dl_tls_allocate_end (reset_rtld_in_malloc);
 
   newp[0].counter = newsize;
 
@@ -676,7 +729,9 @@ allocate_dtv_entry (size_t alignment, size_t size)
   if (powerof2 (alignment) && alignment <= _Alignof (max_align_t))
     {
       /* The alignment is supported by malloc.  */
+      bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
       void *ptr = malloc (size);
+      _dl_tls_allocate_end (reset_rtld_in_malloc);
       return (struct dtv_pointer) { ptr, ptr };
     }
 
@@ -688,7 +743,10 @@ allocate_dtv_entry (size_t alignment, size_t size)
 
   /* Perform the allocation.  This is the pointer we need to free
      later.  */
+  bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
   void *start = malloc (alloc_size);
+  _dl_tls_allocate_end (reset_rtld_in_malloc);
+
   if (start == NULL)
     return (struct dtv_pointer) {};
 
@@ -826,7 +884,11 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
 		 free implementation.  Checking here papers over at
 		 least some dynamic TLS usage by interposed mallocs.  */
 	      if (dtv[modid].pointer.to_free != NULL)
-		free (dtv[modid].pointer.to_free);
+		{
+		  bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
+		  free (dtv[modid].pointer.to_free);
+		  _dl_tls_allocate_end (reset_rtld_in_malloc);
+		}
 	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
 	      dtv[modid].pointer.to_free = NULL;
 
@@ -956,10 +1018,22 @@ __tls_get_addr (GET_ADDR_ARGS)
   size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
   if (__glibc_unlikely (dtv[0].counter != gen))
     {
-      /* Update DTV up to the global generation, see CONCURRENCY NOTES
-         in _dl_update_slotinfo.  */
-      gen = atomic_load_acquire (&GL(dl_tls_generation));
-      return update_get_addr (GET_ADDR_PARAM, gen);
+      if (_dl_tls_allocate_active ()
+	  && GET_ADDR_MODULE < _dl_tls_initial_modid_limit)
+	  /* This is a reentrant __tls_get_addr call, but we can
+	     satisfy it because it's an initially-loaded module ID.
+	     These TLS slotinfo slots do not change, so the
+	     out-of-date generation counter does not matter.  However,
+	     if not in a TLS update, still update_get_addr below, to
+	     get off the slow path eventually.  */
+	;
+      else
+	{
+	  /* Update DTV up to the global generation, see CONCURRENCY NOTES
+	     in _dl_update_slotinfo.  */
+	  gen = atomic_load_acquire (&GL(dl_tls_generation));
+	  return update_get_addr (GET_ADDR_PARAM, gen);
+	}
     }
 
   void *p = dtv[GET_ADDR_MODULE].pointer.val;
@@ -969,7 +1043,7 @@ __tls_get_addr (GET_ADDR_ARGS)
 
   return (char *) p + GET_ADDR_OFFSET;
 }
-#endif
+#endif /* SHARED */
 
 
 /* Look up the module's TLS block as for __tls_get_addr,
@@ -1018,6 +1092,25 @@ _dl_tls_get_addr_soft (struct link_map *l)
   return data;
 }
 
+size_t _dl_tls_initial_modid_limit;
+
+void
+_dl_tls_initial_modid_limit_setup (void)
+{
+  struct dtv_slotinfo_list *listp = GL(dl_tls_dtv_slotinfo_list);
+  size_t idx;
+  for (idx = 0; idx < listp->len; ++idx)
+    {
+      struct link_map *l = listp->slotinfo[idx].map;
+      if (l == NULL
+	  /* The object can be unloaded, so its modid can be
+	     reassociated.  */
+	  || !(l->l_type == lt_executable || l->l_type == lt_library))
+	break;
+    }
+  _dl_tls_initial_modid_limit = idx;
+}
+
 
 void
 _dl_add_to_slotinfo (struct link_map *l, bool do_add)
@@ -1050,9 +1143,11 @@ _dl_add_to_slotinfo (struct link_map *l, bool do_add)
 	 the first slot.  */
       assert (idx == 0);
 
+      bool reset_rtld_in_malloc = _dl_tls_allocate_begin ();
       listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      _dl_tls_allocate_end (reset_rtld_in_malloc);
       if (listp == NULL)
 	{
 	  /* We ran out of memory while resizing the dtv slotinfo list.  */
diff --git a/elf/rtld.c b/elf/rtld.c
index e9525ea987..6352ba76c5 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -788,6 +788,8 @@ init_tls (size_t naudit)
     _dl_fatal_printf ("\
 cannot allocate TLS data structures for initial thread\n");
 
+  _dl_tls_initial_modid_limit_setup ();
+
   /* Store for detection of the special case by __tls_get_addr
      so it knows not to pass this dtv to the normal realloc.  */
   GL(dl_initial_dtv) = GET_DTV (tcbp);
diff --git a/nptl/descr.h b/nptl/descr.h
index 8cef95810c..b94c706b82 100644
--- a/nptl/descr.h
+++ b/nptl/descr.h
@@ -399,7 +399,14 @@ struct pthread
      exited or is about to exit.  exit_lock must only be acquired
      after blocking signals.  */
   bool exiting;
-  int exit_lock; /* A low-level lock (for use with __libc_lock_init etc).  */
+
+  /* Used during TLS updates to flag the current thread as being in
+     malloc.  Activates a special path in __tls_get_addr that avoids
+     using malloc again in some cases.  */
+  bool rtld_in_malloc;
+
+  /* A low-level lock (for use with __libc_lock_init etc).  */
+  int exit_lock;
 
   /* Used on strsignal.  */
   struct tls_internal_t tls_state;
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 50f58a60e3..5ff8d85a4b 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1256,6 +1256,15 @@ extern struct link_map *_dl_update_slotinfo (unsigned long int req_modid,
 					     size_t gen)
      attribute_hidden;
 
+/* The last TLS module ID that is initially loaded, plus 1.  TLS
+   addresses for modules with IDs lower than that can be obtained from
+   the DTV even if its generation is outdated.  */
+extern size_t _dl_tls_initial_modid_limit attribute_hidden attribute_relro;
+
+/* Compute _dl_tls_initial_modid_limit.  To be called after initial
+   relocation.  */
+void _dl_tls_initial_modid_limit_setup (void) attribute_hidden;
+
 /* Look up the module's TLS block as for __tls_get_addr,
    but never touch anything.  Return null if it's not allocated yet.  */
 extern void *_dl_tls_get_addr_soft (struct link_map *l) attribute_hidden;
diff --git a/sysdeps/mach/hurd/i386/tls.h b/sysdeps/mach/hurd/i386/tls.h
index 08b600aa82..0afc33d728 100644
--- a/sysdeps/mach/hurd/i386/tls.h
+++ b/sysdeps/mach/hurd/i386/tls.h
@@ -52,6 +52,11 @@ typedef struct
 
   /* Used by the exception handling implementation in the dynamic loader.  */
   struct rtld_catch *rtld_catch;
+
+  /* Used during TLS updates to flag the current thread as being in
+     malloc.  Activates a special path in __tls_get_addr that avoids
+     using malloc again in some cases.  */
+  _Bool rtld_in_malloc;
 } tcbhead_t;
 
 /* GCC generates %gs:0x14 to access the stack guard.  */
diff --git a/sysdeps/mach/hurd/x86_64/tls.h b/sysdeps/mach/hurd/x86_64/tls.h
index 3d3253963b..174f3c6cb1 100644
--- a/sysdeps/mach/hurd/x86_64/tls.h
+++ b/sysdeps/mach/hurd/x86_64/tls.h
@@ -59,6 +59,11 @@ typedef struct
 
   /* Used by the exception handling implementation in the dynamic loader.  */
   struct rtld_catch *rtld_catch;
+
+  /* Used during TLS updates to flag the current thread as being in
+     malloc.  Activates a special path in __tls_get_addr that avoids
+     using malloc again in some cases.  */
+  _Bool rtld_in_malloc;
 } tcbhead_t;
 
 /* GCC generates %fs:0x28 to access the stack guard.  */
diff --git a/sysdeps/x86_64/dl-tls.c b/sysdeps/x86_64/dl-tls.c
index 869023bbba..b3c1e4fcd7 100644
--- a/sysdeps/x86_64/dl-tls.c
+++ b/sysdeps/x86_64/dl-tls.c
@@ -41,7 +41,10 @@ __tls_get_addr_slow (GET_ADDR_ARGS)
   dtv_t *dtv = THREAD_DTV ();
 
   size_t gen = atomic_load_acquire (&GL(dl_tls_generation));
-  if (__glibc_unlikely (dtv[0].counter != gen))
+  if (__glibc_unlikely (dtv[0].counter != gen)
+      /* See comment in __tls_get_addr in elf/dl-tls.c.  */
+      && !(_dl_tls_allocate_active ()
+           && GET_ADDR_MODULE < _dl_tls_initial_modid_limit))
     return update_get_addr (GET_ADDR_PARAM, gen);
 
   return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);
  
Szabolcs Nagy June 25, 2024, 1:48 p.m. UTC | #3
The 06/25/2024 14:00, Florian Weimer wrote:
> * Szabolcs Nagy:
> > i wonder if we could do something like ppc tls optimization:
> > special modid value means the GET_ADDR_OFFSET is already
> > the final offset relative to tp (assuming this can work for
> > local dynamic access without DTPOFF reloc). so there is no
> > dtv entry at all for modules with static tls and handle them
> > in asm fast paths.
> 
> I don't think optimizing this is worth it.  We should switch to eager
> TLS allocation, then __tls_get_addr would look like this:
> 
> void *
> __tls_get_addr (GET_ADDR_ARGS)
> {
>   dtv_t *dtv = THREAD_DTV ();
>   return (char *) dtv[GET_ADDR_MODULE] + GET_ADDR_OFFSET;
> }
> 
> With multiple modules sharing one TLS module ID.  I assume that's what
> musl is doing today.

i don't think there is modid sharing in musl, otherwise yes.

at thread exit musl only frees the dtv that was allocated at
thread creation, but if a module is loaded later it will alloc
and install dtv in all threads and this will never be freed.

so each new dlopen may alloc tls_size*nthreads that is leaked.
this may be an issue with frequent dlopen (which is not an
issue in practice for musl because of the nop dlclose).

> 
> I've created a lightly tested variant of the patch that uses TLS instead
> of a global variable for the in-malloc state.  I don't think it's an
> improvement (particularly because of the missing compiler barrier before
> THREAD_GETMEM), so I'd like to stick with the first patch.

is it really possible to call the new functions before tp is
initialized?

i think orig is fine as the atomics should be in slow paths.
  
Florian Weimer June 26, 2024, 7:47 a.m. UTC | #4
* Szabolcs Nagy:

>> I don't think optimizing this is worth it.  We should switch to eager
>> TLS allocation, then __tls_get_addr would look like this:
>> 
>> void *
>> __tls_get_addr (GET_ADDR_ARGS)
>> {
>>   dtv_t *dtv = THREAD_DTV ();
>>   return (char *) dtv[GET_ADDR_MODULE] + GET_ADDR_OFFSET;
>> }
>> 
>> With multiple modules sharing one TLS module ID.  I assume that's what
>> musl is doing today.
>
> i don't think there is modid sharing in musl, otherwise yes.
>
> at thread exit musl only frees the dtv that was allocated at
> thread creation, but if a module is loaded later it will alloc
> and install dtv in all threads and this will never be freed.

With module ID sharing, DTV resizing won't be needed because we can have
a fixed number of DTV entries, with exponentially browing backing store
associated with it.

>> I've created a lightly tested variant of the patch that uses TLS instead
>> of a global variable for the in-malloc state.  I don't think it's an
>> improvement (particularly because of the missing compiler barrier before
>> THREAD_GETMEM), so I'd like to stick with the first patch.
>
> is it really possible to call the new functions before tp is
> initialized?
>
> i think orig is fine as the atomics should be in slow paths.

Looks like _dl_allocate_tls_storage is called from init_tls, before the
call to __tls_init_tp, which installs the TCB for use by the current
thread.

I'll send a v2 with a test case, based on the original patch with the
global counter.

Thanks,
Florian
  

Patch

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 670dbc42fc..c0d995d69d 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -75,6 +75,31 @@ 
 /* Default for dl_tls_static_optional.  */
 #define OPTIONAL_TLS 512
 
+/* Used to count the number of threads currently executing dynamic TLS
+   updates.  Used to avoid recursive malloc calls in __tls_get_addr
+   for an interposed malloc that uses global-dynamic TLS (which is not
+   recommended); see _dl_tls_allocate_active checks.  This could be a
+   per-thread flag, but would need TLS access in the dynamic linker.  */
+unsigned int _dl_tls_threads_in_update;
+
+static inline void
+_dl_tls_allocate_begin (void)
+{
+  atomic_fetch_add_relaxed (&_dl_tls_threads_in_update, 1);
+}
+
+static inline void
+_dl_tls_allocate_end (void)
+{
+  atomic_fetch_add_relaxed (&_dl_tls_threads_in_update, -1);
+}
+
+static inline bool
+_dl_tls_allocate_active (void)
+{
+  return atomic_load_relaxed (&_dl_tls_threads_in_update) > 0;
+}
+
 /* Compute the static TLS surplus based on the namespace count and the
    TLS space that can be used for optimizations.  */
 static inline int
@@ -425,12 +450,18 @@  _dl_allocate_tls_storage (void)
   size += TLS_PRE_TCB_SIZE;
 #endif
 
-  /* Perform the allocation.  Reserve space for the required alignment
-     and the pointer to the original allocation.  */
+  /* Reserve space for the required alignment and the pointer to the
+     original allocation.  */
   size_t alignment = GLRO (dl_tls_static_align);
+
+  /* Perform the allocation.  */
+  _dl_tls_allocate_begin ();
   void *allocated = malloc (size + alignment + sizeof (void *));
   if (__glibc_unlikely (allocated == NULL))
-    return NULL;
+    {
+      _dl_tls_allocate_end ();
+      return NULL;
+    }
 
   /* Perform alignment and allocate the DTV.  */
 #if TLS_TCB_AT_TP
@@ -466,6 +497,8 @@  _dl_allocate_tls_storage (void)
   result = allocate_dtv (result);
   if (result == NULL)
     free (allocated);
+
+  _dl_tls_allocate_end ();
   return result;
 }
 
@@ -483,6 +516,7 @@  _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
   size_t newsize = max_modid + DTV_SURPLUS;
   size_t oldsize = dtv[-1].counter;
 
+  _dl_tls_allocate_begin ();
   if (dtv == GL(dl_initial_dtv))
     {
       /* This is the initial dtv that was either statically allocated in
@@ -502,6 +536,7 @@  _dl_resize_dtv (dtv_t *dtv, size_t max_modid)
       if (newp == NULL)
 	oom ();
     }
+  _dl_tls_allocate_end ();
 
   newp[0].counter = newsize;
 
@@ -676,7 +711,9 @@  allocate_dtv_entry (size_t alignment, size_t size)
   if (powerof2 (alignment) && alignment <= _Alignof (max_align_t))
     {
       /* The alignment is supported by malloc.  */
+      _dl_tls_allocate_begin ();
       void *ptr = malloc (size);
+      _dl_tls_allocate_end ();
       return (struct dtv_pointer) { ptr, ptr };
     }
 
@@ -688,7 +725,10 @@  allocate_dtv_entry (size_t alignment, size_t size)
 
   /* Perform the allocation.  This is the pointer we need to free
      later.  */
+  _dl_tls_allocate_begin ();
   void *start = malloc (alloc_size);
+  _dl_tls_allocate_end ();
+
   if (start == NULL)
     return (struct dtv_pointer) {};
 
@@ -826,7 +866,11 @@  _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
 		 free implementation.  Checking here papers over at
 		 least some dynamic TLS usage by interposed mallocs.  */
 	      if (dtv[modid].pointer.to_free != NULL)
-		free (dtv[modid].pointer.to_free);
+		{
+		  _dl_tls_allocate_begin ();
+		  free (dtv[modid].pointer.to_free);
+		  _dl_tls_allocate_end ();
+		}
 	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
 	      dtv[modid].pointer.to_free = NULL;
 
@@ -953,13 +997,25 @@  __tls_get_addr (GET_ADDR_ARGS)
      module, but the global generation counter is easier to check (which
      must be synchronized up to the generation of the accessed module by
      user code doing the TLS access so relaxed mo read is enough).  */
-  size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
+    size_t gen = atomic_load_relaxed (&GL(dl_tls_generation));
   if (__glibc_unlikely (dtv[0].counter != gen))
     {
-      /* Update DTV up to the global generation, see CONCURRENCY NOTES
-         in _dl_update_slotinfo.  */
-      gen = atomic_load_acquire (&GL(dl_tls_generation));
-      return update_get_addr (GET_ADDR_PARAM, gen);
+      if (_dl_tls_allocate_active ()
+	  && GET_ADDR_MODULE < _dl_tls_initial_modid_limit)
+	  /* This is a reentrant __tls_get_addr call, but we can
+	     satisfy it because it's an initially-loaded module ID.
+	     These TLS slotinfo slots do not change, so the
+	     out-of-date generation counter does not matter.  However,
+	     if not in a TLS update, still update_get_addr below, to
+	     get off the slow path eventually.  */
+	;
+      else
+	{
+	  /* Update DTV up to the global generation, see CONCURRENCY NOTES
+	     in _dl_update_slotinfo.  */
+	  gen = atomic_load_acquire (&GL(dl_tls_generation));
+	  return update_get_addr (GET_ADDR_PARAM, gen);
+	}
     }
 
   void *p = dtv[GET_ADDR_MODULE].pointer.val;
@@ -969,7 +1025,7 @@  __tls_get_addr (GET_ADDR_ARGS)
 
   return (char *) p + GET_ADDR_OFFSET;
 }
-#endif
+#endif /* SHARED */
 
 
 /* Look up the module's TLS block as for __tls_get_addr,
@@ -1018,6 +1074,25 @@  _dl_tls_get_addr_soft (struct link_map *l)
   return data;
 }
 
+size_t _dl_tls_initial_modid_limit;
+
+void
+_dl_tls_initial_modid_limit_setup (void)
+{
+  struct dtv_slotinfo_list *listp = GL(dl_tls_dtv_slotinfo_list);
+  size_t idx;
+  for (idx = 0; idx < listp->len; ++idx)
+    {
+      struct link_map *l = listp->slotinfo[idx].map;
+      if (l == NULL
+	  /* The object can be unloaded, so its modid can be
+	     reassociated.  */
+	  || !(l->l_type == lt_executable || l->l_type == lt_library))
+	break;
+    }
+  _dl_tls_initial_modid_limit = idx;
+}
+
 
 void
 _dl_add_to_slotinfo (struct link_map *l, bool do_add)
@@ -1050,9 +1125,11 @@  _dl_add_to_slotinfo (struct link_map *l, bool do_add)
 	 the first slot.  */
       assert (idx == 0);
 
+      _dl_tls_allocate_begin ();
       listp = (struct dtv_slotinfo_list *)
 	malloc (sizeof (struct dtv_slotinfo_list)
 		+ TLS_SLOTINFO_SURPLUS * sizeof (struct dtv_slotinfo));
+      _dl_tls_allocate_end ();
       if (listp == NULL)
 	{
 	  /* We ran out of memory while resizing the dtv slotinfo list.  */
diff --git a/elf/rtld.c b/elf/rtld.c
index e9525ea987..6352ba76c5 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -788,6 +788,8 @@  init_tls (size_t naudit)
     _dl_fatal_printf ("\
 cannot allocate TLS data structures for initial thread\n");
 
+  _dl_tls_initial_modid_limit_setup ();
+
   /* Store for detection of the special case by __tls_get_addr
      so it knows not to pass this dtv to the normal realloc.  */
   GL(dl_initial_dtv) = GET_DTV (tcbp);
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 50f58a60e3..656e8a3fa0 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1256,6 +1256,20 @@  extern struct link_map *_dl_update_slotinfo (unsigned long int req_modid,
 					     size_t gen)
      attribute_hidden;
 
+/* The last TLS module ID that is initially loaded, plus 1.  TLS
+   addresses for modules with IDs lower than that can be obtained from
+   the DTV even if its generation is outdated.  */
+extern size_t _dl_tls_initial_modid_limit attribute_hidden attribute_relro;
+
+/* Compute _dl_tls_initial_modid_limit.  To be called after initial
+   relocation.  */
+void _dl_tls_initial_modid_limit_setup (void) attribute_hidden;
+
+/* Number of threads currently in a TLS update.  This is used to
+   detect reentrant __tls_get_addr calls without a per-thread
+   flag.  */
+extern unsigned int _dl_tls_threads_in_update attribute_hidden;
+
 /* Look up the module's TLS block as for __tls_get_addr,
    but never touch anything.  Return null if it's not allocated yet.  */
 extern void *_dl_tls_get_addr_soft (struct link_map *l) attribute_hidden;
diff --git a/sysdeps/x86_64/dl-tls.c b/sysdeps/x86_64/dl-tls.c
index 869023bbba..b3c1e4fcd7 100644
--- a/sysdeps/x86_64/dl-tls.c
+++ b/sysdeps/x86_64/dl-tls.c
@@ -41,7 +41,10 @@  __tls_get_addr_slow (GET_ADDR_ARGS)
   dtv_t *dtv = THREAD_DTV ();
 
   size_t gen = atomic_load_acquire (&GL(dl_tls_generation));
-  if (__glibc_unlikely (dtv[0].counter != gen))
+  if (__glibc_unlikely (dtv[0].counter != gen)
+      /* See comment in __tls_get_addr in elf/dl-tls.c.  */
+      && !(_dl_tls_allocate_active ()
+           && GET_ADDR_MODULE < _dl_tls_initial_modid_limit))
     return update_get_addr (GET_ADDR_PARAM, gen);
 
   return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL);