malloc: Use internal TLS for tcache and thread_arena

Message ID PAWPR08MB8982A7804486D83DFA27EA4883AA2@PAWPR08MB8982.eurprd08.prod.outlook.com (mailing list archive)
State Under Review
Delegated to: Florian Weimer
Headers
Series malloc: Use internal TLS for tcache and thread_arena |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm fail Test failed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 fail Test failed

Commit Message

Wilco Dijkstra April 7, 2025, 12:50 p.m. UTC
  Use internal TLS for faster access to frequently used thread-local data.  Add tcache
and thread_arena to tls-internal.h to avoid GOT indirections.  Performance of 
bench-malloc-thread 32 improves by 2.2% on Neoverse V2.

---
  

Comments

Florian Weimer April 7, 2025, 5:27 p.m. UTC | #1
* Wilco Dijkstra:

> Use internal TLS for faster access to frequently used thread-local
> data.  Add tcache and thread_arena to tls-internal.h to avoid GOT
> indirections.  Performance of bench-malloc-thread 32 improves by 2.2%
> on Neoverse V2.

I'm pretty sure this is incompatible with dlmopen and auditors.

The separate mallocs work despite their separated data structures
because malloc and free from different address spaces are not expected
to interoperate (and generally can't because an auditor can legimiately
use a completely different malloc implementation than the main program).

Having a shared tcache braeks that because it's totally possible that
tcache gets populated in one namespace, but freed in another namespace,
using the respective lower-level allocators.  For non-main arenas, this
may not trigger heap corruption immediately, but for the main arena, I
expect this to be rather problematic because the locks and everything
are separate.

However, I like the direction of this change.  We absolutely should make
this optimization for the initial namespace.  But we need to find a way
to redirect mallocs in secondary namespaces to a different
implementation, without additional run-time checks in the main
namespace.

Thanks,
Florian
  

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index 5672c699aa1a0306a13ebc54a08263f35024b3f1..4794eff91dcb40cdc7497c91036caefc1a18d800 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -18,6 +18,7 @@ 
 
 #include <stdbool.h>
 #include <setvmaname.h>
+#include <tls-internal.h>
 
 #define TUNABLE_NAMESPACE malloc
 #include <elf/dl-tunables.h>
@@ -86,8 +87,16 @@  extern int sanity_check_heap_info_alignment[(sizeof (heap_info)
 
 /* Thread specific data.  */
 
+#if IS_IN (libc)
+
+#define thread_arena __glibc_tls_internal ()->thread_arena
+
+#else
+
 static __thread mstate thread_arena attribute_tls_model_ie;
 
+#endif
+
 /* Arena free list.  free_list_lock synchronizes access to the
    free_list variable below, and the next_free and attached_threads
    members of struct malloc_state objects.  No other locks must be
diff --git a/malloc/malloc.c b/malloc/malloc.c
index a0bc733482532ce34684d0357cb9076b03ac8a52..74ad32b59a6e9799e3e6ab92f73395707bce2cc6 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -257,6 +257,8 @@ 
 #include <sys/random.h>
 #include <not-cancel.h>
 
+#include <tls-internal.h>
+
 /*
   Debugging:
 
@@ -3127,7 +3129,8 @@  typedef struct tcache_perthread_struct
 } tcache_perthread_struct;
 
 static __thread bool tcache_shutting_down = false;
-static __thread tcache_perthread_struct *tcache = NULL;
+
+#define tcache  __glibc_tls_internal ()->tcache
 
 /* Process-wide key to try and catch a double-free in the same thread.  */
 static uintptr_t tcache_key;
diff --git a/sysdeps/generic/tls-internal-struct.h b/sysdeps/generic/tls-internal-struct.h
index b98db8a9330935dd4c96b5ecc21740c1d43d28de..e0ea2f50dd00786830a1aecdc5d0b24a69b835ba 100644
--- a/sysdeps/generic/tls-internal-struct.h
+++ b/sysdeps/generic/tls-internal-struct.h
@@ -23,6 +23,9 @@  struct tls_internal_t
 {
   char *strsignal_buf;
   char *strerror_l_buf;
+
+  struct tcache_perthread_struct *tcache;
+  struct malloc_state *thread_arena;
 };
 
 #endif