malloc: Improve malloc initialization

Message ID PAWPR08MB8982669218F062038B063C0783A42@PAWPR08MB8982.eurprd08.prod.outlook.com (mailing list archive)
State Superseded
Headers
Series malloc: Improve malloc initialization |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
redhat-pt-bot/TryBot-32bit success Build for i686
redhat-pt-bot/TryBot-still_applies warning Patch no longer applies to master

Commit Message

Wilco Dijkstra March 24, 2025, 11:12 p.m. UTC
  Move malloc initialization out of hot paths - it is only required after checking
fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.

Passes regress, OK for commit?

---
  

Comments

DJ Delorie March 24, 2025, 11:50 p.m. UTC | #1
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Move malloc initialization out of hot paths - it is only required after checking
> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.

ptmalloc_init sets up the tunables needed to properly manage tcache and
fastbins...
  
Wilco Dijkstra March 25, 2025, 12:30 a.m. UTC | #2
Hi DJ,

> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> Move malloc initialization out of hot paths - it is only required after checking
>> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
>> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>
> ptmalloc_init sets up the tunables needed to properly manage tcache and
> fastbins...

And that's fine because one can't enter tcache code when it has not been
initialized yet. The same is true for the fastbins - global_max_fast is zero
intialized, so we can't enter the fastbin case until initialization has been run.

Or would you prefer calling ptmalloc_init explicitly during startup?

Cheers,
Wilco
  
DJ Delorie March 25, 2025, 4:02 a.m. UTC | #3
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> Move malloc initialization out of hot paths - it is only required after checking
>>> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
>>> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>>
>> ptmalloc_init sets up the tunables needed to properly manage tcache and
>> fastbins...
>
> And that's fine because one can't enter tcache code when it has not been
> initialized yet. The same is true for the fastbins - global_max_fast is zero
> intialized, so we can't enter the fastbin case until initialization has been run.
>
> Or would you prefer calling ptmalloc_init explicitly during startup?

I think I brought that up in the distant past but don't recall the
reason.  My concern is that ptmalloc_init does a *lot* and malloc is
complicated; if we're going to defer calling it we need to be really
sure that anything we do access is in some stable state without
ptmalloc_init being called.  I mean, do we even know if there will be
any arenas to have an uninitialized fastbin in?
  
Florian Weimer March 25, 2025, 4:57 a.m. UTC | #4
* Wilco Dijkstra:

> Or would you prefer calling ptmalloc_init explicitly during startup?

That means the initialization could would run even if malloc is
interposed.  I wonder if that has adverse effects.
  
Wilco Dijkstra March 27, 2025, 2:29 p.m. UTC | #5
Hi Florian,

>> Or would you prefer calling ptmalloc_init explicitly during startup?
>
> That means the initialization could would run even if malloc is
> interposed.  I wonder if that has adverse effects.

It would read some tunables and write a lots of global variables which are
never used afterwards. It's basically wasting cycles, but it's harmless.

We could avoid that by using function that would be overridden. Unfortunately
mallopt() is not supported by either Mimalloc or jemalloc, but we could call
malloc_usable_size (NULL) or posix_memalign (NULL, 0, 0) which could do
initialization without having any sideeffects in external allocators.

Cheers,
Wilco
  
Florian Weimer March 31, 2025, 2:21 p.m. UTC | #6
* Wilco Dijkstra:

> Move malloc initialization out of hot paths - it is only required after checking
> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>
> Passes regress, OK for commit?
>
> ---
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index b73ddbf554461da34d99258fae87c6ece6d175ba..b8eb4180570d20223109b1f0084f4a01bb97b9d6 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3313,6 +3313,10 @@ tcache_init(void)
>    if (tcache_shutting_down)
>      return;
>  
> +  /* Ensure malloc is initialized before tcache.  */
> +  if (!__malloc_initialized)
> +    ptmalloc_init ();
> +
>    arena_get (ar_ptr, bytes);
>    victim = _int_malloc (ar_ptr, bytes);
>    if (!victim && ar_ptr != NULL)
> @@ -3389,8 +3393,6 @@ __libc_malloc (size_t bytes)
>    _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
>                    "PTRDIFF_MAX is not more than half of SIZE_MAX");
>  
> -  if (!__malloc_initialized)
> -    ptmalloc_init ();

I think we should document the invariant that tache_init is called as
part of pthread_create, before the new thread starts running.  This way,
ptmalloc_init does not need to be thread-safe.  I think it's still the
case after this patch becaue MAYBE_INIT_TCACHE is called unconditionally
early and does not depend on the allocation size.  (Not sure why this is
a macro and not a function.)  I'd suggest comments on ptmalloc_init and
MAYBE_INIT_TCACHE that reference each other, and one of them should
describe the expected sequence of events.  Maybe also put a comment on
MAYBE_INIT_TCACHE in __libc_malloc2?

Regarding doing this initialization early and unconditionally: This
could well work today, especially if we do it early via
__libc_early_init for dynamically linked builds (so that malloc is
available in PREINIT and LD_PRELOAD objects).

For static builds, additional work will be required, or we could keep
the current scheme for that case.  (I think in general, malloc can be
called before ELF constructors have run in the static case.)

If we ever start allocating during initialization (I expect us to
reserve address space for all possible struct malloc_state objects, for
example) and we want to avoid that for interposed mallocs, then I think
a lazy scheme is the only feasible way.  We can't call into the
interposed malloc from __libc_early_init because its ELF constructor has
not run yet.

Thanks,
Florian
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index b73ddbf554461da34d99258fae87c6ece6d175ba..b8eb4180570d20223109b1f0084f4a01bb97b9d6 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3313,6 +3313,10 @@  tcache_init(void)
   if (tcache_shutting_down)
     return;
 
+  /* Ensure malloc is initialized before tcache.  */
+  if (!__malloc_initialized)
+    ptmalloc_init ();
+
   arena_get (ar_ptr, bytes);
   victim = _int_malloc (ar_ptr, bytes);
   if (!victim && ar_ptr != NULL)
@@ -3389,8 +3393,6 @@  __libc_malloc (size_t bytes)
   _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
                   "PTRDIFF_MAX is not more than half of SIZE_MAX");
 
-  if (!__malloc_initialized)
-    ptmalloc_init ();
 #if USE_TCACHE
   bool err = tcache_try_malloc (bytes, &victim);
 
@@ -3488,9 +3490,6 @@  __libc_realloc (void *oldmem, size_t bytes)
 
   void *newp;             /* chunk to return */
 
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
 #if REALLOC_ZERO_BYTES_FREES
   if (bytes == 0 && oldmem != NULL)
     {
@@ -3619,9 +3618,6 @@  libc_hidden_def (__libc_realloc)
 void *
 __libc_memalign (size_t alignment, size_t bytes)
 {
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
   void *address = RETURN_ADDRESS (0);
   return _mid_memalign (alignment, bytes, address);
 }
@@ -3632,9 +3628,6 @@  void *
 weak_function
 aligned_alloc (size_t alignment, size_t bytes)
 {
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
 /* Similar to memalign, but starting with ISO C17 the standard
    requires an error for alignments that are not supported by the
    implementation.  Valid alignments for the current implementation
@@ -3742,9 +3735,6 @@  _mid_memalign (size_t alignment, size_t bytes, void *address)
 void *
 __libc_valloc (size_t bytes)
 {
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
   void *address = RETURN_ADDRESS (0);
   size_t pagesize = GLRO (dl_pagesize);
   return _mid_memalign (pagesize, bytes, address);
@@ -3753,9 +3743,6 @@  __libc_valloc (size_t bytes)
 void *
 __libc_pvalloc (size_t bytes)
 {
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
   void *address = RETURN_ADDRESS (0);
   size_t pagesize = GLRO (dl_pagesize);
   size_t rounded_bytes;
@@ -3790,9 +3777,6 @@  __libc_calloc (size_t n, size_t elem_size)
 
   sz = bytes;
 
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
 #if USE_TCACHE
   bool err = tcache_try_malloc (bytes, &mem);
 
@@ -3816,7 +3800,7 @@  __libc_calloc (size_t n, size_t elem_size)
   else
     arena_get (av, sz);
 
-  if (av)
+  if (av && top (av) != NULL)
     {
       /* Check if we hand out the top chunk, in which case there may be no
 	 need to clear. */
@@ -4029,6 +4013,10 @@  _int_malloc (mstate av, size_t bytes)
 	}
     }
 
+  /* Ensure malloc is initialized before accessing small/large bins.  */
+  if (!__malloc_initialized)
+    ptmalloc_init ();
+
   /*
      If a small request, check regular bin.  Since these "smallbins"
      hold one size each, no searching within bins is necessary.
@@ -5848,9 +5836,6 @@  __posix_memalign (void **memptr, size_t alignment, size_t size)
 {
   void *mem;
 
-  if (!__malloc_initialized)
-    ptmalloc_init ();
-
   /* Test whether the SIZE argument is valid.  It must be a power of
      two multiple of sizeof (void *).  */
   if (alignment % sizeof (void *) != 0