malloc: Improve malloc initialization
Checks
| Context |
Check |
Description |
| redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
| linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Build passed
|
| linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Test passed
|
| linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Build passed
|
| linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Test passed
|
| redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
| redhat-pt-bot/TryBot-still_applies |
warning
|
Patch no longer applies to master
|
Commit Message
Move malloc initialization out of hot paths - it is only required after checking
fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
Passes regress, OK for commit?
---
Comments
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Move malloc initialization out of hot paths - it is only required after checking
> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
ptmalloc_init sets up the tunables needed to properly manage tcache and
fastbins...
Hi DJ,
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> Move malloc initialization out of hot paths - it is only required after checking
>> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
>> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>
> ptmalloc_init sets up the tunables needed to properly manage tcache and
> fastbins...
And that's fine because one can't enter tcache code when it has not been
initialized yet. The same is true for the fastbins - global_max_fast is zero
intialized, so we can't enter the fastbin case until initialization has been run.
Or would you prefer calling ptmalloc_init explicitly during startup?
Cheers,
Wilco
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> Move malloc initialization out of hot paths - it is only required after checking
>>> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
>>> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>>
>> ptmalloc_init sets up the tunables needed to properly manage tcache and
>> fastbins...
>
> And that's fine because one can't enter tcache code when it has not been
> initialized yet. The same is true for the fastbins - global_max_fast is zero
> intialized, so we can't enter the fastbin case until initialization has been run.
>
> Or would you prefer calling ptmalloc_init explicitly during startup?
I think I brought that up in the distant past but don't recall the
reason. My concern is that ptmalloc_init does a *lot* and malloc is
complicated; if we're going to defer calling it we need to be really
sure that anything we do access is in some stable state without
ptmalloc_init being called. I mean, do we even know if there will be
any arenas to have an uninitialized fastbin in?
* Wilco Dijkstra:
> Or would you prefer calling ptmalloc_init explicitly during startup?
That means the initialization could would run even if malloc is
interposed. I wonder if that has adverse effects.
Hi Florian,
>> Or would you prefer calling ptmalloc_init explicitly during startup?
>
> That means the initialization could would run even if malloc is
> interposed. I wonder if that has adverse effects.
It would read some tunables and write a lots of global variables which are
never used afterwards. It's basically wasting cycles, but it's harmless.
We could avoid that by using function that would be overridden. Unfortunately
mallopt() is not supported by either Mimalloc or jemalloc, but we could call
malloc_usable_size (NULL) or posix_memalign (NULL, 0, 0) which could do
initialization without having any sideeffects in external allocators.
Cheers,
Wilco
* Wilco Dijkstra:
> Move malloc initialization out of hot paths - it is only required after checking
> fastbins in _int_malloc or before tcache initialization. Bench-malloc-thread
> improves by 1.8% for 1 thread and 1.3% for 32 threads on Neoverse V2.
>
> Passes regress, OK for commit?
>
> ---
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index b73ddbf554461da34d99258fae87c6ece6d175ba..b8eb4180570d20223109b1f0084f4a01bb97b9d6 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3313,6 +3313,10 @@ tcache_init(void)
> if (tcache_shutting_down)
> return;
>
> + /* Ensure malloc is initialized before tcache. */
> + if (!__malloc_initialized)
> + ptmalloc_init ();
> +
> arena_get (ar_ptr, bytes);
> victim = _int_malloc (ar_ptr, bytes);
> if (!victim && ar_ptr != NULL)
> @@ -3389,8 +3393,6 @@ __libc_malloc (size_t bytes)
> _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
> "PTRDIFF_MAX is not more than half of SIZE_MAX");
>
> - if (!__malloc_initialized)
> - ptmalloc_init ();
I think we should document the invariant that tache_init is called as
part of pthread_create, before the new thread starts running. This way,
ptmalloc_init does not need to be thread-safe. I think it's still the
case after this patch becaue MAYBE_INIT_TCACHE is called unconditionally
early and does not depend on the allocation size. (Not sure why this is
a macro and not a function.) I'd suggest comments on ptmalloc_init and
MAYBE_INIT_TCACHE that reference each other, and one of them should
describe the expected sequence of events. Maybe also put a comment on
MAYBE_INIT_TCACHE in __libc_malloc2?
Regarding doing this initialization early and unconditionally: This
could well work today, especially if we do it early via
__libc_early_init for dynamically linked builds (so that malloc is
available in PREINIT and LD_PRELOAD objects).
For static builds, additional work will be required, or we could keep
the current scheme for that case. (I think in general, malloc can be
called before ELF constructors have run in the static case.)
If we ever start allocating during initialization (I expect us to
reserve address space for all possible struct malloc_state objects, for
example) and we want to avoid that for interposed mallocs, then I think
a lazy scheme is the only feasible way. We can't call into the
interposed malloc from __libc_early_init because its ELF constructor has
not run yet.
Thanks,
Florian
@@ -3313,6 +3313,10 @@ tcache_init(void)
if (tcache_shutting_down)
return;
+ /* Ensure malloc is initialized before tcache. */
+ if (!__malloc_initialized)
+ ptmalloc_init ();
+
arena_get (ar_ptr, bytes);
victim = _int_malloc (ar_ptr, bytes);
if (!victim && ar_ptr != NULL)
@@ -3389,8 +3393,6 @@ __libc_malloc (size_t bytes)
_Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
"PTRDIFF_MAX is not more than half of SIZE_MAX");
- if (!__malloc_initialized)
- ptmalloc_init ();
#if USE_TCACHE
bool err = tcache_try_malloc (bytes, &victim);
@@ -3488,9 +3490,6 @@ __libc_realloc (void *oldmem, size_t bytes)
void *newp; /* chunk to return */
- if (!__malloc_initialized)
- ptmalloc_init ();
-
#if REALLOC_ZERO_BYTES_FREES
if (bytes == 0 && oldmem != NULL)
{
@@ -3619,9 +3618,6 @@ libc_hidden_def (__libc_realloc)
void *
__libc_memalign (size_t alignment, size_t bytes)
{
- if (!__malloc_initialized)
- ptmalloc_init ();
-
void *address = RETURN_ADDRESS (0);
return _mid_memalign (alignment, bytes, address);
}
@@ -3632,9 +3628,6 @@ void *
weak_function
aligned_alloc (size_t alignment, size_t bytes)
{
- if (!__malloc_initialized)
- ptmalloc_init ();
-
/* Similar to memalign, but starting with ISO C17 the standard
requires an error for alignments that are not supported by the
implementation. Valid alignments for the current implementation
@@ -3742,9 +3735,6 @@ _mid_memalign (size_t alignment, size_t bytes, void *address)
void *
__libc_valloc (size_t bytes)
{
- if (!__malloc_initialized)
- ptmalloc_init ();
-
void *address = RETURN_ADDRESS (0);
size_t pagesize = GLRO (dl_pagesize);
return _mid_memalign (pagesize, bytes, address);
@@ -3753,9 +3743,6 @@ __libc_valloc (size_t bytes)
void *
__libc_pvalloc (size_t bytes)
{
- if (!__malloc_initialized)
- ptmalloc_init ();
-
void *address = RETURN_ADDRESS (0);
size_t pagesize = GLRO (dl_pagesize);
size_t rounded_bytes;
@@ -3790,9 +3777,6 @@ __libc_calloc (size_t n, size_t elem_size)
sz = bytes;
- if (!__malloc_initialized)
- ptmalloc_init ();
-
#if USE_TCACHE
bool err = tcache_try_malloc (bytes, &mem);
@@ -3816,7 +3800,7 @@ __libc_calloc (size_t n, size_t elem_size)
else
arena_get (av, sz);
- if (av)
+ if (av && top (av) != NULL)
{
/* Check if we hand out the top chunk, in which case there may be no
need to clear. */
@@ -4029,6 +4013,10 @@ _int_malloc (mstate av, size_t bytes)
}
}
+ /* Ensure malloc is initialized before accessing small/large bins. */
+ if (!__malloc_initialized)
+ ptmalloc_init ();
+
/*
If a small request, check regular bin. Since these "smallbins"
hold one size each, no searching within bins is necessary.
@@ -5848,9 +5836,6 @@ __posix_memalign (void **memptr, size_t alignment, size_t size)
{
void *mem;
- if (!__malloc_initialized)
- ptmalloc_init ();
-
/* Test whether the SIZE argument is valid. It must be a power of
two multiple of sizeof (void *). */
if (alignment % sizeof (void *) != 0