malloc: Reduce maximum arenas
Checks
| Context |
Check |
Description |
| redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
| linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Build passed
|
| linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Test passed
|
| redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
| linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Build passed
|
| linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Test passed
|
Commit Message
The default maximum arenas is 8 times the number of cores in a 64-bit system.
Since modern CPUs have many cores and big servers have 256 cores, this results
in excessive number of arenas, which wastes memory. Limit the number of arenas
to max (8, ncores) which is less extreme. In the future the limit should be
lowered further for large systems.
Passes regress, OK for commit?
---
Comments
Do we have a sitation where we hit the limit, but didn't need to?
Reading the sources, the number of arenas actually used depends on how
many threads are simultaneously inside the malloc code. So if you hit
the max, that means you actually had that many threads actually doing
malloc() at the same time...
I mean, I agree that 8 per core is extreme for some machines - by
definition, we shouldn't be able to have more than one thread *running*
in malloc at a time per cpu, but for low-cpu low-ram machines that swap,
8 per core might be reasonable - a thread that's scheduled out shouldn't
block all other malloc'ing threads.
Hi DJ,
> Do we have a sitation where we hit the limit, but didn't need to?
Yes, there are various bug reports and complaints that GLIBC creates way
too many arenas. This was before the really big servers with 128+ cores
appeared, so the situation is now far worse...
> Reading the sources, the number of arenas actually used depends on how
> many threads are simultaneously inside the malloc code. So if you hit
> the max, that means you actually had that many threads actually doing
> malloc() at the same time...
That's not how it works. We create a new arena for every thread that happens
to use malloc, even if only once. So if say you create a bunch of worker threads
that are just sitting idle, they get their own arena and are just wasting memory.
So this is a conservative improvement - we'll need to further reduce the
number of arenas and improve concurrency.
> I mean, I agree that 8 per core is extreme for some machines - by
> definition, we shouldn't be able to have more than one thread *running*
> in malloc at a time per cpu, but for low-cpu low-ram machines that swap,
> 8 per core might be reasonable - a thread that's scheduled out shouldn't
> block all other malloc'ing threads.
Threads that are forced to wait will sleep and the thread that holds the lock
should get priority to finish and then the sleeping threads will be woken up
once finished.
Malloc could be improved significantly. Firstly it shouldn't ever hold the lock
for long (like I mentioned realloc doing a big memcpy while locked is a really
bad design). The lock should be more finegrained too, there is no need to use
one global lock when you could use different locks for different chunk sizes.
And we can do far more in tcache without needing locks. Finally, rather than
blocking on a lock, we could try the lock and if it fails use another arena
as a fallback. There is also a possibility to spin for a short time if we can stop
malloc holding onto locks for long (for example the unsorted bin scan will
still process up to 10000 blocks, each of which may need sorting into their
bin, potentially taking many millions of cycles).
Cheers,
Wilco
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> That's not how it works. We create a new arena for every thread that
> happens to use malloc, even if only once. So if say you create a bunch
> of worker threads that are just sitting idle, they get their own arena
> and are just wasting memory.
Ok, so I read it more carefully, and... ewww. It *looks* better than it
is, but arena sharing only ever happens when you hit the limit. Sigh.
@@ -813,16 +813,11 @@ arena_get2 (size_t size, mstate avoid_arena)
{
if (mp_.arena_max != 0)
narenas_limit = mp_.arena_max;
- else if (narenas > mp_.arena_test)
+ else if (narenas >= mp_.arena_test)
{
- int n = __get_nprocs ();
-
- if (n >= 1)
- narenas_limit = NARENAS_FROM_NCORES (n);
- else
- /* We have no information about the system. Assume two
- cores. */
- narenas_limit = NARENAS_FROM_NCORES (2);
+ narenas_limit = __get_nprocs ();
+ if (narenas_limit < mp_.arena_test)
+ narenas_limit = mp_.arena_test;
}
}
repeat:;
@@ -1812,8 +1812,7 @@ static struct malloc_par mp_ =
.n_mmaps_max = DEFAULT_MMAP_MAX,
.mmap_threshold = DEFAULT_MMAP_THRESHOLD,
.trim_threshold = DEFAULT_TRIM_THRESHOLD,
-#define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
- .arena_test = NARENAS_FROM_NCORES (1),
+ .arena_test = 8,
.thp_mode = malloc_thp_mode_not_supported
#if USE_TCACHE
,
@@ -1384,8 +1384,7 @@ This parameter specifies the number of arenas that can be created before the
test on the limit to the number of arenas is conducted. The value is ignored if
@code{M_ARENA_MAX} is set.
-The default value of this parameter is 2 on 32-bit systems and 8 on 64-bit
-systems.
+The default value of this parameter is 8.
This parameter can also be set for the process at startup by setting the
environment variable @env{MALLOC_ARENA_TEST} to the desired value.
@@ -1395,10 +1394,9 @@ This parameter sets the number of arenas to use regardless of the number of
cores in the system.
The default value of this tunable is @code{0}, meaning that the limit on the
-number of arenas is determined by the number of CPU cores online. For 32-bit
-systems the limit is twice the number of cores online and on 64-bit systems, it
-is eight times the number of cores online. Note that the default value is not
-derived from the default value of M_ARENA_TEST and is computed independently.
+number of arenas is determined by the number of CPU cores online. The limit is
+the number of cores online. Note that the default value is not derived from
+the default value of M_ARENA_TEST and is computed independently.
This parameter can also be set for the process at startup by setting the
environment variable @env{MALLOC_ARENA_MAX} to the desired value.