malloc: Reduce maximum arenas

Message ID PAWPR08MB89821FAE648721A1627D33C68351A@PAWPR08MB8982.eurprd08.prod.outlook.com (mailing list archive)
State New
Headers
Series malloc: Reduce maximum arenas |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

Wilco Dijkstra April 2, 2026, 3:05 p.m. UTC
  The default maximum arenas is 8 times the number of cores in a 64-bit system.
Since modern CPUs have many cores and big servers have 256 cores, this results
in excessive number of arenas, which wastes memory.  Limit the number of arenas
to max (8, ncores) which is less extreme.  In the future the limit should be
lowered further for large systems.

Passes regress, OK for commit?

---
  

Comments

DJ Delorie April 2, 2026, 4:19 p.m. UTC | #1
Do we have a sitation where we hit the limit, but didn't need to?
Reading the sources, the number of arenas actually used depends on how
many threads are simultaneously inside the malloc code.  So if you hit
the max, that means you actually had that many threads actually doing
malloc() at the same time...

I mean, I agree that 8 per core is extreme for some machines - by
definition, we shouldn't be able to have more than one thread *running*
in malloc at a time per cpu, but for low-cpu low-ram machines that swap,
8 per core might be reasonable - a thread that's scheduled out shouldn't
block all other malloc'ing threads.
  
Wilco Dijkstra April 2, 2026, 5:07 p.m. UTC | #2
Hi DJ,

> Do we have a sitation where we hit the limit, but didn't need to?

Yes, there are various bug reports and complaints that GLIBC creates way
too many arenas. This was before the really big servers with 128+ cores
appeared, so the situation is now far worse...

> Reading the sources, the number of arenas actually used depends on how
> many threads are simultaneously inside the malloc code.  So if you hit
> the max, that means you actually had that many threads actually doing
> malloc() at the same time...

That's not how it works. We create a new arena for every thread that happens
to use malloc, even if only once. So if say you create a bunch of worker threads
that are just sitting idle, they get their own arena and are just wasting memory.

So this is a conservative improvement - we'll need to further reduce the
number of arenas and improve concurrency.

> I mean, I agree that 8 per core is extreme for some machines - by
> definition, we shouldn't be able to have more than one thread *running*
> in malloc at a time per cpu, but for low-cpu low-ram machines that swap,
> 8 per core might be reasonable - a thread that's scheduled out shouldn't
> block all other malloc'ing threads.

Threads that are forced to wait will sleep and the thread that holds the lock
should get priority to finish and then the sleeping threads will be woken up
once finished.

Malloc could be improved significantly. Firstly it shouldn't ever hold the lock
for long (like I mentioned realloc doing a big memcpy while locked is a really
bad design). The lock should be more finegrained too, there is no need to use
one global lock when you could use different locks for different chunk sizes.
And we can do far more in tcache without needing locks. Finally, rather than
blocking on a lock, we could try the lock and if it fails use another arena
as a fallback. There is also a possibility to spin for a short time if we can stop
malloc holding onto locks for long (for example the unsorted bin scan will
still process up to 10000 blocks, each of which may need sorting into their
bin, potentially taking many millions of cycles).

Cheers,
Wilco
  
DJ Delorie April 2, 2026, 5:20 p.m. UTC | #3
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> That's not how it works. We create a new arena for every thread that
> happens to use malloc, even if only once. So if say you create a bunch
> of worker threads that are just sitting idle, they get their own arena
> and are just wasting memory.

Ok, so I read it more carefully, and... ewww.  It *looks* better than it
is, but arena sharing only ever happens when you hit the limit.  Sigh.
  

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index ddde32c7121257b4140c902f40971385411bab52..1b73f74a021f5fd843c0ddf09568a75f67824e29 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -813,16 +813,11 @@  arena_get2 (size_t size, mstate avoid_arena)
         {
           if (mp_.arena_max != 0)
             narenas_limit = mp_.arena_max;
-          else if (narenas > mp_.arena_test)
+          else if (narenas >= mp_.arena_test)
             {
-              int n = __get_nprocs ();
-
-              if (n >= 1)
-                narenas_limit = NARENAS_FROM_NCORES (n);
-              else
-                /* We have no information about the system.  Assume two
-                   cores.  */
-                narenas_limit = NARENAS_FROM_NCORES (2);
+	      narenas_limit = __get_nprocs ();
+	      if (narenas_limit < mp_.arena_test)
+		narenas_limit = mp_.arena_test;
             }
         }
     repeat:;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 6a888b0eb7de53ae7b814275e86d2bd2f06b5e53..48fb368b245833ec10bed4234946ad8cc57d7bfc 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -1812,8 +1812,7 @@  static struct malloc_par mp_ =
   .n_mmaps_max = DEFAULT_MMAP_MAX,
   .mmap_threshold = DEFAULT_MMAP_THRESHOLD,
   .trim_threshold = DEFAULT_TRIM_THRESHOLD,
-#define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))
-  .arena_test = NARENAS_FROM_NCORES (1),
+  .arena_test = 8,
   .thp_mode = malloc_thp_mode_not_supported
 #if USE_TCACHE
   ,
diff --git a/manual/memory.texi b/manual/memory.texi
index 4f0ef51514057136c14ceb6811face0117e69272..35ebe499287035d1dd4bf900debfc7b251ef747a 100644
--- a/manual/memory.texi
+++ b/manual/memory.texi
@@ -1384,8 +1384,7 @@  This parameter specifies the number of arenas that can be created before the
 test on the limit to the number of arenas is conducted. The value is ignored if
 @code{M_ARENA_MAX} is set.
 
-The default value of this parameter is 2 on 32-bit systems and 8 on 64-bit
-systems.
+The default value of this parameter is 8.
 
 This parameter can also be set for the process at startup by setting the
 environment variable @env{MALLOC_ARENA_TEST} to the desired value.
@@ -1395,10 +1394,9 @@  This parameter sets the number of arenas to use regardless of the number of
 cores in the system.
 
 The default value of this tunable is @code{0}, meaning that the limit on the
-number of arenas is determined by the number of CPU cores online. For 32-bit
-systems the limit is twice the number of cores online and on 64-bit systems, it
-is eight times the number of cores online.  Note that the default value is not
-derived from the default value of M_ARENA_TEST and is computed independently.
+number of arenas is determined by the number of CPU cores online. The limit is
+the number of cores online.  Note that the default value is not derived from
+the default value of M_ARENA_TEST and is computed independently.
 
 This parameter can also be set for the process at startup by setting the
 environment variable @env{MALLOC_ARENA_MAX} to the desired value.