malloc: Improve documentation of malloc tunables

Message ID PAWPR08MB89825A9F3E4CA697455DA6728346A@PAWPR08MB8982.eurprd08.prod.outlook.com (mailing list archive)
State Under Review
Delegated to: Arjun Shankar
Headers
Series malloc: Improve documentation of malloc tunables |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
redhat-pt-bot/TryBot-32bit success Build for i686

Commit Message

Wilco Dijkstra March 10, 2026, 3:56 p.m. UTC
  Update default for tcache_count tunable.  Remove existing documentation and
mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
tunable, including default for AArch64.

OK for commit?

---
  

Comments

Adhemerval Zanella Netto March 10, 2026, 4:32 p.m. UTC | #1
On 10/03/26 12:56, Wilco Dijkstra wrote:
> 
> Update default for tcache_count tunable.  Remove existing documentation and
> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
> tunable, including default for AArch64.
> 
> OK for commit?

It does not address the value returned by --list-tunables, which can be
misleading; nor the issues about memory consumption raised by Dimitri.

I really think we should be more conservative and *not* make this default
and only enable through the tunable.  With system-wide tunables we will
have an option to enable this as side-wide if user do want it. 

> 
> ---
> 
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 72769428e8cac27723fdec9ad83a92dc0b27415a..5850945a59e8494a2bb315da5a3e919112c5c562 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -231,37 +231,29 @@ per-thread cache.  The default (and maximum) value is 1032 bytes on
>  @end deftp
>  
>  @deftp Tunable glibc.malloc.tcache_count
> -The maximum number of chunks of each size to cache. The default is 7.
> +The maximum number of chunks of each size to cache. The default is 16.
>  The upper limit is 65535.  If set to zero, the per-thread cache is effectively
>  disabled.
>  
>  The approximate maximum overhead of the per-thread cache is thus equal
>  to the number of bins times the chunk count in each bin times the size
> -of each chunk.  With defaults, the approximate maximum overhead of the
> -per-thread cache is approximately 236 KB on 64-bit systems and 118 KB
> -on 32-bit systems.
> +of each chunk.
>  @end deftp
>  
>  @deftp Tunable glibc.malloc.mxfast
> -One of the optimizations @code{malloc} uses is to maintain a series of ``fast
> -bins'' that hold chunks up to a specific size.  The default and
> -maximum size which may be held this way is 80 bytes on 32-bit systems
> -or 160 bytes on 64-bit systems.  Applications which value size over
> -speed may choose to reduce the size of requests which are serviced
> -from fast bins with this tunable.  Note that the value specified
> -includes @code{malloc}'s internal overhead, which is normally the size of one
> -pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
> -passed to @code{malloc} for the largest bin size to enable.
> +This tunable has no effect since the ``fastbins'' have been removed.
>  @end deftp
>  
>  @deftp Tunable glibc.malloc.hugetlb
>  This tunable controls the usage of Huge Pages on @code{malloc} calls.  The
> -default value is @code{0}, which disables any additional support on
> -@code{malloc}.
> +default value is @code{0} on most targets.  Using @code{0} disables support
> +that improves use of huge pages in @code{malloc}.  However huge pages may
> +still be created depending on the OS settings.
>  
>  Setting its value to @code{1} enables the use of @code{madvise} with
>  @code{MADV_HUGEPAGE} after memory allocation with @code{mmap}.  It is enabled
>  only if the system supports Transparent Huge Page (currently only on Linux).
> +This is the default used for AArch64.
>  
>  Setting its value to @code{2} enables the use of Huge Page directly with
>  @code{mmap} with the use of @code{MAP_HUGETLB} flag.  The huge page size
>
  
Wilco Dijkstra March 10, 2026, 5:03 p.m. UTC | #2
Hi Adhemerval,

On 10/03/26 12:56, Wilco Dijkstra wrote:
>
> Update default for tcache_count tunable.  Remove existing documentation and
> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
> tunable, including default for AArch64.
>
> OK for commit?

> It does not address the value returned by --list-tunables, which can be
> misleading; nor the issues about memory consumption raised by Dimitri.

I'm testing a patch that shows it in --list-tunables, I'll post it when I get some
results back.

> I really think we should be more conservative and *not* make this default
> and only enable through the tunable.  With system-wide tunables we will
> have an option to enable this as side-wide if user do want it.

We have been super conservative for way too long already. It's around 15 years
since it was enabled by default for distros and 4 years since the hugetlb tunable
was added to GLIBC. How many applications actually set glibc.malloc.hugetlb?

https://codesearch.debian.net/search?q=glibc.malloc.hugetlb&literal=1&perpkg=1

So the question is, how many more years should we wait before developers
figure out how to use the tunables?

Or we can use the distro settings and use THP when enabled. People can
disable it if it doesn't work well for them.

Cheers,
Wilco
  
Dev Jain March 11, 2026, 6:32 a.m. UTC | #3
On 10/03/26 10:02 pm, Adhemerval Zanella Netto wrote:
> 
> 
> On 10/03/26 12:56, Wilco Dijkstra wrote:
>>
>> Update default for tcache_count tunable.  Remove existing documentation and
>> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
>> tunable, including default for AArch64.
>>
>> OK for commit?
> 
> It does not address the value returned by --list-tunables, which can be
> misleading; nor the issues about memory consumption raised by Dimitri.

The memory consumption issue can be fixed perhaps by not aligning the main
arena heap extension (sbrk) to thp_pagesize. I don't think the secondary
heaps are any problem since, the section of the virtual mapping being
converted to PROT_READ | PROT_WRITE is exactly the size demanded by
the user - suppose out of a 2M section, only 1K section is read-write'd.
Linux will not install a 2M THP underlying this, because the protection
flags are different.

Second, since Dimitri said in the bug report that they are on a low
memory to cpu ratio system, the real question to ask is why is the
sysfs setting saying madvise/always instead of never.

> 
> I really think we should be more conservative and *not* make this default
> and only enable through the tunable.  With system-wide tunables we will
> have an option to enable this as side-wide if user do want it. 
> 
>>
>> ---
>>
>> diff --git a/manual/tunables.texi b/manual/tunables.texi
>> index 72769428e8cac27723fdec9ad83a92dc0b27415a..5850945a59e8494a2bb315da5a3e919112c5c562 100644
>> --- a/manual/tunables.texi
>> +++ b/manual/tunables.texi
>> @@ -231,37 +231,29 @@ per-thread cache.  The default (and maximum) value is 1032 bytes on
>>  @end deftp
>>  
>>  @deftp Tunable glibc.malloc.tcache_count
>> -The maximum number of chunks of each size to cache. The default is 7.
>> +The maximum number of chunks of each size to cache. The default is 16.
>>  The upper limit is 65535.  If set to zero, the per-thread cache is effectively
>>  disabled.
>>  
>>  The approximate maximum overhead of the per-thread cache is thus equal
>>  to the number of bins times the chunk count in each bin times the size
>> -of each chunk.  With defaults, the approximate maximum overhead of the
>> -per-thread cache is approximately 236 KB on 64-bit systems and 118 KB
>> -on 32-bit systems.
>> +of each chunk.
>>  @end deftp
>>  
>>  @deftp Tunable glibc.malloc.mxfast
>> -One of the optimizations @code{malloc} uses is to maintain a series of ``fast
>> -bins'' that hold chunks up to a specific size.  The default and
>> -maximum size which may be held this way is 80 bytes on 32-bit systems
>> -or 160 bytes on 64-bit systems.  Applications which value size over
>> -speed may choose to reduce the size of requests which are serviced
>> -from fast bins with this tunable.  Note that the value specified
>> -includes @code{malloc}'s internal overhead, which is normally the size of one
>> -pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
>> -passed to @code{malloc} for the largest bin size to enable.
>> +This tunable has no effect since the ``fastbins'' have been removed.
>>  @end deftp
>>  
>>  @deftp Tunable glibc.malloc.hugetlb
>>  This tunable controls the usage of Huge Pages on @code{malloc} calls.  The
>> -default value is @code{0}, which disables any additional support on
>> -@code{malloc}.
>> +default value is @code{0} on most targets.  Using @code{0} disables support
>> +that improves use of huge pages in @code{malloc}.  However huge pages may
>> +still be created depending on the OS settings.
>>  
>>  Setting its value to @code{1} enables the use of @code{madvise} with
>>  @code{MADV_HUGEPAGE} after memory allocation with @code{mmap}.  It is enabled
>>  only if the system supports Transparent Huge Page (currently only on Linux).
>> +This is the default used for AArch64.
>>  
>>  Setting its value to @code{2} enables the use of Huge Page directly with
>>  @code{mmap} with the use of @code{MAP_HUGETLB} flag.  The huge page size
>>
>
  
Adhemerval Zanella Netto March 11, 2026, 2:07 p.m. UTC | #4
On 10/03/26 14:03, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
> On 10/03/26 12:56, Wilco Dijkstra wrote:
>>
>> Update default for tcache_count tunable.  Remove existing documentation and
>> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
>> tunable, including default for AArch64.
>>
>> OK for commit?
> 
>> It does not address the value returned by --list-tunables, which can be
>> misleading; nor the issues about memory consumption raised by Dimitri.
> 
> I'm testing a patch that shows it in --list-tunables, I'll post it when I get some
> results back.
> 
>> I really think we should be more conservative and *not* make this default
>> and only enable through the tunable.  With system-wide tunables we will
>> have an option to enable this as side-wide if user do want it.
> 
> We have been super conservative for way too long already. It's around 15 years
> since it was enabled by default for distros and 4 years since the hugetlb tunable
> was added to GLIBC. How many applications actually set glibc.malloc.hugetlb?
> 
> https://codesearch.debian.net/search?q=glibc.malloc.hugetlb&literal=1&perpkg=1
> 
> So the question is, how many more years should we wait before developers
> figure out how to use the tunables?
> 
> Or we can use the distro settings and use THP when enabled. People can
> disable it if it doesn't work well for them.

I think the main point of contention here is that glibc is the default system-wide
allocator, which users expect to work well across different workloads, and another
expectation is that the same allocator prioritizes performance for specific workloads.

I am focusing on the former because it is not clear that THP usage is beneficial
across all possible workloads and environments (especially given some glibc
ptmalloc shortcomings, such as fragmentation handling in some scenarios), or that
changing the defaults will always benefit users.

From Dimitri's report, I don't think it is an uncommon high-CPU container cluster
with a low memory-to-CPU ratio. This is exactly what cloud providers are targeting
for current Aarch64 deployments with current chips.

And it worries me that, in such scenarios, sysadmins will need to be aware of
additional configuration to avoid potential pitfalls on aarch64, especially when
comparing with other ABIs that also support THP but do not enable it by default.

Checking on different memory allocators, it really depends on which kind of
workload it targets:

* TCMalloc [1] has baked in THP support and not switch to disable its usage. It
  seems widely used by Google and MongoDB [2].

* jemalloc [3] only enables it as an opt-in feature (with the MALLOC_CONF="thp..."
  option).

* mimalloc [4] also enabled THP usage and set it by default (MIMALLOC_ALLOW_THP).

The TCMalloc case is interesting because Google uses its system-wide profile
(GWP) to drive allocator development to avoid bias from microbenchmarks or specific
benchmarks. It also takes into consideration, from the start, data structures and
heuristics to avoid fragmentation across different workloads. The mimalloc in similar
wrt to design (that takes in consideration THP).

And I do not think we have that amount of research to certify that glibc THP usage
yields the same performance gain across the myriad of workloads it is used for. We
already have a fair number of corner issues [6][7][8][9][10] that might be exacerbated
by using THP by default.

Another worry is that this kind of change might create even more attrition and
misconceptions, leading people to believe that a malloc replacement would be
better than glibc’s.

So I think it would be a good way to first advertise THP support and how to properly
use the tunable, and I would say it is a failure on our part that this isn't more
widespread. This kind of information is really very project-specific, and I don’t
have a good answer on how to make it more widespread.

[1] https://github.com/google/tcmalloc
[2] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
[3] https://jemalloc.net/
[4] https://github.com/microsoft/mimalloc
[5] https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/
[6] https://sourceware.org/bugzilla/show_bug.cgi?id=31556
[7] https://sourceware.org/bugzilla/show_bug.cgi?id=30769
[8] https://sourceware.org/bugzilla/show_bug.cgi?id=15321
[9] https://sourceware.org/bugzilla/show_bug.cgi?id=26969
[10] https://sourceware.org/bugzilla/show_bug.cgi?id=21731
  
Dimitri John Ledkov March 11, 2026, 3:51 p.m. UTC | #5
On Wed, 11 Mar 2026 at 14:07, Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 10/03/26 14:03, Wilco Dijkstra wrote:
> > Hi Adhemerval,
> >
> > On 10/03/26 12:56, Wilco Dijkstra wrote:
> >>
> >> Update default for tcache_count tunable.  Remove existing documentation and
> >> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
> >> tunable, including default for AArch64.
> >>
> >> OK for commit?
> >
> >> It does not address the value returned by --list-tunables, which can be
> >> misleading; nor the issues about memory consumption raised by Dimitri.
> >
> > I'm testing a patch that shows it in --list-tunables, I'll post it when I get some
> > results back.
> >
> >> I really think we should be more conservative and *not* make this default
> >> and only enable through the tunable.  With system-wide tunables we will
> >> have an option to enable this as side-wide if user do want it.
> >
> > We have been super conservative for way too long already. It's around 15 years
> > since it was enabled by default for distros and 4 years since the hugetlb tunable
> > was added to GLIBC. How many applications actually set glibc.malloc.hugetlb?
> >
> > https://codesearch.debian.net/search?q=glibc.malloc.hugetlb&literal=1&perpkg=1
> >
> > So the question is, how many more years should we wait before developers
> > figure out how to use the tunables?
> >
> > Or we can use the distro settings and use THP when enabled. People can
> > disable it if it doesn't work well for them.
>
> I think the main point of contention here is that glibc is the default system-wide
> allocator, which users expect to work well across different workloads, and another
> expectation is that the same allocator prioritizes performance for specific workloads.
>
> I am focusing on the former because it is not clear that THP usage is beneficial
> across all possible workloads and environments (especially given some glibc
> ptmalloc shortcomings, such as fragmentation handling in some scenarios), or that
> changing the defaults will always benefit users.
>
> From Dimitri's report, I don't think it is an uncommon high-CPU container cluster
> with a low memory-to-CPU ratio. This is exactly what cloud providers are targeting
> for current Aarch64 deployments with current chips.


Most container deployments in such hardware / workload density
configurations do not currently use cpuset cgroups. Meaning all
containers typically see the host cpu count, rather than a cpuset
subset. The consequence is that one has to typically manually tweak
MALLOC_ARENA_MAX, most famously this heroku guidance
https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior I
don't know how container/cgroups aware glibc is.

It would be very interesting if glibc malloc allocator could somehow
sense if it is in a container or not (even if by checking the
container environment variable) and for example choosing to lower
MALLOC_ARENA_MAX from the 8 times count of observable cpus to a fixed
lower value.
I wish containers would universally hint the weight and total amount
of cpu shares dynamically adjusted for the number of deployed
containers - but I don't believe this is available today.

Also, no config choice can be universal. The high-performance
deployments typically already tune these things. But it seems like the
number of workloads and glibc loads is no longer bare-metal heavy; but
is VM heavy and container heavy. (a given deployment is most often
likely to load glibc in a constrained container).

One size does not fit all; but more dynamic defaults would be nice.

I saw patches to add support for setting tunables via a config file.
It would be nice to ship multiple example config files for bare-metal,
vm, container => even if they just set malloc_arena_max 2. Then
distributions will get a hint to package these, and provide them as
config files with like update-alternatives. And possibly ship one
edition by default in "container" builds, another in "vm" builds, and
so on.

This approach worked really well for tuning ext4 with the e2fsprogs
default config file sensitive to the total size of filesystem w.r.t.
reserve space / inode reservation / etc. Such that creating ext4 on a
small sd-card results in a different performance profile, compared to
a multi-terabyte filesystem.

I think the desire to change the default mostly stems from lack of
good levers for applying the default tuning; or lack of automatic
dynamic  tuning based on a given environment.

>
> And it worries me that, in such scenarios, sysadmins will need to be aware of
> additional configuration to avoid potential pitfalls on aarch64, especially when
> comparing with other ABIs that also support THP but do not enable it by default.
>
> Checking on different memory allocators, it really depends on which kind of
> workload it targets:
>
> * TCMalloc [1] has baked in THP support and not switch to disable its usage. It
>   seems widely used by Google and MongoDB [2].
>
> * jemalloc [3] only enables it as an opt-in feature (with the MALLOC_CONF="thp..."
>   option).
>
> * mimalloc [4] also enabled THP usage and set it by default (MIMALLOC_ALLOW_THP).
>
> The TCMalloc case is interesting because Google uses its system-wide profile
> (GWP) to drive allocator development to avoid bias from microbenchmarks or specific
> benchmarks. It also takes into consideration, from the start, data structures and
> heuristics to avoid fragmentation across different workloads. The mimalloc in similar
> wrt to design (that takes in consideration THP).
>
> And I do not think we have that amount of research to certify that glibc THP usage
> yields the same performance gain across the myriad of workloads it is used for. We
> already have a fair number of corner issues [6][7][8][9][10] that might be exacerbated
> by using THP by default.
>
> Another worry is that this kind of change might create even more attrition and
> misconceptions, leading people to believe that a malloc replacement would be
> better than glibc’s.
>
> So I think it would be a good way to first advertise THP support and how to properly
> use the tunable, and I would say it is a failure on our part that this isn't more
> widespread. This kind of information is really very project-specific, and I don’t
> have a good answer on how to make it more widespread.
>
> [1] https://github.com/google/tcmalloc
> [2] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
> [3] https://jemalloc.net/
> [4] https://github.com/microsoft/mimalloc
> [5] https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/
> [6] https://sourceware.org/bugzilla/show_bug.cgi?id=31556
> [7] https://sourceware.org/bugzilla/show_bug.cgi?id=30769
> [8] https://sourceware.org/bugzilla/show_bug.cgi?id=15321
> [9] https://sourceware.org/bugzilla/show_bug.cgi?id=26969
> [10] https://sourceware.org/bugzilla/show_bug.cgi?id=21731
>
>
  
Dimitri John Ledkov March 11, 2026, 4:50 p.m. UTC | #6
On Tue, 10 Mar 2026 at 15:57, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
>
> Update default for tcache_count tunable.  Remove existing documentation and
> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
> tunable, including default for AArch64.
>
> OK for commit?
>

Did a test build, and read the new docs, they formatted correctly for
me and I like the updates. All of them are accurate. Thank you.

Tested-By: Dimitri John Ledkov <dimitri.ledkov@surgut.co.uk>

> ---
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 72769428e8cac27723fdec9ad83a92dc0b27415a..5850945a59e8494a2bb315da5a3e919112c5c562 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -231,37 +231,29 @@ per-thread cache.  The default (and maximum) value is 1032 bytes on
>  @end deftp
>
>  @deftp Tunable glibc.malloc.tcache_count
> -The maximum number of chunks of each size to cache. The default is 7.
> +The maximum number of chunks of each size to cache. The default is 16.
>  The upper limit is 65535.  If set to zero, the per-thread cache is effectively
>  disabled.
>
>  The approximate maximum overhead of the per-thread cache is thus equal
>  to the number of bins times the chunk count in each bin times the size
> -of each chunk.  With defaults, the approximate maximum overhead of the
> -per-thread cache is approximately 236 KB on 64-bit systems and 118 KB
> -on 32-bit systems.
> +of each chunk.
>  @end deftp
>
>  @deftp Tunable glibc.malloc.mxfast
> -One of the optimizations @code{malloc} uses is to maintain a series of ``fast
> -bins'' that hold chunks up to a specific size.  The default and
> -maximum size which may be held this way is 80 bytes on 32-bit systems
> -or 160 bytes on 64-bit systems.  Applications which value size over
> -speed may choose to reduce the size of requests which are serviced
> -from fast bins with this tunable.  Note that the value specified
> -includes @code{malloc}'s internal overhead, which is normally the size of one
> -pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
> -passed to @code{malloc} for the largest bin size to enable.
> +This tunable has no effect since the ``fastbins'' have been removed.
>  @end deftp
>
>  @deftp Tunable glibc.malloc.hugetlb
>  This tunable controls the usage of Huge Pages on @code{malloc} calls.  The
> -default value is @code{0}, which disables any additional support on
> -@code{malloc}.
> +default value is @code{0} on most targets.  Using @code{0} disables support
> +that improves use of huge pages in @code{malloc}.  However huge pages may
> +still be created depending on the OS settings.
>
>  Setting its value to @code{1} enables the use of @code{madvise} with
>  @code{MADV_HUGEPAGE} after memory allocation with @code{mmap}.  It is enabled
>  only if the system supports Transparent Huge Page (currently only on Linux).
> +This is the default used for AArch64.
>
>  Setting its value to @code{2} enables the use of Huge Page directly with
>  @code{mmap} with the use of @code{MAP_HUGETLB} flag.  The huge page size
>
  
Dev Jain March 11, 2026, 5:15 p.m. UTC | #7
On 11/03/26 9:21 pm, Dimitri John Ledkov wrote:
> On Wed, 11 Mar 2026 at 14:07, Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 10/03/26 14:03, Wilco Dijkstra wrote:
>>> Hi Adhemerval,
>>>
>>> On 10/03/26 12:56, Wilco Dijkstra wrote:
>>>>
>>>> Update default for tcache_count tunable.  Remove existing documentation and
>>>> mention removal of fastbins in mxfast tunable.  Improve wording of hugetlb
>>>> tunable, including default for AArch64.
>>>>
>>>> OK for commit?
>>>
>>>> It does not address the value returned by --list-tunables, which can be
>>>> misleading; nor the issues about memory consumption raised by Dimitri.
>>>
>>> I'm testing a patch that shows it in --list-tunables, I'll post it when I get some
>>> results back.
>>>
>>>> I really think we should be more conservative and *not* make this default
>>>> and only enable through the tunable.  With system-wide tunables we will
>>>> have an option to enable this as side-wide if user do want it.
>>>
>>> We have been super conservative for way too long already. It's around 15 years
>>> since it was enabled by default for distros and 4 years since the hugetlb tunable
>>> was added to GLIBC. How many applications actually set glibc.malloc.hugetlb?
>>>
>>> https://codesearch.debian.net/search?q=glibc.malloc.hugetlb&literal=1&perpkg=1
>>>
>>> So the question is, how many more years should we wait before developers
>>> figure out how to use the tunables?
>>>
>>> Or we can use the distro settings and use THP when enabled. People can
>>> disable it if it doesn't work well for them.
>>
>> I think the main point of contention here is that glibc is the default system-wide
>> allocator, which users expect to work well across different workloads, and another
>> expectation is that the same allocator prioritizes performance for specific workloads.
>>
>> I am focusing on the former because it is not clear that THP usage is beneficial
>> across all possible workloads and environments (especially given some glibc
>> ptmalloc shortcomings, such as fragmentation handling in some scenarios), or that
>> changing the defaults will always benefit users.
>>
>> From Dimitri's report, I don't think it is an uncommon high-CPU container cluster
>> with a low memory-to-CPU ratio. This is exactly what cloud providers are targeting
>> for current Aarch64 deployments with current chips.
> 
> 
> Most container deployments in such hardware / workload density
> configurations do not currently use cpuset cgroups. Meaning all
> containers typically see the host cpu count, rather than a cpuset
> subset. The consequence is that one has to typically manually tweak
> MALLOC_ARENA_MAX, most famously this heroku guidance
> https://devcenter.heroku.com/articles/tuning-glibc-memory-behavior I
> don't know how container/cgroups aware glibc is.
> 
> It would be very interesting if glibc malloc allocator could somehow
> sense if it is in a container or not (even if by checking the
> container environment variable) and for example choosing to lower
> MALLOC_ARENA_MAX from the 8 times count of observable cpus to a fixed
> lower value.
> I wish containers would universally hint the weight and total amount
> of cpu shares dynamically adjusted for the number of deployed
> containers - but I don't believe this is available today.
> 
> Also, no config choice can be universal. The high-performance
> deployments typically already tune these things. But it seems like the
> number of workloads and glibc loads is no longer bare-metal heavy; but
> is VM heavy and container heavy. (a given deployment is most often
> likely to load glibc in a constrained container).
> 
> One size does not fit all; but more dynamic defaults would be nice.
> 
> I saw patches to add support for setting tunables via a config file.
> It would be nice to ship multiple example config files for bare-metal,
> vm, container => even if they just set malloc_arena_max 2. Then
> distributions will get a hint to package these, and provide them as
> config files with like update-alternatives. And possibly ship one
> edition by default in "container" builds, another in "vm" builds, and
> so on.
> 
> This approach worked really well for tuning ext4 with the e2fsprogs
> default config file sensitive to the total size of filesystem w.r.t.
> reserve space / inode reservation / etc. Such that creating ext4 on a
> small sd-card results in a different performance profile, compared to
> a multi-terabyte filesystem.
> 
> I think the desire to change the default mostly stems from lack of
> good levers for applying the default tuning; or lack of automatic
> dynamic  tuning based on a given environment.

FYI you may take a look at [1], at the heading "process THP controls". In
short, you can make your container workload override the sysctl and set THP
= never for it. Not sure if this suits your case, or how widely this is
used. But yeah we have to improve the malloc THP support in any case.

[1]
https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/mm/transhuge.rst

> 
>>
>> And it worries me that, in such scenarios, sysadmins will need to be aware of
>> additional configuration to avoid potential pitfalls on aarch64, especially when
>> comparing with other ABIs that also support THP but do not enable it by default.
>>
>> Checking on different memory allocators, it really depends on which kind of
>> workload it targets:
>>
>> * TCMalloc [1] has baked in THP support and not switch to disable its usage. It
>>   seems widely used by Google and MongoDB [2].
>>
>> * jemalloc [3] only enables it as an opt-in feature (with the MALLOC_CONF="thp..."
>>   option).
>>
>> * mimalloc [4] also enabled THP usage and set it by default (MIMALLOC_ALLOW_THP).
>>
>> The TCMalloc case is interesting because Google uses its system-wide profile
>> (GWP) to drive allocator development to avoid bias from microbenchmarks or specific
>> benchmarks. It also takes into consideration, from the start, data structures and
>> heuristics to avoid fragmentation across different workloads. The mimalloc in similar
>> wrt to design (that takes in consideration THP).
>>
>> And I do not think we have that amount of research to certify that glibc THP usage
>> yields the same performance gain across the myriad of workloads it is used for. We
>> already have a fair number of corner issues [6][7][8][9][10] that might be exacerbated
>> by using THP by default.
>>
>> Another worry is that this kind of change might create even more attrition and
>> misconceptions, leading people to believe that a malloc replacement would be
>> better than glibc’s.
>>
>> So I think it would be a good way to first advertise THP support and how to properly
>> use the tunable, and I would say it is a failure on our part that this isn't more
>> widespread. This kind of information is really very project-specific, and I don’t
>> have a good answer on how to make it more widespread.
>>
>> [1] https://github.com/google/tcmalloc
>> [2] https://www.mongodb.com/docs/manual/administration/tcmalloc-performance/
>> [3] https://jemalloc.net/
>> [4] https://github.com/microsoft/mimalloc
>> [5] https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/https://paulcavallaro.com/blog/tcmalloc-temeraire-hugepage-aware-allocator/
>> [6] https://sourceware.org/bugzilla/show_bug.cgi?id=31556
>> [7] https://sourceware.org/bugzilla/show_bug.cgi?id=30769
>> [8] https://sourceware.org/bugzilla/show_bug.cgi?id=15321
>> [9] https://sourceware.org/bugzilla/show_bug.cgi?id=26969
>> [10] https://sourceware.org/bugzilla/show_bug.cgi?id=21731
>>
>>
> 
>
  
Wilco Dijkstra March 12, 2026, 6:02 p.m. UTC | #8
Hi Dimitri,

> It would be very interesting if glibc malloc allocator could somehow
> sense if it is in a container or not (even if by checking the
> container environment variable) and for example choosing to lower
> MALLOC_ARENA_MAX from the 8 times count of observable cpus to a fixed
> lower value.
> I wish containers would universally hint the weight and total amount
> of cpu shares dynamically adjusted for the number of deployed
> containers - but I don't believe this is available today.

So is the issue you are seeing due to a bad default of MALLOC_ARENA_MAX?
Lots of arenas implies malloc isn't trying to use arenas efficiently. I think that
can occur if you have extreme numbers of threads - the current malloc design
tries to give each thread its own arena (which seems like a bad idea).

Setting it to 2 might well be better as a default. However it's not obvious why
we ever need more than say 64 arenas even on systems with lots of cores.

> I think the desire to change the default mostly stems from lack of
> good levers for applying the default tuning; or lack of automatic
> dynamic  tuning based on a given environment.

The goal is to provide good results out of the box. I don't believe in forcing
people to fine tune every feature in their system in order to get it to work
properly.

Cheers,
Wilco
  
Dimitri John Ledkov March 12, 2026, 8:57 p.m. UTC | #9
On Thu, 12 Mar 2026 at 18:03, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Dimitri,
>
> > It would be very interesting if glibc malloc allocator could somehow
> > sense if it is in a container or not (even if by checking the
> > container environment variable) and for example choosing to lower
> > MALLOC_ARENA_MAX from the 8 times count of observable cpus to a fixed
> > lower value.
> > I wish containers would universally hint the weight and total amount
> > of cpu shares dynamically adjusted for the number of deployed
> > containers - but I don't believe this is available today.
>
> So is the issue you are seeing due to a bad default of MALLOC_ARENA_MAX?

I am relaying different sets of feedback, from very different
deployments with very different setups / workload / hardware / etc;
from different customers / end-users.

The MALLOC_ARENA_MAX=2 helped migrating of glibc 2.35 to 2.43 on x86_64

> Lots of arenas implies malloc isn't trying to use arenas efficiently. I think that
> can occur if you have extreme numbers of threads - the current malloc design
> tries to give each thread its own arena (which seems like a bad idea).

Yes, in this deployment there is JVM with many threads; inside
containers; and without MALLOC_ARENA_MAX=2 the performance degraded by
30% (total throughput, but also number of total and concurrent
requests being able to service). And the characteristics were similar
- very dense container deployment, low memory to cpu ratio, and yes
each container has a single process with many threads.

>
> Setting it to 2 might well be better as a default. However it's not obvious why
> we ever need more than say 64 arenas even on systems with lots of cores.

Is it arenas per process times 8 times cpu cores? Cause it feels like
in a container which observes all of the CPUs - yet actually cannot
really use all of them (there is contention / overcommit)

>
> > I think the desire to change the default mostly stems from lack of
> > good levers for applying the default tuning; or lack of automatic
> > dynamic  tuning based on a given environment.
>
> The goal is to provide good results out of the box. I don't believe in forcing
> people to fine tune every feature in their system in order to get it to work
> properly.
>
> Cheers,
> Wilco

I offered to set MALLOC_ARENA_MAX=2 by default universally for that
deployment and it had mixed perception - as if it might impact a
single large container allocated to run by-itself on a node;
effectively privileged / bare-metal container full speed ahead
workload.

It does feel like uncapped linear 8 times CPU cores is excessive after
a high enough number of cores. Was 8 times CPU cores been set back in
the day before 192 CPU cores public cloud instances generally
available; which with hyperthreading may appear as 384 cores; and
times by 8 would be 3072 arenas => which feels like very excessive
amount; especially when in practice this host actually runs 500 pods.

I wonder if we can keep 8 times CPU count; but cap it at 32 arenas
maybe? Unfortunately I don't have benchmark data that I could easily
trigger to simulate and measure the impact of changing this.
  
Wilco Dijkstra March 16, 2026, 8:26 p.m. UTC | #10
Hi Dimitri,

>> So is the issue you are seeing due to a bad default of MALLOC_ARENA_MAX?
>
> I am relaying different sets of feedback, from very different
> deployments with very different setups / workload / hardware / etc;
> from different customers / end-users.
>
> The MALLOC_ARENA_MAX=2 helped migrating of glibc 2.35 to 2.43 on x86_64

Interesting since x86_64 does not use THP by default.

>> Lots of arenas implies malloc isn't trying to use arenas efficiently. I think that
>> can occur if you have extreme numbers of threads - the current malloc design
>> tries to give each thread its own arena (which seems like a bad idea).
>
> Yes, in this deployment there is JVM with many threads; inside
> containers; and without MALLOC_ARENA_MAX=2 the performance degraded by
> 30% (total throughput, but also number of total and concurrent
> requests being able to service). And the characteristics were similar
> - very dense container deployment, low memory to cpu ratio, and yes
> each container has a single process with many threads.

There was a talk at the Cauldron about JVM/MySQL malloc issues:
https://conf.gnu-tools-cauldron.org/opo25/talk/LXPUYR/

Since GLIBC 2.43 you can increase the maximum size of allocations handled
by tcache via glibc.malloc.tcache_max=N. This allows one to run with only 1
arena without a performance hit (and no fragmentation after days of use).

> I offered to set MALLOC_ARENA_MAX=2 by default universally for that
> deployment and it had mixed perception - as if it might impact a
> single large container allocated to run by-itself on a node;
> effectively privileged / bare-metal container full speed ahead
> workload.

Yes that would be a bit low...

> It does feel like uncapped linear 8 times CPU cores is excessive after
> a high enough number of cores. Was 8 times CPU cores been set back in
> the day before 192 CPU cores public cloud instances generally
> available; which with hyperthreading may appear as 384 cores; and
> times by 8 would be 3072 arenas => which feels like very excessive
> amount; especially when in practice this host actually runs 500 pods.

The bad commit is from 2009!

 https://sourceware.org/git/?p=glibc.git;a=commit;h=425ce2edb9d

Obviously zero explanation or benchmarks as to why this value was chosen...
So it looks incorrect from the first commit. I think 2 (or even 1) is a better
value, especially since we removed the slow consolidation pass and added
tcache for small blocks. It's just a matter of running some large applications
to confirm there are no major slowdowns.

> I wonder if we can keep 8 times CPU count; but cap it at 32 arenas
> maybe? Unfortunately I don't have benchmark data that I could easily
> trigger to simulate and measure the impact of changing this.

We can lower the 8 times (for laptops) as well as the maximum (for big servers).

Cheers,
Wilco
  

Patch

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 72769428e8cac27723fdec9ad83a92dc0b27415a..5850945a59e8494a2bb315da5a3e919112c5c562 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -231,37 +231,29 @@  per-thread cache.  The default (and maximum) value is 1032 bytes on
 @end deftp
 
 @deftp Tunable glibc.malloc.tcache_count
-The maximum number of chunks of each size to cache. The default is 7.
+The maximum number of chunks of each size to cache. The default is 16.
 The upper limit is 65535.  If set to zero, the per-thread cache is effectively
 disabled.
 
 The approximate maximum overhead of the per-thread cache is thus equal
 to the number of bins times the chunk count in each bin times the size
-of each chunk.  With defaults, the approximate maximum overhead of the
-per-thread cache is approximately 236 KB on 64-bit systems and 118 KB
-on 32-bit systems.
+of each chunk.
 @end deftp
 
 @deftp Tunable glibc.malloc.mxfast
-One of the optimizations @code{malloc} uses is to maintain a series of ``fast
-bins'' that hold chunks up to a specific size.  The default and
-maximum size which may be held this way is 80 bytes on 32-bit systems
-or 160 bytes on 64-bit systems.  Applications which value size over
-speed may choose to reduce the size of requests which are serviced
-from fast bins with this tunable.  Note that the value specified
-includes @code{malloc}'s internal overhead, which is normally the size of one
-pointer, so add 4 on 32-bit systems or 8 on 64-bit systems to the size
-passed to @code{malloc} for the largest bin size to enable.
+This tunable has no effect since the ``fastbins'' have been removed.
 @end deftp
 
 @deftp Tunable glibc.malloc.hugetlb
 This tunable controls the usage of Huge Pages on @code{malloc} calls.  The
-default value is @code{0}, which disables any additional support on
-@code{malloc}.
+default value is @code{0} on most targets.  Using @code{0} disables support
+that improves use of huge pages in @code{malloc}.  However huge pages may
+still be created depending on the OS settings.
 
 Setting its value to @code{1} enables the use of @code{madvise} with
 @code{MADV_HUGEPAGE} after memory allocation with @code{mmap}.  It is enabled
 only if the system supports Transparent Huge Page (currently only on Linux).
+This is the default used for AArch64.
 
 Setting its value to @code{2} enables the use of Huge Page directly with
 @code{mmap} with the use of @code{MAP_HUGETLB} flag.  The huge page size