[v2] malloc: send freed small chunks to smallbin

Message ID UVNjyu5K3y4Pxi6Rwr1-pohsKhwE3nlbYOnMjmOuSvqGnpUGISJu00igzozfDrHYX26FyZQ1t0PRMc_6XBViXLGIL6YhNmr9gHtz5_g7eXk=@proton.me
State Superseded
Headers
Series [v2] malloc: send freed small chunks to smallbin |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
redhat-pt-bot/TryBot-32bit fail Patch series failed to apply

Commit Message

k4lizen July 8, 2024, 5:39 a.m. UTC
  Large chunks get added to the unsorted bin since
sorting them takes time, for small chunks the
benefit of adding them to the unsorted bin is
non-existant, actually hurting performance.

Splitting and malloc_consolidate still add small
chunks to unsorted, but we can hint the compiler
that that is a relatively rare occurance.
Benchmarking shows this to be consistently good.
---
malloc/malloc.c | 59 +++++++++++++++++++++++++++++++++----------------
1 file changed, 40 insertions(+), 19 deletions(-)

--
2.45.2
  

Comments

k4lizen July 8, 2024, 5:42 a.m. UTC | #1
(see attachment for mentioned files)

# The Rationale
The benefit of adding large chunks to the unsorted bin is clear,
we can potentially skip adding them to their bins. This is useful
since large chunks are sorted and this can take time.

Currently in malloc, small chunks are also added to the unsorted bin
(if they do not fit into the fastbins, and the tcache is full).
The reasoning for this is unclear and the benefits dubious as:

1. The only time save one might gain if a small chunk is taken
from the unsorted bin rather than the small bin is that mark_bin()
doesn't need to be called.
However, mark_bin() is very fast, and in any case much faster
than unlink_chunk() which will be performed twice if the small
chunk doesn't get taken from unsorted.

2. There is a clear performance loss in traversing the unsorted bin
if there are more chunks there.

# The Implementation
To implement this, we need an if-statement which will check whether
a chunk is small or large, so it can be added to the appropriate bin
(small or unsorted). This will be after consolidation is performed in
_int_free (more specifically _int_free_create_chunk) as we wish not to
cause any unnecessary fragmentation.

Now lets talk about the variations the patch that will be used to
compare possible approaches and give reasoning for the chosen one.
The models used are:
1. original (current malloc on master branch)
2. mod ("modified", the basic implementation)
3. mod_unlikely (mod with __glibc_unlikely here: https://elixir.bootlin.com/glibc/latest/source/malloc/malloc.c#L4161)
malloc_consolidate() and split chunks can still put small chunks
into the unsorted bin, but it is much rarer now. There are
significant performance advantages to doing this.
4. only_unlikely (original just with the __glibc_unlikely change)
Used to contrast with mod_unlikely.
5. mod_unlikely_inverted (mod_unlikely except the if statement
is such that the first branch is for large chunks)
This is the proposed approach. It's important for the first
branch to be for large chunks, as when doing sequential frees
consolidation will cause there to be many more large than
small chunks hitting this if. This way we give a light
hint to branch prediction without any guarantees.
(Benchmarking shows this is better than simply putting
__glibc_unlikely in this if).

# Performance analysis
All benchmarks are done with glibc compiled with no flags,
just (make -j, make install -j).

The biggest hit to performance is essentially the existence
of the additional if statement in _int_free.

The benefits are this:
(1.) We gain the significant performance benefits of __glibc_unlikely
when taking chunks out of unsorted.
(2.) We gain performance any time a small chunk was supposed to be
added to the unsorted bin.

This manifests in the following ways:
If we are freeing a lot of sequential memory, those chunks will
merge into a large chunk which will keep growing and being
readded into unsorted. This will make our if statement predictable
and overhead will be low, while we will eventually gain performance
on the malloc calls via (1.).

If we are freeing a lot of memory at random locations, any small
chunks are unlikely to be merged into a large chunk, so they will
reach our if statement while small and we gain performance via (2.).

Essentially, the worst case for our algorithm are short sequential
calls, as our if statement will see overhead while not gaining much
performance back.

One might think that code that causes a lot of malloc_consolidate()
is bad for us as it will trick __glibc_unlikely a lot, though this
turns out not to be the case.

# Benchmarks

## uns_filler

The `uns_filler.c` file tries to fill the unsorted bin with small
chunks (which will actually go to their small bins in our algorithm).
The figure showing the benchmark can be seen in
benchmark/_plots/uns_filler.png. We are looking at the time it takes
to perform 10.000 allocations (and frees), averaged over three runs
of the program. This is the metric we will be using from now on.

Our algorithm clearly outpreforms current malloc. To get some numbers:
original / mod_unlikely_inverted: 1.186
original - mod_unlikely_inverted: 3.351 * 10^-5

## unlikely_hit

The next benchmark is unlikely_hit.c which tries to trick
__glibc_unlikely as often as possible. There are two variants _little
and _many, that only differ in the constants. The plots can be seen
in benchmark/_plots/unlikely_hit_little.png and
benchmark/_plots/unlikely_hit_many.png.

As seen from the plots, our algorithm remains largely unaffected
by this attack (even seemingly having better performance than original).

unlikely_hit_little:
original / mod_unlikely_inverted: 1.00476
original - mod_unlikely_inverted: 0.09045 * 10^-5

unlikely_hit_many:
original / mod_unlikely_inverted: 1.00520
original - mod_unlikely_inverted: 0.1732 * 10^-5

## simple-malloc

Next we will look at the benchmark provided by glibc: bench-malloc-simple.c
As a tldr it pretty much does:
```c
for(int i = 0; i < AMOUNT; ++i){
a[i] = malloc(SIZE);
}
for(int i = 0; i < AMOUNT; ++i){
free(a[i]);
}
```
For AMOUNT: [25, 100, 400, 1600] and SIZE: [8, 16, 32, 64, 128,
256, 512, 1024, 2048, 4096]. It test single threaded, single
threaded but with (SINGLE_THREAD_P == false), and multithreaded.
I wrote a similar benchmark (simple.c) which gives plots of the
same shape, but gives me nicer number to work with (I'm having
a hard time interpreting the "*_time" from bench-malloc-simple
as * 200000 / 10000 * 1000 (iterations, 10k allocations,
milliseconds) doesn't seem to give me comparable numbers).

We don't care about the [8, 16, 32, 64] sizes since they belong
to fastbins which we don't care about (those allocations
will never reach our code).

SIZE=128 and AMOUNT=25 is pretty much the worst possible case
for our algorithm, as it doesn't give us enough time to benefit
from either (1.) or (2.), the unsorted bin will be wastefully
looked at for small allocations, and our if-statement will
go to the small branch eight times and then the large branch
seventeen times (so no help from branch prediction) and we only
incur the if-statement overhead.

The plot (from my simple.c) is
benchmark/_plots/simple_alloc128_nb25.png and the numbers are:
original / mod_unlikely_inverted: 0.9126
original - mod_unlikely_inverted: -2.3112 * 10^-5

In this case, our algorithm performs worse than the original.
That being said, as we increase the AMOUNT, we approach the
original, even being better than it in a round (the error is large).
benchmark/_plots/simple_alloc128_nb1600.png:
original / mod_unlikely_inverted: 0.9962
original - mod_unlikely_inverted: -0.21777 * 10^-5

Looking at SIZE=512, the largest size we have where we still allocate
small chunks (though only being half way to the small chunk size
range).

AMOUNT=25 (benchmark/_plots/simple_alloc512_nb25.png) we start off
ten times better than before:
original / mod_unlikely_inverted: 0.9843
original - mod_unlikely_inverted: -0.374843 * 10^-5

AMOUNT=400 (benchmark/_plots/simple_alloc512_nb400.png) and soon
even overtake original!:
original / mod_unlikely_inverted: 1.0024
original - mod_unlikely_inverted: 0.252267 * 10^-5

And further the lead with AMOUNT=1600 (benchmark/_plots/simple_alloc512_nb1600.png)
with tight error margins.
original / mod_unlikely_inverted: 1.0042
original - mod_unlikely_inverted: 0.76221 * 10^-5

Getting to large chunks gets us better branch prediction
and we benefit from (2.) as well! Let's use SIZE=2048.

AMOUNT=25 (benchmark/_plots/simple_alloc2048_nb25.png)
original / mod_unlikely_inverted: 0.98313
original - mod_unlikely_inverted: -0.468283 * 10^-5

AMOUNT=100 (benchmark/_plots/simple_alloc2048_nb100.png)
original / mod_unlikely_inverted: 1.0026
original - mod_unlikely_inverted: 0.84184 * 10^-5

AMOUNT=1600 (benchmark/_plots/simple_alloc2048_nb1600.png)
original / mod_unlikely_inverted: 1.00248
original - mod_unlikely_inverted: 1.75042 * 10^-5

That was singlethreaded only, for multithreading I didn't
rewrite the benchmarks so I can't compare the raw numbers
but I can generate the graphs:

benchmark/_plots/
simple_thread_alloc128_nb25.png
simple_thread_alloc128_nb1600.png
simple_thread_alloc512_nb100.png
simple_thread_alloc2048_nb25.png
simple_thread_alloc2048_nb1600.png

Interestingly, it seems that this version performs
better than original in most cases.

## malloc-thread

This benchmark is provided by glibc and shows how our
algorithm fares against multiple threads (1, 8, 16, 32)
and random (both size and place) allocations.

benchmark/_plots/
malloc_thread_1.png
malloc_thread_8.png
malloc_thread_16.png
malloc_thread_32.png

We can clearly see that mod_unlikely_inverted vastly
outperforms original on every count of threads.

# Security considerations
There is no check that the smallbin is uncorrupted
when small chunks are added to it from the unsorted bin.
Now most freed small chunks will have to go through
```c
if (__glibc_unlikely (fwd->bk != bck))
malloc_printerr ("free(): chunks in smallbin corrupted");
```
which is a net benefit for security and can hinder
heap exploitation attacks.

# Conclusion
All in all, while the proposed version does perform
slightly worse in some cases, that dip is bounded.
Furthermore, cases where it performs better
than current malloc are numerous (covering many
common cases), so I hope it will be taken intoconsideration!
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index bcb6e5b83c..ad77cd083e 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -4156,9 +4156,9 @@  _int_malloc (mstate av, size_t bytes)
#endif
}

- /* place chunk in bin */
-
- if (in_smallbin_range (size))
+ /* Place chunk in bin. Only malloc_consolidate() and splitting can put
+ small chunks into the unsorted bin. */
+ if (__glibc_unlikely (in_smallbin_range (size)))
{
victim_index = smallbin_index (size);
bck = bin_at (av, victim_index);
@@ -4723,23 +4723,45 @@  _int_free_create_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size,
} else
clear_inuse_bit_at_offset(nextchunk, 0);

- /*
- Place the chunk in unsorted chunk list. Chunks are
- not placed into regular bins until after they have
- been given one chance to be used in malloc.
- */
+ mchunkptr bck, fwd;

- mchunkptr bck = unsorted_chunks (av);
- mchunkptr fwd = bck->fd;
- if (__glibc_unlikely (fwd->bk != bck))
- malloc_printerr ("free(): corrupted unsorted chunks");
- p->fd = fwd;
+ if(!in_smallbin_range (size))
+ {
+ /*
+ Place large chunks in unsorted chunk list. Large chunks are
+ not placed into regular bins until after they have
+ been given one chance to be used in malloc.
+
+ This branch is first in the if-statement to help branch
+ prediction on consecutive adjacent frees.
+ */
+
+ bck = unsorted_chunks (av);
+ fwd = bck->fd;
+ if (__glibc_unlikely (fwd->bk != bck))
+ malloc_printerr ("free(): corrupted unsorted chunks");
+ p->fd_nextsize = NULL;
+ p->bk_nextsize = NULL;
+ }
+ else
+ {
+ /*
+ Place small chunks directly in their smallbin, so they
+ don't pollute the unsorted bin.
+ */
+
+ int chunk_index = smallbin_index (size);
+ bck = bin_at (av, chunk_index);
+ fwd = bck->fd;
+
+ if (__glibc_unlikely (fwd->bk != bck))
+ malloc_printerr ("free(): chunks in smallbin corrupted");
+
+ mark_bin (av, chunk_index);
+ }
+
p->bk = bck;
- if (!in_smallbin_range(size))
- {
- p->fd_nextsize = NULL;
- p->bk_nextsize = NULL;
- }
+ p->fd = fwd;
bck->fd = p;
fwd->bk = p;

@@ -4748,7 +4770,6 @@  _int_free_create_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size,

check_free_chunk(av, p);
}
-
else
{
/* If the chunk borders the current high end of memory,