malloc: check tcache mem size in tcache_get_n to avoid arbitrary mem allocation

Message ID 20250407144349.10862-1-dudududuMaxVer@gmail.com (mailing list archive)
State Under Review
Delegated to: Wilco Dijkstra
Headers
Series malloc: check tcache mem size in tcache_get_n to avoid arbitrary mem allocation |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

dbgbgtf April 7, 2025, 2:43 p.m. UTC
  Hi!

> Thanks, having a test is always useful. This basically overwrites the freelist
> with a correctly swizzled pointer to a fake block. Once an attacker can do that,
> they can just as easily fake a correct chunk header and pass the extra check...

Though attackers could fake a correct chunk header to bypass the check,
it's harder for them to arb alloc address out of heap to hijack pc.
At least an arb alloc on heap wouldn't cause too much threat.
Considering the extra checking cost, that is the best we can do for now.

> Note that it should trigger a check in free() since the size is corrupted.
> If it is feasible for an attacker to break the swizzling, it would be best to improve
> that in some way.

The corruption of freelist happens after free, you can also see that in my
test, so check in free won't work.

> I normally just build glibc and benchtests and then run each benchmark
> a few times like:

> time ./build/glibc/benchtests/bench-malloc-thread 1

Maybe do as the ReadMe in benchtests says?
https://elixir.bootlin.com/glibc/glibc-2.41.9000/source/benchtests/README, for example

And if you want to run `./build/glibc/benchtests/bench-malloc-thread
1`directly, you should make it static linked, otherwise, the program
will use malloc from your system libaray.

`make bench` will run a script that handles the ld and libaray for you,
in that case, it's ok to build dynamic programs as default.

On my machine, this patch is about 4% slower for thread1 and thread32, and seems even
faster for simple 16(maybe because I remove the fastbin checks in _int_malloc)?

❯ ldd /home/dbgbgtf/Main/work/mybuild/benchtests/bench-malloc-simple
	statically linked
❯ time /home/dbgbgtf/Main/work/mybuild/benchtests/bench-malloc-simple 16
{
 "timing_type": "hp_timing",
 "functions": {
  "malloc": {
   "": {
    "malloc_block_size": 16,
    "max_rss": 5848,
    "main_arena_st_allocs_0025_time": 24.1369,
    "main_arena_st_allocs_0100_time": 24.0427,
    "main_arena_st_allocs_0400_time": 24.1582,
    "main_arena_st_allocs_1600_time": 24.2585,
    "main_arena_mt_allocs_0025_time": 64.54,
    "main_arena_mt_allocs_0100_time": 65.0808,
    "main_arena_mt_allocs_0400_time": 67.3119,
    "main_arena_mt_allocs_1600_time": 68.838,
    "thread_arena__allocs_0025_time": 53.995,
    "thread_arena__allocs_0100_time": 66.2979,
    "thread_arena__allocs_0400_time": 67.2065,
    "thread_arena__allocs_1600_time": 68.6876
   }
  }
 }
}/home/dbgbgtf/Main/work/mybuild/benchtests/bench-malloc-simple 16  1.23s user 0.00s system 99% cpu 1.240 total

❯ ldd /home/dbgbgtf/Main/work/build/benchtests/bench-malloc-simple
	statically linked
❯ time /home/dbgbgtf/Main/work/build/benchtests/bench-malloc-simple 16
{
 "timing_type": "hp_timing",
 "functions": {
  "malloc": {
   "": {
    "malloc_block_size": 16,
    "max_rss": 8340,
    "main_arena_st_allocs_0025_time": 25.4869,
    "main_arena_st_allocs_0100_time": 25.6998,
    "main_arena_st_allocs_0400_time": 25.7933,
    "main_arena_st_allocs_1600_time": 26.0882,
    "main_arena_mt_allocs_0025_time": 66.6024,
    "main_arena_mt_allocs_0100_time": 67.5042,
    "main_arena_mt_allocs_0400_time": 69.7948,
    "main_arena_mt_allocs_1600_time": 71.1558,
    "thread_arena__allocs_0025_time": 54.2764,
    "thread_arena__allocs_0100_time": 65.392,
    "thread_arena__allocs_0400_time": 69.6825,
    "thread_arena__allocs_1600_time": 71.431
   }
  }
 }
}/home/dbgbgtf/Main/work/build/benchtests/bench-malloc-simple 16  1.29s user 0.00s system 99% cpu 1.292 total

My cpu is 20 × 13th Gen Intel Core i7-13700H, which is x86_64 arch.
I saw your commit said that you use Neoverse V2, could that be the
difference?

> Sorry, I meant csize2tidx (chunksize_nomask(mem2chunk(e)).
> It's a bit long but it reuses existing macros and needs fewer instructions.

I get it, you are right.It won't need to mask the chunksize this way,
thanks for the suggestion.

Cheers,
dbgbgtf

Signed-off-by: dbgbgtf <dudududuMaxVer@gmail.com>
---
 malloc/malloc.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)
  

Comments

Wilco Dijkstra April 8, 2025, 5:27 p.m. UTC | #1
Hi,

>> Thanks, having a test is always useful. This basically overwrites the freelist
>> with a correctly swizzled pointer to a fake block. Once an attacker can do that,
>> they can just as easily fake a correct chunk header and pass the extra check...
>
> Though attackers could fake a correct chunk header to bypass the check,
> it's harder for them to arb alloc address out of heap to hijack pc.
> At least an arb alloc on heap wouldn't cause too much threat.
> Considering the extra checking cost, that is the best we can do for now.

What do you mean with "arb alloc"? Never heard of the term...

>> Note that it should trigger a check in free() since the size is corrupted.
>> If it is feasible for an attacker to break the swizzling, it would be best to improve
>> that in some way.
>
> The corruption of freelist happens after free, you can also see that in my
> test, so check in free won't work.

If the malloc succeeds despite corruption, it will likely fail if that block is freed
again, or on the next malloc that uses the now corrupted freelist. Adding free(ptr[0])
at the end of your test causes "double free or corruption".

> And if you want to run `./build/glibc/benchtests/bench-malloc-thread
> 1`directly, you should make it static linked, otherwise, the program
> will use malloc from your system libaray.

You can statically link with GLIBC, but that's not needed here - "make bench" will generate
benchmarks which will always dynamically link with your local GLIBC build. 

> On my machine, this patch is about 4% slower for thread1 and thread32, and seems even
> faster for simple 16(maybe because I remove the fastbin checks in _int_malloc)?

There is a 3.8% difference between those runs. Which one is old, which one is new? 

> My cpu is 20 × 13th Gen Intel Core i7-13700H, which is x86_64 arch.
> I saw your commit said that you use Neoverse V2, could that be the
> difference?

Note those results seem slow given the i7-13700H is quite new. Are you running on
the fast cores? Does the generated code look well optimized?

The results will depend on the CPU you're testing, but the general trend should be
similar. My old x64 box shows a 14-16% slowdown on bench-malloc-thread 1 / 32,
and bench-malloc-simple 15 is ~0.4% slower.

Cheers,
Wilco
  
dbgbgtf April 18, 2025, 8:28 a.m. UTC | #2
Hi,
It has been a week since I send my latest reply.

Is there any process?
Or anything else I can do to help?

Any new suggestions for my patch?
Or can I get some process?

Cheers,
dbgbgtf
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index a0bc733482..74f9668d3b 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3186,6 +3186,8 @@  tcache_get_n (size_t tc_idx, tcache_entry **ep)
 
   if (__glibc_unlikely (!aligned_OK (e)))
     malloc_printerr ("malloc(): unaligned tcache chunk detected");
+  if (__glibc_unlikely (tc_idx != csize2tidx(chunksize_nomask (mem2chunk(e)))))
+    malloc_printerr ("malloc(): tcache mem size vs request size");
 
   if (ep == &(tcache->entries[tc_idx]))
       *ep = REVEAL_PTR (e->next);
@@ -4009,11 +4011,6 @@  _int_malloc (mstate av, size_t bytes)
 		  while (tcache->counts[tc_idx] < mp_.tcache_count
 			 && (tc_victim = *fb) != NULL)
 		    {
-		      if (__glibc_unlikely (misaligned_chunk (tc_victim)))
-			malloc_printerr ("malloc(): unaligned fastbin chunk detected 3");
-		      size_t victim_tc_idx = csize2tidx (chunksize (tc_victim));
-		      if (__glibc_unlikely (tc_idx != victim_tc_idx))
-			malloc_printerr ("malloc(): chunk size mismatch in fastbin");
 		      if (SINGLE_THREAD_P)
 			*fb = REVEAL_PTR (tc_victim->fd);
 		      else