[RFC,0/3] Improved ALSR

Message ID 20201004130938.64575-1-toiwoton@gmail.com
Headers
Series Improved ALSR |

Message

Topi Miettinen Oct. 4, 2020, 1:09 p.m. UTC
  Problem with using sbrk() for allocations is that the location of the
memory is relatively predicatable since it's always located next to
data segment. This series makes malloc() and TCB use mmap() instead.

Topi Miettinen (3):
  csu: randomize location of TCB
  malloc: always use mmap() to improve ASLR
  dl-sysdep: disable remaining calls to sbrk()

 csu/libc-tls.c                          | 20 ++++++++++++++------
 elf/dl-sysdep.c                         |  2 ++
 malloc/arena.c                          |  5 ++++-
 malloc/malloc.c                         | 16 +++++++++++++---
 malloc/morecore.c                       |  2 ++
 sysdeps/unix/sysv/linux/dl-sysdep.c     |  2 ++
 sysdeps/unix/sysv/linux/mmap64.c        | 19 +++++++++++++++++++
 sysdeps/unix/sysv/linux/mmap_internal.h |  3 +++
 8 files changed, 59 insertions(+), 10 deletions(-)
  

Comments

Topi Miettinen Nov. 23, 2020, 4:06 p.m. UTC | #1
On 4.10.2020 16.09, Topi Miettinen wrote:
> Problem with using sbrk() for allocations is that the location of the
> memory is relatively predicatable since it's always located next to
> data segment. This series makes malloc() and TCB use mmap() instead.

No comments at all? I see several implementation options here:

1. Always use mmap() instead of sbrk(), delete any uses of sbrk()

I have hard time thinking why sbrk() would ever be the preferred choice 
over mmap(), especially considering security. There may be some bytes 
wasted, so embedded systems could want to save those and also MMU-less 
systems can't map pages anywhere, but then those probably won't use glibc.

2. Conditionally use mmap() instead of sbrk()

Something like `#define USE_SBRK`, enabled by `configure` or a header 
file. Sub-options:

2.1. Default to sbrk(), use mmap() only for Linux

This is of course safer if some obscure system needs sbrk(). It would be 
even safer to limit mmap() only to Linux/x86_64 (which is all I care).

2.2. Default to mmap() but don't enable sbrk() anywhere

This is pretty much like #1 but after breakage is noticed for some 
obscure systems, it's easy to `#define USE_SBRK` somewhere.


I've been using a patched glibc for a month without seeing problems. I 
enabled audit logging for the brk() system call and installed a global 
seccomp filter (in initrd) which returns ENOSYS to catch any uses. So 
far I've only noticed that cpp (used by X11 startup in addition to 
compiling) calls sbrk() to check memory usage. Perhaps it should use 
official malloc statistics interface instead, since malloc() may use 
mmap() for other reasons and then sbrk() won't return true data.

It's easy to check that sbrk() has not been used with the command `grep 
'\[heap\]' /proc/*/maps` (as root), which should print nothing since 
there are no heaps (as in "extended data segment") anymore.

-Topi

> Topi Miettinen (3):
>    csu: randomize location of TCB
>    malloc: always use mmap() to improve ASLR
>    dl-sysdep: disable remaining calls to sbrk()
> 
>   csu/libc-tls.c                          | 20 ++++++++++++++------
>   elf/dl-sysdep.c                         |  2 ++
>   malloc/arena.c                          |  5 ++++-
>   malloc/malloc.c                         | 16 +++++++++++++---
>   malloc/morecore.c                       |  2 ++
>   sysdeps/unix/sysv/linux/dl-sysdep.c     |  2 ++
>   sysdeps/unix/sysv/linux/mmap64.c        | 19 +++++++++++++++++++
>   sysdeps/unix/sysv/linux/mmap_internal.h |  3 +++
>   8 files changed, 59 insertions(+), 10 deletions(-)
>
  
Szabolcs Nagy Nov. 23, 2020, 4:41 p.m. UTC | #2
The 11/23/2020 18:06, Topi Miettinen via Libc-alpha wrote:
> No comments at all? I see several implementation options here:
> 
> 1. Always use mmap() instead of sbrk(), delete any uses of sbrk()
> 
> I have hard time thinking why sbrk() would ever be the preferred choice over
> mmap(), especially considering security. There may be some bytes wasted, so

i'm not against using mmap instead brk in malloc
but the latter has more overhead so such change
should be measured.

> 2. Conditionally use mmap() instead of sbrk()
> 
> Something like `#define USE_SBRK`, enabled by `configure` or a header file.

i think configure time option is not a good idea,
but e.g. it can be a runtime tunable.

> I've been using a patched glibc for a month without seeing problems. I
> enabled audit logging for the brk() system call and installed a global
> seccomp filter (in initrd) which returns ENOSYS to catch any uses. So far
> I've only noticed that cpp (used by X11 startup in addition to compiling)
> calls sbrk() to check memory usage. Perhaps it should use official malloc
> statistics interface instead, since malloc() may use mmap() for other
> reasons and then sbrk() won't return true data.

sbrk should continue to work even if glibc itself
does not use it internally, that's public api/abi.
  
Florian Weimer Nov. 23, 2020, 4:44 p.m. UTC | #3
* Topi Miettinen via Libc-alpha:

> Problem with using sbrk() for allocations is that the location of the
> memory is relatively predicatable since it's always located next to
> data segment. This series makes malloc() and TCB use mmap() instead.

Doesn't switching to mmap trade one (relatively) fixed offset for
another?  I think anonymous mmap is not randomized independently from
the file mappings used for loading DSOs.

And the series only changes the TCB allocation for the main thread.
Fixing thread TCB/stack collocation is a completely different matter
(and probably the more significant issue than lack of ASLR).

Thanks,
Florian
  
Topi Miettinen Nov. 23, 2020, 5:55 p.m. UTC | #4
On 23.11.2020 18.41, Szabolcs Nagy wrote:
> The 11/23/2020 18:06, Topi Miettinen via Libc-alpha wrote:
>> No comments at all? I see several implementation options here:
>>
>> 1. Always use mmap() instead of sbrk(), delete any uses of sbrk()
>>
>> I have hard time thinking why sbrk() would ever be the preferred choice over
>> mmap(), especially considering security. There may be some bytes wasted, so
> 
> i'm not against using mmap instead brk in malloc
> but the latter has more overhead so such change
> should be measured.

This test shows 48% increase when using mmap() vs. sbrk():

$ cat malloc-vs-sbrk.c
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define ROUNDS 1000000
#define SIZES 4
#define SIZE_FACTOR 4

int main(int argc, char **argv) {
         if (argc == 2) {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = 4096 * (1 << (j * SIZE_FACTOR));

                                 void *ptr = mmap(NULL, s, PROT_READ | 
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                                 if (ptr == MAP_FAILED) {
                                         fprintf(stderr, "mmap() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 munmap(ptr, s);
                         }
                 }
         } else {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = 4096 * (1 << (j * SIZE_FACTOR));

                                 void *ptr = sbrk(s);
                                 if (ptr == (void *) -1) {
                                         fprintf(stderr, "sbrk() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 sbrk(-s);
                         }
                 }
         }

         return 0;
}
$ time ./malloc-vs-sbrk

real    0m1.923s
user    0m0.160s
sys     0m1.762s
$ time ./malloc-vs-sbrk 1

real    0m2.847s
user    0m0.176s
sys     0m2.669s

>> 2. Conditionally use mmap() instead of sbrk()
>>
>> Something like `#define USE_SBRK`, enabled by `configure` or a header file.
> 
> i think configure time option is not a good idea,
> but e.g. it can be a runtime tunable.

The runtime option needs to be available very early in the dynamic 
loader, before errno and malloc() are available. Would getenv() work?

>> I've been using a patched glibc for a month without seeing problems. I
>> enabled audit logging for the brk() system call and installed a global
>> seccomp filter (in initrd) which returns ENOSYS to catch any uses. So far
>> I've only noticed that cpp (used by X11 startup in addition to compiling)
>> calls sbrk() to check memory usage. Perhaps it should use official malloc
>> statistics interface instead, since malloc() may use mmap() for other
>> reasons and then sbrk() won't return true data.
> 
> sbrk should continue to work even if glibc itself
> does not use it internally, that's public api/abi.

Yes, the patches don't remove the API/ABI.

-Topi
  
Topi Miettinen Nov. 23, 2020, 6:16 p.m. UTC | #5
On 23.11.2020 18.44, Florian Weimer wrote:
> * Topi Miettinen via Libc-alpha:
> 
>> Problem with using sbrk() for allocations is that the location of the
>> memory is relatively predicatable since it's always located next to
>> data segment. This series makes malloc() and TCB use mmap() instead.
> 
> Doesn't switching to mmap trade one (relatively) fixed offset for
> another?  I think anonymous mmap is not randomized independently from
> the file mappings used for loading DSOs.

The mappings are indeed rather predictable relative to each other, even 
with /proc/sys/kernel/randomize_va_space=2. The base address is somewhat 
  randomized. I've sent a patch to fully randomize the mappings:

https://patchwork.kernel.org/project/linux-mm/patch/20201026160518.9212-1-toiwoton@gmail.com/

Glibc could do similar randomization by itself, by calling mmap() with 
an address hint generated with a random numbers from getrandom(), but I 
think hardening the kernel is better choice.

> And the series only changes the TCB allocation for the main thread.
> Fixing thread TCB/stack collocation is a completely different matter
> (and probably the more significant issue than lack of ASLR).

I thought I covered all uses of sbrk(), perhaps I missed that one. Does 
the thread TCB/stack allocation use sbrk()?

-Topi
  
Florian Weimer Nov. 23, 2020, 9:45 p.m. UTC | #6
* Topi Miettinen via Libc-alpha:

> $ time ./malloc-vs-sbrk
>
> real    0m1.923s
> user    0m0.160s
> sys     0m1.762s
> $ time ./malloc-vs-sbrk 1
>
> real    0m2.847s
> user    0m0.176s
> sys     0m2.669s

Does the difference go away if you change the mmap granularity to
128 KiB?  I think this happens under the covers (on the kernel side)
with sbrk.

>>> 2. Conditionally use mmap() instead of sbrk()
>>>
>>> Something like `#define USE_SBRK`, enabled by `configure` or a header file.
>> i think configure time option is not a good idea,
>> but e.g. it can be a runtime tunable.
>
> The runtime option needs to be available very early in the dynamic
> loader, before errno and malloc() are available. Would getenv() work?

getenv wouldn't work, but there is a parser for the GLIBC_TUNABLES
environment variable.

Thanks,
Florian
  
Topi Miettinen Nov. 23, 2020, 10:28 p.m. UTC | #7
On 23.11.2020 23.45, Florian Weimer wrote:
> * Topi Miettinen via Libc-alpha:
> 
>> $ time ./malloc-vs-sbrk
>>
>> real    0m1.923s
>> user    0m0.160s
>> sys     0m1.762s
>> $ time ./malloc-vs-sbrk 1
>>
>> real    0m2.847s
>> user    0m0.176s
>> sys     0m2.669s
> 
> Does the difference go away if you change the mmap granularity to
> 128 KiB?  I think this happens under the covers (on the kernel side)
> with sbrk.

Does not seem so, 56% increase:

$ time ./malloc-vs-sbrk

real    0m2.911s
user    0m0.232s
sys     0m2.677s
$ time ./malloc-vs-sbrk 1

real    0m4.545s
user    0m0.196s
sys     0m4.347s

$ cat malloc-vs-sbrk.c
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define ROUNDS 1000000
#define SIZES 4
#define SIZE_FACTOR 4
#define SIZE_BASE (128 * 1024)

int main(int argc, char **argv) {
         if (argc == 2) {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = SIZE_BASE * (1 << (j * 
SIZE_FACTOR));

                                 void *ptr = mmap(NULL, s, PROT_READ | 
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                                 if (ptr == MAP_FAILED) {
                                         fprintf(stderr, "mmap() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 munmap(ptr, s);
                         }
                 }
         } else {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = SIZE_BASE * (1 << (j * 
SIZE_FACTOR));

                                 void *ptr = sbrk(s);
                                 if (ptr == (void *) -1) {
                                         fprintf(stderr, "sbrk() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 sbrk(-s);
                         }
                 }
         }

         return 0;
}

>>>> 2. Conditionally use mmap() instead of sbrk()
>>>>
>>>> Something like `#define USE_SBRK`, enabled by `configure` or a header file.
>>> i think configure time option is not a good idea,
>>> but e.g. it can be a runtime tunable.
>>
>> The runtime option needs to be available very early in the dynamic
>> loader, before errno and malloc() are available. Would getenv() work?
> 
> getenv wouldn't work, but there is a parser for the GLIBC_TUNABLES
> environment variable.

OK, I'll try to use that in the next version.

-Topi
  
Szabolcs Nagy Nov. 24, 2020, 9:47 a.m. UTC | #8
The 11/23/2020 20:16, Topi Miettinen via Libc-alpha wrote:
> On 23.11.2020 18.44, Florian Weimer wrote:
> > And the series only changes the TCB allocation for the main thread.
> > Fixing thread TCB/stack collocation is a completely different matter
> > (and probably the more significant issue than lack of ASLR).
> 
> I thought I covered all uses of sbrk(), perhaps I missed that one. Does the
> thread TCB/stack allocation use sbrk()?

the point is that thread stack and tls is allocated
as one block (with mmap, but that does not matter).

you want at least a guard page in between, so stack
corruption does not clobber tcb/tls data.

but this has costs (kernel side vma) and significant
work in glibc (separate tls and stack size accounting).
  
Florian Weimer Nov. 24, 2020, 11:24 a.m. UTC | #9
* Topi Miettinen:

> On 23.11.2020 23.45, Florian Weimer wrote:
>> * Topi Miettinen via Libc-alpha:
>> 
>>> $ time ./malloc-vs-sbrk
>>>
>>> real    0m1.923s
>>> user    0m0.160s
>>> sys     0m1.762s
>>> $ time ./malloc-vs-sbrk 1
>>>
>>> real    0m2.847s
>>> user    0m0.176s
>>> sys     0m2.669s
>> Does the difference go away if you change the mmap granularity to
>> 128 KiB?  I think this happens under the covers (on the kernel side)
>> with sbrk.
>
> Does not seem so, 56% increase:

But the test does not seem very realistic because the pages are never
faulted in.  Sorry, I didn't check that before.

Thanks,
Florian
  
Topi Miettinen Nov. 24, 2020, 11:59 a.m. UTC | #10
On 24.11.2020 13.24, Florian Weimer wrote:
> * Topi Miettinen:
> 
>> On 23.11.2020 23.45, Florian Weimer wrote:
>>> * Topi Miettinen via Libc-alpha:
>>>
>>>> $ time ./malloc-vs-sbrk
>>>>
>>>> real    0m1.923s
>>>> user    0m0.160s
>>>> sys     0m1.762s
>>>> $ time ./malloc-vs-sbrk 1
>>>>
>>>> real    0m2.847s
>>>> user    0m0.176s
>>>> sys     0m2.669s
>>> Does the difference go away if you change the mmap granularity to
>>> 128 KiB?  I think this happens under the covers (on the kernel side)
>>> with sbrk.
>>
>> Does not seem so, 56% increase:
> 
> But the test does not seem very realistic because the pages are never
> faulted in.  Sorry, I didn't check that before.

Right, this changes the equation dramatically:

# time ./malloc-vs-sbrk

real    0m19.498s
user    0m1.192s
sys     0m18.302s
# time ./malloc-vs-sbrk 1

real    0m19.428s
user    0m1.276s
sys     0m18.146s

FYI, the effect of full ASLR of mmap() by kernel also seems small:

# echo 3 >/proc/sys/kernel/randomize_va_space
# time ./malloc-vs-sbrk

real    0m19.489s
user    0m1.263s
sys     0m18.211s
# time ./malloc-vs-sbrk 1

real    0m19.532s
user    0m1.148s
sys     0m18.366s

# cat malloc-vs-sbrk.c
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define ROUNDS 1000
#define SIZES 4
#define SIZE_FACTOR 3
#define SIZE_BASE (128 * 1024)

int main(int argc, char **argv) {
         if (argc == 2) {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = SIZE_BASE * (1 << (j * 
SIZE_FACTOR));

                                 void *ptr = mmap(NULL, s, PROT_READ | 
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                                 if (ptr == MAP_FAILED) {
                                         fprintf(stderr, "mmap() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 memset(ptr, 0, s);
                                 munmap(ptr, s);
                         }
                 }
         } else {
                 for (int i = 0; i < ROUNDS; i++) {
                         for (int j = 0; j < SIZES; j++) {
                                 size_t s = SIZE_BASE * (1 << (j * 
SIZE_FACTOR));

                                 void *ptr = sbrk(s);
                                 if (ptr == (void *) -1) {
                                         fprintf(stderr, "sbrk() failed, 
size %zu iter %d\n", s, i);
                                         return 1;
                                 }
                                 memset(ptr, 0, s);
                                 sbrk(-s);
                         }
                 }
         }

         return 0;
}

-Topi
  
Adhemerval Zanella Nov. 24, 2020, 2:29 p.m. UTC | #11
On 24/11/2020 08:59, Topi Miettinen via Libc-alpha wrote:
> On 24.11.2020 13.24, Florian Weimer wrote:
>> * Topi Miettinen:
>>
>>> On 23.11.2020 23.45, Florian Weimer wrote:
>>>> * Topi Miettinen via Libc-alpha:
>>>>
>>>>> $ time ./malloc-vs-sbrk
>>>>>
>>>>> real    0m1.923s
>>>>> user    0m0.160s
>>>>> sys     0m1.762s
>>>>> $ time ./malloc-vs-sbrk 1
>>>>>
>>>>> real    0m2.847s
>>>>> user    0m0.176s
>>>>> sys     0m2.669s
>>>> Does the difference go away if you change the mmap granularity to
>>>> 128 KiB?  I think this happens under the covers (on the kernel side)
>>>> with sbrk.
>>>
>>> Does not seem so, 56% increase:
>>
>> But the test does not seem very realistic because the pages are never
>> faulted in.  Sorry, I didn't check that before.
> 
> Right, this changes the equation dramatically:
> 
> # time ./malloc-vs-sbrk
> 
> real    0m19.498s
> user    0m1.192s
> sys     0m18.302s
> # time ./malloc-vs-sbrk 1
> 
> real    0m19.428s
> user    0m1.276s
> sys     0m18.146s
> 
> FYI, the effect of full ASLR of mmap() by kernel also seems small:
> 
> # echo 3 >/proc/sys/kernel/randomize_va_space
> # time ./malloc-vs-sbrk
> 
> real    0m19.489s
> user    0m1.263s
> sys     0m18.211s
> # time ./malloc-vs-sbrk 1
> 
> real    0m19.532s
> user    0m1.148s
> sys     0m18.366s

I saw similar results showing little performance difference on different
architectures as well (aarch64, s390x, sparc64, and armhf), so I don't 
think performance should an impending reason for this change.
  
Florian Weimer Nov. 30, 2020, 10:28 a.m. UTC | #12
* Topi Miettinen:

> FYI, the effect of full ASLR of mmap() by kernel also seems small:
>
> # echo 3 >/proc/sys/kernel/randomize_va_space
> # time ./malloc-vs-sbrk
>
> real    0m19.489s
> user    0m1.263s
> sys     0m18.211s
> # time ./malloc-vs-sbrk 1
>
> real    0m19.532s
> user    0m1.148s
> sys     0m18.366s

The value 3 doesn't do anything in a mainline kernel.

I think you will see rather catastrophic effects if you have a workload
that triggers TLB misses.  I expect that the cost of walking page tables
increases dramatically with full ASLR.

Thanks,
Florian
  
Topi Miettinen Nov. 30, 2020, 11:39 a.m. UTC | #13
On 30.11.2020 12.28, Florian Weimer wrote:
> * Topi Miettinen:
> 
>> FYI, the effect of full ASLR of mmap() by kernel also seems small:
>>
>> # echo 3 >/proc/sys/kernel/randomize_va_space
>> # time ./malloc-vs-sbrk
>>
>> real    0m19.489s
>> user    0m1.263s
>> sys     0m18.211s
>> # time ./malloc-vs-sbrk 1
>>
>> real    0m19.532s
>> user    0m1.148s
>> sys     0m18.366s
> 
> The value 3 doesn't do anything in a mainline kernel.

No, you need this patch to randomize mmap(), mremap(..., 
MREMAP_MAYMOVE), stack and vdso:

https://lkml.org/lkml/2020/11/29/214

> I think you will see rather catastrophic effects if you have a workload
> that triggers TLB misses.  I expect that the cost of walking page tables
> increases dramatically with full ASLR.

Each random mapping may likely need new page tables, up to 4 pages. Thus 
for a workload with lots of small (page-size) mappings, the memory 
consumption will indeed increase a lot with 
sysctl.kernel.randomize_va_space=3. But as a real world example, main 
process of Firefox has roughly 1500 lines in /proc/$PID/maps in my 
system. Some are contiguous to other mappings, but for simplicity, 
pretending that they all were independent and each required a full set 
of page tables, it would mean 1500*4*4kB  = 24MB of page tables. RSS of 
the process is 340MB, so the effect is not that dramatic by itself, but 
page table entries may also compete for cache slots. So 
randomize_va_space=3 clearly isn't interesting for performance but 
hardening reasons. I'd also not use it on a 32-bit system because the 
virtual memory space could be fragmented too much. I don't think that 
could happen on a 64 bit system, the physical memory should run out first.

The effect of using mmap() instead of sbrk() in libc is independent of 
the kernel patch: one malloc arena and few other areas will use mmap() 
instead of sbrk(), so only a few mappings are affected. With 
randomize_va_space=2, the mappings made with mmap() will be allocated 
close to the other mappings and then a lot of page tables will be 
shared, so probably no additional memory is needed. With a patched 
kernel and randomize_va_space=3, each mapping will be placed randomly. 
This could take maybe tens of kBs more memory in page tables.

-Topi