[RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead

  (As per patch guidelines -- I have not signed the copyright assignment. I
 consider this patch to be trivial and severely doubt it's "legally
 significant" so hopefully this is not a problem.).

When freeing a large amount of memory from a thread then shrink_heap()
may call madvise(). The main arena avoids this particular path and it
used to be very rare but since glibc 2.10, threads always have their
own heap for better scalability.

The problem is that madvise() is not a cheap operation and if it is
immediately followed by a malloc and a refault then there is a lot of
system overhead.  The worst case is a thread doing something like

while (data_to_process) {
	buf = malloc(large_size);
	do_stuff();
	free(buf);
}

For a large size, the free() will call madvise (pagetable teardown, page
free and TLB flush) every time followed immediately by a malloc (fault,
kernel page alloc, zeroing and charge accounting). The kernel overhead
can dominate such a workload.

This patch mitigates the worst-case behaviour by tracking growing/shrinking
trends. If the heap size is roughly static or growing then madvise() is
deferred. If the heap is shrinking then the calls to madvise() are batched
until freeing mmap_threshold. This reduces the overhead in the worst case
at that cost of slightly higher RSS.

This is a basic test-case for the worst case scenario where every free is
a madvise followed by a an alloc

/* gcc bench-free.c -lpthread -o bench-free */

static int num = 1024;

void __attribute__((noinline,noclone)) dostuff (void *p)
{
}

void *worker (void *data)
{
  int i;

  for (i = num; i--;)
    {
      void *m = malloc (48*4096);
      dostuff (m);
      free (m);
    }

  return NULL;
}

int main()
{
  int i;
  pthread_t t;
  void *ret;
  if (pthread_create (&t, NULL, worker, NULL))
    exit (2);
  if (pthread_join (t, &ret))
    exit (3);
  return 0;
}

Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
madvise is called twice.

ebizzy is meant to generate a workload resembling common web application
server workloads. It is threaded with a large working set that at its core
has an allocation, do_stuff, free loop that also hits this case. The primary
metric of the benchmark is records processed per second. This is running on
my desktop which is a single socket machine with an I7-4770 and 8 cores.
Each thread count was run for 30 seconds. It was only run once as the
performance difference is so high that the variation is insignificant.

		glibc 2.21		patch
threads 1            10230              43271
threads 2            19153              85562
threads 4            34295             157634
threads 8            51007             183901

This is roughly quadrupling the performance of this benchmark. The difference in
system CPU usage illustrates why.

ebizzy running 1 thread with glibc 2.21
10230 records/s 306904
real 30.00 s
user  7.47 s
sys  22.49 s

22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
patch applied

ebizzy running 1 thread with patch applied
43271 records/s 1298133
real 30.00 s
user 29.96 s
sys   0.00 s

system CPU usage was zero with the patch applied. strace shows that glibc
running this workload calls madvise approximately 9000 times a second. With
the patch applied madvise was called twice during the workload (or 0.06
times per second).

	* malloc/malloc.c (malloc): Account for heap grows vs shrinks
	* malloc/arena.c (free): Limit calls to madvise when shrinking
---
 ChangeLog       |  5 +++++
 malloc/arena.c  | 46 +++++++++++++++++++++++++++++++++++++++++++---
 malloc/malloc.c | 10 ++++++++++
 3 files changed, 58 insertions(+), 3 deletions(-)

[RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead

Commit Message

Comments

Patch