From patchwork Mon Feb 9 14:06:08 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 4976 Received: (qmail 610 invoked by alias); 9 Feb 2015 14:06:24 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 601 invoked by uid 89); 9 Feb 2015 14:06:23 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.7 required=5.0 tests=AWL, BAYES_00, T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx2.suse.de Date: Mon, 9 Feb 2015 14:06:08 +0000 From: Mel Gorman To: libc-alpha@sourceware.org Subject: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead Message-ID: <20150209140608.GD2395@suse.de> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) (As per patch guidelines -- I have not signed the copyright assignment. I consider this patch to be trivial and severely doubt it's "legally significant" so hopefully this is not a problem.). When freeing a large amount of memory from a thread then shrink_heap() may call madvise(). The main arena avoids this particular path and it used to be very rare but since glibc 2.10, threads always have their own heap for better scalability. The problem is that madvise() is not a cheap operation and if it is immediately followed by a malloc and a refault then there is a lot of system overhead. The worst case is a thread doing something like while (data_to_process) { buf = malloc(large_size); do_stuff(); free(buf); } For a large size, the free() will call madvise (pagetable teardown, page free and TLB flush) every time followed immediately by a malloc (fault, kernel page alloc, zeroing and charge accounting). The kernel overhead can dominate such a workload. This patch mitigates the worst-case behaviour by tracking growing/shrinking trends. If the heap size is roughly static or growing then madvise() is deferred. If the heap is shrinking then the calls to madvise() are batched until freeing mmap_threshold. This reduces the overhead in the worst case at that cost of slightly higher RSS. This is a basic test-case for the worst case scenario where every free is a madvise followed by a an alloc /* gcc bench-free.c -lpthread -o bench-free */ static int num = 1024; void __attribute__((noinline,noclone)) dostuff (void *p) { } void *worker (void *data) { int i; for (i = num; i--;) { void *m = malloc (48*4096); dostuff (m); free (m); } return NULL; } int main() { int i; pthread_t t; void *ret; if (pthread_create (&t, NULL, worker, NULL)) exit (2); if (pthread_join (t, &ret)) exit (3); return 0; } Before the patch, this resulted in 1024 calls to madvise. With the patch applied, madvise is called twice. ebizzy is meant to generate a workload resembling common web application server workloads. It is threaded with a large working set that at its core has an allocation, do_stuff, free loop that also hits this case. The primary metric of the benchmark is records processed per second. This is running on my desktop which is a single socket machine with an I7-4770 and 8 cores. Each thread count was run for 30 seconds. It was only run once as the performance difference is so high that the variation is insignificant. glibc 2.21 patch threads 1 10230 43271 threads 2 19153 85562 threads 4 34295 157634 threads 8 51007 183901 This is roughly quadrupling the performance of this benchmark. The difference in system CPU usage illustrates why. ebizzy running 1 thread with glibc 2.21 10230 records/s 306904 real 30.00 s user 7.47 s sys 22.49 s 22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the patch applied ebizzy running 1 thread with patch applied 43271 records/s 1298133 real 30.00 s user 29.96 s sys 0.00 s system CPU usage was zero with the patch applied. strace shows that glibc running this workload calls madvise approximately 9000 times a second. With the patch applied madvise was called twice during the workload (or 0.06 times per second). * malloc/malloc.c (malloc): Account for heap grows vs shrinks * malloc/arena.c (free): Limit calls to madvise when shrinking --- ChangeLog | 5 +++++ malloc/arena.c | 46 +++++++++++++++++++++++++++++++++++++++++++--- malloc/malloc.c | 10 ++++++++++ 3 files changed, 58 insertions(+), 3 deletions(-) diff --git a/ChangeLog b/ChangeLog index dc1ed1ba1249..5208fadaf7c3 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2015-02-09 Mel Gorman + + * malloc/malloc.c (malloc): Account for heap grows vs shrinks + * malloc/arena.c (free): Limit calls to madvise when shrinking + 2015-02-06 Carlos O'Donell * version.h (RELEASE): Set to "stable". diff --git a/malloc/arena.c b/malloc/arena.c index 886defb074a2..9012f4d2a0b8 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -45,6 +45,20 @@ malloc_chunks. It is allocated with mmap() and always starts at an address aligned to HEAP_MAX_SIZE. */ +/* madvise throttling counters. This structure forces the counters to fit + into a size_t so that the alignment requirements of _heap_info can be + met. */ +typedef struct _madvise_throttle +{ + union { + struct { + uint16_t event_madvise; + uint16_t event_large_malloc; + }; + size_t pad; + }; +} madvise_throttle; + typedef struct _heap_info { mstate ar_ptr; /* Arena for this heap. */ @@ -52,10 +66,13 @@ typedef struct _heap_info size_t size; /* Current size in bytes. */ size_t mprotect_size; /* Size in bytes that has been mprotected PROT_READ|PROT_WRITE. */ + + madvise_throttle mt; + /* Make sure the following data is properly aligned, particularly that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of MALLOC_ALIGNMENT. */ - char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK]; + char pad[-7 * SIZE_SZ & MALLOC_ALIGN_MASK]; } heap_info; /* Get a compile-time error if the heap_info padding is not correct @@ -632,8 +649,31 @@ shrink_heap (heap_info *h, long diff) h->mprotect_size = new_size; } - else - __madvise ((char *) h + new_size, diff, MADV_DONTNEED); + else { + unsigned int ratio; + /* Only keep track of recent madvise events by decaying counters */ + if (++h->mt.event_madvise >= 100) + { + h->mt.event_large_malloc >>= 1; + h->mt.event_madvise >>= 1; + } + ratio = (h->mt.event_large_malloc + 1) * 100 / h->mt.event_madvise; + + /* madvise and a refault is an expensive operation if the shrink request + is temporary. Only call madvise if it is a request bigger than + mmap_threshold or if it is detected that there are a mix of growths and + shrinks but more shrink requests recently. One impact is that a stream + of free+shrink requests will be batched under a single madvise call. */ + if (diff >= mp_.mmap_threshold || (ratio < 100 && h->mt.event_large_malloc > 1)) + { + __madvise ((char *) h + new_size, diff, MADV_DONTNEED); + h->mt.event_large_malloc = h->mt.event_madvise = 0; + } + else + { + return -1; + } + } /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/ h->size = new_size; diff --git a/malloc/malloc.c b/malloc/malloc.c index aa7edbfd4571..41f66cb41557 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -2388,6 +2388,16 @@ sysmalloc (INTERNAL_SIZE_T nb, mstate av) arena_mem += old_heap->size - old_heap_size; set_head (old_top, (((char *) old_heap + old_heap->size) - (char *) old_top) | PREV_INUSE); + /* Track ratio of heap grows/shrinks after recent shrinks */ + if (old_heap->mt.event_madvise) + { + /* Only keep track of recent allocations by decaying counters */ + if (++old_heap->mt.event_large_malloc >= 100) + { + old_heap->mt.event_large_malloc >>= 1; + old_heap->mt.event_madvise >>= 1; + } + } } else if ((heap = new_heap (nb + (MINSIZE + sizeof (*heap)), mp_.top_pad))) {