From patchwork Mon Feb  9 14:06:08 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@suse.de>
X-Patchwork-Id: 4976
Received: (qmail 610 invoked by alias); 9 Feb 2015 14:06:24 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 601 invoked by uid 89); 9 Feb 2015 14:06:23 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.7 required=5.0 tests=AWL, BAYES_00,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: mx2.suse.de
Date: Mon, 9 Feb 2015 14:06:08 +0000
From: Mel Gorman <mgorman@suse.de>
To: libc-alpha@sourceware.org
Subject: [PATCH] [RFC] malloc: Reduce worst-case behaviour with madvise and
	refault overhead
Message-ID: <20150209140608.GD2395@suse.de>
MIME-Version: 1.0
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)

(As per patch guidelines -- I have not signed the copyright assignment. I
 consider this patch to be trivial and severely doubt it's "legally
 significant" so hopefully this is not a problem.).

When freeing a large amount of memory from a thread then shrink_heap()
may call madvise(). The main arena avoids this particular path and it
used to be very rare but since glibc 2.10, threads always have their
own heap for better scalability.

The problem is that madvise() is not a cheap operation and if it is
immediately followed by a malloc and a refault then there is a lot of
system overhead.  The worst case is a thread doing something like

while (data_to_process) {
	buf = malloc(large_size);
	do_stuff();
	free(buf);
}

For a large size, the free() will call madvise (pagetable teardown, page
free and TLB flush) every time followed immediately by a malloc (fault,
kernel page alloc, zeroing and charge accounting). The kernel overhead
can dominate such a workload.

This patch mitigates the worst-case behaviour by tracking growing/shrinking
trends. If the heap size is roughly static or growing then madvise() is
deferred. If the heap is shrinking then the calls to madvise() are batched
until freeing mmap_threshold. This reduces the overhead in the worst case
at that cost of slightly higher RSS.

This is a basic test-case for the worst case scenario where every free is
a madvise followed by a an alloc

/* gcc bench-free.c -lpthread -o bench-free */

static int num = 1024;

void __attribute__((noinline,noclone)) dostuff (void *p)
{
}

void *worker (void *data)
{
  int i;

  for (i = num; i--;)
    {
      void *m = malloc (48*4096);
      dostuff (m);
      free (m);
    }

  return NULL;
}

int main()
{
  int i;
  pthread_t t;
  void *ret;
  if (pthread_create (&t, NULL, worker, NULL))
    exit (2);
  if (pthread_join (t, &ret))
    exit (3);
  return 0;
}

Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
madvise is called twice.

ebizzy is meant to generate a workload resembling common web application
server workloads. It is threaded with a large working set that at its core
has an allocation, do_stuff, free loop that also hits this case. The primary
metric of the benchmark is records processed per second. This is running on
my desktop which is a single socket machine with an I7-4770 and 8 cores.
Each thread count was run for 30 seconds. It was only run once as the
performance difference is so high that the variation is insignificant.

		glibc 2.21		patch
threads 1            10230              43271
threads 2            19153              85562
threads 4            34295             157634
threads 8            51007             183901

This is roughly quadrupling the performance of this benchmark. The difference in
system CPU usage illustrates why.

ebizzy running 1 thread with glibc 2.21
10230 records/s 306904
real 30.00 s
user  7.47 s
sys  22.49 s

22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
patch applied

ebizzy running 1 thread with patch applied
43271 records/s 1298133
real 30.00 s
user 29.96 s
sys   0.00 s

system CPU usage was zero with the patch applied. strace shows that glibc
running this workload calls madvise approximately 9000 times a second. With
the patch applied madvise was called twice during the workload (or 0.06
times per second).

	* malloc/malloc.c (malloc): Account for heap grows vs shrinks
	* malloc/arena.c (free): Limit calls to madvise when shrinking
---
 ChangeLog       |  5 +++++
 malloc/arena.c  | 46 +++++++++++++++++++++++++++++++++++++++++++---
 malloc/malloc.c | 10 ++++++++++
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index dc1ed1ba1249..5208fadaf7c3 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@
+2015-02-09  Mel Gorman  <mgorman@suse.de>
+
+	* malloc/malloc.c (malloc): Account for heap grows vs shrinks
+	* malloc/arena.c (free): Limit calls to madvise when shrinking
+
 2015-02-06  Carlos O'Donell  <carlos@systemhalted.org>
 
 	* version.h (RELEASE): Set to "stable".
diff --git a/malloc/arena.c b/malloc/arena.c
index 886defb074a2..9012f4d2a0b8 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -45,6 +45,20 @@
    malloc_chunks.  It is allocated with mmap() and always starts at an
    address aligned to HEAP_MAX_SIZE.  */
 
+/* madvise throttling counters. This structure forces the counters to fit
+   into a size_t so that the alignment requirements of _heap_info can be
+   met. */
+typedef struct _madvise_throttle
+{
+  union {
+    struct {
+      uint16_t event_madvise;
+      uint16_t event_large_malloc;
+    };
+    size_t pad;
+  };
+} madvise_throttle;
+
 typedef struct _heap_info
 {
   mstate ar_ptr; /* Arena for this heap. */
@@ -52,10 +66,13 @@ typedef struct _heap_info
   size_t size;   /* Current size in bytes. */
   size_t mprotect_size; /* Size in bytes that has been mprotected
                            PROT_READ|PROT_WRITE.  */
+
+  madvise_throttle mt;
+
   /* Make sure the following data is properly aligned, particularly
      that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
      MALLOC_ALIGNMENT. */
-  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];
+  char pad[-7 * SIZE_SZ & MALLOC_ALIGN_MASK];
 } heap_info;
 
 /* Get a compile-time error if the heap_info padding is not correct
@@ -632,8 +649,31 @@ shrink_heap (heap_info *h, long diff)
 
       h->mprotect_size = new_size;
     }
-  else
-    __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
+  else {
+    unsigned int ratio;
+    /* Only keep track of recent madvise events by decaying counters */
+    if (++h->mt.event_madvise >= 100)
+      {
+        h->mt.event_large_malloc >>= 1;
+        h->mt.event_madvise >>= 1;
+      }
+    ratio = (h->mt.event_large_malloc + 1) * 100 / h->mt.event_madvise;
+
+    /* madvise and a refault is an expensive operation if the shrink request
+       is temporary. Only call madvise if it is a request bigger than
+       mmap_threshold or if it is detected that there are a mix of growths and
+       shrinks but more shrink requests recently. One impact is that a stream
+       of free+shrink requests will be batched under a single madvise call. */
+    if (diff >= mp_.mmap_threshold || (ratio < 100 && h->mt.event_large_malloc > 1))
+      {
+        __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
+        h->mt.event_large_malloc = h->mt.event_madvise = 0;
+      }
+    else
+      {
+        return -1;
+      }
+  }
   /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/
 
   h->size = new_size;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index aa7edbfd4571..41f66cb41557 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2388,6 +2388,16 @@ sysmalloc (INTERNAL_SIZE_T nb, mstate av)
           arena_mem += old_heap->size - old_heap_size;
           set_head (old_top, (((char *) old_heap + old_heap->size) - (char *) old_top)
                     | PREV_INUSE);
+          /* Track ratio of heap grows/shrinks after recent shrinks */
+          if (old_heap->mt.event_madvise)
+            {
+              /* Only keep track of recent allocations by decaying counters */
+              if (++old_heap->mt.event_large_malloc >= 100)
+                {
+                  old_heap->mt.event_large_malloc >>= 1;
+                  old_heap->mt.event_madvise >>= 1;
+                }
+            }
         }
       else if ((heap = new_heap (nb + (MINSIZE + sizeof (*heap)), mp_.top_pad)))
         {