[RFC] malloc: Reduce worst-case behaviour with madvise and refault overhead

Message ID 20150209140608.GD2395@suse.de
State Rejected
Headers

Commit Message

Mel Gorman Feb. 9, 2015, 2:06 p.m. UTC
  (As per patch guidelines -- I have not signed the copyright assignment. I
 consider this patch to be trivial and severely doubt it's "legally
 significant" so hopefully this is not a problem.).

When freeing a large amount of memory from a thread then shrink_heap()
may call madvise(). The main arena avoids this particular path and it
used to be very rare but since glibc 2.10, threads always have their
own heap for better scalability.

The problem is that madvise() is not a cheap operation and if it is
immediately followed by a malloc and a refault then there is a lot of
system overhead.  The worst case is a thread doing something like

while (data_to_process) {
	buf = malloc(large_size);
	do_stuff();
	free(buf);
}

For a large size, the free() will call madvise (pagetable teardown, page
free and TLB flush) every time followed immediately by a malloc (fault,
kernel page alloc, zeroing and charge accounting). The kernel overhead
can dominate such a workload.

This patch mitigates the worst-case behaviour by tracking growing/shrinking
trends. If the heap size is roughly static or growing then madvise() is
deferred. If the heap is shrinking then the calls to madvise() are batched
until freeing mmap_threshold. This reduces the overhead in the worst case
at that cost of slightly higher RSS.

This is a basic test-case for the worst case scenario where every free is
a madvise followed by a an alloc

/* gcc bench-free.c -lpthread -o bench-free */

static int num = 1024;

void __attribute__((noinline,noclone)) dostuff (void *p)
{
}

void *worker (void *data)
{
  int i;

  for (i = num; i--;)
    {
      void *m = malloc (48*4096);
      dostuff (m);
      free (m);
    }

  return NULL;
}

int main()
{
  int i;
  pthread_t t;
  void *ret;
  if (pthread_create (&t, NULL, worker, NULL))
    exit (2);
  if (pthread_join (t, &ret))
    exit (3);
  return 0;
}

Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
madvise is called twice.

ebizzy is meant to generate a workload resembling common web application
server workloads. It is threaded with a large working set that at its core
has an allocation, do_stuff, free loop that also hits this case. The primary
metric of the benchmark is records processed per second. This is running on
my desktop which is a single socket machine with an I7-4770 and 8 cores.
Each thread count was run for 30 seconds. It was only run once as the
performance difference is so high that the variation is insignificant.

		glibc 2.21		patch
threads 1            10230              43271
threads 2            19153              85562
threads 4            34295             157634
threads 8            51007             183901

This is roughly quadrupling the performance of this benchmark. The difference in
system CPU usage illustrates why.

ebizzy running 1 thread with glibc 2.21
10230 records/s 306904
real 30.00 s
user  7.47 s
sys  22.49 s

22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
patch applied

ebizzy running 1 thread with patch applied
43271 records/s 1298133
real 30.00 s
user 29.96 s
sys   0.00 s

system CPU usage was zero with the patch applied. strace shows that glibc
running this workload calls madvise approximately 9000 times a second. With
the patch applied madvise was called twice during the workload (or 0.06
times per second).

	* malloc/malloc.c (malloc): Account for heap grows vs shrinks
	* malloc/arena.c (free): Limit calls to madvise when shrinking
---
 ChangeLog       |  5 +++++
 malloc/arena.c  | 46 +++++++++++++++++++++++++++++++++++++++++++---
 malloc/malloc.c | 10 ++++++++++
 3 files changed, 58 insertions(+), 3 deletions(-)
  

Comments

Carlos O'Donell Feb. 9, 2015, 8:52 p.m. UTC | #1
On 02/09/2015 09:06 AM, Mel Gorman wrote:
> while (data_to_process) {
> 	buf = malloc(large_size);
> 	do_stuff();
> 	free(buf);
> }

Why isn't the fix to change the application to hoist the
malloc out of the loop?

buf = malloc(large_size);
while (data_to_process)
  {
    do_stuff();
  }
free(buf);

Is it simply that the software frameworks themselves are
unable to do this directly?

I can understand your position. Ebizzy models the workload and
you use the workload model to improve performance by changing
the runtime to match the workload.

The problem I face as a maintainer is that you've added
complexity to malloc in the form of a decaying counter, and
I need a strong justification for that kind of added complexity.

For example, I see you're from SUSE, have you put this change
through testing in your distribution builds or releases?
What were the results? Under what *real* workloads did this
make a difference?

Cheers,
Carlos.
  
Mel Gorman Feb. 9, 2015, 10:49 p.m. UTC | #2
Thanks for reviewing.

On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote:
> On 02/09/2015 09:06 AM, Mel Gorman wrote:
> > while (data_to_process) {
> > 	buf = malloc(large_size);
> > 	do_stuff();
> > 	free(buf);
> > }
> 
> Why isn't the fix to change the application to hoist the
> malloc out of the loop?
> 
> buf = malloc(large_size);
> while (data_to_process)
>   {
>     do_stuff();
>   }
> free(buf);
> 
> Is it simply that the software frameworks themselves are
> unable to do this directly?
> 

Fixing the benchmark in this case hides the problem -- glibc malloc has
a pathological worst case for a relatively basic allocation pattern that
is encountered simply because it happens to use threads (processes would
have avoided the problem). It was spotted when comparing versions of a
distribution and initially I assumed it was a kernel issue until I analysed
the problem. Even if ebizzy was fixed and it had an upstream maintainer
that would accept the patch, glibc would still have the same problem. It
would be a bit of a shame if the recommendation in some cases was simply
to avoid using malloc/free and instead cache buffers within the application.

> I can understand your position. Ebizzy models the workload and
> you use the workload model to improve performance by changing
> the runtime to match the workload.
> 

Exactly.

> The problem I face as a maintainer is that you've added
> complexity to malloc in the form of a decaying counter, and
> I need a strong justification for that kind of added complexity.
> 

I would also welcome suggestions on how madvise could be throttled without
the use of counters. The counters are heap-local where I do not expect
there will be cache conflicts and the allocation-side counter is only
updated after a recent heap shrink to minimise updates.

Initially I worked around this in the kernel but any solution there
breaks the existing semantics of MADV_DONTNEED and was rejected. See
last paragraph of https://lkml.org/lkml/2015/2/2/696 .

> For example, I see you're from SUSE, have you put this change
> through testing in your distribution builds or releases?

I'm not the glibc maintainer for our distribution but even if I was, the
distribution has an upstream-first policy. A change of this type would
have to be acceptable to the upstream maintainers. If this can be addressed
here then I can ask our glibc maintainers to apply the patch as a backport.

> What were the results? Under what *real* workloads did this
> make a difference?
> 

It was detected manually but the behaviour was also spotted in firefox
during normal browsing (200 calls to madvise on average per second when
monitored for a short period) and evolution when updating search folders
(110 madvise calls per second) but in neither case can I actually quantify
the impact because the overhead is a relatively small part of the overall
workload.

MariaDB when populating a database during the startup phase of sysbench
benchmark was calling madvise 95 times a second during population. In
that case, the cost of the page table teardown + refault is negligible in
comparison to the IO costs and bulk of the CPU time is spent in mariadb
itself but glancing at perf top, it looks like about 25% of system CPU
time is spent tearing down and reallocating pages CPU cycles were spent
on teardown measured cycles). During the sysbench run itself, mariadb was
calling madvise 2000 times a second.  I didn't formally quantify the impact
as I do not have a test setup for testing glibc modifications system-wide
and it's essentially the same problem seen by ebizzy except ebizzy is a
hell of a lot easier to test with a modified glibc.

Thanks.
  
Florian Weimer Feb. 10, 2015, 12:30 p.m. UTC | #3
On 02/09/2015 09:52 PM, Carlos O'Donell wrote:
> On 02/09/2015 09:06 AM, Mel Gorman wrote:
>> while (data_to_process) {
>> 	buf = malloc(large_size);
>> 	do_stuff();
>> 	free(buf);
>> }
> 
> Why isn't the fix to change the application to hoist the
> malloc out of the loop?

For a lot of C++ code, this would require replacing global operator new
with a pool allocator.  We do not want programmers to do that, for
various reasons (loss of tooling, our limited malloc hardening, etc.).
  
Carlos O'Donell Feb. 10, 2015, 2:56 p.m. UTC | #4
On 02/10/2015 07:30 AM, Florian Weimer wrote:
> On 02/09/2015 09:52 PM, Carlos O'Donell wrote:
>> On 02/09/2015 09:06 AM, Mel Gorman wrote:
>>> while (data_to_process) {
>>> 	buf = malloc(large_size);
>>> 	do_stuff();
>>> 	free(buf);
>>> }
>>
>> Why isn't the fix to change the application to hoist the
>> malloc out of the loop?
> 
> For a lot of C++ code, this would require replacing global operator new
> with a pool allocator.  We do not want programmers to do that, for
> various reasons (loss of tooling, our limited malloc hardening, etc.).
 
Is this because the objects are implicitly allocated by the language
in the inner loop using new?

Cheers,
Carlos.
  
Carlos O'Donell Feb. 10, 2015, 3:37 p.m. UTC | #5
On 02/09/2015 05:49 PM, Mel Gorman wrote:
> I would also welcome suggestions on how madvise could be throttled without
> the use of counters. The counters are heap-local where I do not expect
> there will be cache conflicts and the allocation-side counter is only
> updated after a recent heap shrink to minimise updates.
> 
> Initially I worked around this in the kernel but any solution there
> breaks the existing semantics of MADV_DONTNEED and was rejected. See
> last paragraph of https://lkml.org/lkml/2015/2/2/696 .

The truth is that glibc doesn't want to use MADV_DONTNEED in malloc,
but it's the only interface we have right now that has similar
semantics to what we need.

Similarly Kostya from google told me that ASAN also has this problem since 
it has tons of statistics pages that it will touch soon, but doesn't care
if they come back as zero or the original data.

Two years ago I spoke with Motohiro-san and we talked about
MADV_"Want but don't need", but no mainline solution was present at the
time.

The immediate work that comes to mind is the `vrange` work by Minchan Kim.
http://lwn.net/Articles/542504/

I agree with Rik's comment in the above link that we really want MADV_FREE
in these cases, and it gives a 200% speedup over MADV_DONTNEED (as reported
by testers using jemalloc patches).

Thus, instead of adding more complex heuristics to glibc's malloc I think
we should be testing the addition of MADV_FREE in glibc's malloc and then
supporting or adjusting the proposal for MADV_FREE or vrange dependent on
the outcome.

In the meantime we can talk about mitigating the problems in glibc's
allocator for systems with old kernels, but it should not be the primary
solution. In glibc we would conditionalize the changes against the first
kernel version that included MADV_FREE, and when the minimum supported
kernel version is higher than that we would remove the code in question.

My suggested next steps are:

(a) Test using kernel+MADV_FREE with a hacked glibc malloc that uses
    MADV_FREE, see how that performs, and inform upstream kernel.

(b) Continue discussion over rate-limiting MADV_DONTNEED as a temporary
    measure. Before we go any further here, please increase M_TRIM_THRESHOLD
    in ebizzy and see if that makes a difference? It should make a difference
    by increasing the threshold at which we trim back to the OS, both sbrk,
    and mmaps, and thus reduce the MADV_DONTNEED calls at the cost of increased
    memory pressure. Changing the default though is not a trivial thing, since
    it could lead to immediate OOM for existing applications that run close to
    the limit of RAM. Discussion and analysis will be required.

Comments?

Cheers,
Carlos.
  
Florian Weimer Feb. 11, 2015, 11:10 a.m. UTC | #6
On 02/10/2015 03:56 PM, Carlos O'Donell wrote:
> On 02/10/2015 07:30 AM, Florian Weimer wrote:
>> On 02/09/2015 09:52 PM, Carlos O'Donell wrote:
>>> On 02/09/2015 09:06 AM, Mel Gorman wrote:
>>>> while (data_to_process) {
>>>> 	buf = malloc(large_size);
>>>> 	do_stuff();
>>>> 	free(buf);
>>>> }
>>>
>>> Why isn't the fix to change the application to hoist the
>>> malloc out of the loop?
>>
>> For a lot of C++ code, this would require replacing global operator new
>> with a pool allocator.  We do not want programmers to do that, for
>> various reasons (loss of tooling, our limited malloc hardening, etc.).
>  
> Is this because the objects are implicitly allocated by the language
> in the inner loop using new?

Not by the language, but by libraries.  The libraries may not offer a
choice to avoid the large allocation and deallocation.  In general,
object reuse is no longer considered good design because it typically
complicates the different states an object must support.  This is
probably not specific to C++, it may affect modern C code as well (which
tends to hide implementation details with pointers to incomplete structs
etc., and not rely on callers providing sufficiently sized buffers).

Even if there is a way to reuse objects (e.g., we have freopen in
addition to fopen/close), I don't want to suggest it to developers
because it eventually leads to custom allocators, pools and the problems
that come with them.
  
Rich Felker Feb. 11, 2015, 1:26 p.m. UTC | #7
On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote:
> On 02/09/2015 09:06 AM, Mel Gorman wrote:
> > while (data_to_process) {
> > 	buf = malloc(large_size);
> > 	do_stuff();
> > 	free(buf);
> > }
> 
> Why isn't the fix to change the application to hoist the
> malloc out of the loop?

I understand this is impossible for some language idioms (typically
OOP, and despite my personal belief that this indicates they're bad
language idioms, I don't want to descend into that type of argument),
but to me the big question is:

Why, when you have a large buffer -- so large that it can effect
MADV_DONTNEED or munmap when freed -- are you doing so little with it
in do_stuff() that the work performed on the buffer doesn't dominate
the time spent?

This indicates to me that the problem might actually be significant
over-allocation beyond the size that's actually going to be used. Do
we have some real-world specific examples of where this is happening?
If it's poor design in application code and the applications could be
corrected, I think we should consider whether the right fix is on the
application side.

Rich
  
Mel Gorman Feb. 11, 2015, 1:34 p.m. UTC | #8
On Wed, Feb 11, 2015 at 08:26:31AM -0500, Rich Felker wrote:
> On Mon, Feb 09, 2015 at 03:52:22PM -0500, Carlos O'Donell wrote:
> > On 02/09/2015 09:06 AM, Mel Gorman wrote:
> > > while (data_to_process) {
> > > 	buf = malloc(large_size);
> > > 	do_stuff();
> > > 	free(buf);
> > > }
> > 
> > Why isn't the fix to change the application to hoist the
> > malloc out of the loop?
> 
> I understand this is impossible for some language idioms (typically
> OOP, and despite my personal belief that this indicates they're bad
> language idioms, I don't want to descend into that type of argument),
> but to me the big question is:
> 
> Why, when you have a large buffer -- so large that it can effect
> MADV_DONTNEED or munmap when freed -- are you doing so little with it
> in do_stuff() that the work performed on the buffer doesn't dominate
> the time spent?
> 

It's less than ideal application behaviour.

> This indicates to me that the problem might actually be significant
> over-allocation beyond the size that's actually going to be used. Do
> we have some real-world specific examples of where this is happening?

In the case of ebizzy, it is the case that do_stuff is so small that the
allocation/free cost dominates. In the cases where I've seen this happen
on other workloads (firefix, evolution, mariadb during database init from
system) the cost of the operations on the buffer dominated. The malloc/free
cost was there but the performance difference is in the noise.

> If it's poor design in application code and the applications could be
> corrected, I think we should consider whether the right fix is on the
> application side.
> 

Could you look at v2 of the patch please? After discussions, I accept
that fixing this with a tricky heuristic is overkill. The second patch
just obeys trim threshold for per-thread heaps which is much simplier.
If an application is then identified that both requires trim threshold
to perform and the application is correctly implemented then more
complex options can be considered.

Thanks
  
Rich Felker Feb. 11, 2015, 2:07 p.m. UTC | #9
On Wed, Feb 11, 2015 at 01:34:53PM +0000, Mel Gorman wrote:
> > This indicates to me that the problem might actually be significant
> > over-allocation beyond the size that's actually going to be used. Do
> > we have some real-world specific examples of where this is happening?
> 
> In the case of ebizzy, it is the case that do_stuff is so small that the
> allocation/free cost dominates.

ebizzy is supposed to be a benchmark simulating typical workload,
right? If so, I think this specific test operation is a mistake, and I
think glibc should be cautious about optimizing for benchmarks that
don't reflect meaningful real-world usage.

> In the cases where I've seen this happen
> on other workloads (firefix, evolution, mariadb during database init from
> system) the cost of the operations on the buffer dominated. The malloc/free
> cost was there but the performance difference is in the noise.

If it's not distinguishable from noise in actual usage cases, then I'm
skeptical that there's a need to fix this issue.

Rich
  
Mel Gorman Feb. 11, 2015, 2:19 p.m. UTC | #10
On Wed, Feb 11, 2015 at 09:07:37AM -0500, Rich Felker wrote:
> On Wed, Feb 11, 2015 at 01:34:53PM +0000, Mel Gorman wrote:
> > > This indicates to me that the problem might actually be significant
> > > over-allocation beyond the size that's actually going to be used. Do
> > > we have some real-world specific examples of where this is happening?
> > 
> > In the case of ebizzy, it is the case that do_stuff is so small that the
> > allocation/free cost dominates.
> 
> ebizzy is supposed to be a benchmark simulating typical workload,
> right?

Right - web application server workload specifically. In reality, I think
it would depend on what lanaguage said workload was implemented in. Java
workloads would not hit the glibc allocator at all for example.

> If so, I think this specific test operation is a mistake, and I
> think glibc should be cautious about optimizing for benchmarks that
> don't reflect meaningful real-world usage.
> 

That rules out the complex approach in V1 at least.

> > In the cases where I've seen this happen
> > on other workloads (firefix, evolution, mariadb during database init from
> > system) the cost of the operations on the buffer dominated. The malloc/free
> > cost was there but the performance difference is in the noise.
> 
> If it's not distinguishable from noise in actual usage cases, then I'm
> skeptical that there's a need to fix this issue.
> 

I take that is a NAK for v1 of the patch. How about V2? It is expected
that heap trims can be controlled with tuning parameters but right now,
it's not possible to tune the trim threshold for per-thread heaps. V2 of
the patch fixes that and at least is consistent behaviour.
  
Rich Felker Feb. 11, 2015, 2:21 p.m. UTC | #11
On Wed, Feb 11, 2015 at 02:19:26PM +0000, Mel Gorman wrote:
> > > In the cases where I've seen this happen
> > > on other workloads (firefix, evolution, mariadb during database init from
> > > system) the cost of the operations on the buffer dominated. The malloc/free
> > > cost was there but the performance difference is in the noise.
> > 
> > If it's not distinguishable from noise in actual usage cases, then I'm
> > skeptical that there's a need to fix this issue.
> > 
> 
> I take that is a NAK for v1 of the patch. How about V2? It is expected
> that heap trims can be controlled with tuning parameters but right now,
> it's not possible to tune the trim threshold for per-thread heaps. V2 of
> the patch fixes that and at least is consistent behaviour.

I don't think I'm qualified to discuss specific code changes to
glibc's allocator. I'll leave this to others who know it better. I was
just weighing in on the high-level aspects of the proposed changes
which made sense to discuss without knowledge specific to glibc's
malloc internals.

Rich
  
Carlos O'Donell Feb. 11, 2015, 3:50 p.m. UTC | #12
On 02/11/2015 09:21 AM, Rich Felker wrote:
> On Wed, Feb 11, 2015 at 02:19:26PM +0000, Mel Gorman wrote:
>>>> In the cases where I've seen this happen
>>>> on other workloads (firefix, evolution, mariadb during database init from
>>>> system) the cost of the operations on the buffer dominated. The malloc/free
>>>> cost was there but the performance difference is in the noise.
>>>
>>> If it's not distinguishable from noise in actual usage cases, then I'm
>>> skeptical that there's a need to fix this issue.
>>>
>>
>> I take that is a NAK for v1 of the patch. How about V2? It is expected
>> that heap trims can be controlled with tuning parameters but right now,
>> it's not possible to tune the trim threshold for per-thread heaps. V2 of
>> the patch fixes that and at least is consistent behaviour.
> 
> I don't think I'm qualified to discuss specific code changes to
> glibc's allocator. I'll leave this to others who know it better. I was
> just weighing in on the high-level aspects of the proposed changes
> which made sense to discuss without knowledge specific to glibc's
> malloc internals.

I'm looking at this now.

Cheers,
Carlos.
  

Patch

diff --git a/ChangeLog b/ChangeLog
index dc1ed1ba1249..5208fadaf7c3 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,8 @@ 
+2015-02-09  Mel Gorman  <mgorman@suse.de>
+
+	* malloc/malloc.c (malloc): Account for heap grows vs shrinks
+	* malloc/arena.c (free): Limit calls to madvise when shrinking
+
 2015-02-06  Carlos O'Donell  <carlos@systemhalted.org>
 
 	* version.h (RELEASE): Set to "stable".
diff --git a/malloc/arena.c b/malloc/arena.c
index 886defb074a2..9012f4d2a0b8 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -45,6 +45,20 @@ 
    malloc_chunks.  It is allocated with mmap() and always starts at an
    address aligned to HEAP_MAX_SIZE.  */
 
+/* madvise throttling counters. This structure forces the counters to fit
+   into a size_t so that the alignment requirements of _heap_info can be
+   met. */
+typedef struct _madvise_throttle
+{
+  union {
+    struct {
+      uint16_t event_madvise;
+      uint16_t event_large_malloc;
+    };
+    size_t pad;
+  };
+} madvise_throttle;
+
 typedef struct _heap_info
 {
   mstate ar_ptr; /* Arena for this heap. */
@@ -52,10 +66,13 @@  typedef struct _heap_info
   size_t size;   /* Current size in bytes. */
   size_t mprotect_size; /* Size in bytes that has been mprotected
                            PROT_READ|PROT_WRITE.  */
+
+  madvise_throttle mt;
+
   /* Make sure the following data is properly aligned, particularly
      that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
      MALLOC_ALIGNMENT. */
-  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];
+  char pad[-7 * SIZE_SZ & MALLOC_ALIGN_MASK];
 } heap_info;
 
 /* Get a compile-time error if the heap_info padding is not correct
@@ -632,8 +649,31 @@  shrink_heap (heap_info *h, long diff)
 
       h->mprotect_size = new_size;
     }
-  else
-    __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
+  else {
+    unsigned int ratio;
+    /* Only keep track of recent madvise events by decaying counters */
+    if (++h->mt.event_madvise >= 100)
+      {
+        h->mt.event_large_malloc >>= 1;
+        h->mt.event_madvise >>= 1;
+      }
+    ratio = (h->mt.event_large_malloc + 1) * 100 / h->mt.event_madvise;
+
+    /* madvise and a refault is an expensive operation if the shrink request
+       is temporary. Only call madvise if it is a request bigger than
+       mmap_threshold or if it is detected that there are a mix of growths and
+       shrinks but more shrink requests recently. One impact is that a stream
+       of free+shrink requests will be batched under a single madvise call. */
+    if (diff >= mp_.mmap_threshold || (ratio < 100 && h->mt.event_large_malloc > 1))
+      {
+        __madvise ((char *) h + new_size, diff, MADV_DONTNEED);
+        h->mt.event_large_malloc = h->mt.event_madvise = 0;
+      }
+    else
+      {
+        return -1;
+      }
+  }
   /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/
 
   h->size = new_size;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index aa7edbfd4571..41f66cb41557 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -2388,6 +2388,16 @@  sysmalloc (INTERNAL_SIZE_T nb, mstate av)
           arena_mem += old_heap->size - old_heap_size;
           set_head (old_top, (((char *) old_heap + old_heap->size) - (char *) old_top)
                     | PREV_INUSE);
+          /* Track ratio of heap grows/shrinks after recent shrinks */
+          if (old_heap->mt.event_madvise)
+            {
+              /* Only keep track of recent allocations by decaying counters */
+              if (++old_heap->mt.event_large_malloc >= 100)
+                {
+                  old_heap->mt.event_large_malloc >>= 1;
+                  old_heap->mt.event_madvise >>= 1;
+                }
+            }
         }
       else if ((heap = new_heap (nb + (MINSIZE + sizeof (*heap)), mp_.top_pad)))
         {