[v6] malloc: Consistently apply trim_threshold to all heaps [BZ #17195]

Message ID 20150220132747.GV3087@suse.de
State Superseded
Headers

Commit Message

Mel Gorman Feb. 20, 2015, 1:27 p.m. UTC
  Trimming heaps is a balance between saving memory and the system overhead
required to update page tables and discard allocated pages. The malloc
option M_TRIM_THRESHOLD is a tunable that users are meant to use to decide
where this balance point is but it is only applied to the main arena.

For scalability reasons, glibc malloc has per-thread heaps but these are
shrunk with madvise() if there is one page free at the top of the heap.
In some circumstances this can lead to high system overhead if a thread
has a control flow like

    while (data_to_process) {
        buf = malloc(large_size);
        do_stuff();
        free(buf);
    }

For a large size, the free() will call madvise (pagetable teardown, page
free and TLB flush) every time followed immediately by a malloc (fault,
kernel page alloc, zeroing and charge accounting). The kernel overhead
can dominate such a workload.

This patch allows the user to tune when madvise gets called by applying
the trim threshold to the per-thread heaps. Alternatively if the dynamic
brk/mmap threshold gets adjusted then the new values will be obeyed by
the per-thread heaps.

Bug 17195 was a test case motivated by a problem encountered in scientific
applications written in python that performance badly due to high page fault
overhead. The basic operation of such a program was posted by Julian Taylor
https://sourceware.org/ml/libc-alpha/2015-02/msg00373.html

With this patch applied, the overhead is eliminated. All numbers in this
report are in seconds and were recorded by running Julian's program 30
times.

pyarray
                                 glibc               madvise
                                  2.21                    v2
System  min             1.81 (  0.00%)        0.00 (100.00%)
System  mean            1.93 (  0.00%)        0.02 ( 99.20%)
System  stddev          0.06 (  0.00%)        0.01 ( 88.99%)
System  max             2.06 (  0.00%)        0.03 ( 98.54%)
Elapsed min             3.26 (  0.00%)        2.37 ( 27.30%)
Elapsed mean            3.39 (  0.00%)        2.41 ( 28.84%)
Elapsed stddev          0.14 (  0.00%)        0.02 ( 82.73%)
Elapsed max             4.05 (  0.00%)        2.47 ( 39.01%)

               glibc     madvise
                2.21          v2
User          141.86      142.28
System         57.94        0.60
Elapsed       102.02       72.66

Note that almost a minutes worth of system time is eliminted and the
program completes 28% faster on average.

To illustrate the problem without python this is a basic test-case for
the worst case scenario where every free is a madvise followed by a an alloc

/* gcc bench-free.c -lpthread -o bench-free */
static int num = 1024;

void __attribute__((noinline,noclone)) dostuff (void *p)
{
}

void *worker (void *data)
{
  int i;

  for (i = num; i--;)
    {
      void *m = malloc (48*4096);
      dostuff (m);
      free (m);
    }

  return NULL;
}

int main()
{
  int i;
  pthread_t t;
  void *ret;
  if (pthread_create (&t, NULL, worker, NULL))
    exit (2);
  if (pthread_join (t, &ret))
    exit (3);
  return 0;
}

Before the patch, this resulted in 1024 calls to madvise. With the patch applied,
madvise is called twice because the default trim threshold is high enough to avoid
this.

This a more complex case where there is a mix of frees. It's simply a different worker
function for the test case above

void *worker (void *data)
{
  int i;
  int j = 0;
  void *free_index[num];

  for (i = num; i--;)
    {
      void *m = malloc ((i % 58) *4096);
      dostuff (m);
      if (i % 2 == 0) {
        free (m);
      } else {
        free_index[j++] = m;
      }
    }
  for (; j >= 0; j--)
    {
      free(free_index[j]);
    }

  return NULL;
}

glibc 2.21 calls malloc 90305 times but with the patch applied, it's
called 13438. Increasing the trim threshold will decrease the number of
times it's called with the option of eliminating the overhead.

ebizzy is meant to generate a workload resembling common web application
server workloads. It is threaded with a large working set that at its core
has an allocation, do_stuff, free loop that also hits this case. The primary
metric of the benchmark is records processed per second. This is running on
my desktop which is a single socket machine with an I7-4770 and 8 cores.
Each thread count was run for 30 seconds. It was only run once as the
performance difference is so high that the variation is insignificant.

                glibc 2.21              patch
threads 1            10230              44114
threads 2            19153              84925
threads 4            34295             134569
threads 8            51007             183387

Note that the saving happens to be a concidence as the size allocated
by ebizzy was less than the default threshold. If a different number of
chunks were specified then it may also be necessary to tune the threshold
to compensate

This is roughly quadrupling the performance of this benchmark. The difference in
system CPU usage illustrates why.

ebizzy running 1 thread with glibc 2.21
10230 records/s 306904
real 30.00 s
user  7.47 s
sys  22.49 s

22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the
patch applied

ebizzy running 1 thread with patch applied
44126 records/s 1323792
real 30.00 s
user 29.97 s
sys   0.00 s

system CPU usage was zero with the patch applied. strace shows that glibc
running this workload calls madvise approximately 9000 times a second. With
the patch applied madvise was called twice during the workload (or 0.06
times per second).

2015-02-10  Mel Gorman  <mgorman@suse.de>

  [BZ #17195]
  * malloc/arena.c (free): Apply trim threshold to per-thread heaps
    as well as the main arena.
---
 malloc/arena.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Siddhesh Poyarekar Feb. 24, 2015, 7:01 a.m. UTC | #1
On Fri, Feb 20, 2015 at 01:27:47PM +0000, Mel Gorman wrote:
> 2015-02-10  Mel Gorman  <mgorman@suse.de>
> 
>   [BZ #17195]
>   * malloc/arena.c (free): Apply trim threshold to per-thread heaps
>     as well as the main arena.
> ---
>  malloc/arena.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 886defb074a2..fec339819309 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
>      }
>    top_size = chunksize (top_chunk);
>    extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> -  if (extra < (long) pagesz)
> +  if ((unsigned long) extra < mp_.trim_threshold)
>      return 0;
>  
>    /* Try to shrink. */

Looks like this exposes some kind of a race.  I ran the testsuite with
this patch and intl/tst-gettext4 and intl/tst-gettext5 fail with a
segfault.  From a quick peek it looks like a malloc trying to tinker
with the part of an arena that has not yet been allocated, i.e. does
not have any permissions.  Can you please look into it?  I can't push
this till that problem is solved.

Thanks,
Siddhesh
  
Mel Gorman Feb. 24, 2015, 10:47 a.m. UTC | #2
On Tue, Feb 24, 2015 at 12:31:18PM +0530, Siddhesh Poyarekar wrote:
> On Fri, Feb 20, 2015 at 01:27:47PM +0000, Mel Gorman wrote:
> > 2015-02-10  Mel Gorman  <mgorman@suse.de>
> > 
> >   [BZ #17195]
> >   * malloc/arena.c (free): Apply trim threshold to per-thread heaps
> >     as well as the main arena.
> > ---
> >  malloc/arena.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/malloc/arena.c b/malloc/arena.c
> > index 886defb074a2..fec339819309 100644
> > --- a/malloc/arena.c
> > +++ b/malloc/arena.c
> > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
> >      }
> >    top_size = chunksize (top_chunk);
> >    extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> > -  if (extra < (long) pagesz)
> > +  if ((unsigned long) extra < mp_.trim_threshold)
> >      return 0;
> >  
> >    /* Try to shrink. */
> 
> Looks like this exposes some kind of a race.  I ran the testsuite with
> this patch and intl/tst-gettext4 and intl/tst-gettext5 fail with a
> segfault.  From a quick peek it looks like a malloc trying to tinker
> with the part of an arena that has not yet been allocated, i.e. does
> not have any permissions.  Can you please look into it?  I can't push
> this till that problem is solved.
> 

Ok, I'll start taking a look. I have yet to properly investigate but I
do think this is an existing problem that the patch makes worse because
I find this

# This is a test run using the glibc installed on the system
mel@stampy:~/git-public/glibc/obj-baseline > ./intl/tst-gettext4
beauty
thread 1 call 1 returned: beauty
beauty
thread 2 call 1 returned: beauty
beauty
thread 1 call 2 returned: beauty
beauty
thread 2 call 2 returned: beauty

# This is a test run using glibc 2.21
mel@stampy:~/git-public/glibc/obj-baseline > ./testrun.sh ./intl/tst-gettext4
Schï¿œnheit
beautᅵ
Schï¿œnheit
beautᅵ

# This is a test run using glibc 2.21 + the patch
mel@stampy:~/git-public/glibc/obj-madvise-reduce-v6r1 > ./testrun.sh ./intl/tst-gettext4
Segmentation fault (core dumped)

glibc 2.21 does not segfault but the output from glibc 2.21 looks pretty
screwed up relative to glibc 2.19 that is installed on the system.
  
Mel Gorman Feb. 24, 2015, 2:04 p.m. UTC | #3
Joseph, there is a problem with one of your commits since glibc 2.20.

On Tue, Feb 24, 2015 at 10:47:12AM +0000, Mel Gorman wrote:
> On Tue, Feb 24, 2015 at 12:31:18PM +0530, Siddhesh Poyarekar wrote:
> > On Fri, Feb 20, 2015 at 01:27:47PM +0000, Mel Gorman wrote:
> > > 2015-02-10  Mel Gorman  <mgorman@suse.de>
> > > 
> > >   [BZ #17195]
> > >   * malloc/arena.c (free): Apply trim threshold to per-thread heaps
> > >     as well as the main arena.
> > > ---
> > >  malloc/arena.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/malloc/arena.c b/malloc/arena.c
> > > index 886defb074a2..fec339819309 100644
> > > --- a/malloc/arena.c
> > > +++ b/malloc/arena.c
> > > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad)
> > >      }
> > >    top_size = chunksize (top_chunk);
> > >    extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
> > > -  if (extra < (long) pagesz)
> > > +  if ((unsigned long) extra < mp_.trim_threshold)
> > >      return 0;
> > >  
> > >    /* Try to shrink. */
> > 
> > Looks like this exposes some kind of a race.  I ran the testsuite with
> > this patch and intl/tst-gettext4 and intl/tst-gettext5 fail with a
> > segfault.  From a quick peek it looks like a malloc trying to tinker
> > with the part of an arena that has not yet been allocated, i.e. does
> > not have any permissions.  Can you please look into it?  I can't push
> > this till that problem is solved.
> > 
> 
> Ok, I'll start taking a look. I have yet to properly investigate but I
> do think this is an existing problem that the patch makes worse because
> I find this
> 
> # This is a test run using the glibc installed on the system
> mel@stampy:~/git-public/glibc/obj-baseline > ./intl/tst-gettext4
> beauty
> thread 1 call 1 returned: beauty
> beauty
> thread 2 call 1 returned: beauty
> beauty
> thread 1 call 2 returned: beauty
> beauty
> thread 2 call 2 returned: beauty
> 
> # This is a test run using glibc 2.21
> mel@stampy:~/git-public/glibc/obj-baseline > ./testrun.sh ./intl/tst-gettext4
> Schï¿œnheit
> beautᅵ
> Schï¿œnheit
> beautᅵ
> 

I bisected this down to commit 8540f6d2a74 ("Don't require test wrappers
to preserve environment variables, use more consistent environment.").
The test case does not crash but it's clearly corrupted.

Before that commit we have

mel@stampy:~/git-public/glibc > rm -rf obj-verify; git checkout 8540f6d2a74fe9d67440535ebbcfa252180a3172^ | head; mkdir obj-verify; cd obj-verify; ../configure --prefix=/tmp/blah > /dev/null 2>&1 && make -j8 > /dev/null 2>&1 && make -j8 tests > /dev/null 2>&1 ; ./testrun.sh > ./intl/tst-gettext4; cd ..
HEAD is now at ed36bfa18faf... PowerPC: Fix optimized strncat strlen call
beauty
thread 1 call 1 returned: beauty
beauty
thread 2 call 1 returned: beauty
beauty
thread 1 call 2 returned: beauty
beauty
thread 2 call 2 returned: beauty

And on the commit we get
mel@stampy:~/git-public/glibc > rm -rf obj-verify; git checkout 8540f6d2a74fe9d67440535ebbcfa252180a3172 | head; mkdir obj-verify; cd obj-verify; ../configure --prefix=/tmp/blah > /dev/null 2>&1 && make -j8 > /dev/null 2>&1 && make -j8 tests > /dev/null 2>&1 ; ./testrun.sh > ./intl/tst-gettext4; cd ..
Previous HEAD position was ed36bfa18faf... PowerPC: Fix optimized strncat strlen call
HEAD is now at 8540f6d2a74f... Don't require test wrappers to preserve environment variables, use more consistent environment.
Schï¿œnheit
beautᅵ
Schï¿œnheit
beautᅵ
  
Joseph Myers Feb. 24, 2015, 2:35 p.m. UTC | #4
On Tue, 24 Feb 2015, Mel Gorman wrote:

> The test case does not crash but it's clearly corrupted.
> 
> Before that commit we have
> 
> mel@stampy:~/git-public/glibc > rm -rf obj-verify; git checkout 8540f6d2a74fe9d67440535ebbcfa252180a3172^ | head; mkdir obj-verify; cd obj-verify; ../configure --prefix=/tmp/blah > /dev/null 2>&1 && make -j8 > /dev/null 2>&1 && make -j8 tests > /dev/null 2>&1 ; ./testrun.sh > ./intl/tst-gettext4; cd ..
> HEAD is now at ed36bfa18faf... PowerPC: Fix optimized strncat strlen call
> beauty
> thread 1 call 1 returned: beauty
> beauty
> thread 2 call 1 returned: beauty
> beauty
> thread 1 call 2 returned: beauty
> beauty
> thread 2 call 2 returned: beauty

This is the output when the test fails; "thread 1 call 1 returned" etc. 
are failure messages.  Before my commit, testrun.sh was not by itself a 
correct way of running this test, because it lacked the environment setup 
required by this test; that environment setup was local to the 
tst-gettext4.sh script.

> And on the commit we get
> mel@stampy:~/git-public/glibc > rm -rf obj-verify; git checkout 8540f6d2a74fe9d67440535ebbcfa252180a3172 | head; mkdir obj-verify; cd obj-verify; ../configure --prefix=/tmp/blah > /dev/null 2>&1 && make -j8 > /dev/null 2>&1 && make -j8 tests > /dev/null 2>&1 ; ./testrun.sh > ./intl/tst-gettext4; cd ..
> Previous HEAD position was ed36bfa18faf... PowerPC: Fix optimized strncat strlen call
> HEAD is now at 8540f6d2a74f... Don't require test wrappers to preserve environment variables, use more consistent environment.
> Schï¿œnheit
> beautᅵ
> Schï¿œnheit
> beautᅵ

This is passing output (modulo a few character set conversions after the 
test produced its output in ISO-8859-1, but the test clearly worked 
because it didn't produce any of the "returned" messages from failed 
comparisons).  Because the environment setup is now shared for all tests 
in the makefiles, after my commit testrun.sh is suitable to use to run 
this test, and so it runs with the correct passing output.
  
Mel Gorman Feb. 24, 2015, 4:08 p.m. UTC | #5
On Tue, Feb 24, 2015 at 02:35:58PM +0000, Joseph Myers wrote:
> On Tue, 24 Feb 2015, Mel Gorman wrote:
> 
> > The test case does not crash but it's clearly corrupted.
> > 
> > Before that commit we have
> > 
> > mel@stampy:~/git-public/glibc > rm -rf obj-verify; git checkout 8540f6d2a74fe9d67440535ebbcfa252180a3172^ | head; mkdir obj-verify; cd obj-verify; ../configure --prefix=/tmp/blah > /dev/null 2>&1 && make -j8 > /dev/null 2>&1 && make -j8 tests > /dev/null 2>&1 ; ./testrun.sh > ./intl/tst-gettext4; cd ..
> > HEAD is now at ed36bfa18faf... PowerPC: Fix optimized strncat strlen call
> > beauty
> > thread 1 call 1 returned: beauty
> > beauty
> > thread 2 call 1 returned: beauty
> > beauty
> > thread 1 call 2 returned: beauty
> > beauty
> > thread 2 call 2 returned: beauty
> 
> This is the output when the test fails; "thread 1 call 1 returned" etc. 
> are failure messages.  Before my commit, testrun.sh was not by itself a 
> correct way of running this test, because it lacked the environment setup 
> required by this test; that environment setup was local to the 
> tst-gettext4.sh script.
> 

Sorry for the noise. The actual problem is my patch due to the fact that
extra can go negative.
  

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index 886defb074a2..fec339819309 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -696,7 +696,7 @@  heap_trim (heap_info *heap, size_t pad)
     }
   top_size = chunksize (top_chunk);
   extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1);
-  if (extra < (long) pagesz)
+  if ((unsigned long) extra < mp_.trim_threshold)
     return 0;
 
   /* Try to shrink. */