Message ID | 20150218102900.GF23372@suse.de |
---|---|
State | Superseded |
Headers |
Received: (qmail 1684 invoked by alias); 18 Feb 2015 10:29:09 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 1675 invoked by uid 89); 18 Feb 2015 10:29:08 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00, T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mx2.suse.de Date: Wed, 18 Feb 2015 10:29:00 +0000 From: Mel Gorman <mgorman@suse.de> To: libc-alpha@sourceware.org Cc: Julian Taylor <jtaylor.debian@googlemail.com>, Carlos O'Donell <carlos@redhat.com>, Siddhesh Poyarekar <siddhesh@redhat.com>, Rik van Riel <riel@redhat.com>, KOSAKI Motohiro <kosaki.motohiro@gmail.com>, Konstantin Serebryany <kcc@google.com>, Minchan Kim <minchan.kim@gmail.com> Subject: [PATCH] [v4] malloc: Consistently apply trim_threshold to all heaps [BZ #17195] Message-ID: <20150218102900.GF23372@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) |
Commit Message
Mel Gorman
Feb. 18, 2015, 10:29 a.m. UTC
Trimming heaps is a balance between saving memory and the system overhead required to update page tables and discard allocated pages. The malloc option M_TRIM_THRESHOLD is a tunable that users are meant to use to decide where this balance point is but it is only applied to the main arena. For scalability reasons, glibc malloc has per-thread heaps but these are shrunk with madvise() if there is one page free at the top of the heap. In some circumstances this can lead to high system overhead if a thread has a control flow like while (data_to_process) { buf = malloc(large_size); do_stuff(); free(buf); } For a large size, the free() will call madvise (pagetable teardown, page free and TLB flush) every time followed immediately by a malloc (fault, kernel page alloc, zeroing and charge accounting). The kernel overhead can dominate such a workload. This patch allows the user to tune when madvise gets called by applying the trim threshold to the per-thread heaps. Alternatively if the dynamic brk/mmap threshold gets adjusted then the new values will be obeyed by the per-thread heaps. Bug 17195 was a test case motivated by a problem encountered in scientific applications written in python that performance badly due to high page fault overhead. The basic operation of such a program was posted by Julian Taylor https://sourceware.org/ml/libc-alpha/2015-02/msg00373.html With this patch applied, the overhead is eliminated. All numbers in this report are in seconds and were recorded by running Julian's program 30 times. pyarray glibc madvise 2.21 v2 System min 1.81 ( 0.00%) 0.00 (100.00%) System mean 1.93 ( 0.00%) 0.02 ( 99.20%) System stddev 0.06 ( 0.00%) 0.01 ( 88.99%) System max 2.06 ( 0.00%) 0.03 ( 98.54%) Elapsed min 3.26 ( 0.00%) 2.37 ( 27.30%) Elapsed mean 3.39 ( 0.00%) 2.41 ( 28.84%) Elapsed stddev 0.14 ( 0.00%) 0.02 ( 82.73%) Elapsed max 4.05 ( 0.00%) 2.47 ( 39.01%) glibc madvise 2.21 v2 User 141.86 142.28 System 57.94 0.60 Elapsed 102.02 72.66 Note that almost a minutes worth of system time is eliminted and the program completes 28% faster on average. To illustrate the problem without python this is a basic test-case for the worst case scenario where every free is a madvise followed by a an alloc /* gcc bench-free.c -lpthread -o bench-free */ static int num = 1024; void __attribute__((noinline,noclone)) dostuff (void *p) { } void *worker (void *data) { int i; for (i = num; i--;) { void *m = malloc (48*4096); dostuff (m); free (m); } return NULL; } int main() { int i; pthread_t t; void *ret; if (pthread_create (&t, NULL, worker, NULL)) exit (2); if (pthread_join (t, &ret)) exit (3); return 0; } Before the patch, this resulted in 1024 calls to madvise. With the patch applied, madvise is called twice because the default trim threshold is high enough to avoid this. This a more complex case where there is a mix of frees. It's simply a different worker function for the test case above void *worker (void *data) { int i; int j = 0; void *free_index[num]; for (i = num; i--;) { void *m = malloc ((i % 58) *4096); dostuff (m); if (i % 2 == 0) { free (m); } else { free_index[j++] = m; } } for (; j >= 0; j--) { free(free_index[j]); } return NULL; } glibc 2.21 calls malloc 90305 times but with the patch applied, it's called 13438. Increasing the trim threshold will decrease the number of times it's called with the option of eliminating the overhead. ebizzy is meant to generate a workload resembling common web application server workloads. It is threaded with a large working set that at its core has an allocation, do_stuff, free loop that also hits this case. The primary metric of the benchmark is records processed per second. This is running on my desktop which is a single socket machine with an I7-4770 and 8 cores. Each thread count was run for 30 seconds. It was only run once as the performance difference is so high that the variation is insignificant. glibc 2.21 patch threads 1 10230 44114 threads 2 19153 84925 threads 4 34295 134569 threads 8 51007 183387 Note that the saving happens to be a concidence as the size allocated by ebizzy was less than the default threshold. If a different number of chunks were specified then it may also be necessary to tune the threshold to compensate This is roughly quadrupling the performance of this benchmark. The difference in system CPU usage illustrates why. ebizzy running 1 thread with glibc 2.21 10230 records/s 306904 real 30.00 s user 7.47 s sys 22.49 s 22.49 seconds was spent in the kernel for a workload runinng 30 seconds. With the patch applied ebizzy running 1 thread with patch applied 44126 records/s 1323792 real 30.00 s user 29.97 s sys 0.00 s system CPU usage was zero with the patch applied. strace shows that glibc running this workload calls madvise approximately 9000 times a second. With the patch applied madvise was called twice during the workload (or 0.06 times per second). 2015-02-10 Mel Gorman <mgorman@suse.de> [BZ #17195] * malloc/arena.c (free): Apply trim threshold to per-thread heaps as well as the main arena. --- malloc/arena.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On 02/18/2015 11:29 AM, Mel Gorman wrote: > diff --git a/malloc/arena.c b/malloc/arena.c > index 886defb074a2..a78d4835a825 100644 > --- a/malloc/arena.c > +++ b/malloc/arena.c > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > } > top_size = chunksize (top_chunk); > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > - if (extra < (long) pagesz) > + if (extra < (long) mp_.trim_threshold) > return 0; > > /* Try to shrink. */ > I am not very familiar with how the allocator works, what now happens with this change when you are creating lots of temporary threads that allocate memory below the trim treshold? As I understand it each thread has its own arena, as they are now not trimmed as often anymore is that area memory lost when the thread ends? I recall something along these lines for the original reasoning why the trim was not applied to thread arenas.
On Wed, Feb 18, 2015 at 02:26:20PM +0100, Julian Taylor wrote: > On 02/18/2015 11:29 AM, Mel Gorman wrote: > > > diff --git a/malloc/arena.c b/malloc/arena.c > > index 886defb074a2..a78d4835a825 100644 > > --- a/malloc/arena.c > > +++ b/malloc/arena.c > > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > > } > > top_size = chunksize (top_chunk); > > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > > - if (extra < (long) pagesz) > > + if (extra < (long) mp_.trim_threshold) > > return 0; > > > > /* Try to shrink. */ > > > > > I am not very familiar with how the allocator works, This is the first time in years I've looked so take this with a grain of salt. Others with more experience of the allocator may chime in with corrections. > what now happens > with this change when you are creating lots of temporary threads that > allocate memory below the trim treshold? The exact impact would depend on the configured trim threshold. In theory nothing stops a user making this very large and creating processes with a single thread that achieve the same effect. Assuming the default is used and they stay below the default trim threshold then each thread can attempt to use 128K of useless data. It changes with dynamic heap tuning but lets take the simple case. At this point, the maximum amount of wastage is limited by M_ARENA_MAX which is 8 by default. By default, the maximum wastage is low even for a carefully crafted application. The worst-case situation can be tuned by setting trim threshold or arena_max but there are easier ways to waste memory. Furthermore, as the application should not reference the memory (it's freed) the wasted memory eventually be swapped out if swap is available. A hostile application could use knowledge of the allocator to use-after-free the memory and keep it resident but it would be very obvious in top. Overall it would be easier to interfere with a machine but just calling mmap() for a large anonymous buffer. This applies to a carefully crafted application or one that happens to free just below the threshold and never mallocs again. Overall, I think the worst-case impact is marginal. > As I understand it each thread has its own arena, as they are now not > trimmed as often anymore is that area memory lost when the thread ends? No, when a process exits, the memory is given back to the system unconditionally. If it's a thread that's joined then the wastage depends on whether another thread uses that heap or not. > I recall something along these lines for the original reasoning why the > trim was not applied to thread arenas. Is there any chance you can remember keywords in the discussion? The changelog is completely void of information. Worse, it would have been harder to hit this problem originally because new areas were only created when contention occurred. If there was strong motivation for the pagesz value then it's possible that it no longer applies.
On 02/18/2015 04:09 PM, Mel Gorman wrote: > On Wed, Feb 18, 2015 at 02:26:20PM +0100, Julian Taylor wrote: >> On 02/18/2015 11:29 AM, Mel Gorman wrote: >> >> I recall something along these lines for the original reasoning why the >> trim was not applied to thread arenas. > > Is there any chance you can remember keywords in the discussion? The > changelog is completely void of information. Worse, it would have been > harder to hit this problem originally because new areas were only created > when contention occurred. If there was strong motivation for the pagesz > value then it's possible that it no longer applies. > it was probably BZ#11261 that I remembered, its about excessive virtual memory usage due to arenas. As they are now trimmed less aggressively are we making it worse?
On Wed, Feb 18, 2015 at 04:43:04PM +0100, Julian Taylor wrote: > On 02/18/2015 04:09 PM, Mel Gorman wrote: > > On Wed, Feb 18, 2015 at 02:26:20PM +0100, Julian Taylor wrote: > >> On 02/18/2015 11:29 AM, Mel Gorman wrote: > >> > >> I recall something along these lines for the original reasoning why the > >> trim was not applied to thread arenas. > > > > Is there any chance you can remember keywords in the discussion? The > > changelog is completely void of information. Worse, it would have been > > harder to hit this problem originally because new areas were only created > > when contention occurred. If there was strong motivation for the pagesz > > value then it's possible that it no longer applies. > > > > it was probably BZ#11261 that I remembered, its about excessive virtual > memory usage due to arenas. > As they are now trimmed less aggressively are we making it worse? By a small margin. Worst case TRIM_THRESHOLD * ARENA_MAX and only applies for a task that hits the exact case and never mallocs again. As the number of arenas is bound and small by default, I cannot see a case where there would be excessive amounts of memory wastage. Wasting a lot of memory would require the application to be tuned specifically to waste memory. The test case in question shows no difference in behaviour that I could see but it's not actually trying to target the trim threshold case.
On 18.02.2015 11:29, Mel Gorman wrote: > > 2015-02-10 Mel Gorman <mgorman@suse.de> > > [BZ #17195] > * malloc/arena.c (free): Apply trim threshold to per-thread heaps > as well as the main arena. > --- > malloc/arena.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/malloc/arena.c b/malloc/arena.c > index 886defb074a2..a78d4835a825 100644 > --- a/malloc/arena.c > +++ b/malloc/arena.c > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > } > top_size = chunksize (top_chunk); > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > - if (extra < (long) pagesz) > + if (extra < (long) mp_.trim_threshold) > return 0; > > /* Try to shrink. */ > the patch does not seem to account for trim threshold being -1 which is documented to disable the threshold.
On Wed, Feb 18, 2015 at 11:08:13PM +0100, Julian Taylor wrote: > On 18.02.2015 11:29, Mel Gorman wrote: > > > > > 2015-02-10 Mel Gorman <mgorman@suse.de> > > > > [BZ #17195] > > * malloc/arena.c (free): Apply trim threshold to per-thread heaps > > as well as the main arena. > > --- > > malloc/arena.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/malloc/arena.c b/malloc/arena.c > > index 886defb074a2..a78d4835a825 100644 > > --- a/malloc/arena.c > > +++ b/malloc/arena.c > > @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) > > } > > top_size = chunksize (top_chunk); > > extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); > > - if (extra < (long) pagesz) > > + if (extra < (long) mp_.trim_threshold) > > return 0; > > > > /* Try to shrink. */ > > > > the patch does not seem to account for trim threshold being -1 which is > documented to disable the threshold. Well spotted, will post v5 shortly that matches the style for the main arena test to catch this case.
On Wed, Feb 18, 2015 at 04:43:04PM +0100, Julian Taylor wrote: > it was probably BZ#11261 that I remembered, its about excessive virtual > memory usage due to arenas. > As they are now trimmed less aggressively are we making it worse? That is a completely different problem. The number of arenas went out of control because there was a race in accounting the number of arenas. Siddhesh
diff --git a/malloc/arena.c b/malloc/arena.c index 886defb074a2..a78d4835a825 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -696,7 +696,7 @@ heap_trim (heap_info *heap, size_t pad) } top_size = chunksize (top_chunk); extra = (top_size - pad - MINSIZE - 1) & ~(pagesz - 1); - if (extra < (long) pagesz) + if (extra < (long) mp_.trim_threshold) return 0; /* Try to shrink. */