From patchwork Tue Jan 2 18:20:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 25183 Received: (qmail 83959 invoked by alias); 2 Jan 2018 18:20:20 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 83921 invoked by uid 89); 2 Jan 2018 18:20:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.0 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS autolearn=ham version=3.3.2 spammy=percentage, alternatives, 4.5, 1000000 X-HELO: EUR02-HE1-obe.outbound.protection.outlook.com From: Wilco Dijkstra To: Carlos O'Donell , "libc-alpha@sourceware.org" CC: nd Subject: [PATCH v2] Add malloc micro benchmark Date: Tue, 2 Jan 2018 18:20:04 +0000 Message-ID: References: , <6ad98d83-d49b-25a3-ef01-e93e18f4740b@redhat.com> In-Reply-To: <6ad98d83-d49b-25a3-ef01-e93e18f4740b@redhat.com> authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; DB6PR0801MB2056; 7:FN2509cgzRcVjUy3OTldiUABXYDYIlRrCzaagnZJfmtmIRNxWV19D8GDea8BcqDMGlLOltG96jQScAj6BTo9Kz6ScBXeLHV5MJ+g35h9wqoN+AXYi5L/N35OEhT7ZVGXPLnFNzVBomzT+ujg3m4sGlhUYn9S8s1b7D0OCt2mTYr3jGHSgyAWOwg9ASxdPRt9AzRQF9Iw4+atfvNntXcjyo8yuHTb10x4+Zv7Bg5jOSC1aoW0WCG8riZtQDInfjl1 x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: 2f73f5fc-fbc6-4521-31c5-08d5520d7241 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(4534020)(4602075)(4627115)(201703031133081)(201702281549075)(5600026)(4604075)(3008032)(48565401081)(2017052603307)(7153060); SRVR:DB6PR0801MB2056; x-ms-traffictypediagnostic: DB6PR0801MB2056: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(250305191791016)(180628864354917)(22074186197030); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040470)(2401047)(8121501046)(5005006)(93006095)(93001095)(3231023)(944501075)(3002001)(10201501046)(6055026)(6041268)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123558120)(20161123562045)(6072148)(201708071742011); SRVR:DB6PR0801MB2056; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:DB6PR0801MB2056; x-forefront-prvs: 0540846A1D x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(396003)(346002)(39380400002)(376002)(366004)(39860400002)(377424004)(54534003)(51444003)(52314003)(199004)(189003)(24454002)(6116002)(7696005)(72206003)(8936002)(76176011)(478600001)(106356001)(6306002)(9686003)(14454004)(8676002)(59450400001)(53936002)(33656002)(3846002)(6436002)(966005)(2906002)(105586002)(81166006)(55016002)(93886005)(110136005)(86362001)(99286004)(81156014)(74316002)(575784001)(5660300001)(25786009)(66066001)(68736007)(4326008)(316002)(3660700001)(6506007)(5250100002)(3280700002)(97736004)(2950100002)(7736002)(2900100001)(305945005)(2501003)(102836004)(2004002); DIR:OUT; SFP:1101; SCL:1; SRVR:DB6PR0801MB2056; H:DB6PR0801MB2053.eurprd08.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: 8MFNiQM0FGwxoUug4mxSgl7XLJioJhTGTTmVmDUaTRhk9RW5Z809J5l8fpc1CBW17m8lNEQltaoSx66jeB6dIA== spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2f73f5fc-fbc6-4521-31c5-08d5520d7241 X-MS-Exchange-CrossTenant-originalarrivaltime: 02 Jan 2018 18:20:04.6470 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0801MB2056 Carlos O'Donell wrote: > If you have a pattern of malloc/free of *similar* sized blocks, then > it overflows the sized bin in the tcache, with other size bins remaining > empty. The cache itself does not dynamically reconfigure itself to consume > X MiB or Y % of RSS, instead it uses a simple data structure to contain > a fixed number of fixed size blocks. > > Therefore I agree, that enhancing the core data structure in tcache may > result in better overall performance, particularly if we got rid of the > fixed bin sizes and instead found a way to be performant *and* keep a > running total of consumption. Well it could keep track of sum of all block sizes and limit that. That would be better than a fixed limit on number of blocks in each bin. > Likewise *all* of malloc needs to be moved to a better data structure than > just linked lists. I would like to see glibc's malloc offer a cacheing > footprint of no more than Y % of RSS available, and let the user tweak that. > Currently we just consume RSS without much regard for overhead. Though this > is a different case than than what you are talking about, the changes are > related via data-structure enhancements that would benefit both cases IMO. What kind of datastructure do you have in mind? Small blocks could be allocated together in pages. This would avoid the per-block overhead and changes the linked list into a bitmap scan. However there aren't that many alternatives. I've written first-fit allocators using autobalancing trees, but walking the tree is expensive due to being not having good locality. >> I *wish* we could test main_arena vs. threaded arena, since they >> have different code and behave differently e.g. sbrk vs. mmap'd >> heap. I've added test of main and thread arenas in the latest version as well as a test with a larger tcache. > As I suggested in bug 15321: > https://sourceware.org/bugzilla/show_bug.cgi?id=15321 > > We need to merge the main_arena and threaded code together, and stop > treating them as different things. Right now the main_arena, if you > look at the code, is a *pretend* heap with a partial data structure > layered in place. This needs to go away. We need to treat all heaps > as identical, with identical code paths, with just different backing > storage. Yes that sounds like a good plan. > I think people still expect that thread 0 allocates from the sbrk > heap in a single-threaded application, and we can do that by ensuring > sbrk is used to provide the backing store for the main thread. This way > we can jump the pointer 64MB like we normally do for mmap'd heaps, but > then on page touch there the kernel just extends the heap normally. A quick benchmark shows sbrk is about ~25% faster than mmap, so it appears useful. Looking at the addresses, sbrk grows up, while mmap grows down the address space. I think other arenas could use sbrk as well as long as you can free sbrk memory via MADV_FREE or MADV_DONTUSE. >> Implementation: >> >> You need to make this robust against env vars changing malloc >> behaviour. You should use mallopt to change some parameters. I've added support to change the tcache count in mallopt, so we can benchmark different settings. > We need to move to MADV_FREE, which was designed for memory allocators. > > The semantics of MADV_DONTNEED have the problem that one has to consider: > * Is the data destructively lost in that page? > * Is the data flushed to the underlying store before being not-needed? > All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure > that users don't corrupt the data in their backing stores. > > I think that detection of MADV_FREE, and usage, would help performance, > but only on > Linux 4.5, and that might be OK for you. Well we should detect when MADV_FREE is supported and use that. If not, tweak the size of MADV_DONTNEED - perhaps based on a percentage of the RSS size so we limit the overhead. Anyway, here is the new version: Add a malloc micro benchmark to enable accurate testing of the various paths in malloc and free. The benchmark does a varying number of allocations of a given block size, then frees them again. It tests 4 different scenarios: single-threaded using main arena, multi-threaded using thread-arena, main arena with SINGLE_THREAD_P false and main arena with the tcache count set larger than the default. To enable this, add support for M_TCACHE_COUNT in mallopt. OK for commit? ChangeLog: 2018-01-02 Wilco Dijkstra * benchtests/Makefile: Add malloc-simple benchmark. * benchtests/bench-malloc-simple.c: New benchmark. * malloc/malloc.h (M_TCACHE_COUNT): Add new define. * malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT. diff --git a/benchtests/Makefile b/benchtests/Makefile index 74b3821ccfea6912e68578ad2598d68a9e38223c..5052bbbfe79f6d5a0b16c427dfc4807271805e61 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -90,7 +90,7 @@ CFLAGS-bench-trunc.c += -fno-builtin CFLAGS-bench-truncf.c += -fno-builtin ifeq (${BENCHSET},) -bench-malloc := malloc-thread +bench-malloc := malloc-thread malloc-simple else bench-malloc := $(filter malloc-%,${BENCHSET}) endif @@ -98,7 +98,7 @@ endif $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) -$(objpfx)bench-malloc-thread: $(shared-thread-library) +$(addprefix $(objpfx)bench-,$(bench-malloc)): $(shared-thread-library) @@ -165,7 +165,7 @@ bench-clean: ifneq ($(strip ${BENCHSET}),) VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \ wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \ - malloc-thread + malloc-thread malloc-simple INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET}) ifneq (${INVALIDBENCHSETNAMES},) $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES}) @@ -201,10 +201,18 @@ bench-set: $(binaries-benchset) bench-malloc: $(binaries-bench-malloc) for run in $^; do \ + echo "$${run}"; \ + if [ `basename $${run}` = "bench-malloc-thread" ]; then \ for thr in 1 8 16 32; do \ echo "Running $${run} $${thr}"; \ - $(run-bench) $${thr} > $${run}-$${thr}.out; \ - done;\ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + else \ + for thr in 8 16 32 64 128 256 512 1024 2048 4096; do \ + echo "Running $${run} $${thr}"; \ + $(run-bench) $${thr} > $${run}-$${thr}.out; \ + done;\ + fi;\ done # Build and execute the benchmark functions. This target generates JSON diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c new file mode 100644 index 0000000000000000000000000000000000000000..151c38de50c5e747e05d69c717452241a47d7d22 --- /dev/null +++ b/benchtests/bench-malloc-simple.c @@ -0,0 +1,201 @@ +/* Benchmark malloc and free functions. + Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include +#include +#include +#include +#include "bench-timing.h" +#include "json-lib.h" + +#define NUM_ITERS 1000000 +#define NUM_ALLOCS 4 +#define MAX_ALLOCS 1000 + +typedef struct +{ + size_t iters; + size_t size; + int n; + timing_t elapsed; +} malloc_args; + +static void +do_benchmark (malloc_args *args, int **arr) +{ + timing_t start, stop; + size_t iters = args->iters; + size_t size = args->size; + int n = args->n; + + TIMING_NOW (start); + + for (int j = 0; j < iters; j++) + { + for (int i = 0; i < n; i++) + arr[i] = malloc (size); + + for (int i = 0; i < n; i++) + free (arr[i]); + } + + TIMING_NOW (stop); + + TIMING_DIFF (args->elapsed, start, stop); +} + +static malloc_args tests[4][NUM_ALLOCS]; +static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS }; + +static void * +thread_test (void *p) +{ + int **arr = (int**)p; + + /* Run benchmark multi-threaded. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[2][i], arr); + + return p; +} + +void +bench (unsigned long size) +{ + size_t iters = NUM_ITERS; + int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*)); + unsigned long res; + + TIMING_INIT (res); + (void) res; + + /* Set tcache count to default. */ + mallopt (M_TCACHE_COUNT, -1); + + for (int t = 0; t <= 3; t++) + for (int i = 0; i < NUM_ALLOCS; i++) + { + tests[t][i].n = allocs[i]; + tests[t][i].size = size; + tests[t][i].iters = iters / allocs[i]; + + /* Do a quick warmup run. */ + if (t == 0) + do_benchmark (&tests[0][i], arr); + } + + /* Run benchmark single threaded in main_arena. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[0][i], arr); + + /* Run benchmark in a thread_arena. */ + pthread_t t; + pthread_create (&t, NULL, thread_test, (void*)arr); + pthread_join (t, NULL); + + /* Repeat benchmark in main_arena with SINGLE_THREAD_P == false. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[1][i], arr); + + /* Increase size of tcache. */ + mallopt (M_TCACHE_COUNT, 100); + + /* Run again but with larger tcache. */ + for (int i = 0; i < NUM_ALLOCS; i++) + do_benchmark (&tests[3][i], arr); + + mallopt (M_TCACHE_COUNT, -1); + + free (arr); + + json_ctx_t json_ctx; + + json_init (&json_ctx, 0, stdout); + + json_document_begin (&json_ctx); + + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); + + json_attr_object_begin (&json_ctx, "functions"); + + json_attr_object_begin (&json_ctx, "malloc"); + + char s[100]; + double iters2 = iters; + + json_attr_object_begin (&json_ctx, ""); + json_attr_double (&json_ctx, "malloc_block_size", size); + + struct rusage usage; + getrusage (RUSAGE_SELF, &usage); + json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss); + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "main_arena_st_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[0][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "main_arena_mt_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[1][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "big_tcache_mt_allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[3][i].elapsed / iters2); + } + + for (int i = 0; i < NUM_ALLOCS; i++) + { + sprintf (s, "thread_arena__allocs_%04d_time", allocs[i]); + json_attr_double (&json_ctx, s, tests[2][i].elapsed / iters2); + } + + json_attr_object_end (&json_ctx); + + json_attr_object_end (&json_ctx); + + json_attr_object_end (&json_ctx); + + json_document_end (&json_ctx); +} + +static void usage (const char *name) +{ + fprintf (stderr, "%s: \n", name); + exit (1); +} + +int +main (int argc, char **argv) +{ + long val = 16; + if (argc == 2) + val = strtol (argv[1], NULL, 0); + + if (argc > 2 || val <= 0) + usage (argv[0]); + + bench (val); + + return 0; +} diff --git a/malloc/malloc.h b/malloc/malloc.h index 339ab64c7d336873211a9057a923d87e8c1e025d..a047385a4fc8d7b3bb3e120a94440193dba306ed 100644 --- a/malloc/malloc.h +++ b/malloc/malloc.h @@ -121,6 +121,7 @@ extern struct mallinfo mallinfo (void) __THROW; #define M_PERTURB -6 #define M_ARENA_TEST -7 #define M_ARENA_MAX -8 +#define M_TCACHE_COUNT -9 /* General SVID/XPG interface to tunable parameters. */ extern int mallopt (int __param, int __val) __THROW; diff --git a/malloc/malloc.c b/malloc/malloc.c index 0c9e0748b4c10988f6fe99ac2e5b21b8b7b603c3..a07438d276ff4c8177552e1c0d186ee7c8bd7692 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -5177,6 +5177,11 @@ __libc_mallopt (int param_number, int value) if (value > 0) do_set_arena_max (value); break; +#if USE_TCACHE + case M_TCACHE_COUNT: + do_set_tcache_count (value >= 0 ? value : TCACHE_FILL_COUNT); + break; +#endif } __libc_lock_unlock (av->mutex); return res;