Patchwork [v2] Add malloc micro benchmark

login
register
mail settings
Submitter Wilco Dijkstra
Date Jan. 2, 2018, 6:20 p.m.
Message ID <DB6PR0801MB2053641333453CE91496266E83190@DB6PR0801MB2053.eurprd08.prod.outlook.com>
Download mbox | patch
Permalink /patch/25183/
State New
Headers show

Comments

Wilco Dijkstra - Jan. 2, 2018, 6:20 p.m.
Carlos O'Donell wrote:

> If you have a pattern of malloc/free of *similar* sized blocks, then
> it overflows the sized bin in the tcache, with other size bins remaining
> empty. The cache itself does not dynamically reconfigure itself to consume
> X MiB or Y % of RSS, instead it uses a simple data structure to contain
> a fixed number of fixed size blocks.
>
> Therefore I agree, that enhancing the core data structure in tcache may
> result in better overall performance, particularly if we got rid of the
> fixed bin sizes and instead found a way to be performant *and* keep a
> running total of consumption.

Well it could keep track of sum of all block sizes and limit that. That would
be better than a fixed limit on number of blocks in each bin. 

> Likewise *all* of malloc needs to be moved to a better data structure than
> just linked lists. I would like to see glibc's malloc offer a cacheing
> footprint of no more than Y % of RSS available, and let the user tweak that.
> Currently we just consume RSS without much regard for overhead. Though this
> is a different case than than what you are talking about, the changes are
> related via data-structure enhancements that would benefit both cases IMO.

What kind of datastructure do you have in mind? Small blocks could be
allocated together in pages. This would avoid the per-block overhead and
changes the linked list into a bitmap scan. However there aren't that many
alternatives. I've written first-fit allocators using autobalancing trees, but
walking the tree is expensive due to being not having good locality.

>> I *wish* we could test main_arena vs. threaded arena, since they
>> have different code and behave differently e.g. sbrk vs. mmap'd
>> heap.

I've added test of main and thread arenas in the latest version as well
as a test with a larger tcache.

> As I suggested in bug 15321:
> https://sourceware.org/bugzilla/show_bug.cgi?id=15321
>
> We need to merge the main_arena and threaded code together, and stop
> treating them as different things. Right now the main_arena, if you
> look at the code, is a *pretend* heap with a partial data structure
> layered in place. This needs to go away. We need to treat all heaps
> as identical, with identical code paths, with just different backing
> storage.

Yes that sounds like a good plan.

> I think people still expect that thread 0 allocates from the sbrk
> heap in a single-threaded application, and we can do that by ensuring
> sbrk is used to provide the backing store for the main thread. This way
> we can jump the pointer 64MB like we normally do for mmap'd heaps, but
> then on page touch there the kernel just extends the heap normally.

A quick benchmark shows sbrk is about ~25% faster than mmap, so it
appears useful. Looking at the addresses, sbrk grows up, while mmap
grows down the address space. I think other arenas could use sbrk as
well as long as you can free sbrk memory via MADV_FREE or
MADV_DONTUSE.


>> Implementation:
>>
>> You need to make this robust against env vars changing malloc
>> behaviour. You should use mallopt to change some parameters.

I've added support to change the tcache count in mallopt, so we can
benchmark different settings.

> We need to move to MADV_FREE, which was designed for memory allocators.
>
> The semantics of MADV_DONTNEED have the problem that one has to consider:
> * Is the data destructively lost in that page?
> * Is the data flushed to the underlying store before being not-needed?
> All of which lead to MADV_DONTNEED doing a lot of teardown work to ensure
> that users don't corrupt the data in their backing stores.
>
> I think that detection of MADV_FREE, and usage, would help performance,
> but only on > Linux 4.5, and that might be OK for you.

Well we should detect when MADV_FREE is supported and use that.
If not, tweak the size of MADV_DONTNEED - perhaps based on a
percentage of the RSS size so we limit the overhead.


Anyway, here is the new version:


Add a malloc micro benchmark to enable accurate testing of the
various paths in malloc and free.  The benchmark does a varying
number of allocations of a given block size, then frees them again.

It tests 4 different scenarios: single-threaded using main arena,
multi-threaded using thread-arena, main arena with SINGLE_THREAD_P
false and main arena with the tcache count set larger than the
default.  To enable this, add support for M_TCACHE_COUNT in mallopt.

OK for commit?

ChangeLog:
2018-01-02  Wilco Dijkstra  <wdijkstr@arm.com>

	* benchtests/Makefile: Add malloc-simple benchmark.
	* benchtests/bench-malloc-simple.c: New benchmark.
	* malloc/malloc.h (M_TCACHE_COUNT): Add new define.
	* malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT.

--
Wilco Dijkstra - Jan. 3, 2018, 12:12 p.m.
DJ Delorie wrote:

> What other tests do is create a second test that just #include's the
> first test, and set an environment variable in the Makefile specific to
> that test.  Adding an ABI just for a test is a big hammer, although we
> could discuss adding tcache to mallopt() as a seperate topic.

Yeah but the makefiles are already insanely complex. Adding the new
benchmark to the makefile took more than 10x as much time as writing
the test itself...

> I don't have any objection to adding tcache to mallopt (although please
> add all three tunables if you do), just saying we should discuss it as
> an ABI change separately.

It doesn't have to be an external ABI, I'd suggest keeping this internal to
GLIBC to make testing and benchmarking easier.

Wilco
Carlos O'Donell - Jan. 3, 2018, 3:07 p.m.
On 01/03/2018 04:12 AM, Wilco Dijkstra wrote:
> DJ Delorie wrote:
> 
>> What other tests do is create a second test that just #include's the
>> first test, and set an environment variable in the Makefile specific to
>> that test.  Adding an ABI just for a test is a big hammer, although we
>> could discuss adding tcache to mallopt() as a seperate topic.
> 
> Yeah but the makefiles are already insanely complex. Adding the new
> benchmark to the makefile took more than 10x as much time as writing
> the test itself...
> 
>> I don't have any objection to adding tcache to mallopt (although please
>> add all three tunables if you do), just saying we should discuss it as
>> an ABI change separately.
> 
> It doesn't have to be an external ABI, I'd suggest keeping this internal to
> GLIBC to make testing and benchmarking easier.

Don't use mallopt, please make it a tunable then.

The mallopt API already had 2 secret arena options which eventually became
so well used they were baked into the API and had to be made public.

At least with tunables we are allowed to deprecate them.
Wilco Dijkstra - Jan. 4, 2018, 1:48 p.m.
Carlos O'Donell wrote:
>
> Don't use mallopt, please make it a tunable then.
> 
> The mallopt API already had 2 secret arena options which eventually became
> so well used they were baked into the API and had to be made public.

Unfortunately tunables are not exported so you can't use them outside of GLIBC:

/build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
collect2: error: ld returned 1 exit status

Wilco
Adhemerval Zanella Netto - Jan. 4, 2018, 4:37 p.m.
On 04/01/2018 11:48, Wilco Dijkstra wrote:
> Carlos O'Donell wrote:
>>
>> Don't use mallopt, please make it a tunable then.
>>
>> The mallopt API already had 2 secret arena options which eventually became
>> so well used they were baked into the API and had to be made public.
> 
> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
> 
> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
> collect2: error: ld returned 1 exit status
> 
> Wilco
> 

You will need to the environment variable to check outside GLIBC (as expected
by normal programs).
Carlos O'Donell - Jan. 5, 2018, 2:32 p.m.
On 01/04/2018 05:48 AM, Wilco Dijkstra wrote:
> Carlos O'Donell wrote:
>>
>> Don't use mallopt, please make it a tunable then.
>>
>> The mallopt API already had 2 secret arena options which eventually became
>> so well used they were baked into the API and had to be made public.
> 
> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
> 
> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
> collect2: error: ld returned 1 exit status

Correct, we only have a env-var frontend right now, and the internal API is not
made accessible via GLIBC_PRIVATE.

You have 3 options for tests:

* Use the env vars to adjust test behaviour. Run the tests multiple times.
* Add a new C API frontend, very valuable, but more time consuming.
* Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw
  it away later when we get a proper C API frontend.
Carlos O'Donell - Jan. 5, 2018, 2:33 p.m.
On 01/02/2018 10:20 AM, Wilco Dijkstra wrote:
> Carlos O'Donell wrote:
> 
>> If you have a pattern of malloc/free of *similar* sized blocks, then
>> it overflows the sized bin in the tcache, with other size bins remaining
>> empty. The cache itself does not dynamically reconfigure itself to consume
>> X MiB or Y % of RSS, instead it uses a simple data structure to contain
>> a fixed number of fixed size blocks.
>>
>> Therefore I agree, that enhancing the core data structure in tcache may
>> result in better overall performance, particularly if we got rid of the
>> fixed bin sizes and instead found a way to be performant *and* keep a
>> running total of consumption.
> 
> Well it could keep track of sum of all block sizes and limit that. That would
> be better than a fixed limit on number of blocks in each bin. 

Yes, in practice what we want to enforce is % of total RSS, or fixed X MiB of
RSS in thread caches, and we want that value to be a per-thread variable that can
be changed for each thread depending on the thread's workload.

e.g.

Thread 1, 2, and 3 each need 5% of RSS for their workloads.

Threads 4, and 5, need 25% of RSS for their workloads.

Obviously, if RSS is very low, then our 32MiB/64MiB starting heaps may be unusable,
but could also be tuned, it's a decision that can only be made once at startup
because we use this embedded assumption for pointer arithmetic to find the arena
structure address. It could be changed, but this would increase costs.

>> Likewise *all* of malloc needs to be moved to a better data structure than
>> just linked lists. I would like to see glibc's malloc offer a cacheing
>> footprint of no more than Y % of RSS available, and let the user tweak that.
>> Currently we just consume RSS without much regard for overhead. Though this
>> is a different case than than what you are talking about, the changes are
>> related via data-structure enhancements that would benefit both cases IMO.
> 
> What kind of datastructure do you have in mind? Small blocks could be
> allocated together in pages. This would avoid the per-block overhead and
> changes the linked list into a bitmap scan. However there aren't that many
> alternatives. I've written first-fit allocators using autobalancing trees, but
> walking the tree is expensive due to being not having good locality.

Keep in mind we are deviating now from the topic at hand, but I'll lay out two
improvements, one of which you already touched upon.

(1) Small blocks cost too much RSS.

I have had customers hit terrible corner cases with C++ and small blocks which
see ~50% waste due to colocated metadata e.g. new of 13-byte objects.

Getting smaller allocations working together for the small blocks would be a big win,
and using some kind of bitmap would be my suggested solution. This requires a data
structure to track the parent allocation, bitmaps, and sub-allocations. We have some
of these data structures in place, but no easy generic way to use them (I think Florian's
pushing of more general data structure use in glibc is the right way to go e.g. dynarray).

I think that for blocks smaller than the fundamental language types (which require
malloc to have 16-byte alignment) we do not have to return sufficiently aligned
memory. For example if you allocate a 3-byte block or a 13-byte block, you cannot
possibly put a 16-byte long double there, nor can you use that for a stack block,
so it's a waste to guarantee alignment.

* Group small allocations.
* Violate the ABI and use < MALLOC_ALIGNMENT sized alignments for sub-group members.

(2) Track total memory used and free back based on more consistent heuristics.

Again, we have customers who complain that glibc's malloc is bad for container
or VM workloads with tight packing because it just keeps allocating more heaps.

We could do better to track free page runs, and when we exceed some %RSS
threshold, free back larger runs to the OS, something which malloc_trim()
already does but in a more costly manner by walking the whole arena heaps from
start to end.

This isn't explicitly about performance.

(3) Make arena's truly a per-cpu data structure.

We do not want per-thread arenas.

We want per-thread caches (avoids locks)

We want per-cpu arenas (avoid NUMA issues, and get better locality)

The per-thread cache does the job of caching the thread-local requests
and avoiding locks.

The per-cpu arenas do the job of containing cpu-local memory, and handling
requests for the threads that reside on that CPU, either pinned, or there
temporarily.

Memory being returned should go back to the cpu that the memory came from.

The best we have done today is to have the arenas scale with the number of
CPUs, and let them fall where they may across the cores, and let the
numa page scheduler move the memory around them to minimize cost.

The best way would be to use something like restartable sequences to
get the CPU #, select the arena, and get memory from there, and then
choose another arena next time perhaps.

(4) Distribute threads among arenas.

We have no data structure for balancing arena usage across threads.

We have seen workloads where the maximum number of arenas is allocated,
but threads don't move smoothly across arenas, instead the first unlocked
arena is chosen, and that might not be the best from a memory locality
perspective (see (3)).

> Add a malloc micro benchmark to enable accurate testing of the
> various paths in malloc and free.  The benchmark does a varying
> number of allocations of a given block size, then frees them again.
> 
> It tests 4 different scenarios: single-threaded using main arena,
> multi-threaded using thread-arena, main arena with SINGLE_THREAD_P
> false and main arena with the tcache count set larger than the
> default.  To enable this, add support for M_TCACHE_COUNT in mallopt.
> 
> OK for commit?
> 
> ChangeLog:
> 2018-01-02  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* benchtests/Makefile: Add malloc-simple benchmark.
> 	* benchtests/bench-malloc-simple.c: New benchmark.
> 	* malloc/malloc.h (M_TCACHE_COUNT): Add new define.
> 	* malloc/malloc.c (__libc_mallopt): Handle M_TCACHE_COUNT.

I don't like the creation of a new public ABI via M_TCACHE_COUNT through mallopt.

I suggest splitting these tests apart and using the tunable env var to touch this.

See my other email.
Adhemerval Zanella Netto - Jan. 5, 2018, 3:50 p.m.
On 05/01/2018 12:32, Carlos O'Donell wrote:
> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote:
>> Carlos O'Donell wrote:
>>>
>>> Don't use mallopt, please make it a tunable then.
>>>
>>> The mallopt API already had 2 secret arena options which eventually became
>>> so well used they were baked into the API and had to be made public.
>>
>> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
>>
>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
>> collect2: error: ld returned 1 exit status
> 
> Correct, we only have a env-var frontend right now, and the internal API is not
> made accessible via GLIBC_PRIVATE.
> 
> You have 3 options for tests:
> 
> * Use the env vars to adjust test behaviour. Run the tests multiple times.
> * Add a new C API frontend, very valuable, but more time consuming.
> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw
>   it away later when we get a proper C API frontend.
> 

Do we want a C API to tied the malloc implementation to some tunables? My
understanding is the tunable api idea is not really enforce retro-compability
(where a C api would enforce it).
Carlos O'Donell - Jan. 5, 2018, 4:17 p.m.
On 01/05/2018 07:50 AM, Adhemerval Zanella wrote:
> 
> 
> On 05/01/2018 12:32, Carlos O'Donell wrote:
>> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote:
>>> Carlos O'Donell wrote:
>>>>
>>>> Don't use mallopt, please make it a tunable then.
>>>>
>>>> The mallopt API already had 2 secret arena options which eventually became
>>>> so well used they were baked into the API and had to be made public.
>>>
>>> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
>>>
>>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
>>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
>>> collect2: error: ld returned 1 exit status
>>
>> Correct, we only have a env-var frontend right now, and the internal API is not
>> made accessible via GLIBC_PRIVATE.
>>
>> You have 3 options for tests:
>>
>> * Use the env vars to adjust test behaviour. Run the tests multiple times.
>> * Add a new C API frontend, very valuable, but more time consuming.
>> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw
>>   it away later when we get a proper C API frontend.
>>
> 
> Do we want a C API to tied the malloc implementation to some tunables? My
> understanding is the tunable api idea is not really enforce retro-compability
> (where a C api would enforce it).
 
If we add a C API to the tunables, we would honour that API for tunables for
all time, but the tunables themselves would not be stable.

e.g.

* get list of tunables supported
* get the default value for a tunable
* get the value of a tunable
* set the value of a tunable

So you would use this API in the tests to get the tunable list, assert the
tcache tunable was accepted (or fail the test), and then set it to a special
value for the part of the test that needs it.
Joseph Myers - Jan. 5, 2018, 4:28 p.m.
On Fri, 5 Jan 2018, Carlos O'Donell wrote:

> I think that for blocks smaller than the fundamental language types 
> (which require malloc to have 16-byte alignment) we do not have to 
> return sufficiently aligned memory. For example if you allocate a 3-byte 
> block or a 13-byte block, you cannot possibly put a 16-byte long double 
> there, nor can you use that for a stack block, so it's a waste to 
> guarantee alignment.

As per DR#075, the memory needs to be aligned for any type of object (with 
a fundamental alignment requirement, in C11 and later), not just those 
that will fit in the block.  (This in turn allows for applications using 
low bits for tagged pointers.)

This does not of course rule out having another allocation API that 
supports smaller alignment requirements.
Adhemerval Zanella Netto - Jan. 5, 2018, 4:46 p.m.
On 05/01/2018 14:17, Carlos O'Donell wrote:
> On 01/05/2018 07:50 AM, Adhemerval Zanella wrote:
>>
>>
>> On 05/01/2018 12:32, Carlos O'Donell wrote:
>>> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote:
>>>> Carlos O'Donell wrote:
>>>>>
>>>>> Don't use mallopt, please make it a tunable then.
>>>>>
>>>>> The mallopt API already had 2 secret arena options which eventually became
>>>>> so well used they were baked into the API and had to be made public.
>>>>
>>>> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
>>>>
>>>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
>>>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
>>>> collect2: error: ld returned 1 exit status
>>>
>>> Correct, we only have a env-var frontend right now, and the internal API is not
>>> made accessible via GLIBC_PRIVATE.
>>>
>>> You have 3 options for tests:
>>>
>>> * Use the env vars to adjust test behaviour. Run the tests multiple times.
>>> * Add a new C API frontend, very valuable, but more time consuming.
>>> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw
>>>   it away later when we get a proper C API frontend.
>>>
>>
>> Do we want a C API to tied the malloc implementation to some tunables? My
>> understanding is the tunable api idea is not really enforce retro-compability
>> (where a C api would enforce it).
>  
> If we add a C API to the tunables, we would honour that API for tunables for
> all time, but the tunables themselves would not be stable.
> 
> e.g.
> 
> * get list of tunables supported
> * get the default value for a tunable
> * get the value of a tunable
> * set the value of a tunable
> 
> So you would use this API in the tests to get the tunable list, assert the
> tcache tunable was accepted (or fail the test), and then set it to a special
> value for the part of the test that needs it.

Right, this seems a reasonable approach (although I think out of the scope for
this change).
Carlos O'Donell - Jan. 5, 2018, 5:26 p.m.
On 01/05/2018 08:28 AM, Joseph Myers wrote:
> On Fri, 5 Jan 2018, Carlos O'Donell wrote:
> 
>> I think that for blocks smaller than the fundamental language types 
>> (which require malloc to have 16-byte alignment) we do not have to 
>> return sufficiently aligned memory. For example if you allocate a 3-byte 
>> block or a 13-byte block, you cannot possibly put a 16-byte long double 
>> there, nor can you use that for a stack block, so it's a waste to 
>> guarantee alignment.
> 
> As per DR#075, the memory needs to be aligned for any type of object (with 
> a fundamental alignment requirement, in C11 and later), not just those 
> that will fit in the block.  (This in turn allows for applications using 
> low bits for tagged pointers.)

Thanks for the reference to DR#075, I had not considered the cast equality
issue.

> This does not of course rule out having another allocation API that 
> supports smaller alignment requirements.
Agreed.

It would still be a win if we did not have co-located metadata (something
Florian whispered into my ear years ago now) for small constant sized blocks.

We would go from this:

N * 1-byte allocations => N * (32-byte header 
                               + 1-byte allocation
                               + 15-bytes alignment)
                          [97% constant waste]

To this:

N * 1-byte allocations => N * (1-byte allocation
                               + 15-bytes alignment) 
                          + (N/8)-bytes in-use-bit + 16-bytes header
                          [96% waste for 1-byte]
			  [94% waste for 100*1-byte]
                          ... towards a 93.75% constant waste (limit of the alignment e.g. 15/16)

This is a gain of 5% RSS efficiency for a structural change.

For a 13-byte allocation:

N * 1-byte allocations => N * (32-byte header 
                               + 13-byte allocation
                               + 3-bytes alignment)
                          [73% constant waste]

To this:

N * 1-byte allocations => N * (13-byte allocation
                               + 3-bytes alignment) 
                          + (N/8)-bytes in-use-bit + 16-bytes header
                          [60% waste for 13-bytes]
			  [20% waste for 100*13-bytes]
			  [19% waste for 1000*13-bytes]
			  ... towards a 18.75% constant waste (limit of the alignment e.g. 3/16)

Note: We never reach the constant limit because the in-use bit-array still grows quickly.
Carlos O'Donell - Jan. 5, 2018, 5:27 p.m.
On 01/05/2018 08:46 AM, Adhemerval Zanella wrote:
> 
> 
> On 05/01/2018 14:17, Carlos O'Donell wrote:
>> On 01/05/2018 07:50 AM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 05/01/2018 12:32, Carlos O'Donell wrote:
>>>> On 01/04/2018 05:48 AM, Wilco Dijkstra wrote:
>>>>> Carlos O'Donell wrote:
>>>>>>
>>>>>> Don't use mallopt, please make it a tunable then.
>>>>>>
>>>>>> The mallopt API already had 2 secret arena options which eventually became
>>>>>> so well used they were baked into the API and had to be made public.
>>>>>
>>>>> Unfortunately tunables are not exported so you can't use them outside of GLIBC:
>>>>>
>>>>> /build/glibc/benchtests/bench-malloc-simple.o: In function `bench':
>>>>> bench-malloc-simple.c:(.text+0x19c): undefined reference to `__tunable_set_val'
>>>>> collect2: error: ld returned 1 exit status
>>>>
>>>> Correct, we only have a env-var frontend right now, and the internal API is not
>>>> made accessible via GLIBC_PRIVATE.
>>>>
>>>> You have 3 options for tests:
>>>>
>>>> * Use the env vars to adjust test behaviour. Run the tests multiple times.
>>>> * Add a new C API frontend, very valuable, but more time consuming.
>>>> * Expose the existing internal C API via GLIBC_PRIVATE for testing, and throw
>>>>   it away later when we get a proper C API frontend.
>>>>
>>>
>>> Do we want a C API to tied the malloc implementation to some tunables? My
>>> understanding is the tunable api idea is not really enforce retro-compability
>>> (where a C api would enforce it).
>>  
>> If we add a C API to the tunables, we would honour that API for tunables for
>> all time, but the tunables themselves would not be stable.
>>
>> e.g.
>>
>> * get list of tunables supported
>> * get the default value for a tunable
>> * get the value of a tunable
>> * set the value of a tunable
>>
>> So you would use this API in the tests to get the tunable list, assert the
>> tcache tunable was accepted (or fail the test), and then set it to a special
>> value for the part of the test that needs it.
> 
> Right, this seems a reasonable approach (although I think out of the scope for
> this change).
 
That is up to Wilco to decide, but in general I agree that he need not take on
this work to get the current patch set merged, there are other solutions to the
need to tweak the settings. I think the env var and multiple test run approach
is going to be the simplest.

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 74b3821ccfea6912e68578ad2598d68a9e38223c..5052bbbfe79f6d5a0b16c427dfc4807271805e61 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -90,7 +90,7 @@  CFLAGS-bench-trunc.c += -fno-builtin
 CFLAGS-bench-truncf.c += -fno-builtin
 
 ifeq (${BENCHSET},)
-bench-malloc := malloc-thread
+bench-malloc := malloc-thread malloc-simple
 else
 bench-malloc := $(filter malloc-%,${BENCHSET})
 endif
@@ -98,7 +98,7 @@  endif
 $(addprefix $(objpfx)bench-,$(bench-math)): $(libm)
 $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm)
 $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library)
-$(objpfx)bench-malloc-thread: $(shared-thread-library)
+$(addprefix $(objpfx)bench-,$(bench-malloc)): $(shared-thread-library)
 
 
 
@@ -165,7 +165,7 @@  bench-clean:
 ifneq ($(strip ${BENCHSET}),)
 VALIDBENCHSETNAMES := bench-pthread bench-math bench-string string-benchset \
    wcsmbs-benchset stdlib-benchset stdio-common-benchset math-benchset \
-   malloc-thread
+   malloc-thread malloc-simple
 INVALIDBENCHSETNAMES := $(filter-out ${VALIDBENCHSETNAMES},${BENCHSET})
 ifneq (${INVALIDBENCHSETNAMES},)
 $(info The following values in BENCHSET are invalid: ${INVALIDBENCHSETNAMES})
@@ -201,10 +201,18 @@  bench-set: $(binaries-benchset)
 
 bench-malloc: $(binaries-bench-malloc)
 	for run in $^; do \
+	  echo "$${run}"; \
+	  if [ `basename $${run}` = "bench-malloc-thread" ]; then \
 		for thr in 1 8 16 32; do \
 			echo "Running $${run} $${thr}"; \
-	  $(run-bench) $${thr} > $${run}-$${thr}.out; \
-	  done;\
+			$(run-bench) $${thr} > $${run}-$${thr}.out; \
+		done;\
+	  else \
+		for thr in 8 16 32 64 128 256 512 1024 2048 4096; do \
+		  echo "Running $${run} $${thr}"; \
+		  $(run-bench) $${thr} > $${run}-$${thr}.out; \
+		done;\
+	  fi;\
 	done
 
 # Build and execute the benchmark functions.  This target generates JSON
diff --git a/benchtests/bench-malloc-simple.c b/benchtests/bench-malloc-simple.c
new file mode 100644
index 0000000000000000000000000000000000000000..151c38de50c5e747e05d69c717452241a47d7d22
--- /dev/null
+++ b/benchtests/bench-malloc-simple.c
@@ -0,0 +1,201 @@ 
+/* Benchmark malloc and free functions.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <malloc.h>
+#include <sys/resource.h>
+#include "bench-timing.h"
+#include "json-lib.h"
+
+#define NUM_ITERS 1000000
+#define NUM_ALLOCS 4
+#define MAX_ALLOCS 1000
+
+typedef struct
+{
+  size_t iters;
+  size_t size;
+  int n;
+  timing_t elapsed;
+} malloc_args;
+
+static void
+do_benchmark (malloc_args *args, int **arr)
+{
+  timing_t start, stop;
+  size_t iters = args->iters;
+  size_t size = args->size;
+  int n = args->n;
+
+  TIMING_NOW (start);
+
+  for (int j = 0; j < iters; j++)
+    {
+      for (int i = 0; i < n; i++)
+	arr[i] = malloc (size);
+
+      for (int i = 0; i < n; i++)
+	free (arr[i]);
+    }
+
+  TIMING_NOW (stop);
+
+  TIMING_DIFF (args->elapsed, start, stop);
+}
+
+static malloc_args tests[4][NUM_ALLOCS];
+static int allocs[NUM_ALLOCS] = { 25, 100, 400, MAX_ALLOCS };
+
+static void *
+thread_test (void *p)
+{
+  int **arr = (int**)p;
+
+  /* Run benchmark multi-threaded.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[2][i], arr);
+
+  return p;
+}
+
+void
+bench (unsigned long size)
+{
+  size_t iters = NUM_ITERS;
+  int **arr = (int**) malloc (MAX_ALLOCS * sizeof (void*));
+  unsigned long res;
+
+  TIMING_INIT (res);
+  (void) res;
+
+  /* Set tcache count to default.  */
+  mallopt (M_TCACHE_COUNT, -1);
+
+  for (int t = 0; t <= 3; t++)
+    for (int i = 0; i < NUM_ALLOCS; i++)
+      {
+	tests[t][i].n = allocs[i];
+	tests[t][i].size = size;
+	tests[t][i].iters = iters / allocs[i];
+
+	/* Do a quick warmup run.  */
+	if (t == 0)
+	  do_benchmark (&tests[0][i], arr);
+      }
+
+  /* Run benchmark single threaded in main_arena.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[0][i], arr);
+
+  /* Run benchmark in a thread_arena.  */
+  pthread_t t;
+  pthread_create (&t, NULL, thread_test, (void*)arr);
+  pthread_join (t, NULL);
+
+  /* Repeat benchmark in main_arena with SINGLE_THREAD_P == false.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[1][i], arr);
+
+  /* Increase size of tcache.  */
+  mallopt (M_TCACHE_COUNT, 100);
+
+  /* Run again but with larger tcache.  */
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    do_benchmark (&tests[3][i], arr);
+
+  mallopt (M_TCACHE_COUNT, -1);
+
+  free (arr);
+
+  json_ctx_t json_ctx;
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+
+  json_attr_object_begin (&json_ctx, "malloc");
+
+  char s[100];
+  double iters2 = iters;
+
+  json_attr_object_begin (&json_ctx, "");
+  json_attr_double (&json_ctx, "malloc_block_size", size);
+
+  struct rusage usage;
+  getrusage (RUSAGE_SELF, &usage);
+  json_attr_double (&json_ctx, "max_rss", usage.ru_maxrss);
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "main_arena_st_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[0][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "main_arena_mt_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[1][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "big_tcache_mt_allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[3][i].elapsed / iters2);
+    }
+
+  for (int i = 0; i < NUM_ALLOCS; i++)
+    {
+      sprintf (s, "thread_arena__allocs_%04d_time", allocs[i]);
+      json_attr_double (&json_ctx, s, tests[2][i].elapsed / iters2);
+    }
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+
+  json_document_end (&json_ctx);
+}
+
+static void usage (const char *name)
+{
+  fprintf (stderr, "%s: <alloc_size>\n", name);
+  exit (1);
+}
+
+int
+main (int argc, char **argv)
+{
+  long val = 16;
+  if (argc == 2)
+    val = strtol (argv[1], NULL, 0);
+
+  if (argc > 2 || val <= 0)
+    usage (argv[0]);
+
+  bench (val);
+
+  return 0;
+}
diff --git a/malloc/malloc.h b/malloc/malloc.h
index 339ab64c7d336873211a9057a923d87e8c1e025d..a047385a4fc8d7b3bb3e120a94440193dba306ed 100644
--- a/malloc/malloc.h
+++ b/malloc/malloc.h
@@ -121,6 +121,7 @@  extern struct mallinfo mallinfo (void) __THROW;
 #define M_PERTURB           -6
 #define M_ARENA_TEST        -7
 #define M_ARENA_MAX         -8
+#define M_TCACHE_COUNT	    -9
 
 /* General SVID/XPG interface to tunable parameters. */
 extern int mallopt (int __param, int __val) __THROW;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 0c9e0748b4c10988f6fe99ac2e5b21b8b7b603c3..a07438d276ff4c8177552e1c0d186ee7c8bd7692 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -5177,6 +5177,11 @@  __libc_mallopt (int param_number, int value)
       if (value > 0)
 	do_set_arena_max (value);
       break;
+#if USE_TCACHE
+    case M_TCACHE_COUNT:
+      do_set_tcache_count (value >= 0 ? value : TCACHE_FILL_COUNT);
+      break;
+#endif
     }
   __libc_lock_unlock (av->mutex);
   return res;