[malloc] Avoid atomics in have_fastchunks

Message ID DB6PR0801MB2053938869AF403F0D95BC5F83600@DB6PR0801MB2053.eurprd08.prod.outlook.com
State Committed, archived
Headers

Commit Message

Wilco Dijkstra Sept. 19, 2017, 4:32 p.m. UTC
  Currently free typically uses 2 atomic operations per call.  The have_fastchunks
flag indicates whether there are recently freed blocks in the fastbins.  This
is purely an optimization to avoid calling malloc_consolidate too often and
avoiding the overhead of walking all fast bins even if all are empty during a
sequence of allocations.  However using atomics to update the flag is completely
unnecessary since it can be changed into a simple boolean.  There is no change
in multi-threaded behaviour given the flag is already approximate (it may be
set when there are no blocks in any fast bins, or it may be clear when there
are free blocks that could be consolidated).

Performance of malloc/free improves by 27% on a simple benchmark on AArch64
(both single and multithreaded). The number of load/store exclusive instructions is
reduced by 33%. Bench-malloc-thread speeds up by ~3% in all cases.

Passes GLIBC tests. OK to commit?

ChangeLog:
2017-09-19  Wilco Dijkstra  <wdijkstr@arm.com>

	* malloc/malloc.c (FASTCHUNKS_BIT): Remove.
	(have_fastchunks): Remove.
	(clear_fastchunks): Remove.
	(set_fastchunks): Remove.
	(malloc_state): Add have_fastchunks.
	(malloc_init_state): Use have_fastchunks.
	(do_check_malloc_state): Remove incorrect invariant checks.
	(_int_malloc): Use have_fastchunks.
	(_int_free): Likewise.
	(malloc_consolidate): Likewise.
--
  

Comments

Carlos O'Donell Sept. 19, 2017, 5:29 p.m. UTC | #1
On 09/19/2017 10:32 AM, Wilco Dijkstra wrote:
> Currently free typically uses 2 atomic operations per call.  The have_fastchunks
> flag indicates whether there are recently freed blocks in the fastbins.  This
> is purely an optimization to avoid calling malloc_consolidate too often and
> avoiding the overhead of walking all fast bins even if all are empty during a
> sequence of allocations.  However using atomics to update the flag is completely
> unnecessary since it can be changed into a simple boolean.  There is no change
> in multi-threaded behaviour given the flag is already approximate (it may be
> set when there are no blocks in any fast bins, or it may be clear when there
> are free blocks that could be consolidated).
> 
> Performance of malloc/free improves by 27% on a simple benchmark on AArch64
> (both single and multithreaded). The number of load/store exclusive instructions is
> reduced by 33%. Bench-malloc-thread speeds up by ~3% in all cases.
> 
> Passes GLIBC tests. OK to commit?
> 
> ChangeLog:
> 2017-09-19  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* malloc/malloc.c (FASTCHUNKS_BIT): Remove.
> 	(have_fastchunks): Remove.
> 	(clear_fastchunks): Remove.
> 	(set_fastchunks): Remove.
> 	(malloc_state): Add have_fastchunks.
> 	(malloc_init_state): Use have_fastchunks.
> 	(do_check_malloc_state): Remove incorrect invariant checks.
> 	(_int_malloc): Use have_fastchunks.
> 	(_int_free): Likewise.
> 	(malloc_consolidate): Likewise.

Quick review. I think we'll need a v2 before we're ready to commit, or at
least a discussion around the use of relaxed MO loads and stores.

High level:

It is great to see someone looking at the details of malloc at a atomic by
atomic cost analysis. I know we have looked briefly at fastbins and the
tradeoff between taking the arena lock (one atomic) and CAS required to put
the fastbin back in the list.

I'm happy to see this paying dividends for you e.g. ~3% global gains.

I like the direction this fix takes and the process you used to verify it.

Implementation level:

You use unadorned loads and stores of the variable av->have_fastchunks, and
this constitutes a data race which is undefined behaviour in C11.

For example in _int_free have_lock could be 0 and we would write to
av->have_fastchunks, while any other thread might be reading have_fastchunks.
This is a data race in C11.

Please use relaxed MO loads and stores if that is what we need.

After you add the relaxed MO loads and stores the comment for have_fastchunks
will need a little more explicit language about why the relaxed MO loads and
stores are OK from a P&C perspective.

Details:

Does this patch change the number of times malloc_consolidate might
be called? Do you have any figures on this? That would be a user visible
change (and require a bug #).
  
Wilco Dijkstra Sept. 19, 2017, 6:51 p.m. UTC | #2
DJ Delorie wrote:
>
> I think the key here is to look at all accesses to that variable, and
> consider "what's the worst that could happen if the wrong value is there
> forever".  If the worst case doesn't counter the benefits, it's a net
> win.
>
> Buy my biggest concern is wondering how long a "wrong value" can persist
> and cause problems.

Note that the "wrong value" case exists in current GLIBC, even single-threaded.

The single-threaded case is a free that adds a chunk to the fastbin (which sets
have_fastchunks), then a malloc that reuses it (which leaves have_fastchunks
unchanged). If the next malloc is large it will call malloc_consolidate unnecessarily
when there are no actual chunks to consolidate. This is fine as have_fastchunks
is an approximation.

However since free is lock-free, there is no ordering constraint and all 4 possible
combinations of have_fastchunks and the fastbins being empty or not are 
possible in the multithreaded case. This could happen when some threads
execute free while another does malloc, so we end up racing the multiple writes
to have_fastchunks as well as the insertion into and emptying of the fastbins.

I guess the worst-case scenario would be several blocks being added to the
fast bins while have_fastchunks remains false. Then those blocks would not be
consolidated in subsequent mallocs until a later free sets have_fastchunks,
potentially increasing fragmentation. Best-case scenario is another call to
malloc_consolidate which has no effect.

Again this is what the existing code does before my patch - doing the updates to
have_fastchunks using the most constrained memory order does not change that.

Wilco
  
Wilco Dijkstra Sept. 19, 2017, 9:11 p.m. UTC | #3
Carlos O'Donell wrote:

> It is great to see someone looking at the details of malloc at a atomic by
> atomic cost analysis. I know we have looked briefly at fastbins and the
> tradeoff between taking the arena lock (one atomic) and CAS required to put
> the fastbin back in the list.

Yes it looks like doing free lock-free works fine overall. I wonder whether
malloc can do the same for the fastbin paths since it already has to deal
with free updating the fastbins concurrently.

> You use unadorned loads and stores of the variable av->have_fastchunks, and
> this constitutes a data race which is undefined behaviour in C11.
...
> Please use relaxed MO loads and stores if that is what we need.

I'll do that.

> After you add the relaxed MO loads and stores the comment for have_fastchunks
> will need a little more explicit language about why the relaxed MO loads and
> stores are OK from a P&C perspective.

That's easy given multithreaded interleaving already allows all possible
combinations before even considering memory ordering - see my reply to
DJ for the long version...

> Does this patch change the number of times malloc_consolidate might
> be called? Do you have any figures on this? That would be a user visible
> change (and require a bug #).

The number of calls isn't fixed already. I'll have a go at hacking the malloc
test to see how much variation there is and whether my patch changes it.


Btw what is your opinion on how to add generic single-threaded optimizations
that work for all targets? Rather than doing more target hacks, I'd like to add
something similar like we did with stdio getc/putc, ie. add a high-level check for
the single-threaded case that uses a different code path (with no/relaxed atomics
and no locks for the common cases).

To give an idea how much this helps, creating a dummy thread that does nothing
slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...).

An alternative would be to move all the fastbin handling into the t-cache - but
then I bet it's much easier just to write a fast modern allocator...

Wilco
  
Wilco Dijkstra Sept. 19, 2017, 10:39 p.m. UTC | #4
DJ Delorie wrote:
> tcache is a type of fastbin already.  I tested removing fastbins since
> tcache did something similar, but it turns out that doing both tcache
> and fastbins is still faster than doing just either.

Given tcache works in a limited set of circumstances, that's not a surprise!
Particularly the default setting only allows 7 entries which doesn't help
any workloads I've tried. From > 100 gains appear, while 250-1000 looks
quite good. So that's the kind of setting we need before fastbins could be
considered for removal. Also while tcache would need consolidation support
to limit fragmentation, it would need to be less aggressive than it is today.
It appears a fair amount of tache gains are due to better spatial locality.

> Many have said "just replace glibc's malloc with a fast modern
> allocator" but glibc's malloc is about as fast as a modern allocator
> now, and what we're lacking is a way to *prove* a new malloc is faster
> for all the use cases glibc needs to cater to.  I posted a note about
> the trace-enabled glibc hoping that folks would start collecting the
> workloads we want to benchmark.  Having a large library of
> benchmark-able things is somewhat of a prerequisite before we worry
> about "which malloc is faster for glibc's audience".
>
> I'm not saying we should never replace glibc's malloc, I'm saying we
> need solid metrics to justify it.  Remember that glibc's allocator is
> the default allocator for all of Linux, aside from the few apps that
> have an app-specific allocator.
> 
> Thus, "I bet it's much easier" is not such a sure bet ;-)

I agree we need some benchmarks and workloads. But it's not only about
getting a few percent speedup. The current implementation is without a
doubt ancient and extremely hard to understand. There are about 100
if statements in the fast paths, 90% of which are completely redundant.
It even manages to check for the exact same special case several times in
a row (eg. mmap).

The sort of questions my simple patch raises is another example of how
complex and nontrivial it has become. So there are very solid reasons to
look beyond the current implementation. I would find it hard to believe
there aren't multiple more advanced implementations available that not
only improve performance but have a significantly simpler and smaller
codebase as well...

Wilco
  
Carlos O'Donell Sept. 20, 2017, 1:57 a.m. UTC | #5
On 09/19/2017 03:11 PM, Wilco Dijkstra wrote:
> Carlos O'Donell wrote:
>> Does this patch change the number of times malloc_consolidate might
>> be called? Do you have any figures on this? That would be a user visible
>> change (and require a bug #).
> 
> The number of calls isn't fixed already. I'll have a go at hacking the malloc
> test to see how much variation there is and whether my patch changes it.

I'm only curious, this isn't a requirement on the submission of this patch.

> Btw what is your opinion on how to add generic single-threaded optimizations
> that work for all targets? Rather than doing more target hacks, I'd like to add
> something similar like we did with stdio getc/putc, ie. add a high-level check for
> the single-threaded case that uses a different code path (with no/relaxed atomics
> and no locks for the common cases).

I don't have an opinion on this for malloc. I haven't thought much about this kind
of optimization. I'm mostly thinking about how to speed up the multi-threaded
case.

> To give an idea how much this helps, creating a dummy thread that does nothing
> slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...).

You might go even further down this path. If each thread has it's own arena, then
we don't need to do any locking in the arena, except for the arena selection path?

> An alternative would be to move all the fastbin handling into the t-cache - but
> then I bet it's much easier just to write a fast modern allocator...

Writing a fast modern allocator is not easy :-)
  
Carlos O'Donell Sept. 20, 2017, 2:35 a.m. UTC | #6
On 09/19/2017 04:39 PM, Wilco Dijkstra wrote:
> I agree we need some benchmarks and workloads. But it's not only about
> getting a few percent speedup. The current implementation is without a
> doubt ancient and extremely hard to understand. There are about 100
> if statements in the fast paths, 90% of which are completely redundant.
> It even manages to check for the exact same special case several times in
> a row (eg. mmap).
>
> The sort of questions my simple patch raises is another example of how
> complex and nontrivial it has become. So there are very solid reasons to
> look beyond the current implementation. I would find it hard to believe
> there aren't multiple more advanced implementations available that not
> only improve performance but have a significantly simpler and smaller
> codebase as well... 

You are looking at this from the wrong way around.

Changing allocators in your application is easy. Go grab any allocator
from the internet and LD_PRELOAD it. We go to great lengths to allow this.

What matters to *us* as glibc developers is not "What modern allocator
should we change to?" but "Why should we change to a different allocator?"
Since there will always be a new allocator to pick from. In fact we want
to be able to tell users which allocator may work better for their workload!

Eventually all allocators get crufty, and glibc overall has suffered from
a lack of investment and repair of technical debt. We have turned that
trend around in many subsystems of glibc. In 2013 [1] I argued we must own 
the malloc we have, whatever that means, and turn it into what we need, and 
that can include using newer allocators. Since then we have *really* cleaned
up malloc. You should have seen it in 2013!

The key turning point is providing objective engineering proof that
another allocator is better on a given workload for some measure of better
that includes performance, VmRSS, VmSize, and security.

I will continue to push DJ's trace/simulation to get it ready for upstream
integration as another way to further our engineering rigor in this area
of system benchmarking. The trace/simulation is allocator neutral, it's just
an API trace, and we've used it to compare glibc, jemalloc, and tcmalloc
across several workloads.

In summary:

1) We need more engineering rigor and objective data driven decisions before
   we can decide which allocator will really be the best system default
   allocator.

2) In the mean time we should own the existing glibc malloc and make any 
   changes we need to improve the situation incrementally.

3) We will continue to work diligently to ensure that users can interpose
   a new application-specific allocator without issue.

Your patch today contributes to (2). Thank you! :-)
  
Markus Trippelsdorf Sept. 20, 2017, 6:53 a.m. UTC | #7
On 2017.09.19 at 17:30 -0400, DJ Delorie wrote:
> 
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > An alternative would be to move all the fastbin handling into the
> > t-cache - but then I bet it's much easier just to write a fast modern
> > allocator...
> 
> tcache is a type of fastbin already.  I tested removing fastbins since
> tcache did something similar, but it turns out that doing both tcache
> and fastbins is still faster than doing just either.
> 
> Many have said "just replace glibc's malloc with a fast modern
> allocator" but glibc's malloc is about as fast as a modern allocator
> now, and what we're lacking is a way to *prove* a new malloc is faster
> for all the use cases glibc needs to cater to.  I posted a note about
> the trace-enabled glibc hoping that folks would start collecting the
> workloads we want to benchmark.  Having a large library of
> benchmark-able things is somewhat of a prerequisite before we worry
> about "which malloc is faster for glibc's audience".

The reason why nobody uses your trace/simulation patches is because they
generate way too much data and are just too complex/invasive for most
users. And someone would have to analyze all this data.
So it is natural to look for other metrics like code complexity instead.
  
Wilco Dijkstra Sept. 20, 2017, 12:47 p.m. UTC | #8
Markus Trippelsdorf wrote:
    
> The reason why nobody uses your trace/simulation patches is because they
> generate way too much data and are just too complex/invasive for most
> users. And someone would have to analyze all this data.
> So it is natural to look for other metrics like code complexity instead.

Indeed, I can generate hundreds of gigabytes of malloc traces in a few hours...
But what's the point? Traces are many orders of magnitude too large to share
(let alone commit), and unlike the workloads for the math functions it is difficult
to reduce a huge trace and yet remain 100% representative. 

So I think we'll need to add microbenchmarks that test various aspects of
memory allocation. Extracting small representative kernels from existing
applications/benchmarks should be easier than dealing with huge traces...

Wilco
  
Carlos O'Donell Sept. 20, 2017, 1:05 p.m. UTC | #9
On 09/20/2017 12:53 AM, Markus Trippelsdorf wrote:
> On 2017.09.19 at 17:30 -0400, DJ Delorie wrote:
>>
>> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> An alternative would be to move all the fastbin handling into the
>>> t-cache - but then I bet it's much easier just to write a fast modern
>>> allocator...
>>
>> tcache is a type of fastbin already.  I tested removing fastbins since
>> tcache did something similar, but it turns out that doing both tcache
>> and fastbins is still faster than doing just either.
>>
>> Many have said "just replace glibc's malloc with a fast modern
>> allocator" but glibc's malloc is about as fast as a modern allocator
>> now, and what we're lacking is a way to *prove* a new malloc is faster
>> for all the use cases glibc needs to cater to.  I posted a note about
>> the trace-enabled glibc hoping that folks would start collecting the
>> workloads we want to benchmark.  Having a large library of
>> benchmark-able things is somewhat of a prerequisite before we worry
>> about "which malloc is faster for glibc's audience".
> 
> The reason why nobody uses your trace/simulation patches is because they
> generate way too much data and are just too complex/invasive for most
> users. And someone would have to analyze all this data.
> So it is natural to look for other metrics like code complexity instead.
 
Yes, it is a hard problem.

We don't want anyone to use the *patches*, we want them to use pre-compiled
instrumented glibc to gather data at the behest of glibc developers with
cooperation from the distributions.

As luck would have it I support Fedora's glibc and so does DJ so we have
specially compiled Fedora patches to gather this information.

The goal is to eventually get trace into upstream so we can do better analysis
of all of our APIs, and so that distros can turn this on without the effort of
patching. Users just use LD_PRELOAD with a new glibc. I don't see that as too
complex (it might be invasive at a trace cost perspective, but it's less than
any other trace infrastructure I've ever seen).

Yes, they do generate a lot of data, but we need that data, and it's just a file
that users can send which has no real information about their systems. We have
had real customers work with us to gather this data, and it's on our long list to
possibly switch to HDF5 to do compressed data as an option (after trace capture).

Lastly, users don't look at code complexity at all.

In summary:

- Users should get packages from their distro to enable trace capture.
- Files are large, that hasn't stopped people from getting us short workloads.
- Analysis an issue and we continue to work on anything beyond the metrics
  the simulator provides on x86_64 (instruction counts, vmrss, vmsize).

What else can we do to make this better? To further the engineering rigor we use?
  
Arjan van de Ven Sept. 20, 2017, 2:02 p.m. UTC | #10
On 9/19/2017 6:57 PM, Carlos O'Donell wrote:
> 
>> An alternative would be to move all the fastbin handling into the t-cache - but
>> then I bet it's much easier just to write a fast modern allocator...
> 
> Writing a fast modern allocator is not easy :-)
> 

and since glibc 2.26, glibc's malloc() is actually no longer losing against some of those
alternative allocators as well, for example see

https://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.26-Redis-Test
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 1c2a0b05b78c84cea60ee998108180d51b1f1ddf..61bea1eb321f4abbc6a843abc49b2e5b1fef9f9c 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -1604,27 +1604,6 @@  typedef struct malloc_chunk *mfastbinptr;
 #define FASTBIN_CONSOLIDATION_THRESHOLD  (65536UL)
 
 /*
-   Since the lowest 2 bits in max_fast don't matter in size comparisons,
-   they are used as flags.
- */
-
-/*
-   FASTCHUNKS_BIT held in max_fast indicates that there are probably
-   some fastbin chunks. It is set true on entering a chunk into any
-   fastbin, and cleared only in malloc_consolidate.
-
-   The truth value is inverted so that have_fastchunks will be true
-   upon startup (since statics are zero-filled), simplifying
-   initialization checks.
- */
-
-#define FASTCHUNKS_BIT        (1U)
-
-#define have_fastchunks(M)     (((M)->flags & FASTCHUNKS_BIT) == 0)
-#define clear_fastchunks(M)    catomic_or (&(M)->flags, FASTCHUNKS_BIT)
-#define set_fastchunks(M)      catomic_and (&(M)->flags, ~FASTCHUNKS_BIT)
-
-/*
    NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous
    regions.  Otherwise, contiguity is exploited in merging together,
    when possible, results from consecutive MORECORE calls.
@@ -1672,6 +1651,16 @@  get_max_fast (void)
    ----------- Internal state representation and initialization -----------
  */
 
+/*
+   have_fastchunks indicates that there are probably some fastbin chunks.
+   It is set true on entering a chunk into any fastbin, and cleared early in
+   malloc_consolidate.  The value is approximate since it may be set when there
+   are no fastbin chunks, or it may be clear even if there are fastbin chunks
+   available.  Given its sole purpose is to reduce number of redundant calls to
+   malloc_consolidate, it does not affect correctness.
+ */
+
+
 struct malloc_state
 {
   /* Serialize access.  */
@@ -1680,6 +1669,9 @@  struct malloc_state
   /* Flags (formerly in max_fast).  */
   int flags;
 
+  /* Set if the fastbin chunks contain recently inserted free blocks.  */
+  bool have_fastchunks;
+
   /* Fastbins */
   mfastbinptr fastbinsY[NFASTBINS];
 
@@ -1823,7 +1815,7 @@  malloc_init_state (mstate av)
   set_noncontiguous (av);
   if (av == &main_arena)
     set_max_fast (DEFAULT_MXFAST);
-  av->flags |= FASTCHUNKS_BIT;
+  av->have_fastchunks = false;
 
   av->top = initial_top (av);
 }
@@ -2179,11 +2171,6 @@  do_check_malloc_state (mstate av)
         }
     }
 
-  if (total != 0)
-    assert (have_fastchunks (av));
-  else if (!have_fastchunks (av))
-    assert (total == 0);
-
   /* check normal bins */
   for (i = 1; i < NBINS; ++i)
     {
@@ -3650,7 +3637,7 @@  _int_malloc (mstate av, size_t bytes)
   else
     {
       idx = largebin_index (nb);
-      if (have_fastchunks (av))
+      if (av->have_fastchunks)
         malloc_consolidate (av);
     }
 
@@ -4058,7 +4045,7 @@  _int_malloc (mstate av, size_t bytes)
 
       /* When we are using atomic ops to free fast chunks we can get
          here for all block sizes.  */
-      else if (have_fastchunks (av))
+      else if (av->have_fastchunks)
         {
           malloc_consolidate (av);
           /* restore original bin index */
@@ -4163,7 +4150,7 @@  _int_free (mstate av, mchunkptr p, int have_lock)
 
     free_perturb (chunk2mem(p), size - 2 * SIZE_SZ);
 
-    set_fastchunks(av);
+    av->have_fastchunks = true;
     unsigned int idx = fastbin_index(size);
     fb = &fastbin (av, idx);
 
@@ -4291,7 +4278,7 @@  _int_free (mstate av, mchunkptr p, int have_lock)
     */
 
     if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {
-      if (have_fastchunks(av))
+      if (av->have_fastchunks)
 	malloc_consolidate(av);
 
       if (av == &main_arena) {
@@ -4360,7 +4347,7 @@  static void malloc_consolidate(mstate av)
   */
 
   if (get_max_fast () != 0) {
-    clear_fastchunks(av);
+    av->have_fastchunks = false;
 
     unsorted_bin = unsorted_chunks(av);