malloc: Use correct C11 atomics for fastbin

Message ID PAWPR08MB8982002BD1E5CE0E74F4F19B830A9@PAWPR08MB8982.eurprd08.prod.outlook.com
State New
Headers
Series malloc: Use correct C11 atomics for fastbin |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Wilco Dijkstra Nov. 21, 2022, 4:09 p.m. UTC
  Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the
next pointer is read before any MO synchronization, however in the C11 atomic
model this is only correct after a load acquire. Refactor the fastbin code
and add a dedicated fastbin_push/pop implementation. The number of acquire
or release atomics remains the same, and the new functions are inlined, so
performance is unaffected.

Passes regress, OK for commit?

---
  

Comments

Florian Weimer Nov. 21, 2022, 4:18 p.m. UTC | #1
* Wilco Dijkstra:

> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the
> next pointer is read before any MO synchronization, however in the C11 atomic
> model this is only correct after a load acquire. Refactor the fastbin code
> and add a dedicated fastbin_push/pop implementation. The number of acquire
> or release atomics remains the same, and the new functions are inlined, so
> performance is unaffected.
>
> Passes regress, OK for commit?

Did you actually observe any problems resulting from this?  We
previously concluded that the dependency chain would mae this valid
(outside the C11 memory model).

Thanks,
Florian
  
Wilco Dijkstra Nov. 21, 2022, 4:55 p.m. UTC | #2
Hi Florian,

>> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the
>> next pointer is read before any MO synchronization, however in the C11 atomic
>> model this is only correct after a load acquire. Refactor the fastbin code
>> and add a dedicated fastbin_push/pop implementation. The number of acquire
>> or release atomics remains the same, and the new functions are inlined, so
>> performance is unaffected.
>>
>> Passes regress, OK for commit?
>
> Did you actually observe any problems resulting from this?  We
> previously concluded that the dependency chain would mae this valid
> (outside the C11 memory model).

No, I believe it works in most weak memory models (except for Alpha).
However the code always looked weird with the acquire done after several
non-atomic accesses... The key is that we need to do the acquire/release
only once per iteration in the CAS loop, and that avoids regressions.

Cheers,
Wilco
  
Wilco Dijkstra Nov. 21, 2022, 4:56 p.m. UTC | #3
Hi Florian,

>> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the
>> next pointer is read before any MO synchronization, however in the C11 atomic
>> model this is only correct after a load acquire. Refactor the fastbin code
>> and add a dedicated fastbin_push/pop implementation. The number of acquire
>> or release atomics remains the same, and the new functions are inlined, so
>> performance is unaffected.
>>
>> Passes regress, OK for commit?
>
> Did you actually observe any problems resulting from this?  We
> previously concluded that the dependency chain would mae this valid
> (outside the C11 memory model).

No, I believe it works in most weak memory models (except for Alpha).
However the code always looked weird with the acquire done after several
non-atomic accesses... The key is that we need to do the acquire/release
only once per iteration in the CAS loop, and that avoids regressions.

Cheers,
Wilco
  
Florian Weimer Nov. 21, 2022, 5 p.m. UTC | #4
* Wilco Dijkstra:

> Hi Florian,
>
>>> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the
>>> next pointer is read before any MO synchronization, however in the C11 atomic
>>> model this is only correct after a load acquire. Refactor the fastbin code
>>> and add a dedicated fastbin_push/pop implementation. The number of acquire
>>> or release atomics remains the same, and the new functions are inlined, so
>>> performance is unaffected.
>>>
>>> Passes regress, OK for commit?
>>
>> Did you actually observe any problems resulting from this?  We
>> previously concluded that the dependency chain would mae this valid
>> (outside the C11 memory model).
>
> No, I believe it works in most weak memory models (except for Alpha).
> However the code always looked weird with the acquire done after several
> non-atomic accesses... The key is that we need to do the acquire/release
> only once per iteration in the CAS loop, and that avoids regressions.

Sounds good, but I haven't looked at the changes in detail.

Maybe DJ can review it?

Thanks,
Florian
  
DJ Delorie Dec. 2, 2022, 5:11 a.m. UTC | #5
Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes:
> +/* Atomically pop from the fastbin list.  The arena lock must be held to
> +   block other threads removing entries, avoiding the ABA issue.  */

If the arena lock must be held anyway, why go through the compare and
exchange overhead?  We know we're the only thread accessing it.
  
Florian Weimer Dec. 2, 2022, 6:36 a.m. UTC | #6
* DJ Delorie:

> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes:
>> +/* Atomically pop from the fastbin list.  The arena lock must be held to
>> +   block other threads removing entries, avoiding the ABA issue.  */
>
> If the arena lock must be held anyway, why go through the compare and
> exchange overhead?  We know we're the only thread accessing it.

Other threads are adding entries without the arena lock.

Thanks,
Florian
  
Wilco Dijkstra Dec. 2, 2022, 10:56 a.m. UTC | #7
Hi,

>> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes:
>>> +/* Atomically pop from the fastbin list.  The arena lock must be held to
>>> +   block other threads removing entries, avoiding the ABA issue.  */
>>
>> If the arena lock must be held anyway, why go through the compare and
>> exchange overhead?  We know we're the only thread accessing it.
>
> Other threads are adding entries without the arena lock.

Yes, malloc is blocking but free isn't and accesses the freelist concurrently.
It's a really weird design. Splitting the free list into a local one and a shared
one would be far better - no atomics when you have the malloc lock, and if
the local free list is empty it takes one atomic to copy all shared entries.

Cheers,
Wilco
  
Florian Weimer Dec. 2, 2022, 11:24 a.m. UTC | #8
* Wilco Dijkstra:

> Hi,
>
>>> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes:
>>>> +/* Atomically pop from the fastbin list.  The arena lock must be held to
>>>> +   block other threads removing entries, avoiding the ABA issue.  */
>>>
>>> If the arena lock must be held anyway, why go through the compare and
>>> exchange overhead?  We know we're the only thread accessing it.
>>
>> Other threads are adding entries without the arena lock.
>
> Yes, malloc is blocking but free isn't and accesses the freelist
> concurrently.  It's a really weird design. Splitting the free list
> into a local one and a shared one would be far better - no atomics
> when you have the malloc lock, and if the local free list is empty it
> takes one atomic to copy all shared entries.

The local free list is in the tcache.  I think DJ said that removing the
fastbins still resulted in a performance loss.

The tcache has also the benefit that the chain length is bounded.

Thanks,
Florian
  
Wilco Dijkstra Dec. 2, 2022, 12:02 p.m. UTC | #9
Hi Florian,

>> Yes, malloc is blocking but free isn't and accesses the freelist
>> concurrently.  It's a really weird design. Splitting the free list
>> into a local one and a shared one would be far better - no atomics
>> when you have the malloc lock, and if the local free list is empty it
>> takes one atomic to copy all shared entries.
>
> The local free list is in the tcache.  I think DJ said that removing the
> fastbins still resulted in a performance loss.
> 
> The tcache has also the benefit that the chain length is bounded.

That's because tcache has hardly any effect on programs with a high rate
of (de)allocations. It's so small that you usually end up using the fastbin
code anyway. All the singlethreaded optimizations in fastbins are absolutely
essential - that's the code that shows up in profiles.

If we want to make tcache actually work, it will have to support far more
allocations, particularly for smaller sizes.

Cheers,
Wilco
  
DJ Delorie Dec. 2, 2022, 6:55 p.m. UTC | #10
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> If we want to make tcache actually work, it will have to support far more
> allocations, particularly for smaller sizes.

You can test that with a tunable; the max count per bin is runtime
tunable.

But yeah, the point of tcache is to have a few of many sizes for fast
allocations.  Fastbins has a lot of a few small sizes.  Testing showed
that both were required for best performance with the average
application.
  
DJ Delorie Dec. 2, 2022, 6:55 p.m. UTC | #11
Florian Weimer <fweimer@redhat.com> writes:
>> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes:
>>> +/* Atomically pop from the fastbin list.  The arena lock must be held to
>>> +   block other threads removing entries, avoiding the ABA issue.  */
>>
>> If the arena lock must be held anyway, why go through the compare and
>> exchange overhead?  We know we're the only thread accessing it.
>
> Other threads are adding entries without the arena lock.

Ah.  That should be in the comment then.
  
Zack Weinberg Dec. 5, 2022, 6:39 p.m. UTC | #12
On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote:
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> If we want to make tcache actually work, it will have to support far more
>> allocations, particularly for smaller sizes.
> 
> You can test that with a tunable; the max count per bin is runtime
> tunable.
> 
> But yeah, the point of tcache is to have a few of many sizes for fast
> allocations.  Fastbins has a lot of a few small sizes.  Testing showed
> that both were required for best performance with the average
> application.

Every time we start talking about fastbins vs tcache again I start 
wondering, again, what's stopping us from replacing the entire malloc 
implementation with jemalloc, or any other implementation designed less 
than 20 years ago.

zw
  
Wilco Dijkstra Dec. 6, 2022, 1:29 p.m. UTC | #13
Hi,

On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote:
>> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> If we want to make tcache actually work, it will have to support far more
>>> allocations, particularly for smaller sizes.
>> 
>> You can test that with a tunable; the max count per bin is runtime
>> tunable.

Yes you can use tunables, but how many applications actually use more
optimized settings? It's the default that is the problem.

>> But yeah, the point of tcache is to have a few of many sizes for fast
>> allocations.  Fastbins has a lot of a few small sizes.  Testing showed
>> that both were required for best performance with the average
>> application.

Yes, that's due the tiny size of tcache (let's call it tiny-cache!). Once exhausted,
you mostly end up using the fastbins.

> Every time we start talking about fastbins vs tcache again I start 
> wondering, again, what's stopping us from replacing the entire malloc 
> implementation with jemalloc, or any other implementation designed less 
> than 20 years ago.

I can't see any technical reason why not. It's either that or completely rewriting
the current implementation and getting rid of decades of accumulated cruft...

Modern allocators are not only much faster than GLIBC in their default settings
but also have lower memory usage. The two best allocators seem to be mimalloc
and jemalloc.

Cheers,
Wilco
  
Adhemerval Zanella Dec. 6, 2022, 1:37 p.m. UTC | #14
On 06/12/22 10:29, Wilco Dijkstra via Libc-alpha wrote:
> Hi,
> 
> On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote:
>>> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>>> If we want to make tcache actually work, it will have to support far more
>>>> allocations, particularly for smaller sizes.
>>>
>>> You can test that with a tunable; the max count per bin is runtime
>>> tunable.
> 
> Yes you can use tunables, but how many applications actually use more
> optimized settings? It's the default that is the problem.
> 
>>> But yeah, the point of tcache is to have a few of many sizes for fast
>>> allocations.  Fastbins has a lot of a few small sizes.  Testing showed
>>> that both were required for best performance with the average
>>> application.
> 
> Yes, that's due the tiny size of tcache (let's call it tiny-cache!). Once exhausted,
> you mostly end up using the fastbins.
> 
>> Every time we start talking about fastbins vs tcache again I start 
>> wondering, again, what's stopping us from replacing the entire malloc 
>> implementation with jemalloc, or any other implementation designed less 
>> than 20 years ago.
> 
> I can't see any technical reason why not. It's either that or completely rewriting
> the current implementation and getting rid of decades of accumulated cruft...
> 
> Modern allocators are not only much faster than GLIBC in their default settings
> but also have lower memory usage. The two best allocators seem to be mimalloc
> and jemalloc.


Do they have compatible license so we can just import/adapt the code?
  
Zack Weinberg Dec. 6, 2022, 2:31 p.m. UTC | #15
On Tue, Dec 6, 2022, at 8:37 AM, Adhemerval Zanella Netto via Libc-alpha wrote:
>>> Every time we start talking about fastbins vs tcache again I start 
>>> wondering, again, what's stopping us from replacing the entire malloc 
>>> implementation with jemalloc, or any other implementation designed less 
>>> than 20 years ago.
>> 
>> I can't see any technical reason why not. It's either that or completely rewriting
>> the current implementation and getting rid of decades of accumulated cruft...
>> 
>> Modern allocators are not only much faster than GLIBC in their default settings
>> but also have lower memory usage. The two best allocators seem to be mimalloc
>> and jemalloc.
>
>
> Do they have compatible license so we can just import/adapt the code?

Per https://github.com/jemalloc/jemalloc/blob/dev/COPYING jemalloc is 2-clause BSD.  I don't know where to find mimalloc off the top of my head.

zw
  
DJ Delorie Dec. 6, 2022, 4:19 p.m. UTC | #16
Zack Weinberg via Libc-alpha <libc-alpha@sourceware.org> writes:
> Every time we start talking about fastbins vs tcache again I start 
> wondering, again, what's stopping us from replacing the entire malloc 
> implementation with jemalloc,

There's a couple of things, but they're technical details, not political
ones:

* There's a copy of a "mini-malloc" in the dynamic loader that needs to
  be somewhat compatible with the full copy in libc.so so that data
  allocated in the loader can be used later.

* The current malloc knows a lot about glibc internals like pthreads and
  fork, so as to manage those boundaries properly (thread start/stop,
  etc).

* Benchmarks have shown that the various mallocs each have their strong
  points and weak points, so expect some complaints about performance
  loss when switching mallocs.  Your favorite allocator may perform
  poorly in someone else's favorite application.

* Testing, testing, testing.  Libc's malloc is used in all Linux distros
  *and* Hurd, which means it's known to work with tens if not hundreds of
  thousands of programs.  To propose a new system-wide allocator, you
  need to replicate this level of effective testing.

> or any other implementation designed less than 20 years ago.

Comments like this are not appreciated.  Age does not imply quality.
Please stick to technical points.
  
DJ Delorie Dec. 6, 2022, 4:23 p.m. UTC | #17
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> You can test that with a tunable; the max count per bin is runtime
>>> tunable.
>
> Yes you can use tunables, but how many applications actually use more
> optimized settings? It's the default that is the problem.

If testing shows that a larger default makes sense across a wide range
of applications, then we can change the default.  I'm a bit opposed to
changing anything because "it makes sense".  Back it up with data.

> Yes, that's due the tiny size of tcache (let's call it
> tiny-cache!). Once exhausted, you mostly end up using the fastbins.

tcache is wide but shallow, fastbins are narrow but deep (tcache caches
much larger chunks than fastbins do) Benchmarks show they both provide a
speed boost separately, and more so when combined.

> Modern allocators are not only much faster than GLIBC in their default
> settings

Back this up with data please.  Tcache invalidated most online
benchmarks.

> but also have lower memory usage.

This is a project we've got on our to-do list.
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 2a61c8b5ee38f33c42c72f3c5ad441e26ed0e701..a3e7053d2db81cce211dd8c3cff43af0f59a1b71 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -1786,6 +1786,73 @@  get_max_fast (void)
   return global_max_fast;
 }
 
+
+/* Atomically push onto the fastbin list.  Return the previous head of the
+   list, however it can only be referenced if the arena lock is acquired.  */
+static __always_inline mchunkptr
+fastbin_push_chunk (mchunkptr *fastbin, mchunkptr p)
+{
+  mchunkptr head = atomic_load_relaxed (fastbin);
+
+  /* Check that the top of the bin is not the record we are going to
+     add (i.e., double free).  */
+  if (__builtin_expect (head == p, 0))
+    malloc_printerr ("double free or corruption (fasttop)");
+
+  if (SINGLE_THREAD_P)
+    {
+      p->fd = PROTECT_PTR (&p->fd, head);
+      *fastbin = p;
+      return head;
+    }
+
+  /* Atomically update the fastbin head.  Use release MO to synchronize
+     with acquire reads in fastbin_pop_chunk and malloc_consolidate (this
+     ensures the next pointer update is valid).  */
+  do
+    {
+      p->fd = PROTECT_PTR (&p->fd, head);
+    }
+  while (!atomic_compare_exchange_weak_release (fastbin, &head, p));
+
+  return head;
+}
+
+
+/* Atomically pop from the fastbin list.  The arena lock must be held to
+   block other threads removing entries, avoiding the ABA issue.  */
+static __always_inline mchunkptr
+fastbin_pop_chunk (mchunkptr *fastbin)
+{
+  mchunkptr head, tail;
+
+  if (SINGLE_THREAD_P)
+    {
+      head = *fastbin;
+      if (head == NULL)
+	return head;
+      if (__glibc_unlikely (misaligned_chunk (head)))
+	malloc_printerr ("malloc(): unaligned fastbin chunk detected");
+      *fastbin = REVEAL_PTR (head->fd);
+      return head;
+    }
+
+  do
+    {
+      /* Synchronize with release MO CAS in fastbin_pop_chunk - this ensures
+	 the next pointer is valid.  */
+      head = atomic_load_acquire (fastbin);
+      if (head == NULL)
+	return head;
+      if (__glibc_unlikely (head != NULL && misaligned_chunk (head)))
+	malloc_printerr ("malloc(): unaligned fastbin chunk detected");
+      tail = REVEAL_PTR (head->fd);
+    }
+  while (!atomic_compare_exchange_weak_relaxed (fastbin, &head, tail));
+
+  return head;
+}
+
 /*
    ----------- Internal state representation and initialization -----------
  */
@@ -3798,71 +3865,37 @@  _int_malloc (mstate av, size_t bytes)
      can try it without checking, which saves some time on this fast path.
    */
 
-#define REMOVE_FB(fb, victim, pp)			\
-  do							\
-    {							\
-      victim = pp;					\
-      if (victim == NULL)				\
-	break;						\
-      pp = REVEAL_PTR (victim->fd);                                     \
-      if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
-	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
-    }							\
-  while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
-	 != victim);					\
-
   if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ()))
     {
       idx = fastbin_index (nb);
       mfastbinptr *fb = &fastbin (av, idx);
-      mchunkptr pp;
-      victim = *fb;
+      victim = fastbin_pop_chunk (fb);
 
       if (victim != NULL)
 	{
-	  if (__glibc_unlikely (misaligned_chunk (victim)))
-	    malloc_printerr ("malloc(): unaligned fastbin chunk detected 2");
-
-	  if (SINGLE_THREAD_P)
-	    *fb = REVEAL_PTR (victim->fd);
-	  else
-	    REMOVE_FB (fb, pp, victim);
-	  if (__glibc_likely (victim != NULL))
-	    {
-	      size_t victim_idx = fastbin_index (chunksize (victim));
-	      if (__builtin_expect (victim_idx != idx, 0))
-		malloc_printerr ("malloc(): memory corruption (fast)");
-	      check_remalloced_chunk (av, victim, nb);
+	  size_t victim_idx = fastbin_index (chunksize (victim));
+	  if (__builtin_expect (victim_idx != idx, 0))
+	    malloc_printerr ("malloc(): memory corruption (fast)");
+	  check_remalloced_chunk (av, victim, nb);
 #if USE_TCACHE
-	      /* While we're here, if we see other chunks of the same size,
-		 stash them in the tcache.  */
-	      size_t tc_idx = csize2tidx (nb);
-	      if (tcache && tc_idx < mp_.tcache_bins)
+	  /* While we're here, if we see other chunks of the same size,
+	     stash them in the tcache.  */
+	  size_t tc_idx = csize2tidx (nb);
+	  if (tcache && tc_idx < mp_.tcache_bins)
+	    {
+	      /* While bin not empty and tcache not full, copy chunks.  */
+	      while (tcache->counts[tc_idx] < mp_.tcache_count)
 		{
-		  mchunkptr tc_victim;
-
-		  /* While bin not empty and tcache not full, copy chunks.  */
-		  while (tcache->counts[tc_idx] < mp_.tcache_count
-			 && (tc_victim = *fb) != NULL)
-		    {
-		      if (__glibc_unlikely (misaligned_chunk (tc_victim)))
-			malloc_printerr ("malloc(): unaligned fastbin chunk detected 3");
-		      if (SINGLE_THREAD_P)
-			*fb = REVEAL_PTR (tc_victim->fd);
-		      else
-			{
-			  REMOVE_FB (fb, pp, tc_victim);
-			  if (__glibc_unlikely (tc_victim == NULL))
-			    break;
-			}
-		      tcache_put (tc_victim, tc_idx);
-		    }
+		  mchunkptr tc_victim = fastbin_pop_chunk (fb);
+		  if (tc_victim == NULL)
+		    break;
+		  tcache_put (tc_victim, tc_idx);
 		}
-#endif
-	      void *p = chunk2mem (victim);
-	      alloc_perturb (p, bytes);
-	      return p;
 	    }
+#endif
+	  void *p = chunk2mem (victim);
+	  alloc_perturb (p, bytes);
+	  return p;
 	}
     }
 
@@ -4505,29 +4538,7 @@  _int_free (mstate av, mchunkptr p, int have_lock)
     fb = &fastbin (av, idx);
 
     /* Atomically link P to its fastbin: P->FD = *FB; *FB = P;  */
-    mchunkptr old = *fb, old2;
-
-    if (SINGLE_THREAD_P)
-      {
-	/* Check that the top of the bin is not the record we are going to
-	   add (i.e., double free).  */
-	if (__builtin_expect (old == p, 0))
-	  malloc_printerr ("double free or corruption (fasttop)");
-	p->fd = PROTECT_PTR (&p->fd, old);
-	*fb = p;
-      }
-    else
-      do
-	{
-	  /* Check that the top of the bin is not the record we are going to
-	     add (i.e., double free).  */
-	  if (__builtin_expect (old == p, 0))
-	    malloc_printerr ("double free or corruption (fasttop)");
-	  old2 = old;
-	  p->fd = PROTECT_PTR (&p->fd, old);
-	}
-      while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
-	     != old2);
+    mchunkptr old = fastbin_push_chunk (fb, p);
 
     /* Check that size of fastbin chunk at the top is the same as
        size of the chunk that we are adding.  We can dereference OLD
@@ -4718,6 +4729,7 @@  static void malloc_consolidate(mstate av)
   maxfb = &fastbin (av, NFASTBINS - 1);
   fb = &fastbin (av, 0);
   do {
+    /* Synchronize with relase MO CAS in fastbin_push_chunk.  */
     p = atomic_exchange_acquire (fb, NULL);
     if (p != 0) {
       do {