[v6,3/4] Reduce CAS in malloc spinlocks

Message ID 20211111162428.2286605-4-hjl.tools@gmail.com
State Changes Requested, archived
Headers
Series Optimize CAS [BZ #28537] |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

H.J. Lu Nov. 11, 2021, 4:24 p.m. UTC
  Do an atomic load and check if compare may fail.  Skip CAS and spin if
compare may fail to reduce cache line bouncing on contended locks.
---
 malloc/arena.c  |  5 +++++
 malloc/malloc.c | 10 ++++++++++
 2 files changed, 15 insertions(+)
  

Comments

DJ Delorie Feb. 23, 2023, 5:48 a.m. UTC | #1
Sorry for letting this one slip...

"H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> writes:
>        size_t n = narenas;
>        if (__glibc_unlikely (n <= narenas_limit - 1))
>          {
> +          if (atomic_load_relaxed (&narenas) != n)
> +           {
> +              atomic_spin_nop ();
> +              goto repeat;
> +           }
>            if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
>              goto repeat;

I understand that a congested spinloop will benefit from this kind of
change, but... we JUST loaded narenas into n, and adding arenas is rare.
We probably should have loaded it atomically, but still, we just loaded
it.  The odds of malloc being so congested that we miss the CAS is
essentially (but of course not exactly) zero.  Are we just adding an
uneeded atomic read here?  Do any benchmarks say this would be
beneficial?

Also, the malloc code is already complicated enough.  Are the extra
lines of code and slight reduction in readability justified?

Also, we've been migrating to C11-like atomics; would this patch need
changing for that?

Should target-specific atomics optimizations be "hidden" somewhere in
the atomics implementation?  Just because x86 may benefit from a
pre-read doesn't mean that all targets will, and if x86 generally
benefits, it should update its implementation of the atomics to do that
at a lower level.

>            a = _int_new_arena (size);
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 095d97a3be..403ffb84ef 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes)
>        pp = REVEAL_PTR (victim->fd);                                     \
>        if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
>  	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
> +      if (atomic_load_relaxed (fb) != victim)		\
> +	{						\
> +	  atomic_spin_nop ();				\
> +	  continue;					\
> +	}						\
>      }							\
>    while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
>  	 != victim);					\
> @@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>  	    malloc_printerr ("double free or corruption (fasttop)");
>  	  old2 = old;
>  	  p->fd = PROTECT_PTR (&p->fd, old);
> +	  if (atomic_load_relaxed (fb) != old2)
> +	    {
> +	      atomic_spin_nop ();
> +	      continue;
> +	    }
>  	}
>        while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>  	     != old2);

Likewise, although these are less rare, but not so common as I'd expect
a benefit from the extra code.
  

Patch

diff --git a/malloc/arena.c b/malloc/arena.c
index 78ef4cf18c..e7fbe7c183 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -899,6 +899,11 @@  arena_get2 (size_t size, mstate avoid_arena)
          enough address space to create that many arenas.  */
       if (__glibc_unlikely (n <= narenas_limit - 1))
         {
+          if (atomic_load_relaxed (&narenas) != n)
+           {
+              atomic_spin_nop ();
+              goto repeat;
+           }
           if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
             goto repeat;
           a = _int_new_arena (size);
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 095d97a3be..403ffb84ef 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3717,6 +3717,11 @@  _int_malloc (mstate av, size_t bytes)
       pp = REVEAL_PTR (victim->fd);                                     \
       if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
 	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
+      if (atomic_load_relaxed (fb) != victim)		\
+	{						\
+	  atomic_spin_nop ();				\
+	  continue;					\
+	}						\
     }							\
   while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
 	 != victim);					\
@@ -4435,6 +4440,11 @@  _int_free (mstate av, mchunkptr p, int have_lock)
 	    malloc_printerr ("double free or corruption (fasttop)");
 	  old2 = old;
 	  p->fd = PROTECT_PTR (&p->fd, old);
+	  if (atomic_load_relaxed (fb) != old2)
+	    {
+	      atomic_spin_nop ();
+	      continue;
+	    }
 	}
       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
 	     != old2);