[1/2] Single thread optimization for malloc atomics

Message ID 53610133.3070908@linux.vnet.ibm.com
State Rejected
Headers

Commit Message

Adhemerval Zanella Netto April 30, 2014, 1:57 p.m. UTC
  This patch adds a single-thread optimization for malloc atomic usage to
first check if process is single-thread (ST) and if so use normal
load/store instead of atomic instructions.

It is a respin of my initial attempt to add ST optimization on PowerPC.
Now malloc optimization is arch-agnostic and rely in already defined macros
in GLIBC (SINGLE_THREAD_P).  x86 probably would kike an atomic.h refactor
to avoid the double check for '__local_multiple_threads', but that is not
the aim of this patch.

Tested on powerpc32, powerpc64, and x86_64.

--

	* malloc/malloc.c (MALLOC_ATOMIC_OR): New macro: optimize atomic for
	single thread.
	(MALLOC_ATOMIC_AND): Likewise.
	(MALLOC_ATOMIC_CAS_VAL_ACQ): Likewise.
	(MALLOC_ATOMIC_CAS_VAL_REL): Likewise.
	(clear_fastchunks): Use optimized single thread macro.
	(set_fastchunks): Likewise.
	(_int_malloc): Likewise.
	(_int_free): Likewise.

---
  

Comments

Ondrej Bilka April 30, 2014, 2:18 p.m. UTC | #1
On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> This patch adds a single-thread optimization for malloc atomic usage to
> first check if process is single-thread (ST) and if so use normal
> load/store instead of atomic instructions.
> 
How fast is tls on power? When we add a per-thread cache as I suggested
then it would have most of time same performance as singlethread, with
overhead one tls variable access per malloc call.
  
Adhemerval Zanella Netto April 30, 2014, 2:25 p.m. UTC | #2
On 30-04-2014 11:18, Ondřej Bílka wrote:
> On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
>> This patch adds a single-thread optimization for malloc atomic usage to
>> first check if process is single-thread (ST) and if so use normal
>> load/store instead of atomic instructions.
>>
> How fast is tls on power? When we add a per-thread cache as I suggested
> then it would have most of time same performance as singlethread, with
> overhead one tls variable access per malloc call.
>
The question is subtle, you mean compared to what?  Others memory allocators
that per-thread cache shows good performance using TLS.

Anyway, *now* this lock is an issue that has being circumvented in a
arch-specific way (at least for x86_64).  I know that is not the best way
to fix all current malloc issues, but it something we can optimize for
single-thread that has minor code impacts.
  
Ondrej Bilka April 30, 2014, 3:32 p.m. UTC | #3
On Wed, Apr 30, 2014 at 11:25:36AM -0300, Adhemerval Zanella wrote:
> On 30-04-2014 11:18, Ondřej Bílka wrote:
> > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> >> This patch adds a single-thread optimization for malloc atomic usage to
> >> first check if process is single-thread (ST) and if so use normal
> >> load/store instead of atomic instructions.
> >>
> > How fast is tls on power? When we add a per-thread cache as I suggested
> > then it would have most of time same performance as singlethread, with
> > overhead one tls variable access per malloc call.
> >
> The question is subtle, you mean compared to what?  Others memory allocators
> that per-thread cache shows good performance using TLS.
> 
> Anyway, *now* this lock is an issue that has being circumvented in a
> arch-specific way (at least for x86_64).  I know that is not the best way
> to fix all current malloc issues, but it something we can optimize for
> single-thread that has minor code impacts.

Compared to single thread case. At start of malloc you could just get a
pointer to structure and rest of code would be identical to
single-thread one using that structure.
  
Rich Felker April 30, 2014, 4:06 p.m. UTC | #4
On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote:
> On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> > This patch adds a single-thread optimization for malloc atomic usage to
> > first check if process is single-thread (ST) and if so use normal
> > load/store instead of atomic instructions.
> > 
> How fast is tls on power? When we add a per-thread cache as I suggested
> then it would have most of time same performance as singlethread, with
> overhead one tls variable access per malloc call.

Extremely fast: the TLS address is simply kept in a general-purpose
register.

Rich
  
Steven Munroe April 30, 2014, 8:12 p.m. UTC | #5
On Wed, 2014-04-30 at 12:06 -0400, Rich Felker wrote:
> On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote:
> > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> > > This patch adds a single-thread optimization for malloc atomic usage to
> > > first check if process is single-thread (ST) and if so use normal
> > > load/store instead of atomic instructions.
> > > 
> > How fast is tls on power? When we add a per-thread cache as I suggested
> > then it would have most of time same performance as singlethread, with
> > overhead one tls variable access per malloc call.
> 
> Extremely fast: the TLS address is simply kept in a general-purpose
> register.
> 
Depends on the TLS access model.

General Dynamic TLS Model requires a dynamic up-call to _tld_get_addr().
So slow.

If you can get to the Local Exec or Initial Exec form (where the dvt
slot or TLS offset can be known at static link time) it can be a simple
inline computation.

As we are talking about a dynamic library (libc.so) here, you have to
set this up carefully.
  
Ondrej Bilka April 30, 2014, 10:04 p.m. UTC | #6
On Wed, Apr 30, 2014 at 03:12:45PM -0500, Steven Munroe wrote:
> On Wed, 2014-04-30 at 12:06 -0400, Rich Felker wrote:
> > On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote:
> > > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote:
> > > > This patch adds a single-thread optimization for malloc atomic usage to
> > > > first check if process is single-thread (ST) and if so use normal
> > > > load/store instead of atomic instructions.
> > > > 
> > > How fast is tls on power? When we add a per-thread cache as I suggested
> > > then it would have most of time same performance as singlethread, with
> > > overhead one tls variable access per malloc call.
> > 
> > Extremely fast: the TLS address is simply kept in a general-purpose
> > register.
> > 
> Depends on the TLS access model.
> 
> General Dynamic TLS Model requires a dynamic up-call to _tld_get_addr().
> So slow.
> 
> If you can get to the Local Exec or Initial Exec form (where the dvt
> slot or TLS offset can be known at static link time) it can be a simple
> inline computation.
> 
> As we are talking about a dynamic library (libc.so) here, you have to
> set this up carefully.

On malloc we already use initial exec, see tsd_getspecific macro in
malloc/arena.c.

By the way general tls slowness is caused mostly by ineffective
implementation. If you do not mind adding a pointer variable p in ie form
to each binary then you could emulate tls by referencing p and two array
lookups (except for first access which triggers branch that calls
something.)
  
Will Newton May 2, 2014, 2:29 p.m. UTC | #7
On 30 April 2014 14:57, Adhemerval Zanella <azanella@linux.vnet.ibm.com> wrote:
> This patch adds a single-thread optimization for malloc atomic usage to
> first check if process is single-thread (ST) and if so use normal
> load/store instead of atomic instructions.
>
> It is a respin of my initial attempt to add ST optimization on PowerPC.
> Now malloc optimization is arch-agnostic and rely in already defined macros
> in GLIBC (SINGLE_THREAD_P).  x86 probably would kike an atomic.h refactor
> to avoid the double check for '__local_multiple_threads', but that is not
> the aim of this patch.
>
> Tested on powerpc32, powerpc64, and x86_64.
>
> --
>
>         * malloc/malloc.c (MALLOC_ATOMIC_OR): New macro: optimize atomic for
>         single thread.
>         (MALLOC_ATOMIC_AND): Likewise.
>         (MALLOC_ATOMIC_CAS_VAL_ACQ): Likewise.
>         (MALLOC_ATOMIC_CAS_VAL_REL): Likewise.
>         (clear_fastchunks): Use optimized single thread macro.
>         (set_fastchunks): Likewise.
>         (_int_malloc): Likewise.
>         (_int_free): Likewise.
>
> ---
>
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 1120d4d..bb0aa82 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -231,6 +231,7 @@
>  #include <errno.h>
>
>  #include <shlib-compat.h>
> +#include <sysdep-cancel.h>  /* For SINGLE_THREAD_P macro  */
>
>  /* For uintptr_t.  */
>  #include <stdint.h>
> @@ -243,6 +244,57 @@
>
>
>  /*
> +  Single-thread lock optimization: atomic primitives first check the number of

It's not strictly about locks I suppose...

> +  threads and avoid atomic instructions for single-thread case.
> + */
> +#define MALLOC_ATOMIC_OR(mem, mask)                                          \
> + ({                                                                          \
> +   if (!SINGLE_THREAD_P)                                                     \
> +     catomic_or (mem, mask);                                                 \
> +   else                                                                              \
> +     *mem |= mask;                                                           \
> + })
> +
> +#define MALLOC_ATOMIC_AND(mem, mask)                                         \
> + ({                                                                          \
> +   if (!SINGLE_THREAD_P)                                                     \
> +     catomic_and (mem, mask);                                                \
> +   else                                                                              \
> +     *mem &= mask;                                                           \
> + })
> +
> +#define MALLOC_ATOMIC_CAS_VAL_ACQ(mem, newval, oldval)                       \
> + ({                                                                          \
> +    __typeof (*(mem)) __tmp;                                                 \
> +    __typeof (mem) __memp = (mem);                                           \
> +    if (!SINGLE_THREAD_P)                                                    \
> +      __tmp = catomic_compare_and_exchange_val_acq (mem, newval, oldval);     \
> +    else                                                                     \
> +      {                                                                              \
> +       __tmp = *__memp;                                                      \
> +       if (__tmp == oldval)                                                  \
> +         *__memp = newval;                                                   \
> +      }                                                                              \
> +    __tmp;                                                                   \
> + })
> +
> +#define MALLOC_ATOMIC_CAS_VAL_REL(mem, newval, oldval)                       \
> + ({                                                                          \
> +    __typeof (*(mem)) __tmp;                                                 \
> +    __typeof (mem) __memp = (mem);                                           \
> +    if (!SINGLE_THREAD_P)                                                    \
> +      __tmp = catomic_compare_and_exchange_val_rel (mem, newval, oldval);     \
> +    else                                                                     \
> +      {                                                                              \
> +       __tmp = *__memp;                                                      \
> +       if (__tmp == oldval)                                                  \
> +         *__memp = newval;                                                   \
> +      }                                                                              \
> +    __tmp;                                                                   \
> + })

Is there a reason these have to be macros? If we convert them to
inline functions it would be tidier and we wouldn't have to worry
about multiple evaluation etc.

> +
> +
> +/*
>    Debugging:
>
>    Because freed chunks may be overwritten with bookkeeping fields, this
> @@ -1632,8 +1684,8 @@ typedef struct malloc_chunk *mfastbinptr;
>  #define FASTCHUNKS_BIT        (1U)
>
>  #define have_fastchunks(M)     (((M)->flags & FASTCHUNKS_BIT) == 0)
> -#define clear_fastchunks(M)    catomic_or (&(M)->flags, FASTCHUNKS_BIT)
> -#define set_fastchunks(M)      catomic_and (&(M)->flags, ~FASTCHUNKS_BIT)
> +#define clear_fastchunks(M)    MALLOC_ATOMIC_OR (&(M)->flags, FASTCHUNKS_BIT)
> +#define set_fastchunks(M)      MALLOC_ATOMIC_AND (&(M)->flags, ~FASTCHUNKS_BIT)

And I guess since these are the only users we could expand the functions here.

>  /*
>     NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous
> @@ -3334,7 +3386,7 @@ _int_malloc (mstate av, size_t bytes)
>            if (victim == NULL)
>              break;
>          }
> -      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
> +      while ((pp = MALLOC_ATOMIC_CAS_VAL_ACQ (fb, victim->fd, victim))
>               != victim);
>        if (victim != 0)
>          {
> @@ -3903,7 +3955,7 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>           old_idx = fastbin_index(chunksize(old));
>         p->fd = old2 = old;
>        }
> -    while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) != old2);
> +    while ((old = MALLOC_ATOMIC_CAS_VAL_REL (fb, p, old2)) != old2);
>
>      if (have_lock && old != NULL && __builtin_expect (old_idx != idx, 0))
>        {
> --
> 1.8.4.2
>

The main body of the patch looks ok to me.
  
Torvald Riegel May 2, 2014, 2:34 p.m. UTC | #8
On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
> This patch adds a single-thread optimization for malloc atomic usage to
> first check if process is single-thread (ST) and if so use normal
> load/store instead of atomic instructions.
> 
> It is a respin of my initial attempt to add ST optimization on PowerPC.
> Now malloc optimization is arch-agnostic and rely in already defined macros
> in GLIBC (SINGLE_THREAD_P).

I'm not convinced that this is a good change.  First, I'd like to see
some performance numbers that show whether the change leads to a
non-negligible increase in performance.  That should be a
microbenchmark, so we can track it over time.

malloc is currently marked as Async-Signal-Unsafe, but due to locks
elsewhere in the implementation.  The MTSafety docs even talk about what
would be needed to make it AS-Safe.  Your patch would prevent making it
AS-Safe.  The catomic_add that you replace are AS-Safe, the sequential
code you add is not.  Someone added those catomic_* and not sequential
code, so this is at least indication that there's a disconnect here.

You also add macros with nondescriptive names, that each have exactly
one use.  There's no documentation for how they differ, and you don't
update the comments on the MT-Safe docs.

> x86 probably would kike an atomic.h refactor
> to avoid the double check for '__local_multiple_threads', but that is not
> the aim of this patch.

But your patch does introduce this, so you should take care of this as
well.  I guess you wouldn't be happy if someone change x86 without
caring about Power?
  
Ondrej Bilka May 2, 2014, 4:38 p.m. UTC | #9
On Fri, May 02, 2014 at 04:34:58PM +0200, Torvald Riegel wrote:
> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
> > This patch adds a single-thread optimization for malloc atomic usage to
> > first check if process is single-thread (ST) and if so use normal
> > load/store instead of atomic instructions.
> > 
> > It is a respin of my initial attempt to add ST optimization on PowerPC.
> > Now malloc optimization is arch-agnostic and rely in already defined macros
> > in GLIBC (SINGLE_THREAD_P).
> 
> I'm not convinced that this is a good change.  First, I'd like to see
> some performance numbers that show whether the change leads to a
> non-negligible increase in performance.  That should be a
> microbenchmark, so we can track it over time.
> 
> malloc is currently marked as Async-Signal-Unsafe, but due to locks
> elsewhere in the implementation.  The MTSafety docs even talk about what
> would be needed to make it AS-Safe.  Your patch would prevent making it
> AS-Safe.  The catomic_add that you replace are AS-Safe, the sequential
> code you add is not.  Someone added those catomic_* and not sequential
> code, so this is at least indication that there's a disconnect here.
> 
> You also add macros with nondescriptive names, that each have exactly
> one use.  There's no documentation for how they differ, and you don't
> update the comments on the MT-Safe docs.
> 
> > x86 probably would kike an atomic.h refactor
> > to avoid the double check for '__local_multiple_threads', but that is not
> > the aim of this patch.
> 
> But your patch does introduce this, so you should take care of this as
> well.  I guess you wouldn't be happy if someone change x86 without
> caring about Power?

Also as tls is fast on power this change is retundant. A performance
difference with properly implemented thread cache is neglitible so this
code would be soon removed.
  
Adhemerval Zanella Netto May 2, 2014, 5:53 p.m. UTC | #10
On 02-05-2014 11:34, Torvald Riegel wrote:
> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
>> This patch adds a single-thread optimization for malloc atomic usage to
>> first check if process is single-thread (ST) and if so use normal
>> load/store instead of atomic instructions.
>>
>> It is a respin of my initial attempt to add ST optimization on PowerPC.
>> Now malloc optimization is arch-agnostic and rely in already defined macros
>> in GLIBC (SINGLE_THREAD_P).
> I'm not convinced that this is a good change.  First, I'd like to see
> some performance numbers that show whether the change leads to a
> non-negligible increase in performance.  That should be a
> microbenchmark, so we can track it over time.

In fact I would prefer to make it more arch-specific and let the the arch-maintainers
evaluate if such change would be preferable.  One way to do it is adding a
new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as
catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h.  Then arch can be replace
its implementation if it suit them.


>
> malloc is currently marked as Async-Signal-Unsafe, but due to locks
> elsewhere in the implementation.  The MTSafety docs even talk about what
> would be needed to make it AS-Safe.  Your patch would prevent making it
> AS-Safe.  The catomic_add that you replace are AS-Safe, the sequential
> code you add is not.  Someone added those catomic_* and not sequential
> code, so this is at least indication that there's a disconnect here.

Another point to make this change arch specific.  And if we even make malloc
functions AS-Safe in the future, the arch maintainer just need to reevaluate
this change (since it will either make the optimization not required or make
it unsafe).

>
> You also add macros with nondescriptive names, that each have exactly
> one use.  There's no documentation for how they differ, and you don't
> update the comments on the MT-Safe docs.
>
>> x86 probably would kike an atomic.h refactor
>> to avoid the double check for '__local_multiple_threads', but that is not
>> the aim of this patch.
> But your patch does introduce this, so you should take care of this as
> well.  I guess you wouldn't be happy if someone change x86 without
> caring about Power?
>
I will drop this patch and rework on a arch-specific one.
  
Torvald Riegel May 5, 2014, 10:17 a.m. UTC | #11
On Fri, 2014-05-02 at 14:53 -0300, Adhemerval Zanella wrote:
> On 02-05-2014 11:34, Torvald Riegel wrote:
> > On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
> >> This patch adds a single-thread optimization for malloc atomic usage to
> >> first check if process is single-thread (ST) and if so use normal
> >> load/store instead of atomic instructions.
> >>
> >> It is a respin of my initial attempt to add ST optimization on PowerPC.
> >> Now malloc optimization is arch-agnostic and rely in already defined macros
> >> in GLIBC (SINGLE_THREAD_P).
> > I'm not convinced that this is a good change.  First, I'd like to see
> > some performance numbers that show whether the change leads to a
> > non-negligible increase in performance.  That should be a
> > microbenchmark, so we can track it over time.
> 
> In fact I would prefer to make it more arch-specific and let the the arch-maintainers
> evaluate if such change would be preferable.  One way to do it is adding a
> new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as
> catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h.  Then arch can be replace
> its implementation if it suit them.

I agree that there is arch-specific input to this optimization problem,
but that doesn't mean that the solution to it will be primarily
arch-specific.

The arch-specific input we have is:
* Whether an atomic op (probably we'll be considering just RMW ops and
CAS) is significantly slower than non-atomic code plus a branch.  For
example, the difference should be small on recent x86, but I assume it's
higher on PowerPC.  We'd need a microbenchmark of some sort to track the
decision we make here.
* Whether an atomic op that's MT-Safe and AS-Safe is significantly
slower than an atomic op that's just AS-Safe.  That needs a
microbenchmark as well.  For example, x86 currently assumes this is the
case, but it's something we should revisit I guess.
* If the former is the case, we need arch-specific atomics (i.e.,
catomic_*) to make use of that.

Thus, we'd end up with two constants and potentially some arch-specific
variants.  We could differentiate the constants as necessary per user
(e.g., make one that's malloc-specific), but I'd prefer not to, and that
can be done in the future.

With that in place, we can look at the atomics users and see how they
could exploit AS-Safe or AS-Unsafe+MT-Unsafe scenarios.  We'd have a
normal branch and let the compiler remove the dead code (we can use the
preprocessor too, but I think that's worse).
That allows us to really look at what we can do differently in a
sequential-execution scenario -- so going beyond just tweaking one
atomic op at a time.

If we really need the single-atomic-op optimizations often, we can
consider adding something like catomic.  That is, if, for example,
  if (as_safe_atomics_faster) x+=1; else atomic_add(&x, 1);
is needed often, we can add something like catomic.
We might want to revisit current uses of catomic too.

> >
> > malloc is currently marked as Async-Signal-Unsafe, but due to locks
> > elsewhere in the implementation.  The MTSafety docs even talk about what
> > would be needed to make it AS-Safe.  Your patch would prevent making it
> > AS-Safe.  The catomic_add that you replace are AS-Safe, the sequential
> > code you add is not.  Someone added those catomic_* and not sequential
> > code, so this is at least indication that there's a disconnect here.
> 
> Another point to make this change arch specific.

Why?  The motivation to use catomic might have been driven by
arch-specific properties, but you want to replace it with something
functionally different.

> And if we even make malloc
> functions AS-Safe in the future, the arch maintainer just need to reevaluate
> this change (since it will either make the optimization not required or make
> it unsafe).

Whether malloc is AS-Safe or not should never be arch-specific IMO.  If
we use the constants as I outlined above, the malloc implementer can
just build code for the different HW performance scenarios, and all the
archs have to do is set the flags accordingly.

We could even further reduce the implementation overhead for the AS-Safe
MT-Unsafe atomics if using the newer GCC builtins for the C11 memory
model by adding support for as special memory order that just provides
AS-Safety.

> >
> > You also add macros with nondescriptive names, that each have exactly
> > one use.  There's no documentation for how they differ, and you don't
> > update the comments on the MT-Safe docs.
> >
> >> x86 probably would kike an atomic.h refactor
> >> to avoid the double check for '__local_multiple_threads', but that is not
> >> the aim of this patch.
> > But your patch does introduce this, so you should take care of this as
> > well.  I guess you wouldn't be happy if someone change x86 without
> > caring about Power?
> >
> I will drop this patch and rework on a arch-specific one.

What do you think about the plan I outlined above?
  
Adhemerval Zanella Netto May 5, 2014, 4:59 p.m. UTC | #12
On 05-05-2014 07:17, Torvald Riegel wrote:
> On Fri, 2014-05-02 at 14:53 -0300, Adhemerval Zanella wrote:
>> On 02-05-2014 11:34, Torvald Riegel wrote:
>>> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
>>>> This patch adds a single-thread optimization for malloc atomic usage to
>>>> first check if process is single-thread (ST) and if so use normal
>>>> load/store instead of atomic instructions.
>>>>
>>>> It is a respin of my initial attempt to add ST optimization on PowerPC.
>>>> Now malloc optimization is arch-agnostic and rely in already defined macros
>>>> in GLIBC (SINGLE_THREAD_P).
>>> I'm not convinced that this is a good change.  First, I'd like to see
>>> some performance numbers that show whether the change leads to a
>>> non-negligible increase in performance.  That should be a
>>> microbenchmark, so we can track it over time.
>> In fact I would prefer to make it more arch-specific and let the the arch-maintainers
>> evaluate if such change would be preferable.  One way to do it is adding a
>> new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as
>> catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h.  Then arch can be replace
>> its implementation if it suit them.
> I agree that there is arch-specific input to this optimization problem,
> but that doesn't mean that the solution to it will be primarily
> arch-specific.
>
> The arch-specific input we have is:
> * Whether an atomic op (probably we'll be considering just RMW ops and
> CAS) is significantly slower than non-atomic code plus a branch.  For
> example, the difference should be small on recent x86, but I assume it's
> higher on PowerPC.  We'd need a microbenchmark of some sort to track the
> decision we make here.
> * Whether an atomic op that's MT-Safe and AS-Safe is significantly
> slower than an atomic op that's just AS-Safe.  That needs a
> microbenchmark as well.  For example, x86 currently assumes this is the
> case, but it's something we should revisit I guess.

Which motivated me to propose such change is the malloc code for single-thread.
To give you some numbers, currently on POWER7/3GHz using a simple loop to allocate
32 bytes, it takes about 130ns to each call.  Profilers indicate that indeed the
culprit of such long time is the locks being used even in single-thread.

Using my first approach (sequential code for single-thread, no as-safe) the code
the same calls is not done in about 40ns.  By just tuning the locks (not the atomics)
it raises to 90ns.

And yes, I'm aware that issue about being non AS-safe.  However, currently the CAS
primitives, even for AS-sync one with memory barriers, is done with LL/SC instructions
(ldarx. / stdcx.).  And for single-thread case, the memory barriers are not causing
much latency: I check by just removing then (by using CAS primitives without them)
and it did not make difference for performance, the problem is the use of LL/SC
instructions themselves.  These are true at least for POWER6, POWER7, and POWER8. 
 

> * If the former is the case, we need arch-specific atomics (i.e.,
> catomic_*) to make use of that.
>
> Thus, we'd end up with two constants and potentially some arch-specific
> variants.  We could differentiate the constants as necessary per user
> (e.g., make one that's malloc-specific), but I'd prefer not to, and that
> can be done in the future.
>
> With that in place, we can look at the atomics users and see how they
> could exploit AS-Safe or AS-Unsafe+MT-Unsafe scenarios.  We'd have a
> normal branch and let the compiler remove the dead code (we can use the
> preprocessor too, but I think that's worse).
> That allows us to really look at what we can do differently in a
> sequential-execution scenario -- so going beyond just tweaking one
> atomic op at a time.
>
> If we really need the single-atomic-op optimizations often, we can
> consider adding something like catomic.  That is, if, for example,
>   if (as_safe_atomics_faster) x+=1; else atomic_add(&x, 1);
> is needed often, we can add something like catomic.
> We might want to revisit current uses of catomic too.

That's why I would to propose archs to be able to override the atomic in malloc
code with an instruction sequence to check for multithread and use a code 
sequence that is non AS-Safe.  I know that it will impose an additional 
consideration for AS-safeness in malloc code, however since it is not AS-safe
right now this arch-specific modification is safe.  And if in the future, if
we even make malloc AS-Safe, this modification will sure need to be reevaluated
or even drop altogether.

But I concede that this is potentially misleading change, so indeed it will need
some discussions.  As Ondrej has proposed, the best course of action would rework
malloc internal altogether to add something like thread-caches to avoid lock
contention (even for single-thread); however for *current* code these locks are
real issue for POWER and we from POWER toolchain really want a way to circumvent it,
if possible.  I would understand that if has been decided that the engineer cost
does not worth the gains.


>
>>> malloc is currently marked as Async-Signal-Unsafe, but due to locks
>>> elsewhere in the implementation.  The MTSafety docs even talk about what
>>> would be needed to make it AS-Safe.  Your patch would prevent making it
>>> AS-Safe.  The catomic_add that you replace are AS-Safe, the sequential
>>> code you add is not.  Someone added those catomic_* and not sequential
>>> code, so this is at least indication that there's a disconnect here.
>> Another point to make this change arch specific.
> Why?  The motivation to use catomic might have been driven by
> arch-specific properties, but you want to replace it with something
> functionally different.

I would like to replace the atomics in malloc code and libc locks/unlock only for POWER,
since they are not currently AS-safe.  But as I said, I will concede if community decided
it won't worth the engineer effort.

>
>> And if we even make malloc
>> functions AS-Safe in the future, the arch maintainer just need to reevaluate
>> this change (since it will either make the optimization not required or make
>> it unsafe).
> Whether malloc is AS-Safe or not should never be arch-specific IMO.  If
> we use the constants as I outlined above, the malloc implementer can
> just build code for the different HW performance scenarios, and all the
> archs have to do is set the flags accordingly.
>
> We could even further reduce the implementation overhead for the AS-Safe
> MT-Unsafe atomics if using the newer GCC builtins for the C11 memory
> model by adding support for as special memory order that just provides
> AS-Safety.

For POWER and single-thread scenario it not the memory barriers along that LL/SC
instruction that are generating latency, but the LL/SC instructions.  Even using
C11 memory model won't help in this case.


>
>>> You also add macros with nondescriptive names, that each have exactly
>>> one use.  There's no documentation for how they differ, and you don't
>>> update the comments on the MT-Safe docs.
>>>
>>>> x86 probably would kike an atomic.h refactor
>>>> to avoid the double check for '__local_multiple_threads', but that is not
>>>> the aim of this patch.
>>> But your patch does introduce this, so you should take care of this as
>>> well.  I guess you wouldn't be happy if someone change x86 without
>>> caring about Power?
>>>
>> I will drop this patch and rework on a arch-specific one.
> What do you think about the plan I outlined above?
>
  

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 1120d4d..bb0aa82 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -231,6 +231,7 @@ 
 #include <errno.h>
 
 #include <shlib-compat.h>
+#include <sysdep-cancel.h>  /* For SINGLE_THREAD_P macro  */
 
 /* For uintptr_t.  */
 #include <stdint.h>
@@ -243,6 +244,57 @@ 
 
 
 /*
+  Single-thread lock optimization: atomic primitives first check the number of
+  threads and avoid atomic instructions for single-thread case.
+ */
+#define MALLOC_ATOMIC_OR(mem, mask)					      \
+ ({									      \
+   if (!SINGLE_THREAD_P)						      \
+     catomic_or (mem, mask);						      \
+   else									      \
+     *mem |= mask;							      \
+ })
+
+#define MALLOC_ATOMIC_AND(mem, mask)					      \
+ ({									      \
+   if (!SINGLE_THREAD_P)						      \
+     catomic_and (mem, mask);						      \
+   else									      \
+     *mem &= mask;							      \
+ })
+
+#define MALLOC_ATOMIC_CAS_VAL_ACQ(mem, newval, oldval)			      \
+ ({									      \
+    __typeof (*(mem)) __tmp;						      \
+    __typeof (mem) __memp = (mem);					      \
+    if (!SINGLE_THREAD_P)						      \
+      __tmp = catomic_compare_and_exchange_val_acq (mem, newval, oldval);     \
+    else								      \
+      {									      \
+	__tmp = *__memp;						      \
+	if (__tmp == oldval)						      \
+	  *__memp = newval;						      \
+      }									      \
+    __tmp;								      \
+ })
+
+#define MALLOC_ATOMIC_CAS_VAL_REL(mem, newval, oldval)			      \
+ ({									      \
+    __typeof (*(mem)) __tmp;						      \
+    __typeof (mem) __memp = (mem);					      \
+    if (!SINGLE_THREAD_P)						      \
+      __tmp = catomic_compare_and_exchange_val_rel (mem, newval, oldval);     \
+    else								      \
+      {									      \
+	__tmp = *__memp;						      \
+	if (__tmp == oldval)						      \
+	  *__memp = newval;						      \
+      }									      \
+    __tmp;								      \
+ })
+
+
+/*
   Debugging:
 
   Because freed chunks may be overwritten with bookkeeping fields, this
@@ -1632,8 +1684,8 @@  typedef struct malloc_chunk *mfastbinptr;
 #define FASTCHUNKS_BIT        (1U)
 
 #define have_fastchunks(M)     (((M)->flags & FASTCHUNKS_BIT) == 0)
-#define clear_fastchunks(M)    catomic_or (&(M)->flags, FASTCHUNKS_BIT)
-#define set_fastchunks(M)      catomic_and (&(M)->flags, ~FASTCHUNKS_BIT)
+#define clear_fastchunks(M)    MALLOC_ATOMIC_OR (&(M)->flags, FASTCHUNKS_BIT)
+#define set_fastchunks(M)      MALLOC_ATOMIC_AND (&(M)->flags, ~FASTCHUNKS_BIT)
 
 /*
    NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous
@@ -3334,7 +3386,7 @@  _int_malloc (mstate av, size_t bytes)
           if (victim == NULL)
             break;
         }
-      while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
+      while ((pp = MALLOC_ATOMIC_CAS_VAL_ACQ (fb, victim->fd, victim))
              != victim);
       if (victim != 0)
         {
@@ -3903,7 +3955,7 @@  _int_free (mstate av, mchunkptr p, int have_lock)
 	  old_idx = fastbin_index(chunksize(old));
 	p->fd = old2 = old;
       }
-    while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) != old2);
+    while ((old = MALLOC_ATOMIC_CAS_VAL_REL (fb, p, old2)) != old2);
 
     if (have_lock && old != NULL && __builtin_expect (old_idx != idx, 0))
       {