Message ID | 53610133.3070908@linux.vnet.ibm.com |
---|---|
State | Rejected |
Headers |
Return-Path: <x14307373@homiemail-mx20.g.dreamhost.com> X-Original-To: siddhesh@wilcox.dreamhost.com Delivered-To: siddhesh@wilcox.dreamhost.com Received: from homiemail-mx20.g.dreamhost.com (mx2.sub5.homie.mail.dreamhost.com [208.113.200.128]) by wilcox.dreamhost.com (Postfix) with ESMTP id 0248C360078 for <siddhesh@wilcox.dreamhost.com>; Wed, 30 Apr 2014 06:57:20 -0700 (PDT) Received: by homiemail-mx20.g.dreamhost.com (Postfix, from userid 14307373) id A9709413F230D; Wed, 30 Apr 2014 06:57:20 -0700 (PDT) X-Original-To: glibc@patchwork.siddhesh.in Delivered-To: x14307373@homiemail-mx20.g.dreamhost.com Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by homiemail-mx20.g.dreamhost.com (Postfix) with ESMTPS id 89982413EECAF for <glibc@patchwork.siddhesh.in>; Wed, 30 Apr 2014 06:57:20 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :content-type:content-transfer-encoding; q=dns; s=default; b=T8X z7toSDFgp2dTHQ85UTN9Z1esCxcxE7KYgjD2+vDs8DYlSYR9nv+X2Rlri1rPSGtt g8ngGSXO99JCvfT1cW/tTzOrLotm/Xv/3a/gFPnECBlddilwCW2z2UFiBFPuiU7a 9K6e90Qcp/1SEHGPZIU4b5UMuG0hb+A6eFlAFUpc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :content-type:content-transfer-encoding; s=default; bh=uXKOoEj+x geOkjOPJ4SZO29n/24=; b=Les65XhIAeE9UIdspYdqy0q1c8P4EPWfJeX5S9VVn HqO2x7hcXIHALovgeZeb2PvbJM+Uyeb/uOUn+nhX9X6agSMaTrqiEWKmBEJbvjdf Xk5e453A7IAVP+3HjfR0Dq8ij0K0PW1GayHeNN3XqADUUB4OAFHlM/3ojO4oa4Wy JY= Received: (qmail 25544 invoked by alias); 30 Apr 2014 13:57:18 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-glibc=patchwork.siddhesh.in@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 25532 invoked by uid 89); 30 Apr 2014 13:57:18 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL, BAYES_00, RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: e24smtp01.br.ibm.com Message-ID: <53610133.3070908@linux.vnet.ibm.com> Date: Wed, 30 Apr 2014 10:57:07 -0300 From: Adhemerval Zanella <azanella@linux.vnet.ibm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: "GNU C. Library" <libc-alpha@sourceware.org> Subject: [PATCH 1/2] Single thread optimization for malloc atomics Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 14043013-1524-0000-0000-0000099D804B X-DH-Original-To: glibc@patchwork.siddhesh.in |
Commit Message
Adhemerval Zanella Netto
April 30, 2014, 1:57 p.m. UTC
This patch adds a single-thread optimization for malloc atomic usage to first check if process is single-thread (ST) and if so use normal load/store instead of atomic instructions. It is a respin of my initial attempt to add ST optimization on PowerPC. Now malloc optimization is arch-agnostic and rely in already defined macros in GLIBC (SINGLE_THREAD_P). x86 probably would kike an atomic.h refactor to avoid the double check for '__local_multiple_threads', but that is not the aim of this patch. Tested on powerpc32, powerpc64, and x86_64. -- * malloc/malloc.c (MALLOC_ATOMIC_OR): New macro: optimize atomic for single thread. (MALLOC_ATOMIC_AND): Likewise. (MALLOC_ATOMIC_CAS_VAL_ACQ): Likewise. (MALLOC_ATOMIC_CAS_VAL_REL): Likewise. (clear_fastchunks): Use optimized single thread macro. (set_fastchunks): Likewise. (_int_malloc): Likewise. (_int_free): Likewise. ---
Comments
On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: > This patch adds a single-thread optimization for malloc atomic usage to > first check if process is single-thread (ST) and if so use normal > load/store instead of atomic instructions. > How fast is tls on power? When we add a per-thread cache as I suggested then it would have most of time same performance as singlethread, with overhead one tls variable access per malloc call.
On 30-04-2014 11:18, Ondřej Bílka wrote: > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: >> This patch adds a single-thread optimization for malloc atomic usage to >> first check if process is single-thread (ST) and if so use normal >> load/store instead of atomic instructions. >> > How fast is tls on power? When we add a per-thread cache as I suggested > then it would have most of time same performance as singlethread, with > overhead one tls variable access per malloc call. > The question is subtle, you mean compared to what? Others memory allocators that per-thread cache shows good performance using TLS. Anyway, *now* this lock is an issue that has being circumvented in a arch-specific way (at least for x86_64). I know that is not the best way to fix all current malloc issues, but it something we can optimize for single-thread that has minor code impacts.
On Wed, Apr 30, 2014 at 11:25:36AM -0300, Adhemerval Zanella wrote: > On 30-04-2014 11:18, Ondřej Bílka wrote: > > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: > >> This patch adds a single-thread optimization for malloc atomic usage to > >> first check if process is single-thread (ST) and if so use normal > >> load/store instead of atomic instructions. > >> > > How fast is tls on power? When we add a per-thread cache as I suggested > > then it would have most of time same performance as singlethread, with > > overhead one tls variable access per malloc call. > > > The question is subtle, you mean compared to what? Others memory allocators > that per-thread cache shows good performance using TLS. > > Anyway, *now* this lock is an issue that has being circumvented in a > arch-specific way (at least for x86_64). I know that is not the best way > to fix all current malloc issues, but it something we can optimize for > single-thread that has minor code impacts. Compared to single thread case. At start of malloc you could just get a pointer to structure and rest of code would be identical to single-thread one using that structure.
On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote: > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: > > This patch adds a single-thread optimization for malloc atomic usage to > > first check if process is single-thread (ST) and if so use normal > > load/store instead of atomic instructions. > > > How fast is tls on power? When we add a per-thread cache as I suggested > then it would have most of time same performance as singlethread, with > overhead one tls variable access per malloc call. Extremely fast: the TLS address is simply kept in a general-purpose register. Rich
On Wed, 2014-04-30 at 12:06 -0400, Rich Felker wrote: > On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote: > > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: > > > This patch adds a single-thread optimization for malloc atomic usage to > > > first check if process is single-thread (ST) and if so use normal > > > load/store instead of atomic instructions. > > > > > How fast is tls on power? When we add a per-thread cache as I suggested > > then it would have most of time same performance as singlethread, with > > overhead one tls variable access per malloc call. > > Extremely fast: the TLS address is simply kept in a general-purpose > register. > Depends on the TLS access model. General Dynamic TLS Model requires a dynamic up-call to _tld_get_addr(). So slow. If you can get to the Local Exec or Initial Exec form (where the dvt slot or TLS offset can be known at static link time) it can be a simple inline computation. As we are talking about a dynamic library (libc.so) here, you have to set this up carefully.
On Wed, Apr 30, 2014 at 03:12:45PM -0500, Steven Munroe wrote: > On Wed, 2014-04-30 at 12:06 -0400, Rich Felker wrote: > > On Wed, Apr 30, 2014 at 04:18:45PM +0200, Ondřej Bílka wrote: > > > On Wed, Apr 30, 2014 at 10:57:07AM -0300, Adhemerval Zanella wrote: > > > > This patch adds a single-thread optimization for malloc atomic usage to > > > > first check if process is single-thread (ST) and if so use normal > > > > load/store instead of atomic instructions. > > > > > > > How fast is tls on power? When we add a per-thread cache as I suggested > > > then it would have most of time same performance as singlethread, with > > > overhead one tls variable access per malloc call. > > > > Extremely fast: the TLS address is simply kept in a general-purpose > > register. > > > Depends on the TLS access model. > > General Dynamic TLS Model requires a dynamic up-call to _tld_get_addr(). > So slow. > > If you can get to the Local Exec or Initial Exec form (where the dvt > slot or TLS offset can be known at static link time) it can be a simple > inline computation. > > As we are talking about a dynamic library (libc.so) here, you have to > set this up carefully. On malloc we already use initial exec, see tsd_getspecific macro in malloc/arena.c. By the way general tls slowness is caused mostly by ineffective implementation. If you do not mind adding a pointer variable p in ie form to each binary then you could emulate tls by referencing p and two array lookups (except for first access which triggers branch that calls something.)
On 30 April 2014 14:57, Adhemerval Zanella <azanella@linux.vnet.ibm.com> wrote: > This patch adds a single-thread optimization for malloc atomic usage to > first check if process is single-thread (ST) and if so use normal > load/store instead of atomic instructions. > > It is a respin of my initial attempt to add ST optimization on PowerPC. > Now malloc optimization is arch-agnostic and rely in already defined macros > in GLIBC (SINGLE_THREAD_P). x86 probably would kike an atomic.h refactor > to avoid the double check for '__local_multiple_threads', but that is not > the aim of this patch. > > Tested on powerpc32, powerpc64, and x86_64. > > -- > > * malloc/malloc.c (MALLOC_ATOMIC_OR): New macro: optimize atomic for > single thread. > (MALLOC_ATOMIC_AND): Likewise. > (MALLOC_ATOMIC_CAS_VAL_ACQ): Likewise. > (MALLOC_ATOMIC_CAS_VAL_REL): Likewise. > (clear_fastchunks): Use optimized single thread macro. > (set_fastchunks): Likewise. > (_int_malloc): Likewise. > (_int_free): Likewise. > > --- > > diff --git a/malloc/malloc.c b/malloc/malloc.c > index 1120d4d..bb0aa82 100644 > --- a/malloc/malloc.c > +++ b/malloc/malloc.c > @@ -231,6 +231,7 @@ > #include <errno.h> > > #include <shlib-compat.h> > +#include <sysdep-cancel.h> /* For SINGLE_THREAD_P macro */ > > /* For uintptr_t. */ > #include <stdint.h> > @@ -243,6 +244,57 @@ > > > /* > + Single-thread lock optimization: atomic primitives first check the number of It's not strictly about locks I suppose... > + threads and avoid atomic instructions for single-thread case. > + */ > +#define MALLOC_ATOMIC_OR(mem, mask) \ > + ({ \ > + if (!SINGLE_THREAD_P) \ > + catomic_or (mem, mask); \ > + else \ > + *mem |= mask; \ > + }) > + > +#define MALLOC_ATOMIC_AND(mem, mask) \ > + ({ \ > + if (!SINGLE_THREAD_P) \ > + catomic_and (mem, mask); \ > + else \ > + *mem &= mask; \ > + }) > + > +#define MALLOC_ATOMIC_CAS_VAL_ACQ(mem, newval, oldval) \ > + ({ \ > + __typeof (*(mem)) __tmp; \ > + __typeof (mem) __memp = (mem); \ > + if (!SINGLE_THREAD_P) \ > + __tmp = catomic_compare_and_exchange_val_acq (mem, newval, oldval); \ > + else \ > + { \ > + __tmp = *__memp; \ > + if (__tmp == oldval) \ > + *__memp = newval; \ > + } \ > + __tmp; \ > + }) > + > +#define MALLOC_ATOMIC_CAS_VAL_REL(mem, newval, oldval) \ > + ({ \ > + __typeof (*(mem)) __tmp; \ > + __typeof (mem) __memp = (mem); \ > + if (!SINGLE_THREAD_P) \ > + __tmp = catomic_compare_and_exchange_val_rel (mem, newval, oldval); \ > + else \ > + { \ > + __tmp = *__memp; \ > + if (__tmp == oldval) \ > + *__memp = newval; \ > + } \ > + __tmp; \ > + }) Is there a reason these have to be macros? If we convert them to inline functions it would be tidier and we wouldn't have to worry about multiple evaluation etc. > + > + > +/* > Debugging: > > Because freed chunks may be overwritten with bookkeeping fields, this > @@ -1632,8 +1684,8 @@ typedef struct malloc_chunk *mfastbinptr; > #define FASTCHUNKS_BIT (1U) > > #define have_fastchunks(M) (((M)->flags & FASTCHUNKS_BIT) == 0) > -#define clear_fastchunks(M) catomic_or (&(M)->flags, FASTCHUNKS_BIT) > -#define set_fastchunks(M) catomic_and (&(M)->flags, ~FASTCHUNKS_BIT) > +#define clear_fastchunks(M) MALLOC_ATOMIC_OR (&(M)->flags, FASTCHUNKS_BIT) > +#define set_fastchunks(M) MALLOC_ATOMIC_AND (&(M)->flags, ~FASTCHUNKS_BIT) And I guess since these are the only users we could expand the functions here. > /* > NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous > @@ -3334,7 +3386,7 @@ _int_malloc (mstate av, size_t bytes) > if (victim == NULL) > break; > } > - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) > + while ((pp = MALLOC_ATOMIC_CAS_VAL_ACQ (fb, victim->fd, victim)) > != victim); > if (victim != 0) > { > @@ -3903,7 +3955,7 @@ _int_free (mstate av, mchunkptr p, int have_lock) > old_idx = fastbin_index(chunksize(old)); > p->fd = old2 = old; > } > - while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) != old2); > + while ((old = MALLOC_ATOMIC_CAS_VAL_REL (fb, p, old2)) != old2); > > if (have_lock && old != NULL && __builtin_expect (old_idx != idx, 0)) > { > -- > 1.8.4.2 > The main body of the patch looks ok to me.
On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote: > This patch adds a single-thread optimization for malloc atomic usage to > first check if process is single-thread (ST) and if so use normal > load/store instead of atomic instructions. > > It is a respin of my initial attempt to add ST optimization on PowerPC. > Now malloc optimization is arch-agnostic and rely in already defined macros > in GLIBC (SINGLE_THREAD_P). I'm not convinced that this is a good change. First, I'd like to see some performance numbers that show whether the change leads to a non-negligible increase in performance. That should be a microbenchmark, so we can track it over time. malloc is currently marked as Async-Signal-Unsafe, but due to locks elsewhere in the implementation. The MTSafety docs even talk about what would be needed to make it AS-Safe. Your patch would prevent making it AS-Safe. The catomic_add that you replace are AS-Safe, the sequential code you add is not. Someone added those catomic_* and not sequential code, so this is at least indication that there's a disconnect here. You also add macros with nondescriptive names, that each have exactly one use. There's no documentation for how they differ, and you don't update the comments on the MT-Safe docs. > x86 probably would kike an atomic.h refactor > to avoid the double check for '__local_multiple_threads', but that is not > the aim of this patch. But your patch does introduce this, so you should take care of this as well. I guess you wouldn't be happy if someone change x86 without caring about Power?
On Fri, May 02, 2014 at 04:34:58PM +0200, Torvald Riegel wrote: > On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote: > > This patch adds a single-thread optimization for malloc atomic usage to > > first check if process is single-thread (ST) and if so use normal > > load/store instead of atomic instructions. > > > > It is a respin of my initial attempt to add ST optimization on PowerPC. > > Now malloc optimization is arch-agnostic and rely in already defined macros > > in GLIBC (SINGLE_THREAD_P). > > I'm not convinced that this is a good change. First, I'd like to see > some performance numbers that show whether the change leads to a > non-negligible increase in performance. That should be a > microbenchmark, so we can track it over time. > > malloc is currently marked as Async-Signal-Unsafe, but due to locks > elsewhere in the implementation. The MTSafety docs even talk about what > would be needed to make it AS-Safe. Your patch would prevent making it > AS-Safe. The catomic_add that you replace are AS-Safe, the sequential > code you add is not. Someone added those catomic_* and not sequential > code, so this is at least indication that there's a disconnect here. > > You also add macros with nondescriptive names, that each have exactly > one use. There's no documentation for how they differ, and you don't > update the comments on the MT-Safe docs. > > > x86 probably would kike an atomic.h refactor > > to avoid the double check for '__local_multiple_threads', but that is not > > the aim of this patch. > > But your patch does introduce this, so you should take care of this as > well. I guess you wouldn't be happy if someone change x86 without > caring about Power? Also as tls is fast on power this change is retundant. A performance difference with properly implemented thread cache is neglitible so this code would be soon removed.
On 02-05-2014 11:34, Torvald Riegel wrote: > On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote: >> This patch adds a single-thread optimization for malloc atomic usage to >> first check if process is single-thread (ST) and if so use normal >> load/store instead of atomic instructions. >> >> It is a respin of my initial attempt to add ST optimization on PowerPC. >> Now malloc optimization is arch-agnostic and rely in already defined macros >> in GLIBC (SINGLE_THREAD_P). > I'm not convinced that this is a good change. First, I'd like to see > some performance numbers that show whether the change leads to a > non-negligible increase in performance. That should be a > microbenchmark, so we can track it over time. In fact I would prefer to make it more arch-specific and let the the arch-maintainers evaluate if such change would be preferable. One way to do it is adding a new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h. Then arch can be replace its implementation if it suit them. > > malloc is currently marked as Async-Signal-Unsafe, but due to locks > elsewhere in the implementation. The MTSafety docs even talk about what > would be needed to make it AS-Safe. Your patch would prevent making it > AS-Safe. The catomic_add that you replace are AS-Safe, the sequential > code you add is not. Someone added those catomic_* and not sequential > code, so this is at least indication that there's a disconnect here. Another point to make this change arch specific. And if we even make malloc functions AS-Safe in the future, the arch maintainer just need to reevaluate this change (since it will either make the optimization not required or make it unsafe). > > You also add macros with nondescriptive names, that each have exactly > one use. There's no documentation for how they differ, and you don't > update the comments on the MT-Safe docs. > >> x86 probably would kike an atomic.h refactor >> to avoid the double check for '__local_multiple_threads', but that is not >> the aim of this patch. > But your patch does introduce this, so you should take care of this as > well. I guess you wouldn't be happy if someone change x86 without > caring about Power? > I will drop this patch and rework on a arch-specific one.
On Fri, 2014-05-02 at 14:53 -0300, Adhemerval Zanella wrote: > On 02-05-2014 11:34, Torvald Riegel wrote: > > On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote: > >> This patch adds a single-thread optimization for malloc atomic usage to > >> first check if process is single-thread (ST) and if so use normal > >> load/store instead of atomic instructions. > >> > >> It is a respin of my initial attempt to add ST optimization on PowerPC. > >> Now malloc optimization is arch-agnostic and rely in already defined macros > >> in GLIBC (SINGLE_THREAD_P). > > I'm not convinced that this is a good change. First, I'd like to see > > some performance numbers that show whether the change leads to a > > non-negligible increase in performance. That should be a > > microbenchmark, so we can track it over time. > > In fact I would prefer to make it more arch-specific and let the the arch-maintainers > evaluate if such change would be preferable. One way to do it is adding a > new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as > catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h. Then arch can be replace > its implementation if it suit them. I agree that there is arch-specific input to this optimization problem, but that doesn't mean that the solution to it will be primarily arch-specific. The arch-specific input we have is: * Whether an atomic op (probably we'll be considering just RMW ops and CAS) is significantly slower than non-atomic code plus a branch. For example, the difference should be small on recent x86, but I assume it's higher on PowerPC. We'd need a microbenchmark of some sort to track the decision we make here. * Whether an atomic op that's MT-Safe and AS-Safe is significantly slower than an atomic op that's just AS-Safe. That needs a microbenchmark as well. For example, x86 currently assumes this is the case, but it's something we should revisit I guess. * If the former is the case, we need arch-specific atomics (i.e., catomic_*) to make use of that. Thus, we'd end up with two constants and potentially some arch-specific variants. We could differentiate the constants as necessary per user (e.g., make one that's malloc-specific), but I'd prefer not to, and that can be done in the future. With that in place, we can look at the atomics users and see how they could exploit AS-Safe or AS-Unsafe+MT-Unsafe scenarios. We'd have a normal branch and let the compiler remove the dead code (we can use the preprocessor too, but I think that's worse). That allows us to really look at what we can do differently in a sequential-execution scenario -- so going beyond just tweaking one atomic op at a time. If we really need the single-atomic-op optimizations often, we can consider adding something like catomic. That is, if, for example, if (as_safe_atomics_faster) x+=1; else atomic_add(&x, 1); is needed often, we can add something like catomic. We might want to revisit current uses of catomic too. > > > > malloc is currently marked as Async-Signal-Unsafe, but due to locks > > elsewhere in the implementation. The MTSafety docs even talk about what > > would be needed to make it AS-Safe. Your patch would prevent making it > > AS-Safe. The catomic_add that you replace are AS-Safe, the sequential > > code you add is not. Someone added those catomic_* and not sequential > > code, so this is at least indication that there's a disconnect here. > > Another point to make this change arch specific. Why? The motivation to use catomic might have been driven by arch-specific properties, but you want to replace it with something functionally different. > And if we even make malloc > functions AS-Safe in the future, the arch maintainer just need to reevaluate > this change (since it will either make the optimization not required or make > it unsafe). Whether malloc is AS-Safe or not should never be arch-specific IMO. If we use the constants as I outlined above, the malloc implementer can just build code for the different HW performance scenarios, and all the archs have to do is set the flags accordingly. We could even further reduce the implementation overhead for the AS-Safe MT-Unsafe atomics if using the newer GCC builtins for the C11 memory model by adding support for as special memory order that just provides AS-Safety. > > > > You also add macros with nondescriptive names, that each have exactly > > one use. There's no documentation for how they differ, and you don't > > update the comments on the MT-Safe docs. > > > >> x86 probably would kike an atomic.h refactor > >> to avoid the double check for '__local_multiple_threads', but that is not > >> the aim of this patch. > > But your patch does introduce this, so you should take care of this as > > well. I guess you wouldn't be happy if someone change x86 without > > caring about Power? > > > I will drop this patch and rework on a arch-specific one. What do you think about the plan I outlined above?
On 05-05-2014 07:17, Torvald Riegel wrote: > On Fri, 2014-05-02 at 14:53 -0300, Adhemerval Zanella wrote: >> On 02-05-2014 11:34, Torvald Riegel wrote: >>> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote: >>>> This patch adds a single-thread optimization for malloc atomic usage to >>>> first check if process is single-thread (ST) and if so use normal >>>> load/store instead of atomic instructions. >>>> >>>> It is a respin of my initial attempt to add ST optimization on PowerPC. >>>> Now malloc optimization is arch-agnostic and rely in already defined macros >>>> in GLIBC (SINGLE_THREAD_P). >>> I'm not convinced that this is a good change. First, I'd like to see >>> some performance numbers that show whether the change leads to a >>> non-negligible increase in performance. That should be a >>> microbenchmark, so we can track it over time. >> In fact I would prefer to make it more arch-specific and let the the arch-maintainers >> evaluate if such change would be preferable. One way to do it is adding a >> new macro in the bits/libc-locks, some time __libc_atomic_*; and define it as >> catomic_* in ./nptl/sysdeps/pthread/bits/libc-lock.h. Then arch can be replace >> its implementation if it suit them. > I agree that there is arch-specific input to this optimization problem, > but that doesn't mean that the solution to it will be primarily > arch-specific. > > The arch-specific input we have is: > * Whether an atomic op (probably we'll be considering just RMW ops and > CAS) is significantly slower than non-atomic code plus a branch. For > example, the difference should be small on recent x86, but I assume it's > higher on PowerPC. We'd need a microbenchmark of some sort to track the > decision we make here. > * Whether an atomic op that's MT-Safe and AS-Safe is significantly > slower than an atomic op that's just AS-Safe. That needs a > microbenchmark as well. For example, x86 currently assumes this is the > case, but it's something we should revisit I guess. Which motivated me to propose such change is the malloc code for single-thread. To give you some numbers, currently on POWER7/3GHz using a simple loop to allocate 32 bytes, it takes about 130ns to each call. Profilers indicate that indeed the culprit of such long time is the locks being used even in single-thread. Using my first approach (sequential code for single-thread, no as-safe) the code the same calls is not done in about 40ns. By just tuning the locks (not the atomics) it raises to 90ns. And yes, I'm aware that issue about being non AS-safe. However, currently the CAS primitives, even for AS-sync one with memory barriers, is done with LL/SC instructions (ldarx. / stdcx.). And for single-thread case, the memory barriers are not causing much latency: I check by just removing then (by using CAS primitives without them) and it did not make difference for performance, the problem is the use of LL/SC instructions themselves. These are true at least for POWER6, POWER7, and POWER8. > * If the former is the case, we need arch-specific atomics (i.e., > catomic_*) to make use of that. > > Thus, we'd end up with two constants and potentially some arch-specific > variants. We could differentiate the constants as necessary per user > (e.g., make one that's malloc-specific), but I'd prefer not to, and that > can be done in the future. > > With that in place, we can look at the atomics users and see how they > could exploit AS-Safe or AS-Unsafe+MT-Unsafe scenarios. We'd have a > normal branch and let the compiler remove the dead code (we can use the > preprocessor too, but I think that's worse). > That allows us to really look at what we can do differently in a > sequential-execution scenario -- so going beyond just tweaking one > atomic op at a time. > > If we really need the single-atomic-op optimizations often, we can > consider adding something like catomic. That is, if, for example, > if (as_safe_atomics_faster) x+=1; else atomic_add(&x, 1); > is needed often, we can add something like catomic. > We might want to revisit current uses of catomic too. That's why I would to propose archs to be able to override the atomic in malloc code with an instruction sequence to check for multithread and use a code sequence that is non AS-Safe. I know that it will impose an additional consideration for AS-safeness in malloc code, however since it is not AS-safe right now this arch-specific modification is safe. And if in the future, if we even make malloc AS-Safe, this modification will sure need to be reevaluated or even drop altogether. But I concede that this is potentially misleading change, so indeed it will need some discussions. As Ondrej has proposed, the best course of action would rework malloc internal altogether to add something like thread-caches to avoid lock contention (even for single-thread); however for *current* code these locks are real issue for POWER and we from POWER toolchain really want a way to circumvent it, if possible. I would understand that if has been decided that the engineer cost does not worth the gains. > >>> malloc is currently marked as Async-Signal-Unsafe, but due to locks >>> elsewhere in the implementation. The MTSafety docs even talk about what >>> would be needed to make it AS-Safe. Your patch would prevent making it >>> AS-Safe. The catomic_add that you replace are AS-Safe, the sequential >>> code you add is not. Someone added those catomic_* and not sequential >>> code, so this is at least indication that there's a disconnect here. >> Another point to make this change arch specific. > Why? The motivation to use catomic might have been driven by > arch-specific properties, but you want to replace it with something > functionally different. I would like to replace the atomics in malloc code and libc locks/unlock only for POWER, since they are not currently AS-safe. But as I said, I will concede if community decided it won't worth the engineer effort. > >> And if we even make malloc >> functions AS-Safe in the future, the arch maintainer just need to reevaluate >> this change (since it will either make the optimization not required or make >> it unsafe). > Whether malloc is AS-Safe or not should never be arch-specific IMO. If > we use the constants as I outlined above, the malloc implementer can > just build code for the different HW performance scenarios, and all the > archs have to do is set the flags accordingly. > > We could even further reduce the implementation overhead for the AS-Safe > MT-Unsafe atomics if using the newer GCC builtins for the C11 memory > model by adding support for as special memory order that just provides > AS-Safety. For POWER and single-thread scenario it not the memory barriers along that LL/SC instruction that are generating latency, but the LL/SC instructions. Even using C11 memory model won't help in this case. > >>> You also add macros with nondescriptive names, that each have exactly >>> one use. There's no documentation for how they differ, and you don't >>> update the comments on the MT-Safe docs. >>> >>>> x86 probably would kike an atomic.h refactor >>>> to avoid the double check for '__local_multiple_threads', but that is not >>>> the aim of this patch. >>> But your patch does introduce this, so you should take care of this as >>> well. I guess you wouldn't be happy if someone change x86 without >>> caring about Power? >>> >> I will drop this patch and rework on a arch-specific one. > What do you think about the plan I outlined above? >
diff --git a/malloc/malloc.c b/malloc/malloc.c index 1120d4d..bb0aa82 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -231,6 +231,7 @@ #include <errno.h> #include <shlib-compat.h> +#include <sysdep-cancel.h> /* For SINGLE_THREAD_P macro */ /* For uintptr_t. */ #include <stdint.h> @@ -243,6 +244,57 @@ /* + Single-thread lock optimization: atomic primitives first check the number of + threads and avoid atomic instructions for single-thread case. + */ +#define MALLOC_ATOMIC_OR(mem, mask) \ + ({ \ + if (!SINGLE_THREAD_P) \ + catomic_or (mem, mask); \ + else \ + *mem |= mask; \ + }) + +#define MALLOC_ATOMIC_AND(mem, mask) \ + ({ \ + if (!SINGLE_THREAD_P) \ + catomic_and (mem, mask); \ + else \ + *mem &= mask; \ + }) + +#define MALLOC_ATOMIC_CAS_VAL_ACQ(mem, newval, oldval) \ + ({ \ + __typeof (*(mem)) __tmp; \ + __typeof (mem) __memp = (mem); \ + if (!SINGLE_THREAD_P) \ + __tmp = catomic_compare_and_exchange_val_acq (mem, newval, oldval); \ + else \ + { \ + __tmp = *__memp; \ + if (__tmp == oldval) \ + *__memp = newval; \ + } \ + __tmp; \ + }) + +#define MALLOC_ATOMIC_CAS_VAL_REL(mem, newval, oldval) \ + ({ \ + __typeof (*(mem)) __tmp; \ + __typeof (mem) __memp = (mem); \ + if (!SINGLE_THREAD_P) \ + __tmp = catomic_compare_and_exchange_val_rel (mem, newval, oldval); \ + else \ + { \ + __tmp = *__memp; \ + if (__tmp == oldval) \ + *__memp = newval; \ + } \ + __tmp; \ + }) + + +/* Debugging: Because freed chunks may be overwritten with bookkeeping fields, this @@ -1632,8 +1684,8 @@ typedef struct malloc_chunk *mfastbinptr; #define FASTCHUNKS_BIT (1U) #define have_fastchunks(M) (((M)->flags & FASTCHUNKS_BIT) == 0) -#define clear_fastchunks(M) catomic_or (&(M)->flags, FASTCHUNKS_BIT) -#define set_fastchunks(M) catomic_and (&(M)->flags, ~FASTCHUNKS_BIT) +#define clear_fastchunks(M) MALLOC_ATOMIC_OR (&(M)->flags, FASTCHUNKS_BIT) +#define set_fastchunks(M) MALLOC_ATOMIC_AND (&(M)->flags, ~FASTCHUNKS_BIT) /* NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous @@ -3334,7 +3386,7 @@ _int_malloc (mstate av, size_t bytes) if (victim == NULL) break; } - while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim)) + while ((pp = MALLOC_ATOMIC_CAS_VAL_ACQ (fb, victim->fd, victim)) != victim); if (victim != 0) { @@ -3903,7 +3955,7 @@ _int_free (mstate av, mchunkptr p, int have_lock) old_idx = fastbin_index(chunksize(old)); p->fd = old2 = old; } - while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) != old2); + while ((old = MALLOC_ATOMIC_CAS_VAL_REL (fb, p, old2)) != old2); if (have_lock && old != NULL && __builtin_expect (old_idx != idx, 0)) {