From patchwork Thu Apr 10 21:44:25 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Adhemerval Zanella Netto X-Patchwork-Id: 494 Return-Path: X-Original-To: siddhesh@wilcox.dreamhost.com Delivered-To: siddhesh@wilcox.dreamhost.com Received: from homiemail-mx20.g.dreamhost.com (mx2.sub5.homie.mail.dreamhost.com [208.113.200.128]) by wilcox.dreamhost.com (Postfix) with ESMTP id 04412360134 for ; Thu, 10 Apr 2014 14:44:40 -0700 (PDT) Received: by homiemail-mx20.g.dreamhost.com (Postfix, from userid 14307373) id BE8A24100B386; Thu, 10 Apr 2014 14:44:39 -0700 (PDT) X-Original-To: glibc@patchwork.siddhesh.in Delivered-To: x14307373@homiemail-mx20.g.dreamhost.com Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by homiemail-mx20.g.dreamhost.com (Postfix) with ESMTPS id 9846340F5DAAB for ; Thu, 10 Apr 2014 14:44:39 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:cc :subject:references:in-reply-to:content-type :content-transfer-encoding; q=dns; s=default; b=mBOeLMtA6IuTLFH0 E5zQK01r3biuDTzBavT/oolwsvl138mqrRkRrdt5ywgfGBSCaXLTZPk8dT7UFc03 UUyuqVpz8Ih8GD1aGkynY/7KCkLG5Hi1ZTBJ0yA+iy4CfZ4Hn7Sm8R/kwu0U6MFK xRs+oT1nhl8ZD18u9xBudw0ZYlU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:cc :subject:references:in-reply-to:content-type :content-transfer-encoding; s=default; bh=o3uM/FCQhzAwz1bJ76JzSJ XafEI=; b=L7PJPy0mYe9lrqmiSCDwdbdbXqug14hL3jnyec2GMR3fJ+StJ1eNGV PY9fp9cP9YyPCevbgxsr+Wpa1w6zyOInynpP0f/QniI2sBh7c7cxDgCZflTV0jkx mcxhQ6pV2JIGUEqDsS5p//w5QN4venC4Vym5rfHzKeTinZTuQLf0Q= Received: (qmail 1826 invoked by alias); 10 Apr 2014 21:44:37 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 1813 invoked by uid 89); 10 Apr 2014 21:44:36 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL, BAYES_00, RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: e24smtp03.br.ibm.com Message-ID: <534710B9.6010303@linux.vnet.ibm.com> Date: Thu, 10 Apr 2014 18:44:25 -0300 From: Adhemerval Zanella User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Torvald Riegel CC: libc-alpha@sourceware.org Subject: Re: [PATCH] Unify pthread_once (bug 15215) References: <1368024237.7774.794.camel@triegel.csb> <519D97E4.4030808@redhat.com> <1381018836.8757.3598.camel@triegel.csb> <1381182784.18547.138.camel@triegel.csb> <533605BF.9000005@redhat.com> <534307E1.5050904@linux.vnet.ibm.com> <1397159860.10643.16419.camel@triegel.csb> In-Reply-To: <1397159860.10643.16419.camel@triegel.csb> X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 14041021-9564-0000-0000-0000008E996B X-DH-Original-To: glibc@patchwork.siddhesh.in On 10-04-2014 16:57, Torvald Riegel wrote: > On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote: >> >> I tested the patch and benchmark on PowerPC (POWER7) and the results looks good: >> >> * Current code: >> "duration": 5.08322e+09, "iterations": 2.2037e+08, "max": 244.863, "min": 22.08, "mean": 23.0668 >> "duration": 5.08316e+09, "iterations": 2.20479e+08, "max": 771.08, "min": 21.984, "mean": 23.0551 >> "duration": 5.08306e+09, "iterations": 2.21093e+08, "max": 53.966, "min": 22.052, "mean": 22.9906 >> "duration": 5.0833e+09, "iterations": 2.20062e+08, "max": 347.895, "min": 22.15, "mean": 23.0994 >> "duration": 5.08277e+09, "iterations": 2.20699e+08, "max": 632.479, "min": 21.997, "mean": 23.0303 >> >> * Optimization: >> "duration": 4.97694e+09, "iterations": 8.42834e+08, "max": 134.181, "min": 5.405, "mean": 5.90501 >> "duration": 4.9758e+09, "iterations": 8.66952e+08, "max": 29.163, "min": 5.405, "mean": 5.73941 >> "duration": 4.9778e+09, "iterations": 8.51788e+08, "max": 40.819, "min": 5.405, "mean": 5.84394 >> "duration": 4.97413e+09, "iterations": 8.52432e+08, "max": 37.089, "min": 5.488, "mean": 5.83523 >> "duration": 4.97795e+09, "iterations": 8.43376e+08, "max": 163.703, "min": 5.405, "mean": 5.90241 >> >> I am running on a 18 cores machine, so I guess the 'max' is due a timing issue from os jitter. > There's no spinning in the algorithm currently (or most of glibc AFAIK, > except the rather simplistic attempt in adaptive mutexes), so even small > initialization routines may go straight to blocking via futexes. (And > AFAIK, futexes don't spin before actually trying to block.) Yeap, that is what I observed. > >> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store >> condition followed by a full barrier. > Hmm, the code I see is using __lll_acq_instr, which is an isync, which > AFAICT is not a full barrier (hwsync). >> This is overkill for some scenarios: if the initialization is >> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained) >> is just a load acquire. > The atomic_read_barrier used in the unified code seems to result in an > lwsync. Is that faster than an isync? If not, the overhead may be due > to something else. You are correct, bad wording from me. In fact the issue here ir, from some profiling, the load/store with *reservation*, the isync is fact necessary in current case as acquire barrier. The atomic read/store shows a better performance for POWER. And yes, lwsync is faster than isync (show below by the experiment with load acquire/store release). > >> I ran make check and results looks good. I also checked with GCC issues that originated the fix >> that added the release barrier in the code (gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c10) and >> it does not show the issue with new implementation. > Thanks for doing the tests. Would you want to remove the powerpc > implementation after this patch has been committed? Yes I planning to do it. > > I've also seen that there is a separate s390 implementation as well, > which seems similar to the powerpc algorithm. Has someone tested this > one as well? I don't have access to s390 hardware, I think we will need to contact s390 maintainers. > >> Finally, I also check if by replacing the 'atomic_read_barrier' and 'atomic_write_barrier' with >> a load acquire and load release on POWER would shows any difference. The result are not conclusive: >> >> "duration": 4.97196e+09, "iterations": 8.79836e+08, "max": 22.874, "min": 5.405, "mean": 5.651 >> "duration": 4.97144e+09, "iterations": 8.55294e+08, "max": 270.72, "min": 5.4, "mean": 5.81256 >> "duration": 4.97496e+09, "iterations": 8.67163e+08, "max": 27.274, "min": 5.405, "mean": 5.73705 >> "duration": 4.97603e+09, "iterations": 8.61568e+08, "max": 41.631, "min": 5.405, "mean": 5.77556 >> "duration": 4.97347e+09, "iterations": 8.54189e+08, "max": 41.457, "min": 5.405, "mean": 5.82244 >> > Interesting. Which HW barrier / code did you use for the load acquire? > I used this code to check the difference: Your patch shows: duration": 4.97589e+09, "iterations": 8.73533e+08, "max": 64.068, "min": 5.301, "mean": 5.69628 While the load acquire/store release: duration": 4.9789e+09, "iterations": 8.3872e+08, "max": 54.193, "min": 5.906, "mean": 5.93631 As I pointed out earlier, the max seems to have a lot of variation (due the futex call prob), while the 'min' does not show much variation. diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c index 94e2427..a67ef67 100644 --- a/nptl/sysdeps/unix/sysv/linux/pthread_once.c +++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c @@ -63,8 +63,19 @@ __pthread_once (pthread_once_t *once_control, void (*init_routine) (void)) /* We need acquire memory order for this load because if the value signals that initialization has finished, we need to be see any data modifications done during initialization. */ +#if 0 val = *once_control; atomic_read_barrier(); +#endif + asm volatile ( + "ld %0,%1\n\t" + "cmpw 7,%0,%0\n\t" + "bne- 7,$+4\n\t" + "isync\n\t" + : "=r" (val) + : "m"(*once_control) + : "cr7", "memory"); + do { /* Check if the initialization has already been done. */ @@ -112,8 +123,14 @@ __pthread_once (pthread_once_t *once_control, void (*init_routine) (void)) /* Mark *once_control as having finished the initialization. We need release memory order here because we need to synchronize with other threads that want to use the initialized data. */ +#if 0 atomic_write_barrier(); *once_control = 2; +#endif + asm volatile ( + "lwsync\n\t" + "std %0,%1\n\t" + : : "r"(2), "m"(*once_control) : "memory"); /* Wake up all other threads. */ lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);