From patchwork Thu Apr 10 21:44:25 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Adhemerval Zanella Netto <azanella@linux.vnet.ibm.com>
X-Patchwork-Id: 494
Return-Path: <x14307373@homiemail-mx20.g.dreamhost.com>
X-Original-To: siddhesh@wilcox.dreamhost.com
Delivered-To: siddhesh@wilcox.dreamhost.com
Received: from homiemail-mx20.g.dreamhost.com
	(mx2.sub5.homie.mail.dreamhost.com [208.113.200.128])
	by wilcox.dreamhost.com (Postfix) with ESMTP id 04412360134
	for <siddhesh@wilcox.dreamhost.com>;
	Thu, 10 Apr 2014 14:44:40 -0700 (PDT)
Received: by homiemail-mx20.g.dreamhost.com (Postfix, from userid 14307373)
	id BE8A24100B386; Thu, 10 Apr 2014 14:44:39 -0700 (PDT)
X-Original-To: glibc@patchwork.siddhesh.in
Delivered-To: x14307373@homiemail-mx20.g.dreamhost.com
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by homiemail-mx20.g.dreamhost.com (Postfix) with ESMTPS id
	9846340F5DAAB for <glibc@patchwork.siddhesh.in>;
	Thu, 10 Apr 2014 14:44:39 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:cc
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=mBOeLMtA6IuTLFH0
	E5zQK01r3biuDTzBavT/oolwsvl138mqrRkRrdt5ywgfGBSCaXLTZPk8dT7UFc03
	UUyuqVpz8Ih8GD1aGkynY/7KCkLG5Hi1ZTBJ0yA+iy4CfZ4Hn7Sm8R/kwu0U6MFK
	xRs+oT1nhl8ZD18u9xBudw0ZYlU=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:cc
	:subject:references:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=o3uM/FCQhzAwz1bJ76JzSJ
	XafEI=; b=L7PJPy0mYe9lrqmiSCDwdbdbXqug14hL3jnyec2GMR3fJ+StJ1eNGV
	PY9fp9cP9YyPCevbgxsr+Wpa1w6zyOInynpP0f/QniI2sBh7c7cxDgCZflTV0jkx
	mcxhQ6pV2JIGUEqDsS5p//w5QN4venC4Vym5rfHzKeTinZTuQLf0Q=
Received: (qmail 1826 invoked by alias); 10 Apr 2014 21:44:37 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: 
 <mailto:libc-alpha-unsubscribe-glibc=patchwork.siddhesh.in@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 1813 invoked by uid 89); 10 Apr 2014 21:44:36 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL, BAYES_00,
	RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: e24smtp03.br.ibm.com
Message-ID: <534710B9.6010303@linux.vnet.ibm.com>
Date: Thu, 10 Apr 2014 18:44:25 -0300
From: Adhemerval Zanella <azanella@linux.vnet.ibm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Torvald Riegel <triegel@redhat.com>
CC: libc-alpha@sourceware.org
Subject: Re: [PATCH] Unify pthread_once (bug 15215)
References: <1368024237.7774.794.camel@triegel.csb>
	<519D97E4.4030808@redhat.com>
	<1381018836.8757.3598.camel@triegel.csb>
	<Pine.LNX.4.64.1310071604090.31470@digraph.polyomino.org.uk>
	<1381182784.18547.138.camel@triegel.csb>
	<533605BF.9000005@redhat.com>
	<534307E1.5050904@linux.vnet.ibm.com>
	<1397159860.10643.16419.camel@triegel.csb>
In-Reply-To: <1397159860.10643.16419.camel@triegel.csb>
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 14041021-9564-0000-0000-0000008E996B
X-DH-Original-To: glibc@patchwork.siddhesh.in

On 10-04-2014 16:57, Torvald Riegel wrote:
> On Mon, 2014-04-07 at 17:17 -0300, Adhemerval Zanella wrote:
>>
>> I tested the patch and benchmark on PowerPC (POWER7) and the results looks good:
>>
>> * Current code:
>> "duration": 5.08322e+09, "iterations": 2.2037e+08, "max": 244.863, "min": 22.08, "mean": 23.0668
>> "duration": 5.08316e+09, "iterations": 2.20479e+08, "max": 771.08, "min": 21.984, "mean": 23.0551
>> "duration": 5.08306e+09, "iterations": 2.21093e+08, "max": 53.966, "min": 22.052, "mean": 22.9906
>> "duration": 5.0833e+09, "iterations": 2.20062e+08, "max": 347.895, "min": 22.15, "mean": 23.0994
>> "duration": 5.08277e+09, "iterations": 2.20699e+08, "max": 632.479, "min": 21.997, "mean": 23.0303
>>
>> * Optimization:
>> "duration": 4.97694e+09, "iterations": 8.42834e+08, "max": 134.181, "min": 5.405, "mean": 5.90501
>> "duration": 4.9758e+09, "iterations": 8.66952e+08, "max": 29.163, "min": 5.405, "mean": 5.73941
>> "duration": 4.9778e+09, "iterations": 8.51788e+08, "max": 40.819, "min": 5.405, "mean": 5.84394
>> "duration": 4.97413e+09, "iterations": 8.52432e+08, "max": 37.089, "min": 5.488, "mean": 5.83523
>> "duration": 4.97795e+09, "iterations": 8.43376e+08, "max": 163.703, "min": 5.405, "mean": 5.90241
>>
>> I am running on a 18 cores machine, so I guess the 'max' is due a timing issue from os jitter.
> There's no spinning in the algorithm currently (or most of glibc AFAIK,
> except the rather simplistic attempt in adaptive mutexes), so even small
> initialization routines may go straight to blocking via futexes.  (And
> AFAIK, futexes don't spin before actually trying to block.)

Yeap, that is what I observed.

>
>> Digging up on current code and the unified one, I noted PowerPC one is currently doing load+and+store
>> condition followed by a full barrier.
> Hmm, the code I see is using __lll_acq_instr, which is an isync, which
> AFAICT is not a full barrier (hwsync).
>> This is overkill for some scenarios: if the initialization is
>> done (val & 2) we don't need a full barrier (isync), since the requirement (as Tovalds has explained)
>> is just a load acquire.
> The atomic_read_barrier used in the unified code seems to result in an
> lwsync.  Is that faster than an isync?  If not, the overhead may be due
> to something else.

You are correct, bad wording from me.  In fact the issue here ir, from some profiling, the load/store
with *reservation*, the isync is fact necessary in current case as acquire barrier.  The atomic
read/store shows a better performance for POWER. And yes, lwsync is faster than isync (show below
by the experiment with load acquire/store release).


>
>> I ran make check and results looks good. I also checked with GCC issues that originated the fix
>> that added the release barrier in the code (gcc.gnu.org/bugzilla/show_bug.cgi?id=52839#c10) and
>> it does not show the issue with new implementation.
> Thanks for doing the tests.  Would you want to remove the powerpc
> implementation after this patch has been committed?

Yes I planning to do it.

>
> I've also seen that there is a separate s390 implementation as well,
> which seems similar to the powerpc algorithm.  Has someone tested this
> one as well?

I don't have access to s390 hardware, I think we will need to contact s390 maintainers.

>
>> Finally, I also check if by replacing the 'atomic_read_barrier' and 'atomic_write_barrier' with
>> a load acquire and load release on POWER would shows any difference. The result are not conclusive:
>>
>> "duration": 4.97196e+09, "iterations": 8.79836e+08, "max": 22.874, "min": 5.405, "mean": 5.651
>> "duration": 4.97144e+09, "iterations": 8.55294e+08, "max": 270.72, "min": 5.4, "mean": 5.81256
>> "duration": 4.97496e+09, "iterations": 8.67163e+08, "max": 27.274, "min": 5.405, "mean": 5.73705
>> "duration": 4.97603e+09, "iterations": 8.61568e+08, "max": 41.631, "min": 5.405, "mean": 5.77556
>> "duration": 4.97347e+09, "iterations": 8.54189e+08, "max": 41.457, "min": 5.405, "mean": 5.82244
>>
> Interesting.  Which HW barrier / code did you use for the load acquire?
>
I used this code to check the difference:


Your patch shows:

duration": 4.97589e+09, "iterations": 8.73533e+08, "max": 64.068, "min": 5.301, "mean": 5.69628

While the load acquire/store release:

duration": 4.9789e+09, "iterations": 8.3872e+08, "max": 54.193, "min": 5.906, "mean": 5.93631

As I pointed out earlier, the max seems to have a lot of variation (due the futex call prob), while
the 'min' does not show much variation.

diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
index 94e2427..a67ef67 100644
--- a/nptl/sysdeps/unix/sysv/linux/pthread_once.c
+++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
@@ -63,8 +63,19 @@ __pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
       /* We need acquire memory order for this load because if the value
          signals that initialization has finished, we need to be see any
          data modifications done during initialization.  */
+#if 0
       val = *once_control;
       atomic_read_barrier();
+#endif
+      asm volatile (
+          "ld    %0,%1\n\t"
+          "cmpw  7,%0,%0\n\t"
+          "bne-  7,$+4\n\t"
+          "isync\n\t"
+          : "=r" (val)
+          : "m"(*once_control)
+          : "cr7", "memory");
+
       do
        {
          /* Check if the initialization has already been done.  */
@@ -112,8 +123,14 @@ __pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
       /* Mark *once_control as having finished the initialization.  We need
          release memory order here because we need to synchronize with other
          threads that want to use the initialized data.  */
+#if 0
       atomic_write_barrier();
       *once_control = 2;
+#endif
+      asm volatile (
+       "lwsync\n\t"
+       "std %0,%1\n\t"
+       : : "r"(2), "m"(*once_control) : "memory");
 
       /* Wake up all other threads.  */
       lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);