From patchwork Fri Dec 16 17:08:13 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Stefan Liebler <stli@linux.vnet.ibm.com>
X-Patchwork-Id: 18517
Received: (qmail 11815 invoked by alias); 16 Dec 2016 17:08:24 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 11666 invoked by uid 89); 16 Dec 2016 17:08:23 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL, BAYES_00,
	KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_LOW,
	RCVD_IN_SEMBACKSCATTER autolearn=no version=3.3.2 spammy=Related,
	intention, sk:catomic, Liebler
X-HELO: mx0a-001b2d01.pphosted.com
Subject: Re: [PATCH 1/2] S390: Optimize atomic macros.
To: libc-alpha@sourceware.org
References: <1479913753-20506-1-git-send-email-stli@linux.vnet.ibm.com>
	<1479991865.7146.1526.camel@localhost.localdomain>
	<a6ad0218-bb22-1289-afa0-2b29d75fb438@linux.vnet.ibm.com>
From: Stefan Liebler <stli@linux.vnet.ibm.com>
Cc: Torvald Riegel <triegel@redhat.com>
Date: Fri, 16 Dec 2016 18:08:13 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.5.1
MIME-Version: 1.0
In-Reply-To: <a6ad0218-bb22-1289-afa0-2b29d75fb438@linux.vnet.ibm.com>
X-TM-AS-GCONF: 00
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16121617-0032-0000-0000-0000025CDC12
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16121617-0033-0000-0000-00001E2F5B75
Message-Id: <485d2d38-26ae-b941-e55d-a58f7559049b@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2016-12-16_10:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	spamscore=0 suspectscore=3
	malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
	adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000
	definitions=main-1612160261

On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> On 11/24/2016 01:51 PM, Torvald Riegel wrote:
>> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>>> The atomic_compare_and_exchange_val_acq macro is now implemented with
>>> gcc __sync_val_compare_and_swap instead of an inline assembly with
>>> compare-and-swap instruction.
>>
>> Can you use the new GCC atomic builtins, or are they still insufficient
>> on s390?
>>
>> Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
>> Generally, this is where we want to get to, so that we have to maintain
>> as few atomic operations as possible.
>>
>> Also note that we'll be phasing out the old-style glibc atomics
>> eventually and use the new C11-like atomics.
> I've activated it and tried an example like this (also compare it with
> pthread_once):
> int
> foo (int *lock)
> {
>   int val, newval;
>   val = atomic_load_acquire (lock);
>   do
>     {
>       newval = val | 123;
>     }
>   while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));
>
>   return 0;
> }
>
> The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old
> value is stored and loaded to/from the stack-frame before the
> cs-instruction.
> After the cs, the old value is stored again and reloaded if a further
> round of cs is needed.
> The condition-code of cs is extracted into a register and then compared
> to determine if a further round of cs is needed.
>
> The cs-instruction itself updates the old-value-register and a
> conditional jump can be used for a further round.
>
> An upstream gcc produces better code, but it's also not perfect!
> I've already informed Andreas Krebbel.
>
> Thus I won't set USE_ATOMIC_COMPILER_BUILTINS to 1 now.
> And if gcc is fixed, I will activate it only for this version and newer.
>
>>
>>> The memory is compared against expected OLDVAL before using
>>> compare-and-swap
>>> instruction in case of OLDVAL is constant at compile time.  This is
>>> used in
>>> various locking code.  If the lock is already aquired, another cpu
>>> has not to
>>> exclusively lock the memory.  If OLDVAL is not constant the
>>> compare-and-swap
>>> instruction is used directly as the usages of this macro usually load
>>> the
>>> current value in front of this macro.
>>
>> Generally, I think the compiler should do these optimizations (under the
>> assumption that we drive these using builtin_expect and the like).  If
>> GCC not doing these (with new _atomic* builtins), is there a patch about
>> that or at least a BZ?  I'd be okay with accepting an intermediate
>> solution that optimizes atomics in glibc, but I'd want to make this
>> clear in code comments and reference the GCC BZ that makes the
>> improvement request; this ensures that we can drop the glibc hack once
>> GCC does what you want it to do.
>>
> No. This comparision will not be included in the gcc builtins as they
> are intended to only be compare-and-swap instruction.
>
>> For this particular case, I'm hoping we can just fix this in the
>> compiler.
>>
>> Regarding your patch, the compiler barrier is in the wrong position for
>> an acquire MO operation; for the new C11-like atomics, it would be
>> unnecessary because we only guarantee relaxed MO in the failure case.
>>
> If *mem is equal to constant oldval, the sync builtin is considered as
> full barrier after loading *mem. If *mem is not equal to constant
> oldval, you are right there is no barrier after the load and the
> compiler could hoist memory accesses before the load of mem!?
> Thus I can add the compiler barrier in the else path like below.
> +#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)\
> +  ({ __asm__ __volatile__ ("" ::: "memory");            \
> +    __typeof (*(mem)) __atg1_ret;                \
> +    if (!__builtin_constant_p (oldval)                \
> +    || __builtin_expect ((__atg1_ret = *(mem))        \
> +                 == (oldval), 1))            \
> +      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),\
> +                        (newval));    \
> +    else                            \
> +      __asm__ __volatile__ ("" ::: "memory");            \
> +    __atg1_ret; })
> The intention of the full barrier before the load and the if-statement
> is to "force" the load and compare directly before cs instruction.
>
>>> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
>>> defined before.  Now it is implemented with gcc
>>> __sync_bool_compare_and_swap.
>>> If the macro is used as condition in an if/while expression, the
>>> condition
>>> code is used to e.g. jump directly to another code sequence.  Before
>>> this
>>> change, the old value returned by compare-and-swap instruction was
>>> compared
>>> with given OLDVAL to determine if e.g. a jump is needed.
> I will add the barrier here in the same way as mention above.
>>>
>>> The atomic_exchange_acq macro is now using the load-and-and
>>> instruction for a
>>> constant zero value instead of a compare-and-swap loop.  This
>>> instruction is
>>> available on a z196 zarch and higher cpus.  This is e.g. used in
>>> unlocking code.
>>
>> See above.  This should be fixed in the compiler.
> There is no sync-builtin which can be used for exchanging the value as
> done in atomic_exchange_acq. Thus I have to do it in this glibc-macro.
> But you are right, the compiler can fix this for c11 builtin
> __atomic_exchange_n.
> But even if it is fixed in a new version, builds with older compilers
> won't use the load-and-and instruction.
> In case of e.g. lll_unlock, an additional register loaded with the
> old-value and an additional jump is needed.
>
>>
>>> The newly defined atomic_exchange_and_add macro is implemented with gcc
>>> builtin __sync_fetch_and_add which uses load-and-add instruction on
>>> z196 zarch
>>> and higher cpus instead of a loop with compare-and-swap instruction.
>>> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
>>> the appropiate z196 instruction.
>>
>> Can you just use the new _atomic builtins?  The compiler should just
>> give use _atomic* builtins that are optimized, and we should use them.
>>
> I've checked, if __atomic_fetch_or|add use lao|laa instructions.
> And yes they use it, too.
> As there are issues with the __atomic* builtins as mentioned above,
> the usage of the __sync_fetch_* is currently needed here.
> I don't intend to use a mix of __sync and __atomic builtins.
>
>> With that in place, we could just implement the old-style glibc atomic
>> operations based on the _atomic* builtins, and have much less code to
>> maintain.
>>
> Yes, you are right, this is the correct way in future.
> I saw at least aarch64 is doing that this way.
>
>>> The macros lll_trylock, lll_cond_trylock are extended by an
>>> __glibc_unlikely
>>> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
>>> which does not use jumps in case the lock is free.  Without the hint
>>> it had
>>> to jump if the lock was free.
>>
>> I think it's debatable whether trylock should expect the lock to be
>> free.  OTOH, if it's not free, I guess that we should be in the slow
>> path in most use cases anyway.  Either way, I think it's a choice we
>> should make generically for all architectures, and I want to keep
>> arch-specific code to a minimum.  Thus, please split this out and
>> propose it for the generic lowlevellock.
>>
> Okay. I will send a separate patch to propose this change for common-code.
>


Here is an update of this patch. The compiler barrier in 
atomic_compare_and_exchange_val_acq and 
atomic_compare_and_exchange_bool_acq is now fixed.
The hints in macros lll_trylock/lll_cond_trylock are seperated and 
posted here:
"[PATCH] Add __glibc_unlikely hint in lll_trylock, lll_cond_trylock."
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00451.html

The patch-file contains the updated ChangeLog.

Bye.
Stefan

commit 587b15ad5de619a66d647553310c9da79da57a06
Author: Stefan Liebler <stli@linux.vnet.ibm.com>
Date:   Fri Dec 16 17:53:50 2016 +0100

    S390: Optimize atomic macros.
    
    The atomic_compare_and_exchange_val_acq macro is now implemented with
    gcc __sync_val_compare_and_swap instead of an inline assembly with
    compare-and-swap instruction.
    The memory is compared against expected OLDVAL before using compare-and-swap
    instruction in case of OLDVAL is constant at compile time.  This is used in
    various locking code.  If the lock is already aquired, another cpu has not to
    exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
    instruction is used directly as the usages of this macro usually load the
    current value in front of this macro.
    
    The same applies to atomic_compare_and_exchange_bool_acq which wasn't
    defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
    If the macro is used as condition in an if/while expression, the condition
    code is used to e.g. jump directly to another code sequence.  Before this
    change, the old value returned by compare-and-swap instruction was compared
    with given OLDVAL to determine if e.g. a jump is needed.
    
    The atomic_exchange_acq macro is now using the load-and-and instruction for a
    constant zero value instead of a compare-and-swap loop.  This instruction is
    available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.
    
    The newly defined atomic_exchange_and_add macro is implemented with gcc
    builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
    and higher cpus instead of a loop with compare-and-swap instruction.
    The same applies to atomic_or_val, atomic_and_val, ... macros, which use
    the appropiate z196 instruction.
    
    ChangeLog:
    
    	* sysdeps/s390/atomic-machine.h
    	(__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN): New define.
    	(atomic_compare_and_exchange_val_acq):
    	Use __sync_val_compare_and_swap and first compare with non-atomic
    	instruction in case of OLDVAL is constant.
    	(atomic_compare_and_exchange_bool_acq): New define.
    	(atomic_exchange_acq): Use load-and-and instruction for constant
    	zero values, if available.
    	(atomic_exchange_and_add, catomic_exchange_and_add, atomic_or_val,
    	atomic_or, catomic_or, atomic_bit_test_set, atomic_and_val,
    	atomic_and, catomic_and): New define.

diff --git a/sysdeps/s390/atomic-machine.h b/sysdeps/s390/atomic-machine.h
index 4ba4107..e6956ff 100644
--- a/sysdeps/s390/atomic-machine.h
+++ b/sysdeps/s390/atomic-machine.h
@@ -45,76 +45,163 @@ typedef uintmax_t uatomic_max_t;
 
 #define USE_ATOMIC_COMPILER_BUILTINS 0
 
-
-#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("cs %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" (newval), "m" (*__archmem) : "cc", "memory" );	\
-     __archold; })
-
 #ifdef __s390x__
 # define __HAVE_64B_ATOMICS 1
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("csg %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" ((long) (newval)), "m" (*__archmem) : "cc", "memory" ); \
-     __archold; })
 #else
 # define __HAVE_64B_ATOMICS 0
-/* For 31 bit we do not really need 64-bit compare-and-exchange. We can
-   implement them by use of the csd instruction. The straightforward
-   implementation causes warnings so we skip the definition for now.  */
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
 #endif
 
+#ifdef HAVE_S390_MIN_Z196_ZARCH_ASM_SUPPORT
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 1
+#else
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 0
+#endif
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return the old *MEM value.  */
+/* Compare *MEM against expected OLDVAL before using compare-and-swap
+   instruction in case of OLDVAL is constant.  This is used in various
+   locking code.  If the lock is already aquired another cpu has not to
+   exclusively lock the memory.  */
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    __typeof (*(mem)) __atg1_ret;					\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect ((__atg1_ret = *(mem))			\
+			     == (oldval), 1))				\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),	\
+						(newval));		\
+    else								\
+      __asm__ __volatile__ ("" ::: "memory");				\
+    __atg1_ret; })
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return zero if *MEM was changed or non-zero if no exchange happened.  */
+/* Same as with atomic_compare_and_exchange_val_acq,  constant OLDVALs are
+   compared before using compare-and-swap instruction.  As this macro is
+   normally used in conjunction with if or while, gcc emits a conditional-
+   branch instruction to use the condition-code of compare-and-swap instruction
+   instead of comparing the old value.  */
+#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    int __atg3_ret = 1;							\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect (*(mem) == (oldval), 1))			\
+      __atg3_ret = !__sync_bool_compare_and_swap ((mem), (oldval),	\
+						  (newval));		\
+    else								\
+      __asm__ __volatile__ ("" ::: "memory");				\
+    __atg3_ret; })
+
 /* Store NEWVALUE in *MEM and return the old value.  */
 /* On s390, the atomic_exchange_acq is different from generic implementation,
    because the generic one does not use the condition-code of cs-instruction
-   to determine if looping is needed. Instead it saves the old-value and
-   compares it against old-value returned by cs-instruction.  */
+   to determine if looping is needed.  Instead it saves the old-value and
+   compares it against old-value returned by cs-instruction.
+   Setting a constant zero can be done with load-and-and instruction which
+   is available on a z196 zarch and higher cpus.  This is used in unlocking
+   code.  */
 #ifdef __s390x__
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-     else if (sizeof (*mem) == 8)					\
-       __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
-			     "   jl 0b"					\
-			     : "+d" ( __atg5_oldval), "=Q" (*__atg5_memp) \
-			     : "d" ((long) __atg5_value), "m" (*__atg5_memp) \
-			     : "cc", "memory" );			\
-     else								\
-       abort ();							\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
+    else								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else if (sizeof (*(mem)) == 8)					\
+	  __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" ( __atg5_oldval),		\
+				  "=Q" (*__atg5_memp)			\
+				: "d" ((long) __atg5_value),		\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
      __atg5_oldval; })
 #else
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
     else								\
-      abort ();								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
     __atg5_oldval; })
 #endif
+
+/* Add VALUE to *MEM and return the old value of *MEM.  */
+/* The gcc builtin uses load-and-add instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_exchange_and_add(mem, value)	\
+  __sync_fetch_and_add (mem, value)
+#define catomic_exchange_and_add(mem, value)	\
+  atomic_exchange_and_add (mem, value)
+
+/* Atomically *mem |= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-or instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_or_val(mem, mask)		\
+  __sync_fetch_and_or (mem, mask)
+/* Atomically *mem |= mask.  */
+#define atomic_or(mem, mask)			\
+  do {						\
+    atomic_or_val (mem, mask);			\
+  } while (0)
+#define catomic_or(mem, mask)			\
+  atomic_or (mem, mask)
+
+/* Atomically *mem |= 1 << bit and return true if the bit was set in old value
+   of *mem.  */
+/* The load-and-or instruction is used on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_bit_test_set(mem, bit)					\
+  ({ __typeof (*(mem)) __atg14_old;					\
+    __typeof (mem) __atg14_memp = (mem);				\
+    __typeof (*(mem)) __atg14_mask = ((__typeof (*(mem))) 1 << (bit));	\
+    __atg14_old = atomic_or_val (__atg14_memp, __atg14_mask);		\
+    __atg14_old & __atg14_mask; })
+
+/* Atomically *mem &= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-and instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_and_val(mem, mask)		\
+  __sync_fetch_and_and (mem, mask)
+/* Atomically *mem &= mask.  */
+#define atomic_and(mem, mask)			\
+  do {						\
+    atomic_and_val (mem, mask);			\
+  } while (0)
+#define catomic_and(mem, mask)			\
+  atomic_and(mem, mask)