From patchwork Wed Nov 23 15:09:12 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Stefan Liebler <stli@linux.vnet.ibm.com>
X-Patchwork-Id: 17730
Received: (qmail 4252 invoked by alias); 23 Nov 2016 15:09:33 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 4225 invoked by uid 89); 23 Nov 2016 15:09:32 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL, BAYES_00,
	KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_LOW,
	RCVD_IN_SEMBACKSCATTER autolearn=no version=3.3.2 spammy=sk:__lll_l
X-HELO: mx0a-001b2d01.pphosted.com
From: Stefan Liebler <stli@linux.vnet.ibm.com>
To: libc-alpha@sourceware.org
Cc: Stefan Liebler <stli@linux.vnet.ibm.com>
Subject: [PATCH 1/2] S390: Optimize atomic macros.
Date: Wed, 23 Nov 2016 16:09:12 +0100
X-TM-AS-GCONF: 00
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16112315-0008-0000-0000-000003B03D8B
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16112315-0009-0000-0000-00001B87E8A0
Message-Id: <1479913753-20506-1-git-send-email-stli@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2016-11-23_02:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	spamscore=0 suspectscore=3
	malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
	adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000
	definitions=main-1611230255

The atomic_compare_and_exchange_val_acq macro is now implemented with
gcc __sync_val_compare_and_swap instead of an inline assembly with
compare-and-swap instruction.
The memory is compared against expected OLDVAL before using compare-and-swap
instruction in case of OLDVAL is constant at compile time.  This is used in
various locking code.  If the lock is already aquired, another cpu has not to
exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
instruction is used directly as the usages of this macro usually load the
current value in front of this macro.

The same applies to atomic_compare_and_exchange_bool_acq which wasn't
defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
If the macro is used as condition in an if/while expression, the condition
code is used to e.g. jump directly to another code sequence.  Before this
change, the old value returned by compare-and-swap instruction was compared
with given OLDVAL to determine if e.g. a jump is needed.

The atomic_exchange_acq macro is now using the load-and-and instruction for a
constant zero value instead of a compare-and-swap loop.  This instruction is
available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.

The newly defined atomic_exchange_and_add macro is implemented with gcc
builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
and higher cpus instead of a loop with compare-and-swap instruction.
The same applies to atomic_or_val, atomic_and_val, ... macros, which use
the appropiate z196 instruction.

The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
which does not use jumps in case the lock is free.  Without the hint it had
to jump if the lock was free.

ChangeLog:

	* sysdeps/s390/atomic-machine.h
	(__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN): New define.
	(atomic_compare_and_exchange_val_acq):
	Use __sync_val_compare_and_swap and first compare with non-atomic
	instruction in case of OLDVAL is constant.
	(atomic_compare_and_exchange_bool_acq): New define.
	(atomic_exchange_acq): Use load-and-and instruction for constant
	zero values, if available.
	(atomic_exchange_and_add, catomic_exchange_and_add, atomic_or_val,
	atomic_or, catomic_or, atomic_bit_test_set, atomic_and_val,
	atomic_and, catomic_and): New define.
	* sysdeps/unix/sysv/linux/s390/lowlevellock.h:
	(lll_trylock, lll_cond_trylock): New define.
---
 sysdeps/s390/atomic-machine.h               | 195 ++++++++++++++++++++--------
 sysdeps/unix/sysv/linux/s390/lowlevellock.h |  25 +++-
 2 files changed, 160 insertions(+), 60 deletions(-)

diff --git a/sysdeps/s390/atomic-machine.h b/sysdeps/s390/atomic-machine.h
index 4ba4107..650bf98 100644
--- a/sysdeps/s390/atomic-machine.h
+++ b/sysdeps/s390/atomic-machine.h
@@ -45,76 +45,159 @@ typedef uintmax_t uatomic_max_t;
 
 #define USE_ATOMIC_COMPILER_BUILTINS 0
 
-
-#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("cs %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" (newval), "m" (*__archmem) : "cc", "memory" );	\
-     __archold; })
-
 #ifdef __s390x__
 # define __HAVE_64B_ATOMICS 1
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("csg %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" ((long) (newval)), "m" (*__archmem) : "cc", "memory" ); \
-     __archold; })
 #else
 # define __HAVE_64B_ATOMICS 0
-/* For 31 bit we do not really need 64-bit compare-and-exchange. We can
-   implement them by use of the csd instruction. The straightforward
-   implementation causes warnings so we skip the definition for now.  */
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
 #endif
 
+#ifdef HAVE_S390_MIN_Z196_ZARCH_ASM_SUPPORT
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 1
+#else
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 0
+#endif
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return the old *MEM value.  */
+/* Compare *MEM against expected OLDVAL before using compare-and-swap
+   instruction in case of OLDVAL is constant.  This is used in various
+   locking code.  If the lock is already aquired another cpu has not to
+   exclusively lock the memory.  */
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    __typeof (*(mem)) __atg1_ret;					\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect ((__atg1_ret = *(mem))			\
+			     == (oldval), 1))				\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),	\
+						(newval));		\
+    __atg1_ret; })
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return zero if *MEM was changed or non-zero if no exchange happened.  */
+/* Same as with atomic_compare_and_exchange_val_acq,  constant OLDVALs are
+   compared before using compare-and-swap instruction.  As this macro is
+   normally used in conjunction with if or while, gcc emits a conditional-
+   branch instruction to use the condition-code of compare-and-swap instruction
+   instead of comparing the old value.  */
+#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    int __atg3_ret = 1;							\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect (*(mem) == (oldval), 1))			\
+      __atg3_ret = !__sync_bool_compare_and_swap ((mem), (oldval),	\
+						  (newval));		\
+    __atg3_ret; })
+
 /* Store NEWVALUE in *MEM and return the old value.  */
 /* On s390, the atomic_exchange_acq is different from generic implementation,
    because the generic one does not use the condition-code of cs-instruction
-   to determine if looping is needed. Instead it saves the old-value and
-   compares it against old-value returned by cs-instruction.  */
+   to determine if looping is needed.  Instead it saves the old-value and
+   compares it against old-value returned by cs-instruction.
+   Setting a constant zero can be done with load-and-and instruction which
+   is available on a z196 zarch and higher cpus.  This is used in unlocking
+   code.  */
 #ifdef __s390x__
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-     else if (sizeof (*mem) == 8)					\
-       __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
-			     "   jl 0b"					\
-			     : "+d" ( __atg5_oldval), "=Q" (*__atg5_memp) \
-			     : "d" ((long) __atg5_value), "m" (*__atg5_memp) \
-			     : "cc", "memory" );			\
-     else								\
-       abort ();							\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
+    else								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else if (sizeof (*(mem)) == 8)					\
+	  __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" ( __atg5_oldval),		\
+				  "=Q" (*__atg5_memp)			\
+				: "d" ((long) __atg5_value),		\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
      __atg5_oldval; })
 #else
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
     else								\
-      abort ();								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
     __atg5_oldval; })
 #endif
+
+/* Add VALUE to *MEM and return the old value of *MEM.  */
+/* The gcc builtin uses load-and-add instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_exchange_and_add(mem, value)	\
+  __sync_fetch_and_add (mem, value)
+#define catomic_exchange_and_add(mem, value)	\
+  atomic_exchange_and_add (mem, value)
+
+/* Atomically *mem |= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-or instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_or_val(mem, mask)		\
+  __sync_fetch_and_or (mem, mask)
+/* Atomically *mem |= mask.  */
+#define atomic_or(mem, mask)			\
+  do {						\
+    atomic_or_val (mem, mask);			\
+  } while (0)
+#define catomic_or(mem, mask)			\
+  atomic_or (mem, mask)
+
+/* Atomically *mem |= 1 << bit and return true if the bit was set in old value
+   of *mem.  */
+/* The load-and-or instruction is used on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_bit_test_set(mem, bit)					\
+  ({ __typeof (*(mem)) __atg14_old;					\
+    __typeof (mem) __atg14_memp = (mem);				\
+    __typeof (*(mem)) __atg14_mask = ((__typeof (*(mem))) 1 << (bit));	\
+    __atg14_old = atomic_or_val (__atg14_memp, __atg14_mask);		\
+    __atg14_old & __atg14_mask; })
+
+/* Atomically *mem &= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-and instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_and_val(mem, mask)		\
+  __sync_fetch_and_and (mem, mask)
+/* Atomically *mem &= mask.  */
+#define atomic_and(mem, mask)			\
+  do {						\
+    atomic_and_val (mem, mask);			\
+  } while (0)
+#define catomic_and(mem, mask)			\
+  atomic_and(mem, mask)
diff --git a/sysdeps/unix/sysv/linux/s390/lowlevellock.h b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
index ada2e5b..8d564ed 100644
--- a/sysdeps/unix/sysv/linux/s390/lowlevellock.h
+++ b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
@@ -21,13 +21,30 @@
 
 #include <sysdeps/nptl/lowlevellock.h>
 
+#undef lll_trylock
+/* If LOCK is 0 (not acquired), set to 1 (acquired with no waiters) and return
+   0.  Otherwise leave lock unchanged and return non-zero to indicate that the
+   lock was not acquired.  */
+/* With the __glibc_unlikely hint gcc on s390 emits code in e.g.
+   pthread_mutex_trylock which does not use jumps in case the lock is free.
+   Without the hint it has to jump if the lock is free.  */
+#define lll_trylock(lock)						\
+  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 1, 0))
+
+#undef lll_cond_trylock
+/* If LOCK is 0 (not acquired), set to 2 (acquired, possibly with waiters) and
+   return 0.  Otherwise leave lock unchanged and return non-zero to indicate
+   that the lock was not acquired.  */
+#define lll_cond_trylock(lock)						\
+  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
+
 /* Transactional lock elision definitions.  */
-# ifdef ENABLE_LOCK_ELISION
+#ifdef ENABLE_LOCK_ELISION
 extern int __lll_timedlock_elision
   (int *futex, short *adapt_count, const struct timespec *timeout, int private)
   attribute_hidden;
 
-#  define lll_timedlock_elision(futex, adapt_count, timeout, private)	\
+# define lll_timedlock_elision(futex, adapt_count, timeout, private)	\
   __lll_timedlock_elision(&(futex), &(adapt_count), timeout, private)
 
 extern int __lll_lock_elision (int *futex, short *adapt_count, int private)
@@ -39,12 +56,12 @@ extern int __lll_unlock_elision(int *futex, int private)
 extern int __lll_trylock_elision(int *futex, short *adapt_count)
   attribute_hidden;
 
-#  define lll_lock_elision(futex, adapt_count, private) \
+# define lll_lock_elision(futex, adapt_count, private) \
   __lll_lock_elision (&(futex), &(adapt_count), private)
 #  define lll_unlock_elision(futex, adapt_count, private) \
   __lll_unlock_elision (&(futex), private)
 #  define lll_trylock_elision(futex, adapt_count) \
   __lll_trylock_elision(&(futex), &(adapt_count))
-# endif  /* ENABLE_LOCK_ELISION */
+#endif  /* ENABLE_LOCK_ELISION */
 
 #endif	/* lowlevellock.h */