[2/2] S390: Test if lock is free before using atomic instruction in spin-lock.

Message ID 1479913753-20506-2-git-send-email-stli@linux.vnet.ibm.com
State Rejected
Headers

Commit Message

Stefan Liebler Nov. 23, 2016, 3:09 p.m. UTC
  This patch changes the behaviour of pthread_spin_lock on s390:
Instead of spinning on the lock with compare-and-swap instruction,
the atomic_compare_and_exchange_bool_acq macro is used.
The s390 atomic_compare_and_exchange_bool_acq macro called with constant
zero as oldval first compares the lock with normal instructions.  If it is
free the compare-and-swap instruction is used to aquire the lock.  While
the lock is held by one cpu another cpu will not exclusively lock the
memory of the lock in a loop.  If the lock is unlocked by another cpu
it is observed by the normal instruction and the lock is acquired
with a compare-and-swap instruction.

ChangeLog:

	* sysdeps/s390/nptl/pthread_spin_lock.c:
	Use atomic_compare_and_exchange_bool_acq macro instead of
	inline assembly.
	* sysdeps/s390/nptl/pthread_spin_trylock.c: Likewise.
---
 sysdeps/s390/nptl/pthread_spin_lock.c    | 13 +++++++------
 sysdeps/s390/nptl/pthread_spin_trylock.c | 12 +++++-------
 2 files changed, 12 insertions(+), 13 deletions(-)
  

Comments

Torvald Riegel Nov. 24, 2016, 1:04 p.m. UTC | #1
On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> This patch changes the behaviour of pthread_spin_lock on s390:
> Instead of spinning on the lock with compare-and-swap instruction,
> the atomic_compare_and_exchange_bool_acq macro is used.
> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
> zero as oldval first compares the lock with normal instructions.  If it is
> free the compare-and-swap instruction is used to aquire the lock.  While
> the lock is held by one cpu another cpu will not exclusively lock the
> memory of the lock in a loop.  If the lock is unlocked by another cpu
> it is observed by the normal instruction and the lock is acquired
> with a compare-and-swap instruction.

I don't want to have more arch-specific code than we really need.
Please compare what you have against the generic spinlock code.  If you
find the generic spinlock code lacking, then please propose a change for
it in a way that does not make things worse for other architectures.

If there are arch-specific properties that significantly affect the
approach the generic spinlock takes (eg, assumptions about CAS vs
atomic-exchange runtime overheads), then please split them out.

In the long term, this will also benefit s390.  For example, the
spinlock code you have has no backoff, and introduces spinning with
loads in a pretty ugly way through the hack you have added to the CAS
(first loading the lock's value and comparing it manually if the
supplied argument is a constant).
While the generic spinlock code we have is also not very great,
improving it is what will make life easier for the whole glibc project.
Also, if someone else improves the generic spinlock code, s390 would
miss out on this if you have added your custom spinlock code.
  
Stefan Liebler Dec. 9, 2016, 3:16 p.m. UTC | #2
On 11/24/2016 02:04 PM, Torvald Riegel wrote:
> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>> This patch changes the behaviour of pthread_spin_lock on s390:
>> Instead of spinning on the lock with compare-and-swap instruction,
>> the atomic_compare_and_exchange_bool_acq macro is used.
>> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
>> zero as oldval first compares the lock with normal instructions.  If it is
>> free the compare-and-swap instruction is used to aquire the lock.  While
>> the lock is held by one cpu another cpu will not exclusively lock the
>> memory of the lock in a loop.  If the lock is unlocked by another cpu
>> it is observed by the normal instruction and the lock is acquired
>> with a compare-and-swap instruction.
>
> I don't want to have more arch-specific code than we really need.
> Please compare what you have against the generic spinlock code.  If you
> find the generic spinlock code lacking, then please propose a change for
> it in a way that does not make things worse for other architectures.
>
> If there are arch-specific properties that significantly affect the
> approach the generic spinlock takes (eg, assumptions about CAS vs
> atomic-exchange runtime overheads), then please split them out.
>
> In the long term, this will also benefit s390.  For example, the
> spinlock code you have has no backoff, and introduces spinning with
> loads in a pretty ugly way through the hack you have added to the CAS
> (first loading the lock's value and comparing it manually if the
> supplied argument is a constant).
> While the generic spinlock code we have is also not very great,
> improving it is what will make life easier for the whole glibc project.
> Also, if someone else improves the generic spinlock code, s390 would
> miss out on this if you have added your custom spinlock code.
>

I had a look into the assembler output of generic spinlock code, e.g:
int
pthread_spin_trylock (pthread_spinlock_t *lock)
{
   return atomic_exchange_acq (lock, 1) ? EBUSY : 0;
}

On s390x assembler output, a new stack-frame is generated, the lock 
value is loaded, stored to stack, loaded from stack and then passed to 
cs-instruction.
After cs-instruction, the "old-value" is stored to stack, loaded from 
stack and then compared to zero.

I also had a look into the aarch64 pthread_spin_trylock.os compiled with 
build-many-script and a gcc 6.2.
aarch64 is using the __atomic-builtins for atomic_exchange_acq, 
atomic_compare_and_exchange_val_acq, ... .
The atomic_exchange_acq results in the exclusive load/store. Then the 
old-value is also stored to stack and is immediately loaded back in 
order to compare it against zero.

The type pthread_spinlock_t is a volatile int!
If lock is casted to (int *) before passing it to the atomic macros,
the assembler output looks okay.

If I change the current generic spinlock code, do I have to rewrite it 
to the c11-like macros?

I've tested the following function in advance:
int
foo (pthread_spinlock_t *lock)
{
   return atomic_load_acquire (lock);
}

With the __atomic_load_n version of atomic_load_acquire, the assembler 
output contains a load which is returned as expected.

The non-c11-macro results in the following:
    0:   58 10 20 00             l       %r1,0(%r2)
    4:   b3 c1 00 0f             ldgr    %f0,%r15
    8:   e3 f0 ff 58 ff 71       lay     %r15,-168(%r15)
    e:   50 10 f0 a4             st      %r1,164(%r15)
   12:   58 10 f0 a4             l       %r1,164(%r15)
   16:   50 10 f0 a0             st      %r1,160(%r15)
   1a:   58 20 f0 a0             l       %r2,160(%r15)
   1e:   b3 cd 00 f0             lgdr    %r15,%f0
   22:   b9 14 00 22             lgfr    %r2,%r2
   26:   07 fe                   br      %r14
The extra stores/loads to/from stack result from the "__typeof (*(mem)) 
__atg101_val" usages in atomic_load_* macros in conjunction with volatile.

As this behaviour is not s390 specific, I've grep'ed to see which arches 
use the generic spin-lock code and the c11-like-atomic-macros:
sysdeps/hppa/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/unix/sysv/linux/hppa/atomic-machine.h:40:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/microblaze/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/microblaze/atomic-machine.h:39:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/aarch64/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/aarch64/atomic-machine.h:40:#define USE_ATOMIC_COMPILER_BUILTINS 1

sysdeps/mips/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/mips/atomic-machine.h:95:#define USE_ATOMIC_COMPILER_BUILTINS 1
sysdeps/mips/atomic-machine.h:216:#define USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/nios2/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/unix/sysv/linux/nios2/atomic-machine.h:35:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/arm/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/arm/atomic-machine.h:37:#define USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/m68k/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/m68k/m680x0/m68020/atomic-machine.h:48:#define 
USE_ATOMIC_COMPILER_BUILTINS 0
sysdeps/m68k/coldfire/atomic-machine.h:54:#define 
USE_ATOMIC_COMPILER_BUILTINS 0
sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h:40:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

What is your suggestion, how to proceed with the volatile int type in 
conjunction with the atomic-macros?

Bye Stefan
  
Stefan Liebler Dec. 16, 2016, 5:08 p.m. UTC | #3
On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> On 11/24/2016 02:04 PM, Torvald Riegel wrote:
>> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>>> This patch changes the behaviour of pthread_spin_lock on s390:
>>> Instead of spinning on the lock with compare-and-swap instruction,
>>> the atomic_compare_and_exchange_bool_acq macro is used.
>>> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
>>> zero as oldval first compares the lock with normal instructions.  If
>>> it is
>>> free the compare-and-swap instruction is used to aquire the lock.  While
>>> the lock is held by one cpu another cpu will not exclusively lock the
>>> memory of the lock in a loop.  If the lock is unlocked by another cpu
>>> it is observed by the normal instruction and the lock is acquired
>>> with a compare-and-swap instruction.
>>
>> I don't want to have more arch-specific code than we really need.
>> Please compare what you have against the generic spinlock code.  If you
>> find the generic spinlock code lacking, then please propose a change for
>> it in a way that does not make things worse for other architectures.
>>
>> If there are arch-specific properties that significantly affect the
>> approach the generic spinlock takes (eg, assumptions about CAS vs
>> atomic-exchange runtime overheads), then please split them out.
>>
>> In the long term, this will also benefit s390.  For example, the
>> spinlock code you have has no backoff, and introduces spinning with
>> loads in a pretty ugly way through the hack you have added to the CAS
>> (first loading the lock's value and comparing it manually if the
>> supplied argument is a constant).
>> While the generic spinlock code we have is also not very great,
>> improving it is what will make life easier for the whole glibc project.
>> Also, if someone else improves the generic spinlock code, s390 would
>> miss out on this if you have added your custom spinlock code.
>>
>
> I had a look into the assembler output of generic spinlock code, e.g:
> int
> pthread_spin_trylock (pthread_spinlock_t *lock)
> {
>   return atomic_exchange_acq (lock, 1) ? EBUSY : 0;
> }
>
> On s390x assembler output, a new stack-frame is generated, the lock
> value is loaded, stored to stack, loaded from stack and then passed to
> cs-instruction.
> After cs-instruction, the "old-value" is stored to stack, loaded from
> stack and then compared to zero.
>
> I also had a look into the aarch64 pthread_spin_trylock.os compiled with
> build-many-script and a gcc 6.2.
> aarch64 is using the __atomic-builtins for atomic_exchange_acq,
> atomic_compare_and_exchange_val_acq, ... .
> The atomic_exchange_acq results in the exclusive load/store. Then the
> old-value is also stored to stack and is immediately loaded back in
> order to compare it against zero.
>
> The type pthread_spinlock_t is a volatile int!
> If lock is casted to (int *) before passing it to the atomic macros,
> the assembler output looks okay.
>
> If I change the current generic spinlock code, do I have to rewrite it
> to the c11-like macros?
>
> I've tested the following function in advance:
> int
> foo (pthread_spinlock_t *lock)
> {
>   return atomic_load_acquire (lock);
> }
>
> With the __atomic_load_n version of atomic_load_acquire, the assembler
> output contains a load which is returned as expected.
>
> The non-c11-macro results in the following:
>    0:   58 10 20 00             l       %r1,0(%r2)
>    4:   b3 c1 00 0f             ldgr    %f0,%r15
>    8:   e3 f0 ff 58 ff 71       lay     %r15,-168(%r15)
>    e:   50 10 f0 a4             st      %r1,164(%r15)
>   12:   58 10 f0 a4             l       %r1,164(%r15)
>   16:   50 10 f0 a0             st      %r1,160(%r15)
>   1a:   58 20 f0 a0             l       %r2,160(%r15)
>   1e:   b3 cd 00 f0             lgdr    %r15,%f0
>   22:   b9 14 00 22             lgfr    %r2,%r2
>   26:   07 fe                   br      %r14
> The extra stores/loads to/from stack result from the "__typeof (*(mem))
> __atg101_val" usages in atomic_load_* macros in conjunction with volatile.
>
> As this behaviour is not s390 specific, I've grep'ed to see which arches
> use the generic spin-lock code and the c11-like-atomic-macros:
> sysdeps/hppa/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/unix/sysv/linux/hppa/atomic-machine.h:40:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/microblaze/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/microblaze/atomic-machine.h:39:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/aarch64/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/aarch64/atomic-machine.h:40:#define USE_ATOMIC_COMPILER_BUILTINS 1
>
> sysdeps/mips/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/mips/atomic-machine.h:95:#define USE_ATOMIC_COMPILER_BUILTINS 1
> sysdeps/mips/atomic-machine.h:216:#define USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/nios2/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/unix/sysv/linux/nios2/atomic-machine.h:35:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/arm/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/arm/atomic-machine.h:37:#define USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/m68k/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/m68k/m680x0/m68020/atomic-machine.h:48:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
> sysdeps/m68k/coldfire/atomic-machine.h:54:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
> sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h:40:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> What is your suggestion, how to proceed with the volatile int type in
> conjunction with the atomic-macros?
>
> Bye Stefan
>
This patch is not needed anymore as I've posted an adjusted generic 
spinlock code:
[PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00617.html

[PATCH 2/2] S390: Use generic spinlock code.
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00618.html

Please have a look.

Bye.
Stefan
  
Torvald Riegel Dec. 16, 2016, 7:21 p.m. UTC | #4
On Fri, 2016-12-16 at 18:08 +0100, Stefan Liebler wrote:
> > What is your suggestion, how to proceed with the volatile int type in
> > conjunction with the atomic-macros?

Good catch re volatile.  It should not be volatile, which is consistent
with how we do atomics elsewhere.  We shouldn't change the user-facing
type, but should cast to non-volatile internally.

> This patch is not needed anymore as I've posted an adjusted generic 
> spinlock code:
> [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
> https://www.sourceware.org/ml/libc-alpha/2016-12/msg00617.html
> 
> [PATCH 2/2] S390: Use generic spinlock code.
> https://www.sourceware.org/ml/libc-alpha/2016-12/msg00618.html
> 
> Please have a look.

I'll do.  Thanks.
  
Florian Weimer Dec. 16, 2016, 8:12 p.m. UTC | #5
* Stefan Liebler:

> The type pthread_spinlock_t is a volatile int!
> If lock is casted to (int *) before passing it to the atomic macros,
> the assembler output looks okay.

That's undefined:

“If an attempt is made to refer to an object defined with a
volatile-qualified type through use of an lvalue with
non-volatile-qualified type, the behavior is undefined.”

But we cannot drop the volatile qualifier from the definition of
pthread_spinlock_t because it would change the C++ mangling of
pthread_spinlock_t * and similar types.

Maybe we can receive a GCC extension to support this?
  

Patch

diff --git a/sysdeps/s390/nptl/pthread_spin_lock.c b/sysdeps/s390/nptl/pthread_spin_lock.c
index def6a24..9c06949 100644
--- a/sysdeps/s390/nptl/pthread_spin_lock.c
+++ b/sysdeps/s390/nptl/pthread_spin_lock.c
@@ -21,12 +21,13 @@ 
 int
 pthread_spin_lock (pthread_spinlock_t *lock)
 {
-  int oldval;
+  /* The s390 atomic_compare_and_exchange_bool_acq macro called with constant
+     zero as oldval first compares the lock with normal instructions.  If it is
+     free the compare-and-swap instruction is used to aquire the lock.  While
+     the lock is held by one cpu another cpu will not exclusively lock the
+     memory of the lock in a loop.  */
+  while (atomic_compare_and_exchange_bool_acq (lock, 1, 0))
+    ;
 
-  __asm__ __volatile__ ("0: lhi %0,0\n"
-			"   cs  %0,%2,%1\n"
-			"   jl  0b"
-			: "=&d" (oldval), "=Q" (*lock)
-			: "d" (1), "m" (*lock) : "cc" );
   return 0;
 }
diff --git a/sysdeps/s390/nptl/pthread_spin_trylock.c b/sysdeps/s390/nptl/pthread_spin_trylock.c
index 4c00e08..d9e88ef 100644
--- a/sysdeps/s390/nptl/pthread_spin_trylock.c
+++ b/sysdeps/s390/nptl/pthread_spin_trylock.c
@@ -22,11 +22,9 @@ 
 int
 pthread_spin_trylock (pthread_spinlock_t *lock)
 {
-  int old;
-
-  __asm__ __volatile__ ("cs %0,%3,%1"
-			: "=d" (old), "=Q" (*lock)
-			: "0" (0), "d" (1), "m" (*lock) : "cc" );
-
-  return old != 0 ? EBUSY : 0;
+  /* The s390 atomic_compare_and_exchange_bool_acq macro called with constant
+     zero as oldval first compares the lock with normal instructions.  If it is
+     free the compare-and-swap instruction is used to aquire the lock.  */
+  return __glibc_unlikely (atomic_compare_and_exchange_bool_acq (lock, 1, 0))
+    ? EBUSY : 0;
 }