hppa: Optimize atomic_compare_and_exchange_val_acq

Message ID 58B70052-B987-4C41-B603-F3AAB2FDE34B@bell.net
State New, archived
Headers

Commit Message

John David Anglin Sept. 22, 2016, 2:14 p.m. UTC
  The attached patch replaces the conditional branch tests in atomic_compare_and_exchange_val_acq with
conditional instruction nullification.  This avoids the stalls associated with conditional branches and the resulting
code is shorter.  There are no branches in the fast path when the operation is successful.

The change was intended as an optimization but tst-stack4 now passes.

Please install.

Thanks,
Dave
--
John David Anglin	dave.anglin@bell.net
2016-09-22  John David Anglin  

	* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: Don't include
	abort-instr.h.
	(EFAULT): Remove conditional define.
	(ENOSYS): Likewise.
	(atomic_compare_and_exchange_val_acq): Use instruction nullification
	instead of conditional branch instructions.
  

Comments

Carlos O'Donell Sept. 22, 2016, 8:37 p.m. UTC | #1
On 09/22/2016 10:14 AM, John David Anglin wrote:
> The attached patch replaces the conditional branch tests in
> atomic_compare_and_exchange_val_acq with conditional instruction
> nullification. This avoids the stalls associated with conditional
> branches and the resulting code is shorter. There are no branches in
> the fast path when the operation is successful.

Does this really make a measurable difference? The light-weight-syscall
is probably the most costly part of this entire operation.

If you can show there is a measurable difference I would be willing
to accept the removal of the deadlock looping (it becomes a SIGILL
and you have to look at the core file).

> The change was intended as an optimization but tst-stack4 now passes.

This is a red flag for this patch.

Any idea what changes?
 
The tst-stack4 test creates a bunch of threads, that all create their
own stacks, release them (placing them on the free stack list) and then
they get reused by new threads, all during dlopen/dlsym operation which
is growing the DTV. We try to catch a case where the DTV size is too small
and we overflow. I don't see how that could be related to what you have here?

> 2016-09-22  John David Anglin  
> 
> 	* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: Don't include
> 	abort-instr.h.
> 	(EFAULT): Remove conditional define.
> 	(ENOSYS): Likewise.
> 	(atomic_compare_and_exchange_val_acq): Use instruction nullification
> 	instead of conditional branch instructions.
  
Helge Deller Sept. 22, 2016, 9:38 p.m. UTC | #2
On 22.09.2016 22:37, Carlos O'Donell wrote:
> On 09/22/2016 10:14 AM, John David Anglin wrote:
>> The attached patch replaces the conditional branch tests in
>> atomic_compare_and_exchange_val_acq with conditional instruction
>> nullification. This avoids the stalls associated with conditional
>> branches and the resulting code is shorter. There are no branches in
>> the fast path when the operation is successful.
> 
> Does this really make a measurable difference? The light-weight-syscall
> is probably the most costly part of this entire operation.
> 
> If you can show there is a measurable difference I would be willing
> to accept the removal of the deadlock looping (it becomes a SIGILL
> and you have to look at the core file).

I'm with Dave that the removal of the deadlock looping is OK.
There is no production kernel in real-life (e.g. Debian kernels) which
return EDEADLOCK, because in the Linux kernel we have completely disabled
the LWS debugging by default (ENABLE_LWS_DEBUG=0) and as such I see
no kernel code path where EDEADLOCK can be returned.

Helge
  
John David Anglin Sept. 22, 2016, 9:49 p.m. UTC | #3
On 2016-09-22, at 4:37 PM, Carlos O'Donell wrote:

> On 09/22/2016 10:14 AM, John David Anglin wrote:
>> The attached patch replaces the conditional branch tests in
>> atomic_compare_and_exchange_val_acq with conditional instruction
>> nullification. This avoids the stalls associated with conditional
>> branches and the resulting code is shorter. There are no branches in
>> the fast path when the operation is successful.
> 
> Does this really make a measurable difference? The light-weight-syscall
> is probably the most costly part of this entire operation.

I don't have any numbers but my sense is that it does slightly improve pthread related performance.
I have a kernel patch to slightly improve the LWS code as well.

This is just a guess.  Removing the final hunk of C code may give better register allocation and reduce
stack requirements in the code using atomic_compare_and_exchange_val_acq.

> 
> If you can show there is a measurable difference I would be willing
> to accept the removal of the deadlock looping (it becomes a SIGILL
> and you have to look at the core file).

Yes.  The LWS deadlock code is not enabled in the kernel.  So, we are just wasting a couple of cycles in checking
for the deadlock error code.  This may have been helpful when the code was first written but I believe it should go now.

> 
>> The change was intended as an optimization but tst-stack4 now passes.
> 
> This is a red flag for this patch.

The patch has had a significant amount of testing and I'm sure there are no functional issues.  It is important to simplify
when possible.

Why red flag an improvement in test results?

I think it also helps with glib2.0 async-splice-output-stream test but I don't have enough build data to be
absolutely certain.  Fixing syscall cancellation has changed things a bit.

You can see the Debian glibc test results with the change here:
https://buildd.debian.org/status/fetch.php?pkg=glibc&arch=hppa&ver=2.24-3%2Bb1&stamp=1474530506

> 
> Any idea what changes?

No.  I haven't had a chance to look at the tst-stack4 test.

> 
> The tst-stack4 test creates a bunch of threads, that all create their
> own stacks, release them (placing them on the free stack list) and then
> they get reused by new threads, all during dlopen/dlsym operation which
> is growing the DTV. We try to catch a case where the DTV size is too small
> and we overflow. I don't see how that could be related to what you have here?

I can look but I think it might be better to focus on tests that still fail.

> 
>> 2016-09-22  John David Anglin  
>> 
>> 	* sysdeps/unix/sysv/linux/hppa/atomic-machine.h: Don't include
>> 	abort-instr.h.
>> 	(EFAULT): Remove conditional define.
>> 	(ENOSYS): Likewise.
>> 	(atomic_compare_and_exchange_val_acq): Use instruction nullification
>> 	instead of conditional branch instructions.
> 
> 
> -- 
> Cheers,
> Carlos.


Dave
--
John David Anglin	dave.anglin@bell.net
  
Florian Weimer Sept. 23, 2016, 7:13 a.m. UTC | #4
* John David Anglin:

>>> The change was intended as an optimization but tst-stack4 now passes.
>> 
>> This is a red flag for this patch.
>
> The patch has had a significant amount of testing and I'm sure there
> are no functional issues.  It is important to simplify
> when possible.
>
> Why red flag an improvement in test results?

I share Carlos' concern.  If this patch appears to fix tst-stack4,
then it has unforeseen consequences.  We just don't understand the
full impact of this change.
  
Carlos O'Donell Sept. 27, 2016, 5:15 p.m. UTC | #5
On 09/22/2016 05:38 PM, Helge Deller wrote:
> On 22.09.2016 22:37, Carlos O'Donell wrote:
>> On 09/22/2016 10:14 AM, John David Anglin wrote:
>>> The attached patch replaces the conditional branch tests in
>>> atomic_compare_and_exchange_val_acq with conditional instruction
>>> nullification. This avoids the stalls associated with conditional
>>> branches and the resulting code is shorter. There are no branches in
>>> the fast path when the operation is successful.
>>
>> Does this really make a measurable difference? The light-weight-syscall
>> is probably the most costly part of this entire operation.
>>
>> If you can show there is a measurable difference I would be willing
>> to accept the removal of the deadlock looping (it becomes a SIGILL
>> and you have to look at the core file).
> 
> I'm with Dave that the removal of the deadlock looping is OK.
> There is no production kernel in real-life (e.g. Debian kernels) which
> return EDEADLOCK, because in the Linux kernel we have completely disabled
> the LWS debugging by default (ENABLE_LWS_DEBUG=0) and as such I see
> no kernel code path where EDEADLOCK can be returned.

Thanks.
  
Carlos O'Donell Sept. 27, 2016, 5:17 p.m. UTC | #6
On 09/23/2016 03:13 AM, Florian Weimer wrote:
> * John David Anglin:
> 
>>>> The change was intended as an optimization but tst-stack4 now passes.
>>>
>>> This is a red flag for this patch.
>>
>> The patch has had a significant amount of testing and I'm sure there
>> are no functional issues.  It is important to simplify
>> when possible.
>>
>> Why red flag an improvement in test results?
> 
> I share Carlos' concern.  If this patch appears to fix tst-stack4,
> then it has unforeseen consequences.  We just don't understand the
> full impact of this change.

Exactly.

Before and after the patch the regression testsuite should give
roughly the same results (modulo unreliable tests).
 
I agree that operational hours of testing are important, but it's
not everything one has to consider.

Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
this patch?
  
John David Anglin Sept. 27, 2016, 10:26 p.m. UTC | #7
On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:

> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
> this patch?

I don't think so.  It has something to do with testsuite.  When I run test directly, it fails:

dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
Didn't expect signal from child: got `Aborted'


With gdb, I see:

(gdb) r
Starting program: /home/dave/gnu/glibc/objdir/nptl/tst-stack4 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/hppa-linux-gnu/libthread_db.so.1".
[New process 23921]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/hppa-linux-gnu/libthread_db.so.1".
[New Thread 0xf97f7480 (LWP 23922)]
[New Thread 0xf8ff7480 (LWP 23923)]
tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
[New Thread 0xf87f7480 (LWP 23924)]
[New Thread 0xf7ff7480 (LWP 23925)]

Thread 2.2 "tst-stack4" received signal SIGABRT, Aborted.
[Switching to Thread 0xf97f7480 (LWP 23922)]
0xf98278b4 in __assert_fail_base (
    fmt=0x1 <error: Cannot access memory at address 0x1>, 
    assertion=<optimized out>, 
    file=0x6 <error: Cannot access memory at address 0x6>, line=69, 
    function=<optimized out>) at assert.c:73
73	assert.c: No such file or directory.

Dave
--
John David Anglin	dave.anglin@bell.net
  
Florian Weimer Sept. 27, 2016, 10:50 p.m. UTC | #8
* John David Anglin:

> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
>
>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>> this patch?
>
> I don't think so.  It has something to do with testsuite.  When I run
> test directly, it fails:
>
> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
> Didn't expect signal from child: got `Aborted'
>
>
> With gdb, I see:
>
> (gdb) r

Did you pass --direct?  The test harness spawns a subprocess.
  
John David Anglin Sept. 27, 2016, 11:04 p.m. UTC | #9
On 2016-09-27, at 6:50 PM, Florian Weimer wrote:

> * John David Anglin:
> 
>> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
>> 
>>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>>> this patch?
>> 
>> I don't think so.  It has something to do with testsuite.  When I run
>> test directly, it fails:
>> 
>> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
>> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
>> Didn't expect signal from child: got `Aborted'
>> 
>> 
>> With gdb, I see:
>> 
>> (gdb) r
> 
> Did you pass --direct?  The test harness spawns a subprocess.


Didn't know about direct.  Had "set follow-fork-mode child".  "--direct" didn't make any
difference.

Dave
--
John David Anglin	dave.anglin@bell.net
  
Carlos O'Donell Sept. 28, 2016, 8:29 p.m. UTC | #10
On 09/27/2016 06:26 PM, John David Anglin wrote:
> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
> 
>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>> this patch?
> 
> I don't think so.  It has something to do with testsuite.  When I run test directly, it fails:
> 
> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
> Didn't expect signal from child: got `Aborted'

That is odd. It means dlopen failed to find the object and the test aborted.

What does strace show is being opened?
  
Florian Weimer Sept. 28, 2016, 9 p.m. UTC | #11
* John David Anglin:

> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
>
>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>> this patch?
>
> I don't think so.  It has something to do with testsuite.  When I run
> test directly, it fails:
>
> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
> Didn't expect signal from child: got `Aborted'

I think you need to put the current directory on the library path.
The test harness does this automatically to locate libpthread.so.0,
but if you run the test manually, you need to do this yourself.
  
John David Anglin Oct. 2, 2016, 4:47 p.m. UTC | #12
On 2016-09-28, at 5:00 PM, Florian Weimer wrote:

> * John David Anglin:
> 
>> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
>> 
>>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>>> this patch?
>> 
>> I don't think so.  It has something to do with testsuite.  When I run
>> test directly, it fails:
>> 
>> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
>> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
>> Didn't expect signal from child: got `Aborted'
> 
> I think you need to put the current directory on the library path.
> The test harness does this automatically to locate libpthread.so.0,
> but if you run the test manually, you need to do this yourself.


Yes, this fixes the above problem.

The test fails randomly with debian glibc_2.24-3:

dave@mx3210:~/gnu/glibc/objdir/nptl$ export LD_LIBRARY_PATH=`pwd`
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
dave@mx3210:~/gnu/glibc/objdir/nptl$ echo $?
0
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
dave@mx3210:~/gnu/glibc/objdir/nptl$ echo $?
0
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
dave@mx3210:~/gnu/glibc/objdir/nptl$ echo $?
0
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ echo $?
127
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
dave@mx3210:~/gnu/glibc/objdir/nptl$ echo $?
0

Dave
--
John David Anglin	dave.anglin@bell.net
  
Carlos O'Donell Oct. 12, 2016, 5:24 a.m. UTC | #13
On 10/02/2016 12:47 PM, John David Anglin wrote:
> On 2016-09-28, at 5:00 PM, Florian Weimer wrote:
> 
>> * John David Anglin:
>>
>>> On 2016-09-27, at 1:17 PM, Carlos O'Donell wrote:
>>>
>>>> Is the FAIL->PASS for tst-stack4 reliably tied to the addition of
>>>> this patch?
>>>
>>> I don't think so.  It has something to do with testsuite.  When I run
>>> test directly, it fails:
>>>
>>> dave@mx3210:~/gnu/glibc/objdir/nptl$ ./tst-stack4
>>> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
>>> Didn't expect signal from child: got `Aborted'
>>
>> I think you need to put the current directory on the library path.
>> The test harness does this automatically to locate libpthread.so.0,
>> but if you run the test manually, you need to do this yourself.
> 
> 
> Yes, this fixes the above problem.
> 
> The test fails randomly with debian glibc_2.24-3:

In which case your patch is OK with me.

Can you commit it or do you need me to do that?

Aurelien or Mike can also commit.
  

Patch

Index: glibc-2.23/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
===================================================================
--- glibc-2.23.orig/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
+++ glibc-2.23/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
@@ -17,13 +17,6 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #include <stdint.h> /*  Required for type definitions e.g. uint8_t.  */
-#include <abort-instr.h> /*  Required for ABORT_INSTRUCTIUON.  */
-
-/* We need EFAULT, ENONSYS */
-#if !defined EFAULT && !defined ENOSYS
-#define EFAULT	14
-#define ENOSYS	251
-#endif
 
 #ifndef _ATOMIC_MACHINE_H
 #define _ATOMIC_MACHINE_H	1
@@ -62,7 +55,7 @@  typedef uintmax_t uatomic_max_t;
 #define _ASM_EDEADLOCK "-45"
 
 /* The only basic operation needed is compare and exchange.  The mem
-   pointer must be word aligned.  */
+   pointer must be word aligned.  We no longer loop on deadlock.  */
 #define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
   ({									\
      register long lws_errno asm("r21");				\
@@ -74,20 +67,15 @@  typedef uintmax_t uatomic_max_t;
 	"0:					\n\t"			\
 	"ble	" _LWS "(%%sr2, %%r0)		\n\t"			\
 	"ldi	" _LWS_CAS ", %%r20		\n\t"			\
-	"ldi	" _ASM_EAGAIN ", %%r20		\n\t"			\
-	"cmpb,=,n %%r20, %%r21, 0b		\n\t"			\
-	"nop					\n\t"			\
-	"ldi	" _ASM_EDEADLOCK ", %%r20	\n\t"			\
-	"cmpb,=,n %%r20, %%r21, 0b		\n\t"			\
-	"nop					\n\t"			\
+	"cmpiclr,<> " _ASM_EAGAIN ", %%r21, %%r0\n\t"			\
+	"b,n 0b					\n\t"			\
+	"cmpclr,= %%r0, %%r21, %%r0		\n\t"			\
+	"iitlbp %%r0,(%%sr0, %%r0)		\n\t"			\
 	: "=r" (lws_ret), "=r" (lws_errno)				\
 	: "r" (lws_mem), "r" (lws_old), "r" (lws_new)			\
 	: _LWS_CLOBBER							\
      );									\
 									\
-     if (lws_errno == -EFAULT || lws_errno == -ENOSYS)			\
-	ABORT_INSTRUCTION;						\
-									\
      (__typeof (oldval)) lws_ret;					\
    })