[AArch64] Single thread lowlevellock optimization

Message ID 59440699.6080900@arm.com
State New, archived
Headers

Commit Message

Szabolcs Nagy June 16, 2017, 4:26 p.m. UTC
  Do single thread lock optimization in aarch64 libc. Atomic operations
hurt the performance of some single-threaded programs using stdio
(usually getc/putc in a loop).

Ideally such optimization should be done at a higher level and in a
target independent way as in
https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
but that approach will need more discussion so do it in lowlevellocks,
similarly to x86, until there is consensus.

Differences compared to the current x86_64 behaviour:
- The optimization is not silently applied to shared locks, in that
case the build fails.
- Unlock assumes the futex value is 0 or 1, there are no waiters to
wake (that would not work in single thread and libc does not use
such locks, to be sure lll_cond* is undefed).

This speeds up a getchar loop about 2-4x depending on the cpu,
while only cause around 5-10% regression for the multi-threaded case
(other libc internal locks are not expected to be performance
critical or significantly affected by this change).

2017-06-16  Szabolcs Nagy  <szabolcs.nagy@arm.com>

	* sysdeps/unix/sysv/linux/aarch64/lowlevellock.h: New file.
  

Comments

Szabolcs Nagy June 20, 2017, 9:51 a.m. UTC | #1
On 16/06/17 17:26, Szabolcs Nagy wrote:
> Do single thread lock optimization in aarch64 libc. Atomic operations
> hurt the performance of some single-threaded programs using stdio
> (usually getc/putc in a loop).
> 
> Ideally such optimization should be done at a higher level and in a
> target independent way as in
> https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
> but that approach will need more discussion so do it in lowlevellocks,
> similarly to x86, until there is consensus.
> 
> Differences compared to the current x86_64 behaviour:
> - The optimization is not silently applied to shared locks, in that
> case the build fails.
> - Unlock assumes the futex value is 0 or 1, there are no waiters to
> wake (that would not work in single thread and libc does not use
> such locks, to be sure lll_cond* is undefed).
> 
> This speeds up a getchar loop about 2-4x depending on the cpu,
> while only cause around 5-10% regression for the multi-threaded case
> (other libc internal locks are not expected to be performance
> critical or significantly affected by this change).
> 
> 2017-06-16  Szabolcs Nagy  <szabolcs.nagy@arm.com>
> 
> 	* sysdeps/unix/sysv/linux/aarch64/lowlevellock.h: New file.
> 

i'd like to commit this in this release
(and work on the more generic solution later)
i'm waiting for feedback for a while in case somebody
spots some issues.
  
Adhemerval Zanella June 20, 2017, 1:18 p.m. UTC | #2
On 20/06/2017 06:51, Szabolcs Nagy wrote:
> On 16/06/17 17:26, Szabolcs Nagy wrote:
>> Do single thread lock optimization in aarch64 libc. Atomic operations
>> hurt the performance of some single-threaded programs using stdio
>> (usually getc/putc in a loop).
>>
>> Ideally such optimization should be done at a higher level and in a
>> target independent way as in
>> https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
>> but that approach will need more discussion so do it in lowlevellocks,
>> similarly to x86, until there is consensus.
>>
>> Differences compared to the current x86_64 behaviour:
>> - The optimization is not silently applied to shared locks, in that
>> case the build fails.
>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>> wake (that would not work in single thread and libc does not use
>> such locks, to be sure lll_cond* is undefed).
>>
>> This speeds up a getchar loop about 2-4x depending on the cpu,
>> while only cause around 5-10% regression for the multi-threaded case
>> (other libc internal locks are not expected to be performance
>> critical or significantly affected by this change).
>>
>> 2017-06-16  Szabolcs Nagy  <szabolcs.nagy@arm.com>
>>
>> 	* sysdeps/unix/sysv/linux/aarch64/lowlevellock.h: New file.
>>
> 
> i'd like to commit this in this release
> (and work on the more generic solution later)
> i'm waiting for feedback for a while in case somebody
> spots some issues.

This is similar to a powerpc optimization I suggested some time ago [1]
and general idea back then is any single-thread optimizations belong 
into the specific concurrent algorithms.  Tulio again tried this sometime
later [2] and Torvald reject again for the same reasons. He also pointed
out in other thread (which I can't find a link now), that this change is
potentially disruptive in case we aim for async-safe malloc (although
this is not an issue currently and I also pointed out it).

And I tend to agree with Torvald, this change is adds arch-specific
complexity and semantics that should be platform neutral and focused on
specific algorithm, like your first proposal (and I doubt such hackery
would be accepted in musl for instance).


[1] https://sourceware.org/ml/libc-alpha/2014-04/msg00137.html
[2] https://patchwork.ozlabs.org/patch/596463/
  
Torvald Riegel June 20, 2017, 1:47 p.m. UTC | #3
On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
> Do single thread lock optimization in aarch64 libc. Atomic operations
> hurt the performance of some single-threaded programs using stdio
> (usually getc/putc in a loop).
> 
> Ideally such optimization should be done at a higher level and in a
> target independent way as in
> https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
> but that approach will need more discussion so do it in lowlevellocks,
> similarly to x86, until there is consensus.

I disagree that this is sufficient reason to do the right thing here
(ie, optimize in the high-level algorithm).  What further discussion is
needed re the high-level use case?

> Differences compared to the current x86_64 behaviour:
> - The optimization is not silently applied to shared locks, in that
> case the build fails.
> - Unlock assumes the futex value is 0 or 1, there are no waiters to
> wake (that would not work in single thread and libc does not use
> such locks, to be sure lll_cond* is undefed).
> 
> This speeds up a getchar loop about 2-4x depending on the cpu,
> while only cause around 5-10% regression for the multi-threaded case

What measurement of what benchmark resulted in that number (the latter
one)?  Without details of what you are measuring this isn't meaningful.

> (other libc internal locks are not expected to be performance
> critical or significantly affected by this change).

Why do you think this is the case?
  
Szabolcs Nagy June 20, 2017, 3:05 p.m. UTC | #4
On 20/06/17 14:47, Torvald Riegel wrote:
> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>> Do single thread lock optimization in aarch64 libc. Atomic operations
>> hurt the performance of some single-threaded programs using stdio
>> (usually getc/putc in a loop).
>>
>> Ideally such optimization should be done at a higher level and in a
>> target independent way as in
>> https://sourceware.org/ml/libc-alpha/2017-05/msg00479.html
>> but that approach will need more discussion so do it in lowlevellocks,
>> similarly to x86, until there is consensus.
> 
> I disagree that this is sufficient reason to do the right thing here
> (ie, optimize in the high-level algorithm).  What further discussion is
> needed re the high-level use case?
> 

one open issue is to detect malloc interposition at startup
time to disable the optimization (this should be easy i was
just not sure what's the right place to do it).

the current _IO_USER_LOCK flag could use the same mechanism:
instead of doing a check at every flockfile/funlockfile
just check once at entry into getc and jump to getc_unlocked.
but the stdio code may need some refactoring to make this
possible.

i allocated a new flag2 bit, i don't know if there are unwanted
implications (are the flags public abi? then getc_unlocked path
could even be inlined)

stdio can be compiled in non-thread-safe mode, i'm not sure
what that does, i certainly did not test that configuration.

there were a number of _IO* abi symbols in libc but they
didnt quite do what i wanted so i introduced a new symbol
that can be called from libpthread to update FILE objects
when a new thread is created. (i think this should be ok
but again it's not clear to me what might be the downsides
of a new abi symbol).

>> Differences compared to the current x86_64 behaviour:
>> - The optimization is not silently applied to shared locks, in that
>> case the build fails.
>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>> wake (that would not work in single thread and libc does not use
>> such locks, to be sure lll_cond* is undefed).
>>
>> This speeds up a getchar loop about 2-4x depending on the cpu,
>> while only cause around 5-10% regression for the multi-threaded case
> 
> What measurement of what benchmark resulted in that number (the latter
> one)?  Without details of what you are measuring this isn't meaningful.
> 

these are all about getchar in a loop

for (i=0; i<N; i++) getchar();

and then time ./a.out </dev/zero

it is i think idiomatic input processing code for a number
of cmdline tools and those tools tend to be single threaded.

the multi-threaded case is just creating a dummy thread to
disable the optimization.

>> (other libc internal locks are not expected to be performance
>> critical or significantly affected by this change).
> 
> Why do you think this is the case?
> 

there is only an extra branch in the lock and unlock
code, i don't see locks in libc that can be hot enough
to make that matter, except for stdio and malloc locks.
(it does add some code bloat to libc though)

in stdio only getc/putc/getchar/putchar +w variants are
short enough to make the optimization practically relevant.

the effect on malloc is already much smaller since it has
more surrounding code beyond the lock/unlock (instead of
2-4x speed up you get 10% or so with a naive free(malloc(1))
in a loop, with more complex workloads i'd expect smaller
effect as that would probably go through more branches in
malloc/free)
  
Torvald Riegel June 20, 2017, 6:10 p.m. UTC | #5
On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
> On 20/06/17 14:47, Torvald Riegel wrote:
> > On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
> >> Differences compared to the current x86_64 behaviour:
> >> - The optimization is not silently applied to shared locks, in that
> >> case the build fails.
> >> - Unlock assumes the futex value is 0 or 1, there are no waiters to
> >> wake (that would not work in single thread and libc does not use
> >> such locks, to be sure lll_cond* is undefed).
> >>
> >> This speeds up a getchar loop about 2-4x depending on the cpu,
> >> while only cause around 5-10% regression for the multi-threaded case
> > 
> > What measurement of what benchmark resulted in that number (the latter
> > one)?  Without details of what you are measuring this isn't meaningful.
> > 
> 
> these are all about getchar in a loop
> 
> for (i=0; i<N; i++) getchar();
> 
> and then time ./a.out </dev/zero
> 
> it is i think idiomatic input processing code for a number
> of cmdline tools and those tools tend to be single threaded.

Can you measure any CPU time difference for these tools?

> the multi-threaded case is just creating a dummy thread to
> disable the optimization.

Note that half of the overhead will be in the unlock code, and so is
executed during the critical section.  That means that you make the
sequential parts of a program longer, and that will limit the maximum
amount of parallelism you can have.

Also, more and more programs will be multi-threaded (though maybe they
don't do tight getchar() loops like the one above), so it's not quite
clear whether the 5-10% are less important overall or not.

> >> (other libc internal locks are not expected to be performance
> >> critical or significantly affected by this change).
> > 
> > Why do you think this is the case?
> > 
> 
> there is only an extra branch in the lock and unlock
> code, i don't see locks in libc that can be hot enough
> to make that matter, except for stdio and malloc locks.

If it's just a few of the higher-level clients that you think would
benefit, this is another reason to optimize there and leave the
low-level lock unchanged.

> (it does add some code bloat to libc though)
> 
> in stdio only getc/putc/getchar/putchar +w variants are
> short enough to make the optimization practically relevant.
> 
> the effect on malloc is already much smaller since it has
> more surrounding code beyond the lock/unlock (instead of
> 2-4x speed up you get 10% or so with a naive free(malloc(1))
> in a loop, with more complex workloads i'd expect smaller
> effect as that would probably go through more branches in
> malloc/free)

What about multi-threaded malloc?
  
Szabolcs Nagy June 21, 2017, 9:22 a.m. UTC | #6
On 20/06/17 19:10, Torvald Riegel wrote:
> On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
>> On 20/06/17 14:47, Torvald Riegel wrote:
>>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
>>>> Differences compared to the current x86_64 behaviour:
>>>> - The optimization is not silently applied to shared locks, in that
>>>> case the build fails.
>>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
>>>> wake (that would not work in single thread and libc does not use
>>>> such locks, to be sure lll_cond* is undefed).
>>>>
>>>> This speeds up a getchar loop about 2-4x depending on the cpu,
>>>> while only cause around 5-10% regression for the multi-threaded case
>>>
>>> What measurement of what benchmark resulted in that number (the latter
>>> one)?  Without details of what you are measuring this isn't meaningful.
>>>
>>
>> these are all about getchar in a loop
>>
>> for (i=0; i<N; i++) getchar();
>>
>> and then time ./a.out </dev/zero
>>
>> it is i think idiomatic input processing code for a number
>> of cmdline tools and those tools tend to be single threaded.
> 
> Can you measure any CPU time difference for these tools?
> 

gnu dc with some generated input:

$ time taskset -c 1 $NOLOCK/lib64/ld-linux-aarch64.so.1 --library-path $NOLOCK/lib64 ./dc <dcinput
0

real	0m21.083s
user	0m21.040s
sys	0m0.040s
$ time taskset -c 1 $ORIG/lib64/ld-linux-aarch64.so.1 --library-path $ORIG/lib64 ./dc <dcinput
0

real	0m29.734s
user	0m29.716s
sys	0m0.016s

this also affects $customer tool

(most gnu tools have their own silly buffering
exactly to avoid the slow libc stdio, some tools
use _unlocked interfaces directly which are less
portable, so there are plenty of maintenance issues
caused by leaving this unfixed)

>> the multi-threaded case is just creating a dummy thread to
>> disable the optimization.
> 
> Note that half of the overhead will be in the unlock code, and so is
> executed during the critical section.  That means that you make the
> sequential parts of a program longer, and that will limit the maximum
> amount of parallelism you can have.
> 
> Also, more and more programs will be multi-threaded (though maybe they
> don't do tight getchar() loops like the one above), so it's not quite
> clear whether the 5-10% are less important overall or not.
> 

if this optimization is so bad, then remove it
from x86_64, it affects a lot of users.

>>>> (other libc internal locks are not expected to be performance
>>>> critical or significantly affected by this change).
>>>
>>> Why do you think this is the case?
>>>
>>
>> there is only an extra branch in the lock and unlock
>> code, i don't see locks in libc that can be hot enough
>> to make that matter, except for stdio and malloc locks.
> 
> If it's just a few of the higher-level clients that you think would
> benefit, this is another reason to optimize there and leave the
> low-level lock unchanged.
> 

i can simplify the stdio patch a bit so it is only
applied to getc/putc/.. then malloc interposition
is not an issue, nor printf hooks.

that should be a safe first step.

>> (it does add some code bloat to libc though)
>>
>> in stdio only getc/putc/getchar/putchar +w variants are
>> short enough to make the optimization practically relevant.
>>
>> the effect on malloc is already much smaller since it has
>> more surrounding code beyond the lock/unlock (instead of
>> 2-4x speed up you get 10% or so with a naive free(malloc(1))
>> in a loop, with more complex workloads i'd expect smaller
>> effect as that would probably go through more branches in
>> malloc/free)
> 
> What about multi-threaded malloc?
> 

<= 5% (value depends on cpu)
  
Torvald Riegel June 21, 2017, 10:46 a.m. UTC | #7
On Wed, 2017-06-21 at 10:22 +0100, Szabolcs Nagy wrote:
> On 20/06/17 19:10, Torvald Riegel wrote:
> > On Tue, 2017-06-20 at 16:05 +0100, Szabolcs Nagy wrote:
> >> On 20/06/17 14:47, Torvald Riegel wrote:
> >>> On Fri, 2017-06-16 at 17:26 +0100, Szabolcs Nagy wrote:
> >>>> Differences compared to the current x86_64 behaviour:
> >>>> - The optimization is not silently applied to shared locks, in that
> >>>> case the build fails.
> >>>> - Unlock assumes the futex value is 0 or 1, there are no waiters to
> >>>> wake (that would not work in single thread and libc does not use
> >>>> such locks, to be sure lll_cond* is undefed).
> >>>>
> >>>> This speeds up a getchar loop about 2-4x depending on the cpu,
> >>>> while only cause around 5-10% regression for the multi-threaded case
> >>>
> >>> What measurement of what benchmark resulted in that number (the latter
> >>> one)?  Without details of what you are measuring this isn't meaningful.
> >>>
> >>
> >> these are all about getchar in a loop
> >>
> >> for (i=0; i<N; i++) getchar();
> >>
> >> and then time ./a.out </dev/zero
> >>
> >> it is i think idiomatic input processing code for a number
> >> of cmdline tools and those tools tend to be single threaded.
> > 
> > Can you measure any CPU time difference for these tools?
> > 
> 
> gnu dc with some generated input:
> 
> $ time taskset -c 1 $NOLOCK/lib64/ld-linux-aarch64.so.1 --library-path $NOLOCK/lib64 ./dc <dcinput
> 0
> 
> real	0m21.083s
> user	0m21.040s
> sys	0m0.040s
> $ time taskset -c 1 $ORIG/lib64/ld-linux-aarch64.so.1 --library-path $ORIG/lib64 ./dc <dcinput
> 0
> 
> real	0m29.734s
> user	0m29.716s
> sys	0m0.016s

Is dcinput realistic?  Is dc that important?  Anything else?

And I think we still have no real data on how it affects the non-getchar
case, in particular the multi-threaded case.  Those questions go away
once this is an optimization that just affects getchar() ...

> this also affects $customer tool

Is that the real motivation behind your work?  Can you roughly describe
what that tool does?

> (most gnu tools have their own silly buffering
> exactly to avoid the slow libc stdio, some tools
> use _unlocked interfaces directly which are less
> portable, so there are plenty of maintenance issues
> caused by leaving this unfixed)

Well, reading one character at a time with something that will likely
remain a function call isn't a great approach in any case, right?

> >> the multi-threaded case is just creating a dummy thread to
> >> disable the optimization.
> > 
> > Note that half of the overhead will be in the unlock code, and so is
> > executed during the critical section.  That means that you make the
> > sequential parts of a program longer, and that will limit the maximum
> > amount of parallelism you can have.
> > 
> > Also, more and more programs will be multi-threaded (though maybe they
> > don't do tight getchar() loops like the one above), so it's not quite
> > clear whether the 5-10% are less important overall or not.
> > 
> 
> if this optimization is so bad, then remove it
> from x86_64, it affects a lot of users.

And I have indeed thought about doing that, wondering whether it's still
useful.  In particular on x86, atomic operations on data in the cache
where much more costly in the past than they are now.  But I haven't
looked at this closely with benchmarks and so forth so far.

Anyway, the fact that we might have an issue on x86 shouldn't be a
reason to potentially introduce the issue on aarch64 too.

> >>>> (other libc internal locks are not expected to be performance
> >>>> critical or significantly affected by this change).
> >>>
> >>> Why do you think this is the case?
> >>>
> >>
> >> there is only an extra branch in the lock and unlock
> >> code, i don't see locks in libc that can be hot enough
> >> to make that matter, except for stdio and malloc locks.
> > 
> > If it's just a few of the higher-level clients that you think would
> > benefit, this is another reason to optimize there and leave the
> > low-level lock unchanged.
> > 
> 
> i can simplify the stdio patch a bit so it is only
> applied to getc/putc/.. then malloc interposition
> is not an issue, nor printf hooks.
> 
> that should be a safe first step.
> 
> >> (it does add some code bloat to libc though)
> >>
> >> in stdio only getc/putc/getchar/putchar +w variants are
> >> short enough to make the optimization practically relevant.
> >>
> >> the effect on malloc is already much smaller since it has
> >> more surrounding code beyond the lock/unlock (instead of
> >> 2-4x speed up you get 10% or so with a naive free(malloc(1))
> >> in a loop, with more complex workloads i'd expect smaller
> >> effect as that would probably go through more branches in
> >> malloc/free)
> > 
> > What about multi-threaded malloc?
> > 
> 
> <= 5% (value depends on cpu)
> 

And definitely not just that (eg, allocation pattern and frequency,
number of threads, etc.).  Thus, without that as a context, your
statement of less than 5% doesn't tell me anything, sorry.
  

Patch

diff --git a/sysdeps/unix/sysv/linux/aarch64/lowlevellock.h b/sysdeps/unix/sysv/linux/aarch64/lowlevellock.h
new file mode 100644
index 0000000000..7561a454c9
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/lowlevellock.h
@@ -0,0 +1,93 @@ 
+/* AArch64 low-level lock implementation.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _AARCH64_LOWLEVELLOCK_H
+#define _AARCH64_LOWLEVELLOCK_H	1
+
+#include <sysdeps/nptl/lowlevellock.h>
+
+#if IS_IN (libc)
+
+/* See sysdeps/nptl/lowlevellock.h for comments.  This implementation
+   avoids some atomic operations in the single-threaded case in libc,
+   then the lll operations are not thread-safe nor async-signal-safe.
+
+   It is assumed that in libc there are only private futex locks used
+   and there are no waiters on a lock that unlock has to wake.  */
+
+#undef lll_cond_trylock
+#undef lll_cond_lock
+void __aarch64_lll_link_error (void);
+
+extern int __libc_multiple_threads attribute_hidden;
+
+#undef lll_trylock
+#define lll_trylock(futex) __lll_trylock (&(futex))
+static inline int
+__lll_trylock (int *futex)
+{
+  if (__libc_multiple_threads == 0)
+    {
+      int oldval = *futex;
+      if (__glibc_likely (oldval == 0))
+	*futex = 1;
+      return oldval;
+    }
+  return __glibc_unlikely (atomic_compare_and_exchange_bool_acq (futex, 1, 0));
+}
+
+#undef lll_lock
+#define lll_lock(futex, private) \
+  (__builtin_constant_p (private) && (private) == LLL_PRIVATE \
+   ? __lll_lock_private (&(futex)) : __aarch64_lll_link_error ())
+static inline void
+__lll_lock_private (int *futex)
+{
+  if (__libc_multiple_threads == 0)
+    {
+      if (__glibc_likely (*futex == 0))
+	*futex = 1;
+      else
+	__lll_lock_wait_private (futex);
+    }
+  else if (__glibc_unlikely
+	   (atomic_compare_and_exchange_bool_acq (futex, 1, 0)))
+    __lll_lock_wait_private (futex);
+}
+
+#undef lll_unlock
+#define lll_unlock(futex, private) \
+  (__builtin_constant_p (private) && (private) == LLL_PRIVATE \
+   ? __lll_unlock_private (&(futex)) : __aarch64_lll_link_error ())
+/* Note: lll_futex_wake depends on macros defined later.  */
+#define __lll_unlock_private(futex) \
+  ((void)({ \
+    if (__libc_multiple_threads == 0) \
+      *(futex) = 0; \
+    else \
+      { \
+	int *__futex = (futex); \
+	int __oldval = atomic_exchange_rel (__futex, 0); \
+	if (__glibc_unlikely (__oldval > 1)) \
+	  lll_futex_wake (__futex, 1, LLL_PRIVATE); \
+      } \
+  }))
+
+#endif /* IS_IN (libc) */
+
+#endif