[RFC] introduce dl_iterate_phdr_parallel

Message ID 20160731091642.GF2502@scylladb.com
State New, archived
Headers

Commit Message

Gleb Natapov July 31, 2016, 9:16 a.m. UTC
  Sorry, I missed your answer since I am not subscribed and the reply was
sent to the mailing list only.

On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 25/07/2016 11:23, Gleb Natapov wrote:
> > Problem: exception handling is not scalable. Attached program mt.cc
> > demonstrates this easily. It runs around 1 seconds with 1 thread, but
> > with only 4 threads it takes 4.5 seconds to complete. The reason is
> > locks that are taken on unwind path.
> > 
> > There are two locks right now.
> > 
> > Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
> > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
> > executables only, but this is common case).
> > 
> > Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
> > purpose: it stops the list of loaded objects from been modified while
> > iterating over it and it makes sure that more than one callback will
> > not run in parallel. This means that even if we will find a cleaver way
> > to make __dl_iterate_phdr iterate over the objects list locklessly the
> > lock will still have to be taken around the call to callback to preserve
> > second guaranty.
> > 
> > This patch here propose to introduce another API: dl_iterate_phdr_parallel
> > which does exactly same thing as dl_iterate_phdr but may run more than
> > one provided callback in parallel. And to make it more scalable it
> > breaks single dl_load_write_lock into arrays of locks. Reader takes only
> > one lock, but writer has to lock all of them to proceed with modifying
> > the list.
> 
> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?

> It should scale better than current code, since dl_iterate_phdr should act
> just a reader on link_map structures and multiple readers should only incur
> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
> 
I did try read-write lock first and it did not improved the test case
at all. The reason is (as I was very surprised to discover) that rwlock
takes a lock even if there are only readers. Though the lock is taken
for a short period of time when it becomes congested scalability drops
to the floor since all new waiters go to sleep in the kernel.

Attached is a, rather hacky, patch that I used for testing. 

> I see this slight better than trading memory and a limited scalability 
> (the _DL_LOCKS value) and also avoid adds another exported API.
The memory tradeoff is rather small. Using rwlock will not avoid
API addition though since two callbacks will be able to run in
parallel and this is user visible difference in behaviour from current
dl_iterate_phdr. C++ exception handling code, for instance, relies on the
fact that callbacks do not run in parallel. They use global cache which
I made to be per thread in the libgcc patch.

--
			Gleb.
commit a8c321ce2d8c3afe9043ad514e89eccda2e88802
Author: Gleb Natapov <gleb@scylladb.com>
Date:   Sun Jul 24 11:25:07 2016 +0300

    use rwlock in dl-iteratephdr.c
  

Comments

Adhemerval Zanella Netto Aug. 1, 2016, 6:06 p.m. UTC | #1
On 31/07/2016 06:16, Gleb Natapov wrote:
> Sorry, I missed your answer since I am not subscribed and the reply was
> sent to the mailing list only.
> 
> On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 25/07/2016 11:23, Gleb Natapov wrote:
>>> Problem: exception handling is not scalable. Attached program mt.cc
>>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
>>> with only 4 threads it takes 4.5 seconds to complete. The reason is
>>> locks that are taken on unwind path.
>>>
>>> There are two locks right now.
>>>
>>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
>>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
>>> executables only, but this is common case).
>>>
>>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
>>> purpose: it stops the list of loaded objects from been modified while
>>> iterating over it and it makes sure that more than one callback will
>>> not run in parallel. This means that even if we will find a cleaver way
>>> to make __dl_iterate_phdr iterate over the objects list locklessly the
>>> lock will still have to be taken around the call to callback to preserve
>>> second guaranty.
>>>
>>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
>>> which does exactly same thing as dl_iterate_phdr but may run more than
>>> one provided callback in parallel. And to make it more scalable it
>>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
>>> one lock, but writer has to lock all of them to proceed with modifying
>>> the list.
>>
>> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
> 
>> It should scale better than current code, since dl_iterate_phdr should act
>> just a reader on link_map structures and multiple readers should only incur
>> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
>> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
>>
> I did try read-write lock first and it did not improved the test case
> at all. The reason is (as I was very surprised to discover) that rwlock
> takes a lock even if there are only readers. Though the lock is taken
> for a short period of time when it becomes congested scalability drops
> to the floor since all new waiters go to sleep in the kernel.
> 

Yes and I assumed this short lock contention wouldn't yield worse performance,
but that's not the case.

> Attached is a, rather hacky, patch that I used for testing. 
> 
>> I see this slight better than trading memory and a limited scalability 
>> (the _DL_LOCKS value) and also avoid adds another exported API.
> The memory tradeoff is rather small. Using rwlock will not avoid
> API addition though since two callbacks will be able to run in
> parallel and this is user visible difference in behaviour from current
> dl_iterate_phdr. C++ exception handling code, for instance, relies on the
> fact that callbacks do not run in parallel. They use global cache which
> I made to be per thread in the libgcc patch.

Now I understand better the issue your proposed ABI is trying to fix after
check on libstdc++/libgcc unwind code.  Looks like the callback mutual
exclusion execution came from libstdc++ usage, since it is not documented
in 'dl_iterate_phdr' documentation.

Now, I see we have two options here:

1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
   current dl_iterate_phdr serializes callback execution and programs that
   would like to scale its execution should move to dl_iterate_phdr_parallel.
   We would need to further document that scalability uses an internal 64
   locks and using depending of thread and number of processor this limit
   can be hit.

2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
   require callback serialization and use a rdlock while accessing the
   maps list.  With current rwlock implementation performance won't change
   as you noticed, however since we are reworking and changing to a more
   scalable one [1] the read only pass should *much* better: the 
   __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
   for uncontented read case.

   The downside is current libgcc/libstdc++ requires serialized callback
   execution, so we will have to version dl_iterate_phdr.  Also I am not
   sure of other programs usage of dl_iterate_phdr, so it might an issue.

Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
the callback execution and adding a glibc specific symbol (the parallel
version) would add more portabilities issues.  So I would prefer to work
toward 2.

[1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html
  
Gleb Natapov Aug. 1, 2016, 6:49 p.m. UTC | #2
On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 31/07/2016 06:16, Gleb Natapov wrote:
> > Sorry, I missed your answer since I am not subscribed and the reply was
> > sent to the mailing list only.
> > 
> > On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 25/07/2016 11:23, Gleb Natapov wrote:
> >>> Problem: exception handling is not scalable. Attached program mt.cc
> >>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
> >>> with only 4 threads it takes 4.5 seconds to complete. The reason is
> >>> locks that are taken on unwind path.
> >>>
> >>> There are two locks right now.
> >>>
> >>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
> >>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
> >>> executables only, but this is common case).
> >>>
> >>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
> >>> purpose: it stops the list of loaded objects from been modified while
> >>> iterating over it and it makes sure that more than one callback will
> >>> not run in parallel. This means that even if we will find a cleaver way
> >>> to make __dl_iterate_phdr iterate over the objects list locklessly the
> >>> lock will still have to be taken around the call to callback to preserve
> >>> second guaranty.
> >>>
> >>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
> >>> which does exactly same thing as dl_iterate_phdr but may run more than
> >>> one provided callback in parallel. And to make it more scalable it
> >>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
> >>> one lock, but writer has to lock all of them to proceed with modifying
> >>> the list.
> >>
> >> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
> > 
> >> It should scale better than current code, since dl_iterate_phdr should act
> >> just a reader on link_map structures and multiple readers should only incur
> >> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
> >> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
> >>
> > I did try read-write lock first and it did not improved the test case
> > at all. The reason is (as I was very surprised to discover) that rwlock
> > takes a lock even if there are only readers. Though the lock is taken
> > for a short period of time when it becomes congested scalability drops
> > to the floor since all new waiters go to sleep in the kernel.
> > 
> 
> Yes and I assumed this short lock contention wouldn't yield worse performance,
> but that's not the case.
> 
> > Attached is a, rather hacky, patch that I used for testing. 
> > 
> >> I see this slight better than trading memory and a limited scalability 
> >> (the _DL_LOCKS value) and also avoid adds another exported API.
> > The memory tradeoff is rather small. Using rwlock will not avoid
> > API addition though since two callbacks will be able to run in
> > parallel and this is user visible difference in behaviour from current
> > dl_iterate_phdr. C++ exception handling code, for instance, relies on the
> > fact that callbacks do not run in parallel. They use global cache which
> > I made to be per thread in the libgcc patch.
> 
> Now I understand better the issue your proposed ABI is trying to fix after
> check on libstdc++/libgcc unwind code.  Looks like the callback mutual
> exclusion execution came from libstdc++ usage, since it is not documented
> in 'dl_iterate_phdr' documentation.
> 
Yes, precisely. I looked hard to find anything relevant in docs, but did
not see it documented.

> Now, I see we have two options here:
> 
> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
>    current dl_iterate_phdr serializes callback execution and programs that
>    would like to scale its execution should move to dl_iterate_phdr_parallel.
>    We would need to further document that scalability uses an internal 64
>    locks and using depending of thread and number of processor this limit
>    can be hit.
64 locks is implementation detail. It could be change to lock per thread
where a writer has to lock thread list + all thread locks.

> 
> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
>    require callback serialization and use a rdlock while accessing the
>    maps list.  With current rwlock implementation performance won't change
>    as you noticed, however since we are reworking and changing to a more
>    scalable one [1] the read only pass should *much* better: the 
>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
>    for uncontented read case.
I saw new rwlock implementation and tested it. It is much better that
current rwlock, but still almost twice slower than multiple locks.
Profiling shows that exchange in unlock takes a lot of cpu. No wonder
since congested locking operation is very expensive. IMO ideal solution
is array of rwlocks.

> 
>    The downside is current libgcc/libstdc++ requires serialized callback
>    execution, so we will have to version dl_iterate_phdr.  Also I am not
>    sure of other programs usage of dl_iterate_phdr, so it might an issue.
Yes, I was worried about users we do not know about. The one in libgcc can be
fixed. I am not sure how to do symbol versioning magic though, will need
help with that.

> 
> Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
> the callback execution and adding a glibc specific symbol (the parallel
> version) would add more portabilities issues.  So I would prefer to work
> toward 2.
How does libgcc works on those systems? Or do they provide their own
unwinding?

> 
> [1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html

--
			Gleb.
  
Torvald Riegel Aug. 1, 2016, 7:46 p.m. UTC | #3
On Mon, 2016-08-01 at 21:49 +0300, Gleb Natapov wrote:
> On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
> > 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
> >    require callback serialization and use a rdlock while accessing the
> >    maps list.  With current rwlock implementation performance won't change
> >    as you noticed, however since we are reworking and changing to a more
> >    scalable one [1] the read only pass should *much* better: the 
> >    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
> >    for uncontented read case.
> I saw new rwlock implementation and tested it. It is much better that
> current rwlock, but still almost twice slower than multiple locks.
> Profiling shows that exchange in unlock takes a lot of cpu. No wonder
> since congested locking operation is very expensive. IMO ideal solution
> is array of rwlocks.

The new rwlock is built so that it supports process-shared usage, which
means that we have to put everything into struct pthread_rwlock_t.  This
will lead to contention if you rdlock it frequently from many threads.
There is potential for tuning there because we haven't looked closely at
adding back-off in the CAS loop (and if you tested on an arch without
direct HW support for fetch-add, the CAS loop used instead of that might
also be suboptimal).
Which machine did you test this on?

If we built something custom for this and are willing to make the
wrlock / exclusive-access case much more costly, we can decrease this
overhead.  This could be roughly similar to one lock per thread or a set
of rwlocks as you mentioned, but with less space overhead.
  
Gleb Natapov Aug. 1, 2016, 8:07 p.m. UTC | #4
On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> On Mon, 2016-08-01 at 21:49 +0300, Gleb Natapov wrote:
> > On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
> > > 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
> > >    require callback serialization and use a rdlock while accessing the
> > >    maps list.  With current rwlock implementation performance won't change
> > >    as you noticed, however since we are reworking and changing to a more
> > >    scalable one [1] the read only pass should *much* better: the 
> > >    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
> > >    for uncontented read case.
> > I saw new rwlock implementation and tested it. It is much better that
> > current rwlock, but still almost twice slower than multiple locks.
> > Profiling shows that exchange in unlock takes a lot of cpu. No wonder
> > since congested locking operation is very expensive. IMO ideal solution
> > is array of rwlocks.
> 
> The new rwlock is built so that it supports process-shared usage, which
> means that we have to put everything into struct pthread_rwlock_t.  This
> will lead to contention if you rdlock it frequently from many threads.
> There is potential for tuning there because we haven't looked closely at
> adding back-off in the CAS loop (and if you tested on an arch without
> direct HW support for fetch-add, the CAS loop used instead of that might
> also be suboptimal).
> Which machine did you test this on?
> 
x86_64

> If we built something custom for this and are willing to make the
> wrlock / exclusive-access case much more costly, we can decrease this
> overhead.  This could be roughly similar to one lock per thread or a set
> of rwlocks as you mentioned, but with less space overhead.
> 
IMO space overhead is negligible. More efficient rwlock is, of course,
better and can be useful in many more places. If you have something to
test I am willing to do so, but if custom rwlock will take time to
materialize we may start with lock array and change it later. The lock
array is not part of the interface, but implementation detail. What I
would like to avoid is stalling the afford while waiting for something
better. Exception scalability is very pressing issue for us.

--
			Gleb.
  
Torvald Riegel Aug. 1, 2016, 8:19 p.m. UTC | #5
On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > If we built something custom for this and are willing to make the
> > wrlock / exclusive-access case much more costly, we can decrease this
> > overhead.  This could be roughly similar to one lock per thread or a set
> > of rwlocks as you mentioned, but with less space overhead.
> > 
> IMO space overhead is negligible. More efficient rwlock is, of course,
> better and can be useful in many more places. If you have something to
> test I am willing to do so, but if custom rwlock will take time to
> materialize we may start with lock array and change it later. The lock
> array is not part of the interface, but implementation detail. What I
> would like to avoid is stalling the afford while waiting for something
> better. Exception scalability is very pressing issue for us.

I haven't looked at the workload in detail, and so can't say off-hand
what would be best and how long it would take.  Is there a summary
describing the synchronization that is necessary?  Once I know that, I
can give a closer estimate.  It could be simple to do, so it might be
worth having a look before trying the (rw?)lock-array approach.

The POSIX rwlock's complexity is partly due to having to support the
different modes (eg, reader preference), so a custom rwlock for a
specific use case could be much simpler.  I would guess that it looks
more like reference counting, or if the snapshots of the readers are
suitable, it could also be seqlock based (and thus require no memory
modification by readers at all).
  
Adhemerval Zanella Netto Aug. 1, 2016, 8:23 p.m. UTC | #6
On 01/08/2016 15:49, Gleb Natapov wrote:
> On Mon, Aug 01, 2016 at 03:06:32PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 31/07/2016 06:16, Gleb Natapov wrote:
>>> Sorry, I missed your answer since I am not subscribed and the reply was
>>> sent to the mailing list only.
>>>
>>> On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 25/07/2016 11:23, Gleb Natapov wrote:
>>>>> Problem: exception handling is not scalable. Attached program mt.cc
>>>>> demonstrates this easily. It runs around 1 seconds with 1 thread, but
>>>>> with only 4 threads it takes 4.5 seconds to complete. The reason is
>>>>> locks that are taken on unwind path.
>>>>>
>>>>> There are two locks right now.
>>>>>
>>>>> Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
>>>>> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
>>>>> executables only, but this is common case).
>>>>>
>>>>> Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
>>>>> purpose: it stops the list of loaded objects from been modified while
>>>>> iterating over it and it makes sure that more than one callback will
>>>>> not run in parallel. This means that even if we will find a cleaver way
>>>>> to make __dl_iterate_phdr iterate over the objects list locklessly the
>>>>> lock will still have to be taken around the call to callback to preserve
>>>>> second guaranty.
>>>>>
>>>>> This patch here propose to introduce another API: dl_iterate_phdr_parallel
>>>>> which does exactly same thing as dl_iterate_phdr but may run more than
>>>>> one provided callback in parallel. And to make it more scalable it
>>>>> breaks single dl_load_write_lock into arrays of locks. Reader takes only
>>>>> one lock, but writer has to lock all of them to proceed with modifying
>>>>> the list.
>>>>
>>>> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?
>>>
>>>> It should scale better than current code, since dl_iterate_phdr should act
>>>> just a reader on link_map structures and multiple readers should only incur
>>>> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
>>>> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
>>>>
>>> I did try read-write lock first and it did not improved the test case
>>> at all. The reason is (as I was very surprised to discover) that rwlock
>>> takes a lock even if there are only readers. Though the lock is taken
>>> for a short period of time when it becomes congested scalability drops
>>> to the floor since all new waiters go to sleep in the kernel.
>>>
>>
>> Yes and I assumed this short lock contention wouldn't yield worse performance,
>> but that's not the case.
>>
>>> Attached is a, rather hacky, patch that I used for testing. 
>>>
>>>> I see this slight better than trading memory and a limited scalability 
>>>> (the _DL_LOCKS value) and also avoid adds another exported API.
>>> The memory tradeoff is rather small. Using rwlock will not avoid
>>> API addition though since two callbacks will be able to run in
>>> parallel and this is user visible difference in behaviour from current
>>> dl_iterate_phdr. C++ exception handling code, for instance, relies on the
>>> fact that callbacks do not run in parallel. They use global cache which
>>> I made to be per thread in the libgcc patch.
>>
>> Now I understand better the issue your proposed ABI is trying to fix after
>> check on libstdc++/libgcc unwind code.  Looks like the callback mutual
>> exclusion execution came from libstdc++ usage, since it is not documented
>> in 'dl_iterate_phdr' documentation.
>>
> Yes, precisely. I looked hard to find anything relevant in docs, but did
> not see it documented.
> 
>> Now, I see we have two options here:
>>
>> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
>>    current dl_iterate_phdr serializes callback execution and programs that
>>    would like to scale its execution should move to dl_iterate_phdr_parallel.
>>    We would need to further document that scalability uses an internal 64
>>    locks and using depending of thread and number of processor this limit
>>    can be hit.
> 64 locks is implementation detail. It could be change to lock per thread
> where a writer has to lock thread list + all thread locks.

My point here is more to document current behaviour and performance assumption
to actually discuss the choice of 64 for lock numbers.

> 
>>
>> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
>>    require callback serialization and use a rdlock while accessing the
>>    maps list.  With current rwlock implementation performance won't change
>>    as you noticed, however since we are reworking and changing to a more
>>    scalable one [1] the read only pass should *much* better: the 
>>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
>>    for uncontented read case.
> I saw new rwlock implementation and tested it. It is much better that
> current rwlock, but still almost twice slower than multiple locks.
> Profiling shows that exchange in unlock takes a lot of cpu. No wonder
> since congested locking operation is very expensive. IMO ideal solution
> is array of rwlocks.

I would expect so since the unlock phase on new algorithm will yield an
'atomic_compare_exchange_weak_release' in a loop it will be potentially 
slower than just a 'lll_unlock' (which issues a lock decl on x86_64).

Performance-wise an array of rdlocks should be best approach.  What I am
not comfortable with is adding another glibc specific interface to clear
circle around a bad interface specification.

> 
>>
>>    The downside is current libgcc/libstdc++ requires serialized callback
>>    execution, so we will have to version dl_iterate_phdr.  Also I am not
>>    sure of other programs usage of dl_iterate_phdr, so it might an issue.
> Yes, I was worried about users we do not know about. The one in libgcc can be
> fixed. I am not sure how to do symbol versioning magic though, will need
> help with that.
> 
>>
>> Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
>> the callback execution and adding a glibc specific symbol (the parallel
>> version) would add more portabilities issues.  So I would prefer to work
>> toward 2.
> How does libgcc works on those systems? Or do they provide their own
> unwinding?
> 

I looked on the wrong implementation for FreeBSD10, the one actually
used [1] does add locking on callback call. 

So uclibc and musl my guess is neither actually work with libgcc unwinding.

[1] https://github.com/freebsd/freebsd/blob/f4f3cadf4f46110669367b597c4d31e1425f4cbd/libexec/rtld-elf/rtld.c

>>
>> [1] https://sourceware.org/ml/libc-alpha/2016-07/msg00636.html
> 
> --
> 			Gleb.
>
  
Gleb Natapov Aug. 1, 2016, 8:41 p.m. UTC | #7
On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > If we built something custom for this and are willing to make the
> > > wrlock / exclusive-access case much more costly, we can decrease this
> > > overhead.  This could be roughly similar to one lock per thread or a set
> > > of rwlocks as you mentioned, but with less space overhead.
> > > 
> > IMO space overhead is negligible. More efficient rwlock is, of course,
> > better and can be useful in many more places. If you have something to
> > test I am willing to do so, but if custom rwlock will take time to
> > materialize we may start with lock array and change it later. The lock
> > array is not part of the interface, but implementation detail. What I
> > would like to avoid is stalling the afford while waiting for something
> > better. Exception scalability is very pressing issue for us.
> 
> I haven't looked at the workload in detail, and so can't say off-hand
> what would be best and how long it would take.  Is there a summary
> describing the synchronization that is necessary?  Once I know that, I
> can give a closer estimate.  It could be simple to do, so it might be
> worth having a look before trying the (rw?)lock-array approach.
> 
rwlock semantics is pretty much what is needed. There is a list of
loaded objects that changes rarely (only during dlopen/dlclose), but it
is accessed for read on each exception. Exceptions on different threads
are independent, so they should not interfere with each other. Having
only one lock (even rw one) means that all threads will access same
atomic variable for synchronisation which has a lot of overhead on
modern cpus. lock-array approach is a way to allow multiple threads to
process completely independently without any inter-core communication at
all. It is hard to beat. New rwlock is substantially slower on only 4
cores (it is much better than current rwlock since it avoids entering
kernel if there are only readers).

> The POSIX rwlock's complexity is partly due to having to support the
> different modes (eg, reader preference), so a custom rwlock for a
> specific use case could be much simpler.  I would guess that it looks
> more like reference counting, or if the snapshots of the readers are
> suitable, it could also be seqlock based (and thus require no memory
> modification by readers at all).
> 
I evaluated seqlock approach, but snapshoting of the object list requires
pretty deep copy (for exception handling case) and the data accessed
varies for each user of dl_iterate_phdr (elements of the list point into
mmaped parts of .so files).

--
			Gleb.
  
Gleb Natapov Aug. 2, 2016, 10:46 a.m. UTC | #8
On Mon, Aug 01, 2016 at 05:23:47PM -0300, Adhemerval Zanella wrote:
> >> Now, I see we have two options here:
> >>
> >> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
> >>    current dl_iterate_phdr serializes callback execution and programs that
> >>    would like to scale its execution should move to dl_iterate_phdr_parallel.
> >>    We would need to further document that scalability uses an internal 64
> >>    locks and using depending of thread and number of processor this limit
> >>    can be hit.
> > 64 locks is implementation detail. It could be change to lock per thread
> > where a writer has to lock thread list + all thread locks.
> 
> My point here is more to document current behaviour and performance assumption
> to actually discuss the choice of 64 for lock numbers.
> 
OK.

> > 
> >>
> >> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
> >>    require callback serialization and use a rdlock while accessing the
> >>    maps list.  With current rwlock implementation performance won't change
> >>    as you noticed, however since we are reworking and changing to a more
> >>    scalable one [1] the read only pass should *much* better: the 
> >>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
> >>    for uncontented read case.
> > I saw new rwlock implementation and tested it. It is much better that
> > current rwlock, but still almost twice slower than multiple locks.
> > Profiling shows that exchange in unlock takes a lot of cpu. No wonder
> > since congested locking operation is very expensive. IMO ideal solution
> > is array of rwlocks.
> 
> I would expect so since the unlock phase on new algorithm will yield an
> 'atomic_compare_exchange_weak_release' in a loop it will be potentially 
> slower than just a 'lll_unlock' (which issues a lock decl on x86_64).
> 
> Performance-wise an array of rdlocks should be best approach.  What I am
> not comfortable with is adding another glibc specific interface to clear
> circle around a bad interface specification.
> 
I proposed new interface to be on a safe side as an RFC, I am not against
fixing existing one. What is the consequences of modifying current behaviour
though? Old libgcc will stop working with new glibc, but as far as I
understand that can be fixed with symbol versioning (how?). What about other
applications if any?

--
			Gleb.
  
Adhemerval Zanella Netto Aug. 2, 2016, 2:16 p.m. UTC | #9
On 02/08/2016 07:46, Gleb Natapov wrote:
> On Mon, Aug 01, 2016 at 05:23:47PM -0300, Adhemerval Zanella wrote:
>>>> Now, I see we have two options here:
>>>>
>>>> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
>>>>    current dl_iterate_phdr serializes callback execution and programs that
>>>>    would like to scale its execution should move to dl_iterate_phdr_parallel.
>>>>    We would need to further document that scalability uses an internal 64
>>>>    locks and using depending of thread and number of processor this limit
>>>>    can be hit.
>>> 64 locks is implementation detail. It could be change to lock per thread
>>> where a writer has to lock thread list + all thread locks.
>>
>> My point here is more to document current behaviour and performance assumption
>> to actually discuss the choice of 64 for lock numbers.
>>
> OK.
> 
>>>
>>>>
>>>> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
>>>>    require callback serialization and use a rdlock while accessing the
>>>>    maps list.  With current rwlock implementation performance won't change
>>>>    as you noticed, however since we are reworking and changing to a more
>>>>    scalable one [1] the read only pass should *much* better: the 
>>>>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
>>>>    for uncontented read case.
>>> I saw new rwlock implementation and tested it. It is much better that
>>> current rwlock, but still almost twice slower than multiple locks.
>>> Profiling shows that exchange in unlock takes a lot of cpu. No wonder
>>> since congested locking operation is very expensive. IMO ideal solution
>>> is array of rwlocks.
>>
>> I would expect so since the unlock phase on new algorithm will yield an
>> 'atomic_compare_exchange_weak_release' in a loop it will be potentially 
>> slower than just a 'lll_unlock' (which issues a lock decl on x86_64).
>>
>> Performance-wise an array of rdlocks should be best approach.  What I am
>> not comfortable with is adding another glibc specific interface to clear
>> circle around a bad interface specification.
>>
> I proposed new interface to be on a safe side as an RFC, I am not against
> fixing existing one. What is the consequences of modifying current behaviour
> though? Old libgcc will stop working with new glibc, but as far as I
> understand that can be fixed with symbol versioning (how?). What about other
> applications if any?
> 

First we need to define if we should change dl_iterate_phdr interface to
remove or not the requirement of running callback mutually exclusive.  If
this idea is to keep with current behaviour I see we need to work on
make this behaviour explicit in documentation and check if it is possible
to add the lock array mechanism without the add of parallel version.

If we decide that locking of callback call are not meant to be the interface
requirement, we can add a new version (2.25 for instance) that does not
have this behaviour while still providing an older one (2.2.5) that still
does locking.  Main problem is building old libstdc++ with newer glibc might
not work, so I am not sure if this is the best course of action.
  
Gleb Natapov Aug. 2, 2016, 2:55 p.m. UTC | #10
On Tue, Aug 02, 2016 at 11:16:33AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 02/08/2016 07:46, Gleb Natapov wrote:
> > On Mon, Aug 01, 2016 at 05:23:47PM -0300, Adhemerval Zanella wrote:
> >>>> Now, I see we have two options here:
> >>>>
> >>>> 1. Work on your proposal for dl_iterate_phdr_parallel and *document* that
> >>>>    current dl_iterate_phdr serializes callback execution and programs that
> >>>>    would like to scale its execution should move to dl_iterate_phdr_parallel.
> >>>>    We would need to further document that scalability uses an internal 64
> >>>>    locks and using depending of thread and number of processor this limit
> >>>>    can be hit.
> >>> 64 locks is implementation detail. It could be change to lock per thread
> >>> where a writer has to lock thread list + all thread locks.
> >>
> >> My point here is more to document current behaviour and performance assumption
> >> to actually discuss the choice of 64 for lock numbers.
> >>
> > OK.
> > 
> >>>
> >>>>
> >>>> 2. Another option is to push to cleanup dl_iterate_phdr interface to *not*
> >>>>    require callback serialization and use a rdlock while accessing the
> >>>>    maps list.  With current rwlock implementation performance won't change
> >>>>    as you noticed, however since we are reworking and changing to a more
> >>>>    scalable one [1] the read only pass should *much* better: the 
> >>>>    __pthread_rwlock_rdlock_full should issue just a atomic_fetch_add_acquire
> >>>>    for uncontented read case.
> >>> I saw new rwlock implementation and tested it. It is much better that
> >>> current rwlock, but still almost twice slower than multiple locks.
> >>> Profiling shows that exchange in unlock takes a lot of cpu. No wonder
> >>> since congested locking operation is very expensive. IMO ideal solution
> >>> is array of rwlocks.
> >>
> >> I would expect so since the unlock phase on new algorithm will yield an
> >> 'atomic_compare_exchange_weak_release' in a loop it will be potentially 
> >> slower than just a 'lll_unlock' (which issues a lock decl on x86_64).
> >>
> >> Performance-wise an array of rdlocks should be best approach.  What I am
> >> not comfortable with is adding another glibc specific interface to clear
> >> circle around a bad interface specification.
> >>
> > I proposed new interface to be on a safe side as an RFC, I am not against
> > fixing existing one. What is the consequences of modifying current behaviour
> > though? Old libgcc will stop working with new glibc, but as far as I
> > understand that can be fixed with symbol versioning (how?). What about other
> > applications if any?
> > 
> 
> First we need to define if we should change dl_iterate_phdr interface to
> remove or not the requirement of running callback mutually exclusive.  If
> this idea is to keep with current behaviour I see we need to work on
> make this behaviour explicit in documentation and check if it is possible
> to add the lock array mechanism without the add of parallel version.
>
What is the procedure to define that? Both are OK to me. You guys have
more experience with glibc compatibility problems/requirements.

> If we decide that locking of callback call are not meant to be the interface
> requirement, we can add a new version (2.25 for instance) that does not
> have this behaviour while still providing an older one (2.2.5) that still
> does locking.  Main problem is building old libstdc++ with newer glibc might
> not work, so I am not sure if this is the best course of action.
I am reading about symbol versioning at it looks like we can define old
interface (2.2.5) to be default one and use a new one explicitly in new libstdc++.
What I have not figured out yet if it is possible to check in new
libstdc++ that 2.25 interface is available in runtime and use it if it
is. For _parallel variant I can do it like this:

extern int dl_iterate_phdr_parallel (int (*__callback) (struct dl_phdr_info *, size_t, void *),
                                     void *__data) __attribute__((weak));

  if (dl_iterate_phdr_parallel) {
    if (dl_iterate_phdr_parallel (_Unwind_IteratePhdrCallback, &data) <
0)
      return NULL;
  } else {
      if (dl_iterate_phdr (_Unwind_IteratePhdrCallback, &data) < 0)
        return NULL;
  }

This will use dl_iterate_phdr_parallel if it is available in glibc and
dl_iterate_phdr otherwise.

--
			Gleb.
  
Gleb Natapov Aug. 2, 2016, 3:31 p.m. UTC | #11
On Mon, Aug 01, 2016 at 05:23:47PM -0300, Adhemerval Zanella wrote:
> >> Also I noticed other libcs (freebsd, uclibc, and musl) do not serialize
> >> the callback execution and adding a glibc specific symbol (the parallel
> >> version) would add more portabilities issues.  So I would prefer to work
> >> toward 2.
> > How does libgcc works on those systems? Or do they provide their own
> > unwinding?
> > 
> 
> I looked on the wrong implementation for FreeBSD10, the one actually
> used [1] does add locking on callback call. 
> 
And [2] is where it was added. Not surprisingly the reason for the
locking is C++ exception handling.

[2] https://github.com/freebsd/freebsd/commit/60e5d7d0529be9f877e67cf34013ac32320062f0

> So uclibc and musl my guess is neither actually work with libgcc unwinding.
> 
> [1] https://github.com/freebsd/freebsd/blob/f4f3cadf4f46110669367b597c4d31e1425f4cbd/libexec/rtld-elf/rtld.c
> 

--
			Gleb.
  
Florian Weimer Aug. 3, 2016, 10:53 a.m. UTC | #12
On 08/01/2016 09:46 PM, Torvald Riegel wrote:
> The new rwlock is built so that it supports process-shared usage, which
> means that we have to put everything into struct pthread_rwlock_t.  This
> will lead to contention if you rdlock it frequently from many threads.
> There is potential for tuning there because we haven't looked closely at
> adding back-off in the CAS loop (and if you tested on an arch without
> direct HW support for fetch-add, the CAS loop used instead of that might
> also be suboptimal).

The rwlock doesn't eliminate the contention at the hardware level.

If that causes a performance issue, we could reuse Ingo Molnar's brlock 
approach: per-thread, readers acquire their own lock, writers acquire 
the locks of all threads.  This is fairly efficient in the read case 
(and I suspect you can't get much better than that in a non-managed run 
tine), but the write case is obviously extremely costly.  This could be 
the right trade-off here, though.

Florian
  
Adhemerval Zanella Netto Aug. 3, 2016, 2 p.m. UTC | #13
On 03/08/2016 07:53, Florian Weimer wrote:
> On 08/01/2016 09:46 PM, Torvald Riegel wrote:
>> The new rwlock is built so that it supports process-shared usage, which
>> means that we have to put everything into struct pthread_rwlock_t.  This
>> will lead to contention if you rdlock it frequently from many threads.
>> There is potential for tuning there because we haven't looked closely at
>> adding back-off in the CAS loop (and if you tested on an arch without
>> direct HW support for fetch-add, the CAS loop used instead of that might
>> also be suboptimal).
> 
> The rwlock doesn't eliminate the contention at the hardware level.
> 
> If that causes a performance issue, we could reuse Ingo Molnar's brlock approach: per-thread, readers acquire their own lock, writers acquire the locks of all threads.  This is fairly efficient in the read case (and I suspect you can't get much better than that in a non-managed run tine), but the write case is obviously extremely costly.  This could be the right trade-off here, though.
> 
> Florian

The only difference is lglocks/brlocks are per-cpu in kernel, not per-thread.
My concern is what kind of writer degradation it could be in a highly threaded
workload (for instance, a threaded c++ workload with some exceptions that tries
to load a plugin).

It could be the case a constant write lock array, as the initial proposal, could
be a better initial proposal.
  
Gleb Natapov Aug. 3, 2016, 2:30 p.m. UTC | #14
On Wed, Aug 03, 2016 at 11:00:39AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 03/08/2016 07:53, Florian Weimer wrote:
> > On 08/01/2016 09:46 PM, Torvald Riegel wrote:
> >> The new rwlock is built so that it supports process-shared usage, which
> >> means that we have to put everything into struct pthread_rwlock_t.  This
> >> will lead to contention if you rdlock it frequently from many threads.
> >> There is potential for tuning there because we haven't looked closely at
> >> adding back-off in the CAS loop (and if you tested on an arch without
> >> direct HW support for fetch-add, the CAS loop used instead of that might
> >> also be suboptimal).
> > 
> > The rwlock doesn't eliminate the contention at the hardware level.
> > 
> > If that causes a performance issue, we could reuse Ingo Molnar's brlock approach: per-thread, readers acquire their own lock, writers acquire the locks of all threads.  This is fairly efficient in the read case (and I suspect you can't get much better than that in a non-managed run tine), but the write case is obviously extremely costly.  This could be the right trade-off here, though.
> > 
> > Florian
> 
> The only difference is lglocks/brlocks are per-cpu in kernel, not per-thread.
I proposed the same algorithm that Florian describes somewhere in this
thread as an alternative too. Linux kernel does similar trick in mmu
notifiers code (search for mm_take_all_locks), but I wouldn't go this
more complicated route until it is proven that current version has
scalability problems.

> My concern is what kind of writer degradation it could be in a highly threaded
> workload (for instance, a threaded c++ workload with some exceptions that tries
> to load a plugin).
> 
The forward progress of plugin loader is guarantied.

> It could be the case a constant write lock array, as the initial proposal, could
> be a better initial proposal. 
Agree 100%.

--
			Gleb.
  
Torvald Riegel Aug. 3, 2016, 4:12 p.m. UTC | #15
On Wed, 2016-08-03 at 12:53 +0200, Florian Weimer wrote:
> On 08/01/2016 09:46 PM, Torvald Riegel wrote:
> > The new rwlock is built so that it supports process-shared usage, which
> > means that we have to put everything into struct pthread_rwlock_t.  This
> > will lead to contention if you rdlock it frequently from many threads.
> > There is potential for tuning there because we haven't looked closely at
> > adding back-off in the CAS loop (and if you tested on an arch without
> > direct HW support for fetch-add, the CAS loop used instead of that might
> > also be suboptimal).
> 
> The rwlock doesn't eliminate the contention at the hardware level.

It does not, of course, as I wrote.  But there are ways to decrease the
contention (eg, by proper back-off), which has not been added yet.

> If that causes a performance issue, we could reuse Ingo Molnar's brlock 
> approach: per-thread, readers acquire their own lock, writers acquire 
> the locks of all threads.  This is fairly efficient in the read case 
> (and I suspect you can't get much better than that in a non-managed run 
> tine), but the write case is obviously extremely costly.  This could be 
> the right trade-off here, though.

You can do better, for example if all that the rdlock critical sections
do is snapshot data for which the underlying memory will nto be
unmapped.  That's why I asked about the details of the synchronization
problem :)
  
Torvald Riegel Aug. 3, 2016, 5:26 p.m. UTC | #16
On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > If we built something custom for this and are willing to make the
> > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > of rwlocks as you mentioned, but with less space overhead.
> > > > 
> > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > better and can be useful in many more places. If you have something to
> > > test I am willing to do so, but if custom rwlock will take time to
> > > materialize we may start with lock array and change it later. The lock
> > > array is not part of the interface, but implementation detail. What I
> > > would like to avoid is stalling the afford while waiting for something
> > > better. Exception scalability is very pressing issue for us.
> > 
> > I haven't looked at the workload in detail, and so can't say off-hand
> > what would be best and how long it would take.  Is there a summary
> > describing the synchronization that is necessary?  Once I know that, I
> > can give a closer estimate.  It could be simple to do, so it might be
> > worth having a look before trying the (rw?)lock-array approach.
> > 
> rwlock semantics is pretty much what is needed. There is a list of
> loaded objects that changes rarely (only during dlopen/dlclose), but it
> is accessed for read on each exception.

Does "for read" mean that the only thing you are interested in is some
consistent snapshot of the list, or something like a yes/no result for a
query about whether some element is contained?
Is this list in memory that is never unmapped?  Can we construct it in
such a way that we'll follow only a few pointers for which we don't
quite know whether they point to proper elements of the list?

If we can do that, than we should be able to run an atomic snapshot
speculatively.  IOW, do what many STMs do, or seqlocks, etc.

> Exceptions on different threads
> are independent, so they should not interfere with each other. Having
> only one lock (even rw one) means that all threads will access same
> atomic variable for synchronisation which has a lot of overhead on
> modern cpus. lock-array approach is a way to allow multiple threads to
> process completely independently without any inter-core communication at
> all. It is hard to beat.

You can try to never modify memory in the common case, which is better.
Additionally, the problem with the lock array is that you have to put
the locks somewhere, preferably so that they are in a page that is local
to where the thread runs.  This makes a simple lock array suboptimal, at
least on bigger systems.  TLS space would be most likely to be backed by
local memory, but as Adhemerval has mentioned, we might have lots of
threads, and if we do TLS, we need to sync between thread destruction
and the heavyweight operation that wants to acquire all locks.

> New rwlock is substantially slower on only 4
> cores (it is much better than current rwlock since it avoids entering
> kernel if there are only readers).

Is this a microbenchmark just for acquisition/release of the rwlock, or
does this significant difference also show up in the real scenario you
are concerned about (the exception storm you mentioned)?

> > The POSIX rwlock's complexity is partly due to having to support the
> > different modes (eg, reader preference), so a custom rwlock for a
> > specific use case could be much simpler.  I would guess that it looks
> > more like reference counting, or if the snapshots of the readers are
> > suitable, it could also be seqlock based (and thus require no memory
> > modification by readers at all).
> > 
> I evaluated seqlock approach, but snapshoting of the object list requires
> pretty deep copy (for exception handling case)

Why do you need to copy?  Can you point me to the list datastructure you
need to snapshot?

> and the data accessed
> varies for each user of dl_iterate_phdr (elements of the list point into
> mmaped parts of .so files).

That doesn't matter.  The only overhead is how often we need to validate
so that we know that a subsequent pointer dereference is safe.  The
number of these valuations compared to a lock acquisition+release is the
deciding factor here.
  
Gleb Natapov Aug. 3, 2016, 5:47 p.m. UTC | #17
On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > If we built something custom for this and are willing to make the
> > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > 
> > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > better and can be useful in many more places. If you have something to
> > > > test I am willing to do so, but if custom rwlock will take time to
> > > > materialize we may start with lock array and change it later. The lock
> > > > array is not part of the interface, but implementation detail. What I
> > > > would like to avoid is stalling the afford while waiting for something
> > > > better. Exception scalability is very pressing issue for us.
> > > 
> > > I haven't looked at the workload in detail, and so can't say off-hand
> > > what would be best and how long it would take.  Is there a summary
> > > describing the synchronization that is necessary?  Once I know that, I
> > > can give a closer estimate.  It could be simple to do, so it might be
> > > worth having a look before trying the (rw?)lock-array approach.
> > > 
> > rwlock semantics is pretty much what is needed. There is a list of
> > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > is accessed for read on each exception.
> 
> Does "for read" mean that the only thing you are interested in is some
> consistent snapshot of the list, or something like a yes/no result for a
> query about whether some element is contained?
No. The list contains objects that have pointers into some other memory that
the locking of the list prevents from been unmapped.

> Is this list in memory that is never unmapped?  Can we construct it in
> such a way that we'll follow only a few pointers for which we don't
> quite know whether they point to proper elements of the list?
> 
> If we can do that, than we should be able to run an atomic snapshot
> speculatively.  IOW, do what many STMs do, or seqlocks, etc.
I do not see how.

> 
> > Exceptions on different threads
> > are independent, so they should not interfere with each other. Having
> > only one lock (even rw one) means that all threads will access same
> > atomic variable for synchronisation which has a lot of overhead on
> > modern cpus. lock-array approach is a way to allow multiple threads to
> > process completely independently without any inter-core communication at
> > all. It is hard to beat.
> 
> You can try to never modify memory in the common case, which is better.
Which memory? No memory, besides lock, is modified in the common case.

> Additionally, the problem with the lock array is that you have to put
> the locks somewhere, preferably so that they are in a page that is local
> to where the thread runs. This makes a simple lock array suboptimal, at
> least on bigger systems.  TLS space would be most likely to be backed by
> local memory, but as Adhemerval has mentioned, we might have lots of
> threads, and if we do TLS, we need to sync between thread destruction
> and the heavyweight operation that wants to acquire all locks.
> 
The lock array is global, no TLS is involved. There is no problem of a
kind you describe.

> > New rwlock is substantially slower on only 4
> > cores (it is much better than current rwlock since it avoids entering
> > kernel if there are only readers).
> 
> Is this a microbenchmark just for acquisition/release of the rwlock, or
> does this significant difference also show up in the real scenario you
> are concerned about (the exception storm you mentioned)?
That's on a micro benchmark. But it simulates pretty closely what
sometimes happen on our real system when multiple threads encounter
burst of exceptional conditions (network connection dropped and all
queued requests, sometimes thousands, terminate with exception).

> 
> > > The POSIX rwlock's complexity is partly due to having to support the
> > > different modes (eg, reader preference), so a custom rwlock for a
> > > specific use case could be much simpler.  I would guess that it looks
> > > more like reference counting, or if the snapshots of the readers are
> > > suitable, it could also be seqlock based (and thus require no memory
> > > modification by readers at all).
> > > 
> > I evaluated seqlock approach, but snapshoting of the object list requires
> > pretty deep copy (for exception handling case)
> 
> Why do you need to copy? 
We need to copy because if We access this data without a lock it may
disappear.
 
>                          Can you point me to the list datastructure you
> need to snapshot?
> 
Look at __dl_iterate_phdr() where it iterates over GL(dl_ns)[ns]._ns_loaded
list. It calls provided callback with a structure that hold addresses
which are protected from been unmapped by the same lock that protects
the list from been modified. This is the list of all loaded dynamic
libraries.

> > and the data accessed
> > varies for each user of dl_iterate_phdr (elements of the list point into
> > mmaped parts of .so files).
> 
> That doesn't matter.  The only overhead is how often we need to validate
> so that we know that a subsequent pointer dereference is safe.  The
> number of these valuations compared to a lock acquisition+release is the
> deciding factor here.
Pointer dereferencing happens outside of glibc in a user provided
callback.

--
			Gleb.
  
Torvald Riegel Aug. 4, 2016, 5:16 p.m. UTC | #18
On Wed, 2016-08-03 at 20:47 +0300, Gleb Natapov wrote:
> On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> > On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > > If we built something custom for this and are willing to make the
> > > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > > 
> > > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > > better and can be useful in many more places. If you have something to
> > > > > test I am willing to do so, but if custom rwlock will take time to
> > > > > materialize we may start with lock array and change it later. The lock
> > > > > array is not part of the interface, but implementation detail. What I
> > > > > would like to avoid is stalling the afford while waiting for something
> > > > > better. Exception scalability is very pressing issue for us.
> > > > 
> > > > I haven't looked at the workload in detail, and so can't say off-hand
> > > > what would be best and how long it would take.  Is there a summary
> > > > describing the synchronization that is necessary?  Once I know that, I
> > > > can give a closer estimate.  It could be simple to do, so it might be
> > > > worth having a look before trying the (rw?)lock-array approach.
> > > > 
> > > rwlock semantics is pretty much what is needed. There is a list of
> > > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > > is accessed for read on each exception.
> > 
> > Does "for read" mean that the only thing you are interested in is some
> > consistent snapshot of the list, or something like a yes/no result for a
> > query about whether some element is contained?
> No. The list contains objects that have pointers into some other memory that
> the locking of the list prevents from been unmapped.

Then maybe something more specific to this case like hazard pointers
would be better.  This could be used for other things in glibc too, and
you wouldn't need atomic RMW as for the lock (which still have more
overhead then stores, even though the RMWs are not as costly anymore on,
for example, x86 as in the past).

> > > Exceptions on different threads
> > > are independent, so they should not interfere with each other. Having
> > > only one lock (even rw one) means that all threads will access same
> > > atomic variable for synchronisation which has a lot of overhead on
> > > modern cpus. lock-array approach is a way to allow multiple threads to
> > > process completely independently without any inter-core communication at
> > > all. It is hard to beat.
> > 
> > You can try to never modify memory in the common case, which is better.
> Which memory? No memory, besides lock, is modified in the common case.

I am referring to the lock.

> > Additionally, the problem with the lock array is that you have to put
> > the locks somewhere, preferably so that they are in a page that is local
> > to where the thread runs. This makes a simple lock array suboptimal, at
> > least on bigger systems.  TLS space would be most likely to be backed by
> > local memory, but as Adhemerval has mentioned, we might have lots of
> > threads, and if we do TLS, we need to sync between thread destruction
> > and the heavyweight operation that wants to acquire all locks.
> > 
> The lock array is global, no TLS is involved. There is no problem of a
> kind you describe.

If the lock array is global, it will be backed by memory to a single
NUMA node.  This is inefficient if you access it from a CPU on a
different NUMA node in a larger system.  You could split the array
between NUMA nodes, but we don't have a mechanism to access CPU-local
memory efficiently right now (it can be approximated, or one can try to
amortize over several calls, but that's it).
So the problem I am describing is having to make a choice between this
problem and the problems that come with putting the lock into TLS.

> > > New rwlock is substantially slower on only 4
> > > cores (it is much better than current rwlock since it avoids entering
> > > kernel if there are only readers).
> > 
> > Is this a microbenchmark just for acquisition/release of the rwlock, or
> > does this significant difference also show up in the real scenario you
> > are concerned about (the exception storm you mentioned)?
> That's on a micro benchmark. But it simulates pretty closely what
> sometimes happen on our real system when multiple threads encounter
> burst of exceptional conditions (network connection dropped and all
> queued requests, sometimes thousands, terminate with exception).

So the micro benchmark does the exception stuff in the rwlock critical
sections (eg, you throw exceptions from multiple threads concurrently
and measure overall throughput)?
(I'm asking so that I know whether the twice-as-slow claim you made
(IIRC) really holds in the real-world use case too.)

> > > and the data accessed
> > > varies for each user of dl_iterate_phdr (elements of the list point into
> > > mmaped parts of .so files).
> > 
> > That doesn't matter.  The only overhead is how often we need to validate
> > so that we know that a subsequent pointer dereference is safe.  The
> > number of these valuations compared to a lock acquisition+release is the
> > deciding factor here.
> Pointer dereferencing happens outside of glibc in a user provided
> callback.

Ah, right.  I must have skipped the last part of the last sentence you
wrote previously.

Is there a way to make callbacks only work on memory that is not going
to be unmapped (e.g., under the assumption that the user won't ever
dlclose stuff that's currently needed for any exception processing)?

(Note that one reason I'm asking these questions is to collect the
information we need to document why we picked on or the other option to
synchronize this.  Documenting, for example, "Gleb says seqlock don't
work" isn't sufficient because it doesn't tell future glibc developers
why that would be so.)
  
Gleb Natapov Aug. 4, 2016, 5:37 p.m. UTC | #19
On Thu, Aug 04, 2016 at 07:16:55PM +0200, Torvald Riegel wrote:
> On Wed, 2016-08-03 at 20:47 +0300, Gleb Natapov wrote:
> > On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> > > On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > > > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > > > If we built something custom for this and are willing to make the
> > > > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > > > 
> > > > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > > > better and can be useful in many more places. If you have something to
> > > > > > test I am willing to do so, but if custom rwlock will take time to
> > > > > > materialize we may start with lock array and change it later. The lock
> > > > > > array is not part of the interface, but implementation detail. What I
> > > > > > would like to avoid is stalling the afford while waiting for something
> > > > > > better. Exception scalability is very pressing issue for us.
> > > > > 
> > > > > I haven't looked at the workload in detail, and so can't say off-hand
> > > > > what would be best and how long it would take.  Is there a summary
> > > > > describing the synchronization that is necessary?  Once I know that, I
> > > > > can give a closer estimate.  It could be simple to do, so it might be
> > > > > worth having a look before trying the (rw?)lock-array approach.
> > > > > 
> > > > rwlock semantics is pretty much what is needed. There is a list of
> > > > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > > > is accessed for read on each exception.
> > > 
> > > Does "for read" mean that the only thing you are interested in is some
> > > consistent snapshot of the list, or something like a yes/no result for a
> > > query about whether some element is contained?
> > No. The list contains objects that have pointers into some other memory that
> > the locking of the list prevents from been unmapped.
> 
> Then maybe something more specific to this case like hazard pointers
> would be better.  This could be used for other things in glibc too, and
> you wouldn't need atomic RMW as for the lock (which still have more
> overhead then stores, even though the RMWs are not as costly anymore on,
> for example, x86 as in the past).
> 
Access is done outside of glibc control. User code does regular
dereference.

> > > Additionally, the problem with the lock array is that you have to put
> > > the locks somewhere, preferably so that they are in a page that is local
> > > to where the thread runs. This makes a simple lock array suboptimal, at
> > > least on bigger systems.  TLS space would be most likely to be backed by
> > > local memory, but as Adhemerval has mentioned, we might have lots of
> > > threads, and if we do TLS, we need to sync between thread destruction
> > > and the heavyweight operation that wants to acquire all locks.
> > > 
> > The lock array is global, no TLS is involved. There is no problem of a
> > kind you describe.
> 
> If the lock array is global, it will be backed by memory to a single
> NUMA node.  This is inefficient if you access it from a CPU on a
> different NUMA node in a larger system.  You could split the array
> between NUMA nodes, but we don't have a mechanism to access CPU-local
> memory efficiently right now (it can be approximated, or one can try to
> amortize over several calls, but that's it).
> So the problem I am describing is having to make a choice between this
> problem and the problems that come with putting the lock into TLS.
> 
The NUMA problem is not different from having just one lock as we do
now. Can we do something better, yes. Having per thread lock for instance,
but I do not see a point in more complex solution without evidence of
real performance issues in proposed one.

> > > > New rwlock is substantially slower on only 4
> > > > cores (it is much better than current rwlock since it avoids entering
> > > > kernel if there are only readers).
> > > 
> > > Is this a microbenchmark just for acquisition/release of the rwlock, or
> > > does this significant difference also show up in the real scenario you
> > > are concerned about (the exception storm you mentioned)?
> > That's on a micro benchmark. But it simulates pretty closely what
> > sometimes happen on our real system when multiple threads encounter
> > burst of exceptional conditions (network connection dropped and all
> > queued requests, sometimes thousands, terminate with exception).
> 
> So the micro benchmark does the exception stuff in the rwlock critical
> sections (eg, you throw exceptions from multiple threads concurrently
> and measure overall throughput)?
What do you mean by "does the exception stuff in the rwlock critical
sections"? Stack unwinding happens in rwlock critical sections, the
benchmark just throws exception concurrently.

> (I'm asking so that I know whether the twice-as-slow claim you made
> (IIRC) really holds in the real-world use case too.)
For works case scenario it will. For best case even current code works
fine.

> 
> > > > and the data accessed
> > > > varies for each user of dl_iterate_phdr (elements of the list point into
> > > > mmaped parts of .so files).
> > > 
> > > That doesn't matter.  The only overhead is how often we need to validate
> > > so that we know that a subsequent pointer dereference is safe.  The
> > > number of these valuations compared to a lock acquisition+release is the
> > > deciding factor here.
> > Pointer dereferencing happens outside of glibc in a user provided
> > callback.
> 
> Ah, right.  I must have skipped the last part of the last sentence you
> wrote previously.
> 
> Is there a way to make callbacks only work on memory that is not going
> to be unmapped (e.g., under the assumption that the user won't ever
> dlclose stuff that's currently needed for any exception processing)?
> 
Callback may work on any memory of a loaded object. That's the point of
the interface - to run some code on each loaded object. What may work is
an RCU that will delay object deletion until quiescent point, but there
us no such points in glibc. I do not see how RCU can be implemented in glibc
unfortunately.

> (Note that one reason I'm asking these questions is to collect the
> information we need to document why we picked on or the other option to
> synchronize this.  Documenting, for example, "Gleb says seqlock don't
> work" isn't sufficient because it doesn't tell future glibc developers
> why that would be so.)
> 
Sure.

--
			Gleb.
  
Torvald Riegel Aug. 4, 2016, 5:58 p.m. UTC | #20
On Thu, 2016-08-04 at 20:37 +0300, Gleb Natapov wrote:
> On Thu, Aug 04, 2016 at 07:16:55PM +0200, Torvald Riegel wrote:
> > On Wed, 2016-08-03 at 20:47 +0300, Gleb Natapov wrote:
> > > On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> > > > On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > > > > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > > > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > > > > If we built something custom for this and are willing to make the
> > > > > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > > > > 
> > > > > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > > > > better and can be useful in many more places. If you have something to
> > > > > > > test I am willing to do so, but if custom rwlock will take time to
> > > > > > > materialize we may start with lock array and change it later. The lock
> > > > > > > array is not part of the interface, but implementation detail. What I
> > > > > > > would like to avoid is stalling the afford while waiting for something
> > > > > > > better. Exception scalability is very pressing issue for us.
> > > > > > 
> > > > > > I haven't looked at the workload in detail, and so can't say off-hand
> > > > > > what would be best and how long it would take.  Is there a summary
> > > > > > describing the synchronization that is necessary?  Once I know that, I
> > > > > > can give a closer estimate.  It could be simple to do, so it might be
> > > > > > worth having a look before trying the (rw?)lock-array approach.
> > > > > > 
> > > > > rwlock semantics is pretty much what is needed. There is a list of
> > > > > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > > > > is accessed for read on each exception.
> > > > 
> > > > Does "for read" mean that the only thing you are interested in is some
> > > > consistent snapshot of the list, or something like a yes/no result for a
> > > > query about whether some element is contained?
> > > No. The list contains objects that have pointers into some other memory that
> > > the locking of the list prevents from been unmapped.
> > 
> > Then maybe something more specific to this case like hazard pointers
> > would be better.  This could be used for other things in glibc too, and
> > you wouldn't need atomic RMW as for the lock (which still have more
> > overhead then stores, even though the RMWs are not as costly anymore on,
> > for example, x86 as in the past).
> > 
> Access is done outside of glibc control. User code does regular
> dereference.

https://en.wikipedia.org/wiki/Hazard_pointer

It can be used for reference-counting like things too, like this
example.  The locks you suggest have the same goal, but a different
implementation.

> > > > Additionally, the problem with the lock array is that you have to put
> > > > the locks somewhere, preferably so that they are in a page that is local
> > > > to where the thread runs. This makes a simple lock array suboptimal, at
> > > > least on bigger systems.  TLS space would be most likely to be backed by
> > > > local memory, but as Adhemerval has mentioned, we might have lots of
> > > > threads, and if we do TLS, we need to sync between thread destruction
> > > > and the heavyweight operation that wants to acquire all locks.
> > > > 
> > > The lock array is global, no TLS is involved. There is no problem of a
> > > kind you describe.
> > 
> > If the lock array is global, it will be backed by memory to a single
> > NUMA node.  This is inefficient if you access it from a CPU on a
> > different NUMA node in a larger system.  You could split the array
> > between NUMA nodes, but we don't have a mechanism to access CPU-local
> > memory efficiently right now (it can be approximated, or one can try to
> > amortize over several calls, but that's it).
> > So the problem I am describing is having to make a choice between this
> > problem and the problems that come with putting the lock into TLS.
> > 
> The NUMA problem is not different from having just one lock as we do
> now. Can we do something better, yes. Having per thread lock for instance,
> but I do not see a point in more complex solution without evidence of
> real performance issues in proposed one.

Have you measured it?

You are trying to make a performance improvement, yet we have no
evidence that what you picked is actually useful beyond being better
than a single lock.  Why 64 locks, for example.  Why haven't you put
them at least on separate cache lines?  Is this just an arbitrary choice
(which might be fine, but then we need to make this clear in the code)?

> > > > > New rwlock is substantially slower on only 4
> > > > > cores (it is much better than current rwlock since it avoids entering
> > > > > kernel if there are only readers).
> > > > 
> > > > Is this a microbenchmark just for acquisition/release of the rwlock, or
> > > > does this significant difference also show up in the real scenario you
> > > > are concerned about (the exception storm you mentioned)?
> > > That's on a micro benchmark. But it simulates pretty closely what
> > > sometimes happen on our real system when multiple threads encounter
> > > burst of exceptional conditions (network connection dropped and all
> > > queued requests, sometimes thousands, terminate with exception).
> > 
> > So the micro benchmark does the exception stuff in the rwlock critical
> > sections (eg, you throw exceptions from multiple threads concurrently
> > and measure overall throughput)?
> What do you mean by "does the exception stuff in the rwlock critical
> sections"? Stack unwinding happens in rwlock critical sections, the
> benchmark just throws exception concurrently.

That was my question (see the example in there).

> > 
> > > > > and the data accessed
> > > > > varies for each user of dl_iterate_phdr (elements of the list point into
> > > > > mmaped parts of .so files).
> > > > 
> > > > That doesn't matter.  The only overhead is how often we need to validate
> > > > so that we know that a subsequent pointer dereference is safe.  The
> > > > number of these valuations compared to a lock acquisition+release is the
> > > > deciding factor here.
> > > Pointer dereferencing happens outside of glibc in a user provided
> > > callback.
> > 
> > Ah, right.  I must have skipped the last part of the last sentence you
> > wrote previously.
> > 
> > Is there a way to make callbacks only work on memory that is not going
> > to be unmapped (e.g., under the assumption that the user won't ever
> > dlclose stuff that's currently needed for any exception processing)?
> > 
> Callback may work on any memory of a loaded object. That's the point of
> the interface - to run some code on each loaded object.

You asked for comments on a different interface.  Though we currently
seem to favor not adding a different interface, my question is clearly
revelant in that context.
  
Gleb Natapov Aug. 4, 2016, 6:28 p.m. UTC | #21
On Thu, Aug 04, 2016 at 07:58:19PM +0200, Torvald Riegel wrote:
> On Thu, 2016-08-04 at 20:37 +0300, Gleb Natapov wrote:
> > On Thu, Aug 04, 2016 at 07:16:55PM +0200, Torvald Riegel wrote:
> > > On Wed, 2016-08-03 at 20:47 +0300, Gleb Natapov wrote:
> > > > On Wed, Aug 03, 2016 at 07:26:26PM +0200, Torvald Riegel wrote:
> > > > > On Mon, 2016-08-01 at 23:41 +0300, Gleb Natapov wrote:
> > > > > > On Mon, Aug 01, 2016 at 10:19:55PM +0200, Torvald Riegel wrote:
> > > > > > > On Mon, 2016-08-01 at 23:07 +0300, Gleb Natapov wrote:
> > > > > > > > On Mon, Aug 01, 2016 at 09:46:35PM +0200, Torvald Riegel wrote:
> > > > > > > > > If we built something custom for this and are willing to make the
> > > > > > > > > wrlock / exclusive-access case much more costly, we can decrease this
> > > > > > > > > overhead.  This could be roughly similar to one lock per thread or a set
> > > > > > > > > of rwlocks as you mentioned, but with less space overhead.
> > > > > > > > > 
> > > > > > > > IMO space overhead is negligible. More efficient rwlock is, of course,
> > > > > > > > better and can be useful in many more places. If you have something to
> > > > > > > > test I am willing to do so, but if custom rwlock will take time to
> > > > > > > > materialize we may start with lock array and change it later. The lock
> > > > > > > > array is not part of the interface, but implementation detail. What I
> > > > > > > > would like to avoid is stalling the afford while waiting for something
> > > > > > > > better. Exception scalability is very pressing issue for us.
> > > > > > > 
> > > > > > > I haven't looked at the workload in detail, and so can't say off-hand
> > > > > > > what would be best and how long it would take.  Is there a summary
> > > > > > > describing the synchronization that is necessary?  Once I know that, I
> > > > > > > can give a closer estimate.  It could be simple to do, so it might be
> > > > > > > worth having a look before trying the (rw?)lock-array approach.
> > > > > > > 
> > > > > > rwlock semantics is pretty much what is needed. There is a list of
> > > > > > loaded objects that changes rarely (only during dlopen/dlclose), but it
> > > > > > is accessed for read on each exception.
> > > > > 
> > > > > Does "for read" mean that the only thing you are interested in is some
> > > > > consistent snapshot of the list, or something like a yes/no result for a
> > > > > query about whether some element is contained?
> > > > No. The list contains objects that have pointers into some other memory that
> > > > the locking of the list prevents from been unmapped.
> > > 
> > > Then maybe something more specific to this case like hazard pointers
> > > would be better.  This could be used for other things in glibc too, and
> > > you wouldn't need atomic RMW as for the lock (which still have more
> > > overhead then stores, even though the RMWs are not as costly anymore on,
> > > for example, x86 as in the past).
> > > 
> > Access is done outside of glibc control. User code does regular
> > dereference.
> 
> https://en.wikipedia.org/wiki/Hazard_pointer
> 
Can you elaborate on how this can be used to solve the problem of user
accessing an object memory that can be unmapped?

> It can be used for reference-counting like things too, like this
> example.  The locks you suggest have the same goal, but a different
> implementation.
> 
The locks I suggest prevents list from been modified while any object on
the list is accessed, it does not try to make link manipulation lockless.
There is no point in doing so since list modification is rare.

> > > > > Additionally, the problem with the lock array is that you have to put
> > > > > the locks somewhere, preferably so that they are in a page that is local
> > > > > to where the thread runs. This makes a simple lock array suboptimal, at
> > > > > least on bigger systems.  TLS space would be most likely to be backed by
> > > > > local memory, but as Adhemerval has mentioned, we might have lots of
> > > > > threads, and if we do TLS, we need to sync between thread destruction
> > > > > and the heavyweight operation that wants to acquire all locks.
> > > > > 
> > > > The lock array is global, no TLS is involved. There is no problem of a
> > > > kind you describe.
> > > 
> > > If the lock array is global, it will be backed by memory to a single
> > > NUMA node.  This is inefficient if you access it from a CPU on a
> > > different NUMA node in a larger system.  You could split the array
> > > between NUMA nodes, but we don't have a mechanism to access CPU-local
> > > memory efficiently right now (it can be approximated, or one can try to
> > > amortize over several calls, but that's it).
> > > So the problem I am describing is having to make a choice between this
> > > problem and the problems that come with putting the lock into TLS.
> > > 
> > The NUMA problem is not different from having just one lock as we do
> > now. Can we do something better, yes. Having per thread lock for instance,
> > but I do not see a point in more complex solution without evidence of
> > real performance issues in proposed one.
> 
> Have you measured it?
I did not, but as I said it is an improvement over current situation
which has exactly same problem. "better is enemy of good". Do we want
to search for perfect solution forever or have good one now and improve
it afterwards if need arise?

> 
> You are trying to make a performance improvement, yet we have no
> evidence that what you picked is actually useful beyond being better
> than a single lock. 
I showed that what I picked is much better then a single lock. The
usefulness of it in been better. What else should have been shown?

>                      Why 64 locks, for example.  Why haven't you put
> them at least on separate cache lines?  Is this just an arbitrary choice
> (which might be fine, but then we need to make this clear in the code)?
Putting them on separate cache lines is definitely good idea. And no, I
do not have scientific explanation for 64. 64 is reasonable number of
cores in modern medium sized server.

> 
> > > > > > New rwlock is substantially slower on only 4
> > > > > > cores (it is much better than current rwlock since it avoids entering
> > > > > > kernel if there are only readers).
> > > > > 
> > > > > Is this a microbenchmark just for acquisition/release of the rwlock, or
> > > > > does this significant difference also show up in the real scenario you
> > > > > are concerned about (the exception storm you mentioned)?
> > > > That's on a micro benchmark. But it simulates pretty closely what
> > > > sometimes happen on our real system when multiple threads encounter
> > > > burst of exceptional conditions (network connection dropped and all
> > > > queued requests, sometimes thousands, terminate with exception).
> > > 
> > > So the micro benchmark does the exception stuff in the rwlock critical
> > > sections (eg, you throw exceptions from multiple threads concurrently
> > > and measure overall throughput)?
> > What do you mean by "does the exception stuff in the rwlock critical
> > sections"? Stack unwinding happens in rwlock critical sections, the
> > benchmark just throws exception concurrently.
> 
> That was my question (see the example in there).
> 
Where should I see the example?

> > > 
> > > > > > and the data accessed
> > > > > > varies for each user of dl_iterate_phdr (elements of the list point into
> > > > > > mmaped parts of .so files).
> > > > > 
> > > > > That doesn't matter.  The only overhead is how often we need to validate
> > > > > so that we know that a subsequent pointer dereference is safe.  The
> > > > > number of these valuations compared to a lock acquisition+release is the
> > > > > deciding factor here.
> > > > Pointer dereferencing happens outside of glibc in a user provided
> > > > callback.
> > > 
> > > Ah, right.  I must have skipped the last part of the last sentence you
> > > wrote previously.
> > > 
> > > Is there a way to make callbacks only work on memory that is not going
> > > to be unmapped (e.g., under the assumption that the user won't ever
> > > dlclose stuff that's currently needed for any exception processing)?
> > > 
> > Callback may work on any memory of a loaded object. That's the point of
> > the interface - to run some code on each loaded object.
> 
> You asked for comments on a different interface.  Though we currently
> seem to favor not adding a different interface, my question is clearly
> revelant in that context.
> 
Interface I proposed is the fix of the current one, since the current
one failed to spell out that callback may be called in parallel. It
should be easy to move from the old one to the new one for users that
never made such assumptions in the first place. The only reason I
proposed new interface instead of fixing the old one it to be on a safe
side.

--
			Gleb.
  

Patch

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 687d7de..0440fb9 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -535,7 +535,8 @@  _dl_close_worker (struct link_map *map, bool force)
   tls_free_start = tls_free_end = NO_TLS_OFFSET;
 
   /* We modify the list of loaded objects.  */
-  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_wrlock (GL(dl_load_write_lock));
 
   /* Check each element of the search list to see if all references to
      it are gone.  */
@@ -748,7 +749,8 @@  _dl_close_worker (struct link_map *map, bool force)
 	}
     }
 
-  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_unlock (GL(dl_load_write_lock));
 
   /* If we removed any object which uses TLS bump the generation counter.  */
   if (any_tls)
diff --git a/elf/dl-iteratephdr.c b/elf/dl-iteratephdr.c
index 1cb6e26..aa4d0a5 100644
--- a/elf/dl-iteratephdr.c
+++ b/elf/dl-iteratephdr.c
@@ -25,7 +25,8 @@ 
 static void
 cancel_handler (void *arg __attribute__((unused)))
 {
-  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+  //__rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_unlock (GL(dl_load_write_lock));
 }
 
 hidden_proto (__dl_iterate_phdr)
@@ -38,7 +39,8 @@  __dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
   int ret = 0;
 
   /* Make sure nobody modifies the list of loaded objects.  */
-  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_rdlock (GL(dl_load_write_lock));
   __libc_cleanup_push (cancel_handler, NULL);
 
   /* We have to determine the namespace of the caller since this determines
@@ -80,7 +82,8 @@  __dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
 
   /* Release the lock.  */
   __libc_cleanup_pop (0);
-  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_unlock (GL(dl_load_write_lock));
 
   return ret;
 }
diff --git a/elf/dl-object.c b/elf/dl-object.c
index 362992b..c7952a1 100644
--- a/elf/dl-object.c
+++ b/elf/dl-object.c
@@ -31,7 +31,8 @@  internal_function
 _dl_add_to_namespace_list (struct link_map *new, Lmid_t nsid)
 {
   /* We modify the list of loaded objects.  */
-  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_lock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_wrlock (GL(dl_load_write_lock));
 
   if (GL(dl_ns)[nsid]._ns_loaded != NULL)
     {
@@ -48,7 +49,8 @@  _dl_add_to_namespace_list (struct link_map *new, Lmid_t nsid)
   new->l_serial = GL(dl_load_adds);
   ++GL(dl_load_adds);
 
-  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+//  __rtld_lock_unlock_recursive (GL(dl_load_write_lock));
+  __rtld_rwlock_unlock (GL(dl_load_write_lock));
 }
 
 
diff --git a/elf/dl-support.c b/elf/dl-support.c
index c30194c..f05d341 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -213,8 +213,8 @@  __rtld_lock_define_initialized_recursive (, _dl_load_lock)
 /* This lock is used to keep __dl_iterate_phdr from inspecting the
    list of loaded objects while an object is added to or removed from
    that list.  */
-__rtld_lock_define_initialized_recursive (, _dl_load_write_lock)
-
+//__rtld_lock_define_initialized_recursive (, _dl_load_write_lock)
+__libc_rwlock_define_initialized (,_dl_load_write_lock)
 
 #ifdef HAVE_AUX_VECTOR
 int _dl_clktck;
diff --git a/elf/rtld.c b/elf/rtld.c
index 647661c..da91c4e 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -128,7 +128,8 @@  struct rtld_global _rtld_global =
     ._dl_stack_flags = DEFAULT_STACK_PERMS,
 #ifdef _LIBC_REENTRANT
     ._dl_load_lock = _RTLD_LOCK_RECURSIVE_INITIALIZER,
-    ._dl_load_write_lock = _RTLD_LOCK_RECURSIVE_INITIALIZER,
+//    ._dl_load_write_lock = _RTLD_LOCK_RECURSIVE_INITIALIZER,
+    ._dl_load_write_lock = PTHREAD_RWLOCK_INITIALIZER,
 #endif
     ._dl_nns = 1,
     ._dl_ns =
@@ -693,6 +694,22 @@  rtld_lock_default_unlock_recursive (void *lock)
 {
   __rtld_lock_default_unlock_recursive (lock);
 }
+
+static void
+rtld_rwlock_default_wrlock (void *lock)
+{
+}
+
+static void
+rtld_rwlock_default_rdlock (void *lock)
+{
+}
+
+static void
+rtld_rwlock_default_unlock (void *lock)
+{
+}
+
 #endif
 
 
@@ -763,6 +780,9 @@  dl_main (const ElfW(Phdr) *phdr,
     && defined __rtld_lock_default_lock_recursive
   GL(dl_rtld_lock_recursive) = rtld_lock_default_lock_recursive;
   GL(dl_rtld_unlock_recursive) = rtld_lock_default_unlock_recursive;
+  GL(dl_rtld_rwlock_wrlock) = rtld_rwlock_default_wrlock;
+  GL(dl_rtld_rwlock_rdlock) = rtld_rwlock_default_rdlock;
+  GL(dl_rtld_rwlock_unlock) = rtld_rwlock_default_unlock;
 #endif
 
   /* The explicit initialization here is cheaper than processing the reloc
diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
index bdbdfed..a53f914 100644
--- a/nptl/nptl-init.c
+++ b/nptl/nptl-init.c
@@ -475,6 +475,9 @@  __pthread_initialize_minimal_internal (void)
      keep the lock count from the ld.so implementation.  */
   GL(dl_rtld_lock_recursive) = (void *) __pthread_mutex_lock;
   GL(dl_rtld_unlock_recursive) = (void *) __pthread_mutex_unlock;
+  GL(dl_rtld_rwlock_wrlock) = (void *) __pthread_rwlock_wrlock;
+  GL(dl_rtld_rwlock_rdlock) = (void *) __pthread_rwlock_rdlock;
+  GL(dl_rtld_rwlock_unlock) = (void *) __pthread_rwlock_unlock;
   unsigned int rtld_lock_count = GL(dl_load_lock).mutex.__data.__count;
   GL(dl_load_lock).mutex.__data.__count = 0;
   while (rtld_lock_count-- > 0)
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index f68fdf4..b80c7eb 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -334,7 +334,8 @@  struct rtld_global
   /* This lock is used to keep __dl_iterate_phdr from inspecting the
      list of loaded objects while an object is added to or removed
      from that list.  */
-  __rtld_lock_define_recursive (EXTERN, _dl_load_write_lock)
+  //__rtld_lock_define_recursive (EXTERN, _dl_load_write_lock)
+  __libc_rwlock_define (EXTERN, _dl_load_write_lock)
 
   /* Incremented whenever something may have been added to dl_loaded.  */
   EXTERN unsigned long long _dl_load_adds;
@@ -372,6 +373,9 @@  struct rtld_global
     && defined __rtld_lock_default_lock_recursive
   EXTERN void (*_dl_rtld_lock_recursive) (void *);
   EXTERN void (*_dl_rtld_unlock_recursive) (void *);
+  EXTERN void (*_dl_rtld_rwlock_wrlock) (void *);
+  EXTERN void (*_dl_rtld_rwlock_rdlock) (void *);
+  EXTERN void (*_dl_rtld_rwlock_unlock) (void *);
 #endif
 
   /* If loading a shared object requires that we make the stack executable
diff --git a/sysdeps/nptl/libc-lockP.h b/sysdeps/nptl/libc-lockP.h
index 50b86d2..e91e021 100644
--- a/sysdeps/nptl/libc-lockP.h
+++ b/sysdeps/nptl/libc-lockP.h
@@ -236,6 +236,28 @@  typedef pthread_key_t __libc_key_t;
   __libc_maybe_call (__pthread_mutex_unlock, (&(NAME).mutex), 0)
 #endif
 
+#ifdef SHARED
+# define __rtld_rwlock_wrlock(NAME) \
+  GL(dl_rtld_rwlock_wrlock) (&(NAME))
+
+# define __rtld_rwlock_rdlock(NAME) \
+  GL(dl_rtld_rwlock_rdlock) (&(NAME))
+
+# define __rtld_rwlock_unlock(NAME) \
+  GL(dl_rtld_rwlock_unlock) (&(NAME))
+
+#else
+# define __rtld_rwlock_wrlock(NAME) \
+  __libc_maybe_call (__pthread_rwlock_wrlock, (&(NAME)), 0)
+
+# define __rtld_rwlock_rdlock(NAME) \
+  __libc_maybe_call (__pthread_rwlock_rdlock, (&(NAME)), 0)
+
+# define __rtld_rwlock_unlock(NAME) \
+  __libc_maybe_call (__pthread_rwlock_unlock, (&(NAME)), 0)
+
+#endif
+
 /* Define once control variable.  */
 #if PTHREAD_ONCE_INIT == 0
 /* Special case for static variables where we can avoid the initialization