[RFC] introduce dl_iterate_phdr_parallel

  Sorry, I missed your answer since I am not subscribed and the reply was
sent to the mailing list only.

On Thu, Jul 28, 2016 at 05:47:16PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 25/07/2016 11:23, Gleb Natapov wrote:
> > Problem: exception handling is not scalable. Attached program mt.cc
> > demonstrates this easily. It runs around 1 seconds with 1 thread, but
> > with only 4 threads it takes 4.5 seconds to complete. The reason is
> > locks that are taken on unwind path.
> > 
> > There are two locks right now.
> > 
> > Fist is in libgcc's _Unwind_Find_registered_FDE and it is eliminated by
> > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01629.html (for dynamic
> > executables only, but this is common case).
> > 
> > Second one is dl_load_write_lock in __dl_iterate_phdr. It serves dual
> > purpose: it stops the list of loaded objects from been modified while
> > iterating over it and it makes sure that more than one callback will
> > not run in parallel. This means that even if we will find a cleaver way
> > to make __dl_iterate_phdr iterate over the objects list locklessly the
> > lock will still have to be taken around the call to callback to preserve
> > second guaranty.
> > 
> > This patch here propose to introduce another API: dl_iterate_phdr_parallel
> > which does exactly same thing as dl_iterate_phdr but may run more than
> > one provided callback in parallel. And to make it more scalable it
> > breaks single dl_load_write_lock into arrays of locks. Reader takes only
> > one lock, but writer has to lock all of them to proceed with modifying
> > the list.
> 
> Wouldn't a read-write lock solve the performance problem at dl_iterate_phdr?

> It should scale better than current code, since dl_iterate_phdr should act
> just a reader on link_map structures and multiple readers should only incur
> in a lll_lock/lll_unlock on rwlock->__data.__lock.  I am assuming that
> we set it to prefer readers and use a read-lock on dl_iterate_phdr.
> 
I did try read-write lock first and it did not improved the test case
at all. The reason is (as I was very surprised to discover) that rwlock
takes a lock even if there are only readers. Though the lock is taken
for a short period of time when it becomes congested scalability drops
to the floor since all new waiters go to sleep in the kernel.

Attached is a, rather hacky, patch that I used for testing. 

> I see this slight better than trading memory and a limited scalability 
> (the _DL_LOCKS value) and also avoid adds another exported API.
The memory tradeoff is rather small. Using rwlock will not avoid
API addition though since two callbacks will be able to run in
parallel and this is user visible difference in behaviour from current
dl_iterate_phdr. C++ exception handling code, for instance, relies on the
fact that callbacks do not run in parallel. They use global cache which
I made to be per thread in the libgcc patch.

--
			Gleb.
commit a8c321ce2d8c3afe9043ad514e89eccda2e88802
Author: Gleb Natapov <gleb@scylladb.com>
Date:   Sun Jul 24 11:25:07 2016 +0300

    use rwlock in dl-iteratephdr.c

[RFC] introduce dl_iterate_phdr_parallel

Commit Message

Comments

Patch