[RFC/PoC] malloc: use wfcqueue to speed up remote frees

  The goal is to reduce contention and improve locality of cross-thread
malloc/free traffic common to IPC systems (including Userspace-RCU) and
some garbage-collected runtimes.

Very rough benchmarks using `xthr`[1], a small URCU test program
I wrote years ago shows huge improvements in time and space:

  $ /usr/bin/time ./before.sh ./xthr -a 2 -m 2500 -i $((1024 * 1024 * 5))
  2.46user 3.51system 0:05.50elapsed 108%CPU (0avgtext+0avgdata 3352592maxresident)k
  0inputs+0outputs (17major+838014minor)pagefaults 0swaps

  $ /usr/bin/time ./after.sh ./xthr -a 2 -m 2500 -i $((1024 * 1024 * 5))
  2.68user 0.48system 0:02.55elapsed 123%CPU (0avgtext+0avgdata 532904maxresident)k
  0inputs+0outputs (0major+174304minor)pagefaults 0swaps

Where before.sh and after.sh are script wrappers around ld-linux for
the appropriate glibc installation.

  #!/bin/sh
  exec /tmp/6/lib/ld-linux-x86-64.so.2 --library-path /tmp/6/lib "$@"

  [1] xthr.c: https://80x24.org/spew/20180731082205.vykyunsm5xg7ml3e@dcvr/

It avoids lock contention by only deferring `_int_free' to scenarios
where the arena lock is already acquired.  Three functions are added:

* remote_free_begin  - Producer - enqueues the allocation into an arena
                       designated to another thread.  This is wait-free,
                       branchless, and only modifies the last (cold)
                       cacheline of the arena belonging to another thread

* remote_free_step   - Consumer - grabs everything enqueued by
                       remote_free_begin and calls `_int_free' locally
                       without acquiring extra locks.  Returns `true'
                       if it did any work, as other threads may have
                       called `remote_free_begin' in the meantime.

* remote_free_finish - Consumer - calls remote_free_step in a loop until
                       there is no more work to do.  It runs before most
                       calls to malloc_consolidate.

wfcqueue is the LGPL-2.1+ Wait-Free Concurrent Queue distributed
with Userspace-RCU <http://liburcu.org/>.  wfcqueue does not
depend on RCU itself (only atomics), but forms the basis of the
workqueue and call_rcu primitive within liburcu.

The functions I'm using from wfcqueue can be statically-linked
from header files, so it involves no extra linkage at runtime.
Note: Debian users can `apt-get install liburcu-dev' to get
wfcqueue.h; I expect it's available for other distros.

If this proof-of-concept is found acceptable, I can work on
making wfcqueue use the atomics provided by gcc/glibc instead
of the `uatomic` headers of URCU so it can be bundled with
glibc.  But maybe adding liburcu as a build-time dependency
is acceptable :)

Note: I'm haven't gotten "make -j4 check" even close to passing even
without this patch on commit 98864ed0e055583707e37cdb7d41a9cdeac4473b.
It's likely a problem on my end; but I'm only a fairly
common Debian 9 x86-64 system; though I haven't built glibc in years.

On the other hand, with the exception of fiddle (dl-dependent)
and tz tests, most of the "test-all" suite passes for Ruby
when using the either the before.sh or after.sh glibc wrapper
(but I haven't done much testing otherwise):

    make test-all TESTS='-x fiddle -x time_tz' \
		RUNRUBYOPT=--precommand=/path/to/after.sh
---
 malloc/malloc.c | 108 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 103 insertions(+), 5 deletions(-)

[RFC/PoC] malloc: use wfcqueue to speed up remote frees

Commit Message

Comments

Patch