[v3,1/2] Mutex: Accelerate lock acquisition by queuing spinner

  Adaptive mutex indicates the semantic of spinning for a while before
calling into the kernel to block. Thus, the lock of adaptive mutex could be
acquired via immediate getting, spinning getting or waking up.

Currently, the spin-waiting algorithm of adaptive mutex is for each
processor to repeatedly execute an test_and_set instruction until either
the maximum spin count is reached or the  lock is acquired. However, the
lock performance via spinning getting will degrade significantly as the
number of spinning processors increases. Two factors at least cause this
degradation[1]. First, in order to release the lock, the lock holder has to
contend with spinning processors for exclusive access of lock cache line.
(E.g. "lock; decl 0%" of lll_unlock() in
sysdeps/unix/sysv/linux/x86_64/lowlevellock.h for pthread_mutex_unlock()).
For most multiprocessor architectures, it has to wait behind those
test_and_set instructions from spinning processors.
Furthermore, on these invalidation-based cache coherency system,
test_and_set instruction will trigger a "read-for-ownership" request for
exclusive access to the lock cache line, it will potentially slow the
access of other locations(shares the cache line with the lock at least) by
the lock holder.

In this patch, we propose another spin-waiting algorithm to accelerate
lock acquisition by queuing spinners based-on MCS[2] lock. With MCS
algorithm, only the header of queue is allowed to spin on the mutex lock,
others spin on locally-accessible flag variables. Thus, the previous
negative factors are eliminated.

The implementation of this MCS-based spin-waiting algorithm requires an
additional pointer to hold the tail of queue. To keep the size of mutex
data structure constant and maintain the user space ABI unchanged, the
__list field which is originally for implementing robust futex will be
reused. Therefore, we propose a new type mutex with GNU extension
PTHREAD_MUTEX_QUEUESPINNER_NP in this patch.

Pass the ABI compatibility test by running "make check-abi"

The benchmark is available at branch mutex_testing in
https://github.com/kemicoco/tst-spinlock.git.
The tunable pthread.mutex_spin_count is set to 10000 which is big enough in
our testing for fair comparison.

The first test case emulates a practical workload with mathematical
calculation.
The second test case provided by our customer emulates the lock contention
in a distributed file system.

Each workload runs with multiple threads in parallel, each workload does
a) Acquire the lock;
b) Do some work in critical section;
c) Unlock
d) Wait a random time (0~3000 TSCs)
in a loop until 5 seconds, and obtain the total iterations.

Testing on 2s-Skylake platform (112 cores with 62G RAM, HT=on).
TC1: Hashwork
Thread num      adaptive mutex          mcs mutex
	1               7297792             7298755 (+0.01%)
	2               9332105             9752901 (+4.51%)
	3               10428251            11029771 (+5.7%)
	4               10572596            11203997 (+5.9%)
	5               10496815            11008433 (+4.8%)
	6               10292946            10569467 (+2.6%)
	7               9861111             10153538 (+2.97%)
	14              5845303             9756283  (+66.91%)
	28              4299209             8985135  (+109.0%)
	56              3298821             5747645  (+74.23%)
	112             2941309             5629700  (+91.4%)
	448             2821056             3716799  (+31.75%)

TC2: Test_and_set instruction on shared variables
Thread num      adaptive mutex          mcs mutex
	1               7748765             7751963 (+0.04%)
	2               8521596             9251649 (+8.57%)
	3               9594653             9890211 (+3.08%)
	4               9934361             9800205 (-1.35%)
	5               8146007             9597982 (+17.82%)
	6               6896436             9367882 (+35.84%)
	7               5943880             9159893 (+54.11%)
	14              4194305             8623029 (+105.59%)
	28              3374218             7818631 (+131.72%)
	56              2533912             4836622 (+90.88%)
	112             2541950             4850938 (+90.84%)
	448             2507000             5345149 (+113.21%)

Test result on workstation(16 cores with 32G RAM, HT=on)
TC1: Hashwork
Thread num      adaptive mutex          mcs mutex
	1               11419169            11430369 (+0.1%)
	2               15364452            15873667 (+3.31%)
	3               17234014            17547329 (+1.82%)
	4               17711736            17613548 (-0.55%)
	5               16583392            17626707 (+6.29%)
	6               14855586            17305468 (+16.49%)
	7               12948851            17130807 (+32.3%)
	14              8698172             15322237 (+76.15%)
	16              8123930             14937645 (+83.87%)
	64              7946006             5685132 (-28.45%)

TC2: Test_and_set instruction on shared variables
Thread num      adaptive mutex          mcs mutex
	1               12535156            12595629 (+0.48%)
	2               15665576            15929652 (+1.69%)
	3               17469331            16881225 (-3.37%)
	4               14833035            15777572 (+6.37%)
	5               12376033            15256528 (+23.27%)
	6               10568899            14693661 (+39.03%)
	7               9657775             14486039 (+49.99%)
	14              8015061             14112508 (+76.07%)
	16              7641725             13935473 (+82.36%)
	64              7571112             7735482 (+2.17%)

Potential issues:
a) As the preemption can't be disabled at userland during spinning, MCS
style lock potentially has the risk to collapse lock performance when CPUs
are heavily oversubscribed. But generally, MCS-based spin-waiting algorithm
performs much better than the existed one. We may consider to mitigate this
issue by using a cancellable MCS to prevent unnecessary active waiting in
future.

Reference:
[1]"The performance of spin lock alternatives for shared-memory
multiprocessors"
https://www.cc.gatech.edu/classes/AY2009/cs4210_fall/papers/anderson-spinlock.pdf.
[2]"Algorithms for scalable synchronization on shared-memory
multiprocessors"
http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf

    * NEWS: New entry.
    * nptl/Makefile (libpthread-routines): Add mcs_lock.
    * nptl/mcs_lock.c: New file
    * nptl/mcs_lock.h: New file
    * nptl/allocatestack.c (__data.__list): Convert to __data.__list.__list_t
    * nptl/descr.h (__data.__list): Likewise
    * nptl/nptl-init.c (__data.__list): Likewise
    * nptl/pthreadP.h: Extend mutex type mask
    * nptl/pthread_mutex_init.c: Add robust futex checking
    * nptl/pthread_mutex_lock.c: Implement a new mutex prototype
    * nptl/pthread_mutex_timedlock.c: Likewise
    * nptl/pthread_mutex_trylock.c: Likewise
    * nptl/pthread_mutex_unlock.c: Likewise
    * nptl/pthread_mutexattr_settype.c: Set a new mutex attribution
    * sysdeps/nptl/bits/thread-shared-types.h: Redefine
      struct __pthread_internal_list and define mcs_lock_t
    * sysdeps/nptl/pthread.h: Add a new mutex type
    * sysdeps/unix/sysv/linux/hppa/pthread.h: Likewise

May todo list:
  a) Add mutex performance test cases at benchtests
  b) Update mutex description at nptl/thread.texi
  c) Update nptl/printers.py

V2->V3 (Most for addressing Siddhesh Poyarekar's comments):
  a) Add changelog entry
  b) Add detailed description on synchronization usage in mcs_lock.c
  c) Correct formatting on mcs_lock ()/mcs_unlock () definition
  d) Only keep necessary info in NEWS
  e) Remove superfluous barrier in mcs_lock.c
  f) Remove one line superfluous code (node.locked = 1)
  g) Use atomic_load_acquire in mcs_lock () instead of atomic_load_relaxed to
  synchronizes-with atomic_store_release in mcs_unlock
  h) Add PI/PP support for this new type mutex with GUN extension
  PTHREAD_MUTEX_QUEUESPINNER_NP

V1->V2:
  a) Add one line description before copyright in new files
  b) Add one entry to introduce this new type of mutex with GNU
  PTHREAD_MUTEX_QUEUESPINNER_NP extension in NEWS

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 NEWS                                    |  3 ++
 nptl/Makefile                           |  3 +-
 nptl/allocatestack.c                    |  2 +-
 nptl/descr.h                            | 26 +++++-----
 nptl/mcs_lock.c                         | 87 +++++++++++++++++++++++++++++++++
 nptl/mcs_lock.h                         | 23 +++++++++
 nptl/nptl-init.c                        |  2 +-
 nptl/pthreadP.h                         |  6 ++-
 nptl/pthread_mutex_init.c               |  5 ++
 nptl/pthread_mutex_lock.c               | 36 +++++++++++++-
 nptl/pthread_mutex_timedlock.c          | 33 +++++++++++--
 nptl/pthread_mutex_trylock.c            |  7 ++-
 nptl/pthread_mutex_unlock.c             |  9 +++-
 nptl/pthread_mutexattr_settype.c        |  2 +-
 sysdeps/nptl/bits/thread-shared-types.h | 21 ++++++--
 sysdeps/nptl/pthread.h                  | 15 ++++--
 sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
 17 files changed, 247 insertions(+), 37 deletions(-)
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h

[v3,1/2] Mutex: Accelerate lock acquisition by queuing spinner

Commit Message

Comments

Patch