[v2,1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness

  There are two main problems in current adaptive mutex. The first one is
fairness, multiple spinners contend for the lock simultaneously and there
is no any guarantee for a spinner to acquire the lock no matter how long it
has been waiting for. The other is the heavy cache line bouncing. Since the
cache line including the mutex is shared among all of the spinners, when
lock is released, each spinner will try to acquire the lock via cmpxchg
instruction which constantly floods the system via "read-for-ownership"
requests. As a result, there will be a lot of cache line bouncing in a
large system with a lots of CPUs.

This patch introduces a new type of mutex called
PTHREAD_MUTEX_QUEUESPINNER_NP which puts mutex spinners into a queue before
spinning on the mutex lock and only allows the first spinner to spin on the
mutex lock. This reduces the overhead of cache line bouncing when lock
holder is transferred and allow the tasks to move forward faster, because
there is only one spinner and the cache line of mutex lock is contended
between lock holder and current spinner. At the same time, lock fairness
is also guaranteed in some degree. However, there may be a potential issue
in this proposal if CPUs are oversubscribed. When the lock holder is
transferred to the next spinner in the queue, but it may not be running at
that moment (CPU is scheduled to run other task), the lock performance
would collapse (see more details at the end of commit log), this is why we
introduced a new type of mutex rather than do the optimization based-on
the existed mutex discipline.

The implementation of queuing spinner mutex is based-on MCS lock and an
additional pointer is required to hold MCS lock. To keep the size of mutex
data structure consistent and maintain the user space ABI unchanged, the
__list field which is originally for implementing robust futex will be
reused for that purpose. Therefore, the queue spinner mutex with robust
futex would not be supported.

Pass the ABI compatibility test by running "make check-abi"

Test machine:
2-sockets Skylake platform, 112 cores with 62G RAM

Test case: mutex-adaptive-thread/mutex-queuespinner-thread

Usage: make bench BENCHSET="mutex-adaptive-thread mutex-queuespinner-thread"

Test result:
+----------------+-----------------+-----------------+------------+
|  Configuration |      Base       |      Head       | % Change   |
|                | Total iteration | Total iteration | base->head |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 1x           |
+----------------+-----------------+-----------------+------------+
|   1 thread     |     9304170     |     9318160     |    0.2%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |    14718000     |    14947600     |    1.6%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |    21436800     |    20249800     |   -5.5%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |    16657600     |    15656500     |   -6.0%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |     4020620     |    14757000     |  267.0%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |     3489400     |     8996000     |  157.8%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |     3102040     |     9106490     |  193.6%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 10x          |
+----------------+-----------------+-----------------+------------+
|   1 thread     |     5226360     |     5228880     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |     6875240     |     7016720     |    2.1%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |     6323230     |     6053060     |   -4.3%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |     6215860     |     6388180     |    2.8%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |     3921620     |     5249650     |   33.9%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |     2855460     |     4308940     |   50.9%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |     2572420     |     4166650     |   62.0%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 100x         |
+----------------+-----------------+-----------------+------------+
|   1 thread     |      968946     |      969081     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |      772844     |      776187     |    0.4%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |      808041     |      812314     |    0.5%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |      802213     |      794792     |   -0.9%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |      338170     |      339024     |    0.3%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |      339900     |      339932     |    0.0%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |      331791     |      335243     |    1.0%    |
+----------------+-----------------+-----------------+------------+
|                |            Critical section size: 1000x        |
+----------------+-----------------+-----------------+------------+
|   1 thread     |      106082     |      106102     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|   2 threads    |      100833     |      100823     |   -0.0%    |
+----------------+-----------------+-----------------+------------+
|   3 threads    |      100965     |      100842     |   -0.1%    |
+----------------+-----------------+-----------------+------------+
|   4 threads    |       96813     |       96846     |    0.0%    |
+----------------+-----------------+-----------------+------------+
|  28 threads    |       52230     |       52024     |   -0.4%    |
+----------------+-----------------+-----------------+------------+
|  56 threads    |       48298     |       46427     |   -3.9%    |
+----------------+-----------------+-----------------+------------+
| 112 threads    |       45865     |       44405     |   -3.2%    |
+----------------+-----------------+-----------------+------------+

Though, the queue spinner mutex performs better than adaptive mutex, people
probably ask why we need this new type of mutex since we have already had
pthread spin lock and pthread mutex. Therefore, we designed several test
cases below and explore how they perform.
Let's assume the size of critical section is represented by *s*, the size
of non-critical section is represented by *t*, and let t = k*s. Then, on a
single thread, the arrival rate at which a single core will try to acquire
the lock, in the absence of contention, is 1/(k+1). We also assume there
are *n* threads contending for a lock, each thread binds to an individual
CPU core, and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until 5 seconds, and the lock performance is measured by the total
throughput.

To emulate different usage scenarios, we let k=6, s=100ns and run this
workload using different lock disciplines. Then increase *s* and repeat.
In our workload, 4 threads contending for a lock emulates little lock
contention, and 28 threads contending for a lock emulates severe lock
contention within a socket, and 56 threads contending for a lock emulates
severe lock contention across sockets.

The benchmark is provided by Andi Kleen.

+-------+-------------+--------------+----------------+---------------+
|  Num  |  Spin Lock  | Normal Mutex | Adaptive Mutex | Queue Spinner |
+-------+-------------+--------------+------------- --+---------------+
|                     |    s=100ns t=600ns                            |
+-------+-------------+--------------+----------------+---------------+
|   4   |  12117662   |   7124320    |    10372184    |     9557689   |
+-------+-------------+--------------+----------------+---------------+
|  28   |   2695783   |   6385815    |     3927942    |     7182092   |
+-------+-------------+--------------+----------------+---------------+
|  56   |   2203519   |   4555164    |     3143599    |     4690016   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=1000ns t=6000ns                          |
+-------+-------------+--------------+----------------+---------------+
|   4   |   1529542   |   1380643    |     1495118    |     1503344   |
+-------+-------------+--------------+----------------+---------------+
|  28   |   2063929   |   1695128    |     2064940    |     2205245   |
+-------+-------------+--------------+----------------+---------------+
|  56   |   1507764   |   1427931    |     1704105    |     1720832   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=10000ns t=60000ns                        |
+-------+-------------+--------------+----------------+---------------+
|   4   |    159407   |    159213    |      159213    |      159215   |
+-------+-------------+--------------+----------------+---------------+
|  28   |    272062   |    153567    |      223229    |      224948   |
+-------+-------------+--------------+----------------+---------------+
|  56   |    269920   |    157287    |      239814    |      239887   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=100000ns t=600000ns                      |
+-------+-------------+--------------+----------------+---------------+
|   4   |     16024   |     16023    |       16021    |       16021   |
+-------+-------------+--------------+----------------+---------------+
|  28   |     27990   |     20421    |       20372    |       20378   |
+-------+-------------+--------------+----------------+---------------+
|  56   |     27987   |     20395    |       20322    |       20348   |
+-------+-------------+--------------+----------------+---------------+
|                     |    s=1000000ns t=6000000ns                    |
+-------+-------------+--------------+----------------+---------------+
|   4   |      1604   |      1604    |        1604    |        1604   |
+-------+-------------+--------------+----------------+---------------+
|  28   |      2826   |      2748    |        2748    |        2748   |
+-------+-------------+--------------+----------------+---------------+
|  56   |      2853   |      2773    |        2774    |        2773   |
+-------+-------------+--------------+----------------+---------------+

Generally, we can get some conclusion from the rest result above:
a) In case of little lock contention, spin lock performs best, queue
spinner mutex performs similar to adaptive mutex, and both perform a
little better than pthread mutex;
b) In the case of severe lock contention with large number of CPUs when
protecting a small critical section (less than 1000ns). Most of lock
acquisition is got via spinning. Queue spinner mutex
performs much better than spin lock and adaptive mutex. This is because the
overhead of heavy cache line bouncing plays a big role on lock performance.
c) With the increase of the size of a critical section, the advantage of
queue spinner mutex on performance in reduced gradually. This is because
the overhead of cache line bouncing will not become the bottleneck of lock
performance, instead, the overhead of futex_wait and futex_wake
plays a big role. When the critical section is increased to 1ms, even the
latency of futex syscall would be ignorable compared to the total time of
lock acquisition.

As we can see above, queue spinner mutex performs well in kinds of workload,
but there would be a potential risk to use this type of mutex. When the
lock holder is transformed to the next spinner in the queue, but it is not
running (the CPU is scheduled to run other task). And, the other spinners
have to wait in the queue, this would probably collapse the lock
performance. To emulate this case, we run two same processes simultaneously,
there are 28 threads running in parallel in the process, each thread sets
CPU affinity to an individual CPU according to the thread id. Thus,
CPU [0~27] are subscribed by two threads at the same time.
We run this test with different workload mentioned above, in the worst case
(s=1000ns, t=6000ns), the lock performance is reduced by 58.1%
(2205245->924263).

Therefore, queue spinner mutex would be carefully used for applications to
pursue fairness and performance without oversubscribing CPU resource. E.g.
Run a application within a container in public cloud infrastructure.

May to do list:
a) Tuning the threshold of spin count

At last, I would like to extend my appreciation sincerely to Andi Kleen for
his guidance and support during the development.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/Makefile                           |  2 +-
 nptl/allocatestack.c                    |  2 +-
 nptl/descr.h                            | 26 ++++++-------
 nptl/mcs_lock.c                         | 68 +++++++++++++++++++++++++++++++++
 nptl/mcs_lock.h                         | 21 ++++++++++
 nptl/nptl-init.c                        |  2 +-
 nptl/pthreadP.h                         |  2 +-
 nptl/pthread_mutex_init.c               |  3 +-
 nptl/pthread_mutex_lock.c               | 35 ++++++++++++++++-
 nptl/pthread_mutex_timedlock.c          | 35 +++++++++++++++--
 nptl/pthread_mutex_trylock.c            |  5 ++-
 nptl/pthread_mutex_unlock.c             |  7 +++-
 nptl/pthread_mutexattr_settype.c        |  2 +-
 sysdeps/nptl/bits/thread-shared-types.h | 21 +++++++---
 sysdeps/nptl/pthread.h                  | 15 +++++---
 sysdeps/unix/sysv/linux/hppa/pthread.h  |  4 ++
 16 files changed, 212 insertions(+), 38 deletions(-)
 create mode 100644 nptl/mcs_lock.c
 create mode 100644 nptl/mcs_lock.h

[v2,1/5] Mutex: Queue spinners to reduce cache line bouncing and ensure fairness

Commit Message

Patch