[v2,2/5] MCS lock: Enhance legacy MCS lock algorithm

  As we have discussed before, lock holder preemption would collapse lock
performance when CPUs are oversubscribed. To address this issue, we propose
a new evolved MCS lock algorithm which allows a long-waited spinner in the
queue to spin on the mutex lock without waking up. Thus, active spinners
still have the chance to spin on the mutex lock even if their fore spinners
are not running.

Compare to legacy MCS lock, the difference of the principles of this
enhanced MCS lock includes:
a) A spinner represented by node data structure has to be marked *black*
before being allowed to spin on the mutex lock due to timeout;
b) A black node can only be changed to a *white* node (normal node, how to
deal with a normal node follows the principle of legacy MCS lock) when its
fore spinner wakes it up;
c) A black node is handled specially for mcs lock acquisition and mcs lock
release. For mcs lock acquisition (a spinner calls mcs_lock), the spinner
can acquire the permission to spin on mutex lock immediately without being
added into the queue, and the token which identifies this behavior is set
accordingly (node->tag = 1). For mcs lock release, nothing needs to do if
the token has been set (node->tag == 1). If the token has not been set, it
is dealt with like a normal node.
d) The nodes which represent spinners are allocated with Thread Local
Storage (TLS), thus, we don't need to care about their life cycles.

Implementation details (Focus on difference):
a) White node to black node
The requirement of node data structure
node->locked: 0   A spinner is waiting in the queue
node->locked: 1   A spinner is waken up by its fore spinner
We name this type of spinner *white* node.

To mark a spinner black, we introduce the third state of a spinner:
node->locked: 2   A spinner jumps out the queue due to timeout
We name this type of spinner *black* node.

When a spinner jumps out of the queue due to timeout, we use
atomic_compare_and_exchange_val_acq (&node->locked, 2, 0) to ensure the
atomic of the transition of a white node to a black node.

b) Black node to white node
When a spinner jumps out the queue due to timeout to spin on mutex lock,
the next pointer of its fore spinner still points to it. Thus, this black
node would be finally "waken" up by its fore spinner via next->locked = 1.
In such case, the black node is changed to white node because its
node->locked equals to 1.

c) mcs lock acquisition of a black node
The requirement of node data structure(we introduce another field)
node->tag: 1   A black node calls mcs_lock to acquire the permission to
spin on mutex lock.
The token (node->tag) is set to 1 to identify mcs lock acquisition.

d) mcs lock release of a black node
There are two different usage scenarios for a black node to release mcs
lock.
  d1) Case 1:
  A white node jumps out of the queue to spin on mutex lock, after spinning
  for a while, it calls mcs_lock to release mcs lock.
  In this case, the mcs lock release of a black node is dealt with as same
  as a normal node.

  d2) Case 2:
  A black node calls mcs_lock to acquire the permmision to spin on
  mutex lock, after spinning for a while, it calls mcs_unlock to release
  mcs lock.
  In this case(we know this case by checking node->tag == 1), nothing needs
  to do.

e) Spinning timeout
Spinning timeout in the queue can be implemented via either 1) setting a
threshold of spin count, or 2) a timer.
In this patch, we use the MAX_ADAPTIVE_COUNT which can be tuned by system
administrators. We may need different max count for different architecture
due to *pause* lasting for very different amount of time.

f) The management of node data structure
node data structure which represents the state of a spinner is allocated
with the attribute *__thread* (Thread Local Storage), thus, we don't need
to care about its life cycle.

Test result:
To emulate lock holder preemption, we run two same processes simultaneously,
each process has 28 threads running in parallel, and each thread sets CPU
affinity to an individual CPU and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until 5 seconds, and the lock performance is measured by the total
iterations.
Thus, CPU [0~27] are oversubscribed by two threads at same time.

The rightmost column is base data without CPU oversubscribe(run a single
process with legacy MCS lock).

lock_delay   unlock_delay   Legacy MCS       Enhanced MCS     Base
100ns         600ns         pid1: 7282215    pid1: 6959243    7174841
                            pid2: 1828699    pid2: 7115025

1000ns        6000ns        pid1: 437412     pid1: 1796011    2266682
                            pid2: 2238150    pid2: 1774810

10000ns       60000ns       pid1: 177121     pid1: 178121     203068
                            pid2: 178577     pid2: 178110

From the test result, compare to legacy MCS lock, our enhanced MCS lock
a) performs more balance among processes when CPUs are oversubscribed;
b) does not have obvious performance collapse.

Alternative solutions (Info provided by Tim Chen and Andi Kleen):
a)One other possibility is to set a quota ( > 1) on the numbers of
spinners. So even if one of the spinner is pre-empted, you still have other
spinners available to grab the lock to prevent the preemption performance
collapse.
b) Using hints to the scheduler that the lock Holder wouldn't be preempted
for user programs.

This is what I proposed here, I hope smart guys in community can help to
improve this idea (maybe drop this idea:() or they may have better idea.
Thanks for comments.

Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 nptl/mcs_lock.c                         | 28 ++++++++++++++++++++++++----
 nptl/pthread_mutex_lock.c               |  7 ++++++-
 sysdeps/nptl/bits/thread-shared-types.h |  1 +
 3 files changed, 31 insertions(+), 5 deletions(-)

[v2,2/5] MCS lock: Enhance legacy MCS lock algorithm

Commit Message

Patch