[v2,2/5] MCS lock: Enhance legacy MCS lock algorithm
Commit Message
As we have discussed before, lock holder preemption would collapse lock
performance when CPUs are oversubscribed. To address this issue, we propose
a new evolved MCS lock algorithm which allows a long-waited spinner in the
queue to spin on the mutex lock without waking up. Thus, active spinners
still have the chance to spin on the mutex lock even if their fore spinners
are not running.
Compare to legacy MCS lock, the difference of the principles of this
enhanced MCS lock includes:
a) A spinner represented by node data structure has to be marked *black*
before being allowed to spin on the mutex lock due to timeout;
b) A black node can only be changed to a *white* node (normal node, how to
deal with a normal node follows the principle of legacy MCS lock) when its
fore spinner wakes it up;
c) A black node is handled specially for mcs lock acquisition and mcs lock
release. For mcs lock acquisition (a spinner calls mcs_lock), the spinner
can acquire the permission to spin on mutex lock immediately without being
added into the queue, and the token which identifies this behavior is set
accordingly (node->tag = 1). For mcs lock release, nothing needs to do if
the token has been set (node->tag == 1). If the token has not been set, it
is dealt with like a normal node.
d) The nodes which represent spinners are allocated with Thread Local
Storage (TLS), thus, we don't need to care about their life cycles.
Implementation details (Focus on difference):
a) White node to black node
The requirement of node data structure
node->locked: 0 A spinner is waiting in the queue
node->locked: 1 A spinner is waken up by its fore spinner
We name this type of spinner *white* node.
To mark a spinner black, we introduce the third state of a spinner:
node->locked: 2 A spinner jumps out the queue due to timeout
We name this type of spinner *black* node.
When a spinner jumps out of the queue due to timeout, we use
atomic_compare_and_exchange_val_acq (&node->locked, 2, 0) to ensure the
atomic of the transition of a white node to a black node.
b) Black node to white node
When a spinner jumps out the queue due to timeout to spin on mutex lock,
the next pointer of its fore spinner still points to it. Thus, this black
node would be finally "waken" up by its fore spinner via next->locked = 1.
In such case, the black node is changed to white node because its
node->locked equals to 1.
c) mcs lock acquisition of a black node
The requirement of node data structure(we introduce another field)
node->tag: 1 A black node calls mcs_lock to acquire the permission to
spin on mutex lock.
The token (node->tag) is set to 1 to identify mcs lock acquisition.
d) mcs lock release of a black node
There are two different usage scenarios for a black node to release mcs
lock.
d1) Case 1:
A white node jumps out of the queue to spin on mutex lock, after spinning
for a while, it calls mcs_lock to release mcs lock.
In this case, the mcs lock release of a black node is dealt with as same
as a normal node.
d2) Case 2:
A black node calls mcs_lock to acquire the permmision to spin on
mutex lock, after spinning for a while, it calls mcs_unlock to release
mcs lock.
In this case(we know this case by checking node->tag == 1), nothing needs
to do.
e) Spinning timeout
Spinning timeout in the queue can be implemented via either 1) setting a
threshold of spin count, or 2) a timer.
In this patch, we use the MAX_ADAPTIVE_COUNT which can be tuned by system
administrators. We may need different max count for different architecture
due to *pause* lasting for very different amount of time.
f) The management of node data structure
node data structure which represents the state of a spinner is allocated
with the attribute *__thread* (Thread Local Storage), thus, we don't need
to care about its life cycle.
Test result:
To emulate lock holder preemption, we run two same processes simultaneously,
each process has 28 threads running in parallel, and each thread sets CPU
affinity to an individual CPU and does the following:
1) lock
2) spend *s* nanoseconds in the critical section
3) unlock
4) spend *t* nanoseconds in the non-critical section
in a loop until 5 seconds, and the lock performance is measured by the total
iterations.
Thus, CPU [0~27] are oversubscribed by two threads at same time.
The rightmost column is base data without CPU oversubscribe(run a single
process with legacy MCS lock).
lock_delay unlock_delay Legacy MCS Enhanced MCS Base
100ns 600ns pid1: 7282215 pid1: 6959243 7174841
pid2: 1828699 pid2: 7115025
1000ns 6000ns pid1: 437412 pid1: 1796011 2266682
pid2: 2238150 pid2: 1774810
10000ns 60000ns pid1: 177121 pid1: 178121 203068
pid2: 178577 pid2: 178110
From the test result, compare to legacy MCS lock, our enhanced MCS lock
a) performs more balance among processes when CPUs are oversubscribed;
b) does not have obvious performance collapse.
Alternative solutions (Info provided by Tim Chen and Andi Kleen):
a)One other possibility is to set a quota ( > 1) on the numbers of
spinners. So even if one of the spinner is pre-empted, you still have other
spinners available to grab the lock to prevent the preemption performance
collapse.
b) Using hints to the scheduler that the lock Holder wouldn't be preempted
for user programs.
This is what I proposed here, I hope smart guys in community can help to
improve this idea (maybe drop this idea:() or they may have better idea.
Thanks for comments.
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
nptl/mcs_lock.c | 28 ++++++++++++++++++++++++----
nptl/pthread_mutex_lock.c | 7 ++++++-
sysdeps/nptl/bits/thread-shared-types.h | 1 +
3 files changed, 31 insertions(+), 5 deletions(-)
@@ -22,10 +22,19 @@
void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node)
{
mcs_lock_t *prev;
+ int cnt = 0;
- /* Initalize node. */
+ /* Black node is allowed to spin on mutex immediately */
+ if (node->locked == 2)
+ {
+ node->tag = 1;
+ node->next = NULL;
+ return;
+ }
+ /* Initialize node. */
node->next = NULL;
node->locked = 0;
+ node->tag = 0;
prev = atomic_exchange_acquire(lock, node);
@@ -39,13 +48,25 @@ void mcs_lock (mcs_lock_t **lock, mcs_lock_t *node)
/* Add current spinner into the queue. */
atomic_store_release (&prev->next, node);
atomic_full_barrier ();
- /* Waiting until waken up by the previous spinner. */
+ /* Waiting unless waken up by the previous spinner or timeout. */
while (!atomic_load_relaxed (&node->locked))
- atomic_spin_nop ();
+ {
+ atomic_spin_nop ();
+ /* If timeout, mark this node black before get the permission. */
+ if (++cnt >= MAX_ADAPTIVE_COUNT)
+ {
+ /* Make node->locked = 2 to mark a node black */
+ atomic_compare_and_exchange_val_acq (&node->locked, 2, 0);
+ return;
+ }
+ }
}
void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node)
{
+ if (node->tag == 1)
+ return;
+
mcs_lock_t *next = node->next;
if (next == NULL)
@@ -61,7 +82,6 @@ void mcs_unlock (mcs_lock_t **lock, mcs_lock_t *node)
while (! (next = atomic_load_relaxed (&node->next)))
atomic_spin_nop ();
}
-
/* Wake up the next spinner. */
atomic_store_release (&next->locked, 1);
atomic_full_barrier ();
@@ -57,6 +57,12 @@
#define FORCE_ELISION(m, s)
#endif
+static __thread mcs_lock_t node = {
+ NULL,
+ 0,
+ 0
+};
+
static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
__attribute_noinline__;
@@ -128,7 +134,6 @@ __pthread_mutex_lock (pthread_mutex_t *mutex)
int max_cnt = MIN (MAX_ADAPTIVE_COUNT,
mutex->__data.__spins * 2 + 10);
int val = 0;
- mcs_lock_t node;
mcs_lock ((mcs_lock_t **)&mutex->__data.__list.mcs_lock, &node);
@@ -153,6 +153,7 @@ struct mcs_lock
{
struct mcs_lock *next;
int locked;
+ int tag;
};
typedef struct mcs_lock mcs_lock_t;