[RFC] ali_workqueue: Adaptive lock integration on multi-socket/core platform

From: "ling.ma" <ling.ml@antfin.com>

  From: "ling.ma" <ling.ml@antfin.com>

  Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-socket/core platform.

  However if the serialized works are sent to one core and executed
ONLY when contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Adaptive Lock Integration.
(ali workqueue)

  Currently multiple CPU sockets give us better performance per watt,
however that also involve more complex synchronization requirement.
For example under critical section scenario , the Lock cache line
will ping-pong among CPU sockets and the Competing-Lock process
among more cores also bring more overhead. In this version
we introduce distributed synchronization mechanism, which will 
reduce the issues a lot. Assuming There are 2 CPU sockets:

1.	If(the thread is from socket_0)
		Lock_from_socket_0
2.	If (the thread is from socket_1)
		Lock_from_socket_1

3.	Lock_Global 

4.	Enter critical section

5.	If(the thread is from socket_0)
		UnLock_from_socket_0 

6.	if (the thread is from socket_1)
		UnLock_from_socket_1

7.	The threads from the same socket_0 or socket_1 complete the critical one
	by one, until no waiting threads in the right socket. During the process
	We also accelerate data and Lock movement in the same socket.

8.	UnLock_Global:  we allow threads from other sockets to
	enter critical section

Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
get Global Lock in step 3 & 4.

Step 5 or 6 help us to reduce Global Lock & shared data movement,
because Lock and shared data are locked in the same socket.
Ali workqueue is very good at step 7 , meanwhile which also balance
the workload of Lock Owner in original version. In the end we get
significant result as below (We will send the benchmark in this thread soon):

1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
Original Spinlock
Run hashwork in 5 seconds, print statistics below:
1 threads, 10221937 total hashes, 10221937 hashes per thread
2 threads, 18204627 total hashes, 9102313 hashes per thread
4 threads, 21847140 total hashes, 5461785 hashes per thread
8 threads, 13231893 total hashes, 1653986 hashes per thread
16 threads, 9706989 total hashes, 606686 hashes per thread
32 threads, 6096940 total hashes, 190529 hashes per thread
64 threads, 5237120 total hashes, 81830 hashes per thread
80 threads, 5225351 total hashes, 65316 hashes per thread
96 threads, 5345197 total hashes, 55679 hashes per thread

Ali Workqueue
Run hashwork in 5 seconds, print statistics below:
1 threads, 9597719 total hashes, 9597719 hashes per thread
2 threads, 16191658 total hashes, 8095829 hashes per thread
4 threads, 16284311 total hashes, 4071077 hashes per thread
8 threads, 25705715 total hashes, 3213214 hashes per thread
16 threads, 32104276 total hashes, 2006517 hashes per thread
32 threads, 33678957 total hashes, 1052467 hashes per thread
64 threads, 31804354 total hashes, 496943 hashes per thread
80 threads, 34445498 total hashes, 430568 hashes per thread
96 threads, 30523970 total hashes, 317958 hashes per thread

2. Global data benchmark (the smaller is the better,
   the benchmark is from ling.ml@antfin.com for our real workload):

Original Spinlock

1 threads 50000 num
total time (   1 threads): 32789120
2 threads 50000 num
total time (   2 threads): 208625958
4 threads 50000 num
total time (   4 threads): 1063907644
8 threads 50000 num
total time (   8 threads): 4734218966
16 threads 50000 num
total time (  16 threads): 25088565320
32 threads 50000 num
total time (  32 threads): 149992521624
64 threads 50000 num
total time (  64 threads): 1054508130586
80 threads 50000 num
total time (  80 threads): 1488507826842
96 threads 50000 num
total time (  96 threads): 1787252256456

Ali Workqueue
1 threads 50000 num
total time (   1 threads): 36340476
2 threads 50000 num
total time (   2 threads): 169380062
4 threads 50000 num
total time (   4 threads): 565430140
8 threads 50000 num
total time (   8 threads): 1329263188
16 threads 50000 num
total time (  16 threads): 3385617884
32 threads 50000 num
total time (  32 threads): 10736058730
64 threads 50000 num
total time (  64 threads): 31651343042
80 threads 50000 num
total time (  80 threads): 47133700104
96 threads 50000 num
total time (  96 threads): 62611966622

Any comments are appreciated.

Thanks
Ling
---
 ChangeLog                    |   8 ++++
 include/ali_workqueue.h      |  26 +++++++++++
 nptl/Versions                |   2 +
 nptl/ali_workqueue.c         | 102 +++++++++++++++++++++++++++++++++++++++++++
 sysdeps/x86_64/nptl/Makefile |   1 +
 5 files changed, 139 insertions(+)
 create mode 100644 include/ali_workqueue.h
 create mode 100644 nptl/ali_workqueue.c

[RFC] ali_workqueue: Adaptive lock integration on multi-socket/core platform

Commit Message

Comments

Patch