[RFC] ali_workqueue: Adaptive lock integration on multi-socket/core platform
Commit Message
From: "ling.ma" <ling.ml@antfin.com>
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-socket/core platform.
However if the serialized works are sent to one core and executed
ONLY when contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Adaptive Lock Integration.
(ali workqueue)
Currently multiple CPU sockets give us better performance per watt,
however that also involve more complex synchronization requirement.
For example under critical section scenario , the Lock cache line
will ping-pong among CPU sockets and the Competing-Lock process
among more cores also bring more overhead. In this version
we introduce distributed synchronization mechanism, which will
reduce the issues a lot. Assuming There are 2 CPU sockets:
1. If(the thread is from socket_0)
Lock_from_socket_0
2. If (the thread is from socket_1)
Lock_from_socket_1
3. Lock_Global
4. Enter critical section
5. If(the thread is from socket_0)
UnLock_from_socket_0
6. if (the thread is from socket_1)
UnLock_from_socket_1
7. The threads from the same socket_0 or socket_1 complete the critical one
by one, until no waiting threads in the right socket. During the process
We also accelerate data and Lock movement in the same socket.
8. UnLock_Global: we allow threads from other sockets to
enter critical section
Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
get Global Lock in step 3 & 4.
Step 5 or 6 help us to reduce Global Lock & shared data movement,
because Lock and shared data are locked in the same socket.
Ali workqueue is very good at step 7 , meanwhile which also balance
the workload of Lock Owner in original version. In the end we get
significant result as below (We will send the benchmark in this thread soon):
1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
Original Spinlock
Run hashwork in 5 seconds, print statistics below:
1 threads, 10221937 total hashes, 10221937 hashes per thread
2 threads, 18204627 total hashes, 9102313 hashes per thread
4 threads, 21847140 total hashes, 5461785 hashes per thread
8 threads, 13231893 total hashes, 1653986 hashes per thread
16 threads, 9706989 total hashes, 606686 hashes per thread
32 threads, 6096940 total hashes, 190529 hashes per thread
64 threads, 5237120 total hashes, 81830 hashes per thread
80 threads, 5225351 total hashes, 65316 hashes per thread
96 threads, 5345197 total hashes, 55679 hashes per thread
Ali Workqueue
Run hashwork in 5 seconds, print statistics below:
1 threads, 9597719 total hashes, 9597719 hashes per thread
2 threads, 16191658 total hashes, 8095829 hashes per thread
4 threads, 16284311 total hashes, 4071077 hashes per thread
8 threads, 25705715 total hashes, 3213214 hashes per thread
16 threads, 32104276 total hashes, 2006517 hashes per thread
32 threads, 33678957 total hashes, 1052467 hashes per thread
64 threads, 31804354 total hashes, 496943 hashes per thread
80 threads, 34445498 total hashes, 430568 hashes per thread
96 threads, 30523970 total hashes, 317958 hashes per thread
2. Global data benchmark (the smaller is the better,
the benchmark is from ling.ml@antfin.com for our real workload):
Original Spinlock
1 threads 50000 num
total time ( 1 threads): 32789120
2 threads 50000 num
total time ( 2 threads): 208625958
4 threads 50000 num
total time ( 4 threads): 1063907644
8 threads 50000 num
total time ( 8 threads): 4734218966
16 threads 50000 num
total time ( 16 threads): 25088565320
32 threads 50000 num
total time ( 32 threads): 149992521624
64 threads 50000 num
total time ( 64 threads): 1054508130586
80 threads 50000 num
total time ( 80 threads): 1488507826842
96 threads 50000 num
total time ( 96 threads): 1787252256456
Ali Workqueue
1 threads 50000 num
total time ( 1 threads): 36340476
2 threads 50000 num
total time ( 2 threads): 169380062
4 threads 50000 num
total time ( 4 threads): 565430140
8 threads 50000 num
total time ( 8 threads): 1329263188
16 threads 50000 num
total time ( 16 threads): 3385617884
32 threads 50000 num
total time ( 32 threads): 10736058730
64 threads 50000 num
total time ( 64 threads): 31651343042
80 threads 50000 num
total time ( 80 threads): 47133700104
96 threads 50000 num
total time ( 96 threads): 62611966622
Any comments are appreciated.
Thanks
Ling
---
ChangeLog | 8 ++++
include/ali_workqueue.h | 26 +++++++++++
nptl/Versions | 2 +
nptl/ali_workqueue.c | 102 +++++++++++++++++++++++++++++++++++++++++++
sysdeps/x86_64/nptl/Makefile | 1 +
5 files changed, 139 insertions(+)
create mode 100644 include/ali_workqueue.h
create mode 100644 nptl/ali_workqueue.c
Comments
Attach test cases
Thanks
Ling
在 2018/11/28 下午4:23,“Ma Ling”<ling.ma.program@gmail.com> 写入:
From: "ling.ma" <ling.ml@antfin.com>
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-socket/core platform.
However if the serialized works are sent to one core and executed
ONLY when contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Adaptive Lock Integration.
(ali workqueue)
Currently multiple CPU sockets give us better performance per watt,
however that also involve more complex synchronization requirement.
For example under critical section scenario , the Lock cache line
will ping-pong among CPU sockets and the Competing-Lock process
among more cores also bring more overhead. In this version
we introduce distributed synchronization mechanism, which will
reduce the issues a lot. Assuming There are 2 CPU sockets:
1. If(the thread is from socket_0)
Lock_from_socket_0
2. If (the thread is from socket_1)
Lock_from_socket_1
3. Lock_Global
4. Enter critical section
5. If(the thread is from socket_0)
UnLock_from_socket_0
6. if (the thread is from socket_1)
UnLock_from_socket_1
7. The threads from the same socket_0 or socket_1 complete the critical one
by one, until no waiting threads in the right socket. During the process
We also accelerate data and Lock movement in the same socket.
8. UnLock_Global: we allow threads from other sockets to
enter critical section
Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
get Global Lock in step 3 & 4.
Step 5 or 6 help us to reduce Global Lock & shared data movement,
because Lock and shared data are locked in the same socket.
Ali workqueue is very good at step 7 , meanwhile which also balance
the workload of Lock Owner in original version. In the end we get
significant result as below (We will send the benchmark in this thread soon):
1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
Original Spinlock
Run hashwork in 5 seconds, print statistics below:
1 threads, 10221937 total hashes, 10221937 hashes per thread
2 threads, 18204627 total hashes, 9102313 hashes per thread
4 threads, 21847140 total hashes, 5461785 hashes per thread
8 threads, 13231893 total hashes, 1653986 hashes per thread
16 threads, 9706989 total hashes, 606686 hashes per thread
32 threads, 6096940 total hashes, 190529 hashes per thread
64 threads, 5237120 total hashes, 81830 hashes per thread
80 threads, 5225351 total hashes, 65316 hashes per thread
96 threads, 5345197 total hashes, 55679 hashes per thread
Ali Workqueue
Run hashwork in 5 seconds, print statistics below:
1 threads, 9597719 total hashes, 9597719 hashes per thread
2 threads, 16191658 total hashes, 8095829 hashes per thread
4 threads, 16284311 total hashes, 4071077 hashes per thread
8 threads, 25705715 total hashes, 3213214 hashes per thread
16 threads, 32104276 total hashes, 2006517 hashes per thread
32 threads, 33678957 total hashes, 1052467 hashes per thread
64 threads, 31804354 total hashes, 496943 hashes per thread
80 threads, 34445498 total hashes, 430568 hashes per thread
96 threads, 30523970 total hashes, 317958 hashes per thread
2. Global data benchmark (the smaller is the better,
the benchmark is from ling.ml@antfin.com for our real workload):
Original Spinlock
1 threads 50000 num
total time ( 1 threads): 32789120
2 threads 50000 num
total time ( 2 threads): 208625958
4 threads 50000 num
total time ( 4 threads): 1063907644
8 threads 50000 num
total time ( 8 threads): 4734218966
16 threads 50000 num
total time ( 16 threads): 25088565320
32 threads 50000 num
total time ( 32 threads): 149992521624
64 threads 50000 num
total time ( 64 threads): 1054508130586
80 threads 50000 num
total time ( 80 threads): 1488507826842
96 threads 50000 num
total time ( 96 threads): 1787252256456
Ali Workqueue
1 threads 50000 num
total time ( 1 threads): 36340476
2 threads 50000 num
total time ( 2 threads): 169380062
4 threads 50000 num
total time ( 4 threads): 565430140
8 threads 50000 num
total time ( 8 threads): 1329263188
16 threads 50000 num
total time ( 16 threads): 3385617884
32 threads 50000 num
total time ( 32 threads): 10736058730
64 threads 50000 num
total time ( 64 threads): 31651343042
80 threads 50000 num
total time ( 80 threads): 47133700104
96 threads 50000 num
total time ( 96 threads): 62611966622
Any comments are appreciated.
Thanks
Ling
---
ChangeLog | 8 ++++
include/ali_workqueue.h | 26 +++++++++++
nptl/Versions | 2 +
nptl/ali_workqueue.c | 102 +++++++++++++++++++++++++++++++++++++++++++
sysdeps/x86_64/nptl/Makefile | 1 +
5 files changed, 139 insertions(+)
create mode 100644 include/ali_workqueue.h
create mode 100644 nptl/ali_workqueue.c
diff --git a/ChangeLog b/ChangeLog
index d7ee676..fdfc00a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2018-11-08 Ma Ling <ling.ml@antfin.com>
+
+ * sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command.
+ * nptl/Versions: Export 2 routines of ali_workqueue.
+ * nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using
+ adaptive lock integration machnism.
+ * include/ali_workqueue.h: New file, the user API definition.
+
2018-11-05 Arjun Shankar <arjun@redhat.com>
* iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for
diff --git a/include/ali_workqueue.h b/include/ali_workqueue.h
new file mode 100644
index 0000000..62f3429
--- /dev/null
+++ b/include/ali_workqueue.h
@@ -0,0 +1,26 @@
+#ifndef _ALI_WORKQUEUE_H_
+#define _ALI_WORKQUEUE_H_
+
+#define __aligned(x) __attribute__((aligned(x)))
+struct socket {
+ void *core __aligned(64);
+ char pad __aligned(64);
+};
+
+struct ali_workqueue {
+ struct socket owner;
+ struct socket cpu[0];
+} ali_workqueue_t;
+
+
+struct ali_workqueue_info {
+ struct ali_workqueue_info *next __aligned(64);
+ int pending;
+ void (*fn)(void *);
+ void *para;
+ int socket;
+};
+
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size);
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali);
+#endif
diff --git a/nptl/Versions b/nptl/Versions
index e7f691d..f4afa6d 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -267,6 +267,8 @@ libpthread {
}
GLIBC_2.22 {
+ ali_workqueue_init;
+ ali_workqueue;
}
# C11 thread symbols.
diff --git a/nptl/ali_workqueue.c b/nptl/ali_workqueue.c
new file mode 100644
index 0000000..fe34ca0
--- /dev/null
+++ b/nptl/ali_workqueue.c
@@ -0,0 +1,102 @@
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+ Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <atomic.h>
+#include "ali_workqueue.h"
+
+static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu)
+{
+
+ struct ali_workqueue_info *next, *ali;
+
+ old->fn(old->para);
+retry:
+ ali = __sync_val_compare_and_swap(cpu, old, NULL);
+
+ if(ali == old)
+ goto end;
+
+ ali = atomic_exchange_acquire(cpu, old);
+
+repeat:
+ if(old == ali)
+ goto retry;
+
+ while (!(next = atomic_load_relaxed(&ali->next)))
+ atomic_spin_nop ();
+
+ ali->fn(ali->para);
+ ali->pending = 0;
+ ali = next;
+ goto repeat;
+
+end:
+ atomic_store_release(&ali->pending, 0);
+ return;
+}
+
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali)
+{
+
+ struct ali_workqueue_info *old;
+ void **core;
+ ali->next = NULL;
+ ali->pending = 1;
+ core = &ali_wq->cpu[ali->socket].core;
+ old = atomic_exchange_acquire(core , ali);
+ if(old) {
+ atomic_store_release(&ali->next, old);
+ while(atomic_load_relaxed(&ali->pending))
+ atomic_spin_nop ();
+ return;
+ }
+
+ old = atomic_exchange_acquire(&ali_wq->owner.core, ali);
+ if(old) {
+ atomic_store_release(&old->next, ali);
+ while((atomic_load_relaxed(&ali->pending)))
+ atomic_spin_nop ();
+ }
+
+ run_workqueue(ali, core);
+ old = ali;
+
+ ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL);
+ if(ali == old)
+ goto end;
+
+ while (!(ali = atomic_load_relaxed(&old->next)))
+ atomic_spin_nop ();
+
+
+ atomic_store_release(&ali->pending, 0);
+
+end:
+ return;
+
+}
+
+/* Init ali work queue */
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size)
+{
+ memset(ali_wq, 0, size);
+}
+
diff --git a/sysdeps/x86_64/nptl/Makefile b/sysdeps/x86_64/nptl/Makefile
index 7302403..a5d91e2 100644
--- a/sysdeps/x86_64/nptl/Makefile
+++ b/sysdeps/x86_64/nptl/Makefile
@@ -18,3 +18,4 @@
ifeq ($(subdir),csu)
gen-as-const-headers += tcb-offsets.sym
endif
+libpthread-routines += ali_workqueue
--
1.8.3.1
Please see the contribution checklist
<https://sourceware.org/glibc/wiki/Contribution%20checklist>. For
example:
* FSF copyright assignment (with employer assignment / disclaimer as
applicable) needed. People are unlikely to look in detail at code without
an assignment because it could cause problems if the assignment never
appears and they wish to implement something similar in future.
* Please format code according to the GNU Coding Standards.
* New features need documentation in the user manual, and to be mentioned
in the NEWS file.
* APIs need to be architecture-independent, in the absence of a clear
justification for an architecture-specific API.
* A new interface is not useful without an installed header declaring it
for users (this patch only has an internal header, not an installed one).
* New interfaces need testcases added to the glibc testsuite in the patch
adding the interface.
* No "Contributed by" in new files.
* New symbol versions must be the version number of the first glibc
release to have the feature. For something added now that would be
GLIBC_2.29.
* ABI test baselines (for all architectures) must be updated in any patch
adding new interfaces.
* New code should not use __sync_*, and should have comments explicitly
explaining the synchronization used (in terms of the C11 memory model)
when using atomics.
Hi all,
We have got assignment from Free Software Foundation in 2014, so there are no problem for the patches we send out.
Thanks
Ling
在 2018/11/29 下午9:55,“Joseph Myers”<joseph@codesourcery.com> 写入:
On Thu, 29 Nov 2018, 马凌(彦军) wrote:
> Hi Joseph S. Myers
>
> Thanks for your reminder, we have got assignment from Free Software Foundation as attachment.
> So we have right to send the formal patch , it is correct ?
Yes. (It's a good idea to say explicitly when posting the patch that
you're covered by the Alibaba assignment.)
--
Joseph S. Myers
joseph@codesourcery.com
@@ -1,3 +1,11 @@
+2018-11-08 Ma Ling <ling.ml@antfin.com>
+
+ * sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command.
+ * nptl/Versions: Export 2 routines of ali_workqueue.
+ * nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using
+ adaptive lock integration machnism.
+ * include/ali_workqueue.h: New file, the user API definition.
+
2018-11-05 Arjun Shankar <arjun@redhat.com>
* iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for
new file mode 100644
@@ -0,0 +1,26 @@
+#ifndef _ALI_WORKQUEUE_H_
+#define _ALI_WORKQUEUE_H_
+
+#define __aligned(x) __attribute__((aligned(x)))
+struct socket {
+ void *core __aligned(64);
+ char pad __aligned(64);
+};
+
+struct ali_workqueue {
+ struct socket owner;
+ struct socket cpu[0];
+} ali_workqueue_t;
+
+
+struct ali_workqueue_info {
+ struct ali_workqueue_info *next __aligned(64);
+ int pending;
+ void (*fn)(void *);
+ void *para;
+ int socket;
+};
+
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size);
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali);
+#endif
@@ -267,6 +267,8 @@ libpthread {
}
GLIBC_2.22 {
+ ali_workqueue_init;
+ ali_workqueue;
}
# C11 thread symbols.
new file mode 100644
@@ -0,0 +1,102 @@
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+ Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <atomic.h>
+#include "ali_workqueue.h"
+
+static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu)
+{
+
+ struct ali_workqueue_info *next, *ali;
+
+ old->fn(old->para);
+retry:
+ ali = __sync_val_compare_and_swap(cpu, old, NULL);
+
+ if(ali == old)
+ goto end;
+
+ ali = atomic_exchange_acquire(cpu, old);
+
+repeat:
+ if(old == ali)
+ goto retry;
+
+ while (!(next = atomic_load_relaxed(&ali->next)))
+ atomic_spin_nop ();
+
+ ali->fn(ali->para);
+ ali->pending = 0;
+ ali = next;
+ goto repeat;
+
+end:
+ atomic_store_release(&ali->pending, 0);
+ return;
+}
+
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali)
+{
+
+ struct ali_workqueue_info *old;
+ void **core;
+ ali->next = NULL;
+ ali->pending = 1;
+ core = &ali_wq->cpu[ali->socket].core;
+ old = atomic_exchange_acquire(core , ali);
+ if(old) {
+ atomic_store_release(&ali->next, old);
+ while(atomic_load_relaxed(&ali->pending))
+ atomic_spin_nop ();
+ return;
+ }
+
+ old = atomic_exchange_acquire(&ali_wq->owner.core, ali);
+ if(old) {
+ atomic_store_release(&old->next, ali);
+ while((atomic_load_relaxed(&ali->pending)))
+ atomic_spin_nop ();
+ }
+
+ run_workqueue(ali, core);
+ old = ali;
+
+ ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL);
+ if(ali == old)
+ goto end;
+
+ while (!(ali = atomic_load_relaxed(&old->next)))
+ atomic_spin_nop ();
+
+
+ atomic_store_release(&ali->pending, 0);
+
+end:
+ return;
+
+}
+
+/* Init ali work queue */
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size)
+{
+ memset(ali_wq, 0, size);
+}
+
@@ -18,3 +18,4 @@
ifeq ($(subdir),csu)
gen-as-const-headers += tcb-offsets.sym
endif
+libpthread-routines += ali_workqueue