From patchwork Wed Nov 28 08:23:29 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: ling.ma.program@gmail.com X-Patchwork-Id: 30357 Received: (qmail 10487 invoked by alias); 28 Nov 2018 08:23:35 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 10473 invoked by uid 89); 28 Nov 2018 08:23:35 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, LIKELY_SPAM_FROM, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=seriously, H*MI:local, UD:ma, 50000 X-HELO: mail-ot1-f68.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=HTLSEmAoEBju4MYpj+FzZfIWsIUE5kgFg7PffOm+92Y=; b=G9FYoRDou7R5YhM09xmDftFLs7H6/HDuR9XO2turUWhIspWJRB6hBGYTC/R2Gnt7CY ewL5wxI776YSYhtVEcRZERQVbmV3gfSb5PTDkiTrvforEFHll65QDLnqrrweP6RG+M8g M4N6L2WTIeaEeJoCZO4NP8nbe5gl7WA2C8wNfh/JaRgDaBUL/wzHz8uf5TFqevqZFLjR xoQYeuOv/UjDzmyjbMgdzN4VIZp76yiP0WkYhOAVndYJMnRBRwnWtF3YTfjoTheuB0Eb gCgnJ58oDPX7/oV+UgG3JV2sGOoq2HYJ0r4CA+XCV5rSeG8a99BiWeZHX8Px7k2K7LgS +SRQ== Return-Path: From: Ma Ling To: libc-alpha@sourceware.org Cc: "ling.ma" Subject: [RFC PATCH] ali_workqueue: Adaptive lock integration on multi-socket/core platform Date: Wed, 28 Nov 2018 16:23:29 +0800 Message-Id: <20181128082329.26873-1-ling.ma@MacBook-Pro-7.local> From: "ling.ma" Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-socket/core platform. However if the serialized works are sent to one core and executed ONLY when contention happens, that can save much time and power, because all shared data are located in private cache of one core. We call the mechanism as Adaptive Lock Integration. (ali workqueue) Currently multiple CPU sockets give us better performance per watt, however that also involve more complex synchronization requirement. For example under critical section scenario , the Lock cache line will ping-pong among CPU sockets and the Competing-Lock process among more cores also bring more overhead. In this version we introduce distributed synchronization mechanism, which will reduce the issues a lot. Assuming There are 2 CPU sockets: 1. If(the thread is from socket_0) Lock_from_socket_0 2. If (the thread is from socket_1) Lock_from_socket_1 3. Lock_Global 4. Enter critical section 5. If(the thread is from socket_0) UnLock_from_socket_0 6. if (the thread is from socket_1) UnLock_from_socket_1 7. The threads from the same socket_0 or socket_1 complete the critical one by one, until no waiting threads in the right socket. During the process We also accelerate data and Lock movement in the same socket. 8. UnLock_Global: we allow threads from other sockets to enter critical section Step 1 or 2 help us to mitigate Global Lock pression, and only one thread get Global Lock in step 3 & 4. Step 5 or 6 help us to reduce Global Lock & shared data movement, because Lock and shared data are locked in the same socket. Ali workqueue is very good at step 7 , meanwhile which also balance the workload of Lock Owner in original version. In the end we get significant result as below (We will send the benchmark in this thread soon): 1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com): Original Spinlock Run hashwork in 5 seconds, print statistics below: 1 threads, 10221937 total hashes, 10221937 hashes per thread 2 threads, 18204627 total hashes, 9102313 hashes per thread 4 threads, 21847140 total hashes, 5461785 hashes per thread 8 threads, 13231893 total hashes, 1653986 hashes per thread 16 threads, 9706989 total hashes, 606686 hashes per thread 32 threads, 6096940 total hashes, 190529 hashes per thread 64 threads, 5237120 total hashes, 81830 hashes per thread 80 threads, 5225351 total hashes, 65316 hashes per thread 96 threads, 5345197 total hashes, 55679 hashes per thread Ali Workqueue Run hashwork in 5 seconds, print statistics below: 1 threads, 9597719 total hashes, 9597719 hashes per thread 2 threads, 16191658 total hashes, 8095829 hashes per thread 4 threads, 16284311 total hashes, 4071077 hashes per thread 8 threads, 25705715 total hashes, 3213214 hashes per thread 16 threads, 32104276 total hashes, 2006517 hashes per thread 32 threads, 33678957 total hashes, 1052467 hashes per thread 64 threads, 31804354 total hashes, 496943 hashes per thread 80 threads, 34445498 total hashes, 430568 hashes per thread 96 threads, 30523970 total hashes, 317958 hashes per thread 2. Global data benchmark (the smaller is the better, the benchmark is from ling.ml@antfin.com for our real workload): Original Spinlock 1 threads 50000 num total time ( 1 threads): 32789120 2 threads 50000 num total time ( 2 threads): 208625958 4 threads 50000 num total time ( 4 threads): 1063907644 8 threads 50000 num total time ( 8 threads): 4734218966 16 threads 50000 num total time ( 16 threads): 25088565320 32 threads 50000 num total time ( 32 threads): 149992521624 64 threads 50000 num total time ( 64 threads): 1054508130586 80 threads 50000 num total time ( 80 threads): 1488507826842 96 threads 50000 num total time ( 96 threads): 1787252256456 Ali Workqueue 1 threads 50000 num total time ( 1 threads): 36340476 2 threads 50000 num total time ( 2 threads): 169380062 4 threads 50000 num total time ( 4 threads): 565430140 8 threads 50000 num total time ( 8 threads): 1329263188 16 threads 50000 num total time ( 16 threads): 3385617884 32 threads 50000 num total time ( 32 threads): 10736058730 64 threads 50000 num total time ( 64 threads): 31651343042 80 threads 50000 num total time ( 80 threads): 47133700104 96 threads 50000 num total time ( 96 threads): 62611966622 Any comments are appreciated. Thanks Ling --- ChangeLog | 8 ++++ include/ali_workqueue.h | 26 +++++++++++ nptl/Versions | 2 + nptl/ali_workqueue.c | 102 +++++++++++++++++++++++++++++++++++++++++++ sysdeps/x86_64/nptl/Makefile | 1 + 5 files changed, 139 insertions(+) create mode 100644 include/ali_workqueue.h create mode 100644 nptl/ali_workqueue.c diff --git a/ChangeLog b/ChangeLog index d7ee676..fdfc00a 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,11 @@ +2018-11-08 Ma Ling + + * sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command. + * nptl/Versions: Export 2 routines of ali_workqueue. + * nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using + adaptive lock integration machnism. + * include/ali_workqueue.h: New file, the user API definition. + 2018-11-05 Arjun Shankar * iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for diff --git a/include/ali_workqueue.h b/include/ali_workqueue.h new file mode 100644 index 0000000..62f3429 --- /dev/null +++ b/include/ali_workqueue.h @@ -0,0 +1,26 @@ +#ifndef _ALI_WORKQUEUE_H_ +#define _ALI_WORKQUEUE_H_ + +#define __aligned(x) __attribute__((aligned(x))) +struct socket { + void *core __aligned(64); + char pad __aligned(64); +}; + +struct ali_workqueue { + struct socket owner; + struct socket cpu[0]; +} ali_workqueue_t; + + +struct ali_workqueue_info { + struct ali_workqueue_info *next __aligned(64); + int pending; + void (*fn)(void *); + void *para; + int socket; +}; + +void ali_workqueue_init(struct ali_workqueue *ali_wq, int size); +void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali); +#endif diff --git a/nptl/Versions b/nptl/Versions index e7f691d..f4afa6d 100644 --- a/nptl/Versions +++ b/nptl/Versions @@ -267,6 +267,8 @@ libpthread { } GLIBC_2.22 { + ali_workqueue_init; + ali_workqueue; } # C11 thread symbols. diff --git a/nptl/ali_workqueue.c b/nptl/ali_workqueue.c new file mode 100644 index 0000000..fe34ca0 --- /dev/null +++ b/nptl/ali_workqueue.c @@ -0,0 +1,102 @@ +/* Copyright (C) 2018 Free Software Foundation, Inc. + This file is part of the GNU C Library. + Contributed by Ulrich Drepper , 2002. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include +#include +#include +#include "ali_workqueue.h" + +static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu) +{ + + struct ali_workqueue_info *next, *ali; + + old->fn(old->para); +retry: + ali = __sync_val_compare_and_swap(cpu, old, NULL); + + if(ali == old) + goto end; + + ali = atomic_exchange_acquire(cpu, old); + +repeat: + if(old == ali) + goto retry; + + while (!(next = atomic_load_relaxed(&ali->next))) + atomic_spin_nop (); + + ali->fn(ali->para); + ali->pending = 0; + ali = next; + goto repeat; + +end: + atomic_store_release(&ali->pending, 0); + return; +} + +void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali) +{ + + struct ali_workqueue_info *old; + void **core; + ali->next = NULL; + ali->pending = 1; + core = &ali_wq->cpu[ali->socket].core; + old = atomic_exchange_acquire(core , ali); + if(old) { + atomic_store_release(&ali->next, old); + while(atomic_load_relaxed(&ali->pending)) + atomic_spin_nop (); + return; + } + + old = atomic_exchange_acquire(&ali_wq->owner.core, ali); + if(old) { + atomic_store_release(&old->next, ali); + while((atomic_load_relaxed(&ali->pending))) + atomic_spin_nop (); + } + + run_workqueue(ali, core); + old = ali; + + ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL); + if(ali == old) + goto end; + + while (!(ali = atomic_load_relaxed(&old->next))) + atomic_spin_nop (); + + + atomic_store_release(&ali->pending, 0); + +end: + return; + +} + +/* Init ali work queue */ +void ali_workqueue_init(struct ali_workqueue *ali_wq, int size) +{ + memset(ali_wq, 0, size); +} + diff --git a/sysdeps/x86_64/nptl/Makefile b/sysdeps/x86_64/nptl/Makefile index 7302403..a5d91e2 100644 --- a/sysdeps/x86_64/nptl/Makefile +++ b/sysdeps/x86_64/nptl/Makefile @@ -18,3 +18,4 @@ ifeq ($(subdir),csu) gen-as-const-headers += tcb-offsets.sym endif +libpthread-routines += ali_workqueue