From patchwork Wed Nov 28 08:23:29 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: ling.ma.program@gmail.com
X-Patchwork-Id: 30357
Received: (qmail 10487 invoked by alias); 28 Nov 2018 08:23:35 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 10473 invoked by uid 89); 28 Nov 2018 08:23:35 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=BAYES_00,
	FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2,
	GIT_PATCH_3, KAM_SHORT, LIKELY_SPAM_FROM, RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=seriously,
	H*MI:local, UD:ma, 50000
X-HELO: mail-ot1-f68.google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id;
	bh=HTLSEmAoEBju4MYpj+FzZfIWsIUE5kgFg7PffOm+92Y=;
	b=G9FYoRDou7R5YhM09xmDftFLs7H6/HDuR9XO2turUWhIspWJRB6hBGYTC/R2Gnt7CY
	ewL5wxI776YSYhtVEcRZERQVbmV3gfSb5PTDkiTrvforEFHll65QDLnqrrweP6RG+M8g
	M4N6L2WTIeaEeJoCZO4NP8nbe5gl7WA2C8wNfh/JaRgDaBUL/wzHz8uf5TFqevqZFLjR
	xoQYeuOv/UjDzmyjbMgdzN4VIZp76yiP0WkYhOAVndYJMnRBRwnWtF3YTfjoTheuB0Eb
	gCgnJ58oDPX7/oV+UgG3JV2sGOoq2HYJ0r4CA+XCV5rSeG8a99BiWeZHX8Px7k2K7LgS
	+SRQ==
Return-Path: <ling.ma.program@gmail.com>
From: Ma Ling <ling.ma.program@gmail.com>
To: libc-alpha@sourceware.org
Cc: "ling.ma" <ling.ml@antfin.com>
Subject: [RFC PATCH] ali_workqueue: Adaptive lock integration on
	multi-socket/core platform
Date: Wed, 28 Nov 2018 16:23:29 +0800
Message-Id: <20181128082329.26873-1-ling.ma@MacBook-Pro-7.local>

From: "ling.ma" <ling.ml@antfin.com>

  Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-socket/core platform.

  However if the serialized works are sent to one core and executed
ONLY when contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Adaptive Lock Integration.
(ali workqueue)

  Currently multiple CPU sockets give us better performance per watt,
however that also involve more complex synchronization requirement.
For example under critical section scenario , the Lock cache line
will ping-pong among CPU sockets and the Competing-Lock process
among more cores also bring more overhead. In this version
we introduce distributed synchronization mechanism, which will 
reduce the issues a lot. Assuming There are 2 CPU sockets:

1.	If(the thread is from socket_0)
		Lock_from_socket_0
2.	If (the thread is from socket_1)
		Lock_from_socket_1

3.	Lock_Global 

4.	Enter critical section

5.	If(the thread is from socket_0)
		UnLock_from_socket_0 

6.	if (the thread is from socket_1)
		UnLock_from_socket_1

7.	The threads from the same socket_0 or socket_1 complete the critical one
	by one, until no waiting threads in the right socket. During the process
	We also accelerate data and Lock movement in the same socket.

8.	UnLock_Global:  we allow threads from other sockets to
	enter critical section

Step 1 or 2 help us to mitigate Global Lock pression, and only one thread
get Global Lock in step 3 & 4.

Step 5 or 6 help us to reduce Global Lock & shared data movement,
because Lock and shared data are locked in the same socket.
Ali workqueue is very good at step 7 , meanwhile which also balance
the workload of Lock Owner in original version. In the end we get
significant result as below (We will send the benchmark in this thread soon):

1. Hashwork(the more is the better, the benchmark is from kemi.wang@intel.com):
Original Spinlock
Run hashwork in 5 seconds, print statistics below:
1 threads, 10221937 total hashes, 10221937 hashes per thread
2 threads, 18204627 total hashes, 9102313 hashes per thread
4 threads, 21847140 total hashes, 5461785 hashes per thread
8 threads, 13231893 total hashes, 1653986 hashes per thread
16 threads, 9706989 total hashes, 606686 hashes per thread
32 threads, 6096940 total hashes, 190529 hashes per thread
64 threads, 5237120 total hashes, 81830 hashes per thread
80 threads, 5225351 total hashes, 65316 hashes per thread
96 threads, 5345197 total hashes, 55679 hashes per thread

Ali Workqueue
Run hashwork in 5 seconds, print statistics below:
1 threads, 9597719 total hashes, 9597719 hashes per thread
2 threads, 16191658 total hashes, 8095829 hashes per thread
4 threads, 16284311 total hashes, 4071077 hashes per thread
8 threads, 25705715 total hashes, 3213214 hashes per thread
16 threads, 32104276 total hashes, 2006517 hashes per thread
32 threads, 33678957 total hashes, 1052467 hashes per thread
64 threads, 31804354 total hashes, 496943 hashes per thread
80 threads, 34445498 total hashes, 430568 hashes per thread
96 threads, 30523970 total hashes, 317958 hashes per thread

2. Global data benchmark (the smaller is the better,
   the benchmark is from ling.ml@antfin.com for our real workload):

Original Spinlock

1 threads 50000 num
total time (   1 threads): 32789120
2 threads 50000 num
total time (   2 threads): 208625958
4 threads 50000 num
total time (   4 threads): 1063907644
8 threads 50000 num
total time (   8 threads): 4734218966
16 threads 50000 num
total time (  16 threads): 25088565320
32 threads 50000 num
total time (  32 threads): 149992521624
64 threads 50000 num
total time (  64 threads): 1054508130586
80 threads 50000 num
total time (  80 threads): 1488507826842
96 threads 50000 num
total time (  96 threads): 1787252256456

Ali Workqueue
1 threads 50000 num
total time (   1 threads): 36340476
2 threads 50000 num
total time (   2 threads): 169380062
4 threads 50000 num
total time (   4 threads): 565430140
8 threads 50000 num
total time (   8 threads): 1329263188
16 threads 50000 num
total time (  16 threads): 3385617884
32 threads 50000 num
total time (  32 threads): 10736058730
64 threads 50000 num
total time (  64 threads): 31651343042
80 threads 50000 num
total time (  80 threads): 47133700104
96 threads 50000 num
total time (  96 threads): 62611966622

Any comments are appreciated.

Thanks
Ling
---
 ChangeLog                    |   8 ++++
 include/ali_workqueue.h      |  26 +++++++++++
 nptl/Versions                |   2 +
 nptl/ali_workqueue.c         | 102 +++++++++++++++++++++++++++++++++++++++++++
 sysdeps/x86_64/nptl/Makefile |   1 +
 5 files changed, 139 insertions(+)
 create mode 100644 include/ali_workqueue.h
 create mode 100644 nptl/ali_workqueue.c

diff --git a/ChangeLog b/ChangeLog
index d7ee676..fdfc00a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2018-11-08  Ma Ling  <ling.ml@antfin.com>
+
+	* sysdeps/x86_64/nptl/Makefile: Add the ali_workqueue compile command.
+	* nptl/Versions: Export 2 routines of ali_workqueue.
+	* nptl/ali_workqueue.c: New file, the implementation of ali_workqueue by using
+	adaptive lock integration machnism.
+	* include/ali_workqueue.h: New file, the user API definition.
+
 2018-11-05  Arjun Shankar  <arjun@redhat.com>
 
 	* iconv/gconv_conf.c (__gconv_read_conf): Remove NULL check for
diff --git a/include/ali_workqueue.h b/include/ali_workqueue.h
new file mode 100644
index 0000000..62f3429
--- /dev/null
+++ b/include/ali_workqueue.h
@@ -0,0 +1,26 @@
+#ifndef _ALI_WORKQUEUE_H_
+#define _ALI_WORKQUEUE_H_
+
+#define __aligned(x)	__attribute__((aligned(x)))
+struct socket {
+	void *core __aligned(64);
+	char pad __aligned(64);
+};
+
+struct ali_workqueue {
+	struct socket  owner;
+	struct socket  cpu[0];
+} ali_workqueue_t;
+
+
+struct ali_workqueue_info {
+	struct ali_workqueue_info *next  __aligned(64);
+	int pending;
+	void (*fn)(void *);
+	void *para;
+	int socket;
+};
+
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size);
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali);
+#endif
diff --git a/nptl/Versions b/nptl/Versions
index e7f691d..f4afa6d 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -267,6 +267,8 @@ libpthread {
   }
 
   GLIBC_2.22 {
+   ali_workqueue_init;
+   ali_workqueue;
   }
 
   # C11 thread symbols.
diff --git a/nptl/ali_workqueue.c b/nptl/ali_workqueue.c
new file mode 100644
index 0000000..fe34ca0
--- /dev/null
+++ b/nptl/ali_workqueue.c
@@ -0,0 +1,102 @@
+/* Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Ulrich Drepper <drepper@redhat.com>, 2002.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <atomic.h>
+#include "ali_workqueue.h"
+
+static inline void run_workqueue(struct ali_workqueue_info *old, void **cpu)
+{
+
+	struct ali_workqueue_info *next, *ali;
+
+	old->fn(old->para);
+retry:
+	ali = __sync_val_compare_and_swap(cpu, old, NULL);
+
+	if(ali == old)
+		goto end;
+
+	ali =  atomic_exchange_acquire(cpu, old);
+
+repeat:    
+	if(old == ali)
+		goto retry;
+
+	while (!(next = atomic_load_relaxed(&ali->next)))
+   		atomic_spin_nop ();
+
+	ali->fn(ali->para);
+	ali->pending = 0;    
+	ali = next;    
+	goto repeat;
+
+end:
+	atomic_store_release(&ali->pending, 0);
+	return;
+}
+
+void ali_workqueue(struct ali_workqueue *ali_wq, struct ali_workqueue_info *ali)
+{
+
+	struct ali_workqueue_info *old;
+	void **core;
+	ali->next = NULL;
+	ali->pending = 1;
+	core = &ali_wq->cpu[ali->socket].core;
+	old =  atomic_exchange_acquire(core , ali);
+	if(old)	{
+		atomic_store_release(&ali->next, old);
+		while(atomic_load_relaxed(&ali->pending))
+   			atomic_spin_nop ();
+		return;
+	}
+
+	old =  atomic_exchange_acquire(&ali_wq->owner.core, ali);
+	if(old) {
+		atomic_store_release(&old->next, ali);
+		while((atomic_load_relaxed(&ali->pending)))
+   			atomic_spin_nop ();
+	}
+
+	run_workqueue(ali, core);
+	old = ali;
+
+	ali = __sync_val_compare_and_swap(&ali_wq->owner.core, old, NULL);
+	if(ali == old)
+		goto end;
+
+	while (!(ali = atomic_load_relaxed(&old->next)))
+		atomic_spin_nop ();
+
+
+	atomic_store_release(&ali->pending, 0);
+
+end:
+	return;
+
+}
+
+/* Init ali work queue */
+void ali_workqueue_init(struct ali_workqueue *ali_wq, int size)
+{
+	memset(ali_wq, 0, size);
+}
+
diff --git a/sysdeps/x86_64/nptl/Makefile b/sysdeps/x86_64/nptl/Makefile
index 7302403..a5d91e2 100644
--- a/sysdeps/x86_64/nptl/Makefile
+++ b/sysdeps/x86_64/nptl/Makefile
@@ -18,3 +18,4 @@
 ifeq ($(subdir),csu)
 gen-as-const-headers += tcb-offsets.sym
 endif
+libpthread-routines += ali_workqueue