NUMA spinlock [BZ #23962]

From: "ling.ma" <ling.ml@antfin.com>

  From: "ling.ma" <ling.ml@antfin.com>

On multi-socket systems, memory is shared across the entire system.
Data access to the local socket is much faster than the remote socket
and data access to the local core is faster than sibling cores on the
same socket.  For serialized workloads with conventional spinlock,
when there is high spinlock contention between threads, lock ping-pong
among sockets becomes the bottleneck and threads spend majority of
their time in spinlock overhead.

On multi-socket systems, the keys to our NUMA spinlock performance
are to minimize cross-socket traffic as well as localize the serialized
workload to one core for execution.  The basic principles of NUMA
spinlock are mainly consisted of following approaches, which reduce
data movement and accelerate critical section, eventually give us
significant performance improvement.

1. MCS spinlock
MCS spinlock help us to reduce the useless lock movement in the
spinning state.  This paper provides a good description for this
kind of lock:
<http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>

2. Critical Section Integration (CSI)
Essentially spinlock is similar to that one core complete critical
sections one by one. So when contention happen, the serialized works
are sent to the core who is the lock owner and responsible to execute
them, that can save much time and power, because all shared data are
located in private cache of the lock owner.

We implemented this mechanism based on queued spinlock in kernel, that
speeds up critical section, and reduces the probability of contention.
The paper provides a good description for this kind of lock:
<https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>

3. NUMA Aware Spinlock (NAS)
Currently multi-socket systems give us better performance per watt,
however that also involves more complex synchronization requirement,
because off-chip data movement is much slower. We use distributed
synchronization mechanism to decrease Lock cache line to and from
different nodes. The paper provides a good description for this kind
of lock:
<https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>

4. Yield Schedule
When threads are applying for Critical Section Integration(CSI) with
known contention, they will delegate work to the thread who is the
lock owner, and wait for work to be completed.  The resources which
they are using should be transferred to other threads. In order to
accelerate the scenario, we introduce yield_sched function during
spinning stage.

5. Optimization when NUMA is ON or OFF.
Although programs can access memory with lower latency when NUMA is
enabled, some programs may need more memory bandwidth for computation
with NUMA disabled.  We also optimize multi-socket systems with NUMA
disabled.

NUMA spinlock flow chart (assuming there are 2 CPU nodes):

1. Threads from node_0/node_1 acquire local lock for node_0/1
respectively.  If the thread succeeds in acquiring local lock, it
goes to step 2, otherwise pushes critical function into current
local work queue, and enters into spinning stage with MCS mode.

2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
in acquiring the global lock as the lock owner, it goes to step 3,
otherwise waits until the lock owner thread releases the global lock.

3. The lock owner thread from node_0/1 enters into critical section,
cleans up work queue by performing all local critical functions
pushed at step 1 with CSI on behalf of other threads and informs
those spinning threads that their works have been done.  It then
releases the local lock.

4. The lock owner thread frees global lock.  If another thread is
waiting at step 2, the lock owner thread passes the global lock to
the waiting thread and returns.  The new lock owner thread enters
into step 3.  If no threads are waiting, the lock owner thread
releases the global lock and returns.  The whole critical section
process is completed.

Steps 1 and 2 mitigate global lock contention.  Only one thread
from different nodes will compete for the global lock in step 2.
Step 3 reduces the global lock & shared data movement because they
are located in the same node as well as the same core.  Our data
shows that Critical Section Integration (CSI) improves data locality
and NUMA-aware spinlock (NAS) helps CSI balance the workload.

NUMA spinlock can greatly speed up critical section on multi-socket
systems.  It should improve spinlock performance on all multi-socket
systems. 

2018-12-24  Ling Ma  <ling.ml@antfin.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Wei Xiao  <wei3.xiao@intel.com>

	[BZ #23962]
	* NEWS: Mention NUMA spinlock.
	* manual/examples/numa-spinlock.c: New file.
	* sysdeps/unix/sysv/linux/numa-spinlock-private.h: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.c: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.h: Likewise.
	* sysdeps/unix/sysv/linux/numa_spinlock_alloc.c: Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c:
	Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c:
	Likewise.
	* sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c: Likewise.
	* manual/threads.texi: Document NUMA spinlock.
	* sysdeps/unix/sysv/linux/Makefile (libpthread-sysdep_routines):
	Add numa_spinlock_alloc and numa-spinlock.
	(sysdep_headers): Add numa-spinlock.h.
	* sysdeps/unix/sysv/linux/Versions (libpthread): Add
	numa_spinlock_alloc, numa_spinlock_free, numa_spinlock_init
	and numa_spinlock_apply to GLIBC_2.29.
	* sysdeps/unix/sysv/linux/aarch64/libpthread.abilist: Add
	numa_spinlock_alloc, numa_spinlock_apply, numa_spinlock_free
	and numa_spinlock_init.
	* sysdeps/unix/sysv/linux/alpha/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/arm/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/csky/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/hppa/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/i386/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/ia64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/nios2/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sh/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86/Makefile (xtests): Add
	tst-variable-overhead and tst-numa-variable-overhead.

Signed-off-by: Ling Ma  <ling.ml@antfin.com>
Signed-off-by: H.J. Lu  <hongjiu.lu@intel.com>
Signed-off-by: Wei Xiao  <wei3.xiao@intel.com>
---
 NEWS                                               |   3 +
 manual/examples/numa-spinlock.c                    |  99 ++++++
 manual/threads.texi                                | 105 ++++++
 sysdeps/unix/sysv/linux/Makefile                   |   2 +
 sysdeps/unix/sysv/linux/Versions                   |   9 +
 sysdeps/unix/sysv/linux/aarch64/libpthread.abilist |   4 +
 sysdeps/unix/sysv/linux/alpha/libpthread.abilist   |   4 +
 sysdeps/unix/sysv/linux/arm/libpthread.abilist     |   4 +
 sysdeps/unix/sysv/linux/csky/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/hppa/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/i386/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/ia64/libpthread.abilist    |   4 +
 .../sysv/linux/m68k/coldfire/libpthread.abilist    |   4 +
 .../unix/sysv/linux/m68k/m680x0/libpthread.abilist |   4 +
 .../unix/sysv/linux/microblaze/libpthread.abilist  |   4 +
 .../unix/sysv/linux/mips/mips32/libpthread.abilist |   4 +
 .../unix/sysv/linux/mips/mips64/libpthread.abilist |   4 +
 sysdeps/unix/sysv/linux/nios2/libpthread.abilist   |   4 +
 sysdeps/unix/sysv/linux/numa-spinlock-private.h    |  38 ++
 sysdeps/unix/sysv/linux/numa-spinlock.c            | 327 ++++++++++++++++++
 sysdeps/unix/sysv/linux/numa-spinlock.h            |  64 ++++
 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c      | 304 ++++++++++++++++
 .../linux/powerpc/powerpc32/libpthread.abilist     |   4 +
 .../linux/powerpc/powerpc64/be/libpthread.abilist  |   4 +
 .../linux/powerpc/powerpc64/le/libpthread.abilist  |   4 +
 .../unix/sysv/linux/riscv/rv64/libpthread.abilist  |   4 +
 .../sysv/linux/s390/s390-32/libpthread.abilist     |   4 +
 .../sysv/linux/s390/s390-64/libpthread.abilist     |   4 +
 sysdeps/unix/sysv/linux/sh/libpthread.abilist      |   4 +
 .../sysv/linux/sparc/sparc32/libpthread.abilist    |   4 +
 .../sysv/linux/sparc/sparc64/libpthread.abilist    |   4 +
 sysdeps/unix/sysv/linux/x86/Makefile               |   1 +
 .../sysv/linux/x86/tst-numa-variable-overhead.c    |  53 +++
 .../linux/x86/tst-variable-overhead-skeleton.c     | 384 +++++++++++++++++++++
 .../unix/sysv/linux/x86/tst-variable-overhead.c    |  47 +++
 .../unix/sysv/linux/x86_64/64/libpthread.abilist   |   4 +
 .../unix/sysv/linux/x86_64/x32/libpthread.abilist  |   4 +
 37 files changed, 1532 insertions(+)
 create mode 100644 manual/examples/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
 create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c

NUMA spinlock [BZ #23962]

Commit Message

Comments

Patch