From patchwork Wed Mar 17 02:28:49 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Tamura <naohirot@fujitsu.com>
X-Patchwork-Id: 42660
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id B44203850422;
	Wed, 17 Mar 2021 02:30:30 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from esa10.hc1455-7.c3s2.iphmx.com (esa10.hc1455-7.c3s2.iphmx.com
 [139.138.36.225])
 by sourceware.org (Postfix) with ESMTPS id C0F243851C26
 for <libc-alpha@sourceware.org>; Wed, 17 Mar 2021 02:30:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C0F243851C26
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=fujitsu.com
Authentication-Results: sourceware.org;
 spf=fail smtp.mailfrom=naohirot@fujitsu.com
IronPort-SDR: 
 SOPKq9FNNCa/b7onIECbTL7nwbQPPrRqcteKrUDQcrWTvzY2uyvn1cLfqg8yOqcR+Hsm6jTjkW
 Oqc83vS4RNy8AaIe8sncIbnFiYj5m/FpbtWeXxAPQSLyUTjaaIzmbyIE95LdW8J8ucZ1yQRUan
 6+h5iFTPMxMqdt6QP+CZVo/9KGMU9+AAaNal5ROuIqfTt+HWB76nqtDOvNHIVUqllPJvqT34CW
 UxHKYWpdU227O9Eg4ZWyfcTSqSh23lsECq9oPQ5YdoTe5qgtfqpB8yAI8rENI23vxcLyu3ZtfF
 ZFg=
X-IronPort-AV: E=McAfee;i="6000,8403,9925"; a="11135634"
X-IronPort-AV: E=Sophos;i="5.81,254,1610377200"; d="scan'208";a="11135634"
Received: from unknown (HELO oym-r3.gw.nic.fujitsu.com) ([210.162.30.91])
 by esa10.hc1455-7.c3s2.iphmx.com with ESMTP; 17 Mar 2021 11:30:20 +0900
Received: from oym-m4.gw.nic.fujitsu.com (oym-nat-oym-m4.gw.nic.fujitsu.com
 [192.168.87.61])
 by oym-r3.gw.nic.fujitsu.com (Postfix) with ESMTP id 74EF11FB301
 for <libc-alpha@sourceware.org>; Wed, 17 Mar 2021 11:30:21 +0900 (JST)
Received: from m3051.s.css.fujitsu.com (m3051.s.css.fujitsu.com
 [10.134.21.209])
 by oym-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id A70124498E7
 for <libc-alpha@sourceware.org>; Wed, 17 Mar 2021 11:30:20 +0900 (JST)
Received: from bionic.lxd (unknown [10.126.53.116])
 by m3051.s.css.fujitsu.com (Postfix) with ESMTP id 8D28793;
 Wed, 17 Mar 2021 11:30:20 +0900 (JST)
From: Naohiro Tamura <naohirot@fujitsu.com>
To: libc-alpha@sourceware.org
Subject: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Wed, 17 Mar 2021 02:28:49 +0000
Message-Id: <20210317022849.323046-1-naohirot@fujitsu.com>
X-Mailer: git-send-email 2.17.1
X-TM-AS-GCONF: 00
X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
 SPF_HELO_PASS, SPF_NEUTRAL,
 TXREP autolearn=no autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

Fujitsu is in the process of signing the copyright assignment paper.
We'd like to have some feedback in advance.

This series of patches optimize the performance of
memcpy/memmove/memset for A64FX [1] which implements ARMv8-A SVE and
has L1 64KB cache per core and L2 8MB cache per NUMA node.

The first patch is an update of autoconf to check if assembler is
capable for ARMv8-A SVE code generation or not, and then define
HAVE_SVE_ASM_SUPPORT macro.

The second patch is memcpy/memmove performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, prefetch,
and software pipelining.

The third patch is memset performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, and
prefetch.

The forth patch is a test helper script to change Vector Length for
child process. This script can be used as test-wrapper for 'make
check'

The fifth patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.


SVE assembler code for memcpy/memmove/memset is implemented as Vector
Length Agnostic code so theoretically it can be run on any SOC which
supports ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX


Naohiro Tamura (5):
  config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  aarch64: Added optimized memcpy and memmove for A64FX
  aarch64: Added optimized memset for A64FX
  scripts: Added Vector Length Set test helper script
  benchtests: Added generic_memcpy and generic_memmove to large
    benchtests

 benchtests/bench-memcpy-large.c               |   9 +
 benchtests/bench-memmove-large.c              |   9 +
 config.h.in                                   |   3 +
 manual/tunables.texi                          |   3 +-
 scripts/vltest.py                             |  82 ++
 sysdeps/aarch64/configure                     |  28 +
 sysdeps/aarch64/configure.ac                  |  15 +
 sysdeps/aarch64/multiarch/Makefile            |   3 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |  17 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 979 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 sysdeps/aarch64/multiarch/memset.c            |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S      | 574 ++++++++++
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 17 files changed, 1759 insertions(+), 10 deletions(-)
 create mode 100755 scripts/vltest.py
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S