From patchwork Wed Mar 17 02:28:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Tamura X-Patchwork-Id: 42660 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B44203850422; Wed, 17 Mar 2021 02:30:30 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from esa10.hc1455-7.c3s2.iphmx.com (esa10.hc1455-7.c3s2.iphmx.com [139.138.36.225]) by sourceware.org (Postfix) with ESMTPS id C0F243851C26 for ; Wed, 17 Mar 2021 02:30:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C0F243851C26 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=fujitsu.com Authentication-Results: sourceware.org; spf=fail smtp.mailfrom=naohirot@fujitsu.com IronPort-SDR: SOPKq9FNNCa/b7onIECbTL7nwbQPPrRqcteKrUDQcrWTvzY2uyvn1cLfqg8yOqcR+Hsm6jTjkW Oqc83vS4RNy8AaIe8sncIbnFiYj5m/FpbtWeXxAPQSLyUTjaaIzmbyIE95LdW8J8ucZ1yQRUan 6+h5iFTPMxMqdt6QP+CZVo/9KGMU9+AAaNal5ROuIqfTt+HWB76nqtDOvNHIVUqllPJvqT34CW UxHKYWpdU227O9Eg4ZWyfcTSqSh23lsECq9oPQ5YdoTe5qgtfqpB8yAI8rENI23vxcLyu3ZtfF ZFg= X-IronPort-AV: E=McAfee;i="6000,8403,9925"; a="11135634" X-IronPort-AV: E=Sophos;i="5.81,254,1610377200"; d="scan'208";a="11135634" Received: from unknown (HELO oym-r3.gw.nic.fujitsu.com) ([210.162.30.91]) by esa10.hc1455-7.c3s2.iphmx.com with ESMTP; 17 Mar 2021 11:30:20 +0900 Received: from oym-m4.gw.nic.fujitsu.com (oym-nat-oym-m4.gw.nic.fujitsu.com [192.168.87.61]) by oym-r3.gw.nic.fujitsu.com (Postfix) with ESMTP id 74EF11FB301 for ; Wed, 17 Mar 2021 11:30:21 +0900 (JST) Received: from m3051.s.css.fujitsu.com (m3051.s.css.fujitsu.com [10.134.21.209]) by oym-m4.gw.nic.fujitsu.com (Postfix) with ESMTP id A70124498E7 for ; Wed, 17 Mar 2021 11:30:20 +0900 (JST) Received: from bionic.lxd (unknown [10.126.53.116]) by m3051.s.css.fujitsu.com (Postfix) with ESMTP id 8D28793; Wed, 17 Mar 2021 11:30:20 +0900 (JST) From: Naohiro Tamura To: libc-alpha@sourceware.org Subject: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Date: Wed, 17 Mar 2021 02:28:49 +0000 Message-Id: <20210317022849.323046-1-naohirot@fujitsu.com> X-Mailer: git-send-email 2.17.1 X-TM-AS-GCONF: 00 X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_NEUTRAL, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Fujitsu is in the process of signing the copyright assignment paper. We'd like to have some feedback in advance. This series of patches optimize the performance of memcpy/memmove/memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The first patch is an update of autoconf to check if assembler is capable for ARMv8-A SVE code generation or not, and then define HAVE_SVE_ASM_SUPPORT macro. The second patch is memcpy/memmove performance optimization which makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, prefetch, and software pipelining. The third patch is memset performance optimization which makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and prefetch. The forth patch is a test helper script to change Vector Length for child process. This script can be used as test-wrapper for 'make check' The fifth patch is to add generic_memcpy and generic_memmove to bench-memcpy-large.c and bench-memmove-large.c respectively so that we can compare performance between 512 bit scalable vector register with scalar 64 bit register consistently among memcpy/memmove/memset default and large benchtests. SVE assembler code for memcpy/memmove/memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Naohiro Tamura (5): config: Added HAVE_SVE_ASM_SUPPORT for aarch64 aarch64: Added optimized memcpy and memmove for A64FX aarch64: Added optimized memset for A64FX scripts: Added Vector Length Set test helper script benchtests: Added generic_memcpy and generic_memmove to large benchtests benchtests/bench-memcpy-large.c | 9 + benchtests/bench-memmove-large.c | 9 + config.h.in | 3 + manual/tunables.texi | 3 +- scripts/vltest.py | 82 ++ sysdeps/aarch64/configure | 28 + sysdeps/aarch64/configure.ac | 15 + sysdeps/aarch64/multiarch/Makefile | 3 +- sysdeps/aarch64/multiarch/ifunc-impl-list.c | 17 +- sysdeps/aarch64/multiarch/init-arch.h | 4 +- sysdeps/aarch64/multiarch/memcpy.c | 12 +- sysdeps/aarch64/multiarch/memcpy_a64fx.S | 979 ++++++++++++++++++ sysdeps/aarch64/multiarch/memmove.c | 12 +- sysdeps/aarch64/multiarch/memset.c | 11 +- sysdeps/aarch64/multiarch/memset_a64fx.S | 574 ++++++++++ .../unix/sysv/linux/aarch64/cpu-features.c | 4 + .../unix/sysv/linux/aarch64/cpu-features.h | 4 + 17 files changed, 1759 insertions(+), 10 deletions(-) create mode 100755 scripts/vltest.py create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S