From patchwork Tue Feb 7 00:16:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Christoph_M=C3=BCllner?= X-Patchwork-Id: 64394 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1E6BB3895FC2 for ; Tue, 7 Feb 2023 00:19:42 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) by sourceware.org (Postfix) with ESMTPS id 873A13858404 for ; Tue, 7 Feb 2023 00:16:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 873A13858404 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-wm1-x32e.google.com with SMTP id hn2-20020a05600ca38200b003dc5cb96d46so12012954wmb.4 for ; Mon, 06 Feb 2023 16:16:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=akMDhCyjoko+kBn9o1em0IPfdDjHcTURB9a5m+zkGVo=; b=QgwARrkYblP2JaqMXhFazCxa5FrhfxL0JV70qf4fUyuC02BXTrB81hzIxa85zYgLoh 29J2vYSARwE+mi47+PUYULNwrDAiFr5Y81h9O6nyLF4Xt9VW+jnLBrQPHPxWKm+U5deS YWkS9YGgG36IN8j1PcFMK8sw3jo++uPn//+LZ9uQoS3nDmlz0Ib1DMKgml7bLFBeyMJH bnuQAZ86xJLVPaNO2QxX8hix2ZYnqfWwJVpBK4IPTQTMmivg/mDQNBJ3ipA6g2C2onYQ IYo7yNJN3n2jhllhcjhNrCjAfm3JoC9Sda8IvvvqT0BikMurJcmYMBL3GLwNvyoJGySd 4J/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=akMDhCyjoko+kBn9o1em0IPfdDjHcTURB9a5m+zkGVo=; b=eJyC+7vKhlsLF1ILUGP7p7qVaQgUT10exa0rAnUryUSs/qnYfilug9kRvSkKjDfaBG xEPgW+Ywxq7dehYxhaxxsznRlNs9qXFHkB3AJkaajyj3b+IYA+V+kiSTppwQ7i81Cg08 DLi5LKTZZNxprNKvtyLQ4+3G06FcmKzKJUeOvbTrpZCFUpAO2p+d7GMSoHzWVCRiD90y ERhzP2FO3UxAy0eK0ijcgynn585fR6UgGs4IXOIGdfcjK1j6rpa2REyZ7aMWElzhnpK0 15KivENgAKGAsS7JV8/mMb6Q4oYijwjwQK3tQlI8o2Yu7CUbgxnVaPJ3u4Re2IKVP74C 1+NQ== X-Gm-Message-State: AO0yUKVc6j2ZhkzV4H3A5PMOpb9KiJb9eQaVcw2/zhyNmynayWpA0wBD JItFmYlXZHJbkHhUBXVZghx1BIJY9x6q/4Ib X-Google-Smtp-Source: AK7set9VZqhJ/OLYwmf7vDGUlvIwwrJpGbRqgLXIE6ruQHckp0CKpjs2E3ioKUJ46l5WYT8J4PWYAQ== X-Received: by 2002:a05:600c:43d5:b0:3d9:e5d3:bf with SMTP id f21-20020a05600c43d500b003d9e5d300bfmr1280846wmn.32.1675729004827; Mon, 06 Feb 2023 16:16:44 -0800 (PST) Received: from beast.fritz.box (62-178-148-172.cable.dynamic.surfer.at. [62.178.148.172]) by smtp.gmail.com with ESMTPSA id f1-20020a1cc901000000b003df14531724sm16862050wmb.21.2023.02.06.16.16.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 16:16:44 -0800 (PST) From: Christoph Muellner To: libc-alpha@sourceware.org, Palmer Dabbelt , Darius Rad , Andrew Waterman , DJ Delorie , Vineet Gupta , Kito Cheng , Jeff Law , Philipp Tomsich , Heiko Stuebner Cc: =?utf-8?q?Christoph_M=C3=BCllner?= Subject: [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64 Date: Tue, 7 Feb 2023 01:16:09 +0100 Message-Id: <20230207001618.458947-11-christoph.muellner@vrull.eu> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230207001618.458947-1-christoph.muellner@vrull.eu> References: <20230207001618.458947-1-christoph.muellner@vrull.eu> MIME-Version: 1.0 X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_MANYTO, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" From: Christoph Müllner The implementation of memset() can be accelerated by loop unrolling, fast unaligned accesses and cbo.zero. Let's provide an implementation that supports that, with a cbo.zero being optional and only available for a block size of 64 bytes. Signed-off-by: Christoph Müllner --- sysdeps/riscv/multiarch/Makefile | 4 +- sysdeps/riscv/multiarch/ifunc-impl-list.c | 4 + sysdeps/riscv/multiarch/memset.c | 12 + .../riscv/multiarch/memset_rv64_unaligned.S | 31 +++ .../multiarch/memset_rv64_unaligned_cboz64.S | 217 ++++++++++++++++++ 5 files changed, 267 insertions(+), 1 deletion(-) create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile index 453f0f4e4c..6e8ebb42d8 100644 --- a/sysdeps/riscv/multiarch/Makefile +++ b/sysdeps/riscv/multiarch/Makefile @@ -1,4 +1,6 @@ ifeq ($(subdir),string) sysdep_routines += \ - memset_generic + memset_generic \ + memset_rv64_unaligned \ + memset_rv64_unaligned_cboz64 endif diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c index fd1752bc46..e878977b73 100644 --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c @@ -36,6 +36,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, size_t i = 0; IFUNC_IMPL (i, name, memset, +#if __riscv_xlen == 64 + IFUNC_IMPL_ADD (array, i, memset, 1, __memset_rv64_unaligned_cboz64) + IFUNC_IMPL_ADD (array, i, memset, 1, __memset_rv64_unaligned) +#endif IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic)) return i; diff --git a/sysdeps/riscv/multiarch/memset.c b/sysdeps/riscv/multiarch/memset.c index ae4289ab03..7ba10dd3da 100644 --- a/sysdeps/riscv/multiarch/memset.c +++ b/sysdeps/riscv/multiarch/memset.c @@ -31,7 +31,19 @@ extern __typeof (__redirect_memset) __libc_memset; extern __typeof (__redirect_memset) __memset_generic attribute_hidden; +#if __riscv_xlen == 64 +extern __typeof (__redirect_memset) __memset_rv64_unaligned_cboz64 attribute_hidden; +extern __typeof (__redirect_memset) __memset_rv64_unaligned attribute_hidden; + +libc_ifunc (__libc_memset, + (IS_RV64() && HAVE_FAST_UNALIGNED() && HAVE_RV(zicboz) && HAVE_CBOZ_BLOCKSIZE(64) + ? __memset_rv64_unaligned_cboz64 + : (IS_RV64() && HAVE_FAST_UNALIGNED() + ? __memset_rv64_unaligned + : __memset_generic))); +#else libc_ifunc (__libc_memset, __memset_generic); +#endif # undef memset strong_alias (__libc_memset, memset); diff --git a/sysdeps/riscv/multiarch/memset_rv64_unaligned.S b/sysdeps/riscv/multiarch/memset_rv64_unaligned.S new file mode 100644 index 0000000000..561e564b42 --- /dev/null +++ b/sysdeps/riscv/multiarch/memset_rv64_unaligned.S @@ -0,0 +1,31 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library. If not, see + . */ + +#include +#include + +#ifndef MEMSET +# define MEMSET __memset_rv64_unaligned +#endif + +#undef CBO_ZERO_THRESHOLD +#define CBO_ZERO_THRESHOLD 0 + +/* Assumptions: rv64i unaligned accesses. */ + +#include "./memset_rv64_unaligned_cboz64.S" diff --git a/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S b/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S new file mode 100644 index 0000000000..710bb41e44 --- /dev/null +++ b/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S @@ -0,0 +1,217 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library. If not, see + . */ + +#if __riscv_xlen == 64 + +#include +#include + +#define dstin a0 +#define val a1 +#define count a2 +#define dst a3 +#define dstend a4 +#define tmp1 a5 + +#ifndef MEMSET +# define MEMSET __memset_rv64_unaligned_cboz64 +#endif + +/* cbo.zero can be used to improve the performance of memset-zero. + * However, the performance gain depends on the amount of data + * to be cleared. This threshold allows to set the minimum amount + * of bytes to enable the cbo.zero loop. + * To disable cbo.zero, set this threshold to 0. */ +#ifndef CBO_ZERO_THRESHOLD +# define CBO_ZERO_THRESHOLD 128 +#endif + +/* Assumptions: + * rv64i_zicboz, 64 byte cbo.zero block size, unaligned accesses. */ + +ENTRY_ALIGN (MEMSET, 6) + + /* Repeat the byte. */ + slli tmp1, val, 8 + or val, tmp1, a1 + slli tmp1, val, 16 + or val, tmp1, a1 + slli tmp1, val, 32 + or val, tmp1, val + + /* Calculate the end position. */ + add dstend, dstin, count + + /* Decide how to process. */ + li tmp1, 96 + bgtu count, tmp1, L(set_long) + li tmp1, 16 + bgtu count, tmp1, L(set_medium) + + /* Set 0..16 bytes. */ + li tmp1, 8 + bltu count, tmp1, 1f + /* Set 8..16 bytes. */ + sd val, 0(dstin) + sd val, -8(dstend) + ret + + .p2align 3 + /* Set 0..7 bytes. */ +1: li tmp1, 4 + bltu count, tmp1, 2f + /* Set 4..7 bytes. */ + sw val, 0(dstin) + sw val, -4(dstend) + ret + + /* Set 0..3 bytes. */ +2: beqz count, 3f + sb val, 0(dstin) + li tmp1, 2 + bltu count, tmp1, 3f + sh val, -2(dstend) +3: ret + + .p2align 3 + /* Set 17..96 bytes. */ +L(set_medium): + sd val, 0(dstin) + sd val, 8(dstin) + li tmp1, 64 + bgtu count, tmp1, L(set96) + sd val, -16(dstend) + sd val, -8(dstend) + li tmp1, 32 + bleu count, tmp1, 1f + sd val, 16(dstin) + sd val, 24(dstin) + sd val, -32(dstend) + sd val, -24(dstend) +1: ret + + .p2align 4 + /* Set 65..96 bytes. Write 64 bytes from the start and + 32 bytes from the end. */ +L(set96): + sd val, 16(dstin) + sd val, 24(dstin) + sd val, 32(dstin) + sd val, 40(dstin) + sd val, 48(dstin) + sd val, 56(dstin) + sd val, -32(dstend) + sd val, -24(dstend) + sd val, -16(dstend) + sd val, -8(dstend) + ret + + .p2align 4 + /* Set 97+ bytes. */ +L(set_long): + /* Store 16 bytes unaligned. */ + sd val, 0(dstin) + sd val, 8(dstin) + +#if CBO_ZERO_THRESHOLD + li tmp1, CBO_ZERO_THRESHOLD + blt count, tmp1, 1f + beqz val, L(cbo_zero_64) +1: +#endif + + /* Round down to the previous 16 byte boundary (keep offset of 16). */ + andi dst, dstin, -16 + + /* Calculate loop termination position. */ + addi tmp1, dstend, -(16+64) + + /* Store 64 bytes in a loop. */ + .p2align 4 +1: sd val, 16(dst) + sd val, 24(dst) + sd val, 32(dst) + sd val, 40(dst) + sd val, 48(dst) + sd val, 56(dst) + sd val, 64(dst) + sd val, 72(dst) + addi dst, dst, 64 + bltu dst, tmp1, 1b + + /* Calculate remainder (dst2 is 16 too less). */ + sub count, dstend, dst + + /* Check if we have more than 32 bytes to copy. */ + li tmp1, (32+16) + ble count, tmp1, 1f + sd val, 16(dst) + sd val, 24(dst) + sd val, 32(dst) + sd val, 40(dst) +1: sd val, -32(dstend) + sd val, -24(dstend) + sd val, -16(dstend) + sd val, -8(dstend) + ret + +#if CBO_ZERO_THRESHOLD + .option push + .option arch,+zicboz + .p2align 3 +L(cbo_zero_64): + /* Align dst (down). */ + sd val, 16(dstin) + sd val, 24(dstin) + sd val, 32(dstin) + sd val, 40(dstin) + sd val, 48(dstin) + sd val, 56(dstin) + + /* Round up to the next 64 byte boundary. */ + andi dst, dstin, -64 + addi dst, dst, 64 + + /* Calculate loop termination position. */ + addi tmp1, dstend, -64 + + /* cbo.zero sets 64 bytes each time. */ + .p2align 4 +1: cbo.zero (dst) + addi dst, dst, 64 + bltu dst, tmp1, 1b + + sub count, dstend, dst + li tmp1, 32 + ble count, tmp1, 1f + sd val, 0(dst) + sd val, 8(dst) + sd val, 16(dst) + sd val, 24(dst) +1: sd val, -32(dstend) + sd val, -24(dstend) + sd val, -16(dstend) + sd val, -8(dstend) + ret + .option pop +#endif /* CBO_ZERO_THRESHOLD */ + +END (MEMSET) +libc_hidden_builtin_def (MEMSET) + +#endif /* __riscv_xlen == 64 */