From patchwork Tue Feb  7 00:16:11 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?q?Christoph_M=C3=BCllner?=
 <christoph.muellner@vrull.eu>
X-Patchwork-Id: 64391
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id DEED038493D5
	for <patchwork@sourceware.org>; Tue,  7 Feb 2023 00:19:14 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-wm1-x329.google.com (mail-wm1-x329.google.com
 [IPv6:2a00:1450:4864:20::329])
 by sourceware.org (Postfix) with ESMTPS id A7B793858C33
 for <libc-alpha@sourceware.org>; Tue,  7 Feb 2023 00:16:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A7B793858C33
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=vrull.eu
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu
Received: by mail-wm1-x329.google.com with SMTP id
 j29-20020a05600c1c1d00b003dc52fed235so10240335wms.1
 for <libc-alpha@sourceware.org>; Mon, 06 Feb 2023 16:16:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=yFrHuZPAfR6fp/Gzal1le5pq6PKXicPXblYncZyCBBo=;
 b=pTv1sQ13yfkda0x4DiUWm5xnG7N5g0jA9t03WPoe9/OHJYFGjQ1HUJFGZgIS4gEGzu
 8X91x6gG8acsUWii6qGfV8oyprsUDbzznl3hPMZq51Yd7SSKr7kLD1Kfmgj6BeIU3QtA
 q/qUNuq+XkLiU536/Myqi3Atg6cN/3mOkYXdpuN8HA0zKc/uenc7/fobefHojcRxQaAE
 y97fVbz9LxDSguBr28f3haQ0+ATUfgPXYraYTDxtO1Cd9vTj8NY194LwHdh919GdqSGy
 K7DpH84i2ayInjTjhG30bI2EUMQ/V8DrR3O+jUV2HY5pxCg03XK85HUxd1W8Cihwo2OE
 re0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=yFrHuZPAfR6fp/Gzal1le5pq6PKXicPXblYncZyCBBo=;
 b=R1DL/jkJEOB/3/XepeOA+109/WdXRYFyBKvJ0tTysp6fQ0VqQWs0NqxkSkaPP9JovJ
 T5tPm4FqumxEaxKRR033ycY+a5PNG/55NaS5FLzoEArgqyYeZ5zrwQdgHETUwuVei3HM
 RHR8D7IPToS6GcVgLh73NSBDNlqVvGr+Q3Rx2fKulyCvNMC8w2bhp46OY8hp6bD7jCNA
 L8dLcAax9ocfkkLr/VVf6Qw8viRpOpU5CM1TizbasKgJl4GiUUrZsEW8sEQHJsm8PBHE
 cnWaBbBvsCJ2e5KFDNzS6YVIO33/hYOLOeTVTjCnBdRCzh2PMxy05/LkT/WQZOqaSb0e
 Mz9g==
X-Gm-Message-State: AO0yUKWzzhzg0xr93I4fTv59RgdWm1EST/asOtefru2mP5s9/o5R5+ON
 BRQ5xz1v2ALmi6PG0Adfur3TsFW25A4AetbB
X-Google-Smtp-Source: 
 AK7set+oyy/XyunTpgHeac3zptqHld/4RiWMIVb5aMXdFpiow0rNzF/fnGUQrymzMKScCFj2Zo5ttQ==
X-Received: by 2002:a05:600c:3b18:b0:3df:e1d8:cd8f with SMTP id
 m24-20020a05600c3b1800b003dfe1d8cd8fmr11476843wms.6.1675729007732;
 Mon, 06 Feb 2023 16:16:47 -0800 (PST)
Received: from beast.fritz.box (62-178-148-172.cable.dynamic.surfer.at.
 [62.178.148.172]) by smtp.gmail.com with ESMTPSA id
 f1-20020a1cc901000000b003df14531724sm16862050wmb.21.2023.02.06.16.16.46
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 06 Feb 2023 16:16:47 -0800 (PST)
From: Christoph Muellner <christoph.muellner@vrull.eu>
To: libc-alpha@sourceware.org, Palmer Dabbelt <palmer@dabbelt.com>,
 Darius Rad <darius@bluespec.com>, Andrew Waterman <andrew@sifive.com>,
 DJ Delorie <dj@redhat.com>, Vineet Gupta <vineetg@rivosinc.com>,
 Kito Cheng <kito.cheng@sifive.com>, Jeff Law <jeffreyalaw@gmail.com>,
 Philipp Tomsich <philipp.tomsich@vrull.eu>,
 Heiko Stuebner <heiko.stuebner@vrull.eu>
Cc: =?utf-8?q?Christoph_M=C3=BCllner?= <christoph.muellner@vrull.eu>
Subject: [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for
 RV64
Date: Tue,  7 Feb 2023 01:16:11 +0100
Message-Id: <20230207001618.458947-13-christoph.muellner@vrull.eu>
X-Mailer: git-send-email 2.39.1
In-Reply-To: <20230207001618.458947-1-christoph.muellner@vrull.eu>
References: <20230207001618.458947-1-christoph.muellner@vrull.eu>
MIME-Version: 1.0
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, JMQ_SPF_NEUTRAL,
 KAM_MANYTO, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE,
 SCC_10_SHORT_WORD_LINES, SCC_20_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of memcpy()/memmove() can be accelerated by
loop unrolling and fast unaligned accesses.
Let's provide an implementation that is optimized accordingly.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile              |   2 +
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |   6 +
 sysdeps/riscv/multiarch/memcpy.c              |   9 +
 .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
 sysdeps/riscv/multiarch/memmove.c             |   9 +
 5 files changed, 501 insertions(+)
 create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 6bc20c4fe0..b08d7d1c8b 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -2,6 +2,8 @@ ifeq ($(subdir),string)
 sysdep_routines += \
 	memcpy_generic \
 	memmove_generic \
+	memcpy_rv64_unaligned \
+	\
 	memset_generic \
 	memset_rv64_unaligned \
 	memset_rv64_unaligned_cboz64
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index 16e4d7137f..84b3eb25a4 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -36,9 +36,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   size_t i = 0;
 
   IFUNC_IMPL (i, name, memcpy,
+#if __riscv_xlen == 64
+	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_rv64_unaligned)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
 
   IFUNC_IMPL (i, name, memmove,
+#if __riscv_xlen == 64
+	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_rv64_unaligned)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
 
   IFUNC_IMPL (i, name, memset,
diff --git a/sysdeps/riscv/multiarch/memcpy.c b/sysdeps/riscv/multiarch/memcpy.c
index cc9185912a..68ac9bbe35 100644
--- a/sysdeps/riscv/multiarch/memcpy.c
+++ b/sysdeps/riscv/multiarch/memcpy.c
@@ -31,7 +31,16 @@
 extern __typeof (__redirect_memcpy) __libc_memcpy;
 extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
 
+#if __riscv_xlen == 64
+extern __typeof (__redirect_memcpy) __memcpy_rv64_unaligned attribute_hidden;
+
+libc_ifunc (__libc_memcpy,
+	    (IS_RV64() && HAVE_FAST_UNALIGNED()
+	    ? __memcpy_rv64_unaligned
+	    : __memcpy_generic));
+#else
 libc_ifunc (__libc_memcpy, __memcpy_generic);
+#endif
 
 # undef memcpy
 strong_alias (__libc_memcpy, memcpy);
diff --git a/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
new file mode 100644
index 0000000000..372cd0baea
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
@@ -0,0 +1,475 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if __riscv_xlen == 64
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+#define dst	a0
+#define src	a1
+#define count	a2
+#define srcend	a3
+#define dstend	a4
+#define tmp1	a5
+#define dst2	t6
+
+#define A_l	a6
+#define A_h	a7
+#define B_l	t0
+#define B_h	t1
+#define C_l	t2
+#define C_h	t3
+#define D_l	t4
+#define D_h	t5
+#define E_l	tmp1
+#define E_h	count
+#define F_l	dst2
+#define F_h	srcend
+
+#ifndef MEMCPY
+# define MEMCPY __memcpy_rv64_unaligned
+#endif
+
+#ifndef MEMMOVE
+# define MEMMOVE __memmove_rv64_unaligned
+#endif
+
+#ifndef COPY97_128
+# define COPY97_128 1
+#endif
+
+/* Assumptions: rv64i, unaligned accesses.  */
+
+/* memcpy/memmove is implemented by unrolling copy loops.
+   We have two strategies:
+   1) copy from front/start to back/end ("forward")
+   2) copy from back/end to front/start ("backward")
+   In case of memcpy(), the strategy does not matter for correctness.
+   For memmove() and overlapping buffers we need to use the following strategy:
+     if dst < src && src-dst < count -> copy from front to back
+     if src < dst && dst-src < count -> copy from back to front  */
+
+ENTRY_ALIGN (MEMCPY, 6)
+	/* Calculate the end position.  */
+	add	srcend, src, count
+	add	dstend, dst, count
+
+	/* Decide how to process.  */
+	li	tmp1, 96
+	bgtu	count, tmp1, L(copy_long_forward)
+	li	tmp1, 32
+	bgtu	count, tmp1, L(copy33_96)
+	li	tmp1, 16
+	bleu	count, tmp1, L(copy0_16)
+
+	/* Copy 17-32 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, -16(srcend)
+	ld	B_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, -16(dstend)
+	sd	B_h, -8(dstend)
+	ret
+
+L(copy0_16):
+	li	tmp1, 8
+	bleu	count, tmp1, L(copy0_8)
+	/* Copy 9-16 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, -8(dstend)
+	ret
+
+	.p2align 3
+L(copy0_8):
+	li	tmp1, 4
+	bleu	count, tmp1, L(copy0_4)
+	/* Copy 5-8 bytes.  */
+	lw	A_l, 0(src)
+	lw	B_l, -4(srcend)
+	sw	A_l, 0(dst)
+	sw	B_l, -4(dstend)
+	ret
+
+L(copy0_4):
+	li	tmp1, 2
+	bleu	count, tmp1, L(copy0_2)
+	/* Copy 3-4 bytes.  */
+	lh	A_l, 0(src)
+	lh	B_l, -2(srcend)
+	sh	A_l, 0(dst)
+	sh	B_l, -2(dstend)
+	ret
+
+L(copy0_2):
+	li	tmp1, 1
+	bleu	count, tmp1, L(copy0_1)
+	/* Copy 2 bytes. */
+	lh	A_l, 0(src)
+	sh	A_l, 0(dst)
+	ret
+
+L(copy0_1):
+	beqz	count, L(copy0)
+	/* Copy 1 byte.  */
+	lb	A_l, 0(src)
+	sb	A_l, 0(dst)
+L(copy0):
+	ret
+
+	.p2align 4
+L(copy33_96):
+	/* Copy 33-96 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	ld	C_l, -32(srcend)
+	ld	C_h, -24(srcend)
+	ld	D_l, -16(srcend)
+	ld	D_h, -8(srcend)
+
+	li	tmp1, 64
+	bgtu	count, tmp1, L(copy65_96_preloaded)
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	C_l, -32(dstend)
+	sd	C_h, -24(dstend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ret
+
+	.p2align 4
+L(copy65_96_preloaded):
+	/* Copy 65-96 bytes with pre-loaded A, B, C and D.  */
+	ld	E_l, 32(src)
+	ld	E_h, 40(src)
+	ld	F_l, 48(src) /* dst2 will be overwritten.  */
+	ld	F_h, 56(src) /* srcend will be overwritten.  */
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	E_l, 32(dst)
+	sd	E_h, 40(dst)
+	sd	F_l, 48(dst)
+	sd	F_h, 56(dst)
+	sd	C_l, -32(dstend)
+	sd	C_h, -24(dstend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ret
+
+#ifdef COPY97_128
+	.p2align 4
+L(copy97_128_forward):
+	/* Copy 97-128 bytes from front to back.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	ld	C_l, -16(srcend)
+	ld	C_h, -8(srcend)
+	ld	D_l, -32(srcend)
+	ld	D_h, -24(srcend)
+	ld	E_l, -48(srcend)
+	ld	E_h, -40(srcend)
+	ld	F_l, -64(srcend) /* dst2 will be overwritten.  */
+	ld	F_h, -56(srcend) /* srcend will be overwritten.  */
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	ld	A_l, 32(src)
+	ld	A_h, 40(src)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	ld	B_l, 48(src)
+	ld	B_h, 56(src)
+
+	sd	C_l, -16(dstend)
+	sd	C_h, -8(dstend)
+	sd	D_l, -32(dstend)
+	sd	D_h, -24(dstend)
+	sd	E_l, -48(dstend)
+	sd	E_h, -40(dstend)
+	sd	F_l, -64(dstend)
+	sd	F_h, -56(dstend)
+
+	sd	A_l, 32(dst)
+	sd	A_h, 40(dst)
+	sd	B_l, 48(dst)
+	sd	B_h, 56(dst)
+	ret
+#endif
+
+	.p2align 4
+	/* Copy 97+ bytes from front to back.  */
+L(copy_long_forward):
+#ifdef COPY97_128
+	/* Avoid loop if possible.  */
+	li	tmp1, 128
+	ble	count, tmp1, L(copy97_128_forward)
+#endif
+
+	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
+	ld	D_l, 0(src)
+	ld	D_h, 8(src)
+
+	/* Round down to the previous 16 byte boundary (keep offset of 16).  */
+	andi	tmp1, dst, 15
+	andi	dst2, dst, -16
+	sub	src, src, tmp1
+
+	ld	A_l, 16(src)
+	ld	A_h, 24(src)
+	sd	D_l, 0(dst)
+	sd	D_h, 8(dst)
+	ld	B_l, 32(src)
+	ld	B_h, 40(src)
+	ld	C_l, 48(src)
+	ld	C_h, 56(src)
+	ld	D_l, 64(src)
+	ld	D_h, 72(src)
+	addi	src, src, 64
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dstend, -(16+128)
+	bgeu	dst2, tmp1, L(copy64_from_end)
+
+	/* Store 64 bytes in a loop.  */
+	.p2align 4
+L(loop64_forward):
+	addi	src, src, 64
+	sd	A_l, 16(dst2)
+	sd	A_h, 24(dst2)
+	ld	A_l, -48(src)
+	ld	A_h, -40(src)
+	sd	B_l, 32(dst2)
+	sd	B_h, 40(dst2)
+	ld	B_l, -32(src)
+	ld	B_h, -24(src)
+	sd	C_l, 48(dst2)
+	sd	C_h, 56(dst2)
+	ld	C_l, -16(src)
+	ld	C_h, -8(src)
+	sd	D_l, 64(dst2)
+	sd	D_h, 72(dst2)
+	ld	D_l, 0(src)
+	ld	D_h, 8(src)
+	addi	dst2, dst2, 64
+	bltu	dst2, tmp1, L(loop64_forward)
+
+L(copy64_from_end):
+	ld	E_l, -64(srcend)
+	ld	E_h, -56(srcend)
+	sd	A_l, 16(dst2)
+	sd	A_h, 24(dst2)
+	ld	A_l, -48(srcend)
+	ld	A_h, -40(srcend)
+	sd	B_l, 32(dst2)
+	sd	B_h, 40(dst2)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	sd	C_l, 48(dst2)
+	sd	C_h, 56(dst2)
+	ld	C_l, -16(srcend)
+	ld	C_h, -8(srcend)
+	sd	D_l, 64(dst2)
+	sd	D_h, 72(dst2)
+	sd	E_l, -64(dstend)
+	sd	E_h, -56(dstend)
+	sd	A_l, -48(dstend)
+	sd	A_h, -40(dstend)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	sd	C_l, -16(dstend)
+	sd	C_h, -8(dstend)
+	ret
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+ENTRY_ALIGN (MEMMOVE, 6)
+	/* Calculate the end position.  */
+	add	srcend, src, count
+	add	dstend, dst, count
+
+	/* Decide how to process.  */
+	li	tmp1, 96
+	bgtu	count, tmp1, L(move_long)
+	li	tmp1, 32
+	bgtu	count, tmp1, L(copy33_96)
+	li	tmp1, 16
+	bleu	count, tmp1, L(copy0_16)
+
+	/* Copy 17-32 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, -16(srcend)
+	ld	B_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, -16(dstend)
+	sd	B_h, -8(dstend)
+	ret
+
+#ifdef COPY97_128
+	.p2align 4
+L(copy97_128_backward):
+	/* Copy 97-128 bytes from back to front.  */
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	ld	E_l, -80(srcend)
+	ld	E_h, -72(srcend)
+	ld	F_l, -96(srcend) /* dst2 will be overwritten.  */
+	ld	F_h, -88(srcend) /* srcend will be overwritten.  */
+
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, 16(src)
+	ld	A_h, 24(src)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, 0(src)
+	ld	B_h, 8(src)
+
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	sd	E_l, -80(dstend)
+	sd	E_h, -72(dstend)
+	sd	F_l, -96(dstend)
+	sd	F_h, -88(dstend)
+
+	sd	A_l, 16(dst)
+	sd	A_h, 24(dst)
+	sd	B_l, 0(dst)
+	sd	B_h, 8(dst)
+	ret
+#endif
+
+	.p2align 4
+	/* Copy 97+ bytes.  */
+L(move_long):
+	/* dst-src is positive if src < dst.
+	   In this case we must copy forward if dst-src >= count.
+	   If dst-src is negative, then we can interpret the difference
+	   as unsigned value to enforce dst-src >= count as well.  */
+	sub	tmp1, dst, src
+	beqz	tmp1, L(copy0)
+	bgeu	tmp1, count, L(copy_long_forward)
+
+#ifdef COPY97_128
+	/* Avoid loop if possible.  */
+	li	tmp1, 128
+	ble	count, tmp1, L(copy97_128_backward)
+#endif
+
+	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
+	ld	D_l, -16(srcend)
+	ld	D_h, -8(srcend)
+
+	/* Round down to the previous 16 byte boundary (keep offset of 16).  */
+	andi	tmp1, dstend, 15
+	sub	srcend, srcend, tmp1
+
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	andi	dstend, dstend, -16
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dst, 128
+	bleu	dstend, tmp1, L(copy64_from_start)
+
+	/* Store 64 bytes in a loop.  */
+	.p2align 4
+L(loop64_backward):
+	addi	srcend, srcend, -64
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	addi	dstend, dstend, -64
+	bgtu	dstend, tmp1, L(loop64_backward)
+
+L(copy64_from_start):
+	ld	E_l, 48(src)
+	ld	E_h, 56(src)
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, 32(src)
+	ld	A_h, 40(src)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	ld	C_l, 0(src)
+	ld	C_h, 8(src)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	sd	E_l, 48(dst)
+	sd	E_h, 56(dst)
+	sd	A_l, 32(dst)
+	sd	A_h, 40(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	C_l, 0(dst)
+	sd	C_h, 8(dst)
+	ret
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+
+#endif /* __riscv_xlen == 64  */
diff --git a/sysdeps/riscv/multiarch/memmove.c b/sysdeps/riscv/multiarch/memmove.c
index 581a8327d6..b446a9e036 100644
--- a/sysdeps/riscv/multiarch/memmove.c
+++ b/sysdeps/riscv/multiarch/memmove.c
@@ -31,7 +31,16 @@
 extern __typeof (__redirect_memmove) __libc_memmove;
 extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;
 
+#if __riscv_xlen == 64
+extern __typeof (__redirect_memmove) __memmove_rv64_unaligned attribute_hidden;
+
+libc_ifunc (__libc_memmove,
+	    (IS_RV64() && HAVE_FAST_UNALIGNED()
+	    ? __memmove_rv64_unaligned
+	    : __memmove_generic));
+#else
 libc_ifunc (__libc_memmove, __memmove_generic);
+#endif
 
 # undef memmove
 strong_alias (__libc_memmove, memmove);