From patchwork Tue Jan 14 21:03:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 104787 X-Patchwork-Delegate: fweimer@redhat.com Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8CBF4385C400 for ; Tue, 14 Jan 2025 21:05:25 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8CBF4385C400 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=hJO4VFy3 X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oo1-xc2e.google.com (mail-oo1-xc2e.google.com [IPv6:2607:f8b0:4864:20::c2e]) by sourceware.org (Postfix) with ESMTPS id 5BEE4385609D for ; Tue, 14 Jan 2025 21:03:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5BEE4385609D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5BEE4385609D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::c2e ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888625; cv=none; b=RVgPy3vBZQhYHjXsvx37vwx3xU4fMJJnKDksneQ/cHLjqLYS1yeAqjaWHh8XqOjvF1VRUNE/KxClu+NAb3ek+sxEzOmQiwao506+tO+gSNy2DSIMhIG3lIKadq2BmICLFl9jpGPI4pib6cwW+fZtGtsxdMxNbC1oA9Es2B1IbYI= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888625; c=relaxed/simple; bh=dq6kX00BPcrjjFpjKwGkvFg+vEvmB3mwfikncUoKPbQ=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=YErF3iChQBmuHclyuCHMMp+9kcBawMXrpa3eqGX1XHVpmo+hmPpf5WIezi9U2SIU0FprVtpOfZbbWjIHn2XzAIClq7DJFyZOo8c25ONIlRxWWMXK9OJgSjg88w7dDdr2x5qRQtJm+Bi0HloLTKcODEbuvTLGMPIUZu9nHEbiSvA= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5BEE4385609D Received: by mail-oo1-xc2e.google.com with SMTP id 006d021491bc7-5f2e3f6180aso2275547eaf.2 for ; Tue, 14 Jan 2025 13:03:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736888624; x=1737493424; darn=sourceware.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=/4YeRWC8WGHeUlyqYdsVNkHjqf4/o+xkIZCQ+q5IKaM=; b=hJO4VFy3pvxfbqsTHbixus1dcbNaISKUqltEBjilsfhpvbUYaSwR5BLETDicx63Vxx IltU3+bTDquRYpgRcqNrBEYjMEZ+3ShXf2lwfdLC1VD5RQCwxuYY3TC1OeOOqjLXgoBi M8TqrP4hSXZ7VH4YGz+mkvFm62+nXVSuOExgwoHcqW97Cs4dM3kQfAG8JAob39xqYzje r3/RuL822eoyeoRK8AuQTH5kXY5mLtDvJDOWVpdXt4ThwJHeKF9KS10ZsUW2/X7vLDZN vLycVk2nhvJoFSsO+2TGaPZZkeVe2iSLCydAjZrICo6zuyFdfA8UrvqLao+m3l4OYh2+ B3aA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736888624; x=1737493424; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=/4YeRWC8WGHeUlyqYdsVNkHjqf4/o+xkIZCQ+q5IKaM=; b=ihhozC+s8qNDL/x0N06UbbT5D4S0YCwCu6CZN5rDsKrcku2HebnUshXmS6IwdWHNmO vuTvGV1WPKMHCEhD5nhZc28Bl/MA48eZmJjR/lOSRD4ueYoJuyg9jcKr+7SKzoOERZkZ Tqnzpxsic52EqotrzD69+B2D0bte9SZklCGpJz+0BjVQcXmub7OS/GZWEfw8L3Fb7NUi 7bc0nbCUv3fudaHs1lGy+eBYO1UDyJam1AO2wwSsYkhxpjf3KqaiyyiCCcgT4JNBLJsb VfFbDI8AhGmQfjqbXFp8YBkuWKVIsLouDo5dQ4JeecbkoN0cQ0mJJZCwE2wbG1RL6Dcj Lazg== X-Gm-Message-State: AOJu0YytQuoIyHRhINOcw+qpbPPl0V6bqpUXKIH6T4WOQJSTwVdqYO5w i+s/bj7KAEbIszMTV08xl22rVoLuvTtPwnxSim3CPSSGfJfSYhIQ/u0Vjg== X-Gm-Gg: ASbGnctmgtA6wSHI67aJMOpz/0udui1qc254IBDAzuAnh9NGCIXPV13H/9trBe7QVsD qM29KcDMj27BJnvDROROpikTtDUWSrM9oGyQXJBJxjWNemQ6KcwvjyfWA7Er9DChqg8skgkeQ38 vo/pTwDrebmeAI/WyV42d/8onIt+WkNJnl6w0NO/EKtpo09cmUwr/rq/Kelz1vIuHE3l3zXQCVr wW1tSmIjE8PW4f4mB1q/mgGfRHQeU5SAJyB3AzI04mWc9OlRMtJjnJ6gkkXDDdUtyEKazhpVFSm oS2+8obNV8gwMsNyEXp68acjJFHYS9ZPjA== X-Google-Smtp-Source: AGHT+IHSeywDId4TU7aXCoBYGXYhVwK4OSkchOhlW3H7i+4Skv4JMF3Fh0HntwRzyqFzvKUV4XQDzw== X-Received: by 2002:a4a:b18c:0:b0:5f7:339c:27a3 with SMTP id 006d021491bc7-5f7339c2a17mr11730181eaf.2.1736888623692; Tue, 14 Jan 2025 13:03:43 -0800 (PST) Received: from noahgold-desk.lan (syn-024-243-147-173.res.spectrum.com. [24.243.147.173]) by smtp.gmail.com with ESMTPSA id 006d021491bc7-5f8826420d0sm4588259eaf.12.2025.01.14.13.03.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jan 2025 13:03:43 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com Subject: [PATCH v2 1/3] x86/string: Factor out large memmove implemention to seperate file Date: Tue, 14 Jan 2025 13:03:39 -0800 Message-Id: <20250114210341.599037-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, URIBL_BLACK autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org This is to enable us to support multiple large (size greater than non-temporal threshold) implementations. This patch has no affect on the resulting libc.so library. --- .../memmove-vec-large-page-unrolled.S | 272 ++++++++++++++++++ .../multiarch/memmove-vec-unaligned-erms.S | 272 +----------------- 2 files changed, 279 insertions(+), 265 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S new file mode 100644 index 0000000000..ee1f3aa7f6 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-vec-large-page-unrolled.S @@ -0,0 +1,272 @@ +#ifdef MEMMOVE_LARGE_IMPL +# error "Multiple large memmove impls included!" +#endif +#define MEMMOVE_LARGE_IMPL 1 + +/* Copies large regions by copying multiple pages at once. This is + beneficial on some older Intel hardware (Broadwell, Skylake, and + Icelake). + 1. If size < 16 * __x86_shared_non_temporal_threshold and + source and destination do not page alias, copy from 2 pages + at once using non-temporal stores. Page aliasing in this case is + considered true if destination's page alignment - sources' page + alignment is less than 8 * VEC_SIZE. + 2. If size >= 16 * __x86_shared_non_temporal_threshold or source + and destination do page alias copy from 4 pages at once using + non-temporal stores. */ + +#ifndef LOG_PAGE_SIZE +# define LOG_PAGE_SIZE 12 +#endif + +#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) +# error Invalid LOG_PAGE_SIZE +#endif + +/* Byte per page for large_memcpy inner loop. */ +#if VEC_SIZE == 64 +# define LARGE_LOAD_SIZE (VEC_SIZE * 2) +#else +# define LARGE_LOAD_SIZE (VEC_SIZE * 4) +#endif + +/* Amount to shift __x86_shared_non_temporal_threshold by for + bound for memcpy_large_4x. This is essentially use to to + indicate that the copy is far beyond the scope of L3 + (assuming no user config x86_non_temporal_threshold) and to + use a more aggressively unrolled loop. NB: before + increasing the value also update initialization of + x86_non_temporal_threshold. */ +#ifndef LOG_4X_MEMCPY_THRESH +# define LOG_4X_MEMCPY_THRESH 4 +#endif + +#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) +# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; +# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; +#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) +# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVU (offset)base, vec0; \ + VMOVU ((offset) + VEC_SIZE)base, vec1; \ + VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ + VMOVU ((offset) + VEC_SIZE * 3)base, vec3; +# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ + VMOVNT vec0, (offset)base; \ + VMOVNT vec1, ((offset) + VEC_SIZE)base; \ + VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ + VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; +#else +# error Invalid LARGE_LOAD_SIZE +#endif + + .p2align 4,, 10 +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) +L(large_memcpy_check): + /* Entry from L(large_memcpy_2x) has a redundant load of + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x) + is only use for the non-erms memmove which is generally less + common. */ +L(large_memcpy): + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP + cmp %R11_LP, %RDX_LP + jb L(more_8x_vec_check) + /* To reach this point it is impossible for dst > src and + overlap. Remaining to check is src > dst and overlap. rcx + already contains dst - src. Negate rcx to get src - dst. If + length > rcx then there is overlap and forward copy is best. */ + negq %rcx + cmpq %rcx, %rdx + ja L(more_8x_vec_forward) + + /* Cache align destination. First store the first 64 bytes then + adjust alignments. */ + + /* First vec was also loaded into VEC(0). */ +# if VEC_SIZE < 64 + VMOVU VEC_SIZE(%rsi), %VMM(1) +# if VEC_SIZE < 32 + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) +# endif +# endif + VMOVU %VMM(0), (%rdi) +# if VEC_SIZE < 64 + VMOVU %VMM(1), VEC_SIZE(%rdi) +# if VEC_SIZE < 32 + VMOVU %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVU %VMM(3), (VEC_SIZE * 3)(%rdi) +# endif +# endif + + /* Adjust source, destination, and size. */ + movq %rdi, %r8 + andq $63, %r8 + /* Get the negative of offset for alignment. */ + subq $64, %r8 + /* Adjust source. */ + subq %r8, %rsi + /* Adjust destination which should be aligned now. */ + subq %r8, %rdi + /* Adjust length. */ + addq %r8, %rdx + + /* Test if source and destination addresses will alias. If they + do the larger pipeline in large_memcpy_4x alleviated the + performance drop. */ + + /* ecx contains -(dst - src). not ecx will return dst - src - 1 + which works for testing aliasing. */ + notl %ecx + movq %rdx, %r10 + testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx + jz L(large_memcpy_4x) + + /* r11 has __x86_shared_non_temporal_threshold. Shift it left + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold. */ + shlq $LOG_4X_MEMCPY_THRESH, %r11 + cmp %r11, %rdx + jae L(large_memcpy_4x) + + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 2 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 1), %r10 + /* Copy 4x VEC at a time from 2 pages. */ + .p2align 4 +L(loop_large_memcpy_2x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_2x_inner): + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) + /* Load vectors from rsi. */ + LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_2x_inner) + addq $PAGE_SIZE, %rdi + addq $PAGE_SIZE, %rsi + decq %r10 + jne L(loop_large_memcpy_2x_outer) + sfence + + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_2x_end) + + /* Handle the last 2 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_2x_tail): + /* Copy 4 * VEC a time forward with non-temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VMM(0) + VMOVU VEC_SIZE(%rsi), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VMM(0), (%rdi) + VMOVA %VMM(1), VEC_SIZE(%rdi) + VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_2x_tail) + +L(large_memcpy_2x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) + + VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) + VZEROUPPER_RETURN + + .p2align 4 +L(large_memcpy_4x): + /* edx will store remainder size for copying tail. */ + andl $(PAGE_SIZE * 4 - 1), %edx + /* r10 stores outer loop counter. */ + shrq $(LOG_PAGE_SIZE + 2), %r10 + /* Copy 4x VEC at a time from 4 pages. */ + .p2align 4 +L(loop_large_memcpy_4x_outer): + /* ecx stores inner loop counter. */ + movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx +L(loop_large_memcpy_4x_inner): + /* Only one prefetch set per page as doing 4 pages give more + time for prefetcher to keep up. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) + /* Load vectors from rsi. */ + LOAD_ONE_SET ((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) + LOAD_ONE_SET ((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) + subq $-LARGE_LOAD_SIZE, %rsi + /* Non-temporal store vectors to rdi. */ + STORE_ONE_SET ((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) + STORE_ONE_SET ((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) + STORE_ONE_SET ((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) + STORE_ONE_SET ((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) + subq $-LARGE_LOAD_SIZE, %rdi + decl %ecx + jnz L(loop_large_memcpy_4x_inner) + addq $(PAGE_SIZE * 3), %rdi + addq $(PAGE_SIZE * 3), %rsi + decq %r10 + jne L(loop_large_memcpy_4x_outer) + sfence + /* Check if only last 4 loads are needed. */ + cmpl $(VEC_SIZE * 4), %edx + jbe L(large_memcpy_4x_end) + + /* Handle the last 4 * PAGE_SIZE bytes. */ +L(loop_large_memcpy_4x_tail): + /* Copy 4 * VEC a time forward with non-temporal stores. */ + PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) + PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) + VMOVU (%rsi), %VMM(0) + VMOVU VEC_SIZE(%rsi), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) + subq $-(VEC_SIZE * 4), %rsi + addl $-(VEC_SIZE * 4), %edx + VMOVA %VMM(0), (%rdi) + VMOVA %VMM(1), VEC_SIZE(%rdi) + VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) + VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpl $(VEC_SIZE * 4), %edx + ja L(loop_large_memcpy_4x_tail) + +L(large_memcpy_4x_end): + /* Store the last 4 * VEC. */ + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) + VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) + + VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) + VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) + VZEROUPPER_RETURN +#endif diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 5cd8a6286e..70d303687c 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -34,17 +34,8 @@ __x86_rep_movsb_threshold and less than __x86_rep_movsb_stop_threshold, then REP MOVSB will be used. 7. If size >= __x86_shared_non_temporal_threshold and there is no - overlap between destination and source, use non-temporal store - instead of aligned store copying from either 2 or 4 pages at - once. - 8. For point 7) if size < 16 * __x86_shared_non_temporal_threshold - and source and destination do not page alias, copy from 2 pages - at once using non-temporal stores. Page aliasing in this case is - considered true if destination's page alignment - sources' page - alignment is less than 8 * VEC_SIZE. - 9. If size >= 16 * __x86_shared_non_temporal_threshold or source - and destination do page alias copy from 4 pages at once using - non-temporal stores. */ + overlap between destination and source, the exact method varies + and is set with MEMMOVE_VEC_LARGE_IMPL". */ #include @@ -95,31 +86,6 @@ # error Unsupported PAGE_SIZE #endif -#ifndef LOG_PAGE_SIZE -# define LOG_PAGE_SIZE 12 -#endif - -#if PAGE_SIZE != (1 << LOG_PAGE_SIZE) -# error Invalid LOG_PAGE_SIZE -#endif - -/* Byte per page for large_memcpy inner loop. */ -#if VEC_SIZE == 64 -# define LARGE_LOAD_SIZE (VEC_SIZE * 2) -#else -# define LARGE_LOAD_SIZE (VEC_SIZE * 4) -#endif - -/* Amount to shift __x86_shared_non_temporal_threshold by for - bound for memcpy_large_4x. This is essentially use to to - indicate that the copy is far beyond the scope of L3 - (assuming no user config x86_non_temporal_threshold) and to - use a more aggressively unrolled loop. NB: before - increasing the value also update initialization of - x86_non_temporal_threshold. */ -#ifndef LOG_4X_MEMCPY_THRESH -# define LOG_4X_MEMCPY_THRESH 4 -#endif /* Avoid short distance rep movsb only with non-SSE vector. */ #ifndef AVOID_SHORT_DISTANCE_REP_MOVSB @@ -160,26 +126,8 @@ # error Unsupported PREFETCH_SIZE! #endif -#if LARGE_LOAD_SIZE == (VEC_SIZE * 2) -# define LOAD_ONE_SET(base, offset, vec0, vec1, ...) \ - VMOVU (offset)base, vec0; \ - VMOVU ((offset) + VEC_SIZE)base, vec1; -# define STORE_ONE_SET(base, offset, vec0, vec1, ...) \ - VMOVNT vec0, (offset)base; \ - VMOVNT vec1, ((offset) + VEC_SIZE)base; -#elif LARGE_LOAD_SIZE == (VEC_SIZE * 4) -# define LOAD_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ - VMOVU (offset)base, vec0; \ - VMOVU ((offset) + VEC_SIZE)base, vec1; \ - VMOVU ((offset) + VEC_SIZE * 2)base, vec2; \ - VMOVU ((offset) + VEC_SIZE * 3)base, vec3; -# define STORE_ONE_SET(base, offset, vec0, vec1, vec2, vec3) \ - VMOVNT vec0, (offset)base; \ - VMOVNT vec1, ((offset) + VEC_SIZE)base; \ - VMOVNT vec2, ((offset) + VEC_SIZE * 2)base; \ - VMOVNT vec3, ((offset) + VEC_SIZE * 3)base; -#else -# error Invalid LARGE_LOAD_SIZE +#ifndef MEMMOVE_VEC_LARGE_IMPL +# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" #endif #ifndef SECTION @@ -426,7 +374,7 @@ L(more_8x_vec): #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) /* Check non-temporal store threshold. */ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP - ja L(large_memcpy_2x) + ja L(large_memcpy) #endif /* To reach this point there cannot be overlap and dst > src. So check for overlap and src > dst in which case correctness @@ -613,7 +561,7 @@ L(movsb): /* If above __x86_rep_movsb_stop_threshold most likely is candidate for NT moves as well. */ cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP - jae L(large_memcpy_2x_check) + jae L(large_memcpy_check) # if AVOID_SHORT_DISTANCE_REP_MOVSB || ALIGN_MOVSB /* Only avoid short movsb if CPU has FSRM. */ # if X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB < 256 @@ -673,214 +621,8 @@ L(skip_short_movsb_check): # endif #endif - .p2align 4,, 10 -#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) -L(large_memcpy_2x_check): - /* Entry from L(large_memcpy_2x) has a redundant load of - __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x) - is only use for the non-erms memmove which is generally less - common. */ -L(large_memcpy_2x): - mov __x86_shared_non_temporal_threshold(%rip), %R11_LP - cmp %R11_LP, %RDX_LP - jb L(more_8x_vec_check) - /* To reach this point it is impossible for dst > src and - overlap. Remaining to check is src > dst and overlap. rcx - already contains dst - src. Negate rcx to get src - dst. If - length > rcx then there is overlap and forward copy is best. */ - negq %rcx - cmpq %rcx, %rdx - ja L(more_8x_vec_forward) - - /* Cache align destination. First store the first 64 bytes then - adjust alignments. */ - - /* First vec was also loaded into VEC(0). */ -# if VEC_SIZE < 64 - VMOVU VEC_SIZE(%rsi), %VMM(1) -# if VEC_SIZE < 32 - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) -# endif -# endif - VMOVU %VMM(0), (%rdi) -# if VEC_SIZE < 64 - VMOVU %VMM(1), VEC_SIZE(%rdi) -# if VEC_SIZE < 32 - VMOVU %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVU %VMM(3), (VEC_SIZE * 3)(%rdi) -# endif -# endif +#include MEMMOVE_VEC_LARGE_IMPL - /* Adjust source, destination, and size. */ - movq %rdi, %r8 - andq $63, %r8 - /* Get the negative of offset for alignment. */ - subq $64, %r8 - /* Adjust source. */ - subq %r8, %rsi - /* Adjust destination which should be aligned now. */ - subq %r8, %rdi - /* Adjust length. */ - addq %r8, %rdx - - /* Test if source and destination addresses will alias. If they - do the larger pipeline in large_memcpy_4x alleviated the - performance drop. */ - - /* ecx contains -(dst - src). not ecx will return dst - src - 1 - which works for testing aliasing. */ - notl %ecx - movq %rdx, %r10 - testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx - jz L(large_memcpy_4x) - - /* r11 has __x86_shared_non_temporal_threshold. Shift it left - by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold. - */ - shlq $LOG_4X_MEMCPY_THRESH, %r11 - cmp %r11, %rdx - jae L(large_memcpy_4x) - - /* edx will store remainder size for copying tail. */ - andl $(PAGE_SIZE * 2 - 1), %edx - /* r10 stores outer loop counter. */ - shrq $(LOG_PAGE_SIZE + 1), %r10 - /* Copy 4x VEC at a time from 2 pages. */ - .p2align 4 -L(loop_large_memcpy_2x_outer): - /* ecx stores inner loop counter. */ - movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx -L(loop_large_memcpy_2x_inner): - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE * 2) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE * 2) - /* Load vectors from rsi. */ - LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - subq $-LARGE_LOAD_SIZE, %rsi - /* Non-temporal store vectors to rdi. */ - STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - subq $-LARGE_LOAD_SIZE, %rdi - decl %ecx - jnz L(loop_large_memcpy_2x_inner) - addq $PAGE_SIZE, %rdi - addq $PAGE_SIZE, %rsi - decq %r10 - jne L(loop_large_memcpy_2x_outer) - sfence - - /* Check if only last 4 loads are needed. */ - cmpl $(VEC_SIZE * 4), %edx - jbe L(large_memcpy_2x_end) - - /* Handle the last 2 * PAGE_SIZE bytes. */ -L(loop_large_memcpy_2x_tail): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) - VMOVU (%rsi), %VMM(0) - VMOVU VEC_SIZE(%rsi), %VMM(1) - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) - subq $-(VEC_SIZE * 4), %rsi - addl $-(VEC_SIZE * 4), %edx - VMOVA %VMM(0), (%rdi) - VMOVA %VMM(1), VEC_SIZE(%rdi) - VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) - subq $-(VEC_SIZE * 4), %rdi - cmpl $(VEC_SIZE * 4), %edx - ja L(loop_large_memcpy_2x_tail) - -L(large_memcpy_2x_end): - /* Store the last 4 * VEC. */ - VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) - VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) - VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) - VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) - - VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) - VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) - VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) - VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) - VZEROUPPER_RETURN - - .p2align 4 -L(large_memcpy_4x): - /* edx will store remainder size for copying tail. */ - andl $(PAGE_SIZE * 4 - 1), %edx - /* r10 stores outer loop counter. */ - shrq $(LOG_PAGE_SIZE + 2), %r10 - /* Copy 4x VEC at a time from 4 pages. */ - .p2align 4 -L(loop_large_memcpy_4x_outer): - /* ecx stores inner loop counter. */ - movl $(PAGE_SIZE / LARGE_LOAD_SIZE), %ecx -L(loop_large_memcpy_4x_inner): - /* Only one prefetch set per page as doing 4 pages give more - time for prefetcher to keep up. */ - PREFETCH_ONE_SET(1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 2 + PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET(1, (%rsi), PAGE_SIZE * 3 + PREFETCHED_LOAD_SIZE) - /* Load vectors from rsi. */ - LOAD_ONE_SET((%rsi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - LOAD_ONE_SET((%rsi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) - LOAD_ONE_SET((%rsi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) - subq $-LARGE_LOAD_SIZE, %rsi - /* Non-temporal store vectors to rdi. */ - STORE_ONE_SET((%rdi), 0, %VMM(0), %VMM(1), %VMM(2), %VMM(3)) - STORE_ONE_SET((%rdi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7)) - STORE_ONE_SET((%rdi), PAGE_SIZE * 2, %VMM(8), %VMM(9), %VMM(10), %VMM(11)) - STORE_ONE_SET((%rdi), PAGE_SIZE * 3, %VMM(12), %VMM(13), %VMM(14), %VMM(15)) - subq $-LARGE_LOAD_SIZE, %rdi - decl %ecx - jnz L(loop_large_memcpy_4x_inner) - addq $(PAGE_SIZE * 3), %rdi - addq $(PAGE_SIZE * 3), %rsi - decq %r10 - jne L(loop_large_memcpy_4x_outer) - sfence - /* Check if only last 4 loads are needed. */ - cmpl $(VEC_SIZE * 4), %edx - jbe L(large_memcpy_4x_end) - - /* Handle the last 4 * PAGE_SIZE bytes. */ -L(loop_large_memcpy_4x_tail): - /* Copy 4 * VEC a time forward with non-temporal stores. */ - PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE) - PREFETCH_ONE_SET (1, (%rdi), PREFETCHED_LOAD_SIZE) - VMOVU (%rsi), %VMM(0) - VMOVU VEC_SIZE(%rsi), %VMM(1) - VMOVU (VEC_SIZE * 2)(%rsi), %VMM(2) - VMOVU (VEC_SIZE * 3)(%rsi), %VMM(3) - subq $-(VEC_SIZE * 4), %rsi - addl $-(VEC_SIZE * 4), %edx - VMOVA %VMM(0), (%rdi) - VMOVA %VMM(1), VEC_SIZE(%rdi) - VMOVA %VMM(2), (VEC_SIZE * 2)(%rdi) - VMOVA %VMM(3), (VEC_SIZE * 3)(%rdi) - subq $-(VEC_SIZE * 4), %rdi - cmpl $(VEC_SIZE * 4), %edx - ja L(loop_large_memcpy_4x_tail) - -L(large_memcpy_4x_end): - /* Store the last 4 * VEC. */ - VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VMM(0) - VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VMM(1) - VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VMM(2) - VMOVU -VEC_SIZE(%rsi, %rdx), %VMM(3) - - VMOVU %VMM(0), -(VEC_SIZE * 4)(%rdi, %rdx) - VMOVU %VMM(1), -(VEC_SIZE * 3)(%rdi, %rdx) - VMOVU %VMM(2), -(VEC_SIZE * 2)(%rdi, %rdx) - VMOVU %VMM(3), -VEC_SIZE(%rdi, %rdx) - VZEROUPPER_RETURN -#endif END (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) #if IS_IN (libc) From patchwork Tue Jan 14 21:03:40 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 104788 X-Patchwork-Delegate: fweimer@redhat.com Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 17DD3385C401 for ; Tue, 14 Jan 2025 21:05:34 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 17DD3385C401 Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=ZFSqBN1p X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oo1-xc33.google.com (mail-oo1-xc33.google.com [IPv6:2607:f8b0:4864:20::c33]) by sourceware.org (Postfix) with ESMTPS id EB89B38560A6 for ; Tue, 14 Jan 2025 21:03:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EB89B38560A6 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org EB89B38560A6 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::c33 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888627; cv=none; b=Ex/zOTlzg/PUbO4fOHBDVEwQbufFQHSdJPVOWwUEA6unLjDWBtJqBNQsXHd7ki8+mpg328xgNn0Ca86XqKcrwAumJu0FGJw0RuAW0HHxY++0uGCip9EDBmV1yDkdB6vSr6Et5CU6gJFA7tS0J0Uo5KQ9o30LS7s9bE07NAUZfDw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888627; c=relaxed/simple; bh=tTENgqF74Mn/jcH3FVVBhj69W2QHNjr77pz/sEw9O0k=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=CnpvozMdeaGSgFBnlCXOd5dVH6YJQMdUAKcTIrmrcYxG/hHx+RQbw5c9VpZDiioSThHi2muOomXsyPTlWcaDKtQOkIrmvqxR04oZegzhwywEB7uBvS4gx7jXlL8EprvfahDjdBHqCyG55Kenw9HhEVqtjnmIsuvCxYZyvUSKKaE= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EB89B38560A6 Received: by mail-oo1-xc33.google.com with SMTP id 006d021491bc7-5f4c111991bso112192eaf.0 for ; Tue, 14 Jan 2025 13:03:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736888626; x=1737493426; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eI4o1gV8bBANtxPKD1Ef3Ti7BE3qNv38y1x9GwcUHrs=; b=ZFSqBN1pRWyRz0cvgM+w+YN3ZGoqyvxxQXM3b+tGrZbuRfVX72RGcnlg1SuejdNJl8 8Hr32p+m3ExhPmn3qmoq4tz4+e41ZGHHrJTmJMDoCKUdHWpd4+Squ2Il+RAeuEcJ3oLx VdTpGcaR7STuwhvwpf9d50I9LstXWPM7Nbb3QUjXQ2LcT3p65xSRqiBUEM4Gt5vKcbJ0 /LNk/hYAGzQ8SmTEF9R9LShywU/F/Nu/NvtIke/PlPRuwrPKCSGPrgl6kD/Wq8NCF0Hy PrurJoHMqYgOKQI467IX8XItEyPNzGBcs12+eZgqZ36sMKU40fbM5o7Lizdt6134SvnP LqaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736888626; x=1737493426; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eI4o1gV8bBANtxPKD1Ef3Ti7BE3qNv38y1x9GwcUHrs=; b=Mu0TnzFIyr0yH6159dCYM8uVmo5b1hIF/NK+NYU8+ZYoa2ZMz1rzEZ5+WOCQXO0hg4 i9HP7ntf+EOJTjPdpibizGsaxpYaFg+V8k1zwFVJaZO5eZKjqe/t2hplkRgZkQGR7CLD zp1LM+ypCl8KouQ7Y6rQEXr+1D0G/l9pBnpuxKPDi6v7zywdwamE+lG1iauteZU0DFFe QyPJDNgjNQzaP39UTCkT7+5lsqw0oA8BJ3bnX0mP6RSxRnFUaI3TrZNhBavbGpv+6Vex ZrFcUhwqSXonCRIcYHPdwOEjDN/Vq+TYOFYfJ/3K384/7znz6BaM0WgpUlyFvdxKCSGA ozDQ== X-Gm-Message-State: AOJu0Ywk9gtJ+LcyfWArXvx/o6F76W+gaArMrJfzTBV5ECppG9aYIGTd HFlHuwauFkMDyY03PNFZLWC7Z210Ka2zGPG7K7DNIU2oMeqaJt2Df96Iog== X-Gm-Gg: ASbGncvUkYjvfotYwEb1tzRVEHxNrbfqigvEGlxs/W8/a4nFrx3j3c5wYvlycAqJsZt nHopaHd9h8AvZDF0C/sbK/fVIYIaNNOkG/gU93Wm4GeiPgaohOnTp4B4GoM2nAKz6btFyOboOGj QMj7lipX4VkGNHz14RZlcaTIvp0M1GTeteVCkJSFfKFNcV9JGEwnnog3HSLHATbQEvK0QW6TU/x oUdDBADDDjIsQHD8cLMD6wZxoPo2d5Rhm3bQxjZw2k9q5369vgmkcflWvBKk1GYyJJCPE0FBNS2 7X/ci9enRNpmDywrs6JMEpMpOrLDI/LI5Q== X-Google-Smtp-Source: AGHT+IE0KCFNcyQbwWMLjaUkzl490UiGXLXa7QudORkd5TmZZTbidptFTgL4JQKSyN759TaOCoiNBQ== X-Received: by 2002:a05:6820:2912:b0:5f6:3246:8ab7 with SMTP id 006d021491bc7-5f907f293f6mr321993eaf.4.1736888625717; Tue, 14 Jan 2025 13:03:45 -0800 (PST) Received: from noahgold-desk.lan (syn-024-243-147-173.res.spectrum.com. [24.243.147.173]) by smtp.gmail.com with ESMTPSA id 006d021491bc7-5f8826420d0sm4588259eaf.12.2025.01.14.13.03.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jan 2025 13:03:45 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com Subject: [PATCH v2 2/3] x86/string: Use simpler approach for large memcpy [BZ #32475] Date: Tue, 14 Jan 2025 13:03:40 -0800 Message-Id: <20250114210341.599037-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250114210341.599037-1-goldstein.w.n@gmail.com> References: <20250114210341.599037-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org The new approach does a simple 4x non-temporal loop (forwards or backwards to avoid 4k aliasing). This is similiar what we used to do prior to: commit 1a8605b6cd257e8a74e29b5b71c057211f5fb847 Author: noah Date: Sat Apr 3 04:12:15 2021 -0400 x86: Update large memcpy case in memmove-vec-unaligned-erms.S But with 4k aliasing detection to avoid a known pathological slow case. The multi-page approach yielded 5-15% better performance for the size ranges covered by bench-memcpy-large (roughly 64KB-32MB) on the tested platforms but has some notable draw backs. The drawbacks stem from the fact that the multi-page approach is significantly less "canonical" a form of memcpy and thus is likely to have less reliably "good" performance on untested platforms (including future ones) and configurations (i.e > 2GB copies from BZ #32475). Since there are known slow cases with the multi-page approach (that far exceed 15%) and the multi-page approach is much more brittle, it seems prudent to switch to this simpler, more reliable, better future-proofed implementation. Tested on x86_64. --- sysdeps/x86_64/multiarch/memmove-vec-large.S | 107 ++++++++++++++++++ .../multiarch/memmove-vec-unaligned-erms.S | 2 +- 2 files changed, 108 insertions(+), 1 deletion(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-vec-large.S diff --git a/sysdeps/x86_64/multiarch/memmove-vec-large.S b/sysdeps/x86_64/multiarch/memmove-vec-large.S new file mode 100644 index 0000000000..fa13bd66a0 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-vec-large.S @@ -0,0 +1,107 @@ +#ifdef MEMMOVE_LARGE_IMPL +# error "Multiple large memmove impls included!" +#endif +#define MEMMOVE_LARGE_IMPL 1 + +/* Copies large regions by with a 4x unrolled loop of non-temporal + stores. */ + +#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) +L(large_memcpy_check): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jb L(more_8x_vec_check) +L(large_memcpy): + /* To reach this point it is impossible for dst > src and + overlap. Remaining to check is src > dst and overlap. rcx + already contains dst - src. Negate rcx to get src - dst. If + length > rcx then there is overlap and forward copy is best. */ + negq %rcx + cmpq %rcx, %rdx + ja L(more_8x_vec_forward) + + /* We are doing non-temporal copy and no overlap. Choose forward + or backward copy based on avoiding 4k aliasing. ecx already + contains src - dst. We check if: + (src % 4096) - (dst % 4096) > (4096 - 512) + If true then we risk aliasing. */ + andl $(PAGE_SIZE - 1), %ecx + cmpl $(PAGE_SIZE - 512), %ecx + ja L(large_backward) + + subq %rdi, %rsi + + /* Store the first VEC. */ + VMOVU %VMM(0), (%rdi) + + /* Store end of buffer minus tail in rdx. */ + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + + /* Align DST. */ + orq $(VEC_SIZE - 1), %rdi + incq %rdi + leaq (%rdi, %rsi), %rcx + /* Dont use multi-byte nop to align. */ + .p2align 4,, 11 +L(loop_4x_nt_forward): + PREFETCH_ONE_SET (1, (%rcx), VEC_SIZE * 8) + /* Copy 4 * VEC a time forward. */ + VMOVU (VEC_SIZE * 0)(%rcx), %VMM(1) + VMOVU (VEC_SIZE * 1)(%rcx), %VMM(2) + VMOVU (VEC_SIZE * 2)(%rcx), %VMM(3) + VMOVU (VEC_SIZE * 3)(%rcx), %VMM(4) + subq $-(VEC_SIZE * 4), %rcx + VMOVNT %VMM(1), (VEC_SIZE * 0)(%rdi) + VMOVNT %VMM(2), (VEC_SIZE * 1)(%rdi) + VMOVNT %VMM(3), (VEC_SIZE * 2)(%rdi) + VMOVNT %VMM(4), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rdi, %rdx + ja L(loop_4x_nt_forward) + sfence + + VMOVU (VEC_SIZE * 0)(%rsi, %rdx), %VMM(1) + VMOVU (VEC_SIZE * 1)(%rsi, %rdx), %VMM(2) + VMOVU (VEC_SIZE * 2)(%rsi, %rdx), %VMM(3) + VMOVU (VEC_SIZE * 3)(%rsi, %rdx), %VMM(4) + VMOVU %VMM(1), (VEC_SIZE * 0)(%rdx) + VMOVU %VMM(2), (VEC_SIZE * 1)(%rdx) + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdx) + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdx) + VZEROUPPER_RETURN + + .p2align 4,, 10 +L(large_backward): + leaq (VEC_SIZE * -4 - 1)(%rdi, %rdx), %rcx + VMOVU (VEC_SIZE * -1)(%rsi, %rdx), %VMM(5) + VMOVU %VMM(5), (VEC_SIZE * -1)(%rdi, %rdx) + andq $-(VEC_SIZE), %rcx + subq %rdi, %rsi + leaq (%rsi, %rcx), %rdx + /* Don't use multi-byte nop to align. */ + .p2align 4,, 11 +L(loop_4x_nt_backward): + PREFETCH_ONE_SET (-1, (%rdx), -VEC_SIZE * 8) + VMOVU (VEC_SIZE * 3)(%rdx), %VMM(1) + VMOVU (VEC_SIZE * 2)(%rdx), %VMM(2) + VMOVU (VEC_SIZE * 1)(%rdx), %VMM(3) + VMOVU (VEC_SIZE * 0)(%rdx), %VMM(4) + addq $(VEC_SIZE * -4), %rdx + VMOVNT %VMM(1), (VEC_SIZE * 3)(%rcx) + VMOVNT %VMM(2), (VEC_SIZE * 2)(%rcx) + VMOVNT %VMM(3), (VEC_SIZE * 1)(%rcx) + VMOVNT %VMM(4), (VEC_SIZE * 0)(%rcx) + addq $(VEC_SIZE * -4), %rcx + cmpq %rcx, %rdi + jb L(loop_4x_nt_backward) + + sfence + VMOVU (VEC_SIZE * 3)(%rsi, %rdi), %VMM(4) + VMOVU (VEC_SIZE * 2)(%rsi, %rdi), %VMM(3) + VMOVU (VEC_SIZE * 1)(%rsi, %rdi), %VMM(2) + /* We already loaded VMM(0). */ + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi) + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) + VMOVU %VMM(2), (VEC_SIZE * 1)(%rdi) + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) + VZEROUPPER_RETURN +#endif diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 70d303687c..7c4765286d 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -127,7 +127,7 @@ #endif #ifndef MEMMOVE_VEC_LARGE_IMPL -# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +# define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large.S" #endif #ifndef SECTION From patchwork Tue Jan 14 21:03:41 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 104786 X-Patchwork-Delegate: fweimer@redhat.com Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 39FC8385C41A for ; Tue, 14 Jan 2025 21:04:14 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oo1-xc36.google.com (mail-oo1-xc36.google.com [IPv6:2607:f8b0:4864:20::c36]) by sourceware.org (Postfix) with ESMTPS id 6123538560AB for ; Tue, 14 Jan 2025 21:03:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6123538560AB Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 6123538560AB Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::c36 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888628; cv=none; b=oHj4xsk2lrQPoct4wukGvE6/gJ1KVrRr6pBah+E7y/0Fgp8bDTKZSjqsDzWBNeAMUj8gACdmUAqzjKNMYjo4x7cgJhLtEbZZMgycUu6j2oq7zAhVSZbA1w/lJVwVaIC10/CdBEGKz5tW6XwandDR41Z+euGOPY93fhQHTZbrfeY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1736888628; c=relaxed/simple; bh=yC3TCnMKqlKhAdkElPvpTblIJkCQKNrwigBy3Q5WPbk=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=mJkzzwGyUB+PkIhnH1LXdNlEHLtdR5hJxW67QY5UkGdhVeuXA2iONIanI5AhQRGfXCQMAmkxbm+/9meoaEk+OU+MoT79XWXZo+eJXcw9cr7M4ec5bZhmkIDFLojwifbluGUGHssQdAWd63XzuIc/riYwRVa25wflQ2B+V7VhNIQ= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6123538560AB Authentication-Results: sourceware.org; dkim=pass (2048-bit key, unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=eR1TC3yi Received: by mail-oo1-xc36.google.com with SMTP id 006d021491bc7-5f4c111991bso112199eaf.0 for ; Tue, 14 Jan 2025 13:03:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736888627; x=1737493427; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HjeLGkpxKeTiVVA141q3iv0r8nclOmLEreJPC//z0r4=; b=eR1TC3yiy++89RDd4gmN2sXdhqe7/VmHBi+rajRy3w2g558l35odINUgrT34jcp12i x0dMgM582Z+S4/4ZQTN3xeUTqKpki8LzZS+mVHFQBSf+vgLV654Z7aldNPCvflVjq/42 ZEHBOO5o+TMC5n/4KAOdcIxOMjioKyLSKL3FszCvo4SHNqBhLexu5w5esndUrhlgpj3i lmBiZk2xod5fRbB2AvcmdJRHeyPbP3aTQlBKKtIGnHk+r4IvYM46L/IIbugoLivMRjWK FN7z+W1o21ATWJfHT135luCd7Hvnf/8+VPGf8cz3c1bD4PBU86SLdu971zMo4vgaB7+S JH3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736888627; x=1737493427; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HjeLGkpxKeTiVVA141q3iv0r8nclOmLEreJPC//z0r4=; b=uiYqSxm8Aet4SY92UCMbWInwS2K6Z7ABqPThtWXIxr6wrzAeLbu7t23m04W3siVoNo wyormZ/Tt1VgwKUhh8yvyW6YiVPmNsWfovgWP+kooQKYyWc+O0Jtj4CGwYjZzOK04Re5 fjo0MEKBgdytpUYXaufpDPr9XspC/NshFPFf4UfOlAi5xzHHtI7hS2BLvmmk4U3g/B54 POUCvhALSW2+jRg9Ie5dT8867YCVI4EjVH6CAYc9qo1g+Bc5CC5C0ulQsWZt0qPevR/L f+G639+pIOJga7IoAhB46uQlEM9OlXodBXaP1++2AuXYXOTxdDTN8ayAVGo8/jowqR2J qCBA== X-Gm-Message-State: AOJu0YytkYSTJL9S0g2XY62Vrut50FM3c7Vx7zBAae3iIPgs/H0lWr0E ZJhd2An9qAHJsv9QiqS6h//6jqRK1PfyZMosn65yxG04kLF0bMA/wT6+5A== X-Gm-Gg: ASbGnctCLkm808ZA+O1Y18SDAfldHN4IXVmmDI9xdiFiJb3S1UfY6Hu7Qyvqq96wYlS tkJyOl3OJT6TasUQBsMpmcLZ2zoi9iCC8mchPEVHHTFfOaPJXXQW8im/2yDseeUE8oLY/5mudWl 5l2tHhei5yMMoTGmeoxmaiosv0TkXJDVOB78cDMevr5gKjS32nyj9jM2NMS3CKoHvaYg+qGKbfD B/LbBEySBXYMVn0ibQspMjIm9ivrA+v5pPW3pp+T8kRoVVkZ6DIiX8gw/CyjP04YVDqFyQQ7iil QDZUEyS+4bCwmAuoEKOMW6GS/iAQZ2w5rQ== X-Google-Smtp-Source: AGHT+IGkvXLXnllGvP8zofDbUNnR0mFYURV/RTH+VxQsyRaZBvZW5VUKA+Uy7mAOazrF1JUQYqOeHA== X-Received: by 2002:a4a:e84c:0:b0:5f3:4c09:55e7 with SMTP id 006d021491bc7-5f8fa867a9dmr324457eaf.3.1736888627231; Tue, 14 Jan 2025 13:03:47 -0800 (PST) Received: from noahgold-desk.lan (syn-024-243-147-173.res.spectrum.com. [24.243.147.173]) by smtp.gmail.com with ESMTPSA id 006d021491bc7-5f8826420d0sm4588259eaf.12.2025.01.14.13.03.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jan 2025 13:03:46 -0800 (PST) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com Subject: [PATCH v2 3/3] x86/string: Add version of memmove with page unrolled large impl Date: Tue, 14 Jan 2025 13:03:41 -0800 Message-Id: <20250114210341.599037-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20250114210341.599037-1-goldstein.w.n@gmail.com> References: <20250114210341.599037-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~patchwork=sourceware.org@sourceware.org The page unrolled version has been shown to be the best performing on Intel SnB through ICX hardware. --- sysdeps/x86/cpu-features.c | 17 +++ sysdeps/x86/cpu-tunables.c | 6 + ...cpu-features-preferred_feature_index_1.def | 1 + sysdeps/x86/tst-hwcap-tunables.c | 4 +- sysdeps/x86_64/multiarch/Makefile | 3 + sysdeps/x86_64/multiarch/ifunc-impl-list.c | 120 ++++++++++++++++++ sysdeps/x86_64/multiarch/ifunc-memmove.h | 75 +++++++---- ...ove-avx-unaligned-erms-page-unrolled-rtm.S | 5 + ...memmove-avx-unaligned-erms-page-unrolled.S | 5 + .../memmove-avx-unaligned-erms-rtm.S | 2 + ...emmove-evex-unaligned-erms-page-unrolled.S | 5 + 11 files changed, 218 insertions(+), 25 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S create mode 100644 sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S create mode 100644 sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 27abaca8b7..c0ecbbb812 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -877,6 +877,12 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: cpu_features->cachesize_non_temporal_divisor = 8; + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on HSW (and + presumably SnB). */ + cpu_features + ->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; goto default_tuning; /* Newer Bigcore microarch (larger non-temporal store @@ -890,6 +896,11 @@ init_cpu_features (struct cpu_features *cpu_features) non-temporal on all Skylake servers. */ cpu_features->preferred[index_arch_Avoid_Non_Temporal_Memset] |= bit_arch_Avoid_Non_Temporal_Memset; + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on Skylake/SKX. */ + cpu_features + ->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; /* fallthrough */ case INTEL_BIGCORE_COMETLAKE: case INTEL_BIGCORE_SKYLAKE: @@ -897,6 +908,12 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_ICELAKE: case INTEL_BIGCORE_TIGERLAKE: case INTEL_BIGCORE_ROCKETLAKE: + /* Benchmarks indicate page unrolled large implementation + performs better than standard copy loop on SKX/ICX. */ + cpu_features + ->preferred[index_arch_Prefer_Page_Unrolled_Large_Copy] + |= bit_arch_Prefer_Page_Unrolled_Large_Copy; + /* fallthrough */ case INTEL_BIGCORE_RAPTORLAKE: case INTEL_BIGCORE_METEORLAKE: case INTEL_BIGCORE_LUNARLAKE: diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c index 3423176802..d85b618311 100644 --- a/sysdeps/x86/cpu-tunables.c +++ b/sysdeps/x86/cpu-tunables.c @@ -257,6 +257,12 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp) (n, cpu_features, Prefer_PMINUB_for_stringop, SSE2, 26); } break; + case 31: + { + CHECK_GLIBC_IFUNC_PREFERRED_BOTH ( + n, cpu_features, Prefer_Page_Unrolled_Large_Copy, 31); + } + break; } } } diff --git a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def index 0f14aaf071..5943fc1423 100644 --- a/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def +++ b/sysdeps/x86/include/cpu-features-preferred_feature_index_1.def @@ -35,3 +35,4 @@ BIT (Prefer_FSRM) BIT (Avoid_Short_Distance_REP_MOVSB) BIT (Avoid_Non_Temporal_Memset) BIT (Avoid_STOSB) +BIT (Prefer_Page_Unrolled_Large_Copy) \ No newline at end of file diff --git a/sysdeps/x86/tst-hwcap-tunables.c b/sysdeps/x86/tst-hwcap-tunables.c index 3e06048dcc..985153fb38 100644 --- a/sysdeps/x86/tst-hwcap-tunables.c +++ b/sysdeps/x86/tst-hwcap-tunables.c @@ -61,7 +61,7 @@ static const struct test_t "-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,-ERMS," "-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy", test_1, array_length (test_1) }, @@ -70,7 +70,7 @@ static const struct test_t ",-,-Prefer_ERMS,-Prefer_FSRM,-AVX,-AVX2,-AVX512F,-AVX512VL," "-SSE4_1,-SSE4_2,-SSSE3,-Fast_Unaligned_Load,,-," "-ERMS,-AVX_Fast_Unaligned_Load,-Avoid_Non_Temporal_Memset," - "-Avoid_STOSB,-,", + "-Avoid_STOSB,-Prefer_Page_Unrolled_Large_Copy,-,", test_1, array_length (test_1) } diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index 696cb66991..381eaef455 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -16,11 +16,14 @@ sysdep_routines += \ memcmpeq-evex \ memcmpeq-sse2 \ memmove-avx-unaligned-erms \ + memmove-avx-unaligned-erms-page-unrolled \ + memmove-avx-unaligned-erms-page-unrolled-rtm \ memmove-avx-unaligned-erms-rtm \ memmove-avx512-no-vzeroupper \ memmove-avx512-unaligned-erms \ memmove-erms \ memmove-evex-unaligned-erms \ + memmove-evex-unaligned-erms-page-unrolled \ memmove-sse2-unaligned-erms \ memmove-ssse3 \ memrchr-avx2 \ diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index a8349775df..424031f0e6 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -133,23 +133,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX512VL), __memmove_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, CPU_FEATURE_USABLE (AVX), __memmove_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + CPU_FEATURE_USABLE (AVX), + __memmove_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memmove_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memmove_chk, CPU_FEATURE_USABLE (SSSE3), @@ -180,23 +200,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, CPU_FEATURE_USABLE (AVX512VL), __memmove_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memmove, + CPU_FEATURE_USABLE (AVX512VL), + __memmove_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, CPU_FEATURE_USABLE (AVX), __memmove_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + CPU_FEATURE_USABLE (AVX), + __memmove_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memmove_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memmove, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memmove_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memmove, CPU_FEATURE_USABLE (SSSE3), @@ -1140,23 +1180,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __memcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (AVX), __memcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + CPU_FEATURE_USABLE (AVX), + __memcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __memcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __memcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1187,23 +1247,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, CPU_FEATURE_USABLE (AVX512VL), __memcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX512VL), + __memcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, CPU_FEATURE_USABLE (AVX), __memcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + CPU_FEATURE_USABLE (AVX), + __memcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __memcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, memcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __memcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, memcpy, CPU_FEATURE_USABLE (SSSE3), @@ -1234,23 +1314,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_chk_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_chk_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (AVX), __mempcpy_chk_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + CPU_FEATURE_USABLE (AVX), + __mempcpy_chk_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_chk_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, __mempcpy_chk, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_chk_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, __mempcpy_chk, CPU_FEATURE_USABLE (SSSE3), @@ -1281,23 +1381,43 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX512VL), __mempcpy_evex_unaligned_erms) + X86_IFUNC_IMPL_ADD_V4 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX512VL), + __mempcpy_evex_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, CPU_FEATURE_USABLE (AVX), __mempcpy_avx_unaligned_erms) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + CPU_FEATURE_USABLE (AVX), + __mempcpy_avx_unaligned_erms_page_unrolled) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_page_unrolled_rtm) X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, (CPU_FEATURE_USABLE (AVX) && CPU_FEATURE_USABLE (RTM)), __mempcpy_avx_unaligned_erms_rtm) + X86_IFUNC_IMPL_ADD_V3 (array, i, mempcpy, + (CPU_FEATURE_USABLE (AVX) + && CPU_FEATURE_USABLE (RTM)), + __mempcpy_avx_unaligned_erms_page_unrolled_rtm) /* By V3 we assume fast aligned copy. */ X86_IFUNC_IMPL_ADD_V2 (array, i, mempcpy, CPU_FEATURE_USABLE (SSSE3), diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h index de0ac73a2a..6d5df8a9eb 100644 --- a/sysdeps/x86_64/multiarch/ifunc-memmove.h +++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h @@ -28,18 +28,27 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms) extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms) - attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (evex_unaligned_erms_page_unrolled) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) - attribute_hidden; -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm) - attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled) attribute_hidden; +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_page_unrolled_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_rtm) attribute_hidden; +extern __typeof (REDIRECT_NAME) + OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden; @@ -71,40 +80,60 @@ IFUNC_SELECTOR (void) return OPTIMIZE (avx512_no_vzeroupper); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - AVX_Fast_Unaligned_Load, )) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, )) { if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (evex_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_erms_page_unrolled); + return OPTIMIZE (evex_unaligned_erms); + } + + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (evex_unaligned_page_unrolled); return OPTIMIZE (evex_unaligned); } if (CPU_FEATURE_USABLE_P (cpu_features, RTM)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms_rtm); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled_rtm); + return OPTIMIZE (avx_unaligned_erms_rtm); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled_rtm); return OPTIMIZE (avx_unaligned_rtm); } - if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, - Prefer_No_VZEROUPPER, !)) + if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER, !)) { if (CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - return OPTIMIZE (avx_unaligned_erms); - + { + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_erms_page_unrolled); + return OPTIMIZE (avx_unaligned_erms); + } + if (CPU_FEATURES_ARCH_P (cpu_features, + Prefer_Page_Unrolled_Large_Copy)) + return OPTIMIZE (avx_unaligned_page_unrolled); return OPTIMIZE (avx_unaligned); } } if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3) /* Leave this as runtime check. The SSSE3 is optimized almost - exclusively for avoiding unaligned memory access during the - copy and by and large is not better than the sse2 - implementation as a general purpose memmove. */ + exclusively for avoiding unaligned memory access during the + copy and by and large is not better than the sse2 + implementation as a general purpose memmove. */ && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy)) { return OPTIMIZE (ssse3); diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S new file mode 100644 index 0000000000..683d903243 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled-rtm.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled_rtm +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms-rtm.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..57b518e16f --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-avx-unaligned-erms.S" diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S index 20746e6713..36e864e935 100644 --- a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S +++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S @@ -2,7 +2,9 @@ # include "x86-avx-rtm-vecs.h" +#ifndef MEMMOVE_SYMBOL # define MEMMOVE_SYMBOL(p,s) p##_avx_##s##_rtm +#endif # include "memmove-vec-unaligned-erms.S" #endif diff --git a/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S new file mode 100644 index 0000000000..371b454819 --- /dev/null +++ b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms-page-unrolled.S @@ -0,0 +1,5 @@ +#ifndef MEMMOVE_SYMBOL +# define MEMMOVE_SYMBOL(p,s) p##_evex_##s##_page_unrolled +#endif +#define MEMMOVE_VEC_LARGE_IMPL "memmove-vec-large-page-unrolled.S" +#include "memmove-evex-unaligned-erms.S"