From patchwork Mon Apr 1 13:38:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anton Youdkevitch X-Patchwork-Id: 32122 Received: (qmail 61244 invoked by alias); 1 Apr 2019 13:38:43 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 61236 invoked by uid 89); 1 Apr 2019 13:38:42 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-27.6 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.1 spammy=(unknown), prologue, 5986, p2align X-HELO: forward106p.mail.yandex.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bell-sw.com; s=mail; t=1554125916; bh=9GBPaCNVu+cjCugDs56j1UULehj2QC4O8C6TLO1fFkE=; h=Subject:From:To:Date:Message-ID; b=X6D3VXg00CYMLNoAHdoh4FKlzBUAOxWz81DwY+oOCYOL4cnQVR4tdq/zwf5uL3OH8 dKOHfdilhWQDUXwh//iVB28KT1n+iuJsQGyIxu/LTs958MkOQrmaipiX2rFQj0ajPK cJ0nYJuZSxYFRTRlbMl2LSfudsrZxKa24G8o++gA= Authentication-Results: mxback15o.mail.yandex.net; dkim=pass header.i=@bell-sw.com To: "libc-alpha@sourceware.org" From: Anton Youdkevitch Subject: [PATCH v5] aarch64: thunderx2 memcpy optimizations for ext-based code path Message-ID: <5CA2145C.8040402@bell-sw.com> Date: Mon, 1 Apr 2019 16:38:36 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 Here is the updated patch for improving the long unaligned code path (the one using "ext" instruction). 1. Always taken conditional branch at the beginning is removed. 2. Epilogue code is placed after the end of the loop to reduce the number of branches. 3. The redundant "mov" instructions inside the loop are gone due to the changed order of the registers in the "ext" instructions inside the loop, the prologue has additional "ext" instruction. 4.Updating count in the prologue was hoisted out as it is the same update for each prologue. 5. Invariant code of the loop epilogue was hoisted out. 6. As the current size of the ext chunk is exactly 16 instructions long "nop" was added at the beginning of the code sequence so that the loop entry for all the chunks be aligned. make check - no regression (on linux-aarch64) make bench - no performance regressions (on Thunderx2) Looks OK? diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S index b2215c1..45e9a29 100644 --- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S +++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S @@ -382,7 +382,8 @@ L(bytes_0_to_3): strb A_lw, [dstin] strb B_lw, [dstin, tmp1] strb A_hw, [dstend, -1] -L(end): ret +L(end): + ret .p2align 4 @@ -544,43 +545,35 @@ L(dst_unaligned): str C_q, [dst], #16 ldp F_q, G_q, [src], #32 bic dst, dst, 15 + subs count, count, 32 adrp tmp2, L(ext_table) add tmp2, tmp2, :lo12:L(ext_table) add tmp2, tmp2, tmp1, LSL #2 ldr tmp3w, [tmp2] add tmp2, tmp2, tmp3w, SXTW br tmp2 - -#define EXT_CHUNK(shft) \ .p2align 4 ;\ + nop +#define EXT_CHUNK(shft) \ L(ext_size_ ## shft):;\ ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\ - subs count, count, 32;\ - b.ge 2f;\ -1:;\ - stp A_q, B_q, [dst], #32;\ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ - ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - stp H_q, I_q, [dst], #16;\ - add dst, dst, tmp1;\ - str G_q, [dst], #16;\ - b L(copy_long_check32);\ -2:;\ +1:;\ stp A_q, B_q, [dst], #32;\ prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\ - ldp D_q, J_q, [src], #32;\ - ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ + ldp C_q, D_q, [src], #32;\ ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - mov C_v.16b, G_v.16b;\ stp H_q, I_q, [dst], #32;\ + ext A_v.16b, G_v.16b, C_v.16b, 16-shft;\ + ext B_v.16b, C_v.16b, D_v.16b, 16-shft;\ ldp F_q, G_q, [src], #32;\ - ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ - ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\ - mov E_v.16b, J_v.16b;\ + ext H_v.16b, D_v.16b, F_v.16b, 16-shft;\ subs count, count, 64;\ - b.ge 2b;\ - b 1b;\ + b.ge 1b;\ +2:;\ + ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ + b L(ext_tail); EXT_CHUNK(1) EXT_CHUNK(2) @@ -598,6 +591,14 @@ EXT_CHUNK(13) EXT_CHUNK(14) EXT_CHUNK(15) +L(ext_tail): + stp A_q, B_q, [dst], #32 + stp H_q, I_q, [dst], #16 + add dst, dst, tmp1 + str G_q, [dst], #16 + b L(copy_long_check32) + + END (MEMCPY) .section .rodata .p2align 4