Patchwork [v5] aarch64: thunderx2 memcpy optimizations for ext-based code path

login
register
mail settings
Submitter Anton Youdkevitch
Date April 1, 2019, 1:38 p.m.
Message ID <5CA2145C.8040402@bell-sw.com>
Download mbox | patch
Permalink /patch/32122/
State New
Headers show

Comments

Anton Youdkevitch - April 1, 2019, 1:38 p.m.
Here is the updated patch for improving the long unaligned
code path (the one using "ext" instruction).

1. Always taken conditional branch at the beginning is
removed.

2. Epilogue code is placed after the end of the loop to
reduce the number of branches.

3. The redundant "mov" instructions inside the loop are
gone due to the changed order of the registers in the "ext"
instructions inside the loop,  the prologue has additional
"ext" instruction.

4.Updating count in the prologue was hoisted out as
it is the same update for each prologue.

5. Invariant code of the loop epilogue was hoisted out.

6. As the current size of the ext chunk is exactly 16
instructions long "nop" was added at the beginning
of the code sequence so that the loop entry for all the
chunks be aligned.

make check - no regression (on linux-aarch64)
make bench - no performance regressions (on Thunderx2)

Looks OK?
Steve Ellcey - April 2, 2019, 10:48 p.m.
On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote:
> Here is the updated patch for improving the long unaligned

> code path (the one using "ext" instruction).

> 

> 1. Always taken conditional branch at the beginning is

> removed.

> 

> 2. Epilogue code is placed after the end of the loop to

> reduce the number of branches.

> 

> 3. The redundant "mov" instructions inside the loop are

> gone due to the changed order of the registers in the "ext"

> instructions inside the loop,  the prologue has additional

> "ext" instruction.

> 

> 4.Updating count in the prologue was hoisted out as

> it is the same update for each prologue.

> 

> 5. Invariant code of the loop epilogue was hoisted out.

> 

> 6. As the current size of the ext chunk is exactly 16

> instructions long "nop" was added at the beginning

> of the code sequence so that the loop entry for all the

> chunks be aligned.

> 

> make check - no regression (on linux-aarch64)

> make bench - no performance regressions (on Thunderx2)

> 

> Looks OK?


This looks good to me Anton.  I can check it in for you if we have a
consensus that this version is OK and there are no objections.

Steve Ellcey
sellcey@marvell.com
Szabolcs Nagy - April 5, 2019, 3:21 p.m.
On 02/04/2019 23:48, Steve Ellcey wrote:
> On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote:

>> Here is the updated patch for improving the long unaligned

>> code path (the one using "ext" instruction).

>>

>> 1. Always taken conditional branch at the beginning is

>> removed.

>>

>> 2. Epilogue code is placed after the end of the loop to

>> reduce the number of branches.

>>

>> 3. The redundant "mov" instructions inside the loop are

>> gone due to the changed order of the registers in the "ext"

>> instructions inside the loop,  the prologue has additional

>> "ext" instruction.

>>

>> 4.Updating count in the prologue was hoisted out as

>> it is the same update for each prologue.

>>

>> 5. Invariant code of the loop epilogue was hoisted out.

>>

>> 6. As the current size of the ext chunk is exactly 16

>> instructions long "nop" was added at the beginning

>> of the code sequence so that the loop entry for all the

>> chunks be aligned.

>>

>> make check - no regression (on linux-aarch64)

>> make bench - no performance regressions (on Thunderx2)

>>

>> Looks OK?

> 

> This looks good to me Anton.  I can check it in for you if we have a

> consensus that this version is OK and there are no objections.


yes, this is OK to commit, i have no objections.
Steve Ellcey - April 5, 2019, 9:05 p.m.
On Fri, 2019-04-05 at 15:21 +0000, Szabolcs Nagy wrote:
> On 02/04/2019 23:48, Steve Ellcey wrote:

> > On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote:

> > 

> > > make check - no regression (on linux-aarch64)

> > > make bench - no performance regressions (on Thunderx2)

> > > 

> > > Looks OK?

> This looks good to me Anton.  I can check it in for you if we have a

> > consensus that this version is OK and there are no objections.

> 

> yes, this is OK to commit, i have no objections.


Anton,  I have gone ahead and committed this for you.

Steve Ellcey
sellcey@marvell.com
Anton Youdkevitch - April 5, 2019, 9:34 p.m.
On 4/6/2019 00:05, Steve Ellcey wrote:
> On Fri, 2019-04-05 at 15:21 +0000, Szabolcs Nagy wrote:
>> On 02/04/2019 23:48, Steve Ellcey wrote:
>>> On Mon, 2019-04-01 at 16:38 +0300, Anton Youdkevitch wrote:
>>>
>>>> make check - no regression (on linux-aarch64)
>>>> make bench - no performance regressions (on Thunderx2)
>>>>
>>>> Looks OK?
>> This looks good to me Anton.  I can check it in for you if we have a
>>> consensus that this version is OK and there are no objections.
>>
>> yes, this is OK to commit, i have no objections.
>
> Anton,  I have gone ahead and committed this for you.
OK, thanks a lot!

Patch

diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
index b2215c1..45e9a29 100644
--- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
+++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
@@ -382,7 +382,8 @@  L(bytes_0_to_3):
 	strb    A_lw, [dstin]
 	strb    B_lw, [dstin, tmp1]
 	strb    A_hw, [dstend, -1]
-L(end): ret
+L(end):
+	ret
 
 	.p2align 4
 
@@ -544,43 +545,35 @@  L(dst_unaligned):
 	str     C_q, [dst], #16
 	ldp     F_q, G_q, [src], #32
 	bic	dst, dst, 15
+	subs    count, count, 32
 	adrp	tmp2, L(ext_table)
 	add	tmp2, tmp2, :lo12:L(ext_table)
 	add	tmp2, tmp2, tmp1, LSL #2
 	ldr	tmp3w, [tmp2]
 	add	tmp2, tmp2, tmp3w, SXTW
 	br	tmp2
-
-#define EXT_CHUNK(shft) \
 .p2align 4 ;\
+	nop
+#define EXT_CHUNK(shft) \
 L(ext_size_ ## shft):;\
 	ext     A_v.16b, C_v.16b, D_v.16b, 16-shft;\
 	ext     B_v.16b, D_v.16b, E_v.16b, 16-shft;\
-	subs    count, count, 32;\
-	b.ge    2f;\
-1:;\
-	stp     A_q, B_q, [dst], #32;\
 	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
-	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
-	stp     H_q, I_q, [dst], #16;\
-	add     dst, dst, tmp1;\
-	str     G_q, [dst], #16;\
-	b       L(copy_long_check32);\
-2:;\
+1:;\
 	stp     A_q, B_q, [dst], #32;\
 	prfm    pldl1strm, [src, MEMCPY_PREFETCH_LDR];\
-	ldp     D_q, J_q, [src], #32;\
-	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
+	ldp     C_q, D_q, [src], #32;\
 	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
-	mov     C_v.16b, G_v.16b;\
 	stp     H_q, I_q, [dst], #32;\
+	ext     A_v.16b, G_v.16b, C_v.16b, 16-shft;\
+	ext     B_v.16b, C_v.16b, D_v.16b, 16-shft;\
 	ldp     F_q, G_q, [src], #32;\
-	ext     A_v.16b, C_v.16b, D_v.16b, 16-shft;\
-	ext     B_v.16b, D_v.16b, J_v.16b, 16-shft;\
-	mov     E_v.16b, J_v.16b;\
+	ext     H_v.16b, D_v.16b, F_v.16b, 16-shft;\
 	subs    count, count, 64;\
-	b.ge    2b;\
-	b	1b;\
+	b.ge    1b;\
+2:;\
+	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
+	b	L(ext_tail);
 
 EXT_CHUNK(1)
 EXT_CHUNK(2)
@@ -598,6 +591,14 @@  EXT_CHUNK(13)
 EXT_CHUNK(14)
 EXT_CHUNK(15)
 
+L(ext_tail):
+	stp     A_q, B_q, [dst], #32
+	stp     H_q, I_q, [dst], #16
+	add     dst, dst, tmp1
+	str     G_q, [dst], #16
+	b       L(copy_long_check32)
+
+
 END (MEMCPY)
 	.section	.rodata
 	.p2align	4