[v2] aarch64: thunderx2 memcpy branches reordering

Message ID 20190320162930.GB13393@bell-sw.com
State Superseded
Headers

Commit Message

Anton Youdkevitch March 20, 2019, 4:29 p.m. UTC
  Rewrote the branches in load and merge chunk
so that the order is more in line with the 
most probable case.

ChangeLog:
	* sysdeps/aarch64/multiarch/memcpy_thunderx2.S: 
	branches reordering
  

Comments

Anton Youdkevitch March 21, 2019, 6:44 p.m. UTC | #1
Adhemaral,

The reason I did the rewriting is two-fold. First, it
looks more clean as there aren't now two branches
immediately following each other. Second, it can be
performance beneficial as we are saving one branch
on entry for the most of the cases (we take it only if
the first iteration is a partial iteration). But, of course,
the performance benefit was not my concern here.
However, this is one instruction less anyway.


On 3/20/2019 19:29, Anton Youdkevitch wrote:
> Rewrote the branches in load and merge chunk
> so that the order is more in line with the
> most probable case.
>
> ChangeLog:
> 	* sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
> 	branches reordering
>
  
Adhemerval Zanella Netto March 21, 2019, 8:38 p.m. UTC | #2
If is just a non functional change, you should indicate it.  However it
does change the code and thus for such refactor patches on arch-specific 
implementations we usually ask for, besides running the testcase for 
regression (which is also not indicate in your message in which platform 
you actually has tested), that at least some performance evaluation is 
provided (even if this shows no gain).

On 21/03/2019 15:44, Anton Youdkevitch wrote:
> Adhemaral,
> 
> The reason I did the rewriting is two-fold. First, it
> looks more clean as there aren't now two branches
> immediately following each other. Second, it can be
> performance beneficial as we are saving one branch
> on entry for the most of the cases (we take it only if
> the first iteration is a partial iteration). But, of course,
> the performance benefit was not my concern here.
> However, this is one instruction less anyway.
> 
> 
> On 3/20/2019 19:29, Anton Youdkevitch wrote:
>> Rewrote the branches in load and merge chunk
>> so that the order is more in line with the
>> most probable case.
>>
>> ChangeLog:
>>     * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
>>     branches reordering
>>
>
  
Anton Youdkevitch March 22, 2019, 8:13 a.m. UTC | #3
Adhemerval,

On 3/21/2019 23:38, Adhemerval Zanella wrote:
> If is just a non functional change, you should indicate it.  However it
> does change the code and thus for such refactor patches on arch-specific
> implementations we usually ask for, besides running the testcase for
> regression (which is also not indicate in your message in which platform
> you actually has tested), that at least some performance evaluation is
> provided (even if this shows no gain).
make check - no regressions
make bench - no performance regressions for memcpy

Should I resubmit the same patch with the description mentioning it is a
non functional change?


>
> On 21/03/2019 15:44, Anton Youdkevitch wrote:
>> Adhemaral,
>>
>> The reason I did the rewriting is two-fold. First, it
>> looks more clean as there aren't now two branches
>> immediately following each other. Second, it can be
>> performance beneficial as we are saving one branch
>> on entry for the most of the cases (we take it only if
>> the first iteration is a partial iteration). But, of course,
>> the performance benefit was not my concern here.
>> However, this is one instruction less anyway.
>>
>>
>> On 3/20/2019 19:29, Anton Youdkevitch wrote:
>>> Rewrote the branches in load and merge chunk
>>> so that the order is more in line with the
>>> most probable case.
>>>
>>> ChangeLog:
>>>      * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
>>>      branches reordering
>>>
>>
  
Adhemerval Zanella Netto March 22, 2019, 2:31 p.m. UTC | #4
On 22/03/2019 05:13, Anton Youdkevitch wrote:
> Adhemerval,
> 
> On 3/21/2019 23:38, Adhemerval Zanella wrote:
>> If is just a non functional change, you should indicate it.  However it
>> does change the code and thus for such refactor patches on arch-specific
>> implementations we usually ask for, besides running the testcase for
>> regression (which is also not indicate in your message in which platform
>> you actually has tested), that at least some performance evaluation is
>> provided (even if this shows no gain).
> make check - no regressions
> make bench - no performance regressions for memcpy
> 
> Should I resubmit the same patch with the description mentioning it is a
> non functional change?

I think the provided information should be suffice. It also a good thing
the patch could reduce code size (from 2964 to 2736), so I am ok with this
change.

Szabolcs, are you ok with this change?
  

Patch

diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
index b2215c1..f637300 100644
--- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
+++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S
@@ -382,7 +382,8 @@  L(bytes_0_to_3):
 	strb    A_lw, [dstin]
 	strb    B_lw, [dstin, tmp1]
 	strb    A_hw, [dstend, -1]
-L(end): ret
+L(end):
+	ret
 
 	.p2align 4
 
@@ -557,17 +558,9 @@  L(ext_size_ ## shft):;\
 	ext     A_v.16b, C_v.16b, D_v.16b, 16-shft;\
 	ext     B_v.16b, D_v.16b, E_v.16b, 16-shft;\
 	subs    count, count, 32;\
-	b.ge    2f;\
+	b.lt    2f;\
 1:;\
 	stp     A_q, B_q, [dst], #32;\
-	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
-	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
-	stp     H_q, I_q, [dst], #16;\
-	add     dst, dst, tmp1;\
-	str     G_q, [dst], #16;\
-	b       L(copy_long_check32);\
-2:;\
-	stp     A_q, B_q, [dst], #32;\
 	prfm    pldl1strm, [src, MEMCPY_PREFETCH_LDR];\
 	ldp     D_q, J_q, [src], #32;\
 	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
@@ -579,8 +572,15 @@  L(ext_size_ ## shft):;\
 	ext     B_v.16b, D_v.16b, J_v.16b, 16-shft;\
 	mov     E_v.16b, J_v.16b;\
 	subs    count, count, 64;\
-	b.ge    2b;\
-	b	1b;\
+	b.ge    1b;\
+2:;\
+	stp     A_q, B_q, [dst], #32;\
+	ext     H_v.16b, E_v.16b, F_v.16b, 16-shft;\
+	ext     I_v.16b, F_v.16b, G_v.16b, 16-shft;\
+	stp     H_q, I_q, [dst], #16;\
+	add     dst, dst, tmp1;\
+	str     G_q, [dst], #16;\
+	b       L(copy_long_check32);\
 
 EXT_CHUNK(1)
 EXT_CHUNK(2)