[v2] aarch64: thunderx2 memcpy branches reordering
Commit Message
Rewrote the branches in load and merge chunk
so that the order is more in line with the
most probable case.
ChangeLog:
* sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
branches reordering
Comments
Adhemaral,
The reason I did the rewriting is two-fold. First, it
looks more clean as there aren't now two branches
immediately following each other. Second, it can be
performance beneficial as we are saving one branch
on entry for the most of the cases (we take it only if
the first iteration is a partial iteration). But, of course,
the performance benefit was not my concern here.
However, this is one instruction less anyway.
On 3/20/2019 19:29, Anton Youdkevitch wrote:
> Rewrote the branches in load and merge chunk
> so that the order is more in line with the
> most probable case.
>
> ChangeLog:
> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
> branches reordering
>
If is just a non functional change, you should indicate it. However it
does change the code and thus for such refactor patches on arch-specific
implementations we usually ask for, besides running the testcase for
regression (which is also not indicate in your message in which platform
you actually has tested), that at least some performance evaluation is
provided (even if this shows no gain).
On 21/03/2019 15:44, Anton Youdkevitch wrote:
> Adhemaral,
>
> The reason I did the rewriting is two-fold. First, it
> looks more clean as there aren't now two branches
> immediately following each other. Second, it can be
> performance beneficial as we are saving one branch
> on entry for the most of the cases (we take it only if
> the first iteration is a partial iteration). But, of course,
> the performance benefit was not my concern here.
> However, this is one instruction less anyway.
>
>
> On 3/20/2019 19:29, Anton Youdkevitch wrote:
>> Rewrote the branches in load and merge chunk
>> so that the order is more in line with the
>> most probable case.
>>
>> ChangeLog:
>> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
>> branches reordering
>>
>
Adhemerval,
On 3/21/2019 23:38, Adhemerval Zanella wrote:
> If is just a non functional change, you should indicate it. However it
> does change the code and thus for such refactor patches on arch-specific
> implementations we usually ask for, besides running the testcase for
> regression (which is also not indicate in your message in which platform
> you actually has tested), that at least some performance evaluation is
> provided (even if this shows no gain).
make check - no regressions
make bench - no performance regressions for memcpy
Should I resubmit the same patch with the description mentioning it is a
non functional change?
>
> On 21/03/2019 15:44, Anton Youdkevitch wrote:
>> Adhemaral,
>>
>> The reason I did the rewriting is two-fold. First, it
>> looks more clean as there aren't now two branches
>> immediately following each other. Second, it can be
>> performance beneficial as we are saving one branch
>> on entry for the most of the cases (we take it only if
>> the first iteration is a partial iteration). But, of course,
>> the performance benefit was not my concern here.
>> However, this is one instruction less anyway.
>>
>>
>> On 3/20/2019 19:29, Anton Youdkevitch wrote:
>>> Rewrote the branches in load and merge chunk
>>> so that the order is more in line with the
>>> most probable case.
>>>
>>> ChangeLog:
>>> * sysdeps/aarch64/multiarch/memcpy_thunderx2.S:
>>> branches reordering
>>>
>>
On 22/03/2019 05:13, Anton Youdkevitch wrote:
> Adhemerval,
>
> On 3/21/2019 23:38, Adhemerval Zanella wrote:
>> If is just a non functional change, you should indicate it. However it
>> does change the code and thus for such refactor patches on arch-specific
>> implementations we usually ask for, besides running the testcase for
>> regression (which is also not indicate in your message in which platform
>> you actually has tested), that at least some performance evaluation is
>> provided (even if this shows no gain).
> make check - no regressions
> make bench - no performance regressions for memcpy
>
> Should I resubmit the same patch with the description mentioning it is a
> non functional change?
I think the provided information should be suffice. It also a good thing
the patch could reduce code size (from 2964 to 2736), so I am ok with this
change.
Szabolcs, are you ok with this change?
@@ -382,7 +382,8 @@ L(bytes_0_to_3):
strb A_lw, [dstin]
strb B_lw, [dstin, tmp1]
strb A_hw, [dstend, -1]
-L(end): ret
+L(end):
+ ret
.p2align 4
@@ -557,17 +558,9 @@ L(ext_size_ ## shft):;\
ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\
ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\
subs count, count, 32;\
- b.ge 2f;\
+ b.lt 2f;\
1:;\
stp A_q, B_q, [dst], #32;\
- ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
- ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\
- stp H_q, I_q, [dst], #16;\
- add dst, dst, tmp1;\
- str G_q, [dst], #16;\
- b L(copy_long_check32);\
-2:;\
- stp A_q, B_q, [dst], #32;\
prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\
ldp D_q, J_q, [src], #32;\
ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
@@ -579,8 +572,15 @@ L(ext_size_ ## shft):;\
ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\
mov E_v.16b, J_v.16b;\
subs count, count, 64;\
- b.ge 2b;\
- b 1b;\
+ b.ge 1b;\
+2:;\
+ stp A_q, B_q, [dst], #32;\
+ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\
+ ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\
+ stp H_q, I_q, [dst], #16;\
+ add dst, dst, tmp1;\
+ str G_q, [dst], #16;\
+ b L(copy_long_check32);\
EXT_CHUNK(1)
EXT_CHUNK(2)