[RFC] Aarch64: optimize stpcpy a bit.
Commit Message
On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
> On 04/06/15 13:28, Ondřej Bílka wrote:
> > On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
> >> On 25/05/15 12:45, Ondřej Bílka wrote:
> >>> Replaces it with strcpy. One could argue that opposite way to replace
> >>> strcpy with stpcpy is faster.
> >>>
> >>> Reason is register pressure. Strcpy needs extra register to save return
> >>> value while stpcpy has return value already in register used for writing
> >>> terminating zero.
> >>
> >>
> >> Depends on your architecture. On aarch64 we have plenty of spare
> >> registers, so strcpy simply copies the destination register into a
> >> scratch. It then doesn't have to carefully calculate the return value
> >> at the end of the function (making the tail code simpler - there are
> >> multiple return statements, but only one entry point).
> >>
> > Thats correct, main saving you get is from return value is first register, that
> > forces needing extra copy which is suboptimal.
>
> No, look at the AArch64 code. The only time we ever end up with a
> simple MOV instruction to copy the register from one location to another
> is in the stPcpy code. In strcpy it always ends up folded into some
> other operation that we have to do anyway. Once it's been copied to
> that other register we never have to use it elsewhere again.
> Furthermore, the need to handle smallish copies with overlapping stores
> means we need both the original base address /and/ the final result, so
> we'd still need to end up saving it for stpcpy.
>
Wrote too fast, was refering that you would need to copy that on small
address. With dest in different register I could make strcpy and stpcpy
const same instructions in most cases except size 8-15 by adjusting
offsets with some constants.
Also if I think that you could remove extra instructions for stpcpy loop
with following, which also removes one instruction from strcpy if I read
code correctly.
Comments
On 04/06/15 15:16, Ondřej Bílka wrote:
> On Thu, Jun 04, 2015 at 02:44:59PM +0100, Richard Earnshaw wrote:
>> On 04/06/15 13:28, Ondřej Bílka wrote:
>>> On Thu, Jun 04, 2015 at 11:27:57AM +0100, Richard Earnshaw wrote:
>>>> On 25/05/15 12:45, Ondřej Bílka wrote:
>>>>> Replaces it with strcpy. One could argue that opposite way to replace
>>>>> strcpy with stpcpy is faster.
>>>>>
>>>>> Reason is register pressure. Strcpy needs extra register to save return
>>>>> value while stpcpy has return value already in register used for writing
>>>>> terminating zero.
>>>>
>>>>
>>>> Depends on your architecture. On aarch64 we have plenty of spare
>>>> registers, so strcpy simply copies the destination register into a
>>>> scratch. It then doesn't have to carefully calculate the return value
>>>> at the end of the function (making the tail code simpler - there are
>>>> multiple return statements, but only one entry point).
>>>>
>>> Thats correct, main saving you get is from return value is first register, that
>>> forces needing extra copy which is suboptimal.
>>
>> No, look at the AArch64 code. The only time we ever end up with a
>> simple MOV instruction to copy the register from one location to another
>> is in the stPcpy code. In strcpy it always ends up folded into some
>> other operation that we have to do anyway. Once it's been copied to
>> that other register we never have to use it elsewhere again.
>> Furthermore, the need to handle smallish copies with overlapping stores
>> means we need both the original base address /and/ the final result, so
>> we'd still need to end up saving it for stpcpy.
>>
> Wrote too fast, was refering that you would need to copy that on small
> address. With dest in different register I could make strcpy and stpcpy
> const same instructions in most cases except size 8-15 by adjusting
> offsets with some constants.
>
> Also if I think that you could remove extra instructions for stpcpy loop
> with following, which also removes one instruction from strcpy if I read
> code correctly.
>
> diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
> index 28846fb..7ca0412 100644
> --- a/sysdeps/aarch64/strcpy.S
> +++ b/sysdeps/aarch64/strcpy.S
> @@ -204,6 +204,9 @@ L(bulk_entry):
> sub to_align, to_align, #16
> stp data1, data2, [dstin]
> sub src, srcin, to_align
> +#ifdef BUILD_STPCPY
> +# define dst dstin
> +#endif
> sub dst, dstin, to_align
> b L(entry_no_page_cross)
>
> @@ -243,17 +246,16 @@ L(entry_no_page_cross):
> #endif
> rev has_nul1, has_nul1
> clz pos, has_nul1
> - add tmp1, pos, #72
> - add pos, pos, #8
> + add tmp1, pos, #64
> csel pos, pos, tmp1, ne
> add src, src, pos, lsr #3
> add dst, dst, pos, lsr #3
> - ldp data1, data2, [src, #-32]
> - stp data1, data2, [dst, #-16]
> + ldp data1, data2, [src, #-31]
> + stp data1, data2, [dst, #-15]
That's not valid, the offset has to be a multiple of the register size
(8 in this case).
> + ret
> #ifdef BUILD_STPCPY
> - sub dstin, dst, #1
> +# undef dst
Nor is this, dst is already a #define, so this leaves it unspecified.
But since you can't avoid the late subtract, that's irrelevant anyway.
R.
> #endif
> - ret
>
> L(page_cross):
> bic src, srcin, #15
>
@@ -204,6 +204,9 @@ L(bulk_entry):
sub to_align, to_align, #16
stp data1, data2, [dstin]
sub src, srcin, to_align
+#ifdef BUILD_STPCPY
+# define dst dstin
+#endif
sub dst, dstin, to_align
b L(entry_no_page_cross)
@@ -243,17 +246,16 @@ L(entry_no_page_cross):
#endif
rev has_nul1, has_nul1
clz pos, has_nul1
- add tmp1, pos, #72
- add pos, pos, #8
+ add tmp1, pos, #64
csel pos, pos, tmp1, ne
add src, src, pos, lsr #3
add dst, dst, pos, lsr #3
- ldp data1, data2, [src, #-32]
- stp data1, data2, [dst, #-16]
+ ldp data1, data2, [src, #-31]
+ stp data1, data2, [dst, #-15]
+ ret
#ifdef BUILD_STPCPY
- sub dstin, dst, #1
+# undef dst
#endif
- ret
L(page_cross):
bic src, srcin, #15