aarch64: thunderx2 memmove performance improvements

  Wilco,

Thanks a lot for your comments and suggestions.
I attached the upated patch.

On 4/12/2019 21:03, Wilco Dijkstra wrote:
> Hi Anton,
>
> This looks like a good cleanup! A few comments and suggestions for
> improvements:
>
> 0. The diff is quite large due to tab/space changes. Would it be possible to split
> this into a separate patch?
Sure, I will then send the tab/space cleanup patch later.

> 1. There are a few cases where ldp or stp could be used, but isn't, eg:
>
> +	str	B_q, [dst], #16
> +	ldp	H_q, I_q, [src], #32
> +	str	C_q, [dst], #16
>
> Why not do stp B_q, C_q, [dst], 32?
This is the remains of trying to manually schedule the instructions.
Fixed.

> 2. There are a lot of writeback instructions used in cases where this isn't
> strictly required. Have you noticed this actually improves performance? Even
> if required for the main loop, it is best to reduce them where possible:
But there are no writebacks in the main loop - they are only in the loop's
tails, aren't they?

> +L(loop128_exit0):
> +	ldp	F_q, G_q, [srcend, -64]
> +	ldp	H_q, I_q, [srcend, -32]
> +	stp	B_q, C_q, [dst], #32
> +	stp	D_q, E_q, [dst], #32
> +	stp	F_q, G_q, [dstend, -64]
> +	stp	H_q, I_q, [dstend, -32]
> +	ret
>
>   L(loop128_exit1):
> +	ldp	B_q, C_q, [srcend, -64]
> +	ldp	D_q, E_q, [srcend, -32]
> +	stp	F_q, G_q, [dst], #32
> +	stp	H_q, I_q, [dst], #32
> +	stp	B_q, C_q, [dstend, -64]
> +	stp	D_q, E_q, [dstend, -32]
> +	ret
>
> Here dst is not used but incremented twice.
Right, good catch, thanks!
Done.

> 3. Missed optimization:
>
> +L(dst_unaligned_tail):
> +	ldp	C_q, D_q, [srcend, -64]
> +	ldp	E_q, F_q, [srcend, -32]
> +	stp	A_q, B_q, [dst], #32
> +	stp	H_q, I_q, [dst], #32
> +	add	dst, dst, tmp1
> +	str	G_q, [dst, -16]
> +	stp	C_q, D_q, [dstend, -64]
> +	stp	E_q, F_q, [dstend, -32]
>   	ret
>
> Surely this could use str	G_q, [dst, tmp1] if we change the writeback on the stp?
Done.

> 4. Unrolling can be more efficient:
>
>   L(loop128):
> +	ldp	F_q, G_q, [src], #32
> +	ldp	H_q, I_q, [src], #32
> +	stp	B_q, C_q, [dst], #32
> +	stp	D_q, E_q, [dst], #32
> +	subs	count, count, 64
> +	b.lt	L(loop128_exit1)
> +	ldp	B_q, C_q, [src], #32
> +	ldp	D_q, E_q, [src], #32
> +	stp	F_q, G_q, [dst], #32
> +	stp	H_q, I_q, [dst], #32
> +	subs	count, count, 64
> +	b.ge	L(loop128)
> +L(loop128_exit0):
> The idea of unrolling 2x is to only have a single loop branch. Using a single branch
> makes it easier to remove all the writebacks too - you only need 2 rather than 8!
The idea was to minimize the length of the loop tail. For branchless 128
bytes per iteration loop the branchless tail needs to read 128 bytes.
For the branched loop as the one above the tail processes only 64
bytes. And I don't see how I can avoid writebacks of up to 127 bytes
in the branchless tails for your version.

aarch64: thunderx2 memmove performance improvements

Commit Message

Comments

Patch