mbox

[0/5] Added optimized memcpy/memmove/memset for A64FX

Message ID VE1PR08MB559991EE24FFB21C1258B71783709@VE1PR08MB5599.eurprd08.prod.outlook.com
Headers

Message

Wilco Dijkstra April 12, 2021, 12:52 p.m. UTC
  Hi,

I have a few comments about memcpy design (the principles apply equally to memset):

1. Overall the code is too large due to enormous unroll factors

Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger!
This hurts performance due to the code not fitting in the I-cache for common copies.
On a modern OoO core you need very little unrolling since ALU operations and branches
become essentially free while the CPU executes loads and stores. So rather than unrolling
by 32-64 times, try 4 times - you just need enough to hide the taken branch latency.

2. I don't see any special handling for small copies

Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common,
so you always want to handle those as quickly (and with as little code) as possible. Special casing
small copies does not slow down the huge copies - the reverse is more likely since you no longer
need to handle small cases.

3. Check whether using SVE helps small/medium copies

Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD
copy on your uarch.

4. Avoid making the code too general or too specialistic

I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths,
and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize
the code for that and check the size in the ifunc selector (as various memsets do already). If you want to
handle multiple vector sizes, just use a register for the increment rather than repeating the same code
several times for each vector length.

5. Odd prefetches

I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???),
then prefetching the same data a 2nd time, before finally write the loaded data is helping performance...
Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch.
So what is the performance gain of each prefetch/clear step? What is the difference between memcpy
and memmove performance (given memmove doesn't do any of this)?

Cheers,
Wilco