aarch64: optimize the unaligned case of memcmp
Commit Message
Wilco Dijkstra <Wilco dot Dijkstra at arm dot com> wrote:
> Huge memcmp's are very rare, so like most string functions the small case
> is more important than the large case.
Agreed.
> So I think you could just skip the complex
> alignment code and get better results in real applications.
If I remove all the alignment code, I get less performance on the hikey
A53 board.
With this patch:
@@ -142,9 +143,23 @@ ENTRY(memcmp)
.p2align 6
.Lmisaligned8:
+
+ cmp limit, #8
+ b.lo .LmisalignedLt8
+
+ .p2align 5
+.Lloop_part_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ subs limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ cbnz diff, .Lnot_limit
+ b.ne .Lloop_part_aligned
+
+.LmisalignedLt8:
sub limit, limit, #1
1:
- /* Perhaps we can do better than this. */
ldrb data1w, [src1], #1
ldrb data2w, [src2], #1
subs limit, limit, #1
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 497 ns 497 ns 1324622
15.3553MB/s
BM_string_memcmp_unaligned/64 884 ns 883 ns 875735
69.0874MB/s
BM_string_memcmp_unaligned/512 3061 ns 3061 ns 284222
159.507MB/s
BM_string_memcmp_unaligned/1024 4185 ns 4185 ns 167230
233.341MB/s
BM_string_memcmp_unaligned/8k 32286 ns 32282 ns 21677
242.006MB/s
BM_string_memcmp_unaligned/16k 65948 ns 65948 ns 10599
236.93MB/s
BM_string_memcmp_unaligned/32k 131212 ns 131203 ns 5199
238.18MB/s
BM_string_memcmp_unaligned/64k 281103 ns 281033 ns 2490
222.394MB/s
With the patch I submitted for review yesterday I was getting
consistently better numbers:
BM_string_memcmp_unaligned/8 287 ns 287 ns 2441787 26.6141MB/s
BM_string_memcmp_unaligned/64 556 ns 556 ns 1257709 109.764MB/s
BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159
225.443MB/s
BM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282
241.797MB/s
BM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645
242.464MB/s
BM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573
237.882MB/s
BM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350
234.349MB/s
BM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644
236.383MB/s
> Btw why do the max
> alignment thing? - that's a lot of code for very little benefit...
Agreed, I get consistently better performance without the max computation:
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482953
27.062MB/s
BM_string_memcmp_unaligned/64 542 ns 541 ns 1298317
112.77MB/s
BM_string_memcmp_unaligned/512 2152 ns 2152 ns 325267
226.915MB/s
BM_string_memcmp_unaligned/1024 4025 ns 4025 ns 173904
242.622MB/s
BM_string_memcmp_unaligned/8k 32276 ns 32271 ns 21818
242.09MB/s
BM_string_memcmp_unaligned/16k 65970 ns 65970 ns 10554
236.851MB/s
BM_string_memcmp_unaligned/32k 131241 ns 131242 ns 5129
238.11MB/s
BM_string_memcmp_unaligned/64k 266159 ns 266160 ns 2661
234.821MB/s
The numbers are on the same A53 hikey machine with this patch on top of
yesterday's patch:
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8 282 ns 282 ns 2482524
27.0603MB/s
BM_string_memcmp_unaligned/64 542 ns 542 ns 1292480
112.677MB/s
BM_string_memcmp_unaligned/512 2151 ns 2151 ns 325450
227.021MB/s
BM_string_memcmp_unaligned/1024 4026 ns 4025 ns 172599
242.643MB/s
BM_string_memcmp_unaligned/8k 32251 ns 32239 ns 21508
242.332MB/s
BM_string_memcmp_unaligned/16k 65828 ns 65800 ns 10649
237.46MB/s
BM_string_memcmp_unaligned/32k 131058 ns 131002 ns 5254
238.545MB/s
BM_string_memcmp_unaligned/64k 265991 ns 265792 ns 2651
235.146MB/s
Thanks Wilco for your feedback,
Sebastian
@@ -161,12 +161,7 @@ ENTRY(memcmp)
and tmp1, src1, #0x7
orr tmp3, xzr, #0x8
- and tmp2, src2, #0x7
- sub tmp1, tmp3, tmp1
- sub tmp2, tmp3, tmp2
- cmp tmp1, tmp2
- /* Choose the maximum. */
- csel pos, tmp1, tmp2, hi
+ sub pos, tmp3, tmp1
/* Increment SRC pointers by POS so one of the SRC pointers is
word-aligned. */
add src1, src1, pos
And the performance looks about the same with aligning to src2 instead
of src1.
With the extra patch:
@@ -159,7 +159,7 @@ ENTRY(memcmp)
/* Sources are not aligned align one of the sources find max offset
from aligned boundary. */
- and tmp1, src1, #0x7
+ and tmp1, src2, #0x7
orr tmp3, xzr, #0x8
sub pos, tmp3, tmp1