Patchwork aarch64: optimize the unaligned case of memcmp

login
register
mail settings
Submitter Sebastian Pop
Date June 23, 2017, 7:52 p.m.
Message ID <637cf51c-160d-172f-6520-bba51058f85e@samsung.com>
Download mbox | patch
Permalink /patch/21241/
State New
Headers show

Comments

Sebastian Pop - June 23, 2017, 7:52 p.m.
Wilco Dijkstra <Wilco dot Dijkstra at arm dot com> wrote:
 > Huge memcmp's are very rare, so like most string functions the small case
 > is more important than the large case.

Agreed.

 > So I think you could just skip the complex
 > alignment code and get better results in real applications.

If I remove all the alignment code, I get less performance on the hikey 
A53 board.
With this patch:

@@ -142,9 +143,23 @@ ENTRY(memcmp)

         .p2align 6
  .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
         sub     limit, limit, #1
  1:
-       /* Perhaps we can do better than this.  */
         ldrb    data1w, [src1], #1
         ldrb    data2w, [src2], #1
         subs    limit, limit, #1

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8           497 ns        497 ns 1324622   
15.3553MB/s
BM_string_memcmp_unaligned/64          884 ns        883 ns 875735   
69.0874MB/s
BM_string_memcmp_unaligned/512        3061 ns       3061 ns 284222   
159.507MB/s
BM_string_memcmp_unaligned/1024       4185 ns       4185 ns 167230   
233.341MB/s
BM_string_memcmp_unaligned/8k        32286 ns      32282 ns 21677   
242.006MB/s
BM_string_memcmp_unaligned/16k       65948 ns      65948 ns 10599    
236.93MB/s
BM_string_memcmp_unaligned/32k      131212 ns     131203 ns 5199    
238.18MB/s
BM_string_memcmp_unaligned/64k      281103 ns     281033 ns 2490   
222.394MB/s

With the patch I submitted for review yesterday I was getting 
consistently better numbers:
BM_string_memcmp_unaligned/8  287 ns        287 ns    2441787   26.6141MB/s
BM_string_memcmp_unaligned/64 556 ns        556 ns    1257709   109.764MB/s
BM_string_memcmp_unaligned/512 2167 ns       2166 ns     323159  
  225.443MB/s
BM_string_memcmp_unaligned/1024       4041 ns       4039 ns  173282  
  241.797MB/s
BM_string_memcmp_unaligned/8k 32234 ns      32221 ns      21645  
  242.464MB/s
BM_string_memcmp_unaligned/16k  65715 ns      65684 ns      10573  
  237.882MB/s
BM_string_memcmp_unaligned/32k 133390 ns     133348 ns       5350  
  234.349MB/s
BM_string_memcmp_unaligned/64k 264506 ns     264401 ns       2644  
  236.383MB/s

 > Btw why do the max
 > alignment thing? - that's a lot of code for very little benefit...

Agreed, I get consistently better performance without the max computation:
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8           282 ns        282 ns 2482953    
27.062MB/s
BM_string_memcmp_unaligned/64          542 ns        541 ns 1298317    
112.77MB/s
BM_string_memcmp_unaligned/512        2152 ns       2152 ns 325267   
226.915MB/s
BM_string_memcmp_unaligned/1024       4025 ns       4025 ns 173904   
242.622MB/s
BM_string_memcmp_unaligned/8k        32276 ns      32271 ns 21818    
242.09MB/s
BM_string_memcmp_unaligned/16k       65970 ns      65970 ns 10554   
236.851MB/s
BM_string_memcmp_unaligned/32k      131241 ns     131242 ns 5129    
238.11MB/s
BM_string_memcmp_unaligned/64k      266159 ns     266160 ns 2661   
234.821MB/s

The numbers are on the same A53 hikey machine with this patch on top of 
yesterday's patch:


Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8           282 ns        282 ns 2482524   
27.0603MB/s
BM_string_memcmp_unaligned/64          542 ns        542 ns 1292480   
112.677MB/s
BM_string_memcmp_unaligned/512        2151 ns       2151 ns 325450   
227.021MB/s
BM_string_memcmp_unaligned/1024       4026 ns       4025 ns 172599   
242.643MB/s
BM_string_memcmp_unaligned/8k        32251 ns      32239 ns 21508   
242.332MB/s
BM_string_memcmp_unaligned/16k       65828 ns      65800 ns 10649    
237.46MB/s
BM_string_memcmp_unaligned/32k      131058 ns     131002 ns 5254   
238.545MB/s
BM_string_memcmp_unaligned/64k      265991 ns     265792 ns 2651   
235.146MB/s

Thanks Wilco for your feedback,
Sebastian

Patch

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -161,12 +161,7 @@  ENTRY(memcmp)

         and     tmp1, src1, #0x7
         orr     tmp3, xzr, #0x8
-       and     tmp2, src2, #0x7
-       sub     tmp1, tmp3, tmp1
-       sub     tmp2, tmp3, tmp2
-       cmp     tmp1, tmp2
-       /* Choose the maximum. */
-       csel    pos, tmp1, tmp2, hi
+       sub     pos, tmp3, tmp1

         /* Increment SRC pointers by POS so one of the SRC pointers is 
word-aligned. */
         add     src1, src1, pos

And the performance looks about the same with aligning to src2 instead 
of src1.
With the extra patch:

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@  ENTRY(memcmp)
         /* Sources are not aligned align one of the sources find max offset
            from aligned boundary. */

-       and     tmp1, src1, #0x7
+       and     tmp1, src2, #0x7
         orr     tmp3, xzr, #0x8
         sub     pos, tmp3, tmp1