aarch64: optimize the unaligned case of memcmp

Message ID AM5PR0802MB26106339AAEF3DABB5ACE56F83D80@AM5PR0802MB2610.eurprd08.prod.outlook.com
State Not applicable
Headers

Commit Message

Wilco Dijkstra June 23, 2017, 9:28 p.m. UTC
  Sebastian Pop wrote:

> If I remove all the alignment code, I get less performance on the hikey 
> A53 board.
> With this patch:

@@ -142,9 +143,23 @@ ENTRY(memcmp)

         .p2align 6
  .Lmisaligned8:
+
+       cmp     limit, #8
+       b.lo    .LmisalignedLt8
+
+       .p2align 5
+.Lloop_part_aligned:
+       ldr     data1, [src1], #8
+       ldr     data2, [src2], #8
+       subs    limit_wd, limit_wd, #1
+.Lstart_part_realigned:
+       eor     diff, data1, data2      /* Non-zero if differences found. */
+       cbnz    diff, .Lnot_limit
+       b.ne    .Lloop_part_aligned
+
+.LmisalignedLt8:
         sub     limit, limit, #1
  1:
-       /* Perhaps we can do better than this.  */
         ldrb    data1w, [src1], #1
         ldrb    data2w, [src2], #1
         subs    limit, limit, #1

Where is the setup of limit_wd and limit???

I would expect the small cases to be faster since you avoid around 10 cycles of mostly
ALU ops that make very little progress. So it should take several iterations with an extra
unaligned access to before you're worse off. In memcpy (which is similar with 2 streams)
I align after 96 bytes.

> With the extra patch:

Note it's more readable to write mov tmp3, 8. However it's even better to use a 
writeback of 8 in the unaligned loads, and then subtract tmp1 from src1 and src2 -
this saves 2 instructions.

Wilco
  

Comments

Sebastian Pop June 26, 2017, 6:16 p.m. UTC | #1
On 06/23/2017 04:28 PM, Wilco Dijkstra wrote:
> Sebastian Pop wrote:
>
>> If I remove all the alignment code, I get less performance on the hikey
>> A53 board.
>> With this patch:
> @@ -142,9 +143,23 @@ ENTRY(memcmp)
>
>           .p2align 6
>    .Lmisaligned8:
> +
> +       cmp     limit, #8
> +       b.lo    .LmisalignedLt8
> +
> +       .p2align 5
> +.Lloop_part_aligned:
> +       ldr     data1, [src1], #8
> +       ldr     data2, [src2], #8
> +       subs    limit_wd, limit_wd, #1
> +.Lstart_part_realigned:
> +       eor     diff, data1, data2      /* Non-zero if differences found. */
> +       cbnz    diff, .Lnot_limit
> +       b.ne    .Lloop_part_aligned
> +
> +.LmisalignedLt8:
>           sub     limit, limit, #1
>    1:
> -       /* Perhaps we can do better than this.  */
>           ldrb    data1w, [src1], #1
>           ldrb    data2w, [src2], #1
>           subs    limit, limit, #1
>
> Where is the setup of limit_wd and limit???

You are right, my patch was not quite correct: I was missing the 
initialization of limit_wd, like so:

lsr     limit_wd, limit, #3

limit is the number of bytes to be compared passed in as a parameter to 
memcmp.

With this extra statement I am still seeing the same low performance on 
A53 hikey:

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8         345 ns        345 ns 2026483   
22.0879MB/s
BM_string_memcmp_unaligned/16        539 ns        539 ns 1298687   
28.3159MB/s
BM_string_memcmp_unaligned/20        613 ns        613 ns 1142076   
31.1222MB/s
BM_string_memcmp_unaligned/30        794 ns        794 ns 881596   
36.0357MB/s
BM_string_memcmp_unaligned/42        957 ns        957 ns 731746   
41.8753MB/s
BM_string_memcmp_unaligned/55       1208 ns       1207 ns 579591   
43.4525MB/s
BM_string_memcmp_unaligned/60       1231 ns       1231 ns 568372   
46.4756MB/s
BM_string_memcmp_unaligned/64       1312 ns       1312 ns 475862   
46.5316MB/s

The base is with no patch applied to memcmp.S: (byte by byte memcmp)

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8         339 ns        339 ns 2066820   
22.5274MB/s
BM_string_memcmp_unaligned/16        536 ns        536 ns 1306265   
28.4901MB/s
BM_string_memcmp_unaligned/20        612 ns        612 ns 1146573   
31.1479MB/s
BM_string_memcmp_unaligned/30        789 ns        789 ns 886755   
36.2472MB/s
BM_string_memcmp_unaligned/42       1009 ns       1009 ns 693760      
39.7MB/s
BM_string_memcmp_unaligned/55       1233 ns       1233 ns 567719   
42.5469MB/s
BM_string_memcmp_unaligned/60       1322 ns       1322 ns 529511   
43.2804MB/s
BM_string_memcmp_unaligned/64       1392 ns       1392 ns 502817   
43.8426MB/s

And with the patch submitted for review without computing max and 
aligning on src1:

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8         282 ns        282 ns 2482713    
27.061MB/s
BM_string_memcmp_unaligned/16        304 ns        304 ns 2300275   
50.1401MB/s
BM_string_memcmp_unaligned/20        322 ns        322 ns 2176437   
59.2469MB/s
BM_string_memcmp_unaligned/30        352 ns        352 ns 1988315    
81.328MB/s
BM_string_memcmp_unaligned/42        412 ns        412 ns 1699818    
97.317MB/s
BM_string_memcmp_unaligned/55        503 ns        503 ns 1393029   
104.382MB/s
BM_string_memcmp_unaligned/60        522 ns        522 ns 1340682   
109.619MB/s
BM_string_memcmp_unaligned/64        541 ns        541 ns 1297637   
112.891MB/s
  
Wilco Dijkstra June 26, 2017, 7 p.m. UTC | #2
Sebastian Pop wrote:
> On 06/23/2017 04:28 PM, Wilco Dijkstra wrote:
>
> > Where is the setup of limit_wd and limit???
>
> You are right, my patch was not quite correct: I was missing the 
> initialization of limit_wd, like so:
>
> lsr     limit_wd, limit, #3
>
> limit is the number of bytes to be compared passed in as a parameter to 
> memcmp.

You're still missing the setting of limit. Your current version will do the
words up to limit - (limit & 7), and then do byte by byte using the original
value of limit, so it's going well outside its bounds...

Wilco
  
Sebastian Pop June 26, 2017, 7:47 p.m. UTC | #3
On Mon, Jun 26, 2017 at 2:00 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Sebastian Pop wrote:
>> On 06/23/2017 04:28 PM, Wilco Dijkstra wrote:
>>
>> > Where is the setup of limit_wd and limit???
>>
>> You are right, my patch was not quite correct: I was missing the
>> initialization of limit_wd, like so:
>>
>> lsr     limit_wd, limit, #3
>>
>> limit is the number of bytes to be compared passed in as a parameter to
>> memcmp.
>
> You're still missing the setting of limit. Your current version will do the
> words up to limit - (limit & 7), and then do byte by byte using the original
> value of limit, so it's going well outside its bounds...

You are right, I was missing the "and     limit, limit, #7". With that added
the performance looks different than the byte-by-byte memcmp.
Still the performance is lower than with aligning src1:

Benchmark                              Time           CPU Iterations
--------------------------------------------------------------------
BM_string_memcmp_unaligned/8        1288 ns       1288 ns     540945
5.92208MB/s
BM_string_memcmp_unaligned/16       1303 ns       1303 ns     537143
11.7123MB/s
BM_string_memcmp_unaligned/20        341 ns        341 ns    2064228
55.9994MB/s
BM_string_memcmp_unaligned/30        405 ns        405 ns    1726750
70.5799MB/s
BM_string_memcmp_unaligned/42        405 ns        405 ns    1728170
98.8833MB/s
BM_string_memcmp_unaligned/55        563 ns        562 ns    1239350
93.2568MB/s
BM_string_memcmp_unaligned/60        539 ns        539 ns    1298194
106.109MB/s
BM_string_memcmp_unaligned/64       2378 ns       2378 ns     359461
25.6695MB/s

And for larger data sets the performance is still lower than when aligning src1:

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8          1288 ns       1288 ns     543230
   5.9221MB/s
BM_string_memcmp_unaligned/64         2377 ns       2377 ns     359351
  25.6742MB/s
BM_string_memcmp_unaligned/512        6444 ns       6444 ns     184103
  75.7774MB/s
BM_string_memcmp_unaligned/1024       4869 ns       4868 ns     143785
  200.599MB/s
BM_string_memcmp_unaligned/8k        33090 ns      33089 ns      21279
  236.107MB/s
BM_string_memcmp_unaligned/16k       66748 ns      66738 ns      10436
  234.123MB/s
BM_string_memcmp_unaligned/32k      131781 ns     131775 ns       5106
  237.147MB/s
BM_string_memcmp_unaligned/64k      291907 ns     291860 ns       2334
  214.143MB/s
  
Wilco Dijkstra June 26, 2017, 8:17 p.m. UTC | #4
Sebastian Pop wrote:
>
> And for larger data sets the performance is still lower than when aligning src1:

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp_unaligned/8          1288 ns       1288 ns     543230
   5.9221MB/s
BM_string_memcmp_unaligned/64         2377 ns       2377 ns     359351
  25.6742MB/s
BM_string_memcmp_unaligned/512        6444 ns       6444 ns     184103
  75.7774MB/s
BM_string_memcmp_unaligned/1024       4869 ns       4868 ns     143785
  200.599MB/s
BM_string_memcmp_unaligned/8k        33090 ns      33089 ns      21279
  236.107MB/s
BM_string_memcmp_unaligned/16k       66748 ns      66738 ns      10436
  234.123MB/s
BM_string_memcmp_unaligned/32k      131781 ns     131775 ns       5106
  237.147MB/s
BM_string_memcmp_unaligned/64k      291907 ns     291860 ns       2334
  214.143MB/s

These numbers still don't make any sense, the first few results are now many times slower
than the byte-by-byte version (as in your initial mail):

BM_string_memcmp_unaligned/8           339 ns        339 ns    2070998   22.5302MB/s
BM_string_memcmp_unaligned/64         1392 ns       1392 ns     502796   43.8454MB/s
BM_string_memcmp_unaligned/512        9194 ns       9194 ns      76133   53.1104MB/s
BM_string_memcmp_unaligned/1024      18325 ns      18323 ns      38206   53.2963MB/s
BM_string_memcmp_unaligned/8k       148579 ns     148574 ns       4713   52.5831MB/s
BM_string_memcmp_unaligned/16k      298169 ns     298120 ns       2344   52.4118MB/s
BM_string_memcmp_unaligned/32k      598813 ns     598797 ns       1085    52.188MB/s
BM_string_memcmp_unaligned/64k     1196079 ns    1196083 ns        540   52.2539MB/s

Wilco
  
Wilco Dijkstra June 27, 2017, 3:11 p.m. UTC | #5
Hi,

So I had a look at it using the GLIBC bench-memcmp.c. I quickly got the
unaligned loop to go faster than the aligned one using ccmp, so I had to
tune the unaligned loop too... This is using a similar trick as your byte loop to
remove a branch and 1-2 ALU operations per iteration.

This gives a 24% speedup on both Cortex-A53 and Cortex-A72 for
the aligned loop, and about 18% for the unaligned loop on top of your
patch. Aligning either src1 or src2 appears best as there isn't enough
work in the loops to hide an unaligned access.

Wilco
  
Sebastian Pop June 27, 2017, 3:25 p.m. UTC | #6
On Tue, Jun 27, 2017 at 10:11 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Hi,
>
> So I had a look at it using the GLIBC bench-memcmp.c. I quickly got the
> unaligned loop to go faster than the aligned one using ccmp, so I had to
> tune the unaligned loop too... This is using a similar trick as your byte loop to
> remove a branch and 1-2 ALU operations per iteration.
>
> This gives a 24% speedup on both Cortex-A53 and Cortex-A72 for
> the aligned loop, and about 18% for the unaligned loop on top of your
> patch. Aligning either src1 or src2 appears best as there isn't enough
> work in the loops to hide an unaligned access.

This looks good.  Could you please send a patch?

Thanks,
Sebastian
  
Wilco Dijkstra June 29, 2017, 2:47 p.m. UTC | #7
Sebastian Pop <sebpop@gmail.com> wrote:
>
> This looks good.  Could you please send a patch?

This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html -
maybe you could run your benchmark to verify it's faster?

Wilco
  
Sebastian Pop June 29, 2017, 5:09 p.m. UTC | #8
Hi Wilco,

On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html -
> maybe you could run your benchmark to verify it's faster?

Yes, your patch is faster than the patch that I submitted.
Here are the numbers from the bionic-benchmarks with your patch:

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      27 ns         27 ns   26198166
  286.069MB/s
BM_string_memcmp/64                     45 ns         45 ns   15553753
  1.32443GB/s
BM_string_memcmp/512                   242 ns        242 ns    2892423
  1.97049GB/s
BM_string_memcmp/1024                  455 ns        455 ns    1537290
  2.09436GB/s
BM_string_memcmp/8k                   3446 ns       3446 ns     203295
  2.21392GB/s
BM_string_memcmp/16k                  7567 ns       7567 ns      92582
  2.01657GB/s
BM_string_memcmp/32k                 16081 ns      16081 ns      43524
   1.8977GB/s
BM_string_memcmp/64k                 31029 ns      31028 ns      22565
  1.96712GB/s
BM_string_memcmp_aligned/8             184 ns        184 ns    3800912
  41.3654MB/s
BM_string_memcmp_aligned/64            287 ns        287 ns    2438835
   212.65MB/s
BM_string_memcmp_aligned/512          1370 ns       1370 ns     511014
  356.498MB/s
BM_string_memcmp_aligned/1024         2543 ns       2543 ns     275253
  384.006MB/s
BM_string_memcmp_aligned/8k          20413 ns      20411 ns      34306
  382.764MB/s
BM_string_memcmp_aligned/16k         42908 ns      42907 ns      16132
  364.158MB/s
BM_string_memcmp_aligned/32k         88902 ns      88886 ns       8087
  351.574MB/s
BM_string_memcmp_aligned/64k        173016 ns     173007 ns       4122
  361.258MB/s
BM_string_memcmp_unaligned/8           212 ns        212 ns    3304163
  36.0243MB/s
BM_string_memcmp_unaligned/64          361 ns        361 ns    1941597
  169.279MB/s
BM_string_memcmp_unaligned/512        1754 ns       1753 ns     399210
  278.492MB/s
BM_string_memcmp_unaligned/1024       3308 ns       3308 ns     211622
  295.243MB/s
BM_string_memcmp_unaligned/8k        27227 ns      27225 ns      25637
  286.964MB/s
BM_string_memcmp_unaligned/16k       55877 ns      55874 ns      12455
  279.645MB/s
BM_string_memcmp_unaligned/32k      112397 ns     112366 ns       6200
   278.11MB/s
BM_string_memcmp_unaligned/64k      223493 ns     223482 ns       3127
  279.665MB/s

And here are the numbers for the base (without your patch):

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      30 ns         30 ns   23092418
  251.611MB/s
BM_string_memcmp/64                     58 ns         58 ns   12351397
   1046.3MB/s
BM_string_memcmp/512                   305 ns        305 ns    2297189
  1.56479GB/s
BM_string_memcmp/1024                  571 ns        571 ns    1225040
  1.66906GB/s
BM_string_memcmp/8k                   4307 ns       4306 ns     162561
  1.77175GB/s
BM_string_memcmp/16k                  9429 ns       9429 ns      74212
  1.61826GB/s
BM_string_memcmp/32k                 19166 ns      19164 ns      36521
  1.59247GB/s
BM_string_memcmp/64k                 37035 ns      37033 ns      18924
  1.64811GB/s
BM_string_memcmp_aligned/8             199 ns        199 ns    3514227
  38.3337MB/s
BM_string_memcmp_aligned/64            386 ns        386 ns    1811517
  158.015MB/s
BM_string_memcmp_aligned/512          1731 ns       1730 ns     403250
  282.163MB/s
BM_string_memcmp_aligned/1024         3198 ns       3198 ns     218758
  305.354MB/s
BM_string_memcmp_aligned/8k          25217 ns      25214 ns      27483
   309.85MB/s
BM_string_memcmp_aligned/16k         52015 ns      52013 ns      13527
  300.407MB/s
BM_string_memcmp_aligned/32k        105034 ns     105034 ns       6689
  297.522MB/s
BM_string_memcmp_aligned/64k        208464 ns     208465 ns       3427
   299.81MB/s
BM_string_memcmp_unaligned/8           339 ns        339 ns    2066772
   22.537MB/s
BM_string_memcmp_unaligned/64         1392 ns       1392 ns     502754
  43.8398MB/s
BM_string_memcmp_unaligned/512        9194 ns       9194 ns      76133
  53.1087MB/s
BM_string_memcmp_unaligned/1024      18326 ns      18324 ns      38209
  53.2938MB/s
BM_string_memcmp_unaligned/8k       148591 ns     148585 ns       4713
  52.5793MB/s
BM_string_memcmp_unaligned/16k      298256 ns     298207 ns       2343
  52.3964MB/s
BM_string_memcmp_unaligned/32k      598855 ns     598828 ns       1085
  52.1853MB/s
BM_string_memcmp_unaligned/64k     1196350 ns    1196355 ns        539
   52.242MB/s

Thanks,
Sebastian
  
Sebastian Pop June 29, 2017, 5:15 p.m. UTC | #9
On Thu, Jun 29, 2017 at 12:09 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> Hi Wilco,
>
> On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>> This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html -
>> maybe you could run your benchmark to verify it's faster?
>
> Yes, your patch is faster than the patch that I submitted.

Would you like me to submit your patch to Bionic?

Thanks,
Sebastian
  
Sebastian Pop June 29, 2017, 5:55 p.m. UTC | #10
On Thu, Jun 29, 2017 at 12:15 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> On Thu, Jun 29, 2017 at 12:09 PM, Sebastian Pop <sebpop@gmail.com> wrote:
>> Hi Wilco,
>>
>> On Thu, Jun 29, 2017 at 9:47 AM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>> This is the newlib version: https://sourceware.org/ml/newlib/2017/msg00524.html -
>>> maybe you could run your benchmark to verify it's faster?
>>
>> Yes, your patch is faster than the patch that I submitted.
>
> Would you like me to submit your patch to Bionic?

I pushed your change as an update to the patch that I already
submitted for review in Bionic:
https://android-review.googlesource.com/418279

>
> Thanks,
> Sebastian
  
Wilco Dijkstra June 30, 2017, 2:06 p.m. UTC | #11
Sebastian Pop wrote:
>
> I pushed your change as an update to the patch that I already
> submitted for review in Bionic:
> https://android-review.googlesource.com/418279

Thanks, I see it has already been committed to newlib. So I'll post
the GLIBC patch soon.

There are probably some further tweaks possible, eg. using wider
accesses and using unaligned accesses in the < 8 byte case, but
that requires a bit more work, so not for this GLIBC release...

Wilco
  

Patch

--- a/libc/arch-arm64/generic/bionic/memcmp.S
+++ b/libc/arch-arm64/generic/bionic/memcmp.S
@@ -159,7 +159,7 @@  ENTRY(memcmp)
         /* Sources are not aligned align one of the sources find max offset
            from aligned boundary. */

-       and     tmp1, src1, #0x7
+       and     tmp1, src2, #0x7
         orr     tmp3, xzr, #0x8
         sub     pos, tmp3, tmp1