From patchwork Fri Jun 23 19:52:24 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Pop X-Patchwork-Id: 21241 Received: (qmail 27314 invoked by alias); 23 Jun 2017 19:52:33 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 27295 invoked by uid 89); 23 Jun 2017 19:52:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-11.1 required=5.0 tests=BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, SPF_PASS, T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=Huge, HContent-type:plain X-HELO: usmailout1.samsung.com Subject: Re: [PATCH] aarch64: optimize the unaligned case of memcmp To: libc-alpha@sourceware.org, Wilco Dijkstra Cc: Marcus.Shawcroft@arm.com, maxim.kuvyrkov@linaro.org, ramana.radhakrishnan@arm.com, ryan.arnold@linaro.org, adhemerval.zanella@linaro.org, sebpop@gmail.com From: Sebastian Pop Message-id: <637cf51c-160d-172f-6520-bba51058f85e@samsung.com> Date: Fri, 23 Jun 2017 14:52:24 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-version: 1.0 In-reply-to: <1498174226-16525-1-git-send-email-s.pop@samsung.com> Content-type: text/plain; charset=windows-1252; format=flowed Content-transfer-encoding: 7bit X-MTR: 20000000000000000@CPGS X-CMS-MailID: 20170623195225uscas1p2cefd2828dba2b0122378c71016b210ac X-Msg-Generator: CA X-Sender-IP: 182.198.245.179 X-Local-Sender: =?UTF-8?B?U2ViYXN0aWFuIFBhdWwgUG9wG1NBUkMgUGVyZm9ybWFuY2Ug?= =?UTF-8?B?QXJjaGl0ZWN0dXJlG+yCvOyEseyghOyekBtTci4gU3RhZmYgRW5nciAtIFNB?= =?UTF-8?B?UkM=?= X-Global-Sender: =?UTF-8?B?U2ViYXN0aWFuIFBhdWwgUG9wG1NBUkMgUGVyZm9ybWFuY2Ug?= =?UTF-8?B?QXJjaGl0ZWN0dXJlG1NhbXN1bmfCoEVsZWN0cm9uaWNzwqAbU3IuIFN0YWZm?= =?UTF-8?B?IEVuZ3IgLSBTQVJD?= X-Sender-Code: =?UTF-8?B?QzEwG05BSFEbQzEwQUEwMkFBMDIwMTA5?= CMS-TYPE: 301P X-HopCount: 7 X-CMS-RootMailID: 20170622233226uscas1p213aefedba5fe47e520aac1226a731162 X-RootMTR: 20170622233226uscas1p213aefedba5fe47e520aac1226a731162 References: <1498174226-16525-1-git-send-email-s.pop@samsung.com> Wilco Dijkstra wrote: > Huge memcmp's are very rare, so like most string functions the small case > is more important than the large case. Agreed. > So I think you could just skip the complex > alignment code and get better results in real applications. If I remove all the alignment code, I get less performance on the hikey A53 board. With this patch: @@ -142,9 +143,23 @@ ENTRY(memcmp) .p2align 6 .Lmisaligned8: + + cmp limit, #8 + b.lo .LmisalignedLt8 + + .p2align 5 +.Lloop_part_aligned: + ldr data1, [src1], #8 + ldr data2, [src2], #8 + subs limit_wd, limit_wd, #1 +.Lstart_part_realigned: + eor diff, data1, data2 /* Non-zero if differences found. */ + cbnz diff, .Lnot_limit + b.ne .Lloop_part_aligned + +.LmisalignedLt8: sub limit, limit, #1 1: - /* Perhaps we can do better than this. */ ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 subs limit, limit, #1 Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp_unaligned/8 497 ns 497 ns 1324622 15.3553MB/s BM_string_memcmp_unaligned/64 884 ns 883 ns 875735 69.0874MB/s BM_string_memcmp_unaligned/512 3061 ns 3061 ns 284222 159.507MB/s BM_string_memcmp_unaligned/1024 4185 ns 4185 ns 167230 233.341MB/s BM_string_memcmp_unaligned/8k 32286 ns 32282 ns 21677 242.006MB/s BM_string_memcmp_unaligned/16k 65948 ns 65948 ns 10599 236.93MB/s BM_string_memcmp_unaligned/32k 131212 ns 131203 ns 5199 238.18MB/s BM_string_memcmp_unaligned/64k 281103 ns 281033 ns 2490 222.394MB/s With the patch I submitted for review yesterday I was getting consistently better numbers: BM_string_memcmp_unaligned/8 287 ns 287 ns 2441787 26.6141MB/s BM_string_memcmp_unaligned/64 556 ns 556 ns 1257709 109.764MB/s BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159 225.443MB/s BM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282 241.797MB/s BM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645 242.464MB/s BM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573 237.882MB/s BM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350 234.349MB/s BM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644 236.383MB/s > Btw why do the max > alignment thing? - that's a lot of code for very little benefit... Agreed, I get consistently better performance without the max computation: Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp_unaligned/8 282 ns 282 ns 2482953 27.062MB/s BM_string_memcmp_unaligned/64 542 ns 541 ns 1298317 112.77MB/s BM_string_memcmp_unaligned/512 2152 ns 2152 ns 325267 226.915MB/s BM_string_memcmp_unaligned/1024 4025 ns 4025 ns 173904 242.622MB/s BM_string_memcmp_unaligned/8k 32276 ns 32271 ns 21818 242.09MB/s BM_string_memcmp_unaligned/16k 65970 ns 65970 ns 10554 236.851MB/s BM_string_memcmp_unaligned/32k 131241 ns 131242 ns 5129 238.11MB/s BM_string_memcmp_unaligned/64k 266159 ns 266160 ns 2661 234.821MB/s The numbers are on the same A53 hikey machine with this patch on top of yesterday's patch: Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp_unaligned/8 282 ns 282 ns 2482524 27.0603MB/s BM_string_memcmp_unaligned/64 542 ns 542 ns 1292480 112.677MB/s BM_string_memcmp_unaligned/512 2151 ns 2151 ns 325450 227.021MB/s BM_string_memcmp_unaligned/1024 4026 ns 4025 ns 172599 242.643MB/s BM_string_memcmp_unaligned/8k 32251 ns 32239 ns 21508 242.332MB/s BM_string_memcmp_unaligned/16k 65828 ns 65800 ns 10649 237.46MB/s BM_string_memcmp_unaligned/32k 131058 ns 131002 ns 5254 238.545MB/s BM_string_memcmp_unaligned/64k 265991 ns 265792 ns 2651 235.146MB/s Thanks Wilco for your feedback, Sebastian --- a/libc/arch-arm64/generic/bionic/memcmp.S +++ b/libc/arch-arm64/generic/bionic/memcmp.S @@ -161,12 +161,7 @@ ENTRY(memcmp) and tmp1, src1, #0x7 orr tmp3, xzr, #0x8 - and tmp2, src2, #0x7 - sub tmp1, tmp3, tmp1 - sub tmp2, tmp3, tmp2 - cmp tmp1, tmp2 - /* Choose the maximum. */ - csel pos, tmp1, tmp2, hi + sub pos, tmp3, tmp1 /* Increment SRC pointers by POS so one of the SRC pointers is word-aligned. */ add src1, src1, pos And the performance looks about the same with aligning to src2 instead of src1. With the extra patch: --- a/libc/arch-arm64/generic/bionic/memcmp.S +++ b/libc/arch-arm64/generic/bionic/memcmp.S @@ -159,7 +159,7 @@ ENTRY(memcmp) /* Sources are not aligned align one of the sources find max offset from aligned boundary. */ - and tmp1, src1, #0x7 + and tmp1, src2, #0x7 orr tmp3, xzr, #0x8 sub pos, tmp3, tmp1