From patchwork Thu Jun 22 23:30:26 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Pop X-Patchwork-Id: 21223 Received: (qmail 13710 invoked by alias); 22 Jun 2017 23:32:30 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 12836 invoked by uid 89); 22 Jun 2017 23:32:29 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, NO_DNS_FOR_FROM autolearn=ham version=3.3.2 spammy=3200, Pop, mhz, H*MI:pop X-HELO: mailhost.sarc.sas From: Sebastian Pop To: libc-alpha@sourceware.org Cc: Marcus.Shawcroft@arm.com, maxim.kuvyrkov@linaro.org, ramana.radhakrishnan@arm.com, ryan.arnold@linaro.org, adhemerval.zanella@linaro.org, sebpop@gmail.com, Sebastian Pop Subject: [PATCH] aarch64: optimize the unaligned case of memcmp Date: Thu, 22 Jun 2017 18:30:26 -0500 Message-Id: <1498174226-16525-1-git-send-email-s.pop@samsung.com> This brings to glibc a performance improvement that we developed in Bionic libc. That change has been submitted for review to Bionic libc: https://android-review.googlesource.com/418279 This patch has been tested on glibc master with "configure; make; make check" on an aarch64-linux Juno-r0 with no new fails. We would appreciate help to test the performance and correctness of this change. Patch written by Vikas Sinha and Sebastian Pop. Both Vikas and I are working for Samsung Austin R&D Center who has a copyright assignment on file with the FSF for work in glibc. The performance was measured on the bionic-benchmarks on a hikey (aarch64 8xA53) board. There was no performance change to the existing benchmark and a performance improvement on the new benchmark for memcmp on the unaligned side. The new benchmark has been submitted for review at https://android-review.googlesource.com/414860 The overall performance improves by 18% for the small data set 8 and the performance improves by 450% for the large data set 64k. The base is with the libc from /system/lib64. The bionic libc with this patch is in /data. hikey:/data # export LD_LIBRARY_PATH=/system/lib64 hikey:/data # ./bionic-benchmarks --benchmark_filter='BM_string_memcmp*' Run on (8 X 2.4 MHz CPU s) Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 30 ns 30 ns 22955680 251.07MB/s BM_string_memcmp/64 57 ns 57 ns 12349184 1076.99MB/s BM_string_memcmp/512 305 ns 305 ns 2297163 1.56496GB/s BM_string_memcmp/1024 571 ns 571 ns 1225211 1.66912GB/s BM_string_memcmp/8k 4307 ns 4306 ns 162562 1.77177GB/s BM_string_memcmp/16k 8676 ns 8675 ns 80676 1.75887GB/s BM_string_memcmp/32k 19233 ns 19230 ns 36394 1.58695GB/s BM_string_memcmp/64k 36986 ns 36984 ns 18952 1.65029GB/s BM_string_memcmp_aligned/8 199 ns 199 ns 3519166 38.3336MB/s BM_string_memcmp_aligned/64 386 ns 386 ns 1810734 158.073MB/s BM_string_memcmp_aligned/512 1735 ns 1734 ns 403981 281.525MB/s BM_string_memcmp_aligned/1024 3200 ns 3200 ns 218838 305.151MB/s BM_string_memcmp_aligned/8k 25084 ns 25080 ns 28180 311.507MB/s BM_string_memcmp_aligned/16k 51730 ns 51729 ns 13521 302.057MB/s BM_string_memcmp_aligned/32k 103228 ns 103228 ns 6782 302.727MB/s BM_string_memcmp_aligned/64k 207117 ns 207087 ns 3450 301.806MB/s BM_string_memcmp_unaligned/8 339 ns 339 ns 2070998 22.5302MB/s BM_string_memcmp_unaligned/64 1392 ns 1392 ns 502796 43.8454MB/s BM_string_memcmp_unaligned/512 9194 ns 9194 ns 76133 53.1104MB/s BM_string_memcmp_unaligned/1024 18325 ns 18323 ns 38206 53.2963MB/s BM_string_memcmp_unaligned/8k 148579 ns 148574 ns 4713 52.5831MB/s BM_string_memcmp_unaligned/16k 298169 ns 298120 ns 2344 52.4118MB/s BM_string_memcmp_unaligned/32k 598813 ns 598797 ns 1085 52.188MB/s BM_string_memcmp_unaligned/64k 1196079 ns 1196083 ns 540 52.2539MB/s hikey:/data # export LD_LIBRARY_PATH=/data hikey:/data # ./bionic-benchmarks --benchmark_filter='BM_string_memcmp*' Run on (8 X 2.4 MHz CPU s) Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_string_memcmp/8 30 ns 30 ns 23209918 252.802MB/s BM_string_memcmp/64 57 ns 57 ns 12348447 1076.95MB/s BM_string_memcmp/512 305 ns 305 ns 2296878 1.56471GB/s BM_string_memcmp/1024 572 ns 571 ns 1224426 1.6689GB/s BM_string_memcmp/8k 4309 ns 4308 ns 162491 1.77109GB/s BM_string_memcmp/16k 9348 ns 9345 ns 74894 1.63285GB/s BM_string_memcmp/32k 18329 ns 18322 ns 38249 1.6656GB/s BM_string_memcmp/64k 36992 ns 36981 ns 18952 1.65045GB/s BM_string_memcmp_aligned/8 199 ns 199 ns 3513925 38.3162MB/s BM_string_memcmp_aligned/64 386 ns 386 ns 1814038 158.192MB/s BM_string_memcmp_aligned/512 1735 ns 1735 ns 402279 281.502MB/s BM_string_memcmp_aligned/1024 3204 ns 3202 ns 218761 304.941MB/s BM_string_memcmp_aligned/8k 25577 ns 25569 ns 27406 305.548MB/s BM_string_memcmp_aligned/16k 52143 ns 52123 ns 13522 299.769MB/s BM_string_memcmp_aligned/32k 105169 ns 105127 ns 6637 297.26MB/s BM_string_memcmp_aligned/64k 206508 ns 206383 ns 3417 302.835MB/s BM_string_memcmp_unaligned/8 287 ns 287 ns 2441787 26.6141MB/s BM_string_memcmp_unaligned/64 556 ns 556 ns 1257709 109.764MB/s BM_string_memcmp_unaligned/512 2167 ns 2166 ns 323159 225.443MB/s BM_string_memcmp_unaligned/1024 4041 ns 4039 ns 173282 241.797MB/s BM_string_memcmp_unaligned/8k 32234 ns 32221 ns 21645 242.464MB/s BM_string_memcmp_unaligned/16k 65715 ns 65684 ns 10573 237.882MB/s BM_string_memcmp_unaligned/32k 133390 ns 133348 ns 5350 234.349MB/s BM_string_memcmp_unaligned/64k 264506 ns 264401 ns 2644 236.383MB/s --- sysdeps/aarch64/memcmp.S | 59 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 58 insertions(+), 1 deletion(-) diff --git a/sysdeps/aarch64/memcmp.S b/sysdeps/aarch64/memcmp.S index 4cfcb89..d259831 100644 --- a/sysdeps/aarch64/memcmp.S +++ b/sysdeps/aarch64/memcmp.S @@ -138,9 +138,66 @@ L(ret0): .p2align 6 L(misaligned8): + cmp limit, #8 + b.lo L(misalignedLt8) + +L(unalignedGe8): + + /* Load the first dword with both src potentially unaligned. */ + ldr data1, [src1] + ldr data2, [src2] + + eor diff, data1, data2 /* Non-zero if differences found. */ + cbnz diff, L(not_limit) + + /* Sources are not aligned align one of the sources find max offset + from aligned boundary. */ + + and tmp1, src1, #0x7 + orr tmp3, xzr, #0x8 + and tmp2, src2, #0x7 + sub tmp1, tmp3, tmp1 + sub tmp2, tmp3, tmp2 + cmp tmp1, tmp2 + /* Choose the maximum. */ + csel pos, tmp1, tmp2, hi + + /* Increment SRC pointers by POS so one of the SRC pointers is word-aligned. */ + add src1, src1, pos + add src2, src2, pos + + sub limit, limit, pos + lsr limit_wd, limit, #3 + + cmp limit_wd, #0 + + /* Save #bytes to go back to be able to read 8byte at end + pos=negative offset position to read 8 bytes when len%8 != 0. */ + and limit, limit, #7 + sub pos, limit, #8 + + b L(start_part_realigned) + + .p2align 5 +L(loop_part_aligned): + ldr data1, [src1], #8 + ldr data2, [src2], #8 + subs limit_wd, limit_wd, #1 +L(start_part_realigned): + eor diff, data1, data2 /* Non-zero if differences found. */ + cbnz diff, L(not_limit) + b.ne L(loop_part_aligned) + + /* Process leftover bytes: read the leftover bytes, starting with + negative offset - so we can load 8 bytes. */ + ldr data1, [src1, pos] + ldr data2, [src2, pos] + eor diff, data1, data2 /* Non-zero if differences found. */ + b L(not_limit) + +L(misalignedLt8): sub limit, limit, #1 1: - /* Perhaps we can do better than this. */ ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 subs limit, limit, #1