Message ID | 1512105982-17496-1-git-send-email-siddhesh@sourceware.org |
---|---|
State | Committed |
Headers |
Received: (qmail 66396 invoked by alias); 1 Dec 2017 05:29:44 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 24079 invoked by uid 89); 1 Dec 2017 05:28:15 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KB_WAM_FROM_NAME_SINGLEWORD, RCVD_IN_DNSWL_NONE, SPF_NEUTRAL autolearn=ham version=3.3.2 spammy=bic X-HELO: homiemail-a52.g.dreamhost.com From: Siddhesh Poyarekar <siddhesh@sourceware.org> To: libc-alpha@sourceware.org Subject: [PATCH] aarch64: Improve strcmp unaligned performance Date: Fri, 1 Dec 2017 10:56:22 +0530 Message-Id: <1512105982-17496-1-git-send-email-siddhesh@sourceware.org> |
Commit Message
Siddhesh Poyarekar
Dec. 1, 2017, 5:26 a.m. UTC
Replace the simple byte-wise compare in the misaligned case with a dword compare with page boundary checks in place. For simplicity I've chosen a 4K page boundary so that we don't have to query the actual page size on the system. This results in up to 3x improvement in performance in the unaligned case on falkor and about 2.5x improvement on mustang as measured using bench-strcmp. * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a time whenever possible. --- sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-)
Comments
On Thu, Nov 30, 2017 at 9:26 PM, Siddhesh Poyarekar <siddhesh@sourceware.org> wrote: > Replace the simple byte-wise compare in the misaligned case with a > dword compare with page boundary checks in place. For simplicity I've > chosen a 4K page boundary so that we don't have to query the actual > page size on the system. > > This results in up to 3x improvement in performance in the unaligned > case on falkor and about 2.5x improvement on mustang as measured using > bench-strcmp. On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this might be slower as any unaligned load/stores have a 9 cycle penalty. I will double check this next week. What this means is I will propose a strcmp version for OcteonTX and ThunderX1 above this patch if I find it is slower. Thanks, Andrew Pinski > > * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a > time whenever possible. > --- > sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S > index e99d662..c260e1d 100644 > --- a/sysdeps/aarch64/strcmp.S > +++ b/sysdeps/aarch64/strcmp.S > @@ -72,6 +72,7 @@ L(start_realigned): > cbz syndrome, L(loop_aligned) > /* End of performance-critical section -- one 64B cache line. */ > > +L(end): > #ifndef __AARCH64EB__ > rev syndrome, syndrome > rev data1, data1 > @@ -145,12 +146,38 @@ L(mutual_align): > b L(start_realigned) > > L(misaligned8): > - /* We can do better than this. */ > + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always > + checking to make sure that we don't access beyond page boundary in > + SRC2. */ > + tst src1, #7 > + b.eq L(loop_misaligned) > +L(do_misaligned): > ldrb data1w, [src1], #1 > ldrb data2w, [src2], #1 > cmp data1w, #1 > ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ > - b.eq L(misaligned8) > + b.ne L(done) > + tst src1, #7 > + b.ne L(misaligned8) > + > +L(loop_misaligned): > + /* Test if we are within the last dword of the end of a 4K page. If > + yes then jump back to the misaligned loop to copy a byte at a time. */ > + and tmp1, src2, #0xff8 > + eor tmp1, tmp1, #0xff8 > + cbz tmp1, L(do_misaligned) > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + sub tmp1, data1, zeroones > + orr tmp2, data1, #REP8_7f > + eor diff, data1, data2 /* Non-zero if differences found. */ > + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ > + orr syndrome, diff, has_nul > + cbz syndrome, L(loop_misaligned) > + b L(end) > + > +L(done): > sub result, data1, data2 > RET > END(strcmp) > -- > 2.7.5 >
Ping! On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: > Replace the simple byte-wise compare in the misaligned case with a > dword compare with page boundary checks in place. For simplicity I've > chosen a 4K page boundary so that we don't have to query the actual > page size on the system. > > This results in up to 3x improvement in performance in the unaligned > case on falkor and about 2.5x improvement on mustang as measured using > bench-strcmp. > > * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a > time whenever possible. > --- > sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S > index e99d662..c260e1d 100644 > --- a/sysdeps/aarch64/strcmp.S > +++ b/sysdeps/aarch64/strcmp.S > @@ -72,6 +72,7 @@ L(start_realigned): > cbz syndrome, L(loop_aligned) > /* End of performance-critical section -- one 64B cache line. */ > > +L(end): > #ifndef __AARCH64EB__ > rev syndrome, syndrome > rev data1, data1 > @@ -145,12 +146,38 @@ L(mutual_align): > b L(start_realigned) > > L(misaligned8): > - /* We can do better than this. */ > + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always > + checking to make sure that we don't access beyond page boundary in > + SRC2. */ > + tst src1, #7 > + b.eq L(loop_misaligned) > +L(do_misaligned): > ldrb data1w, [src1], #1 > ldrb data2w, [src2], #1 > cmp data1w, #1 > ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ > - b.eq L(misaligned8) > + b.ne L(done) > + tst src1, #7 > + b.ne L(misaligned8) > + > +L(loop_misaligned): > + /* Test if we are within the last dword of the end of a 4K page. If > + yes then jump back to the misaligned loop to copy a byte at a time. */ > + and tmp1, src2, #0xff8 > + eor tmp1, tmp1, #0xff8 > + cbz tmp1, L(do_misaligned) > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + sub tmp1, data1, zeroones > + orr tmp2, data1, #REP8_7f > + eor diff, data1, data2 /* Non-zero if differences found. */ > + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ > + orr syndrome, diff, has_nul > + cbz syndrome, L(loop_misaligned) > + b L(end) > + > +L(done): > sub result, data1, data2 > RET > END(strcmp) >
On Friday 01 December 2017 12:25 PM, Andrew Pinski wrote: > On at least Cavium OcteonTX T81 and T83 (and ThunderX1 T88), this > might be slower as any > unaligned load/stores have a 9 cycle penalty. I will double check > this next week. > What this means is I will propose a strcmp version for OcteonTX and > ThunderX1 above this patch if I find it is slower. Hi Andrew, were you able to test this? Siddhesh
Ping! On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote: > Ping! > > On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: >> Replace the simple byte-wise compare in the misaligned case with a >> dword compare with page boundary checks in place. For simplicity I've >> chosen a 4K page boundary so that we don't have to query the actual >> page size on the system. >> >> This results in up to 3x improvement in performance in the unaligned >> case on falkor and about 2.5x improvement on mustang as measured using >> bench-strcmp. >> >> * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a >> time whenever possible. >> --- >> sysdeps/aarch64/strcmp.S | 31 +++++++++++++++++++++++++++++-- >> 1 file changed, 29 insertions(+), 2 deletions(-) >> >> diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S >> index e99d662..c260e1d 100644 >> --- a/sysdeps/aarch64/strcmp.S >> +++ b/sysdeps/aarch64/strcmp.S >> @@ -72,6 +72,7 @@ L(start_realigned): >> cbz syndrome, L(loop_aligned) >> /* End of performance-critical section -- one 64B cache line. */ >> >> +L(end): >> #ifndef __AARCH64EB__ >> rev syndrome, syndrome >> rev data1, data1 >> @@ -145,12 +146,38 @@ L(mutual_align): >> b L(start_realigned) >> >> L(misaligned8): >> - /* We can do better than this. */ >> + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always >> + checking to make sure that we don't access beyond page boundary in >> + SRC2. */ >> + tst src1, #7 >> + b.eq L(loop_misaligned) >> +L(do_misaligned): >> ldrb data1w, [src1], #1 >> ldrb data2w, [src2], #1 >> cmp data1w, #1 >> ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ >> - b.eq L(misaligned8) >> + b.ne L(done) >> + tst src1, #7 >> + b.ne L(misaligned8) >> + >> +L(loop_misaligned): >> + /* Test if we are within the last dword of the end of a 4K page. If >> + yes then jump back to the misaligned loop to copy a byte at a time. */ >> + and tmp1, src2, #0xff8 >> + eor tmp1, tmp1, #0xff8 >> + cbz tmp1, L(do_misaligned) >> + ldr data1, [src1], #8 >> + ldr data2, [src2], #8 >> + >> + sub tmp1, data1, zeroones >> + orr tmp2, data1, #REP8_7f >> + eor diff, data1, data2 /* Non-zero if differences found. */ >> + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ >> + orr syndrome, diff, has_nul >> + cbz syndrome, L(loop_misaligned) >> + b L(end) >> + >> +L(done): >> sub result, data1, data2 >> RET >> END(strcmp) >> >
On 12/12/17 18:20, Siddhesh Poyarekar wrote: > Ping! > > On Monday 04 December 2017 09:37 PM, Siddhesh Poyarekar wrote: >> Ping! >> >> On Friday 01 December 2017 10:56 AM, Siddhesh Poyarekar wrote: >>> Replace the simple byte-wise compare in the misaligned case with a >>> dword compare with page boundary checks in place. For simplicity I've >>> chosen a 4K page boundary so that we don't have to query the actual >>> page size on the system. >>> >>> This results in up to 3x improvement in performance in the unaligned >>> case on falkor and about 2.5x improvement on mustang as measured using >>> bench-strcmp. >>> >>> * sysdeps/aarch64/strcmp.S (misaligned8): Compare dword at a >>> time whenever possible. sorry for the delay, this is ok to commit. i believe it should give a performance improvement in general.
diff --git a/sysdeps/aarch64/strcmp.S b/sysdeps/aarch64/strcmp.S index e99d662..c260e1d 100644 --- a/sysdeps/aarch64/strcmp.S +++ b/sysdeps/aarch64/strcmp.S @@ -72,6 +72,7 @@ L(start_realigned): cbz syndrome, L(loop_aligned) /* End of performance-critical section -- one 64B cache line. */ +L(end): #ifndef __AARCH64EB__ rev syndrome, syndrome rev data1, data1 @@ -145,12 +146,38 @@ L(mutual_align): b L(start_realigned) L(misaligned8): - /* We can do better than this. */ + /* Align SRC1 to 8 bytes and then compare 8 bytes at a time, always + checking to make sure that we don't access beyond page boundary in + SRC2. */ + tst src1, #7 + b.eq L(loop_misaligned) +L(do_misaligned): ldrb data1w, [src1], #1 ldrb data2w, [src2], #1 cmp data1w, #1 ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */ - b.eq L(misaligned8) + b.ne L(done) + tst src1, #7 + b.ne L(misaligned8) + +L(loop_misaligned): + /* Test if we are within the last dword of the end of a 4K page. If + yes then jump back to the misaligned loop to copy a byte at a time. */ + and tmp1, src2, #0xff8 + eor tmp1, tmp1, #0xff8 + cbz tmp1, L(do_misaligned) + ldr data1, [src1], #8 + ldr data2, [src2], #8 + + sub tmp1, data1, zeroones + orr tmp2, data1, #REP8_7f + eor diff, data1, data2 /* Non-zero if differences found. */ + bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */ + orr syndrome, diff, has_nul + cbz syndrome, L(loop_misaligned) + b L(end) + +L(done): sub result, data1, data2 RET END(strcmp)