From patchwork Fri Oct 16 19:21:48 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Steve Ellcey X-Patchwork-Id: 9194 Received: (qmail 121897 invoked by alias); 16 Oct 2015 19:22:01 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 121883 invoked by uid 89); 16 Oct 2015 19:22:00 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.4 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_LOW, SPF_PASS, T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: mailapp01.imgtec.com From: "Steve Ellcey " Date: Fri, 16 Oct 2015 12:21:48 -0700 To: Subject: [PATCH] MIPS memcpy performance improvement User-Agent: Heirloom mailx 12.5 6/20/10 MIME-Version: 1.0 Message-ID: <40453179-a1ca-4897-971b-a7197772fb88@BAMAIL02.ba.imgtec.org> It was brought to my attention that the MIPS N32 (and N64) memcpy was slower than the MIPS O32 memcpy for small (less than 16 byte) aligned memcpy's. This is because for sizes of 8 to 15 bytes, the O32 memcpy would do two or three word copies followed by byte copies but the N32 version would do all byte copies. Basically, the N32 version did not 'fall back' to doing word copies when it could not do double-word copies. This patch addresses the problem with two changes. One is actually for large memcpy's on N32. After doing as many double-word copies as possible the N32 version will try do do at least one word copy before going to byte copies. The other change is that after determining that a memcpy is small (less than 8 bytes for O32 ABI, less than 16 bytes for N32 or N64 ABI), instead of just doing byte copies it will check the size and alignment of the inputs and, if possible, do word copies (followed by byte copies if needed). If it is not possible to do word copies due to size or alignment it drops back to byte copies as before. The glibc memcpy benchmark does not have any tests that catch the first case (though my own testing showed a small improvement), but it does test the second case. In these cases, for inputs of length 4 to 15 bytes (depending on the ABI), the tests are slower for unaligned memcpy's and faster for aligned ones. There is also a slow down for memcpy's of less than 4 bytes regardless of alignment. For example with O32: Original: Length 1, alignment 0/ 0: 54.3906 54.0156 37.9062 Length 4, alignment 0/ 0: 66.8438 70.6562 65.2969 Length 4, alignment 2/ 0: 65.8438 71.1406 65.25 Length 8, alignment 0/ 0: 73.7344 82.2656 74.2969 Length 8, alignment 3/ 0: 74.3906 76.6875 74.25 With change: Length 1, alignment 0/ 0: 61.7031 51.8125 37.2656 Length 4, alignment 0/ 0: 50.1094 54.5 66.7344 Length 4, alignment 2/ 0: 72.0312 77.0156 65.1719 Length 8, alignment 0/ 0: 72.6406 76.4531 74.125 Length 8, alignment 3/ 0: 80.9375 84.2969 74.125 Or with N32: Original: memcpy builtin_memcpy simple_memcpy Length 1, alignment 0/ 0: 57.7188 52.5156 35.687 Length 4, alignment 0/ 0: 66.1719 75.9531 63.4531 Length 4, alignment 2/ 0: 66.7344 75.4531 64.1719 Length 8, alignment 0/ 0: 76.7656 85.5469 72.625 Length 8, alignment 3/ 0: 75.6094 84.9062 73.7031 New: memcpy builtin_memcpy simple_memcpy Length 1, alignment 0/ 0: 64.3594 54.2344 35.4219 Length 4, alignment 0/ 0: 49.125 59.3281 64.7031 Length 4, alignment 2/ 0: 74.5469 77.3906 63.6562 Length 8, alignment 0/ 0: 57.25 69.0312 73.2188 Length 8, alignment 3/ 0: 94.5 97.9688 73.7031 I have the complete benchmark runs if anyone wants them, but this shows you the overall pattern. I also ran the correctness tests and verified that there are no regressions in correctness. OK to checkin? Steve Ellcey sellcey@imgtec.com 2015-10-16 Steve Ellcey * sysdeps/mips/memcpy.S (memcpy): Add word copies for small aligned data. diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S index c85935b..6f63405 100644 --- a/sysdeps/mips/memcpy.S +++ b/sysdeps/mips/memcpy.S @@ -295,7 +295,7 @@ L(memcpy): * size, copy dst pointer to v0 for the return value. */ slti t2,a2,(2 * NSIZE) - bne t2,zero,L(lastb) + bne t2,zero,L(lasts) #if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH) move v0,zero #else @@ -546,7 +546,7 @@ L(chkw): */ L(chk1w): andi a2,t8,(NSIZE-1) /* a2 is the reminder past one (d)word chunks */ - beq a2,t8,L(lastb) + beq a2,t8,L(lastw) PTR_SUBU a3,t8,a2 /* a3 is count of bytes in one (d)word chunks */ PTR_ADDU a3,a0,a3 /* a3 is the dst address after loop */ @@ -558,6 +558,20 @@ L(wordCopy_loop): bne a0,a3,L(wordCopy_loop) C_ST REG3,UNIT(-1)(a0) +/* If we have been copying double words, see if we can copy a single word + before doing byte copies. We can have, at most, one word to copy. */ + +L(lastw): +#ifdef USE_DOUBLE + andi t8,a2,3 /* a2 is the remainder past 4 byte chunks. */ + beq t8,a2,L(lastb) + lw REG3,0(a1) + sw REG3,0(a0) + PTR_ADDIU a0,a0,4 + PTR_ADDIU a1,a1,4 + move a2,t8 +#endif + /* Copy the last 8 (or 16) bytes */ L(lastb): blez a2,L(leave) @@ -572,6 +586,33 @@ L(leave): j ra nop +/* We jump here with a memcpy of less than 8 or 16 bytes, depending on + whether or not USE_DOUBLE is defined. Instead of just doing byte + copies, check the alignment and size and use lw/sw if possible. + Otherwise, do byte copies. */ + +L(lasts): + andi t8,a2,3 + beq t8,a2,L(lastb) + + andi t9,a0,3 + bne t9,zero,L(lastb) + andi t9,a1,3 + bne t9,zero,L(lastb) + + PTR_SUBU a3,a2,t8 + PTR_ADDU a3,a0,a3 + +L(wcopy_loop): + lw REG3,0(a1) + PTR_ADDIU a0,a0,4 + PTR_ADDIU a1,a1,4 + bne a0,a3,L(wcopy_loop) + sw REG3,-4(a0) + + b L(lastb) + move a2,t8 + #ifndef R6_CODE /* * UNALIGNED case, got here with a3 = "negu a0"