From patchwork Fri Oct 16 19:21:48 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Steve Ellcey <sellcey@imgtec.com>
X-Patchwork-Id: 9194
Received: (qmail 121897 invoked by alias); 16 Oct 2015 19:22:01 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 121883 invoked by uid 89); 16 Oct 2015 19:22:00 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.4 required=5.0 tests=AWL, BAYES_00,
	RCVD_IN_DNSWL_LOW, SPF_PASS,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: mailapp01.imgtec.com
From: "Steve Ellcey " <sellcey@imgtec.com>
Date: Fri, 16 Oct 2015 12:21:48 -0700
To: <libc-alpha@sourceware.org>
Subject: [PATCH] MIPS memcpy performance improvement
User-Agent: Heirloom mailx 12.5 6/20/10
MIME-Version: 1.0
Message-ID: <40453179-a1ca-4897-971b-a7197772fb88@BAMAIL02.ba.imgtec.org>

It was brought to my attention that the MIPS N32 (and N64) memcpy was slower
than the MIPS O32 memcpy for small (less than 16 byte) aligned memcpy's.
This is because for sizes of 8 to 15 bytes, the O32 memcpy would do two
or three word copies followed by byte copies but the N32 version would do
all byte copies.  Basically, the N32 version did not 'fall back' to doing
word copies when it could not do double-word copies.

This patch addresses the problem with two changes.  One is actually for
large memcpy's on N32.  After doing as many double-word copies as possible
the N32 version will try do do at least one word copy before going to byte
copies.

The other change is that after determining that a memcpy is small (less than
8 bytes for O32 ABI, less than 16 bytes for N32 or N64 ABI), instead of just
doing byte copies it will check the size and alignment of the inputs and,
if possible, do word copies (followed by byte copies if needed).  If it is
not possible to do word copies due to size or alignment it drops back to byte
copies as before.

The glibc memcpy benchmark does not have any tests that catch the first
case (though my own testing showed a small improvement), but it does test
the second case.  In these cases, for inputs of length 4 to 15 bytes
(depending on the ABI), the tests are slower for unaligned memcpy's and
faster for aligned ones.  There is also a slow down for memcpy's of less
than 4 bytes regardless of alignment.

For example with O32:

Original:
	Length    1, alignment  0/ 0:	54.3906	54.0156	37.9062
	Length    4, alignment  0/ 0:	66.8438	70.6562	65.2969
	Length    4, alignment  2/ 0:	65.8438	71.1406	65.25
	Length    8, alignment  0/ 0:	73.7344	82.2656	74.2969
	Length    8, alignment  3/ 0:	74.3906	76.6875	74.25
With change:
	Length    1, alignment  0/ 0:	61.7031	51.8125	37.2656
	Length    4, alignment  0/ 0:	50.1094	54.5	66.7344
	Length    4, alignment  2/ 0:	72.0312	77.0156	65.1719
	Length    8, alignment  0/ 0:	72.6406	76.4531	74.125
	Length    8, alignment  3/ 0:	80.9375	84.2969	74.125

Or with N32:

Original:
				memcpy	builtin_memcpy	simple_memcpy
	Length    1, alignment  0/ 0:	57.7188	52.5156	35.687
	Length    4, alignment  0/ 0:	66.1719	75.9531	63.4531
	Length    4, alignment  2/ 0:	66.7344	75.4531	64.1719
	Length    8, alignment  0/ 0:	76.7656	85.5469	72.625
	Length    8, alignment  3/ 0:	75.6094	84.9062	73.7031
New:
				memcpy	builtin_memcpy	simple_memcpy
	Length    1, alignment  0/ 0:	64.3594	54.2344	35.4219
	Length    4, alignment  0/ 0:	49.125	59.3281	64.7031
	Length    4, alignment  2/ 0:	74.5469	77.3906	63.6562
	Length    8, alignment  0/ 0:	57.25	69.0312	73.2188
	Length    8, alignment  3/ 0:	94.5	97.9688	73.7031

I have the complete benchmark runs if anyone wants them, but this
shows you the overall pattern.  I also ran the correctness tests
and verified that there are no regressions in correctness.

OK to checkin?

Steve Ellcey
sellcey@imgtec.com


2015-10-16  Steve Ellcey  <sellcey@imgtec.com>

	* sysdeps/mips/memcpy.S (memcpy):  Add word copies for small aligned
	data.

diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S
index c85935b..6f63405 100644
--- a/sysdeps/mips/memcpy.S
+++ b/sysdeps/mips/memcpy.S
@@ -295,7 +295,7 @@ L(memcpy):
  * size, copy dst pointer to v0 for the return value.
  */
 	slti	t2,a2,(2 * NSIZE)
-	bne	t2,zero,L(lastb)
+	bne	t2,zero,L(lasts)
 #if defined(RETURN_FIRST_PREFETCH) || defined(RETURN_LAST_PREFETCH)
 	move	v0,zero
 #else
@@ -546,7 +546,7 @@ L(chkw):
  */
 L(chk1w):
 	andi	a2,t8,(NSIZE-1)	/* a2 is the reminder past one (d)word chunks */
-	beq	a2,t8,L(lastb)
+	beq	a2,t8,L(lastw)
 	PTR_SUBU a3,t8,a2	/* a3 is count of bytes in one (d)word chunks */
 	PTR_ADDU a3,a0,a3	/* a3 is the dst address after loop */
 
@@ -558,6 +558,20 @@ L(wordCopy_loop):
 	bne	a0,a3,L(wordCopy_loop)
 	C_ST	REG3,UNIT(-1)(a0)
 
+/* If we have been copying double words, see if we can copy a single word
+   before doing byte copies.  We can have, at most, one word to copy.  */
+
+L(lastw):
+#ifdef USE_DOUBLE
+	andi    t8,a2,3		/* a2 is the remainder past 4 byte chunks.  */
+	beq	t8,a2,L(lastb)
+	lw	REG3,0(a1)
+	sw	REG3,0(a0)
+	PTR_ADDIU a0,a0,4
+	PTR_ADDIU a1,a1,4
+	move	a2,t8
+#endif
+
 /* Copy the last 8 (or 16) bytes */
 L(lastb):
 	blez	a2,L(leave)
@@ -572,6 +586,33 @@ L(leave):
 	j	ra
 	nop
 
+/* We jump here with a memcpy of less than 8 or 16 bytes, depending on
+   whether or not USE_DOUBLE is defined.  Instead of just doing byte
+   copies, check the alignment and size and use lw/sw if possible.
+   Otherwise, do byte copies.  */
+
+L(lasts):
+	andi	t8,a2,3
+	beq	t8,a2,L(lastb)
+
+	andi	t9,a0,3
+	bne	t9,zero,L(lastb)
+	andi	t9,a1,3
+	bne	t9,zero,L(lastb)
+
+	PTR_SUBU a3,a2,t8
+	PTR_ADDU a3,a0,a3
+
+L(wcopy_loop):
+	lw	REG3,0(a1)
+	PTR_ADDIU a0,a0,4
+	PTR_ADDIU a1,a1,4
+	bne	a0,a3,L(wcopy_loop)
+	sw	REG3,-4(a0)
+
+	b	L(lastb)
+	move	a2,t8
+
 #ifndef R6_CODE
 /*
  * UNALIGNED case, got here with a3 = "negu a0"