[MIPS] Modify memcpy.S for mips32r6/mips64r6

Message ID 1419354526.27606.73.camel@ubuntu-sellcey
State Superseded
Headers

Commit Message

Steve Ellcey Dec. 23, 2014, 5:08 p.m. UTC
  On Mon, 2014-12-22 at 17:59 +0000, Joseph Myers wrote:
> On Fri, 19 Dec 2014, Steve Ellcey  wrote:
> 
> > 	* sysdeps/mips/memcpy.S: Fix preprocessor indentation.
> 
> Please separate the formatting fixes from the substantive changes.  The 
> formatting fixes - a patch that shows no changes from "git diff -w" - can 
> go in as obvious.  The r6 changes should then be resubmitted.

Here is a new memcpy patch.  It has just the changes needed for
mips32r6/mips64r6 support.  Note that there are still some preprocessor
indentation changes where existing ifdefs are now under a newly
introduced !R6_CODE ifdef.

Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.

OK to checkin?

Steve Ellcey
sellcey@imgtec.com


2014-12-22  Steve Ellcey  <sellcey@imgtec.com>

	* sysdeps/mips/memcpy.S: Add support for mips32r6/mips64r6.
  

Comments

Ondrej Bilka Dec. 23, 2014, 5:25 p.m. UTC | #1
On Tue, Dec 23, 2014 at 09:08:46AM -0800, Steve Ellcey wrote:
> On Mon, 2014-12-22 at 17:59 +0000, Joseph Myers wrote:
> > On Fri, 19 Dec 2014, Steve Ellcey  wrote:
> > 
> > > 	* sysdeps/mips/memcpy.S: Fix preprocessor indentation.
> > 
> > Please separate the formatting fixes from the substantive changes.  The 
> > formatting fixes - a patch that shows no changes from "git diff -w" - can 
> > go in as obvious.  The r6 changes should then be resubmitted.
> 
> Here is a new memcpy patch.  It has just the changes needed for
> mips32r6/mips64r6 support.  Note that there are still some preprocessor
> indentation changes where existing ifdefs are now under a newly
> introduced !R6_CODE ifdef.
> 
> Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
> 
> OK to checkin?
> 
still contains likely performance regression. using indirect jumps is
slow, try different approaches.
  
Steve Ellcey Dec. 23, 2014, 5:34 p.m. UTC | #2
On Tue, 2014-12-23 at 18:25 +0100, Ondřej Bílka wrote:

> > Here is a new memcpy patch.  It has just the changes needed for
> > mips32r6/mips64r6 support.  Note that there are still some preprocessor
> > indentation changes where existing ifdefs are now under a newly
> > introduced !R6_CODE ifdef.
> > 
> > Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
> > 
> > OK to checkin?
> > 
> still contains likely performance regression. using indirect jumps is
> slow, try different approaches.

There is no performance regression for existing MIPS architectures
because that code has not changed.  The code may not be optimal for
mips32r6/mips64r6 (yet) but I would rather get the functionality in
now and optimize it later instead of trying to get it perfect
immediately.  Especially with a release coming up soon.

Steve Ellcey
sellcey@imgtec.com
  
Richard Henderson Dec. 23, 2014, 5:52 p.m. UTC | #3
On 12/23/2014 09:08 AM, Steve Ellcey wrote:
> +	andi	t8,a0,7
> +	lapc	t9,L(atable)
> +	PTR_LSA	t9,t8,t9,2
> +	jrc	t9
> +L(atable):
> +	bc	L(lb0)
> +	bc	L(lb7)
> +	bc	L(lb6)
> +	bc	L(lb5)
> +	bc	L(lb4)
> +	bc	L(lb3)
> +	bc	L(lb2)
> +	bc	L(lb1)
> +L(lb7):
> +	lb	a3, 6(a1)
> +	sb	a3, 6(a0)
> +L(lb6):
> +	lb	a3, 5(a1)
> +	sb	a3, 5(a0)
> +L(lb5):
> +	lb	a3, 4(a1)
> +	sb	a3, 4(a0)
> +L(lb4):
> +	lb	a3, 3(a1)
> +	sb	a3, 3(a0)
> +L(lb3):
> +	lb	a3, 2(a1)
> +	sb	a3, 2(a0)
> +L(lb2):
> +	lb	a3, 1(a1)
> +	sb	a3, 1(a0)
> +L(lb1):
> +	lb	a3, 0(a1)
> +	sb	a3, 0(a0)
L(lbx):
> +
> +	li	t9,8
> +	subu	t8,t9,t8
> +	PTR_SUBU a2,a2,t8
> +	PTR_ADDU a0,a0,t8
> +	PTR_ADDU a1,a1,t8
> +L(lb0):

This table is regular enough that I wonder if it wouldn't be better to do some
arithmetic instead of a branch-to-branch.  E.g.

	andi	t7,a0,7
	li	t8,L(lb0)-L(lbx)
	lsa	t8,t7,t8,8
	lapc	t9,L(lb0)
	selnez	t8,t8,t7
	PTR_SUBU t9,t9,t8
	jrc	t9

Which is certainly smaller than your 12 insns, unlikely to be slower on any
conceivable hardware, but probably faster on most.


r~
  
Ondrej Bilka Dec. 23, 2014, 8:30 p.m. UTC | #4
On Tue, Dec 23, 2014 at 09:52:56AM -0800, Richard Henderson wrote:
> On 12/23/2014 09:08 AM, Steve Ellcey wrote:
> > +	andi	t8,a0,7
> > +	lapc	t9,L(atable)
> > +	PTR_LSA	t9,t8,t9,2
> > +	jrc	t9
> > +L(atable):
> > +	bc	L(lb0)
> > +	bc	L(lb7)
> > +	bc	L(lb6)
> > +	bc	L(lb5)
> > +	bc	L(lb4)
> > +	bc	L(lb3)
> > +	bc	L(lb2)
> > +	bc	L(lb1)
> > +L(lb7):
> > +	lb	a3, 6(a1)
> > +	sb	a3, 6(a0)
> > +L(lb6):
> > +	lb	a3, 5(a1)
> > +	sb	a3, 5(a0)
> > +L(lb5):
> > +	lb	a3, 4(a1)
> > +	sb	a3, 4(a0)
> > +L(lb4):
> > +	lb	a3, 3(a1)
> > +	sb	a3, 3(a0)
> > +L(lb3):
> > +	lb	a3, 2(a1)
> > +	sb	a3, 2(a0)
> > +L(lb2):
> > +	lb	a3, 1(a1)
> > +	sb	a3, 1(a0)
> > +L(lb1):
> > +	lb	a3, 0(a1)
> > +	sb	a3, 0(a0)
> L(lbx):
> > +
> > +	li	t9,8
> > +	subu	t8,t9,t8
> > +	PTR_SUBU a2,a2,t8
> > +	PTR_ADDU a0,a0,t8
> > +	PTR_ADDU a1,a1,t8
> > +L(lb0):
> 
> This table is regular enough that I wonder if it wouldn't be better to do some
> arithmetic instead of a branch-to-branch.  E.g.
> 
> 	andi	t7,a0,7
> 	li	t8,L(lb0)-L(lbx)
> 	lsa	t8,t7,t8,8
> 	lapc	t9,L(lb0)
> 	selnez	t8,t8,t7
> 	PTR_SUBU t9,t9,t8
> 	jrc	t9
> 
> Which is certainly smaller than your 12 insns, unlikely to be slower on any
> conceivable hardware, but probably faster on most.
> 
Do you have that hardware? I already objected versus table but do not
have data. I wouldn't be surprised if its slower than byte-by-byte copy 
with if after each byte. Or just copy 8 bytes without condition but I am
not sure how hardware handles overlapping stores. Difference will be
bigger in practice, in profiling around 50% calls are 8 byte aligned and
you save address calculation cost on these.
  
Matthew Fortune Dec. 23, 2014, 11:15 p.m. UTC | #5
Ondřej Bílka <neleai@seznam.cz>  writes:
> On Tue, Dec 23, 2014 at 09:52:56AM -0800, Richard Henderson wrote:
> > On 12/23/2014 09:08 AM, Steve Ellcey wrote:
> > > +	andi	t8,a0,7
> > > +	lapc	t9,L(atable)
> > > +	PTR_LSA	t9,t8,t9,2
> > > +	jrc	t9
> > > +L(atable):
> > > +	bc	L(lb0)
> > > +	bc	L(lb7)
> > > +	bc	L(lb6)
> > > +	bc	L(lb5)
> > > +	bc	L(lb4)
> > > +	bc	L(lb3)
> > > +	bc	L(lb2)
> > > +	bc	L(lb1)
> > > +L(lb7):
> > > +	lb	a3, 6(a1)
> > > +	sb	a3, 6(a0)
> > > +L(lb6):
> > > +	lb	a3, 5(a1)
> > > +	sb	a3, 5(a0)
> > > +L(lb5):
> > > +	lb	a3, 4(a1)
> > > +	sb	a3, 4(a0)
> > > +L(lb4):
> > > +	lb	a3, 3(a1)
> > > +	sb	a3, 3(a0)
> > > +L(lb3):
> > > +	lb	a3, 2(a1)
> > > +	sb	a3, 2(a0)
> > > +L(lb2):
> > > +	lb	a3, 1(a1)
> > > +	sb	a3, 1(a0)
> > > +L(lb1):
> > > +	lb	a3, 0(a1)
> > > +	sb	a3, 0(a0)
> > L(lbx):
> > > +
> > > +	li	t9,8
> > > +	subu	t8,t9,t8
> > > +	PTR_SUBU a2,a2,t8
> > > +	PTR_ADDU a0,a0,t8
> > > +	PTR_ADDU a1,a1,t8
> > > +L(lb0):
> >
> > This table is regular enough that I wonder if it wouldn't be better to
> > do some arithmetic instead of a branch-to-branch.  E.g.
> >
> > 	andi	t7,a0,7
> > 	li	t8,L(lb0)-L(lbx)
> > 	lsa	t8,t7,t8,8
> > 	lapc	t9,L(lb0)
> > 	selnez	t8,t8,t7
> > 	PTR_SUBU t9,t9,t8
> > 	jrc	t9
> >
> > Which is certainly smaller than your 12 insns, unlikely to be slower
> > on any conceivable hardware, but probably faster on most.
> >
> Do you have that hardware? I already objected versus table but do not
> have data. I wouldn't be surprised if its slower than byte-by-byte copy
> with if after each byte. Or just copy 8 bytes without condition but I am
> not sure how hardware handles overlapping stores. Difference will be
> bigger in practice, in profiling around 50% calls are 8 byte aligned and
> you save address calculation cost on these.

I think Richard's idea is good but I do agree with Steve that the tried and
tested code should go in first and then optimise it. There is lots of
exploration to do with MIPSR6 and there are many new ways to optimize. If
we don't have R6 support in glibc 2.21 then there is a definite performance
regression on R6 as the R5/R2 code will trap and emulate on an R6 core making
any non-trapping code several orders of magnitude better.

Overall we are trying to hit as many package release dates as possible to
provide everyone with initial R6 support for experimentation. For GLIBC that
not only includes all the specific R6 patches from Steve but also requires
the .MIPS.abiflags (FPXX/FP64 ABI) patch from myself.

Thanks,
Matthew
  
Ondrej Bilka Dec. 24, 2014, 2:30 p.m. UTC | #6
On Tue, Dec 23, 2014 at 11:15:02PM +0000, Matthew Fortune wrote:
> Ondřej Bílka <neleai@seznam.cz>  writes:
> > On Tue, Dec 23, 2014 at 09:52:56AM -0800, Richard Henderson wrote:
> > >
> > Do you have that hardware? I already objected versus table but do not
> > have data. I wouldn't be surprised if its slower than byte-by-byte copy
> > with if after each byte. Or just copy 8 bytes without condition but I am
> > not sure how hardware handles overlapping stores. Difference will be
> > bigger in practice, in profiling around 50% calls are 8 byte aligned and
> > you save address calculation cost on these.
> 
> I think Richard's idea is good but I do agree with Steve that the tried and
> tested code should go in first and then optimise it. There is lots of
> exploration to do with MIPSR6 and there are many new ways to optimize. If
> we don't have R6 support in glibc 2.21 then there is a definite performance
> regression on R6 as the R5/R2 code will trap and emulate on an R6 core making
> any non-trapping code several orders of magnitude better.
> 
> Overall we are trying to hit as many package release dates as possible to
> provide everyone with initial R6 support for experimentation. For GLIBC that
> not only includes all the specific R6 patches from Steve but also requires
> the .MIPS.abiflags (FPXX/FP64 ABI) patch from myself.
> 
That is valid argument. If that is objective then KISS when sending
patch as it will be easier to review and modify. When you try to add
optimizations you could expect comments that there is better way to
optimize it.
  
Joseph Myers Dec. 30, 2014, 8:50 p.m. UTC | #7
On Tue, 23 Dec 2014, Steve Ellcey wrote:

> +#if !defined (R6_CODE)

Just #ifndef.

> @@ -339,22 +426,22 @@ L(aligned):
>  	PREFETCH_FOR_STORE (3, a0)
>  #endif
>  #if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
> -# if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
> +#if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
>  	sltu    v1,t9,a0
>  	bgtz    v1,L(skip_set)
>  	nop
>  	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
>  L(skip_set):
> -# else
> +#else
>  	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
> -# endif
> +#endif
>  #endif
>  #if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH) \
>      && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
>  	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*3)
> -# ifdef USE_DOUBLE
> +#ifdef USE_DOUBLE
>  	PTR_ADDIU v0,v0,32
> -# endif
> +#endif
>  #endif
>  L(loop16w):
>  	C_LD	t0,UNIT(0)(a1)

These indentation changes seem wrong.

> @@ -363,8 +450,12 @@ L(loop16w):
>  	bgtz	v1,L(skip_pref)
>  #endif
>  	C_LD	t1,UNIT(1)(a1)
> +#if defined(R6_CODE)

Just #ifdef.

> +#if defined(R6_CODE)

> +#if !defined(R6_CODE)

> +#if !defined (R6_CODE)

Likewise.

> @@ -523,15 +622,15 @@ L(ua_chk16w):
>  	PREFETCH_FOR_STORE (3, a0)
>  #endif
>  #if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
> -# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
> +#if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
>  	sltu    v1,t9,a0
>  	bgtz    v1,L(ua_skip_set)
>  	nop
>  	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
>  L(ua_skip_set):
> -# else
> +#else
>  	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
> -# endif
> +#endif
>  #endif

More wrong indentation changes.
  

Patch

diff --git a/sysdeps/mips/memcpy.S b/sysdeps/mips/memcpy.S
index 7574fdc..1370e73 100644
--- a/sysdeps/mips/memcpy.S
+++ b/sysdeps/mips/memcpy.S
@@ -51,6 +51,13 @@ 
 #endif
 
 
+#if __mips_isa_rev > 5
+# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
+#  undef PREFETCH_STORE_HINT
+#  define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
+# endif
+# define R6_CODE
+#endif
 
 /* Some asm.h files do not have the L macro definition.  */
 #ifndef L
@@ -79,6 +86,14 @@ 
 # endif
 #endif
 
+/* New R6 instructions that may not be in asm.h.  */
+#ifndef PTR_LSA
+# if _MIPS_SIM == _ABI64
+#  define PTR_LSA	dlsa
+# else
+#  define PTR_LSA	lsa
+# endif
+#endif
 
 /*
  * Using PREFETCH_HINT_LOAD_STREAMED instead of PREFETCH_LOAD on load
@@ -221,6 +236,7 @@ 
 #  define C_LDLO	ldl	/* low part is left in little-endian	*/
 #  define C_STLO	sdl	/* low part is left in little-endian	*/
 # endif
+# define C_ALIGN	dalign	/* r6 align instruction			*/
 #else
 # define C_ST	sw
 # define C_LD	lw
@@ -235,6 +251,7 @@ 
 #  define C_LDLO	lwl	/* low part is left in little-endian	*/
 #  define C_STLO	swl	/* low part is left in little-endian	*/
 # endif
+# define C_ALIGN	align	/* r6 align instruction			*/
 #endif
 
 /* Bookkeeping values for 32 vs. 64 bit mode.  */
@@ -285,6 +302,9 @@  L(memcpy):
 #else
 	move	v0,a0
 #endif
+
+#if !defined (R6_CODE)
+
 /*
  * If src and dst have different alignments, go to L(unaligned), if they
  * have the same alignment (but are not actually aligned) do a partial
@@ -305,6 +325,74 @@  L(memcpy):
 	C_STHI	t8,0(a0)
 	PTR_ADDU a0,a0,a3
 
+#else /* R6_CODE */
+
+/* 
+ * Align the destination and hope that the source gets aligned too.  If it
+ * doesn't we jump to L(r6_unaligned*) to do unaligned copies using the r6
+ * align instruction.
+ */
+	andi	t8,a0,7
+	lapc	t9,L(atable)
+	PTR_LSA	t9,t8,t9,2
+	jrc	t9
+L(atable):
+	bc	L(lb0)
+	bc	L(lb7)
+	bc	L(lb6)
+	bc	L(lb5)
+	bc	L(lb4)
+	bc	L(lb3)
+	bc	L(lb2)
+	bc	L(lb1)
+L(lb7):
+	lb	a3, 6(a1)
+	sb	a3, 6(a0)
+L(lb6):
+	lb	a3, 5(a1)
+	sb	a3, 5(a0)
+L(lb5):
+	lb	a3, 4(a1)
+	sb	a3, 4(a0)
+L(lb4):
+	lb	a3, 3(a1)
+	sb	a3, 3(a0)
+L(lb3):
+	lb	a3, 2(a1)
+	sb	a3, 2(a0)
+L(lb2):
+	lb	a3, 1(a1)
+	sb	a3, 1(a0)
+L(lb1):
+	lb	a3, 0(a1)
+	sb	a3, 0(a0)
+
+	li	t9,8
+	subu	t8,t9,t8
+	PTR_SUBU a2,a2,t8
+	PTR_ADDU a0,a0,t8
+	PTR_ADDU a1,a1,t8
+L(lb0):
+
+	andi	t8,a1,(NSIZE-1)
+	lapc	t9,L(jtable)
+	PTR_LSA	t9,t8,t9,2
+	jrc	t9
+L(jtable):
+        bc      L(aligned)
+        bc      L(r6_unaligned1)
+        bc      L(r6_unaligned2)
+        bc      L(r6_unaligned3)
+# ifdef USE_DOUBLE
+        bc      L(r6_unaligned4)
+        bc      L(r6_unaligned5)
+        bc      L(r6_unaligned6)
+        bc      L(r6_unaligned7)
+# endif
+#endif /* R6_CODE */
+
+L(aligned):
+
 /*
  * Now dst/src are both aligned to (word or double word) aligned addresses
  * Set a2 to count how many bytes we have to copy after all the 64/128 byte
@@ -313,7 +401,6 @@  L(memcpy):
  * equals a3.
  */
 
-L(aligned):
 	andi	t8,a2,NSIZEDMASK /* any whole 64-byte/128-byte chunks? */
 	beq	a2,t8,L(chkw)	 /* if a2==t8, no 64-byte/128-byte chunks */
 	PTR_SUBU a3,a2,t8	 /* subtract from a2 the reminder */
@@ -339,22 +426,22 @@  L(aligned):
 	PREFETCH_FOR_STORE (3, a0)
 #endif
 #if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
-# if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
+#if PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE
 	sltu    v1,t9,a0
 	bgtz    v1,L(skip_set)
 	nop
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
 L(skip_set):
-# else
+#else
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
-# endif
+#endif
 #endif
 #if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH) \
     && (PREFETCH_STORE_HINT != PREFETCH_HINT_PREPAREFORSTORE)
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*3)
-# ifdef USE_DOUBLE
+#ifdef USE_DOUBLE
 	PTR_ADDIU v0,v0,32
-# endif
+#endif
 #endif
 L(loop16w):
 	C_LD	t0,UNIT(0)(a1)
@@ -363,8 +450,12 @@  L(loop16w):
 	bgtz	v1,L(skip_pref)
 #endif
 	C_LD	t1,UNIT(1)(a1)
+#if defined(R6_CODE)
+	PREFETCH_FOR_STORE (2, a0)
+#else
 	PREFETCH_FOR_STORE (4, a0)
 	PREFETCH_FOR_STORE (5, a0)
+#endif
 #if defined(RETURN_LAST_PREFETCH) && defined(USE_PREFETCH)
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*5)
 # ifdef USE_DOUBLE
@@ -378,7 +469,11 @@  L(skip_pref):
 	C_LD	REG5,UNIT(5)(a1)
 	C_LD	REG6,UNIT(6)(a1)
 	C_LD	REG7,UNIT(7)(a1)
-        PREFETCH_FOR_LOAD (4, a1)
+#if defined(R6_CODE)
+	PREFETCH_FOR_LOAD (3, a1)
+#else
+	PREFETCH_FOR_LOAD (4, a1)
+#endif
 
 	C_ST	t0,UNIT(0)(a0)
 	C_ST	t1,UNIT(1)(a0)
@@ -397,7 +492,9 @@  L(skip_pref):
 	C_LD	REG5,UNIT(13)(a1)
 	C_LD	REG6,UNIT(14)(a1)
 	C_LD	REG7,UNIT(15)(a1)
+#if !defined(R6_CODE)
         PREFETCH_FOR_LOAD (5, a1)
+#endif
 	C_ST	t0,UNIT(8)(a0)
 	C_ST	t1,UNIT(9)(a0)
 	C_ST	REG2,UNIT(10)(a0)
@@ -476,6 +573,8 @@  L(lastbloop):
 L(leave):
 	j	ra
 	nop
+
+#if !defined (R6_CODE)
 /*
  * UNALIGNED case, got here with a3 = "negu a0"
  * This code is nearly identical to the aligned code above
@@ -523,15 +622,15 @@  L(ua_chk16w):
 	PREFETCH_FOR_STORE (3, a0)
 #endif
 #if defined(RETURN_FIRST_PREFETCH) && defined(USE_PREFETCH)
-# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
+#if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
 	sltu    v1,t9,a0
 	bgtz    v1,L(ua_skip_set)
 	nop
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*4)
 L(ua_skip_set):
-# else
+#else
 	PTR_ADDIU v0,a0,(PREFETCH_CHUNK*1)
-# endif
+#endif
 #endif
 L(ua_loop16w):
 	PREFETCH_FOR_LOAD  (3, a1)
@@ -667,6 +766,59 @@  L(ua_smallCopy_loop):
 	j	ra
 	nop
 
+#else /* R6_CODE */
+
+# if __MIPSEB
+#  define SWAP_REGS(X,Y) X, Y
+#  define ALIGN_OFFSET(N) (N)
+# else
+#  define SWAP_REGS(X,Y) Y, X
+#  define ALIGN_OFFSET(N) (NSIZE-N)
+# endif
+# define R6_UNALIGNED_WORD_COPY(BYTEOFFSET) \
+	andi	REG7, a2, (NSIZE-1);/* REG7 is # of bytes to by bytes.     */ \
+	beq	REG7, a2, L(lastb); /* Check for bytes to copy by word	   */ \
+	PTR_SUBU a3, a2, REG7;	/* a3 is number of bytes to be copied in   */ \
+				/* (d)word chunks.			   */ \
+	move	a2, REG7;	/* a2 is # of bytes to copy byte by byte   */ \
+				/* after word loop is finished.		   */ \
+	PTR_ADDU REG6, a0, a3;	/* REG6 is the dst address after loop.	   */ \
+	PTR_SUBU REG2, a1, t8;	/* REG2 is the aligned src address.	   */ \
+	PTR_ADDU a1, a1, a3;	/* a1 is addr of source after word loop.   */ \
+	C_LD	t0, UNIT(0)(REG2);  /* Load first part of source.	   */ \
+L(r6_ua_wordcopy##BYTEOFFSET):						      \
+	C_LD	t1, UNIT(1)(REG2);  /* Load second part of source.	   */ \
+	C_ALIGN	REG3, SWAP_REGS(t1,t0), ALIGN_OFFSET(BYTEOFFSET);	      \
+	PTR_ADDIU a0, a0, UNIT(1);  /* Increment destination pointer.	   */ \
+	PTR_ADDIU REG2, REG2, UNIT(1); /* Increment aligned source pointer.*/ \
+	move	t0, t1;		/* Move second part of source to first.	   */ \
+	bne	a0, REG6,L(r6_ua_wordcopy##BYTEOFFSET);			      \
+	C_ST	REG3, UNIT(-1)(a0);					      \
+	j	L(lastb);						      \
+	nop
+
+	/* We are generating R6 code, the destination is 4 byte aligned and
+	   the source is not 4 byte aligned. t8 is 1, 2, or 3 depending on the
+           alignment of the source.  */
+
+L(r6_unaligned1):
+	R6_UNALIGNED_WORD_COPY(1)
+L(r6_unaligned2):
+	R6_UNALIGNED_WORD_COPY(2)
+L(r6_unaligned3):
+	R6_UNALIGNED_WORD_COPY(3)
+# ifdef USE_DOUBLE
+L(r6_unaligned4):
+	R6_UNALIGNED_WORD_COPY(4)
+L(r6_unaligned5):
+	R6_UNALIGNED_WORD_COPY(5)
+L(r6_unaligned6):
+	R6_UNALIGNED_WORD_COPY(6)
+L(r6_unaligned7):
+	R6_UNALIGNED_WORD_COPY(7)
+# endif
+#endif /* R6_CODE */
+
 	.set	at
 	.set	reorder
 END(MEMCPY_NAME)