MIPS memcpy performance improvement

  It was brought to my attention that the MIPS N32 (and N64) memcpy was slower
than the MIPS O32 memcpy for small (less than 16 byte) aligned memcpy's.
This is because for sizes of 8 to 15 bytes, the O32 memcpy would do two
or three word copies followed by byte copies but the N32 version would do
all byte copies.  Basically, the N32 version did not 'fall back' to doing
word copies when it could not do double-word copies.

This patch addresses the problem with two changes.  One is actually for
large memcpy's on N32.  After doing as many double-word copies as possible
the N32 version will try do do at least one word copy before going to byte
copies.

The other change is that after determining that a memcpy is small (less than
8 bytes for O32 ABI, less than 16 bytes for N32 or N64 ABI), instead of just
doing byte copies it will check the size and alignment of the inputs and,
if possible, do word copies (followed by byte copies if needed).  If it is
not possible to do word copies due to size or alignment it drops back to byte
copies as before.

The glibc memcpy benchmark does not have any tests that catch the first
case (though my own testing showed a small improvement), but it does test
the second case.  In these cases, for inputs of length 4 to 15 bytes
(depending on the ABI), the tests are slower for unaligned memcpy's and
faster for aligned ones.  There is also a slow down for memcpy's of less
than 4 bytes regardless of alignment.

For example with O32:

Original:
	Length    1, alignment  0/ 0:	54.3906	54.0156	37.9062
	Length    4, alignment  0/ 0:	66.8438	70.6562	65.2969
	Length    4, alignment  2/ 0:	65.8438	71.1406	65.25
	Length    8, alignment  0/ 0:	73.7344	82.2656	74.2969
	Length    8, alignment  3/ 0:	74.3906	76.6875	74.25
With change:
	Length    1, alignment  0/ 0:	61.7031	51.8125	37.2656
	Length    4, alignment  0/ 0:	50.1094	54.5	66.7344
	Length    4, alignment  2/ 0:	72.0312	77.0156	65.1719
	Length    8, alignment  0/ 0:	72.6406	76.4531	74.125
	Length    8, alignment  3/ 0:	80.9375	84.2969	74.125

Or with N32:

Original:
				memcpy	builtin_memcpy	simple_memcpy
	Length    1, alignment  0/ 0:	57.7188	52.5156	35.687
	Length    4, alignment  0/ 0:	66.1719	75.9531	63.4531
	Length    4, alignment  2/ 0:	66.7344	75.4531	64.1719
	Length    8, alignment  0/ 0:	76.7656	85.5469	72.625
	Length    8, alignment  3/ 0:	75.6094	84.9062	73.7031
New:
				memcpy	builtin_memcpy	simple_memcpy
	Length    1, alignment  0/ 0:	64.3594	54.2344	35.4219
	Length    4, alignment  0/ 0:	49.125	59.3281	64.7031
	Length    4, alignment  2/ 0:	74.5469	77.3906	63.6562
	Length    8, alignment  0/ 0:	57.25	69.0312	73.2188
	Length    8, alignment  3/ 0:	94.5	97.9688	73.7031

I have the complete benchmark runs if anyone wants them, but this
shows you the overall pattern.  I also ran the correctness tests
and verified that there are no regressions in correctness.

OK to checkin?

Steve Ellcey
sellcey@imgtec.com

2015-10-16  Steve Ellcey  <sellcey@imgtec.com>

	* sysdeps/mips/memcpy.S (memcpy):  Add word copies for small aligned
	data.

MIPS memcpy performance improvement

Commit Message

Comments

Patch