[MIPS] Modify memset.S for mips32r6/mips64r6
Commit Message
Here is the last of my patches for mips32r6/mips64r6 support. It updates
memset to use byte copies instead of stl or str to align the destination
because those instructions are not supported in mips32r6 or mips64r6.
It also avoids using the 'prepare for store' prefetch hint because that
is not supported on mips32r6 or mips64r6 either.
Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
OK to checkin?
Steve Ellcey
sellcey@imgtec.com
2014-12-19 Steve Ellcey <sellcey@imgtec.com>
* sysdeps/mips/memset.S (memset): Modify for mips32r6/mips64r6
to avoid using stl/str to align destination.
Comments
> +#if __mips_isa_rev > 5
I've just so happened to be starting some mips2 tests with glibc
and found see that __mips_isa_rev is not defined below mips32.
This is then an error given -Werror is in force.
i.e. I think we will need a defined(__mips_isa_rev) in any of the
conditions which use it. This is likely to affect all of the R6
patches.
Thanks,
Matthew
On Fri, 19 Dec 2014, Matthew Fortune wrote:
> > +#if __mips_isa_rev > 5
>
> I've just so happened to be starting some mips2 tests with glibc
> and found see that __mips_isa_rev is not defined below mips32.
> This is then an error given -Werror is in force.
We still have -Wno-error=undef, so it *shouldn't* be an error (but once
Siddhesh's remaining -Wundef fixes are in, and any further fixes needed to
get clean for -Wundef in at least one configuration, we should remove
-Wno-error=undef, so such issues certainly need fixing soon).
> i.e. I think we will need a defined(__mips_isa_rev) in any of the
> conditions which use it. This is likely to affect all of the R6
> patches.
Please put this in a central place (a single header that defines
__mips_isa_rev to 0 if not already defined, for example) rather than
duplicating "defined" tests in different places ("defined" tests are
typo-prone and so discouraged in glibc).
Joseph Myers <joseph@codesourcery.com> writes:
> On Fri, 19 Dec 2014, Matthew Fortune wrote:
>
> > > +#if __mips_isa_rev > 5
> >
> > I've just so happened to be starting some mips2 tests with glibc and
> > found see that __mips_isa_rev is not defined below mips32.
> > This is then an error given -Werror is in force.
>
> We still have -Wno-error=undef, so it *shouldn't* be an error (but once
> Siddhesh's remaining -Wundef fixes are in, and any further fixes needed
> to get clean for -Wundef in at least one configuration, we should remove
> -Wno-error=undef, so such issues certainly need fixing soon).
Ah, OK. I think my build must have failed for a different reason
and there was __mips_isa_rev warning close to the end that I
misread as an error.
I guess we can leave it to a cleanup commit for all the instances
then.
Matthew
On Fri, Dec 19, 2014 at 03:26:44PM -0800, Steve Ellcey wrote:
> Here is the last of my patches for mips32r6/mips64r6 support. It updates
> memset to use byte copies instead of stl or str to align the destination
> because those instructions are not supported in mips32r6 or mips64r6.
> It also avoids using the 'prepare for store' prefetch hint because that
> is not supported on mips32r6 or mips64r6 either.
>
> Tested with the mips32r6/mips64r6 GCC, binutils and qemu simulator.
>
> OK to checkin?
>
> Steve Ellcey
> sellcey@imgtec.com
>
>
> PTR_ADDU a0,a0,t2
> +#else /* R6_CODE */
> + andi t2,a0,7
> + lapc t9,L(atable)
> + PTR_LSA t9,t2,t9,2
> + jrc t9
> +L(atable):
> + bc L(aligned)
> + bc L(lb7)
That could be performance regression, test if its faster than existing
loop on unpredictable branches [B]
Also try if just branches are better, like in following c code [A]
Table lookup could be even slower in real workloads as it adds latency
when table is not in cache.
From practical standpoint realigning code looks like dead code, on x64
83% percents of calls are 16 byte aligned and I cannot find application
that makes call unaligned to 8 bytes.
You will get better speedup by adding a check if its already aligned and
moving realignment code to bottom of file to improve instruction cache
usage.
[A]
if (((int) x) & 1)
*x = mask;
x &= ~1;
if (((int) x) & 2)
*((uint16_t*) x) = mask;
x &= ~2;
if (((int) x) & 4)
*((uint32_t*) x) = mask;
x &= ~4;
[B]
#include <string.h>
int main (int x)
{
char foo[100];
int i;
for (i = 0; i < 100000000; i++)
memset (foo + (i % 16), 1, 32 - (i % 16));
return foo[17];
}
On Sat, 20 Dec 2014, Matthew Fortune wrote:
> Joseph Myers <joseph@codesourcery.com> writes:
> > On Fri, 19 Dec 2014, Matthew Fortune wrote:
> >
> > > > +#if __mips_isa_rev > 5
> > >
> > > I've just so happened to be starting some mips2 tests with glibc and
> > > found see that __mips_isa_rev is not defined below mips32.
> > > This is then an error given -Werror is in force.
> >
> > We still have -Wno-error=undef, so it *shouldn't* be an error (but once
> > Siddhesh's remaining -Wundef fixes are in, and any further fixes needed
> > to get clean for -Wundef in at least one configuration, we should remove
> > -Wno-error=undef, so such issues certainly need fixing soon).
>
> Ah, OK. I think my build must have failed for a different reason
> and there was __mips_isa_rev warning close to the end that I
> misread as an error.
>
> I guess we can leave it to a cleanup commit for all the instances
> then.
I'll wait for the cleanup and revised versions of the memset / memcpy
patches (on the basis that the identified -Wundef issue means that the
memset / memcpy patches can't go in as-is).
On Fri, 19 Dec 2014, Steve Ellcey wrote:
> @@ -188,9 +205,9 @@ LEAF(MEMSET_NAME)
>
> .set nomips16
> .set noreorder
> -/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb). Regardless of
> +/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb). Regardless of
> size, copy dst pointer to v0 for the return value. */
> - slti t2,a2,(2 * NSIZE)
> + slti t2,a2,(4 * NSIZE)
> bne t2,zero,L(lastb)
> move v0,a0
>
This doesn't appear to have anything obvious to do with r6. Please submit
it separately from the r6 changes, with its own rationale (or if it is in
fact needed as part of the r6 changes, please explain further).
In general, r6 changes should not result in any changes to the code in
glibc built for non-r6 - if the best way to support r6 also involves
changes to code for non-r6 (e.g. if some cleanup or optimization naturally
results in r6 support because the non-r6-compatible code was bad in some
way and r6-compatible code is better for all versions, or if code
compatible with both r6 and pre-r6 can be just as good on all versions as
the existing non-r6-compatible code) then it's best to submit those
changes separately.
On Mon, 2014-12-22 at 18:02 +0000, Joseph Myers wrote:
> On Fri, 19 Dec 2014, Steve Ellcey wrote:
>
> > @@ -188,9 +205,9 @@ LEAF(MEMSET_NAME)
> >
> > .set nomips16
> > .set noreorder
> > -/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb). Regardless of
> > +/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb). Regardless of
> > size, copy dst pointer to v0 for the return value. */
> > - slti t2,a2,(2 * NSIZE)
> > + slti t2,a2,(4 * NSIZE)
> > bne t2,zero,L(lastb)
> > move v0,a0
> >
>
> This doesn't appear to have anything obvious to do with r6. Please submit
> it separately from the r6 changes, with its own rationale (or if it is in
> fact needed as part of the r6 changes, please explain further).
>
> In general, r6 changes should not result in any changes to the code in
> glibc built for non-r6 - if the best way to support r6 also involves
> changes to code for non-r6 (e.g. if some cleanup or optimization naturally
> results in r6 support because the non-r6-compatible code was bad in some
> way and r6-compatible code is better for all versions, or if code
> compatible with both r6 and pre-r6 can be just as good on all versions as
> the existing non-r6-compatible code) then it's best to submit those
> changes separately.
The change is needed for r6 because I align the buffer to 8 bytes
instead of to 4 bytes for 32 bit mode and 8 bytes for 64 bit mode. This
should help with cache alignment and, depending on the CPU, with memory
bonding where two sequential 4 byte loads can be handled as if they were
a single 8 byte load.
Without the slti change (assuming O32 mode), if I have a non-aligned 9
byte buffer I could set up to 7 bytes to get it 8 byte aligned and that
leaves less then a full 4 byte word to be set. The existing code after
this check assumes there is at lease one full word needs to be set.
I considered changing the alignment code to only align on a 4 byte
boundary for O32 mode, or ifdef'ing this test but it seemed cleaner to
increase the minimum size of buffers that get handled via a simple byte
copy loop for both r6 and earlier CPU's.
If you don't like this part of the patch I could go back to aligning on
a 4 byte boundary in 32 bit mode for now and get some performance data
on the different alignment options later.
Steve Ellcey
sellcey@imgtec.com
On Mon, 22 Dec 2014, Steve Ellcey wrote:
> I considered changing the alignment code to only align on a 4 byte
> boundary for O32 mode, or ifdef'ing this test but it seemed cleaner to
> increase the minimum size of buffers that get handled via a simple byte
> copy loop for both r6 and earlier CPU's.
In that case, submit the change as a preparatory patch, with its own
justification for being OK for existing CPUs (that it doesn't affect
performance, or whatever), so that the r6 patch can avoid changing pre-r6
code.
@@ -54,6 +54,14 @@
# endif
#endif
+#if __mips_isa_rev > 5
+# if (PREFETCH_STORE_HINT == PREFETCH_HINT_PREPAREFORSTORE)
+# undef PREFETCH_STORE_HINT
+# define PREFETCH_STORE_HINT PREFETCH_HINT_STORE_STREAMED
+# endif
+# define R6_CODE
+#endif
+
/* Some asm.h files do not have the L macro definition. */
#ifndef L
# if _MIPS_SIM == _ABIO32
@@ -72,6 +80,15 @@
# endif
#endif
+/* New R6 instructions that may not be in asm.h. */
+#ifndef PTR_LSA
+# if _MIPS_SIM == _ABI64
+# define PTR_LSA dlsa
+# else
+# define PTR_LSA lsa
+# endif
+#endif
+
/* Using PREFETCH_HINT_PREPAREFORSTORE instead of PREFETCH_STORE
or PREFETCH_STORE_STREAMED offers a large performance advantage
but PREPAREFORSTORE has some special restrictions to consider.
@@ -188,9 +205,9 @@ LEAF(MEMSET_NAME)
.set nomips16
.set noreorder
-/* If the size is less than 2*NSIZE (8 or 16), go to L(lastb). Regardless of
+/* If the size is less than 4*NSIZE (16 or 32), go to L(lastb). Regardless of
size, copy dst pointer to v0 for the return value. */
- slti t2,a2,(2 * NSIZE)
+ slti t2,a2,(4 * NSIZE)
bne t2,zero,L(lastb)
move v0,a0
@@ -231,11 +248,46 @@ LEAF(MEMSET_NAME)
/* If the destination address is not aligned do a partial store to get it
aligned. If it is already aligned just jump to L(aligned). */
L(set0):
+#if !defined (R6_CODE)
andi t2,a3,(NSIZE-1) /* word-unaligned address? */
beq t2,zero,L(aligned) /* t2 is the unalignment count */
PTR_SUBU a2,a2,t2
C_STHI a1,0(a0)
PTR_ADDU a0,a0,t2
+#else /* R6_CODE */
+ andi t2,a0,7
+ lapc t9,L(atable)
+ PTR_LSA t9,t2,t9,2
+ jrc t9
+L(atable):
+ bc L(aligned)
+ bc L(lb7)
+ bc L(lb6)
+ bc L(lb5)
+ bc L(lb4)
+ bc L(lb3)
+ bc L(lb2)
+ bc L(lb1)
+L(lb7):
+ sb a1,6(a0)
+L(lb6):
+ sb a1,5(a0)
+L(lb5):
+ sb a1,4(a0)
+L(lb4):
+ sb a1,3(a0)
+L(lb3):
+ sb a1,2(a0)
+L(lb2):
+ sb a1,1(a0)
+L(lb1):
+ sb a1,0(a0)
+
+ li t9,8
+ subu t2,t9,t2
+ PTR_SUBU a2,a2,t2
+ PTR_ADDU a0,a0,t2
+#endif /* R6_CODE */
L(aligned):
/* If USE_DOUBLE is not set we may still want to align the data on a 16
@@ -286,8 +338,12 @@ L(loop16w):
bgtz v1,L(skip_pref)
nop
#endif
+#if defined(R6_CODE)
+ PREFETCH_FOR_STORE (2, a0)
+#else
PREFETCH_FOR_STORE (4, a0)
PREFETCH_FOR_STORE (5, a0)
+#endif
L(skip_pref):
C_ST a1,UNIT(0)(a0)
C_ST a1,UNIT(1)(a0)