[AArch64] Optimized strcpy

Message ID 54917329.4090601@arm.com
State Superseded
Headers

Commit Message

Richard Earnshaw Dec. 17, 2014, 12:12 p.m. UTC
  This patch contains an optimized implementation of strcpy for AArch64
systems.  Benchmarking shows that it is approximately 20-25% faster than
the generic implementation across the board.

R.

<date>  Richard Earnshaw  <rearnsha@arm.com>

	* sysdeps/aarch64/strcpy.S: New file.
  

Comments

Ondrej Bilka Dec. 18, 2014, 1:05 a.m. UTC | #1
On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> This patch contains an optimized implementation of strcpy for AArch64
> systems.  Benchmarking shows that it is approximately 20-25% faster than
> the generic implementation across the board.
> 
I looked quickly for patch, I found two microoptimizations below and
probable performance problem.

Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
more likely than 9-16 bytes so you should optimize that case well.

See number of calls graph in strcpy function at

http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html

My main three strcpy users are mutt, firefox, bash/sh. As first two are
interactive its hard to do direct benchmark. So you should try how that
affects bash.

You could try to measure bash running time directly as it uses strcpy
relatively often. I measured following bash script and if you LD_PRELOAD
a byte-by-byte loop [A] then it decreases performance by 10%.

http://kam.mff.cuni.cz/~ondra/bashtest.sh

Using bash only differs bit from aggegated case, there is additional peak 
at ~500 bytes caused by copying PATH variable. Otherwise it does things 
like copying each command name ten times before it runs it etc. Graph is
here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html


[A]

char *
strcpy (char *dest, const char *src)
{
  char *destp = dest;
  while (*dest)
    *dest++ = *src++;

  *dest = '\0';

  return destp;
}


and comments:
> 
> +ENTRY_ALIGN (strcpy,6)
> +	mov	zeroones, #REP8_01
> +	mov	dst, dstin
> +	ands	tmp1, src, #15
> +	b.ne	L(misaligned)
> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
> +	   can be done in parallel across the entire word.  */
> +	/* The inner loop deals with two Dwords at a time.  This has a
> +	   slightly higher start-up cost, but we should win quite quickly,
> +	   especially on cores with a high number of issue slots per
> +	   cycle, as we get much better parallelism out of the operations.  */
> +	b	L(first_pass)

Why do you use branch instead moving it here?

> +L(main_loop):

Is this aligned because you calculated size of instruction above or its
just missing?

...
> +L(nul_in_data1):
> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
> +	   fall back to a slow byte-by byte store of the bits already
> +	   loaded.
> +
> +	   The worst case when coming through this path is that we've had
> +	   to copy seven individual bytes to get to alignment and we then
> +	   have to copy another seven (eight for big-endian) again here.
> +	   We could try to detect that case (and any case where more than
> +	   eight bytes have to be copied), but it really doesn't seem
> +	   worth it.  */

which happens frequently, see above.
  
Adhemerval Zanella Netto Dec. 18, 2014, 1:44 a.m. UTC | #2
On 17-12-2014 23:05, Ondřej Bílka wrote:
> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>> This patch contains an optimized implementation of strcpy for AArch64
>> systems.  Benchmarking shows that it is approximately 20-25% faster than
>> the generic implementation across the board.
>>
> I looked quickly for patch, I found two microoptimizations below and
> probable performance problem.
>
> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> more likely than 9-16 bytes so you should optimize that case well.
>
> See number of calls graph in strcpy function at
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
>
> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> interactive its hard to do direct benchmark. So you should try how that
> affects bash.
>
> You could try to measure bash running time directly as it uses strcpy
> relatively often. I measured following bash script and if you LD_PRELOAD
> a byte-by-byte loop [A] then it decreases performance by 10%.
>
> http://kam.mff.cuni.cz/~ondra/bashtest.sh

I gave a try to this benchmark and, at least on powerpc64, I could not see *any*
real strcpy usage running using bash (version 4.3.11).  And profilers also do
not show strcpy being a hotspots.  Also, and I don't know if it is intentional,
it is issuing 'gnuplot2'.

And I see your characterized strcpy being direct from desktop workloads, which
is a limited usage behavior.  For powerpc64, as example, 'firefox' and 'mutt'
is far from the default workloads running on it.

So the focus is not that '1-8 *are* hot path' or if you 'profiler' shows that
mostly of calls are less than 16 bytes, but what kind of workloads aarch64
implementation are trying to solve here.  It would be good if the patch
proposal disclaimer better what kind of tests it did, like if he only ran
the GLIBC benchtests of if he is trying to optimize for a real case usage.

>
> Using bash only differs bit from aggegated case, there is additional peak 
> at ~500 bytes caused by copying PATH variable. Otherwise it does things 
> like copying each command name ten times before it runs it etc. Graph is
> here.
>
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html
>
>
> [A]
>
> char *
> strcpy (char *dest, const char *src)
> {
>   char *destp = dest;
>   while (*dest)
>     *dest++ = *src++;
>
>   *dest = '\0';
>
>   return destp;
> }
>
>
> and comments:
>> +ENTRY_ALIGN (strcpy,6)
>> +	mov	zeroones, #REP8_01
>> +	mov	dst, dstin
>> +	ands	tmp1, src, #15
>> +	b.ne	L(misaligned)
>> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>> +	   can be done in parallel across the entire word.  */
>> +	/* The inner loop deals with two Dwords at a time.  This has a
>> +	   slightly higher start-up cost, but we should win quite quickly,
>> +	   especially on cores with a high number of issue slots per
>> +	   cycle, as we get much better parallelism out of the operations.  */
>> +	b	L(first_pass)
> Why do you use branch instead moving it here?
>
>> +L(main_loop):
> Is this aligned because you calculated size of instruction above or its
> just missing?
>
> ...
>> +L(nul_in_data1):
>> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
>> +	   fall back to a slow byte-by byte store of the bits already
>> +	   loaded.
>> +
>> +	   The worst case when coming through this path is that we've had
>> +	   to copy seven individual bytes to get to alignment and we then
>> +	   have to copy another seven (eight for big-endian) again here.
>> +	   We could try to detect that case (and any case where more than
>> +	   eight bytes have to be copied), but it really doesn't seem
>> +	   worth it.  */
> which happens frequently, see above.
>
>
  
Richard Earnshaw Dec. 18, 2014, 10:55 a.m. UTC | #3
On 18/12/14 01:05, Ondřej Bílka wrote:
> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>> This patch contains an optimized implementation of strcpy for AArch64
>> systems.  Benchmarking shows that it is approximately 20-25% faster than
>> the generic implementation across the board.
>>
> I looked quickly for patch, I found two microoptimizations below and
> probable performance problem.
> 

Ondrej,

Thanks for looking at this.  Unfortunately, you've looked at the wrong
version -- the version I accidentally posted first.  The correct version
was a near complete rewrite following my own benchmarking: I posted that
as a follow-up.

> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> more likely than 9-16 bytes so you should optimize that case well.

I'd expect that.  But the key question is 'how much more likely'?
Having an inner loop dealing with 16 bytes at a time probably only costs
about 5-10% more time *per iteration* than an inner loop dealing with 8
bytes, since much of the cost is in waiting for the load instruction to
return data from the cache or memory and most of the other instructions
will dual-issue on any reasonable implementation of the architecture.
So to win in the overall game of performance we'd need to show that
short (<8 byte strings) were significantly more likely than 8-16 byte
strings.  That seems quite unlikely to me, though I admit I don't have
hard numbers; looking at your data suggests that (approximately) each
larger block size is ~20% of the size of the previous block (I'm
guessing these are 8-byte blocks, but I can't immediately see a
definition), which suggests to me that we have a net win with preferring
16-bytes over 8 bytes per iteration.

Note that starting off with 8-byte blocks and later switching to 16-byte
blocks would most likely involve another unpredictable branch as we
inserted the late realignment check.

R.

> 
> See number of calls graph in strcpy function at
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
> 
> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> interactive its hard to do direct benchmark. So you should try how that
> affects bash.
> 
> You could try to measure bash running time directly as it uses strcpy
> relatively often. I measured following bash script and if you LD_PRELOAD
> a byte-by-byte loop [A] then it decreases performance by 10%.
> 
> http://kam.mff.cuni.cz/~ondra/bashtest.sh
> 
> Using bash only differs bit from aggegated case, there is additional peak 
> at ~500 bytes caused by copying PATH variable. Otherwise it does things 
> like copying each command name ten times before it runs it etc. Graph is
> here.
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html
> 
> 
> [A]
> 
> char *
> strcpy (char *dest, const char *src)
> {
>   char *destp = dest;
>   while (*dest)
>     *dest++ = *src++;
> 
>   *dest = '\0';
> 
>   return destp;
> }
> 
> 
> and comments:
>>
>> +ENTRY_ALIGN (strcpy,6)
>> +	mov	zeroones, #REP8_01
>> +	mov	dst, dstin
>> +	ands	tmp1, src, #15
>> +	b.ne	L(misaligned)
>> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>> +	   can be done in parallel across the entire word.  */
>> +	/* The inner loop deals with two Dwords at a time.  This has a
>> +	   slightly higher start-up cost, but we should win quite quickly,
>> +	   especially on cores with a high number of issue slots per
>> +	   cycle, as we get much better parallelism out of the operations.  */
>> +	b	L(first_pass)
> 
> Why do you use branch instead moving it here?
> 
>> +L(main_loop):
> 
> Is this aligned because you calculated size of instruction above or its
> just missing?
> 
> ...
>> +L(nul_in_data1):
>> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
>> +	   fall back to a slow byte-by byte store of the bits already
>> +	   loaded.
>> +
>> +	   The worst case when coming through this path is that we've had
>> +	   to copy seven individual bytes to get to alignment and we then
>> +	   have to copy another seven (eight for big-endian) again here.
>> +	   We could try to detect that case (and any case where more than
>> +	   eight bytes have to be copied), but it really doesn't seem
>> +	   worth it.  */
> 
> which happens frequently, see above.
> 
> 
>
  
Richard Earnshaw Dec. 18, 2014, 12:45 p.m. UTC | #4
On 18/12/14 01:44, Adhemerval Zanella wrote:
> On 17-12-2014 23:05, Ondřej Bílka wrote:
>> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
>>> This patch contains an optimized implementation of strcpy for AArch64
>>> systems.  Benchmarking shows that it is approximately 20-25% faster than
>>> the generic implementation across the board.
>>>
>> I looked quickly for patch, I found two microoptimizations below and
>> probable performance problem.
>>
>> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
>> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
>> more likely than 9-16 bytes so you should optimize that case well.
>>
>> See number of calls graph in strcpy function at
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
>>
>> My main three strcpy users are mutt, firefox, bash/sh. As first two are
>> interactive its hard to do direct benchmark. So you should try how that
>> affects bash.
>>
>> You could try to measure bash running time directly as it uses strcpy
>> relatively often. I measured following bash script and if you LD_PRELOAD
>> a byte-by-byte loop [A] then it decreases performance by 10%.
>>
>> http://kam.mff.cuni.cz/~ondra/bashtest.sh
> 
> I gave a try to this benchmark and, at least on powerpc64, I could not see *any*
> real strcpy usage running using bash (version 4.3.11).  And profilers also do
> not show strcpy being a hotspots.  Also, and I don't know if it is intentional,
> it is issuing 'gnuplot2'.
> 
> And I see your characterized strcpy being direct from desktop workloads, which
> is a limited usage behavior.  For powerpc64, as example, 'firefox' and 'mutt'
> is far from the default workloads running on it.
> 
> So the focus is not that '1-8 *are* hot path' or if you 'profiler' shows that
> mostly of calls are less than 16 bytes, but what kind of workloads aarch64
> implementation are trying to solve here.  It would be good if the patch
> proposal disclaimer better what kind of tests it did, like if he only ran
> the GLIBC benchtests of if he is trying to optimize for a real case usage.
> 

I'm not targeting a specific workload; the code is intended to be
generic.  On that basis I made the following assumptions:

String lengths would likely follow a negative exponential distribution
over a large set of workloads.  Individual workloads would undoubtedly
not follow such a distribution.

Sequential calls to the string routine would not have a predictable
alignment or length of the source string, so branch predictors would be
unlikely to be able to accurately predict the outcome of most branches.

Source strings might be aligned to an 8-byte boundary but are unlikely
to be specifically aligned to a 16-byte boundary (ie most likely ~50% of
8-byte aligned strings will also be 16-byte aligned).  It's unlikely
that proportion of 8-byte aligned strings would significantly dominate
the profile to the extent that a branch predictor would be able to make
real use of this property.

The start address of source strings will be randomly distributed across
a page (making a page-cross check very predictable: ~99.5%).

Running the glibc strcmp benchmark showed that the new code was 20-25%
faster than the existing C code for *all* cases measured, not just at
specific points; so overall this is a notable improvement.  Note that
the C code uses strlen and memcpy, so is already using similar
techniques and is pretty good now.

Note, the version of the patch that I intended to submit is this one
(not the version first posted):

	https://sourceware.org/ml/libc-alpha/2014-12/msg00609.html

It's worth noting that if, say 75% or more of strings are <= 16 bytes in
length it's likely that the main path through code will be reasonably
predictable since we avoid any loops in that case.  The only
hard-to-predict branches will be those in the final decision tree needed
to handle copies of < 16 bytes.  In those cases we split the cases into
buckets:

1 byte ; 2-3 bytes ; 4-7 bytes; 8-16 bytes

and binary chop between them.

R.

>>
>> Using bash only differs bit from aggegated case, there is additional peak 
>> at ~500 bytes caused by copying PATH variable. Otherwise it does things 
>> like copying each command name ten times before it runs it etc. Graph is
>> here.
>>
>> http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/result.html
>>
>>
>> [A]
>>
>> char *
>> strcpy (char *dest, const char *src)
>> {
>>   char *destp = dest;
>>   while (*dest)
>>     *dest++ = *src++;
>>
>>   *dest = '\0';
>>
>>   return destp;
>> }
>>
>>
>> and comments:
>>> +ENTRY_ALIGN (strcpy,6)
>>> +	mov	zeroones, #REP8_01
>>> +	mov	dst, dstin
>>> +	ands	tmp1, src, #15
>>> +	b.ne	L(misaligned)
>>> +	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
>>> +	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
>>> +	   can be done in parallel across the entire word.  */
>>> +	/* The inner loop deals with two Dwords at a time.  This has a
>>> +	   slightly higher start-up cost, but we should win quite quickly,
>>> +	   especially on cores with a high number of issue slots per
>>> +	   cycle, as we get much better parallelism out of the operations.  */
>>> +	b	L(first_pass)
>> Why do you use branch instead moving it here?
>>
>>> +L(main_loop):
>> Is this aligned because you calculated size of instruction above or its
>> just missing?
>>
>> ...
>>> +L(nul_in_data1):
>>> +	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
>>> +	   fall back to a slow byte-by byte store of the bits already
>>> +	   loaded.
>>> +
>>> +	   The worst case when coming through this path is that we've had
>>> +	   to copy seven individual bytes to get to alignment and we then
>>> +	   have to copy another seven (eight for big-endian) again here.
>>> +	   We could try to detect that case (and any case where more than
>>> +	   eight bytes have to be copied), but it really doesn't seem
>>> +	   worth it.  */
>> which happens frequently, see above.
>>
>>
> 
>
  
Ondrej Bilka Dec. 18, 2014, 1:45 p.m. UTC | #5
On Thu, Dec 18, 2014 at 10:55:25AM +0000, Richard Earnshaw wrote:
> On 18/12/14 01:05, Ondřej Bílka wrote:
> > On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> >> This patch contains an optimized implementation of strcpy for AArch64
> >> systems.  Benchmarking shows that it is approximately 20-25% faster than
> >> the generic implementation across the board.
> >>
> > I looked quickly for patch, I found two microoptimizations below and
> > probable performance problem.
> > 
> 
> Ondrej,
> 
> Thanks for looking at this.  Unfortunately, you've looked at the wrong
> version -- the version I accidentally posted first.  The correct version
> was a near complete rewrite following my own benchmarking: I posted that
> as a follow-up.
>
Yes, in new version its fixed and its probably best possible approach
for short strings. Now I see only few possible microoptimizations.
 
> > Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> > shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> > more likely than 9-16 bytes so you should optimize that case well.
> 
> I'd expect that.  But the key question is 'how much more likely'?

From raw data to create graphs its around 600000 calls for 1-8 versus 260000 for 8-16, number of calls are here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_10_3
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/functionbytes_100_3

> Having an inner loop dealing with 16 bytes at a time probably only costs
> about 5-10% more time *per iteration* than an inner loop dealing with 8
> bytes, since much of the cost is in waiting for the load instruction to
> return data from the cache or memory and most of the other instructions
> will dual-issue on any reasonable implementation of the architecture.
> So to win in the overall game of performance we'd need to show that
> short (<8 byte strings) were significantly more likely than 8-16 byte
> strings.  That seems quite unlikely to me, though I admit I don't have
> hard numbers; looking at your data suggests that (approximately) each
> larger block size is ~20% of the size of the previous block (I'm
> guessing these are 8-byte blocks, but I can't immediately see a
> definition), which suggests to me that we have a net win with preferring
> 16-bytes over 8 bytes per iteration.
> 
These are 16-byte blocks that are aligned to 16 bytes, there is short documentation that I should expand.
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strcpy_profile/results_gcc/doc/properties.html

My objection was only for handling first 8 bytes. 

As loops are concerned a 16byte one is fine, I could even try what you
gain with 32byte but then code size could be problem.

I am not that worried about loop overhead, for each implementation you
could make workload where its slow. You could just shift that weakness
so its unlikely encountered. For 16-byte loop you would need lot of
32-64 byte strings but few 1-32 byte ones. OTOH 8-byte loop would have
problems with mainly 16-32 byte range but would be quite fast in 16-byte loop
as its handled by header.



> Note that starting off with 8-byte blocks and later switching to 16-byte
> blocks would most likely involve another unpredictable branch as we
> inserted the late realignment check.
> 
Yes, these do not work.
  
Ondrej Bilka Jan. 3, 2015, 10:44 a.m. UTC | #6
On Thu, Dec 18, 2014 at 12:45:12PM +0000, Richard Earnshaw wrote:
> On 18/12/14 01:44, Adhemerval Zanella wrote:
> > On 17-12-2014 23:05, Ondřej Bílka wrote:
> >> On Wed, Dec 17, 2014 at 12:12:25PM +0000, Richard Earnshaw wrote:
> >>> This patch contains an optimized implementation of strcpy for AArch64
> >>> systems.  Benchmarking shows that it is approximately 20-25% faster than
> >>> the generic implementation across the board.
> >>>
> >> I looked quickly for patch, I found two microoptimizations below and
> >> probable performance problem.
> >>
> >> Handing sizes 1-8 is definitely not slow path, its hot path. My profiler
> >> shows that 88.36% of calls use less than 16 bytes and 1-8 byte range is
> >> more likely than 9-16 bytes so you should optimize that case well.
> >>
> >> See number of calls graph in strcpy function at
> >>
> >> http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html
> >>
> >> My main three strcpy users are mutt, firefox, bash/sh. As first two are
> >> interactive its hard to do direct benchmark. So you should try how that
> >> affects bash.
> >>
> >> You could try to measure bash running time directly as it uses strcpy
> >> relatively often. I measured following bash script and if you LD_PRELOAD
> >> a byte-by-byte loop [A] then it decreases performance by 10%.
> >>
> >> http://kam.mff.cuni.cz/~ondra/bashtest.sh
> > 
> > I gave a try to this benchmark and, at least on powerpc64, I could not see *any*
> > real strcpy usage running using bash (version 4.3.11).  And profilers also do
> > not show strcpy being a hotspots.  Also, and I don't know if it is intentional,
> > it is issuing 'gnuplot2'.
> > 
> > And I see your characterized strcpy being direct from desktop workloads, which
> > is a limited usage behavior.  For powerpc64, as example, 'firefox' and 'mutt'
> > is far from the default workloads running on it.
> > 
> > So the focus is not that '1-8 *are* hot path' or if you 'profiler' shows that
> > mostly of calls are less than 16 bytes, but what kind of workloads aarch64
> > implementation are trying to solve here.  It would be good if the patch
> > proposal disclaimer better what kind of tests it did, like if he only ran
> > the GLIBC benchtests of if he is trying to optimize for a real case usage.
> > 
> 
> I'm not targeting a specific workload; the code is intended to be
> generic.  On that basis I made the following assumptions:
> 
> String lengths would likely follow a negative exponential distribution
> over a large set of workloads.  Individual workloads would undoubtedly
> not follow such a distribution.
> 
> Sequential calls to the string routine would not have a predictable
> alignment or length of the source string, so branch predictors would be
> unlikely to be able to accurately predict the outcome of most branches.
> 
Which is true but for bit different reason, branch predictor cache is
limited and when there is considerable delay between calls even branch
that could be perfectly predicted will not be there.

In that situation knowing about static prediction help, most sensible archs assume
that branch that is not in cache will not be taken.
  

Patch

diff --git a/sysdeps/aarch64/strcpy.S b/sysdeps/aarch64/strcpy.S
new file mode 100644
index 0000000..1cdf2a1
--- /dev/null
+++ b/sysdeps/aarch64/strcpy.S
@@ -0,0 +1,202 @@ 
+/* Copyright (C) 2013-2014 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses
+ */
+
+/* Arguments and results.  */
+#define dstin		x0
+#define src		x1
+
+/* Locals and temporaries.  */
+#define dst		x2
+#define data1		x3
+#define data1w		w3
+#define data2		x4
+#define has_nul1	x5
+#define has_nul2	x6
+#define tmp1		x7
+#define tmp2		x8
+#define tmp3		x9
+#define tmp4		x10
+#define zeroones	x11
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+	/* Start of critial section -- keep to one 64Byte cache line.  */
+ENTRY_ALIGN (strcpy,6)
+	mov	zeroones, #REP8_01
+	mov	dst, dstin
+	ands	tmp1, src, #15
+	b.ne	L(misaligned)
+	/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	   can be done in parallel across the entire word.  */
+	/* The inner loop deals with two Dwords at a time.  This has a
+	   slightly higher start-up cost, but we should win quite quickly,
+	   especially on cores with a high number of issue slots per
+	   cycle, as we get much better parallelism out of the operations.  */
+	b	L(first_pass)
+L(main_loop):
+	stp	data1, data2, [dst], #16
+L(startloop_fast):
+	ldp	data1, data2, [src], #16
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bics	has_nul2, tmp3, tmp4
+	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
+	b.eq	L(main_loop)
+	/* End of critical section -- keep to one 64Byte cache line.  */
+
+	cbnz	has_nul1, L(nul_in_data1_fast)
+L(nul_in_data2_fast):
+	str	data1, [dst], #8
+L(nul_in_data2_fast_after_d1):
+	/* For a NUL in data2, we always know that we've moved at least 8
+	   bytes, so no need for a slow path.  */
+#ifdef __AARCH64EB__
+	/* For big-endian only, carry propagation means we can't trust
+	   the MSB of the syndrome value calculated above (the byte
+	   sequence 01 00 will generate a syndrome of 80 80 rather than
+	   00 80).  We get around this by byte-swapping the data and
+	   re-calculating.  */
+	rev	data2, data2
+	sub	tmp1, data2, zeroones
+	orr	tmp2, data2, #REP8_7f
+	bic	has_nul2, tmp1, tmp2
+#endif
+	rev	has_nul2, has_nul2
+	sub	src, src, #(8+7)
+	clz	has_nul2, has_nul2
+	lsr	has_nul2, has_nul2, #3		/* Bits to bytes.  */
+	sub	dst, dst, #7
+	ldr	data2, [src, has_nul2]
+	str	data2, [dst, has_nul2]
+	ret
+
+L(nul_in_data1_fast):
+	/* Since we know we've already copied at least 8 bytes, we can
+	   safely handle the tail with one misaligned dword move.  To do this
+	   we calculate the location of the trailing NUL byte and go seven
+	   bytes back from that.  */
+#ifdef __AARCH64EB__
+	/* For big-endian only, carry propagation means we can't trust
+	   the MSB of the syndrome value calculated above (the byte
+	   sequence 01 00 will generate a syndrome of 80 80 rather than
+	   00 80).  We get around this by byte-swapping the data and
+	   re-calculating.  */
+	rev	data1, data1
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+#endif
+	rev	has_nul1, has_nul1
+	sub	src, src, #(16+7)
+	clz	has_nul1, has_nul1
+	lsr	has_nul1, has_nul1, #3		/* Bits to bytes.  */
+	sub	dst, dst, #7
+	ldr	data1, [src, has_nul1]
+	str	data1, [dst, has_nul1]
+	ret
+
+L(first_pass):
+	ldp	data1, data2, [src], #16
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bics	has_nul2, tmp3, tmp4
+	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
+	b.eq	L(main_loop)
+
+	cbz	has_nul1, L(nul_in_data2_fast)
+L(nul_in_data1):
+	/* Slow path.  We can't be sure we've moved at least 8 bytes, so
+	   fall back to a slow byte-by byte store of the bits already
+	   loaded.
+
+	   The worst case when coming through this path is that we've had
+	   to copy seven individual bytes to get to alignment and we then
+	   have to copy another seven (eight for big-endian) again here.
+	   We could try to detect that case (and any case where more than
+	   eight bytes have to be copied), but it really doesn't seem
+	   worth it.  */
+#ifdef __AARCH64EB__
+	rev	data1, data1
+#else
+	/* On little-endian, we can easily check if the NULL byte was
+	   in the last byte of the Dword.  For big-endian we'd have to
+	   recalculate the syndrome, which is unlikely to be worth it.  */
+	lsl	has_nul1, has_nul1, #8
+	cbnz	has_nul1, 1f
+	str	data1, [dst]
+	ret
+#endif
+1:
+	strb	data1w, [dst], #1
+	tst	data1, #0xff
+	lsr	data1, data1, #8
+	b.ne	1b
+L(done):
+	ret
+
+L(misaligned):
+	cmp	tmp1, #8
+	b.ge	2f
+	/* There's at least one Dword before we reach alignment, so we can
+	   deal with that efficiently.  */
+	ldr	data1, [src]
+	bic	src, src, #15
+	sub	tmp3, data1, zeroones
+	orr	tmp4, data1, #REP8_7f
+	bics	has_nul1, tmp3, tmp4
+	b.ne	L(nul_in_data1)
+	str	data1, [dst], #8
+	ldr	data2, [src, #8]
+	add	src, src, #16
+	sub	dst, dst, tmp1
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bics	has_nul2, tmp3, tmp4
+	b.ne	L(nul_in_data2_fast_after_d1)
+	str	data2, [dst], #8
+	/* We can by-pass the first-pass version of the loop in this case
+	   since we know that at least 8 bytes have already been copied.  */
+	b	L(startloop_fast)
+
+2:
+	sub	tmp1, tmp1, #16
+3:
+	ldrb	data1w, [src], #1
+	strb	data1w, [dst], #1
+	cbz	data1w, L(done)
+	add	tmp1, tmp1, #1
+	cbnz	tmp1, 3b
+	b	L(first_pass)
+END (strcpy)
+libc_hidden_builtin_def (strcpy)