Patchwork aarch64: Thunderx specific memcpy and memmove

login
register
mail settings
Submitter Steve Ellcey
Date March 24, 2017, 11:25 p.m.
Message ID <1490397926.19074.73.camel@caviumnetworks.com>
Download mbox | patch
Permalink /patch/19726/
State New
Headers show

Comments

Steve Ellcey - March 24, 2017, 11:25 p.m.
Now that the IFUNC infrastructure for aarch64 is in place, here is a
patch to use it to create ThunderX specific versions of memcpy and
memmove.

This was part of my original patch before it was split in two and a
couple of issues were raised at that time. 

Siddhesh Poyarekar wanted to separate the generic and thunderx copies
of memcpy/memmove instead of using ifdefs in a combined source file.
I prefer the ifdef version as a cleaner implementation with less code
duplication but I can change it if that is the consensus.

Also Adhemerval Zanella did some benchmarking that showed the
prefetching done in the thunderx version might be appropriate for the
generic version.  However if you look at the prefetching we only do it
every other time through the loop.  This is because the loop copies 64
bytes and the ThunderX cache line size is 128 bytes.  If other aarch64
chips have a 64 byte cache line they might want a different prefetching
setup.

If people think we should use the ThunderX version of memcpy for all
aarch64 systems I am happy to drop this patch and create one that just
changes memcpy.S to do the ThunderX style prefetches for all aarch64
systems.

Steve Ellcey
sellcey@cavium.com


2017-03-24  Steve Ellcey  <sellcey@caviumnetworks.com>

	* sysdeps/aarch64/memcpy.S (MEMMOVE, MEMCPY): New macros.
	(memmove): Use MEMMOVE for name.
	(memcpy): Use MEMCPY for name.  Add loop with prefetching
	under USE_THUNDERX macro.
	* sysdeps/aarch64/multiarch/Makefile: New file.
	* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/aarch64/multiarch/init-arch.h: Likewise.
	* sysdeps/aarch64/multiarch/memcpy.c: Likewise.
	* sysdeps/aarch64/multiarch/memcpy_generic.S: Likewise.
	* sysdeps/aarch64/multiarch/memcpy_thunderx.S: Likewise.
	* sysdeps/aarch64/multiarch/memmove.c: Likewise.
Szabolcs Nagy - March 27, 2017, 10:45 a.m.
On 24/03/17 23:25, Steve Ellcey wrote:
> Now that the IFUNC infrastructure for aarch64 is in place, here is a
> patch to use it to create ThunderX specific versions of memcpy and
> memmove.
> 
> This was part of my original patch before it was split in two and a
> couple of issues were raised at that time. 
> 
> Siddhesh Poyarekar wanted to separate the generic and thunderx copies
> of memcpy/memmove instead of using ifdefs in a combined source file.
> I prefer the ifdef version as a cleaner implementation with less code
> duplication but I can change it if that is the consensus.
> 

both are fine with me.

> Also Adhemerval Zanella did some benchmarking that showed the
> prefetching done in the thunderx version might be appropriate for the
> generic version.  However if you look at the prefetching we only do it
> every other time through the loop.  This is because the loop copies 64
> bytes and the ThunderX cache line size is 128 bytes.  If other aarch64
> chips have a 64 byte cache line they might want a different prefetching
> setup.
> 
> If people think we should use the ThunderX version of memcpy for all
> aarch64 systems I am happy to drop this patch and create one that just
> changes memcpy.S to do the ThunderX style prefetches for all aarch64
> systems.
> 

adding prefetches to the generic code is preferable
if it can make both thunderx and generic users happy.

we need to find what's the best way to add the prefetches,
the new memcpy benchmarks may help here.

> Steve Ellcey
> sellcey@cavium.com
> 
> 
> 2017-03-24  Steve Ellcey  <sellcey@caviumnetworks.com>
> 
> 	* sysdeps/aarch64/memcpy.S (MEMMOVE, MEMCPY): New macros.
> 	(memmove): Use MEMMOVE for name.
> 	(memcpy): Use MEMCPY for name.  Add loop with prefetching
> 	under USE_THUNDERX macro.
> 	* sysdeps/aarch64/multiarch/Makefile: New file.
> 	* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Likewise.
> 	* sysdeps/aarch64/multiarch/init-arch.h: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy.c: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_generic.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_thunderx.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memmove.c: Likewise.
>
Ramana Radhakrishnan - March 27, 2017, 10:52 a.m.
On Fri, Mar 24, 2017 at 11:25 PM, Steve Ellcey
<sellcey@caviumnetworks.com> wrote:
> Now that the IFUNC infrastructure for aarch64 is in place, here is a
> patch to use it to create ThunderX specific versions of memcpy and
> memmove.
>
> This was part of my original patch before it was split in two and a
> couple of issues were raised at that time.
>
> Siddhesh Poyarekar wanted to separate the generic and thunderx copies
> of memcpy/memmove instead of using ifdefs in a combined source file.
> I prefer the ifdef version as a cleaner implementation with less code
> duplication but I can change it if that is the consensus.
>
> Also Adhemerval Zanella did some benchmarking that showed the
> prefetching done in the thunderx version might be appropriate for the
> generic version.  However if you look at the prefetching we only do it
> every other time through the loop.  This is because the loop copies 64
> bytes and the ThunderX cache line size is 128 bytes.  If other aarch64
> chips have a 64 byte cache line they might want a different prefetching
> setup.

Can you link to the benchmark numbers, workloads and what systems ?

Ramana

>
> If people think we should use the ThunderX version of memcpy for all
> aarch64 systems I am happy to drop this patch and create one that just
> changes memcpy.S to do the ThunderX style prefetches for all aarch64
> systems.
>
> Steve Ellcey
> sellcey@cavium.com
>
>
> 2017-03-24  Steve Ellcey  <sellcey@caviumnetworks.com>
>
>         * sysdeps/aarch64/memcpy.S (MEMMOVE, MEMCPY): New macros.
>         (memmove): Use MEMMOVE for name.
>         (memcpy): Use MEMCPY for name.  Add loop with prefetching
>         under USE_THUNDERX macro.
>         * sysdeps/aarch64/multiarch/Makefile: New file.
>         * sysdeps/aarch64/multiarch/ifunc-impl-list.c: Likewise.
>         * sysdeps/aarch64/multiarch/init-arch.h: Likewise.
>         * sysdeps/aarch64/multiarch/memcpy.c: Likewise.
>         * sysdeps/aarch64/multiarch/memcpy_generic.S: Likewise.
>         * sysdeps/aarch64/multiarch/memcpy_thunderx.S: Likewise.
>         * sysdeps/aarch64/multiarch/memmove.c: Likewise.
Steve Ellcey - March 27, 2017, 9:35 p.m.
On Mon, 2017-03-27 at 11:52 +0100, Ramana Radhakrishnan wrote:

> > Also Adhemerval Zanella did some benchmarking that showed the
> > prefetching done in the thunderx version might be appropriate for the
> > generic version.  However if you look at the prefetching we only do it
> > every other time through the loop.  This is because the loop copies 64
> > bytes and the ThunderX cache line size is 128 bytes.  If other aarch64
> > chips have a 64 byte cache line they might want a different prefetching
> > setup.

> Can you link to the benchmark numbers, workloads and what systems ?
> 
> Ramana

The only reference I have to Adhemerval's results are at:

https://sourceware.org/ml/libc-alpha/2017-02/msg00118.html

Attached are my latest results on ThunderX with the IFUNC numbers from
the glibc memcpy performance benchmarks.  They include the new bench-
memcpy-random benchmark which doesn't show much difference.  It is
really bench-memcpy-large that stands out.

Steve Ellcey
sellcey@cavium.com
builtin_memcpy	simple_memcpy	__memcpy_thunderx	__memcpy_generic
Length    1, alignment  0/ 0:	39.2188	18.75	23.4375	23.125
Length    1, alignment  0/ 0:	27.0312	17.6562	23.125	23.125
Length    1, alignment  0/ 0:	27.0312	17.0312	22.9688	22.8125
Length    1, alignment  0/ 0:	27.9688	17.1875	26.875	24.375
Length    2, alignment  0/ 0:	27.0312	26.875	23.125	23.2812
Length    2, alignment  1/ 0:	27.0312	25.625	23.125	22.8125
Length    2, alignment  0/ 1:	27.0312	25.1562	22.9688	22.8125
Length    2, alignment  1/ 1:	26.875	25.4688	22.9688	22.8125
Length    4, alignment  0/ 0:	26.25	26.7188	21.25	20.9375
Length    4, alignment  2/ 0:	25	25.9375	20.9375	20.7812
Length    4, alignment  0/ 2:	24.6875	25.9375	21.0938	20.7812
Length    4, alignment  2/ 2:	24.8438	25.4688	20.7812	20.625
Length    8, alignment  0/ 0:	24.2188	38.5938	19.6875	19.8438
Length    8, alignment  3/ 0:	34.2188	37.1875	28.9062	28.75
Length    8, alignment  0/ 3:	35.7812	36.875	30.4688	30.3125
Length    8, alignment  3/ 3:	44.2188	36.875	38.9062	38.5938
Length   16, alignment  0/ 0:	23.75	75	19.5312	19.375
Length   16, alignment  4/ 0:	34.0625	74.5312	28.9062	28.5938
Length   16, alignment  0/ 4:	35.9375	74.375	30.4688	30.3125
Length   16, alignment  4/ 4:	44.2188	74.5312	38.9062	38.5938
Length   32, alignment  0/ 0:	25.3125	110	19.6875	19.0625
Length   32, alignment  5/ 0:	35.3125	110	30.3125	30
Length   32, alignment  0/ 5:	35.3125	110.156	30.1562	30
Length   32, alignment  5/ 5:	45.3125	110	40	40
Length   64, alignment  0/ 0:	26.25	198.906	21.25	21.25
Length   64, alignment  6/ 0:	45	198.906	39.6875	39.8438
Length   64, alignment  0/ 6:	46.5625	198.75	41.25	41.25
Length   64, alignment  6/ 6:	64.375	198.906	59.2188	58.9062
Length  128, alignment  0/ 0:	34.0625	376.875	29.6875	27.9688
Length  128, alignment  7/ 0:	75.625	376.719	71.25	70
Length  128, alignment  0/ 7:	77.8125	376.875	73.5938	71.25
Length  128, alignment  7/ 7:	80.625	376.562	75.9375	74.6875
Length  256, alignment  0/ 0:	44.375	732.344	39.0625	41.0938
Length  256, alignment  8/ 0:	120.312	732.188	116.094	121.406
Length  256, alignment  0/ 8:	122.5	732.344	118.438	122.812
Length  256, alignment  8/ 8:	90.3125	732.344	86.0938	88.4375
Length  512, alignment  0/ 0:	64.375	1443.44	59.375	57.6562
Length  512, alignment  9/ 0:	216.406	1443.59	212.812	211.25
Length  512, alignment  0/ 9:	218.594	1443.44	214.844	212.656
Length  512, alignment  9/ 9:	110.469	1443.44	106.25	104.844
Length 1024, alignment  0/ 0:	107.344	2865.94	103.281	101.719
Length 1024, alignment 10/ 0:	414.219	2866.09	410.312	405.312
Length 1024, alignment  0/10:	416.094	2865.47	412.344	406.562
Length 1024, alignment 10/10:	154.219	2865	150	147.812
Length 2048, alignment  0/ 0:	216.406	5714.69	212.969	209.531
Length 2048, alignment 11/ 0:	793.281	5710.47	789.844	787.969
Length 2048, alignment  0/11:	796.094	5710.62	791.875	789.688
Length 2048, alignment 11/11:	262.344	5710.62	259.219	254.844
Length 4096, alignment  0/ 0:	408.75	11399.7	406.094	398.75
Length 4096, alignment 12/ 0:	1558.28	11399.7	1555.78	1552.97
Length 4096, alignment  0/12:	1559.84	11400	1556.88	1554.22
Length 4096, alignment 12/12:	455.312	11399.7	452.5	445.312
Length 8192, alignment  0/ 0:	796.094	22779.5	944.375	782.344
Length 8192, alignment 13/ 0:	3089.38	22779.7	3084.53	3082.81
Length 8192, alignment  0/13:	3091.56	22922.8	3087.5	3085.16
Length 8192, alignment 13/13:	841.875	22779.8	838.906	827.031
Length 16384, alignment  0/ 0:	1585.78	45738.1	1579.22	1567.66
Length 16384, alignment 14/ 0:	6164.69	45726.9	6155.31	6166.88
Length 16384, alignment  0/14:	6160.94	45736.9	6158.75	6166.88
Length 16384, alignment 14/14:	1624.84	45793.9	1622.03	1608.75
Length 32768, alignment  0/ 0:	3905.47	93004.7	3902.34	4998.44
Length 32768, alignment 15/ 0:	13493.4	92454.4	13462.8	14771.6
Length 32768, alignment  0/15:	13685.5	92742.5	13495	13854.5
Length 32768, alignment 15/15:	4035.31	92889.4	4008.44	4661.09
Length 65536, alignment  0/ 0:	8697.66	193559	8674.38	16843.6
Length 65536, alignment 16/ 0:	8698.12	193557	8677.5	17120.6
Length 65536, alignment  0/16:	8845.62	193541	8678.12	16837.8
Length 65536, alignment 16/16:	8834.38	193557	8679.53	17148.3
Length    0, alignment  0/ 0:	28.2812	18.2812	23.4375	23.5938
Length    0, alignment  0/ 0:	27.3438	17.8125	23.2812	23.4375
Length    0, alignment  0/ 0:	27.3438	17.8125	23.2812	23.2812
Length    0, alignment  0/ 0:	27.1875	17.8125	23.2812	23.2812
Length    1, alignment  0/ 0:	27.3438	17.6562	22.8125	22.9688
Length    1, alignment  1/ 0:	27.0312	17.3438	22.9688	22.9688
Length    1, alignment  0/ 1:	27.0312	17.3438	22.9688	22.9688
Length    1, alignment  1/ 1:	27.0312	17.1875	22.9688	22.8125
Length    2, alignment  0/ 0:	27.1875	25.3125	22.9688	22.8125
Length    2, alignment  2/ 0:	27.0312	25.4688	22.9688	22.9688
Length    2, alignment  0/ 2:	27.0312	25.1562	22.9688	22.8125
Length    2, alignment  2/ 2:	27.1875	25.1562	22.9688	22.8125
Length    3, alignment  0/ 0:	27.0312	24.6875	22.9688	22.8125
Length    3, alignment  3/ 0:	27.0312	22.9688	22.9688	22.9688
Length    3, alignment  0/ 3:	27.0312	22.6562	22.9688	22.9688
Length    3, alignment  3/ 3:	27.0312	22.6562	22.9688	22.8125
Length    4, alignment  0/ 0:	25.3125	26.0938	21.0938	20.7812
Length    4, alignment  4/ 0:	25	25.9375	20.7812	20.7812
Length    4, alignment  0/ 4:	25	25.4688	20.9375	20.7812
Length    4, alignment  4/ 4:	24.8438	25.625	20.7812	20.625
Length    5, alignment  0/ 0:	25	29.6875	20.7812	20.625
Length    5, alignment  5/ 0:	34.5312	28.5938	30.4688	30.3125
Length    5, alignment  0/ 5:	36.25	28.2812	31.875	31.875
Length    5, alignment  5/ 5:	44.5312	28.2812	40.3125	40.1562
Length    6, alignment  0/ 0:	25	32.1875	22.9688	20.625
Length    6, alignment  6/ 0:	29.5312	31.5625	25.4688	25.1562
Length    6, alignment  0/ 6:	30.9375	30.9375	27.0312	26.7188
Length    6, alignment  6/ 6:	35	31.0938	30.7812	30.7812
Length    7, alignment  0/ 0:	24.8438	35.4688	20.7812	20.625
Length    7, alignment  7/ 0:	29.375	34.6875	25.1562	25
Length    7, alignment  0/ 7:	31.0938	34.2188	27.0312	26.875
Length    7, alignment  7/ 7:	35.1562	34.0625	30.9375	30.625
Length    8, alignment  0/ 0:	24.0625	37.5	19.8438	19.2188
Length    8, alignment  8/ 0:	23.5938	37.5	19.375	19.375
Length    8, alignment  0/ 8:	23.5938	37.1875	19.5312	19.2188
Length    8, alignment  8/ 8:	23.75	37.1875	19.5312	19.2188
Length    9, alignment  0/ 0:	35.3125	40.7812	29.8438	29.6875
Length    9, alignment  9/ 0:	39.6875	40.3125	34.5312	34.2188
Length    9, alignment  0/ 9:	39.8438	40	34.375	34.2188
Length    9, alignment  9/ 9:	44.2188	40	38.75	38.5938
Length   10, alignment  0/ 0:	35.3125	43.5938	30	29.8438
Length   10, alignment 10/ 0:	39.6875	42.9688	34.375	34.2188
Length   10, alignment  0/10:	39.6875	42.8125	34.375	34.2188
Length   10, alignment 10/10:	44.2188	42.8125	38.75	38.5938
Length   11, alignment  0/ 0:	35.3125	46.25	29.8438	29.6875
Length   11, alignment 11/ 0:	39.8438	45.9375	34.375	34.0625
Length   11, alignment  0/11:	39.6875	45.625	34.5312	34.2188
Length   11, alignment 11/11:	44.0625	45.7812	38.75	38.5938
Length   12, alignment  0/ 0:	35.3125	48.9062	29.8438	29.6875
Length   12, alignment 12/ 0:	35.9375	48.4375	30.625	30.3125
Length   12, alignment  0/12:	34.8438	48.2812	29.375	29.0625
Length   12, alignment 12/12:	34.6875	48.2812	29.375	29.2188
Length   13, alignment  0/ 0:	35.1562	51.25	30	29.6875
Length   13, alignment 13/ 0:	39.8438	51.0938	34.375	34.0625
Length   13, alignment  0/13:	39.6875	51.25	34.375	34.2188
Length   13, alignment 13/13:	44.2188	51.25	38.9062	38.5938
Length   14, alignment  0/ 0:	35.3125	58.4375	29.8438	29.6875
Length   14, alignment 14/ 0:	39.8438	58.4375	34.375	34.2188
Length   14, alignment  0/14:	39.6875	58.4375	34.5312	34.375
Length   14, alignment 14/14:	44.2188	58.4375	38.9062	38.75
Length   15, alignment  0/ 0:	35.3125	72.0312	29.8438	29.6875
Length   15, alignment 15/ 0:	39.8438	71.875	34.5312	34.375
Length   15, alignment  0/15:	39.6875	72.0312	34.375	34.2188
Length   15, alignment 15/15:	44.2188	72.0312	38.75	38.75
Length   16, alignment  0/ 0:	23.75	74.8438	19.5312	19.375
Length   16, alignment 16/ 0:	23.5938	74.6875	19.375	19.2188
Length   16, alignment  0/16:	23.5938	74.5312	19.2188	17.6562
Length   16, alignment 16/16:	23.5938	74.5312	19.375	19.2188
Length   17, alignment  0/ 0:	36.7188	68.4375	31.5625	31.4062
Length   17, alignment 17/ 0:	40.9375	68.2812	35.7812	35.7812
Length   17, alignment  0/17:	40.4688	68.4375	35.625	35.4688
Length   17, alignment 17/17:	45	68.4375	40	39.8438
Length   18, alignment  0/ 0:	36.0938	71.25	31.25	30.9375
Length   18, alignment 18/ 0:	40.625	71.0938	35.625	35.4688
Length   18, alignment  0/18:	40.4688	71.25	35.625	35.3125
Length   18, alignment 18/18:	45	71.0938	40.1562	39.8438
Length   19, alignment  0/ 0:	36.0938	73.9062	31.25	31.0938
Length   19, alignment 19/ 0:	40.4688	74.0625	35.625	35.4688
Length   19, alignment  0/19:	40.4688	73.9062	35.7812	35.4688
Length   19, alignment 19/19:	45	73.9062	40	39.8438
Length   20, alignment  0/ 0:	36.0938	76.7188	31.0938	30.9375
Length   20, alignment 20/ 0:	40.625	76.7188	35.7812	35.625
Length   20, alignment  0/20:	40.4688	76.7188	35.7812	35.4688
Length   20, alignment 20/20:	45	76.5625	40.1562	39.8438
Length   21, alignment  0/ 0:	36.0938	79.5312	31.25	30.9375
Length   21, alignment 21/ 0:	40.625	79.5312	35.625	35.4688
Length   21, alignment  0/21:	40.4688	79.5312	35.625	35.4688
Length   21, alignment 21/21:	45	79.5312	40.1562	40
Length   22, alignment  0/ 0:	36.0938	82.3438	31.0938	30.9375
Length   22, alignment 22/ 0:	40.625	82.3438	35.625	35.4688
Length   22, alignment  0/22:	40.625	82.3438	35.7812	35.4688
Length   22, alignment 22/22:	45	82.1875	40.1562	40
Length   23, alignment  0/ 0:	36.0938	85.1562	31.25	30.9375
Length   23, alignment 23/ 0:	40.4688	85.1562	35.7812	35.625
Length   23, alignment  0/23:	40.4688	85.1562	35.7812	35.4688
Length   23, alignment 23/23:	45	85	40.1562	39.8438
Length   24, alignment  0/ 0:	36.0938	87.8125	31.25	31.0938
Length   24, alignment 24/ 0:	35.4688	87.8125	30.625	30.4688
Length   24, alignment  0/24:	35.4688	87.8125	30.625	30.4688
Length   24, alignment 24/24:	35	87.8125	30.1562	30
Length   25, alignment  0/ 0:	36.0938	90.625	31.0938	30.9375
Length   25, alignment 25/ 0:	40.625	233.906	36.0938	35.3125
Length   25, alignment  0/25:	40.9375	90.7812	35.7812	35.4688
Length   25, alignment 25/25:	45.3125	90.4688	40	39.8438
Length   26, alignment  0/ 0:	36.0938	93.2812	31.0938	30.9375
Length   26, alignment 26/ 0:	40.4688	93.2812	35.625	35.4688
Length   26, alignment  0/26:	40.625	93.2812	35.625	35.4688
Length   26, alignment 26/26:	45	93.2812	40.1562	40
Length   27, alignment  0/ 0:	36.0938	96.0938	31.0938	30.9375
Length   27, alignment 27/ 0:	40.625	96.0938	35.625	35.4688
Length   27, alignment  0/27:	40.625	96.0938	35.7812	35.4688
Length   27, alignment 27/27:	45	96.0938	40	39.8438
Length   28, alignment  0/ 0:	36.0938	98.9062	31.25	31.0938
Length   28, alignment 28/ 0:	40.4688	98.75	35.625	35.625
Length   28, alignment  0/28:	40.4688	98.9062	35.625	35.4688
Length   28, alignment 28/28:	45	99.0625	40.1562	39.8438
Length   29, alignment  0/ 0:	36.25	101.719	31.25	31.0938
Length   29, alignment 29/ 0:	40.4688	101.719	35.625	35.3125
Length   29, alignment  0/29:	40.4688	101.719	35.625	35.4688
Length   29, alignment 29/29:	45	101.719	40	39.8438
Length   30, alignment  0/ 0:	36.0938	104.531	31.0938	31.0938
Length   30, alignment 30/ 0:	40.625	104.375	35.625	35.4688
Length   30, alignment  0/30:	40.625	104.375	35.7812	35.4688
Length   30, alignment 30/30:	45	104.531	40.1562	39.8438
Length   31, alignment  0/ 0:	36.0938	107.344	31.25	30.9375
Length   31, alignment 31/ 0:	40.4688	107.344	35.7812	35.625
Length   31, alignment  0/31:	40.4688	107.188	35.7812	35.4688
Length   31, alignment 31/31:	45	107.344	40	40
Length   48, alignment  0/ 0:	25.4688	154.375	20.9375	20.9375
Length   48, alignment  3/ 0:	44.5312	154.375	39.6875	39.6875
Length   48, alignment  0/ 3:	46.25	154.375	41.25	41.0938
Length   48, alignment  3/ 3:	64.0625	154.375	59.2188	58.9062
Length   80, alignment  0/ 0:	27.9688	243.281	23.125	22.5
Length   80, alignment  5/ 0:	57.5	243.281	53.125	53.125
Length   80, alignment  0/ 5:	56.875	243.281	52.0312	51.875
Length   80, alignment  5/ 5:	87.5	243.281	82.8125	82.5
Length   96, alignment  0/ 0:	27.1875	287.656	22.6562	22.3438
Length   96, alignment  6/ 0:	57.8125	287.656	53.4375	53.2812
Length   96, alignment  0/ 6:	57.1875	287.656	52.0312	51.875
Length   96, alignment  6/ 6:	87.6562	287.656	82.5	82.3438
Length  112, alignment  0/ 0:	33.5938	332.344	29.5312	27.0312
Length  112, alignment  7/ 0:	75.3125	332.188	71.4062	69.6875
Length  112, alignment  0/ 7:	77.6562	332.188	73.5938	71.25
Length  112, alignment  7/ 7:	80.3125	332.188	76.25	74.5312
Length  144, alignment  0/ 0:	33.2812	421.25	29.0625	26.875
Length  144, alignment  9/ 0:	75.3125	421.25	71.0938	69.6875
Length  144, alignment  0/ 9:	99.5312	421.094	95.1562	93.9062
Length  144, alignment  9/ 9:	84.8438	421.25	80.9375	79.5312
Length  160, alignment  0/ 0:	37.5	465.625	33.9062	31.7188
Length  160, alignment 10/ 0:	96.5625	465.625	92.6562	90.9375
Length  160, alignment  0/10:	98.75	465.625	94.8438	92.8125
Length  160, alignment 10/10:	84.8438	465.625	80.9375	79.2188
Length  176, alignment  0/ 0:	37.6562	510.156	33.9062	31.5625
Length  176, alignment 11/ 0:	96.5625	510	92.6562	91.0938
Length  176, alignment  0/11:	98.75	510	95	92.8125
Length  176, alignment 11/11:	84.8438	510	80.9375	79.2188
Length  192, alignment  0/ 0:	37.6562	554.531	33.75	31.5625
Length  192, alignment 12/ 0:	96.4062	554.531	92.6562	91.0938
Length  192, alignment  0/12:	98.75	554.531	94.8438	92.6562
Length  192, alignment 12/12:	84.6875	554.531	232.656	79.6875
Length  208, alignment  0/ 0:	38.2812	598.906	34.6875	31.7188
Length  208, alignment 13/ 0:	97.1875	598.906	92.9688	91.0938
Length  208, alignment  0/13:	123.125	598.906	119.062	123.594
Length  208, alignment 13/13:	90	598.906	85.9375	88.5938
Length  224, alignment  0/ 0:	42.6562	643.438	38.75	40.625
Length  224, alignment 14/ 0:	120.625	643.438	116.719	121.875
Length  224, alignment  0/14:	122.812	643.438	118.75	123.438
Length  224, alignment 14/14:	90	643.438	85.9375	88.4375
Length  240, alignment  0/ 0:	42.8125	687.656	38.9062	40.7812
Length  240, alignment 15/ 0:	120.625	687.969	116.562	121.719
Length  240, alignment  0/15:	122.812	687.812	118.75	123.281
Length  240, alignment 15/15:	90	687.812	85.9375	88.4375
Length  272, alignment  0/ 0:	42.8125	776.719	38.9062	40.625
Length  272, alignment 17/ 0:	120.469	776.719	116.562	121.719
Length  272, alignment  0/17:	147.812	776.875	143.125	141.719
Length  272, alignment 17/17:	95.3125	776.875	91.25	89.8438
Length  288, alignment  0/ 0:	47.9688	821.25	43.9062	41.7188
Length  288, alignment 18/ 0:	144.531	821.25	140.781	138.906
Length  288, alignment  0/18:	146.719	821.25	142.812	140.469
Length  288, alignment 18/18:	95.1562	821.25	91.25	89.375
Length  304, alignment  0/ 0:	47.9688	865.625	44.0625	41.7188
Length  304, alignment 19/ 0:	144.531	865.625	140.625	138.75
Length  304, alignment  0/19:	146.719	865.781	142.812	140.469
Length  304, alignment 19/19:	95.1562	865.625	91.25	89.375
Length  320, alignment  0/ 0:	47.9688	910	44.0625	41.7188
Length  320, alignment 20/ 0:	144.531	910	140.625	138.906
Length  320, alignment  0/20:	146.875	910.156	142.969	140.625
Length  320, alignment 20/20:	95.1562	910	91.25	89.5312
Length  336, alignment  0/ 0:	47.9688	954.531	43.9062	41.7188
Length  336, alignment 21/ 0:	144.531	954.531	140.781	138.906
Length  336, alignment  0/21:	171.562	954.531	167.031	165.312
Length  336, alignment 21/21:	100	954.531	96.0938	94.6875
Length  352, alignment  0/ 0:	52.6562	999.062	48.9062	46.7188
Length  352, alignment 22/ 0:	168.281	998.906	164.531	162.812
Length  352, alignment  0/22:	172.031	998.906	165.312	162.656
Length  352, alignment 22/22:	100.469	999.219	96.25	94.6875
Length  368, alignment  0/ 0:	52.9688	1043.44	49.2188	46.7188
Length  368, alignment 23/ 0:	168.594	1043.91	164.844	162.812
Length  368, alignment  0/23:	170.781	1043.44	166.562	164.375
Length  368, alignment 23/23:	99.8438	1043.59	96.0938	94.5312
Length  384, alignment  0/ 0:	52.8125	1087.97	49.0625	46.5625
Length  384, alignment 24/ 0:	167.656	1087.97	164.062	162.188
Length  384, alignment  0/24:	170	1087.97	165.938	163.906
Length  384, alignment 24/24:	100.625	1087.34	95.9375	93.9062
Length  400, alignment  0/ 0:	53.4375	1132.5	49.375	46.7188
Length  400, alignment 25/ 0:	169.062	1132.34	164.531	163.281
Length  400, alignment  0/25:	195.625	1132.34	190.938	189.375
Length  400, alignment 25/25:	105.312	1132.5	101.094	100
Length  416, alignment  0/ 0:	58.125	1176.72	54.0625	51.7188
Length  416, alignment 26/ 0:	192.656	1176.88	188.281	186.719
Length  416, alignment  0/26:	194.844	1176.88	190.625	188.281
Length  416, alignment 26/26:	105.469	1176.88	101.094	99.375
Length  432, alignment  0/ 0:	58.125	1221.25	54.0625	51.7188
Length  432, alignment 27/ 0:	192.5	1221.41	188.438	186.562
Length  432, alignment  0/27:	194.688	1221.25	190.625	188.438
Length  432, alignment 27/27:	105.312	1221.25	101.094	99.375
Length  448, alignment  0/ 0:	58.125	1265.78	53.9062	51.7188
Length  448, alignment 28/ 0:	192.656	1265.62	188.438	186.719
Length  448, alignment  0/28:	194.844	1265.62	190.625	188.281
Length  448, alignment 28/28:	105.312	1265.62	101.25	99.375
Length  464, alignment  0/ 0:	58.125	1310.16	53.9062	51.7188
Length  464, alignment 29/ 0:	192.5	1311.25	189.062	186.562
Length  464, alignment  0/29:	219.062	1310.31	215.156	212.969
Length  464, alignment 29/29:	110.781	1310.16	106.25	105
Length  480, alignment  0/ 0:	63.2812	1354.69	59.0625	57.1875
Length  480, alignment 30/ 0:	216.406	1354.69	212.344	210.938
Length  480, alignment  0/30:	218.75	1354.53	214.531	212.5
Length  480, alignment 30/30:	110.312	1354.53	106.25	104.844
Length  496, alignment  0/ 0:	63.125	1399.06	59.0625	57.0312
Length  496, alignment 31/ 0:	216.25	1399.06	212.5	210.938
Length  496, alignment  0/31:	218.594	1399.06	215	212.656
Length  496, alignment 31/31:	110.469	1398.91	106.25	104.688
Length 1024, alignment  0/ 0:	107.031	2866.09	103.125	100.938
Length 1024, alignment 32/ 0:	106.875	2863.91	101.406	100.781
Length 1024, alignment  0/32:	106.719	2865.47	102.812	100.625
Length 1024, alignment 32/32:	106.875	2865	102.656	100.156
Length 1056, alignment  0/ 0:	115.781	2954.69	111.875	108.906
Length 1056, alignment 33/ 0:	434.688	2954.69	430.938	428.438
Length 1056, alignment  0/33:	436.875	2954.53	433.125	430.156
Length 1056, alignment 33/33:	159.219	3092.19	155.469	161.094
Length 1088, alignment  0/ 0:	112.031	3043.44	108.75	112.969
Length 1088, alignment 34/ 0:	435	3043.59	430.781	428.281
Length 1088, alignment  0/34:	436.875	3043.59	433.125	430.156
Length 1088, alignment 34/34:	159.219	3043.44	155.469	160.469
Length 1120, alignment  0/ 0:	117.031	3132.34	113.125	117.812
Length 1120, alignment 35/ 0:	458.906	3132.34	455	452.344
Length 1120, alignment  0/35:	461.094	3132.5	457.344	453.906
Length 1120, alignment 35/35:	164.531	3132.34	160.625	157.812
Length 1152, alignment  0/ 0:	117.5	3221.25	113.594	110.156
Length 1152, alignment 36/ 0:	458.906	3221.25	455	452.344
Length 1152, alignment  0/36:	461.094	3221.25	457.344	454.062
Length 1152, alignment 36/36:	164.531	3221.25	160.781	157.812
Length 1184, alignment  0/ 0:	122.344	3445	118.281	115.781
Length 1184, alignment 37/ 0:	482.969	3310.31	479.062	476.25
Length 1184, alignment  0/37:	485	3310.16	481.25	477.812
Length 1184, alignment 37/37:	169.375	3310.16	165.469	163.125
Length 1216, alignment  0/ 0:	122.344	3398.91	118.594	115.312
Length 1216, alignment 38/ 0:	482.812	3399.06	479.062	476.25
Length 1216, alignment  0/38:	485	3398.91	481.25	477.812
Length 1216, alignment 38/38:	169.375	3398.91	165.312	163.125
Length 1248, alignment  0/ 0:	127.344	3487.97	123.594	120.156
Length 1248, alignment 39/ 0:	506.562	3487.97	502.812	500.156
Length 1248, alignment  0/39:	508.906	3487.97	505	501.719
Length 1248, alignment 39/39:	174.531	3487.81	299.062	168.906
Length 1280, alignment  0/ 0:	127.344	3577.19	123.125	120.156
Length 1280, alignment 40/ 0:	506.562	3576.72	502.812	500.156
Length 1280, alignment  0/40:	508.906	3576.72	505	501.719
Length 1280, alignment 40/40:	174.531	3576.72	170.312	168.125
Length 1312, alignment  0/ 0:	132.5	3665.78	128.438	125.156
Length 1312, alignment 41/ 0:	530.625	3665.62	526.719	524.062
Length 1312, alignment  0/41:	532.812	3665.78	529.062	526.094
Length 1312, alignment 41/41:	179.531	3666.56	175.469	172.812
Length 1344, alignment  0/ 0:	132.344	3754.84	128.594	125
Length 1344, alignment 42/ 0:	530.469	3755	526.875	524.062
Length 1344, alignment  0/42:	532.812	3754.84	668.125	526.875
Length 1344, alignment 42/42:	179.531	3755	175.625	173.125
Length 1376, alignment  0/ 0:	137.344	3843.44	133.75	130.156
Length 1376, alignment 43/ 0:	554.375	3843.44	550.625	547.812
Length 1376, alignment  0/43:	556.719	3843.59	552.969	549.531
Length 1376, alignment 43/43:	184.688	3843.44	180.469	178.125
Length 1408, alignment  0/ 0:	137.344	3932.34	133.594	130.156
Length 1408, alignment 44/ 0:	554.375	3932.5	550.625	547.812
Length 1408, alignment  0/44:	556.719	3932.5	552.812	549.688
Length 1408, alignment 44/44:	184.531	3932.34	180.312	178.125
Length 1440, alignment  0/ 0:	142.344	4021.41	138.438	135.312
Length 1440, alignment 45/ 0:	578.438	4158.28	574.844	572.5
Length 1440, alignment  0/45:	580.469	4021.25	576.875	573.281
Length 1440, alignment 45/45:	189.531	4021.56	185.312	183.125
Length 1472, alignment  0/ 0:	142.344	4110.47	138.594	135.156
Length 1472, alignment 46/ 0:	578.281	4110.47	574.531	571.719
Length 1472, alignment  0/46:	580.625	4110.47	576.719	573.438
Length 1472, alignment 46/46:	189.531	4110.47	185.312	183.125
Length 1504, alignment  0/ 0:	147.344	4199.22	143.594	140.156
Length 1504, alignment 47/ 0:	602.188	4199.38	598.438	595.938
Length 1504, alignment  0/47:	604.375	4199.38	600.312	597.344
Length 1504, alignment 47/47:	195.625	4199.38	190.312	188.594
Length 1536, alignment  0/ 0:	147.344	4288.28	143.75	140.312
Length 1536, alignment 48/ 0:	147.344	4288.28	143.594	140.312
Length 1536, alignment  0/48:	148.125	4288.28	144.219	140.781
Length 1536, alignment 48/48:	147.969	4288.28	144.375	141.25
Length 1568, alignment  0/ 0:	153.125	4377.19	149.062	145.938
Length 1568, alignment 49/ 0:	626.094	4377.19	622.344	619.531
Length 1568, alignment  0/49:	628.281	4377.19	624.219	621.094
Length 1568, alignment 49/49:	200.156	4377.19	195.938	193.438
Length 1600, alignment  0/ 0:	152.969	4466.09	148.75	145.469
Length 1600, alignment 50/ 0:	626.094	4466.88	622.031	620.312
Length 1600, alignment  0/50:	628.281	4465.78	624.531	621.406
Length 1600, alignment 50/50:	200	4465.62	195.938	193.594
Length 1632, alignment  0/ 0:	157.969	4554.69	154.844	150.938
Length 1632, alignment 51/ 0:	650	4554.69	646.094	643.438
Length 1632, alignment  0/51:	652.188	4554.69	648.125	645
Length 1632, alignment 51/51:	206.406	4555.16	201.406	202.344
Length 1664, alignment  0/ 0:	157.812	4643.59	153.906	150.781
Length 1664, alignment 52/ 0:	650	4643.59	645.938	643.438
Length 1664, alignment  0/52:	780	4644.22	648.125	645.156
Length 1664, alignment 52/52:	209.219	4644.38	206.25	201.25
Length 1696, alignment  0/ 0:	165.469	4733.12	163.906	158.281
Length 1696, alignment 53/ 0:	673.906	4732.81	669.844	667.344
Length 1696, alignment  0/53:	676.094	4732.81	672.344	669.219
Length 1696, alignment 53/53:	214.062	4732.66	226.719	211.562
Length 1728, alignment  0/ 0:	171.875	4821.41	163.125	159.531
Length 1728, alignment 54/ 0:	673.75	4821.41	669.688	667.188
Length 1728, alignment  0/54:	676.094	4954.69	672.031	669.688
Length 1728, alignment 54/54:	225.781	4821.88	211.094	223.594
Length 1760, alignment  0/ 0:	177.188	4910.47	166.25	163.438
Length 1760, alignment 55/ 0:	697.656	4910.31	693.75	691.094
Length 1760, alignment  0/55:	700	4910.31	695.938	692.812
Length 1760, alignment 55/55:	237.656	4910.31	234.062	230.469
Length 1792, alignment  0/ 0:	170.312	4999.22	167.656	162.188
Length 1792, alignment 56/ 0:	697.812	4999.22	693.594	691.094
Length 1792, alignment  0/56:	700	4999.22	695.781	692.812
Length 1792, alignment 56/56:	235.625	5001.09	232.344	229.375
Length 1824, alignment  0/ 0:	192.031	5088.75	182.5	183.594
Length 1824, alignment 57/ 0:	721.719	5088.28	718.125	715.625
Length 1824, alignment  0/57:	723.906	5088.75	720	716.875
Length 1824, alignment 57/57:	242.5	5088.28	240	235.312
Length 1856, alignment  0/ 0:	185	5177.03	187.031	175.625
Length 1856, alignment 58/ 0:	721.719	5177.03	717.656	715
Length 1856, alignment  0/58:	724.062	5177.34	720	717.344
Length 1856, alignment 58/58:	242.969	5316.56	239.688	236.094
Length 1888, alignment  0/ 0:	187.969	5266.72	188.125	190
Length 1888, alignment 59/ 0:	745.625	5266.41	741.719	739.375
Length 1888, alignment  0/59:	747.969	5266.56	743.594	740.781
Length 1888, alignment 59/59:	250	5266.56	247.031	242.812
Length 1920, alignment  0/ 0:	192.031	5355.16	200.312	180.312
Length 1920, alignment 60/ 0:	745.469	5355.16	741.562	739.219
Length 1920, alignment  0/60:	747.656	5355.16	743.594	740.781
Length 1920, alignment 60/60:	250.625	5355.31	247.5	242.656
Length 1952, alignment  0/ 0:	211.25	5444.53	208.281	203.906
Length 1952, alignment 61/ 0:	769.375	5444.22	765.781	763.281
Length 1952, alignment  0/61:	771.562	5444.38	768.125	765.312
Length 1952, alignment 61/61:	254.688	5443.91	251.875	247.344
Length 1984, alignment  0/ 0:	210.938	5532.81	208.125	203.594
Length 1984, alignment 62/ 0:	769.531	5532.97	765.625	763.281
Length 1984, alignment  0/62:	771.562	5664.53	767.5	765.781
Length 1984, alignment 62/62:	254.531	5533.44	252.188	247.344
Length 2016, alignment  0/ 0:	215.938	5622.03	213.125	208.438
Length 2016, alignment 63/ 0:	793.281	5621.72	789.531	787.969
Length 2016, alignment  0/63:	795.469	5621.72	791.719	789.531
Length 2016, alignment 63/63:	261.875	5621.72	259.531	255.312
Length 65536, alignment  0/ 0:	8698.28	193524	8673.91	16843.6
__memcpy_thunderx	__memcpy_generic
Length 65543, alignment  0/ 0:	8917.5	16964.4
Length 65551, alignment  0/ 3:	27263.1	35761.9
Length 65567, alignment  3/ 0:	26981.2	37039.4
Length 65599, alignment  3/ 5:	27258.1	35762.5
Length 131079, alignment  0/ 0:	17373.8	33872.5
Length 131087, alignment  0/ 3:	54279.4	71325
Length 131103, alignment  3/ 0:	53722.5	72745.6
Length 131135, alignment  3/ 5:	54763.8	71325.6
Length 262151, alignment  0/ 0:	35076.9	67497.5
Length 262159, alignment  0/ 3:	108816	142948
Length 262175, alignment  3/ 0:	107732	145918
Length 262207, alignment  3/ 5:	108326	142461
Length 524295, alignment  0/ 0:	68905	134738
Length 524303, alignment  0/ 3:	216870	285178
Length 524319, alignment  3/ 0:	214163	290869
Length 524351, alignment  3/ 5:	216894	285184
Length 1048583, alignment  0/ 0:	138173	270021
Length 1048591, alignment  0/ 3:	433695	570389
Length 1048607, alignment  3/ 0:	429163	582216
Length 1048639, alignment  3/ 5:	433696	570890
Length 2097159, alignment  0/ 0:	275731	540180
Length 2097167, alignment  0/ 3:	867276	1.14113e+06
Length 2097183, alignment  3/ 0:	857778	1.16346e+06
Length 2097215, alignment  3/ 5:	866851	1.1407e+06
Length 4194311, alignment  0/ 0:	551626	1.08047e+06
Length 4194319, alignment  0/ 3:	1.73384e+06	2.2816e+06
Length 4194335, alignment  3/ 0:	1.71571e+06	2.32717e+06
Length 4194367, alignment  3/ 5:	1.73384e+06	2.2816e+06
Length 8388615, alignment  0/ 0:	1.29659e+06	3.89121e+06
Length 8388623, alignment  0/ 3:	3.52809e+06	6.25292e+06
Length 8388639, alignment  3/ 0:	3.4988e+06	6.3012e+06
Length 8388671, alignment  3/ 5:	3.52861e+06	6.21914e+06
Length 16777223, alignment  0/ 0:	3.72447e+06	1.37534e+07
Length 16777231, alignment  0/ 3:	7.49027e+06	1.86301e+07
Length 16777247, alignment  3/ 0:	7.46343e+06	1.88143e+07
Length 16777279, alignment  3/ 5:	7.49158e+06	1.85915e+07
Length 33554439, alignment  0/ 0:	7.98001e+06	2.81398e+07
Length 33554447, alignment  0/ 3:	1.52274e+07	3.77285e+07
Length 33554463, alignment  3/ 0:	1.51635e+07	3.81515e+07
Length 33554495, alignment  3/ 5:	1.52274e+07	3.77297e+07
__memcpy_thunderx	__memcpy_generic
Memory size   4096:	96434.8	95387.3
Memory size   8192:	94016.1	93048.1
Memory size  16384:	102725	101778
Memory size  32768:	108842	107799
Memory size  65536:	149327	148883
Wainer dos Santos Moschetta - April 2, 2017, 12:01 a.m.
In sysdeps/aarch64/multiarch/memcpy_generic.S, it has:
+#include "../memcpy.S"

Is it ok to use relative path here? or rather it's recommended use of the full path since sysdeps?

On 03/24/2017 08:25 PM, Steve Ellcey wrote:

> Now that the IFUNC infrastructure for aarch64 is in place, here is a
> patch to use it to create ThunderX specific versions of memcpy and
> memmove.
>
> This was part of my original patch before it was split in two and a
> couple of issues were raised at that time. 
>
> Siddhesh Poyarekar wanted to separate the generic and thunderx copies
> of memcpy/memmove instead of using ifdefs in a combined source file.
> I prefer the ifdef version as a cleaner implementation with less code
> duplication but I can change it if that is the consensus.
>
> Also Adhemerval Zanella did some benchmarking that showed the
> prefetching done in the thunderx version might be appropriate for the
> generic version.  However if you look at the prefetching we only do it
> every other time through the loop.  This is because the loop copies 64
> bytes and the ThunderX cache line size is 128 bytes.  If other aarch64
> chips have a 64 byte cache line they might want a different prefetching
> setup.
>
> If people think we should use the ThunderX version of memcpy for all
> aarch64 systems I am happy to drop this patch and create one that just
> changes memcpy.S to do the ThunderX style prefetches for all aarch64
> systems.
>
> Steve Ellcey
> sellcey@cavium.com
>
>
> 2017-03-24  Steve Ellcey  <sellcey@caviumnetworks.com>
>
> 	* sysdeps/aarch64/memcpy.S (MEMMOVE, MEMCPY): New macros.
> 	(memmove): Use MEMMOVE for name.
> 	(memcpy): Use MEMCPY for name.  Add loop with prefetching
> 	under USE_THUNDERX macro.
> 	* sysdeps/aarch64/multiarch/Makefile: New file.
> 	* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Likewise.
> 	* sysdeps/aarch64/multiarch/init-arch.h: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy.c: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_generic.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_thunderx.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memmove.c: Likewise.
Steve Ellcey - April 6, 2017, 8:48 p.m.
On Sat, 2017-04-01 at 21:01 -0300, Wainer dos Santos Moschetta wrote:
> In sysdeps/aarch64/multiarch/memcpy_generic.S, it has:
> +#include "../memcpy.S"
> 
> Is it ok to use relative path here? or rather it's recommended use of
> the full path since sysdeps?

I think its OK.  I don't see any preference listed in the Coding Style
page of the glibc wiki for one way or the other.  I see other includes
of relative paths, the most common one is '#include "../test-
skeleton.c"' but I also see other examples:

sysdeps/sparc/sparc64/multiarch/rtld-memset.c:#include "../rtld-memset.c"
sysdeps/sparc/sparc64/multiarch/rtld-memcpy.c:#include "../rtld-memcpy.c"
sysdeps/wordsize-64/ftw.c:#include "../../io/ftw.c"
sysdeps/wordsize-64/fts.c:#include "../../io/fts.c"
sysdeps/unix/sysv/linux/sparc/sparc64/xstat.c:#include "../../i386/xstat.c"
sysdeps/unix/sysv/linux/sparc/sparc64/fxstat.c:#include "../../i386/fxstat.c"
sysdeps/unix/sysv/linux/sparc/sparc64/fxstatat.c:#include "../../i386/fxstatat.c"
sysdeps/unix/sysv/linux/sparc/sparc64/lxstat.c:#include "../../i386/lxstat.c"
sysdeps/unix/sysv/linux/aarch64/readelflib.c:#include "../arm/readelflib.c"
sysdeps/unix/sysv/linux/wordsize-64/statvfs.c:#include "../statvfs.c"
sysdeps/unix/sysv/linux/wordsize-64/getdirentries.c:#include "../getdirentries.c"
sysdeps/unix/sysv/linux/wordsize-64/fstatvfs.c:#include "../fstatvfs.c"
sysdeps/unix/sysv/linux/wordsize-64/aio_write.c:#include "../../../../pthread/aio_write.c"
sysdeps/unix/sysv/linux/wordsize-64/openat.c:#include "../openat.c"

That seems more common than using:

sysdeps/unix/sysv/linux/s390/s390-32/updwtmp.c:#include "sysdeps/gnu/updwtmp.c"
sysdeps/unix/sysv/linux/s390/s390-32/getutmp.c:#include "sysdeps/gnu/getutmp.c"
sysdeps/x86_64/fpu/e_sqrtl.c:#include "sysdeps/i386/fpu/e_sqrtl.c"
sysdeps/x86_64/fpu/e_atan2l.c:#include "sysdeps/i386/fpu/e_atan2l.c"
sysdeps/x86_64/fpu/s_atanl.c:#include "sysdeps/i386/fpu/s_atanl.c"
sysdeps/x86_64/fpu/e_acosl.c:#include "sysdeps/i386/fpu/e_acosl.c"

Steve Ellcey
Siddhesh Poyarekar - May 2, 2017, 3:51 a.m.
On Saturday 25 March 2017 04:55 AM, Steve Ellcey wrote:
> If people think we should use the ThunderX version of memcpy for all
> aarch64 systems I am happy to drop this patch and create one that just
> changes memcpy.S to do the ThunderX style prefetches for all aarch64
> systems.

That could be done as an add-on if we find out that it is the case.

The patch looks good to me with the formatting fixups I have specified
inline.

Siddhesh

> 2017-03-24  Steve Ellcey  <sellcey@caviumnetworks.com>
> 
> 	* sysdeps/aarch64/memcpy.S (MEMMOVE, MEMCPY): New macros.
> 	(memmove): Use MEMMOVE for name.
> 	(memcpy): Use MEMCPY for name.  Add loop with prefetching
> 	under USE_THUNDERX macro.
> 	* sysdeps/aarch64/multiarch/Makefile: New file.
> 	* sysdeps/aarch64/multiarch/ifunc-impl-list.c: Likewise.
> 	* sysdeps/aarch64/multiarch/init-arch.h: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy.c: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_generic.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memcpy_thunderx.S: Likewise.
> 	* sysdeps/aarch64/multiarch/memmove.c: Likewise.
> 
> 
> ifunc.patch
> 
> 
> diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
> index 29af8b1..74444b4 100644
> --- a/sysdeps/aarch64/memcpy.S
> +++ b/sysdeps/aarch64/memcpy.S
> @@ -59,7 +59,14 @@
>     Overlapping large forward memmoves use a loop that copies backwards.
>  */
>  
> -ENTRY_ALIGN (memmove, 6)
> +#ifndef MEMMOVE
> +#  define MEMMOVE memmove

Single char indent.

> +#endif
> +#ifndef MEMCPY
> +#  define MEMCPY memcpy

Likewise.

> +#endif
> +
> +ENTRY_ALIGN (MEMMOVE, 6)
>  
>  	DELOUSE (0)
>  	DELOUSE (1)
> @@ -71,9 +78,9 @@ ENTRY_ALIGN (memmove, 6)
>  	b.lo	L(move_long)
>  
>  	/* Common case falls through into memcpy.  */
> -END (memmove)
> -libc_hidden_builtin_def (memmove)
> -ENTRY (memcpy)
> +END (MEMMOVE)
> +libc_hidden_builtin_def (MEMMOVE)
> +ENTRY (MEMCPY)
>  
>  	DELOUSE (0)
>  	DELOUSE (1)
> @@ -158,10 +165,22 @@ L(copy96):
>  
>  	.p2align 4
>  L(copy_long):
> +
> +#ifdef USE_THUNDERX
> +
> +	/* On thunderx, large memcpy's are helped by software prefetching.
> +	   This loop is identical to the one below it but with prefetching
> +	   instructions included.  For loops that are less than 32768 bytes,
> +	   the prefetching does not help and slow the code down so we only
> +	   use the prefetching loop for the largest memcpys.  */
> +
> +	cmp	count, #32768
> +	b.lo	L(copy_long_without_prefetch)
>  	and	tmp1, dstin, 15
>  	bic	dst, dstin, 15
>  	ldp	D_l, D_h, [src]
>  	sub	src, src, tmp1
> +	prfm	pldl1strm, [src, 384]
>  	add	count, count, tmp1	/* Count is now 16 too large.  */
>  	ldp	A_l, A_h, [src, 16]
>  	stp	D_l, D_h, [dstin]
> @@ -169,7 +188,10 @@ L(copy_long):
>  	ldp	C_l, C_h, [src, 48]
>  	ldp	D_l, D_h, [src, 64]!
>  	subs	count, count, 128 + 16	/* Test and readjust count.  */
> -	b.ls	2f
> +
> +L(prefetch_loop64):
> +	tbz	src, #6, 1f
> +	prfm	pldl1strm, [src, 512]
>  1:
>  	stp	A_l, A_h, [dst, 16]
>  	ldp	A_l, A_h, [src, 16]
> @@ -180,12 +202,40 @@ L(copy_long):
>  	stp	D_l, D_h, [dst, 64]!
>  	ldp	D_l, D_h, [src, 64]!
>  	subs	count, count, 64
> -	b.hi	1b
> +	b.hi	L(prefetch_loop64)
> +	b	L(last64)
> +
> +L(copy_long_without_prefetch):
> +#endif
> +
> +	and	tmp1, dstin, 15
> +	bic	dst, dstin, 15
> +	ldp	D_l, D_h, [src]
> +	sub	src, src, tmp1
> +	add	count, count, tmp1	/* Count is now 16 too large.  */
> +	ldp	A_l, A_h, [src, 16]
> +	stp	D_l, D_h, [dstin]
> +	ldp	B_l, B_h, [src, 32]
> +	ldp	C_l, C_h, [src, 48]
> +	ldp	D_l, D_h, [src, 64]!
> +	subs	count, count, 128 + 16	/* Test and readjust count.  */
> +	b.ls	L(last64)
> +L(loop64):
> +	stp	A_l, A_h, [dst, 16]
> +	ldp	A_l, A_h, [src, 16]
> +	stp	B_l, B_h, [dst, 32]
> +	ldp	B_l, B_h, [src, 32]
> +	stp	C_l, C_h, [dst, 48]
> +	ldp	C_l, C_h, [src, 48]
> +	stp	D_l, D_h, [dst, 64]!
> +	ldp	D_l, D_h, [src, 64]!
> +	subs	count, count, 64
> +	b.hi	L(loop64)
>  
>  	/* Write the last full set of 64 bytes.  The remainder is at most 64
>  	   bytes, so it is safe to always copy 64 bytes from the end even if
>  	   there is just 1 byte left.  */
> -2:
> +L(last64):
>  	ldp	E_l, E_h, [srcend, -64]
>  	stp	A_l, A_h, [dst, 16]
>  	ldp	A_l, A_h, [srcend, -48]
> @@ -256,5 +306,5 @@ L(move_long):
>  	stp	C_l, C_h, [dstin]
>  3:	ret
>  
> -END (memcpy)
> -libc_hidden_builtin_def (memcpy)
> +END (MEMCPY)
> +libc_hidden_builtin_def (MEMCPY)
> diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
> index e69de29..78d52c7 100644
> --- a/sysdeps/aarch64/multiarch/Makefile
> +++ b/sysdeps/aarch64/multiarch/Makefile
> @@ -0,0 +1,3 @@
> +ifeq ($(subdir),string)
> +sysdep_routines += memcpy_generic memcpy_thunderx
> +endif
> diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> index e69de29..c4f23df 100644
> --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> @@ -0,0 +1,51 @@
> +/* Enumerate available IFUNC implementations of a function.  AARCH64 version.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <assert.h>
> +#include <string.h>
> +#include <wchar.h>
> +#include <ldsodefs.h>
> +#include <ifunc-impl-list.h>
> +#include <init-arch.h>
> +#include <stdio.h>
> +
> +/* Maximum number of IFUNC implementations.  */
> +#define MAX_IFUNC	2
> +
> +size_t
> +__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> +			size_t max)
> +{
> +  assert (max >= MAX_IFUNC);
> +
> +  size_t i = 0;
> +
> +  INIT_ARCH ();
> +
> +  /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c.  */
> +  IFUNC_IMPL (i, name, memcpy,
> +	      IFUNC_IMPL_ADD (array, i, memcpy, IS_THUNDERX (midr),
> +			      __memcpy_thunderx)
> +	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
> +  IFUNC_IMPL (i, name, memmove,
> +	      IFUNC_IMPL_ADD (array, i, memmove, IS_THUNDERX (midr),
> +			      __memmove_thunderx)
> +	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
> +
> +  return i;
> +}
> diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
> index e69de29..e690e00 100644
> --- a/sysdeps/aarch64/multiarch/init-arch.h
> +++ b/sysdeps/aarch64/multiarch/init-arch.h
> @@ -0,0 +1,22 @@
> +/* This file is part of the GNU C Library.

One line description of the file.

> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <ldsodefs.h>
> +
> +#define INIT_ARCH()				\
> +  uint64_t __attribute__((unused)) midr =	\
> +    GLRO(dl_aarch64_cpu_features).midr_el1;
> diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
> index e69de29..4e3f251 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -0,0 +1,39 @@
> +/* Multiple versions of memcpy. AARCH64 version.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Define multiple versions only for the definition in libc.  */
> +
> +#if IS_IN (libc)
> +/* Redefine memcpy so that the compiler won't complain about the type
> +   mismatch with the IFUNC selector in strong_alias, below.  */
> +# undef memcpy
> +# define memcpy __redirect_memcpy
> +# include <string.h>
> +# include <init-arch.h>
> +
> +extern __typeof (__redirect_memcpy) __libc_memcpy;
> +
> +extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
> +extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
> +
> +libc_ifunc (__libc_memcpy,
> +            IS_THUNDERX (midr) ? __memcpy_thunderx : __memcpy_generic);
> +
> +#undef memcpy

Single char indent.

> +strong_alias (__libc_memcpy, memcpy);
> +#endif
> diff --git a/sysdeps/aarch64/multiarch/memcpy_generic.S b/sysdeps/aarch64/multiarch/memcpy_generic.S
> index e69de29..50e1a1c 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_generic.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_generic.S
> @@ -0,0 +1,42 @@
> +/* A Generic Optimized memcpy implementation for AARCH64.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* The actual memcpy and memmove code is in ../memcpy.S.  If we are
> +   building libc this file defines __memcpy_generic and __memmove_generic.
> +   Otherwise the include of ../memcpy.S will define the normal __memcpy
> +   and__memmove entry points.  */
> +
> +#include <sysdep.h>
> +
> +#if IS_IN (libc)
> +
> +#define MEMCPY __memcpy_generic
> +#define MEMMOVE __memmove_generic
> +
> +/* Do not hide the generic versions of memcpy and memmove, we use them
> +   internally.  */
> +#undef libc_hidden_builtin_def
> +#define libc_hidden_builtin_def(name)
> +
> +/* It doesn't make sense to send libc-internal memcpy calls through a PLT. */
> +	.globl __GI_memcpy; __GI_memcpy = __memcpy_generic
> +	.globl __GI_memmove; __GI_memmove = __memmove_generic

Single char indent for all macro defs.

> +
> +#endif
> +
> +#include "../memcpy.S"
> diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx.S b/sysdeps/aarch64/multiarch/memcpy_thunderx.S
> index e69de29..ee971c8 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_thunderx.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_thunderx.S
> @@ -0,0 +1,32 @@
> +/* A Thunderx Optimized memcpy implementation for AARCH64.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* The actual thunderx optimized code is in ../memcpy.S under the USE_THUNDERX
> +   ifdef.  If we are not building libc then we do not build anything when
> +   compiling this file and __memcpy is defined by memcpy_generic.S.  */
> +
> +#include <sysdep.h>
> +
> +#if IS_IN (libc)
> +
> +#define MEMCPY __memcpy_thunderx
> +#define MEMMOVE __memmove_thunderx
> +#define USE_THUNDERX
> +#include "../memcpy.S"

Single char indent for all macro defs.

> +
> +#endif
> diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
> index e69de29..8d7a146 100644
> --- a/sysdeps/aarch64/multiarch/memmove.c
> +++ b/sysdeps/aarch64/multiarch/memmove.c
> @@ -0,0 +1,39 @@
> +/* Multiple versions of memmove. AARCH64 version.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Define multiple versions only for the definition in libc.  */
> +
> +#if IS_IN (libc)
> +/* Redefine memmove so that the compiler won't complain about the type
> +   mismatch with the IFUNC selector in strong_alias, below.  */
> +# undef memmove
> +# define memmove __redirect_memmove
> +# include <string.h>
> +# include <init-arch.h>
> +
> +extern __typeof (__redirect_memmove) __libc_memmove;
> +
> +extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;
> +extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
> +
> +libc_ifunc (__libc_memmove,
> +            IS_THUNDERX (midr) ? __memmove_thunderx : __memmove_generic);
> +
> +#undef memmove

Single char indent.

> +strong_alias (__libc_memmove, memmove);
> +#endif
>

Patch

diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
index 29af8b1..74444b4 100644
--- a/sysdeps/aarch64/memcpy.S
+++ b/sysdeps/aarch64/memcpy.S
@@ -59,7 +59,14 @@ 
    Overlapping large forward memmoves use a loop that copies backwards.
 */
 
-ENTRY_ALIGN (memmove, 6)
+#ifndef MEMMOVE
+#  define MEMMOVE memmove
+#endif
+#ifndef MEMCPY
+#  define MEMCPY memcpy
+#endif
+
+ENTRY_ALIGN (MEMMOVE, 6)
 
 	DELOUSE (0)
 	DELOUSE (1)
@@ -71,9 +78,9 @@  ENTRY_ALIGN (memmove, 6)
 	b.lo	L(move_long)
 
 	/* Common case falls through into memcpy.  */
-END (memmove)
-libc_hidden_builtin_def (memmove)
-ENTRY (memcpy)
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+ENTRY (MEMCPY)
 
 	DELOUSE (0)
 	DELOUSE (1)
@@ -158,10 +165,22 @@  L(copy96):
 
 	.p2align 4
 L(copy_long):
+
+#ifdef USE_THUNDERX
+
+	/* On thunderx, large memcpy's are helped by software prefetching.
+	   This loop is identical to the one below it but with prefetching
+	   instructions included.  For loops that are less than 32768 bytes,
+	   the prefetching does not help and slow the code down so we only
+	   use the prefetching loop for the largest memcpys.  */
+
+	cmp	count, #32768
+	b.lo	L(copy_long_without_prefetch)
 	and	tmp1, dstin, 15
 	bic	dst, dstin, 15
 	ldp	D_l, D_h, [src]
 	sub	src, src, tmp1
+	prfm	pldl1strm, [src, 384]
 	add	count, count, tmp1	/* Count is now 16 too large.  */
 	ldp	A_l, A_h, [src, 16]
 	stp	D_l, D_h, [dstin]
@@ -169,7 +188,10 @@  L(copy_long):
 	ldp	C_l, C_h, [src, 48]
 	ldp	D_l, D_h, [src, 64]!
 	subs	count, count, 128 + 16	/* Test and readjust count.  */
-	b.ls	2f
+
+L(prefetch_loop64):
+	tbz	src, #6, 1f
+	prfm	pldl1strm, [src, 512]
 1:
 	stp	A_l, A_h, [dst, 16]
 	ldp	A_l, A_h, [src, 16]
@@ -180,12 +202,40 @@  L(copy_long):
 	stp	D_l, D_h, [dst, 64]!
 	ldp	D_l, D_h, [src, 64]!
 	subs	count, count, 64
-	b.hi	1b
+	b.hi	L(prefetch_loop64)
+	b	L(last64)
+
+L(copy_long_without_prefetch):
+#endif
+
+	and	tmp1, dstin, 15
+	bic	dst, dstin, 15
+	ldp	D_l, D_h, [src]
+	sub	src, src, tmp1
+	add	count, count, tmp1	/* Count is now 16 too large.  */
+	ldp	A_l, A_h, [src, 16]
+	stp	D_l, D_h, [dstin]
+	ldp	B_l, B_h, [src, 32]
+	ldp	C_l, C_h, [src, 48]
+	ldp	D_l, D_h, [src, 64]!
+	subs	count, count, 128 + 16	/* Test and readjust count.  */
+	b.ls	L(last64)
+L(loop64):
+	stp	A_l, A_h, [dst, 16]
+	ldp	A_l, A_h, [src, 16]
+	stp	B_l, B_h, [dst, 32]
+	ldp	B_l, B_h, [src, 32]
+	stp	C_l, C_h, [dst, 48]
+	ldp	C_l, C_h, [src, 48]
+	stp	D_l, D_h, [dst, 64]!
+	ldp	D_l, D_h, [src, 64]!
+	subs	count, count, 64
+	b.hi	L(loop64)
 
 	/* Write the last full set of 64 bytes.  The remainder is at most 64
 	   bytes, so it is safe to always copy 64 bytes from the end even if
 	   there is just 1 byte left.  */
-2:
+L(last64):
 	ldp	E_l, E_h, [srcend, -64]
 	stp	A_l, A_h, [dst, 16]
 	ldp	A_l, A_h, [srcend, -48]
@@ -256,5 +306,5 @@  L(move_long):
 	stp	C_l, C_h, [dstin]
 3:	ret
 
-END (memcpy)
-libc_hidden_builtin_def (memcpy)
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index e69de29..78d52c7 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -0,0 +1,3 @@ 
+ifeq ($(subdir),string)
+sysdep_routines += memcpy_generic memcpy_thunderx
+endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index e69de29..c4f23df 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -0,0 +1,51 @@ 
+/* Enumerate available IFUNC implementations of a function.  AARCH64 version.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <string.h>
+#include <wchar.h>
+#include <ldsodefs.h>
+#include <ifunc-impl-list.h>
+#include <init-arch.h>
+#include <stdio.h>
+
+/* Maximum number of IFUNC implementations.  */
+#define MAX_IFUNC	2
+
+size_t
+__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+			size_t max)
+{
+  assert (max >= MAX_IFUNC);
+
+  size_t i = 0;
+
+  INIT_ARCH ();
+
+  /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c.  */
+  IFUNC_IMPL (i, name, memcpy,
+	      IFUNC_IMPL_ADD (array, i, memcpy, IS_THUNDERX (midr),
+			      __memcpy_thunderx)
+	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
+  IFUNC_IMPL (i, name, memmove,
+	      IFUNC_IMPL_ADD (array, i, memmove, IS_THUNDERX (midr),
+			      __memmove_thunderx)
+	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
+
+  return i;
+}
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index e69de29..e690e00 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -0,0 +1,22 @@ 
+/* This file is part of the GNU C Library.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <ldsodefs.h>
+
+#define INIT_ARCH()				\
+  uint64_t __attribute__((unused)) midr =	\
+    GLRO(dl_aarch64_cpu_features).midr_el1;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index e69de29..4e3f251 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -0,0 +1,39 @@ 
+/* Multiple versions of memcpy. AARCH64 version.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine memcpy so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memcpy
+# define memcpy __redirect_memcpy
+# include <string.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_memcpy) __libc_memcpy;
+
+extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
+extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
+
+libc_ifunc (__libc_memcpy,
+            IS_THUNDERX (midr) ? __memcpy_thunderx : __memcpy_generic);
+
+#undef memcpy
+strong_alias (__libc_memcpy, memcpy);
+#endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_generic.S b/sysdeps/aarch64/multiarch/memcpy_generic.S
index e69de29..50e1a1c 100644
--- a/sysdeps/aarch64/multiarch/memcpy_generic.S
+++ b/sysdeps/aarch64/multiarch/memcpy_generic.S
@@ -0,0 +1,42 @@ 
+/* A Generic Optimized memcpy implementation for AARCH64.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The actual memcpy and memmove code is in ../memcpy.S.  If we are
+   building libc this file defines __memcpy_generic and __memmove_generic.
+   Otherwise the include of ../memcpy.S will define the normal __memcpy
+   and__memmove entry points.  */
+
+#include <sysdep.h>
+
+#if IS_IN (libc)
+
+#define MEMCPY __memcpy_generic
+#define MEMMOVE __memmove_generic
+
+/* Do not hide the generic versions of memcpy and memmove, we use them
+   internally.  */
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+/* It doesn't make sense to send libc-internal memcpy calls through a PLT. */
+	.globl __GI_memcpy; __GI_memcpy = __memcpy_generic
+	.globl __GI_memmove; __GI_memmove = __memmove_generic
+
+#endif
+
+#include "../memcpy.S"
diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx.S b/sysdeps/aarch64/multiarch/memcpy_thunderx.S
index e69de29..ee971c8 100644
--- a/sysdeps/aarch64/multiarch/memcpy_thunderx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_thunderx.S
@@ -0,0 +1,32 @@ 
+/* A Thunderx Optimized memcpy implementation for AARCH64.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The actual thunderx optimized code is in ../memcpy.S under the USE_THUNDERX
+   ifdef.  If we are not building libc then we do not build anything when
+   compiling this file and __memcpy is defined by memcpy_generic.S.  */
+
+#include <sysdep.h>
+
+#if IS_IN (libc)
+
+#define MEMCPY __memcpy_thunderx
+#define MEMMOVE __memmove_thunderx
+#define USE_THUNDERX
+#include "../memcpy.S"
+
+#endif
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index e69de29..8d7a146 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -0,0 +1,39 @@ 
+/* Multiple versions of memmove. AARCH64 version.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine memmove so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memmove
+# define memmove __redirect_memmove
+# include <string.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_memmove) __libc_memmove;
+
+extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;
+extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
+
+libc_ifunc (__libc_memmove,
+            IS_THUNDERX (midr) ? __memmove_thunderx : __memmove_generic);
+
+#undef memmove
+strong_alias (__libc_memmove, memmove);
+#endif