[PATCHv2] powerpc: P9 vector load instruction change in memcpy and memmove

Message ID 20171019182056.11179-1-tuliom@linux.vnet.ibm.com
State Superseded
Headers

Commit Message

Tulio Magno Quites Machado Filho Oct. 19, 2017, 6:20 p.m. UTC
  From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>

Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
> uses the same pipeline, have the same latency and same throughput.  The
> only difference is lxv2dx/stxv2x have microcode handling for unaligned
> case and for 4k crossing or 32-byte cross L1 miss (which should not
> occur in the with aligned address).
>
> Why not change POWER7 implementation instead of dropping another one
> which is exactly the same for POWER9?

We're trying to limit the impact of this requirement on other processors so
that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.

However, we could avoid source code duplication with the macros LVX and STVX
I propose here in version 2.
That way, we will postpone the copy to when/if a P7 optimization is
contributed.

Do you think it's better?

--- 8< ---

POWER9 DD2.1 and earlier has an issue where some cache inhibited
vector load traps to the kernel, causing a performance degradation.  To
handle this in memcpy and memmove, lvx/stvx is used for aligned
addresses instead of lxvd2x/stxvd2x.  The remaining part of the
optimization remains same as existing POWER7 code.

Reference: https://patchwork.ozlabs.org/patch/814059/
Tested on powerpc64le.

2017-10-19  Rajalakshmi Srinivasaraghavan  <raji@linux.vnet.ibm.com>
	    Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>

	* sysdeps/powerpc/powerpc64/multiarch/Makefile
	(sysdep_routines): Add memcpy_power9 and memmove_power9.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
	(memcpy): Add __memcpy_power9 to list of memcpy functions.
	(memmove): Add __memmove_power9 to list of memmove functions.
	(bcopy): Add __bcopy_power9 to list of bcopy functions.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy.c
	(memcpy): Add __memcpy_power9 to ifunc list.
	* sysdeps/powerpc/powerpc64/power9/memcpy.S: New File.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S: Likewise.
 	* sysdeps/powerpc/powerpc64/multiarch/bcopy.c
	(bcopy): Add __bcopy_power9 to ifunc list.
	* sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
	Change bcopy as __bcopy.
	* sysdeps/powerpc/powerpc64/multiarch/memmove.c
	(memmove): Add __memmove_power9 to ifunc list.
	* sysdeps/powerpc/powerpc64/power7/memcpy.S (LVX, STVX): New
	macros to help reuse this code on POWER9.
	* sysdeps/powerpc/powerpc64/power7/memmove.S:
	Alias bcopy only if not defined before.
	(LVX, STVX): New macros to help reuse this code on POWER9.
	* sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S:
	New file.
	* sysdeps/powerpc/powerpc64/power9/memmove.S: Likewise.
---
 sysdeps/powerpc/powerpc64/multiarch/Makefile       |   7 +-
 sysdeps/powerpc/powerpc64/multiarch/bcopy.c        |   6 +-
 .../powerpc/powerpc64/multiarch/ifunc-impl-list.c  |   6 +
 .../powerpc/powerpc64/multiarch/memcpy-power9.S    |  26 ++++
 sysdeps/powerpc/powerpc64/multiarch/memcpy.c       |   3 +
 .../powerpc/powerpc64/multiarch/memmove-power7.S   |   4 +-
 .../powerpc/powerpc64/multiarch/memmove-power9.S   |  29 +++++
 sysdeps/powerpc/powerpc64/multiarch/memmove.c      |   5 +-
 sysdeps/powerpc/powerpc64/power7/memcpy.S          |  68 ++++++-----
 sysdeps/powerpc/powerpc64/power7/memmove.S         | 134 +++++++++++----------
 sysdeps/powerpc/powerpc64/power9/memcpy.S          |  23 ++++
 sysdeps/powerpc/powerpc64/power9/memmove.S         |  22 ++++
 12 files changed, 230 insertions(+), 103 deletions(-)
 create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
 create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
 create mode 100644 sysdeps/powerpc/powerpc64/power9/memcpy.S
 create mode 100644 sysdeps/powerpc/powerpc64/power9/memmove.S
  

Comments

Adhemerval Zanella Netto Oct. 19, 2017, 6:33 p.m. UTC | #1
On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
> 
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>> uses the same pipeline, have the same latency and same throughput.  The
>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>> occur in the with aligned address).
>>
>> Why not change POWER7 implementation instead of dropping another one
>> which is exactly the same for POWER9?
> 
> We're trying to limit the impact of this requirement on other processors so
> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
> 
> However, we could avoid source code duplication with the macros LVX and STVX
> I propose here in version 2.
> That way, we will postpone the copy to when/if a P7 optimization is
> contributed.

And which benefit will be exactly? For this specific case current code 
already only does aligned accesses, so it does not really matter whether 
you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
and stxvd2x/stvx shows the same latency and throughput also for POWER7.  

I see no gain on using this POWER9 specific case where you could adjust
POWER7 one.


> 
> Do you think it's better?
> 
> --- 8< ---
> 
> POWER9 DD2.1 and earlier has an issue where some cache inhibited
> vector load traps to the kernel, causing a performance degradation.  To
> handle this in memcpy and memmove, lvx/stvx is used for aligned
> addresses instead of lxvd2x/stxvd2x.  The remaining part of the
> optimization remains same as existing POWER7 code.
> 
> Reference: https://patchwork.ozlabs.org/patch/814059/
> Tested on powerpc64le.
> 
> 2017-10-19  Rajalakshmi Srinivasaraghavan  <raji@linux.vnet.ibm.com>
> 	    Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
> 
> 	* sysdeps/powerpc/powerpc64/multiarch/Makefile
> 	(sysdep_routines): Add memcpy_power9 and memmove_power9.
> 	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> 	(memcpy): Add __memcpy_power9 to list of memcpy functions.
> 	(memmove): Add __memmove_power9 to list of memmove functions.
> 	(bcopy): Add __bcopy_power9 to list of bcopy functions.
> 	* sysdeps/powerpc/powerpc64/multiarch/memcpy.c
> 	(memcpy): Add __memcpy_power9 to ifunc list.
> 	* sysdeps/powerpc/powerpc64/power9/memcpy.S: New File.
> 	* sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S: Likewise.
>  	* sysdeps/powerpc/powerpc64/multiarch/bcopy.c
> 	(bcopy): Add __bcopy_power9 to ifunc list.
> 	* sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
> 	Change bcopy as __bcopy.
> 	* sysdeps/powerpc/powerpc64/multiarch/memmove.c
> 	(memmove): Add __memmove_power9 to ifunc list.
> 	* sysdeps/powerpc/powerpc64/power7/memcpy.S (LVX, STVX): New
> 	macros to help reuse this code on POWER9.
> 	* sysdeps/powerpc/powerpc64/power7/memmove.S:
> 	Alias bcopy only if not defined before.
> 	(LVX, STVX): New macros to help reuse this code on POWER9.
> 	* sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S:
> 	New file.
> 	* sysdeps/powerpc/powerpc64/power9/memmove.S: Likewise.
> ---
>  sysdeps/powerpc/powerpc64/multiarch/Makefile       |   7 +-
>  sysdeps/powerpc/powerpc64/multiarch/bcopy.c        |   6 +-
>  .../powerpc/powerpc64/multiarch/ifunc-impl-list.c  |   6 +
>  .../powerpc/powerpc64/multiarch/memcpy-power9.S    |  26 ++++
>  sysdeps/powerpc/powerpc64/multiarch/memcpy.c       |   3 +
>  .../powerpc/powerpc64/multiarch/memmove-power7.S   |   4 +-
>  .../powerpc/powerpc64/multiarch/memmove-power9.S   |  29 +++++
>  sysdeps/powerpc/powerpc64/multiarch/memmove.c      |   5 +-
>  sysdeps/powerpc/powerpc64/power7/memcpy.S          |  68 ++++++-----
>  sysdeps/powerpc/powerpc64/power7/memmove.S         | 134 +++++++++++----------
>  sysdeps/powerpc/powerpc64/power9/memcpy.S          |  23 ++++
>  sysdeps/powerpc/powerpc64/power9/memmove.S         |  22 ++++
>  12 files changed, 230 insertions(+), 103 deletions(-)
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
>  create mode 100644 sysdeps/powerpc/powerpc64/power9/memcpy.S
>  create mode 100644 sysdeps/powerpc/powerpc64/power9/memmove.S
> 
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> index dea49ac..82728fa 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
> +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
> @@ -1,6 +1,6 @@
>  ifeq ($(subdir),string)
> -sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
> -		   memcpy-power4 memcpy-ppc64 \
> +sysdep_routines += memcpy-power9 memcpy-power7 memcpy-a2 memcpy-power6 \
> +		   memcpy-cell memcpy-power4 memcpy-ppc64 \
>  		   memcmp-power8 memcmp-power7 memcmp-power4 memcmp-ppc64 \
>  		   memset-power7 memset-power6 memset-power4 \
>  		   memset-ppc64 memset-power8 \
> @@ -24,7 +24,8 @@ sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
>  		   stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
>  		   strcmp-power9 strcmp-power8 strcmp-power7 strcmp-ppc64 \
>  		   strcat-power8 strcat-power7 strcat-ppc64 \
> -		   memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
> +		   memmove-power9 memmove-power7 memmove-ppc64 \
> +		   wordcopy-ppc64 bcopy-ppc64 \
>  		   strncpy-power8 strstr-power7 strstr-ppc64 \
>  		   strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
>  		   strlen-power8 strcasestr-power8 strcasestr-ppc64 \
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/bcopy.c b/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
> index 05d46e2..4a4ee6e 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
> @@ -22,8 +22,12 @@
>  extern __typeof (bcopy) __bcopy_ppc attribute_hidden;
>  /* __bcopy_power7 symbol is implemented at memmove-power7.S  */
>  extern __typeof (bcopy) __bcopy_power7 attribute_hidden;
> +/* __bcopy_power9 symbol is implemented at memmove-power9.S.  */
> +extern __typeof (bcopy) __bcopy_power9 attribute_hidden;
>  
>  libc_ifunc (bcopy,
> -            (hwcap & PPC_FEATURE_HAS_VSX)
> +	    (hwcap2 & PPC_FEATURE2_ARCH_3_00)
> +	    ? __bcopy_power9
> +	    : (hwcap & PPC_FEATURE_HAS_VSX)
>              ? __bcopy_power7
>              : __bcopy_ppc);
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> index 6a88536..9040bbc 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
> @@ -51,6 +51,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  #ifdef SHARED
>    /* Support sysdeps/powerpc/powerpc64/multiarch/memcpy.c.  */
>    IFUNC_IMPL (i, name, memcpy,
> +	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00,
> +			      __memcpy_power9)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap & PPC_FEATURE_HAS_VSX,
>  			      __memcpy_power7)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap & PPC_FEATURE_ARCH_2_06,
> @@ -65,6 +67,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  
>    /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c.  */
>    IFUNC_IMPL (i, name, memmove,
> +	      IFUNC_IMPL_ADD (array, i, memmove, hwcap2 & PPC_FEATURE2_ARCH_3_00,
> +			      __memmove_power9)
>  	      IFUNC_IMPL_ADD (array, i, memmove, hwcap & PPC_FEATURE_HAS_VSX,
>  			      __memmove_power7)
>  	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc))
> @@ -168,6 +172,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  
>    /* Support sysdeps/powerpc/powerpc64/multiarch/bcopy.c.  */
>    IFUNC_IMPL (i, name, bcopy,
> +	      IFUNC_IMPL_ADD (array, i, bcopy, hwcap2 & PPC_FEATURE2_ARCH_3_00,
> +			      __bcopy_power9)
>  	      IFUNC_IMPL_ADD (array, i, bcopy, hwcap & PPC_FEATURE_HAS_VSX,
>  			      __bcopy_power7)
>  	      IFUNC_IMPL_ADD (array, i, bcopy, 1, __bcopy_ppc))
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
> new file mode 100644
> index 0000000..fbd0788
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
> @@ -0,0 +1,26 @@
> +/* Optimized memcpy implementation for PowerPC/POWER9.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#define MEMCPY __memcpy_power9
> +
> +#undef libc_hidden_builtin_def
> +#define libc_hidden_builtin_def(name)
> +
> +#include <sysdeps/powerpc/powerpc64/power9/memcpy.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
> index 9f4286c..4c16fa0 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
> @@ -35,8 +35,11 @@ extern __typeof (__redirect_memcpy) __memcpy_cell attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_power6 attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_a2 attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_power7 attribute_hidden;
> +extern __typeof (__redirect_memcpy) __memcpy_power9 attribute_hidden;
>  
>  libc_ifunc (__libc_memcpy,
> +	   (hwcap2 & PPC_FEATURE2_ARCH_3_00)
> +	   ? __memcpy_power9 :
>              (hwcap & PPC_FEATURE_HAS_VSX)
>              ? __memcpy_power7 :
>  	      (hwcap & PPC_FEATURE_ARCH_2_06)
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
> index a9435fa..0599a39 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
> @@ -23,7 +23,7 @@
>  #undef libc_hidden_builtin_def
>  #define libc_hidden_builtin_def(name)
>  
> -#undef bcopy
> -#define bcopy __bcopy_power7
> +#undef __bcopy
> +#define __bcopy __bcopy_power7
>  
>  #include <sysdeps/powerpc/powerpc64/power7/memmove.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
> new file mode 100644
> index 0000000..16a2267
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
> @@ -0,0 +1,29 @@
> +/* Optimized memmove implementation for PowerPC64/POWER7.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#define MEMMOVE __memmove_power9
> +
> +#undef libc_hidden_builtin_def
> +#define libc_hidden_builtin_def(name)
> +
> +#undef __bcopy
> +#define __bcopy __bcopy_power9
> +
> +#include <sysdeps/powerpc/powerpc64/power9/memmove.S>
> diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
> index db2bbc7..f02498e 100644
> --- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c
> +++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
> @@ -31,9 +31,12 @@ extern __typeof (__redirect_memmove) __libc_memmove;
>  
>  extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidden;
> +extern __typeof (__redirect_memmove) __memmove_power9 attribute_hidden;
>  
>  libc_ifunc (__libc_memmove,
> -            (hwcap & PPC_FEATURE_HAS_VSX)
> +	    (hwcap2 & PPC_FEATURE2_ARCH_3_00)
> +	    ? __memmove_power9
> +	    : (hwcap & PPC_FEATURE_HAS_VSX)
>              ? __memmove_power7
>              : __memmove_ppc);
>  
> diff --git a/sysdeps/powerpc/powerpc64/power7/memcpy.S b/sysdeps/powerpc/powerpc64/power7/memcpy.S
> index 1ccbc2e..aea1224 100644
> --- a/sysdeps/powerpc/powerpc64/power7/memcpy.S
> +++ b/sysdeps/powerpc/powerpc64/power7/memcpy.S
> @@ -27,6 +27,10 @@
>  # define MEMCPY memcpy
>  #endif
>  
> +#define LVX lxvd2x
> +#define STVX stxvd2x
> +
> +
>  #define dst 11		/* Use r11 so r3 kept unchanged.  */
>  #define src 4
>  #define cnt 5
> @@ -91,63 +95,63 @@ L(aligned_copy):
>  	srdi	12,cnt,7
>  	cmpdi	12,0
>  	beq	L(aligned_tail)
> -	lxvd2x	6,0,src
> -	lxvd2x	7,src,6
> +	LVX	6,0,src
> +	LVX	7,src,6
>  	mtctr	12
>  	b	L(aligned_128loop)
>  
>  	.align  4
>  L(aligned_128head):
>  	/* for the 2nd + iteration of this loop. */
> -	lxvd2x	6,0,src
> -	lxvd2x	7,src,6
> +	LVX	6,0,src
> +	LVX	7,src,6
>  L(aligned_128loop):
> -	lxvd2x	8,src,7
> -	lxvd2x	9,src,8
> -	stxvd2x	6,0,dst
> +	LVX	8,src,7
> +	LVX	9,src,8
> +	STVX	6,0,dst
>  	addi	src,src,64
> -	stxvd2x	7,dst,6
> -	stxvd2x	8,dst,7
> -	stxvd2x	9,dst,8
> -	lxvd2x	6,0,src
> -	lxvd2x	7,src,6
> +	STVX	7,dst,6
> +	STVX	8,dst,7
> +	STVX	9,dst,8
> +	LVX	6,0,src
> +	LVX	7,src,6
>  	addi	dst,dst,64
> -	lxvd2x	8,src,7
> -	lxvd2x	9,src,8
> +	LVX	8,src,7
> +	LVX	9,src,8
>  	addi	src,src,64
> -	stxvd2x	6,0,dst
> -	stxvd2x	7,dst,6
> -	stxvd2x	8,dst,7
> -	stxvd2x	9,dst,8
> +	STVX	6,0,dst
> +	STVX	7,dst,6
> +	STVX	8,dst,7
> +	STVX	9,dst,8
>  	addi	dst,dst,64
>  	bdnz	L(aligned_128head)
>  
>  L(aligned_tail):
>  	mtocrf	0x01,cnt
>  	bf	25,32f
> -	lxvd2x	6,0,src
> -	lxvd2x	7,src,6
> -	lxvd2x	8,src,7
> -	lxvd2x	9,src,8
> +	LVX	6,0,src
> +	LVX	7,src,6
> +	LVX	8,src,7
> +	LVX	9,src,8
>  	addi	src,src,64
> -	stxvd2x	6,0,dst
> -	stxvd2x	7,dst,6
> -	stxvd2x	8,dst,7
> -	stxvd2x	9,dst,8
> +	STVX	6,0,dst
> +	STVX	7,dst,6
> +	STVX	8,dst,7
> +	STVX	9,dst,8
>  	addi	dst,dst,64
>  32:
>  	bf	26,16f
> -	lxvd2x	6,0,src
> -	lxvd2x	7,src,6
> +	LVX	6,0,src
> +	LVX	7,src,6
>  	addi	src,src,32
> -	stxvd2x	6,0,dst
> -	stxvd2x	7,dst,6
> +	STVX	6,0,dst
> +	STVX	7,dst,6
>  	addi	dst,dst,32
>  16:
>  	bf	27,8f
> -	lxvd2x	6,0,src
> +	LVX	6,0,src
>  	addi	src,src,16
> -	stxvd2x	6,0,dst
> +	STVX	6,0,dst
>  	addi	dst,dst,16
>  8:
>  	bf	28,4f
> diff --git a/sysdeps/powerpc/powerpc64/power7/memmove.S b/sysdeps/powerpc/powerpc64/power7/memmove.S
> index 93baa69..253f541 100644
> --- a/sysdeps/powerpc/powerpc64/power7/memmove.S
> +++ b/sysdeps/powerpc/powerpc64/power7/memmove.S
> @@ -30,6 +30,10 @@
>  #ifndef MEMMOVE
>  # define MEMMOVE memmove
>  #endif
> +
> +#define LVX lxvd2x
> +#define STVX stxvd2x
> +
>  	.machine power7
>  ENTRY_TOCLESS (MEMMOVE, 5)
>  	CALL_MCOUNT 3
> @@ -92,63 +96,63 @@ L(aligned_copy):
>  	srdi	12,r5,7
>  	cmpdi	12,0
>  	beq	L(aligned_tail)
> -	lxvd2x	6,0,r4
> -	lxvd2x	7,r4,6
> +	LVX	6,0,r4
> +	LVX	7,r4,6
>  	mtctr	12
>  	b	L(aligned_128loop)
>  
>  	.align  4
>  L(aligned_128head):
>  	/* for the 2nd + iteration of this loop. */
> -	lxvd2x	6,0,r4
> -	lxvd2x	7,r4,6
> +	LVX	6,0,r4
> +	LVX	7,r4,6
>  L(aligned_128loop):
> -	lxvd2x	8,r4,7
> -	lxvd2x	9,r4,8
> -	stxvd2x	6,0,r11
> +	LVX	8,r4,7
> +	LVX	9,r4,8
> +	STVX	6,0,r11
>  	addi	r4,r4,64
> -	stxvd2x	7,r11,6
> -	stxvd2x	8,r11,7
> -	stxvd2x	9,r11,8
> -	lxvd2x	6,0,r4
> -	lxvd2x	7,r4,6
> +	STVX	7,r11,6
> +	STVX	8,r11,7
> +	STVX	9,r11,8
> +	LVX	6,0,r4
> +	LVX	7,r4,6
>  	addi	r11,r11,64
> -	lxvd2x	8,r4,7
> -	lxvd2x	9,r4,8
> +	LVX	8,r4,7
> +	LVX	9,r4,8
>  	addi	r4,r4,64
> -	stxvd2x	6,0,r11
> -	stxvd2x	7,r11,6
> -	stxvd2x	8,r11,7
> -	stxvd2x	9,r11,8
> +	STVX	6,0,r11
> +	STVX	7,r11,6
> +	STVX	8,r11,7
> +	STVX	9,r11,8
>  	addi	r11,r11,64
>  	bdnz	L(aligned_128head)
>  
>  L(aligned_tail):
>  	mtocrf	0x01,r5
>  	bf	25,32f
> -	lxvd2x	6,0,r4
> -	lxvd2x	7,r4,6
> -	lxvd2x	8,r4,7
> -	lxvd2x	9,r4,8
> +	LVX	6,0,r4
> +	LVX	7,r4,6
> +	LVX	8,r4,7
> +	LVX	9,r4,8
>  	addi	r4,r4,64
> -	stxvd2x	6,0,r11
> -	stxvd2x	7,r11,6
> -	stxvd2x	8,r11,7
> -	stxvd2x	9,r11,8
> +	STVX	6,0,r11
> +	STVX	7,r11,6
> +	STVX	8,r11,7
> +	STVX	9,r11,8
>  	addi	r11,r11,64
>  32:
>  	bf	26,16f
> -	lxvd2x	6,0,r4
> -	lxvd2x	7,r4,6
> +	LVX	6,0,r4
> +	LVX	7,r4,6
>  	addi	r4,r4,32
> -	stxvd2x	6,0,r11
> -	stxvd2x	7,r11,6
> +	STVX	6,0,r11
> +	STVX	7,r11,6
>  	addi	r11,r11,32
>  16:
>  	bf	27,8f
> -	lxvd2x	6,0,r4
> +	LVX	6,0,r4
>  	addi	r4,r4,16
> -	stxvd2x	6,0,r11
> +	STVX	6,0,r11
>  	addi	r11,r11,16
>  8:
>  	bf	28,4f
> @@ -488,63 +492,63 @@ L(aligned_copy_bwd):
>  	srdi	r12,r5,7
>  	cmpdi	r12,0
>  	beq	L(aligned_tail_bwd)
> -	lxvd2x	v6,r4,r6
> -	lxvd2x	v7,r4,r7
> +	LVX	v6,r4,r6
> +	LVX	v7,r4,r7
>  	mtctr	12
>  	b	L(aligned_128loop_bwd)
>  
>  	.align  4
>  L(aligned_128head_bwd):
>  	/* for the 2nd + iteration of this loop. */
> -	lxvd2x	v6,r4,r6
> -	lxvd2x	v7,r4,r7
> +	LVX	v6,r4,r6
> +	LVX	v7,r4,r7
>  L(aligned_128loop_bwd):
> -	lxvd2x	v8,r4,r8
> -	lxvd2x	v9,r4,r9
> -	stxvd2x	v6,r11,r6
> +	LVX	v8,r4,r8
> +	LVX	v9,r4,r9
> +	STVX	v6,r11,r6
>  	subi	r4,r4,64
> -	stxvd2x	v7,r11,r7
> -	stxvd2x	v8,r11,r8
> -	stxvd2x	v9,r11,r9
> -	lxvd2x	v6,r4,r6
> -	lxvd2x	v7,r4,7
> +	STVX	v7,r11,r7
> +	STVX	v8,r11,r8
> +	STVX	v9,r11,r9
> +	LVX	v6,r4,r6
> +	LVX	v7,r4,7
>  	subi	r11,r11,64
> -	lxvd2x	v8,r4,r8
> -	lxvd2x	v9,r4,r9
> +	LVX	v8,r4,r8
> +	LVX	v9,r4,r9
>  	subi	r4,r4,64
> -	stxvd2x	v6,r11,r6
> -	stxvd2x	v7,r11,r7
> -	stxvd2x	v8,r11,r8
> -	stxvd2x	v9,r11,r9
> +	STVX	v6,r11,r6
> +	STVX	v7,r11,r7
> +	STVX	v8,r11,r8
> +	STVX	v9,r11,r9
>  	subi	r11,r11,64
>  	bdnz	L(aligned_128head_bwd)
>  
>  L(aligned_tail_bwd):
>  	mtocrf	0x01,r5
>  	bf	25,32f
> -	lxvd2x	v6,r4,r6
> -	lxvd2x	v7,r4,r7
> -	lxvd2x	v8,r4,r8
> -	lxvd2x	v9,r4,r9
> +	LVX	v6,r4,r6
> +	LVX	v7,r4,r7
> +	LVX	v8,r4,r8
> +	LVX	v9,r4,r9
>  	subi	r4,r4,64
> -	stxvd2x	v6,r11,r6
> -	stxvd2x	v7,r11,r7
> -	stxvd2x	v8,r11,r8
> -	stxvd2x	v9,r11,r9
> +	STVX	v6,r11,r6
> +	STVX	v7,r11,r7
> +	STVX	v8,r11,r8
> +	STVX	v9,r11,r9
>  	subi	r11,r11,64
>  32:
>  	bf	26,16f
> -	lxvd2x	v6,r4,r6
> -	lxvd2x	v7,r4,r7
> +	LVX	v6,r4,r6
> +	LVX	v7,r4,r7
>  	subi	r4,r4,32
> -	stxvd2x	v6,r11,r6
> -	stxvd2x	v7,r11,r7
> +	STVX	v6,r11,r6
> +	STVX	v7,r11,r7
>  	subi	r11,r11,32
>  16:
>  	bf	27,8f
> -	lxvd2x	v6,r4,r6
> +	LVX	v6,r4,r6
>  	subi	r4,r4,16
> -	stxvd2x	v6,r11,r6
> +	STVX	v6,r11,r6
>  	subi	r11,r11,16
>  8:
>  	bf	28,4f
> @@ -832,4 +836,6 @@ ENTRY_TOCLESS (__bcopy)
>  	mr	r4,r6
>  	b	L(_memmove)
>  END (__bcopy)
> +#ifndef __bcopy
>  weak_alias (__bcopy, bcopy)
> +#endif
> diff --git a/sysdeps/powerpc/powerpc64/power9/memcpy.S b/sysdeps/powerpc/powerpc64/power9/memcpy.S
> new file mode 100644
> index 0000000..d827cdf
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/power9/memcpy.S
> @@ -0,0 +1,23 @@
> +/* Optimized memcpy implementation for PowerPC64/POWER9.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +/* Avoid unnecessary traps on cache-inhibited memory on POWER9 DD2.1.  */
> +#define LVX lvx
> +#define STVX stvx
> +
> +#include <sysdeps/powerpc/powerpc64/power7/memcpy.S>
> diff --git a/sysdeps/powerpc/powerpc64/power9/memmove.S b/sysdeps/powerpc/powerpc64/power9/memmove.S
> new file mode 100644
> index 0000000..2c5887e
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/power9/memmove.S
> @@ -0,0 +1,22 @@
> +/* Optimized memmove implementation for PowerPC64/POWER9.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define LVX lxvd2x
> +#define STVX stxvd2x
> +
> +#include <sysdeps/powerpc/powerpc64/power7/memmove.S>
>
  
Tulio Magno Quites Machado Filho Oct. 19, 2017, 6:48 p.m. UTC | #2
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>> 
>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>> 
>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>> uses the same pipeline, have the same latency and same throughput.  The
>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>> occur in the with aligned address).
>>>
>>> Why not change POWER7 implementation instead of dropping another one
>>> which is exactly the same for POWER9?
>> 
>> We're trying to limit the impact of this requirement on other processors so
>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>> 
>> However, we could avoid source code duplication with the macros LVX and STVX
>> I propose here in version 2.
>> That way, we will postpone the copy to when/if a P7 optimization is
>> contributed.
>
> And which benefit will be exactly? For this specific case current code 
> already only does aligned accesses, so it does not really matter whether 
> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>
> I see no gain on using this POWER9 specific case where you could adjust
> POWER7 one.

There are no gains now.  The problem arises when contributing a new
optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.

If POWER9 doesn't have its own implementation, this problem will appear again.
  
Adhemerval Zanella Netto Oct. 19, 2017, 8:19 p.m. UTC | #3
On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>
>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>
>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>> occur in the with aligned address).
>>>>
>>>> Why not change POWER7 implementation instead of dropping another one
>>>> which is exactly the same for POWER9?
>>>
>>> We're trying to limit the impact of this requirement on other processors so
>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>
>>> However, we could avoid source code duplication with the macros LVX and STVX
>>> I propose here in version 2.
>>> That way, we will postpone the copy to when/if a P7 optimization is
>>> contributed.
>>
>> And which benefit will be exactly? For this specific case current code 
>> already only does aligned accesses, so it does not really matter whether 
>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>
>> I see no gain on using this POWER9 specific case where you could adjust
>> POWER7 one.
> 
> There are no gains now.  The problem arises when contributing a new
> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
> 
> If POWER9 doesn't have its own implementation, this problem will appear again.
> 

I think if eventually a POWER8 optimization could not be used as is for POWER9,
then a new ifunc variant would make sense.  But I still think we current
variant, a much simpler solutions (in code sense and maintainability) would be
to just adapt POWER7 variant to use VMX instructions.
  
Carlos O'Donell Oct. 19, 2017, 9:06 p.m. UTC | #4
On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
> 
> 
> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>
>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>
>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>
>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>> occur in the with aligned address).
>>>>>
>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>> which is exactly the same for POWER9?
>>>>
>>>> We're trying to limit the impact of this requirement on other processors so
>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>
>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>> I propose here in version 2.
>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>> contributed.
>>>
>>> And which benefit will be exactly? For this specific case current code 
>>> already only does aligned accesses, so it does not really matter whether 
>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>
>>> I see no gain on using this POWER9 specific case where you could adjust
>>> POWER7 one.
>>
>> There are no gains now.  The problem arises when contributing a new
>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>
>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>
> 
> I think if eventually a POWER8 optimization could not be used as is for POWER9,
> then a new ifunc variant would make sense.  But I still think we current
> variant, a much simpler solutions (in code sense and maintainability) would be
> to just adapt POWER7 variant to use VMX instructions.
 
We are arguing about taste and style here. About duplication versus functionality.
I would leave it up the machine maintainer to decide how best to move forward.

Tulio knows, and may not be able to say, if there are future optimizations coming
down the line. So we lack a clear picture for deciding on this issue of duplication.

My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.

The separation of the files allows for simpler incremental distro testing of the
changes without needing to revalidate the POWER7 code again.
  
Adhemerval Zanella Netto Oct. 19, 2017, 9:41 p.m. UTC | #5
On 19/10/2017 19:06, Carlos O'Donell wrote:
> On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
>>
>>
>> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>
>>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>>
>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>
>>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>>> occur in the with aligned address).
>>>>>>
>>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>>> which is exactly the same for POWER9?
>>>>>
>>>>> We're trying to limit the impact of this requirement on other processors so
>>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>>
>>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>>> I propose here in version 2.
>>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>>> contributed.
>>>>
>>>> And which benefit will be exactly? For this specific case current code 
>>>> already only does aligned accesses, so it does not really matter whether 
>>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>>
>>>> I see no gain on using this POWER9 specific case where you could adjust
>>>> POWER7 one.
>>>
>>> There are no gains now.  The problem arises when contributing a new
>>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>>
>>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>>
>>
>> I think if eventually a POWER8 optimization could not be used as is for POWER9,
>> then a new ifunc variant would make sense.  But I still think we current
>> variant, a much simpler solutions (in code sense and maintainability) would be
>> to just adapt POWER7 variant to use VMX instructions.
>  
> We are arguing about taste and style here. About duplication versus functionality.
> I would leave it up the machine maintainer to decide how best to move forward.
> 
> Tulio knows, and may not be able to say, if there are future optimizations coming
> down the line. So we lack a clear picture for deciding on this issue of duplication.
> 
> My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
> and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.
> 
> The separation of the files allows for simpler incremental distro testing of the
> changes without needing to revalidate the POWER7 code again.

I agree with you if the case of the new file justify a new implementation
for instance by either using new ISA instructions, a new strategy (such 
as unaligned memory access vs aligned ones), to hoist some internal checks
which can lead to different implementations (such as the aarch64 memset), 
or to avoid some chip limitation (such as the different selections on x86 
for intel and amd chips).

However for this *specific* case there absolutely no gain by adding a
similar copy for POWER9 where the same implementation will work perfectly
fine on POWER7.  And my position is based with the provided information:
the new implementation is to fix an *issue* within a chip revision.

Now Tulio told me that the idea is indeed adding a POWER8 optimization,
but even if the idea is to have a POWER8 specialized implementation
is does prevent glibc to select the POWER7 memcpy for POWER7 and POWER9.
(if it will be case a better name for the memcpy implementation would be
better, for instance __memcpy_vsx_aligned).

In fact I think have less possible ifunc implementation is indeed better
for testing, in the memcpy case for instance a developer would not
require to actually use a POWER9 to validate the algorithm correctness.
This might not be the best strategy for an incremental testing if the
idea is backport on distros, but even then I think having the minimum
required ifunc variant is still a better way forward.
  
Carlos O'Donell Oct. 19, 2017, 10:12 p.m. UTC | #6
On 10/19/2017 02:41 PM, Adhemerval Zanella wrote:
> 
> 
> On 19/10/2017 19:06, Carlos O'Donell wrote:
>> On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
>>>
>>>
>>> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>
>>>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>>>
>>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>>
>>>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>>>> occur in the with aligned address).
>>>>>>>
>>>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>>>> which is exactly the same for POWER9?
>>>>>>
>>>>>> We're trying to limit the impact of this requirement on other processors so
>>>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>>>
>>>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>>>> I propose here in version 2.
>>>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>>>> contributed.
>>>>>
>>>>> And which benefit will be exactly? For this specific case current code 
>>>>> already only does aligned accesses, so it does not really matter whether 
>>>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>>>
>>>>> I see no gain on using this POWER9 specific case where you could adjust
>>>>> POWER7 one.
>>>>
>>>> There are no gains now.  The problem arises when contributing a new
>>>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>>>
>>>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>>>
>>>
>>> I think if eventually a POWER8 optimization could not be used as is for POWER9,
>>> then a new ifunc variant would make sense.  But I still think we current
>>> variant, a much simpler solutions (in code sense and maintainability) would be
>>> to just adapt POWER7 variant to use VMX instructions.
>>  
>> We are arguing about taste and style here. About duplication versus functionality.
>> I would leave it up the machine maintainer to decide how best to move forward.
>>
>> Tulio knows, and may not be able to say, if there are future optimizations coming
>> down the line. So we lack a clear picture for deciding on this issue of duplication.
>>
>> My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
>> and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.
>>
>> The separation of the files allows for simpler incremental distro testing of the
>> changes without needing to revalidate the POWER7 code again.
> 
> I agree with you if the case of the new file justify a new implementation
> for instance by either using new ISA instructions, a new strategy (such 
> as unaligned memory access vs aligned ones), to hoist some internal checks
> which can lead to different implementations (such as the aarch64 memset), 
> or to avoid some chip limitation (such as the different selections on x86 
> for intel and amd chips).
> 
> However for this *specific* case there absolutely no gain by adding a
> similar copy for POWER9 where the same implementation will work perfectly
> fine on POWER7.  And my position is based with the provided information:
> the new implementation is to fix an *issue* within a chip revision.
> 
> Now Tulio told me that the idea is indeed adding a POWER8 optimization,
> but even if the idea is to have a POWER8 specialized implementation
> is does prevent glibc to select the POWER7 memcpy for POWER7 and POWER9.
> (if it will be case a better name for the memcpy implementation would be
> better, for instance __memcpy_vsx_aligned).
> 
> In fact I think have less possible ifunc implementation is indeed better
> for testing, in the memcpy case for instance a developer would not
> require to actually use a POWER9 to validate the algorithm correctness.
> This might not be the best strategy for an incremental testing if the
> idea is backport on distros, but even then I think having the minimum
> required ifunc variant is still a better way forward.

Let me summarize your positions if I can:

Tulio suggests:
* Add a POWER N specific optimization, which is largely a copy of POWER N-M
  but fixes one issue with POWER N, and in the future may be more optimized.
* Incremental testing only requires testing on POWER N.
* Full testing requires testing on POWER N, and POWER N-M. (No difference)
* Maintenance cost increased for additional POWER N specific assembly files.
* Future optimizations for POWER N possible with minor tweaks. (A little easier)

You suggest:
* Simplify the POWER N-M implementation to cover all POWER variants >M.
* Incremental testing requires testing POWER M and POWER N. (Increased cost)
* Full testing requires testing on POWER N, and POWER N-M. (No difference)
* Maintenance decreased with only one file the POWER M one to maintain.
* Future optimizations for POWER N require a full review again like this one.

These two positions seem, to me, to be a matter of development taste and
discretion when it comes to future changes, current incremental testing
cost versus maintenance burden in the short term for IBM.

Did I understand this correctly?
  
Adhemerval Zanella Netto Oct. 20, 2017, 12:25 p.m. UTC | #7
On 19/10/2017 20:12, Carlos O'Donell wrote:
> On 10/19/2017 02:41 PM, Adhemerval Zanella wrote:
>>
>>
>> On 19/10/2017 19:06, Carlos O'Donell wrote:
>>> On 10/19/2017 01:19 PM, Adhemerval Zanella wrote:
>>>>
>>>>
>>>> On 19/10/2017 16:48, Tulio Magno Quites Machado Filho wrote:
>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>
>>>>>> On 19/10/2017 16:20, Tulio Magno Quites Machado Filho wrote:
>>>>>>> From: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
>>>>>>>
>>>>>>>> According to "POWER8 Processor User’s Manual for the Single-Chip Module"
>>>>>>>> (it is buried on a sign wall at [1]), both lxv2dx/lvx and stxvd2x/stvx
>>>>>>>> uses the same pipeline, have the same latency and same throughput.  The
>>>>>>>> only difference is lxv2dx/stxv2x have microcode handling for unaligned
>>>>>>>> case and for 4k crossing or 32-byte cross L1 miss (which should not
>>>>>>>> occur in the with aligned address).
>>>>>>>>
>>>>>>>> Why not change POWER7 implementation instead of dropping another one
>>>>>>>> which is exactly the same for POWER9?
>>>>>>>
>>>>>>> We're trying to limit the impact of this requirement on other processors so
>>>>>>> that newer P7 or P8 optimizations can still benefit from lxv2dx and stxvd2x.
>>>>>>>
>>>>>>> However, we could avoid source code duplication with the macros LVX and STVX
>>>>>>> I propose here in version 2.
>>>>>>> That way, we will postpone the copy to when/if a P7 optimization is
>>>>>>> contributed.
>>>>>>
>>>>>> And which benefit will be exactly? For this specific case current code 
>>>>>> already only does aligned accesses, so it does not really matter whether 
>>>>>> you use VSX or VMX instruction. If I recall correctly, both lxv2dx/lvx 
>>>>>> and stxvd2x/stvx shows the same latency and throughput also for POWER7.  
>>>>>>
>>>>>> I see no gain on using this POWER9 specific case where you could adjust
>>>>>> POWER7 one.
>>>>>
>>>>> There are no gains now.  The problem arises when contributing a new
>>>>> optimization, e.g. a memcpy optimization for POWER8 using lxv2dx or stxvd2x.
>>>>>
>>>>> If POWER9 doesn't have its own implementation, this problem will appear again.
>>>>>
>>>>
>>>> I think if eventually a POWER8 optimization could not be used as is for POWER9,
>>>> then a new ifunc variant would make sense.  But I still think we current
>>>> variant, a much simpler solutions (in code sense and maintainability) would be
>>>> to just adapt POWER7 variant to use VMX instructions.
>>>  
>>> We are arguing about taste and style here. About duplication versus functionality.
>>> I would leave it up the machine maintainer to decide how best to move forward.
>>>
>>> Tulio knows, and may not be able to say, if there are future optimizations coming
>>> down the line. So we lack a clear picture for deciding on this issue of duplication.
>>>
>>> My opinion is that I would *rather* see a POWER7 version that is just for POWER7,
>>> and a POWER8 or POWER9 version that is *just* for POWER8 or POWER9.
>>>
>>> The separation of the files allows for simpler incremental distro testing of the
>>> changes without needing to revalidate the POWER7 code again.
>>
>> I agree with you if the case of the new file justify a new implementation
>> for instance by either using new ISA instructions, a new strategy (such 
>> as unaligned memory access vs aligned ones), to hoist some internal checks
>> which can lead to different implementations (such as the aarch64 memset), 
>> or to avoid some chip limitation (such as the different selections on x86 
>> for intel and amd chips).
>>
>> However for this *specific* case there absolutely no gain by adding a
>> similar copy for POWER9 where the same implementation will work perfectly
>> fine on POWER7.  And my position is based with the provided information:
>> the new implementation is to fix an *issue* within a chip revision.
>>
>> Now Tulio told me that the idea is indeed adding a POWER8 optimization,
>> but even if the idea is to have a POWER8 specialized implementation
>> is does prevent glibc to select the POWER7 memcpy for POWER7 and POWER9.
>> (if it will be case a better name for the memcpy implementation would be
>> better, for instance __memcpy_vsx_aligned).
>>
>> In fact I think have less possible ifunc implementation is indeed better
>> for testing, in the memcpy case for instance a developer would not
>> require to actually use a POWER9 to validate the algorithm correctness.
>> This might not be the best strategy for an incremental testing if the
>> idea is backport on distros, but even then I think having the minimum
>> required ifunc variant is still a better way forward.
> 
> Let me summarize your positions if I can:
> 
> Tulio suggests:
> * Add a POWER N specific optimization, which is largely a copy of POWER N-M
>   but fixes one issue with POWER N, and in the future may be more optimized.
> * Incremental testing only requires testing on POWER N.
> * Full testing requires testing on POWER N, and POWER N-M. (No difference)
> * Maintenance cost increased for additional POWER N specific assembly files.
> * Future optimizations for POWER N possible with minor tweaks. (A little easier)
> 
> You suggest:
> * Simplify the POWER N-M implementation to cover all POWER variants >M.
> * Incremental testing requires testing POWER M and POWER N. (Increased cost)
> * Full testing requires testing on POWER N, and POWER N-M. (No difference)
> * Maintenance decreased with only one file the POWER M one to maintain.
> * Future optimizations for POWER N require a full review again like this one.
> 
> These two positions seem, to me, to be a matter of development taste and
> discretion when it comes to future changes, current incremental testing
> cost versus maintenance burden in the short term for IBM.
> 
> Did I understand this correctly?
> 

My objection here is not to set a general development on how to code the POWER
ifunc variants, but rather that this *specific* patch does not add significant
gains. As I put before, if new ifunc implementations for POWER use a different
strategy, instructions, etc. I see no reason to object.

My idea here is it is rather not a "POWER9" optimization, but instead a
VSX variant which should ran effortlessly in any ISA 2.06 variant.  Current
implementation already takes of aligned access so really does not matter
whether your use VMX or VSX instructions (not unaligned traps, no microcode
involved, works on non-cachable memory). Now for the unlikely case where
there indeed an issue with VMX instructions on POWER7 on some hardware
revision on constraint (which I am not aware) I see a reason to have
different VSX implementation.

Which lead to a question: how would we proceed in the case someone send
a patch in the future to do exactly what I am suggesting here?
  

Patch

diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
index dea49ac..82728fa 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
+++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
@@ -1,6 +1,6 @@ 
 ifeq ($(subdir),string)
-sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
-		   memcpy-power4 memcpy-ppc64 \
+sysdep_routines += memcpy-power9 memcpy-power7 memcpy-a2 memcpy-power6 \
+		   memcpy-cell memcpy-power4 memcpy-ppc64 \
 		   memcmp-power8 memcmp-power7 memcmp-power4 memcmp-ppc64 \
 		   memset-power7 memset-power6 memset-power4 \
 		   memset-ppc64 memset-power8 \
@@ -24,7 +24,8 @@  sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
 		   stpncpy-power8 stpncpy-power7 stpncpy-ppc64 \
 		   strcmp-power9 strcmp-power8 strcmp-power7 strcmp-ppc64 \
 		   strcat-power8 strcat-power7 strcat-ppc64 \
-		   memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
+		   memmove-power9 memmove-power7 memmove-ppc64 \
+		   wordcopy-ppc64 bcopy-ppc64 \
 		   strncpy-power8 strstr-power7 strstr-ppc64 \
 		   strspn-power8 strspn-ppc64 strcspn-power8 strcspn-ppc64 \
 		   strlen-power8 strcasestr-power8 strcasestr-ppc64 \
diff --git a/sysdeps/powerpc/powerpc64/multiarch/bcopy.c b/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
index 05d46e2..4a4ee6e 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/bcopy.c
@@ -22,8 +22,12 @@ 
 extern __typeof (bcopy) __bcopy_ppc attribute_hidden;
 /* __bcopy_power7 symbol is implemented at memmove-power7.S  */
 extern __typeof (bcopy) __bcopy_power7 attribute_hidden;
+/* __bcopy_power9 symbol is implemented at memmove-power9.S.  */
+extern __typeof (bcopy) __bcopy_power9 attribute_hidden;
 
 libc_ifunc (bcopy,
-            (hwcap & PPC_FEATURE_HAS_VSX)
+	    (hwcap2 & PPC_FEATURE2_ARCH_3_00)
+	    ? __bcopy_power9
+	    : (hwcap & PPC_FEATURE_HAS_VSX)
             ? __bcopy_power7
             : __bcopy_ppc);
diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
index 6a88536..9040bbc 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
@@ -51,6 +51,8 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 #ifdef SHARED
   /* Support sysdeps/powerpc/powerpc64/multiarch/memcpy.c.  */
   IFUNC_IMPL (i, name, memcpy,
+	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00,
+			      __memcpy_power9)
 	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap & PPC_FEATURE_HAS_VSX,
 			      __memcpy_power7)
 	      IFUNC_IMPL_ADD (array, i, memcpy, hwcap & PPC_FEATURE_ARCH_2_06,
@@ -65,6 +67,8 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   /* Support sysdeps/powerpc/powerpc64/multiarch/memmove.c.  */
   IFUNC_IMPL (i, name, memmove,
+	      IFUNC_IMPL_ADD (array, i, memmove, hwcap2 & PPC_FEATURE2_ARCH_3_00,
+			      __memmove_power9)
 	      IFUNC_IMPL_ADD (array, i, memmove, hwcap & PPC_FEATURE_HAS_VSX,
 			      __memmove_power7)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_ppc))
@@ -168,6 +172,8 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   /* Support sysdeps/powerpc/powerpc64/multiarch/bcopy.c.  */
   IFUNC_IMPL (i, name, bcopy,
+	      IFUNC_IMPL_ADD (array, i, bcopy, hwcap2 & PPC_FEATURE2_ARCH_3_00,
+			      __bcopy_power9)
 	      IFUNC_IMPL_ADD (array, i, bcopy, hwcap & PPC_FEATURE_HAS_VSX,
 			      __bcopy_power7)
 	      IFUNC_IMPL_ADD (array, i, bcopy, 1, __bcopy_ppc))
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
new file mode 100644
index 0000000..fbd0788
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy-power9.S
@@ -0,0 +1,26 @@ 
+/* Optimized memcpy implementation for PowerPC/POWER9.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#define MEMCPY __memcpy_power9
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#include <sysdeps/powerpc/powerpc64/power9/memcpy.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
index 9f4286c..4c16fa0 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/memcpy.c
@@ -35,8 +35,11 @@  extern __typeof (__redirect_memcpy) __memcpy_cell attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_power6 attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_a2 attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_power7 attribute_hidden;
+extern __typeof (__redirect_memcpy) __memcpy_power9 attribute_hidden;
 
 libc_ifunc (__libc_memcpy,
+	   (hwcap2 & PPC_FEATURE2_ARCH_3_00)
+	   ? __memcpy_power9 :
             (hwcap & PPC_FEATURE_HAS_VSX)
             ? __memcpy_power7 :
 	      (hwcap & PPC_FEATURE_ARCH_2_06)
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
index a9435fa..0599a39 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
+++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power7.S
@@ -23,7 +23,7 @@ 
 #undef libc_hidden_builtin_def
 #define libc_hidden_builtin_def(name)
 
-#undef bcopy
-#define bcopy __bcopy_power7
+#undef __bcopy
+#define __bcopy __bcopy_power7
 
 #include <sysdeps/powerpc/powerpc64/power7/memmove.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S b/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
new file mode 100644
index 0000000..16a2267
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/memmove-power9.S
@@ -0,0 +1,29 @@ 
+/* Optimized memmove implementation for PowerPC64/POWER7.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#define MEMMOVE __memmove_power9
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#undef __bcopy
+#define __bcopy __bcopy_power9
+
+#include <sysdeps/powerpc/powerpc64/power9/memmove.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/memmove.c b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
index db2bbc7..f02498e 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/memmove.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/memmove.c
@@ -31,9 +31,12 @@  extern __typeof (__redirect_memmove) __libc_memmove;
 
 extern __typeof (__redirect_memmove) __memmove_ppc attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_power7 attribute_hidden;
+extern __typeof (__redirect_memmove) __memmove_power9 attribute_hidden;
 
 libc_ifunc (__libc_memmove,
-            (hwcap & PPC_FEATURE_HAS_VSX)
+	    (hwcap2 & PPC_FEATURE2_ARCH_3_00)
+	    ? __memmove_power9
+	    : (hwcap & PPC_FEATURE_HAS_VSX)
             ? __memmove_power7
             : __memmove_ppc);
 
diff --git a/sysdeps/powerpc/powerpc64/power7/memcpy.S b/sysdeps/powerpc/powerpc64/power7/memcpy.S
index 1ccbc2e..aea1224 100644
--- a/sysdeps/powerpc/powerpc64/power7/memcpy.S
+++ b/sysdeps/powerpc/powerpc64/power7/memcpy.S
@@ -27,6 +27,10 @@ 
 # define MEMCPY memcpy
 #endif
 
+#define LVX lxvd2x
+#define STVX stxvd2x
+
+
 #define dst 11		/* Use r11 so r3 kept unchanged.  */
 #define src 4
 #define cnt 5
@@ -91,63 +95,63 @@  L(aligned_copy):
 	srdi	12,cnt,7
 	cmpdi	12,0
 	beq	L(aligned_tail)
-	lxvd2x	6,0,src
-	lxvd2x	7,src,6
+	LVX	6,0,src
+	LVX	7,src,6
 	mtctr	12
 	b	L(aligned_128loop)
 
 	.align  4
 L(aligned_128head):
 	/* for the 2nd + iteration of this loop. */
-	lxvd2x	6,0,src
-	lxvd2x	7,src,6
+	LVX	6,0,src
+	LVX	7,src,6
 L(aligned_128loop):
-	lxvd2x	8,src,7
-	lxvd2x	9,src,8
-	stxvd2x	6,0,dst
+	LVX	8,src,7
+	LVX	9,src,8
+	STVX	6,0,dst
 	addi	src,src,64
-	stxvd2x	7,dst,6
-	stxvd2x	8,dst,7
-	stxvd2x	9,dst,8
-	lxvd2x	6,0,src
-	lxvd2x	7,src,6
+	STVX	7,dst,6
+	STVX	8,dst,7
+	STVX	9,dst,8
+	LVX	6,0,src
+	LVX	7,src,6
 	addi	dst,dst,64
-	lxvd2x	8,src,7
-	lxvd2x	9,src,8
+	LVX	8,src,7
+	LVX	9,src,8
 	addi	src,src,64
-	stxvd2x	6,0,dst
-	stxvd2x	7,dst,6
-	stxvd2x	8,dst,7
-	stxvd2x	9,dst,8
+	STVX	6,0,dst
+	STVX	7,dst,6
+	STVX	8,dst,7
+	STVX	9,dst,8
 	addi	dst,dst,64
 	bdnz	L(aligned_128head)
 
 L(aligned_tail):
 	mtocrf	0x01,cnt
 	bf	25,32f
-	lxvd2x	6,0,src
-	lxvd2x	7,src,6
-	lxvd2x	8,src,7
-	lxvd2x	9,src,8
+	LVX	6,0,src
+	LVX	7,src,6
+	LVX	8,src,7
+	LVX	9,src,8
 	addi	src,src,64
-	stxvd2x	6,0,dst
-	stxvd2x	7,dst,6
-	stxvd2x	8,dst,7
-	stxvd2x	9,dst,8
+	STVX	6,0,dst
+	STVX	7,dst,6
+	STVX	8,dst,7
+	STVX	9,dst,8
 	addi	dst,dst,64
 32:
 	bf	26,16f
-	lxvd2x	6,0,src
-	lxvd2x	7,src,6
+	LVX	6,0,src
+	LVX	7,src,6
 	addi	src,src,32
-	stxvd2x	6,0,dst
-	stxvd2x	7,dst,6
+	STVX	6,0,dst
+	STVX	7,dst,6
 	addi	dst,dst,32
 16:
 	bf	27,8f
-	lxvd2x	6,0,src
+	LVX	6,0,src
 	addi	src,src,16
-	stxvd2x	6,0,dst
+	STVX	6,0,dst
 	addi	dst,dst,16
 8:
 	bf	28,4f
diff --git a/sysdeps/powerpc/powerpc64/power7/memmove.S b/sysdeps/powerpc/powerpc64/power7/memmove.S
index 93baa69..253f541 100644
--- a/sysdeps/powerpc/powerpc64/power7/memmove.S
+++ b/sysdeps/powerpc/powerpc64/power7/memmove.S
@@ -30,6 +30,10 @@ 
 #ifndef MEMMOVE
 # define MEMMOVE memmove
 #endif
+
+#define LVX lxvd2x
+#define STVX stxvd2x
+
 	.machine power7
 ENTRY_TOCLESS (MEMMOVE, 5)
 	CALL_MCOUNT 3
@@ -92,63 +96,63 @@  L(aligned_copy):
 	srdi	12,r5,7
 	cmpdi	12,0
 	beq	L(aligned_tail)
-	lxvd2x	6,0,r4
-	lxvd2x	7,r4,6
+	LVX	6,0,r4
+	LVX	7,r4,6
 	mtctr	12
 	b	L(aligned_128loop)
 
 	.align  4
 L(aligned_128head):
 	/* for the 2nd + iteration of this loop. */
-	lxvd2x	6,0,r4
-	lxvd2x	7,r4,6
+	LVX	6,0,r4
+	LVX	7,r4,6
 L(aligned_128loop):
-	lxvd2x	8,r4,7
-	lxvd2x	9,r4,8
-	stxvd2x	6,0,r11
+	LVX	8,r4,7
+	LVX	9,r4,8
+	STVX	6,0,r11
 	addi	r4,r4,64
-	stxvd2x	7,r11,6
-	stxvd2x	8,r11,7
-	stxvd2x	9,r11,8
-	lxvd2x	6,0,r4
-	lxvd2x	7,r4,6
+	STVX	7,r11,6
+	STVX	8,r11,7
+	STVX	9,r11,8
+	LVX	6,0,r4
+	LVX	7,r4,6
 	addi	r11,r11,64
-	lxvd2x	8,r4,7
-	lxvd2x	9,r4,8
+	LVX	8,r4,7
+	LVX	9,r4,8
 	addi	r4,r4,64
-	stxvd2x	6,0,r11
-	stxvd2x	7,r11,6
-	stxvd2x	8,r11,7
-	stxvd2x	9,r11,8
+	STVX	6,0,r11
+	STVX	7,r11,6
+	STVX	8,r11,7
+	STVX	9,r11,8
 	addi	r11,r11,64
 	bdnz	L(aligned_128head)
 
 L(aligned_tail):
 	mtocrf	0x01,r5
 	bf	25,32f
-	lxvd2x	6,0,r4
-	lxvd2x	7,r4,6
-	lxvd2x	8,r4,7
-	lxvd2x	9,r4,8
+	LVX	6,0,r4
+	LVX	7,r4,6
+	LVX	8,r4,7
+	LVX	9,r4,8
 	addi	r4,r4,64
-	stxvd2x	6,0,r11
-	stxvd2x	7,r11,6
-	stxvd2x	8,r11,7
-	stxvd2x	9,r11,8
+	STVX	6,0,r11
+	STVX	7,r11,6
+	STVX	8,r11,7
+	STVX	9,r11,8
 	addi	r11,r11,64
 32:
 	bf	26,16f
-	lxvd2x	6,0,r4
-	lxvd2x	7,r4,6
+	LVX	6,0,r4
+	LVX	7,r4,6
 	addi	r4,r4,32
-	stxvd2x	6,0,r11
-	stxvd2x	7,r11,6
+	STVX	6,0,r11
+	STVX	7,r11,6
 	addi	r11,r11,32
 16:
 	bf	27,8f
-	lxvd2x	6,0,r4
+	LVX	6,0,r4
 	addi	r4,r4,16
-	stxvd2x	6,0,r11
+	STVX	6,0,r11
 	addi	r11,r11,16
 8:
 	bf	28,4f
@@ -488,63 +492,63 @@  L(aligned_copy_bwd):
 	srdi	r12,r5,7
 	cmpdi	r12,0
 	beq	L(aligned_tail_bwd)
-	lxvd2x	v6,r4,r6
-	lxvd2x	v7,r4,r7
+	LVX	v6,r4,r6
+	LVX	v7,r4,r7
 	mtctr	12
 	b	L(aligned_128loop_bwd)
 
 	.align  4
 L(aligned_128head_bwd):
 	/* for the 2nd + iteration of this loop. */
-	lxvd2x	v6,r4,r6
-	lxvd2x	v7,r4,r7
+	LVX	v6,r4,r6
+	LVX	v7,r4,r7
 L(aligned_128loop_bwd):
-	lxvd2x	v8,r4,r8
-	lxvd2x	v9,r4,r9
-	stxvd2x	v6,r11,r6
+	LVX	v8,r4,r8
+	LVX	v9,r4,r9
+	STVX	v6,r11,r6
 	subi	r4,r4,64
-	stxvd2x	v7,r11,r7
-	stxvd2x	v8,r11,r8
-	stxvd2x	v9,r11,r9
-	lxvd2x	v6,r4,r6
-	lxvd2x	v7,r4,7
+	STVX	v7,r11,r7
+	STVX	v8,r11,r8
+	STVX	v9,r11,r9
+	LVX	v6,r4,r6
+	LVX	v7,r4,7
 	subi	r11,r11,64
-	lxvd2x	v8,r4,r8
-	lxvd2x	v9,r4,r9
+	LVX	v8,r4,r8
+	LVX	v9,r4,r9
 	subi	r4,r4,64
-	stxvd2x	v6,r11,r6
-	stxvd2x	v7,r11,r7
-	stxvd2x	v8,r11,r8
-	stxvd2x	v9,r11,r9
+	STVX	v6,r11,r6
+	STVX	v7,r11,r7
+	STVX	v8,r11,r8
+	STVX	v9,r11,r9
 	subi	r11,r11,64
 	bdnz	L(aligned_128head_bwd)
 
 L(aligned_tail_bwd):
 	mtocrf	0x01,r5
 	bf	25,32f
-	lxvd2x	v6,r4,r6
-	lxvd2x	v7,r4,r7
-	lxvd2x	v8,r4,r8
-	lxvd2x	v9,r4,r9
+	LVX	v6,r4,r6
+	LVX	v7,r4,r7
+	LVX	v8,r4,r8
+	LVX	v9,r4,r9
 	subi	r4,r4,64
-	stxvd2x	v6,r11,r6
-	stxvd2x	v7,r11,r7
-	stxvd2x	v8,r11,r8
-	stxvd2x	v9,r11,r9
+	STVX	v6,r11,r6
+	STVX	v7,r11,r7
+	STVX	v8,r11,r8
+	STVX	v9,r11,r9
 	subi	r11,r11,64
 32:
 	bf	26,16f
-	lxvd2x	v6,r4,r6
-	lxvd2x	v7,r4,r7
+	LVX	v6,r4,r6
+	LVX	v7,r4,r7
 	subi	r4,r4,32
-	stxvd2x	v6,r11,r6
-	stxvd2x	v7,r11,r7
+	STVX	v6,r11,r6
+	STVX	v7,r11,r7
 	subi	r11,r11,32
 16:
 	bf	27,8f
-	lxvd2x	v6,r4,r6
+	LVX	v6,r4,r6
 	subi	r4,r4,16
-	stxvd2x	v6,r11,r6
+	STVX	v6,r11,r6
 	subi	r11,r11,16
 8:
 	bf	28,4f
@@ -832,4 +836,6 @@  ENTRY_TOCLESS (__bcopy)
 	mr	r4,r6
 	b	L(_memmove)
 END (__bcopy)
+#ifndef __bcopy
 weak_alias (__bcopy, bcopy)
+#endif
diff --git a/sysdeps/powerpc/powerpc64/power9/memcpy.S b/sysdeps/powerpc/powerpc64/power9/memcpy.S
new file mode 100644
index 0000000..d827cdf
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power9/memcpy.S
@@ -0,0 +1,23 @@ 
+/* Optimized memcpy implementation for PowerPC64/POWER9.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Avoid unnecessary traps on cache-inhibited memory on POWER9 DD2.1.  */
+#define LVX lvx
+#define STVX stvx
+
+#include <sysdeps/powerpc/powerpc64/power7/memcpy.S>
diff --git a/sysdeps/powerpc/powerpc64/power9/memmove.S b/sysdeps/powerpc/powerpc64/power9/memmove.S
new file mode 100644
index 0000000..2c5887e
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power9/memmove.S
@@ -0,0 +1,22 @@ 
+/* Optimized memmove implementation for PowerPC64/POWER9.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define LVX lxvd2x
+#define STVX stxvd2x
+
+#include <sysdeps/powerpc/powerpc64/power7/memmove.S>