[v2] Add TARGET_MOVE_WITH_MODE_P

Message ID Yh/fD6YGRT+G3Ltt@gmail.com
State New
Headers
Series [v2] Add TARGET_MOVE_WITH_MODE_P |

Commit Message

H.J. Lu March 2, 2022, 9:18 p.m. UTC
  On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > The default is
> >
> > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> >
> > For x86, it is MOVE_MAX to restore the old behavior before
> 
> I know we've discussed this to death in the PR, I just want to repeat here
> that the GIMPLE folding expects to generate a single load and a single
> store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> was chosen originally (it's documented to what a "single instruction" does).
> In practice MOVE_MAX does not seem to cover vector register sizes
> so Richard pulled MOVE_RATIO which is really intended to cover
> the case of using multiple instructions for moving memory (but then I
> don't remember whether for the ARM case the single load/store GIMPLE
> will be expanded to multiple load/store instructions).
> 
> TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> being very specific for memcpy folding (we also fold memmove btw).
> 
> There is also MOVE_MAX_PIECES which _might_ be more appropriate
> than MOVE_MAX here and still honor the idea of single instructions.
> Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> not MOVE_MAX * MOVE_RATIO.
> 
> So if we need a new hook then that hook should at least get the
> 'speed' argument of MOVE_RATIO and it should get a better name.
> 
> I still think that it should be possible to improve the insn check to
> avoid use of "disabled" modes, maybe that's also a point to add
> a new hook like .move_with_mode_p or so?  To quote, we do

Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.

> 
>               scalar_int_mode mode;
>               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
>                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
>                   && have_insn_for (SET, mode)
>                   /* If the destination pointer is not aligned we must be able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
>                       || !targetm.slow_unaligned_access (mode, dest_align)
>                       || (optab_handler (movmisalign_optab, mode)
>                           != CODE_FOR_nothing)))
> 
> where I understand the ISA is enabled and if the user explicitely
> uses it that's OK but -mprefer-avx128 should tell GCC to never
> generate AVX256 code where the user was not explicitely using it
> (still for example glibc might happily use AVX256 code to implement
> the memcpy we are folding!)
> 
> Note the BB vectorizer also might end up with using AVX256 because
> in places it also relies on optab queries and the vector_mode_supported_p
> check (but the memcpy folding uses the fake integer modes).  So
> x86 might need to implement the related_mode hook to avoid "auto"-using
> a larger vector mode which the default implementation would happily do.
> 
> Richard.

OK for master?

Thanks.

H.J.
---
Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
generated implicitly.  The default definition returns true.  The x86
version returns true if the mode size <= MOVE_MAX, which is the max
number of bytes we can move in one reasonably fast instruction.

gcc/

	PR target/103393
	* gimple-fold.cc (gimple_fold_builtin_memory_op): Call
	targetm.move_with_mode_p to check if move with mode can be
	generated implicitly.
	* target.def: Add move_with_mode_p.
	* targhooks.cc (default_move_with_mode_p): New.
	* targhooks.h (default_move_with_mode_p): Likewise.
	* config/i386/i386.cc (ix86_move_with_mode_p): New.
	(TARGET_MOVE_WITH_MODE_P): Likewise.
	* doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
	* doc/tm.texi: Regenerate.

gcc/testsuite/

	PR target/103393
	* gcc.target/i386/pr103393-1.c: New test.
	* gcc.target/i386/pr103393-2.c: Likewise.
	* gcc.target/i386/pr103393-3.c: Likewise.
	* gcc.target/i386/pr103393-4.c: Likewise.
	* gcc.target/i386/pr103393-5.c: Likewise.
---
 gcc/config/i386/i386.cc                    | 12 ++++++++++++
 gcc/doc/tm.texi                            |  5 +++++
 gcc/doc/tm.texi.in                         |  2 ++
 gcc/gimple-fold.cc                         |  1 +
 gcc/target.def                             |  7 +++++++
 gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
 10 files changed, 107 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
  

Comments

Richard Biener March 7, 2022, 1:45 p.m. UTC | #1
On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > The default is
> > >
> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > >
> > > For x86, it is MOVE_MAX to restore the old behavior before
> >
> > I know we've discussed this to death in the PR, I just want to repeat here
> > that the GIMPLE folding expects to generate a single load and a single
> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > was chosen originally (it's documented to what a "single instruction" does).
> > In practice MOVE_MAX does not seem to cover vector register sizes
> > so Richard pulled MOVE_RATIO which is really intended to cover
> > the case of using multiple instructions for moving memory (but then I
> > don't remember whether for the ARM case the single load/store GIMPLE
> > will be expanded to multiple load/store instructions).
> >
> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > being very specific for memcpy folding (we also fold memmove btw).
> >
> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > than MOVE_MAX here and still honor the idea of single instructions.
> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > not MOVE_MAX * MOVE_RATIO.
> >
> > So if we need a new hook then that hook should at least get the
> > 'speed' argument of MOVE_RATIO and it should get a better name.
> >
> > I still think that it should be possible to improve the insn check to
> > avoid use of "disabled" modes, maybe that's also a point to add
> > a new hook like .move_with_mode_p or so?  To quote, we do
>
> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.

Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
mentions "a load or store used TO COPY MEMORY" (emphasis mine)
and whose x86 implementation would already be fine (doing larger moves
and also not doing too large moves).  But appearantly the arm folks
decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.

Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
restrict itself to a single load and a single store.

> >
> >               scalar_int_mode mode;
> >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> >                   && have_insn_for (SET, mode)
> >                   /* If the destination pointer is not aligned we must be able
> >                      to emit an unaligned store.  */
> >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> >                       || !targetm.slow_unaligned_access (mode, dest_align)
> >                       || (optab_handler (movmisalign_optab, mode)
> >                           != CODE_FOR_nothing)))
> >
> > where I understand the ISA is enabled and if the user explicitely
> > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > generate AVX256 code where the user was not explicitely using it
> > (still for example glibc might happily use AVX256 code to implement
> > the memcpy we are folding!)
> >
> > Note the BB vectorizer also might end up with using AVX256 because
> > in places it also relies on optab queries and the vector_mode_supported_p
> > check (but the memcpy folding uses the fake integer modes).  So
> > x86 might need to implement the related_mode hook to avoid "auto"-using
> > a larger vector mode which the default implementation would happily do.
> >
> > Richard.
>
> OK for master?

Looking for opinions from others as well.

Btw, there's a similar use in expand_DEFERRED_INIT:

          && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
                                0).exists (&var_mode)
          && have_insn_for (SET, var_mode))

So it occured to me that maybe targetm.move_with_mode_p should eventually
check have_insn_for (SET, var_mode) or we should abstract checking the two
things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
not sure where it would be more appropriate).

> +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> +This target hook returns true if move with mode @var{mode} can be
> +generated implicitly.  The default definition returns true.
> +@end deftypefn

I know what you mean but I'm not sure "can be generated implicitly" captures
that.  The compiler always generated stuff explicitely.  It's also
"should", not "can", no?

Thanks,
Richard.

> Thanks.
>
> H.J.
> ---
> Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> generated implicitly.  The default definition returns true.  The x86
> version returns true if the mode size <= MOVE_MAX, which is the max
> number of bytes we can move in one reasonably fast instruction.
>
> gcc/
>
>         PR target/103393
>         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
>         targetm.move_with_mode_p to check if move with mode can be
>         generated implicitly.
>         * target.def: Add move_with_mode_p.
>         * targhooks.cc (default_move_with_mode_p): New.
>         * targhooks.h (default_move_with_mode_p): Likewise.
>         * config/i386/i386.cc (ix86_move_with_mode_p): New.
>         (TARGET_MOVE_WITH_MODE_P): Likewise.
>         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
>         * doc/tm.texi: Regenerate.
>
> gcc/testsuite/
>
>         PR target/103393
>         * gcc.target/i386/pr103393-1.c: New test.
>         * gcc.target/i386/pr103393-2.c: Likewise.
>         * gcc.target/i386/pr103393-3.c: Likewise.
>         * gcc.target/i386/pr103393-4.c: Likewise.
>         * gcc.target/i386/pr103393-5.c: Likewise.
> ---
>  gcc/config/i386/i386.cc                    | 12 ++++++++++++
>  gcc/doc/tm.texi                            |  5 +++++
>  gcc/doc/tm.texi.in                         |  2 ++
>  gcc/gimple-fold.cc                         |  1 +
>  gcc/target.def                             |  7 +++++++
>  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
>  10 files changed, 107 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..68a2c59220c 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
>    return ROUND_UP (bytes, UNITS_PER_WORD);
>  }
>
> +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> +   with MODE can be generated implicitly.  */
> +
> +static bool
> +ix86_move_with_mode_p (machine_mode mode)
> +{
> +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> +}
> +
>  /* Target-specific selftests.  */
>
>  #if CHECKING_P
> @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
>  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
>  #endif /* #if CHECKING_P */
>
> +#undef TARGET_MOVE_WITH_MODE_P
> +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> +
>  struct gcc_target targetm = TARGET_INITIALIZER;
>
>  #include "gt-i386.h"
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 49864dd79f8..9d5058610e1 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns true if any change
>  was made to the GIMPLE stream.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> +This target hook returns true if move with mode @var{mode} can be
> +generated implicitly.  The default definition returns true.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
>  This hook is used to compare the target attributes in two functions to
>  determine which function's features get higher priority.  This is used
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 95e5e341f07..e9158ab90c6 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
>
>  @hook TARGET_GIMPLE_FOLD_BUILTIN
>
> +@hook TARGET_MOVE_WITH_MODE_P
> +
>  @hook TARGET_COMPARE_VERSION_PRIORITY
>
>  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> index c9179abb27e..93267eeabb2 100644
> --- a/gcc/gimple-fold.cc
> +++ b/gcc/gimple-fold.cc
> @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
>               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
>                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
>                   && have_insn_for (SET, mode)
> +                 && targetm.move_with_mode_p (mode)
>                   /* If the destination pointer is not aligned we must be able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> diff --git a/gcc/target.def b/gcc/target.def
> index 72c2e1ef756..041d944cb38 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
>   bool, (gimple_stmt_iterator *gsi),
>   hook_bool_gsiptr_false)
>
> +DEFHOOK
> +(move_with_mode_p,
> + "This target hook returns true if move with mode @var{mode} can be\n\
> +generated implicitly.  The default definition returns true.",
> + bool, (machine_mode mode),
> + hook_bool_mode_true)
> +
>  /* Target hook is used to compare the target attributes in two functions to
>     determine which function's features get higher priority.  This is used
>     during function multi-versioning to figure out the order in which two
> diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> new file mode 100644
> index 00000000000..2091d33c45f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
> +
> +struct TestData {
> +  float arr[8];
> +};
> +
> +void
> +cpy (struct TestData *s1, struct TestData *s2 )
> +{
> +  for(int i=0; i<8; ++i)
> +    s1->arr[i] = s2->arr[i];
> +}
> +
> +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> new file mode 100644
> index 00000000000..4ad8c8bf379
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake-avx512" } */
> +
> +struct TestData {
> +  float arr[8];
> +};
> +
> +void
> +cpy (struct TestData *s1, struct TestData *s2 )
> +{
> +  for(int i=0; i<16; ++i)
> +    s1->arr[i] = s2->arr[i];
> +}
> +
> +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> new file mode 100644
> index 00000000000..7a03160e512
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids" } */
> +
> +struct TestData {
> +  float arr[8];
> +};
> +
> +void
> +cpy (struct TestData *s1, struct TestData *s2 )
> +{
> +  for(int i=0; i<16; ++i)
> +    s1->arr[i] = s2->arr[i];
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> new file mode 100644
> index 00000000000..ae2599f6411
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> +
> +struct TestData {
> +  float arr[8];
> +};
> +
> +void
> +cpy (struct TestData *s1, struct TestData *s2 )
> +{
> +  for(int i=0; i<16; ++i)
> +    s1->arr[i] = s2->arr[i];
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> new file mode 100644
> index 00000000000..f9173684212
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> +
> +struct TestData {
> +  float arr[8];
> +};
> +
> +void
> +cpy (struct TestData *s1, struct TestData *s2 )
> +{
> +  for(int i=0; i<16; ++i)
> +    s1->arr[i] = s2->arr[i];
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> --
> 2.35.1
>
  
H.J. Lu March 8, 2022, 3:43 p.m. UTC | #2
On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > > The default is
> > > >
> > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > >
> > > > For x86, it is MOVE_MAX to restore the old behavior before
> > >
> > > I know we've discussed this to death in the PR, I just want to repeat here
> > > that the GIMPLE folding expects to generate a single load and a single
> > > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > was chosen originally (it's documented to what a "single instruction" does).
> > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > the case of using multiple instructions for moving memory (but then I
> > > don't remember whether for the ARM case the single load/store GIMPLE
> > > will be expanded to multiple load/store instructions).
> > >
> > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > being very specific for memcpy folding (we also fold memmove btw).
> > >
> > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > than MOVE_MAX here and still honor the idea of single instructions.
> > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > not MOVE_MAX * MOVE_RATIO.
> > >
> > > So if we need a new hook then that hook should at least get the
> > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > >
> > > I still think that it should be possible to improve the insn check to
> > > avoid use of "disabled" modes, maybe that's also a point to add
> > > a new hook like .move_with_mode_p or so?  To quote, we do
> >
> > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
>
> Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> and whose x86 implementation would already be fine (doing larger moves
> and also not doing too large moves).  But appearantly the arm folks
> decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
>
> Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> restrict itself to a single load and a single store.
>
> > >
> > >               scalar_int_mode mode;
> > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > >                   && have_insn_for (SET, mode)
> > >                   /* If the destination pointer is not aligned we must be able
> > >                      to emit an unaligned store.  */
> > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > >                       || !targetm.slow_unaligned_access (mode, dest_align)
> > >                       || (optab_handler (movmisalign_optab, mode)
> > >                           != CODE_FOR_nothing)))
> > >
> > > where I understand the ISA is enabled and if the user explicitely
> > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > generate AVX256 code where the user was not explicitely using it
> > > (still for example glibc might happily use AVX256 code to implement
> > > the memcpy we are folding!)
> > >
> > > Note the BB vectorizer also might end up with using AVX256 because
> > > in places it also relies on optab queries and the vector_mode_supported_p
> > > check (but the memcpy folding uses the fake integer modes).  So
> > > x86 might need to implement the related_mode hook to avoid "auto"-using
> > > a larger vector mode which the default implementation would happily do.
> > >
> > > Richard.
> >
> > OK for master?
>
> Looking for opinions from others as well.
>
> Btw, there's a similar use in expand_DEFERRED_INIT:
>
>           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
>                                 0).exists (&var_mode)
>           && have_insn_for (SET, var_mode))
>
> So it occured to me that maybe targetm.move_with_mode_p should eventually

TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
use var_mode.

> check have_insn_for (SET, var_mode) or we should abstract checking the two
> things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> not sure where it would be more appropriate).
>
> > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > +This target hook returns true if move with mode @var{mode} can be
> > +generated implicitly.  The default definition returns true.
> > +@end deftypefn
>
> I know what you mean but I'm not sure "can be generated implicitly" captures
> that.  The compiler always generated stuff explicitely.  It's also
> "should", not "can", no?

TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
TARGET_FAST_MOVE_WITH_MODE_P be better?

> Thanks,
> Richard.
>
> > Thanks.
> >
> > H.J.
> > ---
> > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > generated implicitly.  The default definition returns true.  The x86
> > version returns true if the mode size <= MOVE_MAX, which is the max
> > number of bytes we can move in one reasonably fast instruction.
> >
> > gcc/
> >
> >         PR target/103393
> >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> >         targetm.move_with_mode_p to check if move with mode can be
> >         generated implicitly.
> >         * target.def: Add move_with_mode_p.
> >         * targhooks.cc (default_move_with_mode_p): New.
> >         * targhooks.h (default_move_with_mode_p): Likewise.
> >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> >         * doc/tm.texi: Regenerate.
> >
> > gcc/testsuite/
> >
> >         PR target/103393
> >         * gcc.target/i386/pr103393-1.c: New test.
> >         * gcc.target/i386/pr103393-2.c: Likewise.
> >         * gcc.target/i386/pr103393-3.c: Likewise.
> >         * gcc.target/i386/pr103393-4.c: Likewise.
> >         * gcc.target/i386/pr103393-5.c: Likewise.
> > ---
> >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> >  gcc/doc/tm.texi                            |  5 +++++
> >  gcc/doc/tm.texi.in                         |  2 ++
> >  gcc/gimple-fold.cc                         |  1 +
> >  gcc/target.def                             |  7 +++++++
> >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> >  10 files changed, 107 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index b2bf90576d5..68a2c59220c 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> >    return ROUND_UP (bytes, UNITS_PER_WORD);
> >  }
> >
> > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > +   with MODE can be generated implicitly.  */
> > +
> > +static bool
> > +ix86_move_with_mode_p (machine_mode mode)
> > +{
> > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > +}
> > +
> >  /* Target-specific selftests.  */
> >
> >  #if CHECKING_P
> > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
> >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> >  #endif /* #if CHECKING_P */
> >
> > +#undef TARGET_MOVE_WITH_MODE_P
> > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > +
> >  struct gcc_target targetm = TARGET_INITIALIZER;
> >
> >  #include "gt-i386.h"
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index 49864dd79f8..9d5058610e1 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns true if any change
> >  was made to the GIMPLE stream.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > +This target hook returns true if move with mode @var{mode} can be
> > +generated implicitly.  The default definition returns true.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
> >  This hook is used to compare the target attributes in two functions to
> >  determine which function's features get higher priority.  This is used
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 95e5e341f07..e9158ab90c6 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> >
> >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> >
> > +@hook TARGET_MOVE_WITH_MODE_P
> > +
> >  @hook TARGET_COMPARE_VERSION_PRIORITY
> >
> >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > index c9179abb27e..93267eeabb2 100644
> > --- a/gcc/gimple-fold.cc
> > +++ b/gcc/gimple-fold.cc
> > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
> >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> >                   && have_insn_for (SET, mode)
> > +                 && targetm.move_with_mode_p (mode)
> >                   /* If the destination pointer is not aligned we must be able
> >                      to emit an unaligned store.  */
> >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > diff --git a/gcc/target.def b/gcc/target.def
> > index 72c2e1ef756..041d944cb38 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> >   bool, (gimple_stmt_iterator *gsi),
> >   hook_bool_gsiptr_false)
> >
> > +DEFHOOK
> > +(move_with_mode_p,
> > + "This target hook returns true if move with mode @var{mode} can be\n\
> > +generated implicitly.  The default definition returns true.",
> > + bool, (machine_mode mode),
> > + hook_bool_mode_true)
> > +
> >  /* Target hook is used to compare the target attributes in two functions to
> >     determine which function's features get higher priority.  This is used
> >     during function multi-versioning to figure out the order in which two
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > new file mode 100644
> > index 00000000000..2091d33c45f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
> > +
> > +struct TestData {
> > +  float arr[8];
> > +};
> > +
> > +void
> > +cpy (struct TestData *s1, struct TestData *s2 )
> > +{
> > +  for(int i=0; i<8; ++i)
> > +    s1->arr[i] = s2->arr[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > new file mode 100644
> > index 00000000000..4ad8c8bf379
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > +
> > +struct TestData {
> > +  float arr[8];
> > +};
> > +
> > +void
> > +cpy (struct TestData *s1, struct TestData *s2 )
> > +{
> > +  for(int i=0; i<16; ++i)
> > +    s1->arr[i] = s2->arr[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > new file mode 100644
> > index 00000000000..7a03160e512
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > +
> > +struct TestData {
> > +  float arr[8];
> > +};
> > +
> > +void
> > +cpy (struct TestData *s1, struct TestData *s2 )
> > +{
> > +  for(int i=0; i<16; ++i)
> > +    s1->arr[i] = s2->arr[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > new file mode 100644
> > index 00000000000..ae2599f6411
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > +
> > +struct TestData {
> > +  float arr[8];
> > +};
> > +
> > +void
> > +cpy (struct TestData *s1, struct TestData *s2 )
> > +{
> > +  for(int i=0; i<16; ++i)
> > +    s1->arr[i] = s2->arr[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > new file mode 100644
> > index 00000000000..f9173684212
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > +
> > +struct TestData {
> > +  float arr[8];
> > +};
> > +
> > +void
> > +cpy (struct TestData *s1, struct TestData *s2 )
> > +{
> > +  for(int i=0; i<16; ++i)
> > +    s1->arr[i] = s2->arr[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > --
> > 2.35.1
> >
  
Richard Biener March 9, 2022, 8:25 a.m. UTC | #3
On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > > > The default is
> > > > >
> > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > > >
> > > > > For x86, it is MOVE_MAX to restore the old behavior before
> > > >
> > > > I know we've discussed this to death in the PR, I just want to repeat here
> > > > that the GIMPLE folding expects to generate a single load and a single
> > > > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > > was chosen originally (it's documented to what a "single instruction" does).
> > > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > > the case of using multiple instructions for moving memory (but then I
> > > > don't remember whether for the ARM case the single load/store GIMPLE
> > > > will be expanded to multiple load/store instructions).
> > > >
> > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > > being very specific for memcpy folding (we also fold memmove btw).
> > > >
> > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > > than MOVE_MAX here and still honor the idea of single instructions.
> > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > > not MOVE_MAX * MOVE_RATIO.
> > > >
> > > > So if we need a new hook then that hook should at least get the
> > > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > >
> > > > I still think that it should be possible to improve the insn check to
> > > > avoid use of "disabled" modes, maybe that's also a point to add
> > > > a new hook like .move_with_mode_p or so?  To quote, we do
> > >
> > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> >
> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > and whose x86 implementation would already be fine (doing larger moves
> > and also not doing too large moves).  But appearantly the arm folks
> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> >
> > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> > restrict itself to a single load and a single store.
> >
> > > >
> > > >               scalar_int_mode mode;
> > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > >                   && have_insn_for (SET, mode)
> > > >                   /* If the destination pointer is not aligned we must be able
> > > >                      to emit an unaligned store.  */
> > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > >                       || !targetm.slow_unaligned_access (mode, dest_align)
> > > >                       || (optab_handler (movmisalign_optab, mode)
> > > >                           != CODE_FOR_nothing)))
> > > >
> > > > where I understand the ISA is enabled and if the user explicitely
> > > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > > generate AVX256 code where the user was not explicitely using it
> > > > (still for example glibc might happily use AVX256 code to implement
> > > > the memcpy we are folding!)
> > > >
> > > > Note the BB vectorizer also might end up with using AVX256 because
> > > > in places it also relies on optab queries and the vector_mode_supported_p
> > > > check (but the memcpy folding uses the fake integer modes).  So
> > > > x86 might need to implement the related_mode hook to avoid "auto"-using
> > > > a larger vector mode which the default implementation would happily do.
> > > >
> > > > Richard.
> > >
> > > OK for master?
> >
> > Looking for opinions from others as well.
> >
> > Btw, there's a similar use in expand_DEFERRED_INIT:
> >
> >           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
> >                                 0).exists (&var_mode)
> >           && have_insn_for (SET, var_mode))
> >
> > So it occured to me that maybe targetm.move_with_mode_p should eventually
>
> TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
> use var_mode.
>
> > check have_insn_for (SET, var_mode) or we should abstract checking the two
> > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> > not sure where it would be more appropriate).
> >
> > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > +This target hook returns true if move with mode @var{mode} can be
> > > +generated implicitly.  The default definition returns true.
> > > +@end deftypefn
> >
> > I know what you mean but I'm not sure "can be generated implicitly" captures
> > that.  The compiler always generated stuff explicitely.  It's also
> > "should", not "can", no?
>
> TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
> It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
> TARGET_FAST_MOVE_WITH_MODE_P be better?

I thought the PR to address was about -mprefer-avx128 but memcpy using
YMM regs?  So it's not about "fast" it's about fulfilling user expectations, no?

As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the
vector_modes hook to iterate over but also with the related_mode hook
(which x86 chooses to not implement).  It sounds like we need something
similar for integer modes (aka "move" modes), but then does x86 "honor"
-mprefer-avx128 when doing *_by_pieces?  And how does it do that there?

Richard.

> > Thanks,
> > Richard.
> >
> > > Thanks.
> > >
> > > H.J.
> > > ---
> > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > > generated implicitly.  The default definition returns true.  The x86
> > > version returns true if the mode size <= MOVE_MAX, which is the max
> > > number of bytes we can move in one reasonably fast instruction.
> > >
> > > gcc/
> > >
> > >         PR target/103393
> > >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> > >         targetm.move_with_mode_p to check if move with mode can be
> > >         generated implicitly.
> > >         * target.def: Add move_with_mode_p.
> > >         * targhooks.cc (default_move_with_mode_p): New.
> > >         * targhooks.h (default_move_with_mode_p): Likewise.
> > >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> > >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> > >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> > >         * doc/tm.texi: Regenerate.
> > >
> > > gcc/testsuite/
> > >
> > >         PR target/103393
> > >         * gcc.target/i386/pr103393-1.c: New test.
> > >         * gcc.target/i386/pr103393-2.c: Likewise.
> > >         * gcc.target/i386/pr103393-3.c: Likewise.
> > >         * gcc.target/i386/pr103393-4.c: Likewise.
> > >         * gcc.target/i386/pr103393-5.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> > >  gcc/doc/tm.texi                            |  5 +++++
> > >  gcc/doc/tm.texi.in                         |  2 ++
> > >  gcc/gimple-fold.cc                         |  1 +
> > >  gcc/target.def                             |  7 +++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> > >  10 files changed, 107 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index b2bf90576d5..68a2c59220c 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> > >    return ROUND_UP (bytes, UNITS_PER_WORD);
> > >  }
> > >
> > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > > +   with MODE can be generated implicitly.  */
> > > +
> > > +static bool
> > > +ix86_move_with_mode_p (machine_mode mode)
> > > +{
> > > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > > +}
> > > +
> > >  /* Target-specific selftests.  */
> > >
> > >  #if CHECKING_P
> > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
> > >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> > >  #endif /* #if CHECKING_P */
> > >
> > > +#undef TARGET_MOVE_WITH_MODE_P
> > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > > +
> > >  struct gcc_target targetm = TARGET_INITIALIZER;
> > >
> > >  #include "gt-i386.h"
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index 49864dd79f8..9d5058610e1 100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns true if any change
> > >  was made to the GIMPLE stream.
> > >  @end deftypefn
> > >
> > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > +This target hook returns true if move with mode @var{mode} can be
> > > +generated implicitly.  The default definition returns true.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
> > >  This hook is used to compare the target attributes in two functions to
> > >  determine which function's features get higher priority.  This is used
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index 95e5e341f07..e9158ab90c6 100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> > >
> > >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> > >
> > > +@hook TARGET_MOVE_WITH_MODE_P
> > > +
> > >  @hook TARGET_COMPARE_VERSION_PRIORITY
> > >
> > >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > > index c9179abb27e..93267eeabb2 100644
> > > --- a/gcc/gimple-fold.cc
> > > +++ b/gcc/gimple-fold.cc
> > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
> > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > >                   && have_insn_for (SET, mode)
> > > +                 && targetm.move_with_mode_p (mode)
> > >                   /* If the destination pointer is not aligned we must be able
> > >                      to emit an unaligned store.  */
> > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index 72c2e1ef756..041d944cb38 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> > >   bool, (gimple_stmt_iterator *gsi),
> > >   hook_bool_gsiptr_false)
> > >
> > > +DEFHOOK
> > > +(move_with_mode_p,
> > > + "This target hook returns true if move with mode @var{mode} can be\n\
> > > +generated implicitly.  The default definition returns true.",
> > > + bool, (machine_mode mode),
> > > + hook_bool_mode_true)
> > > +
> > >  /* Target hook is used to compare the target attributes in two functions to
> > >     determine which function's features get higher priority.  This is used
> > >     during function multi-versioning to figure out the order in which two
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > new file mode 100644
> > > index 00000000000..2091d33c45f
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<8; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > new file mode 100644
> > > index 00000000000..4ad8c8bf379
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > new file mode 100644
> > > index 00000000000..7a03160e512
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > new file mode 100644
> > > index 00000000000..ae2599f6411
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > new file mode 100644
> > > index 00000000000..f9173684212
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > > +
> > > +struct TestData {
> > > +  float arr[8];
> > > +};
> > > +
> > > +void
> > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > +{
> > > +  for(int i=0; i<16; ++i)
> > > +    s1->arr[i] = s2->arr[i];
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > --
> > > 2.35.1
> > >
>
>
>
> --
> H.J.
  
Richard Sandiford March 9, 2022, 6:04 p.m. UTC | #4
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
>> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> > >
>> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
>> > > The default is
>> > >
>> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
>> > >
>> > > For x86, it is MOVE_MAX to restore the old behavior before
>> >
>> > I know we've discussed this to death in the PR, I just want to repeat here
>> > that the GIMPLE folding expects to generate a single load and a single
>> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
>> > was chosen originally (it's documented to what a "single instruction" does).
>> > In practice MOVE_MAX does not seem to cover vector register sizes
>> > so Richard pulled MOVE_RATIO which is really intended to cover
>> > the case of using multiple instructions for moving memory (but then I
>> > don't remember whether for the ARM case the single load/store GIMPLE
>> > will be expanded to multiple load/store instructions).
>> >
>> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
>> > being very specific for memcpy folding (we also fold memmove btw).
>> >
>> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
>> > than MOVE_MAX here and still honor the idea of single instructions.
>> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
>> > not MOVE_MAX * MOVE_RATIO.
>> >
>> > So if we need a new hook then that hook should at least get the
>> > 'speed' argument of MOVE_RATIO and it should get a better name.
>> >
>> > I still think that it should be possible to improve the insn check to
>> > avoid use of "disabled" modes, maybe that's also a point to add
>> > a new hook like .move_with_mode_p or so?  To quote, we do
>>
>> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
>
> Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> and whose x86 implementation would already be fine (doing larger moves
> and also not doing too large moves).  But appearantly the arm folks
> decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.

It seems like there are old comments and old documentation that justify
both interpretations, so there are good arguments on both sides.  But
with this kind of thing I think we have to infer the meaning of the
macro from the way it's currently used, rather than trusting such old
and possibly out-of-date and contradictory information.

FWIW, I agree that (if we exclude old reload, which we should!) the
only direct uses of MOVE_MAX before the patch were not specific to
integer registers and so MOVE_MAX should include vectors if the
target wants vector modes to be used for general movement.

Even if people disagree that that's the current meaning, I think it's
at least a sensible meaning.  It provides information that AFAIK isn't
available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.

So FWIW, I think it'd be reasonable to change non-x86 targets if they
want vector modes to be used for single-insn copies.

Thanks,
Richard
  
H.J. Lu March 9, 2022, 6:07 p.m. UTC | #5
On Wed, Mar 9, 2022 at 12:25 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > > > > The default is
> > > > > >
> > > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > > > >
> > > > > > For x86, it is MOVE_MAX to restore the old behavior before
> > > > >
> > > > > I know we've discussed this to death in the PR, I just want to repeat here
> > > > > that the GIMPLE folding expects to generate a single load and a single
> > > > > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > > > was chosen originally (it's documented to what a "single instruction" does).
> > > > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > > > the case of using multiple instructions for moving memory (but then I
> > > > > don't remember whether for the ARM case the single load/store GIMPLE
> > > > > will be expanded to multiple load/store instructions).
> > > > >
> > > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > > > being very specific for memcpy folding (we also fold memmove btw).
> > > > >
> > > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > > > than MOVE_MAX here and still honor the idea of single instructions.
> > > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > > > not MOVE_MAX * MOVE_RATIO.
> > > > >
> > > > > So if we need a new hook then that hook should at least get the
> > > > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > > >
> > > > > I still think that it should be possible to improve the insn check to
> > > > > avoid use of "disabled" modes, maybe that's also a point to add
> > > > > a new hook like .move_with_mode_p or so?  To quote, we do
> > > >
> > > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> > >
> > > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > > and whose x86 implementation would already be fine (doing larger moves
> > > and also not doing too large moves).  But appearantly the arm folks
> > > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> > >
> > > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> > > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> > > restrict itself to a single load and a single store.
> > >
> > > > >
> > > > >               scalar_int_mode mode;
> > > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > > >                   && have_insn_for (SET, mode)
> > > > >                   /* If the destination pointer is not aligned we must be able
> > > > >                      to emit an unaligned store.  */
> > > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > >                       || !targetm.slow_unaligned_access (mode, dest_align)
> > > > >                       || (optab_handler (movmisalign_optab, mode)
> > > > >                           != CODE_FOR_nothing)))
> > > > >
> > > > > where I understand the ISA is enabled and if the user explicitely
> > > > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > > > generate AVX256 code where the user was not explicitely using it
> > > > > (still for example glibc might happily use AVX256 code to implement
> > > > > the memcpy we are folding!)
> > > > >
> > > > > Note the BB vectorizer also might end up with using AVX256 because
> > > > > in places it also relies on optab queries and the vector_mode_supported_p
> > > > > check (but the memcpy folding uses the fake integer modes).  So
> > > > > x86 might need to implement the related_mode hook to avoid "auto"-using
> > > > > a larger vector mode which the default implementation would happily do.
> > > > >
> > > > > Richard.
> > > >
> > > > OK for master?
> > >
> > > Looking for opinions from others as well.
> > >
> > > Btw, there's a similar use in expand_DEFERRED_INIT:
> > >
> > >           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
> > >                                 0).exists (&var_mode)
> > >           && have_insn_for (SET, var_mode))
> > >
> > > So it occured to me that maybe targetm.move_with_mode_p should eventually
> >
> > TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
> > use var_mode.
> >
> > > check have_insn_for (SET, var_mode) or we should abstract checking the two
> > > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> > > not sure where it would be more appropriate).
> > >
> > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > > +This target hook returns true if move with mode @var{mode} can be
> > > > +generated implicitly.  The default definition returns true.
> > > > +@end deftypefn
> > >
> > > I know what you mean but I'm not sure "can be generated implicitly" captures
> > > that.  The compiler always generated stuff explicitely.  It's also
> > > "should", not "can", no?
> >
> > TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
> > It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
> > TARGET_FAST_MOVE_WITH_MODE_P be better?
>
> I thought the PR to address was about -mprefer-avx128 but memcpy using
> YMM regs?  So it's not about "fast" it's about fulfilling user expectations, no?
>
> As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the
> vector_modes hook to iterate over but also with the related_mode hook
> (which x86 chooses to not implement).  It sounds like we need something
> similar for integer modes (aka "move" modes), but then does x86 "honor"
> -mprefer-avx128 when doing *_by_pieces?  And how does it do that there?

TARGET_PREFERRED_MOVE_WITH_MODE_P?

> Richard.
>
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks.
> > > >
> > > > H.J.
> > > > ---
> > > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > > > generated implicitly.  The default definition returns true.  The x86
> > > > version returns true if the mode size <= MOVE_MAX, which is the max
> > > > number of bytes we can move in one reasonably fast instruction.
> > > >
> > > > gcc/
> > > >
> > > >         PR target/103393
> > > >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> > > >         targetm.move_with_mode_p to check if move with mode can be
> > > >         generated implicitly.
> > > >         * target.def: Add move_with_mode_p.
> > > >         * targhooks.cc (default_move_with_mode_p): New.
> > > >         * targhooks.h (default_move_with_mode_p): Likewise.
> > > >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> > > >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> > > >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> > > >         * doc/tm.texi: Regenerate.
> > > >
> > > > gcc/testsuite/
> > > >
> > > >         PR target/103393
> > > >         * gcc.target/i386/pr103393-1.c: New test.
> > > >         * gcc.target/i386/pr103393-2.c: Likewise.
> > > >         * gcc.target/i386/pr103393-3.c: Likewise.
> > > >         * gcc.target/i386/pr103393-4.c: Likewise.
> > > >         * gcc.target/i386/pr103393-5.c: Likewise.
> > > > ---
> > > >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> > > >  gcc/doc/tm.texi                            |  5 +++++
> > > >  gcc/doc/tm.texi.in                         |  2 ++
> > > >  gcc/gimple-fold.cc                         |  1 +
> > > >  gcc/target.def                             |  7 +++++++
> > > >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> > > >  10 files changed, 107 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index b2bf90576d5..68a2c59220c 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> > > >    return ROUND_UP (bytes, UNITS_PER_WORD);
> > > >  }
> > > >
> > > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > > > +   with MODE can be generated implicitly.  */
> > > > +
> > > > +static bool
> > > > +ix86_move_with_mode_p (machine_mode mode)
> > > > +{
> > > > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > > > +}
> > > > +
> > > >  /* Target-specific selftests.  */
> > > >
> > > >  #if CHECKING_P
> > > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
> > > >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> > > >  #endif /* #if CHECKING_P */
> > > >
> > > > +#undef TARGET_MOVE_WITH_MODE_P
> > > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > > > +
> > > >  struct gcc_target targetm = TARGET_INITIALIZER;
> > > >
> > > >  #include "gt-i386.h"
> > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > index 49864dd79f8..9d5058610e1 100644
> > > > --- a/gcc/doc/tm.texi
> > > > +++ b/gcc/doc/tm.texi
> > > > @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns true if any change
> > > >  was made to the GIMPLE stream.
> > > >  @end deftypefn
> > > >
> > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > > +This target hook returns true if move with mode @var{mode} can be
> > > > +generated implicitly.  The default definition returns true.
> > > > +@end deftypefn
> > > > +
> > > >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
> > > >  This hook is used to compare the target attributes in two functions to
> > > >  determine which function's features get higher priority.  This is used
> > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > index 95e5e341f07..e9158ab90c6 100644
> > > > --- a/gcc/doc/tm.texi.in
> > > > +++ b/gcc/doc/tm.texi.in
> > > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> > > >
> > > >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> > > >
> > > > +@hook TARGET_MOVE_WITH_MODE_P
> > > > +
> > > >  @hook TARGET_COMPARE_VERSION_PRIORITY
> > > >
> > > >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > > > index c9179abb27e..93267eeabb2 100644
> > > > --- a/gcc/gimple-fold.cc
> > > > +++ b/gcc/gimple-fold.cc
> > > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
> > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > >                   && have_insn_for (SET, mode)
> > > > +                 && targetm.move_with_mode_p (mode)
> > > >                   /* If the destination pointer is not aligned we must be able
> > > >                      to emit an unaligned store.  */
> > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > index 72c2e1ef756..041d944cb38 100644
> > > > --- a/gcc/target.def
> > > > +++ b/gcc/target.def
> > > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> > > >   bool, (gimple_stmt_iterator *gsi),
> > > >   hook_bool_gsiptr_false)
> > > >
> > > > +DEFHOOK
> > > > +(move_with_mode_p,
> > > > + "This target hook returns true if move with mode @var{mode} can be\n\
> > > > +generated implicitly.  The default definition returns true.",
> > > > + bool, (machine_mode mode),
> > > > + hook_bool_mode_true)
> > > > +
> > > >  /* Target hook is used to compare the target attributes in two functions to
> > > >     determine which function's features get higher priority.  This is used
> > > >     during function multi-versioning to figure out the order in which two
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > new file mode 100644
> > > > index 00000000000..2091d33c45f
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > @@ -0,0 +1,16 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
> > > > +
> > > > +struct TestData {
> > > > +  float arr[8];
> > > > +};
> > > > +
> > > > +void
> > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > +{
> > > > +  for(int i=0; i<8; ++i)
> > > > +    s1->arr[i] = s2->arr[i];
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > new file mode 100644
> > > > index 00000000000..4ad8c8bf379
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > @@ -0,0 +1,16 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > > > +
> > > > +struct TestData {
> > > > +  float arr[8];
> > > > +};
> > > > +
> > > > +void
> > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > +{
> > > > +  for(int i=0; i<16; ++i)
> > > > +    s1->arr[i] = s2->arr[i];
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > new file mode 100644
> > > > index 00000000000..7a03160e512
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > @@ -0,0 +1,16 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > +
> > > > +struct TestData {
> > > > +  float arr[8];
> > > > +};
> > > > +
> > > > +void
> > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > +{
> > > > +  for(int i=0; i<16; ++i)
> > > > +    s1->arr[i] = s2->arr[i];
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > new file mode 100644
> > > > index 00000000000..ae2599f6411
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > @@ -0,0 +1,16 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > > > +
> > > > +struct TestData {
> > > > +  float arr[8];
> > > > +};
> > > > +
> > > > +void
> > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > +{
> > > > +  for(int i=0; i<16; ++i)
> > > > +    s1->arr[i] = s2->arr[i];
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > new file mode 100644
> > > > index 00000000000..f9173684212
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > @@ -0,0 +1,16 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > > > +
> > > > +struct TestData {
> > > > +  float arr[8];
> > > > +};
> > > > +
> > > > +void
> > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > +{
> > > > +  for(int i=0; i<16; ++i)
> > > > +    s1->arr[i] = s2->arr[i];
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > --
> > > > 2.35.1
> > > >
> >
> >
> >
> > --
> > H.J.
  
Richard Biener March 10, 2022, 8:06 a.m. UTC | #6
On Wed, Mar 9, 2022 at 7:04 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> >> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> > >
> >> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> >> > > The default is
> >> > >
> >> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> >> > >
> >> > > For x86, it is MOVE_MAX to restore the old behavior before
> >> >
> >> > I know we've discussed this to death in the PR, I just want to repeat here
> >> > that the GIMPLE folding expects to generate a single load and a single
> >> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> >> > was chosen originally (it's documented to what a "single instruction" does).
> >> > In practice MOVE_MAX does not seem to cover vector register sizes
> >> > so Richard pulled MOVE_RATIO which is really intended to cover
> >> > the case of using multiple instructions for moving memory (but then I
> >> > don't remember whether for the ARM case the single load/store GIMPLE
> >> > will be expanded to multiple load/store instructions).
> >> >
> >> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> >> > being very specific for memcpy folding (we also fold memmove btw).
> >> >
> >> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> >> > than MOVE_MAX here and still honor the idea of single instructions.
> >> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> >> > not MOVE_MAX * MOVE_RATIO.
> >> >
> >> > So if we need a new hook then that hook should at least get the
> >> > 'speed' argument of MOVE_RATIO and it should get a better name.
> >> >
> >> > I still think that it should be possible to improve the insn check to
> >> > avoid use of "disabled" modes, maybe that's also a point to add
> >> > a new hook like .move_with_mode_p or so?  To quote, we do
> >>
> >> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> >
> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > and whose x86 implementation would already be fine (doing larger moves
> > and also not doing too large moves).  But appearantly the arm folks
> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
>
> It seems like there are old comments and old documentation that justify
> both interpretations, so there are good arguments on both sides.  But
> with this kind of thing I think we have to infer the meaning of the
> macro from the way it's currently used, rather than trusting such old
> and possibly out-of-date and contradictory information.
>
> FWIW, I agree that (if we exclude old reload, which we should!) the
> only direct uses of MOVE_MAX before the patch were not specific to
> integer registers and so MOVE_MAX should include vectors if the
> target wants vector modes to be used for general movement.
>
> Even if people disagree that that's the current meaning, I think it's
> at least a sensible meaning.  It provides information that AFAIK isn't
> available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.
>
> So FWIW, I think it'd be reasonable to change non-x86 targets if they
> want vector modes to be used for single-insn copies.

Note a slight complication in the GIMPLE folding case is that we
do not end up using vector modes but we're using "fake"
integer modes like OImode which x86 has move patterns for.
If we'd use vector modes we could use existing target hooks to
eventually decide whether auto-using those is desired or not.

>
> Thanks,
> Richard
  
Richard Biener March 10, 2022, 8:20 a.m. UTC | #7
On Wed, Mar 9, 2022 at 7:08 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 12:25 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Tue, Mar 8, 2022 at 4:44 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Mon, Mar 7, 2022 at 5:45 AM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > >
> > > > > On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > > > > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > > > > > The default is
> > > > > > >
> > > > > > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > > > > >
> > > > > > > For x86, it is MOVE_MAX to restore the old behavior before
> > > > > >
> > > > > > I know we've discussed this to death in the PR, I just want to repeat here
> > > > > > that the GIMPLE folding expects to generate a single load and a single
> > > > > > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > > > > was chosen originally (it's documented to what a "single instruction" does).
> > > > > > In practice MOVE_MAX does not seem to cover vector register sizes
> > > > > > so Richard pulled MOVE_RATIO which is really intended to cover
> > > > > > the case of using multiple instructions for moving memory (but then I
> > > > > > don't remember whether for the ARM case the single load/store GIMPLE
> > > > > > will be expanded to multiple load/store instructions).
> > > > > >
> > > > > > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > > > > being very specific for memcpy folding (we also fold memmove btw).
> > > > > >
> > > > > > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > > > > than MOVE_MAX here and still honor the idea of single instructions.
> > > > > > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > > > > not MOVE_MAX * MOVE_RATIO.
> > > > > >
> > > > > > So if we need a new hook then that hook should at least get the
> > > > > > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > > > >
> > > > > > I still think that it should be possible to improve the insn check to
> > > > > > avoid use of "disabled" modes, maybe that's also a point to add
> > > > > > a new hook like .move_with_mode_p or so?  To quote, we do
> > > > >
> > > > > Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> > > >
> > > > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > > > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > > > and whose x86 implementation would already be fine (doing larger moves
> > > > and also not doing too large moves).  But appearantly the arm folks
> > > > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> > > >
> > > > Yes, MOVE_MAX_PIECES is documented to apply to move_by_pieces.
> > > > Still GIMPLE memcpy/memmove inlining wants to mimic exactly that but
> > > > restrict itself to a single load and a single store.
> > > >
> > > > > >
> > > > > >               scalar_int_mode mode;
> > > > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > > > >                   && have_insn_for (SET, mode)
> > > > > >                   /* If the destination pointer is not aligned we must be able
> > > > > >                      to emit an unaligned store.  */
> > > > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > > >                       || !targetm.slow_unaligned_access (mode, dest_align)
> > > > > >                       || (optab_handler (movmisalign_optab, mode)
> > > > > >                           != CODE_FOR_nothing)))
> > > > > >
> > > > > > where I understand the ISA is enabled and if the user explicitely
> > > > > > uses it that's OK but -mprefer-avx128 should tell GCC to never
> > > > > > generate AVX256 code where the user was not explicitely using it
> > > > > > (still for example glibc might happily use AVX256 code to implement
> > > > > > the memcpy we are folding!)
> > > > > >
> > > > > > Note the BB vectorizer also might end up with using AVX256 because
> > > > > > in places it also relies on optab queries and the vector_mode_supported_p
> > > > > > check (but the memcpy folding uses the fake integer modes).  So
> > > > > > x86 might need to implement the related_mode hook to avoid "auto"-using
> > > > > > a larger vector mode which the default implementation would happily do.
> > > > > >
> > > > > > Richard.
> > > > >
> > > > > OK for master?
> > > >
> > > > Looking for opinions from others as well.
> > > >
> > > > Btw, there's a similar use in expand_DEFERRED_INIT:
> > > >
> > > >           && int_mode_for_size (tree_to_uhwi (var_size) * BITS_PER_UNIT,
> > > >                                 0).exists (&var_mode)
> > > >           && have_insn_for (SET, var_mode))
> > > >
> > > > So it occured to me that maybe targetm.move_with_mode_p should eventually
> > >
> > > TARGET_MOVE_WITH_MODE_P shouldn't be used here since we do want to
> > > use var_mode.
> > >
> > > > check have_insn_for (SET, var_mode) or we should abstract checking the two
> > > > things to a generic API somewhere (in optabs-query.h maybe, or expmed.h,
> > > > not sure where it would be more appropriate).
> > > >
> > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > > > +This target hook returns true if move with mode @var{mode} can be
> > > > > +generated implicitly.  The default definition returns true.
> > > > > +@end deftypefn
> > > >
> > > > I know what you mean but I'm not sure "can be generated implicitly" captures
> > > > that.  The compiler always generated stuff explicitely.  It's also
> > > > "should", not "can", no?
> > >
> > > TARGET_MOVE_WITH_MODE_P is a target hook for compiler optimization.
> > > It is totally OK for TARGET_MOVE_WITH_MODE_P to return false.   Will
> > > TARGET_FAST_MOVE_WITH_MODE_P be better?
> >
> > I thought the PR to address was about -mprefer-avx128 but memcpy using
> > YMM regs?  So it's not about "fast" it's about fulfilling user expectations, no?
> >
> > As said elsewhere the vectorizer gets to honor -mprefer-avx128 via the
> > vector_modes hook to iterate over but also with the related_mode hook
> > (which x86 chooses to not implement).  It sounds like we need something
> > similar for integer modes (aka "move" modes), but then does x86 "honor"
> > -mprefer-avx128 when doing *_by_pieces?  And how does it do that there?
>
> TARGET_PREFERRED_MOVE_WITH_MODE_P?

That's an alternative suggestion for the name?  Looking at what we have
(I'm still trying to avoid yet another hook), there is
TARGET_USE_BY_PIECES_INFRASTRUCTURE_P which gets passed
length, alignment, kind and speed_p.  I think that hook would be the one
to tell the middle-end the largest piece (mode or size) to use?

GIMPLE folding would then disable itself if that largest mode < size.
The default implementation would use MOVE_MAX_PIECES * MOVE_RATIO
(instead of MOVE_MAX * MOVE_RATIO) and arn could implement
this hook to unleash it to not only allow MAX_MOVE_PIECES but
MAX_MOVE_PIECES * MOVE_RATIO here.  That also highlights that
Richards change was somewhat dubious since a corresponding change
in _by_pieces is not done (and following through would then enable
this there).  That is, it looks like Richards change is a (bad) workaround
for GIMPLE folding not considering multiple loads/stores?

The hook is implemented by 5 targets right now so the amount of changes
should be managable, the obvious default max-mode size is then
just {MOVE,STORE,COMPARE}_MAX_PIECES.

Given all of the above I'm inclined to revert the offending change and
instead change MOVE_MAX to MOVE_MAX_PIECES (because we're
doing by-pieces with an additional restriction of 1 piece).

Richard.

> > Richard.
> >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > Add TARGET_MOVE_WITH_MODE_P to return true if move with mode can be
> > > > > generated implicitly.  The default definition returns true.  The x86
> > > > > version returns true if the mode size <= MOVE_MAX, which is the max
> > > > > number of bytes we can move in one reasonably fast instruction.
> > > > >
> > > > > gcc/
> > > > >
> > > > >         PR target/103393
> > > > >         * gimple-fold.cc (gimple_fold_builtin_memory_op): Call
> > > > >         targetm.move_with_mode_p to check if move with mode can be
> > > > >         generated implicitly.
> > > > >         * target.def: Add move_with_mode_p.
> > > > >         * targhooks.cc (default_move_with_mode_p): New.
> > > > >         * targhooks.h (default_move_with_mode_p): Likewise.
> > > > >         * config/i386/i386.cc (ix86_move_with_mode_p): New.
> > > > >         (TARGET_MOVE_WITH_MODE_P): Likewise.
> > > > >         * doc/tm.texi.in: Add TARGET_MOVE_WITH_MODE_P.
> > > > >         * doc/tm.texi: Regenerate.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >         PR target/103393
> > > > >         * gcc.target/i386/pr103393-1.c: New test.
> > > > >         * gcc.target/i386/pr103393-2.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-3.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-4.c: Likewise.
> > > > >         * gcc.target/i386/pr103393-5.c: Likewise.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc                    | 12 ++++++++++++
> > > > >  gcc/doc/tm.texi                            |  5 +++++
> > > > >  gcc/doc/tm.texi.in                         |  2 ++
> > > > >  gcc/gimple-fold.cc                         |  1 +
> > > > >  gcc/target.def                             |  7 +++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-1.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-2.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-3.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-4.c | 16 ++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr103393-5.c | 16 ++++++++++++++++
> > > > >  10 files changed, 107 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index b2bf90576d5..68a2c59220c 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -23918,6 +23918,15 @@ ix86_push_rounding (poly_int64 bytes)
> > > > >    return ROUND_UP (bytes, UNITS_PER_WORD);
> > > > >  }
> > > > >
> > > > > +/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
> > > > > +   with MODE can be generated implicitly.  */
> > > > > +
> > > > > +static bool
> > > > > +ix86_move_with_mode_p (machine_mode mode)
> > > > > +{
> > > > > +  return GET_MODE_SIZE (mode) <= MOVE_MAX;
> > > > > +}
> > > > > +
> > > > >  /* Target-specific selftests.  */
> > > > >
> > > > >  #if CHECKING_P
> > > > > @@ -24735,6 +24744,9 @@ static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
> > > > >  #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
> > > > >  #endif /* #if CHECKING_P */
> > > > >
> > > > > +#undef TARGET_MOVE_WITH_MODE_P
> > > > > +#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
> > > > > +
> > > > >  struct gcc_target targetm = TARGET_INITIALIZER;
> > > > >
> > > > >  #include "gt-i386.h"
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index 49864dd79f8..9d5058610e1 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -11924,6 +11924,11 @@ statement holding the function call.  Returns true if any change
> > > > >  was made to the GIMPLE stream.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
> > > > > +This target hook returns true if move with mode @var{mode} can be
> > > > > +generated implicitly.  The default definition returns true.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
> > > > >  This hook is used to compare the target attributes in two functions to
> > > > >  determine which function's features get higher priority.  This is used
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index 95e5e341f07..e9158ab90c6 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -7924,6 +7924,8 @@ to by @var{ce_info}.
> > > > >
> > > > >  @hook TARGET_GIMPLE_FOLD_BUILTIN
> > > > >
> > > > > +@hook TARGET_MOVE_WITH_MODE_P
> > > > > +
> > > > >  @hook TARGET_COMPARE_VERSION_PRIORITY
> > > > >
> > > > >  @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
> > > > > diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> > > > > index c9179abb27e..93267eeabb2 100644
> > > > > --- a/gcc/gimple-fold.cc
> > > > > +++ b/gcc/gimple-fold.cc
> > > > > @@ -1007,6 +1007,7 @@ gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
> > > > >               if (int_mode_for_size (ilen * 8, 0).exists (&mode)
> > > > >                   && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
> > > > >                   && have_insn_for (SET, mode)
> > > > > +                 && targetm.move_with_mode_p (mode)
> > > > >                   /* If the destination pointer is not aligned we must be able
> > > > >                      to emit an unaligned store.  */
> > > > >                   && (dest_align >= GET_MODE_ALIGNMENT (mode)
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index 72c2e1ef756..041d944cb38 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2489,6 +2489,13 @@ was made to the GIMPLE stream.",
> > > > >   bool, (gimple_stmt_iterator *gsi),
> > > > >   hook_bool_gsiptr_false)
> > > > >
> > > > > +DEFHOOK
> > > > > +(move_with_mode_p,
> > > > > + "This target hook returns true if move with mode @var{mode} can be\n\
> > > > > +generated implicitly.  The default definition returns true.",
> > > > > + bool, (machine_mode mode),
> > > > > + hook_bool_mode_true)
> > > > > +
> > > > >  /* Target hook is used to compare the target attributes in two functions to
> > > > >     determine which function's features get higher priority.  This is used
> > > > >     during function multi-versioning to figure out the order in which two
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > > new file mode 100644
> > > > > index 00000000000..2091d33c45f
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<8; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > > new file mode 100644
> > > > > index 00000000000..4ad8c8bf379
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
> > > > > +/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > > new file mode 100644
> > > > > index 00000000000..7a03160e512
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > > new file mode 100644
> > > > > index 00000000000..ae2599f6411
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > > new file mode 100644
> > > > > index 00000000000..f9173684212
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
> > > > > @@ -0,0 +1,16 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
> > > > > +
> > > > > +struct TestData {
> > > > > +  float arr[8];
> > > > > +};
> > > > > +
> > > > > +void
> > > > > +cpy (struct TestData *s1, struct TestData *s2 )
> > > > > +{
> > > > > +  for(int i=0; i<16; ++i)
> > > > > +    s1->arr[i] = s2->arr[i];
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > > > +/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
> > > > > --
> > > > > 2.35.1
> > > > >
> > >
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.
  
Richard Sandiford March 14, 2022, 3:44 p.m. UTC | #8
Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Mar 9, 2022 at 7:04 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>
>> >> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
>> >> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
>> >> > <gcc-patches@gcc.gnu.org> wrote:
>> >> > >
>> >> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
>> >> > > The default is
>> >> > >
>> >> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
>> >> > >
>> >> > > For x86, it is MOVE_MAX to restore the old behavior before
>> >> >
>> >> > I know we've discussed this to death in the PR, I just want to repeat here
>> >> > that the GIMPLE folding expects to generate a single load and a single
>> >> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
>> >> > was chosen originally (it's documented to what a "single instruction" does).
>> >> > In practice MOVE_MAX does not seem to cover vector register sizes
>> >> > so Richard pulled MOVE_RATIO which is really intended to cover
>> >> > the case of using multiple instructions for moving memory (but then I
>> >> > don't remember whether for the ARM case the single load/store GIMPLE
>> >> > will be expanded to multiple load/store instructions).
>> >> >
>> >> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
>> >> > being very specific for memcpy folding (we also fold memmove btw).
>> >> >
>> >> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
>> >> > than MOVE_MAX here and still honor the idea of single instructions.
>> >> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
>> >> > not MOVE_MAX * MOVE_RATIO.
>> >> >
>> >> > So if we need a new hook then that hook should at least get the
>> >> > 'speed' argument of MOVE_RATIO and it should get a better name.
>> >> >
>> >> > I still think that it should be possible to improve the insn check to
>> >> > avoid use of "disabled" modes, maybe that's also a point to add
>> >> > a new hook like .move_with_mode_p or so?  To quote, we do
>> >>
>> >> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
>> >
>> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
>> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
>> > and whose x86 implementation would already be fine (doing larger moves
>> > and also not doing too large moves).  But appearantly the arm folks
>> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
>>
>> It seems like there are old comments and old documentation that justify
>> both interpretations, so there are good arguments on both sides.  But
>> with this kind of thing I think we have to infer the meaning of the
>> macro from the way it's currently used, rather than trusting such old
>> and possibly out-of-date and contradictory information.
>>
>> FWIW, I agree that (if we exclude old reload, which we should!) the
>> only direct uses of MOVE_MAX before the patch were not specific to
>> integer registers and so MOVE_MAX should include vectors if the
>> target wants vector modes to be used for general movement.
>>
>> Even if people disagree that that's the current meaning, I think it's
>> at least a sensible meaning.  It provides information that AFAIK isn't
>> available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.
>>
>> So FWIW, I think it'd be reasonable to change non-x86 targets if they
>> want vector modes to be used for single-insn copies.
>
> Note a slight complication in the GIMPLE folding case is that we
> do not end up using vector modes but we're using "fake"
> integer modes like OImode which x86 has move patterns for.
> If we'd use vector modes we could use existing target hooks to
> eventually decide whether auto-using those is desired or not.

Hmm, yeah.  Certainly we shouldn't require the target to support
a scalar integer equivalent of every vector mode.

Richard
  
H.J. Lu March 22, 2022, 2:50 a.m. UTC | #9
On Mon, Mar 14, 2022 at 8:44 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Mar 9, 2022 at 7:04 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> >>
> >> >> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> >> >> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> >> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >> > >
> >> >> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> >> >> > > The default is
> >> >> > >
> >> >> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> >> >> > >
> >> >> > > For x86, it is MOVE_MAX to restore the old behavior before
> >> >> >
> >> >> > I know we've discussed this to death in the PR, I just want to repeat here
> >> >> > that the GIMPLE folding expects to generate a single load and a single
> >> >> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> >> >> > was chosen originally (it's documented to what a "single instruction" does).
> >> >> > In practice MOVE_MAX does not seem to cover vector register sizes
> >> >> > so Richard pulled MOVE_RATIO which is really intended to cover
> >> >> > the case of using multiple instructions for moving memory (but then I
> >> >> > don't remember whether for the ARM case the single load/store GIMPLE
> >> >> > will be expanded to multiple load/store instructions).
> >> >> >
> >> >> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> >> >> > being very specific for memcpy folding (we also fold memmove btw).
> >> >> >
> >> >> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> >> >> > than MOVE_MAX here and still honor the idea of single instructions.
> >> >> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> >> >> > not MOVE_MAX * MOVE_RATIO.
> >> >> >
> >> >> > So if we need a new hook then that hook should at least get the
> >> >> > 'speed' argument of MOVE_RATIO and it should get a better name.
> >> >> >
> >> >> > I still think that it should be possible to improve the insn check to
> >> >> > avoid use of "disabled" modes, maybe that's also a point to add
> >> >> > a new hook like .move_with_mode_p or so?  To quote, we do
> >> >>
> >> >> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> >> >
> >> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> >> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> >> > and whose x86 implementation would already be fine (doing larger moves
> >> > and also not doing too large moves).  But appearantly the arm folks
> >> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> >>
> >> It seems like there are old comments and old documentation that justify
> >> both interpretations, so there are good arguments on both sides.  But
> >> with this kind of thing I think we have to infer the meaning of the
> >> macro from the way it's currently used, rather than trusting such old
> >> and possibly out-of-date and contradictory information.
> >>
> >> FWIW, I agree that (if we exclude old reload, which we should!) the
> >> only direct uses of MOVE_MAX before the patch were not specific to
> >> integer registers and so MOVE_MAX should include vectors if the
> >> target wants vector modes to be used for general movement.
> >>
> >> Even if people disagree that that's the current meaning, I think it's
> >> at least a sensible meaning.  It provides information that AFAIK isn't
> >> available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.
> >>
> >> So FWIW, I think it'd be reasonable to change non-x86 targets if they
> >> want vector modes to be used for single-insn copies.
> >
> > Note a slight complication in the GIMPLE folding case is that we
> > do not end up using vector modes but we're using "fake"
> > integer modes like OImode which x86 has move patterns for.
> > If we'd use vector modes we could use existing target hooks to
> > eventually decide whether auto-using those is desired or not.
>
> Hmm, yeah.  Certainly we shouldn't require the target to support
> a scalar integer equivalent of every vector mode.
>

I'd like to resolve this before GCC 12 is released.

Thanks.
  
Richard Biener March 22, 2022, 8:20 a.m. UTC | #10
On Tue, Mar 22, 2022 at 3:51 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Mar 14, 2022 at 8:44 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> writes:
> > > On Wed, Mar 9, 2022 at 7:04 PM Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >> >>
> > >> >> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > >> >> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > >> >> > <gcc-patches@gcc.gnu.org> wrote:
> > >> >> > >
> > >> >> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > >> >> > > The default is
> > >> >> > >
> > >> >> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > >> >> > >
> > >> >> > > For x86, it is MOVE_MAX to restore the old behavior before
> > >> >> >
> > >> >> > I know we've discussed this to death in the PR, I just want to repeat here
> > >> >> > that the GIMPLE folding expects to generate a single load and a single
> > >> >> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > >> >> > was chosen originally (it's documented to what a "single instruction" does).
> > >> >> > In practice MOVE_MAX does not seem to cover vector register sizes
> > >> >> > so Richard pulled MOVE_RATIO which is really intended to cover
> > >> >> > the case of using multiple instructions for moving memory (but then I
> > >> >> > don't remember whether for the ARM case the single load/store GIMPLE
> > >> >> > will be expanded to multiple load/store instructions).
> > >> >> >
> > >> >> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > >> >> > being very specific for memcpy folding (we also fold memmove btw).
> > >> >> >
> > >> >> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > >> >> > than MOVE_MAX here and still honor the idea of single instructions.
> > >> >> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > >> >> > not MOVE_MAX * MOVE_RATIO.
> > >> >> >
> > >> >> > So if we need a new hook then that hook should at least get the
> > >> >> > 'speed' argument of MOVE_RATIO and it should get a better name.
> > >> >> >
> > >> >> > I still think that it should be possible to improve the insn check to
> > >> >> > avoid use of "disabled" modes, maybe that's also a point to add
> > >> >> > a new hook like .move_with_mode_p or so?  To quote, we do
> > >> >>
> > >> >> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> > >> >
> > >> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > >> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > >> > and whose x86 implementation would already be fine (doing larger moves
> > >> > and also not doing too large moves).  But appearantly the arm folks
> > >> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> > >>
> > >> It seems like there are old comments and old documentation that justify
> > >> both interpretations, so there are good arguments on both sides.  But
> > >> with this kind of thing I think we have to infer the meaning of the
> > >> macro from the way it's currently used, rather than trusting such old
> > >> and possibly out-of-date and contradictory information.
> > >>
> > >> FWIW, I agree that (if we exclude old reload, which we should!) the
> > >> only direct uses of MOVE_MAX before the patch were not specific to
> > >> integer registers and so MOVE_MAX should include vectors if the
> > >> target wants vector modes to be used for general movement.
> > >>
> > >> Even if people disagree that that's the current meaning, I think it's
> > >> at least a sensible meaning.  It provides information that AFAIK isn't
> > >> available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.
> > >>
> > >> So FWIW, I think it'd be reasonable to change non-x86 targets if they
> > >> want vector modes to be used for single-insn copies.
> > >
> > > Note a slight complication in the GIMPLE folding case is that we
> > > do not end up using vector modes but we're using "fake"
> > > integer modes like OImode which x86 has move patterns for.
> > > If we'd use vector modes we could use existing target hooks to
> > > eventually decide whether auto-using those is desired or not.
> >
> > Hmm, yeah.  Certainly we shouldn't require the target to support
> > a scalar integer equivalent of every vector mode.
> >
>
> I'd like to resolve this before GCC 12 is released.

I've come to the conclusion that we should revert r12-3482-g5f6a6c91d7c592,
it abuses MOVE_MAX * MOVE_RATIO to trick GIMPLE into thinking we can
move a larger amount of data with a single instruction while in reality it wants
to use multiple load/store stmts - that's something the GIMPLE folding
doesn't do.
If arm wants to use larger moves with a single instruction it should
adjust MOVE_MAX
instead.  In fact the motivating testcase doesn't use larger loads -
we are just able
to elide the memcpy without the stack copy.

Alternatively as Richard hints above, the folding code should try
using integer vector
modes (and adhere to the vector target hooks as to which modes to "auto" use).

That said, iff we do not want to "regress" arm by reversion the new
hackish target hook
should be on the arm side - sth like MOVE_MAX_FOR_GIMPLE, defaulting
to MOVE_MAX.

If we want to use vector integer types/modes or fold to multiple (up
to MOVE_RATIO) stmts
that's stage1 material.

Richard.



> Thanks.
>
>
> --
> H.J.
  
Richard Biener March 23, 2022, 1:58 p.m. UTC | #11
On Tue, Mar 22, 2022 at 9:20 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 3:51 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, Mar 14, 2022 at 8:44 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Biener <richard.guenther@gmail.com> writes:
> > > > On Wed, Mar 9, 2022 at 7:04 PM Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > >>
> > > >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > >> > On Wed, Mar 2, 2022 at 10:18 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >> >>
> > > >> >> On Wed, Mar 02, 2022 at 09:51:26AM +0100, Richard Biener wrote:
> > > >> >> > On Tue, Mar 1, 2022 at 11:41 PM H.J. Lu via Gcc-patches
> > > >> >> > <gcc-patches@gcc.gnu.org> wrote:
> > > >> >> > >
> > > >> >> > > Add TARGET_FOLD_MEMCPY_MAX for the maximum number of bytes to fold memcpy.
> > > >> >> > > The default is
> > > >> >> > >
> > > >> >> > > MOVE_MAX * MOVE_RATIO (optimize_function_for_size_p (cfun))
> > > >> >> > >
> > > >> >> > > For x86, it is MOVE_MAX to restore the old behavior before
> > > >> >> >
> > > >> >> > I know we've discussed this to death in the PR, I just want to repeat here
> > > >> >> > that the GIMPLE folding expects to generate a single load and a single
> > > >> >> > store (that is what it does on the GIMPLE level) which is why MOVE_MAX
> > > >> >> > was chosen originally (it's documented to what a "single instruction" does).
> > > >> >> > In practice MOVE_MAX does not seem to cover vector register sizes
> > > >> >> > so Richard pulled MOVE_RATIO which is really intended to cover
> > > >> >> > the case of using multiple instructions for moving memory (but then I
> > > >> >> > don't remember whether for the ARM case the single load/store GIMPLE
> > > >> >> > will be expanded to multiple load/store instructions).
> > > >> >> >
> > > >> >> > TARGET_FOLD_MEMCPY_MAX sounds like a stop-gap solution,
> > > >> >> > being very specific for memcpy folding (we also fold memmove btw).
> > > >> >> >
> > > >> >> > There is also MOVE_MAX_PIECES which _might_ be more appropriate
> > > >> >> > than MOVE_MAX here and still honor the idea of single instructions.
> > > >> >> > Now neither arm nor aarch64 define this and it defaults to MOVE_MAX,
> > > >> >> > not MOVE_MAX * MOVE_RATIO.
> > > >> >> >
> > > >> >> > So if we need a new hook then that hook should at least get the
> > > >> >> > 'speed' argument of MOVE_RATIO and it should get a better name.
> > > >> >> >
> > > >> >> > I still think that it should be possible to improve the insn check to
> > > >> >> > avoid use of "disabled" modes, maybe that's also a point to add
> > > >> >> > a new hook like .move_with_mode_p or so?  To quote, we do
> > > >> >>
> > > >> >> Here is the v2 patch to add TARGET_MOVE_WITH_MODE_P.
> > > >> >
> > > >> > Again I'd like to shine light on MOVE_MAX_PIECES which explicitely
> > > >> > mentions "a load or store used TO COPY MEMORY" (emphasis mine)
> > > >> > and whose x86 implementation would already be fine (doing larger moves
> > > >> > and also not doing too large moves).  But appearantly the arm folks
> > > >> > decided that that's not fit and instead (mis-?)used MOVE_MAX * MOVE_RATIO.
> > > >>
> > > >> It seems like there are old comments and old documentation that justify
> > > >> both interpretations, so there are good arguments on both sides.  But
> > > >> with this kind of thing I think we have to infer the meaning of the
> > > >> macro from the way it's currently used, rather than trusting such old
> > > >> and possibly out-of-date and contradictory information.
> > > >>
> > > >> FWIW, I agree that (if we exclude old reload, which we should!) the
> > > >> only direct uses of MOVE_MAX before the patch were not specific to
> > > >> integer registers and so MOVE_MAX should include vectors if the
> > > >> target wants vector modes to be used for general movement.
> > > >>
> > > >> Even if people disagree that that's the current meaning, I think it's
> > > >> at least a sensible meaning.  It provides information that AFAIK isn't
> > > >> available otherwise, and it avoids overlap with MAX_FIXED_MODE_SIZE.
> > > >>
> > > >> So FWIW, I think it'd be reasonable to change non-x86 targets if they
> > > >> want vector modes to be used for single-insn copies.
> > > >
> > > > Note a slight complication in the GIMPLE folding case is that we
> > > > do not end up using vector modes but we're using "fake"
> > > > integer modes like OImode which x86 has move patterns for.
> > > > If we'd use vector modes we could use existing target hooks to
> > > > eventually decide whether auto-using those is desired or not.
> > >
> > > Hmm, yeah.  Certainly we shouldn't require the target to support
> > > a scalar integer equivalent of every vector mode.
> > >
> >
> > I'd like to resolve this before GCC 12 is released.
>
> I've come to the conclusion that we should revert r12-3482-g5f6a6c91d7c592,
> it abuses MOVE_MAX * MOVE_RATIO to trick GIMPLE into thinking we can
> move a larger amount of data with a single instruction while in reality it wants
> to use multiple load/store stmts - that's something the GIMPLE folding
> doesn't do.
> If arm wants to use larger moves with a single instruction it should
> adjust MOVE_MAX
> instead.  In fact the motivating testcase doesn't use larger loads -
> we are just able
> to elide the memcpy without the stack copy.

Btw, the memcpy would have been folded by the following code but
chickens out due to arm being STRICT_ALIGNMENT.

      /* Now that we chose an access type express the other side in
         terms of it if the target allows that with respect to alignment
         constraints.  */
      if (srcvar == NULL_TREE)
        {
          if (src_align >= TYPE_ALIGN (desttype))
            srcvar = fold_build2 (MEM_REF, desttype, src, off0);
          else
            {
              if (STRICT_ALIGNMENT)
                return false;

that path could be improved to handle the case where the desttype mode
is not BLKmode and thus we can check movmisalign_optab.

I'm going to test the attached.  I've verified that with this I can revert
r12-3482-g5f6a6c91d7c592 without regressing

        uint64_t bar64(const uint8_t *rData1)
        {
            uint64_t buffer;
            memcpy(&buffer, rData1, sizeof(buffer));
            return buffer;
        }

on arm.

Richard.

>
> Alternatively as Richard hints above, the folding code should try
> using integer vector
> modes (and adhere to the vector target hooks as to which modes to "auto" use).
>
> That said, iff we do not want to "regress" arm by reversion the new
> hackish target hook
> should be on the arm side - sth like MOVE_MAX_FOR_GIMPLE, defaulting
> to MOVE_MAX.
>
> If we want to use vector integer types/modes or fold to multiple (up
> to MOVE_RATIO) stmts
> that's stage1 material.
>
> Richard.
>
>
>
> > Thanks.
> >
> >
> > --
> > H.J.
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..68a2c59220c 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23918,6 +23918,15 @@  ix86_push_rounding (poly_int64 bytes)
   return ROUND_UP (bytes, UNITS_PER_WORD);
 }
 
+/* Implement the TARGET_MOVE_WITH_MODE_P hook.  Return true if move
+   with MODE can be generated implicitly.  */
+
+static bool
+ix86_move_with_mode_p (machine_mode mode)
+{
+  return GET_MODE_SIZE (mode) <= MOVE_MAX;
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -24735,6 +24744,9 @@  static bool ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
 #define TARGET_RUN_TARGET_SELFTESTS selftest::ix86_run_selftests
 #endif /* #if CHECKING_P */
 
+#undef TARGET_MOVE_WITH_MODE_P
+#define TARGET_MOVE_WITH_MODE_P ix86_move_with_mode_p
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-i386.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 49864dd79f8..9d5058610e1 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11924,6 +11924,11 @@  statement holding the function call.  Returns true if any change
 was made to the GIMPLE stream.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_MOVE_WITH_MODE_P (machine_mode @var{mode})
+This target hook returns true if move with mode @var{mode} can be
+generated implicitly.  The default definition returns true.
+@end deftypefn
+
 @deftypefn {Target Hook} int TARGET_COMPARE_VERSION_PRIORITY (tree @var{decl1}, tree @var{decl2})
 This hook is used to compare the target attributes in two functions to
 determine which function's features get higher priority.  This is used
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 95e5e341f07..e9158ab90c6 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7924,6 +7924,8 @@  to by @var{ce_info}.
 
 @hook TARGET_GIMPLE_FOLD_BUILTIN
 
+@hook TARGET_MOVE_WITH_MODE_P
+
 @hook TARGET_COMPARE_VERSION_PRIORITY
 
 @hook TARGET_GET_FUNCTION_VERSIONS_DISPATCHER
diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
index c9179abb27e..93267eeabb2 100644
--- a/gcc/gimple-fold.cc
+++ b/gcc/gimple-fold.cc
@@ -1007,6 +1007,7 @@  gimple_fold_builtin_memory_op (gimple_stmt_iterator *gsi,
 	      if (int_mode_for_size (ilen * 8, 0).exists (&mode)
 		  && GET_MODE_SIZE (mode) * BITS_PER_UNIT == ilen * 8
 		  && have_insn_for (SET, mode)
+		  && targetm.move_with_mode_p (mode)
 		  /* If the destination pointer is not aligned we must be able
 		     to emit an unaligned store.  */
 		  && (dest_align >= GET_MODE_ALIGNMENT (mode)
diff --git a/gcc/target.def b/gcc/target.def
index 72c2e1ef756..041d944cb38 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2489,6 +2489,13 @@  was made to the GIMPLE stream.",
  bool, (gimple_stmt_iterator *gsi),
  hook_bool_gsiptr_false)
 
+DEFHOOK
+(move_with_mode_p,
+ "This target hook returns true if move with mode @var{mode} can be\n\
+generated implicitly.  The default definition returns true.",
+ bool, (machine_mode mode),
+ hook_bool_mode_true)
+
 /* Target hook is used to compare the target attributes in two functions to
    determine which function's features get higher priority.  This is used
    during function multi-versioning to figure out the order in which two
diff --git a/gcc/testsuite/gcc.target/i386/pr103393-1.c b/gcc/testsuite/gcc.target/i386/pr103393-1.c
new file mode 100644
index 00000000000..2091d33c45f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103393-1.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mavx -mprefer-avx128 -mprefer-vector-width=128" } */
+
+struct TestData {
+  float arr[8];
+};
+
+void
+cpy (struct TestData *s1, struct TestData *s2 )
+{
+  for(int i=0; i<8; ++i)
+    s1->arr[i] = s2->arr[i];
+}
+
+/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr103393-2.c b/gcc/testsuite/gcc.target/i386/pr103393-2.c
new file mode 100644
index 00000000000..4ad8c8bf379
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103393-2.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+struct TestData {
+  float arr[8];
+};
+
+void
+cpy (struct TestData *s1, struct TestData *s2 )
+{
+  for(int i=0; i<16; ++i)
+    s1->arr[i] = s2->arr[i];
+}
+
+/* { dg-final { scan-assembler "jmp\[\\t \]+_?memmove" { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler "call\[\\t \]+_?memmove" { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr103393-3.c b/gcc/testsuite/gcc.target/i386/pr103393-3.c
new file mode 100644
index 00000000000..7a03160e512
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103393-3.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+struct TestData {
+  float arr[8];
+};
+
+void
+cpy (struct TestData *s1, struct TestData *s2 )
+{
+  for(int i=0; i<16; ++i)
+    s1->arr[i] = s2->arr[i];
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr103393-4.c b/gcc/testsuite/gcc.target/i386/pr103393-4.c
new file mode 100644
index 00000000000..ae2599f6411
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103393-4.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512 -mstore-max=512" } */
+
+struct TestData {
+  float arr[8];
+};
+
+void
+cpy (struct TestData *s1, struct TestData *s2 )
+{
+  for(int i=0; i<16; ++i)
+    s1->arr[i] = s2->arr[i];
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr103393-5.c b/gcc/testsuite/gcc.target/i386/pr103393-5.c
new file mode 100644
index 00000000000..f9173684212
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103393-5.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=skylake-avx512 -mmove-max=512" } */
+
+struct TestData {
+  float arr[8];
+};
+
+void
+cpy (struct TestData *s1, struct TestData *s2 )
+{
+  for(int i=0; i<16; ++i)
+    s1->arr[i] = s2->arr[i];
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* { dg-final { scan-assembler-not "\[\\t \]+_?memmove" } } */