x86: Add -mmove-max=bits and -mstore-max=bits

Message ID 20211125224744.4124623-1-hjl.tools@gmail.com
State New
Headers
Series x86: Add -mmove-max=bits and -mstore-max=bits |

Commit Message

H.J. Lu Nov. 25, 2021, 10:47 p.m. UTC
  Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
and store, independent of -mprefer-vector-width=bits:

1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
which are enabled for Intel Sapphire Rapids processor.
2. Add -mmove-max=bits to set the maximum number of bits can be moved from
memory to memory efficiently.  The default value is derived from
X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
preferred vector width.
3. Add -mstore-max=bits to set the maximum number of bits can be stored to
memory efficiently.  The default value is derived from
X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
preferred vector width.

gcc/

	PR target/103269
	* config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
	and PVW_NONE to ix86_target_string.
	* config/i386/i386-options.c (ix86_target_string): Add arguments
	for move_max and store_max.
	(ix86_target_string::add_vector_width): New lambda.
	(ix86_debug_options): Pass ix86_move_max and ix86_store_max to
	ix86_target_string.
	(ix86_function_specific_print): Pass ptr->x_ix86_move_max and
	ptr->x_ix86_store_max to ix86_target_string.
	(ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
	x_ix86_store_max.
	(ix86_option_override_internal): Set the default x_ix86_move_max
	and x_ix86_store_max.
	* config/i386/i386-options.h (ix86_target_string): Add
	prefer_vector_width and prefer_vector_width.
	* config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
	(TARGET_AVX256_STORE_BY_PIECES): Likewise.
	(MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
	PVW_AVX512.  Use 32 if ix86_move_max or ix86_store_max >=
	PVW_AVX256.
	(STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
	Use 32 if ix86_store_max >= PVW_AVX256.
	* config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
	* config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
	(X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
	* doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.

gcc/testsuite/

	PR target/103269
	* gcc.target/i386/pieces-memcpy-17.c: New test.
	* gcc.target/i386/pieces-memcpy-18.c: Likewise.
	* gcc.target/i386/pieces-memcpy-19.c: Likewise.
	* gcc.target/i386/pieces-memcpy-20.c: Likewise.
	* gcc.target/i386/pieces-memcpy-21.c: Likewise.
	* gcc.target/i386/pieces-memset-45.c: Likewise.
	* gcc.target/i386/pieces-memset-46.c: Likewise.
	* gcc.target/i386/pieces-memset-47.c: Likewise.
	* gcc.target/i386/pieces-memset-48.c: Likewise.
	* gcc.target/i386/pieces-memset-49.c: Likewise.
---
 gcc/config/i386/i386-expand.c                 |  1 +
 gcc/config/i386/i386-options.c                | 75 +++++++++++++++++--
 gcc/config/i386/i386-options.h                |  6 +-
 gcc/config/i386/i386.h                        | 18 ++---
 gcc/config/i386/i386.opt                      |  8 ++
 gcc/config/i386/x86-tune.def                  | 10 +++
 gcc/doc/invoke.texi                           | 13 ++++
 .../gcc.target/i386/pieces-memcpy-17.c        | 16 ++++
 .../gcc.target/i386/pieces-memcpy-18.c        | 16 ++++
 .../gcc.target/i386/pieces-memcpy-19.c        | 16 ++++
 .../gcc.target/i386/pieces-memcpy-20.c        | 16 ++++
 .../gcc.target/i386/pieces-memcpy-21.c        | 16 ++++
 .../gcc.target/i386/pieces-memset-45.c        | 16 ++++
 .../gcc.target/i386/pieces-memset-46.c        | 17 +++++
 .../gcc.target/i386/pieces-memset-47.c        | 17 +++++
 .../gcc.target/i386/pieces-memset-48.c        | 17 +++++
 .../gcc.target/i386/pieces-memset-49.c        | 16 ++++
 17 files changed, 276 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
  

Comments

H.J. Lu Dec. 3, 2021, 1:24 p.m. UTC | #1
On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> and store, independent of -mprefer-vector-width=bits:
>
> 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> which are enabled for Intel Sapphire Rapids processor.
> 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> memory to memory efficiently.  The default value is derived from
> X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> preferred vector width.
> 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> memory efficiently.  The default value is derived from
> X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> preferred vector width.
>
> gcc/
>
>         PR target/103269
>         * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
>         and PVW_NONE to ix86_target_string.
>         * config/i386/i386-options.c (ix86_target_string): Add arguments
>         for move_max and store_max.
>         (ix86_target_string::add_vector_width): New lambda.
>         (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
>         ix86_target_string.
>         (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
>         ptr->x_ix86_store_max to ix86_target_string.
>         (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
>         x_ix86_store_max.
>         (ix86_option_override_internal): Set the default x_ix86_move_max
>         and x_ix86_store_max.
>         * config/i386/i386-options.h (ix86_target_string): Add
>         prefer_vector_width and prefer_vector_width.
>         * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
>         (TARGET_AVX256_STORE_BY_PIECES): Likewise.
>         (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
>         PVW_AVX512.  Use 32 if ix86_move_max or ix86_store_max >=
>         PVW_AVX256.
>         (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
>         Use 32 if ix86_store_max >= PVW_AVX256.
>         * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
>         * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
>         (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
>         * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
>
> gcc/testsuite/
>
>         PR target/103269
>         * gcc.target/i386/pieces-memcpy-17.c: New test.
>         * gcc.target/i386/pieces-memcpy-18.c: Likewise.
>         * gcc.target/i386/pieces-memcpy-19.c: Likewise.
>         * gcc.target/i386/pieces-memcpy-20.c: Likewise.
>         * gcc.target/i386/pieces-memcpy-21.c: Likewise.
>         * gcc.target/i386/pieces-memset-45.c: Likewise.
>         * gcc.target/i386/pieces-memset-46.c: Likewise.
>         * gcc.target/i386/pieces-memset-47.c: Likewise.
>         * gcc.target/i386/pieces-memset-48.c: Likewise.
>         * gcc.target/i386/pieces-memset-49.c: Likewise.
> ---
>  gcc/config/i386/i386-expand.c                 |  1 +
>  gcc/config/i386/i386-options.c                | 75 +++++++++++++++++--
>  gcc/config/i386/i386-options.h                |  6 +-
>  gcc/config/i386/i386.h                        | 18 ++---
>  gcc/config/i386/i386.opt                      |  8 ++
>  gcc/config/i386/x86-tune.def                  | 10 +++
>  gcc/doc/invoke.texi                           | 13 ++++
>  .../gcc.target/i386/pieces-memcpy-17.c        | 16 ++++
>  .../gcc.target/i386/pieces-memcpy-18.c        | 16 ++++
>  .../gcc.target/i386/pieces-memcpy-19.c        | 16 ++++
>  .../gcc.target/i386/pieces-memcpy-20.c        | 16 ++++
>  .../gcc.target/i386/pieces-memcpy-21.c        | 16 ++++
>  .../gcc.target/i386/pieces-memset-45.c        | 16 ++++
>  .../gcc.target/i386/pieces-memset-46.c        | 17 +++++
>  .../gcc.target/i386/pieces-memset-47.c        | 17 +++++
>  .../gcc.target/i386/pieces-memset-48.c        | 17 +++++
>  .../gcc.target/i386/pieces-memset-49.c        | 16 ++++
>  17 files changed, 276 insertions(+), 18 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index 0d5d1a0e205..7e77ff56ddc 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
>        char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
>                                        (enum fpmath_unit) 0,
>                                        (enum prefer_vector_width) 0,
> +                                      PVW_NONE, PVW_NONE,
>                                        false, add_abi_p);
>        if (!opts)
>         error ("%qE needs unknown isa option", fndecl);
> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> index a4da8331b8b..77712a07aef 100644
> --- a/gcc/config/i386/i386-options.c
> +++ b/gcc/config/i386/i386-options.c
> @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
>                     const char *arch, const char *tune,
>                     enum fpmath_unit fpmath,
>                     enum prefer_vector_width pvw,
> +                   enum prefer_vector_width move_max,
> +                   enum prefer_vector_width store_max,
>                     bool add_nl_p, bool add_abi_p)
>  {
>    /* Flag options.  */
> @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
>         }
>      }
>
> -  /* Add -mprefer-vector-width= option.  */
> -  if (pvw)
> +  auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> +                                        const char *cmd)
>      {
> -      opts[num][0] = "-mprefer-vector-width=";
> +      opts[num][0] = cmd;
>        switch ((int) pvw)
>         {
>         case PVW_AVX128:
> @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
>         default:
>           gcc_unreachable ();
>         }
> -    }
> +    };
> +
> +  /* Add -mprefer-vector-width= option.  */
> +  if (pvw)
> +    add_vector_width (pvw, "-mprefer-vector-width=");
> +
> +  /* Add -mmove-max= option.  */
> +  if (move_max)
> +    add_vector_width (move_max, "-mmove-max=");
> +
> +  /* Add -mstore-max= option.  */
> +  if (store_max)
> +    add_vector_width (store_max, "-mstore-max=");
>
>    /* Any options?  */
>    if (num == 0)
> @@ -630,6 +644,7 @@ ix86_debug_options (void)
>                                    target_flags, ix86_target_flags,
>                                    ix86_arch_string, ix86_tune_string,
>                                    ix86_fpmath, prefer_vector_width_type,
> +                                  ix86_move_max, ix86_store_max,
>                                    true, true);
>
>    if (opts)
> @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
>      = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
>                           ptr->x_target_flags, ptr->x_ix86_target_flags,
>                           NULL, NULL, ptr->x_ix86_fpmath,
> -                         ptr->x_prefer_vector_width_type, false, true);
> +                         ptr->x_prefer_vector_width_type,
> +                         ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> +                         false, true);
>
>    gcc_assert (ptr->arch < PROCESSOR_max);
>    fprintf (file, "%*sarch = %d (%s)\n",
> @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
>    const char *orig_tune_string = opts->x_ix86_tune_string;
>    enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
>    enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> +  enum prefer_vector_width orig_ix86_move_max_set
> +    = opts_set->x_ix86_move_max;
> +  enum prefer_vector_width orig_ix86_store_max_set
> +    = opts_set->x_ix86_store_max;
>    int orig_tune_defaulted = ix86_tune_defaulted;
>    int orig_arch_specified = ix86_arch_specified;
>    char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
>        opts->x_ix86_tune_string = orig_tune_string;
>        opts_set->x_ix86_fpmath = orig_fpmath_set;
>        opts_set->x_prefer_vector_width_type = orig_pvw_set;
> +      opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> +      opts_set->x_ix86_store_max = orig_ix86_store_max_set;
>        opts->x_ix86_excess_precision = orig_ix86_excess_precision;
>        opts->x_ix86_unsafe_math_optimizations
>         = orig_ix86_unsafe_math_optimizations;
> @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
>        && (opts_set->x_prefer_vector_width_type == PVW_NONE))
>      opts->x_prefer_vector_width_type = PVW_AVX256;
>
> +  if (opts_set->x_ix86_move_max == PVW_NONE)
> +    {
> +      /* Set the maximum number of bits can be moved from memory to
> +        memory efficiently.  */
> +      if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> +       opts->x_ix86_move_max = PVW_AVX512;
> +      else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> +       opts->x_ix86_move_max = PVW_AVX256;
> +      else
> +       {
> +         opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> +         if (opts_set->x_ix86_move_max == PVW_NONE)
> +           {
> +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> +               opts->x_ix86_move_max = PVW_AVX512;
> +             else
> +               opts->x_ix86_move_max = PVW_AVX128;
> +           }
> +       }
> +    }
> +
> +  if (opts_set->x_ix86_store_max == PVW_NONE)
> +    {
> +      /* Set the maximum number of bits can be stored to memory
> +        efficiently.  */
> +      if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> +       opts->x_ix86_store_max = PVW_AVX512;
> +      else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> +       opts->x_ix86_store_max = PVW_AVX256;
> +      else
> +       {
> +         opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> +         if (opts_set->x_ix86_store_max == PVW_NONE)
> +           {
> +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> +               opts->x_ix86_store_max = PVW_AVX512;
> +             else
> +               opts->x_ix86_store_max = PVW_AVX128;
> +           }
> +       }
> +    }
> +
>    if (opts->x_ix86_recip_name)
>      {
>        char *p = ASTRDUP (opts->x_ix86_recip_name);
> diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> index cdaca2644f4..e218e24d15b 100644
> --- a/gcc/config/i386/i386-options.h
> +++ b/gcc/config/i386/i386-options.h
> @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
>                           int flags, int flags2,
>                           const char *arch, const char *tune,
>                           enum fpmath_unit fpmath,
> -                         enum prefer_vector_width pvw, bool add_nl_p,
> -                         bool add_abi_p);
> +                         enum prefer_vector_width pvw,
> +                         enum prefer_vector_width move_max,
> +                         enum prefer_vector_width store_max,
> +                         bool add_nl_p, bool add_abi_p);
>
>  extern enum attr_cpu ix86_schedule;
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 2fda1e0686e..4f70085d793 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>         ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
>  #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
>         ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> -#define TARGET_AVX256_MOVE_BY_PIECES \
> -       ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> -#define TARGET_AVX256_STORE_BY_PIECES \
> -       ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
>  #define TARGET_AVX256_SPLIT_REGS \
>         ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
>  #define TARGET_GENERAL_REGS_SSE_SPILL \
> @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
>     MOVE_MAX_PIECES defaults to MOVE_MAX.  */
>
>  #define MOVE_MAX \
> -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> +  ((TARGET_AVX512F \
> +    && (ix86_move_max == PVW_AVX512 \
> +       || ix86_store_max == PVW_AVX512)) \
>     ? 64 \
>     : ((TARGET_AVX \
> -       && !TARGET_PREFER_AVX128 \
> -       && (TARGET_AVX256_MOVE_BY_PIECES \
> -          || TARGET_AVX256_STORE_BY_PIECES)) \
> +       && (ix86_move_max >= PVW_AVX256 \
> +          || ix86_store_max >= PVW_AVX256)) \
>        ? 32 \
>        : ((TARGET_SSE2 \
>           && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
>     store_by_pieces of 16/32/64 bytes.  */
>  #define STORE_MAX_PIECES \
>    (TARGET_INTER_UNIT_MOVES_TO_VEC \
> -   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> +   ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
>        ? 64 \
>        : ((TARGET_AVX \
> -         && !TARGET_PREFER_AVX128 \
> -         && TARGET_AVX256_STORE_BY_PIECES) \
> +         && ix86_store_max >= PVW_AVX256) \
>           ? 32 \
>           : ((TARGET_SSE2 \
>               && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 3e67c537bb7..620dab6b672 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
>  EnumValue
>  Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
>
> +mmove-max=
> +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> +Maximum number of bits can be moved from memory to memory efficiently.
> +
> +mstore-max=
> +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> +Maximum number of bits can be stored to memory efficiently.
> +
>  ;; ISA support
>
>  m32
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 4ae0b569841..26981f657af 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
>  DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
>           m_CORE_AVX512)
>
> +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> +   AVX instructions.  */
> +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> +         m_SAPPHIRERAPIDS)
> +
> +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> +   AVX instructions.  */
> +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> +         m_SAPPHIRERAPIDS)
> +
>  /*****************************************************************************/
>  /*****************************************************************************/
>  /* Historical relics: tuning flags that helps a specific old CPU designs     */
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 3bddfbaae6a..3412b9ede44 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
>  -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait @gol
>  -mrecip  -mrecip=@var{opt} @gol
>  -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt} @gol
> +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
>  -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx @gol
>  -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl @gol
>  -mavx512bw  -mavx512dq  -mavx512ifma  -mavx512vbmi  -msha  -maes @gol
> @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
>  This option instructs GCC to use @var{opt}-bit vector width in instructions
>  instead of default on the selected platform.
>
> +@item -mmove-max=@var{bits}
> +@opindex mmove-max
> +This option instructs GCC to set the maximum number of bits can be
> +moved from memory to memory efficiently to @var{bits}.  The valid
> +@var{bits} are 128, 256 and 512.
> +
> +@item -mstore-max=@var{bits}
> +@opindex mstore-max
> +This option instructs GCC to set the maximum number of bits can be
> +stored to memory efficiently to @var{bits}.  The valid @var{bits} are
> +128, 256 and 512.
> +
>  @table @samp
>  @item none
>  No extra limitations applied to GCC other than defined by the selected platform.
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> new file mode 100644
> index 00000000000..28ab7a6d41c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> new file mode 100644
> index 00000000000..b15a0db9ff0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> new file mode 100644
> index 00000000000..a5b5b617578
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> new file mode 100644
> index 00000000000..1feff48c5b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> new file mode 100644
> index 00000000000..ef439f20f74
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> new file mode 100644
> index 00000000000..70c80e5064b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> new file mode 100644
> index 00000000000..ab7894aa2e6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> new file mode 100644
> index 00000000000..8f2c254ad03
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> new file mode 100644
> index 00000000000..9a7da962183
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> new file mode 100644
> index 00000000000..ad43f89a9bd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> +  __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> --
> 2.33.1
>

PING.
  
Uros Bizjak Dec. 3, 2021, 4:55 p.m. UTC | #2
On Fri, Dec 3, 2021 at 2:24 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> > and store, independent of -mprefer-vector-width=bits:
> >
> > 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> > which are enabled for Intel Sapphire Rapids processor.
> > 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> > memory to memory efficiently.  The default value is derived from
> > X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> > preferred vector width.
> > 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> > memory efficiently.  The default value is derived from
> > X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> > preferred vector width.
> >
> > gcc/
> >
> >         PR target/103269
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
> >         and PVW_NONE to ix86_target_string.
> >         * config/i386/i386-options.c (ix86_target_string): Add arguments
> >         for move_max and store_max.
> >         (ix86_target_string::add_vector_width): New lambda.
> >         (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
> >         ix86_target_string.
> >         (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
> >         ptr->x_ix86_store_max to ix86_target_string.
> >         (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
> >         x_ix86_store_max.
> >         (ix86_option_override_internal): Set the default x_ix86_move_max
> >         and x_ix86_store_max.
> >         * config/i386/i386-options.h (ix86_target_string): Add
> >         prefer_vector_width and prefer_vector_width.
> >         * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
> >         (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> >         (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
> >         PVW_AVX512.  Use 32 if ix86_move_max or ix86_store_max >=
> >         PVW_AVX256.
> >         (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
> >         Use 32 if ix86_store_max >= PVW_AVX256.
> >         * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
> >         * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
> >         (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
> >         * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
> >
> > gcc/testsuite/
> >
> >         PR target/103269
> >         * gcc.target/i386/pieces-memcpy-17.c: New test.
> >         * gcc.target/i386/pieces-memcpy-18.c: Likewise.
> >         * gcc.target/i386/pieces-memcpy-19.c: Likewise.
> >         * gcc.target/i386/pieces-memcpy-20.c: Likewise.
> >         * gcc.target/i386/pieces-memcpy-21.c: Likewise.
> >         * gcc.target/i386/pieces-memset-45.c: Likewise.
> >         * gcc.target/i386/pieces-memset-46.c: Likewise.
> >         * gcc.target/i386/pieces-memset-47.c: Likewise.
> >         * gcc.target/i386/pieces-memset-48.c: Likewise.
> >         * gcc.target/i386/pieces-memset-49.c: Likewise.

LGTM with two grammar fixes below.

Thanks,
Uros.

> > ---
> >  gcc/config/i386/i386-expand.c                 |  1 +
> >  gcc/config/i386/i386-options.c                | 75 +++++++++++++++++--
> >  gcc/config/i386/i386-options.h                |  6 +-
> >  gcc/config/i386/i386.h                        | 18 ++---
> >  gcc/config/i386/i386.opt                      |  8 ++
> >  gcc/config/i386/x86-tune.def                  | 10 +++
> >  gcc/doc/invoke.texi                           | 13 ++++
> >  .../gcc.target/i386/pieces-memcpy-17.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memcpy-18.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memcpy-19.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memcpy-20.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memcpy-21.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memset-45.c        | 16 ++++
> >  .../gcc.target/i386/pieces-memset-46.c        | 17 +++++
> >  .../gcc.target/i386/pieces-memset-47.c        | 17 +++++
> >  .../gcc.target/i386/pieces-memset-48.c        | 17 +++++
> >  .../gcc.target/i386/pieces-memset-49.c        | 16 ++++
> >  17 files changed, 276 insertions(+), 18 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index 0d5d1a0e205..7e77ff56ddc 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> >        char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
> >                                        (enum fpmath_unit) 0,
> >                                        (enum prefer_vector_width) 0,
> > +                                      PVW_NONE, PVW_NONE,
> >                                        false, add_abi_p);
> >        if (!opts)
> >         error ("%qE needs unknown isa option", fndecl);
> > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> > index a4da8331b8b..77712a07aef 100644
> > --- a/gcc/config/i386/i386-options.c
> > +++ b/gcc/config/i386/i386-options.c
> > @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> >                     const char *arch, const char *tune,
> >                     enum fpmath_unit fpmath,
> >                     enum prefer_vector_width pvw,
> > +                   enum prefer_vector_width move_max,
> > +                   enum prefer_vector_width store_max,
> >                     bool add_nl_p, bool add_abi_p)
> >  {
> >    /* Flag options.  */
> > @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> >         }
> >      }
> >
> > -  /* Add -mprefer-vector-width= option.  */
> > -  if (pvw)
> > +  auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> > +                                        const char *cmd)
> >      {
> > -      opts[num][0] = "-mprefer-vector-width=";
> > +      opts[num][0] = cmd;
> >        switch ((int) pvw)
> >         {
> >         case PVW_AVX128:
> > @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> >         default:
> >           gcc_unreachable ();
> >         }
> > -    }
> > +    };
> > +
> > +  /* Add -mprefer-vector-width= option.  */
> > +  if (pvw)
> > +    add_vector_width (pvw, "-mprefer-vector-width=");
> > +
> > +  /* Add -mmove-max= option.  */
> > +  if (move_max)
> > +    add_vector_width (move_max, "-mmove-max=");
> > +
> > +  /* Add -mstore-max= option.  */
> > +  if (store_max)
> > +    add_vector_width (store_max, "-mstore-max=");
> >
> >    /* Any options?  */
> >    if (num == 0)
> > @@ -630,6 +644,7 @@ ix86_debug_options (void)
> >                                    target_flags, ix86_target_flags,
> >                                    ix86_arch_string, ix86_tune_string,
> >                                    ix86_fpmath, prefer_vector_width_type,
> > +                                  ix86_move_max, ix86_store_max,
> >                                    true, true);
> >
> >    if (opts)
> > @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
> >      = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
> >                           ptr->x_target_flags, ptr->x_ix86_target_flags,
> >                           NULL, NULL, ptr->x_ix86_fpmath,
> > -                         ptr->x_prefer_vector_width_type, false, true);
> > +                         ptr->x_prefer_vector_width_type,
> > +                         ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> > +                         false, true);
> >
> >    gcc_assert (ptr->arch < PROCESSOR_max);
> >    fprintf (file, "%*sarch = %d (%s)\n",
> > @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> >    const char *orig_tune_string = opts->x_ix86_tune_string;
> >    enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
> >    enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> > +  enum prefer_vector_width orig_ix86_move_max_set
> > +    = opts_set->x_ix86_move_max;
> > +  enum prefer_vector_width orig_ix86_store_max_set
> > +    = opts_set->x_ix86_store_max;
> >    int orig_tune_defaulted = ix86_tune_defaulted;
> >    int orig_arch_specified = ix86_arch_specified;
> >    char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> > @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> >        opts->x_ix86_tune_string = orig_tune_string;
> >        opts_set->x_ix86_fpmath = orig_fpmath_set;
> >        opts_set->x_prefer_vector_width_type = orig_pvw_set;
> > +      opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> > +      opts_set->x_ix86_store_max = orig_ix86_store_max_set;
> >        opts->x_ix86_excess_precision = orig_ix86_excess_precision;
> >        opts->x_ix86_unsafe_math_optimizations
> >         = orig_ix86_unsafe_math_optimizations;
> > @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
> >        && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> >      opts->x_prefer_vector_width_type = PVW_AVX256;
> >
> > +  if (opts_set->x_ix86_move_max == PVW_NONE)
> > +    {
> > +      /* Set the maximum number of bits can be moved from memory to
> > +        memory efficiently.  */
> > +      if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> > +       opts->x_ix86_move_max = PVW_AVX512;
> > +      else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> > +       opts->x_ix86_move_max = PVW_AVX256;
> > +      else
> > +       {
> > +         opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> > +         if (opts_set->x_ix86_move_max == PVW_NONE)
> > +           {
> > +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > +               opts->x_ix86_move_max = PVW_AVX512;
> > +             else
> > +               opts->x_ix86_move_max = PVW_AVX128;
> > +           }
> > +       }
> > +    }
> > +
> > +  if (opts_set->x_ix86_store_max == PVW_NONE)
> > +    {
> > +      /* Set the maximum number of bits can be stored to memory
> > +        efficiently.  */
> > +      if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> > +       opts->x_ix86_store_max = PVW_AVX512;
> > +      else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> > +       opts->x_ix86_store_max = PVW_AVX256;
> > +      else
> > +       {
> > +         opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> > +         if (opts_set->x_ix86_store_max == PVW_NONE)
> > +           {
> > +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > +               opts->x_ix86_store_max = PVW_AVX512;
> > +             else
> > +               opts->x_ix86_store_max = PVW_AVX128;
> > +           }
> > +       }
> > +    }
> > +
> >    if (opts->x_ix86_recip_name)
> >      {
> >        char *p = ASTRDUP (opts->x_ix86_recip_name);
> > diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> > index cdaca2644f4..e218e24d15b 100644
> > --- a/gcc/config/i386/i386-options.h
> > +++ b/gcc/config/i386/i386-options.h
> > @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> >                           int flags, int flags2,
> >                           const char *arch, const char *tune,
> >                           enum fpmath_unit fpmath,
> > -                         enum prefer_vector_width pvw, bool add_nl_p,
> > -                         bool add_abi_p);
> > +                         enum prefer_vector_width pvw,
> > +                         enum prefer_vector_width move_max,
> > +                         enum prefer_vector_width store_max,
> > +                         bool add_nl_p, bool add_abi_p);
> >
> >  extern enum attr_cpu ix86_schedule;
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 2fda1e0686e..4f70085d793 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> >         ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
> >  #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> >         ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> > -#define TARGET_AVX256_MOVE_BY_PIECES \
> > -       ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> > -#define TARGET_AVX256_STORE_BY_PIECES \
> > -       ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
> >  #define TARGET_AVX256_SPLIT_REGS \
> >         ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
> >  #define TARGET_GENERAL_REGS_SSE_SPILL \
> > @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
> >     MOVE_MAX_PIECES defaults to MOVE_MAX.  */
> >
> >  #define MOVE_MAX \
> > -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > +  ((TARGET_AVX512F \
> > +    && (ix86_move_max == PVW_AVX512 \
> > +       || ix86_store_max == PVW_AVX512)) \
> >     ? 64 \
> >     : ((TARGET_AVX \
> > -       && !TARGET_PREFER_AVX128 \
> > -       && (TARGET_AVX256_MOVE_BY_PIECES \
> > -          || TARGET_AVX256_STORE_BY_PIECES)) \
> > +       && (ix86_move_max >= PVW_AVX256 \
> > +          || ix86_store_max >= PVW_AVX256)) \
> >        ? 32 \
> >        : ((TARGET_SSE2 \
> >           && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> > @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
> >     store_by_pieces of 16/32/64 bytes.  */
> >  #define STORE_MAX_PIECES \
> >    (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > -   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > +   ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
> >        ? 64 \
> >        : ((TARGET_AVX \
> > -         && !TARGET_PREFER_AVX128 \
> > -         && TARGET_AVX256_STORE_BY_PIECES) \
> > +         && ix86_store_max >= PVW_AVX256) \
> >           ? 32 \
> >           : ((TARGET_SSE2 \
> >               && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 3e67c537bb7..620dab6b672 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
> >  EnumValue
> >  Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
> >
> > +mmove-max=
> > +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > +Maximum number of bits can be moved from memory to memory efficiently.

... number of bits THAT can be ...

> > +
> > +mstore-max=
> > +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > +Maximum number of bits can be stored to memory efficiently.

... number of bits THAT can be ...

> > +
> >  ;; ISA support
> >
> >  m32
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 4ae0b569841..26981f657af 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
> >  DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
> >           m_CORE_AVX512)
> >
> > +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> > +   AVX instructions.  */
> > +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> > +         m_SAPPHIRERAPIDS)
> > +
> > +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> > +   AVX instructions.  */
> > +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> > +         m_SAPPHIRERAPIDS)
> > +
> >  /*****************************************************************************/
> >  /*****************************************************************************/
> >  /* Historical relics: tuning flags that helps a specific old CPU designs     */
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 3bddfbaae6a..3412b9ede44 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
> >  -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait @gol
> >  -mrecip  -mrecip=@var{opt} @gol
> >  -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt} @gol
> > +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
> >  -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx @gol
> >  -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl @gol
> >  -mavx512bw  -mavx512dq  -mavx512ifma  -mavx512vbmi  -msha  -maes @gol
> > @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
> >  This option instructs GCC to use @var{opt}-bit vector width in instructions
> >  instead of default on the selected platform.
> >
> > +@item -mmove-max=@var{bits}
> > +@opindex mmove-max
> > +This option instructs GCC to set the maximum number of bits can be
> > +moved from memory to memory efficiently to @var{bits}.  The valid
> > +@var{bits} are 128, 256 and 512.
> > +
> > +@item -mstore-max=@var{bits}
> > +@opindex mstore-max
> > +This option instructs GCC to set the maximum number of bits can be
> > +stored to memory efficiently to @var{bits}.  The valid @var{bits} are
> > +128, 256 and 512.
> > +
> >  @table @samp
> >  @item none
> >  No extra limitations applied to GCC other than defined by the selected platform.
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > new file mode 100644
> > index 00000000000..28ab7a6d41c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > new file mode 100644
> > index 00000000000..b15a0db9ff0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > new file mode 100644
> > index 00000000000..a5b5b617578
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > new file mode 100644
> > index 00000000000..1feff48c5b2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > new file mode 100644
> > index 00000000000..ef439f20f74
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > new file mode 100644
> > index 00000000000..70c80e5064b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > new file mode 100644
> > index 00000000000..ab7894aa2e6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > new file mode 100644
> > index 00000000000..8f2c254ad03
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > new file mode 100644
> > index 00000000000..9a7da962183
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > new file mode 100644
> > index 00000000000..ad43f89a9bd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > +  __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > +/* No need to dynamically realign the stack here.  */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer.  */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > --
> > 2.33.1
> >
>
> PING.
>
>
> --
> H.J.
  
H.J. Lu Dec. 3, 2021, 5:36 p.m. UTC | #3
On Fri, Dec 3, 2021 at 8:55 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Fri, Dec 3, 2021 at 2:24 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> > > and store, independent of -mprefer-vector-width=bits:
> > >
> > > 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> > > which are enabled for Intel Sapphire Rapids processor.
> > > 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> > > memory to memory efficiently.  The default value is derived from
> > > X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> > > preferred vector width.
> > > 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> > > memory efficiently.  The default value is derived from
> > > X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> > > preferred vector width.
> > >
> > > gcc/
> > >
> > >         PR target/103269
> > >         * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
> > >         and PVW_NONE to ix86_target_string.
> > >         * config/i386/i386-options.c (ix86_target_string): Add arguments
> > >         for move_max and store_max.
> > >         (ix86_target_string::add_vector_width): New lambda.
> > >         (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
> > >         ix86_target_string.
> > >         (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
> > >         ptr->x_ix86_store_max to ix86_target_string.
> > >         (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
> > >         x_ix86_store_max.
> > >         (ix86_option_override_internal): Set the default x_ix86_move_max
> > >         and x_ix86_store_max.
> > >         * config/i386/i386-options.h (ix86_target_string): Add
> > >         prefer_vector_width and prefer_vector_width.
> > >         * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
> > >         (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> > >         (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
> > >         PVW_AVX512.  Use 32 if ix86_move_max or ix86_store_max >=
> > >         PVW_AVX256.
> > >         (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
> > >         Use 32 if ix86_store_max >= PVW_AVX256.
> > >         * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
> > >         * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
> > >         (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
> > >         * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
> > >
> > > gcc/testsuite/
> > >
> > >         PR target/103269
> > >         * gcc.target/i386/pieces-memcpy-17.c: New test.
> > >         * gcc.target/i386/pieces-memcpy-18.c: Likewise.
> > >         * gcc.target/i386/pieces-memcpy-19.c: Likewise.
> > >         * gcc.target/i386/pieces-memcpy-20.c: Likewise.
> > >         * gcc.target/i386/pieces-memcpy-21.c: Likewise.
> > >         * gcc.target/i386/pieces-memset-45.c: Likewise.
> > >         * gcc.target/i386/pieces-memset-46.c: Likewise.
> > >         * gcc.target/i386/pieces-memset-47.c: Likewise.
> > >         * gcc.target/i386/pieces-memset-48.c: Likewise.
> > >         * gcc.target/i386/pieces-memset-49.c: Likewise.
>
> LGTM with two grammar fixes below.

Fixed.

> Thanks,
> Uros.

This is the patch I am checking in.

Thanks.

> > > ---
> > >  gcc/config/i386/i386-expand.c                 |  1 +
> > >  gcc/config/i386/i386-options.c                | 75 +++++++++++++++++--
> > >  gcc/config/i386/i386-options.h                |  6 +-
> > >  gcc/config/i386/i386.h                        | 18 ++---
> > >  gcc/config/i386/i386.opt                      |  8 ++
> > >  gcc/config/i386/x86-tune.def                  | 10 +++
> > >  gcc/doc/invoke.texi                           | 13 ++++
> > >  .../gcc.target/i386/pieces-memcpy-17.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memcpy-18.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memcpy-19.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memcpy-20.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memcpy-21.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memset-45.c        | 16 ++++
> > >  .../gcc.target/i386/pieces-memset-46.c        | 17 +++++
> > >  .../gcc.target/i386/pieces-memset-47.c        | 17 +++++
> > >  .../gcc.target/i386/pieces-memset-48.c        | 17 +++++
> > >  .../gcc.target/i386/pieces-memset-49.c        | 16 ++++
> > >  17 files changed, 276 insertions(+), 18 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > > index 0d5d1a0e205..7e77ff56ddc 100644
> > > --- a/gcc/config/i386/i386-expand.c
> > > +++ b/gcc/config/i386/i386-expand.c
> > > @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> > >        char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
> > >                                        (enum fpmath_unit) 0,
> > >                                        (enum prefer_vector_width) 0,
> > > +                                      PVW_NONE, PVW_NONE,
> > >                                        false, add_abi_p);
> > >        if (!opts)
> > >         error ("%qE needs unknown isa option", fndecl);
> > > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> > > index a4da8331b8b..77712a07aef 100644
> > > --- a/gcc/config/i386/i386-options.c
> > > +++ b/gcc/config/i386/i386-options.c
> > > @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > >                     const char *arch, const char *tune,
> > >                     enum fpmath_unit fpmath,
> > >                     enum prefer_vector_width pvw,
> > > +                   enum prefer_vector_width move_max,
> > > +                   enum prefer_vector_width store_max,
> > >                     bool add_nl_p, bool add_abi_p)
> > >  {
> > >    /* Flag options.  */
> > > @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > >         }
> > >      }
> > >
> > > -  /* Add -mprefer-vector-width= option.  */
> > > -  if (pvw)
> > > +  auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> > > +                                        const char *cmd)
> > >      {
> > > -      opts[num][0] = "-mprefer-vector-width=";
> > > +      opts[num][0] = cmd;
> > >        switch ((int) pvw)
> > >         {
> > >         case PVW_AVX128:
> > > @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > >         default:
> > >           gcc_unreachable ();
> > >         }
> > > -    }
> > > +    };
> > > +
> > > +  /* Add -mprefer-vector-width= option.  */
> > > +  if (pvw)
> > > +    add_vector_width (pvw, "-mprefer-vector-width=");
> > > +
> > > +  /* Add -mmove-max= option.  */
> > > +  if (move_max)
> > > +    add_vector_width (move_max, "-mmove-max=");
> > > +
> > > +  /* Add -mstore-max= option.  */
> > > +  if (store_max)
> > > +    add_vector_width (store_max, "-mstore-max=");
> > >
> > >    /* Any options?  */
> > >    if (num == 0)
> > > @@ -630,6 +644,7 @@ ix86_debug_options (void)
> > >                                    target_flags, ix86_target_flags,
> > >                                    ix86_arch_string, ix86_tune_string,
> > >                                    ix86_fpmath, prefer_vector_width_type,
> > > +                                  ix86_move_max, ix86_store_max,
> > >                                    true, true);
> > >
> > >    if (opts)
> > > @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
> > >      = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
> > >                           ptr->x_target_flags, ptr->x_ix86_target_flags,
> > >                           NULL, NULL, ptr->x_ix86_fpmath,
> > > -                         ptr->x_prefer_vector_width_type, false, true);
> > > +                         ptr->x_prefer_vector_width_type,
> > > +                         ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> > > +                         false, true);
> > >
> > >    gcc_assert (ptr->arch < PROCESSOR_max);
> > >    fprintf (file, "%*sarch = %d (%s)\n",
> > > @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > >    const char *orig_tune_string = opts->x_ix86_tune_string;
> > >    enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
> > >    enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> > > +  enum prefer_vector_width orig_ix86_move_max_set
> > > +    = opts_set->x_ix86_move_max;
> > > +  enum prefer_vector_width orig_ix86_store_max_set
> > > +    = opts_set->x_ix86_store_max;
> > >    int orig_tune_defaulted = ix86_tune_defaulted;
> > >    int orig_arch_specified = ix86_arch_specified;
> > >    char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> > > @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > >        opts->x_ix86_tune_string = orig_tune_string;
> > >        opts_set->x_ix86_fpmath = orig_fpmath_set;
> > >        opts_set->x_prefer_vector_width_type = orig_pvw_set;
> > > +      opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> > > +      opts_set->x_ix86_store_max = orig_ix86_store_max_set;
> > >        opts->x_ix86_excess_precision = orig_ix86_excess_precision;
> > >        opts->x_ix86_unsafe_math_optimizations
> > >         = orig_ix86_unsafe_math_optimizations;
> > > @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
> > >        && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > >      opts->x_prefer_vector_width_type = PVW_AVX256;
> > >
> > > +  if (opts_set->x_ix86_move_max == PVW_NONE)
> > > +    {
> > > +      /* Set the maximum number of bits can be moved from memory to
> > > +        memory efficiently.  */
> > > +      if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> > > +       opts->x_ix86_move_max = PVW_AVX512;
> > > +      else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> > > +       opts->x_ix86_move_max = PVW_AVX256;
> > > +      else
> > > +       {
> > > +         opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> > > +         if (opts_set->x_ix86_move_max == PVW_NONE)
> > > +           {
> > > +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > > +               opts->x_ix86_move_max = PVW_AVX512;
> > > +             else
> > > +               opts->x_ix86_move_max = PVW_AVX128;
> > > +           }
> > > +       }
> > > +    }
> > > +
> > > +  if (opts_set->x_ix86_store_max == PVW_NONE)
> > > +    {
> > > +      /* Set the maximum number of bits can be stored to memory
> > > +        efficiently.  */
> > > +      if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> > > +       opts->x_ix86_store_max = PVW_AVX512;
> > > +      else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> > > +       opts->x_ix86_store_max = PVW_AVX256;
> > > +      else
> > > +       {
> > > +         opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> > > +         if (opts_set->x_ix86_store_max == PVW_NONE)
> > > +           {
> > > +             if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > > +               opts->x_ix86_store_max = PVW_AVX512;
> > > +             else
> > > +               opts->x_ix86_store_max = PVW_AVX128;
> > > +           }
> > > +       }
> > > +    }
> > > +
> > >    if (opts->x_ix86_recip_name)
> > >      {
> > >        char *p = ASTRDUP (opts->x_ix86_recip_name);
> > > diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> > > index cdaca2644f4..e218e24d15b 100644
> > > --- a/gcc/config/i386/i386-options.h
> > > +++ b/gcc/config/i386/i386-options.h
> > > @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > >                           int flags, int flags2,
> > >                           const char *arch, const char *tune,
> > >                           enum fpmath_unit fpmath,
> > > -                         enum prefer_vector_width pvw, bool add_nl_p,
> > > -                         bool add_abi_p);
> > > +                         enum prefer_vector_width pvw,
> > > +                         enum prefer_vector_width move_max,
> > > +                         enum prefer_vector_width store_max,
> > > +                         bool add_nl_p, bool add_abi_p);
> > >
> > >  extern enum attr_cpu ix86_schedule;
> > >
> > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > index 2fda1e0686e..4f70085d793 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> > >         ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
> > >  #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> > >         ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> > > -#define TARGET_AVX256_MOVE_BY_PIECES \
> > > -       ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> > > -#define TARGET_AVX256_STORE_BY_PIECES \
> > > -       ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
> > >  #define TARGET_AVX256_SPLIT_REGS \
> > >         ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
> > >  #define TARGET_GENERAL_REGS_SSE_SPILL \
> > > @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
> > >     MOVE_MAX_PIECES defaults to MOVE_MAX.  */
> > >
> > >  #define MOVE_MAX \
> > > -  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > +  ((TARGET_AVX512F \
> > > +    && (ix86_move_max == PVW_AVX512 \
> > > +       || ix86_store_max == PVW_AVX512)) \
> > >     ? 64 \
> > >     : ((TARGET_AVX \
> > > -       && !TARGET_PREFER_AVX128 \
> > > -       && (TARGET_AVX256_MOVE_BY_PIECES \
> > > -          || TARGET_AVX256_STORE_BY_PIECES)) \
> > > +       && (ix86_move_max >= PVW_AVX256 \
> > > +          || ix86_store_max >= PVW_AVX256)) \
> > >        ? 32 \
> > >        : ((TARGET_SSE2 \
> > >           && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> > > @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
> > >     store_by_pieces of 16/32/64 bytes.  */
> > >  #define STORE_MAX_PIECES \
> > >    (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > > -   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > +   ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
> > >        ? 64 \
> > >        : ((TARGET_AVX \
> > > -         && !TARGET_PREFER_AVX128 \
> > > -         && TARGET_AVX256_STORE_BY_PIECES) \
> > > +         && ix86_store_max >= PVW_AVX256) \
> > >           ? 32 \
> > >           : ((TARGET_SSE2 \
> > >               && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 3e67c537bb7..620dab6b672 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
> > >  EnumValue
> > >  Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
> > >
> > > +mmove-max=
> > > +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > > +Maximum number of bits can be moved from memory to memory efficiently.
>
> ... number of bits THAT can be ...
>
> > > +
> > > +mstore-max=
> > > +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > > +Maximum number of bits can be stored to memory efficiently.
>
> ... number of bits THAT can be ...
>
> > > +
> > >  ;; ISA support
> > >
> > >  m32
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 4ae0b569841..26981f657af 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
> > >  DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
> > >           m_CORE_AVX512)
> > >
> > > +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> > > +   AVX instructions.  */
> > > +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> > > +         m_SAPPHIRERAPIDS)
> > > +
> > > +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> > > +   AVX instructions.  */
> > > +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> > > +         m_SAPPHIRERAPIDS)
> > > +
> > >  /*****************************************************************************/
> > >  /*****************************************************************************/
> > >  /* Historical relics: tuning flags that helps a specific old CPU designs     */
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 3bddfbaae6a..3412b9ede44 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
> > >  -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait @gol
> > >  -mrecip  -mrecip=@var{opt} @gol
> > >  -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt} @gol
> > > +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
> > >  -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx @gol
> > >  -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl @gol
> > >  -mavx512bw  -mavx512dq  -mavx512ifma  -mavx512vbmi  -msha  -maes @gol
> > > @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
> > >  This option instructs GCC to use @var{opt}-bit vector width in instructions
> > >  instead of default on the selected platform.
> > >
> > > +@item -mmove-max=@var{bits}
> > > +@opindex mmove-max
> > > +This option instructs GCC to set the maximum number of bits can be
> > > +moved from memory to memory efficiently to @var{bits}.  The valid
> > > +@var{bits} are 128, 256 and 512.
> > > +
> > > +@item -mstore-max=@var{bits}
> > > +@opindex mstore-max
> > > +This option instructs GCC to set the maximum number of bits can be
> > > +stored to memory efficiently to @var{bits}.  The valid @var{bits} are
> > > +128, 256 and 512.
> > > +
> > >  @table @samp
> > >  @item none
> > >  No extra limitations applied to GCC other than defined by the selected platform.
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > > new file mode 100644
> > > index 00000000000..28ab7a6d41c
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > new file mode 100644
> > > index 00000000000..b15a0db9ff0
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > > new file mode 100644
> > > index 00000000000..a5b5b617578
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > > new file mode 100644
> > > index 00000000000..1feff48c5b2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > new file mode 100644
> > > index 00000000000..ef439f20f74
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > > new file mode 100644
> > > index 00000000000..70c80e5064b
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > new file mode 100644
> > > index 00000000000..ab7894aa2e6
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > > new file mode 100644
> > > index 00000000000..8f2c254ad03
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > > new file mode 100644
> > > index 00000000000..9a7da962183
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > new file mode 100644
> > > index 00000000000..ad43f89a9bd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > +  __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > > +/* No need to dynamically realign the stack here.  */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer.  */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > --
> > > 2.33.1
> > >
> >
> > PING.
> >
> >
> > --
> > H.J.
  

Patch

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 0d5d1a0e205..7e77ff56ddc 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -12295,6 +12295,7 @@  ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
       char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
 				       (enum fpmath_unit) 0,
 				       (enum prefer_vector_width) 0,
+				       PVW_NONE, PVW_NONE,
 				       false, add_abi_p);
       if (!opts)
 	error ("%qE needs unknown isa option", fndecl);
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index a4da8331b8b..77712a07aef 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -364,6 +364,8 @@  ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
 		    const char *arch, const char *tune,
 		    enum fpmath_unit fpmath,
 		    enum prefer_vector_width pvw,
+		    enum prefer_vector_width move_max,
+		    enum prefer_vector_width store_max,
 		    bool add_nl_p, bool add_abi_p)
 {
   /* Flag options.  */
@@ -542,10 +544,10 @@  ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
 	}
     }
 
-  /* Add -mprefer-vector-width= option.  */
-  if (pvw)
+  auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
+					 const char *cmd)
     {
-      opts[num][0] = "-mprefer-vector-width=";
+      opts[num][0] = cmd;
       switch ((int) pvw)
 	{
 	case PVW_AVX128:
@@ -563,7 +565,19 @@  ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
 	default:
 	  gcc_unreachable ();
 	}
-    }
+    };
+
+  /* Add -mprefer-vector-width= option.  */
+  if (pvw)
+    add_vector_width (pvw, "-mprefer-vector-width=");
+
+  /* Add -mmove-max= option.  */
+  if (move_max)
+    add_vector_width (move_max, "-mmove-max=");
+
+  /* Add -mstore-max= option.  */
+  if (store_max)
+    add_vector_width (store_max, "-mstore-max=");
 
   /* Any options?  */
   if (num == 0)
@@ -630,6 +644,7 @@  ix86_debug_options (void)
 				   target_flags, ix86_target_flags,
 				   ix86_arch_string, ix86_tune_string,
 				   ix86_fpmath, prefer_vector_width_type,
+				   ix86_move_max, ix86_store_max,
 				   true, true);
 
   if (opts)
@@ -892,7 +907,9 @@  ix86_function_specific_print (FILE *file, int indent,
     = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
 			  ptr->x_target_flags, ptr->x_ix86_target_flags,
 			  NULL, NULL, ptr->x_ix86_fpmath,
-			  ptr->x_prefer_vector_width_type, false, true);
+			  ptr->x_prefer_vector_width_type,
+			  ptr->x_ix86_move_max, ptr->x_ix86_store_max,
+			  false, true);
 
   gcc_assert (ptr->arch < PROCESSOR_max);
   fprintf (file, "%*sarch = %d (%s)\n",
@@ -1318,6 +1335,10 @@  ix86_valid_target_attribute_tree (tree fndecl, tree args,
   const char *orig_tune_string = opts->x_ix86_tune_string;
   enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
   enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
+  enum prefer_vector_width orig_ix86_move_max_set
+    = opts_set->x_ix86_move_max;
+  enum prefer_vector_width orig_ix86_store_max_set
+    = opts_set->x_ix86_store_max;
   int orig_tune_defaulted = ix86_tune_defaulted;
   int orig_arch_specified = ix86_arch_specified;
   char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
@@ -1393,6 +1414,8 @@  ix86_valid_target_attribute_tree (tree fndecl, tree args,
       opts->x_ix86_tune_string = orig_tune_string;
       opts_set->x_ix86_fpmath = orig_fpmath_set;
       opts_set->x_prefer_vector_width_type = orig_pvw_set;
+      opts_set->x_ix86_move_max = orig_ix86_move_max_set;
+      opts_set->x_ix86_store_max = orig_ix86_store_max_set;
       opts->x_ix86_excess_precision = orig_ix86_excess_precision;
       opts->x_ix86_unsafe_math_optimizations
 	= orig_ix86_unsafe_math_optimizations;
@@ -2667,6 +2690,48 @@  ix86_option_override_internal (bool main_args_p,
       && (opts_set->x_prefer_vector_width_type == PVW_NONE))
     opts->x_prefer_vector_width_type = PVW_AVX256;
 
+  if (opts_set->x_ix86_move_max == PVW_NONE)
+    {
+      /* Set the maximum number of bits can be moved from memory to
+	 memory efficiently.  */
+      if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
+	opts->x_ix86_move_max = PVW_AVX512;
+      else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
+	opts->x_ix86_move_max = PVW_AVX256;
+      else
+	{
+	  opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
+	  if (opts_set->x_ix86_move_max == PVW_NONE)
+	    {
+	      if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
+		opts->x_ix86_move_max = PVW_AVX512;
+	      else
+		opts->x_ix86_move_max = PVW_AVX128;
+	    }
+	}
+    }
+
+  if (opts_set->x_ix86_store_max == PVW_NONE)
+    {
+      /* Set the maximum number of bits can be stored to memory
+	 efficiently.  */
+      if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
+	opts->x_ix86_store_max = PVW_AVX512;
+      else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
+	opts->x_ix86_store_max = PVW_AVX256;
+      else
+	{
+	  opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
+	  if (opts_set->x_ix86_store_max == PVW_NONE)
+	    {
+	      if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
+		opts->x_ix86_store_max = PVW_AVX512;
+	      else
+		opts->x_ix86_store_max = PVW_AVX128;
+	    }
+	}
+    }
+
   if (opts->x_ix86_recip_name)
     {
       char *p = ASTRDUP (opts->x_ix86_recip_name);
diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
index cdaca2644f4..e218e24d15b 100644
--- a/gcc/config/i386/i386-options.h
+++ b/gcc/config/i386/i386-options.h
@@ -26,8 +26,10 @@  char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
 			  int flags, int flags2,
 			  const char *arch, const char *tune,
 			  enum fpmath_unit fpmath,
-			  enum prefer_vector_width pvw, bool add_nl_p,
-			  bool add_abi_p);
+			  enum prefer_vector_width pvw,
+			  enum prefer_vector_width move_max,
+			  enum prefer_vector_width store_max,
+			  bool add_nl_p, bool add_abi_p);
 
 extern enum attr_cpu ix86_schedule;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 2fda1e0686e..4f70085d793 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -408,10 +408,6 @@  extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
-#define TARGET_AVX256_MOVE_BY_PIECES \
-	ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
-#define TARGET_AVX256_STORE_BY_PIECES \
-	ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
 #define TARGET_AVX256_SPLIT_REGS \
 	ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
 #define TARGET_GENERAL_REGS_SSE_SPILL \
@@ -1807,12 +1803,13 @@  typedef struct ix86_args {
    MOVE_MAX_PIECES defaults to MOVE_MAX.  */
 
 #define MOVE_MAX \
-  ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+  ((TARGET_AVX512F \
+    && (ix86_move_max == PVW_AVX512 \
+	|| ix86_store_max == PVW_AVX512)) \
    ? 64 \
    : ((TARGET_AVX \
-       && !TARGET_PREFER_AVX128 \
-       && (TARGET_AVX256_MOVE_BY_PIECES \
-	   || TARGET_AVX256_STORE_BY_PIECES)) \
+       && (ix86_move_max >= PVW_AVX256 \
+	   || ix86_store_max >= PVW_AVX256)) \
       ? 32 \
       : ((TARGET_SSE2 \
 	  && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
@@ -1825,11 +1822,10 @@  typedef struct ix86_args {
    store_by_pieces of 16/32/64 bytes.  */
 #define STORE_MAX_PIECES \
   (TARGET_INTER_UNIT_MOVES_TO_VEC \
-   ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+   ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
       ? 64 \
       : ((TARGET_AVX \
-	  && !TARGET_PREFER_AVX128 \
-	  && TARGET_AVX256_STORE_BY_PIECES) \
+	  && ix86_store_max >= PVW_AVX256) \
 	  ? 32 \
 	  : ((TARGET_SSE2 \
 	      && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 3e67c537bb7..620dab6b672 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -624,6 +624,14 @@  Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
 EnumValue
 Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
 
+mmove-max=
+Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
+Maximum number of bits can be moved from memory to memory efficiently.
+
+mstore-max=
+Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
+Maximum number of bits can be stored to memory efficiently.
+
 ;; ISA support
 
 m32
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 4ae0b569841..26981f657af 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -512,6 +512,16 @@  DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
 DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
 	  m_CORE_AVX512)
 
+/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
+   AVX instructions.  */
+DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
+	  m_SAPPHIRERAPIDS)
+
+/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
+   AVX instructions.  */
+DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
+	  m_SAPPHIRERAPIDS)
+
 /*****************************************************************************/
 /*****************************************************************************/
 /* Historical relics: tuning flags that helps a specific old CPU designs     */
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 3bddfbaae6a..3412b9ede44 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1393,6 +1393,7 @@  See RS/6000 and PowerPC Options.
 -mcld  -mcx16  -msahf  -mmovbe  -mcrc32 -mmwait @gol
 -mrecip  -mrecip=@var{opt} @gol
 -mvzeroupper  -mprefer-avx128  -mprefer-vector-width=@var{opt} @gol
+-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
 -mmmx  -msse  -msse2  -msse3  -mssse3  -msse4.1  -msse4.2  -msse4  -mavx @gol
 -mavx2  -mavx512f  -mavx512pf  -mavx512er  -mavx512cd  -mavx512vl @gol
 -mavx512bw  -mavx512dq  -mavx512ifma  -mavx512vbmi  -msha  -maes @gol
@@ -31848,6 +31849,18 @@  This option instructs GCC to use 128-bit AVX instructions instead of
 This option instructs GCC to use @var{opt}-bit vector width in instructions
 instead of default on the selected platform.
 
+@item -mmove-max=@var{bits}
+@opindex mmove-max
+This option instructs GCC to set the maximum number of bits can be
+moved from memory to memory efficiently to @var{bits}.  The valid
+@var{bits} are 128, 256 and 512.
+
+@item -mstore-max=@var{bits}
+@opindex mstore-max
+This option instructs GCC to set the maximum number of bits can be
+stored to memory efficiently to @var{bits}.  The valid @var{bits} are
+128, 256 and 512.
+
 @table @samp
 @item none
 No extra limitations applied to GCC other than defined by the selected platform.
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
new file mode 100644
index 00000000000..28ab7a6d41c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
new file mode 100644
index 00000000000..b15a0db9ff0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
new file mode 100644
index 00000000000..a5b5b617578
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
new file mode 100644
index 00000000000..1feff48c5b2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
new file mode 100644
index 00000000000..ef439f20f74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
new file mode 100644
index 00000000000..70c80e5064b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
new file mode 100644
index 00000000000..ab7894aa2e6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
new file mode 100644
index 00000000000..8f2c254ad03
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
new file mode 100644
index 00000000000..9a7da962183
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
new file mode 100644
index 00000000000..ad43f89a9bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+  __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here.  */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer.  */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */