x86: Add -mmove-max=bits and -mstore-max=bits
Commit Message
Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
and store, independent of -mprefer-vector-width=bits:
1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
which are enabled for Intel Sapphire Rapids processor.
2. Add -mmove-max=bits to set the maximum number of bits can be moved from
memory to memory efficiently. The default value is derived from
X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
preferred vector width.
3. Add -mstore-max=bits to set the maximum number of bits can be stored to
memory efficiently. The default value is derived from
X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
preferred vector width.
gcc/
PR target/103269
* config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
and PVW_NONE to ix86_target_string.
* config/i386/i386-options.c (ix86_target_string): Add arguments
for move_max and store_max.
(ix86_target_string::add_vector_width): New lambda.
(ix86_debug_options): Pass ix86_move_max and ix86_store_max to
ix86_target_string.
(ix86_function_specific_print): Pass ptr->x_ix86_move_max and
ptr->x_ix86_store_max to ix86_target_string.
(ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
x_ix86_store_max.
(ix86_option_override_internal): Set the default x_ix86_move_max
and x_ix86_store_max.
* config/i386/i386-options.h (ix86_target_string): Add
prefer_vector_width and prefer_vector_width.
* config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
(TARGET_AVX256_STORE_BY_PIECES): Likewise.
(MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
PVW_AVX512. Use 32 if ix86_move_max or ix86_store_max >=
PVW_AVX256.
(STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
Use 32 if ix86_store_max >= PVW_AVX256.
* config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
* config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
(X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
* doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
gcc/testsuite/
PR target/103269
* gcc.target/i386/pieces-memcpy-17.c: New test.
* gcc.target/i386/pieces-memcpy-18.c: Likewise.
* gcc.target/i386/pieces-memcpy-19.c: Likewise.
* gcc.target/i386/pieces-memcpy-20.c: Likewise.
* gcc.target/i386/pieces-memcpy-21.c: Likewise.
* gcc.target/i386/pieces-memset-45.c: Likewise.
* gcc.target/i386/pieces-memset-46.c: Likewise.
* gcc.target/i386/pieces-memset-47.c: Likewise.
* gcc.target/i386/pieces-memset-48.c: Likewise.
* gcc.target/i386/pieces-memset-49.c: Likewise.
---
gcc/config/i386/i386-expand.c | 1 +
gcc/config/i386/i386-options.c | 75 +++++++++++++++++--
gcc/config/i386/i386-options.h | 6 +-
gcc/config/i386/i386.h | 18 ++---
gcc/config/i386/i386.opt | 8 ++
gcc/config/i386/x86-tune.def | 10 +++
gcc/doc/invoke.texi | 13 ++++
.../gcc.target/i386/pieces-memcpy-17.c | 16 ++++
.../gcc.target/i386/pieces-memcpy-18.c | 16 ++++
.../gcc.target/i386/pieces-memcpy-19.c | 16 ++++
.../gcc.target/i386/pieces-memcpy-20.c | 16 ++++
.../gcc.target/i386/pieces-memcpy-21.c | 16 ++++
.../gcc.target/i386/pieces-memset-45.c | 16 ++++
.../gcc.target/i386/pieces-memset-46.c | 17 +++++
.../gcc.target/i386/pieces-memset-47.c | 17 +++++
.../gcc.target/i386/pieces-memset-48.c | 17 +++++
.../gcc.target/i386/pieces-memset-49.c | 16 ++++
17 files changed, 276 insertions(+), 18 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
Comments
On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> and store, independent of -mprefer-vector-width=bits:
>
> 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> which are enabled for Intel Sapphire Rapids processor.
> 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> memory to memory efficiently. The default value is derived from
> X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> preferred vector width.
> 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> memory efficiently. The default value is derived from
> X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> preferred vector width.
>
> gcc/
>
> PR target/103269
> * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
> and PVW_NONE to ix86_target_string.
> * config/i386/i386-options.c (ix86_target_string): Add arguments
> for move_max and store_max.
> (ix86_target_string::add_vector_width): New lambda.
> (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
> ix86_target_string.
> (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
> ptr->x_ix86_store_max to ix86_target_string.
> (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
> x_ix86_store_max.
> (ix86_option_override_internal): Set the default x_ix86_move_max
> and x_ix86_store_max.
> * config/i386/i386-options.h (ix86_target_string): Add
> prefer_vector_width and prefer_vector_width.
> * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
> (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
> PVW_AVX512. Use 32 if ix86_move_max or ix86_store_max >=
> PVW_AVX256.
> (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
> Use 32 if ix86_store_max >= PVW_AVX256.
> * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
> * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
> (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
> * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
>
> gcc/testsuite/
>
> PR target/103269
> * gcc.target/i386/pieces-memcpy-17.c: New test.
> * gcc.target/i386/pieces-memcpy-18.c: Likewise.
> * gcc.target/i386/pieces-memcpy-19.c: Likewise.
> * gcc.target/i386/pieces-memcpy-20.c: Likewise.
> * gcc.target/i386/pieces-memcpy-21.c: Likewise.
> * gcc.target/i386/pieces-memset-45.c: Likewise.
> * gcc.target/i386/pieces-memset-46.c: Likewise.
> * gcc.target/i386/pieces-memset-47.c: Likewise.
> * gcc.target/i386/pieces-memset-48.c: Likewise.
> * gcc.target/i386/pieces-memset-49.c: Likewise.
> ---
> gcc/config/i386/i386-expand.c | 1 +
> gcc/config/i386/i386-options.c | 75 +++++++++++++++++--
> gcc/config/i386/i386-options.h | 6 +-
> gcc/config/i386/i386.h | 18 ++---
> gcc/config/i386/i386.opt | 8 ++
> gcc/config/i386/x86-tune.def | 10 +++
> gcc/doc/invoke.texi | 13 ++++
> .../gcc.target/i386/pieces-memcpy-17.c | 16 ++++
> .../gcc.target/i386/pieces-memcpy-18.c | 16 ++++
> .../gcc.target/i386/pieces-memcpy-19.c | 16 ++++
> .../gcc.target/i386/pieces-memcpy-20.c | 16 ++++
> .../gcc.target/i386/pieces-memcpy-21.c | 16 ++++
> .../gcc.target/i386/pieces-memset-45.c | 16 ++++
> .../gcc.target/i386/pieces-memset-46.c | 17 +++++
> .../gcc.target/i386/pieces-memset-47.c | 17 +++++
> .../gcc.target/i386/pieces-memset-48.c | 17 +++++
> .../gcc.target/i386/pieces-memset-49.c | 16 ++++
> 17 files changed, 276 insertions(+), 18 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
>
> diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> index 0d5d1a0e205..7e77ff56ddc 100644
> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
> (enum fpmath_unit) 0,
> (enum prefer_vector_width) 0,
> + PVW_NONE, PVW_NONE,
> false, add_abi_p);
> if (!opts)
> error ("%qE needs unknown isa option", fndecl);
> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> index a4da8331b8b..77712a07aef 100644
> --- a/gcc/config/i386/i386-options.c
> +++ b/gcc/config/i386/i386-options.c
> @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> const char *arch, const char *tune,
> enum fpmath_unit fpmath,
> enum prefer_vector_width pvw,
> + enum prefer_vector_width move_max,
> + enum prefer_vector_width store_max,
> bool add_nl_p, bool add_abi_p)
> {
> /* Flag options. */
> @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> }
> }
>
> - /* Add -mprefer-vector-width= option. */
> - if (pvw)
> + auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> + const char *cmd)
> {
> - opts[num][0] = "-mprefer-vector-width=";
> + opts[num][0] = cmd;
> switch ((int) pvw)
> {
> case PVW_AVX128:
> @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> default:
> gcc_unreachable ();
> }
> - }
> + };
> +
> + /* Add -mprefer-vector-width= option. */
> + if (pvw)
> + add_vector_width (pvw, "-mprefer-vector-width=");
> +
> + /* Add -mmove-max= option. */
> + if (move_max)
> + add_vector_width (move_max, "-mmove-max=");
> +
> + /* Add -mstore-max= option. */
> + if (store_max)
> + add_vector_width (store_max, "-mstore-max=");
>
> /* Any options? */
> if (num == 0)
> @@ -630,6 +644,7 @@ ix86_debug_options (void)
> target_flags, ix86_target_flags,
> ix86_arch_string, ix86_tune_string,
> ix86_fpmath, prefer_vector_width_type,
> + ix86_move_max, ix86_store_max,
> true, true);
>
> if (opts)
> @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
> = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
> ptr->x_target_flags, ptr->x_ix86_target_flags,
> NULL, NULL, ptr->x_ix86_fpmath,
> - ptr->x_prefer_vector_width_type, false, true);
> + ptr->x_prefer_vector_width_type,
> + ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> + false, true);
>
> gcc_assert (ptr->arch < PROCESSOR_max);
> fprintf (file, "%*sarch = %d (%s)\n",
> @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> const char *orig_tune_string = opts->x_ix86_tune_string;
> enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
> enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> + enum prefer_vector_width orig_ix86_move_max_set
> + = opts_set->x_ix86_move_max;
> + enum prefer_vector_width orig_ix86_store_max_set
> + = opts_set->x_ix86_store_max;
> int orig_tune_defaulted = ix86_tune_defaulted;
> int orig_arch_specified = ix86_arch_specified;
> char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> opts->x_ix86_tune_string = orig_tune_string;
> opts_set->x_ix86_fpmath = orig_fpmath_set;
> opts_set->x_prefer_vector_width_type = orig_pvw_set;
> + opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> + opts_set->x_ix86_store_max = orig_ix86_store_max_set;
> opts->x_ix86_excess_precision = orig_ix86_excess_precision;
> opts->x_ix86_unsafe_math_optimizations
> = orig_ix86_unsafe_math_optimizations;
> @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
> && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> opts->x_prefer_vector_width_type = PVW_AVX256;
>
> + if (opts_set->x_ix86_move_max == PVW_NONE)
> + {
> + /* Set the maximum number of bits can be moved from memory to
> + memory efficiently. */
> + if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> + opts->x_ix86_move_max = PVW_AVX512;
> + else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> + opts->x_ix86_move_max = PVW_AVX256;
> + else
> + {
> + opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> + if (opts_set->x_ix86_move_max == PVW_NONE)
> + {
> + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> + opts->x_ix86_move_max = PVW_AVX512;
> + else
> + opts->x_ix86_move_max = PVW_AVX128;
> + }
> + }
> + }
> +
> + if (opts_set->x_ix86_store_max == PVW_NONE)
> + {
> + /* Set the maximum number of bits can be stored to memory
> + efficiently. */
> + if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> + opts->x_ix86_store_max = PVW_AVX512;
> + else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> + opts->x_ix86_store_max = PVW_AVX256;
> + else
> + {
> + opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> + if (opts_set->x_ix86_store_max == PVW_NONE)
> + {
> + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> + opts->x_ix86_store_max = PVW_AVX512;
> + else
> + opts->x_ix86_store_max = PVW_AVX128;
> + }
> + }
> + }
> +
> if (opts->x_ix86_recip_name)
> {
> char *p = ASTRDUP (opts->x_ix86_recip_name);
> diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> index cdaca2644f4..e218e24d15b 100644
> --- a/gcc/config/i386/i386-options.h
> +++ b/gcc/config/i386/i386-options.h
> @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> int flags, int flags2,
> const char *arch, const char *tune,
> enum fpmath_unit fpmath,
> - enum prefer_vector_width pvw, bool add_nl_p,
> - bool add_abi_p);
> + enum prefer_vector_width pvw,
> + enum prefer_vector_width move_max,
> + enum prefer_vector_width store_max,
> + bool add_nl_p, bool add_abi_p);
>
> extern enum attr_cpu ix86_schedule;
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 2fda1e0686e..4f70085d793 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
> #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> -#define TARGET_AVX256_MOVE_BY_PIECES \
> - ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> -#define TARGET_AVX256_STORE_BY_PIECES \
> - ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
> #define TARGET_AVX256_SPLIT_REGS \
> ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
> #define TARGET_GENERAL_REGS_SSE_SPILL \
> @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
> MOVE_MAX_PIECES defaults to MOVE_MAX. */
>
> #define MOVE_MAX \
> - ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> + ((TARGET_AVX512F \
> + && (ix86_move_max == PVW_AVX512 \
> + || ix86_store_max == PVW_AVX512)) \
> ? 64 \
> : ((TARGET_AVX \
> - && !TARGET_PREFER_AVX128 \
> - && (TARGET_AVX256_MOVE_BY_PIECES \
> - || TARGET_AVX256_STORE_BY_PIECES)) \
> + && (ix86_move_max >= PVW_AVX256 \
> + || ix86_store_max >= PVW_AVX256)) \
> ? 32 \
> : ((TARGET_SSE2 \
> && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
> store_by_pieces of 16/32/64 bytes. */
> #define STORE_MAX_PIECES \
> (TARGET_INTER_UNIT_MOVES_TO_VEC \
> - ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> + ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
> ? 64 \
> : ((TARGET_AVX \
> - && !TARGET_PREFER_AVX128 \
> - && TARGET_AVX256_STORE_BY_PIECES) \
> + && ix86_store_max >= PVW_AVX256) \
> ? 32 \
> : ((TARGET_SSE2 \
> && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 3e67c537bb7..620dab6b672 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
> EnumValue
> Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
>
> +mmove-max=
> +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> +Maximum number of bits can be moved from memory to memory efficiently.
> +
> +mstore-max=
> +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> +Maximum number of bits can be stored to memory efficiently.
> +
> ;; ISA support
>
> m32
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 4ae0b569841..26981f657af 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
> DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
> m_CORE_AVX512)
>
> +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> + AVX instructions. */
> +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> + m_SAPPHIRERAPIDS)
> +
> +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> + AVX instructions. */
> +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> + m_SAPPHIRERAPIDS)
> +
> /*****************************************************************************/
> /*****************************************************************************/
> /* Historical relics: tuning flags that helps a specific old CPU designs */
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 3bddfbaae6a..3412b9ede44 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
> -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait @gol
> -mrecip -mrecip=@var{opt} @gol
> -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} @gol
> +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
> -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx @gol
> -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @gol
> -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -msha -maes @gol
> @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
> This option instructs GCC to use @var{opt}-bit vector width in instructions
> instead of default on the selected platform.
>
> +@item -mmove-max=@var{bits}
> +@opindex mmove-max
> +This option instructs GCC to set the maximum number of bits can be
> +moved from memory to memory efficiently to @var{bits}. The valid
> +@var{bits} are 128, 256 and 512.
> +
> +@item -mstore-max=@var{bits}
> +@opindex mstore-max
> +This option instructs GCC to set the maximum number of bits can be
> +stored to memory efficiently to @var{bits}. The valid @var{bits} are
> +128, 256 and 512.
> +
> @table @samp
> @item none
> No extra limitations applied to GCC other than defined by the selected platform.
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> new file mode 100644
> index 00000000000..28ab7a6d41c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> + __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> new file mode 100644
> index 00000000000..b15a0db9ff0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> + __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> new file mode 100644
> index 00000000000..a5b5b617578
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> + __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> new file mode 100644
> index 00000000000..1feff48c5b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> + __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> new file mode 100644
> index 00000000000..ef439f20f74
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> +
> +extern char *dst, *src;
> +
> +void
> +foo (void)
> +{
> + __builtin_memcpy (dst, src, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> new file mode 100644
> index 00000000000..70c80e5064b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> + __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> new file mode 100644
> index 00000000000..ab7894aa2e6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> + __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> new file mode 100644
> index 00000000000..8f2c254ad03
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> + __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> new file mode 100644
> index 00000000000..9a7da962183
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> + __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> new file mode 100644
> index 00000000000..ad43f89a9bd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> +
> +extern char *dst;
> +
> +void
> +foo (void)
> +{
> + __builtin_memset (dst, 3, 66);
> +}
> +
> +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> +/* No need to dynamically realign the stack here. */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer. */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> --
> 2.33.1
>
PING.
On Fri, Dec 3, 2021 at 2:24 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> > and store, independent of -mprefer-vector-width=bits:
> >
> > 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> > which are enabled for Intel Sapphire Rapids processor.
> > 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> > memory to memory efficiently. The default value is derived from
> > X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> > preferred vector width.
> > 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> > memory efficiently. The default value is derived from
> > X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> > preferred vector width.
> >
> > gcc/
> >
> > PR target/103269
> > * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
> > and PVW_NONE to ix86_target_string.
> > * config/i386/i386-options.c (ix86_target_string): Add arguments
> > for move_max and store_max.
> > (ix86_target_string::add_vector_width): New lambda.
> > (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
> > ix86_target_string.
> > (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
> > ptr->x_ix86_store_max to ix86_target_string.
> > (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
> > x_ix86_store_max.
> > (ix86_option_override_internal): Set the default x_ix86_move_max
> > and x_ix86_store_max.
> > * config/i386/i386-options.h (ix86_target_string): Add
> > prefer_vector_width and prefer_vector_width.
> > * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
> > (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> > (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
> > PVW_AVX512. Use 32 if ix86_move_max or ix86_store_max >=
> > PVW_AVX256.
> > (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
> > Use 32 if ix86_store_max >= PVW_AVX256.
> > * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
> > * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
> > (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
> > * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
> >
> > gcc/testsuite/
> >
> > PR target/103269
> > * gcc.target/i386/pieces-memcpy-17.c: New test.
> > * gcc.target/i386/pieces-memcpy-18.c: Likewise.
> > * gcc.target/i386/pieces-memcpy-19.c: Likewise.
> > * gcc.target/i386/pieces-memcpy-20.c: Likewise.
> > * gcc.target/i386/pieces-memcpy-21.c: Likewise.
> > * gcc.target/i386/pieces-memset-45.c: Likewise.
> > * gcc.target/i386/pieces-memset-46.c: Likewise.
> > * gcc.target/i386/pieces-memset-47.c: Likewise.
> > * gcc.target/i386/pieces-memset-48.c: Likewise.
> > * gcc.target/i386/pieces-memset-49.c: Likewise.
LGTM with two grammar fixes below.
Thanks,
Uros.
> > ---
> > gcc/config/i386/i386-expand.c | 1 +
> > gcc/config/i386/i386-options.c | 75 +++++++++++++++++--
> > gcc/config/i386/i386-options.h | 6 +-
> > gcc/config/i386/i386.h | 18 ++---
> > gcc/config/i386/i386.opt | 8 ++
> > gcc/config/i386/x86-tune.def | 10 +++
> > gcc/doc/invoke.texi | 13 ++++
> > .../gcc.target/i386/pieces-memcpy-17.c | 16 ++++
> > .../gcc.target/i386/pieces-memcpy-18.c | 16 ++++
> > .../gcc.target/i386/pieces-memcpy-19.c | 16 ++++
> > .../gcc.target/i386/pieces-memcpy-20.c | 16 ++++
> > .../gcc.target/i386/pieces-memcpy-21.c | 16 ++++
> > .../gcc.target/i386/pieces-memset-45.c | 16 ++++
> > .../gcc.target/i386/pieces-memset-46.c | 17 +++++
> > .../gcc.target/i386/pieces-memset-47.c | 17 +++++
> > .../gcc.target/i386/pieces-memset-48.c | 17 +++++
> > .../gcc.target/i386/pieces-memset-49.c | 16 ++++
> > 17 files changed, 276 insertions(+), 18 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> >
> > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > index 0d5d1a0e205..7e77ff56ddc 100644
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> > char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
> > (enum fpmath_unit) 0,
> > (enum prefer_vector_width) 0,
> > + PVW_NONE, PVW_NONE,
> > false, add_abi_p);
> > if (!opts)
> > error ("%qE needs unknown isa option", fndecl);
> > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> > index a4da8331b8b..77712a07aef 100644
> > --- a/gcc/config/i386/i386-options.c
> > +++ b/gcc/config/i386/i386-options.c
> > @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > const char *arch, const char *tune,
> > enum fpmath_unit fpmath,
> > enum prefer_vector_width pvw,
> > + enum prefer_vector_width move_max,
> > + enum prefer_vector_width store_max,
> > bool add_nl_p, bool add_abi_p)
> > {
> > /* Flag options. */
> > @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > }
> > }
> >
> > - /* Add -mprefer-vector-width= option. */
> > - if (pvw)
> > + auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> > + const char *cmd)
> > {
> > - opts[num][0] = "-mprefer-vector-width=";
> > + opts[num][0] = cmd;
> > switch ((int) pvw)
> > {
> > case PVW_AVX128:
> > @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > default:
> > gcc_unreachable ();
> > }
> > - }
> > + };
> > +
> > + /* Add -mprefer-vector-width= option. */
> > + if (pvw)
> > + add_vector_width (pvw, "-mprefer-vector-width=");
> > +
> > + /* Add -mmove-max= option. */
> > + if (move_max)
> > + add_vector_width (move_max, "-mmove-max=");
> > +
> > + /* Add -mstore-max= option. */
> > + if (store_max)
> > + add_vector_width (store_max, "-mstore-max=");
> >
> > /* Any options? */
> > if (num == 0)
> > @@ -630,6 +644,7 @@ ix86_debug_options (void)
> > target_flags, ix86_target_flags,
> > ix86_arch_string, ix86_tune_string,
> > ix86_fpmath, prefer_vector_width_type,
> > + ix86_move_max, ix86_store_max,
> > true, true);
> >
> > if (opts)
> > @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
> > = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
> > ptr->x_target_flags, ptr->x_ix86_target_flags,
> > NULL, NULL, ptr->x_ix86_fpmath,
> > - ptr->x_prefer_vector_width_type, false, true);
> > + ptr->x_prefer_vector_width_type,
> > + ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> > + false, true);
> >
> > gcc_assert (ptr->arch < PROCESSOR_max);
> > fprintf (file, "%*sarch = %d (%s)\n",
> > @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > const char *orig_tune_string = opts->x_ix86_tune_string;
> > enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
> > enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> > + enum prefer_vector_width orig_ix86_move_max_set
> > + = opts_set->x_ix86_move_max;
> > + enum prefer_vector_width orig_ix86_store_max_set
> > + = opts_set->x_ix86_store_max;
> > int orig_tune_defaulted = ix86_tune_defaulted;
> > int orig_arch_specified = ix86_arch_specified;
> > char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> > @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > opts->x_ix86_tune_string = orig_tune_string;
> > opts_set->x_ix86_fpmath = orig_fpmath_set;
> > opts_set->x_prefer_vector_width_type = orig_pvw_set;
> > + opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> > + opts_set->x_ix86_store_max = orig_ix86_store_max_set;
> > opts->x_ix86_excess_precision = orig_ix86_excess_precision;
> > opts->x_ix86_unsafe_math_optimizations
> > = orig_ix86_unsafe_math_optimizations;
> > @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
> > && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > opts->x_prefer_vector_width_type = PVW_AVX256;
> >
> > + if (opts_set->x_ix86_move_max == PVW_NONE)
> > + {
> > + /* Set the maximum number of bits can be moved from memory to
> > + memory efficiently. */
> > + if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> > + opts->x_ix86_move_max = PVW_AVX512;
> > + else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> > + opts->x_ix86_move_max = PVW_AVX256;
> > + else
> > + {
> > + opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> > + if (opts_set->x_ix86_move_max == PVW_NONE)
> > + {
> > + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > + opts->x_ix86_move_max = PVW_AVX512;
> > + else
> > + opts->x_ix86_move_max = PVW_AVX128;
> > + }
> > + }
> > + }
> > +
> > + if (opts_set->x_ix86_store_max == PVW_NONE)
> > + {
> > + /* Set the maximum number of bits can be stored to memory
> > + efficiently. */
> > + if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> > + opts->x_ix86_store_max = PVW_AVX512;
> > + else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> > + opts->x_ix86_store_max = PVW_AVX256;
> > + else
> > + {
> > + opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> > + if (opts_set->x_ix86_store_max == PVW_NONE)
> > + {
> > + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > + opts->x_ix86_store_max = PVW_AVX512;
> > + else
> > + opts->x_ix86_store_max = PVW_AVX128;
> > + }
> > + }
> > + }
> > +
> > if (opts->x_ix86_recip_name)
> > {
> > char *p = ASTRDUP (opts->x_ix86_recip_name);
> > diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> > index cdaca2644f4..e218e24d15b 100644
> > --- a/gcc/config/i386/i386-options.h
> > +++ b/gcc/config/i386/i386-options.h
> > @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > int flags, int flags2,
> > const char *arch, const char *tune,
> > enum fpmath_unit fpmath,
> > - enum prefer_vector_width pvw, bool add_nl_p,
> > - bool add_abi_p);
> > + enum prefer_vector_width pvw,
> > + enum prefer_vector_width move_max,
> > + enum prefer_vector_width store_max,
> > + bool add_nl_p, bool add_abi_p);
> >
> > extern enum attr_cpu ix86_schedule;
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 2fda1e0686e..4f70085d793 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> > ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
> > #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> > ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> > -#define TARGET_AVX256_MOVE_BY_PIECES \
> > - ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> > -#define TARGET_AVX256_STORE_BY_PIECES \
> > - ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
> > #define TARGET_AVX256_SPLIT_REGS \
> > ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
> > #define TARGET_GENERAL_REGS_SSE_SPILL \
> > @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
> > MOVE_MAX_PIECES defaults to MOVE_MAX. */
> >
> > #define MOVE_MAX \
> > - ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > + ((TARGET_AVX512F \
> > + && (ix86_move_max == PVW_AVX512 \
> > + || ix86_store_max == PVW_AVX512)) \
> > ? 64 \
> > : ((TARGET_AVX \
> > - && !TARGET_PREFER_AVX128 \
> > - && (TARGET_AVX256_MOVE_BY_PIECES \
> > - || TARGET_AVX256_STORE_BY_PIECES)) \
> > + && (ix86_move_max >= PVW_AVX256 \
> > + || ix86_store_max >= PVW_AVX256)) \
> > ? 32 \
> > : ((TARGET_SSE2 \
> > && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> > @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
> > store_by_pieces of 16/32/64 bytes. */
> > #define STORE_MAX_PIECES \
> > (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > - ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > + ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
> > ? 64 \
> > : ((TARGET_AVX \
> > - && !TARGET_PREFER_AVX128 \
> > - && TARGET_AVX256_STORE_BY_PIECES) \
> > + && ix86_store_max >= PVW_AVX256) \
> > ? 32 \
> > : ((TARGET_SSE2 \
> > && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 3e67c537bb7..620dab6b672 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
> > EnumValue
> > Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
> >
> > +mmove-max=
> > +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > +Maximum number of bits can be moved from memory to memory efficiently.
... number of bits THAT can be ...
> > +
> > +mstore-max=
> > +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > +Maximum number of bits can be stored to memory efficiently.
... number of bits THAT can be ...
> > +
> > ;; ISA support
> >
> > m32
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 4ae0b569841..26981f657af 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
> > DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
> > m_CORE_AVX512)
> >
> > +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> > + AVX instructions. */
> > +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> > + m_SAPPHIRERAPIDS)
> > +
> > +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> > + AVX instructions. */
> > +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> > + m_SAPPHIRERAPIDS)
> > +
> > /*****************************************************************************/
> > /*****************************************************************************/
> > /* Historical relics: tuning flags that helps a specific old CPU designs */
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 3bddfbaae6a..3412b9ede44 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
> > -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait @gol
> > -mrecip -mrecip=@var{opt} @gol
> > -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} @gol
> > +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
> > -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx @gol
> > -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @gol
> > -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -msha -maes @gol
> > @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
> > This option instructs GCC to use @var{opt}-bit vector width in instructions
> > instead of default on the selected platform.
> >
> > +@item -mmove-max=@var{bits}
> > +@opindex mmove-max
> > +This option instructs GCC to set the maximum number of bits can be
> > +moved from memory to memory efficiently to @var{bits}. The valid
> > +@var{bits} are 128, 256 and 512.
> > +
> > +@item -mstore-max=@var{bits}
> > +@opindex mstore-max
> > +This option instructs GCC to set the maximum number of bits can be
> > +stored to memory efficiently to @var{bits}. The valid @var{bits} are
> > +128, 256 and 512.
> > +
> > @table @samp
> > @item none
> > No extra limitations applied to GCC other than defined by the selected platform.
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > new file mode 100644
> > index 00000000000..28ab7a6d41c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > new file mode 100644
> > index 00000000000..b15a0db9ff0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > new file mode 100644
> > index 00000000000..a5b5b617578
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > new file mode 100644
> > index 00000000000..1feff48c5b2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > new file mode 100644
> > index 00000000000..ef439f20f74
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > +
> > +extern char *dst, *src;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memcpy (dst, src, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > new file mode 100644
> > index 00000000000..70c80e5064b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > new file mode 100644
> > index 00000000000..ab7894aa2e6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > new file mode 100644
> > index 00000000000..8f2c254ad03
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > new file mode 100644
> > index 00000000000..9a7da962183
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > new file mode 100644
> > index 00000000000..ad43f89a9bd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > +
> > +extern char *dst;
> > +
> > +void
> > +foo (void)
> > +{
> > + __builtin_memset (dst, 3, 66);
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > +/* No need to dynamically realign the stack here. */
> > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > +/* Nor use a frame pointer. */
> > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > --
> > 2.33.1
> >
>
> PING.
>
>
> --
> H.J.
On Fri, Dec 3, 2021 at 8:55 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Fri, Dec 3, 2021 at 2:24 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 2:47 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > Add -mmove-max=bits and -mstore-max=bits to enable 256-bit/512-bit move
> > > and store, independent of -mprefer-vector-width=bits:
> > >
> > > 1. Add X86_TUNE_AVX512_MOVE_BY_PIECES and X86_TUNE_AVX512_STORE_BY_PIECES
> > > which are enabled for Intel Sapphire Rapids processor.
> > > 2. Add -mmove-max=bits to set the maximum number of bits can be moved from
> > > memory to memory efficiently. The default value is derived from
> > > X86_TUNE_AVX512_MOVE_BY_PIECES, X86_TUNE_AVX256_MOVE_BY_PIECES, and the
> > > preferred vector width.
> > > 3. Add -mstore-max=bits to set the maximum number of bits can be stored to
> > > memory efficiently. The default value is derived from
> > > X86_TUNE_AVX512_STORE_BY_PIECES, X86_TUNE_AVX256_STORE_BY_PIECES and the
> > > preferred vector width.
> > >
> > > gcc/
> > >
> > > PR target/103269
> > > * config/i386/i386-expand.c (ix86_expand_builtin): Pass PVW_NONE
> > > and PVW_NONE to ix86_target_string.
> > > * config/i386/i386-options.c (ix86_target_string): Add arguments
> > > for move_max and store_max.
> > > (ix86_target_string::add_vector_width): New lambda.
> > > (ix86_debug_options): Pass ix86_move_max and ix86_store_max to
> > > ix86_target_string.
> > > (ix86_function_specific_print): Pass ptr->x_ix86_move_max and
> > > ptr->x_ix86_store_max to ix86_target_string.
> > > (ix86_valid_target_attribute_tree): Handle x_ix86_move_max and
> > > x_ix86_store_max.
> > > (ix86_option_override_internal): Set the default x_ix86_move_max
> > > and x_ix86_store_max.
> > > * config/i386/i386-options.h (ix86_target_string): Add
> > > prefer_vector_width and prefer_vector_width.
> > > * config/i386/i386.h (TARGET_AVX256_MOVE_BY_PIECES): Removed.
> > > (TARGET_AVX256_STORE_BY_PIECES): Likewise.
> > > (MOVE_MAX): Use 64 if ix86_move_max or ix86_store_max ==
> > > PVW_AVX512. Use 32 if ix86_move_max or ix86_store_max >=
> > > PVW_AVX256.
> > > (STORE_MAX_PIECES): Use 64 if ix86_store_max == PVW_AVX512.
> > > Use 32 if ix86_store_max >= PVW_AVX256.
> > > * config/i386/i386.opt: Add -mmove-max=bits and -mstore-max=bits.
> > > * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): New.
> > > (X86_TUNE_AVX512_STORE_BY_PIECES): Likewise.
> > > * doc/invoke.texi: Document -mmove-max=bits and -mstore-max=bits.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/103269
> > > * gcc.target/i386/pieces-memcpy-17.c: New test.
> > > * gcc.target/i386/pieces-memcpy-18.c: Likewise.
> > > * gcc.target/i386/pieces-memcpy-19.c: Likewise.
> > > * gcc.target/i386/pieces-memcpy-20.c: Likewise.
> > > * gcc.target/i386/pieces-memcpy-21.c: Likewise.
> > > * gcc.target/i386/pieces-memset-45.c: Likewise.
> > > * gcc.target/i386/pieces-memset-46.c: Likewise.
> > > * gcc.target/i386/pieces-memset-47.c: Likewise.
> > > * gcc.target/i386/pieces-memset-48.c: Likewise.
> > > * gcc.target/i386/pieces-memset-49.c: Likewise.
>
> LGTM with two grammar fixes below.
Fixed.
> Thanks,
> Uros.
This is the patch I am checking in.
Thanks.
> > > ---
> > > gcc/config/i386/i386-expand.c | 1 +
> > > gcc/config/i386/i386-options.c | 75 +++++++++++++++++--
> > > gcc/config/i386/i386-options.h | 6 +-
> > > gcc/config/i386/i386.h | 18 ++---
> > > gcc/config/i386/i386.opt | 8 ++
> > > gcc/config/i386/x86-tune.def | 10 +++
> > > gcc/doc/invoke.texi | 13 ++++
> > > .../gcc.target/i386/pieces-memcpy-17.c | 16 ++++
> > > .../gcc.target/i386/pieces-memcpy-18.c | 16 ++++
> > > .../gcc.target/i386/pieces-memcpy-19.c | 16 ++++
> > > .../gcc.target/i386/pieces-memcpy-20.c | 16 ++++
> > > .../gcc.target/i386/pieces-memcpy-21.c | 16 ++++
> > > .../gcc.target/i386/pieces-memset-45.c | 16 ++++
> > > .../gcc.target/i386/pieces-memset-46.c | 17 +++++
> > > .../gcc.target/i386/pieces-memset-47.c | 17 +++++
> > > .../gcc.target/i386/pieces-memset-48.c | 17 +++++
> > > .../gcc.target/i386/pieces-memset-49.c | 16 ++++
> > > 17 files changed, 276 insertions(+), 18 deletions(-)
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
> > > index 0d5d1a0e205..7e77ff56ddc 100644
> > > --- a/gcc/config/i386/i386-expand.c
> > > +++ b/gcc/config/i386/i386-expand.c
> > > @@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> > > char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
> > > (enum fpmath_unit) 0,
> > > (enum prefer_vector_width) 0,
> > > + PVW_NONE, PVW_NONE,
> > > false, add_abi_p);
> > > if (!opts)
> > > error ("%qE needs unknown isa option", fndecl);
> > > diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> > > index a4da8331b8b..77712a07aef 100644
> > > --- a/gcc/config/i386/i386-options.c
> > > +++ b/gcc/config/i386/i386-options.c
> > > @@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > > const char *arch, const char *tune,
> > > enum fpmath_unit fpmath,
> > > enum prefer_vector_width pvw,
> > > + enum prefer_vector_width move_max,
> > > + enum prefer_vector_width store_max,
> > > bool add_nl_p, bool add_abi_p)
> > > {
> > > /* Flag options. */
> > > @@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > > }
> > > }
> > >
> > > - /* Add -mprefer-vector-width= option. */
> > > - if (pvw)
> > > + auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
> > > + const char *cmd)
> > > {
> > > - opts[num][0] = "-mprefer-vector-width=";
> > > + opts[num][0] = cmd;
> > > switch ((int) pvw)
> > > {
> > > case PVW_AVX128:
> > > @@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > > default:
> > > gcc_unreachable ();
> > > }
> > > - }
> > > + };
> > > +
> > > + /* Add -mprefer-vector-width= option. */
> > > + if (pvw)
> > > + add_vector_width (pvw, "-mprefer-vector-width=");
> > > +
> > > + /* Add -mmove-max= option. */
> > > + if (move_max)
> > > + add_vector_width (move_max, "-mmove-max=");
> > > +
> > > + /* Add -mstore-max= option. */
> > > + if (store_max)
> > > + add_vector_width (store_max, "-mstore-max=");
> > >
> > > /* Any options? */
> > > if (num == 0)
> > > @@ -630,6 +644,7 @@ ix86_debug_options (void)
> > > target_flags, ix86_target_flags,
> > > ix86_arch_string, ix86_tune_string,
> > > ix86_fpmath, prefer_vector_width_type,
> > > + ix86_move_max, ix86_store_max,
> > > true, true);
> > >
> > > if (opts)
> > > @@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
> > > = ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
> > > ptr->x_target_flags, ptr->x_ix86_target_flags,
> > > NULL, NULL, ptr->x_ix86_fpmath,
> > > - ptr->x_prefer_vector_width_type, false, true);
> > > + ptr->x_prefer_vector_width_type,
> > > + ptr->x_ix86_move_max, ptr->x_ix86_store_max,
> > > + false, true);
> > >
> > > gcc_assert (ptr->arch < PROCESSOR_max);
> > > fprintf (file, "%*sarch = %d (%s)\n",
> > > @@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > > const char *orig_tune_string = opts->x_ix86_tune_string;
> > > enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
> > > enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
> > > + enum prefer_vector_width orig_ix86_move_max_set
> > > + = opts_set->x_ix86_move_max;
> > > + enum prefer_vector_width orig_ix86_store_max_set
> > > + = opts_set->x_ix86_store_max;
> > > int orig_tune_defaulted = ix86_tune_defaulted;
> > > int orig_arch_specified = ix86_arch_specified;
> > > char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
> > > @@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
> > > opts->x_ix86_tune_string = orig_tune_string;
> > > opts_set->x_ix86_fpmath = orig_fpmath_set;
> > > opts_set->x_prefer_vector_width_type = orig_pvw_set;
> > > + opts_set->x_ix86_move_max = orig_ix86_move_max_set;
> > > + opts_set->x_ix86_store_max = orig_ix86_store_max_set;
> > > opts->x_ix86_excess_precision = orig_ix86_excess_precision;
> > > opts->x_ix86_unsafe_math_optimizations
> > > = orig_ix86_unsafe_math_optimizations;
> > > @@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
> > > && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > > opts->x_prefer_vector_width_type = PVW_AVX256;
> > >
> > > + if (opts_set->x_ix86_move_max == PVW_NONE)
> > > + {
> > > + /* Set the maximum number of bits can be moved from memory to
> > > + memory efficiently. */
> > > + if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> > > + opts->x_ix86_move_max = PVW_AVX512;
> > > + else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> > > + opts->x_ix86_move_max = PVW_AVX256;
> > > + else
> > > + {
> > > + opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> > > + if (opts_set->x_ix86_move_max == PVW_NONE)
> > > + {
> > > + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > > + opts->x_ix86_move_max = PVW_AVX512;
> > > + else
> > > + opts->x_ix86_move_max = PVW_AVX128;
> > > + }
> > > + }
> > > + }
> > > +
> > > + if (opts_set->x_ix86_store_max == PVW_NONE)
> > > + {
> > > + /* Set the maximum number of bits can be stored to memory
> > > + efficiently. */
> > > + if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> > > + opts->x_ix86_store_max = PVW_AVX512;
> > > + else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> > > + opts->x_ix86_store_max = PVW_AVX256;
> > > + else
> > > + {
> > > + opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> > > + if (opts_set->x_ix86_store_max == PVW_NONE)
> > > + {
> > > + if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
> > > + opts->x_ix86_store_max = PVW_AVX512;
> > > + else
> > > + opts->x_ix86_store_max = PVW_AVX128;
> > > + }
> > > + }
> > > + }
> > > +
> > > if (opts->x_ix86_recip_name)
> > > {
> > > char *p = ASTRDUP (opts->x_ix86_recip_name);
> > > diff --git a/gcc/config/i386/i386-options.h b/gcc/config/i386/i386-options.h
> > > index cdaca2644f4..e218e24d15b 100644
> > > --- a/gcc/config/i386/i386-options.h
> > > +++ b/gcc/config/i386/i386-options.h
> > > @@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
> > > int flags, int flags2,
> > > const char *arch, const char *tune,
> > > enum fpmath_unit fpmath,
> > > - enum prefer_vector_width pvw, bool add_nl_p,
> > > - bool add_abi_p);
> > > + enum prefer_vector_width pvw,
> > > + enum prefer_vector_width move_max,
> > > + enum prefer_vector_width store_max,
> > > + bool add_nl_p, bool add_abi_p);
> > >
> > > extern enum attr_cpu ix86_schedule;
> > >
> > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > index 2fda1e0686e..4f70085d793 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> > > ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
> > > #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
> > > ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
> > > -#define TARGET_AVX256_MOVE_BY_PIECES \
> > > - ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
> > > -#define TARGET_AVX256_STORE_BY_PIECES \
> > > - ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
> > > #define TARGET_AVX256_SPLIT_REGS \
> > > ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
> > > #define TARGET_GENERAL_REGS_SSE_SPILL \
> > > @@ -1807,12 +1803,13 @@ typedef struct ix86_args {
> > > MOVE_MAX_PIECES defaults to MOVE_MAX. */
> > >
> > > #define MOVE_MAX \
> > > - ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > + ((TARGET_AVX512F \
> > > + && (ix86_move_max == PVW_AVX512 \
> > > + || ix86_store_max == PVW_AVX512)) \
> > > ? 64 \
> > > : ((TARGET_AVX \
> > > - && !TARGET_PREFER_AVX128 \
> > > - && (TARGET_AVX256_MOVE_BY_PIECES \
> > > - || TARGET_AVX256_STORE_BY_PIECES)) \
> > > + && (ix86_move_max >= PVW_AVX256 \
> > > + || ix86_store_max >= PVW_AVX256)) \
> > > ? 32 \
> > > : ((TARGET_SSE2 \
> > > && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
> > > @@ -1825,11 +1822,10 @@ typedef struct ix86_args {
> > > store_by_pieces of 16/32/64 bytes. */
> > > #define STORE_MAX_PIECES \
> > > (TARGET_INTER_UNIT_MOVES_TO_VEC \
> > > - ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
> > > + ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
> > > ? 64 \
> > > : ((TARGET_AVX \
> > > - && !TARGET_PREFER_AVX128 \
> > > - && TARGET_AVX256_STORE_BY_PIECES) \
> > > + && ix86_store_max >= PVW_AVX256) \
> > > ? 32 \
> > > : ((TARGET_SSE2 \
> > > && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 3e67c537bb7..620dab6b672 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
> > > EnumValue
> > > Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
> > >
> > > +mmove-max=
> > > +Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > > +Maximum number of bits can be moved from memory to memory efficiently.
>
> ... number of bits THAT can be ...
>
> > > +
> > > +mstore-max=
> > > +Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
> > > +Maximum number of bits can be stored to memory efficiently.
>
> ... number of bits THAT can be ...
>
> > > +
> > > ;; ISA support
> > >
> > > m32
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 4ae0b569841..26981f657af 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
> > > DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
> > > m_CORE_AVX512)
> > >
> > > +/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
> > > + AVX instructions. */
> > > +DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
> > > + m_SAPPHIRERAPIDS)
> > > +
> > > +/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
> > > + AVX instructions. */
> > > +DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
> > > + m_SAPPHIRERAPIDS)
> > > +
> > > /*****************************************************************************/
> > > /*****************************************************************************/
> > > /* Historical relics: tuning flags that helps a specific old CPU designs */
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 3bddfbaae6a..3412b9ede44 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
> > > -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait @gol
> > > -mrecip -mrecip=@var{opt} @gol
> > > -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} @gol
> > > +-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
> > > -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx @gol
> > > -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @gol
> > > -mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -msha -maes @gol
> > > @@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
> > > This option instructs GCC to use @var{opt}-bit vector width in instructions
> > > instead of default on the selected platform.
> > >
> > > +@item -mmove-max=@var{bits}
> > > +@opindex mmove-max
> > > +This option instructs GCC to set the maximum number of bits can be
> > > +moved from memory to memory efficiently to @var{bits}. The valid
> > > +@var{bits} are 128, 256 and 512.
> > > +
> > > +@item -mstore-max=@var{bits}
> > > +@opindex mstore-max
> > > +This option instructs GCC to set the maximum number of bits can be
> > > +stored to memory efficiently to @var{bits}. The valid @var{bits} are
> > > +128, 256 and 512.
> > > +
> > > @table @samp
> > > @item none
> > > No extra limitations applied to GCC other than defined by the selected platform.
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > > new file mode 100644
> > > index 00000000000..28ab7a6d41c
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-17.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > new file mode 100644
> > > index 00000000000..b15a0db9ff0
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > > new file mode 100644
> > > index 00000000000..a5b5b617578
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-19.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > > new file mode 100644
> > > index 00000000000..1feff48c5b2
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-20.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > new file mode 100644
> > > index 00000000000..ef439f20f74
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > > +
> > > +extern char *dst, *src;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memcpy (dst, src, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-45.c b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > > new file mode 100644
> > > index 00000000000..70c80e5064b
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-45.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > new file mode 100644
> > > index 00000000000..ab7894aa2e6
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-47.c b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > > new file mode 100644
> > > index 00000000000..8f2c254ad03
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-47.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-48.c b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > > new file mode 100644
> > > index 00000000000..9a7da962183
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-48.c
> > > @@ -0,0 +1,17 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > > +/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > new file mode 100644
> > > index 00000000000..ad43f89a9bd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > @@ -0,0 +1,16 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
> > > +
> > > +extern char *dst;
> > > +
> > > +void
> > > +foo (void)
> > > +{
> > > + __builtin_memset (dst, 3, 66);
> > > +}
> > > +
> > > +/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
> > > +/* No need to dynamically realign the stack here. */
> > > +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> > > +/* Nor use a frame pointer. */
> > > +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
> > > --
> > > 2.33.1
> > >
> >
> > PING.
> >
> >
> > --
> > H.J.
@@ -12295,6 +12295,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
char *opts = ix86_target_string (bisa, bisa2, 0, 0, NULL, NULL,
(enum fpmath_unit) 0,
(enum prefer_vector_width) 0,
+ PVW_NONE, PVW_NONE,
false, add_abi_p);
if (!opts)
error ("%qE needs unknown isa option", fndecl);
@@ -364,6 +364,8 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
const char *arch, const char *tune,
enum fpmath_unit fpmath,
enum prefer_vector_width pvw,
+ enum prefer_vector_width move_max,
+ enum prefer_vector_width store_max,
bool add_nl_p, bool add_abi_p)
{
/* Flag options. */
@@ -542,10 +544,10 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
}
}
- /* Add -mprefer-vector-width= option. */
- if (pvw)
+ auto add_vector_width = [&opts, &num] (prefer_vector_width pvw,
+ const char *cmd)
{
- opts[num][0] = "-mprefer-vector-width=";
+ opts[num][0] = cmd;
switch ((int) pvw)
{
case PVW_AVX128:
@@ -563,7 +565,19 @@ ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
default:
gcc_unreachable ();
}
- }
+ };
+
+ /* Add -mprefer-vector-width= option. */
+ if (pvw)
+ add_vector_width (pvw, "-mprefer-vector-width=");
+
+ /* Add -mmove-max= option. */
+ if (move_max)
+ add_vector_width (move_max, "-mmove-max=");
+
+ /* Add -mstore-max= option. */
+ if (store_max)
+ add_vector_width (store_max, "-mstore-max=");
/* Any options? */
if (num == 0)
@@ -630,6 +644,7 @@ ix86_debug_options (void)
target_flags, ix86_target_flags,
ix86_arch_string, ix86_tune_string,
ix86_fpmath, prefer_vector_width_type,
+ ix86_move_max, ix86_store_max,
true, true);
if (opts)
@@ -892,7 +907,9 @@ ix86_function_specific_print (FILE *file, int indent,
= ix86_target_string (ptr->x_ix86_isa_flags, ptr->x_ix86_isa_flags2,
ptr->x_target_flags, ptr->x_ix86_target_flags,
NULL, NULL, ptr->x_ix86_fpmath,
- ptr->x_prefer_vector_width_type, false, true);
+ ptr->x_prefer_vector_width_type,
+ ptr->x_ix86_move_max, ptr->x_ix86_store_max,
+ false, true);
gcc_assert (ptr->arch < PROCESSOR_max);
fprintf (file, "%*sarch = %d (%s)\n",
@@ -1318,6 +1335,10 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
const char *orig_tune_string = opts->x_ix86_tune_string;
enum fpmath_unit orig_fpmath_set = opts_set->x_ix86_fpmath;
enum prefer_vector_width orig_pvw_set = opts_set->x_prefer_vector_width_type;
+ enum prefer_vector_width orig_ix86_move_max_set
+ = opts_set->x_ix86_move_max;
+ enum prefer_vector_width orig_ix86_store_max_set
+ = opts_set->x_ix86_store_max;
int orig_tune_defaulted = ix86_tune_defaulted;
int orig_arch_specified = ix86_arch_specified;
char *option_strings[IX86_FUNCTION_SPECIFIC_MAX] = { NULL, NULL };
@@ -1393,6 +1414,8 @@ ix86_valid_target_attribute_tree (tree fndecl, tree args,
opts->x_ix86_tune_string = orig_tune_string;
opts_set->x_ix86_fpmath = orig_fpmath_set;
opts_set->x_prefer_vector_width_type = orig_pvw_set;
+ opts_set->x_ix86_move_max = orig_ix86_move_max_set;
+ opts_set->x_ix86_store_max = orig_ix86_store_max_set;
opts->x_ix86_excess_precision = orig_ix86_excess_precision;
opts->x_ix86_unsafe_math_optimizations
= orig_ix86_unsafe_math_optimizations;
@@ -2667,6 +2690,48 @@ ix86_option_override_internal (bool main_args_p,
&& (opts_set->x_prefer_vector_width_type == PVW_NONE))
opts->x_prefer_vector_width_type = PVW_AVX256;
+ if (opts_set->x_ix86_move_max == PVW_NONE)
+ {
+ /* Set the maximum number of bits can be moved from memory to
+ memory efficiently. */
+ if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
+ opts->x_ix86_move_max = PVW_AVX512;
+ else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
+ opts->x_ix86_move_max = PVW_AVX256;
+ else
+ {
+ opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
+ if (opts_set->x_ix86_move_max == PVW_NONE)
+ {
+ if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
+ opts->x_ix86_move_max = PVW_AVX512;
+ else
+ opts->x_ix86_move_max = PVW_AVX128;
+ }
+ }
+ }
+
+ if (opts_set->x_ix86_store_max == PVW_NONE)
+ {
+ /* Set the maximum number of bits can be stored to memory
+ efficiently. */
+ if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
+ opts->x_ix86_store_max = PVW_AVX512;
+ else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
+ opts->x_ix86_store_max = PVW_AVX256;
+ else
+ {
+ opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
+ if (opts_set->x_ix86_store_max == PVW_NONE)
+ {
+ if (TARGET_AVX512F_P (opts->x_ix86_isa_flags))
+ opts->x_ix86_store_max = PVW_AVX512;
+ else
+ opts->x_ix86_store_max = PVW_AVX128;
+ }
+ }
+ }
+
if (opts->x_ix86_recip_name)
{
char *p = ASTRDUP (opts->x_ix86_recip_name);
@@ -26,8 +26,10 @@ char *ix86_target_string (HOST_WIDE_INT isa, HOST_WIDE_INT isa2,
int flags, int flags2,
const char *arch, const char *tune,
enum fpmath_unit fpmath,
- enum prefer_vector_width pvw, bool add_nl_p,
- bool add_abi_p);
+ enum prefer_vector_width pvw,
+ enum prefer_vector_width move_max,
+ enum prefer_vector_width store_max,
+ bool add_nl_p, bool add_abi_p);
extern enum attr_cpu ix86_schedule;
@@ -408,10 +408,6 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
#define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
-#define TARGET_AVX256_MOVE_BY_PIECES \
- ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES]
-#define TARGET_AVX256_STORE_BY_PIECES \
- ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES]
#define TARGET_AVX256_SPLIT_REGS \
ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
#define TARGET_GENERAL_REGS_SSE_SPILL \
@@ -1807,12 +1803,13 @@ typedef struct ix86_args {
MOVE_MAX_PIECES defaults to MOVE_MAX. */
#define MOVE_MAX \
- ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+ ((TARGET_AVX512F \
+ && (ix86_move_max == PVW_AVX512 \
+ || ix86_store_max == PVW_AVX512)) \
? 64 \
: ((TARGET_AVX \
- && !TARGET_PREFER_AVX128 \
- && (TARGET_AVX256_MOVE_BY_PIECES \
- || TARGET_AVX256_STORE_BY_PIECES)) \
+ && (ix86_move_max >= PVW_AVX256 \
+ || ix86_store_max >= PVW_AVX256)) \
? 32 \
: ((TARGET_SSE2 \
&& TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
@@ -1825,11 +1822,10 @@ typedef struct ix86_args {
store_by_pieces of 16/32/64 bytes. */
#define STORE_MAX_PIECES \
(TARGET_INTER_UNIT_MOVES_TO_VEC \
- ? ((TARGET_AVX512F && !TARGET_PREFER_AVX256) \
+ ? ((TARGET_AVX512F && ix86_store_max == PVW_AVX512) \
? 64 \
: ((TARGET_AVX \
- && !TARGET_PREFER_AVX128 \
- && TARGET_AVX256_STORE_BY_PIECES) \
+ && ix86_store_max >= PVW_AVX256) \
? 32 \
: ((TARGET_SSE2 \
&& TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
@@ -624,6 +624,14 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256)
EnumValue
Enum(prefer_vector_width) String(512) Value(PVW_AVX512)
+mmove-max=
+Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
+Maximum number of bits can be moved from memory to memory efficiently.
+
+mstore-max=
+Target RejectNegative Joined Var(ix86_store_max) Enum(prefer_vector_width) Init(PVW_NONE) Save
+Maximum number of bits can be stored to memory efficiently.
+
;; ISA support
m32
@@ -512,6 +512,16 @@ DEF_TUNE (X86_TUNE_AVX256_MOVE_BY_PIECES, "avx256_move_by_pieces",
DEF_TUNE (X86_TUNE_AVX256_STORE_BY_PIECES, "avx256_store_by_pieces",
m_CORE_AVX512)
+/* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces with 512-bit
+ AVX instructions. */
+DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces",
+ m_SAPPHIRERAPIDS)
+
+/* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with 512-bit
+ AVX instructions. */
+DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
+ m_SAPPHIRERAPIDS)
+
/*****************************************************************************/
/*****************************************************************************/
/* Historical relics: tuning flags that helps a specific old CPU designs */
@@ -1393,6 +1393,7 @@ See RS/6000 and PowerPC Options.
-mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait @gol
-mrecip -mrecip=@var{opt} @gol
-mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} @gol
+-mmove-max=@var{bits} -mstore-max=@var{bits} @gol
-mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx @gol
-mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @gol
-mavx512bw -mavx512dq -mavx512ifma -mavx512vbmi -msha -maes @gol
@@ -31848,6 +31849,18 @@ This option instructs GCC to use 128-bit AVX instructions instead of
This option instructs GCC to use @var{opt}-bit vector width in instructions
instead of default on the selected platform.
+@item -mmove-max=@var{bits}
+@opindex mmove-max
+This option instructs GCC to set the maximum number of bits can be
+moved from memory to memory efficiently to @var{bits}. The valid
+@var{bits} are 128, 256 and 512.
+
+@item -mstore-max=@var{bits}
+@opindex mstore-max
+This option instructs GCC to set the maximum number of bits can be
+stored to memory efficiently to @var{bits}. The valid @var{bits} are
+128, 256 and 512.
+
@table @samp
@item none
No extra limitations applied to GCC other than defined by the selected platform.
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mmove-max=512" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 2 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mmove-max=128 -mstore-max=128" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 8 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mmove-max=256 -mstore-max=256" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+ __builtin_memcpy (dst, src, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:64|)\[ \\t\]+\[^\n\]*%ymm" 4 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=x86-64 -mprefer-vector-width=256 -mavx512f -mtune-ctrl=avx512_store_by_pieces" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu64\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu8\[ \\t\]+\[^\n\]*%zmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mstore-max=128" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=sapphirerapids -mstore-max=256" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* { dg-final { scan-assembler-times "vmovw\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" } */
+
+extern char *dst;
+
+void
+foo (void)
+{
+ __builtin_memset (dst, 3, 66);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu(?:8|)\[ \\t\]+\[^\n\]*%ymm" 2 } } */
+/* No need to dynamically realign the stack here. */
+/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
+/* Nor use a frame pointer. */
+/* { dg-final { scan-assembler-not "%\[re\]bp" } } */