[i386] Add extra cost for unsigned_load which may have stall forward issue.

Message ID 20220316021934.106345-1-hongtao.liu@intel.com
State New
Headers
Series [i386] Add extra cost for unsigned_load which may have stall forward issue. |

Commit Message

Liu, Hongtao March 16, 2022, 2:19 a.m. UTC
  This patch only handle pure-slp for by-value passed parameter which
has nothing to do with IPA but psABI. For by-reference passed
parameter IPA is required.

The patch is aggressive in determining STLF failure, any
unaligned_load for parm_decl passed by stack is thought to have STLF
stall issue. It could lose some perf where there's no such issue(1
vector_load vs n scalar_load + CTOR).

According to microbenchmark in PR, cost of STLF failure is generally
between 8 scalar_loads and 16 scalar loads on most latest Intel/AMD
processors.

gcc/ChangeLog:

	PR target/101908
	* config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
	(ix86_vector_costs::add_stmt_cost): Add extra cost for
	unsigned_load which may have store forwarding stall issue.
	* config/i386/i386.h (processor_costs): Add new member
	stfs.
	* config/i386/x86-tune-costs.h (i386_size_cost): Initialize
	stfs.
	(i386_cost, i486_cost, pentium_cost, lakemont_cost,
	pentiumpro_cost, geode_cost, k6_cost, athlon_cost, k8_cost,
	amdfam10_cost, bdver_cost, znver1_cost, znver2_cost,
	znver3_cost, skylake_cost, icelake_cost, alderlake_cost,
	btver1_cost, btver2_cost, pentium4_cost, nocano_cost,
	atom_cost, slm_cost, tremont_cost, intel_cost, generic_cost,
	core_cost): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr101908-1.c: New test.
	* gcc.target/i386/pr101908-2.c: New test.
	* gcc.target/i386/pr101908-3.c: New test.
	* gcc.target/i386/pr101908-v16hi.c: New test.
	* gcc.target/i386/pr101908-v16qi.c: New test.
	* gcc.target/i386/pr101908-v16sf.c: New test.
	* gcc.target/i386/pr101908-v16si.c: New test.
	* gcc.target/i386/pr101908-v2df.c: New test.
	* gcc.target/i386/pr101908-v2di.c: New test.
	* gcc.target/i386/pr101908-v2hi.c: New test.
	* gcc.target/i386/pr101908-v2qi.c: New test.
	* gcc.target/i386/pr101908-v2sf.c: New test.
	* gcc.target/i386/pr101908-v2si.c: New test.
	* gcc.target/i386/pr101908-v4df.c: New test.
	* gcc.target/i386/pr101908-v4di.c: New test.
	* gcc.target/i386/pr101908-v4hi.c: New test.
	* gcc.target/i386/pr101908-v4qi.c: New test.
	* gcc.target/i386/pr101908-v4sf.c: New test.
	* gcc.target/i386/pr101908-v4si.c: New test.
	* gcc.target/i386/pr101908-v8df-adl.c: New test.
	* gcc.target/i386/pr101908-v8df.c: New test.
	* gcc.target/i386/pr101908-v8di-adl.c: New test.
	* gcc.target/i386/pr101908-v8di.c: New test.
	* gcc.target/i386/pr101908-v8hi-adl.c: New test.
	* gcc.target/i386/pr101908-v8hi.c: New test.
	* gcc.target/i386/pr101908-v8qi-adl.c: New test.
	* gcc.target/i386/pr101908-v8qi.c: New test.
	* gcc.target/i386/pr101908-v8sf-adl.c: New test.
	* gcc.target/i386/pr101908-v8sf.c: New test.
	* gcc.target/i386/pr101908-v8si-adl.c: New test.
	* gcc.target/i386/pr101908-v8si.c: New test.
---
 gcc/config/i386/i386.cc                       | 51 +++++++++++
 gcc/config/i386/i386.h                        |  1 +
 gcc/config/i386/x86-tune-costs.h              | 28 ++++++
 gcc/testsuite/gcc.target/i386/pr101908-1.c    | 12 +++
 gcc/testsuite/gcc.target/i386/pr101908-2.c    | 12 +++
 gcc/testsuite/gcc.target/i386/pr101908-3.c    | 90 +++++++++++++++++++
 .../gcc.target/i386/pr101908-v16hi.c          |  6 ++
 .../gcc.target/i386/pr101908-v16qi.c          | 30 +++++++
 .../gcc.target/i386/pr101908-v16sf.c          |  6 ++
 .../gcc.target/i386/pr101908-v16si.c          |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2df.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2di.c |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c | 16 ++++
 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2si.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4df.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4di.c |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c | 18 ++++
 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4si.c |  6 ++
 .../gcc.target/i386/pr101908-v8df-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8df.c |  6 ++
 .../gcc.target/i386/pr101908-v8di-adl.c       |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8di.c |  7 ++
 .../gcc.target/i386/pr101908-v8hi-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c |  6 ++
 .../gcc.target/i386/pr101908-v8qi-adl.c       | 22 +++++
 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c | 22 +++++
 .../gcc.target/i386/pr101908-v8sf-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c |  6 ++
 .../gcc.target/i386/pr101908-v8si-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8si.c |  6 ++
 34 files changed, 444 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si.c
  

Comments

Richard Biener March 16, 2022, 9:54 a.m. UTC | #1
On Wed, Mar 16, 2022 at 3:19 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> This patch only handle pure-slp for by-value passed parameter which
> has nothing to do with IPA but psABI. For by-reference passed
> parameter IPA is required.
>
> The patch is aggressive in determining STLF failure, any
> unaligned_load for parm_decl passed by stack is thought to have STLF
> stall issue. It could lose some perf where there's no such issue(1
> vector_load vs n scalar_load + CTOR).
>
> According to microbenchmark in PR, cost of STLF failure is generally
> between 8 scalar_loads and 16 scalar loads on most latest Intel/AMD
> processors.
>
> gcc/ChangeLog:
>
>         PR target/101908
>         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
>         (ix86_vector_costs::add_stmt_cost): Add extra cost for
>         unsigned_load which may have store forwarding stall issue.
>         * config/i386/i386.h (processor_costs): Add new member
>         stfs.
>         * config/i386/x86-tune-costs.h (i386_size_cost): Initialize
>         stfs.
>         (i386_cost, i486_cost, pentium_cost, lakemont_cost,
>         pentiumpro_cost, geode_cost, k6_cost, athlon_cost, k8_cost,
>         amdfam10_cost, bdver_cost, znver1_cost, znver2_cost,
>         znver3_cost, skylake_cost, icelake_cost, alderlake_cost,
>         btver1_cost, btver2_cost, pentium4_cost, nocano_cost,
>         atom_cost, slm_cost, tremont_cost, intel_cost, generic_cost,
>         core_cost): Ditto.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr101908-1.c: New test.
>         * gcc.target/i386/pr101908-2.c: New test.
>         * gcc.target/i386/pr101908-3.c: New test.
>         * gcc.target/i386/pr101908-v16hi.c: New test.
>         * gcc.target/i386/pr101908-v16qi.c: New test.
>         * gcc.target/i386/pr101908-v16sf.c: New test.
>         * gcc.target/i386/pr101908-v16si.c: New test.
>         * gcc.target/i386/pr101908-v2df.c: New test.
>         * gcc.target/i386/pr101908-v2di.c: New test.
>         * gcc.target/i386/pr101908-v2hi.c: New test.
>         * gcc.target/i386/pr101908-v2qi.c: New test.
>         * gcc.target/i386/pr101908-v2sf.c: New test.
>         * gcc.target/i386/pr101908-v2si.c: New test.
>         * gcc.target/i386/pr101908-v4df.c: New test.
>         * gcc.target/i386/pr101908-v4di.c: New test.
>         * gcc.target/i386/pr101908-v4hi.c: New test.
>         * gcc.target/i386/pr101908-v4qi.c: New test.
>         * gcc.target/i386/pr101908-v4sf.c: New test.
>         * gcc.target/i386/pr101908-v4si.c: New test.
>         * gcc.target/i386/pr101908-v8df-adl.c: New test.
>         * gcc.target/i386/pr101908-v8df.c: New test.
>         * gcc.target/i386/pr101908-v8di-adl.c: New test.
>         * gcc.target/i386/pr101908-v8di.c: New test.
>         * gcc.target/i386/pr101908-v8hi-adl.c: New test.
>         * gcc.target/i386/pr101908-v8hi.c: New test.
>         * gcc.target/i386/pr101908-v8qi-adl.c: New test.
>         * gcc.target/i386/pr101908-v8qi.c: New test.
>         * gcc.target/i386/pr101908-v8sf-adl.c: New test.
>         * gcc.target/i386/pr101908-v8sf.c: New test.
>         * gcc.target/i386/pr101908-v8si-adl.c: New test.
>         * gcc.target/i386/pr101908-v8si.c: New test.
> ---
>  gcc/config/i386/i386.cc                       | 51 +++++++++++
>  gcc/config/i386/i386.h                        |  1 +
>  gcc/config/i386/x86-tune-costs.h              | 28 ++++++
>  gcc/testsuite/gcc.target/i386/pr101908-1.c    | 12 +++
>  gcc/testsuite/gcc.target/i386/pr101908-2.c    | 12 +++
>  gcc/testsuite/gcc.target/i386/pr101908-3.c    | 90 +++++++++++++++++++
>  .../gcc.target/i386/pr101908-v16hi.c          |  6 ++
>  .../gcc.target/i386/pr101908-v16qi.c          | 30 +++++++
>  .../gcc.target/i386/pr101908-v16sf.c          |  6 ++
>  .../gcc.target/i386/pr101908-v16si.c          |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v2df.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v2di.c |  7 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v2hi.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v2qi.c | 16 ++++
>  gcc/testsuite/gcc.target/i386/pr101908-v2sf.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v2si.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v4df.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v4di.c |  7 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v4hi.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v4qi.c | 18 ++++
>  gcc/testsuite/gcc.target/i386/pr101908-v4sf.c |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v4si.c |  6 ++
>  .../gcc.target/i386/pr101908-v8df-adl.c       |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v8df.c |  6 ++
>  .../gcc.target/i386/pr101908-v8di-adl.c       |  7 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v8di.c |  7 ++
>  .../gcc.target/i386/pr101908-v8hi-adl.c       |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v8hi.c |  6 ++
>  .../gcc.target/i386/pr101908-v8qi-adl.c       | 22 +++++
>  gcc/testsuite/gcc.target/i386/pr101908-v8qi.c | 22 +++++
>  .../gcc.target/i386/pr101908-v8sf-adl.c       |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v8sf.c |  6 ++
>  .../gcc.target/i386/pr101908-v8si-adl.c       |  6 ++
>  gcc/testsuite/gcc.target/i386/pr101908-v8si.c |  6 ++
>  34 files changed, 444 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16si.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2df.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2di.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2si.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4df.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4di.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4si.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d77ad83e437..c01809cc3da 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22988,6 +22988,46 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
>    return default_noce_conversion_profitable_p (seq, if_info);
>  }
>
> +/* Return true if REF may have STF issue, otherwise false.
> +   Any unaligned_load from parm_decl which is passed by stack
> +   is considered to have STLF stall issue.  */
> +static bool
> +ix86_load_maybe_stfs_p (data_reference* dr)
> +{
> +  tree addr = DR_BASE_ADDRESS (dr);
> +  if (TREE_CODE (addr) != ADDR_EXPR)
> +    return false;
> +  addr = get_base_address (TREE_OPERAND (addr, 0));
> +
> +  if (TREE_CODE (addr) != PARM_DECL)
> +    return false;
> +  tree type = TREE_TYPE (addr);
> +  if (!type)

type should never be NULL

> +    return false;
> +
> +  machine_mode mode = TYPE_MODE (type);
> +
> +  /* There could be false positive in determine parameter passed by stack.
> +     .i.e. parameter can be put in registers but finally passed by stack
> +     because registers are ran out.  */
> +  if (TARGET_64BIT)
> +    {
> +      /* From function_arg_64.  */
> +      enum x86_64_reg_class regclass[MAX_CLASSES];
> +      int zero_width_bitfields = 0;
> +      return !classify_argument (mode, type, regclass, 0, zero_width_bitfields);
> +    }
> +  else
> +    {
> +      /* From function_arg_32.  */
> +      return (mode == E_BLKmode
> +             || (AGGREGATE_TYPE_P (type)
> +                 && (VECTOR_MODE_P (mode) || mode == TImode)));
> +    }
> +
> +  return false;

that stmt is unreachable.

> +}
> +
>  /* x86-specific vector costs.  */
>  class ix86_vector_costs : public vector_costs
>  {
> @@ -23218,6 +23258,17 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> +     Performance may lose when there's no STF issue(1 vector_load vs n
> +     scalar_load + CTOR).
> +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine

cost(2000) is no longer there

> +     tuned.  */
> +  if (kind == unaligned_load && stmt_info
> +      && stmt_info->slp_type == pure_slp

You want to restrict this to BB vectorization?  pure_slp isn't exactly that,
instead you can do

           && is_a <bb_vec_info> (m_vinfo)

> +      && STMT_VINFO_DATA_REF (stmt_info)
> +      && ix86_load_maybe_stfs_p (STMT_VINFO_DATA_REF (stmt_info)))
> +    stmt_cost += COSTS_N_INSNS (ix86_cost->stfs / 2);

I wonder why we divide stfs by two?

I'd suggest an additional check, that the DR is close to function start.  One
possible check that occurs to me is to check

  STMT_VINFO_DR_INFO (stmt_info)->group == 0

that will for example avoid the penalty for

struct Y y;
void foo (struct X x)
{
  bar();
  y.a = x.a;
  y.b = x.b;
}

but also (maybe not wanted) when the access happens after control
flow transfer like with

struct Y y;
void foo (struct X x, int flag)
{
  if (flag)
   {
    y.a = x.a;
    y.b = x.b;
   }
}

I think we should be conservative with what we pessimize until we have
evidence that we need to include more cases, also since this after-the-fact
handling of the issue in costing is sub-optimal.  Ideally the vectorizer itself
would decide the vectorize the load in a way to avoid STLF fails, but that's
nothing we can easily arrange for at this stage.

Another option could be to split such loads during md-reorg where we could
somehow "count" the latency from function entry, only scanning paths from
there up to a point where the store buffer is likely not drained (with
a different
target cost parameter?) and only scanning not optimize_for_size BBs.  That
might be a better place to do after-the-fact adjustments (the cost adjustment
won't avoid the STLF fail if the rest of the vectorization compensates the
penalty).

Richard.

> +
>    /* Penalize DFmode vector operations for Bonnell.  */
>    if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
>        && vectype && GET_MODE_INNER (TYPE_MODE (vectype)) == DFmode)
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 0d28e57f8f2..341f1c47981 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -168,6 +168,7 @@ struct processor_costs {
>                                    in 32bit, 64bit, 128bit, 256bit and 512bit */
>    const int sse_unaligned_load[5];/* cost of unaligned load.  */
>    const int sse_unaligned_store[5];/* cost of unaligned store.  */
> +  const int stfs;               /* cost of store forward stalls.  */
>    const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
>             zmm_move;
>    const int sse_to_integer;    /* cost of moving SSE register to integer.  */
> diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index 017ffa69958..3a5fcdeefdd 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -100,6 +100,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
>                                            in 128bit, 256bit and 512bit */
>    {3, 3, 3, 3, 3},                     /* cost of unaligned SSE store
>                                            in 128bit, 256bit and 512bit */
> +  6,                                   /* cost of store forward stall.  */
>    3, 3, 3,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    5, 0,                                        /* Gather load static, per_elt.  */
> @@ -209,6 +210,7 @@ struct processor_costs i386_cost = {        /* 386 specific costs */
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> +  8,                                   /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -317,6 +319,7 @@ struct processor_costs i486_cost = {        /* 486 specific costs */
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> +  8,                                   /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -427,6 +430,7 @@ struct processor_costs pentium_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> +  8,                                   /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -528,6 +532,7 @@ struct processor_costs lakemont_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> +  8,                                   /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -644,6 +649,7 @@ struct processor_costs pentiumpro_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> +  24,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -751,6 +757,7 @@ struct processor_costs geode_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {2, 2, 8, 16, 32},                   /* cost of unaligned loads.  */
>    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
> +  14,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    2, 2,                                        /* Gather load static, per_elt.  */
> @@ -858,6 +865,7 @@ struct processor_costs k6_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {2, 2, 8, 16, 32},                   /* cost of unaligned loads.  */
>    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
> +  24,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    2, 2,                                        /* Gather load static, per_elt.  */
> @@ -971,6 +979,7 @@ struct processor_costs athlon_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 4, 12, 12, 24},                  /* cost of unaligned loads.  */
>    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
> +  14,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    5,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -1086,6 +1095,7 @@ struct processor_costs k8_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 3, 12, 12, 24},                  /* cost of unaligned loads.  */
>    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
> +  14,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    5,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -1214,6 +1224,7 @@ struct processor_costs amdfam10_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {4, 4, 3, 7, 12},                    /* cost of unaligned loads.  */
>    {4, 4, 5, 10, 20},                   /* cost of unaligned stores.  */
> +  21,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    3,                                   /* cost of moving SSE register to integer.  */
>    4, 4,                                        /* Gather load static, per_elt.  */
> @@ -1334,6 +1345,7 @@ const struct processor_costs bdver_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {12, 12, 10, 40, 60},                        /* cost of unaligned loads.  */
>    {10, 10, 10, 40, 60},                        /* cost of unaligned stores.  */
> +  54,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    16,                                  /* cost of moving SSE register to integer.  */
>    12, 12,                              /* Gather load static, per_elt.  */
> @@ -1475,6 +1487,7 @@ struct processor_costs znver1_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 12, 24},                   /* cost of unaligned loads.  */
>    {8, 8, 8, 16, 32},                   /* cost of unaligned stores.  */
> +  42,                                  /* cost of store forward stall.  */
>    2, 3, 6,                             /* cost of moving XMM,YMM,ZMM register.  */
>    6,                                   /* cost of moving SSE register to integer.  */
>    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> @@ -1630,6 +1643,7 @@ struct processor_costs znver2_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> +  42,                                  /* cost of store forward stall.  */
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to integer.  */
> @@ -1762,6 +1776,7 @@ struct processor_costs znver3_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> +  42,                                  /* cost of store forward stall.  */
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to integer.  */
> @@ -1907,6 +1922,7 @@ struct processor_costs skylake_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 10, 20},                   /* cost of unaligned loads.  */
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> +  26,                                  /* cost of store forward stall.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    20, 8,                               /* Gather load static, per_elt.  */
> @@ -2033,6 +2049,7 @@ struct processor_costs icelake_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 10, 20},                   /* cost of unaligned loads.  */
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> +  26,                                  /* cost of store forward stall.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    20, 8,                               /* Gather load static, per_elt.  */
> @@ -2153,6 +2170,7 @@ struct processor_costs alderlake_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> +  90,                                  /* cost of store forward stall.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    18, 6,                               /* Gather load static, per_elt.  */
> @@ -2266,6 +2284,7 @@ const struct processor_costs btver1_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {10, 10, 12, 48, 96},                        /* cost of unaligned loads.  */
>    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
> +  36,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    14,                                  /* cost of moving SSE register to integer.  */
>    10, 10,                              /* Gather load static, per_elt.  */
> @@ -2376,6 +2395,7 @@ const struct processor_costs btver2_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {10, 10, 12, 48, 96},                        /* cost of unaligned loads.  */
>    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
> +  36,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    14,                                  /* cost of moving SSE register to integer.  */
>    10, 10,                              /* Gather load static, per_elt.  */
> @@ -2485,6 +2505,7 @@ struct processor_costs pentium4_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {32, 32, 32, 64, 128},               /* cost of unaligned loads.  */
>    {32, 32, 32, 64, 128},               /* cost of unaligned stores.  */
> +  10,                                  /* cost of store forward stall.  */
>    12, 24, 48,                          /* cost of moving XMM,YMM,ZMM register */
>    20,                                  /* cost of moving SSE register to integer.  */
>    16, 16,                              /* Gather load static, per_elt.  */
> @@ -2597,6 +2618,7 @@ struct processor_costs nocona_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {24, 24, 24, 48, 96},                        /* cost of unaligned loads.  */
>    {24, 24, 24, 48, 96},                        /* cost of unaligned stores.  */
> +  8,                                   /* cost of store forward stall.  */
>    6, 12, 24,                           /* cost of moving XMM,YMM,ZMM register */
>    20,                                  /* cost of moving SSE register to integer.  */
>    12, 12,                              /* Gather load static, per_elt.  */
> @@ -2707,6 +2729,7 @@ struct processor_costs atom_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {16, 16, 16, 32, 64},                        /* cost of unaligned loads.  */
>    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
> +  32,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    8,                                   /* cost of moving SSE register to integer.  */
>    8, 8,                                        /* Gather load static, per_elt.  */
> @@ -2817,6 +2840,7 @@ struct processor_costs slm_cost = {
>                                            in SImode, DImode and TImode.  */
>    {16, 16, 16, 32, 64},                        /* cost of unaligned loads.  */
>    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
> +  48,                                  /* cost of store forward stall.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
>    8,                                   /* cost of moving SSE register to integer.  */
>    8, 8,                                        /* Gather load static, per_elt.  */
> @@ -2939,6 +2963,7 @@ struct processor_costs tremont_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> +  42,                                  /* cost of store forward stall.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    18, 6,                               /* Gather load static, per_elt.  */
> @@ -3051,6 +3076,7 @@ struct processor_costs intel_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  */
>    {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  */
> +  22,                                  /* cost of store forward stall.  */
>    2, 2, 2,                             /* cost of moving XMM,YMM,ZMM register */
>    4,                                   /* cost of moving SSE register to integer.  */
>    6, 6,                                        /* Gather load static, per_elt.  */
> @@ -3168,6 +3194,7 @@ struct processor_costs generic_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> +  54,                                  /* cost of store forward stall.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    6,                                   /* cost of moving SSE register to integer.  */
>    18, 6,                               /* Gather load static, per_elt.  */
> @@ -3291,6 +3318,7 @@ struct processor_costs core_cost = {
>                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
>    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
>    {6, 6, 6, 6, 12},                    /* cost of unaligned stores.  */
> +  26,                                  /* cost of store forward stall.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
>    2,                                   /* cost of moving SSE register to integer.  */
>    /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> new file mode 100644
> index 00000000000..f8e0f2e26bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X* x, struct X* y)
> +{
> +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> new file mode 100644
> index 00000000000..f4ff7a83c82
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[4]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X x, struct X y)
> +{
> +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-3.c b/gcc/testsuite/gcc.target/i386/pr101908-3.c
> new file mode 100644
> index 00000000000..6f853aa7750
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-3.c
> @@ -0,0 +1,90 @@
> +/* PR target/101908.  */
> +/* { dg-do compile } */
> +/* { dg-options "-march=x86-64 -O2 -mtune=generic -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not "add new stmt:.*MEM \<vector(2) double\>.*ray + 24B" "slp2" } }  */
> +/* This testcase is used to avoid STLF stall.  */
> +
> +#define sqrt __builtin_sqrt
> +#define SQ(x)          ((x) * (x))
> +struct vec3 {
> +  double x, y, z;
> +};
> +
> +struct ray {
> +  struct vec3 orig, dir;
> +};
> +
> +struct material {
> +  struct vec3 col;     /* color */
> +  double spow;         /* specular power */
> +  double refl;         /* reflection intensity */
> +};
> +
> +struct sphere {
> +  struct vec3 pos;
> +  double rad;
> +  struct material mat;
> +  struct sphere *next;
> +};
> +
> +struct spoint {
> +  struct vec3 pos, normal, vref;       /* position, normal and view reflection */
> +  double dist;         /* parametric distance of intersection along the ray */
> +};
> +
> +#define ERR_MARGIN             1e-6
> +
> +#define DOT(a, b)      ((a).x * (b).x + (a).y * (b).y + (a).z * (b).z)
> +#define NORMALIZE(a)  do {                     \
> +    double len = sqrt(DOT(a, a));              \
> +    (a).x /= len; (a).y /= len; (a).z /= len;  \
> +  } while(0);
> +
> +static struct vec3
> +reflect(struct vec3 v, struct vec3 n) {
> +  struct vec3 res;
> +  double dot = v.x * n.x + v.y * n.y + v.z * n.z;
> +  res.x = -(2.0 * dot * n.x - v.x);
> +  res.y = -(2.0 * dot * n.y - v.y);
> +  res.z = -(2.0 * dot * n.z - v.z);
> +  return res;
> +}
> +
> +int ray_sphere(const struct sphere *sph,
> +              struct ray ray, struct spoint *sp) {
> +  double a, b, c, d, sqrt_d, t1, t2;
> +
> +  a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
> +  b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
> +    2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
> +    2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
> +  c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) +
> +    SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) +
> +    2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad);
> +
> +  if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
> +
> +  sqrt_d = sqrt(d);
> +  t1 = (-b + sqrt_d) / (2.0 * a);
> +  t2 = (-b - sqrt_d) / (2.0 * a);
> +
> +  if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0)) return 0;
> +
> +  if(sp) {
> +    if(t1 < ERR_MARGIN) t1 = t2;
> +    if(t2 < ERR_MARGIN) t2 = t1;
> +    sp->dist = t1 < t2 ? t1 : t2;
> +
> +    sp->pos.x = ray.orig.x + ray.dir.x * sp->dist;
> +    sp->pos.y = ray.orig.y + ray.dir.y * sp->dist;
> +    sp->pos.z = ray.orig.z + ray.dir.z * sp->dist;
> +
> +    sp->normal.x = (sp->pos.x - sph->pos.x) / sph->rad;
> +    sp->normal.y = (sp->pos.y - sph->pos.y) / sph->rad;
> +    sp->normal.z = (sp->pos.z - sph->pos.z) / sph->rad;
> +
> +    sp->vref = reflect(ray.dir, sp->normal);
> +    NORMALIZE(sp->vref);
> +  }
> +  return 1;
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
> new file mode 100644
> index 00000000000..fcd3ee8122f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) short int\>} "slp2" } } */
> +
> +#define TYPE short
> +#include "pr101908-v16qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
> new file mode 100644
> index 00000000000..6d43788600e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3  -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) char\>} "slp2" } } */
> +
> +#ifndef TYPE
> +#define TYPE char
> +#endif
> +
> +struct X { TYPE a[128]; };
> +
> +void __attribute__((noipa))
> +foo16 (struct X x, struct X y, TYPE* __restrict p)
> +{
> +  p[0] = x.a[1] + y.a[1];
> +  p[1] = x.a[2] + y.a[2];
> +  p[2] = x.a[3] + y.a[3];
> +  p[3] = x.a[4] + y.a[4];
> +  p[4] = x.a[5] + y.a[5];
> +  p[5] = x.a[6] + y.a[6];
> +  p[6] = x.a[7] + y.a[7];
> +  p[7] = x.a[8] + y.a[8];
> +  p[8] = x.a[9] + y.a[9];
> +  p[9] = x.a[10] + y.a[10];
> +  p[10] = x.a[11] + y.a[11];
> +  p[11] = x.a[12] + y.a[12];
> +  p[12] = x.a[13] + y.a[13];
> +  p[13] = x.a[14] + y.a[14];
> +  p[14] = x.a[15] + y.a[15];
> +  p[15] = x.a[16] + y.a[16];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
> new file mode 100644
> index 00000000000..f95b85abbc6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) float\>} "slp2" } } */
> +
> +#define TYPE float
> +#include "pr101908-v16qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16si.c b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
> new file mode 100644
> index 00000000000..5c48aa5da69
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) int\>} "slp2" } } */
> +
> +#define TYPE int
> +#include "pr101908-v16qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2df.c b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
> new file mode 100644
> index 00000000000..9d3f157718c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +#define TYPE double
> +#include "pr101908-v2qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2di.c b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
> new file mode 100644
> index 00000000000..c7cf9a71f21
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) long long int\>} "slp2" } } */
> +
> +typedef long long int64_t;
> +#define TYPE int64_t
> +#include "pr101908-v2qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
> new file mode 100644
> index 00000000000..e6024d70780
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) short int\>} "slp2" } } */
> +
> +#define TYPE short
> +#include "pr101908-v2qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
> new file mode 100644
> index 00000000000..cf876cc70d4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) char\>} "slp2" } } */
> +
> +#ifndef TYPE
> +#define TYPE char
> +#endif
> +
> +struct X { TYPE a[128]; };
> +
> +void __attribute__((noipa))
> +foo16 (struct X x, struct X y, TYPE* __restrict p)
> +{
> +  p[14] = x.a[15] + y.a[15];
> +  p[15] = x.a[16] + y.a[16];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
> new file mode 100644
> index 00000000000..eb6349b957e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) float\>} "slp2" } } */
> +
> +#define TYPE float
> +#include "pr101908-v2qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2si.c b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
> new file mode 100644
> index 00000000000..ae5fa0749c6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) int\>} "slp2" } } */
> +
> +#define TYPE int
> +#include "pr101908-v2qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4df.c b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
> new file mode 100644
> index 00000000000..94497422704
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) double\>} "slp2" } } */
> +
> +#define TYPE double
> +#include "pr101908-v4qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4di.c b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
> new file mode 100644
> index 00000000000..71407aa9fc7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) long long int\>} "slp2" } } */
> +
> +typedef long long int64_t;
> +#define TYPE int64_t
> +#include "pr101908-v4qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
> new file mode 100644
> index 00000000000..4b207b91225
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) short int\>} "slp2" } } */
> +
> +#define TYPE short
> +#include "pr101908-v4qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
> new file mode 100644
> index 00000000000..5292d3442ec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) char\>} "slp2" } } */
> +
> +#ifndef TYPE
> +#define TYPE char
> +#endif
> +
> +struct X { TYPE a[128]; };
> +
> +void __attribute__((noipa))
> +foo16 (struct X x, struct X y, TYPE* __restrict p)
> +{
> +  p[12] = x.a[13] + y.a[13];
> +  p[13] = x.a[14] + y.a[14];
> +  p[14] = x.a[15] + y.a[15];
> +  p[15] = x.a[16] + y.a[16];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
> new file mode 100644
> index 00000000000..a2c6273120d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) float\>} "slp2" } } */
> +
> +#define TYPE float
> +#include "pr101908-v4qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4si.c b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
> new file mode 100644
> index 00000000000..c6824285c74
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) int\>} "slp2" } } */
> +
> +#define TYPE int
> +#include "pr101908-v4qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
> new file mode 100644
> index 00000000000..248c6d0fb91
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
> +
> +#define TYPE double
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
> new file mode 100644
> index 00000000000..05eb2dd51d0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
> +
> +#define TYPE double
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
> new file mode 100644
> index 00000000000..b0055d7d2c0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
> +
> +typedef long long int64_t;
> +#define TYPE int64_t
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
> new file mode 100644
> index 00000000000..76a393bcc6c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
> +
> +typedef long long int64_t;
> +#define TYPE int64_t
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
> new file mode 100644
> index 00000000000..28977adae28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
> +
> +#define TYPE short
> +#include "pr101908-v8qi-adl.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
> new file mode 100644
> index 00000000000..89b50885366
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
> +
> +#define TYPE short
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
> new file mode 100644
> index 00000000000..be668e5d006
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
> +
> +#ifndef TYPE
> +#define TYPE char
> +#endif
> +
> +struct X { TYPE a[128]; };
> +
> +void __attribute__((noipa))
> +foo16 (struct X x, struct X y, TYPE* __restrict p)
> +{
> +  p[8] = x.a[9] + y.a[9];
> +  p[9] = x.a[10] + y.a[10];
> +  p[10] = x.a[11] + y.a[11];
> +  p[11] = x.a[12] + y.a[12];
> +  p[12] = x.a[13] + y.a[13];
> +  p[13] = x.a[14] + y.a[14];
> +  p[14] = x.a[15] + y.a[15];
> +  p[15] = x.a[16] + y.a[16];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
> new file mode 100644
> index 00000000000..842c88c8952
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
> +
> +#ifndef TYPE
> +#define TYPE char
> +#endif
> +
> +struct X { TYPE a[128]; };
> +
> +void __attribute__((noipa))
> +foo16 (struct X x, struct X y, TYPE* __restrict p)
> +{
> +  p[8] = x.a[9] + y.a[9];
> +  p[9] = x.a[10] + y.a[10];
> +  p[10] = x.a[11] + y.a[11];
> +  p[11] = x.a[12] + y.a[12];
> +  p[12] = x.a[13] + y.a[13];
> +  p[13] = x.a[14] + y.a[14];
> +  p[14] = x.a[15] + y.a[15];
> +  p[15] = x.a[16] + y.a[16];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
> new file mode 100644
> index 00000000000..89d33566a40
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
> +
> +#define TYPE float
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
> new file mode 100644
> index 00000000000..81557c7b9b7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
> +
> +#define TYPE float
> +#include "pr101908-v8qi.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
> new file mode 100644
> index 00000000000..883956a0d49
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
> +
> +#define TYPE int
> +#include "pr101908-v8qi-adl.c"
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
> new file mode 100644
> index 00000000000..142f46012d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
> +
> +#define TYPE int
> +#include "pr101908-v8qi.c"
> --
> 2.18.1
>
  
Hongtao Liu March 17, 2022, 7:12 a.m. UTC | #2
On Wed, Mar 16, 2022 at 5:54 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Wed, Mar 16, 2022 at 3:19 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > This patch only handle pure-slp for by-value passed parameter which
> > has nothing to do with IPA but psABI. For by-reference passed
> > parameter IPA is required.
> >
> > The patch is aggressive in determining STLF failure, any
> > unaligned_load for parm_decl passed by stack is thought to have STLF
> > stall issue. It could lose some perf where there's no such issue(1
> > vector_load vs n scalar_load + CTOR).
> >
> > According to microbenchmark in PR, cost of STLF failure is generally
> > between 8 scalar_loads and 16 scalar loads on most latest Intel/AMD
> > processors.
> >
> > gcc/ChangeLog:
> >
> >         PR target/101908
> >         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
> >         (ix86_vector_costs::add_stmt_cost): Add extra cost for
> >         unsigned_load which may have store forwarding stall issue.
> >         * config/i386/i386.h (processor_costs): Add new member
> >         stfs.
> >         * config/i386/x86-tune-costs.h (i386_size_cost): Initialize
> >         stfs.
> >         (i386_cost, i486_cost, pentium_cost, lakemont_cost,
> >         pentiumpro_cost, geode_cost, k6_cost, athlon_cost, k8_cost,
> >         amdfam10_cost, bdver_cost, znver1_cost, znver2_cost,
> >         znver3_cost, skylake_cost, icelake_cost, alderlake_cost,
> >         btver1_cost, btver2_cost, pentium4_cost, nocano_cost,
> >         atom_cost, slm_cost, tremont_cost, intel_cost, generic_cost,
> >         core_cost): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/pr101908-1.c: New test.
> >         * gcc.target/i386/pr101908-2.c: New test.
> >         * gcc.target/i386/pr101908-3.c: New test.
> >         * gcc.target/i386/pr101908-v16hi.c: New test.
> >         * gcc.target/i386/pr101908-v16qi.c: New test.
> >         * gcc.target/i386/pr101908-v16sf.c: New test.
> >         * gcc.target/i386/pr101908-v16si.c: New test.
> >         * gcc.target/i386/pr101908-v2df.c: New test.
> >         * gcc.target/i386/pr101908-v2di.c: New test.
> >         * gcc.target/i386/pr101908-v2hi.c: New test.
> >         * gcc.target/i386/pr101908-v2qi.c: New test.
> >         * gcc.target/i386/pr101908-v2sf.c: New test.
> >         * gcc.target/i386/pr101908-v2si.c: New test.
> >         * gcc.target/i386/pr101908-v4df.c: New test.
> >         * gcc.target/i386/pr101908-v4di.c: New test.
> >         * gcc.target/i386/pr101908-v4hi.c: New test.
> >         * gcc.target/i386/pr101908-v4qi.c: New test.
> >         * gcc.target/i386/pr101908-v4sf.c: New test.
> >         * gcc.target/i386/pr101908-v4si.c: New test.
> >         * gcc.target/i386/pr101908-v8df-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8df.c: New test.
> >         * gcc.target/i386/pr101908-v8di-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8di.c: New test.
> >         * gcc.target/i386/pr101908-v8hi-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8hi.c: New test.
> >         * gcc.target/i386/pr101908-v8qi-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8qi.c: New test.
> >         * gcc.target/i386/pr101908-v8sf-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8sf.c: New test.
> >         * gcc.target/i386/pr101908-v8si-adl.c: New test.
> >         * gcc.target/i386/pr101908-v8si.c: New test.
> > ---
> >  gcc/config/i386/i386.cc                       | 51 +++++++++++
> >  gcc/config/i386/i386.h                        |  1 +
> >  gcc/config/i386/x86-tune-costs.h              | 28 ++++++
> >  gcc/testsuite/gcc.target/i386/pr101908-1.c    | 12 +++
> >  gcc/testsuite/gcc.target/i386/pr101908-2.c    | 12 +++
> >  gcc/testsuite/gcc.target/i386/pr101908-3.c    | 90 +++++++++++++++++++
> >  .../gcc.target/i386/pr101908-v16hi.c          |  6 ++
> >  .../gcc.target/i386/pr101908-v16qi.c          | 30 +++++++
> >  .../gcc.target/i386/pr101908-v16sf.c          |  6 ++
> >  .../gcc.target/i386/pr101908-v16si.c          |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2df.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2di.c |  7 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2hi.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2qi.c | 16 ++++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2sf.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v2si.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4df.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4di.c |  7 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4hi.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4qi.c | 18 ++++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4sf.c |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v4si.c |  6 ++
> >  .../gcc.target/i386/pr101908-v8df-adl.c       |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8df.c |  6 ++
> >  .../gcc.target/i386/pr101908-v8di-adl.c       |  7 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8di.c |  7 ++
> >  .../gcc.target/i386/pr101908-v8hi-adl.c       |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8hi.c |  6 ++
> >  .../gcc.target/i386/pr101908-v8qi-adl.c       | 22 +++++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8qi.c | 22 +++++
> >  .../gcc.target/i386/pr101908-v8sf-adl.c       |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8sf.c |  6 ++
> >  .../gcc.target/i386/pr101908-v8si-adl.c       |  6 ++
> >  gcc/testsuite/gcc.target/i386/pr101908-v8si.c |  6 ++
> >  34 files changed, 444 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16si.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2df.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2di.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2si.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4df.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4di.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4si.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d77ad83e437..c01809cc3da 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22988,6 +22988,46 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
> >    return default_noce_conversion_profitable_p (seq, if_info);
> >  }
> >
> > +/* Return true if REF may have STF issue, otherwise false.
> > +   Any unaligned_load from parm_decl which is passed by stack
> > +   is considered to have STLF stall issue.  */
> > +static bool
> > +ix86_load_maybe_stfs_p (data_reference* dr)
> > +{
> > +  tree addr = DR_BASE_ADDRESS (dr);
> > +  if (TREE_CODE (addr) != ADDR_EXPR)
> > +    return false;
> > +  addr = get_base_address (TREE_OPERAND (addr, 0));
> > +
> > +  if (TREE_CODE (addr) != PARM_DECL)
> > +    return false;
> > +  tree type = TREE_TYPE (addr);
> > +  if (!type)
>
> type should never be NULL
Will change.
>
> > +    return false;
> > +
> > +  machine_mode mode = TYPE_MODE (type);
> > +
> > +  /* There could be false positive in determine parameter passed by stack.
> > +     .i.e. parameter can be put in registers but finally passed by stack
> > +     because registers are ran out.  */
> > +  if (TARGET_64BIT)
> > +    {
> > +      /* From function_arg_64.  */
> > +      enum x86_64_reg_class regclass[MAX_CLASSES];
> > +      int zero_width_bitfields = 0;
> > +      return !classify_argument (mode, type, regclass, 0, zero_width_bitfields);
> > +    }
> > +  else
> > +    {
> > +      /* From function_arg_32.  */
> > +      return (mode == E_BLKmode
> > +             || (AGGREGATE_TYPE_P (type)
> > +                 && (VECTOR_MODE_P (mode) || mode == TImode)));
> > +    }
> > +
> > +  return false;
>
> that stmt is unreachable.
Will change.
>
> > +}
> > +
> >  /* x86-specific vector costs.  */
> >  class ix86_vector_costs : public vector_costs
> >  {
> > @@ -23218,6 +23258,17 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >    if (stmt_cost == -1)
> >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >
> > +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> > +     Performance may lose when there's no STF issue(1 vector_load vs n
> > +     scalar_load + CTOR).
> > +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
>
> cost(2000) is no longer there
>
> > +     tuned.  */
> > +  if (kind == unaligned_load && stmt_info
> > +      && stmt_info->slp_type == pure_slp
>
> You want to restrict this to BB vectorization?  pure_slp isn't exactly that,
> instead you can do
>
>            && is_a <bb_vec_info> (m_vinfo)
>
> > +      && STMT_VINFO_DATA_REF (stmt_info)
> > +      && ix86_load_maybe_stfs_p (STMT_VINFO_DATA_REF (stmt_info)))
> > +    stmt_cost += COSTS_N_INSNS (ix86_cost->stfs / 2);
>
> I wonder why we divide stfs by two?
Just align with the calculation for costs of vec_load/scalar_load/unalign_load.
>
> I'd suggest an additional check, that the DR is close to function start.  One
> possible check that occurs to me is to check
>
>   STMT_VINFO_DR_INFO (stmt_info)->group == 0
>
> that will for example avoid the penalty for
>
> struct Y y;
> void foo (struct X x)
> {
>   bar();
>   y.a = x.a;
>   y.b = x.b;
> }
>
> but also (maybe not wanted) when the access happens after control
> flow transfer like with
>
> struct Y y;
> void foo (struct X x, int flag)
> {
>   if (flag)
>    {
>     y.a = x.a;
>     y.b = x.b;
>    }
> }
>
> I think we should be conservative with what we pessimize until we have
> evidence that we need to include more cases, also since this after-the-fact
> handling of the issue in costing is sub-optimal.  Ideally the vectorizer itself
> would decide the vectorize the load in a way to avoid STLF fails, but that's
> nothing we can easily arrange for at this stage.
>
> Another option could be to split such loads during md-reorg where we could
Let me try this.
> somehow "count" the latency from function entry, only scanning paths from
> there up to a point where the store buffer is likely not drained (with
> a different
> target cost parameter?) and only scanning not optimize_for_size BBs.  That
> might be a better place to do after-the-fact adjustments (the cost adjustment
> won't avoid the STLF fail if the rest of the vectorization compensates the
> penalty).
>
> Richard.
>
> > +
> >    /* Penalize DFmode vector operations for Bonnell.  */
> >    if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
> >        && vectype && GET_MODE_INNER (TYPE_MODE (vectype)) == DFmode)
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 0d28e57f8f2..341f1c47981 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -168,6 +168,7 @@ struct processor_costs {
> >                                    in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    const int sse_unaligned_load[5];/* cost of unaligned load.  */
> >    const int sse_unaligned_store[5];/* cost of unaligned store.  */
> > +  const int stfs;               /* cost of store forward stalls.  */
> >    const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
> >             zmm_move;
> >    const int sse_to_integer;    /* cost of moving SSE register to integer.  */
> > diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> > index 017ffa69958..3a5fcdeefdd 100644
> > --- a/gcc/config/i386/x86-tune-costs.h
> > +++ b/gcc/config/i386/x86-tune-costs.h
> > @@ -100,6 +100,7 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
> >                                            in 128bit, 256bit and 512bit */
> >    {3, 3, 3, 3, 3},                     /* cost of unaligned SSE store
> >                                            in 128bit, 256bit and 512bit */
> > +  6,                                   /* cost of store forward stall.  */
> >    3, 3, 3,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    5, 0,                                        /* Gather load static, per_elt.  */
> > @@ -209,6 +210,7 @@ struct processor_costs i386_cost = {        /* 386 specific costs */
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> > +  8,                                   /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -317,6 +319,7 @@ struct processor_costs i486_cost = {        /* 486 specific costs */
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> > +  8,                                   /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -427,6 +430,7 @@ struct processor_costs pentium_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> > +  8,                                   /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -528,6 +532,7 @@ struct processor_costs lakemont_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> > +  8,                                   /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -644,6 +649,7 @@ struct processor_costs pentiumpro_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned loads.  */
> >    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
> > +  24,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -751,6 +757,7 @@ struct processor_costs geode_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {2, 2, 8, 16, 32},                   /* cost of unaligned loads.  */
> >    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
> > +  14,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    2, 2,                                        /* Gather load static, per_elt.  */
> > @@ -858,6 +865,7 @@ struct processor_costs k6_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {2, 2, 8, 16, 32},                   /* cost of unaligned loads.  */
> >    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
> > +  24,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    2, 2,                                        /* Gather load static, per_elt.  */
> > @@ -971,6 +979,7 @@ struct processor_costs athlon_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 4, 12, 12, 24},                  /* cost of unaligned loads.  */
> >    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
> > +  14,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    5,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -1086,6 +1095,7 @@ struct processor_costs k8_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 3, 12, 12, 24},                  /* cost of unaligned loads.  */
> >    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
> > +  14,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    5,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -1214,6 +1224,7 @@ struct processor_costs amdfam10_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {4, 4, 3, 7, 12},                    /* cost of unaligned loads.  */
> >    {4, 4, 5, 10, 20},                   /* cost of unaligned stores.  */
> > +  21,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    3,                                   /* cost of moving SSE register to integer.  */
> >    4, 4,                                        /* Gather load static, per_elt.  */
> > @@ -1334,6 +1345,7 @@ const struct processor_costs bdver_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {12, 12, 10, 40, 60},                        /* cost of unaligned loads.  */
> >    {10, 10, 10, 40, 60},                        /* cost of unaligned stores.  */
> > +  54,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    16,                                  /* cost of moving SSE register to integer.  */
> >    12, 12,                              /* Gather load static, per_elt.  */
> > @@ -1475,6 +1487,7 @@ struct processor_costs znver1_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 12, 24},                   /* cost of unaligned loads.  */
> >    {8, 8, 8, 16, 32},                   /* cost of unaligned stores.  */
> > +  42,                                  /* cost of store forward stall.  */
> >    2, 3, 6,                             /* cost of moving XMM,YMM,ZMM register.  */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> > @@ -1630,6 +1643,7 @@ struct processor_costs znver2_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> > +  42,                                  /* cost of store forward stall.  */
> >    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> >                                            register.  */
> >    6,                                   /* cost of moving SSE register to integer.  */
> > @@ -1762,6 +1776,7 @@ struct processor_costs znver3_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> > +  42,                                  /* cost of store forward stall.  */
> >    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
> >                                            register.  */
> >    6,                                   /* cost of moving SSE register to integer.  */
> > @@ -1907,6 +1922,7 @@ struct processor_costs skylake_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 10, 20},                   /* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> > +  26,                                  /* cost of store forward stall.  */
> >    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    20, 8,                               /* Gather load static, per_elt.  */
> > @@ -2033,6 +2049,7 @@ struct processor_costs icelake_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 10, 20},                   /* cost of unaligned loads.  */
> >    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
> > +  26,                                  /* cost of store forward stall.  */
> >    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    20, 8,                               /* Gather load static, per_elt.  */
> > @@ -2153,6 +2170,7 @@ struct processor_costs alderlake_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> > +  90,                                  /* cost of store forward stall.  */
> >    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    18, 6,                               /* Gather load static, per_elt.  */
> > @@ -2266,6 +2284,7 @@ const struct processor_costs btver1_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {10, 10, 12, 48, 96},                        /* cost of unaligned loads.  */
> >    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
> > +  36,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    14,                                  /* cost of moving SSE register to integer.  */
> >    10, 10,                              /* Gather load static, per_elt.  */
> > @@ -2376,6 +2395,7 @@ const struct processor_costs btver2_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {10, 10, 12, 48, 96},                        /* cost of unaligned loads.  */
> >    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
> > +  36,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    14,                                  /* cost of moving SSE register to integer.  */
> >    10, 10,                              /* Gather load static, per_elt.  */
> > @@ -2485,6 +2505,7 @@ struct processor_costs pentium4_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {32, 32, 32, 64, 128},               /* cost of unaligned loads.  */
> >    {32, 32, 32, 64, 128},               /* cost of unaligned stores.  */
> > +  10,                                  /* cost of store forward stall.  */
> >    12, 24, 48,                          /* cost of moving XMM,YMM,ZMM register */
> >    20,                                  /* cost of moving SSE register to integer.  */
> >    16, 16,                              /* Gather load static, per_elt.  */
> > @@ -2597,6 +2618,7 @@ struct processor_costs nocona_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {24, 24, 24, 48, 96},                        /* cost of unaligned loads.  */
> >    {24, 24, 24, 48, 96},                        /* cost of unaligned stores.  */
> > +  8,                                   /* cost of store forward stall.  */
> >    6, 12, 24,                           /* cost of moving XMM,YMM,ZMM register */
> >    20,                                  /* cost of moving SSE register to integer.  */
> >    12, 12,                              /* Gather load static, per_elt.  */
> > @@ -2707,6 +2729,7 @@ struct processor_costs atom_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {16, 16, 16, 32, 64},                        /* cost of unaligned loads.  */
> >    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
> > +  32,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    8,                                   /* cost of moving SSE register to integer.  */
> >    8, 8,                                        /* Gather load static, per_elt.  */
> > @@ -2817,6 +2840,7 @@ struct processor_costs slm_cost = {
> >                                            in SImode, DImode and TImode.  */
> >    {16, 16, 16, 32, 64},                        /* cost of unaligned loads.  */
> >    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
> > +  48,                                  /* cost of store forward stall.  */
> >    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register */
> >    8,                                   /* cost of moving SSE register to integer.  */
> >    8, 8,                                        /* Gather load static, per_elt.  */
> > @@ -2939,6 +2963,7 @@ struct processor_costs tremont_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> > +  42,                                  /* cost of store forward stall.  */
> >    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    18, 6,                               /* Gather load static, per_elt.  */
> > @@ -3051,6 +3076,7 @@ struct processor_costs intel_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  */
> >    {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  */
> > +  22,                                  /* cost of store forward stall.  */
> >    2, 2, 2,                             /* cost of moving XMM,YMM,ZMM register */
> >    4,                                   /* cost of moving SSE register to integer.  */
> >    6, 6,                                        /* Gather load static, per_elt.  */
> > @@ -3168,6 +3194,7 @@ struct processor_costs generic_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
> >    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> > +  54,                                  /* cost of store forward stall.  */
> >    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    6,                                   /* cost of moving SSE register to integer.  */
> >    18, 6,                               /* Gather load static, per_elt.  */
> > @@ -3291,6 +3318,7 @@ struct processor_costs core_cost = {
> >                                            in 32bit, 64bit, 128bit, 256bit and 512bit */
> >    {6, 6, 6, 6, 12},                    /* cost of unaligned loads.  */
> >    {6, 6, 6, 6, 12},                    /* cost of unaligned stores.  */
> > +  26,                                  /* cost of store forward stall.  */
> >    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register */
> >    2,                                   /* cost of moving SSE register to integer.  */
> >    /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> > new file mode 100644
> > index 00000000000..f8e0f2e26bb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> > +
> > +struct X { double x[2]; };
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df __attribute__((noipa))
> > +foo (struct X* x, struct X* y)
> > +{
> > +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> > new file mode 100644
> > index 00000000000..f4ff7a83c82
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> > +
> > +struct X { double x[4]; };
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df __attribute__((noipa))
> > +foo (struct X x, struct X y)
> > +{
> > +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-3.c b/gcc/testsuite/gcc.target/i386/pr101908-3.c
> > new file mode 100644
> > index 00000000000..6f853aa7750
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-3.c
> > @@ -0,0 +1,90 @@
> > +/* PR target/101908.  */
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=x86-64 -O2 -mtune=generic -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not "add new stmt:.*MEM \<vector(2) double\>.*ray + 24B" "slp2" } }  */
> > +/* This testcase is used to avoid STLF stall.  */
> > +
> > +#define sqrt __builtin_sqrt
> > +#define SQ(x)          ((x) * (x))
> > +struct vec3 {
> > +  double x, y, z;
> > +};
> > +
> > +struct ray {
> > +  struct vec3 orig, dir;
> > +};
> > +
> > +struct material {
> > +  struct vec3 col;     /* color */
> > +  double spow;         /* specular power */
> > +  double refl;         /* reflection intensity */
> > +};
> > +
> > +struct sphere {
> > +  struct vec3 pos;
> > +  double rad;
> > +  struct material mat;
> > +  struct sphere *next;
> > +};
> > +
> > +struct spoint {
> > +  struct vec3 pos, normal, vref;       /* position, normal and view reflection */
> > +  double dist;         /* parametric distance of intersection along the ray */
> > +};
> > +
> > +#define ERR_MARGIN             1e-6
> > +
> > +#define DOT(a, b)      ((a).x * (b).x + (a).y * (b).y + (a).z * (b).z)
> > +#define NORMALIZE(a)  do {                     \
> > +    double len = sqrt(DOT(a, a));              \
> > +    (a).x /= len; (a).y /= len; (a).z /= len;  \
> > +  } while(0);
> > +
> > +static struct vec3
> > +reflect(struct vec3 v, struct vec3 n) {
> > +  struct vec3 res;
> > +  double dot = v.x * n.x + v.y * n.y + v.z * n.z;
> > +  res.x = -(2.0 * dot * n.x - v.x);
> > +  res.y = -(2.0 * dot * n.y - v.y);
> > +  res.z = -(2.0 * dot * n.z - v.z);
> > +  return res;
> > +}
> > +
> > +int ray_sphere(const struct sphere *sph,
> > +              struct ray ray, struct spoint *sp) {
> > +  double a, b, c, d, sqrt_d, t1, t2;
> > +
> > +  a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
> > +  b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
> > +    2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
> > +    2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
> > +  c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) +
> > +    SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) +
> > +    2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad);
> > +
> > +  if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
> > +
> > +  sqrt_d = sqrt(d);
> > +  t1 = (-b + sqrt_d) / (2.0 * a);
> > +  t2 = (-b - sqrt_d) / (2.0 * a);
> > +
> > +  if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0)) return 0;
> > +
> > +  if(sp) {
> > +    if(t1 < ERR_MARGIN) t1 = t2;
> > +    if(t2 < ERR_MARGIN) t2 = t1;
> > +    sp->dist = t1 < t2 ? t1 : t2;
> > +
> > +    sp->pos.x = ray.orig.x + ray.dir.x * sp->dist;
> > +    sp->pos.y = ray.orig.y + ray.dir.y * sp->dist;
> > +    sp->pos.z = ray.orig.z + ray.dir.z * sp->dist;
> > +
> > +    sp->normal.x = (sp->pos.x - sph->pos.x) / sph->rad;
> > +    sp->normal.y = (sp->pos.y - sph->pos.y) / sph->rad;
> > +    sp->normal.z = (sp->pos.z - sph->pos.z) / sph->rad;
> > +
> > +    sp->vref = reflect(ray.dir, sp->normal);
> > +    NORMALIZE(sp->vref);
> > +  }
> > +  return 1;
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
> > new file mode 100644
> > index 00000000000..fcd3ee8122f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) short int\>} "slp2" } } */
> > +
> > +#define TYPE short
> > +#include "pr101908-v16qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
> > new file mode 100644
> > index 00000000000..6d43788600e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
> > @@ -0,0 +1,30 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3  -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) char\>} "slp2" } } */
> > +
> > +#ifndef TYPE
> > +#define TYPE char
> > +#endif
> > +
> > +struct X { TYPE a[128]; };
> > +
> > +void __attribute__((noipa))
> > +foo16 (struct X x, struct X y, TYPE* __restrict p)
> > +{
> > +  p[0] = x.a[1] + y.a[1];
> > +  p[1] = x.a[2] + y.a[2];
> > +  p[2] = x.a[3] + y.a[3];
> > +  p[3] = x.a[4] + y.a[4];
> > +  p[4] = x.a[5] + y.a[5];
> > +  p[5] = x.a[6] + y.a[6];
> > +  p[6] = x.a[7] + y.a[7];
> > +  p[7] = x.a[8] + y.a[8];
> > +  p[8] = x.a[9] + y.a[9];
> > +  p[9] = x.a[10] + y.a[10];
> > +  p[10] = x.a[11] + y.a[11];
> > +  p[11] = x.a[12] + y.a[12];
> > +  p[12] = x.a[13] + y.a[13];
> > +  p[13] = x.a[14] + y.a[14];
> > +  p[14] = x.a[15] + y.a[15];
> > +  p[15] = x.a[16] + y.a[16];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
> > new file mode 100644
> > index 00000000000..f95b85abbc6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) float\>} "slp2" } } */
> > +
> > +#define TYPE float
> > +#include "pr101908-v16qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16si.c b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
> > new file mode 100644
> > index 00000000000..5c48aa5da69
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) int\>} "slp2" } } */
> > +
> > +#define TYPE int
> > +#include "pr101908-v16qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2df.c b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
> > new file mode 100644
> > index 00000000000..9d3f157718c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> > +
> > +#define TYPE double
> > +#include "pr101908-v2qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2di.c b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
> > new file mode 100644
> > index 00000000000..c7cf9a71f21
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) long long int\>} "slp2" } } */
> > +
> > +typedef long long int64_t;
> > +#define TYPE int64_t
> > +#include "pr101908-v2qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
> > new file mode 100644
> > index 00000000000..e6024d70780
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) short int\>} "slp2" } } */
> > +
> > +#define TYPE short
> > +#include "pr101908-v2qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
> > new file mode 100644
> > index 00000000000..cf876cc70d4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) char\>} "slp2" } } */
> > +
> > +#ifndef TYPE
> > +#define TYPE char
> > +#endif
> > +
> > +struct X { TYPE a[128]; };
> > +
> > +void __attribute__((noipa))
> > +foo16 (struct X x, struct X y, TYPE* __restrict p)
> > +{
> > +  p[14] = x.a[15] + y.a[15];
> > +  p[15] = x.a[16] + y.a[16];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
> > new file mode 100644
> > index 00000000000..eb6349b957e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) float\>} "slp2" } } */
> > +
> > +#define TYPE float
> > +#include "pr101908-v2qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2si.c b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
> > new file mode 100644
> > index 00000000000..ae5fa0749c6
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) int\>} "slp2" } } */
> > +
> > +#define TYPE int
> > +#include "pr101908-v2qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4df.c b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
> > new file mode 100644
> > index 00000000000..94497422704
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) double\>} "slp2" } } */
> > +
> > +#define TYPE double
> > +#include "pr101908-v4qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4di.c b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
> > new file mode 100644
> > index 00000000000..71407aa9fc7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) long long int\>} "slp2" } } */
> > +
> > +typedef long long int64_t;
> > +#define TYPE int64_t
> > +#include "pr101908-v4qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
> > new file mode 100644
> > index 00000000000..4b207b91225
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) short int\>} "slp2" } } */
> > +
> > +#define TYPE short
> > +#include "pr101908-v4qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
> > new file mode 100644
> > index 00000000000..5292d3442ec
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) char\>} "slp2" } } */
> > +
> > +#ifndef TYPE
> > +#define TYPE char
> > +#endif
> > +
> > +struct X { TYPE a[128]; };
> > +
> > +void __attribute__((noipa))
> > +foo16 (struct X x, struct X y, TYPE* __restrict p)
> > +{
> > +  p[12] = x.a[13] + y.a[13];
> > +  p[13] = x.a[14] + y.a[14];
> > +  p[14] = x.a[15] + y.a[15];
> > +  p[15] = x.a[16] + y.a[16];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
> > new file mode 100644
> > index 00000000000..a2c6273120d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) float\>} "slp2" } } */
> > +
> > +#define TYPE float
> > +#include "pr101908-v4qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4si.c b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
> > new file mode 100644
> > index 00000000000..c6824285c74
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) int\>} "slp2" } } */
> > +
> > +#define TYPE int
> > +#include "pr101908-v4qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
> > new file mode 100644
> > index 00000000000..248c6d0fb91
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
> > +
> > +#define TYPE double
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
> > new file mode 100644
> > index 00000000000..05eb2dd51d0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
> > +
> > +#define TYPE double
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
> > new file mode 100644
> > index 00000000000..b0055d7d2c0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
> > +
> > +typedef long long int64_t;
> > +#define TYPE int64_t
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
> > new file mode 100644
> > index 00000000000..76a393bcc6c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
> > @@ -0,0 +1,7 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
> > +
> > +typedef long long int64_t;
> > +#define TYPE int64_t
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
> > new file mode 100644
> > index 00000000000..28977adae28
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
> > +
> > +#define TYPE short
> > +#include "pr101908-v8qi-adl.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
> > new file mode 100644
> > index 00000000000..89b50885366
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
> > +
> > +#define TYPE short
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
> > new file mode 100644
> > index 00000000000..be668e5d006
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
> > +
> > +#ifndef TYPE
> > +#define TYPE char
> > +#endif
> > +
> > +struct X { TYPE a[128]; };
> > +
> > +void __attribute__((noipa))
> > +foo16 (struct X x, struct X y, TYPE* __restrict p)
> > +{
> > +  p[8] = x.a[9] + y.a[9];
> > +  p[9] = x.a[10] + y.a[10];
> > +  p[10] = x.a[11] + y.a[11];
> > +  p[11] = x.a[12] + y.a[12];
> > +  p[12] = x.a[13] + y.a[13];
> > +  p[13] = x.a[14] + y.a[14];
> > +  p[14] = x.a[15] + y.a[15];
> > +  p[15] = x.a[16] + y.a[16];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
> > new file mode 100644
> > index 00000000000..842c88c8952
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
> > +
> > +#ifndef TYPE
> > +#define TYPE char
> > +#endif
> > +
> > +struct X { TYPE a[128]; };
> > +
> > +void __attribute__((noipa))
> > +foo16 (struct X x, struct X y, TYPE* __restrict p)
> > +{
> > +  p[8] = x.a[9] + y.a[9];
> > +  p[9] = x.a[10] + y.a[10];
> > +  p[10] = x.a[11] + y.a[11];
> > +  p[11] = x.a[12] + y.a[12];
> > +  p[12] = x.a[13] + y.a[13];
> > +  p[13] = x.a[14] + y.a[14];
> > +  p[14] = x.a[15] + y.a[15];
> > +  p[15] = x.a[16] + y.a[16];
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
> > new file mode 100644
> > index 00000000000..89d33566a40
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
> > +
> > +#define TYPE float
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
> > new file mode 100644
> > index 00000000000..81557c7b9b7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
> > +
> > +#define TYPE float
> > +#include "pr101908-v8qi.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
> > new file mode 100644
> > index 00000000000..883956a0d49
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
> > +
> > +#define TYPE int
> > +#include "pr101908-v8qi-adl.c"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
> > new file mode 100644
> > index 00000000000..142f46012d7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
> > +
> > +#define TYPE int
> > +#include "pr101908-v8qi.c"
> > --
> > 2.18.1
> >
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d77ad83e437..c01809cc3da 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22988,6 +22988,46 @@  ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
   return default_noce_conversion_profitable_p (seq, if_info);
 }
 
+/* Return true if REF may have STF issue, otherwise false.
+   Any unaligned_load from parm_decl which is passed by stack
+   is considered to have STLF stall issue.  */
+static bool
+ix86_load_maybe_stfs_p (data_reference* dr)
+{
+  tree addr = DR_BASE_ADDRESS (dr);
+  if (TREE_CODE (addr) != ADDR_EXPR)
+    return false;
+  addr = get_base_address (TREE_OPERAND (addr, 0));
+
+  if (TREE_CODE (addr) != PARM_DECL)
+    return false;
+  tree type = TREE_TYPE (addr);
+  if (!type)
+    return false;
+
+  machine_mode mode = TYPE_MODE (type);
+
+  /* There could be false positive in determine parameter passed by stack.
+     .i.e. parameter can be put in registers but finally passed by stack
+     because registers are ran out.  */
+  if (TARGET_64BIT)
+    {
+      /* From function_arg_64.  */
+      enum x86_64_reg_class regclass[MAX_CLASSES];
+      int zero_width_bitfields = 0;
+      return !classify_argument (mode, type, regclass, 0, zero_width_bitfields);
+    }
+  else
+    {
+      /* From function_arg_32.  */
+      return (mode == E_BLKmode
+	      || (AGGREGATE_TYPE_P (type)
+		  && (VECTOR_MODE_P (mode) || mode == TImode)));
+    }
+
+  return false;
+}
+
 /* x86-specific vector costs.  */
 class ix86_vector_costs : public vector_costs
 {
@@ -23218,6 +23258,17 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
 
+  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
+     Performance may lose when there's no STF issue(1 vector_load vs n
+     scalar_load + CTOR).
+     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
+     tuned.  */
+  if (kind == unaligned_load && stmt_info
+      && stmt_info->slp_type == pure_slp
+      && STMT_VINFO_DATA_REF (stmt_info)
+      && ix86_load_maybe_stfs_p (STMT_VINFO_DATA_REF (stmt_info)))
+    stmt_cost += COSTS_N_INSNS (ix86_cost->stfs / 2);
+
   /* Penalize DFmode vector operations for Bonnell.  */
   if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
       && vectype && GET_MODE_INNER (TYPE_MODE (vectype)) == DFmode)
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 0d28e57f8f2..341f1c47981 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -168,6 +168,7 @@  struct processor_costs {
 				   in 32bit, 64bit, 128bit, 256bit and 512bit */
   const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_unaligned_store[5];/* cost of unaligned store.  */
+  const int stfs;		 /* cost of store forward stalls.  */
   const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
 	    zmm_move;
   const int sse_to_integer;	/* cost of moving SSE register to integer.  */
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..3a5fcdeefdd 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -100,6 +100,7 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
 					   in 128bit, 256bit and 512bit */
   {3, 3, 3, 3, 3},			/* cost of unaligned SSE store
 					   in 128bit, 256bit and 512bit */
+  6,					/* cost of store forward stall.  */
   3, 3, 3,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   5, 0,					/* Gather load static, per_elt.  */
@@ -209,6 +210,7 @@  struct processor_costs i386_cost = {	/* 386 specific costs */
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -317,6 +319,7 @@  struct processor_costs i486_cost = {	/* 486 specific costs */
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -427,6 +430,7 @@  struct processor_costs pentium_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -528,6 +532,7 @@  struct processor_costs lakemont_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -644,6 +649,7 @@  struct processor_costs pentiumpro_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  24,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -751,6 +757,7 @@  struct processor_costs geode_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
@@ -858,6 +865,7 @@  struct processor_costs k6_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  24,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
@@ -971,6 +979,7 @@  struct processor_costs athlon_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1086,6 +1095,7 @@  struct processor_costs k8_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1214,6 +1224,7 @@  struct processor_costs amdfam10_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
   {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  21,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1334,6 +1345,7 @@  const struct processor_costs bdver_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
   {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
+  54,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   16,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
@@ -1475,6 +1487,7 @@  struct processor_costs znver1_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
   6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
@@ -1630,6 +1643,7 @@  struct processor_costs znver2_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 2, 3,				/* cost of moving XMM,YMM,ZMM
 					   register.  */
   6,					/* cost of moving SSE register to integer.  */
@@ -1762,6 +1776,7 @@  struct processor_costs znver3_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 2, 3,				/* cost of moving XMM,YMM,ZMM
 					   register.  */
   6,					/* cost of moving SSE register to integer.  */
@@ -1907,6 +1922,7 @@  struct processor_costs skylake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
@@ -2033,6 +2049,7 @@  struct processor_costs icelake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
@@ -2153,6 +2170,7 @@  struct processor_costs alderlake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  90,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -2266,6 +2284,7 @@  const struct processor_costs btver1_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  36,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
@@ -2376,6 +2395,7 @@  const struct processor_costs btver2_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  36,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
@@ -2485,6 +2505,7 @@  struct processor_costs pentium4_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
   {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
+  10,					/* cost of store forward stall.  */
   12, 24, 48,				/* cost of moving XMM,YMM,ZMM register */
   20,					/* cost of moving SSE register to integer.  */
   16, 16,				/* Gather load static, per_elt.  */
@@ -2597,6 +2618,7 @@  struct processor_costs nocona_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
   {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   6, 12, 24,				/* cost of moving XMM,YMM,ZMM register */
   20,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
@@ -2707,6 +2729,7 @@  struct processor_costs atom_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  32,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
@@ -2817,6 +2840,7 @@  struct processor_costs slm_cost = {
 					   in SImode, DImode and TImode.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  48,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
@@ -2939,6 +2963,7 @@  struct processor_costs tremont_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  42,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -3051,6 +3076,7 @@  struct processor_costs intel_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  22,					/* cost of store forward stall.  */
   2, 2, 2,				/* cost of moving XMM,YMM,ZMM register */
   4,					/* cost of moving SSE register to integer.  */
   6, 6,					/* Gather load static, per_elt.  */
@@ -3168,6 +3194,7 @@  struct processor_costs generic_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  54,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -3291,6 +3318,7 @@  struct processor_costs core_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   2,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
new file mode 100644
index 00000000000..f8e0f2e26bb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[2]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X* x, struct X* y)
+{
+  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
new file mode 100644
index 00000000000..f4ff7a83c82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[4]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X x, struct X y)
+{
+  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-3.c b/gcc/testsuite/gcc.target/i386/pr101908-3.c
new file mode 100644
index 00000000000..6f853aa7750
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-3.c
@@ -0,0 +1,90 @@ 
+/* PR target/101908.  */
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64 -O2 -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not "add new stmt:.*MEM \<vector(2) double\>.*ray + 24B" "slp2" } }  */
+/* This testcase is used to avoid STLF stall.  */
+
+#define sqrt __builtin_sqrt
+#define SQ(x)		((x) * (x))
+struct vec3 {
+  double x, y, z;
+};
+
+struct ray {
+  struct vec3 orig, dir;
+};
+
+struct material {
+  struct vec3 col;	/* color */
+  double spow;		/* specular power */
+  double refl;		/* reflection intensity */
+};
+
+struct sphere {
+  struct vec3 pos;
+  double rad;
+  struct material mat;
+  struct sphere *next;
+};
+
+struct spoint {
+  struct vec3 pos, normal, vref;	/* position, normal and view reflection */
+  double dist;		/* parametric distance of intersection along the ray */
+};
+
+#define ERR_MARGIN		1e-6
+
+#define DOT(a, b)	((a).x * (b).x + (a).y * (b).y + (a).z * (b).z)
+#define NORMALIZE(a)  do {			\
+    double len = sqrt(DOT(a, a));		\
+    (a).x /= len; (a).y /= len; (a).z /= len;	\
+  } while(0);
+
+static struct vec3
+reflect(struct vec3 v, struct vec3 n) {
+  struct vec3 res;
+  double dot = v.x * n.x + v.y * n.y + v.z * n.z;
+  res.x = -(2.0 * dot * n.x - v.x);
+  res.y = -(2.0 * dot * n.y - v.y);
+  res.z = -(2.0 * dot * n.z - v.z);
+  return res;
+}
+
+int ray_sphere(const struct sphere *sph,
+	       struct ray ray, struct spoint *sp) {
+  double a, b, c, d, sqrt_d, t1, t2;
+	
+  a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
+  b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
+    2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
+    2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
+  c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) +
+    SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) +
+    2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad);
+	
+  if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
+
+  sqrt_d = sqrt(d);
+  t1 = (-b + sqrt_d) / (2.0 * a);
+  t2 = (-b - sqrt_d) / (2.0 * a);
+
+  if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0)) return 0;
+
+  if(sp) {
+    if(t1 < ERR_MARGIN) t1 = t2;
+    if(t2 < ERR_MARGIN) t2 = t1;
+    sp->dist = t1 < t2 ? t1 : t2;
+		
+    sp->pos.x = ray.orig.x + ray.dir.x * sp->dist;
+    sp->pos.y = ray.orig.y + ray.dir.y * sp->dist;
+    sp->pos.z = ray.orig.z + ray.dir.z * sp->dist;
+		
+    sp->normal.x = (sp->pos.x - sph->pos.x) / sph->rad;
+    sp->normal.y = (sp->pos.y - sph->pos.y) / sph->rad;
+    sp->normal.z = (sp->pos.z - sph->pos.z) / sph->rad;
+
+    sp->vref = reflect(ray.dir, sp->normal);
+    NORMALIZE(sp->vref);
+  }
+  return 1;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
new file mode 100644
index 00000000000..fcd3ee8122f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
new file mode 100644
index 00000000000..6d43788600e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
@@ -0,0 +1,30 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3  -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[0] = x.a[1] + y.a[1];
+  p[1] = x.a[2] + y.a[2];
+  p[2] = x.a[3] + y.a[3];
+  p[3] = x.a[4] + y.a[4];
+  p[4] = x.a[5] + y.a[5];
+  p[5] = x.a[6] + y.a[6];
+  p[6] = x.a[7] + y.a[7];
+  p[7] = x.a[8] + y.a[8];
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
new file mode 100644
index 00000000000..f95b85abbc6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16si.c b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
new file mode 100644
index 00000000000..5c48aa5da69
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2df.c b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
new file mode 100644
index 00000000000..9d3f157718c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2di.c b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
new file mode 100644
index 00000000000..c7cf9a71f21
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
new file mode 100644
index 00000000000..e6024d70780
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
new file mode 100644
index 00000000000..cf876cc70d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
new file mode 100644
index 00000000000..eb6349b957e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2si.c b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
new file mode 100644
index 00000000000..ae5fa0749c6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4df.c b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
new file mode 100644
index 00000000000..94497422704
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4di.c b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
new file mode 100644
index 00000000000..71407aa9fc7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
new file mode 100644
index 00000000000..4b207b91225
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
new file mode 100644
index 00000000000..5292d3442ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
new file mode 100644
index 00000000000..a2c6273120d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4si.c b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
new file mode 100644
index 00000000000..c6824285c74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
new file mode 100644
index 00000000000..248c6d0fb91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
new file mode 100644
index 00000000000..05eb2dd51d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
new file mode 100644
index 00000000000..b0055d7d2c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
new file mode 100644
index 00000000000..76a393bcc6c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
new file mode 100644
index 00000000000..28977adae28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v8qi-adl.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
new file mode 100644
index 00000000000..89b50885366
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
new file mode 100644
index 00000000000..be668e5d006
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
new file mode 100644
index 00000000000..842c88c8952
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
new file mode 100644
index 00000000000..89d33566a40
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
new file mode 100644
index 00000000000..81557c7b9b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
new file mode 100644
index 00000000000..883956a0d49
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v8qi-adl.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
new file mode 100644
index 00000000000..142f46012d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v8qi.c"