[i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue.

Message ID 20220304072714.55713-1-hongtao.liu@intel.com
State New
Headers
Series [i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue. |

Commit Message

Liu, Hongtao March 4, 2022, 7:27 a.m. UTC
  For parameter passing through stack, vectorized load from parm_decl
in callee may trigger serious STF issue. This is why GCC12 regresses
50% for cray at -O2 compared to GCC11.

The patch add an extremely large number to stmt_cost to prevent
vectorization for loads from parm_decl under very-cheap cost model,
this can at least prevent O2 regression due to STF issue, but may lose
some perf where there's no such issue(1 vector_load vs n scalar_load +
CTOR).

No impact for SPEC2017 for both plain O2 and native O2 on ICX.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR target/101908
	* config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
	(ix86_vector_costs::add_stmt_cost): Add extra cost for
	vector_load/unsigned_load which may have stall forward issue.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr101908-1.c: New test.
	* gcc.target/i386/pr101908-2.c: New test.
---
 gcc/config/i386/i386.cc                    | 31 ++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++
 gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++
 3 files changed, 55 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
  

Comments

Hongtao Liu March 7, 2022, 8:26 a.m. UTC | #1
On Fri, Mar 4, 2022 at 3:28 PM liuhongt via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> For parameter passing through stack, vectorized load from parm_decl
> in callee may trigger serious STF issue. This is why GCC12 regresses
> 50% for cray at -O2 compared to GCC11.
>
> The patch add an extremely large number to stmt_cost to prevent
> vectorization for loads from parm_decl under very-cheap cost model,
> this can at least prevent O2 regression due to STF issue, but may lose
> some perf where there's no such issue(1 vector_load vs n scalar_load +
> CTOR).
>
> No impact for SPEC2017 for both plain O2 and native O2 on ICX.
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/101908
>         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
>         (ix86_vector_costs::add_stmt_cost): Add extra cost for
>         vector_load/unsigned_load which may have stall forward issue.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr101908-1.c: New test.
>         * gcc.target/i386/pr101908-2.c: New test.
> ---
>  gcc/config/i386/i386.cc                    | 31 ++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++
>  3 files changed, 55 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..3bbaaf65ea8 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
>    return default_noce_conversion_profitable_p (seq, if_info);
>  }
>
> +/* Return true if REF may have STF issue, otherwise false.  */
> +static bool
> +ix86_load_maybe_stfs_p (tree ref)
> +{
> +  tree addr = get_base_address (ref);
> +
> +  if (TREE_CODE (addr) != PARM_DECL
> +      || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr)))
> +      || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD)
> +    return false;
> +  return true;
> +}
> +
>  /* x86-specific vector costs.  */
>  class ix86_vector_costs : public vector_costs
>  {
> @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
>      }
> +
> +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> +     Performance may lose when there's no STF issue(1 vector_load vs n
> +     scalar_load + CTOR).
> +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
> +     tuned.  */
> +  if ((kind == vector_load || kind == unaligned_load)
> +      && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
> +      && stmt_info
> +      && stmt_info->slp_type == pure_slp
> +      && stmt_info->stmt
> +      && gimple_assign_load_p (stmt_info->stmt)
> +      && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt)))
> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      stmt_cost += 2000;
> +    }
> +
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> new file mode 100644
> index 00000000000..f8e0f2e26bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X* x, struct X* y)
> +{
> +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> new file mode 100644
> index 00000000000..7f2f00cebab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X x, struct X y)
> +{
> +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> +}
> --
> 2.18.1
>
  
Richard Biener March 7, 2022, 9:37 a.m. UTC | #2
On Fri, Mar 4, 2022 at 8:27 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> For parameter passing through stack, vectorized load from parm_decl
> in callee may trigger serious STF issue. This is why GCC12 regresses
> 50% for cray at -O2 compared to GCC11.
>
> The patch add an extremely large number to stmt_cost to prevent
> vectorization for loads from parm_decl under very-cheap cost model,
> this can at least prevent O2 regression due to STF issue, but may lose
> some perf where there's no such issue(1 vector_load vs n scalar_load +
> CTOR).

Note this is just heuristics in that by-value passed parameters are usually
stored to the stack close before the function call.  It does not catch the
similar case from

  foo (const X &bar)  { ... }

where a

  foo ({ 1., 2. })

will have the object passed by reference constructed right before the
call.  In the end a full solution will need to perform some IPA analysis
that computes the initialization distance from the call and uses should
factor in the use distance from function entry.

> No impact for SPEC2017 for both plain O2 and native O2 on ICX.
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/101908
>         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
>         (ix86_vector_costs::add_stmt_cost): Add extra cost for
>         vector_load/unsigned_load which may have stall forward issue.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/pr101908-1.c: New test.
>         * gcc.target/i386/pr101908-2.c: New test.
> ---
>  gcc/config/i386/i386.cc                    | 31 ++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++
>  gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++
>  3 files changed, 55 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b2bf90576d5..3bbaaf65ea8 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
>    return default_noce_conversion_profitable_p (seq, if_info);
>  }
>
> +/* Return true if REF may have STF issue, otherwise false.  */
> +static bool
> +ix86_load_maybe_stfs_p (tree ref)
> +{
> +  tree addr = get_base_address (ref);
> +
> +  if (TREE_CODE (addr) != PARM_DECL
> +      || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr)))
> +      || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD)
> +    return false;
> +  return true;
> +}
> +
>  /* x86-specific vector costs.  */
>  class ix86_vector_costs : public vector_costs
>  {
> @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
>      }
> +
> +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> +     Performance may lose when there's no STF issue(1 vector_load vs n
> +     scalar_load + CTOR).
> +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
> +     tuned.  */
> +  if ((kind == vector_load || kind == unaligned_load)
> +      && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP

This check doesn't make much sense, I'd rather remove it.

> +      && stmt_info
> +      && stmt_info->slp_type == pure_slp
> +      && stmt_info->stmt
> +      && gimple_assign_load_p (stmt_info->stmt)
> +      && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt)))

I'd pass down STMT_VINFO_DATA_REF instead and have ix86_load_maybe_stfs_p
and use

  tree addr = DR_BASE_ADDRESS (dr);
  if (TREE_CODE (addr) != ADDR_EXPR)
    return false;
  addr = get_base_address (TREE_OPERAND (addr, 0));
  ...

since that gets you a more reliable way to look at the actual object referenced.

> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      stmt_cost += 2000;
> +    }
> +

Maybe handle this like the Bonell case and thus put it after the
stmt_cost == -1 handling, just bumping the cost (also noting the actual number
is arbitrary).  It would be nice to have a better estimate on the penalty than
"2000", maybe formulate it in terms of the target costs simple-sse op at least.

That said, it might be interesting to micro-benchmark

v2df __attribute__((noipa))
foo (struct X* x, struct X* y)
{
  double temx0, temx1, temy0, temy1;
  temx0 = x->x[0];
  temx0 += temx0;
 ...
  temx1 = x->x[1];
  temx1 += temx1;
...
  return (v2df) {temx1, temx0 } + (v2df) { temy1, temy0 };
}

(without -ffast-math) to see how many vector adds we'd need to compensate the
STLF penalty (just to have an idea whether the magic number is closer to 200,
2000 or 20000).  Maybe also put that respective kernel into the i386 testsuite
with a specific -mtune (and make the thing a target tunable?).

>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> new file mode 100644
> index 00000000000..f8e0f2e26bb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X* x, struct X* y)
> +{
> +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> new file mode 100644
> index 00000000000..7f2f00cebab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> +
> +struct X { double x[2]; };
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df __attribute__((noipa))
> +foo (struct X x, struct X y)
> +{
> +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> +}
> --
> 2.18.1
>
  
Hongtao Liu March 8, 2022, 3:39 a.m. UTC | #3
On Mon, Mar 7, 2022 at 5:37 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, Mar 4, 2022 at 8:27 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > For parameter passing through stack, vectorized load from parm_decl
> > in callee may trigger serious STF issue. This is why GCC12 regresses
> > 50% for cray at -O2 compared to GCC11.
> >
> > The patch add an extremely large number to stmt_cost to prevent
> > vectorization for loads from parm_decl under very-cheap cost model,
> > this can at least prevent O2 regression due to STF issue, but may lose
> > some perf where there's no such issue(1 vector_load vs n scalar_load +
> > CTOR).
>
> Note this is just heuristics in that by-value passed parameters are usually
> stored to the stack close before the function call.  It does not catch the
> similar case from
>
>   foo (const X &bar)  { ... }
>
> where a
>
>   foo ({ 1., 2. })
>
> will have the object passed by reference constructed right before the
> call.  In the end a full solution will need to perform some IPA analysis
> that computes the initialization distance from the call and uses should
> factor in the use distance from function entry.
Yes, this patch only deals with by-value passed objects which has
nothing to do with ipa, but only depends on psABI.
For the case of passing references, IPA is necessary to help analyze
some obvious STFS scenarios, but static analysis still has
limitations, for example, for the store across the cache line (or
page) boundary case, IPA can not give a judgment.
>
> > No impact for SPEC2017 for both plain O2 and native O2 on ICX.
> > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/101908
> >         * config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
> >         (ix86_vector_costs::add_stmt_cost): Add extra cost for
> >         vector_load/unsigned_load which may have stall forward issue.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/pr101908-1.c: New test.
> >         * gcc.target/i386/pr101908-2.c: New test.
> > ---
> >  gcc/config/i386/i386.cc                    | 31 ++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++++++
> >  gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++++++
> >  3 files changed, 55 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index b2bf90576d5..3bbaaf65ea8 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22976,6 +22976,19 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
> >    return default_noce_conversion_profitable_p (seq, if_info);
> >  }
> >
> > +/* Return true if REF may have STF issue, otherwise false.  */
> > +static bool
> > +ix86_load_maybe_stfs_p (tree ref)
> > +{
> > +  tree addr = get_base_address (ref);
> > +
> > +  if (TREE_CODE (addr) != PARM_DECL
> > +      || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr)))
> > +      || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD)
> > +    return false;
> > +  return true;
> > +}
> > +
> >  /* x86-specific vector costs.  */
> >  class ix86_vector_costs : public vector_costs
> >  {
> > @@ -23203,6 +23216,24 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >         if (TREE_CODE (op) == SSA_NAME)
> >           TREE_VISITED (op) = 0;
> >      }
> > +
> > +  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
> > +     Performance may lose when there's no STF issue(1 vector_load vs n
> > +     scalar_load + CTOR).
> > +     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
> > +     tuned.  */
> > +  if ((kind == vector_load || kind == unaligned_load)
> > +      && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
>
> This check doesn't make much sense, I'd rather remove it.
Will change.
>
> > +      && stmt_info
> > +      && stmt_info->slp_type == pure_slp
> > +      && stmt_info->stmt
> > +      && gimple_assign_load_p (stmt_info->stmt)
> > +      && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt)))
>
> I'd pass down STMT_VINFO_DATA_REF instead and have ix86_load_maybe_stfs_p
> and use
>
>   tree addr = DR_BASE_ADDRESS (dr);
>   if (TREE_CODE (addr) != ADDR_EXPR)
>     return false;
>   addr = get_base_address (TREE_OPERAND (addr, 0));
>   ...
>
> since that gets you a more reliable way to look at the actual object referenced.
Yes.
>
> > +    {
> > +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > +      stmt_cost += 2000;
> > +    }
> > +
>
> Maybe handle this like the Bonell case and thus put it after the
> stmt_cost == -1 handling, just bumping the cost (also noting the actual number
> is arbitrary).  It would be nice to have a better estimate on the penalty than
> "2000", maybe formulate it in terms of the target costs simple-sse op at least.
>
I'll add a member to ix86_cost for STFS to make it target specific,
the initial value will come from [1] and real experiments.

[1] https://www.agner.org/optimize/microarchitecture.pdf

> That said, it might be interesting to micro-benchmark
>
> v2df __attribute__((noipa))
> foo (struct X* x, struct X* y)
> {
>   double temx0, temx1, temy0, temy1;
>   temx0 = x->x[0];
>   temx0 += temx0;
>  ...
>   temx1 = x->x[1];
>   temx1 += temx1;
> ...
>   return (v2df) {temx1, temx0 } + (v2df) { temy1, temy0 };
> }
>
> (without -ffast-math) to see how many vector adds we'd need to compensate the
> STLF penalty (just to have an idea whether the magic number is closer to 200,
> 2000 or 20000).  Maybe also put that respective kernel into the i386 testsuite
> with a specific -mtune (and make the thing a target tunable?).
>
> >    if (stmt_cost == -1)
> >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> > new file mode 100644
> > index 00000000000..f8e0f2e26bb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> > +
> > +struct X { double x[2]; };
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df __attribute__((noipa))
> > +foo (struct X* x, struct X* y)
> > +{
> > +  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> > new file mode 100644
> > index 00000000000..7f2f00cebab
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-slp-details" } */
> > +/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
> > +
> > +struct X { double x[2]; };
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df __attribute__((noipa))
> > +foo (struct X x, struct X y)
> > +{
> > +  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
> > +}
> > --
> > 2.18.1
> >
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..3bbaaf65ea8 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22976,6 +22976,19 @@  ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
   return default_noce_conversion_profitable_p (seq, if_info);
 }
 
+/* Return true if REF may have STF issue, otherwise false.  */
+static bool
+ix86_load_maybe_stfs_p (tree ref)
+{
+  tree addr = get_base_address (ref);
+
+  if (TREE_CODE (addr) != PARM_DECL
+      || !tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (addr)))
+      || tree_to_uhwi (TYPE_SIZE (TREE_TYPE (addr))) <= MAX_BITS_PER_WORD)
+    return false;
+  return true;
+}
+
 /* x86-specific vector costs.  */
 class ix86_vector_costs : public vector_costs
 {
@@ -23203,6 +23216,24 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	if (TREE_CODE (op) == SSA_NAME)
 	  TREE_VISITED (op) = 0;
     }
+
+  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
+     Performance may lose when there's no STF issue(1 vector_load vs n
+     scalar_load + CTOR).
+     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
+     tuned.  */
+  if ((kind == vector_load || kind == unaligned_load)
+      && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && stmt_info
+      && stmt_info->slp_type == pure_slp
+      && stmt_info->stmt
+      && gimple_assign_load_p (stmt_info->stmt)
+      && ix86_load_maybe_stfs_p (gimple_assign_rhs1 (stmt_info->stmt)))
+    {
+      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
+      stmt_cost += 2000;
+    }
+
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
 
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
new file mode 100644
index 00000000000..f8e0f2e26bb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[2]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X* x, struct X* y)
+{
+  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
new file mode 100644
index 00000000000..7f2f00cebab
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[2]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X x, struct X y)
+{
+  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
+}