[x86,RFC] Prevent loop vectorization if it's in a deeply nested big loop.

Message ID 20241127032530.2968290-1-hongtao.liu@intel.com
State New
Headers
Series [x86,RFC] Prevent loop vectorization if it's in a deeply nested big loop. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 warning Skipped upon request
linaro-tcwg-bot/tcwg_gcc_build--master-arm warning Skipped upon request

Commit Message

liuhongt Nov. 27, 2024, 3:25 a.m. UTC
  When loop requires any kind of versioning which could increase register
pressure too much, and it's in a deeply nest big loop, don't do
vectorization.

I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
other benchmarks are same binary.

Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
Any comments?

gcc/ChangeLog:

	pr target/117088
	* config/i386/i386.cc
	(ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
	(ix86_vector_costs::finish_cost): Prevent loop vectorization
	if it's in a deeply nested loop and require versioning.
	* config/i386/i386.opt (--param=vect-max-loop-depth=): New
	param.
---
 gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
 gcc/config/i386/i386.opt |  4 ++
 2 files changed, 93 insertions(+)
  

Comments

Richard Biener Nov. 27, 2024, 1:43 p.m. UTC | #1
On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> When loop requires any kind of versioning which could increase register
> pressure too much, and it's in a deeply nest big loop, don't do
> vectorization.
>
> I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
> other benchmarks are same binary.
>
> Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> Any comments?

The vectorizer tries to version an outer loop when vectorizing a loop nest
and the versioning condition is invariant.  See vect_loop_versioning.  This
tries to handle such cases.  Often the generated runtime alias checks are
not invariant because we do not consider the outer evolutions.  I think we
should instead fix this there.

Question below ...

> gcc/ChangeLog:
>
>         pr target/117088
>         * config/i386/i386.cc
>         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
>         (ix86_vector_costs::finish_cost): Prevent loop vectorization
>         if it's in a deeply nested loop and require versioning.
>         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
>         param.
> ---
>  gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
>  gcc/config/i386/i386.opt |  4 ++
>  2 files changed, 93 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 526c9df7618..608f40413d2 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -25019,6 +25019,8 @@ private:
>
>    /* Estimate register pressure of the vectorized code.  */
>    void ix86_vect_estimate_reg_pressure ();
> +  /* Check if vect_loop is in a deeply-nested loop.  */
> +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
>    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
>       estimation of register pressure.
>       ??? Currently it's only used by vec_construct/scalar_to_vec
> @@ -25324,6 +25326,84 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
>      }
>  }
>
> +/* Return true if vect_loop is in a deeply-nested loop.
> +   .i.e vect_loop_n in below loop structure.
> +loop1
> +{
> + loop2
> + {
> +  loop3
> +  {
> +   vect_loop_1;
> +   loop4
> +   {
> +    vect_loop_2;
> +    loop5
> +    {
> +     vect_loop_3;
> +     loop6
> +     {
> +      vect_loop_4;
> +      loop7
> +      {
> +       vect_loop_5;
> +       loop8
> +       {
> +       loop9
> +       }
> +      vect_loop_6;
> +      }
> +     vect_loop_7;
> +     }
> +    }
> +   }
> + }
> + It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
> + is enhanced by (r15-4225-g70c3db511ba14f)  */
> +bool
> +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
> +{
> +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> +    return true;
> +
> +  if (loop_depth (vect_loop) < 2)
> +    return false;
> +

while the above two are "obvious", what you check below isn't clear to me.
Is this trying to compute whether 'vect_loop' is inside of a loop nest which
at any sibling of vect_loop (or even sibling of an outer loop of vect_loop,
recursively) is a sub-nest with a loop depth (relative to what?) exceeds
ix86_vect_max_loop_depth?

> +  class loop* outer_loop = loop_outer (vect_loop);
> +
> +  auto_vec<class loop*> m_loop_stack;
> +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> +
> +  /* Get all sibling loops for vect_loop.  */
> +  class loop* next_loop = outer_loop->inner;
> +  for (; next_loop; next_loop = next_loop->next)
> +    {
> +      m_loop_stack.safe_push (next_loop);
> +      bitmap_set_bit (m_visited_loops, next_loop->num);
> +    }
> +
> +  /* DFS the max depth of all sibling loop.  */
> +  while (!m_loop_stack.is_empty ())
> +    {
> +      next_loop = m_loop_stack.pop ();
> +      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
> +       return true;
> +
> +      class loop* inner_loop = next_loop->inner;
> +      while (inner_loop)
> +       {
> +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> +           {
> +             m_loop_stack.safe_push (inner_loop);
> +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> +           }
> +         inner_loop = inner_loop->next;
> +       }
> +    }
> +
> +  return false;
> +}
> +
>  void
>  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
>  {
> @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
>           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
>               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
>         m_costs[vect_body] = INT_MAX;
> +
> +      /* Prohibit vectorization when the loop requires versioning
> +        and loop_depth exceeds threshold.  */
> +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +          || vect_apply_runtime_profitability_check_p (loop_vinfo))

this also applies to loops not requiring versioning - it practically applies to
all loops that do not run for a constant VF-times times.

> +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
> +       m_costs[vect_body] = INT_MAX;
>      }
>
>    ix86_vect_estimate_reg_pressure ();
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 99e86f545e8..c5abf83473d 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code generation.
>  mamx-movrs
>  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
>  Support AMX-MOVRS built-in functions and code generation.
> +
> +-param=vect-max-loop-depth=
> +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> +Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.
> --
> 2.34.1
>
  
Hongtao Liu Nov. 28, 2024, 2:04 a.m. UTC | #2
On Wed, Nov 27, 2024 at 9:43 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > When loop requires any kind of versioning which could increase register
> > pressure too much, and it's in a deeply nest big loop, don't do
> > vectorization.
> >
> > I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
> > other benchmarks are same binary.
> >
> > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> > Any comments?
>
> The vectorizer tries to version an outer loop when vectorizing a loop nest
> and the versioning condition is invariant.  See vect_loop_versioning.  This
> tries to handle such cases.  Often the generated runtime alias checks are
> not invariant because we do not consider the outer evolutions.  I think we
> should instead fix this there.
>
> Question below ...
>
> > gcc/ChangeLog:
> >
> >         pr target/117088
> >         * config/i386/i386.cc
> >         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
> >         (ix86_vector_costs::finish_cost): Prevent loop vectorization
> >         if it's in a deeply nested loop and require versioning.
> >         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
> >         param.
> > ---
> >  gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
> >  gcc/config/i386/i386.opt |  4 ++
> >  2 files changed, 93 insertions(+)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 526c9df7618..608f40413d2 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -25019,6 +25019,8 @@ private:
> >
> >    /* Estimate register pressure of the vectorized code.  */
> >    void ix86_vect_estimate_reg_pressure ();
> > +  /* Check if vect_loop is in a deeply-nested loop.  */
> > +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
> >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> >       estimation of register pressure.
> >       ??? Currently it's only used by vec_construct/scalar_to_vec
> > @@ -25324,6 +25326,84 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> >      }
> >  }
> >
> > +/* Return true if vect_loop is in a deeply-nested loop.
> > +   .i.e vect_loop_n in below loop structure.
> > +loop1
> > +{
> > + loop2
> > + {
> > +  loop3
> > +  {
> > +   vect_loop_1;
> > +   loop4
> > +   {
> > +    vect_loop_2;
> > +    loop5
> > +    {
> > +     vect_loop_3;
> > +     loop6
> > +     {
> > +      vect_loop_4;
> > +      loop7
> > +      {
> > +       vect_loop_5;
> > +       loop8
> > +       {
> > +       loop9
> > +       }
> > +      vect_loop_6;
> > +      }
> > +     vect_loop_7;
> > +     }
> > +    }
> > +   }
> > + }
> > + It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
> > + is enhanced by (r15-4225-g70c3db511ba14f)  */
> > +bool
> > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
> > +{
> > +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> > +    return true;
> > +
> > +  if (loop_depth (vect_loop) < 2)
> > +    return false;
> > +
>
> while the above two are "obvious", what you check below isn't clear to me.
> Is this trying to compute whether 'vect_loop' is inside of a loop nest which
> at any sibling of vect_loop (or even sibling of an outer loop of vect_loop,
> recursively) is a sub-nest with a loop depth (relative to what?) exceeds
> ix86_vect_max_loop_depth?
Yes, the function tries to find if the vect_loop is in a "big outer
loop" which contains an innermost loop with loop_depth >
ix86_vect_max_loop_depth.
If yes, then prevent vectorization for the loop if its tripcount is
not constant VF-times.(requires any kind of versioning is not
accurate, and yes it's a big hammer.)
>
> > +  class loop* outer_loop = loop_outer (vect_loop);
> > +
> > +  auto_vec<class loop*> m_loop_stack;
> > +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> > +
> > +  /* Get all sibling loops for vect_loop.  */
> > +  class loop* next_loop = outer_loop->inner;
> > +  for (; next_loop; next_loop = next_loop->next)
> > +    {
> > +      m_loop_stack.safe_push (next_loop);
> > +      bitmap_set_bit (m_visited_loops, next_loop->num);
> > +    }
> > +
> > +  /* DFS the max depth of all sibling loop.  */
> > +  while (!m_loop_stack.is_empty ())
> > +    {
> > +      next_loop = m_loop_stack.pop ();
> > +      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
> > +       return true;
> > +
> > +      class loop* inner_loop = next_loop->inner;
> > +      while (inner_loop)
> > +       {
> > +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> > +           {
> > +             m_loop_stack.safe_push (inner_loop);
> > +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> > +           }
> > +         inner_loop = inner_loop->next;
> > +       }
> > +    }
> > +
> > +  return false;
> > +}
> > +
> >  void
> >  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> >  {
> > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> >         m_costs[vect_body] = INT_MAX;
> > +
> > +      /* Prohibit vectorization when the loop requires versioning
> > +        and loop_depth exceeds threshold.  */
> > +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > +          || vect_apply_runtime_profitability_check_p (loop_vinfo))
>
> this also applies to loops not requiring versioning - it practically applies to
> all loops that do not run for a constant VF-times times.
>
> > +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
> > +       m_costs[vect_body] = INT_MAX;
> >      }
> >
> >    ix86_vect_estimate_reg_pressure ();
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 99e86f545e8..c5abf83473d 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code generation.
> >  mamx-movrs
> >  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
> >  Support AMX-MOVRS built-in functions and code generation.
> > +
> > +-param=vect-max-loop-depth=
> > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> > +Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.
> > --
> > 2.34.1
> >
  
Richard Biener Nov. 28, 2024, 8:57 a.m. UTC | #3
On Thu, Nov 28, 2024 at 3:04 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Nov 27, 2024 at 9:43 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao.liu@intel.com> wrote:
> > >
> > > When loop requires any kind of versioning which could increase register
> > > pressure too much, and it's in a deeply nest big loop, don't do
> > > vectorization.
> > >
> > > I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
> > > other benchmarks are same binary.
> > >
> > > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> > > Any comments?
> >
> > The vectorizer tries to version an outer loop when vectorizing a loop nest
> > and the versioning condition is invariant.  See vect_loop_versioning.  This
> > tries to handle such cases.  Often the generated runtime alias checks are
> > not invariant because we do not consider the outer evolutions.  I think we
> > should instead fix this there.
> >
> > Question below ...
> >
> > > gcc/ChangeLog:
> > >
> > >         pr target/117088
> > >         * config/i386/i386.cc
> > >         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
> > >         (ix86_vector_costs::finish_cost): Prevent loop vectorization
> > >         if it's in a deeply nested loop and require versioning.
> > >         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
> > >         param.
> > > ---
> > >  gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
> > >  gcc/config/i386/i386.opt |  4 ++
> > >  2 files changed, 93 insertions(+)
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 526c9df7618..608f40413d2 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -25019,6 +25019,8 @@ private:
> > >
> > >    /* Estimate register pressure of the vectorized code.  */
> > >    void ix86_vect_estimate_reg_pressure ();
> > > +  /* Check if vect_loop is in a deeply-nested loop.  */
> > > +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
> > >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> > >       estimation of register pressure.
> > >       ??? Currently it's only used by vec_construct/scalar_to_vec
> > > @@ -25324,6 +25326,84 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> > >      }
> > >  }
> > >
> > > +/* Return true if vect_loop is in a deeply-nested loop.
> > > +   .i.e vect_loop_n in below loop structure.
> > > +loop1
> > > +{
> > > + loop2
> > > + {
> > > +  loop3
> > > +  {
> > > +   vect_loop_1;
> > > +   loop4
> > > +   {
> > > +    vect_loop_2;
> > > +    loop5
> > > +    {
> > > +     vect_loop_3;
> > > +     loop6
> > > +     {
> > > +      vect_loop_4;
> > > +      loop7
> > > +      {
> > > +       vect_loop_5;
> > > +       loop8
> > > +       {
> > > +       loop9
> > > +       }
> > > +      vect_loop_6;
> > > +      }
> > > +     vect_loop_7;
> > > +     }
> > > +    }
> > > +   }
> > > + }
> > > + It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
> > > + is enhanced by (r15-4225-g70c3db511ba14f)  */
> > > +bool
> > > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
> > > +{
> > > +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > +    return true;
> > > +
> > > +  if (loop_depth (vect_loop) < 2)
> > > +    return false;
> > > +
> >
> > while the above two are "obvious", what you check below isn't clear to me.
> > Is this trying to compute whether 'vect_loop' is inside of a loop nest which
> > at any sibling of vect_loop (or even sibling of an outer loop of vect_loop,
> > recursively) is a sub-nest with a loop depth (relative to what?) exceeds
> > ix86_vect_max_loop_depth?
> Yes, the function tries to find if the vect_loop is in a "big outer
> loop" which contains an innermost loop with loop_depth >
> ix86_vect_max_loop_depth.
> If yes, then prevent vectorization for the loop if its tripcount is
> not constant VF-times.(requires any kind of versioning is not
> accurate, and yes it's a big hammer.)

I'll note it also doesn't seem to look at register pressure at all or limit
the cut-off to the very-cheap cost model?

That said, it feels like a hack specifically for 548.exchange_r, in particular
vectorization by itself shouldn't increase register pressure (much), but
exchange is known to operate on the bounds of "awful" with regard to
register pressure.  If you'd enable APX would exchange benefit from
vectorizing?

That said, I think we have to live with the regression, the change feels
odd and a strcmp (main_file_name, "exchange") would be similar.  So
we either need to make the pattern matching more precise, like counting
live IVs from the loop nest plus applying heuristics on how vectorization
increases register pressure (maybe it's an IVOPTs issue in the end?),
or defer a solution to GCC 16.

Richard.

> >
> > > +  class loop* outer_loop = loop_outer (vect_loop);
> > > +
> > > +  auto_vec<class loop*> m_loop_stack;
> > > +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> > > +
> > > +  /* Get all sibling loops for vect_loop.  */
> > > +  class loop* next_loop = outer_loop->inner;
> > > +  for (; next_loop; next_loop = next_loop->next)
> > > +    {
> > > +      m_loop_stack.safe_push (next_loop);
> > > +      bitmap_set_bit (m_visited_loops, next_loop->num);
> > > +    }
> > > +
> > > +  /* DFS the max depth of all sibling loop.  */
> > > +  while (!m_loop_stack.is_empty ())
> > > +    {
> > > +      next_loop = m_loop_stack.pop ();
> > > +      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > +       return true;
> > > +
> > > +      class loop* inner_loop = next_loop->inner;
> > > +      while (inner_loop)
> > > +       {
> > > +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> > > +           {
> > > +             m_loop_stack.safe_push (inner_loop);
> > > +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> > > +           }
> > > +         inner_loop = inner_loop->next;
> > > +       }
> > > +    }
> > > +
> > > +  return false;
> > > +}
> > > +
> > >  void
> > >  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > >  {
> > > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> > >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> > >         m_costs[vect_body] = INT_MAX;
> > > +
> > > +      /* Prohibit vectorization when the loop requires versioning
> > > +        and loop_depth exceeds threshold.  */
> > > +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > +          || vect_apply_runtime_profitability_check_p (loop_vinfo))
> >
> > this also applies to loops not requiring versioning - it practically applies to
> > all loops that do not run for a constant VF-times times.
> >
> > > +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
> > > +       m_costs[vect_body] = INT_MAX;
> > >      }
> > >
> > >    ix86_vect_estimate_reg_pressure ();
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 99e86f545e8..c5abf83473d 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code generation.
> > >  mamx-movrs
> > >  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
> > >  Support AMX-MOVRS built-in functions and code generation.
> > > +
> > > +-param=vect-max-loop-depth=
> > > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> > > +Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.
> > > --
> > > 2.34.1
> > >
>
>
>
> --
> BR,
> Hongtao
  
Hongtao Liu Nov. 29, 2024, 1:30 a.m. UTC | #4
On Thu, Nov 28, 2024 at 4:57 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, Nov 28, 2024 at 3:04 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Nov 27, 2024 at 9:43 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao.liu@intel.com> wrote:
> > > >
> > > > When loop requires any kind of versioning which could increase register
> > > > pressure too much, and it's in a deeply nest big loop, don't do
> > > > vectorization.
> > > >
> > > > I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
> > > > other benchmarks are same binary.
> > > >
> > > > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> > > > Any comments?
> > >
> > > The vectorizer tries to version an outer loop when vectorizing a loop nest
> > > and the versioning condition is invariant.  See vect_loop_versioning.  This
> > > tries to handle such cases.  Often the generated runtime alias checks are
> > > not invariant because we do not consider the outer evolutions.  I think we
> > > should instead fix this there.
> > >
> > > Question below ...
> > >
> > > > gcc/ChangeLog:
> > > >
> > > >         pr target/117088
> > > >         * config/i386/i386.cc
> > > >         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
> > > >         (ix86_vector_costs::finish_cost): Prevent loop vectorization
> > > >         if it's in a deeply nested loop and require versioning.
> > > >         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
> > > >         param.
> > > > ---
> > > >  gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
> > > >  gcc/config/i386/i386.opt |  4 ++
> > > >  2 files changed, 93 insertions(+)
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index 526c9df7618..608f40413d2 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -25019,6 +25019,8 @@ private:
> > > >
> > > >    /* Estimate register pressure of the vectorized code.  */
> > > >    void ix86_vect_estimate_reg_pressure ();
> > > > +  /* Check if vect_loop is in a deeply-nested loop.  */
> > > > +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
> > > >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> > > >       estimation of register pressure.
> > > >       ??? Currently it's only used by vec_construct/scalar_to_vec
> > > > @@ -25324,6 +25326,84 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> > > >      }
> > > >  }
> > > >
> > > > +/* Return true if vect_loop is in a deeply-nested loop.
> > > > +   .i.e vect_loop_n in below loop structure.
> > > > +loop1
> > > > +{
> > > > + loop2
> > > > + {
> > > > +  loop3
> > > > +  {
> > > > +   vect_loop_1;
> > > > +   loop4
> > > > +   {
> > > > +    vect_loop_2;
> > > > +    loop5
> > > > +    {
> > > > +     vect_loop_3;
> > > > +     loop6
> > > > +     {
> > > > +      vect_loop_4;
> > > > +      loop7
> > > > +      {
> > > > +       vect_loop_5;
> > > > +       loop8
> > > > +       {
> > > > +       loop9
> > > > +       }
> > > > +      vect_loop_6;
> > > > +      }
> > > > +     vect_loop_7;
> > > > +     }
> > > > +    }
> > > > +   }
> > > > + }
> > > > + It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
> > > > + is enhanced by (r15-4225-g70c3db511ba14f)  */
> > > > +bool
> > > > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
> > > > +{
> > > > +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > > +    return true;
> > > > +
> > > > +  if (loop_depth (vect_loop) < 2)
> > > > +    return false;
> > > > +
> > >
> > > while the above two are "obvious", what you check below isn't clear to me.
> > > Is this trying to compute whether 'vect_loop' is inside of a loop nest which
> > > at any sibling of vect_loop (or even sibling of an outer loop of vect_loop,
> > > recursively) is a sub-nest with a loop depth (relative to what?) exceeds
> > > ix86_vect_max_loop_depth?
> > Yes, the function tries to find if the vect_loop is in a "big outer
> > loop" which contains an innermost loop with loop_depth >
> > ix86_vect_max_loop_depth.
> > If yes, then prevent vectorization for the loop if its tripcount is
> > not constant VF-times.(requires any kind of versioning is not
> > accurate, and yes it's a big hammer.)
>
> I'll note it also doesn't seem to look at register pressure at all or limit
> the cut-off to the very-cheap cost model?
The default parameter ix86_vect_max_loop_depth implies the register
pressure, for each layer of loop, it generally needs 2 registers: 1 iv
+ 1 tripcount.
ix86_vect_max_loop_depth > 8 will run out of 16 registers. The
vectoriation for unknown tripcountl increases 1 register for the "new
tripcount of main vectorized loop", and it causes extra spill in the
outer loop.
if the tripcount of vect_loop is big, then the penalty can be
compensated by the vectorization, but for exchange_r, the tripcount is
small, that's why it regresses performance.
The range info didn't get real range for the tripcount since fortran
allows negative index for the array.
I'm not sure if ivopt or ira could be improved for this case, so maybe
we should live with this regression.
Restricting the cut off with the very-cheap cost model would also be a
good idea.
>
> That said, it feels like a hack specifically for 548.exchange_r, in particular
> vectorization by itself shouldn't increase register pressure (much), but
> exchange is known to operate on the bounds of "awful" with regard to
> register pressure.  If you'd enable APX would exchange benefit from
> vectorizing?
APX would fix this(it even improves the exchange w/o vectorization)
>
> That said, I think we have to live with the regression, the change feels
> odd and a strcmp (main_file_name, "exchange") would be similar.  So
> we either need to make the pattern matching more precise, like counting
> live IVs from the loop nest plus applying heuristics on how vectorization
> increases register pressure (maybe it's an IVOPTs issue in the end?),
> or defer a solution to GCC 16.
>
> Richard.
>
> > >
> > > > +  class loop* outer_loop = loop_outer (vect_loop);
> > > > +
> > > > +  auto_vec<class loop*> m_loop_stack;
> > > > +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> > > > +
> > > > +  /* Get all sibling loops for vect_loop.  */
> > > > +  class loop* next_loop = outer_loop->inner;
> > > > +  for (; next_loop; next_loop = next_loop->next)
> > > > +    {
> > > > +      m_loop_stack.safe_push (next_loop);
> > > > +      bitmap_set_bit (m_visited_loops, next_loop->num);
> > > > +    }
> > > > +
> > > > +  /* DFS the max depth of all sibling loop.  */
> > > > +  while (!m_loop_stack.is_empty ())
> > > > +    {
> > > > +      next_loop = m_loop_stack.pop ();
> > > > +      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > > +       return true;
> > > > +
> > > > +      class loop* inner_loop = next_loop->inner;
> > > > +      while (inner_loop)
> > > > +       {
> > > > +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> > > > +           {
> > > > +             m_loop_stack.safe_push (inner_loop);
> > > > +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> > > > +           }
> > > > +         inner_loop = inner_loop->next;
> > > > +       }
> > > > +    }
> > > > +
> > > > +  return false;
> > > > +}
> > > > +
> > > >  void
> > > >  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > >  {
> > > > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> > > >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> > > >         m_costs[vect_body] = INT_MAX;
> > > > +
> > > > +      /* Prohibit vectorization when the loop requires versioning
> > > > +        and loop_depth exceeds threshold.  */
> > > > +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > > +          || vect_apply_runtime_profitability_check_p (loop_vinfo))
> > >
> > > this also applies to loops not requiring versioning - it practically applies to
> > > all loops that do not run for a constant VF-times times.
> > >
> > > > +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > +       m_costs[vect_body] = INT_MAX;
> > > >      }
> > > >
> > > >    ix86_vect_estimate_reg_pressure ();
> > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > index 99e86f545e8..c5abf83473d 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code generation.
> > > >  mamx-movrs
> > > >  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
> > > >  Support AMX-MOVRS built-in functions and code generation.
> > > > +
> > > > +-param=vect-max-loop-depth=
> > > > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> > > > +Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.
> > > > --
> > > > 2.34.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao
  
Richard Biener Nov. 29, 2024, 8:15 a.m. UTC | #5
On Fri, Nov 29, 2024 at 2:30 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, Nov 28, 2024 at 4:57 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Thu, Nov 28, 2024 at 3:04 AM Hongtao Liu <crazylht@gmail.com> wrote:
> > >
> > > On Wed, Nov 27, 2024 at 9:43 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > > On Wed, Nov 27, 2024 at 4:26 AM liuhongt <hongtao.liu@intel.com> wrote:
> > > > >
> > > > > When loop requires any kind of versioning which could increase register
> > > > > pressure too much, and it's in a deeply nest big loop, don't do
> > > > > vectorization.
> > > > >
> > > > > I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r,
> > > > > other benchmarks are same binary.
> > > > >
> > > > > Bootstrapped and regtested 0on x86_64-pc-linux-gnu{-m32,}
> > > > > Any comments?
> > > >
> > > > The vectorizer tries to version an outer loop when vectorizing a loop nest
> > > > and the versioning condition is invariant.  See vect_loop_versioning.  This
> > > > tries to handle such cases.  Often the generated runtime alias checks are
> > > > not invariant because we do not consider the outer evolutions.  I think we
> > > > should instead fix this there.
> > > >
> > > > Question below ...
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         pr target/117088
> > > > >         * config/i386/i386.cc
> > > > >         (ix86_vector_costs::ix86_vect_in_deep_nested_loop_p): New function.
> > > > >         (ix86_vector_costs::finish_cost): Prevent loop vectorization
> > > > >         if it's in a deeply nested loop and require versioning.
> > > > >         * config/i386/i386.opt (--param=vect-max-loop-depth=): New
> > > > >         param.
> > > > > ---
> > > > >  gcc/config/i386/i386.cc  | 89 ++++++++++++++++++++++++++++++++++++++++
> > > > >  gcc/config/i386/i386.opt |  4 ++
> > > > >  2 files changed, 93 insertions(+)
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > > index 526c9df7618..608f40413d2 100644
> > > > > --- a/gcc/config/i386/i386.cc
> > > > > +++ b/gcc/config/i386/i386.cc
> > > > > @@ -25019,6 +25019,8 @@ private:
> > > > >
> > > > >    /* Estimate register pressure of the vectorized code.  */
> > > > >    void ix86_vect_estimate_reg_pressure ();
> > > > > +  /* Check if vect_loop is in a deeply-nested loop.  */
> > > > > +  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
> > > > >    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> > > > >       estimation of register pressure.
> > > > >       ??? Currently it's only used by vec_construct/scalar_to_vec
> > > > > @@ -25324,6 +25326,84 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
> > > > >      }
> > > > >  }
> > > > >
> > > > > +/* Return true if vect_loop is in a deeply-nested loop.
> > > > > +   .i.e vect_loop_n in below loop structure.
> > > > > +loop1
> > > > > +{
> > > > > + loop2
> > > > > + {
> > > > > +  loop3
> > > > > +  {
> > > > > +   vect_loop_1;
> > > > > +   loop4
> > > > > +   {
> > > > > +    vect_loop_2;
> > > > > +    loop5
> > > > > +    {
> > > > > +     vect_loop_3;
> > > > > +     loop6
> > > > > +     {
> > > > > +      vect_loop_4;
> > > > > +      loop7
> > > > > +      {
> > > > > +       vect_loop_5;
> > > > > +       loop8
> > > > > +       {
> > > > > +       loop9
> > > > > +       }
> > > > > +      vect_loop_6;
> > > > > +      }
> > > > > +     vect_loop_7;
> > > > > +     }
> > > > > +    }
> > > > > +   }
> > > > > + }
> > > > > + It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
> > > > > + is enhanced by (r15-4225-g70c3db511ba14f)  */
> > > > > +bool
> > > > > +ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
> > > > > +{
> > > > > +  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > > > +    return true;
> > > > > +
> > > > > +  if (loop_depth (vect_loop) < 2)
> > > > > +    return false;
> > > > > +
> > > >
> > > > while the above two are "obvious", what you check below isn't clear to me.
> > > > Is this trying to compute whether 'vect_loop' is inside of a loop nest which
> > > > at any sibling of vect_loop (or even sibling of an outer loop of vect_loop,
> > > > recursively) is a sub-nest with a loop depth (relative to what?) exceeds
> > > > ix86_vect_max_loop_depth?
> > > Yes, the function tries to find if the vect_loop is in a "big outer
> > > loop" which contains an innermost loop with loop_depth >
> > > ix86_vect_max_loop_depth.
> > > If yes, then prevent vectorization for the loop if its tripcount is
> > > not constant VF-times.(requires any kind of versioning is not
> > > accurate, and yes it's a big hammer.)
> >
> > I'll note it also doesn't seem to look at register pressure at all or limit
> > the cut-off to the very-cheap cost model?
> The default parameter ix86_vect_max_loop_depth implies the register
> pressure, for each layer of loop, it generally needs 2 registers: 1 iv
> + 1 tripcount.
> ix86_vect_max_loop_depth > 8 will run out of 16 registers. The
> vectoriation for unknown tripcountl increases 1 register for the "new
> tripcount of main vectorized loop", and it causes extra spill in the
> outer loop.

Hmm, yeah - one of my long-term TODO is to re-do how the vectorizer
generates the main IV code to use a decrement to zero remaining
scalar iterations for prologue/vector/epilogue loops.  The disadvantage
is we'd get a decrement by VF and test >= VF in the vector loop
rather than decrement by one and test for zero, but I think IVOPTs
will eventually recover this IV.

My experience with exchange is that LRA spills the "wrong" regs here
and that IVOPTs only considers a loop in isolation rather than a set
of sibling loops within an outer loop for example.

At some point I thought of doing aggressive unswitching of outer loops
on 'mod(row,3)' somehow, that might reduce the complexity of the
nests considerably, enabling IRA/LRA to do a better job eventually.

But anyway, I think exchange is special enough that we shouldn't
worry about the regression too much - at least not as to invent
these kind of odd heuristics for it.

> if the tripcount of vect_loop is big, then the penalty can be
> compensated by the vectorization, but for exchange_r, the tripcount is
> small, that's why it regresses performance.
> The range info didn't get real range for the tripcount since fortran
> allows negative index for the array.
> I'm not sure if ivopt or ira could be improved for this case, so maybe
> we should live with this regression.
> Restricting the cut off with the very-cheap cost model would also be a
> good idea.
> >
> > That said, it feels like a hack specifically for 548.exchange_r, in particular
> > vectorization by itself shouldn't increase register pressure (much), but
> > exchange is known to operate on the bounds of "awful" with regard to
> > register pressure.  If you'd enable APX would exchange benefit from
> > vectorizing?
> APX would fix this(it even improves the exchange w/o vectorization)
> >
> > That said, I think we have to live with the regression, the change feels
> > odd and a strcmp (main_file_name, "exchange") would be similar.  So
> > we either need to make the pattern matching more precise, like counting
> > live IVs from the loop nest plus applying heuristics on how vectorization
> > increases register pressure (maybe it's an IVOPTs issue in the end?),
> > or defer a solution to GCC 16.
> >
> > Richard.
> >
> > > >
> > > > > +  class loop* outer_loop = loop_outer (vect_loop);
> > > > > +
> > > > > +  auto_vec<class loop*> m_loop_stack;
> > > > > +  auto_sbitmap m_visited_loops (number_of_loops (cfun));
> > > > > +
> > > > > +  /* Get all sibling loops for vect_loop.  */
> > > > > +  class loop* next_loop = outer_loop->inner;
> > > > > +  for (; next_loop; next_loop = next_loop->next)
> > > > > +    {
> > > > > +      m_loop_stack.safe_push (next_loop);
> > > > > +      bitmap_set_bit (m_visited_loops, next_loop->num);
> > > > > +    }
> > > > > +
> > > > > +  /* DFS the max depth of all sibling loop.  */
> > > > > +  while (!m_loop_stack.is_empty ())
> > > > > +    {
> > > > > +      next_loop = m_loop_stack.pop ();
> > > > > +      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
> > > > > +       return true;
> > > > > +
> > > > > +      class loop* inner_loop = next_loop->inner;
> > > > > +      while (inner_loop)
> > > > > +       {
> > > > > +         if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
> > > > > +           {
> > > > > +             m_loop_stack.safe_push (inner_loop);
> > > > > +             bitmap_set_bit (m_visited_loops, inner_loop->num);
> > > > > +           }
> > > > > +         inner_loop = inner_loop->next;
> > > > > +       }
> > > > > +    }
> > > > > +
> > > > > +  return false;
> > > > > +}
> > > > > +
> > > > >  void
> > > > >  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > > >  {
> > > > > @@ -25344,6 +25424,15 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > > >           && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> > > > >               > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> > > > >         m_costs[vect_body] = INT_MAX;
> > > > > +
> > > > > +      /* Prohibit vectorization when the loop requires versioning
> > > > > +        and loop_depth exceeds threshold.  */
> > > > > +      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > > +          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > +          || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > > > +          || vect_apply_runtime_profitability_check_p (loop_vinfo))
> > > >
> > > > this also applies to loops not requiring versioning - it practically applies to
> > > > all loops that do not run for a constant VF-times times.
> > > >
> > > > > +         && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
> > > > > +       m_costs[vect_body] = INT_MAX;
> > > > >      }
> > > > >
> > > > >    ix86_vect_estimate_reg_pressure ();
> > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > > index 99e86f545e8..c5abf83473d 100644
> > > > > --- a/gcc/config/i386/i386.opt
> > > > > +++ b/gcc/config/i386/i386.opt
> > > > > @@ -1414,3 +1414,7 @@ Support MOVRS built-in functions and code generation.
> > > > >  mamx-movrs
> > > > >  Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
> > > > >  Support AMX-MOVRS built-in functions and code generation.
> > > > > +
> > > > > +-param=vect-max-loop-depth=
> > > > > +Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
> > > > > +Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.
> > > > > --
> > > > > 2.34.1
> > > > >
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
>
>
>
> --
> BR,
> Hongtao
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 526c9df7618..608f40413d2 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25019,6 +25019,8 @@  private:
 
   /* Estimate register pressure of the vectorized code.  */
   void ix86_vect_estimate_reg_pressure ();
+  /* Check if vect_loop is in a deeply-nested loop.  */
+  bool ix86_vect_in_deep_nested_loop_p (class loop *vect_loop);
   /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
      estimation of register pressure.
      ??? Currently it's only used by vec_construct/scalar_to_vec
@@ -25324,6 +25326,84 @@  ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
     }
 }
 
+/* Return true if vect_loop is in a deeply-nested loop.
+   .i.e vect_loop_n in below loop structure.
+loop1
+{
+ loop2
+ {
+  loop3
+  {
+   vect_loop_1;
+   loop4
+   {
+    vect_loop_2;
+    loop5
+    {
+     vect_loop_3;
+     loop6
+     {
+      vect_loop_4;
+      loop7
+      {
+       vect_loop_5;
+       loop8
+       {
+       loop9
+       }
+      vect_loop_6;
+      }
+     vect_loop_7;
+     }
+    }
+   }
+ }
+ It's a big hammer to fix O2 regression for 548.exchange_r after vectorization
+ is enhanced by (r15-4225-g70c3db511ba14f)  */
+bool
+ix86_vector_costs::ix86_vect_in_deep_nested_loop_p (class loop *vect_loop)
+{
+  if (loop_depth (vect_loop) > (unsigned) ix86_vect_max_loop_depth)
+    return true;
+
+  if (loop_depth (vect_loop) < 2)
+    return false;
+
+  class loop* outer_loop = loop_outer (vect_loop);
+
+  auto_vec<class loop*> m_loop_stack;
+  auto_sbitmap m_visited_loops (number_of_loops (cfun));
+
+  /* Get all sibling loops for vect_loop.  */
+  class loop* next_loop = outer_loop->inner;
+  for (; next_loop; next_loop = next_loop->next)
+    {
+      m_loop_stack.safe_push (next_loop);
+      bitmap_set_bit (m_visited_loops, next_loop->num);
+    }
+
+  /* DFS the max depth of all sibling loop.  */
+  while (!m_loop_stack.is_empty ())
+    {
+      next_loop = m_loop_stack.pop ();
+      if (loop_depth (next_loop) > (unsigned) ix86_vect_max_loop_depth)
+	return true;
+
+      class loop* inner_loop = next_loop->inner;
+      while (inner_loop)
+	{
+	  if (!bitmap_bit_p (m_visited_loops, inner_loop->num))
+	    {
+	      m_loop_stack.safe_push (inner_loop);
+	      bitmap_set_bit (m_visited_loops, inner_loop->num);
+	    }
+	  inner_loop = inner_loop->next;
+	}
+    }
+
+  return false;
+}
+
 void
 ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
 {
@@ -25344,6 +25424,15 @@  ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
 	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
 	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
 	m_costs[vect_body] = INT_MAX;
+
+      /* Prohibit vectorization when the loop requires versioning
+	 and loop_depth exceeds threshold.  */
+      if ((LOOP_REQUIRES_VERSIONING (loop_vinfo)
+	   || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+	   || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	   || vect_apply_runtime_profitability_check_p (loop_vinfo))
+	  && ix86_vect_in_deep_nested_loop_p (LOOP_VINFO_LOOP (loop_vinfo)))
+	m_costs[vect_body] = INT_MAX;
     }
 
   ix86_vect_estimate_reg_pressure ();
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 99e86f545e8..c5abf83473d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1414,3 +1414,7 @@  Support MOVRS built-in functions and code generation.
 mamx-movrs
 Target Mask(ISA2_AMX_MOVRS) Var(ix86_isa_flags2) Save
 Support AMX-MOVRS built-in functions and code generation.
+
+-param=vect-max-loop-depth=
+Target Joined UInteger Var(ix86_vect_max_loop_depth) Init(8) Param
+Preversion loop vectorization when it's in a deeply nested loop and requires versioning, since it may increase register pressure too much.