vect: Hookize better_loop_vinfo_p

Message ID mptwnlj0vut.fsf@arm.com
State Committed
Headers
Series vect: Hookize better_loop_vinfo_p |

Commit Message

Richard Sandiford Nov. 8, 2021, 10:45 a.m. UTC
  One of the things we want to do on AArch64 is compare vector loops
side-by-side and pick the best one.  For some targets, we want this
to be based on issue rates as well as the usual latency-based costs
(at least for loops with relatively high iteration counts).

The current approach to doing this is: when costing vectorisation
candidate A, try to guess what the other main candidate B will look
like and adjust A's latency-based cost up or down based on the likely
difference between A and B's issue rates.  This effectively means
that we try to cost parts of B at the same time as A, without actually
being able to see B.

This is needlessly indirect and complex.  It was a compromise due
to the code being added (too) late in the GCC 11 cycle, so that
target-independent changes weren't possible.

The target-independent code already compares two candidate loop_vec_infos
side-by-side, so that information about A and B above are available
directly.  This patch creates a way for targets to hook into this
comparison.

The AArch64 code can therefore hook into better_main_loop_than_p to
compare issue rates.  If the issue rate comparison isn't decisive,
the code can fall back to the normal latency-based comparison instead.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard


gcc/
	* tree-vectorizer.h (vector_costs::better_main_loop_than_p)
	(vector_costs::better_epilogue_loop_than_p)
	(vector_costs::compare_inside_loop_cost)
	(vector_costs::compare_outside_loop_cost): Likewise.
	* tree-vectorizer.c (vector_costs::better_main_loop_than_p)
	(vector_costs::better_epilogue_loop_than_p)
	(vector_costs::compare_inside_loop_cost)
	(vector_costs::compare_outside_loop_cost): New functions,
	containing code moved from...
	* tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
---
 gcc/tree-vect-loop.c  | 142 ++---------------------------
 gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-vectorizer.h |  17 ++++
 3 files changed, 226 insertions(+), 137 deletions(-)
  

Comments

Richard Biener Nov. 8, 2021, 10:57 a.m. UTC | #1
On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> One of the things we want to do on AArch64 is compare vector loops
> side-by-side and pick the best one.  For some targets, we want this
> to be based on issue rates as well as the usual latency-based costs
> (at least for loops with relatively high iteration counts).
>
> The current approach to doing this is: when costing vectorisation
> candidate A, try to guess what the other main candidate B will look
> like and adjust A's latency-based cost up or down based on the likely
> difference between A and B's issue rates.  This effectively means
> that we try to cost parts of B at the same time as A, without actually
> being able to see B.

I think the idea of the current costing is that compares are always
to the original scalar loop (with comparing the resulting magic
cost numbers) so that by means of transitivity comparing two
vector variants work as well.  So I'm not sure where 'variant B'
comes into play here?

> This is needlessly indirect and complex.  It was a compromise due
> to the code being added (too) late in the GCC 11 cycle, so that
> target-independent changes weren't possible.
>
> The target-independent code already compares two candidate loop_vec_infos
> side-by-side, so that information about A and B above are available
> directly.  This patch creates a way for targets to hook into this
> comparison.

You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
idea to compare those side-by-side - will that not lead to _more_ special-casing
since you need to have a working compare to the scalar variant as well
to determine the vectorization threshold?

>
> The AArch64 code can therefore hook into better_main_loop_than_p to
> compare issue rates.  If the issue rate comparison isn't decisive,
> the code can fall back to the normal latency-based comparison instead.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>
> Richard
>
>
> gcc/
>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>         (vector_costs::better_epilogue_loop_than_p)
>         (vector_costs::compare_inside_loop_cost)
>         (vector_costs::compare_outside_loop_cost): Likewise.
>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>         (vector_costs::better_epilogue_loop_than_p)
>         (vector_costs::compare_inside_loop_cost)
>         (vector_costs::compare_outside_loop_cost): New functions,
>         containing code moved from...
>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
> ---
>  gcc/tree-vect-loop.c  | 142 ++---------------------------
>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vectorizer.h |  17 ++++
>  3 files changed, 226 insertions(+), 137 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index dd4a363fee5..c9ee2e15e35 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>         return new_simdlen_p;
>      }
>
> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
> -  if (main_loop)
> -    {
> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> -      unsigned HOST_WIDE_INT main_vf;
> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
> -      /* If we can determine how many iterations are left for the epilogue
> -        loop, that is if both the main loop's vectorization factor and number
> -        of iterations are constant, then we use them to calculate the cost of
> -        the epilogue loop together with a 'likely value' for the epilogues
> -        vectorization factor.  Otherwise we use the main loop's vectorization
> -        factor and the maximum poly value for the epilogue's.  If the target
> -        has not provided with a sensible upper bound poly vectorization
> -        factors are likely to be favored over constant ones.  */
> -      if (main_poly_vf.is_constant (&main_vf)
> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> -       {
> -         unsigned HOST_WIDE_INT niters
> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> -         HOST_WIDE_INT old_likely_vf
> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
> -         HOST_WIDE_INT new_likely_vf
> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
> -
> -         /* If the epilogue is using partial vectors we account for the
> -            partial iteration here too.  */
> -         old_factor = niters / old_likely_vf;
> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
> -             && niters % old_likely_vf != 0)
> -           old_factor++;
> -
> -         new_factor = niters / new_likely_vf;
> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
> -             && niters % new_likely_vf != 0)
> -           new_factor++;
> -       }
> -      else
> -       {
> -         unsigned HOST_WIDE_INT main_vf_max
> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> -
> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
> -                                                          POLY_VALUE_MAX);
> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
> -                                                          POLY_VALUE_MAX);
> -
> -         /* If the loop is not using partial vectors then it will iterate one
> -            time less than one that does.  It is safe to subtract one here,
> -            because the main loop's vf is always at least 2x bigger than that
> -            of an epilogue.  */
> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
> -           old_factor -= 1;
> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
> -           new_factor -= 1;
> -       }
> -
> -      /* Compute the costs by multiplying the inside costs with the factor and
> -        add the outside costs for a more complete picture.  The factor is the
> -        amount of times we are expecting to iterate this epilogue.  */
> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
> -      return new_cost < old_cost;
> -    }
> -
> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> -  if (estimated_max_niter != -1)
> -    {
> -      if (known_le (estimated_max_niter, new_vf))
> -       new_vf = estimated_max_niter;
> -      if (known_le (estimated_max_niter, old_vf))
> -       old_vf = estimated_max_niter;
> -    }
> -
> -  /* Check whether the (fractional) cost per scalar iteration is lower
> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
> -
> -  HOST_WIDE_INT est_rel_new_min
> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
> -  HOST_WIDE_INT est_rel_new_max
> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
> -
> -  HOST_WIDE_INT est_rel_old_min
> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
> -  HOST_WIDE_INT est_rel_old_max
> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
> -
> -  /* Check first if we can make out an unambigous total order from the minimum
> -     and maximum estimates.  */
> -  if (est_rel_new_min < est_rel_old_min
> -      && est_rel_new_max < est_rel_old_max)
> -    return true;
> -  else if (est_rel_old_min < est_rel_new_min
> -          && est_rel_old_max < est_rel_new_max)
> -    return false;
> -  /* When old_loop_vinfo uses a variable vectorization factor,
> -     we know that it has a lower cost for at least one runtime VF.
> -     However, we don't know how likely that VF is.
> -
> -     One option would be to compare the costs for the estimated VFs.
> -     The problem is that that can put too much pressure on the cost
> -     model.  E.g. if the estimated VF is also the lowest possible VF,
> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
> -     for the estimated VF, we'd then choose new_loop_vinfo even
> -     though (a) new_loop_vinfo might not actually be better than
> -     old_loop_vinfo for that VF and (b) it would be significantly
> -     worse at larger VFs.
> -
> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
> -     no more expensive than old_loop_vinfo even after doubling the
> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
> -     ensures that we only pick new_loop_vinfo if it is significantly
> -     better than old_loop_vinfo at the estimated VF.  */
> -
> -  if (est_rel_old_min != est_rel_new_min
> -      || est_rel_old_max != est_rel_new_max)
> -    {
> -      HOST_WIDE_INT est_rel_new_likely
> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
> -      HOST_WIDE_INT est_rel_old_likely
> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
> -
> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
> -    }
> -
> -  /* If there's nothing to choose between the loop bodies, see whether
> -     there's a difference in the prologue and epilogue costs.  */
> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
> -  if (new_outside_cost != old_outside_cost)
> -    return new_outside_cost < old_outside_cost;
> +  const auto *old_costs = old_loop_vinfo->vector_costs;
> +  const auto *new_costs = new_loop_vinfo->vector_costs;
> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>
> -  return false;
> +  return new_costs->better_main_loop_than_p (old_costs);
>  }
>
>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 9ef76ce654b..dcbb2a3f13a 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
>      }
>    return cost;
>  }
> +
> +/* See the comment above the declaration for details.  */
> +
> +bool
> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
> +{
> +  int diff = compare_inside_loop_cost (other);
> +  if (diff != 0)
> +    return diff < 0;
> +
> +  /* If there's nothing to choose between the loop bodies, see whether
> +     there's a difference in the prologue and epilogue costs.  */
> +  diff = compare_outside_loop_cost (other);
> +  if (diff != 0)
> +    return diff < 0;
> +
> +  return false;
> +}
> +
> +
> +/* See the comment above the declaration for details.  */
> +
> +bool
> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
> +                                          loop_vec_info main_loop) const
> +{
> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> +
> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> +
> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> +  unsigned HOST_WIDE_INT main_vf;
> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
> +  /* If we can determine how many iterations are left for the epilogue
> +     loop, that is if both the main loop's vectorization factor and number
> +     of iterations are constant, then we use them to calculate the cost of
> +     the epilogue loop together with a 'likely value' for the epilogues
> +     vectorization factor.  Otherwise we use the main loop's vectorization
> +     factor and the maximum poly value for the epilogue's.  If the target
> +     has not provided with a sensible upper bound poly vectorization
> +     factors are likely to be favored over constant ones.  */
> +  if (main_poly_vf.is_constant (&main_vf)
> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> +    {
> +      unsigned HOST_WIDE_INT niters
> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> +      HOST_WIDE_INT other_likely_vf
> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
> +      HOST_WIDE_INT this_likely_vf
> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
> +
> +      /* If the epilogue is using partial vectors we account for the
> +        partial iteration here too.  */
> +      other_factor = niters / other_likely_vf;
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
> +         && niters % other_likely_vf != 0)
> +       other_factor++;
> +
> +      this_factor = niters / this_likely_vf;
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
> +         && niters % this_likely_vf != 0)
> +       this_factor++;
> +    }
> +  else
> +    {
> +      unsigned HOST_WIDE_INT main_vf_max
> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> +
> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
> +                                                      POLY_VALUE_MAX);
> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
> +                                                      POLY_VALUE_MAX);
> +
> +      /* If the loop is not using partial vectors then it will iterate one
> +        time less than one that does.  It is safe to subtract one here,
> +        because the main loop's vf is always at least 2x bigger than that
> +        of an epilogue.  */
> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
> +       other_factor -= 1;
> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
> +       this_factor -= 1;
> +    }
> +
> +  /* Compute the costs by multiplying the inside costs with the factor and
> +     add the outside costs for a more complete picture.  The factor is the
> +     amount of times we are expecting to iterate this epilogue.  */
> +  other_cost = other->body_cost () * other_factor;
> +  this_cost = this->body_cost () * this_factor;
> +  other_cost += other->outside_cost ();
> +  this_cost += this->outside_cost ();
> +  return this_cost < other_cost;
> +}
> +
> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
> +   determine the return value of better_main_loop_than_p by comparing the
> +   inside (loop body) costs of THIS and OTHER.  Return:
> +
> +   * -1 if better_main_loop_than_p should return true.
> +   * 1 if better_main_loop_than_p should return false.
> +   * 0 if we can't decide.  */
> +
> +int
> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
> +{
> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> +
> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
> +
> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> +
> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> +  if (estimated_max_niter != -1)
> +    {
> +      if (known_le (estimated_max_niter, this_vf))
> +       this_vf = estimated_max_niter;
> +      if (known_le (estimated_max_niter, other_vf))
> +       other_vf = estimated_max_niter;
> +    }
> +
> +  /* Check whether the (fractional) cost per scalar iteration is lower or
> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
> +  poly_int64 rel_other
> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
> +
> +  HOST_WIDE_INT est_rel_this_min
> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
> +  HOST_WIDE_INT est_rel_this_max
> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
> +
> +  HOST_WIDE_INT est_rel_other_min
> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
> +  HOST_WIDE_INT est_rel_other_max
> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
> +
> +  /* Check first if we can make out an unambigous total order from the minimum
> +     and maximum estimates.  */
> +  if (est_rel_this_min < est_rel_other_min
> +      && est_rel_this_max < est_rel_other_max)
> +    return -1;
> +
> +  if (est_rel_other_min < est_rel_this_min
> +      && est_rel_other_max < est_rel_this_max)
> +    return 1;
> +
> +  /* When other_loop_vinfo uses a variable vectorization factor,
> +     we know that it has a lower cost for at least one runtime VF.
> +     However, we don't know how likely that VF is.
> +
> +     One option would be to compare the costs for the estimated VFs.
> +     The problem is that that can put too much pressure on the cost
> +     model.  E.g. if the estimated VF is also the lowest possible VF,
> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
> +     for the estimated VF, we'd then choose this_loop_vinfo even
> +     though (a) this_loop_vinfo might not actually be better than
> +     other_loop_vinfo for that VF and (b) it would be significantly
> +     worse at larger VFs.
> +
> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
> +     no more expensive than other_loop_vinfo even after doubling the
> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
> +     ensures that we only pick this_loop_vinfo if it is significantly
> +     better than other_loop_vinfo at the estimated VF.  */
> +  if (est_rel_other_min != est_rel_this_min
> +      || est_rel_other_max != est_rel_this_max)
> +    {
> +      HOST_WIDE_INT est_rel_this_likely
> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
> +      HOST_WIDE_INT est_rel_other_likely
> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
> +
> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
> +    }
> +
> +  return 0;
> +}
> +
> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
> +   Check whether we can determine the return value of better_main_loop_than_p
> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
> +   Return:
> +
> +   * -1 if better_main_loop_than_p should return true.
> +   * 1 if better_main_loop_than_p should return false.
> +   * 0 if we can't decide.  */
> +
> +int
> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
> +{
> +  auto this_outside_cost = this->outside_cost ();
> +  auto other_outside_cost = other->outside_cost ();
> +  if (this_outside_cost != other_outside_cost)
> +    return this_outside_cost < other_outside_cost ? -1 : 1;
> +
> +  return 0;
> +}
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 87d3f211a2e..0e3aad590e8 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1419,6 +1419,21 @@ public:
>       read back using the functions below.  */
>    virtual void finish_cost ();
>
> +  /* The costs in THIS and OTHER both describe ways of vectorizing
> +     a main loop.  Return true if the costs described by THIS are
> +     cheaper than the costs described by OTHER.  Return false if any
> +     of the following are true:
> +
> +     - THIS and OTHER are of equal cost
> +     - OTHER is better than THIS
> +     - we can't be sure about the relative costs of THIS and OTHER.  */
> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
> +
> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
> +     vectorizing an epilogue loop of MAIN_LOOP.  */
> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
> +                                           loop_vec_info main_loop) const;
> +
>    unsigned int prologue_cost () const;
>    unsigned int body_cost () const;
>    unsigned int epilogue_cost () const;
> @@ -1429,6 +1444,8 @@ protected:
>                                  unsigned int);
>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
>                                      unsigned int);
> +  int compare_inside_loop_cost (const vector_costs *) const;
> +  int compare_outside_loop_cost (const vector_costs *) const;
>
>    /* The region of code that we're considering vectorizing.  */
>    vec_info *m_vinfo;
> --
> 2.25.1
>
  
Richard Sandiford Nov. 8, 2021, 1:06 p.m. UTC | #2
Richard Biener <richard.guenther@gmail.com> writes:
> On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> One of the things we want to do on AArch64 is compare vector loops
>> side-by-side and pick the best one.  For some targets, we want this
>> to be based on issue rates as well as the usual latency-based costs
>> (at least for loops with relatively high iteration counts).
>>
>> The current approach to doing this is: when costing vectorisation
>> candidate A, try to guess what the other main candidate B will look
>> like and adjust A's latency-based cost up or down based on the likely
>> difference between A and B's issue rates.  This effectively means
>> that we try to cost parts of B at the same time as A, without actually
>> being able to see B.
>
> I think the idea of the current costing is that compares are always
> to the original scalar loop (with comparing the resulting magic
> cost numbers) so that by means of transitivity comparing two
> vector variants work as well.  So I'm not sure where 'variant B'
> comes into play here?

E.g. A could be SVE and B could be Advanced SIMD, or A could be
SVE with fully-packed vectors and B could be SVE with partially
unpacked vectors.

One motivating case is Neoverse V1, which is a 256-bit SVE target.
There, scalar code, Advanced SIMD code and SVE code have different issue
characteristics.  SVE vector ops generally have the same latencies as
the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
have double the throughput of SVE ones, so that the overall bit-for-bit
throughputs are roughly equal.  However, there are differences due to
predication, load/store handling, and so on, and those differences
should be decisive in some cases.

For SLP, latency-based costs are fine.  But for loop costs, they hide
too many details.  What works best in practice, both for vector vs.
vector and vector vs. scalar, is:

(1) compare issue rates between two loop bodies (vector vs. vector
    or vector vs. scalar)
(2) if issue rates are equal for a given number of scalar iterations,
    compare the latencies of the loop bodies, as we do now
(3) if both the above are equal, compare the latencies of the prologue
    and epilogue, as we do now

It's very difficult to combine latency and issue rate information
into a single per-statement integer, in such a way that the integers
can just be summed and compared.

So returning to the above, when costing an SVE loop A, we currently
collect on-the-side issue information about both A and the likely
Advanced SIMD implementation B.  If B would have a higher throughput
than A then we multiply A's latency-based costs by:

   ceil(B throughput/A throughput)

The problem is, we don't actually know whether B exists or what it
would look like (given that Advanced SIMD and SVE have different
features).

In principle, we should do the same in reverse, but since SVE needs
fewer ops than Advanced SIMD, the latencies are usually smaller already.

We also do something similar for the scalar code.  When costing both
Advanced SIMD and SVE, we try to estimate the issue rate of the
original scalar code.  If the scalar code could issue more quickly
than the equivalent vector code, we increase the latency-based cost
of the vector code to account for that.

The idea of the patch (and others) is that we can do the (1), (2),
(3) comparison above directly rather than indirectly.  We can also
use the issue information calculated while costing the scalar code,
rather than having to estimate the scalar issue information while
costing the vector code.

>> This is needlessly indirect and complex.  It was a compromise due
>> to the code being added (too) late in the GCC 11 cycle, so that
>> target-independent changes weren't possible.
>>
>> The target-independent code already compares two candidate loop_vec_infos
>> side-by-side, so that information about A and B above are available
>> directly.  This patch creates a way for targets to hook into this
>> comparison.
>
> You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
> idea to compare those side-by-side - will that not lead to _more_ special-casing
> since you need to have a working compare to the scalar variant as well
> to determine the vectorization threshold?

A later patch allows the code to do the same comparison for vector
vs. scalar.  It makes the target code significantly simpler compared
to now, since add_stmt_cost only needs to consider the latency and
issue rate of the code it's actually supposed to be costing, rather than
trying also to estimate the issue rate of the scalar code and the issue
rate of the Advanced SIMD code.

Thanks,
Richard

>
>>
>> The AArch64 code can therefore hook into better_main_loop_than_p to
>> compare issue rates.  If the issue rate comparison isn't decisive,
>> the code can fall back to the normal latency-based comparison instead.
>>
>> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>>
>> Richard
>>
>>
>> gcc/
>>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>>         (vector_costs::better_epilogue_loop_than_p)
>>         (vector_costs::compare_inside_loop_cost)
>>         (vector_costs::compare_outside_loop_cost): Likewise.
>>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>>         (vector_costs::better_epilogue_loop_than_p)
>>         (vector_costs::compare_inside_loop_cost)
>>         (vector_costs::compare_outside_loop_cost): New functions,
>>         containing code moved from...
>>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
>> ---
>>  gcc/tree-vect-loop.c  | 142 ++---------------------------
>>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>>  gcc/tree-vectorizer.h |  17 ++++
>>  3 files changed, 226 insertions(+), 137 deletions(-)
>>
>> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> index dd4a363fee5..c9ee2e15e35 100644
>> --- a/gcc/tree-vect-loop.c
>> +++ b/gcc/tree-vect-loop.c
>> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>>         return new_simdlen_p;
>>      }
>>
>> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
>> -  if (main_loop)
>> -    {
>> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> -      unsigned HOST_WIDE_INT main_vf;
>> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
>> -      /* If we can determine how many iterations are left for the epilogue
>> -        loop, that is if both the main loop's vectorization factor and number
>> -        of iterations are constant, then we use them to calculate the cost of
>> -        the epilogue loop together with a 'likely value' for the epilogues
>> -        vectorization factor.  Otherwise we use the main loop's vectorization
>> -        factor and the maximum poly value for the epilogue's.  If the target
>> -        has not provided with a sensible upper bound poly vectorization
>> -        factors are likely to be favored over constant ones.  */
>> -      if (main_poly_vf.is_constant (&main_vf)
>> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> -       {
>> -         unsigned HOST_WIDE_INT niters
>> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> -         HOST_WIDE_INT old_likely_vf
>> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
>> -         HOST_WIDE_INT new_likely_vf
>> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
>> -
>> -         /* If the epilogue is using partial vectors we account for the
>> -            partial iteration here too.  */
>> -         old_factor = niters / old_likely_vf;
>> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
>> -             && niters % old_likely_vf != 0)
>> -           old_factor++;
>> -
>> -         new_factor = niters / new_likely_vf;
>> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
>> -             && niters % new_likely_vf != 0)
>> -           new_factor++;
>> -       }
>> -      else
>> -       {
>> -         unsigned HOST_WIDE_INT main_vf_max
>> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> -
>> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
>> -                                                          POLY_VALUE_MAX);
>> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
>> -                                                          POLY_VALUE_MAX);
>> -
>> -         /* If the loop is not using partial vectors then it will iterate one
>> -            time less than one that does.  It is safe to subtract one here,
>> -            because the main loop's vf is always at least 2x bigger than that
>> -            of an epilogue.  */
>> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
>> -           old_factor -= 1;
>> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
>> -           new_factor -= 1;
>> -       }
>> -
>> -      /* Compute the costs by multiplying the inside costs with the factor and
>> -        add the outside costs for a more complete picture.  The factor is the
>> -        amount of times we are expecting to iterate this epilogue.  */
>> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
>> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
>> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
>> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
>> -      return new_cost < old_cost;
>> -    }
>> -
>> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> -  if (estimated_max_niter != -1)
>> -    {
>> -      if (known_le (estimated_max_niter, new_vf))
>> -       new_vf = estimated_max_niter;
>> -      if (known_le (estimated_max_niter, old_vf))
>> -       old_vf = estimated_max_niter;
>> -    }
>> -
>> -  /* Check whether the (fractional) cost per scalar iteration is lower
>> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
>> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
>> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
>> -
>> -  HOST_WIDE_INT est_rel_new_min
>> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
>> -  HOST_WIDE_INT est_rel_new_max
>> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
>> -
>> -  HOST_WIDE_INT est_rel_old_min
>> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
>> -  HOST_WIDE_INT est_rel_old_max
>> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
>> -
>> -  /* Check first if we can make out an unambigous total order from the minimum
>> -     and maximum estimates.  */
>> -  if (est_rel_new_min < est_rel_old_min
>> -      && est_rel_new_max < est_rel_old_max)
>> -    return true;
>> -  else if (est_rel_old_min < est_rel_new_min
>> -          && est_rel_old_max < est_rel_new_max)
>> -    return false;
>> -  /* When old_loop_vinfo uses a variable vectorization factor,
>> -     we know that it has a lower cost for at least one runtime VF.
>> -     However, we don't know how likely that VF is.
>> -
>> -     One option would be to compare the costs for the estimated VFs.
>> -     The problem is that that can put too much pressure on the cost
>> -     model.  E.g. if the estimated VF is also the lowest possible VF,
>> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
>> -     for the estimated VF, we'd then choose new_loop_vinfo even
>> -     though (a) new_loop_vinfo might not actually be better than
>> -     old_loop_vinfo for that VF and (b) it would be significantly
>> -     worse at larger VFs.
>> -
>> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
>> -     no more expensive than old_loop_vinfo even after doubling the
>> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
>> -     ensures that we only pick new_loop_vinfo if it is significantly
>> -     better than old_loop_vinfo at the estimated VF.  */
>> -
>> -  if (est_rel_old_min != est_rel_new_min
>> -      || est_rel_old_max != est_rel_new_max)
>> -    {
>> -      HOST_WIDE_INT est_rel_new_likely
>> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
>> -      HOST_WIDE_INT est_rel_old_likely
>> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
>> -
>> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
>> -    }
>> -
>> -  /* If there's nothing to choose between the loop bodies, see whether
>> -     there's a difference in the prologue and epilogue costs.  */
>> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
>> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
>> -  if (new_outside_cost != old_outside_cost)
>> -    return new_outside_cost < old_outside_cost;
>> +  const auto *old_costs = old_loop_vinfo->vector_costs;
>> +  const auto *new_costs = new_loop_vinfo->vector_costs;
>> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
>> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>>
>> -  return false;
>> +  return new_costs->better_main_loop_than_p (old_costs);
>>  }
>>
>>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
>> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> index 9ef76ce654b..dcbb2a3f13a 100644
>> --- a/gcc/tree-vectorizer.c
>> +++ b/gcc/tree-vectorizer.c
>> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
>>      }
>>    return cost;
>>  }
>> +
>> +/* See the comment above the declaration for details.  */
>> +
>> +bool
>> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
>> +{
>> +  int diff = compare_inside_loop_cost (other);
>> +  if (diff != 0)
>> +    return diff < 0;
>> +
>> +  /* If there's nothing to choose between the loop bodies, see whether
>> +     there's a difference in the prologue and epilogue costs.  */
>> +  diff = compare_outside_loop_cost (other);
>> +  if (diff != 0)
>> +    return diff < 0;
>> +
>> +  return false;
>> +}
>> +
>> +
>> +/* See the comment above the declaration for details.  */
>> +
>> +bool
>> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
>> +                                          loop_vec_info main_loop) const
>> +{
>> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> +
>> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> +
>> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> +  unsigned HOST_WIDE_INT main_vf;
>> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
>> +  /* If we can determine how many iterations are left for the epilogue
>> +     loop, that is if both the main loop's vectorization factor and number
>> +     of iterations are constant, then we use them to calculate the cost of
>> +     the epilogue loop together with a 'likely value' for the epilogues
>> +     vectorization factor.  Otherwise we use the main loop's vectorization
>> +     factor and the maximum poly value for the epilogue's.  If the target
>> +     has not provided with a sensible upper bound poly vectorization
>> +     factors are likely to be favored over constant ones.  */
>> +  if (main_poly_vf.is_constant (&main_vf)
>> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> +    {
>> +      unsigned HOST_WIDE_INT niters
>> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> +      HOST_WIDE_INT other_likely_vf
>> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
>> +      HOST_WIDE_INT this_likely_vf
>> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
>> +
>> +      /* If the epilogue is using partial vectors we account for the
>> +        partial iteration here too.  */
>> +      other_factor = niters / other_likely_vf;
>> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
>> +         && niters % other_likely_vf != 0)
>> +       other_factor++;
>> +
>> +      this_factor = niters / this_likely_vf;
>> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
>> +         && niters % this_likely_vf != 0)
>> +       this_factor++;
>> +    }
>> +  else
>> +    {
>> +      unsigned HOST_WIDE_INT main_vf_max
>> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> +
>> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
>> +                                                      POLY_VALUE_MAX);
>> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
>> +                                                      POLY_VALUE_MAX);
>> +
>> +      /* If the loop is not using partial vectors then it will iterate one
>> +        time less than one that does.  It is safe to subtract one here,
>> +        because the main loop's vf is always at least 2x bigger than that
>> +        of an epilogue.  */
>> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
>> +       other_factor -= 1;
>> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
>> +       this_factor -= 1;
>> +    }
>> +
>> +  /* Compute the costs by multiplying the inside costs with the factor and
>> +     add the outside costs for a more complete picture.  The factor is the
>> +     amount of times we are expecting to iterate this epilogue.  */
>> +  other_cost = other->body_cost () * other_factor;
>> +  this_cost = this->body_cost () * this_factor;
>> +  other_cost += other->outside_cost ();
>> +  this_cost += this->outside_cost ();
>> +  return this_cost < other_cost;
>> +}
>> +
>> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
>> +   determine the return value of better_main_loop_than_p by comparing the
>> +   inside (loop body) costs of THIS and OTHER.  Return:
>> +
>> +   * -1 if better_main_loop_than_p should return true.
>> +   * 1 if better_main_loop_than_p should return false.
>> +   * 0 if we can't decide.  */
>> +
>> +int
>> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
>> +{
>> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> +
>> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
>> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
>> +
>> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> +
>> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> +  if (estimated_max_niter != -1)
>> +    {
>> +      if (known_le (estimated_max_niter, this_vf))
>> +       this_vf = estimated_max_niter;
>> +      if (known_le (estimated_max_niter, other_vf))
>> +       other_vf = estimated_max_niter;
>> +    }
>> +
>> +  /* Check whether the (fractional) cost per scalar iteration is lower or
>> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
>> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
>> +  poly_int64 rel_other
>> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
>> +
>> +  HOST_WIDE_INT est_rel_this_min
>> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
>> +  HOST_WIDE_INT est_rel_this_max
>> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
>> +
>> +  HOST_WIDE_INT est_rel_other_min
>> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
>> +  HOST_WIDE_INT est_rel_other_max
>> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
>> +
>> +  /* Check first if we can make out an unambigous total order from the minimum
>> +     and maximum estimates.  */
>> +  if (est_rel_this_min < est_rel_other_min
>> +      && est_rel_this_max < est_rel_other_max)
>> +    return -1;
>> +
>> +  if (est_rel_other_min < est_rel_this_min
>> +      && est_rel_other_max < est_rel_this_max)
>> +    return 1;
>> +
>> +  /* When other_loop_vinfo uses a variable vectorization factor,
>> +     we know that it has a lower cost for at least one runtime VF.
>> +     However, we don't know how likely that VF is.
>> +
>> +     One option would be to compare the costs for the estimated VFs.
>> +     The problem is that that can put too much pressure on the cost
>> +     model.  E.g. if the estimated VF is also the lowest possible VF,
>> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
>> +     for the estimated VF, we'd then choose this_loop_vinfo even
>> +     though (a) this_loop_vinfo might not actually be better than
>> +     other_loop_vinfo for that VF and (b) it would be significantly
>> +     worse at larger VFs.
>> +
>> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
>> +     no more expensive than other_loop_vinfo even after doubling the
>> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
>> +     ensures that we only pick this_loop_vinfo if it is significantly
>> +     better than other_loop_vinfo at the estimated VF.  */
>> +  if (est_rel_other_min != est_rel_this_min
>> +      || est_rel_other_max != est_rel_this_max)
>> +    {
>> +      HOST_WIDE_INT est_rel_this_likely
>> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
>> +      HOST_WIDE_INT est_rel_other_likely
>> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
>> +
>> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
>> +    }
>> +
>> +  return 0;
>> +}
>> +
>> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
>> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
>> +   Check whether we can determine the return value of better_main_loop_than_p
>> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
>> +   Return:
>> +
>> +   * -1 if better_main_loop_than_p should return true.
>> +   * 1 if better_main_loop_than_p should return false.
>> +   * 0 if we can't decide.  */
>> +
>> +int
>> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
>> +{
>> +  auto this_outside_cost = this->outside_cost ();
>> +  auto other_outside_cost = other->outside_cost ();
>> +  if (this_outside_cost != other_outside_cost)
>> +    return this_outside_cost < other_outside_cost ? -1 : 1;
>> +
>> +  return 0;
>> +}
>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> index 87d3f211a2e..0e3aad590e8 100644
>> --- a/gcc/tree-vectorizer.h
>> +++ b/gcc/tree-vectorizer.h
>> @@ -1419,6 +1419,21 @@ public:
>>       read back using the functions below.  */
>>    virtual void finish_cost ();
>>
>> +  /* The costs in THIS and OTHER both describe ways of vectorizing
>> +     a main loop.  Return true if the costs described by THIS are
>> +     cheaper than the costs described by OTHER.  Return false if any
>> +     of the following are true:
>> +
>> +     - THIS and OTHER are of equal cost
>> +     - OTHER is better than THIS
>> +     - we can't be sure about the relative costs of THIS and OTHER.  */
>> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
>> +
>> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
>> +     vectorizing an epilogue loop of MAIN_LOOP.  */
>> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
>> +                                           loop_vec_info main_loop) const;
>> +
>>    unsigned int prologue_cost () const;
>>    unsigned int body_cost () const;
>>    unsigned int epilogue_cost () const;
>> @@ -1429,6 +1444,8 @@ protected:
>>                                  unsigned int);
>>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
>>                                      unsigned int);
>> +  int compare_inside_loop_cost (const vector_costs *) const;
>> +  int compare_outside_loop_cost (const vector_costs *) const;
>>
>>    /* The region of code that we're considering vectorizing.  */
>>    vec_info *m_vinfo;
>> --
>> 2.25.1
>>
  
Richard Biener Nov. 8, 2021, 2:32 p.m. UTC | #3
On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> One of the things we want to do on AArch64 is compare vector loops
> >> side-by-side and pick the best one.  For some targets, we want this
> >> to be based on issue rates as well as the usual latency-based costs
> >> (at least for loops with relatively high iteration counts).
> >>
> >> The current approach to doing this is: when costing vectorisation
> >> candidate A, try to guess what the other main candidate B will look
> >> like and adjust A's latency-based cost up or down based on the likely
> >> difference between A and B's issue rates.  This effectively means
> >> that we try to cost parts of B at the same time as A, without actually
> >> being able to see B.
> >
> > I think the idea of the current costing is that compares are always
> > to the original scalar loop (with comparing the resulting magic
> > cost numbers) so that by means of transitivity comparing two
> > vector variants work as well.  So I'm not sure where 'variant B'
> > comes into play here?
>
> E.g. A could be SVE and B could be Advanced SIMD, or A could be
> SVE with fully-packed vectors and B could be SVE with partially
> unpacked vectors.
>
> One motivating case is Neoverse V1, which is a 256-bit SVE target.
> There, scalar code, Advanced SIMD code and SVE code have different issue
> characteristics.  SVE vector ops generally have the same latencies as
> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
> have double the throughput of SVE ones, so that the overall bit-for-bit
> throughputs are roughly equal.  However, there are differences due to
> predication, load/store handling, and so on, and those differences
> should be decisive in some cases.
>
> For SLP, latency-based costs are fine.  But for loop costs, they hide
> too many details.  What works best in practice, both for vector vs.
> vector and vector vs. scalar, is:
>
> (1) compare issue rates between two loop bodies (vector vs. vector
>     or vector vs. scalar)
> (2) if issue rates are equal for a given number of scalar iterations,
>     compare the latencies of the loop bodies, as we do now
> (3) if both the above are equal, compare the latencies of the prologue
>     and epilogue, as we do now
>
> It's very difficult to combine latency and issue rate information
> into a single per-statement integer, in such a way that the integers
> can just be summed and compared.

Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
was that finish_cost would determine the issue rate cost part, for
example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
then finish_cost would compute the number of cycles required to
issue all vector stmts.  That yields sth like IPC the latency cost could
be divided by.  Doing that for both ISA A and ISA B would produce
weighted latency values that can be compared?  Alternatively
you can of course simulate issue with the actual instruction latencies
in mind and produce an overall iteration latency number.

Comparing just latency or issue rate in isolation is likely not good
enough?

> So returning to the above, when costing an SVE loop A, we currently
> collect on-the-side issue information about both A and the likely
> Advanced SIMD implementation B.  If B would have a higher throughput
> than A then we multiply A's latency-based costs by:
>
>    ceil(B throughput/A throughput)
>
> The problem is, we don't actually know whether B exists or what it
> would look like (given that Advanced SIMD and SVE have different
> features).
>
> In principle, we should do the same in reverse, but since SVE needs
> fewer ops than Advanced SIMD, the latencies are usually smaller already.
>
> We also do something similar for the scalar code.  When costing both
> Advanced SIMD and SVE, we try to estimate the issue rate of the
> original scalar code.  If the scalar code could issue more quickly
> than the equivalent vector code, we increase the latency-based cost
> of the vector code to account for that.
>
> The idea of the patch (and others) is that we can do the (1), (2),
> (3) comparison above directly rather than indirectly.  We can also
> use the issue information calculated while costing the scalar code,
> rather than having to estimate the scalar issue information while
> costing the vector code.
>
> >> This is needlessly indirect and complex.  It was a compromise due
> >> to the code being added (too) late in the GCC 11 cycle, so that
> >> target-independent changes weren't possible.
> >>
> >> The target-independent code already compares two candidate loop_vec_infos
> >> side-by-side, so that information about A and B above are available
> >> directly.  This patch creates a way for targets to hook into this
> >> comparison.
> >
> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
> > idea to compare those side-by-side - will that not lead to _more_ special-casing
> > since you need to have a working compare to the scalar variant as well
> > to determine the vectorization threshold?
>
> A later patch allows the code to do the same comparison for vector
> vs. scalar.  It makes the target code significantly simpler compared
> to now, since add_stmt_cost only needs to consider the latency and
> issue rate of the code it's actually supposed to be costing, rather than
> trying also to estimate the issue rate of the scalar code and the issue
> rate of the Advanced SIMD code.
>
> Thanks,
> Richard
>
> >
> >>
> >> The AArch64 code can therefore hook into better_main_loop_than_p to
> >> compare issue rates.  If the issue rate comparison isn't decisive,
> >> the code can fall back to the normal latency-based comparison instead.
> >>
> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> >>
> >> Richard
> >>
> >>
> >> gcc/
> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
> >>         (vector_costs::better_epilogue_loop_than_p)
> >>         (vector_costs::compare_inside_loop_cost)
> >>         (vector_costs::compare_outside_loop_cost): Likewise.
> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
> >>         (vector_costs::better_epilogue_loop_than_p)
> >>         (vector_costs::compare_inside_loop_cost)
> >>         (vector_costs::compare_outside_loop_cost): New functions,
> >>         containing code moved from...
> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
> >> ---
> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
> >>  gcc/tree-vectorizer.h |  17 ++++
> >>  3 files changed, 226 insertions(+), 137 deletions(-)
> >>
> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> index dd4a363fee5..c9ee2e15e35 100644
> >> --- a/gcc/tree-vect-loop.c
> >> +++ b/gcc/tree-vect-loop.c
> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
> >>         return new_simdlen_p;
> >>      }
> >>
> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
> >> -  if (main_loop)
> >> -    {
> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> -      unsigned HOST_WIDE_INT main_vf;
> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
> >> -      /* If we can determine how many iterations are left for the epilogue
> >> -        loop, that is if both the main loop's vectorization factor and number
> >> -        of iterations are constant, then we use them to calculate the cost of
> >> -        the epilogue loop together with a 'likely value' for the epilogues
> >> -        vectorization factor.  Otherwise we use the main loop's vectorization
> >> -        factor and the maximum poly value for the epilogue's.  If the target
> >> -        has not provided with a sensible upper bound poly vectorization
> >> -        factors are likely to be favored over constant ones.  */
> >> -      if (main_poly_vf.is_constant (&main_vf)
> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> -       {
> >> -         unsigned HOST_WIDE_INT niters
> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> -         HOST_WIDE_INT old_likely_vf
> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
> >> -         HOST_WIDE_INT new_likely_vf
> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
> >> -
> >> -         /* If the epilogue is using partial vectors we account for the
> >> -            partial iteration here too.  */
> >> -         old_factor = niters / old_likely_vf;
> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
> >> -             && niters % old_likely_vf != 0)
> >> -           old_factor++;
> >> -
> >> -         new_factor = niters / new_likely_vf;
> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
> >> -             && niters % new_likely_vf != 0)
> >> -           new_factor++;
> >> -       }
> >> -      else
> >> -       {
> >> -         unsigned HOST_WIDE_INT main_vf_max
> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> -
> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
> >> -                                                          POLY_VALUE_MAX);
> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
> >> -                                                          POLY_VALUE_MAX);
> >> -
> >> -         /* If the loop is not using partial vectors then it will iterate one
> >> -            time less than one that does.  It is safe to subtract one here,
> >> -            because the main loop's vf is always at least 2x bigger than that
> >> -            of an epilogue.  */
> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
> >> -           old_factor -= 1;
> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
> >> -           new_factor -= 1;
> >> -       }
> >> -
> >> -      /* Compute the costs by multiplying the inside costs with the factor and
> >> -        add the outside costs for a more complete picture.  The factor is the
> >> -        amount of times we are expecting to iterate this epilogue.  */
> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
> >> -      return new_cost < old_cost;
> >> -    }
> >> -
> >> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> -  if (estimated_max_niter != -1)
> >> -    {
> >> -      if (known_le (estimated_max_niter, new_vf))
> >> -       new_vf = estimated_max_niter;
> >> -      if (known_le (estimated_max_niter, old_vf))
> >> -       old_vf = estimated_max_niter;
> >> -    }
> >> -
> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
> >> -
> >> -  HOST_WIDE_INT est_rel_new_min
> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
> >> -  HOST_WIDE_INT est_rel_new_max
> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
> >> -
> >> -  HOST_WIDE_INT est_rel_old_min
> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
> >> -  HOST_WIDE_INT est_rel_old_max
> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
> >> -
> >> -  /* Check first if we can make out an unambigous total order from the minimum
> >> -     and maximum estimates.  */
> >> -  if (est_rel_new_min < est_rel_old_min
> >> -      && est_rel_new_max < est_rel_old_max)
> >> -    return true;
> >> -  else if (est_rel_old_min < est_rel_new_min
> >> -          && est_rel_old_max < est_rel_new_max)
> >> -    return false;
> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
> >> -     we know that it has a lower cost for at least one runtime VF.
> >> -     However, we don't know how likely that VF is.
> >> -
> >> -     One option would be to compare the costs for the estimated VFs.
> >> -     The problem is that that can put too much pressure on the cost
> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
> >> -     though (a) new_loop_vinfo might not actually be better than
> >> -     old_loop_vinfo for that VF and (b) it would be significantly
> >> -     worse at larger VFs.
> >> -
> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
> >> -     no more expensive than old_loop_vinfo even after doubling the
> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
> >> -     ensures that we only pick new_loop_vinfo if it is significantly
> >> -     better than old_loop_vinfo at the estimated VF.  */
> >> -
> >> -  if (est_rel_old_min != est_rel_new_min
> >> -      || est_rel_old_max != est_rel_new_max)
> >> -    {
> >> -      HOST_WIDE_INT est_rel_new_likely
> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
> >> -      HOST_WIDE_INT est_rel_old_likely
> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
> >> -
> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
> >> -    }
> >> -
> >> -  /* If there's nothing to choose between the loop bodies, see whether
> >> -     there's a difference in the prologue and epilogue costs.  */
> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
> >> -  if (new_outside_cost != old_outside_cost)
> >> -    return new_outside_cost < old_outside_cost;
> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
> >>
> >> -  return false;
> >> +  return new_costs->better_main_loop_than_p (old_costs);
> >>  }
> >>
> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> >> index 9ef76ce654b..dcbb2a3f13a 100644
> >> --- a/gcc/tree-vectorizer.c
> >> +++ b/gcc/tree-vectorizer.c
> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
> >>      }
> >>    return cost;
> >>  }
> >> +
> >> +/* See the comment above the declaration for details.  */
> >> +
> >> +bool
> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
> >> +{
> >> +  int diff = compare_inside_loop_cost (other);
> >> +  if (diff != 0)
> >> +    return diff < 0;
> >> +
> >> +  /* If there's nothing to choose between the loop bodies, see whether
> >> +     there's a difference in the prologue and epilogue costs.  */
> >> +  diff = compare_outside_loop_cost (other);
> >> +  if (diff != 0)
> >> +    return diff < 0;
> >> +
> >> +  return false;
> >> +}
> >> +
> >> +
> >> +/* See the comment above the declaration for details.  */
> >> +
> >> +bool
> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
> >> +                                          loop_vec_info main_loop) const
> >> +{
> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> +
> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> +
> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> +  unsigned HOST_WIDE_INT main_vf;
> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
> >> +  /* If we can determine how many iterations are left for the epilogue
> >> +     loop, that is if both the main loop's vectorization factor and number
> >> +     of iterations are constant, then we use them to calculate the cost of
> >> +     the epilogue loop together with a 'likely value' for the epilogues
> >> +     vectorization factor.  Otherwise we use the main loop's vectorization
> >> +     factor and the maximum poly value for the epilogue's.  If the target
> >> +     has not provided with a sensible upper bound poly vectorization
> >> +     factors are likely to be favored over constant ones.  */
> >> +  if (main_poly_vf.is_constant (&main_vf)
> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> +    {
> >> +      unsigned HOST_WIDE_INT niters
> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> +      HOST_WIDE_INT other_likely_vf
> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
> >> +      HOST_WIDE_INT this_likely_vf
> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
> >> +
> >> +      /* If the epilogue is using partial vectors we account for the
> >> +        partial iteration here too.  */
> >> +      other_factor = niters / other_likely_vf;
> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
> >> +         && niters % other_likely_vf != 0)
> >> +       other_factor++;
> >> +
> >> +      this_factor = niters / this_likely_vf;
> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
> >> +         && niters % this_likely_vf != 0)
> >> +       this_factor++;
> >> +    }
> >> +  else
> >> +    {
> >> +      unsigned HOST_WIDE_INT main_vf_max
> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> +
> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
> >> +                                                      POLY_VALUE_MAX);
> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
> >> +                                                      POLY_VALUE_MAX);
> >> +
> >> +      /* If the loop is not using partial vectors then it will iterate one
> >> +        time less than one that does.  It is safe to subtract one here,
> >> +        because the main loop's vf is always at least 2x bigger than that
> >> +        of an epilogue.  */
> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
> >> +       other_factor -= 1;
> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
> >> +       this_factor -= 1;
> >> +    }
> >> +
> >> +  /* Compute the costs by multiplying the inside costs with the factor and
> >> +     add the outside costs for a more complete picture.  The factor is the
> >> +     amount of times we are expecting to iterate this epilogue.  */
> >> +  other_cost = other->body_cost () * other_factor;
> >> +  this_cost = this->body_cost () * this_factor;
> >> +  other_cost += other->outside_cost ();
> >> +  this_cost += this->outside_cost ();
> >> +  return this_cost < other_cost;
> >> +}
> >> +
> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
> >> +   determine the return value of better_main_loop_than_p by comparing the
> >> +   inside (loop body) costs of THIS and OTHER.  Return:
> >> +
> >> +   * -1 if better_main_loop_than_p should return true.
> >> +   * 1 if better_main_loop_than_p should return false.
> >> +   * 0 if we can't decide.  */
> >> +
> >> +int
> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
> >> +{
> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> +
> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
> >> +
> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> +
> >> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> +  if (estimated_max_niter != -1)
> >> +    {
> >> +      if (known_le (estimated_max_niter, this_vf))
> >> +       this_vf = estimated_max_niter;
> >> +      if (known_le (estimated_max_niter, other_vf))
> >> +       other_vf = estimated_max_niter;
> >> +    }
> >> +
> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
> >> +  poly_int64 rel_other
> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
> >> +
> >> +  HOST_WIDE_INT est_rel_this_min
> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
> >> +  HOST_WIDE_INT est_rel_this_max
> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
> >> +
> >> +  HOST_WIDE_INT est_rel_other_min
> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
> >> +  HOST_WIDE_INT est_rel_other_max
> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
> >> +
> >> +  /* Check first if we can make out an unambigous total order from the minimum
> >> +     and maximum estimates.  */
> >> +  if (est_rel_this_min < est_rel_other_min
> >> +      && est_rel_this_max < est_rel_other_max)
> >> +    return -1;
> >> +
> >> +  if (est_rel_other_min < est_rel_this_min
> >> +      && est_rel_other_max < est_rel_this_max)
> >> +    return 1;
> >> +
> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
> >> +     we know that it has a lower cost for at least one runtime VF.
> >> +     However, we don't know how likely that VF is.
> >> +
> >> +     One option would be to compare the costs for the estimated VFs.
> >> +     The problem is that that can put too much pressure on the cost
> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
> >> +     though (a) this_loop_vinfo might not actually be better than
> >> +     other_loop_vinfo for that VF and (b) it would be significantly
> >> +     worse at larger VFs.
> >> +
> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
> >> +     no more expensive than other_loop_vinfo even after doubling the
> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
> >> +     ensures that we only pick this_loop_vinfo if it is significantly
> >> +     better than other_loop_vinfo at the estimated VF.  */
> >> +  if (est_rel_other_min != est_rel_this_min
> >> +      || est_rel_other_max != est_rel_this_max)
> >> +    {
> >> +      HOST_WIDE_INT est_rel_this_likely
> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
> >> +      HOST_WIDE_INT est_rel_other_likely
> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
> >> +
> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
> >> +    }
> >> +
> >> +  return 0;
> >> +}
> >> +
> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
> >> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
> >> +   Check whether we can determine the return value of better_main_loop_than_p
> >> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
> >> +   Return:
> >> +
> >> +   * -1 if better_main_loop_than_p should return true.
> >> +   * 1 if better_main_loop_than_p should return false.
> >> +   * 0 if we can't decide.  */
> >> +
> >> +int
> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
> >> +{
> >> +  auto this_outside_cost = this->outside_cost ();
> >> +  auto other_outside_cost = other->outside_cost ();
> >> +  if (this_outside_cost != other_outside_cost)
> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
> >> +
> >> +  return 0;
> >> +}
> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> >> index 87d3f211a2e..0e3aad590e8 100644
> >> --- a/gcc/tree-vectorizer.h
> >> +++ b/gcc/tree-vectorizer.h
> >> @@ -1419,6 +1419,21 @@ public:
> >>       read back using the functions below.  */
> >>    virtual void finish_cost ();
> >>
> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
> >> +     a main loop.  Return true if the costs described by THIS are
> >> +     cheaper than the costs described by OTHER.  Return false if any
> >> +     of the following are true:
> >> +
> >> +     - THIS and OTHER are of equal cost
> >> +     - OTHER is better than THIS
> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
> >> +
> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
> >> +                                           loop_vec_info main_loop) const;
> >> +
> >>    unsigned int prologue_cost () const;
> >>    unsigned int body_cost () const;
> >>    unsigned int epilogue_cost () const;
> >> @@ -1429,6 +1444,8 @@ protected:
> >>                                  unsigned int);
> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
> >>                                      unsigned int);
> >> +  int compare_inside_loop_cost (const vector_costs *) const;
> >> +  int compare_outside_loop_cost (const vector_costs *) const;
> >>
> >>    /* The region of code that we're considering vectorizing.  */
> >>    vec_info *m_vinfo;
> >> --
> >> 2.25.1
> >>
  
Richard Sandiford Nov. 8, 2021, 2:46 p.m. UTC | #4
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> One of the things we want to do on AArch64 is compare vector loops
>> >> side-by-side and pick the best one.  For some targets, we want this
>> >> to be based on issue rates as well as the usual latency-based costs
>> >> (at least for loops with relatively high iteration counts).
>> >>
>> >> The current approach to doing this is: when costing vectorisation
>> >> candidate A, try to guess what the other main candidate B will look
>> >> like and adjust A's latency-based cost up or down based on the likely
>> >> difference between A and B's issue rates.  This effectively means
>> >> that we try to cost parts of B at the same time as A, without actually
>> >> being able to see B.
>> >
>> > I think the idea of the current costing is that compares are always
>> > to the original scalar loop (with comparing the resulting magic
>> > cost numbers) so that by means of transitivity comparing two
>> > vector variants work as well.  So I'm not sure where 'variant B'
>> > comes into play here?
>>
>> E.g. A could be SVE and B could be Advanced SIMD, or A could be
>> SVE with fully-packed vectors and B could be SVE with partially
>> unpacked vectors.
>>
>> One motivating case is Neoverse V1, which is a 256-bit SVE target.
>> There, scalar code, Advanced SIMD code and SVE code have different issue
>> characteristics.  SVE vector ops generally have the same latencies as
>> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
>> have double the throughput of SVE ones, so that the overall bit-for-bit
>> throughputs are roughly equal.  However, there are differences due to
>> predication, load/store handling, and so on, and those differences
>> should be decisive in some cases.
>>
>> For SLP, latency-based costs are fine.  But for loop costs, they hide
>> too many details.  What works best in practice, both for vector vs.
>> vector and vector vs. scalar, is:
>>
>> (1) compare issue rates between two loop bodies (vector vs. vector
>>     or vector vs. scalar)
>> (2) if issue rates are equal for a given number of scalar iterations,
>>     compare the latencies of the loop bodies, as we do now
>> (3) if both the above are equal, compare the latencies of the prologue
>>     and epilogue, as we do now
>>
>> It's very difficult to combine latency and issue rate information
>> into a single per-statement integer, in such a way that the integers
>> can just be summed and compared.
>
> Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
> was that finish_cost would determine the issue rate cost part, for
> example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
> then finish_cost would compute the number of cycles required to
> issue all vector stmts.  That yields sth like IPC the latency cost could
> be divided by.  Doing that for both ISA A and ISA B would produce
> weighted latency values that can be compared?  Alternatively
> you can of course simulate issue with the actual instruction latencies
> in mind and produce an overall iteration latency number.
>
> Comparing just latency or issue rate in isolation is likely not good
> enough?

In practice, comparing issue rates in isolation does seem to be what we
want as the first level of comparison.  I don't think total latency /
issue rate is really a meaningful combination.  It makes sense to the
extent that “lower latency == good” and “higher issue rate == good”,
but I don't think it's the case that halving the total latency is
equally valuable as doubling the issue rate.  In practice the latter is
a significant win while the former might or might not be.

total latency / issue rate also runs the risk of double-counting:
reducing the number of ops decreases the total latency *and* increases
the issue rate, which could have a quadratic effect.  (This is why we
don't decrease SVE costs when we think that the SVE code will issue
more quickly than the Advanced SIMD code.)

For a long-running loop, the only latencies that really matter are
for loop-carried dependencies.  The issue rate information takes
those into account, in that it tracks reduction and (with later
patches) induction limiters.

Thanks,
Richard

>
>> So returning to the above, when costing an SVE loop A, we currently
>> collect on-the-side issue information about both A and the likely
>> Advanced SIMD implementation B.  If B would have a higher throughput
>> than A then we multiply A's latency-based costs by:
>>
>>    ceil(B throughput/A throughput)
>>
>> The problem is, we don't actually know whether B exists or what it
>> would look like (given that Advanced SIMD and SVE have different
>> features).
>>
>> In principle, we should do the same in reverse, but since SVE needs
>> fewer ops than Advanced SIMD, the latencies are usually smaller already.
>>
>> We also do something similar for the scalar code.  When costing both
>> Advanced SIMD and SVE, we try to estimate the issue rate of the
>> original scalar code.  If the scalar code could issue more quickly
>> than the equivalent vector code, we increase the latency-based cost
>> of the vector code to account for that.
>>
>> The idea of the patch (and others) is that we can do the (1), (2),
>> (3) comparison above directly rather than indirectly.  We can also
>> use the issue information calculated while costing the scalar code,
>> rather than having to estimate the scalar issue information while
>> costing the vector code.
>>
>> >> This is needlessly indirect and complex.  It was a compromise due
>> >> to the code being added (too) late in the GCC 11 cycle, so that
>> >> target-independent changes weren't possible.
>> >>
>> >> The target-independent code already compares two candidate loop_vec_infos
>> >> side-by-side, so that information about A and B above are available
>> >> directly.  This patch creates a way for targets to hook into this
>> >> comparison.
>> >
>> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
>> > idea to compare those side-by-side - will that not lead to _more_ special-casing
>> > since you need to have a working compare to the scalar variant as well
>> > to determine the vectorization threshold?
>>
>> A later patch allows the code to do the same comparison for vector
>> vs. scalar.  It makes the target code significantly simpler compared
>> to now, since add_stmt_cost only needs to consider the latency and
>> issue rate of the code it's actually supposed to be costing, rather than
>> trying also to estimate the issue rate of the scalar code and the issue
>> rate of the Advanced SIMD code.
>>
>> Thanks,
>> Richard
>>
>> >
>> >>
>> >> The AArch64 code can therefore hook into better_main_loop_than_p to
>> >> compare issue rates.  If the issue rate comparison isn't decisive,
>> >> the code can fall back to the normal latency-based comparison instead.
>> >>
>> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >>
>> >> Richard
>> >>
>> >>
>> >> gcc/
>> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>> >>         (vector_costs::better_epilogue_loop_than_p)
>> >>         (vector_costs::compare_inside_loop_cost)
>> >>         (vector_costs::compare_outside_loop_cost): Likewise.
>> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>> >>         (vector_costs::better_epilogue_loop_than_p)
>> >>         (vector_costs::compare_inside_loop_cost)
>> >>         (vector_costs::compare_outside_loop_cost): New functions,
>> >>         containing code moved from...
>> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
>> >> ---
>> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
>> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>> >>  gcc/tree-vectorizer.h |  17 ++++
>> >>  3 files changed, 226 insertions(+), 137 deletions(-)
>> >>
>> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> index dd4a363fee5..c9ee2e15e35 100644
>> >> --- a/gcc/tree-vect-loop.c
>> >> +++ b/gcc/tree-vect-loop.c
>> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>> >>         return new_simdlen_p;
>> >>      }
>> >>
>> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
>> >> -  if (main_loop)
>> >> -    {
>> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> -      unsigned HOST_WIDE_INT main_vf;
>> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
>> >> -      /* If we can determine how many iterations are left for the epilogue
>> >> -        loop, that is if both the main loop's vectorization factor and number
>> >> -        of iterations are constant, then we use them to calculate the cost of
>> >> -        the epilogue loop together with a 'likely value' for the epilogues
>> >> -        vectorization factor.  Otherwise we use the main loop's vectorization
>> >> -        factor and the maximum poly value for the epilogue's.  If the target
>> >> -        has not provided with a sensible upper bound poly vectorization
>> >> -        factors are likely to be favored over constant ones.  */
>> >> -      if (main_poly_vf.is_constant (&main_vf)
>> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> -       {
>> >> -         unsigned HOST_WIDE_INT niters
>> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> -         HOST_WIDE_INT old_likely_vf
>> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
>> >> -         HOST_WIDE_INT new_likely_vf
>> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
>> >> -
>> >> -         /* If the epilogue is using partial vectors we account for the
>> >> -            partial iteration here too.  */
>> >> -         old_factor = niters / old_likely_vf;
>> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
>> >> -             && niters % old_likely_vf != 0)
>> >> -           old_factor++;
>> >> -
>> >> -         new_factor = niters / new_likely_vf;
>> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
>> >> -             && niters % new_likely_vf != 0)
>> >> -           new_factor++;
>> >> -       }
>> >> -      else
>> >> -       {
>> >> -         unsigned HOST_WIDE_INT main_vf_max
>> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> -
>> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
>> >> -                                                          POLY_VALUE_MAX);
>> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
>> >> -                                                          POLY_VALUE_MAX);
>> >> -
>> >> -         /* If the loop is not using partial vectors then it will iterate one
>> >> -            time less than one that does.  It is safe to subtract one here,
>> >> -            because the main loop's vf is always at least 2x bigger than that
>> >> -            of an epilogue.  */
>> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
>> >> -           old_factor -= 1;
>> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
>> >> -           new_factor -= 1;
>> >> -       }
>> >> -
>> >> -      /* Compute the costs by multiplying the inside costs with the factor and
>> >> -        add the outside costs for a more complete picture.  The factor is the
>> >> -        amount of times we are expecting to iterate this epilogue.  */
>> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
>> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
>> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
>> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
>> >> -      return new_cost < old_cost;
>> >> -    }
>> >> -
>> >> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> >> -  if (estimated_max_niter != -1)
>> >> -    {
>> >> -      if (known_le (estimated_max_niter, new_vf))
>> >> -       new_vf = estimated_max_niter;
>> >> -      if (known_le (estimated_max_niter, old_vf))
>> >> -       old_vf = estimated_max_niter;
>> >> -    }
>> >> -
>> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
>> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
>> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
>> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
>> >> -
>> >> -  HOST_WIDE_INT est_rel_new_min
>> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
>> >> -  HOST_WIDE_INT est_rel_new_max
>> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
>> >> -
>> >> -  HOST_WIDE_INT est_rel_old_min
>> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
>> >> -  HOST_WIDE_INT est_rel_old_max
>> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
>> >> -
>> >> -  /* Check first if we can make out an unambigous total order from the minimum
>> >> -     and maximum estimates.  */
>> >> -  if (est_rel_new_min < est_rel_old_min
>> >> -      && est_rel_new_max < est_rel_old_max)
>> >> -    return true;
>> >> -  else if (est_rel_old_min < est_rel_new_min
>> >> -          && est_rel_old_max < est_rel_new_max)
>> >> -    return false;
>> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
>> >> -     we know that it has a lower cost for at least one runtime VF.
>> >> -     However, we don't know how likely that VF is.
>> >> -
>> >> -     One option would be to compare the costs for the estimated VFs.
>> >> -     The problem is that that can put too much pressure on the cost
>> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
>> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
>> >> -     though (a) new_loop_vinfo might not actually be better than
>> >> -     old_loop_vinfo for that VF and (b) it would be significantly
>> >> -     worse at larger VFs.
>> >> -
>> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
>> >> -     no more expensive than old_loop_vinfo even after doubling the
>> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
>> >> -     ensures that we only pick new_loop_vinfo if it is significantly
>> >> -     better than old_loop_vinfo at the estimated VF.  */
>> >> -
>> >> -  if (est_rel_old_min != est_rel_new_min
>> >> -      || est_rel_old_max != est_rel_new_max)
>> >> -    {
>> >> -      HOST_WIDE_INT est_rel_new_likely
>> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
>> >> -      HOST_WIDE_INT est_rel_old_likely
>> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
>> >> -
>> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
>> >> -    }
>> >> -
>> >> -  /* If there's nothing to choose between the loop bodies, see whether
>> >> -     there's a difference in the prologue and epilogue costs.  */
>> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
>> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
>> >> -  if (new_outside_cost != old_outside_cost)
>> >> -    return new_outside_cost < old_outside_cost;
>> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
>> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
>> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
>> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>> >>
>> >> -  return false;
>> >> +  return new_costs->better_main_loop_than_p (old_costs);
>> >>  }
>> >>
>> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
>> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> >> index 9ef76ce654b..dcbb2a3f13a 100644
>> >> --- a/gcc/tree-vectorizer.c
>> >> +++ b/gcc/tree-vectorizer.c
>> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
>> >>      }
>> >>    return cost;
>> >>  }
>> >> +
>> >> +/* See the comment above the declaration for details.  */
>> >> +
>> >> +bool
>> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
>> >> +{
>> >> +  int diff = compare_inside_loop_cost (other);
>> >> +  if (diff != 0)
>> >> +    return diff < 0;
>> >> +
>> >> +  /* If there's nothing to choose between the loop bodies, see whether
>> >> +     there's a difference in the prologue and epilogue costs.  */
>> >> +  diff = compare_outside_loop_cost (other);
>> >> +  if (diff != 0)
>> >> +    return diff < 0;
>> >> +
>> >> +  return false;
>> >> +}
>> >> +
>> >> +
>> >> +/* See the comment above the declaration for details.  */
>> >> +
>> >> +bool
>> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
>> >> +                                          loop_vec_info main_loop) const
>> >> +{
>> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> +  unsigned HOST_WIDE_INT main_vf;
>> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
>> >> +  /* If we can determine how many iterations are left for the epilogue
>> >> +     loop, that is if both the main loop's vectorization factor and number
>> >> +     of iterations are constant, then we use them to calculate the cost of
>> >> +     the epilogue loop together with a 'likely value' for the epilogues
>> >> +     vectorization factor.  Otherwise we use the main loop's vectorization
>> >> +     factor and the maximum poly value for the epilogue's.  If the target
>> >> +     has not provided with a sensible upper bound poly vectorization
>> >> +     factors are likely to be favored over constant ones.  */
>> >> +  if (main_poly_vf.is_constant (&main_vf)
>> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> +    {
>> >> +      unsigned HOST_WIDE_INT niters
>> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> +      HOST_WIDE_INT other_likely_vf
>> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
>> >> +      HOST_WIDE_INT this_likely_vf
>> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
>> >> +
>> >> +      /* If the epilogue is using partial vectors we account for the
>> >> +        partial iteration here too.  */
>> >> +      other_factor = niters / other_likely_vf;
>> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
>> >> +         && niters % other_likely_vf != 0)
>> >> +       other_factor++;
>> >> +
>> >> +      this_factor = niters / this_likely_vf;
>> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
>> >> +         && niters % this_likely_vf != 0)
>> >> +       this_factor++;
>> >> +    }
>> >> +  else
>> >> +    {
>> >> +      unsigned HOST_WIDE_INT main_vf_max
>> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> +
>> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
>> >> +                                                      POLY_VALUE_MAX);
>> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
>> >> +                                                      POLY_VALUE_MAX);
>> >> +
>> >> +      /* If the loop is not using partial vectors then it will iterate one
>> >> +        time less than one that does.  It is safe to subtract one here,
>> >> +        because the main loop's vf is always at least 2x bigger than that
>> >> +        of an epilogue.  */
>> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
>> >> +       other_factor -= 1;
>> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
>> >> +       this_factor -= 1;
>> >> +    }
>> >> +
>> >> +  /* Compute the costs by multiplying the inside costs with the factor and
>> >> +     add the outside costs for a more complete picture.  The factor is the
>> >> +     amount of times we are expecting to iterate this epilogue.  */
>> >> +  other_cost = other->body_cost () * other_factor;
>> >> +  this_cost = this->body_cost () * this_factor;
>> >> +  other_cost += other->outside_cost ();
>> >> +  this_cost += this->outside_cost ();
>> >> +  return this_cost < other_cost;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
>> >> +   determine the return value of better_main_loop_than_p by comparing the
>> >> +   inside (loop body) costs of THIS and OTHER.  Return:
>> >> +
>> >> +   * -1 if better_main_loop_than_p should return true.
>> >> +   * 1 if better_main_loop_than_p should return false.
>> >> +   * 0 if we can't decide.  */
>> >> +
>> >> +int
>> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> +
>> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
>> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
>> >> +
>> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> +
>> >> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> >> +  if (estimated_max_niter != -1)
>> >> +    {
>> >> +      if (known_le (estimated_max_niter, this_vf))
>> >> +       this_vf = estimated_max_niter;
>> >> +      if (known_le (estimated_max_niter, other_vf))
>> >> +       other_vf = estimated_max_niter;
>> >> +    }
>> >> +
>> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
>> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
>> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
>> >> +  poly_int64 rel_other
>> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
>> >> +
>> >> +  HOST_WIDE_INT est_rel_this_min
>> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
>> >> +  HOST_WIDE_INT est_rel_this_max
>> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
>> >> +
>> >> +  HOST_WIDE_INT est_rel_other_min
>> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
>> >> +  HOST_WIDE_INT est_rel_other_max
>> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
>> >> +
>> >> +  /* Check first if we can make out an unambigous total order from the minimum
>> >> +     and maximum estimates.  */
>> >> +  if (est_rel_this_min < est_rel_other_min
>> >> +      && est_rel_this_max < est_rel_other_max)
>> >> +    return -1;
>> >> +
>> >> +  if (est_rel_other_min < est_rel_this_min
>> >> +      && est_rel_other_max < est_rel_this_max)
>> >> +    return 1;
>> >> +
>> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
>> >> +     we know that it has a lower cost for at least one runtime VF.
>> >> +     However, we don't know how likely that VF is.
>> >> +
>> >> +     One option would be to compare the costs for the estimated VFs.
>> >> +     The problem is that that can put too much pressure on the cost
>> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
>> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
>> >> +     though (a) this_loop_vinfo might not actually be better than
>> >> +     other_loop_vinfo for that VF and (b) it would be significantly
>> >> +     worse at larger VFs.
>> >> +
>> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
>> >> +     no more expensive than other_loop_vinfo even after doubling the
>> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
>> >> +     ensures that we only pick this_loop_vinfo if it is significantly
>> >> +     better than other_loop_vinfo at the estimated VF.  */
>> >> +  if (est_rel_other_min != est_rel_this_min
>> >> +      || est_rel_other_max != est_rel_this_max)
>> >> +    {
>> >> +      HOST_WIDE_INT est_rel_this_likely
>> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
>> >> +      HOST_WIDE_INT est_rel_other_likely
>> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
>> >> +
>> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
>> >> +    }
>> >> +
>> >> +  return 0;
>> >> +}
>> >> +
>> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
>> >> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
>> >> +   Check whether we can determine the return value of better_main_loop_than_p
>> >> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
>> >> +   Return:
>> >> +
>> >> +   * -1 if better_main_loop_than_p should return true.
>> >> +   * 1 if better_main_loop_than_p should return false.
>> >> +   * 0 if we can't decide.  */
>> >> +
>> >> +int
>> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
>> >> +{
>> >> +  auto this_outside_cost = this->outside_cost ();
>> >> +  auto other_outside_cost = other->outside_cost ();
>> >> +  if (this_outside_cost != other_outside_cost)
>> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
>> >> +
>> >> +  return 0;
>> >> +}
>> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> >> index 87d3f211a2e..0e3aad590e8 100644
>> >> --- a/gcc/tree-vectorizer.h
>> >> +++ b/gcc/tree-vectorizer.h
>> >> @@ -1419,6 +1419,21 @@ public:
>> >>       read back using the functions below.  */
>> >>    virtual void finish_cost ();
>> >>
>> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
>> >> +     a main loop.  Return true if the costs described by THIS are
>> >> +     cheaper than the costs described by OTHER.  Return false if any
>> >> +     of the following are true:
>> >> +
>> >> +     - THIS and OTHER are of equal cost
>> >> +     - OTHER is better than THIS
>> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
>> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
>> >> +
>> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
>> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
>> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
>> >> +                                           loop_vec_info main_loop) const;
>> >> +
>> >>    unsigned int prologue_cost () const;
>> >>    unsigned int body_cost () const;
>> >>    unsigned int epilogue_cost () const;
>> >> @@ -1429,6 +1444,8 @@ protected:
>> >>                                  unsigned int);
>> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
>> >>                                      unsigned int);
>> >> +  int compare_inside_loop_cost (const vector_costs *) const;
>> >> +  int compare_outside_loop_cost (const vector_costs *) const;
>> >>
>> >>    /* The region of code that we're considering vectorizing.  */
>> >>    vec_info *m_vinfo;
>> >> --
>> >> 2.25.1
>> >>
  
Richard Biener Nov. 9, 2021, 9:30 a.m. UTC | #5
On Mon, Nov 8, 2021 at 3:46 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >>
> >> >> One of the things we want to do on AArch64 is compare vector loops
> >> >> side-by-side and pick the best one.  For some targets, we want this
> >> >> to be based on issue rates as well as the usual latency-based costs
> >> >> (at least for loops with relatively high iteration counts).
> >> >>
> >> >> The current approach to doing this is: when costing vectorisation
> >> >> candidate A, try to guess what the other main candidate B will look
> >> >> like and adjust A's latency-based cost up or down based on the likely
> >> >> difference between A and B's issue rates.  This effectively means
> >> >> that we try to cost parts of B at the same time as A, without actually
> >> >> being able to see B.
> >> >
> >> > I think the idea of the current costing is that compares are always
> >> > to the original scalar loop (with comparing the resulting magic
> >> > cost numbers) so that by means of transitivity comparing two
> >> > vector variants work as well.  So I'm not sure where 'variant B'
> >> > comes into play here?
> >>
> >> E.g. A could be SVE and B could be Advanced SIMD, or A could be
> >> SVE with fully-packed vectors and B could be SVE with partially
> >> unpacked vectors.
> >>
> >> One motivating case is Neoverse V1, which is a 256-bit SVE target.
> >> There, scalar code, Advanced SIMD code and SVE code have different issue
> >> characteristics.  SVE vector ops generally have the same latencies as
> >> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
> >> have double the throughput of SVE ones, so that the overall bit-for-bit
> >> throughputs are roughly equal.  However, there are differences due to
> >> predication, load/store handling, and so on, and those differences
> >> should be decisive in some cases.
> >>
> >> For SLP, latency-based costs are fine.  But for loop costs, they hide
> >> too many details.  What works best in practice, both for vector vs.
> >> vector and vector vs. scalar, is:
> >>
> >> (1) compare issue rates between two loop bodies (vector vs. vector
> >>     or vector vs. scalar)
> >> (2) if issue rates are equal for a given number of scalar iterations,
> >>     compare the latencies of the loop bodies, as we do now
> >> (3) if both the above are equal, compare the latencies of the prologue
> >>     and epilogue, as we do now
> >>
> >> It's very difficult to combine latency and issue rate information
> >> into a single per-statement integer, in such a way that the integers
> >> can just be summed and compared.
> >
> > Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
> > was that finish_cost would determine the issue rate cost part, for
> > example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
> > then finish_cost would compute the number of cycles required to
> > issue all vector stmts.  That yields sth like IPC the latency cost could
> > be divided by.  Doing that for both ISA A and ISA B would produce
> > weighted latency values that can be compared?  Alternatively
> > you can of course simulate issue with the actual instruction latencies
> > in mind and produce an overall iteration latency number.
> >
> > Comparing just latency or issue rate in isolation is likely not good
> > enough?
>
> In practice, comparing issue rates in isolation does seem to be what we
> want as the first level of comparison.  I don't think total latency /
> issue rate is really a meaningful combination.  It makes sense to the
> extent that “lower latency == good” and “higher issue rate == good”,
> but I don't think it's the case that halving the total latency is
> equally valuable as doubling the issue rate.  In practice the latter is
> a significant win while the former might or might not be.
>
> total latency / issue rate also runs the risk of double-counting:
> reducing the number of ops decreases the total latency *and* increases
> the issue rate, which could have a quadratic effect.  (This is why we
> don't decrease SVE costs when we think that the SVE code will issue
> more quickly than the Advanced SIMD code.)
>
> For a long-running loop, the only latencies that really matter are
> for loop-carried dependencies.  The issue rate information takes
> those into account, in that it tracks reduction and (with later
> patches) induction limiters.

I see.  I'm somewhat worried that we basically move all intelligence
to the target hooks this way, requiring each target to duplicate code.

In fact one item on my list is to try tackling instruction issue on
the vectorizer side where we have an idea on the data dependencies
where the target then would continue to mostly work on the
instruction level, returning "I can issue at most 4 of these, each
with a latency of 2",
plus globally specifying "I can issue 6 ops per clock".

So when we rework the costing side we'll have to deal with the target
doing magic that's not per insn and somehow preserve it - that might
prove mightly complicated ...

That said, if you are prepared to deal with this on the arm side
when such reorg ever happens then I'm fine with the two patches.

Thanks,
Richard.

> Thanks,
> Richard
>
> >
> >> So returning to the above, when costing an SVE loop A, we currently
> >> collect on-the-side issue information about both A and the likely
> >> Advanced SIMD implementation B.  If B would have a higher throughput
> >> than A then we multiply A's latency-based costs by:
> >>
> >>    ceil(B throughput/A throughput)
> >>
> >> The problem is, we don't actually know whether B exists or what it
> >> would look like (given that Advanced SIMD and SVE have different
> >> features).
> >>
> >> In principle, we should do the same in reverse, but since SVE needs
> >> fewer ops than Advanced SIMD, the latencies are usually smaller already.
> >>
> >> We also do something similar for the scalar code.  When costing both
> >> Advanced SIMD and SVE, we try to estimate the issue rate of the
> >> original scalar code.  If the scalar code could issue more quickly
> >> than the equivalent vector code, we increase the latency-based cost
> >> of the vector code to account for that.
> >>
> >> The idea of the patch (and others) is that we can do the (1), (2),
> >> (3) comparison above directly rather than indirectly.  We can also
> >> use the issue information calculated while costing the scalar code,
> >> rather than having to estimate the scalar issue information while
> >> costing the vector code.
> >>
> >> >> This is needlessly indirect and complex.  It was a compromise due
> >> >> to the code being added (too) late in the GCC 11 cycle, so that
> >> >> target-independent changes weren't possible.
> >> >>
> >> >> The target-independent code already compares two candidate loop_vec_infos
> >> >> side-by-side, so that information about A and B above are available
> >> >> directly.  This patch creates a way for targets to hook into this
> >> >> comparison.
> >> >
> >> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
> >> > idea to compare those side-by-side - will that not lead to _more_ special-casing
> >> > since you need to have a working compare to the scalar variant as well
> >> > to determine the vectorization threshold?
> >>
> >> A later patch allows the code to do the same comparison for vector
> >> vs. scalar.  It makes the target code significantly simpler compared
> >> to now, since add_stmt_cost only needs to consider the latency and
> >> issue rate of the code it's actually supposed to be costing, rather than
> >> trying also to estimate the issue rate of the scalar code and the issue
> >> rate of the Advanced SIMD code.
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> >>
> >> >> The AArch64 code can therefore hook into better_main_loop_than_p to
> >> >> compare issue rates.  If the issue rate comparison isn't decisive,
> >> >> the code can fall back to the normal latency-based comparison instead.
> >> >>
> >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> >> >>
> >> >> Richard
> >> >>
> >> >>
> >> >> gcc/
> >> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
> >> >>         (vector_costs::better_epilogue_loop_than_p)
> >> >>         (vector_costs::compare_inside_loop_cost)
> >> >>         (vector_costs::compare_outside_loop_cost): Likewise.
> >> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
> >> >>         (vector_costs::better_epilogue_loop_than_p)
> >> >>         (vector_costs::compare_inside_loop_cost)
> >> >>         (vector_costs::compare_outside_loop_cost): New functions,
> >> >>         containing code moved from...
> >> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
> >> >> ---
> >> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
> >> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
> >> >>  gcc/tree-vectorizer.h |  17 ++++
> >> >>  3 files changed, 226 insertions(+), 137 deletions(-)
> >> >>
> >> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> >> index dd4a363fee5..c9ee2e15e35 100644
> >> >> --- a/gcc/tree-vect-loop.c
> >> >> +++ b/gcc/tree-vect-loop.c
> >> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
> >> >>         return new_simdlen_p;
> >> >>      }
> >> >>
> >> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
> >> >> -  if (main_loop)
> >> >> -    {
> >> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> >> -      unsigned HOST_WIDE_INT main_vf;
> >> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
> >> >> -      /* If we can determine how many iterations are left for the epilogue
> >> >> -        loop, that is if both the main loop's vectorization factor and number
> >> >> -        of iterations are constant, then we use them to calculate the cost of
> >> >> -        the epilogue loop together with a 'likely value' for the epilogues
> >> >> -        vectorization factor.  Otherwise we use the main loop's vectorization
> >> >> -        factor and the maximum poly value for the epilogue's.  If the target
> >> >> -        has not provided with a sensible upper bound poly vectorization
> >> >> -        factors are likely to be favored over constant ones.  */
> >> >> -      if (main_poly_vf.is_constant (&main_vf)
> >> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> >> -       {
> >> >> -         unsigned HOST_WIDE_INT niters
> >> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> >> -         HOST_WIDE_INT old_likely_vf
> >> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
> >> >> -         HOST_WIDE_INT new_likely_vf
> >> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
> >> >> -
> >> >> -         /* If the epilogue is using partial vectors we account for the
> >> >> -            partial iteration here too.  */
> >> >> -         old_factor = niters / old_likely_vf;
> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
> >> >> -             && niters % old_likely_vf != 0)
> >> >> -           old_factor++;
> >> >> -
> >> >> -         new_factor = niters / new_likely_vf;
> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
> >> >> -             && niters % new_likely_vf != 0)
> >> >> -           new_factor++;
> >> >> -       }
> >> >> -      else
> >> >> -       {
> >> >> -         unsigned HOST_WIDE_INT main_vf_max
> >> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> >> -
> >> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
> >> >> -                                                          POLY_VALUE_MAX);
> >> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
> >> >> -                                                          POLY_VALUE_MAX);
> >> >> -
> >> >> -         /* If the loop is not using partial vectors then it will iterate one
> >> >> -            time less than one that does.  It is safe to subtract one here,
> >> >> -            because the main loop's vf is always at least 2x bigger than that
> >> >> -            of an epilogue.  */
> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
> >> >> -           old_factor -= 1;
> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
> >> >> -           new_factor -= 1;
> >> >> -       }
> >> >> -
> >> >> -      /* Compute the costs by multiplying the inside costs with the factor and
> >> >> -        add the outside costs for a more complete picture.  The factor is the
> >> >> -        amount of times we are expecting to iterate this epilogue.  */
> >> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
> >> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
> >> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
> >> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
> >> >> -      return new_cost < old_cost;
> >> >> -    }
> >> >> -
> >> >> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> >> -  if (estimated_max_niter != -1)
> >> >> -    {
> >> >> -      if (known_le (estimated_max_niter, new_vf))
> >> >> -       new_vf = estimated_max_niter;
> >> >> -      if (known_le (estimated_max_niter, old_vf))
> >> >> -       old_vf = estimated_max_niter;
> >> >> -    }
> >> >> -
> >> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
> >> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> >> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
> >> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
> >> >> -
> >> >> -  HOST_WIDE_INT est_rel_new_min
> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
> >> >> -  HOST_WIDE_INT est_rel_new_max
> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
> >> >> -
> >> >> -  HOST_WIDE_INT est_rel_old_min
> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
> >> >> -  HOST_WIDE_INT est_rel_old_max
> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
> >> >> -
> >> >> -  /* Check first if we can make out an unambigous total order from the minimum
> >> >> -     and maximum estimates.  */
> >> >> -  if (est_rel_new_min < est_rel_old_min
> >> >> -      && est_rel_new_max < est_rel_old_max)
> >> >> -    return true;
> >> >> -  else if (est_rel_old_min < est_rel_new_min
> >> >> -          && est_rel_old_max < est_rel_new_max)
> >> >> -    return false;
> >> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
> >> >> -     we know that it has a lower cost for at least one runtime VF.
> >> >> -     However, we don't know how likely that VF is.
> >> >> -
> >> >> -     One option would be to compare the costs for the estimated VFs.
> >> >> -     The problem is that that can put too much pressure on the cost
> >> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
> >> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
> >> >> -     though (a) new_loop_vinfo might not actually be better than
> >> >> -     old_loop_vinfo for that VF and (b) it would be significantly
> >> >> -     worse at larger VFs.
> >> >> -
> >> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
> >> >> -     no more expensive than old_loop_vinfo even after doubling the
> >> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
> >> >> -     ensures that we only pick new_loop_vinfo if it is significantly
> >> >> -     better than old_loop_vinfo at the estimated VF.  */
> >> >> -
> >> >> -  if (est_rel_old_min != est_rel_new_min
> >> >> -      || est_rel_old_max != est_rel_new_max)
> >> >> -    {
> >> >> -      HOST_WIDE_INT est_rel_new_likely
> >> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
> >> >> -      HOST_WIDE_INT est_rel_old_likely
> >> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
> >> >> -
> >> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
> >> >> -    }
> >> >> -
> >> >> -  /* If there's nothing to choose between the loop bodies, see whether
> >> >> -     there's a difference in the prologue and epilogue costs.  */
> >> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
> >> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
> >> >> -  if (new_outside_cost != old_outside_cost)
> >> >> -    return new_outside_cost < old_outside_cost;
> >> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
> >> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
> >> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
> >> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
> >> >>
> >> >> -  return false;
> >> >> +  return new_costs->better_main_loop_than_p (old_costs);
> >> >>  }
> >> >>
> >> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> >> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> >> >> index 9ef76ce654b..dcbb2a3f13a 100644
> >> >> --- a/gcc/tree-vectorizer.c
> >> >> +++ b/gcc/tree-vectorizer.c
> >> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
> >> >>      }
> >> >>    return cost;
> >> >>  }
> >> >> +
> >> >> +/* See the comment above the declaration for details.  */
> >> >> +
> >> >> +bool
> >> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
> >> >> +{
> >> >> +  int diff = compare_inside_loop_cost (other);
> >> >> +  if (diff != 0)
> >> >> +    return diff < 0;
> >> >> +
> >> >> +  /* If there's nothing to choose between the loop bodies, see whether
> >> >> +     there's a difference in the prologue and epilogue costs.  */
> >> >> +  diff = compare_outside_loop_cost (other);
> >> >> +  if (diff != 0)
> >> >> +    return diff < 0;
> >> >> +
> >> >> +  return false;
> >> >> +}
> >> >> +
> >> >> +
> >> >> +/* See the comment above the declaration for details.  */
> >> >> +
> >> >> +bool
> >> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
> >> >> +                                          loop_vec_info main_loop) const
> >> >> +{
> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> >> +
> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> >> +
> >> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> >> +  unsigned HOST_WIDE_INT main_vf;
> >> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
> >> >> +  /* If we can determine how many iterations are left for the epilogue
> >> >> +     loop, that is if both the main loop's vectorization factor and number
> >> >> +     of iterations are constant, then we use them to calculate the cost of
> >> >> +     the epilogue loop together with a 'likely value' for the epilogues
> >> >> +     vectorization factor.  Otherwise we use the main loop's vectorization
> >> >> +     factor and the maximum poly value for the epilogue's.  If the target
> >> >> +     has not provided with a sensible upper bound poly vectorization
> >> >> +     factors are likely to be favored over constant ones.  */
> >> >> +  if (main_poly_vf.is_constant (&main_vf)
> >> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> >> +    {
> >> >> +      unsigned HOST_WIDE_INT niters
> >> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> >> +      HOST_WIDE_INT other_likely_vf
> >> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
> >> >> +      HOST_WIDE_INT this_likely_vf
> >> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
> >> >> +
> >> >> +      /* If the epilogue is using partial vectors we account for the
> >> >> +        partial iteration here too.  */
> >> >> +      other_factor = niters / other_likely_vf;
> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
> >> >> +         && niters % other_likely_vf != 0)
> >> >> +       other_factor++;
> >> >> +
> >> >> +      this_factor = niters / this_likely_vf;
> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
> >> >> +         && niters % this_likely_vf != 0)
> >> >> +       this_factor++;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      unsigned HOST_WIDE_INT main_vf_max
> >> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> >> +
> >> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
> >> >> +                                                      POLY_VALUE_MAX);
> >> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
> >> >> +                                                      POLY_VALUE_MAX);
> >> >> +
> >> >> +      /* If the loop is not using partial vectors then it will iterate one
> >> >> +        time less than one that does.  It is safe to subtract one here,
> >> >> +        because the main loop's vf is always at least 2x bigger than that
> >> >> +        of an epilogue.  */
> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
> >> >> +       other_factor -= 1;
> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
> >> >> +       this_factor -= 1;
> >> >> +    }
> >> >> +
> >> >> +  /* Compute the costs by multiplying the inside costs with the factor and
> >> >> +     add the outside costs for a more complete picture.  The factor is the
> >> >> +     amount of times we are expecting to iterate this epilogue.  */
> >> >> +  other_cost = other->body_cost () * other_factor;
> >> >> +  this_cost = this->body_cost () * this_factor;
> >> >> +  other_cost += other->outside_cost ();
> >> >> +  this_cost += this->outside_cost ();
> >> >> +  return this_cost < other_cost;
> >> >> +}
> >> >> +
> >> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
> >> >> +   determine the return value of better_main_loop_than_p by comparing the
> >> >> +   inside (loop body) costs of THIS and OTHER.  Return:
> >> >> +
> >> >> +   * -1 if better_main_loop_than_p should return true.
> >> >> +   * 1 if better_main_loop_than_p should return false.
> >> >> +   * 0 if we can't decide.  */
> >> >> +
> >> >> +int
> >> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
> >> >> +{
> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> >> +
> >> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
> >> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
> >> >> +
> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> >> +
> >> >> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> >> +  if (estimated_max_niter != -1)
> >> >> +    {
> >> >> +      if (known_le (estimated_max_niter, this_vf))
> >> >> +       this_vf = estimated_max_niter;
> >> >> +      if (known_le (estimated_max_niter, other_vf))
> >> >> +       other_vf = estimated_max_niter;
> >> >> +    }
> >> >> +
> >> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
> >> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
> >> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
> >> >> +  poly_int64 rel_other
> >> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
> >> >> +
> >> >> +  HOST_WIDE_INT est_rel_this_min
> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
> >> >> +  HOST_WIDE_INT est_rel_this_max
> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
> >> >> +
> >> >> +  HOST_WIDE_INT est_rel_other_min
> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
> >> >> +  HOST_WIDE_INT est_rel_other_max
> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
> >> >> +
> >> >> +  /* Check first if we can make out an unambigous total order from the minimum
> >> >> +     and maximum estimates.  */
> >> >> +  if (est_rel_this_min < est_rel_other_min
> >> >> +      && est_rel_this_max < est_rel_other_max)
> >> >> +    return -1;
> >> >> +
> >> >> +  if (est_rel_other_min < est_rel_this_min
> >> >> +      && est_rel_other_max < est_rel_this_max)
> >> >> +    return 1;
> >> >> +
> >> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
> >> >> +     we know that it has a lower cost for at least one runtime VF.
> >> >> +     However, we don't know how likely that VF is.
> >> >> +
> >> >> +     One option would be to compare the costs for the estimated VFs.
> >> >> +     The problem is that that can put too much pressure on the cost
> >> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
> >> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
> >> >> +     though (a) this_loop_vinfo might not actually be better than
> >> >> +     other_loop_vinfo for that VF and (b) it would be significantly
> >> >> +     worse at larger VFs.
> >> >> +
> >> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
> >> >> +     no more expensive than other_loop_vinfo even after doubling the
> >> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
> >> >> +     ensures that we only pick this_loop_vinfo if it is significantly
> >> >> +     better than other_loop_vinfo at the estimated VF.  */
> >> >> +  if (est_rel_other_min != est_rel_this_min
> >> >> +      || est_rel_other_max != est_rel_this_max)
> >> >> +    {
> >> >> +      HOST_WIDE_INT est_rel_this_likely
> >> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
> >> >> +      HOST_WIDE_INT est_rel_other_likely
> >> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
> >> >> +
> >> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
> >> >> +    }
> >> >> +
> >> >> +  return 0;
> >> >> +}
> >> >> +
> >> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
> >> >> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
> >> >> +   Check whether we can determine the return value of better_main_loop_than_p
> >> >> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
> >> >> +   Return:
> >> >> +
> >> >> +   * -1 if better_main_loop_than_p should return true.
> >> >> +   * 1 if better_main_loop_than_p should return false.
> >> >> +   * 0 if we can't decide.  */
> >> >> +
> >> >> +int
> >> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
> >> >> +{
> >> >> +  auto this_outside_cost = this->outside_cost ();
> >> >> +  auto other_outside_cost = other->outside_cost ();
> >> >> +  if (this_outside_cost != other_outside_cost)
> >> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
> >> >> +
> >> >> +  return 0;
> >> >> +}
> >> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> >> >> index 87d3f211a2e..0e3aad590e8 100644
> >> >> --- a/gcc/tree-vectorizer.h
> >> >> +++ b/gcc/tree-vectorizer.h
> >> >> @@ -1419,6 +1419,21 @@ public:
> >> >>       read back using the functions below.  */
> >> >>    virtual void finish_cost ();
> >> >>
> >> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
> >> >> +     a main loop.  Return true if the costs described by THIS are
> >> >> +     cheaper than the costs described by OTHER.  Return false if any
> >> >> +     of the following are true:
> >> >> +
> >> >> +     - THIS and OTHER are of equal cost
> >> >> +     - OTHER is better than THIS
> >> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
> >> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
> >> >> +
> >> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
> >> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
> >> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
> >> >> +                                           loop_vec_info main_loop) const;
> >> >> +
> >> >>    unsigned int prologue_cost () const;
> >> >>    unsigned int body_cost () const;
> >> >>    unsigned int epilogue_cost () const;
> >> >> @@ -1429,6 +1444,8 @@ protected:
> >> >>                                  unsigned int);
> >> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
> >> >>                                      unsigned int);
> >> >> +  int compare_inside_loop_cost (const vector_costs *) const;
> >> >> +  int compare_outside_loop_cost (const vector_costs *) const;
> >> >>
> >> >>    /* The region of code that we're considering vectorizing.  */
> >> >>    vec_info *m_vinfo;
> >> >> --
> >> >> 2.25.1
> >> >>
  
Richard Sandiford Nov. 9, 2021, 10:47 a.m. UTC | #6
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Nov 8, 2021 at 3:46 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
>> >> > <gcc-patches@gcc.gnu.org> wrote:
>> >> >>
>> >> >> One of the things we want to do on AArch64 is compare vector loops
>> >> >> side-by-side and pick the best one.  For some targets, we want this
>> >> >> to be based on issue rates as well as the usual latency-based costs
>> >> >> (at least for loops with relatively high iteration counts).
>> >> >>
>> >> >> The current approach to doing this is: when costing vectorisation
>> >> >> candidate A, try to guess what the other main candidate B will look
>> >> >> like and adjust A's latency-based cost up or down based on the likely
>> >> >> difference between A and B's issue rates.  This effectively means
>> >> >> that we try to cost parts of B at the same time as A, without actually
>> >> >> being able to see B.
>> >> >
>> >> > I think the idea of the current costing is that compares are always
>> >> > to the original scalar loop (with comparing the resulting magic
>> >> > cost numbers) so that by means of transitivity comparing two
>> >> > vector variants work as well.  So I'm not sure where 'variant B'
>> >> > comes into play here?
>> >>
>> >> E.g. A could be SVE and B could be Advanced SIMD, or A could be
>> >> SVE with fully-packed vectors and B could be SVE with partially
>> >> unpacked vectors.
>> >>
>> >> One motivating case is Neoverse V1, which is a 256-bit SVE target.
>> >> There, scalar code, Advanced SIMD code and SVE code have different issue
>> >> characteristics.  SVE vector ops generally have the same latencies as
>> >> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
>> >> have double the throughput of SVE ones, so that the overall bit-for-bit
>> >> throughputs are roughly equal.  However, there are differences due to
>> >> predication, load/store handling, and so on, and those differences
>> >> should be decisive in some cases.
>> >>
>> >> For SLP, latency-based costs are fine.  But for loop costs, they hide
>> >> too many details.  What works best in practice, both for vector vs.
>> >> vector and vector vs. scalar, is:
>> >>
>> >> (1) compare issue rates between two loop bodies (vector vs. vector
>> >>     or vector vs. scalar)
>> >> (2) if issue rates are equal for a given number of scalar iterations,
>> >>     compare the latencies of the loop bodies, as we do now
>> >> (3) if both the above are equal, compare the latencies of the prologue
>> >>     and epilogue, as we do now
>> >>
>> >> It's very difficult to combine latency and issue rate information
>> >> into a single per-statement integer, in such a way that the integers
>> >> can just be summed and compared.
>> >
>> > Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
>> > was that finish_cost would determine the issue rate cost part, for
>> > example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
>> > then finish_cost would compute the number of cycles required to
>> > issue all vector stmts.  That yields sth like IPC the latency cost could
>> > be divided by.  Doing that for both ISA A and ISA B would produce
>> > weighted latency values that can be compared?  Alternatively
>> > you can of course simulate issue with the actual instruction latencies
>> > in mind and produce an overall iteration latency number.
>> >
>> > Comparing just latency or issue rate in isolation is likely not good
>> > enough?
>>
>> In practice, comparing issue rates in isolation does seem to be what we
>> want as the first level of comparison.  I don't think total latency /
>> issue rate is really a meaningful combination.  It makes sense to the
>> extent that “lower latency == good” and “higher issue rate == good”,
>> but I don't think it's the case that halving the total latency is
>> equally valuable as doubling the issue rate.  In practice the latter is
>> a significant win while the former might or might not be.
>>
>> total latency / issue rate also runs the risk of double-counting:
>> reducing the number of ops decreases the total latency *and* increases
>> the issue rate, which could have a quadratic effect.  (This is why we
>> don't decrease SVE costs when we think that the SVE code will issue
>> more quickly than the Advanced SIMD code.)
>>
>> For a long-running loop, the only latencies that really matter are
>> for loop-carried dependencies.  The issue rate information takes
>> those into account, in that it tracks reduction and (with later
>> patches) induction limiters.
>
> I see.  I'm somewhat worried that we basically move all intelligence
> to the target hooks this way, requiring each target to duplicate code.

Yeah, I agree that's a concern.

> In fact one item on my list is to try tackling instruction issue on
> the vectorizer side where we have an idea on the data dependencies
> where the target then would continue to mostly work on the
> instruction level, returning "I can issue at most 4 of these, each
> with a latency of 2",
> plus globally specifying "I can issue 6 ops per clock".

This would be too simplistic for what we need on AArch64.  The broad
summary for Neoverse V1 would be “I can issue two SVE ops per cycle”
and “I can issue four Advanced SIMD ops per cycle”, but that hides
a lot of the details that matter.

Currently we model 4 issue rates:

(1) number of stores per cycle
(2) number of loads and stores per cycle
(3) number of “general” operations per cycle
(4) number of predicate operations per cycle

For Neoverse V1, some operations count against multiple issue rates.
E.g. a scalar store counts against (1) and (2).  A vector store counts
against (1), (2) and (3).

In some cases we also take renaming into account, which could perhaps
be seen as a (5).  I guess we would need more issue rates to handle
asymetrical ALUs properly (a bit like (1) and (2) above).

So I guess the target-independent way would need to maintain a
target-configurable tuple of issue rates, with each statement
yielding a tuple of ops to count against those rates.

That's probably worth doing, but even then, I think there would
still be cases in which the target would need to force the choice
based on target-specific properties, even if it's just as a tie
breaker when the target-independent checks come back equal.

> So when we rework the costing side we'll have to deal with the target
> doing magic that's not per insn and somehow preserve it - that might
> prove mightly complicated ...
>
> That said, if you are prepared to deal with this on the arm side
> when such reorg ever happens then I'm fine with the two patches.

Thanks.  And yeah, it would be up to us to keep the target code updated
if the target-independent costing improves.

Although it might not look like it yet, the full patch series simplifies
the current target code rather than making it more complicated. :-)
The patches also make things more direct and so hopefully easier
to maintain.

If target-independent code starts asking for issue rates (in the tuple
form sketched out above) then I think we're in a good position to
provide it.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard
>>
>> >
>> >> So returning to the above, when costing an SVE loop A, we currently
>> >> collect on-the-side issue information about both A and the likely
>> >> Advanced SIMD implementation B.  If B would have a higher throughput
>> >> than A then we multiply A's latency-based costs by:
>> >>
>> >>    ceil(B throughput/A throughput)
>> >>
>> >> The problem is, we don't actually know whether B exists or what it
>> >> would look like (given that Advanced SIMD and SVE have different
>> >> features).
>> >>
>> >> In principle, we should do the same in reverse, but since SVE needs
>> >> fewer ops than Advanced SIMD, the latencies are usually smaller already.
>> >>
>> >> We also do something similar for the scalar code.  When costing both
>> >> Advanced SIMD and SVE, we try to estimate the issue rate of the
>> >> original scalar code.  If the scalar code could issue more quickly
>> >> than the equivalent vector code, we increase the latency-based cost
>> >> of the vector code to account for that.
>> >>
>> >> The idea of the patch (and others) is that we can do the (1), (2),
>> >> (3) comparison above directly rather than indirectly.  We can also
>> >> use the issue information calculated while costing the scalar code,
>> >> rather than having to estimate the scalar issue information while
>> >> costing the vector code.
>> >>
>> >> >> This is needlessly indirect and complex.  It was a compromise due
>> >> >> to the code being added (too) late in the GCC 11 cycle, so that
>> >> >> target-independent changes weren't possible.
>> >> >>
>> >> >> The target-independent code already compares two candidate loop_vec_infos
>> >> >> side-by-side, so that information about A and B above are available
>> >> >> directly.  This patch creates a way for targets to hook into this
>> >> >> comparison.
>> >> >
>> >> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
>> >> > idea to compare those side-by-side - will that not lead to _more_ special-casing
>> >> > since you need to have a working compare to the scalar variant as well
>> >> > to determine the vectorization threshold?
>> >>
>> >> A later patch allows the code to do the same comparison for vector
>> >> vs. scalar.  It makes the target code significantly simpler compared
>> >> to now, since add_stmt_cost only needs to consider the latency and
>> >> issue rate of the code it's actually supposed to be costing, rather than
>> >> trying also to estimate the issue rate of the scalar code and the issue
>> >> rate of the Advanced SIMD code.
>> >>
>> >> Thanks,
>> >> Richard
>> >>
>> >> >
>> >> >>
>> >> >> The AArch64 code can therefore hook into better_main_loop_than_p to
>> >> >> compare issue rates.  If the issue rate comparison isn't decisive,
>> >> >> the code can fall back to the normal latency-based comparison instead.
>> >> >>
>> >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >> >>
>> >> >> Richard
>> >> >>
>> >> >>
>> >> >> gcc/
>> >> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
>> >> >>         (vector_costs::better_epilogue_loop_than_p)
>> >> >>         (vector_costs::compare_inside_loop_cost)
>> >> >>         (vector_costs::compare_outside_loop_cost): Likewise.
>> >> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
>> >> >>         (vector_costs::better_epilogue_loop_than_p)
>> >> >>         (vector_costs::compare_inside_loop_cost)
>> >> >>         (vector_costs::compare_outside_loop_cost): New functions,
>> >> >>         containing code moved from...
>> >> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
>> >> >> ---
>> >> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
>> >> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
>> >> >>  gcc/tree-vectorizer.h |  17 ++++
>> >> >>  3 files changed, 226 insertions(+), 137 deletions(-)
>> >> >>
>> >> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> >> >> index dd4a363fee5..c9ee2e15e35 100644
>> >> >> --- a/gcc/tree-vect-loop.c
>> >> >> +++ b/gcc/tree-vect-loop.c
>> >> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>> >> >>         return new_simdlen_p;
>> >> >>      }
>> >> >>
>> >> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
>> >> >> -  if (main_loop)
>> >> >> -    {
>> >> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> >> -      unsigned HOST_WIDE_INT main_vf;
>> >> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
>> >> >> -      /* If we can determine how many iterations are left for the epilogue
>> >> >> -        loop, that is if both the main loop's vectorization factor and number
>> >> >> -        of iterations are constant, then we use them to calculate the cost of
>> >> >> -        the epilogue loop together with a 'likely value' for the epilogues
>> >> >> -        vectorization factor.  Otherwise we use the main loop's vectorization
>> >> >> -        factor and the maximum poly value for the epilogue's.  If the target
>> >> >> -        has not provided with a sensible upper bound poly vectorization
>> >> >> -        factors are likely to be favored over constant ones.  */
>> >> >> -      if (main_poly_vf.is_constant (&main_vf)
>> >> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> >> -       {
>> >> >> -         unsigned HOST_WIDE_INT niters
>> >> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> >> -         HOST_WIDE_INT old_likely_vf
>> >> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
>> >> >> -         HOST_WIDE_INT new_likely_vf
>> >> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
>> >> >> -
>> >> >> -         /* If the epilogue is using partial vectors we account for the
>> >> >> -            partial iteration here too.  */
>> >> >> -         old_factor = niters / old_likely_vf;
>> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
>> >> >> -             && niters % old_likely_vf != 0)
>> >> >> -           old_factor++;
>> >> >> -
>> >> >> -         new_factor = niters / new_likely_vf;
>> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
>> >> >> -             && niters % new_likely_vf != 0)
>> >> >> -           new_factor++;
>> >> >> -       }
>> >> >> -      else
>> >> >> -       {
>> >> >> -         unsigned HOST_WIDE_INT main_vf_max
>> >> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> >> -
>> >> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
>> >> >> -                                                          POLY_VALUE_MAX);
>> >> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
>> >> >> -                                                          POLY_VALUE_MAX);
>> >> >> -
>> >> >> -         /* If the loop is not using partial vectors then it will iterate one
>> >> >> -            time less than one that does.  It is safe to subtract one here,
>> >> >> -            because the main loop's vf is always at least 2x bigger than that
>> >> >> -            of an epilogue.  */
>> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
>> >> >> -           old_factor -= 1;
>> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
>> >> >> -           new_factor -= 1;
>> >> >> -       }
>> >> >> -
>> >> >> -      /* Compute the costs by multiplying the inside costs with the factor and
>> >> >> -        add the outside costs for a more complete picture.  The factor is the
>> >> >> -        amount of times we are expecting to iterate this epilogue.  */
>> >> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
>> >> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
>> >> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
>> >> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
>> >> >> -      return new_cost < old_cost;
>> >> >> -    }
>> >> >> -
>> >> >> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> >> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> >> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> >> >> -  if (estimated_max_niter != -1)
>> >> >> -    {
>> >> >> -      if (known_le (estimated_max_niter, new_vf))
>> >> >> -       new_vf = estimated_max_niter;
>> >> >> -      if (known_le (estimated_max_niter, old_vf))
>> >> >> -       old_vf = estimated_max_niter;
>> >> >> -    }
>> >> >> -
>> >> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
>> >> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
>> >> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
>> >> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
>> >> >> -
>> >> >> -  HOST_WIDE_INT est_rel_new_min
>> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
>> >> >> -  HOST_WIDE_INT est_rel_new_max
>> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
>> >> >> -
>> >> >> -  HOST_WIDE_INT est_rel_old_min
>> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
>> >> >> -  HOST_WIDE_INT est_rel_old_max
>> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
>> >> >> -
>> >> >> -  /* Check first if we can make out an unambigous total order from the minimum
>> >> >> -     and maximum estimates.  */
>> >> >> -  if (est_rel_new_min < est_rel_old_min
>> >> >> -      && est_rel_new_max < est_rel_old_max)
>> >> >> -    return true;
>> >> >> -  else if (est_rel_old_min < est_rel_new_min
>> >> >> -          && est_rel_old_max < est_rel_new_max)
>> >> >> -    return false;
>> >> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
>> >> >> -     we know that it has a lower cost for at least one runtime VF.
>> >> >> -     However, we don't know how likely that VF is.
>> >> >> -
>> >> >> -     One option would be to compare the costs for the estimated VFs.
>> >> >> -     The problem is that that can put too much pressure on the cost
>> >> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
>> >> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
>> >> >> -     though (a) new_loop_vinfo might not actually be better than
>> >> >> -     old_loop_vinfo for that VF and (b) it would be significantly
>> >> >> -     worse at larger VFs.
>> >> >> -
>> >> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
>> >> >> -     no more expensive than old_loop_vinfo even after doubling the
>> >> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
>> >> >> -     ensures that we only pick new_loop_vinfo if it is significantly
>> >> >> -     better than old_loop_vinfo at the estimated VF.  */
>> >> >> -
>> >> >> -  if (est_rel_old_min != est_rel_new_min
>> >> >> -      || est_rel_old_max != est_rel_new_max)
>> >> >> -    {
>> >> >> -      HOST_WIDE_INT est_rel_new_likely
>> >> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
>> >> >> -      HOST_WIDE_INT est_rel_old_likely
>> >> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
>> >> >> -
>> >> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
>> >> >> -    }
>> >> >> -
>> >> >> -  /* If there's nothing to choose between the loop bodies, see whether
>> >> >> -     there's a difference in the prologue and epilogue costs.  */
>> >> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
>> >> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
>> >> >> -  if (new_outside_cost != old_outside_cost)
>> >> >> -    return new_outside_cost < old_outside_cost;
>> >> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
>> >> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
>> >> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
>> >> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
>> >> >>
>> >> >> -  return false;
>> >> >> +  return new_costs->better_main_loop_than_p (old_costs);
>> >> >>  }
>> >> >>
>> >> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
>> >> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
>> >> >> index 9ef76ce654b..dcbb2a3f13a 100644
>> >> >> --- a/gcc/tree-vectorizer.c
>> >> >> +++ b/gcc/tree-vectorizer.c
>> >> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
>> >> >>      }
>> >> >>    return cost;
>> >> >>  }
>> >> >> +
>> >> >> +/* See the comment above the declaration for details.  */
>> >> >> +
>> >> >> +bool
>> >> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
>> >> >> +{
>> >> >> +  int diff = compare_inside_loop_cost (other);
>> >> >> +  if (diff != 0)
>> >> >> +    return diff < 0;
>> >> >> +
>> >> >> +  /* If there's nothing to choose between the loop bodies, see whether
>> >> >> +     there's a difference in the prologue and epilogue costs.  */
>> >> >> +  diff = compare_outside_loop_cost (other);
>> >> >> +  if (diff != 0)
>> >> >> +    return diff < 0;
>> >> >> +
>> >> >> +  return false;
>> >> >> +}
>> >> >> +
>> >> >> +
>> >> >> +/* See the comment above the declaration for details.  */
>> >> >> +
>> >> >> +bool
>> >> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
>> >> >> +                                          loop_vec_info main_loop) const
>> >> >> +{
>> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> >> +
>> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> >> +
>> >> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
>> >> >> +  unsigned HOST_WIDE_INT main_vf;
>> >> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
>> >> >> +  /* If we can determine how many iterations are left for the epilogue
>> >> >> +     loop, that is if both the main loop's vectorization factor and number
>> >> >> +     of iterations are constant, then we use them to calculate the cost of
>> >> >> +     the epilogue loop together with a 'likely value' for the epilogues
>> >> >> +     vectorization factor.  Otherwise we use the main loop's vectorization
>> >> >> +     factor and the maximum poly value for the epilogue's.  If the target
>> >> >> +     has not provided with a sensible upper bound poly vectorization
>> >> >> +     factors are likely to be favored over constant ones.  */
>> >> >> +  if (main_poly_vf.is_constant (&main_vf)
>> >> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
>> >> >> +    {
>> >> >> +      unsigned HOST_WIDE_INT niters
>> >> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
>> >> >> +      HOST_WIDE_INT other_likely_vf
>> >> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
>> >> >> +      HOST_WIDE_INT this_likely_vf
>> >> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
>> >> >> +
>> >> >> +      /* If the epilogue is using partial vectors we account for the
>> >> >> +        partial iteration here too.  */
>> >> >> +      other_factor = niters / other_likely_vf;
>> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
>> >> >> +         && niters % other_likely_vf != 0)
>> >> >> +       other_factor++;
>> >> >> +
>> >> >> +      this_factor = niters / this_likely_vf;
>> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
>> >> >> +         && niters % this_likely_vf != 0)
>> >> >> +       this_factor++;
>> >> >> +    }
>> >> >> +  else
>> >> >> +    {
>> >> >> +      unsigned HOST_WIDE_INT main_vf_max
>> >> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
>> >> >> +
>> >> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
>> >> >> +                                                      POLY_VALUE_MAX);
>> >> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
>> >> >> +                                                      POLY_VALUE_MAX);
>> >> >> +
>> >> >> +      /* If the loop is not using partial vectors then it will iterate one
>> >> >> +        time less than one that does.  It is safe to subtract one here,
>> >> >> +        because the main loop's vf is always at least 2x bigger than that
>> >> >> +        of an epilogue.  */
>> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
>> >> >> +       other_factor -= 1;
>> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
>> >> >> +       this_factor -= 1;
>> >> >> +    }
>> >> >> +
>> >> >> +  /* Compute the costs by multiplying the inside costs with the factor and
>> >> >> +     add the outside costs for a more complete picture.  The factor is the
>> >> >> +     amount of times we are expecting to iterate this epilogue.  */
>> >> >> +  other_cost = other->body_cost () * other_factor;
>> >> >> +  this_cost = this->body_cost () * this_factor;
>> >> >> +  other_cost += other->outside_cost ();
>> >> >> +  this_cost += this->outside_cost ();
>> >> >> +  return this_cost < other_cost;
>> >> >> +}
>> >> >> +
>> >> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
>> >> >> +   determine the return value of better_main_loop_than_p by comparing the
>> >> >> +   inside (loop body) costs of THIS and OTHER.  Return:
>> >> >> +
>> >> >> +   * -1 if better_main_loop_than_p should return true.
>> >> >> +   * 1 if better_main_loop_than_p should return false.
>> >> >> +   * 0 if we can't decide.  */
>> >> >> +
>> >> >> +int
>> >> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
>> >> >> +{
>> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
>> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
>> >> >> +
>> >> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
>> >> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
>> >> >> +
>> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
>> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
>> >> >> +
>> >> >> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
>> >> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
>> >> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
>> >> >> +  if (estimated_max_niter != -1)
>> >> >> +    {
>> >> >> +      if (known_le (estimated_max_niter, this_vf))
>> >> >> +       this_vf = estimated_max_niter;
>> >> >> +      if (known_le (estimated_max_niter, other_vf))
>> >> >> +       other_vf = estimated_max_niter;
>> >> >> +    }
>> >> >> +
>> >> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
>> >> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
>> >> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
>> >> >> +  poly_int64 rel_other
>> >> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
>> >> >> +
>> >> >> +  HOST_WIDE_INT est_rel_this_min
>> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
>> >> >> +  HOST_WIDE_INT est_rel_this_max
>> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
>> >> >> +
>> >> >> +  HOST_WIDE_INT est_rel_other_min
>> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
>> >> >> +  HOST_WIDE_INT est_rel_other_max
>> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
>> >> >> +
>> >> >> +  /* Check first if we can make out an unambigous total order from the minimum
>> >> >> +     and maximum estimates.  */
>> >> >> +  if (est_rel_this_min < est_rel_other_min
>> >> >> +      && est_rel_this_max < est_rel_other_max)
>> >> >> +    return -1;
>> >> >> +
>> >> >> +  if (est_rel_other_min < est_rel_this_min
>> >> >> +      && est_rel_other_max < est_rel_this_max)
>> >> >> +    return 1;
>> >> >> +
>> >> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
>> >> >> +     we know that it has a lower cost for at least one runtime VF.
>> >> >> +     However, we don't know how likely that VF is.
>> >> >> +
>> >> >> +     One option would be to compare the costs for the estimated VFs.
>> >> >> +     The problem is that that can put too much pressure on the cost
>> >> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
>> >> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
>> >> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
>> >> >> +     though (a) this_loop_vinfo might not actually be better than
>> >> >> +     other_loop_vinfo for that VF and (b) it would be significantly
>> >> >> +     worse at larger VFs.
>> >> >> +
>> >> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
>> >> >> +     no more expensive than other_loop_vinfo even after doubling the
>> >> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
>> >> >> +     ensures that we only pick this_loop_vinfo if it is significantly
>> >> >> +     better than other_loop_vinfo at the estimated VF.  */
>> >> >> +  if (est_rel_other_min != est_rel_this_min
>> >> >> +      || est_rel_other_max != est_rel_this_max)
>> >> >> +    {
>> >> >> +      HOST_WIDE_INT est_rel_this_likely
>> >> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
>> >> >> +      HOST_WIDE_INT est_rel_other_likely
>> >> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
>> >> >> +
>> >> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
>> >> >> +    }
>> >> >> +
>> >> >> +  return 0;
>> >> >> +}
>> >> >> +
>> >> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
>> >> >> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
>> >> >> +   Check whether we can determine the return value of better_main_loop_than_p
>> >> >> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
>> >> >> +   Return:
>> >> >> +
>> >> >> +   * -1 if better_main_loop_than_p should return true.
>> >> >> +   * 1 if better_main_loop_than_p should return false.
>> >> >> +   * 0 if we can't decide.  */
>> >> >> +
>> >> >> +int
>> >> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
>> >> >> +{
>> >> >> +  auto this_outside_cost = this->outside_cost ();
>> >> >> +  auto other_outside_cost = other->outside_cost ();
>> >> >> +  if (this_outside_cost != other_outside_cost)
>> >> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
>> >> >> +
>> >> >> +  return 0;
>> >> >> +}
>> >> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> >> >> index 87d3f211a2e..0e3aad590e8 100644
>> >> >> --- a/gcc/tree-vectorizer.h
>> >> >> +++ b/gcc/tree-vectorizer.h
>> >> >> @@ -1419,6 +1419,21 @@ public:
>> >> >>       read back using the functions below.  */
>> >> >>    virtual void finish_cost ();
>> >> >>
>> >> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
>> >> >> +     a main loop.  Return true if the costs described by THIS are
>> >> >> +     cheaper than the costs described by OTHER.  Return false if any
>> >> >> +     of the following are true:
>> >> >> +
>> >> >> +     - THIS and OTHER are of equal cost
>> >> >> +     - OTHER is better than THIS
>> >> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
>> >> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
>> >> >> +
>> >> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
>> >> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
>> >> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
>> >> >> +                                           loop_vec_info main_loop) const;
>> >> >> +
>> >> >>    unsigned int prologue_cost () const;
>> >> >>    unsigned int body_cost () const;
>> >> >>    unsigned int epilogue_cost () const;
>> >> >> @@ -1429,6 +1444,8 @@ protected:
>> >> >>                                  unsigned int);
>> >> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
>> >> >>                                      unsigned int);
>> >> >> +  int compare_inside_loop_cost (const vector_costs *) const;
>> >> >> +  int compare_outside_loop_cost (const vector_costs *) const;
>> >> >>
>> >> >>    /* The region of code that we're considering vectorizing.  */
>> >> >>    vec_info *m_vinfo;
>> >> >> --
>> >> >> 2.25.1
>> >> >>
  
Richard Biener Nov. 9, 2021, 11:01 a.m. UTC | #7
On Tue, Nov 9, 2021 at 11:47 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Nov 8, 2021 at 3:46 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > On Mon, Nov 8, 2021 at 2:06 PM Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> >> > On Mon, Nov 8, 2021 at 11:45 AM Richard Sandiford via Gcc-patches
> >> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >> >>
> >> >> >> One of the things we want to do on AArch64 is compare vector loops
> >> >> >> side-by-side and pick the best one.  For some targets, we want this
> >> >> >> to be based on issue rates as well as the usual latency-based costs
> >> >> >> (at least for loops with relatively high iteration counts).
> >> >> >>
> >> >> >> The current approach to doing this is: when costing vectorisation
> >> >> >> candidate A, try to guess what the other main candidate B will look
> >> >> >> like and adjust A's latency-based cost up or down based on the likely
> >> >> >> difference between A and B's issue rates.  This effectively means
> >> >> >> that we try to cost parts of B at the same time as A, without actually
> >> >> >> being able to see B.
> >> >> >
> >> >> > I think the idea of the current costing is that compares are always
> >> >> > to the original scalar loop (with comparing the resulting magic
> >> >> > cost numbers) so that by means of transitivity comparing two
> >> >> > vector variants work as well.  So I'm not sure where 'variant B'
> >> >> > comes into play here?
> >> >>
> >> >> E.g. A could be SVE and B could be Advanced SIMD, or A could be
> >> >> SVE with fully-packed vectors and B could be SVE with partially
> >> >> unpacked vectors.
> >> >>
> >> >> One motivating case is Neoverse V1, which is a 256-bit SVE target.
> >> >> There, scalar code, Advanced SIMD code and SVE code have different issue
> >> >> characteristics.  SVE vector ops generally have the same latencies as
> >> >> the corresponding Advanced SIMD ops.  Advanced SIMD ops generally
> >> >> have double the throughput of SVE ones, so that the overall bit-for-bit
> >> >> throughputs are roughly equal.  However, there are differences due to
> >> >> predication, load/store handling, and so on, and those differences
> >> >> should be decisive in some cases.
> >> >>
> >> >> For SLP, latency-based costs are fine.  But for loop costs, they hide
> >> >> too many details.  What works best in practice, both for vector vs.
> >> >> vector and vector vs. scalar, is:
> >> >>
> >> >> (1) compare issue rates between two loop bodies (vector vs. vector
> >> >>     or vector vs. scalar)
> >> >> (2) if issue rates are equal for a given number of scalar iterations,
> >> >>     compare the latencies of the loop bodies, as we do now
> >> >> (3) if both the above are equal, compare the latencies of the prologue
> >> >>     and epilogue, as we do now
> >> >>
> >> >> It's very difficult to combine latency and issue rate information
> >> >> into a single per-statement integer, in such a way that the integers
> >> >> can just be summed and compared.
> >> >
> >> > Yeah, so the idea I had when proposing the init_cost/add_stmt/finish_cost
> >> > was that finish_cost would determine the issue rate cost part, for
> >> > example if the uarch can issue 2 ops with ISA A and 4 ops with ISA B
> >> > then finish_cost would compute the number of cycles required to
> >> > issue all vector stmts.  That yields sth like IPC the latency cost could
> >> > be divided by.  Doing that for both ISA A and ISA B would produce
> >> > weighted latency values that can be compared?  Alternatively
> >> > you can of course simulate issue with the actual instruction latencies
> >> > in mind and produce an overall iteration latency number.
> >> >
> >> > Comparing just latency or issue rate in isolation is likely not good
> >> > enough?
> >>
> >> In practice, comparing issue rates in isolation does seem to be what we
> >> want as the first level of comparison.  I don't think total latency /
> >> issue rate is really a meaningful combination.  It makes sense to the
> >> extent that “lower latency == good” and “higher issue rate == good”,
> >> but I don't think it's the case that halving the total latency is
> >> equally valuable as doubling the issue rate.  In practice the latter is
> >> a significant win while the former might or might not be.
> >>
> >> total latency / issue rate also runs the risk of double-counting:
> >> reducing the number of ops decreases the total latency *and* increases
> >> the issue rate, which could have a quadratic effect.  (This is why we
> >> don't decrease SVE costs when we think that the SVE code will issue
> >> more quickly than the Advanced SIMD code.)
> >>
> >> For a long-running loop, the only latencies that really matter are
> >> for loop-carried dependencies.  The issue rate information takes
> >> those into account, in that it tracks reduction and (with later
> >> patches) induction limiters.
> >
> > I see.  I'm somewhat worried that we basically move all intelligence
> > to the target hooks this way, requiring each target to duplicate code.
>
> Yeah, I agree that's a concern.
>
> > In fact one item on my list is to try tackling instruction issue on
> > the vectorizer side where we have an idea on the data dependencies
> > where the target then would continue to mostly work on the
> > instruction level, returning "I can issue at most 4 of these, each
> > with a latency of 2",
> > plus globally specifying "I can issue 6 ops per clock".
>
> This would be too simplistic for what we need on AArch64.  The broad
> summary for Neoverse V1 would be “I can issue two SVE ops per cycle”
> and “I can issue four Advanced SIMD ops per cycle”, but that hides
> a lot of the details that matter.
>
> Currently we model 4 issue rates:
>
> (1) number of stores per cycle
> (2) number of loads and stores per cycle
> (3) number of “general” operations per cycle
> (4) number of predicate operations per cycle
>
> For Neoverse V1, some operations count against multiple issue rates.
> E.g. a scalar store counts against (1) and (2).  A vector store counts
> against (1), (2) and (3).
>
> In some cases we also take renaming into account, which could perhaps
> be seen as a (5).  I guess we would need more issue rates to handle
> asymetrical ALUs properly (a bit like (1) and (2) above).
>
> So I guess the target-independent way would need to maintain a
> target-configurable tuple of issue rates, with each statement
> yielding a tuple of ops to count against those rates.

Yeah, I agree it's complicated.  I do hope the job of tracking dependent
ops will become easier when we can just shove an SLP tree to the
target for this - at the moment for non-SLP the target will have to
re-construct dependencies from the stmt_info SSA ops in some way,
that's clearly sub-optimal (esp. considering patterns).

We need this kind of rough analysis in other loop optimization contexts
as well - for example loop distribution would mainly look at (1) and (2),
and a possible unroll pass would need to do similar analysis.  For
unrolling we'd also need to model instruction fetch somehow since
you cannot issue what you didn't fetch (but fetch is difficult due to
the lack of a schedule on GIMPLE).  Since it's a problem that solving
might help more generally I was hoping at some high-level modeling
to be good enough (eventually the high-level could extract most info
from the DFA even...).

> That's probably worth doing, but even then, I think there would
> still be cases in which the target would need to force the choice
> based on target-specific properties, even if it's just as a tie
> breaker when the target-independent checks come back equal.
>
> > So when we rework the costing side we'll have to deal with the target
> > doing magic that's not per insn and somehow preserve it - that might
> > prove mightly complicated ...
> >
> > That said, if you are prepared to deal with this on the arm side
> > when such reorg ever happens then I'm fine with the two patches.
>
> Thanks.  And yeah, it would be up to us to keep the target code updated
> if the target-independent costing improves.
>
> Although it might not look like it yet, the full patch series simplifies
> the current target code rather than making it more complicated. :-)
> The patches also make things more direct and so hopefully easier
> to maintain.
>
> If target-independent code starts asking for issue rates (in the tuple
> form sketched out above) then I think we're in a good position to
> provide it.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> Richard
> >>
> >> >
> >> >> So returning to the above, when costing an SVE loop A, we currently
> >> >> collect on-the-side issue information about both A and the likely
> >> >> Advanced SIMD implementation B.  If B would have a higher throughput
> >> >> than A then we multiply A's latency-based costs by:
> >> >>
> >> >>    ceil(B throughput/A throughput)
> >> >>
> >> >> The problem is, we don't actually know whether B exists or what it
> >> >> would look like (given that Advanced SIMD and SVE have different
> >> >> features).
> >> >>
> >> >> In principle, we should do the same in reverse, but since SVE needs
> >> >> fewer ops than Advanced SIMD, the latencies are usually smaller already.
> >> >>
> >> >> We also do something similar for the scalar code.  When costing both
> >> >> Advanced SIMD and SVE, we try to estimate the issue rate of the
> >> >> original scalar code.  If the scalar code could issue more quickly
> >> >> than the equivalent vector code, we increase the latency-based cost
> >> >> of the vector code to account for that.
> >> >>
> >> >> The idea of the patch (and others) is that we can do the (1), (2),
> >> >> (3) comparison above directly rather than indirectly.  We can also
> >> >> use the issue information calculated while costing the scalar code,
> >> >> rather than having to estimate the scalar issue information while
> >> >> costing the vector code.
> >> >>
> >> >> >> This is needlessly indirect and complex.  It was a compromise due
> >> >> >> to the code being added (too) late in the GCC 11 cycle, so that
> >> >> >> target-independent changes weren't possible.
> >> >> >>
> >> >> >> The target-independent code already compares two candidate loop_vec_infos
> >> >> >> side-by-side, so that information about A and B above are available
> >> >> >> directly.  This patch creates a way for targets to hook into this
> >> >> >> comparison.
> >> >> >
> >> >> > You mean it has both loop_vec_infos.  But as said I'm not sure it's a good
> >> >> > idea to compare those side-by-side - will that not lead to _more_ special-casing
> >> >> > since you need to have a working compare to the scalar variant as well
> >> >> > to determine the vectorization threshold?
> >> >>
> >> >> A later patch allows the code to do the same comparison for vector
> >> >> vs. scalar.  It makes the target code significantly simpler compared
> >> >> to now, since add_stmt_cost only needs to consider the latency and
> >> >> issue rate of the code it's actually supposed to be costing, rather than
> >> >> trying also to estimate the issue rate of the scalar code and the issue
> >> >> rate of the Advanced SIMD code.
> >> >>
> >> >> Thanks,
> >> >> Richard
> >> >>
> >> >> >
> >> >> >>
> >> >> >> The AArch64 code can therefore hook into better_main_loop_than_p to
> >> >> >> compare issue rates.  If the issue rate comparison isn't decisive,
> >> >> >> the code can fall back to the normal latency-based comparison instead.
> >> >> >>
> >> >> >> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?
> >> >> >>
> >> >> >> Richard
> >> >> >>
> >> >> >>
> >> >> >> gcc/
> >> >> >>         * tree-vectorizer.h (vector_costs::better_main_loop_than_p)
> >> >> >>         (vector_costs::better_epilogue_loop_than_p)
> >> >> >>         (vector_costs::compare_inside_loop_cost)
> >> >> >>         (vector_costs::compare_outside_loop_cost): Likewise.
> >> >> >>         * tree-vectorizer.c (vector_costs::better_main_loop_than_p)
> >> >> >>         (vector_costs::better_epilogue_loop_than_p)
> >> >> >>         (vector_costs::compare_inside_loop_cost)
> >> >> >>         (vector_costs::compare_outside_loop_cost): New functions,
> >> >> >>         containing code moved from...
> >> >> >>         * tree-vect-loop.c (vect_better_loop_vinfo_p): ...here.
> >> >> >> ---
> >> >> >>  gcc/tree-vect-loop.c  | 142 ++---------------------------
> >> >> >>  gcc/tree-vectorizer.c | 204 ++++++++++++++++++++++++++++++++++++++++++
> >> >> >>  gcc/tree-vectorizer.h |  17 ++++
> >> >> >>  3 files changed, 226 insertions(+), 137 deletions(-)
> >> >> >>
> >> >> >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> >> >> >> index dd4a363fee5..c9ee2e15e35 100644
> >> >> >> --- a/gcc/tree-vect-loop.c
> >> >> >> +++ b/gcc/tree-vect-loop.c
> >> >> >> @@ -2784,144 +2784,12 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
> >> >> >>         return new_simdlen_p;
> >> >> >>      }
> >> >> >>
> >> >> >> -  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
> >> >> >> -  if (main_loop)
> >> >> >> -    {
> >> >> >> -      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> >> >> -      unsigned HOST_WIDE_INT main_vf;
> >> >> >> -      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
> >> >> >> -      /* If we can determine how many iterations are left for the epilogue
> >> >> >> -        loop, that is if both the main loop's vectorization factor and number
> >> >> >> -        of iterations are constant, then we use them to calculate the cost of
> >> >> >> -        the epilogue loop together with a 'likely value' for the epilogues
> >> >> >> -        vectorization factor.  Otherwise we use the main loop's vectorization
> >> >> >> -        factor and the maximum poly value for the epilogue's.  If the target
> >> >> >> -        has not provided with a sensible upper bound poly vectorization
> >> >> >> -        factors are likely to be favored over constant ones.  */
> >> >> >> -      if (main_poly_vf.is_constant (&main_vf)
> >> >> >> -         && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> >> >> -       {
> >> >> >> -         unsigned HOST_WIDE_INT niters
> >> >> >> -           = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> >> >> -         HOST_WIDE_INT old_likely_vf
> >> >> >> -           = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
> >> >> >> -         HOST_WIDE_INT new_likely_vf
> >> >> >> -           = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
> >> >> >> -
> >> >> >> -         /* If the epilogue is using partial vectors we account for the
> >> >> >> -            partial iteration here too.  */
> >> >> >> -         old_factor = niters / old_likely_vf;
> >> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
> >> >> >> -             && niters % old_likely_vf != 0)
> >> >> >> -           old_factor++;
> >> >> >> -
> >> >> >> -         new_factor = niters / new_likely_vf;
> >> >> >> -         if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
> >> >> >> -             && niters % new_likely_vf != 0)
> >> >> >> -           new_factor++;
> >> >> >> -       }
> >> >> >> -      else
> >> >> >> -       {
> >> >> >> -         unsigned HOST_WIDE_INT main_vf_max
> >> >> >> -           = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> >> >> -
> >> >> >> -         old_factor = main_vf_max / estimated_poly_value (old_vf,
> >> >> >> -                                                          POLY_VALUE_MAX);
> >> >> >> -         new_factor = main_vf_max / estimated_poly_value (new_vf,
> >> >> >> -                                                          POLY_VALUE_MAX);
> >> >> >> -
> >> >> >> -         /* If the loop is not using partial vectors then it will iterate one
> >> >> >> -            time less than one that does.  It is safe to subtract one here,
> >> >> >> -            because the main loop's vf is always at least 2x bigger than that
> >> >> >> -            of an epilogue.  */
> >> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
> >> >> >> -           old_factor -= 1;
> >> >> >> -         if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
> >> >> >> -           new_factor -= 1;
> >> >> >> -       }
> >> >> >> -
> >> >> >> -      /* Compute the costs by multiplying the inside costs with the factor and
> >> >> >> -        add the outside costs for a more complete picture.  The factor is the
> >> >> >> -        amount of times we are expecting to iterate this epilogue.  */
> >> >> >> -      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
> >> >> >> -      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
> >> >> >> -      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
> >> >> >> -      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
> >> >> >> -      return new_cost < old_cost;
> >> >> >> -    }
> >> >> >> -
> >> >> >> -  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> >> >> -     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> >> >> -  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> >> >> -  if (estimated_max_niter != -1)
> >> >> >> -    {
> >> >> >> -      if (known_le (estimated_max_niter, new_vf))
> >> >> >> -       new_vf = estimated_max_niter;
> >> >> >> -      if (known_le (estimated_max_niter, old_vf))
> >> >> >> -       old_vf = estimated_max_niter;
> >> >> >> -    }
> >> >> >> -
> >> >> >> -  /* Check whether the (fractional) cost per scalar iteration is lower
> >> >> >> -     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> >> >> >> -  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
> >> >> >> -  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
> >> >> >> -
> >> >> >> -  HOST_WIDE_INT est_rel_new_min
> >> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
> >> >> >> -  HOST_WIDE_INT est_rel_new_max
> >> >> >> -    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
> >> >> >> -
> >> >> >> -  HOST_WIDE_INT est_rel_old_min
> >> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
> >> >> >> -  HOST_WIDE_INT est_rel_old_max
> >> >> >> -    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
> >> >> >> -
> >> >> >> -  /* Check first if we can make out an unambigous total order from the minimum
> >> >> >> -     and maximum estimates.  */
> >> >> >> -  if (est_rel_new_min < est_rel_old_min
> >> >> >> -      && est_rel_new_max < est_rel_old_max)
> >> >> >> -    return true;
> >> >> >> -  else if (est_rel_old_min < est_rel_new_min
> >> >> >> -          && est_rel_old_max < est_rel_new_max)
> >> >> >> -    return false;
> >> >> >> -  /* When old_loop_vinfo uses a variable vectorization factor,
> >> >> >> -     we know that it has a lower cost for at least one runtime VF.
> >> >> >> -     However, we don't know how likely that VF is.
> >> >> >> -
> >> >> >> -     One option would be to compare the costs for the estimated VFs.
> >> >> >> -     The problem is that that can put too much pressure on the cost
> >> >> >> -     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> >> >> -     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
> >> >> >> -     for the estimated VF, we'd then choose new_loop_vinfo even
> >> >> >> -     though (a) new_loop_vinfo might not actually be better than
> >> >> >> -     old_loop_vinfo for that VF and (b) it would be significantly
> >> >> >> -     worse at larger VFs.
> >> >> >> -
> >> >> >> -     Here we go for a hacky compromise: pick new_loop_vinfo if it is
> >> >> >> -     no more expensive than old_loop_vinfo even after doubling the
> >> >> >> -     estimated old_loop_vinfo VF.  For all but trivial loops, this
> >> >> >> -     ensures that we only pick new_loop_vinfo if it is significantly
> >> >> >> -     better than old_loop_vinfo at the estimated VF.  */
> >> >> >> -
> >> >> >> -  if (est_rel_old_min != est_rel_new_min
> >> >> >> -      || est_rel_old_max != est_rel_new_max)
> >> >> >> -    {
> >> >> >> -      HOST_WIDE_INT est_rel_new_likely
> >> >> >> -       = estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
> >> >> >> -      HOST_WIDE_INT est_rel_old_likely
> >> >> >> -       = estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
> >> >> >> -
> >> >> >> -      return est_rel_new_likely * 2 <= est_rel_old_likely;
> >> >> >> -    }
> >> >> >> -
> >> >> >> -  /* If there's nothing to choose between the loop bodies, see whether
> >> >> >> -     there's a difference in the prologue and epilogue costs.  */
> >> >> >> -  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
> >> >> >> -  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
> >> >> >> -  if (new_outside_cost != old_outside_cost)
> >> >> >> -    return new_outside_cost < old_outside_cost;
> >> >> >> +  const auto *old_costs = old_loop_vinfo->vector_costs;
> >> >> >> +  const auto *new_costs = new_loop_vinfo->vector_costs;
> >> >> >> +  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
> >> >> >> +    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
> >> >> >>
> >> >> >> -  return false;
> >> >> >> +  return new_costs->better_main_loop_than_p (old_costs);
> >> >> >>  }
> >> >> >>
> >> >> >>  /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> >> >> >> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> >> >> >> index 9ef76ce654b..dcbb2a3f13a 100644
> >> >> >> --- a/gcc/tree-vectorizer.c
> >> >> >> +++ b/gcc/tree-vectorizer.c
> >> >> >> @@ -1744,3 +1744,207 @@ vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
> >> >> >>      }
> >> >> >>    return cost;
> >> >> >>  }
> >> >> >> +
> >> >> >> +/* See the comment above the declaration for details.  */
> >> >> >> +
> >> >> >> +bool
> >> >> >> +vector_costs::better_main_loop_than_p (const vector_costs *other) const
> >> >> >> +{
> >> >> >> +  int diff = compare_inside_loop_cost (other);
> >> >> >> +  if (diff != 0)
> >> >> >> +    return diff < 0;
> >> >> >> +
> >> >> >> +  /* If there's nothing to choose between the loop bodies, see whether
> >> >> >> +     there's a difference in the prologue and epilogue costs.  */
> >> >> >> +  diff = compare_outside_loop_cost (other);
> >> >> >> +  if (diff != 0)
> >> >> >> +    return diff < 0;
> >> >> >> +
> >> >> >> +  return false;
> >> >> >> +}
> >> >> >> +
> >> >> >> +
> >> >> >> +/* See the comment above the declaration for details.  */
> >> >> >> +
> >> >> >> +bool
> >> >> >> +vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
> >> >> >> +                                          loop_vec_info main_loop) const
> >> >> >> +{
> >> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> >> >> +
> >> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> >> >> +
> >> >> >> +  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
> >> >> >> +  unsigned HOST_WIDE_INT main_vf;
> >> >> >> +  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
> >> >> >> +  /* If we can determine how many iterations are left for the epilogue
> >> >> >> +     loop, that is if both the main loop's vectorization factor and number
> >> >> >> +     of iterations are constant, then we use them to calculate the cost of
> >> >> >> +     the epilogue loop together with a 'likely value' for the epilogues
> >> >> >> +     vectorization factor.  Otherwise we use the main loop's vectorization
> >> >> >> +     factor and the maximum poly value for the epilogue's.  If the target
> >> >> >> +     has not provided with a sensible upper bound poly vectorization
> >> >> >> +     factors are likely to be favored over constant ones.  */
> >> >> >> +  if (main_poly_vf.is_constant (&main_vf)
> >> >> >> +      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
> >> >> >> +    {
> >> >> >> +      unsigned HOST_WIDE_INT niters
> >> >> >> +       = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
> >> >> >> +      HOST_WIDE_INT other_likely_vf
> >> >> >> +       = estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
> >> >> >> +      HOST_WIDE_INT this_likely_vf
> >> >> >> +       = estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
> >> >> >> +
> >> >> >> +      /* If the epilogue is using partial vectors we account for the
> >> >> >> +        partial iteration here too.  */
> >> >> >> +      other_factor = niters / other_likely_vf;
> >> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
> >> >> >> +         && niters % other_likely_vf != 0)
> >> >> >> +       other_factor++;
> >> >> >> +
> >> >> >> +      this_factor = niters / this_likely_vf;
> >> >> >> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
> >> >> >> +         && niters % this_likely_vf != 0)
> >> >> >> +       this_factor++;
> >> >> >> +    }
> >> >> >> +  else
> >> >> >> +    {
> >> >> >> +      unsigned HOST_WIDE_INT main_vf_max
> >> >> >> +       = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
> >> >> >> +
> >> >> >> +      other_factor = main_vf_max / estimated_poly_value (other_vf,
> >> >> >> +                                                      POLY_VALUE_MAX);
> >> >> >> +      this_factor = main_vf_max / estimated_poly_value (this_vf,
> >> >> >> +                                                      POLY_VALUE_MAX);
> >> >> >> +
> >> >> >> +      /* If the loop is not using partial vectors then it will iterate one
> >> >> >> +        time less than one that does.  It is safe to subtract one here,
> >> >> >> +        because the main loop's vf is always at least 2x bigger than that
> >> >> >> +        of an epilogue.  */
> >> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
> >> >> >> +       other_factor -= 1;
> >> >> >> +      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
> >> >> >> +       this_factor -= 1;
> >> >> >> +    }
> >> >> >> +
> >> >> >> +  /* Compute the costs by multiplying the inside costs with the factor and
> >> >> >> +     add the outside costs for a more complete picture.  The factor is the
> >> >> >> +     amount of times we are expecting to iterate this epilogue.  */
> >> >> >> +  other_cost = other->body_cost () * other_factor;
> >> >> >> +  this_cost = this->body_cost () * this_factor;
> >> >> >> +  other_cost += other->outside_cost ();
> >> >> >> +  this_cost += this->outside_cost ();
> >> >> >> +  return this_cost < other_cost;
> >> >> >> +}
> >> >> >> +
> >> >> >> +/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
> >> >> >> +   determine the return value of better_main_loop_than_p by comparing the
> >> >> >> +   inside (loop body) costs of THIS and OTHER.  Return:
> >> >> >> +
> >> >> >> +   * -1 if better_main_loop_than_p should return true.
> >> >> >> +   * 1 if better_main_loop_than_p should return false.
> >> >> >> +   * 0 if we can't decide.  */
> >> >> >> +
> >> >> >> +int
> >> >> >> +vector_costs::compare_inside_loop_cost (const vector_costs *other) const
> >> >> >> +{
> >> >> >> +  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
> >> >> >> +  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
> >> >> >> +
> >> >> >> +  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
> >> >> >> +  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
> >> >> >> +
> >> >> >> +  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
> >> >> >> +  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
> >> >> >> +
> >> >> >> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> >> >> >> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> >> >> >> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> >> >> >> +  if (estimated_max_niter != -1)
> >> >> >> +    {
> >> >> >> +      if (known_le (estimated_max_niter, this_vf))
> >> >> >> +       this_vf = estimated_max_niter;
> >> >> >> +      if (known_le (estimated_max_niter, other_vf))
> >> >> >> +       other_vf = estimated_max_niter;
> >> >> >> +    }
> >> >> >> +
> >> >> >> +  /* Check whether the (fractional) cost per scalar iteration is lower or
> >> >> >> +     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
> >> >> >> +  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
> >> >> >> +  poly_int64 rel_other
> >> >> >> +    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
> >> >> >> +
> >> >> >> +  HOST_WIDE_INT est_rel_this_min
> >> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
> >> >> >> +  HOST_WIDE_INT est_rel_this_max
> >> >> >> +    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
> >> >> >> +
> >> >> >> +  HOST_WIDE_INT est_rel_other_min
> >> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
> >> >> >> +  HOST_WIDE_INT est_rel_other_max
> >> >> >> +    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
> >> >> >> +
> >> >> >> +  /* Check first if we can make out an unambigous total order from the minimum
> >> >> >> +     and maximum estimates.  */
> >> >> >> +  if (est_rel_this_min < est_rel_other_min
> >> >> >> +      && est_rel_this_max < est_rel_other_max)
> >> >> >> +    return -1;
> >> >> >> +
> >> >> >> +  if (est_rel_other_min < est_rel_this_min
> >> >> >> +      && est_rel_other_max < est_rel_this_max)
> >> >> >> +    return 1;
> >> >> >> +
> >> >> >> +  /* When other_loop_vinfo uses a variable vectorization factor,
> >> >> >> +     we know that it has a lower cost for at least one runtime VF.
> >> >> >> +     However, we don't know how likely that VF is.
> >> >> >> +
> >> >> >> +     One option would be to compare the costs for the estimated VFs.
> >> >> >> +     The problem is that that can put too much pressure on the cost
> >> >> >> +     model.  E.g. if the estimated VF is also the lowest possible VF,
> >> >> >> +     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
> >> >> >> +     for the estimated VF, we'd then choose this_loop_vinfo even
> >> >> >> +     though (a) this_loop_vinfo might not actually be better than
> >> >> >> +     other_loop_vinfo for that VF and (b) it would be significantly
> >> >> >> +     worse at larger VFs.
> >> >> >> +
> >> >> >> +     Here we go for a hacky compromise: pick this_loop_vinfo if it is
> >> >> >> +     no more expensive than other_loop_vinfo even after doubling the
> >> >> >> +     estimated other_loop_vinfo VF.  For all but trivial loops, this
> >> >> >> +     ensures that we only pick this_loop_vinfo if it is significantly
> >> >> >> +     better than other_loop_vinfo at the estimated VF.  */
> >> >> >> +  if (est_rel_other_min != est_rel_this_min
> >> >> >> +      || est_rel_other_max != est_rel_this_max)
> >> >> >> +    {
> >> >> >> +      HOST_WIDE_INT est_rel_this_likely
> >> >> >> +       = estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
> >> >> >> +      HOST_WIDE_INT est_rel_other_likely
> >> >> >> +       = estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
> >> >> >> +
> >> >> >> +      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
> >> >> >> +    }
> >> >> >> +
> >> >> >> +  return 0;
> >> >> >> +}
> >> >> >> +
> >> >> >> +/* A <=>-style subroutine of better_main_loop_than_p, used when there is
> >> >> >> +   nothing to choose between the inside (loop body) costs of THIS and OTHER.
> >> >> >> +   Check whether we can determine the return value of better_main_loop_than_p
> >> >> >> +   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
> >> >> >> +   Return:
> >> >> >> +
> >> >> >> +   * -1 if better_main_loop_than_p should return true.
> >> >> >> +   * 1 if better_main_loop_than_p should return false.
> >> >> >> +   * 0 if we can't decide.  */
> >> >> >> +
> >> >> >> +int
> >> >> >> +vector_costs::compare_outside_loop_cost (const vector_costs *other) const
> >> >> >> +{
> >> >> >> +  auto this_outside_cost = this->outside_cost ();
> >> >> >> +  auto other_outside_cost = other->outside_cost ();
> >> >> >> +  if (this_outside_cost != other_outside_cost)
> >> >> >> +    return this_outside_cost < other_outside_cost ? -1 : 1;
> >> >> >> +
> >> >> >> +  return 0;
> >> >> >> +}
> >> >> >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> >> >> >> index 87d3f211a2e..0e3aad590e8 100644
> >> >> >> --- a/gcc/tree-vectorizer.h
> >> >> >> +++ b/gcc/tree-vectorizer.h
> >> >> >> @@ -1419,6 +1419,21 @@ public:
> >> >> >>       read back using the functions below.  */
> >> >> >>    virtual void finish_cost ();
> >> >> >>
> >> >> >> +  /* The costs in THIS and OTHER both describe ways of vectorizing
> >> >> >> +     a main loop.  Return true if the costs described by THIS are
> >> >> >> +     cheaper than the costs described by OTHER.  Return false if any
> >> >> >> +     of the following are true:
> >> >> >> +
> >> >> >> +     - THIS and OTHER are of equal cost
> >> >> >> +     - OTHER is better than THIS
> >> >> >> +     - we can't be sure about the relative costs of THIS and OTHER.  */
> >> >> >> +  virtual bool better_main_loop_than_p (const vector_costs *other) const;
> >> >> >> +
> >> >> >> +  /* Likewise, but the costs in THIS and OTHER both describe ways of
> >> >> >> +     vectorizing an epilogue loop of MAIN_LOOP.  */
> >> >> >> +  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
> >> >> >> +                                           loop_vec_info main_loop) const;
> >> >> >> +
> >> >> >>    unsigned int prologue_cost () const;
> >> >> >>    unsigned int body_cost () const;
> >> >> >>    unsigned int epilogue_cost () const;
> >> >> >> @@ -1429,6 +1444,8 @@ protected:
> >> >> >>                                  unsigned int);
> >> >> >>    unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
> >> >> >>                                      unsigned int);
> >> >> >> +  int compare_inside_loop_cost (const vector_costs *) const;
> >> >> >> +  int compare_outside_loop_cost (const vector_costs *) const;
> >> >> >>
> >> >> >>    /* The region of code that we're considering vectorizing.  */
> >> >> >>    vec_info *m_vinfo;
> >> >> >> --
> >> >> >> 2.25.1
> >> >> >>
  

Patch

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index dd4a363fee5..c9ee2e15e35 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2784,144 +2784,12 @@  vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
 	return new_simdlen_p;
     }
 
-  loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo);
-  if (main_loop)
-    {
-      poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
-      unsigned HOST_WIDE_INT main_vf;
-      unsigned HOST_WIDE_INT old_factor, new_factor, old_cost, new_cost;
-      /* If we can determine how many iterations are left for the epilogue
-	 loop, that is if both the main loop's vectorization factor and number
-	 of iterations are constant, then we use them to calculate the cost of
-	 the epilogue loop together with a 'likely value' for the epilogues
-	 vectorization factor.  Otherwise we use the main loop's vectorization
-	 factor and the maximum poly value for the epilogue's.  If the target
-	 has not provided with a sensible upper bound poly vectorization
-	 factors are likely to be favored over constant ones.  */
-      if (main_poly_vf.is_constant (&main_vf)
-	  && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
-	{
-	  unsigned HOST_WIDE_INT niters
-	    = LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
-	  HOST_WIDE_INT old_likely_vf
-	    = estimated_poly_value (old_vf, POLY_VALUE_LIKELY);
-	  HOST_WIDE_INT new_likely_vf
-	    = estimated_poly_value (new_vf, POLY_VALUE_LIKELY);
-
-	  /* If the epilogue is using partial vectors we account for the
-	     partial iteration here too.  */
-	  old_factor = niters / old_likely_vf;
-	  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo)
-	      && niters % old_likely_vf != 0)
-	    old_factor++;
-
-	  new_factor = niters / new_likely_vf;
-	  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo)
-	      && niters % new_likely_vf != 0)
-	    new_factor++;
-	}
-      else
-	{
-	  unsigned HOST_WIDE_INT main_vf_max
-	    = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
-
-	  old_factor = main_vf_max / estimated_poly_value (old_vf,
-							   POLY_VALUE_MAX);
-	  new_factor = main_vf_max / estimated_poly_value (new_vf,
-							   POLY_VALUE_MAX);
-
-	  /* If the loop is not using partial vectors then it will iterate one
-	     time less than one that does.  It is safe to subtract one here,
-	     because the main loop's vf is always at least 2x bigger than that
-	     of an epilogue.  */
-	  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (old_loop_vinfo))
-	    old_factor -= 1;
-	  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (new_loop_vinfo))
-	    new_factor -= 1;
-	}
-
-      /* Compute the costs by multiplying the inside costs with the factor and
-	 add the outside costs for a more complete picture.  The factor is the
-	 amount of times we are expecting to iterate this epilogue.  */
-      old_cost = old_loop_vinfo->vector_costs->body_cost () * old_factor;
-      new_cost = new_loop_vinfo->vector_costs->body_cost () * new_factor;
-      old_cost += old_loop_vinfo->vector_costs->outside_cost ();
-      new_cost += new_loop_vinfo->vector_costs->outside_cost ();
-      return new_cost < old_cost;
-    }
-
-  /* Limit the VFs to what is likely to be the maximum number of iterations,
-     to handle cases in which at least one loop_vinfo is fully-masked.  */
-  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
-  if (estimated_max_niter != -1)
-    {
-      if (known_le (estimated_max_niter, new_vf))
-	new_vf = estimated_max_niter;
-      if (known_le (estimated_max_niter, old_vf))
-	old_vf = estimated_max_niter;
-    }
-
-  /* Check whether the (fractional) cost per scalar iteration is lower
-     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
-  poly_int64 rel_new = new_loop_vinfo->vector_costs->body_cost () * old_vf;
-  poly_int64 rel_old = old_loop_vinfo->vector_costs->body_cost () * new_vf;
-
-  HOST_WIDE_INT est_rel_new_min
-    = estimated_poly_value (rel_new, POLY_VALUE_MIN);
-  HOST_WIDE_INT est_rel_new_max
-    = estimated_poly_value (rel_new, POLY_VALUE_MAX);
-
-  HOST_WIDE_INT est_rel_old_min
-    = estimated_poly_value (rel_old, POLY_VALUE_MIN);
-  HOST_WIDE_INT est_rel_old_max
-    = estimated_poly_value (rel_old, POLY_VALUE_MAX);
-
-  /* Check first if we can make out an unambigous total order from the minimum
-     and maximum estimates.  */
-  if (est_rel_new_min < est_rel_old_min
-      && est_rel_new_max < est_rel_old_max)
-    return true;
-  else if (est_rel_old_min < est_rel_new_min
-	   && est_rel_old_max < est_rel_new_max)
-    return false;
-  /* When old_loop_vinfo uses a variable vectorization factor,
-     we know that it has a lower cost for at least one runtime VF.
-     However, we don't know how likely that VF is.
-
-     One option would be to compare the costs for the estimated VFs.
-     The problem is that that can put too much pressure on the cost
-     model.  E.g. if the estimated VF is also the lowest possible VF,
-     and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
-     for the estimated VF, we'd then choose new_loop_vinfo even
-     though (a) new_loop_vinfo might not actually be better than
-     old_loop_vinfo for that VF and (b) it would be significantly
-     worse at larger VFs.
-
-     Here we go for a hacky compromise: pick new_loop_vinfo if it is
-     no more expensive than old_loop_vinfo even after doubling the
-     estimated old_loop_vinfo VF.  For all but trivial loops, this
-     ensures that we only pick new_loop_vinfo if it is significantly
-     better than old_loop_vinfo at the estimated VF.  */
-
-  if (est_rel_old_min != est_rel_new_min
-      || est_rel_old_max != est_rel_new_max)
-    {
-      HOST_WIDE_INT est_rel_new_likely
-	= estimated_poly_value (rel_new, POLY_VALUE_LIKELY);
-      HOST_WIDE_INT est_rel_old_likely
-	= estimated_poly_value (rel_old, POLY_VALUE_LIKELY);
-
-      return est_rel_new_likely * 2 <= est_rel_old_likely;
-    }
-
-  /* If there's nothing to choose between the loop bodies, see whether
-     there's a difference in the prologue and epilogue costs.  */
-  auto old_outside_cost = old_loop_vinfo->vector_costs->outside_cost ();
-  auto new_outside_cost = new_loop_vinfo->vector_costs->outside_cost ();
-  if (new_outside_cost != old_outside_cost)
-    return new_outside_cost < old_outside_cost;
+  const auto *old_costs = old_loop_vinfo->vector_costs;
+  const auto *new_costs = new_loop_vinfo->vector_costs;
+  if (loop_vec_info main_loop = LOOP_VINFO_ORIG_LOOP_INFO (old_loop_vinfo))
+    return new_costs->better_epilogue_loop_than_p (old_costs, main_loop);
 
-  return false;
+  return new_costs->better_main_loop_than_p (old_costs);
 }
 
 /* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 9ef76ce654b..dcbb2a3f13a 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1744,3 +1744,207 @@  vector_costs::adjust_cost_for_freq (stmt_vec_info stmt_info,
     }
   return cost;
 }
+
+/* See the comment above the declaration for details.  */
+
+bool
+vector_costs::better_main_loop_than_p (const vector_costs *other) const
+{
+  int diff = compare_inside_loop_cost (other);
+  if (diff != 0)
+    return diff < 0;
+
+  /* If there's nothing to choose between the loop bodies, see whether
+     there's a difference in the prologue and epilogue costs.  */
+  diff = compare_outside_loop_cost (other);
+  if (diff != 0)
+    return diff < 0;
+
+  return false;
+}
+
+
+/* See the comment above the declaration for details.  */
+
+bool
+vector_costs::better_epilogue_loop_than_p (const vector_costs *other,
+					   loop_vec_info main_loop) const
+{
+  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
+  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
+
+  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
+  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
+
+  poly_uint64 main_poly_vf = LOOP_VINFO_VECT_FACTOR (main_loop);
+  unsigned HOST_WIDE_INT main_vf;
+  unsigned HOST_WIDE_INT other_factor, this_factor, other_cost, this_cost;
+  /* If we can determine how many iterations are left for the epilogue
+     loop, that is if both the main loop's vectorization factor and number
+     of iterations are constant, then we use them to calculate the cost of
+     the epilogue loop together with a 'likely value' for the epilogues
+     vectorization factor.  Otherwise we use the main loop's vectorization
+     factor and the maximum poly value for the epilogue's.  If the target
+     has not provided with a sensible upper bound poly vectorization
+     factors are likely to be favored over constant ones.  */
+  if (main_poly_vf.is_constant (&main_vf)
+      && LOOP_VINFO_NITERS_KNOWN_P (main_loop))
+    {
+      unsigned HOST_WIDE_INT niters
+	= LOOP_VINFO_INT_NITERS (main_loop) % main_vf;
+      HOST_WIDE_INT other_likely_vf
+	= estimated_poly_value (other_vf, POLY_VALUE_LIKELY);
+      HOST_WIDE_INT this_likely_vf
+	= estimated_poly_value (this_vf, POLY_VALUE_LIKELY);
+
+      /* If the epilogue is using partial vectors we account for the
+	 partial iteration here too.  */
+      other_factor = niters / other_likely_vf;
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo)
+	  && niters % other_likely_vf != 0)
+	other_factor++;
+
+      this_factor = niters / this_likely_vf;
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo)
+	  && niters % this_likely_vf != 0)
+	this_factor++;
+    }
+  else
+    {
+      unsigned HOST_WIDE_INT main_vf_max
+	= estimated_poly_value (main_poly_vf, POLY_VALUE_MAX);
+
+      other_factor = main_vf_max / estimated_poly_value (other_vf,
+						       POLY_VALUE_MAX);
+      this_factor = main_vf_max / estimated_poly_value (this_vf,
+						       POLY_VALUE_MAX);
+
+      /* If the loop is not using partial vectors then it will iterate one
+	 time less than one that does.  It is safe to subtract one here,
+	 because the main loop's vf is always at least 2x bigger than that
+	 of an epilogue.  */
+      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (other_loop_vinfo))
+	other_factor -= 1;
+      if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (this_loop_vinfo))
+	this_factor -= 1;
+    }
+
+  /* Compute the costs by multiplying the inside costs with the factor and
+     add the outside costs for a more complete picture.  The factor is the
+     amount of times we are expecting to iterate this epilogue.  */
+  other_cost = other->body_cost () * other_factor;
+  this_cost = this->body_cost () * this_factor;
+  other_cost += other->outside_cost ();
+  this_cost += this->outside_cost ();
+  return this_cost < other_cost;
+}
+
+/* A <=>-style subroutine of better_main_loop_than_p.  Check whether we can
+   determine the return value of better_main_loop_than_p by comparing the
+   inside (loop body) costs of THIS and OTHER.  Return:
+
+   * -1 if better_main_loop_than_p should return true.
+   * 1 if better_main_loop_than_p should return false.
+   * 0 if we can't decide.  */
+
+int
+vector_costs::compare_inside_loop_cost (const vector_costs *other) const
+{
+  loop_vec_info this_loop_vinfo = as_a<loop_vec_info> (this->m_vinfo);
+  loop_vec_info other_loop_vinfo = as_a<loop_vec_info> (other->m_vinfo);
+
+  struct loop *loop = LOOP_VINFO_LOOP (this_loop_vinfo);
+  gcc_assert (LOOP_VINFO_LOOP (other_loop_vinfo) == loop);
+
+  poly_int64 this_vf = LOOP_VINFO_VECT_FACTOR (this_loop_vinfo);
+  poly_int64 other_vf = LOOP_VINFO_VECT_FACTOR (other_loop_vinfo);
+
+  /* Limit the VFs to what is likely to be the maximum number of iterations,
+     to handle cases in which at least one loop_vinfo is fully-masked.  */
+  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
+  if (estimated_max_niter != -1)
+    {
+      if (known_le (estimated_max_niter, this_vf))
+	this_vf = estimated_max_niter;
+      if (known_le (estimated_max_niter, other_vf))
+	other_vf = estimated_max_niter;
+    }
+
+  /* Check whether the (fractional) cost per scalar iteration is lower or
+     higher: this_inside_cost / this_vf vs. other_inside_cost / other_vf.  */
+  poly_int64 rel_this = this_loop_vinfo->vector_costs->body_cost () * other_vf;
+  poly_int64 rel_other
+    = other_loop_vinfo->vector_costs->body_cost () * this_vf;
+
+  HOST_WIDE_INT est_rel_this_min
+    = estimated_poly_value (rel_this, POLY_VALUE_MIN);
+  HOST_WIDE_INT est_rel_this_max
+    = estimated_poly_value (rel_this, POLY_VALUE_MAX);
+
+  HOST_WIDE_INT est_rel_other_min
+    = estimated_poly_value (rel_other, POLY_VALUE_MIN);
+  HOST_WIDE_INT est_rel_other_max
+    = estimated_poly_value (rel_other, POLY_VALUE_MAX);
+
+  /* Check first if we can make out an unambigous total order from the minimum
+     and maximum estimates.  */
+  if (est_rel_this_min < est_rel_other_min
+      && est_rel_this_max < est_rel_other_max)
+    return -1;
+
+  if (est_rel_other_min < est_rel_this_min
+      && est_rel_other_max < est_rel_this_max)
+    return 1;
+
+  /* When other_loop_vinfo uses a variable vectorization factor,
+     we know that it has a lower cost for at least one runtime VF.
+     However, we don't know how likely that VF is.
+
+     One option would be to compare the costs for the estimated VFs.
+     The problem is that that can put too much pressure on the cost
+     model.  E.g. if the estimated VF is also the lowest possible VF,
+     and if other_loop_vinfo is 1 unit worse than this_loop_vinfo
+     for the estimated VF, we'd then choose this_loop_vinfo even
+     though (a) this_loop_vinfo might not actually be better than
+     other_loop_vinfo for that VF and (b) it would be significantly
+     worse at larger VFs.
+
+     Here we go for a hacky compromise: pick this_loop_vinfo if it is
+     no more expensive than other_loop_vinfo even after doubling the
+     estimated other_loop_vinfo VF.  For all but trivial loops, this
+     ensures that we only pick this_loop_vinfo if it is significantly
+     better than other_loop_vinfo at the estimated VF.  */
+  if (est_rel_other_min != est_rel_this_min
+      || est_rel_other_max != est_rel_this_max)
+    {
+      HOST_WIDE_INT est_rel_this_likely
+	= estimated_poly_value (rel_this, POLY_VALUE_LIKELY);
+      HOST_WIDE_INT est_rel_other_likely
+	= estimated_poly_value (rel_other, POLY_VALUE_LIKELY);
+
+      return est_rel_this_likely * 2 <= est_rel_other_likely ? -1 : 1;
+    }
+
+  return 0;
+}
+
+/* A <=>-style subroutine of better_main_loop_than_p, used when there is
+   nothing to choose between the inside (loop body) costs of THIS and OTHER.
+   Check whether we can determine the return value of better_main_loop_than_p
+   by comparing the outside (prologue and epilogue) costs of THIS and OTHER.
+   Return:
+
+   * -1 if better_main_loop_than_p should return true.
+   * 1 if better_main_loop_than_p should return false.
+   * 0 if we can't decide.  */
+
+int
+vector_costs::compare_outside_loop_cost (const vector_costs *other) const
+{
+  auto this_outside_cost = this->outside_cost ();
+  auto other_outside_cost = other->outside_cost ();
+  if (this_outside_cost != other_outside_cost)
+    return this_outside_cost < other_outside_cost ? -1 : 1;
+
+  return 0;
+}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 87d3f211a2e..0e3aad590e8 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1419,6 +1419,21 @@  public:
      read back using the functions below.  */
   virtual void finish_cost ();
 
+  /* The costs in THIS and OTHER both describe ways of vectorizing
+     a main loop.  Return true if the costs described by THIS are
+     cheaper than the costs described by OTHER.  Return false if any
+     of the following are true:
+
+     - THIS and OTHER are of equal cost
+     - OTHER is better than THIS
+     - we can't be sure about the relative costs of THIS and OTHER.  */
+  virtual bool better_main_loop_than_p (const vector_costs *other) const;
+
+  /* Likewise, but the costs in THIS and OTHER both describe ways of
+     vectorizing an epilogue loop of MAIN_LOOP.  */
+  virtual bool better_epilogue_loop_than_p (const vector_costs *other,
+					    loop_vec_info main_loop) const;
+
   unsigned int prologue_cost () const;
   unsigned int body_cost () const;
   unsigned int epilogue_cost () const;
@@ -1429,6 +1444,8 @@  protected:
 				 unsigned int);
   unsigned int adjust_cost_for_freq (stmt_vec_info, vect_cost_model_location,
 				     unsigned int);
+  int compare_inside_loop_cost (const vector_costs *) const;
+  int compare_outside_loop_cost (const vector_costs *) const;
 
   /* The region of code that we're considering vectorizing.  */
   vec_info *m_vinfo;