[aarch64] Implement determine_suggested_unroll_factor

Message ID 698d5b64-5e1b-d0bd-f2ff-f5cd2763dbaa@arm.com
State New
Headers
Series [aarch64] Implement determine_suggested_unroll_factor |

Commit Message

Andre Vieira (lists) March 16, 2022, 2:59 p.m. UTC
  Hi,

This patch implements the costing function 
determine_suggested_unroll_factor for aarch64.
It determines the unrolling factor by dividing the number of X 
operations we can do per cycle by the number of X operations in the loop 
body, taking this information from the vec_ops analysis during vector 
costing and the available issue_info information.
We multiply the dividend by a potential reduction_latency, to improve 
our pipeline utilization if we are stalled waiting on a particular 
reduction operation.

Right now we also have a work around for vectorization choices where the 
main loop uses a NEON mode and predication is available, such that if 
the main loop makes use of a NEON pattern that is not directly supported 
by SVE we do not unroll, as that might cause performance regressions in 
cases where we would enter the original main loop's VF. As an example if 
you have a loop where you could use AVG_CEIL with a V8HI mode, you would 
originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated 
epilogue, using other instructions. Whereas with the unrolling you would 
end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping 
the original 8x NEON. In the future, we could handle this differently, 
by either using a different costing model for epilogues, or potentially 
vectorizing more than one single epilogue.

gcc/ChangeLog:

         * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
determine_suggested_unroll_factor.
         (determine_suggested_unroll_factor): New function.
         (aarch64_vector_costs::finish_costs): Use 
determine_suggested_unroll_factor.
  

Comments

Richard Sandiford March 16, 2022, 6:01 p.m. UTC | #1
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> This patch implements the costing function 
> determine_suggested_unroll_factor for aarch64.
> It determines the unrolling factor by dividing the number of X 
> operations we can do per cycle by the number of X operations in the loop 
> body, taking this information from the vec_ops analysis during vector 
> costing and the available issue_info information.
> We multiply the dividend by a potential reduction_latency, to improve 
> our pipeline utilization if we are stalled waiting on a particular 
> reduction operation.
>
> Right now we also have a work around for vectorization choices where the 
> main loop uses a NEON mode and predication is available, such that if 
> the main loop makes use of a NEON pattern that is not directly supported 
> by SVE we do not unroll, as that might cause performance regressions in 
> cases where we would enter the original main loop's VF. As an example if 
> you have a loop where you could use AVG_CEIL with a V8HI mode, you would 
> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated 
> epilogue, using other instructions. Whereas with the unrolling you would 
> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping 
> the original 8x NEON. In the future, we could handle this differently, 
> by either using a different costing model for epilogues, or potentially 
> vectorizing more than one single epilogue.
>
> gcc/ChangeLog:
>
>          * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
> determine_suggested_unroll_factor.
>          (determine_suggested_unroll_factor): New function.
>          (aarch64_vector_costs::finish_costs): Use 
> determine_suggested_unroll_factor.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15680,6 +15680,7 @@ private:
>    unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>  				 unsigned int);
>    bool prefer_unrolled_loop () const;
> +  unsigned int determine_suggested_unroll_factor ();
>  
>    /* True if we have performed one-time initialization based on the
>       vec_info.  */
> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>    return sve_cycles_per_iter;
>  }
>  
> +unsigned int
> +aarch64_vector_costs::determine_suggested_unroll_factor ()
> +{
> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
> +  if (!issue_info)
> +    return 1;
> +  bool sve = false;
> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))

Other code uses m_vec_flags & VEC_ANY_SVE for this.

> +    {
> +      if (!issue_info->sve)
> +	return 1;
> +      sve = true;
> +    }
> +  else
> +    {
> +      if (!issue_info->advsimd)
> +	return 1;

The issue info should instead be taken from vec_ops.simd_issue_info ()
in the loop below.  It can vary for each entry.

> +      /* If we are trying to unroll a NEON main loop that contains patterns

s/a NEON/an Advanced SIMD/

> +	 that we do not support with SVE and we might use a predicated
> +	 epilogue, we need to be conservative and block unrolling as this might
> +	 lead to a less optimal loop for the first and only epilogue using the
> +	 original loop's vectorization factor.
> +	 TODO: Remove this constraint when we add support for multiple epilogue
> +	 vectorization.  */
> +      if (partial_vectors_supported_p ()
> +	  && param_vect_partial_vector_usage != 0
> +	  && !TARGET_SVE2)
> +	{
> +	  unsigned int i;
> +	  stmt_vec_info stmt_vinfo;
> +	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
> +	    {
> +	      if (is_pattern_stmt_p (stmt_vinfo))
> +		{
> +		  gimple *stmt = stmt_vinfo->stmt;
> +		  if (is_gimple_call (stmt)
> +		      && gimple_call_internal_p (stmt))
> +		    {
> +		      enum internal_fn ifn
> +			= gimple_call_internal_fn (stmt);
> +		      switch (ifn)
> +			{
> +			case IFN_AVG_FLOOR:
> +			case IFN_AVG_CEIL:
> +			  return 1;
> +			default:
> +			  break;
> +			}
> +		    }
> +		}
> +	    }
> +	}

I think we should instead record this during add_stmt_cost, like we do
for the other data that this function uses.

We need to drop the partial_vectors_supported_p () condition though:
enabling SVE must not make Advanced SIMD code worse.  So for now,
Advanced SIMD-only code will have to suffer the missed unrolling
opportunity too.

> +    }
> +
> +  unsigned int max_unroll_factor = 1;
> +  aarch64_simd_vec_issue_info const *vec_issue
> +    = sve ? issue_info->sve : issue_info->advsimd;
> +  for (auto vec_ops : m_ops)
> +    {
> +      /* Limit unroll factor to 4 for now.  */
> +      unsigned int unroll_factor = 4;

Would be good to make this an aarch64-specific --param in case people
want to play with different values.

> +      unsigned int factor
> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
> +      unsigned int temp;
> +
> +      /* Sanity check, this should never happen.  */
> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
> +	return 1;
> +
> +      /* Check stores.  */
> +      if (vec_ops.stores > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
> +		       vec_ops.stores);
> +	  unroll_factor = MIN (unroll_factor, temp);
> +	}
> +
> +      /* Check loads.  */
> +      if (vec_ops.loads > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
> +		       vec_ops.loads);

This should use loads + stores rather than just loads.

> +	  unroll_factor = MIN (unroll_factor, temp);
> +	}
> +
> +      /* Check general ops.  */
> +      if (vec_ops.general_ops > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
> +	               vec_ops.general_ops);
> +	  unroll_factor = MIN (unroll_factor, temp);
> +	 }

In principle we should also take pred_ops into account when
sve_issue_info () is nonnull, while ignoring the pred_ops associated
with loop predication (added by analyze_loop_vinfo).  I guess we
should maintain two pred_op counts.

That's just a suggestion for a possible future improvement though.
It doesn't need to be done as part of this patch.

Thanks,
Richard

> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
> +    }
> +
> +  /* Make sure unroll factor is power of 2.  */
> +  return 1 << ceil_log2 (max_unroll_factor);
> +}
> +
>  /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>     and return the new cost.  */
>  unsigned int
> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>    if (loop_vinfo
>        && m_vec_flags
>        && aarch64_use_new_vector_costs_p ())
> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> -					   m_costs[vect_body]);
> +    {
> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> +					     m_costs[vect_body]);
> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
> +    }
>  
>    /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>       the scalar code in the event of a tie, since there is more chance
  
Andre Vieira (lists) March 25, 2022, 5:11 p.m. UTC | #2
Hi,

Addressed all of your comments bar the pred ops one.

Is this OK?


gcc/ChangeLog:

         * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
determine_suggested_unroll_factor and m_nosve_pattern.
         (determine_suggested_unroll_factor): New function.
         (aarch64_vector_costs::add_stmt_cost): Check for a qualifying 
pattern
         to set m_nosve_pattern.
         (aarch64_vector_costs::finish_costs): Use 
determine_suggested_unroll_factor.
         * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New.

On 16/03/2022 18:01, Richard Sandiford wrote:
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> Hi,
>>
>> This patch implements the costing function
>> determine_suggested_unroll_factor for aarch64.
>> It determines the unrolling factor by dividing the number of X
>> operations we can do per cycle by the number of X operations in the loop
>> body, taking this information from the vec_ops analysis during vector
>> costing and the available issue_info information.
>> We multiply the dividend by a potential reduction_latency, to improve
>> our pipeline utilization if we are stalled waiting on a particular
>> reduction operation.
>>
>> Right now we also have a work around for vectorization choices where the
>> main loop uses a NEON mode and predication is available, such that if
>> the main loop makes use of a NEON pattern that is not directly supported
>> by SVE we do not unroll, as that might cause performance regressions in
>> cases where we would enter the original main loop's VF. As an example if
>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
>> epilogue, using other instructions. Whereas with the unrolling you would
>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
>> the original 8x NEON. In the future, we could handle this differently,
>> by either using a different costing model for epilogues, or potentially
>> vectorizing more than one single epilogue.
>>
>> gcc/ChangeLog:
>>
>>           * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>> determine_suggested_unroll_factor.
>>           (determine_suggested_unroll_factor): New function.
>>           (aarch64_vector_costs::finish_costs): Use
>> determine_suggested_unroll_factor.
>>
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -15680,6 +15680,7 @@ private:
>>     unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>   				 unsigned int);
>>     bool prefer_unrolled_loop () const;
>> +  unsigned int determine_suggested_unroll_factor ();
>>   
>>     /* True if we have performed one-time initialization based on the
>>        vec_info.  */
>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>     return sve_cycles_per_iter;
>>   }
>>   
>> +unsigned int
>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>> +{
>> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
>> +  if (!issue_info)
>> +    return 1;
>> +  bool sve = false;
>> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
> Other code uses m_vec_flags & VEC_ANY_SVE for this.
>
>> +    {
>> +      if (!issue_info->sve)
>> +	return 1;
>> +      sve = true;
>> +    }
>> +  else
>> +    {
>> +      if (!issue_info->advsimd)
>> +	return 1;
> The issue info should instead be taken from vec_ops.simd_issue_info ()
> in the loop below.  It can vary for each entry.
>
>> +      /* If we are trying to unroll a NEON main loop that contains patterns
> s/a NEON/an Advanced SIMD/
>
>> +	 that we do not support with SVE and we might use a predicated
>> +	 epilogue, we need to be conservative and block unrolling as this might
>> +	 lead to a less optimal loop for the first and only epilogue using the
>> +	 original loop's vectorization factor.
>> +	 TODO: Remove this constraint when we add support for multiple epilogue
>> +	 vectorization.  */
>> +      if (partial_vectors_supported_p ()
>> +	  && param_vect_partial_vector_usage != 0
>> +	  && !TARGET_SVE2)
>> +	{
>> +	  unsigned int i;
>> +	  stmt_vec_info stmt_vinfo;
>> +	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
>> +	    {
>> +	      if (is_pattern_stmt_p (stmt_vinfo))
>> +		{
>> +		  gimple *stmt = stmt_vinfo->stmt;
>> +		  if (is_gimple_call (stmt)
>> +		      && gimple_call_internal_p (stmt))
>> +		    {
>> +		      enum internal_fn ifn
>> +			= gimple_call_internal_fn (stmt);
>> +		      switch (ifn)
>> +			{
>> +			case IFN_AVG_FLOOR:
>> +			case IFN_AVG_CEIL:
>> +			  return 1;
>> +			default:
>> +			  break;
>> +			}
>> +		    }
>> +		}
>> +	    }
>> +	}
> I think we should instead record this during add_stmt_cost, like we do
> for the other data that this function uses.
>
> We need to drop the partial_vectors_supported_p () condition though:
> enabling SVE must not make Advanced SIMD code worse.  So for now,
> Advanced SIMD-only code will have to suffer the missed unrolling
> opportunity too.
>
>> +    }
>> +
>> +  unsigned int max_unroll_factor = 1;
>> +  aarch64_simd_vec_issue_info const *vec_issue
>> +    = sve ? issue_info->sve : issue_info->advsimd;
>> +  for (auto vec_ops : m_ops)
>> +    {
>> +      /* Limit unroll factor to 4 for now.  */
>> +      unsigned int unroll_factor = 4;
> Would be good to make this an aarch64-specific --param in case people
> want to play with different values.
>
>> +      unsigned int factor
>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>> +      unsigned int temp;
>> +
>> +      /* Sanity check, this should never happen.  */
>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>> +	return 1;
>> +
>> +      /* Check stores.  */
>> +      if (vec_ops.stores > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>> +		       vec_ops.stores);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check loads.  */
>> +      if (vec_ops.loads > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>> +		       vec_ops.loads);
> This should use loads + stores rather than just loads.
>
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check general ops.  */
>> +      if (vec_ops.general_ops > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>> +	               vec_ops.general_ops);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	 }
> In principle we should also take pred_ops into account when
> sve_issue_info () is nonnull, while ignoring the pred_ops associated
> with loop predication (added by analyze_loop_vinfo).  I guess we
> should maintain two pred_op counts.
>
> That's just a suggestion for a possible future improvement though.
> It doesn't need to be done as part of this patch.
>
> Thanks,
> Richard
>
>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>> +    }
>> +
>> +  /* Make sure unroll factor is power of 2.  */
>> +  return 1 << ceil_log2 (max_unroll_factor);
>> +}
>> +
>>   /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>      and return the new cost.  */
>>   unsigned int
>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>     if (loop_vinfo
>>         && m_vec_flags
>>         && aarch64_use_new_vector_costs_p ())
>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> -					   m_costs[vect_body]);
>> +    {
>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> +					     m_costs[vect_body]);
>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>> +    }
>>   
>>     /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>        the scalar code in the event of a tie, since there is more chance
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -15637,6 +15637,7 @@ private:
   unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
 				 unsigned int);
   bool prefer_unrolled_loop () const;
+  unsigned int determine_suggested_unroll_factor ();
 
   /* True if we have performed one-time initialization based on the
      vec_info.  */
@@ -15699,6 +15700,10 @@ private:
      or vector loop.  There is one entry for each tuning option of
      interest.  */
   auto_vec<aarch64_vec_op_count, 2> m_ops;
+
+  /* This loop uses a pattern that is not supported by SVE, but is supported by
+     Advanced SIMD and SVE2.  */
+  bool m_nosve_pattern = false;
 };
 
 aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	 as one iteration of the SVE loop.  */
       if (where == vect_body && m_unrolled_advsimd_niters)
 	m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
+
+      if (is_pattern_stmt_p (stmt_info))
+	{
+	  gimple *stmt = stmt_info->stmt;
+	  if (is_gimple_call (stmt)
+	      && gimple_call_internal_p (stmt))
+	    {
+	      enum internal_fn ifn
+		= gimple_call_internal_fn (stmt);
+	      switch (ifn)
+		{
+		case IFN_AVG_FLOOR:
+		case IFN_AVG_CEIL:
+		  m_nosve_pattern = true;
+		default:
+		  break;
+		}
+	    }
+	}
     }
   return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
 }
@@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
   return sve_cycles_per_iter;
 }
 
+unsigned int
+aarch64_vector_costs::determine_suggested_unroll_factor ()
+{
+  bool sve = m_vec_flags & VEC_ANY_SVE;
+  /* If we are trying to unroll an Advanced SIMD main loop that contains
+     patterns that we do not support with SVE and we might use a predicated
+     epilogue, we need to be conservative and block unrolling as this might
+     lead to a less optimal loop for the first and only epilogue using the
+     original loop's vectorization factor.
+     TODO: Remove this constraint when we add support for multiple epilogue
+     vectorization.  */
+  if (!sve && TARGET_SVE2 && m_nosve_pattern)
+    return 1;
+
+  unsigned int max_unroll_factor = 1;
+  for (auto vec_ops : m_ops)
+    {
+      aarch64_simd_vec_issue_info const *vec_issue
+	= vec_ops.simd_issue_info ();
+      if (!vec_issue)
+	return 1;
+      /* Limit unroll factor to a value adjustable by the user, the default
+	 value is 4. */
+      unsigned int unroll_factor = aarch64_vect_unroll_limit;
+      unsigned int factor
+       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
+      unsigned int temp;
+
+      /* Sanity check, this should never happen.  */
+      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
+	return 1;
+
+      /* Check stores.  */
+      if (vec_ops.stores > 0)
+	{
+	  temp = CEIL (factor * vec_issue->stores_per_cycle,
+		       vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check loads + stores.  */
+      if (vec_ops.loads > 0)
+	{
+	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
+		       vec_ops.loads + vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check general ops.  */
+      if (vec_ops.general_ops > 0)
+	{
+	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
+		       vec_ops.general_ops);
+	  unroll_factor = MIN (unroll_factor, temp);
+	 }
+      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
+    }
+
+  /* Make sure unroll factor is power of 2.  */
+  return 1 << ceil_log2 (max_unroll_factor);
+}
+
 /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
    and return the new cost.  */
 unsigned int
@@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
   if (loop_vinfo
       && m_vec_flags
       && aarch64_use_new_vector_costs_p ())
-    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
-					   m_costs[vect_body]);
+    {
+      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
+					     m_costs[vect_body]);
+      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
+    }
 
   /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
      the scalar code in the event of a tie, since there is more chance
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence.
 -param=aarch64-mops-memset-size-threshold=
 Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param
 Constant memset size in bytes from which to start using MOPS sequence.
+
+-param=aarch64-vect-unroll-limit=
+Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
+Limit how much the autovectorizer may unroll a loop.
  
Richard Sandiford March 28, 2022, 2:59 p.m. UTC | #3
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> Addressed all of your comments bar the pred ops one.
>
> Is this OK?
>
>
> gcc/ChangeLog:
>
>          * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
> determine_suggested_unroll_factor and m_nosve_pattern.
>          (determine_suggested_unroll_factor): New function.
>          (aarch64_vector_costs::add_stmt_cost): Check for a qualifying 
> pattern
>          to set m_nosve_pattern.
>          (aarch64_vector_costs::finish_costs): Use 
> determine_suggested_unroll_factor.
>          * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New.
>
> On 16/03/2022 18:01, Richard Sandiford wrote:
>> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>>> Hi,
>>>
>>> This patch implements the costing function
>>> determine_suggested_unroll_factor for aarch64.
>>> It determines the unrolling factor by dividing the number of X
>>> operations we can do per cycle by the number of X operations in the loop
>>> body, taking this information from the vec_ops analysis during vector
>>> costing and the available issue_info information.
>>> We multiply the dividend by a potential reduction_latency, to improve
>>> our pipeline utilization if we are stalled waiting on a particular
>>> reduction operation.
>>>
>>> Right now we also have a work around for vectorization choices where the
>>> main loop uses a NEON mode and predication is available, such that if
>>> the main loop makes use of a NEON pattern that is not directly supported
>>> by SVE we do not unroll, as that might cause performance regressions in
>>> cases where we would enter the original main loop's VF. As an example if
>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
>>> epilogue, using other instructions. Whereas with the unrolling you would
>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
>>> the original 8x NEON. In the future, we could handle this differently,
>>> by either using a different costing model for epilogues, or potentially
>>> vectorizing more than one single epilogue.
>>>
>>> gcc/ChangeLog:
>>>
>>>           * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>>> determine_suggested_unroll_factor.
>>>           (determine_suggested_unroll_factor): New function.
>>>           (aarch64_vector_costs::finish_costs): Use
>>> determine_suggested_unroll_factor.
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
>>> --- a/gcc/config/aarch64/aarch64.cc
>>> +++ b/gcc/config/aarch64/aarch64.cc
>>> @@ -15680,6 +15680,7 @@ private:
>>>     unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>>   				 unsigned int);
>>>     bool prefer_unrolled_loop () const;
>>> +  unsigned int determine_suggested_unroll_factor ();
>>>   
>>>     /* True if we have performed one-time initialization based on the
>>>        vec_info.  */
>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>>     return sve_cycles_per_iter;
>>>   }
>>>   
>>> +unsigned int
>>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>>> +{
>>> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
>>> +  if (!issue_info)
>>> +    return 1;
>>> +  bool sve = false;
>>> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
>> Other code uses m_vec_flags & VEC_ANY_SVE for this.
>>
>>> +    {
>>> +      if (!issue_info->sve)
>>> +	return 1;
>>> +      sve = true;
>>> +    }
>>> +  else
>>> +    {
>>> +      if (!issue_info->advsimd)
>>> +	return 1;
>> The issue info should instead be taken from vec_ops.simd_issue_info ()
>> in the loop below.  It can vary for each entry.
>>
>>> +      /* If we are trying to unroll a NEON main loop that contains patterns
>> s/a NEON/an Advanced SIMD/
>>
>>> +	 that we do not support with SVE and we might use a predicated
>>> +	 epilogue, we need to be conservative and block unrolling as this might
>>> +	 lead to a less optimal loop for the first and only epilogue using the
>>> +	 original loop's vectorization factor.
>>> +	 TODO: Remove this constraint when we add support for multiple epilogue
>>> +	 vectorization.  */
>>> +      if (partial_vectors_supported_p ()
>>> +	  && param_vect_partial_vector_usage != 0
>>> +	  && !TARGET_SVE2)
>>> +	{
>>> +	  unsigned int i;
>>> +	  stmt_vec_info stmt_vinfo;
>>> +	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
>>> +	    {
>>> +	      if (is_pattern_stmt_p (stmt_vinfo))
>>> +		{
>>> +		  gimple *stmt = stmt_vinfo->stmt;
>>> +		  if (is_gimple_call (stmt)
>>> +		      && gimple_call_internal_p (stmt))
>>> +		    {
>>> +		      enum internal_fn ifn
>>> +			= gimple_call_internal_fn (stmt);
>>> +		      switch (ifn)
>>> +			{
>>> +			case IFN_AVG_FLOOR:
>>> +			case IFN_AVG_CEIL:
>>> +			  return 1;
>>> +			default:
>>> +			  break;
>>> +			}
>>> +		    }
>>> +		}
>>> +	    }
>>> +	}
>> I think we should instead record this during add_stmt_cost, like we do
>> for the other data that this function uses.
>>
>> We need to drop the partial_vectors_supported_p () condition though:
>> enabling SVE must not make Advanced SIMD code worse.  So for now,
>> Advanced SIMD-only code will have to suffer the missed unrolling
>> opportunity too.
>>
>>> +    }
>>> +
>>> +  unsigned int max_unroll_factor = 1;
>>> +  aarch64_simd_vec_issue_info const *vec_issue
>>> +    = sve ? issue_info->sve : issue_info->advsimd;
>>> +  for (auto vec_ops : m_ops)
>>> +    {
>>> +      /* Limit unroll factor to 4 for now.  */
>>> +      unsigned int unroll_factor = 4;
>> Would be good to make this an aarch64-specific --param in case people
>> want to play with different values.
>>
>>> +      unsigned int factor
>>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>>> +      unsigned int temp;
>>> +
>>> +      /* Sanity check, this should never happen.  */
>>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>>> +	return 1;
>>> +
>>> +      /* Check stores.  */
>>> +      if (vec_ops.stores > 0)
>>> +	{
>>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>>> +		       vec_ops.stores);
>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>> +	}
>>> +
>>> +      /* Check loads.  */
>>> +      if (vec_ops.loads > 0)
>>> +	{
>>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>>> +		       vec_ops.loads);
>> This should use loads + stores rather than just loads.
>>
>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>> +	}
>>> +
>>> +      /* Check general ops.  */
>>> +      if (vec_ops.general_ops > 0)
>>> +	{
>>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>>> +	               vec_ops.general_ops);
>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>> +	 }
>> In principle we should also take pred_ops into account when
>> sve_issue_info () is nonnull, while ignoring the pred_ops associated
>> with loop predication (added by analyze_loop_vinfo).  I guess we
>> should maintain two pred_op counts.
>>
>> That's just a suggestion for a possible future improvement though.
>> It doesn't need to be done as part of this patch.
>>
>> Thanks,
>> Richard
>>
>>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>>> +    }
>>> +
>>> +  /* Make sure unroll factor is power of 2.  */
>>> +  return 1 << ceil_log2 (max_unroll_factor);
>>> +}
>>> +
>>>   /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>>      and return the new cost.  */
>>>   unsigned int
>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>>     if (loop_vinfo
>>>         && m_vec_flags
>>>         && aarch64_use_new_vector_costs_p ())
>>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>> -					   m_costs[vect_body]);
>>> +    {
>>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>> +					     m_costs[vect_body]);
>>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>>> +    }
>>>   
>>>     /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>>        the scalar code in the event of a tie, since there is more chance
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15637,6 +15637,7 @@ private:
>    unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>  				 unsigned int);
>    bool prefer_unrolled_loop () const;
> +  unsigned int determine_suggested_unroll_factor ();
>  
>    /* True if we have performed one-time initialization based on the
>       vec_info.  */
> @@ -15699,6 +15700,10 @@ private:
>       or vector loop.  There is one entry for each tuning option of
>       interest.  */
>    auto_vec<aarch64_vec_op_count, 2> m_ops;
> +
> +  /* This loop uses a pattern that is not supported by SVE, but is supported by
> +     Advanced SIMD and SVE2.  */
> +  bool m_nosve_pattern = false;

This is true for other things besides AVG_FLOOR/CEIL.  I think we should
just accept that this is an x264 hack ;-) and name it “m_has_avg” instead.

Could you put it with the other bool field?  That would lead to a nicer
layout, although it hardly matters much here.

>  };
>  
>  aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>  	 as one iteration of the SVE loop.  */
>        if (where == vect_body && m_unrolled_advsimd_niters)
>  	m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
> +
> +      if (is_pattern_stmt_p (stmt_info))
> +	{

Not sure this is necessary.  We'd want the same thing if (for whatever
reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future.

> +	  gimple *stmt = stmt_info->stmt;
> +	  if (is_gimple_call (stmt)
> +	      && gimple_call_internal_p (stmt))
> +	    {
> +	      enum internal_fn ifn
> +		= gimple_call_internal_fn (stmt);
> +	      switch (ifn)

Very minor nit, but might as well switch on gimple_call_internal_fn (stmt)
directly, to avoid the awkward line break.

> +		{
> +		case IFN_AVG_FLOOR:
> +		case IFN_AVG_CEIL:
> +		  m_nosve_pattern = true;
> +		default:
> +		  break;
> +		}
> +	    }
> +	}
>      }
>    return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>  }
> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>    return sve_cycles_per_iter;
>  }
>  
> +unsigned int
> +aarch64_vector_costs::determine_suggested_unroll_factor ()
> +{
> +  bool sve = m_vec_flags & VEC_ANY_SVE;
> +  /* If we are trying to unroll an Advanced SIMD main loop that contains
> +     patterns that we do not support with SVE and we might use a predicated
> +     epilogue, we need to be conservative and block unrolling as this might
> +     lead to a less optimal loop for the first and only epilogue using the
> +     original loop's vectorization factor.
> +     TODO: Remove this constraint when we add support for multiple epilogue
> +     vectorization.  */
> +  if (!sve && TARGET_SVE2 && m_nosve_pattern)
> +    return 1;

Shouldn't this be !TARGET_SVE2?

> +
> +  unsigned int max_unroll_factor = 1;
> +  for (auto vec_ops : m_ops)
> +    {
> +      aarch64_simd_vec_issue_info const *vec_issue
> +	= vec_ops.simd_issue_info ();
> +      if (!vec_issue)
> +	return 1;
> +      /* Limit unroll factor to a value adjustable by the user, the default
> +	 value is 4. */
> +      unsigned int unroll_factor = aarch64_vect_unroll_limit;
> +      unsigned int factor
> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
> +      unsigned int temp;
> +
> +      /* Sanity check, this should never happen.  */
> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
> +	return 1;
> +
> +      /* Check stores.  */
> +      if (vec_ops.stores > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
> +		       vec_ops.stores);
> +	  unroll_factor = MIN (unroll_factor, temp);
> +	}
> +
> +      /* Check loads + stores.  */
> +      if (vec_ops.loads > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
> +		       vec_ops.loads + vec_ops.stores);
> +	  unroll_factor = MIN (unroll_factor, temp);
> +	}
> +
> +      /* Check general ops.  */
> +      if (vec_ops.general_ops > 0)
> +	{
> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
> +		       vec_ops.general_ops);
> +	  unroll_factor = MIN (unroll_factor, temp);
> +	 }
> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
> +    }
> +
> +  /* Make sure unroll factor is power of 2.  */
> +  return 1 << ceil_log2 (max_unroll_factor);
> +}
> +
>  /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>     and return the new cost.  */
>  unsigned int
> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>    if (loop_vinfo
>        && m_vec_flags
>        && aarch64_use_new_vector_costs_p ())
> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> -					   m_costs[vect_body]);
> +    {
> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> +					     m_costs[vect_body]);
> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
> +    }
>  
>    /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>       the scalar code in the event of a tie, since there is more chance
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence.
>  -param=aarch64-mops-memset-size-threshold=
>  Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param
>  Constant memset size in bytes from which to start using MOPS sequence.
> +
> +-param=aarch64-vect-unroll-limit=
> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
> +Limit how much the autovectorizer may unroll a loop.

This needs to be documented in invoke.texi, alongside the existing
AArch64-specific params.

OK with those changes, thanks.

Richard
  
Andre Vieira (lists) March 31, 2022, 4:15 p.m. UTC | #4
On 28/03/2022 15:59, Richard Sandiford wrote:
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> Hi,
>>
>> Addressed all of your comments bar the pred ops one.
>>
>> Is this OK?
>>
>>
>> gcc/ChangeLog:
>>
>>           * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>> determine_suggested_unroll_factor and m_nosve_pattern.
>>           (determine_suggested_unroll_factor): New function.
>>           (aarch64_vector_costs::add_stmt_cost): Check for a qualifying
>> pattern
>>           to set m_nosve_pattern.
>>           (aarch64_vector_costs::finish_costs): Use
>> determine_suggested_unroll_factor.
>>           * config/aarch64/aarch64.opt (aarch64-vect-unroll-limit): New.
>>
>> On 16/03/2022 18:01, Richard Sandiford wrote:
>>> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>>>> Hi,
>>>>
>>>> This patch implements the costing function
>>>> determine_suggested_unroll_factor for aarch64.
>>>> It determines the unrolling factor by dividing the number of X
>>>> operations we can do per cycle by the number of X operations in the loop
>>>> body, taking this information from the vec_ops analysis during vector
>>>> costing and the available issue_info information.
>>>> We multiply the dividend by a potential reduction_latency, to improve
>>>> our pipeline utilization if we are stalled waiting on a particular
>>>> reduction operation.
>>>>
>>>> Right now we also have a work around for vectorization choices where the
>>>> main loop uses a NEON mode and predication is available, such that if
>>>> the main loop makes use of a NEON pattern that is not directly supported
>>>> by SVE we do not unroll, as that might cause performance regressions in
>>>> cases where we would enter the original main loop's VF. As an example if
>>>> you have a loop where you could use AVG_CEIL with a V8HI mode, you would
>>>> originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated
>>>> epilogue, using other instructions. Whereas with the unrolling you would
>>>> end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping
>>>> the original 8x NEON. In the future, we could handle this differently,
>>>> by either using a different costing model for epilogues, or potentially
>>>> vectorizing more than one single epilogue.
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>            * config/aarch64/aarch64.cc (aarch64_vector_costs): Define
>>>> determine_suggested_unroll_factor.
>>>>            (determine_suggested_unroll_factor): New function.
>>>>            (aarch64_vector_costs::finish_costs): Use
>>>> determine_suggested_unroll_factor.
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>>> index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>> @@ -15680,6 +15680,7 @@ private:
>>>>      unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>>>    				 unsigned int);
>>>>      bool prefer_unrolled_loop () const;
>>>> +  unsigned int determine_suggested_unroll_factor ();
>>>>    
>>>>      /* True if we have performed one-time initialization based on the
>>>>         vec_info.  */
>>>> @@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>>>      return sve_cycles_per_iter;
>>>>    }
>>>>    
>>>> +unsigned int
>>>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>>>> +{
>>>> +  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
>>>> +  if (!issue_info)
>>>> +    return 1;
>>>> +  bool sve = false;
>>>> +  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
>>> Other code uses m_vec_flags & VEC_ANY_SVE for this.
>>>
>>>> +    {
>>>> +      if (!issue_info->sve)
>>>> +	return 1;
>>>> +      sve = true;
>>>> +    }
>>>> +  else
>>>> +    {
>>>> +      if (!issue_info->advsimd)
>>>> +	return 1;
>>> The issue info should instead be taken from vec_ops.simd_issue_info ()
>>> in the loop below.  It can vary for each entry.
>>>
>>>> +      /* If we are trying to unroll a NEON main loop that contains patterns
>>> s/a NEON/an Advanced SIMD/
>>>
>>>> +	 that we do not support with SVE and we might use a predicated
>>>> +	 epilogue, we need to be conservative and block unrolling as this might
>>>> +	 lead to a less optimal loop for the first and only epilogue using the
>>>> +	 original loop's vectorization factor.
>>>> +	 TODO: Remove this constraint when we add support for multiple epilogue
>>>> +	 vectorization.  */
>>>> +      if (partial_vectors_supported_p ()
>>>> +	  && param_vect_partial_vector_usage != 0
>>>> +	  && !TARGET_SVE2)
>>>> +	{
>>>> +	  unsigned int i;
>>>> +	  stmt_vec_info stmt_vinfo;
>>>> +	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
>>>> +	    {
>>>> +	      if (is_pattern_stmt_p (stmt_vinfo))
>>>> +		{
>>>> +		  gimple *stmt = stmt_vinfo->stmt;
>>>> +		  if (is_gimple_call (stmt)
>>>> +		      && gimple_call_internal_p (stmt))
>>>> +		    {
>>>> +		      enum internal_fn ifn
>>>> +			= gimple_call_internal_fn (stmt);
>>>> +		      switch (ifn)
>>>> +			{
>>>> +			case IFN_AVG_FLOOR:
>>>> +			case IFN_AVG_CEIL:
>>>> +			  return 1;
>>>> +			default:
>>>> +			  break;
>>>> +			}
>>>> +		    }
>>>> +		}
>>>> +	    }
>>>> +	}
>>> I think we should instead record this during add_stmt_cost, like we do
>>> for the other data that this function uses.
>>>
>>> We need to drop the partial_vectors_supported_p () condition though:
>>> enabling SVE must not make Advanced SIMD code worse.  So for now,
>>> Advanced SIMD-only code will have to suffer the missed unrolling
>>> opportunity too.
>>>
>>>> +    }
>>>> +
>>>> +  unsigned int max_unroll_factor = 1;
>>>> +  aarch64_simd_vec_issue_info const *vec_issue
>>>> +    = sve ? issue_info->sve : issue_info->advsimd;
>>>> +  for (auto vec_ops : m_ops)
>>>> +    {
>>>> +      /* Limit unroll factor to 4 for now.  */
>>>> +      unsigned int unroll_factor = 4;
>>> Would be good to make this an aarch64-specific --param in case people
>>> want to play with different values.
>>>
>>>> +      unsigned int factor
>>>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>>>> +      unsigned int temp;
>>>> +
>>>> +      /* Sanity check, this should never happen.  */
>>>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>>>> +	return 1;
>>>> +
>>>> +      /* Check stores.  */
>>>> +      if (vec_ops.stores > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>>>> +		       vec_ops.stores);
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	}
>>>> +
>>>> +      /* Check loads.  */
>>>> +      if (vec_ops.loads > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>>>> +		       vec_ops.loads);
>>> This should use loads + stores rather than just loads.
>>>
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	}
>>>> +
>>>> +      /* Check general ops.  */
>>>> +      if (vec_ops.general_ops > 0)
>>>> +	{
>>>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>>>> +	               vec_ops.general_ops);
>>>> +	  unroll_factor = MIN (unroll_factor, temp);
>>>> +	 }
>>> In principle we should also take pred_ops into account when
>>> sve_issue_info () is nonnull, while ignoring the pred_ops associated
>>> with loop predication (added by analyze_loop_vinfo).  I guess we
>>> should maintain two pred_op counts.
>>>
>>> That's just a suggestion for a possible future improvement though.
>>> It doesn't need to be done as part of this patch.
>>>
>>> Thanks,
>>> Richard
>>>
>>>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>>>> +    }
>>>> +
>>>> +  /* Make sure unroll factor is power of 2.  */
>>>> +  return 1 << ceil_log2 (max_unroll_factor);
>>>> +}
>>>> +
>>>>    /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>>>       and return the new cost.  */
>>>>    unsigned int
>>>> @@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>>>      if (loop_vinfo
>>>>          && m_vec_flags
>>>>          && aarch64_use_new_vector_costs_p ())
>>>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>> -					   m_costs[vect_body]);
>>>> +    {
>>>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>> +					     m_costs[vect_body]);
>>>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>>>> +    }
>>>>    
>>>>      /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>>>         the scalar code in the event of a tie, since there is more chance
>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> index 5da3d14dc357ba01351bca961af4f100a89665e1..3e4595889ce72ff4f1a1e603d5c4fa93be6a8ff4 100644
>> --- a/gcc/config/aarch64/aarch64.cc
>> +++ b/gcc/config/aarch64/aarch64.cc
>> @@ -15637,6 +15637,7 @@ private:
>>     unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
>>   				 unsigned int);
>>     bool prefer_unrolled_loop () const;
>> +  unsigned int determine_suggested_unroll_factor ();
>>   
>>     /* True if we have performed one-time initialization based on the
>>        vec_info.  */
>> @@ -15699,6 +15700,10 @@ private:
>>        or vector loop.  There is one entry for each tuning option of
>>        interest.  */
>>     auto_vec<aarch64_vec_op_count, 2> m_ops;
>> +
>> +  /* This loop uses a pattern that is not supported by SVE, but is supported by
>> +     Advanced SIMD and SVE2.  */
>> +  bool m_nosve_pattern = false;
> This is true for other things besides AVG_FLOOR/CEIL.  I think we should
> just accept that this is an x264 hack ;-) and name it “m_has_avg” instead.
>
> Could you put it with the other bool field?  That would lead to a nicer
> layout, although it hardly matters much here.
>
>>   };
>>   
>>   aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
>> @@ -16642,6 +16647,25 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>>   	 as one iteration of the SVE loop.  */
>>         if (where == vect_body && m_unrolled_advsimd_niters)
>>   	m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
>> +
>> +      if (is_pattern_stmt_p (stmt_info))
>> +	{
> Not sure this is necessary.  We'd want the same thing if (for whatever
> reason) we had a pre-existing IFN_AVG_FLOOR/CEIL in future.
>
>> +	  gimple *stmt = stmt_info->stmt;
>> +	  if (is_gimple_call (stmt)
>> +	      && gimple_call_internal_p (stmt))
>> +	    {
>> +	      enum internal_fn ifn
>> +		= gimple_call_internal_fn (stmt);
>> +	      switch (ifn)
> Very minor nit, but might as well switch on gimple_call_internal_fn (stmt)
> directly, to avoid the awkward line break.
>
>> +		{
>> +		case IFN_AVG_FLOOR:
>> +		case IFN_AVG_CEIL:
>> +		  m_nosve_pattern = true;
>> +		default:
>> +		  break;
>> +		}
>> +	    }
>> +	}
>>       }
>>     return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
>>   }
>> @@ -16725,6 +16749,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
>>     return sve_cycles_per_iter;
>>   }
>>   
>> +unsigned int
>> +aarch64_vector_costs::determine_suggested_unroll_factor ()
>> +{
>> +  bool sve = m_vec_flags & VEC_ANY_SVE;
>> +  /* If we are trying to unroll an Advanced SIMD main loop that contains
>> +     patterns that we do not support with SVE and we might use a predicated
>> +     epilogue, we need to be conservative and block unrolling as this might
>> +     lead to a less optimal loop for the first and only epilogue using the
>> +     original loop's vectorization factor.
>> +     TODO: Remove this constraint when we add support for multiple epilogue
>> +     vectorization.  */
>> +  if (!sve && TARGET_SVE2 && m_nosve_pattern)
>> +    return 1;
> Shouldn't this be !TARGET_SVE2?
>
>> +
>> +  unsigned int max_unroll_factor = 1;
>> +  for (auto vec_ops : m_ops)
>> +    {
>> +      aarch64_simd_vec_issue_info const *vec_issue
>> +	= vec_ops.simd_issue_info ();
>> +      if (!vec_issue)
>> +	return 1;
>> +      /* Limit unroll factor to a value adjustable by the user, the default
>> +	 value is 4. */
>> +      unsigned int unroll_factor = aarch64_vect_unroll_limit;
>> +      unsigned int factor
>> +       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
>> +      unsigned int temp;
>> +
>> +      /* Sanity check, this should never happen.  */
>> +      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
>> +	return 1;
>> +
>> +      /* Check stores.  */
>> +      if (vec_ops.stores > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->stores_per_cycle,
>> +		       vec_ops.stores);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check loads + stores.  */
>> +      if (vec_ops.loads > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
>> +		       vec_ops.loads + vec_ops.stores);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	}
>> +
>> +      /* Check general ops.  */
>> +      if (vec_ops.general_ops > 0)
>> +	{
>> +	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
>> +		       vec_ops.general_ops);
>> +	  unroll_factor = MIN (unroll_factor, temp);
>> +	 }
>> +      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
>> +    }
>> +
>> +  /* Make sure unroll factor is power of 2.  */
>> +  return 1 << ceil_log2 (max_unroll_factor);
>> +}
>> +
>>   /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
>>      and return the new cost.  */
>>   unsigned int
>> @@ -16861,8 +16947,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>>     if (loop_vinfo
>>         && m_vec_flags
>>         && aarch64_use_new_vector_costs_p ())
>> -    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> -					   m_costs[vect_body]);
>> +    {
>> +      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>> +					     m_costs[vect_body]);
>> +      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
>> +    }
>>   
>>     /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
>>        the scalar code in the event of a tie, since there is more chance
>> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
>> index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644
>> --- a/gcc/config/aarch64/aarch64.opt
>> +++ b/gcc/config/aarch64/aarch64.opt
>> @@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence.
>>   -param=aarch64-mops-memset-size-threshold=
>>   Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param
>>   Constant memset size in bytes from which to start using MOPS sequence.
>> +
>> +-param=aarch64-vect-unroll-limit=
>> +Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
>> +Limit how much the autovectorizer may unroll a loop.
> This needs to be documented in invoke.texi, alongside the existing
> AArch64-specific params.
>
> OK with those changes, thanks.
>
> Richard
Committed attached patch.
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5da3d14dc357ba01351bca961af4f100a89665e1..d66626748e5edc46e46edf10e243114a9f74be97 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -15637,11 +15637,16 @@ private:
   unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
 				 unsigned int);
   bool prefer_unrolled_loop () const;
+  unsigned int determine_suggested_unroll_factor ();
 
   /* True if we have performed one-time initialization based on the
      vec_info.  */
   bool m_analyzed_vinfo = false;
 
+  /* This loop uses an average operation that is not supported by SVE, but is
+     supported by Advanced SIMD and SVE2.  */
+  bool m_has_avg = false;
+
   /* - If M_VEC_FLAGS is zero then we're costing the original scalar code.
      - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
        SIMD code.
@@ -16642,6 +16647,21 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	 as one iteration of the SVE loop.  */
       if (where == vect_body && m_unrolled_advsimd_niters)
 	m_unrolled_advsimd_stmts += count * m_unrolled_advsimd_niters;
+
+      /* Detect the use of an averaging operation.  */
+      gimple *stmt = stmt_info->stmt;
+      if (is_gimple_call (stmt)
+	  && gimple_call_internal_p (stmt))
+	{
+	  switch (gimple_call_internal_fn (stmt))
+	    {
+	    case IFN_AVG_FLOOR:
+	    case IFN_AVG_CEIL:
+	      m_has_avg = true;
+	    default:
+	      break;
+	    }
+	}
     }
   return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
 }
@@ -16725,6 +16745,68 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
   return sve_cycles_per_iter;
 }
 
+unsigned int
+aarch64_vector_costs::determine_suggested_unroll_factor ()
+{
+  bool sve = m_vec_flags & VEC_ANY_SVE;
+  /* If we are trying to unroll an Advanced SIMD main loop that contains
+     an averaging operation that we do not support with SVE and we might use a
+     predicated epilogue, we need to be conservative and block unrolling as
+     this might lead to a less optimal loop for the first and only epilogue
+     using the original loop's vectorization factor.
+     TODO: Remove this constraint when we add support for multiple epilogue
+     vectorization.  */
+  if (!sve && !TARGET_SVE2 && m_has_avg)
+    return 1;
+
+  unsigned int max_unroll_factor = 1;
+  for (auto vec_ops : m_ops)
+    {
+      aarch64_simd_vec_issue_info const *vec_issue
+	= vec_ops.simd_issue_info ();
+      if (!vec_issue)
+	return 1;
+      /* Limit unroll factor to a value adjustable by the user, the default
+	 value is 4. */
+      unsigned int unroll_factor = aarch64_vect_unroll_limit;
+      unsigned int factor
+       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
+      unsigned int temp;
+
+      /* Sanity check, this should never happen.  */
+      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
+	return 1;
+
+      /* Check stores.  */
+      if (vec_ops.stores > 0)
+	{
+	  temp = CEIL (factor * vec_issue->stores_per_cycle,
+		       vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check loads + stores.  */
+      if (vec_ops.loads > 0)
+	{
+	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
+		       vec_ops.loads + vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check general ops.  */
+      if (vec_ops.general_ops > 0)
+	{
+	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
+		       vec_ops.general_ops);
+	  unroll_factor = MIN (unroll_factor, temp);
+	 }
+      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
+    }
+
+  /* Make sure unroll factor is power of 2.  */
+  return 1 << ceil_log2 (max_unroll_factor);
+}
+
 /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
    and return the new cost.  */
 unsigned int
@@ -16861,8 +16943,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
   if (loop_vinfo
       && m_vec_flags
       && aarch64_use_new_vector_costs_p ())
-    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
-					   m_costs[vect_body]);
+    {
+      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
+					     m_costs[vect_body]);
+      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
+    }
 
   /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
      the scalar code in the event of a tie, since there is more chance
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 98ce9c0ab61709b29bd29f3853f025e3a5a1bef2..92220b26ee2bf9f95c9a387c3155779596ee5ad5 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -292,3 +292,7 @@ Constant memmove size in bytes above which to start using MOPS sequence.
 -param=aarch64-mops-memset-size-threshold=
 Target Joined UInteger Var(aarch64_mops_memset_size_threshold) Init(256) Param
 Constant memset size in bytes from which to start using MOPS sequence.
+
+-param=aarch64-vect-unroll-limit=
+Target Joined UInteger Var(aarch64_vect_unroll_limit) Init(4) Param
+Limit how much the autovectorizer may unroll a loop.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index fb3dec4ab0c37810d4409fa6da6a82f0154b0a3a..8c27d55d1edf8bc0f4df6fed7e1df3523bc9686c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15174,6 +15174,12 @@ If this parameter is set to @var{n}, GCC will not use this heuristic
 for loops that are known to execute in fewer than @var{n} Advanced
 SIMD iterations.
 
+@item aarch64-vect-unroll-limit
+The vectorizer will use available tuning information to determine whether it
+would be beneficial to unroll the main vectorized loop and by how much.  This
+parameter set's the upper bound of how much the vectorizer will unroll the main
+loop.  The default value is four.
+
 @end table
 
 @end table
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -15680,6 +15680,7 @@  private:
   unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
 				 unsigned int);
   bool prefer_unrolled_loop () const;
+  unsigned int determine_suggested_unroll_factor ();
 
   /* True if we have performed one-time initialization based on the
      vec_info.  */
@@ -16768,6 +16769,105 @@  adjust_body_cost_sve (const aarch64_vec_op_count *ops,
   return sve_cycles_per_iter;
 }
 
+unsigned int
+aarch64_vector_costs::determine_suggested_unroll_factor ()
+{
+  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
+  if (!issue_info)
+    return 1;
+  bool sve = false;
+  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
+    {
+      if (!issue_info->sve)
+	return 1;
+      sve = true;
+    }
+  else
+    {
+      if (!issue_info->advsimd)
+	return 1;
+      /* If we are trying to unroll a NEON main loop that contains patterns
+	 that we do not support with SVE and we might use a predicated
+	 epilogue, we need to be conservative and block unrolling as this might
+	 lead to a less optimal loop for the first and only epilogue using the
+	 original loop's vectorization factor.
+	 TODO: Remove this constraint when we add support for multiple epilogue
+	 vectorization.  */
+      if (partial_vectors_supported_p ()
+	  && param_vect_partial_vector_usage != 0
+	  && !TARGET_SVE2)
+	{
+	  unsigned int i;
+	  stmt_vec_info stmt_vinfo;
+	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
+	    {
+	      if (is_pattern_stmt_p (stmt_vinfo))
+		{
+		  gimple *stmt = stmt_vinfo->stmt;
+		  if (is_gimple_call (stmt)
+		      && gimple_call_internal_p (stmt))
+		    {
+		      enum internal_fn ifn
+			= gimple_call_internal_fn (stmt);
+		      switch (ifn)
+			{
+			case IFN_AVG_FLOOR:
+			case IFN_AVG_CEIL:
+			  return 1;
+			default:
+			  break;
+			}
+		    }
+		}
+	    }
+	}
+    }
+
+  unsigned int max_unroll_factor = 1;
+  aarch64_simd_vec_issue_info const *vec_issue
+    = sve ? issue_info->sve : issue_info->advsimd;
+  for (auto vec_ops : m_ops)
+    {
+      /* Limit unroll factor to 4 for now.  */
+      unsigned int unroll_factor = 4;
+      unsigned int factor
+       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
+      unsigned int temp;
+
+      /* Sanity check, this should never happen.  */
+      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
+	return 1;
+
+      /* Check stores.  */
+      if (vec_ops.stores > 0)
+	{
+	  temp = CEIL (factor * vec_issue->stores_per_cycle,
+		       vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check loads.  */
+      if (vec_ops.loads > 0)
+	{
+	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
+		       vec_ops.loads);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check general ops.  */
+      if (vec_ops.general_ops > 0)
+	{
+	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
+	               vec_ops.general_ops);
+	  unroll_factor = MIN (unroll_factor, temp);
+	 }
+      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
+    }
+
+  /* Make sure unroll factor is power of 2.  */
+  return 1 << ceil_log2 (max_unroll_factor);
+}
+
 /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
    and return the new cost.  */
 unsigned int
@@ -16904,8 +17004,11 @@  aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
   if (loop_vinfo
       && m_vec_flags
       && aarch64_use_new_vector_costs_p ())
-    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
-					   m_costs[vect_body]);
+    {
+      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
+					     m_costs[vect_body]);
+      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
+    }
 
   /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
      the scalar code in the event of a tie, since there is more chance