AArch64: Take into account when VF is higher than known scalar iters

Message ID patch-18797-tamar@arm.com
State New
Headers
Series AArch64: Take into account when VF is higher than known scalar iters |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Tamar Christina Sept. 20, 2024, 12:06 p.m. UTC
  Hi All,

Consider low overhead loops like:

void
foo (char *restrict a, int *restrict b, int *restrict c, int n)
{
  for (int i = 0; i < 9; i++)
    {
      int res = c[i];
      int t = b[i];
      if (a[i] != 0)
        res = t;
      c[i] = res;
    }
}

For such loops we use latency only costing since the loop bounds is known and
small.

The current costing however does not consider the case where niters < VF.

So when comparing the scalar vs vector costs it doesn't keep in mind that the
scalar code can't perform VF iterations.  This makes it overestimate the cost
for the scalar loop and we incorrectly vectorize.

This patch takes the minimum of the VF and niters in such cases.
Before the patch we generate:

 note:  Original vector body cost = 46
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:    load operations = 2
 note:    store operations = 1
 note:    general operations = 1
 note:    reduction latency = 0
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 32) = 32.000000
 note:  SVE issue estimate:
 note:    load operations = 5
 note:    store operations = 4
 note:    general operations = 11
 note:    predicate operations = 12
 note:    reduction latency = 0
 note:    estimated min cycles per iteration without predication = 5.500000
 note:    estimated min cycles per iteration for predication = 12.000000
 note:    estimated min cycles per iteration = 12.000000
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

vs after:

 note:  Original vector body cost = 46
 note:  Known loop bounds, capping VF to 9 for analysis
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:    load operations = 2
 note:    store operations = 1
 note:    general operations = 1
 note:    reduction latency = 0
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 9) = 9.000000
 note:  SVE issue estimate:
 note:    load operations = 5
 note:    store operations = 4
 note:    general operations = 11
 note:    predicate operations = 12
 note:    reduction latency = 0
 note:    estimated min cycles per iteration without predication = 5.500000
 note:    estimated min cycles per iteration for predication = 12.000000
 note:    estimated min cycles per iteration = 12.000000
 note:  Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
upcoming vectorization regression on exchange2 in SPECCPU 2017.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (adjust_body_cost):

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
	* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
	* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
	* gcc.target/aarch64/sve/spill_6.c: Likewise.
	* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
	* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.

---




--
  

Comments

Richard Sandiford Sept. 20, 2024, 2:02 p.m. UTC | #1
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> Consider low overhead loops like:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c, int n)
> {
>   for (int i = 0; i < 9; i++)
>     {
>       int res = c[i];
>       int t = b[i];
>       if (a[i] != 0)
>         res = t;
>       c[i] = res;
>     }
> }

Eek.

> For such loops we use latency only costing since the loop bounds is known and
> small.
>
> The current costing however does not consider the case where niters < VF.
>
> So when comparing the scalar vs vector costs it doesn't keep in mind that the
> scalar code can't perform VF iterations.  This makes it overestimate the cost
> for the scalar loop and we incorrectly vectorize.

I don't think that in itself is the reason though.  niters < VF isn't
particularly exceptional for SVE.  The vector code is then costed as
performing 1 VF-element iteration plus the prologue (i.e. one full
vector iteration), whereas the scalar code is costed as doing niters
scalar iterations plus the prologue.  The lower the niters, the more
we should favour scalar code.

I suspect instead it's a combination of:

- The latency for the vector costs are too optimistic, especially given
  all the predicate overhead in the loop above, and especially for
  default tuning.  For default tuning I see:

  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7

  and grep -c '\(z[0-9]\|p[0-9]*\.\)' shows there are indeed 26
  SVE instructions.  But assuming a 1 cycle latency for every vector
  operation seems too idealistic (especially for loads).  It might be
  ok for "generic" (the "architectural intent" option), but not for
  the new architecture-level default tuning targets.

  I realise that isn't the whole story though, since -mcpu=neoverse-v2
  also vectorises.

- Having a threshold for latency costs based on niters is too simplistic,
  since the size of the loop matters too.  I think the intention was
  to capture cases where loops are so small that they are in practice
  always issued with surrounding code, so that the throughput of the
  loop itself is kind of irrelevant.  But this loop is big enough that
  throughput makes sense.

So my gut instinct is that we should instead tweak the condition for
using latency costs, but I'll need to think about it more when I get
back from holiday.

Another thing about the expansion is that we generate 4 .S instructions
for the int elements, even though it should be possible for later gimple
passes to fold the last one away at compile time.  That would lessen
the horror to some extent, but maybe not enough to make it an actual win.

Specifically:

        uqdecw  w0, all, mul #3                // this is always 0
        whilelo p7.s, wzr, w0                  // this is always a pfalse
        ld1w    z28.s, p7/z, [x1, #3, mul vl]  // this always loads zero
        and     p6.b, p15/z, p7.b, p7.b        // this is always a pfalse
        st1w    z28.s, p6, [x2, #3, mul vl]    // this is a no-op

One thing I remember considering in the past, but apparently never actually
did, was rejecting any vectorisation in which some vector instructions
are provably redundant (as above).  But perhaps in the end it seemed
that we should just be able to optimise them away later instead.

Thanks,
Richard

> This patch takes the minimum of the VF and niters in such cases.
> Before the patch we generate:
>
>  note:  Original vector body cost = 46
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 32) = 32.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> vs after:
>
>  note:  Original vector body cost = 46
>  note:  Known loop bounds, capping VF to 9 for analysis
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 9) = 9.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
> upcoming vectorization regression on exchange2 in SPECCPU 2017.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (adjust_body_cost):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
> 	* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
> 	* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
> 	* gcc.target/aarch64/sve/spill_6.c: Likewise.
> 	* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
> 	* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337324e3ecebc389 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17565,6 +17565,20 @@ adjust_body_cost (loop_vec_info loop_vinfo,
>      dump_printf_loc (MSG_NOTE, vect_location,
>  		     "Original vector body cost = %d\n", body_cost);
>  
> +  /* If the iteration count is known and low we use latency only calculation,
> +     however if the iteration count is lower than VF then the estimate for the
> +     scalar loops will be too high.  Cap it at NITERS.  */
> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    {
> +      auto niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
> +      if (niters < estimated_vf && dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "Scalar loop iterates at most %wd times.  Capping VF "
> +			 " from %d to %wd\n", niters, estimated_vf, niters);
> +
> +      estimated_vf = MIN (estimated_vf, niters);
> +    }
> +
>    fractional_cost scalar_cycles_per_iter
>      = scalar_ops.min_cycles_per_iter () * estimated_vf;
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> index 6684fe1c12441399faac941bdca79d90de1eb611..10a96a894afd690cf9eb521ae5195f6669e3e2aa 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> @@ -15,12 +15,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> index e4040ee3520c8dd50282e4aeaf4930c7f66c929c..db1721efbc7bd68f0954f1c76eb15ed75dc91cfb 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> @@ -14,12 +14,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> index e47276a3a352bd61d2c508167344a2e60e8bc84c..b8b3e862d0a1968ddf9f51278e6b3813e16a7665 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> index f49915c4ac1430b0260589156d98b0793c78999f..2d02fb70f33fcf9a76c73b03c23f5a2d33e01b92 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> index 42eb4b2661b3f152763ac2c8382877e5116dedd4..8fe2455687b5a310179c26c2c6d7c70b33101612 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> index 09eb4146816cc51af5829f15b6c287aca086c382..cd1fd2b8a078a3a6aa9741aea0d10a7a420e137c 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> @@ -6,7 +6,7 @@ void loop (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c,
>  	   int * __restrict__ g, int * __restrict__ h)
>  {
>    int i = 0;
> -  for (i = 0; i < 3; i++)
> +  for (i = 0; i < 30; i++)
>      {
>        a[i] += i;
>        b[i] += i;
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> index ae9c338f5696a9f743696b58f3d8e1dd991de501..2ff969ced00955403eaa1aea7223e209fdb6f139 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> @@ -11,20 +11,20 @@ void consumer (void *);
>    {									\
>      if (which)								\
>        {									\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x1[i] += VAL;							\
>  	consumer (x1);							\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x2[i] -= VAL;							\
>  	consumer (x2);							\
>        }									\
>      else								\
>        {									\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x3[i] &= VAL;							\
>  	consumer (x3);							\
>        }									\
> -    for (int i = 0; i < 7; ++i)						\
> +    for (int i = 0; i < 70; ++i)					\
>        x4[i] |= VAL;							\
>      consumer (x4);							\
>    }
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..952a4b1cd580c4eaed4c7c470cd1efba8e5e931d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n)
> +{
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..02d10de2a6215d38728a84123d038c2db605b5a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
> +{
> +  if (stride <= 1)
> +    return;
> +
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i*stride];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
  
Tamar Christina Sept. 20, 2024, 2:13 p.m. UTC | #2
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, September 20, 2024 3:02 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known
> scalar iters
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > Consider low overhead loops like:
> >
> > void
> > foo (char *restrict a, int *restrict b, int *restrict c, int n)
> > {
> >   for (int i = 0; i < 9; i++)
> >     {
> >       int res = c[i];
> >       int t = b[i];
> >       if (a[i] != 0)
> >         res = t;
> >       c[i] = res;
> >     }
> > }
> 
> Eek.
> 
> > For such loops we use latency only costing since the loop bounds is known and
> > small.
> >
> > The current costing however does not consider the case where niters < VF.
> >
> > So when comparing the scalar vs vector costs it doesn't keep in mind that the
> > scalar code can't perform VF iterations.  This makes it overestimate the cost
> > for the scalar loop and we incorrectly vectorize.
> 
> I don't think that in itself is the reason though.  niters < VF isn't
> particularly exceptional for SVE.  The vector code is then costed as
> performing 1 VF-element iteration plus the prologue (i.e. one full
> vector iteration), whereas the scalar code is costed as doing niters
> scalar iterations plus the prologue.  The lower the niters, the more
> we should favour scalar code.
> 
> I suspect instead it's a combination of:
> 
> - The latency for the vector costs are too optimistic, especially given
>   all the predicate overhead in the loop above, and especially for
>   default tuning.  For default tuning I see:
> 

But it's not the latency in itself that's the problem.  The problem is
that we turn off throughput based comparison.

Yes the code doesn't take the length of the sequence into account,
But equally latency based calculations are going to be wrong as well
as they don't take into account throughput and dependency chains.

>   Vector inside of loop cost: 20
>   Vector prologue cost: 6
>   Vector epilogue cost: 0
>   Scalar iteration cost: 4
>   Scalar outside cost: 0
>   Vector outside cost: 6
>   prologue iterations: 0
>   epilogue iterations: 0
>   Minimum number of vector iterations: 1
>   Calculated minimum iters for profitability: 7
> 
>   and grep -c '\(z[0-9]\|p[0-9]*\.\)' shows there are indeed 26
>   SVE instructions.  But assuming a 1 cycle latency for every vector
>   operation seems too idealistic (especially for loads).  It might be
>   ok for "generic" (the "architectural intent" option), but not for
>   the new architecture-level default tuning targets.
> 
>   I realise that isn't the whole story though, since -mcpu=neoverse-v2
>   also vectorises.
> 
> - Having a threshold for latency costs based on niters is too simplistic,
>   since the size of the loop matters too.  I think the intention was
>   to capture cases where loops are so small that they are in practice
>   always issued with surrounding code, so that the throughput of the
>   loop itself is kind of irrelevant.  But this loop is big enough that
>   throughput makes sense.

Then I think the pure latency based calculation should be removed.
You can't model these kind of operations based on latency alone.

> 
> So my gut instinct is that we should instead tweak the condition for
> using latency costs, but I'll need to think about it more when I get
> back from holiday.
> 

I think that's a separate problem.. From first principals it should already
be very wrong to compare the scalar loop to an iteration count it will
*NEVER* reach.  So I don't understand why that would ever be valid.

Particularly on a longer VL machine VF can be quite large.  So scalar will
always lose the way the current costing works, and doesn't take into account
at all how wide the scalar pipes are.

As I mentioned internally there's still more work to be done for the
general case where you have niters > VF.

And FTR the testcase above is still vectorized with this patch,  what is not
vectorized are the gathers version.  See the testcases.

Tamar

> Another thing about the expansion is that we generate 4 .S instructions
> for the int elements, even though it should be possible for later gimple
> passes to fold the last one away at compile time.  That would lessen
> the horror to some extent, but maybe not enough to make it an actual win.
> 
> Specifically:
> 
>         uqdecw  w0, all, mul #3                // this is always 0
>         whilelo p7.s, wzr, w0                  // this is always a pfalse
>         ld1w    z28.s, p7/z, [x1, #3, mul vl]  // this always loads zero
>         and     p6.b, p15/z, p7.b, p7.b        // this is always a pfalse
>         st1w    z28.s, p6, [x2, #3, mul vl]    // this is a no-op
> 
> One thing I remember considering in the past, but apparently never actually
> did, was rejecting any vectorisation in which some vector instructions
> are provably redundant (as above).  But perhaps in the end it seemed
> that we should just be able to optimise them away later instead.
> 
> Thanks,
> Richard
> 
> > This patch takes the minimum of the VF and niters in such cases.
> > Before the patch we generate:
> >
> >  note:  Original vector body cost = 46
> >  note:  Vector loop iterates at most 1 times
> >  note:  Scalar issue estimate:
> >  note:    load operations = 2
> >  note:    store operations = 1
> >  note:    general operations = 1
> >  note:    reduction latency = 0
> >  note:    estimated min cycles per iteration = 1.000000
> >  note:    estimated cycles per vector iteration (for VF 32) = 32.000000
> >  note:  SVE issue estimate:
> >  note:    load operations = 5
> >  note:    store operations = 4
> >  note:    general operations = 11
> >  note:    predicate operations = 12
> >  note:    reduction latency = 0
> >  note:    estimated min cycles per iteration without predication = 5.500000
> >  note:    estimated min cycles per iteration for predication = 12.000000
> >  note:    estimated min cycles per iteration = 12.000000
> >  note:  Low iteration count, so using pure latency costs
> >  note:  Cost model analysis:
> >
> > vs after:
> >
> >  note:  Original vector body cost = 46
> >  note:  Known loop bounds, capping VF to 9 for analysis
> >  note:  Vector loop iterates at most 1 times
> >  note:  Scalar issue estimate:
> >  note:    load operations = 2
> >  note:    store operations = 1
> >  note:    general operations = 1
> >  note:    reduction latency = 0
> >  note:    estimated min cycles per iteration = 1.000000
> >  note:    estimated cycles per vector iteration (for VF 9) = 9.000000
> >  note:  SVE issue estimate:
> >  note:    load operations = 5
> >  note:    store operations = 4
> >  note:    general operations = 11
> >  note:    predicate operations = 12
> >  note:    reduction latency = 0
> >  note:    estimated min cycles per iteration without predication = 5.500000
> >  note:    estimated min cycles per iteration for predication = 12.000000
> >  note:    estimated min cycles per iteration = 12.000000
> >  note:  Increasing body cost to 1472 because the scalar code could issue within
> the limit imposed by predicate operations
> >  note:  Low iteration count, so using pure latency costs
> >  note:  Cost model analysis:
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
> > upcoming vectorization regression on exchange2 in SPECCPU 2017.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64.cc (adjust_body_cost):
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
> > 	* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
> > 	* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
> > 	* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
> > 	* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
> > 	* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
> > 	* gcc.target/aarch64/sve/spill_6.c: Likewise.
> > 	* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
> > 	* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337
> 324e3ecebc389 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17565,6 +17565,20 @@ adjust_body_cost (loop_vec_info loop_vinfo,
> >      dump_printf_loc (MSG_NOTE, vect_location,
> >  		     "Original vector body cost = %d\n", body_cost);
> >
> > +  /* If the iteration count is known and low we use latency only calculation,
> > +     however if the iteration count is lower than VF then the estimate for the
> > +     scalar loops will be too high.  Cap it at NITERS.  */
> > +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> > +    {
> > +      auto niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
> > +      if (niters < estimated_vf && dump_enabled_p ())
> > +	dump_printf_loc (MSG_NOTE, vect_location,
> > +			 "Scalar loop iterates at most %wd times.  Capping VF "
> > +			 " from %d to %wd\n", niters, estimated_vf, niters);
> > +
> > +      estimated_vf = MIN (estimated_vf, niters);
> > +    }
> > +
> >    fractional_cost scalar_cycles_per_iter
> >      = scalar_ops.min_cycles_per_iter () * estimated_vf;
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> > index
> 6684fe1c12441399faac941bdca79d90de1eb611..10a96a894afd690cf9eb521ae
> 5195f6669e3e2aa 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> > @@ -15,12 +15,12 @@
> >    }
> >
> >  #define TEST_ALL(T) \
> > -  T (int16_t, int8_t, 7) \
> > -  T (int32_t, int8_t, 3) \
> > -  T (int32_t, int16_t, 3) \
> > -  T (int64_t, int8_t, 5) \
> > -  T (int64_t, int16_t, 5) \
> > -  T (int64_t, int32_t, 5)
> > +  T (int16_t, int8_t, 70) \
> > +  T (int32_t, int8_t, 30) \
> > +  T (int32_t, int16_t, 30) \
> > +  T (int64_t, int8_t, 50) \
> > +  T (int64_t, int16_t, 50) \
> > +  T (int64_t, int32_t, 50)
> >
> >  TEST_ALL (DEF_LOOP)
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> > index
> e4040ee3520c8dd50282e4aeaf4930c7f66c929c..db1721efbc7bd68f0954f1c76
> eb15ed75dc91cfb 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> > @@ -14,12 +14,12 @@
> >    }
> >
> >  #define TEST_ALL(T) \
> > -  T (int16_t, int8_t, 7) \
> > -  T (int32_t, int8_t, 3) \
> > -  T (int32_t, int16_t, 3) \
> > -  T (int64_t, int8_t, 5) \
> > -  T (int64_t, int16_t, 5) \
> > -  T (int64_t, int32_t, 5)
> > +  T (int16_t, int8_t, 70) \
> > +  T (int32_t, int8_t, 30) \
> > +  T (int32_t, int16_t, 30) \
> > +  T (int64_t, int8_t, 50) \
> > +  T (int64_t, int16_t, 50) \
> > +  T (int64_t, int32_t, 50)
> >
> >  TEST_ALL (DEF_LOOP)
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> > index
> e47276a3a352bd61d2c508167344a2e60e8bc84c..b8b3e862d0a1968ddf9f512
> 78e6b3813e16a7665 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> > @@ -14,11 +14,11 @@
> >    }
> >
> >  #define TEST_ALL(T)			\
> > -  T (int32_t, uint16_t, 0xff, 3)	\
> > +  T (int32_t, uint16_t, 0xff, 30)	\
> >  					\
> > -  T (int64_t, uint16_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xffff, 5)
> > +  T (int64_t, uint16_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xffff, 50)
> >
> >  TEST_ALL (DEF_LOOP)
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> > index
> f49915c4ac1430b0260589156d98b0793c78999f..2d02fb70f33fcf9a76c73b03c
> 23f5a2d33e01b92 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> > @@ -14,11 +14,11 @@
> >    }
> >
> >  #define TEST_ALL(T)			\
> > -  T (int32_t, uint16_t, 0xff, 3)	\
> > +  T (int32_t, uint16_t, 0xff, 30)	\
> >  					\
> > -  T (int64_t, uint16_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xffff, 5)
> > +  T (int64_t, uint16_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xffff, 50)
> >
> >  TEST_ALL (DEF_LOOP)
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> > index
> 42eb4b2661b3f152763ac2c8382877e5116dedd4..8fe2455687b5a310179c26c
> 2c6d7c70b33101612 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> > @@ -14,11 +14,11 @@
> >    }
> >
> >  #define TEST_ALL(T)			\
> > -  T (int32_t, uint16_t, 0xff, 3)	\
> > +  T (int32_t, uint16_t, 0xff, 30)	\
> >  					\
> > -  T (int64_t, uint16_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xff, 5)	\
> > -  T (int64_t, uint32_t, 0xffff, 5)
> > +  T (int64_t, uint16_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xff, 50)	\
> > +  T (int64_t, uint32_t, 0xffff, 50)
> >
> >  TEST_ALL (DEF_LOOP)
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> > index
> 09eb4146816cc51af5829f15b6c287aca086c382..cd1fd2b8a078a3a6aa9741aea
> 0d10a7a420e137c 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> > @@ -6,7 +6,7 @@ void loop (int * __restrict__ a, int * __restrict__ b, int *
> __restrict__ c,
> >  	   int * __restrict__ g, int * __restrict__ h)
> >  {
> >    int i = 0;
> > -  for (i = 0; i < 3; i++)
> > +  for (i = 0; i < 30; i++)
> >      {
> >        a[i] += i;
> >        b[i] += i;
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> > index
> ae9c338f5696a9f743696b58f3d8e1dd991de501..2ff969ced00955403eaa1aea7
> 223e209fdb6f139 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> > @@ -11,20 +11,20 @@ void consumer (void *);
> >    {									\
> >      if (which)								\
> >        {									\
> > -	for (int i = 0; i < 7; ++i)					\
> > +	for (int i = 0; i < 70; ++i)					\
> >  	  x1[i] += VAL;							\
> >  	consumer (x1);							\
> > -	for (int i = 0; i < 7; ++i)					\
> > +	for (int i = 0; i < 70; ++i)					\
> >  	  x2[i] -= VAL;							\
> >  	consumer (x2);							\
> >        }									\
> >      else								\
> >        {									\
> > -	for (int i = 0; i < 7; ++i)					\
> > +	for (int i = 0; i < 70; ++i)					\
> >  	  x3[i] &= VAL;							\
> >  	consumer (x3);							\
> >        }									\
> > -    for (int i = 0; i < 7; ++i)						\
> > +    for (int i = 0; i < 70; ++i)					\
> >        x4[i] |= VAL;							\
> >      consumer (x4);							\
> >    }
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..952a4b1cd580c4eaed4c7c
> 470cd1efba8e5e931d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" }
> */
> > +
> > +void
> > +foo (char *restrict a, int *restrict b, int *restrict c, int n)
> > +{
> > +  for (int i = 0; i < 9; i++)
> > +    {
> > +      int res = c[i];
> > +      int t = b[i];
> > +      if (a[i] != 0)
> > +        res = t;
> > +      c[i] = res;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..02d10de2a6215d38728a8
> 4123d038c2db605b5a0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" }
> */
> > +
> > +void
> > +foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
> > +{
> > +  if (stride <= 1)
> > +    return;
> > +
> > +  for (int i = 0; i < 9; i++)
> > +    {
> > +      int res = c[i];
> > +      int t = b[i*stride];
> > +      if (a[i] != 0)
> > +        res = t;
> > +      c[i] = res;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
  
Richard Sandiford Sept. 20, 2024, 2:47 p.m. UTC | #3
Tamar Christina <Tamar.Christina@arm.com> writes:
>> 
>> So my gut instinct is that we should instead tweak the condition for
>> using latency costs, but I'll need to think about it more when I get
>> back from holiday.
>> 
>
> I think that's a separate problem.. From first principals it should already
> be very wrong to compare the scalar loop to an iteration count it will
> *NEVER* reach.  So I don't understand why that would ever be valid.

But I don't think we're doing that, or at least, not as the final result.
Instead, we first calculate the minimum number of vector iterations for
which the vector loop is sometimes profitable.  If this is N, then we're
saying that the vector code is better than the scalar code for N*VF
iterations.  Like you say, this part ignores whether N*VF is actually
achievable.  But then:

	  /* Now that we know the minimum number of vector iterations,
	     find the minimum niters for which the scalar cost is larger:

	     SIC * niters > VIC * vniters + VOC - SOC

	     We know that the minimum niters is no more than
	     vniters * VF + NPEEL, but it might be (and often is) less
	     than that if a partial vector iteration is cheaper than the
	     equivalent scalar code.  */
	  int threshold = (vec_inside_cost * min_vec_niters
			   + vec_outside_cost
			   - scalar_outside_cost);
	  if (threshold <= 0)
	    min_profitable_iters = 1;
	  else
	    min_profitable_iters = threshold / scalar_single_iter_cost + 1;

calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
is the first to be profitable.  This is specifically taking partial
iterations into account and includes the N==1 case.  The lower niters is,
the easier it is for the scalar code to win.

This is what is printed as:

  Calculated minimum iters for profitability: 7

So we think that vectorisation should be rejected if the loop count
is <= 6, but accepted if it's >= 7.

So I think the costing framework is set up to handle niters<VF correctly
on first principles.  It's "just" that the numbers being fed in give the
wrong answer in this case.

Thanks,
Richard
  
Tamar Christina Sept. 20, 2024, 3:35 p.m. UTC | #4
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Friday, September 20, 2024 3:48 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known
> scalar iters
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >>
> >> So my gut instinct is that we should instead tweak the condition for
> >> using latency costs, but I'll need to think about it more when I get
> >> back from holiday.
> >>
> >
> > I think that's a separate problem.. From first principals it should already
> > be very wrong to compare the scalar loop to an iteration count it will
> > *NEVER* reach.  So I don't understand why that would ever be valid.
> 
> But I don't think we're doing that, or at least, not as the final result.
> Instead, we first calculate the minimum number of vector iterations for
> which the vector loop is sometimes profitable.  If this is N, then we're
> saying that the vector code is better than the scalar code for N*VF
> iterations.  Like you say, this part ignores whether N*VF is actually
> achievable.  But then:
> 
> 	  /* Now that we know the minimum number of vector iterations,
> 	     find the minimum niters for which the scalar cost is larger:
> 
> 	     SIC * niters > VIC * vniters + VOC - SOC
> 
> 	     We know that the minimum niters is no more than
> 	     vniters * VF + NPEEL, but it might be (and often is) less
> 	     than that if a partial vector iteration is cheaper than the
> 	     equivalent scalar code.  */
> 	  int threshold = (vec_inside_cost * min_vec_niters
> 			   + vec_outside_cost
> 			   - scalar_outside_cost);
> 	  if (threshold <= 0)
> 	    min_profitable_iters = 1;
> 	  else
> 	    min_profitable_iters = threshold / scalar_single_iter_cost + 1;
> 
> calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
> is the first to be profitable.  This is specifically taking partial
> iterations into account and includes the N==1 case.  The lower niters is,
> the easier it is for the scalar code to win.
> 
> This is what is printed as:
> 
>   Calculated minimum iters for profitability: 7
> 
> So we think that vectorisation should be rejected if the loop count
> is <= 6, but accepted if it's >= 7.

This 7 is the vector iteration count.

  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/app/example.c:4:21: note:    Runtime profitability threshold = 7
/app/example.c:4:21: note:    Static estimate profitability threshold = 7

Which says the vector code has to iterate at least 7 iteration for it to be profitable.
This on its own is already impossible. Even at the smallest VL possible to SVE you
can never reach 7 iterations. At 7 iteration you're processed more elements than
you know are in the loop.  Since every iteration processes 1 array element and
niters is 9, 7 vector iteration is an impossible situation.

So to me, you are essentially quoting back what I said but differently.  You cannot
Ever reach 7 vector iterations.  This vector code will *NEVER* be profitable.

Further more, the SIMD unrolling code also gets in the way,

  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 2
/app/example.c:4:21: note:    Runtime profitability threshold = 2
/app/example.c:4:21: note:    Static estimate profitability threshold = 2
/app/example.c:4:21: note:  ***** Analysis succeeded with vector mode VNx4QI
/app/example.c:4:21: note:  Comparing two main loops (VNx4QI at VF 4 vs VNx4SI at VF 16)
/app/example.c:4:21: note:  Number of insns in unrolled Advanced SIMD loop = 8
/app/example.c:4:21: note:  Preferring Advanced SIMD loop because it can be unrolled

And because of this it doesn't actually do any comparisons. That's why we end up with such
horrible SVE codegen.  It assumes Adv. SIMD will be taken but doesn't actually know if
Adv. SIMD is even possible... And sure enough, the vectorizer itself does the sensible thing and
Blocks most Adv. SIMD modes in generic code:

/app/example.c:4:21: missed:  not vectorized: iteration count smaller than vectorization factor.
/app/example.c:4:21: missed:  Loop costings not worthwhile.
/app/example.c:4:21: note:  ***** Analysis failed with vector mode V16QI

And the smaller modes lose based on costing.  So then the random first tested SVE mode is picked.
This is a purely random selection.

The patch works by disqualifying those modes where you have this impossibility.  You can view it
Instead as I phrased it above.  The result of the computation is an impossibility, at any SVE VL.
Even at the lowest VL your last 4 iterations are with a pfalse predicate even if you somehow get there.

The vectorizer knows this:

/app/example.c:4:21: note:  Original vector body cost = 46
/app/example.c:4:21: note:  Vector loop iterates at most 1 times
...
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/app/example.c:4:21: note:    Runtime profitability threshold = 7
/app/example.c:4:21: note:    Static estimate profitability threshold = 7

So clearly this mode should never be used.  But instead of rejecting the mode we continue with this
and instead just disable throughput based calculations:

  if (m_num_vector_iterations >= 1
      && m_num_vector_iterations < threshold)
    {
      if (dump_enabled_p ())
	dump_printf_loc (MSG_NOTE, vect_location,
			 "Low iteration count, so using pure latency"
			 " costs\n");
    }

But clearly, looking at latency here is wrong.  I assumed the reason we do this is because the throughput
based model may have been too conservative.   But the issue with ignoring throughput here is that you
then also incorrectly model predicate latency.  As I mentioned already, the pure latency based calculation
is inaccurate in that it just looks at the max of pred or vector ops.  But the pred ops are required for the
vector ops.  So the relationship cannot be a MAX.

And the code admits this is simplistic:

/* Estimate the minimum number of cycles needed to issue the operations.
   This is a very simplistic model!  */
fractional_cost
aarch64_vec_op_count::min_cycles_per_iter () const
{
  return std::max (min_nonpred_cycles_per_iter (),
		   min_pred_cycles_per_iter ());
}

So this vastly underestimates the vector loop.

The reason it doesn't help generic code is because as we already discussed internally the advanced SIMD
unrolling code needs to compare fractional VFs before punting to Adv. SIMD so that there's at least some
ordering when Adv. SIMD isn't usable and it falls back to SVE.

It works for specific CPU model because those aren't truly scalable. E.g. Neoverse-V1 generates with the patch

.L2:
        ld1b    z30.s, p7/z, [x0, x3]
        ld1w    z31.s, p7/z, [x1, x3, lsl 2]
        cmpne   p6.b, p7/z, z30.b, #0
        st1w    z31.s, p6, [x2, x3, lsl 2]
        add     x3, x3, x5
        whilelo p7.s, w3, w4
        b.any   .L2

because it disqualified the impossible SVE modes.  And given how the entire model was approximate, and that
based purely on latency it's hard to model the interaction between predicate ops and vector ops that this made sense.

I thought we had agreed before that this was the correct thing to do,
I guess I misunderstood the outcome of that conversation.

Tamar

> 
> So I think the costing framework is set up to handle niters<VF correctly
> on first principles.  It's "just" that the numbers being fed in give the
> wrong answer in this case.
> 
> Thanks,
> Richard
  
Richard Sandiford Sept. 20, 2024, 3:44 p.m. UTC | #5
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Friday, September 20, 2024 3:48 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
>> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known
>> scalar iters
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >>
>> >> So my gut instinct is that we should instead tweak the condition for
>> >> using latency costs, but I'll need to think about it more when I get
>> >> back from holiday.
>> >>
>> >
>> > I think that's a separate problem.. From first principals it should already
>> > be very wrong to compare the scalar loop to an iteration count it will
>> > *NEVER* reach.  So I don't understand why that would ever be valid.
>> 
>> But I don't think we're doing that, or at least, not as the final result.
>> Instead, we first calculate the minimum number of vector iterations for
>> which the vector loop is sometimes profitable.  If this is N, then we're
>> saying that the vector code is better than the scalar code for N*VF
>> iterations.  Like you say, this part ignores whether N*VF is actually
>> achievable.  But then:
>> 
>> 	  /* Now that we know the minimum number of vector iterations,
>> 	     find the minimum niters for which the scalar cost is larger:
>> 
>> 	     SIC * niters > VIC * vniters + VOC - SOC
>> 
>> 	     We know that the minimum niters is no more than
>> 	     vniters * VF + NPEEL, but it might be (and often is) less
>> 	     than that if a partial vector iteration is cheaper than the
>> 	     equivalent scalar code.  */
>> 	  int threshold = (vec_inside_cost * min_vec_niters
>> 			   + vec_outside_cost
>> 			   - scalar_outside_cost);
>> 	  if (threshold <= 0)
>> 	    min_profitable_iters = 1;
>> 	  else
>> 	    min_profitable_iters = threshold / scalar_single_iter_cost + 1;
>> 
>> calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
>> is the first to be profitable.  This is specifically taking partial
>> iterations into account and includes the N==1 case.  The lower niters is,
>> the easier it is for the scalar code to win.
>> 
>> This is what is printed as:
>> 
>>   Calculated minimum iters for profitability: 7
>> 
>> So we think that vectorisation should be rejected if the loop count
>> is <= 6, but accepted if it's >= 7.
>
> This 7 is the vector iteration count.
>
>   epilogue iterations: 0
>   Minimum number of vector iterations: 1
>   Calculated minimum iters for profitability: 7
> /app/example.c:4:21: note:    Runtime profitability threshold = 7
> /app/example.c:4:21: note:    Static estimate profitability threshold = 7
>
> Which says the vector code has to iterate at least 7 iteration for it to be profitable.

It doesn't though:

>   Minimum number of vector iterations: 1

This is in vector iterations but:

>   Calculated minimum iters for profitability: 7

This is in scalar iterations.  (Yes, it would be nice if the dump line
was more explicit. :))

This is why, if we change the loop count to 7 rather than 9:

  for (int i = 0; i < 7; i++)

we still get:

/tmp/foo.c:4:21: note:  Cost model analysis:
  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/tmp/foo.c:4:21: note:    Runtime profitability threshold = 7
/tmp/foo.c:4:21: note:    Static estimate profitability threshold = 7
/tmp/foo.c:4:21: note:  ***** Analysis succeeded with vector mode VNx4SI

But if we change it to 6:

  for (int i = 0; i < 6; i++)

we get:

/tmp/foo.c:4:21: note:  Cost model analysis:
  Vector inside of loop cost: 20
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/tmp/foo.c:4:21: note:    Runtime profitability threshold = 7
/tmp/foo.c:4:21: note:    Static estimate profitability threshold = 7
/tmp/foo.c:4:21: missed:  not vectorized: vectorization not profitable.
/tmp/foo.c:4:21: note:  not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative).
/tmp/foo.c:4:21: missed:  Loop costings not worthwhile.
/tmp/foo.c:4:21: note:  ***** Analysis failed with vector mode VNx4SI

Thanks,
Richard
  
Richard Sandiford Sept. 20, 2024, 4:30 p.m. UTC | #6
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> Consider low overhead loops like:
>
> void
> foo (char *restrict a, int *restrict b, int *restrict c, int n)
> {
>   for (int i = 0; i < 9; i++)
>     {
>       int res = c[i];
>       int t = b[i];
>       if (a[i] != 0)
>         res = t;
>       c[i] = res;
>     }
> }
>
> For such loops we use latency only costing since the loop bounds is known and
> small.
>
> The current costing however does not consider the case where niters < VF.
>
> So when comparing the scalar vs vector costs it doesn't keep in mind that the
> scalar code can't perform VF iterations.  This makes it overestimate the cost
> for the scalar loop and we incorrectly vectorize.
>
> This patch takes the minimum of the VF and niters in such cases.
> Before the patch we generate:
>
>  note:  Original vector body cost = 46
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 32) = 32.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> vs after:
>
>  note:  Original vector body cost = 46
>  note:  Known loop bounds, capping VF to 9 for analysis
>  note:  Vector loop iterates at most 1 times
>  note:  Scalar issue estimate:
>  note:    load operations = 2
>  note:    store operations = 1
>  note:    general operations = 1
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration = 1.000000
>  note:    estimated cycles per vector iteration (for VF 9) = 9.000000
>  note:  SVE issue estimate:
>  note:    load operations = 5
>  note:    store operations = 4
>  note:    general operations = 11
>  note:    predicate operations = 12
>  note:    reduction latency = 0
>  note:    estimated min cycles per iteration without predication = 5.500000
>  note:    estimated min cycles per iteration for predication = 12.000000
>  note:    estimated min cycles per iteration = 12.000000
>  note:  Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations
>  note:  Low iteration count, so using pure latency costs
>  note:  Cost model analysis:
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues, and fixes the
> upcoming vectorization regression on exchange2 in SPECCPU 2017.

Sorry, I'd forgotten about the earlier internal conversation.
I went back over that and:

> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (adjust_body_cost):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
> 	* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
> 	* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
> 	* gcc.target/aarch64/sve/spill_6.c: Likewise.
> 	* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
> 	* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337324e3ecebc389 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17565,6 +17565,20 @@ adjust_body_cost (loop_vec_info loop_vinfo,
>      dump_printf_loc (MSG_NOTE, vect_location,
>  		     "Original vector body cost = %d\n", body_cost);
>  
> +  /* If the iteration count is known and low we use latency only calculation,
> +     however if the iteration count is lower than VF then the estimate for the
> +     scalar loops will be too high.  Cap it at NITERS.  */

...I don't think this is related to latency costs per se (which was
what was confusing me).  It's instead that we want the port-specific:

  /* If the scalar version of the loop could issue at least as
     quickly as the predicate parts of the SVE loop, make the SVE loop
     prohibitively expensive.  In this case vectorization is adding an
     overhead that the original scalar code didn't have.

     This is mostly intended to detect cases in which WHILELOs dominate
     for very tight loops, which is something that normal latency-based
     costs would not model.  Adding this kind of cliffedge would be
     too drastic for scalar_cycles_per_iter vs. sve_cycles_per_iter;
     code in the caller handles that case in a more conservative way.  */
  fractional_cost sve_estimate = sve_pred_cycles_per_iter + 1;
  if (scalar_cycles_per_iter < sve_estimate)

to kick in for loops like this, and it's not doing because
scalar_cycles_per_iter is too high.  (AFAIK, this is the only thing that
depends on scalar_cycles_per_iter when latency costs are being used,
but it isn't itself specific to or related to latency costs.)

How about simply:

  /* If we know we have a single partial vector iteration, cap the VF
     to the number of scalar iterations for costing purposes.  */

I realise all I'm really doing here is removing the reference to "latency
costs", but like I say, that's what confused me. :)

OK with that change, and sorry again for the confusion.

Richard

> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    {
> +      auto niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
> +      if (niters < estimated_vf && dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "Scalar loop iterates at most %wd times.  Capping VF "
> +			 " from %d to %wd\n", niters, estimated_vf, niters);
> +
> +      estimated_vf = MIN (estimated_vf, niters);
> +    }
> +
>    fractional_cost scalar_cycles_per_iter
>      = scalar_ops.min_cycles_per_iter () * estimated_vf;
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> index 6684fe1c12441399faac941bdca79d90de1eb611..10a96a894afd690cf9eb521ae5195f6669e3e2aa 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
> @@ -15,12 +15,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> index e4040ee3520c8dd50282e4aeaf4930c7f66c929c..db1721efbc7bd68f0954f1c76eb15ed75dc91cfb 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
> @@ -14,12 +14,12 @@
>    }
>  
>  #define TEST_ALL(T) \
> -  T (int16_t, int8_t, 7) \
> -  T (int32_t, int8_t, 3) \
> -  T (int32_t, int16_t, 3) \
> -  T (int64_t, int8_t, 5) \
> -  T (int64_t, int16_t, 5) \
> -  T (int64_t, int32_t, 5)
> +  T (int16_t, int8_t, 70) \
> +  T (int32_t, int8_t, 30) \
> +  T (int32_t, int16_t, 30) \
> +  T (int64_t, int8_t, 50) \
> +  T (int64_t, int16_t, 50) \
> +  T (int64_t, int32_t, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> index e47276a3a352bd61d2c508167344a2e60e8bc84c..b8b3e862d0a1968ddf9f51278e6b3813e16a7665 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> index f49915c4ac1430b0260589156d98b0793c78999f..2d02fb70f33fcf9a76c73b03c23f5a2d33e01b92 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> index 42eb4b2661b3f152763ac2c8382877e5116dedd4..8fe2455687b5a310179c26c2c6d7c70b33101612 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
> @@ -14,11 +14,11 @@
>    }
>  
>  #define TEST_ALL(T)			\
> -  T (int32_t, uint16_t, 0xff, 3)	\
> +  T (int32_t, uint16_t, 0xff, 30)	\
>  					\
> -  T (int64_t, uint16_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xff, 5)	\
> -  T (int64_t, uint32_t, 0xffff, 5)
> +  T (int64_t, uint16_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xff, 50)	\
> +  T (int64_t, uint32_t, 0xffff, 50)
>  
>  TEST_ALL (DEF_LOOP)
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> index 09eb4146816cc51af5829f15b6c287aca086c382..cd1fd2b8a078a3a6aa9741aea0d10a7a420e137c 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
> @@ -6,7 +6,7 @@ void loop (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c,
>  	   int * __restrict__ g, int * __restrict__ h)
>  {
>    int i = 0;
> -  for (i = 0; i < 3; i++)
> +  for (i = 0; i < 30; i++)
>      {
>        a[i] += i;
>        b[i] += i;
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> index ae9c338f5696a9f743696b58f3d8e1dd991de501..2ff969ced00955403eaa1aea7223e209fdb6f139 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
> @@ -11,20 +11,20 @@ void consumer (void *);
>    {									\
>      if (which)								\
>        {									\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x1[i] += VAL;							\
>  	consumer (x1);							\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x2[i] -= VAL;							\
>  	consumer (x2);							\
>        }									\
>      else								\
>        {									\
> -	for (int i = 0; i < 7; ++i)					\
> +	for (int i = 0; i < 70; ++i)					\
>  	  x3[i] &= VAL;							\
>  	consumer (x3);							\
>        }									\
> -    for (int i = 0; i < 7; ++i)						\
> +    for (int i = 0; i < 70; ++i)					\
>        x4[i] |= VAL;							\
>      consumer (x4);							\
>    }
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..952a4b1cd580c4eaed4c7c470cd1efba8e5e931d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n)
> +{
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..02d10de2a6215d38728a84123d038c2db605b5a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
> +
> +void
> +foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
> +{
> +  if (stride <= 1)
> +    return;
> +
> +  for (int i = 0; i < 9; i++)
> +    {
> +      int res = c[i];
> +      int t = b[i*stride];
> +      if (a[i] != 0)
> +        res = t;
> +      c[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 6ccf08d1cc0a1aecfc72f95b105ace2c00b1a51d..afb58fd88795a26064c8c74f337324e3ecebc389 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17565,6 +17565,20 @@  adjust_body_cost (loop_vec_info loop_vinfo,
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "Original vector body cost = %d\n", body_cost);
 
+  /* If the iteration count is known and low we use latency only calculation,
+     however if the iteration count is lower than VF then the estimate for the
+     scalar loops will be too high.  Cap it at NITERS.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      auto niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
+      if (niters < estimated_vf && dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Scalar loop iterates at most %wd times.  Capping VF "
+			 " from %d to %wd\n", niters, estimated_vf, niters);
+
+      estimated_vf = MIN (estimated_vf, niters);
+    }
+
   fractional_cost scalar_cycles_per_iter
     = scalar_ops.min_cycles_per_iter () * estimated_vf;
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
index 6684fe1c12441399faac941bdca79d90de1eb611..10a96a894afd690cf9eb521ae5195f6669e3e2aa 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
@@ -15,12 +15,12 @@ 
   }
 
 #define TEST_ALL(T) \
-  T (int16_t, int8_t, 7) \
-  T (int32_t, int8_t, 3) \
-  T (int32_t, int16_t, 3) \
-  T (int64_t, int8_t, 5) \
-  T (int64_t, int16_t, 5) \
-  T (int64_t, int32_t, 5)
+  T (int16_t, int8_t, 70) \
+  T (int32_t, int8_t, 30) \
+  T (int32_t, int16_t, 30) \
+  T (int64_t, int8_t, 50) \
+  T (int64_t, int16_t, 50) \
+  T (int64_t, int32_t, 50)
 
 TEST_ALL (DEF_LOOP)
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
index e4040ee3520c8dd50282e4aeaf4930c7f66c929c..db1721efbc7bd68f0954f1c76eb15ed75dc91cfb 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
@@ -14,12 +14,12 @@ 
   }
 
 #define TEST_ALL(T) \
-  T (int16_t, int8_t, 7) \
-  T (int32_t, int8_t, 3) \
-  T (int32_t, int16_t, 3) \
-  T (int64_t, int8_t, 5) \
-  T (int64_t, int16_t, 5) \
-  T (int64_t, int32_t, 5)
+  T (int16_t, int8_t, 70) \
+  T (int32_t, int8_t, 30) \
+  T (int32_t, int16_t, 30) \
+  T (int64_t, int8_t, 50) \
+  T (int64_t, int16_t, 50) \
+  T (int64_t, int32_t, 50)
 
 TEST_ALL (DEF_LOOP)
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
index e47276a3a352bd61d2c508167344a2e60e8bc84c..b8b3e862d0a1968ddf9f51278e6b3813e16a7665 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
@@ -14,11 +14,11 @@ 
   }
 
 #define TEST_ALL(T)			\
-  T (int32_t, uint16_t, 0xff, 3)	\
+  T (int32_t, uint16_t, 0xff, 30)	\
 					\
-  T (int64_t, uint16_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xffff, 5)
+  T (int64_t, uint16_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xffff, 50)
 
 TEST_ALL (DEF_LOOP)
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
index f49915c4ac1430b0260589156d98b0793c78999f..2d02fb70f33fcf9a76c73b03c23f5a2d33e01b92 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
@@ -14,11 +14,11 @@ 
   }
 
 #define TEST_ALL(T)			\
-  T (int32_t, uint16_t, 0xff, 3)	\
+  T (int32_t, uint16_t, 0xff, 30)	\
 					\
-  T (int64_t, uint16_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xffff, 5)
+  T (int64_t, uint16_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xffff, 50)
 
 TEST_ALL (DEF_LOOP)
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
index 42eb4b2661b3f152763ac2c8382877e5116dedd4..8fe2455687b5a310179c26c2c6d7c70b33101612 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
@@ -14,11 +14,11 @@ 
   }
 
 #define TEST_ALL(T)			\
-  T (int32_t, uint16_t, 0xff, 3)	\
+  T (int32_t, uint16_t, 0xff, 30)	\
 					\
-  T (int64_t, uint16_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xff, 5)	\
-  T (int64_t, uint32_t, 0xffff, 5)
+  T (int64_t, uint16_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xff, 50)	\
+  T (int64_t, uint32_t, 0xffff, 50)
 
 TEST_ALL (DEF_LOOP)
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
index 09eb4146816cc51af5829f15b6c287aca086c382..cd1fd2b8a078a3a6aa9741aea0d10a7a420e137c 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
@@ -6,7 +6,7 @@  void loop (int * __restrict__ a, int * __restrict__ b, int * __restrict__ c,
 	   int * __restrict__ g, int * __restrict__ h)
 {
   int i = 0;
-  for (i = 0; i < 3; i++)
+  for (i = 0; i < 30; i++)
     {
       a[i] += i;
       b[i] += i;
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
index ae9c338f5696a9f743696b58f3d8e1dd991de501..2ff969ced00955403eaa1aea7223e209fdb6f139 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/spill_6.c
@@ -11,20 +11,20 @@  void consumer (void *);
   {									\
     if (which)								\
       {									\
-	for (int i = 0; i < 7; ++i)					\
+	for (int i = 0; i < 70; ++i)					\
 	  x1[i] += VAL;							\
 	consumer (x1);							\
-	for (int i = 0; i < 7; ++i)					\
+	for (int i = 0; i < 70; ++i)					\
 	  x2[i] -= VAL;							\
 	consumer (x2);							\
       }									\
     else								\
       {									\
-	for (int i = 0; i < 7; ++i)					\
+	for (int i = 0; i < 70; ++i)					\
 	  x3[i] &= VAL;							\
 	consumer (x3);							\
       }									\
-    for (int i = 0; i < 7; ++i)						\
+    for (int i = 0; i < 70; ++i)					\
       x4[i] |= VAL;							\
     consumer (x4);							\
   }
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..952a4b1cd580c4eaed4c7c470cd1efba8e5e931d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
+
+void
+foo (char *restrict a, int *restrict b, int *restrict c, int n)
+{
+  for (int i = 0; i < 9; i++)
+    {
+      int res = c[i];
+      int t = b[i];
+      if (a[i] != 0)
+        res = t;
+      c[i] = res;
+    }
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..02d10de2a6215d38728a84123d038c2db605b5a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv9-a -Ofast -fdump-tree-vect-details" } */
+
+void
+foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
+{
+  if (stride <= 1)
+    return;
+
+  for (int i = 0; i < 9; i++)
+    {
+      int res = c[i];
+      int t = b[i*stride];
+      if (a[i] != 0)
+        res = t;
+      c[i] = res;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */