vect, tree-optimization/105219: Disable epilogue vectorization when peeling for alignment

Message ID 8462f41b-895f-9aca-499e-7713ec161673@arm.com
State New
Headers
Series vect, tree-optimization/105219: Disable epilogue vectorization when peeling for alignment |

Commit Message

Andre Vieira (lists) April 26, 2022, 2:34 p.m. UTC
  Hi,

This patch disables epilogue vectorization when we are peeling for 
alignment in the prologue and we can't guarantee the main vectorized 
loop is entered.  This is to prevent executing vectorized code with an 
unaligned access if the target has indicated it wants to peel for 
alignment. We take this conservative approach as we currently do not 
distinguish between peeling for alignment for correctness or for 
performance.

A better codegen would be to make it skip to the scalar epilogue in case 
the main loop isn't entered when alignment peeling is required. However, 
that would require a more aggressive change to the codebase which we 
chose to avoid at this point of development.  We can revisit this option 
during stage 1 if we choose to.

Bootstrapped on aarch64-none-linux and regression tested on 
aarch64-none-elf.

gcc/ChangeLog:

     PR tree-optimization/105219
     * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
function.
     (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
to determine
     whether to vectorize epilogue.
     * testsuite/gcc.target/aarch64/pr105219.c: New.
     * testsuite/gcc.target/aarch64/pr105219-2.c: New.
     * testsuite/gcc.target/aarch64/pr105219-3.c: New.
  

Comments

Richard Sandiford April 26, 2022, 2:43 p.m. UTC | #1
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> Hi,
>
> This patch disables epilogue vectorization when we are peeling for 
> alignment in the prologue and we can't guarantee the main vectorized 
> loop is entered.  This is to prevent executing vectorized code with an 
> unaligned access if the target has indicated it wants to peel for 
> alignment. We take this conservative approach as we currently do not 
> distinguish between peeling for alignment for correctness or for 
> performance.
>
> A better codegen would be to make it skip to the scalar epilogue in case 
> the main loop isn't entered when alignment peeling is required. However, 
> that would require a more aggressive change to the codebase which we 
> chose to avoid at this point of development.  We can revisit this option 
> during stage 1 if we choose to.
>
> Bootstrapped on aarch64-none-linux and regression tested on 
> aarch64-none-elf.
>
> gcc/ChangeLog:
>
>      PR tree-optimization/105219
>      * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
> function.
>      (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
> to determine
>      whether to vectorize epilogue.
>      * testsuite/gcc.target/aarch64/pr105219.c: New.
>      * testsuite/gcc.target/aarch64/pr105219-2.c: New.
>      * testsuite/gcc.target/aarch64/pr105219-3.c: New.
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> @@ -0,0 +1,29 @@
> +/* { dg-do run } */
> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */

I think this should be in gcc.dg/vect, with the options forced
for { target aarch64 }.

Are the skips necessary?  It looks like the test should work correctly
for all options/targets.

> +/* PR 105219.  */
> +int data[128];
> +
> +void __attribute((noipa))
> +foo (int *data, int n)
> +{
> +  for (int i = 0; i < n; ++i)
> +    data[i] = i;
> +}
> +
> +int main()
> +{
> +  for (int start = 0; start < 16; ++start)
> +    for (int n = 1; n < 3*16; ++n)
> +      {
> +        __builtin_memset (data, 0, sizeof (data));
> +        foo (&data[start], n);
> +        for (int j = 0; j < n; ++j)
> +          if (data[start + j] != j)
> +            __builtin_abort ();
> +      }
> +  return 0;
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
> +/* PR 105219.  */
> +int data[128];
> +
> +void foo (void)
> +{
> +  for (int i = 0; i < 9; ++i)
> +    data[i + 1] = i;
> +}
> +
> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> @@ -0,0 +1,28 @@
> +/* { dg-do run { target aarch64_sve128_hw } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */

Same here.

> +/* PR 105219.  */
> +int a;
> +char b[60];
> +short c[18];
> +short d[4][19];
> +long long f;
> +void e(int g, int h, short k[][19]) {
> +  for (signed i = 0; i < 3; i += 2)
> +    for (signed j = 1; j < h + 14; j++) {
> +      b[i * 14 + j] = 1;
> +      c[i + j] = k[2][j];
> +      a = g ? k[i][j] : 0;
> +    }
> +}
> +int main() {
> +  e(9, 1, d);
> +  for (long l = 0; l < 6; ++l)
> +    for (long m = 0; m < 4; ++m)
> +      f ^= b[l + m * 4];
> +  if (f)
> +    __builtin_abort ();
> +}
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
>    return opt_loop_vec_info::success (loop_vinfo);
>  }
>  
> +/* Function vect_epilogue_when_peeling_for_alignment
> +
> +   PR 105219: If we are peeling for alignment in the prologue then we do not
> +   vectorize the epilogue unless we are certain we will enter the main
> +   vectorized loop.  This is to prevent entering the vectorized epilogue in
> +   case there aren't enough iterations to enter the main loop.
> +*/
> +
> +static bool
> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> +{
> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> +    return true;
> +
> +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    {
> +      poly_uint64 niters_for_main
> +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> +      niters_for_main
> +	= upper_bound (niters_for_main,
> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> +      niters_for_main += prologue_peeling;
> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> +	return false;
> +    }
> +  else if (prologue_peeling < 0)

I was surprised that this tests < 0 rather than != 0.  If that's the
right behaviour, could you add a comment saying why?  I would have
expected:

  prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

to be handled more conservatively than:

  prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)

LGTM otherwise, but Richard B should have the final say.

Thanks,
Richard

> +    return false;
> +  return true;
> +}
> +
>  /* Function vect_analyze_loop.
>  
>     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>  		}
>  	    }
>  	  /* For now only allow one epilogue loop.  */
> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
>  	    {
>  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>  	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
  
Jakub Jelinek April 26, 2022, 3:12 p.m. UTC | #2
On Tue, Apr 26, 2022 at 03:43:13PM +0100, Richard Sandiford via Gcc-patches wrote:
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > @@ -0,0 +1,29 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> 
> I think this should be in gcc.dg/vect, with the options forced
> for { target aarch64 }.

I think not just aarch64, doesn't it need some effective target that
the HW on which it is tested is ARM v8.2-a compatible plus that binutils
can assemble v8.2-a instructions?
Sure, it can be done in gcc.dg/vect too if those effective targets
aren't defined in aarch64.exp.  But probably needs dg-additional-options
there instead of dg-options.

	Jakub
  
Andre Vieira (lists) April 26, 2022, 3:41 p.m. UTC | #3
On 26/04/2022 15:43, Richard Sandiford wrote:
> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>> Hi,
>>
>> This patch disables epilogue vectorization when we are peeling for
>> alignment in the prologue and we can't guarantee the main vectorized
>> loop is entered.  This is to prevent executing vectorized code with an
>> unaligned access if the target has indicated it wants to peel for
>> alignment. We take this conservative approach as we currently do not
>> distinguish between peeling for alignment for correctness or for
>> performance.
>>
>> A better codegen would be to make it skip to the scalar epilogue in case
>> the main loop isn't entered when alignment peeling is required. However,
>> that would require a more aggressive change to the codebase which we
>> chose to avoid at this point of development.  We can revisit this option
>> during stage 1 if we choose to.
>>
>> Bootstrapped on aarch64-none-linux and regression tested on
>> aarch64-none-elf.
>>
>> gcc/ChangeLog:
>>
>>       PR tree-optimization/105219
>>       * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New
>> function.
>>       (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment
>> to determine
>>       whether to vectorize epilogue.
>>       * testsuite/gcc.target/aarch64/pr105219.c: New.
>>       * testsuite/gcc.target/aarch64/pr105219-2.c: New.
>>       * testsuite/gcc.target/aarch64/pr105219-3.c: New.
>>
>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
>> @@ -0,0 +1,29 @@
>> +/* { dg-do run } */
>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> I think this should be in gcc.dg/vect, with the options forced
> for { target aarch64 }.
>
> Are the skips necessary?  It looks like the test should work correctly
> for all options/targets.
The -mtune and -march I guess aren't necessary, but if I drop the -mcpu 
skip-if I have to drop the -march option from dg-options as the use 
provided -mcpu might conflict with the -march and the test will fail.
>> +/* PR 105219.  */
>> +int data[128];
>> +
>> +void __attribute((noipa))
>> +foo (int *data, int n)
>> +{
>> +  for (int i = 0; i < n; ++i)
>> +    data[i] = i;
>> +}
>> +
>> +int main()
>> +{
>> +  for (int start = 0; start < 16; ++start)
>> +    for (int n = 1; n < 3*16; ++n)
>> +      {
>> +        __builtin_memset (data, 0, sizeof (data));
>> +        foo (&data[start], n);
>> +        for (int j = 0; j < n; ++j)
>> +          if (data[start + j] != j)
>> +            __builtin_abort ();
>> +      }
>> +  return 0;
>> +}
>> +
>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
>> @@ -0,0 +1,15 @@
>> +/* { dg-do compile } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
>> +/* PR 105219.  */
>> +int data[128];
>> +
>> +void foo (void)
>> +{
>> +  for (int i = 0; i < 9; ++i)
>> +    data[i + 1] = i;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
>> @@ -0,0 +1,28 @@
>> +/* { dg-do run { target aarch64_sve128_hw } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
>> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
> Same here.
Here the reason is even stronger as if the user provides a different 
-msve-vector-bits the test will fail at run-time too (given we are 
requesting 128bit hardware).

Also these were the conditions required for this test to fail, I could 
leave out this altogether ofc and only keep richi's test.
>
>> +/* PR 105219.  */
>> +int a;
>> +char b[60];
>> +short c[18];
>> +short d[4][19];
>> +long long f;
>> +void e(int g, int h, short k[][19]) {
>> +  for (signed i = 0; i < 3; i += 2)
>> +    for (signed j = 1; j < h + 14; j++) {
>> +      b[i * 14 + j] = 1;
>> +      c[i + j] = k[2][j];
>> +      a = g ? k[i][j] : 0;
>> +    }
>> +}
>> +int main() {
>> +  e(9, 1, d);
>> +  for (long l = 0; l < 6; ++l)
>> +    for (long m = 0; m < 4; ++m)
>> +      f ^= b[l + m * 4];
>> +  if (f)
>> +    __builtin_abort ();
>> +}
>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>> index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
>> --- a/gcc/tree-vect-loop.cc
>> +++ b/gcc/tree-vect-loop.cc
>> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
>>     return opt_loop_vec_info::success (loop_vinfo);
>>   }
>>   
>> +/* Function vect_epilogue_when_peeling_for_alignment
>> +
>> +   PR 105219: If we are peeling for alignment in the prologue then we do not
>> +   vectorize the epilogue unless we are certain we will enter the main
>> +   vectorized loop.  This is to prevent entering the vectorized epilogue in
>> +   case there aren't enough iterations to enter the main loop.
>> +*/
>> +
>> +static bool
>> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
>> +{
>> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
>> +    return true;
>> +
>> +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> +    {
>> +      poly_uint64 niters_for_main
>> +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
>> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
>> +      niters_for_main
>> +	= upper_bound (niters_for_main,
>> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
>> +      niters_for_main += prologue_peeling;
>> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
>> +	return false;
>> +    }
>> +  else if (prologue_peeling < 0)
> I was surprised that this tests < 0 rather than != 0.  If that's the
> right behaviour, could you add a comment saying why?  I would have
> expected:
>
>    prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>
> to be handled more conservatively than:
>
>    prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
Oopsie... You are right, if prologue_peeling > 0 and 
!LOOP_VINFO_NITERS_KNOWN_P it should also return false. I'll make the 
change.
>
> LGTM otherwise, but Richard B should have the final say.
>
> Thanks,
> Richard
>
>> +    return false;
>> +  return true;
>> +}
>> +
>>   /* Function vect_analyze_loop.
>>   
>>      Apply a set of analyses on LOOP, and create a loop_vec_info struct
>> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>>   		}
>>   	    }
>>   	  /* For now only allow one epilogue loop.  */
>> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
>> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
>> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
>>   	    {
>>   	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>>   	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
  
Andre Vieira (lists) April 26, 2022, 3:45 p.m. UTC | #4
On 26/04/2022 16:12, Jakub Jelinek wrote:
> On Tue, Apr 26, 2022 at 03:43:13PM +0100, Richard Sandiford via Gcc-patches wrote:
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
>>> @@ -0,0 +1,29 @@
>>> +/* { dg-do run } */
>>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>> I think this should be in gcc.dg/vect, with the options forced
>> for { target aarch64 }.
> I think not just aarch64, doesn't it need some effective target that
> the HW on which it is tested is ARM v8.2-a compatible plus that binutils
> can assemble v8.2-a instructions?
> Sure, it can be done in gcc.dg/vect too if those effective targets
> aren't defined in aarch64.exp.  But probably needs dg-additional-options
> there instead of dg-options.
>
> 	Jakub
For some reason I thought richi wasn't able to reproduce this on other 
targets, but from my last read of the PR I think he was... Regardless 
probably worth testing it for all targets for sure.
Question is how do I make it run for all targets but use target specific 
options for each to try and trigger the original issue? Multiple 
dg-additional-options with different target selectors?

Kind regards,
Andre
  
Jakub Jelinek April 26, 2022, 4:02 p.m. UTC | #5
On Tue, Apr 26, 2022 at 04:45:14PM +0100, Andre Vieira (lists) wrote:
> For some reason I thought richi wasn't able to reproduce this on other
> targets, but from my last read of the PR I think he was... Regardless

Note, it isn't strictly needed that a test added as generic test
fails before fixes on all arches or many of them, when a test is
itself not target specific, it can be useful to run it on all targets,
while it will catch regressing the same bug again on the originally
failing target, sometimes it can catch other bugs on other targets
(happened many times in the past).

> probably worth testing it for all targets for sure.
> Question is how do I make it run for all targets but use target specific
> options for each to try and trigger the original issue? Multiple
> dg-additional-options with different target selectors?

Yes.  But they really need to be guarded also by effective targets
which guarantee hw support.
Say if you /* { dg-additional-options "-mavx2" }
then it would need to be
/* { dg-additional-options "-mavx2" { target { { i?86-* x86_64-* } && avx2_runtime } } } */
or so where that effective target ensures both that assembler can assemble
avx2 instructions and that the hw it is tested on does support them too.
No idea about aarch64/arm effective targets.

Another way sometimes used is to place just normal test without magic
options into gcc.dg/vect/ , i.e. test that with whatever options user
configured gcc with or asks through RUNTESTFLAGS, and when needed add
gcc.target/*/ additional test that has extra dg-options and renames main
to something else and calls that only after checking hw capabilities.
grep ../../gcc.dg/vect testsuite/gcc.target/i386/*.c
for some examples.

	Jakub
  
Richard Biener April 27, 2022, 6:35 a.m. UTC | #6
On Tue, 26 Apr 2022, Richard Sandiford wrote:

> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> > Hi,
> >
> > This patch disables epilogue vectorization when we are peeling for 
> > alignment in the prologue and we can't guarantee the main vectorized 
> > loop is entered.  This is to prevent executing vectorized code with an 
> > unaligned access if the target has indicated it wants to peel for 
> > alignment. We take this conservative approach as we currently do not 
> > distinguish between peeling for alignment for correctness or for 
> > performance.
> >
> > A better codegen would be to make it skip to the scalar epilogue in case 
> > the main loop isn't entered when alignment peeling is required. However, 
> > that would require a more aggressive change to the codebase which we 
> > chose to avoid at this point of development.  We can revisit this option 
> > during stage 1 if we choose to.
> >
> > Bootstrapped on aarch64-none-linux and regression tested on 
> > aarch64-none-elf.
> >
> > gcc/ChangeLog:
> >
> >      PR tree-optimization/105219
> >      * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
> > function.
> >      (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
> > to determine
> >      whether to vectorize epilogue.
> >      * testsuite/gcc.target/aarch64/pr105219.c: New.
> >      * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> >      * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > @@ -0,0 +1,29 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> 
> I think this should be in gcc.dg/vect, with the options forced
> for { target aarch64 }.
> 
> Are the skips necessary?  It looks like the test should work correctly
> for all options/targets.
> 
> > +/* PR 105219.  */
> > +int data[128];
> > +
> > +void __attribute((noipa))
> > +foo (int *data, int n)
> > +{
> > +  for (int i = 0; i < n; ++i)
> > +    data[i] = i;
> > +}
> > +
> > +int main()
> > +{
> > +  for (int start = 0; start < 16; ++start)
> > +    for (int n = 1; n < 3*16; ++n)
> > +      {
> > +        __builtin_memset (data, 0, sizeof (data));
> > +        foo (&data[start], n);
> > +        for (int j = 0; j < n; ++j)
> > +          if (data[start + j] != j)
> > +            __builtin_abort ();
> > +      }
> > +  return 0;
> > +}
> > +
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
> > +/* PR 105219.  */
> > +int data[128];
> > +
> > +void foo (void)
> > +{
> > +  for (int i = 0; i < 9; ++i)
> > +    data[i + 1] = i;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > @@ -0,0 +1,28 @@
> > +/* { dg-do run { target aarch64_sve128_hw } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
> > +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
> 
> Same here.
> 
> > +/* PR 105219.  */
> > +int a;
> > +char b[60];
> > +short c[18];
> > +short d[4][19];
> > +long long f;
> > +void e(int g, int h, short k[][19]) {
> > +  for (signed i = 0; i < 3; i += 2)
> > +    for (signed j = 1; j < h + 14; j++) {
> > +      b[i * 14 + j] = 1;
> > +      c[i + j] = k[2][j];
> > +      a = g ? k[i][j] : 0;
> > +    }
> > +}
> > +int main() {
> > +  e(9, 1, d);
> > +  for (long l = 0; l < 6; ++l)
> > +    for (long m = 0; m < 4; ++m)
> > +      f ^= b[l + m * 4];
> > +  if (f)
> > +    __builtin_abort ();
> > +}
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
> >    return opt_loop_vec_info::success (loop_vinfo);
> >  }
> >  
> > +/* Function vect_epilogue_when_peeling_for_alignment
> > +
> > +   PR 105219: If we are peeling for alignment in the prologue then we do not
> > +   vectorize the epilogue unless we are certain we will enter the main
> > +   vectorized loop.  This is to prevent entering the vectorized epilogue in
> > +   case there aren't enough iterations to enter the main loop.
> > +*/
> > +
> > +static bool
> > +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> > +{
> > +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> > +    return true;
> > +
> > +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> > +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> > +    {
> > +      poly_uint64 niters_for_main
> > +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> > +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> > +      niters_for_main
> > +	= upper_bound (niters_for_main,
> > +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> > +      niters_for_main += prologue_peeling;
> > +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> > +	return false;
> > +    }
> > +  else if (prologue_peeling < 0)
> 
> I was surprised that this tests < 0 rather than != 0.  If that's the
> right behaviour, could you add a comment saying why?  I would have
> expected:
> 
>   prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> 
> to be handled more conservatively than:
> 
>   prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> 
> LGTM otherwise, but Richard B should have the final say.

It works for me.  Note it strictly prefers peeling for alignment over
epilogue vectorization (which is probably OK in most cases), but the
number of targets affected is quite small.

For GCC 13 we should see to make skipping to the scalar epilogue
possible.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +    return false;
> > +  return true;
> > +}
> > +
> >  /* Function vect_analyze_loop.
> >  
> >     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> > @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> >  		}
> >  	    }
> >  	  /* For now only allow one epilogue loop.  */
> > -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> > +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> > +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> >  	    {
> >  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> >  	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
>
  
Richard Biener April 27, 2022, 10:59 a.m. UTC | #7
On Wed, 27 Apr 2022, Richard Biener wrote:

> On Tue, 26 Apr 2022, Richard Sandiford wrote:
> 
> > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> > > Hi,
> > >
> > > This patch disables epilogue vectorization when we are peeling for 
> > > alignment in the prologue and we can't guarantee the main vectorized 
> > > loop is entered.  This is to prevent executing vectorized code with an 
> > > unaligned access if the target has indicated it wants to peel for 
> > > alignment. We take this conservative approach as we currently do not 
> > > distinguish between peeling for alignment for correctness or for 
> > > performance.
> > >
> > > A better codegen would be to make it skip to the scalar epilogue in case 
> > > the main loop isn't entered when alignment peeling is required. However, 
> > > that would require a more aggressive change to the codebase which we 
> > > chose to avoid at this point of development.  We can revisit this option 
> > > during stage 1 if we choose to.
> > >
> > > Bootstrapped on aarch64-none-linux and regression tested on 
> > > aarch64-none-elf.
> > >
> > > gcc/ChangeLog:
> > >
> > >      PR tree-optimization/105219
> > >      * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
> > > function.
> > >      (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
> > > to determine
> > >      whether to vectorize epilogue.
> > >      * testsuite/gcc.target/aarch64/pr105219.c: New.
> > >      * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> > >      * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> > >
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > @@ -0,0 +1,29 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > 
> > I think this should be in gcc.dg/vect, with the options forced
> > for { target aarch64 }.
> > 
> > Are the skips necessary?  It looks like the test should work correctly
> > for all options/targets.
> > 
> > > +/* PR 105219.  */
> > > +int data[128];
> > > +
> > > +void __attribute((noipa))
> > > +foo (int *data, int n)
> > > +{
> > > +  for (int i = 0; i < n; ++i)
> > > +    data[i] = i;
> > > +}
> > > +
> > > +int main()
> > > +{
> > > +  for (int start = 0; start < 16; ++start)
> > > +    for (int n = 1; n < 3*16; ++n)
> > > +      {
> > > +        __builtin_memset (data, 0, sizeof (data));
> > > +        foo (&data[start], n);
> > > +        for (int j = 0; j < n; ++j)
> > > +          if (data[start + j] != j)
> > > +            __builtin_abort ();
> > > +      }
> > > +  return 0;
> > > +}
> > > +
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > @@ -0,0 +1,15 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
> > > +/* PR 105219.  */
> > > +int data[128];
> > > +
> > > +void foo (void)
> > > +{
> > > +  for (int i = 0; i < 9; ++i)
> > > +    data[i + 1] = i;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > @@ -0,0 +1,28 @@
> > > +/* { dg-do run { target aarch64_sve128_hw } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
> > > +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
> > 
> > Same here.
> > 
> > > +/* PR 105219.  */
> > > +int a;
> > > +char b[60];
> > > +short c[18];
> > > +short d[4][19];
> > > +long long f;
> > > +void e(int g, int h, short k[][19]) {
> > > +  for (signed i = 0; i < 3; i += 2)
> > > +    for (signed j = 1; j < h + 14; j++) {
> > > +      b[i * 14 + j] = 1;
> > > +      c[i + j] = k[2][j];
> > > +      a = g ? k[i][j] : 0;
> > > +    }
> > > +}
> > > +int main() {
> > > +  e(9, 1, d);
> > > +  for (long l = 0; l < 6; ++l)
> > > +    for (long m = 0; m < 4; ++m)
> > > +      f ^= b[l + m * 4];
> > > +  if (f)
> > > +    __builtin_abort ();
> > > +}
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
> > >    return opt_loop_vec_info::success (loop_vinfo);
> > >  }
> > >  
> > > +/* Function vect_epilogue_when_peeling_for_alignment
> > > +
> > > +   PR 105219: If we are peeling for alignment in the prologue then we do not
> > > +   vectorize the epilogue unless we are certain we will enter the main
> > > +   vectorized loop.  This is to prevent entering the vectorized epilogue in
> > > +   case there aren't enough iterations to enter the main loop.
> > > +*/
> > > +
> > > +static bool
> > > +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > +{
> > > +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> > > +    return true;
> > > +
> > > +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> > > +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> > > +    {
> > > +      poly_uint64 niters_for_main
> > > +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> > > +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> > > +      niters_for_main
> > > +	= upper_bound (niters_for_main,
> > > +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> > > +      niters_for_main += prologue_peeling;
> > > +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> > > +	return false;
> > > +    }
> > > +  else if (prologue_peeling < 0)
> > 
> > I was surprised that this tests < 0 rather than != 0.  If that's the
> > right behaviour, could you add a comment saying why?  I would have
> > expected:
> > 
> >   prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > 
> > to be handled more conservatively than:
> > 
> >   prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > 
> > LGTM otherwise, but Richard B should have the final say.
> 
> It works for me.  Note it strictly prefers peeling for alignment over
> epilogue vectorization (which is probably OK in most cases), but the
> number of targets affected is quite small.

Btw, I wonder if it is possible to detect the situation by
checking if there's more than the fallthru edge from the vector
loop to the scalar loop when vectorizing the epilogue.

> For GCC 13 we should see to make skipping to the scalar epilogue
> possible.

Possibly by keeping the scalar loop of the main vectorized loop as
scalar loop of the vectorized epilogue - that of course needs
adjustments as to how we vectorize the epilogue with possibly
more divergence from the main loop vectorization.  But maybe not
by too much.

Richard.

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Richard
> > 
> > > +    return false;
> > > +  return true;
> > > +}
> > > +
> > >  /* Function vect_analyze_loop.
> > >  
> > >     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> > > @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> > >  		}
> > >  	    }
> > >  	  /* For now only allow one epilogue loop.  */
> > > -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> > > +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> > > +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> > >  	    {
> > >  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> > >  	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> > 
> 
>
  
Richard Biener April 27, 2022, 11:10 a.m. UTC | #8
On Wed, 27 Apr 2022, Richard Biener wrote:

> On Wed, 27 Apr 2022, Richard Biener wrote:
> 
> > On Tue, 26 Apr 2022, Richard Sandiford wrote:
> > 
> > > "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> > > > Hi,
> > > >
> > > > This patch disables epilogue vectorization when we are peeling for 
> > > > alignment in the prologue and we can't guarantee the main vectorized 
> > > > loop is entered.  This is to prevent executing vectorized code with an 
> > > > unaligned access if the target has indicated it wants to peel for 
> > > > alignment. We take this conservative approach as we currently do not 
> > > > distinguish between peeling for alignment for correctness or for 
> > > > performance.
> > > >
> > > > A better codegen would be to make it skip to the scalar epilogue in case 
> > > > the main loop isn't entered when alignment peeling is required. However, 
> > > > that would require a more aggressive change to the codebase which we 
> > > > chose to avoid at this point of development.  We can revisit this option 
> > > > during stage 1 if we choose to.
> > > >
> > > > Bootstrapped on aarch64-none-linux and regression tested on 
> > > > aarch64-none-elf.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >      PR tree-optimization/105219
> > > >      * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New 
> > > > function.
> > > >      (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment 
> > > > to determine
> > > >      whether to vectorize epilogue.
> > > >      * testsuite/gcc.target/aarch64/pr105219.c: New.
> > > >      * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> > > >      * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > > new file mode 100644
> > > > index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> > > > @@ -0,0 +1,29 @@
> > > > +/* { dg-do run } */
> > > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > 
> > > I think this should be in gcc.dg/vect, with the options forced
> > > for { target aarch64 }.
> > > 
> > > Are the skips necessary?  It looks like the test should work correctly
> > > for all options/targets.
> > > 
> > > > +/* PR 105219.  */
> > > > +int data[128];
> > > > +
> > > > +void __attribute((noipa))
> > > > +foo (int *data, int n)
> > > > +{
> > > > +  for (int i = 0; i < n; ++i)
> > > > +    data[i] = i;
> > > > +}
> > > > +
> > > > +int main()
> > > > +{
> > > > +  for (int start = 0; start < 16; ++start)
> > > > +    for (int n = 1; n < 3*16; ++n)
> > > > +      {
> > > > +        __builtin_memset (data, 0, sizeof (data));
> > > > +        foo (&data[start], n);
> > > > +        for (int j = 0; j < n; ++j)
> > > > +          if (data[start + j] != j)
> > > > +            __builtin_abort ();
> > > > +      }
> > > > +  return 0;
> > > > +}
> > > > +
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > > new file mode 100644
> > > > index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> > > > @@ -0,0 +1,15 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > > +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
> > > > +/* PR 105219.  */
> > > > +int data[128];
> > > > +
> > > > +void foo (void)
> > > > +{
> > > > +  for (int i = 0; i < 9; ++i)
> > > > +    data[i + 1] = i;
> > > > +}
> > > > +
> > > > +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > > new file mode 100644
> > > > index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> > > > @@ -0,0 +1,28 @@
> > > > +/* { dg-do run { target aarch64_sve128_hw } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> > > > +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
> > > > +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
> > > 
> > > Same here.
> > > 
> > > > +/* PR 105219.  */
> > > > +int a;
> > > > +char b[60];
> > > > +short c[18];
> > > > +short d[4][19];
> > > > +long long f;
> > > > +void e(int g, int h, short k[][19]) {
> > > > +  for (signed i = 0; i < 3; i += 2)
> > > > +    for (signed j = 1; j < h + 14; j++) {
> > > > +      b[i * 14 + j] = 1;
> > > > +      c[i + j] = k[2][j];
> > > > +      a = g ? k[i][j] : 0;
> > > > +    }
> > > > +}
> > > > +int main() {
> > > > +  e(9, 1, d);
> > > > +  for (long l = 0; l < 6; ++l)
> > > > +    for (long m = 0; m < 4; ++m)
> > > > +      f ^= b[l + m * 4];
> > > > +  if (f)
> > > > +    __builtin_abort ();
> > > > +}
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
> > > >    return opt_loop_vec_info::success (loop_vinfo);
> > > >  }
> > > >  
> > > > +/* Function vect_epilogue_when_peeling_for_alignment
> > > > +
> > > > +   PR 105219: If we are peeling for alignment in the prologue then we do not
> > > > +   vectorize the epilogue unless we are certain we will enter the main
> > > > +   vectorized loop.  This is to prevent entering the vectorized epilogue in
> > > > +   case there aren't enough iterations to enter the main loop.
> > > > +*/
> > > > +
> > > > +static bool
> > > > +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> > > > +{
> > > > +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> > > > +    return true;
> > > > +
> > > > +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> > > > +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> > > > +    {
> > > > +      poly_uint64 niters_for_main
> > > > +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> > > > +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> > > > +      niters_for_main
> > > > +	= upper_bound (niters_for_main,
> > > > +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> > > > +      niters_for_main += prologue_peeling;
> > > > +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> > > > +	return false;
> > > > +    }
> > > > +  else if (prologue_peeling < 0)
> > > 
> > > I was surprised that this tests < 0 rather than != 0.  If that's the
> > > right behaviour, could you add a comment saying why?  I would have
> > > expected:
> > > 
> > >   prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > 
> > > to be handled more conservatively than:
> > > 
> > >   prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > 
> > > LGTM otherwise, but Richard B should have the final say.
> > 
> > It works for me.  Note it strictly prefers peeling for alignment over
> > epilogue vectorization (which is probably OK in most cases), but the
> > number of targets affected is quite small.
> 
> Btw, I wonder if it is possible to detect the situation by
> checking if there's more than the fallthru edge from the vector
> loop to the scalar loop when vectorizing the epilogue.

And I just checked that we do _not_ seem to assume aligned accesses
in the vectorized epilogue so it might be indeed only the upper bound
computation being wrong (and the aligned access is a missed optimization
if we statically do not skip the vector loop).  We possibly reset
the info that we have aligned accesses during the DR info copying for
the epilogue?

In that case it would be enough to fix the bogus upper iteration
bound and not necessary to regress epilogue vectorization.

Richard.

> > For GCC 13 we should see to make skipping to the scalar epilogue
> > possible.
> 
> Possibly by keeping the scalar loop of the main vectorized loop as
> scalar loop of the vectorized epilogue - that of course needs
> adjustments as to how we vectorize the epilogue with possibly
> more divergence from the main loop vectorization.  But maybe not
> by too much.
> 
> Richard.
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > Thanks,
> > > Richard
> > > 
> > > > +    return false;
> > > > +  return true;
> > > > +}
> > > > +
> > > >  /* Function vect_analyze_loop.
> > > >  
> > > >     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> > > > @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> > > >  		}
> > > >  	    }
> > > >  	  /* For now only allow one epilogue loop.  */
> > > > -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> > > > +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> > > > +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> > > >  	    {
> > > >  	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> > > >  	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> > > 
> > 
> > 
> 
>
  
Andre Vieira (lists) April 27, 2022, 1:46 p.m. UTC | #9
On 27/04/2022 07:35, Richard Biener wrote:
> On Tue, 26 Apr 2022, Richard Sandiford wrote:
>
>> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
>>> Hi,
>>>
>>> This patch disables epilogue vectorization when we are peeling for
>>> alignment in the prologue and we can't guarantee the main vectorized
>>> loop is entered.  This is to prevent executing vectorized code with an
>>> unaligned access if the target has indicated it wants to peel for
>>> alignment. We take this conservative approach as we currently do not
>>> distinguish between peeling for alignment for correctness or for
>>> performance.
>>>
>>> A better codegen would be to make it skip to the scalar epilogue in case
>>> the main loop isn't entered when alignment peeling is required. However,
>>> that would require a more aggressive change to the codebase which we
>>> chose to avoid at this point of development.  We can revisit this option
>>> during stage 1 if we choose to.
>>>
>>> Bootstrapped on aarch64-none-linux and regression tested on
>>> aarch64-none-elf.
>>>
>>> gcc/ChangeLog:
>>>
>>>       PR tree-optimization/105219
>>>       * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New
>>> function.
>>>       (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment
>>> to determine
>>>       whether to vectorize epilogue.
>>>       * testsuite/gcc.target/aarch64/pr105219.c: New.
>>>       * testsuite/gcc.target/aarch64/pr105219-2.c: New.
>>>       * testsuite/gcc.target/aarch64/pr105219-3.c: New.
>>>
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
>>> new file mode 100644
>>> index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
>>> @@ -0,0 +1,29 @@
>>> +/* { dg-do run } */
>>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>> I think this should be in gcc.dg/vect, with the options forced
>> for { target aarch64 }.
>>
>> Are the skips necessary?  It looks like the test should work correctly
>> for all options/targets.
>>
>>> +/* PR 105219.  */
>>> +int data[128];
>>> +
>>> +void __attribute((noipa))
>>> +foo (int *data, int n)
>>> +{
>>> +  for (int i = 0; i < n; ++i)
>>> +    data[i] = i;
>>> +}
>>> +
>>> +int main()
>>> +{
>>> +  for (int start = 0; start < 16; ++start)
>>> +    for (int n = 1; n < 3*16; ++n)
>>> +      {
>>> +        __builtin_memset (data, 0, sizeof (data));
>>> +        foo (&data[start], n);
>>> +        for (int j = 0; j < n; ++j)
>>> +          if (data[start + j] != j)
>>> +            __builtin_abort ();
>>> +      }
>>> +  return 0;
>>> +}
>>> +
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
>>> new file mode 100644
>>> index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
>>> @@ -0,0 +1,15 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
>>> +/* PR 105219.  */
>>> +int data[128];
>>> +
>>> +void foo (void)
>>> +{
>>> +  for (int i = 0; i < 9; ++i)
>>> +    data[i + 1] = i;
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
>>> new file mode 100644
>>> index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
>>> @@ -0,0 +1,28 @@
>>> +/* { dg-do run { target aarch64_sve128_hw } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
>>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
>>> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
>> Same here.
>>
>>> +/* PR 105219.  */
>>> +int a;
>>> +char b[60];
>>> +short c[18];
>>> +short d[4][19];
>>> +long long f;
>>> +void e(int g, int h, short k[][19]) {
>>> +  for (signed i = 0; i < 3; i += 2)
>>> +    for (signed j = 1; j < h + 14; j++) {
>>> +      b[i * 14 + j] = 1;
>>> +      c[i + j] = k[2][j];
>>> +      a = g ? k[i][j] : 0;
>>> +    }
>>> +}
>>> +int main() {
>>> +  e(9, 1, d);
>>> +  for (long l = 0; l < 6; ++l)
>>> +    for (long m = 0; m < 4; ++m)
>>> +      f ^= b[l + m * 4];
>>> +  if (f)
>>> +    __builtin_abort ();
>>> +}
>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>> index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
>>> --- a/gcc/tree-vect-loop.cc
>>> +++ b/gcc/tree-vect-loop.cc
>>> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
>>>     return opt_loop_vec_info::success (loop_vinfo);
>>>   }
>>>   
>>> +/* Function vect_epilogue_when_peeling_for_alignment
>>> +
>>> +   PR 105219: If we are peeling for alignment in the prologue then we do not
>>> +   vectorize the epilogue unless we are certain we will enter the main
>>> +   vectorized loop.  This is to prevent entering the vectorized epilogue in
>>> +   case there aren't enough iterations to enter the main loop.
>>> +*/
>>> +
>>> +static bool
>>> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
>>> +{
>>> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
>>> +    return true;
>>> +
>>> +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
>>> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>>> +    {
>>> +      poly_uint64 niters_for_main
>>> +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
>>> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
>>> +      niters_for_main
>>> +	= upper_bound (niters_for_main,
>>> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
>>> +      niters_for_main += prologue_peeling;
>>> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
>>> +	return false;
>>> +    }
>>> +  else if (prologue_peeling < 0)
>> I was surprised that this tests < 0 rather than != 0.  If that's the
>> right behaviour, could you add a comment saying why?  I would have
>> expected:
>>
>>    prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>
>> to be handled more conservatively than:
>>
>>    prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>>
>> LGTM otherwise, but Richard B should have the final say.
> It works for me.  Note it strictly prefers peeling for alignment over
> epilogue vectorization (which is probably OK in most cases), but the
> number of targets affected is quite small.
>
> For GCC 13 we should see to make skipping to the scalar epilogue
> possible.
>
> Thanks,
> Richard.
>
>> Thanks,
>> Richard
>>
>>> +    return false;
>>> +  return true;
>>> +}
>>> +
>>>   /* Function vect_analyze_loop.
>>>   
>>>      Apply a set of analyses on LOOP, and create a loop_vec_info struct
>>> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>>>   		}
>>>   	    }
>>>   	  /* For now only allow one epilogue loop.  */
>>> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
>>> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
>>> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
>>>   	    {
>>>   	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>>>   	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
IIUC after IRC and PR discussions we are dropping this approach in 
favour of changing the loop_iteration tightening as we now believe it's 
fine to enter the vectorized epilogue without peeling?

Kind Regards,
Andre
  
Richard Biener April 27, 2022, 2:09 p.m. UTC | #10
On Wed, 27 Apr 2022, Andre Vieira (lists) wrote:

> 
> On 27/04/2022 07:35, Richard Biener wrote:
> > On Tue, 26 Apr 2022, Richard Sandiford wrote:
> >
> >> "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes:
> >>> Hi,
> >>>
> >>> This patch disables epilogue vectorization when we are peeling for
> >>> alignment in the prologue and we can't guarantee the main vectorized
> >>> loop is entered.  This is to prevent executing vectorized code with an
> >>> unaligned access if the target has indicated it wants to peel for
> >>> alignment. We take this conservative approach as we currently do not
> >>> distinguish between peeling for alignment for correctness or for
> >>> performance.
> >>>
> >>> A better codegen would be to make it skip to the scalar epilogue in case
> >>> the main loop isn't entered when alignment peeling is required. However,
> >>> that would require a more aggressive change to the codebase which we
> >>> chose to avoid at this point of development.  We can revisit this option
> >>> during stage 1 if we choose to.
> >>>
> >>> Bootstrapped on aarch64-none-linux and regression tested on
> >>> aarch64-none-elf.
> >>>
> >>> gcc/ChangeLog:
> >>>
> >>>       PR tree-optimization/105219
> >>>       * tree-vect-loop.cc (vect_epilogue_when_peeling_for_alignment): New
> >>> function.
> >>>       (vect_analyze_loop): Use vect_epilogue_when_peeling_for_alignment
> >>> to determine
> >>>       whether to vectorize epilogue.
> >>>       * testsuite/gcc.target/aarch64/pr105219.c: New.
> >>>       * testsuite/gcc.target/aarch64/pr105219-2.c: New.
> >>>       * testsuite/gcc.target/aarch64/pr105219-3.c: New.
> >>>
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
> >>> @@ -0,0 +1,29 @@
> >>> +/* { dg-do run } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx
> >>> -fno-vect-cost-model" } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >> I think this should be in gcc.dg/vect, with the options forced
> >> for { target aarch64 }.
> >>
> >> Are the skips necessary?  It looks like the test should work correctly
> >> for all options/targets.
> >>
> >>> +/* PR 105219.  */
> >>> +int data[128];
> >>> +
> >>> +void __attribute((noipa))
> >>> +foo (int *data, int n)
> >>> +{
> >>> +  for (int i = 0; i < n; ++i)
> >>> +    data[i] = i;
> >>> +}
> >>> +
> >>> +int main()
> >>> +{
> >>> +  for (int start = 0; start < 16; ++start)
> >>> +    for (int n = 1; n < 3*16; ++n)
> >>> +      {
> >>> +        __builtin_memset (data, 0, sizeof (data));
> >>> +        foo (&data[start], n);
> >>> +        for (int j = 0; j < n; ++j)
> >>> +          if (data[start + j] != j)
> >>> +            __builtin_abort ();
> >>> +      }
> >>> +  return 0;
> >>> +}
> >>> +
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
> >>> @@ -0,0 +1,15 @@
> >>> +/* { dg-do compile } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx
> >>> -fno-vect-cost-model -fdump-tree-vect-all" } */
> >>> +/* PR 105219.  */
> >>> +int data[128];
> >>> +
> >>> +void foo (void)
> >>> +{
> >>> +  for (int i = 0; i < 9; ++i)
> >>> +    data[i + 1] = i;
> >>> +}
> >>> +
> >>> +/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
> >>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> new file mode 100644
> >>> index
> >>> 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
> >>> --- /dev/null
> >>> +++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
> >>> @@ -0,0 +1,28 @@
> >>> +/* { dg-do run { target aarch64_sve128_hw } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } {
> >>> "-march=armv8.2-a+sve" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } {
> >>> "-mtune=thunderx" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
> >>> +/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*"
> >>> } { "-msve-vector-bits=128" } } */
> >>> +/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128
> >>> -mtune=thunderx" } */
> >> Same here.
> >>
> >>> +/* PR 105219.  */
> >>> +int a;
> >>> +char b[60];
> >>> +short c[18];
> >>> +short d[4][19];
> >>> +long long f;
> >>> +void e(int g, int h, short k[][19]) {
> >>> +  for (signed i = 0; i < 3; i += 2)
> >>> +    for (signed j = 1; j < h + 14; j++) {
> >>> +      b[i * 14 + j] = 1;
> >>> +      c[i + j] = k[2][j];
> >>> +      a = g ? k[i][j] : 0;
> >>> +    }
> >>> +}
> >>> +int main() {
> >>> +  e(9, 1, d);
> >>> +  for (long l = 0; l < 6; ++l)
> >>> +    for (long m = 0; m < 4; ++m)
> >>> +      f ^= b[l + m * 4];
> >>> +  if (f)
> >>> +    __builtin_abort ();
> >>> +}
> >>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >>> index
> >>> d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863
> >>> 100644
> >>> --- a/gcc/tree-vect-loop.cc
> >>> +++ b/gcc/tree-vect-loop.cc
> >>> @@ -2942,6 +2942,38 @@ vect_analyze_loop_1 (class loop *loop,
> >>> @@ vec_info_shared *shared,
> >>>     return opt_loop_vec_info::success (loop_vinfo);
> >>>   }
> >>>   
> >>> +/* Function vect_epilogue_when_peeling_for_alignment
> >>> +
> >>> +   PR 105219: If we are peeling for alignment in the prologue then we do
> >>> not
> >>> +   vectorize the epilogue unless we are certain we will enter the main
> >>> +   vectorized loop.  This is to prevent entering the vectorized epilogue
> >>> in
> >>> +   case there aren't enough iterations to enter the main loop.
> >>> +*/
> >>> +
> >>> +static bool
> >>> +vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
> >>> +{
> >>> +  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
> >>> +    return true;
> >>> +
> >>> +  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> >>> +  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> >>> +    {
> >>> +      poly_uint64 niters_for_main
> >>> +	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> >>> +		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
> >>> +      niters_for_main
> >>> +	= upper_bound (niters_for_main,
> >>> +		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
> >>> +      niters_for_main += prologue_peeling;
> >>> +      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
> >>> +	return false;
> >>> +    }
> >>> +  else if (prologue_peeling < 0)
> >> I was surprised that this tests < 0 rather than != 0.  If that's the
> >> right behaviour, could you add a comment saying why?  I would have
> >> expected:
> >>
> >>    prologue_peeling > 0 && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >>
> >> to be handled more conservatively than:
> >>
> >>    prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >>
> >> LGTM otherwise, but Richard B should have the final say.
> > It works for me.  Note it strictly prefers peeling for alignment over
> > epilogue vectorization (which is probably OK in most cases), but the
> > number of targets affected is quite small.
> >
> > For GCC 13 we should see to make skipping to the scalar epilogue
> > possible.
> >
> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> Richard
> >>
> >>> +    return false;
> >>> +  return true;
> >>> +}
> >>> +
> >>>   /* Function vect_analyze_loop.
> >>>   
> >>>      Apply a set of analyses on LOOP, and create a loop_vec_info struct
> >>> @@ -3151,7 +3183,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared
> >>> @@ *shared)
> >>>        	}
> >>>        }
> >>>   	  /* For now only allow one epilogue loop.  */
> >>> -	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
> >>> +	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
> >>> +	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
> >>>        {
> >>>          first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
> >>>          poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> IIUC after IRC and PR discussions we are dropping this approach in favour of
> changing the loop_iteration tightening as we now believe it's fine to enter
> the vectorized epilogue without peeling?

Yes, I think we are conservative there, at least I don't see too big
alignment used and I saw the DRs being offsetted by a correct amount
in update_epilogue_loop_vinfo.  So unless we have contrary indication
we should be safe here.  You can look at a -gimple suffixed dump
where for example with -mavx2 I see

  __MEM <vector(8) int> ((int *)vectp_data.13_72) = vect_vec_iv_.12_67;

in the main loop (alignment according to the type) and

  __MEM <vector(4) int, 32> ((int *)vectp_data.21_112) = 
vect_vec_iv_.20_106;

in the epilogue (32 bit alignent).

Richard.
  

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-2.c b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c97d1dc100181b77af0766e08407e1e352f604fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr105219-2.c
@@ -0,0 +1,29 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model" } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
+/* PR 105219.  */
+int data[128];
+
+void __attribute((noipa))
+foo (int *data, int n)
+{
+  for (int i = 0; i < n; ++i)
+    data[i] = i;
+}
+
+int main()
+{
+  for (int start = 0; start < 16; ++start)
+    for (int n = 1; n < 3*16; ++n)
+      {
+        __builtin_memset (data, 0, sizeof (data));
+        foo (&data[start], n);
+        for (int j = 0; j < n; ++j)
+          if (data[start + j] != j)
+            __builtin_abort ();
+      }
+  return 0;
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219-3.c b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..444352fc051b787369f6f1be6236d1ff0fc2d392
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr105219-3.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
+/* { dg-options "-O3 -march=armv8.2-a -mtune=thunderx -fno-vect-cost-model -fdump-tree-vect-all" } */
+/* PR 105219.  */
+int data[128];
+
+void foo (void)
+{
+  for (int i = 0; i < 9; ++i)
+    data[i + 1] = i;
+}
+
+/* { dg-final { scan-tree-dump "EPILOGUE VECTORIZED" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr105219.c b/gcc/testsuite/gcc.target/aarch64/pr105219.c
new file mode 100644
index 0000000000000000000000000000000000000000..bbdefb549f6a4e803852f69d20ce1ef9152a526c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr105219.c
@@ -0,0 +1,28 @@ 
+/* { dg-do run { target aarch64_sve128_hw } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-march=*" } { "-march=armv8.2-a+sve" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mtune=*" } { "-mtune=thunderx" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-mcpu=*" } } */
+/* { dg-skip-if "incompatible options" { *-*-* } { "-msve-vector-bits=*" } { "-msve-vector-bits=128" } } */
+/* { dg-options "-O3 -march=armv8.2-a+sve -msve-vector-bits=128 -mtune=thunderx" } */
+/* PR 105219.  */
+int a;
+char b[60];
+short c[18];
+short d[4][19];
+long long f;
+void e(int g, int h, short k[][19]) {
+  for (signed i = 0; i < 3; i += 2)
+    for (signed j = 1; j < h + 14; j++) {
+      b[i * 14 + j] = 1;
+      c[i + j] = k[2][j];
+      a = g ? k[i][j] : 0;
+    }
+}
+int main() {
+  e(9, 1, d);
+  for (long l = 0; l < 6; ++l)
+    for (long m = 0; m < 4; ++m)
+      f ^= b[l + m * 4];
+  if (f)
+    __builtin_abort ();
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d7bc34636bd52b2f67cdecd3dc16fcff684dba07..a23e6181dec8126bcb691ea9474095bf65483863 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2942,6 +2942,38 @@  vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared,
   return opt_loop_vec_info::success (loop_vinfo);
 }
 
+/* Function vect_epilogue_when_peeling_for_alignment
+
+   PR 105219: If we are peeling for alignment in the prologue then we do not
+   vectorize the epilogue unless we are certain we will enter the main
+   vectorized loop.  This is to prevent entering the vectorized epilogue in
+   case there aren't enough iterations to enter the main loop.
+*/
+
+static bool
+vect_epilogue_when_peeling_for_alignment (loop_vec_info loop_vinfo)
+{
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    return true;
+
+  int prologue_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
+  if (prologue_peeling > 0 && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      poly_uint64 niters_for_main
+	= upper_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+		       LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo));
+      niters_for_main
+	= upper_bound (niters_for_main,
+		       LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo));
+      niters_for_main += prologue_peeling;
+      if (maybe_le (LOOP_VINFO_INT_NITERS (loop_vinfo), niters_for_main))
+	return false;
+    }
+  else if (prologue_peeling < 0)
+    return false;
+  return true;
+}
+
 /* Function vect_analyze_loop.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -3151,7 +3183,8 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		}
 	    }
 	  /* For now only allow one epilogue loop.  */
-	  if (first_loop_vinfo->epilogue_vinfos.is_empty ())
+	  if (first_loop_vinfo->epilogue_vinfos.is_empty ()
+	      && vect_epilogue_when_peeling_for_alignment (first_loop_vinfo))
 	    {
 	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
 	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);