VECT: Change flow of decrement IV

Message ID 20230530112824.297599-1-juzhe.zhong@rivai.ai
State New
Headers
Series VECT: Change flow of decrement IV |

Commit Message

钟居哲 May 30, 2023, 11:28 a.m. UTC
  From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

Follow Richi's suggestion, I change current decrement IV flow from:

do {
   remain -= MIN (vf, remain);
} while (remain != 0);

into:

do {
   old_remain = remain;
   len = MIN (vf, remain);
   remain -= vf;
} while (old_remain >= vf);

to enhance SCEV.

ALL tests (decrement IV) of RVV are passed.
Ok for trunk?

gcc/ChangeLog:

        * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Change decrement IV flow.
        (vect_set_loop_condition_partial_vectors): Ditto.

---
 gcc/tree-vect-loop-manip.cc | 40 +++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 13 deletions(-)
  

Comments

Richard Sandiford May 30, 2023, 11:31 a.m. UTC | #1
juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> Follow Richi's suggestion, I change current decrement IV flow from:
>
> do {
>    remain -= MIN (vf, remain);
> } while (remain != 0);
>
> into:
>
> do {
>    old_remain = remain;
>    len = MIN (vf, remain);
>    remain -= vf;
> } while (old_remain >= vf);
>
> to enhance SCEV.
>
> ALL tests (decrement IV) of RVV are passed.

How does it affect RVV code quality?  I thought you specifically chose
the previous approach because code quality was better that way.

Richard
  
钟居哲 May 30, 2023, 11:36 a.m. UTC | #2
Before this patch:
foo:
ble a2,zero,.L5
csrr a3,vlenb
srli a4,a3,2
.L3:
minu a5,a2,a4
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v2,0(a1)
vle32.v v1,0(a0)
vsetvli t1,zero,e32,m1,ta,ma
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
add a1,a1,a3
add a0,a0,a3
      sub   a2,a2,a5
bne a2,zero,.L3
.L5:
ret

After this patch:

foo:
ble a2,zero,.L5
csrr a3,vlenb
srli a4,a3,2
neg a7,a4   -->>>additional instruction
.L3:
minu a5,a2,a4
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v2,0(a1)
vle32.v v1,0(a0)
vsetvli t1,zero,e32,m1,ta,ma
mv a6,a2  -->>>additional instruction
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
add a1,a1,a3
add a0,a0,a3
add a2,a2,a7
bgtu a6,a4,.L3
.L5:
ret

There is 1 more instruction in preheader and 1 more instruction in loop.
But I think it's OK for RVV since we will definitely be using SELECT_VL so this issue will gone.
As long as this flow is better to power (SCEV)。



juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-30 19:31
To: juzhe.zhong
CC: gcc-patches; rguenther; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> Follow Richi's suggestion, I change current decrement IV flow from:
>
> do {
>    remain -= MIN (vf, remain);
> } while (remain != 0);
>
> into:
>
> do {
>    old_remain = remain;
>    len = MIN (vf, remain);
>    remain -= vf;
> } while (old_remain >= vf);
>
> to enhance SCEV.
>
> ALL tests (decrement IV) of RVV are passed.
 
How does it affect RVV code quality?  I thought you specifically chose
the previous approach because code quality was better that way.
 
Richard
  
钟居哲 May 30, 2023, 11:38 a.m. UTC | #3
>> How does it affect RVV code quality?  I thought you specifically chose
>> the previous approach because code quality was better that way.
Yes, previous way is better for RVV.  But as I said, we will definitely use SELECT_VL then
in SELECT_VL,  we will using remain - step (produced by SELET_VL).



juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-30 19:31
To: juzhe.zhong
CC: gcc-patches; rguenther; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> Follow Richi's suggestion, I change current decrement IV flow from:
>
> do {
>    remain -= MIN (vf, remain);
> } while (remain != 0);
>
> into:
>
> do {
>    old_remain = remain;
>    len = MIN (vf, remain);
>    remain -= vf;
> } while (old_remain >= vf);
>
> to enhance SCEV.
>
> ALL tests (decrement IV) of RVV are passed.
 
How does it affect RVV code quality?  I thought you specifically chose
the previous approach because code quality was better that way.
 
Richard
  
Richard Sandiford May 30, 2023, 11:41 a.m. UTC | #4
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Before this patch:
> foo:
> ble a2,zero,.L5
> csrr a3,vlenb
> srli a4,a3,2
> .L3:
> minu a5,a2,a4
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v2,0(a1)
> vle32.v v1,0(a0)
> vsetvli t1,zero,e32,m1,ta,ma
> vadd.vv v1,v1,v2
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a0)
> add a1,a1,a3
> add a0,a0,a3
>       sub   a2,a2,a5
> bne a2,zero,.L3
> .L5:
> ret
>
> After this patch:
>
> foo:
> ble a2,zero,.L5
> csrr a3,vlenb
> srli a4,a3,2
> neg a7,a4   -->>>additional instruction
> .L3:
> minu a5,a2,a4
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v2,0(a1)
> vle32.v v1,0(a0)
> vsetvli t1,zero,e32,m1,ta,ma
> mv a6,a2  -->>>additional instruction
> vadd.vv v1,v1,v2
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a0)
> add a1,a1,a3
> add a0,a0,a3
> add a2,a2,a7
> bgtu a6,a4,.L3
> .L5:
> ret
>
> There is 1 more instruction in preheader and 1 more instruction in loop.
> But I think it's OK for RVV since we will definitely be using SELECT_VL so this issue will gone.

But what about cases where you won't be using SELECT_VL, such as SLP?

Richard
  
Richard Sandiford May 30, 2023, 12:01 p.m. UTC | #5
"juzhe.zhong" <juzhe.zhong@rivai.ai> writes:
> Maybe we can include rgroup number into select vl pattern?So that, I always
> use select vl pattern. In my backend, if it is single rgroup,we gen vsetvl,
> otherwise we gen min.

That just seems to be a way of hiding an “is the target RVV?” test though.

IMO targets shouldn't implement SELECT_VL in cases where it's always
equivalent to MIN.

Also, my understanding was that Kewen was happy with the new IV approach
in general.  It's just the (half-expected) effect on SCEV/niters analysis
that's the problem.

I think it's worth at least looking how difficult it would be to extend
niters analysis.  It could end up being beneficial for RVV too (in future).

Thanks,
Richard
  
Richard Biener May 30, 2023, 12:33 p.m. UTC | #6
On Tue, 30 May 2023, juzhe.zhong wrote:

> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.

That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.

> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
>
  
钟居哲 May 30, 2023, 12:37 p.m. UTC | #7
>> That's odd, you only need to adjust the IV which is used in the exit test,
>> not all the others.
Sorry for my incorrect information. I checked the codegen of both single-rgroup and multi-rgroup.
Their codegen are same behavior, after this patch, there will be 1 more neg instruction in preheader
and 1 more mv instruction inside the loop.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
钟居哲 May 30, 2023, 2:13 p.m. UTC | #8
Hi, all. After several investigations:
Here is my experiements:
void
single_rgroup (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = b[i] + a[i];
}

void
mutiple_rgroup (float *__restrict f, double *__restrict d, int n)
{
  for (int i = 0; i < n; ++i)
    {
      f[i * 2 + 0] = 1;
      f[i * 2 + 1] = 2;
      d[i] = 3;
    }
} 


single_rgroup:
ble a2,zero,.L5
li a4,4
.L3:
minu a5,a2,a4
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v1,0(a0)
vle32.v v2,0(a1)
vsetivli zero,4,e32,m1,ta,ma
mv a3,a2                                       ---------> 1 more "mv" instruction
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
addi a1,a1,16
addi a0,a0,16
addi a2,a2,-4
bgtu a3,a4,.L3
.L5:
ret
.size single_rgroup, .-single_rgroup
.align 1
.globl foo5
.type foo5, @function
mutiple_rgroup :
ble a2,zero,.L11
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
vl1re32.v v2,0(a5)
lui a5,%hi(.LANCHOR0+16)
addi a5,a5,%lo(.LANCHOR0+16)
slli a2,a2,1
li a3,8
li a7,4
vl1re64.v v1,0(a5)
.L9:
minu a5,a2,a3
minu a4,a5,a7
sub a5,a5,a4
addi a6,a0,16
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v2,0(a0)
srli a4,a4,1
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v2,0(a6)
srli a5,a5,1
vsetvli zero,a4,e64,m1,ta,ma
addi a6,a1,16
vse64.v v1,0(a1)
mv a4,a2                                ---------> 1 more "mv" instruction
vsetvli zero,a5,e64,m1,ta,ma
vse64.v v1,0(a6)
addi a0,a0,32
addi a1,a1,32
addi a2,a2,-8
bgtu a4,a3,.L9
.L11:
ret

These are the examples, I have tried enough amount cases. This is the worst case after this patch for RVV:
no matter single-rgroup or multiple-rgroup, we will end up with 1 more "mv" instruction inside the loop.
There are also some examples I have tried with no more instructions (It seems IVOPTS has done some optimization in some cases).

From my side (RVV),  I think one more "mv" instruction is not a big deal if this patch (apply vf step and check conditon by remain > vf)
can help IBM. 

For single-rgroup, this 'mv' instruction will gone when we use SELECT_VL. For multiple-rgroup, the 'mv' instruction remains
but as I said, not a big deal.

If this patch's approach is approved, I will rebase and send SELECT_VL patch again base on this patch.

Looking forward your suggestions.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
钟居哲 May 30, 2023, 2:47 p.m. UTC | #9
Also, I have investigated power's testcase in RVV:

#include <stdint.h>

#define TEST_ALL(T)                                                            \
  T (int8_t)                                                                   \
  T (uint8_t)                                                                  \
  T (int16_t)                                                                  \
  T (uint16_t)                                                                 \
  T (int32_t)                                                                  \
  T (uint32_t)                                                                 \
  T (int64_t)                                                                  \
  T (uint64_t)                                                                 \
  T (float)                                                                    \
  T (double)
  
#define N 64
#define START 1
#define END 59

#define test(TYPE)                                                             \
  TYPE x_##TYPE[N] __attribute__((aligned(16)));                                \
  void __attribute__((noinline, noclone)) test_npeel_##TYPE() {                \
    TYPE v = 0;                                                                \
    for (unsigned int i = START; i < END; i++) {                               \
      x_##TYPE[i] = v;                                                         \
      v += 1;                                                                  \
    }                                                                          \
  }

TEST_ALL (test)

RVV compile option:
-march=rv64gcv_zba_zbb_zbc_zbs_zvl256b -O2 -ftree-vectorize -fno-vect-cost-model -fno-unroll-loops -ffast-math --param=riscv-autovec-preference=fixed-vlmax -S -fdump-tree-optimized

Before this patch:
void test_npeel_int16_t ()
{
  unsigned long ivtmp.39;
  vector(16) short int vect_vec_iv_.33;
  void * _2;
  vector(16) short int * _8;
  vector(16) short int _10;
  unsigned long loop_len_19;
  unsigned long ivtmp_21;
  unsigned long ivtmp_22;

  <bb 2> [local count: 18146240]:
  ivtmp.39_13 = (unsigned long) &MEM <int16_t[64]> [(void *)&x_int16_t + 2B];

  <bb 3> [local count: 72584963]:
  # vect_vec_iv_.33_12 = PHI <_10(3), { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }(2)>
  # ivtmp_21 = PHI <ivtmp_22(3), 58(2)>
  # ivtmp.39_5 = PHI <ivtmp.39_14(3), ivtmp.39_13(2)>
  loop_len_19 = MIN_EXPR <ivtmp_21, 16>;
  _10 = vect_vec_iv_.33_12 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 };
  _2 = (void *) ivtmp.39_5;
  _8 = &MEM <vector(16) short int> [(short int *)_2];
  .LEN_STORE (_8, 16B, loop_len_19, vect_vec_iv_.33_12, 0);
  ivtmp_22 = ivtmp_21 - loop_len_19;
  ivtmp.39_14 = ivtmp.39_5 + 32;
  if (ivtmp_22 != 0)
    goto <bb 3>; [75.00%]
  else
    goto <bb 4>; [25.00%]

  <bb 4> [local count: 18146240]:
  return;

}

After this patch:
void test_npeel_int16_t ()
{
  <bb 2> [local count: 18146240]:
  .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 2B], 16B, 32, { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }, 0);
  .LEN_STORE (&MEM <int16_t[64]> [(void *)&x_int16_t + 66B], 16B, 26, { 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 }, 0); [tail call]
  return;

}

It seems this patch fixed power's issue now.

So, My conclusion:
1. This patch does produce 1 more redundant 'mv' instructions in some cases (not all cases). But it can partially be solved by select_vl
    pattern. And even we can't fix this issue, one more 'mv' instruction is not a big deal for RVV.
2. This patch can solve power's issue.

Thanks. 


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
钟居哲 May 30, 2023, 3:05 p.m. UTC | #10
More information of power's testcase:

Before this patch:
test_npeel_int16_t:
lui a4,%hi(.LANCHOR0+130)
lui a3,%hi(.LANCHOR1)
addi a3,a3,%lo(.LANCHOR1)
addi a4,a4,%lo(.LANCHOR0+130)
li a5,58
li a2,16
vsetivli zero,16,e16,m1,ta,ma
vl1re16.v v3,0(a3)
vid.v v1
.L5:
minu a3,a5,a2
vsetvli zero,a3,e16,m1,ta,ma
sub a5,a5,a3
vse16.v v1,0(a4)
vsetivli zero,16,e16,m1,ta,ma
addi a4,a4,32
vadd.vv v1,v1,v3
bne a5,zero,.L5
ret

After this patch:
test_npeel_int16_t:
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
li a1,16
vsetivli zero,16,e16,m1,ta,ma
addi a2,a5,130
vid.v v1
addi a3,a5,162
vadd.vx v4,v1,a1
addi a4,a5,194
li a1,32
vadd.vx v3,v1,a1
vse16.v v1,0(a2)
vse16.v v4,0(a3)
vse16.v v3,0(a4)
addi a5,a5,226
li a1,48
vadd.vx v2,v1,a1
vsetivli zero,10,e16,m1,ta,ma
vse16.v v2,0(a5)
ret

It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
钟居哲 May 30, 2023, 10:51 p.m. UTC | #11
Hi, Richi.

>> As I said in the PR with the proposed scheme you get a loop around copy of the IV since both the pre and the post decrement values are live at the same time.  
>> If the CPU has a underflow bit set from the subtraction and a branch on that test using that could avoid the copy need.

RISC-V port doesn't have such instructions so such copy is needed in RISC-V port.
But as I said, such copy is very cheap.

So, I wonder whether you will consider take && review this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620086.html 
or not?

Or you have another plan ?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-31 00:40
To: 钟居哲
CC: richard.sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV


Am 30.05.2023 um 14:38 schrieb 钟居哲 <juzhe.zhong@rivai.ai>:

 

>> That's odd, you only need to adjust the IV which is used in the exit test,
>> not all the others.
Sorry for my incorrect information. I checked the codegen of both single-rgroup and multi-rgroup.
Their codegen are same behavior, after this patch, there will be 1 more neg instruction in preheader
and 1 more mv instruction inside the loop.

As I said in the PR with the proposed scheme you get a loop around copy of the IV since both the pre and the post decrement values are live at the same time.  If the CPU has a underflow bit set from the subtraction and a branch on that test using that could avoid the copy need.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
钟居哲 May 31, 2023, 1:42 a.m. UTC | #12
Hi,all. I have posted my several investigations:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 

Turns out when "niters is a constant value and vf is a constant value"
This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.

Actually, for current flow:

step = MIN ()
...
remain = remain - step.

I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
So, could you make a decision for this patch?

I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
we should extend SCEV/IVOPTS ?

Thanks. 


juzhe.zhong@rivai.ai
 
From: 钟居哲
Date: 2023-05-30 23:05
To: rguenther
CC: richard.sandiford; gcc-patches; linkw
Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
More information of power's testcase:

Before this patch:
test_npeel_int16_t:
lui a4,%hi(.LANCHOR0+130)
lui a3,%hi(.LANCHOR1)
addi a3,a3,%lo(.LANCHOR1)
addi a4,a4,%lo(.LANCHOR0+130)
li a5,58
li a2,16
vsetivli zero,16,e16,m1,ta,ma
vl1re16.v v3,0(a3)
vid.v v1
.L5:
minu a3,a5,a2
vsetvli zero,a3,e16,m1,ta,ma
sub a5,a5,a3
vse16.v v1,0(a4)
vsetivli zero,16,e16,m1,ta,ma
addi a4,a4,32
vadd.vv v1,v1,v3
bne a5,zero,.L5
ret

After this patch:
test_npeel_int16_t:
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
li a1,16
vsetivli zero,16,e16,m1,ta,ma
addi a2,a5,130
vid.v v1
addi a3,a5,162
vadd.vx v4,v1,a1
addi a4,a5,194
li a1,32
vadd.vx v3,v1,a1
vse16.v v1,0(a2)
vse16.v v4,0(a3)
vse16.v v3,0(a4)
addi a5,a5,226
li a1,48
vadd.vx v2,v1,a1
vsetivli zero,10,e16,m1,ta,ma
vse16.v v2,0(a5)
ret

It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
Richard Biener May 31, 2023, 6:41 a.m. UTC | #13
On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:

> Hi?all. I have posted my several investigations:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> 
> Turns out when "niters is a constant value and vf is a constant value"
> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
> 
> Actually, for current flow:
> 
> step = MIN ()
> ...
> remain = remain - step.
> 
> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> So, could you make a decision for this patch?
> 
> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
> we should extend SCEV/IVOPTS ?

I don't think we can do anything in SCEV for this which means we'd
need to special-case this in niter analysis, in IVOPTs and any other
passes that might be affected (and not fixed by handling it in niter
analysis).  While improving niter analysis would be good (the user
could write this pattern as well) I do not have time to try
implementing that (I have no idea how ugly or robust it is going to be).

So I think we should patch this up in the vectorizer itself like with
your patch.  I'm going to wait for Richards input though since he
seems to disagree.

Note with SELECT_VL all bets will be off since as I understand the
value it gives can vary from iteration to iteration (but we know
a lower and maybe an upper bound?)

Thanks,
Richard.

> Thanks. 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: ???
> Date: 2023-05-30 23:05
> To: rguenther
> CC: richard.sandiford; gcc-patches; linkw
> Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
> More information of power's testcase:
> 
> Before this patch:
> test_npeel_int16_t:
> lui a4,%hi(.LANCHOR0+130)
> lui a3,%hi(.LANCHOR1)
> addi a3,a3,%lo(.LANCHOR1)
> addi a4,a4,%lo(.LANCHOR0+130)
> li a5,58
> li a2,16
> vsetivli zero,16,e16,m1,ta,ma
> vl1re16.v v3,0(a3)
> vid.v v1
> .L5:
> minu a3,a5,a2
> vsetvli zero,a3,e16,m1,ta,ma
> sub a5,a5,a3
> vse16.v v1,0(a4)
> vsetivli zero,16,e16,m1,ta,ma
> addi a4,a4,32
> vadd.vv v1,v1,v3
> bne a5,zero,.L5
> ret
> 
> After this patch:
> test_npeel_int16_t:
> lui a5,%hi(.LANCHOR0)
> addi a5,a5,%lo(.LANCHOR0)
> li a1,16
> vsetivli zero,16,e16,m1,ta,ma
> addi a2,a5,130
> vid.v v1
> addi a3,a5,162
> vadd.vx v4,v1,a1
> addi a4,a5,194
> li a1,32
> vadd.vx v3,v1,a1
> vse16.v v1,0(a2)
> vse16.v v4,0(a3)
> vse16.v v3,0(a4)
> addi a5,a5,226
> li a1,48
> vadd.vx v2,v1,a1
> vsetivli zero,10,e16,m1,ta,ma
> vse16.v v2,0(a5)
> ret
> 
> It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-05-30 20:33
> To: juzhe.zhong
> CC: Richard Sandiford; gcc-patches; linkw
> Subject: Re: [PATCH] VECT: Change flow of decrement IV
> On Tue, 30 May 2023, juzhe.zhong wrote:
>  
> > This patch will generate the number of rgroup ?mov? instructions inside the
> > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> > instruction in loop. If this patch is necessary? I think I should find a way
> > to fix it.
>  
> That's odd, you only need to adjust the IV which is used in the exit test,
> not all the others.
>  
> > ---- Replied Message ----
> > From
> > Richard Sandiford<richard.sandiford@arm.com>
> > Date
> > 05/30/2023 19:41
> > To
> > juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> > Cc
> > gcc-patches<gcc-patches@gcc.gnu.org>,
> > rguenther<rguenther@suse.de>,
> > linkw<linkw@linux.ibm.com>
> > Subject
> > Re: [PATCH] VECT: Change flow of decrement IV
> > "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > > Before this patch:
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > >       sub   a2,a2,a5
> > > bne a2,zero,.L3
> > > .L5:
> > > ret
> > >
> > > After this patch:
> > >
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > neg a7,a4   -->>>additional instruction
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > mv a6,a2  -->>>additional instruction
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > > add a2,a2,a7
> > > bgtu a6,a4,.L3
> > > .L5:
> > > ret
> > >
> > > There is 1 more instruction in preheader and 1 more instruction in loop.
> > > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> > this issue will gone.
> > 
> > But what about cases where you won't be using SELECT_VL, such as SLP?
> > 
> > Richard
> > 
> > 
>  
>
  
钟居哲 May 31, 2023, 6:50 a.m. UTC | #14
Hi, Richi.

>> Note with SELECT_VL all bets will be off since as I understand the
>> value it gives can vary from iteration to iteration (but we know
>> a lower and maybe an upper bound?)
Yes, in RVV side, the SELECT_VL output can be in range of [ceil(avl/2), vlmax], 
can be any value between the range depending on the hardware implementation.

>> So I think we should patch this up in the vectorizer itself like with
>> your patch.  I'm going to wait for Richards input though since he
>> seems to disagree.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971, 
Kewen is happy with this patch, turns out this patch can fix power's issue.
So, Let's wait for Richard's comments.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-31 14:41
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches; linkw
Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi?all. I have posted my several investigations:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> 
> Turns out when "niters is a constant value and vf is a constant value"
> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
> 
> Actually, for current flow:
> 
> step = MIN ()
> ...
> remain = remain - step.
> 
> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> So, could you make a decision for this patch?
> 
> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
> we should extend SCEV/IVOPTS ?
 
I don't think we can do anything in SCEV for this which means we'd
need to special-case this in niter analysis, in IVOPTs and any other
passes that might be affected (and not fixed by handling it in niter
analysis).  While improving niter analysis would be good (the user
could write this pattern as well) I do not have time to try
implementing that (I have no idea how ugly or robust it is going to be).
 
So I think we should patch this up in the vectorizer itself like with
your patch.  I'm going to wait for Richards input though since he
seems to disagree.
 
Note with SELECT_VL all bets will be off since as I understand the
value it gives can vary from iteration to iteration (but we know
a lower and maybe an upper bound?)
 
Thanks,
Richard.
 
> Thanks. 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: ???
> Date: 2023-05-30 23:05
> To: rguenther
> CC: richard.sandiford; gcc-patches; linkw
> Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
> More information of power's testcase:
> 
> Before this patch:
> test_npeel_int16_t:
> lui a4,%hi(.LANCHOR0+130)
> lui a3,%hi(.LANCHOR1)
> addi a3,a3,%lo(.LANCHOR1)
> addi a4,a4,%lo(.LANCHOR0+130)
> li a5,58
> li a2,16
> vsetivli zero,16,e16,m1,ta,ma
> vl1re16.v v3,0(a3)
> vid.v v1
> .L5:
> minu a3,a5,a2
> vsetvli zero,a3,e16,m1,ta,ma
> sub a5,a5,a3
> vse16.v v1,0(a4)
> vsetivli zero,16,e16,m1,ta,ma
> addi a4,a4,32
> vadd.vv v1,v1,v3
> bne a5,zero,.L5
> ret
> 
> After this patch:
> test_npeel_int16_t:
> lui a5,%hi(.LANCHOR0)
> addi a5,a5,%lo(.LANCHOR0)
> li a1,16
> vsetivli zero,16,e16,m1,ta,ma
> addi a2,a5,130
> vid.v v1
> addi a3,a5,162
> vadd.vx v4,v1,a1
> addi a4,a5,194
> li a1,32
> vadd.vx v3,v1,a1
> vse16.v v1,0(a2)
> vse16.v v4,0(a3)
> vse16.v v3,0(a4)
> addi a5,a5,226
> li a1,48
> vadd.vx v2,v1,a1
> vsetivli zero,10,e16,m1,ta,ma
> vse16.v v2,0(a5)
> ret
> 
> It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-05-30 20:33
> To: juzhe.zhong
> CC: Richard Sandiford; gcc-patches; linkw
> Subject: Re: [PATCH] VECT: Change flow of decrement IV
> On Tue, 30 May 2023, juzhe.zhong wrote:
>  
> > This patch will generate the number of rgroup ?mov? instructions inside the
> > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> > instruction in loop. If this patch is necessary? I think I should find a way
> > to fix it.
>  
> That's odd, you only need to adjust the IV which is used in the exit test,
> not all the others.
>  
> > ---- Replied Message ----
> > From
> > Richard Sandiford<richard.sandiford@arm.com>
> > Date
> > 05/30/2023 19:41
> > To
> > juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> > Cc
> > gcc-patches<gcc-patches@gcc.gnu.org>,
> > rguenther<rguenther@suse.de>,
> > linkw<linkw@linux.ibm.com>
> > Subject
> > Re: [PATCH] VECT: Change flow of decrement IV
> > "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > > Before this patch:
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > >       sub   a2,a2,a5
> > > bne a2,zero,.L3
> > > .L5:
> > > ret
> > >
> > > After this patch:
> > >
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > neg a7,a4   -->>>additional instruction
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > mv a6,a2  -->>>additional instruction
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > > add a2,a2,a7
> > > bgtu a6,a4,.L3
> > > .L5:
> > > ret
> > >
> > > There is 1 more instruction in preheader and 1 more instruction in loop.
> > > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> > this issue will gone.
> > 
> > But what about cases where you won't be using SELECT_VL, such as SLP?
> > 
> > Richard
> > 
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
Richard Sandiford May 31, 2023, 7:28 a.m. UTC | #15
Richard Biener <rguenther@suse.de> writes:
> On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>
>> Hi?all. I have posted my several investigations:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
>> 
>> Turns out when "niters is a constant value and vf is a constant value"
>> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>> 
>> Actually, for current flow:
>> 
>> step = MIN ()
>> ...
>> remain = remain - step.
>> 
>> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>> So, could you make a decision for this patch?
>> 
>> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>> we should extend SCEV/IVOPTS ?
>
> I don't think we can do anything in SCEV for this which means we'd
> need to special-case this in niter analysis, in IVOPTs and any other
> passes that might be affected (and not fixed by handling it in niter
> analysis).  While improving niter analysis would be good (the user
> could write this pattern as well) I do not have time to try
> implementing that (I have no idea how ugly or robust it is going to be).
>
> So I think we should patch this up in the vectorizer itself like with
> your patch.  I'm going to wait for Richards input though since he
> seems to disagree.

I think my main disagreement is that the IV phi can be analysed
as a SCEV with sufficient work (realising that the MIN result is
always VF when the latch is executed).  That SCEV might be useful
“as is” for things like IVOPTS, without specific work in those passes.
(Although perhaps not too useful, since most other IVs will be upcounting.)

I don't object though.  It just feels like we're giving up easily.
And that's a bit frustrating, since this potential problem was flagged
ahead of time.

> Note with SELECT_VL all bets will be off since as I understand the
> value it gives can vary from iteration to iteration (but we know
> a lower and maybe an upper bound?)

Right.  All IVs will have a variable step for SELECT_VL.

Thanks,
Richard
  
钟居哲 May 31, 2023, 7:36 a.m. UTC | #16
Hi, Richard.

>> I don't object though.  It just feels like we're giving up easily.
>> And that's a bit frustrating, since this potential problem was flagged
>> ahead of time.

I can take a look at it. Would you mind giving me some hints?
Should I do this in which PASS ? "ivopts" PASS?
Is that right that we can enhance analysis when we see the statement as follows:
remain = remain - step and step is coming from a MIN_EXPR (remain, vf).
Then what we need to do?
 
Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-31 15:28
To: Richard Biener
CC: juzhe.zhong\@rivai.ai; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
Richard Biener <rguenther@suse.de> writes:
> On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>
>> Hi?all. I have posted my several investigations:
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
>> 
>> Turns out when "niters is a constant value and vf is a constant value"
>> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>> 
>> Actually, for current flow:
>> 
>> step = MIN ()
>> ...
>> remain = remain - step.
>> 
>> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>> So, could you make a decision for this patch?
>> 
>> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>> we should extend SCEV/IVOPTS ?
>
> I don't think we can do anything in SCEV for this which means we'd
> need to special-case this in niter analysis, in IVOPTs and any other
> passes that might be affected (and not fixed by handling it in niter
> analysis).  While improving niter analysis would be good (the user
> could write this pattern as well) I do not have time to try
> implementing that (I have no idea how ugly or robust it is going to be).
>
> So I think we should patch this up in the vectorizer itself like with
> your patch.  I'm going to wait for Richards input though since he
> seems to disagree.
 
I think my main disagreement is that the IV phi can be analysed
as a SCEV with sufficient work (realising that the MIN result is
always VF when the latch is executed).  That SCEV might be useful
“as is” for things like IVOPTS, without specific work in those passes.
(Although perhaps not too useful, since most other IVs will be upcounting.)
 
I don't object though.  It just feels like we're giving up easily.
And that's a bit frustrating, since this potential problem was flagged
ahead of time.
 
> Note with SELECT_VL all bets will be off since as I understand the
> value it gives can vary from iteration to iteration (but we know
> a lower and maybe an upper bound?)
 
Right.  All IVs will have a variable step for SELECT_VL.
 
Thanks,
Richard
  
Richard Biener May 31, 2023, 7:38 a.m. UTC | #17
On Wed, 31 May 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
> >
> >> Hi?all. I have posted my several investigations:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> >> 
> >> Turns out when "niters is a constant value and vf is a constant value"
> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
> >> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
> >> 
> >> Actually, for current flow:
> >> 
> >> step = MIN ()
> >> ...
> >> remain = remain - step.
> >> 
> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> >> So, could you make a decision for this patch?
> >> 
> >> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
> >> we should extend SCEV/IVOPTS ?
> >
> > I don't think we can do anything in SCEV for this which means we'd
> > need to special-case this in niter analysis, in IVOPTs and any other
> > passes that might be affected (and not fixed by handling it in niter
> > analysis).  While improving niter analysis would be good (the user
> > could write this pattern as well) I do not have time to try
> > implementing that (I have no idea how ugly or robust it is going to be).
> >
> > So I think we should patch this up in the vectorizer itself like with
> > your patch.  I'm going to wait for Richards input though since he
> > seems to disagree.
> 
> I think my main disagreement is that the IV phi can be analysed
> as a SCEV with sufficient work (realising that the MIN result is
> always VF when the latch is executed).  That SCEV might be useful
> ?as is? for things like IVOPTS, without specific work in those passes.
> (Although perhaps not too useful, since most other IVs will be upcounting.)

I think we'd need another API for SCEV there then,
analyze_scalar_evolution_for_latch () so we can disregard the
value on the exit edges then.  That means we'd still need to touch
all users and decide whether it's safe to use that or not.

> I don't object though.  It just feels like we're giving up easily.
> And that's a bit frustrating, since this potential problem was flagged
> ahead of time.

Well, I expect that massaging SCEV and niter analysis will take
up quite some developer time while avoiding the situation in
the vectorizer is possible (and would fix the observed regressions).
We can always improve later here and I'd suggest to file an
enhancement bugreport with a simple C testcase using this kind of
iteration.

I'm just saying that to go forward the vectorizer change looks
more promising (also considering the pace RISC-V people are working at 
...)

Richard.

> > Note with SELECT_VL all bets will be off since as I understand the
> > value it gives can vary from iteration to iteration (but we know
> > a lower and maybe an upper bound?)
> 
> Right.  All IVs will have a variable step for SELECT_VL.
> 
> Thanks,
> Richard
>
  
Kewen.Lin May 31, 2023, 7:38 a.m. UTC | #18
> Hi, Richi.
> 
>>> Note with SELECT_VL all bets will be off since as I understand the
>>> value it gives can vary from iteration to iteration (but we know
>>> a lower and maybe an upper bound?)
> Yes, in RVV side, the SELECT_VL output can be in range of [ceil(avl/2), vlmax], 
> can be any value between the range depending on the hardware implementation.
> 
>>> So I think we should patch this up in the vectorizer itself like with
>>> your patch.  I'm going to wait for Richards input though since he
>>> seems to disagree.
> 
> According tohttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971, <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971,> 
> Kewen is happy with this patch, turns out this patch can fix power's issue.

Yeah, the exposed degradation and failures can be fixed by this patch.
I'd expect both approaches (this patch or extending niter analysis and
others) should work for the exposed issues.

A new finding is that my SPEC2017 rerun with this patch exposed some
verification failures, I made a regression test on Power10, it showed
a few failures too (mainly from fortran).  By looking into one of them
(case gfortran.dg/array_alloc_2.f90), I think the patch needs some
adjustment on chosen code according to exit_edge->flags like:

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ef28711c58f..5d518460b6d 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -892,8 +892,9 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
     {
       gcc_assert (compare_step);
-      cond_stmt = gimple_build_cond (GT_EXPR, test_ctrl, compare_step,
-				     NULL_TREE, NULL_TREE);
+      tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
+      cond_stmt = gimple_build_cond (code, test_ctrl, compare_step, NULL_TREE,
+				     NULL_TREE);
       gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
     }
   else

I'm running regression testing again based on this adjustment, will see
if it can fix all exposed failures.

BR,
Kewen

> So, Let's wait for Richard's comments.
> 
> Thanks.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> juzhe.zhong@rivai.ai
> 
>      
>     *From:* Richard Biener <mailto:rguenther@suse.de>
>     *Date:* 2023-05-31 14:41
>     *To:* juzhe.zhong@rivai.ai <mailto:juzhe.zhong@rivai.ai>
>     *CC:* richard.sandiford <mailto:richard.sandiford@arm.com>; gcc-patches <mailto:gcc-patches@gcc.gnu.org>; linkw <mailto:linkw@linux.ibm.com>
>     *Subject:* Re: Re: [PATCH] VECT: Change flow of decrement IV
>     On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>      
>     > Hi?all. I have posted my several investigations:
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html
>     >
>     > Turns out when "niters is a constant value and vf is a constant value"
>     > This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>     > Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>     >
>     > Actually, for current flow:
>     >
>     > step = MIN ()
>     > ...
>     > remain = remain - step.
>     >
>     > I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>     > So, could you make a decision for this patch?
>     >
>     > I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>     > we should extend SCEV/IVOPTS ?
>      
>     I don't think we can do anything in SCEV for this which means we'd
>     need to special-case this in niter analysis, in IVOPTs and any other
>     passes that might be affected (and not fixed by handling it in niter
>     analysis).  While improving niter analysis would be good (the user
>     could write this pattern as well) I do not have time to try
>     implementing that (I have no idea how ugly or robust it is going to be).
>      
>     So I think we should patch this up in the vectorizer itself like with
>     your patch.  I'm going to wait for Richards input though since he
>     seems to disagree.
>      
>     Note with SELECT_VL all bets will be off since as I understand the
>     value it gives can vary from iteration to iteration (but we know
>     a lower and maybe an upper bound?)
>      
>     Thanks,
>     Richard.
>      
>     > Thanks.
>     >
>     >
>     > juzhe.zhong@rivai.ai
>     > 
>     > From: ???
>     > Date: 2023-05-30 23:05
>     > To: rguenther
>     > CC: richard.sandiford; gcc-patches; linkw
>     > Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
>     > More information of power's testcase:
>     >
>     > Before this patch:
>     > test_npeel_int16_t:
>     > lui a4,%hi(.LANCHOR0+130)
>     > lui a3,%hi(.LANCHOR1)
>     > addi a3,a3,%lo(.LANCHOR1)
>     > addi a4,a4,%lo(.LANCHOR0+130)
>     > li a5,58
>     > li a2,16
>     > vsetivli zero,16,e16,m1,ta,ma
>     > vl1re16.v v3,0(a3)
>     > vid.v v1
>     > .L5:
>     > minu a3,a5,a2
>     > vsetvli zero,a3,e16,m1,ta,ma
>     > sub a5,a5,a3
>     > vse16.v v1,0(a4)
>     > vsetivli zero,16,e16,m1,ta,ma
>     > addi a4,a4,32
>     > vadd.vv v1,v1,v3
>     > bne a5,zero,.L5
>     > ret
>     >
>     > After this patch:
>     > test_npeel_int16_t:
>     > lui a5,%hi(.LANCHOR0)
>     > addi a5,a5,%lo(.LANCHOR0)
>     > li a1,16
>     > vsetivli zero,16,e16,m1,ta,ma
>     > addi a2,a5,130
>     > vid.v v1
>     > addi a3,a5,162
>     > vadd.vx v4,v1,a1
>     > addi a4,a5,194
>     > li a1,32
>     > vadd.vx v3,v1,a1
>     > vse16.v v1,0(a2)
>     > vse16.v v4,0(a3)
>     > vse16.v v3,0(a4)
>     > addi a5,a5,226
>     > li a1,48
>     > vadd.vx v2,v1,a1
>     > vsetivli zero,10,e16,m1,ta,ma
>     > vse16.v v2,0(a5)
>     > ret
>     >
>     > It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.
>     >
>     >
>     > juzhe.zhong@rivai.ai
>     > 
>     > From: Richard Biener
>     > Date: 2023-05-30 20:33
>     > To: juzhe.zhong
>     > CC: Richard Sandiford; gcc-patches; linkw
>     > Subject: Re: [PATCH] VECT: Change flow of decrement IV
>     > On Tue, 30 May 2023, juzhe.zhong wrote:
>     > 
>     > > This patch will generate the number of rgroup ?mov? instructions inside the
>     > > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
>     > > instruction in loop. If this patch is necessary? I think I should find a way
>     > > to fix it.
>     > 
>     > That's odd, you only need to adjust the IV which is used in the exit test,
>     > not all the others.
>     > 
>     > > ---- Replied Message ----
>     > > From
>     > > Richard Sandiford<richard.sandiford@arm.com>
>     > > Date
>     > > 05/30/2023 19:41
>     > > To
>     > > juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
>     > > Cc
>     > > gcc-patches<gcc-patches@gcc.gnu.org>,
>     > > rguenther<rguenther@suse.de>,
>     > > linkw<linkw@linux.ibm.com>
>     > > Subject
>     > > Re: [PATCH] VECT: Change flow of decrement IV
>     > > "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>     > > > Before this patch:
>     > > > foo:
>     > > > ble a2,zero,.L5
>     > > > csrr a3,vlenb
>     > > > srli a4,a3,2
>     > > > .L3:
>     > > > minu a5,a2,a4
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vle32.v v2,0(a1)
>     > > > vle32.v v1,0(a0)
>     > > > vsetvli t1,zero,e32,m1,ta,ma
>     > > > vadd.vv v1,v1,v2
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vse32.v v1,0(a0)
>     > > > add a1,a1,a3
>     > > > add a0,a0,a3
>     > > >       sub   a2,a2,a5
>     > > > bne a2,zero,.L3
>     > > > .L5:
>     > > > ret
>     > > >
>     > > > After this patch:
>     > > >
>     > > > foo:
>     > > > ble a2,zero,.L5
>     > > > csrr a3,vlenb
>     > > > srli a4,a3,2
>     > > > neg a7,a4   -->>>additional instruction
>     > > > .L3:
>     > > > minu a5,a2,a4
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vle32.v v2,0(a1)
>     > > > vle32.v v1,0(a0)
>     > > > vsetvli t1,zero,e32,m1,ta,ma
>     > > > mv a6,a2  -->>>additional instruction
>     > > > vadd.vv v1,v1,v2
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vse32.v v1,0(a0)
>     > > > add a1,a1,a3
>     > > > add a0,a0,a3
>     > > > add a2,a2,a7
>     > > > bgtu a6,a4,.L3
>     > > > .L5:
>     > > > ret
>     > > >
>     > > > There is 1 more instruction in preheader and 1 more instruction in loop.
>     > > > But I think it's OK for RVV since we will definitely be using SELECT_VL so
>     > > this issue will gone.
>     > >
>     > > But what about cases where you won't be using SELECT_VL, such as SLP?
>     > >
>     > > Richard
>     > >
>     > >
>     > 
>     >
>      
>     -- 
>     Richard Biener <rguenther@suse.de>
>     SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
>     Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
>     HRB 36809 (AG Nuernberg)
>      
>
  
钟居哲 May 31, 2023, 7:49 a.m. UTC | #19
>> I'm just saying that to go forward the vectorizer change looks
>>more promising (also considering the pace RISC-V people are working at
>>...)

Yeah,  RVV needs a lot of middle-end support:
SELECT_VL, LEN_MASK_LOAD/LEN_MASK_STORE,.....etc

LEN_ADD for RVV reduction support like COND_ADD for ARM SVE...etc

SELECT_VL is still pending.

Without support in middle-end, GCC can not support powerful auto-vectorization (Performance will be much worse than RVV LLVM).
And unfortunately, I am the only guy working on middle-end support of RVV auto-vectorization. :)

I think we can make this patch merged and record the enhancement of SCEV in bugzilla to see we can improve that in the future.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-31 15:38
To: Richard Sandiford
CC: juzhe.zhong@rivai.ai; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Wed, 31 May 2023, Richard Sandiford wrote:
 
> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
> >
> >> Hi?all. I have posted my several investigations:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> >> 
> >> Turns out when "niters is a constant value and vf is a constant value"
> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
> >> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
> >> 
> >> Actually, for current flow:
> >> 
> >> step = MIN ()
> >> ...
> >> remain = remain - step.
> >> 
> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> >> So, could you make a decision for this patch?
> >> 
> >> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
> >> we should extend SCEV/IVOPTS ?
> >
> > I don't think we can do anything in SCEV for this which means we'd
> > need to special-case this in niter analysis, in IVOPTs and any other
> > passes that might be affected (and not fixed by handling it in niter
> > analysis).  While improving niter analysis would be good (the user
> > could write this pattern as well) I do not have time to try
> > implementing that (I have no idea how ugly or robust it is going to be).
> >
> > So I think we should patch this up in the vectorizer itself like with
> > your patch.  I'm going to wait for Richards input though since he
> > seems to disagree.
> 
> I think my main disagreement is that the IV phi can be analysed
> as a SCEV with sufficient work (realising that the MIN result is
> always VF when the latch is executed).  That SCEV might be useful
> ?as is? for things like IVOPTS, without specific work in those passes.
> (Although perhaps not too useful, since most other IVs will be upcounting.)
 
I think we'd need another API for SCEV there then,
analyze_scalar_evolution_for_latch () so we can disregard the
value on the exit edges then.  That means we'd still need to touch
all users and decide whether it's safe to use that or not.
 
> I don't object though.  It just feels like we're giving up easily.
> And that's a bit frustrating, since this potential problem was flagged
> ahead of time.
 
Well, I expect that massaging SCEV and niter analysis will take
up quite some developer time while avoiding the situation in
the vectorizer is possible (and would fix the observed regressions).
We can always improve later here and I'd suggest to file an
enhancement bugreport with a simple C testcase using this kind of
iteration.
 
I'm just saying that to go forward the vectorizer change looks
more promising (also considering the pace RISC-V people are working at 
...)
 
Richard.
 
> > Note with SELECT_VL all bets will be off since as I understand the
> > value it gives can vary from iteration to iteration (but we know
> > a lower and maybe an upper bound?)
> 
> Right.  All IVs will have a variable step for SELECT_VL.
> 
> Thanks,
> Richard
>
  
钟居哲 May 31, 2023, 7:50 a.m. UTC | #20
Oh, it's correct fix. Thanks for catching this.




juzhe.zhong@rivai.ai
 
From: Kewen.Lin
Date: 2023-05-31 15:38
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches; rguenther
Subject: Re: [PATCH] VECT: Change flow of decrement IV
> Hi, Richi.
> 
>>> Note with SELECT_VL all bets will be off since as I understand the
>>> value it gives can vary from iteration to iteration (but we know
>>> a lower and maybe an upper bound?)
> Yes, in RVV side, the SELECT_VL output can be in range of [ceil(avl/2), vlmax], 
> can be any value between the range depending on the hardware implementation.
> 
>>> So I think we should patch this up in the vectorizer itself like with
>>> your patch.  I'm going to wait for Richards input though since he
>>> seems to disagree.
> 
> According tohttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971, <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971,> 
> Kewen is happy with this patch, turns out this patch can fix power's issue.
 
Yeah, the exposed degradation and failures can be fixed by this patch.
I'd expect both approaches (this patch or extending niter analysis and
others) should work for the exposed issues.
 
A new finding is that my SPEC2017 rerun with this patch exposed some
verification failures, I made a regression test on Power10, it showed
a few failures too (mainly from fortran).  By looking into one of them
(case gfortran.dg/array_alloc_2.f90), I think the patch needs some
adjustment on chosen code according to exit_edge->flags like:
 
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ef28711c58f..5d518460b6d 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -892,8 +892,9 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
     {
       gcc_assert (compare_step);
-      cond_stmt = gimple_build_cond (GT_EXPR, test_ctrl, compare_step,
-      NULL_TREE, NULL_TREE);
+      tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : GT_EXPR;
+      cond_stmt = gimple_build_cond (code, test_ctrl, compare_step, NULL_TREE,
+      NULL_TREE);
       gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
     }
   else
 
I'm running regression testing again based on this adjustment, will see
if it can fix all exposed failures.
 
BR,
Kewen
 
> So, Let's wait for Richard's comments.
> 
> Thanks.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> juzhe.zhong@rivai.ai
> 
>      
>     *From:* Richard Biener <mailto:rguenther@suse.de>
>     *Date:* 2023-05-31 14:41
>     *To:* juzhe.zhong@rivai.ai <mailto:juzhe.zhong@rivai.ai>
>     *CC:* richard.sandiford <mailto:richard.sandiford@arm.com>; gcc-patches <mailto:gcc-patches@gcc.gnu.org>; linkw <mailto:linkw@linux.ibm.com>
>     *Subject:* Re: Re: [PATCH] VECT: Change flow of decrement IV
>     On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>      
>     > Hi?all. I have posted my several investigations:
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html
>     > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html
>     >
>     > Turns out when "niters is a constant value and vf is a constant value"
>     > This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>     > Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>     >
>     > Actually, for current flow:
>     >
>     > step = MIN ()
>     > ...
>     > remain = remain - step.
>     >
>     > I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>     > So, could you make a decision for this patch?
>     >
>     > I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>     > we should extend SCEV/IVOPTS ?
>      
>     I don't think we can do anything in SCEV for this which means we'd
>     need to special-case this in niter analysis, in IVOPTs and any other
>     passes that might be affected (and not fixed by handling it in niter
>     analysis).  While improving niter analysis would be good (the user
>     could write this pattern as well) I do not have time to try
>     implementing that (I have no idea how ugly or robust it is going to be).
>      
>     So I think we should patch this up in the vectorizer itself like with
>     your patch.  I'm going to wait for Richards input though since he
>     seems to disagree.
>      
>     Note with SELECT_VL all bets will be off since as I understand the
>     value it gives can vary from iteration to iteration (but we know
>     a lower and maybe an upper bound?)
>      
>     Thanks,
>     Richard.
>      
>     > Thanks.
>     >
>     >
>     > juzhe.zhong@rivai.ai
>     > 
>     > From: ???
>     > Date: 2023-05-30 23:05
>     > To: rguenther
>     > CC: richard.sandiford; gcc-patches; linkw
>     > Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
>     > More information of power's testcase:
>     >
>     > Before this patch:
>     > test_npeel_int16_t:
>     > lui a4,%hi(.LANCHOR0+130)
>     > lui a3,%hi(.LANCHOR1)
>     > addi a3,a3,%lo(.LANCHOR1)
>     > addi a4,a4,%lo(.LANCHOR0+130)
>     > li a5,58
>     > li a2,16
>     > vsetivli zero,16,e16,m1,ta,ma
>     > vl1re16.v v3,0(a3)
>     > vid.v v1
>     > .L5:
>     > minu a3,a5,a2
>     > vsetvli zero,a3,e16,m1,ta,ma
>     > sub a5,a5,a3
>     > vse16.v v1,0(a4)
>     > vsetivli zero,16,e16,m1,ta,ma
>     > addi a4,a4,32
>     > vadd.vv v1,v1,v3
>     > bne a5,zero,.L5
>     > ret
>     >
>     > After this patch:
>     > test_npeel_int16_t:
>     > lui a5,%hi(.LANCHOR0)
>     > addi a5,a5,%lo(.LANCHOR0)
>     > li a1,16
>     > vsetivli zero,16,e16,m1,ta,ma
>     > addi a2,a5,130
>     > vid.v v1
>     > addi a3,a5,162
>     > vadd.vx v4,v1,a1
>     > addi a4,a5,194
>     > li a1,32
>     > vadd.vx v3,v1,a1
>     > vse16.v v1,0(a2)
>     > vse16.v v4,0(a3)
>     > vse16.v v3,0(a4)
>     > addi a5,a5,226
>     > li a1,48
>     > vadd.vx v2,v1,a1
>     > vsetivli zero,10,e16,m1,ta,ma
>     > vse16.v v2,0(a5)
>     > ret
>     >
>     > It's obvious, previously, power's testcase in RVV side can not unroll, but after this patch, in RVV side, it can unroll now.
>     >
>     >
>     > juzhe.zhong@rivai.ai
>     > 
>     > From: Richard Biener
>     > Date: 2023-05-30 20:33
>     > To: juzhe.zhong
>     > CC: Richard Sandiford; gcc-patches; linkw
>     > Subject: Re: [PATCH] VECT: Change flow of decrement IV
>     > On Tue, 30 May 2023, juzhe.zhong wrote:
>     > 
>     > > This patch will generate the number of rgroup ?mov? instructions inside the
>     > > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
>     > > instruction in loop. If this patch is necessary? I think I should find a way
>     > > to fix it.
>     > 
>     > That's odd, you only need to adjust the IV which is used in the exit test,
>     > not all the others.
>     > 
>     > > ---- Replied Message ----
>     > > From
>     > > Richard Sandiford<richard.sandiford@arm.com>
>     > > Date
>     > > 05/30/2023 19:41
>     > > To
>     > > juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
>     > > Cc
>     > > gcc-patches<gcc-patches@gcc.gnu.org>,
>     > > rguenther<rguenther@suse.de>,
>     > > linkw<linkw@linux.ibm.com>
>     > > Subject
>     > > Re: [PATCH] VECT: Change flow of decrement IV
>     > > "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>     > > > Before this patch:
>     > > > foo:
>     > > > ble a2,zero,.L5
>     > > > csrr a3,vlenb
>     > > > srli a4,a3,2
>     > > > .L3:
>     > > > minu a5,a2,a4
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vle32.v v2,0(a1)
>     > > > vle32.v v1,0(a0)
>     > > > vsetvli t1,zero,e32,m1,ta,ma
>     > > > vadd.vv v1,v1,v2
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vse32.v v1,0(a0)
>     > > > add a1,a1,a3
>     > > > add a0,a0,a3
>     > > >       sub   a2,a2,a5
>     > > > bne a2,zero,.L3
>     > > > .L5:
>     > > > ret
>     > > >
>     > > > After this patch:
>     > > >
>     > > > foo:
>     > > > ble a2,zero,.L5
>     > > > csrr a3,vlenb
>     > > > srli a4,a3,2
>     > > > neg a7,a4   -->>>additional instruction
>     > > > .L3:
>     > > > minu a5,a2,a4
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vle32.v v2,0(a1)
>     > > > vle32.v v1,0(a0)
>     > > > vsetvli t1,zero,e32,m1,ta,ma
>     > > > mv a6,a2  -->>>additional instruction
>     > > > vadd.vv v1,v1,v2
>     > > > vsetvli zero,a5,e32,m1,ta,ma
>     > > > vse32.v v1,0(a0)
>     > > > add a1,a1,a3
>     > > > add a0,a0,a3
>     > > > add a2,a2,a7
>     > > > bgtu a6,a4,.L3
>     > > > .L5:
>     > > > ret
>     > > >
>     > > > There is 1 more instruction in preheader and 1 more instruction in loop.
>     > > > But I think it's OK for RVV since we will definitely be using SELECT_VL so
>     > > this issue will gone.
>     > >
>     > > But what about cases where you won't be using SELECT_VL, such as SLP?
>     > >
>     > > Richard
>     > >
>     > >
>     > 
>     >
>      
>     -- 
>     Richard Biener <rguenther@suse.de>
>     SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
>     Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
>     HRB 36809 (AG Nuernberg)
>      
>
  
Richard Biener May 31, 2023, 8:44 a.m. UTC | #21
On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:

> Hi, Richard.
> 
> >> I don't object though.  It just feels like we're giving up easily.
> >> And that's a bit frustrating, since this potential problem was flagged
> >> ahead of time.
> 
> I can take a look at it. Would you mind giving me some hints?
> Should I do this in which PASS ? "ivopts" PASS?
> Is that right that we can enhance analysis when we see the statement as follows:
> remain = remain - step and step is coming from a MIN_EXPR (remain, vf).
> Then what we need to do?

The key is that we have

 # iv = PHI <initial, iv'>
 step = MIN (iv, invariant);
 iv' = iv - step;
 if (iv' != 0)
   continue;

in that case for the purpose of niter analysis we can ignore
the MIN expression and use 'invariant' as step (whether constant
or not).  Of course it only works for unsigned 'step'.

niter analysis uses simple_iv () but that necessarily fails here
so one idea would be to enhance simple_iv when passed a special
flag.  Another idea is to add to the set of patterns niter analysis
already has to compute niters resolving to popcount and friends
and pattern match the above (I'd start with that to see which other
passes / analyses besides niter analysis need improvement).

number_of_iterations_exit_assumptions does

  if (!simple_iv_with_niters (loop, loop_containing_stmt (stmt),
                              op0, &iv0, safe ? &iv0_niters : NULL, 
false))
    return number_of_iterations_bitcount (loop, exit, code, niter);

so I'd add that to the set of matchers in number_of_iterations_bitcount
(and maybe rename that to number_of_iterations_pattern).

Richard.

> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-05-31 15:28
> To: Richard Biener
> CC: juzhe.zhong\@rivai.ai; gcc-patches; linkw
> Subject: Re: [PATCH] VECT: Change flow of decrement IV
> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
> >
> >> Hi?all. I have posted my several investigations:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> >> 
> >> Turns out when "niters is a constant value and vf is a constant value"
> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
> >> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
> >> 
> >> Actually, for current flow:
> >> 
> >> step = MIN ()
> >> ...
> >> remain = remain - step.
> >> 
> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> >> So, could you make a decision for this patch?
> >> 
> >> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
> >> we should extend SCEV/IVOPTS ?
> >
> > I don't think we can do anything in SCEV for this which means we'd
> > need to special-case this in niter analysis, in IVOPTs and any other
> > passes that might be affected (and not fixed by handling it in niter
> > analysis).  While improving niter analysis would be good (the user
> > could write this pattern as well) I do not have time to try
> > implementing that (I have no idea how ugly or robust it is going to be).
> >
> > So I think we should patch this up in the vectorizer itself like with
> > your patch.  I'm going to wait for Richards input though since he
> > seems to disagree.
>  
> I think my main disagreement is that the IV phi can be analysed
> as a SCEV with sufficient work (realising that the MIN result is
> always VF when the latch is executed).  That SCEV might be useful
> ?as is? for things like IVOPTS, without specific work in those passes.
> (Although perhaps not too useful, since most other IVs will be upcounting.)
>  
> I don't object though.  It just feels like we're giving up easily.
> And that's a bit frustrating, since this potential problem was flagged
> ahead of time.
>  
> > Note with SELECT_VL all bets will be off since as I understand the
> > value it gives can vary from iteration to iteration (but we know
> > a lower and maybe an upper bound?)
>  
> Right.  All IVs will have a variable step for SELECT_VL.
>  
> Thanks,
> Richard
>  
>
  
Richard Sandiford May 31, 2023, 9:01 a.m. UTC | #22
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Wed, 31 May 2023, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>> >
>> >> Hi?all. I have posted my several investigations:
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
>> >> 
>> >> Turns out when "niters is a constant value and vf is a constant value"
>> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>> >> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>> >> 
>> >> Actually, for current flow:
>> >> 
>> >> step = MIN ()
>> >> ...
>> >> remain = remain - step.
>> >> 
>> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>> >> So, could you make a decision for this patch?
>> >> 
>> >> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>> >> we should extend SCEV/IVOPTS ?
>> >
>> > I don't think we can do anything in SCEV for this which means we'd
>> > need to special-case this in niter analysis, in IVOPTs and any other
>> > passes that might be affected (and not fixed by handling it in niter
>> > analysis).  While improving niter analysis would be good (the user
>> > could write this pattern as well) I do not have time to try
>> > implementing that (I have no idea how ugly or robust it is going to be).
>> >
>> > So I think we should patch this up in the vectorizer itself like with
>> > your patch.  I'm going to wait for Richards input though since he
>> > seems to disagree.
>> 
>> I think my main disagreement is that the IV phi can be analysed
>> as a SCEV with sufficient work (realising that the MIN result is
>> always VF when the latch is executed).  That SCEV might be useful
>> ?as is? for things like IVOPTS, without specific work in those passes.
>> (Although perhaps not too useful, since most other IVs will be upcounting.)
>
> I think we'd need another API for SCEV there then,
> analyze_scalar_evolution_for_latch () so we can disregard the
> value on the exit edges then.  That means we'd still need to touch
> all users and decide whether it's safe to use that or not.

I'd expect the phi for the IV with the constant step to have the same
value as the phi for the IV with a MIN step.  I realise that the phi
isn't the thing that matters for niters, but I'd expect IVOPTS to
consider both the phi and the adjusted value to be candidates.  Only the
phi can be a candidate with the MIN step, but it feels like it should
still be a candidate, even with current interfaces.

You know this stuff much better than I do though, so I^m almost
certainly oversimplifying/overlooking things.

Like I say, I don't object to the vectoriser change, so please
don't go down a rabbit hole on my account. :)

Thanks,
Richard
  
钟居哲 May 31, 2023, 9:30 a.m. UTC | #23
Thanks Richard.
Seems that this patch's approach is ok to trunk?
Maybe the only thing we should do is to wait Kewen's testing feedback, am I right ?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-31 17:01
To: Richard Biener via Gcc-patches
CC: Richard Biener; juzhe.zhong\@rivai.ai; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Wed, 31 May 2023, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
>> >
>> >> Hi?all. I have posted my several investigations:
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
>> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
>> >> 
>> >> Turns out when "niters is a constant value and vf is a constant value"
>> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase from IBM's testsuite for example) and I think this patch can fix IBM's cunroll issue.
>> >> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I think Gain > Pain overal.
>> >> 
>> >> Actually, for current flow:
>> >> 
>> >> step = MIN ()
>> >> ...
>> >> remain = remain - step.
>> >> 
>> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
>> >> So, could you make a decision for this patch?
>> >> 
>> >> I wonder whether we should apply the approach of this patch (the codes can be refined after well reviewed) or
>> >> we should extend SCEV/IVOPTS ?
>> >
>> > I don't think we can do anything in SCEV for this which means we'd
>> > need to special-case this in niter analysis, in IVOPTs and any other
>> > passes that might be affected (and not fixed by handling it in niter
>> > analysis).  While improving niter analysis would be good (the user
>> > could write this pattern as well) I do not have time to try
>> > implementing that (I have no idea how ugly or robust it is going to be).
>> >
>> > So I think we should patch this up in the vectorizer itself like with
>> > your patch.  I'm going to wait for Richards input though since he
>> > seems to disagree.
>> 
>> I think my main disagreement is that the IV phi can be analysed
>> as a SCEV with sufficient work (realising that the MIN result is
>> always VF when the latch is executed).  That SCEV might be useful
>> ?as is? for things like IVOPTS, without specific work in those passes.
>> (Although perhaps not too useful, since most other IVs will be upcounting.)
>
> I think we'd need another API for SCEV there then,
> analyze_scalar_evolution_for_latch () so we can disregard the
> value on the exit edges then.  That means we'd still need to touch
> all users and decide whether it's safe to use that or not.
 
I'd expect the phi for the IV with the constant step to have the same
value as the phi for the IV with a MIN step.  I realise that the phi
isn't the thing that matters for niters, but I'd expect IVOPTS to
consider both the phi and the adjusted value to be candidates.  Only the
phi can be a candidate with the MIN step, but it feels like it should
still be a candidate, even with current interfaces.
 
You know this stuff much better than I do though, so I^m almost
certainly oversimplifying/overlooking things.
 
Like I say, I don't object to the vectoriser change, so please
don't go down a rabbit hole on my account. :)
 
Thanks,
Richard
  
Richard Biener May 31, 2023, 10:53 a.m. UTC | #24
On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:

> Thanks Richard.
> Seems that this patch's approach is ok to trunk?
> Maybe the only thing we should do is to wait Kewen's testing feedback, am I right ?

Can you repost the patch with Kevens fix and state how you tested it?

Thanks,
Richard.
  
钟居哲 May 31, 2023, 12:16 p.m. UTC | #25
Sure. I will repost it with kewen's fix.
Previously, this patch I tested on X86 and apply to upstream RVV gcc tested on upstream RVV testsuite.
Since the upstream RVV GCC is not ready and fragile, we can't use it compile big program.
I will apply this patch on my downstream GCC and do my downstream regression (Regression is very big and include so many benchmarks).

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-31 18:53
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches; linkw
Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
On Wed, 31 May 2023, juzhe.zhong@rivai.ai wrote:
 
> Thanks Richard.
> Seems that this patch's approach is ok to trunk?
> Maybe the only thing we should do is to wait Kewen's testing feedback, am I right ?
 
Can you repost the patch with Kevens fix and state how you tested it?
 
Thanks,
Richard.
  

Patch

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index acf3642ceb2..ef28711c58f 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -483,7 +483,7 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 				 gimple_stmt_iterator loop_cond_gsi,
 				 rgroup_controls *rgc, tree niters,
 				 tree niters_skip, bool might_wrap_p,
-				 tree *iv_step)
+				 tree *iv_step, tree *compare_step)
 {
   tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
   tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
@@ -538,24 +538,26 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
 	   ...
 	   vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
 	   ...
-	   ivtmp_35 = ivtmp_9 - _36;
+	   ivtmp_35 = ivtmp_9 - POLY_INT_CST [4, 4];
 	   ...
-	   if (ivtmp_35 != 0)
-	     goto <bb 4>; [83.33%]
+	   if (ivtmp_9 > POLY_INT_CST [4, 4])
+	     goto <bb 6>; [83.33%]
 	   else
-	     goto <bb 5>; [16.67%]
+	     goto <bb 7>; [16.67%]
       */
       nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
       tree step = rgc->controls.length () == 1 ? rgc->controls[0]
 					       : make_ssa_name (iv_type);
       /* Create decrement IV.  */
-      create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
-		 insert_after, &index_before_incr, &index_after_incr);
+      create_iv (nitems_total, MINUS_EXPR, nitems_step, NULL_TREE, loop,
+		 &incr_gsi, insert_after, &index_before_incr,
+		 &index_after_incr);
       gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
 							    index_before_incr,
 							    nitems_step));
       *iv_step = step;
-      return index_after_incr;
+      *compare_step = nitems_step;
+      return index_before_incr;
     }
 
   /* Create increment IV.  */
@@ -825,6 +827,7 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
      arbitrarily pick the last.  */
   tree test_ctrl = NULL_TREE;
   tree iv_step = NULL_TREE;
+  tree compare_step = NULL_TREE;
   rgroup_controls *rgc;
   rgroup_controls *iv_rgc = nullptr;
   unsigned int i;
@@ -861,7 +864,7 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
 						 &preheader_seq, &header_seq,
 						 loop_cond_gsi, rgc, niters,
 						 niters_skip, might_wrap_p,
-						 &iv_step);
+						 &iv_step, &compare_step);
 
 	    iv_rgc = rgc;
 	  }
@@ -884,11 +887,22 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
 
   /* Get a boolean result that tells us whether to iterate.  */
   edge exit_edge = single_exit (loop);
-  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
+  gcond *cond_stmt;
   tree zero_ctrl = build_zero_cst (TREE_TYPE (test_ctrl));
-  gcond *cond_stmt = gimple_build_cond (code, test_ctrl, zero_ctrl,
-					NULL_TREE, NULL_TREE);
-  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+    {
+      gcc_assert (compare_step);
+      cond_stmt = gimple_build_cond (GT_EXPR, test_ctrl, compare_step,
+				     NULL_TREE, NULL_TREE);
+      gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+    }
+  else
+    {
+      tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
+      cond_stmt
+	= gimple_build_cond (code, test_ctrl, zero_ctrl, NULL_TREE, NULL_TREE);
+      gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+    }
 
   /* The loop iterates (NITERS - 1) / VF + 1 times.
      Subtract one from this to get the latch count.  */