RISC-V: Increase scalar_to_vec_cost from 1 to 3

Message ID 20240111082329.1198064-1-juzhe.zhong@rivai.ai
State Superseded
Headers
Series RISC-V: Increase scalar_to_vec_cost from 1 to 3 |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-lint warning Lint failed
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-test warning Testing skipped
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gc-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv32gc_zba_zbb_zbc_zbs-ilp32d-non-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-non-multilib success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed

Commit Message

juzhe.zhong@rivai.ai Jan. 11, 2024, 8:23 a.m. UTC
  This patch fixes the following inefficient vectorized codes:

	vsetvli	a5,zero,e8,mf2,ta,ma
	li	a2,17
	vid.v	v1
	li	a4,-32768
	vsetvli	zero,zero,e16,m1,ta,ma
	addiw	a4,a4,104
	vmv.v.i	v3,15
	lui	a1,%hi(a)
	li	a0,19
	vsetvli	zero,zero,e8,mf2,ta,ma
	vadd.vx	v1,v1,a2
	sb	a0,%lo(a)(a1)
	vsetvli	zero,zero,e16,m1,ta,ma
	vzext.vf2	v2,v1
	vmv.v.x	v1,a4
	vminu.vv	v2,v2,v3
	vsrl.vv	v1,v1,v2
	vslidedown.vi	v1,v1,1
	vmv.x.s	a0,v1
	snez	a0,a0
	ret

The reason is scalar_to_vec_cost is too low.

Consider in VEC_SET, we always have a slide + scalar move instruction,
scalar_to_vec_cost = 1 (current cost) is not reasonable.

I tried to set it as 2 but failed fix this case, that is, I need to
set it as 3 to fix this case.

No matter scalar move or slide instruction, I believe they are more costly
than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable
to me.

After this patch:

	lui	a5,%hi(a)
	li	a4,19
	sb	a4,%lo(a)(a5)
	li	a0,0
	ret

Tested on both RV32/RV64 no regression, Ok for trunk ?

	PR target/113281

gcc/ChangeLog:

	* config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
	* gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.

---
 gcc/config/riscv/riscv.cc                      |  4 ++--
 .../vect/costmodel/riscv/rvv/pr113281-1.c      | 18 ++++++++++++++++++
 .../gcc.target/riscv/rvv/autovec/pr113209.c    |  2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
  

Comments

Richard Biener Jan. 11, 2024, 9:18 a.m. UTC | #1
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
>         vsetvli a5,zero,e8,mf2,ta,ma
>         li      a2,17
>         vid.v   v1
>         li      a4,-32768
>         vsetvli zero,zero,e16,m1,ta,ma
>         addiw   a4,a4,104
>         vmv.v.i v3,15
>         lui     a1,%hi(a)
>         li      a0,19
>         vsetvli zero,zero,e8,mf2,ta,ma
>         vadd.vx v1,v1,a2
>         sb      a0,%lo(a)(a1)
>         vsetvli zero,zero,e16,m1,ta,ma
>         vzext.vf2       v2,v1
>         vmv.v.x v1,a4
>         vminu.vv        v2,v2,v3
>         vsrl.vv v1,v1,v2
>         vslidedown.vi   v1,v1,1
>         vmv.x.s a0,v1
>         snez    a0,a0
>         ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.

scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.

We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.

> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable
> to me.
>
> After this patch:
>
>         lui     a5,%hi(a)
>         li      a4,19
>         sb      a4,%lo(a)(a5)
>         li      a0,0
>         ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
>         PR target/113281
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
>         * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc                      |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c      | 18 ++++++++++++++++++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c    |  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>    1, /* gather_load_cost  */
>    1, /* scatter_store_cost  */
>    1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>    1, /* permute_cost  */
>    1, /* align_load_cost  */
>    1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>      1, /* gather_load_cost  */
>      1, /* scatter_store_cost  */
>      1, /* vec_to_scalar_cost  */
> -    1, /* scalar_to_vec_cost  */
> +    3, /* scalar_to_vec_cost  */
>      1, /* permute_cost  */
>      1, /* align_load_cost  */
>      1, /* align_store_cost  */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> new file mode 100644
> index 00000000000..331cf961a1f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +unsigned char a;
> +
> +int main() {
> +  short b = a = 0;
> +  for (; a != 19; a++)
> +    if (a)
> +      b = 32872 >> a;
> +
> +  if (b == 0)
> +    return 0;
> +  else
> +    return 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> index 081ee369394..70aae151000 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" } */
>
>  int b, c, d, f, i, a;
>  int e[1] = {0};
> --
> 2.36.3
>
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 9:30 a.m. UTC | #2
Thanks Richard.

So you think increase scalar_to_vec cost is not the correct approach to fix this case?

Or could you give me a suggestion to fix this case ?

Thanks.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:18
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
>         vsetvli a5,zero,e8,mf2,ta,ma
>         li      a2,17
>         vid.v   v1
>         li      a4,-32768
>         vsetvli zero,zero,e16,m1,ta,ma
>         addiw   a4,a4,104
>         vmv.v.i v3,15
>         lui     a1,%hi(a)
>         li      a0,19
>         vsetvli zero,zero,e8,mf2,ta,ma
>         vadd.vx v1,v1,a2
>         sb      a0,%lo(a)(a1)
>         vsetvli zero,zero,e16,m1,ta,ma
>         vzext.vf2       v2,v1
>         vmv.v.x v1,a4
>         vminu.vv        v2,v2,v3
>         vsrl.vv v1,v1,v2
>         vslidedown.vi   v1,v1,1
>         vmv.x.s a0,v1
>         snez    a0,a0
>         ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.
 
scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.
 
We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.
 
> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable
> to me.
>
> After this patch:
>
>         lui     a5,%hi(a)
>         li      a4,19
>         sb      a4,%lo(a)(a5)
>         li      a0,0
>         ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
>         PR target/113281
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
>         * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc                      |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c      | 18 ++++++++++++++++++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c    |  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>    1, /* gather_load_cost  */
>    1, /* scatter_store_cost  */
>    1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>    1, /* permute_cost  */
>    1, /* align_load_cost  */
>    1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>      1, /* gather_load_cost  */
>      1, /* scatter_store_cost  */
>      1, /* vec_to_scalar_cost  */
> -    1, /* scalar_to_vec_cost  */
> +    3, /* scalar_to_vec_cost  */
>      1, /* permute_cost  */
>      1, /* align_load_cost  */
>      1, /* align_store_cost  */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> new file mode 100644
> index 00000000000..331cf961a1f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +unsigned char a;
> +
> +int main() {
> +  short b = a = 0;
> +  for (; a != 19; a++)
> +    if (a)
> +      b = 32872 >> a;
> +
> +  if (b == 0)
> +    return 0;
> +  else
> +    return 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> index 081ee369394..70aae151000 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" } */
>
>  int b, c, d, f, i, a;
>  int e[1] = {0};
> --
> 2.36.3
>
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 9:46 a.m. UTC | #3
Oh. I see I think I have done wrong here.

I should adjust cost for VEC_EXTRACT not VEC_SET.

But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
cost in vect.dump.

The vect tree:

  # a.4_25 = PHI <1(2), _4(11)>
  # ivtmp_30 = PHI <18(2), ivtmp_20(11)>
  # vect_vec_iv_.10_137 = PHI <{ 1, 2, 3, ... }(2), vect_vec_iv_.10_137(11)>
  # ivtmp_149 = PHI <0(2), ivtmp_150(11)>
  # loop_len_146 = PHI <18(2), _155(11)>
  vect_patt_28.11_139 = (vector([2048,2048]) unsigned short) vect_vec_iv_.10_137;
  _22 = (int) a.4_25;
  vect_patt_26.12_141 = MIN_EXPR <vect_patt_28.11_139, { 15, ... }>;
  vect_patt_10.13_143 = { 32872, ... } >> vect_patt_26.12_141;
  _12 = 32872 >> _22;
  vect_patt_27.14_144 = VIEW_CONVERT_EXPR<vector([2048,2048]) short int>(vect_patt_10.13_143);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_150 = ivtmp_149 + POLY_INT_CST [2048, 2048];
  _153 = MIN_EXPR <ivtmp_150, 18>;
  _154 = 18 - _153;
  _155 = MIN_EXPR <_154, POLY_INT_CST [2048, 2048]>;
  if (_155 != 0)
    goto <bb 11>; [0.00%]
  else
    goto <bb 16>; [100.00%]

  <bb 16> [local count: 118111600]:
  # vect_patt_27.14_145 = PHI <vect_patt_27.14_144(8)>
  # loop_len_156 = PHI <loop_len_146(8)>
  _147 = loop_len_156 + 18446744073709551615;
  _148 = .VEC_EXTRACT (vect_patt_27.14_145, _147);
  b_5 = _148;
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

The vect dump tree only compute cost include vector_stmt and scalar_to_vec.
It seems it didn't consider VEC_EXTRACT cost ?




juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:18
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
>         vsetvli a5,zero,e8,mf2,ta,ma
>         li      a2,17
>         vid.v   v1
>         li      a4,-32768
>         vsetvli zero,zero,e16,m1,ta,ma
>         addiw   a4,a4,104
>         vmv.v.i v3,15
>         lui     a1,%hi(a)
>         li      a0,19
>         vsetvli zero,zero,e8,mf2,ta,ma
>         vadd.vx v1,v1,a2
>         sb      a0,%lo(a)(a1)
>         vsetvli zero,zero,e16,m1,ta,ma
>         vzext.vf2       v2,v1
>         vmv.v.x v1,a4
>         vminu.vv        v2,v2,v3
>         vsrl.vv v1,v1,v2
>         vslidedown.vi   v1,v1,1
>         vmv.x.s a0,v1
>         snez    a0,a0
>         ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.
 
scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.
 
We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.
 
> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable
> to me.
>
> After this patch:
>
>         lui     a5,%hi(a)
>         li      a4,19
>         sb      a4,%lo(a)(a5)
>         li      a0,0
>         ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
>         PR target/113281
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
>         * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc                      |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c      | 18 ++++++++++++++++++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c    |  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>    1, /* gather_load_cost  */
>    1, /* scatter_store_cost  */
>    1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>    1, /* permute_cost  */
>    1, /* align_load_cost  */
>    1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>      1, /* gather_load_cost  */
>      1, /* scatter_store_cost  */
>      1, /* vec_to_scalar_cost  */
> -    1, /* scalar_to_vec_cost  */
> +    3, /* scalar_to_vec_cost  */
>      1, /* permute_cost  */
>      1, /* align_load_cost  */
>      1, /* align_store_cost  */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> new file mode 100644
> index 00000000000..331cf961a1f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
> +
> +unsigned char a;
> +
> +int main() {
> +  short b = a = 0;
> +  for (; a != 19; a++)
> +    if (a)
> +      b = 32872 >> a;
> +
> +  if (b == 0)
> +    return 0;
> +  else
> +    return 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> index 081ee369394..70aae151000 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" } */
>
>  int b, c, d, f, i, a;
>  int e[1] = {0};
> --
> 2.36.3
>
  
Robin Dapp Jan. 11, 2024, 9:52 a.m. UTC | #4
On 1/11/24 10:46, juzhe.zhong@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.

The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.

vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.

I would suggest adding special handling in builtin_vectorization_cost
like:

/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)

and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).

> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.

Regards
 Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 9:56 a.m. UTC | #5
>> With a cost of "3" we still vectorize for zvl512b and larger.
>>Is that intended?  I don't really see why 512 should vectorized
>>but 256 not.  Disregarding that everything should be optimized
>>away, 2 iterations for the whole loop with 256 bits doesn't
>>seem that bad.
 Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...

I am confused now how to fix this case.


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 17:52
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 10:46, juzhe.zhong@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.
 
The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.
 
vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.
 
I would suggest adding special handling in builtin_vectorization_cost
like:
 
/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
 
and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).
 
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.
 
Regards
Robin
  
Richard Biener Jan. 11, 2024, 9:57 a.m. UTC | #6
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp <rdapp.gcc@gmail.com> wrote:
>
> On 1/11/24 10:46, juzhe.zhong@rivai.ai wrote:
> > Oh. I see I think I have done wrong here.
> >
> > I should adjust cost for VEC_EXTRACT not VEC_SET.
> >
> > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> > cost in vect.dump.
>
> The slidedown/vmv.x.s part is of course vec_extract but we indeed
> don't seem to cost it as vec_to_scalar here.

It looks like a vectorized live operation as it's not in the loop body
(and thus really irrelevant for costing in practice).  This has

      /* ???  Enable for loop costing as well.  */
      if (!loop_vinfo)
        record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
                          0, vect_epilogue);

so live ops are not costed at all.  I would suggest to try unconditionally
enabling this?

> vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
> bit high when a regular vector instruction is "1".
> It should rather be dependent on the latency between register
> files.  We can't really say in general but I'd say "2" is not so bad.
>
> I would suggest adding special handling in builtin_vectorization_cost
> like:
>
> /* Add register-register latency.  */
> case scalar_to_vec:
>   return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
>
> and adjust register_move_cost accordingly.  Instead of using
> register_move_cost we could also use a cost structure directly.
> (E.g. like aarch64's regmove tuning structures.  Those don't
> contain VRs but for us it could make sense to add them).
>
> > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
> With a cost of "3" we still vectorize for zvl512b and larger.
> Is that intended?  I don't really see why 512 should vectorized
> but 256 not.  Disregarding that everything should be optimized
> away, 2 iterations for the whole loop with 256 bits doesn't
> seem that bad.
>
> Regards
>  Robin
>
  
Robin Dapp Jan. 11, 2024, 10:09 a.m. UTC | #7
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>       /* ???  Enable for loop costing as well.  */
>       if (!loop_vinfo)
>         record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>                           0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 

IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.

That said, we also don't really cost all our vsetvls yet (difficult...).

Regards
 Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 10:11 a.m. UTC | #8
>> That said, we also don't really cost all our vsetvls yet (difficult...).
If cost vsetvl, we will need to cost 1 more for each STMT.

However, it is not accurate. Since our VSETVL PASS will eliminate redundancy...


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:09
To: Richard Biener
CC: rdapp.gcc; juzhe.zhong@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>       /* ???  Enable for loop costing as well.  */
>       if (!loop_vinfo)
>         record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>                           0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 
 
IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.
 
That said, we also don't really cost all our vsetvls yet (difficult...).
 
Regards
Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 10:13 a.m. UTC | #9
And also I have investigate LLVM cost model. They don't cost vsevli in vectorization cost model.
But their cost model does a good job...



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:09
To: Richard Biener
CC: rdapp.gcc; juzhe.zhong@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>       /* ???  Enable for loop costing as well.  */
>       if (!loop_vinfo)
>         record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>                           0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 
 
IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.
 
That said, we also don't really cost all our vsetvls yet (difficult...).
 
Regards
Robin
  
Robin Dapp Jan. 11, 2024, 10:14 a.m. UTC | #10
>  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
> that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> 
> I am confused now how to fix this case.

4 is definitely too high compared to a regular instruction.
vmv.vx could even be zero-cost for constants.

To catch constants we could add handling in add_stmt_cost, inspecting
the stmt directly.

Regards
 Robin
  
Richard Biener Jan. 11, 2024, 10:18 a.m. UTC | #11
On Thu, Jan 11, 2024 at 11:20 AM juzhe.zhong@rivai.ai
<juzhe.zhong@rivai.ai> wrote:
>
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec:
>
> 1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue
>
>    This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.i v16,15
>
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
>
>    This cost should be higher since it cost 3 instructions:
>     li a4,-32768
>     addiw a4,a4,104
>     vmv.v.x v16,a4
>
> Am I correct ?
>
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.

The cost model interface isn't set up very well to cost different constants
differently.  For scalar_to_vec we should get the backend the actual scalar
to duplicate but we don't.

> ________________________________
> juzhe.zhong@rivai.ai
>
>
> From: Robin Dapp
> Date: 2024-01-11 18:14
> To: juzhe.zhong@rivai.ai; Richard Biener
> CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
> Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> >  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
> > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> >
> > I am confused now how to fix this case.
>
> 4 is definitely too high compared to a regular instruction.
> vmv.vx could even be zero-cost for constants.
>
> To catch constants we could add handling in add_stmt_cost, inspecting
> the stmt directly.
>
> Regards
> Robin
>
  
Richard Biener Jan. 11, 2024, 10:19 a.m. UTC | #12
On Thu, Jan 11, 2024 at 11:18 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 11:20 AM juzhe.zhong@rivai.ai
> <juzhe.zhong@rivai.ai> wrote:
> >
> > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec:
> >
> > 1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue
> >
> >    This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.i v16,15
> >
> > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> >
> >    This cost should be higher since it cost 3 instructions:
> >     li a4,-32768
> >     addiw a4,a4,104
> >     vmv.v.x v16,a4
> >
> > Am I correct ?
> >
> > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.
>
> The cost model interface isn't set up very well to cost different constants
> differently.  For scalar_to_vec we should get the backend the actual scalar
> to duplicate but we don't.

It's similar for permutes btw.  My plan was always to defer changes to when
we only have SLP since there all invariants are represented by SLP nodes
which we can hand down.

> > ________________________________
> > juzhe.zhong@rivai.ai
> >
> >
> > From: Robin Dapp
> > Date: 2024-01-11 18:14
> > To: juzhe.zhong@rivai.ai; Richard Biener
> > CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
> > Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> > >  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
> > > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> > >
> > > I am confused now how to fix this case.
> >
> > 4 is definitely too high compared to a regular instruction.
> > vmv.vx could even be zero-cost for constants.
> >
> > To catch constants we could add handling in add_stmt_cost, inspecting
> > the stmt directly.
> >
> > Regards
> > Robin
> >
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 10:20 a.m. UTC | #13
Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec:

1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue

   This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.i v16,15

2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue

   This cost should be higher since it cost 3 instructions:
    li a4,-32768
    addiw a4,a4,104
    vmv.v.x v16,a4

Am I correct ?

I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:14
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
> that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> 
> I am confused now how to fix this case.
 
4 is definitely too high compared to a regular instruction.
vmv.vx could even be zero-cost for constants.
 
To catch constants we could add handling in add_stmt_cost, inspecting
the stmt directly.
 
Regards
Robin
  
Robin Dapp Jan. 11, 2024, 10:40 a.m. UTC | #14
On 1/11/24 11:20, juzhe.zhong@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue
> 
>    This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>    This cost should be higher since it cost 3 instructions:
>     lia4,-32768
>     addiwa4,a4,104
>     vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.

That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)

As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.

Regards
 Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 10:44 a.m. UTC | #15
>> (My question whether why we shouldn't vectorize this at 256b
>> and above still stands, though)
I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
-zvl65536b, RVV Clang also doesn't vectorize it.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:40
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 11:20, juzhe.zhong@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 1 in prologue
> 
>    This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>    This cost should be higher since it cost 3 instructions:
>     lia4,-32768
>     addiwa4,a4,104
>     vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.
 
That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)
 
As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.
 
Regards
Robin
  
Robin Dapp Jan. 11, 2024, 11:15 a.m. UTC | #16
> I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.

Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.

My point is cost comparison only has the scalar loop to compare
against which is:

	li	a5,1
	li	a3,19
.L2:
	mv	a4,a5
	addiw	a5,a5,1
	bne	a5,a3,.L2

That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.

How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.

Regards
 Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 11:22 a.m. UTC | #17
Oh, Sorry. I made a mistake. It's -mrvv-vector-bits=2048 RVV clang doesn't vectorize it.

But ARM SVE GCC doesn't fix the issue if we specify -msve-vector-bits=2048, it will vectorize it.

https://godbolt.org/z/x7s8Kz87a

I guess LLVM has some magic in their cost model which can work well.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 11:58 a.m. UTC | #18
Hi, Robin.

I model scalar value initialization accurately with following patch:

+/* Adjust vectorization cost after calling
+   targetm.vectorize.builtin_vectorization_cost. For some statement, we would
+   like to further fine-grain tweak the cost on top of
+   targetm.vectorize.builtin_vectorization_cost handling which doesn't have any
+   information on statement operation codes etc.  */
+
+static unsigned
+adjust_stmt_cost (class vec_info *vinfo, enum vect_cost_for_stmt kind,
+                 struct _stmt_vec_info *stmt_info, tree vectype, int count,
+                 int stmt_cost)
+{
+  gimple *stmt = stmt_info->stmt;
+  machine_mode smode = TYPE_MODE (TREE_TYPE (vectype));
+  switch (kind)
+    {
+      case scalar_to_vec: {
+       stmt_cost *= count;
+       /* Adjust cost by counting the scalar value initialization.  */
+       for (int i = 0; i < count; i++)
+         {
+           tree arg = gimple_arg (stmt, i);
+           if (poly_int_tree_p (arg))
+             {
+               poly_int64 value = tree_to_poly_int64 (arg);
+               int scalar_cost
+                 = riscv_const_insns (gen_int_mode (value, smode));
+               stmt_cost += scalar_cost;
+             }
+           else
+             stmt_cost += 1;
+         }
+       return stmt_cost;
+      }
+    default:
+      break;
+    }
+  return count * stmt_cost;
+}
+
 unsigned
 costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                      stmt_vec_info stmt_info, slp_tree, tree vectype,
@@ -1082,9 +1122,13 @@ costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
         as one iteration of the VLA loop.  */
       if (where == vect_body && m_unrolled_vls_niters)
        m_unrolled_vls_stmts += count * m_unrolled_vls_niters;
+
+      if (vectype)
+       stmt_cost = adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, count,
+                                     stmt_cost);
     }

-  return record_stmt_cost (stmt_info, where, count * stmt_cost);
+  return record_stmt_cost (stmt_info, where, stmt_cost);
 }

32872 >> patt_126 1 times scalar_to_vec costs 3 in prologue

32872 spends 2 scalar instructions + 1 scalar_to_vec cost:

li a4,-32768
addiw a4,a4,104
vmv.v.x v16,a4

It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but failed on -march=rv64gcv_zvl4096b.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 12:07 p.m. UTC | #19
Ok. but with this scalar_to_vec set to 2:

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index df9799d9c5e..a14fb36817a 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
   1, /* gather_load_cost  */
   1, /* scatter_store_cost  */
   1, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
+  2, /* scalar_to_vec_cost  */
   1, /* permute_cost  */
   1, /* align_load_cost  */
   1, /* align_store_cost  */
@@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
     1, /* gather_load_cost  */
     1, /* scatter_store_cost  */
     1, /* vec_to_scalar_cost  */
-    1, /* scalar_to_vec_cost  */
+    2, /* scalar_to_vec_cost  */
     1, /* permute_cost  */
     1, /* align_load_cost  */
     1, /* align_store_cost  */

Then all zvl*b can be fixed for this test.

Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be more costly than normal vadd.vv since it is transferring data between different pipeline/register class.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 12:39 p.m. UTC | #20
I see after this accurate cost adjustment, it is still vectorized but different vect dump:

<bb 8> [local count: 118111602]:
  # a.4_25 = PHI <1(2), _4(11)>
  # ivtmp_30 = PHI <18(2), ivtmp_20(11)>
  # vect_vec_iv_.12_149 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }(2), _150(11)>
  # ivtmp_159 = PHI <0(2), ivtmp_160(11)>
  _150 = vect_vec_iv_.12_149 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 };
  vect_patt_46.13_151 = (vector(16) unsigned short) vect_vec_iv_.12_149;
  _22 = (int) a.4_25;
  vect_patt_48.14_153 = MIN_EXPR <vect_patt_46.13_151, { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 }>;
  vect_patt_49.15_155 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect_patt_48.14_153;
  _12 = 32872 >> _22;
  vect_patt_51.16_156 = VIEW_CONVERT_EXPR<vector(16) short int>(vect_patt_49.15_155);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_160 = ivtmp_159 + 1;
  if (ivtmp_160 < 1)
    goto <bb 11>; [0.00%]
  else
    goto <bb 18>; [100.00%]

  <bb 16> [local count: 118111600]:
  # b_5 = PHI <b_141(19)>
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

  <bb 11> [local count: 4]:
  goto <bb 8>; [100.00%]

  <bb 18> [local count: 118111601]:
  # a.4_146 = PHI <_4(8)>
  # ivtmp_147 = PHI <ivtmp_20(8)>
  # vect_patt_51.16_157 = PHI <vect_patt_51.16_156(8)>
  _158 = BIT_FIELD_REF <vect_patt_51.16_157, 16, 240>;
  b_144 = _158;

Purely VLS vectorized codes, then the later CSE pass is able to optimize it into the simply scalar codes.

Wheras it involves some VLA vectorized codes which make the later pass failed to CSE it...



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
  
Robin Dapp Jan. 11, 2024, 12:56 p.m. UTC | #21
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:

	li	a0,32768
	addiw	a0,a0,104

Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).

Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.

With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:

a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 2 in prologue
MIN_EXPR <patt_28, 15> 1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
<unknown> 1 times scalar_stmt costs 1 in prologue
<unknown> 1 times scalar_stmt costs 1 in body

  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)

So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.

Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.

If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.

> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.

We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.

Regards
 Robin
  
juzhe.zhong@rivai.ai Jan. 11, 2024, 1 p.m. UTC | #22
I think we can let it vectorize as long as we purely use VLSmodes.
This following patch looks reasonable:

diff --git a/gcc/config/riscv/riscv-vector-costs.cc b/gcc/config/riscv/riscv-vector-costs.cc
index 58ec0b9b503..4e351bc066c 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -42,6 +42,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "backend.h"
 #include "tree-data-ref.h"
 #include "tree-ssa-loop-niter.h"
+#include "emit-rtl.h"

 /* This file should be included last.  */
 #include "riscv-vector-costs.h"
@@ -1055,6 +1056,42 @@ costs::better_main_loop_than_p (const vector_costs *uncast_other) const
   return vector_costs::better_main_loop_than_p (other);
 }

+/* Adjust vectorization cost after calling
+   targetm.vectorize.builtin_vectorization_cost. For some statement, we would
+   like to further fine-grain tweak the cost on top of
+   targetm.vectorize.builtin_vectorization_cost handling which doesn't have any
+   information on statement operation codes etc.  */
+
+static unsigned
+adjust_stmt_cost (enum vect_cost_for_stmt kind,
+                 struct _stmt_vec_info *stmt_info, int count, int stmt_cost)
+{
+  gimple *stmt = stmt_info->stmt;
+  switch (kind)
+    {
+      case scalar_to_vec: {
+       stmt_cost *= count;
+       /* Adjust cost by counting the scalar value initialization.  */
+       for (unsigned int i = 1; i < gimple_num_ops (stmt); i++)
+         {
+           tree op = gimple_op (stmt, i);
+           if (poly_int_tree_p (op))
+             {
+               poly_int64 value = tree_to_poly_int64 (op);
+               /* We don't need to count scalar costs if it
+                  is in range of [-16, 15] since we can use
+                  vmv.v.i.  */
+               stmt_cost += riscv_const_insns (gen_int_mode (value, Pmode));
+             }
+         }
+       return stmt_cost;
+      }
+    default:
+      break;
+    }
+  return count * stmt_cost;
+}
+
 unsigned
 costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                      stmt_vec_info stmt_info, slp_tree, tree vectype,
@@ -1082,9 +1119,12 @@ costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
         as one iteration of the VLA loop.  */
       if (where == vect_body && m_unrolled_vls_niters)
        m_unrolled_vls_stmts += count * m_unrolled_vls_niters;
+
+      if (vectype)
+       stmt_cost = adjust_stmt_cost (kind, stmt_info, count, stmt_cost);
     }

-  return record_stmt_cost (stmt_info, where, count * stmt_cost);
+  return record_stmt_cost (stmt_info, where, stmt_cost);
 }

 void
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index df9799d9c5e..a14fb36817a 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
   1, /* gather_load_cost  */
   1, /* scatter_store_cost  */
   1, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
+  2, /* scalar_to_vec_cost  */
   1, /* permute_cost  */
   1, /* align_load_cost  */
   1, /* align_store_cost  */
@@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
     1, /* gather_load_cost  */
     1, /* scatter_store_cost  */
     1, /* vec_to_scalar_cost  */
-    1, /* scalar_to_vec_cost  */
+    2, /* scalar_to_vec_cost  */
     1, /* permute_cost  */
     1, /* align_load_cost  */
     1, /* align_store_cost  */

cost.c:3:5: note: vectorized 1 loops in function.
***** Choosing vector mode V16QI
int main ()
{
  vector(16) short int vect_patt_111.16;
  vector(16) unsigned short vect_patt_109.15;
  vector(16) unsigned short vect_patt_108.14;
  vector(16) unsigned short vect_patt_106.13;
  vector(16) unsigned char vect_vec_iv_.12;
  unsigned char tmp.11;
  unsigned char tmp.10;
  unsigned char D.2767;
  unsigned char a_lsm.8;
  short int b;
  unsigned char _4;
  int _12;
  _Bool _14;
  int _15;
  unsigned char _19(D);
  unsigned char ivtmp_20;
  int _22;
  unsigned char a.4_25;
  unsigned char ivtmp_30;
  unsigned char a.4_256;
  unsigned char ivtmp_258;
  int _259;
  int _260;
  unsigned char _262;
  unsigned char ivtmp_263;
  unsigned char a.4_266;
  unsigned char ivtmp_267;
  vector(16) unsigned char _270;
  short int _278;
  unsigned char ivtmp_279;
  unsigned char ivtmp_280;

  <bb 2> [local count: 118111601]:

  <bb 8> [local count: 118111602]:
  # a.4_25 = PHI <1(2), _4(11)>
  # ivtmp_30 = PHI <18(2), ivtmp_20(11)>
  # vect_vec_iv_.12_269 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }(2), _270(11)>
  # ivtmp_279 = PHI <0(2), ivtmp_280(11)>
  _270 = vect_vec_iv_.12_269 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 };
  vect_patt_106.13_271 = (vector(16) unsigned short) vect_vec_iv_.12_269;
  _22 = (int) a.4_25;
  vect_patt_108.14_273 = MIN_EXPR <vect_patt_106.13_271, { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 }>;
  vect_patt_109.15_275 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect_patt_108.14_273;
  _12 = 32872 >> _22;
  vect_patt_111.16_276 = VIEW_CONVERT_EXPR<vector(16) short int>(vect_patt_109.15_275);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_280 = ivtmp_279 + 1;
  if (ivtmp_280 < 1)
    goto <bb 11>; [0.00%]
  else
    goto <bb 18>; [100.00%]

  <bb 16> [local count: 118111600]:
  # b_5 = PHI <b_261(19)>
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

  <bb 11> [local count: 4]:
  goto <bb 8>; [100.00%]

  <bb 18> [local count: 118111601]:
  # a.4_266 = PHI <_4(8)>
  # ivtmp_267 = PHI <ivtmp_20(8)>
  # vect_patt_111.16_277 = PHI <vect_patt_111.16_276(8)>
  _278 = BIT_FIELD_REF <vect_patt_111.16_277, 16, 240>;
  b_264 = _278;

  <bb 19> [local count: 477576208]:
  # a.4_256 = PHI <17(18), _262(20)>
  # ivtmp_258 = PHI <2(18), ivtmp_263(20)>
  _259 = (int) a.4_256;
  _260 = 32872 >> _259;
  b_261 = (short int) _260;
  _262 = a.4_256 + 1;
  ivtmp_263 = ivtmp_258 - 1;
  if (ivtmp_263 != 0)
    goto <bb 20>; [75.27%]
  else
    goto <bb 16>; [24.73%]

  <bb 20> [local count: 359464610]:
  goto <bb 19>; [100.00%]

}

Final ASM:

main:
lui a5,%hi(a)
li a4,19
sb a4,%lo(a)(a5)
li a0,0
ret


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 20:56
To: juzhe.zhong@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:
 
li a0,32768
addiw a0,a0,104
 
Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).
 
Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.
 
With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:
 
a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 2 in prologue
MIN_EXPR <patt_28, 15> 1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
<unknown> 1 times scalar_stmt costs 1 in prologue
<unknown> 1 times scalar_stmt costs 1 in body
 
  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)
 
So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.
 
Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.
 
If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.
 
> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.
 
We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.
 
Regards
Robin
  
juzhe.zhong@rivai.ai Jan. 12, 2024, 8:50 a.m. UTC | #23
Hi, Richard.

I tried hard in RISC-V backend. I found to fix the case with -march=rv64gcv_zvl4096b can not be without vec_to_scalar count.

Is there an approach that we can count vec_to_scalar cost without this piece code in middle-end ?

      /* ???  Enable for loop costing as well.  */
      if (!loop_vinfo)
        record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
                          0, vect_epilogue);

Since it's stage 4, I guess we can't change this now.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:57
To: Robin Dapp
CC: juzhe.zhong@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp <rdapp.gcc@gmail.com> wrote:
>
> On 1/11/24 10:46, juzhe.zhong@rivai.ai wrote:
> > Oh. I see I think I have done wrong here.
> >
> > I should adjust cost for VEC_EXTRACT not VEC_SET.
> >
> > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> > cost in vect.dump.
>
> The slidedown/vmv.x.s part is of course vec_extract but we indeed
> don't seem to cost it as vec_to_scalar here.
 
It looks like a vectorized live operation as it's not in the loop body
(and thus really irrelevant for costing in practice).  This has
 
      /* ???  Enable for loop costing as well.  */
      if (!loop_vinfo)
        record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
                          0, vect_epilogue);
 
so live ops are not costed at all.  I would suggest to try unconditionally
enabling this?
 
> vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
> bit high when a regular vector instruction is "1".
> It should rather be dependent on the latency between register
> files.  We can't really say in general but I'd say "2" is not so bad.
>
> I would suggest adding special handling in builtin_vectorization_cost
> like:
>
> /* Add register-register latency.  */
> case scalar_to_vec:
>   return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
>
> and adjust register_move_cost accordingly.  Instead of using
> register_move_cost we could also use a cost structure directly.
> (E.g. like aarch64's regmove tuning structures.  Those don't
> contain VRs but for us it could make sense to add them).
>
> > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
> With a cost of "3" we still vectorize for zvl512b and larger.
> Is that intended?  I don't really see why 512 should vectorized
> but 256 not.  Disregarding that everything should be optimized
> away, 2 iterations for the whole loop with 256 bits doesn't
> seem that bad.
>
> Regards
>  Robin
>
  

Patch

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index df9799d9c5e..bcfb3c15a39 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -366,7 +366,7 @@  static const common_vector_cost rvv_vls_vector_cost = {
   1, /* gather_load_cost  */
   1, /* scatter_store_cost  */
   1, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
+  3, /* scalar_to_vec_cost  */
   1, /* permute_cost  */
   1, /* align_load_cost  */
   1, /* align_store_cost  */
@@ -382,7 +382,7 @@  static const scalable_vector_cost rvv_vla_vector_cost = {
     1, /* gather_load_cost  */
     1, /* scatter_store_cost  */
     1, /* vec_to_scalar_cost  */
-    1, /* scalar_to_vec_cost  */
+    3, /* scalar_to_vec_cost  */
     1, /* permute_cost  */
     1, /* align_load_cost  */
     1, /* align_store_cost  */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
new file mode 100644
index 00000000000..331cf961a1f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */
+
+unsigned char a;
+
+int main() {
+  short b = a = 0;
+  for (; a != 19; a++)
+    if (a)
+      b = 32872 >> a;
+
+  if (b == 0)
+    return 0;
+  else
+    return 1;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
index 081ee369394..70aae151000 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" } */
 
 int b, c, d, f, i, a;
 int e[1] = {0};