RISC-V: Prevent speculative vsetvl insn scheduling

Message ID 20250212233029.2258031-1-ewlu@rivosinc.com
State Superseded
Headers
Series RISC-V: Prevent speculative vsetvl insn scheduling |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-lint success Lint passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-test success Testing passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-arm-bootstrap success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-aarch64-bootstrap success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed

Commit Message

Edwin Lu Feb. 12, 2025, 11:27 p.m. UTC
  The instruction scheduler appears to be speculatively hoisting vsetvl
insns outside of their basic block without checking for data
dependencies. This resulted in a situation where the following occurs

        vsetvli a5,a1,e32,m1,tu,ma
        vle32.v v2,0(a0)
        sub     a1,a1,a5 <-- a1 potentially set to 0
        sh2add  a0,a5,a0
        vfmacc.vv       v1,v2,v2
        vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
        beq     a1,zero,.L12 <-- check if avl is 0

This patch would essentially delay the vsetvl update to after the branch
to prevent unnecessarily updating the vinfo at the end of a basic block.

	PR/117974

gcc/ChangeLog:

	* config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
	(TARGET_SCHED_CAN_SPECULATE_INSN): Implement.

Signed-off-by: Edwin Lu <ewlu@rivosinc.com>
---
 gcc/config/riscv/riscv.cc | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
  

Comments

钟居哲 Feb. 12, 2025, 11:38 p.m. UTC | #1
Could you add PR117974 testcase ?



juzhe.zhong@rivai.ai
 
From: Edwin Lu
Date: 2025-02-13 07:27
To: gcc-patches
CC: gnu-toolchain; vineetg; juzhe.zhong; Edwin Lu
Subject: [PATCH] RISC-V: Prevent speculative vsetvl insn scheduling
The instruction scheduler appears to be speculatively hoisting vsetvl
insns outside of their basic block without checking for data
dependencies. This resulted in a situation where the following occurs
 
        vsetvli a5,a1,e32,m1,tu,ma
        vle32.v v2,0(a0)
        sub     a1,a1,a5 <-- a1 potentially set to 0
        sh2add  a0,a5,a0
        vfmacc.vv       v1,v2,v2
        vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
        beq     a1,zero,.L12 <-- check if avl is 0
 
This patch would essentially delay the vsetvl update to after the branch
to prevent unnecessarily updating the vinfo at the end of a basic block.
 
PR/117974
 
gcc/ChangeLog:
 
* config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
(TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
 
Signed-off-by: Edwin Lu <ewlu@rivosinc.com>
---
gcc/config/riscv/riscv.cc | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
 
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 6e14126e3a4..24450bae517 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -10209,6 +10209,23 @@ riscv_sched_adjust_cost (rtx_insn *, int, rtx_insn *insn, int cost,
   return new_cost;
}
+/* Implement TARGET_SCHED_CAN_SPECULATE_INSN hook.  Return true if insn can
+   can be scheduled for speculative execution.  Reject vsetvl instructions to
+   prevent the scheduler from hoisting them out of basic blocks without
+   checking for data dependencies PR117974.  */
+static bool
+riscv_sched_can_speculate_insn (rtx_insn *insn)
+{
+  switch (get_attr_type (insn))
+    {
+      case TYPE_VSETVL:
+      case TYPE_VSETVL_PRE:
+ return false;
+      default:
+ return true;
+    }
+}
+
/* Auxiliary function to emit RISC-V ELF attribute. */
static void
riscv_emit_attribute ()
@@ -14055,6 +14072,9 @@ bool need_shadow_stack_push_pop_p ()
#undef  TARGET_SCHED_ADJUST_COST
#define TARGET_SCHED_ADJUST_COST riscv_sched_adjust_cost
+#undef TARGET_SCHED_CAN_SPECULATE_INSN
+#define TARGET_SCHED_CAN_SPECULATE_INSN riscv_sched_can_speculate_insn
+
#undef TARGET_FUNCTION_OK_FOR_SIBCALL
#define TARGET_FUNCTION_OK_FOR_SIBCALL riscv_function_ok_for_sibcall
-- 
2.43.0
  
Edwin Lu Feb. 12, 2025, 11:49 p.m. UTC | #2
Oops I made a mental note to add one and then completely forgot. I'll 
run the testsuite again with the testcase before sending it up as v2.

Edwin

On 2/12/2025 3:38 PM, 钟居哲 wrote:
> Could you add PR117974 testcase ?
>
> ------------------------------------------------------------------------
> juzhe.zhong@rivai.ai
>
>     *From:* Edwin Lu <mailto:ewlu@rivosinc.com>
>     *Date:* 2025-02-13 07:27
>     *To:* gcc-patches <mailto:gcc-patches@gcc.gnu.org>
>     *CC:* gnu-toolchain <mailto:gnu-toolchain@rivosinc.com>; vineetg
>     <mailto:vineetg@rivosinc.com>; juzhe.zhong
>     <mailto:juzhe.zhong@rivai.ai>; Edwin Lu <mailto:ewlu@rivosinc.com>
>     *Subject:* [PATCH] RISC-V: Prevent speculative vsetvl insn scheduling
>     The instruction scheduler appears to be speculatively hoisting vsetvl
>     insns outside of their basic block without checking for data
>     dependencies. This resulted in a situation where the following occurs
>             vsetvli a5,a1,e32,m1,tu,ma
>             vle32.v v2,0(a0)
>             sub     a1,a1,a5 <-- a1 potentially set to 0
>             sh2add  a0,a5,a0
>             vfmacc.vv       v1,v2,v2
>             vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update
>     vl to 0
>             beq     a1,zero,.L12 <-- check if avl is 0
>     This patch would essentially delay the vsetvl update to after the
>     branch
>     to prevent unnecessarily updating the vinfo at the end of a basic
>     block.
>     PR/117974
>     gcc/ChangeLog:
>     * config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
>     (TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
>     Signed-off-by: Edwin Lu <ewlu@rivosinc.com>
>     ---
>     gcc/config/riscv/riscv.cc | 20 ++++++++++++++++++++
>     1 file changed, 20 insertions(+)
>     diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
>     index 6e14126e3a4..24450bae517 100644
>     --- a/gcc/config/riscv/riscv.cc
>     +++ b/gcc/config/riscv/riscv.cc
>     @@ -10209,6 +10209,23 @@ riscv_sched_adjust_cost (rtx_insn *, int,
>     rtx_insn *insn, int cost,
>        return new_cost;
>     }
>     +/* Implement TARGET_SCHED_CAN_SPECULATE_INSN hook. Return true if
>     insn can
>     +   can be scheduled for speculative execution.  Reject vsetvl
>     instructions to
>     +   prevent the scheduler from hoisting them out of basic blocks
>     without
>     +   checking for data dependencies PR117974.  */
>     +static bool
>     +riscv_sched_can_speculate_insn (rtx_insn *insn)
>     +{
>     +  switch (get_attr_type (insn))
>     +    {
>     +      case TYPE_VSETVL:
>     +      case TYPE_VSETVL_PRE:
>     + return false;
>     +      default:
>     + return true;
>     +    }
>     +}
>     +
>     /* Auxiliary function to emit RISC-V ELF attribute. */
>     static void
>     riscv_emit_attribute ()
>     @@ -14055,6 +14072,9 @@ bool need_shadow_stack_push_pop_p ()
>     #undef  TARGET_SCHED_ADJUST_COST
>     #define TARGET_SCHED_ADJUST_COST riscv_sched_adjust_cost
>     +#undef TARGET_SCHED_CAN_SPECULATE_INSN
>     +#define TARGET_SCHED_CAN_SPECULATE_INSN
>     riscv_sched_can_speculate_insn
>     +
>     #undef TARGET_FUNCTION_OK_FOR_SIBCALL
>     #define TARGET_FUNCTION_OK_FOR_SIBCALL riscv_function_ok_for_sibcall
>     -- 
>     2.43.0
>
  
Jeffrey Law Feb. 13, 2025, 12:47 a.m. UTC | #3
On 2/12/25 4:27 PM, Edwin Lu wrote:
> The instruction scheduler appears to be speculatively hoisting vsetvl
> insns outside of their basic block without checking for data
> dependencies. This resulted in a situation where the following occurs
> 
>          vsetvli a5,a1,e32,m1,tu,ma
>          vle32.v v2,0(a0)
>          sub     a1,a1,a5 <-- a1 potentially set to 0
>          sh2add  a0,a5,a0
>          vfmacc.vv       v1,v2,v2
>          vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
>          beq     a1,zero,.L12 <-- check if avl is 0
> 
> This patch would essentially delay the vsetvl update to after the branch
> to prevent unnecessarily updating the vinfo at the end of a basic block.
> 
> 	PR/117974
> 
> gcc/ChangeLog:
> 
> 	* config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
> 	(TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
Correct me if I'm wrong, there's not anything inherently wrong with the 
speculation from a correctness standpoint.  This is "just" a performance 
issue, right?

And from a performance standpoint speculation of the vsetvl could vary 
pretty wildly based on uarch characteristics.   I can easily see cases 
where it it wildly bad, wildly good and don't really care.


Point being it seems like it should be controlled by a uarch setting 
rather than always avoiding or always enabling.

Other thoughts?

Jeff
  
钟居哲 Feb. 13, 2025, 1:44 a.m. UTC | #4
VSETVL PASS is supposed to insert "vsetvli" into optimal place so "vsetvli" inserted by VSETVL PASS shouldn't involved into instruction scheduling.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2025-02-13 08:47
To: Edwin Lu; gcc-patches
CC: gnu-toolchain; vineetg; juzhe.zhong
Subject: Re: [PATCH] RISC-V: Prevent speculative vsetvl insn scheduling
 
 
On 2/12/25 4:27 PM, Edwin Lu wrote:
> The instruction scheduler appears to be speculatively hoisting vsetvl
> insns outside of their basic block without checking for data
> dependencies. This resulted in a situation where the following occurs
> 
>          vsetvli a5,a1,e32,m1,tu,ma
>          vle32.v v2,0(a0)
>          sub     a1,a1,a5 <-- a1 potentially set to 0
>          sh2add  a0,a5,a0
>          vfmacc.vv       v1,v2,v2
>          vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
>          beq     a1,zero,.L12 <-- check if avl is 0
> 
> This patch would essentially delay the vsetvl update to after the branch
> to prevent unnecessarily updating the vinfo at the end of a basic block.
> 
> PR/117974
> 
> gcc/ChangeLog:
> 
> * config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
> (TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
Correct me if I'm wrong, there's not anything inherently wrong with the 
speculation from a correctness standpoint.  This is "just" a performance 
issue, right?
 
And from a performance standpoint speculation of the vsetvl could vary 
pretty wildly based on uarch characteristics.   I can easily see cases 
where it it wildly bad, wildly good and don't really care.
 
 
Point being it seems like it should be controlled by a uarch setting 
rather than always avoiding or always enabling.
 
Other thoughts?
 
Jeff
  
Jeffrey Law Feb. 13, 2025, 3:33 a.m. UTC | #5
On 2/12/25 6:44 PM, 钟居哲 wrote:
> VSETVL PASS is supposed to insert "vsetvli" into optimal place so 
> "vsetvli" inserted by VSETVL PASS shouldn't involved into instruction 
> scheduling.
vsetvl pass inserts based on needs of vector instructions.  The 
scheduler will move code to try and minimize the critical path length. 
That includes potentially hoisting any insn into different control 
blocks if doing so has the same semantics, which is what's happening 
here.  The hoisting, AFAICT doesn't change semantics.


jeff
  
钟居哲 Feb. 13, 2025, 3:49 a.m. UTC | #6
>> vsetvl pass inserts based on needs of vector instructions.
Yes. vector instructions should make use of scheduler which could minimize "vsetvli" insertion in VSETVL PASS.
What I said is we should not involve "vsetvli" instruction (not vector instructions like vadd.vv) in scheduler.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2025-02-13 11:33
To: 钟居哲; ewlu; gcc-patches
CC: gnu-toolchain; vineetg
Subject: Re: [PATCH] RISC-V: Prevent speculative vsetvl insn scheduling
 
 
On 2/12/25 6:44 PM, 钟居哲 wrote:
> VSETVL PASS is supposed to insert "vsetvli" into optimal place so 
> "vsetvli" inserted by VSETVL PASS shouldn't involved into instruction 
> scheduling.
vsetvl pass inserts based on needs of vector instructions.  The 
scheduler will move code to try and minimize the critical path length. 
That includes potentially hoisting any insn into different control 
blocks if doing so has the same semantics, which is what's happening 
here.  The hoisting, AFAICT doesn't change semantics.
 
 
jeff
  
Palmer Dabbelt Feb. 13, 2025, 4:23 a.m. UTC | #7
On Wed, 12 Feb 2025 16:47:07 PST (-0800), jeffreyalaw@gmail.com wrote:
>
>
> On 2/12/25 4:27 PM, Edwin Lu wrote:
>> The instruction scheduler appears to be speculatively hoisting vsetvl
>> insns outside of their basic block without checking for data
>> dependencies. This resulted in a situation where the following occurs
>>
>>          vsetvli a5,a1,e32,m1,tu,ma
>>          vle32.v v2,0(a0)
>>          sub     a1,a1,a5 <-- a1 potentially set to 0
>>          sh2add  a0,a5,a0
>>          vfmacc.vv       v1,v2,v2
>>          vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
>>          beq     a1,zero,.L12 <-- check if avl is 0
>>
>> This patch would essentially delay the vsetvl update to after the branch
>> to prevent unnecessarily updating the vinfo at the end of a basic block.
>>
>> 	PR/117974
>>
>> gcc/ChangeLog:
>>
>> 	* config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
>> 	(TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
> Correct me if I'm wrong, there's not anything inherently wrong with the
> speculation from a correctness standpoint.  This is "just" a performance
> issue, right?

Hopefully, we didn't go into this looking for functional bugs.

We were seeing some odd behavior from the scheduler: it only moves the 
first vsetvli before the branch, the second crosses the branch while 
merging.  That tripped up a "maybe there's a functional bug lurking 
here", but thinking about it again it seems more likely we're just 
missing an opportunity to schedule that's getting made irrelevant by the 
vsetvli elimination pass.

> And from a performance standpoint speculation of the vsetvl could vary
> pretty wildly based on uarch characteristics.   I can easily see cases
> where it it wildly bad, wildly good and don't really care.
>
> Point being it seems like it should be controlled by a uarch setting
> rather than always avoiding or always enabling.

Ya, very much seems like a uarch thing.

We kind of just threw this one together as we're still experimenting 
with this.  The goal was to avoid the VL=0 cases, but I think that's 
even just sort of a side effect of avoiding speculative scheduling here.  
So I think we need to go benchmark this before we can really get a feel 
for what's going on, as it might not be direct enough to catch the 
interesting cases.

> Other thoughts?

The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff 
we can't/don't model in the pipeline, but I have no idea how to model 
the VL=0 case there.
  
Jeffrey Law Feb. 13, 2025, 5:36 a.m. UTC | #8
On 2/12/25 9:23 PM, Palmer Dabbelt wrote:

>>>
>>>     PR/117974
>>>
>>> gcc/ChangeLog:
>>>
>>>     * config/riscv/riscv.cc (riscv_sched_can_speculate_insn):
>>>     (TARGET_SCHED_CAN_SPECULATE_INSN): Implement.
>> Correct me if I'm wrong, there's not anything inherently wrong with the
>> speculation from a correctness standpoint.  This is "just" a performance
>> issue, right?
> 
> Hopefully, we didn't go into this looking for functional bugs.
Right.  The usual looking for one thing and finding something odd along 
the way...


> 
> We kind of just threw this one together as we're still experimenting 
> with this.  The goal was to avoid the VL=0 cases, but I think that's 
> even just sort of a side effect of avoiding speculative scheduling here. 
> So I think we need to go benchmark this before we can really get a feel 
> for what's going on, as it might not be direct enough to catch the 
> interesting cases.
Yea.  And it's that vl=0 case that falls into the wildly bad bucket of 
scenarios.

> 
>> Other thoughts?
> 
> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff 
> we can't/don't model in the pipeline, but I have no idea how to model 
> the VL=0 case there.
Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be 
the first time a hook got (ab)used in ways that weren't part of the 
original intent.

Jeff
  
Robin Dapp Feb. 13, 2025, 8:47 a.m. UTC | #9
>>> Other thoughts?
>> 
>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff 
>> we can't/don't model in the pipeline, but I have no idea how to model 
>> the VL=0 case there.
> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be 
> the first time a hook got (ab)used in ways that weren't part of the 
> original intent.

I don't fully understand what's happening.  So the hoisting is being done
speculatively here?  And it just happens to be "bad" because that might
cause a VL=0 case.  But are we sure a lack of speculation cannot cause
such cases?

Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
understand the problem more thoroughly before changing things.
In the end LCM minimizes the number of vsetvls and inserts them at the
"earliest" point.  If that is not sufficient I'd say we need modify
the constraints (maybe on a per-uarch basis)?

On a separate note:  How about we move the vsetvl pass after sched2?
Then we could at least rely on LCM doing its work uninhibited and wouldn't
reorder vsetvls afterwards.  Or do we somehow rely on rtl_dce and BB
reorder to run afterwards?

That won't help with the problem here but might with others.
  
Vineet Gupta Feb. 13, 2025, 12:12 p.m. UTC | #10
On 2/13/25 14:17, Robin Dapp wrote:
>>>> Other thoughts?
>>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff 
>>> we can't/don't model in the pipeline, but I have no idea how to model 
>>> the VL=0 case there.
>> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be 
>> the first time a hook got (ab)used in ways that weren't part of the 
>> original intent.
> I don't fully understand what's happening.  So the hoisting is being done
> speculatively here?  And it just happens to be "bad" because that might
> cause a VL=0 case.  But are we sure a lack of speculation cannot cause
> such cases?

Exactly. My gut feeling w/o deep dive was this seemed like papering over the issue.

BTW what exactly is speculative scheduling ? As in what is it actually trying to
schedule ahead ?

> Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
> understand the problem more thoroughly before changing things.
> In the end LCM minimizes the number of vsetvls and inserts them at the
> "earliest" point.  If that is not sufficient I'd say we need modify
> the constraints (maybe on a per-uarch basis)?

As far as LCM is concerned it is hoisting the insn to the optimal spot. However
there's some additional logic such as in can_use_next_avl_p () which influences
if things can be moved around.

> On a separate note:  How about we move the vsetvl pass after sched2?
> Then we could at least rely on LCM doing its work uninhibited and wouldn't
> reorder vsetvls afterwards. 

Bingo ! excellent idea. This would ensure scheduling doesn't undo carefully
placed stuff, but ....

>  Or do we somehow rely on rtl_dce and BB
> reorder to run afterwards?

... I have no idea if any of this is in play.

> That won't help with the problem here but might with others.

Right this needs to be evaluated independently with both icounts and BPI3 runs
to see if anything falls out.

-Vineet
  
Jeffrey Law Feb. 13, 2025, 2:46 p.m. UTC | #11
On 2/13/25 1:47 AM, Robin Dapp wrote:
>>>> Other thoughts?
>>>
>>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff
>>> we can't/don't model in the pipeline, but I have no idea how to model
>>> the VL=0 case there.
>> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be
>> the first time a hook got (ab)used in ways that weren't part of the
>> original intent.
> 
> I don't fully understand what's happening.  So the hoisting is being done
> speculatively here?  And it just happens to be "bad" because that might
> cause a VL=0 case.  But are we sure a lack of speculation cannot cause
> such cases?
Yes/No.  The scheduler certainly has code to avoid hoisting when doing 
so would  change semantics.  That's not what's happening here.

I'd have to put it in a debugger or read the full dumps with some crazy 
scheduler dump verbosity setting to be sure, but what I suspect is 
happening is the scheduler is processing a multi-block region 
(effectively an extended basic block).   In this scenario the scheduler 
can pull insns from a later block into an earlier block, including past 
a conditional branch as long as it doesn't change program semantics.


> 
> Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
> understand the problem more thoroughly before changing things.
> In the end LCM minimizes the number of vsetvls and inserts them at the
> "earliest" point.  If that is not sufficient I'd say we need modify
> the constraints (maybe on a per-uarch basis)?
The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a 
path that didn't have a vsetvl before.  Consider this simple graph.

     0
    / \
   2-->3

If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl 
will land in bb4.  bb0 is not a valid insertion point for the vsetvl 
pass because the path 0->3 doesn't strictly need a vsetvl.  That's 
inherent in the LCM algorithm (anticipatable).

The scheduler has no such limitations.  The scheduler might create a 
scheduling region out of blocks 0 and 2.  In that scenario, insns from 
block 2 may speculate into block 0 as long as doing so doesn't change 
semantics.

> 
> On a separate note:  How about we move the vsetvl pass after sched2?
> Then we could at least rely on LCM doing its work uninhibited and wouldn't
> reorder vsetvls afterwards.  Or do we somehow rely on rtl_dce and BB
> reorder to run afterwards?
> 
> That won't help with the problem here but might with others.
It's a double edged sword.  If you defer placement until after 
scheduling, then the vsetvls can wreck havoc with whatever schedule that 
sched2 came up with.  It won't matter much for out of order designs, but 
potentially does for others.

In theory at sched2 time the insn stream should be fixed.  There are 
practical/historical exceptions, but changes to the insn stream after 
that point are discouraged.

jeff
  
Jeffrey Law Feb. 13, 2025, 3:16 p.m. UTC | #12
On 2/13/25 5:12 AM, Vineet Gupta wrote:
> On 2/13/25 14:17, Robin Dapp wrote:
>>>>> Other thoughts?
>>>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff
>>>> we can't/don't model in the pipeline, but I have no idea how to model
>>>> the VL=0 case there.
>>> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be
>>> the first time a hook got (ab)used in ways that weren't part of the
>>> original intent.
>> I don't fully understand what's happening.  So the hoisting is being done
>> speculatively here?  And it just happens to be "bad" because that might
>> cause a VL=0 case.  But are we sure a lack of speculation cannot cause
>> such cases?
> 
> Exactly. My gut feeling w/o deep dive was this seemed like papering over the issue.
Perhaps, but I'm pretty confident that even if this specific situation 
turns out to be slightly different that the scenario I see can/will 
happen elsewhere.

> 
> BTW what exactly is speculative scheduling ? As in what is it actually trying to
> schedule ahead ?
In simplest terms assume we have this kind of graph

     0
    / \
   1-->2


The scheduler knows how to build scheduling regions, essentially 
extended basic blocks.  In this case we have two regions one with the 
blocks 0,1 the other being just block 2.

In the multi-block region 0,1 we allow insns from block 1 to speculate 
into block 0.

Let's assume we're on a simple 2-wide in order machine and somewhere in 
bb0 we there's a slot available for an insn that we couldn't fill with 
anything useful from bb0.  In that case we may speculate an insn from 
bb1 into bb0 to execute "for free" in that unused slot.

That's the basic idea.  It was particularly helpful for in-order cores 
in the past. It's dramatically less important for an out of order core 
since those are likely doing the speculation in hardware.

Naturally if you're using icounts for evaluation this kind of behavior 
is highly undesirable since that kind of evaluation says the 
transformation is bad, but in reality on certain designs is quite helpful.

Jeff
  
Robin Dapp Feb. 13, 2025, 3:19 p.m. UTC | #13
> The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
> path that didn't have a vsetvl before.  Consider this simple graph.
>
>      0
>     / \
>    2-->3
>
> If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
> will land in bb4.  bb0 is not a valid insertion point for the vsetvl
> pass because the path 0->3 doesn't strictly need a vsetvl.  That's
> inherent in the LCM algorithm (anticipatable).

Yeah, I remember the same issue with the rounding-mode setter placement.

Wouldn't that be fixable by requiring a dummy/wildcard/dontcare vsetvl in bb3
(or any other block that doesn't require one)?  Such a dummy vsetvl would be
fusible with every other vsetvl.  If there are dummy vsetvls remaining after
LCM just delete them?

Just thinking out loud, the devil will be in the details.
  
Robin Dapp Feb. 13, 2025, 3:26 p.m. UTC | #14
> Yeah, I remember the same issue with the rounding-mode setter placement.
>
> Wouldn't that be fixable by requiring a dummy/wildcard/dontcare vsetvl in bb3
> (or any other block that doesn't require one)?  Such a dummy vsetvl would be
> fusible with every other vsetvl.  If there are dummy vsetvls remaining after
> LCM just delete them?
>
> Just thinking out loud, the devil will be in the details.

Register liveness is of course relevant here.  Will surely depend on the
specific example whether that makes sense or not.
  
Jeffrey Law Feb. 13, 2025, 3:38 p.m. UTC | #15
On 2/13/25 8:19 AM, Robin Dapp wrote:
>> The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
>> path that didn't have a vsetvl before.  Consider this simple graph.
>>
>>       0
>>      / \
>>     2-->3
>>
>> If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
>> will land in bb4.  bb0 is not a valid insertion point for the vsetvl
>> pass because the path 0->3 doesn't strictly need a vsetvl.  That's
>> inherent in the LCM algorithm (anticipatable).
> 
> Yeah, I remember the same issue with the rounding-mode setter placement.
Yes.  For VXRM placement, under the right circumstances we pretend there 
is a need for the VXRM state at the first instruction in the first BB. 
That enables very aggressive hoisting by LCM in those limited cases.



> 
> Wouldn't that be fixable by requiring a dummy/wildcard/dontcare vsetvl in bb3
> (or any other block that doesn't require one)?  Such a dummy vsetvl would be
> fusible with every other vsetvl.  If there are dummy vsetvls remaining after
> LCM just delete them?
> 
> Just thinking out loud, the devil will be in the details.
But in Vineet's case they want to avoid speculation as that can result 
in a vl=0 case.  If we had a dummy fusible vsetvl in bb3, then that 
would allow movement into bb0 which is undesirable.

WRT a question Palmer asked earlier in the thread.  I went back and 
reviewed the code/docs around the hook Edwin is using.  My reading is a 
bit different and that what Edwin is doing is perfectly fine.

Jeff


>
  
Palmer Dabbelt Feb. 13, 2025, 4:50 p.m. UTC | #16
On Thu, 13 Feb 2025 07:38:13 PST (-0800), jeffreyalaw@gmail.com wrote:
>
>
> On 2/13/25 8:19 AM, Robin Dapp wrote:
>>> The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
>>> path that didn't have a vsetvl before.  Consider this simple graph.
>>>
>>>       0
>>>      / \
>>>     2-->3
>>>
>>> If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
>>> will land in bb4.  bb0 is not a valid insertion point for the vsetvl
>>> pass because the path 0->3 doesn't strictly need a vsetvl.  That's
>>> inherent in the LCM algorithm (anticipatable).
>>
>> Yeah, I remember the same issue with the rounding-mode setter placement.
> Yes.  For VXRM placement, under the right circumstances we pretend there
> is a need for the VXRM state at the first instruction in the first BB.
> That enables very aggressive hoisting by LCM in those limited cases.
>
>
>
>>
>> Wouldn't that be fixable by requiring a dummy/wildcard/dontcare vsetvl in bb3
>> (or any other block that doesn't require one)?  Such a dummy vsetvl would be
>> fusible with every other vsetvl.  If there are dummy vsetvls remaining after
>> LCM just delete them?
>>
>> Just thinking out loud, the devil will be in the details.
> But in Vineet's case they want to avoid speculation as that can result
> in a vl=0 case.  If we had a dummy fusible vsetvl in bb3, then that
> would allow movement into bb0 which is undesirable.

Ya, I think we confused everyone because there's really two 
vsetvli/branch movement things we've been talking about and they're kind 
of the opposite.

There's the issue this patch works around, where we found some vsetvli 
instances that set VL=0 in unrolled loops.  That makes some of our 
hardware people upset.  Turns out the reduced test case has the branches 
to early-out of the unrolled loop when VL would be 0, so just banning 
vsetvli speculation fixes the issue.  It's kind of a indirect way to 
solve a uarch-specific problem, so who knows if it'll be worth doing.

Then there's the vsetvli loop-invarint hoisting / vector tail generation 
thing we were talking about in the meeting this week.  Having the 
vsetvli in the loop made a different subset of our hardware people upset.  
That's kind of the opposite optimization, though we'd want to avoid the 
VL=0 case.  

They're both "Vineet's bug", the hardware people tend to call Vineet 
when they get upset ;)

> WRT a question Palmer asked earlier in the thread.  I went back and
> reviewed the code/docs around the hook Edwin is using.  My reading is a
> bit different and that what Edwin is doing is perfectly fine.

Awesome, thanks.  So I think if this is sane enough to run experiments 
we can at least try that out and see what happens.

> Jeff
>
>
>>
  
Palmer Dabbelt Feb. 13, 2025, 6:13 p.m. UTC | #17
On Thu, 13 Feb 2025 06:46:10 PST (-0800), jeffreyalaw@gmail.com wrote:
>
>
> On 2/13/25 1:47 AM, Robin Dapp wrote:
>>>>> Other thoughts?
>>>>
>>>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff
>>>> we can't/don't model in the pipeline, but I have no idea how to model
>>>> the VL=0 case there.
>>> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be
>>> the first time a hook got (ab)used in ways that weren't part of the
>>> original intent.
>>
>> I don't fully understand what's happening.  So the hoisting is being done
>> speculatively here?  And it just happens to be "bad" because that might
>> cause a VL=0 case.  But are we sure a lack of speculation cannot cause
>> such cases?
> Yes/No.  The scheduler certainly has code to avoid hoisting when doing
> so would  change semantics.  That's not what's happening here.
> 
> I'd have to put it in a debugger or read the full dumps with some crazy
> scheduler dump verbosity setting to be sure, but what I suspect is
> happening is the scheduler is processing a multi-block region
> (effectively an extended basic block).   In this scenario the scheduler
> can pull insns from a later block into an earlier block, including past
> a conditional branch as long as it doesn't change program semantics.

(Sorry to keep crossing the threads here, there's just a lot in this one 
and stuff gets truncated.)

FWIW, that's what tripped up my "maybe there's a functional bug here" 
thought.  It looks like the scheduling is seeing

    bne t0, x0, end
    vsetvli t1, t2, ...
    vsetvli x0, t2, ...
    ...
  end:
    vsetvli x0, t2, ...

and thinking it's safe to schedule that like

    vsetvli t1, t2, ...
    bne t0, x0, end
    vsetvli x0, t2, ...
    ...
  end:
    vsetvli x0, t2, ...

which I'd assumed is because the scheduler sees both execution paths 
overwriting the vector control registers and thus thinks it's safe to 
move the first vsetvli to execute speculatively.  From reading 
"6. Configuration-Setting Instructions" in vector.md that seems 
intentional, though, so maybe it's all just fine?

>> Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
>> understand the problem more thoroughly before changing things.
>> In the end LCM minimizes the number of vsetvls and inserts them at the
>> "earliest" point.  If that is not sufficient I'd say we need modify
>> the constraints (maybe on a per-uarch basis)?
> The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
> path that didn't have a vsetvl before.  Consider this simple graph.
>
>      0
>     / \
>    2-->3
>
> If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
> will land in bb4.  bb0 is not a valid insertion point for the vsetvl
> pass because the path 0->3 doesn't strictly need a vsetvl.  That's
> inherent in the LCM algorithm (anticipatable).
>
> The scheduler has no such limitations.  The scheduler might create a
> scheduling region out of blocks 0 and 2.  In that scenario, insns from
> block 2 may speculate into block 0 as long as doing so doesn't change
> semantics.

Ya.  The combination of the scheduler moving a vsetvli before the 
branch (IIUC from bb2 to bb0 here) and the vsetvli merging causes it to 
look like the whole vsetvli was moved before the branch.

I'm not sure why the scheduler doesn't move both vsetvli instructions to 
execute speculatively, but otherwise this seems to be behaving as 
designed.  It's just tripping up the VL=0 cases for us.

>> On a separate note:  How about we move the vsetvl pass after sched2?
>> Then we could at least rely on LCM doing its work uninhibited and wouldn't
>> reorder vsetvls afterwards.  Or do we somehow rely on rtl_dce and BB
>> reorder to run afterwards?
>>
>> That won't help with the problem here but might with others.
> It's a double edged sword.  If you defer placement until after
> scheduling, then the vsetvls can wreck havoc with whatever schedule that
> sched2 came up with.  It won't matter much for out of order designs, but
> potentially does for others.

Maybe that's a broad uarch split point here?  For OOO designs we'd 
want to rely on HW scheduling and thus avoid hoisting possibly-expensive 
vsetvli instructions (where they'd need to execute in HW because of the 
side effects), while on in-order designs we'd want to aggressively 
schedule vsetvli instructions because we can't rely on HW scheduling to 
hide the latency.

> In theory at sched2 time the insn stream should be fixed.  There are
> practical/historical exceptions, but changes to the insn stream after
> that point are discouraged.

We were just talking about this is our toolchain team meeting, and it 
seems like both GCC and LLVM are in similar spots here -- essentially 
the required set of vsetvli instructions depends very strongly on 
scheduling, so trying to do them independently is just always going to 
lead to sub-par results.  It feels kind of like we want some 
scheduling-based cost feedback in the vsetvli pass (or the other way 
around if they're in the other order) to get better results.

Maybe that's too much of a time sink for the OOO machines, though?  If 
we've got HW scheduling then the SW just has to be in the ballpark and 
everything should be fine.
  
Edwin Lu Feb. 13, 2025, 6:34 p.m. UTC | #18
On 2/13/2025 4:12 AM, Vineet Gupta wrote:
> On 2/13/25 14:17, Robin Dapp wrote:
>>>>> Other thoughts?
>>>> The docs seem to hint TARGET_SCHED_CAN_SPECULATE_INSN is meant for stuff
>>>> we can't/don't model in the pipeline, but I have no idea how to model
>>>> the VL=0 case there.
>>> Maybe so, but what Edwin is doing looks sensible enough.  It wouldn't be
>>> the first time a hook got (ab)used in ways that weren't part of the
>>> original intent.
>> I don't fully understand what's happening.  So the hoisting is being done
>> speculatively here?  And it just happens to be "bad" because that might
>> cause a VL=0 case.  But are we sure a lack of speculation cannot cause
>> such cases?
> Exactly. My gut feeling w/o deep dive was this seemed like papering over the issue.
>
> BTW what exactly is speculative scheduling ? As in what is it actually trying to
> schedule ahead ?
>
>> Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
>> understand the problem more thoroughly before changing things.
>> In the end LCM minimizes the number of vsetvls and inserts them at the
>> "earliest" point.  If that is not sufficient I'd say we need modify
>> the constraints (maybe on a per-uarch basis)?
> As far as LCM is concerned it is hoisting the insn to the optimal spot. However
> there's some additional logic such as in can_use_next_avl_p () which influences
> if things can be moved around.

Since sched1 put the vsetvl right before the branch, that was always 
determined to be the "earliest" point because it was now available on 
all outgoing edges. Without the vsetvl right before the branch, the 
"earliest" point to insert the vsetvls was determined to be the 
beginning of each basic block.

I did try adding some additional logic to adjust the way vsetvl fusion 
occurs across basic blocks in these scenarios  i.e. performing the 
fusion in the opposite manner (breaking lcm guarantees); however, from 
my testing, fusing two vsetvls didn't actually remove the fused 
expression from the vinfo list. I'm not sure if that's intended but as a 
result, phase 3 would remove the fused block and use the vinfo that 
should've been fused into the other.

>> That won't help with the problem here but might with others.
> Right this needs to be evaluated independently with both icounts and BPI3 runs
> to see if anything falls out.
>
> -Vineet

I'll add an opt flag to gate this for testing purposes.

Edwin
  
Robin Dapp Feb. 13, 2025, 6:57 p.m. UTC | #19
> I did try adding some additional logic to adjust the way vsetvl fusion 
> occurs across basic blocks in these scenarios  i.e. performing the 
> fusion in the opposite manner (breaking lcm guarantees); however, from 
> my testing, fusing two vsetvls didn't actually remove the fused 
> expression from the vinfo list. I'm not sure if that's intended but as a 
> result, phase 3 would remove the fused block and use the vinfo that 
> should've been fused into the other.

It depends on the specific example but keeping deleted vsetvls/infos around
has a purpose because it helps delete other vsetvls still.  I don't recall
details but I remember having at least a few examples for it.
  
Jeffrey Law Feb. 13, 2025, 7:17 p.m. UTC | #20
On 2/13/25 11:57 AM, Robin Dapp wrote:
>> I did try adding some additional logic to adjust the way vsetvl fusion
>> occurs across basic blocks in these scenarios  i.e. performing the
>> fusion in the opposite manner (breaking lcm guarantees); however, from
>> my testing, fusing two vsetvls didn't actually remove the fused
>> expression from the vinfo list. I'm not sure if that's intended but as a
>> result, phase 3 would remove the fused block and use the vinfo that
>> should've been fused into the other.
> 
> It depends on the specific example but keeping deleted vsetvls/infos around
> has a purpose because it helps delete other vsetvls still.  I don't recall
> details but I remember having at least a few examples for it.
Yea, that can certainly happen with LCM based algorithms when computing 
the availability and anticipatable sets.

Jeff
  
Jeffrey Law Feb. 13, 2025, 11:28 p.m. UTC | #21
On 2/13/25 11:13 AM, Palmer Dabbelt wrote:

> 
> FWIW, that's what tripped up my "maybe there's a functional bug here" 
> thought.  It looks like the scheduling is seeing
> 
>     bne t0, x0, end
>     vsetvli t1, t2, ...
>     vsetvli x0, t2, ...
>     ...
>   end:
>     vsetvli x0, t2, ...
> 
> and thinking it's safe to schedule that like
> 
>     vsetvli t1, t2, ...
>     bne t0, x0, end
>     vsetvli x0, t2, ...
>     ...
>   end:
>     vsetvli x0, t2, ...
> 
> which I'd assumed is because the scheduler sees both execution paths 
> overwriting the vector control registers and thus thinks it's safe to 
> move the first vsetvli to execute speculatively.  From reading "6. 
> Configuration-Setting Instructions" in vector.md that seems intentional, 
> though, so maybe it's all just fine?
I think it's fine.   Perhaps not what we want from a performance 
standpoint, but functionally safe.


> 
>>> Also, why doesn't the vsetvl pass fix the situation?  IMHO we need to
>>> understand the problem more thoroughly before changing things.
>>> In the end LCM minimizes the number of vsetvls and inserts them at the
>>> "earliest" point.  If that is not sufficient I'd say we need modify
>>> the constraints (maybe on a per-uarch basis)?
>> The vsevl pass is LCM based.  So it's not allowed to add a vsetvl on a
>> path that didn't have a vsetvl before.  Consider this simple graph.
>>
>>      0
>>     / \
>>    2-->3
>>
>> If we have need for a vsetvl in bb2, but not bb0 or bb3, then the vsetvl
>> will land in bb4.  bb0 is not a valid insertion point for the vsetvl
>> pass because the path 0->3 doesn't strictly need a vsetvl.  That's
>> inherent in the LCM algorithm (anticipatable).
>>
>> The scheduler has no such limitations.  The scheduler might create a
>> scheduling region out of blocks 0 and 2.  In that scenario, insns from
>> block 2 may speculate into block 0 as long as doing so doesn't change
>> semantics.
> 
> Ya.  The combination of the scheduler moving a vsetvli before the branch 
> (IIUC from bb2 to bb0 here) and the vsetvli merging causes it to look 
> like the whole vsetvli was moved before the branch.
> 
> I'm not sure why the scheduler doesn't move both vsetvli instructions to 
> execute speculatively, but otherwise this seems to be behaving as 
> designed.  It's just tripping up the VL=0 cases for us.
You'd have to get into those dumps and possibly throw the compiler under 
a debugger.  My guess is it didn't see any advantage in doing so.


> 
> Maybe that's a broad uarch split point here?  For OOO designs we'd want 
> to rely on HW scheduling and thus avoid hoisting possibly-expensive 
> vsetvli instructions (where they'd need to execute in HW because of the 
> side effects), while on in-order designs we'd want to aggressively 
> schedule vsetvli instructions because we can't rely on HW scheduling to 
> hide the latency.
There may be.  But the natural question would be cost/benefit.  It may 
not buy us anything on the performance side to defer vsetvl insertion 
for OOO cores.  At which point the only advantage is testsuite 
stability.  And if that's the only benefit, we may be able to do that 
through other mechanisms.


> 
>> In theory at sched2 time the insn stream should be fixed.  There are
>> practical/historical exceptions, but changes to the insn stream after
>> that point are discouraged.
> 
> We were just talking about this is our toolchain team meeting, and it 
> seems like both GCC and LLVM are in similar spots here -- essentially 
> the required set of vsetvli instructions depends very strongly on 
> scheduling, so trying to do them independently is just always going to 
> lead to sub-par results.  It feels kind of like we want some scheduling- 
> based cost feedback in the vsetvli pass (or the other way around if 
> they're in the other order) to get better results.
> 
> Maybe that's too much of a time sink for the OOO machines, though?  If 
> we've got HW scheduling then the SW just has to be in the ballpark and 
> everything should be fine.
I'd guess it more work than it'd be worth.  We're just not seeing 
vsetvls being all that problematical on our design.  I do see a lot of 
seemingly gratutious changes in the vector config, but when we make 
changes to fix that we generally end up with worse performing code.

Jeff
  
Vineet Gupta Feb. 14, 2025, 3:29 a.m. UTC | #22
On 2/14/25 04:58, Jeff Law wrote:
> I'd guess it more work than it'd be worth.  We're just not seeing 
> vsetvls being all that problematical on our design.  I do see a lot of 
> seemingly gratutious changes in the vector config, but when we make 
> changes to fix that we generally end up with worse performing code.

To be clear the VSETVL on their own are not problematic for us either. It
causing VL=0 is.
I have a change in works which could introduce additional VSETVLs ;-)
-Vineet
  
Vineet Gupta Feb. 14, 2025, 4:18 a.m. UTC | #23
On 2/13/25 20:46, Jeff Law wrote:
>> BTW what exactly is speculative scheduling ? As in what is it actually trying to
>> schedule ahead ?
> In simplest terms assume we have this kind of graph
>
>      0
>     / \
>    1-->2
>
>
> The scheduler knows how to build scheduling regions, essentially 
> extended basic blocks.  In this case we have two regions one with the 
> blocks 0,1 the other being just block 2.
>
> In the multi-block region 0,1 we allow insns from block 1 to speculate 
> into block 0.
>
> Let's assume we're on a simple 2-wide in order machine and somewhere in 
> bb0 we there's a slot available for an insn that we couldn't fill with 
> anything useful from bb0.  In that case we may speculate an insn from 
> bb1 into bb0 to execute "for free" in that unused slot.
>
> That's the basic idea.  It was particularly helpful for in-order cores 
> in the past. It's dramatically less important for an out of order core 
> since those are likely doing the speculation in hardware.

That is great info, super helpful.

Given this background, I'd argue that Edwin's patch to barricade vsetvls in
scheduling is the right thing to do anyways, this issue or otherwise.

> Naturally if you're using icounts for evaluation this kind of behavior 
> is highly undesirable since that kind of evaluation says the 
> transformation is bad, but in reality on certain designs is quite helpful.

Sure.

Thx,
-Vineet
  

Patch

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 6e14126e3a4..24450bae517 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -10209,6 +10209,23 @@  riscv_sched_adjust_cost (rtx_insn *, int, rtx_insn *insn, int cost,
   return new_cost;
 }
 
+/* Implement TARGET_SCHED_CAN_SPECULATE_INSN hook.  Return true if insn can
+   can be scheduled for speculative execution.  Reject vsetvl instructions to
+   prevent the scheduler from hoisting them out of basic blocks without
+   checking for data dependencies PR117974.  */
+static bool
+riscv_sched_can_speculate_insn (rtx_insn *insn)
+{
+  switch (get_attr_type (insn))
+    {
+      case TYPE_VSETVL:
+      case TYPE_VSETVL_PRE:
+	return false;
+      default:
+	return true;
+    }
+}
+
 /* Auxiliary function to emit RISC-V ELF attribute. */
 static void
 riscv_emit_attribute ()
@@ -14055,6 +14072,9 @@  bool need_shadow_stack_push_pop_p ()
 #undef  TARGET_SCHED_ADJUST_COST
 #define TARGET_SCHED_ADJUST_COST riscv_sched_adjust_cost
 
+#undef TARGET_SCHED_CAN_SPECULATE_INSN
+#define TARGET_SCHED_CAN_SPECULATE_INSN riscv_sched_can_speculate_insn
+
 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL riscv_function_ok_for_sibcall