[RISC-V] Reorder the ready queue to avoid extraneous vsetvls

Message ID e4e417df-3b1a-4443-94cd-1589f26842c5@gmail.com
State Dropped
Delegated to: Jeff Law
Headers
Series [RISC-V] Reorder the ready queue to avoid extraneous vsetvls |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-lint success Lint passed
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gc-lp64d-non-multilib success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc-lp64d-non-multilib success Build passed
rivoscibot/toolchain-ci-rivos-test success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed

Commit Message

Jeff Law Oct. 31, 2024, 1:06 a.m. UTC
  So this patch is a very conservative approach to eliminate more vsetvl 
instructions.

As we know, scheduling can scramble the instruction stream based on a 
variety of factors and can easily result in an instruction sequence 
where we ping-pong between different vector configurations, thus 
resulting in more vsetvl instructions than if we had issued instructions 
in a slightly different order.

Robin did some experiments with vsetvl aware scheduling several weeks 
ago.  I believe he was adjusting the priority or cost of the insns based 
on their vsetvl needs.  This experiment actually showed worse 
performance on our design, which is a good indicator that scheduling for 
latency and functional hazards is more important than eliminating vsetvl 
instructions (not a surprise obviously).

That experiment greatly influenced this implementation.  Specifically we 
don't adjust cost/priority of any insns.  Instead we use vector 
configuration information to potentially swap two insns in the 
scheduler's ready queue iff swapping would result in avoiding a vsetvl 
and the two insns being swapped in the ready queue have the same 
priority/cost.

So it's quite conservative, essentially using vector configuration as a 
sort key of last resort and only for the highest priority insns in the 
ready queue and only when it's going to let us eliminate a vsetvl.

For something like SATD we eliminate a few gratuitous vsetvl 
instructions.  As expected it makes no performance difference in the 
pico benchmarks we've built on our design.  The main goal here is to 
side step questions about the over-active changing of vector configurations.

My BPI is busy, so I don't have performance data on that design, but it 
does result in a ~1% drop in dynamic instructions.  On a high 
performance design I wouldn't expect this to help performance in an 
significant way, but it does make the code look better.

Bootstrapped and regression tested on riscv64-linux-gnu, also regression 
tested in my tester for rv32 and rv64 elf configurations.


I'll wait for the pre-commit tester to render a verdict *and* for others 
who know the vsetvl code to have time to chime in.  Assuming clean from 
pre-commit and no negative comments the plan would be to commit next 
week sometime.

Jeff
* config/riscv/riscv-protos.h (has_vtype_op): Declare.
	(mask_agnostic_p, get_avl, vsetvl_insn_p): Likewise.
	* config/riscv/riscv-vsetvl.cc (has_vtype_op): No longer static.
	(vsetvl_insn_p, get_avl, mask_agnostic_p): Likewise.
	* config/riscv/riscv.cc (last_vconfig): New structure.
	(clear_vconfig, compatible_with_last_vconfig): New functions.
	(riscv_sched_init, riscv_sched_reorder): Likewise.
	(riscv_sched_variable_issue): Record vector config as insns
	are issued.
	(TARGET_SCHED_INIT, TARGET_SCHED_REORDER): Define.

	* gcc.target/riscv/rvv/vsetvl/pr111037-1.c: Adjust expected output.
	* gcc.target/riscv/rvv/vsetvl/pr111037-4.c: Likewise.
	* gcc.target/riscv/rvv/vsetvl/pr113248.c: Likewise.
  

Comments

Robin Dapp Oct. 31, 2024, 10:30 a.m. UTC | #1
> So it's quite conservative, essentially using vector configuration as a 
> sort key of last resort and only for the highest priority insns in the 
> ready queue and only when it's going to let us eliminate a vsetvl.

On the one hand I wondered if it does make sense to re-use compatible_p from
the vsetvl pass for the match check but on the other hand I don't see a clear
gain by doing that and it would increase the complexity significantly.
So your approach makes sense to me.

My first very simple attempt with re-ordering actually involved only the last
vector mode but I didn't continue with that attempt.  Out of curiosity, did
you try that as well at some point?
  
Jeff Law Oct. 31, 2024, 1:17 p.m. UTC | #2
On 10/31/24 4:30 AM, Robin Dapp wrote:
>> So it's quite conservative, essentially using vector configuration as a
>> sort key of last resort and only for the highest priority insns in the
>> ready queue and only when it's going to let us eliminate a vsetvl.
> 
> On the one hand I wondered if it does make sense to re-use compatible_p from
> the vsetvl pass for the match check but on the other hand I don't see a clear
> gain by doing that and it would increase the complexity significantly.
> So your approach makes sense to me.
Yea, that would have been my preference rather than writing a new 
function, but there was enough of an impedance mismatch that the 
simplest approach was cleaner.

> 
> My first very simple attempt with re-ordering actually involved only the last
> vector mode but I didn't continue with that attempt.  Out of curiosity, did
> you try that as well at some point?
Do you mean only testing vector modes, ignoring the other attributes? 
No, I never tested that.   Stay conservative and iterate later if we 
find the desire was my mantra.

We could do things similar to the vsetvl insertion pass were we realize 
that two configs are equivalent even if they aren't the same for data 
movement.  But given it's not expected to actually improve performance, 
but just drop the icount and make code look cleaner, I kept it as simple 
as possible.

jeff
  
Jeff Law Nov. 5, 2024, 12:03 a.m. UTC | #3
On 10/30/24 7:06 PM, Jeff Law wrote:
> So this patch is a very conservative approach to eliminate more vsetvl 
> instructions.
> 
> As we know, scheduling can scramble the instruction stream based on a 
> variety of factors and can easily result in an instruction sequence 
> where we ping-pong between different vector configurations, thus 
> resulting in more vsetvl instructions than if we had issued instructions 
> in a slightly different order.
> 
> Robin did some experiments with vsetvl aware scheduling several weeks 
> ago.  I believe he was adjusting the priority or cost of the insns based 
> on their vsetvl needs.  This experiment actually showed worse 
> performance on our design, which is a good indicator that scheduling for 
> latency and functional hazards is more important than eliminating vsetvl 
> instructions (not a surprise obviously).
> 
> That experiment greatly influenced this implementation.  Specifically we 
> don't adjust cost/priority of any insns.  Instead we use vector 
> configuration information to potentially swap two insns in the 
> scheduler's ready queue iff swapping would result in avoiding a vsetvl 
> and the two insns being swapped in the ready queue have the same 
> priority/cost.
> 
> So it's quite conservative, essentially using vector configuration as a 
> sort key of last resort and only for the highest priority insns in the 
> ready queue and only when it's going to let us eliminate a vsetvl.
> 
> For something like SATD we eliminate a few gratuitous vsetvl 
> instructions.  As expected it makes no performance difference in the 
> pico benchmarks we've built on our design.  The main goal here is to 
> side step questions about the over-active changing of vector 
> configurations.
> 
> My BPI is busy, so I don't have performance data on that design, but it 
> does result in a ~1% drop in dynamic instructions.  On a high 
> performance design I wouldn't expect this to help performance in an 
> significant way, but it does make the code look better.
> 
> Bootstrapped and regression tested on riscv64-linux-gnu, also regression 
> tested in my tester for rv32 and rv64 elf configurations.
> 
> 
> I'll wait for the pre-commit tester to render a verdict *and* for others 
> who know the vsetvl code to have time to chime in.  Assuming clean from 
> pre-commit and no negative comments the plan would be to commit next 
> week sometime.
So I'm putting this on hold.  I'm seeing unexpected performance 
regressions when I ran this on the Ventana design.   Not really sure 
what's going on and given this was supposed to be neutral, something's 
clearly not right.  That would tend to indicate something is goofy in 
the priorities.

The other possibility is the swap step.  It would have been better to 
slide the entries into the hold vacated by the insn we want to move to 
the head of the list -- that permutes the schedule a bit less.  But if 
that's making a performance difference, again that would be a sign that 
priorities are goofy.

Either way, I'm putting this on ice for now.  Hopefully I'll come back 
to it (or someone else can take a looksie) in the not terribly distant 
future.

jeff
  

Patch

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 1e6d10a1402..96d0e13577e 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -197,6 +197,12 @@  rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
 rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
 rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
 
+/* Routines implemented in riscv-vsetvl.cc.  */
+extern bool has_vtype_op (rtx_insn *);
+extern bool mask_agnostic_p (rtx_insn *);
+extern rtx get_avl (rtx_insn *);
+extern bool vsetvl_insn_p (rtx_insn *);
+
 /* Routines implemented in riscv-string.c.  */
 extern bool riscv_expand_block_compare (rtx, rtx, rtx, rtx);
 extern bool riscv_expand_block_move (rtx, rtx, rtx);
diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index 030ffbe2ebb..925b95b4063 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -259,7 +259,7 @@  policy_to_str (bool agnostic_p)
 
 /* Return true if it is an RVV instruction depends on VTYPE global
    status register.  */
-static bool
+bool
 has_vtype_op (rtx_insn *rinsn)
 {
   return recog_memoized (rinsn) >= 0 && get_attr_has_vtype_op (rinsn);
@@ -307,7 +307,7 @@  vector_config_insn_p (rtx_insn *rinsn)
 }
 
 /* Return true if it is vsetvldi or vsetvlsi.  */
-static bool
+bool
 vsetvl_insn_p (rtx_insn *rinsn)
 {
   if (!rinsn || !vector_config_insn_p (rinsn))
@@ -387,7 +387,7 @@  get_vl (rtx_insn *rinsn)
 }
 
 /* Helper function to get AVL operand.  */
-static rtx
+rtx
 get_avl (rtx_insn *rinsn)
 {
   if (vsetvl_insn_p (rinsn) || vsetvl_discard_result_insn_p (rinsn))
@@ -412,7 +412,7 @@  get_default_ma ()
 }
 
 /* Helper function to get MA operand.  */
-static bool
+bool
 mask_agnostic_p (rtx_insn *rinsn)
 {
   /* If it doesn't have MA, we return agnostic by default.  */
diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index e111cb07284..71a1aaf96fb 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -9612,6 +9612,71 @@  riscv_issue_rate (void)
   return tune_param->issue_rate;
 }
 
+/* Structure for very basic vector configuration tracking in the scheduler.  */
+struct last_vconfig
+{
+  bool valid;
+  bool ta;
+  bool ma;
+  uint8_t sew;
+  uint8_t vlmul;
+  rtx avl;
+} last_vconfig;
+
+/* Clear LAST_VCONFIG so we have no known state.  */
+static void
+clear_vconfig (void)
+{
+  memset (&last_vconfig, 0, sizeof (last_vconfig));
+}
+
+/* Return TRUE if INSN is a vector insn needing a particular
+   vector configuration that is trivially equal to the last
+   vector insn issued.  Return FALSE otherwise.  */
+static bool
+compatible_with_last_vconfig (rtx_insn *insn)
+{
+  /* We might be able to extract the data from a preexisting vsetvl.  */
+  if (vsetvl_insn_p (insn))
+    return false;
+
+  /* Nothing to do for these cases.  */
+  if (!NONDEBUG_INSN_P (insn) || !has_vtype_op (insn))
+    return false;
+
+  extract_insn_cached (insn);
+
+  rtx avl = get_avl (insn);
+  if (avl != last_vconfig.avl)
+    return false;
+
+  if (get_sew (insn) != last_vconfig.sew)
+    return false;
+
+  if (get_vlmul (insn) != last_vconfig.vlmul)
+    return false;
+
+  if (tail_agnostic_p (insn) != last_vconfig.ta)
+    return false;
+
+  if (mask_agnostic_p (insn) != last_vconfig.ma)
+    return false;
+
+  /* No differences found, they're trivially compatible.  */
+  return true;
+}
+
+/* Implement TARGET_SCHED_INIT, we use this to track the vector configuration
+   of the last issued vector instruction.  We can then use that information
+   to potentially adjust the ready queue to issue instructions of a compatible
+   vector configuration instead of a conflicting configuration.  That will
+   reduce the number of vsetvl instructions we ultimately emit.  */
+static void
+riscv_sched_init (FILE *, int, int)
+{
+  clear_vconfig ();
+}
+
 /* Implement TARGET_SCHED_VARIABLE_ISSUE.  */
 static int
 riscv_sched_variable_issue (FILE *, int, rtx_insn *insn, int more)
@@ -9636,9 +9701,85 @@  riscv_sched_variable_issue (FILE *, int, rtx_insn *insn, int more)
      an assert so we can find and fix this problem.  */
   gcc_assert (insn_has_dfa_reservation_p (insn));
 
+  /* If this is a vector insn with vl/vtype info, then record the last
+     vector configuration.  */
+  if (vsetvl_insn_p (insn))
+    clear_vconfig ();
+  else if (NONDEBUG_INSN_P (insn) && has_vtype_op (insn))
+    {
+      extract_insn_cached (insn);
+
+      rtx avl = get_avl (insn);
+      if (avl == RVV_VLMAX)
+	avl = const0_rtx;
+
+      if (!avl || !CONST_INT_P (avl))
+	clear_vconfig ();
+      else
+	{
+	  last_vconfig.valid = true;
+	  last_vconfig.avl = avl;
+	  last_vconfig.sew = get_sew (insn);
+	  last_vconfig.vlmul = get_vlmul (insn);
+	  last_vconfig.ta = tail_agnostic_p (insn);
+	  last_vconfig.ma = mask_agnostic_p (insn);
+	}
+    }
   return more - 1;
 }
 
+/* Implement TARGET_SCHED_REORDER.  The goal here is to look at the ready
+   queue and reorder it ever so slightly to encourage issing an insn with
+   the same vector configuration as the most recently issued vector
+   instruction.  That will reduce vsetvl instructions.  */
+static int
+riscv_sched_reorder (FILE *, int, rtx_insn **ready, int *nreadyp, int)
+{
+  /* If we don't have a valid prior vector configuration, then there is
+     no point in reordering the ready queue, similarly if there is
+     just one entry in the queue.  */
+  if (!last_vconfig.valid || *nreadyp == 1)
+    return riscv_issue_rate ();
+
+  int nready = *nreadyp;
+  int priority = INSN_PRIORITY (ready[nready - 1]);
+  for (int i = nready - 1; i >= 0; i--)
+    {
+      rtx_insn *insn = ready[i];
+
+      /* On a high performance core, vsetvl instructions should be
+	 inexpensive.  Removing them is very much a secondary concern, so
+	 be extremely conservative with reordering, essentially only
+	 allowing reordering within the highest priority value.
+
+	 Lower end cores may benefit from more flexibility here.  That
+	 tuning is left to those who understand their core's behavior
+	 and can thoroughly benchmark the result.  Assuming such
+	 designs appear, we can probably put an entry in the tuning
+	 structure to indicate how much difference in priority to allow.  */
+      if (INSN_PRIORITY (insn) < priority)
+	break;
+
+      if (compatible_with_last_vconfig (insn))
+	{
+	  /* This entry is compatible with the last vconfig and has
+	     the same priority as the most important insn.  So swap
+	     it so that we keep the vector configuration as-is and
+	     ultimately eliminate a vsetvl.
+
+	     Note no need to swap if this is the first entry in the
+	     queue.  */
+	  if (i == nready - 1)
+	    break;
+
+	  std::swap (ready[i], ready[nready - 1]);
+	  break;
+	}
+    }
+
+  return riscv_issue_rate ();
+}
+
 /* Implement TARGET_SCHED_MACRO_FUSION_P.  Return true if target supports
    instruction fusion of some sort.  */
 
@@ -12610,9 +12751,15 @@  riscv_stack_clash_protection_alloca_probe_range (void)
 #undef TARGET_SCHED_MACRO_FUSION_PAIR_P
 #define TARGET_SCHED_MACRO_FUSION_PAIR_P riscv_macro_fusion_pair_p
 
+#undef TARGET_SCHED_INIT
+#define TARGET_SCHED_INIT riscv_sched_init
+
 #undef  TARGET_SCHED_VARIABLE_ISSUE
 #define TARGET_SCHED_VARIABLE_ISSUE riscv_sched_variable_issue
 
+#undef  TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER riscv_sched_reorder
+
 #undef  TARGET_SCHED_ADJUST_COST
 #define TARGET_SCHED_ADJUST_COST riscv_sched_adjust_cost
 
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-1.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-1.c
index 803ce5702eb..96c1f312cc4 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-1.c
@@ -11,5 +11,5 @@  void foo(_Float16 y, int64_t *i64p)
   asm volatile ("# use %0 %1" : : "vr"(vx), "vr" (vy));
 }
 
-/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*mf4,\s*t[au],\s*m[au]} 1 } } */
-/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*zero,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-4.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-4.c
index 5949085bdc9..6cf3c64679f 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-4.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr111037-4.c
@@ -12,5 +12,5 @@  void foo(_Float16 y, int16_t z, int64_t *i64p)
   asm volatile ("# use %0 %1" : : "vr"(vx), "vr" (vy), "vr" (vz));
 }
 
-/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*mf4,\s*t[au],\s*m[au]} 1 } } */
-/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*zero,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*m1,\s*t[au],\s*m[au]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr113248.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr113248.c
index d95281362a8..fe573933977 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr113248.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/pr113248.c
@@ -11,5 +11,5 @@  void foo(_Float16 y, int64_t *i64p)
   asm volatile ("# use %0 %1" : : "vr"(vx), "vr" (vy));
 }
 
-/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*mf4,\s*t[au],\s*m[au]} 1 } } */
-/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*zero,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e64,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*1,\s*e16,\s*m1,\s*t[au],\s*m[au]} 1 } } */