[to-be-committed,V2,RISC-V] Avoid unnecessary extensions after sCC insns

Message ID bbcfcba8-068b-4a59-81d8-a2816a58313f@gmail.com
State Committed
Commit b567e5ead5d54f022c57b48f31653f6ae6ece007
Delegated to: Jeff Law
Headers
Series [to-be-committed,V2,RISC-V] Avoid unnecessary extensions after sCC insns |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-lint success Lint passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gc-lp64d-non-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc-lp64d-non-multilib success Build passed
rivoscibot/toolchain-ci-rivos-test success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 warning Patch is already merged
linaro-tcwg-bot/tcwg_gcc_check--master-arm warning Patch is already merged

Commit Message

Jeff Law Sept. 5, 2024, 6:03 p.m. UTC
  So the first patch failed the pre-commit CI; it didn't fail in my 
testing because I'm using --with-arch to set a default configuration 
that includes things like zicond to ensure that's always tested.  And 
the failing test is skipped when zicond is enabled by default.

The failing test is designed to ensure that we don't miss an 
if-conversion due to costing issues around the extension that was 
typically done in an sCC sequence (which is why it's only run when 
zicond is off).

The test failed because we have a little routine that is highly 
dependent on the code generated by the sCC expander and will adjust the 
costing to account for expansion quirks that usually go away in register 
allocation.


That code needs to be enhanced to work after the sCC expansion change. 
Essentially it needs to account for the subreg extraction that shows up 
in the sequence as well as being a bit looser on mode checking.

I kept the code working for the old sequences -- in theory a user could 
conjure up the old sequence so handling them seems useful.

This also drops the testsuite changes.  Palmer's change makes them 
unnecessary.

Waiting on pre-commit CI before taking any further action...

Jeff
  

Comments

Palmer Dabbelt Sept. 5, 2024, 6:46 p.m. UTC | #1
On Thu, 05 Sep 2024 11:03:18 PDT (-0700), jeffreyalaw@gmail.com wrote:
>
> So the first patch failed the pre-commit CI; it didn't fail in my
> testing because I'm using --with-arch to set a default configuration
> that includes things like zicond to ensure that's always tested.  And
> the failing test is skipped when zicond is enabled by default.
>
> The failing test is designed to ensure that we don't miss an
> if-conversion due to costing issues around the extension that was
> typically done in an sCC sequence (which is why it's only run when
> zicond is off).
>
> The test failed because we have a little routine that is highly
> dependent on the code generated by the sCC expander and will adjust the
> costing to account for expansion quirks that usually go away in register
> allocation.
>
>
> That code needs to be enhanced to work after the sCC expansion change.
> Essentially it needs to account for the subreg extraction that shows up
> in the sequence as well as being a bit looser on mode checking.
>
> I kept the code working for the old sequences -- in theory a user could
> conjure up the old sequence so handling them seems useful.
>
> This also drops the testsuite changes.  Palmer's change makes them
> unnecessary.

OK, so we'll just go with that one assuming it passes the tests?  I 
don't really care a ton either way, I was mostly just interested in the 
sign extension stuff as we've had so many issues there that I don't know 
how to solve.  So I figured I'd poke around to see if there was anything 
interesting going on, but it was pretty boring.

>
> Waiting on pre-commit CI before taking any further action...
>
> Jeff
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index a38cb72f09f..39489c4377e 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -4218,11 +4218,29 @@ riscv_noce_conversion_profitable_p (rtx_insn *seq,
>  	      riscv_if_info.original_cost += COSTS_N_INSNS (1);
>  	      riscv_if_info.max_seq_cost += COSTS_N_INSNS (1);
>  	    }
> -	  last_dest = NULL_RTX;
> +
>  	  rtx dest = SET_DEST (x);
> -	  if (COMPARISON_P (src)
> +
> +	  /* Do something similar for the  moves that are likely to
> +	     turn into NOP moves by the time the register allocator is
> +	     done.  These are also side effects of how our sCC expanders
> +	     work.  We'll want to check and update LAST_DEST here too.  */
> +	  if (last_dest
>  	      && REG_P (dest)
> -	      && GET_MODE (dest) == SImode)
> +	      && GET_MODE (dest) == SImode
> +	      && SUBREG_P (src)
> +	      && SUBREG_PROMOTED_VAR_P (src)
> +	      && REGNO (SUBREG_REG (src)) == REGNO (last_dest))
> +	    {
> +	      riscv_if_info.original_cost += COSTS_N_INSNS (1);
> +	      riscv_if_info.max_seq_cost += COSTS_N_INSNS (1);
> +	      if (last_dest)
> +		last_dest = dest;
> +	    }
> +	  else
> +	    last_dest = NULL_RTX;
> +
> +	  if (COMPARISON_P (src) && REG_P (dest))
>  	    last_dest = dest;
>  	}
>        else
> @@ -4904,13 +4922,31 @@ riscv_expand_int_scc (rtx target, enum rtx_code code, rtx op0, rtx op1, bool *in
>    riscv_extend_comparands (code, &op0, &op1);
>    op0 = force_reg (word_mode, op0);
>
> +  /* For sub-word targets on rv64, do the computation in DImode
> +     then extract the lowpart for the final target, marking it
> +     as sign extended.  Note that it's also properly zero extended,
> +     but it's probably more profitable to expose it as sign extended.  */
> +  rtx t;
> +  if (TARGET_64BIT && GET_MODE (target) == SImode)
> +    t = gen_reg_rtx (DImode);
> +  else
> +    t = target;
> +
>    if (code == EQ || code == NE)
>      {
>        rtx zie = riscv_zero_if_equal (op0, op1);
> -      riscv_emit_binary (code, target, zie, const0_rtx);
> +      riscv_emit_binary (code, t, zie, const0_rtx);
>      }
>    else
> -    riscv_emit_int_order_test (code, invert_ptr, target, op0, op1);
> +    riscv_emit_int_order_test (code, invert_ptr, t, op0, op1);
> +
> +  if (t != target)
> +    {
> +      t = gen_lowpart (SImode, t);
> +      SUBREG_PROMOTED_VAR_P (t) = 1;
> +      SUBREG_PROMOTED_SET (t, SRP_SIGNED);
> +      emit_move_insn (target, t);
> +    }
>  }
>
>  /* Like riscv_expand_int_scc, but for floating-point comparisons.  */
  
Jeff Law Sept. 5, 2024, 7:23 p.m. UTC | #2
On 9/5/24 12:46 PM, Palmer Dabbelt wrote:
> On Thu, 05 Sep 2024 11:03:18 PDT (-0700), jeffreyalaw@gmail.com wrote:
>>
>> So the first patch failed the pre-commit CI; it didn't fail in my
>> testing because I'm using --with-arch to set a default configuration
>> that includes things like zicond to ensure that's always tested.  And
>> the failing test is skipped when zicond is enabled by default.
>>
>> The failing test is designed to ensure that we don't miss an
>> if-conversion due to costing issues around the extension that was
>> typically done in an sCC sequence (which is why it's only run when
>> zicond is off).
>>
>> The test failed because we have a little routine that is highly
>> dependent on the code generated by the sCC expander and will adjust the
>> costing to account for expansion quirks that usually go away in register
>> allocation.
>>
>>
>> That code needs to be enhanced to work after the sCC expansion change.
>> Essentially it needs to account for the subreg extraction that shows up
>> in the sequence as well as being a bit looser on mode checking.
>>
>> I kept the code working for the old sequences -- in theory a user could
>> conjure up the old sequence so handling them seems useful.
>>
>> This also drops the testsuite changes.  Palmer's change makes them
>> unnecessary.
> 
> OK, so we'll just go with that one assuming it passes the tests? 
That's the plan.  I pushed your change last night, so I just need a 
clean run on my change now (fingers crossed).



> don't really care a ton either way, I was mostly just interested in the 
> sign extension stuff as we've had so many issues there that I don't know 
> how to solve.  So I figured I'd poke around to see if there was anything 
> interesting going on, but it was pretty boring.
There's still "stuff" in this space, but it's of less and less of a concern.

Extensions are typically less than 1% of our dynamic instruction stream 
for specint these days.   The worst cases are 502.gcc where extensions 
vary from 1% - 1.3% of the dynamic stream and 557.xz where they range 
from 1.2% - 1.4% of the dynamic instruction stream.

If it weren't for the measurable real performance regression we saw 
internally on x264 I wouldn't have been looking in this space at all. 
Finding the nugget for sCC expansion was just a bit of frosting from 
that effort.

As far as "stuff" goes.  There's probably on the order of 2b unnecessary 
extensions in 541.leela.  I haven't chased that down yet -- it 
represents a tiny fraction of the dynamic count.  Whatever it is, it was 
caught by the REP_MODE_EXTENDED bits from VRULL and isn't by any of the 
other mechanisms we have in place right now.


Jeff
  

Patch

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index a38cb72f09f..39489c4377e 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -4218,11 +4218,29 @@  riscv_noce_conversion_profitable_p (rtx_insn *seq,
 	      riscv_if_info.original_cost += COSTS_N_INSNS (1);
 	      riscv_if_info.max_seq_cost += COSTS_N_INSNS (1);
 	    }
-	  last_dest = NULL_RTX;
+
 	  rtx dest = SET_DEST (x);
-	  if (COMPARISON_P (src)
+
+	  /* Do something similar for the  moves that are likely to
+	     turn into NOP moves by the time the register allocator is
+	     done.  These are also side effects of how our sCC expanders
+	     work.  We'll want to check and update LAST_DEST here too.  */
+	  if (last_dest
 	      && REG_P (dest)
-	      && GET_MODE (dest) == SImode)
+	      && GET_MODE (dest) == SImode
+	      && SUBREG_P (src)
+	      && SUBREG_PROMOTED_VAR_P (src)
+	      && REGNO (SUBREG_REG (src)) == REGNO (last_dest))
+	    {
+	      riscv_if_info.original_cost += COSTS_N_INSNS (1);
+	      riscv_if_info.max_seq_cost += COSTS_N_INSNS (1);
+	      if (last_dest)
+		last_dest = dest;
+	    }
+	  else
+	    last_dest = NULL_RTX;
+
+	  if (COMPARISON_P (src) && REG_P (dest))
 	    last_dest = dest;
 	}
       else
@@ -4904,13 +4922,31 @@  riscv_expand_int_scc (rtx target, enum rtx_code code, rtx op0, rtx op1, bool *in
   riscv_extend_comparands (code, &op0, &op1);
   op0 = force_reg (word_mode, op0);
 
+  /* For sub-word targets on rv64, do the computation in DImode
+     then extract the lowpart for the final target, marking it
+     as sign extended.  Note that it's also properly zero extended,
+     but it's probably more profitable to expose it as sign extended.  */
+  rtx t;
+  if (TARGET_64BIT && GET_MODE (target) == SImode)
+    t = gen_reg_rtx (DImode);
+  else
+    t = target;
+
   if (code == EQ || code == NE)
     {
       rtx zie = riscv_zero_if_equal (op0, op1);
-      riscv_emit_binary (code, target, zie, const0_rtx);
+      riscv_emit_binary (code, t, zie, const0_rtx);
     }
   else
-    riscv_emit_int_order_test (code, invert_ptr, target, op0, op1);
+    riscv_emit_int_order_test (code, invert_ptr, t, op0, op1);
+
+  if (t != target)
+    {
+      t = gen_lowpart (SImode, t);
+      SUBREG_PROMOTED_VAR_P (t) = 1;
+      SUBREG_PROMOTED_SET (t, SRP_SIGNED);
+      emit_move_insn (target, t);
+    }
 }
 
 /* Like riscv_expand_int_scc, but for floating-point comparisons.  */