[rs6000] Improve constant integer multiply using rldimi.

Message ID 006101d8899f$295807b0$7c081710$@nextmovesoftware.com
State New
Headers
Series [rs6000] Improve constant integer multiply using rldimi. |

Commit Message

Roger Sayle June 26, 2022, 8:56 p.m. UTC
  This patch tweaks the code generated on POWER for integer multiplications

by a constant, by making use of rldimi instructions.  Much like x86's

lea instruction, rldimi can be used to implement a shift and add pair

in some circumstances.  For rldimi this is when the shifted operand

is known to have no bits in common with the added operand.

 

Hence for the new testcase below:

 

int foo(int x)

{

  int t = x & 42;

  return t * 0x2001;

}

 

when compiled with -O2, GCC currently generates:

 

        andi. 3,3,0x2a

        slwi 9,3,13

        add 3,9,3

        extsw 3,3

        blr

 

with this patch, we now generate:

 

        andi. 3,3,0x2a

        rlwimi 3,3,13,0,31-13

        extsw 3,3

        blr

 

It turns out this optimization already exists in the form of a combine

splitter in rs6000.md, but the constraints on combine splitters,

requiring three of four input instructions (and generating one or two

output instructions) mean it doesn't get applied as often as it could.

This patch converts the define_split into a define_insn_and_split to

catch more cases (such as the one above).

 

The one bit that's tricky/controversial is the use of RTL's

nonzero_bits which is accurate during the combine pass when this

pattern is first recognized, but not as advanced (not kept up to

date) when this pattern is eventually split.  To support this,

I've used a "|| reload_completed" idiom.  Does this approach seem

reasonable? [I've another patch of x86 that uses the same idiom].

 

This patch has been tested on powerpc64le-unknown-linux-gnu with

make bootstrap and make -k check with no new failures.

Ok for mainline?

 

 

2022-06-26  Roger Sayle  <roger@nextmovesoftware.com>

 

gcc/ChangeLog

* config/rs6000/rs6000.md (*rotl<mode>3_insert_3b_<code>): New

define_insn_and_split created from exisiting define_split.

 

gcc/testsuite/ChangeLog

* gcc.target/powerpc/rldimi-3.c: New test case.

 

 

Thanks in advance,

Roger

--
  

Comments

Kewen.Lin June 27, 2022, 9:04 a.m. UTC | #1
Hi Roger,

on 2022/6/27 04:56, Roger Sayle wrote:
>  
> 
> This patch tweaks the code generated on POWER for integer multiplications
> 
> by a constant, by making use of rldimi instructions.  Much like x86's
> 
> lea instruction, rldimi can be used to implement a shift and add pair
> 
> in some circumstances.  For rldimi this is when the shifted operand
> 
> is known to have no bits in common with the added operand.
> 
>  
> 
> Hence for the new testcase below:
> 
>  
> 
> int foo(int x)
> 
> {
> 
>   int t = x & 42;
> 
>   return t * 0x2001;
> 
> }
> 
>  
> 
> when compiled with -O2, GCC currently generates:
> 
>  
> 
>         andi. 3,3,0x2a
> 
>         slwi 9,3,13
> 
>         add 3,9,3
> 
>         extsw 3,3
> 
>         blr
> 
>  
> 
> with this patch, we now generate:
> 
>  
> 
>         andi. 3,3,0x2a
> 
>         rlwimi 3,3,13,0,31-13
> 
>         extsw 3,3
> 
>         blr
> 
>  
> 
> It turns out this optimization already exists in the form of a combine
> 
> splitter in rs6000.md, but the constraints on combine splitters,
> 
> requiring three of four input instructions (and generating one or two
> 
> output instructions) mean it doesn't get applied as often as it could.
> 
> This patch converts the define_split into a define_insn_and_split to
> 
> catch more cases (such as the one above).
> 
>  
> 
> The one bit that's tricky/controversial is the use of RTL's
> 
> nonzero_bits which is accurate during the combine pass when this
> 
> pattern is first recognized, but not as advanced (not kept up to
> 
> date) when this pattern is eventually split.  To support this,
> 
> I've used a "|| reload_completed" idiom.  Does this approach seem
> 
> reasonable? [I've another patch of x86 that uses the same idiom].
> 
>  

I tested this patch on powerpc64-linux-gnu, it caused the below ICE
against test case gcc/testsuite/gcc.c-torture/compile/pr93098.c.

gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’:
gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable insn:
(insn 104 32 34 2 (set (reg:SI 185 [+4 ])
        (ior:SI (and:SI (reg:SI 200 [+4 ])
                (const_int 4294967295 [0xffffffff]))
            (ashift:SI (reg:SI 140)
                (const_int 32 [0x20])))) "gcc/testsuite/gcc.c-torture/compile/pr93098.c":6:11 -1
     (nil))
during RTL pass: subreg3
dump file: pr93098.c.291r.subreg3
gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: internal compiler error: in extract_insn, at recog.cc:2791
0x101f664b _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
        gcc/rtl-error.cc:108
0x101f6697 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
        gcc/rtl-error.cc:116
0x10ae427f extract_insn(rtx_insn*)
        gcc/recog.cc:2791
0x11b239bb decompose_multiword_subregs
        gcc/lower-subreg.cc:1555
0x11b25013 execute
        gcc/lower-subreg.cc:1818

The above trace shows we fails to recog the pattern again due to
the inaccurate nonzero_bits information as you pointed out above.

There was another patch [1] which wasn't on trunk but touched this
same define_split, not sure if that can help or we can follow the
similar idea.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-December/585841.html

BR,
Kewen
  
Segher Boessenkool June 27, 2022, 2:38 p.m. UTC | #2
Hi!

[Something is wrong with your mail program.  It puts white lines
everywhere, and prefixes a space to the existing white lines.]

On Sun, Jun 26, 2022 at 09:56:07PM +0100, Roger Sayle wrote:
> It turns out this optimization already exists in the form of a combine
> splitter in rs6000.md, but the constraints on combine splitters,
> requiring three of four input instructions (and generating one or two
> output instructions) mean it doesn't get applied as often as it could.

There are no such constraints.  Or, it is more complicated, it depends
on your viewpoint :-)

Combine works on two to four related insns.  It combines those insns to
one (and one only) instruction, hopefully in canonical form.  If that
insn is not recognised by recog combine tries to split up this
instruction.  It has multiple strategies to do that.  One is to try a
define_split, which indeed is only done if there are at least three
input insns.  This is left that way even after 2->2 combines: firstly
because various backends depend on this, but also because it would cause
a lot of code movement without improvement.

> This patch converts the define_split into a define_insn_and_split to
> catch more cases (such as the one above).
>  
> The one bit that's tricky/controversial is the use of RTL's
> nonzero_bits which is accurate during the combine pass when this
> pattern is first recognized, but not as advanced (not kept up to
> date) when this pattern is eventually split.  To support this,
> I've used a "|| reload_completed" idiom.  Does this approach seem
> reasonable? [I've another patch of x86 that uses the same idiom].

No, this does not work.  All passes after combine and until reload has
completed can then get a nonzero_bits that is a subset of the one
combine saw, and thus, the insn no longer matches (as Ke Wen has
encountered.  Often you do not see this in your testing, but almost
always someone will report an ICE within days!)

[Btw, you posted the patch as quoted-printable, which means other
people can not apply the patch.]


Segher
  

Patch

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 090dbcf..b8aada32 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -4209,13 +4209,19 @@ 
 
 (define_code_iterator plus_ior_xor [plus ior xor])
 
-(define_split
-  [(set (match_operand:GPR 0 "gpc_reg_operand")
-	(plus_ior_xor:GPR (ashift:GPR (match_operand:GPR 1 "gpc_reg_operand")
-				      (match_operand:SI 2 "const_int_operand"))
-			  (match_operand:GPR 3 "gpc_reg_operand")))]
-  "nonzero_bits (operands[3], <MODE>mode)
-   < HOST_WIDE_INT_1U << INTVAL (operands[2])"
+(define_insn_and_split "*rotl<mode>3_insert_3b_<code>"
+  [(set (match_operand:GPR 0 "gpc_reg_operand" "=r")
+	(plus_ior_xor:GPR
+	  (ashift:GPR (match_operand:GPR 1 "gpc_reg_operand" "r")
+		      (match_operand:SI 2 "const_int_operand" "n"))
+	  (match_operand:GPR 3 "gpc_reg_operand" "0")))]
+  "INTVAL (operands[2]) > 0
+   && INTVAL (operands[2]) < 64
+   && ((nonzero_bits (operands[3], <MODE>mode)
+	< HOST_WIDE_INT_1U << INTVAL (operands[2]))
+       || reload_completed)"
+  "#"
+  "&& 1"
   [(set (match_dup 0)
 	(ior:GPR (and:GPR (match_dup 3)
 			  (match_dup 4))
diff --git a/gcc/testsuite/gcc.target/powerpc/rldimi-3.c b/gcc/testsuite/gcc.target/powerpc/rldimi-3.c
new file mode 100644
index 0000000..80ff86e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/rldimi-3.c
@@ -0,0 +1,9 @@ 
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2" } */
+
+int foo(int x)
+{
+  int t = x & 42;
+  return t * 0x2001;
+}
+/* { dg-final { scan-assembler {\mrldimi\M} } } */