[x86] PR rtl-optimization/96692: ((A|B)^C)^A using andn with -mbmi.

Message ID 025701d88954$f281ca40$d7855ec0$@nextmovesoftware.com
State New
Headers
Series [x86] PR rtl-optimization/96692: ((A|B)^C)^A using andn with -mbmi. |

Commit Message

Roger Sayle June 26, 2022, 12:04 p.m. UTC
  This patch addresses PR rtl-optimization/96692 on x86_64, by providing
a define_split for combine to convert the three operation ((A|B)^C)^D
into a two operation sequence using andn when either A or B is the same
register as C or D.  This is essentially a reassociation problem that's
only a win if the target supports an and-not instruction (as with -mbmi).

Hence for the new test case:

int f(int a, int b, int c)
{
    return (a ^ b) ^ (a | c);
}

GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate:

        xorl    %edi, %esi
        orl     %edx, %edi
        movl    %esi, %eax
        xorl    %edi, %eax
        ret

but with this patch now generates:

        andn    %edx, %edi, %eax
        xorl    %esi, %eax
        ret

I'll investigate whether this optimization can also be implemented
more generically in simplify_rtx when the backend provides
accurate rtx_costs for "(and (not ..." (as there's no optab for
andn).

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-06-26  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR rtl-optimization/96692
	* config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
	as (X & ~Y) ^ Z on target BMI when either C or D is A or B.

gcc/testsuite/ChangeLog
	PR rtl-optimization/96692
	* gcc.target/i386/bmi-andn-4.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Uros Bizjak June 26, 2022, 5:07 p.m. UTC | #1
On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch addresses PR rtl-optimization/96692 on x86_64, by providing
> a define_split for combine to convert the three operation ((A|B)^C)^D
> into a two operation sequence using andn when either A or B is the same
> register as C or D.  This is essentially a reassociation problem that's
> only a win if the target supports an and-not instruction (as with -mbmi).
>
> Hence for the new test case:
>
> int f(int a, int b, int c)
> {
>     return (a ^ b) ^ (a | c);
> }
>
> GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate:
>
>         xorl    %edi, %esi
>         orl     %edx, %edi
>         movl    %esi, %eax
>         xorl    %edi, %eax
>         ret
>
> but with this patch now generates:
>
>         andn    %edx, %edi, %eax
>         xorl    %esi, %eax
>         ret
>
> I'll investigate whether this optimization can also be implemented
> more generically in simplify_rtx when the backend provides
> accurate rtx_costs for "(and (not ..." (as there's no optab for
> andn).
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?
>
>
> 2022-06-26  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR rtl-optimization/96692
>         * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
>         as (X & ~Y) ^ Z on target BMI when either C or D is A or B.
>
> gcc/testsuite/ChangeLog
>         PR rtl-optimization/96692
>         * gcc.target/i386/bmi-andn-4.c: New test case.

+  "TARGET_BMI
+   && ix86_pre_reload_split ()
+   && (rtx_equal_p (operands[1], operands[3])
+       || rtx_equal_p (operands[1], operands[4])
+       || (REG_P (operands[2])
+   && (rtx_equal_p (operands[2], operands[3])
+       || rtx_equal_p (operands[2], operands[4]))))"

You don't need a ix86_pre_reload_split for combine splitter*

OTOH, please split the pattern to two for each commutative operand and
use (match_dup x) instead. Something similar to [1].

*combine splitter is described in the documentation as the splitter
pattern that does *not* match any existing insn pattern.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596804.html

Uros.
  
Segher Boessenkool June 27, 2022, 3:03 p.m. UTC | #2
Hi!

On Sun, Jun 26, 2022 at 07:07:35PM +0200, Uros Bizjak via Gcc-patches wrote:
> On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> > I'll investigate whether this optimization can also be implemented
> > more generically in simplify_rtx when the backend provides
> > accurate rtx_costs for "(and (not ..." (as there's no optab for
> > andn).

"Accurate rtx_cost" is a fallacy.  insn_cost can be seen as somewhat
accurate (in some simplified model of reality), but rtx_cost always
misses too much context to be useful as anything but a rough estimate.

> *combine splitter is described in the documentation as the splitter
> pattern that does *not* match any existing insn pattern.

Yeah, that documentation is somewhat misleading, in both directions :-(

combine can and will and does use any splitter, whether or not there is
an insn with that insn pattern.  If there is an insn that passes recog()
combine will not even try a split, but this is not the same thing at
all: if the insn conditions are not met, the insn does not recog().  It
is quite common to have the same instruction pattern with different
insn conditions.

In the other direction, other passes can call split_insns as well, of
course.  Nothing currently does, but that does not change much :-)


Segher
  
Roger Sayle July 4, 2022, 5:27 p.m. UTC | #3
Hi Uros,
Thanks for the review.  This patch implements all of your suggestions, both
removing ix86_pre_reload_split from the combine splitter(s), and dividing
the original splitter up into four simpler variants, that use match_dup to 
handle the variants/permutations caused by operator commutativity.

This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32} with no
new failures.  Ok for mainline?

2022-07-04  Roger Sayle  <roger@nextmovesoftware.com>
            UroŇ° Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog
        PR rtl-optimization/96692
        * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
        as (X & ~Y) ^ Z on target BMI when either C or D is A or B.

gcc/testsuite/ChangeLog
        PR rtl-optimization/96692
        * gcc.target/i386/bmi-andn-4.c: New test case.


Thanks again,
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 26 June 2022 18:08
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn with
> -mbmi.
> 
> On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > This patch addresses PR rtl-optimization/96692 on x86_64, by providing
> > a define_split for combine to convert the three operation ((A|B)^C)^D
> > into a two operation sequence using andn when either A or B is the
> > same register as C or D.  This is essentially a reassociation problem
> > that's only a win if the target supports an and-not instruction (as with -mbmi).
> >
> > Hence for the new test case:
> >
> > int f(int a, int b, int c)
> > {
> >     return (a ^ b) ^ (a | c);
> > }
> >
> > GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate:
> >
> >         xorl    %edi, %esi
> >         orl     %edx, %edi
> >         movl    %esi, %eax
> >         xorl    %edi, %eax
> >         ret
> >
> > but with this patch now generates:
> >
> >         andn    %edx, %edi, %eax
> >         xorl    %esi, %eax
> >         ret
> >
> > I'll investigate whether this optimization can also be implemented
> > more generically in simplify_rtx when the backend provides accurate
> > rtx_costs for "(and (not ..." (as there's no optab for andn).
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-26  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR rtl-optimization/96692
> >         * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
> >         as (X & ~Y) ^ Z on target BMI when either C or D is A or B.
> >
> > gcc/testsuite/ChangeLog
> >         PR rtl-optimization/96692
> >         * gcc.target/i386/bmi-andn-4.c: New test case.
> 
> +  "TARGET_BMI
> +   && ix86_pre_reload_split ()
> +   && (rtx_equal_p (operands[1], operands[3])
> +       || rtx_equal_p (operands[1], operands[4])
> +       || (REG_P (operands[2])
> +   && (rtx_equal_p (operands[2], operands[3])
> +       || rtx_equal_p (operands[2], operands[4]))))"
> 
> You don't need a ix86_pre_reload_split for combine splitter*
> 
> OTOH, please split the pattern to two for each commutative operand and use
> (match_dup x) instead. Something similar to [1].
> 
> *combine splitter is described in the documentation as the splitter pattern that
> does *not* match any existing insn pattern.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596804.html
> 
> Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 20c3b9a..d114754 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10522,6 +10522,82 @@
    (set (match_dup 0) (match_op_dup 1
                         [(and:SI (match_dup 3) (match_dup 2))
 			 (const_int 0)]))])
+
+;; Variant 1 of 4: Split ((A | B) ^ A) ^ C as (B & ~A) ^ C.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+	(xor:SWI48
+	   (xor:SWI48
+	      (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+			 (match_operand:SWI48 2 "nonimmediate_operand"))
+	      (match_dup 1))
+	   (match_operand:SWI48 3 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI"
+  [(parallel
+      [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 1)) (match_dup 2)))
+       (clobber (reg:CC FLAGS_REG))])
+   (parallel
+      [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_dup 3)))
+       (clobber (reg:CC FLAGS_REG))])]
+  "operands[4] = gen_reg_rtx (<MODE>mode);")
+
+;; Variant 2 of 4: Split ((A | B) ^ B) ^ C as (A & ~B) ^ C.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+	(xor:SWI48
+	   (xor:SWI48
+	      (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+			 (match_operand:SWI48 2 "register_operand"))
+	      (match_dup 2))
+	   (match_operand:SWI48 3 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI"
+  [(parallel
+      [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 2)) (match_dup 1)))
+       (clobber (reg:CC FLAGS_REG))])
+   (parallel
+      [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_dup 3)))
+       (clobber (reg:CC FLAGS_REG))])]
+  "operands[4] = gen_reg_rtx (<MODE>mode);")
+
+;; Variant 3 of 4: Split ((A | B) ^ C) ^ A as (B & ~A) ^ C.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+	(xor:SWI48
+	   (xor:SWI48
+	      (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+			 (match_operand:SWI48 2 "nonimmediate_operand"))
+	      (match_operand:SWI48 3 "nonimmediate_operand"))
+	   (match_dup 1)))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI"
+  [(parallel
+      [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 1)) (match_dup 2)))
+       (clobber (reg:CC FLAGS_REG))])
+   (parallel
+      [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_dup 3)))
+       (clobber (reg:CC FLAGS_REG))])]
+  "operands[4] = gen_reg_rtx (<MODE>mode);")
+
+;; Variant 4 of 4: Split ((A | B) ^ C) ^ B as (A & ~B) ^ C.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+	(xor:SWI48
+	   (xor:SWI48
+	      (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+			 (match_operand:SWI48 2 "register_operand"))
+	      (match_operand:SWI48 3 "nonimmediate_operand"))
+	   (match_dup 2)))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI"
+  [(parallel
+      [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 2)) (match_dup 1)))
+       (clobber (reg:CC FLAGS_REG))])
+   (parallel
+      [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_dup 3)))
+       (clobber (reg:CC FLAGS_REG))])]
+  "operands[4] = gen_reg_rtx (<MODE>mode);")
 
 ;; Logical inclusive and exclusive OR instructions
 
diff --git a/gcc/testsuite/gcc.target/i386/bmi-andn-4.c b/gcc/testsuite/gcc.target/i386/bmi-andn-4.c
new file mode 100644
index 0000000..fb89529
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/bmi-andn-4.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mbmi" } */
+
+int f(int a, int b, int c)
+{
+    return (a ^ b) ^ (a | c);
+}
+
+/* { dg-final { scan-assembler "andn\[ \\t\]+" } } */
  
Uros Bizjak July 4, 2022, 6:51 p.m. UTC | #4
On Mon, Jul 4, 2022 at 7:27 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Thanks for the review.  This patch implements all of your suggestions, both
> removing ix86_pre_reload_split from the combine splitter(s), and dividing
> the original splitter up into four simpler variants, that use match_dup to
> handle the variants/permutations caused by operator commutativity.
>
> This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32} with no
> new failures.  Ok for mainline?
>
> 2022-07-04  Roger Sayle  <roger@nextmovesoftware.com>
>             UroŇ° Bizjak  <ubizjak@gmail.com>
>
> gcc/ChangeLog
>         PR rtl-optimization/96692
>         * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
>         as (X & ~Y) ^ Z on target BMI when either C or D is A or B.
>
> gcc/testsuite/ChangeLog
>         PR rtl-optimization/96692
>         * gcc.target/i386/bmi-andn-4.c: New test case.

OK.

Thanks,
Uros.

>
> Thanks again,
> Roger
> --
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 26 June 2022 18:08
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn with
> > -mbmi.
> >
> > On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > This patch addresses PR rtl-optimization/96692 on x86_64, by providing
> > > a define_split for combine to convert the three operation ((A|B)^C)^D
> > > into a two operation sequence using andn when either A or B is the
> > > same register as C or D.  This is essentially a reassociation problem
> > > that's only a win if the target supports an and-not instruction (as with -mbmi).
> > >
> > > Hence for the new test case:
> > >
> > > int f(int a, int b, int c)
> > > {
> > >     return (a ^ b) ^ (a | c);
> > > }
> > >
> > > GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate:
> > >
> > >         xorl    %edi, %esi
> > >         orl     %edx, %edi
> > >         movl    %esi, %eax
> > >         xorl    %edi, %eax
> > >         ret
> > >
> > > but with this patch now generates:
> > >
> > >         andn    %edx, %edi, %eax
> > >         xorl    %esi, %eax
> > >         ret
> > >
> > > I'll investigate whether this optimization can also be implemented
> > > more generically in simplify_rtx when the backend provides accurate
> > > rtx_costs for "(and (not ..." (as there's no optab for andn).
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32},
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-06-26  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR rtl-optimization/96692
> > >         * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
> > >         as (X & ~Y) ^ Z on target BMI when either C or D is A or B.
> > >
> > > gcc/testsuite/ChangeLog
> > >         PR rtl-optimization/96692
> > >         * gcc.target/i386/bmi-andn-4.c: New test case.
> >
> > +  "TARGET_BMI
> > +   && ix86_pre_reload_split ()
> > +   && (rtx_equal_p (operands[1], operands[3])
> > +       || rtx_equal_p (operands[1], operands[4])
> > +       || (REG_P (operands[2])
> > +   && (rtx_equal_p (operands[2], operands[3])
> > +       || rtx_equal_p (operands[2], operands[4]))))"
> >
> > You don't need a ix86_pre_reload_split for combine splitter*
> >
> > OTOH, please split the pattern to two for each commutative operand and use
> > (match_dup x) instead. Something similar to [1].
> >
> > *combine splitter is described in the documentation as the splitter pattern that
> > does *not* match any existing insn pattern.
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596804.html
> >
> > Uros.
  

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3093cb5..27dddaf 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10525,6 +10525,57 @@ 
    (set (match_dup 0) (match_op_dup 1
                         [(and:SI (match_dup 3) (match_dup 2))
 			 (const_int 0)]))])
+
+;; Split ((A | B) ^ C) ^ D as (X & ~Y) ^ Z.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+	(xor:SWI48
+	   (xor:SWI48
+	      (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+			 (match_operand:SWI48 2 "nonimmediate_operand"))
+	      (match_operand:SWI48 3 "nonimmediate_operand"))
+	   (match_operand:SWI48 4 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI
+   && ix86_pre_reload_split ()
+   && (rtx_equal_p (operands[1], operands[3])
+       || rtx_equal_p (operands[1], operands[4])
+       || (REG_P (operands[2])
+	   && (rtx_equal_p (operands[2], operands[3])
+	       || rtx_equal_p (operands[2], operands[4]))))"
+  [(parallel
+      [(set (match_dup 5) (and:SWI48 (not:SWI48 (match_dup 6)) (match_dup 7)))
+       (clobber (reg:CC FLAGS_REG))])
+   (parallel
+      [(set (match_dup 0) (xor:SWI48 (match_dup 5) (match_dup 8)))
+       (clobber (reg:CC FLAGS_REG))])]
+{
+  operands[5] = gen_reg_rtx (<MODE>mode);
+  if (rtx_equal_p (operands[1], operands[3]))
+    {
+      operands[6] = operands[1];
+      operands[7] = operands[2];
+      operands[8] = operands[4];
+    }
+  else if (rtx_equal_p (operands[1], operands[4]))
+    {
+      operands[6] = operands[1];
+      operands[7] = operands[2];
+      operands[8] = operands[3];
+    }
+  else if (rtx_equal_p (operands[2], operands[3]))
+    {
+      operands[6] = operands[2];
+      operands[7] = operands[1];
+      operands[8] = operands[4];
+    }
+  else
+    {
+      operands[6] = operands[2];
+      operands[7] = operands[1];
+      operands[8] = operands[3];
+    }
+})
 
 ;; Logical inclusive and exclusive OR instructions
 
diff --git a/gcc/testsuite/gcc.target/i386/bmi-andn-4.c b/gcc/testsuite/gcc.target/i386/bmi-andn-4.c
new file mode 100644
index 0000000..fb89529
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/bmi-andn-4.c
@@ -0,0 +1,9 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mbmi" } */
+
+int f(int a, int b, int c)
+{
+    return (a ^ b) ^ (a | c);
+}
+
+/* { dg-final { scan-assembler "andn\[ \\t\]+" } } */