[x86,PING] Peephole pand;pxor into pandn

Message ID 00f501d86e81$5c48c8a0$14da59e0$@nextmovesoftware.com
State New
Headers
Series [x86,PING] Peephole pand;pxor into pandn |

Commit Message

Roger Sayle May 23, 2022, 8:44 a.m. UTC
  This is a ping of a patch from April (a dependency of another stage1 patch):
https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593123.html

This patch has been refreshed/retested against gcc 13 trunk on
x86_64-pc-linux-gnu with make bootstrap and make -k check,
both with and without --target_board=unix{-m32}, with no new failures.
Ok for mainline?

2022-05-23  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/sse.md (peephole2): Convert suitable pand followed
	by pxor into pandn, i.e. (X&Y)^X into X & ~Y.

Many thanks in advance,
Roger
--
  

Comments

Uros Bizjak May 23, 2022, 8:50 a.m. UTC | #1
On Mon, May 23, 2022 at 10:44 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This is a ping of a patch from April (a dependency of another stage1 patch):
> https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593123.html
>
> This patch has been refreshed/retested against gcc 13 trunk on
> x86_64-pc-linux-gnu with make bootstrap and make -k check,
> both with and without --target_board=unix{-m32}, with no new failures.
> Ok for mainline?

I think this should be handled in a pre-reload splitter (or perhaps
combine splitter). We have so many variants of SSE/AVX logic
instructions that the transform after reload barely makes sense
(please see the number of regno checks in the proposed patch).

Uros.

> 2022-05-23  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/sse.md (peephole2): Convert suitable pand followed
>         by pxor into pandn, i.e. (X&Y)^X into X & ~Y.
>
> Many thanks in advance,
> Roger
> --
>
  
Roger Sayle May 23, 2022, 8:59 a.m. UTC | #2
Hi Uros,

Thanks for the speedy review.  The point of this patch is that (with
pending changes to STV) the pand;pxor sequence isn't created until
after combine, and hence doesn't/won't get caught by any of the
current pre-reload/combine splitters.


> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 23 May 2022 09:51
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PING] Peephole pand;pxor into pandn
> 
> On Mon, May 23, 2022 at 10:44 AM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> >
> > This is a ping of a patch from April (a dependency of another stage1 patch):
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593123.html
> >
> > This patch has been refreshed/retested against gcc 13 trunk on
> > x86_64-pc-linux-gnu with make bootstrap and make -k check, both with
> > and without --target_board=unix{-m32}, with no new failures.
> > Ok for mainline?
> 
> I think this should be handled in a pre-reload splitter (or perhaps combine
> splitter). We have so many variants of SSE/AVX logic instructions that the
> transform after reload barely makes sense (please see the number of regno
> checks in the proposed patch).
> 
> Uros.
> 
> > 2022-05-23  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/sse.md (peephole2): Convert suitable pand followed
> >         by pxor into pandn, i.e. (X&Y)^X into X & ~Y.
> >
> > Many thanks in advance,
> > Roger
> > --
> >
  
Uros Bizjak May 23, 2022, 9:11 a.m. UTC | #3
On Mon, May 23, 2022 at 10:59 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
>
> Thanks for the speedy review.  The point of this patch is that (with
> pending changes to STV) the pand;pxor sequence isn't created until
> after combine, and hence doesn't/won't get caught by any of the
> current pre-reload/combine splitters.

IMO this happens due to inconsistencies between integer and vector
set, where integer andn is absent without BMI. However, we don't
re-run the combine after reload, and I don't think it is worth to
reimplement it via peephole2 patterns. Please note that AVX allows
much more combinations that are not catched by your patch, and
considering that combine already does the transformation, I don't see
a compelling reason for this very specialized peephole2.

Let's keep the patch shelved until a testcase shows the benefits of the patch.

Uros.

>
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 23 May 2022 09:51
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [x86 PING] Peephole pand;pxor into pandn
> >
> > On Mon, May 23, 2022 at 10:44 AM Roger Sayle
> > <roger@nextmovesoftware.com> wrote:
> > >
> > >
> > > This is a ping of a patch from April (a dependency of another stage1 patch):
> > > https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593123.html
> > >
> > > This patch has been refreshed/retested against gcc 13 trunk on
> > > x86_64-pc-linux-gnu with make bootstrap and make -k check, both with
> > > and without --target_board=unix{-m32}, with no new failures.
> > > Ok for mainline?
> >
> > I think this should be handled in a pre-reload splitter (or perhaps combine
> > splitter). We have so many variants of SSE/AVX logic instructions that the
> > transform after reload barely makes sense (please see the number of regno
> > checks in the proposed patch).
> >
> > Uros.
> >
> > > 2022-05-23  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         * config/i386/sse.md (peephole2): Convert suitable pand followed
> > >         by pxor into pandn, i.e. (X&Y)^X into X & ~Y.
> > >
> > > Many thanks in advance,
> > > Roger
> > > --
> > >
>
  
Roger Sayle May 23, 2022, 10:48 a.m. UTC | #4
Hi Uros,
Hopefully, if I explain even more of the context, you'll better understand why
this harmless (and at worse seemingly redundant) peephole2 is actually critical
for addressing significant regressions in the compiler without introducing new
testsuite failures.  I wouldn't ask (again), if I didn't feel it's important.

Basically, I'm trying to unblock Hongtao's patch (for PR target/104610)
which in your own review, explained is better handled by/during STV: 
https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594070.html

Unfortunately, that patch of mine to STV (that I want to ping next) that solves
the P2 code quality regression PR target/70321, is itself blocked by another
review of yours:
https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593200.html
where this fix (alone) leads to a regression of the test case pr65105-5.c.

This pending regression has nothing to do with TARGET_BMI's andn, but
the idiom "if ((x & y) != y)" on ia32, where x and y are DImode, and stv/reload
has decided to place these values in SSE registers.

After combine we have an *anddi3_doubleword and *cmpdi3_doubleword:
(insn 22 21 23 4 (parallel [
            (set (reg:DI 97)
                (and:DI (reg/v:DI 92 [ p2 ])
                    (reg:DI 88 [ _25 ])))
            (clobber (reg:CC 17 flags))
        ]) "pr65105-5.c":20:18 530 {*anddi3_doubleword}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
(insn 23 22 24 4 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg/v:DI 92 [ p2 ])
            (reg:DI 97))) "pr65105-5.c":20:8 29 {*cmpdi_doubleword}
     (expr_list:REG_DEAD (reg:DI 97)
        (nil)))

After STV we have:
(insn 22 21 45 4 (set (subreg:V2DI (reg:DI 97) 0)
        (and:V2DI (subreg:V2DI (reg/v:DI 92 [ p2 ]) 0)
            (subreg:V2DI (reg:DI 88 [ _25 ]) 0))) "pr65105-5.c":20:18 6640 {*andv2di3}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
(insn 45 22 46 4 (set (reg:V2DI 103)
        (xor:V2DI (subreg:V2DI (reg/v:DI 92 [ p2 ]) 0)
            (subreg:V2DI (reg:DI 97) 0))) "pr65105-5.c":20:8 -1
     (nil))
(insn 46 45 23 4 (set (reg:V2DI 103)
        (vec_select:V2DI (vec_concat:V4DI (reg:V2DI 103)
                (reg:V2DI 103))
            (parallel [
                    (const_int 0 [0])
                    (const_int 2 [0x2])
                ]))) "pr65105-5.c":20:8 -1
     (nil))
(insn 23 46 24 4 (set (reg:CC 17 flags)
        (unspec:CC [
                (reg:V2DI 103) repeated x2
            ] UNSPEC_PTEST)) "pr65105-5.c":20:8 7425 {sse4_1_ptestv2di}
     (expr_list:REG_DEAD (reg:DI 97)
        (nil)))

where the XOR has been introduce to implement the equality,
as P == Q is effectively implemented as (P ^ Q) == 0.  At this point, 
the only remaining pass that can optimize the pand followed by
the pxor is peephole2.

The requirement to optimize this is from gcc.target/i386/pr65105-5.c
where the desired implementation is explicitly looking for pandn+ptest:

/* { dg-do compile { target ia32 } } */
/* { dg-options "-O2 -march=core-avx2 -mno-stackrealign" } */
/* { dg-final { scan-assembler "pandn" } } */
/* { dg-final { scan-assembler "pxor" } } */
/* { dg-final { scan-assembler "ptest" } } */


Confusingly, I've even more patches in the queue/backlog for this part
of the compiler (it's an air traffic control problem, fallout from stage 4).

And of course, very many thanks for the various andn related patches
that have already been approved/committed to the backend, to avoid
potential regressions related to code size (-Os and -Oz).  It's a long road
with many steps.

Might you reconsider?  Pretty  please?
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 23 May 2022 10:11
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PING] Peephole pand;pxor into pandn
> 
> On Mon, May 23, 2022 at 10:59 AM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> >
> > Hi Uros,
> >
> > Thanks for the speedy review.  The point of this patch is that (with
> > pending changes to STV) the pand;pxor sequence isn't created until
> > after combine, and hence doesn't/won't get caught by any of the
> > current pre-reload/combine splitters.
> 
> IMO this happens due to inconsistencies between integer and vector set, where
> integer andn is absent without BMI. However, we don't re-run the combine after
> reload, and I don't think it is worth to reimplement it via peephole2 patterns.
> Please note that AVX allows much more combinations that are not catched by
> your patch, and considering that combine already does the transformation, I
> don't see a compelling reason for this very specialized peephole2.
> 
> Let's keep the patch shelved until a testcase shows the benefits of the patch.
> 
> Uros.
> 
> >
> >
> > > -----Original Message-----
> > > From: Uros Bizjak <ubizjak@gmail.com>
> > > Sent: 23 May 2022 09:51
> > > To: Roger Sayle <roger@nextmovesoftware.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [x86 PING] Peephole pand;pxor into pandn
> > >
> > > On Mon, May 23, 2022 at 10:44 AM Roger Sayle
> > > <roger@nextmovesoftware.com> wrote:
> > > >
> > > >
> > > > This is a ping of a patch from April (a dependency of another stage1 patch):
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593123.html
> > > >
> > > > This patch has been refreshed/retested against gcc 13 trunk on
> > > > x86_64-pc-linux-gnu with make bootstrap and make -k check, both
> > > > with and without --target_board=unix{-m32}, with no new failures.
> > > > Ok for mainline?
> > >
> > > I think this should be handled in a pre-reload splitter (or perhaps
> > > combine splitter). We have so many variants of SSE/AVX logic
> > > instructions that the transform after reload barely makes sense
> > > (please see the number of regno checks in the proposed patch).
> > >
> > > Uros.
> > >
> > > > 2022-05-23  Roger Sayle  <roger@nextmovesoftware.com>
> > > >
> > > > gcc/ChangeLog
> > > >         * config/i386/sse.md (peephole2): Convert suitable pand followed
> > > >         by pxor into pandn, i.e. (X&Y)^X into X & ~Y.
> > > >
> > > > Many thanks in advance,
> > > > Roger
> > > > --
> > > >
> >
  
Uros Bizjak May 23, 2022, 11:28 a.m. UTC | #5
On Mon, May 23, 2022 at 12:49 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Hopefully, if I explain even more of the context, you'll better understand why
> this harmless (and at worse seemingly redundant) peephole2 is actually critical
> for addressing significant regressions in the compiler without introducing new
> testsuite failures.  I wouldn't ask (again), if I didn't feel it's important.
>
> Basically, I'm trying to unblock Hongtao's patch (for PR target/104610)
> which in your own review, explained is better handled by/during STV:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594070.html
>
> Unfortunately, that patch of mine to STV (that I want to ping next) that solves
> the P2 code quality regression PR target/70321, is itself blocked by another
> review of yours:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593200.html
> where this fix (alone) leads to a regression of the test case pr65105-5.c.

Is it possible to start with a STV patch? If there are only a few
introduced regressions, we can afford them in this stage of
development, and fix regressions later with a follow-up patches. THis
way, it is much easier for me to see the effect of the patch, and its
benefit can be weighted appropriately. I was indeed under the
impression that we try to peephole a combination that appears once in
a blue moon, but if the situation appears regularly, this is a
completely different matter.

> This pending regression has nothing to do with TARGET_BMI's andn, but
> the idiom "if ((x & y) != y)" on ia32, where x and y are DImode, and stv/reload
> has decided to place these values in SSE registers.
>
> After combine we have an *anddi3_doubleword and *cmpdi3_doubleword:
> (insn 22 21 23 4 (parallel [
>             (set (reg:DI 97)
>                 (and:DI (reg/v:DI 92 [ p2 ])
>                     (reg:DI 88 [ _25 ])))
>             (clobber (reg:CC 17 flags))
>         ]) "pr65105-5.c":20:18 530 {*anddi3_doubleword}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 23 22 24 4 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg/v:DI 92 [ p2 ])
>             (reg:DI 97))) "pr65105-5.c":20:8 29 {*cmpdi_doubleword}
>      (expr_list:REG_DEAD (reg:DI 97)
>         (nil)))

One possible approach is to introduce intermediate compound (but
non-existent) instruction that is created by combine pass, and is
later split to real instructions. But a real testcase is needed, so
the correct strategy is used.

> After STV we have:
> (insn 22 21 45 4 (set (subreg:V2DI (reg:DI 97) 0)
>         (and:V2DI (subreg:V2DI (reg/v:DI 92 [ p2 ]) 0)
>             (subreg:V2DI (reg:DI 88 [ _25 ]) 0))) "pr65105-5.c":20:18 6640 {*andv2di3}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 45 22 46 4 (set (reg:V2DI 103)
>         (xor:V2DI (subreg:V2DI (reg/v:DI 92 [ p2 ]) 0)
>             (subreg:V2DI (reg:DI 97) 0))) "pr65105-5.c":20:8 -1
>      (nil))
> (insn 46 45 23 4 (set (reg:V2DI 103)
>         (vec_select:V2DI (vec_concat:V4DI (reg:V2DI 103)
>                 (reg:V2DI 103))
>             (parallel [
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                 ]))) "pr65105-5.c":20:8 -1
>      (nil))
> (insn 23 46 24 4 (set (reg:CC 17 flags)
>         (unspec:CC [
>                 (reg:V2DI 103) repeated x2
>             ] UNSPEC_PTEST)) "pr65105-5.c":20:8 7425 {sse4_1_ptestv2di}
>      (expr_list:REG_DEAD (reg:DI 97)
>         (nil)))
>
> where the XOR has been introduce to implement the equality,
> as P == Q is effectively implemented as (P ^ Q) == 0.  At this point,
> the only remaining pass that can optimize the pand followed by
> the pxor is peephole2.
>
> The requirement to optimize this is from gcc.target/i386/pr65105-5.c
> where the desired implementation is explicitly looking for pandn+ptest:
>
> /* { dg-do compile { target ia32 } } */
> /* { dg-options "-O2 -march=core-avx2 -mno-stackrealign" } */
> /* { dg-final { scan-assembler "pandn" } } */
> /* { dg-final { scan-assembler "pxor" } } */
> /* { dg-final { scan-assembler "ptest" } } */
>
>
> Confusingly, I've even more patches in the queue/backlog for this part
> of the compiler (it's an air traffic control problem, fallout from stage 4).
>
> And of course, very many thanks for the various andn related patches
> that have already been approved/committed to the backend, to avoid
> potential regressions related to code size (-Os and -Oz).  It's a long road
> with many steps.
>
> Might you reconsider?  Pretty  please?

No problem for me, but the testcase would really help.

Uros.
  
Uros Bizjak May 23, 2022, 12:16 p.m. UTC | #6
On Mon, May 23, 2022 at 12:49 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Hopefully, if I explain even more of the context, you'll better understand why
> this harmless (and at worse seemingly redundant) peephole2 is actually critical
> for addressing significant regressions in the compiler without introducing new
> testsuite failures.  I wouldn't ask (again), if I didn't feel it's important.
>
> Basically, I'm trying to unblock Hongtao's patch (for PR target/104610)
> which in your own review, explained is better handled by/during STV:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594070.html
>
> Unfortunately, that patch of mine to STV (that I want to ping next) that solves
> the P2 code quality regression PR target/70321, is itself blocked by another
> review of yours:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593200.html
> where this fix (alone) leads to a regression of the test case pr65105-5.c.
>
> This pending regression has nothing to do with TARGET_BMI's andn, but
> the idiom "if ((x & y) != y)" on ia32, where x and y are DImode, and stv/reload
> has decided to place these values in SSE registers.
>
> After combine we have an *anddi3_doubleword and *cmpdi3_doubleword:
> (insn 22 21 23 4 (parallel [
>             (set (reg:DI 97)
>                 (and:DI (reg/v:DI 92 [ p2 ])
>                     (reg:DI 88 [ _25 ])))
>             (clobber (reg:CC 17 flags))
>         ]) "pr65105-5.c":20:18 530 {*anddi3_doubleword}
>      (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (nil)))
> (insn 23 22 24 4 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg/v:DI 92 [ p2 ])
>             (reg:DI 97))) "pr65105-5.c":20:8 29 {*cmpdi_doubleword}
>      (expr_list:REG_DEAD (reg:DI 97)
>         (nil)))

But originally, during combine we have (pr65105-5.c):

Trying 22 -> 23:
   22: {r97:DI=r92:DI&r88:DI;clobber flags:CC;}
      REG_UNUSED flags:CC
   23: {r98:DI=r92:DI^r97:DI;clobber flags:CC;}
      REG_DEAD r97:DI
      REG_UNUSED flags:CC
Successfully matched this instruction:
(parallel [
        (set (reg:DI 98)
            (and:DI (not:DI (reg:DI 88 [ _25 ]))
                (reg/v:DI 92 [ p2 ])))
        (clobber (reg:CC 17 flags))
    ])
allowing combination of insns 22 and 23
original costs 8 + 8 = 16
replacement cost 16
deferring deletion of insn with uid = 22.
modifying insn i3    23: {r98:DI=~r88:DI&r92:DI;clobber flags:CC;}
      REG_UNUSED flags:CC
deferring rescan insn with uid = 23.

so combine is creating:

(insn 23 22 24 4 (parallel [
            (set (reg:DI 98)
                (and:DI (not:DI (reg:DI 88 [ _25 ]))
                    (reg/v:DI 92 [ p2 ])))
            (clobber (reg:CC 17 flags))
        ]) "pr65105-5.c":20:8 552 {*andndi3_doubleword}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

why is this not the case anymore with your patch?

Uros.
  

Patch

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 191371b..4203fe0 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17021,6 +17021,44 @@ 
 			(match_dup 2)))]
   "operands[3] = gen_reg_rtx (<MODE>mode);")
 
+;; Combine pand;pxor into pandn.  (X&Y)^X -> X & ~Y.
+(define_peephole2
+  [(set (match_operand:VMOVE 0 "register_operand")
+	(and:VMOVE (match_operand:VMOVE 1 "register_operand")
+		   (match_operand:VMOVE 2 "register_operand")))
+   (set (match_operand:VMOVE 3 "register_operand")
+	(xor:VMOVE (match_operand:VMOVE 4 "register_operand")
+		   (match_operand:VMOVE 5 "register_operand")))]
+  "TARGET_SSE
+   && REGNO (operands[1]) != REGNO (operands[2])
+   && REGNO (operands[4]) != REGNO (operands[5])
+   && (REGNO (operands[0]) == REGNO (operands[3])
+       || peep2_reg_dead_p (2, operands[0]))"
+  [(set (match_dup 3)
+	(and:VMOVE (not:VMOVE (match_dup 6)) (match_dup 7)))]
+{
+  if (REGNO (operands[0]) != REGNO (operands[1])
+      && ((REGNO (operands[4]) == REGNO (operands[0])
+	   && REGNO (operands[5]) == REGNO (operands[1]))
+	  || (REGNO (operands[4]) == REGNO (operands[1])
+	      && REGNO (operands[5]) == REGNO (operands[0]))))
+    {
+      operands[6] = operands[2];
+      operands[7] = operands[1];
+    }
+  else if (REGNO (operands[0]) != REGNO (operands[2])
+	   && ((REGNO (operands[4]) == REGNO (operands[0])
+		&& REGNO (operands[5]) == REGNO (operands[2]))
+	       || (REGNO (operands[4]) == REGNO (operands[2])
+		   && REGNO (operands[5]) == REGNO (operands[0]))))
+    {
+      operands[6] = operands[1];
+      operands[7] = operands[2];
+    }
+  else
+    FAIL;
+})
+
 (define_insn "*andnot<mode>3_mask"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
 	(vec_merge:VI48_AVX512VL