[x86] PR rtl-optimization/107991: peephole2 to tweak register allocation.

Message ID 011401d9243b$3782ce10$a6886a30$@nextmovesoftware.com
State New
Headers
Series [x86] PR rtl-optimization/107991: peephole2 to tweak register allocation. |

Commit Message

Roger Sayle Jan. 9, 2023, 3:01 p.m. UTC
  This patch addresses PR rtl-optimization/107991, which is a P2 regression
where GCC currently requires more "mov" instructions than GCC 7.

The x86's two address ISA creates some interesting challenges for reload.
For example, the tricky "x = y - x" usually needs to be implemented on x86
as

        tmp = x
        x = y
        x -= tmp

where a scratch register and two mov's are required to work around
the lack of a subf (subtract from) or rsub (reverse subtract) insn.

Not uncommonly, if y is dead after this subtraction, register allocation
can be improved by clobbering y.

        y -= x
        x = y

For the testcase in PR 107991, things are slightly more complicated,
where y is not itself dead, but is assigned from (i.e. equivalent to)
a value that is dead.  Hence we have something like:

        y = z
        x = y - x

so, GCC's reload currently generates the expected shuffle (as y is live):

        y = z
        tmp = x
        x = y
        x -= tmp

but we can use a peephole2 that understands that y and z are equivalent,
and that z is dead, to produce the shorter sequence:

        y = z
        z -= x
        x = z

In practice, for the new testcase from PR 107991, which before produced:

foo:    movl    %edx, %ecx
        movl    %esi, %edx
        movl    %esi, %eax
        subl    %ecx, %edx
        testb   %dil, %dil
        cmovne  %edx, %eax
        ret

with this patch/peephole2 we now produce the much improved:

foo:    movl    %esi, %eax
        subl    %edx, %esi
        testb   %dil, %dil
        cmovne  %esi, %eax
        ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2023-01-09  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR rtl-optimization/107991
        * config/i386/i386.md (peephole2): New peephole2 to avoid register
        shuffling before a subtraction, after a register-to-register move.

gcc/testsuite/ChangeLog
        PR rtl-optimization/107991
        * gcc.target/i386/pr107991.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Uros Bizjak Jan. 9, 2023, 4:02 p.m. UTC | #1
On Mon, Jan 9, 2023 at 4:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch addresses PR rtl-optimization/107991, which is a P2 regression
> where GCC currently requires more "mov" instructions than GCC 7.
>
> The x86's two address ISA creates some interesting challenges for reload.
> For example, the tricky "x = y - x" usually needs to be implemented on x86
> as
>
>         tmp = x
>         x = y
>         x -= tmp
>
> where a scratch register and two mov's are required to work around
> the lack of a subf (subtract from) or rsub (reverse subtract) insn.
>
> Not uncommonly, if y is dead after this subtraction, register allocation
> can be improved by clobbering y.
>
>         y -= x
>         x = y
>
> For the testcase in PR 107991, things are slightly more complicated,
> where y is not itself dead, but is assigned from (i.e. equivalent to)
> a value that is dead.  Hence we have something like:
>
>         y = z
>         x = y - x
>
> so, GCC's reload currently generates the expected shuffle (as y is live):
>
>         y = z
>         tmp = x
>         x = y
>         x -= tmp
>
> but we can use a peephole2 that understands that y and z are equivalent,
> and that z is dead, to produce the shorter sequence:
>
>         y = z
>         z -= x
>         x = z
>
> In practice, for the new testcase from PR 107991, which before produced:
>
> foo:    movl    %edx, %ecx
>         movl    %esi, %edx
>         movl    %esi, %eax
>         subl    %ecx, %edx
>         testb   %dil, %dil
>         cmovne  %edx, %eax
>         ret
>
> with this patch/peephole2 we now produce the much improved:
>
> foo:    movl    %esi, %eax
>         subl    %edx, %esi
>         testb   %dil, %dil
>         cmovne  %esi, %eax
>         ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?

Looking at the PR, it looks to me that Richard S (CC'd) wants to solve
this issue in the register allocator. This would be preferred
(compared to a very specialized peephole2), since peephole2 pass comes
very late in the game, so one freed register does not contribute to
lower the register pressure at all.

Peephole2 should be used to clean after reload only in rare cases when
target ISA prevents generic solution. From your description, a generic
solution would benefit all targets with destructive subtraction (or
perhaps also for other noncommutative operations).

So, please coordinate with Richard S regarding this issue.

Thanks,
Uros.

>
>
> 2023-01-09  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR rtl-optimization/107991
>         * config/i386/i386.md (peephole2): New peephole2 to avoid register
>         shuffling before a subtraction, after a register-to-register move.
>
> gcc/testsuite/ChangeLog
>         PR rtl-optimization/107991
>         * gcc.target/i386/pr107991.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
  
Richard Sandiford Jan. 10, 2023, 10:48 a.m. UTC | #2
Uros Bizjak <ubizjak@gmail.com> writes:
> On Mon, Jan 9, 2023 at 4:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>>
>>
>> This patch addresses PR rtl-optimization/107991, which is a P2 regression
>> where GCC currently requires more "mov" instructions than GCC 7.
>>
>> The x86's two address ISA creates some interesting challenges for reload.
>> For example, the tricky "x = y - x" usually needs to be implemented on x86
>> as
>>
>>         tmp = x
>>         x = y
>>         x -= tmp
>>
>> where a scratch register and two mov's are required to work around
>> the lack of a subf (subtract from) or rsub (reverse subtract) insn.
>>
>> Not uncommonly, if y is dead after this subtraction, register allocation
>> can be improved by clobbering y.
>>
>>         y -= x
>>         x = y
>>
>> For the testcase in PR 107991, things are slightly more complicated,
>> where y is not itself dead, but is assigned from (i.e. equivalent to)
>> a value that is dead.  Hence we have something like:
>>
>>         y = z
>>         x = y - x
>>
>> so, GCC's reload currently generates the expected shuffle (as y is live):
>>
>>         y = z
>>         tmp = x
>>         x = y
>>         x -= tmp
>>
>> but we can use a peephole2 that understands that y and z are equivalent,
>> and that z is dead, to produce the shorter sequence:
>>
>>         y = z
>>         z -= x
>>         x = z
>>
>> In practice, for the new testcase from PR 107991, which before produced:
>>
>> foo:    movl    %edx, %ecx
>>         movl    %esi, %edx
>>         movl    %esi, %eax
>>         subl    %ecx, %edx
>>         testb   %dil, %dil
>>         cmovne  %edx, %eax
>>         ret
>>
>> with this patch/peephole2 we now produce the much improved:
>>
>> foo:    movl    %esi, %eax
>>         subl    %edx, %esi
>>         testb   %dil, %dil
>>         cmovne  %esi, %eax
>>         ret
>>
>>
>> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
>> and make -k check, both with and without --target_board=unix{-m32},
>> with no new failures.  Ok for mainline?
>
> Looking at the PR, it looks to me that Richard S (CC'd) wants to solve
> this issue in the register allocator. This would be preferred
> (compared to a very specialized peephole2), since peephole2 pass comes
> very late in the game, so one freed register does not contribute to
> lower the register pressure at all.

Yeah, I think there are three issues that would all be good to fix:

(1) the fwprop regression
(2) the unhelpful pre-RA move
(3) the RA handling of what it sees now

I had a look last year to see where (1) was coming from, and it turned
out to be bad bookkeeping during a recursive walk.  I've got a couple of
competing ideas for how to fix it, but I've not had time to work on it
since then, sorry.

> Peephole2 should be used to clean after reload only in rare cases when
> target ISA prevents generic solution. From your description, a generic
> solution would benefit all targets with destructive subtraction (or
> perhaps also for other noncommutative operations).

Yeah, agree that peepholes aren't the best fit here.  The problem could
occur with instructions that are too far apart to be peepholed.

Thanks,
Richard

> So, please coordinate with Richard S regarding this issue.
>
> Thanks,
> Uros.
>
>>
>>
>> 2023-01-09  Roger Sayle  <roger@nextmovesoftware.com>
>>
>> gcc/ChangeLog
>>         PR rtl-optimization/107991
>>         * config/i386/i386.md (peephole2): New peephole2 to avoid register
>>         shuffling before a subtraction, after a register-to-register move.
>>
>> gcc/testsuite/ChangeLog
>>         PR rtl-optimization/107991
>>         * gcc.target/i386/pr107991.c: New test case.
>>
>>
>> Thanks in advance,
>> Roger
>> --
>>
  
Roger Sayle Jan. 10, 2023, 3:01 p.m. UTC | #3
Hi Richard and Uros,
I believe I've managed to reduce a minimal test case that exhibits the
underlying
problem with reload.   The following snippet when compiled on x86-64 with
-O2:

void ext(int x);
void foo(int x, int y) { ext(y - x); }

produces the following 5 instructions prior to reload:
insn 13: r86:SI=di:SI				// REG_DEAD di:SI
insn 14: r87:SI=si:SI				// REG_READ si:SI
insn 7: {r85:SI=r87:SI-r86:SI;clobber flags:CC;}	// REG_DEAD r86:SI,
r87:SI
insn 8: di:SI=r85:SI				// REG_READ r85:SI
insn 9: call [`ext'] argc:0

Hence there are three pseudos (allocnos) to be register allocated; r85, r86
& r87.

Currently, reload produces the following assignments/colouring using 3 hard
regs.
r85 in di
r86 in ax
r87 in si

A better (optimal) register allocation requires only 2 hard regs.
r85 in di
r86 in si
r87 in di

Fortunately, this over-allocation is cleaned up later (during
cprop_hardreg), but
as pointed out by Uros, there's little benefit in reducing register pressure
this
late (after peephole2).

As far as I understand it, Richard's patch to handle fully-tied destinations
looks
very reasonable (and is impressively tested/benchmarked):
https://gcc.gnu.org/pipermail/gcc-patches/2019-September/530743.html
but in the prototypical 0:"=r", 1:"0", 2:"r" constraint case, as used in the
problematic subsi3_1 pattern (of insn 7), I'm trying to figure out why r85
and r87 don't get allocated to the same register [given the local spilling
of non-eliminable hard regs in insn 7, temporarily introducing a new pseudo
r89].

In closing, reload is a complex piece of code that's shared between a large
number of backends; if Richard's patch is a win "statistically", then it's
not unreasonable to use a peephole2 to clean-up/catch the corner cases
on class_likely_spilled_p targets [indeed many of the peephole2s in i386.md
tidy up register allocation issues], and such a "specialized" fix is more
suitable
for stage 3, than a potentially disruptive tweak to reload.  At worst, the
peephole2 becomes dead if/when the problem is fixed upstream.

Or put another way, if reload worked perfectly, i386.md wouldn't need
many of the peephole2s that it currently has.  Oh, for such an ideal world.

I hope this helps.
Cheers,
Roger
--

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: 10 January 2023 10:48
> To: Uros Bizjak <ubizjak@gmail.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Roger Sayle
> <roger@nextmovesoftware.com>
> Subject: Re: [x86 PATCH] PR rtl-optimization/107991: peephole2 to tweak
> register allocation.
> 
> Uros Bizjak <ubizjak@gmail.com> writes:
> > On Mon, Jan 9, 2023 at 4:01 PM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >>
> >>
> >> This patch addresses PR rtl-optimization/107991, which is a P2
> >> regression where GCC currently requires more "mov" instructions than
GCC 7.
> >>
> >> The x86's two address ISA creates some interesting challenges for
reload.
> >> For example, the tricky "x = y - x" usually needs to be implemented
> >> on x86 as
> >>
> >>         tmp = x
> >>         x = y
> >>         x -= tmp
> >>
> >> where a scratch register and two mov's are required to work around
> >> the lack of a subf (subtract from) or rsub (reverse subtract) insn.
> >>
> >> Not uncommonly, if y is dead after this subtraction, register
> >> allocation can be improved by clobbering y.
> >>
> >>         y -= x
> >>         x = y
> >>
> >> For the testcase in PR 107991, things are slightly more complicated,
> >> where y is not itself dead, but is assigned from (i.e. equivalent to)
> >> a value that is dead.  Hence we have something like:
> >>
> >>         y = z
> >>         x = y - x
> >>
> >> so, GCC's reload currently generates the expected shuffle (as y is
live):
> >>
> >>         y = z
> >>         tmp = x
> >>         x = y
> >>         x -= tmp
> >>
> >> but we can use a peephole2 that understands that y and z are
> >> equivalent, and that z is dead, to produce the shorter sequence:
> >>
> >>         y = z
> >>         z -= x
> >>         x = z
> >>
> >> In practice, for the new testcase from PR 107991, which before
produced:
> >>
> >> foo:    movl    %edx, %ecx
> >>         movl    %esi, %edx
> >>         movl    %esi, %eax
> >>         subl    %ecx, %edx
> >>         testb   %dil, %dil
> >>         cmovne  %edx, %eax
> >>         ret
> >>
> >> with this patch/peephole2 we now produce the much improved:
> >>
> >> foo:    movl    %esi, %eax
> >>         subl    %edx, %esi
> >>         testb   %dil, %dil
> >>         cmovne  %esi, %eax
> >>         ret
> >>
> >>
> >> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> >> and make -k check, both with and without --target_board=unix{-m32},
> >> with no new failures.  Ok for mainline?
> >
> > Looking at the PR, it looks to me that Richard S (CC'd) wants to solve
> > this issue in the register allocator. This would be preferred
> > (compared to a very specialized peephole2), since peephole2 pass comes
> > very late in the game, so one freed register does not contribute to
> > lower the register pressure at all.
> 
> Yeah, I think there are three issues that would all be good to fix:
> 
> (1) the fwprop regression
> (2) the unhelpful pre-RA move
> (3) the RA handling of what it sees now
> 
> I had a look last year to see where (1) was coming from, and it turned out
to be
> bad bookkeeping during a recursive walk.  I've got a couple of competing
ideas
> for how to fix it, but I've not had time to work on it since then, sorry.
> 
> > Peephole2 should be used to clean after reload only in rare cases when
> > target ISA prevents generic solution. From your description, a generic
> > solution would benefit all targets with destructive subtraction (or
> > perhaps also for other noncommutative operations).
> 
> Yeah, agree that peepholes aren't the best fit here.  The problem could
occur
> with instructions that are too far apart to be peepholed.
> 
> Thanks,
> Richard
> 
> > So, please coordinate with Richard S regarding this issue.
> >
> > Thanks,
> > Uros.
> >
> >>
> >>
> >> 2023-01-09  Roger Sayle  <roger@nextmovesoftware.com>
> >>
> >> gcc/ChangeLog
> >>         PR rtl-optimization/107991
> >>         * config/i386/i386.md (peephole2): New peephole2 to avoid
register
> >>         shuffling before a subtraction, after a register-to-register
move.
> >>
> >> gcc/testsuite/ChangeLog
> >>         PR rtl-optimization/107991
> >>         * gcc.target/i386/pr107991.c: New test case.
> >>
> >>
> >> Thanks in advance,
> >> Roger
> >> --
> >>
  
Uros Bizjak Jan. 11, 2023, 9:28 a.m. UTC | #4
On Tue, Jan 10, 2023 at 4:01 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard and Uros,
> I believe I've managed to reduce a minimal test case that exhibits the
> underlying
> problem with reload.   The following snippet when compiled on x86-64 with
> -O2:
>
> void ext(int x);
> void foo(int x, int y) { ext(y - x); }
>
> produces the following 5 instructions prior to reload:
> insn 13: r86:SI=di:SI                           // REG_DEAD di:SI
> insn 14: r87:SI=si:SI                           // REG_READ si:SI
> insn 7: {r85:SI=r87:SI-r86:SI;clobber flags:CC;}        // REG_DEAD r86:SI,
> r87:SI
> insn 8: di:SI=r85:SI                            // REG_READ r85:SI
> insn 9: call [`ext'] argc:0
>
> Hence there are three pseudos (allocnos) to be register allocated; r85, r86
> & r87.
>
> Currently, reload produces the following assignments/colouring using 3 hard
> regs.
> r85 in di
> r86 in ax
> r87 in si
>
> A better (optimal) register allocation requires only 2 hard regs.
> r85 in di
> r86 in si
> r87 in di
>
> Fortunately, this over-allocation is cleaned up later (during
> cprop_hardreg), but
> as pointed out by Uros, there's little benefit in reducing register pressure
> this
> late (after peephole2).
>
> As far as I understand it, Richard's patch to handle fully-tied destinations
> looks
> very reasonable (and is impressively tested/benchmarked):
> https://gcc.gnu.org/pipermail/gcc-patches/2019-September/530743.html
> but in the prototypical 0:"=r", 1:"0", 2:"r" constraint case, as used in the
> problematic subsi3_1 pattern (of insn 7), I'm trying to figure out why r85
> and r87 don't get allocated to the same register [given the local spilling
> of non-eliminable hard regs in insn 7, temporarily introducing a new pseudo
> r89].
>
> In closing, reload is a complex piece of code that's shared between a large
> number of backends; if Richard's patch is a win "statistically", then it's
> not unreasonable to use a peephole2 to clean-up/catch the corner cases
> on class_likely_spilled_p targets [indeed many of the peephole2s in i386.md
> tidy up register allocation issues], and such a "specialized" fix is more
> suitable
> for stage 3, than a potentially disruptive tweak to reload.  At worst, the
> peephole2 becomes dead if/when the problem is fixed upstream.
>
> Or put another way, if reload worked perfectly, i386.md wouldn't need
> many of the peephole2s that it currently has.  Oh, for such an ideal world.

I have benchmarked the new peephole a bit and during the build of
linux kernel and during the whole gcc bootstrap, it didn't trigger
even once. It looks to me that the compiler produces the problematic
sequence only for specially crafted testcases, when argument setup is
involved. These testcases expose a minor annoyance with the reload
(which IMO should be fixed in the reload and not papered over with a
peephole).

Technically, the pattern is OK, but it really doesn't bring much to
the table. OTOH, the pattern is simple enough that it won't hurt if we
have another specialized pattern in the .md file. I'll leave the
decision to you.

Uros.
  

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 76f55ec..3090cea 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -7603,6 +7603,31 @@ 
   "sub{l}\t{%2, %1|%1, %2}"
   [(set_attr "type" "alu")
    (set_attr "mode" "SI")])
+
+;; PR 107991: Use peephole2 to avoid suffling before subtraction.
+;; ax = si; cx = dx; dx = ax; dx -= cx where both si and cx
+;; are dead becomes ax = si; si -= dx; dx = si.
+(define_peephole2
+  [(set (match_operand:SWI 0 "general_reg_operand")
+	(match_operand:SWI 1 "general_reg_operand"))
+   (set (match_operand:SWI 2 "general_reg_operand")
+	(match_operand:SWI 3 "general_reg_operand"))
+   (set (match_dup 3) (match_dup 0))
+   (parallel
+     [(set (match_dup 3) (minus:SWI (match_dup 3) (match_dup 2)))
+      (clobber (reg:CC FLAGS_REG))])]
+  "REGNO (operands[0]) != REGNO (operands[1])
+   && REGNO (operands[0]) != REGNO (operands[2])
+   && REGNO (operands[0]) != REGNO (operands[3])
+   && REGNO (operands[1]) != REGNO (operands[2])
+   && REGNO (operands[1]) != REGNO (operands[3])
+   && REGNO (operands[2]) != REGNO (operands[3])
+   && peep2_reg_dead_p (1, operands[1])
+   && peep2_reg_dead_p (4, operands[2])"
+  [(set (match_dup 0) (match_dup 1))
+   (parallel [(set (match_dup 1) (minus:SWI (match_dup 1) (match_dup 3)))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 3) (match_dup 1))])
 
 ;; Add with carry and subtract with borrow
 
diff --git a/gcc/testsuite/gcc.target/i386/pr107991.c b/gcc/testsuite/gcc.target/i386/pr107991.c
new file mode 100644
index 0000000..9d0d9b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr107991.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2" } */
+
+int foo(_Bool b, int i, int j) {
+    return b ? i - j : i;
+}
+
+int bar(_Bool b, int i, int j) {
+    return i + (b ? -j : 0);
+}
+
+int baz(_Bool b, int i, int j) {
+    return i - (b ? j : 0);
+}
+
+/* { dg-final { scan-assembler-times "movl" 3 } } */