[x86] PR target/92578: Peephole2s to tweak cmove register allocation.
Commit Message
This patch addresses a (minor) missed-optimization regression revealed
by Richard Biener's example/variant in comment #1 of PR target/92578.
int foo(int moves, int movecnt, int komove) {
int newcnt = movecnt;
if (moves == komove)
newcnt -= 2;
return newcnt;
}
Comparing code generation on godbolt.org shows an interesting evolution
over time, as changes in register allocation affect the cmove sequence.
GCC 4.1.2 (4 instructions, suboptimal mov after cmov).
leal -2(%rsi), %eax
cmpl %edx, %edi
cmove %eax, %esi
movl %esi, %eax
GCC 4.4-4.7 (3 instructions, optimal)
leal -2(%rsi), %eax
cmpl %edx, %edi
cmovne %esi, %eax
GCC 5-7 (4 instructions, suboptimal mov before cmov)
leal -2(%rsi), %ecx
movl %esi, %eax
cmpl %edx, %edi
cmove %ecx, %eax
GCC 8 (4 instructions, suboptimal mov before cmov, reordered)
movl %esi, %eax
leal -2(%rsi), %ecx
cmpl %edx, %edi
cmove %ecx, %eax
GCC 9-trunk (5 instructions, two suboptimal movs before cmov)
movl %edx, %ecx
movl %esi, %eax
leal -2(%rsi), %edx
cmpl %ecx, %edi
cmove %edx, %eax
The challenge is that x86's two operand conditional moves, that require
the destination to be one of the (register) sources, are tricky for reload,
whose heuristics unify pseudos early (greedily?). In this case, we have
the equivalent of "pseudo1 = cond ? pseudo2 : expression", and we'd like
to see "pseudo1 = expression; pseudo1 = cond ? pseudo1 : pseudo2", but
alas reload (currently and quite reasonably) prefers to place pseudo1 and
pseudo2 in the same hard register if possible. Hence the solution is to
fixup/tweak the register allocation during peephole2, as previously with
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575998.html
Instead of a single peephole2 to catch just the current idiom (last above),
I've added the four peephole2s that would catch each of the (historical)
suboptimal variants above and transform them into the ideal 3 insn form.
Instrumenting the compiler shows, for example, that the (earliest) movl
after cmov triggers over 50 times during stage2 of a GCC bootstrap.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}, with
no new failures. Ok for mainline? Or if this regression isn't serious
enough for stage4 (or these patterns considered too risky), for stage1
when it reopens? I suspect the poor interaction between cmove usage
and register allocation is one source of confusion when comparing code
generation with vs. without cmove (the other major source of confusion
being that well-predicted branches are free, but that prediction-quality
is poorly predictable).
2022-04-25 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/92578
* config/i386/i386.md (peephole2): Eliminate register-to-register
moves by inverting the condition of a conditional move.
gcc/testsuite/ChangeLog
PR target/92578
* gcc.target/i386/pr92758.c: New test case.
Thanks in advance,
Roger
--
Comments
On Mon, Apr 25, 2022 at 1:16 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch addresses a (minor) missed-optimization regression revealed
> by Richard Biener's example/variant in comment #1 of PR target/92578.
>
> int foo(int moves, int movecnt, int komove) {
> int newcnt = movecnt;
> if (moves == komove)
> newcnt -= 2;
> return newcnt;
> }
>
> Comparing code generation on godbolt.org shows an interesting evolution
> over time, as changes in register allocation affect the cmove sequence.
>
> GCC 4.1.2 (4 instructions, suboptimal mov after cmov).
> leal -2(%rsi), %eax
> cmpl %edx, %edi
> cmove %eax, %esi
> movl %esi, %eax
>
> GCC 4.4-4.7 (3 instructions, optimal)
> leal -2(%rsi), %eax
> cmpl %edx, %edi
> cmovne %esi, %eax
>
> GCC 5-7 (4 instructions, suboptimal mov before cmov)
> leal -2(%rsi), %ecx
> movl %esi, %eax
> cmpl %edx, %edi
> cmove %ecx, %eax
>
> GCC 8 (4 instructions, suboptimal mov before cmov, reordered)
> movl %esi, %eax
> leal -2(%rsi), %ecx
> cmpl %edx, %edi
> cmove %ecx, %eax
>
> GCC 9-trunk (5 instructions, two suboptimal movs before cmov)
> movl %edx, %ecx
> movl %esi, %eax
> leal -2(%rsi), %edx
> cmpl %ecx, %edi
> cmove %edx, %eax
>
> The challenge is that x86's two operand conditional moves, that require
> the destination to be one of the (register) sources, are tricky for reload,
> whose heuristics unify pseudos early (greedily?). In this case, we have
> the equivalent of "pseudo1 = cond ? pseudo2 : expression", and we'd like
> to see "pseudo1 = expression; pseudo1 = cond ? pseudo1 : pseudo2", but
> alas reload (currently and quite reasonably) prefers to place pseudo1 and
> pseudo2 in the same hard register if possible. Hence the solution is to
> fixup/tweak the register allocation during peephole2, as previously with
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575998.html
>
> Instead of a single peephole2 to catch just the current idiom (last above),
> I've added the four peephole2s that would catch each of the (historical)
> suboptimal variants above and transform them into the ideal 3 insn form.
> Instrumenting the compiler shows, for example, that the (earliest) movl
> after cmov triggers over 50 times during stage2 of a GCC bootstrap.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}, with
> no new failures. Ok for mainline? Or if this regression isn't serious
> enough for stage4 (or these patterns considered too risky), for stage1
> when it reopens? I suspect the poor interaction between cmove usage
> and register allocation is one source of confusion when comparing code
> generation with vs. without cmove (the other major source of confusion
> being that well-predicted branches are free, but that prediction-quality
> is poorly predictable).
>
>
> 2022-04-25 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> PR target/92578
> * config/i386/i386.md (peephole2): Eliminate register-to-register
> moves by inverting the condition of a conditional move.
>
> gcc/testsuite/ChangeLog
> PR target/92578
> * gcc.target/i386/pr92758.c: New test case.
+;; Eliminate a reg-reg mov by inverting the condition of a cmov (#3).
+;; cmov r0,r1; mov r1,r0 -> cmov r1,r0
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+ [(reg FLAGS_REG) (const_int 0)])
+ (match_operand:SWI248 2 "general_reg_operand")
+ (match_operand:SWI248 3 "general_reg_operand")))
+ (set (match_operand:SWI248 4 "general_reg_operand")
+ (match_dup 0))]
+ "TARGET_CMOVE
+ && ((REGNO (operands[0]) == REGNO (operands[2])
+ && REGNO (operands[3]) == REGNO (operands[4]))
+ || (REGNO (operands[0]) == REGNO (operands[3])
+ && REGNO (operands[2]) == REGNO (operands[4])))
+ && peep2_reg_dead_p (2, operands[0])"
+ [(set (match_dup 4) (if_then_else:SWI248 (match_dup 1)
+ (match_dup 2)
+ (match_dup 3)))])
We have a valid cmov insn here, so no need to match operand 0 with 2
or 3. But it doesn't hurt to have some extra level of safety. OTOH,
splitting this pattern to two and using (match_dup X) instead would be
IMO more comprehensible.
+;; Eliminate a reg-reg mov by inverting the condition of a cmov (#5).
+;; mov x,r0; mov r1,r2; cmp; cmov r0,r2 -> mov x,r2; cmp; cmov r1,r2
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (match_operand:SWI248 1))
You probably want the "general_gr_operand" predicate here. Otherwise,
LEA also fits here, which is probably not what you intended.
I think that we can apply the first pattern (which probably represents
90% of matches), which looks simple and safe. Others look a bit too
complicated (and hard to review) for this stage.
Uros.
>
> Thanks in advance,
> Roger
> --
>
@@ -20751,6 +20751,158 @@
operands[9] = replace_rtx (operands[6], operands[0], operands[1], true);
})
+;; Eliminate a reg-reg mov by inverting the condition of a cmov (#3).
+;; cmov r0,r1; mov r1,r0 -> cmov r1,r0
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (if_then_else:SWI248 (match_operator 1 "ix86_comparison_operator"
+ [(reg FLAGS_REG) (const_int 0)])
+ (match_operand:SWI248 2 "general_reg_operand")
+ (match_operand:SWI248 3 "general_reg_operand")))
+ (set (match_operand:SWI248 4 "general_reg_operand")
+ (match_dup 0))]
+ "TARGET_CMOVE
+ && ((REGNO (operands[0]) == REGNO (operands[2])
+ && REGNO (operands[3]) == REGNO (operands[4]))
+ || (REGNO (operands[0]) == REGNO (operands[3])
+ && REGNO (operands[2]) == REGNO (operands[4])))
+ && peep2_reg_dead_p (2, operands[0])"
+ [(set (match_dup 4) (if_then_else:SWI248 (match_dup 1)
+ (match_dup 2)
+ (match_dup 3)))])
+
+;; Eliminate a reg-reg mov by inverting the condition of a cmov (#4).
+;; mov r1,r0; mov x,r2; cmp; cmov r2,r0 -> mov x,r0; cmp; cmov r1,r0
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (match_operand:SWI248 1 "general_reg_operand"))
+ (set (match_operand:SWI248 2 "general_reg_operand")
+ (match_operand:SWI248 3))
+ (set (match_operand 4 "flags_reg_operand") (match_operand 5))
+ (set (match_dup 0)
+ (if_then_else:SWI248 (match_operator 6 "ix86_comparison_operator"
+ [(match_dup 4) (const_int 0)])
+ (match_operand:SWI248 7 "general_reg_operand")
+ (match_operand:SWI248 8 "general_reg_operand")))]
+ "TARGET_CMOVE
+ && REGNO (operands[0]) != REGNO (operands[2])
+ && REGNO (operands[1]) != REGNO (operands[2])
+ && REGNO (operands[7]) != REGNO (operands[8])
+ /* insns 1 and 2 feed the cmove insn 4. */
+ && (REGNO (operands[2]) == REGNO (operands[7])
+ || REGNO (operands[2]) == REGNO (operands[8]))
+ && !reg_overlap_mentioned_p (operands[0], operands[5])
+ && !reg_overlap_mentioned_p (operands[2], operands[5])
+ && peep2_reg_dead_p (4, operands[2])"
+ [(set (match_dup 0) (match_dup 3))
+ (set (match_dup 4) (match_dup 5))
+ (set (match_dup 0) (if_then_else:SWI248 (match_dup 6)
+ (match_dup 7)
+ (match_dup 8)))]
+{
+ operands[3] = replace_rtx (operands[3], operands[0], operands[1], true);
+ if (REGNO (operands[2]) == REGNO (operands[7]))
+ {
+ operands[7] = operands[0];
+ operands[8] = operands[1];
+ }
+ else
+ {
+ operands[7] = operands[1];
+ operands[8] = operands[0];
+ }
+})
+
+;; Eliminate a reg-reg mov by inverting the condition of a cmov (#5).
+;; mov x,r0; mov r1,r2; cmp; cmov r0,r2 -> mov x,r2; cmp; cmov r1,r2
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (match_operand:SWI248 1))
+ (set (match_operand:SWI248 2 "general_reg_operand")
+ (match_operand:SWI248 3 "general_reg_operand"))
+ (set (match_operand 4 "flags_reg_operand") (match_operand 5))
+ (set (match_dup 2)
+ (if_then_else:SWI248 (match_operator 6 "ix86_comparison_operator"
+ [(match_dup 4) (const_int 0)])
+ (match_operand:SWI248 7 "general_reg_operand")
+ (match_operand:SWI248 8 "general_reg_operand")))]
+ "TARGET_CMOVE
+ && REGNO (operands[0]) != REGNO (operands[2])
+ && REGNO (operands[0]) != REGNO (operands[3])
+ && REGNO (operands[2]) != REGNO (operands[3])
+ && REGNO (operands[7]) != REGNO (operands[8])
+ /* insns 1 and 2 feed the cmove insn 4. */
+ && (REGNO (operands[0]) == REGNO (operands[7])
+ || REGNO (operands[0]) == REGNO (operands[8]))
+ && !reg_overlap_mentioned_p (operands[0], operands[5])
+ && !reg_overlap_mentioned_p (operands[2], operands[5])
+ && peep2_reg_dead_p (4, operands[0])"
+ [(set (match_dup 2) (match_dup 1))
+ (set (match_dup 4) (match_dup 5))
+ (set (match_dup 2) (if_then_else:SWI248 (match_dup 6)
+ (match_dup 7)
+ (match_dup 8)))]
+{
+ if (REGNO (operands[0]) == REGNO (operands[7]))
+ {
+ operands[7] = operands[2];
+ operands[8] = operands[3];
+ }
+ else
+ {
+ operands[7] = operands[3];
+ operands[8] = operands[2];
+ }
+})
+
+;; Eliminate two reg-reg movs by inverting the condition of a cmov (#6).
+;; mov r1,r0; mov x,r2; cmp; cmov r2,r0 -> mov x,r0; cmp; cmov r1,r0
+(define_peephole2
+ [(set (match_operand:SWI248 0 "general_reg_operand")
+ (match_operand:SWI248 1 "general_reg_operand"))
+ (set (match_operand 9 "general_reg_operand")
+ (match_operand 10 "general_reg_operand"))
+ (set (match_operand:SWI248 2 "general_reg_operand")
+ (match_operand:SWI248 3))
+ (set (match_operand 4 "flags_reg_operand") (match_operand 5))
+ (set (match_dup 0)
+ (if_then_else:SWI248 (match_operator 6 "ix86_comparison_operator"
+ [(match_dup 4) (const_int 0)])
+ (match_operand:SWI248 7 "general_reg_operand")
+ (match_operand:SWI248 8 "general_reg_operand")))]
+ "TARGET_CMOVE
+ && REGNO (operands[0]) != REGNO (operands[2])
+ && REGNO (operands[0]) != REGNO (operands[9])
+ && REGNO (operands[0]) != REGNO (operands[10])
+ && REGNO (operands[1]) != REGNO (operands[2])
+ && REGNO (operands[1]) != REGNO (operands[9])
+ && REGNO (operands[7]) != REGNO (operands[8])
+ /* insns 1 and 3 feed the cmove insn 5. */
+ && (REGNO (operands[2]) == REGNO (operands[7])
+ || REGNO (operands[2]) == REGNO (operands[8]))
+ && !reg_overlap_mentioned_p (operands[0], operands[5])
+ && !reg_overlap_mentioned_p (operands[2], operands[5])
+ && peep2_reg_dead_p (5, operands[2])"
+ [(set (match_dup 9) (match_dup 10))
+ (set (match_dup 0) (match_dup 3))
+ (set (match_dup 4) (match_dup 5))
+ (set (match_dup 0) (if_then_else:SWI248 (match_dup 6)
+ (match_dup 7)
+ (match_dup 8)))]
+{
+ operands[3] = replace_rtx (operands[3], operands[0], operands[1], true);
+ if (REGNO (operands[2]) == REGNO (operands[7]))
+ {
+ operands[7] = operands[0];
+ operands[8] = operands[1];
+ }
+ else
+ {
+ operands[7] = operands[1];
+ operands[8] = operands[0];
+ }
+})
+
(define_insn "movhf_mask"
[(set (match_operand:HF 0 "nonimmediate_operand" "=v,m,v")
(unspec:HF
new file mode 100644
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2" } */
+
+int foo(int moves, int movecnt, int komove) {
+ int newcnt = movecnt;
+ if (moves == komove)
+ newcnt -= 2;
+ return newcnt;
+}
+
+/* { dg-final { scan-assembler-not "\[ \t]movl\[ \t]" } } */
+/* { dg-final { scan-assembler "\[ \t]cmovne\[ \t]" } } */
+