Canonicalize X&-Y as X*Y in match.pd when Y is [0,1].

Message ID 024f01d86f75$d40ae450$7c20acf0$@nextmovesoftware.com
State New
Headers
Series Canonicalize X&-Y as X*Y in match.pd when Y is [0,1]. |

Commit Message

Roger Sayle May 24, 2022, 1:54 p.m. UTC
  "For every pessimization, there's an equal and opposite optimization".

In the review of my original patch for PR middle-end/98865, Richard
Biener pointed out that match.pd shouldn't be transforming X*Y into
X&-Y as the former is considered cheaper by tree-ssa's cost model
(operator count).  A corollary of this is that we should instead be
transforming X&-Y into the cheaper X*Y as a preferred canonical form
(especially as RTL expansion now intelligently selects the appropriate
implementation based on the target's costs).

With this patch we now generate identical code for:
int foo(int x, int y) { return -(x&1) & y; }
int bar(int x, int y) { return (x&1) * y; }

specifically on x86_64-pc-linux-gnu both use and/neg/and when
optimizing for speed, but both use and/mul when optimizing for
size.

One minor wrinkle/improvement is that this patch includes three
additional optimizations (that account for the change in canonical
form) to continue to optimize PR92834 and PR94786.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-05-24  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* match.pd (match_zero_one_valued_p): New predicate.
	(mult @0 @1): Use zero_one_valued_p for optimization to the
	expression "bit_and @0 @1".
	(bit_and (negate zero_one_valued_p@0) @1): Optimize to MULT_EXPR.
	(plus @0 (mult (minus @1 @0) zero_one_valued_p@2): New transform.
	(minus @0 (mult (minus @0 @1) zero_one_valued_p@2): Likewise.
	(bit_xor @0 (mult (bit_xor @0 @1) zero_one_valued_p@2): Likewise.

gcc/testsuite/ChangeLog
	* gcc.dg/pr98865.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Richard Biener May 25, 2022, 11:34 a.m. UTC | #1
On Tue, May 24, 2022 at 3:55 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> "For every pessimization, there's an equal and opposite optimization".
>
> In the review of my original patch for PR middle-end/98865, Richard
> Biener pointed out that match.pd shouldn't be transforming X*Y into
> X&-Y as the former is considered cheaper by tree-ssa's cost model
> (operator count).  A corollary of this is that we should instead be
> transforming X&-Y into the cheaper X*Y as a preferred canonical form
> (especially as RTL expansion now intelligently selects the appropriate
> implementation based on the target's costs).
>
> With this patch we now generate identical code for:
> int foo(int x, int y) { return -(x&1) & y; }
> int bar(int x, int y) { return (x&1) * y; }
>
> specifically on x86_64-pc-linux-gnu both use and/neg/and when
> optimizing for speed, but both use and/mul when optimizing for
> size.
>
> One minor wrinkle/improvement is that this patch includes three
> additional optimizations (that account for the change in canonical
> form) to continue to optimize PR92834 and PR94786.

Those are presumably the preceeding patterns which match

(convert? (negate@4 (convert? (cmp@5 @2 @3)))

for the multiplication operand - those should be zero_one_valued_p
maybe with the exception of the conversions?  So are the original
patterns still needed after the canonicalization to a multiplication?

Otherwise looks good to me.

Thanks,
Richard.

> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?
>
>
> 2022-05-24  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * match.pd (match_zero_one_valued_p): New predicate.
>         (mult @0 @1): Use zero_one_valued_p for optimization to the
>         expression "bit_and @0 @1".
>         (bit_and (negate zero_one_valued_p@0) @1): Optimize to MULT_EXPR.
>         (plus @0 (mult (minus @1 @0) zero_one_valued_p@2): New transform.
>         (minus @0 (mult (minus @0 @1) zero_one_valued_p@2): Likewise.
>         (bit_xor @0 (mult (bit_xor @0 @1) zero_one_valued_p@2): Likewise.
>
> gcc/testsuite/ChangeLog
>         * gcc.dg/pr98865.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
  
Li, Pan2 via Gcc-patches May 25, 2022, 1:40 p.m. UTC | #2
> On May 25, 2022, at 7:34 AM, Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> 
> On Tue, May 24, 2022 at 3:55 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>> 
>> 
>> "For every pessimization, there's an equal and opposite optimization".
>> 
>> In the review of my original patch for PR middle-end/98865, Richard
>> Biener pointed out that match.pd shouldn't be transforming X*Y into
>> X&-Y as the former is considered cheaper by tree-ssa's cost model
>> (operator count).  A corollary of this is that we should instead be
>> transforming X&-Y into the cheaper X*Y as a preferred canonical form
>> (especially as RTL expansion now intelligently selects the appropriate
>> implementation based on the target's costs).
>> 
>> With this patch we now generate identical code for:
>> int foo(int x, int y) { return -(x&1) & y; }
>> int bar(int x, int y) { return (x&1) * y; }

What, if anything, does the target description have to do for "the appropriate implementation" to be selected?  For example, if the target has an "AND with complement" operation, it's probably cheaper than multiply and would be the preferred generated code.

	paul
  
Roger Sayle May 25, 2022, 2:39 p.m. UTC | #3
> > On May 25, 2022, at 7:34 AM, Richard Biener via Gcc-patches <gcc-
> patches@gcc.gnu.org> wrote:
> >
> > On Tue, May 24, 2022 at 3:55 PM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >>
> >>
> >> "For every pessimization, there's an equal and opposite optimization".
> >>
> >> In the review of my original patch for PR middle-end/98865, Richard
> >> Biener pointed out that match.pd shouldn't be transforming X*Y into
> >> X&-Y as the former is considered cheaper by tree-ssa's cost model
> >> (operator count).  A corollary of this is that we should instead be
> >> transforming X&-Y into the cheaper X*Y as a preferred canonical form
> >> (especially as RTL expansion now intelligently selects the
> >> appropriate implementation based on the target's costs).
> >>
> >> With this patch we now generate identical code for:
> >> int foo(int x, int y) { return -(x&1) & y; } int bar(int x, int y) {
> >> return (x&1) * y; }
> 
> What, if anything, does the target description have to do for "the
appropriate
> implementation" to be selected?  For example, if the target has an "AND
with
> complement" operation, it's probably cheaper than multiply and would be
the
> preferred generated code.

RTL expansion will use an AND and NEG instruction pair if that's cheaper
than the cost of a MULT or a synth_mult sequence.  Even, without the
backend providing an rtx_costs function, GCC will default to AND and NEG
having COSTS_N_INSNS(1), and MULT having COSTS_N_INSNS(4).
But consider the case where y is cloned/inlined/CSE'd to have the
value 2, in which (on many targets) case the LSHIFT is cheaper than
a AND and a NEG.

Alas, I don't believe a existence of ANDN, such as with BMI or SSE, has
any impact on the decision, as this is NEG;AND not NOT;AND.  If you
known of any target that has an "AND with negation" instruction, I'll
probably need to tweak RTL expansion to check for that explicitly.

The correct way to think about this canonicalization, is that the
default implementation of RTL expansion of a multiply by a 0/1
value is to use NEG/AND, and it's only in the extremely rare cases
where a multiply (or synth_mult sequence) is extremely cheap,
for example a single cycle multiply, where this will be used.

Roger
--
  
Li, Pan2 via Gcc-patches May 25, 2022, 2:48 p.m. UTC | #4
On May 25, 2022, at 10:39 AM, Roger Sayle <roger@nextmovesoftware.com<mailto:roger@nextmovesoftware.com>> wrote:


On May 25, 2022, at 7:34 AM, Richard Biener via Gcc-patches <gcc-
patches@gcc.gnu.org<mailto:patches@gcc.gnu.org>> wrote:

On Tue, May 24, 2022 at 3:55 PM Roger Sayle
<roger@nextmovesoftware.com<mailto:roger@nextmovesoftware.com>> wrote:


"For every pessimization, there's an equal and opposite optimization".

In the review of my original patch for PR middle-end/98865, Richard
Biener pointed out that match.pd shouldn't be transforming X*Y into
X&-Y as the former is considered cheaper by tree-ssa's cost model
(operator count).  A corollary of this is that we should instead be
transforming X&-Y into the cheaper X*Y as a preferred canonical form
(especially as RTL expansion now intelligently selects the
appropriate implementation based on the target's costs).

With this patch we now generate identical code for:
int foo(int x, int y) { return -(x&1) & y; } int bar(int x, int y) {
return (x&1) * y; }

What, if anything, does the target description have to do for "the
appropriate
implementation" to be selected?  For example, if the target has an "AND
with
complement" operation, it's probably cheaper than multiply and would be
the
preferred generated code.

RTL expansion will use an AND and NEG instruction pair if that's cheaper
than the cost of a MULT or a synth_mult sequence.  Even, without the
backend providing an rtx_costs function, GCC will default to AND and NEG
having COSTS_N_INSNS(1), and MULT having COSTS_N_INSNS(4).
But consider the case where y is cloned/inlined/CSE'd to have the
value 2, in which (on many targets) case the LSHIFT is cheaper than
a AND and a NEG.

Alas, I don't believe a existence of ANDN, such as with BMI or SSE, has
any impact on the decision, as this is NEG;AND not NOT;AND.  If you
known of any target that has an "AND with negation" instruction, I'll
probably need to tweak RTL expansion to check for that explicitly.

I don't know of one either (in the two's complement world); I misread the minus as tilde in the "before".  Sorry about the mixup.

paul
  

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index c2fed9b..ce97d85 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -285,14 +285,6 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
            || !COMPLEX_FLOAT_TYPE_P (type)))
    (negate @0)))
 
-/* Transform { 0 or 1 } * { 0 or 1 } into { 0 or 1 } & { 0 or 1 } */
-(simplify
- (mult SSA_NAME@1 SSA_NAME@2)
-  (if (INTEGRAL_TYPE_P (type)
-       && get_nonzero_bits (@1) == 1
-       && get_nonzero_bits (@2) == 1)
-   (bit_and @1 @2)))
-
 /* Transform x * { 0 or 1, 0 or 1, ... } into x & { 0 or -1, 0 or -1, ...},
    unless the target has native support for the former but not the latter.  */
 (simplify
@@ -1787,6 +1779,24 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (bit_not (bit_not @0))
   @0)
 
+(match zero_one_valued_p
+ @0
+ (if (INTEGRAL_TYPE_P (type) && tree_nonzero_bits (@0) == 1)))
+(match zero_one_valued_p
+ truth_valued_p@0)
+
+/* Transform { 0 or 1 } * { 0 or 1 } into { 0 or 1 } & { 0 or 1 }.  */
+(simplify
+ (mult zero_one_valued_p@0 zero_one_valued_p@1)
+ (if (INTEGRAL_TYPE_P (type))
+  (bit_and @0 @1)))
+
+/* Transform X & -Y into X * Y when Y is { 0 or 1 }.  */
+(simplify
+ (bit_and:c (negate zero_one_valued_p@0) @1)
+ (if (INTEGRAL_TYPE_P (type))
+  (mult @0 @1)))
+
 /* Convert ~ (-A) to A - 1.  */
 (simplify
  (bit_not (convert? (negate @0)))
@@ -3320,6 +3330,25 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && (GIMPLE || !TREE_SIDE_EFFECTS (@1)))
    (cond (cmp @2 @3) @1 @0))))
 
+/* Likewise using multiplication, A + (B-A)*cmp into cmp ? B : A.  */
+(simplify
+ (plus:c @0 (mult:c (minus @1 @0) zero_one_valued_p@2))
+ (if (INTEGRAL_TYPE_P (type)
+      && (GIMPLE || !TREE_SIDE_EFFECTS (@1)))
+  (cond @2 @1 @0)))
+/* Likewise using multiplication, A - (A-B)*cmp into cmp ? B : A.  */
+(simplify
+ (minus @0 (mult:c (minus @0 @1) zero_one_valued_p@2))
+ (if (INTEGRAL_TYPE_P (type)
+      && (GIMPLE || !TREE_SIDE_EFFECTS (@1)))
+  (cond @2 @1 @0)))
+/* Likewise using multiplication, A ^ (A^B)*cmp into cmp ? B : A.  */
+(simplify
+ (bit_xor:c @0 (mult:c (bit_xor:c @0 @1) zero_one_valued_p@2))
+ (if (INTEGRAL_TYPE_P (type)
+      && (GIMPLE || !TREE_SIDE_EFFECTS (@1)))
+  (cond @2 @1 @0)))
+
 /* Simplifications of shift and rotates.  */
 
 (for rotate (lrotate rrotate)
diff --git a/gcc/testsuite/gcc.dg/pr98865.c b/gcc/testsuite/gcc.dg/pr98865.c
new file mode 100644
index 0000000..95f7270
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98865.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int foo(int x, int y)
+{
+  return -(x&1) & y;
+}
+
+int bar(int x, int y)
+{
+  return (x&1) * y;
+}
+
+/* { dg-final { scan-tree-dump-times " \\* " 2 "optimized" } } */