PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA.

Message ID 032301d835a2$86214110$9263c330$@nextmovesoftware.com
State New
Headers
Series PR tree-optimization/101895: Fold VEC_PERM to help recognize FMA. |

Commit Message

Roger Sayle March 11, 2022, 11:48 p.m. UTC
  This patch resolves PR tree-optimization/101895 a missed optimization
regression, by adding a constant folding simplification to match.pd to
simplify the transform "mult; vec_perm; plus" into "vec_perm; mult; plus"
with the aim that keeping the multiplication and addition next to each
other allows them to be recognized as fused-multiply-add on suitable
targets.  This transformation requires a tweak to match.pd's
vec_same_elem_p predicate to handle CONSTRUCTOR_EXPRs using the same
SSA_NAME_DEF_STMT idiom used for constructors elsewhere in match.pd.

The net effect is that the following code example:

void foo(float * __restrict__ a, float b, float *c) {
  a[0] = c[0]*b + a[0];
  a[1] = c[2]*b + a[1];
  a[2] = c[1]*b + a[2];
  a[3] = c[3]*b + a[3];
}

when compiled on x86_64-pc-linux-gnu with -O2 -march=cascadelake
currently generates:

        vbroadcastss    %xmm0, %xmm0
        vmulps  (%rsi), %xmm0, %xmm0
        vpermilps       $216, %xmm0, %xmm0
        vaddps  (%rdi), %xmm0, %xmm0
        vmovups %xmm0, (%rdi)
        ret

but with this patch now generates the improved:

        vpermilps       $216, (%rsi), %xmm1
        vbroadcastss    %xmm0, %xmm0
        vfmadd213ps     (%rdi), %xmm0, %xmm1
        vmovups %xmm1, (%rdi)
        ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?


2022-03-11  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR tree-optimization/101895
	* match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def.
	(plus (vec_perm (mult ...) ...) ...): New reordering simplification.

gcc/testsuite/ChangeLog
	* gcc.target/i386/pr101895.c: New test case.


Thanks in advance,
Roger
--
  

Comments

Marc Glisse March 12, 2022, 11:39 p.m. UTC | #1
On Fri, 11 Mar 2022, Roger Sayle wrote:

+(match vec_same_elem_p
+  CONSTRUCTOR@0
+  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
+			 ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0))))

Ah, I didn't remember we needed that, we don't seem to be very consistent 
about it. Probably for this reason, the transformation "Prefer vector1 << 
scalar to vector1 << vector2" does not match

typedef int vec __attribute__((vector_size(16)));
vec f(vec a, int b){
   vec bb = { b, b, b, b };
   return a << bb;
}

which is only optimized at vector lowering time.

+/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
+(for plusminus (plus minus)
+  (simplify
+    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
+    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))

Don't you want :s on mult and vec_perm?
  
Richard Biener March 14, 2022, 7:37 a.m. UTC | #2
On Sun, Mar 13, 2022 at 12:39 AM Marc Glisse via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 11 Mar 2022, Roger Sayle wrote:
>
> +(match vec_same_elem_p
> +  CONSTRUCTOR@0
> +  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
> +                        ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0))))
>
> Ah, I didn't remember we needed that, we don't seem to be very consistent
> about it. Probably for this reason, the transformation "Prefer vector1 <<
> scalar to vector1 << vector2" does not match
>
> typedef int vec __attribute__((vector_size(16)));
> vec f(vec a, int b){
>    vec bb = { b, b, b, b };
>    return a << bb;
> }
>
> which is only optimized at vector lowering time.

Few more comments - since match.pd is matching in match.pd order the

(match vec_same_elem_p
  @0
  (...))

should come last.  Please use

+(match vec_same_elem_p
+  CONSTRUCTOR@0
    (if (TREE_CODE (@0) == SSA_NAME
         && uniform_vector_p (...

since otherwise we'll try uniform_vector_p twice on all CTORs (that
are not uniform).

> +/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
> +(for plusminus (plus minus)
> +  (simplify
> +    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
> +    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))
>
> Don't you want :s on mult and vec_perm?

Yes.  Also for plus you want :c on it , likewise you want :c on the
mult.  The :c
on the plus will require splitting the plus and minus case :/

Otherwise looks reasonable.

Richard.

>
> --
> Marc Glisse
  
Roger Sayle March 15, 2022, 7:25 a.m. UTC | #3
Hi Richard and Marc,
Many thanks for both your feedback on my patch for PR 101895.
Here's version 2 of this patch, incorporating all of the suggested improvements.
The one minor complication is that the :s qualifier doesn't automatically
recognize that a capture already has two (or N) uses in a pattern,
so I have to manually confirm that there are no other uses of the mult
using num_imm_uses.

This revision has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?

2022-03-15  Roger Sayle  <roger@nextmovesoftware.com>
	    Marc Glisse  <marc.glisse@inria.fr>
	    Richard Biener  <rguenther@suse.de>

gcc/ChangeLog
	PR tree-optimization/101895
	* match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def.
	(plus (vec_perm (mult ...) ...) ...): New reordering simplification.

gcc/testsuite/ChangeLog
	PR tree-optimization/101895
	* gcc.target/i386/pr101895.c: New test case.


Thanks in advance,
Roger
--

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: 14 March 2022 07:38
> To: GCC Patches <gcc-patches@gcc.gnu.org>
> Cc: Roger Sayle <roger@nextmovesoftware.com>; Marc Glisse
> <marc.glisse@inria.fr>
> Subject: Re: [PATCH] PR tree-optimization/101895: Fold VEC_PERM to help
> recognize FMA.
> 
> On Sun, Mar 13, 2022 at 12:39 AM Marc Glisse via Gcc-patches <gcc-
> patches@gcc.gnu.org> wrote:
> >
> > On Fri, 11 Mar 2022, Roger Sayle wrote:
> >
> > +(match vec_same_elem_p
> > +  CONSTRUCTOR@0
> > +  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
> > +                        ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0))
> > +: @0))))
> >
> > Ah, I didn't remember we needed that, we don't seem to be very
> > consistent about it. Probably for this reason, the transformation
> > "Prefer vector1 << scalar to vector1 << vector2" does not match
> >
> > typedef int vec __attribute__((vector_size(16))); vec f(vec a, int b){
> >    vec bb = { b, b, b, b };
> >    return a << bb;
> > }
> >
> > which is only optimized at vector lowering time.
> 
> Few more comments - since match.pd is matching in match.pd order the
> 
> (match vec_same_elem_p
>   @0
>   (...))
> 
> should come last.  Please use
> 
> +(match vec_same_elem_p
> +  CONSTRUCTOR@0
>     (if (TREE_CODE (@0) == SSA_NAME
>          && uniform_vector_p (...
> 
> since otherwise we'll try uniform_vector_p twice on all CTORs (that are not
> uniform).
> 
> > +/* Push VEC_PERM earlier if that may help FMA perception (PR101895).
> > +*/ (for plusminus (plus minus)
> > +  (simplify
> > +    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
> > +    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))
> >
> > Don't you want :s on mult and vec_perm?
> 
> Yes.  Also for plus you want :c on it , likewise you want :c on the mult.  The :c on
> the plus will require splitting the plus and minus case :/
> 
> Otherwise looks reasonable.
> 
> Richard.
> 
> >
> > --
> > Marc Glisse
diff --git a/gcc/match.pd b/gcc/match.pd
index 97399e5..12c92f4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7689,16 +7689,33 @@ and,
 /* VEC_PERM_EXPR (v, v, mask) -> v where v contains same element.  */
 
 (match vec_same_elem_p
+ (vec_duplicate @0))
+
+(match vec_same_elem_p
+ CONSTRUCTOR@0
+ (if (TREE_CODE (@0) == SSA_NAME
+      && uniform_vector_p (gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0))))))
+
+(match vec_same_elem_p
  @0
  (if (uniform_vector_p (@0))))
 
-(match vec_same_elem_p
- (vec_duplicate @0))
 
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
 
+/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
+(simplify
+ (plus:c (vec_perm:s (mult:c@0 @1 vec_same_elem_p@2) @0 @3) @4)
+ (if (TREE_CODE (@0) == SSA_NAME && num_imm_uses (@0) == 2)
+  (plus (mult (vec_perm @1 @1 @3) @2) @4)))
+(simplify
+ (minus (vec_perm:s (mult:c@0 @1 vec_same_elem_p@2) @0 @3) @4)
+ (if (TREE_CODE (@0) == SSA_NAME && num_imm_uses (@0) == 2)
+  (minus (mult (vec_perm @1 @1 @3) @2) @4)))
+
+
 /* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
    The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
    constant which when multiplied by a power of 2 contains a unique value
diff --git a/gcc/testsuite/gcc.target/i386/pr101895.c b/gcc/testsuite/gcc.target/i386/pr101895.c
new file mode 100644
index 0000000..4d0f1cb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101895.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+void foo(float * __restrict__ a, float b, float *c) {
+  a[0] = c[0]*b + a[0];
+  a[1] = c[2]*b + a[1];
+  a[2] = c[1]*b + a[2];
+  a[3] = c[3]*b + a[3];
+}
+
+/* { dg-final { scan-assembler "vfmadd" } } */
  
Richard Biener March 15, 2022, 7:43 a.m. UTC | #4
On Tue, Mar 15, 2022 at 8:25 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Richard and Marc,
> Many thanks for both your feedback on my patch for PR 101895.
> Here's version 2 of this patch, incorporating all of the suggested improvements.
> The one minor complication is that the :s qualifier doesn't automatically
> recognize that a capture already has two (or N) uses in a pattern,
> so I have to manually confirm that there are no other uses of the mult
> using num_imm_uses.
>
> This revision has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check with no new failures.  Ok for mainline?

OK.

Thanks,
Richard.

> 2022-03-15  Roger Sayle  <roger@nextmovesoftware.com>
>             Marc Glisse  <marc.glisse@inria.fr>
>             Richard Biener  <rguenther@suse.de>
>
> gcc/ChangeLog
>         PR tree-optimization/101895
>         * match.pd (vec_same_elem_p): Handle CONSTRUCTOR_EXPR def.
>         (plus (vec_perm (mult ...) ...) ...): New reordering simplification.
>
> gcc/testsuite/ChangeLog
>         PR tree-optimization/101895
>         * gcc.target/i386/pr101895.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: 14 March 2022 07:38
> > To: GCC Patches <gcc-patches@gcc.gnu.org>
> > Cc: Roger Sayle <roger@nextmovesoftware.com>; Marc Glisse
> > <marc.glisse@inria.fr>
> > Subject: Re: [PATCH] PR tree-optimization/101895: Fold VEC_PERM to help
> > recognize FMA.
> >
> > On Sun, Mar 13, 2022 at 12:39 AM Marc Glisse via Gcc-patches <gcc-
> > patches@gcc.gnu.org> wrote:
> > >
> > > On Fri, 11 Mar 2022, Roger Sayle wrote:
> > >
> > > +(match vec_same_elem_p
> > > +  CONSTRUCTOR@0
> > > +  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
> > > +                        ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0))
> > > +: @0))))
> > >
> > > Ah, I didn't remember we needed that, we don't seem to be very
> > > consistent about it. Probably for this reason, the transformation
> > > "Prefer vector1 << scalar to vector1 << vector2" does not match
> > >
> > > typedef int vec __attribute__((vector_size(16))); vec f(vec a, int b){
> > >    vec bb = { b, b, b, b };
> > >    return a << bb;
> > > }
> > >
> > > which is only optimized at vector lowering time.
> >
> > Few more comments - since match.pd is matching in match.pd order the
> >
> > (match vec_same_elem_p
> >   @0
> >   (...))
> >
> > should come last.  Please use
> >
> > +(match vec_same_elem_p
> > +  CONSTRUCTOR@0
> >     (if (TREE_CODE (@0) == SSA_NAME
> >          && uniform_vector_p (...
> >
> > since otherwise we'll try uniform_vector_p twice on all CTORs (that are not
> > uniform).
> >
> > > +/* Push VEC_PERM earlier if that may help FMA perception (PR101895).
> > > +*/ (for plusminus (plus minus)
> > > +  (simplify
> > > +    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
> > > +    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))
> > >
> > > Don't you want :s on mult and vec_perm?
> >
> > Yes.  Also for plus you want :c on it , likewise you want :c on the mult.  The :c on
> > the plus will require splitting the plus and minus case :/
> >
> > Otherwise looks reasonable.
> >
> > Richard.
> >
> > >
> > > --
> > > Marc Glisse
  

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 97399e5..9184276 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -7695,10 +7695,22 @@  and,
 (match vec_same_elem_p
  (vec_duplicate @0))
 
+(match vec_same_elem_p
+  CONSTRUCTOR@0
+  (if (uniform_vector_p (TREE_CODE (@0) == SSA_NAME
+			 ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0))))
+
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
 
+/* Push VEC_PERM earlier if that may help FMA perception (PR101895).  */
+(for plusminus (plus minus)
+  (simplify
+    (plusminus (vec_perm (mult@0 @1 vec_same_elem_p@2) @0 @3) @4)
+    (plusminus (mult (vec_perm @1 @1 @3) @2) @4)))
+  
+
 /* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
    The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
    constant which when multiplied by a power of 2 contains a unique value
diff --git a/gcc/testsuite/gcc.target/i386/pr101895.c b/gcc/testsuite/gcc.target/i386/pr101895.c
new file mode 100644
index 0000000..4d0f1cb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101895.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+void foo(float * __restrict__ a, float b, float *c) {
+  a[0] = c[0]*b + a[0];
+  a[1] = c[2]*b + a[1];
+  a[2] = c[1]*b + a[2];
+  a[3] = c[3]*b + a[3];
+}
+
+/* { dg-final { scan-assembler "vfmadd" } } */