Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

Message ID 20221104000432.15254-1-hongyu.wang@intel.com
State New
Headers
Series Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167] |

Commit Message

Hongyu Wang Nov. 4, 2022, 12:04 a.m. UTC
  Hi,

This is a follow-up patch for PR98167

The sequence
     c1 = VEC_PERM_EXPR (a, a, mask)
     c2 = VEC_PERM_EXPR (b, b, mask)
     c3 = c1 op c2
can be optimized to
     c = a op b
     c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

gcc/ChangeLog:

	PR target/98167
	* match.pd: New perm + vector op patterns for int and fp vector.

gcc/testsuite/ChangeLog:

	PR target/98167
	* gcc.target/i386/pr98167.c: New test.
---
 gcc/match.pd                            | 49 +++++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
  

Comments

Prathamesh Kulkarni Nov. 4, 2022, 6:43 a.m. UTC | #1
On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> This is a follow-up patch for PR98167
>
> The sequence
>      c1 = VEC_PERM_EXPR (a, a, mask)
>      c2 = VEC_PERM_EXPR (b, b, mask)
>      c3 = c1 op c2
> can be optimized to
>      c = a op b
>      c3 = VEC_PERM_EXPR (c, c, mask)
> for all integer vector operation, and float operation with
> full permutation.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/98167
>         * match.pd: New perm + vector op patterns for int and fp vector.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/98167
>         * gcc.target/i386/pr98167.c: New test.
> ---
>  gcc/match.pd                            | 49 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
>  2 files changed, 93 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 194ba8f5188..b85ad34f609 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8189,3 +8189,52 @@ and,
>   (bit_and (negate @0) integer_onep@1)
>   (if (!TYPE_OVERFLOW_SANITIZED (type))
>    (bit_and @0 @1)))
> +
> +/* Optimize
> +   c1 = VEC_PERM_EXPR (a, a, mask)
> +   c2 = VEC_PERM_EXPR (b, b, mask)
> +   c3 = c1 op c2
> +   -->
> +   c = a op b
> +   c3 = VEC_PERM_EXPR (c, c, mask)
> +   For all integer non-div operations.  */
> +(for op (plus minus mult bit_and bit_ior bit_xor
> +        lshift rshift)
> + (simplify
> +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> +    (if (VECTOR_INTEGER_TYPE_P (type))
> +     (vec_perm (op @0 @1) (op @0 @1) @2))))
Just wondering, why should mask be CST here ?
I guess the transform should work as long as mask is same for both
vectors even if it's
not constant ?
> +
> +/* Similar for float arithmetic when permutation constant covers
> +   all vector elements.  */
> +(for op (plus minus mult)
> + (simplify
> +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> +    (if (VECTOR_FLOAT_TYPE_P (type))
> +     (with
> +      {
> +       tree perm_cst = @2;
> +       vec_perm_builder builder;
> +       bool full_perm_p = false;
> +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> +         {
> +           /* Create a vec_perm_indices for the integer vector.  */
> +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
If this transform is meant only for VLS vectors, I guess you should
bail out if TYPE_VECTOR_SUBPARTS is not constant,
otherwise it will crash for VLA vectors.

Thanks,
Prathamesh
> +           vec_perm_indices sel (builder, 1, nelts);
> +
> +           /* Check if perm indices covers all vector elements.  */
> +           int count = 0, i, j;
> +           for (i = 0; i < nelts; i++)
> +             for (j = 0; j < nelts; j++)
> +               {
> +                 if (sel[j].to_constant () == i)
> +                   {
> +                     count++;
> +                     break;
> +                   }
> +               }
> +           full_perm_p = count == nelts;
> +         }
> +       }
> +       (if (full_perm_p)
> +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> new file mode 100644
> index 00000000000..40e0ac11332
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> @@ -0,0 +1,44 @@
> +/* PR target/98167 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2" } */
> +
> +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> +
> +#define VEC_PERM_4 \
> +  2, 3, 1, 0
> +#define VEC_PERM_8 \
> +  4, 5, 6, 7, 3, 2, 1, 0
> +#define VEC_PERM_16 \
> +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> +
> +#define TYPE_PERM_OP(type, size, op, name) \
> +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> +                                             v##size##s##type b) \
> +  { \
> +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> +                                                  VEC_PERM_##size); \
> +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> +                                                  VEC_PERM_##size); \
> +    return a1 op b1; \
> +  }
> +
> +#define INT_PERMS(op, name) \
> +  TYPE_PERM_OP (int, 4, op, name) \
> +
> +#define FP_PERMS(op, name) \
> +  TYPE_PERM_OP (float, 4, op, name) \
> +
> +INT_PERMS (+, add)
> +INT_PERMS (-, sub)
> +INT_PERMS (*, mul)
> +INT_PERMS (|, ior)
> +INT_PERMS (^, xor)
> +INT_PERMS (&, and)
> +INT_PERMS (<<, shl)
> +INT_PERMS (>>, shr)
> +FP_PERMS (+, add)
> +FP_PERMS (-, sub)
> +FP_PERMS (*, mul)
> +
> --
> 2.18.1
>
  
Richard Biener Nov. 8, 2022, 2:37 p.m. UTC | #2
On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > This is a follow-up patch for PR98167
> >
> > The sequence
> >      c1 = VEC_PERM_EXPR (a, a, mask)
> >      c2 = VEC_PERM_EXPR (b, b, mask)
> >      c3 = c1 op c2
> > can be optimized to
> >      c = a op b
> >      c3 = VEC_PERM_EXPR (c, c, mask)
> > for all integer vector operation, and float operation with
> > full permutation.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/98167
> >         * match.pd: New perm + vector op patterns for int and fp vector.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/98167
> >         * gcc.target/i386/pr98167.c: New test.
> > ---
> >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> >  2 files changed, 93 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 194ba8f5188..b85ad34f609 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -8189,3 +8189,52 @@ and,
> >   (bit_and (negate @0) integer_onep@1)
> >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> >    (bit_and @0 @1)))
> > +
> > +/* Optimize
> > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > +   c3 = c1 op c2
> > +   -->
> > +   c = a op b
> > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > +   For all integer non-div operations.  */
> > +(for op (plus minus mult bit_and bit_ior bit_xor
> > +        lshift rshift)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> Just wondering, why should mask be CST here ?
> I guess the transform should work as long as mask is same for both
> vectors even if it's
> not constant ?

Yes, please change accordingly (and maybe push separately).

> > +
> > +/* Similar for float arithmetic when permutation constant covers
> > +   all vector elements.  */
> > +(for op (plus minus mult)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > +     (with
> > +      {
> > +       tree perm_cst = @2;
> > +       vec_perm_builder builder;
> > +       bool full_perm_p = false;
> > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > +         {
> > +           /* Create a vec_perm_indices for the integer vector.  */
> > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> If this transform is meant only for VLS vectors, I guess you should
> bail out if TYPE_VECTOR_SUBPARTS is not constant,
> otherwise it will crash for VLA vectors.

I suppose it's difficult to create a VLA permute that covers all elements
and that is not trivial though.  But indeed add ().is_constant to the
VECTOR_FLOAT_TYPE_P guard.

>
> Thanks,
> Prathamesh
> > +           vec_perm_indices sel (builder, 1, nelts);
> > +
> > +           /* Check if perm indices covers all vector elements.  */
> > +           int count = 0, i, j;
> > +           for (i = 0; i < nelts; i++)
> > +             for (j = 0; j < nelts; j++)

Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
(as said I can't think of a non-full encoding that isn't trivial
but covers all elements) and then simply .qsort () the vector_builder
(it derives
from vec<>) so the scan is O(n log n).

Maybe Richard has a better idea here though.

Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
always that they explode the match decision tree, we'd ideally have a way to
match those with (op ..) (op ..) first to be able to share more of the matching
code.  That said, match.pd is a less than ideal place for these (but mostly
because of the way we code generate *-match.cc)

Richard.

> > +               {
> > +                 if (sel[j].to_constant () == i)
> > +                   {
> > +                     count++;
> > +                     break;
> > +                   }
> > +               }
> > +           full_perm_p = count == nelts;
> > +         }
> > +       }
> > +       (if (full_perm_p)
> > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > new file mode 100644
> > index 00000000000..40e0ac11332
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > @@ -0,0 +1,44 @@
> > +/* PR target/98167 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2" } */
> > +
> > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > +
> > +#define VEC_PERM_4 \
> > +  2, 3, 1, 0
> > +#define VEC_PERM_8 \
> > +  4, 5, 6, 7, 3, 2, 1, 0
> > +#define VEC_PERM_16 \
> > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > +
> > +#define TYPE_PERM_OP(type, size, op, name) \
> > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > +                                             v##size##s##type b) \
> > +  { \
> > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > +                                                  VEC_PERM_##size); \
> > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > +                                                  VEC_PERM_##size); \
> > +    return a1 op b1; \
> > +  }
> > +
> > +#define INT_PERMS(op, name) \
> > +  TYPE_PERM_OP (int, 4, op, name) \
> > +
> > +#define FP_PERMS(op, name) \
> > +  TYPE_PERM_OP (float, 4, op, name) \
> > +
> > +INT_PERMS (+, add)
> > +INT_PERMS (-, sub)
> > +INT_PERMS (*, mul)
> > +INT_PERMS (|, ior)
> > +INT_PERMS (^, xor)
> > +INT_PERMS (&, and)
> > +INT_PERMS (<<, shl)
> > +INT_PERMS (>>, shr)
> > +FP_PERMS (+, add)
> > +FP_PERMS (-, sub)
> > +FP_PERMS (*, mul)
> > +
> > --
> > 2.18.1
> >
  
Hongyu Wang Nov. 10, 2022, 2:22 a.m. UTC | #3
Hi Prathamesh and Richard,

Thanks for the review and nice suggestions!

> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>

Removed VECTOR_CST for integer ops.

> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.

Added.

> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).

The .qsort () approach requires an extra cmp_func that IMO would not
be feasible to be implemented in match.pd (I suppose lambda function
would not be a good idea either).
Another solution would be using hash_set but it does not work here for
int64_t or poly_int64 type.
So I kept current O(n^2) simple code here, and I suppose usually the
permutation indices would be a small number even for O(n^2)
complexity.

Attached updated patch.

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道:


>
> On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > >
> > > This is a follow-up patch for PR98167
> > >
> > > The sequence
> > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > >      c3 = c1 op c2
> > > can be optimized to
> > >      c = a op b
> > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > for all integer vector operation, and float operation with
> > > full permutation.
> > >
> > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > >         PR target/98167
> > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         PR target/98167
> > >         * gcc.target/i386/pr98167.c: New test.
> > > ---
> > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > >  2 files changed, 93 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 194ba8f5188..b85ad34f609 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -8189,3 +8189,52 @@ and,
> > >   (bit_and (negate @0) integer_onep@1)
> > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > >    (bit_and @0 @1)))
> > > +
> > > +/* Optimize
> > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > +   c3 = c1 op c2
> > > +   -->
> > > +   c = a op b
> > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > +   For all integer non-div operations.  */
> > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > +        lshift rshift)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > Just wondering, why should mask be CST here ?
> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>
> > > +
> > > +/* Similar for float arithmetic when permutation constant covers
> > > +   all vector elements.  */
> > > +(for op (plus minus mult)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > +     (with
> > > +      {
> > > +       tree perm_cst = @2;
> > > +       vec_perm_builder builder;
> > > +       bool full_perm_p = false;
> > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > +         {
> > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.
>
> >
> > Thanks,
> > Prathamesh
> > > +           vec_perm_indices sel (builder, 1, nelts);
> > > +
> > > +           /* Check if perm indices covers all vector elements.  */
> > > +           int count = 0, i, j;
> > > +           for (i = 0; i < nelts; i++)
> > > +             for (j = 0; j < nelts; j++)
>
> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).
>
> Maybe Richard has a better idea here though.
>
> Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> always that they explode the match decision tree, we'd ideally have a way to
> match those with (op ..) (op ..) first to be able to share more of the matching
> code.  That said, match.pd is a less than ideal place for these (but mostly
> because of the way we code generate *-match.cc)
>
> Richard.
>
> > > +               {
> > > +                 if (sel[j].to_constant () == i)
> > > +                   {
> > > +                     count++;
> > > +                     break;
> > > +                   }
> > > +               }
> > > +           full_perm_p = count == nelts;
> > > +         }
> > > +       }
> > > +       (if (full_perm_p)
> > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > new file mode 100644
> > > index 00000000000..40e0ac11332
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > @@ -0,0 +1,44 @@
> > > +/* PR target/98167 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2" } */
> > > +
> > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > +
> > > +#define VEC_PERM_4 \
> > > +  2, 3, 1, 0
> > > +#define VEC_PERM_8 \
> > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > +#define VEC_PERM_16 \
> > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > +
> > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > +                                             v##size##s##type b) \
> > > +  { \
> > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > +                                                  VEC_PERM_##size); \
> > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > +                                                  VEC_PERM_##size); \
> > > +    return a1 op b1; \
> > > +  }
> > > +
> > > +#define INT_PERMS(op, name) \
> > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > +
> > > +#define FP_PERMS(op, name) \
> > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > +
> > > +INT_PERMS (+, add)
> > > +INT_PERMS (-, sub)
> > > +INT_PERMS (*, mul)
> > > +INT_PERMS (|, ior)
> > > +INT_PERMS (^, xor)
> > > +INT_PERMS (&, and)
> > > +INT_PERMS (<<, shl)
> > > +INT_PERMS (>>, shr)
> > > +FP_PERMS (+, add)
> > > +FP_PERMS (-, sub)
> > > +FP_PERMS (*, mul)
> > > +
> > > --
> > > 2.18.1
> > >
  
Richard Biener Nov. 10, 2022, 8:56 a.m. UTC | #4
On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> Hi Prathamesh and Richard,
>
> Thanks for the review and nice suggestions!
>
> > > I guess the transform should work as long as mask is same for both
> > > vectors even if it's
> > > not constant ?
> >
> > Yes, please change accordingly (and maybe push separately).
> >
>
> Removed VECTOR_CST for integer ops.
>
> > > If this transform is meant only for VLS vectors, I guess you should
> > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > otherwise it will crash for VLA vectors.
> >
> > I suppose it's difficult to create a VLA permute that covers all elements
> > and that is not trivial though.  But indeed add ().is_constant to the
> > VECTOR_FLOAT_TYPE_P guard.
>
> Added.
>
> > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > (as said I can't think of a non-full encoding that isn't trivial
> > but covers all elements) and then simply .qsort () the vector_builder
> > (it derives
> > from vec<>) so the scan is O(n log n).
>
> The .qsort () approach requires an extra cmp_func that IMO would not
> be feasible to be implemented in match.pd (I suppose lambda function
> would not be a good idea either).
> Another solution would be using hash_set but it does not work here for
> int64_t or poly_int64 type.
> So I kept current O(n^2) simple code here, and I suppose usually the
> permutation indices would be a small number even for O(n^2)
> complexity.

Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
a lambda function is fine to use.  The alternative (used by the vectorizer
in some places) is to use sth like

 auto_sbitmap seen (nelts);
 for (i = 0; i < nelts; i++)
   {
     if (!bitmap_set_bit (seen, i))
       break;
     count++;
   }
 full_perm_p = count == nelts;

I'll note that you should still check .encoding ().encoded_full_vector_p ()
and only bother to check that case, that's a very simple check.

>
> Attached updated patch.
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道:
>
>
> >
> > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This is a follow-up patch for PR98167
> > > >
> > > > The sequence
> > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > >      c3 = c1 op c2
> > > > can be optimized to
> > > >      c = a op b
> > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > for all integer vector operation, and float operation with
> > > > full permutation.
> > > >
> > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         PR target/98167
> > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         PR target/98167
> > > >         * gcc.target/i386/pr98167.c: New test.
> > > > ---
> > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > >  2 files changed, 93 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 194ba8f5188..b85ad34f609 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -8189,3 +8189,52 @@ and,
> > > >   (bit_and (negate @0) integer_onep@1)
> > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > >    (bit_and @0 @1)))
> > > > +
> > > > +/* Optimize
> > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > +   c3 = c1 op c2
> > > > +   -->
> > > > +   c = a op b
> > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > +   For all integer non-div operations.  */
> > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > +        lshift rshift)
> > > > + (simplify
> > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > Just wondering, why should mask be CST here ?
> > > I guess the transform should work as long as mask is same for both
> > > vectors even if it's
> > > not constant ?
> >
> > Yes, please change accordingly (and maybe push separately).
> >
> > > > +
> > > > +/* Similar for float arithmetic when permutation constant covers
> > > > +   all vector elements.  */
> > > > +(for op (plus minus mult)
> > > > + (simplify
> > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > +     (with
> > > > +      {
> > > > +       tree perm_cst = @2;
> > > > +       vec_perm_builder builder;
> > > > +       bool full_perm_p = false;
> > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > +         {
> > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > If this transform is meant only for VLS vectors, I guess you should
> > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > otherwise it will crash for VLA vectors.
> >
> > I suppose it's difficult to create a VLA permute that covers all elements
> > and that is not trivial though.  But indeed add ().is_constant to the
> > VECTOR_FLOAT_TYPE_P guard.
> >
> > >
> > > Thanks,
> > > Prathamesh
> > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > +
> > > > +           /* Check if perm indices covers all vector elements.  */
> > > > +           int count = 0, i, j;
> > > > +           for (i = 0; i < nelts; i++)
> > > > +             for (j = 0; j < nelts; j++)
> >
> > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > (as said I can't think of a non-full encoding that isn't trivial
> > but covers all elements) and then simply .qsort () the vector_builder
> > (it derives
> > from vec<>) so the scan is O(n log n).
> >
> > Maybe Richard has a better idea here though.
> >
> > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > always that they explode the match decision tree, we'd ideally have a way to
> > match those with (op ..) (op ..) first to be able to share more of the matching
> > code.  That said, match.pd is a less than ideal place for these (but mostly
> > because of the way we code generate *-match.cc)
> >
> > Richard.
> >
> > > > +               {
> > > > +                 if (sel[j].to_constant () == i)
> > > > +                   {
> > > > +                     count++;
> > > > +                     break;
> > > > +                   }
> > > > +               }
> > > > +           full_perm_p = count == nelts;
> > > > +         }
> > > > +       }
> > > > +       (if (full_perm_p)
> > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > new file mode 100644
> > > > index 00000000000..40e0ac11332
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > @@ -0,0 +1,44 @@
> > > > +/* PR target/98167 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx2" } */
> > > > +
> > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > +
> > > > +#define VEC_PERM_4 \
> > > > +  2, 3, 1, 0
> > > > +#define VEC_PERM_8 \
> > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > +#define VEC_PERM_16 \
> > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > +
> > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > +                                             v##size##s##type b) \
> > > > +  { \
> > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > +                                                  VEC_PERM_##size); \
> > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > +                                                  VEC_PERM_##size); \
> > > > +    return a1 op b1; \
> > > > +  }
> > > > +
> > > > +#define INT_PERMS(op, name) \
> > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > +
> > > > +#define FP_PERMS(op, name) \
> > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > +
> > > > +INT_PERMS (+, add)
> > > > +INT_PERMS (-, sub)
> > > > +INT_PERMS (*, mul)
> > > > +INT_PERMS (|, ior)
> > > > +INT_PERMS (^, xor)
> > > > +INT_PERMS (&, and)
> > > > +INT_PERMS (<<, shl)
> > > > +INT_PERMS (>>, shr)
> > > > +FP_PERMS (+, add)
> > > > +FP_PERMS (-, sub)
> > > > +FP_PERMS (*, mul)
> > > > +
> > > > --
> > > > 2.18.1
> > > >
  
Hongyu Wang Nov. 10, 2022, 2:27 p.m. UTC | #5
> Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> a lambda function is fine to use.  The alternative (used by the vectorizer
> in some places) is to use sth like
>
>  auto_sbitmap seen (nelts);
>  for (i = 0; i < nelts; i++)
>    {
>      if (!bitmap_set_bit (seen, i))
>        break;
>      count++;
>    }
>  full_perm_p = count == nelts;
>
> I'll note that you should still check .encoding ().encoded_full_vector_p ()
> and only bother to check that case, that's a very simple check.

Thanks for the good example! We also tried using wide_int as a bitmask
but your code looks more simple and reasonable.

Updated the patch accordingly.

Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四 16:56写道:


>
> On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> >
> > Hi Prathamesh and Richard,
> >
> > Thanks for the review and nice suggestions!
> >
> > > > I guess the transform should work as long as mask is same for both
> > > > vectors even if it's
> > > > not constant ?
> > >
> > > Yes, please change accordingly (and maybe push separately).
> > >
> >
> > Removed VECTOR_CST for integer ops.
> >
> > > > If this transform is meant only for VLS vectors, I guess you should
> > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > otherwise it will crash for VLA vectors.
> > >
> > > I suppose it's difficult to create a VLA permute that covers all elements
> > > and that is not trivial though.  But indeed add ().is_constant to the
> > > VECTOR_FLOAT_TYPE_P guard.
> >
> > Added.
> >
> > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > (as said I can't think of a non-full encoding that isn't trivial
> > > but covers all elements) and then simply .qsort () the vector_builder
> > > (it derives
> > > from vec<>) so the scan is O(n log n).
> >
> > The .qsort () approach requires an extra cmp_func that IMO would not
> > be feasible to be implemented in match.pd (I suppose lambda function
> > would not be a good idea either).
> > Another solution would be using hash_set but it does not work here for
> > int64_t or poly_int64 type.
> > So I kept current O(n^2) simple code here, and I suppose usually the
> > permutation indices would be a small number even for O(n^2)
> > complexity.
>
> Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> a lambda function is fine to use.  The alternative (used by the vectorizer
> in some places) is to use sth like
>
>  auto_sbitmap seen (nelts);
>  for (i = 0; i < nelts; i++)
>    {
>      if (!bitmap_set_bit (seen, i))
>        break;
>      count++;
>    }
>  full_perm_p = count == nelts;
>
> I'll note that you should still check .encoding ().encoded_full_vector_p ()
> and only bother to check that case, that's a very simple check.
>
> >
> > Attached updated patch.
> >
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道:
> >
> >
> > >
> > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > This is a follow-up patch for PR98167
> > > > >
> > > > > The sequence
> > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > >      c3 = c1 op c2
> > > > > can be optimized to
> > > > >      c = a op b
> > > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > for all integer vector operation, and float operation with
> > > > > full permutation.
> > > > >
> > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > >
> > > > > Ok for trunk?
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         PR target/98167
> > > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         PR target/98167
> > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > ---
> > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > > >  2 files changed, 93 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > > >
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > index 194ba8f5188..b85ad34f609 100644
> > > > > --- a/gcc/match.pd
> > > > > +++ b/gcc/match.pd
> > > > > @@ -8189,3 +8189,52 @@ and,
> > > > >   (bit_and (negate @0) integer_onep@1)
> > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > >    (bit_and @0 @1)))
> > > > > +
> > > > > +/* Optimize
> > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > +   c3 = c1 op c2
> > > > > +   -->
> > > > > +   c = a op b
> > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > +   For all integer non-div operations.  */
> > > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > > +        lshift rshift)
> > > > > + (simplify
> > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > Just wondering, why should mask be CST here ?
> > > > I guess the transform should work as long as mask is same for both
> > > > vectors even if it's
> > > > not constant ?
> > >
> > > Yes, please change accordingly (and maybe push separately).
> > >
> > > > > +
> > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > +   all vector elements.  */
> > > > > +(for op (plus minus mult)
> > > > > + (simplify
> > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > +     (with
> > > > > +      {
> > > > > +       tree perm_cst = @2;
> > > > > +       vec_perm_builder builder;
> > > > > +       bool full_perm_p = false;
> > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > +         {
> > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > If this transform is meant only for VLS vectors, I guess you should
> > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > otherwise it will crash for VLA vectors.
> > >
> > > I suppose it's difficult to create a VLA permute that covers all elements
> > > and that is not trivial though.  But indeed add ().is_constant to the
> > > VECTOR_FLOAT_TYPE_P guard.
> > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > +
> > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > +           int count = 0, i, j;
> > > > > +           for (i = 0; i < nelts; i++)
> > > > > +             for (j = 0; j < nelts; j++)
> > >
> > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > (as said I can't think of a non-full encoding that isn't trivial
> > > but covers all elements) and then simply .qsort () the vector_builder
> > > (it derives
> > > from vec<>) so the scan is O(n log n).
> > >
> > > Maybe Richard has a better idea here though.
> > >
> > > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > > always that they explode the match decision tree, we'd ideally have a way to
> > > match those with (op ..) (op ..) first to be able to share more of the matching
> > > code.  That said, match.pd is a less than ideal place for these (but mostly
> > > because of the way we code generate *-match.cc)
> > >
> > > Richard.
> > >
> > > > > +               {
> > > > > +                 if (sel[j].to_constant () == i)
> > > > > +                   {
> > > > > +                     count++;
> > > > > +                     break;
> > > > > +                   }
> > > > > +               }
> > > > > +           full_perm_p = count == nelts;
> > > > > +         }
> > > > > +       }
> > > > > +       (if (full_perm_p)
> > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > new file mode 100644
> > > > > index 00000000000..40e0ac11332
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > @@ -0,0 +1,44 @@
> > > > > +/* PR target/98167 */
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > +
> > > > > +#define VEC_PERM_4 \
> > > > > +  2, 3, 1, 0
> > > > > +#define VEC_PERM_8 \
> > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > +#define VEC_PERM_16 \
> > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > +
> > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > > +                                             v##size##s##type b) \
> > > > > +  { \
> > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > +                                                  VEC_PERM_##size); \
> > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > +                                                  VEC_PERM_##size); \
> > > > > +    return a1 op b1; \
> > > > > +  }
> > > > > +
> > > > > +#define INT_PERMS(op, name) \
> > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > +
> > > > > +#define FP_PERMS(op, name) \
> > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > +
> > > > > +INT_PERMS (+, add)
> > > > > +INT_PERMS (-, sub)
> > > > > +INT_PERMS (*, mul)
> > > > > +INT_PERMS (|, ior)
> > > > > +INT_PERMS (^, xor)
> > > > > +INT_PERMS (&, and)
> > > > > +INT_PERMS (<<, shl)
> > > > > +INT_PERMS (>>, shr)
> > > > > +FP_PERMS (+, add)
> > > > > +FP_PERMS (-, sub)
> > > > > +FP_PERMS (*, mul)
> > > > > +
> > > > > --
> > > > > 2.18.1
> > > > >
  
Richard Biener Nov. 14, 2022, 2:53 p.m. UTC | #6
On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> > a lambda function is fine to use.  The alternative (used by the vectorizer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i = 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p = count == nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_p ()
> > and only bother to check that case, that's a very simple check.
>
> Thanks for the good example! We also tried using wide_int as a bitmask
> but your code looks more simple and reasonable.
>
> Updated the patch accordingly.

OK.

Thanks,
Richard.

> Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四 16:56写道:
>
>
> >
> > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > >
> > > Hi Prathamesh and Richard,
> > >
> > > Thanks for the review and nice suggestions!
> > >
> > > > > I guess the transform should work as long as mask is same for both
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > >
> > > Removed VECTOR_CST for integer ops.
> > >
> > > > > If this transform is meant only for VLS vectors, I guess you should
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all elements
> > > > and that is not trivial though.  But indeed add ().is_constant to the
> > > > VECTOR_FLOAT_TYPE_P guard.
> > >
> > > Added.
> > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_builder
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > >
> > > The .qsort () approach requires an extra cmp_func that IMO would not
> > > be feasible to be implemented in match.pd (I suppose lambda function
> > > would not be a good idea either).
> > > Another solution would be using hash_set but it does not work here for
> > > int64_t or poly_int64 type.
> > > So I kept current O(n^2) simple code here, and I suppose usually the
> > > permutation indices would be a small number even for O(n^2)
> > > complexity.
> >
> > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> > a lambda function is fine to use.  The alternative (used by the vectorizer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i = 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p = count == nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_p ()
> > and only bother to check that case, that's a very simple check.
> >
> > >
> > > Attached updated patch.
> > >
> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道:
> > >
> > >
> > > >
> > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is a follow-up patch for PR98167
> > > > > >
> > > > > > The sequence
> > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > >      c3 = c1 op c2
> > > > > > can be optimized to
> > > > > >      c = a op b
> > > > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > for all integer vector operation, and float operation with
> > > > > > full permutation.
> > > > > >
> > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > >
> > > > > > Ok for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > ---
> > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > > > >  2 files changed, 93 insertions(+)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > >
> > > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > > index 194ba8f5188..b85ad34f609 100644
> > > > > > --- a/gcc/match.pd
> > > > > > +++ b/gcc/match.pd
> > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > >    (bit_and @0 @1)))
> > > > > > +
> > > > > > +/* Optimize
> > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > +   c3 = c1 op c2
> > > > > > +   -->
> > > > > > +   c = a op b
> > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > +   For all integer non-div operations.  */
> > > > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > > > +        lshift rshift)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > Just wondering, why should mask be CST here ?
> > > > > I guess the transform should work as long as mask is same for both
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > > > > > +
> > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > +   all vector elements.  */
> > > > > > +(for op (plus minus mult)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > +     (with
> > > > > > +      {
> > > > > > +       tree perm_cst = @2;
> > > > > > +       vec_perm_builder builder;
> > > > > > +       bool full_perm_p = false;
> > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > +         {
> > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > > If this transform is meant only for VLS vectors, I guess you should
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all elements
> > > > and that is not trivial though.  But indeed add ().is_constant to the
> > > > VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > +
> > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > +           int count = 0, i, j;
> > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > +             for (j = 0; j < nelts; j++)
> > > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_builder
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > > >
> > > > Maybe Richard has a better idea here though.
> > > >
> > > > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > > > always that they explode the match decision tree, we'd ideally have a way to
> > > > match those with (op ..) (op ..) first to be able to share more of the matching
> > > > code.  That said, match.pd is a less than ideal place for these (but mostly
> > > > because of the way we code generate *-match.cc)
> > > >
> > > > Richard.
> > > >
> > > > > > +               {
> > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > +                   {
> > > > > > +                     count++;
> > > > > > +                     break;
> > > > > > +                   }
> > > > > > +               }
> > > > > > +           full_perm_p = count == nelts;
> > > > > > +         }
> > > > > > +       }
> > > > > > +       (if (full_perm_p)
> > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..40e0ac11332
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > @@ -0,0 +1,44 @@
> > > > > > +/* PR target/98167 */
> > > > > > +/* { dg-do compile } */
> > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > +
> > > > > > +#define VEC_PERM_4 \
> > > > > > +  2, 3, 1, 0
> > > > > > +#define VEC_PERM_8 \
> > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > +#define VEC_PERM_16 \
> > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > +
> > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > > > +                                             v##size##s##type b) \
> > > > > > +  { \
> > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > +                                                  VEC_PERM_##size); \
> > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > +                                                  VEC_PERM_##size); \
> > > > > > +    return a1 op b1; \
> > > > > > +  }
> > > > > > +
> > > > > > +#define INT_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > +
> > > > > > +#define FP_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > +
> > > > > > +INT_PERMS (+, add)
> > > > > > +INT_PERMS (-, sub)
> > > > > > +INT_PERMS (*, mul)
> > > > > > +INT_PERMS (|, ior)
> > > > > > +INT_PERMS (^, xor)
> > > > > > +INT_PERMS (&, and)
> > > > > > +INT_PERMS (<<, shl)
> > > > > > +INT_PERMS (>>, shr)
> > > > > > +FP_PERMS (+, add)
> > > > > > +FP_PERMS (-, sub)
> > > > > > +FP_PERMS (*, mul)
> > > > > > +
> > > > > > --
> > > > > > 2.18.1
> > > > > >
  
Tamar Christina Nov. 16, 2022, 3:25 p.m. UTC | #7
Hi,

This patch is causing several ICEs because it changes the permutes from a single register
permute to a multi register due to the lowering of the expressions to different SSA names.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717

I have a prototype fix which adds a new rule to CSE this back to a single register permute,
but would this be the right solution? It seems hard to later on during expand realize that
the two operands are the same.

It's probably also ok to just block this from happening after vec_lower, however I'm worried that
If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.

Thanks,
Tamar

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> Biener via Gcc-patches
> Sent: Monday, November 14, 2022 2:53 PM
> To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> >
> > Thanks for the good example! We also tried using wide_int as a bitmask
> > but your code looks more simple and reasonable.
> >
> > Updated the patch accordingly.
> 
> OK.
> 
> Thanks,
> Richard.
> 
> > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> 16:56写道:
> >
> >
> > >
> > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > Hi Prathamesh and Richard,
> > > >
> > > > Thanks for the review and nice suggestions!
> > > >
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > >
> > > > Removed VECTOR_CST for integer ops.
> > > >
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > Added.
> > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > >
> > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > function would not be a good idea either).
> > > > Another solution would be using hash_set but it does not work here
> > > > for int64_t or poly_int64 type.
> > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > the permutation indices would be a small number even for O(n^2)
> > > > complexity.
> > >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> > >
> > > >
> > > > Attached updated patch.
> > > >
> > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > 于2022年11月8日周二 22:38写道:
> > > >
> > > >
> > > > >
> > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is a follow-up patch for PR98167
> > > > > > >
> > > > > > > The sequence
> > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > >      c3 = c1 op c2
> > > > > > > can be optimized to
> > > > > > >      c = a op b
> > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > operation, and float operation with full permutation.
> > > > > > >
> > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > >
> > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> vector.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > ---
> > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > ++++++++++++++++++++++
> > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > >
> > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > --- a/gcc/match.pd
> > > > > > > +++ b/gcc/match.pd
> > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > >    (bit_and @0 @1)))
> > > > > > > +
> > > > > > > +/* Optimize
> > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > +   c3 = c1 op c2
> > > > > > > +   -->
> > > > > > > +   c = a op b
> > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > +        lshift rshift)
> > > > > > > + (simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > Just wondering, why should mask be CST here ?
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > > > > > +
> > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > +(simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > +     (with
> > > > > > > +      {
> > > > > > > +       tree perm_cst = @2;
> > > > > > > +       vec_perm_builder builder;
> > > > > > > +       bool full_perm_p = false;
> > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > +         {
> > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > +(type).to_constant ();
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > +
> > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > +           int count = 0, i, j;
> > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > +             for (j = 0; j < nelts; j++)
> > > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > > >
> > > > > Maybe Richard has a better idea here though.
> > > > >
> > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > ..)) patterns it's always that they explode the match decision
> > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > ..) first to be able to share more of the matching code.  That
> > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > because of the way we code generate *-match.cc)
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > +               {
> > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > +                   {
> > > > > > > +                     count++;
> > > > > > > +                     break;
> > > > > > > +                   }
> > > > > > > +               }
> > > > > > > +           full_perm_p = count == nelts;
> > > > > > > +         }
> > > > > > > +       }
> > > > > > > +       (if (full_perm_p)
> > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000000..40e0ac11332
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > @@ -0,0 +1,44 @@
> > > > > > > +/* PR target/98167 */
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > +
> > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > +
> > > > > > > +#define VEC_PERM_4 \
> > > > > > > +  2, 3, 1, 0
> > > > > > > +#define VEC_PERM_8 \
> > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > +#define VEC_PERM_16 \
> > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > +
> > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > +((vector_size(4*size))); \
> > > > > > > +  v##size##s##type type##foo##size##i_##name
> (v##size##s##type a, \
> > > > > > > +
> > > > > > > +v##size##s##type b) \
> > > > > > > +  { \
> > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    return a1 op b1; \
> > > > > > > +  }
> > > > > > > +
> > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > +
> > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > +
> > > > > > > +INT_PERMS (+, add)
> > > > > > > +INT_PERMS (-, sub)
> > > > > > > +INT_PERMS (*, mul)
> > > > > > > +INT_PERMS (|, ior)
> > > > > > > +INT_PERMS (^, xor)
> > > > > > > +INT_PERMS (&, and)
> > > > > > > +INT_PERMS (<<, shl)
> > > > > > > +INT_PERMS (>>, shr)
> > > > > > > +FP_PERMS (+, add)
> > > > > > > +FP_PERMS (-, sub)
> > > > > > > +FP_PERMS (*, mul)
> > > > > > > +
> > > > > > > --
> > > > > > > 2.18.1
> > > > > > >
  
Richard Biener Nov. 16, 2022, 3:29 p.m. UTC | #8
On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> Hi,
>
> This patch is causing several ICEs because it changes the permutes from a single register
> permute to a multi register due to the lowering of the expressions to different SSA names.
>
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
>
> I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> but would this be the right solution? It seems hard to later on during expand realize that
> the two operands are the same.
>
> It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.

You can use

 (vec_perm (op@7 @0 @1) @3)

to avoid this issue.

> Thanks,
> Tamar
>
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > Biener via Gcc-patches
> > Sent: Monday, November 14, 2022 2:53 PM
> > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> > index and operation [PR98167]
> >
> > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > <wwwhhhyyy333@gmail.com> wrote:
> > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > >
> > > Thanks for the good example! We also tried using wide_int as a bitmask
> > > but your code looks more simple and reasonable.
> > >
> > > Updated the patch accordingly.
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> > 16:56写道:
> > >
> > >
> > > >
> > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > <wwwhhhyyy333@gmail.com> wrote:
> > > > >
> > > > > Hi Prathamesh and Richard,
> > > > >
> > > > > Thanks for the review and nice suggestions!
> > > > >
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > >
> > > > > Removed VECTOR_CST for integer ops.
> > > > >
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > Added.
> > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > >
> > > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > > function would not be a good idea either).
> > > > > Another solution would be using hash_set but it does not work here
> > > > > for int64_t or poly_int64 type.
> > > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > > the permutation indices would be a small number even for O(n^2)
> > > > > complexity.
> > > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > > >
> > > > >
> > > > > Attached updated patch.
> > > > >
> > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > 于2022年11月8日周二 22:38写道:
> > > > >
> > > > >
> > > > > >
> > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This is a follow-up patch for PR98167
> > > > > > > >
> > > > > > > > The sequence
> > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > >      c3 = c1 op c2
> > > > > > > > can be optimized to
> > > > > > > >      c = a op b
> > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > > operation, and float operation with full permutation.
> > > > > > > >
> > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > >
> > > > > > > > Ok for trunk?
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> > vector.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > ---
> > > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > ++++++++++++++++++++++
> > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > >
> > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > --- a/gcc/match.pd
> > > > > > > > +++ b/gcc/match.pd
> > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > >    (bit_and @0 @1)))
> > > > > > > > +
> > > > > > > > +/* Optimize
> > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > +   c3 = c1 op c2
> > > > > > > > +   -->
> > > > > > > > +   c = a op b
> > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > > +        lshift rshift)
> > > > > > > > + (simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > > > > > +
> > > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > +(simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > +     (with
> > > > > > > > +      {
> > > > > > > > +       tree perm_cst = @2;
> > > > > > > > +       vec_perm_builder builder;
> > > > > > > > +       bool full_perm_p = false;
> > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > +         {
> > > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > +(type).to_constant ();
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > +
> > > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > > +           int count = 0, i, j;
> > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > > >
> > > > > > Maybe Richard has a better idea here though.
> > > > > >
> > > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > > ..)) patterns it's always that they explode the match decision
> > > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > > ..) first to be able to share more of the matching code.  That
> > > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > > because of the way we code generate *-match.cc)
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > > +               {
> > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > +                   {
> > > > > > > > +                     count++;
> > > > > > > > +                     break;
> > > > > > > > +                   }
> > > > > > > > +               }
> > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > +         }
> > > > > > > > +       }
> > > > > > > > +       (if (full_perm_p)
> > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > new file mode 100644
> > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > +/* PR target/98167 */
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > > +
> > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > +  2, 3, 1, 0
> > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > > +#define VEC_PERM_16 \
> > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > > +
> > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > +((vector_size(4*size))); \
> > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > (v##size##s##type a, \
> > > > > > > > +
> > > > > > > > +v##size##s##type b) \
> > > > > > > > +  { \
> > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > +    return a1 op b1; \
> > > > > > > > +  }
> > > > > > > > +
> > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > +
> > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > +
> > > > > > > > +INT_PERMS (+, add)
> > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > +INT_PERMS (&, and)
> > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > +FP_PERMS (+, add)
> > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > +
> > > > > > > > --
> > > > > > > > 2.18.1
> > > > > > > >
  
Richard Biener Nov. 16, 2022, 3:30 p.m. UTC | #9
On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> >
> > Hi,
> >
> > This patch is causing several ICEs because it changes the permutes from a single register
> > permute to a multi register due to the lowering of the expressions to different SSA names.
> >
> > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> >
> > I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> > but would this be the right solution? It seems hard to later on during expand realize that
> > the two operands are the same.
> >
> > It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> > If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.
>
> You can use
>
>  (vec_perm (op@7 @0 @1) @3)

Err, (vec_perm (op@7 @0 @1) @7) obviously.

> to avoid this issue.
>
> > Thanks,
> > Tamar
> >
> > > -----Original Message-----
> > > From: Gcc-patches <gcc-patches-
> > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > Biener via Gcc-patches
> > > Sent: Monday, November 14, 2022 2:53 PM
> > > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > > patches@gcc.gnu.org
> > > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> > > index and operation [PR98167]
> > >
> > > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > > <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > > think a lambda function is fine to use.  The alternative (used by
> > > > > the vectorizer in some places) is to use sth like
> > > > >
> > > > >  auto_sbitmap seen (nelts);
> > > > >  for (i = 0; i < nelts; i++)
> > > > >    {
> > > > >      if (!bitmap_set_bit (seen, i))
> > > > >        break;
> > > > >      count++;
> > > > >    }
> > > > >  full_perm_p = count == nelts;
> > > > >
> > > > > I'll note that you should still check .encoding
> > > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > > very simple check.
> > > >
> > > > Thanks for the good example! We also tried using wide_int as a bitmask
> > > > but your code looks more simple and reasonable.
> > > >
> > > > Updated the patch accordingly.
> > >
> > > OK.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> > > 16:56写道:
> > > >
> > > >
> > > > >
> > > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > > <wwwhhhyyy333@gmail.com> wrote:
> > > > > >
> > > > > > Hi Prathamesh and Richard,
> > > > > >
> > > > > > Thanks for the review and nice suggestions!
> > > > > >
> > > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > > both vectors even if it's not constant ?
> > > > > > >
> > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > >
> > > > > >
> > > > > > Removed VECTOR_CST for integer ops.
> > > > > >
> > > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > > otherwise it will crash for VLA vectors.
> > > > > > >
> > > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > > elements and that is not trivial though.  But indeed add
> > > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > >
> > > > > > Added.
> > > > > >
> > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > > the scan is O(n log n).
> > > > > >
> > > > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > > > function would not be a good idea either).
> > > > > > Another solution would be using hash_set but it does not work here
> > > > > > for int64_t or poly_int64 type.
> > > > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > > > the permutation indices would be a small number even for O(n^2)
> > > > > > complexity.
> > > > >
> > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > > think a lambda function is fine to use.  The alternative (used by
> > > > > the vectorizer in some places) is to use sth like
> > > > >
> > > > >  auto_sbitmap seen (nelts);
> > > > >  for (i = 0; i < nelts; i++)
> > > > >    {
> > > > >      if (!bitmap_set_bit (seen, i))
> > > > >        break;
> > > > >      count++;
> > > > >    }
> > > > >  full_perm_p = count == nelts;
> > > > >
> > > > > I'll note that you should still check .encoding
> > > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > > very simple check.
> > > > >
> > > > > >
> > > > > > Attached updated patch.
> > > > > >
> > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > > 于2022年11月8日周二 22:38写道:
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This is a follow-up patch for PR98167
> > > > > > > > >
> > > > > > > > > The sequence
> > > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > >      c3 = c1 op c2
> > > > > > > > > can be optimized to
> > > > > > > > >      c = a op b
> > > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > > > operation, and float operation with full permutation.
> > > > > > > > >
> > > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > > >
> > > > > > > > > Ok for trunk?
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >         PR target/98167
> > > > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> > > vector.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >         PR target/98167
> > > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > > ---
> > > > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > > ++++++++++++++++++++++
> > > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > >
> > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > > --- a/gcc/match.pd
> > > > > > > > > +++ b/gcc/match.pd
> > > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > > >    (bit_and @0 @1)))
> > > > > > > > > +
> > > > > > > > > +/* Optimize
> > > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > +   c3 = c1 op c2
> > > > > > > > > +   -->
> > > > > > > > > +   c = a op b
> > > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > > > +        lshift rshift)
> > > > > > > > > + (simplify
> > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > > VECTOR_CST@2))
> > > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > > both vectors even if it's not constant ?
> > > > > > >
> > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > > +(simplify
> > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > > VECTOR_CST@2))
> > > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > > +     (with
> > > > > > > > > +      {
> > > > > > > > > +       tree perm_cst = @2;
> > > > > > > > > +       vec_perm_builder builder;
> > > > > > > > > +       bool full_perm_p = false;
> > > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > > +         {
> > > > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > > +(type).to_constant ();
> > > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > > otherwise it will crash for VLA vectors.
> > > > > > >
> > > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > > elements and that is not trivial though.  But indeed add
> > > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > > +
> > > > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > > > +           int count = 0, i, j;
> > > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > > >
> > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > > the scan is O(n log n).
> > > > > > >
> > > > > > > Maybe Richard has a better idea here though.
> > > > > > >
> > > > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > > > ..)) patterns it's always that they explode the match decision
> > > > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > > > ..) first to be able to share more of the matching code.  That
> > > > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > > > because of the way we code generate *-match.cc)
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > > +               {
> > > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > > +                   {
> > > > > > > > > +                     count++;
> > > > > > > > > +                     break;
> > > > > > > > > +                   }
> > > > > > > > > +               }
> > > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > > +         }
> > > > > > > > > +       }
> > > > > > > > > +       (if (full_perm_p)
> > > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > new file mode 100644
> > > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > > +/* PR target/98167 */
> > > > > > > > > +/* { dg-do compile } */
> > > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > > +
> > > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > > > +
> > > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > > +  2, 3, 1, 0
> > > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > > > +#define VEC_PERM_16 \
> > > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > > > +
> > > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > > +((vector_size(4*size))); \
> > > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > > (v##size##s##type a, \
> > > > > > > > > +
> > > > > > > > > +v##size##s##type b) \
> > > > > > > > > +  { \
> > > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > +    return a1 op b1; \
> > > > > > > > > +  }
> > > > > > > > > +
> > > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > > +
> > > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > > +
> > > > > > > > > +INT_PERMS (+, add)
> > > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > > +INT_PERMS (&, and)
> > > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > > +FP_PERMS (+, add)
> > > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > > +
> > > > > > > > > --
> > > > > > > > > 2.18.1
> > > > > > > > >
  
Tamar Christina Nov. 16, 2022, 3:34 p.m. UTC | #10
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Wednesday, November 16, 2022 3:30 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Hongyu Wang <wwwhhhyyy333@gmail.com>; Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org>; Richard Sandiford
> <Richard.Sandiford@arm.com>; Hongyu Wang <hongyu.wang@intel.com>;
> hongtao.liu@intel.com; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch is causing several ICEs because it changes the permutes
> > > from a single register permute to a multi register due to the lowering of
> the expressions to different SSA names.
> > >
> > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> > >
> > > I have a prototype fix which adds a new rule to CSE this back to a
> > > single register permute, but would this be the right solution? It
> > > seems hard to later on during expand realize that the two operands are
> the same.
> > >
> > > It's probably also ok to just block this from happening after
> > > vec_lower, however I'm worried that If it wasn't CSE'd before vec_lower
> it'll lower it so something much less efficient.
> >
> > You can use
> >
> >  (vec_perm (op@7 @0 @1) @3)
> 
> Err, (vec_perm (op@7 @0 @1) @7) obviously.

Oh wow, I had no idea you could capture during rewrite!

That's nifty and good to know.

I'll regtest and submit patch.

Thanks,
Tamar

> 
> > to avoid this issue.
> >
> > > Thanks,
> > > Tamar
> > >
> > > > -----Original Message-----
> > > > From: Gcc-patches <gcc-patches-
> > > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > > Biener via Gcc-patches
> > > > Sent: Monday, November 14, 2022 2:53 PM
> > > > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > > > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > > > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > > > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > > > patches@gcc.gnu.org
> > > > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same
> permutation
> > > > index and operation [PR98167]
> > > >
> > > > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > > > <wwwhhhyyy333@gmail.com> wrote:
> > > > >
> > > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.
> > > > > > I think a lambda function is fine to use.  The alternative
> > > > > > (used by the vectorizer in some places) is to use sth like
> > > > > >
> > > > > >  auto_sbitmap seen (nelts);
> > > > > >  for (i = 0; i < nelts; i++)
> > > > > >    {
> > > > > >      if (!bitmap_set_bit (seen, i))
> > > > > >        break;
> > > > > >      count++;
> > > > > >    }
> > > > > >  full_perm_p = count == nelts;
> > > > > >
> > > > > > I'll note that you should still check .encoding
> > > > > > ().encoded_full_vector_p () and only bother to check that
> > > > > > case, that's a
> > > > very simple check.
> > > > >
> > > > > Thanks for the good example! We also tried using wide_int as a
> > > > > bitmask but your code looks more simple and reasonable.
> > > > >
> > > > > Updated the patch accordingly.
> > > >
> > > > OK.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周
> 四
> > > > 16:56写道:
> > > > >
> > > > >
> > > > > >
> > > > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > > > <wwwhhhyyy333@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Prathamesh and Richard,
> > > > > > >
> > > > > > > Thanks for the review and nice suggestions!
> > > > > > >
> > > > > > > > > I guess the transform should work as long as mask is
> > > > > > > > > same for both vectors even if it's not constant ?
> > > > > > > >
> > > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > > >
> > > > > > >
> > > > > > > Removed VECTOR_CST for integer ops.
> > > > > > >
> > > > > > > > > If this transform is meant only for VLS vectors, I guess
> > > > > > > > > you should bail out if TYPE_VECTOR_SUBPARTS is not
> > > > > > > > > constant, otherwise it will crash for VLA vectors.
> > > > > > > >
> > > > > > > > I suppose it's difficult to create a VLA permute that
> > > > > > > > covers all elements and that is not trivial though.  But
> > > > > > > > indeed add ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > >
> > > > > > > Added.
> > > > > > >
> > > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > > ().encoded_full_vector_p () (as said I can't think of a
> > > > > > > > non-full encoding that isn't trivial but covers all
> > > > > > > > elements) and then simply .qsort () the vector_builder (it
> > > > > > > > derives from vec<>) so the scan is O(n log n).
> > > > > > >
> > > > > > > The .qsort () approach requires an extra cmp_func that IMO
> > > > > > > would not be feasible to be implemented in match.pd (I
> > > > > > > suppose lambda function would not be a good idea either).
> > > > > > > Another solution would be using hash_set but it does not
> > > > > > > work here for int64_t or poly_int64 type.
> > > > > > > So I kept current O(n^2) simple code here, and I suppose
> > > > > > > usually the permutation indices would be a small number even
> > > > > > > for O(n^2) complexity.
> > > > > >
> > > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.
> > > > > > I think a lambda function is fine to use.  The alternative
> > > > > > (used by the vectorizer in some places) is to use sth like
> > > > > >
> > > > > >  auto_sbitmap seen (nelts);
> > > > > >  for (i = 0; i < nelts; i++)
> > > > > >    {
> > > > > >      if (!bitmap_set_bit (seen, i))
> > > > > >        break;
> > > > > >      count++;
> > > > > >    }
> > > > > >  full_perm_p = count == nelts;
> > > > > >
> > > > > > I'll note that you should still check .encoding
> > > > > > ().encoded_full_vector_p () and only bother to check that
> > > > > > case, that's a
> > > > very simple check.
> > > > > >
> > > > > > >
> > > > > > > Attached updated patch.
> > > > > > >
> > > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > > > 于2022年11月8日周二 22:38写道:
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > This is a follow-up patch for PR98167
> > > > > > > > > >
> > > > > > > > > > The sequence
> > > > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > >      c3 = c1 op c2
> > > > > > > > > > can be optimized to
> > > > > > > > > >      c = a op b
> > > > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer
> > > > > > > > > > vector operation, and float operation with full permutation.
> > > > > > > > > >
> > > > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > > > >
> > > > > > > > > > Ok for trunk?
> > > > > > > > > >
> > > > > > > > > > gcc/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         PR target/98167
> > > > > > > > > >         * match.pd: New perm + vector op patterns for
> > > > > > > > > > int and fp
> > > > vector.
> > > > > > > > > >
> > > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         PR target/98167
> > > > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > > > ---
> > > > > > > > > >  gcc/match.pd                            | 49
> +++++++++++++++++++++++++
> > > > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > > > ++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > >
> > > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > > > --- a/gcc/match.pd
> > > > > > > > > > +++ b/gcc/match.pd
> > > > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > > > >    (bit_and @0 @1)))
> > > > > > > > > > +
> > > > > > > > > > +/* Optimize
> > > > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > > +   c3 = c1 op c2
> > > > > > > > > > +   -->
> > > > > > > > > > +   c = a op b
> > > > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > > > +   For all integer non-div operations.  */ (for op
> > > > > > > > > > +(plus minus mult bit_and bit_ior bit_xor
> > > > > > > > > > +        lshift rshift)  (simplify
> > > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1
> @1
> > > > VECTOR_CST@2))
> > > > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > > > I guess the transform should work as long as mask is
> > > > > > > > > same for both vectors even if it's not constant ?
> > > > > > > >
> > > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +/* Similar for float arithmetic when permutation constant
> covers
> > > > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > > > +(simplify
> > > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1
> @1
> > > > VECTOR_CST@2))
> > > > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > > > +     (with
> > > > > > > > > > +      {
> > > > > > > > > > +       tree perm_cst = @2;
> > > > > > > > > > +       vec_perm_builder builder;
> > > > > > > > > > +       bool full_perm_p = false;
> > > > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > > > +         {
> > > > > > > > > > +           /* Create a vec_perm_indices for the integer vector.
> */
> > > > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > > > +(type).to_constant ();
> > > > > > > > > If this transform is meant only for VLS vectors, I guess
> > > > > > > > > you should bail out if TYPE_VECTOR_SUBPARTS is not
> > > > > > > > > constant, otherwise it will crash for VLA vectors.
> > > > > > > >
> > > > > > > > I suppose it's difficult to create a VLA permute that
> > > > > > > > covers all elements and that is not trivial though.  But
> > > > > > > > indeed add ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > > > +
> > > > > > > > > > +           /* Check if perm indices covers all vector elements.
> */
> > > > > > > > > > +           int count = 0, i, j;
> > > > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > > > >
> > > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > > ().encoded_full_vector_p () (as said I can't think of a
> > > > > > > > non-full encoding that isn't trivial but covers all
> > > > > > > > elements) and then simply .qsort () the vector_builder (it
> > > > > > > > derives from vec<>) so the scan is O(n log n).
> > > > > > > >
> > > > > > > > Maybe Richard has a better idea here though.
> > > > > > > >
> > > > > > > > Otherwise looks OK, though with these kind of (* (op ..)
> > > > > > > > (op
> > > > > > > > ..)) patterns it's always that they explode the match
> > > > > > > > decision tree, we'd ideally have a way to match those with
> > > > > > > > (op ..) (op
> > > > > > > > ..) first to be able to share more of the matching code.
> > > > > > > > That said, match.pd is a less than ideal place for these
> > > > > > > > (but mostly because of the way we code generate
> > > > > > > > *-match.cc)
> > > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > > +               {
> > > > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > > > +                   {
> > > > > > > > > > +                     count++;
> > > > > > > > > > +                     break;
> > > > > > > > > > +                   }
> > > > > > > > > > +               }
> > > > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > > > +         }
> > > > > > > > > > +       }
> > > > > > > > > > +       (if (full_perm_p)
> > > > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > > > +/* PR target/98167 */
> > > > > > > > > > +/* { dg-do compile } */
> > > > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > > > +
> > > > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 }
> > > > > > > > > > +} */
> > > > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3
> > > > > > > > > > +} } */
> > > > > > > > > > +
> > > > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > > > +  2, 3, 1, 0
> > > > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0 #define VEC_PERM_16 \
> > > > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1,
> > > > > > > > > > +0
> > > > > > > > > > +
> > > > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > > > +((vector_size(4*size))); \
> > > > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > > > (v##size##s##type a, \
> > > > > > > > > > +
> > > > > > > > > > +v##size##s##type b) \
> > > > > > > > > > +  { \
> > > > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > > +    return a1 op b1; \
> > > > > > > > > > +  }
> > > > > > > > > > +
> > > > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > > > +
> > > > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > > > +
> > > > > > > > > > +INT_PERMS (+, add)
> > > > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > > > +INT_PERMS (&, and)
> > > > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > > > +FP_PERMS (+, add)
> > > > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > > > +
> > > > > > > > > > --
> > > > > > > > > > 2.18.1
> > > > > > > > > >
  
Jakub Jelinek Nov. 16, 2022, 3:37 p.m. UTC | #11
On Wed, Nov 16, 2022 at 04:30:06PM +0100, Richard Biener via Gcc-patches wrote:
> On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch is causing several ICEs because it changes the permutes from a single register
> > > permute to a multi register due to the lowering of the expressions to different SSA names.
> > >
> > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> > >
> > > I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> > > but would this be the right solution? It seems hard to later on during expand realize that
> > > the two operands are the same.
> > >
> > > It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> > > If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.
> >
> > You can use
> >
> >  (vec_perm (op@7 @0 @1) @3)
> 
> Err, (vec_perm (op@7 @0 @1) @7) obviously.

Even:

--- gcc/match.pd.jj	2022-11-15 07:56:05.240348804 +0100
+++ gcc/match.pd	2022-11-16 16:35:34.854080956 +0100
@@ -8259,7 +8259,7 @@ and,
  (simplify
   (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
    (if (VECTOR_INTEGER_TYPE_P (type))
-    (vec_perm (op @0 @1) (op @0 @1) @2))))
+    (vec_perm (op@3 @0 @1) @3 @2))))
 
 /* Similar for float arithmetic when permutation constant covers
    all vector elements.  */
@@ -8298,4 +8298,4 @@ and,
 	 }
       }
       (if (full_perm_p)
-	(vec_perm (op @0 @1) (op @0 @1) @2))))))
+	(vec_perm (op@3 @0 @1) @3 @2))))))

From quick look at the dumps, it seems to work fine.

	Jakub
  
Marc Glisse Nov. 16, 2022, 7:21 p.m. UTC | #12
On Fri, 4 Nov 2022, Hongyu Wang via Gcc-patches wrote:

> This is a follow-up patch for PR98167
>
> The sequence
>     c1 = VEC_PERM_EXPR (a, a, mask)
>     c2 = VEC_PERM_EXPR (b, b, mask)
>     c3 = c1 op c2
> can be optimized to
>     c = a op b
>     c3 = VEC_PERM_EXPR (c, c, mask)
> for all integer vector operation, and float operation with
> full permutation.

Hello,

I assume the "full permutation" condition is to avoid performing some 
extra operations that would raise exception flags. If so, are there 
conditions (-fno-trapping-math?) where the transformation would be safe 
with arbitrary shuffles?
  
Hongyu Wang Nov. 17, 2022, 6:05 a.m. UTC | #13
> I assume the "full permutation" condition is to avoid performing some
> extra operations that would raise exception flags. If so, are there
> conditions (-fno-trapping-math?) where the transformation would be safe
> with arbitrary shuffles?

Yes, that could be an alternative choice with -fno-trapping-math on
arbitrary shuffles. I may have a follow-up patch about this after the
ICE is fixed. Thanks.


> --
> Marc Glisse
  

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 194ba8f5188..b85ad34f609 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8189,3 +8189,52 @@  and,
  (bit_and (negate @0) integer_onep@1)
  (if (!TYPE_OVERFLOW_SANITIZED (type))
   (bit_and @0 @1)))
+
+/* Optimize
+   c1 = VEC_PERM_EXPR (a, a, mask)
+   c2 = VEC_PERM_EXPR (b, b, mask)
+   c3 = c1 op c2
+   -->
+   c = a op b
+   c3 = VEC_PERM_EXPR (c, c, mask)
+   For all integer non-div operations.  */
+(for op (plus minus mult bit_and bit_ior bit_xor
+	 lshift rshift)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+    (if (VECTOR_INTEGER_TYPE_P (type))
+     (vec_perm (op @0 @1) (op @0 @1) @2))))
+
+/* Similar for float arithmetic when permutation constant covers
+   all vector elements.  */
+(for op (plus minus mult)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+    (if (VECTOR_FLOAT_TYPE_P (type))
+     (with
+      {
+	tree perm_cst = @2;
+	vec_perm_builder builder;
+	bool full_perm_p = false;
+	if (tree_to_vec_perm_builder (&builder, perm_cst))
+	  {
+	    /* Create a vec_perm_indices for the integer vector.  */
+	    int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+	    vec_perm_indices sel (builder, 1, nelts);
+
+	    /* Check if perm indices covers all vector elements.  */
+	    int count = 0, i, j;
+	    for (i = 0; i < nelts; i++)
+	      for (j = 0; j < nelts; j++)
+		{
+		  if (sel[j].to_constant () == i)
+		    {
+		      count++;
+		      break;
+		    }
+		}
+	    full_perm_p = count == nelts;
+	  }
+       }
+       (if (full_perm_p)
+	(vec_perm (op @0 @1) (op @0 @1) @2))))))
diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
new file mode 100644
index 00000000000..40e0ac11332
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr98167.c
@@ -0,0 +1,44 @@ 
+/* PR target/98167 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+
+/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
+/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
+
+#define VEC_PERM_4 \
+  2, 3, 1, 0
+#define VEC_PERM_8 \
+  4, 5, 6, 7, 3, 2, 1, 0
+#define VEC_PERM_16 \
+  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
+
+#define TYPE_PERM_OP(type, size, op, name) \
+  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
+  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
+					      v##size##s##type b) \
+  { \
+    v##size##s##type a1 = __builtin_shufflevector (a, a, \
+						   VEC_PERM_##size); \
+    v##size##s##type b1 = __builtin_shufflevector (b, b, \
+						   VEC_PERM_##size); \
+    return a1 op b1; \
+  }
+
+#define INT_PERMS(op, name) \
+  TYPE_PERM_OP (int, 4, op, name) \
+
+#define FP_PERMS(op, name) \
+  TYPE_PERM_OP (float, 4, op, name) \
+
+INT_PERMS (+, add)
+INT_PERMS (-, sub)
+INT_PERMS (*, mul)
+INT_PERMS (|, ior)
+INT_PERMS (^, xor)
+INT_PERMS (&, and)
+INT_PERMS (<<, shl)
+INT_PERMS (>>, shr)
+FP_PERMS (+, add)
+FP_PERMS (-, sub)
+FP_PERMS (*, mul)
+