Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

Message ID CAAgBjMmRt3H9KVPCT3z5YHRRmDJw6K2rHVKc9+cTCeLTi9EWqQ@mail.gmail.com
State New
Headers
Series Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner |

Commit Message

Prathamesh Kulkarni Oct. 10, 2022, 10:48 a.m. UTC
  On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> For num_poly_int_coeffs == 2,
> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> If a1/trunc n1 succeeds,
> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> So, a1 has to be < n1.coeffs[0] ?
> >
> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >
> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >
> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>
> Sorry, should have been:
>
>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
Hi Richard,
Thanks for the clarifications, and sorry for late reply.
I have attached POC patch that tries to implement the above approach.
Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.

For VLA vectors, I have only done limited testing so far.
It seems to pass couple of tests written in the patch for
nelts_per_pattern == 3,
and folds the following svld1rq test:
int32x4_t v = {1, 2, 3, 4};
return svld1rq_s32 (svptrue_b8 (), &v[0])
into:
return {1, 2, 3, 4, ...};
I will try to bootstrap+test it on SVE machine to test further for VLA folding.

I have a couple of questions:
1] When mask selects elements from same vector but from different patterns:
For eg:
arg0 = {1, 11, 2, 12, 3, 13, ...},
arg1 = {21, 31, 22, 32, 23, 33, ...},
mask = {0, 0, 0, 1, 0, 2, ... },
All have npatterns = 2, nelts_per_pattern = 3.

With above mask,
Pattern {0, ...} selects arg0[0], ie {1, ...}
Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
pattern in arg0.
The result is:
res = {1, 1, 1, 11, 1, 2, ...}
In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
with a0 = 1, a1 = 11, S = -9.
Is that expected tho ? It seems to create a new encoding which
wasn't present in the input vector. For instance, the next elem in
sequence would be -7,
which is not present originally in arg0.
I suppose it's fine since if the user defines mask to have pattern {0,
1, 2, ...}
they intended result to have pattern with above encoding.
Just wanted to confirm if this is correct ?

2] Could you please suggest a test-case for S < 0 ?
I am not able to come up with one :/

Thanks,
Prathamesh
>
> > which is an interleaving of the two patterns:
> >
> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
-------------- next part --------------
  

Comments

Prathamesh Kulkarni Oct. 17, 2022, 10:32 a.m. UTC | #1
On Mon, 10 Oct 2022 at 16:18, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> For num_poly_int_coeffs == 2,
> > >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> If a1/trunc n1 succeeds,
> > >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> So, a1 has to be < n1.coeffs[0] ?
> > >
> > > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >
> > > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >
> > >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >
> > Sorry, should have been:
> >
> >   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> Hi Richard,
> Thanks for the clarifications, and sorry for late reply.
> I have attached POC patch that tries to implement the above approach.
> Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>
> For VLA vectors, I have only done limited testing so far.
> It seems to pass couple of tests written in the patch for
> nelts_per_pattern == 3,
> and folds the following svld1rq test:
> int32x4_t v = {1, 2, 3, 4};
> return svld1rq_s32 (svptrue_b8 (), &v[0])
> into:
> return {1, 2, 3, 4, ...};
> I will try to bootstrap+test it on SVE machine to test further for VLA folding.
With the attached patch it seems to pass bootstrap+test with SVE enabled.
The only difference w.r.t previous patch is it adds check in
get_vector_for_pattern
if S is constant otherwise returns NULL_TREE.

I added this check because 930325-1.c ICE'd with previous patch
because it had following vec_perm_expr,
where S was non-constant:
vect__16.13_70 = VEC_PERM_EXPR <vect__16.12_69, vect__16.12_69, {
POLY_INT_CST [3, 4], POLY_INT_CST [6, 8], POLY_INT_CST [9, 12], ...
}>;
I am not sure how to proceed in this case, so chose to bail out.

Thanks,
Prathamesh

>
> I have a couple of questions:
> 1] When mask selects elements from same vector but from different patterns:
> For eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...},
> arg1 = {21, 31, 22, 32, 23, 33, ...},
> mask = {0, 0, 0, 1, 0, 2, ... },
> All have npatterns = 2, nelts_per_pattern = 3.
>
> With above mask,
> Pattern {0, ...} selects arg0[0], ie {1, ...}
> Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> pattern in arg0.
> The result is:
> res = {1, 1, 1, 11, 1, 2, ...}
> In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> with a0 = 1, a1 = 11, S = -9.
> Is that expected tho ? It seems to create a new encoding which
> wasn't present in the input vector. For instance, the next elem in
> sequence would be -7,
> which is not present originally in arg0.
> I suppose it's fine since if the user defines mask to have pattern {0,
> 1, 2, ...}
> they intended result to have pattern with above encoding.
> Just wanted to confirm if this is correct ?
>
> 2] Could you please suggest a test-case for S < 0 ?
> I am not able to come up with one :/
>
> Thanks,
> Prathamesh
> >
> > > which is an interleaving of the two patterns:
> > >
> > >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..e93f2c7b592 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,56 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  int64_t S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      poly_uint64 diff = a2 - a1;
+      if (!diff.is_constant ())
+	return NULL_TREE;
+      S = diff.to_constant ();
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10560,112 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      sel_npatterns = sel.encoding ().npatterns ();
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	}
+
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+
+      res_npatterns
+        = least_common_multiple (sel_npatterns,
+				 least_common_multiple (arg0_npatterns,
+				 			arg1_npatterns));
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
+      poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+
+      /* Divide sel[i] by input vector length, to obtain remainder,
+	 which would be the index for either input vector.  */
+      if (!can_div_trunc_p (sel[i], n1, &q, &r))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+
+      unsigned HOST_WIDE_INT index;
+      if (!r.is_constant (&index))
+	return NULL_TREE;
+
+      /* For VLA vectors, i % sel_npatterns would give the pattern
+         in sel that ith elem belongs to.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +17002,97 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 11, 2, ...} */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17101,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest
  
Prathamesh Kulkarni Oct. 24, 2022, 8:12 a.m. UTC | #2
On Mon, 17 Oct 2022 at 16:02, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 10 Oct 2022 at 16:18, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> For num_poly_int_coeffs == 2,
> > > >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> If a1/trunc n1 succeeds,
> > > >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> So, a1 has to be < n1.coeffs[0] ?
> > > >
> > > > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >
> > > > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >
> > > >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >
> > > Sorry, should have been:
> > >
> > >   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > Hi Richard,
> > Thanks for the clarifications, and sorry for late reply.
> > I have attached POC patch that tries to implement the above approach.
> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >
> > For VLA vectors, I have only done limited testing so far.
> > It seems to pass couple of tests written in the patch for
> > nelts_per_pattern == 3,
> > and folds the following svld1rq test:
> > int32x4_t v = {1, 2, 3, 4};
> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > into:
> > return {1, 2, 3, 4, ...};
> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> With the attached patch it seems to pass bootstrap+test with SVE enabled.
> The only difference w.r.t previous patch is it adds check in
> get_vector_for_pattern
> if S is constant otherwise returns NULL_TREE.
>
> I added this check because 930325-1.c ICE'd with previous patch
> because it had following vec_perm_expr,
> where S was non-constant:
> vect__16.13_70 = VEC_PERM_EXPR <vect__16.12_69, vect__16.12_69, {
> POLY_INT_CST [3, 4], POLY_INT_CST [6, 8], POLY_INT_CST [9, 12], ...
> }>;
> I am not sure how to proceed in this case, so chose to bail out.
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603717.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
> >
> > I have a couple of questions:
> > 1] When mask selects elements from same vector but from different patterns:
> > For eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > mask = {0, 0, 0, 1, 0, 2, ... },
> > All have npatterns = 2, nelts_per_pattern = 3.
> >
> > With above mask,
> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > pattern in arg0.
> > The result is:
> > res = {1, 1, 1, 11, 1, 2, ...}
> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > with a0 = 1, a1 = 11, S = -9.
> > Is that expected tho ? It seems to create a new encoding which
> > wasn't present in the input vector. For instance, the next elem in
> > sequence would be -7,
> > which is not present originally in arg0.
> > I suppose it's fine since if the user defines mask to have pattern {0,
> > 1, 2, ...}
> > they intended result to have pattern with above encoding.
> > Just wanted to confirm if this is correct ?
> >
> > 2] Could you please suggest a test-case for S < 0 ?
> > I am not able to come up with one :/
> >
> > Thanks,
> > Prathamesh
> > >
> > > > which is an interleaving of the two patterns:
> > > >
> > > >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > > >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Richard Sandiford Oct. 26, 2022, 3:37 p.m. UTC | #3
Sorry for the slow response.  I wanted to find some time to think
about this a bit more.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> >> For num_poly_int_coeffs == 2,
>> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> >> If a1/trunc n1 succeeds,
>> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> >> So, a1 has to be < n1.coeffs[0] ?
>> >
>> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> >
>> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> >
>> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>>
>> Sorry, should have been:
>>
>>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> Hi Richard,
> Thanks for the clarifications, and sorry for late reply.
> I have attached POC patch that tries to implement the above approach.
> Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>
> For VLA vectors, I have only done limited testing so far.
> It seems to pass couple of tests written in the patch for
> nelts_per_pattern == 3,
> and folds the following svld1rq test:
> int32x4_t v = {1, 2, 3, 4};
> return svld1rq_s32 (svptrue_b8 (), &v[0])
> into:
> return {1, 2, 3, 4, ...};
> I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>
> I have a couple of questions:
> 1] When mask selects elements from same vector but from different patterns:
> For eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...},
> arg1 = {21, 31, 22, 32, 23, 33, ...},
> mask = {0, 0, 0, 1, 0, 2, ... },
> All have npatterns = 2, nelts_per_pattern = 3.
>
> With above mask,
> Pattern {0, ...} selects arg0[0], ie {1, ...}
> Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> pattern in arg0.
> The result is:
> res = {1, 1, 1, 11, 1, 2, ...}
> In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> with a0 = 1, a1 = 11, S = -9.
> Is that expected tho ? It seems to create a new encoding which
> wasn't present in the input vector. For instance, the next elem in
> sequence would be -7,
> which is not present originally in arg0.

Yeah, you're right, sorry.  Going back to:

(2) The explicit encoding can be used to produce a sequence of N*Ex*Px
    elements for any integer N.  This extended sequence can be reencoded
    as having N*Px patterns, with Ex staying the same.

I guess we need to pick an N for the selector such that each new
selector pattern (each one out of the N*Px patterns) selects from
the *same pattern* of the same data input.

So if a particular pattern in the selector has a step S, and the data
input it selects from has Pi patterns, N*S must be a multiple of Pi.
N must be a multiple of least_common_multiple(S,Pi)/S.

I think that means that the total number of patterns in the result
(Pr from previous messages) can safely be:

  Ps * least_common_multiple(
    least_common_multiple(S[1], P[input(1)]) / S[1],
    ...
    least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
  )

where:

  Ps = the number of patterns in the selector
  S[I] = the step for selector pattern I (I being 1-based)
  input(I) = the data input selected by selector pattern I (I being 1-based)
  P[I] = the number of patterns in data input I

That's getting quite complicated :-)  If we allow arbitrary P[...]
and S[...] then it could also get large.  Perhaps we should finally
give up on the general case and limit this to power-of-2 patterns and
power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
simplifies other things as well.

What do you think?

> I suppose it's fine since if the user defines mask to have pattern {0,
> 1, 2, ...}
> they intended result to have pattern with above encoding.
> Just wanted to confirm if this is correct ?
>
> 2] Could you please suggest a test-case for S < 0 ?
> I am not able to come up with one :/

svrev is one way of creating negative steps.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>>
>> > which is an interleaving of the two patterns:
>> >
>> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Prathamesh Kulkarni Oct. 28, 2022, 2:46 p.m. UTC | #4
On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Sorry for the slow response.  I wanted to find some time to think
> about this a bit more.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> >> For num_poly_int_coeffs == 2,
> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> >> If a1/trunc n1 succeeds,
> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> >
> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> >
> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> >
> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >>
> >> Sorry, should have been:
> >>
> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > Hi Richard,
> > Thanks for the clarifications, and sorry for late reply.
> > I have attached POC patch that tries to implement the above approach.
> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >
> > For VLA vectors, I have only done limited testing so far.
> > It seems to pass couple of tests written in the patch for
> > nelts_per_pattern == 3,
> > and folds the following svld1rq test:
> > int32x4_t v = {1, 2, 3, 4};
> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > into:
> > return {1, 2, 3, 4, ...};
> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >
> > I have a couple of questions:
> > 1] When mask selects elements from same vector but from different patterns:
> > For eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > mask = {0, 0, 0, 1, 0, 2, ... },
> > All have npatterns = 2, nelts_per_pattern = 3.
> >
> > With above mask,
> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > pattern in arg0.
> > The result is:
> > res = {1, 1, 1, 11, 1, 2, ...}
> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > with a0 = 1, a1 = 11, S = -9.
> > Is that expected tho ? It seems to create a new encoding which
> > wasn't present in the input vector. For instance, the next elem in
> > sequence would be -7,
> > which is not present originally in arg0.
>
> Yeah, you're right, sorry.  Going back to:
>
> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>     elements for any integer N.  This extended sequence can be reencoded
>     as having N*Px patterns, with Ex staying the same.
>
> I guess we need to pick an N for the selector such that each new
> selector pattern (each one out of the N*Px patterns) selects from
> the *same pattern* of the same data input.
>
> So if a particular pattern in the selector has a step S, and the data
> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> N must be a multiple of least_common_multiple(S,Pi)/S.
>
> I think that means that the total number of patterns in the result
> (Pr from previous messages) can safely be:
>
>   Ps * least_common_multiple(
>     least_common_multiple(S[1], P[input(1)]) / S[1],
>     ...
>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>   )
>
> where:
>
>   Ps = the number of patterns in the selector
>   S[I] = the step for selector pattern I (I being 1-based)
>   input(I) = the data input selected by selector pattern I (I being 1-based)
>   P[I] = the number of patterns in data input I
>
> That's getting quite complicated :-)  If we allow arbitrary P[...]
> and S[...] then it could also get large.  Perhaps we should finally
> give up on the general case and limit this to power-of-2 patterns and
> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> simplifies other things as well.
>
> What do you think?
Hi Richard,
Thanks for the suggestions. Yeah I suppose we can initially add support for
power-of-2 patterns and power-of-2 steps and try to generalize it in
follow up patches if possible.

Sorry if this sounds like a silly ques -- if we are going to have
pattern in selector, select *same pattern from same input vector*,
instead of re-encoding the selector to have N * Ps patterns, would it
make sense for elements in selector to denote pattern number itself
instead of element index
if input vectors are VLA ?

For eg:
op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
op1 = {...}
with npatterns == 4, nelts_per_pattern == 3,
sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
so, res = {1, 4, 1, 5, 1, 6, ...}
Not sure if this is correct tho.

Thanks,
Prathamesh
>
> > I suppose it's fine since if the user defines mask to have pattern {0,
> > 1, 2, ...}
> > they intended result to have pattern with above encoding.
> > Just wanted to confirm if this is correct ?
> >
> > 2] Could you please suggest a test-case for S < 0 ?
> > I am not able to come up with one :/
>
> svrev is one way of creating negative steps.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >>
> >> > which is an interleaving of the two patterns:
> >> >
> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Richard Sandiford Oct. 31, 2022, 9:57 a.m. UTC | #5
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Sorry for the slow response.  I wanted to find some time to think
>> about this a bit more.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> >> >> For num_poly_int_coeffs == 2,
>> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> >> >> If a1/trunc n1 succeeds,
>> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> >> >> So, a1 has to be < n1.coeffs[0] ?
>> >> >
>> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> >> >
>> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> >> >
>> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>> >>
>> >> Sorry, should have been:
>> >>
>> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
>> > Hi Richard,
>> > Thanks for the clarifications, and sorry for late reply.
>> > I have attached POC patch that tries to implement the above approach.
>> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>> >
>> > For VLA vectors, I have only done limited testing so far.
>> > It seems to pass couple of tests written in the patch for
>> > nelts_per_pattern == 3,
>> > and folds the following svld1rq test:
>> > int32x4_t v = {1, 2, 3, 4};
>> > return svld1rq_s32 (svptrue_b8 (), &v[0])
>> > into:
>> > return {1, 2, 3, 4, ...};
>> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>> >
>> > I have a couple of questions:
>> > 1] When mask selects elements from same vector but from different patterns:
>> > For eg:
>> > arg0 = {1, 11, 2, 12, 3, 13, ...},
>> > arg1 = {21, 31, 22, 32, 23, 33, ...},
>> > mask = {0, 0, 0, 1, 0, 2, ... },
>> > All have npatterns = 2, nelts_per_pattern = 3.
>> >
>> > With above mask,
>> > Pattern {0, ...} selects arg0[0], ie {1, ...}
>> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
>> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
>> > pattern in arg0.
>> > The result is:
>> > res = {1, 1, 1, 11, 1, 2, ...}
>> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
>> > with a0 = 1, a1 = 11, S = -9.
>> > Is that expected tho ? It seems to create a new encoding which
>> > wasn't present in the input vector. For instance, the next elem in
>> > sequence would be -7,
>> > which is not present originally in arg0.
>>
>> Yeah, you're right, sorry.  Going back to:
>>
>> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>>     elements for any integer N.  This extended sequence can be reencoded
>>     as having N*Px patterns, with Ex staying the same.
>>
>> I guess we need to pick an N for the selector such that each new
>> selector pattern (each one out of the N*Px patterns) selects from
>> the *same pattern* of the same data input.
>>
>> So if a particular pattern in the selector has a step S, and the data
>> input it selects from has Pi patterns, N*S must be a multiple of Pi.
>> N must be a multiple of least_common_multiple(S,Pi)/S.
>>
>> I think that means that the total number of patterns in the result
>> (Pr from previous messages) can safely be:
>>
>>   Ps * least_common_multiple(
>>     least_common_multiple(S[1], P[input(1)]) / S[1],
>>     ...
>>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>>   )
>>
>> where:
>>
>>   Ps = the number of patterns in the selector
>>   S[I] = the step for selector pattern I (I being 1-based)
>>   input(I) = the data input selected by selector pattern I (I being 1-based)
>>   P[I] = the number of patterns in data input I
>>
>> That's getting quite complicated :-)  If we allow arbitrary P[...]
>> and S[...] then it could also get large.  Perhaps we should finally
>> give up on the general case and limit this to power-of-2 patterns and
>> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
>> simplifies other things as well.
>>
>> What do you think?
> Hi Richard,
> Thanks for the suggestions. Yeah I suppose we can initially add support for
> power-of-2 patterns and power-of-2 steps and try to generalize it in
> follow up patches if possible.
>
> Sorry if this sounds like a silly ques -- if we are going to have
> pattern in selector, select *same pattern from same input vector*,
> instead of re-encoding the selector to have N * Ps patterns, would it
> make sense for elements in selector to denote pattern number itself
> instead of element index
> if input vectors are VLA ?
>
> For eg:
> op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> op1 = {...}
> with npatterns == 4, nelts_per_pattern == 3,
> sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> so, res = {1, 4, 1, 5, 1, 6, ...}
> Not sure if this is correct tho.

This wouldn't allow us to represent things like a "duplicate one
element", or "copy the leading N elements from the first input and
the other elements from elements N+ of the second input", which we
can with the current scheme.

The restriction about each (unwound) selector pattern selecting from the
same input pattern only applies to case where the selector pattern is
stepped (and only applies to the stepped part of the pattern, not the
leading element).  The restriction is also local to this code; it
doesn't make other VEC_PERM_EXPRs invalid.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>>
>> > I suppose it's fine since if the user defines mask to have pattern {0,
>> > 1, 2, ...}
>> > they intended result to have pattern with above encoding.
>> > Just wanted to confirm if this is correct ?
>> >
>> > 2] Could you please suggest a test-case for S < 0 ?
>> > I am not able to come up with one :/
>>
>> svrev is one way of creating negative steps.
>>
>> Thanks,
>> Richard
>>
>> >
>> > Thanks,
>> > Prathamesh
>> >>
>> >> > which is an interleaving of the two patterns:
>> >> >
>> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Prathamesh Kulkarni Nov. 4, 2022, 8:30 a.m. UTC | #6
On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Sorry for the slow response.  I wanted to find some time to think
> >> about this a bit more.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> >> >> For num_poly_int_coeffs == 2,
> >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> >> >> If a1/trunc n1 succeeds,
> >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> >> >
> >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> >> >
> >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> >> >
> >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >> >>
> >> >> Sorry, should have been:
> >> >>
> >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> >> > Hi Richard,
> >> > Thanks for the clarifications, and sorry for late reply.
> >> > I have attached POC patch that tries to implement the above approach.
> >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >> >
> >> > For VLA vectors, I have only done limited testing so far.
> >> > It seems to pass couple of tests written in the patch for
> >> > nelts_per_pattern == 3,
> >> > and folds the following svld1rq test:
> >> > int32x4_t v = {1, 2, 3, 4};
> >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> >> > into:
> >> > return {1, 2, 3, 4, ...};
> >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >> >
> >> > I have a couple of questions:
> >> > 1] When mask selects elements from same vector but from different patterns:
> >> > For eg:
> >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> >> > mask = {0, 0, 0, 1, 0, 2, ... },
> >> > All have npatterns = 2, nelts_per_pattern = 3.
> >> >
> >> > With above mask,
> >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> >> > pattern in arg0.
> >> > The result is:
> >> > res = {1, 1, 1, 11, 1, 2, ...}
> >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> >> > with a0 = 1, a1 = 11, S = -9.
> >> > Is that expected tho ? It seems to create a new encoding which
> >> > wasn't present in the input vector. For instance, the next elem in
> >> > sequence would be -7,
> >> > which is not present originally in arg0.
> >>
> >> Yeah, you're right, sorry.  Going back to:
> >>
> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >>     elements for any integer N.  This extended sequence can be reencoded
> >>     as having N*Px patterns, with Ex staying the same.
> >>
> >> I guess we need to pick an N for the selector such that each new
> >> selector pattern (each one out of the N*Px patterns) selects from
> >> the *same pattern* of the same data input.
> >>
> >> So if a particular pattern in the selector has a step S, and the data
> >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> >> N must be a multiple of least_common_multiple(S,Pi)/S.
> >>
> >> I think that means that the total number of patterns in the result
> >> (Pr from previous messages) can safely be:
> >>
> >>   Ps * least_common_multiple(
> >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> >>     ...
> >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> >>   )
> >>
> >> where:
> >>
> >>   Ps = the number of patterns in the selector
> >>   S[I] = the step for selector pattern I (I being 1-based)
> >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> >>   P[I] = the number of patterns in data input I
> >>
> >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> >> and S[...] then it could also get large.  Perhaps we should finally
> >> give up on the general case and limit this to power-of-2 patterns and
> >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> >> simplifies other things as well.
> >>
> >> What do you think?
> > Hi Richard,
> > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > follow up patches if possible.
> >
> > Sorry if this sounds like a silly ques -- if we are going to have
> > pattern in selector, select *same pattern from same input vector*,
> > instead of re-encoding the selector to have N * Ps patterns, would it
> > make sense for elements in selector to denote pattern number itself
> > instead of element index
> > if input vectors are VLA ?
> >
> > For eg:
> > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > op1 = {...}
> > with npatterns == 4, nelts_per_pattern == 3,
> > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > so, res = {1, 4, 1, 5, 1, 6, ...}
> > Not sure if this is correct tho.
>
> This wouldn't allow us to represent things like a "duplicate one
> element", or "copy the leading N elements from the first input and
> the other elements from elements N+ of the second input", which we
> can with the current scheme.
>
> The restriction about each (unwound) selector pattern selecting from the
> same input pattern only applies to case where the selector pattern is
> stepped (and only applies to the stepped part of the pattern, not the
> leading element).  The restriction is also local to this code; it
> doesn't make other VEC_PERM_EXPRs invalid.
Hi Richard,
Thanks for the clarifications.
Just to clarify your approach with an eg:
Let selected input vector be:
arg0: {a0, b0, c0, d0,
          a0 + S, b0 + S, c0 + S, d0 + S,
          a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
where arg0 has npatterns = 4, and nelts_per_pattern = 3.

Let sel = {0, 0, 1, 2, 2, 4, ...}
where sel_npatterns = 2 and sel_nelts_per_pattern = 3

So, the first pattern in sel:
p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
which would be incorrect, since they belong to different patterns in arg0.
So to select elements from same pattern in arg0, we need to divide p1
into at least N1 = P_arg0 / S0 = 4 distinct patterns.

Similarly for second pattern in sel:
p2: {0, 2, 4, ...}, we need to divide it into
at least N2 = P_arg0 / S1 = 2 distinct patterns.

Select N = max(N1, N2) = 4
So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
and will be re-encoded with N*Ps = 8 patterns:

re-encoded sel:
{a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
...}

with 8 patterns,
p1: {a0, a0 + 2S, a0 + 4S, ...}
p2: {b0, b0 + 2S, b0 + 4S, ...}
...
which select elements from same pattern from same input vector.
Does this look correct ?

For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
arg1_npatterns are powers of 2 and for each stepped pattern,
it's stepped size S is a power of 2. I suppose this will be sufficient
to ensure that sel can be re-encoded with N*Ps npatterns
such that each new pattern selects elements from same pattern
of the input vector ?

Then compute N:
N = 1;
for (every pattern p in sel)
  {
     op = corresponding input vector for pattern;
     S = step_size (p);
     N_pattern = max (S, npatterns (op)) / S;
     N = max(N, N_pattern)
  }

and re-encode selector with N*Ps patterns.
I guess rest of the patch will mostly stay the same.

Thanks,
Prathamesh

>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >>
> >> > I suppose it's fine since if the user defines mask to have pattern {0,
> >> > 1, 2, ...}
> >> > they intended result to have pattern with above encoding.
> >> > Just wanted to confirm if this is correct ?
> >> >
> >> > 2] Could you please suggest a test-case for S < 0 ?
> >> > I am not able to come up with one :/
> >>
> >> svrev is one way of creating negative steps.
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >>
> >> >> > which is an interleaving of the two patterns:
> >> >> >
> >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Prathamesh Kulkarni Nov. 21, 2022, 9:07 a.m. UTC | #7
On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Sorry for the slow response.  I wanted to find some time to think
> > >> about this a bit more.
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > >> > <richard.sandiford@arm.com> wrote:
> > >> >>
> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> >> >> For num_poly_int_coeffs == 2,
> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> >> >> If a1/trunc n1 succeeds,
> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > >> >> >
> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >> >> >
> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >> >> >
> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >> >>
> > >> >> Sorry, should have been:
> > >> >>
> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > >> > Hi Richard,
> > >> > Thanks for the clarifications, and sorry for late reply.
> > >> > I have attached POC patch that tries to implement the above approach.
> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > >> >
> > >> > For VLA vectors, I have only done limited testing so far.
> > >> > It seems to pass couple of tests written in the patch for
> > >> > nelts_per_pattern == 3,
> > >> > and folds the following svld1rq test:
> > >> > int32x4_t v = {1, 2, 3, 4};
> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > >> > into:
> > >> > return {1, 2, 3, 4, ...};
> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > >> >
> > >> > I have a couple of questions:
> > >> > 1] When mask selects elements from same vector but from different patterns:
> > >> > For eg:
> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > >> >
> > >> > With above mask,
> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > >> > pattern in arg0.
> > >> > The result is:
> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > >> > with a0 = 1, a1 = 11, S = -9.
> > >> > Is that expected tho ? It seems to create a new encoding which
> > >> > wasn't present in the input vector. For instance, the next elem in
> > >> > sequence would be -7,
> > >> > which is not present originally in arg0.
> > >>
> > >> Yeah, you're right, sorry.  Going back to:
> > >>
> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > >>     elements for any integer N.  This extended sequence can be reencoded
> > >>     as having N*Px patterns, with Ex staying the same.
> > >>
> > >> I guess we need to pick an N for the selector such that each new
> > >> selector pattern (each one out of the N*Px patterns) selects from
> > >> the *same pattern* of the same data input.
> > >>
> > >> So if a particular pattern in the selector has a step S, and the data
> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > >>
> > >> I think that means that the total number of patterns in the result
> > >> (Pr from previous messages) can safely be:
> > >>
> > >>   Ps * least_common_multiple(
> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > >>     ...
> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > >>   )
> > >>
> > >> where:
> > >>
> > >>   Ps = the number of patterns in the selector
> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > >>   P[I] = the number of patterns in data input I
> > >>
> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > >> and S[...] then it could also get large.  Perhaps we should finally
> > >> give up on the general case and limit this to power-of-2 patterns and
> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > >> simplifies other things as well.
> > >>
> > >> What do you think?
> > > Hi Richard,
> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > follow up patches if possible.
> > >
> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > pattern in selector, select *same pattern from same input vector*,
> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > make sense for elements in selector to denote pattern number itself
> > > instead of element index
> > > if input vectors are VLA ?
> > >
> > > For eg:
> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > op1 = {...}
> > > with npatterns == 4, nelts_per_pattern == 3,
> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > Not sure if this is correct tho.
> >
> > This wouldn't allow us to represent things like a "duplicate one
> > element", or "copy the leading N elements from the first input and
> > the other elements from elements N+ of the second input", which we
> > can with the current scheme.
> >
> > The restriction about each (unwound) selector pattern selecting from the
> > same input pattern only applies to case where the selector pattern is
> > stepped (and only applies to the stepped part of the pattern, not the
> > leading element).  The restriction is also local to this code; it
> > doesn't make other VEC_PERM_EXPRs invalid.
> Hi Richard,
> Thanks for the clarifications.
> Just to clarify your approach with an eg:
> Let selected input vector be:
> arg0: {a0, b0, c0, d0,
>           a0 + S, b0 + S, c0 + S, d0 + S,
>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
>
> Let sel = {0, 0, 1, 2, 2, 4, ...}
> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
>
> So, the first pattern in sel:
> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> which would be incorrect, since they belong to different patterns in arg0.
> So to select elements from same pattern in arg0, we need to divide p1
> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
>
> Similarly for second pattern in sel:
> p2: {0, 2, 4, ...}, we need to divide it into
> at least N2 = P_arg0 / S1 = 2 distinct patterns.
>
> Select N = max(N1, N2) = 4
> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> and will be re-encoded with N*Ps = 8 patterns:
>
> re-encoded sel:
> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> ...}
>
> with 8 patterns,
> p1: {a0, a0 + 2S, a0 + 4S, ...}
> p2: {b0, b0 + 2S, b0 + 4S, ...}
> ...
> which select elements from same pattern from same input vector.
> Does this look correct ?
>
> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> arg1_npatterns are powers of 2 and for each stepped pattern,
> it's stepped size S is a power of 2. I suppose this will be sufficient
> to ensure that sel can be re-encoded with N*Ps npatterns
> such that each new pattern selects elements from same pattern
> of the input vector ?
>
> Then compute N:
> N = 1;
> for (every pattern p in sel)
>   {
>      op = corresponding input vector for pattern;
>      S = step_size (p);
>      N_pattern = max (S, npatterns (op)) / S;
>      N = max(N, N_pattern)
>   }
>
> and re-encode selector with N*Ps patterns.
> I guess rest of the patch will mostly stay the same.
Hi,
I have attached a POC patch based on the above approach.
For the above eg:
arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
and
sel = {0, 0, 0, 1, 0, 2, ...}
with sel_npatterns == 2 and sel_nelts_per_pattern == 3.

For pattern, {0, 1, 2, ...} it will select elements from different
patterns from arg0, which is incorrect.
So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
with following patterns:
p1 = { 0, ... }
p2 = { 0, 2, 4, ... }
p3 = { 0, ... }
p4 = { 1, 3, 5, ... }
which should be correct since each element from the respective
patterns in sel chooses
elements from same pattern from arg0.
So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
Does this look correct ?

Thanks,
Prathamesh

>
> Thanks,
> Prathamesh
>
> >
> > Thanks,
> > Richard
> >
> > >
> > > Thanks,
> > > Prathamesh
> > >>
> > >> > I suppose it's fine since if the user defines mask to have pattern {0,
> > >> > 1, 2, ...}
> > >> > they intended result to have pattern with above encoding.
> > >> > Just wanted to confirm if this is correct ?
> > >> >
> > >> > 2] Could you please suggest a test-case for S < 0 ?
> > >> > I am not able to come up with one :/
> > >>
> > >> svrev is one way of creating negative steps.
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> >
> > >> > Thanks,
> > >> > Prathamesh
> > >> >>
> > >> >> > which is an interleaving of the two patterns:
> > >> >> >
> > >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..2f45979d4ac 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern,
+			unsigned sel_npatterns, int &S)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      S = (a2 - a1).to_constant ();
+      if (S != 0 && !pow2p_hwi (S))
+	return NULL_TREE;
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+      sel_npatterns = sel.encoding ().npatterns ();
+
+      if (!pow2p_hwi (arg0_npatterns)
+	  || !pow2p_hwi (arg1_npatterns)
+	  || !pow2p_hwi (sel_npatterns))
+        return NULL_TREE;
+
+      unsigned N = 1;
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  int S = 0;
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	  unsigned N_pattern =
+	    (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
+	  N = std::max (N, N_pattern);
+	}
+      
+      res_npatterns
+        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
+
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
-	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+      /* For VLA vectors, i % sel_npatterns would give the original
+         pattern the element belongs to, which is sufficient to get the arg.
+	 Even if sel_npatterns has been multiplied by N,
+	 they will always come from the same input vector.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+       
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      unsigned HOST_WIDE_INT index;
+
+      if (arg == arg0)
+	{
+	  if (!sel[i].is_constant ())
+	    return NULL_TREE;
+	  index = sel[i].to_constant ();
+	}
+      else
+        {
+	  gcc_assert (arg == arg1);
+	  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+	  uint64_t q;
+	  poly_uint64 r;
+
+	  /* Divide sel[i] by input vector length, to obtain remainder,
+	     which would be the index for either input vector.  */
+	  if (!can_div_trunc_p (sel[i], n1, &q, &r))
+	    return NULL_TREE;
+
+	  if (!r.is_constant (&index))
+	    return NULL_TREE;
+	}
+
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (res != NULL_TREE);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest
  
Prathamesh Kulkarni Nov. 28, 2022, 11:44 a.m. UTC | #8
On Mon, 21 Nov 2022 at 14:37, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > >>
> > > >> Sorry for the slow response.  I wanted to find some time to think
> > > >> about this a bit more.
> > > >>
> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > >> > <richard.sandiford@arm.com> wrote:
> > > >> >>
> > > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> >> >> For num_poly_int_coeffs == 2,
> > > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> >> >> If a1/trunc n1 succeeds,
> > > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > >> >> >
> > > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >> >> >
> > > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >> >> >
> > > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > >> >>
> > > >> >> Sorry, should have been:
> > > >> >>
> > > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > >> > Hi Richard,
> > > >> > Thanks for the clarifications, and sorry for late reply.
> > > >> > I have attached POC patch that tries to implement the above approach.
> > > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > >> >
> > > >> > For VLA vectors, I have only done limited testing so far.
> > > >> > It seems to pass couple of tests written in the patch for
> > > >> > nelts_per_pattern == 3,
> > > >> > and folds the following svld1rq test:
> > > >> > int32x4_t v = {1, 2, 3, 4};
> > > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > >> > into:
> > > >> > return {1, 2, 3, 4, ...};
> > > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > >> >
> > > >> > I have a couple of questions:
> > > >> > 1] When mask selects elements from same vector but from different patterns:
> > > >> > For eg:
> > > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > >> >
> > > >> > With above mask,
> > > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > >> > pattern in arg0.
> > > >> > The result is:
> > > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > >> > with a0 = 1, a1 = 11, S = -9.
> > > >> > Is that expected tho ? It seems to create a new encoding which
> > > >> > wasn't present in the input vector. For instance, the next elem in
> > > >> > sequence would be -7,
> > > >> > which is not present originally in arg0.
> > > >>
> > > >> Yeah, you're right, sorry.  Going back to:
> > > >>
> > > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > >>     elements for any integer N.  This extended sequence can be reencoded
> > > >>     as having N*Px patterns, with Ex staying the same.
> > > >>
> > > >> I guess we need to pick an N for the selector such that each new
> > > >> selector pattern (each one out of the N*Px patterns) selects from
> > > >> the *same pattern* of the same data input.
> > > >>
> > > >> So if a particular pattern in the selector has a step S, and the data
> > > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > >>
> > > >> I think that means that the total number of patterns in the result
> > > >> (Pr from previous messages) can safely be:
> > > >>
> > > >>   Ps * least_common_multiple(
> > > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > >>     ...
> > > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > >>   )
> > > >>
> > > >> where:
> > > >>
> > > >>   Ps = the number of patterns in the selector
> > > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > >>   P[I] = the number of patterns in data input I
> > > >>
> > > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > >> and S[...] then it could also get large.  Perhaps we should finally
> > > >> give up on the general case and limit this to power-of-2 patterns and
> > > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > >> simplifies other things as well.
> > > >>
> > > >> What do you think?
> > > > Hi Richard,
> > > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > > follow up patches if possible.
> > > >
> > > > Sorry if this sounds like a silly ques -- if we are going to have
> > > > pattern in selector, select *same pattern from same input vector*,
> > > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > > make sense for elements in selector to denote pattern number itself
> > > > instead of element index
> > > > if input vectors are VLA ?
> > > >
> > > > For eg:
> > > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > > op1 = {...}
> > > > with npatterns == 4, nelts_per_pattern == 3,
> > > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > > Not sure if this is correct tho.
> > >
> > > This wouldn't allow us to represent things like a "duplicate one
> > > element", or "copy the leading N elements from the first input and
> > > the other elements from elements N+ of the second input", which we
> > > can with the current scheme.
> > >
> > > The restriction about each (unwound) selector pattern selecting from the
> > > same input pattern only applies to case where the selector pattern is
> > > stepped (and only applies to the stepped part of the pattern, not the
> > > leading element).  The restriction is also local to this code; it
> > > doesn't make other VEC_PERM_EXPRs invalid.
> > Hi Richard,
> > Thanks for the clarifications.
> > Just to clarify your approach with an eg:
> > Let selected input vector be:
> > arg0: {a0, b0, c0, d0,
> >           a0 + S, b0 + S, c0 + S, d0 + S,
> >           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> >
> > Let sel = {0, 0, 1, 2, 2, 4, ...}
> > where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> >
> > So, the first pattern in sel:
> > p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > which would be incorrect, since they belong to different patterns in arg0.
> > So to select elements from same pattern in arg0, we need to divide p1
> > into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> >
> > Similarly for second pattern in sel:
> > p2: {0, 2, 4, ...}, we need to divide it into
> > at least N2 = P_arg0 / S1 = 2 distinct patterns.
> >
> > Select N = max(N1, N2) = 4
> > So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > and will be re-encoded with N*Ps = 8 patterns:
> >
> > re-encoded sel:
> > {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > ...}
> >
> > with 8 patterns,
> > p1: {a0, a0 + 2S, a0 + 4S, ...}
> > p2: {b0, b0 + 2S, b0 + 4S, ...}
> > ...
> > which select elements from same pattern from same input vector.
> > Does this look correct ?
> >
> > For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > arg1_npatterns are powers of 2 and for each stepped pattern,
> > it's stepped size S is a power of 2. I suppose this will be sufficient
> > to ensure that sel can be re-encoded with N*Ps npatterns
> > such that each new pattern selects elements from same pattern
> > of the input vector ?
> >
> > Then compute N:
> > N = 1;
> > for (every pattern p in sel)
> >   {
> >      op = corresponding input vector for pattern;
> >      S = step_size (p);
> >      N_pattern = max (S, npatterns (op)) / S;
> >      N = max(N, N_pattern)
> >   }
> >
> > and re-encode selector with N*Ps patterns.
> > I guess rest of the patch will mostly stay the same.
> Hi,
> I have attached a POC patch based on the above approach.
> For the above eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> and
> sel = {0, 0, 0, 1, 0, 2, ...}
> with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
>
> For pattern, {0, 1, 2, ...} it will select elements from different
> patterns from arg0, which is incorrect.
> So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> with following patterns:
> p1 = { 0, ... }
> p2 = { 0, 2, 4, ... }
> p3 = { 0, ... }
> p4 = { 1, 3, 5, ... }
> which should be correct since each element from the respective
> patterns in sel chooses
> elements from same pattern from arg0.
> So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> Does this look correct ?
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606850.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
> >
> > Thanks,
> > Prathamesh
> >
> > >
> > > Thanks,
> > > Richard
> > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >>
> > > >> > I suppose it's fine since if the user defines mask to have pattern {0,
> > > >> > 1, 2, ...}
> > > >> > they intended result to have pattern with above encoding.
> > > >> > Just wanted to confirm if this is correct ?
> > > >> >
> > > >> > 2] Could you please suggest a test-case for S < 0 ?
> > > >> > I am not able to come up with one :/
> > > >>
> > > >> svrev is one way of creating negative steps.
> > > >>
> > > >> Thanks,
> > > >> Richard
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> > Prathamesh
> > > >> >>
> > > >> >> > which is an interleaving of the two patterns:
> > > >> >> >
> > > >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > > >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
  
Richard Sandiford Dec. 6, 2022, 3:30 p.m. UTC | #9
Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
>> > > <richard.sandiford@arm.com> wrote:
>> > >>
>> > >> Sorry for the slow response.  I wanted to find some time to think
>> > >> about this a bit more.
>> > >>
>> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
>> > >> > <richard.sandiford@arm.com> wrote:
>> > >> >>
>> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> > >> >> >> For num_poly_int_coeffs == 2,
>> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> > >> >> >> If a1/trunc n1 succeeds,
>> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
>> > >> >> >
>> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> > >> >> >
>> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> > >> >> >
>> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>> > >> >>
>> > >> >> Sorry, should have been:
>> > >> >>
>> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
>> > >> > Hi Richard,
>> > >> > Thanks for the clarifications, and sorry for late reply.
>> > >> > I have attached POC patch that tries to implement the above approach.
>> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>> > >> >
>> > >> > For VLA vectors, I have only done limited testing so far.
>> > >> > It seems to pass couple of tests written in the patch for
>> > >> > nelts_per_pattern == 3,
>> > >> > and folds the following svld1rq test:
>> > >> > int32x4_t v = {1, 2, 3, 4};
>> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
>> > >> > into:
>> > >> > return {1, 2, 3, 4, ...};
>> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>> > >> >
>> > >> > I have a couple of questions:
>> > >> > 1] When mask selects elements from same vector but from different patterns:
>> > >> > For eg:
>> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
>> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
>> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
>> > >> > All have npatterns = 2, nelts_per_pattern = 3.
>> > >> >
>> > >> > With above mask,
>> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
>> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
>> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
>> > >> > pattern in arg0.
>> > >> > The result is:
>> > >> > res = {1, 1, 1, 11, 1, 2, ...}
>> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
>> > >> > with a0 = 1, a1 = 11, S = -9.
>> > >> > Is that expected tho ? It seems to create a new encoding which
>> > >> > wasn't present in the input vector. For instance, the next elem in
>> > >> > sequence would be -7,
>> > >> > which is not present originally in arg0.
>> > >>
>> > >> Yeah, you're right, sorry.  Going back to:
>> > >>
>> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> > >>     elements for any integer N.  This extended sequence can be reencoded
>> > >>     as having N*Px patterns, with Ex staying the same.
>> > >>
>> > >> I guess we need to pick an N for the selector such that each new
>> > >> selector pattern (each one out of the N*Px patterns) selects from
>> > >> the *same pattern* of the same data input.
>> > >>
>> > >> So if a particular pattern in the selector has a step S, and the data
>> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
>> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
>> > >>
>> > >> I think that means that the total number of patterns in the result
>> > >> (Pr from previous messages) can safely be:
>> > >>
>> > >>   Ps * least_common_multiple(
>> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
>> > >>     ...
>> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>> > >>   )
>> > >>
>> > >> where:
>> > >>
>> > >>   Ps = the number of patterns in the selector
>> > >>   S[I] = the step for selector pattern I (I being 1-based)
>> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
>> > >>   P[I] = the number of patterns in data input I
>> > >>
>> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
>> > >> and S[...] then it could also get large.  Perhaps we should finally
>> > >> give up on the general case and limit this to power-of-2 patterns and
>> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
>> > >> simplifies other things as well.
>> > >>
>> > >> What do you think?
>> > > Hi Richard,
>> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
>> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
>> > > follow up patches if possible.
>> > >
>> > > Sorry if this sounds like a silly ques -- if we are going to have
>> > > pattern in selector, select *same pattern from same input vector*,
>> > > instead of re-encoding the selector to have N * Ps patterns, would it
>> > > make sense for elements in selector to denote pattern number itself
>> > > instead of element index
>> > > if input vectors are VLA ?
>> > >
>> > > For eg:
>> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
>> > > op1 = {...}
>> > > with npatterns == 4, nelts_per_pattern == 3,
>> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
>> > > so, res = {1, 4, 1, 5, 1, 6, ...}
>> > > Not sure if this is correct tho.
>> >
>> > This wouldn't allow us to represent things like a "duplicate one
>> > element", or "copy the leading N elements from the first input and
>> > the other elements from elements N+ of the second input", which we
>> > can with the current scheme.
>> >
>> > The restriction about each (unwound) selector pattern selecting from the
>> > same input pattern only applies to case where the selector pattern is
>> > stepped (and only applies to the stepped part of the pattern, not the
>> > leading element).  The restriction is also local to this code; it
>> > doesn't make other VEC_PERM_EXPRs invalid.
>> Hi Richard,
>> Thanks for the clarifications.
>> Just to clarify your approach with an eg:
>> Let selected input vector be:
>> arg0: {a0, b0, c0, d0,
>>           a0 + S, b0 + S, c0 + S, d0 + S,
>>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
>> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
>>
>> Let sel = {0, 0, 1, 2, 2, 4, ...}
>> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
>>
>> So, the first pattern in sel:
>> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
>> which would be incorrect, since they belong to different patterns in arg0.
>> So to select elements from same pattern in arg0, we need to divide p1
>> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
>>
>> Similarly for second pattern in sel:
>> p2: {0, 2, 4, ...}, we need to divide it into
>> at least N2 = P_arg0 / S1 = 2 distinct patterns.
>>
>> Select N = max(N1, N2) = 4
>> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
>> and will be re-encoded with N*Ps = 8 patterns:
>>
>> re-encoded sel:
>> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
>> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
>> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
>> ...}
>>
>> with 8 patterns,
>> p1: {a0, a0 + 2S, a0 + 4S, ...}
>> p2: {b0, b0 + 2S, b0 + 4S, ...}
>> ...
>> which select elements from same pattern from same input vector.
>> Does this look correct ?
>>
>> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
>> arg1_npatterns are powers of 2 and for each stepped pattern,
>> it's stepped size S is a power of 2. I suppose this will be sufficient
>> to ensure that sel can be re-encoded with N*Ps npatterns
>> such that each new pattern selects elements from same pattern
>> of the input vector ?
>>
>> Then compute N:
>> N = 1;
>> for (every pattern p in sel)
>>   {
>>      op = corresponding input vector for pattern;
>>      S = step_size (p);
>>      N_pattern = max (S, npatterns (op)) / S;
>>      N = max(N, N_pattern)
>>   }
>>
>> and re-encode selector with N*Ps patterns.
>> I guess rest of the patch will mostly stay the same.
> Hi,
> I have attached a POC patch based on the above approach.
> For the above eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> and
> sel = {0, 0, 0, 1, 0, 2, ...}
> with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
>
> For pattern, {0, 1, 2, ...} it will select elements from different
> patterns from arg0, which is incorrect.
> So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> with following patterns:
> p1 = { 0, ... }
> p2 = { 0, 2, 4, ... }
> p3 = { 0, ... }
> p4 = { 1, 3, 5, ... }
> which should be correct since each element from the respective
> patterns in sel chooses
> elements from same pattern from arg0.
> So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> Does this look correct ?

Yeah.  But like I said above:

  The restriction about each (unwound) selector pattern selecting from the
  same input pattern only applies to case where the selector pattern is
  stepped (and only applies to the stepped part of the pattern, not the
  leading element).

If the selector nelts-per-pattern is 1 or 2 then we can support all
power-of-2 cases, with the final npatterns being the maximum of the
source nelts-per-patterns.

Also, going back to an earlier part of the discussion, I think we
should use this technique for both VLA and VLS, and only fall back
to the VLS-specific approach if the VLA approach fails.

So I suggest we put the VLA code in its own function and have
the VLS-only path kick in when the VLA code fails.  If the code is
having to pass a lot of state around, it might make sense to define
a local class, store the state in member variables, and use member
functions for the various subroutines.  I don't know if that will
work out neater though.

> @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>  			  build_zero_cst (itype));
>  }
>  
> +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> +   and return the selected arg, otherwise return NULL_TREE.  */
>  
> -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> -   true if successful.  */
> -
> -static bool
> -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> +static tree
> +get_vector_for_pattern (tree arg0, tree arg1,
> +			const vec_perm_indices &sel, unsigned pattern,
> +			unsigned sel_npatterns, int &S)
>  {
> -  unsigned HOST_WIDE_INT i, nunits;
> +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>  
> -  if (TREE_CODE (arg) == VECTOR_CST
> -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +  poly_uint64 nsel = sel.length ();
> +  poly_uint64 esel;
> +
> +  if (!multiple_p (nsel, sel_npatterns, &esel))
> +    return NULL_TREE;
> +
> +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> +  S = 0;
> +  if (sel_nelts_per_pattern == 3)
>      {
> -      for (i = 0; i < nunits; ++i)
> -	elts[i] = VECTOR_CST_ELT (arg, i);
> +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> +      S = (a2 - a1).to_constant ();

The code hasn't proven that this to_constant is safe.

> +      if (S != 0 && !pow2p_hwi (S))
> +	return NULL_TREE;
>      }
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> +
> +  poly_uint64 ae = a1 + (esel - 2) * S;
> +  uint64_t q1, qe;
> +  poly_uint64 r1, re;
> +
> +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> +      || !can_div_trunc_p (ae, n1, &qe, &re)
> +      || (q1 != qe))
> +    return NULL_TREE;

Going back to the above: this check doesn't make sense for
sel_nelts_per_pattern != 3.

Thanks,
Richard

> +
> +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> +
> +  if (S < 0)
>      {
> -      constructor_elt *elt;
> +      poly_uint64 a0 = sel[pattern];
> +      if (!known_eq (S, a1 - a0))
> +        return NULL_TREE;
>  
> -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> -	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> -	  return false;
> -	else
> -	  elts[i] = elt->value;
> +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> +        return NULL_TREE;
>      }
> -  else
> -    return false;
> -  for (; i < nelts; i++)
> -    elts[i]
> -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> -  return true;
> +  
> +  return arg;
>  }
>  
>  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>    unsigned HOST_WIDE_INT nelts;
>    bool need_ctor = false;
>  
> -  if (!sel.length ().is_constant (&nelts))
> -    return NULL_TREE;
> -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> +	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> +			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
>    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>      return NULL_TREE;
>  
> -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> +  unsigned res_npatterns = 0;
> +  unsigned res_nelts_per_pattern = 0;
> +  unsigned sel_npatterns = 0;
> +  tree *vector_for_pattern = NULL;
> +
> +  if (TREE_CODE (arg0) == VECTOR_CST
> +      && TREE_CODE (arg1) == VECTOR_CST
> +      && !sel.length ().is_constant ())
> +    {
> +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> +      sel_npatterns = sel.encoding ().npatterns ();
> +
> +      if (!pow2p_hwi (arg0_npatterns)
> +	  || !pow2p_hwi (arg1_npatterns)
> +	  || !pow2p_hwi (sel_npatterns))
> +        return NULL_TREE;
> +
> +      unsigned N = 1;
> +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> +      for (unsigned i = 0; i < sel_npatterns; i++)
> +	{
> +	  int S = 0;
> +	  tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> +	  if (!op)
> +	    return NULL_TREE;
> +	  vector_for_pattern[i] = op;
> +	  unsigned N_pattern =
> +	    (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> +	  N = std::max (N, N_pattern);
> +	}
> +      
> +      res_npatterns
> +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> +
> +      res_nelts_per_pattern
> +	= std::max(sel.encoding ().nelts_per_pattern (),
> +		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> +			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> +    }
> +  else if (sel.length ().is_constant (&nelts)
> +	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> +	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> +    {
> +      /* For VLS vectors, treat all vectors with
> +	 npatterns = nelts, nelts_per_pattern = 1. */
> +      res_npatterns = sel_npatterns = nelts;
> +      res_nelts_per_pattern = 1;
> +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> +      for (unsigned i = 0; i < nelts; i++)
> +        {
> +	  HOST_WIDE_INT index;
> +	  if (!sel[i].is_constant (&index))
> +	    return NULL_TREE;
> +	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
> +	}
> +    }
> +  else
>      return NULL_TREE;
>  
> -  tree_vector_builder out_elts (type, nelts, 1);
> -  for (i = 0; i < nelts; i++)
> +  tree_vector_builder out_elts (type, res_npatterns,
> +				res_nelts_per_pattern);
> +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> +  for (unsigned i = 0; i < res_nelts; i++)
>      {
> -      HOST_WIDE_INT index;
> -      if (!sel[i].is_constant (&index))
> -	return NULL_TREE;
> -      if (!CONSTANT_CLASS_P (in_elts[index]))
> -	need_ctor = true;
> -      out_elts.quick_push (unshare_expr (in_elts[index]));
> +      /* For VLA vectors, i % sel_npatterns would give the original
> +         pattern the element belongs to, which is sufficient to get the arg.
> +	 Even if sel_npatterns has been multiplied by N,
> +	 they will always come from the same input vector.
> +	 For VLS vectors, sel_npatterns == res_nelts == nelts,
> +	 so i % sel_npatterns == i since i < nelts */
> +       
> +      tree arg = vector_for_pattern[i % sel_npatterns];
> +      unsigned HOST_WIDE_INT index;
> +
> +      if (arg == arg0)
> +	{
> +	  if (!sel[i].is_constant ())
> +	    return NULL_TREE;
> +	  index = sel[i].to_constant ();
> +	}
> +      else
> +        {
> +	  gcc_assert (arg == arg1);
> +	  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +	  uint64_t q;
> +	  poly_uint64 r;
> +
> +	  /* Divide sel[i] by input vector length, to obtain remainder,
> +	     which would be the index for either input vector.  */
> +	  if (!can_div_trunc_p (sel[i], n1, &q, &r))
> +	    return NULL_TREE;
> +
> +	  if (!r.is_constant (&index))
> +	    return NULL_TREE;
> +	}
> +
> +      tree elem;
> +      if (TREE_CODE (arg) == CONSTRUCTOR)
> +        {
> +	  gcc_assert (index < nelts);
> +	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> +	    return NULL_TREE;
> +	  elem = CONSTRUCTOR_ELT (arg, index)->value;
> +	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> +	    return NULL_TREE;
> +	  need_ctor = true;
> +	}
> +      else
> +        elem = vector_cst_elt (arg, index);
> +      out_elts.quick_push (elem);
>      }
>  
>    if (need_ctor)
>      {
>        vec<constructor_elt, va_gc> *v;
> -      vec_alloc (v, nelts);
> -      for (i = 0; i < nelts; i++)
> +      vec_alloc (v, res_nelts);
> +      for (i = 0; i < res_nelts; i++)
>  	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
>        return build_constructor (type, v);
>      }
> -  else
> -    return out_elts.build ();
> +  return out_elts.build ();
>  }
>  
>  /* Try to fold a pointer difference of type TYPE two address expressions of
> @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
>    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
>  }
>  
> +static tree
> +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> +		   int *encoded_elems)
> +{
> +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> +  //machine_mode vmode = VNx4SImode;
> +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +  tree vectype = build_vector_type (integer_type_node, nunits);
> +
> +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> +  return builder.build ();
> +}
> +
> +static void
> +test_vec_perm_vla_folding ()
> +{
> +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> +
> +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> +
> +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> +    return;
> +
> +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> +     should select arg0.  */
> +  {
> +    int mask_elems[] = {0, 1, 2};
> +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    ASSERT_TRUE (res != NULL_TREE);
> +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> +
> +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> +    for (unsigned i = 0; i < res_nelts; i++)
> +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> +				    VECTOR_CST_ELT (arg0, i), 0));
> +  }
> +
> +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> +     should return NULL because for len = 4 + 4x,
> +     if x == 0, we select from arg1
> +     if x > 0, we select from arg0
> +     and thus cannot determine result at compile time.  */
> +  {
> +    int mask_elems[] = {4, 5, 6};
> +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    gcc_assert (res == NULL_TREE);
> +  }
> +
> +  /* Case 3:
> +     mask: {0, 0, 0, 1, 0, 2, ...} 
> +     npatterns == 2, nelts_per_pattern == 3
> +     Pattern {0, ...} should select arg0[0], ie, 1.
> +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> +  {
> +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> +
> +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> +  }
> +
> +  /* Case 4:
> +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> +     npatterns == 2, nelts_per_pattern == 3
> +     Pattern {0, ...} should select arg0[1]
> +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> +     a1 = 5 + 4x
> +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> +        = 5 + 6x
> +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> +     res: {1, 21, 1, 31, 1, 22, ... }
> +     FIXME: How to build vector with poly_int elems ?  */
> +
> +  /* Case 5: S < 0.  */
> +}
> +
>  /* Run all of the selftests within this file.  */
>  
>  void
> @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_vec_perm_vla_folding ();
>  }
>  
>  } // namespace selftest
  
Prathamesh Kulkarni Dec. 13, 2022, 6:05 a.m. UTC | #10
On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >>
> >> > >> Sorry for the slow response.  I wanted to find some time to think
> >> > >> about this a bit more.
> >> > >>
> >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> >> > >> > <richard.sandiford@arm.com> wrote:
> >> > >> >>
> >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> > >> >> >> For num_poly_int_coeffs == 2,
> >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> > >> >> >> If a1/trunc n1 succeeds,
> >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> > >> >> >
> >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> > >> >> >
> >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> > >> >> >
> >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >> > >> >>
> >> > >> >> Sorry, should have been:
> >> > >> >>
> >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> >> > >> > Hi Richard,
> >> > >> > Thanks for the clarifications, and sorry for late reply.
> >> > >> > I have attached POC patch that tries to implement the above approach.
> >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >> > >> >
> >> > >> > For VLA vectors, I have only done limited testing so far.
> >> > >> > It seems to pass couple of tests written in the patch for
> >> > >> > nelts_per_pattern == 3,
> >> > >> > and folds the following svld1rq test:
> >> > >> > int32x4_t v = {1, 2, 3, 4};
> >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> >> > >> > into:
> >> > >> > return {1, 2, 3, 4, ...};
> >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >> > >> >
> >> > >> > I have a couple of questions:
> >> > >> > 1] When mask selects elements from same vector but from different patterns:
> >> > >> > For eg:
> >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> >> > >> >
> >> > >> > With above mask,
> >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> >> > >> > pattern in arg0.
> >> > >> > The result is:
> >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> >> > >> > with a0 = 1, a1 = 11, S = -9.
> >> > >> > Is that expected tho ? It seems to create a new encoding which
> >> > >> > wasn't present in the input vector. For instance, the next elem in
> >> > >> > sequence would be -7,
> >> > >> > which is not present originally in arg0.
> >> > >>
> >> > >> Yeah, you're right, sorry.  Going back to:
> >> > >>
> >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> > >>     elements for any integer N.  This extended sequence can be reencoded
> >> > >>     as having N*Px patterns, with Ex staying the same.
> >> > >>
> >> > >> I guess we need to pick an N for the selector such that each new
> >> > >> selector pattern (each one out of the N*Px patterns) selects from
> >> > >> the *same pattern* of the same data input.
> >> > >>
> >> > >> So if a particular pattern in the selector has a step S, and the data
> >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> >> > >>
> >> > >> I think that means that the total number of patterns in the result
> >> > >> (Pr from previous messages) can safely be:
> >> > >>
> >> > >>   Ps * least_common_multiple(
> >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> >> > >>     ...
> >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> >> > >>   )
> >> > >>
> >> > >> where:
> >> > >>
> >> > >>   Ps = the number of patterns in the selector
> >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> >> > >>   P[I] = the number of patterns in data input I
> >> > >>
> >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> >> > >> and S[...] then it could also get large.  Perhaps we should finally
> >> > >> give up on the general case and limit this to power-of-2 patterns and
> >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> >> > >> simplifies other things as well.
> >> > >>
> >> > >> What do you think?
> >> > > Hi Richard,
> >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> >> > > follow up patches if possible.
> >> > >
> >> > > Sorry if this sounds like a silly ques -- if we are going to have
> >> > > pattern in selector, select *same pattern from same input vector*,
> >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> >> > > make sense for elements in selector to denote pattern number itself
> >> > > instead of element index
> >> > > if input vectors are VLA ?
> >> > >
> >> > > For eg:
> >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> >> > > op1 = {...}
> >> > > with npatterns == 4, nelts_per_pattern == 3,
> >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> >> > > Not sure if this is correct tho.
> >> >
> >> > This wouldn't allow us to represent things like a "duplicate one
> >> > element", or "copy the leading N elements from the first input and
> >> > the other elements from elements N+ of the second input", which we
> >> > can with the current scheme.
> >> >
> >> > The restriction about each (unwound) selector pattern selecting from the
> >> > same input pattern only applies to case where the selector pattern is
> >> > stepped (and only applies to the stepped part of the pattern, not the
> >> > leading element).  The restriction is also local to this code; it
> >> > doesn't make other VEC_PERM_EXPRs invalid.
> >> Hi Richard,
> >> Thanks for the clarifications.
> >> Just to clarify your approach with an eg:
> >> Let selected input vector be:
> >> arg0: {a0, b0, c0, d0,
> >>           a0 + S, b0 + S, c0 + S, d0 + S,
> >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> >>
> >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> >>
> >> So, the first pattern in sel:
> >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> >> which would be incorrect, since they belong to different patterns in arg0.
> >> So to select elements from same pattern in arg0, we need to divide p1
> >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> >>
> >> Similarly for second pattern in sel:
> >> p2: {0, 2, 4, ...}, we need to divide it into
> >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> >>
> >> Select N = max(N1, N2) = 4
> >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> >> and will be re-encoded with N*Ps = 8 patterns:
> >>
> >> re-encoded sel:
> >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> >> ...}
> >>
> >> with 8 patterns,
> >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> >> ...
> >> which select elements from same pattern from same input vector.
> >> Does this look correct ?
> >>
> >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> >> arg1_npatterns are powers of 2 and for each stepped pattern,
> >> it's stepped size S is a power of 2. I suppose this will be sufficient
> >> to ensure that sel can be re-encoded with N*Ps npatterns
> >> such that each new pattern selects elements from same pattern
> >> of the input vector ?
> >>
> >> Then compute N:
> >> N = 1;
> >> for (every pattern p in sel)
> >>   {
> >>      op = corresponding input vector for pattern;
> >>      S = step_size (p);
> >>      N_pattern = max (S, npatterns (op)) / S;
> >>      N = max(N, N_pattern)
> >>   }
> >>
> >> and re-encode selector with N*Ps patterns.
> >> I guess rest of the patch will mostly stay the same.
> > Hi,
> > I have attached a POC patch based on the above approach.
> > For the above eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > and
> > sel = {0, 0, 0, 1, 0, 2, ...}
> > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> >
> > For pattern, {0, 1, 2, ...} it will select elements from different
> > patterns from arg0, which is incorrect.
> > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > with following patterns:
> > p1 = { 0, ... }
> > p2 = { 0, 2, 4, ... }
> > p3 = { 0, ... }
> > p4 = { 1, 3, 5, ... }
> > which should be correct since each element from the respective
> > patterns in sel chooses
> > elements from same pattern from arg0.
> > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > Does this look correct ?
>
> Yeah.  But like I said above:
>
>   The restriction about each (unwound) selector pattern selecting from the
>   same input pattern only applies to case where the selector pattern is
>   stepped (and only applies to the stepped part of the pattern, not the
>   leading element).
>
> If the selector nelts-per-pattern is 1 or 2 then we can support all
> power-of-2 cases, with the final npatterns being the maximum of the
> source nelts-per-patterns.
>
> Also, going back to an earlier part of the discussion, I think we
> should use this technique for both VLA and VLS, and only fall back
> to the VLS-specific approach if the VLA approach fails.
>
> So I suggest we put the VLA code in its own function and have
> the VLS-only path kick in when the VLA code fails.  If the code is
> having to pass a lot of state around, it might make sense to define
> a local class, store the state in member variables, and use member
> functions for the various subroutines.  I don't know if that will
> work out neater though.
Hi Richard,
Thanks for the suggestions. I have attached an updated POC patch,
that does the following:
(a) Uses VLA approach by default, and falls back to VLS specific
folding if VLA approach fails for VLS vectors.
(b) Separates cases for sel_nelts_per_pattern < 3 and
sel_nelts_per_pattern == 3.
(c) Allows, a0 to select different vector from a1 .. ae.
I have written a few unit tests in the patch for testing the same.
Does the patch look in the right direction ?

The patch has an issue for the following case marked as "case 9"
in test_vec_perm_vla_folding:
arg0 = { 1, 11, 2, 12, 3, 13, ... }
arg1 = { 21, 31, 22, 32, 23, 33, ... }
arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.

mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
where 4 + 4x is runtime vector length.
npatterns = 1, nelts_per_pattern = 3.

a1 = 5 + 4x
ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
Is that expected for the above mask ?

I intended it to select the second vector similar to,
sel = { 0, 1, 2 .. }, which would select the first vector
by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
The first would select elements from first pattern from arg0,
while the second pattern would select elements from second pattern from arg0.
with result effectively having same encoding as arg0.
Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?

PS: I will be on vacation next week.

Thanks,
Prathamesh

>
> > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >                         build_zero_cst (itype));
> >  }
> >
> > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > +   and return the selected arg, otherwise return NULL_TREE.  */
> >
> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > -   true if successful.  */
> > -
> > -static bool
> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > +static tree
> > +get_vector_for_pattern (tree arg0, tree arg1,
> > +                     const vec_perm_indices &sel, unsigned pattern,
> > +                     unsigned sel_npatterns, int &S)
> >  {
> > -  unsigned HOST_WIDE_INT i, nunits;
> > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +  poly_uint64 nsel = sel.length ();
> > +  poly_uint64 esel;
> > +
> > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > +    return NULL_TREE;
> > +
> > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > +  S = 0;
> > +  if (sel_nelts_per_pattern == 3)
> >      {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > +      S = (a2 - a1).to_constant ();
>
> The code hasn't proven that this to_constant is safe.
>
> > +      if (S != 0 && !pow2p_hwi (S))
> > +     return NULL_TREE;
> >      }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +
> > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > +  uint64_t q1, qe;
> > +  poly_uint64 r1, re;
> > +
> > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > +      || (q1 != qe))
> > +    return NULL_TREE;
>
> Going back to the above: this check doesn't make sense for
> sel_nelts_per_pattern != 3.
>
> Thanks,
> Richard
>
> > +
> > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > +
> > +  if (S < 0)
> >      {
> > -      constructor_elt *elt;
> > +      poly_uint64 a0 = sel[pattern];
> > +      if (!known_eq (S, a1 - a0))
> > +        return NULL_TREE;
> >
> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > -       return false;
> > -     else
> > -       elts[i] = elt->value;
> > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > +        return NULL_TREE;
> >      }
> > -  else
> > -    return false;
> > -  for (; i < nelts; i++)
> > -    elts[i]
> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > -  return true;
> > +
> > +  return arg;
> >  }
> >
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >    unsigned HOST_WIDE_INT nelts;
> >    bool need_ctor = false;
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > +  unsigned res_npatterns = 0;
> > +  unsigned res_nelts_per_pattern = 0;
> > +  unsigned sel_npatterns = 0;
> > +  tree *vector_for_pattern = NULL;
> > +
> > +  if (TREE_CODE (arg0) == VECTOR_CST
> > +      && TREE_CODE (arg1) == VECTOR_CST
> > +      && !sel.length ().is_constant ())
> > +    {
> > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > +      sel_npatterns = sel.encoding ().npatterns ();
> > +
> > +      if (!pow2p_hwi (arg0_npatterns)
> > +       || !pow2p_hwi (arg1_npatterns)
> > +       || !pow2p_hwi (sel_npatterns))
> > +        return NULL_TREE;
> > +
> > +      unsigned N = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > +     {
> > +       int S = 0;
> > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > +       if (!op)
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = op;
> > +       unsigned N_pattern =
> > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > +       N = std::max (N, N_pattern);
> > +     }
> > +
> > +      res_npatterns
> > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > +
> > +      res_nelts_per_pattern
> > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > +    }
> > +  else if (sel.length ().is_constant (&nelts)
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > +    {
> > +      /* For VLS vectors, treat all vectors with
> > +      npatterns = nelts, nelts_per_pattern = 1. */
> > +      res_npatterns = sel_npatterns = nelts;
> > +      res_nelts_per_pattern = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > +      for (unsigned i = 0; i < nelts; i++)
> > +        {
> > +       HOST_WIDE_INT index;
> > +       if (!sel[i].is_constant (&index))
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > +     }
> > +    }
> > +  else
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > -  for (i = 0; i < nelts; i++)
> > +  tree_vector_builder out_elts (type, res_npatterns,
> > +                             res_nelts_per_pattern);
> > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > +  for (unsigned i = 0; i < res_nelts; i++)
> >      {
> > -      HOST_WIDE_INT index;
> > -      if (!sel[i].is_constant (&index))
> > -     return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > +      /* For VLA vectors, i % sel_npatterns would give the original
> > +         pattern the element belongs to, which is sufficient to get the arg.
> > +      Even if sel_npatterns has been multiplied by N,
> > +      they will always come from the same input vector.
> > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > +      so i % sel_npatterns == i since i < nelts */
> > +
> > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > +      unsigned HOST_WIDE_INT index;
> > +
> > +      if (arg == arg0)
> > +     {
> > +       if (!sel[i].is_constant ())
> > +         return NULL_TREE;
> > +       index = sel[i].to_constant ();
> > +     }
> > +      else
> > +        {
> > +       gcc_assert (arg == arg1);
> > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +       uint64_t q;
> > +       poly_uint64 r;
> > +
> > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > +          which would be the index for either input vector.  */
> > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > +         return NULL_TREE;
> > +
> > +       if (!r.is_constant (&index))
> > +         return NULL_TREE;
> > +     }
> > +
> > +      tree elem;
> > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > +        {
> > +       gcc_assert (index < nelts);
> > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > +         return NULL_TREE;
> > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > +         return NULL_TREE;
> > +       need_ctor = true;
> > +     }
> > +      else
> > +        elem = vector_cst_elt (arg, index);
> > +      out_elts.quick_push (elem);
> >      }
> >
> >    if (need_ctor)
> >      {
> >        vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > +      vec_alloc (v, res_nelts);
> > +      for (i = 0; i < res_nelts; i++)
> >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> >        return build_constructor (type, v);
> >      }
> > -  else
> > -    return out_elts.build ();
> > +  return out_elts.build ();
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> >  }
> >
> > +static tree
> > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > +                int *encoded_elems)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  //machine_mode vmode = VNx4SImode;
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > +
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +test_vec_perm_vla_folding ()
> > +{
> > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > +
> > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > +
> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > +    return;
> > +
> > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should select arg0.  */
> > +  {
> > +    int mask_elems[] = {0, 1, 2};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (res != NULL_TREE);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > +    for (unsigned i = 0; i < res_nelts; i++)
> > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > +  }
> > +
> > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should return NULL because for len = 4 + 4x,
> > +     if x == 0, we select from arg1
> > +     if x > 0, we select from arg0
> > +     and thus cannot determine result at compile time.  */
> > +  {
> > +    int mask_elems[] = {4, 5, 6};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    gcc_assert (res == NULL_TREE);
> > +  }
> > +
> > +  /* Case 3:
> > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > +  {
> > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > +  }
> > +
> > +  /* Case 4:
> > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[1]
> > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > +     a1 = 5 + 4x
> > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > +        = 5 + 6x
> > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > +     res: {1, 21, 1, 31, 1, 22, ... }
> > +     FIXME: How to build vector with poly_int elems ?  */
> > +
> > +  /* Case 5: S < 0.  */
> > +}
> > +
> >  /* Run all of the selftests within this file.  */
> >
> >  void
> > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_vec_perm_vla_folding ();
> >  }
> >
> >  } // namespace selftest
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index e80be8049e1..b1ed90e629b 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "gimple-pretty-print.h"
+#include "tree-pretty-print.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10544,6 +10547,124 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+static std::pair<tree, int>
+get_vector_for_pattern (tree arg0, tree arg1, const vec_perm_indices &sel,
+			unsigned pattern)
+{
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+  gcc_assert (sel_nelts_per_pattern == 3);
+
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  poly_uint64 len = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (len, sel_npatterns, &esel))
+    return std::make_pair (NULL_TREE, 0);
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+  poly_uint64 diff = a2 - a1;
+  if (!diff.is_constant ())
+    return std::make_pair (NULL_TREE, 0);
+  int S = diff.to_constant ();
+  if (!pow2p_hwi (S))
+    return std::make_pair (NULL_TREE, 0);
+
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+  if (!can_div_trunc_p (a1, len, &q1, &r1)
+      || !can_div_trunc_p (ae, len, &qe, &re)
+      || (q1 != qe))
+    return std::make_pair (NULL_TREE, 0);
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0
+      && !known_eq (S, a1 - sel[pattern])
+      && !known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+    return std::make_pair (NULL_TREE, 0);
+
+  return std::make_pair (arg, S);
+}
+
+static tree
+fold_vec_perm_vla (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
+{
+  if (TREE_CODE (arg0) != VECTOR_CST
+      || TREE_CODE (arg1) != VECTOR_CST)
+    return NULL_TREE;
+
+  unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+  unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+  poly_uint64 nelts = sel.length ();
+
+  if (!pow2p_hwi (arg0_npatterns)
+      || !pow2p_hwi (arg1_npatterns)
+      || !pow2p_hwi (sel_npatterns))
+    return NULL_TREE;
+
+  unsigned N = 1;
+  if (sel_nelts_per_pattern == 3)
+    for (unsigned i = 0; i < sel_npatterns; i++)
+      {
+	std::pair<tree, int> ret = get_vector_for_pattern (arg0, arg1, sel, i);
+	tree arg = ret.first;
+	if (!arg)
+	  return NULL_TREE;
+
+	int S = ret.second;
+	/* S can be 0 if one of the patterns is a dup but
+	   other is a stepped sequence. For eg: {0, 0, 0, 1, 0, 2, ...} */
+	unsigned N_pattern
+	  = (S > 0) ? std::max<int> (S, VECTOR_CST_NPATTERNS (arg)) / S : 1;
+	N = std::max (N, N_pattern);
+      }
+
+  unsigned res_npatterns
+    = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
+
+  unsigned res_nelts_per_pattern
+    = std::max (sel_nelts_per_pattern, std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+						 VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_npatterns * res_nelts_per_pattern; i++)
+    {
+      /* Get pattern corresponding to sel[i] and use that to figure out
+	 the input vector.
+	 For a stepped sequence, a0 may choose different vector,
+	 however a1 ... ae must select from same pattern from same vector.
+	 So if i < sel_npatterns, set pattern_index to index of a0,
+	 and if i >= sel_npatterns, set pattern_index to index of a1.  */
+
+      unsigned pattern_index = i % sel_npatterns;
+      if (i >= sel_npatterns)
+	pattern_index += sel_npatterns;
+
+      uint64_t q;
+      poly_uint64 r;
+      if (!can_div_trunc_p (sel[pattern_index], nelts, &q, &r))
+	return NULL_TREE;
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+
+      unsigned HOST_WIDE_INT index;
+      if (sel[i].is_constant ())
+	index = sel[i].to_constant ();
+      else
+	{
+	  poly_uint64 diff = sel[i] - nelts;
+	  if (!diff.is_constant (&index))
+	    return NULL_TREE;
+	}
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10555,15 +10676,20 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (tree res = fold_vec_perm_vla (type, arg0, arg1, sel))
+    return res;
+
+  if (!sel.length().is_constant (&nelts))
+    return NULL_TREE;
+
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
@@ -16958,6 +17084,168 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  //machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static tree
+fold_vec_perm_vla_mask (tree type, tree arg0, tree arg1,
+			unsigned mask_npatterns,
+			unsigned mask_nelts_per_pattern,
+			poly_uint64 *mask_elems)
+{
+  poly_uint64 len = TYPE_VECTOR_SUBPARTS (type);
+  vec_perm_builder builder (len, mask_npatterns, mask_nelts_per_pattern);
+  for (unsigned i = 0; i < mask_npatterns * mask_nelts_per_pattern; i++)
+    builder.quick_push (mask_elems[i]);
+  vec_perm_indices sel (builder, (arg0 == arg1) ? 1 : 2, len);
+  return fold_vec_perm_vla (type, arg0, arg1, sel);
+}
+
+static void
+check_vec_perm_vla_result(tree res, int *res_elems,
+			  unsigned npatterns, unsigned nelts_per_pattern)
+{
+  ASSERT_TRUE (TREE_CODE (res) == VECTOR_CST);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+    ASSERT_TRUE (wi::to_wide (VECTOR_CST_ELT (res, i)) == res_elems[i]);
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: mask is {0, 1, 2, 3, ... }.
+     npatterns = 4, nelts_per_pattern = 1.
+     All elements in mask < 4 + 4x.  */
+  {
+    poly_uint64 mask_elems[] = {0, 1, 2, 3};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 2: mask = {0, 4, 1, 5, ...}
+     npatterns = 4, nelts_per_pattern = 1.
+     Should return NULL, because result cannot be determined at compile time,
+     since len = 4 + 4x and thus {4, 5} can select either from arg0 or arg1
+     depending on runtime length of the vector.  */
+  {
+    poly_uint64 mask_elems[] = {0, 4, 1, 5};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 3: mask = { 4 + 4x, 5 + 4x, 6 + 4x, 7 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 1
+     All elements in mask >= 4 + 4x.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { len, len + 1, len + 2, len + 3 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg1_elems[0], arg1_elems[1], arg1_elems[2], arg1_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 4: mask = {0, 1, 4 + 4x, 5 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 1
+     res = { arg0[0], arg0[1], arg1[0], arg1[1], ... }  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { 0, 1, len, len + 1 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg1_elems[0], arg1_elems[1] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 5: mask = {0, 1, 2, 3, 4 + 4x, 5 + 4x, 6 + 4x, 7 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 2.
+     res = { arg0[0], arg0[1], arg0[2], arg0[3], arg1[0], arg1[1], arg1[2], arg1[3], ... }  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { 0, 1, 2, 3, len, len + 1, len + 2, len + 3 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 2, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3],
+			arg1_elems[0], arg1_elems[1], arg1_elems[2], arg1_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 2);
+  }
+
+  /* Case 6: mask = {0, ... }.
+     npatterns = 1, nelts_per_pattern = 1.
+     Test for npatterns(mask) < npatterns(arg0)  */
+  {
+    poly_uint64 mask_elems[] = {0};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0] };
+    check_vec_perm_vla_result (res, res_elems, 1, 1);
+  }
+
+  /* Case 7: mask = { 0, 1, 2, ... }.
+     npatterns = 1, nelts_per_pattern = 3.
+     Since {0, 1, 2} will select {1, 11, 2} it will be incorrect.
+     Re-encode sel such that each pattern of sel selects elements from
+     same pattern from arg0.
+     Thus the pattern must be divided into
+     npatterns(arg0) / S = 2 / 1 = 2 distinct patterns.
+     Re-encoded sel: {0, 1, 2, 3, 4, 5, ...}
+     with patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
+     Now each pattern selects elements only from same pattern
+     of arg0.
+     Expected res: {1, 11, 2, 12, 3, 13, ...}  */
+  {
+    poly_uint64 mask_elems[] = { 0, 1, 2 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3],
+			arg0_elems[4], arg0_elems[5] };
+    check_vec_perm_vla_result (res, res_elems, 2, 3);
+  }
+
+  /* Case 8: mask = {len, 0, 1, ... }
+     npatterns = 1, nelts_per_pattern = 3.
+     Test for case when a0 selects a different vector from a1 ... ae.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = {len, 0, 1};
+    int res_elems[] = { arg1_elems[0], arg0_elems[0], arg0_elems[1], arg0_elems[2],
+			arg0_elems[3], arg0_elems[4] };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+    check_vec_perm_vla_result (res, res_elems, 2, 3);
+  }
+
+  /* Case 9: mask = {len, len + 1, len + 2, ...}
+     npatterns = 1, nelts_per_pattern = 3.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { len, len + 1, len + 2 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+  }
+
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16966,6 +17254,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest
  
Prathamesh Kulkarni Dec. 26, 2022, 4:26 a.m. UTC | #11
On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > >>
> > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > >> <richard.sandiford@arm.com> wrote:
> > >> >
> > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > >> > > <richard.sandiford@arm.com> wrote:
> > >> > >>
> > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > >> > >> about this a bit more.
> > >> > >>
> > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > >> > >> > <richard.sandiford@arm.com> wrote:
> > >> > >> >>
> > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> > >> >> >> For num_poly_int_coeffs == 2,
> > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> > >> >> >> If a1/trunc n1 succeeds,
> > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > >> > >> >> >
> > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >> > >> >> >
> > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >> > >> >> >
> > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >> > >> >>
> > >> > >> >> Sorry, should have been:
> > >> > >> >>
> > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > >> > >> > Hi Richard,
> > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > >> > >> > I have attached POC patch that tries to implement the above approach.
> > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > >> > >> >
> > >> > >> > For VLA vectors, I have only done limited testing so far.
> > >> > >> > It seems to pass couple of tests written in the patch for
> > >> > >> > nelts_per_pattern == 3,
> > >> > >> > and folds the following svld1rq test:
> > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > >> > >> > into:
> > >> > >> > return {1, 2, 3, 4, ...};
> > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > >> > >> >
> > >> > >> > I have a couple of questions:
> > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > >> > >> > For eg:
> > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > >> > >> >
> > >> > >> > With above mask,
> > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > >> > >> > pattern in arg0.
> > >> > >> > The result is:
> > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > >> > >> > sequence would be -7,
> > >> > >> > which is not present originally in arg0.
> > >> > >>
> > >> > >> Yeah, you're right, sorry.  Going back to:
> > >> > >>
> > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > >> > >>     as having N*Px patterns, with Ex staying the same.
> > >> > >>
> > >> > >> I guess we need to pick an N for the selector such that each new
> > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > >> > >> the *same pattern* of the same data input.
> > >> > >>
> > >> > >> So if a particular pattern in the selector has a step S, and the data
> > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > >> > >>
> > >> > >> I think that means that the total number of patterns in the result
> > >> > >> (Pr from previous messages) can safely be:
> > >> > >>
> > >> > >>   Ps * least_common_multiple(
> > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > >> > >>     ...
> > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > >> > >>   )
> > >> > >>
> > >> > >> where:
> > >> > >>
> > >> > >>   Ps = the number of patterns in the selector
> > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > >> > >>   P[I] = the number of patterns in data input I
> > >> > >>
> > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > >> > >> simplifies other things as well.
> > >> > >>
> > >> > >> What do you think?
> > >> > > Hi Richard,
> > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > >> > > follow up patches if possible.
> > >> > >
> > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > >> > > pattern in selector, select *same pattern from same input vector*,
> > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > >> > > make sense for elements in selector to denote pattern number itself
> > >> > > instead of element index
> > >> > > if input vectors are VLA ?
> > >> > >
> > >> > > For eg:
> > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > >> > > op1 = {...}
> > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > >> > > Not sure if this is correct tho.
> > >> >
> > >> > This wouldn't allow us to represent things like a "duplicate one
> > >> > element", or "copy the leading N elements from the first input and
> > >> > the other elements from elements N+ of the second input", which we
> > >> > can with the current scheme.
> > >> >
> > >> > The restriction about each (unwound) selector pattern selecting from the
> > >> > same input pattern only applies to case where the selector pattern is
> > >> > stepped (and only applies to the stepped part of the pattern, not the
> > >> > leading element).  The restriction is also local to this code; it
> > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > >> Hi Richard,
> > >> Thanks for the clarifications.
> > >> Just to clarify your approach with an eg:
> > >> Let selected input vector be:
> > >> arg0: {a0, b0, c0, d0,
> > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > >>
> > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > >>
> > >> So, the first pattern in sel:
> > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > >> which would be incorrect, since they belong to different patterns in arg0.
> > >> So to select elements from same pattern in arg0, we need to divide p1
> > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > >>
> > >> Similarly for second pattern in sel:
> > >> p2: {0, 2, 4, ...}, we need to divide it into
> > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > >>
> > >> Select N = max(N1, N2) = 4
> > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > >> and will be re-encoded with N*Ps = 8 patterns:
> > >>
> > >> re-encoded sel:
> > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > >> ...}
> > >>
> > >> with 8 patterns,
> > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > >> ...
> > >> which select elements from same pattern from same input vector.
> > >> Does this look correct ?
> > >>
> > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > >> such that each new pattern selects elements from same pattern
> > >> of the input vector ?
> > >>
> > >> Then compute N:
> > >> N = 1;
> > >> for (every pattern p in sel)
> > >>   {
> > >>      op = corresponding input vector for pattern;
> > >>      S = step_size (p);
> > >>      N_pattern = max (S, npatterns (op)) / S;
> > >>      N = max(N, N_pattern)
> > >>   }
> > >>
> > >> and re-encode selector with N*Ps patterns.
> > >> I guess rest of the patch will mostly stay the same.
> > > Hi,
> > > I have attached a POC patch based on the above approach.
> > > For the above eg:
> > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > and
> > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > >
> > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > patterns from arg0, which is incorrect.
> > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > with following patterns:
> > > p1 = { 0, ... }
> > > p2 = { 0, 2, 4, ... }
> > > p3 = { 0, ... }
> > > p4 = { 1, 3, 5, ... }
> > > which should be correct since each element from the respective
> > > patterns in sel chooses
> > > elements from same pattern from arg0.
> > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > Does this look correct ?
> >
> > Yeah.  But like I said above:
> >
> >   The restriction about each (unwound) selector pattern selecting from the
> >   same input pattern only applies to case where the selector pattern is
> >   stepped (and only applies to the stepped part of the pattern, not the
> >   leading element).
> >
> > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > power-of-2 cases, with the final npatterns being the maximum of the
> > source nelts-per-patterns.
> >
> > Also, going back to an earlier part of the discussion, I think we
> > should use this technique for both VLA and VLS, and only fall back
> > to the VLS-specific approach if the VLA approach fails.
> >
> > So I suggest we put the VLA code in its own function and have
> > the VLS-only path kick in when the VLA code fails.  If the code is
> > having to pass a lot of state around, it might make sense to define
> > a local class, store the state in member variables, and use member
> > functions for the various subroutines.  I don't know if that will
> > work out neater though.
> Hi Richard,
> Thanks for the suggestions. I have attached an updated POC patch,
> that does the following:
> (a) Uses VLA approach by default, and falls back to VLS specific
> folding if VLA approach fails for VLS vectors.
> (b) Separates cases for sel_nelts_per_pattern < 3 and
> sel_nelts_per_pattern == 3.
> (c) Allows, a0 to select different vector from a1 .. ae.
> I have written a few unit tests in the patch for testing the same.
> Does the patch look in the right direction ?
>
> The patch has an issue for the following case marked as "case 9"
> in test_vec_perm_vla_folding:
> arg0 = { 1, 11, 2, 12, 3, 13, ... }
> arg1 = { 21, 31, 22, 32, 23, 33, ... }
> arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
>
> mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> where 4 + 4x is runtime vector length.
> npatterns = 1, nelts_per_pattern = 3.
>
> a1 = 5 + 4x
> ae = a1 + (esel - 2) * S
>      = (5 + 4x) + (4 + 4x - 2) * 1
>      = 7 + 8x
>
> Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> Is that expected for the above mask ?
>
> I intended it to select the second vector similar to,
> sel = { 0, 1, 2 .. }, which would select the first vector
> by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> The first would select elements from first pattern from arg0,
> while the second pattern would select elements from second pattern from arg0.
> with result effectively having same encoding as arg0.
> Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> PS: I will be on vacation next week.
>
> Thanks,
> Prathamesh
>
> >
> > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > >                         build_zero_cst (itype));
> > >  }
> > >
> > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > >
> > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > -   true if successful.  */
> > > -
> > > -static bool
> > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > +static tree
> > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > +                     unsigned sel_npatterns, int &S)
> > >  {
> > > -  unsigned HOST_WIDE_INT i, nunits;
> > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > >
> > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +  poly_uint64 nsel = sel.length ();
> > > +  poly_uint64 esel;
> > > +
> > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > +    return NULL_TREE;
> > > +
> > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > +  S = 0;
> > > +  if (sel_nelts_per_pattern == 3)
> > >      {
> > > -      for (i = 0; i < nunits; ++i)
> > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > +      S = (a2 - a1).to_constant ();
> >
> > The code hasn't proven that this to_constant is safe.
> >
> > > +      if (S != 0 && !pow2p_hwi (S))
> > > +     return NULL_TREE;
> > >      }
> > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > +
> > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > +  uint64_t q1, qe;
> > > +  poly_uint64 r1, re;
> > > +
> > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > +      || (q1 != qe))
> > > +    return NULL_TREE;
> >
> > Going back to the above: this check doesn't make sense for
> > sel_nelts_per_pattern != 3.
> >
> > Thanks,
> > Richard
> >
> > > +
> > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > +
> > > +  if (S < 0)
> > >      {
> > > -      constructor_elt *elt;
> > > +      poly_uint64 a0 = sel[pattern];
> > > +      if (!known_eq (S, a1 - a0))
> > > +        return NULL_TREE;
> > >
> > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > -       return false;
> > > -     else
> > > -       elts[i] = elt->value;
> > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > +        return NULL_TREE;
> > >      }
> > > -  else
> > > -    return false;
> > > -  for (; i < nelts; i++)
> > > -    elts[i]
> > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > -  return true;
> > > +
> > > +  return arg;
> > >  }
> > >
> > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > >    unsigned HOST_WIDE_INT nelts;
> > >    bool need_ctor = false;
> > >
> > > -  if (!sel.length ().is_constant (&nelts))
> > > -    return NULL_TREE;
> > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > >      return NULL_TREE;
> > >
> > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > +  unsigned res_npatterns = 0;
> > > +  unsigned res_nelts_per_pattern = 0;
> > > +  unsigned sel_npatterns = 0;
> > > +  tree *vector_for_pattern = NULL;
> > > +
> > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > +      && !sel.length ().is_constant ())
> > > +    {
> > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > +
> > > +      if (!pow2p_hwi (arg0_npatterns)
> > > +       || !pow2p_hwi (arg1_npatterns)
> > > +       || !pow2p_hwi (sel_npatterns))
> > > +        return NULL_TREE;
> > > +
> > > +      unsigned N = 1;
> > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > +     {
> > > +       int S = 0;
> > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > +       if (!op)
> > > +         return NULL_TREE;
> > > +       vector_for_pattern[i] = op;
> > > +       unsigned N_pattern =
> > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > +       N = std::max (N, N_pattern);
> > > +     }
> > > +
> > > +      res_npatterns
> > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > +
> > > +      res_nelts_per_pattern
> > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > +    }
> > > +  else if (sel.length ().is_constant (&nelts)
> > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > +    {
> > > +      /* For VLS vectors, treat all vectors with
> > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > +      res_npatterns = sel_npatterns = nelts;
> > > +      res_nelts_per_pattern = 1;
> > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > +      for (unsigned i = 0; i < nelts; i++)
> > > +        {
> > > +       HOST_WIDE_INT index;
> > > +       if (!sel[i].is_constant (&index))
> > > +         return NULL_TREE;
> > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > +     }
> > > +    }
> > > +  else
> > >      return NULL_TREE;
> > >
> > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > -  for (i = 0; i < nelts; i++)
> > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > +                             res_nelts_per_pattern);
> > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > +  for (unsigned i = 0; i < res_nelts; i++)
> > >      {
> > > -      HOST_WIDE_INT index;
> > > -      if (!sel[i].is_constant (&index))
> > > -     return NULL_TREE;
> > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > -     need_ctor = true;
> > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > +      Even if sel_npatterns has been multiplied by N,
> > > +      they will always come from the same input vector.
> > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > +      so i % sel_npatterns == i since i < nelts */
> > > +
> > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > +      unsigned HOST_WIDE_INT index;
> > > +
> > > +      if (arg == arg0)
> > > +     {
> > > +       if (!sel[i].is_constant ())
> > > +         return NULL_TREE;
> > > +       index = sel[i].to_constant ();
> > > +     }
> > > +      else
> > > +        {
> > > +       gcc_assert (arg == arg1);
> > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +       uint64_t q;
> > > +       poly_uint64 r;
> > > +
> > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > +          which would be the index for either input vector.  */
> > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > +         return NULL_TREE;
> > > +
> > > +       if (!r.is_constant (&index))
> > > +         return NULL_TREE;
> > > +     }
> > > +
> > > +      tree elem;
> > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > +        {
> > > +       gcc_assert (index < nelts);
> > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > +         return NULL_TREE;
> > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > +         return NULL_TREE;
> > > +       need_ctor = true;
> > > +     }
> > > +      else
> > > +        elem = vector_cst_elt (arg, index);
> > > +      out_elts.quick_push (elem);
> > >      }
> > >
> > >    if (need_ctor)
> > >      {
> > >        vec<constructor_elt, va_gc> *v;
> > > -      vec_alloc (v, nelts);
> > > -      for (i = 0; i < nelts; i++)
> > > +      vec_alloc (v, res_nelts);
> > > +      for (i = 0; i < res_nelts; i++)
> > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > >        return build_constructor (type, v);
> > >      }
> > > -  else
> > > -    return out_elts.build ();
> > > +  return out_elts.build ();
> > >  }
> > >
> > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > >  }
> > >
> > > +static tree
> > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > +                int *encoded_elems)
> > > +{
> > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > +  //machine_mode vmode = VNx4SImode;
> > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > +
> > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > +  return builder.build ();
> > > +}
> > > +
> > > +static void
> > > +test_vec_perm_vla_folding ()
> > > +{
> > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > +
> > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > +
> > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > +    return;
> > > +
> > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > +     should select arg0.  */
> > > +  {
> > > +    int mask_elems[] = {0, 1, 2};
> > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    ASSERT_TRUE (res != NULL_TREE);
> > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > +
> > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > +  }
> > > +
> > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > +     should return NULL because for len = 4 + 4x,
> > > +     if x == 0, we select from arg1
> > > +     if x > 0, we select from arg0
> > > +     and thus cannot determine result at compile time.  */
> > > +  {
> > > +    int mask_elems[] = {4, 5, 6};
> > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    gcc_assert (res == NULL_TREE);
> > > +  }
> > > +
> > > +  /* Case 3:
> > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > +     npatterns == 2, nelts_per_pattern == 3
> > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > +  {
> > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > +
> > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > +  }
> > > +
> > > +  /* Case 4:
> > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > +     npatterns == 2, nelts_per_pattern == 3
> > > +     Pattern {0, ...} should select arg0[1]
> > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > +     a1 = 5 + 4x
> > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > +        = 5 + 6x
> > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > +     FIXME: How to build vector with poly_int elems ?  */
> > > +
> > > +  /* Case 5: S < 0.  */
> > > +}
> > > +
> > >  /* Run all of the selftests within this file.  */
> > >
> > >  void
> > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > >    test_arithmetic_folding ();
> > >    test_vector_folding ();
> > >    test_vec_duplicate_folding ();
> > > +  test_vec_perm_vla_folding ();
> > >  }
> > >
> > >  } // namespace selftest
  
Prathamesh Kulkarni Jan. 17, 2023, 11:54 a.m. UTC | #12
On Mon, 26 Dec 2022 at 09:56, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >>
> > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > > >> <richard.sandiford@arm.com> wrote:
> > > >> >
> > > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > >> > > <richard.sandiford@arm.com> wrote:
> > > >> > >>
> > > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > > >> > >> about this a bit more.
> > > >> > >>
> > > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > >> > >> > <richard.sandiford@arm.com> wrote:
> > > >> > >> >>
> > > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> > >> >> >> For num_poly_int_coeffs == 2,
> > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> > >> >> >> If a1/trunc n1 succeeds,
> > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > >> > >> >> >
> > > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >> > >> >> >
> > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >> > >> >> >
> > > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > >> > >> >>
> > > >> > >> >> Sorry, should have been:
> > > >> > >> >>
> > > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > >> > >> > Hi Richard,
> > > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > > >> > >> > I have attached POC patch that tries to implement the above approach.
> > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > >> > >> >
> > > >> > >> > For VLA vectors, I have only done limited testing so far.
> > > >> > >> > It seems to pass couple of tests written in the patch for
> > > >> > >> > nelts_per_pattern == 3,
> > > >> > >> > and folds the following svld1rq test:
> > > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > >> > >> > into:
> > > >> > >> > return {1, 2, 3, 4, ...};
> > > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > >> > >> >
> > > >> > >> > I have a couple of questions:
> > > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > > >> > >> > For eg:
> > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > >> > >> >
> > > >> > >> > With above mask,
> > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > >> > >> > pattern in arg0.
> > > >> > >> > The result is:
> > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > > >> > >> > sequence would be -7,
> > > >> > >> > which is not present originally in arg0.
> > > >> > >>
> > > >> > >> Yeah, you're right, sorry.  Going back to:
> > > >> > >>
> > > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > > >> > >>     as having N*Px patterns, with Ex staying the same.
> > > >> > >>
> > > >> > >> I guess we need to pick an N for the selector such that each new
> > > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > > >> > >> the *same pattern* of the same data input.
> > > >> > >>
> > > >> > >> So if a particular pattern in the selector has a step S, and the data
> > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > >> > >>
> > > >> > >> I think that means that the total number of patterns in the result
> > > >> > >> (Pr from previous messages) can safely be:
> > > >> > >>
> > > >> > >>   Ps * least_common_multiple(
> > > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > >> > >>     ...
> > > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > >> > >>   )
> > > >> > >>
> > > >> > >> where:
> > > >> > >>
> > > >> > >>   Ps = the number of patterns in the selector
> > > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > >> > >>   P[I] = the number of patterns in data input I
> > > >> > >>
> > > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > >> > >> simplifies other things as well.
> > > >> > >>
> > > >> > >> What do you think?
> > > >> > > Hi Richard,
> > > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > >> > > follow up patches if possible.
> > > >> > >
> > > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > >> > > pattern in selector, select *same pattern from same input vector*,
> > > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > >> > > make sense for elements in selector to denote pattern number itself
> > > >> > > instead of element index
> > > >> > > if input vectors are VLA ?
> > > >> > >
> > > >> > > For eg:
> > > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > >> > > op1 = {...}
> > > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > >> > > Not sure if this is correct tho.
> > > >> >
> > > >> > This wouldn't allow us to represent things like a "duplicate one
> > > >> > element", or "copy the leading N elements from the first input and
> > > >> > the other elements from elements N+ of the second input", which we
> > > >> > can with the current scheme.
> > > >> >
> > > >> > The restriction about each (unwound) selector pattern selecting from the
> > > >> > same input pattern only applies to case where the selector pattern is
> > > >> > stepped (and only applies to the stepped part of the pattern, not the
> > > >> > leading element).  The restriction is also local to this code; it
> > > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > > >> Hi Richard,
> > > >> Thanks for the clarifications.
> > > >> Just to clarify your approach with an eg:
> > > >> Let selected input vector be:
> > > >> arg0: {a0, b0, c0, d0,
> > > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > > >>
> > > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > > >>
> > > >> So, the first pattern in sel:
> > > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > > >> which would be incorrect, since they belong to different patterns in arg0.
> > > >> So to select elements from same pattern in arg0, we need to divide p1
> > > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > > >>
> > > >> Similarly for second pattern in sel:
> > > >> p2: {0, 2, 4, ...}, we need to divide it into
> > > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > > >>
> > > >> Select N = max(N1, N2) = 4
> > > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > > >> and will be re-encoded with N*Ps = 8 patterns:
> > > >>
> > > >> re-encoded sel:
> > > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > > >> ...}
> > > >>
> > > >> with 8 patterns,
> > > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > > >> ...
> > > >> which select elements from same pattern from same input vector.
> > > >> Does this look correct ?
> > > >>
> > > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > > >> such that each new pattern selects elements from same pattern
> > > >> of the input vector ?
> > > >>
> > > >> Then compute N:
> > > >> N = 1;
> > > >> for (every pattern p in sel)
> > > >>   {
> > > >>      op = corresponding input vector for pattern;
> > > >>      S = step_size (p);
> > > >>      N_pattern = max (S, npatterns (op)) / S;
> > > >>      N = max(N, N_pattern)
> > > >>   }
> > > >>
> > > >> and re-encode selector with N*Ps patterns.
> > > >> I guess rest of the patch will mostly stay the same.
> > > > Hi,
> > > > I have attached a POC patch based on the above approach.
> > > > For the above eg:
> > > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > > and
> > > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > > >
> > > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > > patterns from arg0, which is incorrect.
> > > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > > with following patterns:
> > > > p1 = { 0, ... }
> > > > p2 = { 0, 2, 4, ... }
> > > > p3 = { 0, ... }
> > > > p4 = { 1, 3, 5, ... }
> > > > which should be correct since each element from the respective
> > > > patterns in sel chooses
> > > > elements from same pattern from arg0.
> > > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > > Does this look correct ?
> > >
> > > Yeah.  But like I said above:
> > >
> > >   The restriction about each (unwound) selector pattern selecting from the
> > >   same input pattern only applies to case where the selector pattern is
> > >   stepped (and only applies to the stepped part of the pattern, not the
> > >   leading element).
> > >
> > > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > > power-of-2 cases, with the final npatterns being the maximum of the
> > > source nelts-per-patterns.
> > >
> > > Also, going back to an earlier part of the discussion, I think we
> > > should use this technique for both VLA and VLS, and only fall back
> > > to the VLS-specific approach if the VLA approach fails.
> > >
> > > So I suggest we put the VLA code in its own function and have
> > > the VLS-only path kick in when the VLA code fails.  If the code is
> > > having to pass a lot of state around, it might make sense to define
> > > a local class, store the state in member variables, and use member
> > > functions for the various subroutines.  I don't know if that will
> > > work out neater though.
> > Hi Richard,
> > Thanks for the suggestions. I have attached an updated POC patch,
> > that does the following:
> > (a) Uses VLA approach by default, and falls back to VLS specific
> > folding if VLA approach fails for VLS vectors.
> > (b) Separates cases for sel_nelts_per_pattern < 3 and
> > sel_nelts_per_pattern == 3.
> > (c) Allows, a0 to select different vector from a1 .. ae.
> > I have written a few unit tests in the patch for testing the same.
> > Does the patch look in the right direction ?
> >
> > The patch has an issue for the following case marked as "case 9"
> > in test_vec_perm_vla_folding:
> > arg0 = { 1, 11, 2, 12, 3, 13, ... }
> > arg1 = { 21, 31, 22, 32, 23, 33, ... }
> > arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
> >
> > mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> > where 4 + 4x is runtime vector length.
> > npatterns = 1, nelts_per_pattern = 3.
> >
> > a1 = 5 + 4x
> > ae = a1 + (esel - 2) * S
> >      = (5 + 4x) + (4 + 4x - 2) * 1
> >      = 7 + 8x
> >
> > Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> > Is that expected for the above mask ?
> >
> > I intended it to select the second vector similar to,
> > sel = { 0, 1, 2 .. }, which would select the first vector
> > by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> > with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> > The first would select elements from first pattern from arg0,
> > while the second pattern would select elements from second pattern from arg0.
> > with result effectively having same encoding as arg0.
> > Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
> Hi Richard,
> ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
Hi Richard,
ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > PS: I will be on vacation next week.
> >
> > Thanks,
> > Prathamesh
> >
> > >
> > > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > > >                         build_zero_cst (itype));
> > > >  }
> > > >
> > > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > > >
> > > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > > -   true if successful.  */
> > > > -
> > > > -static bool
> > > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > > +static tree
> > > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > > +                     unsigned sel_npatterns, int &S)
> > > >  {
> > > > -  unsigned HOST_WIDE_INT i, nunits;
> > > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > >
> > > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > +  poly_uint64 nsel = sel.length ();
> > > > +  poly_uint64 esel;
> > > > +
> > > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > > +    return NULL_TREE;
> > > > +
> > > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > > +  S = 0;
> > > > +  if (sel_nelts_per_pattern == 3)
> > > >      {
> > > > -      for (i = 0; i < nunits; ++i)
> > > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > > +      S = (a2 - a1).to_constant ();
> > >
> > > The code hasn't proven that this to_constant is safe.
> > >
> > > > +      if (S != 0 && !pow2p_hwi (S))
> > > > +     return NULL_TREE;
> > > >      }
> > > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > +
> > > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > > +  uint64_t q1, qe;
> > > > +  poly_uint64 r1, re;
> > > > +
> > > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > > +      || (q1 != qe))
> > > > +    return NULL_TREE;
> > >
> > > Going back to the above: this check doesn't make sense for
> > > sel_nelts_per_pattern != 3.
> > >
> > > Thanks,
> > > Richard
> > >
> > > > +
> > > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > > +
> > > > +  if (S < 0)
> > > >      {
> > > > -      constructor_elt *elt;
> > > > +      poly_uint64 a0 = sel[pattern];
> > > > +      if (!known_eq (S, a1 - a0))
> > > > +        return NULL_TREE;
> > > >
> > > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > > -       return false;
> > > > -     else
> > > > -       elts[i] = elt->value;
> > > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > > +        return NULL_TREE;
> > > >      }
> > > > -  else
> > > > -    return false;
> > > > -  for (; i < nelts; i++)
> > > > -    elts[i]
> > > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > > -  return true;
> > > > +
> > > > +  return arg;
> > > >  }
> > > >
> > > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > > >    unsigned HOST_WIDE_INT nelts;
> > > >    bool need_ctor = false;
> > > >
> > > > -  if (!sel.length ().is_constant (&nelts))
> > > > -    return NULL_TREE;
> > > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > > >      return NULL_TREE;
> > > >
> > > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > > +  unsigned res_npatterns = 0;
> > > > +  unsigned res_nelts_per_pattern = 0;
> > > > +  unsigned sel_npatterns = 0;
> > > > +  tree *vector_for_pattern = NULL;
> > > > +
> > > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > > +      && !sel.length ().is_constant ())
> > > > +    {
> > > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > > +
> > > > +      if (!pow2p_hwi (arg0_npatterns)
> > > > +       || !pow2p_hwi (arg1_npatterns)
> > > > +       || !pow2p_hwi (sel_npatterns))
> > > > +        return NULL_TREE;
> > > > +
> > > > +      unsigned N = 1;
> > > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > > +     {
> > > > +       int S = 0;
> > > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > > +       if (!op)
> > > > +         return NULL_TREE;
> > > > +       vector_for_pattern[i] = op;
> > > > +       unsigned N_pattern =
> > > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > > +       N = std::max (N, N_pattern);
> > > > +     }
> > > > +
> > > > +      res_npatterns
> > > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > > +
> > > > +      res_nelts_per_pattern
> > > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > > +    }
> > > > +  else if (sel.length ().is_constant (&nelts)
> > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > > +    {
> > > > +      /* For VLS vectors, treat all vectors with
> > > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > > +      res_npatterns = sel_npatterns = nelts;
> > > > +      res_nelts_per_pattern = 1;
> > > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > > +      for (unsigned i = 0; i < nelts; i++)
> > > > +        {
> > > > +       HOST_WIDE_INT index;
> > > > +       if (!sel[i].is_constant (&index))
> > > > +         return NULL_TREE;
> > > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > > +     }
> > > > +    }
> > > > +  else
> > > >      return NULL_TREE;
> > > >
> > > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > > -  for (i = 0; i < nelts; i++)
> > > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > > +                             res_nelts_per_pattern);
> > > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > > +  for (unsigned i = 0; i < res_nelts; i++)
> > > >      {
> > > > -      HOST_WIDE_INT index;
> > > > -      if (!sel[i].is_constant (&index))
> > > > -     return NULL_TREE;
> > > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > > -     need_ctor = true;
> > > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > > +      Even if sel_npatterns has been multiplied by N,
> > > > +      they will always come from the same input vector.
> > > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > > +      so i % sel_npatterns == i since i < nelts */
> > > > +
> > > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > > +      unsigned HOST_WIDE_INT index;
> > > > +
> > > > +      if (arg == arg0)
> > > > +     {
> > > > +       if (!sel[i].is_constant ())
> > > > +         return NULL_TREE;
> > > > +       index = sel[i].to_constant ();
> > > > +     }
> > > > +      else
> > > > +        {
> > > > +       gcc_assert (arg == arg1);
> > > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > +       uint64_t q;
> > > > +       poly_uint64 r;
> > > > +
> > > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > > +          which would be the index for either input vector.  */
> > > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > > +         return NULL_TREE;
> > > > +
> > > > +       if (!r.is_constant (&index))
> > > > +         return NULL_TREE;
> > > > +     }
> > > > +
> > > > +      tree elem;
> > > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > +        {
> > > > +       gcc_assert (index < nelts);
> > > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > > +         return NULL_TREE;
> > > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > > +         return NULL_TREE;
> > > > +       need_ctor = true;
> > > > +     }
> > > > +      else
> > > > +        elem = vector_cst_elt (arg, index);
> > > > +      out_elts.quick_push (elem);
> > > >      }
> > > >
> > > >    if (need_ctor)
> > > >      {
> > > >        vec<constructor_elt, va_gc> *v;
> > > > -      vec_alloc (v, nelts);
> > > > -      for (i = 0; i < nelts; i++)
> > > > +      vec_alloc (v, res_nelts);
> > > > +      for (i = 0; i < res_nelts; i++)
> > > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > > >        return build_constructor (type, v);
> > > >      }
> > > > -  else
> > > > -    return out_elts.build ();
> > > > +  return out_elts.build ();
> > > >  }
> > > >
> > > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > > >  }
> > > >
> > > > +static tree
> > > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > > +                int *encoded_elems)
> > > > +{
> > > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > > +  //machine_mode vmode = VNx4SImode;
> > > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > > +
> > > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > > +  return builder.build ();
> > > > +}
> > > > +
> > > > +static void
> > > > +test_vec_perm_vla_folding ()
> > > > +{
> > > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > > +
> > > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > > +
> > > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > > +    return;
> > > > +
> > > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > +     should select arg0.  */
> > > > +  {
> > > > +    int mask_elems[] = {0, 1, 2};
> > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    ASSERT_TRUE (res != NULL_TREE);
> > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > +
> > > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > > +  }
> > > > +
> > > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > +     should return NULL because for len = 4 + 4x,
> > > > +     if x == 0, we select from arg1
> > > > +     if x > 0, we select from arg0
> > > > +     and thus cannot determine result at compile time.  */
> > > > +  {
> > > > +    int mask_elems[] = {4, 5, 6};
> > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    gcc_assert (res == NULL_TREE);
> > > > +  }
> > > > +
> > > > +  /* Case 3:
> > > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > > +  {
> > > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > +
> > > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > > +  }
> > > > +
> > > > +  /* Case 4:
> > > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > +     Pattern {0, ...} should select arg0[1]
> > > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > > +     a1 = 5 + 4x
> > > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > > +        = 5 + 6x
> > > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > > +     FIXME: How to build vector with poly_int elems ?  */
> > > > +
> > > > +  /* Case 5: S < 0.  */
> > > > +}
> > > > +
> > > >  /* Run all of the selftests within this file.  */
> > > >
> > > >  void
> > > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > > >    test_arithmetic_folding ();
> > > >    test_vector_folding ();
> > > >    test_vec_duplicate_folding ();
> > > > +  test_vec_perm_vla_folding ();
> > > >  }
> > > >
> > > >  } // namespace selftest
  
Prathamesh Kulkarni Feb. 1, 2023, 10:01 a.m. UTC | #13
On Tue, 17 Jan 2023 at 17:24, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 26 Dec 2022 at 09:56, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > > >
> > > > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >>
> > > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > > > >> <richard.sandiford@arm.com> wrote:
> > > > >> >
> > > > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > > >> > > <richard.sandiford@arm.com> wrote:
> > > > >> > >>
> > > > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > > > >> > >> about this a bit more.
> > > > >> > >>
> > > > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > > >> > >> > <richard.sandiford@arm.com> wrote:
> > > > >> > >> >>
> > > > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > > >> > >> >> >> For num_poly_int_coeffs == 2,
> > > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > > >> > >> >> >> If a1/trunc n1 succeeds,
> > > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > > >> > >> >> >
> > > > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > > >> > >> >> >
> > > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > > >> > >> >> >
> > > > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > > >> > >> >>
> > > > >> > >> >> Sorry, should have been:
> > > > >> > >> >>
> > > > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > > >> > >> > Hi Richard,
> > > > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > > > >> > >> > I have attached POC patch that tries to implement the above approach.
> > > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > > >> > >> >
> > > > >> > >> > For VLA vectors, I have only done limited testing so far.
> > > > >> > >> > It seems to pass couple of tests written in the patch for
> > > > >> > >> > nelts_per_pattern == 3,
> > > > >> > >> > and folds the following svld1rq test:
> > > > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > > >> > >> > into:
> > > > >> > >> > return {1, 2, 3, 4, ...};
> > > > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > > >> > >> >
> > > > >> > >> > I have a couple of questions:
> > > > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > > > >> > >> > For eg:
> > > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > > >> > >> >
> > > > >> > >> > With above mask,
> > > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > > >> > >> > pattern in arg0.
> > > > >> > >> > The result is:
> > > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > > > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > > > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > > > >> > >> > sequence would be -7,
> > > > >> > >> > which is not present originally in arg0.
> > > > >> > >>
> > > > >> > >> Yeah, you're right, sorry.  Going back to:
> > > > >> > >>
> > > > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > > > >> > >>     as having N*Px patterns, with Ex staying the same.
> > > > >> > >>
> > > > >> > >> I guess we need to pick an N for the selector such that each new
> > > > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > > > >> > >> the *same pattern* of the same data input.
> > > > >> > >>
> > > > >> > >> So if a particular pattern in the selector has a step S, and the data
> > > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > > >> > >>
> > > > >> > >> I think that means that the total number of patterns in the result
> > > > >> > >> (Pr from previous messages) can safely be:
> > > > >> > >>
> > > > >> > >>   Ps * least_common_multiple(
> > > > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > > >> > >>     ...
> > > > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > > >> > >>   )
> > > > >> > >>
> > > > >> > >> where:
> > > > >> > >>
> > > > >> > >>   Ps = the number of patterns in the selector
> > > > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > > >> > >>   P[I] = the number of patterns in data input I
> > > > >> > >>
> > > > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > > > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > > > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > > >> > >> simplifies other things as well.
> > > > >> > >>
> > > > >> > >> What do you think?
> > > > >> > > Hi Richard,
> > > > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > > >> > > follow up patches if possible.
> > > > >> > >
> > > > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > > >> > > pattern in selector, select *same pattern from same input vector*,
> > > > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > > >> > > make sense for elements in selector to denote pattern number itself
> > > > >> > > instead of element index
> > > > >> > > if input vectors are VLA ?
> > > > >> > >
> > > > >> > > For eg:
> > > > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > > >> > > op1 = {...}
> > > > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > > > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > > >> > > Not sure if this is correct tho.
> > > > >> >
> > > > >> > This wouldn't allow us to represent things like a "duplicate one
> > > > >> > element", or "copy the leading N elements from the first input and
> > > > >> > the other elements from elements N+ of the second input", which we
> > > > >> > can with the current scheme.
> > > > >> >
> > > > >> > The restriction about each (unwound) selector pattern selecting from the
> > > > >> > same input pattern only applies to case where the selector pattern is
> > > > >> > stepped (and only applies to the stepped part of the pattern, not the
> > > > >> > leading element).  The restriction is also local to this code; it
> > > > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > > > >> Hi Richard,
> > > > >> Thanks for the clarifications.
> > > > >> Just to clarify your approach with an eg:
> > > > >> Let selected input vector be:
> > > > >> arg0: {a0, b0, c0, d0,
> > > > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > > > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > > > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > > > >>
> > > > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > > > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > > > >>
> > > > >> So, the first pattern in sel:
> > > > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > > > >> which would be incorrect, since they belong to different patterns in arg0.
> > > > >> So to select elements from same pattern in arg0, we need to divide p1
> > > > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > > > >>
> > > > >> Similarly for second pattern in sel:
> > > > >> p2: {0, 2, 4, ...}, we need to divide it into
> > > > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > > > >>
> > > > >> Select N = max(N1, N2) = 4
> > > > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > > > >> and will be re-encoded with N*Ps = 8 patterns:
> > > > >>
> > > > >> re-encoded sel:
> > > > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > > > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > > > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > > > >> ...}
> > > > >>
> > > > >> with 8 patterns,
> > > > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > > > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > > > >> ...
> > > > >> which select elements from same pattern from same input vector.
> > > > >> Does this look correct ?
> > > > >>
> > > > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > > > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > > > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > > > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > > > >> such that each new pattern selects elements from same pattern
> > > > >> of the input vector ?
> > > > >>
> > > > >> Then compute N:
> > > > >> N = 1;
> > > > >> for (every pattern p in sel)
> > > > >>   {
> > > > >>      op = corresponding input vector for pattern;
> > > > >>      S = step_size (p);
> > > > >>      N_pattern = max (S, npatterns (op)) / S;
> > > > >>      N = max(N, N_pattern)
> > > > >>   }
> > > > >>
> > > > >> and re-encode selector with N*Ps patterns.
> > > > >> I guess rest of the patch will mostly stay the same.
> > > > > Hi,
> > > > > I have attached a POC patch based on the above approach.
> > > > > For the above eg:
> > > > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > > > and
> > > > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > > > >
> > > > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > > > patterns from arg0, which is incorrect.
> > > > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > > > with following patterns:
> > > > > p1 = { 0, ... }
> > > > > p2 = { 0, 2, 4, ... }
> > > > > p3 = { 0, ... }
> > > > > p4 = { 1, 3, 5, ... }
> > > > > which should be correct since each element from the respective
> > > > > patterns in sel chooses
> > > > > elements from same pattern from arg0.
> > > > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > > > Does this look correct ?
> > > >
> > > > Yeah.  But like I said above:
> > > >
> > > >   The restriction about each (unwound) selector pattern selecting from the
> > > >   same input pattern only applies to case where the selector pattern is
> > > >   stepped (and only applies to the stepped part of the pattern, not the
> > > >   leading element).
> > > >
> > > > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > > > power-of-2 cases, with the final npatterns being the maximum of the
> > > > source nelts-per-patterns.
> > > >
> > > > Also, going back to an earlier part of the discussion, I think we
> > > > should use this technique for both VLA and VLS, and only fall back
> > > > to the VLS-specific approach if the VLA approach fails.
> > > >
> > > > So I suggest we put the VLA code in its own function and have
> > > > the VLS-only path kick in when the VLA code fails.  If the code is
> > > > having to pass a lot of state around, it might make sense to define
> > > > a local class, store the state in member variables, and use member
> > > > functions for the various subroutines.  I don't know if that will
> > > > work out neater though.
> > > Hi Richard,
> > > Thanks for the suggestions. I have attached an updated POC patch,
> > > that does the following:
> > > (a) Uses VLA approach by default, and falls back to VLS specific
> > > folding if VLA approach fails for VLS vectors.
> > > (b) Separates cases for sel_nelts_per_pattern < 3 and
> > > sel_nelts_per_pattern == 3.
> > > (c) Allows, a0 to select different vector from a1 .. ae.
> > > I have written a few unit tests in the patch for testing the same.
> > > Does the patch look in the right direction ?
> > >
> > > The patch has an issue for the following case marked as "case 9"
> > > in test_vec_perm_vla_folding:
> > > arg0 = { 1, 11, 2, 12, 3, 13, ... }
> > > arg1 = { 21, 31, 22, 32, 23, 33, ... }
> > > arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
> > >
> > > mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> > > where 4 + 4x is runtime vector length.
> > > npatterns = 1, nelts_per_pattern = 3.
> > >
> > > a1 = 5 + 4x
> > > ae = a1 + (esel - 2) * S
> > >      = (5 + 4x) + (4 + 4x - 2) * 1
> > >      = 7 + 8x
> > >
> > > Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> > > Is that expected for the above mask ?
> > >
> > > I intended it to select the second vector similar to,
> > > sel = { 0, 1, 2 .. }, which would select the first vector
> > > by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> > > with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> > > The first would select elements from first pattern from arg0,
> > > while the second pattern would select elements from second pattern from arg0.
> > > with result effectively having same encoding as arg0.
> > > Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
> > Hi Richard,
> > ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
> Hi Richard,
> ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
Hi Richard,
ping * 3: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > PS: I will be on vacation next week.
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > > >
> > > > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > > > >                         build_zero_cst (itype));
> > > > >  }
> > > > >
> > > > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > > > >
> > > > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > > > -   true if successful.  */
> > > > > -
> > > > > -static bool
> > > > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > > > +static tree
> > > > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > > > +                     unsigned sel_npatterns, int &S)
> > > > >  {
> > > > > -  unsigned HOST_WIDE_INT i, nunits;
> > > > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > > >
> > > > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > > +  poly_uint64 nsel = sel.length ();
> > > > > +  poly_uint64 esel;
> > > > > +
> > > > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > > > +    return NULL_TREE;
> > > > > +
> > > > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > > > +  S = 0;
> > > > > +  if (sel_nelts_per_pattern == 3)
> > > > >      {
> > > > > -      for (i = 0; i < nunits; ++i)
> > > > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > > > +      S = (a2 - a1).to_constant ();
> > > >
> > > > The code hasn't proven that this to_constant is safe.
> > > >
> > > > > +      if (S != 0 && !pow2p_hwi (S))
> > > > > +     return NULL_TREE;
> > > > >      }
> > > > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > > +
> > > > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > > > +  uint64_t q1, qe;
> > > > > +  poly_uint64 r1, re;
> > > > > +
> > > > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > > > +      || (q1 != qe))
> > > > > +    return NULL_TREE;
> > > >
> > > > Going back to the above: this check doesn't make sense for
> > > > sel_nelts_per_pattern != 3.
> > > >
> > > > Thanks,
> > > > Richard
> > > >
> > > > > +
> > > > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > > > +
> > > > > +  if (S < 0)
> > > > >      {
> > > > > -      constructor_elt *elt;
> > > > > +      poly_uint64 a0 = sel[pattern];
> > > > > +      if (!known_eq (S, a1 - a0))
> > > > > +        return NULL_TREE;
> > > > >
> > > > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > > > -       return false;
> > > > > -     else
> > > > > -       elts[i] = elt->value;
> > > > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > > > +        return NULL_TREE;
> > > > >      }
> > > > > -  else
> > > > > -    return false;
> > > > > -  for (; i < nelts; i++)
> > > > > -    elts[i]
> > > > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > > > -  return true;
> > > > > +
> > > > > +  return arg;
> > > > >  }
> > > > >
> > > > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > > > >    unsigned HOST_WIDE_INT nelts;
> > > > >    bool need_ctor = false;
> > > > >
> > > > > -  if (!sel.length ().is_constant (&nelts))
> > > > > -    return NULL_TREE;
> > > > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > > > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > > > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > > > >      return NULL_TREE;
> > > > >
> > > > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > > > +  unsigned res_npatterns = 0;
> > > > > +  unsigned res_nelts_per_pattern = 0;
> > > > > +  unsigned sel_npatterns = 0;
> > > > > +  tree *vector_for_pattern = NULL;
> > > > > +
> > > > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > > > +      && !sel.length ().is_constant ())
> > > > > +    {
> > > > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > > > +
> > > > > +      if (!pow2p_hwi (arg0_npatterns)
> > > > > +       || !pow2p_hwi (arg1_npatterns)
> > > > > +       || !pow2p_hwi (sel_npatterns))
> > > > > +        return NULL_TREE;
> > > > > +
> > > > > +      unsigned N = 1;
> > > > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > > > +     {
> > > > > +       int S = 0;
> > > > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > > > +       if (!op)
> > > > > +         return NULL_TREE;
> > > > > +       vector_for_pattern[i] = op;
> > > > > +       unsigned N_pattern =
> > > > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > > > +       N = std::max (N, N_pattern);
> > > > > +     }
> > > > > +
> > > > > +      res_npatterns
> > > > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > > > +
> > > > > +      res_nelts_per_pattern
> > > > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > > > +    }
> > > > > +  else if (sel.length ().is_constant (&nelts)
> > > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > > > +    {
> > > > > +      /* For VLS vectors, treat all vectors with
> > > > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > > > +      res_npatterns = sel_npatterns = nelts;
> > > > > +      res_nelts_per_pattern = 1;
> > > > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > > > +      for (unsigned i = 0; i < nelts; i++)
> > > > > +        {
> > > > > +       HOST_WIDE_INT index;
> > > > > +       if (!sel[i].is_constant (&index))
> > > > > +         return NULL_TREE;
> > > > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > > > +     }
> > > > > +    }
> > > > > +  else
> > > > >      return NULL_TREE;
> > > > >
> > > > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > > > -  for (i = 0; i < nelts; i++)
> > > > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > > > +                             res_nelts_per_pattern);
> > > > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > > > +  for (unsigned i = 0; i < res_nelts; i++)
> > > > >      {
> > > > > -      HOST_WIDE_INT index;
> > > > > -      if (!sel[i].is_constant (&index))
> > > > > -     return NULL_TREE;
> > > > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > > > -     need_ctor = true;
> > > > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > > > +      Even if sel_npatterns has been multiplied by N,
> > > > > +      they will always come from the same input vector.
> > > > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > > > +      so i % sel_npatterns == i since i < nelts */
> > > > > +
> > > > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > > > +      unsigned HOST_WIDE_INT index;
> > > > > +
> > > > > +      if (arg == arg0)
> > > > > +     {
> > > > > +       if (!sel[i].is_constant ())
> > > > > +         return NULL_TREE;
> > > > > +       index = sel[i].to_constant ();
> > > > > +     }
> > > > > +      else
> > > > > +        {
> > > > > +       gcc_assert (arg == arg1);
> > > > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > > +       uint64_t q;
> > > > > +       poly_uint64 r;
> > > > > +
> > > > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > > > +          which would be the index for either input vector.  */
> > > > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > > > +         return NULL_TREE;
> > > > > +
> > > > > +       if (!r.is_constant (&index))
> > > > > +         return NULL_TREE;
> > > > > +     }
> > > > > +
> > > > > +      tree elem;
> > > > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > > +        {
> > > > > +       gcc_assert (index < nelts);
> > > > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > > > +         return NULL_TREE;
> > > > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > > > +         return NULL_TREE;
> > > > > +       need_ctor = true;
> > > > > +     }
> > > > > +      else
> > > > > +        elem = vector_cst_elt (arg, index);
> > > > > +      out_elts.quick_push (elem);
> > > > >      }
> > > > >
> > > > >    if (need_ctor)
> > > > >      {
> > > > >        vec<constructor_elt, va_gc> *v;
> > > > > -      vec_alloc (v, nelts);
> > > > > -      for (i = 0; i < nelts; i++)
> > > > > +      vec_alloc (v, res_nelts);
> > > > > +      for (i = 0; i < res_nelts; i++)
> > > > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > > > >        return build_constructor (type, v);
> > > > >      }
> > > > > -  else
> > > > > -    return out_elts.build ();
> > > > > +  return out_elts.build ();
> > > > >  }
> > > > >
> > > > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > > > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > > > >  }
> > > > >
> > > > > +static tree
> > > > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > > > +                int *encoded_elems)
> > > > > +{
> > > > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > > > +  //machine_mode vmode = VNx4SImode;
> > > > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > > > +
> > > > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > > > +  return builder.build ();
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +test_vec_perm_vla_folding ()
> > > > > +{
> > > > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > > > +
> > > > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > > > +
> > > > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > > > +    return;
> > > > > +
> > > > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > > +     should select arg0.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {0, 1, 2};
> > > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    ASSERT_TRUE (res != NULL_TREE);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > > +
> > > > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > > > +  }
> > > > > +
> > > > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > > +     should return NULL because for len = 4 + 4x,
> > > > > +     if x == 0, we select from arg1
> > > > > +     if x > 0, we select from arg0
> > > > > +     and thus cannot determine result at compile time.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {4, 5, 6};
> > > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    gcc_assert (res == NULL_TREE);
> > > > > +  }
> > > > > +
> > > > > +  /* Case 3:
> > > > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > > +
> > > > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > > > +  }
> > > > > +
> > > > > +  /* Case 4:
> > > > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > > +     Pattern {0, ...} should select arg0[1]
> > > > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > > > +     a1 = 5 + 4x
> > > > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > > > +        = 5 + 6x
> > > > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > > > +     FIXME: How to build vector with poly_int elems ?  */
> > > > > +
> > > > > +  /* Case 5: S < 0.  */
> > > > > +}
> > > > > +
> > > > >  /* Run all of the selftests within this file.  */
> > > > >
> > > > >  void
> > > > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > > > >    test_arithmetic_folding ();
> > > > >    test_vector_folding ();
> > > > >    test_vec_duplicate_folding ();
> > > > > +  test_vec_perm_vla_folding ();
> > > > >  }
> > > > >
> > > > >  } // namespace selftest
  

Patch

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..a150a75faf5 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@  along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,53 @@  fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  int S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      S = (a2 - a1).to_constant ();
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10557,112 @@  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      sel_npatterns = sel.encoding ().npatterns ();
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	}
+
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+
+      res_npatterns
+        = least_common_multiple (sel_npatterns,
+				 least_common_multiple (arg0_npatterns,
+				 			arg1_npatterns));
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
+      poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+
+      /* Divide sel[i] by input vector length, to obtain remainder,
+	 which would be the index for either input vector.  */
+      if (!can_div_trunc_p (sel[i], n1, &q, &r))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+
+      unsigned HOST_WIDE_INT index;
+      if (!r.is_constant (&index))
+	return NULL_TREE;
+
+      /* For VLA vectors, i % sel_npatterns would give the pattern
+         in sel that ith elem belongs to.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +16999,97 @@  test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 11, 2, ...} */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17098,7 @@  fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest