[2/3] Support group-size of three in SLP load permutation lowering

Message ID 20240709113552.DC57A386C5AC@sourceware.org
State New
Headers
Series [1/3] lower SLP load permutation to interleaving |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed

Commit Message

Richard Biener July 9, 2024, 11:35 a.m. UTC
  The following adds support for group-size three in SLP load permutation
lowering to match the non-SLP capabilities.  This is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.

This turns out to be more effective than the scheme implemented
for non-SLP for SSE and only slightly worse for AVX512 and a bit
more worse for AVX2.  It seems to me that this would extend to
other non-power-of-two group-sizes though (but the patch does not).
Optimal schemes are likely difficult to lay out in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

	* tree-vect-slp.cc (vect_lower_load_permutations): Support
	group-size of three.

	* gcc.dg/vect/slp-52.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++++++++++++
 gcc/tree-vect-slp.cc               | 35 +++++++++++++++++-------------
 2 files changed, 34 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c
  

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c
new file mode 100644
index 00000000000..ba49f0046e2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[3*i+0];
+      x[4*i+1] = y[3*i+1] * 2;
+      x[4*i+2] = y[3*i+2] + 3;
+      x[4*i+3] = y[3*i+2] * 2 - 5;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 0f830c1ad9c..2dc6d365303 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3710,7 +3710,8 @@  vect_build_slp_instance (vec_info *vinfo,
 		 with the least number of lanes to one and then repeat until
 		 we end up with two inputs.  That scheme makes sure we end
 		 up with permutes satisfying the restriction of requiring at
-		 most two vector inputs to produce a single vector output.  */
+		 most two vector inputs to produce a single vector output
+		 when the number of lanes is even.  */
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
 		  /* When we have three equal sized groups left the pairwise
@@ -4050,11 +4051,10 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
 
   /* Only a power-of-two number of lanes matches interleaving with N levels.
-     The non-SLP path also supports DR_GROUP_SIZE == 3.
      ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
      at each step.  */
   unsigned group_lanes = DR_GROUP_SIZE (first);
-  if (exact_log2 (group_lanes) == -1)
+  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
   for (slp_tree load : loads)
@@ -4071,7 +4071,7 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4107,7 +4107,7 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
-	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	  if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	    break;
 
 	  /* Try to lower by reducing the group to half its size using an
@@ -4117,19 +4117,24 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
 	     with N == 2.  */
 	  /* ???  Only an even number of lanes can be handed this way, but the
-	     fallback below could work for any number.  */
-	  gcc_assert ((group_lanes & 1) == 0);
-	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
-	  unsigned odd = even;
-	  for (auto l : final_perm)
+	     fallback below could work for any number.  We have to make sure
+	     to round up in that case.  */
+	  gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
+	  unsigned even = 0, odd = 0;
+	  if ((group_lanes & 1) == 0)
 	    {
-	      even &= ~l.second;
-	      odd &= l.second;
+	      even = (1 << ceil_log2 (group_lanes)) - 1;
+	      odd = even;
+	      for (auto l : final_perm)
+		{
+		  even &= ~l.second;
+		  odd &= l.second;
+		}
 	    }
 
 	  /* Now build an even or odd extraction from the unpermuted load.  */
 	  lane_permutation_t perm;
-	  perm.create (group_lanes / 2);
+	  perm.create ((group_lanes + 1) / 2);
 	  unsigned level;
 	  if (even
 	      && ((level = 1 << ctz_hwi (even)), true)
@@ -4164,7 +4169,7 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	      bitmap_iterator bi;
 	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
 		  perm.quick_push (std::make_pair (0, i));
-	      while (perm.length () < group_lanes / 2)
+	      while (perm.length () < (group_lanes + 1) / 2)
 		perm.quick_push (perm.last ());
 	    }
 
@@ -4200,7 +4205,7 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     have a "local" CSE map here.  */
 	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
 
-	  /* We now have a node for group_lanes / 2 lanes.  */
+	  /* We now have a node for (group_lanes + 1) / 2 lanes.  */
 	  l0 = p;
 	}