[v2] load and store-lanes with SLP

Message ID 20240708113356.A60DF3861C69@sourceware.org
State New
Headers
Series [v2] load and store-lanes with SLP |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-lint warning Lint failed
rivoscibot/toolchain-ci-rivos-apply-patch fail Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 fail Patch failed to apply

Commit Message

Richard Biener July 8, 2024, 11:33 a.m. UTC
  The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes, one for each loaded lane and a single
store node plus a single permute node feeding it.  For

  for (int i = 0; i < 1024; ++i)
    {
      a[2*i] = b[2*i] + 7;
      a[2*i+1] = b[2*i+1] * 3;
    }

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:3:21: note:   SLP graph after lowering permutations:
t.c:3:21: note:   node 0x578af60 (max_nunits=1, refcnt=1) vector([4,4]) int
t.c:3:21: note:   op template: *_6 = _7;
t.c:3:21: note:         stmt 0 *_6 = _7;
t.c:3:21: note:         stmt 1 *_12 = _13;
t.c:3:21: note:         children 0x578aff8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  .MASK_LEN_STORE_LANES (vectp_a.10_30, 32B, { -1, ... }, loop_len_38, 0, vect_array.9);

t.c:3:21: note:   node 0x578aff8 (max_nunits=1, refcnt=1) vector([4,4]) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         { }
t.c:3:21: note:         lane permutation { 0[0] 1[0] }
t.c:3:21: note:         children 0x578ab38 0x578ad98

This is the store-lane shuffling node, it's also marked with ldst_lanes
= true during SLP discovery.  This node code-generates

  vect_array.9[0] = vect__7.7_35;
  vect_array.9[1] = vect__13.8_33;

it records virtual defs as vectorized defs.  This is a bit awkward,
on the store side it would be nicer to elide 'vect_array' and instead
make .MASK_LEN_STORE_LANES "varargs", having the vector defs as
arguments.

t.c:3:21: note:   node 0x578ab38 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:3:21: note:   op template: _7 = _5 + 7;
t.c:3:21: note:         stmt 0 _7 = _5 + 7;
t.c:3:21: note:         children 0x578abd0 0x578ac68
t.c:3:21: note:   node (constant) 0x578ac68 (max_nunits=1, refcnt=1)
t.c:3:21: note:         { 7 }
t.c:3:21: note:   node 0x578b978 (max_nunits=2, refcnt=2) vector(2) int
t.c:3:21: note:   op template: _13 = _11 * 3;
t.c:3:21: note:         stmt 0 _13 = _11 * 3;
t.c:3:21: note:         children 0x578b4b8 0x578b8e0
t.c:3:21: note:   node (constant) 0x578b8e0 (max_nunits=1, refcnt=1)
t.c:3:21: note:         { 3 }
t.c:3:21: note:   node 0x578abd0 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         stmt 0 _5 = *_4;
t.c:3:21: note:         lane permutation { 0[0] }
t.c:3:21: note:         children 0x578b090
t.c:3:21: note:   node 0x578b4b8 (max_nunits=2, refcnt=2) vector(2) int
t.c:3:21: note:   op: VEC_PERM_EXPR
t.c:3:21: note:         stmt 0 _11 = *_10;
t.c:3:21: note:         lane permutation { 0[1] }
t.c:3:21: note:         children 0x578bbd8

The above two are the load-lane shufflings, they are marked with
ldst_lanes during load permutation lowering.  They code generate

  _34 = vect_array.6[1];
  _36 = vect_array.6[0];

they pick up the vect_array.6 by looking at the virtual operand def
in the children node.  As with the store side this is awkward but
much harder to avoid since we lack multiple SSA defs support.  What
could be done though is to write vect_array.6 into SSA and use
BIT_FIELD_REF for the extracts?  (yes, that would require ARRAY_TYPE
gimple registers)

t.c:3:21: note:   node 0x578b090 (max_nunits=4, refcnt=3) vector([4,4]) int
t.c:3:21: note:   op template: _5 = *_4;
t.c:3:21: note:         stmt 0 _5 = *_4;
t.c:3:21: note:         stmt 1 _11 = *_10;
t.c:3:21: note:         load permutation { 0 1 }

This is the load itself, marked ldst_lanes during load permutation
lowering.  It code generates

  vect_array.6 = .MASK_LEN_LOAD_LANES (vectp_b.4_40, 32B, { -1, ... }, loop_len_38, 0);

The patch below, tested with just the above testcase and on riscv,
aarch64 with and without SVE and on x86_64 (no load/store lanes)
seems to work though definitely the theoretical support for multi-lane
lanes (aka using ld3 for 3 lane pairs and a total of 6 lanes) is broken.

Code generation for AdvSIMD is

        ld2     {v30.4s - v31.4s}, [x1], 32
        shl     v29.4s, v31.4s, 1
        add     v28.4s, v27.4s, v30.4s
        add     v29.4s, v29.4s, v31.4s
        st2     {v28.4s - v29.4s}, [x0], 32

for SVE

        ld2w    {z30.s - z31.s}, p7/z, [x1, x2, lsl 2]
        add     z30.s, z30.s, #7
        adr     z31.s, [z31.s, z31.s, lsl 1]
        st2w    {z30.s - z31.s}, p7, [x0, x2, lsl 2]

and for RVV

        vsetvli zero,a3,e32,m1,ta,ma
        vlseg2e32.v     v4,(a1)
        vsetvli t1,zero,e32,m1,ta,ma
        sub     a5,a5,a4
        add     a1,a1,a2
        vmul.vv v3,v1,v5
        vadd.vi v2,v4,7
        vsetvli zero,a3,e32,m1,ta,ma
        vsseg2e32.v     v2,(a0)

The way this SLP representation forces us to separate IFN and array
access code generation to be separated introduces problems not present
with non-SLP.  Keeping the operation in one node would be managable
on the store side by recording an incoming lane permutation for it.
On the load side we could code generate everything from the load
node and make a special contract between the "extract" and the load
(but I suppose that's already present with the patch below), it would
then become just a forwarder doing no code generation.

I'm posting this as-is, I will probably try the alternative to decide
which is more ugly.  I guess I'm hoping for magic other ideas here ;)

	* tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
	load, store and permute nodes.
	* tree-vect-slp.cc (vect_build_slp_instance): For stores
	iff the target prefers store-lanes discover single-lane
	sub-groups, do not perform interleaving lowering but mark
	the nodes with ldst_lanes.
	(vect_lower_load_permutations): When the target supports
	load lanes and the loads all fit the pattern split out
	a single level of permutes only and mark the load and
	permute nodes with ldst_lanes.
	(vectorizable_slp_permutation_1): Code-generate the
	load-/store-lane permute parts.
	* tree-vect-stmts.cc (get_group_load_store_type): Support
	load/store-lanes for SLP.
	(vectorizable_store): Hack code generation for store-lanes.
	(vectorizable_load): Hack code generation for load-lanes.
---
 gcc/tree-vect-slp.cc   | 147 +++++++++++++++++++++++++++++++++++++++--
 gcc/tree-vect-stmts.cc |  68 ++++++++++++++-----
 gcc/tree-vectorizer.h  |   3 +
 3 files changed, 197 insertions(+), 21 deletions(-)
  

Patch

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 2dc6d365303..a5648cacd70 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -120,6 +120,7 @@  _slp_tree::_slp_tree ()
   SLP_TREE_SIMD_CLONE_INFO (this) = vNULL;
   SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def;
   SLP_TREE_CODE (this) = ERROR_MARK;
+  this->ldst_lanes = false;
   SLP_TREE_VECTYPE (this) = NULL_TREE;
   SLP_TREE_REPRESENTATIVE (this) = NULL;
   SLP_TREE_REF_COUNT (this) = 1;
@@ -3601,9 +3602,22 @@  vect_build_slp_instance (vec_info *vinfo,
 	 size >= 1.  */
       else if (is_a <loop_vec_info> (vinfo)
 	       && (i > 0 && i < group_size)
-	       && !vect_slp_prefer_store_lanes_p (vinfo,
-						  stmt_info, group_size, i))
-	{
+	       /*&& !vect_slp_prefer_store_lanes_p (vinfo,
+						  stmt_info, group_size, i)*/)
+	{
+	  /* There are targets that cannot do even/odd interleaving schemes
+	     so they absolutely need to use load/store-lanes.  For now
+	     force single-lane SLP for them - they would be happy with
+	     uniform power-of-two lanes (but depending on element size),
+	     but even if we can use 'i' as indicator we would need to
+	     backtrack when later lanes fail to discover with the same
+	     granularity.  */
+	  bool want_store_lanes
+	    = vect_slp_prefer_store_lanes_p (vinfo,
+					     stmt_info, group_size, 1 /*i*/);
+	  if (want_store_lanes)
+	    i = 1;
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "Splitting SLP group at stmt %u\n", i);
@@ -3637,7 +3651,10 @@  vect_build_slp_instance (vec_info *vinfo,
 					       (max_nunits, end - start));
 		  rhs_nodes.safe_push (node);
 		  start = end;
-		  end = group_size;
+		  if (want_store_lanes)
+		    end = start + 1;
+		  else
+		    end = group_size;
 		}
 	      else
 		{
@@ -3711,7 +3728,14 @@  vect_build_slp_instance (vec_info *vinfo,
 		 we end up with two inputs.  That scheme makes sure we end
 		 up with permutes satisfying the restriction of requiring at
 		 most two vector inputs to produce a single vector output
-		 when the number of lanes is even.  */
+		 when the number of lanes is even.
+		 Leave the merge as is when we want to use store-lanes.  */
+	      if (want_store_lanes)
+		{
+		  perm->ldst_lanes = 1;
+		  node->ldst_lanes = 1;
+		}
+	      else
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
 		  /* When we have three equal sized groups left the pairwise
@@ -4057,6 +4081,36 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
   if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
+  /* Verify if all load permutations can be implemented with a suitably
+     large element load-lanes operation.  */
+  unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]);
+  if (exact_log2 (ld_lanes_lanes) == -1
+      /* ???  Need to handle ld_lanes_lanes != 1 correctly.  */
+      || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]),
+				     group_lanes, false) == IFN_LAST)
+    ld_lanes_lanes = 0;
+  else
+    for (slp_tree load : loads)
+      {
+	if (SLP_TREE_LANES (load) != ld_lanes_lanes)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0];
+	if (first % ld_lanes_lanes != 0)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i)
+	  if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i)
+	    {
+	      ld_lanes_lanes = 0;
+	      break;
+	    }
+      }
+
   for (slp_tree load : loads)
     {
       /* Leave masked or gather loads alone for now.  */
@@ -4071,7 +4125,8 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2
+	  && ld_lanes_lanes == 0)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4104,6 +4159,12 @@  vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	final_perm.quick_push
 	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
 
+      if (ld_lanes_lanes != 0)
+	{
+	  l0->ldst_lanes = true;
+	  load->ldst_lanes = true;
+	}
+      else
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
@@ -9705,6 +9766,8 @@  vect_add_slp_permutation (vec_info *vinfo, gimple_stmt_iterator *gsi,
   node->push_vec_def (perm_stmt);
 }
 
+extern tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems);
+
 /* Subroutine of vectorizable_slp_permutation.  Check whether the target
    can perform permutation PERM on the (1 or 2) input nodes in CHILDREN.
    If GSI is nonnull, emit the permutation there.
@@ -9758,6 +9821,78 @@  vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi,
 
   gcc_assert (perm.length () == SLP_TREE_LANES (node));
 
+  /* Load/store-lanes permute.  */
+  if (node->ldst_lanes)
+    {
+      if (!gsi)
+	/* ???  We should have this flag set on the permute before a store
+	   or the permutes after a load.  Check vect_load_lanes_supported
+	   or leave that to the store?  */
+	return 1;
+      /* For the load case there should be a single child and
+	 it's vector def should be the virtual SSA def of the
+	 .LOAD_LANES call.  Create the vect__5.7_48 = vect_array.6[n]
+	 load here plus punning of the vector type (in case an element
+	 represents multiple lanes).  */
+      if (children.length () == 1)
+	{
+	  slp_tree child = children[0];
+	  unsigned vec_num = (SLP_TREE_LANE_PERMUTATION (node)[0].second
+			      / SLP_TREE_LANES (node));
+	  for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
+	    {
+	      tree def = SLP_TREE_VEC_DEFS (child)[i];
+	      gcc_assert (virtual_operand_p (def));
+	      tree decl = gimple_call_lhs (SSA_NAME_DEF_STMT (def));
+	      tree vec = make_ssa_name (TREE_TYPE (TREE_TYPE (decl)));
+	      gassign *new_stmt
+		= gimple_build_assign (vec,
+				       build4 (ARRAY_REF, TREE_TYPE (vec),
+					       decl, build_int_cstu
+						 (unsigned_type_node, vec_num),
+					       NULL_TREE, NULL_TREE));
+	      vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+	      if (!useless_type_conversion_p (vectype, TREE_TYPE (vec)))
+		{
+		  vec = make_ssa_name (vectype);
+		  new_stmt = gimple_build_assign (vec, build1 (VIEW_CONVERT_EXPR, vectype, gimple_assign_lhs (new_stmt)));
+		  vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+		}
+	      node->push_vec_def (vec);
+	    }
+	  return SLP_TREE_VEC_DEFS (node).length ();
+	}
+      /* For the store case create the array and the set of stores from
+	 all children vector defs.  The vector def should then be the
+	 virtual def of the last store into the array.  */
+      else
+	{
+	  unsigned vec_num = SLP_TREE_VEC_DEFS (children[0]).length ();
+	  for (unsigned i = 0; i < vec_num; ++i)
+	    {
+	      tree decl = create_vector_array (op_vectype,
+					       SLP_TREE_LANES (node)
+					       / SLP_TREE_LANES (children[0]));
+	      tree last_vdef = NULL_TREE;
+	      for (unsigned j = 0; j < children.length (); ++j)
+		{
+		  slp_tree child = children[j];
+		  tree ref = build4 (ARRAY_REF, op_vectype, decl,
+				     build_int_cstu (unsigned_type_node, j),
+				     NULL_TREE, NULL_TREE);
+		  gimple *new_stmt
+		    = gimple_build_assign (ref, SLP_TREE_VEC_DEFS (child)[i]);
+		  vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi);
+		  last_vdef = make_ssa_name (gimple_vop (cfun));
+		  SSA_NAME_DEF_STMT (last_vdef) = new_stmt;
+		  gimple_set_vdef (new_stmt, last_vdef);
+		}
+	      node->push_vec_def (last_vdef);
+	    }
+	  return vec_num;
+	}
+    }
+
   /* REPEATING_P is true if every output vector is guaranteed to use the
      same permute vector.  We can handle that case for both variable-length
      and constant-length vectors, but we only handle other cases for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fdcda0d2aba..7169ad4f9c9 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -146,7 +146,7 @@  record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int count,
 
 /* Return a variable of type ELEM_TYPE[NELEMS].  */
 
-static tree
+tree
 create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems)
 {
   return create_tmp_var (build_array_type_nelts (elem_type, nelems),
@@ -1508,7 +1508,8 @@  check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
 
   unsigned int nvectors;
   if (slp_node)
-    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+    /* ???  Incorrect for multi-lane lanes.  */
+    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
   else
     nvectors = vect_get_num_copies (loop_vinfo, vectype);
 
@@ -2069,6 +2070,14 @@  get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		 is irrelevant for them.  */
 	      *alignment_support_scheme = dr_unaligned_supported;
 	    }
+	  /* Try using LOAD/STORE_LANES.  */
+	  else if (slp_node->ldst_lanes
+		   && (*lanes_ifn
+			 = (vls_type == VLS_LOAD
+			    ? vect_load_lanes_supported (vectype, group_size, masked_p)
+			    : vect_store_lanes_supported (vectype, group_size,
+							  masked_p))) != IFN_LAST)
+	    *memory_access_type = VMAT_LOAD_STORE_LANES;
 	  else
 	    *memory_access_type = VMAT_CONTIGUOUS;
 
@@ -8762,7 +8771,7 @@  vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
-      gcc_assert (!slp && grouped_store);
+      //gcc_assert (!slp && grouped_store);
       unsigned inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector stores, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
@@ -8784,7 +8793,7 @@  vectorizable_store (vec_info *vinfo,
 		  op = vect_get_store_rhs (next_stmt_info);
 		  if (costing_p)
 		    update_prologue_cost (&prologue_cost, op);
-		  else
+		  else if (!slp)
 		    {
 		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
 						     ncopies, op,
@@ -8799,15 +8808,15 @@  vectorizable_store (vec_info *vinfo,
 		{
 		  if (mask)
 		    {
-		      vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-						     mask, &vec_masks,
-						     mask_vectype);
+		      if (slp_node)
+			vect_get_slp_defs (mask_node, &vec_masks);
+		      else
+			vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
+						       mask, &vec_masks,
+						       mask_vectype);
 		      vec_mask = vec_masks[0];
 		    }
 
-		  /* We should have catched mismatched types earlier.  */
-		  gcc_assert (
-		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
 		  dataref_ptr
 		    = vect_create_data_ref_ptr (vinfo, first_stmt_info,
 						aggr_type, NULL, offset, &dummy,
@@ -8819,11 +8828,17 @@  vectorizable_store (vec_info *vinfo,
 	      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	      /* DR_CHAIN is then used as an input to
 		 vect_permute_store_chain().  */
+	      if (!slp)
+		{
+		  /* We should have catched mismatched types earlier.  */
+		  gcc_assert (
+		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
 	      for (i = 0; i < group_size; i++)
 		{
 		  vec_oprnd = (*gvec_oprnds[i])[j];
 		  dr_chain[i] = vec_oprnd;
 		}
+		}
 	      if (mask)
 		vec_mask = vec_masks[j];
 	      dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi,
@@ -8836,8 +8851,16 @@  vectorizable_store (vec_info *vinfo,
 	      continue;
 	    }
 
+	  tree vec_array;
+	  if (slp)
+	    {
+	      tree def = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[0])[j];
+	      vec_array = TREE_OPERAND (gimple_assign_lhs (SSA_NAME_DEF_STMT (def)), 0);
+	    }
+	  else
+	    {
 	  /* Get an array into which we can store the individual vectors.  */
-	  tree vec_array = create_vector_array (vectype, vec_num);
+	  vec_array = create_vector_array (vectype, vec_num);
 
 	  /* Invalidate the current contents of VEC_ARRAY.  This should
 	     become an RTL clobber too, which prevents the vector registers
@@ -8851,6 +8874,7 @@  vectorizable_store (vec_info *vinfo,
 	      write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array,
 				  i);
 	    }
+	    }
 
 	  tree final_mask = NULL;
 	  tree final_len = NULL;
@@ -8917,9 +8941,10 @@  vectorizable_store (vec_info *vinfo,
 
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
-	  if (j == 0)
+	  if (j == 0 && !slp)
 	    *vec_stmt = new_stmt;
-	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+	  if (!slp)
+	    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}
 
       if (costing_p)
@@ -10765,12 +10790,14 @@  vectorizable_load (vec_info *vinfo,
     {
       gcc_assert (alignment_support_scheme == dr_aligned
 		  || alignment_support_scheme == dr_unaligned_supported);
-      gcc_assert (grouped_load && !slp);
+      //gcc_assert (grouped_load && !slp);
 
       unsigned int inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector loads, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
       unsigned int n_adjacent_loads = 0;
+      if (slp_node)
+	ncopies = slp_node->vec_stmts_size / vec_num;
       for (j = 0; j < ncopies; j++)
 	{
 	  if (costing_p)
@@ -10884,6 +10911,15 @@  vectorizable_load (vec_info *vinfo,
 	  gimple_call_set_nothrow (call, true);
 	  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 
+	  if (slp)
+	    {
+	      tree def = copy_ssa_name (gimple_vuse (call));
+	      gimple_set_vdef (call, def);
+	      SSA_NAME_DEF_STMT (def) = call;
+	      slp_node->push_vec_def (def);
+	    }
+	  else
+	    {
 	  dr_chain.create (vec_num);
 	  /* Extract each vector into an SSA_NAME.  */
 	  for (i = 0; i < vec_num; i++)
@@ -10900,8 +10936,10 @@  vectorizable_load (vec_info *vinfo,
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
 	  dr_chain.release ();
+	    }
 
-	  *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+	  if (!slp_node)
+	    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 	}
 
       if (costing_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8eb3ec4df86..ac288541c51 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -222,6 +222,9 @@  struct _slp_tree {
   unsigned int lanes;
   /* The operation of this node.  */
   enum tree_code code;
+  /* Whether uses of this load or feeders of this store are suitable
+     for load/store-lanes.  */
+  bool ldst_lanes;
 
   int vertex;