vect+aarch64: Fix ldp_stp_* regressions

Message ID mpt1r05xxzs.fsf@arm.com
State New
Headers
Series vect+aarch64: Fix ldp_stp_* regressions |

Commit Message

Richard Sandiford Feb. 14, 2022, 3:35 p.m. UTC
  ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
vectorisation was enabled at -O2.  In all three cases SLP is
generating vector code when scalar code would be better.

The problem is that the target costs do not model whether STP could
be used for the scalar or vector code, so the normal latency-based
costs for store-heavy code can be way off.  It would be good to fix
that “properly” at some point, but it isn't easy; see the existing
discussion in aarch64_sve_adjust_stmt_cost for more details.

This patch therefore adds an on-the-side check for whether the
code is doing nothing more than set-up+stores.  It then applies
STP-based costs to those cases only, in addition to the normal
latency-based costs.  (That is, the vector code has to win on
both counts rather than on one count individually.)

However, at the moment, SLP costs one vector set-up instruction
for every vector in an SLP node, even if the contents are the
same as a previous vector in the same node.  Fixing the STP costs
without fixing that would regress other cases, tested in the patch.

The patch therefore makes the SLP costing code check for duplicates
within a node.  Ideally we'd check for duplicates more globally,
but that would require a more global approach to costs: the cost
of an initialisation should be amoritised across all trees that
use the initialisation, rather than fully counted against one
arbitrarily-chosen subtree.

Back on aarch64: an earlier version of the patch tried to apply
the new heuristic to constant stores.  However, that didn't work
too well in practice; see the comments for details.  The patch
therefore just tests the status quo for constant cases, leaving out
a match if the current choice is dubious.

ldp_stp_5.c was affected by the same thing.  The test would be
worth vectorising if we generated better vector code, but:

(1) We do a bad job of moving the { -1, 1 } constant, given that
    we have { -1, -1 } and { 1, 1 } to hand.

(2) The vector code has 6 pairable stores to misaligned offsets.
    We have peephole patterns to handle such misalignment for
    4 pairable stores, but not 6.

So the SLP decision isn't wrong as such.  It's just being let
down by later codegen.

The patch therefore adds -mstrict-align to preserve the original
intention of the test while adding ldp_stp_19.c to check for the
preferred vector code (XFAILed for now).

Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
OK for the vectoriser bits?

Thanks,
Richard


gcc/
	* tree-vectorizer.h (vect_scalar_ops_slice): New struct.
	(vect_scalar_ops_slice_hash): Likewise.
	(vect_scalar_ops_slice::op): New function.
	* tree-vect-slp.cc (vect_scalar_ops_slice::all_same_p): New function.
	(vect_scalar_ops_slice_hash::hash): Likewise.
	(vect_scalar_ops_slice_hash::equal): Likewise.
	(vect_prologue_cost_for_slp): Check for duplicate vectors.
	* config/aarch64/aarch64.cc
	(aarch64_vector_costs::m_stp_sequence_cost): New member variable.
	(aarch64_aligned_constant_offset_p): New function.
	(aarch64_stp_sequence_cost): Likewise.
	(aarch64_vector_costs::add_stmt_cost): Handle new STP heuristic.
	(aarch64_vector_costs::finish_cost): Likewise.

gcc/testsuite/
	* gcc.target/aarch64/ldp_stp_5.c: Require -mstrict-align.
	* gcc.target/aarch64/ldp_stp_14.h,
	* gcc.target/aarch64/ldp_stp_14.c: New test.
	* gcc.target/aarch64/ldp_stp_15.c: Likewise.
	* gcc.target/aarch64/ldp_stp_16.c: Likewise.
	* gcc.target/aarch64/ldp_stp_17.c: Likewise.
	* gcc.target/aarch64/ldp_stp_18.c: Likewise.
	* gcc.target/aarch64/ldp_stp_19.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc                 | 140 ++++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c |  89 +++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h |  50 +++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c | 137 +++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c | 133 +++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c | 120 +++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c | 123 +++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c |   6 +
 gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c  |   2 +-
 gcc/tree-vect-slp.cc                          |  75 ++++++----
 gcc/tree-vectorizer.h                         |  35 +++++
 11 files changed, 884 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
  

Comments

Richard Biener Feb. 14, 2022, 3:43 p.m. UTC | #1
On Mon, 14 Feb 2022, Richard Sandiford wrote:

> ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
> vectorisation was enabled at -O2.  In all three cases SLP is
> generating vector code when scalar code would be better.
> 
> The problem is that the target costs do not model whether STP could
> be used for the scalar or vector code, so the normal latency-based
> costs for store-heavy code can be way off.  It would be good to fix
> that ?properly? at some point, but it isn't easy; see the existing
> discussion in aarch64_sve_adjust_stmt_cost for more details.
> 
> This patch therefore adds an on-the-side check for whether the
> code is doing nothing more than set-up+stores.  It then applies
> STP-based costs to those cases only, in addition to the normal
> latency-based costs.  (That is, the vector code has to win on
> both counts rather than on one count individually.)
> 
> However, at the moment, SLP costs one vector set-up instruction
> for every vector in an SLP node, even if the contents are the
> same as a previous vector in the same node.  Fixing the STP costs
> without fixing that would regress other cases, tested in the patch.
> 
> The patch therefore makes the SLP costing code check for duplicates
> within a node.  Ideally we'd check for duplicates more globally,
> but that would require a more global approach to costs: the cost
> of an initialisation should be amoritised across all trees that
> use the initialisation, rather than fully counted against one
> arbitrarily-chosen subtree.
> 
> Back on aarch64: an earlier version of the patch tried to apply
> the new heuristic to constant stores.  However, that didn't work
> too well in practice; see the comments for details.  The patch
> therefore just tests the status quo for constant cases, leaving out
> a match if the current choice is dubious.
> 
> ldp_stp_5.c was affected by the same thing.  The test would be
> worth vectorising if we generated better vector code, but:
> 
> (1) We do a bad job of moving the { -1, 1 } constant, given that
>     we have { -1, -1 } and { 1, 1 } to hand.
> 
> (2) The vector code has 6 pairable stores to misaligned offsets.
>     We have peephole patterns to handle such misalignment for
>     4 pairable stores, but not 6.
> 
> So the SLP decision isn't wrong as such.  It's just being let
> down by later codegen.
> 
> The patch therefore adds -mstrict-align to preserve the original
> intention of the test while adding ldp_stp_19.c to check for the
> preferred vector code (XFAILed for now).
> 
> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
> OK for the vectoriser bits?

I'll look at the patch tomorrow but it reminded me of an old
patch I'm still sitting on which reworked the SLP discovery
cache to be based on defs rather than stmts which allows us to
cache and re-use SLP nodes for invariants during SLP discovery.

From 8df9c7003611e690bd08fd5cff0b624527c99bf4 Mon Sep 17 00:00:00 2001
From: Richard Biener <rguenther@suse.de>
Date: Fri, 20 Mar 2020 11:42:47 +0100
Subject: [PATCH] rework SLP caching based on ops and CSE constants
To: gcc-patches@gcc.gnu.org

This reworks SLP caching so that it keys on the defs and not
their defining stmts so we can use it to CSE SLP nodes for
constants and invariants.

2020-03-19  Richard Biener  <rguenther@suse.de>

	* tree-vect-slp.c (): ...
---
 gcc/tree-vect-slp.c | 222 +++++++++++++++++++++++++++++---------------
 1 file changed, 149 insertions(+), 73 deletions(-)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1ffbf6f6af9..e545e34e353 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -129,36 +129,38 @@ vect_free_slp_instance (slp_instance instance, bool final_p)
   free (instance);
 }
 
-
-/* Create an SLP node for SCALAR_STMTS.  */
-
-static slp_tree
-vect_create_new_slp_node (vec<stmt_vec_info> scalar_stmts, unsigned nops)
-{
-  slp_tree node = new _slp_tree;
-  SLP_TREE_SCALAR_STMTS (node) = scalar_stmts;
-  SLP_TREE_CHILDREN (node).create (nops);
-  SLP_TREE_DEF_TYPE (node) = vect_internal_def;
-  SLP_TREE_REPRESENTATIVE (node) = scalar_stmts[0];
-  SLP_TREE_LANES (node) = scalar_stmts.length ();
-
-  unsigned i;
-  stmt_vec_info stmt_info;
-  FOR_EACH_VEC_ELT (scalar_stmts, i, stmt_info)
-    STMT_VINFO_NUM_SLP_USES (stmt_info)++;
-
-  return node;
-}
-
 /* Create an SLP node for OPS.  */
 
 static slp_tree
-vect_create_new_slp_node (vec<tree> ops)
+vect_create_new_slp_node (vec_info *vinfo,
+			  vec<tree> ops, unsigned nops = 0,
+			  vect_def_type def_type = vect_external_def)
 {
   slp_tree node = new _slp_tree;
   SLP_TREE_SCALAR_OPS (node) = ops;
-  SLP_TREE_DEF_TYPE (node) = vect_external_def;
   SLP_TREE_LANES (node) = ops.length ();
+  if (nops != 0
+      || (def_type != vect_external_def && def_type != vect_constant_def))
+    {
+      if (nops != 0)
+	SLP_TREE_CHILDREN (node).create (nops);
+      SLP_TREE_DEF_TYPE (node) = vect_internal_def;
+
+      SLP_TREE_SCALAR_STMTS (node).create (ops.length ());
+      unsigned i;
+      tree op;
+      FOR_EACH_VEC_ELT (ops, i, op)
+	{
+	  stmt_vec_info stmt_info = vinfo->lookup_def (op);
+	  STMT_VINFO_NUM_SLP_USES (stmt_info)++;
+	  SLP_TREE_SCALAR_STMTS (node).quick_push (stmt_info);
+	  if (i == 0)
+	    SLP_TREE_REPRESENTATIVE (node) = stmt_info;
+	}
+    }
+  else
+    SLP_TREE_DEF_TYPE (node) = vect_external_def;
+
   return node;
 }
 
@@ -168,8 +170,6 @@ vect_create_new_slp_node (vec<tree> ops)
    node.  */
 typedef struct _slp_oprnd_info
 {
-  /* Def-stmts for the operands.  */
-  vec<stmt_vec_info> def_stmts;
   /* Operands.  */
   vec<tree> ops;
   /* Information about the first statement, its vector def-type, type, the
@@ -194,7 +194,6 @@ vect_create_oprnd_info (int nops, int group_size)
   for (i = 0; i < nops; i++)
     {
       oprnd_info = XNEW (struct _slp_oprnd_info);
-      oprnd_info->def_stmts.create (group_size);
       oprnd_info->ops.create (group_size);
       oprnd_info->first_dt = vect_uninitialized_def;
       oprnd_info->first_op_type = NULL_TREE;
@@ -216,7 +215,6 @@ vect_free_oprnd_info (vec<slp_oprnd_info> &oprnds_info)
 
   FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
     {
-      oprnd_info->def_stmts.release ();
       oprnd_info->ops.release ();
       XDELETE (oprnd_info);
     }
@@ -459,7 +457,10 @@ again:
 	}
 
       if (def_stmt_info && is_pattern_stmt_p (def_stmt_info))
-	oprnd_info->any_pattern = true;
+	{
+	  oprnd_info->any_pattern = true;
+	  oprnd = gimple_get_lhs (def_stmt_info->stmt);
+	}
 
       if (first)
 	{
@@ -541,7 +542,6 @@ again:
 	  oprnd_info->first_dt = vect_external_def;
 	  /* Fallthru.  */
 	case vect_constant_def:
-	  oprnd_info->def_stmts.quick_push (NULL);
 	  oprnd_info->ops.quick_push (oprnd);
 	  break;
 
@@ -559,13 +559,11 @@ again:
 		 us a sane SLP graph (still the stmts are not 100%
 		 correct wrt the initial values).  */
 	      gcc_assert (!first);
-	      oprnd_info->def_stmts.quick_push (oprnd_info->def_stmts[0]);
 	      oprnd_info->ops.quick_push (oprnd_info->ops[0]);
 	      break;
 	    }
 	  /* Fallthru.  */
 	case vect_induction_def:
-	  oprnd_info->def_stmts.quick_push (def_stmt_info);
 	  oprnd_info->ops.quick_push (oprnd);
 	  break;
 
@@ -1096,8 +1094,8 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
    need a special value for deleted that differs from empty.  */
 struct bst_traits
 {
-  typedef vec <stmt_vec_info> value_type;
-  typedef vec <stmt_vec_info> compare_type;
+  typedef vec <tree> value_type;
+  typedef vec <tree> compare_type;
   static inline hashval_t hash (value_type);
   static inline bool equal (value_type existing, value_type candidate);
   static inline bool is_empty (value_type x) { return !x.exists (); }
@@ -1112,7 +1110,10 @@ bst_traits::hash (value_type x)
 {
   inchash::hash h;
   for (unsigned i = 0; i < x.length (); ++i)
-    h.add_int (gimple_uid (x[i]->stmt));
+    /* ???  FP constants are not shared so we can't use simple
+       pointer hashing and equivalence which would work if we'd
+       just care for SSA names here.  */
+    inchash::add_expr (x[i], h, 0);
   return h.end ();
 }
 inline bool
@@ -1121,30 +1122,33 @@ bst_traits::equal (value_type existing, value_type candidate)
   if (existing.length () != candidate.length ())
     return false;
   for (unsigned i = 0; i < existing.length (); ++i)
-    if (existing[i] != candidate[i])
+    if (existing[i] != candidate[i]
+	&& (!types_compatible_p (TREE_TYPE (existing[i]),
+				 TREE_TYPE (candidate[i]))
+	    || !operand_equal_p (existing[i], candidate[i], 0)))
       return false;
   return true;
 }
 
-typedef hash_map <vec <gimple *>, slp_tree,
+typedef hash_map <vec <tree>, slp_tree,
 		  simple_hashmap_traits <bst_traits, slp_tree> >
   scalar_stmts_to_slp_tree_map_t;
 
 static slp_tree
 vect_build_slp_tree_2 (vec_info *vinfo,
-		       vec<stmt_vec_info> stmts, unsigned int group_size,
+		       vec<tree> defs, unsigned int group_size,
 		       poly_uint64 *max_nunits,
 		       bool *matches, unsigned *npermutes, unsigned *tree_size,
 		       scalar_stmts_to_slp_tree_map_t *bst_map);
 
 static slp_tree
 vect_build_slp_tree (vec_info *vinfo,
-		     vec<stmt_vec_info> stmts, unsigned int group_size,
+		     vec<tree> defs, unsigned int group_size,
 		     poly_uint64 *max_nunits,
 		     bool *matches, unsigned *npermutes, unsigned *tree_size,
 		     scalar_stmts_to_slp_tree_map_t *bst_map)
 {
-  if (slp_tree *leader = bst_map->get (stmts))
+  if (slp_tree *leader = bst_map->get (defs))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "re-using %sSLP tree %p\n",
@@ -1157,7 +1161,7 @@ vect_build_slp_tree (vec_info *vinfo,
       return *leader;
     }
   poly_uint64 this_max_nunits = 1;
-  slp_tree res = vect_build_slp_tree_2 (vinfo, stmts, group_size,
+  slp_tree res = vect_build_slp_tree_2 (vinfo, defs, group_size,
 					&this_max_nunits,
 					matches, npermutes, tree_size, bst_map);
   if (res)
@@ -1167,7 +1171,7 @@ vect_build_slp_tree (vec_info *vinfo,
       /* Keep a reference for the bst_map use.  */
       res->refcnt++;
     }
-  bst_map->put (stmts.copy (), res);
+  bst_map->put (defs.copy (), res);
   return res;
 }
 
@@ -1180,7 +1184,7 @@ vect_build_slp_tree (vec_info *vinfo,
 
 static slp_tree
 vect_build_slp_tree_2 (vec_info *vinfo,
-		       vec<stmt_vec_info> stmts, unsigned int group_size,
+		       vec<tree> defs, unsigned int group_size,
 		       poly_uint64 *max_nunits,
 		       bool *matches, unsigned *npermutes, unsigned *tree_size,
 		       scalar_stmts_to_slp_tree_map_t *bst_map)
@@ -1189,8 +1193,54 @@ vect_build_slp_tree_2 (vec_info *vinfo,
   poly_uint64 this_max_nunits = *max_nunits;
   slp_tree node;
 
-  matches[0] = false;
+  auto_vec<stmt_vec_info> stmts;
+  stmts.create (defs.length ());
+  vect_def_type dt;
+  vect_def_type def0_type = vect_constant_def;
+  stmt_vec_info def_info;
+  if (!vect_is_simple_use (defs[0], vinfo, &def0_type, &def_info))
+    return NULL;
+  stmts.quick_push (def_info);
+  /* Fail gracefully to allow eventual splitting.  */
+  matches[0] = true;
+  bool fail = false;
+  for (i = 1; i < defs.length (); ++i)
+    {
+      if (!vect_is_simple_use (defs[i], vinfo, &dt, &def_info))
+	return NULL;
+      stmts.quick_push (def_info);
+      if ((def0_type == vect_constant_def
+	   || def0_type == vect_external_def)
+	  != (dt == vect_constant_def
+	      || dt == vect_external_def))
+	{
+	  matches[i] = false;
+	  fail = true;
+	}
+      else
+	matches[i] = true;
+      if (dt == vect_external_def
+	  && def0_type == vect_constant_def)
+	def0_type = vect_external_def;
+    }
+  /* Deal with mismatches in internal vs. invariant/external defs.  */
+  if (fail)
+    return NULL;
+  if (def0_type == vect_external_def
+      || def0_type == vect_constant_def)
+    {
+      tree scalar_type = TREE_TYPE (defs[0]);
+      tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
+						  group_size);
+      if (!vect_record_max_nunits (vinfo, NULL, group_size, vectype,
+				   max_nunits))
+	return NULL;
+      node = vect_create_new_slp_node (vinfo, defs, 0, def0_type);
+      SLP_TREE_VECTYPE (node) = vectype;
+      return node;
+    }
 
+  matches[0] = false;
   stmt_vec_info stmt_info = stmts[0];
   if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
     nops = gimple_call_num_args (stmt);
@@ -1237,7 +1287,7 @@ vect_build_slp_tree_2 (vec_info *vinfo,
       else
 	return NULL;
       (*tree_size)++;
-      node = vect_create_new_slp_node (stmts, 0);
+      node = vect_create_new_slp_node (vinfo, defs, 0, vect_internal_def);
       SLP_TREE_VECTYPE (node) = vectype;
       return node;
     }
@@ -1325,23 +1375,12 @@ vect_build_slp_tree_2 (vec_info *vinfo,
 	  continue;
 	}
 
-      if (oprnd_info->first_dt != vect_internal_def
-	  && oprnd_info->first_dt != vect_reduction_def
-	  && oprnd_info->first_dt != vect_induction_def)
-	{
-	  slp_tree invnode = vect_create_new_slp_node (oprnd_info->ops);
-	  SLP_TREE_DEF_TYPE (invnode) = oprnd_info->first_dt;
-	  oprnd_info->ops = vNULL;
-	  children.safe_push (invnode);
-	  continue;
-	}
-
-      if ((child = vect_build_slp_tree (vinfo, oprnd_info->def_stmts,
+      if ((child = vect_build_slp_tree (vinfo, oprnd_info->ops,
 					group_size, &this_max_nunits,
 					matches, npermutes,
 					&this_tree_size, bst_map)) != NULL)
 	{
-	  oprnd_info->def_stmts = vNULL;
+	  oprnd_info->ops = vNULL;
 	  children.safe_push (child);
 	  continue;
 	}
@@ -1366,10 +1405,9 @@ vect_build_slp_tree_2 (vec_info *vinfo,
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "Building vector operands from scalars\n");
 	  this_tree_size++;
-	  child = vect_create_new_slp_node (oprnd_info->ops);
+	  child = vect_create_new_slp_node (vinfo, oprnd_info->ops);
 	  children.safe_push (child);
 	  oprnd_info->ops = vNULL;
-	  oprnd_info->def_stmts = vNULL;
 	  continue;
 	}
 
@@ -1424,8 +1462,6 @@ vect_build_slp_tree_2 (vec_info *vinfo,
 	  for (j = 0; j < group_size; ++j)
 	    if (matches[j] == !swap_not_matching)
 	      {
-		std::swap (oprnds_info[0]->def_stmts[j],
-			   oprnds_info[1]->def_stmts[j]);
 		std::swap (oprnds_info[0]->ops[j],
 			   oprnds_info[1]->ops[j]);
 		if (dump_enabled_p ())
@@ -1435,12 +1471,12 @@ vect_build_slp_tree_2 (vec_info *vinfo,
 	    dump_printf (MSG_NOTE, "\n");
 	  /* And try again with scratch 'matches' ... */
 	  bool *tem = XALLOCAVEC (bool, group_size);
-	  if ((child = vect_build_slp_tree (vinfo, oprnd_info->def_stmts,
+	  if ((child = vect_build_slp_tree (vinfo, oprnd_info->ops,
 					    group_size, &this_max_nunits,
 					    tem, npermutes,
 					    &this_tree_size, bst_map)) != NULL)
 	    {
-	      oprnd_info->def_stmts = vNULL;
+	      oprnd_info->ops = vNULL;
 	      children.safe_push (child);
 	      continue;
 	    }
@@ -1513,7 +1549,7 @@ fail:
 
       /* Here we record the original defs since this
 	 node represents the final lane configuration.  */
-      node = vect_create_new_slp_node (stmts, 2);
+      node = vect_create_new_slp_node (vinfo, defs, 2);
       SLP_TREE_VECTYPE (node) = vectype;
       SLP_TREE_CODE (node) = VEC_PERM_EXPR;
       SLP_TREE_CHILDREN (node).quick_push (one);
@@ -1544,7 +1580,7 @@ fail:
       return node;
     }
 
-  node = vect_create_new_slp_node (stmts, nops);
+  node = vect_create_new_slp_node (vinfo, defs, nops);
   SLP_TREE_VECTYPE (node) = vectype;
   SLP_TREE_CHILDREN (node).splice (children);
   return node;
@@ -2070,12 +2106,35 @@ vect_analyze_slp_instance (vec_info *vinfo,
   /* Create a node (a root of the SLP tree) for the packed grouped stores.  */
   scalar_stmts.create (group_size);
   stmt_vec_info next_info = stmt_info;
+  vec<tree> defs;
+  defs.create (group_size);
   if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
       /* Collect the stores and store them in SLP_TREE_SCALAR_STMTS.  */
       while (next_info)
         {
+	  /* Just needed for the root SLP node, otherwise "wrong".  */
 	  scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
+	  /* Defs to seed the SLP tree from (excluding the store itself).  */
+	  tree def
+	    = gimple_assign_rhs1 (vect_stmt_to_vectorize (next_info)->stmt);
+	  if (stmt_vec_info defstmt = vinfo->lookup_def (def))
+	    def = gimple_get_lhs (vect_stmt_to_vectorize (defstmt)->stmt);
+	  defs.safe_push (def);
+	  if (is_a <bb_vec_info> (vinfo))
+	    {
+	      /* For BB vectorization we have to perform late vectype
+		 assignment to stores.  */
+	      tree vectype, nunits_vectype;
+	      if (!vect_get_vector_types_for_stmt (vinfo, next_info, &vectype,
+						   &nunits_vectype, group_size)
+		  || !vect_update_shared_vectype (next_info, vectype))
+		{
+		  defs.release ();
+		  scalar_stmts.release ();
+		  return false;
+		}
+	    }
 	  next_info = DR_GROUP_NEXT_ELEMENT (next_info);
         }
     }
@@ -2085,7 +2144,9 @@ vect_analyze_slp_instance (vec_info *vinfo,
 	 SLP_TREE_SCALAR_STMTS.  */
       while (next_info)
         {
-	  scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
+	  stmt_vec_info def_info = vect_stmt_to_vectorize (next_info);
+	  scalar_stmts.safe_push (def_info);
+	  defs.quick_push (gimple_get_lhs (def_info->stmt));
 	  next_info = REDUC_GROUP_NEXT_ELEMENT (next_info);
         }
       /* Mark the first element of the reduction chain as reduction to properly
@@ -2110,7 +2171,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
 	      if (!def_info)
 		return false;
 	      def_info = vect_stmt_to_vectorize (def_info);
-	      scalar_stmts.safe_push (def_info);
+	      defs.quick_push (gimple_get_lhs (def_info->stmt));
 	    }
 	  else
 	    return false;
@@ -2125,7 +2186,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
       /* Collect reduction statements.  */
       vec<stmt_vec_info> reductions = as_a <loop_vec_info> (vinfo)->reductions;
       for (i = 0; reductions.iterate (i, &next_info); i++)
-	scalar_stmts.safe_push (next_info);
+	defs.quick_push (gimple_get_lhs (next_info->stmt));
     }
 
   /* Build the tree for the SLP instance.  */
@@ -2133,7 +2194,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   unsigned npermutes = 0;
   poly_uint64 max_nunits = nunits;
   unsigned tree_size = 0;
-  node = vect_build_slp_tree (vinfo, scalar_stmts, group_size,
+  node = vect_build_slp_tree (vinfo, defs, group_size,
 			      &max_nunits, matches, &npermutes,
 			      &tree_size, bst_map);
   if (node != NULL)
@@ -2238,11 +2299,11 @@ vect_analyze_slp_instance (vec_info *vinfo,
 	      gcc_assert (r);
 	      next_info = vinfo->lookup_stmt (use_stmt);
 	      next_info = vect_stmt_to_vectorize (next_info);
-	      scalar_stmts = vNULL;
-	      scalar_stmts.create (group_size);
+	      vec<tree> scalar_ops;
+	      scalar_ops.create (group_size);
 	      for (unsigned i = 0; i < group_size; ++i)
-		scalar_stmts.quick_push (next_info);
-	      slp_tree conv = vect_create_new_slp_node (scalar_stmts, 1);
+		scalar_ops.quick_push (gimple_get_lhs (next_info->stmt));
+	      slp_tree conv = vect_create_new_slp_node (vinfo, scalar_ops, 1);
 	      SLP_TREE_VECTYPE (conv) = STMT_VINFO_VECTYPE (next_info);
 	      SLP_TREE_CHILDREN (conv).quick_push (node);
 	      SLP_INSTANCE_TREE (new_instance) = conv;
@@ -2252,6 +2313,21 @@ vect_analyze_slp_instance (vec_info *vinfo,
 	      REDUC_GROUP_FIRST_ELEMENT (next_info) = next_info;
 	      REDUC_GROUP_NEXT_ELEMENT (next_info) = NULL;
 	    }
+	  else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	    {
+	      /* Put the root store group in.  */
+	      slp_tree store = vect_create_new_slp_node (vinfo, vNULL, 1,
+							 vect_internal_def);
+	      SLP_TREE_SCALAR_STMTS (store) = scalar_stmts;
+	      stmt_vec_info stmt;
+	      FOR_EACH_VEC_ELT (scalar_stmts, i, stmt)
+		STMT_VINFO_NUM_SLP_USES (stmt)++;
+	      SLP_TREE_REPRESENTATIVE (store) = scalar_stmts[0];
+	      SLP_TREE_VECTYPE (store) = STMT_VINFO_VECTYPE (scalar_stmts[0]);
+	      SLP_TREE_LANES (store) = scalar_stmts.length ();
+	      SLP_TREE_CHILDREN (store).quick_push (node);
+	      SLP_INSTANCE_TREE (new_instance) = store;
+	    }
 
 	  vinfo->slp_instances.safe_push (new_instance);
  
Richard Sandiford Feb. 14, 2022, 3:50 p.m. UTC | #2
Richard Biener <rguenther@suse.de> writes:
> On Mon, 14 Feb 2022, Richard Sandiford wrote:
>
>> ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
>> vectorisation was enabled at -O2.  In all three cases SLP is
>> generating vector code when scalar code would be better.
>> 
>> The problem is that the target costs do not model whether STP could
>> be used for the scalar or vector code, so the normal latency-based
>> costs for store-heavy code can be way off.  It would be good to fix
>> that ?properly? at some point, but it isn't easy; see the existing
>> discussion in aarch64_sve_adjust_stmt_cost for more details.
>> 
>> This patch therefore adds an on-the-side check for whether the
>> code is doing nothing more than set-up+stores.  It then applies
>> STP-based costs to those cases only, in addition to the normal
>> latency-based costs.  (That is, the vector code has to win on
>> both counts rather than on one count individually.)
>> 
>> However, at the moment, SLP costs one vector set-up instruction
>> for every vector in an SLP node, even if the contents are the
>> same as a previous vector in the same node.  Fixing the STP costs
>> without fixing that would regress other cases, tested in the patch.
>> 
>> The patch therefore makes the SLP costing code check for duplicates
>> within a node.  Ideally we'd check for duplicates more globally,
>> but that would require a more global approach to costs: the cost
>> of an initialisation should be amoritised across all trees that
>> use the initialisation, rather than fully counted against one
>> arbitrarily-chosen subtree.
>> 
>> Back on aarch64: an earlier version of the patch tried to apply
>> the new heuristic to constant stores.  However, that didn't work
>> too well in practice; see the comments for details.  The patch
>> therefore just tests the status quo for constant cases, leaving out
>> a match if the current choice is dubious.
>> 
>> ldp_stp_5.c was affected by the same thing.  The test would be
>> worth vectorising if we generated better vector code, but:
>> 
>> (1) We do a bad job of moving the { -1, 1 } constant, given that
>>     we have { -1, -1 } and { 1, 1 } to hand.
>> 
>> (2) The vector code has 6 pairable stores to misaligned offsets.
>>     We have peephole patterns to handle such misalignment for
>>     4 pairable stores, but not 6.
>> 
>> So the SLP decision isn't wrong as such.  It's just being let
>> down by later codegen.
>> 
>> The patch therefore adds -mstrict-align to preserve the original
>> intention of the test while adding ldp_stp_19.c to check for the
>> preferred vector code (XFAILed for now).
>> 
>> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
>> OK for the vectoriser bits?
>
> I'll look at the patch tomorrow but it reminded me of an old
> patch I'm still sitting on which reworked the SLP discovery
> cache to be based on defs rather than stmts which allows us to
> cache and re-use SLP nodes for invariants during SLP discovery.

Ah, yeah, that should help with the “more global” bit.  I think
in the end we need both though: reduce duplicate nodes, and remove
duplicate vectors (or at least duplicate vector costs) within a node.

Thanks,
Richard

> From 8df9c7003611e690bd08fd5cff0b624527c99bf4 Mon Sep 17 00:00:00 2001
> From: Richard Biener <rguenther@suse.de>
> Date: Fri, 20 Mar 2020 11:42:47 +0100
> Subject: [PATCH] rework SLP caching based on ops and CSE constants
> To: gcc-patches@gcc.gnu.org
>
> This reworks SLP caching so that it keys on the defs and not
> their defining stmts so we can use it to CSE SLP nodes for
> constants and invariants.
>
> 2020-03-19  Richard Biener  <rguenther@suse.de>
>
> 	* tree-vect-slp.c (): ...
> ---
>  gcc/tree-vect-slp.c | 222 +++++++++++++++++++++++++++++---------------
>  1 file changed, 149 insertions(+), 73 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 1ffbf6f6af9..e545e34e353 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -129,36 +129,38 @@ vect_free_slp_instance (slp_instance instance, bool final_p)
>    free (instance);
>  }
>  
> -
> -/* Create an SLP node for SCALAR_STMTS.  */
> -
> -static slp_tree
> -vect_create_new_slp_node (vec<stmt_vec_info> scalar_stmts, unsigned nops)
> -{
> -  slp_tree node = new _slp_tree;
> -  SLP_TREE_SCALAR_STMTS (node) = scalar_stmts;
> -  SLP_TREE_CHILDREN (node).create (nops);
> -  SLP_TREE_DEF_TYPE (node) = vect_internal_def;
> -  SLP_TREE_REPRESENTATIVE (node) = scalar_stmts[0];
> -  SLP_TREE_LANES (node) = scalar_stmts.length ();
> -
> -  unsigned i;
> -  stmt_vec_info stmt_info;
> -  FOR_EACH_VEC_ELT (scalar_stmts, i, stmt_info)
> -    STMT_VINFO_NUM_SLP_USES (stmt_info)++;
> -
> -  return node;
> -}
> -
>  /* Create an SLP node for OPS.  */
>  
>  static slp_tree
> -vect_create_new_slp_node (vec<tree> ops)
> +vect_create_new_slp_node (vec_info *vinfo,
> +			  vec<tree> ops, unsigned nops = 0,
> +			  vect_def_type def_type = vect_external_def)
>  {
>    slp_tree node = new _slp_tree;
>    SLP_TREE_SCALAR_OPS (node) = ops;
> -  SLP_TREE_DEF_TYPE (node) = vect_external_def;
>    SLP_TREE_LANES (node) = ops.length ();
> +  if (nops != 0
> +      || (def_type != vect_external_def && def_type != vect_constant_def))
> +    {
> +      if (nops != 0)
> +	SLP_TREE_CHILDREN (node).create (nops);
> +      SLP_TREE_DEF_TYPE (node) = vect_internal_def;
> +
> +      SLP_TREE_SCALAR_STMTS (node).create (ops.length ());
> +      unsigned i;
> +      tree op;
> +      FOR_EACH_VEC_ELT (ops, i, op)
> +	{
> +	  stmt_vec_info stmt_info = vinfo->lookup_def (op);
> +	  STMT_VINFO_NUM_SLP_USES (stmt_info)++;
> +	  SLP_TREE_SCALAR_STMTS (node).quick_push (stmt_info);
> +	  if (i == 0)
> +	    SLP_TREE_REPRESENTATIVE (node) = stmt_info;
> +	}
> +    }
> +  else
> +    SLP_TREE_DEF_TYPE (node) = vect_external_def;
> +
>    return node;
>  }
>  
> @@ -168,8 +170,6 @@ vect_create_new_slp_node (vec<tree> ops)
>     node.  */
>  typedef struct _slp_oprnd_info
>  {
> -  /* Def-stmts for the operands.  */
> -  vec<stmt_vec_info> def_stmts;
>    /* Operands.  */
>    vec<tree> ops;
>    /* Information about the first statement, its vector def-type, type, the
> @@ -194,7 +194,6 @@ vect_create_oprnd_info (int nops, int group_size)
>    for (i = 0; i < nops; i++)
>      {
>        oprnd_info = XNEW (struct _slp_oprnd_info);
> -      oprnd_info->def_stmts.create (group_size);
>        oprnd_info->ops.create (group_size);
>        oprnd_info->first_dt = vect_uninitialized_def;
>        oprnd_info->first_op_type = NULL_TREE;
> @@ -216,7 +215,6 @@ vect_free_oprnd_info (vec<slp_oprnd_info> &oprnds_info)
>  
>    FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
>      {
> -      oprnd_info->def_stmts.release ();
>        oprnd_info->ops.release ();
>        XDELETE (oprnd_info);
>      }
> @@ -459,7 +457,10 @@ again:
>  	}
>  
>        if (def_stmt_info && is_pattern_stmt_p (def_stmt_info))
> -	oprnd_info->any_pattern = true;
> +	{
> +	  oprnd_info->any_pattern = true;
> +	  oprnd = gimple_get_lhs (def_stmt_info->stmt);
> +	}
>  
>        if (first)
>  	{
> @@ -541,7 +542,6 @@ again:
>  	  oprnd_info->first_dt = vect_external_def;
>  	  /* Fallthru.  */
>  	case vect_constant_def:
> -	  oprnd_info->def_stmts.quick_push (NULL);
>  	  oprnd_info->ops.quick_push (oprnd);
>  	  break;
>  
> @@ -559,13 +559,11 @@ again:
>  		 us a sane SLP graph (still the stmts are not 100%
>  		 correct wrt the initial values).  */
>  	      gcc_assert (!first);
> -	      oprnd_info->def_stmts.quick_push (oprnd_info->def_stmts[0]);
>  	      oprnd_info->ops.quick_push (oprnd_info->ops[0]);
>  	      break;
>  	    }
>  	  /* Fallthru.  */
>  	case vect_induction_def:
> -	  oprnd_info->def_stmts.quick_push (def_stmt_info);
>  	  oprnd_info->ops.quick_push (oprnd);
>  	  break;
>  
> @@ -1096,8 +1094,8 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
>     need a special value for deleted that differs from empty.  */
>  struct bst_traits
>  {
> -  typedef vec <stmt_vec_info> value_type;
> -  typedef vec <stmt_vec_info> compare_type;
> +  typedef vec <tree> value_type;
> +  typedef vec <tree> compare_type;
>    static inline hashval_t hash (value_type);
>    static inline bool equal (value_type existing, value_type candidate);
>    static inline bool is_empty (value_type x) { return !x.exists (); }
> @@ -1112,7 +1110,10 @@ bst_traits::hash (value_type x)
>  {
>    inchash::hash h;
>    for (unsigned i = 0; i < x.length (); ++i)
> -    h.add_int (gimple_uid (x[i]->stmt));
> +    /* ???  FP constants are not shared so we can't use simple
> +       pointer hashing and equivalence which would work if we'd
> +       just care for SSA names here.  */
> +    inchash::add_expr (x[i], h, 0);
>    return h.end ();
>  }
>  inline bool
> @@ -1121,30 +1122,33 @@ bst_traits::equal (value_type existing, value_type candidate)
>    if (existing.length () != candidate.length ())
>      return false;
>    for (unsigned i = 0; i < existing.length (); ++i)
> -    if (existing[i] != candidate[i])
> +    if (existing[i] != candidate[i]
> +	&& (!types_compatible_p (TREE_TYPE (existing[i]),
> +				 TREE_TYPE (candidate[i]))
> +	    || !operand_equal_p (existing[i], candidate[i], 0)))
>        return false;
>    return true;
>  }
>  
> -typedef hash_map <vec <gimple *>, slp_tree,
> +typedef hash_map <vec <tree>, slp_tree,
>  		  simple_hashmap_traits <bst_traits, slp_tree> >
>    scalar_stmts_to_slp_tree_map_t;
>  
>  static slp_tree
>  vect_build_slp_tree_2 (vec_info *vinfo,
> -		       vec<stmt_vec_info> stmts, unsigned int group_size,
> +		       vec<tree> defs, unsigned int group_size,
>  		       poly_uint64 *max_nunits,
>  		       bool *matches, unsigned *npermutes, unsigned *tree_size,
>  		       scalar_stmts_to_slp_tree_map_t *bst_map);
>  
>  static slp_tree
>  vect_build_slp_tree (vec_info *vinfo,
> -		     vec<stmt_vec_info> stmts, unsigned int group_size,
> +		     vec<tree> defs, unsigned int group_size,
>  		     poly_uint64 *max_nunits,
>  		     bool *matches, unsigned *npermutes, unsigned *tree_size,
>  		     scalar_stmts_to_slp_tree_map_t *bst_map)
>  {
> -  if (slp_tree *leader = bst_map->get (stmts))
> +  if (slp_tree *leader = bst_map->get (defs))
>      {
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location, "re-using %sSLP tree %p\n",
> @@ -1157,7 +1161,7 @@ vect_build_slp_tree (vec_info *vinfo,
>        return *leader;
>      }
>    poly_uint64 this_max_nunits = 1;
> -  slp_tree res = vect_build_slp_tree_2 (vinfo, stmts, group_size,
> +  slp_tree res = vect_build_slp_tree_2 (vinfo, defs, group_size,
>  					&this_max_nunits,
>  					matches, npermutes, tree_size, bst_map);
>    if (res)
> @@ -1167,7 +1171,7 @@ vect_build_slp_tree (vec_info *vinfo,
>        /* Keep a reference for the bst_map use.  */
>        res->refcnt++;
>      }
> -  bst_map->put (stmts.copy (), res);
> +  bst_map->put (defs.copy (), res);
>    return res;
>  }
>  
> @@ -1180,7 +1184,7 @@ vect_build_slp_tree (vec_info *vinfo,
>  
>  static slp_tree
>  vect_build_slp_tree_2 (vec_info *vinfo,
> -		       vec<stmt_vec_info> stmts, unsigned int group_size,
> +		       vec<tree> defs, unsigned int group_size,
>  		       poly_uint64 *max_nunits,
>  		       bool *matches, unsigned *npermutes, unsigned *tree_size,
>  		       scalar_stmts_to_slp_tree_map_t *bst_map)
> @@ -1189,8 +1193,54 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>    poly_uint64 this_max_nunits = *max_nunits;
>    slp_tree node;
>  
> -  matches[0] = false;
> +  auto_vec<stmt_vec_info> stmts;
> +  stmts.create (defs.length ());
> +  vect_def_type dt;
> +  vect_def_type def0_type = vect_constant_def;
> +  stmt_vec_info def_info;
> +  if (!vect_is_simple_use (defs[0], vinfo, &def0_type, &def_info))
> +    return NULL;
> +  stmts.quick_push (def_info);
> +  /* Fail gracefully to allow eventual splitting.  */
> +  matches[0] = true;
> +  bool fail = false;
> +  for (i = 1; i < defs.length (); ++i)
> +    {
> +      if (!vect_is_simple_use (defs[i], vinfo, &dt, &def_info))
> +	return NULL;
> +      stmts.quick_push (def_info);
> +      if ((def0_type == vect_constant_def
> +	   || def0_type == vect_external_def)
> +	  != (dt == vect_constant_def
> +	      || dt == vect_external_def))
> +	{
> +	  matches[i] = false;
> +	  fail = true;
> +	}
> +      else
> +	matches[i] = true;
> +      if (dt == vect_external_def
> +	  && def0_type == vect_constant_def)
> +	def0_type = vect_external_def;
> +    }
> +  /* Deal with mismatches in internal vs. invariant/external defs.  */
> +  if (fail)
> +    return NULL;
> +  if (def0_type == vect_external_def
> +      || def0_type == vect_constant_def)
> +    {
> +      tree scalar_type = TREE_TYPE (defs[0]);
> +      tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
> +						  group_size);
> +      if (!vect_record_max_nunits (vinfo, NULL, group_size, vectype,
> +				   max_nunits))
> +	return NULL;
> +      node = vect_create_new_slp_node (vinfo, defs, 0, def0_type);
> +      SLP_TREE_VECTYPE (node) = vectype;
> +      return node;
> +    }
>  
> +  matches[0] = false;
>    stmt_vec_info stmt_info = stmts[0];
>    if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
>      nops = gimple_call_num_args (stmt);
> @@ -1237,7 +1287,7 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>        else
>  	return NULL;
>        (*tree_size)++;
> -      node = vect_create_new_slp_node (stmts, 0);
> +      node = vect_create_new_slp_node (vinfo, defs, 0, vect_internal_def);
>        SLP_TREE_VECTYPE (node) = vectype;
>        return node;
>      }
> @@ -1325,23 +1375,12 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>  	  continue;
>  	}
>  
> -      if (oprnd_info->first_dt != vect_internal_def
> -	  && oprnd_info->first_dt != vect_reduction_def
> -	  && oprnd_info->first_dt != vect_induction_def)
> -	{
> -	  slp_tree invnode = vect_create_new_slp_node (oprnd_info->ops);
> -	  SLP_TREE_DEF_TYPE (invnode) = oprnd_info->first_dt;
> -	  oprnd_info->ops = vNULL;
> -	  children.safe_push (invnode);
> -	  continue;
> -	}
> -
> -      if ((child = vect_build_slp_tree (vinfo, oprnd_info->def_stmts,
> +      if ((child = vect_build_slp_tree (vinfo, oprnd_info->ops,
>  					group_size, &this_max_nunits,
>  					matches, npermutes,
>  					&this_tree_size, bst_map)) != NULL)
>  	{
> -	  oprnd_info->def_stmts = vNULL;
> +	  oprnd_info->ops = vNULL;
>  	  children.safe_push (child);
>  	  continue;
>  	}
> @@ -1366,10 +1405,9 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>  	    dump_printf_loc (MSG_NOTE, vect_location,
>  			     "Building vector operands from scalars\n");
>  	  this_tree_size++;
> -	  child = vect_create_new_slp_node (oprnd_info->ops);
> +	  child = vect_create_new_slp_node (vinfo, oprnd_info->ops);
>  	  children.safe_push (child);
>  	  oprnd_info->ops = vNULL;
> -	  oprnd_info->def_stmts = vNULL;
>  	  continue;
>  	}
>  
> @@ -1424,8 +1462,6 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>  	  for (j = 0; j < group_size; ++j)
>  	    if (matches[j] == !swap_not_matching)
>  	      {
> -		std::swap (oprnds_info[0]->def_stmts[j],
> -			   oprnds_info[1]->def_stmts[j]);
>  		std::swap (oprnds_info[0]->ops[j],
>  			   oprnds_info[1]->ops[j]);
>  		if (dump_enabled_p ())
> @@ -1435,12 +1471,12 @@ vect_build_slp_tree_2 (vec_info *vinfo,
>  	    dump_printf (MSG_NOTE, "\n");
>  	  /* And try again with scratch 'matches' ... */
>  	  bool *tem = XALLOCAVEC (bool, group_size);
> -	  if ((child = vect_build_slp_tree (vinfo, oprnd_info->def_stmts,
> +	  if ((child = vect_build_slp_tree (vinfo, oprnd_info->ops,
>  					    group_size, &this_max_nunits,
>  					    tem, npermutes,
>  					    &this_tree_size, bst_map)) != NULL)
>  	    {
> -	      oprnd_info->def_stmts = vNULL;
> +	      oprnd_info->ops = vNULL;
>  	      children.safe_push (child);
>  	      continue;
>  	    }
> @@ -1513,7 +1549,7 @@ fail:
>  
>        /* Here we record the original defs since this
>  	 node represents the final lane configuration.  */
> -      node = vect_create_new_slp_node (stmts, 2);
> +      node = vect_create_new_slp_node (vinfo, defs, 2);
>        SLP_TREE_VECTYPE (node) = vectype;
>        SLP_TREE_CODE (node) = VEC_PERM_EXPR;
>        SLP_TREE_CHILDREN (node).quick_push (one);
> @@ -1544,7 +1580,7 @@ fail:
>        return node;
>      }
>  
> -  node = vect_create_new_slp_node (stmts, nops);
> +  node = vect_create_new_slp_node (vinfo, defs, nops);
>    SLP_TREE_VECTYPE (node) = vectype;
>    SLP_TREE_CHILDREN (node).splice (children);
>    return node;
> @@ -2070,12 +2106,35 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    /* Create a node (a root of the SLP tree) for the packed grouped stores.  */
>    scalar_stmts.create (group_size);
>    stmt_vec_info next_info = stmt_info;
> +  vec<tree> defs;
> +  defs.create (group_size);
>    if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>      {
>        /* Collect the stores and store them in SLP_TREE_SCALAR_STMTS.  */
>        while (next_info)
>          {
> +	  /* Just needed for the root SLP node, otherwise "wrong".  */
>  	  scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
> +	  /* Defs to seed the SLP tree from (excluding the store itself).  */
> +	  tree def
> +	    = gimple_assign_rhs1 (vect_stmt_to_vectorize (next_info)->stmt);
> +	  if (stmt_vec_info defstmt = vinfo->lookup_def (def))
> +	    def = gimple_get_lhs (vect_stmt_to_vectorize (defstmt)->stmt);
> +	  defs.safe_push (def);
> +	  if (is_a <bb_vec_info> (vinfo))
> +	    {
> +	      /* For BB vectorization we have to perform late vectype
> +		 assignment to stores.  */
> +	      tree vectype, nunits_vectype;
> +	      if (!vect_get_vector_types_for_stmt (vinfo, next_info, &vectype,
> +						   &nunits_vectype, group_size)
> +		  || !vect_update_shared_vectype (next_info, vectype))
> +		{
> +		  defs.release ();
> +		  scalar_stmts.release ();
> +		  return false;
> +		}
> +	    }
>  	  next_info = DR_GROUP_NEXT_ELEMENT (next_info);
>          }
>      }
> @@ -2085,7 +2144,9 @@ vect_analyze_slp_instance (vec_info *vinfo,
>  	 SLP_TREE_SCALAR_STMTS.  */
>        while (next_info)
>          {
> -	  scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
> +	  stmt_vec_info def_info = vect_stmt_to_vectorize (next_info);
> +	  scalar_stmts.safe_push (def_info);
> +	  defs.quick_push (gimple_get_lhs (def_info->stmt));
>  	  next_info = REDUC_GROUP_NEXT_ELEMENT (next_info);
>          }
>        /* Mark the first element of the reduction chain as reduction to properly
> @@ -2110,7 +2171,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>  	      if (!def_info)
>  		return false;
>  	      def_info = vect_stmt_to_vectorize (def_info);
> -	      scalar_stmts.safe_push (def_info);
> +	      defs.quick_push (gimple_get_lhs (def_info->stmt));
>  	    }
>  	  else
>  	    return false;
> @@ -2125,7 +2186,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>        /* Collect reduction statements.  */
>        vec<stmt_vec_info> reductions = as_a <loop_vec_info> (vinfo)->reductions;
>        for (i = 0; reductions.iterate (i, &next_info); i++)
> -	scalar_stmts.safe_push (next_info);
> +	defs.quick_push (gimple_get_lhs (next_info->stmt));
>      }
>  
>    /* Build the tree for the SLP instance.  */
> @@ -2133,7 +2194,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    unsigned npermutes = 0;
>    poly_uint64 max_nunits = nunits;
>    unsigned tree_size = 0;
> -  node = vect_build_slp_tree (vinfo, scalar_stmts, group_size,
> +  node = vect_build_slp_tree (vinfo, defs, group_size,
>  			      &max_nunits, matches, &npermutes,
>  			      &tree_size, bst_map);
>    if (node != NULL)
> @@ -2238,11 +2299,11 @@ vect_analyze_slp_instance (vec_info *vinfo,
>  	      gcc_assert (r);
>  	      next_info = vinfo->lookup_stmt (use_stmt);
>  	      next_info = vect_stmt_to_vectorize (next_info);
> -	      scalar_stmts = vNULL;
> -	      scalar_stmts.create (group_size);
> +	      vec<tree> scalar_ops;
> +	      scalar_ops.create (group_size);
>  	      for (unsigned i = 0; i < group_size; ++i)
> -		scalar_stmts.quick_push (next_info);
> -	      slp_tree conv = vect_create_new_slp_node (scalar_stmts, 1);
> +		scalar_ops.quick_push (gimple_get_lhs (next_info->stmt));
> +	      slp_tree conv = vect_create_new_slp_node (vinfo, scalar_ops, 1);
>  	      SLP_TREE_VECTYPE (conv) = STMT_VINFO_VECTYPE (next_info);
>  	      SLP_TREE_CHILDREN (conv).quick_push (node);
>  	      SLP_INSTANCE_TREE (new_instance) = conv;
> @@ -2252,6 +2313,21 @@ vect_analyze_slp_instance (vec_info *vinfo,
>  	      REDUC_GROUP_FIRST_ELEMENT (next_info) = next_info;
>  	      REDUC_GROUP_NEXT_ELEMENT (next_info) = NULL;
>  	    }
> +	  else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
> +	    {
> +	      /* Put the root store group in.  */
> +	      slp_tree store = vect_create_new_slp_node (vinfo, vNULL, 1,
> +							 vect_internal_def);
> +	      SLP_TREE_SCALAR_STMTS (store) = scalar_stmts;
> +	      stmt_vec_info stmt;
> +	      FOR_EACH_VEC_ELT (scalar_stmts, i, stmt)
> +		STMT_VINFO_NUM_SLP_USES (stmt)++;
> +	      SLP_TREE_REPRESENTATIVE (store) = scalar_stmts[0];
> +	      SLP_TREE_VECTYPE (store) = STMT_VINFO_VECTYPE (scalar_stmts[0]);
> +	      SLP_TREE_LANES (store) = scalar_stmts.length ();
> +	      SLP_TREE_CHILDREN (store).quick_push (node);
> +	      SLP_INSTANCE_TREE (new_instance) = store;
> +	    }
>  
>  	  vinfo->slp_instances.safe_push (new_instance);
  
Richard Biener Feb. 15, 2022, 9:23 a.m. UTC | #3
On Mon, 14 Feb 2022, Richard Sandiford wrote:

> ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
> vectorisation was enabled at -O2.  In all three cases SLP is
> generating vector code when scalar code would be better.
> 
> The problem is that the target costs do not model whether STP could
> be used for the scalar or vector code, so the normal latency-based
> costs for store-heavy code can be way off.  It would be good to fix
> that “properly” at some point, but it isn't easy; see the existing
> discussion in aarch64_sve_adjust_stmt_cost for more details.
> 
> This patch therefore adds an on-the-side check for whether the
> code is doing nothing more than set-up+stores.  It then applies
> STP-based costs to those cases only, in addition to the normal
> latency-based costs.  (That is, the vector code has to win on
> both counts rather than on one count individually.)
> 
> However, at the moment, SLP costs one vector set-up instruction
> for every vector in an SLP node, even if the contents are the
> same as a previous vector in the same node.  Fixing the STP costs
> without fixing that would regress other cases, tested in the patch.
> 
> The patch therefore makes the SLP costing code check for duplicates
> within a node.  Ideally we'd check for duplicates more globally,
> but that would require a more global approach to costs: the cost
> of an initialisation should be amoritised across all trees that
> use the initialisation, rather than fully counted against one
> arbitrarily-chosen subtree.
> 
> Back on aarch64: an earlier version of the patch tried to apply
> the new heuristic to constant stores.  However, that didn't work
> too well in practice; see the comments for details.  The patch
> therefore just tests the status quo for constant cases, leaving out
> a match if the current choice is dubious.
> 
> ldp_stp_5.c was affected by the same thing.  The test would be
> worth vectorising if we generated better vector code, but:
> 
> (1) We do a bad job of moving the { -1, 1 } constant, given that
>     we have { -1, -1 } and { 1, 1 } to hand.
> 
> (2) The vector code has 6 pairable stores to misaligned offsets.
>     We have peephole patterns to handle such misalignment for
>     4 pairable stores, but not 6.
> 
> So the SLP decision isn't wrong as such.  It's just being let
> down by later codegen.
> 
> The patch therefore adds -mstrict-align to preserve the original
> intention of the test while adding ldp_stp_19.c to check for the
> preferred vector code (XFAILed for now).
> 
> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
> OK for the vectoriser bits?

OK.

Thanks,
Richard.

> Thanks,
> Richard
> 
> 
> gcc/
> 	* tree-vectorizer.h (vect_scalar_ops_slice): New struct.
> 	(vect_scalar_ops_slice_hash): Likewise.
> 	(vect_scalar_ops_slice::op): New function.
> 	* tree-vect-slp.cc (vect_scalar_ops_slice::all_same_p): New function.
> 	(vect_scalar_ops_slice_hash::hash): Likewise.
> 	(vect_scalar_ops_slice_hash::equal): Likewise.
> 	(vect_prologue_cost_for_slp): Check for duplicate vectors.
> 	* config/aarch64/aarch64.cc
> 	(aarch64_vector_costs::m_stp_sequence_cost): New member variable.
> 	(aarch64_aligned_constant_offset_p): New function.
> 	(aarch64_stp_sequence_cost): Likewise.
> 	(aarch64_vector_costs::add_stmt_cost): Handle new STP heuristic.
> 	(aarch64_vector_costs::finish_cost): Likewise.
> 
> gcc/testsuite/
> 	* gcc.target/aarch64/ldp_stp_5.c: Require -mstrict-align.
> 	* gcc.target/aarch64/ldp_stp_14.h,
> 	* gcc.target/aarch64/ldp_stp_14.c: New test.
> 	* gcc.target/aarch64/ldp_stp_15.c: Likewise.
> 	* gcc.target/aarch64/ldp_stp_16.c: Likewise.
> 	* gcc.target/aarch64/ldp_stp_17.c: Likewise.
> 	* gcc.target/aarch64/ldp_stp_18.c: Likewise.
> 	* gcc.target/aarch64/ldp_stp_19.c: Likewise.
> ---
>  gcc/config/aarch64/aarch64.cc                 | 140 ++++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c |  89 +++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h |  50 +++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c | 137 +++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c | 133 +++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c | 120 +++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c | 123 +++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c |   6 +
>  gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c  |   2 +-
>  gcc/tree-vect-slp.cc                          |  75 ++++++----
>  gcc/tree-vectorizer.h                         |  35 +++++
>  11 files changed, 884 insertions(+), 26 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
> 
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index ec479d3055d..ddd0637185c 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -113,6 +113,41 @@ typedef hash_map<tree_operand_hash,
>  		 std::pair<stmt_vec_info, innermost_loop_behavior *> >
>  	  vec_base_alignments;
>  
> +/* Represents elements [START, START + LENGTH) of cyclical array OPS*
> +   (i.e. OPS repeated to give at least START + LENGTH elements)  */
> +struct vect_scalar_ops_slice
> +{
> +  tree op (unsigned int i) const;
> +  bool all_same_p () const;
> +
> +  vec<tree> *ops;
> +  unsigned int start;
> +  unsigned int length;
> +};
> +
> +/* Return element I of the slice.  */
> +inline tree
> +vect_scalar_ops_slice::op (unsigned int i) const
> +{
> +  return (*ops)[(i + start) % ops->length ()];
> +}
> +
> +/* Hash traits for vect_scalar_ops_slice.  */
> +struct vect_scalar_ops_slice_hash : typed_noop_remove<vect_scalar_ops_slice>
> +{
> +  typedef vect_scalar_ops_slice value_type;
> +  typedef vect_scalar_ops_slice compare_type;
> +
> +  static const bool empty_zero_p = true;
> +
> +  static void mark_deleted (value_type &s) { s.length = ~0U; }
> +  static void mark_empty (value_type &s) { s.length = 0; }
> +  static bool is_deleted (const value_type &s) { return s.length == ~0U; }
> +  static bool is_empty (const value_type &s) { return s.length == 0; }
> +  static hashval_t hash (const value_type &);
> +  static bool equal (const value_type &, const compare_type &);
> +};
> +
>  /************************************************************************
>    SLP
>   ************************************************************************/
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 273543d37ea..c6b5a0696a2 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -4533,6 +4533,37 @@ vect_slp_convert_to_external (vec_info *vinfo, slp_tree node,
>    return true;
>  }
>  
> +/* Return true if all elements of the slice are the same.  */
> +bool
> +vect_scalar_ops_slice::all_same_p () const
> +{
> +  for (unsigned int i = 1; i < length; ++i)
> +    if (!operand_equal_p (op (0), op (i)))
> +      return false;
> +  return true;
> +}
> +
> +hashval_t
> +vect_scalar_ops_slice_hash::hash (const value_type &s)
> +{
> +  hashval_t hash = 0;
> +  for (unsigned i = 0; i < s.length; ++i)
> +    hash = iterative_hash_expr (s.op (i), hash);
> +  return hash;
> +}
> +
> +bool
> +vect_scalar_ops_slice_hash::equal (const value_type &s1,
> +				   const compare_type &s2)
> +{
> +  if (s1.length != s2.length)
> +    return false;
> +  for (unsigned i = 0; i < s1.length; ++i)
> +    if (!operand_equal_p (s1.op (i), s2.op (i)))
> +      return false;
> +  return true;
> +}
> +
>  /* Compute the prologue cost for invariant or constant operands represented
>     by NODE.  */
>  
> @@ -4549,45 +4580,39 @@ vect_prologue_cost_for_slp (slp_tree node,
>       When all elements are the same we can use a splat.  */
>    tree vectype = SLP_TREE_VECTYPE (node);
>    unsigned group_size = SLP_TREE_SCALAR_OPS (node).length ();
> -  unsigned num_vects_to_check;
>    unsigned HOST_WIDE_INT const_nunits;
>    unsigned nelt_limit;
> +  auto ops = &SLP_TREE_SCALAR_OPS (node);
> +  auto_vec<unsigned int> starts (SLP_TREE_NUMBER_OF_VEC_STMTS (node));
>    if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits)
>        && ! multiple_p (const_nunits, group_size))
>      {
> -      num_vects_to_check = SLP_TREE_NUMBER_OF_VEC_STMTS (node);
>        nelt_limit = const_nunits;
> +      hash_set<vect_scalar_ops_slice_hash> vector_ops;
> +      for (unsigned int i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
> +	if (!vector_ops.add ({ ops, i * const_nunits, const_nunits }))
> +	  starts.quick_push (i * const_nunits);
>      }
>    else
>      {
>        /* If either the vector has variable length or the vectors
>  	 are composed of repeated whole groups we only need to
>  	 cost construction once.  All vectors will be the same.  */
> -      num_vects_to_check = 1;
>        nelt_limit = group_size;
> +      starts.quick_push (0);
>      }
> -  tree elt = NULL_TREE;
> -  unsigned nelt = 0;
> -  for (unsigned j = 0; j < num_vects_to_check * nelt_limit; ++j)
> -    {
> -      unsigned si = j % group_size;
> -      if (nelt == 0)
> -	elt = SLP_TREE_SCALAR_OPS (node)[si];
> -      /* ???  We're just tracking whether all operands of a single
> -	 vector initializer are the same, ideally we'd check if
> -	 we emitted the same one already.  */
> -      else if (elt != SLP_TREE_SCALAR_OPS (node)[si])
> -	elt = NULL_TREE;
> -      nelt++;
> -      if (nelt == nelt_limit)
> -	{
> -	  record_stmt_cost (cost_vec, 1,
> -			    SLP_TREE_DEF_TYPE (node) == vect_external_def
> -			    ? (elt ? scalar_to_vec : vec_construct)
> -			    : vector_load,
> -			    NULL, vectype, 0, vect_prologue);
> -	  nelt = 0;
> -	}
> +  /* ???  We're just tracking whether vectors in a single node are the same.
> +     Ideally we'd do something more global.  */
> +  for (unsigned int start : starts)
> +    {
> +      vect_cost_for_stmt kind;
> +      if (SLP_TREE_DEF_TYPE (node) == vect_constant_def)
> +	kind = vector_load;
> +      else if (vect_scalar_ops_slice { ops, start, nelt_limit }.all_same_p ())
> +	kind = scalar_to_vec;
> +      else
> +	kind = vec_construct;
> +      record_stmt_cost (cost_vec, 1, kind, NULL, vectype, 0, vect_prologue);
>      }
>  }
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 7bb97bd48e4..4cf17526e14 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -14932,6 +14932,31 @@ private:
>       - If M_VEC_FLAGS & VEC_ANY_SVE is nonzero then we're costing SVE code.  */
>    unsigned int m_vec_flags = 0;
>  
> +  /* At the moment, we do not model LDP and STP in the vector and scalar costs.
> +     This means that code such as:
> +
> +	a[0] = x;
> +	a[1] = x;
> +
> +     will be costed as two scalar instructions and two vector instructions
> +     (a scalar_to_vec and an unaligned_store).  For SLP, the vector form
> +     wins if the costs are equal, because of the fact that the vector costs
> +     include constant initializations whereas the scalar costs don't.
> +     We would therefore tend to vectorize the code above, even though
> +     the scalar version can use a single STP.
> +
> +     We should eventually fix this and model LDP and STP in the main costs;
> +     see the comment in aarch64_sve_adjust_stmt_cost for some of the problems.
> +     Until then, we look specifically for code that does nothing more than
> +     STP-like operations.  We cost them on that basis in addition to the
> +     normal latency-based costs.
> +
> +     If the scalar or vector code could be a sequence of STPs +
> +     initialization, this variable counts the cost of the sequence,
> +     with 2 units per instruction.  The variable is ~0U for other
> +     kinds of code.  */
> +  unsigned int m_stp_sequence_cost = 0;
> +
>    /* On some CPUs, SVE and Advanced SIMD provide the same theoretical vector
>       throughput, such as 4x128 Advanced SIMD vs. 2x256 SVE.  In those
>       situations, we try to predict whether an Advanced SIMD implementation
> @@ -15724,6 +15749,104 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>      }
>  }
>  
> +/* Return true if STMT_INFO contains a memory access and if the constant
> +   component of the memory address is aligned to SIZE bytes.  */
> +static bool
> +aarch64_aligned_constant_offset_p (stmt_vec_info stmt_info,
> +				   poly_uint64 size)
> +{
> +  if (!STMT_VINFO_DATA_REF (stmt_info))
> +    return false;
> +
> +  if (auto first_stmt = DR_GROUP_FIRST_ELEMENT (stmt_info))
> +    stmt_info = first_stmt;
> +  tree constant_offset = DR_INIT (STMT_VINFO_DATA_REF (stmt_info));
> +  /* Needed for gathers & scatters, for example.  */
> +  if (!constant_offset)
> +    return false;
> +
> +  return multiple_p (wi::to_poly_offset (constant_offset), size);
> +}
> +
> +/* Check if a scalar or vector stmt could be part of a region of code
> +   that does nothing more than store values to memory, in the scalar
> +   case using STP.  Return the cost of the stmt if so, counting 2 for
> +   one instruction.  Return ~0U otherwise.
> +
> +   The arguments are a subset of those passed to add_stmt_cost.  */
> +unsigned int
> +aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
> +			   stmt_vec_info stmt_info, tree vectype)
> +{
> +  /* Code that stores vector constants uses a vector_load to create
> +     the constant.  We don't apply the heuristic to that case for two
> +     main reasons:
> +
> +     - At the moment, STPs are only formed via peephole2, and the
> +       constant scalar moves would often come between STRs and so
> +       prevent STP formation.
> +
> +     - The scalar code also has to load the constant somehow, and that
> +       isn't costed.  */
> +  switch (kind)
> +    {
> +    case scalar_to_vec:
> +      /* Count 2 insns for a GPR->SIMD dup and 1 insn for a FPR->SIMD dup.  */
> +      return (FLOAT_TYPE_P (vectype) ? 2 : 4) * count;
> +
> +    case vec_construct:
> +      if (FLOAT_TYPE_P (vectype))
> +	/* Count 1 insn for the maximum number of FP->SIMD INS
> +	   instructions.  */
> +	return (vect_nunits_for_cost (vectype) - 1) * 2 * count;
> +
> +      /* Count 2 insns for a GPR->SIMD move and 2 insns for the
> +	 maximum number of GPR->SIMD INS instructions.  */
> +      return vect_nunits_for_cost (vectype) * 4 * count;
> +
> +    case vector_store:
> +    case unaligned_store:
> +      /* Count 1 insn per vector if we can't form STP Q pairs.  */
> +      if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
> +	return count * 2;
> +      if (aarch64_tune_params.extra_tuning_flags
> +	  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> +	return count * 2;
> +
> +      if (stmt_info)
> +	{
> +	  /* Assume we won't be able to use STP if the constant offset
> +	     component of the address is misaligned.  ??? This could be
> +	     removed if we formed STP pairs earlier, rather than relying
> +	     on peephole2.  */
> +	  auto size = GET_MODE_SIZE (TYPE_MODE (vectype));
> +	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
> +	    return count * 2;
> +	}
> +      return CEIL (count, 2) * 2;
> +
> +    case scalar_store:
> +      if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
> +	{
> +	  /* Check for a mode in which STP pairs can be formed.  */
> +	  auto size = GET_MODE_SIZE (TYPE_MODE (aarch64_dr_type (stmt_info)));
> +	  if (maybe_ne (size, 4) && maybe_ne (size, 8))
> +	    return ~0U;
> +
> +	  /* Assume we won't be able to use STP if the constant offset
> +	     component of the address is misaligned.  ??? This could be
> +	     removed if we formed STP pairs earlier, rather than relying
> +	     on peephole2.  */
> +	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
> +	    return ~0U;
> +	}
> +      return count;
> +
> +    default:
> +      return ~0U;
> +    }
> +}
> +
>  unsigned
>  aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>  				     stmt_vec_info stmt_info, tree vectype,
> @@ -15747,6 +15870,14 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>        m_analyzed_vinfo = true;
>      }
>  
> +  /* Apply the heuristic described above m_stp_sequence_cost.  */
> +  if (m_stp_sequence_cost != ~0U)
> +    {
> +      uint64_t cost = aarch64_stp_sequence_cost (count, kind,
> +						 stmt_info, vectype);
> +      m_stp_sequence_cost = MIN (m_stp_sequence_cost + cost, ~0U);
> +    }
> +
>    /* Try to get a more accurate cost by looking at STMT_INFO instead
>       of just looking at KIND.  */
>    if (stmt_info && aarch64_use_new_vector_costs_p ())
> @@ -16017,6 +16148,15 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
>      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>  					   m_costs[vect_body]);
>  
> +  /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
> +     the scalar code in the event of a tie, since there is more chance
> +     of scalar code being optimized with surrounding operations.  */
> +  if (!loop_vinfo
> +      && scalar_costs
> +      && m_stp_sequence_cost != ~0U
> +      && m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
> +    m_costs[vect_body] = 2 * scalar_costs->total_cost ();
> +
>    vector_costs::finish_cost (scalar_costs);
>  }
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
> new file mode 100644
> index 00000000000..c7b5f7d6b39
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
> @@ -0,0 +1,89 @@
> +/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "ldp_stp_14.h"
> +
> +/*
> +** const_2_int16_t_0:
> +**	str	wzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, int16_t, 0);
> +
> +/*
> +** const_4_int16_t_0:
> +**	str	xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, int16_t, 0);
> +
> +/*
> +** const_8_int16_t_0:
> +**	stp	xzr, xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (8, int16_t, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (16, int16_t, 0);
> +
> +/*
> +** const_32_int16_t_0:
> +**	movi	v([0-9]+)\.4s, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONST_FN (32, int16_t, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (2, int16_t, 1);
> +
> +/*
> +** const_4_int16_t_1:
> +**	movi	v([0-9]+)\.4h, .*
> +**	str	d\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, int16_t, 1);
> +
> +/*
> +** const_8_int16_t_1:
> +**	movi	v([0-9]+)\.8h, .*
> +**	str	q\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (8, int16_t, 1);
> +
> +/* Fuzzy match due to PR104387.  */
> +/*
> +** dup_2_int16_t:
> +**	...
> +**	strh	w1, \[x0, #?2\]
> +**	ret
> +*/
> +DUP_FN (2, int16_t);
> +
> +/*
> +** dup_4_int16_t:
> +**	dup	v([0-9]+)\.4h, w1
> +**	str	d\1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (4, int16_t);
> +
> +/*
> +** dup_8_int16_t:
> +**	dup	v([0-9]+)\.8h, w1
> +**	str	q\1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (8, int16_t);
> +
> +/*
> +** cons2_1_int16_t:
> +**	strh	w1, \[x0\]
> +**	strh	w2, \[x0, #?2\]
> +**	ret
> +*/
> +CONS2_FN (1, int16_t);
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
> new file mode 100644
> index 00000000000..39c463ff240
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
> @@ -0,0 +1,50 @@
> +#include <stdint.h>
> +
> +#define PRAGMA(X) _Pragma (#X)
> +#define UNROLL(COUNT) PRAGMA (GCC unroll (COUNT))
> +
> +#define CONST_FN(COUNT, TYPE, VAL)		\
> +  void						\
> +  const_##COUNT##_##TYPE##_##VAL (TYPE *x)	\
> +  {						\
> +    UNROLL (COUNT)				\
> +    for (int i = 0; i < COUNT; ++i)		\
> +      x[i] = VAL;				\
> +  }
> +
> +#define DUP_FN(COUNT, TYPE)			\
> +  void						\
> +  dup_##COUNT##_##TYPE (TYPE *x, TYPE val)	\
> +  {						\
> +    UNROLL (COUNT)				\
> +    for (int i = 0; i < COUNT; ++i)		\
> +      x[i] = val;				\
> +  }
> +
> +#define CONS2_FN(COUNT, TYPE)					\
> +  void								\
> +  cons2_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1)	\
> +  {								\
> +    UNROLL (COUNT)						\
> +    for (int i = 0; i < COUNT * 2; i += 2)			\
> +      {								\
> +	x[i + 0] = val0;					\
> +	x[i + 1] = val1;					\
> +      }								\
> +  }
> +
> +#define CONS4_FN(COUNT, TYPE)					\
> +  void								\
> +  cons4_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1,	\
> +			  TYPE val2, TYPE val3)			\
> +  {								\
> +    UNROLL (COUNT)						\
> +    for (int i = 0; i < COUNT * 4; i += 4)			\
> +      {								\
> +	x[i + 0] = val0;					\
> +	x[i + 1] = val1;					\
> +	x[i + 2] = val2;					\
> +	x[i + 3] = val3;					\
> +      }								\
> +  }
> +
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
> new file mode 100644
> index 00000000000..131cd0a63c8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
> @@ -0,0 +1,137 @@
> +/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "ldp_stp_14.h"
> +
> +/*
> +** const_2_int32_t_0:
> +**	str	xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, int32_t, 0);
> +
> +/*
> +** const_4_int32_t_0:
> +**	stp	xzr, xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, int32_t, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (8, int32_t, 0);
> +
> +/*
> +** const_16_int32_t_0:
> +**	movi	v([0-9]+)\.4s, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONST_FN (16, int32_t, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (2, int32_t, 1);
> +
> +/*
> +** const_4_int32_t_1:
> +**	movi	v([0-9]+)\.4s, .*
> +**	str	q\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, int32_t, 1);
> +
> +/*
> +** const_8_int32_t_1:
> +**	movi	v([0-9]+)\.4s, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (8, int32_t, 1);
> +
> +/*
> +** dup_2_int32_t:
> +**	stp	w1, w1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (2, int32_t);
> +
> +/*
> +** dup_4_int32_t:
> +**	stp	w1, w1, \[x0\]
> +**	stp	w1, w1, \[x0, #?8\]
> +**	ret
> +*/
> +DUP_FN (4, int32_t);
> +
> +/*
> +** dup_8_int32_t:
> +**	dup	v([0-9]+)\.4s, w1
> +**	stp	q\1, q\1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (8, int32_t);
> +
> +/*
> +** cons2_1_int32_t:
> +**	stp	w1, w2, \[x0\]
> +**	ret
> +*/
> +CONS2_FN (1, int32_t);
> +
> +/*
> +** cons2_2_int32_t:
> +**	stp	w1, w2, \[x0\]
> +**	stp	w1, w2, \[x0, #?8\]
> +**	ret
> +*/
> +CONS2_FN (2, int32_t);
> +
> +/*
> +** cons2_4_int32_t:
> +**	stp	w1, w2, \[x0\]
> +**	stp	w1, w2, \[x0, #?8\]
> +**	stp	w1, w2, \[x0, #?16\]
> +**	stp	w1, w2, \[x0, #?24\]
> +**	ret
> +*/
> +CONS2_FN (4, int32_t);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONS2_FN (8, int32_t);
> +
> +/*
> +** cons2_16_int32_t:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS2_FN (16, int32_t);
> +
> +/*
> +** cons4_1_int32_t:
> +**	stp	w1, w2, \[x0\]
> +**	stp	w3, w4, \[x0, #?8\]
> +**	ret
> +*/
> +CONS4_FN (1, int32_t);
> +
> +/*
> +** cons4_2_int32_t:
> +**	stp	w1, w2, \[x0\]
> +**	stp	w3, w4, \[x0, #?8\]
> +**	stp	w1, w2, \[x0, #?16\]
> +**	stp	w3, w4, \[x0, #?24\]
> +**	ret
> +*/
> +CONS4_FN (2, int32_t);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONS4_FN (4, int32_t);
> +
> +/*
> +** cons4_8_int32_t:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS4_FN (8, int32_t);
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> new file mode 100644
> index 00000000000..8ab117c4dcd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> @@ -0,0 +1,133 @@
> +/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "ldp_stp_14.h"
> +
> +/*
> +** const_2_float_0:
> +**	str	xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, float, 0);
> +
> +/*
> +** const_4_float_0:
> +**	stp	xzr, xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, float, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (8, float, 0);
> +
> +/*
> +** const_16_float_0:
> +**	movi	v([0-9]+)\.4s, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONST_FN (16, float, 0);
> +
> +/*
> +** const_2_float_1:
> +**	fmov	v([0-9]+)\.2s, .*
> +**	str	d\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, float, 1);
> +
> +/*
> +** const_4_float_1:
> +**	fmov	v([0-9]+)\.4s, .*
> +**	str	q\1, \[x0\]
> +**	ret
> +*/
> +CONST_FN (4, float, 1);
> +
> +/*
> +** dup_2_float:
> +**	stp	s0, s0, \[x0\]
> +**	ret
> +*/
> +DUP_FN (2, float);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +DUP_FN (4, float);
> +
> +/*
> +** dup_8_float:
> +**	dup	v([0-9]+)\.4s, v0.s\[0\]
> +**	stp	q\1, q\1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (8, float);
> +
> +/*
> +** cons2_1_float:
> +**	stp	s0, s1, \[x0\]
> +**	ret
> +*/
> +CONS2_FN (1, float);
> +
> +/*
> +** cons2_2_float:
> +**	stp	s0, s1, \[x0\]
> +**	stp	s0, s1, \[x0, #?8\]
> +**	ret
> +*/
> +CONS2_FN (2, float);
> +
> +/*
> +** cons2_4_float:	{ target aarch64_little_endian }
> +**	ins	v0.s\[1\], v1.s\[0\]
> +**	stp	d0, d0, \[x0\]
> +**	stp	d0, d0, \[x0, #?16\]
> +**	ret
> +*/
> +/*
> +** cons2_4_float:	{ target aarch64_big_endian }
> +**	ins	v1.s\[1\], v0.s\[0\]
> +**	stp	d1, d1, \[x0\]
> +**	stp	d1, d1, \[x0, #?16\]
> +**	ret
> +*/
> +CONS2_FN (4, float);
> +
> +/*
> +** cons2_8_float:
> +**	dup	v([0-9]+)\.4s, .*
> +**	...
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONS2_FN (8, float);
> +
> +/*
> +** cons4_1_float:
> +**	stp	s0, s1, \[x0\]
> +**	stp	s2, s3, \[x0, #?8\]
> +**	ret
> +*/
> +CONS4_FN (1, float);
> +
> +/*
> +** cons4_2_float:
> +**	stp	s0, s1, \[x0\]
> +**	stp	s2, s3, \[x0, #?8\]
> +**	stp	s0, s1, \[x0, #?16\]
> +**	stp	s2, s3, \[x0, #?24\]
> +**	ret
> +*/
> +CONS4_FN (2, float);
> +
> +/*
> +** cons4_4_float:
> +**	ins	v([0-9]+)\.s.*
> +**	...
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONS4_FN (4, float);
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
> new file mode 100644
> index 00000000000..c1122fc07d5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
> @@ -0,0 +1,120 @@
> +/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "ldp_stp_14.h"
> +
> +/*
> +** const_2_int64_t_0:
> +**	stp	xzr, xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, int64_t, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (4, int64_t, 0);
> +
> +/*
> +** const_8_int64_t_0:
> +**	movi	v([0-9]+)\.4s, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONST_FN (8, int64_t, 0);
> +
> +/*
> +** dup_2_int64_t:
> +**	stp	x1, x1, \[x0\]
> +**	ret
> +*/
> +DUP_FN (2, int64_t);
> +
> +/*
> +** dup_4_int64_t:
> +**	stp	x1, x1, \[x0\]
> +**	stp	x1, x1, \[x0, #?16\]
> +**	ret
> +*/
> +DUP_FN (4, int64_t);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +DUP_FN (8, int64_t);
> +
> +/*
> +** dup_16_int64_t:
> +**	dup	v([0-9])\.2d, x1
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	stp	q\1, q\1, \[x0, #?64\]
> +**	stp	q\1, q\1, \[x0, #?96\]
> +**	ret
> +*/
> +DUP_FN (16, int64_t);
> +
> +/*
> +** cons2_1_int64_t:
> +**	stp	x1, x2, \[x0\]
> +**	ret
> +*/
> +CONS2_FN (1, int64_t);
> +
> +/*
> +** cons2_2_int64_t:
> +**	stp	x1, x2, \[x0\]
> +**	stp	x1, x2, \[x0, #?16\]
> +**	ret
> +*/
> +CONS2_FN (2, int64_t);
> +
> +/*
> +** cons2_4_int64_t:
> +**	stp	x1, x2, \[x0\]
> +**	stp	x1, x2, \[x0, #?16\]
> +**	stp	x1, x2, \[x0, #?32\]
> +**	stp	x1, x2, \[x0, #?48\]
> +**	ret
> +*/
> +CONS2_FN (4, int64_t);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONS2_FN (8, int64_t);
> +
> +/*
> +** cons2_16_int64_t:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS2_FN (16, int64_t);
> +
> +/*
> +** cons4_1_int64_t:
> +**	stp	x1, x2, \[x0\]
> +**	stp	x3, x4, \[x0, #?16\]
> +**	ret
> +*/
> +CONS4_FN (1, int64_t);
> +
> +/*
> +** cons4_2_int64_t:
> +**	stp	x1, x2, \[x0\]
> +**	stp	x3, x4, \[x0, #?16\]
> +**	stp	x1, x2, \[x0, #?32\]
> +**	stp	x3, x4, \[x0, #?48\]
> +**	ret
> +*/
> +CONS4_FN (2, int64_t);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONS4_FN (4, int64_t);
> +
> +/* We should probably vectorize this, but currently don't.  */
> +CONS4_FN (8, int64_t);
> +
> +/*
> +** cons4_16_int64_t:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS4_FN (16, int64_t);
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
> new file mode 100644
> index 00000000000..eaa855c3859
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
> @@ -0,0 +1,123 @@
> +/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#include "ldp_stp_14.h"
> +
> +/*
> +** const_2_double_0:
> +**	stp	xzr, xzr, \[x0\]
> +**	ret
> +*/
> +CONST_FN (2, double, 0);
> +
> +/* No preference between vectorizing or not vectorizing here.  */
> +CONST_FN (4, double, 0);
> +
> +/*
> +** const_8_double_0:
> +**	movi	v([0-9]+)\.2d, .*
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +CONST_FN (8, double, 0);
> +
> +/*
> +** dup_2_double:
> +**	stp	d0, d0, \[x0\]
> +**	ret
> +*/
> +DUP_FN (2, double);
> +
> +/*
> +** dup_4_double:
> +**	stp	d0, d0, \[x0\]
> +**	stp	d0, d0, \[x0, #?16\]
> +**	ret
> +*/
> +DUP_FN (4, double);
> +
> +/*
> +** dup_8_double:
> +**	dup	v([0-9])\.2d, v0\.d\[0\]
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	ret
> +*/
> +DUP_FN (8, double);
> +
> +/*
> +** dup_16_double:
> +**	dup	v([0-9])\.2d, v0\.d\[0\]
> +**	stp	q\1, q\1, \[x0\]
> +**	stp	q\1, q\1, \[x0, #?32\]
> +**	stp	q\1, q\1, \[x0, #?64\]
> +**	stp	q\1, q\1, \[x0, #?96\]
> +**	ret
> +*/
> +DUP_FN (16, double);
> +
> +/*
> +** cons2_1_double:
> +**	stp	d0, d1, \[x0\]
> +**	ret
> +*/
> +CONS2_FN (1, double);
> +
> +/*
> +** cons2_2_double:
> +**	stp	d0, d1, \[x0\]
> +**	stp	d0, d1, \[x0, #?16\]
> +**	ret
> +*/
> +CONS2_FN (2, double);
> +
> +/*
> +** cons2_4_double:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS2_FN (4, double);
> +
> +/*
> +** cons2_8_double:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS2_FN (8, double);
> +
> +/*
> +** cons4_1_double:
> +**	stp	d0, d1, \[x0\]
> +**	stp	d2, d3, \[x0, #?16\]
> +**	ret
> +*/
> +CONS4_FN (1, double);
> +
> +/*
> +** cons4_2_double:
> +**	stp	d0, d1, \[x0\]
> +**	stp	d2, d3, \[x0, #?16\]
> +**	stp	d0, d1, \[x0, #?32\]
> +**	stp	d2, d3, \[x0, #?48\]
> +**	ret
> +*/
> +CONS4_FN (2, double);
> +
> +/*
> +** cons2_8_double:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS4_FN (4, double);
> +
> +/*
> +** cons2_8_double:
> +**	...
> +**	stp	q[0-9]+, .*
> +**	ret
> +*/
> +CONS4_FN (8, double);
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
> new file mode 100644
> index 00000000000..9eb41636477
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
> @@ -0,0 +1,6 @@
> +/* { dg-options "-O2 -mstrict-align" } */
> +
> +#include "ldp_stp_5.c"
> +
> +/* { dg-final { scan-assembler-times {stp\tq[0-9]+, q[0-9]} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {str\tq[0-9]+} 1 { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
> index 94266181df7..56d1d3cc555 100644
> --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
> @@ -1,4 +1,4 @@
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -mstrict-align" } */
>  
>  double arr[4][4];
>  
>
  

Patch

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index ec479d3055d..ddd0637185c 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -113,6 +113,41 @@  typedef hash_map<tree_operand_hash,
 		 std::pair<stmt_vec_info, innermost_loop_behavior *> >
 	  vec_base_alignments;
 
+/* Represents elements [START, START + LENGTH) of cyclical array OPS*
+   (i.e. OPS repeated to give at least START + LENGTH elements)  */
+struct vect_scalar_ops_slice
+{
+  tree op (unsigned int i) const;
+  bool all_same_p () const;
+
+  vec<tree> *ops;
+  unsigned int start;
+  unsigned int length;
+};
+
+/* Return element I of the slice.  */
+inline tree
+vect_scalar_ops_slice::op (unsigned int i) const
+{
+  return (*ops)[(i + start) % ops->length ()];
+}
+
+/* Hash traits for vect_scalar_ops_slice.  */
+struct vect_scalar_ops_slice_hash : typed_noop_remove<vect_scalar_ops_slice>
+{
+  typedef vect_scalar_ops_slice value_type;
+  typedef vect_scalar_ops_slice compare_type;
+
+  static const bool empty_zero_p = true;
+
+  static void mark_deleted (value_type &s) { s.length = ~0U; }
+  static void mark_empty (value_type &s) { s.length = 0; }
+  static bool is_deleted (const value_type &s) { return s.length == ~0U; }
+  static bool is_empty (const value_type &s) { return s.length == 0; }
+  static hashval_t hash (const value_type &);
+  static bool equal (const value_type &, const compare_type &);
+};
+
 /************************************************************************
   SLP
  ************************************************************************/
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 273543d37ea..c6b5a0696a2 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4533,6 +4533,37 @@  vect_slp_convert_to_external (vec_info *vinfo, slp_tree node,
   return true;
 }
 
+/* Return true if all elements of the slice are the same.  */
+bool
+vect_scalar_ops_slice::all_same_p () const
+{
+  for (unsigned int i = 1; i < length; ++i)
+    if (!operand_equal_p (op (0), op (i)))
+      return false;
+  return true;
+}
+
+hashval_t
+vect_scalar_ops_slice_hash::hash (const value_type &s)
+{
+  hashval_t hash = 0;
+  for (unsigned i = 0; i < s.length; ++i)
+    hash = iterative_hash_expr (s.op (i), hash);
+  return hash;
+}
+
+bool
+vect_scalar_ops_slice_hash::equal (const value_type &s1,
+				   const compare_type &s2)
+{
+  if (s1.length != s2.length)
+    return false;
+  for (unsigned i = 0; i < s1.length; ++i)
+    if (!operand_equal_p (s1.op (i), s2.op (i)))
+      return false;
+  return true;
+}
+
 /* Compute the prologue cost for invariant or constant operands represented
    by NODE.  */
 
@@ -4549,45 +4580,39 @@  vect_prologue_cost_for_slp (slp_tree node,
      When all elements are the same we can use a splat.  */
   tree vectype = SLP_TREE_VECTYPE (node);
   unsigned group_size = SLP_TREE_SCALAR_OPS (node).length ();
-  unsigned num_vects_to_check;
   unsigned HOST_WIDE_INT const_nunits;
   unsigned nelt_limit;
+  auto ops = &SLP_TREE_SCALAR_OPS (node);
+  auto_vec<unsigned int> starts (SLP_TREE_NUMBER_OF_VEC_STMTS (node));
   if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits)
       && ! multiple_p (const_nunits, group_size))
     {
-      num_vects_to_check = SLP_TREE_NUMBER_OF_VEC_STMTS (node);
       nelt_limit = const_nunits;
+      hash_set<vect_scalar_ops_slice_hash> vector_ops;
+      for (unsigned int i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
+	if (!vector_ops.add ({ ops, i * const_nunits, const_nunits }))
+	  starts.quick_push (i * const_nunits);
     }
   else
     {
       /* If either the vector has variable length or the vectors
 	 are composed of repeated whole groups we only need to
 	 cost construction once.  All vectors will be the same.  */
-      num_vects_to_check = 1;
       nelt_limit = group_size;
+      starts.quick_push (0);
     }
-  tree elt = NULL_TREE;
-  unsigned nelt = 0;
-  for (unsigned j = 0; j < num_vects_to_check * nelt_limit; ++j)
-    {
-      unsigned si = j % group_size;
-      if (nelt == 0)
-	elt = SLP_TREE_SCALAR_OPS (node)[si];
-      /* ???  We're just tracking whether all operands of a single
-	 vector initializer are the same, ideally we'd check if
-	 we emitted the same one already.  */
-      else if (elt != SLP_TREE_SCALAR_OPS (node)[si])
-	elt = NULL_TREE;
-      nelt++;
-      if (nelt == nelt_limit)
-	{
-	  record_stmt_cost (cost_vec, 1,
-			    SLP_TREE_DEF_TYPE (node) == vect_external_def
-			    ? (elt ? scalar_to_vec : vec_construct)
-			    : vector_load,
-			    NULL, vectype, 0, vect_prologue);
-	  nelt = 0;
-	}
+  /* ???  We're just tracking whether vectors in a single node are the same.
+     Ideally we'd do something more global.  */
+  for (unsigned int start : starts)
+    {
+      vect_cost_for_stmt kind;
+      if (SLP_TREE_DEF_TYPE (node) == vect_constant_def)
+	kind = vector_load;
+      else if (vect_scalar_ops_slice { ops, start, nelt_limit }.all_same_p ())
+	kind = scalar_to_vec;
+      else
+	kind = vec_construct;
+      record_stmt_cost (cost_vec, 1, kind, NULL, vectype, 0, vect_prologue);
     }
 }
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7bb97bd48e4..4cf17526e14 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14932,6 +14932,31 @@  private:
      - If M_VEC_FLAGS & VEC_ANY_SVE is nonzero then we're costing SVE code.  */
   unsigned int m_vec_flags = 0;
 
+  /* At the moment, we do not model LDP and STP in the vector and scalar costs.
+     This means that code such as:
+
+	a[0] = x;
+	a[1] = x;
+
+     will be costed as two scalar instructions and two vector instructions
+     (a scalar_to_vec and an unaligned_store).  For SLP, the vector form
+     wins if the costs are equal, because of the fact that the vector costs
+     include constant initializations whereas the scalar costs don't.
+     We would therefore tend to vectorize the code above, even though
+     the scalar version can use a single STP.
+
+     We should eventually fix this and model LDP and STP in the main costs;
+     see the comment in aarch64_sve_adjust_stmt_cost for some of the problems.
+     Until then, we look specifically for code that does nothing more than
+     STP-like operations.  We cost them on that basis in addition to the
+     normal latency-based costs.
+
+     If the scalar or vector code could be a sequence of STPs +
+     initialization, this variable counts the cost of the sequence,
+     with 2 units per instruction.  The variable is ~0U for other
+     kinds of code.  */
+  unsigned int m_stp_sequence_cost = 0;
+
   /* On some CPUs, SVE and Advanced SIMD provide the same theoretical vector
      throughput, such as 4x128 Advanced SIMD vs. 2x256 SVE.  In those
      situations, we try to predict whether an Advanced SIMD implementation
@@ -15724,6 +15749,104 @@  aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
     }
 }
 
+/* Return true if STMT_INFO contains a memory access and if the constant
+   component of the memory address is aligned to SIZE bytes.  */
+static bool
+aarch64_aligned_constant_offset_p (stmt_vec_info stmt_info,
+				   poly_uint64 size)
+{
+  if (!STMT_VINFO_DATA_REF (stmt_info))
+    return false;
+
+  if (auto first_stmt = DR_GROUP_FIRST_ELEMENT (stmt_info))
+    stmt_info = first_stmt;
+  tree constant_offset = DR_INIT (STMT_VINFO_DATA_REF (stmt_info));
+  /* Needed for gathers & scatters, for example.  */
+  if (!constant_offset)
+    return false;
+
+  return multiple_p (wi::to_poly_offset (constant_offset), size);
+}
+
+/* Check if a scalar or vector stmt could be part of a region of code
+   that does nothing more than store values to memory, in the scalar
+   case using STP.  Return the cost of the stmt if so, counting 2 for
+   one instruction.  Return ~0U otherwise.
+
+   The arguments are a subset of those passed to add_stmt_cost.  */
+unsigned int
+aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
+			   stmt_vec_info stmt_info, tree vectype)
+{
+  /* Code that stores vector constants uses a vector_load to create
+     the constant.  We don't apply the heuristic to that case for two
+     main reasons:
+
+     - At the moment, STPs are only formed via peephole2, and the
+       constant scalar moves would often come between STRs and so
+       prevent STP formation.
+
+     - The scalar code also has to load the constant somehow, and that
+       isn't costed.  */
+  switch (kind)
+    {
+    case scalar_to_vec:
+      /* Count 2 insns for a GPR->SIMD dup and 1 insn for a FPR->SIMD dup.  */
+      return (FLOAT_TYPE_P (vectype) ? 2 : 4) * count;
+
+    case vec_construct:
+      if (FLOAT_TYPE_P (vectype))
+	/* Count 1 insn for the maximum number of FP->SIMD INS
+	   instructions.  */
+	return (vect_nunits_for_cost (vectype) - 1) * 2 * count;
+
+      /* Count 2 insns for a GPR->SIMD move and 2 insns for the
+	 maximum number of GPR->SIMD INS instructions.  */
+      return vect_nunits_for_cost (vectype) * 4 * count;
+
+    case vector_store:
+    case unaligned_store:
+      /* Count 1 insn per vector if we can't form STP Q pairs.  */
+      if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
+	return count * 2;
+      if (aarch64_tune_params.extra_tuning_flags
+	  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
+	return count * 2;
+
+      if (stmt_info)
+	{
+	  /* Assume we won't be able to use STP if the constant offset
+	     component of the address is misaligned.  ??? This could be
+	     removed if we formed STP pairs earlier, rather than relying
+	     on peephole2.  */
+	  auto size = GET_MODE_SIZE (TYPE_MODE (vectype));
+	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
+	    return count * 2;
+	}
+      return CEIL (count, 2) * 2;
+
+    case scalar_store:
+      if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
+	{
+	  /* Check for a mode in which STP pairs can be formed.  */
+	  auto size = GET_MODE_SIZE (TYPE_MODE (aarch64_dr_type (stmt_info)));
+	  if (maybe_ne (size, 4) && maybe_ne (size, 8))
+	    return ~0U;
+
+	  /* Assume we won't be able to use STP if the constant offset
+	     component of the address is misaligned.  ??? This could be
+	     removed if we formed STP pairs earlier, rather than relying
+	     on peephole2.  */
+	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
+	    return ~0U;
+	}
+      return count;
+
+    default:
+      return ~0U;
+    }
+}
+
 unsigned
 aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 				     stmt_vec_info stmt_info, tree vectype,
@@ -15747,6 +15870,14 @@  aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
       m_analyzed_vinfo = true;
     }
 
+  /* Apply the heuristic described above m_stp_sequence_cost.  */
+  if (m_stp_sequence_cost != ~0U)
+    {
+      uint64_t cost = aarch64_stp_sequence_cost (count, kind,
+						 stmt_info, vectype);
+      m_stp_sequence_cost = MIN (m_stp_sequence_cost + cost, ~0U);
+    }
+
   /* Try to get a more accurate cost by looking at STMT_INFO instead
      of just looking at KIND.  */
   if (stmt_info && aarch64_use_new_vector_costs_p ())
@@ -16017,6 +16148,15 @@  aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
 					   m_costs[vect_body]);
 
+  /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
+     the scalar code in the event of a tie, since there is more chance
+     of scalar code being optimized with surrounding operations.  */
+  if (!loop_vinfo
+      && scalar_costs
+      && m_stp_sequence_cost != ~0U
+      && m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
+    m_costs[vect_body] = 2 * scalar_costs->total_cost ();
+
   vector_costs::finish_cost (scalar_costs);
 }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
new file mode 100644
index 00000000000..c7b5f7d6b39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
@@ -0,0 +1,89 @@ 
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int16_t_0:
+**	str	wzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int16_t, 0);
+
+/*
+** const_4_int16_t_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, int16_t, 0);
+
+/*
+** const_8_int16_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (8, int16_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (16, int16_t, 0);
+
+/*
+** const_32_int16_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (32, int16_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (2, int16_t, 1);
+
+/*
+** const_4_int16_t_1:
+**	movi	v([0-9]+)\.4h, .*
+**	str	d\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, int16_t, 1);
+
+/*
+** const_8_int16_t_1:
+**	movi	v([0-9]+)\.8h, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (8, int16_t, 1);
+
+/* Fuzzy match due to PR104387.  */
+/*
+** dup_2_int16_t:
+**	...
+**	strh	w1, \[x0, #?2\]
+**	ret
+*/
+DUP_FN (2, int16_t);
+
+/*
+** dup_4_int16_t:
+**	dup	v([0-9]+)\.4h, w1
+**	str	d\1, \[x0\]
+**	ret
+*/
+DUP_FN (4, int16_t);
+
+/*
+** dup_8_int16_t:
+**	dup	v([0-9]+)\.8h, w1
+**	str	q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, int16_t);
+
+/*
+** cons2_1_int16_t:
+**	strh	w1, \[x0\]
+**	strh	w2, \[x0, #?2\]
+**	ret
+*/
+CONS2_FN (1, int16_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
new file mode 100644
index 00000000000..39c463ff240
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
@@ -0,0 +1,50 @@ 
+#include <stdint.h>
+
+#define PRAGMA(X) _Pragma (#X)
+#define UNROLL(COUNT) PRAGMA (GCC unroll (COUNT))
+
+#define CONST_FN(COUNT, TYPE, VAL)		\
+  void						\
+  const_##COUNT##_##TYPE##_##VAL (TYPE *x)	\
+  {						\
+    UNROLL (COUNT)				\
+    for (int i = 0; i < COUNT; ++i)		\
+      x[i] = VAL;				\
+  }
+
+#define DUP_FN(COUNT, TYPE)			\
+  void						\
+  dup_##COUNT##_##TYPE (TYPE *x, TYPE val)	\
+  {						\
+    UNROLL (COUNT)				\
+    for (int i = 0; i < COUNT; ++i)		\
+      x[i] = val;				\
+  }
+
+#define CONS2_FN(COUNT, TYPE)					\
+  void								\
+  cons2_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1)	\
+  {								\
+    UNROLL (COUNT)						\
+    for (int i = 0; i < COUNT * 2; i += 2)			\
+      {								\
+	x[i + 0] = val0;					\
+	x[i + 1] = val1;					\
+      }								\
+  }
+
+#define CONS4_FN(COUNT, TYPE)					\
+  void								\
+  cons4_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1,	\
+			  TYPE val2, TYPE val3)			\
+  {								\
+    UNROLL (COUNT)						\
+    for (int i = 0; i < COUNT * 4; i += 4)			\
+      {								\
+	x[i + 0] = val0;					\
+	x[i + 1] = val1;					\
+	x[i + 2] = val2;					\
+	x[i + 3] = val3;					\
+      }								\
+  }
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
new file mode 100644
index 00000000000..131cd0a63c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
@@ -0,0 +1,137 @@ 
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int32_t_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int32_t, 0);
+
+/*
+** const_4_int32_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, int32_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (8, int32_t, 0);
+
+/*
+** const_16_int32_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (16, int32_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (2, int32_t, 1);
+
+/*
+** const_4_int32_t_1:
+**	movi	v([0-9]+)\.4s, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, int32_t, 1);
+
+/*
+** const_8_int32_t_1:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+CONST_FN (8, int32_t, 1);
+
+/*
+** dup_2_int32_t:
+**	stp	w1, w1, \[x0\]
+**	ret
+*/
+DUP_FN (2, int32_t);
+
+/*
+** dup_4_int32_t:
+**	stp	w1, w1, \[x0\]
+**	stp	w1, w1, \[x0, #?8\]
+**	ret
+*/
+DUP_FN (4, int32_t);
+
+/*
+** dup_8_int32_t:
+**	dup	v([0-9]+)\.4s, w1
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, int32_t);
+
+/*
+** cons2_1_int32_t:
+**	stp	w1, w2, \[x0\]
+**	ret
+*/
+CONS2_FN (1, int32_t);
+
+/*
+** cons2_2_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w1, w2, \[x0, #?8\]
+**	ret
+*/
+CONS2_FN (2, int32_t);
+
+/*
+** cons2_4_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w1, w2, \[x0, #?8\]
+**	stp	w1, w2, \[x0, #?16\]
+**	stp	w1, w2, \[x0, #?24\]
+**	ret
+*/
+CONS2_FN (4, int32_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS2_FN (8, int32_t);
+
+/*
+** cons2_16_int32_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (16, int32_t);
+
+/*
+** cons4_1_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w3, w4, \[x0, #?8\]
+**	ret
+*/
+CONS4_FN (1, int32_t);
+
+/*
+** cons4_2_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w3, w4, \[x0, #?8\]
+**	stp	w1, w2, \[x0, #?16\]
+**	stp	w3, w4, \[x0, #?24\]
+**	ret
+*/
+CONS4_FN (2, int32_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS4_FN (4, int32_t);
+
+/*
+** cons4_8_int32_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (8, int32_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
new file mode 100644
index 00000000000..8ab117c4dcd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
@@ -0,0 +1,133 @@ 
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_float_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, float, 0);
+
+/*
+** const_4_float_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, float, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (8, float, 0);
+
+/*
+** const_16_float_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (16, float, 0);
+
+/*
+** const_2_float_1:
+**	fmov	v([0-9]+)\.2s, .*
+**	str	d\1, \[x0\]
+**	ret
+*/
+CONST_FN (2, float, 1);
+
+/*
+** const_4_float_1:
+**	fmov	v([0-9]+)\.4s, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, float, 1);
+
+/*
+** dup_2_float:
+**	stp	s0, s0, \[x0\]
+**	ret
+*/
+DUP_FN (2, float);
+
+/* No preference between vectorizing or not vectorizing here.  */
+DUP_FN (4, float);
+
+/*
+** dup_8_float:
+**	dup	v([0-9]+)\.4s, v0.s\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, float);
+
+/*
+** cons2_1_float:
+**	stp	s0, s1, \[x0\]
+**	ret
+*/
+CONS2_FN (1, float);
+
+/*
+** cons2_2_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s0, s1, \[x0, #?8\]
+**	ret
+*/
+CONS2_FN (2, float);
+
+/*
+** cons2_4_float:	{ target aarch64_little_endian }
+**	ins	v0.s\[1\], v1.s\[0\]
+**	stp	d0, d0, \[x0\]
+**	stp	d0, d0, \[x0, #?16\]
+**	ret
+*/
+/*
+** cons2_4_float:	{ target aarch64_big_endian }
+**	ins	v1.s\[1\], v0.s\[0\]
+**	stp	d1, d1, \[x0\]
+**	stp	d1, d1, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (4, float);
+
+/*
+** cons2_8_float:
+**	dup	v([0-9]+)\.4s, .*
+**	...
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONS2_FN (8, float);
+
+/*
+** cons4_1_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s2, s3, \[x0, #?8\]
+**	ret
+*/
+CONS4_FN (1, float);
+
+/*
+** cons4_2_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s2, s3, \[x0, #?8\]
+**	stp	s0, s1, \[x0, #?16\]
+**	stp	s2, s3, \[x0, #?24\]
+**	ret
+*/
+CONS4_FN (2, float);
+
+/*
+** cons4_4_float:
+**	ins	v([0-9]+)\.s.*
+**	...
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONS4_FN (4, float);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
new file mode 100644
index 00000000000..c1122fc07d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
@@ -0,0 +1,120 @@ 
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int64_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int64_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (4, int64_t, 0);
+
+/*
+** const_8_int64_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (8, int64_t, 0);
+
+/*
+** dup_2_int64_t:
+**	stp	x1, x1, \[x0\]
+**	ret
+*/
+DUP_FN (2, int64_t);
+
+/*
+** dup_4_int64_t:
+**	stp	x1, x1, \[x0\]
+**	stp	x1, x1, \[x0, #?16\]
+**	ret
+*/
+DUP_FN (4, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+DUP_FN (8, int64_t);
+
+/*
+** dup_16_int64_t:
+**	dup	v([0-9])\.2d, x1
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	stp	q\1, q\1, \[x0, #?64\]
+**	stp	q\1, q\1, \[x0, #?96\]
+**	ret
+*/
+DUP_FN (16, int64_t);
+
+/*
+** cons2_1_int64_t:
+**	stp	x1, x2, \[x0\]
+**	ret
+*/
+CONS2_FN (1, int64_t);
+
+/*
+** cons2_2_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x1, x2, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (2, int64_t);
+
+/*
+** cons2_4_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x1, x2, \[x0, #?16\]
+**	stp	x1, x2, \[x0, #?32\]
+**	stp	x1, x2, \[x0, #?48\]
+**	ret
+*/
+CONS2_FN (4, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS2_FN (8, int64_t);
+
+/*
+** cons2_16_int64_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (16, int64_t);
+
+/*
+** cons4_1_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x3, x4, \[x0, #?16\]
+**	ret
+*/
+CONS4_FN (1, int64_t);
+
+/*
+** cons4_2_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x3, x4, \[x0, #?16\]
+**	stp	x1, x2, \[x0, #?32\]
+**	stp	x3, x4, \[x0, #?48\]
+**	ret
+*/
+CONS4_FN (2, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS4_FN (4, int64_t);
+
+/* We should probably vectorize this, but currently don't.  */
+CONS4_FN (8, int64_t);
+
+/*
+** cons4_16_int64_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (16, int64_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
new file mode 100644
index 00000000000..eaa855c3859
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
@@ -0,0 +1,123 @@ 
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_double_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, double, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (4, double, 0);
+
+/*
+** const_8_double_0:
+**	movi	v([0-9]+)\.2d, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (8, double, 0);
+
+/*
+** dup_2_double:
+**	stp	d0, d0, \[x0\]
+**	ret
+*/
+DUP_FN (2, double);
+
+/*
+** dup_4_double:
+**	stp	d0, d0, \[x0\]
+**	stp	d0, d0, \[x0, #?16\]
+**	ret
+*/
+DUP_FN (4, double);
+
+/*
+** dup_8_double:
+**	dup	v([0-9])\.2d, v0\.d\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+DUP_FN (8, double);
+
+/*
+** dup_16_double:
+**	dup	v([0-9])\.2d, v0\.d\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	stp	q\1, q\1, \[x0, #?64\]
+**	stp	q\1, q\1, \[x0, #?96\]
+**	ret
+*/
+DUP_FN (16, double);
+
+/*
+** cons2_1_double:
+**	stp	d0, d1, \[x0\]
+**	ret
+*/
+CONS2_FN (1, double);
+
+/*
+** cons2_2_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d0, d1, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (2, double);
+
+/*
+** cons2_4_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (4, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (8, double);
+
+/*
+** cons4_1_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d2, d3, \[x0, #?16\]
+**	ret
+*/
+CONS4_FN (1, double);
+
+/*
+** cons4_2_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d2, d3, \[x0, #?16\]
+**	stp	d0, d1, \[x0, #?32\]
+**	stp	d2, d3, \[x0, #?48\]
+**	ret
+*/
+CONS4_FN (2, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (4, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (8, double);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
new file mode 100644
index 00000000000..9eb41636477
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
@@ -0,0 +1,6 @@ 
+/* { dg-options "-O2 -mstrict-align" } */
+
+#include "ldp_stp_5.c"
+
+/* { dg-final { scan-assembler-times {stp\tq[0-9]+, q[0-9]} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {str\tq[0-9]+} 1 { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
index 94266181df7..56d1d3cc555 100644
--- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
@@ -1,4 +1,4 @@ 
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mstrict-align" } */
 
 double arr[4][4];