From patchwork Mon Feb 14 15:35:03 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@arm.com>
X-Patchwork-Id: 51103
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id F302E3858413
	for <patchwork@sourceware.org>; Mon, 14 Feb 2022 15:36:25 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org F302E3858413
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1644852986;
	bh=OdWZVlTMGEU3A+fQM7Tt0sgiEYR8R007JcnwMil01Hk=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=anPdSx9KCmUCQ06h58b9SN5y6qaObj0Tt7bLIqkxNgbRZMYnFWPl2Wlp3R2KzKfvM
	 PZsMK2beyVNnKPy8V1ByBdquQ19vRLUC+G30ZudJ1gcGyWS9EE2UeME/6LbMg9ZFdn
	 ZP52L9O8SsCodpNzx7TY5KGzW06rToXl3Nw3Rd5E=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id C7AB63858024
 for <gcc-patches@gcc.gnu.org>; Mon, 14 Feb 2022 15:35:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C7AB63858024
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5F96D1063;
 Mon, 14 Feb 2022 07:35:05 -0800 (PST)
Received: from localhost (unknown [10.32.98.88])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ADA093F70D;
 Mon, 14 Feb 2022 07:35:04 -0800 (PST)
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, rguenther@suse.de,
 richard.sandiford@arm.com
Subject: [PATCH] vect+aarch64: Fix ldp_stp_* regressions
Date: Mon, 14 Feb 2022 15:35:03 +0000
Message-ID: <mpt1r05xxzs.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Richard Sandiford via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Richard Sandiford <richard.sandiford@arm.com>
Reply-To: Richard Sandiford <richard.sandiford@arm.com>
Cc: rguenther@suse.de
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since
vectorisation was enabled at -O2.  In all three cases SLP is
generating vector code when scalar code would be better.

The problem is that the target costs do not model whether STP could
be used for the scalar or vector code, so the normal latency-based
costs for store-heavy code can be way off.  It would be good to fix
that “properly” at some point, but it isn't easy; see the existing
discussion in aarch64_sve_adjust_stmt_cost for more details.

This patch therefore adds an on-the-side check for whether the
code is doing nothing more than set-up+stores.  It then applies
STP-based costs to those cases only, in addition to the normal
latency-based costs.  (That is, the vector code has to win on
both counts rather than on one count individually.)

However, at the moment, SLP costs one vector set-up instruction
for every vector in an SLP node, even if the contents are the
same as a previous vector in the same node.  Fixing the STP costs
without fixing that would regress other cases, tested in the patch.

The patch therefore makes the SLP costing code check for duplicates
within a node.  Ideally we'd check for duplicates more globally,
but that would require a more global approach to costs: the cost
of an initialisation should be amoritised across all trees that
use the initialisation, rather than fully counted against one
arbitrarily-chosen subtree.

Back on aarch64: an earlier version of the patch tried to apply
the new heuristic to constant stores.  However, that didn't work
too well in practice; see the comments for details.  The patch
therefore just tests the status quo for constant cases, leaving out
a match if the current choice is dubious.

ldp_stp_5.c was affected by the same thing.  The test would be
worth vectorising if we generated better vector code, but:

(1) We do a bad job of moving the { -1, 1 } constant, given that
    we have { -1, -1 } and { 1, 1 } to hand.

(2) The vector code has 6 pairable stores to misaligned offsets.
    We have peephole patterns to handle such misalignment for
    4 pairable stores, but not 6.

So the SLP decision isn't wrong as such.  It's just being let
down by later codegen.

The patch therefore adds -mstrict-align to preserve the original
intention of the test while adding ldp_stp_19.c to check for the
preferred vector code (XFAILed for now).

Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
OK for the vectoriser bits?

Thanks,
Richard


gcc/
	* tree-vectorizer.h (vect_scalar_ops_slice): New struct.
	(vect_scalar_ops_slice_hash): Likewise.
	(vect_scalar_ops_slice::op): New function.
	* tree-vect-slp.cc (vect_scalar_ops_slice::all_same_p): New function.
	(vect_scalar_ops_slice_hash::hash): Likewise.
	(vect_scalar_ops_slice_hash::equal): Likewise.
	(vect_prologue_cost_for_slp): Check for duplicate vectors.
	* config/aarch64/aarch64.cc
	(aarch64_vector_costs::m_stp_sequence_cost): New member variable.
	(aarch64_aligned_constant_offset_p): New function.
	(aarch64_stp_sequence_cost): Likewise.
	(aarch64_vector_costs::add_stmt_cost): Handle new STP heuristic.
	(aarch64_vector_costs::finish_cost): Likewise.

gcc/testsuite/
	* gcc.target/aarch64/ldp_stp_5.c: Require -mstrict-align.
	* gcc.target/aarch64/ldp_stp_14.h,
	* gcc.target/aarch64/ldp_stp_14.c: New test.
	* gcc.target/aarch64/ldp_stp_15.c: Likewise.
	* gcc.target/aarch64/ldp_stp_16.c: Likewise.
	* gcc.target/aarch64/ldp_stp_17.c: Likewise.
	* gcc.target/aarch64/ldp_stp_18.c: Likewise.
	* gcc.target/aarch64/ldp_stp_19.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc                 | 140 ++++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c |  89 +++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h |  50 +++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c | 137 +++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c | 133 +++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c | 120 +++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c | 123 +++++++++++++++
 gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c |   6 +
 gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c  |   2 +-
 gcc/tree-vect-slp.cc                          |  75 ++++++----
 gcc/tree-vectorizer.h                         |  35 +++++
 11 files changed, 884 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index ec479d3055d..ddd0637185c 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -113,6 +113,41 @@ typedef hash_map<tree_operand_hash,
 		 std::pair<stmt_vec_info, innermost_loop_behavior *> >
 	  vec_base_alignments;
 
+/* Represents elements [START, START + LENGTH) of cyclical array OPS*
+   (i.e. OPS repeated to give at least START + LENGTH elements)  */
+struct vect_scalar_ops_slice
+{
+  tree op (unsigned int i) const;
+  bool all_same_p () const;
+
+  vec<tree> *ops;
+  unsigned int start;
+  unsigned int length;
+};
+
+/* Return element I of the slice.  */
+inline tree
+vect_scalar_ops_slice::op (unsigned int i) const
+{
+  return (*ops)[(i + start) % ops->length ()];
+}
+
+/* Hash traits for vect_scalar_ops_slice.  */
+struct vect_scalar_ops_slice_hash : typed_noop_remove<vect_scalar_ops_slice>
+{
+  typedef vect_scalar_ops_slice value_type;
+  typedef vect_scalar_ops_slice compare_type;
+
+  static const bool empty_zero_p = true;
+
+  static void mark_deleted (value_type &s) { s.length = ~0U; }
+  static void mark_empty (value_type &s) { s.length = 0; }
+  static bool is_deleted (const value_type &s) { return s.length == ~0U; }
+  static bool is_empty (const value_type &s) { return s.length == 0; }
+  static hashval_t hash (const value_type &);
+  static bool equal (const value_type &, const compare_type &);
+};
+
 /************************************************************************
   SLP
  ************************************************************************/
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 273543d37ea..c6b5a0696a2 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -4533,6 +4533,37 @@ vect_slp_convert_to_external (vec_info *vinfo, slp_tree node,
   return true;
 }
 
+/* Return true if all elements of the slice are the same.  */
+bool
+vect_scalar_ops_slice::all_same_p () const
+{
+  for (unsigned int i = 1; i < length; ++i)
+    if (!operand_equal_p (op (0), op (i)))
+      return false;
+  return true;
+}
+
+hashval_t
+vect_scalar_ops_slice_hash::hash (const value_type &s)
+{
+  hashval_t hash = 0;
+  for (unsigned i = 0; i < s.length; ++i)
+    hash = iterative_hash_expr (s.op (i), hash);
+  return hash;
+}
+
+bool
+vect_scalar_ops_slice_hash::equal (const value_type &s1,
+				   const compare_type &s2)
+{
+  if (s1.length != s2.length)
+    return false;
+  for (unsigned i = 0; i < s1.length; ++i)
+    if (!operand_equal_p (s1.op (i), s2.op (i)))
+      return false;
+  return true;
+}
+
 /* Compute the prologue cost for invariant or constant operands represented
    by NODE.  */
 
@@ -4549,45 +4580,39 @@ vect_prologue_cost_for_slp (slp_tree node,
      When all elements are the same we can use a splat.  */
   tree vectype = SLP_TREE_VECTYPE (node);
   unsigned group_size = SLP_TREE_SCALAR_OPS (node).length ();
-  unsigned num_vects_to_check;
   unsigned HOST_WIDE_INT const_nunits;
   unsigned nelt_limit;
+  auto ops = &SLP_TREE_SCALAR_OPS (node);
+  auto_vec<unsigned int> starts (SLP_TREE_NUMBER_OF_VEC_STMTS (node));
   if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits)
       && ! multiple_p (const_nunits, group_size))
     {
-      num_vects_to_check = SLP_TREE_NUMBER_OF_VEC_STMTS (node);
       nelt_limit = const_nunits;
+      hash_set<vect_scalar_ops_slice_hash> vector_ops;
+      for (unsigned int i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
+	if (!vector_ops.add ({ ops, i * const_nunits, const_nunits }))
+	  starts.quick_push (i * const_nunits);
     }
   else
     {
       /* If either the vector has variable length or the vectors
 	 are composed of repeated whole groups we only need to
 	 cost construction once.  All vectors will be the same.  */
-      num_vects_to_check = 1;
       nelt_limit = group_size;
+      starts.quick_push (0);
     }
-  tree elt = NULL_TREE;
-  unsigned nelt = 0;
-  for (unsigned j = 0; j < num_vects_to_check * nelt_limit; ++j)
-    {
-      unsigned si = j % group_size;
-      if (nelt == 0)
-	elt = SLP_TREE_SCALAR_OPS (node)[si];
-      /* ???  We're just tracking whether all operands of a single
-	 vector initializer are the same, ideally we'd check if
-	 we emitted the same one already.  */
-      else if (elt != SLP_TREE_SCALAR_OPS (node)[si])
-	elt = NULL_TREE;
-      nelt++;
-      if (nelt == nelt_limit)
-	{
-	  record_stmt_cost (cost_vec, 1,
-			    SLP_TREE_DEF_TYPE (node) == vect_external_def
-			    ? (elt ? scalar_to_vec : vec_construct)
-			    : vector_load,
-			    NULL, vectype, 0, vect_prologue);
-	  nelt = 0;
-	}
+  /* ???  We're just tracking whether vectors in a single node are the same.
+     Ideally we'd do something more global.  */
+  for (unsigned int start : starts)
+    {
+      vect_cost_for_stmt kind;
+      if (SLP_TREE_DEF_TYPE (node) == vect_constant_def)
+	kind = vector_load;
+      else if (vect_scalar_ops_slice { ops, start, nelt_limit }.all_same_p ())
+	kind = scalar_to_vec;
+      else
+	kind = vec_construct;
+      record_stmt_cost (cost_vec, 1, kind, NULL, vectype, 0, vect_prologue);
     }
 }
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 7bb97bd48e4..4cf17526e14 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -14932,6 +14932,31 @@ private:
      - If M_VEC_FLAGS & VEC_ANY_SVE is nonzero then we're costing SVE code.  */
   unsigned int m_vec_flags = 0;
 
+  /* At the moment, we do not model LDP and STP in the vector and scalar costs.
+     This means that code such as:
+
+	a[0] = x;
+	a[1] = x;
+
+     will be costed as two scalar instructions and two vector instructions
+     (a scalar_to_vec and an unaligned_store).  For SLP, the vector form
+     wins if the costs are equal, because of the fact that the vector costs
+     include constant initializations whereas the scalar costs don't.
+     We would therefore tend to vectorize the code above, even though
+     the scalar version can use a single STP.
+
+     We should eventually fix this and model LDP and STP in the main costs;
+     see the comment in aarch64_sve_adjust_stmt_cost for some of the problems.
+     Until then, we look specifically for code that does nothing more than
+     STP-like operations.  We cost them on that basis in addition to the
+     normal latency-based costs.
+
+     If the scalar or vector code could be a sequence of STPs +
+     initialization, this variable counts the cost of the sequence,
+     with 2 units per instruction.  The variable is ~0U for other
+     kinds of code.  */
+  unsigned int m_stp_sequence_cost = 0;
+
   /* On some CPUs, SVE and Advanced SIMD provide the same theoretical vector
      throughput, such as 4x128 Advanced SIMD vs. 2x256 SVE.  In those
      situations, we try to predict whether an Advanced SIMD implementation
@@ -15724,6 +15749,104 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
     }
 }
 
+/* Return true if STMT_INFO contains a memory access and if the constant
+   component of the memory address is aligned to SIZE bytes.  */
+static bool
+aarch64_aligned_constant_offset_p (stmt_vec_info stmt_info,
+				   poly_uint64 size)
+{
+  if (!STMT_VINFO_DATA_REF (stmt_info))
+    return false;
+
+  if (auto first_stmt = DR_GROUP_FIRST_ELEMENT (stmt_info))
+    stmt_info = first_stmt;
+  tree constant_offset = DR_INIT (STMT_VINFO_DATA_REF (stmt_info));
+  /* Needed for gathers & scatters, for example.  */
+  if (!constant_offset)
+    return false;
+
+  return multiple_p (wi::to_poly_offset (constant_offset), size);
+}
+
+/* Check if a scalar or vector stmt could be part of a region of code
+   that does nothing more than store values to memory, in the scalar
+   case using STP.  Return the cost of the stmt if so, counting 2 for
+   one instruction.  Return ~0U otherwise.
+
+   The arguments are a subset of those passed to add_stmt_cost.  */
+unsigned int
+aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
+			   stmt_vec_info stmt_info, tree vectype)
+{
+  /* Code that stores vector constants uses a vector_load to create
+     the constant.  We don't apply the heuristic to that case for two
+     main reasons:
+
+     - At the moment, STPs are only formed via peephole2, and the
+       constant scalar moves would often come between STRs and so
+       prevent STP formation.
+
+     - The scalar code also has to load the constant somehow, and that
+       isn't costed.  */
+  switch (kind)
+    {
+    case scalar_to_vec:
+      /* Count 2 insns for a GPR->SIMD dup and 1 insn for a FPR->SIMD dup.  */
+      return (FLOAT_TYPE_P (vectype) ? 2 : 4) * count;
+
+    case vec_construct:
+      if (FLOAT_TYPE_P (vectype))
+	/* Count 1 insn for the maximum number of FP->SIMD INS
+	   instructions.  */
+	return (vect_nunits_for_cost (vectype) - 1) * 2 * count;
+
+      /* Count 2 insns for a GPR->SIMD move and 2 insns for the
+	 maximum number of GPR->SIMD INS instructions.  */
+      return vect_nunits_for_cost (vectype) * 4 * count;
+
+    case vector_store:
+    case unaligned_store:
+      /* Count 1 insn per vector if we can't form STP Q pairs.  */
+      if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
+	return count * 2;
+      if (aarch64_tune_params.extra_tuning_flags
+	  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
+	return count * 2;
+
+      if (stmt_info)
+	{
+	  /* Assume we won't be able to use STP if the constant offset
+	     component of the address is misaligned.  ??? This could be
+	     removed if we formed STP pairs earlier, rather than relying
+	     on peephole2.  */
+	  auto size = GET_MODE_SIZE (TYPE_MODE (vectype));
+	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
+	    return count * 2;
+	}
+      return CEIL (count, 2) * 2;
+
+    case scalar_store:
+      if (stmt_info && STMT_VINFO_DATA_REF (stmt_info))
+	{
+	  /* Check for a mode in which STP pairs can be formed.  */
+	  auto size = GET_MODE_SIZE (TYPE_MODE (aarch64_dr_type (stmt_info)));
+	  if (maybe_ne (size, 4) && maybe_ne (size, 8))
+	    return ~0U;
+
+	  /* Assume we won't be able to use STP if the constant offset
+	     component of the address is misaligned.  ??? This could be
+	     removed if we formed STP pairs earlier, rather than relying
+	     on peephole2.  */
+	  if (!aarch64_aligned_constant_offset_p (stmt_info, size))
+	    return ~0U;
+	}
+      return count;
+
+    default:
+      return ~0U;
+    }
+}
+
 unsigned
 aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 				     stmt_vec_info stmt_info, tree vectype,
@@ -15747,6 +15870,14 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
       m_analyzed_vinfo = true;
     }
 
+  /* Apply the heuristic described above m_stp_sequence_cost.  */
+  if (m_stp_sequence_cost != ~0U)
+    {
+      uint64_t cost = aarch64_stp_sequence_cost (count, kind,
+						 stmt_info, vectype);
+      m_stp_sequence_cost = MIN (m_stp_sequence_cost + cost, ~0U);
+    }
+
   /* Try to get a more accurate cost by looking at STMT_INFO instead
      of just looking at KIND.  */
   if (stmt_info && aarch64_use_new_vector_costs_p ())
@@ -16017,6 +16148,15 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
     m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
 					   m_costs[vect_body]);
 
+  /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
+     the scalar code in the event of a tie, since there is more chance
+     of scalar code being optimized with surrounding operations.  */
+  if (!loop_vinfo
+      && scalar_costs
+      && m_stp_sequence_cost != ~0U
+      && m_stp_sequence_cost >= scalar_costs->m_stp_sequence_cost)
+    m_costs[vect_body] = 2 * scalar_costs->total_cost ();
+
   vector_costs::finish_cost (scalar_costs);
 }
 
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
new file mode 100644
index 00000000000..c7b5f7d6b39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.c
@@ -0,0 +1,89 @@
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int16_t_0:
+**	str	wzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int16_t, 0);
+
+/*
+** const_4_int16_t_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, int16_t, 0);
+
+/*
+** const_8_int16_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (8, int16_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (16, int16_t, 0);
+
+/*
+** const_32_int16_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (32, int16_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (2, int16_t, 1);
+
+/*
+** const_4_int16_t_1:
+**	movi	v([0-9]+)\.4h, .*
+**	str	d\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, int16_t, 1);
+
+/*
+** const_8_int16_t_1:
+**	movi	v([0-9]+)\.8h, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (8, int16_t, 1);
+
+/* Fuzzy match due to PR104387.  */
+/*
+** dup_2_int16_t:
+**	...
+**	strh	w1, \[x0, #?2\]
+**	ret
+*/
+DUP_FN (2, int16_t);
+
+/*
+** dup_4_int16_t:
+**	dup	v([0-9]+)\.4h, w1
+**	str	d\1, \[x0\]
+**	ret
+*/
+DUP_FN (4, int16_t);
+
+/*
+** dup_8_int16_t:
+**	dup	v([0-9]+)\.8h, w1
+**	str	q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, int16_t);
+
+/*
+** cons2_1_int16_t:
+**	strh	w1, \[x0\]
+**	strh	w2, \[x0, #?2\]
+**	ret
+*/
+CONS2_FN (1, int16_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
new file mode 100644
index 00000000000..39c463ff240
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_14.h
@@ -0,0 +1,50 @@
+#include <stdint.h>
+
+#define PRAGMA(X) _Pragma (#X)
+#define UNROLL(COUNT) PRAGMA (GCC unroll (COUNT))
+
+#define CONST_FN(COUNT, TYPE, VAL)		\
+  void						\
+  const_##COUNT##_##TYPE##_##VAL (TYPE *x)	\
+  {						\
+    UNROLL (COUNT)				\
+    for (int i = 0; i < COUNT; ++i)		\
+      x[i] = VAL;				\
+  }
+
+#define DUP_FN(COUNT, TYPE)			\
+  void						\
+  dup_##COUNT##_##TYPE (TYPE *x, TYPE val)	\
+  {						\
+    UNROLL (COUNT)				\
+    for (int i = 0; i < COUNT; ++i)		\
+      x[i] = val;				\
+  }
+
+#define CONS2_FN(COUNT, TYPE)					\
+  void								\
+  cons2_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1)	\
+  {								\
+    UNROLL (COUNT)						\
+    for (int i = 0; i < COUNT * 2; i += 2)			\
+      {								\
+	x[i + 0] = val0;					\
+	x[i + 1] = val1;					\
+      }								\
+  }
+
+#define CONS4_FN(COUNT, TYPE)					\
+  void								\
+  cons4_##COUNT##_##TYPE (TYPE *x, TYPE val0, TYPE val1,	\
+			  TYPE val2, TYPE val3)			\
+  {								\
+    UNROLL (COUNT)						\
+    for (int i = 0; i < COUNT * 4; i += 4)			\
+      {								\
+	x[i + 0] = val0;					\
+	x[i + 1] = val1;					\
+	x[i + 2] = val2;					\
+	x[i + 3] = val3;					\
+      }								\
+  }
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
new file mode 100644
index 00000000000..131cd0a63c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_15.c
@@ -0,0 +1,137 @@
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int32_t_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int32_t, 0);
+
+/*
+** const_4_int32_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, int32_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (8, int32_t, 0);
+
+/*
+** const_16_int32_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (16, int32_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (2, int32_t, 1);
+
+/*
+** const_4_int32_t_1:
+**	movi	v([0-9]+)\.4s, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, int32_t, 1);
+
+/*
+** const_8_int32_t_1:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+CONST_FN (8, int32_t, 1);
+
+/*
+** dup_2_int32_t:
+**	stp	w1, w1, \[x0\]
+**	ret
+*/
+DUP_FN (2, int32_t);
+
+/*
+** dup_4_int32_t:
+**	stp	w1, w1, \[x0\]
+**	stp	w1, w1, \[x0, #?8\]
+**	ret
+*/
+DUP_FN (4, int32_t);
+
+/*
+** dup_8_int32_t:
+**	dup	v([0-9]+)\.4s, w1
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, int32_t);
+
+/*
+** cons2_1_int32_t:
+**	stp	w1, w2, \[x0\]
+**	ret
+*/
+CONS2_FN (1, int32_t);
+
+/*
+** cons2_2_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w1, w2, \[x0, #?8\]
+**	ret
+*/
+CONS2_FN (2, int32_t);
+
+/*
+** cons2_4_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w1, w2, \[x0, #?8\]
+**	stp	w1, w2, \[x0, #?16\]
+**	stp	w1, w2, \[x0, #?24\]
+**	ret
+*/
+CONS2_FN (4, int32_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS2_FN (8, int32_t);
+
+/*
+** cons2_16_int32_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (16, int32_t);
+
+/*
+** cons4_1_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w3, w4, \[x0, #?8\]
+**	ret
+*/
+CONS4_FN (1, int32_t);
+
+/*
+** cons4_2_int32_t:
+**	stp	w1, w2, \[x0\]
+**	stp	w3, w4, \[x0, #?8\]
+**	stp	w1, w2, \[x0, #?16\]
+**	stp	w3, w4, \[x0, #?24\]
+**	ret
+*/
+CONS4_FN (2, int32_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS4_FN (4, int32_t);
+
+/*
+** cons4_8_int32_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (8, int32_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
new file mode 100644
index 00000000000..8ab117c4dcd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
@@ -0,0 +1,133 @@
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_float_0:
+**	str	xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, float, 0);
+
+/*
+** const_4_float_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (4, float, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (8, float, 0);
+
+/*
+** const_16_float_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (16, float, 0);
+
+/*
+** const_2_float_1:
+**	fmov	v([0-9]+)\.2s, .*
+**	str	d\1, \[x0\]
+**	ret
+*/
+CONST_FN (2, float, 1);
+
+/*
+** const_4_float_1:
+**	fmov	v([0-9]+)\.4s, .*
+**	str	q\1, \[x0\]
+**	ret
+*/
+CONST_FN (4, float, 1);
+
+/*
+** dup_2_float:
+**	stp	s0, s0, \[x0\]
+**	ret
+*/
+DUP_FN (2, float);
+
+/* No preference between vectorizing or not vectorizing here.  */
+DUP_FN (4, float);
+
+/*
+** dup_8_float:
+**	dup	v([0-9]+)\.4s, v0.s\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	ret
+*/
+DUP_FN (8, float);
+
+/*
+** cons2_1_float:
+**	stp	s0, s1, \[x0\]
+**	ret
+*/
+CONS2_FN (1, float);
+
+/*
+** cons2_2_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s0, s1, \[x0, #?8\]
+**	ret
+*/
+CONS2_FN (2, float);
+
+/*
+** cons2_4_float:	{ target aarch64_little_endian }
+**	ins	v0.s\[1\], v1.s\[0\]
+**	stp	d0, d0, \[x0\]
+**	stp	d0, d0, \[x0, #?16\]
+**	ret
+*/
+/*
+** cons2_4_float:	{ target aarch64_big_endian }
+**	ins	v1.s\[1\], v0.s\[0\]
+**	stp	d1, d1, \[x0\]
+**	stp	d1, d1, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (4, float);
+
+/*
+** cons2_8_float:
+**	dup	v([0-9]+)\.4s, .*
+**	...
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONS2_FN (8, float);
+
+/*
+** cons4_1_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s2, s3, \[x0, #?8\]
+**	ret
+*/
+CONS4_FN (1, float);
+
+/*
+** cons4_2_float:
+**	stp	s0, s1, \[x0\]
+**	stp	s2, s3, \[x0, #?8\]
+**	stp	s0, s1, \[x0, #?16\]
+**	stp	s2, s3, \[x0, #?24\]
+**	ret
+*/
+CONS4_FN (2, float);
+
+/*
+** cons4_4_float:
+**	ins	v([0-9]+)\.s.*
+**	...
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONS4_FN (4, float);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
new file mode 100644
index 00000000000..c1122fc07d5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_17.c
@@ -0,0 +1,120 @@
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_int64_t_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, int64_t, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (4, int64_t, 0);
+
+/*
+** const_8_int64_t_0:
+**	movi	v([0-9]+)\.4s, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (8, int64_t, 0);
+
+/*
+** dup_2_int64_t:
+**	stp	x1, x1, \[x0\]
+**	ret
+*/
+DUP_FN (2, int64_t);
+
+/*
+** dup_4_int64_t:
+**	stp	x1, x1, \[x0\]
+**	stp	x1, x1, \[x0, #?16\]
+**	ret
+*/
+DUP_FN (4, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+DUP_FN (8, int64_t);
+
+/*
+** dup_16_int64_t:
+**	dup	v([0-9])\.2d, x1
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	stp	q\1, q\1, \[x0, #?64\]
+**	stp	q\1, q\1, \[x0, #?96\]
+**	ret
+*/
+DUP_FN (16, int64_t);
+
+/*
+** cons2_1_int64_t:
+**	stp	x1, x2, \[x0\]
+**	ret
+*/
+CONS2_FN (1, int64_t);
+
+/*
+** cons2_2_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x1, x2, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (2, int64_t);
+
+/*
+** cons2_4_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x1, x2, \[x0, #?16\]
+**	stp	x1, x2, \[x0, #?32\]
+**	stp	x1, x2, \[x0, #?48\]
+**	ret
+*/
+CONS2_FN (4, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS2_FN (8, int64_t);
+
+/*
+** cons2_16_int64_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (16, int64_t);
+
+/*
+** cons4_1_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x3, x4, \[x0, #?16\]
+**	ret
+*/
+CONS4_FN (1, int64_t);
+
+/*
+** cons4_2_int64_t:
+**	stp	x1, x2, \[x0\]
+**	stp	x3, x4, \[x0, #?16\]
+**	stp	x1, x2, \[x0, #?32\]
+**	stp	x3, x4, \[x0, #?48\]
+**	ret
+*/
+CONS4_FN (2, int64_t);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONS4_FN (4, int64_t);
+
+/* We should probably vectorize this, but currently don't.  */
+CONS4_FN (8, int64_t);
+
+/*
+** cons4_16_int64_t:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (16, int64_t);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
new file mode 100644
index 00000000000..eaa855c3859
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_18.c
@@ -0,0 +1,123 @@
+/* { dg-options "-O2 -fno-tree-loop-distribute-patterns" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#include "ldp_stp_14.h"
+
+/*
+** const_2_double_0:
+**	stp	xzr, xzr, \[x0\]
+**	ret
+*/
+CONST_FN (2, double, 0);
+
+/* No preference between vectorizing or not vectorizing here.  */
+CONST_FN (4, double, 0);
+
+/*
+** const_8_double_0:
+**	movi	v([0-9]+)\.2d, .*
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+CONST_FN (8, double, 0);
+
+/*
+** dup_2_double:
+**	stp	d0, d0, \[x0\]
+**	ret
+*/
+DUP_FN (2, double);
+
+/*
+** dup_4_double:
+**	stp	d0, d0, \[x0\]
+**	stp	d0, d0, \[x0, #?16\]
+**	ret
+*/
+DUP_FN (4, double);
+
+/*
+** dup_8_double:
+**	dup	v([0-9])\.2d, v0\.d\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	ret
+*/
+DUP_FN (8, double);
+
+/*
+** dup_16_double:
+**	dup	v([0-9])\.2d, v0\.d\[0\]
+**	stp	q\1, q\1, \[x0\]
+**	stp	q\1, q\1, \[x0, #?32\]
+**	stp	q\1, q\1, \[x0, #?64\]
+**	stp	q\1, q\1, \[x0, #?96\]
+**	ret
+*/
+DUP_FN (16, double);
+
+/*
+** cons2_1_double:
+**	stp	d0, d1, \[x0\]
+**	ret
+*/
+CONS2_FN (1, double);
+
+/*
+** cons2_2_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d0, d1, \[x0, #?16\]
+**	ret
+*/
+CONS2_FN (2, double);
+
+/*
+** cons2_4_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (4, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS2_FN (8, double);
+
+/*
+** cons4_1_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d2, d3, \[x0, #?16\]
+**	ret
+*/
+CONS4_FN (1, double);
+
+/*
+** cons4_2_double:
+**	stp	d0, d1, \[x0\]
+**	stp	d2, d3, \[x0, #?16\]
+**	stp	d0, d1, \[x0, #?32\]
+**	stp	d2, d3, \[x0, #?48\]
+**	ret
+*/
+CONS4_FN (2, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (4, double);
+
+/*
+** cons2_8_double:
+**	...
+**	stp	q[0-9]+, .*
+**	ret
+*/
+CONS4_FN (8, double);
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
new file mode 100644
index 00000000000..9eb41636477
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_19.c
@@ -0,0 +1,6 @@
+/* { dg-options "-O2 -mstrict-align" } */
+
+#include "ldp_stp_5.c"
+
+/* { dg-final { scan-assembler-times {stp\tq[0-9]+, q[0-9]} 3 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {str\tq[0-9]+} 1 { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
index 94266181df7..56d1d3cc555 100644
--- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_5.c
@@ -1,4 +1,4 @@
-/* { dg-options "-O2" } */
+/* { dg-options "-O2 -mstrict-align" } */
 
 double arr[4][4];