From patchwork Wed Jan 26 02:28:08 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 50446
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 335D13851C1A
	for <patchwork@sourceware.org>; Wed, 26 Jan 2022 02:29:59 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 335D13851C1A
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1643164199;
	bh=So0907fyyupvpJYp6KHfnXe6DBR3sh+PPBGsfbObQxw=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=AMChemBxH3o6BulKkq6eibhuyfPO54G9K/UzoZBqZsXNLZ4vle1teLth1F/GVWOHj
	 x7ZNDKWlOeCWOHRiKdblsKB1eWWxO/28qOuWBXaxuy2l9Pdg6t+Ata4cX7jxbLm470
	 V2siMFHKouPSYo0U1w1tr/92rxo5/9Ay8hQq3BdE=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com
 [67.231.156.173])
 by sourceware.org (Postfix) with ESMTPS id DB5153851C2B
 for <gcc-patches@gcc.gnu.org>; Wed, 26 Jan 2022 02:29:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DB5153851C2B
Received: from pps.filterd (m0045851.ppops.net [127.0.0.1])
 by mx0b-0016f401.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id
 20PKMsqZ011432
 for <gcc-patches@gcc.gnu.org>; Tue, 25 Jan 2022 18:29:28 -0800
Received: from dc5-exch01.marvell.com ([199.233.59.181])
 by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3dt8mumdhg-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT)
 for <gcc-patches@gcc.gnu.org>; Tue, 25 Jan 2022 18:29:26 -0800
Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.2;
 Tue, 25 Jan 2022 18:28:12 -0800
Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com
 (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.2 via Frontend
 Transport; Tue, 25 Jan 2022 18:28:11 -0800
Received: from linux.wrightpinski.org.com (unknown [10.69.242.198])
 by maili.marvell.com (Postfix) with ESMTP id C47373F7066;
 Tue, 25 Jan 2022 18:28:11 -0800 (PST)
To: <gcc-patches@gcc.gnu.org>
Subject: [PATCH v3] [AARCH64] Fix PR target/103100 -mstrict-align and memset
 on not aligned buffers
Date: Tue, 25 Jan 2022 18:28:08 -0800
Message-ID: <1643164088-18048-1-git-send-email-apinski@marvell.com>
X-Mailer: git-send-email 1.8.3.1
MIME-Version: 1.0
X-Proofpoint-ORIG-GUID: FkWxC9DLsoUk40mTVjFMLfDiYgDHyCya
X-Proofpoint-GUID: FkWxC9DLsoUk40mTVjFMLfDiYgDHyCya
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513
 definitions=2022-01-25_06,2022-01-25_02,2021-12-02_01
X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: apinski--- via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: apinski@marvell.com
Cc: Andrew Pinski <apinski@marvell.com>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: Andrew Pinski <apinski@marvell.com>

The problem here is that aarch64_expand_setmem does not change the alignment
for strict alignment case. This is version 3 of this patch, is is based on
version 2 and moves the check for the number of instructions from the
optimizing for size case to be always and change the cost of libcalls for
the !size case to be max_size/16 + 1 (or 17) which was the same as before
when handling just the max_size. The main change is dealing with strict
alignment case where we only inline a max of 17 instructions as at that
point the call to the memset will be faster and could handle the dynamic
alignment instead of just the static alignment.

Note the reason why it is +1 is to count for the setting of the simd
duplicate.

OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.

	PR target/103100
gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint
	copy_limit to the alignment of the mem if STRICT_ALIGNMENT is
	true. Also constraint the number of instructions for the !size
	case to max_size/16 + 1.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/memset-strict-align-1.c: Update test.
	Reduce the size down to 207 and make s1 global and aligned
	to 16 bytes.
	* gcc.target/aarch64/memset-strict-align-2.c: New test.
---
 gcc/config/aarch64/aarch64.cc                 | 55 ++++++++++---------
 .../aarch64/memset-strict-align-1.c           | 20 +++----
 .../aarch64/memset-strict-align-2.c           | 14 +++++
 3 files changed, 53 insertions(+), 36 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 296145e6008..02ecb2154ea 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands)
     (zero constants can use XZR directly).  */
   unsigned mops_cost = 3 + 1 + cst_val;
   /* A libcall to memset in the worst case takes 3 instructions to prepare
-     the arguments + 1 for the call.  */
-  unsigned libcall_cost = 4;
+     the arguments + 1 for the call.
+     In the case of not optimizing for size the cost of doing a libcall
+     is the max_set_size / 16 + 1 or 17 instructions. The one instruction
+     is for the vector dup which may or may not be used.  */
+  unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1);
 
   /* Upper bound check.  For large constant-sized setmem use the MOPS sequence
      when available.  */
@@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands)
 
   /* Attempt a sequence with a vector broadcast followed by stores.
      Count the number of operations involved to see if it's worth it
-     against the alternatives.  A simple counter simd_ops on the
+     against the alternatives.  A simple counter inlined_ops on the
      algorithmically-relevant operations is used rather than an rtx_insn count
      as all the pointer adjusmtents and mode reinterprets will be optimized
      away later.  */
   start_sequence ();
-  unsigned simd_ops = 0;
+  unsigned inlined_ops = 0;
 
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
@@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands)
   /* Prepare the val using a DUP/MOVI v0.16B, val.  */
   src = expand_vector_broadcast (V16QImode, val);
   src = force_reg (V16QImode, src);
-  simd_ops++;
+  inlined_ops++;
   /* Convert len to bits to make the rest of the code simpler.  */
   n = len * BITS_PER_UNIT;
 
   /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
      AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  */
-  const int copy_limit = (aarch64_tune_params.extra_tuning_flags
-			  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
-			  ? GET_MODE_BITSIZE (TImode) : 256;
+  int copy_limit;
+
+  if (aarch64_tune_params.extra_tuning_flags
+      & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
+    copy_limit = GET_MODE_BITSIZE (TImode);
+  else
+    copy_limit = 256;
+
+  if (STRICT_ALIGNMENT)
+    copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst));
 
   while (n > 0)
     {
@@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands)
 
       mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
       aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
-      simd_ops++;
+      inlined_ops++;
       n -= mode_bits;
 
       /* Do certain trailing copies as overlapping if it's going to be
@@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands)
   rtx_insn *seq = get_insns ();
   end_sequence ();
 
-  if (size_p)
-    {
-      /* When optimizing for size we have 3 options: the SIMD broadcast sequence,
-	 call to memset or the MOPS expansion.  */
-      if (TARGET_MOPS
-	  && mops_cost <= libcall_cost
-	  && mops_cost <= simd_ops)
-	return aarch64_expand_setmem_mops (operands);
-      /* If MOPS is not available or not shorter pick a libcall if the SIMD
-	 sequence is too long.  */
-      else if (libcall_cost < simd_ops)
-	return false;
-      emit_insn (seq);
-      return true;
-    }
+  /* When optimizing for size we have 3 options: the inlined sequence,
+     call to memset or the MOPS expansion.  */
+  if (size_p
+      && TARGET_MOPS
+      && mops_cost <= libcall_cost
+      && mops_cost <= inlined_ops)
+    return aarch64_expand_setmem_mops (operands);
+  /* Pick a libcall if the inlined sequence is too long.  */
+  else if (libcall_cost < inlined_ops)
+    return false;
 
-  /* At this point the SIMD broadcast sequence is the best choice when
-     optimizing for speed.  */
   emit_insn (seq);
   return true;
 }
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
index 664d43aee13..5a7acbf2fa9 100644
--- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c
@@ -1,28 +1,28 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -mstrict-align" } */
 
-struct s { char x[255]; };
+struct s { char x[207]; };
+struct s s1 __attribute__((aligned(16)));
 void foo (struct s *);
-void bar (void) { struct s s1 = {}; foo (&s1); }
+void bar (void) { s1 = (struct s){}; foo (&s1); }
 
-/* memset (s1 = {}, sizeof = 255) should be expanded out
+/* memset (s1 = {}, sizeof = 207) should be expanded out
    such that there are no overlap stores when -mstrict-align
    is in use.
-   so 7 pairs of 16 bytes stores (224 bytes).
-   1 16 byte stores
+   so 5 pairs of 16 bytes stores (80 bytes).
+   2 16 byte stores (FIXME: should be 0)
    1 8 byte store
    1 4 byte store
    1 2 byte store
    1 1 byte store
    */
 
-/* { dg-final { scan-assembler-times "stp\tq" 7 } } */
-/* { dg-final { scan-assembler-times "str\tq" 1 } } */
+/* { dg-final { scan-assembler-times "stp\tq" 5 } } */
+/* { dg-final { scan-assembler-times "str\tq" 2 } } */
 /* { dg-final { scan-assembler-times "str\txzr" 1 } } */
 /* { dg-final { scan-assembler-times "str\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */
 /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */
 
-/* Also one store pair for the frame-pointer and the LR. */
-/* { dg-final { scan-assembler-times "stp\tx" 1 } } */
-
+/* No store pair with 8 byte words is needed as foo is called with a sibcall. */
+/* { dg-final { scan-assembler-times "stp\tx" 0 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
new file mode 100644
index 00000000000..67d994fe39e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mstrict-align" } */
+
+struct s { char x[15]; };
+void foo (struct s *);
+void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); }
+
+/* memset (s1 = {}, sizeof = 15) should be expanded out
+   such that there are no overlap stores when -mstrict-align
+   is in use. As the alignment of s1 is unknown, byte stores are needed.
+   so 15 byte stores
+   */
+
+/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */