From patchwork Tue Dec 27 04:02:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alexandre Oliva <oliva@adacore.com>
X-Patchwork-Id: 62424
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 2C9233858434
	for <patchwork@sourceware.org>; Tue, 27 Dec 2022 04:12:01 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2C9233858434
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1672114321;
	bh=TWsuBmK6cy492Rtw1vEtOA5AYBt86PbcbyQJ/h2cdag=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=oNwNicTPclmUH2ois/h45xI8i5e/uSakWF4eH8oNRIVcxC1n1x6pMhac98SW5HHQN
	 2/khmjh7WWoEoInsXYvA7PJjfGJrzP9zuHFXA57nBu7HXYSpJKFRZww7hfXZLVt4WA
	 ZJN0HdtrJxMOlEm4TukfrRfXWMAaODg+bsizpkSQ=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from rock.gnat.com (rock.gnat.com
 [IPv6:2620:20:4000:0:a9e:1ff:fe9b:1d1])
 by sourceware.org (Postfix) with ESMTPS id 3E6AD3858D38
 for <gcc-patches@gcc.gnu.org>; Tue, 27 Dec 2022 04:11:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3E6AD3858D38
Received: from localhost (localhost.localdomain [127.0.0.1])
 by filtered-rock.gnat.com (Postfix) with ESMTP id 31F381162AB;
 Mon, 26 Dec 2022 23:11:31 -0500 (EST)
X-Virus-Scanned: Debian amavisd-new at gnat.com
Received: from rock.gnat.com ([127.0.0.1])
 by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024)
 with LMTP id a+rnJzljnkMF; Mon, 26 Dec 2022 23:11:31 -0500 (EST)
Received: from free.home (tron.gnat.com
 [IPv6:2620:20:4000:0:46a8:42ff:fe0e:e294])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by rock.gnat.com (Postfix) with ESMTPS id E920C11628F;
 Mon, 26 Dec 2022 23:11:30 -0500 (EST)
Received: from livre (livre.home [172.31.160.2])
 by free.home (8.15.2/8.15.2) with ESMTPS id 2BR4267I711372
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Tue, 27 Dec 2022 01:02:07 -0300
To: gcc-patches@gcc.gnu.org
Subject: [RFC] Introduce -finline-memset-loops
Organization: Free thinker, does not speak for AdaCore
Date: Tue, 27 Dec 2022 01:02:06 -0300
Message-ID: <ora639h4u9.fsf@lxoliva.fsfla.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.84
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Alexandre Oliva via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Alexandre Oliva <oliva@adacore.com>
Reply-To: Alexandre Oliva <oliva@adacore.com>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

try_store_by_multiple_pieces was added not long ago, enabling
variable-sized memset to be expanded inline when the worst-case
in-range constant length would, using conditional blocks with powers
of two to cover all possibilities of length and alignment.

This patch extends the memset expansion to start with a loop, so as to
still take advantage of known alignment even with long lengths, but
without necessarily adding store blocks for every power of two.

This makes it possible for any memset call to be expanded, even if
storing a single byte per iteration.  Surely efficient implementations
of memset can do better, with a pre-loop to increase alignment, but
that would likely be excessive for inline expansions of memset.

Still, in some cases, users prefer to inline memset, even if it's not
as performant, or when it's known to be performant in ways the
compiler can't tell, to avoid depending on a C runtime library.

With this flag, global or per-function optimizations may enable inline
expansion of memset to be selectively requested, while the
infrastructure for that may enable us to introduce per-target tuning
to enable such looping even advantageous, even if not explicitly
requested.


I realize this is late for new features in this cycle; I`d be happy to
submit it again later, but I wonder whether there's any interest in this
feature, or any objections to it.  FWIW, I've regstrapped this on
x86_64-linux-gnu, and also tested earlier versions of this patch in
earlier GCC branches with RISC-v crosses.  Is this ok for GCC 14?  Maybe
even simple enough for GCC 13, considering it's disabled by default?
TIA,


for  gcc/ChangeLog

	* builtins.cc (try_store_by_multiple_pieces): Support starting
	with a loop.
	* common.opt (finline-memset-loops): New.
	* doc/invoke.texi (-finline-memset-loops): Add.

for  gcc/testsuite/ChangeLog

	* gcc.dg/torture/inline-mem-set-1.c: New.
---
 gcc/builtins.cc                                 |   50 ++++++++++++++++++++++-
 gcc/common.opt                                  |    4 ++
 gcc/doc/invoke.texi                             |   13 ++++++
 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   14 ++++++
 4 files changed, 77 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 02c4fefa86f48..388bae58ce49e 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -4361,9 +4361,37 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
   if (max_bits >= 0)
     xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
 		- (HOST_WIDE_INT_1U << ctz_len));
+  bool max_loop = false;
   if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
 			    &valc, align, true))
-    return false;
+    {
+      if (!flag_inline_memset_loops)
+	return false;
+      while (--max_bits >= sctz_len)
+	{
+	  xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
+		     - (HOST_WIDE_INT_1U << ctz_len));
+	  if (can_store_by_pieces (xlenest + blksize,
+				   builtin_memset_read_str,
+				   &valc, align, true))
+	    {
+	      max_loop = true;
+	      break;
+	    }
+	  if (!blksize)
+	    continue;
+	  if (can_store_by_pieces (xlenest,
+				   builtin_memset_read_str,
+				   &valc, align, true))
+	    {
+	      blksize = 0;
+	      max_loop = true;
+	      break;
+	    }
+	}
+      if (!max_loop)
+	return false;
+    }
 
   by_pieces_constfn constfun;
   void *constfundata;
@@ -4405,6 +4433,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
      the least significant bit possibly set in the length.  */
   for (int i = max_bits; i >= sctz_len; i--)
     {
+      rtx_code_label *loop_label = NULL;
       rtx_code_label *label = NULL;
       blksize = HOST_WIDE_INT_1U << i;
 
@@ -4423,14 +4452,24 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
       else if ((max_len & blksize) == 0)
 	continue;
 
+      if (max_loop && i == max_bits)
+	{
+	  loop_label = gen_label_rtx ();
+	  emit_label (loop_label);
+	  /* Since we may run this multiple times, don't assume we
+	     know anything about the offset.  */
+	  clear_mem_offset (to);
+	}
+
       /* Issue a store of BLKSIZE bytes.  */
+      bool update_needed = i != sctz_len || loop_label;
       to = store_by_pieces (to, blksize,
 			    constfun, constfundata,
 			    align, true,
-			    i != sctz_len ? RETURN_END : RETURN_BEGIN);
+			    update_needed ? RETURN_END : RETURN_BEGIN);
 
       /* Adjust REM and PTR, unless this is the last iteration.  */
-      if (i != sctz_len)
+      if (update_needed)
 	{
 	  emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX));
 	  to = replace_equiv_address (to, ptr);
@@ -4438,6 +4477,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
 	  emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX));
 	}
 
+      if (loop_label)
+	emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL,
+				 ptr_mode, 1, loop_label,
+				 profile_probability::likely ());
+
       if (label)
 	{
 	  emit_label (label);
diff --git a/gcc/common.opt b/gcc/common.opt
index 562d73d7f552a..c28af170be896 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1874,6 +1874,10 @@ finline-atomics
 Common Var(flag_inline_atomics) Init(1) Optimization
 Inline __atomic operations when a lock free instruction sequence is available.
 
+finline-memset-loops
+Common Var(flag_inline_memset_loops) Init(0) Optimization
+Inline memset even if it requires loops.
+
 fcf-protection
 Common RejectNegative Alias(fcf-protection=,full)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index da9ad1068fbf6..19f436ad46385 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}.
 -fgcse-sm  -fhoist-adjacent-loads  -fif-conversion @gol
 -fif-conversion2  -findirect-inlining @gol
 -finline-functions  -finline-functions-called-once  -finline-limit=@var{n} @gol
--finline-small-functions -fipa-modref -fipa-cp  -fipa-cp-clone @gol
+-finline-memset-loops -finline-small-functions @gol
+-fipa-modref -fipa-cp  -fipa-cp-clone @gol
 -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const @gol
 -fipa-reference  -fipa-reference-addressable @gol
 -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm} @gol
@@ -11960,6 +11961,16 @@ in its own right.
 Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-Os},
 but not @option{-Og}.
 
+@item -finline-memset-loops
+@opindex finline-memset-loops
+Expand @code{memset} calls inline, even when the length is variable or
+big enough as to require looping.  This may enable the compiler to take
+advantage of known alignment and length multipliers, but it will often
+generate code that is less efficient than performant implementations of
+@code{memset}, and grow code size so much that even a less performant
+@code{memset} may run faster due to better use of the code cache.  This
+option is disabled by default.
+
 @item -fearly-inlining
 @opindex fearly-inlining
 Inline functions marked by @code{always_inline} and functions whose body seems
diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
new file mode 100644
index 0000000000000..73bd1025f191f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto" } */
+
+void *zero (unsigned long long (*p)[32], int n)
+{
+  return __builtin_memset (p, 0, n * sizeof (*p));
+}
+
+void *ones (char (*p)[128], int n)
+{
+  return __builtin_memset (p, -1, n * sizeof (*p));
+}
+
+/* { dg-final { scan-assembler-not "memset" } } */