From patchwork Thu Nov 18 15:18:02 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Roger Sayle <roger@nextmovesoftware.com>
X-Patchwork-Id: 47889
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0F9FC385E00D
	for <patchwork@sourceware.org>; Thu, 18 Nov 2021 15:18:23 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from server.nextmovesoftware.com (server.nextmovesoftware.com
 [162.254.253.69])
 by sourceware.org (Postfix) with ESMTPS id F225C3858433
 for <gcc-patches@gcc.gnu.org>; Thu, 18 Nov 2021 15:18:04 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F225C3858433
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=nextmovesoftware.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=nextmovesoftware.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID:
 Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID:
 Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
 :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:
 List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=OLoTQU62L20lne9BWb37JwMlIuW6aYB9sP1ZPKNy4bE=; b=DX4CIVfNGrcnncSFF29g+q5w8d
 BFHJ6fbkOhWqXB6uXkQlL8KhEM+xI8o9ccRfnQYD7kRB+0wKgbAATGXO1oyPdQ1Nf18oSUklcca6r
 1/FOY9SlNeXaQkfKf/+qQmfHWarJ5NgD/SPo/VDcumfbXPkuY2GJUk6TWx7yY/lOKbawGu9rWNl09
 kUQ6S7APiJqKuswm0vLmZ1R7Mh5pt13LWx/TAmb5HjbDTjCHvI+xHBmf3r8I9Yw+bedThfiYl0Ri6
 yw0GnzYM68vCyatkHZ+C/W5bYpDLzPekQqumUGmmfQesgr9rhQJiMr5SuS+wl8OrZCiyxXWkP3SUS
 mJVsHLvQ==;
Received: from [185.62.158.67] (port=58090 helo=Dell)
 by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <roger@nextmovesoftware.com>)
 id 1mnjB2-0007t7-Sn; Thu, 18 Nov 2021 10:18:04 -0500
From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'Richard Biener'" <richard.guenther@gmail.com>,
 "'Andrew MacLeod'" <amacleod@redhat.com>
Subject: [PATCH take 2] ivopts: Improve code generated for very simple loops.
Date: Thu, 18 Nov 2021 15:18:02 -0000
Message-ID: <007e01d7dc8f$7b87f260$7297d720$@nextmovesoftware.com>
MIME-Version: 1.0
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AdfchxE/VzQaaD02R4KQeI3krlFKEQ==
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com
X-AntiAbuse: Original Domain - gcc.gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - nextmovesoftware.com
X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id:
 roger@nextmovesoftware.com
X-Authenticated-Sender: server.nextmovesoftware.com:
 roger@nextmovesoftware.com
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE,
 SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Hi Richard,
Many thanks for the patch review.

On Tue, Nov 16, 2021 at 12:38 Richard Biener wrote:
> On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle
> <roger@nextmovesoftware.com> wrote:
> >
> > This patch tidies up the code that GCC generates for simple loops, by
> > selecting/generating a simpler loop bound expression in ivopts.
> > Previously:
> >
> > void v1 (unsigned long *in, unsigned long *out, unsigned int n) {
> >   int i;
> >   for (i = 0; i < n; i++) {
> >     out[i] = in[i];
> >   }
> > }
> >
> > on x86_64 generated:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> > .L3:    movq    (%rdi,%rax,8), %rcx
> >         movq    %rcx, (%rsi,%rax,8)
> >         addq    $1, %rax
> >         cmpq    %rax, %rdx
> >         jne     .L3
> > .L1:    ret
> >
> > and now instead generates:
> > v1:     testl   %edx, %edx
> >         je      .L1
> >         movl    %edx, %edx
> >         xorl    %eax, %eax
> >         leaq    0(,%rdx,8), %rcx
> > .L3:    movq    (%rdi,%rax), %rdx
> >         movq    %rdx, (%rsi,%rax)
> >         addq    $8, %rax
> >         cmpq    %rax, %rcx
> >         jne     .L3
> > .L1:    ret
> 
> Is that actually better?  IIRC the addressing modes are both complex and we
> now have an extra lea?


Technically the induction variable elimination is removing two multiplies by 8
(or left shifts) from the body of the loop and replacing them with a single
multiply by 8 prior to the loop, which is exactly the induction variable 
optimization that ivopts is designed to do.  It's true that with x86's complex
addressing modes these "multiplications" are free, but ivopts is run on all
targets including those without indexed addressing, and even on x86 there's
a benefit from shorter instruction encodings.

> For this case I see we generate
>   _15 = n_10(D) + 4294967295;
>   _8 = (unsigned long) _15;
>   _7 = _8 + 1;
> 
> where n is unsigned int so if we know that n is not zero we can simplify the
> addition and conveniently the loop header test provides this guarantee.
> IIRC there were some attempts to enhance match.pd for some cases of such
> expressions.

Exactly.  The loop optimizers are generating the expression (x-1)+1 and
assuming that the middle-end will simplify this to x.  Unfortunately, the
change of modes (designed to prevent overflow) actually makes this
impossible to simplify, due to concerns about overflow.

> +  /* If AFTER_ADJUST is required, the code below generates the equivalent
> +   * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
> +   * BASE + (NITER + 1) * STEP, especially when NITER is often of the form
> +   * SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
> +   * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
> +   * class for common idioms that we know are safe.  */
> 
> No '* ' each line.
Doh!  Thanks.  Sometimes I hate vi.

> I wonder if the non-overflowing can be captured by
>     integer_onep (iv->step)
>     && max_stmt_executions (loop, &max)

Unfortunately, max_stmt_executions is intended to return the wide_int count
of iterations, either the known constant value or a profile-based estimate, while
the optimizations I'm proposing work with symbolic values from variable iteration
counts.  When the iteration count is a constant, fold-const is already able to
simplify (x-1)+1 to an integer constant even with type conversions.

> if we then do (niter + 1) * step instead of niter*step + step would that do the
> same?

Yes.  I've extended the scope of the patch to now also handle loops of the
form (for i=beg; i<end; i++), where the original niter is (end-beg)-1, and we
now use (end-beg) as the incremented niter, so the invariant expression now
becomes (end-beg)*4 instead of the currently generated:  ((end-beg)-1)*4 + 4.

I'm assuming that the niter*step + step is by design (for the general case),
so I'm only tweaking the (common) corner cases, where it's easy to see that
it's safe to substitute a simpler expression.  For more general affine recurrences
in complex loop nests, niter*step + step may be preferred.

> That said - what the change does is actually ensure that we CSE niter + 1
> with the bound of the simplified exit test?

Not quite, this simply provides a simplified expression for "niter + 1" that
takes advantage of the implicit range information we have.  You're right
in theory that ranger/vrp may be able to help simplify (long)((int)x-1)+1L
more generally.   In this case, I suspect it's just a historical artifact that
much of the literature on (iv) loop optimizations dates back to Fortran
and Algol where arrays were 1-based rather than 0-based, so decades
later gcc 11 on x86_64 sometimes requires an extra inc/dec instruction
for purely historical reasons.

> The patch doesn't add any testcase.

The three new attached tests check that the critical invariants have a
simpler form, and hopefully shouldn't be affected by whether the optimizer
and/or backend costs actually decide to perform this iv substitution or not.

> Thanks,
> Richard.


This revised patch has been tested on x86_64-pc-linux-gnu with a
make bootstrap and make -k check with no new failures.

2021-11-18  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* tree-ssa-loop-ivopts.c (cand_value_at): Take a class
	tree_niter_desc* argument instead of just a tree for NITER.
	If we require the iv candidate value at the end of the final
	loop iteration, try using the original loop bound as the
	NITER for sufficiently simple loops.
	(may_eliminate_iv): Update (only) call to cand_value_at.

gcc/testsuite
	* gcc.dg/tree-ssa/ivopts-5.c: New test case.
	* gcc.dg/tree-ssa/ivopts-6.c: New test case.
	* gcc.dg/tree-ssa/ivopts-7.c: New test case.
	* gcc.dg/wrapped-binop-simplify.c: Update expected test result.


Many thanks in advance (fingers-crossed this is still suitable/grandfathered-in
at the start of Stage 3).

Roger

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 4a498ab..8f98595 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -5034,28 +5034,57 @@ determine_group_iv_cost_address (struct ivopts_data *data,
   return !sum_cost.infinite_cost_p ();
 }
 
-/* Computes value of candidate CAND at position AT in iteration NITER, and
-   stores it to VAL.  */
+/* Computes value of candidate CAND at position AT in iteration DESC->NITER,
+   and stores it to VAL.  */
 
 static void
-cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at, tree niter,
-	       aff_tree *val)
+cand_value_at (class loop *loop, struct iv_cand *cand, gimple *at,
+	       class tree_niter_desc *desc, aff_tree *val)
 {
   aff_tree step, delta, nit;
   struct iv *iv = cand->iv;
   tree type = TREE_TYPE (iv->base);
+  tree niter = desc->niter;
+  bool after_adjust = stmt_after_increment (loop, cand, at);
   tree steptype;
+
   if (POINTER_TYPE_P (type))
     steptype = sizetype;
   else
     steptype = unsigned_type_for (type);
 
+  /* If AFTER_ADJUST is required, the code below generates the equivalent
+     of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression
+     BASE + (NITER + 1) * STEP, especially when NITER is often of the form
+     SSA_NAME - 1.  Unfortunately, guaranteeing that adding 1 to NITER
+     doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC
+     class for common idioms that we know are safe.  */
+  if (after_adjust
+      && desc->control.no_overflow
+      && integer_onep (desc->control.step)
+      && (desc->cmp == LT_EXPR
+	  || desc->cmp == NE_EXPR)
+      && TREE_CODE (desc->bound) == SSA_NAME)
+    {
+      if (integer_onep (desc->control.base))
+	{
+	  niter = desc->bound;
+	  after_adjust = false;
+	}
+      else if (TREE_CODE (niter) == MINUS_EXPR
+	       && integer_onep (TREE_OPERAND (niter, 1)))
+	{
+	  niter = TREE_OPERAND (niter, 0);
+	  after_adjust = false;
+	}
+    }
+
   tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &step);
   aff_combination_convert (&step, steptype);
   tree_to_aff_combination (niter, TREE_TYPE (niter), &nit);
   aff_combination_convert (&nit, steptype);
   aff_combination_mult (&nit, &step, &delta);
-  if (stmt_after_increment (loop, cand, at))
+  if (after_adjust)
     aff_combination_add (&delta, &step);
 
   tree_to_aff_combination (iv->base, type, val);
@@ -5406,7 +5435,7 @@ may_eliminate_iv (struct ivopts_data *data,
       return true;
     }
 
-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
 			 aff_combination_to_tree (&bnd));
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
new file mode 100644
index 0000000..7a7661c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-5.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i < sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
new file mode 100644
index 0000000..773bfe4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-6.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int sz, int val)
+{
+  int i;
+  for (i = 0; i != sz; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) sz_\[0-9\]\\(D\\) \\* 4 \\+ \\(unsigned long\\) mem_\[0-9\]\\(D\\)" "ivopts" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
new file mode 100644
index 0000000..3d2de48
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-7.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+int*
+foo (int* mem, int beg, int end, int val)
+{
+  int i;
+  for (i = beg; i < end; i++)
+    if (mem[i] == val) 
+      return &mem[i];
+  return 0;
+}
+/* { dg-final { scan-tree-dump "inv_expr \[0-9\]: \\t\\(unsigned long\\) \\(\\(unsigned int\\) end_\[0-9\]\\(D\\) - \\(unsigned int\\) beg_\[0-9\]\\(D\\)\\)" "ivopts" } } */
+
diff --git a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
index a5d953b..706eed8 100644
--- a/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
+++ b/gcc/testsuite/gcc.dg/wrapped-binop-simplify.c
@@ -1,6 +1,6 @@
 /* { dg-do compile { target { { i?86-*-* x86_64-*-* s390*-*-* } && lp64 } } } */
 /* { dg-options "-O2 -fdump-tree-vrp2-details" } */
-/* { dg-final { scan-tree-dump-times "gimple_simplified to" 4 "vrp2" } } */
+/* { dg-final { scan-tree-dump-times "gimple_simplified to" 1 "vrp2" } } */
 
 void v1 (unsigned long *in, unsigned long *out, unsigned int n)
 {