From patchwork Fri Jan 13 09:42:36 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lehua Ding <lehua.ding@rivai.ai>
X-Patchwork-Id: 63145
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 776AB385ED7A
	for <patchwork@sourceware.org>; Fri, 13 Jan 2023 09:43:10 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtpbg153.qq.com (smtpbg153.qq.com [13.245.218.24])
 by sourceware.org (Postfix) with ESMTPS id F3AA93858D32
 for <gcc-patches@gcc.gnu.org>; Fri, 13 Jan 2023 09:42:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F3AA93858D32
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=rivai.ai
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai
X-QQ-mid: bizesmtp87t1673602959t5rely2j
Received: from server1.localdomain ( [58.60.1.20])
 by bizesmtp.qq.com (ESMTP) with
 id ; Fri, 13 Jan 2023 17:42:38 +0800 (CST)
X-QQ-SSF: 0140000000200070C000B00A0000000
X-QQ-FEAT: ff4zginWWTeeJ529/lDdaLuaaKu43mu5uWv7afUQeTqrmxlNjlVC44YlwbnfZ
 dtdIzh4fFKL0okQoRKqOsly2o9d7+zCsYfhfl2aS+fXtqQGggwpqEPKMsEsmeKsTDXRgRCl
 66Q+Rwpm8HB1w6Z55bjU+xAyz3EU7rgMuD3hAzEaT0P8acIa/ma0tijY4VgJ0R1OiJdDwyG
 4svZTI1RqjUgzq942kys1KnZUDAAWRuKDTML3OE7bvFnPe3sQokCN9UM+ZOvHWO6jzazkXQ
 6+n17a54/Jaep1bFgtedqlInh4kbPr1gQuMm7xqMZ0hYSERQFRtjab6OJt4LP13U0MjgP3v
 FSiMb3iCZnGs0shC6F5VuJDvKKy5REDVPKXtMcsJJWNADR9qaQ2/MP5xBgKOPDyOfqzUSKe
 EWbL4l3TeMSU5F/wKeaZ5A==
X-QQ-GoodBg: 2
From: lehua.ding@rivai.ai
To: gcc-patches@gcc.gnu.org
Cc: richard.sandiford@arm.com, juzhe.zhong@rivai.ai,
 Lehua Ding <lehua.ding@rivai.ai>
Subject: [PATCH 1/1] [fwprop]: Add the support of forwarding the vec_duplicate
 rtx
Date: Fri, 13 Jan 2023 17:42:36 +0800
Message-Id: <20230113094236.77805-1-lehua.ding@rivai.ai>
X-Mailer: git-send-email 2.36.3
MIME-Version: 1.0
X-QQ-SENDSIZE: 520
Feedback-ID: bizesmtp:rivai.ai:qybglogicsvr:qybglogicsvr2
X-Spam-Status: No, score=-8.7 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_ASCII_DIVIDERS, KAM_DMARC_STATUS, RCVD_IN_BARRACUDACENTRAL,
 RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, RCVD_IN_SORBS_WEB, SPF_HELO_PASS,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

From: Lehua Ding <lehua.ding@rivai.ai>

ps: Resend for adjusting the width of each line of text.

Hi,

When I was adding the new RISC-V auto-vectorization function, I found that
converting `vector-reg1 vop vector-vreg2` to `scalar-reg3 vop vectorreg2`
is not very easy to handle where `vector-reg1` is a vec_duplicate_expr.
For example the bellow gimple IR:

```gimple
<bb2>
vect_cst__51 = [vec_duplicate_expr] z_14(D);

<bb 3>
vect_iftmp.13_53 = .LEN_COND_ADD(mask__40.9_47, vect__6.12_50, vect_cst__51, { 0.0, ... }, curr_cnt_60);
```

I once wanted to add corresponding functions to gimple IR, such as adding
.LEN_COND_ADD_VS, and then convert .LEN_COND_ADD to .LEN_COND_ADD_VS in match.pd.
This method can be realized, but it will cause too many similar internal functions
to be added to gimple IR. It doesn't feel necessary. Later, I tried to combine them
on the combine pass but failed. Finally, I thought of adding the ability to support
forwarding `(vec_duplciate reg)` in fwprop pass, so I have this patch.

Because the current upstream does not support the RISC-V automatic vectorization
function, I found an example in sve that can also be optimized and simply tried
it. For the float type, one instruction can be reduced, for example the bellow C
code. The difference between the new and old assembly code is that the new one
uses the mov instruction to directly move the scalar variable to the vector register.
The old assembly code first moves the scalar variable to the vector register outside
the loop, and then uses the sel instruction. Compared with the entire assembly code,
the new assembly code has one instruction less. In addition, I noticed that some
instructions in the new assembly code are ahead of the `ble .L1` instruction.
I debugged and found that the modification was made in the ce1 pass. This pass
believes that moving up is more beneficial to performance.

In addition, for the int type, compared with the float type, the new assembly code
will have one more `fmov s2, w2` instruction, so I can't judge whether the
performance is better than the previous one. In fact, I mainly do RISC-V development work.

This patch is an exploratory patch and has not been tested too much. I mainly
want to see your suggestions on whether this method is feasible and possible
potential problems.

Best,
Lehua Ding

```c
/* compiler options: -O3 -march=armv8.2-a+sve -S */
void test1 (int *pred, float *x, float z, int n)
{
         for (int i = 0; i < n; i += 1)
           {
                 x[i] = pred[i] != 1 ? x[i] : z;
           }
}
```

The old assembly code like this (compiler explorer link: https://godbolt.org/z/hxTnEhaqY):

```asm
test1:
         cmp w2, 0
         ble.L1
         mov x3, 0
         cntw x4
         mov z0.s, s0
         whilelo p0.s, wzr, w2
         ptrue p2.b, all
.L3:
         ld1w z2.s, p0/z, [x0, x3, lsl 2]
         ld1w z1.s, p0/z, [x1, x3, lsl 2]
         cmpne p1.s, p2/z, z2.s, #1
         sel z1.s, p1, z1.s, z0.s
         st1w z1.s, p0, [x1, x3, lsl 2]
         add x3, x3, x4
         while lo p0.s, w3, w2
         b.any.L3
.L1:
         ret
```

The new assembly code like this:

```asm
test1:
         whilelo p0.s, wzr, w2
         mov x3, 0
         cntw x4
         ptrue p2.b, all
         cmp w2, 0
         ble.L1
.L3:
         ld1w z2.s, p0/z, [x0, x3, lsl 2]
         ld1w z1.s, p0/z, [x1, x3, lsl 2]
         cmpne p1.s, p2/z, z2.s, #1
         mov z1.s, p1/m, s0
         st1w z1.s, p0, [x1, x3, lsl 2]
         add x3, x3, x4
         while lo p0.s, w3, w2
         b.any.L3
.L1:
         ret
```


gcc/ChangeLog:

        * config/aarch64/aarch64-sve.md (@aarch64_sel_dup<mode>_vs): Add new pattern to capture new opeands order
        * fwprop.cc (fwprop_propagation::profitable_p): Add new check
        (reg_single_def_for_src_p): Add new function for src rtx
        (forward_propagate_into): Change to new function call
---
 gcc/config/aarch64/aarch64-sve.md | 20 ++++++++++++++++++++
 gcc/fwprop.cc                     | 16 +++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index b8cc47ef5fc..84d8ed0924d 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -7636,6 +7636,26 @@
   [(set_attr "movprfx" "*,*,yes,yes,yes,yes")]
 )
 
+;; Swap the order of operand 1 and operand 2 so that it matches the above pattern
+(define_insn_and_split "@aarch64_sel_dup<mode>_vs"
+  [(set (match_operand:SVE_ALL 0 "register_operand" "=?w, w, ??w, ?&w, ??&w, ?&w")
+	(unspec:SVE_ALL
+	  [(match_operand:<VPRED> 3 "register_operand" "Upl, Upl, Upl, Upl, Upl, Upl")
+           (match_operand:SVE_ALL 1 "aarch64_simd_reg_or_zero" "0, 0, Dz, Dz, w, w")
+	   (vec_duplicate:SVE_ALL
+             (match_operand:<VEL> 2 "register_operand" "r, w, r, w, r, w"))]
+	  UNSPEC_SEL))]
+  "TARGET_SVE"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+        (unspec:SVE_ALL
+          [(match_dup 3)
+           (vec_duplicate:SVE_ALL (match_dup 2))
+           (match_dup 1)]
+          UNSPEC_SEL))]
+)
+
 ;; -------------------------------------------------------------------------
 ;; ---- [INT,FP] Compare and select
 ;; -------------------------------------------------------------------------
diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index ae342f59407..5d921dd3d2f 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -342,6 +342,9 @@ fwprop_propagation::profitable_p () const
   if (CONSTANT_P (to))
     return true;
 
+  if (GET_CODE (to) == VEC_DUPLICATE)
+    return true;
+
   return false;
 }
 
@@ -353,6 +356,17 @@ reg_single_def_p (rtx x)
   return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
 }
 
+/* Check that X has a single def or a VEC_DUPLICATE expr whose elements have a
+   single def. */
+static bool
+reg_single_def_for_src_p (rtx x)
+{
+  if (GET_CODE (x) == VEC_DUPLICATE)
+    x = XEXP (x, 0);
+
+  return reg_single_def_p (x);
+}
+
 /* Return true if X contains a paradoxical subreg.  */
 
 static bool
@@ -873,7 +887,7 @@ forward_propagate_into (use_info *use, bool reg_prop_only = false)
   if ((reg_prop_only
        || (def_loop != use_loop
 	   && !flow_loop_nested_p (use_loop, def_loop)))
-      && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
+      && (!reg_single_def_p (dest) || !reg_single_def_for_src_p (src)))
     return false;
 
   /* Don't substitute into a non-local goto, this confuses CFG.  */