From patchwork Wed Mar 16 14:59:00 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Andre Vieira (lists)" <Andre.SimoesDiasVieira@arm.com>
X-Patchwork-Id: 52006
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BD70A385DC02
	for <patchwork@sourceware.org>; Wed, 16 Mar 2022 14:59:36 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BD70A385DC02
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1647442776;
	bh=LQGnEfo6q5jII4fBBVgQ/uu47ZJSEQDlZb93uJYjAL8=;
	h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=WnKcr1YdmJI2X43pYj+7YCwk0sJQ2NEZeABEW+ABGnCCYP+l5zLT08OF0qGlqZcIR
	 AJoJtYBgbpPQyR2CLNTNcNj/elv1ztRcytbKRlgGJ8slpOGJv8l1ll5uXoqCg2QEf2
	 5hYgYjcdcRoyx/xv2vwbikG2vY7feTAxZpX4b0OE=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 2E6B53858D1E
 for <gcc-patches@gcc.gnu.org>; Wed, 16 Mar 2022 14:59:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2E6B53858D1E
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9C9F152B;
 Wed, 16 Mar 2022 07:59:05 -0700 (PDT)
Received: from [10.1.38.140] (E121495.Arm.com [10.1.38.140])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 688C83F7D7;
 Wed, 16 Mar 2022 07:59:05 -0700 (PDT)
Message-ID: <698d5b64-5e1b-d0bd-f2ff-f5cd2763dbaa@arm.com>
Date: Wed, 16 Mar 2022 14:59:00 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.6.2
Content-Language: en-US
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: [aarch64] Implement determine_suggested_unroll_factor
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, BODY_8BITS,
 GIT_PATCH_0, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Andre Vieira \(lists\) via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Andre Vieira (lists)" <Andre.SimoesDiasVieira@arm.com>
Reply-To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Hi,

This patch implements the costing function 
determine_suggested_unroll_factor for aarch64.
It determines the unrolling factor by dividing the number of X 
operations we can do per cycle by the number of X operations in the loop 
body, taking this information from the vec_ops analysis during vector 
costing and the available issue_info information.
We multiply the dividend by a potential reduction_latency, to improve 
our pipeline utilization if we are stalled waiting on a particular 
reduction operation.

Right now we also have a work around for vectorization choices where the 
main loop uses a NEON mode and predication is available, such that if 
the main loop makes use of a NEON pattern that is not directly supported 
by SVE we do not unroll, as that might cause performance regressions in 
cases where we would enter the original main loop's VF. As an example if 
you have a loop where you could use AVG_CEIL with a V8HI mode, you would 
originally get 8x NEON using AVG_CEIL followed by a 8x SVE predicated 
epilogue, using other instructions. Whereas with the unrolling you would 
end up with 16x AVG_CEIL NEON + 8x SVE predicated loop, thus skipping 
the original 8x NEON. In the future, we could handle this differently, 
by either using a different costing model for epilogues, or potentially 
vectorizing more than one single epilogue.

gcc/ChangeLog:

         * config/aarch64/aarch64.cc (aarch64_vector_costs): Define 
determine_suggested_unroll_factor.
         (determine_suggested_unroll_factor): New function.
         (aarch64_vector_costs::finish_costs): Use 
determine_suggested_unroll_factor.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b5687aab59f630920e51b742b80a540c3a56c6c8..9d3a607d378d6a2792efa7c6dece2a65c24e4521 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -15680,6 +15680,7 @@ private:
   unsigned int adjust_body_cost (loop_vec_info, const aarch64_vector_costs *,
 				 unsigned int);
   bool prefer_unrolled_loop () const;
+  unsigned int determine_suggested_unroll_factor ();
 
   /* True if we have performed one-time initialization based on the
      vec_info.  */
@@ -16768,6 +16769,105 @@ adjust_body_cost_sve (const aarch64_vec_op_count *ops,
   return sve_cycles_per_iter;
 }
 
+unsigned int
+aarch64_vector_costs::determine_suggested_unroll_factor ()
+{
+  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
+  if (!issue_info)
+    return 1;
+  bool sve = false;
+  if (aarch64_sve_mode_p (m_vinfo->vector_mode))
+    {
+      if (!issue_info->sve)
+	return 1;
+      sve = true;
+    }
+  else
+    {
+      if (!issue_info->advsimd)
+	return 1;
+      /* If we are trying to unroll a NEON main loop that contains patterns
+	 that we do not support with SVE and we might use a predicated
+	 epilogue, we need to be conservative and block unrolling as this might
+	 lead to a less optimal loop for the first and only epilogue using the
+	 original loop's vectorization factor.
+	 TODO: Remove this constraint when we add support for multiple epilogue
+	 vectorization.  */
+      if (partial_vectors_supported_p ()
+	  && param_vect_partial_vector_usage != 0
+	  && !TARGET_SVE2)
+	{
+	  unsigned int i;
+	  stmt_vec_info stmt_vinfo;
+	  FOR_EACH_VEC_ELT (m_vinfo->stmt_vec_infos, i, stmt_vinfo)
+	    {
+	      if (is_pattern_stmt_p (stmt_vinfo))
+		{
+		  gimple *stmt = stmt_vinfo->stmt;
+		  if (is_gimple_call (stmt)
+		      && gimple_call_internal_p (stmt))
+		    {
+		      enum internal_fn ifn
+			= gimple_call_internal_fn (stmt);
+		      switch (ifn)
+			{
+			case IFN_AVG_FLOOR:
+			case IFN_AVG_CEIL:
+			  return 1;
+			default:
+			  break;
+			}
+		    }
+		}
+	    }
+	}
+    }
+
+  unsigned int max_unroll_factor = 1;
+  aarch64_simd_vec_issue_info const *vec_issue
+    = sve ? issue_info->sve : issue_info->advsimd;
+  for (auto vec_ops : m_ops)
+    {
+      /* Limit unroll factor to 4 for now.  */
+      unsigned int unroll_factor = 4;
+      unsigned int factor
+       = vec_ops.reduction_latency > 1 ? vec_ops.reduction_latency : 1;
+      unsigned int temp;
+
+      /* Sanity check, this should never happen.  */
+      if ((vec_ops.stores + vec_ops.loads + vec_ops.general_ops) == 0)
+	return 1;
+
+      /* Check stores.  */
+      if (vec_ops.stores > 0)
+	{
+	  temp = CEIL (factor * vec_issue->stores_per_cycle,
+		       vec_ops.stores);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check loads.  */
+      if (vec_ops.loads > 0)
+	{
+	  temp = CEIL (factor * vec_issue->loads_stores_per_cycle,
+		       vec_ops.loads);
+	  unroll_factor = MIN (unroll_factor, temp);
+	}
+
+      /* Check general ops.  */
+      if (vec_ops.general_ops > 0)
+	{
+	  temp = CEIL (factor * vec_issue->general_ops_per_cycle,
+	               vec_ops.general_ops);
+	  unroll_factor = MIN (unroll_factor, temp);
+	 }
+      max_unroll_factor = MAX (max_unroll_factor, unroll_factor);
+    }
+
+  /* Make sure unroll factor is power of 2.  */
+  return 1 << ceil_log2 (max_unroll_factor);
+}
+
 /* BODY_COST is the cost of a vector loop body.  Adjust the cost as necessary
    and return the new cost.  */
 unsigned int
@@ -16904,8 +17004,11 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs)
   if (loop_vinfo
       && m_vec_flags
       && aarch64_use_new_vector_costs_p ())
-    m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
-					   m_costs[vect_body]);
+    {
+      m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
+					     m_costs[vect_body]);
+      m_suggested_unroll_factor = determine_suggested_unroll_factor ();
+    }
 
   /* Apply the heuristic described above m_stp_sequence_cost.  Prefer
      the scalar code in the event of a tie, since there is more chance