From patchwork Mon Oct 24 02:46:04 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
X-Patchwork-Id: 59335
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 2AEC53857030
	for <patchwork@sourceware.org>; Mon, 24 Oct 2022 02:46:57 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2AEC53857030
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1666579617;
	bh=9EMmajT2MTDxsSAYuulL7SHlUbmkEzWnTpwxHxrlZs0=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=E/LoRn13xDNo01rFD7bYrr5wS60JABQEX6y2jkIeqZlqHaxfLdkZwDHQPu3qCXI1B
	 ZJjcIIjLJvhKdh8Qbc5h49jQlynzz0PKo6oLAqpAM9MxvDHble+qnZ7GW0FLjGZZvr
	 pwCtGrVkzBMrGTWosAKms9O2eln8q9m3tbQ6s5ZM=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by sourceware.org (Postfix) with ESMTPS id B07A53858012
 for <gcc-patches@gcc.gnu.org>; Mon, 24 Oct 2022 02:46:16 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B07A53858012
X-IronPort-AV: E=McAfee;i="6500,9779,10509"; a="287724035"
X-IronPort-AV: E=Sophos;i="5.95,207,1661842800"; d="scan'208";a="287724035"
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 23 Oct 2022 19:46:06 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10509"; a="960280436"
X-IronPort-AV: E=Sophos;i="5.95,207,1661842800"; d="scan'208";a="960280436"
Received: from scymds02.sc.intel.com ([10.82.73.244])
 by fmsmga005.fm.intel.com with ESMTP; 23 Oct 2022 19:46:06 -0700
Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125])
 by scymds02.sc.intel.com with ESMTP id 29O2k5DV015332;
 Sun, 23 Oct 2022 19:46:05 -0700
To: gcc-patches@gcc.gnu.org
Subject: [PATCH] ix86: Suggest unroll factor for loop vectorization
Date: Mon, 24 Oct 2022 10:46:04 +0800
Message-Id: <20221024024604.18324-1-lili.cui@intel.com>
X-Mailer: git-send-email 2.17.1
X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_SHORT,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Cui,Lili via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Li, Pan2 via Gcc-patches" <gcc-patches@gcc.gnu.org>
Reply-To: "Cui,Lili" <lili.cui@intel.com>
Cc: hongtao.liu@intel.com
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Hi Hongtao,

This patch introduces function finish_cost and 
determine_suggested_unroll_factor for x86 backend, to make it be
able to suggest the unroll factor for a given loop being vectorized.
Referring to aarch64, RS6000 backends and basing on the analysis on
SPEC2017 performance evaluation results.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

OK for trunk?


With this patch, SPEC2017 performance evaluation results on
ICX/CLX/ADL/Znver3 are listed below:

For single copy:
  - ICX: 549.fotonik3d_r +6.2%, the others are neutral
  - CLX: 549.fotonik3d_r +1.9%, the others are neutral
  - ADL: 549.fotonik3d_r +4.5%, the others are neutral
  - Znver3: 549.fotonik3d_r +4.8%, the others are neutral

For multi-copy:
  - ADL: 549.fotonik3d_r +2.7%, the others are neutral

gcc/ChangeLog:

	* config/i386/i386.cc (class ix86_vector_costs): Add new members
	 m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
	(ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
	and m_nstores.
	(ix86_vector_costs::determine_suggested_unroll_factor): New function.
	(ix86_vector_costs::finish_cost): Diito.
	* config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
	(x86-vect-unroll-min-ldst-threshold): Likewise.
	(x86-vect-unroll-max-loop-size): Likewise.
	* doc/invoke.texi: Document new parameter.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
	* gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
	* gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
	* gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
	* gcc.target/i386/vect-reduc-1.c: Ditto.
---
 gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
 gcc/config/i386/i386.opt                      |  15 +++
 gcc/doc/invoke.texi                           |  17 +++
 .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
 .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
 .../i386/vect-alignment-peeling-1.c           |   2 +-
 .../i386/vect-alignment-peeling-2.c           |   2 +-
 gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
 8 files changed, 143 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index aeea26ef4be..a939354e55e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
 			      stmt_vec_info stmt_info, slp_tree node,
 			      tree vectype, int misalign,
 			      vect_cost_model_location where) override;
+
+  unsigned int determine_suggested_unroll_factor (loop_vec_info);
+
+  void finish_cost (const vector_costs *) override;
+
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned m_nstmts = 0;
+  /* Total number of loads (loop only).  */
+  unsigned m_nloads = 0;
+  /* Total number of stores (loop only).  */
+  unsigned m_nstores = 0;
 };
 
 /* Implement targetm.vectorize.create_costs.  */
@@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	retval = (retval * 17) / 10;
     }
 
+  if (!m_costing_for_scalar
+      && is_a<loop_vec_info> (m_vinfo)
+      && where == vect_body)
+    {
+      m_nstmts += count;
+      if (kind == scalar_load || kind == vector_load
+	  || kind == unaligned_load || kind == vector_gather_load)
+	m_nloads += count;
+      else if (kind == scalar_store || kind == vector_store
+	       || kind == unaligned_store || kind == vector_scatter_store)
+	m_nstores += count;
+    }
+
   m_costs[where] += retval;
 
   return retval;
@@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   return nunroll;
 }
 
+unsigned int
+ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Don't unroll if it's specified explicitly not to be unrolled.  */
+  if (loop->unroll == 1
+      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
+      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
+    return 1;
+
+  /* Don't unroll if there is no vectorized stmt.  */
+  if (m_nstmts == 0)
+    return 1;
+
+  /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
+     sizes.  */
+  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
+    return 1;
+
+  /* Calc the total number of loads and stores in the loop body.  */
+  unsigned int nstmts_ldst = m_nloads + m_nstores;
+
+  /* Don't unroll if loop body size big than threshold, the threshold
+     is a heuristic value inspired by param_max_unrolled_insns.  */
+  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
+		    ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
+		    : 1;
+  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
+  uf = 1 << ceil_log2 (uf);
+
+  /* Early return if don't need to unroll.  */
+  if (uf == 1)
+    return 1;
+
+  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
+     if the number of loads and stores exceeds the threshold, unroll + software
+     schedule will reduce cache miss rate.  */
+  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
+    return uf;
+
+  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
+  unsigned int vf = vect_vf_for_cost (loop_vinfo);
+  unsigned int unrolled_vf = vf * uf;
+  if (est_niter == -1 || est_niter < unrolled_vf)
+    /* When the estimated iteration of this loop is unknown, it's possible
+       that we are able to vectorize this loop with the original VF but fail
+       to vectorize it with the unrolled VF any more if the actual iteration
+       count is in between.  */
+    return 1;
+  else
+    {
+      unsigned int epil_niter_unr = est_niter % unrolled_vf;
+      unsigned int epil_niter = est_niter % vf;
+      /* Even if we have partial vector support, it can be still inefficent
+	to calculate the length when the iteration count is unknown, so
+	only expect it's good to unroll when the epilogue iteration count
+	is not bigger than VF (only one time length calculation).  */
+      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  && epil_niter_unr <= vf)
+       return uf;
+      /* Without partial vector support, conservatively unroll this when
+	the epilogue iteration count is less than the original one
+	(epilogue execution time wouldn't be longer than before).  */
+      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	       && epil_niter_unr <= epil_niter)
+       return uf;
+    }
+
+  return 1;
+}
+
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+
+{
+  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
+    {
+      m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
+    }
+  vector_costs::finish_cost (scalar_costs);
+}
 
 /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.  */
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 53d534f6392..8e49b406aa5 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1224,3 +1224,18 @@ mavxvnniint8
 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
 Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
 AVXVNNIINT8 built-in functions and code generation.
+
+-param=x86-vect-unroll-limit=
+Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
+Used to limit unroll factor which indicates how much the autovectorizer may
+unroll a loop.  The default value is 4.
+
+-param=x86-vect-unroll-min-ldst-threshold=
+Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
+Used to limit the mininum of loads and stores in the main loop.  The default
+value is 25.
+
+-param=x86-vect-unroll-max-loop-size=
+Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
+This threshold is used to limit the maxnum size of loop body after unrolling.
+The default value is 200.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 09548c4528c..c86d686f2cd 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
 @item x86-stlf-window-ninsns
 Instructions number above which STFL stall penalty can be compensated.
 
+@item x86-vect-unroll-limit
+The vectorizer will check with target information to determine whether it
+would be beneficial to unroll the main vectorized loop and by how much.  This
+parameter sets the upper bound of how much the vectorizer will unroll the main
+loop.  The default value is four.
+
+@item x86-vect-unroll-min-ldst-threshold
+The vectorizer will check with target information to determine whether unroll
+it. This parameter is used to limit the mininum of loads and stores in the main
+loop.
+
+@item x86-vect-unroll-max-loop-size
+The vectorizer will check with target information to determine whether unroll
+it. This threshold is used to limit the max size of loop body after unrolling.
+The default value is 200.
+
+
 @end table
 
 @end table
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
index 78c6600f83b..3bf1fb1b12d 100644
--- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
+/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
 /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
 /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
 /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
index 117179f2109..ba41fd64386 100644
--- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
+/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
 /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
 /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
 /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
index 4aa536ba86c..fd2f054af4a 100644
--- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
+++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
@@ -2,7 +2,7 @@
 /* This is a test exercising peeling for alignment for a negative step
    vector loop.  We're forcing atom tuning here because that has a higher
    unaligned vs aligned cost unlike most other archs.  */
-/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
+/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
 
 float a[1024], b[1024];
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
index 834bf0f770d..62c0db2bb9a 100644
--- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
+++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
@@ -2,7 +2,7 @@
 /* This is a test exercising peeling for alignment for a positive step
    vector loop.  We're forcing atom tuning here because that has a higher
    unaligned vs aligned cost unlike most other archs.  */
-/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
+/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
 
 float a[1024], b[1024];
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
index 9ee9ba4e736..1ba4be01bea 100644
--- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
+++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
+/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
 
 #define N 32
 int foo (int *a, int n)