From patchwork Mon Oct 24 02:46:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 59335 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2AEC53857030 for ; Mon, 24 Oct 2022 02:46:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2AEC53857030 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1666579617; bh=9EMmajT2MTDxsSAYuulL7SHlUbmkEzWnTpwxHxrlZs0=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=E/LoRn13xDNo01rFD7bYrr5wS60JABQEX6y2jkIeqZlqHaxfLdkZwDHQPu3qCXI1B ZJjcIIjLJvhKdh8Qbc5h49jQlynzz0PKo6oLAqpAM9MxvDHble+qnZ7GW0FLjGZZvr pwCtGrVkzBMrGTWosAKms9O2eln8q9m3tbQ6s5ZM= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by sourceware.org (Postfix) with ESMTPS id B07A53858012 for ; Mon, 24 Oct 2022 02:46:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B07A53858012 X-IronPort-AV: E=McAfee;i="6500,9779,10509"; a="287724035" X-IronPort-AV: E=Sophos;i="5.95,207,1661842800"; d="scan'208";a="287724035" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Oct 2022 19:46:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10509"; a="960280436" X-IronPort-AV: E=Sophos;i="5.95,207,1661842800"; d="scan'208";a="960280436" Received: from scymds02.sc.intel.com ([10.82.73.244]) by fmsmga005.fm.intel.com with ESMTP; 23 Oct 2022 19:46:06 -0700 Received: from shgcc10.sh.intel.com (shgcc10.sh.intel.com [10.239.154.125]) by scymds02.sc.intel.com with ESMTP id 29O2k5DV015332; Sun, 23 Oct 2022 19:46:05 -0700 To: gcc-patches@gcc.gnu.org Subject: [PATCH] ix86: Suggest unroll factor for loop vectorization Date: Mon, 24 Oct 2022 10:46:04 +0800 Message-Id: <20221024024604.18324-1-lili.cui@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: "Cui,Lili via Gcc-patches" From: "Li, Pan2 via Gcc-patches" Reply-To: "Cui,Lili" Cc: hongtao.liu@intel.com Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi Hongtao, This patch introduces function finish_cost and determine_suggested_unroll_factor for x86 backend, to make it be able to suggest the unroll factor for a given loop being vectorized. Referring to aarch64, RS6000 backends and basing on the analysis on SPEC2017 performance evaluation results. Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu. OK for trunk? With this patch, SPEC2017 performance evaluation results on ICX/CLX/ADL/Znver3 are listed below: For single copy: - ICX: 549.fotonik3d_r +6.2%, the others are neutral - CLX: 549.fotonik3d_r +1.9%, the others are neutral - ADL: 549.fotonik3d_r +4.5%, the others are neutral - Znver3: 549.fotonik3d_r +4.8%, the others are neutral For multi-copy: - ADL: 549.fotonik3d_r +2.7%, the others are neutral gcc/ChangeLog: * config/i386/i386.cc (class ix86_vector_costs): Add new members m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor. (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads and m_nstores. (ix86_vector_costs::determine_suggested_unroll_factor): New function. (ix86_vector_costs::finish_cost): Diito. * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter. (x86-vect-unroll-min-ldst-threshold): Likewise. (x86-vect-unroll-max-loop-size): Likewise. * doc/invoke.texi: Document new parameter. gcc/testsuite/ChangeLog: * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops. * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto. * gcc.target/i386/vect-alignment-peeling-1.c: Ditto. * gcc.target/i386/vect-alignment-peeling-2.c: Ditto. * gcc.target/i386/vect-reduc-1.c: Ditto. --- gcc/config/i386/i386.cc | 106 ++++++++++++++++++ gcc/config/i386/i386.opt | 15 +++ gcc/doc/invoke.texi | 17 +++ .../gcc.target/i386/cond_op_maxmin_b-1.c | 2 +- .../gcc.target/i386/cond_op_maxmin_ub-1.c | 2 +- .../i386/vect-alignment-peeling-1.c | 2 +- .../i386/vect-alignment-peeling-2.c | 2 +- gcc/testsuite/gcc.target/i386/vect-reduc-1.c | 2 +- 8 files changed, 143 insertions(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index aeea26ef4be..a939354e55e 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs stmt_vec_info stmt_info, slp_tree node, tree vectype, int misalign, vect_cost_model_location where) override; + + unsigned int determine_suggested_unroll_factor (loop_vec_info); + + void finish_cost (const vector_costs *) override; + + /* Total number of vectorized stmts (loop only). */ + unsigned m_nstmts = 0; + /* Total number of loads (loop only). */ + unsigned m_nloads = 0; + /* Total number of stores (loop only). */ + unsigned m_nstores = 0; }; /* Implement targetm.vectorize.create_costs. */ @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, retval = (retval * 17) / 10; } + if (!m_costing_for_scalar + && is_a (m_vinfo) + && where == vect_body) + { + m_nstmts += count; + if (kind == scalar_load || kind == vector_load + || kind == unaligned_load || kind == vector_gather_load) + m_nloads += count; + else if (kind == scalar_store || kind == vector_store + || kind == unaligned_store || kind == vector_scatter_store) + m_nstores += count; + } + m_costs[where] += retval; return retval; @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop) return nunroll; } +unsigned int +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo) +{ + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + + /* Don't unroll if it's specified explicitly not to be unrolled. */ + if (loop->unroll == 1 + || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops) + || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops)) + return 1; + + /* Don't unroll if there is no vectorized stmt. */ + if (m_nstmts == 0) + return 1; + + /* Don't unroll if vector size is zmm, since zmm throughput is lower than other + sizes. */ + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) + return 1; + + /* Calc the total number of loads and stores in the loop body. */ + unsigned int nstmts_ldst = m_nloads + m_nstores; + + /* Don't unroll if loop body size big than threshold, the threshold + is a heuristic value inspired by param_max_unrolled_insns. */ + unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size + ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts) + : 1; + uf = MIN ((unsigned int)x86_vect_unroll_limit, uf); + uf = 1 << ceil_log2 (uf); + + /* Early return if don't need to unroll. */ + if (uf == 1) + return 1; + + /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop + if the number of loads and stores exceeds the threshold, unroll + software + schedule will reduce cache miss rate. */ + if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold) + return uf; + + HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop); + unsigned int vf = vect_vf_for_cost (loop_vinfo); + unsigned int unrolled_vf = vf * uf; + if (est_niter == -1 || est_niter < unrolled_vf) + /* When the estimated iteration of this loop is unknown, it's possible + that we are able to vectorize this loop with the original VF but fail + to vectorize it with the unrolled VF any more if the actual iteration + count is in between. */ + return 1; + else + { + unsigned int epil_niter_unr = est_niter % unrolled_vf; + unsigned int epil_niter = est_niter % vf; + /* Even if we have partial vector support, it can be still inefficent + to calculate the length when the iteration count is unknown, so + only expect it's good to unroll when the epilogue iteration count + is not bigger than VF (only one time length calculation). */ + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) + && epil_niter_unr <= vf) + return uf; + /* Without partial vector support, conservatively unroll this when + the epilogue iteration count is less than the original one + (epilogue execution time wouldn't be longer than before). */ + else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) + && epil_niter_unr <= epil_niter) + return uf; + } + + return 1; +} + +void +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) + +{ + if (loop_vec_info loop_vinfo = dyn_cast (m_vinfo)) + { + m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo); + } + vector_costs::finish_cost (scalar_costs); +} /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P. */ diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 53d534f6392..8e49b406aa5 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -1224,3 +1224,18 @@ mavxvnniint8 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and AVXVNNIINT8 built-in functions and code generation. + +-param=x86-vect-unroll-limit= +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param +Used to limit unroll factor which indicates how much the autovectorizer may +unroll a loop. The default value is 4. + +-param=x86-vect-unroll-min-ldst-threshold= +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param +Used to limit the mininum of loads and stores in the main loop. The default +value is 25. + +-param=x86-vect-unroll-max-loop-size= +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param +This threshold is used to limit the maxnum size of loop body after unrolling. +The default value is 200. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 09548c4528c..c86d686f2cd 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets: @item x86-stlf-window-ninsns Instructions number above which STFL stall penalty can be compensated. +@item x86-vect-unroll-limit +The vectorizer will check with target information to determine whether it +would be beneficial to unroll the main vectorized loop and by how much. This +parameter sets the upper bound of how much the vectorizer will unroll the main +loop. The default value is four. + +@item x86-vect-unroll-min-ldst-threshold +The vectorizer will check with target information to determine whether unroll +it. This parameter is used to limit the mininum of loads and stores in the main +loop. + +@item x86-vect-unroll-max-loop-size +The vectorizer will check with target information to determine whether unroll +it. This threshold is used to limit the max size of loop body after unrolling. +The default value is 200. + + @end table @end table diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c index 78c6600f83b..3bf1fb1b12d 100644 --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */ +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */ /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ /* { dg-final { scan-assembler-times "vpmaxsb" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c index 117179f2109..ba41fd64386 100644 --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */ +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */ /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */ /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */ /* { dg-final { scan-assembler-times "vpmaxub" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c index 4aa536ba86c..fd2f054af4a 100644 --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c @@ -2,7 +2,7 @@ /* This is a test exercising peeling for alignment for a negative step vector loop. We're forcing atom tuning here because that has a higher unaligned vs aligned cost unlike most other archs. */ -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */ +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */ float a[1024], b[1024]; diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c index 834bf0f770d..62c0db2bb9a 100644 --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c @@ -2,7 +2,7 @@ /* This is a test exercising peeling for alignment for a positive step vector loop. We're forcing atom tuning here because that has a higher unaligned vs aligned cost unlike most other archs. */ -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */ +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */ float a[1024], b[1024]; diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c index 9ee9ba4e736..1ba4be01bea 100644 --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */ +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */ #define N 32 int foo (int *a, int n)