Message ID | fe87ee3d-c0e0-24bd-8c39-794095b390b5@arm.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 82A753858031 for <patchwork@sourceware.org>; Tue, 7 Dec 2021 11:27:53 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 82A753858031 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1638876473; bh=Fw0h0kXirtZfFBc1F8NWpKzFbsnhwkwlkm2pXt5IGFU=; h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=D8qKY3fboXtxhhFQtbHFmIxDHR0Wc0WZCPg8N2a81m0CJN28SUjVcHvQqwU5cuwSZ kmU4DPmbWxSjyXa+vLAjFuH/NuCIzmxvUCfWjUE3aFW0LYeQ482spxBPgF8vHJoMh1 BlooBzqXBY8HOxxY1OI9cL3AxDLwgsnqyDbCjO1E= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 7D5DA3858401 for <gcc-patches@gcc.gnu.org>; Tue, 7 Dec 2021 11:27:23 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7D5DA3858401 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0A87B11FB; Tue, 7 Dec 2021 03:27:23 -0800 (PST) Received: from [10.57.3.27] (unknown [10.57.3.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6F6F03F73B; Tue, 7 Dec 2021 03:27:22 -0800 (PST) Content-Type: multipart/mixed; boundary="------------dqa7UYM6fnmmHKlsG1CvCH9i" Message-ID: <fe87ee3d-c0e0-24bd-8c39-794095b390b5@arm.com> Date: Tue, 7 Dec 2021 11:27:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.3.2 Subject: [vect] Re-analyze all modes for epilogues Content-Language: en-US To: Richard Biener <rguenther@suse.de> References: <4a2e6dde-cc5c-97fe-7a43-bd59d542c2ce@arm.com> <q2n8qs0-p9s-nq31-6sp-79n7o35730ss@fhfr.qr> <e01f9b48-918b-6379-8f0f-5ab8f5402b5a@arm.com> <p4442748-3rq6-4013-4546-51q6r553s756@fhfr.qr> <fb6e6a2f-646a-f338-6f55-4669e593e9c2@arm.com> <4272814n-8538-p793-157q-5n6q16r48n51@fhfr.qr> <623fbfd9-b97c-8c6e-0348-07d6c4496592@arm.com> <pp44o5pq-rp4o-rn56-1070-15ns3n93n4o2@fhfr.qr> <5c887c48-7f7e-c02b-2998-7a7c41b11af8@arm.com> <r938qonn-53n-68o5-o3o6-7n1s773p92nq@fhfr.qr> <mptmtn1xt0e.fsf@arm.com> <33cb143e-bb2e-e214-cd5f-66fd2d1bd20b@arm.com> <5op15ns-4sq8-2sn3-41qs-49q44417sp6@fhfr.qr> <b73e1c6f-0c8c-ecae-7244-7c62489db306@arm.com> <99qs2o2p-pn87-n164-q8n9-9p814r6n75r1@fhfr.qr> <475fae98-9541-5dca-2e60-eaff172ff787@arm.com> <8p72o15s-5894-4or0-409r-oo4p74o238r1@fhfr.qr> <21e3500d-6cf5-ed46-6f95-1f554c5dbc50@arm.com> <r8r6pnqr-nn92-pqpo-p0s4-n6no777ppn6q@fhfr.qr> <5477e0cb-6dc9-e828-7c20-a99de3c6840c@arm.com> <n3r4625s-p112-rq9p-q546-7q625rn081n@fhfr.qr> In-Reply-To: <n3r4625s-p112-rq9p-q546-7q625rn081n@fhfr.qr> X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, BODY_8BITS, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: "Andre Vieira \(lists\) via Gcc-patches" <gcc-patches@gcc.gnu.org> Reply-To: "Andre Vieira \(lists\)" <andre.simoesdiasvieira@arm.com> Cc: richard.sandiford@arm.com, "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[vect] Re-analyze all modes for epilogues
|
|
Commit Message
Andre Vieira (lists)
Dec. 7, 2021, 11:27 a.m. UTC
Hi, I've split this particular part off, since it's not only relevant to unrolling. The new test shows how this is useful for existing (non-unrolling) cases. I also had to fix the costing function, the main_vf / epilogue_vf calculations for old and new didn't take into consideration that the main_vf could be lower, nor did it take into consideration that they were not necessarily always a multiple of each other. So using CEIL here is the correct approach. Bootstrapped and regression tested on aarch64-none-linux-gnu. OK for trunk? gcc/ChangeLog: * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up for epilogue costing. (vect_analyze_loop): Re-analyze all modes for epilogues. gcc/testsuite/ChangeLog: * gcc.target/aarch64/masked_epilogue.c: New test. On 30/11/2021 13:56, Richard Biener wrote: > On Tue, 30 Nov 2021, Andre Vieira (lists) wrote: > >> On 25/11/2021 12:46, Richard Biener wrote: >>> Oops, my fault, yes, it does. I would suggest to refactor things so >>> that the mode_i = first_loop_i case is there only once. I also wonder >>> if all the argument about starting at 0 doesn't apply to the >>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well? So >>> what's the reason to differ here? So in the end I'd just change >>> the existing >>> >>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) >>> { >>> >>> to >>> >>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo) >>> || first_loop_vinfo->suggested_unroll_factor > 1) >>> { >>> >>> and maybe revisit this when we have an actual testcase showing that >>> doing sth else has a positive effect? >>> >>> Thanks, >>> Richard. >> So I had a quick chat with Richard Sandiford and he is suggesting resetting >> mode_i to 0 for all cases. >> >> He pointed out that for some tunings the SVE mode might come after the NEON >> mode, which means that even for not-unrolled loop_vinfos we could end up with >> a suboptimal choice of mode for the epilogue. I.e. it could be that we pick >> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd not >> try VNx16QI for the epilogue. >> >> This would simplify the mode selecting cases, by just simply restarting at >> mode_i in all epilogue cases. Is that something you'd be OK? > Works for me with an updated comment. Even better with showing a > testcase exercising such tuning. > > Richard.
Comments
Hi, Rebased on top of the epilogue mode patch. OK for trunk? gcc/ChangeLog: * tree-vect-loop.c (vect_estimate_min_profitable_iters): Pass new argument suggested_unroll_factor. (vect_analyze_loop_costing): Likewise. (_loop_vec_info::_loop_vec_info): Initialize new member suggested_unroll_factor. (vect_determine_partial_vectors_and_peeling): Make epilogue of unrolled main loop use partial vectors. (vect_analyze_loop_2): Pass and use new argument suggested_unroll_factor. (vect_analyze_loop_1): Likewise. (vect_analyze_loop): Change to intialize local suggested_unroll_factor and use it. (vectorizable_reduction): Don't use single_defuse_cycle when unrolling. * tree-vectorizer.h (_loop_vec_info::_loop_vec_info): Add new member suggested_unroll_factor. (vector_costs::vector_costs): Add new member m_suggested_unroll_factor. (vector_costs::suggested_unroll_factor): New getter function. (finish_cost): Set return argument suggested_unroll_factor. Regards, Andre On 07/12/2021 11:27, Andre Vieira (lists) via Gcc-patches wrote: > Hi, > > I've split this particular part off, since it's not only relevant to > unrolling. The new test shows how this is useful for existing > (non-unrolling) cases. I also had to fix the costing function, the > main_vf / epilogue_vf calculations for old and new didn't take into > consideration that the main_vf could be lower, nor did it take into > consideration that they were not necessarily always a multiple of each > other. So using CEIL here is the correct approach. > > Bootstrapped and regression tested on aarch64-none-linux-gnu. > > OK for trunk? > > gcc/ChangeLog: > > * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors > up for epilogue costing. > (vect_analyze_loop): Re-analyze all modes for epilogues. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/masked_epilogue.c: New test. > > On 30/11/2021 13:56, Richard Biener wrote: >> On Tue, 30 Nov 2021, Andre Vieira (lists) wrote: >> >>> On 25/11/2021 12:46, Richard Biener wrote: >>>> Oops, my fault, yes, it does. I would suggest to refactor things so >>>> that the mode_i = first_loop_i case is there only once. I also wonder >>>> if all the argument about starting at 0 doesn't apply to the >>>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well? So >>>> what's the reason to differ here? So in the end I'd just change >>>> the existing >>>> >>>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) >>>> { >>>> >>>> to >>>> >>>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo) >>>> || first_loop_vinfo->suggested_unroll_factor > 1) >>>> { >>>> >>>> and maybe revisit this when we have an actual testcase showing that >>>> doing sth else has a positive effect? >>>> >>>> Thanks, >>>> Richard. >>> So I had a quick chat with Richard Sandiford and he is suggesting >>> resetting >>> mode_i to 0 for all cases. >>> >>> He pointed out that for some tunings the SVE mode might come after >>> the NEON >>> mode, which means that even for not-unrolled loop_vinfos we could >>> end up with >>> a suboptimal choice of mode for the epilogue. I.e. it could be that >>> we pick >>> V16QI for main vectorization, but that's VNx16QI + 1 in the array, >>> so we'd not >>> try VNx16QI for the epilogue. >>> >>> This would simplify the mode selecting cases, by just simply >>> restarting at >>> mode_i in all epilogue cases. Is that something you'd be OK? >> Works for me with an updated comment. Even better with showing a >> testcase exercising such tuning. >> >> Richard. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 17b090170d4a5dad22097a727bc25a63e230e278..29cf14c83ac02e2f1372e56f7f96427dafcd4d11 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -153,7 +153,8 @@ along with GCC; see the file COPYING3. If not see http://gcc.gnu.org/projects/tree-ssa/vectorization.html */ -static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *); +static void vect_estimate_min_profitable_iters (loop_vec_info, int *, int *, + unsigned *); static stmt_vec_info vect_is_simple_reduction (loop_vec_info, stmt_vec_info, bool *, bool *); @@ -828,6 +829,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) skip_main_loop_edge (nullptr), skip_this_loop_edge (nullptr), reusable_accumulators (), + suggested_unroll_factor (1), max_vectorization_factor (0), mask_skip_niters (NULL_TREE), rgroup_compare_type (NULL_TREE), @@ -1811,7 +1813,8 @@ vect_known_niters_smaller_than_vf (loop_vec_info loop_vinfo) definitely no, or -1 if it's worth retrying. */ static int -vect_analyze_loop_costing (loop_vec_info loop_vinfo) +vect_analyze_loop_costing (loop_vec_info loop_vinfo, + unsigned *suggested_unroll_factor) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo); @@ -1845,7 +1848,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) int min_profitable_iters, min_profitable_estimate; vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters, - &min_profitable_estimate); + &min_profitable_estimate, + suggested_unroll_factor); if (min_profitable_iters < 0) { @@ -2129,10 +2133,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo, vectors to the epilogue, with the main loop continuing to operate on full vectors. + If we are unrolling we also do not want to use partial vectors. This + is to avoid the overhead of generating multiple masks and also to + avoid having to execute entire iterations of FALSE masked instructions + when dealing with one or less full iterations. + ??? We could then end up failing to use partial vectors if we decide to peel iterations into a prologue, and if the main loop then ends up processing fewer than VF iterations. */ - if (param_vect_partial_vector_usage == 1 + if ((param_vect_partial_vector_usage == 1 + || loop_vinfo->suggested_unroll_factor > 1) && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !vect_known_niters_smaller_than_vf (loop_vinfo)) LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true; @@ -2199,7 +2209,8 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo, for it. The different analyses will record information in the loop_vec_info struct. */ static opt_result -vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal) +vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, + unsigned *suggested_unroll_factor) { opt_result ok = opt_result::success (); int res; @@ -2359,6 +2370,12 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal) set of rgroups. */ gcc_assert (LOOP_VINFO_MASKS (loop_vinfo).is_empty ()); + /* Apply the suggested unrolling factor, this was determined by the backend + during finish_cost the first time we ran the analyzis for this + vector mode. */ + if (loop_vinfo->suggested_unroll_factor > 1) + LOOP_VINFO_VECT_FACTOR (loop_vinfo) *= loop_vinfo->suggested_unroll_factor; + /* This is the point where we can re-start analysis with SLP forced off. */ start_over: @@ -2550,7 +2567,7 @@ start_over: return ok; /* Check the costings of the loop make vectorizing worthwhile. */ - res = vect_analyze_loop_costing (loop_vinfo); + res = vect_analyze_loop_costing (loop_vinfo, suggested_unroll_factor); if (res < 0) { ok = opt_result::failure_at (vect_location, @@ -2953,7 +2970,7 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared, loop_vec_info main_loop_vinfo, const vector_modes &vector_modes, unsigned &mode_i, machine_mode &autodetected_vector_mode, - bool &fatal) + bool &fatal, unsigned int *suggested_unroll_factor = NULL) { loop_vec_info loop_vinfo = vect_create_loop_vinfo (loop, shared, loop_form_info, main_loop_vinfo); @@ -2961,8 +2978,18 @@ vect_analyze_loop_1 (class loop *loop, vec_info_shared *shared, machine_mode vector_mode = vector_modes[mode_i]; loop_vinfo->vector_mode = vector_mode; + /* Don't ask for a suggested unroll factor for an already unrolled loop + vinfo and reset the value for the next analysis. */ + if (suggested_unroll_factor && *suggested_unroll_factor > 1) + { + loop_vinfo->suggested_unroll_factor = *suggested_unroll_factor; + *suggested_unroll_factor = 1; + suggested_unroll_factor = NULL; + } + /* Run the main analysis. */ - opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal); + opt_result res = vect_analyze_loop_2 (loop_vinfo, fatal, + suggested_unroll_factor); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "***** Analysis %s with vector mode %s\n", @@ -3072,6 +3099,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); unsigned int mode_i = 0; unsigned HOST_WIDE_INT simdlen = loop->simdlen; + unsigned int suggested_unroll_factor = 1; /* First determine the main loop vectorization mode, either the first one that works, starting with auto-detecting the vector mode and then @@ -3083,7 +3111,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) opt_loop_vec_info loop_vinfo = vect_analyze_loop_1 (loop, shared, &loop_form_info, NULL, vector_modes, mode_i, - autodetected_vector_mode, fatal); + autodetected_vector_mode, fatal, + &suggested_unroll_factor); if (fatal) break; @@ -3107,7 +3136,18 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) first_loop_vinfo = opt_loop_vec_info::success (NULL); } if (first_loop_vinfo == NULL) - first_loop_vinfo = loop_vinfo; + { + first_loop_vinfo = loop_vinfo; + if (suggested_unroll_factor > 1) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis for unrolling" + " with unroll factor %d.\n", + suggested_unroll_factor); + continue; + } + } else { delete loop_vinfo; @@ -3169,6 +3209,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* ??? If first_loop_vinfo was using VOIDmode then we probably want to instead search for the corresponding mode in vector_modes[]. */ + poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); while (1) { bool fatal; @@ -3180,6 +3221,22 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) if (fatal) break; + if (loop_vinfo + && known_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), first_vinfo_vf) + && !(LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), + first_vinfo_vf))) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Will not use %s mode as an" + " epilogue, since it leads to an higher" + " vectorization factor than main loop\n", + GET_MODE_NAME (loop_vinfo->vector_mode)); + delete loop_vinfo; + loop_vinfo = opt_loop_vec_info::success (NULL); + } + if (loop_vinfo) { if (pick_lowest_cost_p) @@ -3892,7 +3949,8 @@ vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue, static void vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, int *ret_min_profitable_niters, - int *ret_min_profitable_estimate) + int *ret_min_profitable_estimate, + unsigned *suggested_unroll_factor) { int min_profitable_iters; int min_profitable_estimate; @@ -4252,8 +4310,23 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* Complete the target-specific cost calculations. */ - finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost, - &vec_inside_cost, &vec_epilogue_cost); + finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), + &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost, + suggested_unroll_factor); + + if (suggested_unroll_factor && *suggested_unroll_factor > 1 + && LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo) != MAX_VECTORIZATION_FACTOR + && !known_le (LOOP_VINFO_VECT_FACTOR (loop_vinfo) * + *suggested_unroll_factor, + LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't unroll as unrolled vectorization factor larger" + " than maximum vectorization factor: %d\n", + LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo)); + *suggested_unroll_factor = 1; + } vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost); @@ -7242,7 +7315,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, participating. */ if (ncopies > 1 && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) - && reduc_chain_length == 1) + && reduc_chain_length == 1 + && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; if (single_defuse_cycle || lane_reduc_code_p) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index b552e9dccce5bce6a3bbcf5d531e7ccefa719b9a..238e0b4bf21871c518da73a79320f34e55c9201c 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -624,6 +624,13 @@ public: about the reductions that generated them. */ hash_map<tree, vect_reusable_accumulator> reusable_accumulators; + /* The number of times that the target suggested we unroll the vector loop + in order to promote more ILP. This value will be used to re-analyze the + loop for vectorization and if successful the value will be folded into + vectorization_factor (and therefore exactly divides + vectorization_factor). */ + unsigned int suggested_unroll_factor; + /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR if there is no particular limit. */ unsigned HOST_WIDE_INT max_vectorization_factor; @@ -1430,6 +1437,7 @@ public: unsigned int prologue_cost () const; unsigned int body_cost () const; unsigned int epilogue_cost () const; + unsigned int suggested_unroll_factor () const; protected: unsigned int record_stmt_cost (stmt_vec_info, vect_cost_model_location, @@ -1447,6 +1455,9 @@ protected: /* The costs of the three regions, indexed by vect_cost_model_location. */ unsigned int m_costs[3]; + /* The suggested unrolling factor determined at finish_cost. */ + unsigned int m_suggested_unroll_factor; + /* True if finish_cost has been called. */ bool m_finished; }; @@ -1459,6 +1470,7 @@ vector_costs::vector_costs (vec_info *vinfo, bool costing_for_scalar) : m_vinfo (vinfo), m_costing_for_scalar (costing_for_scalar), m_costs (), + m_suggested_unroll_factor(1), m_finished (false) { } @@ -1490,6 +1502,15 @@ vector_costs::epilogue_cost () const return m_costs[vect_epilogue]; } +/* Return the suggested unroll factor. */ + +inline unsigned int +vector_costs::suggested_unroll_factor () const +{ + gcc_checking_assert (m_finished); + return m_suggested_unroll_factor; +} + #define VECT_MAX_COST 1000 /* The maximum number of intermediate steps required in multi-step type @@ -1665,12 +1686,15 @@ add_stmt_cost (vector_costs *costs, stmt_info_for_cost *i) static inline void finish_cost (vector_costs *costs, unsigned *prologue_cost, - unsigned *body_cost, unsigned *epilogue_cost) + unsigned *body_cost, unsigned *epilogue_cost, + unsigned *suggested_unroll_factor = NULL) { costs->finish_cost (); *prologue_cost = costs->prologue_cost (); *body_cost = costs->body_cost (); *epilogue_cost = costs->epilogue_cost (); + if (suggested_unroll_factor) + *suggested_unroll_factor = costs->suggested_unroll_factor (); } inline void
On Tue, 7 Dec 2021, Andre Vieira (lists) wrote: > Hi, > > I've split this particular part off, since it's not only relevant to > unrolling. The new test shows how this is useful for existing (non-unrolling) > cases. I also had to fix the costing function, the main_vf / epilogue_vf > calculations for old and new didn't take into consideration that the main_vf > could be lower, nor did it take into consideration that they were not > necessarily always a multiple of each other. So using CEIL here is the > correct approach. > > Bootstrapped and regression tested on aarch64-none-linux-gnu. > > OK for trunk? + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis with first vector mode" + " %s for epilogue.\n", + GET_MODE_NAME (vector_modes[mode_i])); that will now unconditionally dump VOIDmode which isn't really useful information. I suggest to dump "Re-trying analysis using mode autodetection for epilouge.\n". OK with that change. Can you check whether, give we know the main VF, the epilogue analysis does not start with am autodetected vector mode that needs a too large VF? Thanks, Richard. > gcc/ChangeLog: > > * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up for > epilogue costing. > (vect_analyze_loop): Re-analyze all modes for epilogues. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/masked_epilogue.c: New test. > > On 30/11/2021 13:56, Richard Biener wrote: > > On Tue, 30 Nov 2021, Andre Vieira (lists) wrote: > > > >> On 25/11/2021 12:46, Richard Biener wrote: > >>> Oops, my fault, yes, it does. I would suggest to refactor things so > >>> that the mode_i = first_loop_i case is there only once. I also wonder > >>> if all the argument about starting at 0 doesn't apply to the > >>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well? So > >>> what's the reason to differ here? So in the end I'd just change > >>> the existing > >>> > >>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) > >>> { > >>> > >>> to > >>> > >>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo) > >>> || first_loop_vinfo->suggested_unroll_factor > 1) > >>> { > >>> > >>> and maybe revisit this when we have an actual testcase showing that > >>> doing sth else has a positive effect? > >>> > >>> Thanks, > >>> Richard. > >> So I had a quick chat with Richard Sandiford and he is suggesting resetting > >> mode_i to 0 for all cases. > >> > >> He pointed out that for some tunings the SVE mode might come after the NEON > >> mode, which means that even for not-unrolled loop_vinfos we could end up > >> with > >> a suboptimal choice of mode for the epilogue. I.e. it could be that we pick > >> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd > >> not > >> try VNx16QI for the epilogue. > >> > >> This would simplify the mode selecting cases, by just simply restarting at > >> mode_i in all epilogue cases. Is that something you'd be OK? > > Works for me with an updated comment. Even better with showing a > > testcase exercising such tuning. > > > > Richard. >
On Tue, 7 Dec 2021, Andre Vieira (lists) wrote: > Hi, > > Rebased on top of the epilogue mode patch. > > OK for trunk? @@ -7242,7 +7315,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, participating. */ if (ncopies > 1 && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) - && reduc_chain_length == 1) + && reduc_chain_length == 1 + && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; if (single_defuse_cycle || lane_reduc_code_p) It seems to me that 'ncopies' should include loop_vinfo->suggested_unroll_factor already so the check shouldn't be necessary. Otherwise looks OK to me. Thanks, Richard. > > gcc/ChangeLog: > > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Pass new > argument > suggested_unroll_factor. > (vect_analyze_loop_costing): Likewise. > (_loop_vec_info::_loop_vec_info): Initialize new member > suggested_unroll_factor. > (vect_determine_partial_vectors_and_peeling): Make epilogue of > unrolled > main loop use partial vectors. > (vect_analyze_loop_2): Pass and use new argument > suggested_unroll_factor. > (vect_analyze_loop_1): Likewise. > (vect_analyze_loop): Change to intialize local > suggested_unroll_factor and use it. > (vectorizable_reduction): Don't use single_defuse_cycle when > unrolling. > * tree-vectorizer.h (_loop_vec_info::_loop_vec_info): Add new member > suggested_unroll_factor. > (vector_costs::vector_costs): Add new member > m_suggested_unroll_factor. > (vector_costs::suggested_unroll_factor): New getter function. > (finish_cost): Set return argument suggested_unroll_factor. > > > > Regards, > Andre > > On 07/12/2021 11:27, Andre Vieira (lists) via Gcc-patches wrote: > > Hi, > > > > I've split this particular part off, since it's not only relevant to > > unrolling. The new test shows how this is useful for existing > > (non-unrolling) cases. I also had to fix the costing function, the main_vf / > > epilogue_vf calculations for old and new didn't take into consideration that > > the main_vf could be lower, nor did it take into consideration that they > > were not necessarily always a multiple of each other. So using CEIL here is > > the correct approach. > > > > Bootstrapped and regression tested on aarch64-none-linux-gnu. > > > > OK for trunk? > > > > gcc/ChangeLog: > > > > * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up for > > epilogue costing. > > (vect_analyze_loop): Re-analyze all modes for epilogues. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/masked_epilogue.c: New test. > > > > On 30/11/2021 13:56, Richard Biener wrote: > >> On Tue, 30 Nov 2021, Andre Vieira (lists) wrote: > >> > >>> On 25/11/2021 12:46, Richard Biener wrote: > >>>> Oops, my fault, yes, it does. I would suggest to refactor things so > >>>> that the mode_i = first_loop_i case is there only once. I also wonder > >>>> if all the argument about starting at 0 doesn't apply to the > >>>> not unrolled LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P as well? So > >>>> what's the reason to differ here? So in the end I'd just change > >>>> the existing > >>>> > >>>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) > >>>> { > >>>> > >>>> to > >>>> > >>>> if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo) > >>>> || first_loop_vinfo->suggested_unroll_factor > 1) > >>>> { > >>>> > >>>> and maybe revisit this when we have an actual testcase showing that > >>>> doing sth else has a positive effect? > >>>> > >>>> Thanks, > >>>> Richard. > >>> So I had a quick chat with Richard Sandiford and he is suggesting > >>> resetting > >>> mode_i to 0 for all cases. > >>> > >>> He pointed out that for some tunings the SVE mode might come after the > >>> NEON > >>> mode, which means that even for not-unrolled loop_vinfos we could end up > >>> with > >>> a suboptimal choice of mode for the epilogue. I.e. it could be that we > >>> pick > >>> V16QI for main vectorization, but that's VNx16QI + 1 in the array, so we'd > >>> not > >>> try VNx16QI for the epilogue. > >>> > >>> This would simplify the mode selecting cases, by just simply restarting at > >>> mode_i in all epilogue cases. Is that something you'd be OK? > >> Works for me with an updated comment. Even better with showing a > >> testcase exercising such tuning. > >> > >> Richard. >
Richard Biener <rguenther@suse.de> writes: > On Tue, 7 Dec 2021, Andre Vieira (lists) wrote: > >> Hi, >> >> Rebased on top of the epilogue mode patch. >> >> OK for trunk? > > @@ -7242,7 +7315,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > participating. */ > if (ncopies > 1 > && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) > - && reduc_chain_length == 1) > + && reduc_chain_length == 1 > + && loop_vinfo->suggested_unroll_factor == 1) > single_defuse_cycle = true; > > if (single_defuse_cycle || lane_reduc_code_p) > > It seems to me that 'ncopies' should include > loop_vinfo->suggested_unroll_factor already so the check shouldn't be > necessary. Yeah, ncopies will be >1 for the unroll case. But the point is that for unrolling, we want each unrolled loop to have its own reduction accumulator, since reducing the aggregate loop-carried latency is one of the main benefits of unrolling a reduction. Thanks, Richard
On Tue, 7 Dec 2021, Richard Sandiford wrote: > Richard Biener <rguenther@suse.de> writes: > > On Tue, 7 Dec 2021, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> Rebased on top of the epilogue mode patch. > >> > >> OK for trunk? > > > > @@ -7242,7 +7315,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > > participating. */ > > if (ncopies > 1 > > && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live) > > - && reduc_chain_length == 1) > > + && reduc_chain_length == 1 > > + && loop_vinfo->suggested_unroll_factor == 1) > > single_defuse_cycle = true; > > > > if (single_defuse_cycle || lane_reduc_code_p) > > > > It seems to me that 'ncopies' should include > > loop_vinfo->suggested_unroll_factor already so the check shouldn't be > > necessary. > > Yeah, ncopies will be >1 for the unroll case. But the point is that > for unrolling, we want each unrolled loop to have its own reduction > accumulator, since reducing the aggregate loop-carried latency is one > of the main benefits of unrolling a reduction. Ah, I see. Please add a comment above this check then. Richard.
On 07/12/2021 11:45, Richard Biener wrote: > Can you check whether, give we know the main VF, the epilogue analysis > does not start with am autodetected vector mode that needs a too large VF? Hmm struggling to see how we could check this here. AFAIU before we analyze the loop for a given vector mode we won't know the VF? Are you saying that we could reject an autodetected mode which NUNITS > main VF for !LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P? An by reject I mean we'd start with mode_i = 1. Or did I misunderstand something here. FWIW this is just to prevent extra analysis right? If the epilogue's VF isn't appropriate it will be rejected later.
Hi, Added an extra step to skip unusable epilogue modes when we know the target does not support predication. This uses a new function 'support_predication_p' that is generated at build time and checks whether the target supports at least one optab that can be used for predicated code-generation. Bootstrapped and regression tested on aarch64-none-linux-gnu. OK for trunk? gcc/ChangeLog: * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up for epilogue costing. (vect_analyze_loop): Re-analyze all modes for epilogues, unless we are guaranteed that no predication is possible. (genopinit.c) (support_predication_p): Generate new function. gcc/testsuite/ChangeLog: * gcc.target/aarch64/masked_epilogue.c: New test. diff --git a/gcc/genopinit.c b/gcc/genopinit.c index 195ddf74fa2b7d89760622073dcec9d5d339a097..e0958bc6c849911395341611a53b0fcb69565827 100644 --- a/gcc/genopinit.c +++ b/gcc/genopinit.c @@ -321,6 +321,7 @@ main (int argc, const char **argv) " bool supports_vec_scatter_store_cached;\n" "};\n" "extern void init_all_optabs (struct target_optabs *);\n" + "extern bool support_predication_p (void);\n" "\n" "extern struct target_optabs default_target_optabs;\n" "extern struct target_optabs *this_fn_optabs;\n" @@ -373,6 +374,33 @@ main (int argc, const char **argv) fprintf (s_file, " ena[%u] = HAVE_%s;\n", i, p->name); fprintf (s_file, "}\n\n"); + fprintf (s_file, + "/* Returns TRUE if the target supports any of the predication\n" + " specific optabs: while_ult_optab, len_load_optab or len_store_optab,\n" + " for any mode. */\n" + "bool\nsupport_predication_p (void)\n{\n"); + bool any_match = false; + fprintf (s_file, "\treturn"); + bool first = true; + for (i = 0; patterns.iterate (i, &p); ++i) + { +#define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N))) + if (CMP_NAME("while_ult") || CMP_NAME ("len_load") + || CMP_NAME ("len_store")) + { + if (first) + fprintf (s_file, " HAVE_%s", p->name); + else + fprintf (s_file, " || HAVE_%s", p->name); + first = false; + any_match = true; + } + } + if (!any_match) + fprintf (s_file, " false"); + fprintf (s_file, ";\n}\n"); + + /* Perform a binary search on a pre-encoded optab+mode*2. */ /* ??? Perhaps even better to generate a minimal perfect hash. Using gperf directly is awkward since it's so geared to working diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c new file mode 100644 index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */ + +void f(unsigned char y[restrict], + unsigned char x[restrict], int n) { + for (int i = 0; i < n; ++i) + y[i] = (y[i] + x[i] + 1) >> 1; +} + +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index a28bb6321d76b8222bc8cfdade151ca9b4dca406..86e0cb47aef2919fdf7d87228f7f6a8378893e68 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, { unsigned HOST_WIDE_INT main_vf_max = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT old_vf_max + = estimated_poly_value (old_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT new_vf_max + = estimated_poly_value (new_vf, POLY_VALUE_MAX); - old_factor = main_vf_max / estimated_poly_value (old_vf, - POLY_VALUE_MAX); - new_factor = main_vf_max / estimated_poly_value (new_vf, - POLY_VALUE_MAX); + old_factor = CEIL (main_vf_max, old_vf_max); + new_factor = CEIL (main_vf_max, new_vf_max); /* If the loop is not using partial vectors then it will iterate one time less than one that does. It is safe to subtract one here, @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) machine_mode autodetected_vector_mode = VOIDmode; opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); unsigned int mode_i = 0; - unsigned int first_loop_i = 0; - unsigned int first_loop_next_i = 0; unsigned HOST_WIDE_INT simdlen = loop->simdlen; /* First determine the main loop vectorization mode, either the first @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) lowest cost if pick_lowest_cost_p. */ while (1) { - unsigned int loop_vinfo_i = mode_i; bool fatal; opt_loop_vec_info loop_vinfo = vect_analyze_loop_1 (loop, shared, &loop_form_info, @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) first_loop_vinfo = opt_loop_vec_info::success (NULL); } if (first_loop_vinfo == NULL) - { - first_loop_vinfo = loop_vinfo; - first_loop_i = loop_vinfo_i; - first_loop_next_i = mode_i; - } + first_loop_vinfo = loop_vinfo; else { delete loop_vinfo; @@ -3158,30 +3153,40 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* Now analyze first_loop_vinfo for epilogue vectorization. */ poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); - /* Handle the case that the original loop can use partial - vectorization, but want to only adopt it for the epilogue. - The retry should be in the same mode as original. */ - if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) - { - gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo) - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo)); - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "***** Re-trying analysis with same vector mode" - " %s for epilogue with partial vectors.\n", - GET_MODE_NAME (first_loop_vinfo->vector_mode)); - mode_i = first_loop_i; - } - else - { - mode_i = first_loop_next_i; - if (mode_i == vector_modes.length ()) - return first_loop_vinfo; - } + /* For epilogues start the analysis from the first mode. The motivation + behind starting from the beginning comes from cases where the VECTOR_MODES + array may contain length agnostic and length fixed modes. Their ordering + is not guaranteed, so we could end up picking a mode for the main loop + that is after the epilogue's optimal mode. */ + mode_i = 0; + + /* If the target does not support predication we can shorten the number of + modes to analyze for the epilogue as we know we can't pick a mode that has + at least as many NUNITS as the main loop's vectorization factor, since + that would imply the epilogue's vectorization factor would be at least as + high as the main loop's. */ + poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); + if (!support_predication_p ()) + { + machine_mode epil_mode = mode_i == 0 + ? autodetected_vector_mode + : vector_modes[mode_i]; + while (maybe_ge (GET_MODE_NUNITS (epil_mode), first_vinfo_vf) + && mode_i < (vector_modes.length() - 1)) + epil_mode = vector_modes[++mode_i]; + } + + if (dump_enabled_p () && mode_i != 0) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis for epilogue using mode %s\n.", + GET_MODE_NAME (vector_modes[mode_i])); + else if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis for epilogue using mode" + " autodetection vector mode.\n"); /* ??? If first_loop_vinfo was using VOIDmode then we probably want to instead search for the corresponding mode in vector_modes[]. */ - while (1) { bool fatal;
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > Hi, > > Added an extra step to skip unusable epilogue modes when we know the > target does not support predication. This uses a new function > 'support_predication_p' that is generated at build time and checks > whether the target supports at least one optab that can be used for > predicated code-generation. > > Bootstrapped and regression tested on aarch64-none-linux-gnu. > > OK for trunk? Looks good, but see the final comment below about whether we could simplify this a bit. > gcc/ChangeLog: > > * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up > for epilogue costing. > (vect_analyze_loop): Re-analyze all modes for epilogues, unless > we are guaranteed that no > predication is possible. > (genopinit.c) (support_predication_p): Generate new function. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/masked_epilogue.c: New test. > > diff --git a/gcc/genopinit.c b/gcc/genopinit.c > index 195ddf74fa2b7d89760622073dcec9d5d339a097..e0958bc6c849911395341611a53b0fcb69565827 100644 > --- a/gcc/genopinit.c > +++ b/gcc/genopinit.c > @@ -321,6 +321,7 @@ main (int argc, const char **argv) > " bool supports_vec_scatter_store_cached;\n" > "};\n" > "extern void init_all_optabs (struct target_optabs *);\n" > + "extern bool support_predication_p (void);\n" len_load and len_store aren't really predication (or masking). So maybe: partial_vectors_supported_p ? > "\n" > "extern struct target_optabs default_target_optabs;\n" > "extern struct target_optabs *this_fn_optabs;\n" > @@ -373,6 +374,33 @@ main (int argc, const char **argv) > fprintf (s_file, " ena[%u] = HAVE_%s;\n", i, p->name); > fprintf (s_file, "}\n\n"); > > + fprintf (s_file, > + "/* Returns TRUE if the target supports any of the predication\n" > + " specific optabs: while_ult_optab, len_load_optab or len_store_optab,\n" > + " for any mode. */\n" Similarly here, s/predication specific optabs/partial vector optabs/. > + "bool\nsupport_predication_p (void)\n{\n"); > + bool any_match = false; > + fprintf (s_file, "\treturn"); > + bool first = true; > + for (i = 0; patterns.iterate (i, &p); ++i) > + { > +#define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N))) > + if (CMP_NAME("while_ult") || CMP_NAME ("len_load") > + || CMP_NAME ("len_store")) > + { > + if (first) > + fprintf (s_file, " HAVE_%s", p->name); > + else > + fprintf (s_file, " || HAVE_%s", p->name); > + first = false; > + any_match = true; > + } > + } > + if (!any_match) > + fprintf (s_file, " false"); > + fprintf (s_file, ";\n}\n"); > + > + > /* Perform a binary search on a pre-encoded optab+mode*2. */ > /* ??? Perhaps even better to generate a minimal perfect hash. > Using gperf directly is awkward since it's so geared to working > diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c > new file mode 100644 > index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c > @@ -0,0 +1,10 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */ > + > +void f(unsigned char y[restrict], > + unsigned char x[restrict], int n) { > + for (int i = 0; i < n; ++i) > + y[i] = (y[i] + x[i] + 1) >> 1; > +} > + > +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */ > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index a28bb6321d76b8222bc8cfdade151ca9b4dca406..86e0cb47aef2919fdf7d87228f7f6a8378893e68 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, > { > unsigned HOST_WIDE_INT main_vf_max > = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); > + unsigned HOST_WIDE_INT old_vf_max > + = estimated_poly_value (old_vf, POLY_VALUE_MAX); > + unsigned HOST_WIDE_INT new_vf_max > + = estimated_poly_value (new_vf, POLY_VALUE_MAX); > > - old_factor = main_vf_max / estimated_poly_value (old_vf, > - POLY_VALUE_MAX); > - new_factor = main_vf_max / estimated_poly_value (new_vf, > - POLY_VALUE_MAX); > + old_factor = CEIL (main_vf_max, old_vf_max); > + new_factor = CEIL (main_vf_max, new_vf_max); > > /* If the loop is not using partial vectors then it will iterate one > time less than one that does. It is safe to subtract one here, > @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > machine_mode autodetected_vector_mode = VOIDmode; > opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); > unsigned int mode_i = 0; > - unsigned int first_loop_i = 0; > - unsigned int first_loop_next_i = 0; > unsigned HOST_WIDE_INT simdlen = loop->simdlen; > > /* First determine the main loop vectorization mode, either the first > @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > lowest cost if pick_lowest_cost_p. */ > while (1) > { > - unsigned int loop_vinfo_i = mode_i; > bool fatal; > opt_loop_vec_info loop_vinfo > = vect_analyze_loop_1 (loop, shared, &loop_form_info, > @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > first_loop_vinfo = opt_loop_vec_info::success (NULL); > } > if (first_loop_vinfo == NULL) > - { > - first_loop_vinfo = loop_vinfo; > - first_loop_i = loop_vinfo_i; > - first_loop_next_i = mode_i; > - } > + first_loop_vinfo = loop_vinfo; > else > { > delete loop_vinfo; > @@ -3158,30 +3153,40 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > /* Now analyze first_loop_vinfo for epilogue vectorization. */ > poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); > > - /* Handle the case that the original loop can use partial > - vectorization, but want to only adopt it for the epilogue. > - The retry should be in the same mode as original. */ > - if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) > - { > - gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo) > - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo)); > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_NOTE, vect_location, > - "***** Re-trying analysis with same vector mode" > - " %s for epilogue with partial vectors.\n", > - GET_MODE_NAME (first_loop_vinfo->vector_mode)); > - mode_i = first_loop_i; > - } > - else > - { > - mode_i = first_loop_next_i; > - if (mode_i == vector_modes.length ()) > - return first_loop_vinfo; > - } > + /* For epilogues start the analysis from the first mode. The motivation > + behind starting from the beginning comes from cases where the VECTOR_MODES > + array may contain length agnostic and length fixed modes. Their ordering length-agnostic and length-specific > + is not guaranteed, so we could end up picking a mode for the main loop > + that is after the epilogue's optimal mode. */ That's true, and good enough on its own, but FTR: this could also happen for non-SVE cases. E.g. the user could use simdlen to force a particular loop to use long vectors even when the current target prefers shorter vectors over longer vectors. The loop could still benefit from using the higher-priority shorter vectors for the epilogue. No need to change the comment to say that. Just thought it was worth saying here for the record. > + mode_i = 0; I think this should be 1 if vector_modes.length () > 1. The documentation for autovectorize_vector_modes says: The first mode should be the @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} for its element mode. (although that probably amounts to a self-quote :-/). That should help with the change below. > + > + /* If the target does not support predication we can shorten the number of > + modes to analyze for the epilogue as we know we can't pick a mode that has > + at least as many NUNITS as the main loop's vectorization factor, since > + that would imply the epilogue's vectorization factor would be at least as > + high as the main loop's. */ > + poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); > + if (!support_predication_p ()) > + { > + machine_mode epil_mode = mode_i == 0 > + ? autodetected_vector_mode > + : vector_modes[mode_i]; > + while (maybe_ge (GET_MODE_NUNITS (epil_mode), first_vinfo_vf) > + && mode_i < (vector_modes.length() - 1)) > + epil_mode = vector_modes[++mode_i]; > + } > + > + if (dump_enabled_p () && mode_i != 0) > + dump_printf_loc (MSG_NOTE, vect_location, > + "***** Re-trying analysis for epilogue using mode %s\n.", > + GET_MODE_NAME (vector_modes[mode_i])); > + else if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "***** Re-trying analysis for epilogue using mode" > + " autodetection vector mode.\n"); > /* ??? If first_loop_vinfo was using VOIDmode then we probably > want to instead search for the corresponding mode in vector_modes[]. */ > - Could we just stick a continue in the loop below instead? That seems like it could be simpler and more general, in that it would skip any mode that is too big, not just leading modes. It's probably worth computing support_predication_p () outside the loop though. I think this patch removes the need for the ???. We're doing the kind of skip that the ??? anticipated. Thanks, Richard > while (1) > { > bool fatal;
Made the suggested changes. Regarding the name change to partial vectors, I agree in the name change since that is the terminology we are using in the loop_vinfo members too, but is there an actual difference between predication/masking and partial vectors that I am missing? OK for trunk? gcc/ChangeLog: * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up for epilogue costing. (vect_analyze_loop): Re-analyze all modes for epilogues, unless we are guaranteed that we can't have partial vectors. (genopinit.c) (partial_vectors_supported): Generate new function. gcc/testsuite/ChangeLog: * gcc.target/aarch64/masked_epilogue.c: New test. diff --git a/gcc/genopinit.c b/gcc/genopinit.c index 195ddf74fa2b7d89760622073dcec9d5d339a097..2bc7cdbf53337beae181afd7bb05b366ab068c6a 100644 --- a/gcc/genopinit.c +++ b/gcc/genopinit.c @@ -321,6 +321,7 @@ main (int argc, const char **argv) " bool supports_vec_scatter_store_cached;\n" "};\n" "extern void init_all_optabs (struct target_optabs *);\n" + "extern bool partial_vectors_supported_p (void);\n" "\n" "extern struct target_optabs default_target_optabs;\n" "extern struct target_optabs *this_fn_optabs;\n" @@ -373,6 +374,33 @@ main (int argc, const char **argv) fprintf (s_file, " ena[%u] = HAVE_%s;\n", i, p->name); fprintf (s_file, "}\n\n"); + fprintf (s_file, + "/* Returns TRUE if the target supports any of the partial vector\n" + " optabs: while_ult_optab, len_load_optab or len_store_optab,\n" + " for any mode. */\n" + "bool\npartial_vectors_supported_p (void)\n{\n"); + bool any_match = false; + fprintf (s_file, "\treturn"); + bool first = true; + for (i = 0; patterns.iterate (i, &p); ++i) + { +#define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N))) + if (CMP_NAME("while_ult") || CMP_NAME ("len_load") + || CMP_NAME ("len_store")) + { + if (first) + fprintf (s_file, " HAVE_%s", p->name); + else + fprintf (s_file, " || HAVE_%s", p->name); + first = false; + any_match = true; + } + } + if (!any_match) + fprintf (s_file, " false"); + fprintf (s_file, ";\n}\n"); + + /* Perform a binary search on a pre-encoded optab+mode*2. */ /* ??? Perhaps even better to generate a minimal perfect hash. Using gperf directly is awkward since it's so geared to working diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c new file mode 100644 index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */ + +void f(unsigned char y[restrict], + unsigned char x[restrict], int n) { + for (int i = 0; i < n; ++i) + y[i] = (y[i] + x[i] + 1) >> 1; +} + +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index a28bb6321d76b8222bc8cfdade151ca9b4dca406..5af98a36678ae61e99f93beb90920e2d0940c53a 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, { unsigned HOST_WIDE_INT main_vf_max = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT old_vf_max + = estimated_poly_value (old_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT new_vf_max + = estimated_poly_value (new_vf, POLY_VALUE_MAX); - old_factor = main_vf_max / estimated_poly_value (old_vf, - POLY_VALUE_MAX); - new_factor = main_vf_max / estimated_poly_value (new_vf, - POLY_VALUE_MAX); + old_factor = CEIL (main_vf_max, old_vf_max); + new_factor = CEIL (main_vf_max, new_vf_max); /* If the loop is not using partial vectors then it will iterate one time less than one that does. It is safe to subtract one here, @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) machine_mode autodetected_vector_mode = VOIDmode; opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); unsigned int mode_i = 0; - unsigned int first_loop_i = 0; - unsigned int first_loop_next_i = 0; unsigned HOST_WIDE_INT simdlen = loop->simdlen; /* First determine the main loop vectorization mode, either the first @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) lowest cost if pick_lowest_cost_p. */ while (1) { - unsigned int loop_vinfo_i = mode_i; bool fatal; opt_loop_vec_info loop_vinfo = vect_analyze_loop_1 (loop, shared, &loop_form_info, @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) first_loop_vinfo = opt_loop_vec_info::success (NULL); } if (first_loop_vinfo == NULL) - { - first_loop_vinfo = loop_vinfo; - first_loop_i = loop_vinfo_i; - first_loop_next_i = mode_i; - } + first_loop_vinfo = loop_vinfo; else { delete loop_vinfo; @@ -3158,32 +3153,37 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* Now analyze first_loop_vinfo for epilogue vectorization. */ poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); - /* Handle the case that the original loop can use partial - vectorization, but want to only adopt it for the epilogue. - The retry should be in the same mode as original. */ - if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) + /* For epilogues start the analysis from the first mode. The motivation + behind starting from the beginning comes from cases where the VECTOR_MODES + array may contain length-agnostic and length-specific modes. Their + ordering is not guaranteed, so we could end up picking a mode for the main + loop that is after the epilogue's optimal mode. */ + mode_i = 1; + bool supports_partial_vectors = partial_vectors_supported_p (); + poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); + + while (1) { - gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo) - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo)); + /* If the target does not support partial vectors we can shorten the + number of modes to analyze for the epilogue as we know we can't pick a + mode that has at least as many NUNITS as the main loop's vectorization + factor, since that would imply the epilogue's vectorization factor + would be at least as high as the main loop's and we would be + vectorizing for more scalar iterations than there would be left. */ + if (!supports_partial_vectors + && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf)) + { + mode_i++; + if (mode_i == vector_modes.length ()) + break; + continue; + } + if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, - "***** Re-trying analysis with same vector mode" - " %s for epilogue with partial vectors.\n", - GET_MODE_NAME (first_loop_vinfo->vector_mode)); - mode_i = first_loop_i; - } - else - { - mode_i = first_loop_next_i; - if (mode_i == vector_modes.length ()) - return first_loop_vinfo; - } - - /* ??? If first_loop_vinfo was using VOIDmode then we probably - want to instead search for the corresponding mode in vector_modes[]. */ + "***** Re-trying epilogue analysis with vector " + "mode %s\n", GET_MODE_NAME (vector_modes[mode_i])); - while (1) - { bool fatal; opt_loop_vec_info loop_vinfo = vect_analyze_loop_1 (loop, shared, &loop_form_info, @@ -3235,11 +3235,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) if (mode_i == vector_modes.length ()) break; - /* Try the next biggest vector size. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "***** Re-trying epilogue analysis with vector " - "mode %s\n", GET_MODE_NAME (vector_modes[mode_i])); } if (!first_loop_vinfo->epilogue_vinfos.is_empty ())
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > Made the suggested changes. > > Regarding the name change to partial vectors, I agree in the name change > since that is the terminology we are using in the loop_vinfo members > too, but is there an actual difference between predication/masking and > partial vectors that I am missing? “Predication/masking” refers to the ability to enable and disable operations on a lane-by-lane basis. E.g. it means that patterns like 10100010 are possible. “Operating on partial vectors” is the ability to operate on just the first N lanes of a vector, for some given N. It means that patterns like 11111100 are possible but patterns like 10100010 might not be. At the moment, “operating on partial vectors” also requires either direct support for loading and storing the first N lanes, or a way of generating a loop predicate/mask from N and using predication/masking. So the two concepts overlap, but support for one doesn't directly imply support for the other. > OK for trunk? > > gcc/ChangeLog: > > * tree-vect-loop.c (vect_better_loop_vinfo_p): Round factors up > for epilogue costing. > (vect_analyze_loop): Re-analyze all modes for epilogues, unless > we are guaranteed that we can't > have partial vectors. > (genopinit.c) (partial_vectors_supported): Generate new function. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/masked_epilogue.c: New test. OK, thanks. Richard > diff --git a/gcc/genopinit.c b/gcc/genopinit.c > index 195ddf74fa2b7d89760622073dcec9d5d339a097..2bc7cdbf53337beae181afd7bb05b366ab068c6a 100644 > --- a/gcc/genopinit.c > +++ b/gcc/genopinit.c > @@ -321,6 +321,7 @@ main (int argc, const char **argv) > " bool supports_vec_scatter_store_cached;\n" > "};\n" > "extern void init_all_optabs (struct target_optabs *);\n" > + "extern bool partial_vectors_supported_p (void);\n" > "\n" > "extern struct target_optabs default_target_optabs;\n" > "extern struct target_optabs *this_fn_optabs;\n" > @@ -373,6 +374,33 @@ main (int argc, const char **argv) > fprintf (s_file, " ena[%u] = HAVE_%s;\n", i, p->name); > fprintf (s_file, "}\n\n"); > > + fprintf (s_file, > + "/* Returns TRUE if the target supports any of the partial vector\n" > + " optabs: while_ult_optab, len_load_optab or len_store_optab,\n" > + " for any mode. */\n" > + "bool\npartial_vectors_supported_p (void)\n{\n"); > + bool any_match = false; > + fprintf (s_file, "\treturn"); > + bool first = true; > + for (i = 0; patterns.iterate (i, &p); ++i) > + { > +#define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N))) > + if (CMP_NAME("while_ult") || CMP_NAME ("len_load") > + || CMP_NAME ("len_store")) > + { > + if (first) > + fprintf (s_file, " HAVE_%s", p->name); > + else > + fprintf (s_file, " || HAVE_%s", p->name); > + first = false; > + any_match = true; > + } > + } > + if (!any_match) > + fprintf (s_file, " false"); > + fprintf (s_file, ";\n}\n"); > + > + > /* Perform a binary search on a pre-encoded optab+mode*2. */ > /* ??? Perhaps even better to generate a minimal perfect hash. > Using gperf directly is awkward since it's so geared to working > diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c > new file mode 100644 > index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c > @@ -0,0 +1,10 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */ > + > +void f(unsigned char y[restrict], > + unsigned char x[restrict], int n) { > + for (int i = 0; i < n; ++i) > + y[i] = (y[i] + x[i] + 1) >> 1; > +} > + > +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */ > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index a28bb6321d76b8222bc8cfdade151ca9b4dca406..5af98a36678ae61e99f93beb90920e2d0940c53a 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, > { > unsigned HOST_WIDE_INT main_vf_max > = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); > + unsigned HOST_WIDE_INT old_vf_max > + = estimated_poly_value (old_vf, POLY_VALUE_MAX); > + unsigned HOST_WIDE_INT new_vf_max > + = estimated_poly_value (new_vf, POLY_VALUE_MAX); > > - old_factor = main_vf_max / estimated_poly_value (old_vf, > - POLY_VALUE_MAX); > - new_factor = main_vf_max / estimated_poly_value (new_vf, > - POLY_VALUE_MAX); > + old_factor = CEIL (main_vf_max, old_vf_max); > + new_factor = CEIL (main_vf_max, new_vf_max); > > /* If the loop is not using partial vectors then it will iterate one > time less than one that does. It is safe to subtract one here, > @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > machine_mode autodetected_vector_mode = VOIDmode; > opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); > unsigned int mode_i = 0; > - unsigned int first_loop_i = 0; > - unsigned int first_loop_next_i = 0; > unsigned HOST_WIDE_INT simdlen = loop->simdlen; > > /* First determine the main loop vectorization mode, either the first > @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > lowest cost if pick_lowest_cost_p. */ > while (1) > { > - unsigned int loop_vinfo_i = mode_i; > bool fatal; > opt_loop_vec_info loop_vinfo > = vect_analyze_loop_1 (loop, shared, &loop_form_info, > @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > first_loop_vinfo = opt_loop_vec_info::success (NULL); > } > if (first_loop_vinfo == NULL) > - { > - first_loop_vinfo = loop_vinfo; > - first_loop_i = loop_vinfo_i; > - first_loop_next_i = mode_i; > - } > + first_loop_vinfo = loop_vinfo; > else > { > delete loop_vinfo; > @@ -3158,32 +3153,37 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > /* Now analyze first_loop_vinfo for epilogue vectorization. */ > poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); > > - /* Handle the case that the original loop can use partial > - vectorization, but want to only adopt it for the epilogue. > - The retry should be in the same mode as original. */ > - if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) > + /* For epilogues start the analysis from the first mode. The motivation > + behind starting from the beginning comes from cases where the VECTOR_MODES > + array may contain length-agnostic and length-specific modes. Their > + ordering is not guaranteed, so we could end up picking a mode for the main > + loop that is after the epilogue's optimal mode. */ > + mode_i = 1; > + bool supports_partial_vectors = partial_vectors_supported_p (); > + poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo); > + > + while (1) > { > - gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo) > - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo)); > + /* If the target does not support partial vectors we can shorten the > + number of modes to analyze for the epilogue as we know we can't pick a > + mode that has at least as many NUNITS as the main loop's vectorization > + factor, since that would imply the epilogue's vectorization factor > + would be at least as high as the main loop's and we would be > + vectorizing for more scalar iterations than there would be left. */ > + if (!supports_partial_vectors > + && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf)) > + { > + mode_i++; > + if (mode_i == vector_modes.length ()) > + break; > + continue; > + } > + > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > - "***** Re-trying analysis with same vector mode" > - " %s for epilogue with partial vectors.\n", > - GET_MODE_NAME (first_loop_vinfo->vector_mode)); > - mode_i = first_loop_i; > - } > - else > - { > - mode_i = first_loop_next_i; > - if (mode_i == vector_modes.length ()) > - return first_loop_vinfo; > - } > - > - /* ??? If first_loop_vinfo was using VOIDmode then we probably > - want to instead search for the corresponding mode in vector_modes[]. */ > + "***** Re-trying epilogue analysis with vector " > + "mode %s\n", GET_MODE_NAME (vector_modes[mode_i])); > > - while (1) > - { > bool fatal; > opt_loop_vec_info loop_vinfo > = vect_analyze_loop_1 (loop, shared, &loop_form_info, > @@ -3235,11 +3235,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > if (mode_i == vector_modes.length ()) > break; > > - /* Try the next biggest vector size. */ > - if (dump_enabled_p ()) > - dump_printf_loc (MSG_NOTE, vect_location, > - "***** Re-trying epilogue analysis with vector " > - "mode %s\n", GET_MODE_NAME (vector_modes[mode_i])); > } > > if (!first_loop_vinfo->epilogue_vinfos.is_empty ())
diff --git a/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c new file mode 100644 index 0000000000000000000000000000000000000000..286a7be236f337fee4c4650f42da72000855c5e6 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/masked_epilogue.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect-details -march=armv8-a+sve -msve-vector-bits=scalable" } */ + +void f(unsigned char y[restrict], + unsigned char x[restrict], int n) { + for (int i = 0; i < n; ++i) + y[i] = (y[i] + x[i] + 1) >> 1; +} + +/* { dg-final { scan-tree-dump {LOOP EPILOGUE VECTORIZED \(MODE=VNx} "vect" } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index a28bb6321d76b8222bc8cfdade151ca9b4dca406..17b090170d4a5dad22097a727bc25a63e230e278 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2824,11 +2824,13 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo, { unsigned HOST_WIDE_INT main_vf_max = estimated_poly_value (main_poly_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT old_vf_max + = estimated_poly_value (old_vf, POLY_VALUE_MAX); + unsigned HOST_WIDE_INT new_vf_max + = estimated_poly_value (new_vf, POLY_VALUE_MAX); - old_factor = main_vf_max / estimated_poly_value (old_vf, - POLY_VALUE_MAX); - new_factor = main_vf_max / estimated_poly_value (new_vf, - POLY_VALUE_MAX); + old_factor = CEIL (main_vf_max, old_vf_max); + new_factor = CEIL (main_vf_max, new_vf_max); /* If the loop is not using partial vectors then it will iterate one time less than one that does. It is safe to subtract one here, @@ -3069,8 +3071,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) machine_mode autodetected_vector_mode = VOIDmode; opt_loop_vec_info first_loop_vinfo = opt_loop_vec_info::success (NULL); unsigned int mode_i = 0; - unsigned int first_loop_i = 0; - unsigned int first_loop_next_i = 0; unsigned HOST_WIDE_INT simdlen = loop->simdlen; /* First determine the main loop vectorization mode, either the first @@ -3079,7 +3079,6 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) lowest cost if pick_lowest_cost_p. */ while (1) { - unsigned int loop_vinfo_i = mode_i; bool fatal; opt_loop_vec_info loop_vinfo = vect_analyze_loop_1 (loop, shared, &loop_form_info, @@ -3108,11 +3107,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) first_loop_vinfo = opt_loop_vec_info::success (NULL); } if (first_loop_vinfo == NULL) - { - first_loop_vinfo = loop_vinfo; - first_loop_i = loop_vinfo_i; - first_loop_next_i = mode_i; - } + first_loop_vinfo = loop_vinfo; else { delete loop_vinfo; @@ -3158,26 +3153,18 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* Now analyze first_loop_vinfo for epilogue vectorization. */ poly_uint64 lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo); - /* Handle the case that the original loop can use partial - vectorization, but want to only adopt it for the epilogue. - The retry should be in the same mode as original. */ - if (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (first_loop_vinfo)) - { - gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (first_loop_vinfo) - && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (first_loop_vinfo)); - if (dump_enabled_p ()) - dump_printf_loc (MSG_NOTE, vect_location, - "***** Re-trying analysis with same vector mode" - " %s for epilogue with partial vectors.\n", - GET_MODE_NAME (first_loop_vinfo->vector_mode)); - mode_i = first_loop_i; - } - else - { - mode_i = first_loop_next_i; - if (mode_i == vector_modes.length ()) - return first_loop_vinfo; - } + /* For epilogues start the analysis from the first mode. The motivation + behind starting from the beginning comes from cases where the VECTOR_MODES + array may contain length agnostic and length fixed modes. Their ordering + is not guaranteed, so we could end up picking a mode for the main loop + that is after the epilogue's optimal mode. */ + mode_i = 0; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis with first vector mode" + " %s for epilogue.\n", + GET_MODE_NAME (vector_modes[mode_i])); /* ??? If first_loop_vinfo was using VOIDmode then we probably want to instead search for the corresponding mode in vector_modes[]. */