From patchwork Tue Oct 19 01:47:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xionghu Luo X-Patchwork-Id: 46368 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 556EA3858439 for ; Tue, 19 Oct 2021 01:48:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 556EA3858439 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1634608113; bh=wYCHw1gnKNw/Wqs4gX6LmdRe/PecfhfztDkMS6RNXq0=; h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=IawPLrucIOPWPRRUXCjVM9WdxnjWh5ne57nQ/jrzME6IEaQ33wfdXUi+Oq5uy8xWn hwUh9aA4rApqzBkvjiUx1uMA8NTh8OmutngFZvrq64Ia/Gj1oPggaPrbeZklfWBcjq RZrzovGzsIUnEURDw8oQyQf1aASHKiWbi0S4d1PE= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 78B943858C27; Tue, 19 Oct 2021 01:48:00 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 78B943858C27 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19J0pXjg019747; Mon, 18 Oct 2021 21:48:00 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 3bskuk8py6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 21:47:59 -0400 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19J1TSHB006630; Mon, 18 Oct 2021 21:47:59 -0400 Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com [169.51.49.102]) by mx0b-001b2d01.pphosted.com with ESMTP id 3bskuk8pxw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 18 Oct 2021 21:47:58 -0400 Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1]) by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19J1fVqp014970; Tue, 19 Oct 2021 01:47:57 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma06ams.nl.ibm.com with ESMTP id 3bqp0jm3b1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 19 Oct 2021 01:47:57 +0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 19J1lsk461407574 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 19 Oct 2021 01:47:54 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A21684204B; Tue, 19 Oct 2021 01:47:54 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C2C2A4203F; Tue, 19 Oct 2021 01:47:51 +0000 (GMT) Received: from [9.200.59.68] (unknown [9.200.59.68]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Tue, 19 Oct 2021 01:47:51 +0000 (GMT) Message-ID: <549e1f54-3163-d01a-6c17-0eda4b5e59cb@linux.ibm.com> Date: Tue, 19 Oct 2021 09:47:49 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: [PATCH v5 2/2] Don't move cold code out of loop by checking bb count Content-Language: en-US To: Richard Biener References: <20210802050501.159058-1-luoxhu@linux.ibm.com> <53b7c729-33c0-138f-fa06-d6efb7a43911@linux.ibm.com> <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com> In-Reply-To: <0b675ba1-cdab-652a-0bba-704b0090f1c5@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: eAMB_0NQIVIWox5TJUDA-4pVSPldtsGT X-Proofpoint-ORIG-GUID: k68bn8yqflHP-M5EiO_EnHyjUwG3Crdq X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475 definitions=2021-10-18_07,2021-10-18_01,2020-04-07_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 impostorscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 clxscore=1011 malwarescore=0 lowpriorityscore=0 priorityscore=1501 adultscore=0 bulkscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109230001 definitions=main-2110190007 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Xionghu Luo via Gcc-patches From: Xionghu Luo Reply-To: Xionghu Luo Cc: Segher Boessenkool , GCC Patches , linkw@gcc.gnu.org, Bill Schmidt , Jan Hubicka , David Edelsohn Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" On 2021/10/18 12:29, Xionghu Luo via Gcc-patches wrote: > > > On 2021/10/15 16:11, Richard Biener wrote: >> On Sat, Oct 9, 2021 at 5:45 AM Xionghu Luo wrote: >>> >>> Hi, >>> >>> On 2021/9/28 20:09, Richard Biener wrote: >>>> On Fri, Sep 24, 2021 at 8:29 AM Xionghu Luo wrote: >>>>> >>>>> Update the patch to v3, not sure whether you prefer the paste style >>>>> and continue to link the previous thread as Segher dislikes this... >>>>> >>>>> >>>>> [PATCH v3] Don't move cold code out of loop by checking bb count >>>>> >>>>> >>>>> Changes: >>>>> 1. Handle max_loop in determine_max_movement instead of >>>>> outermost_invariant_loop. >>>>> 2. Remove unnecessary changes. >>>>> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. >>>>> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused >>>>> infinite loop when implementing v1 and the iteration is missed to be >>>>> updated actually. >>>>> >>>>> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html >>>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html >>>>> >>>>> There was a patch trying to avoid move cold block out of loop: >>>>> >>>>> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html >>>>> >>>>> Richard suggested to "never hoist anything from a bb with lower execution >>>>> frequency to a bb with higher one in LIM invariantness_dom_walker >>>>> before_dom_children". >>>>> >>>>> In gimple LIM analysis, add find_coldest_out_loop to move invariants to >>>>> expected target loop, if profile count of the loop bb is colder >>>>> than target loop preheader, it won't be hoisted out of loop. >>>>> Likely for store motion, if all locations of the REF in loop is cold, >>>>> don't do store motion of it. >>>>> >>>>> SPEC2017 performance evaluation shows 1% performance improvement for >>>>> intrate GEOMEAN and no obvious regression for others. Especially, >>>>> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is >>>>> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% >>>>> on P8LE. >>>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * loop-invariant.c (find_invariants_bb): Check profile count >>>>> before motion. >>>>> (find_invariants_body): Add argument. >>>>> * tree-ssa-loop-im.c (find_coldest_out_loop): New function. >>>>> (determine_max_movement): Use find_coldest_out_loop. >>>>> (move_computations_worker): Adjust and fix iteration udpate. >>>>> (execute_sm_exit): Check pointer validness. >>>>> (class ref_in_loop_hot_body): New functor. >>>>> (ref_in_loop_hot_body::operator): New. >>>>> (can_sm_ref_p): Use for_all_locs_in_loop. >>>>> >>>>> gcc/testsuite/ChangeLog: >>>>> >>>>> * gcc.dg/tree-ssa/recip-3.c: Adjust. >>>>> * gcc.dg/tree-ssa/ssa-lim-18.c: New test. >>>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>>> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >>>>> --- >>>>> gcc/loop-invariant.c | 10 ++-- >>>>> gcc/tree-ssa-loop-im.c | 61 ++++++++++++++++++++-- >>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- >>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 +++++++ >>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 27 ++++++++++ >>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++++ >>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 28 ++++++++++ >>>>> 7 files changed, 165 insertions(+), 8 deletions(-) >>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c >>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c >>>>> >>>>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c >>>>> index fca0c2b24be..5c3be7bf0eb 100644 >>>>> --- a/gcc/loop-invariant.c >>>>> +++ b/gcc/loop-invariant.c >>>>> @@ -1183,9 +1183,14 @@ find_invariants_insn (rtx_insn *insn, bool always_reached, bool always_executed) >>>>> call. */ >>>>> >>>>> static void >>>>> -find_invariants_bb (basic_block bb, bool always_reached, bool always_executed) >>>>> +find_invariants_bb (class loop *loop, basic_block bb, bool always_reached, >>>>> + bool always_executed) >>>>> { >>>>> rtx_insn *insn; >>>>> + basic_block preheader = loop_preheader_edge (loop)->src; >>>>> + >>>>> + if (preheader->count > bb->count) >>>>> + return; >>>>> >>>>> FOR_BB_INSNS (bb, insn) >>>>> { >>>>> @@ -1214,8 +1219,7 @@ find_invariants_body (class loop *loop, basic_block *body, >>>>> unsigned i; >>>>> >>>>> for (i = 0; i < loop->num_nodes; i++) >>>>> - find_invariants_bb (body[i], >>>>> - bitmap_bit_p (always_reached, i), >>>>> + find_invariants_bb (loop, body[i], bitmap_bit_p (always_reached, i), >>>>> bitmap_bit_p (always_executed, i)); >>>>> } >>>>> >>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>>> index 4b187c2cdaf..655fab03442 100644 >>>>> --- a/gcc/tree-ssa-loop-im.c >>>>> +++ b/gcc/tree-ssa-loop-im.c >>>>> @@ -417,6 +417,28 @@ movement_possibility (gimple *stmt) >>>>> return ret; >>>>> } >>>>> >>>>> +/* Find coldest loop between outmost_loop and loop by comapring profile count. */ >>>>> + >>>>> +static class loop * >>>>> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >>>>> + basic_block curr_bb) >>>>> +{ >>>>> + class loop *cold_loop, *min_loop; >>>>> + cold_loop = min_loop = outmost_loop; >>>>> + profile_count min_count = loop_preheader_edge (min_loop)->src->count; >>>>> + >>>>> + if (curr_bb && curr_bb->count < loop_preheader_edge (loop)->src->count) >>>> >>>> Honza - can you comment on whether we should compare BB counts this way? >>>> >>>> I would suspect that for, say, >>>> >>>> for (...) >>>> if (a) >>>> X; >>>> else >>>> Y; >>>> >>>> that the counts for X and Y will be less than that of the preheader of the loop >>>> only when the loop is estimated to run once. That is, should we really compare >>>> the to the preheader count or maybe better to the _header_ count which >>>> would keep the number of iterations out of the equation? >>> >>> I quickly tried to replace all the loop_preheader_edge (loop)->src with >>> loop_preheader_edge (loop)->dest, it will cause many failures in >>> gcc.dg/tree-ssa/ssa-lim-*.c, I didn't go deep to investigate, but it seems >>> reasonable to compare the bb count with preheader count as both gimple lim >>> and RTL loop-invariant move instructions to *preheader* instead of *header* >>> after analysis? >> >> Hmm, yeah - guess I was confused here. >> >>>> >>>> If we look at maybe_hot_count_p that's a quite sophisticated thing to >>>> compare a count to the "IPA hot", here we're comparing two counts >>>> within a function where it actually matters whether we use a>>> !(a>=b) since 'unordered' is mapped to false (but there's no ordered_p >>>> function). >>>> >>>> Xionghu, you error on the side of not hoisting for unordered counts here >>>> >>>>> + return NULL; >>>>> + >>>>> + while (min_loop != loop) >>>>> + { >>>>> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >>>>> + if (loop_preheader_edge (min_loop)->src->count < min_count) >>>> >>>> but in the other direction here and on the side of not hoisting >>>> in ref_in_loop_hot_body. >>>> >>>> The three-state relational operator overloads are probably not the >>>> very best idea... >>>> (see profile-count.h for them) >>>> >>> Added new function bb_colder_than_loop_preheader to encapsulate the comparision, >>> if FALSE is returned due to three-state inequality, find_coldest_out_loop >>> will return the original input to lim->max_loop, and ref_in_loop_hot_body::operator () >>> will return true to continue perform store motion, both preserve the previous >>> behavior. >> >> Thanks. But I don't think the abstraction as written is useful: >> >> +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state >> + as stated in profile-count.h, FALSE is returned if inequality cannot be >> + decided. */ >> +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) >> +{ >> + if (count1 < count2) >> + return true; >> + else >> + return false; >> +} >> >> given the following seems to pass the preheader count in place of the BB count. >> >> + if (bb_colder_than_loop_preheader ( >> + loop_preheader_edge (min_loop)->src->count, min_count)) >> + cold_loop = min_loop; >> >> find_coldest_out_loop is also a bit weird, I think we want to find >> the outermost loop between outmost_loop and loop that has a >> lower count than the curr_bb count but >> >> + while (min_loop != loop) >> + { >> + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); >> + if (bb_colder_than_loop_preheader ( >> + loop_preheader_edge (min_loop)->src->count, min_count)) >> + cold_loop = min_loop; >> >> compares the outermost loops count (min_count) against the preheader >> count? So we're searching for a cold loop with respect to its enclosing loop >> here? > > Let me try to explain how it works :) > > find_coldest_out_loop does two steps check: > 1) Check whether curr_bb is cold in it's own loop_father, if it is cold, > just return NULL which means it should not be moved out at all; > 2) curr_bb is NOT cold, assuming the current loop L[m] is the coldest first, > than try to find a cold loop to be hoisted to from {L[1], L[2], ... L[m]}, > if L[i]->count < L[m]->count, set the cold_loop to L[i] until find the loop > that has smallest profile_count. > > > Take the updated ssa-lim-19.c as example, check whether curr_bb(bb 5) is cold in > loop 3, if it is cold, just return NULL, otherwise select the coldest loop in > {loop1, loop2, loop3} and find that loop2 is colder than loop3, return loop2 to > be the target hoist loop. The first check could AVOID hoist if curr_bb is colder > than loop3, but it is still hot than loop1 and loop2. Not sure whether it is possible > to construct such cases? > > > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > > volatile int x; > void > bar (int, char *, char *); > void > foo (int *a, int n, int m, int s, int t) > { > int i; > int j; > int k; > > for (i = 0; i < m; i++) // loop 1 > { > if (__builtin_expect (x, 0)) > for (j = 0; j < n; j++) // loop 2 > for (k = 0; k < n; k++) // loop 3 > { > bar (s / 5, "one", "two"); // curr_bb > a[t] = s; > } > a[t] = t; // curr_bb2 > } > } > > The 4 invariant statements are moved to bb 11(loop2) instead of bb 10(loop1) > with this patch. > There are totally 6 combinations when curr_bb is hotter than loop 3. We need > to compare the "Loop preheader hotness" instead of "every Loop[i] and curr_bb hotness", > returning the coldest loop for this function find_coldest_out_loop, otherwise > unexpected behavior happens. > > L1 > L2 > L3 => return L3 > L1 > L3 > L2 => return L2 > L2 > L1 > L3 => return L3 > L2 > L3 > L1 => return L1 > L3 > L1 > L2 => return L2 > L3 > L2 > L1 => return L1 > > So bb_colder_than_loop_preheader does two kind of checks, one is checking > L3 preheader count with curr_bb count, another is checking L3 preheader count > with L1 preheader count, L2 preheader count, etc... > > > ssa-lim-19.c.138t.lim2: > ... > [local count: 16057869]: // L1 preheader > - _4 = s_22(D) / 5; > - _5 = (long unsigned int) t_24(D); > - _6 = _5 * 4; > - _7 = a_25(D) + _6; > _8 = (long unsigned int) t_24(D); > _9 = _8 * 4; > _10 = a_25(D) + _9; > > [local count: 145980626]: > # i_34 = PHI > x.0_1 ={v} x; > if (x.0_1 != 0) > goto ; [10.00%] > else > goto ; [90.00%] > > [local count: 14598063]: > if (n_20(D) > 0) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 12992276]: // L2 preheader > + _4 = s_22(D) / 5; > + _5 = (long unsigned int) t_24(D); > + _6 = _5 * 4; > + _7 = a_25(D) + _6; > goto ; [100.00%] > > [local count: 850510901]: > > [local count: 955630225]: // curr_bb > # k_36 = PHI > bar (_4, "one", "two"); > *_7 = s_22(D); > k_27 = k_36 + 1; > if (n_20(D) > k_27) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 118111600]: > j_21 = j_35 + 1; > if (n_20(D) > j_21) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 105119324]: > > [local count: 118111600]: // L3 preheader > # j_35 = PHI > goto ; [100.00%] > > [local count: 145980626]: > *_10 = t_24(D); > i_29 = i_34 + 1; > > Re-paste the bb_colder_than_loop_preheader and find_coldest_out_loop: > > +/* Compare the profile count inequality of COUNT1 and COUNT2, it is three-state > + as stated in profile-count.h, FALSE is returned if inequality cannot be > + decided. */ > +bool bb_colder_than_loop_preheader (profile_count count1, profile_count count2) > +{ > + if (count1 < count2) > + return true; > + else > + return false; > +} > + > +/* Find coldest loop between OUTMOST_LOOP and LOOP by comapring profile count. > + */ > + > +static class loop * > +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, > + basic_block curr_bb) > +{ > + class loop *cold_loop, *min_loop; > + cold_loop = min_loop = outmost_loop; > + profile_count min_count = loop_preheader_edge (min_loop)->src->count; > + > + /* If bb_colder_than_loop_preheader returns false due to three-state > + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. > + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ > + if (curr_bb > + && bb_colder_than_loop_preheader (curr_bb->count, > + loop_preheader_edge (loop)->src->count)) > + return NULL; > + > + while (min_loop != loop) > + { > + min_loop = superloop_at_depth (loop, loop_depth (min_loop) + 1); > + if (bb_colder_than_loop_preheader ( > + loop_preheader_edge (min_loop)->src->count, min_count)) > + cold_loop = min_loop; > + } > + return cold_loop; > +} > + > > >> >> Why is this function not simply >> >> +static class loop * >> +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, >> + basic_block curr_bb) >> +{ >> while (bb_colder_than_loop_preheader (curr_bb->count, >> loop_preheader_edge (outermost_loop)->src->count)) >> { >> if (outermost_loop == loop) >> return NULL; >> outermost_loop = superloop_at_depth (loop, loop_depth >> (outermost_loop) + 1); >> } >> return outermost_loop; >> } > > If change like this, when processing curr_bb(5), outermost_loop will > return loop 1 since curr_bb->count > Loop1_prehead->count, the while > loop stopped. This doesn't meet what we want. > >> >> ? >> >> Likewise I wonder why ref_in_loop_hot_body::operator () needs to call >> find_coldest_out_loop and why it not simply does >> >> +bool >> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) >> +{ >> + basic_block curr_bb = gimple_bb (loc->stmt); >> if (bb_colder_than_loop_preheader (curr_bb->count, >> loop_preheader_edge (l)->src->count)) >> return false; >> return true; >> } > > Likely for this part, > > +/* Find out the coldest loop between loop L and innermost loop, compare the > + hotness between current BB and coldest loop preheader by profile count. */ > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + class loop *cold_loop = l; > + if (l != inner_loop) > + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); > + if (!cold_loop) > + return false; > + edge e = loop_preheader_edge (cold_loop); > + /* If bb_colder_than_loop_preheader is false due to three-state inequality > + comparision, TRUE is returned to continue perform store motion. */ > + if (bb_colder_than_loop_preheader (curr_bb->count, e->src->count)) > + return false; > + else > + return true; > +} > > l is the input of ref_in_loop_hot_body, it is an out loop, we need to find a > cold_loop between l and inner_loop. Reason is there may be cold loop between > l and inner_loop, which means we shouldn't do store-motion from curr_bb to l > directly. > After reconsideration, I think the bb_colder_than_loop_preheader could be > removed since curr_bb is checked in find_coldest_out_loop already. And remove > the "l != inner_loop" check: > > +/* Find out the coldest loop between loop L and innermost loop. */ > +bool > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) > +{ > + basic_block curr_bb = gimple_bb (loc->stmt); > + class loop *inner_loop = curr_bb->loop_father; > + class loop *cold_loop = l; > + cold_loop = find_coldest_out_loop (l, inner_loop, curr_bb); > + if (!cold_loop) > + return false; > + return true; > +} > > Refined the functions and comments for better understanding in v5: [PATCH v5 2/2] Don't move cold code out of loop by checking bb count v5 changes: 1. Refine comments for new functions. 2. Use basic_block instead of count in bb_colder_than_loop_preheader to align with function name. 3. Refine with simpler implementation for find_coldest_out_loop and ref_in_loop_hot_body::operator for better understanding. v4 changes: 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader. 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare. 3. Split RTL invariant motion part out. 4. Remove aux changes. v3 changes: 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop. 2. Remove unnecessary changes. 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p. 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused infinite loop when implementing v1 and the iteration is missed to be updated actually. v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html There was a patch trying to avoid move cold block out of loop: https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html Richard suggested to "never hoist anything from a bb with lower execution frequency to a bb with higher one in LIM invariantness_dom_walker before_dom_children". In gimple LIM analysis, add find_coldest_out_loop to move invariants to expected target loop, if profile count of the loop bb is colder than target loop preheader, it won't be hoisted out of loop. Likely for store motion, if all locations of the REF in loop is cold, don't do store motion of it. SPEC2017 performance evaluation shows 1% performance improvement for intrate GEOMEAN and no obvious regression for others. Especially, 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00% on P8LE. gcc/ChangeLog: * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New function. (find_coldest_out_loop): New function. (determine_max_movement): Use find_coldest_out_loop. (move_computations_worker): Adjust and fix iteration udpate. (class ref_in_loop_hot_body): New functor. (ref_in_loop_hot_body::operator): New. (can_sm_ref_p): Use for_all_locs_in_loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/recip-3.c: Adjust. * gcc.dg/tree-ssa/ssa-lim-18.c: New test. * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. * gcc.dg/tree-ssa/ssa-lim-21.c: New test. --- gcc/tree-ssa-loop-im.c | 78 +++++++++++++++++++++- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c | 20 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 29 ++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 25 +++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c | 35 ++++++++++ 6 files changed, 186 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 4b187c2cdaf..6de9be7d4e5 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -417,6 +417,49 @@ movement_possibility (gimple *stmt) return ret; } +/* Compare the profile count inequality of bb and preheader, it is three-state + as stated in profile-count.h, FALSE is returned if inequality cannot be + decided. */ +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader) +{ + gcc_assert (bb && preheader); + return bb->count < preheader->count; +} + +/* Find coldest loop between OUTMOST_LOOP and LOOP by comparing profile count. + It does two steps check: + 1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just + return NULL which means it should not be moved out at all; + 2) CURR_BB is NOT cold, set LOOP to cold_loop, then iteratively search loops + from {L[outmost_loop], L[outmost_loop+1], ... L[loop]}, if L[i] is colder + than L[cold_loop], reset cold_loop to L[i] until get the loop that has + smallest profile_count. */ + +static class loop * +find_coldest_out_loop (class loop *outmost_loop, class loop *loop, + basic_block curr_bb) +{ + class loop *cold_loop; + + /* If bb_colder_than_loop_preheader returns false due to three-state + comparision, OUTMOST_LOOP is returned finally to preserve the behavior. + Otherwise, return the coldest loop between OUTMOST_LOOP and LOOP. */ + if (curr_bb + && bb_colder_than_loop_preheader (curr_bb, + loop_preheader_edge (loop)->src)) + return NULL; + + cold_loop = loop; + while (outmost_loop != loop) + { + if (bb_colder_than_loop_preheader (loop_preheader_edge (outmost_loop)->src, + loop_preheader_edge (cold_loop)->src)) + cold_loop = outmost_loop; + outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1); + } + return cold_loop; +} + /* Suppose that operand DEF is used inside the LOOP. Returns the outermost loop to that we could move the expression using DEF if it did not have other operands, i.e. the outermost loop enclosing LOOP in that the value @@ -685,7 +728,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec) level = ALWAYS_EXECUTED_IN (bb); else level = superloop_at_depth (loop, 1); - lim_data->max_loop = level; + lim_data->max_loop = find_coldest_out_loop (level, loop, bb); + if (!lim_data->max_loop) + return false; if (gphi *phi = dyn_cast (stmt)) { @@ -1221,7 +1266,10 @@ move_computations_worker (basic_block bb) /* We do not really want to move conditionals out of the loop; we just placed it here to force its operands to be moved if necessary. */ if (gimple_code (stmt) == GIMPLE_COND) - continue; + { + gsi_next (&bsi); + continue; + } if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -2887,6 +2935,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind) return indep_p; } +class ref_in_loop_hot_body +{ +public: + ref_in_loop_hot_body (loop *loop_) : l (loop_) {} + bool operator () (mem_ref_loc *loc); + class loop *l; +}; + +/* Find the coldest loop between loop L and innermost loop. If there is one + cold loop between L and INNER_LOOP, store motion can be performed, otherwise + no cold loop means no store motion. find_coldest_out_loop also handles cases + when l is inner_loop. */ +bool +ref_in_loop_hot_body::operator () (mem_ref_loc *loc) +{ + basic_block curr_bb = gimple_bb (loc->stmt); + class loop *inner_loop = curr_bb->loop_father; + return find_coldest_out_loop (l, inner_loop, curr_bb); +} + /* Returns true if we can perform store motion of REF from LOOP. */ @@ -2941,6 +3009,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref) if (!ref_indep_loop_p (loop, ref, sm_war)) return false; + /* Verify whether the candidate is hot for LOOP. Only do store motion if the + candidate's profile count is hot. Statement in cold BB shouldn't be moved + out of it's loop_father. */ + if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop))) + return false; + return true; } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c index 638bf38db8c..641c91e719e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c @@ -23,4 +23,4 @@ float h () F[0] += E / d; } -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */ +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c new file mode 100644 index 00000000000..7326a230b3f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int k) +{ + int i; + + for (i = 0; i < n; i++) + { + if (__builtin_expect (x, 0)) + bar (k / 5, "one", "two"); + a[i] = k; + } +} + +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..51c1913d003 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +volatile int x; +void +bar (int, char *, char *); +void +foo (int *a, int n, int m, int s, int t) +{ + int i; + int j; + int k; + + for (i = 0; i < m; i++) // Loop 1 + { + if (__builtin_expect (x, 0)) + for (j = 0; j < n; j++) // Loop 2 + for (k = 0; k < n; k++) // Loop 3 + { + bar (s / 5, "one", "two"); + a[t] = s; + } + a[t] = t; + } +} + +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..bc60a040a70 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `count' is not hoisted out of loop when bb is cold. */ + +int count; +volatile int x; + +struct obj { + int data; + struct obj *next; + +} *q; + +void +func (int m) +{ + struct obj *p; + for (int i = 0; i < m; i++) + if (__builtin_expect (x, 0)) + count++; + +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c new file mode 100644 index 00000000000..d843fc55368 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c @@ -0,0 +1,35 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop + when it is in cold loop. */ + +int count; +volatile int x; + +struct obj { + int data; + int data1; + struct obj *next; +}; + +void +func (int m, int n, int k, struct obj *a) +{ + struct obj *q = a; + for (int j = 0; j < m; j++) + if (__builtin_expect (m, 0)) + for (int i = 0; i < m; i++) + { + if (__builtin_expect (x, 0)) + { + count++; + q->data += 3; /* Not hoisted out to inner loop. */ + } + count += n; + q->data1 += k; /* Not hoisted out to outer loop. */ + } +} + +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2" } } */ +