[v7,2/2] Don't move cold code out of loop by checking bb count

Message ID 409e3be5-0bfd-cdc4-2d93-9dbacdbc69cf@linux.ibm.com
State New
Headers
Series None |

Commit Message

Xionghu Luo Nov. 10, 2021, 3:08 a.m. UTC
  On 2021/11/4 21:00, Richard Biener wrote:
> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>> +  while (outmost_loop != loop)
>>> +    {
>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
>>> (outmost_loop)->src,
>>> +                                        loop_preheader_edge (cold_loop)->src))
>>> +       cold_loop = outmost_loop;
>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
>>> +    }
>>>
>>> could be instead written as
>>>
>>>   coldest_loop = coldest_outermost_loop[loop->num];
>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>     return outermost_loop;
>>>   return coldest_loop;
>>>
>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
>>> It should be possible to compute such cache in a DFS walk of the loop tree
>>> (the loop iterator by default visits in such order).
>>
>>
>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>
>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>>
>> then function find_coldest_out_loop will return a loop NOT accord with our
>> expectation, that should return second_coldest_loop instead of outermost_loop?
> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
> outermost loop will be the loop at depth 1 since outer loops tend to
> be colder than
> inner loops?  That would then defeat the whole exercise.

It is not easy to construct such cases, But finally I got below results,

1) many cases inner loop is hotter than outer loop, for example:

loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2


2) But there are also cases inner loop is colder than outer loop, like:

loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL


> 
> To optimize the common case but not avoiding iteration in the cases we care
> about we could instead cache the next outermost loop that is _not_ colder
> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
> candidate would then be 'second_coldest_loop' and we'd then iterate
> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
> cold candidate we can compare against?  For the common case we'd
> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
> simply pick 'outermost_loop'.

Thanks.  It was difficult to understand, but finally I got to know what you
want to express :)

We should cache the next loop that is *colder* than loop instead of '_not_ colder
than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
then it makes sense if the coldest loop is outside of outermost loop, continue to
find a colder loop between outermost loop and current loop in
colder_than_inner_loop[loop->num]?  Hope I understood you correctly...

> 
> One comment on the patch itself below.
> 

The loop in fill_cold_out_loop is also removed in the updated v7 patch.



[PATCH v7 2/2] Don't move cold code out of loop by checking bb count

From: Xiong Hu Luo <luoxhu@linux.ibm.com>

v7 changes:
1. Refine get_coldest_out_loop to replace loop with checking
pre-computed coldest_outermost_loop and colder_than_inner_loop.
2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
colder_than_inner_loop recursively without loop.

v6 changes:
1. Add function fill_coldest_out_loop to pre compute the coldest
outermost loop for each loop.
2. Rename find_coldest_out_loop to get_coldest_out_loop.
3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.

v5 changes:
1. Refine comments for new functions.
2. Use basic_block instead of count in bb_colder_than_loop_preheader
to align with function name.
3. Refine with simpler implementation for get_coldest_out_loop and
ref_in_loop_hot_body::operator for better understanding.

v4 changes:
1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
3. Split RTL invariant motion part out.
4. Remove aux changes.

v3 changes:
1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add get_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
	function.
	(get_coldest_out_loop): New function.
	(determine_max_movement): Use get_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.
	(fill_cold_out_loop): New.
	(tree_ssa_lim_finalize): Free coldest_outermost_loop and
	colder_than_inner_loop.
	(loop_invariant_motion_in_fun): Call fill_cold_out_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 140 ++++++++++++++++++++-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
 7 files changed, 280 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
  

Comments

Xionghu Luo Nov. 24, 2021, 5:15 a.m. UTC | #1
Gentle ping and is this patch still suitable for stage 3?  Thanks.


[PATCH v7 2/2] Don't move cold code out of loop by checking bb count

https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583911.html



On 2021/11/10 11:08, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/11/4 21:00, Richard Biener wrote:
>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>> +  while (outmost_loop != loop)
>>>> +    {
>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
>>>> (outmost_loop)->src,
>>>> +                                        loop_preheader_edge (cold_loop)->src))
>>>> +       cold_loop = outmost_loop;
>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
>>>> +    }
>>>>
>>>> could be instead written as
>>>>
>>>>   coldest_loop = coldest_outermost_loop[loop->num];
>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>     return outermost_loop;
>>>>   return coldest_loop;
>>>>
>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
>>>> It should be possible to compute such cache in a DFS walk of the loop tree
>>>> (the loop iterator by default visits in such order).
>>>
>>>
>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
>>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
>>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>>
>>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>>>
>>> then function find_coldest_out_loop will return a loop NOT accord with our
>>> expectation, that should return second_coldest_loop instead of outermost_loop?
>> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
>> outermost loop will be the loop at depth 1 since outer loops tend to
>> be colder than
>> inner loops?  That would then defeat the whole exercise.
> 
> It is not easy to construct such cases, But finally I got below results,
> 
> 1) many cases inner loop is hotter than outer loop, for example:
> 
> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> 
> 
> 2) But there are also cases inner loop is colder than outer loop, like:
> 
> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
> 
> 
>>
>> To optimize the common case but not avoiding iteration in the cases we care
>> about we could instead cache the next outermost loop that is _not_ colder
>> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
>> candidate would then be 'second_coldest_loop' and we'd then iterate
>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
>> cold candidate we can compare against?  For the common case we'd
>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
>> simply pick 'outermost_loop'.
> 
> Thanks.  It was difficult to understand, but finally I got to know what you
> want to express :)
> 
> We should cache the next loop that is *colder* than loop instead of '_not_ colder
> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
> then it makes sense if the coldest loop is outside of outermost loop, continue to
> find a colder loop between outermost loop and current loop in
> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
> 
>>
>> One comment on the patch itself below.
>>
> 
> The loop in fill_cold_out_loop is also removed in the updated v7 patch.
> 
> 
> 
> [PATCH v7 2/2] Don't move cold code out of loop by checking bb count
> 
> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> 
> v7 changes:
> 1. Refine get_coldest_out_loop to replace loop with checking
> pre-computed coldest_outermost_loop and colder_than_inner_loop.
> 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> colder_than_inner_loop recursively without loop.
> 
> v6 changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> 
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
> 
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
> 
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
> 
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> 
> There was a patch trying to avoid move cold block out of loop:
> 
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
> 
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
> 
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> 	function.
> 	(get_coldest_out_loop): New function.
> 	(determine_max_movement): Use get_coldest_out_loop.
> 	(move_computations_worker): Adjust and fix iteration udpate.
> 	(class ref_in_loop_hot_body): New functor.
> 	(ref_in_loop_hot_body::operator): New.
> 	(can_sm_ref_p): Use for_all_locs_in_loop.
> 	(fill_cold_out_loop): New.
> 	(tree_ssa_lim_finalize): Free coldest_outermost_loop and
> 	colder_than_inner_loop.
> 	(loop_invariant_motion_in_fun): Call fill_cold_out_loop.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/recip-3.c: Adjust.
> 	* gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 140 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 +++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 ++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
>  7 files changed, 280 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..e9b3d0cba93 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,11 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> 
> +/* coldest outermost loop for given loop.  */
> +class loop **coldest_outermost_loop;
> +/* colder outer loop nerest to given loop.  */
> +class loop **colder_than_inner_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> 
>  static void
> @@ -417,6 +422,53 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
> 
> +/* Compare the profile count inequality of bb and preheader, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
> +{
> +  gcc_assert (bb && preheader);
> +  return bb->count < preheader->count;
> +}
> +
> +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> +   count.
> +  It does three steps check:
> +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +  return NULL which means it should not be moved out at all;
> +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> +		      basic_block curr_bb)
> +{
> +  gcc_assert (outermost_loop == loop
> +	      || flow_loop_nested_p (outermost_loop, loop));
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +					loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> +    {
> +      if (colder_than_inner_loop[loop->num] != NULL
> +	  && loop_depth (outermost_loop)
> +	       < loop_depth (colder_than_inner_loop[loop->num]))
> +	return colder_than_inner_loop[loop->num];
> +      return outermost_loop;
> +    }
> +  return coldest_loop;
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -685,7 +737,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
>      level = superloop_at_depth (loop, 1);
> -  lim_data->max_loop = level;
> +  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
> +  if (!lim_data->max_loop)
> +    return false;
> 
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>      {
> @@ -1221,7 +1275,10 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>  	 placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -	continue;
> +	{
> +	  gsi_next (&bsi);
> +	  continue;
> +	}
> 
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
> @@ -2887,6 +2944,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>    return indep_p;
>  }
> 
> +class ref_in_loop_hot_body
> +{
> +public:
> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> +  bool operator () (mem_ref_loc *loc);
> +  class loop *l;
> +};
> +
> +/* Check the coldest loop between loop L and innermost loop.  If there is one
> +   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
> +   no cold loop means no store motion.  get_coldest_out_loop also handles cases
> +   when l is inner_loop.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return get_coldest_out_loop (l, inner_loop, curr_bb);
> +}
> +
> 
>  /* Returns true if we can perform store motion of REF from LOOP.  */
> 
> @@ -2941,6 +3018,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    if (!ref_indep_loop_p (loop, ref, sm_war))
>      return false;
> 
> +  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
> +    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
> +    out of it's loop_father.  */
> +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +    return false;
> +
>    return true;
>  }
> 
> @@ -3153,6 +3236,48 @@ fill_always_executed_in (void)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
> 
> +/* Find the coldest loop preheader for LOOP, also find the nearest colder loop
> +   to LOOP.  Then recursively iterate each inner loop.  */
> +
> +void
> +fill_cold_out_loop (class loop *coldest_loop, class loop *outer_loop,
> +		    class loop *loop)
> +{
> +  class loop *colder_loop = NULL;
> +  if (outer_loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge (outer_loop)->src,
> +					 loop_preheader_edge (loop)->src))
> +	colder_loop = outer_loop;
> +      else if (colder_than_inner_loop[outer_loop->num]
> +	       && bb_colder_than_loop_preheader (
> +		 loop_preheader_edge (colder_than_inner_loop[outer_loop->num])
> +		   ->src,
> +		 loop_preheader_edge (loop)->src))
> +	colder_loop = colder_than_inner_loop[outer_loop->num];
> +    }
> +
> +  if (bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
> +				     loop_preheader_edge (coldest_loop)->src))
> +    coldest_loop = loop;
> +
> +  coldest_outermost_loop[loop->num] = coldest_loop;
> +  colder_than_inner_loop[loop->num] = colder_loop;
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf (MSG_NOTE, "loop %d's coldest_outermost_loop is %d, ",
> +		   loop->num, coldest_loop->num);
> +      if (colder_loop)
> +	dump_printf (MSG_NOTE, "colder_than_inner_loop is %d\n",
> +		     colder_loop->num);
> +      else
> +	dump_printf (MSG_NOTE, "colder_than_inner_loop is NULL\n");
> +    }
> +
> +  class loop *inner_loop;
> +  for (inner_loop = loop->inner; inner_loop; inner_loop = inner_loop->next)
> +    fill_cold_out_loop (coldest_loop, loop, inner_loop);
> +}
> 
>  /* Compute the global information needed by the loop invariant motion pass.  */
> 
> @@ -3237,6 +3362,9 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
> 
>    free (bb_loop_postorder);
> +
> +  free (coldest_outermost_loop);
> +  free (colder_than_inner_loop);
>  }
> 
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> @@ -3256,6 +3384,14 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
>    fill_always_executed_in ();
> 
> +  /* Pre-compute coldest outermost loop and nearest colder loop  of each loop.
> +   */
> +  class loop *loop;
> +  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  colder_than_inner_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> +    fill_cold_out_loop (loop, NULL, loop);
> +
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>  	F[0] += E / d;
>  }
> 
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> new file mode 100644
> index 00000000000..7326a230b3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +	bar (k / 5, "one", "two");
> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..51c1913d003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +	for (j = 0; j < n; j++) // Loop 2
> +	  for (k = 0; k < n; k++) // Loop 3
> +	    {
> +	      bar (s / 5, "one", "two");
> +	      a[t] = s;
> +	    }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..bc60a040a70
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +    if (__builtin_expect (x, 0))
> +      count++;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> new file mode 100644
> index 00000000000..ffe6f8f699d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
> +   when it is in cold loop.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  int data1;
> +  struct obj *next;
> +};
> +
> +void
> +func (int m, int n, int k, struct obj *a)
> +{
> +  struct obj *q = a;
> +  for (int j = 0; j < m; j++)
> +    if (__builtin_expect (m, 0))
> +      for (int i = 0; i < m; i++)
> +	{
> +	  if (__builtin_expect (x, 0))
> +	    {
> +	      count++;
> +	      q->data += 3; /* Not hoisted out to inner loop. */
> +	    }
> +	  count += n;
> +	  q->data1 += k; /* Not hoisted out to outer loop. */
> +	}
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> new file mode 100644
> index 00000000000..16ba4ceb8ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +volatile int y;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +	for (j = 0; j < n; j++) // Loop 2
> +	  if (__builtin_expect (y, 0))
> +	    for (k = 0; k < n; k++) // Loop 3
> +	      {
> +		bar (s / 5, "one", "two");
> +		a[t] = s;
> +	      }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> +
>
  
Richard Biener Nov. 24, 2021, 7:35 a.m. UTC | #2
On Wed, Nov 24, 2021 at 6:15 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
> Gentle ping and is this patch still suitable for stage 3?  Thanks.

It's on my list to look at, yes.  I'm worried about accuracy of profile counts
though which is why I keep pushing it back - I see the various profile count
fixes are still pending.

I'm thinking of requiring reliable_p () on the counts (but that will restrict
this to FDO).

Anyway, still on my list to look at - sorry for the delays.

Richard.

>
> [PATCH v7 2/2] Don't move cold code out of loop by checking bb count
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583911.html
>
>
>
> On 2021/11/10 11:08, Xionghu Luo via Gcc-patches wrote:
> >
> >
> > On 2021/11/4 21:00, Richard Biener wrote:
> >> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>
> >>>
> >>>> +  while (outmost_loop != loop)
> >>>> +    {
> >>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> >>>> (outmost_loop)->src,
> >>>> +                                        loop_preheader_edge (cold_loop)->src))
> >>>> +       cold_loop = outmost_loop;
> >>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> >>>> +    }
> >>>>
> >>>> could be instead written as
> >>>>
> >>>>   coldest_loop = coldest_outermost_loop[loop->num];
> >>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >>>>     return outermost_loop;
> >>>>   return coldest_loop;
> >>>>
> >>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> >>>> It should be possible to compute such cache in a DFS walk of the loop tree
> >>>> (the loop iterator by default visits in such order).
> >>>
> >>>
> >>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
> >>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
> >>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
> >>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
> >>>
> >>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
> >>>
> >>> then function find_coldest_out_loop will return a loop NOT accord with our
> >>> expectation, that should return second_coldest_loop instead of outermost_loop?
> >> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
> >> outermost loop will be the loop at depth 1 since outer loops tend to
> >> be colder than
> >> inner loops?  That would then defeat the whole exercise.
> >
> > It is not easy to construct such cases, But finally I got below results,
> >
> > 1) many cases inner loop is hotter than outer loop, for example:
> >
> > loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
> > loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
> > loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> > loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> >
> >
> > 2) But there are also cases inner loop is colder than outer loop, like:
> >
> > loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
> > loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
> > loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
> >
> >
> >>
> >> To optimize the common case but not avoiding iteration in the cases we care
> >> about we could instead cache the next outermost loop that is _not_ colder
> >> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
> >> candidate would then be 'second_coldest_loop' and we'd then iterate
> >> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
> >> cold candidate we can compare against?  For the common case we'd
> >> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
> >> simply pick 'outermost_loop'.
> >
> > Thanks.  It was difficult to understand, but finally I got to know what you
> > want to express :)
> >
> > We should cache the next loop that is *colder* than loop instead of '_not_ colder
> > than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
> > then it makes sense if the coldest loop is outside of outermost loop, continue to
> > find a colder loop between outermost loop and current loop in
> > colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
> >
> >>
> >> One comment on the patch itself below.
> >>
> >
> > The loop in fill_cold_out_loop is also removed in the updated v7 patch.
> >
> >
> >
> > [PATCH v7 2/2] Don't move cold code out of loop by checking bb count
> >
> > From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >
> > v7 changes:
> > 1. Refine get_coldest_out_loop to replace loop with checking
> > pre-computed coldest_outermost_loop and colder_than_inner_loop.
> > 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> > colder_than_inner_loop recursively without loop.
> >
> > v6 changes:
> > 1. Add function fill_coldest_out_loop to pre compute the coldest
> > outermost loop for each loop.
> > 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> > 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> >
> > v5 changes:
> > 1. Refine comments for new functions.
> > 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> > to align with function name.
> > 3. Refine with simpler implementation for get_coldest_out_loop and
> > ref_in_loop_hot_body::operator for better understanding.
> >
> > v4 changes:
> > 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> > 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> > 3. Split RTL invariant motion part out.
> > 4. Remove aux changes.
> >
> > v3 changes:
> > 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> > 2. Remove unnecessary changes.
> > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> > infinite loop when implementing v1 and the iteration is missed to be
> > updated actually.
> >
> > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> > v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> > v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> > v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> >
> > There was a patch trying to avoid move cold block out of loop:
> >
> > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >
> > Richard suggested to "never hoist anything from a bb with lower execution
> > frequency to a bb with higher one in LIM invariantness_dom_walker
> > before_dom_children".
> >
> > In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> > expected target loop, if profile count of the loop bb is colder
> > than target loop preheader, it won't be hoisted out of loop.
> > Likely for store motion, if all locations of the REF in loop is cold,
> > don't do store motion of it.
> >
> > SPEC2017 performance evaluation shows 1% performance improvement for
> > intrate GEOMEAN and no obvious regression for others.  Especially,
> > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> > on P8LE.
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> >       function.
> >       (get_coldest_out_loop): New function.
> >       (determine_max_movement): Use get_coldest_out_loop.
> >       (move_computations_worker): Adjust and fix iteration udpate.
> >       (class ref_in_loop_hot_body): New functor.
> >       (ref_in_loop_hot_body::operator): New.
> >       (can_sm_ref_p): Use for_all_locs_in_loop.
> >       (fill_cold_out_loop): New.
> >       (tree_ssa_lim_finalize): Free coldest_outermost_loop and
> >       colder_than_inner_loop.
> >       (loop_invariant_motion_in_fun): Call fill_cold_out_loop.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >       * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> > ---
> >  gcc/tree-ssa-loop-im.c                     | 140 ++++++++++++++++++++-
> >  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 +++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 +++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 ++++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
> >  7 files changed, 280 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> >
> > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> > index 4b187c2cdaf..e9b3d0cba93 100644
> > --- a/gcc/tree-ssa-loop-im.c
> > +++ b/gcc/tree-ssa-loop-im.c
> > @@ -146,6 +146,11 @@ public:
> >  enum dep_kind { lim_raw, sm_war, sm_waw };
> >  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> >
> > +/* coldest outermost loop for given loop.  */
> > +class loop **coldest_outermost_loop;
> > +/* colder outer loop nerest to given loop.  */
> > +class loop **colder_than_inner_loop;
> > +
> >  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> >
> >  static void
> > @@ -417,6 +422,53 @@ movement_possibility (gimple *stmt)
> >    return ret;
> >  }
> >
> > +/* Compare the profile count inequality of bb and preheader, it is three-state
> > +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> > +   decided.  */
> > +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
> > +{
> > +  gcc_assert (bb && preheader);
> > +  return bb->count < preheader->count;
> > +}
> > +
> > +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> > +   count.
> > +  It does three steps check:
> > +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> > +  return NULL which means it should not be moved out at all;
> > +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> > +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> > +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> > +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> > +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> > +
> > +static class loop *
> > +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> > +                   basic_block curr_bb)
> > +{
> > +  gcc_assert (outermost_loop == loop
> > +           || flow_loop_nested_p (outermost_loop, loop));
> > +
> > +  /* If bb_colder_than_loop_preheader returns false due to three-state
> > +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> > +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> > +  if (curr_bb
> > +      && bb_colder_than_loop_preheader (curr_bb,
> > +                                     loop_preheader_edge (loop)->src))
> > +    return NULL;
> > +
> > +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> > +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> > +    {
> > +      if (colder_than_inner_loop[loop->num] != NULL
> > +       && loop_depth (outermost_loop)
> > +            < loop_depth (colder_than_inner_loop[loop->num]))
> > +     return colder_than_inner_loop[loop->num];
> > +      return outermost_loop;
> > +    }
> > +  return coldest_loop;
> > +}
> > +
> >  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
> >     loop to that we could move the expression using DEF if it did not have
> >     other operands, i.e. the outermost loop enclosing LOOP in that the value
> > @@ -685,7 +737,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
> >      level = ALWAYS_EXECUTED_IN (bb);
> >    else
> >      level = superloop_at_depth (loop, 1);
> > -  lim_data->max_loop = level;
> > +  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
> > +  if (!lim_data->max_loop)
> > +    return false;
> >
> >    if (gphi *phi = dyn_cast <gphi *> (stmt))
> >      {
> > @@ -1221,7 +1275,10 @@ move_computations_worker (basic_block bb)
> >        /* We do not really want to move conditionals out of the loop; we just
> >        placed it here to force its operands to be moved if necessary.  */
> >        if (gimple_code (stmt) == GIMPLE_COND)
> > -     continue;
> > +     {
> > +       gsi_next (&bsi);
> > +       continue;
> > +     }
> >
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >       {
> > @@ -2887,6 +2944,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
> >    return indep_p;
> >  }
> >
> > +class ref_in_loop_hot_body
> > +{
> > +public:
> > +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> > +  bool operator () (mem_ref_loc *loc);
> > +  class loop *l;
> > +};
> > +
> > +/* Check the coldest loop between loop L and innermost loop.  If there is one
> > +   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
> > +   no cold loop means no store motion.  get_coldest_out_loop also handles cases
> > +   when l is inner_loop.  */
> > +bool
> > +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> > +{
> > +  basic_block curr_bb = gimple_bb (loc->stmt);
> > +  class loop *inner_loop = curr_bb->loop_father;
> > +  return get_coldest_out_loop (l, inner_loop, curr_bb);
> > +}
> > +
> >
> >  /* Returns true if we can perform store motion of REF from LOOP.  */
> >
> > @@ -2941,6 +3018,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
> >    if (!ref_indep_loop_p (loop, ref, sm_war))
> >      return false;
> >
> > +  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
> > +    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
> > +    out of it's loop_father.  */
> > +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> > +    return false;
> > +
> >    return true;
> >  }
> >
> > @@ -3153,6 +3236,48 @@ fill_always_executed_in (void)
> >      fill_always_executed_in_1 (loop, contains_call);
> >  }
> >
> > +/* Find the coldest loop preheader for LOOP, also find the nearest colder loop
> > +   to LOOP.  Then recursively iterate each inner loop.  */
> > +
> > +void
> > +fill_cold_out_loop (class loop *coldest_loop, class loop *outer_loop,
> > +                 class loop *loop)
> > +{
> > +  class loop *colder_loop = NULL;
> > +  if (outer_loop)
> > +    {
> > +      if (bb_colder_than_loop_preheader (loop_preheader_edge (outer_loop)->src,
> > +                                      loop_preheader_edge (loop)->src))
> > +     colder_loop = outer_loop;
> > +      else if (colder_than_inner_loop[outer_loop->num]
> > +            && bb_colder_than_loop_preheader (
> > +              loop_preheader_edge (colder_than_inner_loop[outer_loop->num])
> > +                ->src,
> > +              loop_preheader_edge (loop)->src))
> > +     colder_loop = colder_than_inner_loop[outer_loop->num];
> > +    }
> > +
> > +  if (bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
> > +                                  loop_preheader_edge (coldest_loop)->src))
> > +    coldest_loop = loop;
> > +
> > +  coldest_outermost_loop[loop->num] = coldest_loop;
> > +  colder_than_inner_loop[loop->num] = colder_loop;
> > +  if (dump_enabled_p ())
> > +    {
> > +      dump_printf (MSG_NOTE, "loop %d's coldest_outermost_loop is %d, ",
> > +                loop->num, coldest_loop->num);
> > +      if (colder_loop)
> > +     dump_printf (MSG_NOTE, "colder_than_inner_loop is %d\n",
> > +                  colder_loop->num);
> > +      else
> > +     dump_printf (MSG_NOTE, "colder_than_inner_loop is NULL\n");
> > +    }
> > +
> > +  class loop *inner_loop;
> > +  for (inner_loop = loop->inner; inner_loop; inner_loop = inner_loop->next)
> > +    fill_cold_out_loop (coldest_loop, loop, inner_loop);
> > +}
> >
> >  /* Compute the global information needed by the loop invariant motion pass.  */
> >
> > @@ -3237,6 +3362,9 @@ tree_ssa_lim_finalize (void)
> >      free_affine_expand_cache (&memory_accesses.ttae_cache);
> >
> >    free (bb_loop_postorder);
> > +
> > +  free (coldest_outermost_loop);
> > +  free (colder_than_inner_loop);
> >  }
> >
> >  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> > @@ -3256,6 +3384,14 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
> >    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> >    fill_always_executed_in ();
> >
> > +  /* Pre-compute coldest outermost loop and nearest colder loop  of each loop.
> > +   */
> > +  class loop *loop;
> > +  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> > +  colder_than_inner_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> > +  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> > +    fill_cold_out_loop (loop, NULL, loop);
> > +
> >    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
> >    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> > index 638bf38db8c..641c91e719e 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> > @@ -23,4 +23,4 @@ float h ()
> >       F[0] += E / d;
> >  }
> >
> > -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> > +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> > new file mode 100644
> > index 00000000000..7326a230b3f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> > +
> > +volatile int x;
> > +void
> > +bar (int, char *, char *);
> > +void
> > +foo (int *a, int n, int k)
> > +{
> > +  int i;
> > +
> > +  for (i = 0; i < n; i++)
> > +    {
> > +      if (__builtin_expect (x, 0))
> > +     bar (k / 5, "one", "two");
> > +      a[i] = k;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> > new file mode 100644
> > index 00000000000..51c1913d003
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> > @@ -0,0 +1,29 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> > +
> > +volatile int x;
> > +void
> > +bar (int, char *, char *);
> > +void
> > +foo (int *a, int n, int m, int s, int t)
> > +{
> > +  int i;
> > +  int j;
> > +  int k;
> > +
> > +  for (i = 0; i < m; i++) // Loop 1
> > +    {
> > +      if (__builtin_expect (x, 0))
> > +     for (j = 0; j < n; j++) // Loop 2
> > +       for (k = 0; k < n; k++) // Loop 3
> > +         {
> > +           bar (s / 5, "one", "two");
> > +           a[t] = s;
> > +         }
> > +      a[t] = t;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> > new file mode 100644
> > index 00000000000..bc60a040a70
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do compile  } */
> > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> > +
> > +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> > +
> > +int count;
> > +volatile int x;
> > +
> > +struct obj {
> > +  int data;
> > +  struct obj *next;
> > +
> > +} *q;
> > +
> > +void
> > +func (int m)
> > +{
> > +  struct obj *p;
> > +  for (int i = 0; i < m; i++)
> > +    if (__builtin_expect (x, 0))
> > +      count++;
> > +
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> > new file mode 100644
> > index 00000000000..ffe6f8f699d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> > @@ -0,0 +1,35 @@
> > +/* { dg-do compile  } */
> > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> > +
> > +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
> > +   when it is in cold loop.  */
> > +
> > +int count;
> > +volatile int x;
> > +
> > +struct obj {
> > +  int data;
> > +  int data1;
> > +  struct obj *next;
> > +};
> > +
> > +void
> > +func (int m, int n, int k, struct obj *a)
> > +{
> > +  struct obj *q = a;
> > +  for (int j = 0; j < m; j++)
> > +    if (__builtin_expect (m, 0))
> > +      for (int i = 0; i < m; i++)
> > +     {
> > +       if (__builtin_expect (x, 0))
> > +         {
> > +           count++;
> > +           q->data += 3; /* Not hoisted out to inner loop. */
> > +         }
> > +       count += n;
> > +       q->data1 += k; /* Not hoisted out to outer loop. */
> > +     }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> > +
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> > new file mode 100644
> > index 00000000000..16ba4ceb8ab
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> > @@ -0,0 +1,32 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> > +
> > +volatile int x;
> > +volatile int y;
> > +void
> > +bar (int, char *, char *);
> > +void
> > +foo (int *a, int n, int m, int s, int t)
> > +{
> > +  int i;
> > +  int j;
> > +  int k;
> > +
> > +  for (i = 0; i < m; i++) // Loop 1
> > +    {
> > +      if (__builtin_expect (x, 0))
> > +     for (j = 0; j < n; j++) // Loop 2
> > +       if (__builtin_expect (y, 0))
> > +         for (k = 0; k < n; k++) // Loop 3
> > +           {
> > +             bar (s / 5, "one", "two");
> > +             a[t] = s;
> > +           }
> > +      a[t] = t;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
> > +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> > +
> > +
> >
>
> --
> Thanks,
> Xionghu
  
Richard Biener Dec. 1, 2021, 10:09 a.m. UTC | #3
On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/11/4 21:00, Richard Biener wrote:
> > On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>
> >>
> >>> +  while (outmost_loop != loop)
> >>> +    {
> >>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> >>> (outmost_loop)->src,
> >>> +                                        loop_preheader_edge (cold_loop)->src))
> >>> +       cold_loop = outmost_loop;
> >>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> >>> +    }
> >>>
> >>> could be instead written as
> >>>
> >>>   coldest_loop = coldest_outermost_loop[loop->num];
> >>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >>>     return outermost_loop;
> >>>   return coldest_loop;
> >>>
> >>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> >>> It should be possible to compute such cache in a DFS walk of the loop tree
> >>> (the loop iterator by default visits in such order).
> >>
> >>
> >> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
> >> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
> >> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
> >> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
> >>
> >>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
> >>
> >> then function find_coldest_out_loop will return a loop NOT accord with our
> >> expectation, that should return second_coldest_loop instead of outermost_loop?
> > Hmm, interesting - yes.  I guess the common case will be that the pre-computed
> > outermost loop will be the loop at depth 1 since outer loops tend to
> > be colder than
> > inner loops?  That would then defeat the whole exercise.
>
> It is not easy to construct such cases, But finally I got below results,
>
> 1) many cases inner loop is hotter than outer loop, for example:
>
> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>
>
> 2) But there are also cases inner loop is colder than outer loop, like:
>
> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
>
>
> >
> > To optimize the common case but not avoiding iteration in the cases we care
> > about we could instead cache the next outermost loop that is _not_ colder
> > than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
> > candidate would then be 'second_coldest_loop' and we'd then iterate
> > to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
> > cold candidate we can compare against?  For the common case we'd
> > have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
> > simply pick 'outermost_loop'.
>
> Thanks.  It was difficult to understand, but finally I got to know what you
> want to express :)
>
> We should cache the next loop that is *colder* than loop instead of '_not_ colder
> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
> then it makes sense if the coldest loop is outside of outermost loop, continue to
> find a colder loop between outermost loop and current loop in
> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...

Heh, looking at the patch - I don't know.

To make the calls to bb_colder_than_loop_preheader more obvious can you
change that function to

+bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
+{
+  return bb->count < loop_preheader_edge (loop)->src->count;

?

+      if (colder_than_inner_loop[loop->num] != NULL
+         && loop_depth (outermost_loop)
+              < loop_depth (colder_than_inner_loop[loop->num]))
+       return colder_than_inner_loop[loop->num];

that is

+  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
+  colder loop between OUTERMOST_LOOP and loop in pre-computed
+  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */

since we index colder_than_inner_loop by loop->num as well this doesn't find the
coldest outer loop of 'loop' within the nest up to outermost_loop,
correct?  For the
"common" case of each successive outer loop being colder than the immediate
inner loop doesn't that result in only hoisting one level in case the
coldest outer loop
is outside of the outermost loop we can legally hoist to?

My suggestion to cache the next outer "hotter" loop was to make the search for
the optimal hoisting position O(1) for the common case (there is no
hotter outer loop),
allowing to do

   hot_loop = next_hot_loop[loop->num];
   if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
     return outermost_loop;

and do the linear search in the other uncommon case (eventually the cache
can be exploited to optimize that search as well).

The cache should guarantee that for loops between next_hot_loop[loop->num]
and loop each loop is colder than its inner loop, so the immediate inner loop
of 'hot_loop' is the coldest place in the sub-nest.

Basically the cache should make the common case O(1) (well, I _do_ hope
that is the common case ;))

Btw, can you make the cache array a vec<class loop> * please so we get
index checking?

> >
> > One comment on the patch itself below.
> >
>
> The loop in fill_cold_out_loop is also removed in the updated v7 patch.
>
>
>
> [PATCH v7 2/2] Don't move cold code out of loop by checking bb count
>
> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>
> v7 changes:
> 1. Refine get_coldest_out_loop to replace loop with checking
> pre-computed coldest_outermost_loop and colder_than_inner_loop.
> 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> colder_than_inner_loop recursively without loop.
>
> v6 changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
>
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
>
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
>
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
>
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
>
> There was a patch trying to avoid move cold block out of loop:
>
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
>
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
>
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
>
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
>
> gcc/ChangeLog:
>
>         * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
>         function.
>         (get_coldest_out_loop): New function.
>         (determine_max_movement): Use get_coldest_out_loop.
>         (move_computations_worker): Adjust and fix iteration udpate.
>         (class ref_in_loop_hot_body): New functor.
>         (ref_in_loop_hot_body::operator): New.
>         (can_sm_ref_p): Use for_all_locs_in_loop.
>         (fill_cold_out_loop): New.
>         (tree_ssa_lim_finalize): Free coldest_outermost_loop and
>         colder_than_inner_loop.
>         (loop_invariant_motion_in_fun): Call fill_cold_out_loop.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/recip-3.c: Adjust.
>         * gcc.dg/tree-ssa/ssa-lim-18.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
>         * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 140 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c |  20 +++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 ++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
>  7 files changed, 280 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
>
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..e9b3d0cba93 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,11 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
>
> +/* coldest outermost loop for given loop.  */
> +class loop **coldest_outermost_loop;
> +/* colder outer loop nerest to given loop.  */
> +class loop **colder_than_inner_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
>
>  static void
> @@ -417,6 +422,53 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>
> +/* Compare the profile count inequality of bb and preheader, it is three-state
> +   as stated in profile-count.h, FALSE is returned if inequality cannot be
> +   decided.  */
> +bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
> +{
> +  gcc_assert (bb && preheader);
> +  return bb->count < preheader->count;
> +}
> +
> +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> +   count.
> +  It does three steps check:
> +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +  return NULL which means it should not be moved out at all;
> +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> +                     basic_block curr_bb)
> +{
> +  gcc_assert (outermost_loop == loop
> +             || flow_loop_nested_p (outermost_loop, loop));
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> +  if (curr_bb
> +      && bb_colder_than_loop_preheader (curr_bb,
> +                                       loop_preheader_edge (loop)->src))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> +    {
> +      if (colder_than_inner_loop[loop->num] != NULL
> +         && loop_depth (outermost_loop)
> +              < loop_depth (colder_than_inner_loop[loop->num]))
> +       return colder_than_inner_loop[loop->num];
> +      return outermost_loop;
> +    }
> +  return coldest_loop;
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -685,7 +737,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
>      level = superloop_at_depth (loop, 1);
> -  lim_data->max_loop = level;
> +  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
> +  if (!lim_data->max_loop)
> +    return false;
>
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
>      {
> @@ -1221,7 +1275,10 @@ move_computations_worker (basic_block bb)
>        /* We do not really want to move conditionals out of the loop; we just
>          placed it here to force its operands to be moved if necessary.  */
>        if (gimple_code (stmt) == GIMPLE_COND)
> -       continue;
> +       {
> +         gsi_next (&bsi);
> +         continue;
> +       }
>
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         {
> @@ -2887,6 +2944,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
>    return indep_p;
>  }
>
> +class ref_in_loop_hot_body
> +{
> +public:
> +  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
> +  bool operator () (mem_ref_loc *loc);
> +  class loop *l;
> +};
> +
> +/* Check the coldest loop between loop L and innermost loop.  If there is one
> +   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
> +   no cold loop means no store motion.  get_coldest_out_loop also handles cases
> +   when l is inner_loop.  */
> +bool
> +ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
> +{
> +  basic_block curr_bb = gimple_bb (loc->stmt);
> +  class loop *inner_loop = curr_bb->loop_father;
> +  return get_coldest_out_loop (l, inner_loop, curr_bb);
> +}
> +
>
>  /* Returns true if we can perform store motion of REF from LOOP.  */
>
> @@ -2941,6 +3018,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    if (!ref_indep_loop_p (loop, ref, sm_war))
>      return false;
>
> +  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
> +    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
> +    out of it's loop_father.  */
> +  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +    return false;
> +
>    return true;
>  }
>
> @@ -3153,6 +3236,48 @@ fill_always_executed_in (void)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
>
> +/* Find the coldest loop preheader for LOOP, also find the nearest colder loop
> +   to LOOP.  Then recursively iterate each inner loop.  */
> +
> +void
> +fill_cold_out_loop (class loop *coldest_loop, class loop *outer_loop,
> +                   class loop *loop)
> +{
> +  class loop *colder_loop = NULL;
> +  if (outer_loop)
> +    {
> +      if (bb_colder_than_loop_preheader (loop_preheader_edge (outer_loop)->src,
> +                                        loop_preheader_edge (loop)->src))
> +       colder_loop = outer_loop;
> +      else if (colder_than_inner_loop[outer_loop->num]
> +              && bb_colder_than_loop_preheader (
> +                loop_preheader_edge (colder_than_inner_loop[outer_loop->num])
> +                  ->src,
> +                loop_preheader_edge (loop)->src))
> +       colder_loop = colder_than_inner_loop[outer_loop->num];
> +    }
> +
> +  if (bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
> +                                    loop_preheader_edge (coldest_loop)->src))
> +    coldest_loop = loop;
> +
> +  coldest_outermost_loop[loop->num] = coldest_loop;
> +  colder_than_inner_loop[loop->num] = colder_loop;
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf (MSG_NOTE, "loop %d's coldest_outermost_loop is %d, ",
> +                  loop->num, coldest_loop->num);
> +      if (colder_loop)
> +       dump_printf (MSG_NOTE, "colder_than_inner_loop is %d\n",
> +                    colder_loop->num);
> +      else
> +       dump_printf (MSG_NOTE, "colder_than_inner_loop is NULL\n");
> +    }
> +
> +  class loop *inner_loop;
> +  for (inner_loop = loop->inner; inner_loop; inner_loop = inner_loop->next)
> +    fill_cold_out_loop (coldest_loop, loop, inner_loop);
> +}
>
>  /* Compute the global information needed by the loop invariant motion pass.  */
>
> @@ -3237,6 +3362,9 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
>
>    free (bb_loop_postorder);
> +
> +  free (coldest_outermost_loop);
> +  free (colder_than_inner_loop);
>  }
>
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> @@ -3256,6 +3384,14 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
>    fill_always_executed_in ();
>
> +  /* Pre-compute coldest outermost loop and nearest colder loop  of each loop.
> +   */
> +  class loop *loop;
> +  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  colder_than_inner_loop = XNEWVEC (class loop *, number_of_loops (cfun));
> +  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> +    fill_cold_out_loop (loop, NULL, loop);
> +
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> index 638bf38db8c..641c91e719e 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
> @@ -23,4 +23,4 @@ float h ()
>         F[0] += E / d;
>  }
>
> -/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
> +/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> new file mode 100644
> index 00000000000..7326a230b3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int k)
> +{
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    {
> +      if (__builtin_expect (x, 0))
> +       bar (k / 5, "one", "two");
> +      a[i] = k;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..51c1913d003
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         for (k = 0; k < n; k++) // Loop 3
> +           {
> +             bar (s / 5, "one", "two");
> +             a[t] = s;
> +           }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..bc60a040a70
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `count' is not hoisted out of loop when bb is cold.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  struct obj *next;
> +
> +} *q;
> +
> +void
> +func (int m)
> +{
> +  struct obj *p;
> +  for (int i = 0; i < m; i++)
> +    if (__builtin_expect (x, 0))
> +      count++;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> new file mode 100644
> index 00000000000..ffe6f8f699d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile  } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
> +   when it is in cold loop.  */
> +
> +int count;
> +volatile int x;
> +
> +struct obj {
> +  int data;
> +  int data1;
> +  struct obj *next;
> +};
> +
> +void
> +func (int m, int n, int k, struct obj *a)
> +{
> +  struct obj *q = a;
> +  for (int j = 0; j < m; j++)
> +    if (__builtin_expect (m, 0))
> +      for (int i = 0; i < m; i++)
> +       {
> +         if (__builtin_expect (x, 0))
> +           {
> +             count++;
> +             q->data += 3; /* Not hoisted out to inner loop. */
> +           }
> +         count += n;
> +         q->data1 += k; /* Not hoisted out to outer loop. */
> +       }
> +}
> +
> +/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> new file mode 100644
> index 00000000000..16ba4ceb8ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +volatile int x;
> +volatile int y;
> +void
> +bar (int, char *, char *);
> +void
> +foo (int *a, int n, int m, int s, int t)
> +{
> +  int i;
> +  int j;
> +  int k;
> +
> +  for (i = 0; i < m; i++) // Loop 1
> +    {
> +      if (__builtin_expect (x, 0))
> +       for (j = 0; j < n; j++) // Loop 2
> +         if (__builtin_expect (y, 0))
> +           for (k = 0; k < n; k++) // Loop 3
> +             {
> +               bar (s / 5, "one", "two");
> +               a[t] = s;
> +             }
> +      a[t] = t;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
> +/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
> +
> +
> --
> 2.27.0.90.geebb51ba8c
>
  
Xionghu Luo Dec. 6, 2021, 5:09 a.m. UTC | #4
On 2021/12/1 18:09, Richard Biener wrote:
> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/11/4 21:00, Richard Biener wrote:
>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>>
>>>>> +  while (outmost_loop != loop)
>>>>> +    {
>>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
>>>>> (outmost_loop)->src,
>>>>> +                                        loop_preheader_edge (cold_loop)->src))
>>>>> +       cold_loop = outmost_loop;
>>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
>>>>> +    }
>>>>>
>>>>> could be instead written as
>>>>>
>>>>>   coldest_loop = coldest_outermost_loop[loop->num];
>>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>>     return outermost_loop;
>>>>>   return coldest_loop;
>>>>>
>>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
>>>>> It should be possible to compute such cache in a DFS walk of the loop tree
>>>>> (the loop iterator by default visits in such order).
>>>>
>>>>
>>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
>>>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
>>>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>>>
>>>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>>>>
>>>> then function find_coldest_out_loop will return a loop NOT accord with our
>>>> expectation, that should return second_coldest_loop instead of outermost_loop?
>>> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
>>> outermost loop will be the loop at depth 1 since outer loops tend to
>>> be colder than
>>> inner loops?  That would then defeat the whole exercise.
>>
>> It is not easy to construct such cases, But finally I got below results,
>>
>> 1) many cases inner loop is hotter than outer loop, for example:
>>
>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>
>>
>> 2) But there are also cases inner loop is colder than outer loop, like:
>>
>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
>>
>>
>>>
>>> To optimize the common case but not avoiding iteration in the cases we care
>>> about we could instead cache the next outermost loop that is _not_ colder
>>> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
>>> candidate would then be 'second_coldest_loop' and we'd then iterate
>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
>>> cold candidate we can compare against?  For the common case we'd
>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
>>> simply pick 'outermost_loop'.
>>
>> Thanks.  It was difficult to understand, but finally I got to know what you
>> want to express :)
>>
>> We should cache the next loop that is *colder* than loop instead of '_not_ colder
>> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
>> then it makes sense if the coldest loop is outside of outermost loop, continue to
>> find a colder loop between outermost loop and current loop in
>> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
> 
> Heh, looking at the patch - I don't know.
> 
> To make the calls to bb_colder_than_loop_preheader more obvious can you
> change that function to
> 
> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
> +{
> +  return bb->count < loop_preheader_edge (loop)->src->count;
> 
> ?
> 
> +      if (colder_than_inner_loop[loop->num] != NULL
> +         && loop_depth (outermost_loop)
> +              < loop_depth (colder_than_inner_loop[loop->num]))
> +       return colder_than_inner_loop[loop->num];
> 
> that is
> 
> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> 
> since we index colder_than_inner_loop by loop->num as well this doesn't find the
> coldest outer loop of 'loop' within the nest up to outermost_loop,
> correct?  For the
> "common" case of each successive outer loop being colder than the immediate
> inner loop doesn't that result in only hoisting one level in case the
> coldest outer loop
> is outside of the outermost loop we can legally hoist to?
> 
> My suggestion to cache the next outer "hotter" loop was to make the search for
> the optimal hoisting position O(1) for the common case (there is no
> hotter outer loop),
> allowing to do
> 
>    hot_loop = next_hot_loop[loop->num];
>    if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
>      return outermost_loop;
> 
> and do the linear search in the other uncommon case (eventually the cache
> can be exploited to optimize that search as well).
> 
> The cache should guarantee that for loops between next_hot_loop[loop->num]
> and loop each loop is colder than its inner loop, so the immediate inner loop
> of 'hot_loop' is the coldest place in the sub-nest.
> 
> Basically the cache should make the common case O(1) (well, I _do_ hope
> that is the common case ;))
> 
> Btw, can you make the cache array a vec<class loop> * please so we get
> index checking?
> 

Thanks, updated with your comments to use hotter_than_inner_loop instead of
colder_than_inner_loop, it works.

For typical case like below:

for (int j = 0; j < m; j++) // Loop 1
  if (__builtin_expect (y, 0))
   for (int j2 = 0; j2 < 256; j2++) // Loop 2
      for (int i = 0; i < m; i++) // Loop 3
        for (int s = 0; s < m; s++) // Loop 4
          for (int s1 = 0; s1 < m; s1++) // Loop 5
            if (__builtin_expect (y, 0))
              for (int s2 = 0; s2 < m; s2++) // Loop 6


The output is:

loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5



v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch


From: Xiong Hu Luo <luoxhu@linux.ibm.com>

v8 changes:
1. Use hotter_than_inner_loop instead of colder to store a hotter loop
nearest to loop.
2. Update the logic in fill_coldest_and_hotter_out_loop and
get_coldest_out_loop to make common case O(1).
3. Update function argument bb_colder_than_loop_preheader.
4. Make cached array to vec<class *loop> for index checking.

v7 changes:
1. Refine get_coldest_out_loop to replace loop with checking
pre-computed coldest_outermost_loop and colder_than_inner_loop.
2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
colder_than_inner_loop recursively without loop.

v6 changes:
1. Add function fill_coldest_out_loop to pre compute the coldest
outermost loop for each loop.
2. Rename find_coldest_out_loop to get_coldest_out_loop.
3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.

v5 changes:
1. Refine comments for new functions.
2. Use basic_block instead of count in bb_colder_than_loop_preheader
to align with function name.
3. Refine with simpler implementation for get_coldest_out_loop and
ref_in_loop_hot_body::operator for better understanding.

v4 changes:
1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
3. Split RTL invariant motion part out.
4. Remove aux changes.

v3 changes:
1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
2. Remove unnecessary changes.
3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
infinite loop when implementing v1 and the iteration is missed to be
updated actually.

v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html

There was a patch trying to avoid move cold block out of loop:

https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html

Richard suggested to "never hoist anything from a bb with lower execution
frequency to a bb with higher one in LIM invariantness_dom_walker
before_dom_children".

In gimple LIM analysis, add get_coldest_out_loop to move invariants to
expected target loop, if profile count of the loop bb is colder
than target loop preheader, it won't be hoisted out of loop.
Likely for store motion, if all locations of the REF in loop is cold,
don't do store motion of it.

SPEC2017 performance evaluation shows 1% performance improvement for
intrate GEOMEAN and no obvious regression for others.  Especially,
500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
on P8LE.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
	function.
	(get_coldest_out_loop): New function.
	(determine_max_movement): Use get_coldest_out_loop.
	(move_computations_worker): Adjust and fix iteration udpate.
	(class ref_in_loop_hot_body): New functor.
	(ref_in_loop_hot_body::operator): New.
	(can_sm_ref_p): Use for_all_locs_in_loop.
	(fill_coldest_and_hotter_out_loop): New.
	(tree_ssa_lim_finalize): Free coldest_outermost_loop and
	hotter_than_inner_loop.
	(loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/recip-3.c: Adjust.
	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-23.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 150 ++++++++++++++++++++-
 gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c |  21 +++
 7 files changed, 291 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..0bcc3bca91b 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -146,6 +146,11 @@ public:
 enum dep_kind { lim_raw, sm_war, sm_waw };
 enum dep_state { dep_unknown, dep_independent, dep_dependent };
 
+/* coldest outermost loop for given loop.  */
+vec<class loop *> coldest_outermost_loop;
+/* hotter outer loop nearest to given loop.  */
+vec<class loop *> hotter_than_inner_loop;
+
 /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
 
 static void
@@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Compare the profile count inequality of bb and loop's preheader, it is
+   three-state as stated in profile-count.h, FALSE is returned if inequality
+   cannot be decided.  */
+bool
+bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
+{
+  gcc_assert (bb && loop);
+  return bb->count < loop_preheader_edge (loop)->src->count;
+}
+
+/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
+   count.
+  It does three steps check:
+  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+  return NULL which means it should not be moved out at all;
+  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
+  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
+  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
+  hotter loop between OUTERMOST_LOOP and loop in pre-computed
+  HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
+  OUTERMOST_LOOP.  */
+
+static class loop *
+get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
+		      basic_block curr_bb)
+{
+  gcc_assert (outermost_loop == loop
+	      || flow_loop_nested_p (outermost_loop, loop));
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
+  if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
+    return NULL;
+
+  class loop *coldest_loop = coldest_outermost_loop[loop->num];
+  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
+    {
+      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
+      if (!hotter_loop
+	  || loop_depth (hotter_loop) < loop_depth (outermost_loop))
+	return outermost_loop;
+
+      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
+	[loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
+	hotter_loop, second_coldest_loop, ..., loop]
+	return second_coldest_loop to be the hoist target.  */
+      class loop *aloop;
+      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
+	if (flow_loop_nested_p (aloop, loop))
+	  return aloop;
+    }
+  return coldest_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +745,9 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1221,7 +1283,10 @@ move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2887,6 +2952,26 @@ ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (class loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+/* Check the coldest loop between loop L and innermost loop.  If there is one
+   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
+   no cold loop means no store motion.  get_coldest_out_loop also handles cases
+   when l is inner_loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return get_coldest_out_loop (l, inner_loop, curr_bb);
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +3026,12 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
+    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
+    out of it's loop_father.  */
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
@@ -3153,6 +3244,48 @@ fill_always_executed_in (void)
     fill_always_executed_in_1 (loop, contains_call);
 }
 
+/* Find the coldest loop preheader for LOOP, also find the nearest hotter loop
+   to LOOP.  Then recursively iterate each inner loop.  */
+
+void
+fill_coldest_and_hotter_out_loop (class loop *coldest_loop,
+				  class loop *hotter_loop, class loop *loop)
+{
+  if (bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
+				     coldest_loop))
+    coldest_loop = loop;
+
+  coldest_outermost_loop[loop->num] = coldest_loop;
+
+  hotter_than_inner_loop[loop->num] = NULL;
+  class loop *outer_loop = loop_outer (loop);
+  if (hotter_loop
+      && bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
+					hotter_loop))
+    hotter_than_inner_loop[loop->num] = hotter_loop;
+
+  if (outer_loop && outer_loop != current_loops->tree_root
+      && bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
+					outer_loop))
+    hotter_than_inner_loop[loop->num] = outer_loop;
+
+  if (dump_enabled_p ())
+    {
+      dump_printf (MSG_NOTE, "loop %d's coldest_outermost_loop is %d, ",
+		   loop->num, coldest_loop->num);
+      if (hotter_than_inner_loop[loop->num])
+	dump_printf (MSG_NOTE, "hotter_than_inner_loop is %d\n",
+		     hotter_than_inner_loop[loop->num]->num);
+      else
+	dump_printf (MSG_NOTE, "hotter_than_inner_loop is NULL\n");
+    }
+
+  class loop *inner_loop;
+  for (inner_loop = loop->inner; inner_loop; inner_loop = inner_loop->next)
+    fill_coldest_and_hotter_out_loop (coldest_loop,
+				      hotter_than_inner_loop[loop->num],
+				      inner_loop);
+}
 
 /* Compute the global information needed by the loop invariant motion pass.  */
 
@@ -3237,6 +3370,9 @@ tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+
+  coldest_outermost_loop.release ();
+  hotter_than_inner_loop.release ();
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
@@ -3256,6 +3392,16 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
+  /* Pre-compute coldest outermost loop and nearest hotter loop of each loop.
+   */
+  class loop *loop;
+  coldest_outermost_loop.create (number_of_loops (cfun));
+  coldest_outermost_loop.safe_grow_cleared (number_of_loops (cfun));
+  hotter_than_inner_loop.create (number_of_loops (cfun));
+  hotter_than_inner_loop.safe_grow_cleared (number_of_loops (cfun));
+  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
+    fill_coldest_and_hotter_out_loop (loop, NULL, loop);
+
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@ float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..51c1913d003
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  for (k = 0; k < n; k++) // Loop 3
+	    {
+	      bar (s / 5, "one", "two");
+	      a[t] = s;
+	    }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..ffe6f8f699d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,35 @@
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
+   when it is in cold loop.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  int data1;
+  struct obj *next;
+};
+
+void
+func (int m, int n, int k, struct obj *a)
+{
+  struct obj *q = a;
+  for (int j = 0; j < m; j++)
+    if (__builtin_expect (m, 0))
+      for (int i = 0; i < m; i++)
+	{
+	  if (__builtin_expect (x, 0))
+	    {
+	      count++;
+	      q->data += 3; /* Not hoisted out to inner loop. */
+	    }
+	  count += n;
+	  q->data1 += k; /* Not hoisted out to outer loop. */
+	}
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
new file mode 100644
index 00000000000..16ba4ceb8ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+volatile int y;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  if (__builtin_expect (y, 0))
+	    for (k = 0; k < n; k++) // Loop 3
+	      {
+		bar (s / 5, "one", "two");
+		a[t] = s;
+	      }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
new file mode 100644
index 00000000000..e7880746422
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
+
  
Xionghu Luo Dec. 6, 2021, 5:26 a.m. UTC | #5
On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/12/1 18:09, Richard Biener wrote:
>> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>
>>>
>>>
>>> On 2021/11/4 21:00, Richard Biener wrote:
>>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>>
>>>>>
>>>>>> +  while (outmost_loop != loop)
>>>>>> +    {
>>>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
>>>>>> (outmost_loop)->src,
>>>>>> +                                        loop_preheader_edge (cold_loop)->src))
>>>>>> +       cold_loop = outmost_loop;
>>>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
>>>>>> +    }
>>>>>>
>>>>>> could be instead written as
>>>>>>
>>>>>>   coldest_loop = coldest_outermost_loop[loop->num];
>>>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>>>     return outermost_loop;
>>>>>>   return coldest_loop;
>>>>>>
>>>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
>>>>>> It should be possible to compute such cache in a DFS walk of the loop tree
>>>>>> (the loop iterator by default visits in such order).
>>>>>
>>>>>
>>>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
>>>>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
>>>>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
>>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
>>>>>
>>>>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
>>>>>
>>>>> then function find_coldest_out_loop will return a loop NOT accord with our
>>>>> expectation, that should return second_coldest_loop instead of outermost_loop?
>>>> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
>>>> outermost loop will be the loop at depth 1 since outer loops tend to
>>>> be colder than
>>>> inner loops?  That would then defeat the whole exercise.
>>>
>>> It is not easy to construct such cases, But finally I got below results,
>>>
>>> 1) many cases inner loop is hotter than outer loop, for example:
>>>
>>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
>>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
>>>
>>>
>>> 2) But there are also cases inner loop is colder than outer loop, like:
>>>
>>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
>>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
>>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
>>>
>>>
>>>>
>>>> To optimize the common case but not avoiding iteration in the cases we care
>>>> about we could instead cache the next outermost loop that is _not_ colder
>>>> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
>>>> candidate would then be 'second_coldest_loop' and we'd then iterate
>>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
>>>> cold candidate we can compare against?  For the common case we'd
>>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
>>>> simply pick 'outermost_loop'.
>>>
>>> Thanks.  It was difficult to understand, but finally I got to know what you
>>> want to express :)
>>>
>>> We should cache the next loop that is *colder* than loop instead of '_not_ colder
>>> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
>>> then it makes sense if the coldest loop is outside of outermost loop, continue to
>>> find a colder loop between outermost loop and current loop in
>>> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
>>
>> Heh, looking at the patch - I don't know.
>>
>> To make the calls to bb_colder_than_loop_preheader more obvious can you
>> change that function to
>>
>> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
>> +{
>> +  return bb->count < loop_preheader_edge (loop)->src->count;
>>
>> ?
>>
>> +      if (colder_than_inner_loop[loop->num] != NULL
>> +         && loop_depth (outermost_loop)
>> +              < loop_depth (colder_than_inner_loop[loop->num]))
>> +       return colder_than_inner_loop[loop->num];
>>
>> that is
>>
>> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
>> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
>> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
>>
>> since we index colder_than_inner_loop by loop->num as well this doesn't find the
>> coldest outer loop of 'loop' within the nest up to outermost_loop,
>> correct?  For the
>> "common" case of each successive outer loop being colder than the immediate
>> inner loop doesn't that result in only hoisting one level in case the
>> coldest outer loop
>> is outside of the outermost loop we can legally hoist to?
>>
>> My suggestion to cache the next outer "hotter" loop was to make the search for
>> the optimal hoisting position O(1) for the common case (there is no
>> hotter outer loop),
>> allowing to do
>>
>>    hot_loop = next_hot_loop[loop->num];
>>    if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
>>      return outermost_loop;
>>
>> and do the linear search in the other uncommon case (eventually the cache
>> can be exploited to optimize that search as well).
>>
>> The cache should guarantee that for loops between next_hot_loop[loop->num]
>> and loop each loop is colder than its inner loop, so the immediate inner loop
>> of 'hot_loop' is the coldest place in the sub-nest.
>>
>> Basically the cache should make the common case O(1) (well, I _do_ hope
>> that is the common case ;))
>>
>> Btw, can you make the cache array a vec<class loop> * please so we get
>> index checking?
>>
> 
> Thanks, updated with your comments to use hotter_than_inner_loop instead of
> colder_than_inner_loop, it works.
> 
> For typical case like below:
> 
> for (int j = 0; j < m; j++) // Loop 1
>   if (__builtin_expect (y, 0))
>    for (int j2 = 0; j2 < 256; j2++) // Loop 2
>       for (int i = 0; i < m; i++) // Loop 3
>         for (int s = 0; s < m; s++) // Loop 4
>           for (int s1 = 0; s1 < m; s1++) // Loop 5
>             if (__builtin_expect (y, 0))
>               for (int s2 = 0; s2 < m; s2++) // Loop 6
> 
> 
> The output is:
> 
> loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
> loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
> loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5
> 
> 
> 
> v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch
> 
> 
> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> 
> v8 changes:
> 1. Use hotter_than_inner_loop instead of colder to store a hotter loop
> nearest to loop.
> 2. Update the logic in fill_coldest_and_hotter_out_loop and
> get_coldest_out_loop to make common case O(1).
> 3. Update function argument bb_colder_than_loop_preheader.
> 4. Make cached array to vec<class *loop> for index checking.
> 
> v7 changes:
> 1. Refine get_coldest_out_loop to replace loop with checking
> pre-computed coldest_outermost_loop and colder_than_inner_loop.
> 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> colder_than_inner_loop recursively without loop.
> 
> v6 changes:
> 1. Add function fill_coldest_out_loop to pre compute the coldest
> outermost loop for each loop.
> 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> 
> v5 changes:
> 1. Refine comments for new functions.
> 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> to align with function name.
> 3. Refine with simpler implementation for get_coldest_out_loop and
> ref_in_loop_hot_body::operator for better understanding.
> 
> v4 changes:
> 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> 3. Split RTL invariant motion part out.
> 4. Remove aux changes.
> 
> v3 changes:
> 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> 2. Remove unnecessary changes.
> 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> infinite loop when implementing v1 and the iteration is missed to be
> updated actually.
> 
> v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> 
> There was a patch trying to avoid move cold block out of loop:
> 
> https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> 
> Richard suggested to "never hoist anything from a bb with lower execution
> frequency to a bb with higher one in LIM invariantness_dom_walker
> before_dom_children".
> 
> In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> expected target loop, if profile count of the loop bb is colder
> than target loop preheader, it won't be hoisted out of loop.
> Likely for store motion, if all locations of the REF in loop is cold,
> don't do store motion of it.
> 
> SPEC2017 performance evaluation shows 1% performance improvement for
> intrate GEOMEAN and no obvious regression for others.  Especially,
> 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> on P8LE.
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> 	function.
> 	(get_coldest_out_loop): New function.
> 	(determine_max_movement): Use get_coldest_out_loop.
> 	(move_computations_worker): Adjust and fix iteration udpate.
> 	(class ref_in_loop_hot_body): New functor.
> 	(ref_in_loop_hot_body::operator): New.
> 	(can_sm_ref_p): Use for_all_locs_in_loop.
> 	(fill_coldest_and_hotter_out_loop): New.
> 	(tree_ssa_lim_finalize): Free coldest_outermost_loop and
> 	hotter_than_inner_loop.
> 	(loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/recip-3.c: Adjust.
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-23.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 150 ++++++++++++++++++++-
>  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c |  21 +++
>  7 files changed, 291 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 4b187c2cdaf..0bcc3bca91b 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -146,6 +146,11 @@ public:
>  enum dep_kind { lim_raw, sm_war, sm_waw };
>  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> 
> +/* coldest outermost loop for given loop.  */
> +vec<class loop *> coldest_outermost_loop;
> +/* hotter outer loop nearest to given loop.  */
> +vec<class loop *> hotter_than_inner_loop;
> +
>  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> 
>  static void
> @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
> 
> +/* Compare the profile count inequality of bb and loop's preheader, it is
> +   three-state as stated in profile-count.h, FALSE is returned if inequality
> +   cannot be decided.  */
> +bool
> +bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
> +{
> +  gcc_assert (bb && loop);
> +  return bb->count < loop_preheader_edge (loop)->src->count;
> +}
> +
> +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> +   count.
> +  It does three steps check:
> +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> +  return NULL which means it should not be moved out at all;
> +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> +  hotter loop between OUTERMOST_LOOP and loop in pre-computed
> +  HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
> +  OUTERMOST_LOOP.  */
> +
> +static class loop *
> +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> +		      basic_block curr_bb)
> +{
> +  gcc_assert (outermost_loop == loop
> +	      || flow_loop_nested_p (outermost_loop, loop));
> +
> +  /* If bb_colder_than_loop_preheader returns false due to three-state
> +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> +  if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
> +    return NULL;
> +
> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> +    {
> +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> +      if (!hotter_loop
> +	  || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> +	return outermost_loop;
> +
> +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
> +	[loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> +	hotter_loop, second_coldest_loop, ..., loop]
> +	return second_coldest_loop to be the hoist target.  */
> +      class loop *aloop;
> +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> +	if (flow_loop_nested_p (aloop, loop))

should be:

        if (aloop == loop || flow_loop_nested_p (aloop, loop))


> +	  return aloop;
> +    }
> +  return coldest_loop;
> +}
> +
  
Richard Biener Dec. 7, 2021, 12:17 p.m. UTC | #6
On Mon, Dec 6, 2021 at 6:26 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/12/6 13:09, Xionghu Luo via Gcc-patches wrote:
> >
> >
> > On 2021/12/1 18:09, Richard Biener wrote:
> >> On Wed, Nov 10, 2021 at 4:08 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2021/11/4 21:00, Richard Biener wrote:
> >>>> On Wed, Nov 3, 2021 at 2:29 PM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
> >>>>>
> >>>>>
> >>>>>> +  while (outmost_loop != loop)
> >>>>>> +    {
> >>>>>> +      if (bb_colder_than_loop_preheader (loop_preheader_edge
> >>>>>> (outmost_loop)->src,
> >>>>>> +                                        loop_preheader_edge (cold_loop)->src))
> >>>>>> +       cold_loop = outmost_loop;
> >>>>>> +      outmost_loop = superloop_at_depth (loop, loop_depth (outmost_loop) + 1);
> >>>>>> +    }
> >>>>>>
> >>>>>> could be instead written as
> >>>>>>
> >>>>>>   coldest_loop = coldest_outermost_loop[loop->num];
> >>>>>>   if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >>>>>>     return outermost_loop;
> >>>>>>   return coldest_loop;
> >>>>>>
> >>>>>> ?  And in the usual case coldest_outermost_loop[L] would be the loop tree root.
> >>>>>> It should be possible to compute such cache in a DFS walk of the loop tree
> >>>>>> (the loop iterator by default visits in such order).
> >>>>>
> >>>>>
> >>>>> Thanks.  Updated the patch with your suggestion.  Not sure whether it strictly
> >>>>> conforms to your comments.  Though the patch passed all my added tests(coverage not enough),
> >>>>> I am still a bit worried if pre-computed coldest_loop is outside of outermost_loop, but
> >>>>> outermost_loop is not the COLDEST LOOP, i.e. (outer->inner)
> >>>>>
> >>>>>  [loop tree root, coldest_loop, outermost_loop,..., second_coldest_loop, ..., loop],
> >>>>>
> >>>>> then function find_coldest_out_loop will return a loop NOT accord with our
> >>>>> expectation, that should return second_coldest_loop instead of outermost_loop?
> >>>> Hmm, interesting - yes.  I guess the common case will be that the pre-computed
> >>>> outermost loop will be the loop at depth 1 since outer loops tend to
> >>>> be colder than
> >>>> inner loops?  That would then defeat the whole exercise.
> >>>
> >>> It is not easy to construct such cases, But finally I got below results,
> >>>
> >>> 1) many cases inner loop is hotter than outer loop, for example:
> >>>
> >>> loop 1's coldest_outermost_loop is 1, colder_than_inner_loop is NULL
> >>> loop 2's coldest_outermost_loop is 1, colder_than_inner_loop is 1
> >>> loop 3's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> >>> loop 4's coldest_outermost_loop is 1, colder_than_inner_loop is 2
> >>>
> >>>
> >>> 2) But there are also cases inner loop is colder than outer loop, like:
> >>>
> >>> loop 1's coldest outermost loop is 1, colder_than_inner_loop is NULL
> >>> loop 2's coldest outermost loop is 2, colder_than_inner_loop is NULL
> >>> loop 3's coldest outermost loop is 3, colder_than_inner_loop is NULL
> >>>
> >>>
> >>>>
> >>>> To optimize the common case but not avoiding iteration in the cases we care
> >>>> about we could instead cache the next outermost loop that is _not_ colder
> >>>> than loop.  So for your [ ... ] example above we'd have> hotter_than_inner_loop[loop] == outer (second_coldest_loop), where the
> >>>> candidate would then be 'second_coldest_loop' and we'd then iterate
> >>>> to hotter_than_inner_loop[hotter_than_inner_loop[loop]] to find the next
> >>>> cold candidate we can compare against?  For the common case we'd
> >>>> have hotter_than_inner_loop[looo] == NULL (no such loop) and we then
> >>>> simply pick 'outermost_loop'.
> >>>
> >>> Thanks.  It was difficult to understand, but finally I got to know what you
> >>> want to express :)
> >>>
> >>> We should cache the next loop that is *colder* than loop instead of '_not_ colder
> >>> than loop', and 'hotter_than_inner_loop' should be 'colder_than_inner_loop',
> >>> then it makes sense if the coldest loop is outside of outermost loop, continue to
> >>> find a colder loop between outermost loop and current loop in
> >>> colder_than_inner_loop[loop->num]?  Hope I understood you correctly...
> >>
> >> Heh, looking at the patch - I don't know.
> >>
> >> To make the calls to bb_colder_than_loop_preheader more obvious can you
> >> change that function to
> >>
> >> +bool bb_colder_than_loop_preheader (basic_block bb, loop *loop)
> >> +{
> >> +  return bb->count < loop_preheader_edge (loop)->src->count;
> >>
> >> ?
> >>
> >> +      if (colder_than_inner_loop[loop->num] != NULL
> >> +         && loop_depth (outermost_loop)
> >> +              < loop_depth (colder_than_inner_loop[loop->num]))
> >> +       return colder_than_inner_loop[loop->num];
> >>
> >> that is
> >>
> >> +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> >> +  colder loop between OUTERMOST_LOOP and loop in pre-computed
> >> +  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
> >>
> >> since we index colder_than_inner_loop by loop->num as well this doesn't find the
> >> coldest outer loop of 'loop' within the nest up to outermost_loop,
> >> correct?  For the
> >> "common" case of each successive outer loop being colder than the immediate
> >> inner loop doesn't that result in only hoisting one level in case the
> >> coldest outer loop
> >> is outside of the outermost loop we can legally hoist to?
> >>
> >> My suggestion to cache the next outer "hotter" loop was to make the search for
> >> the optimal hoisting position O(1) for the common case (there is no
> >> hotter outer loop),
> >> allowing to do
> >>
> >>    hot_loop = next_hot_loop[loop->num];
> >>    if (!hot_loop || loop_depth (hot_loop) < loop_depth (outermost_loop))
> >>      return outermost_loop;
> >>
> >> and do the linear search in the other uncommon case (eventually the cache
> >> can be exploited to optimize that search as well).
> >>
> >> The cache should guarantee that for loops between next_hot_loop[loop->num]
> >> and loop each loop is colder than its inner loop, so the immediate inner loop
> >> of 'hot_loop' is the coldest place in the sub-nest.
> >>
> >> Basically the cache should make the common case O(1) (well, I _do_ hope
> >> that is the common case ;))
> >>
> >> Btw, can you make the cache array a vec<class loop> * please so we get
> >> index checking?
> >>
> >
> > Thanks, updated with your comments to use hotter_than_inner_loop instead of
> > colder_than_inner_loop, it works.
> >
> > For typical case like below:
> >
> > for (int j = 0; j < m; j++) // Loop 1
> >   if (__builtin_expect (y, 0))
> >    for (int j2 = 0; j2 < 256; j2++) // Loop 2
> >       for (int i = 0; i < m; i++) // Loop 3
> >         for (int s = 0; s < m; s++) // Loop 4
> >           for (int s1 = 0; s1 < m; s1++) // Loop 5
> >             if (__builtin_expect (y, 0))
> >               for (int s2 = 0; s2 < m; s2++) // Loop 6
> >
> >
> > The output is:
> >
> > loop 1's coldest_outermost_loop is 1, hotter_than_inner_loop is NULL
> > loop 2's coldest_outermost_loop is 2, hotter_than_inner_loop is 1
> > loop 3's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 4's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 5's coldest_outermost_loop is 2, hotter_than_inner_loop is NULL
> > loop 6's coldest_outermost_loop is 2, hotter_than_inner_loop is 5
> >
> >
> >
> > v8-0002-Don-t-move-cold-code-out-of-loop-by-checking-bb-c.patch
> >
> >
> > From: Xiong Hu Luo <luoxhu@linux.ibm.com>
> >
> > v8 changes:
> > 1. Use hotter_than_inner_loop instead of colder to store a hotter loop
> > nearest to loop.
> > 2. Update the logic in fill_coldest_and_hotter_out_loop and
> > get_coldest_out_loop to make common case O(1).
> > 3. Update function argument bb_colder_than_loop_preheader.
> > 4. Make cached array to vec<class *loop> for index checking.
> >
> > v7 changes:
> > 1. Refine get_coldest_out_loop to replace loop with checking
> > pre-computed coldest_outermost_loop and colder_than_inner_loop.
> > 2. Add function fill_cold_out_loop, compute coldest_outermost_loop and
> > colder_than_inner_loop recursively without loop.
> >
> > v6 changes:
> > 1. Add function fill_coldest_out_loop to pre compute the coldest
> > outermost loop for each loop.
> > 2. Rename find_coldest_out_loop to get_coldest_out_loop.
> > 3. Add testcase ssa-lim-22.c to differentiate with ssa-lim-19.c.
> >
> > v5 changes:
> > 1. Refine comments for new functions.
> > 2. Use basic_block instead of count in bb_colder_than_loop_preheader
> > to align with function name.
> > 3. Refine with simpler implementation for get_coldest_out_loop and
> > ref_in_loop_hot_body::operator for better understanding.
> >
> > v4 changes:
> > 1. Sort out profile_count comparision to function bb_cold_than_loop_preheader.
> > 2. Update ref_in_loop_hot_body::operator () to find cold_loop before compare.
> > 3. Split RTL invariant motion part out.
> > 4. Remove aux changes.
> >
> > v3 changes:
> > 1. Handle max_loop in determine_max_movement instead of outermost_invariant_loop.
> > 2. Remove unnecessary changes.
> > 3. Add for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body) in can_sm_ref_p.
> > 4. "gsi_next (&bsi);" in move_computations_worker is kept since it caused
> > infinite loop when implementing v1 and the iteration is missed to be
> > updated actually.
> >
> > v1: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576488.html
> > v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579086.html
> > v3: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580211.html
> > v4: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581231.html
> > v5: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581961.html
> >
> > There was a patch trying to avoid move cold block out of loop:
> >
> > https://gcc.gnu.org/pipermail/gcc/2014-November/215551.html
> >
> > Richard suggested to "never hoist anything from a bb with lower execution
> > frequency to a bb with higher one in LIM invariantness_dom_walker
> > before_dom_children".
> >
> > In gimple LIM analysis, add get_coldest_out_loop to move invariants to
> > expected target loop, if profile count of the loop bb is colder
> > than target loop preheader, it won't be hoisted out of loop.
> > Likely for store motion, if all locations of the REF in loop is cold,
> > don't do store motion of it.
> >
> > SPEC2017 performance evaluation shows 1% performance improvement for
> > intrate GEOMEAN and no obvious regression for others.  Especially,
> > 500.perlbench_r +7.52% (Perf shows function S_regtry of perlbench is
> > largely improved.), and 548.exchange2_r+1.98%, 526.blender_r +1.00%
> > on P8LE.
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-loop-im.c (bb_colder_than_loop_preheader): New
> >       function.
> >       (get_coldest_out_loop): New function.
> >       (determine_max_movement): Use get_coldest_out_loop.
> >       (move_computations_worker): Adjust and fix iteration udpate.
> >       (class ref_in_loop_hot_body): New functor.
> >       (ref_in_loop_hot_body::operator): New.
> >       (can_sm_ref_p): Use for_all_locs_in_loop.
> >       (fill_coldest_and_hotter_out_loop): New.
> >       (tree_ssa_lim_finalize): Free coldest_outermost_loop and
> >       hotter_than_inner_loop.
> >       (loop_invariant_motion_in_fun): Call fill_coldet_and_hotter_out_loop.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.dg/tree-ssa/recip-3.c: Adjust.
> >       * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-21.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-22.c: New test.
> >       * gcc.dg/tree-ssa/ssa-lim-23.c: New test.
> > ---
> >  gcc/tree-ssa-loop-im.c                     | 150 ++++++++++++++++++++-
> >  gcc/testsuite/gcc.dg/tree-ssa/recip-3.c    |   2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  29 ++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  25 ++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c |  35 +++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c |  32 +++++
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c |  21 +++
> >  7 files changed, 291 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-23.c
> >
> > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> > index 4b187c2cdaf..0bcc3bca91b 100644
> > --- a/gcc/tree-ssa-loop-im.c
> > +++ b/gcc/tree-ssa-loop-im.c
> > @@ -146,6 +146,11 @@ public:
> >  enum dep_kind { lim_raw, sm_war, sm_waw };
> >  enum dep_state { dep_unknown, dep_independent, dep_dependent };
> >
> > +/* coldest outermost loop for given loop.  */
> > +vec<class loop *> coldest_outermost_loop;
> > +/* hotter outer loop nearest to given loop.  */
> > +vec<class loop *> hotter_than_inner_loop;
> > +
> >  /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
> >
> >  static void
> > @@ -417,6 +422,61 @@ movement_possibility (gimple *stmt)
> >    return ret;
> >  }
> >
> > +/* Compare the profile count inequality of bb and loop's preheader, it is
> > +   three-state as stated in profile-count.h, FALSE is returned if inequality
> > +   cannot be decided.  */
> > +bool
> > +bb_colder_than_loop_preheader (basic_block bb, class loop *loop)
> > +{
> > +  gcc_assert (bb && loop);
> > +  return bb->count < loop_preheader_edge (loop)->src->count;
> > +}
> > +
> > +/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
> > +   count.
> > +  It does three steps check:
> > +  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
> > +  return NULL which means it should not be moved out at all;
> > +  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
> > +  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
> > +  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
> > +  hotter loop between OUTERMOST_LOOP and loop in pre-computed
> > +  HOTTER_THAN_INNER_LOOP, return it's nested inner loop, otherwise return
> > +  OUTERMOST_LOOP.  */
> > +
> > +static class loop *
> > +get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
> > +                   basic_block curr_bb)
> > +{
> > +  gcc_assert (outermost_loop == loop
> > +           || flow_loop_nested_p (outermost_loop, loop));
> > +
> > +  /* If bb_colder_than_loop_preheader returns false due to three-state
> > +    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
> > +    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
> > +  if (curr_bb && bb_colder_than_loop_preheader (curr_bb, loop))
> > +    return NULL;
> > +
> > +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> > +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> > +    {
> > +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> > +      if (!hotter_loop
> > +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> > +     return outermost_loop;
> > +
> > +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
> > +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> > +     hotter_loop, second_coldest_loop, ..., loop]
> > +     return second_coldest_loop to be the hoist target.  */
> > +      class loop *aloop;
> > +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> > +     if (flow_loop_nested_p (aloop, loop))
>
> should be:
>
>         if (aloop == loop || flow_loop_nested_p (aloop, loop))

OK with that fixed.

Are necessary prerequesites committed to avoid regressions?
I guess we need to keep a watchful eye and eventually revert
(or gate with a --param disabled by default) the new behavior if
severe regressions are discovered.

Thanks and sorry for the repeated delays.
Richard.

>
> > +       return aloop;
> > +    }
> > +  return coldest_loop;
> > +}
> > +
>
>
> --
> Thanks,
> Xionghu
  
Xionghu Luo Dec. 8, 2021, 6:32 a.m. UTC | #7
On 2021/12/7 20:17, Richard Biener wrote:
>>> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
>>> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>> +    {
>>> +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
>>> +      if (!hotter_loop
>>> +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
>>> +     return outermost_loop;
>>> +
>>> +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
>>> +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
>>> +     hotter_loop, second_coldest_loop, ..., loop]
>>> +     return second_coldest_loop to be the hoist target.  */
>>> +      class loop *aloop;
>>> +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
>>> +     if (flow_loop_nested_p (aloop, loop))
>> should be:
>>
>>         if (aloop == loop || flow_loop_nested_p (aloop, loop))
> OK with that fixed.
> 
> Are necessary prerequesites committed to avoid regressions?
> I guess we need to keep a watchful eye and eventually revert
> (or gate with a --param disabled by default) the new behavior if
> severe regressions are discovered.
> 
> Thanks and sorry for the repeated delays.
> Richard.
> 

Thanks for your review, I learned quite a lot and gained very useful
comments & help through the period :)  There are still 3 patches required
to avoid regression or so, I've reorganized them and sent it out.

https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586371.html


In addition, cooked the patch to add option for disable/enable it.
Is it OK to merge it to current patch?


[PATCH] Add option -fhoist-to-cold-loop


gcc/ChangeLog:

	* common.opt: New.
	* loop-invariant.c (find_invariants_bb):
	* tree-ssa-loop-im.c (get_coldest_out_loop):
	(can_sm_ref_p):
	(loop_invariant_motion_in_fun):
---
 gcc/common.opt         |  4 ++++
 gcc/loop-invariant.c   |  2 +-
 gcc/tree-ssa-loop-im.c | 33 ++++++++++++++++++++++-----------
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index b921f5e3b25..62b82bd8b95 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1171,6 +1171,10 @@ fcode-hoisting
 Common Var(flag_code_hoisting) Optimization
 Enable code hoisting.
 
+fhoist-to-cold-loop
+Common Var(flag_hoist_to_cold_loop) Init(1) Optimization
+Enable hoisting code to cold loop.
+
 fcombine-stack-adjustments
 Common Var(flag_combine_stack_adjustments) Optimization
 Looks for opportunities to reduce stack adjustments and stack references.
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 5c3be7bf0eb..75b9dd47cd7 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -1189,7 +1189,7 @@ find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
   rtx_insn *insn;
   basic_block preheader = loop_preheader_edge (loop)->src;
 
-  if (preheader->count > bb->count)
+  if (flag_hoist_to_cold_loop && preheader->count > bb->count)
     return;
 
   FOR_BB_INSNS (bb, insn)
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 565ee62d3f7..d745f66851b 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -450,6 +450,9 @@ static class loop *
 get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
 		      basic_block curr_bb)
 {
+  if (!flag_hoist_to_cold_loop)
+    return outermost_loop;
+
   gcc_assert (outermost_loop == loop
 	      || flow_loop_nested_p (outermost_loop, loop));
 
@@ -3031,8 +3034,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
     candidate's profile count is hot.  Statement in cold BB shouldn't be moved
     out of it's loop_father.  */
-  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
-    return false;
+  if (flag_hoist_to_cold_loop)
+    if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+      return false;
 
   return true;
 }
@@ -3373,8 +3377,11 @@ tree_ssa_lim_finalize (void)
 
   free (bb_loop_postorder);
 
-  coldest_outermost_loop.release ();
-  hotter_than_inner_loop.release ();
+  if (flag_hoist_to_cold_loop)
+    {
+      coldest_outermost_loop.release ();
+      hotter_than_inner_loop.release ();
+    }
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
@@ -3396,13 +3403,17 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
 
   /* Pre-compute coldest outermost loop and nearest hotter loop of each loop.
    */
-  class loop *loop;
-  coldest_outermost_loop.create (number_of_loops (cfun));
-  coldest_outermost_loop.safe_grow_cleared (number_of_loops (cfun));
-  hotter_than_inner_loop.create (number_of_loops (cfun));
-  hotter_than_inner_loop.safe_grow_cleared (number_of_loops (cfun));
-  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
-    fill_coldest_and_hotter_out_loop (loop, NULL, loop);
+  if (flag_hoist_to_cold_loop)
+    {
+      class loop *loop;
+      coldest_outermost_loop.create (number_of_loops (cfun));
+      coldest_outermost_loop.safe_grow_cleared (number_of_loops (cfun));
+      hotter_than_inner_loop.create (number_of_loops (cfun));
+      hotter_than_inner_loop.safe_grow_cleared (number_of_loops (cfun));
+      for (loop = current_loops->tree_root->inner; loop != NULL;
+	   loop = loop->next)
+	fill_coldest_and_hotter_out_loop (loop, NULL, loop);
+    }
 
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
  
Richard Biener Dec. 20, 2021, 7:29 a.m. UTC | #8
On Wed, Dec 8, 2021 at 7:32 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>
>
>
> On 2021/12/7 20:17, Richard Biener wrote:
> >>> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
> >>> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
> >>> +    {
> >>> +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
> >>> +      if (!hotter_loop
> >>> +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
> >>> +     return outermost_loop;
> >>> +
> >>> +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
> >>> +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
> >>> +     hotter_loop, second_coldest_loop, ..., loop]
> >>> +     return second_coldest_loop to be the hoist target.  */
> >>> +      class loop *aloop;
> >>> +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
> >>> +     if (flow_loop_nested_p (aloop, loop))
> >> should be:
> >>
> >>         if (aloop == loop || flow_loop_nested_p (aloop, loop))
> > OK with that fixed.
> >
> > Are necessary prerequesites committed to avoid regressions?
> > I guess we need to keep a watchful eye and eventually revert
> > (or gate with a --param disabled by default) the new behavior if
> > severe regressions are discovered.
> >
> > Thanks and sorry for the repeated delays.
> > Richard.
> >
>
> Thanks for your review, I learned quite a lot and gained very useful
> comments & help through the period :)  There are still 3 patches required
> to avoid regression or so, I've reorganized them and sent it out.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586371.html
>
>
> In addition, cooked the patch to add option for disable/enable it.
> Is it OK to merge it to current patch?

Hmm, let's go without this flag for now, we can add something like this
when we see a testcase where that makes sense (and where profile info
is not wrecked by other bugs).

Adding a --param would be a no brainer, but for new user-facing options
we should be more careful.

Richard.

>
> [PATCH] Add option -fhoist-to-cold-loop
>
>
> gcc/ChangeLog:
>
>         * common.opt: New.
>         * loop-invariant.c (find_invariants_bb):
>         * tree-ssa-loop-im.c (get_coldest_out_loop):
>         (can_sm_ref_p):
>         (loop_invariant_motion_in_fun):
> ---
>  gcc/common.opt         |  4 ++++
>  gcc/loop-invariant.c   |  2 +-
>  gcc/tree-ssa-loop-im.c | 33 ++++++++++++++++++++++-----------
>  3 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index b921f5e3b25..62b82bd8b95 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1171,6 +1171,10 @@ fcode-hoisting
>  Common Var(flag_code_hoisting) Optimization
>  Enable code hoisting.
>
> +fhoist-to-cold-loop
> +Common Var(flag_hoist_to_cold_loop) Init(1) Optimization
> +Enable hoisting code to cold loop.
> +
>  fcombine-stack-adjustments
>  Common Var(flag_combine_stack_adjustments) Optimization
>  Looks for opportunities to reduce stack adjustments and stack references.
> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
> index 5c3be7bf0eb..75b9dd47cd7 100644
> --- a/gcc/loop-invariant.c
> +++ b/gcc/loop-invariant.c
> @@ -1189,7 +1189,7 @@ find_invariants_bb (class loop *loop, basic_block bb, bool always_reached,
>    rtx_insn *insn;
>    basic_block preheader = loop_preheader_edge (loop)->src;
>
> -  if (preheader->count > bb->count)
> +  if (flag_hoist_to_cold_loop && preheader->count > bb->count)
>      return;
>
>    FOR_BB_INSNS (bb, insn)
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 565ee62d3f7..d745f66851b 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -450,6 +450,9 @@ static class loop *
>  get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
>                       basic_block curr_bb)
>  {
> +  if (!flag_hoist_to_cold_loop)
> +    return outermost_loop;
> +
>    gcc_assert (outermost_loop == loop
>               || flow_loop_nested_p (outermost_loop, loop));
>
> @@ -3031,8 +3034,9 @@ can_sm_ref_p (class loop *loop, im_mem_ref *ref)
>    /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
>      candidate's profile count is hot.  Statement in cold BB shouldn't be moved
>      out of it's loop_father.  */
> -  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> -    return false;
> +  if (flag_hoist_to_cold_loop)
> +    if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
> +      return false;
>
>    return true;
>  }
> @@ -3373,8 +3377,11 @@ tree_ssa_lim_finalize (void)
>
>    free (bb_loop_postorder);
>
> -  coldest_outermost_loop.release ();
> -  hotter_than_inner_loop.release ();
> +  if (flag_hoist_to_cold_loop)
> +    {
> +      coldest_outermost_loop.release ();
> +      hotter_than_inner_loop.release ();
> +    }
>  }
>
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> @@ -3396,13 +3403,17 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
>
>    /* Pre-compute coldest outermost loop and nearest hotter loop of each loop.
>     */
> -  class loop *loop;
> -  coldest_outermost_loop.create (number_of_loops (cfun));
> -  coldest_outermost_loop.safe_grow_cleared (number_of_loops (cfun));
> -  hotter_than_inner_loop.create (number_of_loops (cfun));
> -  hotter_than_inner_loop.safe_grow_cleared (number_of_loops (cfun));
> -  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> -    fill_coldest_and_hotter_out_loop (loop, NULL, loop);
> +  if (flag_hoist_to_cold_loop)
> +    {
> +      class loop *loop;
> +      coldest_outermost_loop.create (number_of_loops (cfun));
> +      coldest_outermost_loop.safe_grow_cleared (number_of_loops (cfun));
> +      hotter_than_inner_loop.create (number_of_loops (cfun));
> +      hotter_than_inner_loop.safe_grow_cleared (number_of_loops (cfun));
> +      for (loop = current_loops->tree_root->inner; loop != NULL;
> +          loop = loop->next)
> +       fill_coldest_and_hotter_out_loop (loop, NULL, loop);
> +    }
>
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> --
> 2.27.0.90.geebb51ba8c
>
>
  
Xionghu Luo Dec. 21, 2021, 3:59 a.m. UTC | #9
On 2021/12/20 15:29, Richard Biener wrote:
> On Wed, Dec 8, 2021 at 7:32 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/12/7 20:17, Richard Biener wrote:
>>>>> +  class loop *coldest_loop = coldest_outermost_loop[loop->num];
>>>>> +  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
>>>>> +    {
>>>>> +      class loop *hotter_loop = hotter_than_inner_loop[loop->num];
>>>>> +      if (!hotter_loop
>>>>> +       || loop_depth (hotter_loop) < loop_depth (outermost_loop))
>>>>> +     return outermost_loop;
>>>>> +
>>>>> +      /*  hotter_loop is between OUTERMOST_LOOP and LOOP like:
>>>>> +     [loop tree root, ..., coldest_loop, ..., outermost_loop, ...,
>>>>> +     hotter_loop, second_coldest_loop, ..., loop]
>>>>> +     return second_coldest_loop to be the hoist target.  */
>>>>> +      class loop *aloop;
>>>>> +      for (aloop = hotter_loop->inner; aloop; aloop = aloop->next)
>>>>> +     if (flow_loop_nested_p (aloop, loop))
>>>> should be:
>>>>
>>>>         if (aloop == loop || flow_loop_nested_p (aloop, loop))
>>> OK with that fixed.
>>>
>>> Are necessary prerequesites committed to avoid regressions?
>>> I guess we need to keep a watchful eye and eventually revert
>>> (or gate with a --param disabled by default) the new behavior if
>>> severe regressions are discovered.
>>>
>>> Thanks and sorry for the repeated delays.
>>> Richard.
>>>
>>
>> Thanks for your review, I learned quite a lot and gained very useful
>> comments & help through the period :)  There are still 3 patches required
>> to avoid regression or so, I've reorganized them and sent it out.
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-December/586371.html
>>
>>
>> In addition, cooked the patch to add option for disable/enable it.
>> Is it OK to merge it to current patch?
> 
> Hmm, let's go without this flag for now, we can add something like this
> when we see a testcase where that makes sense (and where profile info
> is not wrecked by other bugs).
> 
> Adding a --param would be a no brainer, but for new user-facing options
> we should be more careful.
> 

Thanks.  Committed to r12-6087.
  

Patch

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 4b187c2cdaf..e9b3d0cba93 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -146,6 +146,11 @@  public:
 enum dep_kind { lim_raw, sm_war, sm_waw };
 enum dep_state { dep_unknown, dep_independent, dep_dependent };
 
+/* coldest outermost loop for given loop.  */
+class loop **coldest_outermost_loop;
+/* colder outer loop nerest to given loop.  */
+class loop **colder_than_inner_loop;
+
 /* Populate the loop dependence cache of REF for LOOP, KIND with STATE.  */
 
 static void
@@ -417,6 +422,53 @@  movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Compare the profile count inequality of bb and preheader, it is three-state
+   as stated in profile-count.h, FALSE is returned if inequality cannot be
+   decided.  */
+bool bb_colder_than_loop_preheader (basic_block bb, basic_block preheader)
+{
+  gcc_assert (bb && preheader);
+  return bb->count < preheader->count;
+}
+
+/* Check coldest loop between OUTERMOST_LOOP and LOOP by comparing profile
+   count.
+  It does three steps check:
+  1) Check whether CURR_BB is cold in it's own loop_father, if it is cold, just
+  return NULL which means it should not be moved out at all;
+  2) CURR_BB is NOT cold, check if pre-computed COLDEST_LOOP is outside of
+  OUTERMOST_LOOP, if it is inside of OUTERMOST_LOOP, return the COLDEST_LOOP;
+  3) If COLDEST_LOOP is outside of OUTERMOST_LOOP, check whether there is a
+  colder loop between OUTERMOST_LOOP and loop in pre-computed
+  COLDER_THAN_INNER_LOOP, return it, otherwise return OUTERMOST_LOOP.  */
+
+static class loop *
+get_coldest_out_loop (class loop *outermost_loop, class loop *loop,
+		      basic_block curr_bb)
+{
+  gcc_assert (outermost_loop == loop
+	      || flow_loop_nested_p (outermost_loop, loop));
+
+  /* If bb_colder_than_loop_preheader returns false due to three-state
+    comparision, OUTERMOST_LOOP is returned finally to preserve the behavior.
+    Otherwise, return the coldest loop between OUTERMOST_LOOP and LOOP.  */
+  if (curr_bb
+      && bb_colder_than_loop_preheader (curr_bb,
+					loop_preheader_edge (loop)->src))
+    return NULL;
+
+  class loop *coldest_loop = coldest_outermost_loop[loop->num];
+  if (loop_depth (coldest_loop) < loop_depth (outermost_loop))
+    {
+      if (colder_than_inner_loop[loop->num] != NULL
+	  && loop_depth (outermost_loop)
+	       < loop_depth (colder_than_inner_loop[loop->num]))
+	return colder_than_inner_loop[loop->num];
+      return outermost_loop;
+    }
+  return coldest_loop;
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -685,7 +737,9 @@  determine_max_movement (gimple *stmt, bool must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
     level = superloop_at_depth (loop, 1);
-  lim_data->max_loop = level;
+  lim_data->max_loop = get_coldest_out_loop (level, loop, bb);
+  if (!lim_data->max_loop)
+    return false;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
@@ -1221,7 +1275,10 @@  move_computations_worker (basic_block bb)
       /* We do not really want to move conditionals out of the loop; we just
 	 placed it here to force its operands to be moved if necessary.  */
       if (gimple_code (stmt) == GIMPLE_COND)
-	continue;
+	{
+	  gsi_next (&bsi);
+	  continue;
+	}
 
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
@@ -2887,6 +2944,26 @@  ref_indep_loop_p (class loop *loop, im_mem_ref *ref, dep_kind kind)
   return indep_p;
 }
 
+class ref_in_loop_hot_body
+{
+public:
+  ref_in_loop_hot_body (loop *loop_) : l (loop_) {}
+  bool operator () (mem_ref_loc *loc);
+  class loop *l;
+};
+
+/* Check the coldest loop between loop L and innermost loop.  If there is one
+   cold loop between L and INNER_LOOP, store motion can be performed, otherwise
+   no cold loop means no store motion.  get_coldest_out_loop also handles cases
+   when l is inner_loop.  */
+bool
+ref_in_loop_hot_body::operator () (mem_ref_loc *loc)
+{
+  basic_block curr_bb = gimple_bb (loc->stmt);
+  class loop *inner_loop = curr_bb->loop_father;
+  return get_coldest_out_loop (l, inner_loop, curr_bb);
+}
+
 
 /* Returns true if we can perform store motion of REF from LOOP.  */
 
@@ -2941,6 +3018,12 @@  can_sm_ref_p (class loop *loop, im_mem_ref *ref)
   if (!ref_indep_loop_p (loop, ref, sm_war))
     return false;
 
+  /* Verify whether the candidate is hot for LOOP.  Only do store motion if the
+    candidate's profile count is hot.  Statement in cold BB shouldn't be moved
+    out of it's loop_father.  */
+  if (!for_all_locs_in_loop (loop, ref, ref_in_loop_hot_body (loop)))
+    return false;
+
   return true;
 }
 
@@ -3153,6 +3236,48 @@  fill_always_executed_in (void)
     fill_always_executed_in_1 (loop, contains_call);
 }
 
+/* Find the coldest loop preheader for LOOP, also find the nearest colder loop
+   to LOOP.  Then recursively iterate each inner loop.  */
+
+void
+fill_cold_out_loop (class loop *coldest_loop, class loop *outer_loop,
+		    class loop *loop)
+{
+  class loop *colder_loop = NULL;
+  if (outer_loop)
+    {
+      if (bb_colder_than_loop_preheader (loop_preheader_edge (outer_loop)->src,
+					 loop_preheader_edge (loop)->src))
+	colder_loop = outer_loop;
+      else if (colder_than_inner_loop[outer_loop->num]
+	       && bb_colder_than_loop_preheader (
+		 loop_preheader_edge (colder_than_inner_loop[outer_loop->num])
+		   ->src,
+		 loop_preheader_edge (loop)->src))
+	colder_loop = colder_than_inner_loop[outer_loop->num];
+    }
+
+  if (bb_colder_than_loop_preheader (loop_preheader_edge (loop)->src,
+				     loop_preheader_edge (coldest_loop)->src))
+    coldest_loop = loop;
+
+  coldest_outermost_loop[loop->num] = coldest_loop;
+  colder_than_inner_loop[loop->num] = colder_loop;
+  if (dump_enabled_p ())
+    {
+      dump_printf (MSG_NOTE, "loop %d's coldest_outermost_loop is %d, ",
+		   loop->num, coldest_loop->num);
+      if (colder_loop)
+	dump_printf (MSG_NOTE, "colder_than_inner_loop is %d\n",
+		     colder_loop->num);
+      else
+	dump_printf (MSG_NOTE, "colder_than_inner_loop is NULL\n");
+    }
+
+  class loop *inner_loop;
+  for (inner_loop = loop->inner; inner_loop; inner_loop = inner_loop->next)
+    fill_cold_out_loop (coldest_loop, loop, inner_loop);
+}
 
 /* Compute the global information needed by the loop invariant motion pass.  */
 
@@ -3237,6 +3362,9 @@  tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+
+  free (coldest_outermost_loop);
+  free (colder_than_inner_loop);
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
@@ -3256,6 +3384,14 @@  loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
+  /* Pre-compute coldest outermost loop and nearest colder loop  of each loop.
+   */
+  class loop *loop;
+  coldest_outermost_loop = XNEWVEC (class loop *, number_of_loops (cfun));
+  colder_than_inner_loop = XNEWVEC (class loop *, number_of_loops (cfun));
+  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
+    fill_cold_out_loop (loop, NULL, loop);
+
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
index 638bf38db8c..641c91e719e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
@@ -23,4 +23,4 @@  float h ()
 	F[0] += E / d;
 }
 
-/* { dg-final { scan-tree-dump-times " / " 1 "recip" } } */
+/* { dg-final { scan-tree-dump-times " / " 5 "recip" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
new file mode 100644
index 00000000000..7326a230b3f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int k)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+    {
+      if (__builtin_expect (x, 0))
+	bar (k / 5, "one", "two");
+      a[i] = k;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "out of loop 1" "lim2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..51c1913d003
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  for (k = 0; k < n; k++) // Loop 3
+	    {
+	      bar (s / 5, "one", "two");
+	      a[t] = s;
+	    }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 2" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..bc60a040a70
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,25 @@ 
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `count' is not hoisted out of loop when bb is cold.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  struct obj *next;
+
+} *q;
+
+void
+func (int m)
+{
+  struct obj *p;
+  for (int i = 0; i < m; i++)
+    if (__builtin_expect (x, 0))
+      count++;
+
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
new file mode 100644
index 00000000000..ffe6f8f699d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-21.c
@@ -0,0 +1,35 @@ 
+/* { dg-do compile  } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+/* Test that `data' and 'data1' is not hoisted out of inner loop and outer loop
+   when it is in cold loop.  */
+
+int count;
+volatile int x;
+
+struct obj {
+  int data;
+  int data1;
+  struct obj *next;
+};
+
+void
+func (int m, int n, int k, struct obj *a)
+{
+  struct obj *q = a;
+  for (int j = 0; j < m; j++)
+    if (__builtin_expect (m, 0))
+      for (int i = 0; i < m; i++)
+	{
+	  if (__builtin_expect (x, 0))
+	    {
+	      count++;
+	      q->data += 3; /* Not hoisted out to inner loop. */
+	    }
+	  count += n;
+	  q->data1 += k; /* Not hoisted out to outer loop. */
+	}
+}
+
+/* { dg-final { scan-tree-dump-not "Executing store motion of" "lim2"  }  } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
new file mode 100644
index 00000000000..16ba4ceb8ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-22.c
@@ -0,0 +1,32 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+volatile int x;
+volatile int y;
+void
+bar (int, char *, char *);
+void
+foo (int *a, int n, int m, int s, int t)
+{
+  int i;
+  int j;
+  int k;
+
+  for (i = 0; i < m; i++) // Loop 1
+    {
+      if (__builtin_expect (x, 0))
+	for (j = 0; j < n; j++) // Loop 2
+	  if (__builtin_expect (y, 0))
+	    for (k = 0; k < n; k++) // Loop 3
+	      {
+		bar (s / 5, "one", "two");
+		a[t] = s;
+	      }
+      a[t] = t;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "out of loop 3" 4 "lim2" } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 3 "lim2" } } */
+
+