Support threading of just the exit edge

Message ID 20220812120145.91C6513AAE@imap2.suse-dmz.suse.de
State New
Headers
Series Support threading of just the exit edge |

Commit Message

Richard Biener Aug. 12, 2022, 12:01 p.m. UTC
  This started with noticing we add ENTRY_BLOCK to our threads
just for the sake of simplifying the conditional at the end of
the first block in a function.  That's not really threading
anything but it ends up duplicating the entry block, and
re-writing the result instead of statically fold the jump.

The following tries to handle those by recording simplifications
of the exit conditional as a thread of length one.  That requires
special-casing them in the backward copier since if we do not
have any block to copy but modify the jump in place and remove
not taken edges this confuses the hell out of remaining threads.

So back_jt_path_registry::update_cfg now first marks all
edges we know are never taken and then prunes the threading
candidates when they include such edge.  Then it makes sure
to first perform unreachable edge removal (so we avoid
copying them when other thread paths contain the prevailing
edge) before continuing to apply the remaining threads.

In statistics you can see this avoids quite a bunch of useless
threads (I've investiated 3 random files from cc1files with
dropped stats in any of the thread passes).

Still thinking about it it would be nice to avoid the work of
discovering those candidates we have to throw away later
which could eventually be done by having the backward threader
perform a RPO walk over the CFG, skipping edges that can be
statically determined as not being executed.  Below I'm
abusing the path range query to statically analyze the exit
branch but I assume there's a simpler way of folding this stmt
which could then better integrate with such a walk.

In any case it seems worth more conciously handling the
case of exit branches that simplify without path sensitive
information.

Then the patch also restricts path discovery when we'd produce
threads we'll reject later during copying - the backward threader
copying cannot handle paths where the to duplicate blocks are
not from exactly the same loop.  I'm probably going to split this
part out.

Any thoughts?

	* gimple-range-path.cc (path_range_query::set_path): Adjust
	assert to allow paths of size one.
	* tree-ssa-threadbackward.cc (back_threader::maybe_register_path):
	Paths of size one are always profitable.
	(back_threader::find_paths_to_names): Likewise.
	Do not walk further if we are leaving the current loop.
	(back_threader::find_taken_edge): Remove assert.  Do not
	walk to ENTRY_BLOCK.
	* tree-ssa-threadupdate.cc (back_jt_path_registry::update_cfg):
	Handle jump threads of just the exit edge by modifying the
	control statement in-place.
---
 gcc/gimple-range-path.cc       |  2 +-
 gcc/tree-ssa-threadbackward.cc | 21 ++++++++-----
 gcc/tree-ssa-threadupdate.cc   | 54 ++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 8 deletions(-)
  

Comments

Aldy Hernandez Aug. 12, 2022, 4:03 p.m. UTC | #1
On Fri, Aug 12, 2022 at 2:01 PM Richard Biener <rguenther@suse.de> wrote:
>
> This started with noticing we add ENTRY_BLOCK to our threads
> just for the sake of simplifying the conditional at the end of
> the first block in a function.  That's not really threading
> anything but it ends up duplicating the entry block, and
> re-writing the result instead of statically fold the jump.

Hmmm, but threading 2 blocks is not really threading at all??  Unless
I'm misunderstanding something, this was even documented in the
backwards threader:

[snip]
     That's not really a jump threading opportunity, but instead is
     simple cprop & simplification.  We could handle it here if we
     wanted by wiring up all the incoming edges.  If we run this
     early in IPA, that might be worth doing.   For now we just
     reject that case.  */
  if (m_path.length () <= 1)
      return false;

Which you undoubtedly ran into because you're specifically eliding the check:

> -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
> -                                         &irreducible)
> +         if ((m_path.length () == 1
> +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
> +                                             &irreducible))

>
> The following tries to handle those by recording simplifications
> of the exit conditional as a thread of length one.  That requires
> special-casing them in the backward copier since if we do not
> have any block to copy but modify the jump in place and remove
> not taken edges this confuses the hell out of remaining threads.
>
> So back_jt_path_registry::update_cfg now first marks all
> edges we know are never taken and then prunes the threading
> candidates when they include such edge.  Then it makes sure
> to first perform unreachable edge removal (so we avoid
> copying them when other thread paths contain the prevailing
> edge) before continuing to apply the remaining threads.

This is all beyond my pay grade.  I'm not very well versed in the
threader per se.  So if y'all think it's a good idea, by all means.
Perhaps Jeff can chime in, or remembers the above comment?

>
> In statistics you can see this avoids quite a bunch of useless
> threads (I've investiated 3 random files from cc1files with
> dropped stats in any of the thread passes).
>
> Still thinking about it it would be nice to avoid the work of
> discovering those candidates we have to throw away later
> which could eventually be done by having the backward threader
> perform a RPO walk over the CFG, skipping edges that can be
> statically determined as not being executed.  Below I'm
> abusing the path range query to statically analyze the exit
> branch but I assume there's a simpler way of folding this stmt
> which could then better integrate with such a walk.

Unreachable paths can be queried with
path_range_query::unreachable_path_p ().  Could you leverage this?
The idea was that if we ever resolved any SSA name to UNDEFINED, the
path itself was unreachable.

Aldy

>
> In any case it seems worth more conciously handling the
> case of exit branches that simplify without path sensitive
> information.
>
> Then the patch also restricts path discovery when we'd produce
> threads we'll reject later during copying - the backward threader
> copying cannot handle paths where the to duplicate blocks are
> not from exactly the same loop.  I'm probably going to split this
> part out.
>
> Any thoughts?
>
>         * gimple-range-path.cc (path_range_query::set_path): Adjust
>         assert to allow paths of size one.
>         * tree-ssa-threadbackward.cc (back_threader::maybe_register_path):
>         Paths of size one are always profitable.
>         (back_threader::find_paths_to_names): Likewise.
>         Do not walk further if we are leaving the current loop.
>         (back_threader::find_taken_edge): Remove assert.  Do not
>         walk to ENTRY_BLOCK.
>         * tree-ssa-threadupdate.cc (back_jt_path_registry::update_cfg):
>         Handle jump threads of just the exit edge by modifying the
>         control statement in-place.
> ---
>  gcc/gimple-range-path.cc       |  2 +-
>  gcc/tree-ssa-threadbackward.cc | 21 ++++++++-----
>  gcc/tree-ssa-threadupdate.cc   | 54 ++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
> index 78146f5683e..a7d277c31b8 100644
> --- a/gcc/gimple-range-path.cc
> +++ b/gcc/gimple-range-path.cc
> @@ -220,7 +220,7 @@ path_range_query::unreachable_path_p ()
>  void
>  path_range_query::set_path (const vec<basic_block> &path)
>  {
> -  gcc_checking_assert (path.length () > 1);
> +  gcc_checking_assert (!path.is_empty ());
>    m_path = path.copy ();
>    m_pos = m_path.length () - 1;
>    bitmap_clear (m_has_cache_entry);
> diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
> index b886027fccf..669098e4ec3 100644
> --- a/gcc/tree-ssa-threadbackward.cc
> +++ b/gcc/tree-ssa-threadbackward.cc
> @@ -241,8 +241,9 @@ back_threader::maybe_register_path ()
>        else
>         {
>           bool irreducible = false;
> -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
> -                                         &irreducible)
> +         if ((m_path.length () == 1
> +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
> +                                             &irreducible))
>               && debug_counter ()
>               && m_registry.register_path (m_path, taken_edge))
>             {
> @@ -267,7 +268,6 @@ back_threader::maybe_register_path ()
>  edge
>  back_threader::find_taken_edge (const vec<basic_block> &path)
>  {
> -  gcc_checking_assert (path.length () > 1);
>    switch (gimple_code (m_last_stmt))
>      {
>      case GIMPLE_COND:
> @@ -350,9 +350,15 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
>    m_path.safe_push (bb);
>
>    // Try to resolve the path without looking back.
> -  if (m_path.length () > 1
> -      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
> -         || maybe_register_path ()))
> +  if ((m_path.length () > 1
> +       && !m_profit.profitable_path_p (m_path, m_name, NULL))
> +      || maybe_register_path ())
> +    ;
> +
> +  // The backwards thread copier cannot copy blocks that do not belong
> +  // to the same loop, so when the new source of the path entry no
> +  // longer belongs to it we don't need to search further.
> +  else if (m_path[0]->loop_father != bb->loop_father)
>      ;
>
>    // Continue looking for ways to extend the path but limit the
> @@ -445,7 +451,8 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
>           edge e;
>           FOR_EACH_EDGE (e, iter, bb->preds)
>             {
> -             if (e->flags & EDGE_ABNORMAL
> +             if ((e->flags & EDGE_ABNORMAL)
> +                 || e->src->index == ENTRY_BLOCK
>                   // This is like path_crosses_loops in profitable_path_p but
>                   // more restrictive to avoid peeling off loop iterations (see
>                   // tree-ssa/pr14341.c for an example).
> diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc
> index 59c268a3567..d40fa7c4cff 100644
> --- a/gcc/tree-ssa-threadupdate.cc
> +++ b/gcc/tree-ssa-threadupdate.cc
> @@ -2613,6 +2613,60 @@ back_jt_path_registry::update_cfg (bool /*peel_loop_headers*/)
>    bool retval = false;
>    hash_set<edge> visited_starting_edges;
>
> +  /* Mark never taken edges from paths that are just jump simplifications.  */
> +  auto_edge_flag never_taken (cfun);
> +  for (auto path : m_paths)
> +    if (path->length () == 1)
> +      {
> +       edge_iterator ei;
> +       edge e;
> +       FOR_EACH_EDGE (e, ei, (*path)[0]->e->src->succs)
> +         if (e != (*path)[0]->e)
> +           e->flags |= never_taken;
> +      }
> +
> +  /* And prune paths that contain such edge before we remove them.  */
> +  for (unsigned i = 0; i < m_paths.length ();)
> +    {
> +      bool remove = false;
> +      for (auto je : *m_paths[i])
> +       {
> +         if (je->e->flags & never_taken)
> +           {
> +             cancel_thread (m_paths[i],
> +                            "Avoiding threading through unreachable edge");
> +             remove = true;
> +             break;
> +           }
> +       }
> +      if (!remove)
> +       ++i;
> +      else
> +       m_paths.unordered_remove (i);
> +    }
> +
> +  /* Finally perform those threads first, this way we avoid copying the
> +     dead outgoing edges when other theads contain the prevailing edge.  */
> +  for (unsigned i = 0; i < m_paths.length ();)
> +    {
> +      vec<jump_thread_edge *> *path = m_paths[i];
> +      if (path->length () != 1)
> +       {
> +         ++i;
> +         continue;
> +       }
> +      edge exit = (*path)[0]->e;
> +      remove_ctrl_stmt_and_useless_edges (exit->src, exit->dest);
> +      exit->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL);
> +      exit->flags |= EDGE_FALLTHRU;
> +      /* We do not update dominance info.  */
> +      free_dominance_info (CDI_DOMINATORS);
> +      retval = true;
> +      m_num_threaded_edges++;
> +      path->release ();
> +      m_paths.unordered_remove (i);
> +    }
> +
>    while (m_paths.length ())
>      {
>        vec<jump_thread_edge *> *path = m_paths[0];
> --
> 2.35.3
>
  
Richard Biener Aug. 15, 2022, 9:39 a.m. UTC | #2
On Fri, 12 Aug 2022, Aldy Hernandez wrote:

> On Fri, Aug 12, 2022 at 2:01 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > This started with noticing we add ENTRY_BLOCK to our threads
> > just for the sake of simplifying the conditional at the end of
> > the first block in a function.  That's not really threading
> > anything but it ends up duplicating the entry block, and
> > re-writing the result instead of statically fold the jump.
> 
> Hmmm, but threading 2 blocks is not really threading at all??  Unless
> I'm misunderstanding something, this was even documented in the
> backwards threader:
> 
> [snip]
>      That's not really a jump threading opportunity, but instead is
>      simple cprop & simplification.  We could handle it here if we
>      wanted by wiring up all the incoming edges.  If we run this
>      early in IPA, that might be worth doing.   For now we just
>      reject that case.  */
>   if (m_path.length () <= 1)
>       return false;
> 
> Which you undoubtedly ran into because you're specifically eliding the check:
> 
> > -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > -                                         &irreducible)
> > +         if ((m_path.length () == 1
> > +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > +                                             &irreducible))

Correct.  But currently the threader just "cheats", picks up one more
block and then "threads" the case anyway, doing this simple simplification
in the most expensive way possible ...

> >
> > The following tries to handle those by recording simplifications
> > of the exit conditional as a thread of length one.  That requires
> > special-casing them in the backward copier since if we do not
> > have any block to copy but modify the jump in place and remove
> > not taken edges this confuses the hell out of remaining threads.
> >
> > So back_jt_path_registry::update_cfg now first marks all
> > edges we know are never taken and then prunes the threading
> > candidates when they include such edge.  Then it makes sure
> > to first perform unreachable edge removal (so we avoid
> > copying them when other thread paths contain the prevailing
> > edge) before continuing to apply the remaining threads.
> 
> This is all beyond my pay grade.  I'm not very well versed in the
> threader per se.  So if y'all think it's a good idea, by all means.
> Perhaps Jeff can chime in, or remembers the above comment?
> 
> >
> > In statistics you can see this avoids quite a bunch of useless
> > threads (I've investiated 3 random files from cc1files with
> > dropped stats in any of the thread passes).
> >
> > Still thinking about it it would be nice to avoid the work of
> > discovering those candidates we have to throw away later
> > which could eventually be done by having the backward threader
> > perform a RPO walk over the CFG, skipping edges that can be
> > statically determined as not being executed.  Below I'm
> > abusing the path range query to statically analyze the exit
> > branch but I assume there's a simpler way of folding this stmt
> > which could then better integrate with such a walk.
> 
> Unreachable paths can be queried with
> path_range_query::unreachable_path_p ().  Could you leverage this?
> The idea was that if we ever resolved any SSA name to UNDEFINED, the
> path itself was unreachable.

The situation is that we end up with paths where an intermediate
branch on the path can be simplified to false - but of course only
if we put all intermediate branch dependences into the list of
imports to consider.

I don't like it very much to use the "threading" code to perform
the simplification but I couldn't figure a cheap way to perform
the simplification without invoking a full EVRP?  That said,
the backwards threader simply does

  basic_block bb;
  FOR_EACH_BB_FN (bb, m_fun)
    if (EDGE_COUNT (bb->succs) > 1)
      maybe_thread_block (bb);

  bool changed = m_registry.thread_through_all_blocks (true);

instead of that we should only consider edges that may be executable
by instead walking the CFG along such edges, simplifying BB exit
conditionals.  Unfortunately EVRP is now a C++ maze so I couldn't
find how to actually do such simplification, not knowing how
interacting with ranger influences the path query use either.
If you or Andrew has any suggestions on how to essentially
do a

  if (edge e = find_taken_edge (bb))
    {
...
    }

where find_taken_edge should be at least as powerful as using
the path query for a single bb then I'd be all ears.  As said,
I tried to find the code to cut&paste in EVRP but failed to ...

Thanks,
Richard.

> 
> Aldy
> 
> >
> > In any case it seems worth more conciously handling the
> > case of exit branches that simplify without path sensitive
> > information.
> >
> > Then the patch also restricts path discovery when we'd produce
> > threads we'll reject later during copying - the backward threader
> > copying cannot handle paths where the to duplicate blocks are
> > not from exactly the same loop.  I'm probably going to split this
> > part out.
> >
> > Any thoughts?
> >
> >         * gimple-range-path.cc (path_range_query::set_path): Adjust
> >         assert to allow paths of size one.
> >         * tree-ssa-threadbackward.cc (back_threader::maybe_register_path):
> >         Paths of size one are always profitable.
> >         (back_threader::find_paths_to_names): Likewise.
> >         Do not walk further if we are leaving the current loop.
> >         (back_threader::find_taken_edge): Remove assert.  Do not
> >         walk to ENTRY_BLOCK.
> >         * tree-ssa-threadupdate.cc (back_jt_path_registry::update_cfg):
> >         Handle jump threads of just the exit edge by modifying the
> >         control statement in-place.
> > ---
> >  gcc/gimple-range-path.cc       |  2 +-
> >  gcc/tree-ssa-threadbackward.cc | 21 ++++++++-----
> >  gcc/tree-ssa-threadupdate.cc   | 54 ++++++++++++++++++++++++++++++++++
> >  3 files changed, 69 insertions(+), 8 deletions(-)
> >
> > diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
> > index 78146f5683e..a7d277c31b8 100644
> > --- a/gcc/gimple-range-path.cc
> > +++ b/gcc/gimple-range-path.cc
> > @@ -220,7 +220,7 @@ path_range_query::unreachable_path_p ()
> >  void
> >  path_range_query::set_path (const vec<basic_block> &path)
> >  {
> > -  gcc_checking_assert (path.length () > 1);
> > +  gcc_checking_assert (!path.is_empty ());
> >    m_path = path.copy ();
> >    m_pos = m_path.length () - 1;
> >    bitmap_clear (m_has_cache_entry);
> > diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
> > index b886027fccf..669098e4ec3 100644
> > --- a/gcc/tree-ssa-threadbackward.cc
> > +++ b/gcc/tree-ssa-threadbackward.cc
> > @@ -241,8 +241,9 @@ back_threader::maybe_register_path ()
> >        else
> >         {
> >           bool irreducible = false;
> > -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > -                                         &irreducible)
> > +         if ((m_path.length () == 1
> > +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > +                                             &irreducible))
> >               && debug_counter ()
> >               && m_registry.register_path (m_path, taken_edge))
> >             {
> > @@ -267,7 +268,6 @@ back_threader::maybe_register_path ()
> >  edge
> >  back_threader::find_taken_edge (const vec<basic_block> &path)
> >  {
> > -  gcc_checking_assert (path.length () > 1);
> >    switch (gimple_code (m_last_stmt))
> >      {
> >      case GIMPLE_COND:
> > @@ -350,9 +350,15 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
> >    m_path.safe_push (bb);
> >
> >    // Try to resolve the path without looking back.
> > -  if (m_path.length () > 1
> > -      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
> > -         || maybe_register_path ()))
> > +  if ((m_path.length () > 1
> > +       && !m_profit.profitable_path_p (m_path, m_name, NULL))
> > +      || maybe_register_path ())
> > +    ;
> > +
> > +  // The backwards thread copier cannot copy blocks that do not belong
> > +  // to the same loop, so when the new source of the path entry no
> > +  // longer belongs to it we don't need to search further.
> > +  else if (m_path[0]->loop_father != bb->loop_father)
> >      ;
> >
> >    // Continue looking for ways to extend the path but limit the
> > @@ -445,7 +451,8 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
> >           edge e;
> >           FOR_EACH_EDGE (e, iter, bb->preds)
> >             {
> > -             if (e->flags & EDGE_ABNORMAL
> > +             if ((e->flags & EDGE_ABNORMAL)
> > +                 || e->src->index == ENTRY_BLOCK
> >                   // This is like path_crosses_loops in profitable_path_p but
> >                   // more restrictive to avoid peeling off loop iterations (see
> >                   // tree-ssa/pr14341.c for an example).
> > diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc
> > index 59c268a3567..d40fa7c4cff 100644
> > --- a/gcc/tree-ssa-threadupdate.cc
> > +++ b/gcc/tree-ssa-threadupdate.cc
> > @@ -2613,6 +2613,60 @@ back_jt_path_registry::update_cfg (bool /*peel_loop_headers*/)
> >    bool retval = false;
> >    hash_set<edge> visited_starting_edges;
> >
> > +  /* Mark never taken edges from paths that are just jump simplifications.  */
> > +  auto_edge_flag never_taken (cfun);
> > +  for (auto path : m_paths)
> > +    if (path->length () == 1)
> > +      {
> > +       edge_iterator ei;
> > +       edge e;
> > +       FOR_EACH_EDGE (e, ei, (*path)[0]->e->src->succs)
> > +         if (e != (*path)[0]->e)
> > +           e->flags |= never_taken;
> > +      }
> > +
> > +  /* And prune paths that contain such edge before we remove them.  */
> > +  for (unsigned i = 0; i < m_paths.length ();)
> > +    {
> > +      bool remove = false;
> > +      for (auto je : *m_paths[i])
> > +       {
> > +         if (je->e->flags & never_taken)
> > +           {
> > +             cancel_thread (m_paths[i],
> > +                            "Avoiding threading through unreachable edge");
> > +             remove = true;
> > +             break;
> > +           }
> > +       }
> > +      if (!remove)
> > +       ++i;
> > +      else
> > +       m_paths.unordered_remove (i);
> > +    }
> > +
> > +  /* Finally perform those threads first, this way we avoid copying the
> > +     dead outgoing edges when other theads contain the prevailing edge.  */
> > +  for (unsigned i = 0; i < m_paths.length ();)
> > +    {
> > +      vec<jump_thread_edge *> *path = m_paths[i];
> > +      if (path->length () != 1)
> > +       {
> > +         ++i;
> > +         continue;
> > +       }
> > +      edge exit = (*path)[0]->e;
> > +      remove_ctrl_stmt_and_useless_edges (exit->src, exit->dest);
> > +      exit->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL);
> > +      exit->flags |= EDGE_FALLTHRU;
> > +      /* We do not update dominance info.  */
> > +      free_dominance_info (CDI_DOMINATORS);
> > +      retval = true;
> > +      m_num_threaded_edges++;
> > +      path->release ();
> > +      m_paths.unordered_remove (i);
> > +    }
> > +
> >    while (m_paths.length ())
> >      {
> >        vec<jump_thread_edge *> *path = m_paths[0];
> > --
> > 2.35.3
> >
> 
>
  
Jeff Law Aug. 15, 2022, 3:22 p.m. UTC | #3
On 8/12/2022 10:03 AM, Aldy Hernandez wrote:
> On Fri, Aug 12, 2022 at 2:01 PM Richard Biener <rguenther@suse.de> wrote:
>> This started with noticing we add ENTRY_BLOCK to our threads
>> just for the sake of simplifying the conditional at the end of
>> the first block in a function.  That's not really threading
>> anything but it ends up duplicating the entry block, and
>> re-writing the result instead of statically fold the jump.
> Hmmm, but threading 2 blocks is not really threading at all??  Unless
> I'm misunderstanding something, this was even documented in the
> backwards threader:
>
> [snip]
>       That's not really a jump threading opportunity, but instead is
>       simple cprop & simplification.  We could handle it here if we
>       wanted by wiring up all the incoming edges.  If we run this
>       early in IPA, that might be worth doing.   For now we just
>       reject that case.  */
>    if (m_path.length () <= 1)
>        return false;
My recollection is that code was supposed to filter out the case where 
the threading path is just a definition block and a use block where the 
definition block dominated the use block.    For that case, threading 
isn't really needed as we can just use const/copy propagation to 
propagate the value from the def to the use which should in turn allow 
the use (the conditional branch) to be simplified away -- all without 
the block copying and associated CFG updates.

What doesn't make sense to me today is how do we know there's a 
dominator relationship between the two blocks?



Jeff
  
Aldy Hernandez Aug. 15, 2022, 7:09 p.m. UTC | #4
On Mon, Aug 15, 2022 at 11:39 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Fri, 12 Aug 2022, Aldy Hernandez wrote:
>
> > On Fri, Aug 12, 2022 at 2:01 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > This started with noticing we add ENTRY_BLOCK to our threads
> > > just for the sake of simplifying the conditional at the end of
> > > the first block in a function.  That's not really threading
> > > anything but it ends up duplicating the entry block, and
> > > re-writing the result instead of statically fold the jump.
> >
> > Hmmm, but threading 2 blocks is not really threading at all??  Unless
> > I'm misunderstanding something, this was even documented in the
> > backwards threader:
> >
> > [snip]
> >      That's not really a jump threading opportunity, but instead is
> >      simple cprop & simplification.  We could handle it here if we
> >      wanted by wiring up all the incoming edges.  If we run this
> >      early in IPA, that might be worth doing.   For now we just
> >      reject that case.  */
> >   if (m_path.length () <= 1)
> >       return false;
> >
> > Which you undoubtedly ran into because you're specifically eliding the check:
> >
> > > -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > > -                                         &irreducible)
> > > +         if ((m_path.length () == 1
> > > +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
> > > +                                             &irreducible))
>
> Correct.  But currently the threader just "cheats", picks up one more
> block and then "threads" the case anyway, doing this simple simplification
> in the most expensive way possible ...

Ah.

>
> > >
> > > The following tries to handle those by recording simplifications
> > > of the exit conditional as a thread of length one.  That requires
> > > special-casing them in the backward copier since if we do not
> > > have any block to copy but modify the jump in place and remove
> > > not taken edges this confuses the hell out of remaining threads.
> > >
> > > So back_jt_path_registry::update_cfg now first marks all
> > > edges we know are never taken and then prunes the threading
> > > candidates when they include such edge.  Then it makes sure
> > > to first perform unreachable edge removal (so we avoid
> > > copying them when other thread paths contain the prevailing
> > > edge) before continuing to apply the remaining threads.
> >
> > This is all beyond my pay grade.  I'm not very well versed in the
> > threader per se.  So if y'all think it's a good idea, by all means.
> > Perhaps Jeff can chime in, or remembers the above comment?
> >
> > >
> > > In statistics you can see this avoids quite a bunch of useless
> > > threads (I've investiated 3 random files from cc1files with
> > > dropped stats in any of the thread passes).
> > >
> > > Still thinking about it it would be nice to avoid the work of
> > > discovering those candidates we have to throw away later
> > > which could eventually be done by having the backward threader
> > > perform a RPO walk over the CFG, skipping edges that can be
> > > statically determined as not being executed.  Below I'm
> > > abusing the path range query to statically analyze the exit
> > > branch but I assume there's a simpler way of folding this stmt
> > > which could then better integrate with such a walk.
> >
> > Unreachable paths can be queried with
> > path_range_query::unreachable_path_p ().  Could you leverage this?
> > The idea was that if we ever resolved any SSA name to UNDEFINED, the
> > path itself was unreachable.
>
> The situation is that we end up with paths where an intermediate
> branch on the path can be simplified to false - but of course only
> if we put all intermediate branch dependences into the list of
> imports to consider.
>
> I don't like it very much to use the "threading" code to perform
> the simplification but I couldn't figure a cheap way to perform
> the simplification without invoking a full EVRP?  That said,
> the backwards threader simply does
>
>   basic_block bb;
>   FOR_EACH_BB_FN (bb, m_fun)
>     if (EDGE_COUNT (bb->succs) > 1)
>       maybe_thread_block (bb);
>
>   bool changed = m_registry.thread_through_all_blocks (true);
>
> instead of that we should only consider edges that may be executable
> by instead walking the CFG along such edges, simplifying BB exit
> conditionals.  Unfortunately EVRP is now a C++ maze so I couldn't
> find how to actually do such simplification, not knowing how
> interacting with ranger influences the path query use either.
> If you or Andrew has any suggestions on how to essentially
> do a
>
>   if (edge e = find_taken_edge (bb))
>     {
> ...
>     }
>
> where find_taken_edge should be at least as powerful as using
> the path query for a single bb then I'd be all ears.  As said,
> I tried to find the code to cut&paste in EVRP but failed to ...

Interesting... If what you need is find_taken_edge(bb), I think we can
do this quite cleanly.

What you're asking is to fold the conditional at the end of a basic
block, and return the edge depending on the folded value.  We have all
the smarts to do the folding (fold_range), and tree-cfg.c has
find_taken_edge(basic_block, tree val), which AFAICT, does everything
except the folding of non-trivial statements.

If you like, I could tweak find_taken_edge() to use a ranger if
available (pass has called enable_ranger()), or otherwise use the
global range query in cfun (SSA_NAME_RANGE_INFO but with the
range_query API).  This should be as fast as poking at
SSA_NAME_RANGE_INFO for the common case, or using an active ranger if
available.

See the attached proof of concept.

With this approach we could handle everything find_taken_edge(bb)
currently handles, plus anything we can fold in ranger (without the
penalty of a full ranger if you don't want to-- we can still fold and
use global ranges for SSA names if available).  Or ultimately, if you
have an active ranger, you can leverage the full ranger functionality.
Either way is a lot better than the 1 != 0, etc folding the code
currently does ;-).

If you want to fold these blocks from the threader, we could expose
the ranger in the path_query with enable_ranger(), and then
find_taken_edge(basic_block) can pick the ranger up.  Up to you if you
want to run this within the threader, or elsewhere.

Does this make sense?

I'm happy to cobble it up if you find it useful.
Aldy
  
Andrew MacLeod Aug. 15, 2022, 7:24 p.m. UTC | #5
heh. or just


+      int_range<2> r;
+      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
+      || !r.singleton_p (&val))


if you do not provide a range_query to any of the fold_using_range code, 
it defaults to:

fur_source::fur_source (range_query *q)
{
   if (q)
     m_query = q;
   else if (cfun)
     m_query = get_range_query (cfun);
   else
     m_query = get_global_range_query ();
   m_gori = NULL;
}

so it will default to the one you provide, and if there isn't one, it 
will try to use the cfun version if cfun is available.. otherwise it 
defaults to the global range query.  So you dont need to provide the 
cfun version.

This applies to the 5 basic routines in gimple-fold-range.h:

// Fold stmt S into range R using range query Q.
bool fold_range (vrange &r, gimple *s, range_query *q = NULL);
// Recalculate stmt S into R using range query Q as if it were on edge 
ON_EDGE.
bool fold_range (vrange &v, gimple *s, edge on_edge, range_query *q = NULL);

// These routines the operands to be specified when manually folding.
// Any excess queries will be drawn from the current range_query.
bool fold_range (vrange &r, gimple *s, vrange &r1);
bool fold_range (vrange &r, gimple *s, vrange &r1, vrange &r2);
bool fold_range (vrange &r, gimple *s, unsigned num_elements, vrange 
**vector);

Andrew


On 8/15/22 15:09, Aldy Hernandez wrote:
> On Mon, Aug 15, 2022 at 11:39 AM Richard Biener <rguenther@suse.de> wrote:
>> On Fri, 12 Aug 2022, Aldy Hernandez wrote:
>>
>>> On Fri, Aug 12, 2022 at 2:01 PM Richard Biener <rguenther@suse.de> wrote:
>>>> This started with noticing we add ENTRY_BLOCK to our threads
>>>> just for the sake of simplifying the conditional at the end of
>>>> the first block in a function.  That's not really threading
>>>> anything but it ends up duplicating the entry block, and
>>>> re-writing the result instead of statically fold the jump.
>>> Hmmm, but threading 2 blocks is not really threading at all??  Unless
>>> I'm misunderstanding something, this was even documented in the
>>> backwards threader:
>>>
>>> [snip]
>>>       That's not really a jump threading opportunity, but instead is
>>>       simple cprop & simplification.  We could handle it here if we
>>>       wanted by wiring up all the incoming edges.  If we run this
>>>       early in IPA, that might be worth doing.   For now we just
>>>       reject that case.  */
>>>    if (m_path.length () <= 1)
>>>        return false;
>>>
>>> Which you undoubtedly ran into because you're specifically eliding the check:
>>>
>>>> -         if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
>>>> -                                         &irreducible)
>>>> +         if ((m_path.length () == 1
>>>> +              || m_profit.profitable_path_p (m_path, m_name, taken_edge,
>>>> +                                             &irreducible))
>> Correct.  But currently the threader just "cheats", picks up one more
>> block and then "threads" the case anyway, doing this simple simplification
>> in the most expensive way possible ...
> Ah.
>
>>>> The following tries to handle those by recording simplifications
>>>> of the exit conditional as a thread of length one.  That requires
>>>> special-casing them in the backward copier since if we do not
>>>> have any block to copy but modify the jump in place and remove
>>>> not taken edges this confuses the hell out of remaining threads.
>>>>
>>>> So back_jt_path_registry::update_cfg now first marks all
>>>> edges we know are never taken and then prunes the threading
>>>> candidates when they include such edge.  Then it makes sure
>>>> to first perform unreachable edge removal (so we avoid
>>>> copying them when other thread paths contain the prevailing
>>>> edge) before continuing to apply the remaining threads.
>>> This is all beyond my pay grade.  I'm not very well versed in the
>>> threader per se.  So if y'all think it's a good idea, by all means.
>>> Perhaps Jeff can chime in, or remembers the above comment?
>>>
>>>> In statistics you can see this avoids quite a bunch of useless
>>>> threads (I've investiated 3 random files from cc1files with
>>>> dropped stats in any of the thread passes).
>>>>
>>>> Still thinking about it it would be nice to avoid the work of
>>>> discovering those candidates we have to throw away later
>>>> which could eventually be done by having the backward threader
>>>> perform a RPO walk over the CFG, skipping edges that can be
>>>> statically determined as not being executed.  Below I'm
>>>> abusing the path range query to statically analyze the exit
>>>> branch but I assume there's a simpler way of folding this stmt
>>>> which could then better integrate with such a walk.
>>> Unreachable paths can be queried with
>>> path_range_query::unreachable_path_p ().  Could you leverage this?
>>> The idea was that if we ever resolved any SSA name to UNDEFINED, the
>>> path itself was unreachable.
>> The situation is that we end up with paths where an intermediate
>> branch on the path can be simplified to false - but of course only
>> if we put all intermediate branch dependences into the list of
>> imports to consider.
>>
>> I don't like it very much to use the "threading" code to perform
>> the simplification but I couldn't figure a cheap way to perform
>> the simplification without invoking a full EVRP?  That said,
>> the backwards threader simply does
>>
>>    basic_block bb;
>>    FOR_EACH_BB_FN (bb, m_fun)
>>      if (EDGE_COUNT (bb->succs) > 1)
>>        maybe_thread_block (bb);
>>
>>    bool changed = m_registry.thread_through_all_blocks (true);
>>
>> instead of that we should only consider edges that may be executable
>> by instead walking the CFG along such edges, simplifying BB exit
>> conditionals.  Unfortunately EVRP is now a C++ maze so I couldn't
>> find how to actually do such simplification, not knowing how
>> interacting with ranger influences the path query use either.
>> If you or Andrew has any suggestions on how to essentially
>> do a
>>
>>    if (edge e = find_taken_edge (bb))
>>      {
>> ...
>>      }
>>
>> where find_taken_edge should be at least as powerful as using
>> the path query for a single bb then I'd be all ears.  As said,
>> I tried to find the code to cut&paste in EVRP but failed to ...
> Interesting... If what you need is find_taken_edge(bb), I think we can
> do this quite cleanly.
>
> What you're asking is to fold the conditional at the end of a basic
> block, and return the edge depending on the folded value.  We have all
> the smarts to do the folding (fold_range), and tree-cfg.c has
> find_taken_edge(basic_block, tree val), which AFAICT, does everything
> except the folding of non-trivial statements.
>
> If you like, I could tweak find_taken_edge() to use a ranger if
> available (pass has called enable_ranger()), or otherwise use the
> global range query in cfun (SSA_NAME_RANGE_INFO but with the
> range_query API).  This should be as fast as poking at
> SSA_NAME_RANGE_INFO for the common case, or using an active ranger if
> available.
>
> See the attached proof of concept.
>
> With this approach we could handle everything find_taken_edge(bb)
> currently handles, plus anything we can fold in ranger (without the
> penalty of a full ranger if you don't want to-- we can still fold and
> use global ranges for SSA names if available).  Or ultimately, if you
> have an active ranger, you can leverage the full ranger functionality.
> Either way is a lot better than the 1 != 0, etc folding the code
> currently does ;-).
>
> If you want to fold these blocks from the threader, we could expose
> the ranger in the path_query with enable_ranger(), and then
> find_taken_edge(basic_block) can pick the ranger up.  Up to you if you
> want to run this within the threader, or elsewhere.
>
> Does this make sense?
>
> I'm happy to cobble it up if you find it useful.
> Aldy
  
Aldy Hernandez Aug. 15, 2022, 7:29 p.m. UTC | #6
On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> heh. or just
>
>
> +      int_range<2> r;
> +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> +      || !r.singleton_p (&val))
>
>
> if you do not provide a range_query to any of the fold_using_range code,
> it defaults to:
>
> fur_source::fur_source (range_query *q)
> {
>    if (q)
>      m_query = q;
>    else if (cfun)
>      m_query = get_range_query (cfun);
>    else
>      m_query = get_global_range_query ();
>    m_gori = NULL;
> }
>

Sweet.  Even better!
Aldy
  
Richard Biener Aug. 16, 2022, 9:18 a.m. UTC | #7
On Mon, 15 Aug 2022, Aldy Hernandez wrote:

> On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >
> > heh. or just
> >
> >
> > +      int_range<2> r;
> > +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> > +      || !r.singleton_p (&val))
> >
> >
> > if you do not provide a range_query to any of the fold_using_range code,
> > it defaults to:
> >
> > fur_source::fur_source (range_query *q)
> > {
> >    if (q)
> >      m_query = q;
> >    else if (cfun)
> >      m_query = get_range_query (cfun);
> >    else
> >      m_query = get_global_range_query ();
> >    m_gori = NULL;
> > }
> >
> 
> Sweet.  Even better!

So when I do the following incremental change ontop of the posted
patch then I see that the path query is able to simplify more
"single BB paths" than the global range folding.

diff --git a/gcc/tree-ssa-threadbackward.cc 
b/gcc/tree-ssa-threadbackward.cc
index 669098e4ec3..777e778037f 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const 
vec<basic_block> &path,
 {
   int_range_max r;
 
+  int_range<2> rf;
+  if (path.length () == 1)
+    {
+      fold_range (rf, cond);
+    }
+
   m_solver->compute_ranges (path, m_imports);
   m_solver->range_of_stmt (r, cond);
 
@@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const 
vec<basic_block> &path,
 
   if (r == true_range || r == false_range)
     {
+      if (path.length () == 1)
+       gcc_assert  (r == rf);
       edge e_true, e_false;
       basic_block bb = gimple_bb (cond);
       extract_true_false_edges_from_block (bb, &e_true, &e_false);

Even doing the following (not sure what's the difference and in
particular expense over the path range query) results in missed
simplifications (checking my set of cc1files).

diff --git a/gcc/tree-ssa-threadbackward.cc 
b/gcc/tree-ssa-threadbackward.cc
index 669098e4ec3..1d43a179d08 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -99,6 +99,7 @@ private:
 
   back_threader_registry m_registry;
   back_threader_profitability m_profit;
+  gimple_ranger *m_ranger;
   path_range_query *m_solver;
 
   // Current path being analyzed.
@@ -146,12 +147,14 @@ back_threader::back_threader (function *fun, 
unsigned flags, bool first)
   // The path solver needs EDGE_DFS_BACK in resolving mode.
   if (flags & BT_RESOLVE)
     mark_dfs_back_edges ();
-  m_solver = new path_range_query (flags & BT_RESOLVE);
+  m_ranger = new gimple_ranger;
+  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
 }
 
 back_threader::~back_threader ()
 {
   delete m_solver;
+  delete m_ranger;
 
   loop_optimizer_finalize ();
 }
@@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const 
vec<basic_block> &path,
 {
   int_range_max r;
 
+  int_range<2> rf;
+  if (path.length () == 1)
+    {
+      fold_range (rf, cond, m_ranger);
+    }
+
   m_solver->compute_ranges (path, m_imports);
   m_solver->range_of_stmt (r, cond);
 
@@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const 
vec<basic_block> &path,
 
   if (r == true_range || r == false_range)
     {
+      if (path.length () == 1)
+       gcc_assert  (r == rf);
       edge e_true, e_false;
       basic_block bb = gimple_bb (cond);
       extract_true_false_edges_from_block (bb, &e_true, &e_false);

one example is

<bb 176> [local count: 14414059]:
_128 = node_177(D)->typed.type;
pretmp_413 = MEM[(const union tree_node *)_128].base.code;
_431 = pretmp_413 + 65519;
if (_128 == 0B)
  goto <bb 199>; [18.09%]
else
  goto <bb 177>; [81.91%]

where m_imports for the path is just _128 and the range computed is
false while the ranger query returns VARYING.  But
path_range_query::range_defined_in_block does

  if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
    m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);

which adjusts the range to ~[0, 0], probably because of the
dereference in the following stmt.

Why does fold_range not do this when folding the exit test?  Is there
a way to make it do so?  It looks like the only routine using this
in gimple-range.cc is range_on_edge and there it's used for e->src
after calling range_on_exit for e->src (why's it not done in
range_on_exit?).  A testcase for this is

int foo (int **p, int i)
{
  int *q = *p;
  int res = *q + i;
  if (q)
    return res;
  return -1;
}

which we "thread" with the path and with the above ICEs because
fold_range doesn't get that if (q) is always true.  Without the
patch ethread doesn't want to duplicate the block (it's too large)
but threadfull will if you disable evrp (if you remove the increment
by 'i' it again won't since nothing interesting prevails and it
won't go to BB 0 and fails to pick up a thread of length > 1):

Checking profitability of path (backwards):  bb:2 (6 insns) bb:0
  Control statement insns: 2
  Overall: 4 insns
  [1] Registering jump thread: (0, 2) incoming edge;  (2, 3) nocopy;
path: 0->2->3 SUCCESS
Removing basic block 2
;; basic block 2, loop depth 0
;;  pred:
_1 = *p_6(D);
_2 = (long unsigned int) n_7(D);
_3 = _2 * 4;
q_8 = _1 + _3;
res_9 = *q_8;
if (q_8 != 0B)
  goto <bb 3>; [98.33%]
else
  goto <bb 4>; [1.67%]
;;  succ:       3
;;              4

... (it copied BB 2) ...

int foo (int * * p, int n)
{
  int res;
  int * q;
  int * _10;
  long unsigned int _11;
  long unsigned int _12;

  <bb 2> [local count: 1073741824]:
  _10 = *p_6(D);
  _11 = (long unsigned int) n_7(D);
  _12 = _11 * 4;
  q_13 = _10 + _12;
  res_14 = *q_13;
  return res_14;
  
Aldy Hernandez Aug. 16, 2022, 10:06 a.m. UTC | #8
On Tue, Aug 16, 2022 at 11:18 AM Richard Biener <rguenther@suse.de> wrote:
>
> On Mon, 15 Aug 2022, Aldy Hernandez wrote:
>
> > On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > >
> > > heh. or just
> > >
> > >
> > > +      int_range<2> r;
> > > +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> > > +      || !r.singleton_p (&val))
> > >
> > >
> > > if you do not provide a range_query to any of the fold_using_range code,
> > > it defaults to:
> > >
> > > fur_source::fur_source (range_query *q)
> > > {
> > >    if (q)
> > >      m_query = q;
> > >    else if (cfun)
> > >      m_query = get_range_query (cfun);
> > >    else
> > >      m_query = get_global_range_query ();
> > >    m_gori = NULL;
> > > }
> > >
> >
> > Sweet.  Even better!
>
> So when I do the following incremental change ontop of the posted
> patch then I see that the path query is able to simplify more
> "single BB paths" than the global range folding.
>
> diff --git a/gcc/tree-ssa-threadbackward.cc
> b/gcc/tree-ssa-threadbackward.cc
> index 669098e4ec3..777e778037f 100644
> --- a/gcc/tree-ssa-threadbackward.cc
> +++ b/gcc/tree-ssa-threadbackward.cc
> @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>  {
>    int_range_max r;
>
> +  int_range<2> rf;
> +  if (path.length () == 1)
> +    {
> +      fold_range (rf, cond);
> +    }
> +
>    m_solver->compute_ranges (path, m_imports);
>    m_solver->range_of_stmt (r, cond);
>
> @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>
>    if (r == true_range || r == false_range)
>      {
> +      if (path.length () == 1)
> +       gcc_assert  (r == rf);
>        edge e_true, e_false;
>        basic_block bb = gimple_bb (cond);
>        extract_true_false_edges_from_block (bb, &e_true, &e_false);
>
> Even doing the following (not sure what's the difference and in
> particular expense over the path range query) results in missed
> simplifications (checking my set of cc1files).
>
> diff --git a/gcc/tree-ssa-threadbackward.cc
> b/gcc/tree-ssa-threadbackward.cc
> index 669098e4ec3..1d43a179d08 100644
> --- a/gcc/tree-ssa-threadbackward.cc
> +++ b/gcc/tree-ssa-threadbackward.cc
> @@ -99,6 +99,7 @@ private:
>
>    back_threader_registry m_registry;
>    back_threader_profitability m_profit;
> +  gimple_ranger *m_ranger;
>    path_range_query *m_solver;
>
>    // Current path being analyzed.
> @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> unsigned flags, bool first)
>    // The path solver needs EDGE_DFS_BACK in resolving mode.
>    if (flags & BT_RESOLVE)
>      mark_dfs_back_edges ();
> -  m_solver = new path_range_query (flags & BT_RESOLVE);
> +  m_ranger = new gimple_ranger;
> +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
>  }

Passing an allocated ranger here results in less simplifications over
letting path_range_query allocate its own?  That's not right.  Or do
you mean that using fold_range() with the m_ranger causes ICEs with
your patch (due to the non-null processing described below)?

>
>  back_threader::~back_threader ()
>  {
>    delete m_solver;
> +  delete m_ranger;
>
>    loop_optimizer_finalize ();
>  }
> @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>  {
>    int_range_max r;
>
> +  int_range<2> rf;
> +  if (path.length () == 1)
> +    {
> +      fold_range (rf, cond, m_ranger);
> +    }
> +
>    m_solver->compute_ranges (path, m_imports);
>    m_solver->range_of_stmt (r, cond);
>
> @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>
>    if (r == true_range || r == false_range)
>      {
> +      if (path.length () == 1)
> +       gcc_assert  (r == rf);
>        edge e_true, e_false;
>        basic_block bb = gimple_bb (cond);
>        extract_true_false_edges_from_block (bb, &e_true, &e_false);
>
> one example is
>
> <bb 176> [local count: 14414059]:
> _128 = node_177(D)->typed.type;
> pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> _431 = pretmp_413 + 65519;
> if (_128 == 0B)
>   goto <bb 199>; [18.09%]
> else
>   goto <bb 177>; [81.91%]
>
> where m_imports for the path is just _128 and the range computed is
> false while the ranger query returns VARYING.  But
> path_range_query::range_defined_in_block does
>
>   if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
>     m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
>
> which adjusts the range to ~[0, 0], probably because of the
> dereference in the following stmt.
>
> Why does fold_range not do this when folding the exit test?  Is there
> a way to make it do so?  It looks like the only routine using this
> in gimple-range.cc is range_on_edge and there it's used for e->src
> after calling range_on_exit for e->src (why's it not done in
> range_on_exit?).  A testcase for this is

Andrew's gonna have to answer this one, because I'm just a user of the
infer_range infrastructure.  But yes, you're right... fold_range
doesn't seem to take into account side-effects such as non-null.

Aldy

>
> int foo (int **p, int i)
> {
>   int *q = *p;
>   int res = *q + i;
>   if (q)
>     return res;
>   return -1;
> }
>
> which we "thread" with the path and with the above ICEs because
> fold_range doesn't get that if (q) is always true.  Without the
> patch ethread doesn't want to duplicate the block (it's too large)
> but threadfull will if you disable evrp (if you remove the increment
> by 'i' it again won't since nothing interesting prevails and it
> won't go to BB 0 and fails to pick up a thread of length > 1):
>
> Checking profitability of path (backwards):  bb:2 (6 insns) bb:0
>   Control statement insns: 2
>   Overall: 4 insns
>   [1] Registering jump thread: (0, 2) incoming edge;  (2, 3) nocopy;
> path: 0->2->3 SUCCESS
> Removing basic block 2
> ;; basic block 2, loop depth 0
> ;;  pred:
> _1 = *p_6(D);
> _2 = (long unsigned int) n_7(D);
> _3 = _2 * 4;
> q_8 = _1 + _3;
> res_9 = *q_8;
> if (q_8 != 0B)
>   goto <bb 3>; [98.33%]
> else
>   goto <bb 4>; [1.67%]
> ;;  succ:       3
> ;;              4
>
> ... (it copied BB 2) ...
>
> int foo (int * * p, int n)
> {
>   int res;
>   int * q;
>   int * _10;
>   long unsigned int _11;
>   long unsigned int _12;
>
>   <bb 2> [local count: 1073741824]:
>   _10 = *p_6(D);
>   _11 = (long unsigned int) n_7(D);
>   _12 = _11 * 4;
>   q_13 = _10 + _12;
>   res_14 = *q_13;
>   return res_14;
>
  
Richard Biener Aug. 16, 2022, 11:32 a.m. UTC | #9
On Tue, 16 Aug 2022, Aldy Hernandez wrote:

> On Tue, Aug 16, 2022 at 11:18 AM Richard Biener <rguenther@suse.de> wrote:
> >
> > On Mon, 15 Aug 2022, Aldy Hernandez wrote:
> >
> > > On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > > >
> > > > heh. or just
> > > >
> > > >
> > > > +      int_range<2> r;
> > > > +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> > > > +      || !r.singleton_p (&val))
> > > >
> > > >
> > > > if you do not provide a range_query to any of the fold_using_range code,
> > > > it defaults to:
> > > >
> > > > fur_source::fur_source (range_query *q)
> > > > {
> > > >    if (q)
> > > >      m_query = q;
> > > >    else if (cfun)
> > > >      m_query = get_range_query (cfun);
> > > >    else
> > > >      m_query = get_global_range_query ();
> > > >    m_gori = NULL;
> > > > }
> > > >
> > >
> > > Sweet.  Even better!
> >
> > So when I do the following incremental change ontop of the posted
> > patch then I see that the path query is able to simplify more
> > "single BB paths" than the global range folding.
> >
> > diff --git a/gcc/tree-ssa-threadbackward.cc
> > b/gcc/tree-ssa-threadbackward.cc
> > index 669098e4ec3..777e778037f 100644
> > --- a/gcc/tree-ssa-threadbackward.cc
> > +++ b/gcc/tree-ssa-threadbackward.cc
> > @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >  {
> >    int_range_max r;
> >
> > +  int_range<2> rf;
> > +  if (path.length () == 1)
> > +    {
> > +      fold_range (rf, cond);
> > +    }
> > +
> >    m_solver->compute_ranges (path, m_imports);
> >    m_solver->range_of_stmt (r, cond);
> >
> > @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >
> >    if (r == true_range || r == false_range)
> >      {
> > +      if (path.length () == 1)
> > +       gcc_assert  (r == rf);
> >        edge e_true, e_false;
> >        basic_block bb = gimple_bb (cond);
> >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >
> > Even doing the following (not sure what's the difference and in
> > particular expense over the path range query) results in missed
> > simplifications (checking my set of cc1files).
> >
> > diff --git a/gcc/tree-ssa-threadbackward.cc
> > b/gcc/tree-ssa-threadbackward.cc
> > index 669098e4ec3..1d43a179d08 100644
> > --- a/gcc/tree-ssa-threadbackward.cc
> > +++ b/gcc/tree-ssa-threadbackward.cc
> > @@ -99,6 +99,7 @@ private:
> >
> >    back_threader_registry m_registry;
> >    back_threader_profitability m_profit;
> > +  gimple_ranger *m_ranger;
> >    path_range_query *m_solver;
> >
> >    // Current path being analyzed.
> > @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> > unsigned flags, bool first)
> >    // The path solver needs EDGE_DFS_BACK in resolving mode.
> >    if (flags & BT_RESOLVE)
> >      mark_dfs_back_edges ();
> > -  m_solver = new path_range_query (flags & BT_RESOLVE);
> > +  m_ranger = new gimple_ranger;
> > +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
> >  }
> 
> Passing an allocated ranger here results in less simplifications over
> letting path_range_query allocate its own?  That's not right.  Or do
> you mean that using fold_range() with the m_ranger causes ICEs with
> your patch (due to the non-null processing described below)?

Yes, I've needed a ranger to use fold_range (..., m_ranger) which
I thought might do more than not passing one.

> >
> >  back_threader::~back_threader ()
> >  {
> >    delete m_solver;
> > +  delete m_ranger;
> >
> >    loop_optimizer_finalize ();
> >  }
> > @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >  {
> >    int_range_max r;
> >
> > +  int_range<2> rf;
> > +  if (path.length () == 1)
> > +    {
> > +      fold_range (rf, cond, m_ranger);
> > +    }
> > +
> >    m_solver->compute_ranges (path, m_imports);
> >    m_solver->range_of_stmt (r, cond);
> >
> > @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >
> >    if (r == true_range || r == false_range)
> >      {
> > +      if (path.length () == 1)
> > +       gcc_assert  (r == rf);
> >        edge e_true, e_false;
> >        basic_block bb = gimple_bb (cond);
> >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >
> > one example is
> >
> > <bb 176> [local count: 14414059]:
> > _128 = node_177(D)->typed.type;
> > pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> > _431 = pretmp_413 + 65519;
> > if (_128 == 0B)
> >   goto <bb 199>; [18.09%]
> > else
> >   goto <bb 177>; [81.91%]
> >
> > where m_imports for the path is just _128 and the range computed is
> > false while the ranger query returns VARYING.  But
> > path_range_query::range_defined_in_block does
> >
> >   if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
> >     m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
> >
> > which adjusts the range to ~[0, 0], probably because of the
> > dereference in the following stmt.
> >
> > Why does fold_range not do this when folding the exit test?  Is there
> > a way to make it do so?  It looks like the only routine using this
> > in gimple-range.cc is range_on_edge and there it's used for e->src
> > after calling range_on_exit for e->src (why's it not done in
> > range_on_exit?).  A testcase for this is
> 
> Andrew's gonna have to answer this one, because I'm just a user of the
> infer_range infrastructure.  But yes, you're right... fold_range
> doesn't seem to take into account side-effects such as non-null.

OK, let's see.  I can probably use the path query as well - I'll
see to produce a prototype of doing those simplifications early,
avoiding threadings there and through unreachable parts of the CFG.

Richard.

> Aldy
> 
> >
> > int foo (int **p, int i)
> > {
> >   int *q = *p;
> >   int res = *q + i;
> >   if (q)
> >     return res;
> >   return -1;
> > }
> >
> > which we "thread" with the path and with the above ICEs because
> > fold_range doesn't get that if (q) is always true.  Without the
> > patch ethread doesn't want to duplicate the block (it's too large)
> > but threadfull will if you disable evrp (if you remove the increment
> > by 'i' it again won't since nothing interesting prevails and it
> > won't go to BB 0 and fails to pick up a thread of length > 1):
> >
> > Checking profitability of path (backwards):  bb:2 (6 insns) bb:0
> >   Control statement insns: 2
> >   Overall: 4 insns
> >   [1] Registering jump thread: (0, 2) incoming edge;  (2, 3) nocopy;
> > path: 0->2->3 SUCCESS
> > Removing basic block 2
> > ;; basic block 2, loop depth 0
> > ;;  pred:
> > _1 = *p_6(D);
> > _2 = (long unsigned int) n_7(D);
> > _3 = _2 * 4;
> > q_8 = _1 + _3;
> > res_9 = *q_8;
> > if (q_8 != 0B)
> >   goto <bb 3>; [98.33%]
> > else
> >   goto <bb 4>; [1.67%]
> > ;;  succ:       3
> > ;;              4
> >
> > ... (it copied BB 2) ...
> >
> > int foo (int * * p, int n)
> > {
> >   int res;
> >   int * q;
> >   int * _10;
> >   long unsigned int _11;
> >   long unsigned int _12;
> >
> >   <bb 2> [local count: 1073741824]:
> >   _10 = *p_6(D);
> >   _11 = (long unsigned int) n_7(D);
> >   _12 = _11 * 4;
> >   q_13 = _10 + _12;
> >   res_14 = *q_13;
> >   return res_14;
> >
> 
>
  
Aldy Hernandez Aug. 16, 2022, 11:42 a.m. UTC | #10
On Tue, Aug 16, 2022 at 1:32 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Tue, 16 Aug 2022, Aldy Hernandez wrote:
>
> > On Tue, Aug 16, 2022 at 11:18 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Mon, 15 Aug 2022, Aldy Hernandez wrote:
> > >
> > > > On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > > > >
> > > > > heh. or just
> > > > >
> > > > >
> > > > > +      int_range<2> r;
> > > > > +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> > > > > +      || !r.singleton_p (&val))
> > > > >
> > > > >
> > > > > if you do not provide a range_query to any of the fold_using_range code,
> > > > > it defaults to:
> > > > >
> > > > > fur_source::fur_source (range_query *q)
> > > > > {
> > > > >    if (q)
> > > > >      m_query = q;
> > > > >    else if (cfun)
> > > > >      m_query = get_range_query (cfun);
> > > > >    else
> > > > >      m_query = get_global_range_query ();
> > > > >    m_gori = NULL;
> > > > > }
> > > > >
> > > >
> > > > Sweet.  Even better!
> > >
> > > So when I do the following incremental change ontop of the posted
> > > patch then I see that the path query is able to simplify more
> > > "single BB paths" than the global range folding.
> > >
> > > diff --git a/gcc/tree-ssa-threadbackward.cc
> > > b/gcc/tree-ssa-threadbackward.cc
> > > index 669098e4ec3..777e778037f 100644
> > > --- a/gcc/tree-ssa-threadbackward.cc
> > > +++ b/gcc/tree-ssa-threadbackward.cc
> > > @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >  {
> > >    int_range_max r;
> > >
> > > +  int_range<2> rf;
> > > +  if (path.length () == 1)
> > > +    {
> > > +      fold_range (rf, cond);
> > > +    }
> > > +
> > >    m_solver->compute_ranges (path, m_imports);
> > >    m_solver->range_of_stmt (r, cond);
> > >
> > > @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >
> > >    if (r == true_range || r == false_range)
> > >      {
> > > +      if (path.length () == 1)
> > > +       gcc_assert  (r == rf);
> > >        edge e_true, e_false;
> > >        basic_block bb = gimple_bb (cond);
> > >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> > >
> > > Even doing the following (not sure what's the difference and in
> > > particular expense over the path range query) results in missed
> > > simplifications (checking my set of cc1files).
> > >
> > > diff --git a/gcc/tree-ssa-threadbackward.cc
> > > b/gcc/tree-ssa-threadbackward.cc
> > > index 669098e4ec3..1d43a179d08 100644
> > > --- a/gcc/tree-ssa-threadbackward.cc
> > > +++ b/gcc/tree-ssa-threadbackward.cc
> > > @@ -99,6 +99,7 @@ private:
> > >
> > >    back_threader_registry m_registry;
> > >    back_threader_profitability m_profit;
> > > +  gimple_ranger *m_ranger;
> > >    path_range_query *m_solver;
> > >
> > >    // Current path being analyzed.
> > > @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> > > unsigned flags, bool first)
> > >    // The path solver needs EDGE_DFS_BACK in resolving mode.
> > >    if (flags & BT_RESOLVE)
> > >      mark_dfs_back_edges ();
> > > -  m_solver = new path_range_query (flags & BT_RESOLVE);
> > > +  m_ranger = new gimple_ranger;
> > > +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
> > >  }
> >
> > Passing an allocated ranger here results in less simplifications over
> > letting path_range_query allocate its own?  That's not right.  Or do
> > you mean that using fold_range() with the m_ranger causes ICEs with
> > your patch (due to the non-null processing described below)?
>
> Yes, I've needed a ranger to use fold_range (..., m_ranger) which
> I thought might do more than not passing one.

Yeah.  If you don't pass it a ranger, it'll use global ranges (i.e.
SSA_NAME_RANGE_INFO).

More specifically, it will first try to get the ranger you have
enabled in your pass with enable_ranger().  If that's not available,
then it will use global ranges.

So you could also do:

m_ranger = enable_ranger (fun);

and then in the destructor:

disable_ranger (m_fun);
m_ranger = NULL; // for good measure.

Then you could use fold_range() without any parameters and it will
DTRT.  This is what I had in mind when I shared my proof of concept
for tree-cfg's version of find_taken_edge(bb).  If you have enabled a
ranger, it'll use that, otherwise it'll use global SSA_NAME_RANGE_INFO
ranges.


>
> > >
> > >  back_threader::~back_threader ()
> > >  {
> > >    delete m_solver;
> > > +  delete m_ranger;
> > >
> > >    loop_optimizer_finalize ();
> > >  }
> > > @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >  {
> > >    int_range_max r;
> > >
> > > +  int_range<2> rf;
> > > +  if (path.length () == 1)
> > > +    {
> > > +      fold_range (rf, cond, m_ranger);
> > > +    }
> > > +
> > >    m_solver->compute_ranges (path, m_imports);
> > >    m_solver->range_of_stmt (r, cond);
> > >
> > > @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >
> > >    if (r == true_range || r == false_range)
> > >      {
> > > +      if (path.length () == 1)
> > > +       gcc_assert  (r == rf);
> > >        edge e_true, e_false;
> > >        basic_block bb = gimple_bb (cond);
> > >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> > >
> > > one example is
> > >
> > > <bb 176> [local count: 14414059]:
> > > _128 = node_177(D)->typed.type;
> > > pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> > > _431 = pretmp_413 + 65519;
> > > if (_128 == 0B)
> > >   goto <bb 199>; [18.09%]
> > > else
> > >   goto <bb 177>; [81.91%]
> > >
> > > where m_imports for the path is just _128 and the range computed is
> > > false while the ranger query returns VARYING.  But
> > > path_range_query::range_defined_in_block does
> > >
> > >   if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
> > >     m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
> > >
> > > which adjusts the range to ~[0, 0], probably because of the
> > > dereference in the following stmt.
> > >
> > > Why does fold_range not do this when folding the exit test?  Is there
> > > a way to make it do so?  It looks like the only routine using this
> > > in gimple-range.cc is range_on_edge and there it's used for e->src
> > > after calling range_on_exit for e->src (why's it not done in
> > > range_on_exit?).  A testcase for this is
> >
> > Andrew's gonna have to answer this one, because I'm just a user of the
> > infer_range infrastructure.  But yes, you're right... fold_range
> > doesn't seem to take into account side-effects such as non-null.
>
> OK, let's see.  I can probably use the path query as well - I'll
> see to produce a prototype of doing those simplifications early,
> avoiding threadings there and through unreachable parts of the CFG.

Let's see if Andrew has any bright ideas on having fold_range work
right with non-null, etc.  Otherwise, the path_query idea would be
fine, although it'd be nice to have a global entry point for this,
like what you suggested with find_taken_edge(bb).

Aldy
  
Richard Biener Aug. 16, 2022, 1:44 p.m. UTC | #11
On Tue, 16 Aug 2022, Richard Biener wrote:

> On Tue, 16 Aug 2022, Aldy Hernandez wrote:
> 
> > On Tue, Aug 16, 2022 at 11:18 AM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Mon, 15 Aug 2022, Aldy Hernandez wrote:
> > >
> > > > On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> > > > >
> > > > > heh. or just
> > > > >
> > > > >
> > > > > +      int_range<2> r;
> > > > > +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> > > > > +      || !r.singleton_p (&val))
> > > > >
> > > > >
> > > > > if you do not provide a range_query to any of the fold_using_range code,
> > > > > it defaults to:
> > > > >
> > > > > fur_source::fur_source (range_query *q)
> > > > > {
> > > > >    if (q)
> > > > >      m_query = q;
> > > > >    else if (cfun)
> > > > >      m_query = get_range_query (cfun);
> > > > >    else
> > > > >      m_query = get_global_range_query ();
> > > > >    m_gori = NULL;
> > > > > }
> > > > >
> > > >
> > > > Sweet.  Even better!
> > >
> > > So when I do the following incremental change ontop of the posted
> > > patch then I see that the path query is able to simplify more
> > > "single BB paths" than the global range folding.
> > >
> > > diff --git a/gcc/tree-ssa-threadbackward.cc
> > > b/gcc/tree-ssa-threadbackward.cc
> > > index 669098e4ec3..777e778037f 100644
> > > --- a/gcc/tree-ssa-threadbackward.cc
> > > +++ b/gcc/tree-ssa-threadbackward.cc
> > > @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >  {
> > >    int_range_max r;
> > >
> > > +  int_range<2> rf;
> > > +  if (path.length () == 1)
> > > +    {
> > > +      fold_range (rf, cond);
> > > +    }
> > > +
> > >    m_solver->compute_ranges (path, m_imports);
> > >    m_solver->range_of_stmt (r, cond);
> > >
> > > @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >
> > >    if (r == true_range || r == false_range)
> > >      {
> > > +      if (path.length () == 1)
> > > +       gcc_assert  (r == rf);
> > >        edge e_true, e_false;
> > >        basic_block bb = gimple_bb (cond);
> > >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> > >
> > > Even doing the following (not sure what's the difference and in
> > > particular expense over the path range query) results in missed
> > > simplifications (checking my set of cc1files).
> > >
> > > diff --git a/gcc/tree-ssa-threadbackward.cc
> > > b/gcc/tree-ssa-threadbackward.cc
> > > index 669098e4ec3..1d43a179d08 100644
> > > --- a/gcc/tree-ssa-threadbackward.cc
> > > +++ b/gcc/tree-ssa-threadbackward.cc
> > > @@ -99,6 +99,7 @@ private:
> > >
> > >    back_threader_registry m_registry;
> > >    back_threader_profitability m_profit;
> > > +  gimple_ranger *m_ranger;
> > >    path_range_query *m_solver;
> > >
> > >    // Current path being analyzed.
> > > @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> > > unsigned flags, bool first)
> > >    // The path solver needs EDGE_DFS_BACK in resolving mode.
> > >    if (flags & BT_RESOLVE)
> > >      mark_dfs_back_edges ();
> > > -  m_solver = new path_range_query (flags & BT_RESOLVE);
> > > +  m_ranger = new gimple_ranger;
> > > +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
> > >  }
> > 
> > Passing an allocated ranger here results in less simplifications over
> > letting path_range_query allocate its own?  That's not right.  Or do
> > you mean that using fold_range() with the m_ranger causes ICEs with
> > your patch (due to the non-null processing described below)?
> 
> Yes, I've needed a ranger to use fold_range (..., m_ranger) which
> I thought might do more than not passing one.
> 
> > >
> > >  back_threader::~back_threader ()
> > >  {
> > >    delete m_solver;
> > > +  delete m_ranger;
> > >
> > >    loop_optimizer_finalize ();
> > >  }
> > > @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >  {
> > >    int_range_max r;
> > >
> > > +  int_range<2> rf;
> > > +  if (path.length () == 1)
> > > +    {
> > > +      fold_range (rf, cond, m_ranger);
> > > +    }
> > > +
> > >    m_solver->compute_ranges (path, m_imports);
> > >    m_solver->range_of_stmt (r, cond);
> > >
> > > @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> > > vec<basic_block> &path,
> > >
> > >    if (r == true_range || r == false_range)
> > >      {
> > > +      if (path.length () == 1)
> > > +       gcc_assert  (r == rf);
> > >        edge e_true, e_false;
> > >        basic_block bb = gimple_bb (cond);
> > >        extract_true_false_edges_from_block (bb, &e_true, &e_false);
> > >
> > > one example is
> > >
> > > <bb 176> [local count: 14414059]:
> > > _128 = node_177(D)->typed.type;
> > > pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> > > _431 = pretmp_413 + 65519;
> > > if (_128 == 0B)
> > >   goto <bb 199>; [18.09%]
> > > else
> > >   goto <bb 177>; [81.91%]
> > >
> > > where m_imports for the path is just _128 and the range computed is
> > > false while the ranger query returns VARYING.  But
> > > path_range_query::range_defined_in_block does
> > >
> > >   if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
> > >     m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
> > >
> > > which adjusts the range to ~[0, 0], probably because of the
> > > dereference in the following stmt.
> > >
> > > Why does fold_range not do this when folding the exit test?  Is there
> > > a way to make it do so?  It looks like the only routine using this
> > > in gimple-range.cc is range_on_edge and there it's used for e->src
> > > after calling range_on_exit for e->src (why's it not done in
> > > range_on_exit?).  A testcase for this is
> > 
> > Andrew's gonna have to answer this one, because I'm just a user of the
> > infer_range infrastructure.  But yes, you're right... fold_range
> > doesn't seem to take into account side-effects such as non-null.
> 
> OK, let's see.  I can probably use the path query as well - I'll
> see to produce a prototype of doing those simplifications early,
> avoiding threadings there and through unreachable parts of the CFG.

Something like the following - it processes the CFG first,
simplifying BB conditionals and marking edges so we can later
avoid threading along unreachable paths.  It will no longer
count those simplifications as theaded jumps.

It's basically like an EVRP pass on steroids (because of the path
query doing more than ranger folding of the conditional), but only
simplifying conditionals.  Ideally just using ranger or even
better, fold_stmt (..., m_ranger) plus find_taken_edge ()
would work here.

Especially for ethread doing such simplification seems worthwhile
since EVRP only comes later.  Likewise the first threading pass
after inlining runs before the first VRP pass (and VRP isn't enabled
at -O1 but threading is).

That said, what triggered this is seeing that the backwards threader
needs 0 -> 2 (aka ENTRY_BLOCK -> 2) to simplify the conditional in
BB2 and that it will also copy BB2 for this even though there's only
a single entry into BB2.  I suppose we could say we don't want to
thread those, thus reject any threading attempt with an entry
edge into a block with a single predecessor?  Or optimize for
this case in the copier, not copying such blocks?

I'm somewhat undecided what's the correct thing to do here.

Richard.

diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index c99d77dd340..782794d3825 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -220,7 +220,6 @@ path_range_query::unreachable_path_p ()
 void
 path_range_query::set_path (const vec<basic_block> &path)
 {
-  gcc_checking_assert (path.length () > 1);
   m_path = path.copy ();
   m_pos = m_path.length () - 1;
   bitmap_clear (m_has_cache_entry);
diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index 1c362839fab..3ca8c6eb9da 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -91,9 +91,9 @@ private:
   edge maybe_register_path ();
   void maybe_register_path_dump (edge taken_edge);
   void find_paths_to_names (basic_block bb, bitmap imports, unsigned);
-  edge find_taken_edge (const vec<basic_block> &path);
-  edge find_taken_edge_cond (const vec<basic_block> &path, gcond *);
-  edge find_taken_edge_switch (const vec<basic_block> &path, gswitch *);
+  edge find_taken_edge (const vec<basic_block> &path, bool = false);
+  edge find_taken_edge_cond (const vec<basic_block> &path, gcond *, bool);
+  edge find_taken_edge_switch (const vec<basic_block> &path, gswitch *, bool);
   virtual void debug ();
   virtual void dump (FILE *out);
 
@@ -124,6 +124,7 @@ private:
   // each threadfull[12] pass.  This is used to differentiate between
   // the different threading passes so we can set up debug counters.
   bool m_first;
+  auto_edge_flag m_reachable;
 };
 
 // Used to differentiate unreachable edges, so we may stop the search
@@ -132,7 +133,8 @@ const edge back_threader::UNREACHABLE_EDGE = (edge) -1;
 
 back_threader::back_threader (function *fun, unsigned flags, bool first)
   : m_profit (flags & BT_SPEED),
-    m_first (first)
+    m_first (first),
+    m_reachable (fun)
 {
   if (flags & BT_SPEED)
     loop_optimizer_init (LOOPS_HAVE_PREHEADERS | LOOPS_HAVE_SIMPLE_LATCHES);
@@ -265,16 +267,16 @@ back_threader::maybe_register_path ()
 // outgoing edge can be calculated, return NULL.
 
 edge
-back_threader::find_taken_edge (const vec<basic_block> &path)
+back_threader::find_taken_edge (const vec<basic_block> &path, bool modify)
 {
-  gcc_checking_assert (path.length () > 1);
   switch (gimple_code (m_last_stmt))
     {
     case GIMPLE_COND:
-      return find_taken_edge_cond (path, as_a<gcond *> (m_last_stmt));
+      return find_taken_edge_cond (path, as_a<gcond *> (m_last_stmt), modify);
 
     case GIMPLE_SWITCH:
-      return find_taken_edge_switch (path, as_a<gswitch *> (m_last_stmt));
+      return find_taken_edge_switch (path, as_a<gswitch *> (m_last_stmt),
+				     modify);
 
     default:
       return NULL;
@@ -285,7 +287,7 @@ back_threader::find_taken_edge (const vec<basic_block> &path)
 
 edge
 back_threader::find_taken_edge_switch (const vec<basic_block> &path,
-				       gswitch *sw)
+				       gswitch *sw, bool modify)
 {
   tree name = gimple_switch_index (sw);
   int_range_max r;
@@ -303,6 +305,9 @@ back_threader::find_taken_edge_switch (const vec<basic_block> &path,
   if (!label)
     return NULL;
 
+  if (modify)
+    gimple_switch_set_index (sw, CASE_LOW (label));
+
   return find_edge (gimple_bb (sw), label_to_block (cfun, CASE_LABEL (label)));
 }
 
@@ -310,7 +315,7 @@ back_threader::find_taken_edge_switch (const vec<basic_block> &path,
 
 edge
 back_threader::find_taken_edge_cond (const vec<basic_block> &path,
-				     gcond *cond)
+				     gcond *cond, bool modify)
 {
   int_range_max r;
 
@@ -328,6 +333,13 @@ back_threader::find_taken_edge_cond (const vec<basic_block> &path,
       edge e_true, e_false;
       basic_block bb = gimple_bb (cond);
       extract_true_false_edges_from_block (bb, &e_true, &e_false);
+      if (modify)
+	{
+	  if (r == true_range)
+	    gimple_cond_make_true (cond);
+	  else
+	    gimple_cond_make_false (cond);
+	}
       return r == true_range ? e_true : e_false;
     }
   return NULL;
@@ -452,6 +464,7 @@ back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
 	  FOR_EACH_EDGE (e, iter, bb->preds)
 	    {
 	      if (e->flags & EDGE_ABNORMAL
+		  || !(e->flags & m_reachable)
 		  // This is like path_crosses_loops in profitable_path_p but
 		  // more restrictive to avoid peeling off loop iterations (see
 		  // tree-ssa/pr14341.c for an example).
@@ -948,16 +961,101 @@ unsigned int
 back_threader::thread_blocks ()
 {
   basic_block bb;
+  bool changed = false;
+
+  auto_vec<basic_block> worklist (n_basic_blocks_for_fn (cfun));
+  auto_bb_flag visited (cfun);
+  worklist.quick_push (single_succ (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+  worklist[0]->flags |= visited;
+  while (!worklist.is_empty ())
+    {
+      bb = worklist.pop ();
+
+      // Try to resolve the jump at the end of the block
+      // ???  We'd like to use ranger without a fake path of
+      // length one here, but that turns out to be less powerful
+      gimple *stmt = last_stmt (bb);
+      if (stmt
+	  && (gimple_code (stmt) == GIMPLE_COND
+	      || gimple_code (stmt) == GIMPLE_SWITCH))
+	{
+	  bitmap_clear (m_imports);
+	  ssa_op_iter iter;
+	  tree name;
+	  // could use gori exports here
+	  FOR_EACH_SSA_TREE_OPERAND (name, stmt, iter, SSA_OP_USE)
+	    {
+	      if (gimple_range_ssa_p (name))
+		bitmap_set_bit (m_imports, SSA_NAME_VERSION (name));
+	    }
+	  if (!bitmap_empty_p (m_imports))
+	    {
+	      auto_vec<basic_block, 1> path;
+	      path.quick_push (bb);
+	      m_last_stmt = stmt;
+	      if (edge e = find_taken_edge (path, true))
+		{
+		  if (e == UNREACHABLE_EDGE)
+		    continue;
+		  else
+		    {
+		      // The jump was modified to be trivially
+		      // redirectable by CFG cleanup, so make sure
+		      // that runs
+		      changed = true;
+		      if (!(e->dest->flags & visited))
+			{
+			  worklist.quick_push (e->dest);
+			  e->dest->flags |= visited;
+			  e->flags |= m_reachable;
+			  continue;
+			}
+		    }
+		}
+	    }
+	}
+
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  e->flags |= m_reachable;
+	  if (!(e->dest->flags & visited))
+	    {
+	      worklist.quick_push (e->dest);
+	      e->dest->flags |= visited;
+	      e->flags |= m_reachable;
+	    }
+	}
+    }
+
   FOR_EACH_BB_FN (bb, m_fun)
-    if (EDGE_COUNT (bb->succs) > 1)
-      maybe_thread_block (bb);
+    // visited doubles for reachable
+    if (bb->flags & visited)
+      {
+	bb->flags &= ~visited;
+	unsigned cnt = 0;
+	edge_iterator ei;
+	edge e;
+	FOR_EACH_EDGE (e, ei, bb->succs)
+	  if (e->flags & m_reachable)
+	    cnt++;
+	// check we have a yet unresolved exit jump
+	if (cnt > 1)
+	  maybe_thread_block (bb);
+      }
 
-  bool changed = m_registry.thread_through_all_blocks (true);
+  FOR_EACH_BB_FN (bb, m_fun)
+    {
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	e->flags &= ~m_reachable;
+    }
 
-  if (m_flags & BT_SPEED)
-    return changed ? TODO_cleanup_cfg : 0;
+  changed |= m_registry.thread_through_all_blocks (true);
 
-  return false;
+  return changed ? TODO_cleanup_cfg : 0;
 }
 
 namespace {
  
Andrew MacLeod Aug. 16, 2022, 2:30 p.m. UTC | #12
On 8/16/22 05:18, Richard Biener wrote:
> On Mon, 15 Aug 2022, Aldy Hernandez wrote:
>
>> On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>> heh. or just
>>>
>>>
>>> +      int_range<2> r;
>>> +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
>>> +      || !r.singleton_p (&val))
>>>
>>>
>>> if you do not provide a range_query to any of the fold_using_range code,
>>> it defaults to:
>>>
>>> fur_source::fur_source (range_query *q)
>>> {
>>>     if (q)
>>>       m_query = q;
>>>     else if (cfun)
>>>       m_query = get_range_query (cfun);
>>>     else
>>>       m_query = get_global_range_query ();
>>>     m_gori = NULL;
>>> }
>>>
>> Sweet.  Even better!
> So when I do the following incremental change ontop of the posted
> patch then I see that the path query is able to simplify more
> "single BB paths" than the global range folding.
>
> diff --git a/gcc/tree-ssa-threadbackward.cc
> b/gcc/tree-ssa-threadbackward.cc
> index 669098e4ec3..777e778037f 100644
> --- a/gcc/tree-ssa-threadbackward.cc
> +++ b/gcc/tree-ssa-threadbackward.cc
> @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>   {
>     int_range_max r;
>   
> +  int_range<2> rf;
> +  if (path.length () == 1)
> +    {
> +      fold_range (rf, cond);
> +    }
> +
>     m_solver->compute_ranges (path, m_imports);
>     m_solver->range_of_stmt (r, cond);
>   
> @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>   
>     if (r == true_range || r == false_range)
>       {
> +      if (path.length () == 1)
> +       gcc_assert  (r == rf);
>         edge e_true, e_false;
>         basic_block bb = gimple_bb (cond);
>         extract_true_false_edges_from_block (bb, &e_true, &e_false);
>
> Even doing the following (not sure what's the difference and in
> particular expense over the path range query) results in missed
> simplifications (checking my set of cc1files).
>
> diff --git a/gcc/tree-ssa-threadbackward.cc
> b/gcc/tree-ssa-threadbackward.cc
> index 669098e4ec3..1d43a179d08 100644
> --- a/gcc/tree-ssa-threadbackward.cc
> +++ b/gcc/tree-ssa-threadbackward.cc
> @@ -99,6 +99,7 @@ private:
>   
>     back_threader_registry m_registry;
>     back_threader_profitability m_profit;
> +  gimple_ranger *m_ranger;
>     path_range_query *m_solver;
>   
>     // Current path being analyzed.
> @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> unsigned flags, bool first)
>     // The path solver needs EDGE_DFS_BACK in resolving mode.
>     if (flags & BT_RESOLVE)
>       mark_dfs_back_edges ();
> -  m_solver = new path_range_query (flags & BT_RESOLVE);
> +  m_ranger = new gimple_ranger;
> +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
>   }
>   
>   back_threader::~back_threader ()
>   {
>     delete m_solver;
> +  delete m_ranger;
>   
>     loop_optimizer_finalize ();
>   }
> @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>   {
>     int_range_max r;
>   
> +  int_range<2> rf;
> +  if (path.length () == 1)
> +    {
> +      fold_range (rf, cond, m_ranger);
> +    }
> +
>     m_solver->compute_ranges (path, m_imports);
>     m_solver->range_of_stmt (r, cond);
>   
> @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> vec<basic_block> &path,
>   
>     if (r == true_range || r == false_range)
>       {
> +      if (path.length () == 1)
> +       gcc_assert  (r == rf);
>         edge e_true, e_false;
>         basic_block bb = gimple_bb (cond);
>         extract_true_false_edges_from_block (bb, &e_true, &e_false);
>
> one example is
>
> <bb 176> [local count: 14414059]:
> _128 = node_177(D)->typed.type;
> pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> _431 = pretmp_413 + 65519;
> if (_128 == 0B)
>    goto <bb 199>; [18.09%]
> else
>    goto <bb 177>; [81.91%]
>
> where m_imports for the path is just _128 and the range computed is
> false while the ranger query returns VARYING.  But
> path_range_query::range_defined_in_block does
>
>    if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
>      m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
This is the coarse grained "side effect applies somewhere in the block" 
mechanism.  There is no understanding of where in the block it happens.
>
> which adjusts the range to ~[0, 0], probably because of the
> dereference in the following stmt.
>
> Why does fold_range not do this when folding the exit test?  Is there
> a way to make it do so?  It looks like the only routine using this
> in gimple-range.cc is range_on_edge and there it's used for e->src
> after calling range_on_exit for e->src (why's it not done in
> range_on_exit?).  A testcase for this is

Fold_range doesnt do this because it is simply another statement.  It 
makes no attempt to understand the context in which you are folding 
something. you could be folding that stmt from a different location (ie 
recomputing)   If your context is that you are looking for the range 
after the last statement has been executed, then one needs to check to 
see if there are any side effects.

ranger uses it for range_on_edge (), because  it knows all the 
statements in the block have been executed, and its safe to apply 
anything seen in the block.  It does it right after range_on_exit() is 
called internally.

Once upon a time, it was integrated with range-on-exit, but it turned 
out there were certain times that it was causing problems. There have 
been some cleanups since then, it probably safe now to return that call 
to range_on_exit.. but doesnt buy us a whole lot by itself.. except of 
course I have now OKd using range_on-entry/exit generally :-)

the cache also uses it when walking blocks to pick up inferred values 
during an on-entry cache fill.


> int foo (int **p, int i)
> {
>    int *q = *p;
>    int res = *q + i;
>    if (q)
>      return res;
>    return -1;
> }
>
> which we "thread" with the path and with the above ICEs because
> fold_range doesn't get that if (q) is always true.  Without the

Its a known limitation that, unless you are doing a walk, on-demand 
requests will "miss" some inferred ranges, because they are only 
maintained at the basic block level.  (ie, we will know that q is 
non-null in BB2,  but we don't know where, so we can make no assumptions 
at the exit condition about q in this case. the path_query is invoked in 
on-demand mode because it wont walk the entire IL,. so the first time 
you ask for the range of an ssa-name, it will quickly zip over the 
immediate use list and "fill" the on-exit structure for any blocks which 
a non-null reference is seen.  This allows the threader to pick up 
non-null from blocks outside the path that havent been examined.

VRP does a walk, and while during the walk, adjusts ranges on the fly 
for the current block via the gimple_ranger::register_inferred_ranges () 
routine.  which is really just a wrapper around 
ranger_cache::apply_inferred_ranges ()  (in gimple-range-cache.cc)

This is called after every statement and is where we take care of 
bookkeeping for adjusting values, and adding them to the blocks list.

if the path query is walking those statement, it could also "adjust" the 
range of q on the fly... but it has to have walked those statements.  
from that routine, the relevant bits use the gimple-infer class to find 
any inferred ranges from the statement, and would look something like:

   gimple_infer_range infer(s);
   for (unsigned x = 0; x < infer.num (); x++)
     {
       tree name = infer.name (x);
       if (!interesting_p (name))
          continue;
       get_current_path_range (r, name);
       if (r.intersect (infer.range (x)))
         set_current_path_range (name, r);
     }

That would adjust the value of q to be non-zero after   "int res = *q + i;"

but you need to have walked the statement before you get to the 
condition.   as long as they are all in your list of interesting 
statement to look at, then you'd be golden.

I dont know if, when, or what direction things are examined.

Andrew
  
Richard Biener Aug. 17, 2022, 7:42 a.m. UTC | #13
On Tue, 16 Aug 2022, Andrew MacLeod wrote:

> 
> On 8/16/22 05:18, Richard Biener wrote:
> > On Mon, 15 Aug 2022, Aldy Hernandez wrote:
> >
> >> On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
> >>> heh. or just
> >>>
> >>>
> >>> +      int_range<2> r;
> >>> +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> >>> +      || !r.singleton_p (&val))
> >>>
> >>>
> >>> if you do not provide a range_query to any of the fold_using_range code,
> >>> it defaults to:
> >>>
> >>> fur_source::fur_source (range_query *q)
> >>> {
> >>>     if (q)
> >>>       m_query = q;
> >>>     else if (cfun)
> >>>       m_query = get_range_query (cfun);
> >>>     else
> >>>       m_query = get_global_range_query ();
> >>>     m_gori = NULL;
> >>> }
> >>>
> >> Sweet.  Even better!
> > So when I do the following incremental change ontop of the posted
> > patch then I see that the path query is able to simplify more
> > "single BB paths" than the global range folding.
> >
> > diff --git a/gcc/tree-ssa-threadbackward.cc
> > b/gcc/tree-ssa-threadbackward.cc
> > index 669098e4ec3..777e778037f 100644
> > --- a/gcc/tree-ssa-threadbackward.cc
> > +++ b/gcc/tree-ssa-threadbackward.cc
> > @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >   {
> >     int_range_max r;
> >   +  int_range<2> rf;
> > +  if (path.length () == 1)
> > +    {
> > +      fold_range (rf, cond);
> > +    }
> > +
> >     m_solver->compute_ranges (path, m_imports);
> >     m_solver->range_of_stmt (r, cond);
> >   @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >   
> >     if (r == true_range || r == false_range)
> >       {
> > +      if (path.length () == 1)
> > +       gcc_assert  (r == rf);
> >         edge e_true, e_false;
> >         basic_block bb = gimple_bb (cond);
> >         extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >
> > Even doing the following (not sure what's the difference and in
> > particular expense over the path range query) results in missed
> > simplifications (checking my set of cc1files).
> >
> > diff --git a/gcc/tree-ssa-threadbackward.cc
> > b/gcc/tree-ssa-threadbackward.cc
> > index 669098e4ec3..1d43a179d08 100644
> > --- a/gcc/tree-ssa-threadbackward.cc
> > +++ b/gcc/tree-ssa-threadbackward.cc
> > @@ -99,6 +99,7 @@ private:
> >   
> >     back_threader_registry m_registry;
> >     back_threader_profitability m_profit;
> > +  gimple_ranger *m_ranger;
> >     path_range_query *m_solver;
> >   
> >     // Current path being analyzed.
> > @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> > unsigned flags, bool first)
> >     // The path solver needs EDGE_DFS_BACK in resolving mode.
> >     if (flags & BT_RESOLVE)
> >       mark_dfs_back_edges ();
> > -  m_solver = new path_range_query (flags & BT_RESOLVE);
> > +  m_ranger = new gimple_ranger;
> > +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
> >   }
> >   
> >   back_threader::~back_threader ()
> >   {
> >     delete m_solver;
> > +  delete m_ranger;
> >   
> >     loop_optimizer_finalize ();
> >   }
> > @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >   {
> >     int_range_max r;
> >   +  int_range<2> rf;
> > +  if (path.length () == 1)
> > +    {
> > +      fold_range (rf, cond, m_ranger);
> > +    }
> > +
> >     m_solver->compute_ranges (path, m_imports);
> >     m_solver->range_of_stmt (r, cond);
> >   @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> > vec<basic_block> &path,
> >   
> >     if (r == true_range || r == false_range)
> >       {
> > +      if (path.length () == 1)
> > +       gcc_assert  (r == rf);
> >         edge e_true, e_false;
> >         basic_block bb = gimple_bb (cond);
> >         extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >
> > one example is
> >
> > <bb 176> [local count: 14414059]:
> > _128 = node_177(D)->typed.type;
> > pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> > _431 = pretmp_413 + 65519;
> > if (_128 == 0B)
> >    goto <bb 199>; [18.09%]
> > else
> >    goto <bb 177>; [81.91%]
> >
> > where m_imports for the path is just _128 and the range computed is
> > false while the ranger query returns VARYING.  But
> > path_range_query::range_defined_in_block does
> >
> >    if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
> >      m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
> This is the coarse grained "side effect applies somewhere in the block"
> mechanism.  There is no understanding of where in the block it happens.
> >
> > which adjusts the range to ~[0, 0], probably because of the
> > dereference in the following stmt.
> >
> > Why does fold_range not do this when folding the exit test?  Is there
> > a way to make it do so?  It looks like the only routine using this
> > in gimple-range.cc is range_on_edge and there it's used for e->src
> > after calling range_on_exit for e->src (why's it not done in
> > range_on_exit?).  A testcase for this is
> 
> Fold_range doesnt do this because it is simply another statement.  It makes no
> attempt to understand the context in which you are folding something. you
> could be folding that stmt from a different location (ie recomputing)   If
> your context is that you are looking for the range after the last statement
> has been executed, then one needs to check to see if there are any side
> effects.

Hmm, but I'm asking it to fold a specific statement - how can that ever
be folded "from a different context"?  In fact it traces as
" at stmt " with the stmt I pass it.  But yes, the issue is of course
that to compute the range of q != 0 it needs to compute the range
of 'q' but it simply does

  // If no name, simply call the base routine.
  if (!name)
    {
      res = fold_range_internal (r, s, NULL_TREE);
      if (res && is_a <gcond *> (s))
        {
          // Update any exports in the cache if this is a gimple cond 
statement.
          tree exp;
          basic_block bb = gimple_bb (s);
          FOR_EACH_GORI_EXPORT_NAME (m_cache.m_gori, bb, exp)
            m_cache.propagate_updated_value (exp, bb);

and fur_depend seems to get the context so it could adjust ranges
in its get_operand which seems to just call range_of_expr with
the stmt as third arg.  Unfortunately gimple_ranger::range_of_expr
is undocumented, but again 'stmt' is dumped as " at stmt " thus
seems to be the "context" here?

      // If name is defined in this block, try to get an range from S.

what is 'S'?  def_stmt or stmt?

      if (def_stmt && gimple_bb (def_stmt) == bb)
        {
          // Declared in this block, if it has a global set, check for an
          // override from a block walk, otherwise calculate it.
          if (m_cache.get_global_range (r, expr))
            m_cache.block_range (r, bb, expr, false);
          else
            range_of_stmt (r, def_stmt, expr);

so, add 

      if (is_ctrl_stmt (stmt))
         m_cache.m_exit.maybe_adjust_range (r, expr, bb);

at the end (also covering the range_on_entry case).  I suppose
range_on_entry does not evaluate all possible paths from definition
to BB, using maybe_adjust_range on blocks inbetween and unioning
ranges at path merges?  That would likely be prohibitly expensive.

> ranger uses it for range_on_edge (), because  it knows all the statements in
> the block have been executed, and its safe to apply anything seen in the
> block.  It does it right after range_on_exit() is called internally.
> 
> Once upon a time, it was integrated with range-on-exit, but it turned out
> there were certain times that it was causing problems. There have been some
> cleanups since then, it probably safe now to return that call to
> range_on_exit.. but doesnt buy us a whole lot by itself.. except of course I
> have now OKd using range_on-entry/exit generally :-)
> 
> the cache also uses it when walking blocks to pick up inferred values during
> an on-entry cache fill.
> 
> 
> > int foo (int **p, int i)
> > {
> >    int *q = *p;
> >    int res = *q + i;
> >    if (q)
> >      return res;
> >    return -1;
> > }
> >
> > which we "thread" with the path and with the above ICEs because
> > fold_range doesn't get that if (q) is always true.  Without the
> 
> Its a known limitation that, unless you are doing a walk, on-demand requests
> will "miss" some inferred ranges, because they are only maintained at the
> basic block level.  (ie, we will know that q is non-null in BB2,  but we don't
> know where, so we can make no assumptions at the exit condition about q in
> this case. the path_query is invoked in on-demand mode because it wont walk
> the entire IL,. so the first time you ask for the range of an ssa-name, it
> will quickly zip over the immediate use list and "fill" the on-exit structure
> for any blocks which a non-null reference is seen.  This allows the threader
> to pick up non-null from blocks outside the path that havent been examined.
> 
> VRP does a walk, and while during the walk, adjusts ranges on the fly for the
> current block via the gimple_ranger::register_inferred_ranges () routine. 
> which is really just a wrapper around ranger_cache::apply_inferred_ranges () 
> (in gimple-range-cache.cc)
> 
> This is called after every statement and is where we take care of bookkeeping
> for adjusting values, and adding them to the blocks list.
> 
> if the path query is walking those statement, it could also "adjust" the range
> of q on the fly... but it has to have walked those statements.  from that
> routine, the relevant bits use the gimple-infer class to find any inferred
> ranges from the statement, and would look something like:
> 
>   gimple_infer_range infer(s);
>   for (unsigned x = 0; x < infer.num (); x++)
>     {
>       tree name = infer.name (x);
>       if (!interesting_p (name))
>          continue;
>       get_current_path_range (r, name);
>       if (r.intersect (infer.range (x)))
>         set_current_path_range (name, r);
>     }
> 
> That would adjust the value of q to be non-zero after   "int res = *q + i;"
> 
> but you need to have walked the statement before you get to the condition.  
> as long as they are all in your list of interesting statement to look at, then
> you'd be golden.

Ah, so while the above might work it would go against the spirit of this
being only fully useful if applied during a walk, adjusting the
cached ranges?  It's still a bit unfortunate that one needs to walk all
stmts for this - the usecase wants to just walk the CFG and control stmts,
and with the path query it would only cover the single BB of each
control stmt (thus the above hack would probably work out).

Meanwhile I'm leaning towards calling this a phase ordering issue
of threading + VRP, but that also means we shouldn't deliberately
try to preserve "threadings" of this kind - in fact we might want
to explicitely reject them?

Richard.
  
Andrew MacLeod Aug. 17, 2022, 2:39 p.m. UTC | #14
On 8/17/22 03:42, Richard Biener wrote:
> On Tue, 16 Aug 2022, Andrew MacLeod wrote:
>
>> On 8/16/22 05:18, Richard Biener wrote:
>>> On Mon, 15 Aug 2022, Aldy Hernandez wrote:
>>>
>>>> On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>>>>> heh. or just
>>>>>
>>>>>
>>>>> +      int_range<2> r;
>>>>> +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
>>>>> +      || !r.singleton_p (&val))
>>>>>
>>>>>
>>>>> if you do not provide a range_query to any of the fold_using_range code,
>>>>> it defaults to:
>>>>>
>>>>> fur_source::fur_source (range_query *q)
>>>>> {
>>>>>      if (q)
>>>>>        m_query = q;
>>>>>      else if (cfun)
>>>>>        m_query = get_range_query (cfun);
>>>>>      else
>>>>>        m_query = get_global_range_query ();
>>>>>      m_gori = NULL;
>>>>> }
>>>>>
>>>> Sweet.  Even better!
>>> So when I do the following incremental change ontop of the posted
>>> patch then I see that the path query is able to simplify more
>>> "single BB paths" than the global range folding.
>>>
>>> diff --git a/gcc/tree-ssa-threadbackward.cc
>>> b/gcc/tree-ssa-threadbackward.cc
>>> index 669098e4ec3..777e778037f 100644
>>> --- a/gcc/tree-ssa-threadbackward.cc
>>> +++ b/gcc/tree-ssa-threadbackward.cc
>>> @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
>>> vec<basic_block> &path,
>>>    {
>>>      int_range_max r;
>>>    +  int_range<2> rf;
>>> +  if (path.length () == 1)
>>> +    {
>>> +      fold_range (rf, cond);
>>> +    }
>>> +
>>>      m_solver->compute_ranges (path, m_imports);
>>>      m_solver->range_of_stmt (r, cond);
>>>    @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
>>> vec<basic_block> &path,
>>>    
>>>      if (r == true_range || r == false_range)
>>>        {
>>> +      if (path.length () == 1)
>>> +       gcc_assert  (r == rf);
>>>          edge e_true, e_false;
>>>          basic_block bb = gimple_bb (cond);
>>>          extract_true_false_edges_from_block (bb, &e_true, &e_false);
>>>
>>> Even doing the following (not sure what's the difference and in
>>> particular expense over the path range query) results in missed
>>> simplifications (checking my set of cc1files).
>>>
>>> diff --git a/gcc/tree-ssa-threadbackward.cc
>>> b/gcc/tree-ssa-threadbackward.cc
>>> index 669098e4ec3..1d43a179d08 100644
>>> --- a/gcc/tree-ssa-threadbackward.cc
>>> +++ b/gcc/tree-ssa-threadbackward.cc
>>> @@ -99,6 +99,7 @@ private:
>>>    
>>>      back_threader_registry m_registry;
>>>      back_threader_profitability m_profit;
>>> +  gimple_ranger *m_ranger;
>>>      path_range_query *m_solver;
>>>    
>>>      // Current path being analyzed.
>>> @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
>>> unsigned flags, bool first)
>>>      // The path solver needs EDGE_DFS_BACK in resolving mode.
>>>      if (flags & BT_RESOLVE)
>>>        mark_dfs_back_edges ();
>>> -  m_solver = new path_range_query (flags & BT_RESOLVE);
>>> +  m_ranger = new gimple_ranger;
>>> +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
>>>    }
>>>    
>>>    back_threader::~back_threader ()
>>>    {
>>>      delete m_solver;
>>> +  delete m_ranger;
>>>    
>>>      loop_optimizer_finalize ();
>>>    }
>>> @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
>>> vec<basic_block> &path,
>>>    {
>>>      int_range_max r;
>>>    +  int_range<2> rf;
>>> +  if (path.length () == 1)
>>> +    {
>>> +      fold_range (rf, cond, m_ranger);
>>> +    }
>>> +
>>>      m_solver->compute_ranges (path, m_imports);
>>>      m_solver->range_of_stmt (r, cond);
>>>    @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
>>> vec<basic_block> &path,
>>>    
>>>      if (r == true_range || r == false_range)
>>>        {
>>> +      if (path.length () == 1)
>>> +       gcc_assert  (r == rf);
>>>          edge e_true, e_false;
>>>          basic_block bb = gimple_bb (cond);
>>>          extract_true_false_edges_from_block (bb, &e_true, &e_false);
>>>
>>> one example is
>>>
>>> <bb 176> [local count: 14414059]:
>>> _128 = node_177(D)->typed.type;
>>> pretmp_413 = MEM[(const union tree_node *)_128].base.code;
>>> _431 = pretmp_413 + 65519;
>>> if (_128 == 0B)
>>>     goto <bb 199>; [18.09%]
>>> else
>>>     goto <bb 177>; [81.91%]
>>>
>>> where m_imports for the path is just _128 and the range computed is
>>> false while the ranger query returns VARYING.  But
>>> path_range_query::range_defined_in_block does
>>>
>>>     if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
>>>       m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
>> This is the coarse grained "side effect applies somewhere in the block"
>> mechanism.  There is no understanding of where in the block it happens.
>>> which adjusts the range to ~[0, 0], probably because of the
>>> dereference in the following stmt.
>>>
>>> Why does fold_range not do this when folding the exit test?  Is there
>>> a way to make it do so?  It looks like the only routine using this
>>> in gimple-range.cc is range_on_edge and there it's used for e->src
>>> after calling range_on_exit for e->src (why's it not done in
>>> range_on_exit?).  A testcase for this is
>> Fold_range doesnt do this because it is simply another statement.  It makes no
>> attempt to understand the context in which you are folding something. you
>> could be folding that stmt from a different location (ie recomputing)   If
>> your context is that you are looking for the range after the last statement
>> has been executed, then one needs to check to see if there are any side
>> effects.
> Hmm, but I'm asking it to fold a specific statement - how can that ever
> be folded "from a different context"?  In fact it traces as
> " at stmt " with the stmt I pass it.  But yes, the issue is of course
> that to compute the range of q != 0 it needs to compute the range
> of 'q' but it simply does
>
>    // If no name, simply call the base routine.
>    if (!name)
>      {
>        res = fold_range_internal (r, s, NULL_TREE);
>        if (res && is_a <gcond *> (s))
>          {
>            // Update any exports in the cache if this is a gimple cond
> statement.
>            tree exp;
>            basic_block bb = gimple_bb (s);
>            FOR_EACH_GORI_EXPORT_NAME (m_cache.m_gori, bb, exp)
>              m_cache.propagate_updated_value (exp, bb);
>
> and fur_depend seems to get the context so it could adjust ranges
> in its get_operand which seems to just call range_of_expr with
> the stmt as third arg.  Unfortunately gimple_ranger::range_of_expr
> is undocumented, but again 'stmt' is dumped as " at stmt " thus
> seems to be the "context" here?
Oh, so you are looking at gimple_ranger::range_of_stmt(), not fold_range.
>        // If name is defined in this block, try to get an range from S.
>
> what is 'S'?  def_stmt or stmt?
>
>        if (def_stmt && gimple_bb (def_stmt) == bb)
>          {
>            // Declared in this block, if it has a global set, check for an
>            // override from a block walk, otherwise calculate it.
>            if (m_cache.get_global_range (r, expr))
>              m_cache.block_range (r, bb, expr, false);
>            else
>              range_of_stmt (r, def_stmt, expr);
>
> so, add
>
>        if (is_ctrl_stmt (stmt))
>           m_cache.m_exit.maybe_adjust_range (r, expr, bb);
>
> at the end (also covering the range_on_entry case).  I suppose
> range_on_entry does not evaluate all possible paths from definition
> to BB, using maybe_adjust_range on blocks inbetween and unioning
> ranges at path merges?  That would likely be prohibitly expensive.

in fact it does. The cache takes care of filling in the blanks and 
requesting ranges that have not been satisfied yet in a reasonably 
efficient way.  Maybe_adjust_range() will deal only with inferred ranges 
that are marked as having occurred in this block.

The problem I ran into with trying to doing it in range_on_exit(), is 
that if there is an abnornal edge out of the block, it is unsafe to 
assume that anything is true at block exit..   when processing the 
abnormal destination, we ask for range_on_exit of anything it uses from 
the src block.

That said, I suppose is should be safe for range_of_stmt to assume the 
statement successfully executed (or there wouldnt be a defintion 
produced), and therefore if it is the last stmt in the block, we should 
in theory be able to do as you suggest there.

Im trying to think if its safe to also do it in range_of_expr. if 
is_ctrl_stmt() is true, can I assume that the stmt cannot throw an 
exception under any circumstance?   in which case, its proabbly safe to 
put that in range_of_expr.  Ive attached a patch which does that if you 
want to try it.

The caveat is that it is only a partial solution... it will only work 
for names on that stmt.  if you have anything more complex, ie

if (a == 0 || b == 0)  we have a seqeunce feeding the ctrl stmt..

c_1 = a == 0
c_2 = b == 0
c_3 = c_1 && c_2
if (c_3 == 0)

only the evaluation of c_3 will have the ctrl stmt as its context.. the 
others will be evaluted on their own statement, and thus neither a nor b 
would pick up anything from the block as they are evalauted and cached 
as they are seen.    unless of course we are doing a walk :-P


>
>> ranger uses it for range_on_edge (), because  it knows all the statements in
>> the block have been executed, and its safe to apply anything seen in the
>> block.  It does it right after range_on_exit() is called internally.
>>
>> Once upon a time, it was integrated with range-on-exit, but it turned out
>> there were certain times that it was causing problems. There have been some
>> cleanups since then, it probably safe now to return that call to
>> range_on_exit.. but doesnt buy us a whole lot by itself.. except of course I
>> have now OKd using range_on-entry/exit generally :-)
>>
>> the cache also uses it when walking blocks to pick up inferred values during
>> an on-entry cache fill.
>>
>>
>>> int foo (int **p, int i)
>>> {
>>>     int *q = *p;
>>>     int res = *q + i;
>>>     if (q)
>>>       return res;
>>>     return -1;
>>> }
>>>
>>> which we "thread" with the path and with the above ICEs because
>>> fold_range doesn't get that if (q) is always true.  Without the
>> Its a known limitation that, unless you are doing a walk, on-demand requests
>> will "miss" some inferred ranges, because they are only maintained at the
>> basic block level.  (ie, we will know that q is non-null in BB2,  but we don't
>> know where, so we can make no assumptions at the exit condition about q in
>> this case. the path_query is invoked in on-demand mode because it wont walk
>> the entire IL,. so the first time you ask for the range of an ssa-name, it
>> will quickly zip over the immediate use list and "fill" the on-exit structure
>> for any blocks which a non-null reference is seen.  This allows the threader
>> to pick up non-null from blocks outside the path that havent been examined.
>>
>> VRP does a walk, and while during the walk, adjusts ranges on the fly for the
>> current block via the gimple_ranger::register_inferred_ranges () routine.
>> which is really just a wrapper around ranger_cache::apply_inferred_ranges ()
>> (in gimple-range-cache.cc)
>>
>> This is called after every statement and is where we take care of bookkeeping
>> for adjusting values, and adding them to the blocks list.
>>
>> if the path query is walking those statement, it could also "adjust" the range
>> of q on the fly... but it has to have walked those statements.  from that
>> routine, the relevant bits use the gimple-infer class to find any inferred
>> ranges from the statement, and would look something like:
>>
>>    gimple_infer_range infer(s);
>>    for (unsigned x = 0; x < infer.num (); x++)
>>      {
>>        tree name = infer.name (x);
>>        if (!interesting_p (name))
>>           continue;
>>        get_current_path_range (r, name);
>>        if (r.intersect (infer.range (x)))
>>          set_current_path_range (name, r);
>>      }
>>
>> That would adjust the value of q to be non-zero after   "int res = *q + i;"
>>
>> but you need to have walked the statement before you get to the condition.
>> as long as they are all in your list of interesting statement to look at, then
>> you'd be golden.
> Ah, so while the above might work it would go against the spirit of this
> being only fully useful if applied during a walk, adjusting the
> cached ranges?  It's still a bit unfortunate that one needs to walk all
> stmts for this - the usecase wants to just walk the CFG and control stmts,
> and with the path query it would only cover the single BB of each
> control stmt (thus the above hack would probably work out).

I wouldn't say its against the spirit, merely against trying to catch 
all cases in a practical way :-).  If we add that, it should work for 
this case.   for the threader, we will have visited all the uses of q to 
fill the inferred block cache q will have a non-null refererence in 
bb2.  and when calculating the final value, if we are indeed assuming 
anything in the block can be applied to the final stmt, then it should 
be ok. but we'll still mis anything but the simpliest cases.  maybe 
thats OK.   better to catch a few cases than none.

let me think about how, especially for paths, we might be able to take 
care of calculating an entire complex condition/expression in the 
context of the original call.

So, a longer term goal was to perhaps have some sort of persistent 
ranger across compatible passes.. turn it into a pass property, and only 
dispose of it when a pass makes enough IL changes to invalidate it...  
In the meantime, it should be possible to take a ranger that just 
completed a VRP pass, and use that as the root ranger for a threading 
pass immediately after.. I think there may be some lingering issues with 
abnormal edges if we "re-visit" blocks which we claim to have walked due 
to the way I store inferred ranges in those block.. (the expectation 
being we never go back up into the block, so the on-entry cache works 
like the "current def" vector in the original EVRP.  I'd have to think 
about that too.

>
> Meanwhile I'm leaning towards calling this a phase ordering issue
> of threading + VRP, but that also means we shouldn't deliberately
> try to preserve "threadings" of this kind - in fact we might want
> to explicitely reject them?
we are probably going to want to visit some pass ordering.
> Richard.
  
Richard Biener Aug. 18, 2022, 7:08 a.m. UTC | #15
On Wed, 17 Aug 2022, Andrew MacLeod wrote:

> 
> On 8/17/22 03:42, Richard Biener wrote:
> > On Tue, 16 Aug 2022, Andrew MacLeod wrote:
> >
> >> On 8/16/22 05:18, Richard Biener wrote:
> >>> On Mon, 15 Aug 2022, Aldy Hernandez wrote:
> >>>
> >>>> On Mon, Aug 15, 2022 at 9:24 PM Andrew MacLeod <amacleod@redhat.com>
> >>>> wrote:
> >>>>> heh. or just
> >>>>>
> >>>>>
> >>>>> +      int_range<2> r;
> >>>>> +      if (!fold_range (r, const_cast <gcond *> (cond_stmt))
> >>>>> +      || !r.singleton_p (&val))
> >>>>>
> >>>>>
> >>>>> if you do not provide a range_query to any of the fold_using_range code,
> >>>>> it defaults to:
> >>>>>
> >>>>> fur_source::fur_source (range_query *q)
> >>>>> {
> >>>>>      if (q)
> >>>>>        m_query = q;
> >>>>>      else if (cfun)
> >>>>>        m_query = get_range_query (cfun);
> >>>>>      else
> >>>>>        m_query = get_global_range_query ();
> >>>>>      m_gori = NULL;
> >>>>> }
> >>>>>
> >>>> Sweet.  Even better!
> >>> So when I do the following incremental change ontop of the posted
> >>> patch then I see that the path query is able to simplify more
> >>> "single BB paths" than the global range folding.
> >>>
> >>> diff --git a/gcc/tree-ssa-threadbackward.cc
> >>> b/gcc/tree-ssa-threadbackward.cc
> >>> index 669098e4ec3..777e778037f 100644
> >>> --- a/gcc/tree-ssa-threadbackward.cc
> >>> +++ b/gcc/tree-ssa-threadbackward.cc
> >>> @@ -314,6 +314,12 @@ back_threader::find_taken_edge_cond (const
> >>> vec<basic_block> &path,
> >>>    {
> >>>      int_range_max r;
> >>>    +  int_range<2> rf;
> >>> +  if (path.length () == 1)
> >>> +    {
> >>> +      fold_range (rf, cond);
> >>> +    }
> >>> +
> >>>      m_solver->compute_ranges (path, m_imports);
> >>>      m_solver->range_of_stmt (r, cond);
> >>>    @@ -325,6 +331,8 @@ back_threader::find_taken_edge_cond (const
> >>> vec<basic_block> &path,
> >>>    
> >>>      if (r == true_range || r == false_range)
> >>>        {
> >>> +      if (path.length () == 1)
> >>> +       gcc_assert  (r == rf);
> >>>          edge e_true, e_false;
> >>>          basic_block bb = gimple_bb (cond);
> >>>          extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >>>
> >>> Even doing the following (not sure what's the difference and in
> >>> particular expense over the path range query) results in missed
> >>> simplifications (checking my set of cc1files).
> >>>
> >>> diff --git a/gcc/tree-ssa-threadbackward.cc
> >>> b/gcc/tree-ssa-threadbackward.cc
> >>> index 669098e4ec3..1d43a179d08 100644
> >>> --- a/gcc/tree-ssa-threadbackward.cc
> >>> +++ b/gcc/tree-ssa-threadbackward.cc
> >>> @@ -99,6 +99,7 @@ private:
> >>>    
> >>>      back_threader_registry m_registry;
> >>>      back_threader_profitability m_profit;
> >>> +  gimple_ranger *m_ranger;
> >>>      path_range_query *m_solver;
> >>>    
> >>>      // Current path being analyzed.
> >>> @@ -146,12 +147,14 @@ back_threader::back_threader (function *fun,
> >>> unsigned flags, bool first)
> >>>      // The path solver needs EDGE_DFS_BACK in resolving mode.
> >>>      if (flags & BT_RESOLVE)
> >>>        mark_dfs_back_edges ();
> >>> -  m_solver = new path_range_query (flags & BT_RESOLVE);
> >>> +  m_ranger = new gimple_ranger;
> >>> +  m_solver = new path_range_query (flags & BT_RESOLVE, m_ranger);
> >>>    }
> >>>    
> >>>    back_threader::~back_threader ()
> >>>    {
> >>>      delete m_solver;
> >>> +  delete m_ranger;
> >>>    
> >>>      loop_optimizer_finalize ();
> >>>    }
> >>> @@ -314,6 +317,12 @@ back_threader::find_taken_edge_cond (const
> >>> vec<basic_block> &path,
> >>>    {
> >>>      int_range_max r;
> >>>    +  int_range<2> rf;
> >>> +  if (path.length () == 1)
> >>> +    {
> >>> +      fold_range (rf, cond, m_ranger);
> >>> +    }
> >>> +
> >>>      m_solver->compute_ranges (path, m_imports);
> >>>      m_solver->range_of_stmt (r, cond);
> >>>    @@ -325,6 +334,8 @@ back_threader::find_taken_edge_cond (const
> >>> vec<basic_block> &path,
> >>>    
> >>>      if (r == true_range || r == false_range)
> >>>        {
> >>> +      if (path.length () == 1)
> >>> +       gcc_assert  (r == rf);
> >>>          edge e_true, e_false;
> >>>          basic_block bb = gimple_bb (cond);
> >>>          extract_true_false_edges_from_block (bb, &e_true, &e_false);
> >>>
> >>> one example is
> >>>
> >>> <bb 176> [local count: 14414059]:
> >>> _128 = node_177(D)->typed.type;
> >>> pretmp_413 = MEM[(const union tree_node *)_128].base.code;
> >>> _431 = pretmp_413 + 65519;
> >>> if (_128 == 0B)
> >>>     goto <bb 199>; [18.09%]
> >>> else
> >>>     goto <bb 177>; [81.91%]
> >>>
> >>> where m_imports for the path is just _128 and the range computed is
> >>> false while the ranger query returns VARYING.  But
> >>> path_range_query::range_defined_in_block does
> >>>
> >>>     if (bb && POINTER_TYPE_P (TREE_TYPE (name)))
> >>>       m_ranger->m_cache.m_exit.maybe_adjust_range (r, name, bb);
> >> This is the coarse grained "side effect applies somewhere in the block"
> >> mechanism.  There is no understanding of where in the block it happens.
> >>> which adjusts the range to ~[0, 0], probably because of the
> >>> dereference in the following stmt.
> >>>
> >>> Why does fold_range not do this when folding the exit test?  Is there
> >>> a way to make it do so?  It looks like the only routine using this
> >>> in gimple-range.cc is range_on_edge and there it's used for e->src
> >>> after calling range_on_exit for e->src (why's it not done in
> >>> range_on_exit?).  A testcase for this is
> >> Fold_range doesnt do this because it is simply another statement.  It makes
> >> no
> >> attempt to understand the context in which you are folding something. you
> >> could be folding that stmt from a different location (ie recomputing)   If
> >> your context is that you are looking for the range after the last statement
> >> has been executed, then one needs to check to see if there are any side
> >> effects.
> > Hmm, but I'm asking it to fold a specific statement - how can that ever
> > be folded "from a different context"?  In fact it traces as
> > " at stmt " with the stmt I pass it.  But yes, the issue is of course
> > that to compute the range of q != 0 it needs to compute the range
> > of 'q' but it simply does
> >
> >    // If no name, simply call the base routine.
> >    if (!name)
> >      {
> >        res = fold_range_internal (r, s, NULL_TREE);
> >        if (res && is_a <gcond *> (s))
> >          {
> >            // Update any exports in the cache if this is a gimple cond
> > statement.
> >            tree exp;
> >            basic_block bb = gimple_bb (s);
> >            FOR_EACH_GORI_EXPORT_NAME (m_cache.m_gori, bb, exp)
> >              m_cache.propagate_updated_value (exp, bb);
> >
> > and fur_depend seems to get the context so it could adjust ranges
> > in its get_operand which seems to just call range_of_expr with
> > the stmt as third arg.  Unfortunately gimple_ranger::range_of_expr
> > is undocumented, but again 'stmt' is dumped as " at stmt " thus
> > seems to be the "context" here?
> Oh, so you are looking at gimple_ranger::range_of_stmt(), not fold_range.
> >        // If name is defined in this block, try to get an range from S.
> >
> > what is 'S'?  def_stmt or stmt?
> >
> >        if (def_stmt && gimple_bb (def_stmt) == bb)
> >          {
> >            // Declared in this block, if it has a global set, check for an
> >            // override from a block walk, otherwise calculate it.
> >            if (m_cache.get_global_range (r, expr))
> >              m_cache.block_range (r, bb, expr, false);
> >            else
> >              range_of_stmt (r, def_stmt, expr);
> >
> > so, add
> >
> >        if (is_ctrl_stmt (stmt))
> >           m_cache.m_exit.maybe_adjust_range (r, expr, bb);
> >
> > at the end (also covering the range_on_entry case).  I suppose
> > range_on_entry does not evaluate all possible paths from definition
> > to BB, using maybe_adjust_range on blocks inbetween and unioning
> > ranges at path merges?  That would likely be prohibitly expensive.
> 
> in fact it does. The cache takes care of filling in the blanks and requesting
> ranges that have not been satisfied yet in a reasonably efficient way. 
> Maybe_adjust_range() will deal only with inferred ranges that are marked as
> having occurred in this block.
> 
> The problem I ran into with trying to doing it in range_on_exit(), is that if
> there is an abnornal edge out of the block, it is unsafe to assume that
> anything is true at block exit..   when processing the abnormal destination,
> we ask for range_on_exit of anything it uses from the src block.
> 
> That said, I suppose is should be safe for range_of_stmt to assume the
> statement successfully executed (or there wouldnt be a defintion produced),
> and therefore if it is the last stmt in the block, we should in theory be able
> to do as you suggest there.
> 
> Im trying to think if its safe to also do it in range_of_expr. if
> is_ctrl_stmt() is true, can I assume that the stmt cannot throw an exception
> under any circumstance?   in which case, its proabbly safe to put that in
> range_of_expr.  Ive attached a patch which does that if you want to try it.

control stmts cannot throw

> The caveat is that it is only a partial solution... it will only work for
> names on that stmt.  if you have anything more complex, ie
> 
> if (a == 0 || b == 0)  we have a seqeunce feeding the ctrl stmt..
> 
> c_1 = a == 0
> c_2 = b == 0
> c_3 = c_1 && c_2
> if (c_3 == 0)
> 
> only the evaluation of c_3 will have the ctrl stmt as its context.. the others
> will be evaluted on their own statement, and thus neither a nor b would pick
> up anything from the block as they are evalauted and cached as they are
> seen.    unless of course we are doing a walk :-P

Hmm, but as I traced it when I do range_of_expr the context stmt I provide
will be passed down and even when processing dependences that context
will stick?  But maybe it doesn't because it would mean explosion of the
cache?

But yeah, with the above restriction it would be even more useless.
Same issue as with

  *p = 0;
  if (..)
   / \
 ..   \
      if (p)

here the local adjustment of 'p' in if (p) would not pick up the
p != 0 guarantee from the immediate dominator.

> 
> >
> >> ranger uses it for range_on_edge (), because  it knows all the statements
> >> in
> >> the block have been executed, and its safe to apply anything seen in the
> >> block.  It does it right after range_on_exit() is called internally.
> >>
> >> Once upon a time, it was integrated with range-on-exit, but it turned out
> >> there were certain times that it was causing problems. There have been some
> >> cleanups since then, it probably safe now to return that call to
> >> range_on_exit.. but doesnt buy us a whole lot by itself.. except of course
> >> I
> >> have now OKd using range_on-entry/exit generally :-)
> >>
> >> the cache also uses it when walking blocks to pick up inferred values
> >> during
> >> an on-entry cache fill.
> >>
> >>
> >>> int foo (int **p, int i)
> >>> {
> >>>     int *q = *p;
> >>>     int res = *q + i;
> >>>     if (q)
> >>>       return res;
> >>>     return -1;
> >>> }
> >>>
> >>> which we "thread" with the path and with the above ICEs because
> >>> fold_range doesn't get that if (q) is always true.  Without the
> >> Its a known limitation that, unless you are doing a walk, on-demand
> >> requests
> >> will "miss" some inferred ranges, because they are only maintained at the
> >> basic block level.  (ie, we will know that q is non-null in BB2,  but we
> >> don't
> >> know where, so we can make no assumptions at the exit condition about q in
> >> this case. the path_query is invoked in on-demand mode because it wont walk
> >> the entire IL,. so the first time you ask for the range of an ssa-name, it
> >> will quickly zip over the immediate use list and "fill" the on-exit
> >> structure
> >> for any blocks which a non-null reference is seen.  This allows the
> >> threader
> >> to pick up non-null from blocks outside the path that havent been examined.
> >>
> >> VRP does a walk, and while during the walk, adjusts ranges on the fly for
> >> the
> >> current block via the gimple_ranger::register_inferred_ranges () routine.
> >> which is really just a wrapper around ranger_cache::apply_inferred_ranges
> >> ()
> >> (in gimple-range-cache.cc)
> >>
> >> This is called after every statement and is where we take care of
> >> bookkeeping
> >> for adjusting values, and adding them to the blocks list.
> >>
> >> if the path query is walking those statement, it could also "adjust" the
> >> range
> >> of q on the fly... but it has to have walked those statements.  from that
> >> routine, the relevant bits use the gimple-infer class to find any inferred
> >> ranges from the statement, and would look something like:
> >>
> >>    gimple_infer_range infer(s);
> >>    for (unsigned x = 0; x < infer.num (); x++)
> >>      {
> >>        tree name = infer.name (x);
> >>        if (!interesting_p (name))
> >>           continue;
> >>        get_current_path_range (r, name);
> >>        if (r.intersect (infer.range (x)))
> >>          set_current_path_range (name, r);
> >>      }
> >>
> >> That would adjust the value of q to be non-zero after   "int res = *q + i;"
> >>
> >> but you need to have walked the statement before you get to the condition.
> >> as long as they are all in your list of interesting statement to look at,
> >> then
> >> you'd be golden.
> > Ah, so while the above might work it would go against the spirit of this
> > being only fully useful if applied during a walk, adjusting the
> > cached ranges?  It's still a bit unfortunate that one needs to walk all
> > stmts for this - the usecase wants to just walk the CFG and control stmts,
> > and with the path query it would only cover the single BB of each
> > control stmt (thus the above hack would probably work out).
> 
> I wouldn't say its against the spirit, merely against trying to catch all
> cases in a practical way :-).  If we add that, it should work for this case.  
> for the threader, we will have visited all the uses of q to fill the inferred
> block cache q will have a non-null refererence in bb2.  and when calculating
> the final value, if we are indeed assuming anything in the block can be
> applied to the final stmt, then it should be ok. but we'll still mis anything
> but the simpliest cases.  maybe thats OK.   better to catch a few cases than
> none.
> 
> let me think about how, especially for paths, we might be able to take care of
> calculating an entire complex condition/expression in the context of the
> original call.
> 
> So, a longer term goal was to perhaps have some sort of persistent ranger
> across compatible passes.. turn it into a pass property, and only dispose of
> it when a pass makes enough IL changes to invalidate it...

We've tried to do that with the SCEV cache and it proved to be a hassle
with lots of issues ... (we're still there but I think almost all passes
now clear the SCEV cache), so I wouldn't go there.

> In the meantime,
> it should be possible to take a ranger that just completed a VRP pass, and use
> that as the root ranger for a threading pass immediately after.. I think there
> may be some lingering issues with abnormal edges if we "re-visit" blocks which
> we claim to have walked due to the way I store inferred ranges in those
> block.. (the expectation being we never go back up into the block, so the
> on-entry cache works like the "current def" vector in the original EVRP.  I'd
> have to think about that too.

Of course that would essentially do a VRP pass before each threading
which I think is a bit expensive.  Also looking at how ranger works
with all its abstraction having the "old" EVRP style body walk rather
than abusing the on-demand ranger for such a walk would be a lot more
efficient for this purpose :/

> > Meanwhile I'm leaning towards calling this a phase ordering issue
> > of threading + VRP, but that also means we shouldn't deliberately
> > try to preserve "threadings" of this kind - in fact we might want
> > to explicitely reject them?
> we are probably going to want to visit some pass ordering.

Sure, though there's never going to be a good enough pass ordering :/

Richard.
  
Andrew MacLeod Aug. 18, 2022, 1:18 p.m. UTC | #16
On 8/18/22 03:08, Richard Biener wrote:
>
>> The caveat is that it is only a partial solution... it will only work for
>> names on that stmt.  if you have anything more complex, ie
>>
>> if (a == 0 || b == 0)  we have a seqeunce feeding the ctrl stmt..
>>
>> c_1 = a == 0
>> c_2 = b == 0
>> c_3 = c_1 && c_2
>> if (c_3 == 0)
>>
>> only the evaluation of c_3 will have the ctrl stmt as its context.. the others
>> will be evaluted on their own statement, and thus neither a nor b would pick
>> up anything from the block as they are evalauted and cached as they are
>> seen.    unless of course we are doing a walk :-P
> Hmm, but as I traced it when I do range_of_expr the context stmt I provide
> will be passed down and even when processing dependences that context
> will stick?  But maybe it doesn't because it would mean explosion of the
> cache?
>
> But yeah, with the above restriction it would be even more useless.
> Same issue as with
>
>    *p = 0;
>    if (..)
>     / \
>   ..   \
>        if (p)
>
> here the local adjustment of 'p' in if (p) would not pick up the
> p != 0 guarantee from the immediate dominator.

it certainly should. the earlier BB will have the ~[0, 0] property in 
the on-exit structure, so when the range of 'p' is evaluated on the edge 
to the next block, it will be adjusted. the value for on-entry of P to 
that block will therefore be ~[0, 0].   Ranger does this, and the path 
query code is *suppose* to.. late discussions with Aldy yesterday left 
me unclear if it always does.  it should.  that was the entire point of 
leaving the on-demand filling of the structure via immediate uses.



>>   In the meantime,
>> it should be possible to take a ranger that just completed a VRP pass, and use
>> that as the root ranger for a threading pass immediately after.. I think there
>> may be some lingering issues with abnormal edges if we "re-visit" blocks which
>> we claim to have walked due to the way I store inferred ranges in those
>> block.. (the expectation being we never go back up into the block, so the
>> on-entry cache works like the "current def" vector in the original EVRP.  I'd
>> have to think about that too.
> Of course that would essentially do a VRP pass before each threading
> which I think is a bit expensive.  Also looking at how ranger works
> with all its abstraction having the "old" EVRP style body walk rather
> than abusing the on-demand ranger for such a walk would be a lot more
> efficient for this purpose :/

No, I just meant when we do the VRP pass, rather than throw away the 
fully primed ranger and its values, one could invoke the threader using 
it...  But I'm not sure how much extra we'd get anyway.


>>> Meanwhile I'm leaning towards calling this a phase ordering issue
>>> of threading + VRP, but that also means we shouldn't deliberately
>>> try to preserve "threadings" of this kind - in fact we might want
>>> to explicitely reject them?
>> we are probably going to want to visit some pass ordering.
> Sure, though there's never going to be a good enough pass ordering :/
>
> Richard.
  

Patch

diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index 78146f5683e..a7d277c31b8 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -220,7 +220,7 @@  path_range_query::unreachable_path_p ()
 void
 path_range_query::set_path (const vec<basic_block> &path)
 {
-  gcc_checking_assert (path.length () > 1);
+  gcc_checking_assert (!path.is_empty ());
   m_path = path.copy ();
   m_pos = m_path.length () - 1;
   bitmap_clear (m_has_cache_entry);
diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index b886027fccf..669098e4ec3 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -241,8 +241,9 @@  back_threader::maybe_register_path ()
       else
 	{
 	  bool irreducible = false;
-	  if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
-					  &irreducible)
+	  if ((m_path.length () == 1
+	       || m_profit.profitable_path_p (m_path, m_name, taken_edge,
+					      &irreducible))
 	      && debug_counter ()
 	      && m_registry.register_path (m_path, taken_edge))
 	    {
@@ -267,7 +268,6 @@  back_threader::maybe_register_path ()
 edge
 back_threader::find_taken_edge (const vec<basic_block> &path)
 {
-  gcc_checking_assert (path.length () > 1);
   switch (gimple_code (m_last_stmt))
     {
     case GIMPLE_COND:
@@ -350,9 +350,15 @@  back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
   m_path.safe_push (bb);
 
   // Try to resolve the path without looking back.
-  if (m_path.length () > 1
-      && (!m_profit.profitable_path_p (m_path, m_name, NULL)
-	  || maybe_register_path ()))
+  if ((m_path.length () > 1
+       && !m_profit.profitable_path_p (m_path, m_name, NULL))
+      || maybe_register_path ())
+    ;
+
+  // The backwards thread copier cannot copy blocks that do not belong
+  // to the same loop, so when the new source of the path entry no
+  // longer belongs to it we don't need to search further.
+  else if (m_path[0]->loop_father != bb->loop_father)
     ;
 
   // Continue looking for ways to extend the path but limit the
@@ -445,7 +451,8 @@  back_threader::find_paths_to_names (basic_block bb, bitmap interesting,
 	  edge e;
 	  FOR_EACH_EDGE (e, iter, bb->preds)
 	    {
-	      if (e->flags & EDGE_ABNORMAL
+	      if ((e->flags & EDGE_ABNORMAL)
+		  || e->src->index == ENTRY_BLOCK
 		  // This is like path_crosses_loops in profitable_path_p but
 		  // more restrictive to avoid peeling off loop iterations (see
 		  // tree-ssa/pr14341.c for an example).
diff --git a/gcc/tree-ssa-threadupdate.cc b/gcc/tree-ssa-threadupdate.cc
index 59c268a3567..d40fa7c4cff 100644
--- a/gcc/tree-ssa-threadupdate.cc
+++ b/gcc/tree-ssa-threadupdate.cc
@@ -2613,6 +2613,60 @@  back_jt_path_registry::update_cfg (bool /*peel_loop_headers*/)
   bool retval = false;
   hash_set<edge> visited_starting_edges;
 
+  /* Mark never taken edges from paths that are just jump simplifications.  */
+  auto_edge_flag never_taken (cfun);
+  for (auto path : m_paths)
+    if (path->length () == 1)
+      {
+	edge_iterator ei;
+	edge e;
+	FOR_EACH_EDGE (e, ei, (*path)[0]->e->src->succs)
+	  if (e != (*path)[0]->e)
+	    e->flags |= never_taken;
+      }
+
+  /* And prune paths that contain such edge before we remove them.  */
+  for (unsigned i = 0; i < m_paths.length ();)
+    {
+      bool remove = false;
+      for (auto je : *m_paths[i])
+	{
+	  if (je->e->flags & never_taken)
+	    {
+	      cancel_thread (m_paths[i],
+			     "Avoiding threading through unreachable edge");
+	      remove = true;
+	      break;
+	    }
+	}
+      if (!remove)
+	++i;
+      else
+	m_paths.unordered_remove (i);
+    }
+
+  /* Finally perform those threads first, this way we avoid copying the
+     dead outgoing edges when other theads contain the prevailing edge.  */
+  for (unsigned i = 0; i < m_paths.length ();)
+    {
+      vec<jump_thread_edge *> *path = m_paths[i];
+      if (path->length () != 1)
+	{
+	  ++i;
+	  continue;
+	}
+      edge exit = (*path)[0]->e;
+      remove_ctrl_stmt_and_useless_edges (exit->src, exit->dest);
+      exit->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL);
+      exit->flags |= EDGE_FALLTHRU;
+      /* We do not update dominance info.  */
+      free_dominance_info (CDI_DOMINATORS);
+      retval = true;
+      m_num_threaded_edges++;
+      path->release ();
+      m_paths.unordered_remove (i);
+    }
+
   while (m_paths.length ())
     {
       vec<jump_thread_edge *> *path = m_paths[0];