tree-optimization/116166 - forward jump-threading going wild

Message ID 20240806131244.77CAA13981@imap1.dmz-prg2.suse.org
State Committed
Commit 2cf89ae83225f932b226cd57ef2d083a59bcf8a3
Headers
Series tree-optimization/116166 - forward jump-threading going wild |

Checks

Context Check Description
rivoscibot/toolchain-ci-rivos-lint success Lint passed
rivoscibot/toolchain-ci-rivos-apply-patch success Patch applied
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gc-lp64d-non-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib success Build passed
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc-lp64d-non-multilib success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
rivoscibot/toolchain-ci-rivos-test success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Richard Biener Aug. 6, 2024, 1:12 p.m. UTC
  Currently the forward threader isn't limited as to the search space
it explores and with it now using path-ranger for simplifying
conditions it runs into it became pretty slow for degenerate cases
like compiling insn-emit.cc for RISC-V esp. when compiling for
a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.

The following makes the forward threader honor the search space
limit I introduced for the backward threader.  This reduces
compile-time from minutes to seconds for the testcase in PR116166.

Note this wasn't necessary before we had ranger but with ranger
the work we do is quadatic in the length of the threading path
we build up (the same is true for the backwards threader).

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

OK if that succeeds?

Thanks,
Richard.

	PR tree-optimization/116166
	* tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
	Add limit parameter.
	(jump_threader::thread_through_normal_block): Likewise.
	* tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
	Honor and decrement limit parameter.
	(jump_threader::thread_through_normal_block): Likewise.
	(jump_threader::thread_across_edge): Initialize limit from
	param_max_jump_thread_paths and pass it down to workers.
---
 gcc/tree-ssa-threadedge.cc | 30 ++++++++++++++++++++++--------
 gcc/tree-ssa-threadedge.h  |  4 ++--
 2 files changed, 24 insertions(+), 10 deletions(-)
  

Comments

Andrew MacLeod Aug. 6, 2024, 5:05 p.m. UTC | #1
On 8/6/24 09:12, Richard Biener wrote:
> Currently the forward threader isn't limited as to the search space
> it explores and with it now using path-ranger for simplifying
> conditions it runs into it became pretty slow for degenerate cases
> like compiling insn-emit.cc for RISC-V esp. when compiling for
> a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
>
> The following makes the forward threader honor the search space
> limit I introduced for the backward threader.  This reduces
> compile-time from minutes to seconds for the testcase in PR116166.
>
> Note this wasn't necessary before we had ranger but with ranger
> the work we do is quadatic in the length of the threading path
> we build up (the same is true for the backwards threader).

Theres probably some work that can be done in the path processing space 
using the new gori_on_edge interface I introduced for the fast VRP pass.

// These APIs are used to query GORI if there are ranges generated on an 
edge.
// GORI_ON_EDGE is used to get all the ranges at once (returned in an
// ssa_cache structure).
// Fill ssa-cache R with any outgoing ranges on edge E, using QUERY.
bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = NULL);

With this, the threader and path calculator can get and collect all the 
outgoing ranges for a block in linear time and just keep them.. and 
decide what it wants to use.   I suspect for really large CFGs, we'd 
want to substitute and alternative ssa_cache implementation to something 
like the sbr_sparse_bitmap class ranger's  cache uses which compresses 
the size of the vector so it isn't a vector over all the ssa-names, and 
at the same time limit it to  a max of 14 outgoing ranges.

no one has had any time to investigate that  yet.

>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> OK if that succeeds?
OK with me.

Andrew
  
Richard Biener Aug. 7, 2024, 7:13 a.m. UTC | #2
On Tue, 6 Aug 2024, Andrew MacLeod wrote:

> 
> On 8/6/24 09:12, Richard Biener wrote:
> > Currently the forward threader isn't limited as to the search space
> > it explores and with it now using path-ranger for simplifying
> > conditions it runs into it became pretty slow for degenerate cases
> > like compiling insn-emit.cc for RISC-V esp. when compiling for
> > a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
> >
> > The following makes the forward threader honor the search space
> > limit I introduced for the backward threader.  This reduces
> > compile-time from minutes to seconds for the testcase in PR116166.
> >
> > Note this wasn't necessary before we had ranger but with ranger
> > the work we do is quadatic in the length of the threading path
> > we build up (the same is true for the backwards threader).
> 
> Theres probably some work that can be done in the path processing space using
> the new gori_on_edge interface I introduced for the fast VRP pass.
> 
> // These APIs are used to query GORI if there are ranges generated on an edge.
> // GORI_ON_EDGE is used to get all the ranges at once (returned in an
> // ssa_cache structure).
> // Fill ssa-cache R with any outgoing ranges on edge E, using QUERY.
> bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = NULL);
> 
> With this, the threader and path calculator can get and collect all the
> outgoing ranges for a block in linear time and just keep them.. and decide
> what it wants to use.   I suspect for really large CFGs, we'd want to
> substitute and alternative ssa_cache implementation to something like the
> sbr_sparse_bitmap class ranger's  cache uses which compresses the size of the
> vector so it isn't a vector over all the ssa-names, and at the same time limit
> it to  a max of 14 outgoing ranges.
> 
> no one has had any time to investigate that  yet.

The thing that path-ranger handles in addition to what range_on_edge
gives you is CFG merges where knowing the path into it improves
ranges downstream.  Basically path-ranger improves the range of PHIs
based on known path taken.  Whenever the decision which edge(s) of
a PHI are taken changes we'd have to wipe cached information.

In PR114855 the gori compute itself shows scalability issues
(I've identified the dominator walk sofar but it might not be
the full story).

> >
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> >
> > OK if that succeeds?
> OK with me.

Thanks, pushed and queued for backport.

Richard.
  
Andrew MacLeod Aug. 9, 2024, 7:25 p.m. UTC | #3
On 8/7/24 03:13, Richard Biener wrote:
> On Tue, 6 Aug 2024, Andrew MacLeod wrote:
>
>> On 8/6/24 09:12, Richard Biener wrote:
>>> Currently the forward threader isn't limited as to the search space
>>> it explores and with it now using path-ranger for simplifying
>>> conditions it runs into it became pretty slow for degenerate cases
>>> like compiling insn-emit.cc for RISC-V esp. when compiling for
>>> a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
>>>
>>> The following makes the forward threader honor the search space
>>> limit I introduced for the backward threader.  This reduces
>>> compile-time from minutes to seconds for the testcase in PR116166.
>>>
>>> Note this wasn't necessary before we had ranger but with ranger
>>> the work we do is quadatic in the length of the threading path
>>> we build up (the same is true for the backwards threader).
>> Theres probably some work that can be done in the path processing space using
>> the new gori_on_edge interface I introduced for the fast VRP pass.
>>
>> // These APIs are used to query GORI if there are ranges generated on an edge.
>> // GORI_ON_EDGE is used to get all the ranges at once (returned in an
>> // ssa_cache structure).
>> // Fill ssa-cache R with any outgoing ranges on edge E, using QUERY.
>> bool gori_on_edge (class ssa_cache &r, edge e, range_query *query = NULL);
>>
>> With this, the threader and path calculator can get and collect all the
>> outgoing ranges for a block in linear time and just keep them.. and decide
>> what it wants to use.   I suspect for really large CFGs, we'd want to
>> substitute and alternative ssa_cache implementation to something like the
>> sbr_sparse_bitmap class ranger's  cache uses which compresses the size of the
>> vector so it isn't a vector over all the ssa-names, and at the same time limit
>> it to  a max of 14 outgoing ranges.
>>
>> no one has had any time to investigate that  yet.
> The thing that path-ranger handles in addition to what range_on_edge
> gives you is CFG merges where knowing the path into it improves
> ranges downstream.  Basically path-ranger improves the range of PHIs
> based on known path taken.  Whenever the decision which edge(s) of
> a PHI are taken changes we'd have to wipe cached information.
>
> In PR114855 the gori compute itself shows scalability issues
> (I've identified the dominator walk sofar but it might not be
> the full story).
>
>
The real story is that we do not cache edge calculations, thus the 
scalability issues with path-ranger.  Ranger caches on-entry values 
which integrates these edge values, and thus avoids recalculating edges, 
so there was never a need to cache edges.

  After the 2 patches I just committed,  the remainder of time at -O1 in 
PR114855 is spend by the path ranger recalculating ranges on edges over 
and over.   Something  like previously mentioned the gori_on_edge () 
approach might allow the path ranger to calculate these static edge 
values only once per edge, and cache them somewhere.  There would be one 
ssa_cache class structure per edge you cared about, and it would 
instantly tell you the static range of any ssa-name on the edge... if 
has_edge_range_p() is true.

Thinking about it the last couple of days. I think we can do better.  I 
think I can integrate that into GORI proper so that edges can optionally 
be cached as well, and this should alleviate at least some of the 
compile time pressure caused by the lack of caching.  It'd be a bit of 
an experiment, but it dovetails nicely with some range-logging work I'm 
investigating.

I'm on vacation next week, so It'll be at least a week before I can get 
look at it closer. I think Aldy is back then too, so maybe we can figure 
something out for the remaining compile time issues in DOM.

Andrew
  
Jeff Law Aug. 9, 2024, 8:04 p.m. UTC | #4
On 8/6/24 7:12 AM, Richard Biener wrote:
> Currently the forward threader isn't limited as to the search space
> it explores and with it now using path-ranger for simplifying
> conditions it runs into it became pretty slow for degenerate cases
> like compiling insn-emit.cc for RISC-V esp. when compiling for
> a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
> 
> The following makes the forward threader honor the search space
> limit I introduced for the backward threader.  This reduces
> compile-time from minutes to seconds for the testcase in PR116166.
> 
> Note this wasn't necessary before we had ranger but with ranger
> the work we do is quadatic in the length of the threading path
> we build up (the same is true for the backwards threader).
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> OK if that succeeds?
> 
> Thanks,
> Richard.
> 
> 	PR tree-optimization/116166
> 	* tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
> 	Add limit parameter.
> 	(jump_threader::thread_through_normal_block): Likewise.
> 	* tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
> 	Honor and decrement limit parameter.
> 	(jump_threader::thread_through_normal_block): Likewise.
> 	(jump_threader::thread_across_edge): Initialize limit from
> 	param_max_jump_thread_paths and pass it down to workers.
Definitely OK if it succeeds.  I wouldn't expect any testsuite fallout 
from the various dump/assembly scan tests.

jeff
  
Aldy Hernandez Aug. 26, 2024, 5:54 a.m. UTC | #5
[I'm slowly coming up to speed here after my absence, so please bear with me...]

I suspect there's a few things going on here, both in the forward and
the backwards threader.  For the forward threader, you mention some
very good points in the PR.  First, there's unnecessary recursion in
simplify_control_stmt_condition_1 that ranger should be able to handle
on its own.  Secondly, since we're doing a DOM walk, we should be able
to re-use most of the path_ranger's cache instead of having to reset
all of it on every path, especially when we're just adding empty
blocks.  I can investigate both of these things.

The end game here is to get rid of the forward threader, so we should
really find out why the backwards threader is choking so bad.  I
suspect whatever the case is, will affect both threaders.  I thought
you had added some limits in the search space last cycle?  Are they
not being triggered?

For the record, the reason we can't get rid of the forward threader
yet (apart from having to fix whatever is going on in PR114855 at -O2
:)), is that we still rely on the pointer equivalency tracking with
the DOM equiv lookup tables.  Prange does not yet handle pointer
equivs.  Also, we'd need to audit to make sure frange handles whatever
floating point operations were being simplified in the DOM equiv
lookup as well.  I suspect not much, but we still need to make sure.

Minor nit, wouldn't it be cleaner for "limit" to be a class local
variable instead of passing it around as a function parameter?

Thanks for all your work here.
Aldy

On Tue, Aug 6, 2024 at 3:12 PM Richard Biener <rguenther@suse.de> wrote:
>
> Currently the forward threader isn't limited as to the search space
> it explores and with it now using path-ranger for simplifying
> conditions it runs into it became pretty slow for degenerate cases
> like compiling insn-emit.cc for RISC-V esp. when compiling for
> a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
>
> The following makes the forward threader honor the search space
> limit I introduced for the backward threader.  This reduces
> compile-time from minutes to seconds for the testcase in PR116166.
>
> Note this wasn't necessary before we had ranger but with ranger
> the work we do is quadatic in the length of the threading path
> we build up (the same is true for the backwards threader).
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> OK if that succeeds?
>
> Thanks,
> Richard.
>
>         PR tree-optimization/116166
>         * tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
>         Add limit parameter.
>         (jump_threader::thread_through_normal_block): Likewise.
>         * tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
>         Honor and decrement limit parameter.
>         (jump_threader::thread_through_normal_block): Likewise.
>         (jump_threader::thread_across_edge): Initialize limit from
>         param_max_jump_thread_paths and pass it down to workers.
> ---
>  gcc/tree-ssa-threadedge.cc | 30 ++++++++++++++++++++++--------
>  gcc/tree-ssa-threadedge.h  |  4 ++--
>  2 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/tree-ssa-threadedge.cc b/gcc/tree-ssa-threadedge.cc
> index 7f82639b8ec..0aa2aa85143 100644
> --- a/gcc/tree-ssa-threadedge.cc
> +++ b/gcc/tree-ssa-threadedge.cc
> @@ -786,13 +786,17 @@ propagate_threaded_block_debug_into (basic_block dest, basic_block src)
>  bool
>  jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
>                                            edge taken_edge,
> -                                          bitmap visited)
> +                                          bitmap visited, unsigned &limit)
>  {
>    basic_block bb = taken_edge->dest;
>    gimple_stmt_iterator gsi;
>    gimple *stmt;
>    tree cond;
>
> +  if (limit == 0)
> +    return false;
> +  --limit;
> +
>    /* The key property of these blocks is that they need not be duplicated
>       when threading.  Thus they cannot have visible side effects such
>       as PHI nodes.  */
> @@ -830,7 +834,8 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
>               m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
>               m_state->append_path (taken_edge->dest);
>               bitmap_set_bit (visited, taken_edge->dest->index);
> -             return thread_around_empty_blocks (path, taken_edge, visited);
> +             return thread_around_empty_blocks (path, taken_edge, visited,
> +                                                limit);
>             }
>         }
>
> @@ -872,7 +877,7 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
>        m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
>        m_state->append_path (taken_edge->dest);
>
> -      thread_around_empty_blocks (path, taken_edge, visited);
> +      thread_around_empty_blocks (path, taken_edge, visited, limit);
>        return true;
>      }
>
> @@ -899,8 +904,13 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
>
>  int
>  jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
> -                                           edge e, bitmap visited)
> +                                           edge e, bitmap visited,
> +                                           unsigned &limit)
>  {
> +  if (limit == 0)
> +    return 0;
> +  limit--;
> +
>    m_state->register_equivs_edge (e);
>
>    /* PHIs create temporary equivalences.
> @@ -989,7 +999,7 @@ jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
>              visited.  This may be overly conservative.  */
>           bitmap_set_bit (visited, dest->index);
>           bitmap_set_bit (visited, e->dest->index);
> -         thread_around_empty_blocks (path, taken_edge, visited);
> +         thread_around_empty_blocks (path, taken_edge, visited, limit);
>           return 1;
>         }
>      }
> @@ -1075,9 +1085,12 @@ jump_threader::thread_across_edge (edge e)
>    bitmap_set_bit (visited, e->src->index);
>    bitmap_set_bit (visited, e->dest->index);
>
> +  /* Limit search space.  */
> +  unsigned limit = param_max_jump_thread_paths;
> +
>    int threaded = 0;
>    if ((e->flags & EDGE_DFS_BACK) == 0)
> -    threaded = thread_through_normal_block (path, e, visited);
> +    threaded = thread_through_normal_block (path, e, visited, limit);
>
>    if (threaded > 0)
>      {
> @@ -1148,11 +1161,12 @@ jump_threader::thread_across_edge (edge e)
>         m_registry->push_edge (path, e, EDGE_START_JUMP_THREAD);
>         m_registry->push_edge (path, taken_edge, EDGE_COPY_SRC_JOINER_BLOCK);
>
> -       found = thread_around_empty_blocks (path, taken_edge, visited);
> +       found = thread_around_empty_blocks (path, taken_edge, visited, limit);
>
>         if (!found)
>           found = thread_through_normal_block (path,
> -                                              path->last ()->e, visited) > 0;
> +                                              path->last ()->e, visited,
> +                                              limit) > 0;
>
>         /* If we were able to thread through a successor of E->dest, then
>            record the jump threading opportunity.  */
> diff --git a/gcc/tree-ssa-threadedge.h b/gcc/tree-ssa-threadedge.h
> index 9f6cbfe9330..245b3506a55 100644
> --- a/gcc/tree-ssa-threadedge.h
> +++ b/gcc/tree-ssa-threadedge.h
> @@ -101,9 +101,9 @@ private:
>                                           unsigned limit);
>
>    bool thread_around_empty_blocks (vec<class jump_thread_edge *> *path,
> -                                  edge, bitmap visited);
> +                                  edge, bitmap visited, unsigned &limit);
>    int thread_through_normal_block (vec<jump_thread_edge *> *path,
> -                                  edge, bitmap visited);
> +                                  edge, bitmap visited, unsigned &limit);
>    void thread_across_edge (edge);
>    bool record_temporary_equivalences_from_phis (edge);
>    gimple *record_temporary_equivalences_from_stmts_at_dest (edge);
> --
> 2.43.0
>
  
Richard Biener Aug. 26, 2024, 8:36 a.m. UTC | #6
On Mon, 26 Aug 2024, Aldy Hernandez wrote:

> [I'm slowly coming up to speed here after my absence, so please bear with me...]
> 
> I suspect there's a few things going on here, both in the forward and
> the backwards threader.  For the forward threader, you mention some
> very good points in the PR.  First, there's unnecessary recursion in
> simplify_control_stmt_condition_1 that ranger should be able to handle
> on its own.  Secondly, since we're doing a DOM walk, we should be able
> to re-use most of the path_ranger's cache instead of having to reset
> all of it on every path, especially when we're just adding empty
> blocks.  I can investigate both of these things.
> 
> The end game here is to get rid of the forward threader, so we should
> really find out why the backwards threader is choking so bad.  I
> suspect whatever the case is, will affect both threaders.  I thought
> you had added some limits in the search space last cycle?  Are they
> not being triggered?

They trigger, but they assume that the work ranger does computing
ranges on a path is O(path-size) but it seems it's rather
O(function-size) as it happily ventures outside of the path and for
that parts (obviously(?)) ends up not using a common cache (aka
only global ranges?).  I've identified the dominator walk in the
PR (or the one of the related PRs) which walks the whole function
dominator tree rather than limiting itself.

> For the record, the reason we can't get rid of the forward threader
> yet (apart from having to fix whatever is going on in PR114855 at -O2
> :)), is that we still rely on the pointer equivalency tracking with
> the DOM equiv lookup tables.  Prange does not yet handle pointer
> equivs.  Also, we'd need to audit to make sure frange handles whatever
> floating point operations were being simplified in the DOM equiv
> lookup as well.  I suspect not much, but we still need to make sure.

I see.  I'll note that with forward threading re-using the cache
when just adding blocks at the tail of the path is obvious (it
might miss optimizations though) while for backwards threading
backtracking on the head of the path requires pruning of the whole
cache (we could eventually keep a "stack" of caches...).

> Minor nit, wouldn't it be cleaner for "limit" to be a class local
> variable instead of passing it around as a function parameter?

Maybe, it's old habit of doing things.

As said, the main problem is ranger itself doing work not within the
bounds that the backwards threader expects.

Richard.

> Thanks for all your work here.
> Aldy
> 
> On Tue, Aug 6, 2024 at 3:12 PM Richard Biener <rguenther@suse.de> wrote:
> >
> > Currently the forward threader isn't limited as to the search space
> > it explores and with it now using path-ranger for simplifying
> > conditions it runs into it became pretty slow for degenerate cases
> > like compiling insn-emit.cc for RISC-V esp. when compiling for
> > a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
> >
> > The following makes the forward threader honor the search space
> > limit I introduced for the backward threader.  This reduces
> > compile-time from minutes to seconds for the testcase in PR116166.
> >
> > Note this wasn't necessary before we had ranger but with ranger
> > the work we do is quadatic in the length of the threading path
> > we build up (the same is true for the backwards threader).
> >
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> >
> > OK if that succeeds?
> >
> > Thanks,
> > Richard.
> >
> >         PR tree-optimization/116166
> >         * tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
> >         Add limit parameter.
> >         (jump_threader::thread_through_normal_block): Likewise.
> >         * tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
> >         Honor and decrement limit parameter.
> >         (jump_threader::thread_through_normal_block): Likewise.
> >         (jump_threader::thread_across_edge): Initialize limit from
> >         param_max_jump_thread_paths and pass it down to workers.
> > ---
> >  gcc/tree-ssa-threadedge.cc | 30 ++++++++++++++++++++++--------
> >  gcc/tree-ssa-threadedge.h  |  4 ++--
> >  2 files changed, 24 insertions(+), 10 deletions(-)
> >
> > diff --git a/gcc/tree-ssa-threadedge.cc b/gcc/tree-ssa-threadedge.cc
> > index 7f82639b8ec..0aa2aa85143 100644
> > --- a/gcc/tree-ssa-threadedge.cc
> > +++ b/gcc/tree-ssa-threadedge.cc
> > @@ -786,13 +786,17 @@ propagate_threaded_block_debug_into (basic_block dest, basic_block src)
> >  bool
> >  jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
> >                                            edge taken_edge,
> > -                                          bitmap visited)
> > +                                          bitmap visited, unsigned &limit)
> >  {
> >    basic_block bb = taken_edge->dest;
> >    gimple_stmt_iterator gsi;
> >    gimple *stmt;
> >    tree cond;
> >
> > +  if (limit == 0)
> > +    return false;
> > +  --limit;
> > +
> >    /* The key property of these blocks is that they need not be duplicated
> >       when threading.  Thus they cannot have visible side effects such
> >       as PHI nodes.  */
> > @@ -830,7 +834,8 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
> >               m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
> >               m_state->append_path (taken_edge->dest);
> >               bitmap_set_bit (visited, taken_edge->dest->index);
> > -             return thread_around_empty_blocks (path, taken_edge, visited);
> > +             return thread_around_empty_blocks (path, taken_edge, visited,
> > +                                                limit);
> >             }
> >         }
> >
> > @@ -872,7 +877,7 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
> >        m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
> >        m_state->append_path (taken_edge->dest);
> >
> > -      thread_around_empty_blocks (path, taken_edge, visited);
> > +      thread_around_empty_blocks (path, taken_edge, visited, limit);
> >        return true;
> >      }
> >
> > @@ -899,8 +904,13 @@ jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
> >
> >  int
> >  jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
> > -                                           edge e, bitmap visited)
> > +                                           edge e, bitmap visited,
> > +                                           unsigned &limit)
> >  {
> > +  if (limit == 0)
> > +    return 0;
> > +  limit--;
> > +
> >    m_state->register_equivs_edge (e);
> >
> >    /* PHIs create temporary equivalences.
> > @@ -989,7 +999,7 @@ jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
> >              visited.  This may be overly conservative.  */
> >           bitmap_set_bit (visited, dest->index);
> >           bitmap_set_bit (visited, e->dest->index);
> > -         thread_around_empty_blocks (path, taken_edge, visited);
> > +         thread_around_empty_blocks (path, taken_edge, visited, limit);
> >           return 1;
> >         }
> >      }
> > @@ -1075,9 +1085,12 @@ jump_threader::thread_across_edge (edge e)
> >    bitmap_set_bit (visited, e->src->index);
> >    bitmap_set_bit (visited, e->dest->index);
> >
> > +  /* Limit search space.  */
> > +  unsigned limit = param_max_jump_thread_paths;
> > +
> >    int threaded = 0;
> >    if ((e->flags & EDGE_DFS_BACK) == 0)
> > -    threaded = thread_through_normal_block (path, e, visited);
> > +    threaded = thread_through_normal_block (path, e, visited, limit);
> >
> >    if (threaded > 0)
> >      {
> > @@ -1148,11 +1161,12 @@ jump_threader::thread_across_edge (edge e)
> >         m_registry->push_edge (path, e, EDGE_START_JUMP_THREAD);
> >         m_registry->push_edge (path, taken_edge, EDGE_COPY_SRC_JOINER_BLOCK);
> >
> > -       found = thread_around_empty_blocks (path, taken_edge, visited);
> > +       found = thread_around_empty_blocks (path, taken_edge, visited, limit);
> >
> >         if (!found)
> >           found = thread_through_normal_block (path,
> > -                                              path->last ()->e, visited) > 0;
> > +                                              path->last ()->e, visited,
> > +                                              limit) > 0;
> >
> >         /* If we were able to thread through a successor of E->dest, then
> >            record the jump threading opportunity.  */
> > diff --git a/gcc/tree-ssa-threadedge.h b/gcc/tree-ssa-threadedge.h
> > index 9f6cbfe9330..245b3506a55 100644
> > --- a/gcc/tree-ssa-threadedge.h
> > +++ b/gcc/tree-ssa-threadedge.h
> > @@ -101,9 +101,9 @@ private:
> >                                           unsigned limit);
> >
> >    bool thread_around_empty_blocks (vec<class jump_thread_edge *> *path,
> > -                                  edge, bitmap visited);
> > +                                  edge, bitmap visited, unsigned &limit);
> >    int thread_through_normal_block (vec<jump_thread_edge *> *path,
> > -                                  edge, bitmap visited);
> > +                                  edge, bitmap visited, unsigned &limit);
> >    void thread_across_edge (edge);
> >    bool record_temporary_equivalences_from_phis (edge);
> >    gimple *record_temporary_equivalences_from_stmts_at_dest (edge);
> > --
> > 2.43.0
> >
> 
>
  

Patch

diff --git a/gcc/tree-ssa-threadedge.cc b/gcc/tree-ssa-threadedge.cc
index 7f82639b8ec..0aa2aa85143 100644
--- a/gcc/tree-ssa-threadedge.cc
+++ b/gcc/tree-ssa-threadedge.cc
@@ -786,13 +786,17 @@  propagate_threaded_block_debug_into (basic_block dest, basic_block src)
 bool
 jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
 					   edge taken_edge,
-					   bitmap visited)
+					   bitmap visited, unsigned &limit)
 {
   basic_block bb = taken_edge->dest;
   gimple_stmt_iterator gsi;
   gimple *stmt;
   tree cond;
 
+  if (limit == 0)
+    return false;
+  --limit;
+
   /* The key property of these blocks is that they need not be duplicated
      when threading.  Thus they cannot have visible side effects such
      as PHI nodes.  */
@@ -830,7 +834,8 @@  jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
 	      m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
 	      m_state->append_path (taken_edge->dest);
 	      bitmap_set_bit (visited, taken_edge->dest->index);
-	      return thread_around_empty_blocks (path, taken_edge, visited);
+	      return thread_around_empty_blocks (path, taken_edge, visited,
+						 limit);
 	    }
 	}
 
@@ -872,7 +877,7 @@  jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
       m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
       m_state->append_path (taken_edge->dest);
 
-      thread_around_empty_blocks (path, taken_edge, visited);
+      thread_around_empty_blocks (path, taken_edge, visited, limit);
       return true;
     }
 
@@ -899,8 +904,13 @@  jump_threader::thread_around_empty_blocks (vec<jump_thread_edge *> *path,
 
 int
 jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
-					    edge e, bitmap visited)
+					    edge e, bitmap visited,
+					    unsigned &limit)
 {
+  if (limit == 0)
+    return 0;
+  limit--;
+
   m_state->register_equivs_edge (e);
 
   /* PHIs create temporary equivalences.
@@ -989,7 +999,7 @@  jump_threader::thread_through_normal_block (vec<jump_thread_edge *> *path,
  	     visited.  This may be overly conservative.  */
 	  bitmap_set_bit (visited, dest->index);
 	  bitmap_set_bit (visited, e->dest->index);
-	  thread_around_empty_blocks (path, taken_edge, visited);
+	  thread_around_empty_blocks (path, taken_edge, visited, limit);
 	  return 1;
 	}
     }
@@ -1075,9 +1085,12 @@  jump_threader::thread_across_edge (edge e)
   bitmap_set_bit (visited, e->src->index);
   bitmap_set_bit (visited, e->dest->index);
 
+  /* Limit search space.  */
+  unsigned limit = param_max_jump_thread_paths;
+
   int threaded = 0;
   if ((e->flags & EDGE_DFS_BACK) == 0)
-    threaded = thread_through_normal_block (path, e, visited);
+    threaded = thread_through_normal_block (path, e, visited, limit);
 
   if (threaded > 0)
     {
@@ -1148,11 +1161,12 @@  jump_threader::thread_across_edge (edge e)
 	m_registry->push_edge (path, e, EDGE_START_JUMP_THREAD);
 	m_registry->push_edge (path, taken_edge, EDGE_COPY_SRC_JOINER_BLOCK);
 
-	found = thread_around_empty_blocks (path, taken_edge, visited);
+	found = thread_around_empty_blocks (path, taken_edge, visited, limit);
 
 	if (!found)
 	  found = thread_through_normal_block (path,
-					       path->last ()->e, visited) > 0;
+					       path->last ()->e, visited,
+					       limit) > 0;
 
 	/* If we were able to thread through a successor of E->dest, then
 	   record the jump threading opportunity.  */
diff --git a/gcc/tree-ssa-threadedge.h b/gcc/tree-ssa-threadedge.h
index 9f6cbfe9330..245b3506a55 100644
--- a/gcc/tree-ssa-threadedge.h
+++ b/gcc/tree-ssa-threadedge.h
@@ -101,9 +101,9 @@  private:
 					  unsigned limit);
 
   bool thread_around_empty_blocks (vec<class jump_thread_edge *> *path,
-				   edge, bitmap visited);
+				   edge, bitmap visited, unsigned &limit);
   int thread_through_normal_block (vec<jump_thread_edge *> *path,
-				   edge, bitmap visited);
+				   edge, bitmap visited, unsigned &limit);
   void thread_across_edge (edge);
   bool record_temporary_equivalences_from_phis (edge);
   gimple *record_temporary_equivalences_from_stmts_at_dest (edge);