[RFC] More jump threading restrictions in the presence of loops.

Message ID 20211004094313.1596556-1-aldyh@redhat.com
State RFC
Headers
Series [RFC] More jump threading restrictions in the presence of loops. |

Commit Message

Aldy Hernandez Oct. 4, 2021, 9:43 a.m. UTC
  It is frustrating that virtually all the regressions with the hybrid
threader for VRP, have not been with the engine itself, but with the
independent restrictions we have agreed upon.

The following patch is a collection of discussions with Richi, Jeff,
and Michael Matz regarding jump threading limitations in the presence
of loops, that I hope can lead to further refinements.

As I have mentioned before, most of our threading tests are too
fragile, so in this patch I have distilled various restrictions into
gimple FE tests that I hope can help in maintaining the threader going
forward.  The goal is to have one test with no valid threads
whatsover, and one with exclusively one valid thread per function.
This should make it trivial to maintain this going forward.

I would like to request the relevant experts to not only examine the
patch, but review the tests in this patch, to make sure we agree upon
these restrictions.  I have distilled the smallest possible test for
each restriction and have annotated said tests to make reviewing easy.

Note that the test in ssa-thread-valid.c is a thread that Jeff has
suggested should be an exception to the path crossing loops
restriction, but I have not implemented it yet, because even if we
could loosen the restriction, it would violate Richi's restriction of
a path that rotates a loop.  Comments are highly welcome.

By the way, these restrictions trigger *a lot*.  We seem to be
rotating loops left and right.  So I wouldn't be surprised if this
requires (as usual) a lot of test tweaking.

Untested patch follows.

p.s. Note that I'm just facilitating the discussion.  I'm highly
dependent on the loop experts here ;-).

gcc/ChangeLog:

	* tree-ssa-threadupdate.c
	(jt_path_registry::cancel_invalid_paths): Tweak.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
---
 .../gcc.dg/tree-ssa/ssa-thread-invalid.c      | 102 ++++++++++++++++++
 .../gcc.dg/tree-ssa/ssa-thread-valid.c        |  57 ++++++++++
 gcc/tree-ssa-threadupdate.c                   |  29 ++++-
 3 files changed, 186 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
  

Comments

Jeff Law Oct. 4, 2021, 1:31 p.m. UTC | #1
On 10/4/2021 3:43 AM, Aldy Hernandez wrote:
> It is frustrating that virtually all the regressions with the hybrid
> threader for VRP, have not been with the engine itself, but with the
> independent restrictions we have agreed upon.
>
> The following patch is a collection of discussions with Richi, Jeff,
> and Michael Matz regarding jump threading limitations in the presence
> of loops, that I hope can lead to further refinements.
>
> As I have mentioned before, most of our threading tests are too
> fragile, so in this patch I have distilled various restrictions into
> gimple FE tests that I hope can help in maintaining the threader going
> forward.  The goal is to have one test with no valid threads
> whatsover, and one with exclusively one valid thread per function.
> This should make it trivial to maintain this going forward.
>
> I would like to request the relevant experts to not only examine the
> patch, but review the tests in this patch, to make sure we agree upon
> these restrictions.  I have distilled the smallest possible test for
> each restriction and have annotated said tests to make reviewing easy.
>
> Note that the test in ssa-thread-valid.c is a thread that Jeff has
> suggested should be an exception to the path crossing loops
> restriction, but I have not implemented it yet, because even if we
> could loosen the restriction, it would violate Richi's restriction of
> a path that rotates a loop.  Comments are highly welcome.
>
> By the way, these restrictions trigger *a lot*.  We seem to be
> rotating loops left and right.  So I wouldn't be surprised if this
> requires (as usual) a lot of test tweaking.
>
> Untested patch follows.
>
> p.s. Note that I'm just facilitating the discussion.  I'm highly
> dependent on the loop experts here ;-).
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c
> 	(jt_path_registry::cancel_invalid_paths): Tweak.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
Let me throw that into the tester and see what pops.

Jeff
  
Aldy Hernandez Oct. 4, 2021, 1:36 p.m. UTC | #2
Note that none of the other tests have been adjusted so it'll likely result
in multiple threading regressions.

Thanks for looking at this.

Aldy

On Mon, Oct 4, 2021, 15:31 Jeff Law <jeffreyalaw@gmail.com> wrote:

>
>
> On 10/4/2021 3:43 AM, Aldy Hernandez wrote:
> > It is frustrating that virtually all the regressions with the hybrid
> > threader for VRP, have not been with the engine itself, but with the
> > independent restrictions we have agreed upon.
> >
> > The following patch is a collection of discussions with Richi, Jeff,
> > and Michael Matz regarding jump threading limitations in the presence
> > of loops, that I hope can lead to further refinements.
> >
> > As I have mentioned before, most of our threading tests are too
> > fragile, so in this patch I have distilled various restrictions into
> > gimple FE tests that I hope can help in maintaining the threader going
> > forward.  The goal is to have one test with no valid threads
> > whatsover, and one with exclusively one valid thread per function.
> > This should make it trivial to maintain this going forward.
> >
> > I would like to request the relevant experts to not only examine the
> > patch, but review the tests in this patch, to make sure we agree upon
> > these restrictions.  I have distilled the smallest possible test for
> > each restriction and have annotated said tests to make reviewing easy.
> >
> > Note that the test in ssa-thread-valid.c is a thread that Jeff has
> > suggested should be an exception to the path crossing loops
> > restriction, but I have not implemented it yet, because even if we
> > could loosen the restriction, it would violate Richi's restriction of
> > a path that rotates a loop.  Comments are highly welcome.
> >
> > By the way, these restrictions trigger *a lot*.  We seem to be
> > rotating loops left and right.  So I wouldn't be surprised if this
> > requires (as usual) a lot of test tweaking.
> >
> > Untested patch follows.
> >
> > p.s. Note that I'm just facilitating the discussion.  I'm highly
> > dependent on the loop experts here ;-).
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-threadupdate.c
> >       (jt_path_registry::cancel_invalid_paths): Tweak.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
> >       * gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
> Let me throw that into the tester and see what pops.
>
> Jeff
>
>
  
Jeff Law Oct. 4, 2021, 3:30 p.m. UTC | #3
On 10/4/2021 7:36 AM, Aldy Hernandez wrote:
> Note that none of the other tests have been adjusted so it'll likely 
> result in multiple threading regressions.
Yea ;-)  So maybe we first focus on how those affect visium since that's 
what started us down this path.

The original test of regressions for visium were:
visium-sim: gcc.c-torture/execute/960218-1.c   -Os  (test for excess errors)
visium-sim: gcc.c-torture/execute/961125-1.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
visium-sim: gcc.c-torture/execute/961125-1.c   -O3 -g  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pending-4.c   -O1  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2 -flto 
-fno-use-linker-plugin -flto-partition=none  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2 -flto 
-fuse-linker-plugin -fno-fat-lto-objects  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O3 -g  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pr68911.c   -O1  (test for excess errors)

I'm pretty confident these are all cases where we previously threaded 
one or more jumps and as a result we were able to remove all calls to 
abort ().   Your change doesn't affect any of those :(

So probably the first thing to do is get a patch that fixes one or more 
of those failures *or* make a determination that one or more of those 
tests are things we do not want to optimize.   The one we had previously 
looked at (960218-1)  had a thread path from inside the loop to a point 
outside the loop.  My sense is we probably want to allow that.

And just in case it got lost.  Here's the analysis on 960218-1 on visium:

We've got this going into ethread:

int f (int x)
{
   int D.1420;
   int a;

;;   basic block 2, loop depth 0, maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
   a_4 = ~x_3(D);
   goto <bb 4>; [INV]
;;    succ:       4 (FALLTHRU,EXECUTABLE)

;;   basic block 3, loop depth 1, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
   gl = a_1;
;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)

;;   basic block 4, loop depth 1, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 (FALLTHRU,EXECUTABLE)
;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
   # a_1 = PHI <a_4(2), 0(3)>
   if (a_1 != 0)
     goto <bb 3>; [INV]
   else
     goto <bb 5>; [INV]
;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
;;                5 (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 0, maybe hot
;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
   return;
;;    succ:       EXIT

}


There's a pretty obvious jump thread in there.  3->4->5.  We used to 
allow this, but it's currently being rejected and I'm not sure it should.

In simplest terms we're threading from inside the loop, through the 
latch to a point outside the loop.  At first glance, that seems safe.

960218-1.c, compiled with -Os

int gl;

g (x)
{
   gl = x;
   return 0;
}

f (x)
{
   int a = ~x;
   while (a)
     a = g (a);
}

main ()
{
   f (3);
   if (gl != -4)
     abort ();
   exit (0);
}


Ideally in the final assembly there would be no calls to abort since it 
can be determined at compile-time that the call can never happen.

jeff
  
Michael Matz Oct. 4, 2021, 4:29 p.m. UTC | #4
Hello,

On Mon, 4 Oct 2021, Jeff Law wrote:

> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> 
> We've got this going into ethread:
> 
> int f (int x)
> {
>   int D.1420;
>   int a;
> 
> ;;   basic block 2, loop depth 0, maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>   a_4 = ~x_3(D);
>   goto <bb 4>; [INV]
> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 3, loop depth 1, maybe hot
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>   gl = a_1;
> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> 
> ;;   basic block 4, loop depth 1, maybe hot
> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>   # a_1 = PHI <a_4(2), 0(3)>
>   if (a_1 != 0)
>     goto <bb 3>; [INV]
>   else
>     goto <bb 5>; [INV]
> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> ;;                5 (FALSE_VALUE,EXECUTABLE)
> 
> ;;   basic block 5, loop depth 0, maybe hot
> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>   return;
> ;;    succ:       EXIT
> 
> }

First notice that this doesn't have an empty latch block to start with 
(in fact, there is no separate latch block at all), so the loop optimizers 
haven't been initialized for simple latches at this point.  Still, I would 
say that even then we want to be careful with the latch edge, as putting 
code onto it will most probably create a problem downstream once we _do_ 
want to intialize the loop optimizers for simple latches.  So, that we are 
careful here is okay, we are just too careful in this specific case.

> There's a pretty obvious jump thread in there.  3->4->5.
> 
> We used to allow this, but it's currently being rejected and I'm not 
> sure it should.
> 
> In simplest terms we're threading from inside the loop, through the 
> latch to a point outside the loop.  At first glance, that seems safe.

Is at least the unrestricted jump threader (the one after loop optimizers) 
picking this up?

Independend of that, I agree, that this specific instance of threading 
through the latch block, even though followed by to-be-copied 
instructions, is okay.  We need to determine precisely when that is okay, 
though.  In this case there are several reasons I could see why this is 
so:
(a) while the thread is through the latch block, it's _not_ through the
    latch edge (which is 4->3).  That would create the problem, so here
    we're fine.
(b) even if it were through the latch edge, and would be followed by 
    to-be-copied instructions (and hence fill the latch edge) the ultimate 
    end of the path is outside the loop, so all the edges and blocks that 
    would now carry instructions would also be outside the loop, and hence 
    be no problem

Now, capture this precisely ;-)

I think something like so: we have a candidate path

  S -> L -> B1 (-> B2 ... Bn)

(Start, Latch, Blocks 1 to n following the latch).  (I think in our 
notation that means that the jump in L is redirected to Bn, and all code 
from B1..Bn be copied, right?  Do we even support multiple blocks after 
the to-be-redirected jump?)

Now if L is a latch block, but L->B1 is no latch/back edge : no 
restrictions apply.

Otherwise, L->B1 is a latch/back edge (that means B1 is a loop header): if 
all of B2..Bn-1 don't contain jumps back into the loop, and Bn is outside 
the loop, then the thread is okay as well.  (B2..Bn-1 can be inside the 
loop, as long as they don't contain jumps back into the loop, after 
copying by the threader, they don't create problems: their copies will be 
placed outside the loop and won't generate side entries back into the 
loop; the copied latch edge will not be a latch edge anymore, but a loop 
exit edge).

It's quite possible that the full generality above is not necessary.  
It's probably enough to just not deal with the (b) case above (and 
continue to reject it), and simply only accept the candidate path if none 
of the involved edges is a latch/back edge (despite one block being the 
latch block).  Especially so because I'm not convinced that I handled 
the idea of case (b) correctly above ;-)


Ciao,
Michael.
  
Richard Biener Oct. 5, 2021, 11:22 a.m. UTC | #5
On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>
> Hello,
>
> On Mon, 4 Oct 2021, Jeff Law wrote:
>
> > And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> >
> > We've got this going into ethread:
> >
> > int f (int x)
> > {
> >   int D.1420;
> >   int a;
> >
> > ;;   basic block 2, loop depth 0, maybe hot
> > ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
> >   a_4 = ~x_3(D);
> >   goto <bb 4>; [INV]
> > ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> >
> > ;;   basic block 3, loop depth 1, maybe hot
> > ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
> >   gl = a_1;
> > ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >
> > ;;   basic block 4, loop depth 1, maybe hot
> > ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> > ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >   # a_1 = PHI <a_4(2), 0(3)>
> >   if (a_1 != 0)
> >     goto <bb 3>; [INV]
> >   else
> >     goto <bb 5>; [INV]
> > ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> > ;;                5 (FALSE_VALUE,EXECUTABLE)
> >
> > ;;   basic block 5, loop depth 0, maybe hot
> > ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
> >   return;
> > ;;    succ:       EXIT
> >
> > }
>
> First notice that this doesn't have an empty latch block to start with
> (in fact, there is no separate latch block at all), so the loop optimizers
> haven't been initialized for simple latches at this point.  Still, I would
> say that even then we want to be careful with the latch edge, as putting
> code onto it will most probably create a problem downstream once we _do_
> want to intialize the loop optimizers for simple latches.  So, that we are
> careful here is okay, we are just too careful in this specific case.

Not sure if the argument about empty or not empty latches is important...
> > There's a pretty obvious jump thread in there.  3->4->5.
> >
> > We used to allow this, but it's currently being rejected and I'm not
> > sure it should.
> >
> > In simplest terms we're threading from inside the loop, through the
> > latch to a point outside the loop.  At first glance, that seems safe.

All threadings that start inside the loop and end outside of it are OK
in principle.  Even if the path crosses the latch or the header.

We _might_ end up creating a loop with multiple exits though.

> Is at least the unrestricted jump threader (the one after loop optimizers)
> picking this up?
>
> Independend of that, I agree, that this specific instance of threading
> through the latch block, even though followed by to-be-copied
> instructions, is okay.  We need to determine precisely when that is okay,
> though.  In this case there are several reasons I could see why this is
> so:
> (a) while the thread is through the latch block, it's _not_ through the
>     latch edge (which is 4->3).  That would create the problem, so here
>     we're fine.
> (b) even if it were through the latch edge, and would be followed by
>     to-be-copied instructions (and hence fill the latch edge) the ultimate
>     end of the path is outside the loop, so all the edges and blocks that
>     would now carry instructions would also be outside the loop, and hence
>     be no problem

Yep.

> Now, capture this precisely ;-)

I tried to capture this with

+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+                             entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }

it's not really necessary to have (simple) latches to argue about this
I think.

> I think something like so: we have a candidate path
>
>   S -> L -> B1 (-> B2 ... Bn)
>
> (Start, Latch, Blocks 1 to n following the latch).  (I think in our
> notation that means that the jump in L is redirected to Bn, and all code
> from B1..Bn be copied, right?  Do we even support multiple blocks after
> the to-be-redirected jump?)
>
> Now if L is a latch block, but L->B1 is no latch/back edge : no
> restrictions apply.
>
> Otherwise, L->B1 is a latch/back edge (that means B1 is a loop header): if
> all of B2..Bn-1 don't contain jumps back into the loop, and Bn is outside
> the loop, then the thread is okay as well.  (B2..Bn-1 can be inside the
> loop, as long as they don't contain jumps back into the loop, after
> copying by the threader, they don't create problems: their copies will be
> placed outside the loop and won't generate side entries back into the
> loop; the copied latch edge will not be a latch edge anymore, but a loop
> exit edge).
>
> It's quite possible that the full generality above is not necessary.
> It's probably enough to just not deal with the (b) case above (and
> continue to reject it), and simply only accept the candidate path if none
> of the involved edges is a latch/back edge (despite one block being the
> latch block).  Especially so because I'm not convinced that I handled
> the idea of case (b) correctly above ;-)
>
>
> Ciao,
> Michael.
  
Michael Matz Oct. 5, 2021, 12:43 p.m. UTC | #6
Hello,

On Tue, 5 Oct 2021, Richard Biener wrote:

> > First notice that this doesn't have an empty latch block to start with 
> > (in fact, there is no separate latch block at all), so the loop 
> > optimizers haven't been initialized for simple latches at this point.  
> > Still, I would say that even then we want to be careful with the latch 
> > edge, as putting code onto it will most probably create a problem 
> > downstream once we _do_ want to intialize the loop optimizers for 
> > simple latches.  So, that we are careful here is okay, we are just too 
> > careful in this specific case.
> 
> Not sure if the argument about empty or not empty latches is important...

In this case it's not (as there are no separate latches anyway), but 
generally a latch that is already non-empty (i.e. problematic) only 
becomes more non-empty, so doing the threading doesn't introduce that 
specific problem.

> > > There's a pretty obvious jump thread in there.  3->4->5.
> > >
> > > We used to allow this, but it's currently being rejected and I'm not
> > > sure it should.
> > >
> > > In simplest terms we're threading from inside the loop, through the
> > > latch to a point outside the loop.  At first glance, that seems safe.
> 
> All threadings that start inside the loop and end outside of it are OK
> in principle.  Even if the path crosses the latch or the header.
> 
> We _might_ end up creating a loop with multiple exits though.

And entries (and hence irreducable loops)!  That's why I also added the 
piece about "all of B2..Bn-1 don't contain jumps back into the loop".  
I'm not sure if candidate paths going into our threader can have this 
problem, but if they have then you also don't want to do the block 
copying.

> > (a) while the thread is through the latch block, it's _not_ through the
> >     latch edge (which is 4->3).  That would create the problem, so here
> >     we're fine.
> > (b) even if it were through the latch edge, and would be followed by
> >     to-be-copied instructions (and hence fill the latch edge) the ultimate
> >     end of the path is outside the loop, so all the edges and blocks that
> >     would now carry instructions would also be outside the loop, and hence
> >     be no problem
> 
> Yep.
> 
> > Now, capture this precisely ;-)
> 
> I tried to capture this with
> 
> +  // The path should either start and end in the same loop or exit the
> +  // loop it starts in but never enter a loop.  This also catches
> +  // creating irreducible loops, not only rotation.
> +  if (entry->src->loop_father != exit->dest->loop_father
> +      && !flow_loop_nested_p (exit->src->loop_father,
> +                             entry->dest->loop_father))
> +    {
> +      cancel_thread (&path, "Path rotates loop");
> +      return true;
> +    }
> 
> it's not really necessary to have (simple) latches to argue about this
> I think.

Correct.  I don't think the above captures the re-entry problem (when 
intermediary blocks contain jumps inside the loop), the one I'm not sure 
we even have, but otherwise I think it does capture the (a) and (b) parts.


Ciao,
Michael.
  
Aldy Hernandez Oct. 5, 2021, 1:33 p.m. UTC | #7
On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>
> Hello,
>
> On Mon, 4 Oct 2021, Jeff Law wrote:
>
> > And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> >
> > We've got this going into ethread:
> >
> > int f (int x)
> > {
> >   int D.1420;
> >   int a;
> >
> > ;;   basic block 2, loop depth 0, maybe hot
> > ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
> >   a_4 = ~x_3(D);
> >   goto <bb 4>; [INV]
> > ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> >
> > ;;   basic block 3, loop depth 1, maybe hot
> > ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
> >   gl = a_1;
> > ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >
> > ;;   basic block 4, loop depth 1, maybe hot
> > ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> > ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >   # a_1 = PHI <a_4(2), 0(3)>
> >   if (a_1 != 0)
> >     goto <bb 3>; [INV]
> >   else
> >     goto <bb 5>; [INV]
> > ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> > ;;                5 (FALSE_VALUE,EXECUTABLE)
> >
> > ;;   basic block 5, loop depth 0, maybe hot
> > ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
> >   return;
> > ;;    succ:       EXIT
> >
> > }
>
> First notice that this doesn't have an empty latch block to start with
> (in fact, there is no separate latch block at all), so the loop optimizers
> haven't been initialized for simple latches at this point.  Still, I would
> say that even then we want to be careful with the latch edge, as putting
> code onto it will most probably create a problem downstream once we _do_
> want to intialize the loop optimizers for simple latches.  So, that we are
> careful here is okay, we are just too careful in this specific case.
>
> > There's a pretty obvious jump thread in there.  3->4->5.
> >
> > We used to allow this, but it's currently being rejected and I'm not
> > sure it should.
> >
> > In simplest terms we're threading from inside the loop, through the
> > latch to a point outside the loop.  At first glance, that seems safe.
>
> Is at least the unrestricted jump threader (the one after loop optimizers)
> picking this up?
>
> Independend of that, I agree, that this specific instance of threading
> through the latch block, even though followed by to-be-copied
> instructions, is okay.  We need to determine precisely when that is okay,
> though.  In this case there are several reasons I could see why this is
> so:
> (a) while the thread is through the latch block, it's _not_ through the
>     latch edge (which is 4->3).  That would create the problem, so here
>     we're fine.

It seems we all agree Jeff's finding should be allowed, so let's
attack this one first, since it gets almost all of his visium
failures.  I can submit the rest of the cases separately.

The attached patch catches the IL discussed, and adds a relevant
gimple FE test others can use for experimenting :).

Tested on x86-64 and by visual inspection on visium-elf on the
regressions Jeff pointed me at.

OK?

BTW Jeff, this fixes all the regressions you mention except:

1. pr68911.c

The path not being threaded here is 7->10->11->12.  It crosses loops
multiple times, so I believe the restriction code is correct.

7, 10, 12 are in loop1.
11 is in loop 2.

So we have a path going from loop1 -> loop2 -> loop1.  I can't
conceive any scenario where this is ok, but I can attach the entire IL
if I'm missing something.

2. 961125-1.c

This one is a bit trickier.  Here we're trying to thread the following
conditional, which I mentioned when I contributed this work, we don't
handle (and happens infrequently enough not to matter):

+  // Loop 4
+  <bb 6> [local count: 114863531]:
+  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
+  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
+    goto <bb 7>; [50.00%]
+  else
+    goto <bb 17>; [50.00%]

The hybrid threader doesn't handle &MEM in the final conditional.  As
I mentioned earlier, if this becomes an issue, we can adapt class
pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
I will contribute as an XFAIL with an associated PR to keep us honest.

That being said... in this particular test, this is all irrelevant
because the path will be disallowed for two reasons:

a) The path crosses loops, and the reason we didn't realize it in the
old code was because the ASSERT_EXPR had pulled the SSA outside the
loop, so it looks like the entire path is l in the same loop.  If you
look at the original IL, it's not.

b) Now the path actually fits the pattern being discussed in this
patch, where there's an early exit out of a loop, so it looks like we
should handle it.  But...in this case, we would fill a presently empty
latch.  Interestingly, the old code didn't catch it, because....you
guessed it...there was an ASSERT_EXPR in the latch.

So I argue that even in the absence of MEM_REF handling, we shouldn't
thread this path.
  
Jeff Law Oct. 5, 2021, 2:56 p.m. UTC | #8
On 10/5/2021 5:22 AM, Richard Biener wrote:
> On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>> Hello,
>>
>> On Mon, 4 Oct 2021, Jeff Law wrote:
>>
>>> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
>>>
>>> We've got this going into ethread:
>>>
>>> int f (int x)
>>> {
>>>    int D.1420;
>>>    int a;
>>>
>>> ;;   basic block 2, loop depth 0, maybe hot
>>> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>>>    a_4 = ~x_3(D);
>>>    goto <bb 4>; [INV]
>>> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
>>>
>>> ;;   basic block 3, loop depth 1, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>>>    gl = a_1;
>>> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>
>>> ;;   basic block 4, loop depth 1, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
>>> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>    # a_1 = PHI <a_4(2), 0(3)>
>>>    if (a_1 != 0)
>>>      goto <bb 3>; [INV]
>>>    else
>>>      goto <bb 5>; [INV]
>>> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
>>> ;;                5 (FALSE_VALUE,EXECUTABLE)
>>>
>>> ;;   basic block 5, loop depth 0, maybe hot
>>> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>>>    return;
>>> ;;    succ:       EXIT
>>>
>>> }
>> First notice that this doesn't have an empty latch block to start with
>> (in fact, there is no separate latch block at all), so the loop optimizers
>> haven't been initialized for simple latches at this point.  Still, I would
>> say that even then we want to be careful with the latch edge, as putting
>> code onto it will most probably create a problem downstream once we _do_
>> want to intialize the loop optimizers for simple latches.  So, that we are
>> careful here is okay, we are just too careful in this specific case.
> Not sure if the argument about empty or not empty latches is important...
>>> There's a pretty obvious jump thread in there.  3->4->5.
>>>
>>> We used to allow this, but it's currently being rejected and I'm not
>>> sure it should.
>>>
>>> In simplest terms we're threading from inside the loop, through the
>>> latch to a point outside the loop.  At first glance, that seems safe.
> All threadings that start inside the loop and end outside of it are OK
> in principle.  Even if the path crosses the latch or the header.
I generally agree, but with one caveat.  If the latch is marked as a 
joiner block, then it will not be OK as that'll create multiple 
backedges to the top of the loop.


Jeff
  
Jeff Law Oct. 5, 2021, 3:10 p.m. UTC | #9
On 10/5/2021 7:33 AM, Aldy Hernandez wrote:
> On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>> Hello,
>>
>> On Mon, 4 Oct 2021, Jeff Law wrote:
>>
>>> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
>>>
>>> We've got this going into ethread:
>>>
>>> int f (int x)
>>> {
>>>    int D.1420;
>>>    int a;
>>>
>>> ;;   basic block 2, loop depth 0, maybe hot
>>> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>>>    a_4 = ~x_3(D);
>>>    goto <bb 4>; [INV]
>>> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
>>>
>>> ;;   basic block 3, loop depth 1, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>>>    gl = a_1;
>>> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>
>>> ;;   basic block 4, loop depth 1, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
>>> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>    # a_1 = PHI <a_4(2), 0(3)>
>>>    if (a_1 != 0)
>>>      goto <bb 3>; [INV]
>>>    else
>>>      goto <bb 5>; [INV]
>>> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
>>> ;;                5 (FALSE_VALUE,EXECUTABLE)
>>>
>>> ;;   basic block 5, loop depth 0, maybe hot
>>> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>>>    return;
>>> ;;    succ:       EXIT
>>>
>>> }
>> First notice that this doesn't have an empty latch block to start with
>> (in fact, there is no separate latch block at all), so the loop optimizers
>> haven't been initialized for simple latches at this point.  Still, I would
>> say that even then we want to be careful with the latch edge, as putting
>> code onto it will most probably create a problem downstream once we _do_
>> want to intialize the loop optimizers for simple latches.  So, that we are
>> careful here is okay, we are just too careful in this specific case.
>>
>>> There's a pretty obvious jump thread in there.  3->4->5.
>>>
>>> We used to allow this, but it's currently being rejected and I'm not
>>> sure it should.
>>>
>>> In simplest terms we're threading from inside the loop, through the
>>> latch to a point outside the loop.  At first glance, that seems safe.
>> Is at least the unrestricted jump threader (the one after loop optimizers)
>> picking this up?
>>
>> Independend of that, I agree, that this specific instance of threading
>> through the latch block, even though followed by to-be-copied
>> instructions, is okay.  We need to determine precisely when that is okay,
>> though.  In this case there are several reasons I could see why this is
>> so:
>> (a) while the thread is through the latch block, it's _not_ through the
>>      latch edge (which is 4->3).  That would create the problem, so here
>>      we're fine.
> It seems we all agree Jeff's finding should be allowed, so let's
> attack this one first, since it gets almost all of his visium
> failures.  I can submit the rest of the cases separately.
>
> The attached patch catches the IL discussed, and adds a relevant
> gimple FE test others can use for experimenting :).
>
> Tested on x86-64 and by visual inspection on visium-elf on the
> regressions Jeff pointed me at.
>
> OK?
>
> BTW Jeff, this fixes all the regressions you mention except:
>
> 1. pr68911.c
>
> The path not being threaded here is 7->10->11->12.  It crosses loops
> multiple times, so I believe the restriction code is correct.
>
> 7, 10, 12 are in loop1.
> 11 is in loop 2.
>
> So we have a path going from loop1 -> loop2 -> loop1.  I can't
> conceive any scenario where this is ok, but I can attach the entire IL
> if I'm missing something.
>
> 2. 961125-1.c
>
> This one is a bit trickier.  Here we're trying to thread the following
> conditional, which I mentioned when I contributed this work, we don't
> handle (and happens infrequently enough not to matter):
>
> +  // Loop 4
> +  <bb 6> [local count: 114863531]:
> +  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
> +  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
> +    goto <bb 7>; [50.00%]
> +  else
> +    goto <bb 17>; [50.00%]
>
> The hybrid threader doesn't handle &MEM in the final conditional.  As
> I mentioned earlier, if this becomes an issue, we can adapt class
> pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
> I will contribute as an XFAIL with an associated PR to keep us honest.
>
> That being said... in this particular test, this is all irrelevant
> because the path will be disallowed for two reasons:
>
> a) The path crosses loops, and the reason we didn't realize it in the
> old code was because the ASSERT_EXPR had pulled the SSA outside the
> loop, so it looks like the entire path is l in the same loop.  If you
> look at the original IL, it's not.
>
> b) Now the path actually fits the pattern being discussed in this
> patch, where there's an early exit out of a loop, so it looks like we
> should handle it.  But...in this case, we would fill a presently empty
> latch.  Interestingly, the old code didn't catch it, because....you
> guessed it...there was an ASSERT_EXPR in the latch.
>
> So I argue that even in the absence of MEM_REF handling, we shouldn't
> thread this path.
>
> 0001-Loosen-loop-crossing-restriction-in-threader.patch
>
>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.
>
> In fact, the whole threaded path is logically outside the loop.  This
> has nice secondary effects.  For example, objects on the threaded path
> will no longer necessarily be live throughout the loop, so we can get
> register allocation improvements.  The threaded path can physically
> move outside the loop resulting in better icache efficiency, etc.
>
> Tested on x86-64 Linux, and on a visium-elf cross making sure that the
> following tests do not have an abort in the final assembly:
>
> gcc.c-torture/execute/960218-1.c
> gcc.c-torture/execute/visium-pending-4.c
> gcc.c-torture/execute/pr58209.c
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
I'll throw it in and see what we get :-)

jeff
  
Jeff Law Oct. 5, 2021, 4:08 p.m. UTC | #10
On 10/5/2021 7:33 AM, Aldy Hernandez wrote:
>
> OK?
>
> BTW Jeff, this fixes all the regressions you mention except:
>
> 1. pr68911.c
>
> The path not being threaded here is 7->10->11->12.  It crosses loops
> multiple times, so I believe the restriction code is correct.
>
> 7, 10, 12 are in loop1.
> 11 is in loop 2.
>
> So we have a path going from loop1 -> loop2 -> loop1.  I can't
> conceive any scenario where this is ok, but I can attach the entire IL
> if I'm missing something.
Wow.  I'm having trouble seeing how we were threading that, but clearly 
not OK.
>
> 2. 961125-1.c
>
> This one is a bit trickier.  Here we're trying to thread the following
> conditional, which I mentioned when I contributed this work, we don't
> handle (and happens infrequently enough not to matter):
Sorry, I'd forgotten about this case.

>
> +  // Loop 4
> +  <bb 6> [local count: 114863531]:
> +  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
> +  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
> +    goto <bb 7>; [50.00%]
> +  else
> +    goto <bb 17>; [50.00%]
>
> The hybrid threader doesn't handle &MEM in the final conditional.  As
> I mentioned earlier, if this becomes an issue, we can adapt class
> pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
> I will contribute as an XFAIL with an associated PR to keep us honest.
>
> That being said... in this particular test, this is all irrelevant
> because the path will be disallowed for two reasons:
>
> a) The path crosses loops, and the reason we didn't realize it in the
> old code was because the ASSERT_EXPR had pulled the SSA outside the
> loop, so it looks like the entire path is l in the same loop.  If you
> look at the original IL, it's not.
>
> b) Now the path actually fits the pattern being discussed in this
> patch, where there's an early exit out of a loop, so it looks like we
> should handle it.  But...in this case, we would fill a presently empty
> latch.  Interestingly, the old code didn't catch it, because....you
> guessed it...there was an ASSERT_EXPR in the latch.
>
> So I argue that even in the absence of MEM_REF handling, we shouldn't
> thread this path.
I can live with that too.  I'll set new baselines for visium-elf so that 
these don't show up again.

>
> 0001-Loosen-loop-crossing-restriction-in-threader.patch
>
>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.
>
> In fact, the whole threaded path is logically outside the loop.  This
> has nice secondary effects.  For example, objects on the threaded path
> will no longer necessarily be live throughout the loop, so we can get
> register allocation improvements.  The threaded path can physically
> move outside the loop resulting in better icache efficiency, etc.
>
> Tested on x86-64 Linux, and on a visium-elf cross making sure that the
> following tests do not have an abort in the final assembly:
>
> gcc.c-torture/execute/960218-1.c
> gcc.c-torture/execute/visium-pending-4.c
> gcc.c-torture/execute/pr58209.c
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
OK.  I think there may still be some concern if the latch is marked as a 
joiner.    I think we should always reject those before the loop 
optimizers run as the joiner's clone would introduce a second latch.   I 
think that can be handled in a follow-up refinement.





>
> Co-authored-by: Jeff Law <jeffrealaw@gmail.com>
I'm not sure I deserve co-authorship :-)  You're doing all the work, I'm 
just reporting the regressions.

jeff
  
Aldy Hernandez Oct. 5, 2021, 4:22 p.m. UTC | #11
On Tue, Oct 5, 2021 at 6:08 PM Jeff Law <jeffreyalaw@gmail.com> wrote:

> OK.  I think there may still be some concern if the latch is marked as a joiner.    I think we should always reject those before the loop optimizers run as the joiner's clone would introduce a second latch.   I think that can be handled in a follow-up refinement.

I've been mumbling, to myself mostly, that the loop threading
restrictions should be owned by the loop experts.  There's not much I
can contribute here, except testcases and patches that I *think* are
what y'all want.  That being said, I will follow-up with the
suggestions made by Richi, because I've already done the work, and
have cobbled up the relevant gimple FE tests ;-).

>
>
>
>
>
>
> Co-authored-by: Jeff Law <jeffrealaw@gmail.com>
>
> I'm not sure I deserve co-authorship :-)  You're doing all the work, I'm just reporting the regressions.

Heh.  Heh.  Ok.   I honestly thought about putting everyone's name here.

Aldy
  
Aldy Hernandez Oct. 6, 2021, 10:22 a.m. UTC | #12
[Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
on aarch64 and ppc64le.]

There is a lot of fall-out from this patch, as there were many threading
tests that assumed the restrictions introduced by this patch were valid.
Some tests have merely shifted the threading to after loop
optimizations, but others ended up with no threading opportunities at
all.  Surprisingly some tests ended up with more total threads.  It was
a crapshoot all around.

On a postive note, there are 6 tests that no longer XFAIL, and one
guality test which now passes.

I would appreciate someone reviewing the test changes.  I am unsure
whether some of the tests should be removed, XFAILed, or some other
thing.

I felt a bit queasy about such a fundamental change wrt threading, so I
ran it through my callgrind test harness (.ii files from a bootstrap).
There was no change in overall compilation, DOM, or the VRP threaders.

However, there was a slight increase of 1.63% in the backward threader.
I'm pretty sure we could reduce this if we incorporated the restrictions
into their profitability code.  This way we could stop the search when
we ran into one of these restrictions.  Not sure it's worth it at this
point.

Note, that this ad-hoc testing is not meant to replace a more thorough
SPEC, etc test.

Tested on x86-64 Linux.

OK pending tests on aarch64 and ppc64le?

Co-authored-by: Richard Biener <rguenther@suse.de>
  
Andreas Schwab Oct. 6, 2021, 1:15 p.m. UTC | #13
On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:

> From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.

This breaks bootstrap on aarch64 (in stage2):

In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
    inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  206 |       stack_usage_map[i] = 1;
      |       ~~~~~~~~~~~~~~~~~~~^~~
../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
../../gcc/calls.c:202:39: note: 'const_upper' was declared here
  202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
      |                                       ^~~~~~~~~~~

Andreas.
  
Aldy Hernandez Oct. 6, 2021, 1:47 p.m. UTC | #14
The pending patch I have from Richi fixes this.  Even so, it's the
uninit code that's confused.

Sigh...every single change to the threading code shines the light on
some warning bug.

If you take the calls.ii file from the aarch64 bootstrap and break on
the warning, you can see that the uninitalized use is for
const_upper_3934 here:

 <bb 102> [local count: 315357954]:
  # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
  if (_881 != 0)
    goto <bb 103>; [50.00%]
  else
    goto <bb 106>; [50.00%]

  <bb 103> [local count: 157678977]:
  if (const_upper_3934 > _6699)
    goto <bb 105>; [89.00%]
  else
    goto <bb 294>; [11.00%]

  <bb 294> [local count: 17344687]:

  <bb 104> [local count: 157678977]:
  goto <bb 107>; [100.00%]

  <bb 105> [local count: 140334290]:
  stack_usage_map.481_3930 = stack_usage_map;
  _6441 = const_upper_3934 - _6699;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PROBLEMATIC READ HERE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  _4819 = stack_usage_map.481_3930 + _6699;
  __builtin_memset (_4819, 1, _6441);
  goto <bb 104>; [11.00%]

const_upper_3934 could be undefined if it comes from BB101
(const_upper_3937(D)), but it only gets read for _881 != 0, so it
shouldn't warn.

I suggest -Wmaybe-uninitialized be turned off for that file until the
warning is fixed.

And yes, the proposed patch will also cure this, but the underlying
problem in the warning is still there.

Aldy

On Wed, Oct 6, 2021 at 3:24 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:
>
> > From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> > From: Aldy Hernandez <aldyh@redhat.com>
> > Date: Tue, 5 Oct 2021 15:03:34 +0200
> > Subject: [PATCH] Loosen loop crossing restriction in threader.
> >
> > Crossing loops is generally discouraged from the threader, but we can
> > make an exception when we don't cross the latch or enter another loop,
> > since this is just an early exit out of the loop.
>
> This breaks bootstrap on aarch64 (in stage2):
>
> In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
>     inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
> ../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>   206 |       stack_usage_map[i] = 1;
>       |       ~~~~~~~~~~~~~~~~~~~^~~
> ../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
> ../../gcc/calls.c:202:39: note: 'const_upper' was declared here
>   202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
>       |                                       ^~~~~~~~~~~
>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."
>
  
Martin Sebor Oct. 6, 2021, 3:03 p.m. UTC | #15
On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
> The pending patch I have from Richi fixes this.  Even so, it's the
> uninit code that's confused.
> 
> Sigh...every single change to the threading code shines the light on
> some warning bug.
> 
> If you take the calls.ii file from the aarch64 bootstrap and break on
> the warning, you can see that the uninitalized use is for
> const_upper_3934 here:
> 
>   <bb 102> [local count: 315357954]:
>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>    if (_881 != 0)
>      goto <bb 103>; [50.00%]
>    else
>      goto <bb 106>; [50.00%]
> 
>    <bb 103> [local count: 157678977]:
>    if (const_upper_3934 > _6699)
>      goto <bb 105>; [89.00%]
>    else
>      goto <bb 294>; [11.00%]
> 
>    <bb 294> [local count: 17344687]:
> 
>    <bb 104> [local count: 157678977]:
>    goto <bb 107>; [100.00%]
> 
>    <bb 105> [local count: 140334290]:
>    stack_usage_map.481_3930 = stack_usage_map;
>    _6441 = const_upper_3934 - _6699;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> PROBLEMATIC READ HERE
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    _4819 = stack_usage_map.481_3930 + _6699;
>    __builtin_memset (_4819, 1, _6441);
>    goto <bb 104>; [11.00%]
> 
> const_upper_3934 could be undefined if it comes from BB101
> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
> shouldn't warn.
> 
> I suggest -Wmaybe-uninitialized be turned off for that file until the
> warning is fixed.

-Wmaybe-uninitialized certainly suffers from a high rate of false
positives (I count 40 open PRs).  Even after all this time debugging
it I still struggle with the code for it but in all the bug reports
I've reviewed, only a small subset seems to be due to bugs or even
limitations in it.  Most are inherent in its design goal to warn
for reads from variables that cannot be proven to have been
initialized.

If this one is a bug with some chance of getting fixed it would
be good to distill it to a small test case (even though it's not
unlikely that there already is an existing report with the same
root cause).

That said, from from your description and the IL above it sounds
to me like the uninitialized read here is possible (it takes place
when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
supposed to.

Either way, rather than suppressing the warning for the whole file
I would suggest to consider initializing the local variable instead.
Looking at the code in calls.c, I can't convince myself that
the variable cannot, in fact, be used uninitialized.

Martin

> 
> And yes, the proposed patch will also cure this, but the underlying
> problem in the warning is still there.
> 
> Aldy
> 
> On Wed, Oct 6, 2021 at 3:24 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:
>>
>>>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
>>> From: Aldy Hernandez <aldyh@redhat.com>
>>> Date: Tue, 5 Oct 2021 15:03:34 +0200
>>> Subject: [PATCH] Loosen loop crossing restriction in threader.
>>>
>>> Crossing loops is generally discouraged from the threader, but we can
>>> make an exception when we don't cross the latch or enter another loop,
>>> since this is just an early exit out of the loop.
>>
>> This breaks bootstrap on aarch64 (in stage2):
>>
>> In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
>>      inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
>> ../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>>    206 |       stack_usage_map[i] = 1;
>>        |       ~~~~~~~~~~~~~~~~~~~^~~
>> ../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
>> ../../gcc/calls.c:202:39: note: 'const_upper' was declared here
>>    202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
>>        |                                       ^~~~~~~~~~~
>>
>> Andreas.
>>
>> --
>> Andreas Schwab, schwab@linux-m68k.org
>> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
>> "And now for something completely different."
>>
>
  
Aldy Hernandez Oct. 6, 2021, 4:22 p.m. UTC | #16
On Wed, Oct 6, 2021 at 5:03 PM Martin Sebor <msebor@gmail.com> wrote:
>
> On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:

> -Wmaybe-uninitialized certainly suffers from a high rate of false
> positives (I count 40 open PRs).  Even after all this time debugging
> it I still struggle with the code for it but in all the bug reports
> I've reviewed, only a small subset seems to be due to bugs or even
> limitations in it.  Most are inherent in its design goal to warn
> for reads from variables that cannot be proven to have been
> initialized.
>
> If this one is a bug with some chance of getting fixed it would
> be good to distill it to a small test case (even though it's not
> unlikely that there already is an existing report with the same
> root cause).
>
> That said, from from your description and the IL above it sounds
> to me like the uninitialized read here is possible (it takes place
> when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
> supposed to.
>
> Either way, rather than suppressing the warning for the whole file
> I would suggest to consider initializing the local variable instead.
> Looking at the code in calls.c, I can't convince myself that
> the variable cannot, in fact, be used uninitialized.
>
> Martin

FWIW, libgomp/team.c suffers from the same problem if you remove the
stop-gap in gomp_team_start():

  struct gomp_thread_start_data *start_data = NULL;

The use of start_data in the block predicated by "if (nested)" is the
only path that can use start_data without passing through its
initialization.  But the only way to elide the initialization is from:

if (!nested)
{
  ...
 ....
  if (i == nthreads)
    goto do_release;
}

So the use of start_data either crosses the gomp_alloca
initialization, or is predicated by if(nested).

And the gimple shows a very similar pattern to what I described earlier:

  <bb 163> [local count: 59055800]:
  # start_data_876 = PHI <start_data_770(162), start_data_258(288),
start_data_359(307)>
do_release:
  if (_1 != 0)
    goto <bb 164>; [55.90%]
  else
    goto <bb 289>; [44.10%]

  <bb 289> [local count: 26046541]:
  goto <bb 165>; [100.00%]

  <bb 164> [local count: 33009259]:
  _223 = &team_417(D)->barrier;
  gomp_barrier_wait (_223);
  goto <bb 166>; [100.00%]

  <bb 265> [local count: 6962719]:

  <bb 165> [local count: 33009259]:
  # start_data_781 = PHI <start_data_876(289), start_data_518(D)(265)>
 # old_threads_used_887 = PHI <old_threads_used_782(289),
old_threads_used_454(265)>
  # affinity_count_825 = PHI <affinity_count_885(289), affinity_count_343(265)>
  # affinity_thr_904 = PHI <affinity_thr_867(289), 0B(265)>
  # force_display_840 = PHI <force_display_612(289), force_display_192(265)>
  _589 = &MEM[(struct gomp_simple_barrier_t *)pool_410 + 64B].bar;
  gomp_barrier_wait (_589);

  <bb 166> [local count: 66018519]:
  # start_data_870 = PHI <start_data_876(164), start_data_781(165)>

The use of start_data_781 coming in from  BB265 (start_data_518(D))
would be uninitialized if the read wasn't guarded by "if (_1 != 0)".

It seems uninit has issues seeing this pattern.

Unfortunately reducing this has been, ahem...challenging, but the
problem is still there, both within calls.c on aarch64 and on x86-64
for libgomp/team.c.

Aldy
  
Aldy Hernandez Oct. 6, 2021, 5:03 p.m. UTC | #17
FWIW, I've opened a PR with both testcases:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102631
  
Martin Sebor Oct. 6, 2021, 7:11 p.m. UTC | #18
On 10/6/21 11:03 AM, Aldy Hernandez wrote:
> FWIW, I've opened a PR with both testcases:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102631
> 

Okay, thanks.  I see what you mean that reducing it to a test case
is challenging.  I'll see if I can find some time to distill it to
something useful.  Until then, to get past the warning, I recommend
to simply initialize the variable.  For simple scalars there's little
reason not to.

Martin
  
Jeff Law Oct. 6, 2021, 11 p.m. UTC | #19
On 10/6/2021 10:22 AM, Aldy Hernandez via Gcc-patches wrote:
> On Wed, Oct 6, 2021 at 5:03 PM Martin Sebor <msebor@gmail.com> wrote:
>> On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>> -Wmaybe-uninitialized certainly suffers from a high rate of false
>> positives (I count 40 open PRs).  Even after all this time debugging
>> it I still struggle with the code for it but in all the bug reports
>> I've reviewed, only a small subset seems to be due to bugs or even
>> limitations in it.  Most are inherent in its design goal to warn
>> for reads from variables that cannot be proven to have been
>> initialized.
>>
>> If this one is a bug with some chance of getting fixed it would
>> be good to distill it to a small test case (even though it's not
>> unlikely that there already is an existing report with the same
>> root cause).
>>
>> That said, from from your description and the IL above it sounds
>> to me like the uninitialized read here is possible (it takes place
>> when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
>> supposed to.
>>
>> Either way, rather than suppressing the warning for the whole file
>> I would suggest to consider initializing the local variable instead.
>> Looking at the code in calls.c, I can't convince myself that
>> the variable cannot, in fact, be used uninitialized.
>>
>> Martin
> FWIW, libgomp/team.c suffers from the same problem if you remove the
> stop-gap in gomp_team_start():
>
>    struct gomp_thread_start_data *start_data = NULL;
>
> The use of start_data in the block predicated by "if (nested)" is the
> only path that can use start_data without passing through its
> initialization.  But the only way to elide the initialization is from:
>
> if (!nested)
> {
>    ...
>   ....
>    if (i == nthreads)
>      goto do_release;
> }
>
> So the use of start_data either crosses the gomp_alloca
> initialization, or is predicated by if(nested).
It may simply be the case that the uninit analysis gave up or it may 
have failed to simplify its predicates enough to realize the use is 
properly guarded.  These things certainly do happen.

Of course this is all part of why we try so hard to thread jumps to 
aggressively.  Missed jump threads closely correlate with bogus 
uninitialized warnings due to infeasible paths in the CFG.  In my mind 
the predicate analysis we do for uninit is primarily there to catch 
cases that jump threading discovers, but does not optimize due to cost.

Jeff
  
Jeff Law Oct. 6, 2021, 11:06 p.m. UTC | #20
On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
> The pending patch I have from Richi fixes this.  Even so, it's the
> uninit code that's confused.
>
> Sigh...every single change to the threading code shines the light on
> some warning bug.
>
> If you take the calls.ii file from the aarch64 bootstrap and break on
> the warning, you can see that the uninitalized use is for
> const_upper_3934 here:
>
>   <bb 102> [local count: 315357954]:
>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>    if (_881 != 0)
>      goto <bb 103>; [50.00%]
>    else
>      goto <bb 106>; [50.00%]
>
>    <bb 103> [local count: 157678977]:
>    if (const_upper_3934 > _6699)
>      goto <bb 105>; [89.00%]
>    else
>      goto <bb 294>; [11.00%]
>
>    <bb 294> [local count: 17344687]:
>
>    <bb 104> [local count: 157678977]:
>    goto <bb 107>; [100.00%]
>
>    <bb 105> [local count: 140334290]:
>    stack_usage_map.481_3930 = stack_usage_map;
>    _6441 = const_upper_3934 - _6699;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> PROBLEMATIC READ HERE
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    _4819 = stack_usage_map.481_3930 + _6699;
>    __builtin_memset (_4819, 1, _6441);
>    goto <bb 104>; [11.00%]
>
> const_upper_3934 could be undefined if it comes from BB101
> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
> shouldn't warn.
>
> I suggest -Wmaybe-uninitialized be turned off for that file until the
> warning is fixed.
>
> And yes, the proposed patch will also cure this, but the underlying
> problem in the warning is still there.
You haven't shown enough context for me to agree or disagree.  Are there 
other preds to bb105?   I hate these slimmed down dumps.  I work better 
with the full pred/succ lists. -fdump-tree-<whatever>-blocks-details :-)

It appears to me that for _881 != 0 we certainly flow into the read of 
_const_upper_3934 in bb103 and bb105.  Why do you think that's safe?

Jeff
  
Aldy Hernandez Oct. 7, 2021, 8:15 a.m. UTC | #21
[Andrew, ranger comment embedded below.]

On 10/7/21 1:06 AM, Jeff Law wrote:
> 
> 
> On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>> The pending patch I have from Richi fixes this.  Even so, it's the
>> uninit code that's confused.
>>
>> Sigh...every single change to the threading code shines the light on
>> some warning bug.
>>
>> If you take the calls.ii file from the aarch64 bootstrap and break on
>> the warning, you can see that the uninitalized use is for
>> const_upper_3934 here:
>>
>>   <bb 102> [local count: 315357954]:
>>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>>    if (_881 != 0)
>>      goto <bb 103>; [50.00%]
>>    else
>>      goto <bb 106>; [50.00%]
>>
>>    <bb 103> [local count: 157678977]:
>>    if (const_upper_3934 > _6699)
>>      goto <bb 105>; [89.00%]
>>    else
>>      goto <bb 294>; [11.00%]
>>
>>    <bb 294> [local count: 17344687]:
>>
>>    <bb 104> [local count: 157678977]:
>>    goto <bb 107>; [100.00%]
>>
>>    <bb 105> [local count: 140334290]:
>>    stack_usage_map.481_3930 = stack_usage_map;
>>    _6441 = const_upper_3934 - _6699;
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> PROBLEMATIC READ HERE
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>    _4819 = stack_usage_map.481_3930 + _6699;
>>    __builtin_memset (_4819, 1, _6441);
>>    goto <bb 104>; [11.00%]
>>
>> const_upper_3934 could be undefined if it comes from BB101
>> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
>> shouldn't warn.
>>
>> I suggest -Wmaybe-uninitialized be turned off for that file until the
>> warning is fixed.
>>
>> And yes, the proposed patch will also cure this, but the underlying
>> problem in the warning is still there.
> You haven't shown enough context for me to agree or disagree.  Are there 
> other preds to bb105?   I hate these slimmed down dumps.  I work better 
> with the full pred/succ lists. -fdump-tree-<whatever>-blocks-details :-)
> 
> It appears to me that for _881 != 0 we certainly flow into the read of 
> _const_upper_3934 in bb103 and bb105.  Why do you think that's safe?

My bad, there's some missing context.

The only way to get to BB101->BB102 is through:

   <bb 100>
   if (_6711 != 0)
     goto <bb 101>; [5.50%]
   else
     goto <bb 293>; [94.50%]

And there's an implicit relation between _6711 and _811:

<bb 86>
...
   if (_6711 != 0)
     goto <bb 287>; [5.50%]
   else
     goto <bb 87>; [94.50%]

   <bb 287> [local count: 17344687]:
   goto <bb 88>; [100.00%]

   <bb 87> [local count: 298013267]:

   <bb 88> [local count: 315357954]:
   # _881 = PHI <1(87), 0(287)>

That is, _6711 == !_881.

[Andrew, it'd be neat if we could teach ranger the relationship between 
_6711 and _811 above.  And also, that _881 is [0,1].  Perhaps with the 
relation oracle, the uninit code could notice that a _6711 guard is also 
a !_811 guard.]

Presumably the threader shuffled things sufficiently so that the above 
relationship was difficult to devise.

Your preferred full context follows ;-).

Aldy

----

Here we see that _881 == 0 when _6711 != 0:

;;   basic block 286, loop depth 1, count 640272214 (estimated locally), 
maybe hot
;;    prev block 85, next block 86, flags: (NEW)
;;    pred:       85 [67.0% (guessed)]  count:640272214 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
   goto <bb 112>; [100.00%]
;;    succ:       112 [always]  count:640272214 (estimated locally) 
(FALLTHRU)

;;   basic block 86, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 286, next block 287, flags: (NEW, REACHABLE, VISITED)
;;    pred:       85 [33.0% (guessed)]  count:315357953 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   _4293 = MEM[(long int *)_5338 + 80B];
   _6727 = MEM[(long int *)_5338 + 88B];
   _6715 = MEM[(long int *)_5338 + 32B];
   _4305 = _4293 + _6715;
   _4299 = MEM[(long int *)_5338 + 40B];
   _4300 = _4299 + _6727;
   _6707 = (long unsigned int) _4305;
   _6711 = (long unsigned int) _4300;
   _6699 = (long unsigned int) _4293;
   if (_6711 != 0)
     goto <bb 287>; [5.50%]
   else
     goto <bb 87>; [94.50%]
;;    succ:       287 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                87 [94.5% (guessed)]  count:298013267 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 287, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 86, next block 87, flags: (NEW)
;;    pred:       86 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   goto <bb 88>; [100.00%]
;;    succ:       88 [always]  count:17344687 (estimated locally) (FALLTHRU)

;;   basic block 87, loop depth 1, count 298013267 (estimated locally), 
maybe hot
;;    prev block 287, next block 88, flags: (NEW, VISITED)
;;    pred:       86 [94.5% (guessed)]  count:298013267 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       88 [always]  count:298013267 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 88, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 87, next block 89, flags: (NEW, REACHABLE, VISITED)
;;    pred:       87 [always]  count:298013267 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                287 [always]  count:17344687 (estimated locally) 
(FALLTHRU)
   # const_upper_3863 = PHI <_6707(87), 18446744073709551615(287)>
   # _881 = PHI <1(87), 0(287)>

[snip]
[snip]
[snip]

And here we see that the "undefined" read is guarded by _6711:

;;   basic block 100, loop depth 1, count 315357955 (estimated locally), 
maybe hot
;;    prev block 99, next block 293, flags: (NEW, REACHABLE, VISITED)
;;    pred:       98 [always]  count:94607385 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                99 [always]  count:220750569 (estimated locally) 
(FALLTHRU,EXECUTABLE)
   # iftmp.581_254 = PHI <_669(98), _3922(99)>
   MEM[(struct poly_int *)&D.150826] ={v} {CLOBBER};
   _4417 = MEM[(long int *)_5338 + 56B];
   MEM[(struct poly_int *)&D.150826].D.21135.coeffs[0] = _4417;
   _4410 = MEM[(long int *)_5338 + 64B];
   MEM[(struct poly_int *)&D.150826].D.21135.coeffs[1] = _4410;
   _673 = gen_int_mode (D.150826, 16);
   MEM[(struct poly_int *)&D.150832] ={v} {CLOBBER};
   MEM[(struct poly_int *)&D.150832].D.21135.coeffs[0] = 0;
   MEM[(struct poly_int *)&D.150832].D.21135.coeffs[1] = 0;
   emit_push_insn (val_598, mode_597, 0B, 0B, parm_align_601, 
partial_600, reg_599, D.150832, argblock_88, _673, 0, iftmp.581_254, 0);
   D.150832 ={v} {CLOBBER};
   D.150826 ={v} {CLOBBER};
   D.150830 ={v} {CLOBBER};
   D.150828 ={v} {CLOBBER};
   if (_6711 != 0)
     goto <bb 101>; [5.50%]
   else
     goto <bb 293>; [94.50%]
;;    succ:       101 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                293 [94.5% (guessed)]  count:298013268 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 293, loop depth 1, count 298013268 (estimated locally), 
maybe hot
;;    prev block 100, next block 101, flags: (NEW)
;;    pred:       100 [94.5% (guessed)]  count:298013268 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
   goto <bb 102>; [100.00%]
;;    succ:       102 [always]  count:298013268 (estimated locally) 
(FALLTHRU)

;;   basic block 101, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 293, next block 102, flags: (NEW, VISITED)
;;    pred:       100 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;    succ:       102 [always]  count:17344687 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 102, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 101, next block 103, flags: (NEW, REACHABLE, VISITED)
;;    pred:       101 [always]  count:17344687 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                293 [always]  count:298013268 (estimated locally) 
(FALLTHRU)
   # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
   if (_881 != 0)
     goto <bb 103>; [50.00%]
   else
     goto <bb 106>; [50.00%]
;;    succ:       103 [50.0% (guessed)]  count:157678977 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                106 [50.0% (guessed)]  count:157678977 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 103, loop depth 1, count 157678977 (estimated locally), 
maybe hot
;;    prev block 102, next block 294, flags: (NEW, REACHABLE, VISITED)
;;    pred:       102 [50.0% (guessed)]  count:157678977 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   if (const_upper_3934 > _6699)
     goto <bb 105>; [89.00%]
   else
     goto <bb 294>; [11.00%]
;;    succ:       105 [89.0% (guessed)]  count:140334290 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                294 [11.0% (guessed)]  count:17344687 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 294, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 103, next block 104, flags: (NEW)
;;    pred:       103 [11.0% (guessed)]  count:17344687 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       104 [always]  count:17344687 (estimated locally) 
(FALLTHRU)

;;   basic block 104, loop depth 1, count 157678977 (estimated locally), 
maybe hot
;;   Invalid sum of incoming counts 32781459 (estimated locally), should 
be 157678977 (estimated locally)
;;    prev block 294, next block 105, flags: (NEW, VISITED)
;;    pred:       294 [always]  count:17344687 (estimated locally) 
(FALLTHRU)
;;                105 [11.0% (guessed)]  count:15436772 (estimated 
locally) (FALLTHRU,EXECUTABLE)
   goto <bb 107>; [100.00%]
;;    succ:       107 [always]  count:157678977 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 105, loop depth 1, count 140334290 (estimated locally), 
maybe hot
;;    prev block 104, next block 106, flags: (NEW, REACHABLE, VISITED)
;;    pred:       103 [89.0% (guessed)]  count:140334290 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   stack_usage_map.481_3930 = stack_usage_map;
   _6441 = const_upper_3934 - _6699;
   _4819 = stack_usage_map.481_3930 + _6699;
   __builtin_memset (_4819, 1, _6441);
   goto <bb 104>; [11.00%]
  
Jeff Law Oct. 7, 2021, 1:35 p.m. UTC | #22
On 10/7/2021 2:15 AM, Aldy Hernandez wrote:
> [Andrew, ranger comment embedded below.]
>
> On 10/7/21 1:06 AM, Jeff Law wrote:
>>
>>
>> On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>>> The pending patch I have from Richi fixes this.  Even so, it's the
>>> uninit code that's confused.
>>>
>>> Sigh...every single change to the threading code shines the light on
>>> some warning bug.
>>>
>>> If you take the calls.ii file from the aarch64 bootstrap and break on
>>> the warning, you can see that the uninitalized use is for
>>> const_upper_3934 here:
>>>
>>>   <bb 102> [local count: 315357954]:
>>>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>>>    if (_881 != 0)
>>>      goto <bb 103>; [50.00%]
>>>    else
>>>      goto <bb 106>; [50.00%]
>>>
>>>    <bb 103> [local count: 157678977]:
>>>    if (const_upper_3934 > _6699)
>>>      goto <bb 105>; [89.00%]
>>>    else
>>>      goto <bb 294>; [11.00%]
>>>
>>>    <bb 294> [local count: 17344687]:
>>>
>>>    <bb 104> [local count: 157678977]:
>>>    goto <bb 107>; [100.00%]
>>>
>>>    <bb 105> [local count: 140334290]:
>>>    stack_usage_map.481_3930 = stack_usage_map;
>>>    _6441 = const_upper_3934 - _6699;
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> PROBLEMATIC READ HERE
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>    _4819 = stack_usage_map.481_3930 + _6699;
>>>    __builtin_memset (_4819, 1, _6441);
>>>    goto <bb 104>; [11.00%]
>>>
>>> const_upper_3934 could be undefined if it comes from BB101
>>> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
>>> shouldn't warn.
>>>
>>> I suggest -Wmaybe-uninitialized be turned off for that file until the
>>> warning is fixed.
>>>
>>> And yes, the proposed patch will also cure this, but the underlying
>>> problem in the warning is still there.
>> You haven't shown enough context for me to agree or disagree. Are 
>> there other preds to bb105?   I hate these slimmed down dumps.  I 
>> work better with the full pred/succ lists. 
>> -fdump-tree-<whatever>-blocks-details :-)
>>
>> It appears to me that for _881 != 0 we certainly flow into the read 
>> of _const_upper_3934 in bb103 and bb105.  Why do you think that's safe?
>
> My bad, there's some missing context.
>
> The only way to get to BB101->BB102 is through:
>
>   <bb 100>
>   if (_6711 != 0)
>     goto <bb 101>; [5.50%]
>   else
>     goto <bb 293>; [94.50%]
>
> And there's an implicit relation between _6711 and _811:
>
> <bb 86>
> ...
>   if (_6711 != 0)
>     goto <bb 287>; [5.50%]
>   else
>     goto <bb 87>; [94.50%]
>
>   <bb 287> [local count: 17344687]:
>   goto <bb 88>; [100.00%]
>
>   <bb 87> [local count: 298013267]:
>
>   <bb 88> [local count: 315357954]:
>   # _881 = PHI <1(87), 0(287)>
>
> That is, _6711 == !_881.
>
> [Andrew, it'd be neat if we could teach ranger the relationship 
> between _6711 and _811 above.  And also, that _881 is [0,1]. Perhaps 
> with the relation oracle, the uninit code could notice that a _6711 
> guard is also a !_811 guard.]
Ah, yes.  This is one of the most common cases where we fail to detect 
jump threads and/or fail to prune a path in the uninit warnings.  
There's multiple instances of this BZ, though I don't have the #s handy 
and the testcases are much smaller & easier to understand.  I'm pretty 
sure they're linked to the meta bug for threading or uninit warnings.

However, in this instance we may have a way out.   When we record the 
constant boolean equivalence for _881 talking the paths from 87 and 287 
respectively, we could walk up the control dependency graph and derive 
other values such as _6711 in this example.  I'm pretty sure we don't 
have the CDG built for DOM, but I do think we have it in the predicate 
analysis engine, so we could at least prune the warning.


Jeff
  
Aldy Hernandez Oct. 14, 2021, 9:25 a.m. UTC | #23
PING.

Note, that there are no PRs and nothing really dependent on this
patch.  This has just been suggested as the right thing to do wrt
loops.

This patch fixes 6 XFAILs in our testsuite and has the added
side-effect of fixing the aarch64 bootstrap problem (though the
problem in the uninit code is still there).

This is a fundamental change in what we've traditionally allowed for
jump threading, but I think it's a good thing.  It also paves the way
for combining the validity models for both the forward and the
backward threaders.

I am happy to field the PRs this may bring about, since every change
in the cost model has us questioning whether we should or shouldn't
thread a path.  But I may need some help from y'all if there's a
missing thread that causes a regression in some other pass.  That
being said, most of the issues that have come with the threader
changes haven't been because we thread less, but because we thread
more-- so perhaps restricting things is a good thing ;-).

Aldy

On Wed, Oct 6, 2021 at 12:22 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> [Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
> on aarch64 and ppc64le.]
>
> There is a lot of fall-out from this patch, as there were many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>
  
Jeff Law Oct. 14, 2021, 2:23 p.m. UTC | #24
On 10/14/2021 3:25 AM, Aldy Hernandez wrote:
> PING.
>
> Note, that there are no PRs and nothing really dependent on this
> patch.  This has just been suggested as the right thing to do wrt
> loops.
>
> This patch fixes 6 XFAILs in our testsuite and has the added
> side-effect of fixing the aarch64 bootstrap problem (though the
> problem in the uninit code is still there).
>
> This is a fundamental change in what we've traditionally allowed for
> jump threading, but I think it's a good thing.  It also paves the way
> for combining the validity models for both the forward and the
> backward threaders.
>
> I am happy to field the PRs this may bring about, since every change
> in the cost model has us questioning whether we should or shouldn't
> thread a path.  But I may need some help from y'all if there's a
> missing thread that causes a regression in some other pass.  That
> being said, most of the issues that have come with the threader
> changes haven't been because we thread less, but because we thread
> more-- so perhaps restricting things is a good thing ;-).
Just chasing down fallout from the vectorizer changes first :-)

jeff
  
Jeff Law Oct. 17, 2021, 1:32 a.m. UTC | #25
On 10/6/2021 4:22 AM, Aldy Hernandez wrote:
> [Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
> on aarch64 and ppc64le.]
>
> There is a lot of fall-out from this patch, as there were many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>
>
> 0001-Disallow-loop-rotation-and-loop-header-crossing-in-j.patch
>
>  From 5c0fb55b45733ec38918f5d7a576719da32e4a6c Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Mon, 4 Oct 2021 09:47:02 +0200
> Subject: [PATCH] Disallow loop rotation and loop header crossing in jump
>   threaders.
>
> There is a lot of fall-out from this patch, as there are many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (cancel_thread): Dump threading reason
> 	on the same line as the threading cancellation.
> 	(jt_path_registry::cancel_invalid_paths): Avoid rotating loops.
> 	Avoid threading through loop headers where the path remains in the
> 	loop.
>
> libgomp/ChangeLog:
>
> 	* testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/Warray-bounds-87.c: Remove xfail.
> 	* gcc.dg/analyzer/pr94851-2.c: Remove xfail.
> 	* gcc.dg/graphite/pr69728.c: Remove xfail.
> 	* gcc.dg/graphite/scop-dsyr2k.c: Remove xfail.
> 	* gcc.dg/graphite/scop-dsyrk.c: Remove xfail.
> 	* gcc.dg/shrink-wrap-loop.c: Remove xfail.
> 	* gcc.dg/loop-8.c: Adjust for new threading restrictions.
> 	* gcc.dg/tree-ssa/ifc-20040816-1.c: Same.
> 	* gcc.dg/tree-ssa/pr21559.c: Same.
> 	* gcc.dg/tree-ssa/pr59597.c: Same.
> 	* gcc.dg/tree-ssa/pr71437.c: Same.
> 	* gcc.dg/tree-ssa/pr77445-2.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.
> 	* gcc.dg/vect/bb-slp-16.c: Same.
> 	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
So there is some light fallout on our friend visium.  I must say that 
having a port which fails to link if there's a call to abort () left in 
the program is awful helpful :-)    I reviewed the half-dozen or so 
tests that fail after this change, they all look like any jump threads 
are for loop rotation.  So I'm inclined to ignore those and just 
generate new baselines for the visium port.

Some additional thoughts on the tests below.  I didn't see anything 
that's worth an objection, but at least in a couple cases there's ways 
to save the test.  In others I might have made a slightly different 
decision, but I can live with yours.

I think once we reach a consensus on the tests, this will be good to go.


> diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
> index 90ea1c45524..66318fc08dc 100644
> --- a/gcc/testsuite/gcc.dg/loop-8.c
> +++ b/gcc/testsuite/gcc.dg/loop-8.c
> @@ -24,5 +24,9 @@ f (int *a, int *b)
>   
>   /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
>   /* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
> -/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
> +
> +
> +/* ?? The expected behavior below depends on threading the 2->3->5 path
> +   in DOM2, but this is invalid since it would rotate the loop.  */
> +/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" { xfail *-*-* } } } */
So maybe the thing to do here since I guess we want to keep the test 
would be to manually rotate the loop in the source.  In theory that 
should restore the test to validating what we want it to validate 
(specifically the behavior of LICM).

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> index 0246ebf3c63..f83cefd8d89 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> @@ -22,6 +22,8 @@
>   
>      All the cases are picked up by VRP1 as jump threads.  */
>   
> -/* There used to be 6 jump threads found by thread1, but they all
> -   depended on threading through distinct loops in ethread.  */
> -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
> +/* This test should be obsoleted.  We used to catch 2 total threads in
> +   vrp-thread1, but after adding loop rotating restrictions, we get
> +   none.  Interestingly, on x86-64 we now get 1 in DOM2, 5 in DOM3,
> +   and 1 in vrp-thread2.  */
> +/* { dg-final { scan-tree-dump-not "Threaded" "vrp-thread1" } } */
I think that testing nothing was threaded in vrp1 is probably best. 
Though I wouldn't lose any sleep if this just went away.

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> index 8f0a12c12ee..68808bd09fc 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> @@ -1,10 +1,9 @@
>   /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
> +/* { dg-options "-O2 -fdump-statistics" } */
>   
>   void bla();
>   
> -/* In the following case, we should be able to thread edge through
> -   the loop header.  */
> +/* No one should thread through the loop header.  */
>   
>   void thread_entry_through_header (void)
>   {
> @@ -14,8 +13,4 @@ void thread_entry_through_header (void)
>       bla ();
>   }
>   
> -/* There's a single jump thread that should be handled by the VRP
> -   jump threading pass.  */
> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
> -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
> +/* { dg-final { scan-tree-dump-not "Jumps threaded" "statistics"} } */
Similarly.

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> index b0a7d423475..24de9d57d50 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> @@ -1,8 +1,12 @@
>   /* { dg-do compile } */
>   /* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
>   
> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
> -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
> +/* ?? We should obsolete this test.  All the threads in thread1 and
> +   thread3 we used to get cross the loop header but does not exit the
> +   loop, so they have been deemed invalid.  */
> +
> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread1" } } */
> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread3" } } */
>   
>   int sum0, sum1, sum2, sum3;
>   int foo (char *s, char **ret)
Similarly.

> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> index e68a9b62535..fc3adab3fc3 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> @@ -65,5 +65,9 @@ int main (void)
>     return 0;
>   }
>   
> -/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
> +/* ?? The check below depends on jump threading.  There are no a
> +   couple threaded paths that are being rejected because they either
> +   rotate a loop or cross the loop header without exiting the
> +   loop.  */
> +/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" { xfail *-*-* } } } */
So we could manually rotate the loop in the source which in turn would 
allow this test to continue to validate SLP's behavior.
  
Aldy Hernandez Oct. 18, 2021, 10:14 a.m. UTC | #26
On 10/17/21 3:32 AM, Jeff Law wrote:

> I think once we reach a consensus on the tests, this will be good to go.
> 
> 
>> diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
>> index 90ea1c45524..66318fc08dc 100644
>> --- a/gcc/testsuite/gcc.dg/loop-8.c
>> +++ b/gcc/testsuite/gcc.dg/loop-8.c
>> @@ -24,5 +24,9 @@ f (int *a, int *b)
>>   
>>   /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
>>   /* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
>> -/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
>> +
>> +
>> +/* ?? The expected behavior below depends on threading the 2->3->5 path
>> +   in DOM2, but this is invalid since it would rotate the loop.  */
>> +/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" { xfail *-*-* } } } */
> So maybe the thing to do here since I guess we want to keep the test 
> would be to manually rotate the loop in the source.  In theory that 
> should restore the test to validating what we want it to validate 
> (specifically the behavior of LICM).

I have rotated the loop.  This fixes the xfail I introduced, but the 
"Decided"  test fails.  I've removed that instead.  Let me know if this 
is OK.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> index 0246ebf3c63..f83cefd8d89 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> @@ -22,6 +22,8 @@
>>   
>>      All the cases are picked up by VRP1 as jump threads.  */
>>   
>> -/* There used to be 6 jump threads found by thread1, but they all
>> -   depended on threading through distinct loops in ethread.  */
>> -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
>> +/* This test should be obsoleted.  We used to catch 2 total threads in
>> +   vrp-thread1, but after adding loop rotating restrictions, we get
>> +   none.  Interestingly, on x86-64 we now get 1 in DOM2, 5 in DOM3,
>> +   and 1 in vrp-thread2.  */
>> +/* { dg-final { scan-tree-dump-not "Threaded" "vrp-thread1" } } */
> I think that testing nothing was threaded in vrp1 is probably best. 
> Though I wouldn't lose any sleep if this just went away.

I've opted to remove these tests, since I'm testing the exact behavior 
we're disallowing in the gimple FE tests I've included in this patch.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> index 8f0a12c12ee..68808bd09fc 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> @@ -1,10 +1,9 @@
>>   /* { dg-do compile } */
>> -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
>> +/* { dg-options "-O2 -fdump-statistics" } */
>>   
>>   void bla();
>>   
>> -/* In the following case, we should be able to thread edge through
>> -   the loop header.  */
>> +/* No one should thread through the loop header.  */
>>   
>>   void thread_entry_through_header (void)
>>   {
>> @@ -14,8 +13,4 @@ void thread_entry_through_header (void)
>>       bla ();
>>   }
>>   
>> -/* There's a single jump thread that should be handled by the VRP
>> -   jump threading pass.  */
>> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
>> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
>> -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
>> +/* { dg-final { scan-tree-dump-not "Jumps threaded" "statistics"} } */
> Similarly.

Same.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> index b0a7d423475..24de9d57d50 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> @@ -1,8 +1,12 @@
>>   /* { dg-do compile } */
>>   /* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
>>   
>> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
>> -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
>> +/* ?? We should obsolete this test.  All the threads in thread1 and
>> +   thread3 we used to get cross the loop header but does not exit the
>> +   loop, so they have been deemed invalid.  */
>> +
>> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread1" } } */
>> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread3" } } */
>>   
>>   int sum0, sum1, sum2, sum3;
>>   int foo (char *s, char **ret)
> Similarly.

Same.

> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> index e68a9b62535..fc3adab3fc3 100644
>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> @@ -65,5 +65,9 @@ int main (void)
>>     return 0;
>>   }
>>   
>> -/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
>> +/* ?? The check below depends on jump threading.  There are no a
>> +   couple threaded paths that are being rejected because they either
>> +   rotate a loop or cross the loop header without exiting the
>> +   loop.  */
>> +/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" { xfail *-*-* } } } */
> So we could manually rotate the loop in the source which in turn would 
> allow this test to continue to validate SLP's behavior.

That did the trick.  Thanks.

OK?

Aldy
  

Patch

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
new file mode 100644
index 00000000000..bd56a62a4b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
@@ -0,0 +1,102 @@ 
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of seemingly threadble paths that should not be allowed.
+
+void foobar (int);
+
+// Possible thread from 2->4->3, but it would rotate the loop.
+void __GIMPLE (ssa)
+f1 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  foobar (i_1);
+  i_5 = i_1 + 1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  i_1 = __PHI (__BB2: 0, __BB3: i_5);
+  if (i_1 != 101)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+
+}
+
+// Possible thread from 2->3->5 but threading through the empty latch
+// would create a non-empty latch.
+void __GIMPLE (ssa)
+f2 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB3;
+
+  __BB(3,loop_header(1)):
+  i_8 = __PHI (__BB5: i_5, __BB2: 0);
+  foobar (i_8);
+  i_5 = i_8 + 1;
+  if (i_5 != 256)
+    goto __BB5;
+  else
+    goto __BB4;
+
+  // Latch.
+  __BB(5):
+  goto __BB3;
+
+  __BB(4):
+  return;
+
+}
+
+// Possible thread from 3->5->6->3 but this would thread through the
+// header but not exit the loop.
+int __GIMPLE (ssa)
+f3 (int a)
+{
+  int i;
+
+  __BB(2):
+  goto __BB6;
+
+  __BB(3):
+  if (i_1 != 0)
+    goto __BB4;
+  else
+    goto __BB5;
+
+  __BB(4):
+  foobar (5);
+  goto __BB5;
+
+  // Latch.
+  __BB(5):
+  i_7 = i_1 + 1;
+  goto __BB6;
+
+  __BB(6,loop_header(1)):
+  i_1 = __PHI (__BB2: 1, __BB5: i_7);
+  if (i_1 <= 99)
+    goto __BB3;
+  else
+    goto __BB7;
+
+  __BB(7):
+  return;
+
+}
+
+// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } }
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
new file mode 100644
index 00000000000..e10e877d01a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
@@ -0,0 +1,57 @@ 
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of threadable paths.  To simplify maintenance,
+// there should only be one threadable path per function.
+
+int global;
+
+// ** FIXME: This currently fails**
+//
+// There is a path crossing loops through 3->4->5.
+//
+// Jeff has stipulated that...
+//
+// This might be an exception since this goes from inside the loop to
+// outside the loop without entering another loop.  That is, we have
+// found an early exit from the loop that doesn't require us to go
+// through the latch.  In fact, the whole threaded path is logically
+// outside the loop.  So the primary effect is to reduce the number of
+// branches necessary when we exit the loop via this path.
+//
+// This has nice secondary effects.  For example, objects on the
+// threaded path will no longer necessarily be live throughout the
+// loop -- so we can get register allocation improvements.  The
+// threaded path can physically move outside the loop resulting in
+// better icache efficiency, etc.
+//
+// The question we'd have to answer is whether or not exposing this
+// alternate exit path mucks up other loop optimizations, but if we
+// restrict to after the loop optimizer is done, then that's a
+// non-issue.
+int __GIMPLE (ssa)
+f1 (int x)
+{
+  int a;
+
+  __BB(2):
+  a_4 = ~x_3(D);
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  global = a_1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  a_1 = __PHI (__BB2: a_4, __BB3: 0);
+  if (a_1 != 0)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+}
+
+// { dg-final { scan-tree-dump "Jumps threaded: 1" "statistics" { xfail *-*-* } } }
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index dcabfdb30d2..414ebabe895 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2766,10 +2766,12 @@  bool
 jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 {
   gcc_checking_assert (!path.is_empty ());
-  edge taken_edge = path[path.length () - 1]->e;
-  loop_p loop = taken_edge->src->loop_father;
+  edge entry = path[0]->e;
+  edge exit = path[path.length () - 1]->e;
+  loop_p loop = exit->src->loop_father;
   bool seen_latch = false;
   bool path_crosses_loops = false;
+  bool path_crosses_loop_header = false;
 
   for (unsigned int i = 0; i < path.length (); i++)
     {
@@ -2793,6 +2795,14 @@  jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 	  || e->dest->loop_father != loop)
 	path_crosses_loops = true;
 
+      // ?? Avoid threading through loop headers that remain in the
+      // loop, as such threadings tend to create sub-loops which
+      // _might_ be OK ??.
+      if (e->dest->loop_father->header == e->dest
+	  && !flow_loop_nested_p (exit->dest->loop_father,
+				  e->dest->loop_father))
+	path_crosses_loop_header = true;
+
       if (flag_checking && !m_backedge_threads)
 	gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
     }
@@ -2811,6 +2821,21 @@  jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
       cancel_thread (&path, "Path crosses loops");
       return true;
     }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+			      entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }
+  if (path_crosses_loop_header)
+    {
+      cancel_thread (&path, "Path crosses loop header but does not exit it");
+      return true;
+    }
   return false;
 }