[RFC] Remove VRP threader passes in exchange for better threading pre-VRP.

Message ID 20211018134117.410106-1-aldyh@redhat.com
State New
Headers
Series [RFC] Remove VRP threader passes in exchange for better threading pre-VRP. |

Commit Message

Aldy Hernandez Oct. 18, 2021, 1:41 p.m. UTC
  The jump threading bits seem to have stabilized.  The one or two open
PRs can be fixed by the pending loop threading restrictions to loop
rotation and loop headers.  With all the pieces in play, we can
finally explore altering the pipeline to reduce the jump threading
passes.

I know the jump threaders have become a confusing mess, and I have
added to this sphaghetti, but please bear with me, the goal is to
reduce the threaders to one code base.

As a quick birds-eye view, we have 2 engines:

1. The new backward threader using a path solver based on the ranger.
It can work in two modes-- a quick mode that assumes any SSA outside
the path is VARYING, and a fully resolving mode.

All the *.thread passes are running with this engine, but in quick
mode.

2. The old-old forward threader used by VRP and DOM.  The DOM client
uses its internal structures as well as evrp to resolve conditionals.
Whereas the VRP threader uses the engine in #1 in fully resolving
mode.

The VRP threaders are running with this engine, but using the solver
in #1.  That is, the VRP threaders use the old forward threader for
path discovery, but the new backward threader to solve candidate
paths (hybrid-threader).  This was always a stop-gap while we moved
everyone to #1.

The DOM threader is the last remaining threader with no changes
whatsoever from the previous release.  It uses the forward threader,
with all the evrp + DOM goo.

It doesn't matter if you're all confused, we're about to make things
simpler.

I've been experimenting with reducing the total number of threading
passes, and I'd like to see if there's consensus/stomach for altering
the pipeline.  Note, that the goal is to remove forward threader clients,
not the other way around.  So, we should prefer to remove a VRP threader
instance over a *.thread one immediately before VRP.

After some playing, it looks like if we enable fully-resolving mode in
the *.thread passes immediately preceeding VRP, we can remove the VRP
threading passes altogether, thus removing 2 threading passes (and
forward threading passes at that!).

The numbers look really good.  We get 6874 more jump threading passes
over my boostrap .ii files for a total 3.74% increase.  And we get that
while running marginally faster (0.19% faster, so noise).

The details are:

*** Mainline (with the loop rotation patch):
          ethread:64722
	  dom:31246
	  thread:73709
	  vrp-thread:14357
	  total:  184034

*** Removing all the VRP threaders.
         ethread:64722
         thread-full:76493
         dom:33648
         thread:16045
         total:  190908

Notice that not only do we get a lot more threads in thread-full
(resolving mode), but even DOM can get more jump threads.

This doesn't come without risks though.  The main issue is that we would
be removing one engine (forward threader), with another one (backward
threader).  But the good news is that (a) we've been using the new
backward threader for a while now (b) even the VRP threader in
mainline is using the backward threader solver.  So, all that would
really be changing would be the path discovery bits and custom copier
in the forward threader, with the backward threader bit and the
generic copier.

I personally don't think this is a big risk, because we've done all
the hard work already and it's all being stressed in one way or another.

The untested patch below is all that would need to happen, albeit with
copius changes to tests.

I'd like to see where we all stand on this before I start chugging away
at testing and other time consuming tasks.

Note, that all the relevant bits will still be tested in this release,
so I'm not gonna cry one way or another.  But it'd be nice to start
reducing passes, especially if we get a 3.74% increase in jump threads
for no time penalty.

Finally, even if we all agree, I think we should give us a week after the
loop rotation restrictions go in, because threading changes always cause
a party of unexpected things to happen.

Shoot!
  

Comments

Aldy Hernandez Oct. 18, 2021, 2:03 p.m. UTC | #1
On 10/18/21 3:41 PM, Aldy Hernandez wrote:

> I've been experimenting with reducing the total number of threading
> passes, and I'd like to see if there's consensus/stomach for altering
> the pipeline.  Note, that the goal is to remove forward threader clients,
> not the other way around.  So, we should prefer to remove a VRP threader
> instance over a *.thread one immediately before VRP.
> 
> After some playing, it looks like if we enable fully-resolving mode in
> the *.thread passes immediately preceeding VRP, we can remove the VRP
> threading passes altogether, thus removing 2 threading passes (and
> forward threading passes at that!).

It occurs to me that we could also remove the threading before VRP 
passes, and enable a fully-resolving backward threader after VRP.  I 
haven't played with this scenario, but it should be just as good.  That 
being said, I don't know the intricacies of why we had both pre and post 
VRP threading passes, and if one is ideally better than the other.

Aldy
  
Richard Biener Oct. 19, 2021, 6:52 a.m. UTC | #2
On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>
> > I've been experimenting with reducing the total number of threading
> > passes, and I'd like to see if there's consensus/stomach for altering
> > the pipeline.  Note, that the goal is to remove forward threader clients,
> > not the other way around.  So, we should prefer to remove a VRP threader
> > instance over a *.thread one immediately before VRP.
> >
> > After some playing, it looks like if we enable fully-resolving mode in
> > the *.thread passes immediately preceeding VRP, we can remove the VRP
> > threading passes altogether, thus removing 2 threading passes (and
> > forward threading passes at that!).
>
> It occurs to me that we could also remove the threading before VRP
> passes, and enable a fully-resolving backward threader after VRP.  I
> haven't played with this scenario, but it should be just as good.  That
> being said, I don't know the intricacies of why we had both pre and post
> VRP threading passes, and if one is ideally better than the other.

It was done because they were different threaders.  Since the new threader
uses built-in VRP it shouldn't really matter whether it's before or after
VRP _for the threading_, but it might be that if threading runs before VRP
then VRP itself can do a better job on cleaning up the IL.

+      /* ?? Is this still needed.  ?? */
       /* Threading can leave many const/copy propagations in the IL.
         Clean them up.  Instead of just copy_prop, we use ccp to
         compute alignment and nonzero bits.  */

Yes, it's still needed but not for the stated reason - the VRP
substitution and folding stage should deal with copy/constant propagation
but we replaced the former copy propagation with CCP to re-compute
nonzero bits & alignment so I'd change the comment to

   /* Run CCP to compute alignment and nonzero bits.  */

Richard.

> Aldy
>
  
Aldy Hernandez Oct. 19, 2021, 7:33 a.m. UTC | #3
On Tue, Oct 19, 2021 at 8:52 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
> >
> >
> >
> > On 10/18/21 3:41 PM, Aldy Hernandez wrote:
> >
> > > I've been experimenting with reducing the total number of threading
> > > passes, and I'd like to see if there's consensus/stomach for altering
> > > the pipeline.  Note, that the goal is to remove forward threader clients,
> > > not the other way around.  So, we should prefer to remove a VRP threader
> > > instance over a *.thread one immediately before VRP.
> > >
> > > After some playing, it looks like if we enable fully-resolving mode in
> > > the *.thread passes immediately preceeding VRP, we can remove the VRP
> > > threading passes altogether, thus removing 2 threading passes (and
> > > forward threading passes at that!).
> >
> > It occurs to me that we could also remove the threading before VRP
> > passes, and enable a fully-resolving backward threader after VRP.  I
> > haven't played with this scenario, but it should be just as good.  That
> > being said, I don't know the intricacies of why we had both pre and post
> > VRP threading passes, and if one is ideally better than the other.
>
> It was done because they were different threaders.  Since the new threader
> uses built-in VRP it shouldn't really matter whether it's before or after
> VRP _for the threading_, but it might be that if threading runs before VRP
> then VRP itself can do a better job on cleaning up the IL.

Good point.

FWIW, earlier this season I played with replacing the VRPs with evrp
instances (which fold far more conditionals) and I found that the
threaders can actually find LESS opportunities after *vrp fold away
things.  I don't know if this is a good or a bad thing.  Perhaps we
should benchmark three alternatives:

1. Mainline
2. Fully resolving threader -> VRP -> No threading.
3. No threading -> VRP -> Full resolving threader.

...and see what the actual effect is, regardless of number of threaded paths.

Speak of which, what's the blessed way of benchmarking performance
nowadays?  I've seen some PRs fly that measure some more lightweight
benchmarks (open source?) than a full blown SPEC.

>
> +      /* ?? Is this still needed.  ?? */
>        /* Threading can leave many const/copy propagations in the IL.
>          Clean them up.  Instead of just copy_prop, we use ccp to
>          compute alignment and nonzero bits.  */
>
> Yes, it's still needed but not for the stated reason - the VRP
> substitution and folding stage should deal with copy/constant propagation
> but we replaced the former copy propagation with CCP to re-compute
> nonzero bits & alignment so I'd change the comment to
>
>    /* Run CCP to compute alignment and nonzero bits.  */

Ahh..

There's another similar comment after DOM.  Is this comment still relevant?

      NEXT_PASS (pass_dominator, true /* may_peel_loop_headers_p */);
      /* Threading can leave many const/copy propagations in the IL.
     Clean them up.  Failure to do so well can lead to false
     positives from warnings for erroneous code.  */
      NEXT_PASS (pass_copy_prop);
      /* Identify paths that should never be executed in a conforming
     program and isolate those paths.  */

Thanks.
Aldy
  
Richard Biener Oct. 19, 2021, 8:40 a.m. UTC | #4
On Tue, Oct 19, 2021 at 9:33 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 8:52 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
> > >
> > >
> > >
> > > On 10/18/21 3:41 PM, Aldy Hernandez wrote:
> > >
> > > > I've been experimenting with reducing the total number of threading
> > > > passes, and I'd like to see if there's consensus/stomach for altering
> > > > the pipeline.  Note, that the goal is to remove forward threader clients,
> > > > not the other way around.  So, we should prefer to remove a VRP threader
> > > > instance over a *.thread one immediately before VRP.
> > > >
> > > > After some playing, it looks like if we enable fully-resolving mode in
> > > > the *.thread passes immediately preceeding VRP, we can remove the VRP
> > > > threading passes altogether, thus removing 2 threading passes (and
> > > > forward threading passes at that!).
> > >
> > > It occurs to me that we could also remove the threading before VRP
> > > passes, and enable a fully-resolving backward threader after VRP.  I
> > > haven't played with this scenario, but it should be just as good.  That
> > > being said, I don't know the intricacies of why we had both pre and post
> > > VRP threading passes, and if one is ideally better than the other.
> >
> > It was done because they were different threaders.  Since the new threader
> > uses built-in VRP it shouldn't really matter whether it's before or after
> > VRP _for the threading_, but it might be that if threading runs before VRP
> > then VRP itself can do a better job on cleaning up the IL.
>
> Good point.
>
> FWIW, earlier this season I played with replacing the VRPs with evrp
> instances (which fold far more conditionals) and I found that the
> threaders can actually find LESS opportunities after *vrp fold away
> things.  I don't know if this is a good or a bad thing.

Probably a sign that either threading theads stuff that's pointless
(does not consider conditions on the path that always evaluate false?)
or that after VRP and removing redundant conditions blocks are now
bigger and we run into some --param limit (that would suggest the
limits behave odd if it would thread when we'd artificially split blocks).

Examples might be interesting to look at to understand what's going "wrong".

> Perhaps we
> should benchmark three alternatives:
>
> 1. Mainline
> 2. Fully resolving threader -> VRP -> No threading.
> 3. No threading -> VRP -> Full resolving threader.
>
> ...and see what the actual effect is, regardless of number of threaded paths.

As said, only 2. makes "sense" to me unless examples show why we really
have the usual pass ordering issue.  As said, I think threading exposes new
VRP (esp. constant/copy prop) opportunities but VRP shouldn't expose new
threading opportunities.

> Speak of which, what's the blessed way of benchmarking performance
> nowadays?  I've seen some PRs fly that measure some more lightweight
> benchmarks (open source?) than a full blown SPEC.

The answer is SPEC.  The other answer would be GCC bootstrap time
and resulting code size.

> >
> > +      /* ?? Is this still needed.  ?? */
> >        /* Threading can leave many const/copy propagations in the IL.
> >          Clean them up.  Instead of just copy_prop, we use ccp to
> >          compute alignment and nonzero bits.  */
> >
> > Yes, it's still needed but not for the stated reason - the VRP
> > substitution and folding stage should deal with copy/constant propagation
> > but we replaced the former copy propagation with CCP to re-compute
> > nonzero bits & alignment so I'd change the comment to
> >
> >    /* Run CCP to compute alignment and nonzero bits.  */
>
> Ahh..
>
> There's another similar comment after DOM.  Is this comment still relevant?

Yes, since DOM still does threading the threaded paths lack const/copy
propagation.

>       NEXT_PASS (pass_dominator, true /* may_peel_loop_headers_p */);
>       /* Threading can leave many const/copy propagations in the IL.
>      Clean them up.  Failure to do so well can lead to false
>      positives from warnings for erroneous code.  */
>       NEXT_PASS (pass_copy_prop);
>       /* Identify paths that should never be executed in a conforming
>      program and isolate those paths.  */
>
> Thanks.
> Aldy
>
  
Aldy Hernandez Oct. 19, 2021, 9:06 a.m. UTC | #5
On 10/19/21 10:40 AM, Richard Biener wrote:
> On Tue, Oct 19, 2021 at 9:33 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>> On Tue, Oct 19, 2021 at 8:52 AM Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>>
>>> On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>>>>
>>>>> I've been experimenting with reducing the total number of threading
>>>>> passes, and I'd like to see if there's consensus/stomach for altering
>>>>> the pipeline.  Note, that the goal is to remove forward threader clients,
>>>>> not the other way around.  So, we should prefer to remove a VRP threader
>>>>> instance over a *.thread one immediately before VRP.
>>>>>
>>>>> After some playing, it looks like if we enable fully-resolving mode in
>>>>> the *.thread passes immediately preceeding VRP, we can remove the VRP
>>>>> threading passes altogether, thus removing 2 threading passes (and
>>>>> forward threading passes at that!).
>>>>
>>>> It occurs to me that we could also remove the threading before VRP
>>>> passes, and enable a fully-resolving backward threader after VRP.  I
>>>> haven't played with this scenario, but it should be just as good.  That
>>>> being said, I don't know the intricacies of why we had both pre and post
>>>> VRP threading passes, and if one is ideally better than the other.
>>>
>>> It was done because they were different threaders.  Since the new threader
>>> uses built-in VRP it shouldn't really matter whether it's before or after
>>> VRP _for the threading_, but it might be that if threading runs before VRP
>>> then VRP itself can do a better job on cleaning up the IL.
>>
>> Good point.
>>
>> FWIW, earlier this season I played with replacing the VRPs with evrp
>> instances (which fold far more conditionals) and I found that the
>> threaders can actually find LESS opportunities after *vrp fold away
>> things.  I don't know if this is a good or a bad thing.
> 
> Probably a sign that either threading theads stuff that's pointless
> (does not consider conditions on the path that always evaluate false?)

At least in the backward threader, we don't keep looking back if we can 
resolve the conditional at the end of an in-progress path, so it's 
certainly possible we thread paths that are unreachable.  I'm pretty 
sure that's also possible in the forward threader.

For example, we if we have a candidate path that ends in x > 1234 and we 
know on entry to the path that x is [2000,3000], there's no need to 
chase further back to see if the path itself is reachable.

> or that after VRP and removing redundant conditions blocks are now
> bigger and we run into some --param limit (that would suggest the
> limits behave odd if it would thread when we'd artificially split blocks).
> 
> Examples might be interesting to look at to understand what's going "wrong".
> 
>> Perhaps we
>> should benchmark three alternatives:
>>
>> 1. Mainline
>> 2. Fully resolving threader -> VRP -> No threading.
>> 3. No threading -> VRP -> Full resolving threader.
>>
>> ...and see what the actual effect is, regardless of number of threaded paths.
> 
> As said, only 2. makes "sense" to me unless examples show why we really
> have the usual pass ordering issue.  As said, I think threading exposes new
> VRP (esp. constant/copy prop) opportunities but VRP shouldn't expose new
> threading opportunities.

Excellent!  This saves me time, as I've mostly played with option #2 ;-).

Thanks.
Aldy
  
Aldy Hernandez Oct. 19, 2021, 9:16 a.m. UTC | #6
On Tue, Oct 19, 2021 at 11:06 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 10/19/21 10:40 AM, Richard Biener wrote:
> > On Tue, Oct 19, 2021 at 9:33 AM Aldy Hernandez <aldyh@redhat.com> wrote:
> >>
> >> On Tue, Oct 19, 2021 at 8:52 AM Richard Biener
> >> <richard.guenther@gmail.com> wrote:
> >>>
> >>> On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
> >>>>
> >>>>> I've been experimenting with reducing the total number of threading
> >>>>> passes, and I'd like to see if there's consensus/stomach for altering
> >>>>> the pipeline.  Note, that the goal is to remove forward threader clients,
> >>>>> not the other way around.  So, we should prefer to remove a VRP threader
> >>>>> instance over a *.thread one immediately before VRP.
> >>>>>
> >>>>> After some playing, it looks like if we enable fully-resolving mode in
> >>>>> the *.thread passes immediately preceeding VRP, we can remove the VRP
> >>>>> threading passes altogether, thus removing 2 threading passes (and
> >>>>> forward threading passes at that!).
> >>>>
> >>>> It occurs to me that we could also remove the threading before VRP
> >>>> passes, and enable a fully-resolving backward threader after VRP.  I
> >>>> haven't played with this scenario, but it should be just as good.  That
> >>>> being said, I don't know the intricacies of why we had both pre and post
> >>>> VRP threading passes, and if one is ideally better than the other.
> >>>
> >>> It was done because they were different threaders.  Since the new threader
> >>> uses built-in VRP it shouldn't really matter whether it's before or after
> >>> VRP _for the threading_, but it might be that if threading runs before VRP
> >>> then VRP itself can do a better job on cleaning up the IL.
> >>
> >> Good point.
> >>
> >> FWIW, earlier this season I played with replacing the VRPs with evrp
> >> instances (which fold far more conditionals) and I found that the
> >> threaders can actually find LESS opportunities after *vrp fold away
> >> things.  I don't know if this is a good or a bad thing.
> >
> > Probably a sign that either threading theads stuff that's pointless
> > (does not consider conditions on the path that always evaluate false?)
>
> At least in the backward threader, we don't keep looking back if we can
> resolve the conditional at the end of an in-progress path, so it's
> certainly possible we thread paths that are unreachable.  I'm pretty
> sure that's also possible in the forward threader.
>
> For example, we if we have a candidate path that ends in x > 1234 and we
> know on entry to the path that x is [2000,3000], there's no need to
> chase further back to see if the path itself is reachable.

For that matter, when I was working on replacing the DOM threader, I
found out that the forward threader + evrp routinely tried to thread
paths that were unreachable, and I had to trim them from my comparison
tally.  The new backward threader engine suffers less from this,
because if there is an UNDEFINED range as part of the in-path
calculation, we can trim the path as unreachable (and avoid further
searches in that direction).  However, as I said, if the range is
known on entry, we do no further lookups and happily thread away.

Aldy
  
Jeff Law Oct. 19, 2021, 10:58 p.m. UTC | #7
On 10/18/2021 7:41 AM, Aldy Hernandez wrote:
>
> After some playing, it looks like if we enable fully-resolving mode in
> the *.thread passes immediately preceeding VRP, we can remove the VRP
> threading passes altogether, thus removing 2 threading passes (and
> forward threading passes at that!).
Whoo whoo!


>
> The numbers look really good.  We get 6874 more jump threading passes
> over my boostrap .ii files for a total 3.74% increase.  And we get that
> while running marginally faster (0.19% faster, so noise).
>
> The details are:
>
> *** Mainline (with the loop rotation patch):
>            ethread:64722
> 	  dom:31246
> 	  thread:73709
> 	  vrp-thread:14357
> 	  total:  184034
>
> *** Removing all the VRP threaders.
>           ethread:64722
>           thread-full:76493
>           dom:33648
>           thread:16045
>           total:  190908
>
> Notice that not only do we get a lot more threads in thread-full
> (resolving mode), but even DOM can get more jump threads.
>
> This doesn't come without risks though.  The main issue is that we would
> be removing one engine (forward threader), with another one (backward
> threader).  But the good news is that (a) we've been using the new
> backward threader for a while now (b) even the VRP threader in
> mainline is using the backward threader solver.  So, all that would
> really be changing would be the path discovery bits and custom copier
> in the forward threader, with the backward threader bit and the
> generic copier.
>
> I personally don't think this is a big risk, because we've done all
> the hard work already and it's all being stressed in one way or another.
I don't see the risk as significantly different than any other big chunk 
of development work.  Furthermore, this is a major step on the path 
we've been discussing the last couple years.   There'll be some testing 
fallout, but I think that's manageable and ultimately worth the pain to 
work through.   I can express how happy I am that we're at the point of 
zapping the two VRP threading passes.

>
> The untested patch below is all that would need to happen, albeit with
> copius changes to tests.
>
> I'd like to see where we all stand on this before I start chugging away
> at testing and other time consuming tasks.
>
> Note, that all the relevant bits will still be tested in this release,
> so I'm not gonna cry one way or another.  But it'd be nice to start
> reducing passes, especially if we get a 3.74% increase in jump threads
> for no time penalty.
>
> Finally, even if we all agree, I think we should give us a week after the
> loop rotation restrictions go in, because threading changes always cause
> a party of unexpected things to happen.
Sure.  And FWIW, the loop rotation changes are fine IMHO.   So commit 
those when you're ready, then drop this in a week later.  All in time 
for stage1 close :-)

Jeff
  
Jeff Law Oct. 19, 2021, 11 p.m. UTC | #8
On 10/18/2021 8:03 AM, Aldy Hernandez wrote:
>
>
> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>
>> I've been experimenting with reducing the total number of threading
>> passes, and I'd like to see if there's consensus/stomach for altering
>> the pipeline.  Note, that the goal is to remove forward threader 
>> clients,
>> not the other way around.  So, we should prefer to remove a VRP threader
>> instance over a *.thread one immediately before VRP.
>>
>> After some playing, it looks like if we enable fully-resolving mode in
>> the *.thread passes immediately preceeding VRP, we can remove the VRP
>> threading passes altogether, thus removing 2 threading passes (and
>> forward threading passes at that!).
>
> It occurs to me that we could also remove the threading before VRP 
> passes, and enable a fully-resolving backward threader after VRP. I 
> haven't played with this scenario, but it should be just as good.  
> That being said, I don't know the intricacies of why we had both pre 
> and post VRP threading passes, and if one is ideally better than the 
> other.
The only post-VRP threading pass that (in my mind) makes sense is the 
one sitting between VRP and DOM and it should replace the DOM based 
threader.

Jeff
  
Jeff Law Oct. 19, 2021, 11:06 p.m. UTC | #9
On 10/19/2021 1:33 AM, Aldy Hernandez wrote:
> On Tue, Oct 19, 2021 at 8:52 AM Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Oct 18, 2021 at 4:03 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>
>>>
>>> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>>>
>>>> I've been experimenting with reducing the total number of threading
>>>> passes, and I'd like to see if there's consensus/stomach for altering
>>>> the pipeline.  Note, that the goal is to remove forward threader clients,
>>>> not the other way around.  So, we should prefer to remove a VRP threader
>>>> instance over a *.thread one immediately before VRP.
>>>>
>>>> After some playing, it looks like if we enable fully-resolving mode in
>>>> the *.thread passes immediately preceeding VRP, we can remove the VRP
>>>> threading passes altogether, thus removing 2 threading passes (and
>>>> forward threading passes at that!).
>>> It occurs to me that we could also remove the threading before VRP
>>> passes, and enable a fully-resolving backward threader after VRP.  I
>>> haven't played with this scenario, but it should be just as good.  That
>>> being said, I don't know the intricacies of why we had both pre and post
>>> VRP threading passes, and if one is ideally better than the other.
>> It was done because they were different threaders.  Since the new threader
>> uses built-in VRP it shouldn't really matter whether it's before or after
>> VRP _for the threading_, but it might be that if threading runs before VRP
>> then VRP itself can do a better job on cleaning up the IL.
> Good point.
>
> FWIW, earlier this season I played with replacing the VRPs with evrp
> instances (which fold far more conditionals) and I found that the
> threaders can actually find LESS opportunities after *vrp fold away
> things.  I don't know if this is a good or a bad thing.  Perhaps we
> should benchmark three alternatives:
This is expected.  VRP and DOM will sometimes find conditionals that 
they can fully optimize away.  If those conditionals are left in the IL, 
the threaders would sometimes pick them up.

So as we fold more in VRP/DOM, I'm not surprised there's fewer things 
for the threaders to find.   In general, if a conditional can be removed 
by VRP/DOM, that's the preference.

>
> 1. Mainline
> 2. Fully resolving threader -> VRP -> No threading.
> 3. No threading -> VRP -> Full resolving threader.
>
> ...and see what the actual effect is, regardless of number of threaded paths.
>
> Speak of which, what's the blessed way of benchmarking performance
> nowadays?  I've seen some PRs fly that measure some more lightweight
> benchmarks (open source?) than a full blown SPEC.
Can't speak for anyone else, but I benchmark these days with spec in a 
cycle-approximate simulator :-)  I'm isolated from the whims of turbo 
modes and the like.  Of course it takes considerably longer than running 
on real hardware :-)

>
>> +      /* ?? Is this still needed.  ?? */
>>         /* Threading can leave many const/copy propagations in the IL.
>>           Clean them up.  Instead of just copy_prop, we use ccp to
>>           compute alignment and nonzero bits.  */
>>
>> Yes, it's still needed but not for the stated reason - the VRP
>> substitution and folding stage should deal with copy/constant propagation
>> but we replaced the former copy propagation with CCP to re-compute
>> nonzero bits & alignment so I'd change the comment to
>>
>>     /* Run CCP to compute alignment and nonzero bits.  */
The threaders (really the copiers) don't make much, if any, attempt to 
clean up the IL.  So, for example,  they'll often leave degenerate PHIs 
in the IL.  We need to clean that crap up or we'll get false positives 
in the middle-end diagnostics.

Jeff
  
Aldy Hernandez Oct. 20, 2021, 9:27 a.m. UTC | #10
On Wed, Oct 20, 2021 at 1:00 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 10/18/2021 8:03 AM, Aldy Hernandez wrote:
> >
> >
> > On 10/18/21 3:41 PM, Aldy Hernandez wrote:
> >
> >> I've been experimenting with reducing the total number of threading
> >> passes, and I'd like to see if there's consensus/stomach for altering
> >> the pipeline.  Note, that the goal is to remove forward threader
> >> clients,
> >> not the other way around.  So, we should prefer to remove a VRP threader
> >> instance over a *.thread one immediately before VRP.
> >>
> >> After some playing, it looks like if we enable fully-resolving mode in
> >> the *.thread passes immediately preceeding VRP, we can remove the VRP
> >> threading passes altogether, thus removing 2 threading passes (and
> >> forward threading passes at that!).
> >
> > It occurs to me that we could also remove the threading before VRP
> > passes, and enable a fully-resolving backward threader after VRP. I
> > haven't played with this scenario, but it should be just as good.
> > That being said, I don't know the intricacies of why we had both pre
> > and post VRP threading passes, and if one is ideally better than the
> > other.
> The only post-VRP threading pass that (in my mind) makes sense is the
> one sitting between VRP and DOM and it should replace the DOM based
> threader.

Yes, that's the goal, but it won't happen on this release because of
floats.  The DOM threader uses the const/avails machinery to thread
conditionals involving floats, something the path solver can't do
because it depends on gori/ranger.  Adding floats to ranger is
probably our #1 task for the next cycle.

Now before Andrew gets clever, the relation oracle is technically type
agnostic, so it could theoretically be possible to use it in the DOM
threader and replace all the const/avails stuff.  But I'd like to go
on vacation at some point ;-).

Aldy
  
Andrew MacLeod Oct. 20, 2021, 12:32 p.m. UTC | #11
On 10/20/21 5:27 AM, Aldy Hernandez wrote:
> On Wed, Oct 20, 2021 at 1:00 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>> On 10/18/2021 8:03 AM, Aldy Hernandez wrote:
>>>
>>> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>>>
>>>> I've been experimenting with reducing the total number of threading
>>>> passes, and I'd like to see if there's consensus/stomach for altering
>>>> the pipeline.  Note, that the goal is to remove forward threader
>>>> clients,
>>>> not the other way around.  So, we should prefer to remove a VRP threader
>>>> instance over a *.thread one immediately before VRP.
>>>>
>>>> After some playing, it looks like if we enable fully-resolving mode in
>>>> the *.thread passes immediately preceeding VRP, we can remove the VRP
>>>> threading passes altogether, thus removing 2 threading passes (and
>>>> forward threading passes at that!).
>>> It occurs to me that we could also remove the threading before VRP
>>> passes, and enable a fully-resolving backward threader after VRP. I
>>> haven't played with this scenario, but it should be just as good.
>>> That being said, I don't know the intricacies of why we had both pre
>>> and post VRP threading passes, and if one is ideally better than the
>>> other.
>> The only post-VRP threading pass that (in my mind) makes sense is the
>> one sitting between VRP and DOM and it should replace the DOM based
>> threader.
> Yes, that's the goal, but it won't happen on this release because of
> floats.  The DOM threader uses the const/avails machinery to thread
> conditionals involving floats, something the path solver can't do
> because it depends on gori/ranger.  Adding floats to ranger is
> probably our #1 task for the next cycle.
>
> Now before Andrew gets clever, the relation oracle is technically type
> agnostic, so it could theoretically be possible to use it in the DOM
> threader and replace all the const/avails stuff.  But I'd like to go
> on vacation at some point ;-).
>
Oh?  the float stuff isn't range related, just relations?  you can 
certainly register those and query/fold them....
  
Aldy Hernandez Oct. 20, 2021, 1:08 p.m. UTC | #12
On Wed, Oct 20, 2021 at 2:32 PM Andrew MacLeod <amacleod@redhat.com> wrote:
>
> On 10/20/21 5:27 AM, Aldy Hernandez wrote:

> Oh?  the float stuff isn't range related, just relations?  you can
> certainly register those and query/fold them....

See?  I knew you'd get a bright idea.  No.  There's no time in this release :-).

And yes, the DOM threader's folding of float conditionals is limited
to relationals.  But why spend any time there (forward threader) when
next year we'll have floats and can replace the entire DOM threader
with a backward threader.

Aldy
  
Jeff Law Oct. 20, 2021, 5:55 p.m. UTC | #13
On 10/20/2021 3:27 AM, Aldy Hernandez wrote:
> On Wed, Oct 20, 2021 at 1:00 AM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>> On 10/18/2021 8:03 AM, Aldy Hernandez wrote:
>>>
>>> On 10/18/21 3:41 PM, Aldy Hernandez wrote:
>>>
>>>> I've been experimenting with reducing the total number of threading
>>>> passes, and I'd like to see if there's consensus/stomach for altering
>>>> the pipeline.  Note, that the goal is to remove forward threader
>>>> clients,
>>>> not the other way around.  So, we should prefer to remove a VRP threader
>>>> instance over a *.thread one immediately before VRP.
>>>>
>>>> After some playing, it looks like if we enable fully-resolving mode in
>>>> the *.thread passes immediately preceeding VRP, we can remove the VRP
>>>> threading passes altogether, thus removing 2 threading passes (and
>>>> forward threading passes at that!).
>>> It occurs to me that we could also remove the threading before VRP
>>> passes, and enable a fully-resolving backward threader after VRP. I
>>> haven't played with this scenario, but it should be just as good.
>>> That being said, I don't know the intricacies of why we had both pre
>>> and post VRP threading passes, and if one is ideally better than the
>>> other.
>> The only post-VRP threading pass that (in my mind) makes sense is the
>> one sitting between VRP and DOM and it should replace the DOM based
>> threader.
> Yes, that's the goal, but it won't happen on this release because of
> floats.  The DOM threader uses the const/avails machinery to thread
> conditionals involving floats, something the path solver can't do
> because it depends on gori/ranger.  Adding floats to ranger is
> probably our #1 task for the next cycle.
>
> Now before Andrew gets clever, the relation oracle is technically type
> agnostic, so it could theoretically be possible to use it in the DOM
> threader and replace all the const/avails stuff.  But I'd like to go
> on vacation at some point ;-).
Correct.  I just wanted to make it clear that as the backwards threader 
improves I see less and less of a need to run as many threader passes.

For VRP1 -> DOM2, I could see removing the threading path from DOM2 and 
having a backwards threader between VRP1 and DOM2.  I don't see 
significant value in having a threading pass of any sort after VRP2 as 
the vast majority of things are handled by then.

Jeff
  

Patch

diff --git a/gcc/passes.def b/gcc/passes.def
index c11c237f6d2..96fc230e780 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -210,9 +210,8 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_return_slot);
       NEXT_PASS (pass_fre, true /* may_iterate */);
       NEXT_PASS (pass_merge_phi);
-      NEXT_PASS (pass_thread_jumps);
+      NEXT_PASS (pass_thread_jumps_full);
       NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
-      NEXT_PASS (pass_vrp_threader);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_dce);
       /* pass_stdarg is always run and at this point we execute
@@ -336,9 +335,9 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
       NEXT_PASS (pass_strlen);
-      NEXT_PASS (pass_thread_jumps);
+      NEXT_PASS (pass_thread_jumps_full);
       NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
-      NEXT_PASS (pass_vrp_threader);
+      /* ?? Is this still needed.  ?? */
       /* Threading can leave many const/copy propagations in the IL.
 	 Clean them up.  Instead of just copy_prop, we use ccp to
 	 compute alignment and nonzero bits.  */

gcc/ChangeLog:

	* passes.def: Replace pass_thread_jumps that happen before VRP
	with pass_thread_jumps_full.  Remove pass_vrp_threader passes.
---
 gcc/passes.def | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/gcc/passes.def b/gcc/passes.def
index c11c237f6d2..96fc230e780 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -210,9 +210,8 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_return_slot);
       NEXT_PASS (pass_fre, true /* may_iterate */);
       NEXT_PASS (pass_merge_phi);
-      NEXT_PASS (pass_thread_jumps);
+      NEXT_PASS (pass_thread_jumps_full);
       NEXT_PASS (pass_vrp, true /* warn_array_bounds_p */);
-      NEXT_PASS (pass_vrp_threader);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_dce);
       /* pass_stdarg is always run and at this point we execute
@@ -336,9 +335,9 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_thread_jumps);
       NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
       NEXT_PASS (pass_strlen);
-      NEXT_PASS (pass_thread_jumps);
+      NEXT_PASS (pass_thread_jumps_full);
       NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
-      NEXT_PASS (pass_vrp_threader);
+      /* ?? Is this still needed.  ?? */
       /* Threading can leave many const/copy propagations in the IL.
 	 Clean them up.  Instead of just copy_prop, we use ccp to
 	 compute alignment and nonzero bits.  */