[V2] Disable sched1 in functions that call setjmp

Message ID 20221222173208.13317-1-jose.marchesi@oracle.com
State New
Headers
Series [V2] Disable sched1 in functions that call setjmp |

Commit Message

Jose E. Marchesi Dec. 22, 2022, 5:32 p.m. UTC
  When the following testcase is built with -fschedule-insns in either
x86_64 or aarch64:

  #include <stdio.h>
  #include <setjmp.h>
  #include <stdbool.h>

  jmp_buf ex_buf__;

   #define TRY do{ if( !setjmp(ex_buf__) ){
   #define CATCH } else {
   #define ETRY } }while(0)
   #define THROW longjmp(ex_buf__, 1)

  int f(int x)
  {
    int arr[] = {1,2,6,8,9,10};
    int lo=0;
    int hi=5;

    while(lo<=hi) {
          int mid=(lo+hi)/2;

          if(arr[mid]==x) {
            THROW;
          } else if(arr[mid]<x) {
            lo=mid+1;
          } else if(arr[mid]>x) {
            hi=mid-1;
          }
    }

    return -1;
  }

  int
  main(int argc, char** argv)
  {
    int a=2;
    bool b=false;

    TRY
    {
     a=f(a);
     b=true;
    }
    CATCH
    {
     printf("a : %d\n",a);
     printf("Got Exception!\n");
    }
    ETRY;

    if(b) {
          printf("b is true!\n");
    }
    return 0;
  }

The first instruction scheduler pass reorders instructions in the TRY
block in a way `b=true' gets executed before the call to the function
`f'.  This optimization is wrong, because `main' calls setjmp and `f'
is known to call longjmp.

As discussed in BZ 57067, the root cause for this is the fact that
setjmp is not properly modeled in RTL, and therefore the backend
passes have no normalized way to handle this situation.

As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
functions that call setjmp.  This includes for example gcse,
store_motion and cprop.  This patch adds the sched1 pass to that list.

Note that the other instruction scheduling passes are still allowed to
run on these functions, since they reorder instructions within basic
blocks, and therefore they cannot cross function calls.

This doesn't fix the fundamental issue, but at least assures that
sched1 wont perform invalid transformation in correct C programs.

regtested in aarch64-linux-gnu.

gcc/ChangeLog:

	PR rtl-optimization/57067
	* sched-rgn.cc (pass_sched::gate): Disable pass if current
	function calls setjmp.
---
 gcc/sched-rgn.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Alexander Monakov Dec. 22, 2022, 5:56 p.m. UTC | #1
On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:

> The first instruction scheduler pass reorders instructions in the TRY
> block in a way `b=true' gets executed before the call to the function
> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
> is known to call longjmp.
> 
> As discussed in BZ 57067, the root cause for this is the fact that
> setjmp is not properly modeled in RTL, and therefore the backend
> passes have no normalized way to handle this situation.
> 
> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
> functions that call setjmp.  This includes for example gcse,
> store_motion and cprop.  This patch adds the sched1 pass to that list.
> 
> Note that the other instruction scheduling passes are still allowed to
> run on these functions, since they reorder instructions within basic
> blocks, and therefore they cannot cross function calls.
> 
> This doesn't fix the fundamental issue, but at least assures that
> sched1 wont perform invalid transformation in correct C programs.

I think scheduling across calls in the pre-RA scheduler is simply an oversight,
we do not look at dataflow information and with 50% chance risk extending
lifetime of a pseudoregister across a call, causing higher register pressure at
the point of the call, and potentially an extra spill.

Therefore I would suggest to indeed solve the root cause, with (untested):

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 948aa0c3b..343fe2bfa 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)

       CANT_MOVE (insn) = 1;

-      if (find_reg_note (insn, REG_SETJMP, NULL))
+      if (!reload_completed)
+       {
+         /* Do not schedule across calls, this is prone to extending lifetime
+            of a pseudo and causing extra spill later on.  */
+         reg_pending_barrier = MOVE_BARRIER;
+       }
+      else if (find_reg_note (insn, REG_SETJMP, NULL))
         {
           /* This is setjmp.  Assume that all registers, not just
              hard registers, may be clobbered by this call.  */

Alexander
  
Qing Zhao Dec. 22, 2022, 7:28 p.m. UTC | #2
> On Dec 22, 2022, at 12:56 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:
> 
>> The first instruction scheduler pass reorders instructions in the TRY
>> block in a way `b=true' gets executed before the call to the function
>> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
>> is known to call longjmp.
>> 
>> As discussed in BZ 57067, the root cause for this is the fact that
>> setjmp is not properly modeled in RTL, and therefore the backend
>> passes have no normalized way to handle this situation.
>> 
>> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
>> functions that call setjmp.  This includes for example gcse,
>> store_motion and cprop.  This patch adds the sched1 pass to that list.
>> 
>> Note that the other instruction scheduling passes are still allowed to
>> run on these functions, since they reorder instructions within basic
>> blocks, and therefore they cannot cross function calls.
>> 
>> This doesn't fix the fundamental issue, but at least assures that
>> sched1 wont perform invalid transformation in correct C programs.
> 
> I think scheduling across calls in the pre-RA scheduler is simply an oversight,
> we do not look at dataflow information and with 50% chance risk extending
> lifetime of a pseudoregister across a call, causing higher register pressure at
> the point of the call, and potentially an extra spill.

I am a little confused, you mean pre-RA scheduler does not look at the data flow
 information at all when scheduling insns across calls currently?

Qing


> 
> Therefore I would suggest to indeed solve the root cause, with (untested):
> 
> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
> index 948aa0c3b..343fe2bfa 100644
> --- a/gcc/sched-deps.cc
> +++ b/gcc/sched-deps.cc
> @@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)
> 
>       CANT_MOVE (insn) = 1;
> 
> -      if (find_reg_note (insn, REG_SETJMP, NULL))
> +      if (!reload_completed)
> +       {
> +         /* Do not schedule across calls, this is prone to extending lifetime
> +            of a pseudo and causing extra spill later on.  */
> +         reg_pending_barrier = MOVE_BARRIER;
> +       }
> +      else if (find_reg_note (insn, REG_SETJMP, NULL))
>         {
>           /* This is setjmp.  Assume that all registers, not just
>              hard registers, may be clobbered by this call.  */
> 
> Alexander
  
Alexander Monakov Dec. 23, 2022, 7:33 a.m. UTC | #3
On Thu, 22 Dec 2022, Qing Zhao wrote:

> > I think scheduling across calls in the pre-RA scheduler is simply an oversight,
> > we do not look at dataflow information and with 50% chance risk extending
> > lifetime of a pseudoregister across a call, causing higher register pressure at
> > the point of the call, and potentially an extra spill.
> 
> I am a little confused, you mean pre-RA scheduler does not look at the data flow
>  information at all when scheduling insns across calls currently?

I think it does not inspect liveness info, and may extend lifetime of a pseudo
across a call, transforming
 
  call foo
  reg = 1
  ...
  use reg

to

  reg = 1
  call foo
  ...
  use reg

but this is undesirable, because now register allocation cannot select a
call-clobbered register for 'reg'.

Alexander
  
Richard Sandiford Dec. 23, 2022, 10:05 a.m. UTC | #4
Alexander Monakov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:
>
>> The first instruction scheduler pass reorders instructions in the TRY
>> block in a way `b=true' gets executed before the call to the function
>> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
>> is known to call longjmp.
>> 
>> As discussed in BZ 57067, the root cause for this is the fact that
>> setjmp is not properly modeled in RTL, and therefore the backend
>> passes have no normalized way to handle this situation.
>> 
>> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
>> functions that call setjmp.  This includes for example gcse,
>> store_motion and cprop.  This patch adds the sched1 pass to that list.
>> 
>> Note that the other instruction scheduling passes are still allowed to
>> run on these functions, since they reorder instructions within basic
>> blocks, and therefore they cannot cross function calls.
>> 
>> This doesn't fix the fundamental issue, but at least assures that
>> sched1 wont perform invalid transformation in correct C programs.
>
> I think scheduling across calls in the pre-RA scheduler is simply an oversight,
> we do not look at dataflow information and with 50% chance risk extending
> lifetime of a pseudoregister across a call, causing higher register pressure at
> the point of the call, and potentially an extra spill.
>
> Therefore I would suggest to indeed solve the root cause, with (untested):
>
> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
> index 948aa0c3b..343fe2bfa 100644
> --- a/gcc/sched-deps.cc
> +++ b/gcc/sched-deps.cc
> @@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)
>
>        CANT_MOVE (insn) = 1;
>
> -      if (find_reg_note (insn, REG_SETJMP, NULL))
> +      if (!reload_completed)
> +       {
> +         /* Do not schedule across calls, this is prone to extending lifetime
> +            of a pseudo and causing extra spill later on.  */
> +         reg_pending_barrier = MOVE_BARRIER;
> +       }
> +      else if (find_reg_note (insn, REG_SETJMP, NULL))
>          {
>            /* This is setjmp.  Assume that all registers, not just
>               hard registers, may be clobbered by this call.  */

+1 for trying this FWIW.  There's still plenty of time to try an
alternative solution if there are unexpected performance problems.

Richard
  
Jose E. Marchesi Dec. 23, 2022, 10:42 a.m. UTC | #5
> Alexander Monakov via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:
>>
>>> The first instruction scheduler pass reorders instructions in the TRY
>>> block in a way `b=true' gets executed before the call to the function
>>> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
>>> is known to call longjmp.
>>> 
>>> As discussed in BZ 57067, the root cause for this is the fact that
>>> setjmp is not properly modeled in RTL, and therefore the backend
>>> passes have no normalized way to handle this situation.
>>> 
>>> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
>>> functions that call setjmp.  This includes for example gcse,
>>> store_motion and cprop.  This patch adds the sched1 pass to that list.
>>> 
>>> Note that the other instruction scheduling passes are still allowed to
>>> run on these functions, since they reorder instructions within basic
>>> blocks, and therefore they cannot cross function calls.
>>> 
>>> This doesn't fix the fundamental issue, but at least assures that
>>> sched1 wont perform invalid transformation in correct C programs.
>>
>> I think scheduling across calls in the pre-RA scheduler is simply an oversight,
>> we do not look at dataflow information and with 50% chance risk extending
>> lifetime of a pseudoregister across a call, causing higher register pressure at
>> the point of the call, and potentially an extra spill.
>>
>> Therefore I would suggest to indeed solve the root cause, with (untested):
>>
>> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
>> index 948aa0c3b..343fe2bfa 100644
>> --- a/gcc/sched-deps.cc
>> +++ b/gcc/sched-deps.cc
>> @@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)
>>
>>        CANT_MOVE (insn) = 1;
>>
>> -      if (find_reg_note (insn, REG_SETJMP, NULL))
>> +      if (!reload_completed)
>> +       {
>> +         /* Do not schedule across calls, this is prone to extending lifetime
>> +            of a pseudo and causing extra spill later on.  */
>> +         reg_pending_barrier = MOVE_BARRIER;
>> +       }
>> +      else if (find_reg_note (insn, REG_SETJMP, NULL))
>>          {
>>            /* This is setjmp.  Assume that all registers, not just
>>               hard registers, may be clobbered by this call.  */
>
> +1 for trying this FWIW.  There's still plenty of time to try an
> alternative solution if there are unexpected performance problems.

Let me see if Alexander's patch fixes the issue at hand (it must) and
will also do some regression testing.
  
Qing Zhao Dec. 23, 2022, 2:34 p.m. UTC | #6
> On Dec 23, 2022, at 2:33 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Thu, 22 Dec 2022, Qing Zhao wrote:
> 
>>> I think scheduling across calls in the pre-RA scheduler is simply an oversight,
>>> we do not look at dataflow information and with 50% chance risk extending
>>> lifetime of a pseudoregister across a call, causing higher register pressure at
>>> the point of the call, and potentially an extra spill.
>> 
>> I am a little confused, you mean pre-RA scheduler does not look at the data flow
>> information at all when scheduling insns across calls currently?
> 
> I think it does not inspect liveness info, and may extend lifetime of a pseudo
> across a call, transforming
> 
>  call foo
>  reg = 1
>  ...
>  use reg
> 
> to
> 
>  reg = 1
>  call foo
>  ...
>  use reg
> 
> but this is undesirable, because now register allocation cannot select a
> call-clobbered register for 'reg’.
Okay, thanks for the explanation.

Then, why not just check the liveness info instead of inhibiting all scheduling across calls?

Qing
> 
> Alexander
  
Alexander Monakov Dec. 23, 2022, 3:03 p.m. UTC | #7
On Fri, 23 Dec 2022, Qing Zhao wrote:
> >> I am a little confused, you mean pre-RA scheduler does not look at the data flow
> >> information at all when scheduling insns across calls currently?
> > 
> > I think it does not inspect liveness info, and may extend lifetime of a pseudo
> > across a call, transforming
> > 
> >  call foo
> >  reg = 1
> >  ...
> >  use reg
> > 
> > to
> > 
> >  reg = 1
> >  call foo
> >  ...
> >  use reg
> > 
> > but this is undesirable, because now register allocation cannot select a
> > call-clobbered register for 'reg’.
> Okay, thanks for the explanation.
> 
> Then, why not just check the liveness info instead of inhibiting all scheduling across calls?

Because there's almost nothing to gain from pre-RA scheduling across calls in
the first place. Remember that the call transfers control flow elsewhere and
therefore the scheduler has no idea about the pipeline state after the call
and after the return, so modeling-wise it's a gamble.

For instructions that lie on a critical path such scheduling can be useful when
it substantially reduces the difference between the priority of the call and
nearby instructions of the critical path. But we don't track which instructions
are on critical path(s) and which are not.

(scheduling across calls in sched2 is somewhat dubious as well, but
it doesn't risk register pressure issues, and on VLIW CPUs it at least
can result in better VLIW packing)

Alexander
  
Jose E. Marchesi Dec. 23, 2022, 5:27 p.m. UTC | #8
> On Fri, 23 Dec 2022, Qing Zhao wrote:
>> >> I am a little confused, you mean pre-RA scheduler does not look at the data flow
>> >> information at all when scheduling insns across calls currently?
>> > 
>> > I think it does not inspect liveness info, and may extend lifetime of a pseudo
>> > across a call, transforming
>> > 
>> >  call foo
>> >  reg = 1
>> >  ...
>> >  use reg
>> > 
>> > to
>> > 
>> >  reg = 1
>> >  call foo
>> >  ...
>> >  use reg
>> > 
>> > but this is undesirable, because now register allocation cannot select a
>> > call-clobbered register for 'reg’.
>> Okay, thanks for the explanation.
>> 
>> Then, why not just check the liveness info instead of inhibiting all scheduling across calls?
>
> Because there's almost nothing to gain from pre-RA scheduling across calls in
> the first place. Remember that the call transfers control flow elsewhere and
> therefore the scheduler has no idea about the pipeline state after the call
> and after the return, so modeling-wise it's a gamble.
>
> For instructions that lie on a critical path such scheduling can be useful when
> it substantially reduces the difference between the priority of the call and
> nearby instructions of the critical path. But we don't track which instructions
> are on critical path(s) and which are not.
>
> (scheduling across calls in sched2 is somewhat dubious as well, but
> it doesn't risk register pressure issues, and on VLIW CPUs it at least
> can result in better VLIW packing)

Does sched2 actually schedule across calls?  All the comments in the
source code stress the fact that the second scheduler pass (after
register allocation) works in regions that correspond to basic blocks:
"(after reload, each region is of one block)".
  
Alexander Monakov Dec. 23, 2022, 5:31 p.m. UTC | #9
On Fri, 23 Dec 2022, Jose E. Marchesi wrote:

> > (scheduling across calls in sched2 is somewhat dubious as well, but
> > it doesn't risk register pressure issues, and on VLIW CPUs it at least
> > can result in better VLIW packing)
> 
> Does sched2 actually schedule across calls?  All the comments in the
> source code stress the fact that the second scheduler pass (after
> register allocation) works in regions that correspond to basic blocks:
> "(after reload, each region is of one block)".

A call instruction does not end a basic block.

(also, with -fsched2-use-superblocks sched2 works on regions like sched1)

Alexander
  
Jose E. Marchesi Dec. 23, 2022, 5:45 p.m. UTC | #10
> On Fri, 23 Dec 2022, Jose E. Marchesi wrote:
>
>> > (scheduling across calls in sched2 is somewhat dubious as well, but
>> > it doesn't risk register pressure issues, and on VLIW CPUs it at least
>> > can result in better VLIW packing)
>> 
>> Does sched2 actually schedule across calls?  All the comments in the
>> source code stress the fact that the second scheduler pass (after
>> register allocation) works in regions that correspond to basic blocks:
>> "(after reload, each region is of one block)".
>
> A call instruction does not end a basic block.

Ok, so my original assumption in the patch explaining why I disabled
sched1 but not sched2 was not correct.  Good to know.

> (also, with -fsched2-use-superblocks sched2 works on regions like sched1)
>
> Alexander
  
Qing Zhao Dec. 23, 2022, 7:21 p.m. UTC | #11
Then, sched2 still can move insn across calls? 
So does sched2 have the same issue of incorrectly moving  the insn across a call which has unknown control flow?

Qing

> On Dec 23, 2022, at 12:31 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Fri, 23 Dec 2022, Jose E. Marchesi wrote:
> 
>>> (scheduling across calls in sched2 is somewhat dubious as well, but
>>> it doesn't risk register pressure issues, and on VLIW CPUs it at least
>>> can result in better VLIW packing)
>> 
>> Does sched2 actually schedule across calls?  All the comments in the
>> source code stress the fact that the second scheduler pass (after
>> register allocation) works in regions that correspond to basic blocks:
>> "(after reload, each region is of one block)".
> 
> A call instruction does not end a basic block.
> 
> (also, with -fsched2-use-superblocks sched2 works on regions like sched1)
> 
> Alexander
  
Alexander Monakov Dec. 23, 2022, 7:36 p.m. UTC | #12
On Fri, 23 Dec 2022, Qing Zhao wrote:

> Then, sched2 still can move insn across calls? 
> So does sched2 have the same issue of incorrectly moving  the insn across a call which has unknown control flow?

I think problems are unlikely because register allocator assigns pseudos that
cross setjmp to memory.

I think you hit the problem with sched1 because most testing is done on x86 and
sched1 is not enabled there, otherwise the problem would have been noticed much
earlier.

Alexander
  
Qing Zhao Dec. 23, 2022, 7:57 p.m. UTC | #13
> On Dec 23, 2022, at 2:36 PM, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> 
> On Fri, 23 Dec 2022, Qing Zhao wrote:
> 
>> Then, sched2 still can move insn across calls? 
>> So does sched2 have the same issue of incorrectly moving  the insn across a call which has unknown control flow?
> 
> I think problems are unlikely because register allocator assigns pseudos that
> cross setjmp to memory.
> 
> I think you hit the problem with sched1 because most testing is done on x86 and
> sched1 is not enabled there, otherwise the problem would have been noticed much
> earlier.


Yes, the problem with this bug is in sched1 on aarch64.  On x86 the same issue will be exposed when explicitly enable sched1 with -fschedule-insns. 

BTW, Why sched1 is not enabled on x86 by default?

Another question is:  As discussed in the original bug PR57067: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067
The root cause of this issue related to the abnormal control flow edges (from setjmp/longjmp) cannot be represented correctly at RTL stage, shall we fix
this root cause instead? 

Qing


> Alexander
  
Alexander Monakov Dec. 24, 2022, 8:10 a.m. UTC | #14
On Fri, 23 Dec 2022, Qing Zhao wrote:

> BTW, Why sched1 is not enabled on x86 by default?

Register allocation is tricky on x86 due to small number of general-purpose
registers, and sched1 can make it even more difficult. I think before register
pressure modeling was added, sched1 could not be enabled because then allocation
would sometimes fail, and now there's no incentive to enable it, as it is not so
important for modern x86 CPUs. Perhaps someone else has a more comprehensive
answer.

> Another question is:  As discussed in the original bug PR57067:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
> be represented correctly at RTL stage, shall we fix this root cause instead? 

You'd need an experienced reviewer to work with you, especially on high-level
design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
I'm afraid it's not just a matter of applying a small patch in one place.

Alexander
  
Richard Biener Dec. 24, 2022, 9:26 a.m. UTC | #15
> Am 24.12.2022 um 09:11 schrieb Alexander Monakov via Gcc-patches <gcc-patches@gcc.gnu.org>:
> 
> 
>> On Fri, 23 Dec 2022, Qing Zhao wrote:
>> 
>> BTW, Why sched1 is not enabled on x86 by default?
> 
> Register allocation is tricky on x86 due to small number of general-purpose
> registers, and sched1 can make it even more difficult. I think before register
> pressure modeling was added, sched1 could not be enabled because then allocation
> would sometimes fail, and now there's no incentive to enable it, as it is not so
> important for modern x86 CPUs. Perhaps someone else has a more comprehensive
> answer.
> 
>> Another question is:  As discussed in the original bug PR57067:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
>> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
>> be represented correctly at RTL stage, shall we fix this root cause instead? 
> 
> You'd need an experienced reviewer to work with you, especially on high-level
> design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
> I'm afraid it's not just a matter of applying a small patch in one place.

For nonlocal goto we Thread the abnormal dispatcher.  Of course by regenerating abnormal edges, not by keeping and modifying them.  We cannot re-generate the (optimal) set of abnormal edges for setjmp so we want to preserve those edges.  But as you say it’s a very non-trivial change.

Richard 

> Alexander
  
Jose E. Marchesi Dec. 24, 2022, 9:58 a.m. UTC | #16
>> Am 24.12.2022 um 09:11 schrieb Alexander Monakov via Gcc-patches <gcc-patches@gcc.gnu.org>:
>> 
>> 
>>> On Fri, 23 Dec 2022, Qing Zhao wrote:
>>> 
>>> BTW, Why sched1 is not enabled on x86 by default?
>> 
>> Register allocation is tricky on x86 due to small number of general-purpose
>> registers, and sched1 can make it even more difficult. I think before register
>> pressure modeling was added, sched1 could not be enabled because then allocation
>> would sometimes fail, and now there's no incentive to enable it, as it is not so
>> important for modern x86 CPUs. Perhaps someone else has a more comprehensive
>> answer.
>> 
>>> Another question is:  As discussed in the original bug PR57067:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
>>> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
>>> be represented correctly at RTL stage, shall we fix this root cause instead? 
>> 
>> You'd need an experienced reviewer to work with you, especially on high-level
>> design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
>> I'm afraid it's not just a matter of applying a small patch in one place.
>
> For nonlocal goto we Thread the abnormal dispatcher.  Of course by
> regenerating abnormal edges, not by keeping and modifying them.  We
> cannot re-generate the (optimal) set of abnormal edges for setjmp so
> we want to preserve those edges.  But as you say it’s a very
> non-trivial change.

Allright, so we have two short-term alternatives for at least remove the
possibility that GCC generates wrong code for valid C when the scheduler
is turned on:

a) To disable sched1 in functions that call setjmp.

b) To change deps_analyze_insn so instructions are not moved across
   function calls before register allocation (!reload_completed).

Both patches fix our particular use cases and are regression free in
aarch64-linux-gnu.

However, there is something I don't understand: wouldn't sched2
introduce the same problem when -fsched2-use-superblocks is specified?
In that case, the option a) would need to be expanded to disable sched2
as well, and b) wouldn't have effect (!after_reload)?

Using -fsched2-use-superblocks doesn't trigger the problem in our use
case.
  
Richard Biener Dec. 24, 2022, 10:06 a.m. UTC | #17
> Am 24.12.2022 um 10:54 schrieb Jose E. Marchesi <jose.marchesi@oracle.com>:
> 
> 
>>>> Am 24.12.2022 um 09:11 schrieb Alexander Monakov via Gcc-patches <gcc-patches@gcc.gnu.org>:
>>> 
>>> 
>>>> On Fri, 23 Dec 2022, Qing Zhao wrote:
>>>> 
>>>> BTW, Why sched1 is not enabled on x86 by default?
>>> 
>>> Register allocation is tricky on x86 due to small number of general-purpose
>>> registers, and sched1 can make it even more difficult. I think before register
>>> pressure modeling was added, sched1 could not be enabled because then allocation
>>> would sometimes fail, and now there's no incentive to enable it, as it is not so
>>> important for modern x86 CPUs. Perhaps someone else has a more comprehensive
>>> answer.
>>> 
>>>> Another question is:  As discussed in the original bug PR57067:
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
>>>> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
>>>> be represented correctly at RTL stage, shall we fix this root cause instead? 
>>> 
>>> You'd need an experienced reviewer to work with you, especially on high-level
>>> design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
>>> I'm afraid it's not just a matter of applying a small patch in one place.
>> 
>> For nonlocal goto we Thread the abnormal dispatcher.  Of course by
>> regenerating abnormal edges, not by keeping and modifying them.  We
>> cannot re-generate the (optimal) set of abnormal edges for setjmp so
>> we want to preserve those edges.  But as you say it’s a very
>> non-trivial change.
> 
> Allright, so we have two short-term alternatives for at least remove the
> possibility that GCC generates wrong code for valid C when the scheduler
> is turned on:
> 
> a) To disable sched1 in functions that call setjmp.
> 
> b) To change deps_analyze_insn so instructions are not moved across
>   function calls before register allocation (!reload_completed).
> 
> Both patches fix our particular use cases and are regression free in
> aarch64-linux-gnu.
> 
> However, there is something I don't understand: wouldn't sched2
> introduce the same problem when -fsched2-use-superblocks is specified?
> In that case, the option a) would need to be expanded to disable sched2
> as well, and b) wouldn't have effect (!after_reload)?
> 
> Using -fsched2-use-superblocks doesn't trigger the problem in our use
> case.

I’d go with b) and revisit sched2 when we run into a testcase that’s mishandled.

Richard
  
Alexander Monakov Dec. 24, 2022, 10:18 a.m. UTC | #18
On Sat, 24 Dec 2022, Jose E. Marchesi wrote:

> However, there is something I don't understand: wouldn't sched2
> introduce the same problem when -fsched2-use-superblocks is specified?

Superblocks are irrelevant, a call instruction does not end a basic block
and the problematic motion happens within a BB on your testcase. Didn't you
ask about this already?

> In that case, the option a) would need to be expanded to disable sched2
> as well, and b) wouldn't have effect (!after_reload)?

See my response to Qing Zhao, I think due to special-casing of pseudos
that are live at setjmp during register allocation, sched2 will not move
them in such manner (they should be assigned to memory and I don't expect
sched2 will move such MEMs across calls). But of course there may be holes
in this theory.

On some targets disabling sched2 is not so easy because it's responsible
for VLIW packing (bundling on ia64).

Alexander
  
Segher Boessenkool Dec. 26, 2022, 12:29 a.m. UTC | #19
Hi!

On Sat, Dec 24, 2022 at 10:58:41AM +0100, Jose E. Marchesi via Gcc-patches wrote:
> Allright, so we have two short-term alternatives for at least remove the
> possibility that GCC generates wrong code for valid C when the scheduler
> is turned on:
> 
> a) To disable sched1 in functions that call setjmp.

That is a heavy hammer.

> b) To change deps_analyze_insn so instructions are not moved across
>    function calls before register allocation (!reload_completed).

And this is way heavier still.

OTOH, it is possible b) actually improves code (improves performance) in
general (and maybe even without such a reload_completed check).

> Both patches fix our particular use cases and are regression free in
> aarch64-linux-gnu.

Did you also check for performance regressions?


Segher
  
Qing Zhao Jan. 5, 2023, 6:11 p.m. UTC | #20
Alexander,

(Sorry for the late reply due to holiday vacation).

> On Dec 24, 2022, at 3:10 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Fri, 23 Dec 2022, Qing Zhao wrote:
> 
>> BTW, Why sched1 is not enabled on x86 by default?
> 
> Register allocation is tricky on x86 due to small number of general-purpose
> registers, and sched1 can make it even more difficult. I think before register
> pressure modeling was added, sched1 could not be enabled because then allocation
> would sometimes fail, and now there's no incentive to enable it, as it is not so
> important for modern x86 CPUs. Perhaps someone else has a more comprehensive
> answer.

Okay. I see. Thanks for the explanation of the history. 
> 
>> Another question is:  As discussed in the original bug PR57067:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
>> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
>> be represented correctly at RTL stage, shall we fix this root cause instead? 
> 
> You'd need an experienced reviewer to work with you, especially on high-level
> design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
> I'm afraid it's not just a matter of applying a small patch in one place.
I see. (And I guess so, fixing this is not a trivial work).

Qing
> 
> Alexander
  

Patch

diff --git a/gcc/sched-rgn.cc b/gcc/sched-rgn.cc
index 420c45dffb4..c536d0b8dea 100644
--- a/gcc/sched-rgn.cc
+++ b/gcc/sched-rgn.cc
@@ -3847,7 +3847,8 @@  bool
 pass_sched::gate (function *)
 {
 #ifdef INSN_SCHEDULING
-  return optimize > 0 && flag_schedule_insns && dbg_cnt (sched_func);
+  return optimize > 0 && flag_schedule_insns
+    && !cfun->calls_setjmp && dbg_cnt (sched_func);
 #else
   return 0;
 #endif