[1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

Message ID 20231120120649.672893-2-maxim.kuvyrkov@linaro.org
State New
Headers
Series Avoid exponential behavior in scheduler and better logging |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed

Commit Message

Maxim Kuvyrkov Nov. 20, 2023, 12:06 p.m. UTC
  This patch avoids sched-deps.cc:find_inc() creating exponential number
of dependencies, which become memory and compilation time hogs.
Consider example (simplified from PR96388) ...
===
sp=sp-4 // sp_insnA
mem_insnA1[sp+A1]
...
mem_insnAN[sp+AN]
sp=sp-4 // sp_insnB
mem_insnB1[sp+B1]
...
mem_insnBM[sp+BM]
===
... in this example find_modifiable_mems() will arrange for mem_insnA*
to be able to pass sp_insnA, and, while doing this, will create
dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
is a consumer of sp_insnA.  After this sp_insnB will have N new
backward dependencies.
Then find_modifiable_mems() gets to mem_insnB*s and starts to create
N new dependencies for _every_ mem_insnB*.  This gets us N*M new
dependencies.

In PR96833's testcase N and M are 10k-15k, which causes RAM usage of
30GB and compilation time of 30 minutes, with sched2 accounting for
95% of both metrics.  After this patch the RAM usage is down to 1GB
and compilation time is down to 3-4 minutes, with sched2 no longer
standing out on -ftime-report or memory usage.

gcc/ChangeLog:

	PR rtl-optimization/96388
	PR rtl-optimization/111554
	* sched-deps.cc (find_inc): Avoid exponential behavior.
---
 gcc/sched-deps.cc | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)
  

Comments

Richard Biener Nov. 20, 2023, 1:09 p.m. UTC | #1
On Mon, Nov 20, 2023 at 1:08 PM Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
>
> This patch avoids sched-deps.cc:find_inc() creating exponential number
> of dependencies, which become memory and compilation time hogs.
> Consider example (simplified from PR96388) ...
> ===
> sp=sp-4 // sp_insnA
> mem_insnA1[sp+A1]
> ...
> mem_insnAN[sp+AN]
> sp=sp-4 // sp_insnB
> mem_insnB1[sp+B1]
> ...
> mem_insnBM[sp+BM]
> ===
> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> to be able to pass sp_insnA, and, while doing this, will create
> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> is a consumer of sp_insnA.  After this sp_insnB will have N new
> backward dependencies.
> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> dependencies.
>
> In PR96833's testcase N and M are 10k-15k, which causes RAM usage of
> 30GB and compilation time of 30 minutes, with sched2 accounting for
> 95% of both metrics.  After this patch the RAM usage is down to 1GB
> and compilation time is down to 3-4 minutes, with sched2 no longer
> standing out on -ftime-report or memory usage.
>
> gcc/ChangeLog:
>
>         PR rtl-optimization/96388
>         PR rtl-optimization/111554
>         * sched-deps.cc (find_inc): Avoid exponential behavior.
> ---
>  gcc/sched-deps.cc | 45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
> index c23218890f3..397bb9fd462 100644
> --- a/gcc/sched-deps.cc
> +++ b/gcc/sched-deps.cc
> @@ -4779,24 +4779,59 @@ parse_add_or_inc (struct mem_inc_info *mii, rtx_insn *insn, bool before_mem)
>  /* Once a suitable mem reference has been found and the corresponding data
>     in MII has been filled in, this function is called to find a suitable
>     add or inc insn involving the register we found in the memory
> -   reference.  */
> +   reference.
> +   If successful, this function will create additional dependencies between
> +   - mii->inc_insn's producers and mii->mem_insn as a consumer (if backwards)
> +   - mii->inc_insn's consumers and mii->mem_insn as a producer (if !backwards).
> +*/
>
>  static bool
>  find_inc (struct mem_inc_info *mii, bool backwards)
>  {
>    sd_iterator_def sd_it;
>    dep_t dep;
> +  sd_list_types_def mem_deps = backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW;
> +  int n_mem_deps = sd_lists_size (mii->mem_insn, mem_deps);
>
> -  sd_it = sd_iterator_start (mii->mem_insn,
> -                            backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW);
> +  sd_it = sd_iterator_start (mii->mem_insn, mem_deps);
>    while (sd_iterator_cond (&sd_it, &dep))
>      {
>        dep_node_t node = DEP_LINK_NODE (*sd_it.linkp);
>        rtx_insn *pro = DEP_PRO (dep);
>        rtx_insn *con = DEP_CON (dep);
> -      rtx_insn *inc_cand = backwards ? pro : con;
> +      rtx_insn *inc_cand;
> +      int n_inc_deps;
> +
> +      if (backwards)
> +       {
> +         inc_cand = pro;
> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_BACK);
> +       }
> +      else
> +       {
> +         inc_cand = con;
> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_FORW);
> +       }
> +
> +      /* In the FOR_EACH_DEP loop below we will create additional n_inc_deps
> +        for mem_insn.  This by itself is not a problem, since each mem_insn
> +        will have only a few inc_insns associated with it.  However, if
> +        we consider that a single inc_insn may have a lot of mem_insns, AND,
> +        on top of that, a few other inc_insns associated with it --
> +        those _other inc_insns_ will get (n_mem_deps * number of MEM insns)
> +        dependencies created for them.  This may cause an exponential
> +        growth of memory usage and scheduling time.
> +        See PR96388 for details.
> +        We [heuristically] use n_inc_deps as a proxy for the number of MEM
> +        insns, and drop opportunities for breaking modifiable_mem dependencies
> +        when dependency lists grow beyond reasonable size.  */
> +      if (n_mem_deps * n_inc_deps
> +         >= param_max_pending_list_length * param_max_pending_list_length)
> +       goto next;
> +
>        if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))

it looks like this check is a lot cheaper than computing sd_lists_size so
can we keep that first?  sd_lists_size might be even more expensive
than parse_add_or_inc, and in principle you only need to count up
to param_max_pending_list_length * param_max_pending_list_length
for n_mem_deps and that devided by n_mem_deps when counting
n_inc_deps, but I guess that's premature optimization at that point.

So OK from my side with moving the if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))
before the list counting.

Richard.

>         goto next;
> +
>        if (parse_add_or_inc (mii, inc_cand, backwards))
>         {
>           struct dep_replacement *desc;
> @@ -4838,6 +4873,8 @@ find_inc (struct mem_inc_info *mii, bool backwards)
>           desc->insn = mii->mem_insn;
>           move_dep_link (DEP_NODE_BACK (node), INSN_HARD_BACK_DEPS (con),
>                          INSN_SPEC_BACK_DEPS (con));
> +
> +         gcc_assert (mii->inc_insn == inc_cand);
>           if (backwards)
>             {
>               FOR_EACH_DEP (mii->inc_insn, SD_LIST_BACK, sd_it, dep)
> --
> 2.34.1
>
  
Maxim Kuvyrkov Nov. 20, 2023, 1:42 p.m. UTC | #2
> On Nov 20, 2023, at 17:09, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Mon, Nov 20, 2023 at 1:08 PM Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
>> 
>> This patch avoids sched-deps.cc:find_inc() creating exponential number
>> of dependencies, which become memory and compilation time hogs.
>> Consider example (simplified from PR96388) ...
>> ===
>> sp=sp-4 // sp_insnA
>> mem_insnA1[sp+A1]
>> ...
>> mem_insnAN[sp+AN]
>> sp=sp-4 // sp_insnB
>> mem_insnB1[sp+B1]
>> ...
>> mem_insnBM[sp+BM]
>> ===
>> ... in this example find_modifiable_mems() will arrange for mem_insnA*
>> to be able to pass sp_insnA, and, while doing this, will create
>> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>> is a consumer of sp_insnA.  After this sp_insnB will have N new
>> backward dependencies.
>> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>> dependencies.
>> 
>> In PR96833's testcase N and M are 10k-15k, which causes RAM usage of
>> 30GB and compilation time of 30 minutes, with sched2 accounting for
>> 95% of both metrics.  After this patch the RAM usage is down to 1GB
>> and compilation time is down to 3-4 minutes, with sched2 no longer
>> standing out on -ftime-report or memory usage.
>> 
>> gcc/ChangeLog:
>> 
>>        PR rtl-optimization/96388
>>        PR rtl-optimization/111554
>>        * sched-deps.cc (find_inc): Avoid exponential behavior.
>> ---
>> gcc/sched-deps.cc | 45 +++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 41 insertions(+), 4 deletions(-)
>> 
>> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
>> index c23218890f3..397bb9fd462 100644
>> --- a/gcc/sched-deps.cc
>> +++ b/gcc/sched-deps.cc
>> @@ -4779,24 +4779,59 @@ parse_add_or_inc (struct mem_inc_info *mii, rtx_insn *insn, bool before_mem)
>> /* Once a suitable mem reference has been found and the corresponding data
>>    in MII has been filled in, this function is called to find a suitable
>>    add or inc insn involving the register we found in the memory
>> -   reference.  */
>> +   reference.
>> +   If successful, this function will create additional dependencies between
>> +   - mii->inc_insn's producers and mii->mem_insn as a consumer (if backwards)
>> +   - mii->inc_insn's consumers and mii->mem_insn as a producer (if !backwards).
>> +*/
>> 
>> static bool
>> find_inc (struct mem_inc_info *mii, bool backwards)
>> {
>>   sd_iterator_def sd_it;
>>   dep_t dep;
>> +  sd_list_types_def mem_deps = backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW;
>> +  int n_mem_deps = sd_lists_size (mii->mem_insn, mem_deps);
>> 
>> -  sd_it = sd_iterator_start (mii->mem_insn,
>> -                            backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW);
>> +  sd_it = sd_iterator_start (mii->mem_insn, mem_deps);
>>   while (sd_iterator_cond (&sd_it, &dep))
>>     {
>>       dep_node_t node = DEP_LINK_NODE (*sd_it.linkp);
>>       rtx_insn *pro = DEP_PRO (dep);
>>       rtx_insn *con = DEP_CON (dep);
>> -      rtx_insn *inc_cand = backwards ? pro : con;
>> +      rtx_insn *inc_cand;
>> +      int n_inc_deps;
>> +
>> +      if (backwards)
>> +       {
>> +         inc_cand = pro;
>> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_BACK);
>> +       }
>> +      else
>> +       {
>> +         inc_cand = con;
>> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_FORW);
>> +       }
>> +
>> +      /* In the FOR_EACH_DEP loop below we will create additional n_inc_deps
>> +        for mem_insn.  This by itself is not a problem, since each mem_insn
>> +        will have only a few inc_insns associated with it.  However, if
>> +        we consider that a single inc_insn may have a lot of mem_insns, AND,
>> +        on top of that, a few other inc_insns associated with it --
>> +        those _other inc_insns_ will get (n_mem_deps * number of MEM insns)
>> +        dependencies created for them.  This may cause an exponential
>> +        growth of memory usage and scheduling time.
>> +        See PR96388 for details.
>> +        We [heuristically] use n_inc_deps as a proxy for the number of MEM
>> +        insns, and drop opportunities for breaking modifiable_mem dependencies
>> +        when dependency lists grow beyond reasonable size.  */
>> +      if (n_mem_deps * n_inc_deps
>> +         >= param_max_pending_list_length * param_max_pending_list_length)
>> +       goto next;
>> +
>>       if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))
> 
> it looks like this check is a lot cheaper than computing sd_lists_size so
> can we keep that first?  sd_lists_size might be even more expensive
> than parse_add_or_inc,

sd_lists_size() is cheap, it doesn't walk or count dependencies; it just adds 2-3 integers together.

The reason why sd_lists_size() has a loop is to be able to transparently handle dependencies split among sub-lists, e.g., sd_lists_size(SD_LIST_HARD_BACK | SD_LIST_SPEC_BACK) will return total number of backward dependencies.

--
Maxim Kuvyrkov
https://www.linaro.org
  
Richard Biener Nov. 20, 2023, 1:45 p.m. UTC | #3
On Mon, Nov 20, 2023 at 2:42 PM Maxim Kuvyrkov
<maxim.kuvyrkov@linaro.org> wrote:
>
> > On Nov 20, 2023, at 17:09, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Nov 20, 2023 at 1:08 PM Maxim Kuvyrkov
> > <maxim.kuvyrkov@linaro.org> wrote:
> >>
> >> This patch avoids sched-deps.cc:find_inc() creating exponential number
> >> of dependencies, which become memory and compilation time hogs.
> >> Consider example (simplified from PR96388) ...
> >> ===
> >> sp=sp-4 // sp_insnA
> >> mem_insnA1[sp+A1]
> >> ...
> >> mem_insnAN[sp+AN]
> >> sp=sp-4 // sp_insnB
> >> mem_insnB1[sp+B1]
> >> ...
> >> mem_insnBM[sp+BM]
> >> ===
> >> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> >> to be able to pass sp_insnA, and, while doing this, will create
> >> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> >> is a consumer of sp_insnA.  After this sp_insnB will have N new
> >> backward dependencies.
> >> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> >> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> >> dependencies.
> >>
> >> In PR96833's testcase N and M are 10k-15k, which causes RAM usage of
> >> 30GB and compilation time of 30 minutes, with sched2 accounting for
> >> 95% of both metrics.  After this patch the RAM usage is down to 1GB
> >> and compilation time is down to 3-4 minutes, with sched2 no longer
> >> standing out on -ftime-report or memory usage.
> >>
> >> gcc/ChangeLog:
> >>
> >>        PR rtl-optimization/96388
> >>        PR rtl-optimization/111554
> >>        * sched-deps.cc (find_inc): Avoid exponential behavior.
> >> ---
> >> gcc/sched-deps.cc | 45 +++++++++++++++++++++++++++++++++++++++++----
> >> 1 file changed, 41 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
> >> index c23218890f3..397bb9fd462 100644
> >> --- a/gcc/sched-deps.cc
> >> +++ b/gcc/sched-deps.cc
> >> @@ -4779,24 +4779,59 @@ parse_add_or_inc (struct mem_inc_info *mii, rtx_insn *insn, bool before_mem)
> >> /* Once a suitable mem reference has been found and the corresponding data
> >>    in MII has been filled in, this function is called to find a suitable
> >>    add or inc insn involving the register we found in the memory
> >> -   reference.  */
> >> +   reference.
> >> +   If successful, this function will create additional dependencies between
> >> +   - mii->inc_insn's producers and mii->mem_insn as a consumer (if backwards)
> >> +   - mii->inc_insn's consumers and mii->mem_insn as a producer (if !backwards).
> >> +*/
> >>
> >> static bool
> >> find_inc (struct mem_inc_info *mii, bool backwards)
> >> {
> >>   sd_iterator_def sd_it;
> >>   dep_t dep;
> >> +  sd_list_types_def mem_deps = backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW;
> >> +  int n_mem_deps = sd_lists_size (mii->mem_insn, mem_deps);
> >>
> >> -  sd_it = sd_iterator_start (mii->mem_insn,
> >> -                            backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW);
> >> +  sd_it = sd_iterator_start (mii->mem_insn, mem_deps);
> >>   while (sd_iterator_cond (&sd_it, &dep))
> >>     {
> >>       dep_node_t node = DEP_LINK_NODE (*sd_it.linkp);
> >>       rtx_insn *pro = DEP_PRO (dep);
> >>       rtx_insn *con = DEP_CON (dep);
> >> -      rtx_insn *inc_cand = backwards ? pro : con;
> >> +      rtx_insn *inc_cand;
> >> +      int n_inc_deps;
> >> +
> >> +      if (backwards)
> >> +       {
> >> +         inc_cand = pro;
> >> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_BACK);
> >> +       }
> >> +      else
> >> +       {
> >> +         inc_cand = con;
> >> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_FORW);
> >> +       }
> >> +
> >> +      /* In the FOR_EACH_DEP loop below we will create additional n_inc_deps
> >> +        for mem_insn.  This by itself is not a problem, since each mem_insn
> >> +        will have only a few inc_insns associated with it.  However, if
> >> +        we consider that a single inc_insn may have a lot of mem_insns, AND,
> >> +        on top of that, a few other inc_insns associated with it --
> >> +        those _other inc_insns_ will get (n_mem_deps * number of MEM insns)
> >> +        dependencies created for them.  This may cause an exponential
> >> +        growth of memory usage and scheduling time.
> >> +        See PR96388 for details.
> >> +        We [heuristically] use n_inc_deps as a proxy for the number of MEM
> >> +        insns, and drop opportunities for breaking modifiable_mem dependencies
> >> +        when dependency lists grow beyond reasonable size.  */
> >> +      if (n_mem_deps * n_inc_deps
> >> +         >= param_max_pending_list_length * param_max_pending_list_length)
> >> +       goto next;
> >> +
> >>       if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))
> >
> > it looks like this check is a lot cheaper than computing sd_lists_size so
> > can we keep that first?  sd_lists_size might be even more expensive
> > than parse_add_or_inc,
>
> sd_lists_size() is cheap, it doesn't walk or count dependencies; it just adds 2-3 integers together.
>
> The reason why sd_lists_size() has a loop is to be able to transparently handle dependencies split among sub-lists, e.g., sd_lists_size(SD_LIST_HARD_BACK | SD_LIST_SPEC_BACK) will return total number of backward dependencies.

I see.  It still better to move the DEP_NONREG/DEP_MULTIPLE checks,
they do not depend on any
of the code above it, no?

Richard.

> --
> Maxim Kuvyrkov
> https://www.linaro.org
>
>
  
Alexander Monakov Nov. 20, 2023, 1:52 p.m. UTC | #4
On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> This patch avoids sched-deps.cc:find_inc() creating exponential number
> of dependencies, which become memory and compilation time hogs.
> Consider example (simplified from PR96388) ...
> ===
> sp=sp-4 // sp_insnA
> mem_insnA1[sp+A1]
> ...
> mem_insnAN[sp+AN]
> sp=sp-4 // sp_insnB
> mem_insnB1[sp+B1]
> ...
> mem_insnBM[sp+BM]
> ===
> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> to be able to pass sp_insnA, and, while doing this, will create
> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> is a consumer of sp_insnA.  After this sp_insnB will have N new
> backward dependencies.
> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> dependencies.

It's a bit hard to read this without knowing which value of 'backwards'
is assumed.

Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
This is a true dependency. We know we can break it by adjusting B1 by -4, but
we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
we need to iterate over *incoming true dependencies* of sp_insnB and add them.

But the code seems to be iterating over *all incoming dependencies*, so it
will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
understanding is correct.

Alexander
  
Maxim Kuvyrkov Nov. 20, 2023, 2:39 p.m. UTC | #5
> On Nov 20, 2023, at 17:52, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
> 
>> This patch avoids sched-deps.cc:find_inc() creating exponential number
>> of dependencies, which become memory and compilation time hogs.
>> Consider example (simplified from PR96388) ...
>> ===
>> sp=sp-4 // sp_insnA
>> mem_insnA1[sp+A1]
>> ...
>> mem_insnAN[sp+AN]
>> sp=sp-4 // sp_insnB
>> mem_insnB1[sp+B1]
>> ...
>> mem_insnBM[sp+BM]
>> ===
>> ... in this example find_modifiable_mems() will arrange for mem_insnA*
>> to be able to pass sp_insnA, and, while doing this, will create
>> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>> is a consumer of sp_insnA.  After this sp_insnB will have N new
>> backward dependencies.
>> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>> dependencies.

[For avoidance of doubt, below discussion is about the general implementation of find_modifiable_mems() and not about the patch.]

> 
> It's a bit hard to read this without knowing which value of 'backwards'
> is assumed.
> 
> Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
> This is a true dependency. We know we can break it by adjusting B1 by -4, but
> we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
> we need to iterate over *incoming true dependencies* of sp_insnB and add them.
> 
> But the code seems to be iterating over *all incoming dependencies*, so it
> will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
> understanding is correct.

Yeap, your understanding is correct.  However, this is what find_modifiable_mems() has to do to avoid complicated analysis of second-level dependencies.

I think, the optimization that find_modifiable_mems() does can be implemented as part of main dependency analysis, and I'm going to discuss that with Bernd.  I might be missing something, but it seems that instruction transformations that find_modifiable_mems() is doing are simpler than what ia64 speculation is doing.  So, I think, we should be able to leverage speculation infrastructure, rather than doing this optimization "on the side" of normal scheduling.

--
Maxim Kuvyrkov
https://www.linaro.org
  
Alexander Monakov Nov. 20, 2023, 4:30 p.m. UTC | #6
On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> > On Nov 20, 2023, at 17:52, Alexander Monakov <amonakov@ispras.ru> wrote:
> > 
> > 
> > On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
> > 
> >> This patch avoids sched-deps.cc:find_inc() creating exponential number
> >> of dependencies, which become memory and compilation time hogs.
> >> Consider example (simplified from PR96388) ...
> >> ===
> >> sp=sp-4 // sp_insnA
> >> mem_insnA1[sp+A1]
> >> ...
> >> mem_insnAN[sp+AN]
> >> sp=sp-4 // sp_insnB
> >> mem_insnB1[sp+B1]
> >> ...
> >> mem_insnBM[sp+BM]
> >> ===
> >> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> >> to be able to pass sp_insnA, and, while doing this, will create
> >> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> >> is a consumer of sp_insnA.  After this sp_insnB will have N new
> >> backward dependencies.
> >> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> >> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> >> dependencies.
> 
> [For avoidance of doubt, below discussion is about the general implementation
> of find_modifiable_mems() and not about the patch.]

I was saying the commit message is hard to read (unless it's just me).

> > It's a bit hard to read this without knowing which value of 'backwards'
> > is assumed.
> > 
> > Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
> > This is a true dependency. We know we can break it by adjusting B1 by -4, but
> > we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
> > we need to iterate over *incoming true dependencies* of sp_insnB and add them.
> > 
> > But the code seems to be iterating over *all incoming dependencies*, so it
> > will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> > dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
> > understanding is correct.
> 
> Yeap, your understanding is correct.  However, this is what
> find_modifiable_mems() has to do to avoid complicated analysis of second-level
> dependencies.

What is the reason it cannot simply skip anti-dependencies in the
'if (backwards)' loop, and true dependencies in the 'else' loop?

Alexander
  
Maxim Kuvyrkov Nov. 20, 2023, 6:59 p.m. UTC | #7
> On Nov 20, 2023, at 17:45, Richard Biener <richard.guenther@gmail.com> wrote:
> 
> On Mon, Nov 20, 2023 at 2:42 PM Maxim Kuvyrkov
> <maxim.kuvyrkov@linaro.org> wrote:
>> 
>>> On Nov 20, 2023, at 17:09, Richard Biener <richard.guenther@gmail.com> wrote:
>>> 
>>> On Mon, Nov 20, 2023 at 1:08 PM Maxim Kuvyrkov
>>> <maxim.kuvyrkov@linaro.org> wrote:
>>>> 
>>>> This patch avoids sched-deps.cc:find_inc() creating exponential number
>>>> of dependencies, which become memory and compilation time hogs.
>>>> Consider example (simplified from PR96388) ...
>>>> ===
>>>> sp=sp-4 // sp_insnA
>>>> mem_insnA1[sp+A1]
>>>> ...
>>>> mem_insnAN[sp+AN]
>>>> sp=sp-4 // sp_insnB
>>>> mem_insnB1[sp+B1]
>>>> ...
>>>> mem_insnBM[sp+BM]
>>>> ===
>>>> ... in this example find_modifiable_mems() will arrange for mem_insnA*
>>>> to be able to pass sp_insnA, and, while doing this, will create
>>>> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>>>> is a consumer of sp_insnA.  After this sp_insnB will have N new
>>>> backward dependencies.
>>>> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>>>> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>>>> dependencies.
>>>> 
>>>> In PR96833's testcase N and M are 10k-15k, which causes RAM usage of
>>>> 30GB and compilation time of 30 minutes, with sched2 accounting for
>>>> 95% of both metrics.  After this patch the RAM usage is down to 1GB
>>>> and compilation time is down to 3-4 minutes, with sched2 no longer
>>>> standing out on -ftime-report or memory usage.
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>       PR rtl-optimization/96388
>>>>       PR rtl-optimization/111554
>>>>       * sched-deps.cc (find_inc): Avoid exponential behavior.
>>>> ---
>>>> gcc/sched-deps.cc | 45 +++++++++++++++++++++++++++++++++++++++++----
>>>> 1 file changed, 41 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
>>>> index c23218890f3..397bb9fd462 100644
>>>> --- a/gcc/sched-deps.cc
>>>> +++ b/gcc/sched-deps.cc
>>>> @@ -4779,24 +4779,59 @@ parse_add_or_inc (struct mem_inc_info *mii, rtx_insn *insn, bool before_mem)
>>>> /* Once a suitable mem reference has been found and the corresponding data
>>>>   in MII has been filled in, this function is called to find a suitable
>>>>   add or inc insn involving the register we found in the memory
>>>> -   reference.  */
>>>> +   reference.
>>>> +   If successful, this function will create additional dependencies between
>>>> +   - mii->inc_insn's producers and mii->mem_insn as a consumer (if backwards)
>>>> +   - mii->inc_insn's consumers and mii->mem_insn as a producer (if !backwards).
>>>> +*/
>>>> 
>>>> static bool
>>>> find_inc (struct mem_inc_info *mii, bool backwards)
>>>> {
>>>>  sd_iterator_def sd_it;
>>>>  dep_t dep;
>>>> +  sd_list_types_def mem_deps = backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW;
>>>> +  int n_mem_deps = sd_lists_size (mii->mem_insn, mem_deps);
>>>> 
>>>> -  sd_it = sd_iterator_start (mii->mem_insn,
>>>> -                            backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW);
>>>> +  sd_it = sd_iterator_start (mii->mem_insn, mem_deps);
>>>>  while (sd_iterator_cond (&sd_it, &dep))
>>>>    {
>>>>      dep_node_t node = DEP_LINK_NODE (*sd_it.linkp);
>>>>      rtx_insn *pro = DEP_PRO (dep);
>>>>      rtx_insn *con = DEP_CON (dep);
>>>> -      rtx_insn *inc_cand = backwards ? pro : con;
>>>> +      rtx_insn *inc_cand;
>>>> +      int n_inc_deps;
>>>> +
>>>> +      if (backwards)
>>>> +       {
>>>> +         inc_cand = pro;
>>>> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_BACK);
>>>> +       }
>>>> +      else
>>>> +       {
>>>> +         inc_cand = con;
>>>> +         n_inc_deps = sd_lists_size (inc_cand, SD_LIST_FORW);
>>>> +       }
>>>> +
>>>> +      /* In the FOR_EACH_DEP loop below we will create additional n_inc_deps
>>>> +        for mem_insn.  This by itself is not a problem, since each mem_insn
>>>> +        will have only a few inc_insns associated with it.  However, if
>>>> +        we consider that a single inc_insn may have a lot of mem_insns, AND,
>>>> +        on top of that, a few other inc_insns associated with it --
>>>> +        those _other inc_insns_ will get (n_mem_deps * number of MEM insns)
>>>> +        dependencies created for them.  This may cause an exponential
>>>> +        growth of memory usage and scheduling time.
>>>> +        See PR96388 for details.
>>>> +        We [heuristically] use n_inc_deps as a proxy for the number of MEM
>>>> +        insns, and drop opportunities for breaking modifiable_mem dependencies
>>>> +        when dependency lists grow beyond reasonable size.  */
>>>> +      if (n_mem_deps * n_inc_deps
>>>> +         >= param_max_pending_list_length * param_max_pending_list_length)
>>>> +       goto next;
>>>> +
>>>>      if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))
>>> 
>>> it looks like this check is a lot cheaper than computing sd_lists_size so
>>> can we keep that first?  sd_lists_size might be even more expensive
>>> than parse_add_or_inc,
>> 
>> sd_lists_size() is cheap, it doesn't walk or count dependencies; it just adds 2-3 integers together.
>> 
>> The reason why sd_lists_size() has a loop is to be able to transparently handle dependencies split among sub-lists, e.g., sd_lists_size(SD_LIST_HARD_BACK | SD_LIST_SPEC_BACK) will return total number of backward dependencies.
> 
> I see.  It still better to move the DEP_NONREG/DEP_MULTIPLE checks,
> they do not depend on any
> of the code above it, no?

Yes.  Adjusted in v2.

Thanks,

--
Maxim Kuvyrkov
https://www.linaro.org
  
Maxim Kuvyrkov Nov. 21, 2023, 10:32 a.m. UTC | #8
> On Nov 20, 2023, at 20:30, Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> 
> On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
> 
>>> On Nov 20, 2023, at 17:52, Alexander Monakov <amonakov@ispras.ru> wrote:
>>> 
>>> 
>>> On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
>>> 
>>>> This patch avoids sched-deps.cc:find_inc() creating exponential number
>>>> of dependencies, which become memory and compilation time hogs.
>>>> Consider example (simplified from PR96388) ...
>>>> ===
>>>> sp=sp-4 // sp_insnA
>>>> mem_insnA1[sp+A1]
>>>> ...
>>>> mem_insnAN[sp+AN]
>>>> sp=sp-4 // sp_insnB
>>>> mem_insnB1[sp+B1]
>>>> ...
>>>> mem_insnBM[sp+BM]
>>>> ===
>>>> ... in this example find_modifiable_mems() will arrange for mem_insnA*
>>>> to be able to pass sp_insnA, and, while doing this, will create
>>>> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>>>> is a consumer of sp_insnA.  After this sp_insnB will have N new
>>>> backward dependencies.
>>>> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>>>> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>>>> dependencies.
>> 
>> [For avoidance of doubt, below discussion is about the general implementation
>> of find_modifiable_mems() and not about the patch.]
> 
> I was saying the commit message is hard to read (unless it's just me).
> 
>>> It's a bit hard to read this without knowing which value of 'backwards'
>>> is assumed.

Oh, sorry, I misunderstood your comment.

In the above example I want to describe outcome that current code generates, without going into details about exactly how it does it.  I'm not sure how to make it more readable, and would appreciate suggestions.


>>> 
>>> Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
>>> This is a true dependency. We know we can break it by adjusting B1 by -4, but
>>> we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
>>> we need to iterate over *incoming true dependencies* of sp_insnB and add them.
>>> 
>>> But the code seems to be iterating over *all incoming dependencies*, so it
>>> will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
>>> dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
>>> understanding is correct.
>> 
>> Yeap, your understanding is correct.  However, this is what
>> find_modifiable_mems() has to do to avoid complicated analysis of second-level
>> dependencies.
> 
> What is the reason it cannot simply skip anti-dependencies in the
> 'if (backwards)' loop, and true dependencies in the 'else' loop?

I /think/, this should be possible.  However, rather than improving current implementation my preference is to rework it by integrating with the main dependency analysis.  This should provide both faster and more precise dependency analysis, which would generate breakable addr/mem dependencies.

Thanks,

--
Maxim Kuvyrkov
https://www.linaro.org
  
Alexander Monakov Nov. 21, 2023, 11:05 a.m. UTC | #9
On Tue, 21 Nov 2023, Maxim Kuvyrkov wrote:

> >>>> This patch avoids sched-deps.cc:find_inc() creating exponential number
> >>>> of dependencies, which become memory and compilation time hogs.
> >>>> Consider example (simplified from PR96388) ...
> >>>> ===
> >>>> sp=sp-4 // sp_insnA
> >>>> mem_insnA1[sp+A1]
> >>>> ...
> >>>> mem_insnAN[sp+AN]
> >>>> sp=sp-4 // sp_insnB
> >>>> mem_insnB1[sp+B1]
> >>>> ...
> >>>> mem_insnBM[sp+BM]
> >>>> ===
> >>>> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> >>>> to be able to pass sp_insnA, and, while doing this, will create
> >>>> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> >>>> is a consumer of sp_insnA.  After this sp_insnB will have N new
> >>>> backward dependencies.
> >>>> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> >>>> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> >>>> dependencies.
> >> 
> >> [For avoidance of doubt, below discussion is about the general implementation
> >> of find_modifiable_mems() and not about the patch.]
> > 
> > I was saying the commit message is hard to read (unless it's just me).
> > 
> >>> It's a bit hard to read this without knowing which value of 'backwards'
> >>> is assumed.
> 
> Oh, sorry, I misunderstood your comment.
> 
> In the above example I want to describe outcome that current code generates,
> without going into details about exactly how it does it.  I'm not sure how to
> make it more readable, and would appreciate suggestions.

I think it would be easier to follow if you could fix a specific value of
'backwards' up front, and then ensure all following statements are consistent
with that, like I did in my explanation. Please feel free to pick up my text
into the commit message, if it helps.

> >>> Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
> >>> This is a true dependency. We know we can break it by adjusting B1 by -4, but
> >>> we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
> >>> we need to iterate over *incoming true dependencies* of sp_insnB and add them.
> >>> 
> >>> But the code seems to be iterating over *all incoming dependencies*, so it
> >>> will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> >>> dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
> >>> understanding is correct.
> >> 
> >> Yeap, your understanding is correct.  However, this is what
> >> find_modifiable_mems() has to do to avoid complicated analysis of second-level
> >> dependencies.
> > 
> > What is the reason it cannot simply skip anti-dependencies in the
> > 'if (backwards)' loop, and true dependencies in the 'else' loop?
> 
> I /think/, this should be possible.  However, rather than improving current
> implementation my preference is to rework it by integrating with the main
> dependency analysis.  This should provide both faster and more precise
> dependency analysis, which would generate breakable addr/mem dependencies.

I see, thank you.

Alexander
  

Patch

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index c23218890f3..397bb9fd462 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -4779,24 +4779,59 @@  parse_add_or_inc (struct mem_inc_info *mii, rtx_insn *insn, bool before_mem)
 /* Once a suitable mem reference has been found and the corresponding data
    in MII has been filled in, this function is called to find a suitable
    add or inc insn involving the register we found in the memory
-   reference.  */
+   reference.
+   If successful, this function will create additional dependencies between
+   - mii->inc_insn's producers and mii->mem_insn as a consumer (if backwards)
+   - mii->inc_insn's consumers and mii->mem_insn as a producer (if !backwards).
+*/
 
 static bool
 find_inc (struct mem_inc_info *mii, bool backwards)
 {
   sd_iterator_def sd_it;
   dep_t dep;
+  sd_list_types_def mem_deps = backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW;
+  int n_mem_deps = sd_lists_size (mii->mem_insn, mem_deps);
 
-  sd_it = sd_iterator_start (mii->mem_insn,
-			     backwards ? SD_LIST_HARD_BACK : SD_LIST_FORW);
+  sd_it = sd_iterator_start (mii->mem_insn, mem_deps);
   while (sd_iterator_cond (&sd_it, &dep))
     {
       dep_node_t node = DEP_LINK_NODE (*sd_it.linkp);
       rtx_insn *pro = DEP_PRO (dep);
       rtx_insn *con = DEP_CON (dep);
-      rtx_insn *inc_cand = backwards ? pro : con;
+      rtx_insn *inc_cand;
+      int n_inc_deps;
+
+      if (backwards)
+	{
+	  inc_cand = pro;
+	  n_inc_deps = sd_lists_size (inc_cand, SD_LIST_BACK);
+	}
+      else
+	{
+	  inc_cand = con;
+	  n_inc_deps = sd_lists_size (inc_cand, SD_LIST_FORW);
+	}
+
+      /* In the FOR_EACH_DEP loop below we will create additional n_inc_deps
+	 for mem_insn.  This by itself is not a problem, since each mem_insn
+	 will have only a few inc_insns associated with it.  However, if
+	 we consider that a single inc_insn may have a lot of mem_insns, AND,
+	 on top of that, a few other inc_insns associated with it --
+	 those _other inc_insns_ will get (n_mem_deps * number of MEM insns)
+	 dependencies created for them.  This may cause an exponential
+	 growth of memory usage and scheduling time.
+	 See PR96388 for details.
+	 We [heuristically] use n_inc_deps as a proxy for the number of MEM
+	 insns, and drop opportunities for breaking modifiable_mem dependencies
+	 when dependency lists grow beyond reasonable size.  */
+      if (n_mem_deps * n_inc_deps
+	  >= param_max_pending_list_length * param_max_pending_list_length)
+	goto next;
+
       if (DEP_NONREG (dep) || DEP_MULTIPLE (dep))
 	goto next;
+
       if (parse_add_or_inc (mii, inc_cand, backwards))
 	{
 	  struct dep_replacement *desc;
@@ -4838,6 +4873,8 @@  find_inc (struct mem_inc_info *mii, bool backwards)
 	  desc->insn = mii->mem_insn;
 	  move_dep_link (DEP_NODE_BACK (node), INSN_HARD_BACK_DEPS (con),
 			 INSN_SPEC_BACK_DEPS (con));
+
+	  gcc_assert (mii->inc_insn == inc_cand);
 	  if (backwards)
 	    {
 	      FOR_EACH_DEP (mii->inc_insn, SD_LIST_BACK, sd_it, dep)