[COMMITTED,1/2] sched1: parameterize pressure scheduling spilling aggressiveness [PR/114729]
Checks
Context |
Check |
Description |
rivoscibot/toolchain-ci-rivos-lint |
warning
|
Lint failed
|
rivoscibot/toolchain-ci-rivos-apply-patch |
success
|
Patch applied
|
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gcv-lp64d-multilib |
success
|
Build passed
|
rivoscibot/toolchain-ci-rivos-build--linux-rv64gcv-lp64d-multilib |
success
|
Build passed
|
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc_zba_zbb_zbc_zbs-lp64d-multilib |
success
|
Build passed
|
rivoscibot/toolchain-ci-rivos-build--newlib-rv64gc-lp64d-non-multilib |
success
|
Build passed
|
rivoscibot/toolchain-ci-rivos-build--linux-rv64gc-lp64d-non-multilib |
success
|
Build passed
|
rivoscibot/toolchain-ci-rivos-test |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
fail
|
Patch failed to apply
|
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
fail
|
Patch failed to apply
|
Commit Message
sched1 computes ECC (Excess Change Cost) for each insn, which represents
the register pressure attributed to the insn.
Currently the pressure sensitive scheduling algorithm deliberately ignores
negative ECC values (pressure reduction), making them 0 (neutral), leading
to more spills. This happens due to the assumption that the compiler has
a reasonably accurate processor pipeline scheduling model and thus tries
to aggresively fill pipeline bubbles with spill slots.
This however might not be true, as the model might not be available for
certains uarches or even applicable especially for modern out-of-order cores.
The existing heuristic induces spill frenzy on RISC-V, noticably so on
SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
total dynamic icounts for this workload are reduced in half from
~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).
This patch adds --param=cycle-accurate-model={0,1} to gate the spill
behavior.
- The default (1) preserves existing spill behavior.
- targets/uarches sensitive to spilling can override the param to (0)
to get the reverse effect. RISC-V backend does so too.
The actual perf numbers are very promising.
(1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:
Before:
------
Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
4,917,712.97 msec task-clock:u # 1.000 CPUs utilized
5,314 context-switches:u # 1.081 /sec
3 cpu-migrations:u # 0.001 /sec
204,784 page-faults:u # 41.642 /sec
7,868,291,222,513 cycles:u # 1.600 GHz
2,615,069,866,153 instructions:u # 0.33 insn per cycle
10,799,381,890 branches:u # 2.196 M/sec
15,714,572 branch-misses:u # 0.15% of all branches
After:
-----
Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
4,552,979.58 msec task-clock:u # 0.998 CPUs utilized
205,020 context-switches:u # 45.030 /sec
2 cpu-migrations:u # 0.000 /sec
204,221 page-faults:u # 44.854 /sec
7,285,176,204,764 cycles:u (7.4% faster) # 1.600 GHz
2,145,284,345,397 instructions:u (17.96% fewer) # 0.29 insn per cycle
10,799,382,011 branches:u # 2.372 M/sec
16,235,628 branch-misses:u # 0.15% of all branches
(2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.
gcc/ChangeLog:
PR target/11472
* params.opt (--param=cycle-accurate-model=): New opt.
* doc/invoke.texi (cycle-accurate-model): Document.
* haifa-sched.cc (model_excess_group_cost): Return negative
delta if param_cycle_accurate_model is 0.
(model_excess_cost): Ceil negative baseECC to 0 only if
param_cycle_accurate_model is 1.
Dump the actual ECC value.
* config/riscv/riscv.cc (riscv_option_override): Set param
to 0.
gcc/testsuite/ChangeLog:
PR target/114729
* gcc.target/riscv/riscv.exp: Enable new tests to build.
* gcc.target/riscv/sched1-spills/spill1.cpp: Add new test.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
---
gcc/config/riscv/riscv.cc | 4 +++
gcc/doc/invoke.texi | 10 ++++++
gcc/haifa-sched.cc | 32 ++++++++++++++-----
gcc/params.opt | 4 +++
gcc/testsuite/gcc.target/riscv/riscv.exp | 2 ++
.../gcc.target/riscv/sched1-spills/spill1.cpp | 32 +++++++++++++++++++
6 files changed, 76 insertions(+), 8 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/sched1-spills/spill1.cpp
Comments
On 04/12/24 11:03 -0800, Vineet Gupta wrote:
>sched1 computes ECC (Excess Change Cost) for each insn, which represents
>the register pressure attributed to the insn.
>Currently the pressure sensitive scheduling algorithm deliberately ignores
>negative ECC values (pressure reduction), making them 0 (neutral), leading
>to more spills. This happens due to the assumption that the compiler has
>a reasonably accurate processor pipeline scheduling model and thus tries
>to aggresively fill pipeline bubbles with spill slots.
>
>This however might not be true, as the model might not be available for
>certains uarches or even applicable especially for modern out-of-order cores.
>
>The existing heuristic induces spill frenzy on RISC-V, noticably so on
>SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
>total dynamic icounts for this workload are reduced in half from
>~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).
>
>This patch adds --param=cycle-accurate-model={0,1} to gate the spill
>behavior.
>
> - The default (1) preserves existing spill behavior.
>
> - targets/uarches sensitive to spilling can override the param to (0)
> to get the reverse effect. RISC-V backend does so too.
>
>The actual perf numbers are very promising.
>
>(1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:
>
> Before:
> ------
> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
> 4,917,712.97 msec task-clock:u # 1.000 CPUs utilized
> 5,314 context-switches:u # 1.081 /sec
> 3 cpu-migrations:u # 0.001 /sec
> 204,784 page-faults:u # 41.642 /sec
> 7,868,291,222,513 cycles:u # 1.600 GHz
> 2,615,069,866,153 instructions:u # 0.33 insn per cycle
> 10,799,381,890 branches:u # 2.196 M/sec
> 15,714,572 branch-misses:u # 0.15% of all branches
>
> After:
> -----
> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>
> 4,552,979.58 msec task-clock:u # 0.998 CPUs utilized
> 205,020 context-switches:u # 45.030 /sec
> 2 cpu-migrations:u # 0.000 /sec
> 204,221 page-faults:u # 44.854 /sec
> 7,285,176,204,764 cycles:u (7.4% faster) # 1.600 GHz
> 2,145,284,345,397 instructions:u (17.96% fewer) # 0.29 insn per cycle
> 10,799,382,011 branches:u # 2.372 M/sec
> 16,235,628 branch-misses:u # 0.15% of all branches
>
>(2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.
>
>gcc/ChangeLog:
> PR target/11472
Note that you typo'd the PR number here, so that it added a comment
to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11472
The server-side git hooks that check commit logs didn't notice it,
because you used [PR/114729] in the subject line, which should be in
the form [PR114729]. If you'd used the correct form, the hooks would
have told you that the two PR numbers didn't match.
I see you've been using the PR/nnn form for all your commits, please
use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
On 04/12/24 19:48 +0000, Jonathan Wakely wrote:
>On 04/12/24 11:03 -0800, Vineet Gupta wrote:
>>sched1 computes ECC (Excess Change Cost) for each insn, which represents
>>the register pressure attributed to the insn.
>>Currently the pressure sensitive scheduling algorithm deliberately ignores
>>negative ECC values (pressure reduction), making them 0 (neutral), leading
>>to more spills. This happens due to the assumption that the compiler has
>>a reasonably accurate processor pipeline scheduling model and thus tries
>>to aggresively fill pipeline bubbles with spill slots.
>>
>>This however might not be true, as the model might not be available for
>>certains uarches or even applicable especially for modern out-of-order cores.
>>
>>The existing heuristic induces spill frenzy on RISC-V, noticably so on
>>SPEC2017 507.Cactu. If insn scheduling is disabled completely, the
>>total dynamic icounts for this workload are reduced in half from
>>~2.5 trillion insns to ~1.3 (w/ -fno-schedule-insns).
>>
>>This patch adds --param=cycle-accurate-model={0,1} to gate the spill
>>behavior.
>>
>>- The default (1) preserves existing spill behavior.
>>
>>- targets/uarches sensitive to spilling can override the param to (0)
>> to get the reverse effect. RISC-V backend does so too.
>>
>>The actual perf numbers are very promising.
>>
>>(1) On RISC-V BPI-F3 in-order CPU, -Ofast -march=rv64gcv_zba_zbb_zbs:
>>
>> Before:
>> ------
>> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>>
>> 4,917,712.97 msec task-clock:u # 1.000 CPUs utilized
>> 5,314 context-switches:u # 1.081 /sec
>> 3 cpu-migrations:u # 0.001 /sec
>> 204,784 page-faults:u # 41.642 /sec
>>7,868,291,222,513 cycles:u # 1.600 GHz
>>2,615,069,866,153 instructions:u # 0.33 insn per cycle
>> 10,799,381,890 branches:u # 2.196 M/sec
>> 15,714,572 branch-misses:u # 0.15% of all branches
>>
>> After:
>> -----
>> Performance counter stats for './cactusBSSN_r_base.rivos spec_ref.par':
>>
>> 4,552,979.58 msec task-clock:u # 0.998 CPUs utilized
>> 205,020 context-switches:u # 45.030 /sec
>> 2 cpu-migrations:u # 0.000 /sec
>> 204,221 page-faults:u # 44.854 /sec
>>7,285,176,204,764 cycles:u (7.4% faster) # 1.600 GHz
>>2,145,284,345,397 instructions:u (17.96% fewer) # 0.29 insn per cycle
>> 10,799,382,011 branches:u # 2.372 M/sec
>> 16,235,628 branch-misses:u # 0.15% of all branches
>>
>>(2) Wilco reported 20% perf gains on aarch64 Neoverse V2 runs.
>>
>>gcc/ChangeLog:
>> PR target/11472
>
>Note that you typo'd the PR number here, so that it added a comment
>to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11472
>
>The server-side git hooks that check commit logs didn't notice it,
>because you used [PR/114729] in the subject line, which should be in
>the form [PR114729]. If you'd used the correct form, the hooks would
>have told you that the two PR numbers didn't match.
>
>I see you've been using the PR/nnn form for all your commits, please
>use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
Also it looks like the actual component in bugzilla is
"rtl-optimization" not "target", so shouldn't the ChangeLog entry use
"PR rtl-optimization/114729" ?
On 12/4/24 11:48, Jonathan Wakely wrote:
>> gcc/ChangeLog:
>> PR target/11472
> Note that you typo'd the PR number here, so that it added a comment
> to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11472
Apologies my bad.
In my defense I ran following:
./contrib/gcc-changelog/git_check_commit.py
./contrib/check_GNU_style.sh
> The server-side git hooks that check commit logs didn't notice it,
> because you used [PR/114729] in the subject line, which should be in
> the form [PR114729]. If you'd used the correct form, the hooks would
> have told you that the two PR numbers didn't match.
>
> I see you've been using the PR/nnn form for all your commits, please
> use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
Alright I'll do this going fwd.
Thx,
-Vineet
On 12/4/24 11:52, Jonathan Wakely wrote:
>> I see you've been using the PR/nnn form for all your commits, please
>> use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
> Also it looks like the actual component in bugzilla is
> "rtl-optimization" not "target", so shouldn't the ChangeLog entry use
> "PR rtl-optimization/114729" ?
A silly question perhaps, does RTL component refer to md type changes or any RTL
pass change.
This turned out to be a non-target specific generic change to RTL pass sched1.
-Vineet
On Wed, 4 Dec 2024 at 20:47, Vineet Gupta <vineetg@rivosinc.com> wrote:
>
> On 12/4/24 11:52, Jonathan Wakely wrote:
> >> I see you've been using the PR/nnn form for all your commits, please
> >> use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
> > Also it looks like the actual component in bugzilla is
> > "rtl-optimization" not "target", so shouldn't the ChangeLog entry use
> > "PR rtl-optimization/114729" ?
>
> A silly question perhaps, does RTL component refer to md type changes or any RTL
> pass change.
No idea, sorry!
> This turned out to be a non-target specific generic change to RTL pass sched1.
>
> -Vineet
>
On Wed, 4 Dec 2024 at 20:43, Vineet Gupta <vineetg@rivosinc.com> wrote:
>
> On 12/4/24 11:48, Jonathan Wakely wrote:
> >> gcc/ChangeLog:
> >> PR target/11472
> > Note that you typo'd the PR number here, so that it added a comment
> > to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11472
>
> Apologies my bad.
>
> In my defense I ran following:
>
> ./contrib/gcc-changelog/git_check_commit.py
Yes, it seems this script doesn't notice this problem. It will
complain if the summary line has [PRnnn] and the ChangeLog part
doesn't have a matching PR, but apparently it doesn't complain the
other way around.
Or maybe it didn't complain because the ChangeLog part contained
*both* PR target/114729 and PR target/11472, and so the PR/114729 in
the summary did match something in the ChangeLog part.
> ./contrib/check_GNU_style.sh
>
>
> > The server-side git hooks that check commit logs didn't notice it,
> > because you used [PR/114729] in the subject line, which should be in
> > the form [PR114729]. If you'd used the correct form, the hooks would
> > have told you that the two PR numbers didn't match.
> >
> > I see you've been using the PR/nnn form for all your commits, please
> > use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
>
> Alright I'll do this going fwd.
>
> Thx,
> -Vineet
>
On Wed, Dec 4, 2024 at 1:28 PM Jonathan Wakely <jwakely@redhat.com> wrote:
>
> On Wed, 4 Dec 2024 at 20:43, Vineet Gupta <vineetg@rivosinc.com> wrote:
> >
> > On 12/4/24 11:48, Jonathan Wakely wrote:
> > >> gcc/ChangeLog:
> > >> PR target/11472
> > > Note that you typo'd the PR number here, so that it added a comment
> > > to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11472
> >
> > Apologies my bad.
> >
> > In my defense I ran following:
> >
> > ./contrib/gcc-changelog/git_check_commit.py
>
> Yes, it seems this script doesn't notice this problem. It will
> complain if the summary line has [PRnnn] and the ChangeLog part
> doesn't have a matching PR, but apparently it doesn't complain the
> other way around.
> Or maybe it didn't complain because the ChangeLog part contained
> *both* PR target/114729 and PR target/11472, and so the PR/114729 in
> the summary did match something in the ChangeLog part.
Yes it only sees if the summary has a reference to bug #, it is in the
changelog too and not if there is one in the changelog, there is also
one in the summary.
In this case since there is one in the summary, it matches against the
one in the changelog; but it does not check the others.
This is because sometimes you have a few different PRs but there is a
main PR # in the summary.
One thing that is useful is just placing the PR # reference before the
full changelog instead of inside each one. This reduces the change of
a typo (though not all the way).
Thanks,
Andrew Pinski
>
>
> > ./contrib/check_GNU_style.sh
> >
> >
> > > The server-side git hooks that check commit logs didn't notice it,
> > > because you used [PR/114729] in the subject line, which should be in
> > > the form [PR114729]. If you'd used the correct form, the hooks would
> > > have told you that the two PR numbers didn't match.
> > >
> > > I see you've been using the PR/nnn form for all your commits, please
> > > use the [PRnnn] form as described at https://gcc.gnu.org/contribute.html#patches
> >
> > Alright I'll do this going fwd.
> >
> > Thx,
> > -Vineet
> >
>
@@ -10616,6 +10616,10 @@ riscv_option_override (void)
param_sched_pressure_algorithm,
SCHED_PRESSURE_MODEL);
+ SET_OPTION_IF_UNSET (&global_options, &global_options_set,
+ param_cycle_accurate_model,
+ 0);
+
/* Function to allocate machine-dependent function status. */
init_machine_status = &riscv_init_machine_status;
@@ -17211,6 +17211,16 @@ With @option{--param=openacc-privatization=quiet}, don't diagnose.
This is the current default.
With @option{--param=openacc-privatization=noisy}, do diagnose.
+@item cycle-accurate-model
+Specifies whether GCC should assume that the scheduling description is mostly
+a cycle-accurate model of the target processor, where the code is intended to
+run on, in the absence of cache misses. Nonzero means that the selected
+scheduling model is accurate and likely describes an in-order processor,
+and that scheduling will aggressively spill to try and fill any pipeline
+bubbles. This is the current default. Zero could mean scheduling descrption
+might not be available/accurate or perhaps not applicale at all such as for
+modern out-of-order processors.
+
@end table
The following choices of @var{name} are available on AArch64 targets:
@@ -2398,11 +2398,18 @@ model_excess_group_cost (struct model_pressure_group *group,
int pressure, cl;
cl = ira_pressure_classes[pci];
- if (delta < 0 && point >= group->limits[pci].point)
+ if (delta < 0)
{
- pressure = MAX (group->limits[pci].orig_pressure,
- curr_reg_pressure[cl] + delta);
- return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
+ if (point >= group->limits[pci].point)
+ {
+ pressure = MAX (group->limits[pci].orig_pressure,
+ curr_reg_pressure[cl] + delta);
+ return -model_spill_cost (cl, pressure, curr_reg_pressure[cl]);
+ }
+ /* if target prefers fewer spills, return the -ve delta indicating
+ pressure reduction. */
+ else if (!param_cycle_accurate_model)
+ return delta;
}
if (delta > 0)
@@ -2453,7 +2460,7 @@ model_excess_cost (rtx_insn *insn, bool print_p)
}
if (print_p)
- fprintf (sched_dump, "\n");
+ fprintf (sched_dump, " ECC %d\n", cost);
return cost;
}
@@ -2489,8 +2496,9 @@ model_set_excess_costs (rtx_insn **insns, int count)
bool print_p;
/* Record the baseECC value for each instruction in the model schedule,
- except that negative costs are converted to zero ones now rather than
- later. Do not assign a cost to debug instructions, since they must
+ except that for targets which prefer wider schedules (more spills)
+ negative costs are converted to zero ones now rather than later.
+ Do not assign a cost to debug instructions, since they must
not change code-generation decisions. Experiments suggest we also
get better results by not assigning a cost to instructions from
a different block.
@@ -2512,7 +2520,7 @@ model_set_excess_costs (rtx_insn **insns, int count)
print_p = true;
}
cost = model_excess_cost (insns[i], print_p);
- if (cost <= 0)
+ if (param_cycle_accurate_model && cost <= 0)
{
priority = INSN_PRIORITY (insns[i]) - insn_delay (insns[i]) - cost;
priority_base = MAX (priority_base, priority);
@@ -2523,6 +2531,14 @@ model_set_excess_costs (rtx_insn **insns, int count)
if (print_p)
fprintf (sched_dump, MODEL_BAR);
+ /* Typically in-order cores have a good pipeline scheduling model and the
+ algorithm would try to use that to minimize bubbles, favoring spills.
+ MAX (baseECC, 0) below changes negative baseECC (pressure reduction)
+ to 0 (pressure neutral) thus tending to more spills.
+ Otherwise return. */
+ if (!param_cycle_accurate_model)
+ return;
+
/* Use MAX (baseECC, 0) and baseP to calculcate ECC for each
instruction. */
for (i = 0; i < count; i++)
@@ -66,6 +66,10 @@ Enable asan stack protection.
Common Joined UInteger Var(param_asan_use_after_return) Init(1) IntegerRange(0, 1) Param Optimization
Enable asan detection of use-after-return bugs.
+-param=cycle-accurate-model
+Common Joined UInteger Var(param_cycle_accurate_model) Init(1) IntegerRange(0, 1) Param Optimization
+Whether the scheduling description is mostly a cycle-accurate model of the target processor and is likely to be spill aggressively to fill any pipeline bubbles.
+
-param=hwasan-instrument-stack=
Common Joined UInteger Var(param_hwasan_instrument_stack) Init(1) IntegerRange(0, 1) Param Optimization
Enable hwasan instrumentation of statically sized stack-allocated variables.
@@ -38,6 +38,8 @@ dg-init
# Main loop.
gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*.\[cS\]]] \
"" $DEFAULT_CFLAGS
+gcc-dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/sched1-spills/*.{\[cS\],cpp}]] \
+ "" $DEFAULT_CFLAGS
# Saturation alu
foreach opt {
new file mode 100644
@@ -0,0 +1,32 @@
+/* { dg-options "-O2 -march=rv64gc -mabi=lp64d -save-temps -fverbose-asm" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "O1" "-Og" "-Os" "-Oz" } } */
+
+/* Reduced from SPEC2017 Cactu ML_BSSN_Advect.cpp
+ by comparing -fschedule-insn and -fno-schedule-insns builds.
+ Shows up one extra spill (pair of spill markers "sfp") in verbose asm
+ output which the patch fixes. */
+
+void s();
+double b, c, d, e, f, g, h, k, l, m, n, o, p, q, t, u, v;
+int *j;
+double *r, *w;
+long x;
+void y() {
+ double *a((double *)s);
+ for (;;)
+ for (; j[1];)
+ for (int i = 1; i < j[0]; i++) {
+ k = l;
+ m = n;
+ o = p = q;
+ r[0] = t;
+ a[0] = u;
+ x = g;
+ e = f;
+ v = w[x];
+ b = c;
+ d = h;
+ }
+}
+
+/* { dg-final { scan-assembler-not "%sfp" } } */