WIP Move 'pass_fast_rtl_dce' from 'pass_postreload' into 'pass_late_compilation' (was: nvptx vs. [PATCH] Add a late-combine pass [PR106594])
Checks
Commit Message
Hi!
On 2024-06-28T00:41:54+0200, I wrote:
> On 2024-06-27T23:20:18+0200, I wrote:
>> On 2024-06-27T22:27:21+0200, I wrote:
>>> On 2024-06-27T18:49:17+0200, I wrote:
>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>>> This patch adds a combine pass that runs late in the pipeline.
>>>
>>> [After sending, I realized I replied to a previous thread of this work.]
>>>
>>>> I've beek looking a bit through recent nvptx target code generation
>>>> changes for GCC target libraries, and thought I'd also share here my
>>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>>
>>>> First the unexpected thing:
>>>
>>> So much for "unexpected thing" -- next level of unexpected here...
>>> Appreciated if anyone feels like helping me find my way through this, but
>>> I totally understand if you've got other things to do.
>>
>> OK, I found something already. (Unexpectedly quickly...) ;-)
>>
>>>> there are a few cases where we now see unused
>>>> registers get declared
>
>> But in fact, for both cases
>
> Now tested: 's%both%all'. :-)
>
>> the unexpected difference goes away if after
>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.
The following will be unnecessary assuming that Richard's proposed
"Give fast DCE a separate dirty flag" gets accepted, but may still be
useful if we follow through with the idea to enable (parts of)
'pass_postreload' for nvptx (as discussing with Roger), so, for later:
>> The following makes these two cases work, but evidently needs a lot more
>> analysis: a lot of other passes are enabled that may be anything between
>> beneficial and harmful for 'targetm.no_register_allocation'/nvptx.
>>
>> --- gcc/passes.cc
>> +++ gcc/passes.cc
>> @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
>> class pass_postreload : public rtl_opt_pass
>> {
>> public:
>> pass_postreload (gcc::context *ctxt)
>> : rtl_opt_pass (pass_data_postreload, ctxt)
>> {}
>>
>> /* opt_pass methods: */
>> - bool gate (function *) final override { return reload_completed; }
>> + bool gate (function *) final override { return reload_completed || targetm.no_register_allocation; }
>> --- gcc/regcprop.cc
>> +++ gcc/regcprop.cc
>> @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
>> public:
>> pass_cprop_hardreg (gcc::context *ctxt)
>> : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
>> {}
>>
>> /* opt_pass methods: */
>> bool gate (function *) final override
>> {
>> - return (optimize > 0 && (flag_cprop_registers));
>> + return (optimize > 0 && flag_cprop_registers && !targetm.no_register_allocation);
>> }
>
> Also, that quickly ICEs; more '[...] && !targetm.no_register_allocation'
> are needed elsewhere, at least.
>
> The following simpler thing, however, does work; move 'pass_fast_rtl_dce'
> out of 'pass_postreload':
>
> --- gcc/passes.cc
> +++ gcc/passes.cc
> @@ -677,14 +677,15 @@ class pass_postreload : public rtl_opt_pass
> {
> public:
> pass_postreload (gcc::context *ctxt)
> : rtl_opt_pass (pass_data_postreload, ctxt)
> {}
>
> /* opt_pass methods: */
> + opt_pass * clone () final override { return new pass_postreload (m_ctxt); }
> bool gate (function *) final override { return reload_completed; }
>
> }; // class pass_postreload
> --- gcc/passes.def
> +++ gcc/passes.def
> @@ -529,7 +529,10 @@ along with GCC; see the file COPYING3. If not see
> NEXT_PASS (pass_regrename);
> NEXT_PASS (pass_fold_mem_offsets);
> NEXT_PASS (pass_cprop_hardreg);
> - NEXT_PASS (pass_fast_rtl_dce);
> + POP_INSERT_PASSES ()
> + NEXT_PASS (pass_fast_rtl_dce);
> + NEXT_PASS (pass_postreload);
> + PUSH_INSERT_PASSES_WITHIN (pass_postreload)
> NEXT_PASS (pass_reorder_blocks);
> NEXT_PASS (pass_leaf_regs);
> NEXT_PASS (pass_split_before_sched2);
>
> This (only) cleans up "the mess that 'pass_late_combine' created"; no
> further changes in GCC target libraries for nvptx. (For avoidance of
> doubt: "mess" is a great exaggeration here.)
But that then disturbs non-nvptx targets; see (prerequisite)
<https://inbox.sourceware.org/87ed8htdwy.fsf@euler.schwinge.ddns.net>
"Handle 'NUM' in 'PUSH_INSERT_PASSES_WITHIN'" for why.
Then, see the attached -- just for later, for now --
"WIP Move 'pass_fast_rtl_dce' from 'pass_postreload' into 'pass_late_compilation'"
for how to make this work properly. (This also puts back
'pass_fast_rtl_dce' into 'pass_late_compilation' instead of running it
unconditionally, in order to not change any behavior in that regard.)
Grüße
Thomas
>>> But: should we expect '-fno-late-combine-instructions' vs.
>>> '-flate-combine-instructions' to behave in the same way? (After all,
>>> '%r22' remains unused also with '-flate-combine-instructions', and
>>> doesn't need to be emitted.) This could, of course, also be a nvptx back
>>> end issue?
>>>
>>> I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz'
>>> is attached if you'd like to reproduce this with your own nvptx target
>>> 'cc1':
>>>
>>> $ [...]/configure --target=nvptx-none --enable-languages=c
>>> $ make -j12 all-gcc
>>> $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions
>>>
>>>
>>> Grüße
>>> Thomas
From ef14e15c3255059f374e04a47d838e9c98c9da2c Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Fri, 28 Jun 2024 00:41:54 +0200
Subject: [PATCH] WIP Move 'pass_fast_rtl_dce' from 'pass_postreload' into
'pass_late_compilation'
id:87ed8i2ekt.fsf@euler.schwinge.ddns.net
---
gcc/passes.cc | 8 ++++++++
gcc/passes.def | 6 ++++++
2 files changed, 14 insertions(+)
@@ -685,6 +685,10 @@ public:
{}
/* opt_pass methods: */
+ opt_pass *clone () final override
+ {
+ return new pass_postreload (m_ctxt);
+ }
bool gate (function *) final override
{
if (reload_completed)
@@ -728,6 +732,10 @@ public:
{}
/* opt_pass methods: */
+ opt_pass *clone () final override
+ {
+ return new pass_late_compilation (m_ctxt);
+ }
bool gate (function *) final override
{
return reload_completed || targetm.no_register_allocation;
@@ -529,7 +529,13 @@ along with GCC; see the file COPYING3. If not see
NEXT_PASS (pass_regrename);
NEXT_PASS (pass_fold_mem_offsets);
NEXT_PASS (pass_cprop_hardreg);
+ POP_INSERT_PASSES ()
+ NEXT_PASS (pass_late_compilation);
+ PUSH_INSERT_PASSES_WITHIN (pass_late_compilation)
NEXT_PASS (pass_fast_rtl_dce);
+ POP_INSERT_PASSES ()
+ NEXT_PASS (pass_postreload);
+ PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_reorder_blocks);
NEXT_PASS (pass_leaf_regs);
NEXT_PASS (pass_split_before_sched2);
--
2.34.1