lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"
Message ID | ea289929-5955-d366-9341-34b95f9ceb57@yahoo.co.jp |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C114938582AC for <patchwork@sourceware.org>; Wed, 3 Aug 2022 01:38:58 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C114938582AC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1659490738; bh=dRBRtst5wrBt8s/zviW2gP2R2/1M0qyOFNH6KxL28lE=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=Lp9dtY/CMMqL4t2I5CFYDnvBBy0cEhCzepLqFw/0cyM5t1vXU7dW+lgvBcF2PE0E3 MRmNvswxfoB6aZcfDSk8nQ03SWztzzr8wqkhEHEO9ILlEtRvlmYwJ2XHmGGHWA8EDV 6lmLeJHchNMOrdPfIdowzFChDO/sjpLOM10WXU0k= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nh605-vm3.bullet.mail.ssk.yahoo.co.jp (nh605-vm3.bullet.mail.ssk.yahoo.co.jp [182.22.90.76]) by sourceware.org (Postfix) with SMTP id 4754F3858D33 for <gcc-patches@gcc.gnu.org>; Wed, 3 Aug 2022 01:38:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4754F3858D33 Received: from [182.22.66.106] by nh605.bullet.mail.ssk.yahoo.co.jp with NNFMP; 03 Aug 2022 01:38:23 -0000 Received: from [182.22.91.131] by t604.bullet.mail.ssk.yahoo.co.jp with NNFMP; 03 Aug 2022 01:38:23 -0000 Received: from [127.0.0.1] by omp604.mail.ssk.yahoo.co.jp with NNFMP; 03 Aug 2022 01:38:23 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 692827.18559.bm@omp604.mail.ssk.yahoo.co.jp Received: (qmail 58254 invoked by alias); 3 Aug 2022 01:38:23 -0000 Received: from unknown (HELO ?192.168.2.3?) (175.177.45.188 with ) by smtp6009.mail.ssk.ynwp.yahoo.co.jp with SMTP; 3 Aug 2022 01:38:23 -0000 X-YMail-JAS: 90JHrd4VM1n2fwe58G7jY57EJuLs7fJlmGnHBc_4TSuCAmk0xd65KgN2WOgOSG75hfU2Nun2Zz953VSJKc4FCX7LUPtZrMLnf.AyS2GUqK0Y72Q5Osvg_R22vw0j.1vvkSdY_F86xw-- X-Apparently-From: <jjsuwa_sys3175@yahoo.co.jp> X-YMail-OSG: UG3lhnAVM1kwPUUXY4bq97dDhUtiK.KUptbHH6Poa8jtJAE rQ97eqyGAhF2pg83mmaYuCghRgPgr8EaiiGjdlnZZ9mxPi4iaZKQR0hPgMXa BbL.0T3E_XLtRjaPu0JMKI8FAN0x4Tlp00m2swu9KeAPeF3ue3NikEKUgOwi 5MV0kpZLAYUzmbh8tPZ4Xotg2kC4TRUXFKYx_t2Rnd5z3jF.gOmQeRwM41yc 8p4ydQaRD1gyN8TIx_ZJkCCpb4W_y1gXlwU6M0qcjpkNTxPypc5T5EQFabn0 OdLrka_wlVCYc8IiMfyladWFUWvrVSmS1kJSEjFtk9RtEQ3PVFeFAoEckr5B omlYr0tefemLE3M0Ry._NpCHCEysjgHH4h_rBgYPbtc0xQbNDEFtspM9pcVu ASiZCsTz09RDSMxHr645kGP6PUILgNUctoS8K8.ArxKnYUiRg5iCkS9.3v5U 998VlKOLsur8idjD8cDIlzdVdWc1SaZP2ssTo0JuVt2ReYilJMf8._vVYjju 5RBuBgO1bK1zZ7RyCTVdGHFnKPc0T13.t_HVuyxjjcjTMqC7T0xunPDPIpT9 Out_6EJ_CgCo_bluSH2QEo5ZfFDroqt2R1Nqw_Qo.mVEQn7qlz5wVnOEt2So TiOYINQI5ZLaPUnO6TLr7ta5pa1MCXtd5D.5jN.Tm0mZaHjvXR62gsQ0R4jY 3WODXS9tJicRHsqNrGltGq_xVUymAPyajpaoCuw2kRVJa9HKQKZAV2lM4m28 Gx.a4ne.GW4C9EzmH89ReESULQCceuu.mLQ2uXafEG1j9ocT9Wok_zqMkzQ5 4a6ADzdLc7K3DDVIYBTPhOjACjX0D1GomG53g5uptLSj_Lr8NZF2q.IgxaKz pv3SHlpsFsIh3S1WPNVL31k6fq0Tnqr4OWjBujeN0Js7JgwxaKIPqtSh2zgl gs_M971dqjo7GvBII Message-ID: <ea289929-5955-d366-9341-34b95f9ceb57@yahoo.co.jp> Date: Wed, 3 Aug 2022 10:35:19 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.0 To: GCC Patches <gcc-patches@gcc.gnu.org> Subject: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-12.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"
|
|
Commit Message
Takayuki 'January June' Suwa
Aug. 3, 2022, 1:35 a.m. UTC
Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps data flow consistent, but it also increases register allocation pressure and thus often creates many unwanted register-to-register moves that cannot be optimized away. It seems just analogous to partial register stall which is a famous problem on processors that do register renaming. In my opinion, when the register to be clobbered is a composite of hard ones, we should clobber the individual elements separetely, otherwise clear the entire to zero prior to use as the "init-regs" pass does (like partial register stall workarounds on x86 CPUs). Such redundant zero constant assignments will be removed later in the "cprop_hardreg" pass. This patch may give better output code quality for the reasons above, especially on architectures that don't have DFmode hard registers (On architectures with such hard registers, this patch changes virtually nothing). For example (Espressif ESP8266, Xtensa without FP hard regs): /* example */ double _Complex conjugate(double _Complex z) { __imag__(z) *= -1; return z; } ;; before conjugate: movi.n a6, -1 slli a6, a6, 31 mov.n a8, a2 mov.n a9, a3 mov.n a7, a4 xor a6, a5, a6 mov.n a2, a8 mov.n a3, a9 mov.n a4, a7 mov.n a5, a6 ret.n ;; after conjugate: movi.n a6, -1 slli a6, a6, 31 xor a6, a5, a6 mov.n a5, a6 ret.n gcc/ChangeLog: * lower-subreg.cc (resolve_simple_move): Add zero clear of the entire register immediately after the clobber. * expr.cc (emit_move_complex_parts): Change to clobber the real and imaginary parts separately instead of the whole complex register if possible. --- gcc/expr.cc | 26 ++++++++++++++++++++------ gcc/lower-subreg.cc | 7 ++++++- 2 files changed, 26 insertions(+), 7 deletions(-)
Comments
Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps > data flow consistent, but it also increases register allocation pressure > and thus often creates many unwanted register-to-register moves that > cannot be optimized away. There are two things here: - If emit_move_complex_parts emits a clobber of a hard register, then that's probably a bug/misfeature. The point of the clobber is to indicate that the register has no useful contents. That's useful for wide pseudos that are written to in parts, since it avoids the need to track the liveness of each part of the pseudo individually. But it shouldn't be necessary for hard registers, since subregs of hard registers are simplified to hard registers wherever possible (which on most targets is "always"). So I think the emit_move_complex_parts clobber should be restricted to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps (if only partly) then it would be worth doing as its own patch. - I think it'd be worth looking into more detail why a clobber makes a difference to register pressure. A clobber of a pseudo register R shouldn't make R conflict with things that are live at the point of the clobber. > It seems just analogous to partial register > stall which is a famous problem on processors that do register renaming. > > In my opinion, when the register to be clobbered is a composite of hard > ones, we should clobber the individual elements separetely, otherwise > clear the entire to zero prior to use as the "init-regs" pass does (like > partial register stall workarounds on x86 CPUs). Such redundant zero > constant assignments will be removed later in the "cprop_hardreg" pass. I don't think we should rely on the zero being optimised away later. Emitting the zero also makes it harder for the register allocator to elide the move. For example, if we have: (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) then there is at least a chance that the RA could assign hard registers R0:R1 to P, which would turn the moves into nops. If we emit: (set (reg:DI P) (const_int 0)) beforehand then that becomes impossible, since R0 and R1 would then conflict with P. TBH I'm surprised we still run init_regs for LRA. I thought there was a plan to stop doing that, but perhaps I misremember. Thanks, Richard > This patch may give better output code quality for the reasons above, > especially on architectures that don't have DFmode hard registers > (On architectures with such hard registers, this patch changes virtually > nothing). > > For example (Espressif ESP8266, Xtensa without FP hard regs): > > /* example */ > double _Complex conjugate(double _Complex z) { > __imag__(z) *= -1; > return z; > } > > ;; before > conjugate: > movi.n a6, -1 > slli a6, a6, 31 > mov.n a8, a2 > mov.n a9, a3 > mov.n a7, a4 > xor a6, a5, a6 > mov.n a2, a8 > mov.n a3, a9 > mov.n a4, a7 > mov.n a5, a6 > ret.n > > ;; after > conjugate: > movi.n a6, -1 > slli a6, a6, 31 > xor a6, a5, a6 > mov.n a5, a6 > ret.n > > gcc/ChangeLog: > > * lower-subreg.cc (resolve_simple_move): > Add zero clear of the entire register immediately after > the clobber. > * expr.cc (emit_move_complex_parts): > Change to clobber the real and imaginary parts separately > instead of the whole complex register if possible. > --- > gcc/expr.cc | 26 ++++++++++++++++++++------ > gcc/lower-subreg.cc | 7 ++++++- > 2 files changed, 26 insertions(+), 7 deletions(-) > > diff --git a/gcc/expr.cc b/gcc/expr.cc > index 80bb1b8a4c5..9732e8fd4e5 100644 > --- a/gcc/expr.cc > +++ b/gcc/expr.cc > @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, rtx y) > rtx_insn * > emit_move_complex_parts (rtx x, rtx y) > { > - /* Show the output dies here. This is necessary for SUBREGs > - of pseudos since we cannot track their lifetimes correctly; > - hard regs shouldn't appear here except as return values. */ > - if (!reload_completed && !reload_in_progress > - && REG_P (x) && !reg_overlap_mentioned_p (x, y)) > - emit_clobber (x); > + rtx_insn *re_insn, *im_insn; > > write_complex_part (x, read_complex_part (y, false), false, true); > + re_insn = get_last_insn (); > write_complex_part (x, read_complex_part (y, true), true, false); > + im_insn = get_last_insn (); > + > + /* Show the output dies here. This is necessary for SUBREGs > + of pseudos since we cannot track their lifetimes correctly. */ > + if (can_create_pseudo_p () > + && REG_P (x) && ! reg_overlap_mentioned_p (x, y)) > + { > + /* Hard regs shouldn't appear here except as return values. */ > + if (HARD_REGISTER_P (x) && REG_NREGS (x) % 2 == 0) > + { > + emit_insn_before (gen_clobber (SET_DEST (PATTERN (re_insn))), > + re_insn); > + emit_insn_before (gen_clobber (SET_DEST (PATTERN (im_insn))), > + im_insn); > + } > + else > + emit_insn_before (gen_clobber (x), re_insn); > + } > > return get_last_insn (); > } > diff --git a/gcc/lower-subreg.cc b/gcc/lower-subreg.cc > index 03e9326c663..4ff0a7d1556 100644 > --- a/gcc/lower-subreg.cc > +++ b/gcc/lower-subreg.cc > @@ -1086,7 +1086,12 @@ resolve_simple_move (rtx set, rtx_insn *insn) > unsigned int i; > > if (REG_P (dest) && !HARD_REGISTER_NUM_P (REGNO (dest))) > - emit_clobber (dest); > + { > + emit_clobber (dest); > + /* We clear the entire of dest with zero after the clobber, > + similar to the "init-regs" pass. */ > + emit_move_insn (dest, CONST0_RTX (GET_MODE (dest))); > + } > > for (i = 0; i < words; ++i) > {
Thanks for your response. On 2022/08/03 16:52, Richard Sandiford wrote: > Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps >> data flow consistent, but it also increases register allocation pressure >> and thus often creates many unwanted register-to-register moves that >> cannot be optimized away. > > There are two things here: > > - If emit_move_complex_parts emits a clobber of a hard register, > then that's probably a bug/misfeature. The point of the clobber is > to indicate that the register has no useful contents. That's useful > for wide pseudos that are written to in parts, since it avoids the > need to track the liveness of each part of the pseudo individually. > But it shouldn't be necessary for hard registers, since subregs of > hard registers are simplified to hard registers wherever possible > (which on most targets is "always"). > > So I think the emit_move_complex_parts clobber should be restricted > to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps > (if only partly) then it would be worth doing as its own patch. > > - I think it'd be worth looking into more detail why a clobber makes > a difference to register pressure. A clobber of a pseudo register R > shouldn't make R conflict with things that are live at the point of > the clobber. I agree with its worth. In fact, aside from other ports, on the xtensa one, RA in code with frequent D[FC]mode pseudos is terribly bad. For example, in __muldc3 on libgcc2, the size of the stack frame reserved will almost double depending on whether or not this patch is applied. > >> It seems just analogous to partial register >> stall which is a famous problem on processors that do register renaming. >> >> In my opinion, when the register to be clobbered is a composite of hard >> ones, we should clobber the individual elements separetely, otherwise >> clear the entire to zero prior to use as the "init-regs" pass does (like >> partial register stall workarounds on x86 CPUs). Such redundant zero >> constant assignments will be removed later in the "cprop_hardreg" pass. > > I don't think we should rely on the zero being optimised away later. > > Emitting the zero also makes it harder for the register allocator > to elide the move. For example, if we have: > > (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) > (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) > > then there is at least a chance that the RA could assign hard registers > R0:R1 to P, which would turn the moves into nops. If we emit: > > (set (reg:DI P) (const_int 0)) > > beforehand then that becomes impossible, since R0 and R1 would then > conflict with P. Ah, surely, as you pointed out for targets where "(reg: DI)" corresponds to one hard register. > > TBH I'm surprised we still run init_regs for LRA. I thought there was > a plan to stop doing that, but perhaps I misremember. Sorry I am not sure about the status of LRA... because the xtensa port is still using reload. As conclusion, trying to tweak the common code side may have been a bit premature. I'll consider if I can deal with those issues on the side of the target-specific code. > > Thanks, > Richard > >> This patch may give better output code quality for the reasons above, >> especially on architectures that don't have DFmode hard registers >> (On architectures with such hard registers, this patch changes virtually >> nothing). >> >> For example (Espressif ESP8266, Xtensa without FP hard regs): >> >> /* example */ >> double _Complex conjugate(double _Complex z) { >> __imag__(z) *= -1; >> return z; >> } >> >> ;; before >> conjugate: >> movi.n a6, -1 >> slli a6, a6, 31 >> mov.n a8, a2 >> mov.n a9, a3 >> mov.n a7, a4 >> xor a6, a5, a6 >> mov.n a2, a8 >> mov.n a3, a9 >> mov.n a4, a7 >> mov.n a5, a6 >> ret.n >> >> ;; after >> conjugate: >> movi.n a6, -1 >> slli a6, a6, 31 >> xor a6, a5, a6 >> mov.n a5, a6 >> ret.n >> >> gcc/ChangeLog: >> >> * lower-subreg.cc (resolve_simple_move): >> Add zero clear of the entire register immediately after >> the clobber. >> * expr.cc (emit_move_complex_parts): >> Change to clobber the real and imaginary parts separately >> instead of the whole complex register if possible. >> --- >> gcc/expr.cc | 26 ++++++++++++++++++++------ >> gcc/lower-subreg.cc | 7 ++++++- >> 2 files changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/gcc/expr.cc b/gcc/expr.cc >> index 80bb1b8a4c5..9732e8fd4e5 100644 >> --- a/gcc/expr.cc >> +++ b/gcc/expr.cc >> @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, rtx y) >> rtx_insn * >> emit_move_complex_parts (rtx x, rtx y) >> { >> - /* Show the output dies here. This is necessary for SUBREGs >> - of pseudos since we cannot track their lifetimes correctly; >> - hard regs shouldn't appear here except as return values. */ >> - if (!reload_completed && !reload_in_progress >> - && REG_P (x) && !reg_overlap_mentioned_p (x, y)) >> - emit_clobber (x); >> + rtx_insn *re_insn, *im_insn; >> >> write_complex_part (x, read_complex_part (y, false), false, true); >> + re_insn = get_last_insn (); >> write_complex_part (x, read_complex_part (y, true), true, false); >> + im_insn = get_last_insn (); >> + >> + /* Show the output dies here. This is necessary for SUBREGs >> + of pseudos since we cannot track their lifetimes correctly. */ >> + if (can_create_pseudo_p () >> + && REG_P (x) && ! reg_overlap_mentioned_p (x, y)) >> + { >> + /* Hard regs shouldn't appear here except as return values. */ >> + if (HARD_REGISTER_P (x) && REG_NREGS (x) % 2 == 0) >> + { >> + emit_insn_before (gen_clobber (SET_DEST (PATTERN (re_insn))), >> + re_insn); >> + emit_insn_before (gen_clobber (SET_DEST (PATTERN (im_insn))), >> + im_insn); >> + } >> + else >> + emit_insn_before (gen_clobber (x), re_insn); >> + } >> >> return get_last_insn (); >> } >> diff --git a/gcc/lower-subreg.cc b/gcc/lower-subreg.cc >> index 03e9326c663..4ff0a7d1556 100644 >> --- a/gcc/lower-subreg.cc >> +++ b/gcc/lower-subreg.cc >> @@ -1086,7 +1086,12 @@ resolve_simple_move (rtx set, rtx_insn *insn) >> unsigned int i; >> >> if (REG_P (dest) && !HARD_REGISTER_NUM_P (REGNO (dest))) >> - emit_clobber (dest); >> + { >> + emit_clobber (dest); >> + /* We clear the entire of dest with zero after the clobber, >> + similar to the "init-regs" pass. */ >> + emit_move_insn (dest, CONST0_RTX (GET_MODE (dest))); >> + } >> >> for (i = 0; i < words; ++i) >> {
On 8/3/2022 1:52 AM, Richard Sandiford via Gcc-patches wrote: > Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps >> data flow consistent, but it also increases register allocation pressure >> and thus often creates many unwanted register-to-register moves that >> cannot be optimized away. > There are two things here: > > - If emit_move_complex_parts emits a clobber of a hard register, > then that's probably a bug/misfeature. The point of the clobber is > to indicate that the register has no useful contents. That's useful > for wide pseudos that are written to in parts, since it avoids the > need to track the liveness of each part of the pseudo individually. > But it shouldn't be necessary for hard registers, since subregs of > hard registers are simplified to hard registers wherever possible > (which on most targets is "always"). > > So I think the emit_move_complex_parts clobber should be restricted > to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps > (if only partly) then it would be worth doing as its own patch. Agreed. > > - I think it'd be worth looking into more detail why a clobber makes > a difference to register pressure. A clobber of a pseudo register R > shouldn't make R conflict with things that are live at the point of > the clobber. Also agreed. > >> It seems just analogous to partial register >> stall which is a famous problem on processors that do register renaming. >> >> In my opinion, when the register to be clobbered is a composite of hard >> ones, we should clobber the individual elements separetely, otherwise >> clear the entire to zero prior to use as the "init-regs" pass does (like >> partial register stall workarounds on x86 CPUs). Such redundant zero >> constant assignments will be removed later in the "cprop_hardreg" pass. > I don't think we should rely on the zero being optimised away later. > > Emitting the zero also makes it harder for the register allocator > to elide the move. For example, if we have: > > (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) > (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) > > then there is at least a chance that the RA could assign hard registers > R0:R1 to P, which would turn the moves into nops. If we emit: > > (set (reg:DI P) (const_int 0)) > > beforehand then that becomes impossible, since R0 and R1 would then > conflict with P. > > TBH I'm surprised we still run init_regs for LRA. I thought there was > a plan to stop doing that, but perhaps I misremember. I have vague memories of dealing with some of this nonsense a few release cycles. I don't recall all the details, but init-regs + lower-subreg + regcprop + splitting all conspired to generate poor code on the MIPS targets. See pr87761, though it doesn't include all my findings -- I can't recall if I walked through the entire tortured sequence in the gcc-patches discussion or not. I ended up working around in the mips backend in conjunction with some changes to regcprop IIRC. Jeff
Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On 8/3/2022 1:52 AM, Richard Sandiford via Gcc-patches wrote: >> Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >>> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps >>> data flow consistent, but it also increases register allocation pressure >>> and thus often creates many unwanted register-to-register moves that >>> cannot be optimized away. >> There are two things here: >> >> - If emit_move_complex_parts emits a clobber of a hard register, >> then that's probably a bug/misfeature. The point of the clobber is >> to indicate that the register has no useful contents. That's useful >> for wide pseudos that are written to in parts, since it avoids the >> need to track the liveness of each part of the pseudo individually. >> But it shouldn't be necessary for hard registers, since subregs of >> hard registers are simplified to hard registers wherever possible >> (which on most targets is "always"). >> >> So I think the emit_move_complex_parts clobber should be restricted >> to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps >> (if only partly) then it would be worth doing as its own patch. > Agreed. > >> >> - I think it'd be worth looking into more detail why a clobber makes >> a difference to register pressure. A clobber of a pseudo register R >> shouldn't make R conflict with things that are live at the point of >> the clobber. > Also agreed. > >> >>> It seems just analogous to partial register >>> stall which is a famous problem on processors that do register renaming. >>> >>> In my opinion, when the register to be clobbered is a composite of hard >>> ones, we should clobber the individual elements separetely, otherwise >>> clear the entire to zero prior to use as the "init-regs" pass does (like >>> partial register stall workarounds on x86 CPUs). Such redundant zero >>> constant assignments will be removed later in the "cprop_hardreg" pass. >> I don't think we should rely on the zero being optimised away later. >> >> Emitting the zero also makes it harder for the register allocator >> to elide the move. For example, if we have: >> >> (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) >> (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) >> >> then there is at least a chance that the RA could assign hard registers >> R0:R1 to P, which would turn the moves into nops. If we emit: >> >> (set (reg:DI P) (const_int 0)) >> >> beforehand then that becomes impossible, since R0 and R1 would then >> conflict with P. >> >> TBH I'm surprised we still run init_regs for LRA. I thought there was >> a plan to stop doing that, but perhaps I misremember. > I have vague memories of dealing with some of this nonsense a few > release cycles. I don't recall all the details, but init-regs + > lower-subreg + regcprop + splitting all conspired to generate poor code > on the MIPS targets. See pr87761, though it doesn't include all my > findings -- I can't recall if I walked through the entire tortured > sequence in the gcc-patches discussion or not. > > I ended up working around in the mips backend in conjunction with some > changes to regcprop IIRC. Thanks for the pointer, hadn't seen that. And yeah, for the early-ish passes, I guess the interaction between lower-subreg and init-regs is important too, not just the interaction between lower-subreg and RA. It probably also ties into the problems with overly-scalarised register moves, like in PR 106106. So maybe I was being too optimistic :-) Richard
Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes: > Thanks for your response. > > On 2022/08/03 16:52, Richard Sandiford wrote: >> Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >>> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps >>> data flow consistent, but it also increases register allocation pressure >>> and thus often creates many unwanted register-to-register moves that >>> cannot be optimized away. >> >> There are two things here: >> >> - If emit_move_complex_parts emits a clobber of a hard register, >> then that's probably a bug/misfeature. The point of the clobber is >> to indicate that the register has no useful contents. That's useful >> for wide pseudos that are written to in parts, since it avoids the >> need to track the liveness of each part of the pseudo individually. >> But it shouldn't be necessary for hard registers, since subregs of >> hard registers are simplified to hard registers wherever possible >> (which on most targets is "always"). >> >> So I think the emit_move_complex_parts clobber should be restricted >> to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps >> (if only partly) then it would be worth doing as its own patch. >> >> - I think it'd be worth looking into more detail why a clobber makes >> a difference to register pressure. A clobber of a pseudo register R >> shouldn't make R conflict with things that are live at the point of >> the clobber. > > I agree with its worth. > In fact, aside from other ports, on the xtensa one, RA in code with frequent D[FC]mode pseudos is terribly bad. > For example, in __muldc3 on libgcc2, the size of the stack frame reserved will almost double depending on whether or not this patch is applied. Yeah, that's a lot. >>> It seems just analogous to partial register >>> stall which is a famous problem on processors that do register renaming. >>> >>> In my opinion, when the register to be clobbered is a composite of hard >>> ones, we should clobber the individual elements separetely, otherwise >>> clear the entire to zero prior to use as the "init-regs" pass does (like >>> partial register stall workarounds on x86 CPUs). Such redundant zero >>> constant assignments will be removed later in the "cprop_hardreg" pass. >> >> I don't think we should rely on the zero being optimised away later. >> >> Emitting the zero also makes it harder for the register allocator >> to elide the move. For example, if we have: >> >> (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) >> (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) >> >> then there is at least a chance that the RA could assign hard registers >> R0:R1 to P, which would turn the moves into nops. If we emit: >> >> (set (reg:DI P) (const_int 0)) >> >> beforehand then that becomes impossible, since R0 and R1 would then >> conflict with P. > > Ah, surely, as you pointed out for targets where "(reg: DI)" corresponds to one hard register. I was thinking here about the case where (reg:DI …) corresponds to 2 hard registers. Each subreg move is then a single hard register copy, but assigning P to the combination R0:R1 can remove both of the subreg moves. >> TBH I'm surprised we still run init_regs for LRA. I thought there was >> a plan to stop doing that, but perhaps I misremember. > > Sorry I am not sure about the status of LRA... because the xtensa port is still using reload. Ah, hadn't realised that. If you have time to work on it, it would be really good to move over to LRA. There are plans to remove old reload. It might be that old reload *does* treat a pseudo clobber as a conflict. I can't remember now. If so, then zeroing the register wouldn't be too bad (for old reload only). > As conclusion, trying to tweak the common code side may have been a bit premature. > I'll consider if I can deal with those issues on the side of the target-specific code. It's likely to be at least partly a target-independent issue, so tweaking the common code makes sense in principle. Does adding !HARD_REGISTER_P (x) to: /* Show the output dies here. This is necessary for SUBREGs of pseudos since we cannot track their lifetimes correctly; hard regs shouldn't appear here except as return values. */ if (!reload_completed && !reload_in_progress && REG_P (x) && !reg_overlap_mentioned_p (x, y)) emit_clobber (x); in emit_move_complex_parts help? If so, I think we should do at least that much. Thanks, Richard
(sorry repost due to the lack of cc here) Hi! On 2022/08/04 18:49, Richard Sandiford wrote: > Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes: >> Thanks for your response. >> >> On 2022/08/03 16:52, Richard Sandiford wrote: >>> Takayuki 'January June' Suwa via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >>>> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps >>>> data flow consistent, but it also increases register allocation pressure >>>> and thus often creates many unwanted register-to-register moves that >>>> cannot be optimized away. >>> >>> There are two things here: >>> >>> - If emit_move_complex_parts emits a clobber of a hard register, >>> then that's probably a bug/misfeature. The point of the clobber is >>> to indicate that the register has no useful contents. That's useful >>> for wide pseudos that are written to in parts, since it avoids the >>> need to track the liveness of each part of the pseudo individually. >>> But it shouldn't be necessary for hard registers, since subregs of >>> hard registers are simplified to hard registers wherever possible >>> (which on most targets is "always"). >>> >>> So I think the emit_move_complex_parts clobber should be restricted >>> to !HARD_REGISTER_P, like the lower-subreg clobber is. If that helps >>> (if only partly) then it would be worth doing as its own patch. >>> >>> - I think it'd be worth looking into more detail why a clobber makes >>> a difference to register pressure. A clobber of a pseudo register R >>> shouldn't make R conflict with things that are live at the point of >>> the clobber. >> >> I agree with its worth. >> In fact, aside from other ports, on the xtensa one, RA in code with frequent D[FC]mode pseudos is terribly bad. >> For example, in __muldc3 on libgcc2, the size of the stack frame reserved will almost double depending on whether or not this patch is applied. > > Yeah, that's a lot. So lots, but almost double might be an overstatement :) BTW after some quick experimentation, I found that turning on -fsplit-wide-types-early would roughly (but not completely) solve the problem. Surely, the output was not so bad in the past... > >>>> It seems just analogous to partial register >>>> stall which is a famous problem on processors that do register renaming. >>>> >>>> In my opinion, when the register to be clobbered is a composite of hard >>>> ones, we should clobber the individual elements separetely, otherwise >>>> clear the entire to zero prior to use as the "init-regs" pass does (like >>>> partial register stall workarounds on x86 CPUs). Such redundant zero >>>> constant assignments will be removed later in the "cprop_hardreg" pass. >>> >>> I don't think we should rely on the zero being optimised away later. >>> >>> Emitting the zero also makes it harder for the register allocator >>> to elide the move. For example, if we have: >>> >>> (set (subreg:SI (reg:DI P) 0) (reg:SI R0)) >>> (set (subreg:SI (reg:DI P) 4) (reg:SI R1)) >>> >>> then there is at least a chance that the RA could assign hard registers >>> R0:R1 to P, which would turn the moves into nops. If we emit: >>> >>> (set (reg:DI P) (const_int 0)) >>> >>> beforehand then that becomes impossible, since R0 and R1 would then >>> conflict with P. >> >> Ah, surely, as you pointed out for targets where "(reg: DI)" corresponds to one hard register. > > I was thinking here about the case where (reg:DI …) corresponds to > 2 hard registers. Each subreg move is then a single hard register > copy, but assigning P to the combination R0:R1 can remove both of > the subreg moves. > >>> TBH I'm surprised we still run init_regs for LRA. I thought there was >>> a plan to stop doing that, but perhaps I misremember. >> >> Sorry I am not sure about the status of LRA... because the xtensa port is still using reload. > > Ah, hadn't realised that. If you have time to work on it, it would be > really good to move over to LRA. There are plans to remove old reload. Alas you do overestimate me :) I've only been working about the GCC development for a little over a year. Well it's a lie that I'm not interested in it, but too much for me. > > It might be that old reload *does* treat a pseudo clobber as a conflict. > I can't remember now. If so, then zeroing the register wouldn't be > too bad (for old reload only). > >> As conclusion, trying to tweak the common code side may have been a bit premature. >> I'll consider if I can deal with those issues on the side of the target-specific code. > > It's likely to be at least partly a target-independent issue, so tweaking > the common code makes sense in principle. > > Does adding !HARD_REGISTER_P (x) to: > > /* Show the output dies here. This is necessary for SUBREGs > of pseudos since we cannot track their lifetimes correctly; > hard regs shouldn't appear here except as return values. */ > if (!reload_completed && !reload_in_progress > && REG_P (x) && !reg_overlap_mentioned_p (x, y)) > emit_clobber (x); > > in emit_move_complex_parts help? If so, I think we should do at Probably yes. Quick test says the abovementioned mod makes the ad-hoc fix I posted earlier (https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596626.html) a thing of the past. > least that much. > > Thanks, > Richard
On 8/4/2022 3:49 AM, Richard Sandiford via Gcc-patches wrote: > >>> TBH I'm surprised we still run init_regs for LRA. I thought there was >>> a plan to stop doing that, but perhaps I misremember. >> Sorry I am not sure about the status of LRA... because the xtensa port is still using reload. > Ah, hadn't realised that. If you have time to work on it, it would be > really good to move over to LRA. There are plans to remove old reload. Definitely worth investigating. With the cc0 removal done I think the last blocker for removing the old reload pass is gone. We just need to get the remaining targets converted to LRA. > It might be that old reload *does* treat a pseudo clobber as a conflict. > I can't remember now. If so, then zeroing the register wouldn't be > too bad (for old reload only). No idea anymore either. I'd be a bit surprised since IIRC the main purpose was to tell the old uninit warning code that the entire object was set by the subsequent libcall sequence. But all that code is long-gone. Which I think raises a question. Do we even need those CLOBBERSs anymore? > >> As conclusion, trying to tweak the common code side may have been a bit premature. >> I'll consider if I can deal with those issues on the side of the target-specific code. > It's likely to be at least partly a target-independent issue, so tweaking > the common code makes sense in principle. > > Does adding !HARD_REGISTER_P (x) to: > > /* Show the output dies here. This is necessary for SUBREGs > of pseudos since we cannot track their lifetimes correctly; > hard regs shouldn't appear here except as return values. */ > if (!reload_completed && !reload_in_progress > && REG_P (x) && !reg_overlap_mentioned_p (x, y)) > emit_clobber (x); > > in emit_move_complex_parts help? If so, I think we should do at > least that much. If we can't remove the CLOBBERs entirely, then this sounds like a good thing, even if it doesn't help this specific case. jeff
On 8/4/2022 6:35 AM, Takayuki 'January June' Suwa via Gcc-patches wrote: > So lots, but almost double might be an overstatement :) > > BTW after some quick experimentation, I found that turning on -fsplit-wide-types-early would roughly (but not completely) solve the problem. Surely, the output was not so bad in the past... It could have been. >> Ah, hadn't realised that. If you have time to work on it, it would be >> really good to move over to LRA. There are plans to remove old reload. > Alas you do overestimate me :) I've only been working about the GCC development for a little over a year. > Well it's a lie that I'm not interested in it, but too much for me. It may actually be trivial -- change TARGET_LRA_P to be hook_bool_void_true in the xtensa port, then rebuild & test. In the couple conversions I've done it's exposed a very small number of issues that were trivially resolved. Given the what I've seen from you I would expect it's within your capabilities and there's folks here that can help if you do run into problems. jeff
Hi Suwa-san, On Fri, Oct 14, 2022 at 4:19 AM Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> wrote: > This patch provides the first step in the transition from Reload to LRA > in Xtensa. > > gcc/ChangeLog: > > * config/xtensa/xtensa-proto.h (xtensa_split1_is_finished_p): > New prototype. > * config/xtensa/xtensa.cc > (xtensa_split1_is_finished_p, xtensa_lra_p): New functions. > (TARGET_LRA_P): Replace the dummy hook with xtensa_lra_p. > (xt_true_regnum): Rework. > * gcc/config/xtensa/xtensa.h (CALL_REALLY_USED_REGISTERS): > Rename from CALL_USED_REGISTERS, and remove what correspond to > FIXED_REGISTERS. > * gcc/config/xtensa/constraints.md (Y): > Use !xtensa_split1_is_finished_p() instead of can_create_pseudo_p(). > * gcc/config/xtensa/predicates.md (move_operand): Ditto. > * gcc/config/xtensa/xtensa.md: > Add new split pattern that puts out-of-constraint integer constants > into the constant pool. > * gcc/config/xtensa/xtensa.opt (-mlra): New target-specific option > for testing purpose. > --- > gcc/config/xtensa/constraints.md | 2 +- > gcc/config/xtensa/predicates.md | 2 +- > gcc/config/xtensa/xtensa-protos.h | 1 + > gcc/config/xtensa/xtensa.cc | 48 ++++++++++++++++++++++++------- > gcc/config/xtensa/xtensa.h | 6 ++-- > gcc/config/xtensa/xtensa.md | 12 ++++++++ > gcc/config/xtensa/xtensa.opt | 4 +++ > 7 files changed, 60 insertions(+), 15 deletions(-) Thank you for doing this, I couldn't find time to get back to it since 2020 ): This change results in a few new regressions in the following tests caused by ICE even when running without -mlra option: +FAIL: gcc.c-torture/execute/pr92904.c -O1 (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -O2 (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -O3 -g (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -Os (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -O2 -flto -fno-use-linker-plugin -flto-partition=none (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: gcc.c-torture/execute/pr92904.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: g++.dg/torture/vshuf-v2si.C -O3 -g (internal compiler error: in extract_insn, at recog.cc:2791) +FAIL: g++.dg/torture/vshuf-v8qi.C -O3 -g (internal compiler error: in extract_insn, at recog.cc:2791) The backtraces look like this in all of them: gcc/gcc/testsuite/gcc.c-torture/execute/pr92904.c:395:1: error: unrecognizable insn: (insn 10501 7 10502 2 (set (reg:SI 5913) (const_int 1431655765 [0x55555555])) "gcc/gcc/testsuite/gcc.c-torture/execute/pr92904.c":239:9 -1 (nil)) during RTL pass: subreg3 gcc/gcc/testsuite/gcc.c-torture/execute/pr92904.c:395:1: internal compiler error: in extract_insn, at recog.cc:2791 0x6b17f7 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) gcc/gcc/rtl-error.cc:108 0x6b187a _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) gcc/gcc/rtl-error.cc:116 0x6a2aa4 extract_insn(rtx_insn*) gcc/gcc/recog.cc:2791 0x179e94d decompose_multiword_subregs gcc/gcc/lower-subreg.cc:1678 0x179ebdd execute gcc/gcc/lower-subreg.cc:1820 There's also the following runtime failures, but only on call0 configuration: +FAIL: gcc.c-torture/execute/20010122-1.c -O1 execution test +FAIL: gcc.c-torture/execute/20010122-1.c -O2 execution test +FAIL: gcc.c-torture/execute/20010122-1.c -O3 -g execution test +FAIL: gcc.c-torture/execute/20010122-1.c -Os execution test +FAIL: gcc.c-torture/execute/20010122-1.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test
diff --git a/gcc/expr.cc b/gcc/expr.cc index 80bb1b8a4c5..9732e8fd4e5 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, rtx y) rtx_insn * emit_move_complex_parts (rtx x, rtx y) { - /* Show the output dies here. This is necessary for SUBREGs - of pseudos since we cannot track their lifetimes correctly; - hard regs shouldn't appear here except as return values. */ - if (!reload_completed && !reload_in_progress - && REG_P (x) && !reg_overlap_mentioned_p (x, y)) - emit_clobber (x); + rtx_insn *re_insn, *im_insn; write_complex_part (x, read_complex_part (y, false), false, true); + re_insn = get_last_insn (); write_complex_part (x, read_complex_part (y, true), true, false); + im_insn = get_last_insn (); + + /* Show the output dies here. This is necessary for SUBREGs + of pseudos since we cannot track their lifetimes correctly. */ + if (can_create_pseudo_p () + && REG_P (x) && ! reg_overlap_mentioned_p (x, y)) + { + /* Hard regs shouldn't appear here except as return values. */ + if (HARD_REGISTER_P (x) && REG_NREGS (x) % 2 == 0) + { + emit_insn_before (gen_clobber (SET_DEST (PATTERN (re_insn))), + re_insn); + emit_insn_before (gen_clobber (SET_DEST (PATTERN (im_insn))), + im_insn); + } + else + emit_insn_before (gen_clobber (x), re_insn); + } return get_last_insn (); } diff --git a/gcc/lower-subreg.cc b/gcc/lower-subreg.cc index 03e9326c663..4ff0a7d1556 100644 --- a/gcc/lower-subreg.cc +++ b/gcc/lower-subreg.cc @@ -1086,7 +1086,12 @@ resolve_simple_move (rtx set, rtx_insn *insn) unsigned int i; if (REG_P (dest) && !HARD_REGISTER_NUM_P (REGNO (dest))) - emit_clobber (dest); + { + emit_clobber (dest); + /* We clear the entire of dest with zero after the clobber, + similar to the "init-regs" pass. */ + emit_move_insn (dest, CONST0_RTX (GET_MODE (dest))); + } for (i = 0; i < words; ++i) {