[RFA] Improve strcmp expansion when one input is a constant string.
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-arm |
success
|
Testing passed
|
Commit Message
While investigating a RISC-V backend patch from Jivan I noticed a
regression in terms of dynamic instruction counts for the omnetpp
benchmark in spec2017.
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html
The code we we with Jivan's patch at expansion time looks like this for
each character in the input string:
(insn 6 5 7 (set (reg:SI 137)
(zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM
<char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
(nil))
(insn 7 6 8 (set (reg:DI 138)
(sign_extend:DI (plus:SI (reg:SI 137)
(const_int -108 [0xffffffffffffff94])))) "j.c":5:11 -1
(nil))
(insn 8 7 9 (set (reg:SI 136)
(subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1
(expr_list:REG_EQUAL (plus:SI (reg:SI 137)
(const_int -108 [0xffffffffffffff94]))
(nil)))
(insn 9 8 10 (set (reg:DI 139)
(sign_extend:DI (reg:SI 136))) "j.c":5:11 -1
(nil))
(jump_insn 10 9 11 (set (pc)
(if_then_else (ne (reg:DI 139)
(const_int 0 [0]))
(label_ref 64)
(pc))) "j.c":5:11 -1
(nil))
Ignore insn 9. fwprop will turn it into a trivial copy from r138->r139
which will ultimately propagate away.
All the paths eventually transfer to control to the label in question,
either by jumping or falling thru on the last character. After a bit of
cleanup by fwprop & friends we have:
> (insn 6 3 7 2 (set (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2}
> (nil))
> (insn 7 6 8 2 (set (reg:DI 138)
> (sign_extend:DI (plus:SI (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 6 {addsi3_extended}
> (expr_list:REG_DEAD (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> (nil)))
> (insn 8 7 10 2 (set (reg:SI 136 [ MEM <char[1:12]> [(void *)x_2(D)]+11 ])
> (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal}
> (nil))
> (jump_insn 10 8 73 2 (set (pc)
> (if_then_else (ne (reg:DI 138)
> (const_int 0 [0]))
> (label_ref 64)
> (pc))) "j.c":5:11 243 {*branchdi}
> (expr_list:REG_DEAD (reg:DI 138)
> (int_list:REG_BR_PROB 536870916 (nil)))
> -> 64)
insn 8 is the result of wanting the ultimate result of the strcmp to be
an "int" type (SImode). Note that (reg 136) is the result of the
strcmp. It gets set in each fragment of code that compares one element
in the string. It's also live after the strcmp sequence. As a result
combine isn't going to be able to clean this up.
Note how (reg 136) births while (reg 138) is live and even though (reg
136) is a copy of (reg 138), IRA doesn't have the necessary code to
determine that the regs do not conflict. As a result (reg 136) and (reg
138) must be allocated different hard registers and we get code like this:
> lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqisi2/1
> addiw a5,a5,-108 # 7 [c=8 l=4] addsi3_extended/1
> mv a4,a5 # 8 [c=4 l=4] *movsi_internal/0
> bne a5,zero,.L2 # 10 [c=4 l=4] *branchdi
Note the annoying "mv".
Rather than do a conversion for each character, we could do each step in
word_mode and do the conversion once at the end of the whole sequence.
So for each character we expand to:
> (insn 6 5 7 (set (reg:DI 138)
> (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
> (nil))
>
> (insn 7 6 8 (set (reg:DI 137)
> (plus:DI (reg:DI 138)
> (const_int -108 [0xffffffffffffff94]))) "j.c":5:11 -1
> (nil))
>
> (jump_insn 8 7 9 (set (pc)
> (if_then_else (ne (reg:DI 137)
> (const_int 0 [0]))
> (label_ref 41)
> (pc))) "j.c":5:11 -1
> (nil))
Good. Then at the end of the sequence we have:
> (code_label 41 40 42 2 (nil) [0 uses])
>
> (insn 42 41 43 (set (reg:SI 136)
> (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1
> (nil))
Which seems like exactly what we want. At the assembly level we get:
lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqidi2/1
addi a0,a5,-108 # 7 [c=4 l=4] adddi3/1
bne a0,zero,.L2 # 8 [c=4 l=4] *branchdi
[ ... ]
At the end of the sequence we realize the narrowing subreg followed by
an extnesion isn't necessary and just remove them.
The ultimate result is omnetpp goes from a small regression to a small
overall improvement with Jivan's patch.
Bootstrapped and regression tested on x86. Also built and run spec2017
on riscv64.
OK for the trunk?
Jeff
Comments
On Sun, Jun 4, 2023 at 11:41 PM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> While investigating a RISC-V backend patch from Jivan I noticed a
> regression in terms of dynamic instruction counts for the omnetpp
> benchmark in spec2017.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html
>
> The code we we with Jivan's patch at expansion time looks like this for
> each character in the input string:
>
>
>
> (insn 6 5 7 (set (reg:SI 137)
> (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM
> <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
> (nil))
>
> (insn 7 6 8 (set (reg:DI 138)
> (sign_extend:DI (plus:SI (reg:SI 137)
> (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 -1
> (nil))
>
> (insn 8 7 9 (set (reg:SI 136)
> (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1
> (expr_list:REG_EQUAL (plus:SI (reg:SI 137)
> (const_int -108 [0xffffffffffffff94]))
> (nil)))
>
> (insn 9 8 10 (set (reg:DI 139)
> (sign_extend:DI (reg:SI 136))) "j.c":5:11 -1
> (nil))
>
> (jump_insn 10 9 11 (set (pc)
> (if_then_else (ne (reg:DI 139)
> (const_int 0 [0]))
> (label_ref 64)
> (pc))) "j.c":5:11 -1
> (nil))
>
>
> Ignore insn 9. fwprop will turn it into a trivial copy from r138->r139
> which will ultimately propagate away.
>
>
> All the paths eventually transfer to control to the label in question,
> either by jumping or falling thru on the last character. After a bit of
> cleanup by fwprop & friends we have:
>
>
>
> > (insn 6 3 7 2 (set (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> > (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2}
> > (nil))
> > (insn 7 6 8 2 (set (reg:DI 138)
> > (sign_extend:DI (plus:SI (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> > (const_int -108 [0xffffffffffffff94])))) "j.c":5:11 6 {addsi3_extended}
> > (expr_list:REG_DEAD (reg:SI 137 [ MEM <char[1:12]> [(void *)x_2(D)] ])
> > (nil)))
> > (insn 8 7 10 2 (set (reg:SI 136 [ MEM <char[1:12]> [(void *)x_2(D)]+11 ])
> > (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal}
> > (nil))
> > (jump_insn 10 8 73 2 (set (pc)
> > (if_then_else (ne (reg:DI 138)
> > (const_int 0 [0]))
> > (label_ref 64)
> > (pc))) "j.c":5:11 243 {*branchdi}
> > (expr_list:REG_DEAD (reg:DI 138)
> > (int_list:REG_BR_PROB 536870916 (nil)))
> > -> 64)
>
>
> insn 8 is the result of wanting the ultimate result of the strcmp to be
> an "int" type (SImode). Note that (reg 136) is the result of the
> strcmp. It gets set in each fragment of code that compares one element
> in the string. It's also live after the strcmp sequence. As a result
> combine isn't going to be able to clean this up.
>
> Note how (reg 136) births while (reg 138) is live and even though (reg
> 136) is a copy of (reg 138), IRA doesn't have the necessary code to
> determine that the regs do not conflict. As a result (reg 136) and (reg
> 138) must be allocated different hard registers and we get code like this:
>
> > lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqisi2/1
> > addiw a5,a5,-108 # 7 [c=8 l=4] addsi3_extended/1
> > mv a4,a5 # 8 [c=4 l=4] *movsi_internal/0
> > bne a5,zero,.L2 # 10 [c=4 l=4] *branchdi
>
> Note the annoying "mv".
>
>
> Rather than do a conversion for each character, we could do each step in
> word_mode and do the conversion once at the end of the whole sequence.
>
> So for each character we expand to:
>
> > (insn 6 5 7 (set (reg:DI 138)
> > (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM <char[1:12]> [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
> > (nil))
> >
> > (insn 7 6 8 (set (reg:DI 137)
> > (plus:DI (reg:DI 138)
> > (const_int -108 [0xffffffffffffff94]))) "j.c":5:11 -1
> > (nil))
> >
> > (jump_insn 8 7 9 (set (pc)
> > (if_then_else (ne (reg:DI 137)
> > (const_int 0 [0]))
> > (label_ref 41)
> > (pc))) "j.c":5:11 -1
> > (nil))
>
> Good. Then at the end of the sequence we have:
> > (code_label 41 40 42 2 (nil) [0 uses])
> >
> > (insn 42 41 43 (set (reg:SI 136)
> > (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1
> > (nil))
>
> Which seems like exactly what we want. At the assembly level we get:
> lbu a5,0(a0) # 6 [c=28 l=4] zero_extendqidi2/1
> addi a0,a5,-108 # 7 [c=4 l=4] adddi3/1
> bne a0,zero,.L2 # 8 [c=4 l=4] *branchdi
> [ ... ]
>
> At the end of the sequence we realize the narrowing subreg followed by
> an extnesion isn't necessary and just remove them.
>
> The ultimate result is omnetpp goes from a small regression to a small
> overall improvement with Jivan's patch.
>
> Bootstrapped and regression tested on x86. Also built and run spec2017
> on riscv64.
>
> OK for the trunk?
But then for example x86 has smaller encoding for byte ops and while
widening is easily done later, truncation is not.
Btw, you failed to update the overall function comment which lists
the conversions applied.
Note I would have expected to use the mode of the load so we truly
elide some extensions, using word_mode looks like just another
mode here? The key to note is probably
op0 = convert_modes (mode, unit_mode, op0, 1);
op1 = convert_modes (mode, unit_mode, op1, 1);
rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
result, 1, OPTAB_WIDEN);
which uses OPTAB_WIDEN - wouldn't it be better to pass in the
unconverted modes and leave the decision which mode to use
to OPTAB_WIDEN? Should we somehow query the target for
the smallest mode from unit_mode it can do both the MINUS
and the compare?
Thanks,
Richard.
>
> Jeff
On 6/5/23 00:29, Richard Biener wrote:
>
> But then for example x86 has smaller encoding for byte ops and while
> widening is easily done later, truncation is not.
Which probably argues we need to be checking costs.
>
> Btw, you failed to update the overall function comment which lists
> the conversions applied.
ACK. It occurred to me when I woke up today that I also failed to
handle the case where word_mode is actually smaller than an int.
>
> Note I would have expected to use the mode of the load so we truly
> elide some extensions, using word_mode looks like just another
> mode here? The key to note is probably
>
> op0 = convert_modes (mode, unit_mode, op0, 1);
> op1 = convert_modes (mode, unit_mode, op1, 1);
> rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
> result, 1, OPTAB_WIDEN);
On many (most?) targets the loads can cheaply extend to word_mode.
>
> which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> unconverted modes and leave the decision which mode to use
> to OPTAB_WIDEN? Should we somehow query the target for
> the smallest mode from unit_mode it can do both the MINUS
> and the compare?
I'll play with it. My worry is that the optabs are going to leave
things in SImode. RV64 has 32 bit add/subtract which implicitly sign
extends the result to 64 bits and Jivan's patch models that behavior
with generally very good results. But again, I'll play with it.
I do agree that we need to be looking at cost modeling more in here. So
I'll poke at that too.
jeff
On 6/5/23 00:29, Richard Biener wrote:
>
> But then for example x86 has smaller encoding for byte ops and while
> widening is easily done later, truncation is not.
Sadly, the x86 costing looks totally bogus here. We actually emit the
exact same code for a QI mode loads vs a zero-extending load from QI to
SI. But the costing is different and would tend to prefer QImode. That
in turn is going to force an extension at the end of the sequence which
would be a regression relative to the current code. Additionally we may
get partial register stalls for the byte ops to implement the comparison
steps.
The net result is that querying the backend's costs would do the exact
opposite of what I think we want on x86. One could argue the x86
maintainers should improve this situation...
>
> Note I would have expected to use the mode of the load so we truly
> elide some extensions, using word_mode looks like just another
> mode here? The key to note is probably
>
> op0 = convert_modes (mode, unit_mode, op0, 1);
> op1 = convert_modes (mode, unit_mode, op1, 1);
> rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
> result, 1, OPTAB_WIDEN);
>
> which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> unconverted modes and leave the decision which mode to use
> to OPTAB_WIDEN? Should we somehow query the target for
> the smallest mode from unit_mode it can do both the MINUS
> and the compare?
And avoiding OPTAB_WIDEN isn't going to help rv64 at all. The core
issue being that we do define 32bit ops. With Jivan's patch those 32bit
ops expose the sign extending nature. So a 32bit add would look
something like
(set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI))))
(set (res:SI) (subreg:SI (temp:DI) 0)
Where we mark the subreg with SUBREG_PROMOTED_VAR_P.
I'm not sure the best way to proceed now. I could just put this on the
back-burner as it's RISC-V specific and the gains elsewhere dwarf this
issue.
jeff
On Mon, Jun 5, 2023 at 8:41 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 6/5/23 00:29, Richard Biener wrote:
>
> >
> > But then for example x86 has smaller encoding for byte ops and while
> > widening is easily done later, truncation is not.
> Sadly, the x86 costing looks totally bogus here. We actually emit the
> exact same code for a QI mode loads vs a zero-extending load from QI to
> SI. But the costing is different and would tend to prefer QImode. That
> in turn is going to force an extension at the end of the sequence which
> would be a regression relative to the current code. Additionally we may
> get partial register stalls for the byte ops to implement the comparison
> steps.
>
> The net result is that querying the backend's costs would do the exact
> opposite of what I think we want on x86. One could argue the x86
> maintainers should improve this situation...
>
> >
> > Note I would have expected to use the mode of the load so we truly
> > elide some extensions, using word_mode looks like just another
> > mode here? The key to note is probably
> >
> > op0 = convert_modes (mode, unit_mode, op0, 1);
> > op1 = convert_modes (mode, unit_mode, op1, 1);
> > rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
> > result, 1, OPTAB_WIDEN);
> >
> > which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> > unconverted modes and leave the decision which mode to use
> > to OPTAB_WIDEN? Should we somehow query the target for
> > the smallest mode from unit_mode it can do both the MINUS
> > and the compare?
> And avoiding OPTAB_WIDEN isn't going to help rv64 at all. The core
> issue being that we do define 32bit ops. With Jivan's patch those 32bit
> ops expose the sign extending nature. So a 32bit add would look
> something like
>
> (set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI))))
> (set (res:SI) (subreg:SI (temp:DI) 0)
>
> Where we mark the subreg with SUBREG_PROMOTED_VAR_P.
>
>
> I'm not sure the best way to proceed now. I could just put this on the
> back-burner as it's RISC-V specific and the gains elsewhere dwarf this
> issue.
I wonder if there's some more generic target macro we can key the
behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS
is maybe closer but it's exact implications are unknown to me. Maybe
there's something else as well ...
The point about OPTAB_WIDEN above was that I wonder why we
extend 'op0' and 'op1' before emitting the binop when we allow WIDEN
anyway. Yes, we want the result in 'mode' (but why? As you say we
can extend at the end) and there's likely no way to tell expand_simple_binop
to "expand as needed and not narrow the result". So I wonder if we should
emulate that somehow (also taking into consideration the compare).
Richard.
>
> jeff
On 6/6/23 00:47, Richard Biener wrote:
>
> I wonder if there's some more generic target macro we can key the
> behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS
> is maybe closer but it's exact implications are unknown to me. Maybe
> there's something else as well ...
LOAD_EXTEND_OP might help here, at least on some targets. Though not on
x86.
>
> The point about OPTAB_WIDEN above was that I wonder why we
> extend 'op0' and 'op1' before emitting the binop when we allow WIDEN
> anyway.
Ahh. I misunderstood. However, I think dropping the pre-widening will
result in byte ops on x86 which may not be wise given the partial
register stall problem that exists on some variants.
Yes, we want the result in 'mode' (but why? As you say we
> can extend at the end) and there's likely no way to tell expand_simple_binop
> to "expand as needed and not narrow the result". So I wonder if we should
> emulate that somehow (also taking into consideration the compare).
That's what I felt I was starting to build. Essentially looking at
costing (and probably other stuff eventually, like the ability to
compare/branch on narrower modes) to make a determination about whether
or not to do the operations in narrow or wider modes. With the costing
so mucked up on x86 though, I'm hesitant to pursue this path further at
this time.
Jeff
@@ -7135,6 +7135,9 @@ inline_string_cmp (rtx target, tree var_str, const char *const_str,
scalar_int_mode unit_mode
= as_a <scalar_int_mode> TYPE_MODE (unit_type_node);
+ /* We do the intermediate steps in WORD_MODE, then convert
+ to mode at the end of the sequence. */
+ rtx intermediate_result = gen_reg_rtx (word_mode);
start_sequence ();
for (unsigned HOST_WIDE_INT i = 0; i < length; i++)
@@ -7145,25 +7148,27 @@ inline_string_cmp (rtx target, tree var_str, const char *const_str,
rtx op0 = (const_str_n == 1) ? const_rtx : var_rtx;
rtx op1 = (const_str_n == 1) ? var_rtx : const_rtx;
- op0 = convert_modes (mode, unit_mode, op0, 1);
- op1 = convert_modes (mode, unit_mode, op1, 1);
- rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
- result, 1, OPTAB_WIDEN);
+ op0 = convert_modes (word_mode, unit_mode, op0, 1);
+ op1 = convert_modes (word_mode, unit_mode, op1, 1);
+ rtx diff = expand_simple_binop (word_mode, MINUS, op0, op1,
+ intermediate_result, 1, OPTAB_WIDEN);
- /* Force the difference into result register. We cannot reassign
- result here ("result = diff") or we may end up returning
- uninitialized result when expand_simple_binop allocates a new
- pseudo-register for returning. */
- if (diff != result)
- emit_move_insn (result, diff);
+ /* Force the difference into intermediate result register. We cannot
+ reassign the intermediate result here ("intermediate_result = diff")
+ or we may end up returning uninitialized result when
+ expand_simple_binop allocates a new pseudo-register for returning. */
+ if (diff != intermediate_result)
+ emit_move_insn (intermediate_result, diff);
if (i < length - 1)
- emit_cmp_and_jump_insns (result, CONST0_RTX (mode), NE, NULL_RTX,
- mode, true, ne_label);
+ emit_cmp_and_jump_insns (intermediate_result, CONST0_RTX (word_mode),
+ NE, NULL_RTX, word_mode, true, ne_label);
offset += GET_MODE_SIZE (unit_mode);
}
emit_label (ne_label);
+ /* Now convert the intermediate result to the final result. */
+ emit_move_insn (result, gen_lowpart (mode, intermediate_result));
rtx_insn *insns = get_insns ();
end_sequence ();
emit_insn (insns);