[x86_64] Support memory destinations and wide immediate constants in STV.
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-arm |
success
|
Test passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 |
success
|
Test passed
|
Commit Message
Hi Uros,
Very many thanks for the quick review and approval. Here's another.
This patch implements two improvements/refinements to the i386 backend's
Scalar-To-Vector (STV) pass. The first is to support memory destinations
in binary logic operations, and the second is to provide more accurate
costs/gains for (wide) immediate constants in binary logic operations.
A motivating example is gcc.target/i386/movti-2.c:
__int128 m;
void foo()
{
m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
}
for which STV1 currently generates a warning/error:
> r100 has non convertible use in insn 6
(insn 5 2 6 2 (set (reg:TI 100)
(const_wide_int 0x123456789abcdef0123456789abcdef)) "movti-2.c":7:7
87 {
*movti_internal}
(nil))
(insn 6 5 0 2 (parallel [
(set (mem/c:TI (symbol_ref:DI ("m") [flags 0x2] <var_decl
0x7f36d1c
27c60 m>) [1 m+0 S16 A128])
(and:TI (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]
<var_decl 0x
7f36d1c27c60 m>) [1 m+0 S16 A128])
(reg:TI 100)))
(clobber (reg:CC 17 flags))
]) "movti-2.c":7:7 645 {*andti3_doubleword}
(expr_list:REG_DEAD (reg:TI 100)
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil))))
and therefore generates the following scalar code with -O2 -mavx
foo: movabsq $81985529216486895, %rax
andq %rax, m(%rip)
andq %rax, m+8(%rip)
ret
with this patch we now support read-modify-write instructions (as STV
candidates), splitting them into explicit read-modify instructions
followed by an explicit write instruction. Hence, we now produce
(when not optimizing for size):
foo: movabsq $81985529216486895, %rax
vmovq %rax, %xmm0
vpunpcklqdq %xmm0, %xmm0, %xmm0
vpand m(%rip), %xmm0, %xmm0
vmovdqa %xmm0, m(%rip)
ret
This code also handles the const_wide_int in example above, correcting
the costs/gains when the hi/lo words are the same. One minor complication
is that the middle-end assumes (when generating memset) that SSE constants
will be shared/amortized across multiple consecutive writes. Hence to
avoid testsuite regressions, we add a heuristic that considers an immediate
constant to be very cheap, if that same immediate value occurs in the
previous instruction or in the instruction after.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures. Ok for mainline?
2024-08-05 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-features.cc (timode_immed_const_gain): New
function to determine the gain/cost on a CONST_WIDE_INT.
(local_duplicate_constant_p): Helper function to see if the
same immediate constant appears in the previous or next insn.
(timode_scalar_chain::compute_convert_gain): Fix whitespace.
<case CONST_WIDE_INT>: Provide more accurate estimates using
timode_immed_const_gain and local_duplicate_constant_p.
<case AND>: Handle MEM_P (dst) and CONSTANT_SCALAR_INT_P (src).
(timode_scalar_to_vector_candidate_p): Support the first operand
of AND, IOR and XOR being MEM_P (i.e. a read-modify-write insn).
gcc/testsuite/ChangeLog
* gcc.target/i386/movti-2.c: Change dg-options to -Os.
* gcc.target/i386/movti-4.c: Expected output of original movti-2.c.
Thanks again,
Roger
--
Comments
On Mon, Aug 5, 2024 at 5:50 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Very many thanks for the quick review and approval. Here's another.
>
> This patch implements two improvements/refinements to the i386 backend's
> Scalar-To-Vector (STV) pass. The first is to support memory destinations
> in binary logic operations, and the second is to provide more accurate
> costs/gains for (wide) immediate constants in binary logic operations.
Please do not mix together changes made for different reasons, as
advised in "Contributing to GCC" [1], section "Submitting Patches".
[1] https://gcc.gnu.org/contribute.html
Uros.
>
> A motivating example is gcc.target/i386/movti-2.c:
>
> __int128 m;
> void foo()
> {
> m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
> }
>
> for which STV1 currently generates a warning/error:
> > r100 has non convertible use in insn 6
>
> (insn 5 2 6 2 (set (reg:TI 100)
> (const_wide_int 0x123456789abcdef0123456789abcdef)) "movti-2.c":7:7
> 87 {
> *movti_internal}
> (nil))
> (insn 6 5 0 2 (parallel [
> (set (mem/c:TI (symbol_ref:DI ("m") [flags 0x2] <var_decl
> 0x7f36d1c
> 27c60 m>) [1 m+0 S16 A128])
> (and:TI (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]
> <var_decl 0x
> 7f36d1c27c60 m>) [1 m+0 S16 A128])
> (reg:TI 100)))
> (clobber (reg:CC 17 flags))
> ]) "movti-2.c":7:7 645 {*andti3_doubleword}
> (expr_list:REG_DEAD (reg:TI 100)
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil))))
>
> and therefore generates the following scalar code with -O2 -mavx
>
> foo: movabsq $81985529216486895, %rax
> andq %rax, m(%rip)
> andq %rax, m+8(%rip)
> ret
>
> with this patch we now support read-modify-write instructions (as STV
> candidates), splitting them into explicit read-modify instructions
> followed by an explicit write instruction. Hence, we now produce
> (when not optimizing for size):
>
> foo: movabsq $81985529216486895, %rax
> vmovq %rax, %xmm0
> vpunpcklqdq %xmm0, %xmm0, %xmm0
> vpand m(%rip), %xmm0, %xmm0
> vmovdqa %xmm0, m(%rip)
> ret
>
> This code also handles the const_wide_int in example above, correcting
> the costs/gains when the hi/lo words are the same. One minor complication
> is that the middle-end assumes (when generating memset) that SSE constants
> will be shared/amortized across multiple consecutive writes. Hence to
> avoid testsuite regressions, we add a heuristic that considers an immediate
> constant to be very cheap, if that same immediate value occurs in the
> previous instruction or in the instruction after.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures. Ok for mainline?
>
>
> 2024-08-05 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (timode_immed_const_gain): New
> function to determine the gain/cost on a CONST_WIDE_INT.
> (local_duplicate_constant_p): Helper function to see if the
> same immediate constant appears in the previous or next insn.
> (timode_scalar_chain::compute_convert_gain): Fix whitespace.
> <case CONST_WIDE_INT>: Provide more accurate estimates using
> timode_immed_const_gain and local_duplicate_constant_p.
> <case AND>: Handle MEM_P (dst) and CONSTANT_SCALAR_INT_P (src).
> (timode_scalar_to_vector_candidate_p): Support the first operand
> of AND, IOR and XOR being MEM_P (i.e. a read-modify-write insn).
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/movti-2.c: Change dg-options to -Os.
> * gcc.target/i386/movti-4.c: Expected output of original movti-2.c.
>
>
> Thanks again,
> Roger
> --
>
@@ -1503,6 +1503,54 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
df_insn_rescan (insn);
}
+/* Helper function to compute gain for loading an immediate constant.
+ Typically, two movabsq for TImode vs. vmovdqa for V1TImode, but
+ with numerous special cases. */
+
+static int
+timode_immed_const_gain (rtx cst)
+{
+ /* movabsq vs. movabsq+vmovq+vunpacklqdq. */
+ if (CONST_WIDE_INT_P (cst)
+ && CONST_WIDE_INT_NUNITS (cst) == 2
+ && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
+ return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
+ : -COSTS_N_INSNS (2);
+ /* 2x movabsq ~ vmovdqa. */
+ return 0;
+}
+
+/* Return true if the constant CST in mode MODE is found as an
+ immediate operand in the insn after INSN, or the insn before it. */
+
+static bool
+local_duplicate_constant_p (rtx_insn *insn, machine_mode mode, rtx cst)
+{
+ rtx set;
+
+ rtx_insn *next = NEXT_INSN (insn);
+ if (next)
+ {
+ set = single_set (insn);
+ if (set
+ && GET_MODE (SET_DEST (set)) == mode
+ && rtx_equal_p (SET_SRC (set), cst))
+ return true;
+ }
+
+ rtx_insn *prev = PREV_INSN (insn);
+ if (prev)
+ {
+ set = single_set (insn);
+ if (set
+ && GET_MODE (SET_DEST (set)) == mode
+ && rtx_equal_p (SET_SRC (set), cst))
+ return true;
+ }
+ return false;
+}
+
+
/* Compute a gain for chain conversion. */
int
@@ -1549,7 +1597,17 @@ timode_scalar_chain::compute_convert_gain ()
case CONST_INT:
if (MEM_P (dst)
&& standard_sse_constant_p (src, V1TImode))
- igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (11) : 1;
+ igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
+ break;
+
+ case CONST_WIDE_INT:
+ igain = local_duplicate_constant_p (insn, TImode, src)
+ ? 0
+ : timode_immed_const_gain (src);
+ /* 2 x mov vs. vmovdqa. */
+ if (MEM_P (dst))
+ igain += optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
+ : COSTS_N_INSNS (1);
break;
case NOT:
@@ -1560,8 +1618,14 @@ timode_scalar_chain::compute_convert_gain ()
case AND:
case XOR:
case IOR:
- if (!MEM_P (dst))
- igain = COSTS_N_INSNS (1);
+ if (MEM_P (dst))
+ igain = optimize_insn_for_size_p () ? -COSTS_N_BYTES (2)
+ : 0;
+ else
+ igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (6)
+ : COSTS_N_INSNS (1);
+ if (CONST_SCALAR_INT_P (XEXP (src, 1)))
+ igain += timode_immed_const_gain (XEXP (src, 1));
break;
case ASHIFT:
@@ -2303,14 +2367,16 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
|| CONST_SCALAR_INT_P (XEXP (src, 1))
|| timode_mem_p (XEXP (src, 1))))
return true;
- return REG_P (XEXP (src, 0))
+ return (REG_P (XEXP (src, 0))
+ || timode_mem_p (XEXP (src, 0)))
&& (REG_P (XEXP (src, 1))
|| CONST_SCALAR_INT_P (XEXP (src, 1))
|| timode_mem_p (XEXP (src, 1)));
case IOR:
case XOR:
- return REG_P (XEXP (src, 0))
+ return (REG_P (XEXP (src, 0))
+ || timode_mem_p (XEXP (src, 0)))
&& (REG_P (XEXP (src, 1))
|| CONST_SCALAR_INT_P (XEXP (src, 1))
|| timode_mem_p (XEXP (src, 1)));
@@ -1,5 +1,5 @@
/* { dg-do compile { target int128 } } */
-/* { dg-options "-O2 -mavx" } */
+/* { dg-options "-Os -mavx" } */
__int128 m;
void foo()
new file mode 100644
@@ -0,0 +1,11 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -mavx" } */
+__int128 m;
+
+void foo()
+{
+ m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
+}
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "vpand" 1 } } */