[x86_64] Support memory destinations and wide immediate constants in STV.

Message ID 00ad01dae74f$2a3c79d0$7eb56d70$@nextmovesoftware.com
State New
Headers
Series [x86_64] Support memory destinations and wide immediate constants in STV. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Roger Sayle Aug. 5, 2024, 3:50 p.m. UTC
  Hi Uros,
Very many thanks for the quick review and approval.  Here's another.

This patch implements two improvements/refinements to the i386 backend's
Scalar-To-Vector (STV) pass.  The first is to support memory destinations
in binary logic operations, and the second is to provide more accurate
costs/gains for (wide) immediate constants in binary logic operations.

A motivating example is gcc.target/i386/movti-2.c:

__int128 m;
void foo()
{
    m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
}

for which STV1 currently generates a warning/error:
> r100 has non convertible use in insn 6

(insn 5 2 6 2 (set (reg:TI 100)
        (const_wide_int 0x123456789abcdef0123456789abcdef)) "movti-2.c":7:7
87 {
*movti_internal}
     (nil))
(insn 6 5 0 2 (parallel [
            (set (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]  <var_decl
0x7f36d1c
27c60 m>) [1 m+0 S16 A128])
                (and:TI (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]
<var_decl 0x
7f36d1c27c60 m>) [1 m+0 S16 A128])
                    (reg:TI 100)))
            (clobber (reg:CC 17 flags))
        ]) "movti-2.c":7:7 645 {*andti3_doubleword}
     (expr_list:REG_DEAD (reg:TI 100)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))

and therefore generates the following scalar code with -O2 -mavx

foo:    movabsq $81985529216486895, %rax
        andq    %rax, m(%rip)
        andq    %rax, m+8(%rip)
        ret

with this patch we now support read-modify-write instructions (as STV
candidates), splitting them into explicit read-modify instructions
followed by an explicit write instruction.  Hence, we now produce
(when not optimizing for size):

foo:    movabsq $81985529216486895, %rax
        vmovq   %rax, %xmm0
        vpunpcklqdq     %xmm0, %xmm0, %xmm0
        vpand   m(%rip), %xmm0, %xmm0
        vmovdqa %xmm0, m(%rip)
        ret

This code also handles the const_wide_int in example above, correcting
the costs/gains when the hi/lo words are the same.  One minor complication
is that the middle-end assumes (when generating memset) that SSE constants
will be shared/amortized across multiple consecutive writes.  Hence to
avoid testsuite regressions, we add a heuristic that considers an immediate
constant to be very cheap, if that same immediate value occurs in the
previous instruction or in the instruction after.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2024-08-05  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * config/i386/i386-features.cc (timode_immed_const_gain): New
        function to determine the gain/cost on a CONST_WIDE_INT.
        (local_duplicate_constant_p): Helper function to see if the
        same immediate constant appears in the previous or next insn.
        (timode_scalar_chain::compute_convert_gain): Fix whitespace.
        <case CONST_WIDE_INT>: Provide more accurate estimates using
        timode_immed_const_gain and local_duplicate_constant_p.
        <case AND>: Handle MEM_P (dst) and CONSTANT_SCALAR_INT_P (src).
        (timode_scalar_to_vector_candidate_p): Support the first operand
        of AND, IOR and XOR being MEM_P (i.e. a read-modify-write insn).

gcc/testsuite/ChangeLog
        * gcc.target/i386/movti-2.c: Change dg-options to -Os.
        * gcc.target/i386/movti-4.c: Expected output of original movti-2.c.


Thanks again,
Roger
--
  

Comments

Uros Bizjak Aug. 6, 2024, 7:35 a.m. UTC | #1
On Mon, Aug 5, 2024 at 5:50 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Very many thanks for the quick review and approval.  Here's another.
>
> This patch implements two improvements/refinements to the i386 backend's
> Scalar-To-Vector (STV) pass.  The first is to support memory destinations
> in binary logic operations, and the second is to provide more accurate
> costs/gains for (wide) immediate constants in binary logic operations.

Please do not mix together changes made for different reasons, as
advised in "Contributing to GCC" [1], section "Submitting Patches".

[1] https://gcc.gnu.org/contribute.html

Uros.

>
> A motivating example is gcc.target/i386/movti-2.c:
>
> __int128 m;
> void foo()
> {
>     m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
> }
>
> for which STV1 currently generates a warning/error:
> > r100 has non convertible use in insn 6
>
> (insn 5 2 6 2 (set (reg:TI 100)
>         (const_wide_int 0x123456789abcdef0123456789abcdef)) "movti-2.c":7:7
> 87 {
> *movti_internal}
>      (nil))
> (insn 6 5 0 2 (parallel [
>             (set (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]  <var_decl
> 0x7f36d1c
> 27c60 m>) [1 m+0 S16 A128])
>                 (and:TI (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]
> <var_decl 0x
> 7f36d1c27c60 m>) [1 m+0 S16 A128])
>                     (reg:TI 100)))
>             (clobber (reg:CC 17 flags))
>         ]) "movti-2.c":7:7 645 {*andti3_doubleword}
>      (expr_list:REG_DEAD (reg:TI 100)
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
>
> and therefore generates the following scalar code with -O2 -mavx
>
> foo:    movabsq $81985529216486895, %rax
>         andq    %rax, m(%rip)
>         andq    %rax, m+8(%rip)
>         ret
>
> with this patch we now support read-modify-write instructions (as STV
> candidates), splitting them into explicit read-modify instructions
> followed by an explicit write instruction.  Hence, we now produce
> (when not optimizing for size):
>
> foo:    movabsq $81985529216486895, %rax
>         vmovq   %rax, %xmm0
>         vpunpcklqdq     %xmm0, %xmm0, %xmm0
>         vpand   m(%rip), %xmm0, %xmm0
>         vmovdqa %xmm0, m(%rip)
>         ret
>
> This code also handles the const_wide_int in example above, correcting
> the costs/gains when the hi/lo words are the same.  One minor complication
> is that the middle-end assumes (when generating memset) that SSE constants
> will be shared/amortized across multiple consecutive writes.  Hence to
> avoid testsuite regressions, we add a heuristic that considers an immediate
> constant to be very cheap, if that same immediate value occurs in the
> previous instruction or in the instruction after.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-05  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-features.cc (timode_immed_const_gain): New
>         function to determine the gain/cost on a CONST_WIDE_INT.
>         (local_duplicate_constant_p): Helper function to see if the
>         same immediate constant appears in the previous or next insn.
>         (timode_scalar_chain::compute_convert_gain): Fix whitespace.
>         <case CONST_WIDE_INT>: Provide more accurate estimates using
>         timode_immed_const_gain and local_duplicate_constant_p.
>         <case AND>: Handle MEM_P (dst) and CONSTANT_SCALAR_INT_P (src).
>         (timode_scalar_to_vector_candidate_p): Support the first operand
>         of AND, IOR and XOR being MEM_P (i.e. a read-modify-write insn).
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/movti-2.c: Change dg-options to -Os.
>         * gcc.target/i386/movti-4.c: Expected output of original movti-2.c.
>
>
> Thanks again,
> Roger
> --
>
  

Patch

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 3da56dd..8ad0ae7 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1503,6 +1503,54 @@  general_scalar_chain::convert_insn (rtx_insn *insn)
   df_insn_rescan (insn);
 }
 
+/* Helper function to compute gain for loading an immediate constant.
+   Typically, two movabsq for TImode vs. vmovdqa for V1TImode, but
+   with numerous special cases.  */
+
+static int
+timode_immed_const_gain (rtx cst)
+{
+  /* movabsq vs. movabsq+vmovq+vunpacklqdq.  */
+  if (CONST_WIDE_INT_P (cst)
+      && CONST_WIDE_INT_NUNITS (cst) == 2
+      && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
+    return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
+				       : -COSTS_N_INSNS (2);
+  /* 2x movabsq ~ vmovdqa.  */
+  return 0;
+}
+
+/* Return true if the constant CST in mode MODE is found as an
+   immediate operand in the insn after INSN, or the insn before it.  */
+
+static bool
+local_duplicate_constant_p (rtx_insn *insn, machine_mode mode, rtx cst)
+{
+  rtx set;
+
+  rtx_insn *next = NEXT_INSN (insn);
+  if (next)
+    {
+      set = single_set (insn);
+      if (set
+	  && GET_MODE (SET_DEST (set)) == mode
+	  && rtx_equal_p (SET_SRC (set), cst))
+	return true;
+    }
+
+  rtx_insn *prev = PREV_INSN (insn);
+  if (prev)
+    {
+      set = single_set (insn);
+      if (set
+	  && GET_MODE (SET_DEST (set)) == mode
+	  && rtx_equal_p (SET_SRC (set), cst))
+	return true;
+    }
+  return false;
+}
+
+
 /* Compute a gain for chain conversion.  */
 
 int
@@ -1549,7 +1597,17 @@  timode_scalar_chain::compute_convert_gain ()
 	case CONST_INT:
 	  if (MEM_P (dst)
 	      && standard_sse_constant_p (src, V1TImode))
-	    igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (11) : 1;
+	    igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
+	  break;
+
+	case CONST_WIDE_INT:
+	  igain = local_duplicate_constant_p (insn, TImode, src)
+		  ? 0
+		  : timode_immed_const_gain (src);
+	  /* 2 x mov vs. vmovdqa.  */
+          if (MEM_P (dst))
+	    igain += optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
+						 : COSTS_N_INSNS (1);
 	  break;
 
 	case NOT:
@@ -1560,8 +1618,14 @@  timode_scalar_chain::compute_convert_gain ()
 	case AND:
 	case XOR:
 	case IOR:
-	  if (!MEM_P (dst))
-	    igain = COSTS_N_INSNS (1);
+	  if (MEM_P (dst))
+	    igain = optimize_insn_for_size_p () ? -COSTS_N_BYTES (2)
+						: 0;
+	  else
+	    igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (6)
+						: COSTS_N_INSNS (1);
+	  if (CONST_SCALAR_INT_P (XEXP (src, 1)))
+	    igain += timode_immed_const_gain (XEXP (src, 1));
 	  break;
 
 	case ASHIFT:
@@ -2303,14 +2367,16 @@  timode_scalar_to_vector_candidate_p (rtx_insn *insn)
 	      || CONST_SCALAR_INT_P (XEXP (src, 1))
 	      || timode_mem_p (XEXP (src, 1))))
 	return true;
-      return REG_P (XEXP (src, 0))
+      return (REG_P (XEXP (src, 0))
+	      || timode_mem_p (XEXP (src, 0)))
 	     && (REG_P (XEXP (src, 1))
 		 || CONST_SCALAR_INT_P (XEXP (src, 1))
 		 || timode_mem_p (XEXP (src, 1)));
 
     case IOR:
     case XOR:
-      return REG_P (XEXP (src, 0))
+      return (REG_P (XEXP (src, 0))
+	      || timode_mem_p (XEXP (src, 0)))
 	     && (REG_P (XEXP (src, 1))
 		 || CONST_SCALAR_INT_P (XEXP (src, 1))
 		 || timode_mem_p (XEXP (src, 1)));
diff --git a/gcc/testsuite/gcc.target/i386/movti-2.c b/gcc/testsuite/gcc.target/i386/movti-2.c
index 73f69d2..c3a6ae3 100644
--- a/gcc/testsuite/gcc.target/i386/movti-2.c
+++ b/gcc/testsuite/gcc.target/i386/movti-2.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile { target int128 } } */
-/* { dg-options "-O2 -mavx" } */
+/* { dg-options "-Os -mavx" } */
 __int128 m;
 
 void foo()
diff --git a/gcc/testsuite/gcc.target/i386/movti-4.c b/gcc/testsuite/gcc.target/i386/movti-4.c
new file mode 100644
index 0000000..eac66fc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/movti-4.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -mavx" } */
+__int128 m;
+
+void foo()
+{
+    m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
+}
+
+/* { dg-final { scan-assembler-times "movabsq" 1 } } */
+/* { dg-final { scan-assembler-times "vpand" 1 } } */