[i386] Add extra cost for unsigned_load which may have stall forward issue.

Message ID 20220316010300.88783-1-hongtao.liu@intel.com
State New
Headers
Series [i386] Add extra cost for unsigned_load which may have stall forward issue. |

Commit Message

Liu, Hongtao March 16, 2022, 1:03 a.m. UTC
  This patch only handle pure-slp for by-value passed parameter which
has nothing to do with IPA but psABI. For by-reference passed
parameter IPA is required.

The patch is aggressive in determining STLF failure, any
unaligned_load for parm_decl passed by stack is thought to have STLF
stall issue. It could lose some perf where there's no such issue(1
vector_load vs n scalar_load + CTOR).

According to microbenchmark in PR, cost of STLF failure is generally
between 8 scalar_loads and 16 scalar loads on most latest Intel/AMD
processors.

gcc/ChangeLog:

	PR target/101908
	* config/i386/i386.cc (ix86_load_maybe_stfs_p): New.
	(ix86_vector_costs::add_stmt_cost): Add extra cost for
	unsigned_load which may have store forwarding stall issue.
	* config/i386/i386.h (processor_costs): Add new member
	stfs.
	* config/i386/x86-tune-costs.h (i386_size_cost): Initialize
	stfs.
	(i386_cost, i486_cost, pentium_cost, lakemont_cost,
	pentiumpro_cost, geode_cost, k6_cost, athlon_cost, k8_cost,
	amdfam10_cost, bdver_cost, znver1_cost, znver2_cost,
	znver3_cost, skylake_cost, icelake_cost, alderlake_cost,
	btver1_cost, btver2_cost, pentium4_cost, nocano_cost,
	atom_cost, slm_cost, tremont_cost, intel_cost, generic_cost,
	core_cost): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr101908-1.c: New test.
	* gcc.target/i386/pr101908-2.c: New test.
	* gcc.target/i386/pr101908-3.c: New test.
	* gcc.target/i386/pr101908-v16hi.c: New test.
	* gcc.target/i386/pr101908-v16qi.c: New test.
	* gcc.target/i386/pr101908-v16sf.c: New test.
	* gcc.target/i386/pr101908-v16si.c: New test.
	* gcc.target/i386/pr101908-v2df.c: New test.
	* gcc.target/i386/pr101908-v2di.c: New test.
	* gcc.target/i386/pr101908-v2hi.c: New test.
	* gcc.target/i386/pr101908-v2qi.c: New test.
	* gcc.target/i386/pr101908-v2sf.c: New test.
	* gcc.target/i386/pr101908-v2si.c: New test.
	* gcc.target/i386/pr101908-v4df.c: New test.
	* gcc.target/i386/pr101908-v4di.c: New test.
	* gcc.target/i386/pr101908-v4hi.c: New test.
	* gcc.target/i386/pr101908-v4qi.c: New test.
	* gcc.target/i386/pr101908-v4sf.c: New test.
	* gcc.target/i386/pr101908-v4si.c: New test.
	* gcc.target/i386/pr101908-v8df-adl.c: New test.
	* gcc.target/i386/pr101908-v8df.c: New test.
	* gcc.target/i386/pr101908-v8di-adl.c: New test.
	* gcc.target/i386/pr101908-v8di.c: New test.
	* gcc.target/i386/pr101908-v8hi-adl.c: New test.
	* gcc.target/i386/pr101908-v8hi.c: New test.
	* gcc.target/i386/pr101908-v8qi-adl.c: New test.
	* gcc.target/i386/pr101908-v8qi.c: New test.
	* gcc.target/i386/pr101908-v8sf-adl.c: New test.
	* gcc.target/i386/pr101908-v8sf.c: New test.
	* gcc.target/i386/pr101908-v8si-adl.c: New test.
	* gcc.target/i386/pr101908-v8si.c: New test.
---
 gcc/config/i386/i386.cc                       | 51 +++++++++++
 gcc/config/i386/i386.h                        |  1 +
 gcc/config/i386/x86-tune-costs.h              | 28 ++++++
 gcc/testsuite/gcc.target/i386/pr101908-1.c    | 12 +++
 gcc/testsuite/gcc.target/i386/pr101908-2.c    | 12 +++
 gcc/testsuite/gcc.target/i386/pr101908-3.c    | 90 +++++++++++++++++++
 .../gcc.target/i386/pr101908-v16hi.c          |  6 ++
 .../gcc.target/i386/pr101908-v16qi.c          | 30 +++++++
 .../gcc.target/i386/pr101908-v16sf.c          |  6 ++
 .../gcc.target/i386/pr101908-v16si.c          |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2df.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2di.c |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c | 16 ++++
 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v2si.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4df.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4di.c |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c | 18 ++++
 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v4si.c |  6 ++
 .../gcc.target/i386/pr101908-v8df-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8df.c |  6 ++
 .../gcc.target/i386/pr101908-v8di-adl.c       |  7 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8di.c |  7 ++
 .../gcc.target/i386/pr101908-v8hi-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c |  6 ++
 .../gcc.target/i386/pr101908-v8qi-adl.c       | 22 +++++
 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c | 22 +++++
 .../gcc.target/i386/pr101908-v8sf-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c |  6 ++
 .../gcc.target/i386/pr101908-v8si-adl.c       |  6 ++
 gcc/testsuite/gcc.target/i386/pr101908-v8si.c |  6 ++
 34 files changed, 444 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v16si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v2si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v4si.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8df.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8di.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-v8si.c
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d77ad83e437..c01809cc3da 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22988,6 +22988,46 @@  ix86_noce_conversion_profitable_p (rtx_insn *seq, struct noce_if_info *if_info)
   return default_noce_conversion_profitable_p (seq, if_info);
 }
 
+/* Return true if REF may have STF issue, otherwise false.
+   Any unaligned_load from parm_decl which is passed by stack
+   is considered to have STLF stall issue.  */
+static bool
+ix86_load_maybe_stfs_p (data_reference* dr)
+{
+  tree addr = DR_BASE_ADDRESS (dr);
+  if (TREE_CODE (addr) != ADDR_EXPR)
+    return false;
+  addr = get_base_address (TREE_OPERAND (addr, 0));
+
+  if (TREE_CODE (addr) != PARM_DECL)
+    return false;
+  tree type = TREE_TYPE (addr);
+  if (!type)
+    return false;
+
+  machine_mode mode = TYPE_MODE (type);
+
+  /* There could be false positive in determine parameter passed by stack.
+     .i.e. parameter can be put in registers but finally passed by stack
+     because registers are ran out.  */
+  if (TARGET_64BIT)
+    {
+      /* From function_arg_64.  */
+      enum x86_64_reg_class regclass[MAX_CLASSES];
+      int zero_width_bitfields = 0;
+      return !classify_argument (mode, type, regclass, 0, zero_width_bitfields);
+    }
+  else
+    {
+      /* From function_arg_32.  */
+      return (mode == E_BLKmode
+	      || (AGGREGATE_TYPE_P (type)
+		  && (VECTOR_MODE_P (mode) || mode == TImode)));
+    }
+
+  return false;
+}
+
 /* x86-specific vector costs.  */
 class ix86_vector_costs : public vector_costs
 {
@@ -23218,6 +23258,17 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
 
+  /* Prevent vectorization for load from parm_decl at O2 to avoid STF issue.
+     Performance may lose when there's no STF issue(1 vector_load vs n
+     scalar_load + CTOR).
+     TODO: both extra cost(2000) and ix86_load_maybe_stfs_p need to be fine
+     tuned.  */
+  if (kind == unaligned_load && stmt_info
+      && stmt_info->slp_type == pure_slp
+      && STMT_VINFO_DATA_REF (stmt_info)
+      && ix86_load_maybe_stfs_p (STMT_VINFO_DATA_REF (stmt_info)))
+    stmt_cost += COSTS_N_INSNS (ix86_cost->stfs / 2);
+
   /* Penalize DFmode vector operations for Bonnell.  */
   if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
       && vectype && GET_MODE_INNER (TYPE_MODE (vectype)) == DFmode)
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 0d28e57f8f2..341f1c47981 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -168,6 +168,7 @@  struct processor_costs {
 				   in 32bit, 64bit, 128bit, 256bit and 512bit */
   const int sse_unaligned_load[5];/* cost of unaligned load.  */
   const int sse_unaligned_store[5];/* cost of unaligned store.  */
+  const int stfs;		 /* cost of store forward stalls.  */
   const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
 	    zmm_move;
   const int sse_to_integer;	/* cost of moving SSE register to integer.  */
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 017ffa69958..3a5fcdeefdd 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -100,6 +100,7 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
 					   in 128bit, 256bit and 512bit */
   {3, 3, 3, 3, 3},			/* cost of unaligned SSE store
 					   in 128bit, 256bit and 512bit */
+  6,					/* cost of store forward stall.  */
   3, 3, 3,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   5, 0,					/* Gather load static, per_elt.  */
@@ -209,6 +210,7 @@  struct processor_costs i386_cost = {	/* 386 specific costs */
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -317,6 +319,7 @@  struct processor_costs i486_cost = {	/* 486 specific costs */
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -427,6 +430,7 @@  struct processor_costs pentium_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -528,6 +532,7 @@  struct processor_costs lakemont_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -644,6 +649,7 @@  struct processor_costs pentiumpro_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 8, 16, 32, 64},			/* cost of unaligned loads.  */
   {4, 8, 16, 32, 64},			/* cost of unaligned stores.  */
+  24,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -751,6 +757,7 @@  struct processor_costs geode_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
@@ -858,6 +865,7 @@  struct processor_costs k6_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {2, 2, 8, 16, 32},			/* cost of unaligned loads.  */
   {2, 2, 8, 16, 32},			/* cost of unaligned stores.  */
+  24,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   2, 2,					/* Gather load static, per_elt.  */
@@ -971,6 +979,7 @@  struct processor_costs athlon_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 4, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1086,6 +1095,7 @@  struct processor_costs k8_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 3, 12, 12, 24},			/* cost of unaligned loads.  */
   {4, 4, 10, 10, 20},			/* cost of unaligned stores.  */
+  14,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   5,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1214,6 +1224,7 @@  struct processor_costs amdfam10_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {4, 4, 3, 7, 12},			/* cost of unaligned loads.  */
   {4, 4, 5, 10, 20},			/* cost of unaligned stores.  */
+  21,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   3,					/* cost of moving SSE register to integer.  */
   4, 4,					/* Gather load static, per_elt.  */
@@ -1334,6 +1345,7 @@  const struct processor_costs bdver_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {12, 12, 10, 40, 60},			/* cost of unaligned loads.  */
   {10, 10, 10, 40, 60},			/* cost of unaligned stores.  */
+  54,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   16,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
@@ -1475,6 +1487,7 @@  struct processor_costs znver1_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 12, 24},			/* cost of unaligned loads.  */
   {8, 8, 8, 16, 32},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 3, 6,				/* cost of moving XMM,YMM,ZMM register.  */
   6,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
@@ -1630,6 +1643,7 @@  struct processor_costs znver2_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 2, 3,				/* cost of moving XMM,YMM,ZMM
 					   register.  */
   6,					/* cost of moving SSE register to integer.  */
@@ -1762,6 +1776,7 @@  struct processor_costs znver3_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  42,					/* cost of store forward stall.  */
   2, 2, 3,				/* cost of moving XMM,YMM,ZMM
 					   register.  */
   6,					/* cost of moving SSE register to integer.  */
@@ -1907,6 +1922,7 @@  struct processor_costs skylake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
@@ -2033,6 +2049,7 @@  struct processor_costs icelake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 20},			/* cost of unaligned loads.  */
   {8, 8, 8, 8, 16},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   20, 8,				/* Gather load static, per_elt.  */
@@ -2153,6 +2170,7 @@  struct processor_costs alderlake_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  90,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -2266,6 +2284,7 @@  const struct processor_costs btver1_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  36,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
@@ -2376,6 +2395,7 @@  const struct processor_costs btver2_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 12, 48, 96},			/* cost of unaligned loads.  */
   {10, 10, 12, 48, 96},			/* cost of unaligned stores.  */
+  36,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   14,					/* cost of moving SSE register to integer.  */
   10, 10,				/* Gather load static, per_elt.  */
@@ -2485,6 +2505,7 @@  struct processor_costs pentium4_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {32, 32, 32, 64, 128},		/* cost of unaligned loads.  */
   {32, 32, 32, 64, 128},		/* cost of unaligned stores.  */
+  10,					/* cost of store forward stall.  */
   12, 24, 48,				/* cost of moving XMM,YMM,ZMM register */
   20,					/* cost of moving SSE register to integer.  */
   16, 16,				/* Gather load static, per_elt.  */
@@ -2597,6 +2618,7 @@  struct processor_costs nocona_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {24, 24, 24, 48, 96},			/* cost of unaligned loads.  */
   {24, 24, 24, 48, 96},			/* cost of unaligned stores.  */
+  8,					/* cost of store forward stall.  */
   6, 12, 24,				/* cost of moving XMM,YMM,ZMM register */
   20,					/* cost of moving SSE register to integer.  */
   12, 12,				/* Gather load static, per_elt.  */
@@ -2707,6 +2729,7 @@  struct processor_costs atom_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  32,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
@@ -2817,6 +2840,7 @@  struct processor_costs slm_cost = {
 					   in SImode, DImode and TImode.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned loads.  */
   {16, 16, 16, 32, 64},			/* cost of unaligned stores.  */
+  48,					/* cost of store forward stall.  */
   2, 4, 8,				/* cost of moving XMM,YMM,ZMM register */
   8,					/* cost of moving SSE register to integer.  */
   8, 8,					/* Gather load static, per_elt.  */
@@ -2939,6 +2963,7 @@  struct processor_costs tremont_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  42,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -3051,6 +3076,7 @@  struct processor_costs intel_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
   {10, 10, 10, 10, 10},			/* cost of unaligned loads.  */
+  22,					/* cost of store forward stall.  */
   2, 2, 2,				/* cost of moving XMM,YMM,ZMM register */
   4,					/* cost of moving SSE register to integer.  */
   6, 6,					/* Gather load static, per_elt.  */
@@ -3168,6 +3194,7 @@  struct processor_costs generic_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 10, 15},			/* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},			/* cost of unaligned storess.  */
+  54,					/* cost of store forward stall.  */
   2, 3, 4,				/* cost of moving XMM,YMM,ZMM register */
   6,					/* cost of moving SSE register to integer.  */
   18, 6,				/* Gather load static, per_elt.  */
@@ -3291,6 +3318,7 @@  struct processor_costs core_cost = {
 					   in 32bit, 64bit, 128bit, 256bit and 512bit */
   {6, 6, 6, 6, 12},			/* cost of unaligned loads.  */
   {6, 6, 6, 6, 12},			/* cost of unaligned stores.  */
+  26,					/* cost of store forward stall.  */
   2, 2, 4,				/* cost of moving XMM,YMM,ZMM register */
   2,					/* cost of moving SSE register to integer.  */
   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c
new file mode 100644
index 00000000000..f8e0f2e26bb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[2]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X* x, struct X* y)
+{
+  return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] };
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c
new file mode 100644
index 00000000000..f4ff7a83c82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+struct X { double x[4]; };
+typedef double v2df __attribute__((vector_size(16)));
+
+v2df __attribute__((noipa))
+foo (struct X x, struct X y)
+{
+  return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] };
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-3.c b/gcc/testsuite/gcc.target/i386/pr101908-3.c
new file mode 100644
index 00000000000..6f853aa7750
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-3.c
@@ -0,0 +1,90 @@ 
+/* PR target/101908.  */
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64 -O2 -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not "add new stmt:.*MEM \<vector(2) double\>.*ray + 24B" "slp2" } }  */
+/* This testcase is used to avoid STLF stall.  */
+
+#define sqrt __builtin_sqrt
+#define SQ(x)		((x) * (x))
+struct vec3 {
+  double x, y, z;
+};
+
+struct ray {
+  struct vec3 orig, dir;
+};
+
+struct material {
+  struct vec3 col;	/* color */
+  double spow;		/* specular power */
+  double refl;		/* reflection intensity */
+};
+
+struct sphere {
+  struct vec3 pos;
+  double rad;
+  struct material mat;
+  struct sphere *next;
+};
+
+struct spoint {
+  struct vec3 pos, normal, vref;	/* position, normal and view reflection */
+  double dist;		/* parametric distance of intersection along the ray */
+};
+
+#define ERR_MARGIN		1e-6
+
+#define DOT(a, b)	((a).x * (b).x + (a).y * (b).y + (a).z * (b).z)
+#define NORMALIZE(a)  do {			\
+    double len = sqrt(DOT(a, a));		\
+    (a).x /= len; (a).y /= len; (a).z /= len;	\
+  } while(0);
+
+static struct vec3
+reflect(struct vec3 v, struct vec3 n) {
+  struct vec3 res;
+  double dot = v.x * n.x + v.y * n.y + v.z * n.z;
+  res.x = -(2.0 * dot * n.x - v.x);
+  res.y = -(2.0 * dot * n.y - v.y);
+  res.z = -(2.0 * dot * n.z - v.z);
+  return res;
+}
+
+int ray_sphere(const struct sphere *sph,
+	       struct ray ray, struct spoint *sp) {
+  double a, b, c, d, sqrt_d, t1, t2;
+	
+  a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
+  b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
+    2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
+    2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
+  c = SQ(sph->pos.x) + SQ(sph->pos.y) + SQ(sph->pos.z) +
+    SQ(ray.orig.x) + SQ(ray.orig.y) + SQ(ray.orig.z) +
+    2.0 * (-sph->pos.x * ray.orig.x - sph->pos.y * ray.orig.y - sph->pos.z * ray.orig.z) - SQ(sph->rad);
+	
+  if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0;
+
+  sqrt_d = sqrt(d);
+  t1 = (-b + sqrt_d) / (2.0 * a);
+  t2 = (-b - sqrt_d) / (2.0 * a);
+
+  if((t1 < ERR_MARGIN && t2 < ERR_MARGIN) || (t1 > 1.0 && t2 > 1.0)) return 0;
+
+  if(sp) {
+    if(t1 < ERR_MARGIN) t1 = t2;
+    if(t2 < ERR_MARGIN) t2 = t1;
+    sp->dist = t1 < t2 ? t1 : t2;
+		
+    sp->pos.x = ray.orig.x + ray.dir.x * sp->dist;
+    sp->pos.y = ray.orig.y + ray.dir.y * sp->dist;
+    sp->pos.z = ray.orig.z + ray.dir.z * sp->dist;
+		
+    sp->normal.x = (sp->pos.x - sph->pos.x) / sph->rad;
+    sp->normal.y = (sp->pos.y - sph->pos.y) / sph->rad;
+    sp->normal.z = (sp->pos.z - sph->pos.z) / sph->rad;
+
+    sp->vref = reflect(ray.dir, sp->normal);
+    NORMALIZE(sp->vref);
+  }
+  return 1;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
new file mode 100644
index 00000000000..fcd3ee8122f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
new file mode 100644
index 00000000000..6d43788600e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16qi.c
@@ -0,0 +1,30 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3  -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[0] = x.a[1] + y.a[1];
+  p[1] = x.a[2] + y.a[2];
+  p[2] = x.a[3] + y.a[3];
+  p[3] = x.a[4] + y.a[4];
+  p[4] = x.a[5] + y.a[5];
+  p[5] = x.a[6] + y.a[6];
+  p[6] = x.a[7] + y.a[7];
+  p[7] = x.a[8] + y.a[8];
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
new file mode 100644
index 00000000000..f95b85abbc6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v16si.c b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
new file mode 100644
index 00000000000..5c48aa5da69
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v16si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx512f -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(16\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v16qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2df.c b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
new file mode 100644
index 00000000000..9d3f157718c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2di.c b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
new file mode 100644
index 00000000000..c7cf9a71f21
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
new file mode 100644
index 00000000000..e6024d70780
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
new file mode 100644
index 00000000000..cf876cc70d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2qi.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
new file mode 100644
index 00000000000..eb6349b957e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v2si.c b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
new file mode 100644
index 00000000000..ae5fa0749c6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v2si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt:.*MEM \<vector\(2\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v2qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4df.c b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
new file mode 100644
index 00000000000..94497422704
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4di.c b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
new file mode 100644
index 00000000000..71407aa9fc7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
new file mode 100644
index 00000000000..4b207b91225
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
new file mode 100644
index 00000000000..5292d3442ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4qi.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
new file mode 100644
index 00000000000..a2c6273120d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v4si.c b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
new file mode 100644
index 00000000000..c6824285c74
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v4si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(4\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v4qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
new file mode 100644
index 00000000000..248c6d0fb91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8df.c b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
new file mode 100644
index 00000000000..05eb2dd51d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8df.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) double\>} "slp2" } } */
+
+#define TYPE double
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
new file mode 100644
index 00000000000..b0055d7d2c0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di-adl.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8di.c b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
new file mode 100644
index 00000000000..76a393bcc6c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8di.c
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx512f -mtune=generic -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) long long int\>} "slp2" } } */
+
+typedef long long int64_t;
+#define TYPE int64_t
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
new file mode 100644
index 00000000000..28977adae28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v8qi-adl.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
new file mode 100644
index 00000000000..89b50885366
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8hi.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) short int\>} "slp2" } } */
+
+#define TYPE short
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
new file mode 100644
index 00000000000..be668e5d006
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi-adl.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O3 -march=x86-64 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
new file mode 100644
index 00000000000..842c88c8952
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8qi.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O3 -march=x86-64 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) char\>} "slp2" } } */
+
+#ifndef TYPE
+#define TYPE char
+#endif
+
+struct X { TYPE a[128]; };
+
+void __attribute__((noipa))
+foo16 (struct X x, struct X y, TYPE* __restrict p)
+{
+  p[8] = x.a[9] + y.a[9];
+  p[9] = x.a[10] + y.a[10];
+  p[10] = x.a[11] + y.a[11];
+  p[11] = x.a[12] + y.a[12];
+  p[12] = x.a[13] + y.a[13];
+  p[13] = x.a[14] + y.a[14];
+  p[14] = x.a[15] + y.a[15];
+  p[15] = x.a[16] + y.a[16];
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
new file mode 100644
index 00000000000..89d33566a40
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
new file mode 100644
index 00000000000..81557c7b9b7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8sf.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) float\>} "slp2" } } */
+
+#define TYPE float
+#include "pr101908-v8qi.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
new file mode 100644
index 00000000000..883956a0d49
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si-adl.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -mtune=alderlake -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump-not {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v8qi-adl.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr101908-v8si.c b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
new file mode 100644
index 00000000000..142f46012d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr101908-v8si.c
@@ -0,0 +1,6 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=x86-64 -mavx2 -fdump-tree-slp-details" } */
+/* { dg-final { scan-tree-dump {(?n)add new stmt: vect.*MEM \<vector\(8\) int\>} "slp2" } } */
+
+#define TYPE int
+#include "pr101908-v8qi.c"