[2/4] vect: Refit lane-reducing to be normal operation

Message ID LV2PR01MB7839EFE65507BCE03159DDAAF7A72@LV2PR01MB7839.prod.exchangelabs.com
State New
Headers
Series [1/4] vect: Add a unified vect_get_num_copies for slp and non-slp |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 fail Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Patch failed to apply

Commit Message

Feng Xue OS July 13, 2024, 3:47 p.m. UTC
  Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation, which would cause vector
def/use mismatched when we want to support loop reduction mixed with lane-
reducing and normal operations. One solution is to refit lane-reducing
to make it behave like a normal one, by adding new pass-through copies to
fix possible def/use gap. And resultant superfluous statements could be
optimized away after vectorization.  For example:

  int sum = 1;
  for (i)
    {
      sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
    }

  The vector size is 128-bit,vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
    {
      sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
      sum_v1 = sum_v1;  // copy
      sum_v2 = sum_v2;  // copy
      sum_v3 = sum_v3;  // copy
    }

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

Thanks,
Feng
---
gcc/
	* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
	Calculate effective vector stmts number with generic
	vect_get_num_copies.
	(vect_transform_reduction): Insert copies for lane-reducing so as to
	fix over-estimated vector stmts number.
	(vect_transform_cycle_phi): Calculate vector PHI number only based on
	output vectype.
	* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove
	adjustment on vector stmts number specific to slp reduction.
---
 gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++-------
 gcc/tree-vect-slp.cc  |  27 +++------
 2 files changed, 121 insertions(+), 40 deletions(-)
  

Comments

Richard Biener July 15, 2024, 2:03 p.m. UTC | #1
On Sat, Jul 13, 2024 at 5:47 PM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> Vector stmts number of an operation is calculated based on output vectype.
> This is over-estimated for lane-reducing operation, which would cause vector
> def/use mismatched when we want to support loop reduction mixed with lane-
> reducing and normal operations. One solution is to refit lane-reducing
> to make it behave like a normal one, by adding new pass-through copies to
> fix possible def/use gap. And resultant superfluous statements could be
> optimized away after vectorization.  For example:
>
>   int sum = 1;
>   for (i)
>     {
>       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>     }
>
>   The vector size is 128-bit,vectorization factor is 16.  Reduction
>   statements would be transformed as:
>
>   vector<4> int sum_v0 = { 0, 0, 0, 1 };
>   vector<4> int sum_v1 = { 0, 0, 0, 0 };
>   vector<4> int sum_v2 = { 0, 0, 0, 0 };
>   vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>   for (i / 16)
>     {
>       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>       sum_v1 = sum_v1;  // copy
>       sum_v2 = sum_v2;  // copy
>       sum_v3 = sum_v3;  // copy
>     }
>
>   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

OK.

Thanks,
Richard.

> Thanks,
> Feng
> ---
> gcc/
>         * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
>         Calculate effective vector stmts number with generic
>         vect_get_num_copies.
>         (vect_transform_reduction): Insert copies for lane-reducing so as to
>         fix over-estimated vector stmts number.
>         (vect_transform_cycle_phi): Calculate vector PHI number only based on
>         output vectype.
>         * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove
>         adjustment on vector stmts number specific to slp reduction.
> ---
>  gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++-------
>  gcc/tree-vect-slp.cc  |  27 +++------
>  2 files changed, 121 insertions(+), 40 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index a64b5082bd1..5ac83e76975 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>                         = get_masked_reduction_fn (reduc_fn, vectype_in);
>        vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
>        vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> -      unsigned nvectors;
> -
> -      if (slp_node)
> -       nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> -      else
> -       nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> +      unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node,
> +                                              vectype_in);
>
>        if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
>         vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1);
> @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
>    gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
>    int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
> -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +
> +  if (!vectype_in)
> +    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
>
>    if (slp_node)
>      {
>        ncopies = 1;
> -      vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> +      vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in);
>      }
>    else
>      {
> @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    bool lane_reducing = lane_reducing_op_p (code);
>    gcc_assert (single_defuse_cycle || lane_reducing);
>
> +  if (lane_reducing)
> +    {
> +      /* The last operand of lane-reducing op is for reduction.  */
> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> +    }
> +
>    /* Create the destination vector  */
>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
>    tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
>
> +  if (lane_reducing && !slp_node && !single_defuse_cycle)
> +    {
> +      /* Note: there are still vectorizable cases that can not be handled by
> +        single-lane slp.  Probably it would take some time to evolve the
> +        feature to a mature state.  So we have to keep the below non-slp code
> +        path as failsafe for lane-reducing support.  */
> +      gcc_assert (op.num_ops <= 3);
> +      for (unsigned i = 0; i < op.num_ops; i++)
> +       {
> +         unsigned oprnd_ncopies = ncopies;
> +
> +         if ((int) i == reduc_index)
> +           {
> +             tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +             oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype);
> +           }
> +
> +         vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies,
> +                                        op.ops[i], &vec_oprnds[i]);
> +       }
> +    }
>    /* Get NCOPIES vector definitions for all operands except the reduction
>       definition.  */
> -  if (!cond_fn_p)
> +  else if (!cond_fn_p)
>      {
>        gcc_assert (reduc_index >= 0 && reduc_index <= 2);
>        vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
> @@ -8702,6 +8728,61 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>                          &vec_oprnds[2]);
>      }
> +  else if (lane_reducing)
> +    {
> +      /* For normal reduction, consistency between vectorized def/use is
> +        naturally ensured when mapping from scalar statement.  But if lane-
> +        reducing op is involved in reduction, thing would become somewhat
> +        complicated in that the op's result and operand for accumulation are
> +        limited to less lanes than other operands, which certainly causes
> +        def/use mismatch on adjacent statements around the op if do not have
> +        any kind of specific adjustment.  One approach is to refit lane-
> +        reducing op in the way of introducing new trivial pass-through copies
> +        to fix possible def/use gap, so as to make it behave like a normal op.
> +        And vector reduction PHIs are always generated to the full extent, no
> +        matter lane-reducing op exists or not.  If some copies or PHIs are
> +        actually superfluous, they would be cleaned up by passes after
> +        vectorization.  An example for single-lane slp is given as below.
> +        Similarly, this handling is applicable for multiple-lane slp as well.
> +
> +          int sum = 1;
> +          for (i)
> +            {
> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> +            }
> +
> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> +        statements would be transformed as:
> +
> +          vector<4> int sum_v0 = { 0, 0, 0, 1 };
> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> +
> +          for (i / 16)
> +            {
> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> +              sum_v1 = sum_v1;  // copy
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +            }
> +
> +          sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> +       */
> +      unsigned effec_ncopies = vec_oprnds[0].length ();
> +      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
> +
> +      gcc_assert (effec_ncopies <= total_ncopies);
> +
> +      if (effec_ncopies < total_ncopies)
> +       {
> +         for (unsigned i = 0; i < op.num_ops - 1; i++)
> +           {
> +             gcc_assert (vec_oprnds[i].length () == effec_ncopies);
> +             vec_oprnds[i].safe_grow_cleared (total_ncopies);
> +           }
> +       }
> +    }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>      {
>        gimple *new_stmt;
>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> -      if (masked_loop_p && !mask_by_cond_expr)
> +      if (!vop[0] || !vop[1])
> +       {
> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> +
> +         /* If could not generate an effective vector statement for current
> +            portion of reduction operand, insert a trivial copy to simply
> +            handle over the operand to other dependent statements.  */
> +         gcc_assert (reduc_vop);
> +
> +         if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
> +             && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
> +           new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
> +         else
> +           {
> +             new_temp = make_ssa_name (vec_dest);
> +             new_stmt = gimple_build_assign (new_temp, reduc_vop);
> +             vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> +                                          gsi);
> +           }
> +       }
> +      else if (masked_loop_p && !mask_by_cond_expr)
>         {
>           /* No conditional ifns have been defined for lane-reducing op
>              yet.  */
> @@ -8810,23 +8911,16 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
>      /* Leave the scalar phi in place.  */
>      return true;
>
> -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> -  /* For a nested cycle we do not fill the above.  */
> -  if (!vectype_in)
> -    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> -  gcc_assert (vectype_in);
> -
>    if (slp_node)
>      {
> -      /* The size vect_schedule_slp_instance computes is off for us.  */
> -      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> -                                     * SLP_TREE_LANES (slp_node), vectype_in);
> +      vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>        ncopies = 1;
>      }
>    else
>      {
>        vec_num = 1;
> -      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> +      ncopies = vect_get_num_copies (loop_vinfo,
> +                                    STMT_VINFO_VECTYPE (stmt_info));
>      }
>
>    /* Check whether we should use a single PHI node and accumulate
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 4dadbc6854d..55ae496cbb2 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -6554,26 +6554,13 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
>  {
>    stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
>
> -  /* Calculate the number of vector statements to be created for the
> -     scalar stmts in this node.  For SLP reductions it is equal to the
> -     number of vector statements in the children (which has already been
> -     calculated by the recursive call).  Otherwise it is the number of
> -     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> -     VF divided by the number of elements in a vector.  */
> -  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> -      && !STMT_VINFO_DATA_REF (stmt_info)
> -      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> -    {
> -      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> -       if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def)
> -         {
> -           SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> -             = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
> -           break;
> -         }
> -    }
> -  else
> -    SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
> +  /* Calculate the number of vector statements to be created for the scalar
> +     stmts in this node.  It is the number of scalar elements in one scalar
> +     iteration (DR_GROUP_SIZE) multiplied by VF divided by the number of
> +     elements in a vector.  For single-defuse-cycle, lane-reducing op, and
> +     PHI statement that starts reduction comprised of only lane-reducing ops,
> +     the number is more than effective vector statements actually required.  */
> +  SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
>
>    /* Handle purely internal nodes.  */
>    if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> --
> 2.17.1
  

Patch

From 2b9b22f7f1a19816a17086c79e7ec5f7d0298af6 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Tue, 2 Jul 2024 17:12:00 +0800
Subject: [PATCH 2/4] vect: Refit lane-reducing to be normal operation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation, which would cause vector
def/use mismatched when we want to support loop reduction mixed with lane-
reducing and normal operations. One solution is to refit lane-reducing
to make it behave like a normal one, by adding new pass-through copies to
fix possible def/use gap. And resultant superfluous statements could be
optimized away after vectorization.  For example:

  int sum = 1;
  for (i)
    {
      sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
    }

  The vector size is 128-bit,vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
    {
      sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
      sum_v1 = sum_v1;  // copy
      sum_v2 = sum_v2;  // copy
      sum_v3 = sum_v3;  // copy
    }

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

2024-07-02 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
	Calculate effective vector stmts number with generic
	vect_get_num_copies.
	(vect_transform_reduction): Insert copies for lane-reducing so as to
	fix over-estimated vector stmts number.
	(vect_transform_cycle_phi): Calculate vector PHI number only based on
	output vectype.
	* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove
	adjustment on vector stmts number specific to slp reduction.
---
 gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++-------
 gcc/tree-vect-slp.cc  |  27 +++------
 2 files changed, 121 insertions(+), 40 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a64b5082bd1..5ac83e76975 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7468,12 +7468,8 @@  vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
 			= get_masked_reduction_fn (reduc_fn, vectype_in);
       vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
       vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
-      unsigned nvectors;
-
-      if (slp_node)
-	nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-      else
-	nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+      unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node,
+					       vectype_in);
 
       if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
 	vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1);
@@ -8595,12 +8591,15 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
   stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
   gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
   int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
-  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+
+  if (!vectype_in)
+    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
 
   if (slp_node)
     {
       ncopies = 1;
-      vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+      vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in);
     }
   else
     {
@@ -8658,13 +8657,40 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);
 
+  if (lane_reducing)
+    {
+      /* The last operand of lane-reducing op is for reduction.  */
+      gcc_assert (reduc_index == (int) op.num_ops - 1);
+    }
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
 
+  if (lane_reducing && !slp_node && !single_defuse_cycle)
+    {
+      /* Note: there are still vectorizable cases that can not be handled by
+	 single-lane slp.  Probably it would take some time to evolve the
+	 feature to a mature state.  So we have to keep the below non-slp code
+	 path as failsafe for lane-reducing support.  */
+      gcc_assert (op.num_ops <= 3);
+      for (unsigned i = 0; i < op.num_ops; i++)
+	{
+	  unsigned oprnd_ncopies = ncopies;
+
+	  if ((int) i == reduc_index)
+	    {
+	      tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+	      oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype);
+	    }
+
+	  vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies,
+					 op.ops[i], &vec_oprnds[i]);
+	}
+    }
   /* Get NCOPIES vector definitions for all operands except the reduction
      definition.  */
-  if (!cond_fn_p)
+  else if (!cond_fn_p)
     {
       gcc_assert (reduc_index >= 0 && reduc_index <= 2);
       vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
@@ -8702,6 +8728,61 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
 			 reduc_index == 2 ? op.ops[2] : NULL_TREE,
 			 &vec_oprnds[2]);
     }
+  else if (lane_reducing)
+    {
+      /* For normal reduction, consistency between vectorized def/use is
+	 naturally ensured when mapping from scalar statement.  But if lane-
+	 reducing op is involved in reduction, thing would become somewhat
+	 complicated in that the op's result and operand for accumulation are
+	 limited to less lanes than other operands, which certainly causes
+	 def/use mismatch on adjacent statements around the op if do not have
+	 any kind of specific adjustment.  One approach is to refit lane-
+	 reducing op in the way of introducing new trivial pass-through copies
+	 to fix possible def/use gap, so as to make it behave like a normal op.
+	 And vector reduction PHIs are always generated to the full extent, no
+	 matter lane-reducing op exists or not.  If some copies or PHIs are
+	 actually superfluous, they would be cleaned up by passes after
+	 vectorization.  An example for single-lane slp is given as below.
+	 Similarly, this handling is applicable for multiple-lane slp as well.
+
+	   int sum = 1;
+	   for (i)
+	     {
+	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	     }
+
+	 The vector size is 128-bit,vectorization factor is 16.  Reduction
+	 statements would be transformed as:
+
+	   vector<4> int sum_v0 = { 0, 0, 0, 1 };
+	   vector<4> int sum_v1 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v2 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+	   for (i / 16)
+	     {
+	       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+	     }
+
+	   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
+	*/
+      unsigned effec_ncopies = vec_oprnds[0].length ();
+      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
+
+      gcc_assert (effec_ncopies <= total_ncopies);
+
+      if (effec_ncopies < total_ncopies)
+	{
+	  for (unsigned i = 0; i < op.num_ops - 1; i++)
+	    {
+	      gcc_assert (vec_oprnds[i].length () == effec_ncopies);
+	      vec_oprnds[i].safe_grow_cleared (total_ncopies);
+	    }
+	}
+    }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8710,7 +8791,27 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+      if (!vop[0] || !vop[1])
+	{
+	  tree reduc_vop = vec_oprnds[reduc_index][i];
+
+	  /* If could not generate an effective vector statement for current
+	     portion of reduction operand, insert a trivial copy to simply
+	     handle over the operand to other dependent statements.  */
+	  gcc_assert (reduc_vop);
+
+	  if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
+	      && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
+	    new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
+	  else
+	    {
+	      new_temp = make_ssa_name (vec_dest);
+	      new_stmt = gimple_build_assign (new_temp, reduc_vop);
+	      vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
+					   gsi);
+	    }
+	}
+      else if (masked_loop_p && !mask_by_cond_expr)
 	{
 	  /* No conditional ifns have been defined for lane-reducing op
 	     yet.  */
@@ -8810,23 +8911,16 @@  vect_transform_cycle_phi (loop_vec_info loop_vinfo,
     /* Leave the scalar phi in place.  */
     return true;
 
-  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
-  /* For a nested cycle we do not fill the above.  */
-  if (!vectype_in)
-    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
-  gcc_assert (vectype_in);
-
   if (slp_node)
     {
-      /* The size vect_schedule_slp_instance computes is off for us.  */
-      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-				      * SLP_TREE_LANES (slp_node), vectype_in);
+      vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
       ncopies = 1;
     }
   else
     {
       vec_num = 1;
-      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+      ncopies = vect_get_num_copies (loop_vinfo,
+				     STMT_VINFO_VECTYPE (stmt_info));
     }
 
   /* Check whether we should use a single PHI node and accumulate
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 4dadbc6854d..55ae496cbb2 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6554,26 +6554,13 @@  vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
 {
   stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
 
-  /* Calculate the number of vector statements to be created for the
-     scalar stmts in this node.  For SLP reductions it is equal to the
-     number of vector statements in the children (which has already been
-     calculated by the recursive call).  Otherwise it is the number of
-     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
-     VF divided by the number of elements in a vector.  */
-  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
-      && !STMT_VINFO_DATA_REF (stmt_info)
-      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-    {
-      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
-	if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def)
-	  {
-	    SLP_TREE_NUMBER_OF_VEC_STMTS (node)
-	      = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
-	    break;
-	  }
-    }
-  else
-    SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
+  /* Calculate the number of vector statements to be created for the scalar
+     stmts in this node.  It is the number of scalar elements in one scalar
+     iteration (DR_GROUP_SIZE) multiplied by VF divided by the number of
+     elements in a vector.  For single-defuse-cycle, lane-reducing op, and
+     PHI statement that starts reduction comprised of only lane-reducing ops,
+     the number is more than effective vector statements actually required.  */
+  SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
 
   /* Handle purely internal nodes.  */
   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
-- 
2.17.1