[v11,02/12] Implement recording/getting of mask/length for BB SLP

Message ID 20260603151924.53706-3-chris.bazley@arm.com
State New
Headers
Series Extend BB SLP vectorization to use predicated tails |

Commit Message

Christopher Bazley June 3, 2026, 3:19 p.m. UTC
  Add two new fields to SLP tree nodes, which are accessed as
SLP_TREE_CAN_USE_PARTIAL_VECTORS_P and SLP_TREE_PARTIAL_VECTORS_STYLE.

SLP_TREE_CAN_USE_PARTIAL_VECTORS_P is analogous to the existing
predicate LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. It is initialized to
true. This flag just records whether the target could vectorize a
node using a partial vector; it does not say anything about
whether the vector actually is partial, or how the target would support
use of a partial vector. Some kinds of node require mask/length for
partial vectors; others don't. In the latter case (e.g., for add
operations), SLP_TREE_CAN_USE_PARTIAL_VECTORS_P will remain true.

SLP_TREE_PARTIAL_VECTORS_STYLE is analogous to the existing field
LOOP_VINFO_PARTIAL_VECTORS_STYLE. Both are initialized to 'none'.
The vect_partial_vectors_avx512 enumerator is not used for BB SLP.
Unlike loop vectorization, a different style of partial vectors can be
chosen for each node during analysis of that node.

Implement the recently-introduced wrapper functions,
vect_record_(len|mask), for BB SLP by setting
SLP_TREE_PARTIAL_VECTORS_STYLE to indicate that a mask or length should
be used for a given SLP node. The passed-in vec_info is ignored.

Implement the vect_fully_(masked|with_length)_p wrapper functions for
BB SLP by checking the SLP_TREE_PARTIAL_VECTORS_STYLE. This should be
sufficient because at most one of vect_record_(len|mask) and
vect_cannot_use_partial_vectors are expected to be called for any
given SLP node. SLP_TREE_CAN_USE_PARTIAL_VECTORS_P should be true if
the style is not 'none', but its value isn't used beyond the analysis
phase.

The implementations of vect_get_mask and vect_get_len for BB SLP are
non-trivial (albeit simpler than for loop vectorization), therefore they
are delegated to SLP-specific functions defined in tree-vect-slp.cc.

Implement the vect_cannot_use_partial_vectors wrapper function by
setting the SLP_TREE_CAN_USE_PARTIAL_VECTORS_P flag to false.
To prevent regressions, vect_can_use_partial_vectors_p still returns
false for BB SLP regardless (for now). This prevents vect_record_mask
or vect_record_len from being called.

gcc/ChangeLog:

	* tree-vect-slp.cc (_slp_tree::_slp_tree): initialize new
	partial_vector_style, can_use_partial_vectors and
	num_partial_vectors members.
	(vect_slp_analyze_node_operations): Account for worst-case
	prologue costs of per-node partial-vector mask or length
	materialisation.
	(vect_slp_record_bb_style): Set the partial vector style of an
	SLP node, checking that the style does not flip-flop between mask
	and length.
	(vect_slp_record_bb_mask): Use vect_slp_record_bb_style to set
	the partial vector style of the SLP tree node to
	vect_partial_vectors_while_ult.
	(vect_slp_get_bb_mask): New function to materialize a mask for
	basic block SLP vectorization.
	(vect_slp_record_bb_len): Use vect_slp_record_bb_style to set
	the partial vector style of the SLP tree node to
	vect_partial_vectors_len.
	(vect_slp_get_bb_len): New function to materialize a length for
	basic block SLP vectorization.
	* tree-vect-stmts.cc (vectorizable_internal_function):
	(vect_record_mask): Handle the basic block SLP use case by
	delegating to vect_slp_record_bb_mask.
	(vect_get_mask): Handle the basic block SLP use case by
	delegating to vect_slp_get_bb_mask.
	(vect_record_len): Handle the basic block SLP use case by
	delegating to vect_slp_record_bb_len.
	(vect_get_len): Handle the basic block SLP use case by
	delegating to vect_slp_get_bb_len.
	(vect_gen_while_ssa_name): New function containing code
	refactored out of vect_gen_while for reuse by
	vect_slp_get_bb_mask.
	(vect_gen_while): Use vect_gen_while_ssa_name instead of custom
	code for some of the implementation.
	* tree-vectorizer.h (enum vect_partial_vector_style): Move this
	definition earlier to allow reuse by struct _slp_tree.
	(struct _slp_tree): Add a partial_vector_style member to record
	whether to use a length or mask for the SLP tree node, if
	partial vectors are required and supported.
	Add a can_use_partial_vectors member to record whether partial
	vectors are supported for the SLP tree node.
	Add a num_partial_vectors member for costing.
	(SLP_TREE_PARTIAL_VECTORS_STYLE): New member accessor macro.
	(SLP_TREE_CAN_USE_PARTIAL_VECTORS_P): New member accessor macro.
	(SLP_TREE_NUM_PARTIAL_VECTORS): New member accessor macro.
	(vect_gen_while_ssa_name): Declaration of a new function.
	(vect_slp_get_bb_mask): As above.
	(vect_slp_get_bb_len): As above.
	(vect_cannot_use_partial_vectors): Handle the basic block SLP
	use-case by setting SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to
	false.
	(vect_fully_with_length_p): Handle the basic block SLP use
	case by checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
	vect_partial_vectors_len.
	(vect_fully_masked_p): Handle the basic block SLP use case by
	checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
	vect_partial_vectors_while_ult.
---
 gcc/tree-vect-slp.cc   | 182 +++++++++++++++++++++++++++++++++++++++++
 gcc/tree-vect-stmts.cc |  52 +++++++-----
 gcc/tree-vectorizer.h  |  52 ++++++++----
 3 files changed, 247 insertions(+), 39 deletions(-)
  

Comments

Hongtao Liu June 9, 2026, 6:38 a.m. UTC | #1
On Wed, Jun 3, 2026 at 11:25 PM Christopher Bazley <chris.bazley@arm.com> wrote:
>
> Add two new fields to SLP tree nodes, which are accessed as
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P and SLP_TREE_PARTIAL_VECTORS_STYLE.
>
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P is analogous to the existing
> predicate LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. It is initialized to
> true. This flag just records whether the target could vectorize a
> node using a partial vector; it does not say anything about
> whether the vector actually is partial, or how the target would support
> use of a partial vector. Some kinds of node require mask/length for
> partial vectors; others don't. In the latter case (e.g., for add
> operations), SLP_TREE_CAN_USE_PARTIAL_VECTORS_P will remain true.
>
> SLP_TREE_PARTIAL_VECTORS_STYLE is analogous to the existing field
> LOOP_VINFO_PARTIAL_VECTORS_STYLE. Both are initialized to 'none'.
> The vect_partial_vectors_avx512 enumerator is not used for BB SLP.
> Unlike loop vectorization, a different style of partial vectors can be
> chosen for each node during analysis of that node.
>
> Implement the recently-introduced wrapper functions,
> vect_record_(len|mask), for BB SLP by setting
> SLP_TREE_PARTIAL_VECTORS_STYLE to indicate that a mask or length should
> be used for a given SLP node. The passed-in vec_info is ignored.
>
> Implement the vect_fully_(masked|with_length)_p wrapper functions for
> BB SLP by checking the SLP_TREE_PARTIAL_VECTORS_STYLE. This should be
> sufficient because at most one of vect_record_(len|mask) and
> vect_cannot_use_partial_vectors are expected to be called for any
> given SLP node. SLP_TREE_CAN_USE_PARTIAL_VECTORS_P should be true if
> the style is not 'none', but its value isn't used beyond the analysis
> phase.
>
> The implementations of vect_get_mask and vect_get_len for BB SLP are
> non-trivial (albeit simpler than for loop vectorization), therefore they
> are delegated to SLP-specific functions defined in tree-vect-slp.cc.
>
> Implement the vect_cannot_use_partial_vectors wrapper function by
> setting the SLP_TREE_CAN_USE_PARTIAL_VECTORS_P flag to false.
> To prevent regressions, vect_can_use_partial_vectors_p still returns
> false for BB SLP regardless (for now). This prevents vect_record_mask
> or vect_record_len from being called.
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.cc (_slp_tree::_slp_tree): initialize new
>         partial_vector_style, can_use_partial_vectors and
>         num_partial_vectors members.
>         (vect_slp_analyze_node_operations): Account for worst-case
>         prologue costs of per-node partial-vector mask or length
>         materialisation.
>         (vect_slp_record_bb_style): Set the partial vector style of an
>         SLP node, checking that the style does not flip-flop between mask
>         and length.
>         (vect_slp_record_bb_mask): Use vect_slp_record_bb_style to set
>         the partial vector style of the SLP tree node to
>         vect_partial_vectors_while_ult.
>         (vect_slp_get_bb_mask): New function to materialize a mask for
>         basic block SLP vectorization.
>         (vect_slp_record_bb_len): Use vect_slp_record_bb_style to set
>         the partial vector style of the SLP tree node to
>         vect_partial_vectors_len.
>         (vect_slp_get_bb_len): New function to materialize a length for
>         basic block SLP vectorization.
>         * tree-vect-stmts.cc (vectorizable_internal_function):
>         (vect_record_mask): Handle the basic block SLP use case by
>         delegating to vect_slp_record_bb_mask.
>         (vect_get_mask): Handle the basic block SLP use case by
>         delegating to vect_slp_get_bb_mask.
>         (vect_record_len): Handle the basic block SLP use case by
>         delegating to vect_slp_record_bb_len.
>         (vect_get_len): Handle the basic block SLP use case by
>         delegating to vect_slp_get_bb_len.
>         (vect_gen_while_ssa_name): New function containing code
>         refactored out of vect_gen_while for reuse by
>         vect_slp_get_bb_mask.
>         (vect_gen_while): Use vect_gen_while_ssa_name instead of custom
>         code for some of the implementation.
>         * tree-vectorizer.h (enum vect_partial_vector_style): Move this
>         definition earlier to allow reuse by struct _slp_tree.
>         (struct _slp_tree): Add a partial_vector_style member to record
>         whether to use a length or mask for the SLP tree node, if
>         partial vectors are required and supported.
>         Add a can_use_partial_vectors member to record whether partial
>         vectors are supported for the SLP tree node.
>         Add a num_partial_vectors member for costing.
>         (SLP_TREE_PARTIAL_VECTORS_STYLE): New member accessor macro.
>         (SLP_TREE_CAN_USE_PARTIAL_VECTORS_P): New member accessor macro.
>         (SLP_TREE_NUM_PARTIAL_VECTORS): New member accessor macro.
>         (vect_gen_while_ssa_name): Declaration of a new function.
>         (vect_slp_get_bb_mask): As above.
>         (vect_slp_get_bb_len): As above.
>         (vect_cannot_use_partial_vectors): Handle the basic block SLP
>         use-case by setting SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to
>         false.
>         (vect_fully_with_length_p): Handle the basic block SLP use
>         case by checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
>         vect_partial_vectors_len.
>         (vect_fully_masked_p): Handle the basic block SLP use case by
>         checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
>         vect_partial_vectors_while_ult.
> ---
>  gcc/tree-vect-slp.cc   | 182 +++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vect-stmts.cc |  52 +++++++-----
>  gcc/tree-vectorizer.h  |  52 ++++++++----
>  3 files changed, 247 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 075e93f04a9..4dd7e6e1e21 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -125,6 +125,9 @@ _slp_tree::_slp_tree ()
>    SLP_TREE_GS_BASE (this) = NULL_TREE;
>    this->ldst_lanes = false;
>    this->avoid_stlf_fail = false;
> +  SLP_TREE_PARTIAL_VECTORS_STYLE (this) = vect_partial_vectors_none;
> +  SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (this) = true;
> +  SLP_TREE_NUM_PARTIAL_VECTORS (this) = 0;
>    SLP_TREE_VECTYPE (this) = NULL_TREE;
>    SLP_TREE_REPRESENTATIVE (this) = NULL;
>    this->cycle_info.id = -1;
> @@ -8958,6 +8961,40 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
>           vect_prologue_cost_for_slp (vinfo, child, cost_vec);
>         }
>
> +  if (res)
> +    {
> +      /* Take care of special costs for partial vectors.
> +        Costing each partial vector is excessive for many SLP instances,
> +        because it is common to materialise identical masks/lengths for related
> +        operations (e.g., for vector loads and stores of the same length).
> +        Masks/lengths can also be shared between SLP subgraphs or eliminated by
> +        pattern-based lowering during instruction selection.  However, it's
> +        simpler and safer to use the worst-case cost; if this ends up being the
> +        tie-breaker between vectorizing or not, then it's probably better not
> +        to vectorize.  */
> +      const int num_partial_vectors = SLP_TREE_NUM_PARTIAL_VECTORS (node);
> +
> +      if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
> +         == vect_partial_vectors_while_ult)
> +       {
> +         gcc_assert (num_partial_vectors > 0);
> +         record_stmt_cost (cost_vec, num_partial_vectors, vector_stmt, NULL,
> +                           NULL, NULL_TREE, 0, vect_prologue);
> +       }
> +      else if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
> +              == vect_partial_vectors_len)
> +       {
> +         /* Need to set up a length in the prologue.  */
> +         gcc_assert (num_partial_vectors > 0);
> +         record_stmt_cost (cost_vec, num_partial_vectors, scalar_stmt, NULL,
> +                           NULL, NULL_TREE, 0, vect_prologue);
> +       }
> +      else
> +       {
> +         gcc_assert (num_partial_vectors == 0);
> +       }
> +    }
> +
>    /* If this node or any of its children can't be vectorized, try pruning
>       the tree here rather than felling the whole thing.  */
>    if (!res && vect_slp_convert_to_external (vinfo, node, node_instance))
> @@ -12441,3 +12478,148 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances)
>          }
>      }
>  }
> +
> +/* Record that a specific partial vector style could be used to vectorize
> +   SLP_NODE if required.  */
> +
> +static void
> +vect_slp_record_bb_style (slp_tree slp_node, vect_partial_vector_style style)
> +{
> +  gcc_assert (style != vect_partial_vectors_none);
> +  gcc_assert (style != vect_partial_vectors_avx512);
> +
> +  if (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == vect_partial_vectors_none)
> +    SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) = style;
> +  else
> +    gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == style);
> +}
> +
> +/* Record that a complete set of masks associated with SLP_NODE would need to
> +   contain a sequence of NVECTORS masks that each control a vector of type
> +   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> +   these vector masks with the vector version of SCALAR_MASK.  */
> +void
> +vect_slp_record_bb_mask (slp_tree slp_node, unsigned int /* nvectors */,
> +                        tree /* vectype */, tree /* scalar_mask */)
> +{
> +  vect_slp_record_bb_style (slp_node, vect_partial_vectors_while_ult);
> +
> +  /* FORNOW: this often overestimates the number of masks for costing purposes
> +     because, after lowering, masks have often been eliminated, shared between
> +     SLP nodes, or even shared between SLP subgraphs.  */
> +  SLP_TREE_NUM_PARTIAL_VECTORS(slp_node) ++;
> +}
> +
> +/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
> +   Insert any set-up statements before GSI.  */
> +
> +tree
> +vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
> +                     unsigned int nvectors, tree vectype, unsigned int index)
> +{
> +  gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +             == vect_partial_vectors_while_ult);
> +  gcc_assert (nvectors >= 1);
> +  gcc_assert (index < nvectors);
> +
> +  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  const unsigned int group_size = SLP_TREE_LANES (slp_node);
> +  unsigned int mask_size = group_size;
> +  const tree masktype = truth_type_for (vectype);
> +
> +  if (nunits.is_constant ())
> +    {
> +      /* Only the last vector can be a partial vector.  */
> +      if (index + 1 < nvectors)
> +       return build_minus_one_cst (masktype);
> +
> +      /* Return a mask for a possibly-partial tail vector. */
> +      const unsigned int const_nunits = nunits.to_constant ();
> +      const unsigned int head_size = (nvectors - 1) * const_nunits;
> +      gcc_assert (head_size <= group_size);
> +      mask_size = group_size - head_size;
> +
> +      if (mask_size == const_nunits)
> +       return build_minus_one_cst (masktype);
> +    }
> +  else
> +    {
> +      /* Return a mask for a single variable-length vector. */
> +      gcc_assert (nvectors == 1);
> +      gcc_assert (known_le (mask_size, nunits));
> +    }
> +
> +  /* FORNOW: don't bother maintaining a set of mask constants to allow
> +     sharing between nodes belonging to the same instance of bb_vec_info
> +     or even within the same SLP subgraph.  */
> +  gimple_seq stmts = NULL;
> +  const tree cmp_type = size_type_node;
> +  const tree start_index = build_zero_cst (cmp_type);
> +  const tree end_index = build_int_cst (cmp_type, mask_size);
> +  const tree mask = make_temp_ssa_name (masktype, NULL, "slp_mask");
> +  vect_gen_while_ssa_name (&stmts, masktype, start_index, end_index, mask);

Not a review, I've encountered an ICE when trying to compile with x86 avx512

./gcc/xgcc -B ./gcc -O3 -march=sapphirerapids slp_pred_1.c -S

during GIMPLE pass: slp
slp_pred_1.c: In function ‘f’:
slp_pred_1.c:11:1: internal compiler error: in
vect_gen_while_ssa_name, at tree-vect-stmts.cc:14883
   11 | f (uint8_t *x)
      | ^
0x26038eb internal_error(char const*, ...)
        ../../slp_pred_tail/gcc/diagnostic-global-context.cc:787
0x9e8768 fancy_abort(char const*, int, char const*)
        ../../slp_pred_tail/gcc/diagnostics/context.cc:1813
0x8dca22 vect_gen_while_ssa_name(gimple**, tree_node*, tree_node*,
tree_node*, tree_node*)
        ../../slp_pred_tail/gcc/tree-vect-stmts.cc:14883
0x14f182a vect_slp_get_bb_mask(_slp_tree*, gimple_stmt_iterator*,
unsigned int, tree_node*, unsigned int)
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12688
0x149cab7 vectorizable_load
        ../../slp_pred_tail/gcc/tree-vect-stmts.cc:11522
0x14ad760 vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
        ../../slp_pred_tail/gcc/tree-vect-stmts.cc:13581
0x14eee89 vect_schedule_slp_node
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12171
0x15123d1 vect_schedule_slp_node
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:11940
0x15123d1 vect_schedule_scc
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12418
0x151236a vect_schedule_scc
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x151236a vect_schedule_scc
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12399
0x1512a49 vect_schedule_slp(vec_info*, vec<_slp_instance*, va_heap,
vl_ptr> const&)
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:12563
0x15145af vect_slp_region
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:10445
0x151640b vect_slp_bbs
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:10557
0x15169b4 vect_slp_function(function*)
        ../../slp_pred_tail/gcc/tree-vect-slp.cc:10679
0x1521ad2 execute
        ../../slp_pred_tail/gcc/tree-vectorizer.cc:1570

It materializes BB-SLP tail masks with WHILE_ULT, which x86 doesn’t support.


After manually using a constant mask for avx512, I encountered another
performance issue.
if I change slp_pred_1.c to
void
f (uint8_t *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
  x[6] += 1;
  x[7] += 2;
  x[8] += 1;
  x[9] += 2;
  x[10] += 1;
  x[11] += 2;
  x[12] += 1;
  x[13] += 2;
  x[14] += 1;
  x[15] += 4;
}
with -march=sapphirerapids -O3, it generates

 <bb 2> [local count: 1073741824]:
 vectp.4_51 = x_34(D);
 vect__1.5_52 = .MASK_LOAD (vectp.4_51, 8B, { -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0,
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0 });
 vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
4 };
 _1 = *x_34(D);

But a 128-bit vector w/o mask should be used here instead of using
256-bit vector + mask off upper 128-bit.

 <bb 2> [local count: 1073741824]:
 vectp.4_51 = x_34(D);
 vect__1.5_52 = MEM <vector(16) unsigned char> [(uint8_t *)vectp.4_51];
 vect__2.6_53 = vect__1.5_52 + { 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 4 };
 _1 = *x_34(D);
 _2 = _1 + 1;
 _3 = MEM[(uint8_t *)x_34(D) + 1B];

Similarly, for original slp-pred-1.c, a 128-bit vector should be used
with a mask instead of 256-bit vector.

> +  gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
> +  return mask;
> +}
> +
> +/* Record that a complete set of lengths associated with SLP_NODE would need to
> +   contain a sequence of NVECTORS lengths for controlling an operation on
> +   VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
> +   subelements, measuring the length as a number of these subelements.  */
> +
> +void
> +vect_slp_record_bb_len (slp_tree slp_node, unsigned int /* nvectors */,
> +                       tree /* vectype */, unsigned int /* factor */)
> +{
> +  vect_slp_record_bb_style (slp_node, vect_partial_vectors_len);
> +
> +  /* FORNOW: this probably overestimates the number of lengths for costing
> +     purposes because, after lowering, lengths might have been eliminated,
> +     shared between SLP nodes, or even shared between SLP subgraphs.  */
> +  SLP_TREE_NUM_PARTIAL_VECTORS (slp_node)++;
> +}
> +
> +/* Materialize length number INDEX for a group of scalar stmts in SLP_NODE that
> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
> +   Return a value that contains FACTOR multiplied by the number of elements that
> +   should be processed.  */
> +
> +tree
> +vect_slp_get_bb_len (slp_tree slp_node, unsigned int nvectors, tree vectype,
> +                    unsigned int index, unsigned int factor, bool adjusted)
> +{
> +  gcc_checking_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +                      == vect_partial_vectors_len);
> +  gcc_assert (nvectors >= 1);
> +  gcc_assert (index < nvectors);
> +  (void) adjusted;
> +
> +  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  const unsigned int group_size = SLP_TREE_LANES (slp_node);
> +  unsigned int len = group_size;
> +
> +  if (nunits.is_constant ())
> +    {
> +      const unsigned int const_nunits = nunits.to_constant ();
> +
> +      /* Only the last vector can be a partial vector.  */
> +      if (index + 1 < nvectors)
> +       len = const_nunits;
> +      else
> +       {
> +         /* Return a length for a possibly-partial tail vector. */
> +         const unsigned int head_size = (nvectors - 1) * const_nunits;
> +         gcc_assert (head_size <= group_size);
> +         len = group_size - head_size;
> +       }
> +    }
> +  else
> +    {
> +      /* Return a length for a single variable-length vector. */
> +      gcc_assert (nvectors == 1);
> +      gcc_assert (known_le (len, nunits));
> +    }
> +
> +  return size_int (len * factor);
> +}
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 15fca17a407..ecad74e7cbf 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1385,7 +1385,9 @@ vectorizable_internal_function (combined_fn cfn, tree fndecl,
>  /* Record that a complete set of masks associated with VINFO would need to
>     contain a sequence of NVECTORS masks that each control a vector of type
>     VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> -   these vector masks with the vector version of SCALAR_MASK.  */
> +   these vector masks with the vector version of SCALAR_MASK.  Alternatively,
> +   if doing basic block vectorization, record that a mask could be used to
> +   vectorize SLP_NODE if required.  */
>  static void
>  vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>                   tree vectype, tree scalar_mask)
> @@ -1395,7 +1397,7 @@ vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>      vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo), nvectors,
>                            vectype, scalar_mask);
>    else
> -    (void) slp_node; /* FORNOW */
> +    vect_slp_record_bb_mask (slp_node, nvectors, vectype, scalar_mask);
>  }
>
>  /* Given a complete set of masks associated with VINFO, extract mask number
> @@ -1413,16 +1415,15 @@ vect_get_mask (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
>      return vect_get_loop_mask (loop_vinfo, gsi, &LOOP_VINFO_MASKS (loop_vinfo),
>                                nvectors, vectype, index);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return NULL_TREE;
> -    }
> +    return vect_slp_get_bb_mask (slp_node, gsi, nvectors, vectype, index);
>  }
>
>  /* Record that a complete set of lengths associated with VINFO would need to
>     contain a sequence of NVECTORS lengths for controlling an operation on
>     VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
> -   subelements, measuring the length as a number of these subelements.  */
> +   subelements, measuring the length as a number of these subelements.
> +   Alternatively, if doing basic block vectorization, record that a length limit
> +   could be used to vectorize SLP_NODE if required.  */
>  static void
>  vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>                  tree vectype, unsigned int factor)
> @@ -1432,7 +1433,7 @@ vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>      vect_record_loop_len (loop_vinfo, &LOOP_VINFO_LENS (loop_vinfo), nvectors,
>                           vectype, factor);
>    else
> -    (void) slp_node; /* FORNOW */
> +    vect_slp_record_bb_len (slp_node, nvectors, vectype, factor);
>  }
>
>  /* Given a complete set of lengths associated with VINFO, extract length number
> @@ -1453,10 +1454,8 @@ vect_get_len (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
>      return vect_get_loop_len (loop_vinfo, gsi, &LOOP_VINFO_LENS (loop_vinfo),
>                               nvectors, vectype, index, factor, adjusted);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return NULL_TREE;
> -    }
> +    return vect_slp_get_bb_len (slp_node, nvectors, vectype, index, factor,
> +                               adjusted);
>  }
>
>  static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
> @@ -14710,24 +14709,35 @@ supportable_indirect_convert_operation (code_helper code,
>     mask[I] is true iff J + START_INDEX < END_INDEX for all J <= I.
>     Add the statements to SEQ.  */
>
> +void
> +vect_gen_while_ssa_name (gimple_seq *seq, tree mask_type, tree start_index,
> +                        tree end_index, tree ssa_name)
> +{
> +  tree cmp_type = TREE_TYPE (start_index);
> +  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT, cmp_type,
> +                                                      mask_type,
> +                                                      OPTIMIZE_FOR_SPEED));
> +  gcall *call
> +    = gimple_build_call_internal (IFN_WHILE_ULT, 3, start_index, end_index,
> +                                 build_zero_cst (mask_type));
> +  gimple_call_set_lhs (call, ssa_name);
> +  gimple_seq_add_stmt (seq, call);
> +}
> +
> +/*  Like vect_gen_while_ssa_name except that it creates a new SSA_NAME node
> +    for type MASK_TYPE defined in the created GIMPLE_CALL statement.  If NAME
> +    is not a null pointer then it is used for the SSA_NAME in dumps.  */
> +
>  tree
>  vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index,
>                 tree end_index, const char *name)
>  {
> -  tree cmp_type = TREE_TYPE (start_index);
> -  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
> -                                                      cmp_type, mask_type,
> -                                                      OPTIMIZE_FOR_SPEED));
> -  gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
> -                                           start_index, end_index,
> -                                           build_zero_cst (mask_type));
>    tree tmp;
>    if (name)
>      tmp = make_temp_ssa_name (mask_type, NULL, name);
>    else
>      tmp = make_ssa_name (mask_type);
> -  gimple_call_set_lhs (call, tmp);
> -  gimple_seq_add_stmt (seq, call);
> +  vect_gen_while_ssa_name (seq, mask_type, start_index, end_index, tmp);
>    return tmp;
>  }
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index a3855568b09..f79f04ff8ac 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -312,6 +312,13 @@ struct vect_load_store_data : vect_data {
>    bool subchain_p; // VMAT_STRIDED_SLP and VMAT_GATHER_SCATTER
>  };
>
> +enum vect_partial_vector_style {
> +  vect_partial_vectors_none,
> +  vect_partial_vectors_while_ult,
> +  vect_partial_vectors_avx512,
> +  vect_partial_vectors_len
> +};
> +
>  /* A computation tree of an SLP instance.  Each node corresponds to a group of
>     stmts to be packed in a SIMD stmt.  */
>  struct _slp_tree {
> @@ -377,7 +384,16 @@ struct _slp_tree {
>    /* For BB vect, flag to indicate this load node should be vectorized
>       as to avoid STLF fails because of related stores.  */
>    bool avoid_stlf_fail;
> -
> +  /* The style used for implementing partial vectors if LANES is less than
> +     the minimum number of lanes implied by the VECTYPE.  */
> +  vect_partial_vector_style partial_vector_style;
> +  /* Flag to indicate whether we still have the option of vectorizing this node
> +     using partial vectors (i.e.  using lengths or masks to prevent use of
> +     inactive scalar lanes).  */
> +  bool can_use_partial_vectors;
> +  /* Number of partial vectors, for costing purposes. Should be 0 unless a
> +     partial vector style has been set.  */
> +  int num_partial_vectors;
>    int vertex;
>
>    /* The kind of operation as determined by analysis and optional
> @@ -476,6 +492,9 @@ public:
>  #define SLP_TREE_GS_BASE(S)                     (S)->gs_base
>  #define SLP_TREE_REDUC_IDX(S)                   (S)->cycle_info.reduc_idx
>  #define SLP_TREE_PERMUTE_P(S)                   ((S)->code == VEC_PERM_EXPR)
> +#define SLP_TREE_PARTIAL_VECTORS_STYLE(S)       (S)->partial_vector_style
> +#define SLP_TREE_CAN_USE_PARTIAL_VECTORS_P(S)   (S)->can_use_partial_vectors
> +#define SLP_TREE_NUM_PARTIAL_VECTORS(S)                 (S)->num_partial_vectors
>
>  inline vect_memory_access_type
>  SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
> @@ -486,13 +505,6 @@ SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
>    return VMAT_UNINITIALIZED;
>  }
>
> -enum vect_partial_vector_style {
> -    vect_partial_vectors_none,
> -    vect_partial_vectors_while_ult,
> -    vect_partial_vectors_avx512,
> -    vect_partial_vectors_len
> -};
> -
>  /* Key for map that records association between
>     scalar conditions and corresponding loop mask, and
>     is populated by vect_record_loop_mask.  */
> @@ -2607,6 +2619,7 @@ extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
>  extern void optimize_mask_stores (class loop*);
>  extern tree vect_gen_while (gimple_seq *, tree, tree, tree,
>                             const char * = nullptr);
> +extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
>  extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
>  extern opt_result vect_get_vector_types_for_stmt (vec_info *,
>                                                   stmt_vec_info, tree *,
> @@ -2788,7 +2801,14 @@ extern slp_tree vect_create_new_slp_node (unsigned, tree_code);
>  extern void vect_free_slp_tree (slp_tree);
>  extern bool compatible_calls_p (gcall *, gcall *, bool);
>  extern int vect_slp_child_index_for_operand (const stmt_vec_info, int op);
> -
> +extern void vect_slp_record_bb_mask (slp_tree slp_node, unsigned int nvectors,
> +                                    tree vectype, tree scalar_mask);
> +extern tree vect_slp_get_bb_mask (slp_tree, gimple_stmt_iterator *,
> +                                 unsigned int, tree, unsigned int);
> +extern void vect_slp_record_bb_len (slp_tree slp_node, unsigned int nvectors,
> +                                   tree vectype, unsigned int factor);
> +extern tree vect_slp_get_bb_len (slp_tree, unsigned int, tree, unsigned int,
> +                                unsigned int, bool);
>  extern tree prepare_vec_mask (vec_info *, tree, tree, tree,
>                               gimple_stmt_iterator *);
>  extern tree vect_get_mask_load_else (int, tree);
> @@ -2953,7 +2973,7 @@ vect_cannot_use_partial_vectors (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vinfo)
>      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>    else
> -    (void) slp_node; /* FORNOW */
> +    SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node) = false;
>  }
>
>  /* Return true if VINFO is vectorizer state for loop vectorization, we've
> @@ -2967,10 +2987,8 @@ vect_fully_with_length_p (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>      return LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return false;
> -    }
> +    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +          == vect_partial_vectors_len;
>  }
>
>  /* Return true if VINFO is vectorizer state for loop vectorization, we've
> @@ -2984,10 +3002,8 @@ vect_fully_masked_p (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>      return LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return false;
> -    }
> +    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +          == vect_partial_vectors_while_ult;
>  }
>
>  /* If STMT_INFO describes a reduction, return the vect_reduction_type
> --
> 2.43.0
>
  
Richard Biener June 9, 2026, 9:24 a.m. UTC | #2
On Wed, Jun 3, 2026 at 5:20 PM Christopher Bazley <chris.bazley@arm.com> wrote:
>
> Add two new fields to SLP tree nodes, which are accessed as
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P and SLP_TREE_PARTIAL_VECTORS_STYLE.
>
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P is analogous to the existing
> predicate LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P. It is initialized to
> true. This flag just records whether the target could vectorize a
> node using a partial vector; it does not say anything about
> whether the vector actually is partial, or how the target would support
> use of a partial vector. Some kinds of node require mask/length for
> partial vectors; others don't. In the latter case (e.g., for add
> operations), SLP_TREE_CAN_USE_PARTIAL_VECTORS_P will remain true.
>
> SLP_TREE_PARTIAL_VECTORS_STYLE is analogous to the existing field
> LOOP_VINFO_PARTIAL_VECTORS_STYLE. Both are initialized to 'none'.
> The vect_partial_vectors_avx512 enumerator is not used for BB SLP.
> Unlike loop vectorization, a different style of partial vectors can be
> chosen for each node during analysis of that node.
>
> Implement the recently-introduced wrapper functions,
> vect_record_(len|mask), for BB SLP by setting
> SLP_TREE_PARTIAL_VECTORS_STYLE to indicate that a mask or length should
> be used for a given SLP node. The passed-in vec_info is ignored.
>
> Implement the vect_fully_(masked|with_length)_p wrapper functions for
> BB SLP by checking the SLP_TREE_PARTIAL_VECTORS_STYLE. This should be
> sufficient because at most one of vect_record_(len|mask) and
> vect_cannot_use_partial_vectors are expected to be called for any
> given SLP node. SLP_TREE_CAN_USE_PARTIAL_VECTORS_P should be true if
> the style is not 'none', but its value isn't used beyond the analysis
> phase.
>
> The implementations of vect_get_mask and vect_get_len for BB SLP are
> non-trivial (albeit simpler than for loop vectorization), therefore they
> are delegated to SLP-specific functions defined in tree-vect-slp.cc.
>
> Implement the vect_cannot_use_partial_vectors wrapper function by
> setting the SLP_TREE_CAN_USE_PARTIAL_VECTORS_P flag to false.
> To prevent regressions, vect_can_use_partial_vectors_p still returns
> false for BB SLP regardless (for now). This prevents vect_record_mask
> or vect_record_len from being called.
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.cc (_slp_tree::_slp_tree): initialize new
>         partial_vector_style, can_use_partial_vectors and
>         num_partial_vectors members.
>         (vect_slp_analyze_node_operations): Account for worst-case
>         prologue costs of per-node partial-vector mask or length
>         materialisation.
>         (vect_slp_record_bb_style): Set the partial vector style of an
>         SLP node, checking that the style does not flip-flop between mask
>         and length.
>         (vect_slp_record_bb_mask): Use vect_slp_record_bb_style to set
>         the partial vector style of the SLP tree node to
>         vect_partial_vectors_while_ult.
>         (vect_slp_get_bb_mask): New function to materialize a mask for
>         basic block SLP vectorization.
>         (vect_slp_record_bb_len): Use vect_slp_record_bb_style to set
>         the partial vector style of the SLP tree node to
>         vect_partial_vectors_len.
>         (vect_slp_get_bb_len): New function to materialize a length for
>         basic block SLP vectorization.
>         * tree-vect-stmts.cc (vectorizable_internal_function):
>         (vect_record_mask): Handle the basic block SLP use case by
>         delegating to vect_slp_record_bb_mask.
>         (vect_get_mask): Handle the basic block SLP use case by
>         delegating to vect_slp_get_bb_mask.
>         (vect_record_len): Handle the basic block SLP use case by
>         delegating to vect_slp_record_bb_len.
>         (vect_get_len): Handle the basic block SLP use case by
>         delegating to vect_slp_get_bb_len.
>         (vect_gen_while_ssa_name): New function containing code
>         refactored out of vect_gen_while for reuse by
>         vect_slp_get_bb_mask.
>         (vect_gen_while): Use vect_gen_while_ssa_name instead of custom
>         code for some of the implementation.
>         * tree-vectorizer.h (enum vect_partial_vector_style): Move this
>         definition earlier to allow reuse by struct _slp_tree.
>         (struct _slp_tree): Add a partial_vector_style member to record
>         whether to use a length or mask for the SLP tree node, if
>         partial vectors are required and supported.
>         Add a can_use_partial_vectors member to record whether partial
>         vectors are supported for the SLP tree node.
>         Add a num_partial_vectors member for costing.
>         (SLP_TREE_PARTIAL_VECTORS_STYLE): New member accessor macro.
>         (SLP_TREE_CAN_USE_PARTIAL_VECTORS_P): New member accessor macro.
>         (SLP_TREE_NUM_PARTIAL_VECTORS): New member accessor macro.
>         (vect_gen_while_ssa_name): Declaration of a new function.
>         (vect_slp_get_bb_mask): As above.
>         (vect_slp_get_bb_len): As above.
>         (vect_cannot_use_partial_vectors): Handle the basic block SLP
>         use-case by setting SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to
>         false.
>         (vect_fully_with_length_p): Handle the basic block SLP use
>         case by checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
>         vect_partial_vectors_len.
>         (vect_fully_masked_p): Handle the basic block SLP use case by
>         checking whether the SLP_TREE_PARTIAL_VECTORS_STYLE is
>         vect_partial_vectors_while_ult.
> ---
>  gcc/tree-vect-slp.cc   | 182 +++++++++++++++++++++++++++++++++++++++++
>  gcc/tree-vect-stmts.cc |  52 +++++++-----
>  gcc/tree-vectorizer.h  |  52 ++++++++----
>  3 files changed, 247 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 075e93f04a9..4dd7e6e1e21 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -125,6 +125,9 @@ _slp_tree::_slp_tree ()
>    SLP_TREE_GS_BASE (this) = NULL_TREE;
>    this->ldst_lanes = false;
>    this->avoid_stlf_fail = false;
> +  SLP_TREE_PARTIAL_VECTORS_STYLE (this) = vect_partial_vectors_none;
> +  SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (this) = true;
> +  SLP_TREE_NUM_PARTIAL_VECTORS (this) = 0;
>    SLP_TREE_VECTYPE (this) = NULL_TREE;
>    SLP_TREE_REPRESENTATIVE (this) = NULL;
>    this->cycle_info.id = -1;
> @@ -8958,6 +8961,40 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
>           vect_prologue_cost_for_slp (vinfo, child, cost_vec);
>         }
>
> +  if (res)
> +    {
> +      /* Take care of special costs for partial vectors.
> +        Costing each partial vector is excessive for many SLP instances,
> +        because it is common to materialise identical masks/lengths for related
> +        operations (e.g., for vector loads and stores of the same length).
> +        Masks/lengths can also be shared between SLP subgraphs or eliminated by
> +        pattern-based lowering during instruction selection.  However, it's
> +        simpler and safer to use the worst-case cost; if this ends up being the
> +        tie-breaker between vectorizing or not, then it's probably better not
> +        to vectorize.  */

I'd prefer to do this per SLP subgraph group based on recorded
requirements so similar
how loop masking is set up.

> +      const int num_partial_vectors = SLP_TREE_NUM_PARTIAL_VECTORS (node);
> +
> +      if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
> +         == vect_partial_vectors_while_ult)
> +       {
> +         gcc_assert (num_partial_vectors > 0);
> +         record_stmt_cost (cost_vec, num_partial_vectors, vector_stmt, NULL,
> +                           NULL, NULL_TREE, 0, vect_prologue);
> +       }
> +      else if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
> +              == vect_partial_vectors_len)
> +       {
> +         /* Need to set up a length in the prologue.  */
> +         gcc_assert (num_partial_vectors > 0);
> +         record_stmt_cost (cost_vec, num_partial_vectors, scalar_stmt, NULL,
> +                           NULL, NULL_TREE, 0, vect_prologue);
> +       }
> +      else
> +       {
> +         gcc_assert (num_partial_vectors == 0);
> +       }
> +    }
> +
>    /* If this node or any of its children can't be vectorized, try pruning
>       the tree here rather than felling the whole thing.  */
>    if (!res && vect_slp_convert_to_external (vinfo, node, node_instance))
> @@ -12441,3 +12478,148 @@ vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances)
>          }
>      }
>  }
> +
> +/* Record that a specific partial vector style could be used to vectorize
> +   SLP_NODE if required.  */
> +
> +static void
> +vect_slp_record_bb_style (slp_tree slp_node, vect_partial_vector_style style)
> +{
> +  gcc_assert (style != vect_partial_vectors_none);
> +  gcc_assert (style != vect_partial_vectors_avx512);
> +
> +  if (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == vect_partial_vectors_none)
> +    SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) = style;
> +  else
> +    gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == style);
> +}
> +
> +/* Record that a complete set of masks associated with SLP_NODE would need to
> +   contain a sequence of NVECTORS masks that each control a vector of type
> +   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> +   these vector masks with the vector version of SCALAR_MASK.  */
> +void
> +vect_slp_record_bb_mask (slp_tree slp_node, unsigned int /* nvectors */,
> +                        tree /* vectype */, tree /* scalar_mask */)
> +{
> +  vect_slp_record_bb_style (slp_node, vect_partial_vectors_while_ult);
> +
> +  /* FORNOW: this often overestimates the number of masks for costing purposes
> +     because, after lowering, masks have often been eliminated, shared between
> +     SLP nodes, or even shared between SLP subgraphs.  */
> +  SLP_TREE_NUM_PARTIAL_VECTORS(slp_node) ++;
> +}
> +
> +/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
> +   Insert any set-up statements before GSI.  */
> +
> +tree
> +vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
> +                     unsigned int nvectors, tree vectype, unsigned int index)
> +{
> +  gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +             == vect_partial_vectors_while_ult);
> +  gcc_assert (nvectors >= 1);
> +  gcc_assert (index < nvectors);
> +
> +  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  const unsigned int group_size = SLP_TREE_LANES (slp_node);
> +  unsigned int mask_size = group_size;
> +  const tree masktype = truth_type_for (vectype);
> +
> +  if (nunits.is_constant ())
> +    {
> +      /* Only the last vector can be a partial vector.  */
> +      if (index + 1 < nvectors)
> +       return build_minus_one_cst (masktype);
> +
> +      /* Return a mask for a possibly-partial tail vector. */
> +      const unsigned int const_nunits = nunits.to_constant ();
> +      const unsigned int head_size = (nvectors - 1) * const_nunits;
> +      gcc_assert (head_size <= group_size);
> +      mask_size = group_size - head_size;
> +
> +      if (mask_size == const_nunits)
> +       return build_minus_one_cst (masktype);
> +    }
> +  else
> +    {
> +      /* Return a mask for a single variable-length vector. */
> +      gcc_assert (nvectors == 1);
> +      gcc_assert (known_le (mask_size, nunits));
> +    }
> +
> +  /* FORNOW: don't bother maintaining a set of mask constants to allow
> +     sharing between nodes belonging to the same instance of bb_vec_info
> +     or even within the same SLP subgraph.  */

See above.  The loop code already should have everything set up for
caching.  Why not reuse that?

> +  gimple_seq stmts = NULL;
> +  const tree cmp_type = size_type_node;
> +  const tree start_index = build_zero_cst (cmp_type);
> +  const tree end_index = build_int_cst (cmp_type, mask_size);
> +  const tree mask = make_temp_ssa_name (masktype, NULL, "slp_mask");
> +  vect_gen_while_ssa_name (&stmts, masktype, start_index, end_index, mask);
> +  gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
> +  return mask;
> +}
> +
> +/* Record that a complete set of lengths associated with SLP_NODE would need to
> +   contain a sequence of NVECTORS lengths for controlling an operation on
> +   VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
> +   subelements, measuring the length as a number of these subelements.  */
> +
> +void
> +vect_slp_record_bb_len (slp_tree slp_node, unsigned int /* nvectors */,
> +                       tree /* vectype */, unsigned int /* factor */)
> +{
> +  vect_slp_record_bb_style (slp_node, vect_partial_vectors_len);
> +
> +  /* FORNOW: this probably overestimates the number of lengths for costing
> +     purposes because, after lowering, lengths might have been eliminated,
> +     shared between SLP nodes, or even shared between SLP subgraphs.  */
> +  SLP_TREE_NUM_PARTIAL_VECTORS (slp_node)++;
> +}
> +
> +/* Materialize length number INDEX for a group of scalar stmts in SLP_NODE that
> +   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
> +   Return a value that contains FACTOR multiplied by the number of elements that
> +   should be processed.  */
> +
> +tree
> +vect_slp_get_bb_len (slp_tree slp_node, unsigned int nvectors, tree vectype,
> +                    unsigned int index, unsigned int factor, bool adjusted)
> +{
> +  gcc_checking_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +                      == vect_partial_vectors_len);
> +  gcc_assert (nvectors >= 1);
> +  gcc_assert (index < nvectors);
> +  (void) adjusted;
> +
> +  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  const unsigned int group_size = SLP_TREE_LANES (slp_node);
> +  unsigned int len = group_size;
> +
> +  if (nunits.is_constant ())
> +    {
> +      const unsigned int const_nunits = nunits.to_constant ();
> +
> +      /* Only the last vector can be a partial vector.  */
> +      if (index + 1 < nvectors)
> +       len = const_nunits;
> +      else
> +       {
> +         /* Return a length for a possibly-partial tail vector. */
> +         const unsigned int head_size = (nvectors - 1) * const_nunits;
> +         gcc_assert (head_size <= group_size);
> +         len = group_size - head_size;
> +       }
> +    }
> +  else
> +    {
> +      /* Return a length for a single variable-length vector. */
> +      gcc_assert (nvectors == 1);
> +      gcc_assert (known_le (len, nunits));
> +    }
> +
> +  return size_int (len * factor);
> +}
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 15fca17a407..ecad74e7cbf 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1385,7 +1385,9 @@ vectorizable_internal_function (combined_fn cfn, tree fndecl,
>  /* Record that a complete set of masks associated with VINFO would need to
>     contain a sequence of NVECTORS masks that each control a vector of type
>     VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
> -   these vector masks with the vector version of SCALAR_MASK.  */
> +   these vector masks with the vector version of SCALAR_MASK.  Alternatively,
> +   if doing basic block vectorization, record that a mask could be used to
> +   vectorize SLP_NODE if required.  */
>  static void
>  vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>                   tree vectype, tree scalar_mask)
> @@ -1395,7 +1397,7 @@ vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>      vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo), nvectors,
>                            vectype, scalar_mask);
>    else
> -    (void) slp_node; /* FORNOW */
> +    vect_slp_record_bb_mask (slp_node, nvectors, vectype, scalar_mask);
>  }
>
>  /* Given a complete set of masks associated with VINFO, extract mask number
> @@ -1413,16 +1415,15 @@ vect_get_mask (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
>      return vect_get_loop_mask (loop_vinfo, gsi, &LOOP_VINFO_MASKS (loop_vinfo),
>                                nvectors, vectype, index);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return NULL_TREE;
> -    }
> +    return vect_slp_get_bb_mask (slp_node, gsi, nvectors, vectype, index);
>  }
>
>  /* Record that a complete set of lengths associated with VINFO would need to
>     contain a sequence of NVECTORS lengths for controlling an operation on
>     VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
> -   subelements, measuring the length as a number of these subelements.  */
> +   subelements, measuring the length as a number of these subelements.
> +   Alternatively, if doing basic block vectorization, record that a length limit
> +   could be used to vectorize SLP_NODE if required.  */
>  static void
>  vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>                  tree vectype, unsigned int factor)
> @@ -1432,7 +1433,7 @@ vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
>      vect_record_loop_len (loop_vinfo, &LOOP_VINFO_LENS (loop_vinfo), nvectors,
>                           vectype, factor);
>    else
> -    (void) slp_node; /* FORNOW */
> +    vect_slp_record_bb_len (slp_node, nvectors, vectype, factor);
>  }
>
>  /* Given a complete set of lengths associated with VINFO, extract length number
> @@ -1453,10 +1454,8 @@ vect_get_len (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
>      return vect_get_loop_len (loop_vinfo, gsi, &LOOP_VINFO_LENS (loop_vinfo),
>                               nvectors, vectype, index, factor, adjusted);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return NULL_TREE;
> -    }
> +    return vect_slp_get_bb_len (slp_node, nvectors, vectype, index, factor,
> +                               adjusted);
>  }
>
>  static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
> @@ -14710,24 +14709,35 @@ supportable_indirect_convert_operation (code_helper code,
>     mask[I] is true iff J + START_INDEX < END_INDEX for all J <= I.
>     Add the statements to SEQ.  */
>
> +void
> +vect_gen_while_ssa_name (gimple_seq *seq, tree mask_type, tree start_index,
> +                        tree end_index, tree ssa_name)
> +{
> +  tree cmp_type = TREE_TYPE (start_index);
> +  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT, cmp_type,
> +                                                      mask_type,
> +                                                      OPTIMIZE_FOR_SPEED));
> +  gcall *call
> +    = gimple_build_call_internal (IFN_WHILE_ULT, 3, start_index, end_index,
> +                                 build_zero_cst (mask_type));
> +  gimple_call_set_lhs (call, ssa_name);
> +  gimple_seq_add_stmt (seq, call);
> +}
> +
> +/*  Like vect_gen_while_ssa_name except that it creates a new SSA_NAME node
> +    for type MASK_TYPE defined in the created GIMPLE_CALL statement.  If NAME
> +    is not a null pointer then it is used for the SSA_NAME in dumps.  */
> +
>  tree
>  vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index,
>                 tree end_index, const char *name)
>  {
> -  tree cmp_type = TREE_TYPE (start_index);
> -  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
> -                                                      cmp_type, mask_type,
> -                                                      OPTIMIZE_FOR_SPEED));
> -  gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
> -                                           start_index, end_index,
> -                                           build_zero_cst (mask_type));
>    tree tmp;
>    if (name)
>      tmp = make_temp_ssa_name (mask_type, NULL, name);
>    else
>      tmp = make_ssa_name (mask_type);
> -  gimple_call_set_lhs (call, tmp);
> -  gimple_seq_add_stmt (seq, call);
> +  vect_gen_while_ssa_name (seq, mask_type, start_index, end_index, tmp);
>    return tmp;
>  }
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index a3855568b09..f79f04ff8ac 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -312,6 +312,13 @@ struct vect_load_store_data : vect_data {
>    bool subchain_p; // VMAT_STRIDED_SLP and VMAT_GATHER_SCATTER
>  };
>
> +enum vect_partial_vector_style {
> +  vect_partial_vectors_none,
> +  vect_partial_vectors_while_ult,
> +  vect_partial_vectors_avx512,
> +  vect_partial_vectors_len
> +};
> +
>  /* A computation tree of an SLP instance.  Each node corresponds to a group of
>     stmts to be packed in a SIMD stmt.  */
>  struct _slp_tree {
> @@ -377,7 +384,16 @@ struct _slp_tree {
>    /* For BB vect, flag to indicate this load node should be vectorized
>       as to avoid STLF fails because of related stores.  */
>    bool avoid_stlf_fail;
> -
> +  /* The style used for implementing partial vectors if LANES is less than
> +     the minimum number of lanes implied by the VECTYPE.  */
> +  vect_partial_vector_style partial_vector_style;

I wonder if we want to / need to mix style across the SLP subgraph, likewise
whether we really need to track can_use_partial_vectors per SLP node as
opposed to per subgraph.  Likewise I wonder if we want to deal with the
case of parts of the graph being unsupported because of lack of masking
support which we could fix by promoting that part extern (not covered) rather
than failing the whole subgraph.

That is, I'm questioning (maybe again?) the overall tracking/analysis phase?

> +  /* Flag to indicate whether we still have the option of vectorizing this node
> +     using partial vectors (i.e.  using lengths or masks to prevent use of
> +     inactive scalar lanes).  */
> +  bool can_use_partial_vectors;
> +  /* Number of partial vectors, for costing purposes. Should be 0 unless a
> +     partial vector style has been set.  */
> +  int num_partial_vectors;
>    int vertex;
>
>    /* The kind of operation as determined by analysis and optional
> @@ -476,6 +492,9 @@ public:
>  #define SLP_TREE_GS_BASE(S)                     (S)->gs_base
>  #define SLP_TREE_REDUC_IDX(S)                   (S)->cycle_info.reduc_idx
>  #define SLP_TREE_PERMUTE_P(S)                   ((S)->code == VEC_PERM_EXPR)
> +#define SLP_TREE_PARTIAL_VECTORS_STYLE(S)       (S)->partial_vector_style
> +#define SLP_TREE_CAN_USE_PARTIAL_VECTORS_P(S)   (S)->can_use_partial_vectors
> +#define SLP_TREE_NUM_PARTIAL_VECTORS(S)                 (S)->num_partial_vectors
>
>  inline vect_memory_access_type
>  SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
> @@ -486,13 +505,6 @@ SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
>    return VMAT_UNINITIALIZED;
>  }
>
> -enum vect_partial_vector_style {
> -    vect_partial_vectors_none,
> -    vect_partial_vectors_while_ult,
> -    vect_partial_vectors_avx512,
> -    vect_partial_vectors_len
> -};
> -
>  /* Key for map that records association between
>     scalar conditions and corresponding loop mask, and
>     is populated by vect_record_loop_mask.  */
> @@ -2607,6 +2619,7 @@ extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
>  extern void optimize_mask_stores (class loop*);
>  extern tree vect_gen_while (gimple_seq *, tree, tree, tree,
>                             const char * = nullptr);
> +extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
>  extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
>  extern opt_result vect_get_vector_types_for_stmt (vec_info *,
>                                                   stmt_vec_info, tree *,
> @@ -2788,7 +2801,14 @@ extern slp_tree vect_create_new_slp_node (unsigned, tree_code);
>  extern void vect_free_slp_tree (slp_tree);
>  extern bool compatible_calls_p (gcall *, gcall *, bool);
>  extern int vect_slp_child_index_for_operand (const stmt_vec_info, int op);
> -
> +extern void vect_slp_record_bb_mask (slp_tree slp_node, unsigned int nvectors,
> +                                    tree vectype, tree scalar_mask);
> +extern tree vect_slp_get_bb_mask (slp_tree, gimple_stmt_iterator *,
> +                                 unsigned int, tree, unsigned int);
> +extern void vect_slp_record_bb_len (slp_tree slp_node, unsigned int nvectors,
> +                                   tree vectype, unsigned int factor);
> +extern tree vect_slp_get_bb_len (slp_tree, unsigned int, tree, unsigned int,
> +                                unsigned int, bool);
>  extern tree prepare_vec_mask (vec_info *, tree, tree, tree,
>                               gimple_stmt_iterator *);
>  extern tree vect_get_mask_load_else (int, tree);
> @@ -2953,7 +2973,7 @@ vect_cannot_use_partial_vectors (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vinfo)
>      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>    else
> -    (void) slp_node; /* FORNOW */
> +    SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node) = false;
>  }
>
>  /* Return true if VINFO is vectorizer state for loop vectorization, we've
> @@ -2967,10 +2987,8 @@ vect_fully_with_length_p (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>      return LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return false;
> -    }
> +    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +          == vect_partial_vectors_len;
>  }
>
>  /* Return true if VINFO is vectorizer state for loop vectorization, we've
> @@ -2984,10 +3002,8 @@ vect_fully_masked_p (vec_info *vinfo, slp_tree slp_node)
>    if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>      return LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>    else
> -    {
> -      (void) slp_node; /* FORNOW */
> -      return false;
> -    }
> +    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
> +          == vect_partial_vectors_while_ult;
>  }
>
>  /* If STMT_INFO describes a reduction, return the vect_reduction_type
> --
> 2.43.0
>
  

Patch

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 075e93f04a9..4dd7e6e1e21 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -125,6 +125,9 @@  _slp_tree::_slp_tree ()
   SLP_TREE_GS_BASE (this) = NULL_TREE;
   this->ldst_lanes = false;
   this->avoid_stlf_fail = false;
+  SLP_TREE_PARTIAL_VECTORS_STYLE (this) = vect_partial_vectors_none;
+  SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (this) = true;
+  SLP_TREE_NUM_PARTIAL_VECTORS (this) = 0;
   SLP_TREE_VECTYPE (this) = NULL_TREE;
   SLP_TREE_REPRESENTATIVE (this) = NULL;
   this->cycle_info.id = -1;
@@ -8958,6 +8961,40 @@  vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
 	  vect_prologue_cost_for_slp (vinfo, child, cost_vec);
 	}
 
+  if (res)
+    {
+      /* Take care of special costs for partial vectors.
+	 Costing each partial vector is excessive for many SLP instances,
+	 because it is common to materialise identical masks/lengths for related
+	 operations (e.g., for vector loads and stores of the same length).
+	 Masks/lengths can also be shared between SLP subgraphs or eliminated by
+	 pattern-based lowering during instruction selection.  However, it's
+	 simpler and safer to use the worst-case cost; if this ends up being the
+	 tie-breaker between vectorizing or not, then it's probably better not
+	 to vectorize.  */
+      const int num_partial_vectors = SLP_TREE_NUM_PARTIAL_VECTORS (node);
+
+      if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+	  == vect_partial_vectors_while_ult)
+	{
+	  gcc_assert (num_partial_vectors > 0);
+	  record_stmt_cost (cost_vec, num_partial_vectors, vector_stmt, NULL,
+			    NULL, NULL_TREE, 0, vect_prologue);
+	}
+      else if (SLP_TREE_PARTIAL_VECTORS_STYLE (node)
+	       == vect_partial_vectors_len)
+	{
+	  /* Need to set up a length in the prologue.  */
+	  gcc_assert (num_partial_vectors > 0);
+	  record_stmt_cost (cost_vec, num_partial_vectors, scalar_stmt, NULL,
+			    NULL, NULL_TREE, 0, vect_prologue);
+	}
+      else
+	{
+	  gcc_assert (num_partial_vectors == 0);
+	}
+    }
+
   /* If this node or any of its children can't be vectorized, try pruning
      the tree here rather than felling the whole thing.  */
   if (!res && vect_slp_convert_to_external (vinfo, node, node_instance))
@@ -12441,3 +12478,148 @@  vect_schedule_slp (vec_info *vinfo, const vec<slp_instance> &slp_instances)
         }
     }
 }
+
+/* Record that a specific partial vector style could be used to vectorize
+   SLP_NODE if required.  */
+
+static void
+vect_slp_record_bb_style (slp_tree slp_node, vect_partial_vector_style style)
+{
+  gcc_assert (style != vect_partial_vectors_none);
+  gcc_assert (style != vect_partial_vectors_avx512);
+
+  if (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == vect_partial_vectors_none)
+    SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) = style;
+  else
+    gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node) == style);
+}
+
+/* Record that a complete set of masks associated with SLP_NODE would need to
+   contain a sequence of NVECTORS masks that each control a vector of type
+   VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
+   these vector masks with the vector version of SCALAR_MASK.  */
+void
+vect_slp_record_bb_mask (slp_tree slp_node, unsigned int /* nvectors */,
+			 tree /* vectype */, tree /* scalar_mask */)
+{
+  vect_slp_record_bb_style (slp_node, vect_partial_vectors_while_ult);
+
+  /* FORNOW: this often overestimates the number of masks for costing purposes
+     because, after lowering, masks have often been eliminated, shared between
+     SLP nodes, or even shared between SLP subgraphs.  */
+  SLP_TREE_NUM_PARTIAL_VECTORS(slp_node) ++;
+}
+
+/* Materialize mask number INDEX for a group of scalar stmts in SLP_NODE that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+   Insert any set-up statements before GSI.  */
+
+tree
+vect_slp_get_bb_mask (slp_tree slp_node, gimple_stmt_iterator *gsi,
+		      unsigned int nvectors, tree vectype, unsigned int index)
+{
+  gcc_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+	      == vect_partial_vectors_while_ult);
+  gcc_assert (nvectors >= 1);
+  gcc_assert (index < nvectors);
+
+  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  const unsigned int group_size = SLP_TREE_LANES (slp_node);
+  unsigned int mask_size = group_size;
+  const tree masktype = truth_type_for (vectype);
+
+  if (nunits.is_constant ())
+    {
+      /* Only the last vector can be a partial vector.  */
+      if (index + 1 < nvectors)
+	return build_minus_one_cst (masktype);
+
+      /* Return a mask for a possibly-partial tail vector. */
+      const unsigned int const_nunits = nunits.to_constant ();
+      const unsigned int head_size = (nvectors - 1) * const_nunits;
+      gcc_assert (head_size <= group_size);
+      mask_size = group_size - head_size;
+
+      if (mask_size == const_nunits)
+	return build_minus_one_cst (masktype);
+    }
+  else
+    {
+      /* Return a mask for a single variable-length vector. */
+      gcc_assert (nvectors == 1);
+      gcc_assert (known_le (mask_size, nunits));
+    }
+
+  /* FORNOW: don't bother maintaining a set of mask constants to allow
+     sharing between nodes belonging to the same instance of bb_vec_info
+     or even within the same SLP subgraph.  */
+  gimple_seq stmts = NULL;
+  const tree cmp_type = size_type_node;
+  const tree start_index = build_zero_cst (cmp_type);
+  const tree end_index = build_int_cst (cmp_type, mask_size);
+  const tree mask = make_temp_ssa_name (masktype, NULL, "slp_mask");
+  vect_gen_while_ssa_name (&stmts, masktype, start_index, end_index, mask);
+  gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
+  return mask;
+}
+
+/* Record that a complete set of lengths associated with SLP_NODE would need to
+   contain a sequence of NVECTORS lengths for controlling an operation on
+   VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
+   subelements, measuring the length as a number of these subelements.  */
+
+void
+vect_slp_record_bb_len (slp_tree slp_node, unsigned int /* nvectors */,
+			tree /* vectype */, unsigned int /* factor */)
+{
+  vect_slp_record_bb_style (slp_node, vect_partial_vectors_len);
+
+  /* FORNOW: this probably overestimates the number of lengths for costing
+     purposes because, after lowering, lengths might have been eliminated,
+     shared between SLP nodes, or even shared between SLP subgraphs.  */
+  SLP_TREE_NUM_PARTIAL_VECTORS (slp_node)++;
+}
+
+/* Materialize length number INDEX for a group of scalar stmts in SLP_NODE that
+   operate on NVECTORS vectors of type VECTYPE, where 0 <= INDEX < NVECTORS.
+   Return a value that contains FACTOR multiplied by the number of elements that
+   should be processed.  */
+
+tree
+vect_slp_get_bb_len (slp_tree slp_node, unsigned int nvectors, tree vectype,
+		     unsigned int index, unsigned int factor, bool adjusted)
+{
+  gcc_checking_assert (SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+		       == vect_partial_vectors_len);
+  gcc_assert (nvectors >= 1);
+  gcc_assert (index < nvectors);
+  (void) adjusted;
+
+  const poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  const unsigned int group_size = SLP_TREE_LANES (slp_node);
+  unsigned int len = group_size;
+
+  if (nunits.is_constant ())
+    {
+      const unsigned int const_nunits = nunits.to_constant ();
+
+      /* Only the last vector can be a partial vector.  */
+      if (index + 1 < nvectors)
+	len = const_nunits;
+      else
+	{
+	  /* Return a length for a possibly-partial tail vector. */
+	  const unsigned int head_size = (nvectors - 1) * const_nunits;
+	  gcc_assert (head_size <= group_size);
+	  len = group_size - head_size;
+	}
+    }
+  else
+    {
+      /* Return a length for a single variable-length vector. */
+      gcc_assert (nvectors == 1);
+      gcc_assert (known_le (len, nunits));
+    }
+
+  return size_int (len * factor);
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 15fca17a407..ecad74e7cbf 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1385,7 +1385,9 @@  vectorizable_internal_function (combined_fn cfn, tree fndecl,
 /* Record that a complete set of masks associated with VINFO would need to
    contain a sequence of NVECTORS masks that each control a vector of type
    VECTYPE.  If SCALAR_MASK is nonnull, the fully-masked loop would AND
-   these vector masks with the vector version of SCALAR_MASK.  */
+   these vector masks with the vector version of SCALAR_MASK.  Alternatively,
+   if doing basic block vectorization, record that a mask could be used to
+   vectorize SLP_NODE if required.  */
 static void
 vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
 		  tree vectype, tree scalar_mask)
@@ -1395,7 +1397,7 @@  vect_record_mask (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
     vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo), nvectors,
 			   vectype, scalar_mask);
   else
-    (void) slp_node; /* FORNOW */
+    vect_slp_record_bb_mask (slp_node, nvectors, vectype, scalar_mask);
 }
 
 /* Given a complete set of masks associated with VINFO, extract mask number
@@ -1413,16 +1415,15 @@  vect_get_mask (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
     return vect_get_loop_mask (loop_vinfo, gsi, &LOOP_VINFO_MASKS (loop_vinfo),
 			       nvectors, vectype, index);
   else
-    {
-      (void) slp_node; /* FORNOW */
-      return NULL_TREE;
-    }
+    return vect_slp_get_bb_mask (slp_node, gsi, nvectors, vectype, index);
 }
 
 /* Record that a complete set of lengths associated with VINFO would need to
    contain a sequence of NVECTORS lengths for controlling an operation on
    VECTYPE.  The operation splits each element of VECTYPE into FACTOR separate
-   subelements, measuring the length as a number of these subelements.  */
+   subelements, measuring the length as a number of these subelements.
+   Alternatively, if doing basic block vectorization, record that a length limit
+   could be used to vectorize SLP_NODE if required.  */
 static void
 vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
 		 tree vectype, unsigned int factor)
@@ -1432,7 +1433,7 @@  vect_record_len (vec_info *vinfo, slp_tree slp_node, unsigned int nvectors,
     vect_record_loop_len (loop_vinfo, &LOOP_VINFO_LENS (loop_vinfo), nvectors,
 			  vectype, factor);
   else
-    (void) slp_node; /* FORNOW */
+    vect_slp_record_bb_len (slp_node, nvectors, vectype, factor);
 }
 
 /* Given a complete set of lengths associated with VINFO, extract length number
@@ -1453,10 +1454,8 @@  vect_get_len (vec_info *vinfo, slp_tree slp_node, gimple_stmt_iterator *gsi,
     return vect_get_loop_len (loop_vinfo, gsi, &LOOP_VINFO_LENS (loop_vinfo),
 			      nvectors, vectype, index, factor, adjusted);
   else
-    {
-      (void) slp_node; /* FORNOW */
-      return NULL_TREE;
-    }
+    return vect_slp_get_bb_len (slp_node, nvectors, vectype, index, factor,
+				adjusted);
 }
 
 static tree permute_vec_elements (vec_info *, tree, tree, tree, stmt_vec_info,
@@ -14710,24 +14709,35 @@  supportable_indirect_convert_operation (code_helper code,
    mask[I] is true iff J + START_INDEX < END_INDEX for all J <= I.
    Add the statements to SEQ.  */
 
+void
+vect_gen_while_ssa_name (gimple_seq *seq, tree mask_type, tree start_index,
+			 tree end_index, tree ssa_name)
+{
+  tree cmp_type = TREE_TYPE (start_index);
+  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT, cmp_type,
+						       mask_type,
+						       OPTIMIZE_FOR_SPEED));
+  gcall *call
+    = gimple_build_call_internal (IFN_WHILE_ULT, 3, start_index, end_index,
+				  build_zero_cst (mask_type));
+  gimple_call_set_lhs (call, ssa_name);
+  gimple_seq_add_stmt (seq, call);
+}
+
+/*  Like vect_gen_while_ssa_name except that it creates a new SSA_NAME node
+    for type MASK_TYPE defined in the created GIMPLE_CALL statement.  If NAME
+    is not a null pointer then it is used for the SSA_NAME in dumps.  */
+
 tree
 vect_gen_while (gimple_seq *seq, tree mask_type, tree start_index,
 		tree end_index, const char *name)
 {
-  tree cmp_type = TREE_TYPE (start_index);
-  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
-						       cmp_type, mask_type,
-						       OPTIMIZE_FOR_SPEED));
-  gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
-					    start_index, end_index,
-					    build_zero_cst (mask_type));
   tree tmp;
   if (name)
     tmp = make_temp_ssa_name (mask_type, NULL, name);
   else
     tmp = make_ssa_name (mask_type);
-  gimple_call_set_lhs (call, tmp);
-  gimple_seq_add_stmt (seq, call);
+  vect_gen_while_ssa_name (seq, mask_type, start_index, end_index, tmp);
   return tmp;
 }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a3855568b09..f79f04ff8ac 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -312,6 +312,13 @@  struct vect_load_store_data : vect_data {
   bool subchain_p; // VMAT_STRIDED_SLP and VMAT_GATHER_SCATTER
 };
 
+enum vect_partial_vector_style {
+  vect_partial_vectors_none,
+  vect_partial_vectors_while_ult,
+  vect_partial_vectors_avx512,
+  vect_partial_vectors_len
+};
+
 /* A computation tree of an SLP instance.  Each node corresponds to a group of
    stmts to be packed in a SIMD stmt.  */
 struct _slp_tree {
@@ -377,7 +384,16 @@  struct _slp_tree {
   /* For BB vect, flag to indicate this load node should be vectorized
      as to avoid STLF fails because of related stores.  */
   bool avoid_stlf_fail;
-
+  /* The style used for implementing partial vectors if LANES is less than
+     the minimum number of lanes implied by the VECTYPE.  */
+  vect_partial_vector_style partial_vector_style;
+  /* Flag to indicate whether we still have the option of vectorizing this node
+     using partial vectors (i.e.  using lengths or masks to prevent use of
+     inactive scalar lanes).  */
+  bool can_use_partial_vectors;
+  /* Number of partial vectors, for costing purposes. Should be 0 unless a
+     partial vector style has been set.  */
+  int num_partial_vectors;
   int vertex;
 
   /* The kind of operation as determined by analysis and optional
@@ -476,6 +492,9 @@  public:
 #define SLP_TREE_GS_BASE(S)			 (S)->gs_base
 #define SLP_TREE_REDUC_IDX(S)			 (S)->cycle_info.reduc_idx
 #define SLP_TREE_PERMUTE_P(S)			 ((S)->code == VEC_PERM_EXPR)
+#define SLP_TREE_PARTIAL_VECTORS_STYLE(S)	 (S)->partial_vector_style
+#define SLP_TREE_CAN_USE_PARTIAL_VECTORS_P(S)	 (S)->can_use_partial_vectors
+#define SLP_TREE_NUM_PARTIAL_VECTORS(S)		 (S)->num_partial_vectors
 
 inline vect_memory_access_type
 SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
@@ -486,13 +505,6 @@  SLP_TREE_MEMORY_ACCESS_TYPE (slp_tree node)
   return VMAT_UNINITIALIZED;
 }
 
-enum vect_partial_vector_style {
-    vect_partial_vectors_none,
-    vect_partial_vectors_while_ult,
-    vect_partial_vectors_avx512,
-    vect_partial_vectors_len
-};
-
 /* Key for map that records association between
    scalar conditions and corresponding loop mask, and
    is populated by vect_record_loop_mask.  */
@@ -2607,6 +2619,7 @@  extern tree vect_gen_perm_mask_checked (tree, const vec_perm_indices &);
 extern void optimize_mask_stores (class loop*);
 extern tree vect_gen_while (gimple_seq *, tree, tree, tree,
 			    const char * = nullptr);
+extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
 extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
 extern opt_result vect_get_vector_types_for_stmt (vec_info *,
 						  stmt_vec_info, tree *,
@@ -2788,7 +2801,14 @@  extern slp_tree vect_create_new_slp_node (unsigned, tree_code);
 extern void vect_free_slp_tree (slp_tree);
 extern bool compatible_calls_p (gcall *, gcall *, bool);
 extern int vect_slp_child_index_for_operand (const stmt_vec_info, int op);
-
+extern void vect_slp_record_bb_mask (slp_tree slp_node, unsigned int nvectors,
+				     tree vectype, tree scalar_mask);
+extern tree vect_slp_get_bb_mask (slp_tree, gimple_stmt_iterator *,
+				  unsigned int, tree, unsigned int);
+extern void vect_slp_record_bb_len (slp_tree slp_node, unsigned int nvectors,
+				    tree vectype, unsigned int factor);
+extern tree vect_slp_get_bb_len (slp_tree, unsigned int, tree, unsigned int,
+				 unsigned int, bool);
 extern tree prepare_vec_mask (vec_info *, tree, tree, tree,
 			      gimple_stmt_iterator *);
 extern tree vect_get_mask_load_else (int, tree);
@@ -2953,7 +2973,7 @@  vect_cannot_use_partial_vectors (vec_info *vinfo, slp_tree slp_node)
   if (loop_vinfo)
     LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
   else
-    (void) slp_node; /* FORNOW */
+    SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node) = false;
 }
 
 /* Return true if VINFO is vectorizer state for loop vectorization, we've
@@ -2967,10 +2987,8 @@  vect_fully_with_length_p (vec_info *vinfo, slp_tree slp_node)
   if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
     return LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo);
   else
-    {
-      (void) slp_node; /* FORNOW */
-      return false;
-    }
+    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+	   == vect_partial_vectors_len;
 }
 
 /* Return true if VINFO is vectorizer state for loop vectorization, we've
@@ -2984,10 +3002,8 @@  vect_fully_masked_p (vec_info *vinfo, slp_tree slp_node)
   if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
     return LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   else
-    {
-      (void) slp_node; /* FORNOW */
-      return false;
-    }
+    return SLP_TREE_PARTIAL_VECTORS_STYLE (slp_node)
+	   == vect_partial_vectors_while_ult;
 }
 
 /* If STMT_INFO describes a reduction, return the vect_reduction_type