[v10,09/11] Extend BB SLP vectorization to use predicated tails

Message ID 20260603131548.50668-10-chris.bazley@arm.com
State New
Headers
Series Extend BB SLP vectorization to use predicated tails |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed

Commit Message

Christopher Bazley June 3, 2026, 1:15 p.m. UTC
  This enables use of a predicate mask or length limit for
vectorization of basic blocks in cases where previously only the
equivalent rolled (i.e. loop) form of some source code would have
been vectorized. Predication is used for groups whose size
is not neatly divisible into vectors of lengths that can be
supported directly by the target.

The initial vector mode for an SLP region is "autodetected" by calling
aarch64_preferred_simd_mode, which prefers SVE modes if supported and
unless configured otherwise (e.g. VNx4SI for int). If at least one
profitable subgraph can be scheduled then GCC does not try to vectorise
the region using any other modes, even though their estimated costs
might otherwise have been lower.

For example, if analysis of a 24-byte group succeeds with vector mode
V16QI (using types vector(16) and vector(8) char) then the estimated
cost of the vectorised code is 11+11=22. If analysis of the same group
succeeds with vector mode VNx16QI (using type vector([16,16]) char for
both subtrees) then the estimated cost is 15+15=30. In both cases, the
estimated vectorised cost would beat the estimated scalar cost of
96+48=144, so vector([16,16]) wins because VNx16QI is tried first.

This is mitigated by the fact that a sequence of GIMPLE stmts such as:

  vectp.14_86 = x_50(D) + 16;
  slp_mask_87 = .WHILE_ULT (0, 8, { 0, ... });
  .MASK_STORE (vectp.14_86, 8B, slp_mask_87, vect__34.12_85);

are lowered to a fixed-length vector store (e.g., str d30, [x0, 16]) if
possible, instead of a more literal interpretation such as:

  add	x0, x0, 16
  ptrue	p7.b, vl7
  st1b	z30.b, p7, [x0]

The vect_record_nunits function used during building of an SLP
tree is updated to prevent it returning failure for BB SLP if the
group size is not an integral multiple of the number of lanes in the
vector type; it now allows such cases if the vector type might be more
than long enough.

Instead of giving up if vect_get_vector_types_for_stmt
fails for the specified group size, vect_build_slp_tree_1
now calls vect_get_vector_types_for_stmt again without
a group size (which defaults to 0) as a fallback.
If this succeeds then the initial failure is treated as a
'soft' failure that results in the group being split.
Consequently, assertions that "For BB vectorization, we
should always have a group size once we've constructed the
SLP tree" were deleted in get_vectype_for_scalar_type and
vect_get_vector_types_for_stmt.

For BB SLP, vect_analyze_slp_instance previously gave up after
building an SLP tree if it could not prove that the group size was
at least the maximum lane count across all of the vector types in
the SLP tree (which is unprovable for scalable vector types), or
attempted to split the group if it could prove that the group size
was greater than this maximum but not exactly divisible by it
(which is also unprovable for scalable vector types).

This function will now provisionally create a new SLP instance if the
group size definitely does not exceed the minimum number of lanes,
even if the group size otherwise satisfies conditions that would
require a loop to be unrolled (e.g., a group of size 3 that uses a
mixture of V4SI and V8HI types). If the group size lies between the
minimum and maximum number of lanes then vectorization is still
abandoned (e.g., a group of size 3 that uses a mixture of
V2DI and V4SI types).

With BB SLP, there is no need for agreement between different SLP
nodes about whether to use masks or lengths to support partial vectors.
Instead, that decision is made early and per individual SLP node, by
vect_analyze_stmt. If a partial vector is required (i.e. if the number
of subparts in the vector type may be greater than the number of active
lanes for the node) then vect_analyze_stmt now requires
SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to be true; otherwise it clears any
SLP_TREE_PARTIAL_VECTORS_STYLE that could have been set.

The vect_get_num_copies function used during statement analysis
is updated to return early with 1 if a vector type is long enough for
the specified SLP tree node. This avoids an ICE in vect_get_num_vectors,
which cannot cope with SVE vector types.

When checking whether a value that is used outside the vectorized
region can be supported, the vectorizable_live_operation function
calculates which vector contains the result, and which lane of that
vector we need. Previously, this calculation gave the wrong answer
for BB SLP with a variable-length vector type (eventually generating
invalid offsets such as BIT_FIELD_REF <_251, 32, POLY_INT_CST
[96, 128]> to access the third element of a group using type VNx4SI)
because it reused logic intended for loop vectorization, which selects
the 'last' occurrence of a scalar index relative to the group size
(which is a multiple of the vector length). For BB SLP with a
predicate mask, only the first SLP_TREE_LANES elements are well
defined.

vect_create_vectorized_promotion_stmts no longer pushes
more stmts than implied by vect_get_num_copies because it could
previously overrun the number of slots allocated for an SLP node
(based on its number of lanes and type). e.g., four defs were
pushed for a promotion of V8HI to V2DI (8/2=4) even if only two
lanes of the V8HI were active. Allowing it later caused ICE in
vectorizable_operation for a parent node, because binary ops
require both operands to be the same length.

Since promotion no longer produces redundant definitions,
vectorizable_conversion also had to be modified so that demotion no
longer relies on an even number of defs being produced. If
necessary, it now pushes a single constant zero def.

The whole change is enabled by wiring the wrapper function
vect_can_use_partial_vectors_p to SLP_TREE_CAN_USE_PARTIAL_VECTORS_P
when invoked for BB SLP vectorization.

Update test expectations for gcc.dg/vect/vect-over-widen-*.c,
gcc.target/aarch64/sve/slp_6.c and
gcc.target/aarch64/sve/vec_construct_*.c.

The vec_construct_*.c tests previously expected their output
to use Advanced SIMD instead of SVE despite their use of
vector length agnostic types such as svint16_t and despite
the fact that they are in the aarch64/sve directory. Since
BB SLP can now vectorize these tests using VLA types such
as 'vector([8,8]) char', and because (with one exception) the
resultant code is deemed profitable relative to scalar code,
GCC no longer considers vectorizing using non-VLA types such
as 'vector(8) char' (although the estimated cost with non-VLA
types might have been lower, had it been calculated).
Instruction selection is not the focus of these tests therefore
I updated them to expect SVE instead (e.g. st1b instead of str)
and added --param=aarch64-autovec-preference=sve-only to reduce
future churn.

Because the cost model takes into account predicate mask
generation for BB SLP with VLA types, the threshold at which
vectorized code wins against scalar code is higher than
before. The number of elements stored by vec_construct_3.c was
increased just enough to allow for that.

gcc/ChangeLog:

	* tree-vect-loop.cc (vectorizable_live_operation): Simplify the
	calculation of the index of the final result to avoid
	generating invalid polynomial offsets relative to the end of
	variable-length vector types, which is what happens if the code
	for loop vectorization is reused for basic block SLP.
	* tree-vect-slp.cc (vect_record_nunits): Allow group sizes that
	are indivisible by the vector length.
	(vect_build_slp_tree_1): In case of failure of
	vect_get_vector_types_for_stmt, try to get fallback vector
	types and continue analysis to allow splitting of groups.
	(vect_build_slp_tree_2): Don't call
	can_duplicate_and_interleave_p when doing basic block SLP
	vectorization.
	(vect_update_slp_min_nunits_for_node): New recursive function.
	Update min_nunits to reflect the minimum number of subparts for
	all of the vector types used by an SLP subgraph.
	(vect_slp_tree_min_nunits): New function. Initialize min_nunits
	then call vect_update_slp_min_nunits_for_node.
	(vect_analyze_slp_instance): For BB SLP vectorization, create
	a new SLP instance if the group size definitely does not exceed
	the minimum number of subparts for all of the vector types used
	in the SLP tree, even if the group size otherwise satisfies
	conditions that would require a loop to be unrolled.
	(vectorizable_slp_permutation_1): Instead of asserting that an
	SLP tree node's number of lanes is compatible with the chosen
	vector width, return a failure indication if incompatible.
	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
	When calculating the number of vectors, get the group size from
	SLP_TREE_LANES instead of a parameter (e.g., DR_GROUP_SIZE) if
	doing BB SLP vectorization. Don't assume it can be divided by
	the number of subparts in the vector type to get a compile-time
	constant.
	(vect_get_data_ptr_increment): Require a parameter of type
	loop_vec_info instead of vec_info *.
	(vect_create_vectorized_promotion_stmts): Require an SLP tree
	node to be passed by the caller, for use by
	vect_get_num_copies.
	Stop pushing more stmts than implied by vect_get_num_copies.
	(vectorizable_conversion): Pass SLP tree node to
	vect_create_vectorized_promotion_stmts.
	Demotion no longer relies on an even number of definitions
	being produced by promotion. If necessary, push a single constant
	zero definition.
	(vectorizable_load): Pass loop_vec_info instead of vec_info *
	when calling vect_get_data_ptr_increment.
	(vect_analyze_stmt): For BB SLP vectorization, check whether
	the group needs partial vectors. If it does then return a
	failure indication if SLP_TREE_CAN_USE_PARTIAL_VECTORS_P was
	cleared by a callee of this function; if it doesn't need
	partial vectors then clear any partial vectors style that might
	have been chosen by callees of this function.
	(get_vectype_for_scalar_type): For BB SLP vectorization, allow
	invocation of this function with a group size of zero even if
	one or more SLP instances have been created.
	If the number of subparts in the natural choice of vector type
	could be greater than the group size then pick a shorter vector
	type only if the target does not support partial vectors.
	(vect_maybe_update_slp_op_vectype): Reject external definitions
	that have a number of lanes not divisible by the number of
	subparts in a vector type naively inferred from the scalar
	type.
	(vect_get_vector_types_for_stmt): Add a new output parameter of
	Boolean type. Set it to true if the statement can't be
	vectorized because it uses a data type that the target doesn't
	support in vector form for a group of the given size, otherwise
	false.
	* tree-vectorizer.h (vect_get_num_copies): Return early with 1
	if a vector type is long enough for the specified SLP tree
	node to avoid an ICE in vect_get_num_vectors.
	(vect_get_vector_types_for_stmt): Update function declaration.
	(vect_can_use_partial_vectors_p): Handle the BB SLP use-case by
	returning the result of SLP_TREE_CAN_USE_PARTIAL_VECTORS_P.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/vect-over-widen-10.c: Update test expectations to
	avoid spurious matching of scan-tree-dump-not pattern.
	* gcc.dg/vect/vect-over-widen-13.c: As above.
	* gcc.dg/vect/vect-over-widen-14.c: As above.
	* gcc.dg/vect/vect-over-widen-17.c: As above.
	* gcc.dg/vect/vect-over-widen-18.c: As above.
	* gcc.dg/vect/vect-over-widen-5.c: As above.
	* gcc.dg/vect/vect-over-widen-6.c: As above.
	* gcc.dg/vect/vect-over-widen-7.c: As above.
	* gcc.dg/vect/vect-over-widen-8.c: As above.
	* gcc.dg/vect/vect-over-widen-9.c: As above.
	* gcc.target/aarch64/sve/vec_construct_1.c:
	  Expect SVE instead of ASIMD instructions and add
	  --param=aarch64-autovec-preference=sve-only to stop
	  flip-flopping.
	* gcc.target/aarch64/sve/vec_construct_1.c: As above.
	* gcc.target/aarch64/sve/vec_construct_2.c: As above.
	* gcc.target/aarch64/sve/vec_construct_3.c:
	  Expect SVE instead of ASIMD instructions and add
	  --param=aarch64-autovec-preference=sve-only to avoid
	  flip-flopping. Increase the number of elements
	  stored to ensure vectorization using SVE is deemed
	  profitable despite predicate mask costs.
	* gcc.target/aarch64/sve/vec_construct_4.c:
	  Expect SVE instead of ASIMD instructions and add
	  --param=aarch64-autovec-preference=sve-only to stop
	  flip-flopping.
	* gcc.target/aarch64/sve/vec_construct_5.c: As above.
---
 .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/sve/slp_6.c  |   3 -
 .../gcc.target/aarch64/sve/vec_construct_1.c  |   6 +-
 .../gcc.target/aarch64/sve/vec_construct_2.c  |   4 +-
 .../gcc.target/aarch64/sve/vec_construct_3.c  |  20 +-
 .../gcc.target/aarch64/sve/vec_construct_4.c  |   5 +-
 .../gcc.target/aarch64/sve/vec_construct_5.c  |   6 +-
 gcc/tree-vect-loop.cc                         |  14 +-
 gcc/tree-vect-slp.cc                          | 141 ++++++++++--
 gcc/tree-vect-stmts.cc                        | 202 +++++++++++++-----
 gcc/tree-vectorizer.h                         |  13 +-
 20 files changed, 317 insertions(+), 117 deletions(-)
  

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
index f0140e4ef6d..6efcf739db9 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
@@ -16,5 +16,5 @@ 
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
index 08a65ea5518..720353716cf 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
@@ -48,5 +48,5 @@  main (void)
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* / 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* = \(signed char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
index dfa09f5d2ca..f1d5f95c543 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
@@ -15,5 +15,5 @@ 
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* = \(unsigned char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
index 53fcfd0c06c..ac1a0f86727 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
@@ -46,5 +46,5 @@  main (void)
    adopts realign_load scheme.  It requires rs6000_builtin_mask_for_load to
    generate mask whose return type is vector char.  */
 /* { dg-final { scan-tree-dump-not {vector[^\n]*char} "vect" { target vect_hw_misalign } } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
index aa58cd1c957..3ebfaa78270 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
@@ -47,5 +47,5 @@  main (void)
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* |} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* <<} "vect" } } */
 /* { dg-final { scan-tree-dump {vector[^\n]*char} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
index c2ab11a9d32..1d89789a86d 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
@@ -49,5 +49,5 @@  main (void)
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(signed char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
index bda92c965e0..62d5a52587e 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
@@ -13,5 +13,5 @@ 
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
index 1d55e13fb1f..6e09631009a 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
@@ -51,5 +51,5 @@  main (void)
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(signed char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
index 553c0712a79..b6d650beab4 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
@@ -16,5 +16,5 @@ 
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* \+ } "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(unsigned char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
index 36bfc68e053..e82f8a571da 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
@@ -56,5 +56,5 @@  main (void)
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 1} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: detected:[^\n]* >> 2} "vect" } } */
 /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: detected:[^\n]* \(signed char\)} "vect" } } */
-/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
 /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
index 44d128477d2..1c9ac15a699 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
@@ -37,9 +37,6 @@  vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
 TEST_ALL (VEC_PERM)
 
 /* These loops can't use SLP.  */
-/* { dg-final { scan-assembler-not {\tld1b\t} } } */
-/* { dg-final { scan-assembler-not {\tld1h\t} } } */
-/* { dg-final { scan-assembler-not {\tld1w\t} } } */
 /* { dg-final { scan-assembler-not {\tld1d\t} } } */
 /* { dg-final { scan-assembler {\tld3b\t} } } */
 /* { dg-final { scan-assembler {\tld3h\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
index 2f8ce6808a9..eea13c28e49 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-slp-vectorize" } */
+/* { dg-options "-O2 -ftree-slp-vectorize --param=aarch64-autovec-preference=sve-only" } */
 
 /* Test that a group of stores of 8 elements derived from a horizontal
    reduction is vectorized by constructing a vector and storing it.
@@ -30,8 +30,8 @@  foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
   s.h = svaddv_s8 (all, src7);
 }
 
-/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+, \[x[0-9]+\]\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
 /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
index 6715118d7b0..2bf537e13e2 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-slp-vectorize" } */
+/* { dg-options "-O2 -ftree-slp-vectorize --param=aarch64-autovec-preference=sve-only" } */
 
 /* Test that a group of stores of 8 elements derived from the results of calls
    to a function that has only vector parameters and returns a scalar result is
@@ -40,3 +40,5 @@  foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
 
 /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} } } */
 /* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
+/* { dg-final { scan-assembler-not {\tfmov\th[0-9]+, h[0-9]+\n} } } */
+/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+\.b, w[0-9]+\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
index 8143d0050ad..ccadaccbcb4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
@@ -1,7 +1,7 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-slp-vectorize" } */
+/* { dg-options "-O2 -ftree-slp-vectorize --param=aarch64-autovec-preference=sve-only" } */
 
-/* Test that a group of stores of 8 elements derived from a horizontal
+/* Test that a group of stores of 14 elements derived from a horizontal
    reduction is vectorized by constructing a vector and storing it
    even if the results of the reductions are narrowed.
    Since there are no GPR-to-SIMD register transfers, there is no
@@ -13,12 +13,14 @@ 
 
 struct S
 {
-  char a, b, c, d, e, f, g, h;
+  char a, b, c, d, e, f, g, h, i, j, k, l, m, n;
 } s;
 
 void
 foo (svint16_t src0, svint32_t src1, svint16_t src2, svint32_t src3,
-     svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7)
+     svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7,
+     svint16_t src8, svint32_t src9, svint16_t src10, svint32_t src11,
+     svint32_t src12, svint16_t src13)
 {
   svbool_t all16 = svptrue_b16 ();
   svbool_t all32 = svptrue_b32 ();
@@ -30,10 +32,16 @@  foo (svint16_t src0, svint32_t src1, svint16_t src2, svint32_t src3,
   s.f = svminv_s16 (all16, src5);
   s.g = svlastb_s32 (svptrue_pat_b32 (SV_VL1), src6);
   s.h = svaddv_s16 (all16, src7);
+  s.i = svmaxv_s16 (all16, src8);
+  s.j = svminv_s32 (all32, src9);
+  s.k = svlastb_s16 (svptrue_pat_b16 (SV_VL1), src10);
+  s.l = svaddv_s32 (all32, src11);
+  s.m = svmaxv_s32 (all32, src12);
+  s.n = svminv_s16 (all16, src13);
 }
 
-/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.b, b[0-9]+\n} 13 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-9]+, \[x[0-9]\]\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
 /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
index 49f8114b64c..3d41af684a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-slp-vectorize" } */
+/* { dg-options "-O2 -ftree-slp-vectorize --param=aarch64-autovec-preference=sve-only" } */
 
 /* Test that a group of stores of 8 elements derived from a horizontal
    reduction is not vectorized by constructing a vector and storing it
@@ -33,5 +33,6 @@  foo (svint16_t src0, svint8_t src1, svint16_t src2, svint8_t src3,
 /* { dg-final { scan-assembler-times {\tstp\tw[0-9]+, w[0-9]+,} 4 } } */
 
 /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.s\[[0-9]+\], w[0-9]+\n} } } */
-/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } }
+/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
 /* { dg-final { scan-assembler-not {\tstp\tq[0-9]+, q[0-9]+,} } } */
+/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+.s, w[0-9]+\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
index 983d6c69ebc..89e57406c0e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-slp-vectorize" } */
+/* { dg-options "-O2 -ftree-slp-vectorize --param=aarch64-autovec-preference=sve-only" } */
 
 /* Test that a group of stores of 8 elements derived from lane extractions is
    vectorized by constructing a vector and storing it.  Since there are no
@@ -30,8 +30,8 @@  foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t src3, svint8_t src4,
   s.h = svlastb_s8 (p, src7);
 }
 
-/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
+/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
+/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+, \[x[0-9]+\]\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
 /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d602c67108..7503fd084cf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10227,12 +10227,16 @@  vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
 
   gcc_assert (slp_index >= 0);
 
-  /* Get the last occurrence of the scalar index from the concatenation of
-     all the slp vectors. Calculate which slp vector it is and the index
-     within.  */
-  int num_scalar = SLP_TREE_LANES (slp_node);
   int num_vec = vect_get_num_copies (vinfo, slp_node);
-  poly_uint64 pos = (num_vec * nunits) - num_scalar + slp_index;
+  poly_uint64 pos = slp_index;
+  if (loop_vinfo)
+    {
+      /* Get the last occurrence of the scalar index from the concatenation of
+	 all the slp vectors. Calculate which slp vector it is and the index
+	 within.  */
+      int num_scalar = SLP_TREE_LANES (slp_node);
+      pos += (num_vec * nunits) - num_scalar;
+    }
 
   /* Calculate which vector contains the result, and which lane of
      that vector we need.  */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 1850af4e753..6af13e65e19 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1117,8 +1117,12 @@  vect_record_max_nunits (vec_info *vinfo, stmt_vec_info stmt_info,
     }
 
   /* If populating the vector type requires unrolling then fail
-     before adjusting *max_nunits for basic-block vectorization.  */
+     before adjusting *max_nunits for basic-block vectorization.
+     Allow group sizes that are indivisible by the vector length only if they
+     are known not to exceed the vector length.  We may be able to support such
+     cases by generating constant masks.  */
   if (is_a <bb_vec_info> (vinfo)
+      && maybe_gt (group_size, TYPE_VECTOR_SUBPARTS (vectype))
       && !multiple_p (group_size, TYPE_VECTOR_SUBPARTS (vectype)))
     {
       if (dump_enabled_p ())
@@ -1170,12 +1174,29 @@  vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
   tree soft_fail_nunits_vectype = NULL_TREE;
 
   tree vectype, nunits_vectype;
+  bool unsupported_datatype = false;
   if (!vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
-				       &nunits_vectype, group_size))
+				       &nunits_vectype, &unsupported_datatype,
+				       group_size))
     {
-      /* Fatal mismatch.  */
-      matches[0] = false;
-      return false;
+      /* Try to get fallback vector types and continue analysis, producing
+	 matches[] as if vectype was not an issue.  This allows splitting of
+	 groups to happen.  */
+      if (unsupported_datatype
+	  && vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
+					     &nunits_vectype,
+					     &unsupported_datatype))
+	{
+	  gcc_assert (is_a<bb_vec_info> (vinfo));
+	  maybe_soft_fail = true;
+	  soft_fail_nunits_vectype = nunits_vectype;
+	}
+      else
+	{
+	  /* Fatal mismatch.  */
+	  matches[0] = false;
+	  return false;
+	}
     }
   if (is_a <bb_vec_info> (vinfo)
       && known_le (TYPE_VECTOR_SUBPARTS (vectype), 1U))
@@ -1705,16 +1726,22 @@  vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
 
   if (maybe_soft_fail)
     {
-      unsigned HOST_WIDE_INT const_nunits;
-      if (!TYPE_VECTOR_SUBPARTS
-	    (soft_fail_nunits_vectype).is_constant (&const_nunits)
-	  || const_nunits > group_size)
+      /* Use the known minimum number of subparts for VLA because we still need
+	 to choose a splitting point although the choice is more arbitrary.  */
+      unsigned HOST_WIDE_INT const_nunits = constant_lower_bound (
+	  TYPE_VECTOR_SUBPARTS (soft_fail_nunits_vectype));
+
+      if (const_nunits > group_size)
 	matches[0] = false;
       else
 	{
 	  /* With constant vector elements simulate a mismatch at the
 	     point we need to split.  */
+	  gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
 	  unsigned tail = group_size & (const_nunits - 1);
+	  if (tail == 0)
+	    tail = const_nunits;
+	  gcc_assert (group_size >= tail);
 	  memset (&matches[group_size - tail], 0, sizeof (bool) * tail);
 	}
       return false;
@@ -2446,13 +2473,21 @@  vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 		  /* Check whether we can build the invariant.  If we can't
 		     we never will be able to.  */
 		  tree type = TREE_TYPE (chains[0][n].op);
-		  if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ()
-		      && (TREE_CODE (type) == BOOLEAN_TYPE
-			  || !can_duplicate_and_interleave_p (vinfo, group_size,
-							      type)))
+		  if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ())
 		    {
-		      matches[0] = false;
-		      goto out;
+		      if (TREE_CODE (type) == BOOLEAN_TYPE)
+			{
+			  matches[0] = false;
+			  goto out;
+			}
+
+		      if (!is_a<bb_vec_info> (vinfo)
+			  && !can_duplicate_and_interleave_p (vinfo, group_size,
+							      type))
+			{
+			  matches[0] = false;
+			  goto out;
+			}
 		    }
 		}
 	      else if (dt != vect_internal_def)
@@ -2881,7 +2916,7 @@  out:
 		    uniform_val = NULL_TREE;
 		    break;
 		  }
-	      if (!uniform_val
+	      if (!uniform_val && !is_a<bb_vec_info> (vinfo)
 		  && !can_duplicate_and_interleave_p (vinfo,
 						      oprnd_info->ops.length (),
 						      TREE_TYPE (op0)))
@@ -4993,6 +5028,53 @@  vect_analyze_slp_reductions (loop_vec_info loop_vinfo,
   return true;
 }
 
+/* Update MIN_NUNITS to reflect the minimum number of subparts for all of the
+   vector types used by the SLP subgraph rooted at NODE.  VISITED is used to
+   avoid reevaluating any node in the subgraph; it thereby prevents infinite
+   recursion should a cycle be encountered. The value of MIN_NUNITS will only be
+   updated if any node in the subgraph has a vector type with a number of
+   subparts that is smaller than the passed-in value of MIN_NUNITS. Before
+   calling this function for the first time, initialize MIN_NUNITS to
+   UINT64_MAX.  */
+
+static void
+vect_update_slp_min_nunits_for_node (slp_tree node, poly_uint64 &min_nunits,
+				     hash_set<slp_tree> &visited)
+{
+  if (!node || SLP_TREE_DEF_TYPE (node) != vect_internal_def)
+    return;
+
+  if (visited.add (node))
+    return;
+
+  for (slp_tree child : SLP_TREE_CHILDREN (node))
+    vect_update_slp_min_nunits_for_node (child, min_nunits, visited);
+
+  tree vectype = SLP_TREE_VECTYPE (node);
+  if (!vectype)
+    return;
+
+  /* All unit counts have the form vec_info::vector_size * X for some
+     rational X, therefore we know the values are ordered.  */
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  min_nunits = known_eq (min_nunits, UINT64_MAX)
+		 ? nunits
+		 : ordered_min (min_nunits, nunits);
+}
+
+/* For NODE, return the minimum number of subparts for all of the vector
+   types used in the given SLP graph.  */
+
+static poly_uint64
+vect_slp_tree_min_nunits (slp_tree node)
+{
+  poly_uint64 min_nunits = UINT64_MAX;
+  hash_set<slp_tree> visited;
+  vect_update_slp_min_nunits_for_node (node, min_nunits, visited);
+  gcc_checking_assert (known_ne (min_nunits, UINT64_MAX));
+  return min_nunits;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the group.  */
@@ -5062,8 +5144,8 @@  vect_analyze_slp_instance (vec_info *vinfo,
       poly_uint64 unrolling_factor
 	= calculate_unrolling_factor (max_nunits, group_size);
 
-      if (maybe_ne (unrolling_factor, 1U)
-	  && is_a <bb_vec_info> (vinfo))
+      if (maybe_ne (unrolling_factor, 1U) && is_a<bb_vec_info> (vinfo)
+	  && !known_ge (vect_slp_tree_min_nunits (node), group_size))
 	{
 	  unsigned HOST_WIDE_INT const_max_nunits;
 	  if (!max_nunits.is_constant (&const_max_nunits)
@@ -5148,9 +5230,10 @@  vect_analyze_slp_instance (vec_info *vinfo,
 	    = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
 	  tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
 						      1 << floor_log2 (i));
-	  unsigned HOST_WIDE_INT const_nunits;
-	  if (vectype
-	      && TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits))
+	  unsigned HOST_WIDE_INT const_nunits
+	    = vectype ? constant_lower_bound (TYPE_VECTOR_SUBPARTS (vectype))
+		      : 0;
+	  if (const_nunits > 1 && (i % const_nunits) == 0)
 	    {
 	      /* Split into two groups at the first vector boundary.  */
 	      gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
@@ -11596,7 +11679,21 @@  vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi,
       unpack_factor = 1;
     }
   unsigned olanes = unpack_factor * ncopies * SLP_TREE_LANES (node);
-  gcc_assert (repeating_p || multiple_p (olanes, nunits));
+
+  /* With fully-predicated BB-SLP, an external node's number of lanes can be
+     incompatible with the chosen vector width (e.g., lane packs of 3 with a
+     natural 2-lane vector type).  */
+  if (!repeating_p && !multiple_p (olanes, nunits))
+    {
+      if (dump_p)
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "unsupported permutation %p: vector type %T,"
+			 " nunits=" HOST_WIDE_INT_PRINT_UNSIGNED
+			 " ncopies=%" PRIu64 ", lanes=%u and unpack=%u\n",
+			 (void *) node, vectype, estimated_poly_value (nunits),
+			 ncopies, SLP_TREE_LANES (node), unpack_factor);
+      return -1;
+    }
 
   /* Compute the { { SLP operand, vector index}, lane } permutation sequence
      from the { SLP operand, scalar lane } permutation as recorded in the
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index cd3ba6fa1cb..367a9c63ea4 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1671,23 +1671,27 @@  check_load_store_for_partial_vectors (vec_info *vinfo, tree vectype,
     unsigned int nvectors;
     if (can_div_away_from_zero_p (size, nunits, &nvectors))
       return nvectors;
-    gcc_unreachable ();
+
+    gcc_assert (known_le (size, nunits));
+    return 1u;
   };
 
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
-  poly_uint64 vf = loop_vinfo ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) : 1;
+  poly_uint64 size = loop_vinfo
+		       ? group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo)
+		       : SLP_TREE_LANES (slp_node);
   unsigned factor;
   vect_partial_vector_style partial_vector_style
     = vect_get_partial_vector_style (vectype, is_load, &factor, elsvals);
 
   if (partial_vector_style == vect_partial_vectors_len)
     {
-      nvectors = group_memory_nvectors (group_size * vf, nunits);
+      nvectors = group_memory_nvectors (size, nunits);
       vect_record_len (vinfo, slp_node, nvectors, vectype, factor);
     }
   else if (partial_vector_style == vect_partial_vectors_while_ult)
     {
-      nvectors = group_memory_nvectors (group_size * vf, nunits);
+      nvectors = group_memory_nvectors (size, nunits);
       vect_record_mask (vinfo, slp_node, nvectors, vectype, scalar_mask);
     }
   else
@@ -3362,12 +3366,11 @@  vect_get_strided_load_store_ops (stmt_vec_info stmt_info, slp_tree node,
 
 static tree
 vect_get_loop_variant_data_ptr_increment (
-  vec_info *vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
+  loop_vec_info loop_vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
   vec_loop_lens *loop_lens, dr_vec_info *dr_info,
   vect_memory_access_type memory_access_type)
 {
-  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
-  tree step = vect_dr_behavior (vinfo, dr_info)->step;
+  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
 
   /* gather/scatter never reach here.  */
   gcc_assert (!mat_gather_scatter_p (memory_access_type));
@@ -3411,7 +3414,7 @@  vect_get_data_ptr_increment (vec_info *vinfo, gimple_stmt_iterator *gsi,
 
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
   if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
-    return vect_get_loop_variant_data_ptr_increment (vinfo, aggr_type, gsi,
+    return vect_get_loop_variant_data_ptr_increment (loop_vinfo, aggr_type, gsi,
 						     loop_lens, dr_info,
 						     memory_access_type);
 
@@ -5297,7 +5300,7 @@  vect_create_vectorized_demotion_stmts (vec_info *vinfo, vec<tree> *vec_oprnds,
    call the function recursively.  */
 
 static void
-vect_create_vectorized_promotion_stmts (vec_info *vinfo,
+vect_create_vectorized_promotion_stmts (vec_info *vinfo, slp_tree slp_node,
 					vec<tree> *vec_oprnds0,
 					vec<tree> *vec_oprnds1,
 					stmt_vec_info stmt_info, tree vec_dest,
@@ -5310,37 +5313,39 @@  vect_create_vectorized_promotion_stmts (vec_info *vinfo,
   gimple *new_stmt1, *new_stmt2;
   vec<tree> vec_tmp = vNULL;
 
-  vec_tmp.create (vec_oprnds0->length () * 2);
+  const unsigned ncopies = vect_get_num_copies (vinfo, slp_node);
+  vec_tmp.create (ncopies);
+  gcc_assert (vec_oprnds0->length () <= ncopies);
   FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
     {
+      if (vec_tmp.length () >= ncopies)
+	break;
+
       if (op_type == binary_op)
 	vop1 = (*vec_oprnds1)[i];
       else
 	vop1 = NULL_TREE;
 
       /* Generate the two halves of promotion operation.  */
-      new_stmt1 = vect_gen_widened_results_half (vinfo, ch1, vop0, vop1,
-						 op_type, vec_dest, gsi,
-						 stmt_info);
-      new_stmt2 = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1,
-						 op_type, vec_dest, gsi,
-						 stmt_info);
-      if (is_gimple_call (new_stmt1))
-	{
-	  new_tmp1 = gimple_call_lhs (new_stmt1);
-	  new_tmp2 = gimple_call_lhs (new_stmt2);
-	}
-      else
+      new_stmt1
+	= vect_gen_widened_results_half (vinfo, ch1, vop0, vop1, op_type,
+					 vec_dest, gsi, stmt_info);
+      new_tmp1 = is_gimple_call (new_stmt1) ? gimple_call_lhs (new_stmt1)
+					    : gimple_assign_lhs (new_stmt1);
+      vec_tmp.quick_push (new_tmp1);
+
+      if (vec_tmp.length () < ncopies)
 	{
-	  new_tmp1 = gimple_assign_lhs (new_stmt1);
-	  new_tmp2 = gimple_assign_lhs (new_stmt2);
+	  new_stmt2
+	    = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1, op_type,
+					     vec_dest, gsi, stmt_info);
+	  new_tmp2 = is_gimple_call (new_stmt2) ? gimple_call_lhs (new_stmt2)
+						: gimple_assign_lhs (new_stmt2);
+	  vec_tmp.quick_push (new_tmp2);
 	}
-
-      /* Store the results for the next step.  */
-      vec_tmp.quick_push (new_tmp1);
-      vec_tmp.quick_push (new_tmp2);
     }
 
+  gcc_assert (vec_tmp.length () <= ncopies);
   vec_oprnds0->release ();
   *vec_oprnds0 = vec_tmp;
 }
@@ -5553,6 +5558,7 @@  vectorizable_conversion (vec_info *vinfo,
      from the scalar type.  */
   if (!vectype_in)
     vectype_in = get_vectype_for_scalar_type (vinfo, rhs_type, slp_node);
+
   if (!cost_vec)
     gcc_assert (vectype_in);
   if (!vectype_in)
@@ -5961,12 +5967,15 @@  vectorizable_conversion (vec_info *vinfo,
 					     stmt_info, this_dest, gsi, c1,
 					     op_type);
 	  else
-	    vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
-						    &vec_oprnds1, stmt_info,
-						    this_dest, gsi,
+	    vect_create_vectorized_promotion_stmts (vinfo, slp_node,
+						    &vec_oprnds0, &vec_oprnds1,
+						    stmt_info, this_dest, gsi,
 						    c1, c2, op_type);
 	}
 
+      gcc_assert (vec_oprnds0.length ()
+		  == vect_get_num_copies (vinfo, slp_node));
+
       FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
 	{
 	  gimple *new_stmt;
@@ -5990,6 +5999,16 @@  vectorizable_conversion (vec_info *vinfo,
 	 generate more than one vector stmt - i.e - we need to "unroll"
 	 the vector stmt by a factor VF/nunits.  */
       vect_get_vec_defs (vinfo, slp_node, op0, &vec_oprnds0);
+
+      /* Promotion no longer produces redundant defs (since support was
+	added for length/mask-predicated BB SLP of awkward-sized groups),
+	therefore demotion now has to handle that case too.  */
+      if (vec_oprnds0.length () % 2 != 0)
+	{
+	  tree vectype = TREE_TYPE (vec_oprnds0[0]);
+	  vec_oprnds0.safe_push (build_zero_cst (vectype));
+	}
+
       /* Arguments are ready.  Create the new vector stmts.  */
       if (cvt_type && modifier == NARROW_DST)
 	FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
@@ -10803,7 +10822,7 @@  vectorizable_load (vec_info *vinfo,
 
       aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
       if (!costing_p)
-	bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
+	bump = vect_get_data_ptr_increment (loop_vinfo, gsi, dr_info, aggr_type,
 					    memory_access_type, loop_lens);
 
       unsigned int inside_cost = 0, prologue_cost = 0;
@@ -13460,6 +13479,38 @@  vect_analyze_stmt (vec_info *vinfo,
 				   " live stmt not supported: %G",
 				   stmt_info->stmt);
 
+  if (bb_vinfo)
+    {
+      unsigned int group_size = SLP_TREE_LANES (node);
+      tree vectype = SLP_TREE_VECTYPE (node);
+      poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+      bool needs_partial = maybe_lt (group_size, nunits);
+      if (needs_partial)
+	{
+	  /* If partial vectors are required then they must be supported by the
+	     target; however, don't assume that a partial vectors style has
+	     been set because a mask or length may not be required for the
+	     statement.  */
+	  if (!SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (node))
+	    return opt_result::failure_at (stmt_info->stmt,
+					   "not vectorized: SLP node needs but "
+					   "cannot use partial vectors: %G",
+					   stmt_info->stmt);
+	}
+      else
+	{
+	  /* If we don't need partial vectors then we don't care about whether
+	     they are supported or not; however, we need to clear any partial
+	     vectors style that might have been chosen because it will be used
+	     to control generation of lengths or masks.  */
+	  SLP_TREE_PARTIAL_VECTORS_STYLE (node) = vect_partial_vectors_none;
+	  SLP_TREE_NUM_PARTIAL_VECTORS (node) = 0;
+	}
+
+      if (maybe_gt (group_size, nunits))
+	gcc_assert (multiple_p (group_size, nunits));
+    }
+
   return opt_result::success ();
 }
 
@@ -13767,13 +13818,7 @@  tree
 get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
 			     unsigned int group_size)
 {
-  /* For BB vectorization, we should always have a group size once we've
-     constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
-     are tentative requests during things like early data reference
-     analysis and pattern recognition.  */
-  if (is_a <bb_vec_info> (vinfo))
-    gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
-  else
+  if (!is_a <bb_vec_info> (vinfo))
     group_size = 0;
 
   tree vectype = get_related_vectype_for_scalar_type (vinfo->vector_mode,
@@ -13787,10 +13832,18 @@  get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
     vinfo->used_vector_modes.add (TYPE_MODE (vectype));
 
   /* If the natural choice of vector type doesn't satisfy GROUP_SIZE,
-     try again with an explicit number of elements.  */
-  if (vectype
-      && group_size
-      && maybe_ge (TYPE_VECTOR_SUBPARTS (vectype), group_size))
+     try again with an explicit number of elements.  A vector type satisfies
+     GROUP_SIZE if it is definitely not too long to store the whole group,
+     or we are able to generate masks to handle the unknown number of excess
+     lanes that might exist.  Otherwise, we must substitute a vector type that
+     can be used to carve up the group.
+   */
+  if (vectype && group_size
+      && maybe_gt (TYPE_VECTOR_SUBPARTS (vectype), group_size)
+      && (vect_get_partial_vector_style (vectype, true)
+	    == vect_partial_vectors_none
+	  || vect_get_partial_vector_style (vectype, false)
+	       == vect_partial_vectors_none))
     {
       /* Start with the biggest number of units that fits within
 	 GROUP_SIZE and halve it until we find a valid vector type.
@@ -14106,7 +14159,36 @@  vect_maybe_update_slp_op_vectype (vec_info *vinfo, slp_tree op, tree vectype)
       && SLP_TREE_DEF_TYPE (op) == vect_external_def
       && SLP_TREE_LANES (op) > 1)
     return false;
-  (void) vinfo; /* FORNOW */
+
+  /* When the vectorizer falls back to building vector operands from scalars,
+     it can create SLP trees with external defs that have a number of lanes not
+     divisible by the number of subparts in a vector type naively inferred from
+     the scalar type.  Reject such types to avoid ICE when later computing the
+     prologue cost for invariant operands.  */
+  if (SLP_TREE_DEF_TYPE (op) == vect_external_def)
+    {
+      poly_uint64 vf = 1;
+
+      if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
+	vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+      vf *= SLP_TREE_LANES (op);
+
+      if (maybe_lt (TYPE_VECTOR_SUBPARTS (vectype), vf)
+	  && !multiple_p (vf, TYPE_VECTOR_SUBPARTS (vectype)))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "lanes=" HOST_WIDE_INT_PRINT_UNSIGNED
+			     " is not divisible by "
+			     "subparts=" HOST_WIDE_INT_PRINT_UNSIGNED ".\n",
+			     estimated_poly_value (vf),
+			     estimated_poly_value (
+			       TYPE_VECTOR_SUBPARTS (vectype)));
+	  return false;
+	}
+    }
+
   SLP_TREE_VECTYPE (op) = vectype;
   return true;
 }
@@ -14814,27 +14896,32 @@  vect_gen_while_not (gimple_seq *seq, tree mask_type, tree start_index,
 
    - Set *NUNITS_VECTYPE_OUT to the vector type that contains the maximum
      number of units needed to vectorize STMT_INFO, or NULL_TREE if the
-     statement does not help to determine the overall number of units.  */
+     statement does not help to determine the overall number of units.
+
+   - Set *UNSUPPORTED_DATATYPE to false.
+
+   On failure:
+
+   - Set *UNSUPPORTED_DATATYPE to true if the statement can't be vectorized
+     because it uses a data type that the target doesn't support in vector form
+     for a group of the given GROUP_SIZE.
+ */
 
 opt_result
 vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
 				tree *stmt_vectype_out,
 				tree *nunits_vectype_out,
+				bool *unsupported_datatype,
 				unsigned int group_size)
 {
   gimple *stmt = stmt_info->stmt;
 
-  /* For BB vectorization, we should always have a group size once we've
-     constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
-     are tentative requests during things like early data reference
-     analysis and pattern recognition.  */
-  if (is_a <bb_vec_info> (vinfo))
-    gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
-  else
+  if (!is_a<bb_vec_info> (vinfo))
     group_size = 0;
 
   *stmt_vectype_out = NULL_TREE;
   *nunits_vectype_out = NULL_TREE;
+  *unsupported_datatype = false;
 
   if (gimple_get_lhs (stmt) == NULL_TREE
       /* Allow vector conditionals through here.  */
@@ -14907,10 +14994,13 @@  vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
 	}
       vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size);
       if (!vectype)
-	return opt_result::failure_at (stmt,
-				       "not vectorized:"
-				       " unsupported data-type %T\n",
-				       scalar_type);
+	{
+	  *unsupported_datatype = true;
+	  return opt_result::failure_at (stmt,
+					 "not vectorized:"
+					 " unsupported data-type %T\n",
+					 scalar_type);
+	}
 
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype);
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 62f6ad320f0..1b9103b6f5f 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2353,6 +2353,8 @@  vect_get_num_copies (vec_info *vinfo, slp_tree node)
 
   vf *= SLP_TREE_LANES (node);
   tree vectype = SLP_TREE_VECTYPE (node);
+  if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
+    return 1;
 
   return vect_get_num_vectors (vf, vectype);
 }
@@ -2621,9 +2623,9 @@  extern tree vect_gen_while (gimple_seq *, tree, tree, tree,
 			    const char * = nullptr);
 extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
 extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
-extern opt_result vect_get_vector_types_for_stmt (vec_info *,
-						  stmt_vec_info, tree *,
-						  tree *, unsigned int = 0);
+extern opt_result vect_get_vector_types_for_stmt (vec_info *, stmt_vec_info,
+						  tree *, tree *,
+						  bool *, unsigned int = 0);
 extern opt_tree vect_get_mask_type_for_stmt (stmt_vec_info, unsigned int = 0);
 
 /* In tree-if-conv.cc.  */
@@ -2956,9 +2958,8 @@  vect_can_use_partial_vectors_p (vec_info *vinfo, slp_tree slp_node)
   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
   if (loop_vinfo)
     return LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo);
-
-  (void) slp_node; /* FORNOW */
-  return false;
+  else
+    return SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node);
 }
 
 /* If VINFO is vectorizer state for loop vectorization then record that we no