[RFC] main loop masked vectorization with --param vect-partial-vector-usage=1

Message ID 20230614094650.20C373858414@sourceware.org
State New
Headers
Series [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1 |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Testing passed

Commit Message

Richard Biener June 14, 2023, 9:46 a.m. UTC
  Currently vect_determine_partial_vectors_and_peeling will decide
to apply fully masking to the main loop despite
--param vect-partial-vector-usage=1 when the currently analyzed
vector mode results in a vectorization factor that's bigger
than the number of scalar iterations.  That's undesirable for
targets where a vector mode can handle both partial vector and
non-partial vector vectorization.  I understand that for AARCH64
we have SVE and NEON but SVE can only do partial vector and
NEON only non-partial vector vectorization, plus the target
chooses to let cost comparison decide the vector mode to use.

For x86 and the upcoming AVX512 partial vector support the
story is different, the target chooses the first (and largest)
vector mode that can successfully used for vectorization.  But
that means with --param vect-partial-vector-usage=1 we will
always choose AVX512 with partial vectors for the main loop
even if, for example, V4SI would be a perfect fit with full
vectors and no required epilog!

The following tries to find the appropriate condition for
this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P
on the main loop when --param vect-partial-vector-usage=1 will
hurt AARCH64?  Incidentially looking up the docs for
vect-partial-vector-usage suggests that it's not supposed to
control epilog vectorization but instead
"1 allows partial vector loads and stores if vectorization removes the
need for the code to iterate".  That's probably OK in the end
but if there's a fixed size vector mode that allows the same thing
without using masking that would be better.

I wonder if we should special-case known niter (bounds) somehow
when analyzing the vector modes and override the targets sorting?

Maybe we want a new --param in addition to vect-epilogues-nomask
and vect-partial-vector-usage to say we want masked epilogues?

	* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
	For non-VLA vectorization interpret param_vect_partial_vector_usage == 1
	as only applying to epilogues.
---
 gcc/tree-vect-loop.cc | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
  

Patch

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 9be66b8fbc5..9323aa572d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2478,7 +2478,15 @@  vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
-      else
+      /* Avoid using a large fixed size vectorization mode with masking
+	 for the main loop when we were asked to only use masking for
+	 the epilog.
+	 ???  Ideally we'd start analysis with a better sized mode,
+	 the param_vect_partial_vector_usage == 2 case suffers from
+	 this as well.  But there's a catch-22.  */
+      else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		 && param_vect_partial_vector_usage == 1
+		 && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
 	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
     }