[vect] PR103997: Fix epilogue mode skipping

Message ID 51da8b34-511c-4359-b7d8-786935417c70@arm.com
State New
Headers
Series [vect] PR103997: Fix epilogue mode skipping |

Commit Message

Andre Vieira (lists) Jan. 13, 2022, 9:48 a.m. UTC
  This time to the list too (sorry for double email)

Hi,

The original patch '[vect] Re-analyze all modes for epilogues', skipped 
modes that should not be skipped since it used the vector mode provided 
by autovectorize_vector_modes to derive the minimum VF required for it. 
However, those modes should only really be used to dictate vector size, 
so instead this patch looks for the mode in 'used_vector_modes' with the 
largest element size, and constructs a vector mode with the smae size as 
the current vector_modes[mode_i]. Since we are using the largest element 
size the NUNITs for this mode is the smallest possible VF required for 
an epilogue with this mode and should thus skip only the modes we are 
certain can not be used.

Passes bootstrap and regression on x86_64 and aarch64.

gcc/ChangeLog:

         PR 103997
         * tree-vect-loop.c (vect_analyze_loop): Fix mode skipping for 
epilogue
         vectorization.
  

Comments

Richard Biener Jan. 13, 2022, 12:36 p.m. UTC | #1
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:

> This time to the list too (sorry for double email)
> 
> Hi,
> 
> The original patch '[vect] Re-analyze all modes for epilogues', skipped modes
> that should not be skipped since it used the vector mode provided by
> autovectorize_vector_modes to derive the minimum VF required for it. However,
> those modes should only really be used to dictate vector size, so instead this
> patch looks for the mode in 'used_vector_modes' with the largest element size,
> and constructs a vector mode with the smae size as the current
> vector_modes[mode_i]. Since we are using the largest element size the NUNITs
> for this mode is the smallest possible VF required for an epilogue with this
> mode and should thus skip only the modes we are certain can not be used.
> 
> Passes bootstrap and regression on x86_64 and aarch64.

Clearly

+         /* To make sure we are conservative as to what modes we skip, we
+            should use check the smallest possible NUNITS which would be
+            derived from the mode in USED_VECTOR_MODES with the largest
+            element size.  */
+         scalar_mode max_elsize_mode = GET_MODE_INNER
(vector_modes[mode_i]);
+         for (vec_info::mode_set::iterator i =
+               first_loop_vinfo->used_vector_modes.begin ();
+             i != first_loop_vinfo->used_vector_modes.end (); ++i)
+           {
+             if (VECTOR_MODE_P (*i)
+                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
+                 > GET_MODE_SIZE (max_elsize_mode))
+               max_elsize_mode = GET_MODE_INNER (*i);
+           }

can be done once before iterating over the modes for the epilogue.

Richard maybe knows whether we should take care to look at the
size of the vector mode as well since related_vector_mode when
passed 0 as nunits produces a vector mode with the same size
as vector_modes[mode_i] but not all used_vector_modes may be
of the same size (and you probably also want to exclude
VECTOR_BOOLEAN_TYPE_P from the search?)

Thanks,
Richard.

> gcc/ChangeLog:
> 
>         PR 103997
>         * tree-vect-loop.c (vect_analyze_loop): Fix mode skipping for 
> epilogue
>         vectorization.
>
  
Andre Vieira (lists) Jan. 13, 2022, 1:51 p.m. UTC | #2
On 13/01/2022 12:36, Richard Biener wrote:
> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>
>> This time to the list too (sorry for double email)
>>
>> Hi,
>>
>> The original patch '[vect] Re-analyze all modes for epilogues', skipped modes
>> that should not be skipped since it used the vector mode provided by
>> autovectorize_vector_modes to derive the minimum VF required for it. However,
>> those modes should only really be used to dictate vector size, so instead this
>> patch looks for the mode in 'used_vector_modes' with the largest element size,
>> and constructs a vector mode with the smae size as the current
>> vector_modes[mode_i]. Since we are using the largest element size the NUNITs
>> for this mode is the smallest possible VF required for an epilogue with this
>> mode and should thus skip only the modes we are certain can not be used.
>>
>> Passes bootstrap and regression on x86_64 and aarch64.
> Clearly
>
> +         /* To make sure we are conservative as to what modes we skip, we
> +            should use check the smallest possible NUNITS which would be
> +            derived from the mode in USED_VECTOR_MODES with the largest
> +            element size.  */
> +         scalar_mode max_elsize_mode = GET_MODE_INNER
> (vector_modes[mode_i]);
> +         for (vec_info::mode_set::iterator i =
> +               first_loop_vinfo->used_vector_modes.begin ();
> +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
> +           {
> +             if (VECTOR_MODE_P (*i)
> +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
> +                 > GET_MODE_SIZE (max_elsize_mode))
> +               max_elsize_mode = GET_MODE_INNER (*i);
> +           }
>
> can be done once before iterating over the modes for the epilogue.
True, I'll start with QImode instead of the inner of 
vector_modes[mode_i] too since we can't guarantee the mode is a 
VECTOR_MODE_P and it is actually better too since we can't possible 
guarantee the element size of the USED_VECTOR_MODES is smaller than that 
of the first vector mode...

> Richard maybe knows whether we should take care to look at the
> size of the vector mode as well since related_vector_mode when
> passed 0 as nunits produces a vector mode with the same size
> as vector_modes[mode_i] but not all used_vector_modes may be
> of the same size
I suspect that should be fine though, since if we use the largest 
element size of all used_vector_modes then that should gives us the 
least possible number of NUNITS and thus only conservatively skip. That 
said, that does assume that no vector mode used may be larger than the 
size of the loop's vector_mode. Can I assume that?
>
> (and you probably also want to exclude
> VECTOR_BOOLEAN_TYPE_P from the search?)
Yeah I think so too, thanks!

I keep going back to thinking (as I brought up in the bugzilla ticket), 
maybe we ought to only skip if the NUNITS of the vector mode with the 
same vector size as vector_modes[mode_i] is larger than first_info_vf, 
or just don't skip at all...
  
Richard Biener Jan. 13, 2022, 2:25 p.m. UTC | #3
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:

> 
> On 13/01/2022 12:36, Richard Biener wrote:
> > On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >
> >> This time to the list too (sorry for double email)
> >>
> >> Hi,
> >>
> >> The original patch '[vect] Re-analyze all modes for epilogues', skipped
> >> modes
> >> that should not be skipped since it used the vector mode provided by
> >> autovectorize_vector_modes to derive the minimum VF required for it.
> >> However,
> >> those modes should only really be used to dictate vector size, so instead
> >> this
> >> patch looks for the mode in 'used_vector_modes' with the largest element
> >> size,
> >> and constructs a vector mode with the smae size as the current
> >> vector_modes[mode_i]. Since we are using the largest element size the
> >> NUNITs
> >> for this mode is the smallest possible VF required for an epilogue with
> >> this
> >> mode and should thus skip only the modes we are certain can not be used.
> >>
> >> Passes bootstrap and regression on x86_64 and aarch64.
> > Clearly
> >
> > +         /* To make sure we are conservative as to what modes we skip, we
> > +            should use check the smallest possible NUNITS which would be
> > +            derived from the mode in USED_VECTOR_MODES with the largest
> > +            element size.  */
> > +         scalar_mode max_elsize_mode = GET_MODE_INNER
> > (vector_modes[mode_i]);
> > +         for (vec_info::mode_set::iterator i =
> > +               first_loop_vinfo->used_vector_modes.begin ();
> > +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
> > +           {
> > +             if (VECTOR_MODE_P (*i)
> > +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
> > +                 > GET_MODE_SIZE (max_elsize_mode))
> > +               max_elsize_mode = GET_MODE_INNER (*i);
> > +           }
> >
> > can be done once before iterating over the modes for the epilogue.
> True, I'll start with QImode instead of the inner of vector_modes[mode_i] too
> since we can't guarantee the mode is a VECTOR_MODE_P and it is actually better
> too since we can't possible guarantee the element size of the
> USED_VECTOR_MODES is smaller than that of the first vector mode...
> 
> > Richard maybe knows whether we should take care to look at the
> > size of the vector mode as well since related_vector_mode when
> > passed 0 as nunits produces a vector mode with the same size
> > as vector_modes[mode_i] but not all used_vector_modes may be
> > of the same size
> I suspect that should be fine though, since if we use the largest element size
> of all used_vector_modes then that should gives us the least possible number
> of NUNITS and thus only conservatively skip. That said, that does assume that
> no vector mode used may be larger than the size of the loop's vector_mode. Can
> I assume that?

No idea, but I would lean towards a no ;)  I think the loops vector_mode
doesn't have to match vector_modes[mode_i] either, does it?  At least
autodetected_vector_mode will be not QImode based.

> >
> > (and you probably also want to exclude
> > VECTOR_BOOLEAN_TYPE_P from the search?)
> Yeah I think so too, thanks!
> 
> I keep going back to thinking (as I brought up in the bugzilla ticket), maybe
> we ought to only skip if the NUNITS of the vector mode with the same vector
> size as vector_modes[mode_i] is larger than first_info_vf, or just don't skip
> at all...

The question is how much work we do before realizing the chosen mode
cannot be used because there's not enough iterations?  Maybe we can
improve there easily?

Also for targets that for the main loop do not perform cost
comparison (like x86) but have lots of vector modes the previous
mode of operation really made sense (start at next_mode_i or
mode_i when unrolling).
  
Andre Vieira (lists) Jan. 13, 2022, 3:38 p.m. UTC | #4
On 13/01/2022 14:25, Richard Biener wrote:
> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>
>> On 13/01/2022 12:36, Richard Biener wrote:
>>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>>>
>>>> This time to the list too (sorry for double email)
>>>>
>>>> Hi,
>>>>
>>>> The original patch '[vect] Re-analyze all modes for epilogues', skipped
>>>> modes
>>>> that should not be skipped since it used the vector mode provided by
>>>> autovectorize_vector_modes to derive the minimum VF required for it.
>>>> However,
>>>> those modes should only really be used to dictate vector size, so instead
>>>> this
>>>> patch looks for the mode in 'used_vector_modes' with the largest element
>>>> size,
>>>> and constructs a vector mode with the smae size as the current
>>>> vector_modes[mode_i]. Since we are using the largest element size the
>>>> NUNITs
>>>> for this mode is the smallest possible VF required for an epilogue with
>>>> this
>>>> mode and should thus skip only the modes we are certain can not be used.
>>>>
>>>> Passes bootstrap and regression on x86_64 and aarch64.
>>> Clearly
>>>
>>> +         /* To make sure we are conservative as to what modes we skip, we
>>> +            should use check the smallest possible NUNITS which would be
>>> +            derived from the mode in USED_VECTOR_MODES with the largest
>>> +            element size.  */
>>> +         scalar_mode max_elsize_mode = GET_MODE_INNER
>>> (vector_modes[mode_i]);
>>> +         for (vec_info::mode_set::iterator i =
>>> +               first_loop_vinfo->used_vector_modes.begin ();
>>> +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
>>> +           {
>>> +             if (VECTOR_MODE_P (*i)
>>> +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
>>> +                 > GET_MODE_SIZE (max_elsize_mode))
>>> +               max_elsize_mode = GET_MODE_INNER (*i);
>>> +           }
>>>
>>> can be done once before iterating over the modes for the epilogue.
>> True, I'll start with QImode instead of the inner of vector_modes[mode_i] too
>> since we can't guarantee the mode is a VECTOR_MODE_P and it is actually better
>> too since we can't possible guarantee the element size of the
>> USED_VECTOR_MODES is smaller than that of the first vector mode...
>>
>>> Richard maybe knows whether we should take care to look at the
>>> size of the vector mode as well since related_vector_mode when
>>> passed 0 as nunits produces a vector mode with the same size
>>> as vector_modes[mode_i] but not all used_vector_modes may be
>>> of the same size
>> I suspect that should be fine though, since if we use the largest element size
>> of all used_vector_modes then that should gives us the least possible number
>> of NUNITS and thus only conservatively skip. That said, that does assume that
>> no vector mode used may be larger than the size of the loop's vector_mode. Can
>> I assume that?
> No idea, but I would lean towards a no ;)  I think the loops vector_mode
> doesn't have to match vector_modes[mode_i] either, does it?  At least
> autodetected_vector_mode will be not QImode based.
The mode doesn't but both vector modes have to be the same vector size 
surely, I'm not referring to the element size here.
What I was trying to ask was whether all vector modes in 
used_vector_modes had the same vector size as the loops vector mode (and 
the vector_modes[mode_i] it originated from).
>
>>> (and you probably also want to exclude
>>> VECTOR_BOOLEAN_TYPE_P from the search?)
>> Yeah I think so too, thanks!
>>
>> I keep going back to thinking (as I brought up in the bugzilla ticket), maybe
>> we ought to only skip if the NUNITS of the vector mode with the same vector
>> size as vector_modes[mode_i] is larger than first_info_vf, or just don't skip
>> at all...
> The question is how much work we do before realizing the chosen mode
> cannot be used because there's not enough iterations?  Maybe we can
> improve there easily?
IIUC the VF can change depending on whether we decide to use SLP, so 
really we can only check if after we have determined whether or not to 
use SLP, so either:
* When SLP fully succeeds, so somewhere between the last 'goto again;' 
and return success, but there is very little left to do there
* When SLP fails: here we could save on some work.

>
> Also for targets that for the main loop do not perform cost
> comparison (like x86) but have lots of vector modes the previous
> mode of operation really made sense (start at next_mode_i or
> mode_i when unrolling).
Are you hinting at maybe creating different paths here based on some 
target configurable thing? Could be something we ask vector_costs?
  
Richard Biener Jan. 14, 2022, 7:08 a.m. UTC | #5
On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:

> 
> On 13/01/2022 14:25, Richard Biener wrote:
> > On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >
> >> On 13/01/2022 12:36, Richard Biener wrote:
> >>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >>>
> >>>> This time to the list too (sorry for double email)
> >>>>
> >>>> Hi,
> >>>>
> >>>> The original patch '[vect] Re-analyze all modes for epilogues', skipped
> >>>> modes
> >>>> that should not be skipped since it used the vector mode provided by
> >>>> autovectorize_vector_modes to derive the minimum VF required for it.
> >>>> However,
> >>>> those modes should only really be used to dictate vector size, so instead
> >>>> this
> >>>> patch looks for the mode in 'used_vector_modes' with the largest element
> >>>> size,
> >>>> and constructs a vector mode with the smae size as the current
> >>>> vector_modes[mode_i]. Since we are using the largest element size the
> >>>> NUNITs
> >>>> for this mode is the smallest possible VF required for an epilogue with
> >>>> this
> >>>> mode and should thus skip only the modes we are certain can not be used.
> >>>>
> >>>> Passes bootstrap and regression on x86_64 and aarch64.
> >>> Clearly
> >>>
> >>> +         /* To make sure we are conservative as to what modes we skip, we
> >>> +            should use check the smallest possible NUNITS which would be
> >>> +            derived from the mode in USED_VECTOR_MODES with the largest
> >>> +            element size.  */
> >>> +         scalar_mode max_elsize_mode = GET_MODE_INNER
> >>> (vector_modes[mode_i]);
> >>> +         for (vec_info::mode_set::iterator i =
> >>> +               first_loop_vinfo->used_vector_modes.begin ();
> >>> +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
> >>> +           {
> >>> +             if (VECTOR_MODE_P (*i)
> >>> +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
> >>> +                 > GET_MODE_SIZE (max_elsize_mode))
> >>> +               max_elsize_mode = GET_MODE_INNER (*i);
> >>> +           }
> >>>
> >>> can be done once before iterating over the modes for the epilogue.
> >> True, I'll start with QImode instead of the inner of vector_modes[mode_i]
> >> too
> >> since we can't guarantee the mode is a VECTOR_MODE_P and it is actually
> >> better
> >> too since we can't possible guarantee the element size of the
> >> USED_VECTOR_MODES is smaller than that of the first vector mode...
> >>
> >>> Richard maybe knows whether we should take care to look at the
> >>> size of the vector mode as well since related_vector_mode when
> >>> passed 0 as nunits produces a vector mode with the same size
> >>> as vector_modes[mode_i] but not all used_vector_modes may be
> >>> of the same size
> >> I suspect that should be fine though, since if we use the largest element
> >> size
> >> of all used_vector_modes then that should gives us the least possible
> >> number
> >> of NUNITS and thus only conservatively skip. That said, that does assume
> >> that
> >> no vector mode used may be larger than the size of the loop's vector_mode.
> >> Can
> >> I assume that?
> > No idea, but I would lean towards a no ;)  I think the loops vector_mode
> > doesn't have to match vector_modes[mode_i] either, does it?  At least
> > autodetected_vector_mode will be not QImode based.
> The mode doesn't but both vector modes have to be the same vector size surely,
> I'm not referring to the element size here.
> What I was trying to ask was whether all vector modes in used_vector_modes had
> the same vector size as the loops vector mode (and the vector_modes[mode_i] it
> originated from).

Definitely not I think.

> >>> (and you probably also want to exclude
> >>> VECTOR_BOOLEAN_TYPE_P from the search?)
> >> Yeah I think so too, thanks!
> >>
> >> I keep going back to thinking (as I brought up in the bugzilla ticket),
> >> maybe
> >> we ought to only skip if the NUNITS of the vector mode with the same vector
> >> size as vector_modes[mode_i] is larger than first_info_vf, or just don't
> >> skip
> >> at all...
> > The question is how much work we do before realizing the chosen mode
> > cannot be used because there's not enough iterations?  Maybe we can
> > improve there easily?
> IIUC the VF can change depending on whether we decide to use SLP, so really we
> can only check if after we have determined whether or not to use SLP, so
> either:
> * When SLP fully succeeds, so somewhere between the last 'goto again;' and
> return success, but there is very little left to do there
> * When SLP fails: here we could save on some work.

Hmm, yeah.  Guess it's quite expensive then in the end so worth to
avoid doing useless stuff.  I do wonder whether we could cache
analysis fails (and VFs in case of success but worse cost) of the
main loop analysis.

> > Also for targets that for the main loop do not perform cost
> > comparison (like x86) but have lots of vector modes the previous
> > mode of operation really made sense (start at next_mode_i or
> > mode_i when unrolling).
> Are you hinting at maybe creating different paths here based on some target
> configurable thing? Could be something we ask vector_costs?

That would be an option, yes.  We could re-use the VECT_COMPARE_COSTS
bit from autovectorize_vector_modes, if we are not supposed to compare
costs then the old scheme makes sense.  We could of course also ask
the target for the first (auto-detect) mode to try for the epilogue,
telling it the first loops mode and VF (again if not comparing costs)
with a new target hook.

But at this point lets try to fix the skipping heuristic and if that
fails just go back to the old iteration scheme, at least for the first
mode to try?  Thus, maybe set vector_modes[0] to that previously chosen
next mode and iterate from that.  Shouldn't matter for aarcht64 since
we'd compare costs of the other modes anyway.

Richard.
  
Andre Vieira (lists) Jan. 14, 2022, 9:35 a.m. UTC | #6
On 14/01/2022 07:08, Richard Biener wrote:
> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>
>> On 13/01/2022 14:25, Richard Biener wrote:
>>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>>>
>>>> On 13/01/2022 12:36, Richard Biener wrote:
>>>>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
>>>>>
>>>>>> This time to the list too (sorry for double email)
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The original patch '[vect] Re-analyze all modes for epilogues', skipped
>>>>>> modes
>>>>>> that should not be skipped since it used the vector mode provided by
>>>>>> autovectorize_vector_modes to derive the minimum VF required for it.
>>>>>> However,
>>>>>> those modes should only really be used to dictate vector size, so instead
>>>>>> this
>>>>>> patch looks for the mode in 'used_vector_modes' with the largest element
>>>>>> size,
>>>>>> and constructs a vector mode with the smae size as the current
>>>>>> vector_modes[mode_i]. Since we are using the largest element size the
>>>>>> NUNITs
>>>>>> for this mode is the smallest possible VF required for an epilogue with
>>>>>> this
>>>>>> mode and should thus skip only the modes we are certain can not be used.
>>>>>>
>>>>>> Passes bootstrap and regression on x86_64 and aarch64.
>>>>> Clearly
>>>>>
>>>>> +         /* To make sure we are conservative as to what modes we skip, we
>>>>> +            should use check the smallest possible NUNITS which would be
>>>>> +            derived from the mode in USED_VECTOR_MODES with the largest
>>>>> +            element size.  */
>>>>> +         scalar_mode max_elsize_mode = GET_MODE_INNER
>>>>> (vector_modes[mode_i]);
>>>>> +         for (vec_info::mode_set::iterator i =
>>>>> +               first_loop_vinfo->used_vector_modes.begin ();
>>>>> +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
>>>>> +           {
>>>>> +             if (VECTOR_MODE_P (*i)
>>>>> +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
>>>>> +                 > GET_MODE_SIZE (max_elsize_mode))
>>>>> +               max_elsize_mode = GET_MODE_INNER (*i);
>>>>> +           }
>>>>>
>>>>> can be done once before iterating over the modes for the epilogue.
>>>> True, I'll start with QImode instead of the inner of vector_modes[mode_i]
>>>> too
>>>> since we can't guarantee the mode is a VECTOR_MODE_P and it is actually
>>>> better
>>>> too since we can't possible guarantee the element size of the
>>>> USED_VECTOR_MODES is smaller than that of the first vector mode...
>>>>
>>>>> Richard maybe knows whether we should take care to look at the
>>>>> size of the vector mode as well since related_vector_mode when
>>>>> passed 0 as nunits produces a vector mode with the same size
>>>>> as vector_modes[mode_i] but not all used_vector_modes may be
>>>>> of the same size
>>>> I suspect that should be fine though, since if we use the largest element
>>>> size
>>>> of all used_vector_modes then that should gives us the least possible
>>>> number
>>>> of NUNITS and thus only conservatively skip. That said, that does assume
>>>> that
>>>> no vector mode used may be larger than the size of the loop's vector_mode.
>>>> Can
>>>> I assume that?
>>> No idea, but I would lean towards a no ;)  I think the loops vector_mode
>>> doesn't have to match vector_modes[mode_i] either, does it?  At least
>>> autodetected_vector_mode will be not QImode based.
>> The mode doesn't but both vector modes have to be the same vector size surely,
>> I'm not referring to the element size here.
>> What I was trying to ask was whether all vector modes in used_vector_modes had
>> the same vector size as the loops vector mode (and the vector_modes[mode_i] it
>> originated from).
> Definitely not I think.
Hmmm I'm still struggling to understand what we use that initial 
vector_mode for then. I thought it was a combination of limiting vector 
size (By that I mean NUNITS * element size) and ISA choice.
If we can use vector modes within a loop with a size different from the 
initial one, is it at least a guarantee that the input vector mode's 
size is an upper bound to the sizes of the modes in used_vector_modes?
>>>>> (and you probably also want to exclude
>>>>> VECTOR_BOOLEAN_TYPE_P from the search?)
>>>> Yeah I think so too, thanks!
>>>>
>>>> I keep going back to thinking (as I brought up in the bugzilla ticket),
>>>> maybe
>>>> we ought to only skip if the NUNITS of the vector mode with the same vector
>>>> size as vector_modes[mode_i] is larger than first_info_vf, or just don't
>>>> skip
>>>> at all...
>>> The question is how much work we do before realizing the chosen mode
>>> cannot be used because there's not enough iterations?  Maybe we can
>>> improve there easily?
>> IIUC the VF can change depending on whether we decide to use SLP, so really we
>> can only check if after we have determined whether or not to use SLP, so
>> either:
>> * When SLP fully succeeds, so somewhere between the last 'goto again;' and
>> return success, but there is very little left to do there
>> * When SLP fails: here we could save on some work.
> Hmm, yeah.  Guess it's quite expensive then in the end so worth to
> avoid doing useless stuff.  I do wonder whether we could cache
> analysis fails (and VFs in case of success but worse cost) of the
> main loop analysis.
Hmmm, a quick look doesn't show any cases where the main loop may fail 
where epilogue may succeed (other than costing). So this could save on 
some analysis. Though it is potentially a band-aid if we end up having 
to skip for VF, since that wouldn't fail for the main loop, unless ofc 
it's a known iteration count. It sounds like a lot of work for maybe a 
bit of saving?
>>> Also for targets that for the main loop do not perform cost
>>> comparison (like x86) but have lots of vector modes the previous
>>> mode of operation really made sense (start at next_mode_i or
>>> mode_i when unrolling).
>> Are you hinting at maybe creating different paths here based on some target
>> configurable thing? Could be something we ask vector_costs?
> That would be an option, yes.  We could re-use the VECT_COMPARE_COSTS
> bit from autovectorize_vector_modes, if we are not supposed to compare
> costs then the old scheme makes sense.  We could of course also ask
> the target for the first (auto-detect) mode to try for the epilogue,
> telling it the first loops mode and VF (again if not comparing costs)
> with a new target hook.
>
> But at this point lets try to fix the skipping heuristic and if that
> fails just go back to the old iteration scheme, at least for the first
> mode to try?  Thus, maybe set vector_modes[0] to that previously chosen
> next mode and iterate from that.  Shouldn't matter for aarcht64 since
> we'd compare costs of the other modes anyway.
I like the idea of pushing the old 'next mode' to vector_modes[mode_i] 
idea, aarch64 will still analyze more than it needs to though. So so 
yeah the skipping might be preferable if we can get it to work.  The 
original patch seems to work fine on current testsuite for aarch64 and 
x86_64. Though, I am not confident of the 'max element size of 
USED_VECTOR_MODES' if we can't guarantee that the vector size of the 
input vector_mode for the main loop is an upper bound for the sizes in 
used_vector_modes...
  
Richard Biener Jan. 14, 2022, 9:57 a.m. UTC | #7
On Fri, 14 Jan 2022, Andre Vieira (lists) wrote:

> 
> On 14/01/2022 07:08, Richard Biener wrote:
> > On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >
> >> On 13/01/2022 14:25, Richard Biener wrote:
> >>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >>>
> >>>> On 13/01/2022 12:36, Richard Biener wrote:
> >>>>> On Thu, 13 Jan 2022, Andre Vieira (lists) wrote:
> >>>>>
> >>>>>> This time to the list too (sorry for double email)
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> The original patch '[vect] Re-analyze all modes for epilogues', skipped
> >>>>>> modes
> >>>>>> that should not be skipped since it used the vector mode provided by
> >>>>>> autovectorize_vector_modes to derive the minimum VF required for it.
> >>>>>> However,
> >>>>>> those modes should only really be used to dictate vector size, so
> >>>>>> instead
> >>>>>> this
> >>>>>> patch looks for the mode in 'used_vector_modes' with the largest
> >>>>>> element
> >>>>>> size,
> >>>>>> and constructs a vector mode with the smae size as the current
> >>>>>> vector_modes[mode_i]. Since we are using the largest element size the
> >>>>>> NUNITs
> >>>>>> for this mode is the smallest possible VF required for an epilogue with
> >>>>>> this
> >>>>>> mode and should thus skip only the modes we are certain can not be
> >>>>>> used.
> >>>>>>
> >>>>>> Passes bootstrap and regression on x86_64 and aarch64.
> >>>>> Clearly
> >>>>>
> >>>>> +         /* To make sure we are conservative as to what modes we skip,
> >>>>> we
> >>>>> +            should use check the smallest possible NUNITS which would
> >>>>> be
> >>>>> +            derived from the mode in USED_VECTOR_MODES with the largest
> >>>>> +            element size.  */
> >>>>> +         scalar_mode max_elsize_mode = GET_MODE_INNER
> >>>>> (vector_modes[mode_i]);
> >>>>> +         for (vec_info::mode_set::iterator i =
> >>>>> +               first_loop_vinfo->used_vector_modes.begin ();
> >>>>> +             i != first_loop_vinfo->used_vector_modes.end (); ++i)
> >>>>> +           {
> >>>>> +             if (VECTOR_MODE_P (*i)
> >>>>> +                 && GET_MODE_SIZE (GET_MODE_INNER (*i))
> >>>>> +                 > GET_MODE_SIZE (max_elsize_mode))
> >>>>> +               max_elsize_mode = GET_MODE_INNER (*i);
> >>>>> +           }
> >>>>>
> >>>>> can be done once before iterating over the modes for the epilogue.
> >>>> True, I'll start with QImode instead of the inner of vector_modes[mode_i]
> >>>> too
> >>>> since we can't guarantee the mode is a VECTOR_MODE_P and it is actually
> >>>> better
> >>>> too since we can't possible guarantee the element size of the
> >>>> USED_VECTOR_MODES is smaller than that of the first vector mode...
> >>>>
> >>>>> Richard maybe knows whether we should take care to look at the
> >>>>> size of the vector mode as well since related_vector_mode when
> >>>>> passed 0 as nunits produces a vector mode with the same size
> >>>>> as vector_modes[mode_i] but not all used_vector_modes may be
> >>>>> of the same size
> >>>> I suspect that should be fine though, since if we use the largest element
> >>>> size
> >>>> of all used_vector_modes then that should gives us the least possible
> >>>> number
> >>>> of NUNITS and thus only conservatively skip. That said, that does assume
> >>>> that
> >>>> no vector mode used may be larger than the size of the loop's
> >>>> vector_mode.
> >>>> Can
> >>>> I assume that?
> >>> No idea, but I would lean towards a no ;)  I think the loops vector_mode
> >>> doesn't have to match vector_modes[mode_i] either, does it?  At least
> >>> autodetected_vector_mode will be not QImode based.
> >> The mode doesn't but both vector modes have to be the same vector size
> >> surely,
> >> I'm not referring to the element size here.
> >> What I was trying to ask was whether all vector modes in used_vector_modes
> >> had
> >> the same vector size as the loops vector mode (and the vector_modes[mode_i]
> >> it
> >> originated from).
> > Definitely not I think.
> Hmmm I'm still struggling to understand what we use that initial vector_mode
> for then. I thought it was a combination of limiting vector size (By that I
> mean NUNITS * element size) and ISA choice.
> If we can use vector modes within a loop with a size different from the
> initial one, is it at least a guarantee that the input vector mode's size is
> an upper bound to the sizes of the modes in used_vector_modes?

I don't think so - via related_vector_mode we can get to both smaller
and larger vector types (if they are supported).  The input vector mode
really just determines what we use as first guess.  But maybe for
loop vectorization that's really all - Richard should know more here.
But since this is all to be used as heuristic there's no need to be
precise (and I think it's impossible to be).

> >>>>> (and you probably also want to exclude
> >>>>> VECTOR_BOOLEAN_TYPE_P from the search?)
> >>>> Yeah I think so too, thanks!
> >>>>
> >>>> I keep going back to thinking (as I brought up in the bugzilla ticket),
> >>>> maybe
> >>>> we ought to only skip if the NUNITS of the vector mode with the same
> >>>> vector
> >>>> size as vector_modes[mode_i] is larger than first_info_vf, or just don't
> >>>> skip
> >>>> at all...
> >>> The question is how much work we do before realizing the chosen mode
> >>> cannot be used because there's not enough iterations?  Maybe we can
> >>> improve there easily?
> >> IIUC the VF can change depending on whether we decide to use SLP, so really
> >> we
> >> can only check if after we have determined whether or not to use SLP, so
> >> either:
> >> * When SLP fully succeeds, so somewhere between the last 'goto again;' and
> >> return success, but there is very little left to do there
> >> * When SLP fails: here we could save on some work.
> > Hmm, yeah.  Guess it's quite expensive then in the end so worth to
> > avoid doing useless stuff.  I do wonder whether we could cache
> > analysis fails (and VFs in case of success but worse cost) of the
> > main loop analysis.
> Hmmm, a quick look doesn't show any cases where the main loop may fail where
> epilogue may succeed (other than costing). So this could save on some
> analysis. Though it is potentially a band-aid if we end up having to skip for
> VF, since that wouldn't fail for the main loop, unless ofc it's a known
> iteration count. It sounds like a lot of work for maybe a bit of saving?
> >>> Also for targets that for the main loop do not perform cost
> >>> comparison (like x86) but have lots of vector modes the previous
> >>> mode of operation really made sense (start at next_mode_i or
> >>> mode_i when unrolling).
> >> Are you hinting at maybe creating different paths here based on some target
> >> configurable thing? Could be something we ask vector_costs?
> > That would be an option, yes.  We could re-use the VECT_COMPARE_COSTS
> > bit from autovectorize_vector_modes, if we are not supposed to compare
> > costs then the old scheme makes sense.  We could of course also ask
> > the target for the first (auto-detect) mode to try for the epilogue,
> > telling it the first loops mode and VF (again if not comparing costs)
> > with a new target hook.
> >
> > But at this point lets try to fix the skipping heuristic and if that
> > fails just go back to the old iteration scheme, at least for the first
> > mode to try?  Thus, maybe set vector_modes[0] to that previously chosen
> > next mode and iterate from that.  Shouldn't matter for aarcht64 since
> > we'd compare costs of the other modes anyway.
> I like the idea of pushing the old 'next mode' to vector_modes[mode_i] idea,
> aarch64 will still analyze more than it needs to though. So so yeah the
> skipping might be preferable if we can get it to work.  The original patch
> seems to work fine on current testsuite for aarch64 and x86_64. Though, I am
> not confident of the 'max element size of USED_VECTOR_MODES' if we can't
> guarantee that the vector size of the input vector_mode for the main loop is
> an upper bound for the sizes in used_vector_modes...

The 'used_vector_modes' is also a heuristic by itself since it registers
every vector type we query, not only those that are used in the end ...

So it's really all heuristics that can eventually go bad.

IMHO remembering the VF that we ended up with (maybe w/o unrolling)
for each analyzed vector_mode[] might be really the easiest thing to do,
that should make it easy to skip those modes where the VF is larger
or equal as the VF of the main loop for the purpose of epilogue 
vectorization.  Likewise those vector_mode[] that failed analysis can
be remembered (with -1U VF for example).

Richard.
  
Andre Vieira (lists) Jan. 18, 2022, 6:14 p.m. UTC | #8
On 14/01/2022 09:57, Richard Biener wrote:
>
> The 'used_vector_modes' is also a heuristic by itself since it registers
> every vector type we query, not only those that are used in the end ...
>
> So it's really all heuristics that can eventually go bad.
>
> IMHO remembering the VF that we ended up with (maybe w/o unrolling)
> for each analyzed vector_mode[] might be really the easiest thing to do,
> that should make it easy to skip those modes where the VF is larger
> or equal as the VF of the main loop for the purpose of epilogue
> vectorization.  Likewise those vector_mode[] that failed analysis can
> be remembered (with -1U VF for example).
>
> Richard.

I liked the caching suggestion, so here it is. Sorry for the delay, 
wanted to post this after pushing the vect unroll which was waiting on 
some retesting for the rebase.

gcc/ChangeLog:

         PR 103997
         * tree-vect-loop.c (vect_analyze_loop): Fix mode skipping for 
epilogue
         vectorization.
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 0fe3529b2d1cf36617c04c1d0f1c4c7bb363607c..6d3dbb2187b4096a7db947b6343257a47cd9fb3d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3004,6 +3004,12 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   unsigned int mode_i = 0;
   unsigned HOST_WIDE_INT simdlen = loop->simdlen;
 
+  /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
+     a mode has not been analyzed.  */
+  auto_vec<poly_uint64, 8> cached_vf_per_mode;
+  for (unsigned i = 0; i < vector_modes.length (); ++i)
+    cached_vf_per_mode.safe_push (0);
+
   /* First determine the main loop vectorization mode, either the first
      one that works, starting with auto-detecting the vector mode and then
      following the targets order of preference, or the one with the
@@ -3011,6 +3017,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   while (1)
     {
       bool fatal;
+      unsigned int last_mode_i = mode_i;
+      /* Set cached VF to -1 prior to analysis, which indicates a mode has
+	 failed.  */
+      cached_vf_per_mode[last_mode_i] = -1;
       opt_loop_vec_info loop_vinfo
 	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
 			       NULL, vector_modes, mode_i,
@@ -3020,6 +3030,12 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (loop_vinfo)
 	{
+	  /*  Analyzis has been successful so update the VF value.  The
+	      VF should always be a multiple of unroll_factor and we want to
+	      capture the original VF here.  */
+	  cached_vf_per_mode[last_mode_i]
+	    = exact_div (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			 loop_vinfo->suggested_unroll_factor);
 	  /* Once we hit the desired simdlen for the first time,
 	     discard any previous attempts.  */
 	  if (simdlen
@@ -3105,7 +3121,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	 would be at least as high as the main loop's and we would be
 	 vectorizing for more scalar iterations than there would be left.  */
       if (!supports_partial_vectors
-	  && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf))
+	  && maybe_ge (cached_vf_per_mode[mode_i], first_vinfo_vf))
 	{
 	  mode_i++;
 	  if (mode_i == vector_modes.length ())
  
Richard Biener Jan. 19, 2022, 11:04 a.m. UTC | #9
On Tue, 18 Jan 2022, Andre Vieira (lists) wrote:

> 
> On 14/01/2022 09:57, Richard Biener wrote:
> >
> > The 'used_vector_modes' is also a heuristic by itself since it registers
> > every vector type we query, not only those that are used in the end ...
> >
> > So it's really all heuristics that can eventually go bad.
> >
> > IMHO remembering the VF that we ended up with (maybe w/o unrolling)
> > for each analyzed vector_mode[] might be really the easiest thing to do,
> > that should make it easy to skip those modes where the VF is larger
> > or equal as the VF of the main loop for the purpose of epilogue
> > vectorization.  Likewise those vector_mode[] that failed analysis can
> > be remembered (with -1U VF for example).
> >
> > Richard.
> 
> I liked the caching suggestion, so here it is. Sorry for the delay, wanted to
> post this after pushing the vect unroll which was waiting on some retesting
> for the rebase.

LGTM.

Thanks,
Richard.

> gcc/ChangeLog:
> 
>         PR 103997
>         * tree-vect-loop.c (vect_analyze_loop): Fix mode skipping for 
> epilogue
>         vectorization.
>
  
Andre Vieira (lists) Jan. 19, 2022, 2:15 p.m. UTC | #10
On 19/01/2022 11:04, Richard Biener wrote:
> On Tue, 18 Jan 2022, Andre Vieira (lists) wrote:
>
>> On 14/01/2022 09:57, Richard Biener wrote:
>>> The 'used_vector_modes' is also a heuristic by itself since it registers
>>> every vector type we query, not only those that are used in the end ...
>>>
>>> So it's really all heuristics that can eventually go bad.
>>>
>>> IMHO remembering the VF that we ended up with (maybe w/o unrolling)
>>> for each analyzed vector_mode[] might be really the easiest thing to do,
>>> that should make it easy to skip those modes where the VF is larger
>>> or equal as the VF of the main loop for the purpose of epilogue
>>> vectorization.  Likewise those vector_mode[] that failed analysis can
>>> be remembered (with -1U VF for example).
>>>
>>> Richard.
>> I liked the caching suggestion, so here it is. Sorry for the delay, wanted to
>> post this after pushing the vect unroll which was waiting on some retesting
>> for the rebase.
> LGTM.
>
> Thanks,
> Richard.
>
>> gcc/ChangeLog:
>>
>>          PR 103997
>>          * tree-vect-loop.c (vect_analyze_loop): Fix mode skipping for
>> epilogue
>>          vectorization.


Thanks! Committed the following patch. I updated the comment above the 
last change as I realized it still described the old behaviour.
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 0fe3529b2d1cf36617c04c1d0f1c4c7bb363607c..e15738ee6c4a1d2cf6bfa4c291a4dc46faaaf7a5 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3004,6 +3004,12 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   unsigned int mode_i = 0;
   unsigned HOST_WIDE_INT simdlen = loop->simdlen;
 
+  /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
+     a mode has not been analyzed.  */
+  auto_vec<poly_uint64, 8> cached_vf_per_mode;
+  for (unsigned i = 0; i < vector_modes.length (); ++i)
+    cached_vf_per_mode.safe_push (0);
+
   /* First determine the main loop vectorization mode, either the first
      one that works, starting with auto-detecting the vector mode and then
      following the targets order of preference, or the one with the
@@ -3011,6 +3017,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
   while (1)
     {
       bool fatal;
+      unsigned int last_mode_i = mode_i;
+      /* Set cached VF to -1 prior to analysis, which indicates a mode has
+	 failed.  */
+      cached_vf_per_mode[last_mode_i] = -1;
       opt_loop_vec_info loop_vinfo
 	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
 			       NULL, vector_modes, mode_i,
@@ -3020,6 +3030,12 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (loop_vinfo)
 	{
+	  /*  Analyzis has been successful so update the VF value.  The
+	      VF should always be a multiple of unroll_factor and we want to
+	      capture the original VF here.  */
+	  cached_vf_per_mode[last_mode_i]
+	    = exact_div (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			 loop_vinfo->suggested_unroll_factor);
 	  /* Once we hit the desired simdlen for the first time,
 	     discard any previous attempts.  */
 	  if (simdlen
@@ -3100,12 +3116,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
     {
       /* If the target does not support partial vectors we can shorten the
 	 number of modes to analyze for the epilogue as we know we can't pick a
-	 mode that has at least as many NUNITS as the main loop's vectorization
-	 factor, since that would imply the epilogue's vectorization factor
-	 would be at least as high as the main loop's and we would be
-	 vectorizing for more scalar iterations than there would be left.  */
+	 mode that would lead to a VF at least as big as the
+	 FIRST_VINFO_VF.  */
       if (!supports_partial_vectors
-	  && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf))
+	  && maybe_ge (cached_vf_per_mode[mode_i], first_vinfo_vf))
 	{
 	  mode_i++;
 	  if (mode_i == vector_modes.length ())
  

Patch

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index ba67de490bbd033b6db6217c8f9f9ca04cec323b..87b5ec5b4c6cb40e922b1e04bb7777ce74233af8 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3038,12 +3038,37 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	 would be at least as high as the main loop's and we would be
 	 vectorizing for more scalar iterations than there would be left.  */
       if (!supports_partial_vectors
-	  && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf))
-	{
-	  mode_i++;
-	  if (mode_i == vector_modes.length ())
-	    break;
-	  continue;
+	  && VECTOR_MODE_P (vector_modes[mode_i]))
+	{
+	  /* To make sure we are conservative as to what modes we skip, we
+	     should use check the smallest possible NUNITS which would be
+	     derived from the mode in USED_VECTOR_MODES with the largest
+	     element size.  */
+	  scalar_mode max_elsize_mode = GET_MODE_INNER (vector_modes[mode_i]);
+	  for (vec_info::mode_set::iterator i =
+		first_loop_vinfo->used_vector_modes.begin ();
+	      i != first_loop_vinfo->used_vector_modes.end (); ++i)
+	    {
+	      if (VECTOR_MODE_P (*i)
+		  && GET_MODE_SIZE (GET_MODE_INNER (*i))
+		  > GET_MODE_SIZE (max_elsize_mode))
+		max_elsize_mode = GET_MODE_INNER (*i);
+	    }
+	  /* After finding the largest element size used in the main loop, find
+	     the related vector mode with the same size as the mode
+	     corresponding to the current MODE_I.  */
+	  machine_mode max_elsize_vector_mode =
+	    related_vector_mode (vector_modes[mode_i], max_elsize_mode,
+				 0).else_void ();
+	  if (VECTOR_MODE_P (max_elsize_vector_mode)
+	      && maybe_ge (GET_MODE_NUNITS (max_elsize_vector_mode),
+			   first_vinfo_vf))
+	    {
+	      mode_i++;
+	      if (mode_i == vector_modes.length ())
+	      break;
+	      continue;
+	    }
 	}
 
       if (dump_enabled_p ())