rs6000: Modify the way for extra penalized cost

Message ID d26e0a72-a029-c765-75ab-9b31de44f114@linux.ibm.com
State New
Headers
Series rs6000: Modify the way for extra penalized cost |

Commit Message

Kewen.Lin Sept. 16, 2021, 1:14 a.m. UTC
  Hi,

This patch follows the discussion here[1], where Segher pointed
out the existing way to guard the extra penalized cost for
strided/elementwise loads with a magic bound doesn't scale.

The way with nunits * stmt_cost can get one much exaggerated
penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
that's why we need one bound.  To make it scale, this patch
doesn't use nunits * stmt_cost any more, but it still keeps
nunits since there are actually nunits scalar loads there.  So
it uses one cost adjusted from stmt_cost, since the current
stmt_cost sort of considers nunits, we can stablize the cost
for big nunits and retain the cost for small nunits.  After
some tries, this patch gets the adjusted cost as:

    stmt_cost / (log2(nunits) * log2(nunits))

For V16QI, the adjusted cost would be 1 and total penalized
cost is 16, it isn't exaggerated.  For V2DI, the adjusted
cost would be 2 and total penalized cost is 4, which is the
same as before.  btw, I tried to use one single log2(nunits),
but the penalized cost is still big enough and can't fix the
degraded bmk blender_r.

The separated SPEC2017 evaluations on Power8, Power9 and Power10
at option sets O2-vect and Ofast-unroll showed this change is
neutral (that is same effect as before).

Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
	the way to compute extra penalized cost.

---
 gcc/config/rs6000/rs6000.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

--
2.25.1
  

Comments

Li, Pan2 via Gcc-patches Sept. 17, 2021, 4:34 p.m. UTC | #1
Hi Kewen,

On 9/15/21 8:14 PM, Kewen.Lin wrote:
> Hi,
>
> This patch follows the discussion here[1], where Segher pointed
> out the existing way to guard the extra penalized cost for
> strided/elementwise loads with a magic bound doesn't scale.
>
> The way with nunits * stmt_cost can get one much exaggerated
> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
> that's why we need one bound.  To make it scale, this patch
> doesn't use nunits * stmt_cost any more, but it still keeps
> nunits since there are actually nunits scalar loads there.  So
> it uses one cost adjusted from stmt_cost, since the current
> stmt_cost sort of considers nunits, we can stablize the cost
> for big nunits and retain the cost for small nunits.  After
> some tries, this patch gets the adjusted cost as:
>
>      stmt_cost / (log2(nunits) * log2(nunits))
>
> For V16QI, the adjusted cost would be 1 and total penalized
> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
> cost would be 2 and total penalized cost is 4, which is the
> same as before.  btw, I tried to use one single log2(nunits),
> but the penalized cost is still big enough and can't fix the
> degraded bmk blender_r.
>
> The separated SPEC2017 evaluations on Power8, Power9 and Power10
> at option sets O2-vect and Ofast-unroll showed this change is
> neutral (that is same effect as before).
>
> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>
> Is it ok for trunk?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
> 	the way to compute extra penalized cost.
>
> ---
>   gcc/config/rs6000/rs6000.c | 28 +++++++++++++++++-----------
>   1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 4ab23b0ab33..e08b94c0447 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>   	{
>   	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>   	  unsigned int nunits = vect_nunits_for_cost (vectype);
> -	  unsigned int extra_cost = nunits * stmt_cost;
> -	  /* As function rs6000_builtin_vectorization_cost shows, we have
> -	     priced much on V16QI/V8HI vector construction as their units,
> -	     if we penalize them with nunits * stmt_cost, it can result in
> -	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
> -	     is 20 and nunits is 16, the extra cost is 320 which looks
> -	     much exaggerated.  So let's use one maximum bound for the
> -	     extra penalized cost for vector construction here.  */
> -	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> -	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> -	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  /* As function rs6000_builtin_vectorization_cost shows, we
> +	     have priced much on V16QI/V8HI vector construction by
> +	     considering their units, if we penalize them with nunits
> +	     * stmt_cost here, it can result in an unreliable body cost,

This might be confusing to the reader, since you have deleted the 
calculation of nunits * stmt_cost.  Could you instead write this to 
indicate that we used to adjust in this way, and it had this particular 
downside, so that's why you're choosing this heuristic? It's a minor 
thing but I think people reading the code will be confused otherwise.

I think the heuristic is generally reasonable, and certainly better than 
what we had before!

LGTM with adjusted commentary, so recommend maintainers approve.

Thanks for the patch!
Bill
> +	     eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16,
> +	     the penalty will be 320 which looks much exaggerated.  But
> +	     there are actually nunits scalar loads, so we try to adopt
> +	     one reasonable penalized cost for each load rather than
> +	     stmt_cost.  Here, with stmt_cost dividing by log2(nunits)^2,
> +	     we can still retain the necessary penalty for small nunits
> +	     meanwhile stabilize the penalty for big nunits.  */
> +	  int nunits_log2 = exact_log2 (nunits);
> +	  gcc_assert (nunits_log2 > 0);
> +	  unsigned int nunits_sq = nunits_log2 * nunits_log2;
> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
> +	  gcc_assert (adjusted_cost > 0);
> +	  unsigned int extra_cost = nunits * adjusted_cost;
>   	  data->extra_ctor_cost += extra_cost;
>   	}
>       }
> --
> 2.25.1
  
Segher Boessenkool Sept. 17, 2021, 10:01 p.m. UTC | #2
Hi!

On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
> The way with nunits * stmt_cost can get one much exaggerated
> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
> that's why we need one bound.  To make it scale, this patch
> doesn't use nunits * stmt_cost any more, but it still keeps
> nunits since there are actually nunits scalar loads there.  So
> it uses one cost adjusted from stmt_cost, since the current
> stmt_cost sort of considers nunits, we can stablize the cost
> for big nunits and retain the cost for small nunits.  After
> some tries, this patch gets the adjusted cost as:
> 
>     stmt_cost / (log2(nunits) * log2(nunits))

So for  V16QI it gives *16/(4*4) so *1
        V8HI  it gives *8/(3*3)  so *8/9
        V4SI  it gives *4/(2*2)  so *1
        V2DI  it gives *2/(1*1)  so *2
and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
just gives wildly wrong answers; the div returns 0 on recent systems).

> For V16QI, the adjusted cost would be 1 and total penalized
> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
> cost would be 2 and total penalized cost is 4, which is the
> same as before.  btw, I tried to use one single log2(nunits),
> but the penalized cost is still big enough and can't fix the
> degraded bmk blender_r.

Does it make sense to treat V2DI (and V2DF) as twice more expensive than
other vectors, which are all pretty much equal cost (except those that
end up with cost 0)?  If so, there are simpler ways to do that.

> +	  int nunits_log2 = exact_log2 (nunits);
> +	  gcc_assert (nunits_log2 > 0);
> +	  unsigned int nunits_sq = nunits_log2 * nunits_log2;

>= 0

This of course is assuming nunits will always be a power of 2, but I'm
sure that we have many other places in the compiler assuming that
already, so that is fine.  And if one day this stops being true we will
get a nice ICE, pretty much the best we could hope for.

> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;

But this can divide by 0.  Or are we somehow guaranteed that nunits
will never be 1?  Yes the log2 check above, sure, but that ICEs if this
is violated; is there anything that actually guarantees it is true?

> +	  gcc_assert (adjusted_cost > 0);

I don't see how you guarantee this, either.


A magic crazy formula like this is no good.  If you want to make the
cost of everything but V2D* be the same, and that of V2D* be twice that,
that is a weird heuristic, but we can live with that perhaps.  But that
beats completely unexplained (and unexplainable) magic!

Sorry.


Segher
  
Kewen.Lin Sept. 21, 2021, 2:20 a.m. UTC | #3
Hi Bill,

Thanks for the review!

on 2021/9/18 上午12:34, Bill Schmidt wrote:
> Hi Kewen,
> 
> On 9/15/21 8:14 PM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch follows the discussion here[1], where Segher pointed
>> out the existing way to guard the extra penalized cost for
>> strided/elementwise loads with a magic bound doesn't scale.
>>
>> The way with nunits * stmt_cost can get one much exaggerated
>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>> that's why we need one bound.  To make it scale, this patch
>> doesn't use nunits * stmt_cost any more, but it still keeps
>> nunits since there are actually nunits scalar loads there.  So
>> it uses one cost adjusted from stmt_cost, since the current
>> stmt_cost sort of considers nunits, we can stablize the cost
>> for big nunits and retain the cost for small nunits.  After
>> some tries, this patch gets the adjusted cost as:
>>
>>      stmt_cost / (log2(nunits) * log2(nunits))
>>
>> For V16QI, the adjusted cost would be 1 and total penalized
>> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
>> cost would be 2 and total penalized cost is 4, which is the
>> same as before.  btw, I tried to use one single log2(nunits),
>> but the penalized cost is still big enough and can't fix the
>> degraded bmk blender_r.
>>
>> The separated SPEC2017 evaluations on Power8, Power9 and Power10
>> at option sets O2-vect and Ofast-unroll showed this change is
>> neutral (that is same effect as before).
>>
>> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>>     * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust
>>     the way to compute extra penalized cost.
>>
>> ---
>>   gcc/config/rs6000/rs6000.c | 28 +++++++++++++++++-----------
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 4ab23b0ab33..e08b94c0447 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>       {
>>         tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>>         unsigned int nunits = vect_nunits_for_cost (vectype);
>> -      unsigned int extra_cost = nunits * stmt_cost;
>> -      /* As function rs6000_builtin_vectorization_cost shows, we have
>> -         priced much on V16QI/V8HI vector construction as their units,
>> -         if we penalize them with nunits * stmt_cost, it can result in
>> -         an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> -         is 20 and nunits is 16, the extra cost is 320 which looks
>> -         much exaggerated.  So let's use one maximum bound for the
>> -         extra penalized cost for vector construction here.  */
>> -      const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> -      if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> -        extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +      /* As function rs6000_builtin_vectorization_cost shows, we
>> +         have priced much on V16QI/V8HI vector construction by
>> +         considering their units, if we penalize them with nunits
>> +         * stmt_cost here, it can result in an unreliable body cost,
> 
> This might be confusing to the reader, since you have deleted the calculation of nunits * stmt_cost.  Could you instead write this to indicate that we used to adjust in this way, and it had this particular downside, so that's why you're choosing this heuristic? It's a minor thing but I think people reading the code will be confused otherwise.
> 

Good point!  I'll update the commentary to explain it, thanks!!

BR,
Kewen 

> I think the heuristic is generally reasonable, and certainly better than what we had before!
> 
> LGTM with adjusted commentary, so recommend maintainers approve.
> 
> Thanks for the patch!
> Bill
>> +         eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16,
>> +         the penalty will be 320 which looks much exaggerated.  But
>> +         there are actually nunits scalar loads, so we try to adopt
>> +         one reasonable penalized cost for each load rather than
>> +         stmt_cost.  Here, with stmt_cost dividing by log2(nunits)^2,
>> +         we can still retain the necessary penalty for small nunits
>> +         meanwhile stabilize the penalty for big nunits.  */
>> +      int nunits_log2 = exact_log2 (nunits);
>> +      gcc_assert (nunits_log2 > 0);
>> +      unsigned int nunits_sq = nunits_log2 * nunits_log2;
>> +      unsigned int adjusted_cost = stmt_cost / nunits_sq;
>> +      gcc_assert (adjusted_cost > 0);
>> +      unsigned int extra_cost = nunits * adjusted_cost;
>>         data->extra_ctor_cost += extra_cost;
>>       }
>>       }
>> -- 
>> 2.25.1
>
  
Kewen.Lin Sept. 21, 2021, 3:24 a.m. UTC | #4
Hi Segher,

Thanks for the review!

on 2021/9/18 上午6:01, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
>> The way with nunits * stmt_cost can get one much exaggerated
>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>> that's why we need one bound.  To make it scale, this patch
>> doesn't use nunits * stmt_cost any more, but it still keeps
>> nunits since there are actually nunits scalar loads there.  So
>> it uses one cost adjusted from stmt_cost, since the current
>> stmt_cost sort of considers nunits, we can stablize the cost
>> for big nunits and retain the cost for small nunits.  After
>> some tries, this patch gets the adjusted cost as:
>>
>>     stmt_cost / (log2(nunits) * log2(nunits))
> 
> So for  V16QI it gives *16/(4*4) so *1
>         V8HI  it gives *8/(3*3)  so *8/9
>         V4SI  it gives *4/(2*2)  so *1
>         V2DI  it gives *2/(1*1)  so *2
> and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
> just gives wildly wrong answers; the div returns 0 on recent systems).
> 

I don't expected we will have V1TI for strided/elementwise load,
if it's one unit vector, it's the whole vector itself.
Besides, the below assertion should exclude it already.

>> For V16QI, the adjusted cost would be 1 and total penalized
>> cost is 16, it isn't exaggerated.  For V2DI, the adjusted
>> cost would be 2 and total penalized cost is 4, which is the
>> same as before.  btw, I tried to use one single log2(nunits),
>> but the penalized cost is still big enough and can't fix the
>> degraded bmk blender_r.
> 
> Does it make sense to treat V2DI (and V2DF) as twice more expensive than
> other vectors, which are all pretty much equal cost (except those that
> end up with cost 0)?  If so, there are simpler ways to do that.
> 

Yeah, from the SPEC2017 evaluation, it's good with this.  The costing
framework of vectorization doesn't consider the dependent insn chain
and available #unit etc. like local scheduling (it can't either), so
we have to use some heuristics to handle some special cases.  For more
units vector construction, the used instructions are more.  It has more
chances to schedule them better (even run in parallelly when enough
available units at the time), so we don't need to penalize more for them.
For V2DI, the load result is fed into construction directly, the current
stmt_cost is to consider merging and only 2, penalizing it with one is
not enough from the bwaves experiment.

>> +	  int nunits_log2 = exact_log2 (nunits);
>> +	  gcc_assert (nunits_log2 > 0);
>> +	  unsigned int nunits_sq = nunits_log2 * nunits_log2;
> 
>> = 0
> 
> This of course is assuming nunits will always be a power of 2, but I'm
> sure that we have many other places in the compiler assuming that
> already, so that is fine.  And if one day this stops being true we will
> get a nice ICE, pretty much the best we could hope for.
> 

Yeah, exact_log2 returns -1 for non power of 2 input, for example:

input     output
0     ->    -1
1     ->    0
2     ->    1
3     ->    -1

>> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
> 
> But this can divide by 0.  Or are we somehow guaranteed that nunits
> will never be 1?  Yes the log2 check above, sure, but that ICEs if this
> is violated; is there anything that actually guarantees it is true?
> 

As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
and the ICE should check this and ensure dividing by zero never happens.  :)

>> +	  gcc_assert (adjusted_cost > 0);
> 
> I don't see how you guarantee this, either.
> 

It's mainly to prevent that one day we tweak the cost for construction
in rs6000_builtin_vectorization_cost then make some unexpected values
generated here.  But now these expected values are guaranteed as the
current costs and the formula.

> 
> A magic crazy formula like this is no good.  If you want to make the
> cost of everything but V2D* be the same, and that of V2D* be twice that,
> that is a weird heuristic, but we can live with that perhaps.  But that
> beats completely unexplained (and unexplainable) magic!
> 
> Sorry.
> 

That's all right, thanks for the comments!  let's improve it.  :)
How about just assigning 2 for V2DI and 1 for the others for the
penalized_cost_per_load with some detailed commentary, it should have
the same effect with this "magic crazy formula", but I guess it can
be more clear.

BR,
Kewen
  
Segher Boessenkool Sept. 22, 2021, 10:36 p.m. UTC | #5
Hi!

On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote:
> on 2021/9/18 上午6:01, Segher Boessenkool wrote:
> > On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
> >> The way with nunits * stmt_cost can get one much exaggerated
> >> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
> >> that's why we need one bound.  To make it scale, this patch
> >> doesn't use nunits * stmt_cost any more, but it still keeps
> >> nunits since there are actually nunits scalar loads there.  So
> >> it uses one cost adjusted from stmt_cost, since the current
> >> stmt_cost sort of considers nunits, we can stablize the cost
> >> for big nunits and retain the cost for small nunits.  After
> >> some tries, this patch gets the adjusted cost as:
> >>
> >>     stmt_cost / (log2(nunits) * log2(nunits))
> > 
> > So for  V16QI it gives *16/(4*4) so *1
> >         V8HI  it gives *8/(3*3)  so *8/9
> >         V4SI  it gives *4/(2*2)  so *1
> >         V2DI  it gives *2/(1*1)  so *2
> > and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
> > just gives wildly wrong answers; the div returns 0 on recent systems).
> 
> I don't expected we will have V1TI for strided/elementwise load,
> if it's one unit vector, it's the whole vector itself.
> Besides, the below assertion should exclude it already.

Yes.  But ignoring the UB for unexpectedly large vector components, the
1 / 1.111 / 1 / 2  scoring does not make much sense.  The formulas
"look" smooth and even sort of reasonable, but as soon as you look at
what it *means*, and realise the domain if the function is discrete
(only four or five possible inputs), and then see how the function
behaves on that...  Hrm :-)

> > This of course is assuming nunits will always be a power of 2, but I'm
> > sure that we have many other places in the compiler assuming that
> > already, so that is fine.  And if one day this stops being true we will
> > get a nice ICE, pretty much the best we could hope for.
> 
> Yeah, exact_log2 returns -1 for non power of 2 input, for example:

Exactly.

> >> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
> > 
> > But this can divide by 0.  Or are we somehow guaranteed that nunits
> > will never be 1?  Yes the log2 check above, sure, but that ICEs if this
> > is violated; is there anything that actually guarantees it is true?
> 
> As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
> and the ICE should check this and ensure dividing by zero never happens.  :)

Can you assert that *directly* then please?

> > A magic crazy formula like this is no good.  If you want to make the
> > cost of everything but V2D* be the same, and that of V2D* be twice that,
> > that is a weird heuristic, but we can live with that perhaps.  But that
> > beats completely unexplained (and unexplainable) magic!
> > 
> > Sorry.
> 
> That's all right, thanks for the comments!  let's improve it.  :)

I like that spirit :-)

> How about just assigning 2 for V2DI and 1 for the others for the
> penalized_cost_per_load with some detailed commentary, it should have
> the same effect with this "magic crazy formula", but I guess it can
> be more clear.

That is fine yes!  (Well, V2DF the same I guess?  Or you'll need very
detailed commentary :-) )

It is fine to say "this is just a heuristic without much supporting
theory" in places.  That is what most of our --param= are as well, for
example.  If counting two-element vectors as twice as expensive as all
other vectors helps performance, then so be it: if there is no better
way to cost things (or we do not know one), then what else are we to do?


Segher
  
Kewen.Lin Sept. 28, 2021, 8:39 a.m. UTC | #6
Hi Segher,

on 2021/9/23 上午6:36, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote:
>> on 2021/9/18 上午6:01, Segher Boessenkool wrote:
>>> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote:
>>>> The way with nunits * stmt_cost can get one much exaggerated
>>>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320,
>>>> that's why we need one bound.  To make it scale, this patch
>>>> doesn't use nunits * stmt_cost any more, but it still keeps
>>>> nunits since there are actually nunits scalar loads there.  So
>>>> it uses one cost adjusted from stmt_cost, since the current
>>>> stmt_cost sort of considers nunits, we can stablize the cost
>>>> for big nunits and retain the cost for small nunits.  After
>>>> some tries, this patch gets the adjusted cost as:
>>>>
>>>>     stmt_cost / (log2(nunits) * log2(nunits))
>>>
>>> So for  V16QI it gives *16/(4*4) so *1
>>>         V8HI  it gives *8/(3*3)  so *8/9
>>>         V4SI  it gives *4/(2*2)  so *1
>>>         V2DI  it gives *2/(1*1)  so *2
>>> and for V1TI  it gives *1/(0*0) which is UB (no, does not crash for us,
>>> just gives wildly wrong answers; the div returns 0 on recent systems).
>>
>> I don't expected we will have V1TI for strided/elementwise load,
>> if it's one unit vector, it's the whole vector itself.
>> Besides, the below assertion should exclude it already.
> 
> Yes.  But ignoring the UB for unexpectedly large vector components, the
> 1 / 1.111 / 1 / 2  scoring does not make much sense.  The formulas
> "look" smooth and even sort of reasonable, but as soon as you look at
> what it *means*, and realise the domain if the function is discrete
> (only four or five possible inputs), and then see how the function
> behaves on that...  Hrm :-)
> 
>>> This of course is assuming nunits will always be a power of 2, but I'm
>>> sure that we have many other places in the compiler assuming that
>>> already, so that is fine.  And if one day this stops being true we will
>>> get a nice ICE, pretty much the best we could hope for.
>>
>> Yeah, exact_log2 returns -1 for non power of 2 input, for example:
> 
> Exactly.
> 
>>>> +	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
>>>
>>> But this can divide by 0.  Or are we somehow guaranteed that nunits
>>> will never be 1?  Yes the log2 check above, sure, but that ICEs if this
>>> is violated; is there anything that actually guarantees it is true?
>>
>> As I mentioned above, I don't expect we can have nunits 1 strided/ew load,
>> and the ICE should check this and ensure dividing by zero never happens.  :)
> 
> Can you assert that *directly* then please?
> 

Fix in v2.

>>> A magic crazy formula like this is no good.  If you want to make the
>>> cost of everything but V2D* be the same, and that of V2D* be twice that,
>>> that is a weird heuristic, but we can live with that perhaps.  But that
>>> beats completely unexplained (and unexplainable) magic!
>>>
>>> Sorry.
>>
>> That's all right, thanks for the comments!  let's improve it.  :)
> 
> I like that spirit :-)
> 
>> How about just assigning 2 for V2DI and 1 for the others for the
>> penalized_cost_per_load with some detailed commentary, it should have
>> the same effect with this "magic crazy formula", but I guess it can
>> be more clear.
> 
> That is fine yes!  (Well, V2DF the same I guess?  Or you'll need very
> detailed commentary :-) )
> 
> It is fine to say "this is just a heuristic without much supporting
> theory" in places.  That is what most of our --param= are as well, for
> example.  If counting two-element vectors as twice as expensive as all
> other vectors helps performance, then so be it: if there is no better
> way to cost things (or we do not know one), then what else are we to do?
> 
> 

Thanks a lot for the suggestion, I just posted v2:

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580358.html

BR,
Kewen
  

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 4ab23b0ab33..e08b94c0447 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5454,17 +5454,23 @@  rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
 	{
 	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
 	  unsigned int nunits = vect_nunits_for_cost (vectype);
-	  unsigned int extra_cost = nunits * stmt_cost;
-	  /* As function rs6000_builtin_vectorization_cost shows, we have
-	     priced much on V16QI/V8HI vector construction as their units,
-	     if we penalize them with nunits * stmt_cost, it can result in
-	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
-	     is 20 and nunits is 16, the extra cost is 320 which looks
-	     much exaggerated.  So let's use one maximum bound for the
-	     extra penalized cost for vector construction here.  */
-	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
-	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
-	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  /* As function rs6000_builtin_vectorization_cost shows, we
+	     have priced much on V16QI/V8HI vector construction by
+	     considering their units, if we penalize them with nunits
+	     * stmt_cost here, it can result in an unreliable body cost,
+	     eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16,
+	     the penalty will be 320 which looks much exaggerated.  But
+	     there are actually nunits scalar loads, so we try to adopt
+	     one reasonable penalized cost for each load rather than
+	     stmt_cost.  Here, with stmt_cost dividing by log2(nunits)^2,
+	     we can still retain the necessary penalty for small nunits
+	     meanwhile stabilize the penalty for big nunits.  */
+	  int nunits_log2 = exact_log2 (nunits);
+	  gcc_assert (nunits_log2 > 0);
+	  unsigned int nunits_sq = nunits_log2 * nunits_log2;
+	  unsigned int adjusted_cost = stmt_cost / nunits_sq;
+	  gcc_assert (adjusted_cost > 0);
+	  unsigned int extra_cost = nunits * adjusted_cost;
 	  data->extra_ctor_cost += extra_cost;
 	}
     }