Message ID  d26e0a72a029c76575ab9b31de44f114@linux.ibm.com 

State  New 
Headers 
ReturnPath: <gccpatchesbounces+patchwork=sourceware.org@gcc.gnu.org> XOriginalTo: patchwork@sourceware.org DeliveredTo: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E9ABE385781A for <patchwork@sourceware.org>; Thu, 16 Sep 2021 01:14:56 +0000 (GMT) DKIMFilter: OpenDKIM Filter v2.11.0 sourceware.org E9ABE385781A DKIMSignature: v=1; a=rsasha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1631754897; bh=nXKV02ddlT/l9ee3kzhG0N5tx2UmgXFDpLJc5aUuRAA=; h=To:Subject:Date:ListId:ListUnsubscribe:ListArchive:ListPost: ListHelp:ListSubscribe:From:ReplyTo:Cc:From; b=UfO5CW+j1yfadgjlODd/mZq2xZ+hjX499fzjCe0hybCnuBDj9R8w9QVwXfwIROf/n bPG3ABn6gE4TNqFWZMhmbb4zxmb3wM2AJlUtY8IXY7I8vFM2WKtDXyZnEq+cKcYYN7 w7A5+R1ons/Ss70OO5CgVX9n6/JSlcSFxRj80X08= XOriginalTo: gccpatches@gcc.gnu.org DeliveredTo: gccpatches@gcc.gnu.org Received: from mx0a001b2d01.pphosted.com (mx0a001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 3E6AF3858C39 for <gccpatches@gcc.gnu.org>; Thu, 16 Sep 2021 01:14:27 +0000 (GMT) DMARCFilter: OpenDMARC Filter v1.4.1 sourceware.org 3E6AF3858C39 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a001b2d01.pphosted.com (8.16.1.2/8.16.0.43) with SMTP id 18FMxkpW031658; Wed, 15 Sep 2021 21:14:25 0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a001b2d01.pphosted.com with ESMTP id 3b3t481y4n1 (version=TLSv1.2 cipher=ECDHERSAAES256GCMSHA384 bits=256 verify=NOT); Wed, 15 Sep 2021 21:14:24 0400 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 18G0pI4o023254; Wed, 15 Sep 2021 21:14:24 0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.slreverse.com [169.51.49.99]) by mx0a001b2d01.pphosted.com with ESMTP id 3b3t481y481 (version=TLSv1.2 cipher=ECDHERSAAES256GCMSHA384 bits=256 verify=NOT); Wed, 15 Sep 2021 21:14:24 0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 18G17HGi025365; Thu, 16 Sep 2021 01:14:22 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma04ams.nl.ibm.com with ESMTP id 3b0m3ac79k1 (version=TLSv1.2 cipher=ECDHERSAAES256GCMSHA384 bits=256 verify=NOT); Thu, 16 Sep 2021 01:14:21 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 18G1EJYC42861054 (version=TLSv1/SSLv3 cipher=DHERSAAES256GCMSHA384 bits=256 verify=OK); Thu, 16 Sep 2021 01:14:19 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 72D62A405C; Thu, 16 Sep 2021 01:14:19 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E0485A405F; Thu, 16 Sep 2021 01:14:17 +0000 (GMT) Received: from KewenLinsMacBookPro.local (unknown [9.197.239.230]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Thu, 16 Sep 2021 01:14:17 +0000 (GMT) To: GCC Patches <gccpatches@gcc.gnu.org> Subject: [PATCH] rs6000: Modify the way for extra penalized cost MessageID: <d26e0a72a029c76575ab9b31de44f114@linux.ibm.com> Date: Thu, 16 Sep 2021 09:14:15 +0800 UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 ContentType: text/plain; charset=gbk ContentLanguage: enUS XTMASGCONF: 00 XProofpointGUID: jzb_K7J9UBkLrjU2BVoSpPyY8jBOeUp XProofpointORIGGUID: StJIrrmE1FKEWeYG1PfYoIpR5qdAHnM ContentTransferEncoding: 7bit XProofpointUnRewURL: 0 URL was unrewritten MIMEVersion: 1.0 XProofpointVirusVersion: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.687,Hydra:6.0.235,FMLib:17.0.607.475 definitions=20201013_15,20201013_02,20200407_01 XProofpointSpamDetails: rule=outbound_notspam policy=outbound score=0 bulkscore=0 mlxlogscore=999 clxscore=1015 malwarescore=0 phishscore=0 priorityscore=1501 impostorscore=0 suspectscore=0 mlxscore=0 spamscore=0 adultscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.02109030001 definitions=main2109150122 XSpamStatus: No, score=11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 XSpamCheckerVersion: SpamAssassin 3.4.4 (20200124) on server2.sourceware.org XBeenThere: gccpatches@gcc.gnu.org XMailmanVersion: 2.1.29 Precedence: list ListId: Gccpatches mailing list <gccpatches.gcc.gnu.org> ListUnsubscribe: <https://gcc.gnu.org/mailman/options/gccpatches>, <mailto:gccpatchesrequest@gcc.gnu.org?subject=unsubscribe> ListArchive: <https://gcc.gnu.org/pipermail/gccpatches/> ListPost: <mailto:gccpatches@gcc.gnu.org> ListHelp: <mailto:gccpatchesrequest@gcc.gnu.org?subject=help> ListSubscribe: <https://gcc.gnu.org/mailman/listinfo/gccpatches>, <mailto:gccpatchesrequest@gcc.gnu.org?subject=subscribe> From: "Kewen.Lin via Gccpatches" <gccpatches@gcc.gnu.org> ReplyTo: "Kewen.Lin" <linkw@linux.ibm.com> Cc: Bill Schmidt <wschmidt@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>, Segher Boessenkool <segher@kernel.crashing.org> ErrorsTo: gccpatchesbounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gccpatches" <gccpatchesbounces+patchwork=sourceware.org@gcc.gnu.org> 
Series 
rs6000: Modify the way for extra penalized cost


Commit Message
Kewen.Lin
Sept. 16, 2021, 1:14 a.m. UTC
Hi, This patch follows the discussion here[1], where Segher pointed out the existing way to guard the extra penalized cost for strided/elementwise loads with a magic bound doesn't scale. The way with nunits * stmt_cost can get one much exaggerated penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, that's why we need one bound. To make it scale, this patch doesn't use nunits * stmt_cost any more, but it still keeps nunits since there are actually nunits scalar loads there. So it uses one cost adjusted from stmt_cost, since the current stmt_cost sort of considers nunits, we can stablize the cost for big nunits and retain the cost for small nunits. After some tries, this patch gets the adjusted cost as: stmt_cost / (log2(nunits) * log2(nunits)) For V16QI, the adjusted cost would be 1 and total penalized cost is 16, it isn't exaggerated. For V2DI, the adjusted cost would be 2 and total penalized cost is 4, which is the same as before. btw, I tried to use one single log2(nunits), but the penalized cost is still big enough and can't fix the degraded bmk blender_r. The separated SPEC2017 evaluations on Power8, Power9 and Power10 at option sets O2vect and Ofastunroll showed this change is neutral (that is same effect as before). Bootstrapped and regresstested on powerpc64lelinuxgnu Power9. Is it ok for trunk? [1] https://gcc.gnu.org/pipermail/gccpatches/2021September/579121.html BR, Kewen  gcc/ChangeLog: * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust the way to compute extra penalized cost.  gcc/config/rs6000/rs6000.c  28 +++++++++++++++++ 1 file changed, 17 insertions(+), 11 deletions()  2.25.1
Comments
Hi Kewen, On 9/15/21 8:14 PM, Kewen.Lin wrote: > Hi, > > This patch follows the discussion here[1], where Segher pointed > out the existing way to guard the extra penalized cost for > strided/elementwise loads with a magic bound doesn't scale. > > The way with nunits * stmt_cost can get one much exaggerated > penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, > that's why we need one bound. To make it scale, this patch > doesn't use nunits * stmt_cost any more, but it still keeps > nunits since there are actually nunits scalar loads there. So > it uses one cost adjusted from stmt_cost, since the current > stmt_cost sort of considers nunits, we can stablize the cost > for big nunits and retain the cost for small nunits. After > some tries, this patch gets the adjusted cost as: > > stmt_cost / (log2(nunits) * log2(nunits)) > > For V16QI, the adjusted cost would be 1 and total penalized > cost is 16, it isn't exaggerated. For V2DI, the adjusted > cost would be 2 and total penalized cost is 4, which is the > same as before. btw, I tried to use one single log2(nunits), > but the penalized cost is still big enough and can't fix the > degraded bmk blender_r. > > The separated SPEC2017 evaluations on Power8, Power9 and Power10 > at option sets O2vect and Ofastunroll showed this change is > neutral (that is same effect as before). > > Bootstrapped and regresstested on powerpc64lelinuxgnu Power9. > > Is it ok for trunk? > > [1] https://gcc.gnu.org/pipermail/gccpatches/2021September/579121.html > > BR, > Kewen >  > gcc/ChangeLog: > > * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust > the way to compute extra penalized cost. > >  > gcc/config/rs6000/rs6000.c  28 +++++++++++++++++ > 1 file changed, 17 insertions(+), 11 deletions() > > diff git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 4ab23b0ab33..e08b94c0447 100644 >  a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ 5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, > { > tree vectype = STMT_VINFO_VECTYPE (stmt_info); > unsigned int nunits = vect_nunits_for_cost (vectype); >  unsigned int extra_cost = nunits * stmt_cost; >  /* As function rs6000_builtin_vectorization_cost shows, we have >  priced much on V16QI/V8HI vector construction as their units, >  if we penalize them with nunits * stmt_cost, it can result in >  an unreliable body cost, eg: for V16QI on Power8, stmt_cost >  is 20 and nunits is 16, the extra cost is 320 which looks >  much exaggerated. So let's use one maximum bound for the >  extra penalized cost for vector construction here. */ >  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >  extra_cost = MAX_PENALIZED_COST_FOR_CTOR; > + /* As function rs6000_builtin_vectorization_cost shows, we > + have priced much on V16QI/V8HI vector construction by > + considering their units, if we penalize them with nunits > + * stmt_cost here, it can result in an unreliable body cost, This might be confusing to the reader, since you have deleted the calculation of nunits * stmt_cost. Could you instead write this to indicate that we used to adjust in this way, and it had this particular downside, so that's why you're choosing this heuristic? It's a minor thing but I think people reading the code will be confused otherwise. I think the heuristic is generally reasonable, and certainly better than what we had before! LGTM with adjusted commentary, so recommend maintainers approve. Thanks for the patch! Bill > + eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16, > + the penalty will be 320 which looks much exaggerated. But > + there are actually nunits scalar loads, so we try to adopt > + one reasonable penalized cost for each load rather than > + stmt_cost. Here, with stmt_cost dividing by log2(nunits)^2, > + we can still retain the necessary penalty for small nunits > + meanwhile stabilize the penalty for big nunits. */ > + int nunits_log2 = exact_log2 (nunits); > + gcc_assert (nunits_log2 > 0); > + unsigned int nunits_sq = nunits_log2 * nunits_log2; > + unsigned int adjusted_cost = stmt_cost / nunits_sq; > + gcc_assert (adjusted_cost > 0); > + unsigned int extra_cost = nunits * adjusted_cost; > data>extra_ctor_cost += extra_cost; > } > } >  > 2.25.1
Hi! On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote: > The way with nunits * stmt_cost can get one much exaggerated > penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, > that's why we need one bound. To make it scale, this patch > doesn't use nunits * stmt_cost any more, but it still keeps > nunits since there are actually nunits scalar loads there. So > it uses one cost adjusted from stmt_cost, since the current > stmt_cost sort of considers nunits, we can stablize the cost > for big nunits and retain the cost for small nunits. After > some tries, this patch gets the adjusted cost as: > > stmt_cost / (log2(nunits) * log2(nunits)) So for V16QI it gives *16/(4*4) so *1 V8HI it gives *8/(3*3) so *8/9 V4SI it gives *4/(2*2) so *1 V2DI it gives *2/(1*1) so *2 and for V1TI it gives *1/(0*0) which is UB (no, does not crash for us, just gives wildly wrong answers; the div returns 0 on recent systems). > For V16QI, the adjusted cost would be 1 and total penalized > cost is 16, it isn't exaggerated. For V2DI, the adjusted > cost would be 2 and total penalized cost is 4, which is the > same as before. btw, I tried to use one single log2(nunits), > but the penalized cost is still big enough and can't fix the > degraded bmk blender_r. Does it make sense to treat V2DI (and V2DF) as twice more expensive than other vectors, which are all pretty much equal cost (except those that end up with cost 0)? If so, there are simpler ways to do that. > + int nunits_log2 = exact_log2 (nunits); > + gcc_assert (nunits_log2 > 0); > + unsigned int nunits_sq = nunits_log2 * nunits_log2; >= 0 This of course is assuming nunits will always be a power of 2, but I'm sure that we have many other places in the compiler assuming that already, so that is fine. And if one day this stops being true we will get a nice ICE, pretty much the best we could hope for. > + unsigned int adjusted_cost = stmt_cost / nunits_sq; But this can divide by 0. Or are we somehow guaranteed that nunits will never be 1? Yes the log2 check above, sure, but that ICEs if this is violated; is there anything that actually guarantees it is true? > + gcc_assert (adjusted_cost > 0); I don't see how you guarantee this, either. A magic crazy formula like this is no good. If you want to make the cost of everything but V2D* be the same, and that of V2D* be twice that, that is a weird heuristic, but we can live with that perhaps. But that beats completely unexplained (and unexplainable) magic! Sorry. Segher
Hi Bill, Thanks for the review! on 2021/9/18 上午12:34, Bill Schmidt wrote: > Hi Kewen, > > On 9/15/21 8:14 PM, Kewen.Lin wrote: >> Hi, >> >> This patch follows the discussion here[1], where Segher pointed >> out the existing way to guard the extra penalized cost for >> strided/elementwise loads with a magic bound doesn't scale. >> >> The way with nunits * stmt_cost can get one much exaggerated >> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, >> that's why we need one bound. To make it scale, this patch >> doesn't use nunits * stmt_cost any more, but it still keeps >> nunits since there are actually nunits scalar loads there. So >> it uses one cost adjusted from stmt_cost, since the current >> stmt_cost sort of considers nunits, we can stablize the cost >> for big nunits and retain the cost for small nunits. After >> some tries, this patch gets the adjusted cost as: >> >> stmt_cost / (log2(nunits) * log2(nunits)) >> >> For V16QI, the adjusted cost would be 1 and total penalized >> cost is 16, it isn't exaggerated. For V2DI, the adjusted >> cost would be 2 and total penalized cost is 4, which is the >> same as before. btw, I tried to use one single log2(nunits), >> but the penalized cost is still big enough and can't fix the >> degraded bmk blender_r. >> >> The separated SPEC2017 evaluations on Power8, Power9 and Power10 >> at option sets O2vect and Ofastunroll showed this change is >> neutral (that is same effect as before). >> >> Bootstrapped and regresstested on powerpc64lelinuxgnu Power9. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gccpatches/2021September/579121.html >> >> BR, >> Kewen >>  >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust >> the way to compute extra penalized cost. >> >>  >> gcc/config/rs6000/rs6000.c  28 +++++++++++++++++ >> 1 file changed, 17 insertions(+), 11 deletions() >> >> diff git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 4ab23b0ab33..e08b94c0447 100644 >>  a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ 5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >> { >> tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> unsigned int nunits = vect_nunits_for_cost (vectype); >>  unsigned int extra_cost = nunits * stmt_cost; >>  /* As function rs6000_builtin_vectorization_cost shows, we have >>  priced much on V16QI/V8HI vector construction as their units, >>  if we penalize them with nunits * stmt_cost, it can result in >>  an unreliable body cost, eg: for V16QI on Power8, stmt_cost >>  is 20 and nunits is 16, the extra cost is 320 which looks >>  much exaggerated. So let's use one maximum bound for the >>  extra penalized cost for vector construction here. */ >>  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >>  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >>  extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> + /* As function rs6000_builtin_vectorization_cost shows, we >> + have priced much on V16QI/V8HI vector construction by >> + considering their units, if we penalize them with nunits >> + * stmt_cost here, it can result in an unreliable body cost, > > This might be confusing to the reader, since you have deleted the calculation of nunits * stmt_cost. Could you instead write this to indicate that we used to adjust in this way, and it had this particular downside, so that's why you're choosing this heuristic? It's a minor thing but I think people reading the code will be confused otherwise. > Good point! I'll update the commentary to explain it, thanks!! BR, Kewen > I think the heuristic is generally reasonable, and certainly better than what we had before! > > LGTM with adjusted commentary, so recommend maintainers approve. > > Thanks for the patch! > Bill >> + eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16, >> + the penalty will be 320 which looks much exaggerated. But >> + there are actually nunits scalar loads, so we try to adopt >> + one reasonable penalized cost for each load rather than >> + stmt_cost. Here, with stmt_cost dividing by log2(nunits)^2, >> + we can still retain the necessary penalty for small nunits >> + meanwhile stabilize the penalty for big nunits. */ >> + int nunits_log2 = exact_log2 (nunits); >> + gcc_assert (nunits_log2 > 0); >> + unsigned int nunits_sq = nunits_log2 * nunits_log2; >> + unsigned int adjusted_cost = stmt_cost / nunits_sq; >> + gcc_assert (adjusted_cost > 0); >> + unsigned int extra_cost = nunits * adjusted_cost; >> data>extra_ctor_cost += extra_cost; >> } >> } >>  >> 2.25.1 >
Hi Segher, Thanks for the review! on 2021/9/18 上午6:01, Segher Boessenkool wrote: > Hi! > > On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote: >> The way with nunits * stmt_cost can get one much exaggerated >> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, >> that's why we need one bound. To make it scale, this patch >> doesn't use nunits * stmt_cost any more, but it still keeps >> nunits since there are actually nunits scalar loads there. So >> it uses one cost adjusted from stmt_cost, since the current >> stmt_cost sort of considers nunits, we can stablize the cost >> for big nunits and retain the cost for small nunits. After >> some tries, this patch gets the adjusted cost as: >> >> stmt_cost / (log2(nunits) * log2(nunits)) > > So for V16QI it gives *16/(4*4) so *1 > V8HI it gives *8/(3*3) so *8/9 > V4SI it gives *4/(2*2) so *1 > V2DI it gives *2/(1*1) so *2 > and for V1TI it gives *1/(0*0) which is UB (no, does not crash for us, > just gives wildly wrong answers; the div returns 0 on recent systems). > I don't expected we will have V1TI for strided/elementwise load, if it's one unit vector, it's the whole vector itself. Besides, the below assertion should exclude it already. >> For V16QI, the adjusted cost would be 1 and total penalized >> cost is 16, it isn't exaggerated. For V2DI, the adjusted >> cost would be 2 and total penalized cost is 4, which is the >> same as before. btw, I tried to use one single log2(nunits), >> but the penalized cost is still big enough and can't fix the >> degraded bmk blender_r. > > Does it make sense to treat V2DI (and V2DF) as twice more expensive than > other vectors, which are all pretty much equal cost (except those that > end up with cost 0)? If so, there are simpler ways to do that. > Yeah, from the SPEC2017 evaluation, it's good with this. The costing framework of vectorization doesn't consider the dependent insn chain and available #unit etc. like local scheduling (it can't either), so we have to use some heuristics to handle some special cases. For more units vector construction, the used instructions are more. It has more chances to schedule them better (even run in parallelly when enough available units at the time), so we don't need to penalize more for them. For V2DI, the load result is fed into construction directly, the current stmt_cost is to consider merging and only 2, penalizing it with one is not enough from the bwaves experiment. >> + int nunits_log2 = exact_log2 (nunits); >> + gcc_assert (nunits_log2 > 0); >> + unsigned int nunits_sq = nunits_log2 * nunits_log2; > >> = 0 > > This of course is assuming nunits will always be a power of 2, but I'm > sure that we have many other places in the compiler assuming that > already, so that is fine. And if one day this stops being true we will > get a nice ICE, pretty much the best we could hope for. > Yeah, exact_log2 returns 1 for non power of 2 input, for example: input output 0 > 1 1 > 0 2 > 1 3 > 1 >> + unsigned int adjusted_cost = stmt_cost / nunits_sq; > > But this can divide by 0. Or are we somehow guaranteed that nunits > will never be 1? Yes the log2 check above, sure, but that ICEs if this > is violated; is there anything that actually guarantees it is true? > As I mentioned above, I don't expect we can have nunits 1 strided/ew load, and the ICE should check this and ensure dividing by zero never happens. :) >> + gcc_assert (adjusted_cost > 0); > > I don't see how you guarantee this, either. > It's mainly to prevent that one day we tweak the cost for construction in rs6000_builtin_vectorization_cost then make some unexpected values generated here. But now these expected values are guaranteed as the current costs and the formula. > > A magic crazy formula like this is no good. If you want to make the > cost of everything but V2D* be the same, and that of V2D* be twice that, > that is a weird heuristic, but we can live with that perhaps. But that > beats completely unexplained (and unexplainable) magic! > > Sorry. > That's all right, thanks for the comments! let's improve it. :) How about just assigning 2 for V2DI and 1 for the others for the penalized_cost_per_load with some detailed commentary, it should have the same effect with this "magic crazy formula", but I guess it can be more clear. BR, Kewen
Hi! On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote: > on 2021/9/18 上午6:01, Segher Boessenkool wrote: > > On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote: > >> The way with nunits * stmt_cost can get one much exaggerated > >> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, > >> that's why we need one bound. To make it scale, this patch > >> doesn't use nunits * stmt_cost any more, but it still keeps > >> nunits since there are actually nunits scalar loads there. So > >> it uses one cost adjusted from stmt_cost, since the current > >> stmt_cost sort of considers nunits, we can stablize the cost > >> for big nunits and retain the cost for small nunits. After > >> some tries, this patch gets the adjusted cost as: > >> > >> stmt_cost / (log2(nunits) * log2(nunits)) > > > > So for V16QI it gives *16/(4*4) so *1 > > V8HI it gives *8/(3*3) so *8/9 > > V4SI it gives *4/(2*2) so *1 > > V2DI it gives *2/(1*1) so *2 > > and for V1TI it gives *1/(0*0) which is UB (no, does not crash for us, > > just gives wildly wrong answers; the div returns 0 on recent systems). > > I don't expected we will have V1TI for strided/elementwise load, > if it's one unit vector, it's the whole vector itself. > Besides, the below assertion should exclude it already. Yes. But ignoring the UB for unexpectedly large vector components, the 1 / 1.111 / 1 / 2 scoring does not make much sense. The formulas "look" smooth and even sort of reasonable, but as soon as you look at what it *means*, and realise the domain if the function is discrete (only four or five possible inputs), and then see how the function behaves on that... Hrm :) > > This of course is assuming nunits will always be a power of 2, but I'm > > sure that we have many other places in the compiler assuming that > > already, so that is fine. And if one day this stops being true we will > > get a nice ICE, pretty much the best we could hope for. > > Yeah, exact_log2 returns 1 for non power of 2 input, for example: Exactly. > >> + unsigned int adjusted_cost = stmt_cost / nunits_sq; > > > > But this can divide by 0. Or are we somehow guaranteed that nunits > > will never be 1? Yes the log2 check above, sure, but that ICEs if this > > is violated; is there anything that actually guarantees it is true? > > As I mentioned above, I don't expect we can have nunits 1 strided/ew load, > and the ICE should check this and ensure dividing by zero never happens. :) Can you assert that *directly* then please? > > A magic crazy formula like this is no good. If you want to make the > > cost of everything but V2D* be the same, and that of V2D* be twice that, > > that is a weird heuristic, but we can live with that perhaps. But that > > beats completely unexplained (and unexplainable) magic! > > > > Sorry. > > That's all right, thanks for the comments! let's improve it. :) I like that spirit :) > How about just assigning 2 for V2DI and 1 for the others for the > penalized_cost_per_load with some detailed commentary, it should have > the same effect with this "magic crazy formula", but I guess it can > be more clear. That is fine yes! (Well, V2DF the same I guess? Or you'll need very detailed commentary :) ) It is fine to say "this is just a heuristic without much supporting theory" in places. That is what most of our param= are as well, for example. If counting twoelement vectors as twice as expensive as all other vectors helps performance, then so be it: if there is no better way to cost things (or we do not know one), then what else are we to do? Segher
Hi Segher, on 2021/9/23 上午6:36, Segher Boessenkool wrote: > Hi! > > On Tue, Sep 21, 2021 at 11:24:08AM +0800, Kewen.Lin wrote: >> on 2021/9/18 上午6:01, Segher Boessenkool wrote: >>> On Thu, Sep 16, 2021 at 09:14:15AM +0800, Kewen.Lin wrote: >>>> The way with nunits * stmt_cost can get one much exaggerated >>>> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, >>>> that's why we need one bound. To make it scale, this patch >>>> doesn't use nunits * stmt_cost any more, but it still keeps >>>> nunits since there are actually nunits scalar loads there. So >>>> it uses one cost adjusted from stmt_cost, since the current >>>> stmt_cost sort of considers nunits, we can stablize the cost >>>> for big nunits and retain the cost for small nunits. After >>>> some tries, this patch gets the adjusted cost as: >>>> >>>> stmt_cost / (log2(nunits) * log2(nunits)) >>> >>> So for V16QI it gives *16/(4*4) so *1 >>> V8HI it gives *8/(3*3) so *8/9 >>> V4SI it gives *4/(2*2) so *1 >>> V2DI it gives *2/(1*1) so *2 >>> and for V1TI it gives *1/(0*0) which is UB (no, does not crash for us, >>> just gives wildly wrong answers; the div returns 0 on recent systems). >> >> I don't expected we will have V1TI for strided/elementwise load, >> if it's one unit vector, it's the whole vector itself. >> Besides, the below assertion should exclude it already. > > Yes. But ignoring the UB for unexpectedly large vector components, the > 1 / 1.111 / 1 / 2 scoring does not make much sense. The formulas > "look" smooth and even sort of reasonable, but as soon as you look at > what it *means*, and realise the domain if the function is discrete > (only four or five possible inputs), and then see how the function > behaves on that... Hrm :) > >>> This of course is assuming nunits will always be a power of 2, but I'm >>> sure that we have many other places in the compiler assuming that >>> already, so that is fine. And if one day this stops being true we will >>> get a nice ICE, pretty much the best we could hope for. >> >> Yeah, exact_log2 returns 1 for non power of 2 input, for example: > > Exactly. > >>>> + unsigned int adjusted_cost = stmt_cost / nunits_sq; >>> >>> But this can divide by 0. Or are we somehow guaranteed that nunits >>> will never be 1? Yes the log2 check above, sure, but that ICEs if this >>> is violated; is there anything that actually guarantees it is true? >> >> As I mentioned above, I don't expect we can have nunits 1 strided/ew load, >> and the ICE should check this and ensure dividing by zero never happens. :) > > Can you assert that *directly* then please? > Fix in v2. >>> A magic crazy formula like this is no good. If you want to make the >>> cost of everything but V2D* be the same, and that of V2D* be twice that, >>> that is a weird heuristic, but we can live with that perhaps. But that >>> beats completely unexplained (and unexplainable) magic! >>> >>> Sorry. >> >> That's all right, thanks for the comments! let's improve it. :) > > I like that spirit :) > >> How about just assigning 2 for V2DI and 1 for the others for the >> penalized_cost_per_load with some detailed commentary, it should have >> the same effect with this "magic crazy formula", but I guess it can >> be more clear. > > That is fine yes! (Well, V2DF the same I guess? Or you'll need very > detailed commentary :) ) > > It is fine to say "this is just a heuristic without much supporting > theory" in places. That is what most of our param= are as well, for > example. If counting twoelement vectors as twice as expensive as all > other vectors helps performance, then so be it: if there is no better > way to cost things (or we do not know one), then what else are we to do? > > Thanks a lot for the suggestion, I just posted v2: https://gcc.gnu.org/pipermail/gccpatches/2021September/580358.html BR, Kewen
diff git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 4ab23b0ab33..e08b94c0447 100644  a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ 5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, { tree vectype = STMT_VINFO_VECTYPE (stmt_info); unsigned int nunits = vect_nunits_for_cost (vectype);  unsigned int extra_cost = nunits * stmt_cost;  /* As function rs6000_builtin_vectorization_cost shows, we have  priced much on V16QI/V8HI vector construction as their units,  if we penalize them with nunits * stmt_cost, it can result in  an unreliable body cost, eg: for V16QI on Power8, stmt_cost  is 20 and nunits is 16, the extra cost is 320 which looks  much exaggerated. So let's use one maximum bound for the  extra penalized cost for vector construction here. */  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)  extra_cost = MAX_PENALIZED_COST_FOR_CTOR; + /* As function rs6000_builtin_vectorization_cost shows, we + have priced much on V16QI/V8HI vector construction by + considering their units, if we penalize them with nunits + * stmt_cost here, it can result in an unreliable body cost, + eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16, + the penalty will be 320 which looks much exaggerated. But + there are actually nunits scalar loads, so we try to adopt + one reasonable penalized cost for each load rather than + stmt_cost. Here, with stmt_cost dividing by log2(nunits)^2, + we can still retain the necessary penalty for small nunits + meanwhile stabilize the penalty for big nunits. */ + int nunits_log2 = exact_log2 (nunits); + gcc_assert (nunits_log2 > 0); + unsigned int nunits_sq = nunits_log2 * nunits_log2; + unsigned int adjusted_cost = stmt_cost / nunits_sq; + gcc_assert (adjusted_cost > 0); + unsigned int extra_cost = nunits * adjusted_cost; data>extra_ctor_cost += extra_cost; } }