rs6000: Don't use optimize_function_for_speed_p too early [PR108184]

Message ID 197abd1f-081c-3206-4dd5-45f0b098612a@linux.ibm.com
State New
Headers
Series rs6000: Don't use optimize_function_for_speed_p too early [PR108184] |

Commit Message

Kewen.Lin Jan. 4, 2023, 9:20 a.m. UTC
  Hi,

As Honza pointed out in [1], the current uses of function
optimize_function_for_speed_p in rs6000_option_override_internal
are too early, since the query results from the functions
optimize_function_for_{speed,size}_p could be changed later due
to profile feedback and some function attributes handlings etc.

This patch is to move optimize_function_for_speed_p to all the
use places of the corresponding flags, which follows the existing
practices.  Maybe we can cache it somewhere at an appropriate
timing, but that's another thing.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607527.html

Bootstrapped and regtested on powerpc64-linux-gnu P8 and
powerpc64le-linux-gnu P9 and P10.

I'm going to push this soon if no objections.

BR,
Kewen
-----
	PR target/108184

gcc/ChangeLog:

	* config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove
	all optimize_function_for_speed_p uses.
	* config/rs6000/predicates.md (fusion_gpr_mem_load): Call
	optimize_function_for_speed_p along with TARGET_P8_FUSION_SIGN.
	(fusion_gpr_load_p): Likewise.
	(expand_fusion_gpr_load): Likewise.
	(rs6000_call_aix): Call optimize_function_for_speed_p along with
	TARGET_SAVE_TOC_INDIRECT.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/pr108184.c: New test.
---
 gcc/config/rs6000/predicates.md             |  4 +++-
 gcc/config/rs6000/rs6000.cc                 | 16 ++++++++++------
 gcc/testsuite/gcc.target/powerpc/pr108184.c | 15 +++++++++++++++
 3 files changed, 28 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184.c

--
2.27.0
  

Comments

Segher Boessenkool Jan. 4, 2023, 10:46 a.m. UTC | #1
On Wed, Jan 04, 2023 at 05:20:14PM +0800, Kewen.Lin wrote:
> As Honza pointed out in [1], the current uses of function
> optimize_function_for_speed_p in rs6000_option_override_internal
> are too early, since the query results from the functions
> optimize_function_for_{speed,size}_p could be changed later due
> to profile feedback and some function attributes handlings etc.
> 
> This patch is to move optimize_function_for_speed_p to all the
> use places of the corresponding flags, which follows the existing
> practices.  Maybe we can cache it somewhere at an appropriate
> timing, but that's another thing.

> @@ -25604,7 +25602,9 @@ rs6000_call_aix (rtx value, rtx func_desc, rtx tlsarg, rtx cookie)
> 
>  	  /* Can we optimize saving the TOC in the prologue or
>  	     do we need to do it at every call?  */
> -	  if (TARGET_SAVE_TOC_INDIRECT && !cfun->calls_alloca)
> +	  if (TARGET_SAVE_TOC_INDIRECT
> +	      && !cfun->calls_alloca
> +	      && optimize_function_for_speed_p (cfun))
>  	    cfun->machine->save_toc_in_prologue = true;

Is this correct?  If so, it really needs a separate testcase.

The rest looks good.  Thanks!


Segher
  
Kewen.Lin Jan. 4, 2023, 12:15 p.m. UTC | #2
Hi Segher,

Thanks for the comments.

on 2023/1/4 18:46, Segher Boessenkool wrote:
> On Wed, Jan 04, 2023 at 05:20:14PM +0800, Kewen.Lin wrote:
>> As Honza pointed out in [1], the current uses of function
>> optimize_function_for_speed_p in rs6000_option_override_internal
>> are too early, since the query results from the functions
>> optimize_function_for_{speed,size}_p could be changed later due
>> to profile feedback and some function attributes handlings etc.
>>
>> This patch is to move optimize_function_for_speed_p to all the
>> use places of the corresponding flags, which follows the existing
>> practices.  Maybe we can cache it somewhere at an appropriate
>> timing, but that's another thing.
> 
>> @@ -25604,7 +25602,9 @@ rs6000_call_aix (rtx value, rtx func_desc, rtx tlsarg, rtx cookie)
>>
>>  	  /* Can we optimize saving the TOC in the prologue or
>>  	     do we need to do it at every call?  */
>> -	  if (TARGET_SAVE_TOC_INDIRECT && !cfun->calls_alloca)
>> +	  if (TARGET_SAVE_TOC_INDIRECT
>> +	      && !cfun->calls_alloca
>> +	      && optimize_function_for_speed_p (cfun))
>>  	    cfun->machine->save_toc_in_prologue = true;
> 
> Is this correct?  If so, it really needs a separate testcase.
> 

Yes, it just moves the condition from:

--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -3978,8 +3978,7 @@ rs6000_option_override_internal (bool global_init_p)
   /* If we can shrink-wrap the TOC register save separately, then use
      -msave-toc-indirect unless explicitly disabled.  */
   if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
-      && flag_shrink_wrap_separate
-      && optimize_function_for_speed_p (cfun))
+      && flag_shrink_wrap_separate)
     rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;

here.

I tried to find one test case before, but failed to find one which is not fragile
to test.  And I thought the associated test case has demonstrated why the use of
optimize_function_for_{speed,size}_p is too early in function
rs6000_option_override_internal, so I gave up then.  Do you worry about that we
could revert it unexpectedly in future and no sensitive test case is on it?


BR,
Kewen
  
Segher Boessenkool Jan. 4, 2023, 2:02 p.m. UTC | #3
Hi!

On Wed, Jan 04, 2023 at 08:15:03PM +0800, Kewen.Lin wrote:
> on 2023/1/4 18:46, Segher Boessenkool wrote:
> >> @@ -25604,7 +25602,9 @@ rs6000_call_aix (rtx value, rtx func_desc, rtx tlsarg, rtx cookie)
> >>
> >>  	  /* Can we optimize saving the TOC in the prologue or
> >>  	     do we need to do it at every call?  */
> >> -	  if (TARGET_SAVE_TOC_INDIRECT && !cfun->calls_alloca)
> >> +	  if (TARGET_SAVE_TOC_INDIRECT
> >> +	      && !cfun->calls_alloca
> >> +	      && optimize_function_for_speed_p (cfun))
> >>  	    cfun->machine->save_toc_in_prologue = true;
> > 
> > Is this correct?  If so, it really needs a separate testcase.
> 
> Yes, it just moves the condition from:
> 
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -3978,8 +3978,7 @@ rs6000_option_override_internal (bool global_init_p)
>    /* If we can shrink-wrap the TOC register save separately, then use
>       -msave-toc-indirect unless explicitly disabled.  */
>    if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
> -      && flag_shrink_wrap_separate
> -      && optimize_function_for_speed_p (cfun))
> +      && flag_shrink_wrap_separate)
>      rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;
> 
> here.

That "just" reinforces that this really needs a testcase!  It is all
action at a distance, none of this is trivial (if it was there would
not be a bug here in the first place, of course).

> I tried to find one test case before, but failed to find one which is not fragile
> to test.  And I thought the associated test case has demonstrated why the use of
> optimize_function_for_{speed,size}_p is too early in function
> rs6000_option_override_internal, so I gave up then.  Do you worry about that we
> could revert it unexpectedly in future and no sensitive test case is on it?

I worry that it might contradict what some other code does.  I also
worry that it just is not a sensible thing to do.

I do not worry that your patch is not an improvement.  But the resulting
code more clearly (than the original) is problematic.  Where is r2 saved
to the frame if save_toc_in_prologue is false?


Segher
  
Kewen.Lin Jan. 5, 2023, 4:04 a.m. UTC | #4
on 2023/1/4 22:02, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jan 04, 2023 at 08:15:03PM +0800, Kewen.Lin wrote:
>> on 2023/1/4 18:46, Segher Boessenkool wrote:
>>>> @@ -25604,7 +25602,9 @@ rs6000_call_aix (rtx value, rtx func_desc, rtx tlsarg, rtx cookie)
>>>>
>>>>  	  /* Can we optimize saving the TOC in the prologue or
>>>>  	     do we need to do it at every call?  */
>>>> -	  if (TARGET_SAVE_TOC_INDIRECT && !cfun->calls_alloca)
>>>> +	  if (TARGET_SAVE_TOC_INDIRECT
>>>> +	      && !cfun->calls_alloca
>>>> +	      && optimize_function_for_speed_p (cfun))
>>>>  	    cfun->machine->save_toc_in_prologue = true;
>>>
>>> Is this correct?  If so, it really needs a separate testcase.
>>
>> Yes, it just moves the condition from:
>>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -3978,8 +3978,7 @@ rs6000_option_override_internal (bool global_init_p)
>>    /* If we can shrink-wrap the TOC register save separately, then use
>>       -msave-toc-indirect unless explicitly disabled.  */
>>    if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
>> -      && flag_shrink_wrap_separate
>> -      && optimize_function_for_speed_p (cfun))
>> +      && flag_shrink_wrap_separate)
>>      rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;
>>
>> here.
> 
> That "just" reinforces that this really needs a testcase!  It is all
> action at a distance, none of this is trivial (if it was there would
> not be a bug here in the first place, of course).

OK, I'll make a test case for it. :)

> 
>> I tried to find one test case before, but failed to find one which is not fragile
>> to test.  And I thought the associated test case has demonstrated why the use of
>> optimize_function_for_{speed,size}_p is too early in function
>> rs6000_option_override_internal, so I gave up then.  Do you worry about that we
>> could revert it unexpectedly in future and no sensitive test case is on it?
> 
> I worry that it might contradict what some other code does.  I also
> worry that it just is not a sensible thing to do.
> 
> I do not worry that your patch is not an improvement.  But the resulting
> code more clearly (than the original) is problematic.  Where is r2 saved
> to the frame if save_toc_in_prologue is false?

If save_toc_in_prologue is false, the r2 saving to frame would occur at each
indirect call.  Currently separate shrink-wrapping will check save_toc_in_prologue
to decide whether to consider saving toc as one component, I think that's why
we enable save-toc-indirect implicitly (going to set save_toc_in_prologue)
if it's not specified explicitly and doing separate shrink-wrapping.

BR,
Kewen
  

Patch

diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index b1fcc69bb60..11f1779e7bf 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -1878,7 +1878,9 @@  (define_predicate "fusion_gpr_mem_load"

   /* Handle sign/zero extend.  */
   if (GET_CODE (op) == ZERO_EXTEND
-      || (TARGET_P8_FUSION_SIGN && GET_CODE (op) == SIGN_EXTEND))
+      || (TARGET_P8_FUSION_SIGN
+	  && GET_CODE (op) == SIGN_EXTEND
+	  && optimize_function_for_speed_p (cfun)))
     {
       op = XEXP (op, 0);
       mode = GET_MODE (op);
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index eb7ad5e954f..bbf829f45d9 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -3978,8 +3978,7 @@  rs6000_option_override_internal (bool global_init_p)
   /* If we can shrink-wrap the TOC register save separately, then use
      -msave-toc-indirect unless explicitly disabled.  */
   if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
-      && flag_shrink_wrap_separate
-      && optimize_function_for_speed_p (cfun))
+      && flag_shrink_wrap_separate)
     rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;

   /* Enable power8 fusion if we are tuning for power8, even if we aren't
@@ -4013,7 +4012,6 @@  rs6000_option_override_internal (bool global_init_p)
      zero extending load, and an explicit sign extension.  */
   if (TARGET_P8_FUSION
       && !(rs6000_isa_flags_explicit & OPTION_MASK_P8_FUSION_SIGN)
-      && optimize_function_for_speed_p (cfun)
       && optimize >= 3)
     rs6000_isa_flags |= OPTION_MASK_P8_FUSION_SIGN;

@@ -25604,7 +25602,9 @@  rs6000_call_aix (rtx value, rtx func_desc, rtx tlsarg, rtx cookie)

 	  /* Can we optimize saving the TOC in the prologue or
 	     do we need to do it at every call?  */
-	  if (TARGET_SAVE_TOC_INDIRECT && !cfun->calls_alloca)
+	  if (TARGET_SAVE_TOC_INDIRECT
+	      && !cfun->calls_alloca
+	      && optimize_function_for_speed_p (cfun))
 	    cfun->machine->save_toc_in_prologue = true;
 	  else
 	    {
@@ -27471,7 +27471,9 @@  fusion_gpr_load_p (rtx addis_reg,	/* register set via addis.  */

   /* Allow sign/zero extension.  */
   if (GET_CODE (mem) == ZERO_EXTEND
-      || (GET_CODE (mem) == SIGN_EXTEND && TARGET_P8_FUSION_SIGN))
+      || (GET_CODE (mem) == SIGN_EXTEND
+	  && TARGET_P8_FUSION_SIGN
+	  && optimize_function_for_speed_p (cfun)))
     mem = XEXP (mem, 0);

   if (!MEM_P (mem))
@@ -27535,7 +27537,9 @@  expand_fusion_gpr_load (rtx *operands)
   enum rtx_code extend = UNKNOWN;

   if (GET_CODE (orig_mem) == ZERO_EXTEND
-      || (TARGET_P8_FUSION_SIGN && GET_CODE (orig_mem) == SIGN_EXTEND))
+      || (TARGET_P8_FUSION_SIGN
+	  && GET_CODE (orig_mem) == SIGN_EXTEND
+	  && optimize_function_for_speed_p (cfun)))
     {
       extend = GET_CODE (orig_mem);
       orig_mem = XEXP (orig_mem, 0);
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108184.c b/gcc/testsuite/gcc.target/powerpc/pr108184.c
new file mode 100644
index 00000000000..8f1e91d9258
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108184.c
@@ -0,0 +1,15 @@ 
+/* Only possible to fuse sign extended loads with the addis when
+   optimize >= 3 and Power8 fusion takes effects.  */
+/* { dg-options "-mdejagnu-tune=power8 -O3" } */
+
+/* Verify it doesn't optimize this function for speed as it's cold,
+   by checking it doesn't try to fuse sign extended loads with addis,
+   which results in a zero extended load and a sign extension.  */
+
+__attribute__ ((cold)) int
+fusion_short (short *p)
+{
+  return p[0x12345];
+}
+
+/* { dg-final { scan-assembler-not {\mextsh\M} } } */