ira: Cap callee-saved register cost scale to 300

Message ID 20250202065854.133973-1-hjl.tools@gmail.com
State New
Headers
Series ira: Cap callee-saved register cost scale to 300 |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm fail Test failed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

H.J. Lu Feb. 2, 2025, 6:58 a.m. UTC
  Don't increase callee-saved register cost by 1000x, which leads to that
callee-saved registers aren't used to preserve local variable values
across calls, by capping the scale to 300.

	PR rtl-optimization/111673
	PR rtl-optimization/115932
	PR rtl-optimization/116028
	PR rtl-optimization/117081
	PR rtl-optimization/118497
	* ira-color.cc (assign_hard_reg): Cap callee-saved register cost
	scale to 300.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
---
 gcc/ira-color.cc | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)
  

Comments

Richard Biener Feb. 2, 2025, 7:33 a.m. UTC | #1
> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> 
> Don't increase callee-saved register cost by 1000x, which leads to that
> callee-saved registers aren't used to preserve local variable values
> across calls, by capping the scale to 300.

>    PR rtl-optimization/111673
>    PR rtl-optimization/115932
>    PR rtl-optimization/116028
>    PR rtl-optimization/117081
>    PR rtl-optimization/118497
>    * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
>    scale to 300.
> 
> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> ---
> gcc/ira-color.cc | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> index 0699b349a1a..707ff188250 100644
> --- a/gcc/ira-color.cc
> +++ b/gcc/ira-color.cc
> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
>      /* We need to save/restore the hard register in
>         epilogue/prologue.  Therefore we increase the cost.  */
>      {
> +        int scale;
> +        if (optimize_size)
> +          scale = 1;
> +        else
> +          {
> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> +        /* Don't increase callee-saved register cost by 1000x,
> +           which leads to that callee-saved registers aren't
> +           used to preserve local variable values across calls,
> +           by capping the scale to 300.  */
> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> +          scale = 300;

That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?

> +          }
>        rclass = REGNO_REG_CLASS (hard_regno);
>        add_cost = ((ira_memory_move_cost[mode][rclass][0]
>                 + ira_memory_move_cost[mode][rclass][1])
>                * saved_nregs / hard_regno_nregs (hard_regno,
>                              mode) - 1)
> -               * (optimize_size ? 1 :
> -              REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
> +               * scale;
>        cost += add_cost;
>        full_cost += add_cost;
>      }
> --
> 2.48.1
>
  
H.J. Lu Feb. 2, 2025, 7:58 a.m. UTC | #2
On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
>
>
> > Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> >
> > Don't increase callee-saved register cost by 1000x, which leads to that
> > callee-saved registers aren't used to preserve local variable values
> > across calls, by capping the scale to 300.
>
> >    PR rtl-optimization/111673
> >    PR rtl-optimization/115932
> >    PR rtl-optimization/116028
> >    PR rtl-optimization/117081
> >    PR rtl-optimization/118497
> >    * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
> >    scale to 300.
> >
> > Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> > ---
> > gcc/ira-color.cc | 16 ++++++++++++++--
> > 1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> > index 0699b349a1a..707ff188250 100644
> > --- a/gcc/ira-color.cc
> > +++ b/gcc/ira-color.cc
> > @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
> >      /* We need to save/restore the hard register in
> >         epilogue/prologue.  Therefore we increase the cost.  */
> >      {
> > +        int scale;
> > +        if (optimize_size)
> > +          scale = 1;
> > +        else
> > +          {
> > +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> > +        /* Don't increase callee-saved register cost by 1000x,
> > +           which leads to that callee-saved registers aren't
> > +           used to preserve local variable values across calls,
> > +           by capping the scale to 300.  */
> > +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> > +          scale = 300;
>
> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?

There are

* The weights for each insn varies from 0 to REG_FREQ_BASE.
   This constant does not need to be high, as in infrequently executed
   regions we want to count instructions equivalently to optimize for
   size instead of speed.  */
#define REG_FREQ_MAX 1000

/* Compute register frequency from the BB frequency.  When optimizing for size,
   or profile driven feedback is available and the function is never executed,
   frequency is always equivalent.  Otherwise rescale the basic block
   frequency.  */
#define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
                               || !cfun->cfg->count_max.initialized_p ())     \
                              ? REG_FREQ_MAX                                  \
                              : ((bb)->count.to_frequency (cfun)              \
                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
                              ? ((bb)->count.to_frequency (cfun)              \
                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
                              : 1)

1000 is the default.  If it isn't 1000, it isn't the default.  I only want
to get a more reasonable default scale, instead of 1000.   Lower
scale will fail the PR rtl-optimization/111673 test on powerpc64.


> > +          }
> >        rclass = REGNO_REG_CLASS (hard_regno);
> >        add_cost = ((ira_memory_move_cost[mode][rclass][0]
> >                 + ira_memory_move_cost[mode][rclass][1])
> >                * saved_nregs / hard_regno_nregs (hard_regno,
> >                              mode) - 1)
> > -               * (optimize_size ? 1 :
> > -              REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
> > +               * scale;
> >        cost += add_cost;
> >        full_cost += add_cost;
> >      }
> > --
> > 2.48.1
> >
  
Richard Biener Feb. 2, 2025, 8:19 a.m. UTC | #3
> Am 02.02.2025 um 08:59 schrieb H.J. Lu <hjl.tools@gmail.com>:
> 
> On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
>> 
>> 
>> 
>>>> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
>>> 
>>> Don't increase callee-saved register cost by 1000x, which leads to that
>>> callee-saved registers aren't used to preserve local variable values
>>> across calls, by capping the scale to 300.
>> 
>>>   PR rtl-optimization/111673
>>>   PR rtl-optimization/115932
>>>   PR rtl-optimization/116028
>>>   PR rtl-optimization/117081
>>>   PR rtl-optimization/118497
>>>   * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
>>>   scale to 300.
>>> 
>>> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
>>> ---
>>> gcc/ira-color.cc | 16 ++++++++++++++--
>>> 1 file changed, 14 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
>>> index 0699b349a1a..707ff188250 100644
>>> --- a/gcc/ira-color.cc
>>> +++ b/gcc/ira-color.cc
>>> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
>>>     /* We need to save/restore the hard register in
>>>        epilogue/prologue.  Therefore we increase the cost.  */
>>>     {
>>> +        int scale;
>>> +        if (optimize_size)
>>> +          scale = 1;
>>> +        else
>>> +          {
>>> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>>> +        /* Don't increase callee-saved register cost by 1000x,
>>> +           which leads to that callee-saved registers aren't
>>> +           used to preserve local variable values across calls,
>>> +           by capping the scale to 300.  */
>>> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
>>> +          scale = 300;
>> 
>> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?
> 
> There are
> 
> * The weights for each insn varies from 0 to REG_FREQ_BASE.
>   This constant does not need to be high, as in infrequently executed
>   regions we want to count instructions equivalently to optimize for
>   size instead of speed.  */
> #define REG_FREQ_MAX 1000
> 
> /* Compute register frequency from the BB frequency.  When optimizing for size,
>   or profile driven feedback is available and the function is never executed,
>   frequency is always equivalent.  Otherwise rescale the basic block
>   frequency.  */
> #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
>                               || !cfun->cfg->count_max.initialized_p ())     \
>                              ? REG_FREQ_MAX                                  \
>                              : ((bb)->count.to_frequency (cfun)              \
>                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
>                              ? ((bb)->count.to_frequency (cfun)              \
>                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
>                              : 1)
> 
> 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> to get a more reasonable default scale, instead of 1000.   Lower
> scale will fail the PR rtl-optimization/111673 test on powerpc64.

I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?

> 
> 
>>> +          }
>>>       rclass = REGNO_REG_CLASS (hard_regno);
>>>       add_cost = ((ira_memory_move_cost[mode][rclass][0]
>>>                + ira_memory_move_cost[mode][rclass][1])
>>>               * saved_nregs / hard_regno_nregs (hard_regno,
>>>                             mode) - 1)
>>> -               * (optimize_size ? 1 :
>>> -              REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
>>> +               * scale;
>>>       cost += add_cost;
>>>       full_cost += add_cost;
>>>     }
>>> --
>>> 2.48.1
>>> 
> 
> 
> 
> --
> H.J.
  
H.J. Lu Feb. 2, 2025, 8:28 a.m. UTC | #4
On Sun, Feb 2, 2025 at 4:20 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
>
>
> > Am 02.02.2025 um 08:59 schrieb H.J. Lu <hjl.tools@gmail.com>:
> >
> > On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> >>
> >>
> >>
> >>>> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> >>>
> >>> Don't increase callee-saved register cost by 1000x, which leads to that
> >>> callee-saved registers aren't used to preserve local variable values
> >>> across calls, by capping the scale to 300.
> >>
> >>>   PR rtl-optimization/111673
> >>>   PR rtl-optimization/115932
> >>>   PR rtl-optimization/116028
> >>>   PR rtl-optimization/117081
> >>>   PR rtl-optimization/118497
> >>>   * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
> >>>   scale to 300.
> >>>
> >>> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> >>> ---
> >>> gcc/ira-color.cc | 16 ++++++++++++++--
> >>> 1 file changed, 14 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> >>> index 0699b349a1a..707ff188250 100644
> >>> --- a/gcc/ira-color.cc
> >>> +++ b/gcc/ira-color.cc
> >>> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
> >>>     /* We need to save/restore the hard register in
> >>>        epilogue/prologue.  Therefore we increase the cost.  */
> >>>     {
> >>> +        int scale;
> >>> +        if (optimize_size)
> >>> +          scale = 1;
> >>> +        else
> >>> +          {
> >>> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> >>> +        /* Don't increase callee-saved register cost by 1000x,
> >>> +           which leads to that callee-saved registers aren't
> >>> +           used to preserve local variable values across calls,
> >>> +           by capping the scale to 300.  */
> >>> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> >>> +          scale = 300;
> >>
> >> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?
> >
> > There are
> >
> > * The weights for each insn varies from 0 to REG_FREQ_BASE.
> >   This constant does not need to be high, as in infrequently executed
> >   regions we want to count instructions equivalently to optimize for
> >   size instead of speed.  */
> > #define REG_FREQ_MAX 1000
> >
> > /* Compute register frequency from the BB frequency.  When optimizing for size,
> >   or profile driven feedback is available and the function is never executed,
> >   frequency is always equivalent.  Otherwise rescale the basic block
> >   frequency.  */
> > #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
> >                               || !cfun->cfg->count_max.initialized_p ())     \
> >                              ? REG_FREQ_MAX                                  \
> >                              : ((bb)->count.to_frequency (cfun)              \
> >                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
> >                              ? ((bb)->count.to_frequency (cfun)              \
> >                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
> >                              : 1)
> >
> > 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> > to get a more reasonable default scale, instead of 1000.   Lower
> > scale will fail the PR rtl-optimization/111673 test on powerpc64.
>
> I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?

commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
Author: Surya Kumari Jangala <jskumari@linux.ibm.com>
Date:   Tue Jun 25 08:37:49 2024 -0500

    ira: Scale save/restore costs of callee save registers with block frequency

uses REG_FREQ_FROM_BB as the cost scale.  I don't know if it is a misuse.
I don't want to change REG_FREQ_FROM_BB since it is used in other places,
not as a cost scale.  Maybe the above commit should be reverted and we add
a target hook for callee-saved register cost scale.  Each target can choose
a proper cost scale, install of increasing the cost by 1000x for everyone.

> >
> >
> >>> +          }
> >>>       rclass = REGNO_REG_CLASS (hard_regno);
> >>>       add_cost = ((ira_memory_move_cost[mode][rclass][0]
> >>>                + ira_memory_move_cost[mode][rclass][1])
> >>>               * saved_nregs / hard_regno_nregs (hard_regno,
> >>>                             mode) - 1)
> >>> -               * (optimize_size ? 1 :
> >>> -              REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
> >>> +               * scale;
> >>>       cost += add_cost;
> >>>       full_cost += add_cost;
> >>>     }
> >>> --
> >>> 2.48.1
> >>>
> >
> >
> >
> > --
> > H.J.
  
Richard Biener Feb. 3, 2025, 9:21 a.m. UTC | #5
On Sun, Feb 2, 2025 at 9:29 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Sun, Feb 2, 2025 at 4:20 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> >
> >
> > > Am 02.02.2025 um 08:59 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > >
> > > On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >>>> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > >>>
> > >>> Don't increase callee-saved register cost by 1000x, which leads to that
> > >>> callee-saved registers aren't used to preserve local variable values
> > >>> across calls, by capping the scale to 300.
> > >>
> > >>>   PR rtl-optimization/111673
> > >>>   PR rtl-optimization/115932
> > >>>   PR rtl-optimization/116028
> > >>>   PR rtl-optimization/117081
> > >>>   PR rtl-optimization/118497
> > >>>   * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
> > >>>   scale to 300.
> > >>>
> > >>> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> > >>> ---
> > >>> gcc/ira-color.cc | 16 ++++++++++++++--
> > >>> 1 file changed, 14 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> > >>> index 0699b349a1a..707ff188250 100644
> > >>> --- a/gcc/ira-color.cc
> > >>> +++ b/gcc/ira-color.cc
> > >>> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
> > >>>     /* We need to save/restore the hard register in
> > >>>        epilogue/prologue.  Therefore we increase the cost.  */
> > >>>     {
> > >>> +        int scale;
> > >>> +        if (optimize_size)
> > >>> +          scale = 1;
> > >>> +        else
> > >>> +          {
> > >>> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> > >>> +        /* Don't increase callee-saved register cost by 1000x,
> > >>> +           which leads to that callee-saved registers aren't
> > >>> +           used to preserve local variable values across calls,
> > >>> +           by capping the scale to 300.  */
> > >>> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> > >>> +          scale = 300;
> > >>
> > >> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?
> > >
> > > There are
> > >
> > > * The weights for each insn varies from 0 to REG_FREQ_BASE.
> > >   This constant does not need to be high, as in infrequently executed
> > >   regions we want to count instructions equivalently to optimize for
> > >   size instead of speed.  */
> > > #define REG_FREQ_MAX 1000
> > >
> > > /* Compute register frequency from the BB frequency.  When optimizing for size,
> > >   or profile driven feedback is available and the function is never executed,
> > >   frequency is always equivalent.  Otherwise rescale the basic block
> > >   frequency.  */
> > > #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
> > >                               || !cfun->cfg->count_max.initialized_p ())     \
> > >                              ? REG_FREQ_MAX                                  \
> > >                              : ((bb)->count.to_frequency (cfun)              \
> > >                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
> > >                              ? ((bb)->count.to_frequency (cfun)              \
> > >                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
> > >                              : 1)
> > >
> > > 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> > > to get a more reasonable default scale, instead of 1000.   Lower
> > > scale will fail the PR rtl-optimization/111673 test on powerpc64.
> >
> > I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?
>
> commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
> Author: Surya Kumari Jangala <jskumari@linux.ibm.com>
> Date:   Tue Jun 25 08:37:49 2024 -0500
>
>     ira: Scale save/restore costs of callee save registers with block frequency
>
> uses REG_FREQ_FROM_BB as the cost scale.  I don't know if it is a misuse.
> I don't want to change REG_FREQ_FROM_BB since it is used in other places,
> not as a cost scale.  Maybe the above commit should be reverted and we add
> a target hook for callee-saved register cost scale.  Each target can choose
> a proper cost scale, install of increasing the cost by 1000x for everyone.

I believe testing cfun->cfg->count_max.initialized_p () is a bit odd at least,
as it doesn't seem to be used.  The comment talks about profile feedback,
but for example with -fprofile-correction or -fpartial-profile this
test looks odd.
In fact optimize_function_for_size_p should already handle this correctly.

Also REG_FREQ_FROM_BB simply documents that in this case the
frequency will be equivalent for all BBs and not any particular value.  The new
use might indeed not have the same constraints as others, instead of
a target hook making the "same value" another macro argument might be
a good first step.

That said - does removing the || !cfun->cfg->count_max.initialized_p ()
test "fix" things?

Richard.

> > >
> > >
> > >>> +          }
> > >>>       rclass = REGNO_REG_CLASS (hard_regno);
> > >>>       add_cost = ((ira_memory_move_cost[mode][rclass][0]
> > >>>                + ira_memory_move_cost[mode][rclass][1])
> > >>>               * saved_nregs / hard_regno_nregs (hard_regno,
> > >>>                             mode) - 1)
> > >>> -               * (optimize_size ? 1 :
> > >>> -              REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
> > >>> +               * scale;
> > >>>       cost += add_cost;
> > >>>       full_cost += add_cost;
> > >>>     }
> > >>> --
> > >>> 2.48.1
> > >>>
> > >
> > >
> > >
> > > --
> > > H.J.
>
>
>
> --
> H.J.
  
H.J. Lu Feb. 3, 2025, 9:28 a.m. UTC | #6
On Mon, Feb 3, 2025 at 5:21 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Sun, Feb 2, 2025 at 9:29 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Sun, Feb 2, 2025 at 4:20 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > >
> > >
> > > > Am 02.02.2025 um 08:59 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > > >
> > > > On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
> > > > <richard.guenther@gmail.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >>>> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > > >>>
> > > >>> Don't increase callee-saved register cost by 1000x, which leads to that
> > > >>> callee-saved registers aren't used to preserve local variable values
> > > >>> across calls, by capping the scale to 300.
> > > >>
> > > >>>   PR rtl-optimization/111673
> > > >>>   PR rtl-optimization/115932
> > > >>>   PR rtl-optimization/116028
> > > >>>   PR rtl-optimization/117081
> > > >>>   PR rtl-optimization/118497
> > > >>>   * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
> > > >>>   scale to 300.
> > > >>>
> > > >>> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> > > >>> ---
> > > >>> gcc/ira-color.cc | 16 ++++++++++++++--
> > > >>> 1 file changed, 14 insertions(+), 2 deletions(-)
> > > >>>
> > > >>> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> > > >>> index 0699b349a1a..707ff188250 100644
> > > >>> --- a/gcc/ira-color.cc
> > > >>> +++ b/gcc/ira-color.cc
> > > >>> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
> > > >>>     /* We need to save/restore the hard register in
> > > >>>        epilogue/prologue.  Therefore we increase the cost.  */
> > > >>>     {
> > > >>> +        int scale;
> > > >>> +        if (optimize_size)
> > > >>> +          scale = 1;
> > > >>> +        else
> > > >>> +          {
> > > >>> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> > > >>> +        /* Don't increase callee-saved register cost by 1000x,
> > > >>> +           which leads to that callee-saved registers aren't
> > > >>> +           used to preserve local variable values across calls,
> > > >>> +           by capping the scale to 300.  */
> > > >>> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> > > >>> +          scale = 300;
> > > >>
> > > >> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?
> > > >
> > > > There are
> > > >
> > > > * The weights for each insn varies from 0 to REG_FREQ_BASE.
> > > >   This constant does not need to be high, as in infrequently executed
> > > >   regions we want to count instructions equivalently to optimize for
> > > >   size instead of speed.  */
> > > > #define REG_FREQ_MAX 1000
> > > >
> > > > /* Compute register frequency from the BB frequency.  When optimizing for size,
> > > >   or profile driven feedback is available and the function is never executed,
> > > >   frequency is always equivalent.  Otherwise rescale the basic block
> > > >   frequency.  */
> > > > #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
> > > >                               || !cfun->cfg->count_max.initialized_p ())     \
> > > >                              ? REG_FREQ_MAX                                  \
> > > >                              : ((bb)->count.to_frequency (cfun)              \
> > > >                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
> > > >                              ? ((bb)->count.to_frequency (cfun)              \
> > > >                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
> > > >                              : 1)
> > > >
> > > > 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> > > > to get a more reasonable default scale, instead of 1000.   Lower
> > > > scale will fail the PR rtl-optimization/111673 test on powerpc64.
> > >
> > > I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?
> >
> > commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
> > Author: Surya Kumari Jangala <jskumari@linux.ibm.com>
> > Date:   Tue Jun 25 08:37:49 2024 -0500
> >
> >     ira: Scale save/restore costs of callee save registers with block frequency
> >
> > uses REG_FREQ_FROM_BB as the cost scale.  I don't know if it is a misuse.
> > I don't want to change REG_FREQ_FROM_BB since it is used in other places,
> > not as a cost scale.  Maybe the above commit should be reverted and we add
> > a target hook for callee-saved register cost scale.  Each target can choose
> > a proper cost scale, install of increasing the cost by 1000x for everyone.
>
> I believe testing cfun->cfg->count_max.initialized_p () is a bit odd at least,
> as it doesn't seem to be used.  The comment talks about profile feedback,
> but for example with -fprofile-correction or -fpartial-profile this
> test looks odd.
> In fact optimize_function_for_size_p should already handle this correctly.
>
> Also REG_FREQ_FROM_BB simply documents that in this case the
> frequency will be equivalent for all BBs and not any particular value.  The new
> use might indeed not have the same constraints as others, instead of
> a target hook making the "same value" another macro argument might be
> a good first step.
>
> That said - does removing the || !cfun->cfg->count_max.initialized_p ()
> test "fix" things?

I don't think so.  Frequency value can be anything, 1000 or 100000.
It is relative.  It is very odd to use it as the cost scale.  Here is the v2
patch

https://gcc.gnu.org/pipermail/gcc-patches/2025-February/674940.html

to add a target hook for callee-saved register cost scale and restore the
old behavior for x86.
  
Jan Hubicka Feb. 3, 2025, 12:55 p.m. UTC | #7
> On Mon, Feb 3, 2025 at 5:21 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Sun, Feb 2, 2025 at 9:29 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Sun, Feb 2, 2025 at 4:20 PM Richard Biener
> > > <richard.guenther@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > > Am 02.02.2025 um 08:59 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > > > >
> > > > > On Sun, Feb 2, 2025 at 3:33 PM Richard Biener
> > > > > <richard.guenther@gmail.com> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >>>> Am 02.02.2025 um 08:00 schrieb H.J. Lu <hjl.tools@gmail.com>:
> > > > >>>
> > > > >>> Don't increase callee-saved register cost by 1000x, which leads to that
> > > > >>> callee-saved registers aren't used to preserve local variable values
> > > > >>> across calls, by capping the scale to 300.
> > > > >>
> > > > >>>   PR rtl-optimization/111673
> > > > >>>   PR rtl-optimization/115932
> > > > >>>   PR rtl-optimization/116028
> > > > >>>   PR rtl-optimization/117081
> > > > >>>   PR rtl-optimization/118497
> > > > >>>   * ira-color.cc (assign_hard_reg): Cap callee-saved register cost
> > > > >>>   scale to 300.
> > > > >>>
> > > > >>> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> > > > >>> ---
> > > > >>> gcc/ira-color.cc | 16 ++++++++++++++--
> > > > >>> 1 file changed, 14 insertions(+), 2 deletions(-)
> > > > >>>
> > > > >>> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> > > > >>> index 0699b349a1a..707ff188250 100644
> > > > >>> --- a/gcc/ira-color.cc
> > > > >>> +++ b/gcc/ira-color.cc
> > > > >>> @@ -2175,13 +2175,25 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
> > > > >>>     /* We need to save/restore the hard register in
> > > > >>>        epilogue/prologue.  Therefore we increase the cost.  */
> > > > >>>     {
> > > > >>> +        int scale;
> > > > >>> +        if (optimize_size)
> > > > >>> +          scale = 1;
> > > > >>> +        else
> > > > >>> +          {
> > > > >>> +        scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> > > > >>> +        /* Don't increase callee-saved register cost by 1000x,
> > > > >>> +           which leads to that callee-saved registers aren't
> > > > >>> +           used to preserve local variable values across calls,
> > > > >>> +           by capping the scale to 300.  */
> > > > >>> +        if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
> > > > >>> +          scale = 300;
> > > > >>
> > > > >> That leads to 300 for 1000 but 999 for 999 which is odd.  I’d have expected to scale this down to [0, 300] or is MAX a magic value?
> > > > >
> > > > > There are
> > > > >
> > > > > * The weights for each insn varies from 0 to REG_FREQ_BASE.
> > > > >   This constant does not need to be high, as in infrequently executed
> > > > >   regions we want to count instructions equivalently to optimize for
> > > > >   size instead of speed.  */
> > > > > #define REG_FREQ_MAX 1000
> > > > >
> > > > > /* Compute register frequency from the BB frequency.  When optimizing for size,
> > > > >   or profile driven feedback is available and the function is never executed,
> > > > >   frequency is always equivalent.  Otherwise rescale the basic block
> > > > >   frequency.  */
> > > > > #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
> > > > >                               || !cfun->cfg->count_max.initialized_p ())     \
> > > > >                              ? REG_FREQ_MAX                                  \
> > > > >                              : ((bb)->count.to_frequency (cfun)              \
> > > > >                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
> > > > >                              ? ((bb)->count.to_frequency (cfun)              \
> > > > >                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
> > > > >                              : 1)
> > > > >
> > > > > 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> > > > > to get a more reasonable default scale, instead of 1000.   Lower
> > > > > scale will fail the PR rtl-optimization/111673 test on powerpc64.
> > > >
> > > > I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?
> > >
> > > commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
> > > Author: Surya Kumari Jangala <jskumari@linux.ibm.com>
> > > Date:   Tue Jun 25 08:37:49 2024 -0500
> > >
> > >     ira: Scale save/restore costs of callee save registers with block frequency
> > >
> > > uses REG_FREQ_FROM_BB as the cost scale.  I don't know if it is a misuse.
> > > I don't want to change REG_FREQ_FROM_BB since it is used in other places,
> > > not as a cost scale.  Maybe the above commit should be reverted and we add
> > > a target hook for callee-saved register cost scale.  Each target can choose
> > > a proper cost scale, install of increasing the cost by 1000x for everyone.
> >
> > I believe testing cfun->cfg->count_max.initialized_p () is a bit odd at least,
> > as it doesn't seem to be used.  The comment talks about profile feedback,
> > but for example with -fprofile-correction or -fpartial-profile this
> > test looks odd.
> > In fact optimize_function_for_size_p should already handle this correctly.

count_max is computed in update_max_bb_count and if some basic blocks do
not have initialized count, they are ignored. So if you have any
initialized count, count_max should be greater or equal to it
after update call.  This may be later changed if multiple code paths are
merged to one and then BB count can be somewhat larger than max.

With -fprofile-correction or -fpartial-profile all BBs sould have counts
initialized. If profile feedback is missing we keep guessed profile if
feedback is incoherent we only attemt to correct and it and drop quality
info.  If some passes later create new BBs with unitialized profile,
they should be fixed.

What can happen is inlining function without profile (-O0) to function
with profile.  In that case you get CFG with both initialized and
uninitialized BBs.  But we probably do not care that much about
performance of htis.
> >
> > Also REG_FREQ_FROM_BB simply documents that in this case the
> > frequency will be equivalent for all BBs and not any particular value.  The new
> > use might indeed not have the same constraints as others, instead of
> > a target hook making the "same value" another macro argument might be
> > a good first step.
> >
> > That said - does removing the || !cfun->cfg->count_max.initialized_p ()
> > test "fix" things?
> 
> I don't think so.  Frequency value can be anything, 1000 or 100000.
> It is relative.  It is very odd to use it as the cost scale.  Here is the v2
> patch
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2025-February/674940.html
> 
> to add a target hook for callee-saved register cost scale and restore the
> old behavior for x86.

Original prupose of REG_FREQ_FROM_BB macro was to translate new profile
counts to old BB frequencies.  Those were saled to be in the range
0....REG_FREQ_MAX where REG_FREQ_MAX was set to 1000 so one can do third
power in 32bit value chaply.  In early 2000s when this code was written
64bit arithmetics was expensive.

It also tries to imitate old code in register allocator which used
constnat value of 1000 for everything if profile was missing or we
optimize for size (and thus static count of reloads).

These days it is probably smoother to write code that cares about
relative frequencies using sreals unless it is very performance
critical.

Honza
  
Jan Hubicka Feb. 3, 2025, 12:57 p.m. UTC | #8
> > > > #define REG_FREQ_FROM_BB(bb) ((optimize_function_for_size_p (cfun)            \
> > > >                               || !cfun->cfg->count_max.initialized_p ())     \
> > > >                              ? REG_FREQ_MAX                                  \
> > > >                              : ((bb)->count.to_frequency (cfun)              \
> > > >                                * REG_FREQ_MAX / BB_FREQ_MAX)                 \
> > > >                              ? ((bb)->count.to_frequency (cfun)              \
> > > >                                 * REG_FREQ_MAX / BB_FREQ_MAX)                \
> > > >                              : 1)
> > > >
> > > > 1000 is the default.  If it isn't 1000, it isn't the default.  I only want
> > > > to get a more reasonable default scale, instead of 1000.   Lower
> > > > scale will fail the PR rtl-optimization/111673 test on powerpc64.
> > >
> > > I see.  Why not adjust the above macro then?  That would be a bit more obvious.  Like use MAX/2 or so?
> >
> > commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
> > Author: Surya Kumari Jangala <jskumari@linux.ibm.com>
> > Date:   Tue Jun 25 08:37:49 2024 -0500
> >
> >     ira: Scale save/restore costs of callee save registers with block frequency
> >
> > uses REG_FREQ_FROM_BB as the cost scale.  I don't know if it is a misuse.
> > I don't want to change REG_FREQ_FROM_BB since it is used in other places,
> > not as a cost scale.  Maybe the above commit should be reverted and we add
> > a target hook for callee-saved register cost scale.  Each target can choose
> > a proper cost scale, install of increasing the cost by 1000x for everyone.
> 
> I believe testing cfun->cfg->count_max.initialized_p () is a bit odd at least,
> as it doesn't seem to be used.  The comment talks about profile feedback,

It is used by count.to_frequency, which basically computes count/max_count * REG_FREQ_MAX.
It aborts if max_count is uninitialized rather than returning arbitrary
value...

Honza
  

Patch

diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index 0699b349a1a..707ff188250 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -2175,13 +2175,25 @@  assign_hard_reg (ira_allocno_t a, bool retry_p)
 	  /* We need to save/restore the hard register in
 	     epilogue/prologue.  Therefore we increase the cost.  */
 	  {
+	    int scale;
+	    if (optimize_size)
+	      scale = 1;
+	    else
+	      {
+		scale = REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+		/* Don't increase callee-saved register cost by 1000x,
+		   which leads to that callee-saved registers aren't
+		   used to preserve local variable values across calls,
+		   by capping the scale to 300.  */
+		if (REG_FREQ_MAX == 1000 && scale == REG_FREQ_MAX)
+		  scale = 300;
+	      }
 	    rclass = REGNO_REG_CLASS (hard_regno);
 	    add_cost = ((ira_memory_move_cost[mode][rclass][0]
 		         + ira_memory_move_cost[mode][rclass][1])
 		        * saved_nregs / hard_regno_nregs (hard_regno,
 							  mode) - 1)
-		       * (optimize_size ? 1 :
-			  REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
+		       * scale;
 	    cost += add_cost;
 	    full_cost += add_cost;
 	  }