AArch64: Tune case-values-threshold

Message ID VE1PR08MB559988D252EEABF494E24BAD83BC9@VE1PR08MB5599.eurprd08.prod.outlook.com
State Committed
Commit 9c751b88def09f71e1edbf623d41776478ead6fd
Headers
Series AArch64: Tune case-values-threshold |

Commit Message

Wilco Dijkstra Oct. 18, 2021, 4:19 p.m. UTC
  Tune the case-values-threshold setting for modern cores.  A value of 11 improves
SPECINT2017 by 0.2% and reduces codesize by 0.04%.  With -Os use value 8 which
reduces codesize by 0.07%.

Passes regress, OK for commit?

ChangeLog:

2021-10-18  Wilco Dijkstra  <wdijkstr@arm.com>

        * config/aarch64/aarch64.c (aarch64_case_values_threshold):
        Change to 8 with -Os, 11 otherwise.

---
  

Comments

Richard Sandiford Oct. 18, 2021, 4:31 p.m. UTC | #1
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Tune the case-values-threshold setting for modern cores.  A value of 11 improves
> SPECINT2017 by 0.2% and reduces codesize by 0.04%.  With -Os use value 8 which
> reduces codesize by 0.07%.
>
> Passes regress, OK for commit?
>
> ChangeLog:
>
> 2021-10-18  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * config/aarch64/aarch64.c (aarch64_case_values_threshold):
>         Change to 8 with -Os, 11 otherwise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index f5b25a7f7041645921e6ad85714efda73b993492..adc5256c5ccc1182710d87cc6a1091083d888663 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -9360,8 +9360,8 @@ aarch64_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>     The expansion for a table switch is quite expensive due to the number
>     of instructions, the table lookup and hard to predict indirect jump.
>     When optimizing for speed, and -O3 enabled, use the per-core tuning if
> -   set, otherwise use tables for > 16 cases as a tradeoff between size and
> -   performance.  When optimizing for size, use the default setting.  */
> +   set, otherwise use tables for >= 11 cases as a tradeoff between size and
> +   performance.  When optimizing for size, use 8 for smallest codesize.  */

I'm just concerned that here we're using the same explanation but with
different numbers.  Why are the new numbers more right than the old ones
(especially when it comes to code size, where the trade-off hasn't
really changed)?

It would be good to have more discussion of why certain numbers are
too small or too high, and why 8 is the right pivot point for -Os.

Thanks,
Richard

>
>  static unsigned int
>  aarch64_case_values_threshold (void)
> @@ -9372,7 +9372,7 @@ aarch64_case_values_threshold (void)
>        && selected_cpu->tune->max_case_values != 0)
>      return selected_cpu->tune->max_case_values;
>    else
> -    return optimize_size ? default_case_values_threshold () : 17;
> +    return optimize_size ? 8 : 11;
>  }
>
>  /* Return true if register REGNO is a valid index register.
  
Wilco Dijkstra Oct. 19, 2021, 12:57 p.m. UTC | #2
Hi Richard,

> I'm just concerned that here we're using the same explanation but with
> different numbers.  Why are the new numbers more right than the old ones
> (especially when it comes to code size, where the trade-off hasn't
> really changed)?

Like all tuning/costing parameters, these values are never fixed but change
over time due to optimizations, micro architectures and workloads.
The previous values were out of date so that's why I retuned them by
benchmarking different values and choosing the best combinations.

> It would be good to have more discussion of why certain numbers are
> too small or too high, and why 8 is the right pivot point for -Os.

You mean add more discussion in the comment? That comment is already overly
large and too specific - it would be better to reduce it. The -Os value was never
tuned, and 8 turns out to be faster and smaller than GCC's default.

Cheers,
Wilco
  
Richard Sandiford Oct. 19, 2021, 1:23 p.m. UTC | #3
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> I'm just concerned that here we're using the same explanation but with
>> different numbers.  Why are the new numbers more right than the old ones
>> (especially when it comes to code size, where the trade-off hasn't
>> really changed)?
>
> Like all tuning/costing parameters, these values are never fixed but change
> over time due to optimizations, micro architectures and workloads.
> The previous values were out of date so that's why I retuned them by
> benchmarking different values and choosing the best combinations.
>
>> It would be good to have more discussion of why certain numbers are
>> too small or too high, and why 8 is the right pivot point for -Os.
>
> You mean add more discussion in the comment? That comment is already overly
> large and too specific - it would be better to reduce it. The -Os value was never
> tuned, and 8 turns out to be faster and smaller than GCC's default.

The problem is that you're effectively asking for these values to be
taken on faith without providing any analysis and without describing
how you arrived at the new numbers.  Did you try other values too?
If so, how did they compare with the numbers that you finally chose?
At least that would give an indication of where the boundaries are.

For example, it's easier to believe that 8 is the right value for -Os if
you say that you tried 9 and 7 as well, and they were worse than 8 by X%
and Y%.  This would also help anyone who wants to tweak the numbers
again in future.

BTW, which CPU did you use to do the experiments?  Are the tuning
parameters for that CPU already consistent with the new generic values?

Thanks,
Richard
  
Bernhard Reutner-Fischer Oct. 19, 2021, 2:37 p.m. UTC | #4
On 19 October 2021 15:23:58 CEST, Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>> Hi Richard,
>>
>>> I'm just concerned that here we're using the same explanation but with
>>> different numbers.  Why are the new numbers more right than the old ones
>>> (especially when it comes to code size, where the trade-off hasn't
>>> really changed)?
>>
>> Like all tuning/costing parameters, these values are never fixed but change
>> over time due to optimizations, micro architectures and workloads.
>> The previous values were out of date so that's why I retuned them by
>> benchmarking different values and choosing the best combinations.
>>
>>> It would be good to have more discussion of why certain numbers are
>>> too small or too high, and why 8 is the right pivot point for -Os.
>>
>> You mean add more discussion in the comment? That comment is already overly
>> large and too specific - it would be better to reduce it. The -Os value was never
>> tuned, and 8 turns out to be faster and smaller than GCC's default.
>
>The problem is that you're effectively asking for these values to be
>taken on faith without providing any analysis and without describing
>how you arrived at the new numbers.  Did you try other values too?
>If so, how did they compare with the numbers that you finally chose?
>At least that would give an indication of where the boundaries are.

Maybe you can show csibe benchmark numbers to show the effects:
http://szeged.github.io/csibe/

thanks,
>
>For example, it's easier to believe that 8 is the right value for -Os if
>you say that you tried 9 and 7 as well, and they were worse than 8 by X%
>and Y%.  This would also help anyone who wants to tweak the numbers
>again in future.
>
>BTW, which CPU did you use to do the experiments?  Are the tuning
>parameters for that CPU already consistent with the new generic values?
>
>Thanks,
>Richard
  
Wilco Dijkstra Oct. 19, 2021, 2:55 p.m. UTC | #5
Hi Richard,

> The problem is that you're effectively asking for these values to be
> taken on faith without providing any analysis and without describing
> how you arrived at the new numbers.  Did you try other values too?
> If so, how did they compare with the numbers that you finally chose?
> At least that would give an indication of where the boundaries are.

Yes, I obviously tried other values, pretty much all in range 1-20. There is
generally a range of 4-5 values that are very similar in size, and then you
choose one in the middle which also looks good for performance.

> For example, it's easier to believe that 8 is the right value for -Os if
> you say that you tried 9 and 7 as well, and they were worse than 8 by X%
> and Y%.  This would also help anyone who wants to tweak the numbers
> again in future.

For -Os, the size range for values 6-10 is within 0.01% so they are virtually
identical and I picked the median. Whether this will remain best in the future
is unclear since it depends on so many things, so at some point it needs
to be looked at again, just like most other tunings.

> BTW, which CPU did you use to do the experiments?  Are the tuning
> parameters for that CPU already consistent with the new generic values?

This was done on Neoverse N1. Almost no CPUs use per-CPU tuning for this.

Cheers,
Wilco
  
Richard Sandiford Oct. 20, 2021, 11:30 a.m. UTC | #6
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> The problem is that you're effectively asking for these values to be
>> taken on faith without providing any analysis and without describing
>> how you arrived at the new numbers.  Did you try other values too?
>> If so, how did they compare with the numbers that you finally chose?
>> At least that would give an indication of where the boundaries are.
>
> Yes, I obviously tried other values, pretty much all in range 1-20. There is
> generally a range of 4-5 values that are very similar in size, and then you
> choose one in the middle which also looks good for performance.
>
>> For example, it's easier to believe that 8 is the right value for -Os if
>> you say that you tried 9 and 7 as well, and they were worse than 8 by X%
>> and Y%.  This would also help anyone who wants to tweak the numbers
>> again in future.
>
> For -Os, the size range for values 6-10 is within 0.01% so they are virtually
> identical and I picked the median. Whether this will remain best in the future
> is unclear since it depends on so many things, so at some point it needs
> to be looked at again, just like most other tunings.

Thanks.  These details are useful.  For example, if someone finds
a compelling reason to bump the new values by +/-2 (to help with a
particular test case) then it sounds we should accept that, since it
wouldn't conflict with your work.

So the patch is OK, thanks.

(FWIW, I tried building a linux kernel I had lying around at -Os,
which also showed an improvement of ~0.07%.)

Richard
  

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f5b25a7f7041645921e6ad85714efda73b993492..adc5256c5ccc1182710d87cc6a1091083d888663 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -9360,8 +9360,8 @@  aarch64_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
    The expansion for a table switch is quite expensive due to the number
    of instructions, the table lookup and hard to predict indirect jump.
    When optimizing for speed, and -O3 enabled, use the per-core tuning if 
-   set, otherwise use tables for > 16 cases as a tradeoff between size and
-   performance.  When optimizing for size, use the default setting.  */
+   set, otherwise use tables for >= 11 cases as a tradeoff between size and
+   performance.  When optimizing for size, use 8 for smallest codesize.  */
 
 static unsigned int
 aarch64_case_values_threshold (void)
@@ -9372,7 +9372,7 @@  aarch64_case_values_threshold (void)
       && selected_cpu->tune->max_case_values != 0)
     return selected_cpu->tune->max_case_values;
   else
-    return optimize_size ? default_case_values_threshold () : 17;
+    return optimize_size ? 8 : 11;
 }
 
 /* Return true if register REGNO is a valid index register.