diff mbox series

rs6000: Suggest unroll factor for loop vectorization

Message ID	dd251673-29f8-3310-988f-a957c98b7dab@linux.ibm.com
State	New
Headers	DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B8FA73854815 Message-ID: <dd251673-29f8-3310-988f-a957c98b7dab@linux.ibm.com> Date: Wed, 20 Jul 2022 17:30:35 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Content-Language: en-US To: GCC Patches <gcc-patches@gcc.gnu.org> Subject: [PATCH] rs6000: Suggest unroll factor for loop vectorization Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list From: "Kewen.Lin via Gcc-patches" <gcc-patches@gcc.gnu.org> Reply-To: "Kewen.Lin" <linkw@linux.ibm.com> Cc: Richard Sandiford <richard.sandiford@arm.com>, Peter Bergner <bergner@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>, Segher Boessenkool <segher@kernel.crashing.org> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	rs6000: Suggest unroll factor for loop vectorization \| rs6000: Suggest unroll factor for loop vectorization

Commit Message

Kewen.Lin July 20, 2022, 9:30 a.m. UTC

  Hi,

Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one
unroll factor to be applied to vectorization factor when
vectorizing the main loop, it would be suggested by target
when doing costing.

This patch introduces function determine_suggested_unroll_factor
for rs6000 port, to make it be able to suggest the unroll factor
for a given loop being vectorized.  Referring to aarch64 port
and basing on the analysis on SPEC2017 performance evaluation
results, it mainly considers these aspects:
  1) unroll option and pragma which can disable unrolling for the
     given loop;
  2) simple hardware resource model with issued non memory access
     vector insn per cycle;
  3) aggressive heuristics when iteration count is unknown:
     - reduction case to break cross iteration dependency;
     - emulated gather load;
  4) estimated iteration count when iteration count is unknown;

With this patch, SPEC2017 performance evaluation results on
Power8/9/10 are listed below (speedup pct.):

  * Power10
    - O2: all are neutral (excluding some noises);
    - Ofast: 510.parest_r +6.67%, the others are neutral
             (use ... for the followings);
    - Ofast + unroll: 510.parest_r +5.91%, ...
    - Ofast + LTO + PGO: 510.parest_r +3.00%, ...
    - Ofast + cheap vect cost: 510.parest_r +6.23%, ...
    - Ofast + very-cheap vect cost: all are neutral;

  * Power9
    - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18%
             (likely noise), 500.perlbench_r +1.84%, ...

  * Power8
    - Ofast: 510.parest_r +5.43%, ...;

This patch also introduces one documented parameter
rs6000-vect-unroll-limit= similar to what aarch64 proposes,
by evaluating on P8/P9/P10, the default value 4 is slightly
better than the other choices like 2 and 8.

It also parameterizes two other values as undocumented
parameters for future tweaking.  One parameter is
rs6000-vect-unroll-issue, it's to simply model hardware
resource for non memory access vector instructions to avoid
excessive unrolling, initially I tried to use the value in
the hook rs6000_issue_rate, but the evaluation showed it's
bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at
Ofast, the results showed the default value 4 is good enough
on these different architectures.  For a record, choice 8
could make 510.parest_r's gain become smaller or gone on
P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more
than 1% on P8/P10; and choice 2 could make 538.imagick_r
degrade by 3.8%.  The other parameter is
rs6000-vect-unroll-reduc-threshold.  It's mainly inspired by
510.parest_r and tweaked as it, evaluating with different
values 0/1/2/3 for the threshold, it showed value 1 is the
best choice.  For a record, choice 0 could make 525.x264_r
degrade by 2% and 527.cam4_r degrade by 2.95% on P10,
548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by
2.54% on P8; choice 2 and bigger values could make
510.parest_r's gain become smaller.

Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8,
and powerpc64le-linux-gnu P9.  Bootstrapped on
powerpc64le-linux-gnu P10, but one failure was exposed during
regression testing there, it's identified as one miss
optimization and can be reproduced without this support,
PR106365 was opened for further tracking.

Is it for trunk?

BR,
Kewen
------
gcc/ChangeLog:

	* config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members
	m_nstores, m_reduc_factor, m_gather_load and member function
	determine_suggested_unroll_factor.
	(rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores,
	m_reduc_factor and m_gather_load.
	(rs6000_cost_data::determine_suggested_unroll_factor): New function.
	(rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor.
	* config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter.
	(rs6000-vect-unroll-issue): Likewise.
	(rs6000-vect-unroll-reduc-threshold): Likewise.
	* doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter.

---
 gcc/config/rs6000/rs6000.cc  | 125 ++++++++++++++++++++++++++++++++++-
 gcc/config/rs6000/rs6000.opt |  18 +++++
 gcc/doc/invoke.texi          |   7 ++
 3 files changed, 147 insertions(+), 3 deletions(-)

--
2.27.0

Comments

Kewen.Lin Aug. 15, 2022, 8:05 a.m. UTC | #1

Hi,

Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html

BR,
Kewen

on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one
> unroll factor to be applied to vectorization factor when
> vectorizing the main loop, it would be suggested by target
> when doing costing.
> 
> This patch introduces function determine_suggested_unroll_factor
> for rs6000 port, to make it be able to suggest the unroll factor
> for a given loop being vectorized.  Referring to aarch64 port
> and basing on the analysis on SPEC2017 performance evaluation
> results, it mainly considers these aspects:
>   1) unroll option and pragma which can disable unrolling for the
>      given loop;
>   2) simple hardware resource model with issued non memory access
>      vector insn per cycle;
>   3) aggressive heuristics when iteration count is unknown:
>      - reduction case to break cross iteration dependency;
>      - emulated gather load;
>   4) estimated iteration count when iteration count is unknown;
> 
> With this patch, SPEC2017 performance evaluation results on
> Power8/9/10 are listed below (speedup pct.):
> 
>   * Power10
>     - O2: all are neutral (excluding some noises);
>     - Ofast: 510.parest_r +6.67%, the others are neutral
>              (use ... for the followings);
>     - Ofast + unroll: 510.parest_r +5.91%, ...
>     - Ofast + LTO + PGO: 510.parest_r +3.00%, ...
>     - Ofast + cheap vect cost: 510.parest_r +6.23%, ...
>     - Ofast + very-cheap vect cost: all are neutral;
> 
>   * Power9
>     - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18%
>              (likely noise), 500.perlbench_r +1.84%, ...
> 
>   * Power8
>     - Ofast: 510.parest_r +5.43%, ...;
> 
> This patch also introduces one documented parameter
> rs6000-vect-unroll-limit= similar to what aarch64 proposes,
> by evaluating on P8/P9/P10, the default value 4 is slightly
> better than the other choices like 2 and 8.
> 
> It also parameterizes two other values as undocumented
> parameters for future tweaking.  One parameter is
> rs6000-vect-unroll-issue, it's to simply model hardware
> resource for non memory access vector instructions to avoid
> excessive unrolling, initially I tried to use the value in
> the hook rs6000_issue_rate, but the evaluation showed it's
> bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at
> Ofast, the results showed the default value 4 is good enough
> on these different architectures.  For a record, choice 8
> could make 510.parest_r's gain become smaller or gone on
> P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more
> than 1% on P8/P10; and choice 2 could make 538.imagick_r
> degrade by 3.8%.  The other parameter is
> rs6000-vect-unroll-reduc-threshold.  It's mainly inspired by
> 510.parest_r and tweaked as it, evaluating with different
> values 0/1/2/3 for the threshold, it showed value 1 is the
> best choice.  For a record, choice 0 could make 525.x264_r
> degrade by 2% and 527.cam4_r degrade by 2.95% on P10,
> 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by
> 2.54% on P8; choice 2 and bigger values could make
> 510.parest_r's gain become smaller.
> 
> Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8,
> and powerpc64le-linux-gnu P9.  Bootstrapped on
> powerpc64le-linux-gnu P10, but one failure was exposed during
> regression testing there, it's identified as one miss
> optimization and can be reproduced without this support,
> PR106365 was opened for further tracking.
> 
> Is it for trunk?
> 
> BR,
> Kewen
> ------
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members
> 	m_nstores, m_reduc_factor, m_gather_load and member function
> 	determine_suggested_unroll_factor.
> 	(rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores,
> 	m_reduc_factor and m_gather_load.
> 	(rs6000_cost_data::determine_suggested_unroll_factor): New function.
> 	(rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor.
> 	* config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter.
> 	(rs6000-vect-unroll-issue): Likewise.
> 	(rs6000-vect-unroll-reduc-threshold): Likewise.
> 	* doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter.
> 
> ---
>  gcc/config/rs6000/rs6000.cc  | 125 ++++++++++++++++++++++++++++++++++-
>  gcc/config/rs6000/rs6000.opt |  18 +++++
>  gcc/doc/invoke.texi          |   7 ++
>  3 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 3ff16b8ae04..d0f107d70a8 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -5208,16 +5208,23 @@ protected:
>  				    vect_cost_model_location, unsigned int);
>    void density_test (loop_vec_info);
>    void adjust_vect_cost_per_loop (loop_vec_info);
> +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> 
>    /* Total number of vectorized stmts (loop only).  */
>    unsigned m_nstmts = 0;
>    /* Total number of loads (loop only).  */
>    unsigned m_nloads = 0;
> +  /* Total number of stores (loop only).  */
> +  unsigned m_nstores = 0;
> +  /* Reduction factor for suggesting unroll factor (loop only).  */
> +  unsigned m_reduc_factor = 0;
>    /* Possible extra penalized cost on vector construction (loop only).  */
>    unsigned m_extra_ctor_cost = 0;
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool m_vect_nonmem = false;
> +  /* If this loop gets vectorized with emulated gather load.  */
> +  bool m_gather_load = false;
>  };
> 
>  /* Test for likely overcommitment of vector hardware resources.  If a
> @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
>      {
>        m_nstmts += orig_count;
> 
> -      if (kind == scalar_load || kind == vector_load
> -	  || kind == unaligned_load || kind == vector_gather_load)
> -	m_nloads += orig_count;
> +      if (kind == scalar_load
> +	  || kind == vector_load
> +	  || kind == unaligned_load
> +	  || kind == vector_gather_load)
> +	{
> +	  m_nloads += orig_count;
> +	  if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> +	    m_gather_load = true;
> +	}
> +      else if (kind == scalar_store
> +	       || kind == vector_store
> +	       || kind == unaligned_store
> +	       || kind == vector_scatter_store)
> +	m_nstores += orig_count;
> +      else if ((kind == scalar_stmt
> +		|| kind == vector_stmt
> +		|| kind == vec_to_scalar)
> +	       && stmt_info
> +	       && vect_is_reduction (stmt_info))
> +	{
> +	  /* Loop body contains normal int or fp operations and epilogue
> +	     contains vector reduction.  For simplicity, we assume int
> +	     operation takes one cycle and fp operation takes one more.  */
> +	  tree lhs = gimple_get_lhs (stmt_info->stmt);
> +	  bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs));
> +	  unsigned int basic_cost = is_float ? 2 : 1;
> +	  m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor);
> +	}
> 
>        /* Power processors do not currently have instructions for strided
>  	 and elementwise loads, and instead we must generate multiple
> @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo)
>      }
>  }
> 
> +/* Determine suggested unroll factor by considering some below factors:
> +
> +    - unroll option/pragma which can disable unrolling for this loop;
> +    - simple hardware resource model for non memory vector insns;
> +    - aggressive heuristics when iteration count is unknown:
> +      - reduction case to break cross iteration dependency;
> +      - emulated gather load;
> +    - estimated iteration count when iteration count is unknown;
> +*/
> +
> +
> +unsigned int
> +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> +  if (loop->unroll == 1
> +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> +    return 1;
> +
> +  unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores;
> +  /* Don't unroll if no vector instructions excepting for memory access.  */
> +  if (nstmts_nonldst == 0)
> +    return 1;
> +
> +  /* Consider breaking cross iteration dependency for reduction.  */
> +  unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1;
> +
> +  /* Use this simple hardware resource model that how many non ld/st
> +     vector instructions can be issued per cycle.  */
> +  unsigned int issue_width = rs6000_vect_unroll_issue;
> +  unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst);
> +  uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf);
> +  /* Make sure it is power of 2.  */
> +  uf = 1 << ceil_log2 (uf);
> +
> +  /* If the iteration count is known, the costing would be exact enough,
> +     don't worry it could be worse.  */
> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +    return uf;
> +
> +  /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the
> +     loop if either condition is satisfied:
> +       - reduction factor exceeds the threshold;
> +       - emulated gather load adopted.  */
> +  if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold
> +      || m_gather_load)
> +    return uf;
> +
> +  /* Check if we can conclude it's good to unroll from the estimated
> +     iteration count.  */
> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> +  unsigned int unrolled_vf = vf * uf;
> +  if (est_niter == -1 || est_niter < unrolled_vf)
> +    /* When the estimated iteration of this loop is unknown, it's possible
> +       that we are able to vectorize this loop with the original VF but fail
> +       to vectorize it with the unrolled VF any more if the actual iteration
> +       count is in between.  */
> +    return 1;
> +  else
> +    {
> +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> +      unsigned int epil_niter = est_niter % vf;
> +      /* Even if we have partial vector support, it can be still inefficent
> +	 to calculate the length when the iteration count is unknown, so
> +	 only expect it's good to unroll when the epilogue iteration count
> +	 is not bigger than VF (only one time length calculation).  */
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +	  && epil_niter_unr <= vf)
> +	return uf;
> +      /* Without partial vector support, conservatively unroll this when
> +	 the epilogue iteration count is less than the original one
> +	 (epilogue execution time wouldn't be longer than before).  */
> +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +	       && epil_niter_unr <= epil_niter)
> +	return uf;
> +    }
> +
> +  return 1;
> +}
> +
>  void
>  rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>  {
> @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>  	  && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2
>  	  && LOOP_REQUIRES_VERSIONING (loop_vinfo))
>  	m_costs[vect_body] += 10000;
> +
> +      m_suggested_unroll_factor
> +	= determine_suggested_unroll_factor (loop_vinfo);
>      }
> 
>    vector_costs::finish_cost (scalar_costs);
> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
> index 4931d781c4e..80c2c61a9de 100644
> --- a/gcc/config/rs6000/rs6000.opt
> +++ b/gcc/config/rs6000/rs6000.opt
> @@ -624,6 +624,14 @@ mieee128-constant
>  Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
>  Generate (do not generate) code that uses the LXVKQ instruction.
> 
> +; Documented parameters
> +
> +-param=rs6000-vect-unroll-limit=
> +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param
> +Used to limit unroll factor which indicates how much the autovectorizer may
> +unroll a loop.  The default value is 4.
> +
> +; Undocumented parameters
>  -param=rs6000-density-pct-threshold=
>  Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param
>  When costing for loop vectorization, we probably need to penalize the loop body
> @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we also check if the total
>  number of load statements exceeds the threshold specified by this parameter,
>  and penalize only if it's satisfied.  The default value is 20.
> 
> +-param=rs6000-vect-unroll-issue=
> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param
> +Indicate how many non memory access vector instructions can be issued per
> +cycle, it's used in unroll factor determination for autovectorizer.  The
> +default value is 4.
> +
> +-param=rs6000-vect-unroll-reduc-threshold=
> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param
> +When reduction factor computed for a loop exceeds the threshold specified by
> +this parameter, prefer to unroll this loop.  The default value is 1.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 84d6f0f9860..097ab1d5563 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in privileged state.
>  @opindex no-block-ops-unaligned-vsx
>  Generate (do not generate) unaligned vsx loads and stores for
>  inline expansion of @code{memcpy} and @code{memmove}.
> +
> +@item --param rs6000-vect-unroll-limit=
> +The vectorizer will check with target information to determine whether it
> +would be beneficial to unroll the main vectorized loop and by how much.  This
> +parameter sets the upper bound of how much the vectorizer will unroll the main
> +loop.  The default value is four.
> +
>  @end table
> 
>  @node RX Options
> --
> 2.27.0

Kewen.Lin Aug. 29, 2022, 6:22 a.m. UTC | #2

Hi,

Gentle ping: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598601.html

BR,
Kewen

> 
> on 2022/7/20 17:30, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Commit r12-6679-g7ca1582ca60dc8 made vectorizer accept one
>> unroll factor to be applied to vectorization factor when
>> vectorizing the main loop, it would be suggested by target
>> when doing costing.
>>
>> This patch introduces function determine_suggested_unroll_factor
>> for rs6000 port, to make it be able to suggest the unroll factor
>> for a given loop being vectorized.  Referring to aarch64 port
>> and basing on the analysis on SPEC2017 performance evaluation
>> results, it mainly considers these aspects:
>>   1) unroll option and pragma which can disable unrolling for the
>>      given loop;
>>   2) simple hardware resource model with issued non memory access
>>      vector insn per cycle;
>>   3) aggressive heuristics when iteration count is unknown:
>>      - reduction case to break cross iteration dependency;
>>      - emulated gather load;
>>   4) estimated iteration count when iteration count is unknown;
>>
>> With this patch, SPEC2017 performance evaluation results on
>> Power8/9/10 are listed below (speedup pct.):
>>
>>   * Power10
>>     - O2: all are neutral (excluding some noises);
>>     - Ofast: 510.parest_r +6.67%, the others are neutral
>>              (use ... for the followings);
>>     - Ofast + unroll: 510.parest_r +5.91%, ...
>>     - Ofast + LTO + PGO: 510.parest_r +3.00%, ...
>>     - Ofast + cheap vect cost: 510.parest_r +6.23%, ...
>>     - Ofast + very-cheap vect cost: all are neutral;
>>
>>   * Power9
>>     - Ofast: 510.parest_r +8.73%, 538.imagick_r +11.18%
>>              (likely noise), 500.perlbench_r +1.84%, ...
>>
>>   * Power8
>>     - Ofast: 510.parest_r +5.43%, ...;
>>
>> This patch also introduces one documented parameter
>> rs6000-vect-unroll-limit= similar to what aarch64 proposes,
>> by evaluating on P8/P9/P10, the default value 4 is slightly
>> better than the other choices like 2 and 8.
>>
>> It also parameterizes two other values as undocumented
>> parameters for future tweaking.  One parameter is
>> rs6000-vect-unroll-issue, it's to simply model hardware
>> resource for non memory access vector instructions to avoid
>> excessive unrolling, initially I tried to use the value in
>> the hook rs6000_issue_rate, but the evaluation showed it's
>> bad, so I evaluated different values 2/4/6/8 on P8/P9/P10 at
>> Ofast, the results showed the default value 4 is good enough
>> on these different architectures.  For a record, choice 8
>> could make 510.parest_r's gain become smaller or gone on
>> P8/P9/P10; choice 6 could make 503.bwaves_r degrade by more
>> than 1% on P8/P10; and choice 2 could make 538.imagick_r
>> degrade by 3.8%.  The other parameter is
>> rs6000-vect-unroll-reduc-threshold.  It's mainly inspired by
>> 510.parest_r and tweaked as it, evaluating with different
>> values 0/1/2/3 for the threshold, it showed value 1 is the
>> best choice.  For a record, choice 0 could make 525.x264_r
>> degrade by 2% and 527.cam4_r degrade by 2.95% on P10,
>> 548.exchange2_r degrade by 1.41% and 527.cam4_r degrade by
>> 2.54% on P8; choice 2 and bigger values could make
>> 510.parest_r's gain become smaller.
>>
>> Bootstrapped and regtested on powerpc64-linux-gnu P7 and P8,
>> and powerpc64le-linux-gnu P9.  Bootstrapped on
>> powerpc64le-linux-gnu P10, but one failure was exposed during
>> regression testing there, it's identified as one miss
>> optimization and can be reproduced without this support,
>> PR106365 was opened for further tracking.
>>
>> Is it for trunk?
>>
>> BR,
>> Kewen
>> ------
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.cc (class rs6000_cost_data): Add new members
>> 	m_nstores, m_reduc_factor, m_gather_load and member function
>> 	determine_suggested_unroll_factor.
>> 	(rs6000_cost_data::update_target_cost_per_stmt): Update for m_nstores,
>> 	m_reduc_factor and m_gather_load.
>> 	(rs6000_cost_data::determine_suggested_unroll_factor): New function.
>> 	(rs6000_cost_data::finish_cost): Use determine_suggested_unroll_factor.
>> 	* config/rs6000/rs6000.opt (rs6000-vect-unroll-limit): New parameter.
>> 	(rs6000-vect-unroll-issue): Likewise.
>> 	(rs6000-vect-unroll-reduc-threshold): Likewise.
>> 	* doc/invoke.texi (rs6000-vect-unroll-limit): Document new parameter.
>>
>> ---
>>  gcc/config/rs6000/rs6000.cc  | 125 ++++++++++++++++++++++++++++++++++-
>>  gcc/config/rs6000/rs6000.opt |  18 +++++
>>  gcc/doc/invoke.texi          |   7 ++
>>  3 files changed, 147 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 3ff16b8ae04..d0f107d70a8 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -5208,16 +5208,23 @@ protected:
>>  				    vect_cost_model_location, unsigned int);
>>    void density_test (loop_vec_info);
>>    void adjust_vect_cost_per_loop (loop_vec_info);
>> +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
>>
>>    /* Total number of vectorized stmts (loop only).  */
>>    unsigned m_nstmts = 0;
>>    /* Total number of loads (loop only).  */
>>    unsigned m_nloads = 0;
>> +  /* Total number of stores (loop only).  */
>> +  unsigned m_nstores = 0;
>> +  /* Reduction factor for suggesting unroll factor (loop only).  */
>> +  unsigned m_reduc_factor = 0;
>>    /* Possible extra penalized cost on vector construction (loop only).  */
>>    unsigned m_extra_ctor_cost = 0;
>>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>>       instruction is needed by the vectorization.  */
>>    bool m_vect_nonmem = false;
>> +  /* If this loop gets vectorized with emulated gather load.  */
>> +  bool m_gather_load = false;
>>  };
>>
>>  /* Test for likely overcommitment of vector hardware resources.  If a
>> @@ -5368,9 +5375,34 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
>>      {
>>        m_nstmts += orig_count;
>>
>> -      if (kind == scalar_load || kind == vector_load
>> -	  || kind == unaligned_load || kind == vector_gather_load)
>> -	m_nloads += orig_count;
>> +      if (kind == scalar_load
>> +	  || kind == vector_load
>> +	  || kind == unaligned_load
>> +	  || kind == vector_gather_load)
>> +	{
>> +	  m_nloads += orig_count;
>> +	  if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +	    m_gather_load = true;
>> +	}
>> +      else if (kind == scalar_store
>> +	       || kind == vector_store
>> +	       || kind == unaligned_store
>> +	       || kind == vector_scatter_store)
>> +	m_nstores += orig_count;
>> +      else if ((kind == scalar_stmt
>> +		|| kind == vector_stmt
>> +		|| kind == vec_to_scalar)
>> +	       && stmt_info
>> +	       && vect_is_reduction (stmt_info))
>> +	{
>> +	  /* Loop body contains normal int or fp operations and epilogue
>> +	     contains vector reduction.  For simplicity, we assume int
>> +	     operation takes one cycle and fp operation takes one more.  */
>> +	  tree lhs = gimple_get_lhs (stmt_info->stmt);
>> +	  bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs));
>> +	  unsigned int basic_cost = is_float ? 2 : 1;
>> +	  m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor);
>> +	}
>>
>>        /* Power processors do not currently have instructions for strided
>>  	 and elementwise loads, and instead we must generate multiple
>> @@ -5462,6 +5494,90 @@ rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo)
>>      }
>>  }
>>
>> +/* Determine suggested unroll factor by considering some below factors:
>> +
>> +    - unroll option/pragma which can disable unrolling for this loop;
>> +    - simple hardware resource model for non memory vector insns;
>> +    - aggressive heuristics when iteration count is unknown:
>> +      - reduction case to break cross iteration dependency;
>> +      - emulated gather load;
>> +    - estimated iteration count when iteration count is unknown;
>> +*/
>> +
>> +
>> +unsigned int
>> +rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
>> +{
>> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +
>> +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
>> +  if (loop->unroll == 1
>> +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
>> +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
>> +    return 1;
>> +
>> +  unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores;
>> +  /* Don't unroll if no vector instructions excepting for memory access.  */
>> +  if (nstmts_nonldst == 0)
>> +    return 1;
>> +
>> +  /* Consider breaking cross iteration dependency for reduction.  */
>> +  unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1;
>> +
>> +  /* Use this simple hardware resource model that how many non ld/st
>> +     vector instructions can be issued per cycle.  */
>> +  unsigned int issue_width = rs6000_vect_unroll_issue;
>> +  unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst);
>> +  uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf);
>> +  /* Make sure it is power of 2.  */
>> +  uf = 1 << ceil_log2 (uf);
>> +
>> +  /* If the iteration count is known, the costing would be exact enough,
>> +     don't worry it could be worse.  */
>> +  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> +    return uf;
>> +
>> +  /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the
>> +     loop if either condition is satisfied:
>> +       - reduction factor exceeds the threshold;
>> +       - emulated gather load adopted.  */
>> +  if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold
>> +      || m_gather_load)
>> +    return uf;
>> +
>> +  /* Check if we can conclude it's good to unroll from the estimated
>> +     iteration count.  */
>> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
>> +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
>> +  unsigned int unrolled_vf = vf * uf;
>> +  if (est_niter == -1 || est_niter < unrolled_vf)
>> +    /* When the estimated iteration of this loop is unknown, it's possible
>> +       that we are able to vectorize this loop with the original VF but fail
>> +       to vectorize it with the unrolled VF any more if the actual iteration
>> +       count is in between.  */
>> +    return 1;
>> +  else
>> +    {
>> +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
>> +      unsigned int epil_niter = est_niter % vf;
>> +      /* Even if we have partial vector support, it can be still inefficent
>> +	 to calculate the length when the iteration count is unknown, so
>> +	 only expect it's good to unroll when the epilogue iteration count
>> +	 is not bigger than VF (only one time length calculation).  */
>> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> +	  && epil_niter_unr <= vf)
>> +	return uf;
>> +      /* Without partial vector support, conservatively unroll this when
>> +	 the epilogue iteration count is less than the original one
>> +	 (epilogue execution time wouldn't be longer than before).  */
>> +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>> +	       && epil_niter_unr <= epil_niter)
>> +	return uf;
>> +    }
>> +
>> +  return 1;
>> +}
>> +
>>  void
>>  rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>>  {
>> @@ -5478,6 +5594,9 @@ rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
>>  	  && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2
>>  	  && LOOP_REQUIRES_VERSIONING (loop_vinfo))
>>  	m_costs[vect_body] += 10000;
>> +
>> +      m_suggested_unroll_factor
>> +	= determine_suggested_unroll_factor (loop_vinfo);
>>      }
>>
>>    vector_costs::finish_cost (scalar_costs);
>> diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
>> index 4931d781c4e..80c2c61a9de 100644
>> --- a/gcc/config/rs6000/rs6000.opt
>> +++ b/gcc/config/rs6000/rs6000.opt
>> @@ -624,6 +624,14 @@ mieee128-constant
>>  Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
>>  Generate (do not generate) code that uses the LXVKQ instruction.
>>
>> +; Documented parameters
>> +
>> +-param=rs6000-vect-unroll-limit=
>> +Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param
>> +Used to limit unroll factor which indicates how much the autovectorizer may
>> +unroll a loop.  The default value is 4.
>> +
>> +; Undocumented parameters
>>  -param=rs6000-density-pct-threshold=
>>  Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param
>>  When costing for loop vectorization, we probably need to penalize the loop body
>> @@ -661,3 +669,13 @@ Like parameter rs6000-density-load-pct-threshold, we also check if the total
>>  number of load statements exceeds the threshold specified by this parameter,
>>  and penalize only if it's satisfied.  The default value is 20.
>>
>> +-param=rs6000-vect-unroll-issue=
>> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param
>> +Indicate how many non memory access vector instructions can be issued per
>> +cycle, it's used in unroll factor determination for autovectorizer.  The
>> +default value is 4.
>> +
>> +-param=rs6000-vect-unroll-reduc-threshold=
>> +Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param
>> +When reduction factor computed for a loop exceeds the threshold specified by
>> +this parameter, prefer to unroll this loop.  The default value is 1.
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 84d6f0f9860..097ab1d5563 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -29658,6 +29658,13 @@ Generate (do not generate) code that will run in privileged state.
>>  @opindex no-block-ops-unaligned-vsx
>>  Generate (do not generate) unaligned vsx loads and stores for
>>  inline expansion of @code{memcpy} and @code{memmove}.
>> +
>> +@item --param rs6000-vect-unroll-limit=
>> +The vectorizer will check with target information to determine whether it
>> +would be beneficial to unroll the main vectorized loop and by how much.  This
>> +parameter sets the upper bound of how much the vectorizer will unroll the main
>> +loop.  The default value is four.
>> +
>>  @end table
>>
>>  @node RX Options
>> --
>> 2.27.0

diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 3ff16b8ae04..d0f107d70a8 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -5208,16 +5208,23 @@  protected:
 				    vect_cost_model_location, unsigned int);
   void density_test (loop_vec_info);
   void adjust_vect_cost_per_loop (loop_vec_info);
+  unsigned int determine_suggested_unroll_factor (loop_vec_info);

   /* Total number of vectorized stmts (loop only).  */
   unsigned m_nstmts = 0;
   /* Total number of loads (loop only).  */
   unsigned m_nloads = 0;
+  /* Total number of stores (loop only).  */
+  unsigned m_nstores = 0;
+  /* Reduction factor for suggesting unroll factor (loop only).  */
+  unsigned m_reduc_factor = 0;
   /* Possible extra penalized cost on vector construction (loop only).  */
   unsigned m_extra_ctor_cost = 0;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool m_vect_nonmem = false;
+  /* If this loop gets vectorized with emulated gather load.  */
+  bool m_gather_load = false;
 };

 /* Test for likely overcommitment of vector hardware resources.  If a
@@ -5368,9 +5375,34 @@  rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
     {
       m_nstmts += orig_count;

-      if (kind == scalar_load || kind == vector_load
-	  || kind == unaligned_load || kind == vector_gather_load)
-	m_nloads += orig_count;
+      if (kind == scalar_load
+	  || kind == vector_load
+	  || kind == unaligned_load
+	  || kind == vector_gather_load)
+	{
+	  m_nloads += orig_count;
+	  if (stmt_info && STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	    m_gather_load = true;
+	}
+      else if (kind == scalar_store
+	       || kind == vector_store
+	       || kind == unaligned_store
+	       || kind == vector_scatter_store)
+	m_nstores += orig_count;
+      else if ((kind == scalar_stmt
+		|| kind == vector_stmt
+		|| kind == vec_to_scalar)
+	       && stmt_info
+	       && vect_is_reduction (stmt_info))
+	{
+	  /* Loop body contains normal int or fp operations and epilogue
+	     contains vector reduction.  For simplicity, we assume int
+	     operation takes one cycle and fp operation takes one more.  */
+	  tree lhs = gimple_get_lhs (stmt_info->stmt);
+	  bool is_float = FLOAT_TYPE_P (TREE_TYPE (lhs));
+	  unsigned int basic_cost = is_float ? 2 : 1;
+	  m_reduc_factor = MAX (basic_cost * orig_count, m_reduc_factor);
+	}

       /* Power processors do not currently have instructions for strided
 	 and elementwise loads, and instead we must generate multiple
@@ -5462,6 +5494,90 @@  rs6000_cost_data::adjust_vect_cost_per_loop (loop_vec_info loop_vinfo)
     }
 }

+/* Determine suggested unroll factor by considering some below factors:
+
+    - unroll option/pragma which can disable unrolling for this loop;
+    - simple hardware resource model for non memory vector insns;
+    - aggressive heuristics when iteration count is unknown:
+      - reduction case to break cross iteration dependency;
+      - emulated gather load;
+    - estimated iteration count when iteration count is unknown;
+*/
+
+
+unsigned int
+rs6000_cost_data::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Don't unroll if it's specified explicitly not to be unrolled.  */
+  if (loop->unroll == 1
+      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
+      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
+    return 1;
+
+  unsigned int nstmts_nonldst = m_nstmts - m_nloads - m_nstores;
+  /* Don't unroll if no vector instructions excepting for memory access.  */
+  if (nstmts_nonldst == 0)
+    return 1;
+
+  /* Consider breaking cross iteration dependency for reduction.  */
+  unsigned int reduc_factor = m_reduc_factor > 1 ? m_reduc_factor : 1;
+
+  /* Use this simple hardware resource model that how many non ld/st
+     vector instructions can be issued per cycle.  */
+  unsigned int issue_width = rs6000_vect_unroll_issue;
+  unsigned int uf = CEIL (reduc_factor * issue_width, nstmts_nonldst);
+  uf = MIN ((unsigned int) rs6000_vect_unroll_limit, uf);
+  /* Make sure it is power of 2.  */
+  uf = 1 << ceil_log2 (uf);
+
+  /* If the iteration count is known, the costing would be exact enough,
+     don't worry it could be worse.  */
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    return uf;
+
+  /* Inspired by SPEC2017 parest_r, we want to aggressively unroll the
+     loop if either condition is satisfied:
+       - reduction factor exceeds the threshold;
+       - emulated gather load adopted.  */
+  if (reduc_factor > (unsigned int) rs6000_vect_unroll_reduc_threshold
+      || m_gather_load)
+    return uf;
+
+  /* Check if we can conclude it's good to unroll from the estimated
+     iteration count.  */
+  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
+  unsigned int vf = vect_vf_for_cost (loop_vinfo);
+  unsigned int unrolled_vf = vf * uf;
+  if (est_niter == -1 || est_niter < unrolled_vf)
+    /* When the estimated iteration of this loop is unknown, it's possible
+       that we are able to vectorize this loop with the original VF but fail
+       to vectorize it with the unrolled VF any more if the actual iteration
+       count is in between.  */
+    return 1;
+  else
+    {
+      unsigned int epil_niter_unr = est_niter % unrolled_vf;
+      unsigned int epil_niter = est_niter % vf;
+      /* Even if we have partial vector support, it can be still inefficent
+	 to calculate the length when the iteration count is unknown, so
+	 only expect it's good to unroll when the epilogue iteration count
+	 is not bigger than VF (only one time length calculation).  */
+      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  && epil_niter_unr <= vf)
+	return uf;
+      /* Without partial vector support, conservatively unroll this when
+	 the epilogue iteration count is less than the original one
+	 (epilogue execution time wouldn't be longer than before).  */
+      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	       && epil_niter_unr <= epil_niter)
+	return uf;
+    }
+
+  return 1;
+}
+
 void
 rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
 {
@@ -5478,6 +5594,9 @@  rs6000_cost_data::finish_cost (const vector_costs *scalar_costs)
 	  && LOOP_VINFO_VECT_FACTOR (loop_vinfo) == 2
 	  && LOOP_REQUIRES_VERSIONING (loop_vinfo))
 	m_costs[vect_body] += 10000;
+
+      m_suggested_unroll_factor
+	= determine_suggested_unroll_factor (loop_vinfo);
     }

   vector_costs::finish_cost (scalar_costs);
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 4931d781c4e..80c2c61a9de 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -624,6 +624,14 @@  mieee128-constant
 Target Var(TARGET_IEEE128_CONSTANT) Init(1) Save
 Generate (do not generate) code that uses the LXVKQ instruction.

+; Documented parameters
+
+-param=rs6000-vect-unroll-limit=
+Target Joined UInteger Var(rs6000_vect_unroll_limit) Init(4) IntegerRange(1, 64) Param
+Used to limit unroll factor which indicates how much the autovectorizer may
+unroll a loop.  The default value is 4.
+
+; Undocumented parameters
 -param=rs6000-density-pct-threshold=
 Target Undocumented Joined UInteger Var(rs6000_density_pct_threshold) Init(85) IntegerRange(0, 100) Param
 When costing for loop vectorization, we probably need to penalize the loop body
@@ -661,3 +669,13 @@  Like parameter rs6000-density-load-pct-threshold, we also check if the total
 number of load statements exceeds the threshold specified by this parameter,
 and penalize only if it's satisfied.  The default value is 20.

+-param=rs6000-vect-unroll-issue=
+Target Undocumented Joined UInteger Var(rs6000_vect_unroll_issue) Init(4) IntegerRange(1, 128) Param
+Indicate how many non memory access vector instructions can be issued per
+cycle, it's used in unroll factor determination for autovectorizer.  The
+default value is 4.
+
+-param=rs6000-vect-unroll-reduc-threshold=
+Target Undocumented Joined UInteger Var(rs6000_vect_unroll_reduc_threshold) Init(1) Param
+When reduction factor computed for a loop exceeds the threshold specified by
+this parameter, prefer to unroll this loop.  The default value is 1.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 84d6f0f9860..097ab1d5563 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -29658,6 +29658,13 @@  Generate (do not generate) code that will run in privileged state.
 @opindex no-block-ops-unaligned-vsx
 Generate (do not generate) unaligned vsx loads and stores for
 inline expansion of @code{memcpy} and @code{memmove}.
+
+@item --param rs6000-vect-unroll-limit=
+The vectorizer will check with target information to determine whether it
+would be beneficial to unroll the main vectorized loop and by how much.  This
+parameter sets the upper bound of how much the vectorizer will unroll the main
+loop.  The default value is four.
+
 @end table

 @node RX Options