ix86: Suggest unroll factor for loop vectorization

Message ID 20221024024604.18324-1-lili.cui@intel.com
State New
Headers
Series ix86: Suggest unroll factor for loop vectorization |

Commit Message

Li, Pan2 via Gcc-patches Oct. 24, 2022, 2:46 a.m. UTC
  Hi Hongtao,

This patch introduces function finish_cost and 
determine_suggested_unroll_factor for x86 backend, to make it be
able to suggest the unroll factor for a given loop being vectorized.
Referring to aarch64, RS6000 backends and basing on the analysis on
SPEC2017 performance evaluation results.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

OK for trunk?



With this patch, SPEC2017 performance evaluation results on
ICX/CLX/ADL/Znver3 are listed below:

For single copy:
  - ICX: 549.fotonik3d_r +6.2%, the others are neutral
  - CLX: 549.fotonik3d_r +1.9%, the others are neutral
  - ADL: 549.fotonik3d_r +4.5%, the others are neutral
  - Znver3: 549.fotonik3d_r +4.8%, the others are neutral

For multi-copy:
  - ADL: 549.fotonik3d_r +2.7%, the others are neutral

gcc/ChangeLog:

	* config/i386/i386.cc (class ix86_vector_costs): Add new members
	 m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
	(ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
	and m_nstores.
	(ix86_vector_costs::determine_suggested_unroll_factor): New function.
	(ix86_vector_costs::finish_cost): Diito.
	* config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
	(x86-vect-unroll-min-ldst-threshold): Likewise.
	(x86-vect-unroll-max-loop-size): Likewise.
	* doc/invoke.texi: Document new parameter.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
	* gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
	* gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
	* gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
	* gcc.target/i386/vect-reduc-1.c: Ditto.
---
 gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
 gcc/config/i386/i386.opt                      |  15 +++
 gcc/doc/invoke.texi                           |  17 +++
 .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
 .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
 .../i386/vect-alignment-peeling-1.c           |   2 +-
 .../i386/vect-alignment-peeling-2.c           |   2 +-
 gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
 8 files changed, 143 insertions(+), 5 deletions(-)
  

Comments

Hongtao Liu Oct. 25, 2022, 5:49 a.m. UTC | #1
Any comments?

On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi Hongtao,
>
> This patch introduces function finish_cost and
> determine_suggested_unroll_factor for x86 backend, to make it be
> able to suggest the unroll factor for a given loop being vectorized.
> Referring to aarch64, RS6000 backends and basing on the analysis on
> SPEC2017 performance evaluation results.
>
> Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> OK for trunk?
>
>
>
> With this patch, SPEC2017 performance evaluation results on
> ICX/CLX/ADL/Znver3 are listed below:
>
> For single copy:
>   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
>   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
>   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
>   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
>
> For multi-copy:
>   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
>
> gcc/ChangeLog:
>
>         * config/i386/i386.cc (class ix86_vector_costs): Add new members
>          m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
>         (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
>         and m_nstores.
>         (ix86_vector_costs::determine_suggested_unroll_factor): New function.
>         (ix86_vector_costs::finish_cost): Diito.
>         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
>         (x86-vect-unroll-min-ldst-threshold): Likewise.
>         (x86-vect-unroll-max-loop-size): Likewise.
>         * doc/invoke.texi: Document new parameter.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
>         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
>         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
>         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
>         * gcc.target/i386/vect-reduc-1.c: Ditto.
> ---
>  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
>  gcc/config/i386/i386.opt                      |  15 +++
>  gcc/doc/invoke.texi                           |  17 +++
>  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
>  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
>  .../i386/vect-alignment-peeling-1.c           |   2 +-
>  .../i386/vect-alignment-peeling-2.c           |   2 +-
>  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
>  8 files changed, 143 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index aeea26ef4be..a939354e55e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
>                               stmt_vec_info stmt_info, slp_tree node,
>                               tree vectype, int misalign,
>                               vect_cost_model_location where) override;
> +
> +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> +
> +  void finish_cost (const vector_costs *) override;
> +
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned m_nstmts = 0;
> +  /* Total number of loads (loop only).  */
> +  unsigned m_nloads = 0;
> +  /* Total number of stores (loop only).  */
> +  unsigned m_nstores = 0;
>  };
>
>  /* Implement targetm.vectorize.create_costs.  */
> @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>         retval = (retval * 17) / 10;
>      }
>
> +  if (!m_costing_for_scalar
> +      && is_a<loop_vec_info> (m_vinfo)
> +      && where == vect_body)
> +    {
> +      m_nstmts += count;
> +      if (kind == scalar_load || kind == vector_load
> +         || kind == unaligned_load || kind == vector_gather_load)
> +       m_nloads += count;
> +      else if (kind == scalar_store || kind == vector_store
> +              || kind == unaligned_store || kind == vector_scatter_store)
> +       m_nstores += count;
> +    }
> +
>    m_costs[where] += retval;
>
>    return retval;
> @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
>    return nunroll;
>  }
>
> +unsigned int
> +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> +  if (loop->unroll == 1
> +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> +    return 1;
> +
> +  /* Don't unroll if there is no vectorized stmt.  */
> +  if (m_nstmts == 0)
> +    return 1;
> +
> +  /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
> +     sizes.  */
> +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> +    return 1;
> +
> +  /* Calc the total number of loads and stores in the loop body.  */
> +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> +
> +  /* Don't unroll if loop body size big than threshold, the threshold
> +     is a heuristic value inspired by param_max_unrolled_insns.  */
> +  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
> +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> +                   : 1;
> +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> +  uf = 1 << ceil_log2 (uf);
> +
> +  /* Early return if don't need to unroll.  */
> +  if (uf == 1)
> +    return 1;
> +
> +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
> +     if the number of loads and stores exceeds the threshold, unroll + software
> +     schedule will reduce cache miss rate.  */
> +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> +    return uf;
> +
> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> +  unsigned int unrolled_vf = vf * uf;
> +  if (est_niter == -1 || est_niter < unrolled_vf)
> +    /* When the estimated iteration of this loop is unknown, it's possible
> +       that we are able to vectorize this loop with the original VF but fail
> +       to vectorize it with the unrolled VF any more if the actual iteration
> +       count is in between.  */
> +    return 1;
> +  else
> +    {
> +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> +      unsigned int epil_niter = est_niter % vf;
> +      /* Even if we have partial vector support, it can be still inefficent
> +       to calculate the length when the iteration count is unknown, so
> +       only expect it's good to unroll when the epilogue iteration count
> +       is not bigger than VF (only one time length calculation).  */
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +         && epil_niter_unr <= vf)
> +       return uf;
> +      /* Without partial vector support, conservatively unroll this when
> +       the epilogue iteration count is less than the original one
> +       (epilogue execution time wouldn't be longer than before).  */
> +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +              && epil_niter_unr <= epil_niter)
> +       return uf;
> +    }
> +
> +  return 1;
> +}
> +
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +
> +{
> +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> +    {
> +      m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
> +    }
> +  vector_costs::finish_cost (scalar_costs);
> +}
>
>  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.  */
>
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 53d534f6392..8e49b406aa5 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1224,3 +1224,18 @@ mavxvnniint8
>  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
>  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
>  AVXVNNIINT8 built-in functions and code generation.
> +
> +-param=x86-vect-unroll-limit=
> +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
> +Used to limit unroll factor which indicates how much the autovectorizer may
> +unroll a loop.  The default value is 4.
> +
> +-param=x86-vect-unroll-min-ldst-threshold=
> +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
> +Used to limit the mininum of loads and stores in the main loop.  The default
> +value is 25.
> +
> +-param=x86-vect-unroll-max-loop-size=
> +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
> +This threshold is used to limit the maxnum size of loop body after unrolling.
> +The default value is 200.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 09548c4528c..c86d686f2cd 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
>  @item x86-stlf-window-ninsns
>  Instructions number above which STFL stall penalty can be compensated.
>
> +@item x86-vect-unroll-limit
> +The vectorizer will check with target information to determine whether it
> +would be beneficial to unroll the main vectorized loop and by how much.  This
> +parameter sets the upper bound of how much the vectorizer will unroll the main
> +loop.  The default value is four.
> +
> +@item x86-vect-unroll-min-ldst-threshold
> +The vectorizer will check with target information to determine whether unroll
> +it. This parameter is used to limit the mininum of loads and stores in the main
> +loop.
> +
> +@item x86-vect-unroll-max-loop-size
> +The vectorizer will check with target information to determine whether unroll
> +it. This threshold is used to limit the max size of loop body after unrolling.
> +The default value is 200.
> +
> +
>  @end table
>
>  @end table
> diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> index 78c6600f83b..3bf1fb1b12d 100644
> --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
>  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
>  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
>  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> index 117179f2109..ba41fd64386 100644
> --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
>  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
>  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
>  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> index 4aa536ba86c..fd2f054af4a 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> @@ -2,7 +2,7 @@
>  /* This is a test exercising peeling for alignment for a negative step
>     vector loop.  We're forcing atom tuning here because that has a higher
>     unaligned vs aligned cost unlike most other archs.  */
> -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
>
>  float a[1024], b[1024];
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> index 834bf0f770d..62c0db2bb9a 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> @@ -2,7 +2,7 @@
>  /* This is a test exercising peeling for alignment for a positive step
>     vector loop.  We're forcing atom tuning here because that has a higher
>     unaligned vs aligned cost unlike most other archs.  */
> -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
>
>  float a[1024], b[1024];
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> index 9ee9ba4e736..1ba4be01bea 100644
> --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
> +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
>
>  #define N 32
>  int foo (int *a, int n)
> --
> 2.17.1
>
> Thanks,
> Lili.
  
Richard Biener Oct. 25, 2022, 11:53 a.m. UTC | #2
On Tue, Oct 25, 2022 at 7:46 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Any comments?

+@item x86-vect-unroll-min-ldst-threshold
+The vectorizer will check with target information to determine whether unroll
+it. This parameter is used to limit the mininum of loads and stores in the main
+loop.

It's odd to "limit" the minimum number of something.  I think this
warrants clarification
that for some (unknow to me ;)) reason we think that when we have many loads
and (or?) stores it is beneficial to unroll to get even more loads and
stores in a
single iteration.  Btw, does the parameter limit the number of loads and stores
_after_ unrolling or before?

+@item x86-vect-unroll-max-loop-size
+The vectorizer will check with target information to determine whether unroll
+it. This threshold is used to limit the max size of loop body after unrolling.
+The default value is 200.

it should probably say not "size" but "number of instructions".  Note that 200
is quite large given we are talking about vector instructions here which have
larger encodings than scalar instructions.  Optimistically assuming
4 byte encoding (quite optimistic give we're looking at loops with many
loads/stores) that would be an 800 byte loop body which would be 25 cache lines.
ISTR that at least the loop discovery is limited to a lot smaller cases (but
we are likely not targeting that).  The limit probably still works to
fit the loop
body in the u-op caches though.

That said, the heuristic made me think "what the heck".  Can we explain
in u-arch terms why the unrolling is beneficial instead of just defering to
SPEC CPU 2017 fotonik?

> On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi Hongtao,
> >
> > This patch introduces function finish_cost and
> > determine_suggested_unroll_factor for x86 backend, to make it be
> > able to suggest the unroll factor for a given loop being vectorized.
> > Referring to aarch64, RS6000 backends and basing on the analysis on
> > SPEC2017 performance evaluation results.
> >
> > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > OK for trunk?
> >
> >
> >
> > With this patch, SPEC2017 performance evaluation results on
> > ICX/CLX/ADL/Znver3 are listed below:
> >
> > For single copy:
> >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> >
> > For multi-copy:
> >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> >
> > gcc/ChangeLog:
> >
> >         * config/i386/i386.cc (class ix86_vector_costs): Add new members
> >          m_nstmts, m_nloads m_nstores and determine_suggested_unroll_factor.
> >         (ix86_vector_costs::add_stmt_cost): Update for m_nstores, m_nloads
> >         and m_nstores.
> >         (ix86_vector_costs::determine_suggested_unroll_factor): New function.
> >         (ix86_vector_costs::finish_cost): Diito.
> >         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> >         (x86-vect-unroll-min-ldst-threshold): Likewise.
> >         (x86-vect-unroll-max-loop-size): Likewise.
> >         * doc/invoke.texi: Document new parameter.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> >         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> >         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> >         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> >         * gcc.target/i386/vect-reduc-1.c: Ditto.
> > ---
> >  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
> >  gcc/config/i386/i386.opt                      |  15 +++
> >  gcc/doc/invoke.texi                           |  17 +++
> >  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
> >  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
> >  .../i386/vect-alignment-peeling-1.c           |   2 +-
> >  .../i386/vect-alignment-peeling-2.c           |   2 +-
> >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> >  8 files changed, 143 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index aeea26ef4be..a939354e55e 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> >                               stmt_vec_info stmt_info, slp_tree node,
> >                               tree vectype, int misalign,
> >                               vect_cost_model_location where) override;
> > +
> > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > +
> > +  void finish_cost (const vector_costs *) override;
> > +
> > +  /* Total number of vectorized stmts (loop only).  */
> > +  unsigned m_nstmts = 0;
> > +  /* Total number of loads (loop only).  */
> > +  unsigned m_nloads = 0;
> > +  /* Total number of stores (loop only).  */
> > +  unsigned m_nstores = 0;
> >  };
> >
> >  /* Implement targetm.vectorize.create_costs.  */
> > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >         retval = (retval * 17) / 10;
> >      }
> >
> > +  if (!m_costing_for_scalar
> > +      && is_a<loop_vec_info> (m_vinfo)
> > +      && where == vect_body)
> > +    {
> > +      m_nstmts += count;
> > +      if (kind == scalar_load || kind == vector_load
> > +         || kind == unaligned_load || kind == vector_gather_load)
> > +       m_nloads += count;
> > +      else if (kind == scalar_store || kind == vector_store
> > +              || kind == unaligned_store || kind == vector_scatter_store)
> > +       m_nstores += count;
> > +    }
> > +
> >    m_costs[where] += retval;
> >
> >    return retval;
> > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
> >    return nunroll;
> >  }
> >
> > +unsigned int
> > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
> > +{
> > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > +
> > +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> > +  if (loop->unroll == 1
> > +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> > +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> > +    return 1;
> > +
> > +  /* Don't unroll if there is no vectorized stmt.  */
> > +  if (m_nstmts == 0)
> > +    return 1;
> > +
> > +  /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
> > +     sizes.  */
> > +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> > +    return 1;
> > +
> > +  /* Calc the total number of loads and stores in the loop body.  */
> > +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> > +
> > +  /* Don't unroll if loop body size big than threshold, the threshold
> > +     is a heuristic value inspired by param_max_unrolled_insns.  */
> > +  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
> > +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> > +                   : 1;
> > +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> > +  uf = 1 << ceil_log2 (uf);
> > +
> > +  /* Early return if don't need to unroll.  */
> > +  if (uf == 1)
> > +    return 1;
> > +
> > +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
> > +     if the number of loads and stores exceeds the threshold, unroll + software
> > +     schedule will reduce cache miss rate.  */
> > +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> > +    return uf;
> > +
> > +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> > +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> > +  unsigned int unrolled_vf = vf * uf;
> > +  if (est_niter == -1 || est_niter < unrolled_vf)
> > +    /* When the estimated iteration of this loop is unknown, it's possible
> > +       that we are able to vectorize this loop with the original VF but fail
> > +       to vectorize it with the unrolled VF any more if the actual iteration
> > +       count is in between.  */
> > +    return 1;
> > +  else
> > +    {
> > +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> > +      unsigned int epil_niter = est_niter % vf;
> > +      /* Even if we have partial vector support, it can be still inefficent
> > +       to calculate the length when the iteration count is unknown, so
> > +       only expect it's good to unroll when the epilogue iteration count
> > +       is not bigger than VF (only one time length calculation).  */
> > +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +         && epil_niter_unr <= vf)
> > +       return uf;
> > +      /* Without partial vector support, conservatively unroll this when
> > +       the epilogue iteration count is less than the original one
> > +       (epilogue execution time wouldn't be longer than before).  */
> > +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +              && epil_niter_unr <= epil_niter)
> > +       return uf;
> > +    }
> > +
> > +  return 1;
> > +}
> > +
> > +void
> > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > +
> > +{
> > +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > +    {
> > +      m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
> > +    }
> > +  vector_costs::finish_cost (scalar_costs);
> > +}
> >
> >  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.  */
> >
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index 53d534f6392..8e49b406aa5 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1224,3 +1224,18 @@ mavxvnniint8
> >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> >  AVXVNNIINT8 built-in functions and code generation.
> > +
> > +-param=x86-vect-unroll-limit=
> > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
> > +Used to limit unroll factor which indicates how much the autovectorizer may
> > +unroll a loop.  The default value is 4.
> > +
> > +-param=x86-vect-unroll-min-ldst-threshold=
> > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
> > +Used to limit the mininum of loads and stores in the main loop.  The default
> > +value is 25.
> > +
> > +-param=x86-vect-unroll-max-loop-size=
> > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
> > +This threshold is used to limit the maxnum size of loop body after unrolling.
> > +The default value is 200.
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 09548c4528c..c86d686f2cd 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are available on i386 and x86_64 targets:
> >  @item x86-stlf-window-ninsns
> >  Instructions number above which STFL stall penalty can be compensated.
> >
> > +@item x86-vect-unroll-limit
> > +The vectorizer will check with target information to determine whether it
> > +would be beneficial to unroll the main vectorized loop and by how much.  This
> > +parameter sets the upper bound of how much the vectorizer will unroll the main
> > +loop.  The default value is four.
> > +
> > +@item x86-vect-unroll-min-ldst-threshold
> > +The vectorizer will check with target information to determine whether unroll
> > +it. This parameter is used to limit the mininum of loads and stores in the main
> > +loop.
> > +
> > +@item x86-vect-unroll-max-loop-size
> > +The vectorizer will check with target information to determine whether unroll
> > +it. This threshold is used to limit the max size of loop body after unrolling.
> > +The default value is 200.
> > +
> > +
> >  @end table
> >
> >  @end table
> > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > index 78c6600f83b..3bf1fb1b12d 100644
> > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
> >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> >  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > index 117179f2109..ba41fd64386 100644
> > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
> >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> >  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > index 4aa536ba86c..fd2f054af4a 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > @@ -2,7 +2,7 @@
> >  /* This is a test exercising peeling for alignment for a negative step
> >     vector loop.  We're forcing atom tuning here because that has a higher
> >     unaligned vs aligned cost unlike most other archs.  */
> > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
> >
> >  float a[1024], b[1024];
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > index 834bf0f770d..62c0db2bb9a 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > @@ -2,7 +2,7 @@
> >  /* This is a test exercising peeling for alignment for a positive step
> >     vector loop.  We're forcing atom tuning here because that has a higher
> >     unaligned vs aligned cost unlike most other archs.  */
> > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
> > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
> >
> >  float a[1024], b[1024];
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > index 9ee9ba4e736..1ba4be01bea 100644
> > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
> > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
> >
> >  #define N 32
> >  int foo (int *a, int n)
> > --
> > 2.17.1
> >
> > Thanks,
> > Lili.
>
>
>
> --
> BR,
> Hongtao
  
Li, Pan2 via Gcc-patches Oct. 26, 2022, 11:38 a.m. UTC | #3
Hi Richard,

> +@item x86-vect-unroll-min-ldst-threshold
> +The vectorizer will check with target information to determine whether
> +unroll it. This parameter is used to limit the mininum of loads and
> +stores in the main loop.
> 
> It's odd to "limit" the minimum number of something.  I think this warrants
> clarification that for some (unknow to me ;)) reason we think that when we
> have many loads and (or?) stores it is beneficial to unroll to get even more
> loads and stores in a single iteration.  Btw, does the parameter limit the
> number of loads and stores _after_ unrolling or before?
> 
When the number of loads/stores exceeds the threshold, the loads/stores are more likely to conflict with loop itself in the L1 cache(Assuming that address of loads are scattered).
Unroll + software scheduling will make 2 or 4 address contiguous loads/stores closer together, it will reduce cache miss rate.

> +@item x86-vect-unroll-max-loop-size
> +The vectorizer will check with target information to determine whether
> +unroll it. This threshold is used to limit the max size of loop body after
> unrolling.
> +The default value is 200.
> 
> it should probably say not "size" but "number of instructions".  Note that 200
> is quite large given we are talking about vector instructions here which have
> larger encodings than scalar instructions.  Optimistically assuming
> 4 byte encoding (quite optimistic give we're looking at loops with many
> loads/stores) that would be an 800 byte loop body which would be 25 cache
> lines.
> ISTR that at least the loop discovery is limited to a lot smaller cases (but we
> are likely not targeting that).  The limit probably still works to fit the loop
> body in the u-op caches though.
> 
Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the reminder about larger encodings, I checked the skylake uop cache, it can hold 1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.

> That said, the heuristic made me think "what the heck".  Can we explain in u-
> arch terms why the unrolling is beneficial instead of just defering to SPEC
> CPU 2017 fotonik?
> 
Regarding the benefits,  I explained in the first answer, I checked 5 hottest functions in the 549, they all benefit from it, it improves the cache hit ratio.

Thanks,
Lili.

> > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Hi Hongtao,
> > >
> > > This patch introduces function finish_cost and
> > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > able to suggest the unroll factor for a given loop being vectorized.
> > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > SPEC2017 performance evaluation results.
> > >
> > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > OK for trunk?
> > >
> > >
> > >
> > > With this patch, SPEC2017 performance evaluation results on
> > > ICX/CLX/ADL/Znver3 are listed below:
> > >
> > > For single copy:
> > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > >
> > > For multi-copy:
> > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > >
> > > gcc/ChangeLog:
> > >
> > >         * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > >          m_nstmts, m_nloads m_nstores and
> determine_suggested_unroll_factor.
> > >         (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> m_nloads
> > >         and m_nstores.
> > >         (ix86_vector_costs::determine_suggested_unroll_factor): New
> function.
> > >         (ix86_vector_costs::finish_cost): Diito.
> > >         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > >         (x86-vect-unroll-min-ldst-threshold): Likewise.
> > >         (x86-vect-unroll-max-loop-size): Likewise.
> > >         * doc/invoke.texi: Document new parameter.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > >         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > >         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > >         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > >         * gcc.target/i386/vect-reduc-1.c: Ditto.
> > > ---
> > >  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
> > >  gcc/config/i386/i386.opt                      |  15 +++
> > >  gcc/doc/invoke.texi                           |  17 +++
> > >  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
> > >  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
> > >  .../i386/vect-alignment-peeling-1.c           |   2 +-
> > >  .../i386/vect-alignment-peeling-2.c           |   2 +-
> > >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> > >  8 files changed, 143 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > > aeea26ef4be..a939354e55e 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> > >                               stmt_vec_info stmt_info, slp_tree node,
> > >                               tree vectype, int misalign,
> > >                               vect_cost_model_location where)
> > > override;
> > > +
> > > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > > +
> > > +  void finish_cost (const vector_costs *) override;
> > > +
> > > +  /* Total number of vectorized stmts (loop only).  */  unsigned
> > > + m_nstmts = 0;
> > > +  /* Total number of loads (loop only).  */  unsigned m_nloads = 0;
> > > +  /* Total number of stores (loop only).  */  unsigned m_nstores =
> > > + 0;
> > >  };
> > >
> > >  /* Implement targetm.vectorize.create_costs.  */
> > > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
> > >         retval = (retval * 17) / 10;
> > >      }
> > >
> > > +  if (!m_costing_for_scalar
> > > +      && is_a<loop_vec_info> (m_vinfo)
> > > +      && where == vect_body)
> > > +    {
> > > +      m_nstmts += count;
> > > +      if (kind == scalar_load || kind == vector_load
> > > +         || kind == unaligned_load || kind == vector_gather_load)
> > > +       m_nloads += count;
> > > +      else if (kind == scalar_store || kind == vector_store
> > > +              || kind == unaligned_store || kind == vector_scatter_store)
> > > +       m_nstores += count;
> > > +    }
> > > +
> > >    m_costs[where] += retval;
> > >
> > >    return retval;
> > > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll,
> class loop *loop)
> > >    return nunroll;
> > >  }
> > >
> > > +unsigned int
> > > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info
> loop_vinfo)
> > > +{
> > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > +
> > > +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> > > +  if (loop->unroll == 1
> > > +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> > > +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> > > +    return 1;
> > > +
> > > +  /* Don't unroll if there is no vectorized stmt.  */
> > > +  if (m_nstmts == 0)
> > > +    return 1;
> > > +
> > > +  /* Don't unroll if vector size is zmm, since zmm throughput is lower
> than other
> > > +     sizes.  */
> > > +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> > > +    return 1;
> > > +
> > > +  /* Calc the total number of loads and stores in the loop body.  */
> > > +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> > > +
> > > +  /* Don't unroll if loop body size big than threshold, the threshold
> > > +     is a heuristic value inspired by param_max_unrolled_insns.  */
> > > +  unsigned int uf = m_nstmts < (unsigned
> int)x86_vect_unroll_max_loop_size
> > > +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> > > +                   : 1;
> > > +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> > > +  uf = 1 << ceil_log2 (uf);
> > > +
> > > +  /* Early return if don't need to unroll.  */
> > > +  if (uf == 1)
> > > +    return 1;
> > > +
> > > +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll
> the loop
> > > +     if the number of loads and stores exceeds the threshold, unroll +
> software
> > > +     schedule will reduce cache miss rate.  */
> > > +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> > > +    return uf;
> > > +
> > > +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> > > +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> > > +  unsigned int unrolled_vf = vf * uf;
> > > +  if (est_niter == -1 || est_niter < unrolled_vf)
> > > +    /* When the estimated iteration of this loop is unknown, it's possible
> > > +       that we are able to vectorize this loop with the original VF but fail
> > > +       to vectorize it with the unrolled VF any more if the actual iteration
> > > +       count is in between.  */
> > > +    return 1;
> > > +  else
> > > +    {
> > > +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> > > +      unsigned int epil_niter = est_niter % vf;
> > > +      /* Even if we have partial vector support, it can be still inefficent
> > > +       to calculate the length when the iteration count is unknown, so
> > > +       only expect it's good to unroll when the epilogue iteration count
> > > +       is not bigger than VF (only one time length calculation).  */
> > > +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > +         && epil_niter_unr <= vf)
> > > +       return uf;
> > > +      /* Without partial vector support, conservatively unroll this when
> > > +       the epilogue iteration count is less than the original one
> > > +       (epilogue execution time wouldn't be longer than before).  */
> > > +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > +              && epil_niter_unr <= epil_niter)
> > > +       return uf;
> > > +    }
> > > +
> > > +  return 1;
> > > +}
> > > +
> > > +void
> > > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > +
> > > +{
> > > +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > > +    {
> > > +      m_suggested_unroll_factor = determine_suggested_unroll_factor
> (loop_vinfo);
> > > +    }
> > > +  vector_costs::finish_cost (scalar_costs);
> > > +}
> > >
> > >  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.
> */
> > >
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 53d534f6392..8e49b406aa5 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -1224,3 +1224,18 @@ mavxvnniint8
> > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > >  AVXVNNIINT8 built-in functions and code generation.
> > > +
> > > +-param=x86-vect-unroll-limit=
> > > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4)
> IntegerRange(1, 8) Param
> > > +Used to limit unroll factor which indicates how much the autovectorizer
> may
> > > +unroll a loop.  The default value is 4.
> > > +
> > > +-param=x86-vect-unroll-min-ldst-threshold=
> > > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold)
> Init(25) Param
> > > +Used to limit the mininum of loads and stores in the main loop.  The
> default
> > > +value is 25.
> > > +
> > > +-param=x86-vect-unroll-max-loop-size=
> > > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200)
> Param
> > > +This threshold is used to limit the maxnum size of loop body after
> unrolling.
> > > +The default value is 200.
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 09548c4528c..c86d686f2cd 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are
> available on i386 and x86_64 targets:
> > >  @item x86-stlf-window-ninsns
> > >  Instructions number above which STFL stall penalty can be compensated.
> > >
> > > +@item x86-vect-unroll-limit
> > > +The vectorizer will check with target information to determine whether
> it
> > > +would be beneficial to unroll the main vectorized loop and by how much.
> This
> > > +parameter sets the upper bound of how much the vectorizer will unroll
> the main
> > > +loop.  The default value is four.
> > > +
> > > +@item x86-vect-unroll-min-ldst-threshold
> > > +The vectorizer will check with target information to determine whether
> unroll
> > > +it. This parameter is used to limit the mininum of loads and stores in the
> main
> > > +loop.
> > > +
> > > +@item x86-vect-unroll-max-loop-size
> > > +The vectorizer will check with target information to determine whether
> unroll
> > > +it. This threshold is used to limit the max size of loop body after unrolling.
> > > +The default value is 200.
> > > +
> > > +
> > >  @end table
> > >
> > >  @end table
> > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > index 78c6600f83b..3bf1fb1b12d 100644
> > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-
> optimized" } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-
> loops -fdump-tree-optimized" } */
> > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > >  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > index 117179f2109..ba41fd64386 100644
> > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-
> optimized" } */
> > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-
> loops -fdump-tree-optimized" } */
> > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > >  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > index 4aa536ba86c..fd2f054af4a 100644
> > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > @@ -2,7 +2,7 @@
> > >  /* This is a test exercising peeling for alignment for a negative step
> > >     vector loop.  We're forcing atom tuning here because that has a higher
> > >     unaligned vs aligned cost unlike most other archs.  */
> > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> details -save-temps" } */
> > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> fdump-tree-vect-details -save-temps" } */
> > >
> > >  float a[1024], b[1024];
> > >
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > index 834bf0f770d..62c0db2bb9a 100644
> > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > @@ -2,7 +2,7 @@
> > >  /* This is a test exercising peeling for alignment for a positive step
> > >     vector loop.  We're forcing atom tuning here because that has a higher
> > >     unaligned vs aligned cost unlike most other archs.  */
> > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> details -save-temps" } */
> > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> fdump-tree-vect-details -save-temps" } */
> > >
> > >  float a[1024], b[1024];
> > >
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > index 9ee9ba4e736..1ba4be01bea 100644
> > > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" }
> */
> > > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-
> tree-vect-details" } */
> > >
> > >  #define N 32
> > >  int foo (int *a, int n)
> > > --
> > > 2.17.1
> > >
> > > Thanks,
> > > Lili.
> >
> >
> >
> > --
> > BR,
> > Hongtao
  
Richard Biener Oct. 28, 2022, 7:12 a.m. UTC | #4
On Wed, Oct 26, 2022 at 1:38 PM Cui, Lili <lili.cui@intel.com> wrote:
>
> Hi Richard,
>
> > +@item x86-vect-unroll-min-ldst-threshold
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This parameter is used to limit the mininum of loads and
> > +stores in the main loop.
> >
> > It's odd to "limit" the minimum number of something.  I think this warrants
> > clarification that for some (unknow to me ;)) reason we think that when we
> > have many loads and (or?) stores it is beneficial to unroll to get even more
> > loads and stores in a single iteration.  Btw, does the parameter limit the
> > number of loads and stores _after_ unrolling or before?
> >
> When the number of loads/stores exceeds the threshold, the loads/stores are more likely to conflict with loop itself in the L1 cache(Assuming that address of loads are scattered).
> Unroll + software scheduling will make 2 or 4 address contiguous loads/stores closer together, it will reduce cache miss rate.

Ah, nice.  Can we express the default as a function of L1 data cache
size, L1 cache line size and
more importantly, the size of the vector memory access?

Btw, I was looking into making a more meaningful cost modeling for loop
distribution.  Similar reasoning might apply there - try to _reduce_ the
number of memory streams so L1 cache utilization allows re-use of a
cache line in the next [next N] iteration[s]?  OTOH given L1D is quite
large I'd expect the loops affected to be either quite huge or bottlenecked
by load/store bandwith (there are 1024 L1D cache lines in zen2 for
example) - what's the effective L1D load you are keying off?.
Btw, how does L1D allocation on stores play a role here?

> > +@item x86-vect-unroll-max-loop-size
> > +The vectorizer will check with target information to determine whether
> > +unroll it. This threshold is used to limit the max size of loop body after
> > unrolling.
> > +The default value is 200.
> >
> > it should probably say not "size" but "number of instructions".  Note that 200
> > is quite large given we are talking about vector instructions here which have
> > larger encodings than scalar instructions.  Optimistically assuming
> > 4 byte encoding (quite optimistic give we're looking at loops with many
> > loads/stores) that would be an 800 byte loop body which would be 25 cache
> > lines.
> > ISTR that at least the loop discovery is limited to a lot smaller cases (but we
> > are likely not targeting that).  The limit probably still works to fit the loop
> > body in the u-op caches though.
> >
> Agree with you, it should be "x86-vect-unroll-max-loop-insns". Thanks for the reminder about larger encodings, I checked the skylake uop cache, it can hold 1.5k uOPs, 200 * 2 (1~3 uops/instruction) = 400 uops. I think 200 still work.
>
> > That said, the heuristic made me think "what the heck".  Can we explain in u-
> > arch terms why the unrolling is beneficial instead of just defering to SPEC
> > CPU 2017 fotonik?
> >
> Regarding the benefits,  I explained in the first answer, I checked 5 hottest functions in the 549, they all benefit from it, it improves the cache hit ratio.
>
> Thanks,
> Lili.
>
> > > On Mon, Oct 24, 2022 at 10:46 AM Cui,Lili via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Hi Hongtao,
> > > >
> > > > This patch introduces function finish_cost and
> > > > determine_suggested_unroll_factor for x86 backend, to make it be
> > > > able to suggest the unroll factor for a given loop being vectorized.
> > > > Referring to aarch64, RS6000 backends and basing on the analysis on
> > > > SPEC2017 performance evaluation results.
> > > >
> > > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > OK for trunk?
> > > >
> > > >
> > > >
> > > > With this patch, SPEC2017 performance evaluation results on
> > > > ICX/CLX/ADL/Znver3 are listed below:
> > > >
> > > > For single copy:
> > > >   - ICX: 549.fotonik3d_r +6.2%, the others are neutral
> > > >   - CLX: 549.fotonik3d_r +1.9%, the others are neutral
> > > >   - ADL: 549.fotonik3d_r +4.5%, the others are neutral
> > > >   - Znver3: 549.fotonik3d_r +4.8%, the others are neutral
> > > >
> > > > For multi-copy:
> > > >   - ADL: 549.fotonik3d_r +2.7%, the others are neutral
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * config/i386/i386.cc (class ix86_vector_costs): Add new members
> > > >          m_nstmts, m_nloads m_nstores and
> > determine_suggested_unroll_factor.
> > > >         (ix86_vector_costs::add_stmt_cost): Update for m_nstores,
> > m_nloads
> > > >         and m_nstores.
> > > >         (ix86_vector_costs::determine_suggested_unroll_factor): New
> > function.
> > > >         (ix86_vector_costs::finish_cost): Diito.
> > > >         * config/i386/i386.opt:(x86-vect-unroll-limit): New parameter.
> > > >         (x86-vect-unroll-min-ldst-threshold): Likewise.
> > > >         (x86-vect-unroll-max-loop-size): Likewise.
> > > >         * doc/invoke.texi: Document new parameter.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/i386/cond_op_maxmin_b-1.c: Add -fno-unroll-loops.
> > > >         * gcc.target/i386/cond_op_maxmin_ub-1.c: Ditto.
> > > >         * gcc.target/i386/vect-alignment-peeling-1.c: Ditto.
> > > >         * gcc.target/i386/vect-alignment-peeling-2.c: Ditto.
> > > >         * gcc.target/i386/vect-reduc-1.c: Ditto.
> > > > ---
> > > >  gcc/config/i386/i386.cc                       | 106 ++++++++++++++++++
> > > >  gcc/config/i386/i386.opt                      |  15 +++
> > > >  gcc/doc/invoke.texi                           |  17 +++
> > > >  .../gcc.target/i386/cond_op_maxmin_b-1.c      |   2 +-
> > > >  .../gcc.target/i386/cond_op_maxmin_ub-1.c     |   2 +-
> > > >  .../i386/vect-alignment-peeling-1.c           |   2 +-
> > > >  .../i386/vect-alignment-peeling-2.c           |   2 +-
> > > >  gcc/testsuite/gcc.target/i386/vect-reduc-1.c  |   2 +-
> > > >  8 files changed, 143 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > > > aeea26ef4be..a939354e55e 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -23336,6 +23336,17 @@ class ix86_vector_costs : public vector_costs
> > > >                               stmt_vec_info stmt_info, slp_tree node,
> > > >                               tree vectype, int misalign,
> > > >                               vect_cost_model_location where)
> > > > override;
> > > > +
> > > > +  unsigned int determine_suggested_unroll_factor (loop_vec_info);
> > > > +
> > > > +  void finish_cost (const vector_costs *) override;
> > > > +
> > > > +  /* Total number of vectorized stmts (loop only).  */  unsigned
> > > > + m_nstmts = 0;
> > > > +  /* Total number of loads (loop only).  */  unsigned m_nloads = 0;
> > > > +  /* Total number of stores (loop only).  */  unsigned m_nstores =
> > > > + 0;
> > > >  };
> > > >
> > > >  /* Implement targetm.vectorize.create_costs.  */
> > > > @@ -23579,6 +23590,19 @@ ix86_vector_costs::add_stmt_cost (int count,
> > vect_cost_for_stmt kind,
> > > >         retval = (retval * 17) / 10;
> > > >      }
> > > >
> > > > +  if (!m_costing_for_scalar
> > > > +      && is_a<loop_vec_info> (m_vinfo)
> > > > +      && where == vect_body)
> > > > +    {
> > > > +      m_nstmts += count;
> > > > +      if (kind == scalar_load || kind == vector_load
> > > > +         || kind == unaligned_load || kind == vector_gather_load)
> > > > +       m_nloads += count;
> > > > +      else if (kind == scalar_store || kind == vector_store
> > > > +              || kind == unaligned_store || kind == vector_scatter_store)
> > > > +       m_nstores += count;
> > > > +    }
> > > > +
> > > >    m_costs[where] += retval;
> > > >
> > > >    return retval;
> > > > @@ -23850,6 +23874,88 @@ ix86_loop_unroll_adjust (unsigned nunroll,
> > class loop *loop)
> > > >    return nunroll;
> > > >  }
> > > >
> > > > +unsigned int
> > > > +ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info
> > loop_vinfo)
> > > > +{
> > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > +
> > > > +  /* Don't unroll if it's specified explicitly not to be unrolled.  */
> > > > +  if (loop->unroll == 1
> > > > +      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
> > > > +      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
> > > > +    return 1;
> > > > +
> > > > +  /* Don't unroll if there is no vectorized stmt.  */
> > > > +  if (m_nstmts == 0)
> > > > +    return 1;
> > > > +
> > > > +  /* Don't unroll if vector size is zmm, since zmm throughput is lower
> > than other
> > > > +     sizes.  */
> > > > +  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
> > > > +    return 1;
> > > > +
> > > > +  /* Calc the total number of loads and stores in the loop body.  */
> > > > +  unsigned int nstmts_ldst = m_nloads + m_nstores;
> > > > +
> > > > +  /* Don't unroll if loop body size big than threshold, the threshold
> > > > +     is a heuristic value inspired by param_max_unrolled_insns.  */
> > > > +  unsigned int uf = m_nstmts < (unsigned
> > int)x86_vect_unroll_max_loop_size
> > > > +                   ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
> > > > +                   : 1;
> > > > +  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
> > > > +  uf = 1 << ceil_log2 (uf);
> > > > +
> > > > +  /* Early return if don't need to unroll.  */
> > > > +  if (uf == 1)
> > > > +    return 1;
> > > > +
> > > > +  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll
> > the loop
> > > > +     if the number of loads and stores exceeds the threshold, unroll +
> > software
> > > > +     schedule will reduce cache miss rate.  */
> > > > +  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
> > > > +    return uf;
> > > > +
> > > > +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> > > > +  unsigned int vf = vect_vf_for_cost (loop_vinfo);
> > > > +  unsigned int unrolled_vf = vf * uf;
> > > > +  if (est_niter == -1 || est_niter < unrolled_vf)
> > > > +    /* When the estimated iteration of this loop is unknown, it's possible
> > > > +       that we are able to vectorize this loop with the original VF but fail
> > > > +       to vectorize it with the unrolled VF any more if the actual iteration
> > > > +       count is in between.  */
> > > > +    return 1;
> > > > +  else
> > > > +    {
> > > > +      unsigned int epil_niter_unr = est_niter % unrolled_vf;
> > > > +      unsigned int epil_niter = est_niter % vf;
> > > > +      /* Even if we have partial vector support, it can be still inefficent
> > > > +       to calculate the length when the iteration count is unknown, so
> > > > +       only expect it's good to unroll when the epilogue iteration count
> > > > +       is not bigger than VF (only one time length calculation).  */
> > > > +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > > +         && epil_niter_unr <= vf)
> > > > +       return uf;
> > > > +      /* Without partial vector support, conservatively unroll this when
> > > > +       the epilogue iteration count is less than the original one
> > > > +       (epilogue execution time wouldn't be longer than before).  */
> > > > +      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > > +              && epil_niter_unr <= epil_niter)
> > > > +       return uf;
> > > > +    }
> > > > +
> > > > +  return 1;
> > > > +}
> > > > +
> > > > +void
> > > > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> > > > +
> > > > +{
> > > > +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > > > +    {
> > > > +      m_suggested_unroll_factor = determine_suggested_unroll_factor
> > (loop_vinfo);
> > > > +    }
> > > > +  vector_costs::finish_cost (scalar_costs);
> > > > +}
> > > >
> > > >  /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.
> > */
> > > >
> > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > > index 53d534f6392..8e49b406aa5 100644
> > > > --- a/gcc/config/i386/i386.opt
> > > > +++ b/gcc/config/i386/i386.opt
> > > > @@ -1224,3 +1224,18 @@ mavxvnniint8
> > > >  Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
> > > >  Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
> > > >  AVXVNNIINT8 built-in functions and code generation.
> > > > +
> > > > +-param=x86-vect-unroll-limit=
> > > > +Target Joined UInteger Var(x86_vect_unroll_limit) Init(4)
> > IntegerRange(1, 8) Param
> > > > +Used to limit unroll factor which indicates how much the autovectorizer
> > may
> > > > +unroll a loop.  The default value is 4.
> > > > +
> > > > +-param=x86-vect-unroll-min-ldst-threshold=
> > > > +Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold)
> > Init(25) Param
> > > > +Used to limit the mininum of loads and stores in the main loop.  The
> > default
> > > > +value is 25.
> > > > +
> > > > +-param=x86-vect-unroll-max-loop-size=
> > > > +Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200)
> > Param
> > > > +This threshold is used to limit the maxnum size of loop body after
> > unrolling.
> > > > +The default value is 200.
> > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > index 09548c4528c..c86d686f2cd 100644
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -15779,6 +15779,23 @@ The following choices of @var{name} are
> > available on i386 and x86_64 targets:
> > > >  @item x86-stlf-window-ninsns
> > > >  Instructions number above which STFL stall penalty can be compensated.
> > > >
> > > > +@item x86-vect-unroll-limit
> > > > +The vectorizer will check with target information to determine whether
> > it
> > > > +would be beneficial to unroll the main vectorized loop and by how much.
> > This
> > > > +parameter sets the upper bound of how much the vectorizer will unroll
> > the main
> > > > +loop.  The default value is four.
> > > > +
> > > > +@item x86-vect-unroll-min-ldst-threshold
> > > > +The vectorizer will check with target information to determine whether
> > unroll
> > > > +it. This parameter is used to limit the mininum of loads and stores in the
> > main
> > > > +loop.
> > > > +
> > > > +@item x86-vect-unroll-max-loop-size
> > > > +The vectorizer will check with target information to determine whether
> > unroll
> > > > +it. This threshold is used to limit the max size of loop body after unrolling.
> > > > +The default value is 200.
> > > > +
> > > > +
> > > >  @end table
> > > >
> > > >  @end table
> > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > index 78c6600f83b..3bf1fb1b12d 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-
> > optimized" } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-
> > loops -fdump-tree-optimized" } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > > >  /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > index 117179f2109..ba41fd64386 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-
> > optimized" } */
> > > > +/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-
> > loops -fdump-tree-optimized" } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
> > > >  /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
> > > >  /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > index 4aa536ba86c..fd2f054af4a 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
> > > > @@ -2,7 +2,7 @@
> > > >  /* This is a test exercising peeling for alignment for a negative step
> > > >     vector loop.  We're forcing atom tuning here because that has a higher
> > > >     unaligned vs aligned cost unlike most other archs.  */
> > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> > details -save-temps" } */
> > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> > fdump-tree-vect-details -save-temps" } */
> > > >
> > > >  float a[1024], b[1024];
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > index 834bf0f770d..62c0db2bb9a 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
> > > > @@ -2,7 +2,7 @@
> > > >  /* This is a test exercising peeling for alignment for a positive step
> > > >     vector loop.  We're forcing atom tuning here because that has a higher
> > > >     unaligned vs aligned cost unlike most other archs.  */
> > > > -/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-
> > details -save-temps" } */
> > > > +/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -
> > fdump-tree-vect-details -save-temps" } */
> > > >
> > > >  float a[1024], b[1024];
> > > >
> > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > index 9ee9ba4e736..1ba4be01bea 100644
> > > > --- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
> > > > @@ -1,5 +1,5 @@
> > > >  /* { dg-do compile } */
> > > > -/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" }
> > */
> > > > +/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-
> > tree-vect-details" } */
> > > >
> > > >  #define N 32
> > > >  int foo (int *a, int n)
> > > > --
> > > > 2.17.1
> > > >
> > > > Thanks,
> > > > Lili.
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
  
Li, Pan2 via Gcc-patches Nov. 2, 2022, 9:37 a.m. UTC | #5
> > > +@item x86-vect-unroll-min-ldst-threshold
> > > +The vectorizer will check with target information to determine
> > > +whether unroll it. This parameter is used to limit the mininum of
> > > +loads and stores in the main loop.
> > >
> > > It's odd to "limit" the minimum number of something.  I think this
> > > warrants clarification that for some (unknow to me ;)) reason we
> > > think that when we have many loads and (or?) stores it is beneficial
> > > to unroll to get even more loads and stores in a single iteration.
> > > Btw, does the parameter limit the number of loads and stores _after_
> unrolling or before?
> > >
> > When the number of loads/stores exceeds the threshold, the loads/stores
> are more likely to conflict with loop itself in the L1 cache(Assuming that
> address of loads are scattered).
> > Unroll + software scheduling will make 2 or 4 address contiguous
> loads/stores closer together, it will reduce cache miss rate.
> 
> Ah, nice.  Can we express the default as a function of L1 data cache size, L1
> cache line size and more importantly, the size of the vector memory access?
> 
> Btw, I was looking into making a more meaningful cost modeling for loop
> distribution.  Similar reasoning might apply there - try to _reduce_ the
> number of memory streams so L1 cache utilization allows re-use of a cache
> line in the next [next N] iteration[s]?  OTOH given L1D is quite large I'd expect
> the loops affected to be either quite huge or bottlenecked by load/store
> bandwith (there are 1024 L1D cache lines in zen2 for
> example) - what's the effective L1D load you are keying off?.
> Btw, how does L1D allocation on stores play a role here?
> 
Hi Richard,
To answer your question, I rechecked 549, I found that the 549 improvement comes from load reduction, it has a 3-level loop and 8 scalar loads in inner loop are loop invariants (due to high register pressure, these loop invariants all spill to the stack).
After unrolling the inner loop, those scalar parts are not doubled,  so unrolling reduces load instructions and L1/L2/L3 accesses. In the inner loop there are 8 different three-dimensional arrays, which size like this "a[128][480][128]". Although the size of the 3-layer array is very large,
but it doesn't support the theory I said before, Sorry for that. I need to hold this patch to see if we can do something about this scenario. 

Thanks,
Lili.
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index aeea26ef4be..a939354e55e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23336,6 +23336,17 @@  class ix86_vector_costs : public vector_costs
 			      stmt_vec_info stmt_info, slp_tree node,
 			      tree vectype, int misalign,
 			      vect_cost_model_location where) override;
+
+  unsigned int determine_suggested_unroll_factor (loop_vec_info);
+
+  void finish_cost (const vector_costs *) override;
+
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned m_nstmts = 0;
+  /* Total number of loads (loop only).  */
+  unsigned m_nloads = 0;
+  /* Total number of stores (loop only).  */
+  unsigned m_nstores = 0;
 };
 
 /* Implement targetm.vectorize.create_costs.  */
@@ -23579,6 +23590,19 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	retval = (retval * 17) / 10;
     }
 
+  if (!m_costing_for_scalar
+      && is_a<loop_vec_info> (m_vinfo)
+      && where == vect_body)
+    {
+      m_nstmts += count;
+      if (kind == scalar_load || kind == vector_load
+	  || kind == unaligned_load || kind == vector_gather_load)
+	m_nloads += count;
+      else if (kind == scalar_store || kind == vector_store
+	       || kind == unaligned_store || kind == vector_scatter_store)
+	m_nstores += count;
+    }
+
   m_costs[where] += retval;
 
   return retval;
@@ -23850,6 +23874,88 @@  ix86_loop_unroll_adjust (unsigned nunroll, class loop *loop)
   return nunroll;
 }
 
+unsigned int
+ix86_vector_costs::determine_suggested_unroll_factor (loop_vec_info loop_vinfo)
+{
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+
+  /* Don't unroll if it's specified explicitly not to be unrolled.  */
+  if (loop->unroll == 1
+      || (OPTION_SET_P (flag_unroll_loops) && !flag_unroll_loops)
+      || (OPTION_SET_P (flag_unroll_all_loops) && !flag_unroll_all_loops))
+    return 1;
+
+  /* Don't unroll if there is no vectorized stmt.  */
+  if (m_nstmts == 0)
+    return 1;
+
+  /* Don't unroll if vector size is zmm, since zmm throughput is lower than other
+     sizes.  */
+  if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
+    return 1;
+
+  /* Calc the total number of loads and stores in the loop body.  */
+  unsigned int nstmts_ldst = m_nloads + m_nstores;
+
+  /* Don't unroll if loop body size big than threshold, the threshold
+     is a heuristic value inspired by param_max_unrolled_insns.  */
+  unsigned int uf = m_nstmts < (unsigned int)x86_vect_unroll_max_loop_size
+		    ? ((unsigned int)x86_vect_unroll_max_loop_size / m_nstmts)
+		    : 1;
+  uf = MIN ((unsigned int)x86_vect_unroll_limit, uf);
+  uf = 1 << ceil_log2 (uf);
+
+  /* Early return if don't need to unroll.  */
+  if (uf == 1)
+    return 1;
+
+  /* Inspired by SPEC2017 fotonik3d_r, we want to aggressively unroll the loop
+     if the number of loads and stores exceeds the threshold, unroll + software
+     schedule will reduce cache miss rate.  */
+  if (nstmts_ldst >= (unsigned int)x86_vect_unroll_min_ldst_threshold)
+    return uf;
+
+  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
+  unsigned int vf = vect_vf_for_cost (loop_vinfo);
+  unsigned int unrolled_vf = vf * uf;
+  if (est_niter == -1 || est_niter < unrolled_vf)
+    /* When the estimated iteration of this loop is unknown, it's possible
+       that we are able to vectorize this loop with the original VF but fail
+       to vectorize it with the unrolled VF any more if the actual iteration
+       count is in between.  */
+    return 1;
+  else
+    {
+      unsigned int epil_niter_unr = est_niter % unrolled_vf;
+      unsigned int epil_niter = est_niter % vf;
+      /* Even if we have partial vector support, it can be still inefficent
+	to calculate the length when the iteration count is unknown, so
+	only expect it's good to unroll when the epilogue iteration count
+	is not bigger than VF (only one time length calculation).  */
+      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  && epil_niter_unr <= vf)
+       return uf;
+      /* Without partial vector support, conservatively unroll this when
+	the epilogue iteration count is less than the original one
+	(epilogue execution time wouldn't be longer than before).  */
+      else if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	       && epil_niter_unr <= epil_niter)
+       return uf;
+    }
+
+  return 1;
+}
+
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+
+{
+  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
+    {
+      m_suggested_unroll_factor = determine_suggested_unroll_factor (loop_vinfo);
+    }
+  vector_costs::finish_cost (scalar_costs);
+}
 
 /* Implement TARGET_FLOAT_EXCEPTIONS_ROUNDING_SUPPORTED_P.  */
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 53d534f6392..8e49b406aa5 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1224,3 +1224,18 @@  mavxvnniint8
 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
 Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
 AVXVNNIINT8 built-in functions and code generation.
+
+-param=x86-vect-unroll-limit=
+Target Joined UInteger Var(x86_vect_unroll_limit) Init(4) IntegerRange(1, 8) Param
+Used to limit unroll factor which indicates how much the autovectorizer may
+unroll a loop.  The default value is 4.
+
+-param=x86-vect-unroll-min-ldst-threshold=
+Target Joined UInteger Var(x86_vect_unroll_min_ldst_threshold) Init(25) Param
+Used to limit the mininum of loads and stores in the main loop.  The default
+value is 25.
+
+-param=x86-vect-unroll-max-loop-size=
+Target Joined UInteger Var(x86_vect_unroll_max_loop_size) Init(200) Param
+This threshold is used to limit the maxnum size of loop body after unrolling.
+The default value is 200.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 09548c4528c..c86d686f2cd 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15779,6 +15779,23 @@  The following choices of @var{name} are available on i386 and x86_64 targets:
 @item x86-stlf-window-ninsns
 Instructions number above which STFL stall penalty can be compensated.
 
+@item x86-vect-unroll-limit
+The vectorizer will check with target information to determine whether it
+would be beneficial to unroll the main vectorized loop and by how much.  This
+parameter sets the upper bound of how much the vectorizer will unroll the main
+loop.  The default value is four.
+
+@item x86-vect-unroll-min-ldst-threshold
+The vectorizer will check with target information to determine whether unroll
+it. This parameter is used to limit the mininum of loads and stores in the main
+loop.
+
+@item x86-vect-unroll-max-loop-size
+The vectorizer will check with target information to determine whether unroll
+it. This threshold is used to limit the max size of loop body after unrolling.
+The default value is 200.
+
+
 @end table
 
 @end table
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
index 78c6600f83b..3bf1fb1b12d 100644
--- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_b-1.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fdump-tree-optimized" } */
+/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=int8 -fno-unroll-loops -fdump-tree-optimized" } */
 /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
 /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
 /* { dg-final { scan-assembler-times "vpmaxsb"  1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
index 117179f2109..ba41fd64386 100644
--- a/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
+++ b/gcc/testsuite/gcc.target/i386/cond_op_maxmin_ub-1.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fdump-tree-optimized" } */
+/* { dg-options "-O2 -march=skylake-avx512 -DTYPE=uint8 -fno-unroll-loops -fdump-tree-optimized" } */
 /* { dg-final { scan-tree-dump ".COND_MAX" "optimized" } } */
 /* { dg-final { scan-tree-dump ".COND_MIN" "optimized" } } */
 /* { dg-final { scan-assembler-times "vpmaxub"  1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
index 4aa536ba86c..fd2f054af4a 100644
--- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
+++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-1.c
@@ -2,7 +2,7 @@ 
 /* This is a test exercising peeling for alignment for a negative step
    vector loop.  We're forcing atom tuning here because that has a higher
    unaligned vs aligned cost unlike most other archs.  */
-/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
+/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
 
 float a[1024], b[1024];
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
index 834bf0f770d..62c0db2bb9a 100644
--- a/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
+++ b/gcc/testsuite/gcc.target/i386/vect-alignment-peeling-2.c
@@ -2,7 +2,7 @@ 
 /* This is a test exercising peeling for alignment for a positive step
    vector loop.  We're forcing atom tuning here because that has a higher
    unaligned vs aligned cost unlike most other archs.  */
-/* { dg-options "-O3 -march=x86-64 -mtune=atom -fdump-tree-vect-details -save-temps" } */
+/* { dg-options "-O3 -march=x86-64 -mtune=atom -fno-unroll-loops -fdump-tree-vect-details -save-temps" } */
 
 float a[1024], b[1024];
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
index 9ee9ba4e736..1ba4be01bea 100644
--- a/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
+++ b/gcc/testsuite/gcc.target/i386/vect-reduc-1.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */
+/* { dg-options "-O3 -mavx2 -mno-avx512f -fno-unroll-loops -fdump-tree-vect-details" } */
 
 #define N 32
 int foo (int *a, int n)