Take register pressure into account for vec_construct when the components are not loaded from memory.

Message ID 20231128075212.3526692-1-hongtao.liu@intel.com
State New
Headers
Series Take register pressure into account for vec_construct when the components are not loaded from memory. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed

Commit Message

Liu, Hongtao Nov. 28, 2023, 7:52 a.m. UTC
  For vec_contruct, the components must be live at the same time if
they're not loaded from memory, when the number of those components
exceeds available registers, spill happens. Try to account that with a
rough estimation.
??? Ideally, we should have an overall estimation of register pressure
if we know the live range of all variables.

The patch can avoid regressions due to .i.e. vec_contruct with 32 char.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.

Ok for trunk?

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Take
	register pressure into account for vec_construct when the
	components are not loaded from memory.
---
 gcc/config/i386/i386.cc | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)
  

Comments

Richard Biener Nov. 29, 2023, 7:47 a.m. UTC | #1
On Tue, Nov 28, 2023 at 8:54 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> For vec_contruct, the components must be live at the same time if
> they're not loaded from memory, when the number of those components
> exceeds available registers, spill happens. Try to account that with a
> rough estimation.
> ??? Ideally, we should have an overall estimation of register pressure
> if we know the live range of all variables.
>
> The patch can avoid regressions due to .i.e. vec_contruct with 32 char.
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>
> Ok for trunk?

Hmm, I would suggest you put reg_needed into the class and accumulate
over all vec_construct, with your patch you pessimize a single v32qi
over two separate v16qi for example.  Also currently the whole block is
gated with INTEGRAL_TYPE_P but register pressure would be also
a concern for floating point vectors.  finish_cost would then apply an
adjustment.

'target_avail_regs' is for GENERAL_REGS, does that include APX regs?
I don't see anything similar for FP regs, but I guess the target should know
or maybe there's a #regs in regclass query already.

That said, this kind of adjustment looks somewhat appealing.

Richard.

> gcc/ChangeLog:
>
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Take
>         register pressure into account for vec_construct when the
>         components are not loaded from memory.
> ---
>  gcc/config/i386/i386.cc | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 683ac643bc8..f8417555930 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24706,6 +24706,7 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>        unsigned i;
>        tree op;
> +      unsigned reg_needed = 0;
>        FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
> @@ -24737,11 +24738,30 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>                   && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
>                       || !VECTOR_TYPE_P (TREE_TYPE
>                                 (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> -           stmt_cost += ix86_cost->sse_to_integer;
> +           {
> +             stmt_cost += ix86_cost->sse_to_integer;
> +             reg_needed++;
> +           }
>         }
>        FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
> +
> +      /* For vec_contruct, the components must be live at the same time if
> +        they're not loaded from memory, when the number of those components
> +        exceeds available registers, spill happens. Try to account that with a
> +        rough estimation. Currently only handle integral modes since scalar fp
> +        shares sse_regs with vectors.
> +        ??? Ideally, we should have an overall estimation of register pressure
> +        if we know the live range of all variables.  */
> +      if (!fp && kind == vec_construct
> +         && reg_needed > target_avail_regs)
> +       {
> +         unsigned spill_cost = ix86_builtin_vectorization_cost (scalar_store,
> +                                                                vectype,
> +                                                                misalign);
> +         stmt_cost += spill_cost * (reg_needed - target_avail_regs);
> +       }
>      }
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> --
> 2.31.1
>
  
Hongtao Liu Nov. 29, 2023, 8 a.m. UTC | #2
On Wed, Nov 29, 2023 at 3:47 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Nov 28, 2023 at 8:54 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > For vec_contruct, the components must be live at the same time if
> > they're not loaded from memory, when the number of those components
> > exceeds available registers, spill happens. Try to account that with a
> > rough estimation.
> > ??? Ideally, we should have an overall estimation of register pressure
> > if we know the live range of all variables.
> >
> > The patch can avoid regressions due to .i.e. vec_contruct with 32 char.
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> >
> > Ok for trunk?
>
> Hmm, I would suggest you put reg_needed into the class and accumulate
> over all vec_construct, with your patch you pessimize a single v32qi
> over two separate v16qi for example.  Also currently the whole block is
2 separate v16qi(or 4 v8qi) may have different live ranges, but yes,
vec_construct is probably an entry for vectorized code, so counting
them all also makes sense.

> gated with INTEGRAL_TYPE_P but register pressure would be also
> a concern for floating point vectors.  finish_cost would then apply an
> adjustment.
>
> 'target_avail_regs' is for GENERAL_REGS, does that include APX regs?
Yes, I saw it's used by ivopt which mostly cares about integers.
> I don't see anything similar for FP regs, but I guess the target should know
> or maybe there's a #regs in regclass query already.
For x86, float shares sse regs with vector registers, for that case,
we may need to count all vector temporary variables to get an
estimation for sse_regs?
but still we don't know the live range, so probably let's just start
with vec_constuct.
>
> That said, this kind of adjustment looks somewhat appealing.
>
> Richard.
>
> > gcc/ChangeLog:
> >
> >         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Take
> >         register pressure into account for vec_construct when the
> >         components are not loaded from memory.
> > ---
> >  gcc/config/i386/i386.cc | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 683ac643bc8..f8417555930 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -24706,6 +24706,7 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >        stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> >        unsigned i;
> >        tree op;
> > +      unsigned reg_needed = 0;
> >        FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> >         if (TREE_CODE (op) == SSA_NAME)
> >           TREE_VISITED (op) = 0;
> > @@ -24737,11 +24738,30 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> >                   && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> >                       || !VECTOR_TYPE_P (TREE_TYPE
> >                                 (TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
> > -           stmt_cost += ix86_cost->sse_to_integer;
> > +           {
> > +             stmt_cost += ix86_cost->sse_to_integer;
> > +             reg_needed++;
> > +           }
> >         }
> >        FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
> >         if (TREE_CODE (op) == SSA_NAME)
> >           TREE_VISITED (op) = 0;
> > +
> > +      /* For vec_contruct, the components must be live at the same time if
> > +        they're not loaded from memory, when the number of those components
> > +        exceeds available registers, spill happens. Try to account that with a
> > +        rough estimation. Currently only handle integral modes since scalar fp
> > +        shares sse_regs with vectors.
> > +        ??? Ideally, we should have an overall estimation of register pressure
> > +        if we know the live range of all variables.  */
> > +      if (!fp && kind == vec_construct
> > +         && reg_needed > target_avail_regs)
> > +       {
> > +         unsigned spill_cost = ix86_builtin_vectorization_cost (scalar_store,
> > +                                                                vectype,
> > +                                                                misalign);
> > +         stmt_cost += spill_cost * (reg_needed - target_avail_regs);
> > +       }
> >      }
> >    if (stmt_cost == -1)
> >      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> > --
> > 2.31.1
> >
  

Patch

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 683ac643bc8..f8417555930 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24706,6 +24706,7 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
       unsigned i;
       tree op;
+      unsigned reg_needed = 0;
       FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
 	if (TREE_CODE (op) == SSA_NAME)
 	  TREE_VISITED (op) = 0;
@@ -24737,11 +24738,30 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 		  && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
 		      || !VECTOR_TYPE_P (TREE_TYPE
 				(TREE_OPERAND (gimple_assign_rhs1 (def), 0))))))
-	    stmt_cost += ix86_cost->sse_to_integer;
+	    {
+	      stmt_cost += ix86_cost->sse_to_integer;
+	      reg_needed++;
+	    }
 	}
       FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_OPS (node), i, op)
 	if (TREE_CODE (op) == SSA_NAME)
 	  TREE_VISITED (op) = 0;
+
+      /* For vec_contruct, the components must be live at the same time if
+	 they're not loaded from memory, when the number of those components
+	 exceeds available registers, spill happens. Try to account that with a
+	 rough estimation. Currently only handle integral modes since scalar fp
+	 shares sse_regs with vectors.
+	 ??? Ideally, we should have an overall estimation of register pressure
+	 if we know the live range of all variables.  */
+      if (!fp && kind == vec_construct
+	  && reg_needed > target_avail_regs)
+	{
+	  unsigned spill_cost = ix86_builtin_vectorization_cost (scalar_store,
+								 vectype,
+								 misalign);
+	  stmt_cost += spill_cost * (reg_needed - target_avail_regs);
+	}
     }
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);