[aarch64] Code-gen for vector initialization involving constants

Message ID CAAgBjMnwGk4fOc3PTM_agTXXFvt=767a3-AWOfSr23Xja6K81w@mail.gmail.com
State New
Headers
Series [aarch64] Code-gen for vector initialization involving constants |

Commit Message

Prathamesh Kulkarni Feb. 3, 2023, 7:16 a.m. UTC
  Hi Richard,
While digging thru aarch64_expand_vector_init, I noticed it gives
priority to loading a constant first:
 /* Initialise a vector which is part-variable.  We want to first try
     to build those lanes which are constant in the most efficient way we
     can.  */

which results in suboptimal code-gen for following case:
int16x8_t f_s16(int16_t x)
{
  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
}

code-gen trunk:
f_s16:
        movi    v0.8h, 0x1
        ins     v0.h[0], w0
        ins     v0.h[1], w0
        ins     v0.h[2], w0
        ins     v0.h[3], w0
        ins     v0.h[4], w0
        ins     v0.h[5], w0
        ins     v0.h[6], w0
        ret

The attached patch tweaks the following condition:
if (n_var == n_elts && n_elts <= 16)
  {
    ...
  }

to pass if maxv >= 80% of n_elts, with 80% being an
arbitrary "high enough" threshold. The intent is to dup
the most repeating variable if it it's repetition
is "high enough" and insert constants which should be "better" than
loading constant first and inserting variables like in the above case.

Alternatively, I suppose we can remove threshold and for constants,
generate both sequences and check which one is more
efficient ?

code-gen with patch:
f_s16:
        dup     v0.8h, w0
        movi    v1.4h, 0x1
        ins     v0.h[7], v1.h[0]
        ret

The patch is lightly tested to verify that vec[t]-init-*.c tests pass
with bootstrap+test
in progress.
Does this look OK ?

Thanks,
Prathamesh
  

Comments

Prathamesh Kulkarni Feb. 13, 2023, 6:28 a.m. UTC | #1
On Fri, 3 Feb 2023 at 12:46, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi Richard,
> While digging thru aarch64_expand_vector_init, I noticed it gives
> priority to loading a constant first:
>  /* Initialise a vector which is part-variable.  We want to first try
>      to build those lanes which are constant in the most efficient way we
>      can.  */
>
> which results in suboptimal code-gen for following case:
> int16x8_t f_s16(int16_t x)
> {
>   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> }
>
> code-gen trunk:
> f_s16:
>         movi    v0.8h, 0x1
>         ins     v0.h[0], w0
>         ins     v0.h[1], w0
>         ins     v0.h[2], w0
>         ins     v0.h[3], w0
>         ins     v0.h[4], w0
>         ins     v0.h[5], w0
>         ins     v0.h[6], w0
>         ret
>
> The attached patch tweaks the following condition:
> if (n_var == n_elts && n_elts <= 16)
>   {
>     ...
>   }
>
> to pass if maxv >= 80% of n_elts, with 80% being an
> arbitrary "high enough" threshold. The intent is to dup
> the most repeating variable if it it's repetition
> is "high enough" and insert constants which should be "better" than
> loading constant first and inserting variables like in the above case.
>
> Alternatively, I suppose we can remove threshold and for constants,
> generate both sequences and check which one is more
> efficient ?
>
> code-gen with patch:
> f_s16:
>         dup     v0.8h, w0
>         movi    v1.4h, 0x1
>         ins     v0.h[7], v1.h[0]
>         ret
>
> The patch is lightly tested to verify that vec[t]-init-*.c tests pass
> with bootstrap+test
> in progress.
> Does this look OK ?
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611243.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
  
Prathamesh Kulkarni April 3, 2023, 6:12 p.m. UTC | #2
On Mon, 13 Feb 2023 at 11:58, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 3 Feb 2023 at 12:46, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > Hi Richard,
> > While digging thru aarch64_expand_vector_init, I noticed it gives
> > priority to loading a constant first:
> >  /* Initialise a vector which is part-variable.  We want to first try
> >      to build those lanes which are constant in the most efficient way we
> >      can.  */
> >
> > which results in suboptimal code-gen for following case:
> > int16x8_t f_s16(int16_t x)
> > {
> >   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > }
> >
> > code-gen trunk:
> > f_s16:
> >         movi    v0.8h, 0x1
> >         ins     v0.h[0], w0
> >         ins     v0.h[1], w0
> >         ins     v0.h[2], w0
> >         ins     v0.h[3], w0
> >         ins     v0.h[4], w0
> >         ins     v0.h[5], w0
> >         ins     v0.h[6], w0
> >         ret
> >
> > The attached patch tweaks the following condition:
> > if (n_var == n_elts && n_elts <= 16)
> >   {
> >     ...
> >   }
> >
> > to pass if maxv >= 80% of n_elts, with 80% being an
> > arbitrary "high enough" threshold. The intent is to dup
> > the most repeating variable if it it's repetition
> > is "high enough" and insert constants which should be "better" than
> > loading constant first and inserting variables like in the above case.
> >
> > Alternatively, I suppose we can remove threshold and for constants,
> > generate both sequences and check which one is more
> > efficient ?
> >
> > code-gen with patch:
> > f_s16:
> >         dup     v0.8h, w0
> >         movi    v1.4h, 0x1
> >         ins     v0.h[7], v1.h[0]
> >         ret
> >
> > The patch is lightly tested to verify that vec[t]-init-*.c tests pass
> > with bootstrap+test
> > in progress.
> > Does this look OK ?
> Hi Richard,
> ping https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611243.html
Hi Richard,
ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2023-February/611243.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
  
Richard Sandiford April 25, 2023, 10:59 a.m. UTC | #3
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi Richard,
> While digging thru aarch64_expand_vector_init, I noticed it gives
> priority to loading a constant first:
>  /* Initialise a vector which is part-variable.  We want to first try
>      to build those lanes which are constant in the most efficient way we
>      can.  */
>
> which results in suboptimal code-gen for following case:
> int16x8_t f_s16(int16_t x)
> {
>   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> }
>
> code-gen trunk:
> f_s16:
>         movi    v0.8h, 0x1
>         ins     v0.h[0], w0
>         ins     v0.h[1], w0
>         ins     v0.h[2], w0
>         ins     v0.h[3], w0
>         ins     v0.h[4], w0
>         ins     v0.h[5], w0
>         ins     v0.h[6], w0
>         ret
>
> The attached patch tweaks the following condition:
> if (n_var == n_elts && n_elts <= 16)
>   {
>     ...
>   }
>
> to pass if maxv >= 80% of n_elts, with 80% being an
> arbitrary "high enough" threshold. The intent is to dup
> the most repeating variable if it it's repetition
> is "high enough" and insert constants which should be "better" than
> loading constant first and inserting variables like in the above case.

I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.

The case above can also be handled by relaxing n_var == n_elts to
n_var >= n_elts - 1, so that if there's just one constant element,
we look for duplicated variable elements.  If there are none
(maxv == 1), but there is a constant element, we can duplicate
the constant element into a register.

The case when there's more than one constant element needs more thought
(and testcases :-)).  E.g. after a certain point, it would probably be
better to load the variable and constant parts separately and blend them
using TBL.  It also matters whether the constants are equal or not.

There are also cases that could be handled using EXT.

Plus, if we're inserting many variable elements that are already
in GPRs, we can probably do better by coalescing them into bigger
GPR values and inserting them as wider elements.

Because of things like that, I think we should stick to the
single-constant case for now.

Thanks,
Richard
  
Prathamesh Kulkarni May 2, 2023, 5:41 a.m. UTC | #4
On Tue, 25 Apr 2023 at 16:29, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi Richard,
> > While digging thru aarch64_expand_vector_init, I noticed it gives
> > priority to loading a constant first:
> >  /* Initialise a vector which is part-variable.  We want to first try
> >      to build those lanes which are constant in the most efficient way we
> >      can.  */
> >
> > which results in suboptimal code-gen for following case:
> > int16x8_t f_s16(int16_t x)
> > {
> >   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > }
> >
> > code-gen trunk:
> > f_s16:
> >         movi    v0.8h, 0x1
> >         ins     v0.h[0], w0
> >         ins     v0.h[1], w0
> >         ins     v0.h[2], w0
> >         ins     v0.h[3], w0
> >         ins     v0.h[4], w0
> >         ins     v0.h[5], w0
> >         ins     v0.h[6], w0
> >         ret
> >
> > The attached patch tweaks the following condition:
> > if (n_var == n_elts && n_elts <= 16)
> >   {
> >     ...
> >   }
> >
> > to pass if maxv >= 80% of n_elts, with 80% being an
> > arbitrary "high enough" threshold. The intent is to dup
> > the most repeating variable if it it's repetition
> > is "high enough" and insert constants which should be "better" than
> > loading constant first and inserting variables like in the above case.
>
> I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.
>
> The case above can also be handled by relaxing n_var == n_elts to
> n_var >= n_elts - 1, so that if there's just one constant element,
> we look for duplicated variable elements.  If there are none
> (maxv == 1), but there is a constant element, we can duplicate
> the constant element into a register.
>
> The case when there's more than one constant element needs more thought
> (and testcases :-)).  E.g. after a certain point, it would probably be
> better to load the variable and constant parts separately and blend them
> using TBL.  It also matters whether the constants are equal or not.
>
> There are also cases that could be handled using EXT.
>
> Plus, if we're inserting many variable elements that are already
> in GPRs, we can probably do better by coalescing them into bigger
> GPR values and inserting them as wider elements.
>
> Because of things like that, I think we should stick to the
> single-constant case for now.
Hi Richard,
Thanks for the suggestions. The attached patch only handles the single
constant case.
Bootstrap+test in progress on aarch64-linux-gnu.
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
[aarch64] Improve code-gen for vector initialization with single constant element.

gcc/ChangeLog:
	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
	and if maxv == 1, use constant element for duplicating into register.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/vec-init-single-const.c: New test.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2b0de7ca038..f46750133a6 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  if ((n_var >= n_elts - 1) && n_elts <= 16)
     {
       int matches[16][2] = {0};
       for (int i = 0; i < n_elts; i++)
@@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	     vector register.  For big-endian we want that position to hold
 	     the last element of VALS.  */
 	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
+
+	  /* If we have a single constant element, use that for duplicating
+	     instead.  */
+	  if (n_var == n_elts - 1)
+	    for (int i = 0; i < n_elts; i++)
+	      if (CONST_INT_P (XVECEXP (vals, 0, i))
+		  || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
+		{
+		  maxelement = i;
+		  break;
+		}
+
 	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
 	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
 	}
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
new file mode 100644
index 00000000000..517f47b13ec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
@@ -0,0 +1,66 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	...
+**	dup	v[0-9]+\.16b, w[0-9]+
+**	movi	v[0-9]+\.8b, 0x1
+**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
+**	...
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	...
+**	dup	v[0-9]+\.8h, w[0-9]+
+**	movi	v[0-9]+\.4h, 0x1
+**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
+**	...
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	...
+**	movi	v[0-9]\.2s, 0x1
+**	dup	v[0-9]\.4s, w[0-9]+
+**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
+**	...
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	...
+**	fmov	d[0-9]+, x[0-9]+
+**	mov	x[0-9]+, 1
+**	ins	v[0-9]+\.d\[1\], x[0-9]+
+**	...
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
  
Richard Sandiford May 2, 2023, 9:25 a.m. UTC | #5
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 25 Apr 2023 at 16:29, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi Richard,
>> > While digging thru aarch64_expand_vector_init, I noticed it gives
>> > priority to loading a constant first:
>> >  /* Initialise a vector which is part-variable.  We want to first try
>> >      to build those lanes which are constant in the most efficient way we
>> >      can.  */
>> >
>> > which results in suboptimal code-gen for following case:
>> > int16x8_t f_s16(int16_t x)
>> > {
>> >   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> >         movi    v0.8h, 0x1
>> >         ins     v0.h[0], w0
>> >         ins     v0.h[1], w0
>> >         ins     v0.h[2], w0
>> >         ins     v0.h[3], w0
>> >         ins     v0.h[4], w0
>> >         ins     v0.h[5], w0
>> >         ins     v0.h[6], w0
>> >         ret
>> >
>> > The attached patch tweaks the following condition:
>> > if (n_var == n_elts && n_elts <= 16)
>> >   {
>> >     ...
>> >   }
>> >
>> > to pass if maxv >= 80% of n_elts, with 80% being an
>> > arbitrary "high enough" threshold. The intent is to dup
>> > the most repeating variable if it it's repetition
>> > is "high enough" and insert constants which should be "better" than
>> > loading constant first and inserting variables like in the above case.
>>
>> I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.
>>
>> The case above can also be handled by relaxing n_var == n_elts to
>> n_var >= n_elts - 1, so that if there's just one constant element,
>> we look for duplicated variable elements.  If there are none
>> (maxv == 1), but there is a constant element, we can duplicate
>> the constant element into a register.
>>
>> The case when there's more than one constant element needs more thought
>> (and testcases :-)).  E.g. after a certain point, it would probably be
>> better to load the variable and constant parts separately and blend them
>> using TBL.  It also matters whether the constants are equal or not.
>>
>> There are also cases that could be handled using EXT.
>>
>> Plus, if we're inserting many variable elements that are already
>> in GPRs, we can probably do better by coalescing them into bigger
>> GPR values and inserting them as wider elements.
>>
>> Because of things like that, I think we should stick to the
>> single-constant case for now.
> Hi Richard,
> Thanks for the suggestions. The attached patch only handles the single
> constant case.
> Bootstrap+test in progress on aarch64-linux-gnu.
> Does it look OK ?
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Richard
>
> [aarch64] Improve code-gen for vector initialization with single constant element.
>
> gcc/ChangeLog:
> 	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> 	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> 	and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.target/aarch64/vec-init-single-const.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..f46750133a6 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>       and matches[X][1] with the count of duplicate elements (if X is the
>       earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>      {
>        int matches[16][2] = {0};
>        for (int i = 0; i < n_elts; i++)
> @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	     vector register.  For big-endian we want that position to hold
>  	     the last element of VALS.  */
>  	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> +
> +	  /* If we have a single constant element, use that for duplicating
> +	     instead.  */
> +	  if (n_var == n_elts - 1)
> +	    for (int i = 0; i < n_elts; i++)
> +	      if (CONST_INT_P (XVECEXP (vals, 0, i))
> +		  || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> +		{
> +		  maxelement = i;
> +		  break;
> +		}
> +
>  	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>  	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));

We don't want to force the constant into a register though.

>  	}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 00000000000..517f47b13ec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	dup	v[0-9]+\.16b, w[0-9]+
> +**	movi	v[0-9]+\.8b, 0x1
> +**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	...
> +**	dup	v[0-9]+\.8h, w[0-9]+
> +**	movi	v[0-9]+\.4h, 0x1
> +**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	...
> +**	movi	v[0-9]\.2s, 0x1
> +**	dup	v[0-9]\.4s, w[0-9]+
> +**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	...
> +**	fmov	d[0-9]+, x[0-9]+
> +**	mov	x[0-9]+, 1
> +**	ins	v[0-9]+\.d\[1\], x[0-9]+
> +**	...
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
  
Prathamesh Kulkarni May 2, 2023, 10:22 a.m. UTC | #6
On Tue, 2 May 2023 at 14:56, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 25 Apr 2023 at 16:29, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi Richard,
> >> > While digging thru aarch64_expand_vector_init, I noticed it gives
> >> > priority to loading a constant first:
> >> >  /* Initialise a vector which is part-variable.  We want to first try
> >> >      to build those lanes which are constant in the most efficient way we
> >> >      can.  */
> >> >
> >> > which results in suboptimal code-gen for following case:
> >> > int16x8_t f_s16(int16_t x)
> >> > {
> >> >   return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> >> > }
> >> >
> >> > code-gen trunk:
> >> > f_s16:
> >> >         movi    v0.8h, 0x1
> >> >         ins     v0.h[0], w0
> >> >         ins     v0.h[1], w0
> >> >         ins     v0.h[2], w0
> >> >         ins     v0.h[3], w0
> >> >         ins     v0.h[4], w0
> >> >         ins     v0.h[5], w0
> >> >         ins     v0.h[6], w0
> >> >         ret
> >> >
> >> > The attached patch tweaks the following condition:
> >> > if (n_var == n_elts && n_elts <= 16)
> >> >   {
> >> >     ...
> >> >   }
> >> >
> >> > to pass if maxv >= 80% of n_elts, with 80% being an
> >> > arbitrary "high enough" threshold. The intent is to dup
> >> > the most repeating variable if it it's repetition
> >> > is "high enough" and insert constants which should be "better" than
> >> > loading constant first and inserting variables like in the above case.
> >>
> >> I'm not too keen on the 80%.  Like you say, it seems a bit arbitrary.
> >>
> >> The case above can also be handled by relaxing n_var == n_elts to
> >> n_var >= n_elts - 1, so that if there's just one constant element,
> >> we look for duplicated variable elements.  If there are none
> >> (maxv == 1), but there is a constant element, we can duplicate
> >> the constant element into a register.
> >>
> >> The case when there's more than one constant element needs more thought
> >> (and testcases :-)).  E.g. after a certain point, it would probably be
> >> better to load the variable and constant parts separately and blend them
> >> using TBL.  It also matters whether the constants are equal or not.
> >>
> >> There are also cases that could be handled using EXT.
> >>
> >> Plus, if we're inserting many variable elements that are already
> >> in GPRs, we can probably do better by coalescing them into bigger
> >> GPR values and inserting them as wider elements.
> >>
> >> Because of things like that, I think we should stick to the
> >> single-constant case for now.
> > Hi Richard,
> > Thanks for the suggestions. The attached patch only handles the single
> > constant case.
> > Bootstrap+test in progress on aarch64-linux-gnu.
> > Does it look OK ?
> >
> > Thanks,
> > Prathamesh
> >>
> >> Thanks,
> >> Richard
> >
> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >
> > gcc/ChangeLog:
> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >       and if maxv == 1, use constant element for duplicating into register.
> >
> > gcc/testsuite/ChangeLog:
> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 2b0de7ca038..f46750133a6 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >       and matches[X][1] with the count of duplicate elements (if X is the
> >       earliest element which has duplicates).  */
> >
> > -  if (n_var == n_elts && n_elts <= 16)
> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >      {
> >        int matches[16][2] = {0};
> >        for (int i = 0; i < n_elts; i++)
> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >            vector register.  For big-endian we want that position to hold
> >            the last element of VALS.  */
> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> > +
> > +       /* If we have a single constant element, use that for duplicating
> > +          instead.  */
> > +       if (n_var == n_elts - 1)
> > +         for (int i = 0; i < n_elts; i++)
> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> > +             {
> > +               maxelement = i;
> > +               break;
> > +             }
> > +
> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>
> We don't want to force the constant into a register though.
OK right, sorry.
With the attached patch, for the following test-case:
int64x2_t f_s64(int64_t x)
{
  return (int64x2_t) { x, 1 };
}

it loads constant from memory (same code-gen as without patch).
f_s64:
        adrp    x1, .LC0
        ldr     q0, [x1, #:lo12:.LC0]
        ins     v0.d[0], x0
        ret

Does the patch look OK ?

Thanks,
Prathamesh


>
> >       }
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > new file mode 100644
> > index 00000000000..517f47b13ec
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   ...
> > +**   dup     v[0-9]+\.16b, w[0-9]+
> > +**   movi    v[0-9]+\.8b, 0x1
> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   ...
> > +**   dup     v[0-9]+\.8h, w[0-9]+
> > +**   movi    v[0-9]+\.4h, 0x1
> > +**   ins     v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   ...
> > +**   movi    v[0-9]\.2s, 0x1
> > +**   dup     v[0-9]\.4s, w[0-9]+
> > +**   ins     v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   ...
> > +**   fmov    d[0-9]+, x[0-9]+
> > +**   mov     x[0-9]+, 1
> > +**   ins     v[0-9]+\.d\[1\], x[0-9]+
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
[aarch64] Improve code-gen for vector initialization with single constant element.

gcc/ChangeLog:
	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
	and if maxv == 1, use constant element for duplicating into register.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/vec-init-single-const.c: New test.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2b0de7ca038..97309ddec4f 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  if ((n_var >= n_elts - 1) && n_elts <= 16)
     {
       int matches[16][2] = {0};
       for (int i = 0; i < n_elts; i++)
@@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	     vector register.  For big-endian we want that position to hold
 	     the last element of VALS.  */
 	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
-	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+
+	  /* If we have a single constant element, use that for duplicating
+	     instead.  */
+	  if (n_var == n_elts - 1)
+	    for (int i = 0; i < n_elts; i++)
+	      if (CONST_INT_P (XVECEXP (vals, 0, i))
+		  || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
+		{
+		  maxelement = i;
+		  break;
+		}
+
+	  rtx maxval = XVECEXP (vals, 0, maxelement);
+	  if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
+	    {
+	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	    }
+	  else
+	    aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
 	}
       else
 	{
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
new file mode 100644
index 00000000000..682fd43439a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
@@ -0,0 +1,66 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	...
+**	dup	v[0-9]+\.16b, w[0-9]+
+**	movi	v[0-9]+\.8b, 0x1
+**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
+**	...
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	...
+**	dup	v[0-9]+\.8h, w[0-9]+
+**	movi	v[0-9]+\.4h, 0x1
+**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
+**	...
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	...
+**	movi	v[0-9]\.2s, 0x1
+**	dup	v[0-9]\.4s, w[0-9]+
+**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
+**	...
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	...
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v[0-9]+\.d\[0\], x[0-9]+
+**	...
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
  
Richard Sandiford May 2, 2023, 12:02 p.m. UTC | #7
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 2 May 2023 at 14:56, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >
>> > gcc/ChangeLog:
>> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >       and if maxv == 1, use constant element for duplicating into register.
>> >
>> > gcc/testsuite/ChangeLog:
>> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 2b0de7ca038..f46750133a6 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >       earliest element which has duplicates).  */
>> >
>> > -  if (n_var == n_elts && n_elts <= 16)
>> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >      {
>> >        int matches[16][2] = {0};
>> >        for (int i = 0; i < n_elts; i++)
>> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >            vector register.  For big-endian we want that position to hold
>> >            the last element of VALS.  */
>> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> > +
>> > +       /* If we have a single constant element, use that for duplicating
>> > +          instead.  */
>> > +       if (n_var == n_elts - 1)
>> > +         for (int i = 0; i < n_elts; i++)
>> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> > +             {
>> > +               maxelement = i;
>> > +               break;
>> > +             }
>> > +
>> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>>
>> We don't want to force the constant into a register though.
> OK right, sorry.
> With the attached patch, for the following test-case:
> int64x2_t f_s64(int64_t x)
> {
>   return (int64x2_t) { x, 1 };
> }
>
> it loads constant from memory (same code-gen as without patch).
> f_s64:
>         adrp    x1, .LC0
>         ldr     q0, [x1, #:lo12:.LC0]
>         ins     v0.d[0], x0
>         ret
>
> Does the patch look OK ?
>
> Thanks,
> Prathamesh
> [...]
> [aarch64] Improve code-gen for vector initialization with single constant element.
>
> gcc/ChangeLog:
> 	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> 	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> 	and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.target/aarch64/vec-init-single-const.c: New test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..97309ddec4f 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>       and matches[X][1] with the count of duplicate elements (if X is the
>       earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if ((n_var >= n_elts - 1) && n_elts <= 16)

No need for the extra brackets.

>      {
>        int matches[16][2] = {0};
>        for (int i = 0; i < n_elts; i++)
> @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	     vector register.  For big-endian we want that position to hold
>  	     the last element of VALS.  */
>  	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +
> +	  /* If we have a single constant element, use that for duplicating
> +	     instead.  */
> +	  if (n_var == n_elts - 1)
> +	    for (int i = 0; i < n_elts; i++)
> +	      if (CONST_INT_P (XVECEXP (vals, 0, i))
> +		  || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> +		{
> +		  maxelement = i;
> +		  break;
> +		}
> +
> +	  rtx maxval = XVECEXP (vals, 0, maxelement);
> +	  if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> +	    {
> +	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	    }
> +	  else
> +	    aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
>  	}
>        else
>  	{

This seems a bit convoluted.  It might be easier to record whether
we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
and if so what the constant is.  Then handle that case first,
as a separate arm of the "if".

> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 00000000000..682fd43439a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	dup	v[0-9]+\.16b, w[0-9]+
> +**	movi	v[0-9]+\.8b, 0x1
> +**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> +**	...
> +**	ret

Like with the divide-and-conquer patch, there's nothing that requires
the first two instructions to be in that order.

What is the second ... hiding?  What sequences do we actually generate?

BTW, remember to say how patches were tested :-)

Thanks,
Richard

> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	...
> +**	dup	v[0-9]+\.8h, w[0-9]+
> +**	movi	v[0-9]+\.4h, 0x1
> +**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	...
> +**	movi	v[0-9]\.2s, 0x1
> +**	dup	v[0-9]\.4s, w[0-9]+
> +**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> +**	...
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	...
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v[0-9]+\.d\[0\], x[0-9]+
> +**	...
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
  
Prathamesh Kulkarni May 2, 2023, 12:38 p.m. UTC | #8
On Tue, 2 May 2023 at 17:32, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >> >
> >> > gcc/ChangeLog:
> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >> >       and if maxv == 1, use constant element for duplicating into register.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> > index 2b0de7ca038..f46750133a6 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >       and matches[X][1] with the count of duplicate elements (if X is the
> >> >       earliest element which has duplicates).  */
> >> >
> >> > -  if (n_var == n_elts && n_elts <= 16)
> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >> >      {
> >> >        int matches[16][2] = {0};
> >> >        for (int i = 0; i < n_elts; i++)
> >> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >            vector register.  For big-endian we want that position to hold
> >> >            the last element of VALS.  */
> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> >> > +
> >> > +       /* If we have a single constant element, use that for duplicating
> >> > +          instead.  */
> >> > +       if (n_var == n_elts - 1)
> >> > +         for (int i = 0; i < n_elts; i++)
> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> >> > +             {
> >> > +               maxelement = i;
> >> > +               break;
> >> > +             }
> >> > +
> >> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >>
> >> We don't want to force the constant into a register though.
> > OK right, sorry.
> > With the attached patch, for the following test-case:
> > int64x2_t f_s64(int64_t x)
> > {
> >   return (int64x2_t) { x, 1 };
> > }
> >
> > it loads constant from memory (same code-gen as without patch).
> > f_s64:
> >         adrp    x1, .LC0
> >         ldr     q0, [x1, #:lo12:.LC0]
> >         ins     v0.d[0], x0
> >         ret
> >
> > Does the patch look OK ?
> >
> > Thanks,
> > Prathamesh
> > [...]
> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >
> > gcc/ChangeLog:
> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >       and if maxv == 1, use constant element for duplicating into register.
> >
> > gcc/testsuite/ChangeLog:
> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 2b0de7ca038..97309ddec4f 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >       and matches[X][1] with the count of duplicate elements (if X is the
> >       earliest element which has duplicates).  */
> >
> > -  if (n_var == n_elts && n_elts <= 16)
> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>
> No need for the extra brackets.
Adjusted, thanks. Sorry if this sounds like a silly question, but why
do we need the n_elts <= 16 check ?
Won't n_elts be always <= 16 since max number of elements in a vector
would be 16 for V16QI ?
>
> >      {
> >        int matches[16][2] = {0};
> >        for (int i = 0; i < n_elts; i++)
> > @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >            vector register.  For big-endian we want that position to hold
> >            the last element of VALS.  */
> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +
> > +       /* If we have a single constant element, use that for duplicating
> > +          instead.  */
> > +       if (n_var == n_elts - 1)
> > +         for (int i = 0; i < n_elts; i++)
> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> > +             {
> > +               maxelement = i;
> > +               break;
> > +             }
> > +
> > +       rtx maxval = XVECEXP (vals, 0, maxelement);
> > +       if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> > +         {
> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +         }
> > +       else
> > +         aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
> >       }
> >        else
> >       {
>
> This seems a bit convoluted.  It might be easier to record whether
> we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
> and if so what the constant is.  Then handle that case first,
> as a separate arm of the "if".
Adjusted in the attached patch. Does it look OK ?
>
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > new file mode 100644
> > index 00000000000..682fd43439a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > @@ -0,0 +1,66 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   ...
> > +**   dup     v[0-9]+\.16b, w[0-9]+
> > +**   movi    v[0-9]+\.8b, 0x1
> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> > +**   ...
> > +**   ret
>
> Like with the divide-and-conquer patch, there's nothing that requires
> the first two instructions to be in that order.
Hmm, will it be OK to disable scheduling by passing
-fno-schedule-insns -fno-schedule-insns2
for the test ?
>
> What is the second ... hiding?  What sequences do we actually generate?
Sorry, added them by mistake. They were the exact sequences. Adjusted
tests in the patch.
>
> BTW, remember to say how patches were tested :-)
Right, sorry. The patch is under bootstrap+test on aarch64-linux-gnu.
OK to commit if passes ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   ...
> > +**   dup     v[0-9]+\.8h, w[0-9]+
> > +**   movi    v[0-9]+\.4h, 0x1
> > +**   ins     v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   ...
> > +**   movi    v[0-9]\.2s, 0x1
> > +**   dup     v[0-9]\.4s, w[0-9]+
> > +**   ins     v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   ...
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v[0-9]+\.d\[0\], x[0-9]+
> > +**   ...
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
[aarch64] Improve code-gen for vector initialization with single constant element.

gcc/ChangeLog:
	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
	and if maxv == 1, use constant element for duplicating into register.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/vec-init-single-const.c: New test.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2b0de7ca038..31319977ffd 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  if (n_var >= n_elts - 1 && n_elts <= 16)
     {
       int matches[16][2] = {0};
       for (int i = 0; i < n_elts; i++)
@@ -22227,8 +22227,27 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	     vector register.  For big-endian we want that position to hold
 	     the last element of VALS.  */
 	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
-	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+
+	  /* If we have a single constant element, use that for duplicating
+	     instead.  */
+	  if (n_var == n_elts - 1)
+	    {
+	      for (int i = 0; i < n_elts; i++)
+		if (CONST_INT_P (XVECEXP (vals, 0, i))
+		    || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
+		  {
+		    maxelement = i;
+		    rtx const_elem = XVECEXP (vals, 0, maxelement);
+		    aarch64_emit_move (target,
+				       gen_vec_duplicate (mode, const_elem));
+		    break;
+		  }
+	    }
+	  else
+	    {
+	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	    }
 	}
       else
 	{
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
new file mode 100644
index 00000000000..790c90b48ce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	dup	v[0-9]+\.16b, w[0-9]+
+**	movi	v[0-9]+\.8b, 0x1
+**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	dup	v[0-9]+\.8h, w[0-9]+
+**	movi	v[0-9]+\.4h, 0x1
+**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	dup	v[0-9]\.4s, w[0-9]+
+**	movi	v[0-9]\.2s, 0x1
+**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v[0-9]+\.d\[0\], x[0-9]+
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
  
Richard Sandiford May 2, 2023, 12:52 p.m. UTC | #9
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 2 May 2023 at 17:32, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >> >
>> >> > gcc/ChangeLog:
>> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >> >       and if maxv == 1, use constant element for duplicating into register.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> > index 2b0de7ca038..f46750133a6 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >> >       earliest element which has duplicates).  */
>> >> >
>> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >> >      {
>> >> >        int matches[16][2] = {0};
>> >> >        for (int i = 0; i < n_elts; i++)
>> >> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >            vector register.  For big-endian we want that position to hold
>> >> >            the last element of VALS.  */
>> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> >> > +
>> >> > +       /* If we have a single constant element, use that for duplicating
>> >> > +          instead.  */
>> >> > +       if (n_var == n_elts - 1)
>> >> > +         for (int i = 0; i < n_elts; i++)
>> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> >> > +             {
>> >> > +               maxelement = i;
>> >> > +               break;
>> >> > +             }
>> >> > +
>> >> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> >>
>> >> We don't want to force the constant into a register though.
>> > OK right, sorry.
>> > With the attached patch, for the following test-case:
>> > int64x2_t f_s64(int64_t x)
>> > {
>> >   return (int64x2_t) { x, 1 };
>> > }
>> >
>> > it loads constant from memory (same code-gen as without patch).
>> > f_s64:
>> >         adrp    x1, .LC0
>> >         ldr     q0, [x1, #:lo12:.LC0]
>> >         ins     v0.d[0], x0
>> >         ret
>> >
>> > Does the patch look OK ?
>> >
>> > Thanks,
>> > Prathamesh
>> > [...]
>> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >
>> > gcc/ChangeLog:
>> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >       and if maxv == 1, use constant element for duplicating into register.
>> >
>> > gcc/testsuite/ChangeLog:
>> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 2b0de7ca038..97309ddec4f 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >       earliest element which has duplicates).  */
>> >
>> > -  if (n_var == n_elts && n_elts <= 16)
>> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>>
>> No need for the extra brackets.
> Adjusted, thanks. Sorry if this sounds like a silly question, but why
> do we need the n_elts <= 16 check ?
> Won't n_elts be always <= 16 since max number of elements in a vector
> would be 16 for V16QI ?

Was wondering the same thing :)

Let's leave it though.

>> >      {
>> >        int matches[16][2] = {0};
>> >        for (int i = 0; i < n_elts; i++)
>> > @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >            vector register.  For big-endian we want that position to hold
>> >            the last element of VALS.  */
>> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> > +
>> > +       /* If we have a single constant element, use that for duplicating
>> > +          instead.  */
>> > +       if (n_var == n_elts - 1)
>> > +         for (int i = 0; i < n_elts; i++)
>> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> > +             {
>> > +               maxelement = i;
>> > +               break;
>> > +             }
>> > +
>> > +       rtx maxval = XVECEXP (vals, 0, maxelement);
>> > +       if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
>> > +         {
>> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> > +         }
>> > +       else
>> > +         aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
>> >       }
>> >        else
>> >       {
>>
>> This seems a bit convoluted.  It might be easier to record whether
>> we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
>> and if so what the constant is.  Then handle that case first,
>> as a separate arm of the "if".
> Adjusted in the attached patch. Does it look OK ?

I meant: adjust

      int maxelement = 0;
      int maxv = 0;
      for (int i = 0; i < n_elts; i++)
	if (matches[i][1] > maxv)
	  {
	    maxelement = i;
	    maxv = matches[i][1];
	  }

so that it also records any CONST_INT or CONST_DOUBLE (as an rtx).

>> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
>> > new file mode 100644
>> > index 00000000000..682fd43439a
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
>> > @@ -0,0 +1,66 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-O2" } */
>> > +/* { dg-final { check-function-bodies "**" "" "" } } */
>> > +
>> > +#include <arm_neon.h>
>> > +
>> > +/*
>> > +** f_s8:
>> > +**   ...
>> > +**   dup     v[0-9]+\.16b, w[0-9]+
>> > +**   movi    v[0-9]+\.8b, 0x1
>> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
>> > +**   ...
>> > +**   ret
>>
>> Like with the divide-and-conquer patch, there's nothing that requires
>> the first two instructions to be in that order.
> Hmm, will it be OK to disable scheduling by passing
> -fno-schedule-insns -fno-schedule-insns2
> for the test ?

Guess we might as well try that for now.

Elsewhere I've used:

  (
     first sequence
  |
     second sequence
  )
     common part

but we probably have enough control over the unscheduled sequence
for that not to be necessary here.

>> What is the second ... hiding?  What sequences do we actually generate?
> Sorry, added them by mistake. They were the exact sequences. Adjusted
> tests in the patch.
>>
>> BTW, remember to say how patches were tested :-)
> Right, sorry. The patch is under bootstrap+test on aarch64-linux-gnu.

Please also test the new tests on big-endian.

> +/*
> +** f_s8:
> +**	dup	v[0-9]+\.16b, w[0-9]+

Without the ...s, this must be v0 and w0 respectively

> +**	movi	v[0-9]+\.8b, 0x1

Would be good to capture the register number here and use \1 in the
following line.

> +**	ins	v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]

Similarly v0 for the first operand here.

Thanks,
Richard

> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	dup	v[0-9]+\.8h, w[0-9]+
> +**	movi	v[0-9]+\.4h, 0x1
> +**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	dup	v[0-9]\.4s, w[0-9]+
> +**	movi	v[0-9]\.2s, 0x1
> +**	ins	v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v[0-9]+\.d\[0\], x[0-9]+
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
  
Prathamesh Kulkarni May 3, 2023, 11:28 a.m. UTC | #10
On Tue, 2 May 2023 at 18:22, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 2 May 2023 at 17:32, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >> >> >
> >> >> > gcc/ChangeLog:
> >> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >> >> >       and if maxv == 1, use constant element for duplicating into register.
> >> >> >
> >> >> > gcc/testsuite/ChangeLog:
> >> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >> >> >
> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> >> > index 2b0de7ca038..f46750133a6 100644
> >> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >       and matches[X][1] with the count of duplicate elements (if X is the
> >> >> >       earliest element which has duplicates).  */
> >> >> >
> >> >> > -  if (n_var == n_elts && n_elts <= 16)
> >> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >> >> >      {
> >> >> >        int matches[16][2] = {0};
> >> >> >        for (int i = 0; i < n_elts; i++)
> >> >> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >            vector register.  For big-endian we want that position to hold
> >> >> >            the last element of VALS.  */
> >> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> >> >> > +
> >> >> > +       /* If we have a single constant element, use that for duplicating
> >> >> > +          instead.  */
> >> >> > +       if (n_var == n_elts - 1)
> >> >> > +         for (int i = 0; i < n_elts; i++)
> >> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> >> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> >> >> > +             {
> >> >> > +               maxelement = i;
> >> >> > +               break;
> >> >> > +             }
> >> >> > +
> >> >> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> >> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> >>
> >> >> We don't want to force the constant into a register though.
> >> > OK right, sorry.
> >> > With the attached patch, for the following test-case:
> >> > int64x2_t f_s64(int64_t x)
> >> > {
> >> >   return (int64x2_t) { x, 1 };
> >> > }
> >> >
> >> > it loads constant from memory (same code-gen as without patch).
> >> > f_s64:
> >> >         adrp    x1, .LC0
> >> >         ldr     q0, [x1, #:lo12:.LC0]
> >> >         ins     v0.d[0], x0
> >> >         ret
> >> >
> >> > Does the patch look OK ?
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> > [...]
> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >> >
> >> > gcc/ChangeLog:
> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >> >       and if maxv == 1, use constant element for duplicating into register.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> > index 2b0de7ca038..97309ddec4f 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >       and matches[X][1] with the count of duplicate elements (if X is the
> >> >       earliest element which has duplicates).  */
> >> >
> >> > -  if (n_var == n_elts && n_elts <= 16)
> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >>
> >> No need for the extra brackets.
> > Adjusted, thanks. Sorry if this sounds like a silly question, but why
> > do we need the n_elts <= 16 check ?
> > Won't n_elts be always <= 16 since max number of elements in a vector
> > would be 16 for V16QI ?
>
> Was wondering the same thing :)
>
> Let's leave it though.
>
> >> >      {
> >> >        int matches[16][2] = {0};
> >> >        for (int i = 0; i < n_elts; i++)
> >> > @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >            vector register.  For big-endian we want that position to hold
> >> >            the last element of VALS.  */
> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> >> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> > +
> >> > +       /* If we have a single constant element, use that for duplicating
> >> > +          instead.  */
> >> > +       if (n_var == n_elts - 1)
> >> > +         for (int i = 0; i < n_elts; i++)
> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> >> > +             {
> >> > +               maxelement = i;
> >> > +               break;
> >> > +             }
> >> > +
> >> > +       rtx maxval = XVECEXP (vals, 0, maxelement);
> >> > +       if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> >> > +         {
> >> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> > +         }
> >> > +       else
> >> > +         aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
> >> >       }
> >> >        else
> >> >       {
> >>
> >> This seems a bit convoluted.  It might be easier to record whether
> >> we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
> >> and if so what the constant is.  Then handle that case first,
> >> as a separate arm of the "if".
> > Adjusted in the attached patch. Does it look OK ?
>
> I meant: adjust
>
>       int maxelement = 0;
>       int maxv = 0;
>       for (int i = 0; i < n_elts; i++)
>         if (matches[i][1] > maxv)
>           {
>             maxelement = i;
>             maxv = matches[i][1];
>           }
>
> so that it also records any CONST_INT or CONST_DOUBLE (as an rtx).
Oh right. Adjusted in the attached patch, but I also added
const_elem_pos to keep track of the position,
to set maxelement to it since it's later used to skip duplicated element here:

    /* Insert the rest.  */
      for (int i = 0; i < n_elts; i++)
        {
          rtx x = XVECEXP (vals, 0, i);
          if (matches[i][0] == maxelement)
            continue;
          x = force_reg (inner_mode, x);
          emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
        }
      return;

Does that look OK ?
>
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> >> > new file mode 100644
> >> > index 00000000000..682fd43439a
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> >> > @@ -0,0 +1,66 @@
> >> > +/* { dg-do compile } */
> >> > +/* { dg-options "-O2" } */
> >> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> >> > +
> >> > +#include <arm_neon.h>
> >> > +
> >> > +/*
> >> > +** f_s8:
> >> > +**   ...
> >> > +**   dup     v[0-9]+\.16b, w[0-9]+
> >> > +**   movi    v[0-9]+\.8b, 0x1
> >> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> >> > +**   ...
> >> > +**   ret
> >>
> >> Like with the divide-and-conquer patch, there's nothing that requires
> >> the first two instructions to be in that order.
> > Hmm, will it be OK to disable scheduling by passing
> > -fno-schedule-insns -fno-schedule-insns2
> > for the test ?
>
> Guess we might as well try that for now.
>
> Elsewhere I've used:
>
>   (
>      first sequence
>   |
>      second sequence
>   )
>      common part
>
> but we probably have enough control over the unscheduled sequence
> for that not to be necessary here.
>
> >> What is the second ... hiding?  What sequences do we actually generate?
> > Sorry, added them by mistake. They were the exact sequences. Adjusted
> > tests in the patch.
> >>
> >> BTW, remember to say how patches were tested :-)
> > Right, sorry. The patch is under bootstrap+test on aarch64-linux-gnu.
>
> Please also test the new tests on big-endian.
Done, thanks.
>
> > +/*
> > +** f_s8:
> > +**   dup     v[0-9]+\.16b, w[0-9]+
>
> Without the ...s, this must be v0 and w0 respectively
>
> > +**   movi    v[0-9]+\.8b, 0x1
>
> Would be good to capture the register number here and use \1 in the
> following line.
>
> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
>
> Similarly v0 for the first operand here.
Done, thanks.
I verified the big-endian test passes on aarch64_be-linux-gnu, and
patch is under bootstrap+test on aarch64-linux-gnu.
OK to commit if passes ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   dup     v[0-9]+\.8h, w[0-9]+
> > +**   movi    v[0-9]+\.4h, 0x1
> > +**   ins     v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   dup     v[0-9]\.4s, w[0-9]+
> > +**   movi    v[0-9]\.2s, 0x1
> > +**   ins     v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v[0-9]+\.d\[0\], x[0-9]+
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
[aarch64] Improve code-gen for vector initialization with single constant element.

gcc/ChangeLog:
	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
	and if maxv == 1, use constant element for duplicating into register.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/vec-init-single-const.c: New test.
	* gcc.target/aarch64/vec-init-single-const-be.c: Likewise.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 2b0de7ca038..1ae8cf530e9 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  if (n_var >= n_elts - 1 && n_elts <= 16)
     {
       int matches[16][2] = {0};
       for (int i = 0; i < n_elts; i++)
@@ -22184,12 +22184,23 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	}
       int maxelement = 0;
       int maxv = 0;
+      rtx const_elem = NULL_RTX;
+      int const_elem_pos = 0;
+
       for (int i = 0; i < n_elts; i++)
-	if (matches[i][1] > maxv)
-	  {
-	    maxelement = i;
-	    maxv = matches[i][1];
-	  }
+	{
+	  if (matches[i][1] > maxv)
+	    {
+	      maxelement = i;
+	      maxv = matches[i][1];
+	    }
+	  if (CONST_INT_P (XVECEXP (vals, 0, i))
+	      || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
+	    {
+	      const_elem_pos = i; 
+	      const_elem = XVECEXP (vals, 0, i);
+	    }
+	}
 
       /* Create a duplicate of the most common element, unless all elements
 	 are equally useless to us, in which case just immediately set the
@@ -22227,8 +22238,19 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	     vector register.  For big-endian we want that position to hold
 	     the last element of VALS.  */
 	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
-	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+
+	  /* If we have a single constant element, use that for duplicating
+	     instead.  */
+	  if (const_elem)
+	    {
+	      maxelement = const_elem_pos;
+	      aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
+	    }
+	  else
+	    {
+	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	    }
 	}
       else
 	{
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
new file mode 100644
index 00000000000..f84befa4c11
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	dup	v0.16b, w0
+**	movi	(v[0-9]+)\.8b, 0x1
+**	ins	v0.b\[0\], \1\.b\[0\]
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	dup	v0.8h, w0
+**	movi	(v[0-9]+)\.4h, 0x1
+**	ins	v0.h\[0\], \1\.h\[0\]
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	dup	v0.4s, w0
+**	movi	(v[0-9])\.2s, 0x1
+**	ins	v0.s\[0\], \1\.s\[0\]
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v0\.d\[1\], x0
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
new file mode 100644
index 00000000000..f736bfc3b68
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	dup	v0.16b, w0
+**	movi	(v[0-9]+)\.8b, 0x1
+**	ins	v0.b\[15\], \1\.b\[0\]
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	dup	v0.8h, w0
+**	movi	(v[0-9]+)\.4h, 0x1
+**	ins	v0.h\[7\], \1\.h\[0\]
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	dup	v0.4s, w0
+**	movi	(v[0-9])\.2s, 0x1
+**	ins	v0.s\[3\], \1\.s\[0\]
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v0\.d\[0\], x0
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
  
Richard Sandiford May 11, 2023, 7:15 p.m. UTC | #11
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:

> On Tue, 2 May 2023 at 18:22, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 2 May 2023 at 17:32, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >> >> >
>> >> >> > gcc/ChangeLog:
>> >> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >> >> >       and if maxv == 1, use constant element for duplicating into register.
>> >> >> >
>> >> >> > gcc/testsuite/ChangeLog:
>> >> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >> >
>> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> >> > index 2b0de7ca038..f46750133a6 100644
>> >> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >> >> >       earliest element which has duplicates).  */
>> >> >> >
>> >> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >> >> >      {
>> >> >> >        int matches[16][2] = {0};
>> >> >> >        for (int i = 0; i < n_elts; i++)
>> >> >> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >> >            vector register.  For big-endian we want that position to hold
>> >> >> >            the last element of VALS.  */
>> >> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> >> >> > +
>> >> >> > +       /* If we have a single constant element, use that for duplicating
>> >> >> > +          instead.  */
>> >> >> > +       if (n_var == n_elts - 1)
>> >> >> > +         for (int i = 0; i < n_elts; i++)
>> >> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> >> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> >> >> > +             {
>> >> >> > +               maxelement = i;
>> >> >> > +               break;
>> >> >> > +             }
>> >> >> > +
>> >> >> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >> >> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> >> >>
>> >> >> We don't want to force the constant into a register though.
>> >> > OK right, sorry.
>> >> > With the attached patch, for the following test-case:
>> >> > int64x2_t f_s64(int64_t x)
>> >> > {
>> >> >   return (int64x2_t) { x, 1 };
>> >> > }
>> >> >
>> >> > it loads constant from memory (same code-gen as without patch).
>> >> > f_s64:
>> >> >         adrp    x1, .LC0
>> >> >         ldr     q0, [x1, #:lo12:.LC0]
>> >> >         ins     v0.d[0], x0
>> >> >         ret
>> >> >
>> >> > Does the patch look OK ?
>> >> >
>> >> > Thanks,
>> >> > Prathamesh
>> >> > [...]
>> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
>> >> >
>> >> > gcc/ChangeLog:
>> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
>> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
>> >> >       and if maxv == 1, use constant element for duplicating into register.
>> >> >
>> >> > gcc/testsuite/ChangeLog:
>> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
>> >> >
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> > index 2b0de7ca038..97309ddec4f 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >       and matches[X][1] with the count of duplicate elements (if X is the
>> >> >       earliest element which has duplicates).  */
>> >> >
>> >> > -  if (n_var == n_elts && n_elts <= 16)
>> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
>> >>
>> >> No need for the extra brackets.
>> > Adjusted, thanks. Sorry if this sounds like a silly question, but why
>> > do we need the n_elts <= 16 check ?
>> > Won't n_elts be always <= 16 since max number of elements in a vector
>> > would be 16 for V16QI ?
>>
>> Was wondering the same thing :)
>>
>> Let's leave it though.
>>
>> >> >      {
>> >> >        int matches[16][2] = {0};
>> >> >        for (int i = 0; i < n_elts; i++)
>> >> > @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> >            vector register.  For big-endian we want that position to hold
>> >> >            the last element of VALS.  */
>> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> >> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> >> > +
>> >> > +       /* If we have a single constant element, use that for duplicating
>> >> > +          instead.  */
>> >> > +       if (n_var == n_elts - 1)
>> >> > +         for (int i = 0; i < n_elts; i++)
>> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
>> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
>> >> > +             {
>> >> > +               maxelement = i;
>> >> > +               break;
>> >> > +             }
>> >> > +
>> >> > +       rtx maxval = XVECEXP (vals, 0, maxelement);
>> >> > +       if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
>> >> > +         {
>> >> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> >> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> >> > +         }
>> >> > +       else
>> >> > +         aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
>> >> >       }
>> >> >        else
>> >> >       {
>> >>
>> >> This seems a bit convoluted.  It might be easier to record whether
>> >> we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
>> >> and if so what the constant is.  Then handle that case first,
>> >> as a separate arm of the "if".
>> > Adjusted in the attached patch. Does it look OK ?
>>
>> I meant: adjust
>>
>>       int maxelement = 0;
>>       int maxv = 0;
>>       for (int i = 0; i < n_elts; i++)
>>         if (matches[i][1] > maxv)
>>           {
>>             maxelement = i;
>>             maxv = matches[i][1];
>>           }
>>
>> so that it also records any CONST_INT or CONST_DOUBLE (as an rtx).
> Oh right. Adjusted in the attached patch, but I also added
> const_elem_pos to keep track of the position,
> to set maxelement to it since it's later used to skip duplicated element here:
>
>     /* Insert the rest.  */
>       for (int i = 0; i < n_elts; i++)
>         {
>           rtx x = XVECEXP (vals, 0, i);
>           if (matches[i][0] == maxelement)
>             continue;
>           x = force_reg (inner_mode, x);
>           emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
>         }
>       return;
>
> Does that look OK ?

Yeah, looks good.

>> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
>> >> > new file mode 100644
>> >> > index 00000000000..682fd43439a
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
>> >> > @@ -0,0 +1,66 @@
>> >> > +/* { dg-do compile } */
>> >> > +/* { dg-options "-O2" } */
>> >> > +/* { dg-final { check-function-bodies "**" "" "" } } */
>> >> > +
>> >> > +#include <arm_neon.h>
>> >> > +
>> >> > +/*
>> >> > +** f_s8:
>> >> > +**   ...
>> >> > +**   dup     v[0-9]+\.16b, w[0-9]+
>> >> > +**   movi    v[0-9]+\.8b, 0x1
>> >> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
>> >> > +**   ...
>> >> > +**   ret
>> >>
>> >> Like with the divide-and-conquer patch, there's nothing that requires
>> >> the first two instructions to be in that order.
>> > Hmm, will it be OK to disable scheduling by passing
>> > -fno-schedule-insns -fno-schedule-insns2
>> > for the test ?
>>
>> Guess we might as well try that for now.
>>
>> Elsewhere I've used:
>>
>>   (
>>      first sequence
>>   |
>>      second sequence
>>   )
>>      common part
>>
>> but we probably have enough control over the unscheduled sequence
>> for that not to be necessary here.
>>
>> >> What is the second ... hiding?  What sequences do we actually generate?
>> > Sorry, added them by mistake. They were the exact sequences. Adjusted
>> > tests in the patch.
>> >>
>> >> BTW, remember to say how patches were tested :-)
>> > Right, sorry. The patch is under bootstrap+test on aarch64-linux-gnu.
>>
>> Please also test the new tests on big-endian.
> Done, thanks.
>>
>> > +/*
>> > +** f_s8:
>> > +**   dup     v[0-9]+\.16b, w[0-9]+
>>
>> Without the ...s, this must be v0 and w0 respectively
>>
>> > +**   movi    v[0-9]+\.8b, 0x1
>>
>> Would be good to capture the register number here and use \1 in the
>> following line.
>>
>> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
>>
>> Similarly v0 for the first operand here.
> Done, thanks.
> I verified the big-endian test passes on aarch64_be-linux-gnu, and
> patch is under bootstrap+test on aarch64-linux-gnu.
> OK to commit if passes ?

OK, thanks.

Richard

>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Richard
>>
>> > +**   ret
>> > +*/
>> > +
>> > +int8x16_t f_s8(int8_t x)
>> > +{
>> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
>> > +                       x, x, x, x, x, x, x, 1 };
>> > +}
>> > +
>> > +/*
>> > +** f_s16:
>> > +**   dup     v[0-9]+\.8h, w[0-9]+
>> > +**   movi    v[0-9]+\.4h, 0x1
>> > +**   ins     v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
>> > +**   ret
>> > +*/
>> > +
>> > +int16x8_t f_s16(int16_t x)
>> > +{
>> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
>> > +}
>> > +
>> > +/*
>> > +** f_s32:
>> > +**   dup     v[0-9]\.4s, w[0-9]+
>> > +**   movi    v[0-9]\.2s, 0x1
>> > +**   ins     v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
>> > +**   ret
>> > +*/
>> > +
>> > +int32x4_t f_s32(int32_t x)
>> > +{
>> > +  return (int32x4_t) { x, x, x, 1 };
>> > +}
>> > +
>> > +/*
>> > +** f_s64:
>> > +**   adrp    x[0-9]+, .LC[0-9]+
>> > +**   ldr     q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
>> > +**   ins     v[0-9]+\.d\[0\], x[0-9]+
>> > +**   ret
>> > +*/
>> > +
>> > +int64x2_t f_s64(int64_t x)
>> > +{
>> > +  return (int64x2_t) { x, 1 };
>> > +}
>
> [aarch64] Improve code-gen for vector initialization with single constant element.
>
> gcc/ChangeLog:
> 	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> 	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> 	and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.target/aarch64/vec-init-single-const.c: New test.
> 	* gcc.target/aarch64/vec-init-single-const-be.c: Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 2b0de7ca038..1ae8cf530e9 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>       and matches[X][1] with the count of duplicate elements (if X is the
>       earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if (n_var >= n_elts - 1 && n_elts <= 16)
>      {
>        int matches[16][2] = {0};
>        for (int i = 0; i < n_elts; i++)
> @@ -22184,12 +22184,23 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	}
>        int maxelement = 0;
>        int maxv = 0;
> +      rtx const_elem = NULL_RTX;
> +      int const_elem_pos = 0;
> +
>        for (int i = 0; i < n_elts; i++)
> -	if (matches[i][1] > maxv)
> -	  {
> -	    maxelement = i;
> -	    maxv = matches[i][1];
> -	  }
> +	{
> +	  if (matches[i][1] > maxv)
> +	    {
> +	      maxelement = i;
> +	      maxv = matches[i][1];
> +	    }
> +	  if (CONST_INT_P (XVECEXP (vals, 0, i))
> +	      || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> +	    {
> +	      const_elem_pos = i; 
> +	      const_elem = XVECEXP (vals, 0, i);
> +	    }
> +	}
>  
>        /* Create a duplicate of the most common element, unless all elements
>  	 are equally useless to us, in which case just immediately set the
> @@ -22227,8 +22238,19 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	     vector register.  For big-endian we want that position to hold
>  	     the last element of VALS.  */
>  	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +
> +	  /* If we have a single constant element, use that for duplicating
> +	     instead.  */
> +	  if (const_elem)
> +	    {
> +	      maxelement = const_elem_pos;
> +	      aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
> +	    }
> +	  else
> +	    {
> +	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	    }
>  	}
>        else
>  	{
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> new file mode 100644
> index 00000000000..f84befa4c11
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	dup	v0.16b, w0
> +**	movi	(v[0-9]+)\.8b, 0x1
> +**	ins	v0.b\[0\], \1\.b\[0\]
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	dup	v0.8h, w0
> +**	movi	(v[0-9]+)\.4h, 0x1
> +**	ins	v0.h\[0\], \1\.h\[0\]
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	dup	v0.4s, w0
> +**	movi	(v[0-9])\.2s, 0x1
> +**	ins	v0.s\[0\], \1\.s\[0\]
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v0\.d\[1\], x0
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 00000000000..f736bfc3b68
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	dup	v0.16b, w0
> +**	movi	(v[0-9]+)\.8b, 0x1
> +**	ins	v0.b\[15\], \1\.b\[0\]
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	dup	v0.8h, w0
> +**	movi	(v[0-9]+)\.4h, 0x1
> +**	ins	v0.h\[7\], \1\.h\[0\]
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	dup	v0.4s, w0
> +**	movi	(v[0-9])\.2s, 0x1
> +**	ins	v0.s\[3\], \1\.s\[0\]
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v0\.d\[0\], x0
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
  
Prathamesh Kulkarni May 15, 2023, 2:09 p.m. UTC | #12
On Fri, 12 May 2023 at 00:45, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>
> > On Tue, 2 May 2023 at 18:22, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Tue, 2 May 2023 at 17:32, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> > On Tue, 2 May 2023 at 14:56, Richard Sandiford
> >> >> > <richard.sandiford@arm.com> wrote:
> >> >> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >> >> >> >
> >> >> >> > gcc/ChangeLog:
> >> >> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >> >> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >> >> >> >       and if maxv == 1, use constant element for duplicating into register.
> >> >> >> >
> >> >> >> > gcc/testsuite/ChangeLog:
> >> >> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >> >> >> >
> >> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> >> >> > index 2b0de7ca038..f46750133a6 100644
> >> >> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> >> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >> >       and matches[X][1] with the count of duplicate elements (if X is the
> >> >> >> >       earliest element which has duplicates).  */
> >> >> >> >
> >> >> >> > -  if (n_var == n_elts && n_elts <= 16)
> >> >> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >> >> >> >      {
> >> >> >> >        int matches[16][2] = {0};
> >> >> >> >        for (int i = 0; i < n_elts; i++)
> >> >> >> > @@ -22227,6 +22227,18 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >> >            vector register.  For big-endian we want that position to hold
> >> >> >> >            the last element of VALS.  */
> >> >> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> >> >> >> > +
> >> >> >> > +       /* If we have a single constant element, use that for duplicating
> >> >> >> > +          instead.  */
> >> >> >> > +       if (n_var == n_elts - 1)
> >> >> >> > +         for (int i = 0; i < n_elts; i++)
> >> >> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> >> >> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> >> >> >> > +             {
> >> >> >> > +               maxelement = i;
> >> >> >> > +               break;
> >> >> >> > +             }
> >> >> >> > +
> >> >> >> >         rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> >> >> >         aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> >> >>
> >> >> >> We don't want to force the constant into a register though.
> >> >> > OK right, sorry.
> >> >> > With the attached patch, for the following test-case:
> >> >> > int64x2_t f_s64(int64_t x)
> >> >> > {
> >> >> >   return (int64x2_t) { x, 1 };
> >> >> > }
> >> >> >
> >> >> > it loads constant from memory (same code-gen as without patch).
> >> >> > f_s64:
> >> >> >         adrp    x1, .LC0
> >> >> >         ldr     q0, [x1, #:lo12:.LC0]
> >> >> >         ins     v0.d[0], x0
> >> >> >         ret
> >> >> >
> >> >> > Does the patch look OK ?
> >> >> >
> >> >> > Thanks,
> >> >> > Prathamesh
> >> >> > [...]
> >> >> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >> >> >
> >> >> > gcc/ChangeLog:
> >> >> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >> >> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >> >> >       and if maxv == 1, use constant element for duplicating into register.
> >> >> >
> >> >> > gcc/testsuite/ChangeLog:
> >> >> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >> >> >
> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> >> > index 2b0de7ca038..97309ddec4f 100644
> >> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> >> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >       and matches[X][1] with the count of duplicate elements (if X is the
> >> >> >       earliest element which has duplicates).  */
> >> >> >
> >> >> > -  if (n_var == n_elts && n_elts <= 16)
> >> >> > +  if ((n_var >= n_elts - 1) && n_elts <= 16)
> >> >>
> >> >> No need for the extra brackets.
> >> > Adjusted, thanks. Sorry if this sounds like a silly question, but why
> >> > do we need the n_elts <= 16 check ?
> >> > Won't n_elts be always <= 16 since max number of elements in a vector
> >> > would be 16 for V16QI ?
> >>
> >> Was wondering the same thing :)
> >>
> >> Let's leave it though.
> >>
> >> >> >      {
> >> >> >        int matches[16][2] = {0};
> >> >> >        for (int i = 0; i < n_elts; i++)
> >> >> > @@ -22227,8 +22227,26 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >> >> >            vector register.  For big-endian we want that position to hold
> >> >> >            the last element of VALS.  */
> >> >> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> >> >> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> >> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> >> > +
> >> >> > +       /* If we have a single constant element, use that for duplicating
> >> >> > +          instead.  */
> >> >> > +       if (n_var == n_elts - 1)
> >> >> > +         for (int i = 0; i < n_elts; i++)
> >> >> > +           if (CONST_INT_P (XVECEXP (vals, 0, i))
> >> >> > +               || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> >> >> > +             {
> >> >> > +               maxelement = i;
> >> >> > +               break;
> >> >> > +             }
> >> >> > +
> >> >> > +       rtx maxval = XVECEXP (vals, 0, maxelement);
> >> >> > +       if (!(CONST_INT_P (maxval) || CONST_DOUBLE_P (maxval)))
> >> >> > +         {
> >> >> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> >> >> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> >> >> > +         }
> >> >> > +       else
> >> >> > +         aarch64_emit_move (target, gen_vec_duplicate (mode, maxval));
> >> >> >       }
> >> >> >        else
> >> >> >       {
> >> >>
> >> >> This seems a bit convoluted.  It might be easier to record whether
> >> >> we see a CONST_INT_P or a CONST_DOUBLE_P during the previous loop,
> >> >> and if so what the constant is.  Then handle that case first,
> >> >> as a separate arm of the "if".
> >> > Adjusted in the attached patch. Does it look OK ?
> >>
> >> I meant: adjust
> >>
> >>       int maxelement = 0;
> >>       int maxv = 0;
> >>       for (int i = 0; i < n_elts; i++)
> >>         if (matches[i][1] > maxv)
> >>           {
> >>             maxelement = i;
> >>             maxv = matches[i][1];
> >>           }
> >>
> >> so that it also records any CONST_INT or CONST_DOUBLE (as an rtx).
> > Oh right. Adjusted in the attached patch, but I also added
> > const_elem_pos to keep track of the position,
> > to set maxelement to it since it's later used to skip duplicated element here:
> >
> >     /* Insert the rest.  */
> >       for (int i = 0; i < n_elts; i++)
> >         {
> >           rtx x = XVECEXP (vals, 0, i);
> >           if (matches[i][0] == maxelement)
> >             continue;
> >           x = force_reg (inner_mode, x);
> >           emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
> >         }
> >       return;
> >
> > Does that look OK ?
>
> Yeah, looks good.
>
> >> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> >> >> > new file mode 100644
> >> >> > index 00000000000..682fd43439a
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> >> >> > @@ -0,0 +1,66 @@
> >> >> > +/* { dg-do compile } */
> >> >> > +/* { dg-options "-O2" } */
> >> >> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> >> >> > +
> >> >> > +#include <arm_neon.h>
> >> >> > +
> >> >> > +/*
> >> >> > +** f_s8:
> >> >> > +**   ...
> >> >> > +**   dup     v[0-9]+\.16b, w[0-9]+
> >> >> > +**   movi    v[0-9]+\.8b, 0x1
> >> >> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> >> >> > +**   ...
> >> >> > +**   ret
> >> >>
> >> >> Like with the divide-and-conquer patch, there's nothing that requires
> >> >> the first two instructions to be in that order.
> >> > Hmm, will it be OK to disable scheduling by passing
> >> > -fno-schedule-insns -fno-schedule-insns2
> >> > for the test ?
> >>
> >> Guess we might as well try that for now.
> >>
> >> Elsewhere I've used:
> >>
> >>   (
> >>      first sequence
> >>   |
> >>      second sequence
> >>   )
> >>      common part
> >>
> >> but we probably have enough control over the unscheduled sequence
> >> for that not to be necessary here.
> >>
> >> >> What is the second ... hiding?  What sequences do we actually generate?
> >> > Sorry, added them by mistake. They were the exact sequences. Adjusted
> >> > tests in the patch.
> >> >>
> >> >> BTW, remember to say how patches were tested :-)
> >> > Right, sorry. The patch is under bootstrap+test on aarch64-linux-gnu.
> >>
> >> Please also test the new tests on big-endian.
> > Done, thanks.
> >>
> >> > +/*
> >> > +** f_s8:
> >> > +**   dup     v[0-9]+\.16b, w[0-9]+
> >>
> >> Without the ...s, this must be v0 and w0 respectively
> >>
> >> > +**   movi    v[0-9]+\.8b, 0x1
> >>
> >> Would be good to capture the register number here and use \1 in the
> >> following line.
> >>
> >> > +**   ins     v[0-9]+\.b\[15\], v[0-9]+\.b\[0\]
> >>
> >> Similarly v0 for the first operand here.
> > Done, thanks.
> > I verified the big-endian test passes on aarch64_be-linux-gnu, and
> > patch is under bootstrap+test on aarch64-linux-gnu.
> > OK to commit if passes ?
>
> OK, thanks.
Hi Richard,
After committing the interleave+zip1 patch for vector initialization,
it seems to regress the s32 case for this patch:

int32x4_t f_s32(int32_t x)
{
  return (int32x4_t) { x, x, x, 1 };
}

code-gen:
f_s32:
        movi    v30.2s, 0x1
        fmov    s31, w0
        dup     v0.2s, v31.s[0]
        ins     v30.s[0], v31.s[0]
        zip1    v0.4s, v0.4s, v30.4s
        ret

instead of expected code-gen:
f_s32:
        movi    v31.2s, 0x1
        dup     v0.4s, w0
        ins     v0.s[3], v31.s[0]
        ret

Cost for fallback sequence: 16
Cost for interleave and zip sequence: 12

For the above case, the cost for interleave+zip1 sequence is computed as:
halves[0]:
(set (reg:V2SI 96)
    (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
cost = 8

halves[1]:
(set (reg:V2SI 97)
    (const_vector:V2SI [
            (const_int 1 [0x1]) repeated x2
        ]))
(set (reg:V2SI 97)
    (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
        (reg:V2SI 97)
        (const_int 1 [0x1])))
cost = 8

followed by:
(set (reg:V4SI 95)
    (unspec:V4SI [
            (subreg:V4SI (reg:V2SI 96) 0)
            (subreg:V4SI (reg:V2SI 97) 0)
        ] UNSPEC_ZIP1))
cost = 4

So the total cost becomes
max(costs[0], costs[1]) + zip1_insn_cost
= max(8, 8) + 4
= 12

While the fallback rtl sequence is:
(set (reg:V4SI 95)
    (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
cost = 8
(set (reg:SI 98)
    (const_int 1 [0x1]))
cost = 4
(set (reg:V4SI 95)
    (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
        (reg:V4SI 95)
        (const_int 8 [0x8])))
cost = 4

So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.

I think the issue is probably that for the interleave+zip1 sequence we take
max(costs[0], costs[1]) to reflect that both halves are interleaved,
but for the fallback seq we use seq_cost, which assumes serial execution
of insns in the sequence.
For above fallback sequence,
set (reg:V4SI 95)
    (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
and
(set (reg:SI 98)
    (const_int 1 [0x1]))
could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.

I was wondering if we should we make cost for interleave+zip1 sequence
more conservative
by not taking max, but summing up costs[0] + costs[1] even for speed ?
For this case,
that would be 8 + 8 + 4 = 20.

It generates the fallback sequence for other cases (s8, s16, s64) from
the test-case.

Thanks,
Prathamesh
>
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >>
> >> Thanks,
> >> Richard
> >>
> >> > +**   ret
> >> > +*/
> >> > +
> >> > +int8x16_t f_s8(int8_t x)
> >> > +{
> >> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> >> > +                       x, x, x, x, x, x, x, 1 };
> >> > +}
> >> > +
> >> > +/*
> >> > +** f_s16:
> >> > +**   dup     v[0-9]+\.8h, w[0-9]+
> >> > +**   movi    v[0-9]+\.4h, 0x1
> >> > +**   ins     v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
> >> > +**   ret
> >> > +*/
> >> > +
> >> > +int16x8_t f_s16(int16_t x)
> >> > +{
> >> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> >> > +}
> >> > +
> >> > +/*
> >> > +** f_s32:
> >> > +**   dup     v[0-9]\.4s, w[0-9]+
> >> > +**   movi    v[0-9]\.2s, 0x1
> >> > +**   ins     v[0-9]+\.s\[3\], v[0-9]+\.s\[0\]
> >> > +**   ret
> >> > +*/
> >> > +
> >> > +int32x4_t f_s32(int32_t x)
> >> > +{
> >> > +  return (int32x4_t) { x, x, x, 1 };
> >> > +}
> >> > +
> >> > +/*
> >> > +** f_s64:
> >> > +**   adrp    x[0-9]+, .LC[0-9]+
> >> > +**   ldr     q[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> >> > +**   ins     v[0-9]+\.d\[0\], x[0-9]+
> >> > +**   ret
> >> > +*/
> >> > +
> >> > +int64x2_t f_s64(int64_t x)
> >> > +{
> >> > +  return (int64x2_t) { x, 1 };
> >> > +}
> >
> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >
> > gcc/ChangeLog:
> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >       and if maxv == 1, use constant element for duplicating into register.
> >
> > gcc/testsuite/ChangeLog:
> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >       * gcc.target/aarch64/vec-init-single-const-be.c: Likewise.
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 2b0de7ca038..1ae8cf530e9 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22167,7 +22167,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >       and matches[X][1] with the count of duplicate elements (if X is the
> >       earliest element which has duplicates).  */
> >
> > -  if (n_var == n_elts && n_elts <= 16)
> > +  if (n_var >= n_elts - 1 && n_elts <= 16)
> >      {
> >        int matches[16][2] = {0};
> >        for (int i = 0; i < n_elts; i++)
> > @@ -22184,12 +22184,23 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >       }
> >        int maxelement = 0;
> >        int maxv = 0;
> > +      rtx const_elem = NULL_RTX;
> > +      int const_elem_pos = 0;
> > +
> >        for (int i = 0; i < n_elts; i++)
> > -     if (matches[i][1] > maxv)
> > -       {
> > -         maxelement = i;
> > -         maxv = matches[i][1];
> > -       }
> > +     {
> > +       if (matches[i][1] > maxv)
> > +         {
> > +           maxelement = i;
> > +           maxv = matches[i][1];
> > +         }
> > +       if (CONST_INT_P (XVECEXP (vals, 0, i))
> > +           || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> > +         {
> > +           const_elem_pos = i;
> > +           const_elem = XVECEXP (vals, 0, i);
> > +         }
> > +     }
> >
> >        /* Create a duplicate of the most common element, unless all elements
> >        are equally useless to us, in which case just immediately set the
> > @@ -22227,8 +22238,19 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >            vector register.  For big-endian we want that position to hold
> >            the last element of VALS.  */
> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +
> > +       /* If we have a single constant element, use that for duplicating
> > +          instead.  */
> > +       if (const_elem)
> > +         {
> > +           maxelement = const_elem_pos;
> > +           aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
> > +         }
> > +       else
> > +         {
> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +         }
> >       }
> >        else
> >       {
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> > new file mode 100644
> > index 00000000000..f84befa4c11
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   dup     v0.16b, w0
> > +**   movi    (v[0-9]+)\.8b, 0x1
> > +**   ins     v0.b\[0\], \1\.b\[0\]
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   dup     v0.8h, w0
> > +**   movi    (v[0-9]+)\.4h, 0x1
> > +**   ins     v0.h\[0\], \1\.h\[0\]
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   dup     v0.4s, w0
> > +**   movi    (v[0-9])\.2s, 0x1
> > +**   ins     v0.s\[0\], \1\.s\[0\]
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v0\.d\[1\], x0
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > new file mode 100644
> > index 00000000000..f736bfc3b68
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   dup     v0.16b, w0
> > +**   movi    (v[0-9]+)\.8b, 0x1
> > +**   ins     v0.b\[15\], \1\.b\[0\]
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   dup     v0.8h, w0
> > +**   movi    (v[0-9]+)\.4h, 0x1
> > +**   ins     v0.h\[7\], \1\.h\[0\]
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   dup     v0.4s, w0
> > +**   movi    (v[0-9])\.2s, 0x1
> > +**   ins     v0.s\[3\], \1\.s\[0\]
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v0\.d\[0\], x0
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
  
Richard Sandiford May 15, 2023, 6:59 p.m. UTC | #13
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi Richard,
> After committing the interleave+zip1 patch for vector initialization,
> it seems to regress the s32 case for this patch:
>
> int32x4_t f_s32(int32_t x)
> {
>   return (int32x4_t) { x, x, x, 1 };
> }
>
> code-gen:
> f_s32:
>         movi    v30.2s, 0x1
>         fmov    s31, w0
>         dup     v0.2s, v31.s[0]
>         ins     v30.s[0], v31.s[0]
>         zip1    v0.4s, v0.4s, v30.4s
>         ret
>
> instead of expected code-gen:
> f_s32:
>         movi    v31.2s, 0x1
>         dup     v0.4s, w0
>         ins     v0.s[3], v31.s[0]
>         ret
>
> Cost for fallback sequence: 16
> Cost for interleave and zip sequence: 12
>
> For the above case, the cost for interleave+zip1 sequence is computed as:
> halves[0]:
> (set (reg:V2SI 96)
>     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> cost = 8
>
> halves[1]:
> (set (reg:V2SI 97)
>     (const_vector:V2SI [
>             (const_int 1 [0x1]) repeated x2
>         ]))
> (set (reg:V2SI 97)
>     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
>         (reg:V2SI 97)
>         (const_int 1 [0x1])))
> cost = 8
>
> followed by:
> (set (reg:V4SI 95)
>     (unspec:V4SI [
>             (subreg:V4SI (reg:V2SI 96) 0)
>             (subreg:V4SI (reg:V2SI 97) 0)
>         ] UNSPEC_ZIP1))
> cost = 4
>
> So the total cost becomes
> max(costs[0], costs[1]) + zip1_insn_cost
> = max(8, 8) + 4
> = 12
>
> While the fallback rtl sequence is:
> (set (reg:V4SI 95)
>     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> cost = 8
> (set (reg:SI 98)
>     (const_int 1 [0x1]))
> cost = 4
> (set (reg:V4SI 95)
>     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
>         (reg:V4SI 95)
>         (const_int 8 [0x8])))
> cost = 4
>
> So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
>
> I think the issue is probably that for the interleave+zip1 sequence we take
> max(costs[0], costs[1]) to reflect that both halves are interleaved,
> but for the fallback seq we use seq_cost, which assumes serial execution
> of insns in the sequence.
> For above fallback sequence,
> set (reg:V4SI 95)
>     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> and
> (set (reg:SI 98)
>     (const_int 1 [0x1]))
> could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.

Agreed.

A good-enough substitute for this might be to ignore scalar moves
(for both alternatives) when costing for speed.

> I was wondering if we should we make cost for interleave+zip1 sequence
> more conservative
> by not taking max, but summing up costs[0] + costs[1] even for speed ?
> For this case,
> that would be 8 + 8 + 4 = 20.
>
> It generates the fallback sequence for other cases (s8, s16, s64) from
> the test-case.

What does it do for the tests in the interleave+zip1 patch?  If it doesn't
make a difference there then it sounds like we don't have enough tests. :)

Summing is only conservative if the fallback sequence is somehow "safer".
But I don't think it is.   Building an N-element vector from N scalars
can be done using N instructions in the fallback case and N+1 instructions
in the interleave+zip1 case.  But the interleave+zip1 case is still
better (speedwise) for N==16.

Thanks,
Richard
  
Prathamesh Kulkarni May 17, 2023, 3:23 p.m. UTC | #14
On Tue, 16 May 2023 at 00:29, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi Richard,
> > After committing the interleave+zip1 patch for vector initialization,
> > it seems to regress the s32 case for this patch:
> >
> > int32x4_t f_s32(int32_t x)
> > {
> >   return (int32x4_t) { x, x, x, 1 };
> > }
> >
> > code-gen:
> > f_s32:
> >         movi    v30.2s, 0x1
> >         fmov    s31, w0
> >         dup     v0.2s, v31.s[0]
> >         ins     v30.s[0], v31.s[0]
> >         zip1    v0.4s, v0.4s, v30.4s
> >         ret
> >
> > instead of expected code-gen:
> > f_s32:
> >         movi    v31.2s, 0x1
> >         dup     v0.4s, w0
> >         ins     v0.s[3], v31.s[0]
> >         ret
> >
> > Cost for fallback sequence: 16
> > Cost for interleave and zip sequence: 12
> >
> > For the above case, the cost for interleave+zip1 sequence is computed as:
> > halves[0]:
> > (set (reg:V2SI 96)
> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> > cost = 8
> >
> > halves[1]:
> > (set (reg:V2SI 97)
> >     (const_vector:V2SI [
> >             (const_int 1 [0x1]) repeated x2
> >         ]))
> > (set (reg:V2SI 97)
> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
> >         (reg:V2SI 97)
> >         (const_int 1 [0x1])))
> > cost = 8
> >
> > followed by:
> > (set (reg:V4SI 95)
> >     (unspec:V4SI [
> >             (subreg:V4SI (reg:V2SI 96) 0)
> >             (subreg:V4SI (reg:V2SI 97) 0)
> >         ] UNSPEC_ZIP1))
> > cost = 4
> >
> > So the total cost becomes
> > max(costs[0], costs[1]) + zip1_insn_cost
> > = max(8, 8) + 4
> > = 12
> >
> > While the fallback rtl sequence is:
> > (set (reg:V4SI 95)
> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> > cost = 8
> > (set (reg:SI 98)
> >     (const_int 1 [0x1]))
> > cost = 4
> > (set (reg:V4SI 95)
> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
> >         (reg:V4SI 95)
> >         (const_int 8 [0x8])))
> > cost = 4
> >
> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
> >
> > I think the issue is probably that for the interleave+zip1 sequence we take
> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
> > but for the fallback seq we use seq_cost, which assumes serial execution
> > of insns in the sequence.
> > For above fallback sequence,
> > set (reg:V4SI 95)
> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> > and
> > (set (reg:SI 98)
> >     (const_int 1 [0x1]))
> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
>
> Agreed.
>
> A good-enough substitute for this might be to ignore scalar moves
> (for both alternatives) when costing for speed.
Thanks for the suggestions. Just wondering for aarch64, if there's an easy
way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
>
> > I was wondering if we should we make cost for interleave+zip1 sequence
> > more conservative
> > by not taking max, but summing up costs[0] + costs[1] even for speed ?
> > For this case,
> > that would be 8 + 8 + 4 = 20.
> >
> > It generates the fallback sequence for other cases (s8, s16, s64) from
> > the test-case.
>
> What does it do for the tests in the interleave+zip1 patch?  If it doesn't
> make a difference there then it sounds like we don't have enough tests. :)
Oh right, the tests in interleave+zip1 patch only check for s16 case,
sorry about that :/
Looking briefly at the code generated for s8, s32 and s64 case,
(a) s8, and s16 seem to use same sequence for all cases.
(b) s64 seems to use fallback sequence.
(c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence
because costs are tied,
while s32 case prefers interleave+zip1:

int32x4_t f_s32(int32_t x, int32_t y)
{
  return (int32x4_t) { x, y, 1, 2 };
}

Code-gen with interleave+zip1 sequence:
f_s32:
        movi    v31.2s, 0x1
        movi    v0.2s, 0x2
        ins     v31.s[0], w0
        ins     v0.s[0], w1
        zip1    v0.4s, v31.4s, v0.4s
        ret

Code-gen with fallback sequence:
f_s32:
        adrp    x2, .LC0
        ldr     q0, [x2, #:lo12:.LC0]
        ins     v0.s[0], w0
        ins     v0.s[1], w1
        ret

Fallback sequence cost = 20
interleave+zip1 sequence cost = 12
I assume interleave+zip1 sequence is better in this case (chosen currently) ?

I will send a patch to add cases for s8, s16 and s64 in a follow up patch soon.
>
> Summing is only conservative if the fallback sequence is somehow "safer".
> But I don't think it is.   Building an N-element vector from N scalars
> can be done using N instructions in the fallback case and N+1 instructions
> in the interleave+zip1 case.  But the interleave+zip1 case is still
> better (speedwise) for N==16.
Ack, thanks.
Should we also prefer interleave+zip1 when the costs are tied ?
For eg, for the following case:
int32x4_t f_s32(int32_t x)
{
  return (int32x4_t) { x, 1, x, 1 };
}

costs for both fallback and interleave+zip1 sequence = 12, and we
currently choose fallback sequence.
Code-gen:
f_s32:
        movi    v0.4s, 0x1
        fmov    s31, w0
        ins     v0.s[0], v31.s[0]
        ins     v0.s[2], v31.s[0]
        ret

while, if we choose interleave+zip1, code-gen is:
f_s32:
        dup     v31.2s, w0
        movi    v0.2s, 0x1
        zip1    v0.4s, v31.4s, v0.4s
        ret

I suppose the interleave+zip1 sequence is better in this case ?
And more generally, if the costs are tied, would it be OK to prefer
interleave+zip1 sequence since it will
have parallel execution of two halves, which may not always be the
case with fallback sequence ?

Also, would it be OK to commit the above patch that addresses the
issue with single constant case and xfail the s32 case for now ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
  
Richard Sandiford May 18, 2023, 8:07 a.m. UTC | #15
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 16 May 2023 at 00:29, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi Richard,
>> > After committing the interleave+zip1 patch for vector initialization,
>> > it seems to regress the s32 case for this patch:
>> >
>> > int32x4_t f_s32(int32_t x)
>> > {
>> >   return (int32x4_t) { x, x, x, 1 };
>> > }
>> >
>> > code-gen:
>> > f_s32:
>> >         movi    v30.2s, 0x1
>> >         fmov    s31, w0
>> >         dup     v0.2s, v31.s[0]
>> >         ins     v30.s[0], v31.s[0]
>> >         zip1    v0.4s, v0.4s, v30.4s
>> >         ret
>> >
>> > instead of expected code-gen:
>> > f_s32:
>> >         movi    v31.2s, 0x1
>> >         dup     v0.4s, w0
>> >         ins     v0.s[3], v31.s[0]
>> >         ret
>> >
>> > Cost for fallback sequence: 16
>> > Cost for interleave and zip sequence: 12
>> >
>> > For the above case, the cost for interleave+zip1 sequence is computed as:
>> > halves[0]:
>> > (set (reg:V2SI 96)
>> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
>> > cost = 8
>> >
>> > halves[1]:
>> > (set (reg:V2SI 97)
>> >     (const_vector:V2SI [
>> >             (const_int 1 [0x1]) repeated x2
>> >         ]))
>> > (set (reg:V2SI 97)
>> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
>> >         (reg:V2SI 97)
>> >         (const_int 1 [0x1])))
>> > cost = 8
>> >
>> > followed by:
>> > (set (reg:V4SI 95)
>> >     (unspec:V4SI [
>> >             (subreg:V4SI (reg:V2SI 96) 0)
>> >             (subreg:V4SI (reg:V2SI 97) 0)
>> >         ] UNSPEC_ZIP1))
>> > cost = 4
>> >
>> > So the total cost becomes
>> > max(costs[0], costs[1]) + zip1_insn_cost
>> > = max(8, 8) + 4
>> > = 12
>> >
>> > While the fallback rtl sequence is:
>> > (set (reg:V4SI 95)
>> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> > cost = 8
>> > (set (reg:SI 98)
>> >     (const_int 1 [0x1]))
>> > cost = 4
>> > (set (reg:V4SI 95)
>> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
>> >         (reg:V4SI 95)
>> >         (const_int 8 [0x8])))
>> > cost = 4
>> >
>> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
>> >
>> > I think the issue is probably that for the interleave+zip1 sequence we take
>> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
>> > but for the fallback seq we use seq_cost, which assumes serial execution
>> > of insns in the sequence.
>> > For above fallback sequence,
>> > set (reg:V4SI 95)
>> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> > and
>> > (set (reg:SI 98)
>> >     (const_int 1 [0x1]))
>> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
>>
>> Agreed.
>>
>> A good-enough substitute for this might be to ignore scalar moves
>> (for both alternatives) when costing for speed.
> Thanks for the suggestions. Just wondering for aarch64, if there's an easy
> way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
> that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?

It should be enough to check that the pattern is a SET:

(a) whose SET_DEST has a scalar mode and
(b) whose SET_SRC an aarch64_mov_operand 

>> > I was wondering if we should we make cost for interleave+zip1 sequence
>> > more conservative
>> > by not taking max, but summing up costs[0] + costs[1] even for speed ?
>> > For this case,
>> > that would be 8 + 8 + 4 = 20.
>> >
>> > It generates the fallback sequence for other cases (s8, s16, s64) from
>> > the test-case.
>>
>> What does it do for the tests in the interleave+zip1 patch?  If it doesn't
>> make a difference there then it sounds like we don't have enough tests. :)
> Oh right, the tests in interleave+zip1 patch only check for s16 case,
> sorry about that :/
> Looking briefly at the code generated for s8, s32 and s64 case,
> (a) s8, and s16 seem to use same sequence for all cases.
> (b) s64 seems to use fallback sequence.
> (c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence
> because costs are tied,
> while s32 case prefers interleave+zip1:
>
> int32x4_t f_s32(int32_t x, int32_t y)
> {
>   return (int32x4_t) { x, y, 1, 2 };
> }
>
> Code-gen with interleave+zip1 sequence:
> f_s32:
>         movi    v31.2s, 0x1
>         movi    v0.2s, 0x2
>         ins     v31.s[0], w0
>         ins     v0.s[0], w1
>         zip1    v0.4s, v31.4s, v0.4s
>         ret
>
> Code-gen with fallback sequence:
> f_s32:
>         adrp    x2, .LC0
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.s[0], w0
>         ins     v0.s[1], w1
>         ret
>
> Fallback sequence cost = 20
> interleave+zip1 sequence cost = 12
> I assume interleave+zip1 sequence is better in this case (chosen currently) ?
>
> I will send a patch to add cases for s8, s16 and s64 in a follow up patch soon.
>>
>> Summing is only conservative if the fallback sequence is somehow "safer".
>> But I don't think it is.   Building an N-element vector from N scalars
>> can be done using N instructions in the fallback case and N+1 instructions
>> in the interleave+zip1 case.  But the interleave+zip1 case is still
>> better (speedwise) for N==16.
> Ack, thanks.
> Should we also prefer interleave+zip1 when the costs are tied ?

No, because the ZIP1 approach requires more temporary registers (in
general).  And we're making an optimistic (but reasonable) assumption
that enough vector pipes are free to do the work in parallel.

> For eg, for the following case:
> int32x4_t f_s32(int32_t x)
> {
>   return (int32x4_t) { x, 1, x, 1 };
> }
>
> costs for both fallback and interleave+zip1 sequence = 12, and we
> currently choose fallback sequence.
> Code-gen:
> f_s32:
>         movi    v0.4s, 0x1
>         fmov    s31, w0
>         ins     v0.s[0], v31.s[0]
>         ins     v0.s[2], v31.s[0]
>         ret
>
> while, if we choose interleave+zip1, code-gen is:
> f_s32:
>         dup     v31.2s, w0
>         movi    v0.2s, 0x1
>         zip1    v0.4s, v31.4s, v0.4s
>         ret
>
> I suppose the interleave+zip1 sequence is better in this case ?
> And more generally, if the costs are tied, would it be OK to prefer
> interleave+zip1 sequence since it will
> have parallel execution of two halves, which may not always be the
> case with fallback sequence ?

But when looking at these sequences, it's important to ask not just
which sequence wins, but why it wins.  If the zip1 version is better
(I agree it probably is), then the question is why that isn't showing
up in the costs.

> Also, would it be OK to commit the above patch that addresses the
> issue with single constant case and xfail the s32 case for now ?

I think it'd be better to wait until we can avoid the XFAIL.  That way
it's easier to see that we're making forward progress.

Thanks,
Richard
  
Prathamesh Kulkarni May 18, 2023, 2:41 p.m. UTC | #16
On Thu, 18 May 2023 at 13:37, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 16 May 2023 at 00:29, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi Richard,
> >> > After committing the interleave+zip1 patch for vector initialization,
> >> > it seems to regress the s32 case for this patch:
> >> >
> >> > int32x4_t f_s32(int32_t x)
> >> > {
> >> >   return (int32x4_t) { x, x, x, 1 };
> >> > }
> >> >
> >> > code-gen:
> >> > f_s32:
> >> >         movi    v30.2s, 0x1
> >> >         fmov    s31, w0
> >> >         dup     v0.2s, v31.s[0]
> >> >         ins     v30.s[0], v31.s[0]
> >> >         zip1    v0.4s, v0.4s, v30.4s
> >> >         ret
> >> >
> >> > instead of expected code-gen:
> >> > f_s32:
> >> >         movi    v31.2s, 0x1
> >> >         dup     v0.4s, w0
> >> >         ins     v0.s[3], v31.s[0]
> >> >         ret
> >> >
> >> > Cost for fallback sequence: 16
> >> > Cost for interleave and zip sequence: 12
> >> >
> >> > For the above case, the cost for interleave+zip1 sequence is computed as:
> >> > halves[0]:
> >> > (set (reg:V2SI 96)
> >> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> >> > cost = 8
> >> >
> >> > halves[1]:
> >> > (set (reg:V2SI 97)
> >> >     (const_vector:V2SI [
> >> >             (const_int 1 [0x1]) repeated x2
> >> >         ]))
> >> > (set (reg:V2SI 97)
> >> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
> >> >         (reg:V2SI 97)
> >> >         (const_int 1 [0x1])))
> >> > cost = 8
> >> >
> >> > followed by:
> >> > (set (reg:V4SI 95)
> >> >     (unspec:V4SI [
> >> >             (subreg:V4SI (reg:V2SI 96) 0)
> >> >             (subreg:V4SI (reg:V2SI 97) 0)
> >> >         ] UNSPEC_ZIP1))
> >> > cost = 4
> >> >
> >> > So the total cost becomes
> >> > max(costs[0], costs[1]) + zip1_insn_cost
> >> > = max(8, 8) + 4
> >> > = 12
> >> >
> >> > While the fallback rtl sequence is:
> >> > (set (reg:V4SI 95)
> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> > cost = 8
> >> > (set (reg:SI 98)
> >> >     (const_int 1 [0x1]))
> >> > cost = 4
> >> > (set (reg:V4SI 95)
> >> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
> >> >         (reg:V4SI 95)
> >> >         (const_int 8 [0x8])))
> >> > cost = 4
> >> >
> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
> >> >
> >> > I think the issue is probably that for the interleave+zip1 sequence we take
> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
> >> > but for the fallback seq we use seq_cost, which assumes serial execution
> >> > of insns in the sequence.
> >> > For above fallback sequence,
> >> > set (reg:V4SI 95)
> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> > and
> >> > (set (reg:SI 98)
> >> >     (const_int 1 [0x1]))
> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
> >>
> >> Agreed.
> >>
> >> A good-enough substitute for this might be to ignore scalar moves
> >> (for both alternatives) when costing for speed.
> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy
> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
>
> It should be enough to check that the pattern is a SET:
>
> (a) whose SET_DEST has a scalar mode and
> (b) whose SET_SRC an aarch64_mov_operand
Hi Richard,
Thanks for the suggestions, the attached patch calls seq_cost to compute
cost for sequence and then subtracts cost of each scalar move insn from it.
Does that look OK ?
The patch is under bootstrap+test on aarch64-linux-gnu.

After applying the single-constant case patch on top, the cost of fallback
sequence is now reduced to 12 instead of 16:
Cost before ignoring scalar moves: 16
Ignoring cost = 4 for: (set (reg:SI 98)
    (const_int 1 [0x1]))
Cost after ignoring scalar moves: 12
fallback_seq_cost = 12, zip1_seq_cost = 12

fallback_seq:
(set (reg:V4SI 95)
    (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
(set (reg:SI 98)
    (const_int 1 [0x1]))
(set (reg:V4SI 95)
    (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
        (reg:V4SI 95)
        (const_int 8 [0x8])))

zip1_seq:
(set (reg:V2SI 96)
    (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
(set (reg:V2SI 97)
    (const_vector:V2SI [
            (const_int 1 [0x1]) repeated x2
        ]))
(set (reg:V2SI 97)
    (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
        (reg:V2SI 97)
        (const_int 1 [0x1])))
(set (reg:V4SI 95)
    (unspec:V4SI [
            (subreg:V4SI (reg:V2SI 96) 0)
            (subreg:V4SI (reg:V2SI 97) 0)
        ] UNSPEC_ZIP1))

So now the costs for both sequences are tied at 12, and so it now
chooses the fallback sequence,
which "fixes" this case. However, more generally, if the costs for
both sequences are tied,
how do we evaluate which sequence'd be better ? Currently we choose
the fallback sequence if
the costs for both sequences are same.
>
> >> > I was wondering if we should we make cost for interleave+zip1 sequence
> >> > more conservative
> >> > by not taking max, but summing up costs[0] + costs[1] even for speed ?
> >> > For this case,
> >> > that would be 8 + 8 + 4 = 20.
> >> >
> >> > It generates the fallback sequence for other cases (s8, s16, s64) from
> >> > the test-case.
> >>
> >> What does it do for the tests in the interleave+zip1 patch?  If it doesn't
> >> make a difference there then it sounds like we don't have enough tests. :)
> > Oh right, the tests in interleave+zip1 patch only check for s16 case,
> > sorry about that :/
> > Looking briefly at the code generated for s8, s32 and s64 case,
> > (a) s8, and s16 seem to use same sequence for all cases.
> > (b) s64 seems to use fallback sequence.
> > (c) For vec-init-21.c, s8 and s16 cases prefer fallback sequence
> > because costs are tied,
> > while s32 case prefers interleave+zip1:
> >
> > int32x4_t f_s32(int32_t x, int32_t y)
> > {
> >   return (int32x4_t) { x, y, 1, 2 };
> > }
> >
> > Code-gen with interleave+zip1 sequence:
> > f_s32:
> >         movi    v31.2s, 0x1
> >         movi    v0.2s, 0x2
> >         ins     v31.s[0], w0
> >         ins     v0.s[0], w1
> >         zip1    v0.4s, v31.4s, v0.4s
> >         ret
> >
> > Code-gen with fallback sequence:
> > f_s32:
> >         adrp    x2, .LC0
> >         ldr     q0, [x2, #:lo12:.LC0]
> >         ins     v0.s[0], w0
> >         ins     v0.s[1], w1
> >         ret
> >
> > Fallback sequence cost = 20
> > interleave+zip1 sequence cost = 12
> > I assume interleave+zip1 sequence is better in this case (chosen currently) ?
> >
> > I will send a patch to add cases for s8, s16 and s64 in a follow up patch soon.
> >>
> >> Summing is only conservative if the fallback sequence is somehow "safer".
> >> But I don't think it is.   Building an N-element vector from N scalars
> >> can be done using N instructions in the fallback case and N+1 instructions
> >> in the interleave+zip1 case.  But the interleave+zip1 case is still
> >> better (speedwise) for N==16.
> > Ack, thanks.
> > Should we also prefer interleave+zip1 when the costs are tied ?
>
> No, because the ZIP1 approach requires more temporary registers (in
> general).  And we're making an optimistic (but reasonable) assumption
> that enough vector pipes are free to do the work in parallel.
Oh right, thanks, the zip1 sequence'd also increase register pressure.
>
> > For eg, for the following case:
> > int32x4_t f_s32(int32_t x)
> > {
> >   return (int32x4_t) { x, 1, x, 1 };
> > }
> >
> > costs for both fallback and interleave+zip1 sequence = 12, and we
> > currently choose fallback sequence.
> > Code-gen:
> > f_s32:
> >         movi    v0.4s, 0x1
> >         fmov    s31, w0
> >         ins     v0.s[0], v31.s[0]
> >         ins     v0.s[2], v31.s[0]
> >         ret
> >
> > while, if we choose interleave+zip1, code-gen is:
> > f_s32:
> >         dup     v31.2s, w0
> >         movi    v0.2s, 0x1
> >         zip1    v0.4s, v31.4s, v0.4s
> >         ret
> >
> > I suppose the interleave+zip1 sequence is better in this case ?
> > And more generally, if the costs are tied, would it be OK to prefer
> > interleave+zip1 sequence since it will
> > have parallel execution of two halves, which may not always be the
> > case with fallback sequence ?
>
> But when looking at these sequences, it's important to ask not just
> which sequence wins, but why it wins.  If the zip1 version is better
> (I agree it probably is), then the question is why that isn't showing
> up in the costs.
>
> > Also, would it be OK to commit the above patch that addresses the
> > issue with single constant case and xfail the s32 case for now ?
>
> I think it'd be better to wait until we can avoid the XFAIL.  That way
> it's easier to see that we're making forward progress.
Sure.

Thanks,
Prathamesh
>
> Thanks,
> Richard
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 29dbacfa917..7efd896d364 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
   return gen_rtx_PARALLEL (new_mode, vec);
 }
 
+/* Return true if INSN is a scalar move.  */
+
+static bool
+scalar_move_insn_p (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  if (!set)
+    return false;
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  return is_a<scalar_mode>(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src));
+}
+
+/* Ignore cost for scalar moves from cost of sequence. This function is called
+   for calculating sequence costs in aarch64_expand_vector_init.  */
+
+static unsigned
+seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed)
+{
+  unsigned cost = seq_cost (seq, speed);
+  for (; seq; seq = NEXT_INSN (seq))
+    if (scalar_move_insn_p (seq))
+      cost -= insn_cost (seq, speed);
+  return cost;
+}
+
 /* Expand a vector initialization sequence, such that TARGET is
    initialized to contain VALS.  */
 
@@ -22367,7 +22393,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
       halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
       rtx_insn *rec_seq = get_insns ();
       end_sequence ();
-      costs[i] = seq_cost (rec_seq, !optimize_size);
+      costs[i] = seq_cost_ignore_scalar_moves (rec_seq, !optimize_size);
       emit_insn (rec_seq);
     }
 
@@ -22384,7 +22410,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   start_sequence ();
   aarch64_expand_vector_init_fallback (target, vals);
   rtx_insn *fallback_seq = get_insns ();
-  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
+  unsigned fallback_seq_cost = seq_cost_ignore_scalar_moves (fallback_seq, !optimize_size);
   end_sequence ();
 
   emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
  
Richard Sandiford May 18, 2023, 4:34 p.m. UTC | #17
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 18 May 2023 at 13:37, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 16 May 2023 at 00:29, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > Hi Richard,
>> >> > After committing the interleave+zip1 patch for vector initialization,
>> >> > it seems to regress the s32 case for this patch:
>> >> >
>> >> > int32x4_t f_s32(int32_t x)
>> >> > {
>> >> >   return (int32x4_t) { x, x, x, 1 };
>> >> > }
>> >> >
>> >> > code-gen:
>> >> > f_s32:
>> >> >         movi    v30.2s, 0x1
>> >> >         fmov    s31, w0
>> >> >         dup     v0.2s, v31.s[0]
>> >> >         ins     v30.s[0], v31.s[0]
>> >> >         zip1    v0.4s, v0.4s, v30.4s
>> >> >         ret
>> >> >
>> >> > instead of expected code-gen:
>> >> > f_s32:
>> >> >         movi    v31.2s, 0x1
>> >> >         dup     v0.4s, w0
>> >> >         ins     v0.s[3], v31.s[0]
>> >> >         ret
>> >> >
>> >> > Cost for fallback sequence: 16
>> >> > Cost for interleave and zip sequence: 12
>> >> >
>> >> > For the above case, the cost for interleave+zip1 sequence is computed as:
>> >> > halves[0]:
>> >> > (set (reg:V2SI 96)
>> >> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
>> >> > cost = 8
>> >> >
>> >> > halves[1]:
>> >> > (set (reg:V2SI 97)
>> >> >     (const_vector:V2SI [
>> >> >             (const_int 1 [0x1]) repeated x2
>> >> >         ]))
>> >> > (set (reg:V2SI 97)
>> >> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
>> >> >         (reg:V2SI 97)
>> >> >         (const_int 1 [0x1])))
>> >> > cost = 8
>> >> >
>> >> > followed by:
>> >> > (set (reg:V4SI 95)
>> >> >     (unspec:V4SI [
>> >> >             (subreg:V4SI (reg:V2SI 96) 0)
>> >> >             (subreg:V4SI (reg:V2SI 97) 0)
>> >> >         ] UNSPEC_ZIP1))
>> >> > cost = 4
>> >> >
>> >> > So the total cost becomes
>> >> > max(costs[0], costs[1]) + zip1_insn_cost
>> >> > = max(8, 8) + 4
>> >> > = 12
>> >> >
>> >> > While the fallback rtl sequence is:
>> >> > (set (reg:V4SI 95)
>> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> >> > cost = 8
>> >> > (set (reg:SI 98)
>> >> >     (const_int 1 [0x1]))
>> >> > cost = 4
>> >> > (set (reg:V4SI 95)
>> >> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
>> >> >         (reg:V4SI 95)
>> >> >         (const_int 8 [0x8])))
>> >> > cost = 4
>> >> >
>> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
>> >> >
>> >> > I think the issue is probably that for the interleave+zip1 sequence we take
>> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
>> >> > but for the fallback seq we use seq_cost, which assumes serial execution
>> >> > of insns in the sequence.
>> >> > For above fallback sequence,
>> >> > set (reg:V4SI 95)
>> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
>> >> > and
>> >> > (set (reg:SI 98)
>> >> >     (const_int 1 [0x1]))
>> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
>> >>
>> >> Agreed.
>> >>
>> >> A good-enough substitute for this might be to ignore scalar moves
>> >> (for both alternatives) when costing for speed.
>> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy
>> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
>> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
>>
>> It should be enough to check that the pattern is a SET:
>>
>> (a) whose SET_DEST has a scalar mode and
>> (b) whose SET_SRC an aarch64_mov_operand
> Hi Richard,
> Thanks for the suggestions, the attached patch calls seq_cost to compute
> cost for sequence and then subtracts cost of each scalar move insn from it.
> Does that look OK ?
> The patch is under bootstrap+test on aarch64-linux-gnu.

Yeah, the patch looks reasonable (some comments below).  The testing
for this kind of patch is more than a formality though, so it would
be good to wait to see if the tests pass.

> [...]
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 29dbacfa917..7efd896d364 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
>    return gen_rtx_PARALLEL (new_mode, vec);
>  }
>  
> +/* Return true if INSN is a scalar move.  */
> +
> +static bool
> +scalar_move_insn_p (rtx_insn *insn)
> +{
> +  rtx set = single_set (insn);
> +  if (!set)
> +    return false;
> +  rtx src = SET_SRC (set);
> +  rtx dest = SET_DEST (set);
> +  return is_a<scalar_mode>(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src));

Long line.

> +}
> +
> +/* Ignore cost for scalar moves from cost of sequence. This function is called
> +   for calculating sequence costs in aarch64_expand_vector_init.  */
> +
> +static unsigned
> +seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed)

Maybe more readable as "ignoring" rather than "ignore".

> +{
> +  unsigned cost = seq_cost (seq, speed);
> +  for (; seq; seq = NEXT_INSN (seq))
> +    if (scalar_move_insn_p (seq))
> +      cost -= insn_cost (seq, speed);
> +  return cost;
> +}
> +

seq_cost uses set_rtx_cost rather than insn_cost for single sets.

To avoid that kind of inconsistency, I think it'd better to duplicate and
adjust seq_cost.  Then scalar_move_insn_p can be passed the single set.

>  /* Expand a vector initialization sequence, such that TARGET is
>     initialized to contain VALS.  */
>  
> @@ -22367,7 +22393,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>        halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
>        rtx_insn *rec_seq = get_insns ();
>        end_sequence ();
> -      costs[i] = seq_cost (rec_seq, !optimize_size);
> +      costs[i] = seq_cost_ignore_scalar_moves (rec_seq, !optimize_size);
>        emit_insn (rec_seq);
>      }
>  
> @@ -22384,7 +22410,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>    start_sequence ();
>    aarch64_expand_vector_init_fallback (target, vals);
>    rtx_insn *fallback_seq = get_insns ();
> -  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
> +  unsigned fallback_seq_cost = seq_cost_ignore_scalar_moves (fallback_seq, !optimize_size);

Long line.

Thanks,
Richard

>    end_sequence ();
>  
>    emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
  
Prathamesh Kulkarni May 19, 2023, 10:56 a.m. UTC | #18
On Thu, 18 May 2023 at 22:04, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Thu, 18 May 2023 at 13:37, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Tue, 16 May 2023 at 00:29, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> > Hi Richard,
> >> >> > After committing the interleave+zip1 patch for vector initialization,
> >> >> > it seems to regress the s32 case for this patch:
> >> >> >
> >> >> > int32x4_t f_s32(int32_t x)
> >> >> > {
> >> >> >   return (int32x4_t) { x, x, x, 1 };
> >> >> > }
> >> >> >
> >> >> > code-gen:
> >> >> > f_s32:
> >> >> >         movi    v30.2s, 0x1
> >> >> >         fmov    s31, w0
> >> >> >         dup     v0.2s, v31.s[0]
> >> >> >         ins     v30.s[0], v31.s[0]
> >> >> >         zip1    v0.4s, v0.4s, v30.4s
> >> >> >         ret
> >> >> >
> >> >> > instead of expected code-gen:
> >> >> > f_s32:
> >> >> >         movi    v31.2s, 0x1
> >> >> >         dup     v0.4s, w0
> >> >> >         ins     v0.s[3], v31.s[0]
> >> >> >         ret
> >> >> >
> >> >> > Cost for fallback sequence: 16
> >> >> > Cost for interleave and zip sequence: 12
> >> >> >
> >> >> > For the above case, the cost for interleave+zip1 sequence is computed as:
> >> >> > halves[0]:
> >> >> > (set (reg:V2SI 96)
> >> >> >     (vec_duplicate:V2SI (reg/v:SI 93 [ x ])))
> >> >> > cost = 8
> >> >> >
> >> >> > halves[1]:
> >> >> > (set (reg:V2SI 97)
> >> >> >     (const_vector:V2SI [
> >> >> >             (const_int 1 [0x1]) repeated x2
> >> >> >         ]))
> >> >> > (set (reg:V2SI 97)
> >> >> >     (vec_merge:V2SI (vec_duplicate:V2SI (reg/v:SI 93 [ x ]))
> >> >> >         (reg:V2SI 97)
> >> >> >         (const_int 1 [0x1])))
> >> >> > cost = 8
> >> >> >
> >> >> > followed by:
> >> >> > (set (reg:V4SI 95)
> >> >> >     (unspec:V4SI [
> >> >> >             (subreg:V4SI (reg:V2SI 96) 0)
> >> >> >             (subreg:V4SI (reg:V2SI 97) 0)
> >> >> >         ] UNSPEC_ZIP1))
> >> >> > cost = 4
> >> >> >
> >> >> > So the total cost becomes
> >> >> > max(costs[0], costs[1]) + zip1_insn_cost
> >> >> > = max(8, 8) + 4
> >> >> > = 12
> >> >> >
> >> >> > While the fallback rtl sequence is:
> >> >> > (set (reg:V4SI 95)
> >> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> >> > cost = 8
> >> >> > (set (reg:SI 98)
> >> >> >     (const_int 1 [0x1]))
> >> >> > cost = 4
> >> >> > (set (reg:V4SI 95)
> >> >> >     (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 98))
> >> >> >         (reg:V4SI 95)
> >> >> >         (const_int 8 [0x8])))
> >> >> > cost = 4
> >> >> >
> >> >> > So total cost = 8 + 4 + 4 = 16, and we choose the interleave+zip1 sequence.
> >> >> >
> >> >> > I think the issue is probably that for the interleave+zip1 sequence we take
> >> >> > max(costs[0], costs[1]) to reflect that both halves are interleaved,
> >> >> > but for the fallback seq we use seq_cost, which assumes serial execution
> >> >> > of insns in the sequence.
> >> >> > For above fallback sequence,
> >> >> > set (reg:V4SI 95)
> >> >> >     (vec_duplicate:V4SI (reg/v:SI 93 [ x ])))
> >> >> > and
> >> >> > (set (reg:SI 98)
> >> >> >     (const_int 1 [0x1]))
> >> >> > could be executed in parallel, which would make it's cost max(8, 4) + 4 = 12.
> >> >>
> >> >> Agreed.
> >> >>
> >> >> A good-enough substitute for this might be to ignore scalar moves
> >> >> (for both alternatives) when costing for speed.
> >> > Thanks for the suggestions. Just wondering for aarch64, if there's an easy
> >> > way we can check if insn is a scalar move, similar to riscv's scalar_move_insn_p
> >> > that checks if get_attr_type(insn) is TYPE_VIMOVXV or TYPE_VFMOVFV ?
> >>
> >> It should be enough to check that the pattern is a SET:
> >>
> >> (a) whose SET_DEST has a scalar mode and
> >> (b) whose SET_SRC an aarch64_mov_operand
> > Hi Richard,
> > Thanks for the suggestions, the attached patch calls seq_cost to compute
> > cost for sequence and then subtracts cost of each scalar move insn from it.
> > Does that look OK ?
> > The patch is under bootstrap+test on aarch64-linux-gnu.
>
> Yeah, the patch looks reasonable (some comments below).  The testing
> for this kind of patch is more than a formality though, so it would
> be good to wait to see if the tests pass.
>
> > [...]
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 29dbacfa917..7efd896d364 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22332,6 +22332,32 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> >    return gen_rtx_PARALLEL (new_mode, vec);
> >  }
> >
> > +/* Return true if INSN is a scalar move.  */
> > +
> > +static bool
> > +scalar_move_insn_p (rtx_insn *insn)
> > +{
> > +  rtx set = single_set (insn);
> > +  if (!set)
> > +    return false;
> > +  rtx src = SET_SRC (set);
> > +  rtx dest = SET_DEST (set);
> > +  return is_a<scalar_mode>(GET_MODE (dest)) && aarch64_mov_operand_p (src, GET_MODE (src));
>
> Long line.
>
> > +}
> > +
> > +/* Ignore cost for scalar moves from cost of sequence. This function is called
> > +   for calculating sequence costs in aarch64_expand_vector_init.  */
> > +
> > +static unsigned
> > +seq_cost_ignore_scalar_moves (rtx_insn *seq, bool speed)
>
> Maybe more readable as "ignoring" rather than "ignore".
>
> > +{
> > +  unsigned cost = seq_cost (seq, speed);
> > +  for (; seq; seq = NEXT_INSN (seq))
> > +    if (scalar_move_insn_p (seq))
> > +      cost -= insn_cost (seq, speed);
> > +  return cost;
> > +}
> > +
>
> seq_cost uses set_rtx_cost rather than insn_cost for single sets.
>
> To avoid that kind of inconsistency, I think it'd better to duplicate and
> adjust seq_cost.  Then scalar_move_insn_p can be passed the single set.
>
> >  /* Expand a vector initialization sequence, such that TARGET is
> >     initialized to contain VALS.  */
> >
> > @@ -22367,7 +22393,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >        halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
> >        rtx_insn *rec_seq = get_insns ();
> >        end_sequence ();
> > -      costs[i] = seq_cost (rec_seq, !optimize_size);
> > +      costs[i] = seq_cost_ignore_scalar_moves (rec_seq, !optimize_size);
> >        emit_insn (rec_seq);
> >      }
> >
> > @@ -22384,7 +22410,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >    start_sequence ();
> >    aarch64_expand_vector_init_fallback (target, vals);
> >    rtx_insn *fallback_seq = get_insns ();
> > -  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
> > +  unsigned fallback_seq_cost = seq_cost_ignore_scalar_moves (fallback_seq, !optimize_size);
>
> Long line.
Hi Richard,
Thanks for the suggestions. Does the attached patch look OK ?
Boostrap+test in progress on aarch64-linux-gnu.

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> >    end_sequence ();
> >
> >    emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
[aarch64] Ignore cost of scalar moves for seq in vector initialization.

gcc/ChangeLog:
	* config/aarch64/aarch64.cc (scalar_move_insn_p): New function. 
	(seq_cost_ignoring_scalar_moves): Likewise.
	(aarch64_expand_vector_init): Call seq_cost_ignoring_scalar_moves.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 29dbacfa917..e611a7cca25 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
   return gen_rtx_PARALLEL (new_mode, vec);
 }
 
+/* Return true if INSN is a scalar move.  */
+
+static bool
+scalar_move_insn_p (const rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  if (!set)
+    return false;
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  return is_a<scalar_mode>(GET_MODE (dest))
+	 && aarch64_mov_operand_p (src, GET_MODE (src));
+}
+
+/* Similar to seq_cost, but ignore cost for scalar moves. This function
+   is called from aarch64_expand_vector_init.  */
+
+static unsigned
+seq_cost_ignoring_scalar_moves (const rtx_insn *seq, bool speed)
+{
+  unsigned cost = 0;
+  rtx set;
+
+  for (; seq; seq = NEXT_INSN (seq))
+    if (NONDEBUG_INSN_P (seq)
+	&& !scalar_move_insn_p (seq))
+      {
+	int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
+	if (this_cost > 0)
+	  cost += this_cost;
+	else
+	  cost++;
+      }
+
+  return cost;
+}
+
 /* Expand a vector initialization sequence, such that TARGET is
    initialized to contain VALS.  */
 
@@ -22367,7 +22404,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
       halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
       rtx_insn *rec_seq = get_insns ();
       end_sequence ();
-      costs[i] = seq_cost (rec_seq, !optimize_size);
+      costs[i] = seq_cost_ignoring_scalar_moves (rec_seq, !optimize_size);
       emit_insn (rec_seq);
     }
 
@@ -22384,7 +22421,8 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   start_sequence ();
   aarch64_expand_vector_init_fallback (target, vals);
   rtx_insn *fallback_seq = get_insns ();
-  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
+  unsigned fallback_seq_cost
+    = seq_cost_ignoring_scalar_moves (fallback_seq, !optimize_size);
   end_sequence ();
 
   emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
  
Richard Sandiford May 22, 2023, 8:48 a.m. UTC | #19
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi Richard,
> Thanks for the suggestions. Does the attached patch look OK ?
> Boostrap+test in progress on aarch64-linux-gnu.

Like I say, please wait for the tests to complete before sending an RFA.
It saves a review cycle if the tests don't in fact pass.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 29dbacfa917..e611a7cca25 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
>    return gen_rtx_PARALLEL (new_mode, vec);
>  }
>  
> +/* Return true if INSN is a scalar move.  */
> +
> +static bool
> +scalar_move_insn_p (const rtx_insn *insn)
> +{
> +  rtx set = single_set (insn);
> +  if (!set)
> +    return false;
> +  rtx src = SET_SRC (set);
> +  rtx dest = SET_DEST (set);
> +  return is_a<scalar_mode>(GET_MODE (dest))
> +	 && aarch64_mov_operand_p (src, GET_MODE (src));

Formatting:

  return (is_a<scalar_mode>(GET_MODE (dest))
	  && aarch64_mov_operand_p (src, GET_MODE (src)));

OK with that change if the tests pass, thanks.

Richard
  
Prathamesh Kulkarni May 24, 2023, 9:29 a.m. UTC | #20
On Mon, 22 May 2023 at 14:18, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi Richard,
> > Thanks for the suggestions. Does the attached patch look OK ?
> > Boostrap+test in progress on aarch64-linux-gnu.
>
> Like I say, please wait for the tests to complete before sending an RFA.
> It saves a review cycle if the tests don't in fact pass.
Right, sorry, will post patches after completion of testing henceforth.
>
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 29dbacfa917..e611a7cca25 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> >    return gen_rtx_PARALLEL (new_mode, vec);
> >  }
> >
> > +/* Return true if INSN is a scalar move.  */
> > +
> > +static bool
> > +scalar_move_insn_p (const rtx_insn *insn)
> > +{
> > +  rtx set = single_set (insn);
> > +  if (!set)
> > +    return false;
> > +  rtx src = SET_SRC (set);
> > +  rtx dest = SET_DEST (set);
> > +  return is_a<scalar_mode>(GET_MODE (dest))
> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
>
> Formatting:
>
>   return (is_a<scalar_mode>(GET_MODE (dest))
>           && aarch64_mov_operand_p (src, GET_MODE (src)));
>
> OK with that change if the tests pass, thanks.
Unfortunately, the patch regressed vec-init-21.c:

int8x16_t f_s8(int8_t x, int8_t y)
{
  return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
                       7, 8, 9, 10, 11, 12, 13, 14 };
}

-O3 code-gen trunk:
f_s8:
        adrp    x2, .LC0
        ldr     q0, [x2, #:lo12:.LC0]
        ins     v0.b[0], w0
        ins     v0.b[1], w1
        ret

-O3 code-gen patch:
f_s8:
        adrp    x2, .LC0
        ldr     d31, [x2, #:lo12:.LC0]
        adrp    x2, .LC1
        ldr     d0, [x2, #:lo12:.LC1]
        ins     v31.b[0], w0
        ins     v0.b[0], w1
        zip1    v0.16b, v31.16b, v0.16b
        ret

With trunk, it chooses the fallback sequence because both fallback
and zip1 sequence had cost = 20, however with patch applied,
we end up with zip1 sequence cost = 24 and fallback sequence
cost = 28.

This happens because of using insn_cost instead of
set_rtx_cost for the following expression:
(set (reg:QI 100)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
set_rtx_cost returns 0 for above expression but insn_cost returns 4.

This expression template appears twice in fallback sequence, which raises
the cost to 28 from 20, while it appears once in each half of zip1 sequence,
which raises the cost to 24 from 20, and so it now prefers zip1 sequence
instead.

I assumed this expression would be ignored because it looks like a scalar move,
but that doesn't seem to be the case ?
aarch64_classify_symbolic_expression returns
SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
and thus aarch64_mov_operand_p returns false.

Another issue with the zip1 sequence above is using same register x2
for loading another half of constant in:
adrp    x2, .LC1

I guess this will create an output dependency from adrp x2, .LC0 ->
adrp x2, .LC1
and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
essentially forcing almost the entire sequence (except ins
instructions) to execute sequentially ?

Fallback sequence rtl, cost = 28
(set (reg:V16QI 96)
    (const_vector:V16QI [
            (const_int 7 [0x7])
            (const_int 8 [0x8])
            (const_int 1 [0x1])
            (const_int 2 [0x2])
            (const_int 3 [0x3])
            (const_int 4 [0x4])
            (const_int 5 [0x5])
            (const_int 6 [0x6])
            (const_int 7 [0x7])
            (const_int 8 [0x8])
            (const_int 9 [0x9])
            (const_int 10 [0xa])
            (const_int 11 [0xb])
            (const_int 12 [0xc])
            (const_int 13 [0xd])
            (const_int 14 [0xe])
        ]))
cost = 12
(set (reg:QI 101)
    (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0))
cost = 4
(set (reg:V16QI 96)
    (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 101))
        (reg:V16QI 96)
        (const_int 1 [0x1])))
cost = 4
(set (reg:QI 102)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
cost = 4
(set (reg:V16QI 96)
    (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 102))
        (reg:V16QI 96)
        (const_int 2 [0x2])))
cost = 4

zip1 sequence rtl, cost = 24
(set (reg:V8QI 97)
    (const_vector:V8QI [
            (const_int 7 [0x7])
            (const_int 1 [0x1])
            (const_int 3 [0x3])
            (const_int 5 [0x5])
            (const_int 7 [0x7])
            (const_int 9 [0x9])
            (const_int 11 [0xb])
            (const_int 13 [0xd])
        ]))
cost = 12
(set (reg:QI 98)
    (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0))
cost = 4
(set (reg:V8QI 97)
    (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 98))
        (reg:V8QI 97)
        (const_int 1 [0x1])))
cost = 4
(set (reg:V8QI 99)
    (const_vector:V8QI [
            (const_int 8 [0x8])
            (const_int 2 [0x2])
            (const_int 4 [0x4])
            (const_int 6 [0x6])
            (const_int 8 [0x8])
            (const_int 10 [0xa])
            (const_int 12 [0xc])
            (const_int 14 [0xe])
        ]))
cost = 12
(set (reg:QI 100)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
cost = 4
(set (reg:V8QI 99)
    (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 100))
        (reg:V8QI 99)
        (const_int 1 [0x1])))
cost = 4
(set (reg:V16QI 96)
    (unspec:V16QI [
            (subreg:V16QI (reg:V8QI 97) 0)
            (subreg:V16QI (reg:V8QI 99) 0)
        ] UNSPEC_ZIP1))
cost = 4

Thanks,
Prathamesh
>
> Richard
  
Richard Sandiford May 24, 2023, 10:10 a.m. UTC | #21
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 22 May 2023 at 14:18, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi Richard,
>> > Thanks for the suggestions. Does the attached patch look OK ?
>> > Boostrap+test in progress on aarch64-linux-gnu.
>>
>> Like I say, please wait for the tests to complete before sending an RFA.
>> It saves a review cycle if the tests don't in fact pass.
> Right, sorry, will post patches after completion of testing henceforth.
>>
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 29dbacfa917..e611a7cca25 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
>> >    return gen_rtx_PARALLEL (new_mode, vec);
>> >  }
>> >
>> > +/* Return true if INSN is a scalar move.  */
>> > +
>> > +static bool
>> > +scalar_move_insn_p (const rtx_insn *insn)
>> > +{
>> > +  rtx set = single_set (insn);
>> > +  if (!set)
>> > +    return false;
>> > +  rtx src = SET_SRC (set);
>> > +  rtx dest = SET_DEST (set);
>> > +  return is_a<scalar_mode>(GET_MODE (dest))
>> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
>>
>> Formatting:
>>
>>   return (is_a<scalar_mode>(GET_MODE (dest))
>>           && aarch64_mov_operand_p (src, GET_MODE (src)));
>>
>> OK with that change if the tests pass, thanks.
> Unfortunately, the patch regressed vec-init-21.c:
>
> int8x16_t f_s8(int8_t x, int8_t y)
> {
>   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
>                        7, 8, 9, 10, 11, 12, 13, 14 };
> }
>
> -O3 code-gen trunk:
> f_s8:
>         adrp    x2, .LC0
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         ins     v0.b[1], w1
>         ret
>
> -O3 code-gen patch:
> f_s8:
>         adrp    x2, .LC0
>         ldr     d31, [x2, #:lo12:.LC0]
>         adrp    x2, .LC1
>         ldr     d0, [x2, #:lo12:.LC1]
>         ins     v31.b[0], w0
>         ins     v0.b[0], w1
>         zip1    v0.16b, v31.16b, v0.16b
>         ret
>
> With trunk, it chooses the fallback sequence because both fallback
> and zip1 sequence had cost = 20, however with patch applied,
> we end up with zip1 sequence cost = 24 and fallback sequence
> cost = 28.
>
> This happens because of using insn_cost instead of
> set_rtx_cost for the following expression:
> (set (reg:QI 100)
>     (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
> set_rtx_cost returns 0 for above expression but insn_cost returns 4.

Yeah, was wondering why you'd dropped the set_rtx_cost thing,
but decided not to question it since using insn_cost seemed
reasonable if it worked.

> This expression template appears twice in fallback sequence, which raises
> the cost to 28 from 20, while it appears once in each half of zip1 sequence,
> which raises the cost to 24 from 20, and so it now prefers zip1 sequence
> instead.
>
> I assumed this expression would be ignored because it looks like a scalar move,
> but that doesn't seem to be the case ?
> aarch64_classify_symbolic_expression returns
> SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
> and thus aarch64_mov_operand_p returns false.

Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
they're so different...

> Another issue with the zip1 sequence above is using same register x2
> for loading another half of constant in:
> adrp    x2, .LC1
>
> I guess this will create an output dependency from adrp x2, .LC0 ->
> adrp x2, .LC1
> and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
> essentially forcing almost the entire sequence (except ins
> instructions) to execute sequentially ?

I'd expect modern cores to handle that via renaming.

Thanks,
Richard
  
Prathamesh Kulkarni May 24, 2023, 7:50 p.m. UTC | #22
On Wed, 24 May 2023 at 15:40, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 22 May 2023 at 14:18, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi Richard,
> >> > Thanks for the suggestions. Does the attached patch look OK ?
> >> > Boostrap+test in progress on aarch64-linux-gnu.
> >>
> >> Like I say, please wait for the tests to complete before sending an RFA.
> >> It saves a review cycle if the tests don't in fact pass.
> > Right, sorry, will post patches after completion of testing henceforth.
> >>
> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> > index 29dbacfa917..e611a7cca25 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> >> >    return gen_rtx_PARALLEL (new_mode, vec);
> >> >  }
> >> >
> >> > +/* Return true if INSN is a scalar move.  */
> >> > +
> >> > +static bool
> >> > +scalar_move_insn_p (const rtx_insn *insn)
> >> > +{
> >> > +  rtx set = single_set (insn);
> >> > +  if (!set)
> >> > +    return false;
> >> > +  rtx src = SET_SRC (set);
> >> > +  rtx dest = SET_DEST (set);
> >> > +  return is_a<scalar_mode>(GET_MODE (dest))
> >> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
> >>
> >> Formatting:
> >>
> >>   return (is_a<scalar_mode>(GET_MODE (dest))
> >>           && aarch64_mov_operand_p (src, GET_MODE (src)));
> >>
> >> OK with that change if the tests pass, thanks.
> > Unfortunately, the patch regressed vec-init-21.c:
> >
> > int8x16_t f_s8(int8_t x, int8_t y)
> > {
> >   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
> >                        7, 8, 9, 10, 11, 12, 13, 14 };
> > }
> >
> > -O3 code-gen trunk:
> > f_s8:
> >         adrp    x2, .LC0
> >         ldr     q0, [x2, #:lo12:.LC0]
> >         ins     v0.b[0], w0
> >         ins     v0.b[1], w1
> >         ret
> >
> > -O3 code-gen patch:
> > f_s8:
> >         adrp    x2, .LC0
> >         ldr     d31, [x2, #:lo12:.LC0]
> >         adrp    x2, .LC1
> >         ldr     d0, [x2, #:lo12:.LC1]
> >         ins     v31.b[0], w0
> >         ins     v0.b[0], w1
> >         zip1    v0.16b, v31.16b, v0.16b
> >         ret
> >
> > With trunk, it chooses the fallback sequence because both fallback
> > and zip1 sequence had cost = 20, however with patch applied,
> > we end up with zip1 sequence cost = 24 and fallback sequence
> > cost = 28.
> >
> > This happens because of using insn_cost instead of
> > set_rtx_cost for the following expression:
> > (set (reg:QI 100)
> >     (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
> > set_rtx_cost returns 0 for above expression but insn_cost returns 4.
>
> Yeah, was wondering why you'd dropped the set_rtx_cost thing,
> but decided not to question it since using insn_cost seemed
> reasonable if it worked.
[reposting because my reply got blocked for moderator approval]

The attached patch uses set_rtx_cost for single_set and insn_cost
otherwise for non debug insns similar to seq_cost.
>
> > This expression template appears twice in fallback sequence, which raises
> > the cost to 28 from 20, while it appears once in each half of zip1 sequence,
> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence
> > instead.
> >
> > I assumed this expression would be ignored because it looks like a scalar move,
> > but that doesn't seem to be the case ?
> > aarch64_classify_symbolic_expression returns
> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
> > and thus aarch64_mov_operand_p returns false.
>
> Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
> they're so different...
Thanks, using aarch64_mov_operand worked.
>
> > Another issue with the zip1 sequence above is using same register x2
> > for loading another half of constant in:
> > adrp    x2, .LC1
> >
> > I guess this will create an output dependency from adrp x2, .LC0 ->
> > adrp x2, .LC1
> > and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
> > essentially forcing almost the entire sequence (except ins
> > instructions) to execute sequentially ?
>
> I'd expect modern cores to handle that via renaming.
Ah right, thanks for the clarification.

For some reason, it seems git diff is not formatting the patch correctly :/
Or perhaps I am doing something wrongly.
For eg, it shows:
+  return is_a<scalar_mode>(GET_MODE (dest))
+        && aarch64_mov_operand (src, GET_MODE (src));
but after applying the patch, it's formatted correctly with
"&&aarch64..."  right below is_a<scalar_mode>, both on column 10.

Similarly, for following hunk in seq_cost_ignoring_scalar_moves:
+    if (NONDEBUG_INSN_P (seq)
+       && !scalar_move_insn_p (seq))
After applying patch, "&&" is below N, and not '('. Both N and "&&"
are on col 9.

And for the following just below:
+      {
+       if (rtx set = single_set (seq))

diff shows only one space difference between '{' and the following if,
but after applying the patch it's formatted correctly, with two spaces
after the curly brace.

This is the entire file aarch64.cc with patch applied:
https://people.linaro.org/~prathamesh.kulkarni/aarch64.cc
Does the formatting look OK for scalar_move_insn_p and
seq_cost_ignoring_scalar_moves in above aarch64.cc ?

Patch passes bootstrap+test on aarch64-linux-gnu with no regressions.
OK to commit ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
[aarch64] Ignore cost of scalar moves for seq in vector initialization.

gcc/ChangeLog:
	* config/aarch64/aarch64.cc (scalar_move_insn_p): New function. 
	(seq_cost_ignoring_scalar_moves): Likewise.
	(aarch64_expand_vector_init): Call seq_cost_ignoring_scalar_moves.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d6fc94015fa..598f2f86417 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22332,6 +22332,47 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
   return gen_rtx_PARALLEL (new_mode, vec);
 }
 
+/* Return true if INSN is a scalar move.  */
+
+static bool
+scalar_move_insn_p (const rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  if (!set)
+    return false;
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  return is_a<scalar_mode>(GET_MODE (dest))
+	 && aarch64_mov_operand (src, GET_MODE (src));
+}
+
+/* Similar to seq_cost, but ignore cost for scalar moves.  This function
+   is called from aarch64_expand_vector_init.  */
+
+static unsigned
+seq_cost_ignoring_scalar_moves (const rtx_insn *seq, bool speed)
+{
+  unsigned cost = 0;
+
+  for (; seq; seq = NEXT_INSN (seq))
+    if (NONDEBUG_INSN_P (seq)
+	&& !scalar_move_insn_p (seq))
+      {
+	if (rtx set = single_set (seq))
+	  cost += set_rtx_cost (set, speed);
+	else
+	  {
+	    int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
+	    if (this_cost > 0)
+	      cost += this_cost;
+	    else
+	      cost++;
+	  }
+      }
+
+  return cost;
+}
+
 /* Expand a vector initialization sequence, such that TARGET is
    initialized to contain VALS.  */
 
@@ -22367,7 +22408,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
       halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
       rtx_insn *rec_seq = get_insns ();
       end_sequence ();
-      costs[i] = seq_cost (rec_seq, !optimize_size);
+      costs[i] = seq_cost_ignoring_scalar_moves (rec_seq, !optimize_size);
       emit_insn (rec_seq);
     }
 
@@ -22384,7 +22425,8 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   start_sequence ();
   aarch64_expand_vector_init_fallback (target, vals);
   rtx_insn *fallback_seq = get_insns ();
-  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
+  unsigned fallback_seq_cost
+    = seq_cost_ignoring_scalar_moves (fallback_seq, !optimize_size);
   end_sequence ();
 
   emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
  
Richard Sandiford May 24, 2023, 7:58 p.m. UTC | #23
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Wed, 24 May 2023 at 15:40, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Mon, 22 May 2023 at 14:18, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > Hi Richard,
>> >> > Thanks for the suggestions. Does the attached patch look OK ?
>> >> > Boostrap+test in progress on aarch64-linux-gnu.
>> >>
>> >> Like I say, please wait for the tests to complete before sending an RFA.
>> >> It saves a review cycle if the tests don't in fact pass.
>> > Right, sorry, will post patches after completion of testing henceforth.
>> >>
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> > index 29dbacfa917..e611a7cca25 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
>> >> >    return gen_rtx_PARALLEL (new_mode, vec);
>> >> >  }
>> >> >
>> >> > +/* Return true if INSN is a scalar move.  */
>> >> > +
>> >> > +static bool
>> >> > +scalar_move_insn_p (const rtx_insn *insn)
>> >> > +{
>> >> > +  rtx set = single_set (insn);
>> >> > +  if (!set)
>> >> > +    return false;
>> >> > +  rtx src = SET_SRC (set);
>> >> > +  rtx dest = SET_DEST (set);
>> >> > +  return is_a<scalar_mode>(GET_MODE (dest))
>> >> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
>> >>
>> >> Formatting:
>> >>
>> >>   return (is_a<scalar_mode>(GET_MODE (dest))
>> >>           && aarch64_mov_operand_p (src, GET_MODE (src)));
>> >>
>> >> OK with that change if the tests pass, thanks.
>> > Unfortunately, the patch regressed vec-init-21.c:
>> >
>> > int8x16_t f_s8(int8_t x, int8_t y)
>> > {
>> >   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
>> >                        7, 8, 9, 10, 11, 12, 13, 14 };
>> > }
>> >
>> > -O3 code-gen trunk:
>> > f_s8:
>> >         adrp    x2, .LC0
>> >         ldr     q0, [x2, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[1], w1
>> >         ret
>> >
>> > -O3 code-gen patch:
>> > f_s8:
>> >         adrp    x2, .LC0
>> >         ldr     d31, [x2, #:lo12:.LC0]
>> >         adrp    x2, .LC1
>> >         ldr     d0, [x2, #:lo12:.LC1]
>> >         ins     v31.b[0], w0
>> >         ins     v0.b[0], w1
>> >         zip1    v0.16b, v31.16b, v0.16b
>> >         ret
>> >
>> > With trunk, it chooses the fallback sequence because both fallback
>> > and zip1 sequence had cost = 20, however with patch applied,
>> > we end up with zip1 sequence cost = 24 and fallback sequence
>> > cost = 28.
>> >
>> > This happens because of using insn_cost instead of
>> > set_rtx_cost for the following expression:
>> > (set (reg:QI 100)
>> >     (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
>> > set_rtx_cost returns 0 for above expression but insn_cost returns 4.
>>
>> Yeah, was wondering why you'd dropped the set_rtx_cost thing,
>> but decided not to question it since using insn_cost seemed
>> reasonable if it worked.
> The attached patch uses set_rtx_cost for single_set and insn_cost
> otherwise for non debug insns similar to seq_cost.

FWIW, I think with the aarch64_mov_operand fix, the old way of using
insn_cost for everything would have worked too.  But either way is fine.

>> > This expression template appears twice in fallback sequence, which raises
>> > the cost to 28 from 20, while it appears once in each half of zip1 sequence,
>> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence
>> > instead.
>> >
>> > I assumed this expression would be ignored because it looks like a scalar move,
>> > but that doesn't seem to be the case ?
>> > aarch64_classify_symbolic_expression returns
>> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
>> > and thus aarch64_mov_operand_p returns false.
>>
>> Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
>> they're so different...
> Thanks, using aarch64_mov_operand worked.
>>
>> > Another issue with the zip1 sequence above is using same register x2
>> > for loading another half of constant in:
>> > adrp    x2, .LC1
>> >
>> > I guess this will create an output dependency from adrp x2, .LC0 ->
>> > adrp x2, .LC1
>> > and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
>> > essentially forcing almost the entire sequence (except ins
>> > instructions) to execute sequentially ?
>>
>> I'd expect modern cores to handle that via renaming.
> Ah right, thanks for the clarification.
>
> For some reason, it seems git diff is not formatting the patch correctly :/
> Or perhaps I am doing something wrongly.

No, I think it's fine.  It's just tabs vs. spaces.  A leading
"+" followed by a tab is still only indented 8 columns, whereas
"+" followed by 6 spaces is indented 7 columns.  So indentation
can look a bit weird in the diff.

I was accounting for that though. :)

> For eg, it shows:
> +  return is_a<scalar_mode>(GET_MODE (dest))
> +        && aarch64_mov_operand (src, GET_MODE (src));
> but after applying the patch, it's formatted correctly with "&&
> aarch64..."  right below is_a<scalar_mode>, both on column 10.

Yeah, the indentation itself was OK.  But there's an “emacs rule”
that says that parens should be used when splitting an expression
over multiple lines like this.  So:

-------
Formatting:

  return (is_a<scalar_mode>(GET_MODE (dest))
	  && aarch64_mov_operand_p (src, GET_MODE (src)));
-------

was about adding the parens.

> +  for (; seq; seq = NEXT_INSN (seq))
> +    if (NONDEBUG_INSN_P (seq)
> +	&& !scalar_move_insn_p (seq))
> +      {
> +	if (rtx set = single_set (seq))
> +	  cost += set_rtx_cost (set, speed);
> +	else
> +	  {
> +	    int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
> +	    if (this_cost > 0)
> +	      cost += this_cost;
> +	    else
> +	      cost++;
> +	  }
> +      }

I think it'd be better to do the single_set first, and pass the set
to scalar_move_insn_p.  I.e.

  for (; seq; seq = NEXT_INSN (seq))
    if (NONDEBUG_INSN_P (seq))
      {
	if (rtx set = single_set (seq))
	  {
	    if (!scalar_move_insn_p (set))
	      cost += set_rtx_cost (set, speed);
	  }
	else
	  {
	    int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
	    if (this_cost > 0)
	      cost += this_cost;
	    else
	      cost++;
	  }
      }

Then scalar_move_insn_p can just be the last three statements,
adjusted as follows:

  rtx src = SET_SRC (set);
  rtx dest = SET_DEST (set);
  return (is_a<scalar_mode> (GET_MODE (dest))
	  && aarch64_mov_operand (src, GET_MODE (dest)));

Note the space after ">", and that the mode passed to aarch64_mov_operand
is the destination mode (since the source mode might be VOIDmode).

OK with that change, thanks.

Thanks,
Richard
  
Prathamesh Kulkarni May 25, 2023, 6:47 a.m. UTC | #24
On Thu, 25 May 2023 at 01:28, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Wed, 24 May 2023 at 15:40, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Mon, 22 May 2023 at 14:18, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> > Hi Richard,
> >> >> > Thanks for the suggestions. Does the attached patch look OK ?
> >> >> > Boostrap+test in progress on aarch64-linux-gnu.
> >> >>
> >> >> Like I say, please wait for the tests to complete before sending an RFA.
> >> >> It saves a review cycle if the tests don't in fact pass.
> >> > Right, sorry, will post patches after completion of testing henceforth.
> >> >>
> >> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> >> > index 29dbacfa917..e611a7cca25 100644
> >> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> >> >> >    return gen_rtx_PARALLEL (new_mode, vec);
> >> >> >  }
> >> >> >
> >> >> > +/* Return true if INSN is a scalar move.  */
> >> >> > +
> >> >> > +static bool
> >> >> > +scalar_move_insn_p (const rtx_insn *insn)
> >> >> > +{
> >> >> > +  rtx set = single_set (insn);
> >> >> > +  if (!set)
> >> >> > +    return false;
> >> >> > +  rtx src = SET_SRC (set);
> >> >> > +  rtx dest = SET_DEST (set);
> >> >> > +  return is_a<scalar_mode>(GET_MODE (dest))
> >> >> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
> >> >>
> >> >> Formatting:
> >> >>
> >> >>   return (is_a<scalar_mode>(GET_MODE (dest))
> >> >>           && aarch64_mov_operand_p (src, GET_MODE (src)));
> >> >>
> >> >> OK with that change if the tests pass, thanks.
> >> > Unfortunately, the patch regressed vec-init-21.c:
> >> >
> >> > int8x16_t f_s8(int8_t x, int8_t y)
> >> > {
> >> >   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
> >> >                        7, 8, 9, 10, 11, 12, 13, 14 };
> >> > }
> >> >
> >> > -O3 code-gen trunk:
> >> > f_s8:
> >> >         adrp    x2, .LC0
> >> >         ldr     q0, [x2, #:lo12:.LC0]
> >> >         ins     v0.b[0], w0
> >> >         ins     v0.b[1], w1
> >> >         ret
> >> >
> >> > -O3 code-gen patch:
> >> > f_s8:
> >> >         adrp    x2, .LC0
> >> >         ldr     d31, [x2, #:lo12:.LC0]
> >> >         adrp    x2, .LC1
> >> >         ldr     d0, [x2, #:lo12:.LC1]
> >> >         ins     v31.b[0], w0
> >> >         ins     v0.b[0], w1
> >> >         zip1    v0.16b, v31.16b, v0.16b
> >> >         ret
> >> >
> >> > With trunk, it chooses the fallback sequence because both fallback
> >> > and zip1 sequence had cost = 20, however with patch applied,
> >> > we end up with zip1 sequence cost = 24 and fallback sequence
> >> > cost = 28.
> >> >
> >> > This happens because of using insn_cost instead of
> >> > set_rtx_cost for the following expression:
> >> > (set (reg:QI 100)
> >> >     (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
> >> > set_rtx_cost returns 0 for above expression but insn_cost returns 4.
> >>
> >> Yeah, was wondering why you'd dropped the set_rtx_cost thing,
> >> but decided not to question it since using insn_cost seemed
> >> reasonable if it worked.
> > The attached patch uses set_rtx_cost for single_set and insn_cost
> > otherwise for non debug insns similar to seq_cost.
>
> FWIW, I think with the aarch64_mov_operand fix, the old way of using
> insn_cost for everything would have worked too.  But either way is fine.
>
> >> > This expression template appears twice in fallback sequence, which raises
> >> > the cost to 28 from 20, while it appears once in each half of zip1 sequence,
> >> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence
> >> > instead.
> >> >
> >> > I assumed this expression would be ignored because it looks like a scalar move,
> >> > but that doesn't seem to be the case ?
> >> > aarch64_classify_symbolic_expression returns
> >> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
> >> > and thus aarch64_mov_operand_p returns false.
> >>
> >> Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
> >> they're so different...
> > Thanks, using aarch64_mov_operand worked.
> >>
> >> > Another issue with the zip1 sequence above is using same register x2
> >> > for loading another half of constant in:
> >> > adrp    x2, .LC1
> >> >
> >> > I guess this will create an output dependency from adrp x2, .LC0 ->
> >> > adrp x2, .LC1
> >> > and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
> >> > essentially forcing almost the entire sequence (except ins
> >> > instructions) to execute sequentially ?
> >>
> >> I'd expect modern cores to handle that via renaming.
> > Ah right, thanks for the clarification.
> >
> > For some reason, it seems git diff is not formatting the patch correctly :/
> > Or perhaps I am doing something wrongly.
>
> No, I think it's fine.  It's just tabs vs. spaces.  A leading
> "+" followed by a tab is still only indented 8 columns, whereas
> "+" followed by 6 spaces is indented 7 columns.  So indentation
> can look a bit weird in the diff.
>
> I was accounting for that though. :)
>
> > For eg, it shows:
> > +  return is_a<scalar_mode>(GET_MODE (dest))
> > +        && aarch64_mov_operand (src, GET_MODE (src));
> > but after applying the patch, it's formatted correctly with "&&
> > aarch64..."  right below is_a<scalar_mode>, both on column 10.
>
> Yeah, the indentation itself was OK.  But there's an “emacs rule”
> that says that parens should be used when splitting an expression
> over multiple lines like this.  So:
>
> -------
> Formatting:
>
>   return (is_a<scalar_mode>(GET_MODE (dest))
>           && aarch64_mov_operand_p (src, GET_MODE (src)));
> -------
>
> was about adding the parens.
>
> > +  for (; seq; seq = NEXT_INSN (seq))
> > +    if (NONDEBUG_INSN_P (seq)
> > +     && !scalar_move_insn_p (seq))
> > +      {
> > +     if (rtx set = single_set (seq))
> > +       cost += set_rtx_cost (set, speed);
> > +     else
> > +       {
> > +         int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
> > +         if (this_cost > 0)
> > +           cost += this_cost;
> > +         else
> > +           cost++;
> > +       }
> > +      }
>
> I think it'd be better to do the single_set first, and pass the set
> to scalar_move_insn_p.  I.e.
>
>   for (; seq; seq = NEXT_INSN (seq))
>     if (NONDEBUG_INSN_P (seq))
>       {
>         if (rtx set = single_set (seq))
>           {
>             if (!scalar_move_insn_p (set))
>               cost += set_rtx_cost (set, speed);
>           }
>         else
>           {
>             int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
>             if (this_cost > 0)
>               cost += this_cost;
>             else
>               cost++;
>           }
>       }
>
> Then scalar_move_insn_p can just be the last three statements,
> adjusted as follows:
>
>   rtx src = SET_SRC (set);
>   rtx dest = SET_DEST (set);
>   return (is_a<scalar_mode> (GET_MODE (dest))
>           && aarch64_mov_operand (src, GET_MODE (dest)));
>
> Note the space after ">", and that the mode passed to aarch64_mov_operand
> is the destination mode (since the source mode might be VOIDmode).
>
> OK with that change, thanks.
Hi Richard,
Thanks for the suggestions, and sorry for being a bit daft in my
previous replies.
Does the attached patch look OK ?
Bootstrapped+tested on aarch64-linux-gnu.

Thanks,
Prathamesh
>
> Thanks,
> Richard
[aarch64] Ignore cost of scalar moves for seq in vector initialization.

gcc/ChangeLog:
	* config/aarch64/aarch64.cc (scalar_move_insn_p): New function. 
	(seq_cost_ignoring_scalar_moves): Likewise.
	(aarch64_expand_vector_init): Call seq_cost_ignoring_scalar_moves.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d6fc94015fa..db7ca4c28c3 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22332,6 +22332,46 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
   return gen_rtx_PARALLEL (new_mode, vec);
 }
 
+/* Return true if INSN is a scalar move.  */
+
+static bool
+scalar_move_insn_p (rtx set)
+{
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  return (is_a<scalar_mode> (GET_MODE (dest))
+	  && aarch64_mov_operand (src, GET_MODE (dest)));
+}
+
+/* Similar to seq_cost, but ignore cost for scalar moves.  This function
+   is called from aarch64_expand_vector_init.  */
+
+static unsigned
+seq_cost_ignoring_scalar_moves (const rtx_insn *seq, bool speed)
+{
+  unsigned cost = 0;
+
+  for (; seq; seq = NEXT_INSN (seq))
+    if (NONDEBUG_INSN_P (seq))
+      {
+	if (rtx set = single_set (seq))
+	  {
+	    if (!scalar_move_insn_p (set))
+	      cost += set_rtx_cost (set, speed);
+	  }
+	else
+	  {
+	    int this_cost = insn_cost (CONST_CAST_RTX_INSN (seq), speed);
+	    if (this_cost > 0)
+	      cost += this_cost;
+	    else
+	      cost++;
+	  }
+      }
+
+  return cost;
+}
+
 /* Expand a vector initialization sequence, such that TARGET is
    initialized to contain VALS.  */
 
@@ -22367,7 +22407,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
       halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);
       rtx_insn *rec_seq = get_insns ();
       end_sequence ();
-      costs[i] = seq_cost (rec_seq, !optimize_size);
+      costs[i] = seq_cost_ignoring_scalar_moves (rec_seq, !optimize_size);
       emit_insn (rec_seq);
     }
 
@@ -22384,7 +22424,8 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   start_sequence ();
   aarch64_expand_vector_init_fallback (target, vals);
   rtx_insn *fallback_seq = get_insns ();
-  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
+  unsigned fallback_seq_cost
+    = seq_cost_ignoring_scalar_moves (fallback_seq, !optimize_size);
   end_sequence ();
 
   emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
  
Richard Sandiford May 25, 2023, 7:34 a.m. UTC | #25
LGTM, just a couple of comment tweaks:

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d6fc94015fa..db7ca4c28c3 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22332,6 +22332,46 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
>    return gen_rtx_PARALLEL (new_mode, vec);
>  }
>  
> +/* Return true if INSN is a scalar move.  */

s/INSN/SET/

> +
> +static bool
> +scalar_move_insn_p (rtx set)
> +{
> +  rtx src = SET_SRC (set);
> +  rtx dest = SET_DEST (set);
> +  return (is_a<scalar_mode> (GET_MODE (dest))
> +	  && aarch64_mov_operand (src, GET_MODE (dest)));
> +}
> +
> +/* Similar to seq_cost, but ignore cost for scalar moves.  This function
> +   is called from aarch64_expand_vector_init.  */

Probably best to drop the second sentence.

OK with those changes, thanks (no need to retest).

Richard
  
Prathamesh Kulkarni May 25, 2023, 9:56 a.m. UTC | #26
On Thu, 25 May 2023 at 13:04, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> LGTM, just a couple of comment tweaks:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index d6fc94015fa..db7ca4c28c3 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22332,6 +22332,46 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> >    return gen_rtx_PARALLEL (new_mode, vec);
> >  }
> >
> > +/* Return true if INSN is a scalar move.  */
>
> s/INSN/SET/
>
> > +
> > +static bool
> > +scalar_move_insn_p (rtx set)
> > +{
> > +  rtx src = SET_SRC (set);
> > +  rtx dest = SET_DEST (set);
> > +  return (is_a<scalar_mode> (GET_MODE (dest))
> > +       && aarch64_mov_operand (src, GET_MODE (dest)));
> > +}
> > +
> > +/* Similar to seq_cost, but ignore cost for scalar moves.  This function
> > +   is called from aarch64_expand_vector_init.  */
>
> Probably best to drop the second sentence.
>
> OK with those changes, thanks (no need to retest).
Thanks, committed as ea9154dbc8fc86d4c617503ca5e6f02fed3a6a56.

Thanks,
Prathamesh
>
> Richard
  
Prathamesh Kulkarni May 26, 2023, 3:04 a.m. UTC | #27
On Thu, 25 May 2023 at 15:26, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 25 May 2023 at 13:04, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > LGTM, just a couple of comment tweaks:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > index d6fc94015fa..db7ca4c28c3 100644
> > > --- a/gcc/config/aarch64/aarch64.cc
> > > +++ b/gcc/config/aarch64/aarch64.cc
> > > @@ -22332,6 +22332,46 @@ aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p)
> > >    return gen_rtx_PARALLEL (new_mode, vec);
> > >  }
> > >
> > > +/* Return true if INSN is a scalar move.  */
> >
> > s/INSN/SET/
> >
> > > +
> > > +static bool
> > > +scalar_move_insn_p (rtx set)
> > > +{
> > > +  rtx src = SET_SRC (set);
> > > +  rtx dest = SET_DEST (set);
> > > +  return (is_a<scalar_mode> (GET_MODE (dest))
> > > +       && aarch64_mov_operand (src, GET_MODE (dest)));
> > > +}
> > > +
> > > +/* Similar to seq_cost, but ignore cost for scalar moves.  This function
> > > +   is called from aarch64_expand_vector_init.  */
> >
> > Probably best to drop the second sentence.
> >
> > OK with those changes, thanks (no need to retest).
> Thanks, committed as ea9154dbc8fc86d4c617503ca5e6f02fed3a6a56.
Hi Richard,
The s32 case for single constant patch doesn't regress now after the
above commit.
Bootstrapped+tested on aarch64-linux-gnu, and verified that the new
tests pass for aarch64_be-linux-gnu.
Is it OK to commit ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Richard
[aarch64] Improve code-gen for vector initialization with single constant element.

gcc/ChangeLog:
	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
	and if maxv == 1, use constant element for duplicating into register.

gcc/testsuite/ChangeLog:
	* gcc.target/aarch64/vec-init-single-const.c: New test.
	* gcc.target/aarch64/vec-init-single-const-be.c: Likewise.
	* gcc.target/aarch64/vec-init-single-const-2.c: Likewise.

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5b046d32b37..30d6e3e8d83 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22192,7 +22192,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  if (n_var >= n_elts - 1 && n_elts <= 16)
     {
       int matches[16][2] = {0};
       for (int i = 0; i < n_elts; i++)
@@ -22209,12 +22209,23 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
 	}
       int maxelement = 0;
       int maxv = 0;
+      rtx const_elem = NULL_RTX;
+      int const_elem_pos = 0;
+
       for (int i = 0; i < n_elts; i++)
-	if (matches[i][1] > maxv)
-	  {
-	    maxelement = i;
-	    maxv = matches[i][1];
-	  }
+	{
+	  if (matches[i][1] > maxv)
+	    {
+	      maxelement = i;
+	      maxv = matches[i][1];
+	    }
+	  if (CONST_INT_P (XVECEXP (vals, 0, i))
+	      || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
+	    {
+	      const_elem_pos = i;
+	      const_elem = XVECEXP (vals, 0, i);
+	    }
+	}
 
       /* Create a duplicate of the most common element, unless all elements
 	 are equally useless to us, in which case just immediately set the
@@ -22252,8 +22263,19 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
 	     vector register.  For big-endian we want that position to hold
 	     the last element of VALS.  */
 	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
-	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+
+	  /* If we have a single constant element, use that for duplicating
+	     instead.  */
+	  if (const_elem)
+	    {
+	      maxelement = const_elem_pos;
+	      aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
+	    }
+	  else
+	    {
+	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	    }
 	}
       else
 	{
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
new file mode 100644
index 00000000000..f4dcab429c1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include <arm_neon.h>
+
+/* In case where there are no duplicate elements in vector initializer,
+   check that the constant is used for duplication.  */
+
+int8x16_t f_s8(int8_t a0, int8_t a1, int8_t a2, int8_t a3, int8_t a4,
+                int8_t a5, int8_t a6, int8_t a7, int8_t a8, int8_t a9,
+                int8_t a10, int8_t a11, int8_t a12, int8_t a13, int8_t a14)
+{
+  return (int8x16_t) { a0, a1, a2, a3, a4, a5, a6, a7,
+                       a8, a9, a10, a11, a12, a13, a14, 1 };
+}
+
+int16x8_t f_s16(int16_t a0, int16_t a1, int16_t a2, int16_t a3, int16_t a4,
+		int16_t a5, int16_t a6)
+{
+  return (int16x8_t) { a0, a1, a2, a3, a4, a5, a6, 1 };
+}
+
+int32x4_t f_s32(int32_t a0, int32_t a1, int32_t a2)
+{
+  return (int32x4_t) { a0, a1, a2, 1 };
+}
+
+/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.8b, 0x1} } } */ 
+/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */ 
+/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.2s, 0x1} } } */ 
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
new file mode 100644
index 00000000000..3140e007b5d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	dup	v0.16b, w0
+**	movi	(v[0-9]+)\.8b, 0x1
+**	ins	v0.b\[0\], \1\.b\[0\]
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	dup	v0.8h, w0
+**	movi	(v[0-9]+)\.4h, 0x1
+**	ins	v0.h\[0\], \1\.h\[0\]
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	dup	v0.4s, w0
+**	movi	(v[0-9]+)\.2s, 0x1
+**	ins	v0.s\[0\], \1\.s\[0\]
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v0\.d\[1\], x0
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
new file mode 100644
index 00000000000..274b0b39ac4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
@@ -0,0 +1,58 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
+
+#include <arm_neon.h>
+
+/*
+** f_s8:
+**	dup	v0.16b, w0
+**	movi	(v[0-9]+)\.8b, 0x1
+**	ins	v0.b\[15\], \1\.b\[0\]
+**	ret
+*/
+
+int8x16_t f_s8(int8_t x)
+{
+  return (int8x16_t) { x, x, x, x, x, x, x, x,
+                       x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s16:
+**	dup	v0.8h, w0
+**	movi	(v[0-9]+)\.4h, 0x1
+**	ins	v0.h\[7\], \1\.h\[0\]
+**	ret
+*/
+
+int16x8_t f_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
+}
+
+/*
+** f_s32:
+**	dup	v0.4s, w0
+**	movi	(v[0-9]+)\.2s, 0x1
+**	ins	v0.s\[3\], \1\.s\[0\]
+**	ret
+*/
+
+int32x4_t f_s32(int32_t x)
+{
+  return (int32x4_t) { x, x, x, 1 };
+}
+
+/*
+** f_s64:
+**	adrp	x[0-9]+, .LC[0-9]+
+**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
+**	ins	v0\.d\[0\], x0
+**	ret
+*/
+
+int64x2_t f_s64(int64_t x)
+{
+  return (int64x2_t) { x, 1 };
+}
  
Richard Sandiford May 30, 2023, 6:53 p.m. UTC | #28
Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi Richard,
> The s32 case for single constant patch doesn't regress now after the
> above commit.
> Bootstrapped+tested on aarch64-linux-gnu, and verified that the new
> tests pass for aarch64_be-linux-gnu.
> Is it OK to commit ?
>
> Thanks,
> Prathamesh
>
> [aarch64] Improve code-gen for vector initialization with single constant element.
>
> gcc/ChangeLog:
> 	* config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> 	if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> 	and if maxv == 1, use constant element for duplicating into register.
>
> gcc/testsuite/ChangeLog:
> 	* gcc.target/aarch64/vec-init-single-const.c: New test.
> 	* gcc.target/aarch64/vec-init-single-const-be.c: Likewise.
> 	* gcc.target/aarch64/vec-init-single-const-2.c: Likewise.

OK, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5b046d32b37..30d6e3e8d83 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -22192,7 +22192,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>       and matches[X][1] with the count of duplicate elements (if X is the
>       earliest element which has duplicates).  */
>  
> -  if (n_var == n_elts && n_elts <= 16)
> +  if (n_var >= n_elts - 1 && n_elts <= 16)
>      {
>        int matches[16][2] = {0};
>        for (int i = 0; i < n_elts; i++)
> @@ -22209,12 +22209,23 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>  	}
>        int maxelement = 0;
>        int maxv = 0;
> +      rtx const_elem = NULL_RTX;
> +      int const_elem_pos = 0;
> +
>        for (int i = 0; i < n_elts; i++)
> -	if (matches[i][1] > maxv)
> -	  {
> -	    maxelement = i;
> -	    maxv = matches[i][1];
> -	  }
> +	{
> +	  if (matches[i][1] > maxv)
> +	    {
> +	      maxelement = i;
> +	      maxv = matches[i][1];
> +	    }
> +	  if (CONST_INT_P (XVECEXP (vals, 0, i))
> +	      || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> +	    {
> +	      const_elem_pos = i;
> +	      const_elem = XVECEXP (vals, 0, i);
> +	    }
> +	}
>  
>        /* Create a duplicate of the most common element, unless all elements
>  	 are equally useless to us, in which case just immediately set the
> @@ -22252,8 +22263,19 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>  	     vector register.  For big-endian we want that position to hold
>  	     the last element of VALS.  */
>  	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> -	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +
> +	  /* If we have a single constant element, use that for duplicating
> +	     instead.  */
> +	  if (const_elem)
> +	    {
> +	      maxelement = const_elem_pos;
> +	      aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
> +	    }
> +	  else
> +	    {
> +	      rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	    }
>  	}
>        else
>  	{
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
> new file mode 100644
> index 00000000000..f4dcab429c1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#include <arm_neon.h>
> +
> +/* In case where there are no duplicate elements in vector initializer,
> +   check that the constant is used for duplication.  */
> +
> +int8x16_t f_s8(int8_t a0, int8_t a1, int8_t a2, int8_t a3, int8_t a4,
> +                int8_t a5, int8_t a6, int8_t a7, int8_t a8, int8_t a9,
> +                int8_t a10, int8_t a11, int8_t a12, int8_t a13, int8_t a14)
> +{
> +  return (int8x16_t) { a0, a1, a2, a3, a4, a5, a6, a7,
> +                       a8, a9, a10, a11, a12, a13, a14, 1 };
> +}
> +
> +int16x8_t f_s16(int16_t a0, int16_t a1, int16_t a2, int16_t a3, int16_t a4,
> +		int16_t a5, int16_t a6)
> +{
> +  return (int16x8_t) { a0, a1, a2, a3, a4, a5, a6, 1 };
> +}
> +
> +int32x4_t f_s32(int32_t a0, int32_t a1, int32_t a2)
> +{
> +  return (int32x4_t) { a0, a1, a2, 1 };
> +}
> +
> +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.8b, 0x1} } } */ 
> +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */ 
> +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.2s, 0x1} } } */ 
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> new file mode 100644
> index 00000000000..3140e007b5d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	dup	v0.16b, w0
> +**	movi	(v[0-9]+)\.8b, 0x1
> +**	ins	v0.b\[0\], \1\.b\[0\]
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	dup	v0.8h, w0
> +**	movi	(v[0-9]+)\.4h, 0x1
> +**	ins	v0.h\[0\], \1\.h\[0\]
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	dup	v0.4s, w0
> +**	movi	(v[0-9]+)\.2s, 0x1
> +**	ins	v0.s\[0\], \1\.s\[0\]
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v0\.d\[1\], x0
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> new file mode 100644
> index 00000000000..274b0b39ac4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> @@ -0,0 +1,58 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	dup	v0.16b, w0
> +**	movi	(v[0-9]+)\.8b, 0x1
> +**	ins	v0.b\[15\], \1\.b\[0\]
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> +                       x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s16:
> +**	dup	v0.8h, w0
> +**	movi	(v[0-9]+)\.4h, 0x1
> +**	ins	v0.h\[7\], \1\.h\[0\]
> +**	ret
> +*/
> +
> +int16x8_t f_s16(int16_t x)
> +{
> +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> +}
> +
> +/*
> +** f_s32:
> +**	dup	v0.4s, w0
> +**	movi	(v[0-9]+)\.2s, 0x1
> +**	ins	v0.s\[3\], \1\.s\[0\]
> +**	ret
> +*/
> +
> +int32x4_t f_s32(int32_t x)
> +{
> +  return (int32x4_t) { x, x, x, 1 };
> +}
> +
> +/*
> +** f_s64:
> +**	adrp	x[0-9]+, .LC[0-9]+
> +**	ldr	q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	ins	v0\.d\[0\], x0
> +**	ret
> +*/
> +
> +int64x2_t f_s64(int64_t x)
> +{
> +  return (int64x2_t) { x, 1 };
> +}
  
Prathamesh Kulkarni June 12, 2023, 5:52 p.m. UTC | #29
On Wed, 31 May 2023 at 00:23, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi Richard,
> > The s32 case for single constant patch doesn't regress now after the
> > above commit.
> > Bootstrapped+tested on aarch64-linux-gnu, and verified that the new
> > tests pass for aarch64_be-linux-gnu.
> > Is it OK to commit ?
> >
> > Thanks,
> > Prathamesh
> >
> > [aarch64] Improve code-gen for vector initialization with single constant element.
> >
> > gcc/ChangeLog:
> >       * config/aarch64/aarc64.cc (aarch64_expand_vector_init): Tweak condition
> >       if (n_var == n_elts && n_elts <= 16) to allow a single constant,
> >       and if maxv == 1, use constant element for duplicating into register.
> >
> > gcc/testsuite/ChangeLog:
> >       * gcc.target/aarch64/vec-init-single-const.c: New test.
> >       * gcc.target/aarch64/vec-init-single-const-be.c: Likewise.
> >       * gcc.target/aarch64/vec-init-single-const-2.c: Likewise.
>
> OK, thanks.
Hi Richard,
Sorry for the delay, I was away on vacation. Committed the patch after
rebasing on ToT, and verifying bootstrap+test passes on
aarch64-linux-gnu:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=9eb757d11746c006c044ff45538b956be7f5859c

Thanks,
Prathamesh
>
> Richard
>
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 5b046d32b37..30d6e3e8d83 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22192,7 +22192,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
> >       and matches[X][1] with the count of duplicate elements (if X is the
> >       earliest element which has duplicates).  */
> >
> > -  if (n_var == n_elts && n_elts <= 16)
> > +  if (n_var >= n_elts - 1 && n_elts <= 16)
> >      {
> >        int matches[16][2] = {0};
> >        for (int i = 0; i < n_elts; i++)
> > @@ -22209,12 +22209,23 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
> >       }
> >        int maxelement = 0;
> >        int maxv = 0;
> > +      rtx const_elem = NULL_RTX;
> > +      int const_elem_pos = 0;
> > +
> >        for (int i = 0; i < n_elts; i++)
> > -     if (matches[i][1] > maxv)
> > -       {
> > -         maxelement = i;
> > -         maxv = matches[i][1];
> > -       }
> > +     {
> > +       if (matches[i][1] > maxv)
> > +         {
> > +           maxelement = i;
> > +           maxv = matches[i][1];
> > +         }
> > +       if (CONST_INT_P (XVECEXP (vals, 0, i))
> > +           || CONST_DOUBLE_P (XVECEXP (vals, 0, i)))
> > +         {
> > +           const_elem_pos = i;
> > +           const_elem = XVECEXP (vals, 0, i);
> > +         }
> > +     }
> >
> >        /* Create a duplicate of the most common element, unless all elements
> >        are equally useless to us, in which case just immediately set the
> > @@ -22252,8 +22263,19 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
> >            vector register.  For big-endian we want that position to hold
> >            the last element of VALS.  */
> >         maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> > -       rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > -       aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +
> > +       /* If we have a single constant element, use that for duplicating
> > +          instead.  */
> > +       if (const_elem)
> > +         {
> > +           maxelement = const_elem_pos;
> > +           aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
> > +         }
> > +       else
> > +         {
> > +           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > +           aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> > +         }
> >       }
> >        else
> >       {
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
> > new file mode 100644
> > index 00000000000..f4dcab429c1
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-2.c
> > @@ -0,0 +1,30 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2" } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/* In case where there are no duplicate elements in vector initializer,
> > +   check that the constant is used for duplication.  */
> > +
> > +int8x16_t f_s8(int8_t a0, int8_t a1, int8_t a2, int8_t a3, int8_t a4,
> > +                int8_t a5, int8_t a6, int8_t a7, int8_t a8, int8_t a9,
> > +                int8_t a10, int8_t a11, int8_t a12, int8_t a13, int8_t a14)
> > +{
> > +  return (int8x16_t) { a0, a1, a2, a3, a4, a5, a6, a7,
> > +                       a8, a9, a10, a11, a12, a13, a14, 1 };
> > +}
> > +
> > +int16x8_t f_s16(int16_t a0, int16_t a1, int16_t a2, int16_t a3, int16_t a4,
> > +             int16_t a5, int16_t a6)
> > +{
> > +  return (int16x8_t) { a0, a1, a2, a3, a4, a5, a6, 1 };
> > +}
> > +
> > +int32x4_t f_s32(int32_t a0, int32_t a1, int32_t a2)
> > +{
> > +  return (int32x4_t) { a0, a1, a2, 1 };
> > +}
> > +
> > +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.8b, 0x1} } } */
> > +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */
> > +/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.2s, 0x1} } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> > new file mode 100644
> > index 00000000000..3140e007b5d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const-be.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { be } } } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   dup     v0.16b, w0
> > +**   movi    (v[0-9]+)\.8b, 0x1
> > +**   ins     v0.b\[0\], \1\.b\[0\]
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   dup     v0.8h, w0
> > +**   movi    (v[0-9]+)\.4h, 0x1
> > +**   ins     v0.h\[0\], \1\.h\[0\]
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   dup     v0.4s, w0
> > +**   movi    (v[0-9]+)\.2s, 0x1
> > +**   ins     v0.s\[0\], \1\.s\[0\]
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v0\.d\[1\], x0
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > new file mode 100644
> > index 00000000000..274b0b39ac4
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-single-const.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> > +/* { dg-final { check-function-bodies "**" "" "" { target { le } } } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** f_s8:
> > +**   dup     v0.16b, w0
> > +**   movi    (v[0-9]+)\.8b, 0x1
> > +**   ins     v0.b\[15\], \1\.b\[0\]
> > +**   ret
> > +*/
> > +
> > +int8x16_t f_s8(int8_t x)
> > +{
> > +  return (int8x16_t) { x, x, x, x, x, x, x, x,
> > +                       x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s16:
> > +**   dup     v0.8h, w0
> > +**   movi    (v[0-9]+)\.4h, 0x1
> > +**   ins     v0.h\[7\], \1\.h\[0\]
> > +**   ret
> > +*/
> > +
> > +int16x8_t f_s16(int16_t x)
> > +{
> > +  return (int16x8_t) { x, x, x, x, x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s32:
> > +**   dup     v0.4s, w0
> > +**   movi    (v[0-9]+)\.2s, 0x1
> > +**   ins     v0.s\[3\], \1\.s\[0\]
> > +**   ret
> > +*/
> > +
> > +int32x4_t f_s32(int32_t x)
> > +{
> > +  return (int32x4_t) { x, x, x, 1 };
> > +}
> > +
> > +/*
> > +** f_s64:
> > +**   adrp    x[0-9]+, .LC[0-9]+
> > +**   ldr     q0, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> > +**   ins     v0\.d\[0\], x0
> > +**   ret
> > +*/
> > +
> > +int64x2_t f_s64(int64_t x)
> > +{
> > +  return (int64x2_t) { x, 1 };
> > +}
  

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index acc0cfe5f94..df33509c6e4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22079,30 +22079,36 @@  aarch64_expand_vector_init (rtx target, rtx vals)
      and matches[X][1] with the count of duplicate elements (if X is the
      earliest element which has duplicates).  */
 
-  if (n_var == n_elts && n_elts <= 16)
+  int matches[16][2] = {0};
+  for (int i = 0; i < n_elts; i++)
     {
-      int matches[16][2] = {0};
-      for (int i = 0; i < n_elts; i++)
+      for (int j = 0; j <= i; j++)
 	{
-	  for (int j = 0; j <= i; j++)
+	  if (rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, j)))
 	    {
-	      if (rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, j)))
-		{
-		  matches[i][0] = j;
-		  matches[j][1]++;
-		  break;
-		}
+	      matches[i][0] = j;
+	      matches[j][1]++;
+	      break;
 	    }
 	}
-      int maxelement = 0;
-      int maxv = 0;
-      for (int i = 0; i < n_elts; i++)
-	if (matches[i][1] > maxv)
-	  {
-	    maxelement = i;
-	    maxv = matches[i][1];
-	  }
+    }
 
+  int maxelement = 0;
+  int maxv = 0;
+  for (int i = 0; i < n_elts; i++)
+    if (matches[i][1] > maxv)
+      {
+	maxelement = i;
+	maxv = matches[i][1];
+      }
+
+  rtx max_elem = XVECEXP (vals, 0, maxelement); 
+  if (n_elts <= 16
+      && ((n_var == n_elts)
+	   || (maxv >= (int)(0.8 * n_elts)
+	       && !CONST_INT_P (max_elem)
+	       && !CONST_DOUBLE_P (max_elem))))
+    {
       /* Create a duplicate of the most common element, unless all elements
 	 are equally useless to us, in which case just immediately set the
 	 vector register using the first element.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
new file mode 100644
index 00000000000..e20b813559e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
@@ -0,0 +1,53 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <arm_neon.h>
+
+/*
+** f1_s16:
+**	...
+**	dup	v[0-9]+\.8h, w[0-9]+
+**	movi	v[0-9]+\.4h, 0x1
+**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
+**	...
+**	ret
+*/
+
+int16x8_t f1_s16(int16_t x)
+{
+  return (int16x8_t) {x, x, x, x, x, x, x, 1};
+}
+
+/*
+** f2_s16:
+**	...
+**	dup	v[0-9]+\.8h, w[0-9]+
+**	movi	v[0-9]+\.4h, 0x1
+**	movi	v[0-9]+\.4h, 0x2
+**	ins	v[0-9]+\.h\[6\], v[0-9]+\.h\[0\]
+**	ins	v[0-9]+\.h\[7\], v[0-9]+\.h\[0\]
+**	...
+**	ret
+*/
+
+int16x8_t f2_s16(int16_t x)
+{
+  return (int16x8_t) { x, x, x, x, x, x, 1, 2 };
+}
+
+/*
+** f3_s16:
+**	...
+**	movi	v[0-9]+\.8h, 0x1
+**	ins	v[0-9]+\.h\[0\], w0
+**	ins	v[0-9]+\.h\[1\], w0
+**	ins	v[0-9]+\.h\[2\], w0
+**	...
+**	ret
+*/
+
+int16x8_t f3_s16(int16_t x)
+{
+  return (int16x8_t) {x, x, x, 1, 1, 1, 1, 1};
+}