aarch64: suppress duplication into sub-64-bit AdvSIMD vectors [PR125538]

Message ID 20260603115451.1250666-1-artemiy.volkov@arm.com
State Superseded
Headers
Series aarch64: suppress duplication into sub-64-bit AdvSIMD vectors [PR125538] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-aarch64-bootstrap success Build passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-arm-bootstrap success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Artemiy Volkov June 3, 2026, 11:54 a.m. UTC
  As we don't have RTL support for duplicating values into partial AdvSIMD
vector modes, any expression like (vec_duplicate:V4QI (reg:QI)) is going
to be malformed.  The ICE reported in PR125538 occurred because in
r17-897-g4ddae2a94a032d we started generating such expressions when doing
a splat of the most common element at aarch64.cc:25876.

To address the problem, this patch introduces the
aarch64_gen_vec_duplicate () wrapper, which handles the case of a
sub-64-bit destination mode by duplicating the source value into 64 bits
and wrapping that into a SUBREG expression.  The alternative here would be
to add some more vec_duplicate RTL patterns, but that would lead to some
code churn in aarch64-simd.md and break a long-standing invariant for no
obvious benefit.

I've added the reduced testcase from the PR, as well as appended some
similar tests to vec_init_5.c and vec-init-23.c.

Regtested and bootstrapped on aarch64-linux-gnu.

	PR target/125538

gcc/ChangeLog:

	* config/aarch64/aarch64-protos.h (aarch64_gen_vec_duplicate):
	Declare new function.
	* config/aarch64/aarch64.cc (aarch64_gen_vec_duplicate): Define
	it.
	(aarch64_expand_vector_init_fallback): Use
	aarch64_gen_vec_duplicate () instead of gen_vec_duplicate ().

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/vec_init_5.c: Add new 8/16-bit testcases.
	* gcc.target/aarch64/vec-init-23.c: Likewise.
	* gcc.target/aarch64/pr125538.c: New test.
---
 gcc/config/aarch64/aarch64-protos.h           |  1 +
 gcc/config/aarch64/aarch64.cc                 | 36 ++++++++--
 gcc/testsuite/gcc.target/aarch64/pr125538.c   | 20 ++++++
 .../gcc.target/aarch64/sve/vec_init_5.c       | 69 ++++++++++++++++++
 .../gcc.target/aarch64/vec-init-23.c          | 71 ++++++++++++++++++-
 5 files changed, 191 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125538.c
  

Comments

Andrew Pinski June 3, 2026, 7:58 p.m. UTC | #1
On Wed, Jun 3, 2026 at 4:56 AM Artemiy Volkov <artemiy.volkov@arm.com> wrote:
>
> As we don't have RTL support for duplicating values into partial AdvSIMD
> vector modes, any expression like (vec_duplicate:V4QI (reg:QI)) is going
> to be malformed.  The ICE reported in PR125538 occurred because in
> r17-897-g4ddae2a94a032d we started generating such expressions when doing
> a splat of the most common element at aarch64.cc:25876.
>
> To address the problem, this patch introduces the
> aarch64_gen_vec_duplicate () wrapper, which handles the case of a
> sub-64-bit destination mode by duplicating the source value into 64 bits
> and wrapping that into a SUBREG expression.  The alternative here would be
> to add some more vec_duplicate RTL patterns, but that would lead to some
> code churn in aarch64-simd.md and break a long-standing invariant for no
> obvious benefit.

I am not so sure there on no obvious benefit.
Take:
```
#define vect4 __attribute__((vector_size(4)))

void f(vect4 signed char *a, signed char b)
{
    *a = (vect4 signed char){b,b,b,b};
}
void f1(signed char *a, signed char b)
{
    a[0] = b;
    a[1] = b;
    a[2] = b;
    a[3] = b;
}
```
These could use a benifit of having a vec_dup of V4QI.
And it would be a good step forward of having V4QI/V2HI as not just a
container for initializations.

Thanks,
Andrea



>
> I've added the reduced testcase from the PR, as well as appended some
> similar tests to vec_init_5.c and vec-init-23.c.
>
> Regtested and bootstrapped on aarch64-linux-gnu.
>
>         PR target/125538
>
> gcc/ChangeLog:
>
>         * config/aarch64/aarch64-protos.h (aarch64_gen_vec_duplicate):
>         Declare new function.
>         * config/aarch64/aarch64.cc (aarch64_gen_vec_duplicate): Define
>         it.
>         (aarch64_expand_vector_init_fallback): Use
>         aarch64_gen_vec_duplicate () instead of gen_vec_duplicate ().
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/sve/vec_init_5.c: Add new 8/16-bit testcases.
>         * gcc.target/aarch64/vec-init-23.c: Likewise.
>         * gcc.target/aarch64/pr125538.c: New test.
> ---
>  gcc/config/aarch64/aarch64-protos.h           |  1 +
>  gcc/config/aarch64/aarch64.cc                 | 36 ++++++++--
>  gcc/testsuite/gcc.target/aarch64/pr125538.c   | 20 ++++++
>  .../gcc.target/aarch64/sve/vec_init_5.c       | 69 ++++++++++++++++++
>  .../gcc.target/aarch64/vec-init-23.c          | 71 ++++++++++++++++++-
>  5 files changed, 191 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125538.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 513b556398f..3e679f6d36a 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1014,6 +1014,7 @@ rtx aarch64_mask_from_zextract_ops (rtx, rtx);
>  rtx aarch64_return_addr_rtx (void);
>  rtx aarch64_return_addr (int, rtx);
>  rtx aarch64_simd_gen_const_vector_dup (machine_mode, HOST_WIDE_INT);
> +rtx aarch64_gen_vec_duplicate (machine_mode, rtx);
>  rtx aarch64_gen_shareable_zero (machine_mode);
>  bool aarch64_split_simd_shift_p (rtx_insn *);
>  bool aarch64_simd_mem_operand_p (rtx);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5a859e12b1a..4c7173c162a 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25039,6 +25039,30 @@ aarch64_gen_shareable_zero (machine_mode mode)
>    return lowpart_subreg (mode, reg, GET_MODE (reg));
>  }
>
> +/* Duplicate value X into a vector of type MODE.  In case MODE is a
> +   sub-64-bit mode and the result isn't a const_vector, duplicate into a
> +   full register and take a SUBREG of that.  */
> +
> +rtx
> +aarch64_gen_vec_duplicate (machine_mode mode, rtx x)
> +{
> +  gcc_assert (VECTOR_MODE_P (mode));
> +
> +  if (!aarch64_advsimd_sub_dword_mode_p (mode))
> +    return gen_vec_duplicate (mode, x);
> +
> +  if (valid_for_const_vector_p (mode, x))
> +    return gen_const_vec_duplicate (mode, x);
> +
> +  machine_mode dup_mode = mode_for_vector (GET_MODE_INNER (mode),
> +                               64 / GET_MODE_BITSIZE (GET_MODE_INNER (mode)))
> +                         .require ();
> +
> +  rtx reg = gen_reg_rtx (dup_mode);
> +  aarch64_emit_move (reg, gen_rtx_VEC_DUPLICATE (dup_mode, x));
> +  return lowpart_subreg (mode, reg, dup_mode);
> +}
> +
>  /* INSN is some form of extension or shift that can be split into a
>     permutation involving a shared zero.  Return true if we should
>     perform such a split.
> @@ -25699,7 +25723,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>                                2 * GET_MODE_SIZE (narrow_mode)));
>        if (rtx_equal_p (v0, v1))
>         aarch64_emit_move (target,
> -                         gen_vec_duplicate (mode,
> +                         aarch64_gen_vec_duplicate (mode,
>                                              force_reg (narrow_mode, v0)));
>        else
>         emit_insn (gen_aarch64_vec_concat (narrow_mode, target,
> @@ -25733,7 +25757,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>    if (all_same)
>      {
>        rtx x = force_reg (inner_mode, v0);
> -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +      aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
>        return;
>      }
>
> @@ -25769,7 +25793,8 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>               RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
>             aarch64_expand_vector_init (new_target,
>                                         gen_rtx_PARALLEL (subv_mode, new_vals));
> -           aarch64_emit_move (target, gen_vec_duplicate (mode, new_target));
> +           aarch64_emit_move (target,
> +                              aarch64_gen_vec_duplicate (mode, new_target));
>             return;
>           }
>      }
> @@ -25862,7 +25887,8 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>           if (const_elem)
>             {
>               maxelement = const_elem_pos;
> -             aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
> +             aarch64_emit_move (target,
> +                                aarch64_gen_vec_duplicate (mode, const_elem));
>             }
>           else
>             {
> @@ -25873,7 +25899,7 @@ aarch64_expand_vector_init_fallback (rtx target, rtx vals)
>        else
>         {
>           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -         aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +         aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
>         }
>
>        /* Insert the rest.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr125538.c b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> new file mode 100644
> index 00000000000..f0cdcd58dfb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv9.5-a" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#define vect16 __attribute__((vector_size(16)))
> +
> +vect16 char fff(char _292, char _145, char _231)
> +{
> +    return (vect16 char) {_292, _145, _145, _231, _292, _145, _145, _231, _292, _145, _145, _231, _292, _145, _145, _231};
> +}
> +
> +/*
> +** fff:
> +**     bfi     w0, w1, 8, 8
> +**     bfi     w1, w2, 8, 8
> +**     dup     v31\.4h, w0
> +**     dup     v0\.4h, w1
> +**     zip1    v0\.16b, v31\.16b, v0\.16b
> +**     ret
> +*/
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> index 99e04aac265..112a0eafc7a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> @@ -152,6 +152,27 @@
>  **     ret
>  */
>
> +/*
> +** test_int8_14:
> +**     bfi     w0, w1, 8, 8
> +**     bfi     w1, w2, 8, 8
> +**     dup     v31\.4h, w0
> +**     dup     v30\.4h, w1
> +**     zip1    v31\.16b, v31\.16b, v30\.16b
> +**     dup     z0\.q, z31.q\[0\]
> +**     ret
> +*/
> +
> +/*
> +** test_int8_15:
> +**     bfi     w0, w2, 8, 8
> +**     dup     v30\.8b, w1
> +**     dup     v31\.4h, w0
> +**     zip1    v31\.16b, v31\.16b, v30\.16b
> +**     dup     z0\.q, z31\.q\[0\]
> +**     ret
> +*/
> +
>  /*
>  ** test_float16_1:
>  **     fcvt    h0, s0
> @@ -236,6 +257,33 @@
>  **     ret
>  */
>
> +/*
> +** test_float16_9:
> +**     fcvt    h1, s1
> +**     fcvt    h2, s2
> +**     fcvt    h0, s0
> +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> +**     dup     v0\.2s, v0\.s\[0\]
> +**     dup     v1\.2s, v1\.s\[0\]
> +**     zip1    v0\.8h, v0\.8h, v1\.8h
> +**     dup     z0\.q, z0.q\[0\]
> +**     ret
> +*/
> +
> +/*
> +** test_float16_10:
> +**     fcvt    h2, s2
> +**     fcvt    h0, s0
> +**     fcvt    h1, s1
> +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> +**     dup     v1\.4h, v1\.h\[0\]
> +**     dup     v0\.2s, v0\.s\[0\]
> +**     zip1    v0\.8h, v0\.8h, v1\.8h
> +**     dup     z0\.q, z0.q\[0\]
> +**     ret
> +*/
> +
>  /*
>  ** test_int16_1:
>  **     mov     z0\.h, w0
> @@ -310,6 +358,27 @@
>  **     ret
>  */
>
> +/*
> +** test_int16_9:
> +**     bfi     w0, w1, 16, 16
> +**     bfi     w1, w2, 16, 16
> +**     dup     v31\.2s, w0
> +**     dup     v30\.2s, w1
> +**     zip1    v31\.8h, v31\.8h, v30\.8h
> +**     dup     z0\.q, z31\.q\[0\]
> +**     ret
> +*/
> +
> +/*
> +** test_int16_10:
> +**     bfi     w0, w2, 16, 16
> +**     dup     v30\.4h, w1
> +**     dup     v31\.2s, w0
> +**     zip1    v31\.8h, v31\.8h, v30\.8h
> +**     dup     z0\.q, z31\.q\[0\]
> +**     ret
> +*/
> +
>  /*
>  ** test_float32_1:
>  **     mov     z0\.s, s0
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> index 8c154f3680d..4721b068366 100644
> --- a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> @@ -41,7 +41,11 @@
>      TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
>                                x0, x1, 0, 1, x2, x3, 2, 3)\
>      TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
> -                              0, 1, x0, x1, 2, 3, x2, x3)
> +                              0, 1, x0, x1, 2, 3, x2, x3) \
> +    TESTCASE (TYPE, ETYPE, T, 8, 14, x16, x0, x1, x1, x2, x0, x1, x1, x2,\
> +                              x0, x1, x1, x2, x0, x1, x1, x2) \
> +    TESTCASE (TYPE, ETYPE, T, 8, 15, x16, x0, x1, x2, x1, x0, x1, x2, x1,\
> +                              x0, x1, x2, x1, x0, x1, x2, x1)
>
>  #define TEST_16(TYPE, ETYPE, T)\
>      TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
> @@ -52,6 +56,8 @@
>      TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
>      TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
>      TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
> +    TESTCASE (TYPE, ETYPE, T, 16, 9, x8, x0, x1, x1, x2, x0, x1, x1, x2)\
> +    TESTCASE (TYPE, ETYPE, T, 16, 10, x8, x0, x1, x2, x1, x0, x1, x2, x1)
>
>  #define TEST_32(TYPE, ETYPE, T)\
>      TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
> @@ -205,6 +211,25 @@ TEST_64(int, int64_t, s)
>  **     ret
>  */
>
> +/*
> +** test_int8_14:
> +**     bfi     w0, w1, 8, 8
> +**     bfi     w1, w2, 8, 8
> +**     dup     v31\.4h, w0
> +**     dup     v0\.4h, w1
> +**     zip1    v0\.16b, v31\.16b, v0\.16b
> +**     ret
> +*/
> +
> +/*
> +** test_int8_15:
> +**     bfi     w0, w2, 8, 8
> +**     dup     v0.8b, w1
> +**     dup     v31.4h, w0
> +**     zip1    v0.16b, v31.16b, v0.16b
> +**     ret
> +*/
> +
>  /*
>  ** test_float16_1:
>  **     fcvt    h0, s0
> @@ -286,6 +311,31 @@ TEST_64(int, int64_t, s)
>  **     ret
>  */
>
> +/*
> +** test_float16_9:
> +**     fcvt    h1, s1
> +**     fcvt    h2, s2
> +**     fcvt    h0, s0
> +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> +**     dup     v0\.2s, v0\.s\[0\]
> +**     dup     v1\.2s, v1\.s\[0\]
> +**     zip1    v0\.8h, v0\.8h, v1\.8h
> +**     ret
> +*/
> +
> +/*
> +** test_float16_10:
> +**     fcvt    h2, s2
> +**     fcvt    h0, s0
> +**     fcvt    h1, s1
> +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> +**     dup     v1\.4h, v1\.h\[0\]
> +**     dup     v0\.2s, v0\.s\[0\]
> +**     zip1    v0\.8h, v0\.8h, v1\.8h
> +**     ret
> +*/
> +
>  /*
>  ** test_int16_1:
>  **     dup     v0\.8h, w0
> @@ -356,6 +406,25 @@ TEST_64(int, int64_t, s)
>  **     ret
>  */
>
> +/*
> +** test_int16_9:
> +**     bfi     w0, w1, 16, 16
> +**     bfi     w1, w2, 16, 16
> +**     dup     v31\.2s, w0
> +**     dup     v0\.2s, w1
> +**     zip1    v0\.8h, v31\.8h, v0\.8h
> +**     ret
> +*/
> +
> +/*
> +** test_int16_10:
> +**     bfi     w0, w2, 16, 16
> +**     dup     v0\.4h, w1
> +**     dup     v31\.2s, w0
> +**     zip1    v0\.8h, v31\.8h, v0\.8h
> +**     ret
> +*/
> +
>  /*
>  ** test_float32_1:
>  **     dup     v0\.4s, v0\.s\[0\]
> --
> 2.43.0
>
  
Tamar Christina June 3, 2026, 8:23 p.m. UTC | #2
> -----Original Message-----
> From: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
> Sent: 03 June 2026 20:58
> To: Artemiy Volkov <Artemiy.Volkov@arm.com>
> Cc: gcc-patches@gcc.gnu.org; Tamar Christina <Tamar.Christina@arm.com>;
> Wilco Dijkstra <Wilco.Dijkstra@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; ktkachov@nvidia.com; Alice Carlotti
> <Alice.Carlotti@arm.com>; Alex Coplan <Alex.Coplan@arm.com>
> Subject: Re: [PATCH] aarch64: suppress duplication into sub-64-bit AdvSIMD
> vectors [PR125538]
> 
> On Wed, Jun 3, 2026 at 4:56 AM Artemiy Volkov <artemiy.volkov@arm.com>
> wrote:
> >
> > As we don't have RTL support for duplicating values into partial AdvSIMD
> > vector modes, any expression like (vec_duplicate:V4QI (reg:QI)) is going
> > to be malformed.  The ICE reported in PR125538 occurred because in
> > r17-897-g4ddae2a94a032d we started generating such expressions when
> doing
> > a splat of the most common element at aarch64.cc:25876.
> >
> > To address the problem, this patch introduces the
> > aarch64_gen_vec_duplicate () wrapper, which handles the case of a
> > sub-64-bit destination mode by duplicating the source value into 64 bits
> > and wrapping that into a SUBREG expression.  The alternative here would be
> > to add some more vec_duplicate RTL patterns, but that would lead to some
> > code churn in aarch64-simd.md and break a long-standing invariant for no
> > obvious benefit.
> 
> I am not so sure there on no obvious benefit.
> Take:
> ```
> #define vect4 __attribute__((vector_size(4)))
> 
> void f(vect4 signed char *a, signed char b)
> {
>     *a = (vect4 signed char){b,b,b,b};
> }
> void f1(signed char *a, signed char b)
> {
>     a[0] = b;
>     a[1] = b;
>     a[2] = b;
>     a[3] = b;
> }
> ```
> These could use a benifit of having a vec_dup of V4QI.
> And it would be a good step forward of having V4QI/V2HI as not just a
> container for initializations.

I was busy writing an elaborate response to this but Pinski beat me to it.

I would also indeed rather have you just add the RTL patterns for vec_duplate
of the partial modes. I don't think they really require that much churn.

Whether you take a subreg or just use a 128-bit/64-bit vector dup the semantic
remains the same, as long as we access the register in their intended modes
the result is sound.

So I think you should really just define the vec_duplicates and you only really need
2 new RTL patterns.

Additionally I'd not want to lose that gen_vec_duplicate just works. Requiring a target
specific function is always annoying. It works better if we can just support the generic
abstractions.

Thanks,
Tamar

> 
> Thanks,
> Andrea
> 
> 
> 
> >
> > I've added the reduced testcase from the PR, as well as appended some
> > similar tests to vec_init_5.c and vec-init-23.c.
> >
> > Regtested and bootstrapped on aarch64-linux-gnu.
> >
> >         PR target/125538
> >
> > gcc/ChangeLog:
> >
> >         * config/aarch64/aarch64-protos.h (aarch64_gen_vec_duplicate):
> >         Declare new function.
> >         * config/aarch64/aarch64.cc (aarch64_gen_vec_duplicate): Define
> >         it.
> >         (aarch64_expand_vector_init_fallback): Use
> >         aarch64_gen_vec_duplicate () instead of gen_vec_duplicate ().
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/sve/vec_init_5.c: Add new 8/16-bit testcases.
> >         * gcc.target/aarch64/vec-init-23.c: Likewise.
> >         * gcc.target/aarch64/pr125538.c: New test.
> > ---
> >  gcc/config/aarch64/aarch64-protos.h           |  1 +
> >  gcc/config/aarch64/aarch64.cc                 | 36 ++++++++--
> >  gcc/testsuite/gcc.target/aarch64/pr125538.c   | 20 ++++++
> >  .../gcc.target/aarch64/sve/vec_init_5.c       | 69 ++++++++++++++++++
> >  .../gcc.target/aarch64/vec-init-23.c          | 71 ++++++++++++++++++-
> >  5 files changed, 191 insertions(+), 6 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125538.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index 513b556398f..3e679f6d36a 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -1014,6 +1014,7 @@ rtx aarch64_mask_from_zextract_ops (rtx, rtx);
> >  rtx aarch64_return_addr_rtx (void);
> >  rtx aarch64_return_addr (int, rtx);
> >  rtx aarch64_simd_gen_const_vector_dup (machine_mode,
> HOST_WIDE_INT);
> > +rtx aarch64_gen_vec_duplicate (machine_mode, rtx);
> >  rtx aarch64_gen_shareable_zero (machine_mode);
> >  bool aarch64_split_simd_shift_p (rtx_insn *);
> >  bool aarch64_simd_mem_operand_p (rtx);
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 5a859e12b1a..4c7173c162a 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -25039,6 +25039,30 @@ aarch64_gen_shareable_zero
> (machine_mode mode)
> >    return lowpart_subreg (mode, reg, GET_MODE (reg));
> >  }
> >
> > +/* Duplicate value X into a vector of type MODE.  In case MODE is a
> > +   sub-64-bit mode and the result isn't a const_vector, duplicate into a
> > +   full register and take a SUBREG of that.  */
> > +
> > +rtx
> > +aarch64_gen_vec_duplicate (machine_mode mode, rtx x)
> > +{
> > +  gcc_assert (VECTOR_MODE_P (mode));
> > +
> > +  if (!aarch64_advsimd_sub_dword_mode_p (mode))
> > +    return gen_vec_duplicate (mode, x);
> > +
> > +  if (valid_for_const_vector_p (mode, x))
> > +    return gen_const_vec_duplicate (mode, x);
> > +
> > +  machine_mode dup_mode = mode_for_vector (GET_MODE_INNER
> (mode),
> > +                               64 / GET_MODE_BITSIZE (GET_MODE_INNER (mode)))
> > +                         .require ();
> > +
> > +  rtx reg = gen_reg_rtx (dup_mode);
> > +  aarch64_emit_move (reg, gen_rtx_VEC_DUPLICATE (dup_mode, x));
> > +  return lowpart_subreg (mode, reg, dup_mode);
> > +}
> > +
> >  /* INSN is some form of extension or shift that can be split into a
> >     permutation involving a shared zero.  Return true if we should
> >     perform such a split.
> > @@ -25699,7 +25723,7 @@ aarch64_expand_vector_init_fallback (rtx
> target, rtx vals)
> >                                2 * GET_MODE_SIZE (narrow_mode)));
> >        if (rtx_equal_p (v0, v1))
> >         aarch64_emit_move (target,
> > -                         gen_vec_duplicate (mode,
> > +                         aarch64_gen_vec_duplicate (mode,
> >                                              force_reg (narrow_mode, v0)));
> >        else
> >         emit_insn (gen_aarch64_vec_concat (narrow_mode, target,
> > @@ -25733,7 +25757,7 @@ aarch64_expand_vector_init_fallback (rtx
> target, rtx vals)
> >    if (all_same)
> >      {
> >        rtx x = force_reg (inner_mode, v0);
> > -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> > +      aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
> >        return;
> >      }
> >
> > @@ -25769,7 +25793,8 @@ aarch64_expand_vector_init_fallback (rtx
> target, rtx vals)
> >               RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
> >             aarch64_expand_vector_init (new_target,
> >                                         gen_rtx_PARALLEL (subv_mode, new_vals));
> > -           aarch64_emit_move (target, gen_vec_duplicate (mode, new_target));
> > +           aarch64_emit_move (target,
> > +                              aarch64_gen_vec_duplicate (mode, new_target));
> >             return;
> >           }
> >      }
> > @@ -25862,7 +25887,8 @@ aarch64_expand_vector_init_fallback (rtx
> target, rtx vals)
> >           if (const_elem)
> >             {
> >               maxelement = const_elem_pos;
> > -             aarch64_emit_move (target, gen_vec_duplicate (mode,
> const_elem));
> > +             aarch64_emit_move (target,
> > +                                aarch64_gen_vec_duplicate (mode, const_elem));
> >             }
> >           else
> >             {
> > @@ -25873,7 +25899,7 @@ aarch64_expand_vector_init_fallback (rtx
> target, rtx vals)
> >        else
> >         {
> >           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > -         aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> > +         aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
> >         }
> >
> >        /* Insert the rest.  */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr125538.c
> b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> > new file mode 100644
> > index 00000000000..f0cdcd58dfb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=armv9.5-a" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#define vect16 __attribute__((vector_size(16)))
> > +
> > +vect16 char fff(char _292, char _145, char _231)
> > +{
> > +    return (vect16 char) {_292, _145, _145, _231, _292, _145, _145, _231,
> _292, _145, _145, _231, _292, _145, _145, _231};
> > +}
> > +
> > +/*
> > +** fff:
> > +**     bfi     w0, w1, 8, 8
> > +**     bfi     w1, w2, 8, 8
> > +**     dup     v31\.4h, w0
> > +**     dup     v0\.4h, w1
> > +**     zip1    v0\.16b, v31\.16b, v0\.16b
> > +**     ret
> > +*/
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > index 99e04aac265..112a0eafc7a 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > @@ -152,6 +152,27 @@
> >  **     ret
> >  */
> >
> > +/*
> > +** test_int8_14:
> > +**     bfi     w0, w1, 8, 8
> > +**     bfi     w1, w2, 8, 8
> > +**     dup     v31\.4h, w0
> > +**     dup     v30\.4h, w1
> > +**     zip1    v31\.16b, v31\.16b, v30\.16b
> > +**     dup     z0\.q, z31.q\[0\]
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_int8_15:
> > +**     bfi     w0, w2, 8, 8
> > +**     dup     v30\.8b, w1
> > +**     dup     v31\.4h, w0
> > +**     zip1    v31\.16b, v31\.16b, v30\.16b
> > +**     dup     z0\.q, z31\.q\[0\]
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_float16_1:
> >  **     fcvt    h0, s0
> > @@ -236,6 +257,33 @@
> >  **     ret
> >  */
> >
> > +/*
> > +** test_float16_9:
> > +**     fcvt    h1, s1
> > +**     fcvt    h2, s2
> > +**     fcvt    h0, s0
> > +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> > +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> > +**     dup     v0\.2s, v0\.s\[0\]
> > +**     dup     v1\.2s, v1\.s\[0\]
> > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > +**     dup     z0\.q, z0.q\[0\]
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_float16_10:
> > +**     fcvt    h2, s2
> > +**     fcvt    h0, s0
> > +**     fcvt    h1, s1
> > +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> > +**     dup     v1\.4h, v1\.h\[0\]
> > +**     dup     v0\.2s, v0\.s\[0\]
> > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > +**     dup     z0\.q, z0.q\[0\]
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_int16_1:
> >  **     mov     z0\.h, w0
> > @@ -310,6 +358,27 @@
> >  **     ret
> >  */
> >
> > +/*
> > +** test_int16_9:
> > +**     bfi     w0, w1, 16, 16
> > +**     bfi     w1, w2, 16, 16
> > +**     dup     v31\.2s, w0
> > +**     dup     v30\.2s, w1
> > +**     zip1    v31\.8h, v31\.8h, v30\.8h
> > +**     dup     z0\.q, z31\.q\[0\]
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_int16_10:
> > +**     bfi     w0, w2, 16, 16
> > +**     dup     v30\.4h, w1
> > +**     dup     v31\.2s, w0
> > +**     zip1    v31\.8h, v31\.8h, v30\.8h
> > +**     dup     z0\.q, z31\.q\[0\]
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_float32_1:
> >  **     mov     z0\.s, s0
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > index 8c154f3680d..4721b068366 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > @@ -41,7 +41,11 @@
> >      TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
> >                                x0, x1, 0, 1, x2, x3, 2, 3)\
> >      TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
> > -                              0, 1, x0, x1, 2, 3, x2, x3)
> > +                              0, 1, x0, x1, 2, 3, x2, x3) \
> > +    TESTCASE (TYPE, ETYPE, T, 8, 14, x16, x0, x1, x1, x2, x0, x1, x1, x2,\
> > +                              x0, x1, x1, x2, x0, x1, x1, x2) \
> > +    TESTCASE (TYPE, ETYPE, T, 8, 15, x16, x0, x1, x2, x1, x0, x1, x2, x1,\
> > +                              x0, x1, x2, x1, x0, x1, x2, x1)
> >
> >  #define TEST_16(TYPE, ETYPE, T)\
> >      TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
> > @@ -52,6 +56,8 @@
> >      TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
> >      TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
> >      TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 9, x8, x0, x1, x1, x2, x0, x1, x1, x2)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 10, x8, x0, x1, x2, x1, x0, x1, x2, x1)
> >
> >  #define TEST_32(TYPE, ETYPE, T)\
> >      TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
> > @@ -205,6 +211,25 @@ TEST_64(int, int64_t, s)
> >  **     ret
> >  */
> >
> > +/*
> > +** test_int8_14:
> > +**     bfi     w0, w1, 8, 8
> > +**     bfi     w1, w2, 8, 8
> > +**     dup     v31\.4h, w0
> > +**     dup     v0\.4h, w1
> > +**     zip1    v0\.16b, v31\.16b, v0\.16b
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_int8_15:
> > +**     bfi     w0, w2, 8, 8
> > +**     dup     v0.8b, w1
> > +**     dup     v31.4h, w0
> > +**     zip1    v0.16b, v31.16b, v0.16b
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_float16_1:
> >  **     fcvt    h0, s0
> > @@ -286,6 +311,31 @@ TEST_64(int, int64_t, s)
> >  **     ret
> >  */
> >
> > +/*
> > +** test_float16_9:
> > +**     fcvt    h1, s1
> > +**     fcvt    h2, s2
> > +**     fcvt    h0, s0
> > +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> > +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> > +**     dup     v0\.2s, v0\.s\[0\]
> > +**     dup     v1\.2s, v1\.s\[0\]
> > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_float16_10:
> > +**     fcvt    h2, s2
> > +**     fcvt    h0, s0
> > +**     fcvt    h1, s1
> > +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> > +**     dup     v1\.4h, v1\.h\[0\]
> > +**     dup     v0\.2s, v0\.s\[0\]
> > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_int16_1:
> >  **     dup     v0\.8h, w0
> > @@ -356,6 +406,25 @@ TEST_64(int, int64_t, s)
> >  **     ret
> >  */
> >
> > +/*
> > +** test_int16_9:
> > +**     bfi     w0, w1, 16, 16
> > +**     bfi     w1, w2, 16, 16
> > +**     dup     v31\.2s, w0
> > +**     dup     v0\.2s, w1
> > +**     zip1    v0\.8h, v31\.8h, v0\.8h
> > +**     ret
> > +*/
> > +
> > +/*
> > +** test_int16_10:
> > +**     bfi     w0, w2, 16, 16
> > +**     dup     v0\.4h, w1
> > +**     dup     v31\.2s, w0
> > +**     zip1    v0\.8h, v31\.8h, v0\.8h
> > +**     ret
> > +*/
> > +
> >  /*
> >  ** test_float32_1:
> >  **     dup     v0\.4s, v0\.s\[0\]
> > --
> > 2.43.0
> >
  
Artemiy Volkov June 4, 2026, 10:40 a.m. UTC | #3
On Wed, Jun 03, 2026 at 09:23:54PM +0100, Tamar Christina wrote:
> > -----Original Message-----
> > From: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
> > Sent: 03 June 2026 20:58
> > To: Artemiy Volkov <Artemiy.Volkov@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; Tamar Christina <Tamar.Christina@arm.com>;
> > Wilco Dijkstra <Wilco.Dijkstra@arm.com>; Richard Earnshaw
> > <Richard.Earnshaw@arm.com>; ktkachov@nvidia.com; Alice Carlotti
> > <Alice.Carlotti@arm.com>; Alex Coplan <Alex.Coplan@arm.com>
> > Subject: Re: [PATCH] aarch64: suppress duplication into sub-64-bit AdvSIMD
> > vectors [PR125538]
> > 
> > On Wed, Jun 3, 2026 at 4:56 AM Artemiy Volkov <artemiy.volkov@arm.com>
> > wrote:
> > >
> > > As we don't have RTL support for duplicating values into partial AdvSIMD
> > > vector modes, any expression like (vec_duplicate:V4QI (reg:QI)) is going
> > > to be malformed.  The ICE reported in PR125538 occurred because in
> > > r17-897-g4ddae2a94a032d we started generating such expressions when
> > doing
> > > a splat of the most common element at aarch64.cc:25876.
> > >
> > > To address the problem, this patch introduces the
> > > aarch64_gen_vec_duplicate () wrapper, which handles the case of a
> > > sub-64-bit destination mode by duplicating the source value into 64 bits
> > > and wrapping that into a SUBREG expression.  The alternative here would be
> > > to add some more vec_duplicate RTL patterns, but that would lead to some
> > > code churn in aarch64-simd.md and break a long-standing invariant for no
> > > obvious benefit.
> > 
> > I am not so sure there on no obvious benefit.
> > Take:
> > ```
> > #define vect4 __attribute__((vector_size(4)))
> > 
> > void f(vect4 signed char *a, signed char b)
> > {
> >     *a = (vect4 signed char){b,b,b,b};
> > }
> > void f1(signed char *a, signed char b)
> > {
> >     a[0] = b;
> >     a[1] = b;
> >     a[2] = b;
> >     a[3] = b;
> > }
> > ```
> > These could use a benifit of having a vec_dup of V4QI.
> > And it would be a good step forward of having V4QI/V2HI as not just a
> > container for initializations.
> 
> I was busy writing an elaborate response to this but Pinski beat me to it.
> 
> I would also indeed rather have you just add the RTL patterns for vec_duplate
> of the partial modes. I don't think they really require that much churn.
> 
> Whether you take a subreg or just use a 128-bit/64-bit vector dup the semantic
> remains the same, as long as we access the register in their intended modes
> the result is sound.
> 
> So I think you should really just define the vec_duplicates and you only really need
> 2 new RTL patterns.
> 
> Additionally I'd not want to lose that gen_vec_duplicate just works. Requiring a target
> specific function is always annoying. It works better if we can just support the generic
> abstractions.
> 
> Thanks,
> Tamar

Thank you both, it looks like my intuition was completely wrong on that
one.

Will post the alternative solution shortly.

Kind regards,
Artemiy

> 
> > 
> > Thanks,
> > Andrea
> > 
> > 
> > 
> > >
> > > I've added the reduced testcase from the PR, as well as appended some
> > > similar tests to vec_init_5.c and vec-init-23.c.
> > >
> > > Regtested and bootstrapped on aarch64-linux-gnu.
> > >
> > >         PR target/125538
> > >
> > > gcc/ChangeLog:
> > >
> > >         * config/aarch64/aarch64-protos.h (aarch64_gen_vec_duplicate):
> > >         Declare new function.
> > >         * config/aarch64/aarch64.cc (aarch64_gen_vec_duplicate): Define
> > >         it.
> > >         (aarch64_expand_vector_init_fallback): Use
> > >         aarch64_gen_vec_duplicate () instead of gen_vec_duplicate ().
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/aarch64/sve/vec_init_5.c: Add new 8/16-bit testcases.
> > >         * gcc.target/aarch64/vec-init-23.c: Likewise.
> > >         * gcc.target/aarch64/pr125538.c: New test.
> > > ---
> > >  gcc/config/aarch64/aarch64-protos.h           |  1 +
> > >  gcc/config/aarch64/aarch64.cc                 | 36 ++++++++--
> > >  gcc/testsuite/gcc.target/aarch64/pr125538.c   | 20 ++++++
> > >  .../gcc.target/aarch64/sve/vec_init_5.c       | 69 ++++++++++++++++++
> > >  .../gcc.target/aarch64/vec-init-23.c          | 71 ++++++++++++++++++-
> > >  5 files changed, 191 insertions(+), 6 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr125538.c
> > >
> > > diff --git a/gcc/config/aarch64/aarch64-protos.h
> > b/gcc/config/aarch64/aarch64-protos.h
> > > index 513b556398f..3e679f6d36a 100644
> > > --- a/gcc/config/aarch64/aarch64-protos.h
> > > +++ b/gcc/config/aarch64/aarch64-protos.h
> > > @@ -1014,6 +1014,7 @@ rtx aarch64_mask_from_zextract_ops (rtx, rtx);
> > >  rtx aarch64_return_addr_rtx (void);
> > >  rtx aarch64_return_addr (int, rtx);
> > >  rtx aarch64_simd_gen_const_vector_dup (machine_mode,
> > HOST_WIDE_INT);
> > > +rtx aarch64_gen_vec_duplicate (machine_mode, rtx);
> > >  rtx aarch64_gen_shareable_zero (machine_mode);
> > >  bool aarch64_split_simd_shift_p (rtx_insn *);
> > >  bool aarch64_simd_mem_operand_p (rtx);
> > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > > index 5a859e12b1a..4c7173c162a 100644
> > > --- a/gcc/config/aarch64/aarch64.cc
> > > +++ b/gcc/config/aarch64/aarch64.cc
> > > @@ -25039,6 +25039,30 @@ aarch64_gen_shareable_zero
> > (machine_mode mode)
> > >    return lowpart_subreg (mode, reg, GET_MODE (reg));
> > >  }
> > >
> > > +/* Duplicate value X into a vector of type MODE.  In case MODE is a
> > > +   sub-64-bit mode and the result isn't a const_vector, duplicate into a
> > > +   full register and take a SUBREG of that.  */
> > > +
> > > +rtx
> > > +aarch64_gen_vec_duplicate (machine_mode mode, rtx x)
> > > +{
> > > +  gcc_assert (VECTOR_MODE_P (mode));
> > > +
> > > +  if (!aarch64_advsimd_sub_dword_mode_p (mode))
> > > +    return gen_vec_duplicate (mode, x);
> > > +
> > > +  if (valid_for_const_vector_p (mode, x))
> > > +    return gen_const_vec_duplicate (mode, x);
> > > +
> > > +  machine_mode dup_mode = mode_for_vector (GET_MODE_INNER
> > (mode),
> > > +                               64 / GET_MODE_BITSIZE (GET_MODE_INNER (mode)))
> > > +                         .require ();
> > > +
> > > +  rtx reg = gen_reg_rtx (dup_mode);
> > > +  aarch64_emit_move (reg, gen_rtx_VEC_DUPLICATE (dup_mode, x));
> > > +  return lowpart_subreg (mode, reg, dup_mode);
> > > +}
> > > +
> > >  /* INSN is some form of extension or shift that can be split into a
> > >     permutation involving a shared zero.  Return true if we should
> > >     perform such a split.
> > > @@ -25699,7 +25723,7 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > >                                2 * GET_MODE_SIZE (narrow_mode)));
> > >        if (rtx_equal_p (v0, v1))
> > >         aarch64_emit_move (target,
> > > -                         gen_vec_duplicate (mode,
> > > +                         aarch64_gen_vec_duplicate (mode,
> > >                                              force_reg (narrow_mode, v0)));
> > >        else
> > >         emit_insn (gen_aarch64_vec_concat (narrow_mode, target,
> > > @@ -25733,7 +25757,7 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > >    if (all_same)
> > >      {
> > >        rtx x = force_reg (inner_mode, v0);
> > > -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> > > +      aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
> > >        return;
> > >      }
> > >
> > > @@ -25769,7 +25793,8 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > >               RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
> > >             aarch64_expand_vector_init (new_target,
> > >                                         gen_rtx_PARALLEL (subv_mode, new_vals));
> > > -           aarch64_emit_move (target, gen_vec_duplicate (mode, new_target));
> > > +           aarch64_emit_move (target,
> > > +                              aarch64_gen_vec_duplicate (mode, new_target));
> > >             return;
> > >           }
> > >      }
> > > @@ -25862,7 +25887,8 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > >           if (const_elem)
> > >             {
> > >               maxelement = const_elem_pos;
> > > -             aarch64_emit_move (target, gen_vec_duplicate (mode,
> > const_elem));
> > > +             aarch64_emit_move (target,
> > > +                                aarch64_gen_vec_duplicate (mode, const_elem));
> > >             }
> > >           else
> > >             {
> > > @@ -25873,7 +25899,7 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > >        else
> > >         {
> > >           rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> > > -         aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> > > +         aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
> > >         }
> > >
> > >        /* Insert the rest.  */
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr125538.c
> > b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> > > new file mode 100644
> > > index 00000000000..f0cdcd58dfb
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/pr125538.c
> > > @@ -0,0 +1,20 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=armv9.5-a" } */
> > > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > > +
> > > +#define vect16 __attribute__((vector_size(16)))
> > > +
> > > +vect16 char fff(char _292, char _145, char _231)
> > > +{
> > > +    return (vect16 char) {_292, _145, _145, _231, _292, _145, _145, _231,
> > _292, _145, _145, _231, _292, _145, _145, _231};
> > > +}
> > > +
> > > +/*
> > > +** fff:
> > > +**     bfi     w0, w1, 8, 8
> > > +**     bfi     w1, w2, 8, 8
> > > +**     dup     v31\.4h, w0
> > > +**     dup     v0\.4h, w1
> > > +**     zip1    v0\.16b, v31\.16b, v0\.16b
> > > +**     ret
> > > +*/
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > > index 99e04aac265..112a0eafc7a 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
> > > @@ -152,6 +152,27 @@
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_int8_14:
> > > +**     bfi     w0, w1, 8, 8
> > > +**     bfi     w1, w2, 8, 8
> > > +**     dup     v31\.4h, w0
> > > +**     dup     v30\.4h, w1
> > > +**     zip1    v31\.16b, v31\.16b, v30\.16b
> > > +**     dup     z0\.q, z31.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_int8_15:
> > > +**     bfi     w0, w2, 8, 8
> > > +**     dup     v30\.8b, w1
> > > +**     dup     v31\.4h, w0
> > > +**     zip1    v31\.16b, v31\.16b, v30\.16b
> > > +**     dup     z0\.q, z31\.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_float16_1:
> > >  **     fcvt    h0, s0
> > > @@ -236,6 +257,33 @@
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_float16_9:
> > > +**     fcvt    h1, s1
> > > +**     fcvt    h2, s2
> > > +**     fcvt    h0, s0
> > > +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> > > +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> > > +**     dup     v0\.2s, v0\.s\[0\]
> > > +**     dup     v1\.2s, v1\.s\[0\]
> > > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > > +**     dup     z0\.q, z0.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_float16_10:
> > > +**     fcvt    h2, s2
> > > +**     fcvt    h0, s0
> > > +**     fcvt    h1, s1
> > > +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> > > +**     dup     v1\.4h, v1\.h\[0\]
> > > +**     dup     v0\.2s, v0\.s\[0\]
> > > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > > +**     dup     z0\.q, z0.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_int16_1:
> > >  **     mov     z0\.h, w0
> > > @@ -310,6 +358,27 @@
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_int16_9:
> > > +**     bfi     w0, w1, 16, 16
> > > +**     bfi     w1, w2, 16, 16
> > > +**     dup     v31\.2s, w0
> > > +**     dup     v30\.2s, w1
> > > +**     zip1    v31\.8h, v31\.8h, v30\.8h
> > > +**     dup     z0\.q, z31\.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_int16_10:
> > > +**     bfi     w0, w2, 16, 16
> > > +**     dup     v30\.4h, w1
> > > +**     dup     v31\.2s, w0
> > > +**     zip1    v31\.8h, v31\.8h, v30\.8h
> > > +**     dup     z0\.q, z31\.q\[0\]
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_float32_1:
> > >  **     mov     z0\.s, s0
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > > index 8c154f3680d..4721b068366 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > > @@ -41,7 +41,11 @@
> > >      TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
> > >                                x0, x1, 0, 1, x2, x3, 2, 3)\
> > >      TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
> > > -                              0, 1, x0, x1, 2, 3, x2, x3)
> > > +                              0, 1, x0, x1, 2, 3, x2, x3) \
> > > +    TESTCASE (TYPE, ETYPE, T, 8, 14, x16, x0, x1, x1, x2, x0, x1, x1, x2,\
> > > +                              x0, x1, x1, x2, x0, x1, x1, x2) \
> > > +    TESTCASE (TYPE, ETYPE, T, 8, 15, x16, x0, x1, x2, x1, x0, x1, x2, x1,\
> > > +                              x0, x1, x2, x1, x0, x1, x2, x1)
> > >
> > >  #define TEST_16(TYPE, ETYPE, T)\
> > >      TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
> > > @@ -52,6 +56,8 @@
> > >      TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
> > >      TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
> > >      TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
> > > +    TESTCASE (TYPE, ETYPE, T, 16, 9, x8, x0, x1, x1, x2, x0, x1, x1, x2)\
> > > +    TESTCASE (TYPE, ETYPE, T, 16, 10, x8, x0, x1, x2, x1, x0, x1, x2, x1)
> > >
> > >  #define TEST_32(TYPE, ETYPE, T)\
> > >      TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
> > > @@ -205,6 +211,25 @@ TEST_64(int, int64_t, s)
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_int8_14:
> > > +**     bfi     w0, w1, 8, 8
> > > +**     bfi     w1, w2, 8, 8
> > > +**     dup     v31\.4h, w0
> > > +**     dup     v0\.4h, w1
> > > +**     zip1    v0\.16b, v31\.16b, v0\.16b
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_int8_15:
> > > +**     bfi     w0, w2, 8, 8
> > > +**     dup     v0.8b, w1
> > > +**     dup     v31.4h, w0
> > > +**     zip1    v0.16b, v31.16b, v0.16b
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_float16_1:
> > >  **     fcvt    h0, s0
> > > @@ -286,6 +311,31 @@ TEST_64(int, int64_t, s)
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_float16_9:
> > > +**     fcvt    h1, s1
> > > +**     fcvt    h2, s2
> > > +**     fcvt    h0, s0
> > > +**     uzp1    v0\.4h, v0\.4h, v1\.4h
> > > +**     uzp1    v1\.4h, v1\.4h, v2\.4h
> > > +**     dup     v0\.2s, v0\.s\[0\]
> > > +**     dup     v1\.2s, v1\.s\[0\]
> > > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_float16_10:
> > > +**     fcvt    h2, s2
> > > +**     fcvt    h0, s0
> > > +**     fcvt    h1, s1
> > > +**     uzp1    v0\.4h, v0\.4h, v2\.4h
> > > +**     dup     v1\.4h, v1\.h\[0\]
> > > +**     dup     v0\.2s, v0\.s\[0\]
> > > +**     zip1    v0\.8h, v0\.8h, v1\.8h
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_int16_1:
> > >  **     dup     v0\.8h, w0
> > > @@ -356,6 +406,25 @@ TEST_64(int, int64_t, s)
> > >  **     ret
> > >  */
> > >
> > > +/*
> > > +** test_int16_9:
> > > +**     bfi     w0, w1, 16, 16
> > > +**     bfi     w1, w2, 16, 16
> > > +**     dup     v31\.2s, w0
> > > +**     dup     v0\.2s, w1
> > > +**     zip1    v0\.8h, v31\.8h, v0\.8h
> > > +**     ret
> > > +*/
> > > +
> > > +/*
> > > +** test_int16_10:
> > > +**     bfi     w0, w2, 16, 16
> > > +**     dup     v0\.4h, w1
> > > +**     dup     v31\.2s, w0
> > > +**     zip1    v0\.8h, v31\.8h, v0\.8h
> > > +**     ret
> > > +*/
> > > +
> > >  /*
> > >  ** test_float32_1:
> > >  **     dup     v0\.4s, v0\.s\[0\]
> > > --
> > > 2.43.0
> > >
  

Patch

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 513b556398f..3e679f6d36a 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1014,6 +1014,7 @@  rtx aarch64_mask_from_zextract_ops (rtx, rtx);
 rtx aarch64_return_addr_rtx (void);
 rtx aarch64_return_addr (int, rtx);
 rtx aarch64_simd_gen_const_vector_dup (machine_mode, HOST_WIDE_INT);
+rtx aarch64_gen_vec_duplicate (machine_mode, rtx);
 rtx aarch64_gen_shareable_zero (machine_mode);
 bool aarch64_split_simd_shift_p (rtx_insn *);
 bool aarch64_simd_mem_operand_p (rtx);
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 5a859e12b1a..4c7173c162a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25039,6 +25039,30 @@  aarch64_gen_shareable_zero (machine_mode mode)
   return lowpart_subreg (mode, reg, GET_MODE (reg));
 }
 
+/* Duplicate value X into a vector of type MODE.  In case MODE is a
+   sub-64-bit mode and the result isn't a const_vector, duplicate into a
+   full register and take a SUBREG of that.  */
+
+rtx
+aarch64_gen_vec_duplicate (machine_mode mode, rtx x)
+{
+  gcc_assert (VECTOR_MODE_P (mode));
+
+  if (!aarch64_advsimd_sub_dword_mode_p (mode))
+    return gen_vec_duplicate (mode, x);
+
+  if (valid_for_const_vector_p (mode, x))
+    return gen_const_vec_duplicate (mode, x);
+
+  machine_mode dup_mode = mode_for_vector (GET_MODE_INNER (mode),
+				64 / GET_MODE_BITSIZE (GET_MODE_INNER (mode)))
+			  .require ();
+
+  rtx reg = gen_reg_rtx (dup_mode);
+  aarch64_emit_move (reg, gen_rtx_VEC_DUPLICATE (dup_mode, x));
+  return lowpart_subreg (mode, reg, dup_mode);
+}
+
 /* INSN is some form of extension or shift that can be split into a
    permutation involving a shared zero.  Return true if we should
    perform such a split.
@@ -25699,7 +25723,7 @@  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
 			       2 * GET_MODE_SIZE (narrow_mode)));
       if (rtx_equal_p (v0, v1))
        aarch64_emit_move (target,
-			  gen_vec_duplicate (mode,
+			  aarch64_gen_vec_duplicate (mode,
 					     force_reg (narrow_mode, v0)));
       else
        emit_insn (gen_aarch64_vec_concat (narrow_mode, target,
@@ -25733,7 +25757,7 @@  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
   if (all_same)
     {
       rtx x = force_reg (inner_mode, v0);
-      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+      aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
       return;
     }
 
@@ -25769,7 +25793,8 @@  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
 	      RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
 	    aarch64_expand_vector_init (new_target,
 					gen_rtx_PARALLEL (subv_mode, new_vals));
-	    aarch64_emit_move (target, gen_vec_duplicate (mode, new_target));
+	    aarch64_emit_move (target,
+			       aarch64_gen_vec_duplicate (mode, new_target));
 	    return;
 	  }
     }
@@ -25862,7 +25887,8 @@  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
 	  if (const_elem)
 	    {
 	      maxelement = const_elem_pos;
-	      aarch64_emit_move (target, gen_vec_duplicate (mode, const_elem));
+	      aarch64_emit_move (target,
+				 aarch64_gen_vec_duplicate (mode, const_elem));
 	    }
 	  else
 	    {
@@ -25873,7 +25899,7 @@  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
       else
 	{
 	  rtx x = force_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+	  aarch64_emit_move (target, aarch64_gen_vec_duplicate (mode, x));
 	}
 
       /* Insert the rest.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr125538.c b/gcc/testsuite/gcc.target/aarch64/pr125538.c
new file mode 100644
index 00000000000..f0cdcd58dfb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr125538.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv9.5-a" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#define vect16 __attribute__((vector_size(16)))
+
+vect16 char fff(char _292, char _145, char _231)
+{
+    return (vect16 char) {_292, _145, _145, _231, _292, _145, _145, _231, _292, _145, _145, _231, _292, _145, _145, _231};
+}
+
+/*
+** fff:
+**	bfi	w0, w1, 8, 8
+**	bfi	w1, w2, 8, 8
+**	dup	v31\.4h, w0
+**	dup	v0\.4h, w1
+**	zip1	v0\.16b, v31\.16b, v0\.16b
+**	ret
+*/
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
index 99e04aac265..112a0eafc7a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_init_5.c
@@ -152,6 +152,27 @@ 
 **	ret
 */
 
+/*
+** test_int8_14:
+**	bfi	w0, w1, 8, 8
+**	bfi	w1, w2, 8, 8
+**	dup	v31\.4h, w0
+**	dup	v30\.4h, w1
+**	zip1	v31\.16b, v31\.16b, v30\.16b
+**	dup	z0\.q, z31.q\[0\]
+**	ret
+*/
+
+/*
+** test_int8_15:
+**	bfi	w0, w2, 8, 8
+**	dup	v30\.8b, w1
+**	dup	v31\.4h, w0
+**	zip1	v31\.16b, v31\.16b, v30\.16b
+**	dup	z0\.q, z31\.q\[0\]
+**	ret
+*/
+
 /*
 ** test_float16_1:
 **	fcvt	h0, s0
@@ -236,6 +257,33 @@ 
 **	ret
 */
 
+/*
+** test_float16_9:
+**	fcvt	h1, s1
+**	fcvt	h2, s2
+**	fcvt	h0, s0
+**	uzp1	v0\.4h, v0\.4h, v1\.4h
+**	uzp1	v1\.4h, v1\.4h, v2\.4h
+**	dup	v0\.2s, v0\.s\[0\]
+**	dup	v1\.2s, v1\.s\[0\]
+**	zip1	v0\.8h, v0\.8h, v1\.8h
+**	dup	z0\.q, z0.q\[0\]
+**	ret
+*/
+
+/*
+** test_float16_10:
+**	fcvt	h2, s2
+**	fcvt	h0, s0
+**	fcvt	h1, s1
+**	uzp1	v0\.4h, v0\.4h, v2\.4h
+**	dup	v1\.4h, v1\.h\[0\]
+**	dup	v0\.2s, v0\.s\[0\]
+**	zip1	v0\.8h, v0\.8h, v1\.8h
+**	dup	z0\.q, z0.q\[0\]
+**	ret
+*/
+
 /*
 ** test_int16_1:
 **	mov	z0\.h, w0
@@ -310,6 +358,27 @@ 
 **	ret
 */
 
+/*
+** test_int16_9:
+**	bfi	w0, w1, 16, 16
+**	bfi	w1, w2, 16, 16
+**	dup	v31\.2s, w0
+**	dup	v30\.2s, w1
+**	zip1	v31\.8h, v31\.8h, v30\.8h
+**	dup	z0\.q, z31\.q\[0\]
+**	ret
+*/
+
+/*
+** test_int16_10:
+**	bfi	w0, w2, 16, 16
+**	dup	v30\.4h, w1
+**	dup	v31\.2s, w0
+**	zip1	v31\.8h, v31\.8h, v30\.8h
+**	dup	z0\.q, z31\.q\[0\]
+**	ret
+*/
+
 /*
 ** test_float32_1:
 **	mov	z0\.s, s0
diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
index 8c154f3680d..4721b068366 100644
--- a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
+++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
@@ -41,7 +41,11 @@ 
     TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
 			       x0, x1, 0, 1, x2, x3, 2, 3)\
     TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
-			       0, 1, x0, x1, 2, 3, x2, x3)
+			       0, 1, x0, x1, 2, 3, x2, x3) \
+    TESTCASE (TYPE, ETYPE, T, 8, 14, x16, x0, x1, x1, x2, x0, x1, x1, x2,\
+			       x0, x1, x1, x2, x0, x1, x1, x2) \
+    TESTCASE (TYPE, ETYPE, T, 8, 15, x16, x0, x1, x2, x1, x0, x1, x2, x1,\
+			       x0, x1, x2, x1, x0, x1, x2, x1)
 
 #define TEST_16(TYPE, ETYPE, T)\
     TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
@@ -52,6 +56,8 @@ 
     TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
     TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
     TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
+    TESTCASE (TYPE, ETYPE, T, 16, 9, x8, x0, x1, x1, x2, x0, x1, x1, x2)\
+    TESTCASE (TYPE, ETYPE, T, 16, 10, x8, x0, x1, x2, x1, x0, x1, x2, x1)
 
 #define TEST_32(TYPE, ETYPE, T)\
     TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
@@ -205,6 +211,25 @@  TEST_64(int, int64_t, s)
 **	ret
 */
 
+/*
+** test_int8_14:
+**	bfi	w0, w1, 8, 8
+**	bfi	w1, w2, 8, 8
+**	dup	v31\.4h, w0
+**	dup	v0\.4h, w1
+**	zip1	v0\.16b, v31\.16b, v0\.16b
+**	ret
+*/
+
+/*
+** test_int8_15:
+**	bfi	w0, w2, 8, 8
+**	dup	v0.8b, w1
+**	dup	v31.4h, w0
+**	zip1	v0.16b, v31.16b, v0.16b
+**	ret
+*/
+
 /*
 ** test_float16_1:
 **	fcvt	h0, s0
@@ -286,6 +311,31 @@  TEST_64(int, int64_t, s)
 **	ret
 */
 
+/*
+** test_float16_9:
+**	fcvt	h1, s1
+**	fcvt	h2, s2
+**	fcvt	h0, s0
+**	uzp1	v0\.4h, v0\.4h, v1\.4h
+**	uzp1	v1\.4h, v1\.4h, v2\.4h
+**	dup	v0\.2s, v0\.s\[0\]
+**	dup	v1\.2s, v1\.s\[0\]
+**	zip1	v0\.8h, v0\.8h, v1\.8h
+**	ret
+*/
+
+/*
+** test_float16_10:
+**	fcvt	h2, s2
+**	fcvt	h0, s0
+**	fcvt	h1, s1
+**	uzp1	v0\.4h, v0\.4h, v2\.4h
+**	dup	v1\.4h, v1\.h\[0\]
+**	dup	v0\.2s, v0\.s\[0\]
+**	zip1	v0\.8h, v0\.8h, v1\.8h
+**	ret
+*/
+
 /*
 ** test_int16_1:
 **	dup	v0\.8h, w0
@@ -356,6 +406,25 @@  TEST_64(int, int64_t, s)
 **	ret
 */
 
+/*
+** test_int16_9:
+**	bfi	w0, w1, 16, 16
+**	bfi	w1, w2, 16, 16
+**	dup	v31\.2s, w0
+**	dup	v0\.2s, w1
+**	zip1	v0\.8h, v31\.8h, v0\.8h
+**	ret
+*/
+
+/*
+** test_int16_10:
+**	bfi	w0, w2, 16, 16
+**	dup	v0\.4h, w1
+**	dup	v31\.2s, w0
+**	zip1	v0\.8h, v31\.8h, v0\.8h
+**	ret
+*/
+
 /*
 ** test_float32_1:
 **	dup	v0\.4s, v0\.s\[0\]