[2/2] AArch64: support encoding integer immediates using floating point moves

Message ID ZvqPvaRg/nmYIrSz@arm.com
State New
Headers
Series [1/2] AArch64: refactor aarch64_float_const_representable_p to take additional mode param |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed

Commit Message

Tamar Christina Sept. 30, 2024, 11:47 a.m. UTC
  Hi All,

This patch extends our immediate SIMD generation cases to support generating
integer immediates using floating point operation if the integer immediate maps
to an exact FP value.

As an example:

uint32x4_t f1() {
    return vdupq_n_u32(0x3f800000);
}

currently generates:

f1:
        adrp    x0, .LC0
        ldr     q0, [x0, #:lo12:.LC0]
        ret

i.e. a load, but with this change:

f1:
        fmov    v0.4s, 1.0e+0
        ret

Such immediates are common in e.g. our Math routines in glibc because they are
created to extract or mark part of an FP immediate as masks.

Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-protos.h (aarch64_float_const_representable_p):
	Add overload.
	* config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject
	integer modes.
	(aarch64_simd_valid_immediate, aarch64_float_const_representable_p):
	Check if integer value maps to an exact FP constant.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/const_create_using_fmov.c: New test.

---




--
  

Comments

Richard Sandiford Sept. 30, 2024, 5:33 p.m. UTC | #1
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This patch extends our immediate SIMD generation cases to support generating
> integer immediates using floating point operation if the integer immediate maps
> to an exact FP value.
>
> As an example:
>
> uint32x4_t f1() {
>     return vdupq_n_u32(0x3f800000);
> }
>
> currently generates:
>
> f1:
>         adrp    x0, .LC0
>         ldr     q0, [x0, #:lo12:.LC0]
>         ret
>
> i.e. a load, but with this change:
>
> f1:
>         fmov    v0.4s, 1.0e+0
>         ret
>
> Such immediates are common in e.g. our Math routines in glibc because they are
> created to extract or mark part of an FP immediate as masks.

I agree this is a good thing to do.  The current code is too beholden
to the original vector mode.  This patch relaxes it so that it isn't
beholden to the original mode's class (integer vs. float), but it would
still be beholden to the original mode's element size.

It looks like an alternative would be to remove:

  scalar_float_mode elt_float_mode;
  if (n_elts == 1
      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
    {
      rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
      if (aarch64_float_const_zero_rtx_p (elt)
	  || aarch64_float_const_representable_p (elt))
	{
	  if (info)
	    *info = simd_immediate_info (elt_float_mode, elt);
	  return true;
	}
    }

and instead insert code:

  /* Get the repeating 8-byte value as an integer.  No endian correction
     is needed here because bytes is already in lsb-first order.  */
  unsigned HOST_WIDE_INT val64 = 0;
  for (unsigned int i = 0; i < 8; i++)
    val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes]
	      << (i * BITS_PER_UNIT));

---> here

  if (vec_flags & VEC_SVE_DATA)
    return aarch64_sve_valid_immediate (val64, info);
  else
    return aarch64_advsimd_valid_immediate (val64, info, which);

that tries to reduce val64 to the smallest repeating pattern,
then tries to interpret that pattern as a float.  The reduction step
could reuse the first part of aarch64_sve_valid_immediate, which
calculates the narrowest repeating integer mode:

  scalar_int_mode mode = DImode;
  unsigned int val32 = val64 & 0xffffffff;
  if (val32 == (val64 >> 32))
    {
      mode = SImode;
      unsigned int val16 = val32 & 0xffff;
      if (val16 == (val32 >> 16))
	{
	  mode = HImode;
	  unsigned int val8 = val16 & 0xff;
	  if (val8 == (val16 >> 8))
	    mode = QImode;
	}
    }

This would give us the candidate integer mode, to which we could
apply float_mode_for_size (...).exists, as in the patch.

In this case we would have the value as an integer, rather than
as an rtx, so I think it would make sense to split out the part of
aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE.
aarch64_simd_valid_immediate could then use the patch's:

> +      long int as_long_ints[2];
> +      as_long_ints[0] = buf & 0xFFFFFFFF;
> +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> [...]
> +      real_from_target (&r, as_long_ints, fmode);

with "buf" being "val64" in the code above, and "fmode" being the result
of float_mode_for_size (...).exists.  aarch64_simd_valid_immediate
would then pass "r" and and "fmode" to the new, split-out variant of
aarch64_float_const_representable_p.  (I haven't checked the endiannes
requirements for real_from_target.)

The split-out variant would still perform the HFmode test in:

  if (GET_MODE (x) == VOIDmode
      || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
    return false;

The VOIDmode test is redundant and can be dropped.  AArch64 has always
been a CONST_WIDE_INT target.

If we do that, we should probably also pass the integer mode calculated
by the code quoted above down to aarch64_sve_valid_immediate (where it
came from) and aarch64_advsimd_valid_immediate, since both of them would
find it useful.  E.g.:

      /* Try using a replicated byte.  */
      if (which == AARCH64_CHECK_MOV
	  && val16 == (val32 >> 16)
	  && val8 == (val16 >> 8))
	{
	  if (info)
	    *info = simd_immediate_info (QImode, val8);
	  return true;
	}

would become:

  /* Try using a replicated byte.  */
  if (which == AARCH64_CHECK_MOV && mode == QImode)
    {
      if (info)
        *info = simd_immediate_info (QImode, val8);
      return true;
    }

I realise that's quite a bit different from the patch as posted, sorry,
and I've made it sound more complicated than it actually is.  But I think
it should be both more general (because it ignores the element size as
well as the mode class) and a little simpler.

The proposed split of aarch64_float_const_representable_p would be
a replacement for patch 1 in the series.  The current rtx version
of aarch64_float_const_representable_p would not need to take a mode,
but the REAL_VALUE_TYPE interface would.

Thanks,
Richard

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-protos.h (aarch64_float_const_representable_p):
> 	Add overload.
> 	* config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject
> 	integer modes.
> 	(aarch64_simd_valid_immediate, aarch64_float_const_representable_p):
> 	Check if integer value maps to an exact FP constant.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/const_create_using_fmov.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac38381ea6451fd55 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx);
>  
>  /* Check for a legitimate floating point constant for FMOV.  */
>  bool aarch64_float_const_representable_p (rtx, machine_mode);
> +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode);
>  
>  extern int aarch64_epilogue_uses (int);
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb4c3f99c9f88e70ed 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x)
>    /* 0.0 in Decimal Floating Point cannot be represented by #0 or
>       zr as our callers expect, so no need to check the actual
>       value if X is of Decimal Floating Point type.  */
> -  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
> +  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT
> +      || !CONST_DOUBLE_P (x))
>      return false;
>  
>    if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
> @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op, simd_immediate_info *info,
>    else
>      return false;
>  
> -  scalar_float_mode elt_float_mode;
> -  if (n_elts == 1
> -      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
> +  if (n_elts == 1)
>      {
>        rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
> +      rtx new_elt = NULL_RTX;
>        if (aarch64_float_const_zero_rtx_p (elt)
> -	  || aarch64_float_const_representable_p (elt, elt_mode))
> -	{
> -	  if (info)
> -	    *info = simd_immediate_info (elt_float_mode, elt);
> -	  return true;
> +	  || aarch64_float_const_representable_p (&new_elt, elt, elt_mode))
> +	{
> +	  scalar_float_mode elt_float_mode;
> +	  auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode);
> +	  if (is_a <scalar_float_mode> (elt_mode))
> +	    elt_float_mode = as_a <scalar_float_mode> (elt_mode);
> +	  else if (which == AARCH64_CHECK_MOV
> +		   && new_elt
> +		   && float_mode_for_size (bitsize).exists (&elt_float_mode))
> +	    elt = new_elt;
> +	  else
> +	    elt = NULL_RTX;
> +
> +	  if (elt != NULL_RTX)
> +	    {
> +	      if (info)
> +		*info = simd_immediate_info (elt_float_mode, elt);
> +	      return true;
> +	    }
>  	}
>      }
>  
> @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix)
>  
>  /* Return true iff X with mode MODE can be represented by a quarter-precision
>     floating point immediate operand X.  Note, we cannot represent 0.0.  */
> +
>  bool
>  aarch64_float_const_representable_p (rtx x, machine_mode mode)
> +{
> +  return aarch64_float_const_representable_p (NULL, x, mode);
> +}
> +
> +
> +/* Return true iff X with mode MODE can be represented by a quarter-precision
> +   floating point immediate operand X.  Note, we cannot represent 0.0.
> +   If the value is a CONST_INT that can be represented as an exact floating
> +   point then OUT will contain the new floating point value to emit to generate
> +   the integer constant.  */
> +
> +bool
> +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode)
>  {
>    /* This represents our current view of how many bits
>       make up the mantissa.  */
> @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x, machine_mode mode)
>  
>    x = unwrap_const_vec_duplicate (x);
>    mode = GET_MODE_INNER (mode);
> -  if (!CONST_DOUBLE_P (x))
> +  if (!CONST_DOUBLE_P (x)
> +      && !CONST_INT_P (x))
>      return false;
>  
>    if (mode == VOIDmode
> -      || (mode == HFmode && !TARGET_FP_F16INST))
> +      || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST))
>      return false;
>  
> -  r = *CONST_DOUBLE_REAL_VALUE (x);
> +  /* If we have an integer bit pattern, decode it back into a real.
> +     real_from_target requires the representation to be split into
> +     32-bit values and then put into two host wide ints.  */
> +  if (CONST_INT_P (x))
> +    {
> +      HOST_WIDE_INT buf = INTVAL (x);
> +      long int as_long_ints[2];
> +      as_long_ints[0] = buf & 0xFFFFFFFF;
> +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> +      machine_mode fmode;
> +      switch (mode)
> +      {
> +      case HImode:
> +	fmode = HFmode;
> +	break;
> +      case SImode:
> +	fmode = SFmode;
> +	break;
> +      case DImode:
> +	fmode = DFmode;
> +	break;
> +      default:
> +	return false;
> +      }
> +
> +      real_from_target (&r, as_long_ints, fmode);
> +      if (out)
> +	*out = const_double_from_real_value (r, fmode);
> +    }
> +  else
> +    r = *CONST_DOUBLE_REAL_VALUE (x);
>  
>    /* We cannot represent infinities, NaNs or +/-zero.  We won't
>       know if we have +zero until we analyse the mantissa, but we
> @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x, machine_mode mode)
>       the value.  */
>    if (w.ulow () != 0)
>      return false;
> +
>    /* We have rejected the lower HOST_WIDE_INT, so update our
>       understanding of how many bits lie in the mantissa and
>       look only at the high HOST_WIDE_INT.  */
> @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x, machine_mode mode)
>    return (exponent >= 0 && exponent <= 7);
>  }
>  
> -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or BIC
> -   immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects whether to
> -   output MOVI/MVNI, ORR or BIC immediate.  */
> +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC or
> +   FMOV immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects whether
> +   to output MOVI/MVNI, ORR or BIC immediate.  */
>  char*
>  aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width,
>  				   enum simd_immediate_check which)
> diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..e080afed8aa3578660027979335bfc859ca6bc91
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> @@ -0,0 +1,87 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** g:
> +** 	fmov	v0\.4s, 1\.0e\+0
> +** 	ret
> +*/
> +float32x4_t g(){
> +    return vdupq_n_f32(1);
> +}
> +
> +/*
> +** h:
> +** 	fmov	v0\.4s, 1\.0e\+0
> +** 	ret
> +*/
> +uint32x4_t h() {
> +    return vreinterpretq_u32_f32(g());
> +}
> +
> +/*
> +** f1:
> +** 	fmov	v0\.4s, 1\.0e\+0
> +** 	ret
> +*/
> +uint32x4_t f1() {
> +    return vdupq_n_u32(0x3f800000);
> +}
> +
> +/*
> +** f2:
> +** 	fmov	v0\.4s, 1\.5e\+0
> +** 	ret
> +*/
> +uint32x4_t f2() {
> +    return vdupq_n_u32(0x3FC00000);
> +}
> +
> +/*
> +** f3:
> +** 	fmov	v0\.4s, 1\.25e\+0
> +** 	ret
> +*/
> +uint32x4_t f3() {
> +    return vdupq_n_u32(0x3FA00000);
> +}
> +
> +/*
> +** f4:
> +** 	fmov	v0\.2d, 1\.0e\+0
> +** 	ret
> +*/
> +uint64x2_t f4() {
> +    return vdupq_n_u64(0x3FF0000000000000);
> +}
> +
> +/*
> +** fn4:
> +** 	fmov	v0\.2d, -1\.0e\+0
> +** 	ret
> +*/
> +uint64x2_t fn4() {
> +    return vdupq_n_u64(0xBFF0000000000000);
> +}
> +
> +/*
> +** f5:
> +** 	fmov	v0\.8h, 1\.5e\+0
> +** 	ret
> +*/
> +uint16x8_t f5() {
> +    return vdupq_n_u16(0x3E00);
> +}
> +
> +/*
> +** f6:
> +** 	adrp	x0, \.LC0
> +** 	ldr	q0, \[x0, #:lo12:\.LC0\]
> +** 	ret
> +*/
> +uint32x4_t f6() {
> +    return vdupq_n_u32(0x4f800000);
> +}
  
Tamar Christina Sept. 30, 2024, 5:43 p.m. UTC | #2
Thanks for the review,
Will get started on it but one question...

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, September 30, 2024 6:33 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using
> floating point moves
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > This patch extends our immediate SIMD generation cases to support generating
> > integer immediates using floating point operation if the integer immediate maps
> > to an exact FP value.
> >
> > As an example:
> >
> > uint32x4_t f1() {
> >     return vdupq_n_u32(0x3f800000);
> > }
> >
> > currently generates:
> >
> > f1:
> >         adrp    x0, .LC0
> >         ldr     q0, [x0, #:lo12:.LC0]
> >         ret
> >
> > i.e. a load, but with this change:
> >
> > f1:
> >         fmov    v0.4s, 1.0e+0
> >         ret
> >
> > Such immediates are common in e.g. our Math routines in glibc because they are
> > created to extract or mark part of an FP immediate as masks.
> 
> I agree this is a good thing to do.  The current code is too beholden
> to the original vector mode.  This patch relaxes it so that it isn't
> beholden to the original mode's class (integer vs. float), but it would
> still be beholden to the original mode's element size.
> 
> It looks like an alternative would be to remove:
> 
>   scalar_float_mode elt_float_mode;
>   if (n_elts == 1
>       && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
>     {
>       rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
>       if (aarch64_float_const_zero_rtx_p (elt)
> 	  || aarch64_float_const_representable_p (elt))
> 	{
> 	  if (info)
> 	    *info = simd_immediate_info (elt_float_mode, elt);
> 	  return true;
> 	}
>     }
> 
> and instead insert code:
> 
>   /* Get the repeating 8-byte value as an integer.  No endian correction
>      is needed here because bytes is already in lsb-first order.  */
>   unsigned HOST_WIDE_INT val64 = 0;
>   for (unsigned int i = 0; i < 8; i++)
>     val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes]
> 	      << (i * BITS_PER_UNIT));
> 
> ---> here
> 
>   if (vec_flags & VEC_SVE_DATA)
>     return aarch64_sve_valid_immediate (val64, info);
>   else
>     return aarch64_advsimd_valid_immediate (val64, info, which);
> 
> that tries to reduce val64 to the smallest repeating pattern,
> then tries to interpret that pattern as a float.  The reduction step
> could reuse the first part of aarch64_sve_valid_immediate, which
> calculates the narrowest repeating integer mode:
> 
>   scalar_int_mode mode = DImode;
>   unsigned int val32 = val64 & 0xffffffff;
>   if (val32 == (val64 >> 32))
>     {
>       mode = SImode;
>       unsigned int val16 = val32 & 0xffff;
>       if (val16 == (val32 >> 16))
> 	{
> 	  mode = HImode;
> 	  unsigned int val8 = val16 & 0xff;
> 	  if (val8 == (val16 >> 8))
> 	    mode = QImode;
> 	}
>     }
> 
> This would give us the candidate integer mode, to which we could
> apply float_mode_for_size (...).exists, as in the patch.
> 

I was doubting whether it's safe to use this or not.  That's why I listed
the modes using a switch statement.  Namely I'm concerned about the
multiple float 16 format.  It looks like from looking at the source of
float_mode_for_size that it just returns the first float mode, so makes it
pretty sensitive to the order of definition in aarch64/aarch64-modes.def.

Is it safe to assume that storage only formats like BF16 will always be
listed after general compute types?

Thanks,
Tamar

> In this case we would have the value as an integer, rather than
> as an rtx, so I think it would make sense to split out the part of
> aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE.
> aarch64_simd_valid_immediate could then use the patch's:
> 
> > +      long int as_long_ints[2];
> > +      as_long_ints[0] = buf & 0xFFFFFFFF;
> > +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> > [...]
> > +      real_from_target (&r, as_long_ints, fmode);
> 
> with "buf" being "val64" in the code above, and "fmode" being the result
> of float_mode_for_size (...).exists.  aarch64_simd_valid_immediate
> would then pass "r" and and "fmode" to the new, split-out variant of
> aarch64_float_const_representable_p.  (I haven't checked the endiannes
> requirements for real_from_target.)
> 
> The split-out variant would still perform the HFmode test in:
> 
>   if (GET_MODE (x) == VOIDmode
>       || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
>     return false;
> 
> The VOIDmode test is redundant and can be dropped.  AArch64 has always
> been a CONST_WIDE_INT target.
> 
> If we do that, we should probably also pass the integer mode calculated
> by the code quoted above down to aarch64_sve_valid_immediate (where it
> came from) and aarch64_advsimd_valid_immediate, since both of them would
> find it useful.  E.g.:
> 
>       /* Try using a replicated byte.  */
>       if (which == AARCH64_CHECK_MOV
> 	  && val16 == (val32 >> 16)
> 	  && val8 == (val16 >> 8))
> 	{
> 	  if (info)
> 	    *info = simd_immediate_info (QImode, val8);
> 	  return true;
> 	}
> 
> would become:
> 
>   /* Try using a replicated byte.  */
>   if (which == AARCH64_CHECK_MOV && mode == QImode)
>     {
>       if (info)
>         *info = simd_immediate_info (QImode, val8);
>       return true;
>     }
> 
> I realise that's quite a bit different from the patch as posted, sorry,
> and I've made it sound more complicated than it actually is.  But I think
> it should be both more general (because it ignores the element size as
> well as the mode class) and a little simpler.
> 
> The proposed split of aarch64_float_const_representable_p would be
> a replacement for patch 1 in the series.  The current rtx version
> of aarch64_float_const_representable_p would not need to take a mode,
> but the REAL_VALUE_TYPE interface would.
> 
> Thanks,
> Richard
> 
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-protos.h
> (aarch64_float_const_representable_p):
> > 	Add overload.
> > 	* config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject
> > 	integer modes.
> > 	(aarch64_simd_valid_immediate, aarch64_float_const_representable_p):
> > 	Check if integer value maps to an exact FP constant.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/const_create_using_fmov.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index
> 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac
> 38381ea6451fd55 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx);
> >
> >  /* Check for a legitimate floating point constant for FMOV.  */
> >  bool aarch64_float_const_representable_p (rtx, machine_mode);
> > +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode);
> >
> >  extern int aarch64_epilogue_uses (int);
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb
> 4c3f99c9f88e70ed 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x)
> >    /* 0.0 in Decimal Floating Point cannot be represented by #0 or
> >       zr as our callers expect, so no need to check the actual
> >       value if X is of Decimal Floating Point type.  */
> > -  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
> > +  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT
> > +      || !CONST_DOUBLE_P (x))
> >      return false;
> >
> >    if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
> > @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op,
> simd_immediate_info *info,
> >    else
> >      return false;
> >
> > -  scalar_float_mode elt_float_mode;
> > -  if (n_elts == 1
> > -      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
> > +  if (n_elts == 1)
> >      {
> >        rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
> > +      rtx new_elt = NULL_RTX;
> >        if (aarch64_float_const_zero_rtx_p (elt)
> > -	  || aarch64_float_const_representable_p (elt, elt_mode))
> > -	{
> > -	  if (info)
> > -	    *info = simd_immediate_info (elt_float_mode, elt);
> > -	  return true;
> > +	  || aarch64_float_const_representable_p (&new_elt, elt, elt_mode))
> > +	{
> > +	  scalar_float_mode elt_float_mode;
> > +	  auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode);
> > +	  if (is_a <scalar_float_mode> (elt_mode))
> > +	    elt_float_mode = as_a <scalar_float_mode> (elt_mode);
> > +	  else if (which == AARCH64_CHECK_MOV
> > +		   && new_elt
> > +		   && float_mode_for_size (bitsize).exists (&elt_float_mode))
> > +	    elt = new_elt;
> > +	  else
> > +	    elt = NULL_RTX;
> > +
> > +	  if (elt != NULL_RTX)
> > +	    {
> > +	      if (info)
> > +		*info = simd_immediate_info (elt_float_mode, elt);
> > +	      return true;
> > +	    }
> >  	}
> >      }
> >
> > @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix)
> >
> >  /* Return true iff X with mode MODE can be represented by a quarter-precision
> >     floating point immediate operand X.  Note, we cannot represent 0.0.  */
> > +
> >  bool
> >  aarch64_float_const_representable_p (rtx x, machine_mode mode)
> > +{
> > +  return aarch64_float_const_representable_p (NULL, x, mode);
> > +}
> > +
> > +
> > +/* Return true iff X with mode MODE can be represented by a quarter-precision
> > +   floating point immediate operand X.  Note, we cannot represent 0.0.
> > +   If the value is a CONST_INT that can be represented as an exact floating
> > +   point then OUT will contain the new floating point value to emit to generate
> > +   the integer constant.  */
> > +
> > +bool
> > +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode)
> >  {
> >    /* This represents our current view of how many bits
> >       make up the mantissa.  */
> > @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >
> >    x = unwrap_const_vec_duplicate (x);
> >    mode = GET_MODE_INNER (mode);
> > -  if (!CONST_DOUBLE_P (x))
> > +  if (!CONST_DOUBLE_P (x)
> > +      && !CONST_INT_P (x))
> >      return false;
> >
> >    if (mode == VOIDmode
> > -      || (mode == HFmode && !TARGET_FP_F16INST))
> > +      || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST))
> >      return false;
> >
> > -  r = *CONST_DOUBLE_REAL_VALUE (x);
> > +  /* If we have an integer bit pattern, decode it back into a real.
> > +     real_from_target requires the representation to be split into
> > +     32-bit values and then put into two host wide ints.  */
> > +  if (CONST_INT_P (x))
> > +    {
> > +      HOST_WIDE_INT buf = INTVAL (x);
> > +      long int as_long_ints[2];
> > +      as_long_ints[0] = buf & 0xFFFFFFFF;
> > +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> > +      machine_mode fmode;
> > +      switch (mode)
> > +      {
> > +      case HImode:
> > +	fmode = HFmode;
> > +	break;
> > +      case SImode:
> > +	fmode = SFmode;
> > +	break;
> > +      case DImode:
> > +	fmode = DFmode;
> > +	break;
> > +      default:
> > +	return false;
> > +      }
> > +
> > +      real_from_target (&r, as_long_ints, fmode);
> > +      if (out)
> > +	*out = const_double_from_real_value (r, fmode);
> > +    }
> > +  else
> > +    r = *CONST_DOUBLE_REAL_VALUE (x);
> >
> >    /* We cannot represent infinities, NaNs or +/-zero.  We won't
> >       know if we have +zero until we analyse the mantissa, but we
> > @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >       the value.  */
> >    if (w.ulow () != 0)
> >      return false;
> > +
> >    /* We have rejected the lower HOST_WIDE_INT, so update our
> >       understanding of how many bits lie in the mantissa and
> >       look only at the high HOST_WIDE_INT.  */
> > @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >    return (exponent >= 0 && exponent <= 7);
> >  }
> >
> > -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or BIC
> > -   immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects
> whether to
> > -   output MOVI/MVNI, ORR or BIC immediate.  */
> > +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC
> or
> > +   FMOV immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH
> selects whether
> > +   to output MOVI/MVNI, ORR or BIC immediate.  */
> >  char*
> >  aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width,
> >  				   enum simd_immediate_check which)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..e080afed8aa35786600279
> 79335bfc859ca6bc91
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> > @@ -0,0 +1,87 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-march=armv9-a -Ofast" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** g:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +float32x4_t g(){
> > +    return vdupq_n_f32(1);
> > +}
> > +
> > +/*
> > +** h:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t h() {
> > +    return vreinterpretq_u32_f32(g());
> > +}
> > +
> > +/*
> > +** f1:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f1() {
> > +    return vdupq_n_u32(0x3f800000);
> > +}
> > +
> > +/*
> > +** f2:
> > +** 	fmov	v0\.4s, 1\.5e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f2() {
> > +    return vdupq_n_u32(0x3FC00000);
> > +}
> > +
> > +/*
> > +** f3:
> > +** 	fmov	v0\.4s, 1\.25e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f3() {
> > +    return vdupq_n_u32(0x3FA00000);
> > +}
> > +
> > +/*
> > +** f4:
> > +** 	fmov	v0\.2d, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint64x2_t f4() {
> > +    return vdupq_n_u64(0x3FF0000000000000);
> > +}
> > +
> > +/*
> > +** fn4:
> > +** 	fmov	v0\.2d, -1\.0e\+0
> > +** 	ret
> > +*/
> > +uint64x2_t fn4() {
> > +    return vdupq_n_u64(0xBFF0000000000000);
> > +}
> > +
> > +/*
> > +** f5:
> > +** 	fmov	v0\.8h, 1\.5e\+0
> > +** 	ret
> > +*/
> > +uint16x8_t f5() {
> > +    return vdupq_n_u16(0x3E00);
> > +}
> > +
> > +/*
> > +** f6:
> > +** 	adrp	x0, \.LC0
> > +** 	ldr	q0, \[x0, #:lo12:\.LC0\]
> > +** 	ret
> > +*/
> > +uint32x4_t f6() {
> > +    return vdupq_n_u32(0x4f800000);
> > +}
  
Richard Sandiford Oct. 1, 2024, 8:27 a.m. UTC | #3
Tamar Christina <Tamar.Christina@arm.com> writes:
> Thanks for the review,
> Will get started on it but one question...
>
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, September 30, 2024 6:33 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
>> Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using
>> floating point moves
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > This patch extends our immediate SIMD generation cases to support generating
>> > integer immediates using floating point operation if the integer immediate maps
>> > to an exact FP value.
>> >
>> > As an example:
>> >
>> > uint32x4_t f1() {
>> >     return vdupq_n_u32(0x3f800000);
>> > }
>> >
>> > currently generates:
>> >
>> > f1:
>> >         adrp    x0, .LC0
>> >         ldr     q0, [x0, #:lo12:.LC0]
>> >         ret
>> >
>> > i.e. a load, but with this change:
>> >
>> > f1:
>> >         fmov    v0.4s, 1.0e+0
>> >         ret
>> >
>> > Such immediates are common in e.g. our Math routines in glibc because they are
>> > created to extract or mark part of an FP immediate as masks.
>> 
>> I agree this is a good thing to do.  The current code is too beholden
>> to the original vector mode.  This patch relaxes it so that it isn't
>> beholden to the original mode's class (integer vs. float), but it would
>> still be beholden to the original mode's element size.
>> 
>> It looks like an alternative would be to remove:
>> 
>>   scalar_float_mode elt_float_mode;
>>   if (n_elts == 1
>>       && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
>>     {
>>       rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
>>       if (aarch64_float_const_zero_rtx_p (elt)
>> 	  || aarch64_float_const_representable_p (elt))
>> 	{
>> 	  if (info)
>> 	    *info = simd_immediate_info (elt_float_mode, elt);
>> 	  return true;
>> 	}
>>     }
>> 
>> and instead insert code:
>> 
>>   /* Get the repeating 8-byte value as an integer.  No endian correction
>>      is needed here because bytes is already in lsb-first order.  */
>>   unsigned HOST_WIDE_INT val64 = 0;
>>   for (unsigned int i = 0; i < 8; i++)
>>     val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes]
>> 	      << (i * BITS_PER_UNIT));
>> 
>> ---> here
>> 
>>   if (vec_flags & VEC_SVE_DATA)
>>     return aarch64_sve_valid_immediate (val64, info);
>>   else
>>     return aarch64_advsimd_valid_immediate (val64, info, which);
>> 
>> that tries to reduce val64 to the smallest repeating pattern,
>> then tries to interpret that pattern as a float.  The reduction step
>> could reuse the first part of aarch64_sve_valid_immediate, which
>> calculates the narrowest repeating integer mode:
>> 
>>   scalar_int_mode mode = DImode;
>>   unsigned int val32 = val64 & 0xffffffff;
>>   if (val32 == (val64 >> 32))
>>     {
>>       mode = SImode;
>>       unsigned int val16 = val32 & 0xffff;
>>       if (val16 == (val32 >> 16))
>> 	{
>> 	  mode = HImode;
>> 	  unsigned int val8 = val16 & 0xff;
>> 	  if (val8 == (val16 >> 8))
>> 	    mode = QImode;
>> 	}
>>     }
>> 
>> This would give us the candidate integer mode, to which we could
>> apply float_mode_for_size (...).exists, as in the patch.
>> 
>
> I was doubting whether it's safe to use this or not.  That's why I listed
> the modes using a switch statement.  Namely I'm concerned about the
> multiple float 16 format.  It looks like from looking at the source of
> float_mode_for_size that it just returns the first float mode, so makes it
> pretty sensitive to the order of definition in aarch64/aarch64-modes.def.
>
> Is it safe to assume that storage only formats like BF16 will always be
> listed after general compute types?

Ah yeah, fair point.  In that case, I agree it'd be better to be explicit.

Thanks,
Richard
  
Tamar Christina Oct. 2, 2024, 7:04 p.m. UTC | #4
Hi,

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, September 30, 2024 6:33 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using
> floating point moves
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > This patch extends our immediate SIMD generation cases to support generating
> > integer immediates using floating point operation if the integer immediate maps
> > to an exact FP value.
> >
> > As an example:
> >
> > uint32x4_t f1() {
> >     return vdupq_n_u32(0x3f800000);
> > }
> >
> > currently generates:
> >
> > f1:
> >         adrp    x0, .LC0
> >         ldr     q0, [x0, #:lo12:.LC0]
> >         ret
> >
> > i.e. a load, but with this change:
> >
> > f1:
> >         fmov    v0.4s, 1.0e+0
> >         ret
> >
> > Such immediates are common in e.g. our Math routines in glibc because they are
> > created to extract or mark part of an FP immediate as masks.
> 
> I agree this is a good thing to do.  The current code is too beholden
> to the original vector mode.  This patch relaxes it so that it isn't
> beholden to the original mode's class (integer vs. float), but it would
> still be beholden to the original mode's element size.

I've implemented this approach and it works but I'm struggling with an inconsistency
in how zeros are created.

There are about 800 SVE ACLE tests like acge_f16.c that check that a zero is created
using a mov of the same sized register as the usage.  So I added an exception for
zero to use the original input element mode.

But then there are about 400 other SVE ACLE tests that actually check that zeros are
created using byte moves, like dup_128_s16_z even though they're used as ints.

So these two are in conflict.  Do you care which way I resolve this?  since it's zero
it shouldn't matter how they're created but perhaps there's a reason why some
test check for the specific instruction?

Thanks,
Tamar
> 
> It looks like an alternative would be to remove:
> 
>   scalar_float_mode elt_float_mode;
>   if (n_elts == 1
>       && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
>     {
>       rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
>       if (aarch64_float_const_zero_rtx_p (elt)
> 	  || aarch64_float_const_representable_p (elt))
> 	{
> 	  if (info)
> 	    *info = simd_immediate_info (elt_float_mode, elt);
> 	  return true;
> 	}
>     }
> 
> and instead insert code:
> 
>   /* Get the repeating 8-byte value as an integer.  No endian correction
>      is needed here because bytes is already in lsb-first order.  */
>   unsigned HOST_WIDE_INT val64 = 0;
>   for (unsigned int i = 0; i < 8; i++)
>     val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes]
> 	      << (i * BITS_PER_UNIT));
> 
> ---> here
> 
>   if (vec_flags & VEC_SVE_DATA)
>     return aarch64_sve_valid_immediate (val64, info);
>   else
>     return aarch64_advsimd_valid_immediate (val64, info, which);
> 
> that tries to reduce val64 to the smallest repeating pattern,
> then tries to interpret that pattern as a float.  The reduction step
> could reuse the first part of aarch64_sve_valid_immediate, which
> calculates the narrowest repeating integer mode:
> 
>   scalar_int_mode mode = DImode;
>   unsigned int val32 = val64 & 0xffffffff;
>   if (val32 == (val64 >> 32))
>     {
>       mode = SImode;
>       unsigned int val16 = val32 & 0xffff;
>       if (val16 == (val32 >> 16))
> 	{
> 	  mode = HImode;
> 	  unsigned int val8 = val16 & 0xff;
> 	  if (val8 == (val16 >> 8))
> 	    mode = QImode;
> 	}
>     }
> 
> This would give us the candidate integer mode, to which we could
> apply float_mode_for_size (...).exists, as in the patch.
> 
> In this case we would have the value as an integer, rather than
> as an rtx, so I think it would make sense to split out the part of
> aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE.
> aarch64_simd_valid_immediate could then use the patch's:
> 
> > +      long int as_long_ints[2];
> > +      as_long_ints[0] = buf & 0xFFFFFFFF;
> > +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> > [...]
> > +      real_from_target (&r, as_long_ints, fmode);
> 
> with "buf" being "val64" in the code above, and "fmode" being the result
> of float_mode_for_size (...).exists.  aarch64_simd_valid_immediate
> would then pass "r" and and "fmode" to the new, split-out variant of
> aarch64_float_const_representable_p.  (I haven't checked the endiannes
> requirements for real_from_target.)
> 
> The split-out variant would still perform the HFmode test in:
> 
>   if (GET_MODE (x) == VOIDmode
>       || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
>     return false;
> 
> The VOIDmode test is redundant and can be dropped.  AArch64 has always
> been a CONST_WIDE_INT target.
> 
> If we do that, we should probably also pass the integer mode calculated
> by the code quoted above down to aarch64_sve_valid_immediate (where it
> came from) and aarch64_advsimd_valid_immediate, since both of them would
> find it useful.  E.g.:
> 
>       /* Try using a replicated byte.  */
>       if (which == AARCH64_CHECK_MOV
> 	  && val16 == (val32 >> 16)
> 	  && val8 == (val16 >> 8))
> 	{
> 	  if (info)
> 	    *info = simd_immediate_info (QImode, val8);
> 	  return true;
> 	}
> 
> would become:
> 
>   /* Try using a replicated byte.  */
>   if (which == AARCH64_CHECK_MOV && mode == QImode)
>     {
>       if (info)
>         *info = simd_immediate_info (QImode, val8);
>       return true;
>     }
> 
> I realise that's quite a bit different from the patch as posted, sorry,
> and I've made it sound more complicated than it actually is.  But I think
> it should be both more general (because it ignores the element size as
> well as the mode class) and a little simpler.
> 
> The proposed split of aarch64_float_const_representable_p would be
> a replacement for patch 1 in the series.  The current rtx version
> of aarch64_float_const_representable_p would not need to take a mode,
> but the REAL_VALUE_TYPE interface would.
> 
> Thanks,
> Richard
> 
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-protos.h
> (aarch64_float_const_representable_p):
> > 	Add overload.
> > 	* config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject
> > 	integer modes.
> > 	(aarch64_simd_valid_immediate, aarch64_float_const_representable_p):
> > 	Check if integer value maps to an exact FP constant.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	* gcc.target/aarch64/const_create_using_fmov.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> > index
> 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac
> 38381ea6451fd55 100644
> > --- a/gcc/config/aarch64/aarch64-protos.h
> > +++ b/gcc/config/aarch64/aarch64-protos.h
> > @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx);
> >
> >  /* Check for a legitimate floating point constant for FMOV.  */
> >  bool aarch64_float_const_representable_p (rtx, machine_mode);
> > +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode);
> >
> >  extern int aarch64_epilogue_uses (int);
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb
> 4c3f99c9f88e70ed 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x)
> >    /* 0.0 in Decimal Floating Point cannot be represented by #0 or
> >       zr as our callers expect, so no need to check the actual
> >       value if X is of Decimal Floating Point type.  */
> > -  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
> > +  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT
> > +      || !CONST_DOUBLE_P (x))
> >      return false;
> >
> >    if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
> > @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op,
> simd_immediate_info *info,
> >    else
> >      return false;
> >
> > -  scalar_float_mode elt_float_mode;
> > -  if (n_elts == 1
> > -      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
> > +  if (n_elts == 1)
> >      {
> >        rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
> > +      rtx new_elt = NULL_RTX;
> >        if (aarch64_float_const_zero_rtx_p (elt)
> > -	  || aarch64_float_const_representable_p (elt, elt_mode))
> > -	{
> > -	  if (info)
> > -	    *info = simd_immediate_info (elt_float_mode, elt);
> > -	  return true;
> > +	  || aarch64_float_const_representable_p (&new_elt, elt, elt_mode))
> > +	{
> > +	  scalar_float_mode elt_float_mode;
> > +	  auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode);
> > +	  if (is_a <scalar_float_mode> (elt_mode))
> > +	    elt_float_mode = as_a <scalar_float_mode> (elt_mode);
> > +	  else if (which == AARCH64_CHECK_MOV
> > +		   && new_elt
> > +		   && float_mode_for_size (bitsize).exists (&elt_float_mode))
> > +	    elt = new_elt;
> > +	  else
> > +	    elt = NULL_RTX;
> > +
> > +	  if (elt != NULL_RTX)
> > +	    {
> > +	      if (info)
> > +		*info = simd_immediate_info (elt_float_mode, elt);
> > +	      return true;
> > +	    }
> >  	}
> >      }
> >
> > @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix)
> >
> >  /* Return true iff X with mode MODE can be represented by a quarter-precision
> >     floating point immediate operand X.  Note, we cannot represent 0.0.  */
> > +
> >  bool
> >  aarch64_float_const_representable_p (rtx x, machine_mode mode)
> > +{
> > +  return aarch64_float_const_representable_p (NULL, x, mode);
> > +}
> > +
> > +
> > +/* Return true iff X with mode MODE can be represented by a quarter-precision
> > +   floating point immediate operand X.  Note, we cannot represent 0.0.
> > +   If the value is a CONST_INT that can be represented as an exact floating
> > +   point then OUT will contain the new floating point value to emit to generate
> > +   the integer constant.  */
> > +
> > +bool
> > +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode)
> >  {
> >    /* This represents our current view of how many bits
> >       make up the mantissa.  */
> > @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >
> >    x = unwrap_const_vec_duplicate (x);
> >    mode = GET_MODE_INNER (mode);
> > -  if (!CONST_DOUBLE_P (x))
> > +  if (!CONST_DOUBLE_P (x)
> > +      && !CONST_INT_P (x))
> >      return false;
> >
> >    if (mode == VOIDmode
> > -      || (mode == HFmode && !TARGET_FP_F16INST))
> > +      || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST))
> >      return false;
> >
> > -  r = *CONST_DOUBLE_REAL_VALUE (x);
> > +  /* If we have an integer bit pattern, decode it back into a real.
> > +     real_from_target requires the representation to be split into
> > +     32-bit values and then put into two host wide ints.  */
> > +  if (CONST_INT_P (x))
> > +    {
> > +      HOST_WIDE_INT buf = INTVAL (x);
> > +      long int as_long_ints[2];
> > +      as_long_ints[0] = buf & 0xFFFFFFFF;
> > +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> > +      machine_mode fmode;
> > +      switch (mode)
> > +      {
> > +      case HImode:
> > +	fmode = HFmode;
> > +	break;
> > +      case SImode:
> > +	fmode = SFmode;
> > +	break;
> > +      case DImode:
> > +	fmode = DFmode;
> > +	break;
> > +      default:
> > +	return false;
> > +      }
> > +
> > +      real_from_target (&r, as_long_ints, fmode);
> > +      if (out)
> > +	*out = const_double_from_real_value (r, fmode);
> > +    }
> > +  else
> > +    r = *CONST_DOUBLE_REAL_VALUE (x);
> >
> >    /* We cannot represent infinities, NaNs or +/-zero.  We won't
> >       know if we have +zero until we analyse the mantissa, but we
> > @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >       the value.  */
> >    if (w.ulow () != 0)
> >      return false;
> > +
> >    /* We have rejected the lower HOST_WIDE_INT, so update our
> >       understanding of how many bits lie in the mantissa and
> >       look only at the high HOST_WIDE_INT.  */
> > @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x,
> machine_mode mode)
> >    return (exponent >= 0 && exponent <= 7);
> >  }
> >
> > -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or BIC
> > -   immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects
> whether to
> > -   output MOVI/MVNI, ORR or BIC immediate.  */
> > +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC
> or
> > +   FMOV immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH
> selects whether
> > +   to output MOVI/MVNI, ORR or BIC immediate.  */
> >  char*
> >  aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width,
> >  				   enum simd_immediate_check which)
> > diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..e080afed8aa35786600279
> 79335bfc859ca6bc91
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> > @@ -0,0 +1,87 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-march=armv9-a -Ofast" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/*
> > +** g:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +float32x4_t g(){
> > +    return vdupq_n_f32(1);
> > +}
> > +
> > +/*
> > +** h:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t h() {
> > +    return vreinterpretq_u32_f32(g());
> > +}
> > +
> > +/*
> > +** f1:
> > +** 	fmov	v0\.4s, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f1() {
> > +    return vdupq_n_u32(0x3f800000);
> > +}
> > +
> > +/*
> > +** f2:
> > +** 	fmov	v0\.4s, 1\.5e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f2() {
> > +    return vdupq_n_u32(0x3FC00000);
> > +}
> > +
> > +/*
> > +** f3:
> > +** 	fmov	v0\.4s, 1\.25e\+0
> > +** 	ret
> > +*/
> > +uint32x4_t f3() {
> > +    return vdupq_n_u32(0x3FA00000);
> > +}
> > +
> > +/*
> > +** f4:
> > +** 	fmov	v0\.2d, 1\.0e\+0
> > +** 	ret
> > +*/
> > +uint64x2_t f4() {
> > +    return vdupq_n_u64(0x3FF0000000000000);
> > +}
> > +
> > +/*
> > +** fn4:
> > +** 	fmov	v0\.2d, -1\.0e\+0
> > +** 	ret
> > +*/
> > +uint64x2_t fn4() {
> > +    return vdupq_n_u64(0xBFF0000000000000);
> > +}
> > +
> > +/*
> > +** f5:
> > +** 	fmov	v0\.8h, 1\.5e\+0
> > +** 	ret
> > +*/
> > +uint16x8_t f5() {
> > +    return vdupq_n_u16(0x3E00);
> > +}
> > +
> > +/*
> > +** f6:
> > +** 	adrp	x0, \.LC0
> > +** 	ldr	q0, \[x0, #:lo12:\.LC0\]
> > +** 	ret
> > +*/
> > +uint32x4_t f6() {
> > +    return vdupq_n_u32(0x4f800000);
> > +}
  
Richard Sandiford Oct. 3, 2024, 8:17 a.m. UTC | #5
Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi,
>
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, September 30, 2024 6:33 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
>> Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using
>> floating point moves
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > This patch extends our immediate SIMD generation cases to support generating
>> > integer immediates using floating point operation if the integer immediate maps
>> > to an exact FP value.
>> >
>> > As an example:
>> >
>> > uint32x4_t f1() {
>> >     return vdupq_n_u32(0x3f800000);
>> > }
>> >
>> > currently generates:
>> >
>> > f1:
>> >         adrp    x0, .LC0
>> >         ldr     q0, [x0, #:lo12:.LC0]
>> >         ret
>> >
>> > i.e. a load, but with this change:
>> >
>> > f1:
>> >         fmov    v0.4s, 1.0e+0
>> >         ret
>> >
>> > Such immediates are common in e.g. our Math routines in glibc because they are
>> > created to extract or mark part of an FP immediate as masks.
>> 
>> I agree this is a good thing to do.  The current code is too beholden
>> to the original vector mode.  This patch relaxes it so that it isn't
>> beholden to the original mode's class (integer vs. float), but it would
>> still be beholden to the original mode's element size.
>
> I've implemented this approach and it works but I'm struggling with an inconsistency
> in how zeros are created.
>
> There are about 800 SVE ACLE tests like acge_f16.c that check that a zero is created
> using a mov of the same sized register as the usage.  So I added an exception for
> zero to use the original input element mode.
>
> But then there are about 400 other SVE ACLE tests that actually check that zeros are
> created using byte moves, like dup_128_s16_z even though they're used as ints.
>
> So these two are in conflict.  Do you care which way I resolve this?  since it's zero
> it shouldn't matter how they're created but perhaps there's a reason why some
> test check for the specific instruction?

No, I think it was an oversight.  Any element size would be correct.

Using byte moves sounds like a good thing.  It would be good to
share constants at some point (like we do with ptrues) and using
the smallest element size would then be the natural choice.

Sorry for the drudge work in updating all the tests.  Hope that
generalising them to be size-agnostic turns out to be sed-able,
or at least a simple script.

Thanks,
Richard
  

Patch

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac38381ea6451fd55 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -974,6 +974,7 @@  void aarch64_split_simd_move (rtx, rtx);
 
 /* Check for a legitimate floating point constant for FMOV.  */
 bool aarch64_float_const_representable_p (rtx, machine_mode);
+bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode);
 
 extern int aarch64_epilogue_uses (int);
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb4c3f99c9f88e70ed 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -10991,7 +10991,8 @@  aarch64_float_const_zero_rtx_p (rtx x)
   /* 0.0 in Decimal Floating Point cannot be represented by #0 or
      zr as our callers expect, so no need to check the actual
      value if X is of Decimal Floating Point type.  */
-  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
+  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT
+      || !CONST_DOUBLE_P (x))
     return false;
 
   if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
@@ -23026,17 +23027,30 @@  aarch64_simd_valid_immediate (rtx op, simd_immediate_info *info,
   else
     return false;
 
-  scalar_float_mode elt_float_mode;
-  if (n_elts == 1
-      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
+  if (n_elts == 1)
     {
       rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
+      rtx new_elt = NULL_RTX;
       if (aarch64_float_const_zero_rtx_p (elt)
-	  || aarch64_float_const_representable_p (elt, elt_mode))
-	{
-	  if (info)
-	    *info = simd_immediate_info (elt_float_mode, elt);
-	  return true;
+	  || aarch64_float_const_representable_p (&new_elt, elt, elt_mode))
+	{
+	  scalar_float_mode elt_float_mode;
+	  auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode);
+	  if (is_a <scalar_float_mode> (elt_mode))
+	    elt_float_mode = as_a <scalar_float_mode> (elt_mode);
+	  else if (which == AARCH64_CHECK_MOV
+		   && new_elt
+		   && float_mode_for_size (bitsize).exists (&elt_float_mode))
+	    elt = new_elt;
+	  else
+	    elt = NULL_RTX;
+
+	  if (elt != NULL_RTX)
+	    {
+	      if (info)
+		*info = simd_immediate_info (elt_float_mode, elt);
+	      return true;
+	    }
 	}
     }
 
@@ -25121,8 +25135,22 @@  aarch64_c_mode_for_suffix (char suffix)
 
 /* Return true iff X with mode MODE can be represented by a quarter-precision
    floating point immediate operand X.  Note, we cannot represent 0.0.  */
+
 bool
 aarch64_float_const_representable_p (rtx x, machine_mode mode)
+{
+  return aarch64_float_const_representable_p (NULL, x, mode);
+}
+
+
+/* Return true iff X with mode MODE can be represented by a quarter-precision
+   floating point immediate operand X.  Note, we cannot represent 0.0.
+   If the value is a CONST_INT that can be represented as an exact floating
+   point then OUT will contain the new floating point value to emit to generate
+   the integer constant.  */
+
+bool
+aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode)
 {
   /* This represents our current view of how many bits
      make up the mantissa.  */
@@ -25134,14 +25162,45 @@  aarch64_float_const_representable_p (rtx x, machine_mode mode)
 
   x = unwrap_const_vec_duplicate (x);
   mode = GET_MODE_INNER (mode);
-  if (!CONST_DOUBLE_P (x))
+  if (!CONST_DOUBLE_P (x)
+      && !CONST_INT_P (x))
     return false;
 
   if (mode == VOIDmode
-      || (mode == HFmode && !TARGET_FP_F16INST))
+      || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST))
     return false;
 
-  r = *CONST_DOUBLE_REAL_VALUE (x);
+  /* If we have an integer bit pattern, decode it back into a real.
+     real_from_target requires the representation to be split into
+     32-bit values and then put into two host wide ints.  */
+  if (CONST_INT_P (x))
+    {
+      HOST_WIDE_INT buf = INTVAL (x);
+      long int as_long_ints[2];
+      as_long_ints[0] = buf & 0xFFFFFFFF;
+      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
+      machine_mode fmode;
+      switch (mode)
+      {
+      case HImode:
+	fmode = HFmode;
+	break;
+      case SImode:
+	fmode = SFmode;
+	break;
+      case DImode:
+	fmode = DFmode;
+	break;
+      default:
+	return false;
+      }
+
+      real_from_target (&r, as_long_ints, fmode);
+      if (out)
+	*out = const_double_from_real_value (r, fmode);
+    }
+  else
+    r = *CONST_DOUBLE_REAL_VALUE (x);
 
   /* We cannot represent infinities, NaNs or +/-zero.  We won't
      know if we have +zero until we analyse the mantissa, but we
@@ -25170,6 +25229,7 @@  aarch64_float_const_representable_p (rtx x, machine_mode mode)
      the value.  */
   if (w.ulow () != 0)
     return false;
+
   /* We have rejected the lower HOST_WIDE_INT, so update our
      understanding of how many bits lie in the mantissa and
      look only at the high HOST_WIDE_INT.  */
@@ -25205,9 +25265,9 @@  aarch64_float_const_representable_p (rtx x, machine_mode mode)
   return (exponent >= 0 && exponent <= 7);
 }
 
-/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or BIC
-   immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects whether to
-   output MOVI/MVNI, ORR or BIC immediate.  */
+/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC or
+   FMOV immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects whether
+   to output MOVI/MVNI, ORR or BIC immediate.  */
 char*
 aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width,
 				   enum simd_immediate_check which)
diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
new file mode 100644
index 0000000000000000000000000000000000000000..e080afed8aa3578660027979335bfc859ca6bc91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
@@ -0,0 +1,87 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=armv9-a -Ofast" } */
+/* { dg-final { check-function-bodies "**" "" "" } } */
+
+#include <arm_neon.h>
+
+/*
+** g:
+** 	fmov	v0\.4s, 1\.0e\+0
+** 	ret
+*/
+float32x4_t g(){
+    return vdupq_n_f32(1);
+}
+
+/*
+** h:
+** 	fmov	v0\.4s, 1\.0e\+0
+** 	ret
+*/
+uint32x4_t h() {
+    return vreinterpretq_u32_f32(g());
+}
+
+/*
+** f1:
+** 	fmov	v0\.4s, 1\.0e\+0
+** 	ret
+*/
+uint32x4_t f1() {
+    return vdupq_n_u32(0x3f800000);
+}
+
+/*
+** f2:
+** 	fmov	v0\.4s, 1\.5e\+0
+** 	ret
+*/
+uint32x4_t f2() {
+    return vdupq_n_u32(0x3FC00000);
+}
+
+/*
+** f3:
+** 	fmov	v0\.4s, 1\.25e\+0
+** 	ret
+*/
+uint32x4_t f3() {
+    return vdupq_n_u32(0x3FA00000);
+}
+
+/*
+** f4:
+** 	fmov	v0\.2d, 1\.0e\+0
+** 	ret
+*/
+uint64x2_t f4() {
+    return vdupq_n_u64(0x3FF0000000000000);
+}
+
+/*
+** fn4:
+** 	fmov	v0\.2d, -1\.0e\+0
+** 	ret
+*/
+uint64x2_t fn4() {
+    return vdupq_n_u64(0xBFF0000000000000);
+}
+
+/*
+** f5:
+** 	fmov	v0\.8h, 1\.5e\+0
+** 	ret
+*/
+uint16x8_t f5() {
+    return vdupq_n_u16(0x3E00);
+}
+
+/*
+** f6:
+** 	adrp	x0, \.LC0
+** 	ldr	q0, \[x0, #:lo12:\.LC0\]
+** 	ret
+*/
+uint32x4_t f6() {
+    return vdupq_n_u32(0x4f800000);
+}