[1/2] Add support for conditional xorsign [PR96373]

Message ID mptpmb0kzir.fsf@arm.com
State Committed
Commit 7486fe153adaa868f36248b72f3e78d18b1b3ba1
Headers
Series [1/2] Add support for conditional xorsign [PR96373] |

Commit Message

Richard Sandiford Jan. 27, 2023, 11:06 a.m. UTC
  This patch is an optimisation, but it's also a prerequisite for
fixing PR96373 without regressing vect-xorsign_exec.c.

Currently the vectoriser vectorises:

  for (i = 0; i < N; i++)
    r[i] = a[i] * __builtin_copysignf (1.0f, b[i]);

as two unconditional operations (copysign and mult).
tree-ssa-math-opts.cc later combines them into an "xorsign" function.
This works for both Advanced SIMD and SVE.

However, with the fix for PR96373, the vectoriser will instead
generate a conditional multiplication (IFN_COND_MUL).  Something then
needs to fold copysign & IFN_COND_MUL to the equivalent of a conditional
xorsign.  Three obvious options were:

(1) Extend tree-ssa-math-opts.cc.
(2) Do the fold in match.pd.
(3) Leave it to rtl combine.

I'm against (3), because this isn't a target-specific optimisation.
(1) would be possible, but would involve open-coding a lot of what
match.pd does for us.  And, in contrast to doing the current
tree-ssa-math-opts.cc optimisation in match.pd, there should be
no danger of (2) happening too early.  If we have an IFN_COND_MUL
then we're already past the stage of simplifying the original
source code.

There was also a choice between adding a conditional xorsign ifn
and simply open-coding the xorsign.  The latter seems simpler,
and means less boiler-plate for target-specific code.

The signed_or_unsigned_type_for change is needed to make sure
that we stay in "SVE space" when doing the optimisation on 128-bit
fixed-length SVE.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
	PR tree-optimization/96373
	* tree.h (sign_mask_for): Declare.
	* tree.cc (sign_mask_for): New function.
	(signed_or_unsigned_type_for): For vector types, try to use the
	related_int_vector_mode.
	* genmatch.cc (commutative_op): Handle conditional internal functions.
	* match.pd: Fold an IFN_COND_MUL+copysign into an IFN_COND_XOR+and.

gcc/testsuite/
	PR tree-optimization/96373
	* gcc.target/aarch64/sve/cond_xorsign_1.c: New test.
	* gcc.target/aarch64/sve/cond_xorsign_2.c: Likewise.
---
 gcc/genmatch.cc                               | 15 ++++++++
 gcc/match.pd                                  | 14 ++++++++
 .../gcc.target/aarch64/sve/cond_xorsign_1.c   | 34 +++++++++++++++++++
 .../gcc.target/aarch64/sve/cond_xorsign_2.c   | 17 ++++++++++
 gcc/tree.cc                                   | 33 ++++++++++++++++++
 gcc/tree.h                                    |  1 +
 6 files changed, 114 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
  

Comments

Richard Biener Jan. 27, 2023, 11:36 a.m. UTC | #1
On Fri, 27 Jan 2023, Richard Sandiford wrote:

> This patch is an optimisation, but it's also a prerequisite for
> fixing PR96373 without regressing vect-xorsign_exec.c.
> 
> Currently the vectoriser vectorises:
> 
>   for (i = 0; i < N; i++)
>     r[i] = a[i] * __builtin_copysignf (1.0f, b[i]);
> 
> as two unconditional operations (copysign and mult).
> tree-ssa-math-opts.cc later combines them into an "xorsign" function.
> This works for both Advanced SIMD and SVE.
> 
> However, with the fix for PR96373, the vectoriser will instead
> generate a conditional multiplication (IFN_COND_MUL).  Something then
> needs to fold copysign & IFN_COND_MUL to the equivalent of a conditional
> xorsign.  Three obvious options were:
> 
> (1) Extend tree-ssa-math-opts.cc.
> (2) Do the fold in match.pd.
> (3) Leave it to rtl combine.
> 
> I'm against (3), because this isn't a target-specific optimisation.
> (1) would be possible, but would involve open-coding a lot of what
> match.pd does for us.  And, in contrast to doing the current
> tree-ssa-math-opts.cc optimisation in match.pd, there should be
> no danger of (2) happening too early.  If we have an IFN_COND_MUL
> then we're already past the stage of simplifying the original
> source code.
> 
> There was also a choice between adding a conditional xorsign ifn
> and simply open-coding the xorsign.  The latter seems simpler,
> and means less boiler-plate for target-specific code.
> 
> The signed_or_unsigned_type_for change is needed to make sure
> that we stay in "SVE space" when doing the optimisation on 128-bit
> fixed-length SVE.
> 
> Tested on aarch64-linux-gnu.  OK to install?

OK.

Thanks,
Richard.

> Richard
> 
> 
> gcc/
> 	PR tree-optimization/96373
> 	* tree.h (sign_mask_for): Declare.
> 	* tree.cc (sign_mask_for): New function.
> 	(signed_or_unsigned_type_for): For vector types, try to use the
> 	related_int_vector_mode.
> 	* genmatch.cc (commutative_op): Handle conditional internal functions.
> 	* match.pd: Fold an IFN_COND_MUL+copysign into an IFN_COND_XOR+and.
> 
> gcc/testsuite/
> 	PR tree-optimization/96373
> 	* gcc.target/aarch64/sve/cond_xorsign_1.c: New test.
> 	* gcc.target/aarch64/sve/cond_xorsign_2.c: Likewise.
> ---
>  gcc/genmatch.cc                               | 15 ++++++++
>  gcc/match.pd                                  | 14 ++++++++
>  .../gcc.target/aarch64/sve/cond_xorsign_1.c   | 34 +++++++++++++++++++
>  .../gcc.target/aarch64/sve/cond_xorsign_2.c   | 17 ++++++++++
>  gcc/tree.cc                                   | 33 ++++++++++++++++++
>  gcc/tree.h                                    |  1 +
>  6 files changed, 114 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
> 
> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> index d4cb439a851..e147ab9db7a 100644
> --- a/gcc/genmatch.cc
> +++ b/gcc/genmatch.cc
> @@ -489,6 +489,21 @@ commutative_op (id_base *id)
>        case CFN_FNMS:
>  	return 0;
>  
> +      case CFN_COND_ADD:
> +      case CFN_COND_MUL:
> +      case CFN_COND_MIN:
> +      case CFN_COND_MAX:
> +      case CFN_COND_FMIN:
> +      case CFN_COND_FMAX:
> +      case CFN_COND_AND:
> +      case CFN_COND_IOR:
> +      case CFN_COND_XOR:
> +      case CFN_COND_FMA:
> +      case CFN_COND_FMS:
> +      case CFN_COND_FNMA:
> +      case CFN_COND_FNMS:
> +	return 1;
> +
>        default:
>  	return -1;
>        }
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 56ac743aa6d..f605b798c44 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -339,6 +339,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
>    (COPYSIGN_ALL (negate @0) @1)))
>  
> +/* Transform c ? x * copysign (1, y) : z to c ? x ^ signs(y) : z.
> +   tree-ssa-math-opts.cc does the corresponding optimization for
> +   unconditional multiplications (via xorsign).  */
> +(simplify
> + (IFN_COND_MUL:c @0 @1 (IFN_COPYSIGN real_onep @2) @3)
> + (with { tree signs = sign_mask_for (type); }
> +  (if (signs)
> +   (with { tree inttype = TREE_TYPE (signs); }
> +    (view_convert:type
> +     (IFN_COND_XOR:inttype @0
> +      (view_convert:inttype @1)
> +      (bit_and (view_convert:inttype @2) { signs; })
> +      (view_convert:inttype @3)))))))
> +
>  /* (x >= 0 ? x : 0) + (x <= 0 ? -x : 0) -> abs x.  */
>  (simplify
>    (plus:c (max @0 integer_zerop) (max (negate @0) integer_zerop))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
> new file mode 100644
> index 00000000000..338ca605923
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define xorsign(A, B, SUFFIX) ((A) * __builtin_copysign##SUFFIX (1.0, B))
> +
> +#define DEF_LOOP(TYPE, SUFFIX)					\
> +  void __attribute__ ((noinline, noclone))			\
> +  test_##TYPE (TYPE *__restrict r, TYPE *__restrict a,		\
> +	       TYPE *__restrict b, TYPE *__restrict c,		\
> +	       int n)						\
> +  {								\
> +    for (int i = 0; i < n; ++i)					\
> +      r[i] = a[i] < 20 ? xorsign(b[i], c[i], SUFFIX) : b[i];	\
> +  }
> +
> +#define TEST_ALL(T) \
> +  T (_Float16, f16) \
> +  T (float, f) \
> +  T (double, )
> +
> +TEST_ALL (DEF_LOOP)
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, z[0-9]+\.h,} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, z[0-9]+\.s,} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d,} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m,} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
> +
> +/* { dg-final { scan-assembler-not {\tfmul} } } */
> +/* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
> +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
> new file mode 100644
> index 00000000000..274dd0ede59
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msve-vector-bits=128 --param aarch64-autovec-preference=2" } */
> +
> +#include "cond_xorsign_1.c"
> +
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, z[0-9]+\.h,} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, z[0-9]+\.s,} 1 } } */
> +/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d,} 1 } } */
> +
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m,} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
> +/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
> +
> +/* { dg-final { scan-assembler-not {\tfmul} } } */
> +/* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
> +/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/tree.cc b/gcc/tree.cc
> index 7473912a065..94d20eaba17 100644
> --- a/gcc/tree.cc
> +++ b/gcc/tree.cc
> @@ -2669,6 +2669,35 @@ build_zero_cst (tree type)
>      }
>  }
>  
> +/* If floating-point type TYPE has an IEEE-style sign bit, return an
> +   unsigned constant in which only the sign bit is set.  Return null
> +   otherwise.  */
> +
> +tree
> +sign_mask_for (tree type)
> +{
> +  /* Avoid having to choose between a real-only sign and a pair of signs.
> +     This could be relaxed if the choice becomes obvious later.  */
> +  if (TREE_CODE (type) == COMPLEX_TYPE)
> +    return NULL_TREE;
> +
> +  auto eltmode = as_a<scalar_float_mode> (element_mode (type));
> +  auto bits = REAL_MODE_FORMAT (eltmode)->ieee_bits;
> +  if (!bits || !pow2p_hwi (bits))
> +    return NULL_TREE;
> +
> +  tree inttype = unsigned_type_for (type);
> +  if (!inttype)
> +    return NULL_TREE;
> +
> +  auto mask = wi::set_bit_in_zero (bits - 1, bits);
> +  if (TREE_CODE (inttype) == VECTOR_TYPE)
> +    {
> +      tree elt = wide_int_to_tree (TREE_TYPE (inttype), mask);
> +      return build_vector_from_val (inttype, elt);
> +    }
> +  return wide_int_to_tree (inttype, mask);
> +}
>  
>  /* Build a BINFO with LEN language slots.  */
>  
> @@ -10961,6 +10990,10 @@ signed_or_unsigned_type_for (int unsignedp, tree type)
>  	return NULL_TREE;
>        if (inner == inner2)
>  	return type;
> +      machine_mode new_mode;
> +      if (VECTOR_MODE_P (TYPE_MODE (type))
> +	  && related_int_vector_mode (TYPE_MODE (type)).exists (&new_mode))
> +	return build_vector_type_for_mode (inner2, new_mode);
>        return build_vector_type (inner2, TYPE_VECTOR_SUBPARTS (type));
>      }
>  
> diff --git a/gcc/tree.h b/gcc/tree.h
> index e730a2a3e56..c656cd5b7bf 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -4675,6 +4675,7 @@ extern tree build_one_cst (tree);
>  extern tree build_minus_one_cst (tree);
>  extern tree build_all_ones_cst (tree);
>  extern tree build_zero_cst (tree);
> +extern tree sign_mask_for (tree);
>  extern tree build_string (unsigned, const char * = NULL);
>  extern tree build_poly_int_cst (tree, const poly_wide_int_ref &);
>  extern tree build_tree_list (tree, tree CXX_MEM_STAT_INFO);
>
  

Patch

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index d4cb439a851..e147ab9db7a 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -489,6 +489,21 @@  commutative_op (id_base *id)
       case CFN_FNMS:
 	return 0;
 
+      case CFN_COND_ADD:
+      case CFN_COND_MUL:
+      case CFN_COND_MIN:
+      case CFN_COND_MAX:
+      case CFN_COND_FMIN:
+      case CFN_COND_FMAX:
+      case CFN_COND_AND:
+      case CFN_COND_IOR:
+      case CFN_COND_XOR:
+      case CFN_COND_FMA:
+      case CFN_COND_FMS:
+      case CFN_COND_FNMA:
+      case CFN_COND_FNMS:
+	return 1;
+
       default:
 	return -1;
       }
diff --git a/gcc/match.pd b/gcc/match.pd
index 56ac743aa6d..f605b798c44 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -339,6 +339,20 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (REAL_VALUE_NEGATIVE (TREE_REAL_CST (@0)))
   (COPYSIGN_ALL (negate @0) @1)))
 
+/* Transform c ? x * copysign (1, y) : z to c ? x ^ signs(y) : z.
+   tree-ssa-math-opts.cc does the corresponding optimization for
+   unconditional multiplications (via xorsign).  */
+(simplify
+ (IFN_COND_MUL:c @0 @1 (IFN_COPYSIGN real_onep @2) @3)
+ (with { tree signs = sign_mask_for (type); }
+  (if (signs)
+   (with { tree inttype = TREE_TYPE (signs); }
+    (view_convert:type
+     (IFN_COND_XOR:inttype @0
+      (view_convert:inttype @1)
+      (bit_and (view_convert:inttype @2) { signs; })
+      (view_convert:inttype @3)))))))
+
 /* (x >= 0 ? x : 0) + (x <= 0 ? -x : 0) -> abs x.  */
 (simplify
   (plus:c (max @0 integer_zerop) (max (negate @0) integer_zerop))
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
new file mode 100644
index 00000000000..338ca605923
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_1.c
@@ -0,0 +1,34 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define xorsign(A, B, SUFFIX) ((A) * __builtin_copysign##SUFFIX (1.0, B))
+
+#define DEF_LOOP(TYPE, SUFFIX)					\
+  void __attribute__ ((noinline, noclone))			\
+  test_##TYPE (TYPE *__restrict r, TYPE *__restrict a,		\
+	       TYPE *__restrict b, TYPE *__restrict c,		\
+	       int n)						\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      r[i] = a[i] < 20 ? xorsign(b[i], c[i], SUFFIX) : b[i];	\
+  }
+
+#define TEST_ALL(T) \
+  T (_Float16, f16) \
+  T (float, f) \
+  T (double, )
+
+TEST_ALL (DEF_LOOP)
+
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, z[0-9]+\.h,} 1 } } */
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, z[0-9]+\.s,} 1 } } */
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d,} 1 } } */
+
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+
+/* { dg-final { scan-assembler-not {\tfmul} } } */
+/* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
+/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
new file mode 100644
index 00000000000..274dd0ede59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_xorsign_2.c
@@ -0,0 +1,17 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -msve-vector-bits=128 --param aarch64-autovec-preference=2" } */
+
+#include "cond_xorsign_1.c"
+
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.h, z[0-9]+\.h,} 1 } } */
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.s, z[0-9]+\.s,} 1 } } */
+/* { dg-final { scan-assembler-times {\tand\tz[0-9]+\.d, z[0-9]+\.d,} 1 } } */
+
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.h, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
+/* { dg-final { scan-assembler-times {\teor\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
+
+/* { dg-final { scan-assembler-not {\tfmul} } } */
+/* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
+/* { dg-final { scan-assembler-not {\tmovprfx\t} } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 7473912a065..94d20eaba17 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -2669,6 +2669,35 @@  build_zero_cst (tree type)
     }
 }
 
+/* If floating-point type TYPE has an IEEE-style sign bit, return an
+   unsigned constant in which only the sign bit is set.  Return null
+   otherwise.  */
+
+tree
+sign_mask_for (tree type)
+{
+  /* Avoid having to choose between a real-only sign and a pair of signs.
+     This could be relaxed if the choice becomes obvious later.  */
+  if (TREE_CODE (type) == COMPLEX_TYPE)
+    return NULL_TREE;
+
+  auto eltmode = as_a<scalar_float_mode> (element_mode (type));
+  auto bits = REAL_MODE_FORMAT (eltmode)->ieee_bits;
+  if (!bits || !pow2p_hwi (bits))
+    return NULL_TREE;
+
+  tree inttype = unsigned_type_for (type);
+  if (!inttype)
+    return NULL_TREE;
+
+  auto mask = wi::set_bit_in_zero (bits - 1, bits);
+  if (TREE_CODE (inttype) == VECTOR_TYPE)
+    {
+      tree elt = wide_int_to_tree (TREE_TYPE (inttype), mask);
+      return build_vector_from_val (inttype, elt);
+    }
+  return wide_int_to_tree (inttype, mask);
+}
 
 /* Build a BINFO with LEN language slots.  */
 
@@ -10961,6 +10990,10 @@  signed_or_unsigned_type_for (int unsignedp, tree type)
 	return NULL_TREE;
       if (inner == inner2)
 	return type;
+      machine_mode new_mode;
+      if (VECTOR_MODE_P (TYPE_MODE (type))
+	  && related_int_vector_mode (TYPE_MODE (type)).exists (&new_mode))
+	return build_vector_type_for_mode (inner2, new_mode);
       return build_vector_type (inner2, TYPE_VECTOR_SUBPARTS (type));
     }
 
diff --git a/gcc/tree.h b/gcc/tree.h
index e730a2a3e56..c656cd5b7bf 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -4675,6 +4675,7 @@  extern tree build_one_cst (tree);
 extern tree build_minus_one_cst (tree);
 extern tree build_all_ones_cst (tree);
 extern tree build_zero_cst (tree);
+extern tree sign_mask_for (tree);
 extern tree build_string (unsigned, const char * = NULL);
 extern tree build_poly_int_cst (tree, const poly_wide_int_ref &);
 extern tree build_tree_list (tree, tree CXX_MEM_STAT_INFO);