[1/1] aarch64: remove extra XTN in vector concatenation

Message ID 20241202145435.2535-2-Akram.Ahmad@arm.com
State New
Headers
Series aarch64: remove extra XTN in vector concatenation |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed

Commit Message

Akram Ahmad Dec. 2, 2024, 2:54 p.m. UTC
  GIMPLE code which performs a narrowing truncation on the result of a
vector concatenation currently results in an unnecessary XTN being
emitted following a UZP1 to concate the operands. In cases such as this,
UZP1 should instead use a smaller arrangement specifier to replace the
XTN instruction. This is seen in cases such as in this GIMPLE example:

	int32x2_t foo (svint64_t a, svint64_t b)
	{
	  vector(2) int vect__2.8;
	  long int _1;
	  long int _3;
	  vector(2) long int _12;

	  <bb 2> [local count: 1073741824]:
	  _1 = svaddv_s64 ({ -1, 0, 0, 0, 0, 0, 0, 0, ... }, a_6(D));
	  _3 = svaddv_s64 ({ -1, 0, 0, 0, 0, 0, 0, 0, ... }, b_7(D));
	  _12 = {_1, _3};
	  vect__2.8_13 = (vector(2) int) _12;
	  return vect__2.8_13;

	}

Original assembly generated:

	bar:
	        ptrue   p3.b, all
	        uaddv   d0, p3, z0.d
	        uaddv   d1, p3, z1.d
	        uzp1    v0.2d, v0.2d, v1.2d
	        xtn     v0.2s, v0.2d
	        ret

This patch therefore defines the *aarch64_trunc_concat<mode> insn which
truncates the concatenation result, rather than concatenating the
truncated operands (such as in *aarch64_narrow_trunc<mode>), resulting
in the following optimised assembly being emitted:

	bar:
	        ptrue   p3.b, all
	        uaddv   d0, p3, z0.d
	        uaddv   d1, p3, z1.d
	        uzp1    v0.2s, v0.2s, v1.2s
	        ret

This patch passes all regression tests on aarch64 with no new failures.
A supporting test for this optimisation is also written and passes.

OK for master? I do not have commit rights so I cannot push the patch
myself.

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md: (*aarch64_trunc_concat) new
	  insn definition.
	* config/aarch64/iterators.md: (VDQHSD_F): new mode iterator.
	  (VTRUNCD): new mode attribute for truncated modes.
	  (Vtruncd): new mode attribute for arrangement specifier.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/truncated_concatenation_1.c: new test
	  for the above example and the int64x2 version of the above.
---
 gcc/config/aarch64/aarch64-simd.md            | 16 ++++++++++++++
 gcc/config/aarch64/iterators.md               | 12 ++++++++++
 .../aarch64/sve/truncated_concatenation_1.c   | 22 +++++++++++++++++++
 3 files changed, 50 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
  

Comments

Kyrylo Tkachov Dec. 2, 2024, 3:09 p.m. UTC | #1
Hi Akram,

> On 2 Dec 2024, at 15:54, Akram Ahmad <Akram.Ahmad@arm.com> wrote:
> 
> GIMPLE code which performs a narrowing truncation on the result of a
> vector concatenation currently results in an unnecessary XTN being
> emitted following a UZP1 to concate the operands. In cases such as this,
> UZP1 should instead use a smaller arrangement specifier to replace the
> XTN instruction. This is seen in cases such as in this GIMPLE example:
> 
> int32x2_t foo (svint64_t a, svint64_t b)
> {
>  vector(2) int vect__2.8;
>  long int _1;
>  long int _3;
>  vector(2) long int _12;
> 
>  <bb 2> [local count: 1073741824]:
>  _1 = svaddv_s64 ({ -1, 0, 0, 0, 0, 0, 0, 0, ... }, a_6(D));
>  _3 = svaddv_s64 ({ -1, 0, 0, 0, 0, 0, 0, 0, ... }, b_7(D));
>  _12 = {_1, _3};
>  vect__2.8_13 = (vector(2) int) _12;
>  return vect__2.8_13;
> 
> }
> 
> Original assembly generated:
> 
> bar:
>        ptrue   p3.b, all
>        uaddv   d0, p3, z0.d
>        uaddv   d1, p3, z1.d
>        uzp1    v0.2d, v0.2d, v1.2d
>        xtn     v0.2s, v0.2d
>        ret
> 
> This patch therefore defines the *aarch64_trunc_concat<mode> insn which
> truncates the concatenation result, rather than concatenating the
> truncated operands (such as in *aarch64_narrow_trunc<mode>), resulting
> in the following optimised assembly being emitted:
> 
> bar:
>        ptrue   p3.b, all
>        uaddv   d0, p3, z0.d
>        uaddv   d1, p3, z1.d
>        uzp1    v0.2s, v0.2s, v1.2s
>        ret
> 
> This patch passes all regression tests on aarch64 with no new failures.
> A supporting test for this optimisation is also written and passes.
> 
> OK for master? I do not have commit rights so I cannot push the patch
> myself.

Thanks for the patch. As this is sent after the end of stage1 and is not finishing support for an architecture feature perhaps we should stage this for GCC 16.
But if it fixes a performance problem in a real app or, better yet, fixes a performance regression then we should consider it for this cycle.
That said...


> 
> gcc/ChangeLog:
> 
> * config/aarch64/aarch64-simd.md: (*aarch64_trunc_concat) new
>  insn definition.
> * config/aarch64/iterators.md: (VDQHSD_F): new mode iterator.
>  (VTRUNCD): new mode attribute for truncated modes.
>  (Vtruncd): new mode attribute for arrangement specifier.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/sve/truncated_concatenation_1.c: new test
>  for the above example and the int64x2 version of the above.
> ---
> gcc/config/aarch64/aarch64-simd.md            | 16 ++++++++++++++
> gcc/config/aarch64/iterators.md               | 12 ++++++++++
> .../aarch64/sve/truncated_concatenation_1.c   | 22 +++++++++++++++++++
> 3 files changed, 50 insertions(+)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
> 
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index cfe95bd4c31..de3dd444ecd 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1872,6 +1872,22 @@
>   [(set_attr "type" "neon_permute<q>")]
> )
> 
> +(define_insn "*aarch64_trunc_concat<mode>"
> +  [(set (match_operand:<VTRUNCD> 0 "register_operand" "=w")
> + (truncate:<VTRUNCD>
> +  (vec_concat:VDQHSD_F
> +            (match_operand:<VHALF> 1 "register_operand" "w")
> +    (match_operand:<VHALF> 2 "register_operand" "w"))))]
> +  "TARGET_SIMD"
> +{
> +  if (!BYTES_BIG_ENDIAN)
> +    return "uzp1\\t%0.<Vtruncd>, %1.<Vtruncd>, %2.<Vtruncd>";
> +  else
> +    return "uzp1\\t%0.<Vtruncd>, %2.<Vtruncd>, %1.<Vtruncd>";
> +}

… The UZP1 instruction doesn’t accept .2h operands so I don’t think this pattern is valid for the V2SF value of VDQHSD_F


> +  [(set_attr "type" "neon_permute<q>")]
> +)
> +
> ;; Packing doubles.
> 
> (define_expand "vec_pack_trunc_<mode>"
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index d7cb27e1885..3b28b2fae0c 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -290,6 +290,10 @@
> ;; Advanced SIMD modes for H, S and D types.
> (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
> 
> +;; Advanced SIMD modes that can be truncated whilst preserving
> +;; the number of vector elements.
> +(define_mode_iterator VDQHSD_F [V8HI V4SI V2DI V2SF V4SF V2DF])
> +
> (define_mode_iterator VDQHSD_V1DI [VDQHSD V1DI])
> 
> ;; Advanced SIMD and scalar integer modes for H and S.
> @@ -1722,6 +1726,14 @@
> (define_mode_attr Vnarrowq2 [(V8HI "v16qi") (V4SI "v8hi")
>     (V2DI "v4si")])
> 
> +;; Truncated Advanced SIMD modes which preserve the number of lanes.
> +(define_mode_attr VTRUNCD [(V8HI "V8QI") (V4SI "V4HI")
> +   (V2SF "V2HF") (V4SF "V4HF")
> +   (V2DI "V2SI") (V2DF "V2SF")])
> +(define_mode_attr Vtruncd [(V8HI "8b") (V4SI "4h")
> +   (V2SF "2h") (V4SF "4h")
> +   (V2DI "2s") (V2DF "2s")])
> +
> ;; Narrowed modes of vector modes.
> (define_mode_attr VNARROW [(VNx8HI "VNx16QI")
>   (VNx4SI "VNx8HI") (VNx4SF "VNx8HF")
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c b/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
> new file mode 100644
> index 00000000000..e0ad4209206
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -Wall -march=armv8.2-a+sve" } */
> +
> +#include <arm_neon.h>
> +#include <arm_sve.h>
> +
> +int32x2_t foo (svint64_t a, svint64_t b) {
> +    int32x2_t ab = vdup_n_s32 (0);
> +    ab = vset_lane_s32 ((int)svaddv_s64 (svptrue_b64 (), a), ab, 0);
> +    ab = vset_lane_s32 ((int)svaddv_s64 (svptrue_b64 (), b), ab, 1);
> +    return ab;
> +}
> +
> +int64x2_t bar (svint64_t a, svint64_t b) {
> +    int64x2_t ab = vdupq_n_s64(0);
> +    ab = vsetq_lane_s64 ((int)svaddv_s64 (svptrue_b64 (), a), ab, 0);
> +    ab = vsetq_lane_s64 ((int)svaddv_s64 (svptrue_b64 (), b), ab, 1);
> +    return ab;
> +}

We should have tests for the various sizes that the new pattern covers.
Thanks,
Kyrill


> +
> +/* { dg-final { scan-assembler-not {\txtn\t} } }*/
> +/* { dg-final { scan-assembler-times {\tuzp1\tv[0-9]+\.2s, v[0-9]+\.2s, v[0-9]+\.2s} 2 } }*/
> \ No newline at end of file
> -- 
> 2.34.1
>
  
Akram Ahmad Dec. 3, 2024, 2:33 p.m. UTC | #2
Hi Kyrill, thanks for the very quick response!

On 02/12/2024 15:09, Kyrylo Tkachov wrote:
> Thanks for the patch. As this is sent after the end of stage1 and is not finishing support for an architecture feature perhaps we should stage this for GCC 16.
> But if it fixes a performance problem in a real app or, better yet, fixes a performance regression then we should consider it for this cycle.
Sorry, I should have specified in the cover letter that this was 
originally intended for GCC 16... although it would improve performance 
in some video codecs as this is where the issue was first raised.I'll 
try and find out a bit more about this if needed.
> … The UZP1 instruction doesn’t accept .2h operands so I don’t think this pattern is valid for the V2SF value of VDQHSD_F
> We should have tests for the various sizes that the new pattern covers.

Okay, I'll correct the modes and then write tests for the ones that remain.

Many thanks,
Akram
  

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index cfe95bd4c31..de3dd444ecd 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1872,6 +1872,22 @@ 
   [(set_attr "type" "neon_permute<q>")]
 )
 
+(define_insn "*aarch64_trunc_concat<mode>"
+  [(set (match_operand:<VTRUNCD> 0 "register_operand" "=w")
+	(truncate:<VTRUNCD>
+	  (vec_concat:VDQHSD_F
+            (match_operand:<VHALF> 1 "register_operand" "w")
+	    (match_operand:<VHALF> 2 "register_operand" "w"))))]
+  "TARGET_SIMD"
+{
+  if (!BYTES_BIG_ENDIAN)
+    return "uzp1\\t%0.<Vtruncd>, %1.<Vtruncd>, %2.<Vtruncd>";
+  else
+    return "uzp1\\t%0.<Vtruncd>, %2.<Vtruncd>, %1.<Vtruncd>";
+}
+  [(set_attr "type" "neon_permute<q>")]
+)
+
 ;; Packing doubles.
 
 (define_expand "vec_pack_trunc_<mode>"
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index d7cb27e1885..3b28b2fae0c 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -290,6 +290,10 @@ 
 ;; Advanced SIMD modes for H, S and D types.
 (define_mode_iterator VDQHSD [V4HI V8HI V2SI V4SI V2DI])
 
+;; Advanced SIMD modes that can be truncated whilst preserving
+;; the number of vector elements.
+(define_mode_iterator VDQHSD_F [V8HI V4SI V2DI V2SF V4SF V2DF])
+
 (define_mode_iterator VDQHSD_V1DI [VDQHSD V1DI])
 
 ;; Advanced SIMD and scalar integer modes for H and S.
@@ -1722,6 +1726,14 @@ 
 (define_mode_attr Vnarrowq2 [(V8HI "v16qi") (V4SI "v8hi")
 			     (V2DI "v4si")])
 
+;; Truncated Advanced SIMD modes which preserve the number of lanes.
+(define_mode_attr VTRUNCD [(V8HI "V8QI") (V4SI "V4HI")
+			   (V2SF "V2HF") (V4SF "V4HF")
+			   (V2DI "V2SI") (V2DF "V2SF")])
+(define_mode_attr Vtruncd [(V8HI "8b") (V4SI "4h")
+			   (V2SF "2h") (V4SF "4h")
+			   (V2DI "2s") (V2DF "2s")])
+
 ;; Narrowed modes of vector modes.
 (define_mode_attr VNARROW [(VNx8HI "VNx16QI")
 			   (VNx4SI "VNx8HI") (VNx4SF "VNx8HF")
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c b/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
new file mode 100644
index 00000000000..e0ad4209206
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/truncated_concatenation_1.c
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -Wall -march=armv8.2-a+sve" } */
+
+#include <arm_neon.h>
+#include <arm_sve.h>
+
+int32x2_t foo (svint64_t a, svint64_t b) {
+    int32x2_t ab = vdup_n_s32 (0);
+    ab = vset_lane_s32 ((int)svaddv_s64 (svptrue_b64 (), a), ab, 0);
+    ab = vset_lane_s32 ((int)svaddv_s64 (svptrue_b64 (), b), ab, 1);
+    return ab;
+}
+
+int64x2_t bar (svint64_t a, svint64_t b) {
+    int64x2_t ab = vdupq_n_s64(0);
+    ab = vsetq_lane_s64 ((int)svaddv_s64 (svptrue_b64 (), a), ab, 0);
+    ab = vsetq_lane_s64 ((int)svaddv_s64 (svptrue_b64 (), b), ab, 1);
+    return ab;
+}
+
+/* { dg-final { scan-assembler-not {\txtn\t} } }*/
+/* { dg-final { scan-assembler-times {\tuzp1\tv[0-9]+\.2s, v[0-9]+\.2s, v[0-9]+\.2s} 2 } }*/
\ No newline at end of file