[v1] LoongArch: Opitmize the cost of vec_construct.

Message ID 20250107044738.57951-1-chenxiaolong@loongson.cn
State New
Headers
Series [v1] LoongArch: Opitmize the cost of vec_construct. |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed

Commit Message

chenxiaolong Jan. 7, 2025, 4:47 a.m. UTC
  When analyzing 525 on LoongArch architecture, it was found that the
for loop of hotspot function x264_pixel_satd_8x4 could not be quantized
256-bit due to the cost of vec_construct setting. After re-adjusting
vec_construct, the performance of 525 program was improved by 16.57%.
It was found that this function can be vectorized on the aarch64 and
x86 architectures, see [PR98138].

Co-Authored-By: Deng Jianbo <dengjianbo@loongson.cn>.

gcc/ChangeLog:

	* config/loongarch/loongarch.cc
	(loongarch_builtin_vectorization_cost): Modify the
        construction cost of the vec_construct vector.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/vect-slp-two-operator.c: New test.
---
 gcc/config/loongarch/loongarch.cc             |  6 ++--
 .../loongarch/vect-slp-two-operator.c         | 36 +++++++++++++++++++
 2 files changed, 39 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
  

Comments

Lulu Cheng Jan. 7, 2025, 12:06 p.m. UTC | #1
在 2025/1/7 下午12:47, chenxiaolong 写道:
>    When analyzing 525 on LoongArch architecture, it was found that the
> for loop of hotspot function x264_pixel_satd_8x4 could not be quantized
> 256-bit due to the cost of vec_construct setting. After re-adjusting
> vec_construct, the performance of 525 program was improved by 16.57%.
> It was found that this function can be vectorized on the aarch64 and
> x86 architectures, see [PR98138].
>
> Co-Authored-By: Deng Jianbo <dengjianbo@loongson.cn>.
>
> gcc/ChangeLog:
>
> 	* config/loongarch/loongarch.cc
> 	(loongarch_builtin_vectorization_cost): Modify the
>          construction cost of the vec_construct vector.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/loongarch/vect-slp-two-operator.c: New test.
> ---
>   gcc/config/loongarch/loongarch.cc             |  6 ++--
>   .../loongarch/vect-slp-two-operator.c         | 36 +++++++++++++++++++
>   2 files changed, 39 insertions(+), 3 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
>
> diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc
> index 89237c377e7..ff27b96c31e 100644
> --- a/gcc/config/loongarch/loongarch.cc
> +++ b/gcc/config/loongarch/loongarch.cc
> @@ -4127,10 +4127,10 @@ loongarch_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>   
>         case vec_construct:
>   	elements = TYPE_VECTOR_SUBPARTS (vectype);
> -	if (ISA_HAS_LASX)
> -	  return elements + 1;
> +	if (LASX_SUPPORTED_MODE_P (mode) && !LSX_SUPPORTED_MODE_P (mode))
> +	  return elements / 2 + 3;
>   	else
> -	  return elements;
> +	  return elements / 2 + 1;
>   
>         default:
>   	gcc_unreachable ();
> diff --git a/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
> new file mode 100644
> index 00000000000..f27492e10f0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
> @@ -0,0 +1,36 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mlasx -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned int uint32_t;
> +
> +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> +    int t0 = s0 + s1;\
> +    int t1 = s0 - s1;\
> +    int t2 = s2 + s3;\
> +    int t3 = s2 - s3;\
> +    d0 = t0 + t2;\
> +    d1 = t1 + t3;\
> +    d2 = t0 - t2;\
> +    d3 = t1 - t3;\
> +}
> +
> +void sink(uint32_t tmp[4][4]);
> +
> +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )

Hi,xiao long:

There is a problem with the format of the code here. There must be a 
space before '('.

> +{
> +    uint32_t tmp[4][4];
> +    int sum = 0;
> +    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
> +    {
> +        uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> +        uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> +        uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> +        uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> +        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
> +    }
> +    sink(tmp);
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
  

Patch

diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc
index 89237c377e7..ff27b96c31e 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -4127,10 +4127,10 @@  loongarch_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
 
       case vec_construct:
 	elements = TYPE_VECTOR_SUBPARTS (vectype);
-	if (ISA_HAS_LASX)
-	  return elements + 1;
+	if (LASX_SUPPORTED_MODE_P (mode) && !LSX_SUPPORTED_MODE_P (mode))
+	  return elements / 2 + 3;
 	else
-	  return elements;
+	  return elements / 2 + 1;
 
       default:
 	gcc_unreachable ();
diff --git a/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
new file mode 100644
index 00000000000..f27492e10f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/vect-slp-two-operator.c
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -mlasx -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */
+
+typedef unsigned char uint8_t;
+typedef unsigned int uint32_t;
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+    int t0 = s0 + s1;\
+    int t1 = s0 - s1;\
+    int t2 = s2 + s3;\
+    int t3 = s2 - s3;\
+    d0 = t0 + t2;\
+    d1 = t1 + t3;\
+    d2 = t0 - t2;\
+    d3 = t1 - t3;\
+}
+
+void sink(uint32_t tmp[4][4]);
+
+int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
+{
+    uint32_t tmp[4][4];
+    int sum = 0;
+    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
+    {
+        uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+        uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+        uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+        uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
+    }
+    sink(tmp);
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */