# [4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles

Message ID LV2PR01MB783994175ADE5AA6E522F684F7A72@LV2PR01MB7839.prod.exchangelabs.com New [1/4] vect: Add a unified vect_get_num_copies for slp and non-slp |

## Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 fail Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Patch failed to apply

## Commit Message

Feng Xue OS July 13, 2024, 3:49 p.m. UTC
```  When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

int sum = 1;
for (i)
{
sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
sum += w[i];               // widen-sum <vector(16) char>
sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
sum += n[i];               // normal <vector(4) int>
}

Original transformation result:

for (i / 16)
{
sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

...
}

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vector lane-reducing ops be distributed
evenly among all def-use cycles. Transformed as the below, DOT_PROD,
WIDEN_SUM and SADs are generated into disparate cycles, instruction
dependency among them could be eliminated.

for (i / 16)
{
sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = sum_v0;  // copy
sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = sum_v0;  // copy
sum_v1 = sum_v1;  // copy
sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

...
}

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++-----
gcc/tree-vectorizer.h |  6 ++++
2 files changed, 63 insertions(+), 7 deletions(-)
```

Richard Biener July 15, 2024, 2:05 p.m. UTC | #1
```On Sat, Jul 13, 2024 at 5:49 PM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> When transforming multiple lane-reducing operations in a loop reduction chain,
> originally, corresponding vectorized statements are generated into def-use
> cycles starting from 0. The def-use cycle with smaller index, would contain
> more statements, which means more instruction dependency. For example:
>
>    int sum = 1;
>    for (i)
>      {
>        sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>        sum += w[i];               // widen-sum <vector(16) char>
>        sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
>        sum += n[i];               // normal <vector(4) int>
>      }
>
> Original transformation result:
>
>    for (i / 16)
>      {
>        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>        sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        ...
>      }
>
> For a higher instruction parallelism in final vectorized loop, an optimal
> means is to make those effective vector lane-reducing ops be distributed
> evenly among all def-use cycles. Transformed as the below, DOT_PROD,
> WIDEN_SUM and SADs are generated into disparate cycles, instruction
> dependency among them could be eliminated.
>
>    for (i / 16)
>      {
>        sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = sum_v0;  // copy
>        sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
>        sum_v2 = sum_v2;  // copy
>        sum_v3 = sum_v3;  // copy
>
>        sum_v0 = sum_v0;  // copy
>        sum_v1 = sum_v1;  // copy
>        sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
>        sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
>
>        ...
>      }

OK.

Thanks,
Richard.

> Thanks,
> Feng
> ---
> gcc/
>         PR tree-optimization/114440
>         * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
>         reduc_result_pos.
>         * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
>         statements in an optimized order.
> ---
>  gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++-----
>  gcc/tree-vectorizer.h |  6 ++++
>  2 files changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index e72d692ffa3..5bc6e526d43 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>                sum += w[i];               // widen-sum <vector(16) char>
>                sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> +              sum += n[i];               // normal <vector(4) int>
>              }
>
>          The vector size is 128-bit，vectorization factor is 16.  Reduction
> @@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                sum_v2 = sum_v2;  // copy
>                sum_v3 = sum_v3;  // copy
>
> -              sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
> -              sum_v1 = sum_v1;  // copy
> +              sum_v0 = sum_v0;  // copy
> +              sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
>                sum_v2 = sum_v2;  // copy
>                sum_v3 = sum_v3;  // copy
>
> -              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
> -              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
> -              sum_v2 = sum_v2;  // copy
> +              sum_v0 = sum_v0;  // copy
> +              sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
> +              sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
>                sum_v3 = sum_v3;  // copy
> +
> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> +              sum_v2 += n_v2[i: 8  ~ 11];
> +              sum_v3 += n_v3[i: 12 ~ 15];
>              }
>
> -          sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
> -       */
> +        Moreover, for a higher instruction parallelism in final vectorized
> +        loop, it is considered to make those effective vector lane-reducing
> +        ops be distributed evenly among all def-use cycles.  In the above
> +        example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate
> +        cycles, instruction dependency among them could be eliminated.  */
>        unsigned effec_ncopies = vec_oprnds[0].length ();
>        unsigned total_ncopies = vec_oprnds[reduc_index].length ();
>
> @@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>               vec_oprnds[i].safe_grow_cleared (total_ncopies);
>             }
>         }
> +
> +      tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> +      gcc_assert (reduc_vectype_in);
> +
> +      unsigned effec_reduc_ncopies
> +       = vect_get_num_copies (loop_vinfo, slp_node, reduc_vectype_in);
> +
> +      gcc_assert (effec_ncopies <= effec_reduc_ncopies);
> +
> +      if (effec_ncopies < effec_reduc_ncopies)
> +       {
> +         /* Find suitable def-use cycles to generate vectorized statements
> +            into, and reorder operands based on the selection.  */
> +         unsigned curr_pos = reduc_info->reduc_result_pos;
> +         unsigned next_pos = (curr_pos + effec_ncopies) % effec_reduc_ncopies;
> +
> +         gcc_assert (curr_pos < effec_reduc_ncopies);
> +          reduc_info->reduc_result_pos = next_pos;
> +
> +         if (curr_pos)
> +           {
> +             unsigned count = effec_reduc_ncopies - effec_ncopies;
> +             unsigned start = curr_pos - count;
> +
> +             if ((int) start < 0)
> +               {
> +                 count = curr_pos;
> +                 start = 0;
> +               }
> +
> +             for (unsigned i = 0; i < op.num_ops - 1; i++)
> +               {
> +                 for (unsigned j = effec_ncopies; j > start; j--)
> +                   {
> +                     unsigned k = j - 1;
> +                     std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
> +                     gcc_assert (!vec_oprnds[i][k]);
> +                   }
> +               }
> +           }
> +       }
>      }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 62121f63f18..b6fdbc651d6 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -1402,6 +1402,12 @@ public:
>    /* The vector type for performing the actual reduction.  */
>    tree reduc_vectype;
>
> +  /* For loop reduction with multiple vectorized results (ncopies > 1), a
> +     lane-reducing operation participating in it may not use all of those
> +     results, this field specifies result index starting from which any
> +     following land-reducing operation would be assigned to.  */
> +  unsigned int reduc_result_pos;
> +
>    /* If IS_REDUC_INFO is true and if the vector code is performing
>       N scalar reductions in parallel, this variable gives the initial
>       scalar values of those N reductions.  */
> --
> 2.17.1
```

## Patch

```From f3d2bff96f8e29f775e2cb12ef43ad464b819fcf Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Wed, 29 May 2024 17:28:14 +0800
Subject: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop
def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

int sum = 1;
for (i)
{
sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
sum += w[i];               // widen-sum <vector(16) char>
sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
sum += n[i];               // normal <vector(4) int>
}

Original transformation result:

for (i / 16)
{
sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

...
}

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vector lane-reducing ops be distributed
evenly among all def-use cycles. Transformed as the below, DOT_PROD,
WIDEN_SUM and SADs are generated into disparate cycles, instruction
dependency among them could be eliminated.

for (i / 16)
{
sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1;  // copy
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = sum_v0;  // copy
sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

sum_v0 = sum_v0;  // copy
sum_v1 = sum_v1;  // copy
sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

...
}

2024-03-22 Feng Xue <fxue@os.amperecomputing.com>

gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++-----
gcc/tree-vectorizer.h |  6 ++++
2 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index e72d692ffa3..5bc6e526d43 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8841,6 +8841,7 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
sum += w[i];               // widen-sum <vector(16) char>
sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
+	       sum += n[i];               // normal <vector(4) int>
}

The vector size is 128-bit，vectorization factor is 16.  Reduction
@@ -8858,19 +8859,27 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

-	       sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
-	       sum_v1 = sum_v1;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
sum_v2 = sum_v2;  // copy
sum_v3 = sum_v3;  // copy

-	       sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-	       sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-	       sum_v2 = sum_v2;  // copy
+	       sum_v0 = sum_v0;  // copy
+	       sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+	       sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
sum_v3 = sum_v3;  // copy
+
+	       sum_v0 += n_v0[i: 0  ~ 3 ];
+	       sum_v1 += n_v1[i: 4  ~ 7 ];
+	       sum_v2 += n_v2[i: 8  ~ 11];
+	       sum_v3 += n_v3[i: 12 ~ 15];
}

-	   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
-	*/
+	 Moreover, for a higher instruction parallelism in final vectorized
+	 loop, it is considered to make those effective vector lane-reducing
+	 ops be distributed evenly among all def-use cycles.  In the above
+	 example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate
+	 cycles, instruction dependency among them could be eliminated.  */
unsigned effec_ncopies = vec_oprnds[0].length ();
unsigned total_ncopies = vec_oprnds[reduc_index].length ();

@@ -8884,6 +8893,47 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
vec_oprnds[i].safe_grow_cleared (total_ncopies);
}
}
+
+      tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+      gcc_assert (reduc_vectype_in);
+
+      unsigned effec_reduc_ncopies
+	= vect_get_num_copies (loop_vinfo, slp_node, reduc_vectype_in);
+
+      gcc_assert (effec_ncopies <= effec_reduc_ncopies);
+
+      if (effec_ncopies < effec_reduc_ncopies)
+	{
+	  /* Find suitable def-use cycles to generate vectorized statements
+	     into, and reorder operands based on the selection.  */
+	  unsigned curr_pos = reduc_info->reduc_result_pos;
+	  unsigned next_pos = (curr_pos + effec_ncopies) % effec_reduc_ncopies;
+
+	  gcc_assert (curr_pos < effec_reduc_ncopies);
+          reduc_info->reduc_result_pos = next_pos;
+
+	  if (curr_pos)
+	    {
+	      unsigned count = effec_reduc_ncopies - effec_ncopies;
+	      unsigned start = curr_pos - count;
+
+	      if ((int) start < 0)
+		{
+		  count = curr_pos;
+		  start = 0;
+		}
+
+	      for (unsigned i = 0; i < op.num_ops - 1; i++)
+		{
+		  for (unsigned j = effec_ncopies; j > start; j--)
+		    {
+		      unsigned k = j - 1;
+		      std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+		      gcc_assert (!vec_oprnds[i][k]);
+		    }
+		}
+	    }
+	}
}

bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 62121f63f18..b6fdbc651d6 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@  public:
/* The vector type for performing the actual reduction.  */
tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+     lane-reducing operation participating in it may not use all of those
+     results, this field specifies result index starting from which any
+     following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
/* If IS_REDUC_INFO is true and if the vector code is performing
N scalar reductions in parallel, this variable gives the initial
scalar values of those N reductions.  */
--
2.17.1

```