[V2] Use subscalar mode to move struct block for parameter

Message ID 20221117061549.178481-1-guojiufu@linux.ibm.com
State New
Headers
Series [V2] Use subscalar mode to move struct block for parameter |

Commit Message

Jiufu Guo Nov. 17, 2022, 6:15 a.m. UTC
  Hi,

As mentioned in the previous version patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
The suboptimal code is generated for "assigning from parameter" or
"assigning to return value".
This patch enhances the assignment from parameters like the below
cases:
/////case1.c
typedef struct SA {double a[3];long l; } A;
A ret_arg (A a) {return a;}
void st_arg (A a, A *p) {*p = a;}

////case2.c
typedef struct SA {double a[3];} A;
A ret_arg (A a) {return a;}
void st_arg (A a, A *p) {*p = a;}

For this patch, bootstrap and regtest pass on ppc64{,le}
and x86_64.
* Besides asking for help reviewing this patch, I would like to
consult comments about enhancing for "assigning to returns".

On some targets(ppc64), for below case:
////case3.c
typedef struct SA {double a[3]; long l; } A;
A ret_arg_pt (A *a) {return *a;}

The optimized GIMPLE code looks like:
  <retval> = *a_2(D);
  return <retval>;
Here, <retval>(aka. RESULT_DECL) is MEM, and "aggregate_value_p"
returns true for <retval>.

* While for below case, the generated code is still suboptimal.
////case4.c
typedef struct SA {double a[3];} A;
A ret_arg_pt (A *a) {return *a;}

The optimized GIMPLE code looks like:
  D.3951 = *a_2(D);
  return D.3951;
The "return/assign" stmts are using D.3951(VAR_DECL) instead
"<retval>(RESULT_DECL)".  The mode of D.3951/<retval> is BLK.
The RTL of D.3951 is MEM, and RTL of <retval> is PARALLEL. For
PARALLEL, aggregate_value_p returns false.

In function expand_assignment, there is code:
  if (TREE_CODE (to) == RESULT_DECL
      && (REG_P (to_rtx) || GET_CODE (to_rtx) == PARALLEL))
This code can handle "<retval>", but can not handle "D.3951".

I'm thinking of one way to handle this issue is to update the
GIMPLE sequence as: "<retval> = *a_2(D); return <retval>;"
Or, collecting VARs which are used by return stmts; and for
assignments to those VARs, using sub scalar mode for the block
move.

Thanks for any comments and suggestions!


BR,
Jeff (Jiufu)

---
 gcc/expr.cc | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
  

Comments

Jiufu Guo Nov. 21, 2022, 3:07 a.m. UTC | #1
Jiufu Guo <guojiufu@linux.ibm.com> writes:

> Hi,
>
> As mentioned in the previous version patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
> The suboptimal code is generated for "assigning from parameter" or
> "assigning to return value".
> This patch enhances the assignment from parameters like the below
> cases:
> /////case1.c
> typedef struct SA {double a[3];long l; } A;
> A ret_arg (A a) {return a;}
> void st_arg (A a, A *p) {*p = a;}
>
> ////case2.c
> typedef struct SA {double a[3];} A;
> A ret_arg (A a) {return a;}
> void st_arg (A a, A *p) {*p = a;}
>
> For this patch, bootstrap and regtest pass on ppc64{,le}
> and x86_64.
> * Besides asking for help reviewing this patch, I would like to
> consult comments about enhancing for "assigning to returns".

I updated the patch to fix the issue for returns.  This patch
adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
by a return stmt.  This patch fix the issue in expand pass only,
so, we would try to update the patch to avoid this flag.

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index dd29ffffc03..09b8ec64cea 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
     frame_phase = off ? align - off : 0;
   }
 
+  /* Collect VARs on returns.  */
+  if (DECL_RESULT (current_function_decl))
+    {
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
+	  {
+	    tree val = gimple_return_retval (ret);
+	    if (val && VAR_P (val))
+	      DECL_USEDBY_RETURN_P (val) = 1;
+	  }
+    }
+
   /* Set TREE_USED on all variables in the local_decls.  */
   FOR_EACH_LOCAL_DECL (cfun, i, var)
     TREE_USED (var) = 1;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index d9407432ea5..20973649963 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
       return;
     }
 
+  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
+       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
+       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
+	   || REG_P (DECL_INCOMING_RTL (from))))
+      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
+	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
+	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
+	       == PARALLEL))
+    {
+      push_temp_slots ();
+      rtx par_ret;
+      machine_mode mode;
+      par_ret = TREE_CODE (from) == PARM_DECL
+		  ? DECL_INCOMING_RTL (from)
+		  : DECL_RTL (DECL_RESULT (current_function_decl));
+      mode = GET_CODE (par_ret) == PARALLEL
+	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
+	       : word_mode;
+      int mode_size = GET_MODE_SIZE (mode).to_constant ();
+      int size = INTVAL (expr_size (from));
+
+      /* If/How the parameter using submode, it dependes on the size and
+	 position of the parameter.  Here using heurisitic number.  */
+      int hurstc_num = 8;
+      if (size < mode_size || (size % mode_size) != 0
+	  || size > (mode_size * hurstc_num))
+	result = store_expr (from, to_rtx, 0, nontemporal, false);
+      else
+	{
+	  rtx from_rtx
+	    = expand_expr (from, NULL_RTX, GET_MODE (to_rtx), EXPAND_NORMAL);
+	  for (int i = 0; i < size / mode_size; i++)
+	    {
+	      rtx temp = gen_reg_rtx (mode);
+	      rtx src = adjust_address (from_rtx, mode, mode_size * i);
+	      rtx dest = adjust_address (to_rtx, mode, mode_size * i);
+	      emit_move_insn (temp, src);
+	      emit_move_insn (dest, temp);
+	    }
+	  result = to_rtx;
+	}
+      preserve_temp_slots (result);
+      pop_temp_slots ();
+      return;
+    }
+
   /* Compute FROM and store the value in the rtx we got.  */
 
   push_temp_slots ();
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index af75522504f..be42e1464de 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -1808,7 +1808,8 @@ struct GTY(()) tree_decl_common {
      In VAR_DECL, PARM_DECL and RESULT_DECL, this is
      DECL_HAS_VALUE_EXPR_P.  */
   unsigned decl_flag_2 : 1;
-  /* In FIELD_DECL, this is DECL_PADDING_P.  */
+  /* In FIELD_DECL, this is DECL_PADDING_P
+     In VAR_DECL, this is DECL_USEDBY_RETURN_P.  */
   unsigned decl_flag_3 : 1;
   /* Logically, these two would go in a theoretical base shared by var and
      parm decl. */
diff --git a/gcc/tree.h b/gcc/tree.h
index a863d2e50e5..73c0314dac1 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -3011,6 +3011,10 @@ extern void decl_value_expr_insert (tree, tree);
 #define DECL_PADDING_P(NODE) \
   (FIELD_DECL_CHECK (NODE)->decl_common.decl_flag_3)
 
+/* Used in a VAR_DECL to indicate that it is used by a return stmt.  */
+#define DECL_USEDBY_RETURN_P(NODE) \
+  (VAR_DECL_CHECK (NODE)->decl_common.decl_flag_3)
+
 /* Used in a FIELD_DECL to indicate whether this field is not a flexible
    array member. This is only valid for the last array type field of a
    structure.  */

>
> On some targets(ppc64), for below case:
> ////case3.c
> typedef struct SA {double a[3]; long l; } A;
> A ret_arg_pt (A *a) {return *a;}
>
> The optimized GIMPLE code looks like:
>   <retval> = *a_2(D);
>   return <retval>;
> Here, <retval>(aka. RESULT_DECL) is MEM, and "aggregate_value_p"
> returns true for <retval>.
>
> * While for below case, the generated code is still suboptimal.
> ////case4.c
> typedef struct SA {double a[3];} A;
> A ret_arg_pt (A *a) {return *a;}
>
> The optimized GIMPLE code looks like:
>   D.3951 = *a_2(D);
>   return D.3951;
> The "return/assign" stmts are using D.3951(VAR_DECL) instead
> "<retval>(RESULT_DECL)".  The mode of D.3951/<retval> is BLK.
> The RTL of D.3951 is MEM, and RTL of <retval> is PARALLEL. For
> PARALLEL, aggregate_value_p returns false.
>
> In function expand_assignment, there is code:
>   if (TREE_CODE (to) == RESULT_DECL
>       && (REG_P (to_rtx) || GET_CODE (to_rtx) == PARALLEL))
> This code can handle "<retval>", but can not handle "D.3951".
>
> I'm thinking of one way to handle this issue is to update the
> GIMPLE sequence as: "<retval> = *a_2(D); return <retval>;"
> Or, collecting VARs which are used by return stmts; and for
> assignments to those VARs, using sub scalar mode for the block
> move.
>
> Thanks for any comments and suggestions!
>
>
> BR,
> Jeff (Jiufu)
>
> ---
>  gcc/expr.cc | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index d9407432ea5..420f9cf3662 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -6045,6 +6045,46 @@ expand_assignment (tree to, tree from, bool nontemporal)
>        return;
>      }
>  
> +  if (TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
> +      && TYPE_MODE (TREE_TYPE (from)) == BLKmode
> +      && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
> +	  || REG_P (DECL_INCOMING_RTL (from))))
> +    {
> +      rtx parm = DECL_INCOMING_RTL (from);
> +
> +      push_temp_slots ();
> +      machine_mode mode;
> +      mode = GET_CODE (parm) == PARALLEL
> +	       ? GET_MODE (XEXP (XVECEXP (parm, 0, 0), 0))
> +	       : word_mode;
> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
> +      int size = INTVAL (expr_size (from));
> +
> +      /* If/How the parameter using submode, it dependes on the size and
> +	 position of the parameter.  Here using heurisitic number.  */
> +      int hurstc_num = 8;
> +      if (size < mode_size || (size % mode_size) != 0
> +	  || size > (mode_size * hurstc_num))
> +	result = store_expr (from, to_rtx, 0, nontemporal, false);
> +      else
> +	{
> +	  rtx from_rtx
> +	    = expand_expr (from, NULL_RTX, GET_MODE (to_rtx), EXPAND_NORMAL);
> +	  for (int i = 0; i < size / mode_size; i++)
> +	    {
> +	      rtx temp = gen_reg_rtx (mode);
> +	      rtx src = adjust_address (from_rtx, mode, mode_size * i);
> +	      rtx dest = adjust_address (to_rtx, mode, mode_size * i);
> +	      emit_move_insn (temp, src);
> +	      emit_move_insn (dest, temp);
> +	    }
> +	  result = to_rtx;
> +	}
> +      preserve_temp_slots (result);
> +      pop_temp_slots ();
> +      return;
> +    }
> +
>    /* Compute FROM and store the value in the rtx we got.  */
>  
>    push_temp_slots ();
  
Jeff Law Nov. 22, 2022, 9:57 p.m. UTC | #2
On 11/20/22 20:07, Jiufu Guo wrote:
> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>
>> Hi,
>>
>> As mentioned in the previous version patch:
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>> The suboptimal code is generated for "assigning from parameter" or
>> "assigning to return value".
>> This patch enhances the assignment from parameters like the below
>> cases:
>> /////case1.c
>> typedef struct SA {double a[3];long l; } A;
>> A ret_arg (A a) {return a;}
>> void st_arg (A a, A *p) {*p = a;}
>>
>> ////case2.c
>> typedef struct SA {double a[3];} A;
>> A ret_arg (A a) {return a;}
>> void st_arg (A a, A *p) {*p = a;}
>>
>> For this patch, bootstrap and regtest pass on ppc64{,le}
>> and x86_64.
>> * Besides asking for help reviewing this patch, I would like to
>> consult comments about enhancing for "assigning to returns".
> I updated the patch to fix the issue for returns.  This patch
> adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
> by a return stmt.  This patch fix the issue in expand pass only,
> so, we would try to update the patch to avoid this flag.
>
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index dd29ffffc03..09b8ec64cea 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
>       frame_phase = off ? align - off : 0;
>     }
>   
> +  /* Collect VARs on returns.  */
> +  if (DECL_RESULT (current_function_decl))
> +    {
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
> +	  {
> +	    tree val = gimple_return_retval (ret);
> +	    if (val && VAR_P (val))
> +	      DECL_USEDBY_RETURN_P (val) = 1;
> +	  }
> +    }
> +
>     /* Set TREE_USED on all variables in the local_decls.  */
>     FOR_EACH_LOCAL_DECL (cfun, i, var)
>       TREE_USED (var) = 1;
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index d9407432ea5..20973649963 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
>         return;
>       }
>   
> +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
> +	   || REG_P (DECL_INCOMING_RTL (from))))
> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
> +	       == PARALLEL))
> +    {
> +      push_temp_slots ();
> +      rtx par_ret;
> +      machine_mode mode;
> +      par_ret = TREE_CODE (from) == PARM_DECL
> +		  ? DECL_INCOMING_RTL (from)
> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
> +      mode = GET_CODE (par_ret) == PARALLEL
> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
> +	       : word_mode;
> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
> +      int size = INTVAL (expr_size (from));
> +
> +      /* If/How the parameter using submode, it dependes on the size and
> +	 position of the parameter.  Here using heurisitic number.  */
> +      int hurstc_num = 8;

Where did this come from and what does it mean?


Note that BLKmode subword values passed in registers can be either right 
or left justified.  I think you also need to worry about endianness here.


Jeff
  
Jiufu Guo Nov. 23, 2022, 2:58 a.m. UTC | #3
Hi Jeff,

Thanks a lot for your comments!

Jeff Law <jeffreyalaw@gmail.com> writes:

> On 11/20/22 20:07, Jiufu Guo wrote:
>> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>>
>>> Hi,
>>>
>>> As mentioned in the previous version patch:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>>> The suboptimal code is generated for "assigning from parameter" or
>>> "assigning to return value".
>>> This patch enhances the assignment from parameters like the below
>>> cases:
>>> /////case1.c
>>> typedef struct SA {double a[3];long l; } A;
>>> A ret_arg (A a) {return a;}
>>> void st_arg (A a, A *p) {*p = a;}
>>>
>>> ////case2.c
>>> typedef struct SA {double a[3];} A;
>>> A ret_arg (A a) {return a;}
>>> void st_arg (A a, A *p) {*p = a;}
>>>
>>> For this patch, bootstrap and regtest pass on ppc64{,le}
>>> and x86_64.
>>> * Besides asking for help reviewing this patch, I would like to
>>> consult comments about enhancing for "assigning to returns".
>> I updated the patch to fix the issue for returns.  This patch
>> adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
>> by a return stmt.  This patch fix the issue in expand pass only,
>> so, we would try to update the patch to avoid this flag.
>>
>> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
>> index dd29ffffc03..09b8ec64cea 100644
>> --- a/gcc/cfgexpand.cc
>> +++ b/gcc/cfgexpand.cc
>> @@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
>>       frame_phase = off ? align - off : 0;
>>     }
>>   +  /* Collect VARs on returns.  */
>> +  if (DECL_RESULT (current_function_decl))
>> +    {
>> +      edge_iterator ei;
>> +      edge e;
>> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
>> +	  {
>> +	    tree val = gimple_return_retval (ret);
>> +	    if (val && VAR_P (val))
>> +	      DECL_USEDBY_RETURN_P (val) = 1;
>> +	  }
>> +    }
>> +
>>     /* Set TREE_USED on all variables in the local_decls.  */
>>     FOR_EACH_LOCAL_DECL (cfun, i, var)
>>       TREE_USED (var) = 1;
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index d9407432ea5..20973649963 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
>>         return;
>>       }
>>   +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
>> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
>> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
>> +	   || REG_P (DECL_INCOMING_RTL (from))))
>> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
>> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
>> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
>> +	       == PARALLEL))
>> +    {
>> +      push_temp_slots ();
>> +      rtx par_ret;
>> +      machine_mode mode;
>> +      par_ret = TREE_CODE (from) == PARM_DECL
>> +		  ? DECL_INCOMING_RTL (from)
>> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
>> +      mode = GET_CODE (par_ret) == PARALLEL
>> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
>> +	       : word_mode;
>> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
>> +      int size = INTVAL (expr_size (from));
>> +
>> +      /* If/How the parameter using submode, it dependes on the size and
>> +	 position of the parameter.  Here using heurisitic number.  */
>> +      int hurstc_num = 8;
>
> Where did this come from and what does it mean?
Sorry for does not make this clear. We know that an aggregate arg may be
on registers partially or totally, as assign_parm_adjust_entry_rtl.
For an example, if a parameter with 12 words and the target/ABI only
allow 8 gprs for arguments, then the parameter could use 8 regs at most
and left part in stack.

>
>
> Note that BLKmode subword values passed in registers can be either
> right or left justified.  I think you also need to worry about
> endianness here.
Since the subword is used to move block(read from source mem and then
store to destination mem with register mode), and this would keep to use
the same endianness on reg like move_block_from_reg. So, the patch does
not check the endianness.

If any concerns and sugguestions, please point out, thanks!

BR,
Jeff (Jiufu)
>
>
> Jeff
  
Richard Biener Nov. 24, 2022, 7:31 a.m. UTC | #4
On Wed, 23 Nov 2022, Jiufu Guo wrote:

> Hi Jeff,
> 
> Thanks a lot for your comments!

Sorry for the late response ...

> Jeff Law <jeffreyalaw@gmail.com> writes:
> 
> > On 11/20/22 20:07, Jiufu Guo wrote:
> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
> >>
> >>> Hi,
> >>>
> >>> As mentioned in the previous version patch:
> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
> >>> The suboptimal code is generated for "assigning from parameter" or
> >>> "assigning to return value".
> >>> This patch enhances the assignment from parameters like the below
> >>> cases:
> >>> /////case1.c
> >>> typedef struct SA {double a[3];long l; } A;
> >>> A ret_arg (A a) {return a;}
> >>> void st_arg (A a, A *p) {*p = a;}
> >>>
> >>> ////case2.c
> >>> typedef struct SA {double a[3];} A;
> >>> A ret_arg (A a) {return a;}
> >>> void st_arg (A a, A *p) {*p = a;}
> >>>
> >>> For this patch, bootstrap and regtest pass on ppc64{,le}
> >>> and x86_64.
> >>> * Besides asking for help reviewing this patch, I would like to
> >>> consult comments about enhancing for "assigning to returns".
> >> I updated the patch to fix the issue for returns.  This patch
> >> adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
> >> by a return stmt.  This patch fix the issue in expand pass only,
> >> so, we would try to update the patch to avoid this flag.
> >>
> >> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> >> index dd29ffffc03..09b8ec64cea 100644
> >> --- a/gcc/cfgexpand.cc
> >> +++ b/gcc/cfgexpand.cc
> >> @@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
> >>       frame_phase = off ? align - off : 0;
> >>     }
> >>   +  /* Collect VARs on returns.  */
> >> +  if (DECL_RESULT (current_function_decl))
> >> +    {
> >> +      edge_iterator ei;
> >> +      edge e;
> >> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> >> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
> >> +	  {
> >> +	    tree val = gimple_return_retval (ret);
> >> +	    if (val && VAR_P (val))
> >> +	      DECL_USEDBY_RETURN_P (val) = 1;

you probably want to check && auto_var_in_fn (val, ...) since val
might be global?

> >> +	  }
> >> +    }
> >> +
> >>     /* Set TREE_USED on all variables in the local_decls.  */
> >>     FOR_EACH_LOCAL_DECL (cfun, i, var)
> >>       TREE_USED (var) = 1;
> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> >> index d9407432ea5..20973649963 100644
> >> --- a/gcc/expr.cc
> >> +++ b/gcc/expr.cc
> >> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
> >>         return;
> >>       }

I miss an explanatory comment here on that the following is heuristics
and its reasoning.

> >>   +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
> >> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode

Why check TYPE_MODE here?  Do you want AGGREGATE_TYPE_P on the type
instead?

> >> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
> >> +	   || REG_P (DECL_INCOMING_RTL (from))))
> >> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
> >> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode

Likewise.

> >> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
> >> +	       == PARALLEL))

Not REG_P here?

> >> +    {
> >> +      push_temp_slots ();
> >> +      rtx par_ret;
> >> +      machine_mode mode;
> >> +      par_ret = TREE_CODE (from) == PARM_DECL
> >> +		  ? DECL_INCOMING_RTL (from)
> >> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
> >> +      mode = GET_CODE (par_ret) == PARALLEL
> >> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
> >> +	       : word_mode;
> >> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
> >> +      int size = INTVAL (expr_size (from));
> >> +
> >> +      /* If/How the parameter using submode, it dependes on the size and
> >> +	 position of the parameter.  Here using heurisitic number.  */
> >> +      int hurstc_num = 8;
> >
> > Where did this come from and what does it mean?
> Sorry for does not make this clear. We know that an aggregate arg may be
> on registers partially or totally, as assign_parm_adjust_entry_rtl.
> For an example, if a parameter with 12 words and the target/ABI only
> allow 8 gprs for arguments, then the parameter could use 8 regs at most
> and left part in stack.

I also wonder about the exact semantics of the parallels we get here.

+      int size = INTVAL (expr_size (from));

esp. when you use sth as simple as this.  Shouldn't you instead look
at to_rtx since that's already expanded?  For returns that should
be the desired layout to match 'from' to, no?  Maybe it's better
to not try sharing the code for both incoming and return copies
for clarity?

Also, what happens if there's a copy from a PARM_DECL to a
DECL_USEDBY_RETURN_P decl?  Which heuristic takes precedent?

I think that at least the place of the copy improvement and the
way you compute DECL_USEDBY_RETURN_P is reasonable.

Thanks,
Richard.

> >
> >
> > Note that BLKmode subword values passed in registers can be either
> > right or left justified.  I think you also need to worry about
> > endianness here.
> Since the subword is used to move block(read from source mem and then
> store to destination mem with register mode), and this would keep to use
> the same endianness on reg like move_block_from_reg. So, the patch does
> not check the endianness.
> 
> If any concerns and sugguestions, please point out, thanks!
> 
> BR,
> Jeff (Jiufu)
> >
> >
> > Jeff
>
  
Jiufu Guo Nov. 25, 2022, 5:05 a.m. UTC | #5
Hi Richard,

Thanks a lot for your comments!

Richard Biener <rguenther@suse.de> writes:

> On Wed, 23 Nov 2022, Jiufu Guo wrote:
>
>> Hi Jeff,
>> 
>> Thanks a lot for your comments!
>
> Sorry for the late response ...
>
>> Jeff Law <jeffreyalaw@gmail.com> writes:
>> 
>> > On 11/20/22 20:07, Jiufu Guo wrote:
>> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>> >>
>> >>> Hi,
>> >>>
>> >>> As mentioned in the previous version patch:
>> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>> >>> The suboptimal code is generated for "assigning from parameter" or
>> >>> "assigning to return value".
>> >>> This patch enhances the assignment from parameters like the below
>> >>> cases:
>> >>> /////case1.c
>> >>> typedef struct SA {double a[3];long l; } A;
>> >>> A ret_arg (A a) {return a;}
>> >>> void st_arg (A a, A *p) {*p = a;}
>> >>>
>> >>> ////case2.c
>> >>> typedef struct SA {double a[3];} A;
>> >>> A ret_arg (A a) {return a;}
>> >>> void st_arg (A a, A *p) {*p = a;}
>> >>>
>> >>> For this patch, bootstrap and regtest pass on ppc64{,le}
>> >>> and x86_64.
>> >>> * Besides asking for help reviewing this patch, I would like to
>> >>> consult comments about enhancing for "assigning to returns".
>> >> I updated the patch to fix the issue for returns.  This patch
>> >> adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
>> >> by a return stmt.  This patch fix the issue in expand pass only,
>> >> so, we would try to update the patch to avoid this flag.
>> >>
>> >> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
>> >> index dd29ffffc03..09b8ec64cea 100644
>> >> --- a/gcc/cfgexpand.cc
>> >> +++ b/gcc/cfgexpand.cc
>> >> @@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
>> >>       frame_phase = off ? align - off : 0;
>> >>     }
>> >>   +  /* Collect VARs on returns.  */
>> >> +  if (DECL_RESULT (current_function_decl))
>> >> +    {
>> >> +      edge_iterator ei;
>> >> +      edge e;
>> >> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>> >> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
>> >> +	  {
>> >> +	    tree val = gimple_return_retval (ret);
>> >> +	    if (val && VAR_P (val))
>> >> +	      DECL_USEDBY_RETURN_P (val) = 1;
>
> you probably want to check && auto_var_in_fn (val, ...) since val
> might be global?
Since we are collecting the return vals, it would be built in function
gimplify_return_expr.  In this function, create_tmp_reg is used and
a local temp.  So it would not be a global var here.

code piece in gimplify_return_expr:
  if (!result_decl)
    result = NULL_TREE;
  else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)))
    {
    ....
      result = result_decl;
    }
  else if (gimplify_ctxp->return_temp)
    result = gimplify_ctxp->return_temp;
  else
    {
      result = create_tmp_reg (TREE_TYPE (result_decl));

Here, for "typedef struct SA {double a[3];}", aggregate_value_p returns
false for target like ppc64le, because result of "hard_function_value"
is a "rtx with PARALLELL code".
And then a DECL_VAR is built via "create_tmp_reg" (actually it is not a
reg here. it built a DECL_VAR with RECORD type and BLK mode).

I also tried the way to use RESULT_DECL for this kind of type, or
let aggregate_value_p accept this kind of type.  But it seems not easy,
because "<retval> (RESULT_DECL with PARALLEL)" is not ok for address
operations.


>
>> >> +	  }
>> >> +    }
>> >> +
>> >>     /* Set TREE_USED on all variables in the local_decls.  */
>> >>     FOR_EACH_LOCAL_DECL (cfun, i, var)
>> >>       TREE_USED (var) = 1;
>> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> >> index d9407432ea5..20973649963 100644
>> >> --- a/gcc/expr.cc
>> >> +++ b/gcc/expr.cc
>> >> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
>> >>         return;
>> >>       }
>
> I miss an explanatory comment here on that the following is heuristics
> and its reasoning.
>
>> >>   +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
>> >> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
>
> Why check TYPE_MODE here?  Do you want AGGREGATE_TYPE_P on the type
> instead?
Checking BLK, because I want make sure the param should occur on
register and stack (saved from register).
Actualy, my intention is checking:
GET_MODE (DECL_INCOMING_RTL (from)) == BLKmode

For code:
GET_MODE (DECL_INCOMING_RTL (from)) == BLKmode
&& (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
    || REG_P (DECL_INCOMING_RTL (from)))
This checking indicates if the param may be passing via 2 or more
registers.

Using "AGGREGATE_TYPE_P && (PARALLEL || REG_P)" may be ok and more
readable. I would have a test.

>
>> >> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
>> >> +	   || REG_P (DECL_INCOMING_RTL (from))))
>> >> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
>> >> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
>
> Likewise.
>
>> >> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
>> >> +	       == PARALLEL))
>
> Not REG_P here?
REG_P with BLK on return would means return in memory, instead return
through registers, so, REG_P was not added here, and let it use
orignal behavior.
>
>> >> +    {
>> >> +      push_temp_slots ();
>> >> +      rtx par_ret;
>> >> +      machine_mode mode;
>> >> +      par_ret = TREE_CODE (from) == PARM_DECL
>> >> +		  ? DECL_INCOMING_RTL (from)
>> >> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
>> >> +      mode = GET_CODE (par_ret) == PARALLEL
>> >> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
>> >> +	       : word_mode;
>> >> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
>> >> +      int size = INTVAL (expr_size (from));
>> >> +
>> >> +      /* If/How the parameter using submode, it dependes on the size and
>> >> +	 position of the parameter.  Here using heurisitic number.  */
>> >> +      int hurstc_num = 8;
>> >
>> > Where did this come from and what does it mean?
>> Sorry for does not make this clear. We know that an aggregate arg may be
>> on registers partially or totally, as assign_parm_adjust_entry_rtl.
>> For an example, if a parameter with 12 words and the target/ABI only
>> allow 8 gprs for arguments, then the parameter could use 8 regs at most
>> and left part in stack.
>
> I also wonder about the exact semantics of the parallels we get here.
>
> +      int size = INTVAL (expr_size (from));
>
> esp. when you use sth as simple as this.  Shouldn't you instead look
> at to_rtx since that's already expanded?  For returns that should
Yes, I use "expr_size (from)" is just because the size should be same
between "from", "to" and "to_rtx" for these cases.
> be the desired layout to match 'from' to, no?  Maybe it's better
> to not try sharing the code for both incoming and return copies
> for clarity?
OK, thanks, good sugguestion.
>
> Also, what happens if there's a copy from a PARM_DECL to a
> DECL_USEDBY_RETURN_P decl?  Which heuristic takes precedent?
I thinking below code reflects this senario:

typedef struct SA {double a[3];} A;
A ret_arg (A a) {return a;}

The gimple seq:
  D.3951 = a;
  return D.3951;

For this patch, since "PARM_DECL" is checked before "VAR_P",
so, PARM_DECL is checked as true first.

>
> I think that at least the place of the copy improvement and the
> way you compute DECL_USEDBY_RETURN_P is reasonable.

I'm also thinking to mark this flag for return variables during
it was built in gimplify_return_expr.  This would be more
straightforward and no need to query/walk return stmts.  One
possible concern: it is too far between "gimplify" phase and
"expand".

I will update the patch accordingly.

Thanks for suggestions and comments!


BR,
Jeff (Jiufu)

>
> Thanks,
> Richard.
>
>> >
>> >
>> > Note that BLKmode subword values passed in registers can be either
>> > right or left justified.  I think you also need to worry about
>> > endianness here.
>> Since the subword is used to move block(read from source mem and then
>> store to destination mem with register mode), and this would keep to use
>> the same endianness on reg like move_block_from_reg. So, the patch does
>> not check the endianness.
>> 
>> If any concerns and sugguestions, please point out, thanks!
>> 
>> BR,
>> Jeff (Jiufu)
>> >
>> >
>> > Jeff
>>
  
Jiufu Guo Nov. 25, 2022, 12:29 p.m. UTC | #6
Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:

> Hi Richard,
>
> Thanks a lot for your comments!
>
> Richard Biener <rguenther@suse.de> writes:
>
>> On Wed, 23 Nov 2022, Jiufu Guo wrote:
>>
>>> Hi Jeff,
>>> 
>>> Thanks a lot for your comments!
>>
>> Sorry for the late response ...
>>
>>> Jeff Law <jeffreyalaw@gmail.com> writes:
>>> 
>>> > On 11/20/22 20:07, Jiufu Guo wrote:
>>> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> As mentioned in the previous version patch:
>>> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>>> >>> The suboptimal code is generated for "assigning from parameter" or
>>> >>> "assigning to return value".
>>> >>> This patch enhances the assignment from parameters like the below
>>> >>> cases:
>>> >>> /////case1.c
cut...
>>> >> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>>> >> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
>>> >> +	  {
>>> >> +	    tree val = gimple_return_retval (ret);
>>> >> +	    if (val && VAR_P (val))
>>> >> +	      DECL_USEDBY_RETURN_P (val) = 1;
>>
>> you probably want to check && auto_var_in_fn (val, ...) since val
>> might be global?
> Since we are collecting the return vals, it would be built in function
> gimplify_return_expr.  In this function, create_tmp_reg is used and
> a local temp.  So it would not be a global var here.
>
> code piece in gimplify_return_expr:
>   if (!result_decl)
>     result = NULL_TREE;
>   else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)))
>     {
>     ....
>       result = result_decl;
>     }
>   else if (gimplify_ctxp->return_temp)
>     result = gimplify_ctxp->return_temp;
>   else
>     {
>       result = create_tmp_reg (TREE_TYPE (result_decl));
>
> Here, for "typedef struct SA {double a[3];}", aggregate_value_p returns
> false for target like ppc64le, because result of "hard_function_value"
> is a "rtx with PARALLELL code".
> And then a DECL_VAR is built via "create_tmp_reg" (actually it is not a
> reg here. it built a DECL_VAR with RECORD type and BLK mode).
>
> I also tried the way to use RESULT_DECL for this kind of type, or
> let aggregate_value_p accept this kind of type.  But it seems not easy,
> because "<retval> (RESULT_DECL with PARALLEL)" is not ok for address
> operations.
>
>
>>
>>> >> +	  }
>>> >> +    }
>>> >> +
>>> >>     /* Set TREE_USED on all variables in the local_decls.  */
>>> >>     FOR_EACH_LOCAL_DECL (cfun, i, var)
>>> >>       TREE_USED (var) = 1;
>>> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
>>> >> index d9407432ea5..20973649963 100644
>>> >> --- a/gcc/expr.cc
>>> >> +++ b/gcc/expr.cc
>>> >> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
>>> >>         return;
>>> >>       }
>>
>> I miss an explanatory comment here on that the following is heuristics
>> and its reasoning.
>>
>>> >>   +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
>>> >> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
>>
>> Why check TYPE_MODE here?  Do you want AGGREGATE_TYPE_P on the type
>> instead?
> Checking BLK, because I want make sure the param should occur on
> register and stack (saved from register).
> Actualy, my intention is checking:
> GET_MODE (DECL_INCOMING_RTL (from)) == BLKmode
>
> For code:
> GET_MODE (DECL_INCOMING_RTL (from)) == BLKmode
> && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
>     || REG_P (DECL_INCOMING_RTL (from)))
> This checking indicates if the param may be passing via 2 or more
> registers.
>
> Using "AGGREGATE_TYPE_P && (PARALLEL || REG_P)" may be ok and more
> readable. I would have a test.
Oh, AGGREGATE_TYPE_P seems not the intention of this patch, since if
struct size can be represented by an INT size, the mode would be that
INT mode, and no need to use block move for the assignment.

BR,
Jeff (Jiufu)
>
>>
>>> >> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
>>> >> +	   || REG_P (DECL_INCOMING_RTL (from))))
>>> >> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
>>> >> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
>>
>> Likewise.
>>
>>> >> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
>>> >> +	       == PARALLEL))
>>
>> Not REG_P here?
> REG_P with BLK on return would means return in memory, instead return
> through registers, so, REG_P was not added here, and let it use
> orignal behavior.
>>
>>> >> +    {
>>> >> +      push_temp_slots ();
>>> >> +      rtx par_ret;
>>> >> +      machine_mode mode;
>>> >> +      par_ret = TREE_CODE (from) == PARM_DECL
>>> >> +		  ? DECL_INCOMING_RTL (from)
>>> >> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
>>> >> +      mode = GET_CODE (par_ret) == PARALLEL
>>> >> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
>>> >> +	       : word_mode;
>>> >> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
>>> >> +      int size = INTVAL (expr_size (from));
>>> >> +
>>> >> +      /* If/How the parameter using submode, it dependes on the size and
>>> >> +	 position of the parameter.  Here using heurisitic number.  */
>>> >> +      int hurstc_num = 8;
>>> >
>>> > Where did this come from and what does it mean?
>>> Sorry for does not make this clear. We know that an aggregate arg may be
>>> on registers partially or totally, as assign_parm_adjust_entry_rtl.
>>> For an example, if a parameter with 12 words and the target/ABI only
>>> allow 8 gprs for arguments, then the parameter could use 8 regs at most
>>> and left part in stack.
>>
>> I also wonder about the exact semantics of the parallels we get here.
>>
>> +      int size = INTVAL (expr_size (from));
>>
>> esp. when you use sth as simple as this.  Shouldn't you instead look
>> at to_rtx since that's already expanded?  For returns that should
> Yes, I use "expr_size (from)" is just because the size should be same
> between "from", "to" and "to_rtx" for these cases.
>> be the desired layout to match 'from' to, no?  Maybe it's better
>> to not try sharing the code for both incoming and return copies
>> for clarity?
> OK, thanks, good sugguestion.
>>
cut...
>>> Since the subword is used to move block(read from source mem and then
>>> store to destination mem with register mode), and this would keep to use
>>> the same endianness on reg like move_block_from_reg. So, the patch does
>>> not check the endianness.
>>> 
>>> If any concerns and sugguestions, please point out, thanks!
>>> 
>>> BR,
>>> Jeff (Jiufu)
>>> >
>>> >
>>> > Jeff
>>>
  
Jeff Law Nov. 28, 2022, 5 p.m. UTC | #7
On 11/22/22 19:58, Jiufu Guo wrote:
> Hi Jeff,
>
> Thanks a lot for your comments!
>
> Jeff Law <jeffreyalaw@gmail.com> writes:
>
>> On 11/20/22 20:07, Jiufu Guo wrote:
>>> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>>>
>>>> Hi,
>>>>
>>>> As mentioned in the previous version patch:
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>>>> The suboptimal code is generated for "assigning from parameter" or
>>>> "assigning to return value".
>>>> This patch enhances the assignment from parameters like the below
>>>> cases:
>>>> /////case1.c
>>>> typedef struct SA {double a[3];long l; } A;
>>>> A ret_arg (A a) {return a;}
>>>> void st_arg (A a, A *p) {*p = a;}
>>>>
>>>> ////case2.c
>>>> typedef struct SA {double a[3];} A;
>>>> A ret_arg (A a) {return a;}
>>>> void st_arg (A a, A *p) {*p = a;}
>>>>
>>>> For this patch, bootstrap and regtest pass on ppc64{,le}
>>>> and x86_64.
>>>> * Besides asking for help reviewing this patch, I would like to
>>>> consult comments about enhancing for "assigning to returns".
>>> I updated the patch to fix the issue for returns.  This patch
>>> adds a flag DECL_USEDBY_RETURN_P to indicate if a var is used
>>> by a return stmt.  This patch fix the issue in expand pass only,
>>> so, we would try to update the patch to avoid this flag.
>>>
>>> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
>>> index dd29ffffc03..09b8ec64cea 100644
>>> --- a/gcc/cfgexpand.cc
>>> +++ b/gcc/cfgexpand.cc
>>> @@ -2158,6 +2158,20 @@ expand_used_vars (bitmap forced_stack_vars)
>>>        frame_phase = off ? align - off : 0;
>>>      }
>>>    +  /* Collect VARs on returns.  */
>>> +  if (DECL_RESULT (current_function_decl))
>>> +    {
>>> +      edge_iterator ei;
>>> +      edge e;
>>> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
>>> +	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
>>> +	  {
>>> +	    tree val = gimple_return_retval (ret);
>>> +	    if (val && VAR_P (val))
>>> +	      DECL_USEDBY_RETURN_P (val) = 1;
>>> +	  }
>>> +    }
>>> +
>>>      /* Set TREE_USED on all variables in the local_decls.  */
>>>      FOR_EACH_LOCAL_DECL (cfun, i, var)
>>>        TREE_USED (var) = 1;
>>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>>> index d9407432ea5..20973649963 100644
>>> --- a/gcc/expr.cc
>>> +++ b/gcc/expr.cc
>>> @@ -6045,6 +6045,52 @@ expand_assignment (tree to, tree from, bool nontemporal)
>>>          return;
>>>        }
>>>    +  if ((TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
>>> +       && TYPE_MODE (TREE_TYPE (from)) == BLKmode
>>> +       && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
>>> +	   || REG_P (DECL_INCOMING_RTL (from))))
>>> +      || (VAR_P (to) && DECL_USEDBY_RETURN_P (to)
>>> +	  && TYPE_MODE (TREE_TYPE (to)) == BLKmode
>>> +	  && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl)))
>>> +	       == PARALLEL))
>>> +    {
>>> +      push_temp_slots ();
>>> +      rtx par_ret;
>>> +      machine_mode mode;
>>> +      par_ret = TREE_CODE (from) == PARM_DECL
>>> +		  ? DECL_INCOMING_RTL (from)
>>> +		  : DECL_RTL (DECL_RESULT (current_function_decl));
>>> +      mode = GET_CODE (par_ret) == PARALLEL
>>> +	       ? GET_MODE (XEXP (XVECEXP (par_ret, 0, 0), 0))
>>> +	       : word_mode;
>>> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
>>> +      int size = INTVAL (expr_size (from));
>>> +
>>> +      /* If/How the parameter using submode, it dependes on the size and
>>> +	 position of the parameter.  Here using heurisitic number.  */
>>> +      int hurstc_num = 8;
>> Where did this come from and what does it mean?
> Sorry for does not make this clear. We know that an aggregate arg may be
> on registers partially or totally, as assign_parm_adjust_entry_rtl.
> For an example, if a parameter with 12 words and the target/ABI only
> allow 8 gprs for arguments, then the parameter could use 8 regs at most
> and left part in stack.

Right, but the number of registers is target dependent, so I don't see 
how using "8" or any number of that matter is correct here.


>
>>
>> Note that BLKmode subword values passed in registers can be either
>> right or left justified.  I think you also need to worry about
>> endianness here.
> Since the subword is used to move block(read from source mem and then
> store to destination mem with register mode), and this would keep to use
> the same endianness on reg like move_block_from_reg. So, the patch does
> not check the endianness.

Hmm, I was clear enough here, particularly using the endianness term.  
Don't you need to query the ABI to ensure that you're not changing left 
vs right justification for a partially in register argument.     On the 
PA we have:

/* Specify padding for the last element of a block move between registers
    and memory.

    The 64-bit runtime specifies that objects need to be left justified
    (i.e., the normal justification for a big endian target).  The 32-bit
    runtime specifies right justification for objects smaller than 64 bits.
    We use a DImode register in the parallel for 5 to 7 byte structures
    so that there is only one element.  This allows the object to be
    correctly padded.  */
#define BLOCK_REG_PADDING(MODE, TYPE, FIRST) \
   targetm.calls.function_arg_padding ((MODE), (TYPE))


Jeff
  
Jiufu Guo Nov. 29, 2022, 3:53 a.m. UTC | #8
Hi Jeff,

Thanks a lot for your comments!

Jeff Law <jeffreyalaw@gmail.com> writes:

> On 11/22/22 19:58, Jiufu Guo wrote:
>> Hi Jeff,
>>
>> Thanks a lot for your comments!
>>
>> Jeff Law <jeffreyalaw@gmail.com> writes:
>>
>>> On 11/20/22 20:07, Jiufu Guo wrote:
>>>> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> As mentioned in the previous version patch:
>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604646.html
>>>>> The suboptimal code is generated for "assigning from parameter" or
>>>>> "assigning to return value".
>>>>> This patch enhances the assignment from parameters like the below
>>>>> cases:
>>>>> /////case1.c
>>>>> typedef struct SA {double a[3];long l; } A;
>>>>> A ret_arg (A a) {return a;}
>>>>> void st_arg (A a, A *p) {*p = a;}
>>>>>
>>>>> ////case2.c
>>>>> typedef struct SA {double a[3];} A;
>>>>> A ret_arg (A a) {return a;}
>>>>> void st_arg (A a, A *p) {*p = a;}
>>>>>
>>>>> For this patch, bootstrap and regtest pass on ppc64{,le}
>>>>> and x86_64.
cut...
>>>> +	       : word_mode;
>>>> +      int mode_size = GET_MODE_SIZE (mode).to_constant ();
>>>> +      int size = INTVAL (expr_size (from));
>>>> +
>>>> +      /* If/How the parameter using submode, it dependes on the size and
>>>> +	 position of the parameter.  Here using heurisitic number.  */
>>>> +      int hurstc_num = 8;
>>> Where did this come from and what does it mean?
>> Sorry for does not make this clear. We know that an aggregate arg may be
>> on registers partially or totally, as assign_parm_adjust_entry_rtl.
>> For an example, if a parameter with 12 words and the target/ABI only
>> allow 8 gprs for arguments, then the parameter could use 8 regs at most
>> and left part in stack.
>
> Right, but the number of registers is target dependent, so I don't see
> how using "8" or any number of that matter is correct here.
I understand.  And even for the same struct type, using how many
registers to pass a parameter, it also dependends on the size of the
parameter and how many leading parameters there is.
So, as you said, "8" or any numbers are not always accurate.

Because, the enhancement in this patch is just make "block move" to be
more friendly for follow optiomizations(cse/dse/dce...) by moving
through sub-mode.  So, I just selected one arbitrary number which may
not too large and also tolerable.
I also through to query the max number registers from targets for
param/ret passing, but it may not very accurate neither.

Any sugguestions are welcome! and thanks!

>
>
>>
>>>
>>> Note that BLKmode subword values passed in registers can be either
>>> right or left justified.  I think you also need to worry about
>>> endianness here.
>> Since the subword is used to move block(read from source mem and then
>> store to destination mem with register mode), and this would keep to use
>> the same endianness on reg like move_block_from_reg. So, the patch does
>> not check the endianness.
>
> Hmm, I was clear enough here, particularly using the endianness term. 
> Don't you need to query the ABI to ensure that you're not changing
> left vs right justification for a partially in register argument.    
> On the PA we have:
>
> /* Specify padding for the last element of a block move between registers
>    and memory.
>
>    The 64-bit runtime specifies that objects need to be left justified
>    (i.e., the normal justification for a big endian target).  The 32-bit
>    runtime specifies right justification for objects smaller than 64 bits.
>    We use a DImode register in the parallel for 5 to 7 byte structures
>    so that there is only one element.  This allows the object to be
>    correctly padded.  */
> #define BLOCK_REG_PADDING(MODE, TYPE, FIRST) \
>   targetm.calls.function_arg_padding ((MODE), (TYPE))

Yes. We should be careful when store registers to stack
(assign_parms/assign_parm_setup_xx/block/reg), or load to returns.

For this patch, only simple stuffs are handled like "D.xxx = param_1",
where the source and dest of the assignment are all in memory which is
the DECL_RTL(of D.xx/param_xx) in MEM_P/BLK.
And to avoid complicate, this patch only handle the case where
"(size % mode_size) == 0".

If any misunderstandings, please point out, thanks.
And thanks for comments! 


BR,
Jeff (Jiufu)

>
>
> Jeff
  
Jeff Law Dec. 5, 2022, 4:48 p.m. UTC | #9
On 11/28/22 20:53, Jiufu Guo wrote:

>>
>> Right, but the number of registers is target dependent, so I don't see
>> how using "8" or any number of that matter is correct here.
> I understand.  And even for the same struct type, using how many
> registers to pass a parameter, it also dependends on the size of the
> parameter and how many leading parameters there is.
> So, as you said, "8" or any numbers are not always accurate.
> 
> Because, the enhancement in this patch is just make "block move" to be
> more friendly for follow optiomizations(cse/dse/dce...) by moving
> through sub-mode.  So, I just selected one arbitrary number which may
> not too large and also tolerable.
> I also through to query the max number registers from targets for
> param/ret passing, but it may not very accurate neither.
> 
> Any sugguestions are welcome! and thanks!
OK, so it's just a magic number and (in theory) any number should still 
generate correct code -- the number merely places limits on when we'll 
consider performing this optimization.

It may be overkill, but you might consider making it a PARAM that can be 
adjusted.


> 
> For this patch, only simple stuffs are handled like "D.xxx = param_1",
> where the source and dest of the assignment are all in memory which is
> the DECL_RTL(of D.xx/param_xx) in MEM_P/BLK.
> And to avoid complicate, this patch only handle the case where
> "(size % mode_size) == 0".
> 
> If any misunderstandings, please point out, thanks.
> And thanks for comments!
How values are justified varies on the PA depending on whether the 
parameter is passed in registers or in memory.   Though thinking more 
about things, I don't think you're changing how the parameter is passed. 
  Just how it's subsequently pulled out of memory.

jeff
  
Jiufu Guo Dec. 6, 2022, 2:36 a.m. UTC | #10
Hi,

Jeff Law <jeffreyalaw@gmail.com> writes:

> On 11/28/22 20:53, Jiufu Guo wrote:
>
>>>
>>> Right, but the number of registers is target dependent, so I don't see
>>> how using "8" or any number of that matter is correct here.
>> I understand.  And even for the same struct type, using how many
>> registers to pass a parameter, it also dependends on the size of the
>> parameter and how many leading parameters there is.
>> So, as you said, "8" or any numbers are not always accurate.
>>
>> Because, the enhancement in this patch is just make "block move" to be
>> more friendly for follow optiomizations(cse/dse/dce...) by moving
>> through sub-mode.  So, I just selected one arbitrary number which may
>> not too large and also tolerable.
>> I also through to query the max number registers from targets for
>> param/ret passing, but it may not very accurate neither.
>>
>> Any sugguestions are welcome! and thanks!
> OK, so it's just a magic number and (in theory) any number should
> still generate correct code -- the number merely places limits on when
> we'll consider performing this optimization.
Right.
>
> It may be overkill, but you might consider making it a PARAM that can
> be adjusted.

Yes. I also feel the magic number is not perfect. I'm thinking of a way
to mark if a param is stored to stack from the register when setup, and
then reference the marker during this optimization. This would be more
accurate and able to avoid the magic number.

>
>
>>
>> For this patch, only simple stuffs are handled like "D.xxx = param_1",
>> where the source and dest of the assignment are all in memory which is
>> the DECL_RTL(of D.xx/param_xx) in MEM_P/BLK.
>> And to avoid complicate, this patch only handle the case where
>> "(size % mode_size) == 0".
>>
>> If any misunderstandings, please point out, thanks.
>> And thanks for comments!
> How values are justified varies on the PA depending on whether the
> parameter is passed in registers or in memory.  Though thinking more
> about things, I don't think you're changing how the parameter is
> passed. Just how it's subsequently pulled out of memory.
Right!

Thanks a lot for your comments!

BR,
Jeff (Jiufu)
>
> jeff
  

Patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index d9407432ea5..420f9cf3662 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -6045,6 +6045,46 @@  expand_assignment (tree to, tree from, bool nontemporal)
       return;
     }
 
+  if (TREE_CODE (from) == PARM_DECL && DECL_INCOMING_RTL (from)
+      && TYPE_MODE (TREE_TYPE (from)) == BLKmode
+      && (GET_CODE (DECL_INCOMING_RTL (from)) == PARALLEL
+	  || REG_P (DECL_INCOMING_RTL (from))))
+    {
+      rtx parm = DECL_INCOMING_RTL (from);
+
+      push_temp_slots ();
+      machine_mode mode;
+      mode = GET_CODE (parm) == PARALLEL
+	       ? GET_MODE (XEXP (XVECEXP (parm, 0, 0), 0))
+	       : word_mode;
+      int mode_size = GET_MODE_SIZE (mode).to_constant ();
+      int size = INTVAL (expr_size (from));
+
+      /* If/How the parameter using submode, it dependes on the size and
+	 position of the parameter.  Here using heurisitic number.  */
+      int hurstc_num = 8;
+      if (size < mode_size || (size % mode_size) != 0
+	  || size > (mode_size * hurstc_num))
+	result = store_expr (from, to_rtx, 0, nontemporal, false);
+      else
+	{
+	  rtx from_rtx
+	    = expand_expr (from, NULL_RTX, GET_MODE (to_rtx), EXPAND_NORMAL);
+	  for (int i = 0; i < size / mode_size; i++)
+	    {
+	      rtx temp = gen_reg_rtx (mode);
+	      rtx src = adjust_address (from_rtx, mode, mode_size * i);
+	      rtx dest = adjust_address (to_rtx, mode, mode_size * i);
+	      emit_move_insn (temp, src);
+	      emit_move_insn (dest, temp);
+	    }
+	  result = to_rtx;
+	}
+      preserve_temp_slots (result);
+      pop_temp_slots ();
+      return;
+    }
+
   /* Compute FROM and store the value in the rtx we got.  */
 
   push_temp_slots ();