[V3] Use reg mode to move sub blocks for parameters and returns

Message ID 20221207120008.126895-1-guojiufu@linux.ibm.com
State New
Headers
Series [V3] Use reg mode to move sub blocks for parameters and returns |

Commit Message

Jiufu Guo Dec. 7, 2022, noon UTC
  Hi,

When assigning a parameter to a variable, or assigning a variable to
return value with struct type, "block move" are used to expand
the assignment. It would be better to use the register mode according
to the target/ABI to move the blocks if the parameter/return is passed
through registers. And then this would raise more opportunities for
other optimization passes(cse/dse/xprop).

As the example code (like code in PR65421):

typedef struct SA {double a[3];} A;
A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
A ret_arg (A a) {return a;} // just empty fun body
void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)

This patches check the "from" and "to" of an assignment in
"expand_assignment", if it is about param/ret which may passing via
register, then use the register mode to move sub-blocks for the
assignning.

This patches are based on the discussions for previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606498.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607480.html

Comparing with previous version, this patch uses a flag to mark if
a parameter is passed via registers and stored to stack. And when
expanding an assignment, this flag is checked.  This could avoid
inaccurate guess during expanding the assignment.  In this version,
testcase pr65421-1.c is updated to remove unnecessary checks.

Bootstrap and regtest pass on ppc64{,le} and x86_64.
Is this ok for trunk?

BR,
Jeff (Jiufu)


	PR target/65421

gcc/ChangeLog:

	* cfgexpand.cc (expand_used_vars): Update to mark DECL_USEDBY_RETURN_P
	for returns.
	* expr.cc (move_sub_blocks): New function.
	(expand_assignment): Update to call move_sub_blocks for returns or
	parameters.
	* function.cc (assign_parm_setup_block): Update to mark
	DECL_STACK_REGS_P for parameter.
	* tree-core.h (struct tree_decl_common): Add comment.
	* tree.h (DECL_USEDBY_RETURN_P): New define.
	(DECL_STACK_REGS_P): New define.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/pr65421-1.c: New test.
	* gcc.target/powerpc/pr65421.c: New test.

---
 gcc/cfgexpand.cc                             | 14 ++++
 gcc/expr.cc                                  | 81 ++++++++++++++++++++
 gcc/function.cc                              |  3 +
 gcc/tree-core.h                              |  4 +-
 gcc/tree.h                                   |  9 +++
 gcc/testsuite/gcc.target/powerpc/pr65421-1.c | 15 ++++
 gcc/testsuite/gcc.target/powerpc/pr65421.c   | 24 ++++++
 7 files changed, 149 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421.c
  

Comments

Segher Boessenkool Dec. 7, 2022, 3:23 p.m. UTC | #1
Hi!

On Wed, Dec 07, 2022 at 08:00:08PM +0800, Jiufu Guo wrote:
> When assigning a parameter to a variable, or assigning a variable to
> return value with struct type, "block move" are used to expand
> the assignment. It would be better to use the register mode according
> to the target/ABI to move the blocks if the parameter/return is passed
> through registers. And then this would raise more opportunities for
> other optimization passes(cse/dse/xprop).
> 
> As the example code (like code in PR65421):
> 
> typedef struct SA {double a[3];} A;
> A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
> A ret_arg (A a) {return a;} // just empty fun body
> void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)

What is this like if you use [5] instead?  Or use an ABI without
homogeneous aggregates?

> +static void
> +move_sub_blocks (rtx to_rtx, tree from, machine_mode sub_mode, bool nontemporal)
> +{
> +  HOST_WIDE_INT size, sub_size;
> +  int len;
> +
> +  gcc_assert (MEM_P (to_rtx));
> +
> +  size = MEM_SIZE (to_rtx).to_constant ();
> +  sub_size = GET_MODE_SIZE (sub_mode).to_constant ();
> +  len = size / sub_size;

Unrelated, but a pet peeve: it is much more modern (and imo much better
taste) to not put all declarations at the start; just declare at first
use:

  gcc_assert (MEM_P (to_rtx));

  HOST_WIDE_INT size = MEM_SIZE (to_rtx).to_constant ();
  HOST_WIDE_INT sub_size = GET_MODE_SIZE (sub_mode).to_constant ();
  int len = size / sub_size;

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> @@ -0,0 +1,15 @@
> +/* PR target/65421 */
> +/* { dg-options "-O2" } */
> +/* { dg-require-effective-target has_arch_ppc64 } */
> +
> +typedef struct SA
> +{
> +  double a[2];
> +  long l;
> +} A;
> +
> +/* std 3 param regs to return slot */
> +A ret_arg (A a) {return a;}
> +/* { dg-final { scan-assembler-times {\mstd 4,0\(3\)\s} 1 } } */
> +/* { dg-final { scan-assembler-times {\mstd 5,8\(3\)\s} 1 } } *
> +/* { dg-final { scan-assembler-times {\mstd 6,16\(3\)\s} 1 } } */

This is only correct on certain ABIs, probably only ELFv2 even.


We certainly can improve the homogeneous aggregates stuff, but please
make sure you don't degrade all other stuff?  Older, as well as when
things are not an homogeneous aggregate, for example too big.  Can you
please add tests for such cases?


Segher
  
Jiufu Guo Dec. 8, 2022, 1:17 p.m. UTC | #2
Hi Segher,

Thanks a lot for your review!

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi!
>
> On Wed, Dec 07, 2022 at 08:00:08PM +0800, Jiufu Guo wrote:
>> When assigning a parameter to a variable, or assigning a variable to
>> return value with struct type, "block move" are used to expand
>> the assignment. It would be better to use the register mode according
>> to the target/ABI to move the blocks if the parameter/return is passed
>> through registers. And then this would raise more opportunities for
>> other optimization passes(cse/dse/xprop).
>> 
>> As the example code (like code in PR65421):
>> 
>> typedef struct SA {double a[3];} A;
>> A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
>> A ret_arg (A a) {return a;} // just empty fun body
>> void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)
>
> What is this like if you use [5] instead?  Or use an ABI without
> homogeneous aggregates?
Thanks for this question!  I also tested the cases on different array
types or different sizes, or mixed field types.

If it is out of the number of registers for passing the param
or return, it is treated as a mem block.
For parameter, it is partially passed via registers, and partially
passing via stack.
For return, it is returned via a pointer (with one invisible pointer
parameter). And the <retval> of the function is not with parallel code.

This patch does not cover these cases.

>
>> +static void
>> +move_sub_blocks (rtx to_rtx, tree from, machine_mode sub_mode, bool nontemporal)
>> +{
>> +  HOST_WIDE_INT size, sub_size;
>> +  int len;
>> +
>> +  gcc_assert (MEM_P (to_rtx));
>> +
>> +  size = MEM_SIZE (to_rtx).to_constant ();
>> +  sub_size = GET_MODE_SIZE (sub_mode).to_constant ();
>> +  len = size / sub_size;
>
> Unrelated, but a pet peeve: it is much more modern (and imo much better
> taste) to not put all declarations at the start; just declare at first
> use:
>
>   gcc_assert (MEM_P (to_rtx));
>
>   HOST_WIDE_INT size = MEM_SIZE (to_rtx).to_constant ();
>   HOST_WIDE_INT sub_size = GET_MODE_SIZE (sub_mode).to_constant ();
>   int len = size / sub_size;
>
Thanks for catch this!

>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>> @@ -0,0 +1,15 @@
>> +/* PR target/65421 */
>> +/* { dg-options "-O2" } */
>> +/* { dg-require-effective-target has_arch_ppc64 } */
>> +
>> +typedef struct SA
>> +{
>> +  double a[2];
>> +  long l;
>> +} A;
>> +
>> +/* std 3 param regs to return slot */
>> +A ret_arg (A a) {return a;}
>> +/* { dg-final { scan-assembler-times {\mstd 4,0\(3\)\s} 1 } } */
>> +/* { dg-final { scan-assembler-times {\mstd 5,8\(3\)\s} 1 } } *
>> +/* { dg-final { scan-assembler-times {\mstd 6,16\(3\)\s} 1 } } */
>
> This is only correct on certain ABIs, probably only ELFv2 even.
Thanks for point out this!
This is only correct if the ABI allows this struct to be passed
through integer registers, and return through the mem block.

In the previous version, I added a requirement on ELFv2. As tested on
BE environments, this case also pass. So, I deleted the requirement.
(While on BE environments, there is another issue: some unnecessary
memory stores are not deleted.)

But with more reading of the code 'rs6000_function_arg', as you said,
I'm not sure if this behavior meets other ABIs (at least, it seems,
this is not correct on darwin64).
So, as you said, we may add a requirement on ELFv2; Or leave this
case there, and add "! target" when hitting failure?

>
>
> We certainly can improve the homogeneous aggregates stuff, but please
> make sure you don't degrade all other stuff?  Older, as well as when
> things are not an homogeneous aggregate, for example too big.  Can you
> please add tests for such cases?
Sure, thanks!  I encounter one issue in this kind of case (large struct)
on a previous version path.

Thanks again for your comments and suggestions!

BR,
Jeff (Jiufu)

>
>
> Segher
  
Segher Boessenkool Dec. 8, 2022, 2:58 p.m. UTC | #3
On Thu, Dec 08, 2022 at 09:17:38PM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Wed, Dec 07, 2022 at 08:00:08PM +0800, Jiufu Guo wrote:
> >> typedef struct SA {double a[3];} A;
> >> A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
> >> A ret_arg (A a) {return a;} // just empty fun body
> >> void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)
> >
> > What is this like if you use [5] instead?  Or use an ABI without
> > homogeneous aggregates?
> Thanks for this question!  I also tested the cases on different array
> types or different sizes, or mixed field types.
> 
> If it is out of the number of registers for passing the param
> or return, it is treated as a mem block.
> For parameter, it is partially passed via registers, and partially
> passing via stack.
> For return, it is returned via a pointer (with one invisible pointer
> parameter). And the <retval> of the function is not with parallel code.
> 
> This patch does not cover these cases.

Understood, sure; but my point is, can it degrade code quality in such
cases?  I don't see anything in the patch that precludes that.

> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
> >> @@ -0,0 +1,15 @@
> >> +/* PR target/65421 */
> >> +/* { dg-options "-O2" } */
> >> +/* { dg-require-effective-target has_arch_ppc64 } */
> >> +
> >> +typedef struct SA
> >> +{
> >> +  double a[2];
> >> +  long l;
> >> +} A;
> >> +
> >> +/* std 3 param regs to return slot */
> >> +A ret_arg (A a) {return a;}
> >> +/* { dg-final { scan-assembler-times {\mstd 4,0\(3\)\s} 1 } } */
> >> +/* { dg-final { scan-assembler-times {\mstd 5,8\(3\)\s} 1 } } *
> >> +/* { dg-final { scan-assembler-times {\mstd 6,16\(3\)\s} 1 } } */
> >
> > This is only correct on certain ABIs, probably only ELFv2 even.
> Thanks for point out this!
> This is only correct if the ABI allows this struct to be passed
> through integer registers, and return through the mem block.

And it needs to be in those specific registers / at those specific
offsets as well.

Btw, please leave out the \s?

> In the previous version, I added a requirement on ELFv2. As tested on
> BE environments, this case also pass. So, I deleted the requirement.

BE for ELFv2 also exists, fwiw.

> (While on BE environments, there is another issue: some unnecessary
> memory stores are not deleted.)

Huh.  Does that happen with the current compiler as well?  Do you have
an example?

> But with more reading of the code 'rs6000_function_arg', as you said,
> I'm not sure if this behavior meets other ABIs (at least, it seems,
> this is not correct on darwin64).
> So, as you said, we may add a requirement on ELFv2; Or leave this
> case there, and add "! target" when hitting failure?

If you do !target the testcase won't test much at all anymore ;-)

> > We certainly can improve the homogeneous aggregates stuff, but please
> > make sure you don't degrade all other stuff?  Older, as well as when
> > things are not an homogeneous aggregate, for example too big.  Can you
> > please add tests for such cases?
> Sure, thanks!  I encounter one issue in this kind of case (large struct)
> on a previous version path.

Perhaps it would be better to have a hook so that every target (and
subtarget) can fine tune exactly when this is done.  Then again, perhaps
I worry too much.


Segher
  
Jiufu Guo Dec. 9, 2022, 3:46 a.m. UTC | #4
Hi Segher,

Thanks a lot for your helpful comments!

Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Thu, Dec 08, 2022 at 09:17:38PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > On Wed, Dec 07, 2022 at 08:00:08PM +0800, Jiufu Guo wrote:
>> >> typedef struct SA {double a[3];} A;
>> >> A ret_arg_pt (A *a) {return *a;} // on ppc64le, expect only 3 lfd(s)
>> >> A ret_arg (A a) {return a;} // just empty fun body
>> >> void st_arg (A a, A *p) {*p = a;} //only 3 stfd(s)
>> >
>> > What is this like if you use [5] instead?  Or use an ABI without
>> > homogeneous aggregates?
>> Thanks for this question!  I also tested the cases on different array
>> types or different sizes, or mixed field types.
>> 
>> If it is out of the number of registers for passing the param
>> or return, it is treated as a mem block.
>> For parameter, it is partially passed via registers, and partially
>> passing via stack.
>> For return, it is returned via a pointer (with one invisible pointer
>> parameter). And the <retval> of the function is not with parallel code.
>> 
>> This patch does not cover these cases.
>
> Understood, sure; but my point is, can it degrade code quality in such
> cases?  I don't see anything in the patch that precludes that.

No, the behavior of such cases is not affected in this patch.
The preclude code is in "assign_parm_setup_block". This patch only shows
the different parts, the context is not shown.

In assign_parm_setup_block, this patch marks "DECL_STACK_REGS_P" only
for "REG_P (entry_parm) || GET_CODE (entry_parm) == PARALLEL" which
indicates the registers are enough to pass the param.

>
>> >> --- /dev/null
>> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>> >> @@ -0,0 +1,15 @@
>> >> +/* PR target/65421 */
>> >> +/* { dg-options "-O2" } */
>> >> +/* { dg-require-effective-target has_arch_ppc64 } */
>> >> +
>> >> +typedef struct SA
>> >> +{
>> >> +  double a[2];
>> >> +  long l;
>> >> +} A;
>> >> +
>> >> +/* std 3 param regs to return slot */
>> >> +A ret_arg (A a) {return a;}
>> >> +/* { dg-final { scan-assembler-times {\mstd 4,0\(3\)\s} 1 } } */
>> >> +/* { dg-final { scan-assembler-times {\mstd 5,8\(3\)\s} 1 } } *
>> >> +/* { dg-final { scan-assembler-times {\mstd 6,16\(3\)\s} 1 } } */
>> >
>> > This is only correct on certain ABIs, probably only ELFv2 even.
>> Thanks for point out this!
>> This is only correct if the ABI allows this struct to be passed
>> through integer registers, and return through the mem block.
>
> And it needs to be in those specific registers / at those specific
> offsets as well.
Yes.
>
> Btw, please leave out the \s?
Thanks! 
>
>> In the previous version, I added a requirement on ELFv2. As tested on
>> BE environments, this case also pass. So, I deleted the requirement.
>
> BE for ELFv2 also exists, fwiw.
Yeap! We have -mabi=elfv2.
>
>> (While on BE environments, there is another issue: some unnecessary
>> memory stores are not deleted.)
>
> Huh.  Does that happen with the current compiler as well?  Do you have
> an example?

We can use the test case (pr65421-1.c) as the example -:)

typedef struct SA {double a[2]; long l; } A;
A ret_arg (A a) {return a;}

For this case, without the patch, below is generated:
        std 4,56(1)
        std 5,64(1)
        li 10,56
        std 6,72(1)
        std 6,16(3)
        lxvd2x 0,1,10
        stxvd2x 0,0,3
With the patch, below is generated:
        std 4,56(1)
        std 5,64(1)
        std 6,72(1)
        std 4,0(3)
        std 5,8(3)
        std 6,16(3)
The first 3 std insns are reductant.  This is an unrelated issue.
With -mabi=elfv2, code can be optimized, and those 3 insns are deleted.

I think it would be fine to just test this case on powerpc_elfv2.
I would merge pr65421-1.c into pr65421.c (with dg-require elfv2).

>
>> But with more reading of the code 'rs6000_function_arg', as you said,
>> I'm not sure if this behavior meets other ABIs (at least, it seems,
>> this is not correct on darwin64).
>> So, as you said, we may add a requirement on ELFv2; Or leave this
>> case there, and add "! target" when hitting failure?
>
> If you do !target the testcase won't test much at all anymore ;-)

Right. we could use this method to exclude the sub-targets which are not
using r4,5,6 for the param for this case.

>
>> > We certainly can improve the homogeneous aggregates stuff, but please
>> > make sure you don't degrade all other stuff?  Older, as well as when
>> > things are not an homogeneous aggregate, for example too big.  Can you
>> > please add tests for such cases?
>> Sure, thanks!  I encounter one issue in this kind of case (large struct)
>> on a previous version path.
>
> Perhaps it would be better to have a hook so that every target (and
> subtarget) can fine tune exactly when this is done.  Then again, perhaps
> I worry too much.

Understand, I also thought about using a hook for targets to tune!
The good news is that: a few target hooks are used by generic code for
arg&ret, and the target-related info (e.g. if a struct param is passed
via registers) are available (indirectly).
And I also feel that I may need to add a hook when the required target
info is not accessible(or not suitable) for generic code.

Thanks again for your review!

BR,
Jeff (Jiufu)
>
>
> Segher
  

Patch

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index dd29ffffc03..09b8ec64cea 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -2158,6 +2158,20 @@  expand_used_vars (bitmap forced_stack_vars)
     frame_phase = off ? align - off : 0;
   }
 
+  /* Collect VARs on returns.  */
+  if (DECL_RESULT (current_function_decl))
+    {
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+	if (greturn *ret = safe_dyn_cast<greturn *> (last_stmt (e->src)))
+	  {
+	    tree val = gimple_return_retval (ret);
+	    if (val && VAR_P (val))
+	      DECL_USEDBY_RETURN_P (val) = 1;
+	  }
+    }
+
   /* Set TREE_USED on all variables in the local_decls.  */
   FOR_EACH_LOCAL_DECL (cfun, i, var)
     TREE_USED (var) = 1;
diff --git a/gcc/expr.cc b/gcc/expr.cc
index d9407432ea5..d61669d5662 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -5559,6 +5559,55 @@  mem_ref_refers_to_non_mem_p (tree ref)
   return non_mem_decl_p (base);
 }
 
+/* Sub routine of expand_assignment, invoked when assigning from a
+   parameter or assigning to a return val on struct type which may
+   be passed through registers.  The mode of register is used to
+   move the content for the assignment.
+
+   This routine generates code for expression FROM which is BLKmode,
+   and move the generated content to TO_RTX by su-blocks in SUB_MODE.  */
+
+static void
+move_sub_blocks (rtx to_rtx, tree from, machine_mode sub_mode, bool nontemporal)
+{
+  HOST_WIDE_INT size, sub_size;
+  int len;
+
+  gcc_assert (MEM_P (to_rtx));
+
+  size = MEM_SIZE (to_rtx).to_constant ();
+  sub_size = GET_MODE_SIZE (sub_mode).to_constant ();
+  len = size / sub_size;
+
+  /* It would be not profitable to move through sub-modes, if the size does
+     not meet register mode.  */
+  if ((size % sub_size) != 0)
+    {
+      push_temp_slots ();
+      rtx result = store_expr (from, to_rtx, 0, nontemporal, false);
+      preserve_temp_slots (result);
+      pop_temp_slots ();
+      return;
+    }
+
+  push_temp_slots ();
+
+  rtx from_rtx;
+  from_rtx = expand_expr (from, NULL_RTX, GET_MODE (to_rtx), EXPAND_NORMAL);
+  for (int i = 0; i < len; i++)
+    {
+      rtx temp = gen_reg_rtx (sub_mode);
+      rtx src = adjust_address (from_rtx, sub_mode, sub_size * i);
+      rtx dest = adjust_address (to_rtx, sub_mode, sub_size * i);
+      emit_move_insn (temp, src);
+      emit_move_insn (dest, temp);
+    }
+
+  preserve_temp_slots (to_rtx);
+  pop_temp_slots ();
+  return;
+}
+
 /* Expand an assignment that stores the value of FROM into TO.  If NONTEMPORAL
    is true, try generating a nontemporal store.  */
 
@@ -6045,6 +6094,38 @@  expand_assignment (tree to, tree from, bool nontemporal)
       return;
     }
 
+  /* If it is assigning from a struct param which may be passed via registers,
+     It would be better to use the register's mode to move sub-blocks for the
+     assignment.  */
+  if (TREE_CODE (from) == PARM_DECL && mode == BLKmode
+      && DECL_STACK_REGS_P (from))
+    {
+      rtx parm = DECL_INCOMING_RTL (from);
+      gcc_assert (REG_P (parm) || GET_CODE (parm) == PARALLEL);
+
+      machine_mode sub_mode;
+      if (REG_P (parm))
+	sub_mode = word_mode;
+      else
+	sub_mode = GET_MODE (XEXP (XVECEXP (parm, 0, 0), 0));
+
+      move_sub_blocks (to_rtx, from, sub_mode, nontemporal);
+      return;
+    }
+
+  /* If it is assigning to a struct var which will be returned, and the
+     function is returning via registers, it would be better to use the
+     register's mode to move sub-blocks for the assignment.  */
+  if (VAR_P (to) && DECL_USEDBY_RETURN_P (to) && mode == BLKmode
+      && TREE_CODE (from) != CONSTRUCTOR
+      && GET_CODE (DECL_RTL (DECL_RESULT (current_function_decl))) == PARALLEL)
+    {
+      rtx ret = DECL_RTL (DECL_RESULT (current_function_decl));
+      machine_mode sub_mode = GET_MODE (XEXP (XVECEXP (ret, 0, 0), 0));
+      move_sub_blocks (to_rtx, from, sub_mode, nontemporal);
+      return;
+    }
+
   /* Compute FROM and store the value in the rtx we got.  */
 
   push_temp_slots ();
diff --git a/gcc/function.cc b/gcc/function.cc
index dc333c27e92..a118f98a203 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -2991,6 +2991,9 @@  assign_parm_setup_block (struct assign_parm_data_all *all,
 
       mem = validize_mem (copy_rtx (stack_parm));
 
+      if (MEM_P (mem))
+	DECL_STACK_REGS_P (parm) = 1;
+
       /* Handle values in multiple non-contiguous locations.  */
       if (GET_CODE (entry_parm) == PARALLEL && !MEM_P (mem))
 	emit_group_store (mem, entry_parm, data->arg.type, size);
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index e146b133dbd..1a2d6c4ca62 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -1808,7 +1808,9 @@  struct GTY(()) tree_decl_common {
      In VAR_DECL, PARM_DECL and RESULT_DECL, this is
      DECL_HAS_VALUE_EXPR_P.  */
   unsigned decl_flag_2 : 1;
-  /* In FIELD_DECL, this is DECL_PADDING_P.  */
+  /* In FIELD_DECL, this is DECL_PADDING_P
+     In VAR_DECL, this is DECL_USEDBY_RETURN_P
+     In PARM_DECL, this is DECL_STACK_REGS_P.  */
   unsigned decl_flag_3 : 1;
   /* Logically, these two would go in a theoretical base shared by var and
      parm decl. */
diff --git a/gcc/tree.h b/gcc/tree.h
index 4a19de1c94d..f9c2e8491a0 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -3007,6 +3007,15 @@  extern void decl_value_expr_insert (tree, tree);
 #define DECL_PADDING_P(NODE) \
   (FIELD_DECL_CHECK (NODE)->decl_common.decl_flag_3)
 
+/* Used in a VAR_DECL to indicate that it is used by a return stmt.  */
+#define DECL_USEDBY_RETURN_P(NODE) \
+  (VAR_DECL_CHECK (NODE)->decl_common.decl_flag_3)
+
+/* Used in a PARM_DECL to indicate that it is struct parameter passed
+   by registers totally and stored to stack during setup.  */
+#define DECL_STACK_REGS_P(NODE) \
+  (PARM_DECL_CHECK (NODE)->decl_common.decl_flag_3)
+
 /* Used in a FIELD_DECL to indicate whether this field is not a flexible
    array member. This is only valid for the last array type field of a
    structure.  */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421-1.c b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
new file mode 100644
index 00000000000..ceae0119c19
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421-1.c
@@ -0,0 +1,15 @@ 
+/* PR target/65421 */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target has_arch_ppc64 } */
+
+typedef struct SA
+{
+  double a[2];
+  long l;
+} A;
+
+/* std 3 param regs to return slot */
+A ret_arg (A a) {return a;}
+/* { dg-final { scan-assembler-times {\mstd 4,0\(3\)\s} 1 } } */
+/* { dg-final { scan-assembler-times {\mstd 5,8\(3\)\s} 1 } } *
+/* { dg-final { scan-assembler-times {\mstd 6,16\(3\)\s} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421.c b/gcc/testsuite/gcc.target/powerpc/pr65421.c
new file mode 100644
index 00000000000..db7c465b5a4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421.c
@@ -0,0 +1,24 @@ 
+/* PR target/65421 */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target powerpc_elfv2 } */
+/* { dg-require-effective-target has_arch_ppc64 } */
+
+typedef struct SA
+{
+  double a[3];
+} A;
+
+/* 3 lfd */
+A ret_arg_pt (A *a){return *a;}
+
+/* blr */
+A ret_arg (A a) {return a;}
+
+/* 3 stfd */
+void st_arg (A a, A *p) {*p = a;}
+
+/* { dg-final { scan-assembler-times {\mlfd\M} 3 } } */
+/* { dg-final { scan-assembler-times {\mstfd\M} 3 } } */
+
+/* count insns */
+/* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9 } } */