[1/3] middle-end: Add the ability to let the target decide the method of argument promotions.

Message ID patch-15721-tamar@arm.com
State Dropped
Headers
Series [1/3] middle-end: Add the ability to let the target decide the method of argument promotions. |

Commit Message

Tamar Christina May 13, 2022, 5:11 p.m. UTC
  Hi All,

Some targets require function parameters to be promoted to a different
type on expand time because the target may not have native instructions
to work on such types.  As an example the AArch64 port does not have native
instructions working on integer 8- or 16-bit values.  As such it promotes
every parameter of these types to 32-bits.

This promotion could be done by a target for two reasons:

1. For correctness.  This may be an APCS requirement for instance.
2. For efficiency.  By promoting the argument at expansion time we don't have
   to keep promoting the type back and forth after each operation on it.
   i.e. the promotion simplies the RTL.

This patch adds the ability for a target to decide whether during the expansion
to use an extend to handle promotion or to use a paradoxical subreg.

A pradoxical subreg can be used when there's no correctness issues and when you
still want the RTL efficiency of not doing the promotion constantly.

This also allows the target to not need to generate any code when the top bits
are not significant.

An example is in AArch64 the following extend is unneeded:

uint8_t fd2 (uint8_t xr){
    return xr + 1;
}

currently generates:

fd2:
        and     w0, w0, 255
        add     w0, w0, 1
        ret

instead of

fd2:
        add     w0, w0, #1
        ret

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Bootstrapped on x86_64-pc-linux-gnu and no issues

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* cfgexpand.cc (set_rtl): Check for function promotion.
	* tree-outof-ssa.cc (insert_value_copy_on_edge): Likewise.
	* function.cc (assign_parm_setup_reg): Likewise.
	* hooks.cc (hook_bool_mode_mode_int_tree_false,
	hook_bool_mode_mode_int_tree_true): New.
	* hooks.h (hook_bool_mode_mode_int_tree_false,
	hook_bool_mode_mode_int_tree_true): New.
	* target.def (promote_function_args_subreg_p): New.
	* doc/tm.texi: Document it.
	* doc/tm.texi.in: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index d3cc77d2ca98f620b29623fc5696410bad9bc184..df95184cfa185312c2a46cb92daa051718d9f4f3 100644




--
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index d3cc77d2ca98f620b29623fc5696410bad9bc184..df95184cfa185312c2a46cb92daa051718d9f4f3 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -206,14 +206,20 @@ set_rtl (tree t, rtx x)
      have to compute it ourselves.  For RESULT_DECLs, we accept mode
      mismatches too, as long as we have BLKmode or are not coalescing
      across variables, so that we don't reject BLKmode PARALLELs or
-     unpromoted REGs.  */
+     unpromoted REGs.  For any promoted types that result in a
+     paradoxical subreg also accept the argument.  */
   gcc_checking_assert (!x || x == pc_rtx || TREE_CODE (t) != SSA_NAME
 		       || (SSAVAR (t)
 			   && TREE_CODE (SSAVAR (t)) == RESULT_DECL
 			   && (promote_ssa_mode (t, NULL) == BLKmode
 			       || !flag_tree_coalesce_vars))
 		       || !use_register_for_decl (t)
-		       || GET_MODE (x) == promote_ssa_mode (t, NULL));
+		       || GET_MODE (x) == promote_ssa_mode (t, NULL)
+		       || targetm.calls.promote_function_args_subreg_p (
+					  GET_MODE (x),
+					  promote_ssa_mode (t, NULL),
+					  TYPE_UNSIGNED (TREE_TYPE (t)),
+					  SSAVAR (t)));
 
   if (x)
     {
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2f92d37da8c0091e9879a493cfe8a361eb1d9299..6314cd83a2488dc225d4a1a15599e8e51e639f7f 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -3906,6 +3906,15 @@ cases of mismatch, it also makes for better code on certain machines.
 The default is to not promote prototypes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P (machine_mode @var{mode}, machine_mode @var{promoted_mode}, int @var{unsignedp}, tree @var{v})
+When a function argument is promoted with @code{PROMOTE_MODE} then this
+hook is used to determine whether the bits of the promoted type are all
+significant in the expression pointed to by V.  If they are an extend is
+generated, if they are not a paradoxical subreg is created for the argument
+from @code{mode} to @code{promoted_mode}.
+The default is to promote using an extend.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PUSH_ARGUMENT (unsigned int @var{npush})
 This target hook returns @code{true} if push instructions will be
 used to pass outgoing arguments.  When the push instruction usage is
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f869ddd5e5b8b7acbd8e9765fb103af24a1085b6..35f955803ec0a5a93be18a028fa1043f90858982 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -3103,6 +3103,8 @@ control passing certain arguments in registers.
 
 @hook TARGET_PROMOTE_PROTOTYPES
 
+@hook TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P
+
 @hook TARGET_PUSH_ARGUMENT
 
 @defmac PUSH_ARGS_REVERSED
diff --git a/gcc/function.cc b/gcc/function.cc
index d5ed51a6a663a1ef472f5b1c090543f359c18f42..92f469bfd5d1ebfb09cc94d9be854715cd2f90f8 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -3161,7 +3161,7 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
   machine_mode promoted_nominal_mode;
   int unsignedp = TYPE_UNSIGNED (TREE_TYPE (parm));
   bool did_conversion = false;
-  bool need_conversion, moved;
+  bool need_conversion, moved, use_subregs;
   enum insn_code icode;
   rtx rtl;
 
@@ -3172,7 +3172,20 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
     = promote_function_mode (data->nominal_type, data->nominal_mode, &unsignedp,
 			     TREE_TYPE (current_function_decl), 2);
 
-  parmreg = gen_reg_rtx (promoted_nominal_mode);
+  /* Check to see how the target wants the promotion of function arguments to
+     be handled.  */
+  use_subregs
+    = targetm.calls.promote_function_args_subreg_p (data->nominal_mode,
+						    promoted_nominal_mode,
+						    unsignedp, parm);
+
+  /* If we're promoting using a paradoxical subreg then we need to keep using
+     the unpromoted type because that's the only fully defined value.  */
+  if (use_subregs)
+    parmreg = gen_reg_rtx (data->nominal_mode);
+  else
+    parmreg = gen_reg_rtx (promoted_nominal_mode);
+
   if (!DECL_ARTIFICIAL (parm))
     mark_user_reg (parmreg);
 
@@ -3256,9 +3269,19 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
 	    }
 	  else
 	    t = op1;
-	  rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
-					   data->passed_mode, unsignedp);
-	  emit_insn (pat);
+
+	  /* Promote the argument itself now if a target wants it.  This
+	     prevents unneeded back and forth convertions in RTL between
+	     the original and promoted type.  */
+	  if (use_subregs)
+	    emit_move_insn (op0, lowpart_subreg (promoted_nominal_mode, t,
+						 data->nominal_mode));
+	  else
+	    {
+	      rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
+					       data->passed_mode, unsignedp);
+	      emit_insn (pat);
+	    }
 	  insns = get_insns ();
 
 	  moved = true;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index 1056e1e9e4dc3e6ce298557351047caa2f84227f..8d68de5cdb9adaea0a79ebf6de599f66b40aa67a 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -31,6 +31,8 @@ extern bool hook_bool_const_int_const_int_true (const int, const int);
 extern bool hook_bool_mode_false (machine_mode);
 extern bool hook_bool_mode_true (machine_mode);
 extern bool hook_bool_mode_mode_true (machine_mode, machine_mode);
+extern bool hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree);
+extern bool hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree);
 extern bool hook_bool_mode_const_rtx_false (machine_mode, const_rtx);
 extern bool hook_bool_mode_const_rtx_true (machine_mode, const_rtx);
 extern bool hook_bool_mode_rtx_false (machine_mode, rtx);
diff --git a/gcc/hooks.cc b/gcc/hooks.cc
index b29233f4f852fb81ede75a5065d743cd16cc9219..7647774f9e8efbbe13d5607e4a4b2f1c9d22f045 100644
--- a/gcc/hooks.cc
+++ b/gcc/hooks.cc
@@ -89,6 +89,22 @@ hook_bool_mode_mode_true (machine_mode, machine_mode)
   return true;
 }
 
+/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
+   returns false.  */
+bool
+hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree)
+{
+  return false;
+}
+
+/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
+   returns true.  */
+bool
+hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree)
+{
+  return true;
+}
+
 /* Generic hook that takes (machine_mode, const_rtx) and returns false.  */
 bool
 hook_bool_mode_const_rtx_false (machine_mode, const_rtx)
diff --git a/gcc/target.def b/gcc/target.def
index 72c2e1ef756cf70a1c92abe81f8a6577eaaa2501..bdbacf8c5fd7b0626a37951f6f6ec649f3194977 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4561,6 +4561,17 @@ The default is to not promote prototypes.",
  bool, (const_tree fntype),
  hook_bool_const_tree_false)
 
+DEFHOOK
+(promote_function_args_subreg_p,
+ "When a function argument is promoted with @code{PROMOTE_MODE} then this\n\
+hook is used to determine whether the bits of the promoted type are all\n\
+significant in the expression pointed to by V.  If they are an extend is\n\
+generated, if they are not a paradoxical subreg is created for the argument\n\
+from @code{mode} to @code{promoted_mode}.\n\
+The default is to promote using an extend.",
+ bool, (machine_mode mode, machine_mode promoted_mode, int unsignedp, tree v),
+ hook_bool_mode_mode_int_tree_false)
+
 DEFHOOK
 (struct_value_rtx,
  "This target hook should return the location of the structure value\n\
diff --git a/gcc/tree-outof-ssa.cc b/gcc/tree-outof-ssa.cc
index ec883126ad86d86a2c2dafee4592b8d83e2ed917..0f437023983baa0f23da25221f7bce8fc559a8b8 100644
--- a/gcc/tree-outof-ssa.cc
+++ b/gcc/tree-outof-ssa.cc
@@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-coalesce.h"
 #include "tree-outof-ssa.h"
 #include "dojump.h"
+#include "target.h"
 
 /* FIXME: A lot of code here deals with expanding to RTL.  All that code
    should be in cfgexpand.cc.  */
@@ -333,7 +334,10 @@ insert_value_copy_on_edge (edge e, int dest, tree src, location_t locus)
   dest_mode = GET_MODE (dest_rtx);
   gcc_assert (src_mode == TYPE_MODE (TREE_TYPE (name)));
   gcc_assert (!REG_P (dest_rtx)
-	      || dest_mode == promote_ssa_mode (name, &unsignedp));
+	      || dest_mode == promote_ssa_mode (name, &unsignedp)
+	      || targetm.calls.promote_function_args_subreg_p (
+			dest_mode, promote_ssa_mode (name, NULL), unsignedp,
+			name));
 
   if (src_mode != dest_mode)
     {
  

Comments

Richard Biener May 16, 2022, 6:31 a.m. UTC | #1
On Fri, 13 May 2022, Tamar Christina wrote:

> Hi All,
> 
> Some targets require function parameters to be promoted to a different
> type on expand time because the target may not have native instructions
> to work on such types.  As an example the AArch64 port does not have native
> instructions working on integer 8- or 16-bit values.  As such it promotes
> every parameter of these types to 32-bits.
> 
> This promotion could be done by a target for two reasons:
> 
> 1. For correctness.  This may be an APCS requirement for instance.
> 2. For efficiency.  By promoting the argument at expansion time we don't have
>    to keep promoting the type back and forth after each operation on it.
>    i.e. the promotion simplies the RTL.
> 
> This patch adds the ability for a target to decide whether during the expansion
> to use an extend to handle promotion or to use a paradoxical subreg.
> 
> A pradoxical subreg can be used when there's no correctness issues and when you
> still want the RTL efficiency of not doing the promotion constantly.
> 
> This also allows the target to not need to generate any code when the top bits
> are not significant.
> 
> An example is in AArch64 the following extend is unneeded:
> 
> uint8_t fd2 (uint8_t xr){
>     return xr + 1;
> }
> 
> currently generates:
> 
> fd2:
>         and     w0, w0, 255
>         add     w0, w0, 1
>         ret
> 
> instead of
> 
> fd2:
>         add     w0, w0, #1
>         ret
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Bootstrapped on x86_64-pc-linux-gnu and no issues
> 
> Ok for master?

Why do we need a target hook for this?  Why doesn't the targets
function_arg(?) hook just return (subreg:SI (reg:QI 0)) here instead
when no zero-extension is required and (reg:QI 0) when it is?

That said, an extra hook looks like a bad design to me, the existing
cummulative args way of querying the target ABI shoud be enough,
and if not, should be extended in a less hackish way.

But of course I am not familiar at all with the current state, but
since you specifially CCed me ...

Richard.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* cfgexpand.cc (set_rtl): Check for function promotion.
> 	* tree-outof-ssa.cc (insert_value_copy_on_edge): Likewise.
> 	* function.cc (assign_parm_setup_reg): Likewise.
> 	* hooks.cc (hook_bool_mode_mode_int_tree_false,
> 	hook_bool_mode_mode_int_tree_true): New.
> 	* hooks.h (hook_bool_mode_mode_int_tree_false,
> 	hook_bool_mode_mode_int_tree_true): New.
> 	* target.def (promote_function_args_subreg_p): New.
> 	* doc/tm.texi: Document it.
> 	* doc/tm.texi.in: Likewise.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index d3cc77d2ca98f620b29623fc5696410bad9bc184..df95184cfa185312c2a46cb92daa051718d9f4f3 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -206,14 +206,20 @@ set_rtl (tree t, rtx x)
>       have to compute it ourselves.  For RESULT_DECLs, we accept mode
>       mismatches too, as long as we have BLKmode or are not coalescing
>       across variables, so that we don't reject BLKmode PARALLELs or
> -     unpromoted REGs.  */
> +     unpromoted REGs.  For any promoted types that result in a
> +     paradoxical subreg also accept the argument.  */
>    gcc_checking_assert (!x || x == pc_rtx || TREE_CODE (t) != SSA_NAME
>  		       || (SSAVAR (t)
>  			   && TREE_CODE (SSAVAR (t)) == RESULT_DECL
>  			   && (promote_ssa_mode (t, NULL) == BLKmode
>  			       || !flag_tree_coalesce_vars))
>  		       || !use_register_for_decl (t)
> -		       || GET_MODE (x) == promote_ssa_mode (t, NULL));
> +		       || GET_MODE (x) == promote_ssa_mode (t, NULL)
> +		       || targetm.calls.promote_function_args_subreg_p (
> +					  GET_MODE (x),
> +					  promote_ssa_mode (t, NULL),
> +					  TYPE_UNSIGNED (TREE_TYPE (t)),
> +					  SSAVAR (t)));
>  
>    if (x)
>      {
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 2f92d37da8c0091e9879a493cfe8a361eb1d9299..6314cd83a2488dc225d4a1a15599e8e51e639f7f 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -3906,6 +3906,15 @@ cases of mismatch, it also makes for better code on certain machines.
>  The default is to not promote prototypes.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P (machine_mode @var{mode}, machine_mode @var{promoted_mode}, int @var{unsignedp}, tree @var{v})
> +When a function argument is promoted with @code{PROMOTE_MODE} then this
> +hook is used to determine whether the bits of the promoted type are all
> +significant in the expression pointed to by V.  If they are an extend is
> +generated, if they are not a paradoxical subreg is created for the argument
> +from @code{mode} to @code{promoted_mode}.
> +The default is to promote using an extend.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_PUSH_ARGUMENT (unsigned int @var{npush})
>  This target hook returns @code{true} if push instructions will be
>  used to pass outgoing arguments.  When the push instruction usage is
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index f869ddd5e5b8b7acbd8e9765fb103af24a1085b6..35f955803ec0a5a93be18a028fa1043f90858982 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -3103,6 +3103,8 @@ control passing certain arguments in registers.
>  
>  @hook TARGET_PROMOTE_PROTOTYPES
>  
> +@hook TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P
> +
>  @hook TARGET_PUSH_ARGUMENT
>  
>  @defmac PUSH_ARGS_REVERSED
> diff --git a/gcc/function.cc b/gcc/function.cc
> index d5ed51a6a663a1ef472f5b1c090543f359c18f42..92f469bfd5d1ebfb09cc94d9be854715cd2f90f8 100644
> --- a/gcc/function.cc
> +++ b/gcc/function.cc
> @@ -3161,7 +3161,7 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
>    machine_mode promoted_nominal_mode;
>    int unsignedp = TYPE_UNSIGNED (TREE_TYPE (parm));
>    bool did_conversion = false;
> -  bool need_conversion, moved;
> +  bool need_conversion, moved, use_subregs;
>    enum insn_code icode;
>    rtx rtl;
>  
> @@ -3172,7 +3172,20 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
>      = promote_function_mode (data->nominal_type, data->nominal_mode, &unsignedp,
>  			     TREE_TYPE (current_function_decl), 2);
>  
> -  parmreg = gen_reg_rtx (promoted_nominal_mode);
> +  /* Check to see how the target wants the promotion of function arguments to
> +     be handled.  */
> +  use_subregs
> +    = targetm.calls.promote_function_args_subreg_p (data->nominal_mode,
> +						    promoted_nominal_mode,
> +						    unsignedp, parm);
> +
> +  /* If we're promoting using a paradoxical subreg then we need to keep using
> +     the unpromoted type because that's the only fully defined value.  */
> +  if (use_subregs)
> +    parmreg = gen_reg_rtx (data->nominal_mode);
> +  else
> +    parmreg = gen_reg_rtx (promoted_nominal_mode);
> +
>    if (!DECL_ARTIFICIAL (parm))
>      mark_user_reg (parmreg);
>  
> @@ -3256,9 +3269,19 @@ assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
>  	    }
>  	  else
>  	    t = op1;
> -	  rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
> -					   data->passed_mode, unsignedp);
> -	  emit_insn (pat);
> +
> +	  /* Promote the argument itself now if a target wants it.  This
> +	     prevents unneeded back and forth convertions in RTL between
> +	     the original and promoted type.  */
> +	  if (use_subregs)
> +	    emit_move_insn (op0, lowpart_subreg (promoted_nominal_mode, t,
> +						 data->nominal_mode));
> +	  else
> +	    {
> +	      rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
> +					       data->passed_mode, unsignedp);
> +	      emit_insn (pat);
> +	    }
>  	  insns = get_insns ();
>  
>  	  moved = true;
> diff --git a/gcc/hooks.h b/gcc/hooks.h
> index 1056e1e9e4dc3e6ce298557351047caa2f84227f..8d68de5cdb9adaea0a79ebf6de599f66b40aa67a 100644
> --- a/gcc/hooks.h
> +++ b/gcc/hooks.h
> @@ -31,6 +31,8 @@ extern bool hook_bool_const_int_const_int_true (const int, const int);
>  extern bool hook_bool_mode_false (machine_mode);
>  extern bool hook_bool_mode_true (machine_mode);
>  extern bool hook_bool_mode_mode_true (machine_mode, machine_mode);
> +extern bool hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree);
> +extern bool hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree);
>  extern bool hook_bool_mode_const_rtx_false (machine_mode, const_rtx);
>  extern bool hook_bool_mode_const_rtx_true (machine_mode, const_rtx);
>  extern bool hook_bool_mode_rtx_false (machine_mode, rtx);
> diff --git a/gcc/hooks.cc b/gcc/hooks.cc
> index b29233f4f852fb81ede75a5065d743cd16cc9219..7647774f9e8efbbe13d5607e4a4b2f1c9d22f045 100644
> --- a/gcc/hooks.cc
> +++ b/gcc/hooks.cc
> @@ -89,6 +89,22 @@ hook_bool_mode_mode_true (machine_mode, machine_mode)
>    return true;
>  }
>  
> +/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
> +   returns false.  */
> +bool
> +hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree)
> +{
> +  return false;
> +}
> +
> +/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
> +   returns true.  */
> +bool
> +hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree)
> +{
> +  return true;
> +}
> +
>  /* Generic hook that takes (machine_mode, const_rtx) and returns false.  */
>  bool
>  hook_bool_mode_const_rtx_false (machine_mode, const_rtx)
> diff --git a/gcc/target.def b/gcc/target.def
> index 72c2e1ef756cf70a1c92abe81f8a6577eaaa2501..bdbacf8c5fd7b0626a37951f6f6ec649f3194977 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -4561,6 +4561,17 @@ The default is to not promote prototypes.",
>   bool, (const_tree fntype),
>   hook_bool_const_tree_false)
>  
> +DEFHOOK
> +(promote_function_args_subreg_p,
> + "When a function argument is promoted with @code{PROMOTE_MODE} then this\n\
> +hook is used to determine whether the bits of the promoted type are all\n\
> +significant in the expression pointed to by V.  If they are an extend is\n\
> +generated, if they are not a paradoxical subreg is created for the argument\n\
> +from @code{mode} to @code{promoted_mode}.\n\
> +The default is to promote using an extend.",
> + bool, (machine_mode mode, machine_mode promoted_mode, int unsignedp, tree v),
> + hook_bool_mode_mode_int_tree_false)
> +
>  DEFHOOK
>  (struct_value_rtx,
>   "This target hook should return the location of the structure value\n\
> diff --git a/gcc/tree-outof-ssa.cc b/gcc/tree-outof-ssa.cc
> index ec883126ad86d86a2c2dafee4592b8d83e2ed917..0f437023983baa0f23da25221f7bce8fc559a8b8 100644
> --- a/gcc/tree-outof-ssa.cc
> +++ b/gcc/tree-outof-ssa.cc
> @@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-coalesce.h"
>  #include "tree-outof-ssa.h"
>  #include "dojump.h"
> +#include "target.h"
>  
>  /* FIXME: A lot of code here deals with expanding to RTL.  All that code
>     should be in cfgexpand.cc.  */
> @@ -333,7 +334,10 @@ insert_value_copy_on_edge (edge e, int dest, tree src, location_t locus)
>    dest_mode = GET_MODE (dest_rtx);
>    gcc_assert (src_mode == TYPE_MODE (TREE_TYPE (name)));
>    gcc_assert (!REG_P (dest_rtx)
> -	      || dest_mode == promote_ssa_mode (name, &unsignedp));
> +	      || dest_mode == promote_ssa_mode (name, &unsignedp)
> +	      || targetm.calls.promote_function_args_subreg_p (
> +			dest_mode, promote_ssa_mode (name, NULL), unsignedp,
> +			name));
>  
>    if (src_mode != dest_mode)
>      {
> 
> 
> 
> 
>
  
Tamar Christina May 16, 2022, 8:26 a.m. UTC | #2
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, May 16, 2022 7:31 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jeffreyalaw@gmail.com;
> Richard Sandiford <Richard.Sandiford@arm.com>
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> On Fri, 13 May 2022, Tamar Christina wrote:
> 
> > Hi All,
> >
> > Some targets require function parameters to be promoted to a different
> > type on expand time because the target may not have native
> > instructions to work on such types.  As an example the AArch64 port
> > does not have native instructions working on integer 8- or 16-bit
> > values.  As such it promotes every parameter of these types to 32-bits.
> >
> > This promotion could be done by a target for two reasons:
> >
> > 1. For correctness.  This may be an APCS requirement for instance.
> > 2. For efficiency.  By promoting the argument at expansion time we don't
> have
> >    to keep promoting the type back and forth after each operation on it.
> >    i.e. the promotion simplies the RTL.
> >
> > This patch adds the ability for a target to decide whether during the
> > expansion to use an extend to handle promotion or to use a paradoxical
> subreg.
> >
> > A pradoxical subreg can be used when there's no correctness issues and
> > when you still want the RTL efficiency of not doing the promotion
> constantly.
> >
> > This also allows the target to not need to generate any code when the
> > top bits are not significant.
> >
> > An example is in AArch64 the following extend is unneeded:
> >
> > uint8_t fd2 (uint8_t xr){
> >     return xr + 1;
> > }
> >
> > currently generates:
> >
> > fd2:
> >         and     w0, w0, 255
> >         add     w0, w0, 1
> >         ret
> >
> > instead of
> >
> > fd2:
> >         add     w0, w0, #1
> >         ret
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > Bootstrapped on x86_64-pc-linux-gnu and no issues
> >
> > Ok for master?
> 
> Why do we need a target hook for this?  Why doesn't the targets
> function_arg(?) hook just return (subreg:SI (reg:QI 0)) here instead when no
> zero-extension is required and (reg:QI 0) when it is?
> 

Because I'm not sure it's allowed to. From what I can tell function_arg expects
return registers to be hardregs. i.e. the actual APCS determined register for the
argument.  And since it's a hardreg can't make it a paradoxical subreg, it'll just
resize the register to the new mode.

assign_parm_setup_reg gets around this by creating a new pseudo and then
assigning the resulting rtl to the param set_parm_rtl (parm, rtl); which as far
as I can tell changes what expand will expand the parm to without actually
changing the actual APCS register.  Which is why I think the current code does
the promotions there.

> That said, an extra hook looks like a bad design to me, the existing
> cummulative args way of querying the target ABI shoud be enough, and if
> not, should be extended in a less hackish way.

I could perhaps modify function_arg_info to have a field for the subreg extension
which I can then use in function_arg for the promotion.  However I'll have a problem
with  the asserts in insert_value_copy_on_edge and set_rtl both of which check that
either the in/out types match, or the out type is a promoted in type.

set_rtl I can extend with enough information to determine whether the subreg was
intentional but insert_value_copy_on_edge doesn't give me access to enough information..

Which is probably why the current code re-queries the target..  Can I get to the parm information
Here?

I also initially tried to extend PROMOTE_MODE itself, however this is used by promote_mode, which
is used in many other places and I am not sure some of these usages are safe to change, such as
promote_ssa_mode.

The new hook I know only interacts with parms.  Any thoughts appreciated ?

Cheers,
Tamar

> 
> But of course I am not familiar at all with the current state, but since you
> specifially CCed me ...
> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* cfgexpand.cc (set_rtl): Check for function promotion.
> > 	* tree-outof-ssa.cc (insert_value_copy_on_edge): Likewise.
> > 	* function.cc (assign_parm_setup_reg): Likewise.
> > 	* hooks.cc (hook_bool_mode_mode_int_tree_false,
> > 	hook_bool_mode_mode_int_tree_true): New.
> > 	* hooks.h (hook_bool_mode_mode_int_tree_false,
> > 	hook_bool_mode_mode_int_tree_true): New.
> > 	* target.def (promote_function_args_subreg_p): New.
> > 	* doc/tm.texi: Document it.
> > 	* doc/tm.texi.in: Likewise.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index
> >
> d3cc77d2ca98f620b29623fc5696410bad9bc184..df95184cfa185312c2a46cb92da
> a
> > 051718d9f4f3 100644
> > --- a/gcc/cfgexpand.cc
> > +++ b/gcc/cfgexpand.cc
> > @@ -206,14 +206,20 @@ set_rtl (tree t, rtx x)
> >       have to compute it ourselves.  For RESULT_DECLs, we accept mode
> >       mismatches too, as long as we have BLKmode or are not coalescing
> >       across variables, so that we don't reject BLKmode PARALLELs or
> > -     unpromoted REGs.  */
> > +     unpromoted REGs.  For any promoted types that result in a
> > +     paradoxical subreg also accept the argument.  */
> >    gcc_checking_assert (!x || x == pc_rtx || TREE_CODE (t) != SSA_NAME
> >  		       || (SSAVAR (t)
> >  			   && TREE_CODE (SSAVAR (t)) == RESULT_DECL
> >  			   && (promote_ssa_mode (t, NULL) == BLKmode
> >  			       || !flag_tree_coalesce_vars))
> >  		       || !use_register_for_decl (t)
> > -		       || GET_MODE (x) == promote_ssa_mode (t, NULL));
> > +		       || GET_MODE (x) == promote_ssa_mode (t, NULL)
> > +		       || targetm.calls.promote_function_args_subreg_p (
> > +					  GET_MODE (x),
> > +					  promote_ssa_mode (t, NULL),
> > +					  TYPE_UNSIGNED (TREE_TYPE (t)),
> > +					  SSAVAR (t)));
> >
> >    if (x)
> >      {
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index
> >
> 2f92d37da8c0091e9879a493cfe8a361eb1d9299..6314cd83a2488dc225d4a1a15
> 599
> > e8e51e639f7f 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -3906,6 +3906,15 @@ cases of mismatch, it also makes for better code
> on certain machines.
> >  The default is to not promote prototypes.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool
> TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P
> > +(machine_mode @var{mode}, machine_mode @var{promoted_mode},
> int
> > +@var{unsignedp}, tree @var{v}) When a function argument is promoted
> > +with @code{PROMOTE_MODE} then this hook is used to determine
> whether
> > +the bits of the promoted type are all significant in the expression
> > +pointed to by V.  If they are an extend is generated, if they are not a
> paradoxical subreg is created for the argument from @code{mode} to
> @code{promoted_mode}.
> > +The default is to promote using an extend.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} bool TARGET_PUSH_ARGUMENT (unsigned int
> > @var{npush})  This target hook returns @code{true} if push
> > instructions will be  used to pass outgoing arguments.  When the push
> > instruction usage is diff --git a/gcc/doc/tm.texi.in
> > b/gcc/doc/tm.texi.in index
> >
> f869ddd5e5b8b7acbd8e9765fb103af24a1085b6..35f955803ec0a5a93be18a028
> fa1
> > 043f90858982 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -3103,6 +3103,8 @@ control passing certain arguments in registers.
> >
> >  @hook TARGET_PROMOTE_PROTOTYPES
> >
> > +@hook TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P
> > +
> >  @hook TARGET_PUSH_ARGUMENT
> >
> >  @defmac PUSH_ARGS_REVERSED
> > diff --git a/gcc/function.cc b/gcc/function.cc index
> >
> d5ed51a6a663a1ef472f5b1c090543f359c18f42..92f469bfd5d1ebfb09cc94d9be
> 85
> > 4715cd2f90f8 100644
> > --- a/gcc/function.cc
> > +++ b/gcc/function.cc
> > @@ -3161,7 +3161,7 @@ assign_parm_setup_reg (struct
> assign_parm_data_all *all, tree parm,
> >    machine_mode promoted_nominal_mode;
> >    int unsignedp = TYPE_UNSIGNED (TREE_TYPE (parm));
> >    bool did_conversion = false;
> > -  bool need_conversion, moved;
> > +  bool need_conversion, moved, use_subregs;
> >    enum insn_code icode;
> >    rtx rtl;
> >
> > @@ -3172,7 +3172,20 @@ assign_parm_setup_reg (struct
> assign_parm_data_all *all, tree parm,
> >      = promote_function_mode (data->nominal_type, data-
> >nominal_mode, &unsignedp,
> >  			     TREE_TYPE (current_function_decl), 2);
> >
> > -  parmreg = gen_reg_rtx (promoted_nominal_mode);
> > +  /* Check to see how the target wants the promotion of function
> arguments to
> > +     be handled.  */
> > +  use_subregs
> > +    = targetm.calls.promote_function_args_subreg_p (data-
> >nominal_mode,
> > +						    promoted_nominal_mode,
> > +						    unsignedp, parm);
> > +
> > +  /* If we're promoting using a paradoxical subreg then we need to keep
> using
> > +     the unpromoted type because that's the only fully defined value.
> > + */  if (use_subregs)
> > +    parmreg = gen_reg_rtx (data->nominal_mode);  else
> > +    parmreg = gen_reg_rtx (promoted_nominal_mode);
> > +
> >    if (!DECL_ARTIFICIAL (parm))
> >      mark_user_reg (parmreg);
> >
> > @@ -3256,9 +3269,19 @@ assign_parm_setup_reg (struct
> assign_parm_data_all *all, tree parm,
> >  	    }
> >  	  else
> >  	    t = op1;
> > -	  rtx_insn *pat = gen_extend_insn (op0, t,
> promoted_nominal_mode,
> > -					   data->passed_mode, unsignedp);
> > -	  emit_insn (pat);
> > +
> > +	  /* Promote the argument itself now if a target wants it.  This
> > +	     prevents unneeded back and forth convertions in RTL between
> > +	     the original and promoted type.  */
> > +	  if (use_subregs)
> > +	    emit_move_insn (op0, lowpart_subreg
> (promoted_nominal_mode, t,
> > +						 data->nominal_mode));
> > +	  else
> > +	    {
> > +	      rtx_insn *pat = gen_extend_insn (op0, t,
> promoted_nominal_mode,
> > +					       data->passed_mode, unsignedp);
> > +	      emit_insn (pat);
> > +	    }
> >  	  insns = get_insns ();
> >
> >  	  moved = true;
> > diff --git a/gcc/hooks.h b/gcc/hooks.h index
> >
> 1056e1e9e4dc3e6ce298557351047caa2f84227f..8d68de5cdb9adaea0a79ebf6d
> e59
> > 9f66b40aa67a 100644
> > --- a/gcc/hooks.h
> > +++ b/gcc/hooks.h
> > @@ -31,6 +31,8 @@ extern bool hook_bool_const_int_const_int_true
> > (const int, const int);  extern bool hook_bool_mode_false
> > (machine_mode);  extern bool hook_bool_mode_true (machine_mode);
> > extern bool hook_bool_mode_mode_true (machine_mode,
> machine_mode);
> > +extern bool hook_bool_mode_mode_int_tree_false (machine_mode,
> > +machine_mode, int, tree); extern bool
> > +hook_bool_mode_mode_int_tree_true (machine_mode,
> machine_mode, int,
> > +tree);
> >  extern bool hook_bool_mode_const_rtx_false (machine_mode,
> const_rtx);
> > extern bool hook_bool_mode_const_rtx_true (machine_mode,
> const_rtx);
> > extern bool hook_bool_mode_rtx_false (machine_mode, rtx); diff --git
> > a/gcc/hooks.cc b/gcc/hooks.cc index
> >
> b29233f4f852fb81ede75a5065d743cd16cc9219..7647774f9e8efbbe13d5607e4
> a4b
> > 2f1c9d22f045 100644
> > --- a/gcc/hooks.cc
> > +++ b/gcc/hooks.cc
> > @@ -89,6 +89,22 @@ hook_bool_mode_mode_true (machine_mode,
> machine_mode)
> >    return true;
> >  }
> >
> > +/* Generic hook that takes (machine_mode, machine_mode, int, tree)
> and
> > +   returns false.  */
> > +bool
> > +hook_bool_mode_mode_int_tree_false (machine_mode,
> machine_mode, int,
> > +tree) {
> > +  return false;
> > +}
> > +
> > +/* Generic hook that takes (machine_mode, machine_mode, int, tree)
> and
> > +   returns true.  */
> > +bool
> > +hook_bool_mode_mode_int_tree_true (machine_mode,
> machine_mode, int,
> > +tree) {
> > +  return true;
> > +}
> > +
> >  /* Generic hook that takes (machine_mode, const_rtx) and returns
> > false.  */  bool  hook_bool_mode_const_rtx_false (machine_mode,
> > const_rtx) diff --git a/gcc/target.def b/gcc/target.def index
> >
> 72c2e1ef756cf70a1c92abe81f8a6577eaaa2501..bdbacf8c5fd7b0626a37951f6f6
> e
> > c649f3194977 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -4561,6 +4561,17 @@ The default is to not promote prototypes.",
> >   bool, (const_tree fntype),
> >   hook_bool_const_tree_false)
> >
> > +DEFHOOK
> > +(promote_function_args_subreg_p,
> > + "When a function argument is promoted with @code{PROMOTE_MODE}
> then
> > +this\n\ hook is used to determine whether the bits of the promoted
> > +type are all\n\ significant in the expression pointed to by V.  If
> > +they are an extend is\n\ generated, if they are not a paradoxical
> > +subreg is created for the argument\n\ from @code{mode} to
> > +@code{promoted_mode}.\n\ The default is to promote using an
> extend.",
> > +bool, (machine_mode mode, machine_mode promoted_mode, int
> unsignedp,
> > +tree v),
> > + hook_bool_mode_mode_int_tree_false)
> > +
> >  DEFHOOK
> >  (struct_value_rtx,
> >   "This target hook should return the location of the structure
> > value\n\ diff --git a/gcc/tree-outof-ssa.cc b/gcc/tree-outof-ssa.cc
> > index
> >
> ec883126ad86d86a2c2dafee4592b8d83e2ed917..0f437023983baa0f23da25221
> f7b
> > ce8fc559a8b8 100644
> > --- a/gcc/tree-outof-ssa.cc
> > +++ b/gcc/tree-outof-ssa.cc
> > @@ -45,6 +45,7 @@ along with GCC; see the file COPYING3.  If not see
> > #include "tree-ssa-coalesce.h"
> >  #include "tree-outof-ssa.h"
> >  #include "dojump.h"
> > +#include "target.h"
> >
> >  /* FIXME: A lot of code here deals with expanding to RTL.  All that code
> >     should be in cfgexpand.cc.  */
> > @@ -333,7 +334,10 @@ insert_value_copy_on_edge (edge e, int dest,
> tree src, location_t locus)
> >    dest_mode = GET_MODE (dest_rtx);
> >    gcc_assert (src_mode == TYPE_MODE (TREE_TYPE (name)));
> >    gcc_assert (!REG_P (dest_rtx)
> > -	      || dest_mode == promote_ssa_mode (name, &unsignedp));
> > +	      || dest_mode == promote_ssa_mode (name, &unsignedp)
> > +	      || targetm.calls.promote_function_args_subreg_p (
> > +			dest_mode, promote_ssa_mode (name, NULL),
> unsignedp,
> > +			name));
> >
> >    if (src_mode != dest_mode)
> >      {
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
  
Richard Sandiford May 16, 2022, 11:36 a.m. UTC | #3
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> Some targets require function parameters to be promoted to a different
> type on expand time because the target may not have native instructions
> to work on such types.  As an example the AArch64 port does not have native
> instructions working on integer 8- or 16-bit values.  As such it promotes
> every parameter of these types to 32-bits.

This doesn't seem specific to parameters though.  It applies to any
8- or 16-bit variable.  E.g.:

#include <stdint.h>
uint8_t foo(uint32_t x, uint32_t y) {
    uint8_t z = x != 0 ? x : y;
    return z + 1;
}

generates:

foo:
        cmp     w0, 0
        and     w1, w1, 255
        and     w0, w0, 255
        csel    w0, w1, w0, eq
        add     w0, w0, 1
        ret

So I think the new behaviour is really a modification of the PROMOTE_MODE
behaviour rather than the PROMOTE_FUNCTION_MODE behaviour.

FWIW, I agree with Richard that it would be better not to add a new hook.
I think we're really making PROMOTE_MODE choose between SIGN_EXTEND,
ZERO_EXTEND or SUBREG (what LLVM would call “any extend”) rather
than the current SIGN_EXTEND vs. ZERO_EXTEND choice.

Thanks,
Richard
  
Tamar Christina May 16, 2022, 11:49 a.m. UTC | #4
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, May 16, 2022 12:36 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> jeffreyalaw@gmail.com
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > Some targets require function parameters to be promoted to a different
> > type on expand time because the target may not have native
> > instructions to work on such types.  As an example the AArch64 port
> > does not have native instructions working on integer 8- or 16-bit
> > values.  As such it promotes every parameter of these types to 32-bits.
> 
> This doesn't seem specific to parameters though.  It applies to any
> 8- or 16-bit variable.  E.g.:
> 
> #include <stdint.h>
> uint8_t foo(uint32_t x, uint32_t y) {
>     uint8_t z = x != 0 ? x : y;
>     return z + 1;
> }
> 
> generates:
> 
> foo:
>         cmp     w0, 0
>         and     w1, w1, 255
>         and     w0, w0, 255
>         csel    w0, w1, w0, eq
>         add     w0, w0, 1
>         ret
> 
> So I think the new behaviour is really a modification of the PROMOTE_MODE
> behaviour rather than the PROMOTE_FUNCTION_MODE behaviour.
> 
> FWIW, I agree with Richard that it would be better not to add a new hook.
> I think we're really making PROMOTE_MODE choose between
> SIGN_EXTEND, ZERO_EXTEND or SUBREG (what LLVM would call “any
> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND choice.

Ah, I hadn't realized this also applied to locals.. ok I can modify PROMOTE_MODE then,
but I also need the actual SSA_NAME and not just the type so will have to pass this along.

From a practical point of view.. the actual hook however is implemented by 34 targets,
would I need to CC maintainers for each of them or would global maintainer approval
suffice for these mostly mechanical changes?

Cheers,
Tamar

> 
> Thanks,
> Richard
  
Richard Sandiford May 16, 2022, 12:14 p.m. UTC | #5
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, May 16, 2022 12:36 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>> jeffreyalaw@gmail.com
>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
>> the method of argument promotions.
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > Some targets require function parameters to be promoted to a different
>> > type on expand time because the target may not have native
>> > instructions to work on such types.  As an example the AArch64 port
>> > does not have native instructions working on integer 8- or 16-bit
>> > values.  As such it promotes every parameter of these types to 32-bits.
>> 
>> This doesn't seem specific to parameters though.  It applies to any
>> 8- or 16-bit variable.  E.g.:
>> 
>> #include <stdint.h>
>> uint8_t foo(uint32_t x, uint32_t y) {
>>     uint8_t z = x != 0 ? x : y;
>>     return z + 1;
>> }
>> 
>> generates:
>> 
>> foo:
>>         cmp     w0, 0
>>         and     w1, w1, 255
>>         and     w0, w0, 255
>>         csel    w0, w1, w0, eq
>>         add     w0, w0, 1
>>         ret
>> 
>> So I think the new behaviour is really a modification of the PROMOTE_MODE
>> behaviour rather than the PROMOTE_FUNCTION_MODE behaviour.
>> 
>> FWIW, I agree with Richard that it would be better not to add a new hook.
>> I think we're really making PROMOTE_MODE choose between
>> SIGN_EXTEND, ZERO_EXTEND or SUBREG (what LLVM would call “any
>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND choice.
>
> Ah, I hadn't realized this also applied to locals.. ok I can modify PROMOTE_MODE then,
> but I also need the actual SSA_NAME and not just the type so will have to pass this along.
>
> From a practical point of view.. the actual hook however is implemented by 34 targets,
> would I need to CC maintainers for each of them or would global maintainer approval
> suffice for these mostly mechanical changes?

Yeah, single approval should be enough mechanical changes.  It would be
good to do the interface change and mechanical target changes as a
separate prepatch if possible though.

I'm not sure about passing the SSA name to the target though, or the
way that the aarch64 hook uses the info.  It looks like a single cold
comparison could defeat the optimisation for hot code.

If we do try to make the decision based on uses at expand time, it might
be better for the analysis to be in target-independent code, with help
from the target to decide where extensions are cheap.  It still feels
a bit hacky though.

What stops us from forming cbz/cbnz when the extension is done close to
the comparison (from the comment in 2/3)?  If we can solve that, could
we simply do an any-extend all the time, and treat removing redundant
extensions as a global availability problem?

What kind of code do we emit when do an extension just before
an operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say,
then it should be safe to do the extension directly into R:

  (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))

which avoids the problem of having two values live at once
(the zero-extended value and the any-extended value).

Having identical instructions for each extension would also make
it easier for any global pass to move them and remove redundancies.

Thanks,
Richard
  
Richard Sandiford May 16, 2022, 12:18 p.m. UTC | #6
Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Tamar Christina <Tamar.Christina@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Monday, May 16, 2022 12:36 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>>> jeffreyalaw@gmail.com
>>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
>>> the method of argument promotions.
>>> 
>>> Tamar Christina <tamar.christina@arm.com> writes:
>>> > Hi All,
>>> >
>>> > Some targets require function parameters to be promoted to a different
>>> > type on expand time because the target may not have native
>>> > instructions to work on such types.  As an example the AArch64 port
>>> > does not have native instructions working on integer 8- or 16-bit
>>> > values.  As such it promotes every parameter of these types to 32-bits.
>>> 
>>> This doesn't seem specific to parameters though.  It applies to any
>>> 8- or 16-bit variable.  E.g.:
>>> 
>>> #include <stdint.h>
>>> uint8_t foo(uint32_t x, uint32_t y) {
>>>     uint8_t z = x != 0 ? x : y;
>>>     return z + 1;
>>> }
>>> 
>>> generates:
>>> 
>>> foo:
>>>         cmp     w0, 0
>>>         and     w1, w1, 255
>>>         and     w0, w0, 255
>>>         csel    w0, w1, w0, eq
>>>         add     w0, w0, 1
>>>         ret
>>> 
>>> So I think the new behaviour is really a modification of the PROMOTE_MODE
>>> behaviour rather than the PROMOTE_FUNCTION_MODE behaviour.
>>> 
>>> FWIW, I agree with Richard that it would be better not to add a new hook.
>>> I think we're really making PROMOTE_MODE choose between
>>> SIGN_EXTEND, ZERO_EXTEND or SUBREG (what LLVM would call “any
>>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND choice.
>>
>> Ah, I hadn't realized this also applied to locals.. ok I can modify PROMOTE_MODE then,
>> but I also need the actual SSA_NAME and not just the type so will have to pass this along.
>>
>> From a practical point of view.. the actual hook however is implemented by 34 targets,
>> would I need to CC maintainers for each of them or would global maintainer approval
>> suffice for these mostly mechanical changes?
>
> Yeah, single approval should be enough mechanical changes.  It would be
> good to do the interface change and mechanical target changes as a
> separate prepatch if possible though.
>
> I'm not sure about passing the SSA name to the target though, or the
> way that the aarch64 hook uses the info.  It looks like a single cold
> comparison could defeat the optimisation for hot code.
>
> If we do try to make the decision based on uses at expand time, it might
> be better for the analysis to be in target-independent code, with help
> from the target to decide where extensions are cheap.  It still feels
> a bit hacky though.
>
> What stops us from forming cbz/cbnz when the extension is done close to
> the comparison (from the comment in 2/3)?  If we can solve that, could
> we simply do an any-extend all the time, and treat removing redundant
> extensions as a global availability problem?

(which would run after combine)

>
> What kind of code do we emit when do an extension just before
> an operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say,
> then it should be safe to do the extension directly into R:
>
>   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))

Oops, that should of course be:

  (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))

> which avoids the problem of having two values live at once
> (the zero-extended value and the any-extended value).
>
> Having identical instructions for each extension would also make
> it easier for any global pass to move them and remove redundancies.
>
> Thanks,
> Richard
  
Tamar Christina May 16, 2022, 1:02 p.m. UTC | #7
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, May 16, 2022 1:18 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> jeffreyalaw@gmail.com
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >>> -----Original Message-----
> >>> From: Richard Sandiford <richard.sandiford@arm.com>
> >>> Sent: Monday, May 16, 2022 12:36 PM
> >>> To: Tamar Christina <Tamar.Christina@arm.com>
> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >>> jeffreyalaw@gmail.com
> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >>> target decide the method of argument promotions.
> >>>
> >>> Tamar Christina <tamar.christina@arm.com> writes:
> >>> > Hi All,
> >>> >
> >>> > Some targets require function parameters to be promoted to a
> >>> > different type on expand time because the target may not have
> >>> > native instructions to work on such types.  As an example the
> >>> > AArch64 port does not have native instructions working on integer
> >>> > 8- or 16-bit values.  As such it promotes every parameter of these
> types to 32-bits.
> >>>
> >>> This doesn't seem specific to parameters though.  It applies to any
> >>> 8- or 16-bit variable.  E.g.:
> >>>
> >>> #include <stdint.h>
> >>> uint8_t foo(uint32_t x, uint32_t y) {
> >>>     uint8_t z = x != 0 ? x : y;
> >>>     return z + 1;
> >>> }
> >>>
> >>> generates:
> >>>
> >>> foo:
> >>>         cmp     w0, 0
> >>>         and     w1, w1, 255
> >>>         and     w0, w0, 255
> >>>         csel    w0, w1, w0, eq
> >>>         add     w0, w0, 1
> >>>         ret
> >>>
> >>> So I think the new behaviour is really a modification of the
> >>> PROMOTE_MODE behaviour rather than the
> PROMOTE_FUNCTION_MODE behaviour.
> >>>
> >>> FWIW, I agree with Richard that it would be better not to add a new
> hook.
> >>> I think we're really making PROMOTE_MODE choose between
> SIGN_EXTEND,
> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
> choice.
> >>
> >> Ah, I hadn't realized this also applied to locals.. ok I can modify
> >> PROMOTE_MODE then, but I also need the actual SSA_NAME and not just
> the type so will have to pass this along.
> >>
> >> From a practical point of view.. the actual hook however is
> >> implemented by 34 targets, would I need to CC maintainers for each of
> >> them or would global maintainer approval suffice for these mostly
> mechanical changes?
> >
> > Yeah, single approval should be enough mechanical changes.  It would
> > be good to do the interface change and mechanical target changes as a
> > separate prepatch if possible though.
> >
> > I'm not sure about passing the SSA name to the target though, or the
> > way that the aarch64 hook uses the info.  It looks like a single cold
> > comparison could defeat the optimisation for hot code.

I'm not sure I follow why the likelihood of the comparison matters in this case..
I'll expand on it below..

> >
> > If we do try to make the decision based on uses at expand time, it
> > might be better for the analysis to be in target-independent code,
> > with help from the target to decide where extensions are cheap.  It
> > still feels a bit hacky though.

I thought about it but can't see most target having this problem. I did go
with an optimistic heuristics. There are of course various ways to defeat it
but looking through the corpus of code I don't see any but the simple cases
in practice. (more below).

> >
> > What stops us from forming cbz/cbnz when the extension is done close
> > to the comparison (from the comment in 2/3)?  If we can solve that,
> > could we simply do an any-extend all the time, and treat removing
> > redundant extensions as a global availability problem?
> 

In such cases there's no gain from doing the extension at all, e.g.
and w0, w0, 255
cmp w0, 0
b.eq .Lfoo

will be optimized to

tst w0, 0xff
b.ne .Lfoo

already.

In RTL the problem occurs when you have nested control flow like nested if and switch statements
The example in 2/3 was intended to show that before what we'd do is

and w22, w0, 255
.... <code that clobbers cc and caller saves>
<switch1>
cbz w22, .Lfoo1
....
<switch2>
cbz w22, .Lfoo2

If we have a single comparison we already sink the zero_extend today.

Now if we instead any-extend w0 we end up with:

mov w22, w0
.... <code that clobbers cc and caller saves>
<switch1>
tst w22, 0xff
b.ne .Lfoo1
....
<switch2>
tst w22, 0xff
b.ne .Lfoo2

So you get an additional tst before each branch. You also can't perform the tst higher since CC is clobbered.
The cbz is nice because the zero extend doesn't use CC of course and so having the value live allows us to optimize
The branch.

I don't think branch likeliness matters here as you must keep w22 live in both cases somehow. In the SPECCPU 2017
Benchmark perlbench (which uses a lot of nested switches) this pattern is responsible for an extra 0.3% codesize increase
which the approach in 2/3 prevents.

> (which would run after combine)
> 
> >
> > What kind of code do we emit when do an extension just before an
> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then it
> > should be safe to do the extension directly into R:
> >
> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
> 
> Oops, that should of course be:
> 
>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
> 
> > which avoids the problem of having two values live at once (the
> > zero-extended value and the any-extended value).

I'm not sure it does, as the any-extended value must remain live. i.e. above you can't get rid of w22,
you can only choose between having it be zero of any extended.  But I am not sure how you would do
that after expand.

Kind Regards,
Tamar

> >
> > Having identical instructions for each extension would also make it
> > easier for any global pass to move them and remove redundancies.
> >
> > Thanks,
> > Richard
  
Richard Sandiford May 16, 2022, 1:24 p.m. UTC | #8
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, May 16, 2022 1:18 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>> jeffreyalaw@gmail.com
>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
>> the method of argument promotions.
>> 
>> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Tamar Christina <Tamar.Christina@arm.com> writes:
>> >>> -----Original Message-----
>> >>> From: Richard Sandiford <richard.sandiford@arm.com>
>> >>> Sent: Monday, May 16, 2022 12:36 PM
>> >>> To: Tamar Christina <Tamar.Christina@arm.com>
>> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>> >>> jeffreyalaw@gmail.com
>> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
>> >>> target decide the method of argument promotions.
>> >>>
>> >>> Tamar Christina <tamar.christina@arm.com> writes:
>> >>> > Hi All,
>> >>> >
>> >>> > Some targets require function parameters to be promoted to a
>> >>> > different type on expand time because the target may not have
>> >>> > native instructions to work on such types.  As an example the
>> >>> > AArch64 port does not have native instructions working on integer
>> >>> > 8- or 16-bit values.  As such it promotes every parameter of these
>> types to 32-bits.
>> >>>
>> >>> This doesn't seem specific to parameters though.  It applies to any
>> >>> 8- or 16-bit variable.  E.g.:
>> >>>
>> >>> #include <stdint.h>
>> >>> uint8_t foo(uint32_t x, uint32_t y) {
>> >>>     uint8_t z = x != 0 ? x : y;
>> >>>     return z + 1;
>> >>> }
>> >>>
>> >>> generates:
>> >>>
>> >>> foo:
>> >>>         cmp     w0, 0
>> >>>         and     w1, w1, 255
>> >>>         and     w0, w0, 255
>> >>>         csel    w0, w1, w0, eq
>> >>>         add     w0, w0, 1
>> >>>         ret
>> >>>
>> >>> So I think the new behaviour is really a modification of the
>> >>> PROMOTE_MODE behaviour rather than the
>> PROMOTE_FUNCTION_MODE behaviour.
>> >>>
>> >>> FWIW, I agree with Richard that it would be better not to add a new
>> hook.
>> >>> I think we're really making PROMOTE_MODE choose between
>> SIGN_EXTEND,
>> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
>> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
>> choice.
>> >>
>> >> Ah, I hadn't realized this also applied to locals.. ok I can modify
>> >> PROMOTE_MODE then, but I also need the actual SSA_NAME and not just
>> the type so will have to pass this along.
>> >>
>> >> From a practical point of view.. the actual hook however is
>> >> implemented by 34 targets, would I need to CC maintainers for each of
>> >> them or would global maintainer approval suffice for these mostly
>> mechanical changes?
>> >
>> > Yeah, single approval should be enough mechanical changes.  It would
>> > be good to do the interface change and mechanical target changes as a
>> > separate prepatch if possible though.
>> >
>> > I'm not sure about passing the SSA name to the target though, or the
>> > way that the aarch64 hook uses the info.  It looks like a single cold
>> > comparison could defeat the optimisation for hot code.
>
> I'm not sure I follow why the likelihood of the comparison matters in this case..
> I'll expand on it below..

I meant the likelihood that the comparison is executed at all,
not which outcome is more likely.  E.g. suppose the only comparison
occurs on a failure path that eventually calls abort, and that there are
other paths (without comparisons of the same value) that would benefit
from the any-extend optimisation.  We'd prioritise the cold comparison
over optimising the other (hot) code.

I'm just suspicious of heuristics along the lines of “don't do X
if there is a single instance of Y”. :-)

>> > If we do try to make the decision based on uses at expand time, it
>> > might be better for the analysis to be in target-independent code,
>> > with help from the target to decide where extensions are cheap.  It
>> > still feels a bit hacky though.
>
> I thought about it but can't see most target having this problem. I did go
> with an optimistic heuristics. There are of course various ways to defeat it
> but looking through the corpus of code I don't see any but the simple cases
> in practice. (more below).
>
>> >
>> > What stops us from forming cbz/cbnz when the extension is done close
>> > to the comparison (from the comment in 2/3)?  If we can solve that,
>> > could we simply do an any-extend all the time, and treat removing
>> > redundant extensions as a global availability problem?
>> 
>
> In such cases there's no gain from doing the extension at all, e.g.
> and w0, w0, 255
> cmp w0, 0
> b.eq .Lfoo
>
> will be optimized to
>
> tst w0, 0xff
> b.ne .Lfoo
>
> already.
>
> In RTL the problem occurs when you have nested control flow like nested if and switch statements
> The example in 2/3 was intended to show that before what we'd do is
>
> and w22, w0, 255
> .... <code that clobbers cc and caller saves>
> <switch1>
> cbz w22, .Lfoo1
> ....
> <switch2>
> cbz w22, .Lfoo2
>
> If we have a single comparison we already sink the zero_extend today.
>
> Now if we instead any-extend w0 we end up with:
>
> mov w22, w0
> .... <code that clobbers cc and caller saves>
> <switch1>
> tst w22, 0xff
> b.ne .Lfoo1
> ....
> <switch2>
> tst w22, 0xff
> b.ne .Lfoo2
>
> So you get an additional tst before each branch. You also can't perform the tst higher since CC is clobbered.
> The cbz is nice because the zero extend doesn't use CC of course and so having the value live allows us to optimize
> The branch.

Once the cbz has been formed (in combine), where does the optimisation
of it happen?

> I don't think branch likeliness matters here as you must keep w22 live in both cases somehow. In the SPECCPU 2017
> Benchmark perlbench (which uses a lot of nested switches) this pattern is responsible for an extra 0.3% codesize increase
> which the approach in 2/3 prevents.
>
>> (which would run after combine)
>> 
>> >
>> > What kind of code do we emit when do an extension just before an
>> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then it
>> > should be safe to do the extension directly into R:
>> >
>> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
>> 
>> Oops, that should of course be:
>> 
>>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
>> 
>> > which avoids the problem of having two values live at once (the
>> > zero-extended value and the any-extended value).
>
> I'm not sure it does, as the any-extended value must remain live. i.e. above you can't get rid of w22,
> you can only choose between having it be zero of any extended.  But I am not sure how you would do
> that after expand.

These per-operation extends are emitted during expand.  The question is
whether we do them into fresh registers:

   (set (reg:SI Rtmp) (zero_extend:SI (subreg:QI (reg:SI R))))

which leaves both R and Rtmp live at the same time, or whether we
do them in-situ:

   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))

Expand should know that the latter is valid, given the DECL_RTL.

Thanks,
Richard
  
Tamar Christina May 16, 2022, 3:29 p.m. UTC | #9
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, May 16, 2022 2:24 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> jeffreyalaw@gmail.com
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, May 16, 2022 1:18 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >> jeffreyalaw@gmail.com
> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target
> >> decide the method of argument promotions.
> >>
> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >>> -----Original Message-----
> >> >>> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >>> Sent: Monday, May 16, 2022 12:36 PM
> >> >>> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> rguenther@suse.de;
> >> >>> jeffreyalaw@gmail.com
> >> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >> >>> target decide the method of argument promotions.
> >> >>>
> >> >>> Tamar Christina <tamar.christina@arm.com> writes:
> >> >>> > Hi All,
> >> >>> >
> >> >>> > Some targets require function parameters to be promoted to a
> >> >>> > different type on expand time because the target may not have
> >> >>> > native instructions to work on such types.  As an example the
> >> >>> > AArch64 port does not have native instructions working on
> >> >>> > integer
> >> >>> > 8- or 16-bit values.  As such it promotes every parameter of
> >> >>> > these
> >> types to 32-bits.
> >> >>>
> >> >>> This doesn't seem specific to parameters though.  It applies to
> >> >>> any
> >> >>> 8- or 16-bit variable.  E.g.:
> >> >>>
> >> >>> #include <stdint.h>
> >> >>> uint8_t foo(uint32_t x, uint32_t y) {
> >> >>>     uint8_t z = x != 0 ? x : y;
> >> >>>     return z + 1;
> >> >>> }
> >> >>>
> >> >>> generates:
> >> >>>
> >> >>> foo:
> >> >>>         cmp     w0, 0
> >> >>>         and     w1, w1, 255
> >> >>>         and     w0, w0, 255
> >> >>>         csel    w0, w1, w0, eq
> >> >>>         add     w0, w0, 1
> >> >>>         ret
> >> >>>
> >> >>> So I think the new behaviour is really a modification of the
> >> >>> PROMOTE_MODE behaviour rather than the
> >> PROMOTE_FUNCTION_MODE behaviour.
> >> >>>
> >> >>> FWIW, I agree with Richard that it would be better not to add a
> >> >>> new
> >> hook.
> >> >>> I think we're really making PROMOTE_MODE choose between
> >> SIGN_EXTEND,
> >> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
> >> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
> >> choice.
> >> >>
> >> >> Ah, I hadn't realized this also applied to locals.. ok I can
> >> >> modify PROMOTE_MODE then, but I also need the actual SSA_NAME
> and
> >> >> not just
> >> the type so will have to pass this along.
> >> >>
> >> >> From a practical point of view.. the actual hook however is
> >> >> implemented by 34 targets, would I need to CC maintainers for each
> >> >> of them or would global maintainer approval suffice for these
> >> >> mostly
> >> mechanical changes?
> >> >
> >> > Yeah, single approval should be enough mechanical changes.  It
> >> > would be good to do the interface change and mechanical target
> >> > changes as a separate prepatch if possible though.
> >> >
> >> > I'm not sure about passing the SSA name to the target though, or
> >> > the way that the aarch64 hook uses the info.  It looks like a
> >> > single cold comparison could defeat the optimisation for hot code.
> >
> > I'm not sure I follow why the likelihood of the comparison matters in this
> case..
> > I'll expand on it below..
> 
> I meant the likelihood that the comparison is executed at all, not which
> outcome is more likely.  E.g. suppose the only comparison occurs on a failure
> path that eventually calls abort, and that there are other paths (without
> comparisons of the same value) that would benefit from the any-extend
> optimisation.  We'd prioritise the cold comparison over optimising the other
> (hot) code.
> 
> I'm just suspicious of heuristics along the lines of “don't do X if there is a
> single instance of Y”. :-)

I'm probably very dense here sorry.. but if there's

1 use: the zero extend gets pushed down into the branch which needs it.

i.e. in:

extern void foo ();
extern void bar ();

uint8_t f (uint8_t a, uint8_t b)
{
  if (b) {
    if (a)
      foo ();
    else
      return f (a, b);
  } else {
      bar ();
  }
  return b;
}

The zero extend of a is only done in the true branch for if (b).  Secondly the zero
extended form is the basis for all other patterns we form, such as ands, which is
the combination of the zero extend and compare.

2 uses, both live:

extern void foo ();
extern void bar (uint8_t);

uint8_t f (uint8_t a, uint8_t b)
{
  if (b) {
    if (a)
      foo ();
    else
      return f (a, b);
  } else {
      bar (a);
  }
  return b;
}

In which case the extend of a is done before the if (b) and only the extended values
used.

Even if you had multiple cold/unused branches, I struggle to see any case where the
any-extend would be better.  Reload must keep the value live as it's a param. Either you:

1. have enough registers to keep the value live, in which case, instead of doing a "mov" to
     To copy the value and then later an AND or TST, it's better to just do an and instead of the mov.
      You keep the same number of registers live but in the best case you have 1 instruction less, and
       The worse case you have 0 instructions more.
2. You don't have enough registers to keep the value live, in which case the zero extended value is
     Still better because on the reload it can simply use ldrb ..., cbz as we use the load for an implicit
     zero extend. Which is still better than ldrb ..., tst, cbnz for an any-extend.

So I am genuinely struggling to see a case where any-extend is better for comparison. And the only
reason I am singling out comparisons is because in GIMPLE integer constants don't get an explicit
promotion to int.  Otherwise I wouldn't have needed to as it would have always required an extend
here.

> 
> >> > If we do try to make the decision based on uses at expand time, it
> >> > might be better for the analysis to be in target-independent code,
> >> > with help from the target to decide where extensions are cheap.  It
> >> > still feels a bit hacky though.
> >
> > I thought about it but can't see most target having this problem. I
> > did go with an optimistic heuristics. There are of course various ways
> > to defeat it but looking through the corpus of code I don't see any
> > but the simple cases in practice. (more below).
> >
> >> >
> >> > What stops us from forming cbz/cbnz when the extension is done
> >> > close to the comparison (from the comment in 2/3)?  If we can solve
> >> > that, could we simply do an any-extend all the time, and treat
> >> > removing redundant extensions as a global availability problem?
> >>
> >
> > In such cases there's no gain from doing the extension at all, e.g.
> > and w0, w0, 255
> > cmp w0, 0
> > b.eq .Lfoo
> >
> > will be optimized to
> >
> > tst w0, 0xff
> > b.ne .Lfoo
> >
> > already.
> >
> > In RTL the problem occurs when you have nested control flow like
> > nested if and switch statements The example in 2/3 was intended to
> > show that before what we'd do is
> >
> > and w22, w0, 255
> > .... <code that clobbers cc and caller saves> <switch1> cbz w22,
> > .Lfoo1 ....
> > <switch2>
> > cbz w22, .Lfoo2
> >
> > If we have a single comparison we already sink the zero_extend today.
> >
> > Now if we instead any-extend w0 we end up with:
> >
> > mov w22, w0
> > .... <code that clobbers cc and caller saves> <switch1> tst w22, 0xff
> > b.ne .Lfoo1 ....
> > <switch2>
> > tst w22, 0xff
> > b.ne .Lfoo2
> >
> > So you get an additional tst before each branch. You also can't perform the
> tst higher since CC is clobbered.
> > The cbz is nice because the zero extend doesn't use CC of course and
> > so having the value live allows us to optimize The branch.
> 
> Once the cbz has been formed (in combine), where does the optimisation of
> it happen?

There's no real "optimization". Combine combines the cmp 0 and br leaving the AND
behind.  Because of the live range required for the value reload must copy it away from
a caller save.  It chooses to move it to w22 in this case.

and w0, w0, 255
mov w22, w0

this simply gets simplified into and w22, w0, 255 by a zero extending move pattern.
The only optimization here is when the pattern isn't single use, it's simply not moved/folded.

The only options available to combine are

cmp, br = tst + br (in the case of a subreg where it can't tell what the top bits are)
and, cmp, br = ands + br (if value is single use)
cmp, br = cbz (in the case it knows that the top bits are 0).

If we emit a zero extend both operations above are possible, and we emit them depending on
value being single use or not.  If we emit a paradoxical subreg, we never form cbz unless the value
comes from an operation where GIMPLE has maintained C semantics.

But I am probably missing something.. so I'll just make the changes and see where we land ?

> 
> > I don't think branch likeliness matters here as you must keep w22 live
> > in both cases somehow. In the SPECCPU 2017 Benchmark perlbench (which
> > uses a lot of nested switches) this pattern is responsible for an extra 0.3%
> codesize increase which the approach in 2/3 prevents.
> >
> >> (which would run after combine)
> >>
> >> >
> >> > What kind of code do we emit when do an extension just before an
> >> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then
> >> > it should be safe to do the extension directly into R:
> >> >
> >> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
> >>
> >> Oops, that should of course be:
> >>
> >>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
> >>
> >> > which avoids the problem of having two values live at once (the
> >> > zero-extended value and the any-extended value)

I'm assuming R here is the hardreg which has the parameter? In which case
wouldn't the subreg be folded away? I.e you end up with

(set (reg:SI R) (zero_extend:SI (reg:QI R)))

? But that SET isn’t paradoxical, we wouldn't generate it.

We generate for e.g.:

#include <stdint.h>

uint16_t f8 (uint8_t xr, uint8_t xc){
    return (uint8_t)(xr * xc);
}

(insn 9 6 10 2 (set (reg:HI 101)                                                                                                                                                                                                                                                     (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1                                                                                                                                                                                                                   (nil))                                                                                                                                                                                                                                                                  (insn 10 9 11 2 (set (reg:HI 102)                                                                                                                                                                                                                                                    (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1                                                                                                                                                                                                                   (nil))                                                                                                                                                                                                                                                                  (insn 11 10 12 2 (set (reg:SI 103)                                                                                                                                                                                                                                                   (mult:SI (subreg:SI (reg:HI 101) 0)                                                                                                                                                                                                                                              (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1                                                                                                                                                                                                                         (nil))

Out of expand. The paradoxical subreg isn't generated at all out of expand
unless it's needed. It does keep the original params around as unused:

(insn 2 7 4 2 (set (reg:QI 97)                                                                                                                                                                                                                                                       (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1                                                                                                                                                                                                                                    (nil))                                                                                                                                                                                                                                                                  (insn 4 2 3 2 (set (reg:QI 99)                                                                                                                                                                                                                                                       (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1                                                                                                                                                                                                                                    (nil))  

And the paradoxical subreg is moved into the first operation requiring it:

(insn 11 10 12 2 (set (reg:SI 103)                                                                                                                                                                                                                                                   (mult:SI (subreg:SI (reg:HI 101) 0)                                                                                                                                                                                                                                              (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1                                                                                                                                                                                                                         (nil))

In any case, I'm still not totally sure what the objection here is.  Afaik,
compares need to be treated specially because in GIMPLE they already
are.  Afaik, C integer promotion rules state that in the comparison 0 should
have been promoted to an integer constant of rank int and so the comparison itself
should have been done as integer. i.e. extended.  And most of our patterns
are based around this.

Gimple however doesn't do this, the comparison is done in the rank of the
variable and there is no explicit conversion.  This happened to be fixed up
before during the forced promotion.  So to me the heuristic doesn't seem
to be that crazy..

But I'll respin the patch without the hook and see where things land.

Thanks,
Tamar

> >
> > I'm not sure it does, as the any-extended value must remain live. i.e.
> > above you can't get rid of w22, you can only choose between having it
> > be zero of any extended.  But I am not sure how you would do that after
> expand.
> 
> These per-operation extends are emitted during expand.  The question is
> whether we do them into fresh registers:
> 
>    (set (reg:SI Rtmp) (zero_extend:SI (subreg:QI (reg:SI R))))
> 
> which leaves both R and Rtmp live at the same time, or whether we do them
> in-situ:
> 
>    (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
> 
> Expand should know that the latter is valid, given the DECL_RTL.
> 
> Thanks,
> Richard
  
Richard Sandiford May 16, 2022, 4:48 p.m. UTC | #10
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, May 16, 2022 2:24 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>> jeffreyalaw@gmail.com
>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
>> the method of argument promotions.
>> 
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Monday, May 16, 2022 1:18 PM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
>> >> jeffreyalaw@gmail.com
>> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target
>> >> decide the method of argument promotions.
>> >>
>> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> >>> -----Original Message-----
>> >> >>> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> >>> Sent: Monday, May 16, 2022 12:36 PM
>> >> >>> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
>> rguenther@suse.de;
>> >> >>> jeffreyalaw@gmail.com
>> >> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
>> >> >>> target decide the method of argument promotions.
>> >> >>>
>> >> >>> Tamar Christina <tamar.christina@arm.com> writes:
>> >> >>> > Hi All,
>> >> >>> >
>> >> >>> > Some targets require function parameters to be promoted to a
>> >> >>> > different type on expand time because the target may not have
>> >> >>> > native instructions to work on such types.  As an example the
>> >> >>> > AArch64 port does not have native instructions working on
>> >> >>> > integer
>> >> >>> > 8- or 16-bit values.  As such it promotes every parameter of
>> >> >>> > these
>> >> types to 32-bits.
>> >> >>>
>> >> >>> This doesn't seem specific to parameters though.  It applies to
>> >> >>> any
>> >> >>> 8- or 16-bit variable.  E.g.:
>> >> >>>
>> >> >>> #include <stdint.h>
>> >> >>> uint8_t foo(uint32_t x, uint32_t y) {
>> >> >>>     uint8_t z = x != 0 ? x : y;
>> >> >>>     return z + 1;
>> >> >>> }
>> >> >>>
>> >> >>> generates:
>> >> >>>
>> >> >>> foo:
>> >> >>>         cmp     w0, 0
>> >> >>>         and     w1, w1, 255
>> >> >>>         and     w0, w0, 255
>> >> >>>         csel    w0, w1, w0, eq
>> >> >>>         add     w0, w0, 1
>> >> >>>         ret
>> >> >>>
>> >> >>> So I think the new behaviour is really a modification of the
>> >> >>> PROMOTE_MODE behaviour rather than the
>> >> PROMOTE_FUNCTION_MODE behaviour.
>> >> >>>
>> >> >>> FWIW, I agree with Richard that it would be better not to add a
>> >> >>> new
>> >> hook.
>> >> >>> I think we're really making PROMOTE_MODE choose between
>> >> SIGN_EXTEND,
>> >> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
>> >> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
>> >> choice.
>> >> >>
>> >> >> Ah, I hadn't realized this also applied to locals.. ok I can
>> >> >> modify PROMOTE_MODE then, but I also need the actual SSA_NAME
>> and
>> >> >> not just
>> >> the type so will have to pass this along.
>> >> >>
>> >> >> From a practical point of view.. the actual hook however is
>> >> >> implemented by 34 targets, would I need to CC maintainers for each
>> >> >> of them or would global maintainer approval suffice for these
>> >> >> mostly
>> >> mechanical changes?
>> >> >
>> >> > Yeah, single approval should be enough mechanical changes.  It
>> >> > would be good to do the interface change and mechanical target
>> >> > changes as a separate prepatch if possible though.
>> >> >
>> >> > I'm not sure about passing the SSA name to the target though, or
>> >> > the way that the aarch64 hook uses the info.  It looks like a
>> >> > single cold comparison could defeat the optimisation for hot code.
>> >
>> > I'm not sure I follow why the likelihood of the comparison matters in this
>> case..
>> > I'll expand on it below..
>> 
>> I meant the likelihood that the comparison is executed at all, not which
>> outcome is more likely.  E.g. suppose the only comparison occurs on a failure
>> path that eventually calls abort, and that there are other paths (without
>> comparisons of the same value) that would benefit from the any-extend
>> optimisation.  We'd prioritise the cold comparison over optimising the other
>> (hot) code.
>> 
>> I'm just suspicious of heuristics along the lines of “don't do X if there is a
>> single instance of Y”. :-)
>
> I'm probably very dense here sorry.. but if there's
>
> 1 use: the zero extend gets pushed down into the branch which needs it.
>
> i.e. in:
>
> extern void foo ();
> extern void bar ();
>
> uint8_t f (uint8_t a, uint8_t b)
> {
>   if (b) {
>     if (a)
>       foo ();
>     else
>       return f (a, b);
>   } else {
>       bar ();
>   }
>   return b;
> }
>
> The zero extend of a is only done in the true branch for if (b).  Secondly the zero
> extended form is the basis for all other patterns we form, such as ands, which is
> the combination of the zero extend and compare.
>
> 2 uses, both live:
>
> extern void foo ();
> extern void bar (uint8_t);
>
> uint8_t f (uint8_t a, uint8_t b)
> {
>   if (b) {
>     if (a)
>       foo ();
>     else
>       return f (a, b);
>   } else {
>       bar (a);
>   }
>   return b;
> }
>
> In which case the extend of a is done before the if (b) and only the extended values
> used.
>
> Even if you had multiple cold/unused branches, I struggle to see any case where the
> any-extend would be better.  Reload must keep the value live as it's a param. Either you:
>
> 1. have enough registers to keep the value live, in which case, instead of doing a "mov" to
>      To copy the value and then later an AND or TST, it's better to just do an and instead of the mov.
>       You keep the same number of registers live but in the best case you have 1 instruction less, and
>        The worse case you have 0 instructions more.
> 2. You don't have enough registers to keep the value live, in which case the zero extended value is
>      Still better because on the reload it can simply use ldrb ..., cbz as we use the load for an implicit
>      zero extend. Which is still better than ldrb ..., tst, cbnz for an any-extend.
>
> So I am genuinely struggling to see a case where any-extend is better for comparison. And the only
> reason I am singling out comparisons is because in GIMPLE integer constants don't get an explicit
> promotion to int.  Otherwise I wouldn't have needed to as it would have always required an extend
> here.

IIUC, you're talking about cases involving multiple comparisons.  I was
instead talking about the case where there is 1 cold comparison that doesn't
benefit from any-extend and multiple hot operations (not comparisons)
that do benefit.  The patch then seemed to avoid any-extend because of
the cold comparison.

E.g. does the patch avoid the AND in:

#include <stdint.h>
uint8_t foo(uint8_t x, int y) {
    if (y) {
        printf("Foo %d\n", x ? 1 : 2);
        __builtin_abort ();
    }
    return x + 1;
}

?

>> >> > If we do try to make the decision based on uses at expand time, it
>> >> > might be better for the analysis to be in target-independent code,
>> >> > with help from the target to decide where extensions are cheap.  It
>> >> > still feels a bit hacky though.
>> >
>> > I thought about it but can't see most target having this problem. I
>> > did go with an optimistic heuristics. There are of course various ways
>> > to defeat it but looking through the corpus of code I don't see any
>> > but the simple cases in practice. (more below).
>> >
>> >> >
>> >> > What stops us from forming cbz/cbnz when the extension is done
>> >> > close to the comparison (from the comment in 2/3)?  If we can solve
>> >> > that, could we simply do an any-extend all the time, and treat
>> >> > removing redundant extensions as a global availability problem?
>> >>
>> >
>> > In such cases there's no gain from doing the extension at all, e.g.
>> > and w0, w0, 255
>> > cmp w0, 0
>> > b.eq .Lfoo
>> >
>> > will be optimized to
>> >
>> > tst w0, 0xff
>> > b.ne .Lfoo
>> >
>> > already.
>> >
>> > In RTL the problem occurs when you have nested control flow like
>> > nested if and switch statements The example in 2/3 was intended to
>> > show that before what we'd do is
>> >
>> > and w22, w0, 255
>> > .... <code that clobbers cc and caller saves> <switch1> cbz w22,
>> > .Lfoo1 ....
>> > <switch2>
>> > cbz w22, .Lfoo2
>> >
>> > If we have a single comparison we already sink the zero_extend today.
>> >
>> > Now if we instead any-extend w0 we end up with:
>> >
>> > mov w22, w0
>> > .... <code that clobbers cc and caller saves> <switch1> tst w22, 0xff
>> > b.ne .Lfoo1 ....
>> > <switch2>
>> > tst w22, 0xff
>> > b.ne .Lfoo2
>> >
>> > So you get an additional tst before each branch. You also can't perform the
>> tst higher since CC is clobbered.
>> > The cbz is nice because the zero extend doesn't use CC of course and
>> > so having the value live allows us to optimize The branch.
>> 
>> Once the cbz has been formed (in combine), where does the optimisation of
>> it happen?
>
> There's no real "optimization". Combine combines the cmp 0 and br leaving the AND
> behind.  Because of the live range required for the value reload must copy it away from
> a caller save.  It chooses to move it to w22 in this case.
>
> and w0, w0, 255
> mov w22, w0
>
> this simply gets simplified into and w22, w0, 255 by a zero extending move pattern.
> The only optimization here is when the pattern isn't single use, it's simply not moved/folded.
>
> The only options available to combine are
>
> cmp, br = tst + br (in the case of a subreg where it can't tell what the top bits are)
> and, cmp, br = ands + br (if value is single use)
> cmp, br = cbz (in the case it knows that the top bits are 0).
>
> If we emit a zero extend both operations above are possible, and we emit them depending on
> value being single use or not.  If we emit a paradoxical subreg, we never form cbz unless the value
> comes from an operation where GIMPLE has maintained C semantics.
>
> But I am probably missing something.. so I'll just make the changes and see where we land ?

No, I agree/was agreeing with the description of the combine behaviour.
I guess I just misunderstood what you meant by “the cbz is nice because
the zero extend doesn't use CC of course and so having the value live
allows us to optimize the branch”.

>> > I don't think branch likeliness matters here as you must keep w22 live
>> > in both cases somehow. In the SPECCPU 2017 Benchmark perlbench (which
>> > uses a lot of nested switches) this pattern is responsible for an extra 0.3%
>> codesize increase which the approach in 2/3 prevents.
>> >
>> >> (which would run after combine)
>> >>
>> >> >
>> >> > What kind of code do we emit when do an extension just before an
>> >> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say, then
>> >> > it should be safe to do the extension directly into R:
>> >> >
>> >> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
>> >>
>> >> Oops, that should of course be:
>> >>
>> >>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
>> >>
>> >> > which avoids the problem of having two values live at once (the
>> >> > zero-extended value and the any-extended value)
>
> I'm assuming R here is the hardreg which has the parameter? In which case
> wouldn't the subreg be folded away? I.e you end up with
>
> (set (reg:SI R) (zero_extend:SI (reg:QI R)))

No, R is the pseudo that holds the DECL_RTL (for both VAR_DECLs and
PARM_DECLs).

> ? But that SET isn’t paradoxical, we wouldn't generate it.
>
> We generate for e.g.:
>
> #include <stdint.h>
>
> uint16_t f8 (uint8_t xr, uint8_t xc){
>     return (uint8_t)(xr * xc);
> }
>
> (insn 9 6 10 2 (set (reg:HI 101)                                                                                                                                                                                                                                                     (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1                                                                                                                                                                                                                   (nil))                                                                                                                                                                                                                                                                  (insn 10 9 11 2 (set (reg:HI 102)                                                                                                                                                                                                                                                    (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1                                                                                                                                                                                                                   (nil))                                                                                                                                                                                                                                                                  (insn 11 10 12 2 (set (reg:SI 103)                                                                                                                                                                                                                                                   (mult:SI (subreg:SI (reg:HI 101) 0)                                                                                                                                                                                                                                              (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1                                                                                                                                                                                                                         (nil))
>
> Out of expand. The paradoxical subreg isn't generated at all out of expand
> unless it's needed. It does keep the original params around as unused:
>
> (insn 2 7 4 2 (set (reg:QI 97)                                                                                                                                                                                                                                                       (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1                                                                                                                                                                                                                                    (nil))                                                                                                                                                                                                                                                                  (insn 4 2 3 2 (set (reg:QI 99)                                                                                                                                                                                                                                                       (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1                                                                                                                                                                                                                                    (nil))  
>
> And the paradoxical subreg is moved into the first operation requiring it:
>
> (insn 11 10 12 2 (set (reg:SI 103)                                                                                                                                                                                                                                                   (mult:SI (subreg:SI (reg:HI 101) 0)                                                                                                                                                                                                                                              (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1                                                                                                                                                                                                                         (nil))

Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc registers
would be SIs and the DECL_RTLs would be QI subregs of those SI regs.
I think that might work better, for the reasons above.  (That is,
whenever we need the register in extended form, we can simply extend
the existing reg rather than create a new one.)

I think that's where confusion was coming from.

> In any case, I'm still not totally sure what the objection here is.  Afaik,
> compares need to be treated specially because in GIMPLE they already
> are.  Afaik, C integer promotion rules state that in the comparison 0 should
> have been promoted to an integer constant of rank int and so the comparison itself
> should have been done as integer. i.e. extended.  And most of our patterns
> are based around this.
>
> Gimple however doesn't do this, the comparison is done in the rank of the
> variable and there is no explicit conversion.  This happened to be fixed up
> before during the forced promotion.  So to me the heuristic doesn't seem
> to be that crazy..

I guess I still don't see the distinction.  C says the same thing about
+, -, >>, etc.  And gimple is free to do those operations in narrow types
if it wants to, and if that doesn't change the semantics.  (Not that gimple
always does them in narrow types.  But it is valid gimple.)

The optimisation problem doesn't come from C or gimple semantics,
but from the fact that AArch64 (unlike x86 say) doesn't have byte add,
byte compare, byte right shift, etc.  We therefore need to promote
8-bit and 16-bit operations to 32 bits first.

For add, subtract, multiply, left shift, and logic ops, we can avoid
defining the upper bits of the inputs when we do these extensions,
because the upper bits of the inputs don't affect the useful bits of
the result.  But for comparisons, right shifts, and divides, etc.,
we do need to extend.

AIUI, the comparison case is special because (for AArch64-specific
reasons), we prefer extend + cbz to tst + branch, especially when the
extend can be shared.

Thanks,
Richard
  
Tamar Christina May 17, 2022, 7:55 a.m. UTC | #11
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, May 16, 2022 5:48 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> jeffreyalaw@gmail.com
> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target decide
> the method of argument promotions.
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, May 16, 2022 2:24 PM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >> jeffreyalaw@gmail.com
> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the target
> >> decide the method of argument promotions.
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Monday, May 16, 2022 1:18 PM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de;
> >> >> jeffreyalaw@gmail.com
> >> >> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >> >> target decide the method of argument promotions.
> >> >>
> >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >>> -----Original Message-----
> >> >> >>> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >>> Sent: Monday, May 16, 2022 12:36 PM
> >> >> >>> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> >> rguenther@suse.de;
> >> >> >>> jeffreyalaw@gmail.com
> >> >> >>> Subject: Re: [PATCH 1/3]middle-end: Add the ability to let the
> >> >> >>> target decide the method of argument promotions.
> >> >> >>>
> >> >> >>> Tamar Christina <tamar.christina@arm.com> writes:
> >> >> >>> > Hi All,
> >> >> >>> >
> >> >> >>> > Some targets require function parameters to be promoted to a
> >> >> >>> > different type on expand time because the target may not
> >> >> >>> > have native instructions to work on such types.  As an
> >> >> >>> > example the
> >> >> >>> > AArch64 port does not have native instructions working on
> >> >> >>> > integer
> >> >> >>> > 8- or 16-bit values.  As such it promotes every parameter of
> >> >> >>> > these
> >> >> types to 32-bits.
> >> >> >>>
> >> >> >>> This doesn't seem specific to parameters though.  It applies
> >> >> >>> to any
> >> >> >>> 8- or 16-bit variable.  E.g.:
> >> >> >>>
> >> >> >>> #include <stdint.h>
> >> >> >>> uint8_t foo(uint32_t x, uint32_t y) {
> >> >> >>>     uint8_t z = x != 0 ? x : y;
> >> >> >>>     return z + 1;
> >> >> >>> }
> >> >> >>>
> >> >> >>> generates:
> >> >> >>>
> >> >> >>> foo:
> >> >> >>>         cmp     w0, 0
> >> >> >>>         and     w1, w1, 255
> >> >> >>>         and     w0, w0, 255
> >> >> >>>         csel    w0, w1, w0, eq
> >> >> >>>         add     w0, w0, 1
> >> >> >>>         ret
> >> >> >>>
> >> >> >>> So I think the new behaviour is really a modification of the
> >> >> >>> PROMOTE_MODE behaviour rather than the
> >> >> PROMOTE_FUNCTION_MODE behaviour.
> >> >> >>>
> >> >> >>> FWIW, I agree with Richard that it would be better not to add
> >> >> >>> a new
> >> >> hook.
> >> >> >>> I think we're really making PROMOTE_MODE choose between
> >> >> SIGN_EXTEND,
> >> >> >>> ZERO_EXTEND or SUBREG (what LLVM would call “any
> >> >> >>> extend”) rather than the current SIGN_EXTEND vs. ZERO_EXTEND
> >> >> choice.
> >> >> >>
> >> >> >> Ah, I hadn't realized this also applied to locals.. ok I can
> >> >> >> modify PROMOTE_MODE then, but I also need the actual
> SSA_NAME
> >> and
> >> >> >> not just
> >> >> the type so will have to pass this along.
> >> >> >>
> >> >> >> From a practical point of view.. the actual hook however is
> >> >> >> implemented by 34 targets, would I need to CC maintainers for
> >> >> >> each of them or would global maintainer approval suffice for
> >> >> >> these mostly
> >> >> mechanical changes?
> >> >> >
> >> >> > Yeah, single approval should be enough mechanical changes.  It
> >> >> > would be good to do the interface change and mechanical target
> >> >> > changes as a separate prepatch if possible though.
> >> >> >
> >> >> > I'm not sure about passing the SSA name to the target though, or
> >> >> > the way that the aarch64 hook uses the info.  It looks like a
> >> >> > single cold comparison could defeat the optimisation for hot code.
> >> >
> >> > I'm not sure I follow why the likelihood of the comparison matters
> >> > in this
> >> case..
> >> > I'll expand on it below..
> >>
> >> I meant the likelihood that the comparison is executed at all, not
> >> which outcome is more likely.  E.g. suppose the only comparison
> >> occurs on a failure path that eventually calls abort, and that there
> >> are other paths (without comparisons of the same value) that would
> >> benefit from the any-extend optimisation.  We'd prioritise the cold
> >> comparison over optimising the other
> >> (hot) code.
> >>
> >> I'm just suspicious of heuristics along the lines of “don't do X if
> >> there is a single instance of Y”. :-)
> >
> > I'm probably very dense here sorry.. but if there's
> >
> > 1 use: the zero extend gets pushed down into the branch which needs it.
> >
> > i.e. in:
> >
> > extern void foo ();
> > extern void bar ();
> >
> > uint8_t f (uint8_t a, uint8_t b)
> > {
> >   if (b) {
> >     if (a)
> >       foo ();
> >     else
> >       return f (a, b);
> >   } else {
> >       bar ();
> >   }
> >   return b;
> > }
> >
> > The zero extend of a is only done in the true branch for if (b).
> > Secondly the zero extended form is the basis for all other patterns we
> > form, such as ands, which is the combination of the zero extend and
> compare.
> >
> > 2 uses, both live:
> >
> > extern void foo ();
> > extern void bar (uint8_t);
> >
> > uint8_t f (uint8_t a, uint8_t b)
> > {
> >   if (b) {
> >     if (a)
> >       foo ();
> >     else
> >       return f (a, b);
> >   } else {
> >       bar (a);
> >   }
> >   return b;
> > }
> >
> > In which case the extend of a is done before the if (b) and only the
> > extended values used.
> >
> > Even if you had multiple cold/unused branches, I struggle to see any
> > case where the any-extend would be better.  Reload must keep the value
> live as it's a param. Either you:
> >
> > 1. have enough registers to keep the value live, in which case, instead of
> doing a "mov" to
> >      To copy the value and then later an AND or TST, it's better to just do an
> and instead of the mov.
> >       You keep the same number of registers live but in the best case you
> have 1 instruction less, and
> >        The worse case you have 0 instructions more.
> > 2. You don't have enough registers to keep the value live, in which case the
> zero extended value is
> >      Still better because on the reload it can simply use ldrb ..., cbz as we use
> the load for an implicit
> >      zero extend. Which is still better than ldrb ..., tst, cbnz for an any-extend.
> >
> > So I am genuinely struggling to see a case where any-extend is better
> > for comparison. And the only reason I am singling out comparisons is
> > because in GIMPLE integer constants don't get an explicit promotion to
> > int.  Otherwise I wouldn't have needed to as it would have always required
> an extend here.
> 
> IIUC, you're talking about cases involving multiple comparisons.  I was instead
> talking about the case where there is 1 cold comparison that doesn't benefit
> from any-extend and multiple hot operations (not comparisons) that do
> benefit.  The patch then seemed to avoid any-extend because of the cold
> comparison.
> 
> E.g. does the patch avoid the AND in:
> 
> #include <stdint.h>
> uint8_t foo(uint8_t x, int y) {
>     if (y) {
>         printf("Foo %d\n", x ? 1 : 2);
>         __builtin_abort ();
>     }
>     return x + 1;
> }
> 
> ?

Morning,

It does actually, it generates:

foo:
        cbnz    w1, .L9
        add     w0, w0, 1
        ret
.L9:
        tst     w0, 255
        stp     x29, x30, [sp, -16]!
        cset    w1, ne
        add     w1, w1, 1
        mov     x29, sp
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        bl      printf
        bl      abort
        .size   foo, .-foo

Now I will admit that this isn't because of a grand master design, but
purely because the patch works around the cases seen in SPEC.  In those
cases the comparisons in question were floated out of the if statement.

The heuristic in patch 2/3 allows this because it only looks for compares in
gimple assigns whereas in this case the compare is in the Gimple cond
directly.

> 
> >> >> > If we do try to make the decision based on uses at expand time,
> >> >> > it might be better for the analysis to be in target-independent
> >> >> > code, with help from the target to decide where extensions are
> >> >> > cheap.  It still feels a bit hacky though.
> >> >
> >> > I thought about it but can't see most target having this problem. I
> >> > did go with an optimistic heuristics. There are of course various
> >> > ways to defeat it but looking through the corpus of code I don't
> >> > see any but the simple cases in practice. (more below).
> >> >
> >> >> >
> >> >> > What stops us from forming cbz/cbnz when the extension is done
> >> >> > close to the comparison (from the comment in 2/3)?  If we can
> >> >> > solve that, could we simply do an any-extend all the time, and
> >> >> > treat removing redundant extensions as a global availability problem?
> >> >>
> >> >
> >> > In such cases there's no gain from doing the extension at all, e.g.
> >> > and w0, w0, 255
> >> > cmp w0, 0
> >> > b.eq .Lfoo
> >> >
> >> > will be optimized to
> >> >
> >> > tst w0, 0xff
> >> > b.ne .Lfoo
> >> >
> >> > already.
> >> >
> >> > In RTL the problem occurs when you have nested control flow like
> >> > nested if and switch statements The example in 2/3 was intended to
> >> > show that before what we'd do is
> >> >
> >> > and w22, w0, 255
> >> > .... <code that clobbers cc and caller saves> <switch1> cbz w22,
> >> > .Lfoo1 ....
> >> > <switch2>
> >> > cbz w22, .Lfoo2
> >> >
> >> > If we have a single comparison we already sink the zero_extend today.
> >> >
> >> > Now if we instead any-extend w0 we end up with:
> >> >
> >> > mov w22, w0
> >> > .... <code that clobbers cc and caller saves> <switch1> tst w22,
> >> > 0xff b.ne .Lfoo1 ....
> >> > <switch2>
> >> > tst w22, 0xff
> >> > b.ne .Lfoo2
> >> >
> >> > So you get an additional tst before each branch. You also can't
> >> > perform the
> >> tst higher since CC is clobbered.
> >> > The cbz is nice because the zero extend doesn't use CC of course
> >> > and so having the value live allows us to optimize The branch.
> >>
> >> Once the cbz has been formed (in combine), where does the
> >> optimisation of it happen?
> >
> > There's no real "optimization". Combine combines the cmp 0 and br
> > leaving the AND behind.  Because of the live range required for the
> > value reload must copy it away from a caller save.  It chooses to move it to
> w22 in this case.
> >
> > and w0, w0, 255
> > mov w22, w0
> >
> > this simply gets simplified into and w22, w0, 255 by a zero extending move
> pattern.
> > The only optimization here is when the pattern isn't single use, it's simply
> not moved/folded.
> >
> > The only options available to combine are
> >
> > cmp, br = tst + br (in the case of a subreg where it can't tell what
> > the top bits are) and, cmp, br = ands + br (if value is single use)
> > cmp, br = cbz (in the case it knows that the top bits are 0).
> >
> > If we emit a zero extend both operations above are possible, and we
> > emit them depending on value being single use or not.  If we emit a
> > paradoxical subreg, we never form cbz unless the value comes from an
> operation where GIMPLE has maintained C semantics.
> >
> > But I am probably missing something.. so I'll just make the changes
> > and see where we land ?
> 
> No, I agree/was agreeing with the description of the combine behaviour.
> I guess I just misunderstood what you meant by “the cbz is nice because the
> zero extend doesn't use CC of course and so having the value live allows us
> to optimize the branch”.
> 
> >> > I don't think branch likeliness matters here as you must keep w22
> >> > live in both cases somehow. In the SPECCPU 2017 Benchmark perlbench
> >> > (which uses a lot of nested switches) this pattern is responsible
> >> > for an extra 0.3%
> >> codesize increase which the approach in 2/3 prevents.
> >> >
> >> >> (which would run after combine)
> >> >>
> >> >> >
> >> >> > What kind of code do we emit when do an extension just before an
> >> >> > operation?  If the DECL_RTL is (subreg:QI (reg:SI R) 0), say,
> >> >> > then it should be safe to do the extension directly into R:
> >> >> >
> >> >> >   (set (reg:SI X) (zero_extend:SI (subreg:QI (reg:SI X))))
> >> >>
> >> >> Oops, that should of course be:
> >> >>
> >> >>   (set (reg:SI R) (zero_extend:SI (subreg:QI (reg:SI R))))
> >> >>
> >> >> > which avoids the problem of having two values live at once (the
> >> >> > zero-extended value and the any-extended value)
> >
> > I'm assuming R here is the hardreg which has the parameter? In which
> > case wouldn't the subreg be folded away? I.e you end up with
> >
> > (set (reg:SI R) (zero_extend:SI (reg:QI R)))
> 
> No, R is the pseudo that holds the DECL_RTL (for both VAR_DECLs and
> PARM_DECLs).
> 
> > ? But that SET isn’t paradoxical, we wouldn't generate it.
> >
> > We generate for e.g.:
> >
> > #include <stdint.h>
> >
> > uint16_t f8 (uint8_t xr, uint8_t xc){
> >     return (uint8_t)(xr * xc);
> > }
> >
> > (insn 9 6 10 2 (set (reg:HI 101)
> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
> (nil))
> (insn 10 9 11 2 (set (reg:HI 102)
> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
> (nil))
> (insn 11 10 12 2 (set (reg:SI 103)
> (mult:SI (subreg:SI (reg:HI 101) 0)
> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> (nil))
> >
> > Out of expand. The paradoxical subreg isn't generated at all out of
> > expand unless it's needed. It does keep the original params around as
> unused:
> >
> > (insn 2 7 4 2 (set (reg:QI 97)
> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
> (nil))
> (insn 4 2 3 2 (set (reg:QI 99)
> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
> (nil))
> >
> > And the paradoxical subreg is moved into the first operation requiring it:
> >
> > (insn 11 10 12 2 (set (reg:SI 103)
> (mult:SI (subreg:SI (reg:HI 101) 0)
> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> (nil))
> 
> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc registers would
> be SIs and the DECL_RTLs would be QI subregs of those SI regs.
> I think that might work better, for the reasons above.  (That is, whenever we
> need the register in extended form, we can simply extend the existing reg
> rather than create a new one.)

Ah, I see, no, I explicitly avoid this. When doing the type promotions I tell it that
size of the copies of xr and xc is still the original size, e.g. QI (i.e. I don't change 97 and 99).
This is different from what we do with extends where 97 and 99 *would* be changed.

The reason is that if I make this SI the compiler thinks it knows the value of all the bits
in the register which led to various miscompares as it thinks it can use the SI value directly.

This happens because again the xr and xc are hard regs. So having 97 be

(set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))

gets folded to an incorrect

(set (reg:SI 97) (reg:SI 0 x0 [ xr ]))

And now 97 is free to be used without any zero extension, as 97 on it's own is an invalid RTX.

So I have to keep the intermediate copy QI mode, after which the RTX optimizations
being done during expand generates the forms above.

> 
> I think that's where confusion was coming from.
> 
> > In any case, I'm still not totally sure what the objection here is.
> > Afaik, compares need to be treated specially because in GIMPLE they
> > already are.  Afaik, C integer promotion rules state that in the
> > comparison 0 should have been promoted to an integer constant of rank
> > int and so the comparison itself should have been done as integer.
> > i.e. extended.  And most of our patterns are based around this.
> >
> > Gimple however doesn't do this, the comparison is done in the rank of
> > the variable and there is no explicit conversion.  This happened to be
> > fixed up before during the forced promotion.  So to me the heuristic
> > doesn't seem to be that crazy..
> 
> I guess I still don't see the distinction.  C says the same thing about
> +, -, >>, etc.  And gimple is free to do those operations in narrow
> +types
> if it wants to, and if that doesn't change the semantics.  (Not that gimple
> always does them in narrow types.  But it is valid gimple.)
> 
> The optimisation problem doesn't come from C or gimple semantics, but
> from the fact that AArch64 (unlike x86 say) doesn't have byte add, byte
> compare, byte right shift, etc.  We therefore need to promote 8-bit and 16-
> bit operations to 32 bits first.
> 
> For add, subtract, multiply, left shift, and logic ops, we can avoid defining the
> upper bits of the inputs when we do these extensions, because the upper
> bits of the inputs don't affect the useful bits of the result.  But for
> comparisons, right shifts, and divides, etc., we do need to extend.
> 
> AIUI, the comparison case is special because (for AArch64-specific reasons),
> we prefer extend + cbz to tst + branch, especially when the extend can be
> shared.

Agreed, so I think we agree on this ? I guess the disagreement is where this
should be done. I'll admit that the testcase above works by coincidence. But if
we don't do it during expand time, the only place I can think of to introduce
the zero extends is to add various patterns to do an early split of any-extend
+ cmp.  But wouldn't that be more fragile? At least at expand time all comparisons
are tcc_comparisons. 

Kind regards,
Tamar
> 
> Thanks,
> Richard
  
Richard Sandiford May 17, 2022, 9:03 a.m. UTC | #12
Tamar Christina <Tamar.Christina@arm.com> writes:
[…]
>> E.g. does the patch avoid the AND in:
>> 
>> #include <stdint.h>
>> uint8_t foo(uint8_t x, int y) {
>>     if (y) {
>>         printf("Foo %d\n", x ? 1 : 2);
>>         __builtin_abort ();
>>     }
>>     return x + 1;
>> }
>> 
>> ?
>
> Morning,
>
> It does actually, it generates:
>
> foo:
>         cbnz    w1, .L9
>         add     w0, w0, 1
>         ret
> .L9:
>         tst     w0, 255
>         stp     x29, x30, [sp, -16]!
>         cset    w1, ne
>         add     w1, w1, 1
>         mov     x29, sp
>         adrp    x0, .LC0
>         add     x0, x0, :lo12:.LC0
>         bl      printf
>         bl      abort
>         .size   foo, .-foo

Ah, nice.

> Now I will admit that this isn't because of a grand master design, but
> purely because the patch works around the cases seen in SPEC.  In those
> cases the comparisons in question were floated out of the if statement.
>
> The heuristic in patch 2/3 allows this because it only looks for compares in
> gimple assigns whereas in this case the compare is in the Gimple cond
> directly.

OK.

[…]
>> > We generate for e.g.:
>> >
>> > #include <stdint.h>
>> >
>> > uint16_t f8 (uint8_t xr, uint8_t xc){
>> >     return (uint8_t)(xr * xc);
>> > }
>> >
>> > (insn 9 6 10 2 (set (reg:HI 101)
>> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
>> (nil))
>> (insn 10 9 11 2 (set (reg:HI 102)
>> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
>> (nil))
>> (insn 11 10 12 2 (set (reg:SI 103)
>> (mult:SI (subreg:SI (reg:HI 101) 0)
>> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
>> (nil))
>> >
>> > Out of expand. The paradoxical subreg isn't generated at all out of
>> > expand unless it's needed. It does keep the original params around as
>> unused:
>> >
>> > (insn 2 7 4 2 (set (reg:QI 97)
>> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
>> (nil))
>> (insn 4 2 3 2 (set (reg:QI 99)
>> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
>> (nil))
>> >
>> > And the paradoxical subreg is moved into the first operation requiring it:
>> >
>> > (insn 11 10 12 2 (set (reg:SI 103)
>> (mult:SI (subreg:SI (reg:HI 101) 0)
>> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
>> (nil))
>> 
>> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc registers would
>> be SIs and the DECL_RTLs would be QI subregs of those SI regs.
>> I think that might work better, for the reasons above.  (That is, whenever we
>> need the register in extended form, we can simply extend the existing reg
>> rather than create a new one.)
>
> Ah, I see, no, I explicitly avoid this. When doing the type promotions I tell it that
> size of the copies of xr and xc is still the original size, e.g. QI (i.e. I don't change 97 and 99).
> This is different from what we do with extends where 97 and 99 *would* be changed.
>
> The reason is that if I make this SI the compiler thinks it knows the value of all the bits
> in the register which led to various miscompares as it thinks it can use the SI value directly.
>
> This happens because again the xr and xc are hard regs. So having 97 be
>
> (set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))
>
> gets folded to an incorrect
>
> (set (reg:SI 97) (reg:SI 0 x0 [ xr ]))

This part I would expect (and hope for :-)).

> And now 97 is free to be used without any zero extension, as 97 on it's own is an invalid RTX.

But the way I'd imagined it working, expand would need to insert an
extension before any operation that needs the upper 24 bits to be
defined (e.g. comparisons, right shifts).  If the DECL_RTL is
(subreg:QI (reg:SI x) 0) then the upper bits are not defined,
since SUBREG_PROMOTED_VAR_P would/should be false for the subreg.

E.g. for:

  int8_t foo(int8_t x) { return x >> 1; }

x would have a DECL_RTL of (subreg:QI (reg:SI x) 0), the parameter
assignment would be expanded as:

  (set (reg:SI x) (reg:SI x0))

the shift would be expanded as:

  (set (reg:SI x) (zero_extend:SI (subreg:QI (reg:SI x) 0)))
  (set (reg:SI x) (ashiftrt:SI (reg:SI x) (const_int 1)))

and the return assignment would be expanded as:

  (set (reg:SI x0) (reg:SI x))

x + 1 would instead be expanded to just:

  (set (reg:SI x) (plus:SI (reg:SI x) (const_int 1)))

(without an extension).

I realised later though that, although reusing the DECL_RTL reg for
the extension has the nice RA property of avoiding multiple live values,
it would make it harder to combine the extension into the operation
if the variable is still live afterwards.  So I guess we lose something
both ways.

Maybe we need a different approach, not based on changing PROMOTE_MODE.

I wonder how easy it would be to do the promotion in gimple,
then reuse backprop to determine when a sign/zero-extension
(i.e. a normal gimple cast) can be converted into an “any extend”
(probably represented as a new ifn).

> So I have to keep the intermediate copy QI mode, after which the RTX optimizations
> being done during expand generates the forms above.
>
>> 
>> I think that's where confusion was coming from.
>> 
>> > In any case, I'm still not totally sure what the objection here is.
>> > Afaik, compares need to be treated specially because in GIMPLE they
>> > already are.  Afaik, C integer promotion rules state that in the
>> > comparison 0 should have been promoted to an integer constant of rank
>> > int and so the comparison itself should have been done as integer.
>> > i.e. extended.  And most of our patterns are based around this.
>> >
>> > Gimple however doesn't do this, the comparison is done in the rank of
>> > the variable and there is no explicit conversion.  This happened to be
>> > fixed up before during the forced promotion.  So to me the heuristic
>> > doesn't seem to be that crazy..
>> 
>> I guess I still don't see the distinction.  C says the same thing about
>> +, -, >>, etc.  And gimple is free to do those operations in narrow
>> +types
>> if it wants to, and if that doesn't change the semantics.  (Not that gimple
>> always does them in narrow types.  But it is valid gimple.)
>> 
>> The optimisation problem doesn't come from C or gimple semantics, but
>> from the fact that AArch64 (unlike x86 say) doesn't have byte add, byte
>> compare, byte right shift, etc.  We therefore need to promote 8-bit and 16-
>> bit operations to 32 bits first.
>> 
>> For add, subtract, multiply, left shift, and logic ops, we can avoid defining the
>> upper bits of the inputs when we do these extensions, because the upper
>> bits of the inputs don't affect the useful bits of the result.  But for
>> comparisons, right shifts, and divides, etc., we do need to extend.
>> 
>> AIUI, the comparison case is special because (for AArch64-specific reasons),
>> we prefer extend + cbz to tst + branch, especially when the extend can be
>> shared.
>
> Agreed, so I think we agree on this ? I guess the disagreement is where this
> should be done. I'll admit that the testcase above works by coincidence. But if
> we don't do it during expand time, the only place I can think of to introduce
> the zero extends is to add various patterns to do an early split of any-extend
> + cmp.  But wouldn't that be more fragile? At least at expand time all comparisons
> are tcc_comparisons. 

I guess one question is: is the patch without the comparison handling
just exacerbating an existing problem?  Do we already make similar bad
choices between extend+cbz and tst+branch in cases where the variables
aren't short, but where intermediate calculations involve &s?  If so,
it might be something worth tackling in its own right, regardless of
where the &s or extensions come from.

But yeah, I'm not sure how easily that would fit into existing passes.

Thanks,
Richard
  
Tamar Christina May 17, 2022, 5:45 p.m. UTC | #13
[…]
> >> > We generate for e.g.:
> >> >
> >> > #include <stdint.h>
> >> >
> >> > uint16_t f8 (uint8_t xr, uint8_t xc){
> >> >     return (uint8_t)(xr * xc);
> >> > }
> >> >
> >> > (insn 9 6 10 2 (set (reg:HI 101)
> >> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
> >> (nil))
> >> (insn 10 9 11 2 (set (reg:HI 102)
> >> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
> >> (nil))
> >> (insn 11 10 12 2 (set (reg:SI 103)
> >> (mult:SI (subreg:SI (reg:HI 101) 0)
> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> >> (nil))
> >> >
> >> > Out of expand. The paradoxical subreg isn't generated at all out of
> >> > expand unless it's needed. It does keep the original params around
> >> > as
> >> unused:
> >> >
> >> > (insn 2 7 4 2 (set (reg:QI 97)
> >> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
> >> (nil))
> >> (insn 4 2 3 2 (set (reg:QI 99)
> >> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
> >> (nil))
> >> >
> >> > And the paradoxical subreg is moved into the first operation requiring it:
> >> >
> >> > (insn 11 10 12 2 (set (reg:SI 103)
> >> (mult:SI (subreg:SI (reg:HI 101) 0)
> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> >> (nil))
> >>
> >> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc
> >> registers would be SIs and the DECL_RTLs would be QI subregs of those SI
> regs.
> >> I think that might work better, for the reasons above.  (That is,
> >> whenever we need the register in extended form, we can simply extend
> >> the existing reg rather than create a new one.)
> >
> > Ah, I see, no, I explicitly avoid this. When doing the type promotions
> > I tell it that size of the copies of xr and xc is still the original size, e.g. QI (i.e. I
> don't change 97 and 99).
> > This is different from what we do with extends where 97 and 99 *would*
> be changed.
> >
> > The reason is that if I make this SI the compiler thinks it knows the
> > value of all the bits in the register which led to various miscompares as it
> thinks it can use the SI value directly.
> >
> > This happens because again the xr and xc are hard regs. So having 97
> > be
> >
> > (set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))
> >
> > gets folded to an incorrect
> >
> > (set (reg:SI 97) (reg:SI 0 x0 [ xr ]))
> 
> This part I would expect (and hope for :-)).
> 
> > And now 97 is free to be used without any zero extension, as 97 on it's own
> is an invalid RTX.
> 
> But the way I'd imagined it working, expand would need to insert an
> extension before any operation that needs the upper 24 bits to be defined
> (e.g. comparisons, right shifts).  If the DECL_RTL is (subreg:QI (reg:SI x) 0)
> then the upper bits are not defined, since SUBREG_PROMOTED_VAR_P
> would/should be false for the subreg.

Ah I see, my fear here was that if we have a pattern which splits out the zero-extend for whatever reason
that if it gets folded it would be invalid.  But I think I understand what you meant.  In your case
we'd never again use the hardreg, but that everything goes through 97. Got it.

> 
> E.g. for:
> 
>   int8_t foo(int8_t x) { return x >> 1; }
> 
> x would have a DECL_RTL of (subreg:QI (reg:SI x) 0), the parameter
> assignment would be expanded as:
> 
>   (set (reg:SI x) (reg:SI x0))
> 
> the shift would be expanded as:
> 
>   (set (reg:SI x) (zero_extend:SI (subreg:QI (reg:SI x) 0)))
>   (set (reg:SI x) (ashiftrt:SI (reg:SI x) (const_int 1)))
> 
> and the return assignment would be expanded as:
> 
>   (set (reg:SI x0) (reg:SI x))
> 
> x + 1 would instead be expanded to just:
> 
>   (set (reg:SI x) (plus:SI (reg:SI x) (const_int 1)))
> 
> (without an extension).
> 
> I realised later though that, although reusing the DECL_RTL reg for the
> extension has the nice RA property of avoiding multiple live values, it would
> make it harder to combine the extension into the operation if the variable is
> still live afterwards.  So I guess we lose something both ways.
> 
> Maybe we need a different approach, not based on changing
> PROMOTE_MODE.
> 
> I wonder how easy it would be to do the promotion in gimple, then reuse
> backprop to determine when a sign/zero-extension (i.e. a normal gimple cast)
> can be converted into an “any extend”
> (probably represented as a new ifn).

Do you mean without changing the hook implementation but keeping the current promotion?

I guess the problem here is that it's the inverse cases that's the problem isn't it? It's not that in
gimple there are unneeded extends, it's that some operations require an any-extend no?

like in gimple ~a where a is an 8-bit quantity requires an any-extend, but no cast would be there
in gimple.

So for instance

#include <stdint.h>

uint8_t f (uint8_t a)
{
    return ~a;
}

Is just simply:

f (uint8_t a)
{
  uint8_t _2;

  <bb 2> [local count: 1073741824]:
  _2 = ~a_1(D);
  return _2;

}

In gimple. I'm also slightly worried about interfering with phi opts. Backprop runs
before ifcombine and pihops for instance and there are various phi opts like ifcombine_ifandif
that rely on the BB containing only the phi node.  Adding an any-extend in between would break this.

I also wonder if the IFN won't interfere with range analysis of expressions. Unless we manage to strategically
Insert the IFNs on entire expressions and not in the intermediate SSA components.

> 
> > So I have to keep the intermediate copy QI mode, after which the RTX
> > optimizations being done during expand generates the forms above.
> >
> >>
> >> I think that's where confusion was coming from.
> >>
> >> > In any case, I'm still not totally sure what the objection here is.
> >> > Afaik, compares need to be treated specially because in GIMPLE they
> >> > already are.  Afaik, C integer promotion rules state that in the
> >> > comparison 0 should have been promoted to an integer constant of
> >> > rank int and so the comparison itself should have been done as integer.
> >> > i.e. extended.  And most of our patterns are based around this.
> >> >
> >> > Gimple however doesn't do this, the comparison is done in the rank
> >> > of the variable and there is no explicit conversion.  This happened
> >> > to be fixed up before during the forced promotion.  So to me the
> >> > heuristic doesn't seem to be that crazy..
> >>
> >> I guess I still don't see the distinction.  C says the same thing
> >> about
> >> +, -, >>, etc.  And gimple is free to do those operations in narrow
> >> +types
> >> if it wants to, and if that doesn't change the semantics.  (Not that
> >> gimple always does them in narrow types.  But it is valid gimple.)
> >>
> >> The optimisation problem doesn't come from C or gimple semantics, but
> >> from the fact that AArch64 (unlike x86 say) doesn't have byte add,
> >> byte compare, byte right shift, etc.  We therefore need to promote
> >> 8-bit and 16- bit operations to 32 bits first.
> >>
> >> For add, subtract, multiply, left shift, and logic ops, we can avoid
> >> defining the upper bits of the inputs when we do these extensions,
> >> because the upper bits of the inputs don't affect the useful bits of
> >> the result.  But for comparisons, right shifts, and divides, etc., we do need
> to extend.
> >>
> >> AIUI, the comparison case is special because (for AArch64-specific
> >> reasons), we prefer extend + cbz to tst + branch, especially when the
> >> extend can be shared.
> >
> > Agreed, so I think we agree on this ? I guess the disagreement is
> > where this should be done. I'll admit that the testcase above works by
> > coincidence. But if we don't do it during expand time, the only place
> > I can think of to introduce the zero extends is to add various
> > patterns to do an early split of any-extend
> > + cmp.  But wouldn't that be more fragile? At least at expand time all
> > + comparisons
> > are tcc_comparisons.
> 
> I guess one question is: is the patch without the comparison handling just
> exacerbating an existing problem?  Do we already make similar bad choices
> between extend+cbz and tst+branch in cases where the variables aren't
> short, but where intermediate calculations involve &s?  If so, it might be
> something worth tackling in its own right, regardless of where the &s or
> extensions come from.

I've tried various cases but they all look correct.  This is mostly because
we already have actual instructions for these. We generate a chain of
and + tst in most != 0 cases, in the rest we do the least amount of &s
and then a normal cmp. 

For the case in spec, we correctly only get an additional mov to copy
the value.  So it looks like it's only an issue with short cases.

Cheers,
Tamar

> 
> But yeah, I'm not sure how easily that would fit into existing passes.
> 
> Thanks,
> Richard
  
Richard Sandiford May 18, 2022, 7:49 a.m. UTC | #14
Tamar Christina <Tamar.Christina@arm.com> writes:
>  […]
>> >> > We generate for e.g.:
>> >> >
>> >> > #include <stdint.h>
>> >> >
>> >> > uint16_t f8 (uint8_t xr, uint8_t xc){
>> >> >     return (uint8_t)(xr * xc);
>> >> > }
>> >> >
>> >> > (insn 9 6 10 2 (set (reg:HI 101)
>> >> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
>> >> (nil))
>> >> (insn 10 9 11 2 (set (reg:HI 102)
>> >> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
>> >> (nil))
>> >> (insn 11 10 12 2 (set (reg:SI 103)
>> >> (mult:SI (subreg:SI (reg:HI 101) 0)
>> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
>> >> (nil))
>> >> >
>> >> > Out of expand. The paradoxical subreg isn't generated at all out of
>> >> > expand unless it's needed. It does keep the original params around
>> >> > as
>> >> unused:
>> >> >
>> >> > (insn 2 7 4 2 (set (reg:QI 97)
>> >> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
>> >> (nil))
>> >> (insn 4 2 3 2 (set (reg:QI 99)
>> >> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
>> >> (nil))
>> >> >
>> >> > And the paradoxical subreg is moved into the first operation requiring it:
>> >> >
>> >> > (insn 11 10 12 2 (set (reg:SI 103)
>> >> (mult:SI (subreg:SI (reg:HI 101) 0)
>> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
>> >> (nil))
>> >>
>> >> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc
>> >> registers would be SIs and the DECL_RTLs would be QI subregs of those SI
>> regs.
>> >> I think that might work better, for the reasons above.  (That is,
>> >> whenever we need the register in extended form, we can simply extend
>> >> the existing reg rather than create a new one.)
>> >
>> > Ah, I see, no, I explicitly avoid this. When doing the type promotions
>> > I tell it that size of the copies of xr and xc is still the original size, e.g. QI (i.e. I
>> don't change 97 and 99).
>> > This is different from what we do with extends where 97 and 99 *would*
>> be changed.
>> >
>> > The reason is that if I make this SI the compiler thinks it knows the
>> > value of all the bits in the register which led to various miscompares as it
>> thinks it can use the SI value directly.
>> >
>> > This happens because again the xr and xc are hard regs. So having 97
>> > be
>> >
>> > (set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))
>> >
>> > gets folded to an incorrect
>> >
>> > (set (reg:SI 97) (reg:SI 0 x0 [ xr ]))
>> 
>> This part I would expect (and hope for :-)).
>> 
>> > And now 97 is free to be used without any zero extension, as 97 on it's own
>> is an invalid RTX.
>> 
>> But the way I'd imagined it working, expand would need to insert an
>> extension before any operation that needs the upper 24 bits to be defined
>> (e.g. comparisons, right shifts).  If the DECL_RTL is (subreg:QI (reg:SI x) 0)
>> then the upper bits are not defined, since SUBREG_PROMOTED_VAR_P
>> would/should be false for the subreg.
>
> Ah I see, my fear here was that if we have a pattern which splits out the zero-extend for whatever reason
> that if it gets folded it would be invalid.  But I think I understand what you meant.  In your case
> we'd never again use the hardreg, but that everything goes through 97. Got it.

Yeah.  The expand code is supposed to move the hard register into a
pseudo at the earliest opportunity (at the head of the function) and
then everything else should use the pseudo.  Using the hard register
later could lead to spill failures, or to attempts to keep the register
live across calls.

>> E.g. for:
>> 
>>   int8_t foo(int8_t x) { return x >> 1; }
>> 
>> x would have a DECL_RTL of (subreg:QI (reg:SI x) 0), the parameter
>> assignment would be expanded as:
>> 
>>   (set (reg:SI x) (reg:SI x0))
>> 
>> the shift would be expanded as:
>> 
>>   (set (reg:SI x) (zero_extend:SI (subreg:QI (reg:SI x) 0)))
>>   (set (reg:SI x) (ashiftrt:SI (reg:SI x) (const_int 1)))
>> 
>> and the return assignment would be expanded as:
>> 
>>   (set (reg:SI x0) (reg:SI x))
>> 
>> x + 1 would instead be expanded to just:
>> 
>>   (set (reg:SI x) (plus:SI (reg:SI x) (const_int 1)))
>> 
>> (without an extension).
>> 
>> I realised later though that, although reusing the DECL_RTL reg for the
>> extension has the nice RA property of avoiding multiple live values, it would
>> make it harder to combine the extension into the operation if the variable is
>> still live afterwards.  So I guess we lose something both ways.
>> 
>> Maybe we need a different approach, not based on changing
>> PROMOTE_MODE.
>> 
>> I wonder how easy it would be to do the promotion in gimple, then reuse
>> backprop to determine when a sign/zero-extension (i.e. a normal gimple cast)
>> can be converted into an “any extend”
>> (probably represented as a new ifn).
>
> Do you mean without changing the hook implementation but keeping the current promotion?

Yeah, keep the hook definitions as they are now, but do the promotion
in gimple by widening types where necessary.

> I guess the problem here is that it's the inverse cases that's the problem isn't it? It's not that in
> gimple there are unneeded extends, it's that some operations require an any-extend no?
>
> like in gimple ~a where a is an 8-bit quantity requires an any-extend, but no cast would be there
> in gimple.
>
> So for instance
>
> #include <stdint.h>
>
> uint8_t f (uint8_t a)
> {
>     return ~a;
> }
>
> Is just simply:
>
> f (uint8_t a)
> {
>   uint8_t _2;
>
>   <bb 2> [local count: 1073741824]:
>   _2 = ~a_1(D);
>   return _2;
>
> }

Right.  But the idea was to run a late-ish isel-like pass that converts
this to:

f (uint8_t a)
{
  uint8_t _2;
  uint32_t _3;
  uint8_t _4;

  <bb 2> [local count: 1073741824]:
  _2 = .ANY_EXTEND(a_1(D), (uint32_t)0);
  _3 = ~_2;
  _4 = (uint8_t)_3;
  return _4;
}

We'd need to experiment with the best placing of the pass.

> In gimple. I'm also slightly worried about interfering with phi opts. Backprop runs
> before ifcombine and pihops for instance and there are various phi opts like ifcombine_ifandif
> that rely on the BB containing only the phi node.  Adding an any-extend in between would break this.

Yeah, the current backprop pass runs quite early.  But I meant that we
could reuse the code (or simply run another instance of the pass,
extended to handle this case) at the appropriate point.

> I also wonder if the IFN won't interfere with range analysis of expressions. Unless we manage to strategically
> Insert the IFNs on entire expressions and not in the intermediate SSA components.

This would be part of the trade-off in placing the promotion pass.

Thanks,
Richard
  

Patch

--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -206,14 +206,20 @@  set_rtl (tree t, rtx x)
      have to compute it ourselves.  For RESULT_DECLs, we accept mode
      mismatches too, as long as we have BLKmode or are not coalescing
      across variables, so that we don't reject BLKmode PARALLELs or
-     unpromoted REGs.  */
+     unpromoted REGs.  For any promoted types that result in a
+     paradoxical subreg also accept the argument.  */
   gcc_checking_assert (!x || x == pc_rtx || TREE_CODE (t) != SSA_NAME
 		       || (SSAVAR (t)
 			   && TREE_CODE (SSAVAR (t)) == RESULT_DECL
 			   && (promote_ssa_mode (t, NULL) == BLKmode
 			       || !flag_tree_coalesce_vars))
 		       || !use_register_for_decl (t)
-		       || GET_MODE (x) == promote_ssa_mode (t, NULL));
+		       || GET_MODE (x) == promote_ssa_mode (t, NULL)
+		       || targetm.calls.promote_function_args_subreg_p (
+					  GET_MODE (x),
+					  promote_ssa_mode (t, NULL),
+					  TYPE_UNSIGNED (TREE_TYPE (t)),
+					  SSAVAR (t)));
 
   if (x)
     {
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2f92d37da8c0091e9879a493cfe8a361eb1d9299..6314cd83a2488dc225d4a1a15599e8e51e639f7f 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -3906,6 +3906,15 @@  cases of mismatch, it also makes for better code on certain machines.
 The default is to not promote prototypes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P (machine_mode @var{mode}, machine_mode @var{promoted_mode}, int @var{unsignedp}, tree @var{v})
+When a function argument is promoted with @code{PROMOTE_MODE} then this
+hook is used to determine whether the bits of the promoted type are all
+significant in the expression pointed to by V.  If they are an extend is
+generated, if they are not a paradoxical subreg is created for the argument
+from @code{mode} to @code{promoted_mode}.
+The default is to promote using an extend.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PUSH_ARGUMENT (unsigned int @var{npush})
 This target hook returns @code{true} if push instructions will be
 used to pass outgoing arguments.  When the push instruction usage is
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f869ddd5e5b8b7acbd8e9765fb103af24a1085b6..35f955803ec0a5a93be18a028fa1043f90858982 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -3103,6 +3103,8 @@  control passing certain arguments in registers.
 
 @hook TARGET_PROMOTE_PROTOTYPES
 
+@hook TARGET_PROMOTE_FUNCTION_ARGS_SUBREG_P
+
 @hook TARGET_PUSH_ARGUMENT
 
 @defmac PUSH_ARGS_REVERSED
diff --git a/gcc/function.cc b/gcc/function.cc
index d5ed51a6a663a1ef472f5b1c090543f359c18f42..92f469bfd5d1ebfb09cc94d9be854715cd2f90f8 100644
--- a/gcc/function.cc
+++ b/gcc/function.cc
@@ -3161,7 +3161,7 @@  assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
   machine_mode promoted_nominal_mode;
   int unsignedp = TYPE_UNSIGNED (TREE_TYPE (parm));
   bool did_conversion = false;
-  bool need_conversion, moved;
+  bool need_conversion, moved, use_subregs;
   enum insn_code icode;
   rtx rtl;
 
@@ -3172,7 +3172,20 @@  assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
     = promote_function_mode (data->nominal_type, data->nominal_mode, &unsignedp,
 			     TREE_TYPE (current_function_decl), 2);
 
-  parmreg = gen_reg_rtx (promoted_nominal_mode);
+  /* Check to see how the target wants the promotion of function arguments to
+     be handled.  */
+  use_subregs
+    = targetm.calls.promote_function_args_subreg_p (data->nominal_mode,
+						    promoted_nominal_mode,
+						    unsignedp, parm);
+
+  /* If we're promoting using a paradoxical subreg then we need to keep using
+     the unpromoted type because that's the only fully defined value.  */
+  if (use_subregs)
+    parmreg = gen_reg_rtx (data->nominal_mode);
+  else
+    parmreg = gen_reg_rtx (promoted_nominal_mode);
+
   if (!DECL_ARTIFICIAL (parm))
     mark_user_reg (parmreg);
 
@@ -3256,9 +3269,19 @@  assign_parm_setup_reg (struct assign_parm_data_all *all, tree parm,
 	    }
 	  else
 	    t = op1;
-	  rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
-					   data->passed_mode, unsignedp);
-	  emit_insn (pat);
+
+	  /* Promote the argument itself now if a target wants it.  This
+	     prevents unneeded back and forth convertions in RTL between
+	     the original and promoted type.  */
+	  if (use_subregs)
+	    emit_move_insn (op0, lowpart_subreg (promoted_nominal_mode, t,
+						 data->nominal_mode));
+	  else
+	    {
+	      rtx_insn *pat = gen_extend_insn (op0, t, promoted_nominal_mode,
+					       data->passed_mode, unsignedp);
+	      emit_insn (pat);
+	    }
 	  insns = get_insns ();
 
 	  moved = true;
diff --git a/gcc/hooks.h b/gcc/hooks.h
index 1056e1e9e4dc3e6ce298557351047caa2f84227f..8d68de5cdb9adaea0a79ebf6de599f66b40aa67a 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -31,6 +31,8 @@  extern bool hook_bool_const_int_const_int_true (const int, const int);
 extern bool hook_bool_mode_false (machine_mode);
 extern bool hook_bool_mode_true (machine_mode);
 extern bool hook_bool_mode_mode_true (machine_mode, machine_mode);
+extern bool hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree);
+extern bool hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree);
 extern bool hook_bool_mode_const_rtx_false (machine_mode, const_rtx);
 extern bool hook_bool_mode_const_rtx_true (machine_mode, const_rtx);
 extern bool hook_bool_mode_rtx_false (machine_mode, rtx);
diff --git a/gcc/hooks.cc b/gcc/hooks.cc
index b29233f4f852fb81ede75a5065d743cd16cc9219..7647774f9e8efbbe13d5607e4a4b2f1c9d22f045 100644
--- a/gcc/hooks.cc
+++ b/gcc/hooks.cc
@@ -89,6 +89,22 @@  hook_bool_mode_mode_true (machine_mode, machine_mode)
   return true;
 }
 
+/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
+   returns false.  */
+bool
+hook_bool_mode_mode_int_tree_false (machine_mode, machine_mode, int, tree)
+{
+  return false;
+}
+
+/* Generic hook that takes (machine_mode, machine_mode, int, tree) and
+   returns true.  */
+bool
+hook_bool_mode_mode_int_tree_true (machine_mode, machine_mode, int, tree)
+{
+  return true;
+}
+
 /* Generic hook that takes (machine_mode, const_rtx) and returns false.  */
 bool
 hook_bool_mode_const_rtx_false (machine_mode, const_rtx)
diff --git a/gcc/target.def b/gcc/target.def
index 72c2e1ef756cf70a1c92abe81f8a6577eaaa2501..bdbacf8c5fd7b0626a37951f6f6ec649f3194977 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4561,6 +4561,17 @@  The default is to not promote prototypes.",
  bool, (const_tree fntype),
  hook_bool_const_tree_false)
 
+DEFHOOK
+(promote_function_args_subreg_p,
+ "When a function argument is promoted with @code{PROMOTE_MODE} then this\n\
+hook is used to determine whether the bits of the promoted type are all\n\
+significant in the expression pointed to by V.  If they are an extend is\n\
+generated, if they are not a paradoxical subreg is created for the argument\n\
+from @code{mode} to @code{promoted_mode}.\n\
+The default is to promote using an extend.",
+ bool, (machine_mode mode, machine_mode promoted_mode, int unsignedp, tree v),
+ hook_bool_mode_mode_int_tree_false)
+
 DEFHOOK
 (struct_value_rtx,
  "This target hook should return the location of the structure value\n\
diff --git a/gcc/tree-outof-ssa.cc b/gcc/tree-outof-ssa.cc
index ec883126ad86d86a2c2dafee4592b8d83e2ed917..0f437023983baa0f23da25221f7bce8fc559a8b8 100644
--- a/gcc/tree-outof-ssa.cc
+++ b/gcc/tree-outof-ssa.cc
@@ -45,6 +45,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-coalesce.h"
 #include "tree-outof-ssa.h"
 #include "dojump.h"
+#include "target.h"
 
 /* FIXME: A lot of code here deals with expanding to RTL.  All that code
    should be in cfgexpand.cc.  */
@@ -333,7 +334,10 @@  insert_value_copy_on_edge (edge e, int dest, tree src, location_t locus)
   dest_mode = GET_MODE (dest_rtx);
   gcc_assert (src_mode == TYPE_MODE (TREE_TYPE (name)));
   gcc_assert (!REG_P (dest_rtx)
-	      || dest_mode == promote_ssa_mode (name, &unsignedp));
+	      || dest_mode == promote_ssa_mode (name, &unsignedp)
+	      || targetm.calls.promote_function_args_subreg_p (
+			dest_mode, promote_ssa_mode (name, NULL), unsignedp,
+			name));
 
   if (src_mode != dest_mode)
     {