Check if loading const from mem is faster

Message ID 20220222065313.2040127-1-guojiufu@linux.ibm.com
State New
Headers
Series Check if loading const from mem is faster |

Commit Message

Jiufu Guo Feb. 22, 2022, 6:53 a.m. UTC
  Hi,

For constants, there are some codes to check: if it is able to put
to instruction as an immediate operand or it is profitable to load from
mem.  There are still some places that could be improved for platforms.

This patch could handle PR63281/57836.  This patch does not change
too much on the code like force_const_mem and legitimate_constant_p.
We may integrate these APIs for passes like expand/cse/combine
as a whole solution in the future (maybe better for stage1?).

Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
Thanks for comments!


BR,
Jiufu

gcc/ChangeLog:

	PR target/94393
	PR rtl-optimization/63281
	* config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
	New hook.
	(rs6000_faster_loading_const): New hook.
	(rs6000_cannot_force_const_mem): Update.
	(rs6000_emit_move): Use rs6000_faster_loading_const.
	* cse.cc (cse_insn): Use new hook.
	* doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
	* doc/tm.texi: Regenerate.
	* target.def: New hook faster_loading_constant.
	* targhooks.cc (default_faster_loading_constant): New.
	* targhooks.h (default_faster_loading_constant): Declare it.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/medium_offset.c: Update.
	* gcc.target/powerpc/pr93012.c: Update.
	* gcc.target/powerpc/pr63281.c: New test.

---
 gcc/config/rs6000/rs6000.cc                   | 38 ++++++++++++++-----
 gcc/cse.cc                                    |  5 +++
 gcc/doc/tm.texi                               |  5 +++
 gcc/doc/tm.texi.in                            |  2 +
 gcc/target.def                                |  8 ++++
 gcc/targhooks.cc                              |  6 +++
 .../gcc.target/powerpc/medium_offset.c        |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
 gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
 9 files changed, 71 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
  

Comments

Richard Biener Feb. 22, 2022, 7:26 a.m. UTC | #1
On Tue, 22 Feb 2022, Jiufu Guo wrote:

> Hi,
> 
> For constants, there are some codes to check: if it is able to put
> to instruction as an immediate operand or it is profitable to load from
> mem.  There are still some places that could be improved for platforms.
> 
> This patch could handle PR63281/57836.  This patch does not change
> too much on the code like force_const_mem and legitimate_constant_p.
> We may integrate these APIs for passes like expand/cse/combine
> as a whole solution in the future (maybe better for stage1?).
> 
> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
> Thanks for comments!

I'm not sure whether we need a new hook here, but iff, then I think
whether loading a constant (from memory?) is faster or not depends
on the context.  So what's the exact situation and the two variants
you are costing against each other?  I assume (since you are
touching CSE) you are costing

  (set (...) (mem))  (before CSE)

against

  (set (...) (immediate))  (what CSE does now)

vs.

  (set (...) (mem))  (original, no CSE)

?  With the new hook you are skipping _all_ of the following loops
logic which does look like a quite bad design and hack (not that
I am very familiar with the candidate / costing logic in there).

We already have TARGET_INSN_COST which you could ask for a cost.
Like if we'd have a single_set then just temporarily substitute
the RHS with the candidate and cost the insns and compare against
the original insn cost.  So why exactly do you need a new hook
for this particular situation?

Thanks,
Richard.


> 
> BR,
> Jiufu
> 
> gcc/ChangeLog:
> 
> 	PR target/94393
> 	PR rtl-optimization/63281
> 	* config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
> 	New hook.
> 	(rs6000_faster_loading_const): New hook.
> 	(rs6000_cannot_force_const_mem): Update.
> 	(rs6000_emit_move): Use rs6000_faster_loading_const.
> 	* cse.cc (cse_insn): Use new hook.
> 	* doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
> 	* doc/tm.texi: Regenerate.
> 	* target.def: New hook faster_loading_constant.
> 	* targhooks.cc (default_faster_loading_constant): New.
> 	* targhooks.h (default_faster_loading_constant): Declare it.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/powerpc/medium_offset.c: Update.
> 	* gcc.target/powerpc/pr93012.c: Update.
> 	* gcc.target/powerpc/pr63281.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000.cc                   | 38 ++++++++++++++-----
>  gcc/cse.cc                                    |  5 +++
>  gcc/doc/tm.texi                               |  5 +++
>  gcc/doc/tm.texi.in                            |  2 +
>  gcc/target.def                                |  8 ++++
>  gcc/targhooks.cc                              |  6 +++
>  .../gcc.target/powerpc/medium_offset.c        |  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
>  gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
>  9 files changed, 71 insertions(+), 11 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index d7a7cfe860f..beb21a0bb2b 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>  
> +#undef TARGET_FASTER_LOADING_CONSTANT
> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> +
>  #undef TARGET_DELEGITIMIZE_ADDRESS
>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>  
> @@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  if (GET_CODE (x) == HIGH)
>      return true;
>  
>    /* A TLS symbol in the TOC cannot contain a sum.  */
> @@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
>    return false;
>  }
>  
> +
> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> +
> +static bool
> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> +{
> +  gcc_assert (CONSTANT_P (src));
> +
> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> +    return false;
> +  if (GET_CODE (src) == HIGH)
> +    return false;
> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> +    return false;
> +  if (rs6000_cannot_force_const_mem (mode, src))
> +    return false;
> +
> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> +    return true;
> +  if (!CONST_INT_P (src))
> +    return true;
> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> +}
> +
>  /* Emit a move from SOURCE to DEST in mode MODE.  */
>  void
>  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
> @@ -10860,13 +10886,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>  	operands[1] = create_TOC_reference (operands[1], operands[0]);
>        else if (mode == Pmode
>  	       && CONSTANT_P (operands[1])
> -	       && GET_CODE (operands[1]) != HIGH
> -	       && ((REG_P (operands[0])
> -		    && FP_REGNO_P (REGNO (operands[0])))
> -		   || !CONST_INT_P (operands[1])
> -		   || (num_insns_constant (operands[1], mode)
> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
>  	       && (TARGET_CMODEL == CMODEL_SMALL
>  		   || can_create_pseudo_p ()
>  		   || (REG_P (operands[0])
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index a18b599d324..c33d3c7e911 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -5158,6 +5158,11 @@ cse_insn (rtx_insn *insn)
>        if (dest == pc_rtx && src_const && GET_CODE (src_const) == LABEL_REF)
>  	src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
>  
> +      /* Check if src_foled is contant which is fastster to load pool.  */
> +      if (src_folded && CONSTANT_P (src_folded) && src && MEM_P (src)
> +	  && targetm.faster_loading_constant (mode, dest, src_folded))
> +	src_folded = NULL_RTX, src_folded_cost = src_folded_regcost = -1;
> +
>        /* Terminate loop when replacement made.  This must terminate since
>           the current contents will be tested and will always be valid.  */
>        while (1)
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 962bbb8caaf..65685311249 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
>  of TLS symbols for various targets.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
> +It returns true if loading contant @var{x} is faster than building
> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
>  This hook should return true if pool entries for constant @var{x} can
>  be placed in an @code{object_block} structure.  @var{mode} is the mode
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 394b59e70a6..918914f0e30 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_CANNOT_FORCE_CONST_MEM
>  
> +@hook TARGET_FASTER_LOADING_CONSTANT
> +
>  @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
>  
>  @hook TARGET_USE_BLOCKS_FOR_DECL_P
> diff --git a/gcc/target.def b/gcc/target.def
> index 57e64b20eef..8b007aca9dc 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
>   bool, (void),
>   hook_bool_void_false)
>  
> +/* Check if it is profitable to load the constant from constant pool.  */
> +DEFHOOK
> +(faster_loading_constant,
> + "It returns true if loading contant @var{x} is faster than building\n\
> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
> + bool, (machine_mode mode, rtx y, rtx x),
> + default_faster_loading_constant)
> +
>  /* True if it is OK to do sibling call optimization for the specified
>     call expression EXP.  DECL will be the called function, or NULL if
>     this is an indirect call.  */
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fc49235eb38..1d9340d1c14 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
>    return (TYPE_MODE (type) == BLKmode);
>  }
>  
> +bool
> +default_faster_loading_constant (machine_mode mode, rtx dest, rtx src)
> +{
> +  return false;
> +}
> +
>  rtx
>  default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
>  			    machine_mode mode ATTRIBUTE_UNUSED)
> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> index f29eba08c38..4889e8fa8ec 100644
> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile { target { powerpc*-*-* } } } */
>  /* { dg-require-effective-target lp64 } */
>  /* { dg-options "-O" } */
> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>  
>  static int x;
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> new file mode 100644
> index 00000000000..58ba074d427
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define CONST1 0x100803004101001
> +#define CONST2 0x000406002105007
> +
> +void __attribute__ ((noinline))
> +foo (unsigned long long *a, unsigned long long *b)
> +{
> +  *a = CONST1;
> +  *b = CONST2;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> index 4f764d0576f..5afb4f79c45 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>  
> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>
  
Segher Boessenkool Feb. 22, 2022, 5:30 p.m. UTC | #2
Hi Jiu Fu,

On Tue, Feb 22, 2022 at 02:53:13PM +0800, Jiufu Guo wrote:
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  if (GET_CODE (x) == HIGH)
>      return true;

This isn't explained anywhere.  "Update" is not enough ;-)


CSE is the pass that is most ancient and still causing problems left and
right.  It should be rewritten sooner rather than later.

The problem with that is that the pass does so much more than just CSE,
and we don't want to lose all those other things.  So it will be a slow
arduous affair of peeling off bits into separate passes, I think :-(

Doing actual CSE without all the restrictive restrictions our pass has
historically had isn't the hard part!


Segher
  
Jiufu Guo Feb. 23, 2022, 3:31 a.m. UTC | #3
On 2022-02-23 01:30, Segher Boessenkool wrote:
> Hi Jiu Fu,
> 
> On Tue, Feb 22, 2022 at 02:53:13PM +0800, Jiufu Guo wrote:
>>  static bool
>>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, 
>> rtx x)
>>  {
>> -  if (GET_CODE (x) == HIGH
>> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> +  if (GET_CODE (x) == HIGH)
>>      return true;
> 
Hi Segher,

> This isn't explained anywhere.  "Update" is not enough ;-)
Thanks! I will add explanations for it.   This excludes all 'HIGH' for 
'x' code,
like function "rs6000_emit_move" also check if the code is 'HIGH'.

And on P10, I also encounter this kind of case like:
  (high:DI (symbol_ref:DI ("var_1") [flags 0xc0] <var_decl 0x200000b003f0 
var_1>))
Which fail to store into .rodata.

> 
> 
> CSE is the pass that is most ancient and still causing problems left 
> and
> right.  It should be rewritten sooner rather than later.
> 
> The problem with that is that the pass does so much more than just CSE,
> and we don't want to lose all those other things.  So it will be a slow
> arduous affair of peeling off bits into separate passes, I think :-(

Yes, it does a lot of work. One of the additional works is checking 
'folding out
constants and putting constant in memory'.

BR,
Jiufu

> 
> Doing actual CSE without all the restrictive restrictions our pass has
> historically had isn't the hard part!
> 
> 
> Segher
  
Jiufu Guo Feb. 23, 2022, 11:32 a.m. UTC | #4
On 2/22/22 PM3:26, Richard Biener wrote:
> On Tue, 22 Feb 2022, Jiufu Guo wrote:
> 
>> Hi,
>>
>> For constants, there are some codes to check: if it is able to put
>> to instruction as an immediate operand or it is profitable to load from
>> mem.  There are still some places that could be improved for platforms.
>>
>> This patch could handle PR63281/57836.  This patch does not change
>> too much on the code like force_const_mem and legitimate_constant_p.
>> We may integrate these APIs for passes like expand/cse/combine
>> as a whole solution in the future (maybe better for stage1?).
>>
>> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
>> Thanks for comments!
> 
> I'm not sure whether we need a new hook here, but iff, then I think
> whether loading a constant (from memory?) is faster or not depends
> on the context.  So what's the exact situation and the two variants
> you are costing against each other?  I assume (since you are
> touching CSE) you are costing

Hi Richard,

Thanks for your review!

In some contexts, it may be faster to load from memory for some
constant value, and for some constant value, it would be faster
to put into immediate of few (1 or 2) instructions.

For example 0x1234567812345678, on ppc64, we may need 3 instructions
to build it, and then it would be better to put it in .rodata, and
then load it from memory.

Currently, we already have hooks TARGET_CANNOT_FORCE_CONST_MEM and
TARGET_LEGITIMATE_CONSTANT_P.

TARGET_CANNOT_FORCE_CONST_MEM is used to check if one 'rtx' can be
store into the constant pool.
On some targets (e.g. alpha), TARGET_LEGITIMATE_CONSTANT_P does the
behavior like what we expect:

I once thought to use TARGET_LEGITIMATE_CONSTANT_P too.
But in general, it seems this hook is designed to check if one
'rtx' could be used as an immediate instruction. This hook is used
in RTL passes: ira/reload. It is also used in recog.cc and expr.cc.

In other words, I feel, whether putting a constant in the constant
pool, we could check:
- If TARGET_CANNOT_FORCE_CONST_MEM returns true, we should not put
the 'constant' in the constant pool.
- If TARGET_LEGITIMATE_CONSTANT_P returns true, then the 'constant'
would be immediate of **one** instruction, and not put to constant
pool.
- If the new hook TARGET_FASTER_LOADING_CONSTANT returns true, then
the 'constant' would be stored in the constant pool.
Otherwise, it would be better to use an instructions-seq to build the
'constant'.
This is why I introduce a new hook.

We may also use the new hook at other places, e.g. expand/combining...
where is calling force_const_mem.

Any suggestions?

> 
>    (set (...) (mem))  (before CSE)
> 
> against
> 
>    (set (...) (immediate))  (what CSE does now)
> 
> vs.
> 
>    (set (...) (mem))  (original, no CSE)
> 
> ?  With the new hook you are skipping _all_ of the following loops
> logic which does look like a quite bad design and hack (not that
> I am very familiar with the candidate / costing logic in there).

At cse_insn, in the following loop of the code, it is also testing
the constant and try to put into memory:

           else if (crtl->uses_const_pool
                    && CONSTANT_P (trial)
                    && !CONST_INT_P (trial)
                    && (src_folded == 0 || !MEM_P (src_folded))
                    && GET_MODE_CLASS (mode) != MODE_CC
                    && mode != VOIDmode)
             {
               src_folded = force_const_mem (mode, trial);
               if (src_folded)
                 {
                   src_folded_cost = COST (src_folded, mode);
                   src_folded_regcost = approx_reg_cost (src_folded);
                 }
             }

This code is at the end of the loop, it would only be tested for
the next iteration. It may be better to test "if need to put the
constant into memory" for all iterations.

The current patch is adding an additional test before the loop.
I will update the patch to integrate these two places!

> 
> We already have TARGET_INSN_COST which you could ask for a cost.
> Like if we'd have a single_set then just temporarily substitute
> the RHS with the candidate and cost the insns and compare against
> the original insn cost.  So why exactly do you need a new hook
> for this particular situation?

Thanks for pointing out this! Segher also mentioned this before.
Currently, CSE is using rtx_cost. Using insn_cost to replace
rtx_cost would be a good idea for all necessary places including CSE.

For this particular case: check the cost for constants.
I did not use insn_cost. Because to use insn_cost, we may need
to create a recognizable insn temporarily, and for some kind of
constants we may need to create a sequence instructions on some
platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
sum cost of those instructions. If only create one fake
instruction, the insn_cost may not return the accurate cost either.

BR,
Jiufu

> 
> Thanks,
> Richard.
> 
> 
>>
>> BR,
>> Jiufu
>>
>> gcc/ChangeLog:
>>
>> 	PR target/94393
>> 	PR rtl-optimization/63281
>> 	* config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
>> 	New hook.
>> 	(rs6000_faster_loading_const): New hook.
>> 	(rs6000_cannot_force_const_mem): Update.
>> 	(rs6000_emit_move): Use rs6000_faster_loading_const.
>> 	* cse.cc (cse_insn): Use new hook.
>> 	* doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
>> 	* doc/tm.texi: Regenerate.
>> 	* target.def: New hook faster_loading_constant.
>> 	* targhooks.cc (default_faster_loading_constant): New.
>> 	* targhooks.h (default_faster_loading_constant): Declare it.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.target/powerpc/medium_offset.c: Update.
>> 	* gcc.target/powerpc/pr93012.c: Update.
>> 	* gcc.target/powerpc/pr63281.c: New test.
>>
>> ---
>>   gcc/config/rs6000/rs6000.cc                   | 38 ++++++++++++++-----
>>   gcc/cse.cc                                    |  5 +++
>>   gcc/doc/tm.texi                               |  5 +++
>>   gcc/doc/tm.texi.in                            |  2 +
>>   gcc/target.def                                |  8 ++++
>>   gcc/targhooks.cc                              |  6 +++
>>   .../gcc.target/powerpc/medium_offset.c        |  2 +-
>>   gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
>>   gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
>>   9 files changed, 71 insertions(+), 11 deletions(-)
>>   create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index d7a7cfe860f..beb21a0bb2b 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>>   #undef TARGET_CANNOT_FORCE_CONST_MEM
>>   #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>>   
>> +#undef TARGET_FASTER_LOADING_CONSTANT
>> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
>> +
>>   #undef TARGET_DELEGITIMIZE_ADDRESS
>>   #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>>   
>> @@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
>>   static bool
>>   rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>>   {
>> -  if (GET_CODE (x) == HIGH
>> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> +  if (GET_CODE (x) == HIGH)
>>       return true;
>>   
>>     /* A TLS symbol in the TOC cannot contain a sum.  */
>> @@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
>>     return false;
>>   }
>>   
>> +
>> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
>> +
>> +static bool
>> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
>> +{
>> +  gcc_assert (CONSTANT_P (src));
>> +
>> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
>> +    return false;
>> +  if (GET_CODE (src) == HIGH)
>> +    return false;
>> +  if (toc_relative_expr_p (src, false, NULL, NULL))
>> +    return false;
>> +  if (rs6000_cannot_force_const_mem (mode, src))
>> +    return false;
>> +
>> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
>> +    return true;
>> +  if (!CONST_INT_P (src))
>> +    return true;
>> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
>> +}
>> +
>>   /* Emit a move from SOURCE to DEST in mode MODE.  */
>>   void
>>   rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>> @@ -10860,13 +10886,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>>   	operands[1] = create_TOC_reference (operands[1], operands[0]);
>>         else if (mode == Pmode
>>   	       && CONSTANT_P (operands[1])
>> -	       && GET_CODE (operands[1]) != HIGH
>> -	       && ((REG_P (operands[0])
>> -		    && FP_REGNO_P (REGNO (operands[0])))
>> -		   || !CONST_INT_P (operands[1])
>> -		   || (num_insns_constant (operands[1], mode)
>> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
>> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
>> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
>>   	       && (TARGET_CMODEL == CMODEL_SMALL
>>   		   || can_create_pseudo_p ()
>>   		   || (REG_P (operands[0])
>> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> index a18b599d324..c33d3c7e911 100644
>> --- a/gcc/cse.cc
>> +++ b/gcc/cse.cc
>> @@ -5158,6 +5158,11 @@ cse_insn (rtx_insn *insn)
>>         if (dest == pc_rtx && src_const && GET_CODE (src_const) == LABEL_REF)
>>   	src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
>>   
>> +      /* Check if src_foled is contant which is fastster to load pool.  */
>> +      if (src_folded && CONSTANT_P (src_folded) && src && MEM_P (src)
>> +	  && targetm.faster_loading_constant (mode, dest, src_folded))
>> +	src_folded = NULL_RTX, src_folded_cost = src_folded_regcost = -1;
>> +
>>         /* Terminate loop when replacement made.  This must terminate since
>>            the current contents will be tested and will always be valid.  */
>>         while (1)
>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>> index 962bbb8caaf..65685311249 100644
>> --- a/gcc/doc/tm.texi
>> +++ b/gcc/doc/tm.texi
>> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
>>   of TLS symbols for various targets.
>>   @end deftypefn
>>   
>> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
>> +It returns true if loading contant @var{x} is faster than building
>> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
>> +@end deftypefn
>> +
>>   @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
>>   This hook should return true if pool entries for constant @var{x} can
>>   be placed in an @code{object_block} structure.  @var{mode} is the mode
>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
>> index 394b59e70a6..918914f0e30 100644
>> --- a/gcc/doc/tm.texi.in
>> +++ b/gcc/doc/tm.texi.in
>> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
>>   
>>   @hook TARGET_CANNOT_FORCE_CONST_MEM
>>   
>> +@hook TARGET_FASTER_LOADING_CONSTANT
>> +
>>   @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
>>   
>>   @hook TARGET_USE_BLOCKS_FOR_DECL_P
>> diff --git a/gcc/target.def b/gcc/target.def
>> index 57e64b20eef..8b007aca9dc 100644
>> --- a/gcc/target.def
>> +++ b/gcc/target.def
>> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
>>    bool, (void),
>>    hook_bool_void_false)
>>   
>> +/* Check if it is profitable to load the constant from constant pool.  */
>> +DEFHOOK
>> +(faster_loading_constant,
>> + "It returns true if loading contant @var{x} is faster than building\n\
>> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
>> + bool, (machine_mode mode, rtx y, rtx x),
>> + default_faster_loading_constant)
>> +
>>   /* True if it is OK to do sibling call optimization for the specified
>>      call expression EXP.  DECL will be the called function, or NULL if
>>      this is an indirect call.  */
>> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
>> index fc49235eb38..1d9340d1c14 100644
>> --- a/gcc/targhooks.cc
>> +++ b/gcc/targhooks.cc
>> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
>>     return (TYPE_MODE (type) == BLKmode);
>>   }
>>   
>> +bool
>> +default_faster_loading_constant (machine_mode mode, rtx dest, rtx src)
>> +{
>> +  return false;
>> +}
>> +
>>   rtx
>>   default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
>>   			    machine_mode mode ATTRIBUTE_UNUSED)
>> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> index f29eba08c38..4889e8fa8ec 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> @@ -1,7 +1,7 @@
>>   /* { dg-do compile { target { powerpc*-*-* } } } */
>>   /* { dg-require-effective-target lp64 } */
>>   /* { dg-options "-O" } */
>> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
>> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>>   
>>   static int x;
>>   
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> new file mode 100644
>> index 00000000000..58ba074d427
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2" } */
>> +
>> +#define CONST1 0x100803004101001
>> +#define CONST2 0x000406002105007
>> +
>> +void __attribute__ ((noinline))
>> +foo (unsigned long long *a, unsigned long long *b)
>> +{
>> +  *a = CONST1;
>> +  *b = CONST2;
>> +}
>> +
>> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> index 4f764d0576f..5afb4f79c45 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>>   unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>>   unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>>   
>> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
>> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>>
>
  
Richard Biener Feb. 23, 2022, 1:02 p.m. UTC | #5
On Wed, 23 Feb 2022, guojiufu wrote:

> 
> 
> On 2/22/22 PM3:26, Richard Biener wrote:
> > On Tue, 22 Feb 2022, Jiufu Guo wrote:
> > 
> >> Hi,
> >>
> >> For constants, there are some codes to check: if it is able to put
> >> to instruction as an immediate operand or it is profitable to load from
> >> mem.  There are still some places that could be improved for platforms.
> >>
> >> This patch could handle PR63281/57836.  This patch does not change
> >> too much on the code like force_const_mem and legitimate_constant_p.
> >> We may integrate these APIs for passes like expand/cse/combine
> >> as a whole solution in the future (maybe better for stage1?).
> >>
> >> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
> >> Thanks for comments!
> > 
> > I'm not sure whether we need a new hook here, but iff, then I think
> > whether loading a constant (from memory?) is faster or not depends
> > on the context.  So what's the exact situation and the two variants
> > you are costing against each other?  I assume (since you are
> > touching CSE) you are costing
> 
> Hi Richard,
> 
> Thanks for your review!
> 
> In some contexts, it may be faster to load from memory for some
> constant value, and for some constant value, it would be faster
> to put into immediate of few (1 or 2) instructions.
> 
> For example 0x1234567812345678, on ppc64, we may need 3 instructions
> to build it, and then it would be better to put it in .rodata, and
> then load it from memory.
> 
> Currently, we already have hooks TARGET_CANNOT_FORCE_CONST_MEM and
> TARGET_LEGITIMATE_CONSTANT_P.
> 
> TARGET_CANNOT_FORCE_CONST_MEM is used to check if one 'rtx' can be
> store into the constant pool.
> On some targets (e.g. alpha), TARGET_LEGITIMATE_CONSTANT_P does the
> behavior like what we expect:
> 
> I once thought to use TARGET_LEGITIMATE_CONSTANT_P too.
> But in general, it seems this hook is designed to check if one
> 'rtx' could be used as an immediate instruction. This hook is used
> in RTL passes: ira/reload. It is also used in recog.cc and expr.cc.
> 
> In other words, I feel, whether putting a constant in the constant
> pool, we could check:
> - If TARGET_CANNOT_FORCE_CONST_MEM returns true, we should not put
> the 'constant' in the constant pool.
> - If TARGET_LEGITIMATE_CONSTANT_P returns true, then the 'constant'
> would be immediate of **one** instruction, and not put to constant
> pool.
> - If the new hook TARGET_FASTER_LOADING_CONSTANT returns true, then
> the 'constant' would be stored in the constant pool.
> Otherwise, it would be better to use an instructions-seq to build the
> 'constant'.
> This is why I introduce a new hook.

I agree TARGET_CANNOT_FORCE_CONST_MEM and TARGET_LEGITIMATE_CONSTANT_P
are not the correct vehicle for a cost based consideration, they
are used for correctness checks.

> We may also use the new hook at other places, e.g. expand/combining...
> where is calling force_const_mem.
> 
> Any suggestions?
> 
> > 
> >    (set (...) (mem))  (before CSE)
> > 
> > against
> > 
> >    (set (...) (immediate))  (what CSE does now)
> > 
> > vs.
> > 
> >    (set (...) (mem))  (original, no CSE)
> > 
> > ?  With the new hook you are skipping _all_ of the following loops
> > logic which does look like a quite bad design and hack (not that
> > I am very familiar with the candidate / costing logic in there).
> 
> At cse_insn, in the following loop of the code, it is also testing
> the constant and try to put into memory:
> 
>           else if (crtl->uses_const_pool
>                    && CONSTANT_P (trial)
>                    && !CONST_INT_P (trial)
>                    && (src_folded == 0 || !MEM_P (src_folded))
>                    && GET_MODE_CLASS (mode) != MODE_CC
>                    && mode != VOIDmode)
>             {
>               src_folded = force_const_mem (mode, trial);
>               if (src_folded)
>                 {
>                   src_folded_cost = COST (src_folded, mode);
>                   src_folded_regcost = approx_reg_cost (src_folded);
>                 }
>             }
> 
> This code is at the end of the loop, it would only be tested for
> the next iteration. It may be better to test "if need to put the
> constant into memory" for all iterations.
>
> The current patch is adding an additional test before the loop.
> I will update the patch to integrate these two places!

I'm assuming we're always dealing with

  (set (reg:MODE ..) <src_folded>)

here and CSE is not substituting into random places of an
instruction(?).  I don't know what 'rtx_cost' should evaluate
to for a constant, if it should implicitely evaluate the cost
of putting the result into a register for example.

The RTX_COST target hook at least has some context
(outer_code and opno) and COST passes SET and 1 here.

So why is adjusting the RTX_COST hook not enough then for this case?
Using RTX_COST with SET and 1 at least looks no worse than using
your proposed new target hook and comparing it with the original
unfolded src (again with SET and 1).

Richard.
 
> > 
> > We already have TARGET_INSN_COST which you could ask for a cost.
> > Like if we'd have a single_set then just temporarily substitute
> > the RHS with the candidate and cost the insns and compare against
> > the original insn cost.  So why exactly do you need a new hook
> > for this particular situation?
> 
> Thanks for pointing out this! Segher also mentioned this before.
> Currently, CSE is using rtx_cost. Using insn_cost to replace
> rtx_cost would be a good idea for all necessary places including CSE.
> 
> For this particular case: check the cost for constants.
> I did not use insn_cost. Because to use insn_cost, we may need
> to create a recognizable insn temporarily, and for some kind of
> constants we may need to create a sequence instructions on some
> platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
> sum cost of those instructions. If only create one fake
> instruction, the insn_cost may not return the accurate cost either.
> 
> BR,
> Jiufu
> 
> > 
> > Thanks,
> > Richard.
> > 
> > 
> >>
> >> BR,
> >> Jiufu
> >>
> >> gcc/ChangeLog:
> >>
> >>  PR target/94393
> >>  PR rtl-optimization/63281
> >>  * config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
> >>  New hook.
> >>  (rs6000_faster_loading_const): New hook.
> >>  (rs6000_cannot_force_const_mem): Update.
> >>  (rs6000_emit_move): Use rs6000_faster_loading_const.
> >>  * cse.cc (cse_insn): Use new hook.
> >>  * doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
> >>  * doc/tm.texi: Regenerate.
> >>  * target.def: New hook faster_loading_constant.
> >>  * targhooks.cc (default_faster_loading_constant): New.
> >>  * targhooks.h (default_faster_loading_constant): Declare it.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>  * gcc.target/powerpc/medium_offset.c: Update.
> >>  * gcc.target/powerpc/pr93012.c: Update.
> >>  * gcc.target/powerpc/pr63281.c: New test.
> >>
> >> ---
> >>   gcc/config/rs6000/rs6000.cc                   | 38 ++++++++++++++-----
> >>   gcc/cse.cc                                    |  5 +++
> >>   gcc/doc/tm.texi                               |  5 +++
> >>   gcc/doc/tm.texi.in                            |  2 +
> >>   gcc/target.def                                |  8 ++++
> >>   gcc/targhooks.cc                              |  6 +++
> >>   .../gcc.target/powerpc/medium_offset.c        |  2 +-
> >>   gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
> >>   gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
> >>   9 files changed, 71 insertions(+), 11 deletions(-)
> >>   create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> >>
> >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> >> index d7a7cfe860f..beb21a0bb2b 100644
> >> --- a/gcc/config/rs6000/rs6000.cc
> >> +++ b/gcc/config/rs6000/rs6000.cc
> >> @@ -1361,6 +1361,9 @@ static const struct attribute_spec
> >> rs6000_attribute_table[] =
> >>   #undef TARGET_CANNOT_FORCE_CONST_MEM
> >>   #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
> >>   
> >> +#undef TARGET_FASTER_LOADING_CONSTANT
> >> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> >> +
> >>   #undef TARGET_DELEGITIMIZE_ADDRESS
> >>   #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
> >>   
> >> @@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
> >>   static bool
> >>   rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
> >>   {
> >> -  if (GET_CODE (x) == HIGH
> >> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
> >> +  if (GET_CODE (x) == HIGH)
> >>       return true;
> >>   
> >>     /* A TLS symbol in the TOC cannot contain a sum.  */
> >> @@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx
> >> source, machine_mode mode)
> >>     return false;
> >>   }
> >>   
> >> +
> >> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> >> +
> >> +static bool
> >> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> >> +{
> >> +  gcc_assert (CONSTANT_P (src));
> >> +
> >> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> >> +    return false;
> >> +  if (GET_CODE (src) == HIGH)
> >> +    return false;
> >> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> >> +    return false;
> >> +  if (rs6000_cannot_force_const_mem (mode, src))
> >> +    return false;
> >> +
> >> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> >> +    return true;
> >> +  if (!CONST_INT_P (src))
> >> +    return true;
> >> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> >> +}
> >> +
> >>   /* Emit a move from SOURCE to DEST in mode MODE.  */
> >>   void
> >>   rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
> >> @@ -10860,13 +10886,7 @@ rs6000_emit_move (rtx dest, rtx source,
> >> machine_mode mode)
> >>    operands[1] = create_TOC_reference (operands[1], operands[0]);
> >>         else if (mode == Pmode
> >>   	       && CONSTANT_P (operands[1])
> >> -	       && GET_CODE (operands[1]) != HIGH
> >> -	       && ((REG_P (operands[0])
> >> -		    && FP_REGNO_P (REGNO (operands[0])))
> >> -		   || !CONST_INT_P (operands[1])
> >> -		   || (num_insns_constant (operands[1], mode)
> >> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
> >> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
> >> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
> >>           && (TARGET_CMODEL == CMODEL_SMALL
> >>        || can_create_pseudo_p ()
> >>        || (REG_P (operands[0])
> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
> >> index a18b599d324..c33d3c7e911 100644
> >> --- a/gcc/cse.cc
> >> +++ b/gcc/cse.cc
> >> @@ -5158,6 +5158,11 @@ cse_insn (rtx_insn *insn)
> >>          if (dest == pc_rtx && src_const && GET_CODE (src_const) ==
> >>    LABEL_REF)
> >>    src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
> >>   +      /* Check if src_foled is contant which is fastster to load pool.
> >> */
> >> +      if (src_folded && CONSTANT_P (src_folded) && src && MEM_P (src)
> >> +	  && targetm.faster_loading_constant (mode, dest, src_folded))
> >> +	src_folded = NULL_RTX, src_folded_cost = src_folded_regcost = -1;
> >> +
> >>         /* Terminate loop when replacement made.  This must terminate since
> >>            the current contents will be tested and will always be valid.
> >>         */
> >>         while (1)
> >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> >> index 962bbb8caaf..65685311249 100644
> >> --- a/gcc/doc/tm.texi
> >> +++ b/gcc/doc/tm.texi
> >> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often
> >> true of addresses
> >>   of TLS symbols for various targets.
> >>   @end deftypefn
> >>   
> >> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode
> >> @var{mode}, rtx @var{y}, rtx @var{x})
> >> +It returns true if loading contant @var{x} is faster than building
> >> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
> >> +@end deftypefn
> >> +
> >>   @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P
> >>   (machine_mode @var{mode}, const_rtx @var{x})
> >>   This hook should return true if pool entries for constant @var{x} can
> >>   be placed in an @code{object_block} structure.  @var{mode} is the mode
> >> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> >> index 394b59e70a6..918914f0e30 100644
> >> --- a/gcc/doc/tm.texi.in
> >> +++ b/gcc/doc/tm.texi.in
> >> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can
> >> generate better code.
> >>   
> >>   @hook TARGET_CANNOT_FORCE_CONST_MEM
> >>   
> >> +@hook TARGET_FASTER_LOADING_CONSTANT
> >> +
> >>   @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
> >>   
> >>   @hook TARGET_USE_BLOCKS_FOR_DECL_P
> >> diff --git a/gcc/target.def b/gcc/target.def
> >> index 57e64b20eef..8b007aca9dc 100644
> >> --- a/gcc/target.def
> >> +++ b/gcc/target.def
> >> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
> >>    bool, (void),
> >>    hook_bool_void_false)
> >>   +/* Check if it is profitable to load the constant from constant pool.
> >> */
> >> +DEFHOOK
> >> +(faster_loading_constant,
> >> + "It returns true if loading contant @var{x} is faster than building\n\
> >> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
> >> + bool, (machine_mode mode, rtx y, rtx x),
> >> + default_faster_loading_constant)
> >> +
> >>   /* True if it is OK to do sibling call optimization for the specified
> >>      call expression EXP.  DECL will be the called function, or NULL if
> >>      this is an indirect call.  */
> >> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> >> index fc49235eb38..1d9340d1c14 100644
> >> --- a/gcc/targhooks.cc
> >> +++ b/gcc/targhooks.cc
> >> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
> >>     return (TYPE_MODE (type) == BLKmode);
> >>   }
> >>   
> >> +bool
> >> +default_faster_loading_constant (machine_mode mode, rtx dest, rtx src)
> >> +{
> >> +  return false;
> >> +}
> >> +
> >>   rtx
> >>   default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
> >>   			    machine_mode mode ATTRIBUTE_UNUSED)
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> index f29eba08c38..4889e8fa8ec 100644
> >> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> @@ -1,7 +1,7 @@
> >>   /* { dg-do compile { target { powerpc*-*-* } } } */
> >>   /* { dg-require-effective-target lp64 } */
> >>   /* { dg-options "-O" } */
> >> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
> >> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
> >>   
> >>   static int x;
> >>   
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> new file mode 100644
> >> index 00000000000..58ba074d427
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> @@ -0,0 +1,14 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2" } */
> >> +
> >> +#define CONST1 0x100803004101001
> >> +#define CONST2 0x000406002105007
> >> +
> >> +void __attribute__ ((noinline))
> >> +foo (unsigned long long *a, unsigned long long *b)
> >> +{
> >> +  *a = CONST1;
> >> +  *b = CONST2;
> >> +}
> >> +
> >> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> index 4f764d0576f..5afb4f79c45 100644
> >> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return
> >> 0xffff9234ffff9234ULL; }
> >>   unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
> >>   unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
> >>   
> >> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
> >> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
> >>
> > 
> 
>
  
Segher Boessenkool Feb. 23, 2022, 9:14 p.m. UTC | #6
On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> I'm assuming we're always dealing with
> 
>   (set (reg:MODE ..) <src_folded>)
> 
> here and CSE is not substituting into random places of an
> instruction(?).  I don't know what 'rtx_cost' should evaluate
> to for a constant, if it should implicitely evaluate the cost
> of putting the result into a register for example.

rtx_cost is no good here (and in most places).  rtx_cost should be 0
for anything that is used as input in a machine instruction -- but you
need much more context to determine that.  insn_cost is much simpler and
much easier to use.

> Using RTX_COST with SET and 1 at least looks no worse than using
> your proposed new target hook and comparing it with the original
> unfolded src (again with SET and 1).

It is required to generate valid instructions no matter what, before
the pass has finished that is.  On all more modern architectures it is
futile to think you can usefully consider the cost of an RTL expression
and derive a real-world cost of the generated code from that.

But there is so much more wrong with cse.c :-(


Segher
  
Segher Boessenkool Feb. 23, 2022, 9:27 p.m. UTC | #7
On Wed, Feb 23, 2022 at 07:32:55PM +0800, guojiufu wrote:
> >We already have TARGET_INSN_COST which you could ask for a cost.
> >Like if we'd have a single_set then just temporarily substitute
> >the RHS with the candidate and cost the insns and compare against
> >the original insn cost.  So why exactly do you need a new hook
> >for this particular situation?
> 
> Thanks for pointing out this! Segher also mentioned this before.
> Currently, CSE is using rtx_cost. Using insn_cost to replace
> rtx_cost would be a good idea for all necessary places including CSE.

I have updated many places that used rtx_cost to use insn_cost instead,
over the years (as a fallback the generic insn_cost will use rtx_cost).
CSE is the biggest remaining thing.  There is a long tail left as well
of course.

> For this particular case: check the cost for constants.
> I did not use insn_cost. Because to use insn_cost, we may need
> to create a recognizable insn temporarily, and for some kind of
> constants we may need to create a sequence instructions on some
> platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
> sum cost of those instructions. If only create one fake
> instruction, the insn_cost may not return the accurate cost either.

That is the problem yes.  You need insns to call insn_cost on.  You can
look in combine.c:combine_validate_cost to see how this can be done; but
you need to have some code to generate in the first place, and for CSE
it isn't always clear what code to generate, it really is based on RTL
expressions having a cost.


Segher
  
Jiufu Guo Feb. 24, 2022, 5:56 a.m. UTC | #8
Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> I'm assuming we're always dealing with
>> 
>>   (set (reg:MODE ..) <src_folded>)
>> 
>> here and CSE is not substituting into random places of an
>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> to for a constant, if it should implicitely evaluate the cost
>> of putting the result into a register for example.
>
> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> for anything that is used as input in a machine instruction -- but you
> need much more context to determine that.  insn_cost is much simpler and
> much easier to use.
>
>> Using RTX_COST with SET and 1 at least looks no worse than using
>> your proposed new target hook and comparing it with the original
>> unfolded src (again with SET and 1).
>
> It is required to generate valid instructions no matter what, before
> the pass has finished that is.  On all more modern architectures it is
> futile to think you can usefully consider the cost of an RTL expression
> and derive a real-world cost of the generated code from that.

Thanks Segher for pointing out these!  Here is  another reason that I
did not use rtx_cost: in a few passes, there are codes to check the
constants and store them in constant pool.  I'm thinking to integerate
those codes in a consistent way.


BR,
Jiufu

>
> But there is so much more wrong with cse.c :-(
>
>
> Segher
  
Jiufu Guo Feb. 24, 2022, 6:33 a.m. UTC | #9
Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:

> Segher Boessenkool <segher@kernel.crashing.org> writes:
>
>> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>>> I'm assuming we're always dealing with
>>> 
>>>   (set (reg:MODE ..) <src_folded>)
>>> 
>>> here and CSE is not substituting into random places of an
>>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>>> to for a constant, if it should implicitely evaluate the cost
>>> of putting the result into a register for example.
>>
>> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> for anything that is used as input in a machine instruction -- but you
>> need much more context to determine that.  insn_cost is much simpler and
>> much easier to use.
>>
>>> Using RTX_COST with SET and 1 at least looks no worse than using
>>> your proposed new target hook and comparing it with the original
>>> unfolded src (again with SET and 1).
>>
>> It is required to generate valid instructions no matter what, before
>> the pass has finished that is.  On all more modern architectures it is
>> futile to think you can usefully consider the cost of an RTL expression
>> and derive a real-world cost of the generated code from that.
>
> Thanks Segher for pointing out these!  Here is  another reason that I
> did not use rtx_cost: in a few passes, there are codes to check the
> constants and store them in constant pool.  I'm thinking to integerate
> those codes in a consistent way.

Hi Segher, Richard!

I'm thinking the way like: For a constant,
1. if the constant could be used as an immediate for the
instruction, then retreated as an operand;
2. otherwise if the constant can not be stored into a
constant pool, then handle through instructions;
3. if it is faster to access constant from pool, then emit
constant as data(.rodata);
4. otherwise, handle the constant by instructions.

And to store the constant into a pool, besides force_const_mem,
create reference (toc) may be needed on some platforms.

For this particular issue in CSE, there is already code that
tries to put constant into a pool (invoke force_const_mem).
While the code is too late.  So, we may check the constant
earlier and store it into constant pool if profitable.

And another thing as Segher pointed out, CSE is doing too
much work.  It may be ok to separate the constant handling
logic from CSE.

I update a new version patch as follow (did not seprate CSE):

Thanks for the comments and suggestions again!


BR,
Jiufu

---
 gcc/config/rs6000/rs6000.cc                   | 39 ++++++++++++++-----
 gcc/cse.cc                                    | 36 ++++++++---------
 gcc/doc/tm.texi                               |  5 +++
 gcc/doc/tm.texi.in                            |  2 +
 gcc/target.def                                |  8 ++++
 gcc/targhooks.cc                              |  6 +++
 gcc/targhooks.h                               |  1 +
 .../gcc.target/powerpc/medium_offset.c        |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
 gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
 10 files changed, 84 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..0a8f487d516 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
 
+#undef TARGET_FASTER_LOADING_CONSTANT
+#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
+
 #undef TARGET_DELEGITIMIZE_ADDRESS
 #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
 
@@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
 {
-  if (GET_CODE (x) == HIGH
-      && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  /* Exclude CONSTANT HIGH part.  */
+  if (GET_CODE (x) == HIGH)
     return true;
 
   /* A TLS symbol in the TOC cannot contain a sum.  */
@@ -10483,6 +10486,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
   return false;
 }
 
+
+/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
+
+static bool
+rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
+{
+  gcc_assert (CONSTANT_P (src));
+
+  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
+    return false;
+  if (GET_CODE (src) == HIGH)
+    return false;
+  if (toc_relative_expr_p (src, false, NULL, NULL))
+    return false;
+  if (rs6000_cannot_force_const_mem (mode, src))
+    return false;
+
+  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
+    return true;
+  if (!CONST_INT_P (src))
+    return true;
+  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
+}
+
 /* Emit a move from SOURCE to DEST in mode MODE.  */
 void
 rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
@@ -10860,13 +10887,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
 	operands[1] = create_TOC_reference (operands[1], operands[0]);
       else if (mode == Pmode
 	       && CONSTANT_P (operands[1])
-	       && GET_CODE (operands[1]) != HIGH
-	       && ((REG_P (operands[0])
-		    && FP_REGNO_P (REGNO (operands[0])))
-		   || !CONST_INT_P (operands[1])
-		   || (num_insns_constant (operands[1], mode)
-		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
-	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
+	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
 	       && (TARGET_CMODEL == CMODEL_SMALL
 		   || can_create_pseudo_p ()
 		   || (REG_P (operands[0])
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..53a7b3556b3 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -5192,6 +5192,22 @@ cse_insn (rtx_insn *insn)
 	      src_elt_regcost = elt->regcost;
 	    }
 
+	  /* If the current function uses a constant pool and this is a
+	     constant, try making a pool entry. Put it in src_folded
+	     unless we already have done this since that is where it
+	     likely came from.  */
+
+	  if (crtl->uses_const_pool && src_folded && CONSTANT_P (src_folded)
+	      && targetm.faster_loading_constant (mode, dest, src_folded))
+	    {
+	      src_folded = force_const_mem (mode, src_folded);
+	      if (src_folded)
+		{
+		  src_folded_cost = COST (src_folded, mode);
+		  src_folded_regcost = approx_reg_cost (src_folded);
+		}
+	    }
+
 	  /* Find cheapest and skip it for the next time.   For items
 	     of equal cost, use this order:
 	     src_folded, src, src_eqv, src_related and hash table entry.  */
@@ -5390,26 +5406,6 @@ cse_insn (rtx_insn *insn)
 
 	      break;
 	    }
-
-	  /* If the current function uses a constant pool and this is a
-	     constant, try making a pool entry. Put it in src_folded
-	     unless we already have done this since that is where it
-	     likely came from.  */
-
-	  else if (crtl->uses_const_pool
-		   && CONSTANT_P (trial)
-		   && !CONST_INT_P (trial)
-		   && (src_folded == 0 || !MEM_P (src_folded))
-		   && GET_MODE_CLASS (mode) != MODE_CC
-		   && mode != VOIDmode)
-	    {
-	      src_folded = force_const_mem (mode, trial);
-	      if (src_folded)
-		{
-		  src_folded_cost = COST (src_folded, mode);
-		  src_folded_regcost = approx_reg_cost (src_folded);
-		}
-	    }
 	}
 
       /* If we changed the insn too much, handle this set from scratch.  */
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 962bbb8caaf..65685311249 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
 of TLS symbols for various targets.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
+It returns true if loading contant @var{x} is faster than building
+from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
 This hook should return true if pool entries for constant @var{x} can
 be placed in an @code{object_block} structure.  @var{mode} is the mode
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 394b59e70a6..918914f0e30 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_CANNOT_FORCE_CONST_MEM
 
+@hook TARGET_FASTER_LOADING_CONSTANT
+
 @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
 
 @hook TARGET_USE_BLOCKS_FOR_DECL_P
diff --git a/gcc/target.def b/gcc/target.def
index 57e64b20eef..8b007aca9dc 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
  bool, (void),
  hook_bool_void_false)
 
+/* Check if it is profitable to load the constant from constant pool.  */
+DEFHOOK
+(faster_loading_constant,
+ "It returns true if loading contant @var{x} is faster than building\n\
+from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
+ bool, (machine_mode mode, rtx y, rtx x),
+ default_faster_loading_constant)
+
 /* True if it is OK to do sibling call optimization for the specified
    call expression EXP.  DECL will be the called function, or NULL if
    this is an indirect call.  */
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fc49235eb38..12de61a7ed8 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
   return (TYPE_MODE (type) == BLKmode);
 }
 
+bool
+default_faster_loading_constant (machine_mode, rtx, rtx)
+{
+  return false;
+}
+
 rtx
 default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
 			    machine_mode mode ATTRIBUTE_UNUSED)
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index ecfa11287ef..4db21cd59f6 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -295,5 +295,6 @@ extern rtx default_memtag_extract_tag (rtx, rtx);
 extern rtx default_memtag_untagged_pointer (rtx, rtx);
 
 extern HOST_WIDE_INT default_gcov_type_size (void);
+extern bool default_faster_loading_constant (machine_mode, rtx, rtx);
 
 #endif /* GCC_TARGHOOKS_H */
diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
index f29eba08c38..4889e8fa8ec 100644
--- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
+++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
@@ -1,7 +1,7 @@
 /* { dg-do compile { target { powerpc*-*-* } } } */
 /* { dg-require-effective-target lp64 } */
 /* { dg-options "-O" } */
-/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
+/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
 
 static int x;
 
diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
new file mode 100644
index 00000000000..58ba074d427
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define CONST1 0x100803004101001
+#define CONST2 0x000406002105007
+
+void __attribute__ ((noinline))
+foo (unsigned long long *a, unsigned long long *b)
+{
+  *a = CONST1;
+  *b = CONST2;
+}
+
+/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
index 4f764d0576f..5afb4f79c45 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
@@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
 unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
 unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
 
-/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
+/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
  
Jiufu Guo Feb. 24, 2022, 7:48 a.m. UTC | #10
Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Wed, Feb 23, 2022 at 07:32:55PM +0800, guojiufu wrote:
>> >We already have TARGET_INSN_COST which you could ask for a cost.
>> >Like if we'd have a single_set then just temporarily substitute
>> >the RHS with the candidate and cost the insns and compare against
>> >the original insn cost.  So why exactly do you need a new hook
>> >for this particular situation?
>> 
>> Thanks for pointing out this! Segher also mentioned this before.
>> Currently, CSE is using rtx_cost. Using insn_cost to replace
>> rtx_cost would be a good idea for all necessary places including CSE.
>
> I have updated many places that used rtx_cost to use insn_cost instead,
> over the years (as a fallback the generic insn_cost will use rtx_cost).
> CSE is the biggest remaining thing.  There is a long tail left as well
> of course.
>
>> For this particular case: check the cost for constants.
>> I did not use insn_cost. Because to use insn_cost, we may need
>> to create a recognizable insn temporarily, and for some kind of
>> constants we may need to create a sequence instructions on some
>> platform, e.g. "li xx; ori ; sldi .." on ppc64, and check the
>> sum cost of those instructions. If only create one fake
>> instruction, the insn_cost may not return the accurate cost either.
>
> That is the problem yes.  You need insns to call insn_cost on.  You can
> look in combine.c:combine_validate_cost to see how this can be done; but
> you need to have some code to generate in the first place, and for CSE
> it isn't always clear what code to generate, it really is based on RTL
> expressions having a cost.

Hi Segher,

Thanks! combine_validate_cost is useful to help me on
evaluating the costs of several instructions or replacements.

As you pointed out, at CSE, it may not be clear to know what
extact insn sequences will be generated. Actually,  the same
issue also exists on RTL expression.  At CSE, it may not clear
the exact cost, since the real instructions maybe emitted in
very late passes.

To get the accurate cost, we may analyze the constant in the
hook(insn_cost or rtx_cost) and estimate the possible final
instructions and then calculate the costs.

We discussed one idea: let the hook insn_cost accept
any interim instruction, and estimate the real instruction
base on the interim insn, and then return the estimated
costs.

For example: input insn "r119:DI=0x100803004101001" to
insn_cost; and in rs6000_insn_cost (for ppc), analyze
constant "0x100803004101001" which would need 5 insns;
then rs6000_insn_cost sumarize the cost of 5 insns.

A minor concern: because we know that reading this
constant from the pool is faster than building it by insns,
we will generate instructions to load constant from the pool
finally, do not emit 5 real instructions to build the value.
So, we are more interested in if it is faster to load from
pool or not.


BR,
Jiufu 

>
>
> Segher
  
Richard Biener Feb. 24, 2022, 8:50 a.m. UTC | #11
On Thu, 24 Feb 2022, Jiufu Guo wrote:

> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> 
> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >
> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >>> I'm assuming we're always dealing with
> >>> 
> >>>   (set (reg:MODE ..) <src_folded>)
> >>> 
> >>> here and CSE is not substituting into random places of an
> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >>> to for a constant, if it should implicitely evaluate the cost
> >>> of putting the result into a register for example.
> >>
> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> for anything that is used as input in a machine instruction -- but you
> >> need much more context to determine that.  insn_cost is much simpler and
> >> much easier to use.
> >>
> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >>> your proposed new target hook and comparing it with the original
> >>> unfolded src (again with SET and 1).
> >>
> >> It is required to generate valid instructions no matter what, before
> >> the pass has finished that is.  On all more modern architectures it is
> >> futile to think you can usefully consider the cost of an RTL expression
> >> and derive a real-world cost of the generated code from that.
> >
> > Thanks Segher for pointing out these!  Here is  another reason that I
> > did not use rtx_cost: in a few passes, there are codes to check the
> > constants and store them in constant pool.  I'm thinking to integerate
> > those codes in a consistent way.
> 
> Hi Segher, Richard!
> 
> I'm thinking the way like: For a constant,
> 1. if the constant could be used as an immediate for the
> instruction, then retreated as an operand;
> 2. otherwise if the constant can not be stored into a
> constant pool, then handle through instructions;
> 3. if it is faster to access constant from pool, then emit
> constant as data(.rodata);
> 4. otherwise, handle the constant by instructions.
> 
> And to store the constant into a pool, besides force_const_mem,
> create reference (toc) may be needed on some platforms.
> 
> For this particular issue in CSE, there is already code that
> tries to put constant into a pool (invoke force_const_mem).
> While the code is too late.  So, we may check the constant
> earlier and store it into constant pool if profitable.
> 
> And another thing as Segher pointed out, CSE is doing too
> much work.  It may be ok to separate the constant handling
> logic from CSE.

Not sure - CSE just is value numbering, I don't see that it does
more than that.  Yes, it might have developed "heuristics" over
the years what to CSE and to what and where to substitute and
where not.  But in the end it does just value numbering.

> 
> I update a new version patch as follow (did not seprate CSE):

How is the new target hook better in any way compared to rtx_cost
or insn_cost?  It looks like a total hack.

I suppose the actual way of materializing a constant is done
behind GCCs back and not exposed anywhere?  But instead you
claim the constants are valid when they actually are not?
Isn't the problem then that the rs6000 backend lies?

Btw, all of this is of course not appropriate for stage4 and changes
to CSE need testing on more than one target.

Richard.

> Thanks for the comments and suggestions again!
> 
> 
> BR,
> Jiufu
> 
> ---
>  gcc/config/rs6000/rs6000.cc                   | 39 ++++++++++++++-----
>  gcc/cse.cc                                    | 36 ++++++++---------
>  gcc/doc/tm.texi                               |  5 +++
>  gcc/doc/tm.texi.in                            |  2 +
>  gcc/target.def                                |  8 ++++
>  gcc/targhooks.cc                              |  6 +++
>  gcc/targhooks.h                               |  1 +
>  .../gcc.target/powerpc/medium_offset.c        |  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
>  gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
>  10 files changed, 84 insertions(+), 31 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index d7a7cfe860f..0a8f487d516 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>  
> +#undef TARGET_FASTER_LOADING_CONSTANT
> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> +
>  #undef TARGET_DELEGITIMIZE_ADDRESS
>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>  
> @@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  /* Exclude CONSTANT HIGH part.  */
> +  if (GET_CODE (x) == HIGH)
>      return true;
>  
>    /* A TLS symbol in the TOC cannot contain a sum.  */
> @@ -10483,6 +10486,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
>    return false;
>  }
>  
> +
> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> +
> +static bool
> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> +{
> +  gcc_assert (CONSTANT_P (src));
> +
> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> +    return false;
> +  if (GET_CODE (src) == HIGH)
> +    return false;
> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> +    return false;
> +  if (rs6000_cannot_force_const_mem (mode, src))
> +    return false;
> +
> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> +    return true;
> +  if (!CONST_INT_P (src))
> +    return true;
> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> +}
> +
>  /* Emit a move from SOURCE to DEST in mode MODE.  */
>  void
>  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
> @@ -10860,13 +10887,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>  	operands[1] = create_TOC_reference (operands[1], operands[0]);
>        else if (mode == Pmode
>  	       && CONSTANT_P (operands[1])
> -	       && GET_CODE (operands[1]) != HIGH
> -	       && ((REG_P (operands[0])
> -		    && FP_REGNO_P (REGNO (operands[0])))
> -		   || !CONST_INT_P (operands[1])
> -		   || (num_insns_constant (operands[1], mode)
> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
>  	       && (TARGET_CMODEL == CMODEL_SMALL
>  		   || can_create_pseudo_p ()
>  		   || (REG_P (operands[0])
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index a18b599d324..53a7b3556b3 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -5192,6 +5192,22 @@ cse_insn (rtx_insn *insn)
>  	      src_elt_regcost = elt->regcost;
>  	    }
>  
> +	  /* If the current function uses a constant pool and this is a
> +	     constant, try making a pool entry. Put it in src_folded
> +	     unless we already have done this since that is where it
> +	     likely came from.  */
> +
> +	  if (crtl->uses_const_pool && src_folded && CONSTANT_P (src_folded)
> +	      && targetm.faster_loading_constant (mode, dest, src_folded))
> +	    {
> +	      src_folded = force_const_mem (mode, src_folded);
> +	      if (src_folded)
> +		{
> +		  src_folded_cost = COST (src_folded, mode);
> +		  src_folded_regcost = approx_reg_cost (src_folded);
> +		}
> +	    }
> +
>  	  /* Find cheapest and skip it for the next time.   For items
>  	     of equal cost, use this order:
>  	     src_folded, src, src_eqv, src_related and hash table entry.  */
> @@ -5390,26 +5406,6 @@ cse_insn (rtx_insn *insn)
>  
>  	      break;
>  	    }
> -
> -	  /* If the current function uses a constant pool and this is a
> -	     constant, try making a pool entry. Put it in src_folded
> -	     unless we already have done this since that is where it
> -	     likely came from.  */
> -
> -	  else if (crtl->uses_const_pool
> -		   && CONSTANT_P (trial)
> -		   && !CONST_INT_P (trial)
> -		   && (src_folded == 0 || !MEM_P (src_folded))
> -		   && GET_MODE_CLASS (mode) != MODE_CC
> -		   && mode != VOIDmode)
> -	    {
> -	      src_folded = force_const_mem (mode, trial);
> -	      if (src_folded)
> -		{
> -		  src_folded_cost = COST (src_folded, mode);
> -		  src_folded_regcost = approx_reg_cost (src_folded);
> -		}
> -	    }
>  	}
>  
>        /* If we changed the insn too much, handle this set from scratch.  */
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 962bbb8caaf..65685311249 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
>  of TLS symbols for various targets.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
> +It returns true if loading contant @var{x} is faster than building
> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
>  This hook should return true if pool entries for constant @var{x} can
>  be placed in an @code{object_block} structure.  @var{mode} is the mode
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 394b59e70a6..918914f0e30 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_CANNOT_FORCE_CONST_MEM
>  
> +@hook TARGET_FASTER_LOADING_CONSTANT
> +
>  @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
>  
>  @hook TARGET_USE_BLOCKS_FOR_DECL_P
> diff --git a/gcc/target.def b/gcc/target.def
> index 57e64b20eef..8b007aca9dc 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
>   bool, (void),
>   hook_bool_void_false)
>  
> +/* Check if it is profitable to load the constant from constant pool.  */
> +DEFHOOK
> +(faster_loading_constant,
> + "It returns true if loading contant @var{x} is faster than building\n\
> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
> + bool, (machine_mode mode, rtx y, rtx x),
> + default_faster_loading_constant)
> +
>  /* True if it is OK to do sibling call optimization for the specified
>     call expression EXP.  DECL will be the called function, or NULL if
>     this is an indirect call.  */
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index fc49235eb38..12de61a7ed8 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
>    return (TYPE_MODE (type) == BLKmode);
>  }
>  
> +bool
> +default_faster_loading_constant (machine_mode, rtx, rtx)
> +{
> +  return false;
> +}
> +
>  rtx
>  default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
>  			    machine_mode mode ATTRIBUTE_UNUSED)
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index ecfa11287ef..4db21cd59f6 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -295,5 +295,6 @@ extern rtx default_memtag_extract_tag (rtx, rtx);
>  extern rtx default_memtag_untagged_pointer (rtx, rtx);
>  
>  extern HOST_WIDE_INT default_gcov_type_size (void);
> +extern bool default_faster_loading_constant (machine_mode, rtx, rtx);
>  
>  #endif /* GCC_TARGHOOKS_H */
> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> index f29eba08c38..4889e8fa8ec 100644
> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile { target { powerpc*-*-* } } } */
>  /* { dg-require-effective-target lp64 } */
>  /* { dg-options "-O" } */
> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>  
>  static int x;
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> new file mode 100644
> index 00000000000..58ba074d427
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define CONST1 0x100803004101001
> +#define CONST2 0x000406002105007
> +
> +void __attribute__ ((noinline))
> +foo (unsigned long long *a, unsigned long long *b)
> +{
> +  *a = CONST1;
> +  *b = CONST2;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> index 4f764d0576f..5afb4f79c45 100644
> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>  
> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>
  
Jiufu Guo Feb. 25, 2022, 4:35 a.m. UTC | #12
Richard Biener <rguenther@suse.de> writes:

> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>
>> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> 
>> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >
>> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >>> I'm assuming we're always dealing with
>> >>> 
>> >>>   (set (reg:MODE ..) <src_folded>)
>> >>> 
>> >>> here and CSE is not substituting into random places of an
>> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >>> to for a constant, if it should implicitely evaluate the cost
>> >>> of putting the result into a register for example.
>> >>
>> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> for anything that is used as input in a machine instruction -- but you
>> >> need much more context to determine that.  insn_cost is much simpler and
>> >> much easier to use.
>> >>
>> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >>> your proposed new target hook and comparing it with the original
>> >>> unfolded src (again with SET and 1).
>> >>
>> >> It is required to generate valid instructions no matter what, before
>> >> the pass has finished that is.  On all more modern architectures it is
>> >> futile to think you can usefully consider the cost of an RTL expression
>> >> and derive a real-world cost of the generated code from that.
>> >
>> > Thanks Segher for pointing out these!  Here is  another reason that I
>> > did not use rtx_cost: in a few passes, there are codes to check the
>> > constants and store them in constant pool.  I'm thinking to integerate
>> > those codes in a consistent way.
>> 
>> Hi Segher, Richard!
>> 
>> I'm thinking the way like: For a constant,
>> 1. if the constant could be used as an immediate for the
>> instruction, then retreated as an operand;
>> 2. otherwise if the constant can not be stored into a
>> constant pool, then handle through instructions;
>> 3. if it is faster to access constant from pool, then emit
>> constant as data(.rodata);
>> 4. otherwise, handle the constant by instructions.
>> 
>> And to store the constant into a pool, besides force_const_mem,
>> create reference (toc) may be needed on some platforms.
>> 
>> For this particular issue in CSE, there is already code that
>> tries to put constant into a pool (invoke force_const_mem).
>> While the code is too late.  So, we may check the constant
>> earlier and store it into constant pool if profitable.
>> 
>> And another thing as Segher pointed out, CSE is doing too
>> much work.  It may be ok to separate the constant handling
>> logic from CSE.
>
> Not sure - CSE just is value numbering, I don't see that it does
> more than that.  Yes, it might have developed "heuristics" over
> the years what to CSE and to what and where to substitute and
> where not.  But in the end it does just value numbering.
>
>> 
>> I update a new version patch as follow (did not seprate CSE):
>
> How is the new target hook better in any way compared to rtx_cost
> or insn_cost?  It looks like a total hack.
>
> I suppose the actual way of materializing a constant is done
> behind GCCs back and not exposed anywhere?  But instead you
> claim the constants are valid when they actually are not?
> Isn't the problem then that the rs6000 backend lies?

Hi Richard,

Thanks for your comments and sugguestions!

Materializing a constant should be done behind GCC.
On rs6000, in expand pass, during emit_move, the constant is
checked and store into constant pool if necessary.
Some other platforms are doing a similar thing, e.g.
ix86_expand_vector_move, alpha_expand_mov,...
mips_legitimize_const_move.

But, it does not as we expect, force_const_mem is also 
exposed other places (not only ira/reload for stack reference).

CSE is one place, for example, CSE first retrieve the constant
from insn's equal, but it also tries to put constant into
pool for some condition (the condition was introduced at
early age of cse.c, and it is rare to run into in trunk).
In some aspects, IMHO, this seems not a great work of CSE.

And this is how the 'invalid(or say slow)' constant occurs.
e.g.  before cse:
    7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
      REG_EQUAL 0x100803004101001
after cse: 
    7: r119:DI=0x100803004101001
      REG_EQUAL 0x100803004101001

As you pointed out: we can also avoid this transformation through
rtx_cost/insn_cost by estimating the COST more accurately for
 "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
be a real final instruction.)

Is it necessary to refine this constant handling for CSE, or even
to eliminate the logic on constant extracting for an insn, beside
updating rtx_cost/insn_cost?
Any sugguestions?

>
> Btw, all of this is of course not appropriate for stage4 and changes
> to CSE need testing on more than one target.
I would do more evaluation, thanks!

Jiufu

>
> Richard.
>
>> Thanks for the comments and suggestions again!
>> 
>> 
>> BR,
>> Jiufu
>> 
>> ---
>>  gcc/config/rs6000/rs6000.cc                   | 39 ++++++++++++++-----
>>  gcc/cse.cc                                    | 36 ++++++++---------
>>  gcc/doc/tm.texi                               |  5 +++
>>  gcc/doc/tm.texi.in                            |  2 +
>>  gcc/target.def                                |  8 ++++
>>  gcc/targhooks.cc                              |  6 +++
>>  gcc/targhooks.h                               |  1 +
>>  .../gcc.target/powerpc/medium_offset.c        |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
>>  gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
>>  10 files changed, 84 insertions(+), 31 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
>> 
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index d7a7cfe860f..0a8f487d516 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
>>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>>  
>> +#undef TARGET_FASTER_LOADING_CONSTANT
>> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
>> +
>>  #undef TARGET_DELEGITIMIZE_ADDRESS
>>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>>  
>> @@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
>>  static bool
>>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>>  {
>> -  if (GET_CODE (x) == HIGH
>> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
>> +  /* Exclude CONSTANT HIGH part.  */
>> +  if (GET_CODE (x) == HIGH)
>>      return true;
>>  
>>    /* A TLS symbol in the TOC cannot contain a sum.  */
>> @@ -10483,6 +10486,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
>>    return false;
>>  }
>>  
>> +
>> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
>> +
>> +static bool
>> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
>> +{
>> +  gcc_assert (CONSTANT_P (src));
>> +
>> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
>> +    return false;
>> +  if (GET_CODE (src) == HIGH)
>> +    return false;
>> +  if (toc_relative_expr_p (src, false, NULL, NULL))
>> +    return false;
>> +  if (rs6000_cannot_force_const_mem (mode, src))
>> +    return false;
>> +
>> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
>> +    return true;
>> +  if (!CONST_INT_P (src))
>> +    return true;
>> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
>> +}
>> +
>>  /* Emit a move from SOURCE to DEST in mode MODE.  */
>>  void
>>  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>> @@ -10860,13 +10887,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>>  	operands[1] = create_TOC_reference (operands[1], operands[0]);
>>        else if (mode == Pmode
>>  	       && CONSTANT_P (operands[1])
>> -	       && GET_CODE (operands[1]) != HIGH
>> -	       && ((REG_P (operands[0])
>> -		    && FP_REGNO_P (REGNO (operands[0])))
>> -		   || !CONST_INT_P (operands[1])
>> -		   || (num_insns_constant (operands[1], mode)
>> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
>> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
>> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
>>  	       && (TARGET_CMODEL == CMODEL_SMALL
>>  		   || can_create_pseudo_p ()
>>  		   || (REG_P (operands[0])
>> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> index a18b599d324..53a7b3556b3 100644
>> --- a/gcc/cse.cc
>> +++ b/gcc/cse.cc
>> @@ -5192,6 +5192,22 @@ cse_insn (rtx_insn *insn)
>>  	      src_elt_regcost = elt->regcost;
>>  	    }
>>  
>> +	  /* If the current function uses a constant pool and this is a
>> +	     constant, try making a pool entry. Put it in src_folded
>> +	     unless we already have done this since that is where it
>> +	     likely came from.  */
>> +
>> +	  if (crtl->uses_const_pool && src_folded && CONSTANT_P (src_folded)
>> +	      && targetm.faster_loading_constant (mode, dest, src_folded))
>> +	    {
>> +	      src_folded = force_const_mem (mode, src_folded);
>> +	      if (src_folded)
>> +		{
>> +		  src_folded_cost = COST (src_folded, mode);
>> +		  src_folded_regcost = approx_reg_cost (src_folded);
>> +		}
>> +	    }
>> +
>>  	  /* Find cheapest and skip it for the next time.   For items
>>  	     of equal cost, use this order:
>>  	     src_folded, src, src_eqv, src_related and hash table entry.  */
>> @@ -5390,26 +5406,6 @@ cse_insn (rtx_insn *insn)
>>  
>>  	      break;
>>  	    }
>> -
>> -	  /* If the current function uses a constant pool and this is a
>> -	     constant, try making a pool entry. Put it in src_folded
>> -	     unless we already have done this since that is where it
>> -	     likely came from.  */
>> -
>> -	  else if (crtl->uses_const_pool
>> -		   && CONSTANT_P (trial)
>> -		   && !CONST_INT_P (trial)
>> -		   && (src_folded == 0 || !MEM_P (src_folded))
>> -		   && GET_MODE_CLASS (mode) != MODE_CC
>> -		   && mode != VOIDmode)
>> -	    {
>> -	      src_folded = force_const_mem (mode, trial);
>> -	      if (src_folded)
>> -		{
>> -		  src_folded_cost = COST (src_folded, mode);
>> -		  src_folded_regcost = approx_reg_cost (src_folded);
>> -		}
>> -	    }
>>  	}
>>  
>>        /* If we changed the insn too much, handle this set from scratch.  */
>> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>> index 962bbb8caaf..65685311249 100644
>> --- a/gcc/doc/tm.texi
>> +++ b/gcc/doc/tm.texi
>> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
>>  of TLS symbols for various targets.
>>  @end deftypefn
>>  
>> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
>> +It returns true if loading contant @var{x} is faster than building
>> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
>> +@end deftypefn
>> +
>>  @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
>>  This hook should return true if pool entries for constant @var{x} can
>>  be placed in an @code{object_block} structure.  @var{mode} is the mode
>> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
>> index 394b59e70a6..918914f0e30 100644
>> --- a/gcc/doc/tm.texi.in
>> +++ b/gcc/doc/tm.texi.in
>> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
>>  
>>  @hook TARGET_CANNOT_FORCE_CONST_MEM
>>  
>> +@hook TARGET_FASTER_LOADING_CONSTANT
>> +
>>  @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
>>  
>>  @hook TARGET_USE_BLOCKS_FOR_DECL_P
>> diff --git a/gcc/target.def b/gcc/target.def
>> index 57e64b20eef..8b007aca9dc 100644
>> --- a/gcc/target.def
>> +++ b/gcc/target.def
>> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
>>   bool, (void),
>>   hook_bool_void_false)
>>  
>> +/* Check if it is profitable to load the constant from constant pool.  */
>> +DEFHOOK
>> +(faster_loading_constant,
>> + "It returns true if loading contant @var{x} is faster than building\n\
>> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
>> + bool, (machine_mode mode, rtx y, rtx x),
>> + default_faster_loading_constant)
>> +
>>  /* True if it is OK to do sibling call optimization for the specified
>>     call expression EXP.  DECL will be the called function, or NULL if
>>     this is an indirect call.  */
>> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
>> index fc49235eb38..12de61a7ed8 100644
>> --- a/gcc/targhooks.cc
>> +++ b/gcc/targhooks.cc
>> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
>>    return (TYPE_MODE (type) == BLKmode);
>>  }
>>  
>> +bool
>> +default_faster_loading_constant (machine_mode, rtx, rtx)
>> +{
>> +  return false;
>> +}
>> +
>>  rtx
>>  default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
>>  			    machine_mode mode ATTRIBUTE_UNUSED)
>> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
>> index ecfa11287ef..4db21cd59f6 100644
>> --- a/gcc/targhooks.h
>> +++ b/gcc/targhooks.h
>> @@ -295,5 +295,6 @@ extern rtx default_memtag_extract_tag (rtx, rtx);
>>  extern rtx default_memtag_untagged_pointer (rtx, rtx);
>>  
>>  extern HOST_WIDE_INT default_gcov_type_size (void);
>> +extern bool default_faster_loading_constant (machine_mode, rtx, rtx);
>>  
>>  #endif /* GCC_TARGHOOKS_H */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> index f29eba08c38..4889e8fa8ec 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
>> @@ -1,7 +1,7 @@
>>  /* { dg-do compile { target { powerpc*-*-* } } } */
>>  /* { dg-require-effective-target lp64 } */
>>  /* { dg-options "-O" } */
>> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
>> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
>>  
>>  static int x;
>>  
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> new file mode 100644
>> index 00000000000..58ba074d427
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2" } */
>> +
>> +#define CONST1 0x100803004101001
>> +#define CONST2 0x000406002105007
>> +
>> +void __attribute__ ((noinline))
>> +foo (unsigned long long *a, unsigned long long *b)
>> +{
>> +  *a = CONST1;
>> +  *b = CONST2;
>> +}
>> +
>> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> index 4f764d0576f..5afb4f79c45 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
>> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
>>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
>>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
>>  
>> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
>> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
>>
  
Richard Biener Feb. 25, 2022, 8:45 a.m. UTC | #13
On Fri, 25 Feb 2022, Jiufu Guo wrote:

> Richard Biener <rguenther@suse.de> writes:
> 
> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >
> >> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> 
> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >
> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >> >>> I'm assuming we're always dealing with
> >> >>> 
> >> >>>   (set (reg:MODE ..) <src_folded>)
> >> >>> 
> >> >>> here and CSE is not substituting into random places of an
> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >> >>> to for a constant, if it should implicitely evaluate the cost
> >> >>> of putting the result into a register for example.
> >> >>
> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> >> for anything that is used as input in a machine instruction -- but you
> >> >> need much more context to determine that.  insn_cost is much simpler and
> >> >> much easier to use.
> >> >>
> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >> >>> your proposed new target hook and comparing it with the original
> >> >>> unfolded src (again with SET and 1).
> >> >>
> >> >> It is required to generate valid instructions no matter what, before
> >> >> the pass has finished that is.  On all more modern architectures it is
> >> >> futile to think you can usefully consider the cost of an RTL expression
> >> >> and derive a real-world cost of the generated code from that.
> >> >
> >> > Thanks Segher for pointing out these!  Here is  another reason that I
> >> > did not use rtx_cost: in a few passes, there are codes to check the
> >> > constants and store them in constant pool.  I'm thinking to integerate
> >> > those codes in a consistent way.
> >> 
> >> Hi Segher, Richard!
> >> 
> >> I'm thinking the way like: For a constant,
> >> 1. if the constant could be used as an immediate for the
> >> instruction, then retreated as an operand;
> >> 2. otherwise if the constant can not be stored into a
> >> constant pool, then handle through instructions;
> >> 3. if it is faster to access constant from pool, then emit
> >> constant as data(.rodata);
> >> 4. otherwise, handle the constant by instructions.
> >> 
> >> And to store the constant into a pool, besides force_const_mem,
> >> create reference (toc) may be needed on some platforms.
> >> 
> >> For this particular issue in CSE, there is already code that
> >> tries to put constant into a pool (invoke force_const_mem).
> >> While the code is too late.  So, we may check the constant
> >> earlier and store it into constant pool if profitable.
> >> 
> >> And another thing as Segher pointed out, CSE is doing too
> >> much work.  It may be ok to separate the constant handling
> >> logic from CSE.
> >
> > Not sure - CSE just is value numbering, I don't see that it does
> > more than that.  Yes, it might have developed "heuristics" over
> > the years what to CSE and to what and where to substitute and
> > where not.  But in the end it does just value numbering.
> >
> >> 
> >> I update a new version patch as follow (did not seprate CSE):
> >
> > How is the new target hook better in any way compared to rtx_cost
> > or insn_cost?  It looks like a total hack.
> >
> > I suppose the actual way of materializing a constant is done
> > behind GCCs back and not exposed anywhere?  But instead you
> > claim the constants are valid when they actually are not?
> > Isn't the problem then that the rs6000 backend lies?
> 
> Hi Richard,
> 
> Thanks for your comments and sugguestions!
> 
> Materializing a constant should be done behind GCC.
> On rs6000, in expand pass, during emit_move, the constant is
> checked and store into constant pool if necessary.
> Some other platforms are doing a similar thing, e.g.
> ix86_expand_vector_move, alpha_expand_mov,...
> mips_legitimize_const_move.
> 
> But, it does not as we expect, force_const_mem is also 
> exposed other places (not only ira/reload for stack reference).
> 
> CSE is one place, for example, CSE first retrieve the constant
> from insn's equal, but it also tries to put constant into
> pool for some condition (the condition was introduced at
> early age of cse.c, and it is rare to run into in trunk).
> In some aspects, IMHO, this seems not a great work of CSE.
> 
> And this is how the 'invalid(or say slow)' constant occurs.
> e.g.  before cse:
>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>       REG_EQUAL 0x100803004101001
> after cse: 
>     7: r119:DI=0x100803004101001
>       REG_EQUAL 0x100803004101001
> 
> As you pointed out: we can also avoid this transformation through
> rtx_cost/insn_cost by estimating the COST more accurately for
>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
> be a real final instruction.)

At which point does this become the final instruction?  I suppose
CSE belives this constant is legitimate and the insn is recognized
just fine in CSE?  (that's what I mean with "behind GCCs back")

> Is it necessary to refine this constant handling for CSE, or even
> to eliminate the logic on constant extracting for an insn, beside
> updating rtx_cost/insn_cost?

So if CSE really just transforms

>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>       REG_EQUAL 0x100803004101001

to

>     7: r119:DI=0x100803004101001
>       REG_EQUAL 0x100803004101001

then why can't rtx_cost (SET, 1) or insn_cost () be used to
accurately tell it that the immediate is going to be a lot
more expensive?

That is, rtx_cost (CONST_INT-rtx, DI, SET, 1, ...) is accurate
enough to be treated as an actual insn (it _might_ be part of
a parallel I guess, but that's unlikely) and thus the backend
should have a very good idea of the many instruction it needs
for this.  Likewise rtx_cost ([unspec[`*.LC0',%r2:DI] 47], DI, SET, 1, 
...)
should receive accurate cost that can be compared to other
rtx_cost (..., DI, SET, 1, ...)

And CSE doesn't even need to create fake insns here since IMHO
rtx_cost is good enough to capture the full insn.  Targets can
choose to split out a set_cost from their rtx_cost/insn_cost
hooks for this case for example.

Richard.

> Any sugguestions?
> 
> >
> > Btw, all of this is of course not appropriate for stage4 and changes
> > to CSE need testing on more than one target.
> I would do more evaluation, thanks!
> 
> Jiufu
> 
> >
> > Richard.
> >
> >> Thanks for the comments and suggestions again!
> >> 
> >> 
> >> BR,
> >> Jiufu
> >> 
> >> ---
> >>  gcc/config/rs6000/rs6000.cc                   | 39 ++++++++++++++-----
> >>  gcc/cse.cc                                    | 36 ++++++++---------
> >>  gcc/doc/tm.texi                               |  5 +++
> >>  gcc/doc/tm.texi.in                            |  2 +
> >>  gcc/target.def                                |  8 ++++
> >>  gcc/targhooks.cc                              |  6 +++
> >>  gcc/targhooks.h                               |  1 +
> >>  .../gcc.target/powerpc/medium_offset.c        |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/pr63281.c    | 14 +++++++
> >>  gcc/testsuite/gcc.target/powerpc/pr93012.c    |  2 +-
> >>  10 files changed, 84 insertions(+), 31 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> 
> >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> >> index d7a7cfe860f..0a8f487d516 100644
> >> --- a/gcc/config/rs6000/rs6000.cc
> >> +++ b/gcc/config/rs6000/rs6000.cc
> >> @@ -1361,6 +1361,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
> >>  #undef TARGET_CANNOT_FORCE_CONST_MEM
> >>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
> >>  
> >> +#undef TARGET_FASTER_LOADING_CONSTANT
> >> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> >> +
> >>  #undef TARGET_DELEGITIMIZE_ADDRESS
> >>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
> >>  
> >> @@ -9684,8 +9687,8 @@ rs6000_init_stack_protect_guard (void)
> >>  static bool
> >>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
> >>  {
> >> -  if (GET_CODE (x) == HIGH
> >> -      && GET_CODE (XEXP (x, 0)) == UNSPEC)
> >> +  /* Exclude CONSTANT HIGH part.  */
> >> +  if (GET_CODE (x) == HIGH)
> >>      return true;
> >>  
> >>    /* A TLS symbol in the TOC cannot contain a sum.  */
> >> @@ -10483,6 +10486,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
> >>    return false;
> >>  }
> >>  
> >> +
> >> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> >> +
> >> +static bool
> >> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> >> +{
> >> +  gcc_assert (CONSTANT_P (src));
> >> +
> >> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> >> +    return false;
> >> +  if (GET_CODE (src) == HIGH)
> >> +    return false;
> >> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> >> +    return false;
> >> +  if (rs6000_cannot_force_const_mem (mode, src))
> >> +    return false;
> >> +
> >> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> >> +    return true;
> >> +  if (!CONST_INT_P (src))
> >> +    return true;
> >> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> >> +}
> >> +
> >>  /* Emit a move from SOURCE to DEST in mode MODE.  */
> >>  void
> >>  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
> >> @@ -10860,13 +10887,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
> >>  	operands[1] = create_TOC_reference (operands[1], operands[0]);
> >>        else if (mode == Pmode
> >>  	       && CONSTANT_P (operands[1])
> >> -	       && GET_CODE (operands[1]) != HIGH
> >> -	       && ((REG_P (operands[0])
> >> -		    && FP_REGNO_P (REGNO (operands[0])))
> >> -		   || !CONST_INT_P (operands[1])
> >> -		   || (num_insns_constant (operands[1], mode)
> >> -		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
> >> -	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
> >> +	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
> >>  	       && (TARGET_CMODEL == CMODEL_SMALL
> >>  		   || can_create_pseudo_p ()
> >>  		   || (REG_P (operands[0])
> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
> >> index a18b599d324..53a7b3556b3 100644
> >> --- a/gcc/cse.cc
> >> +++ b/gcc/cse.cc
> >> @@ -5192,6 +5192,22 @@ cse_insn (rtx_insn *insn)
> >>  	      src_elt_regcost = elt->regcost;
> >>  	    }
> >>  
> >> +	  /* If the current function uses a constant pool and this is a
> >> +	     constant, try making a pool entry. Put it in src_folded
> >> +	     unless we already have done this since that is where it
> >> +	     likely came from.  */
> >> +
> >> +	  if (crtl->uses_const_pool && src_folded && CONSTANT_P (src_folded)
> >> +	      && targetm.faster_loading_constant (mode, dest, src_folded))
> >> +	    {
> >> +	      src_folded = force_const_mem (mode, src_folded);
> >> +	      if (src_folded)
> >> +		{
> >> +		  src_folded_cost = COST (src_folded, mode);
> >> +		  src_folded_regcost = approx_reg_cost (src_folded);
> >> +		}
> >> +	    }
> >> +
> >>  	  /* Find cheapest and skip it for the next time.   For items
> >>  	     of equal cost, use this order:
> >>  	     src_folded, src, src_eqv, src_related and hash table entry.  */
> >> @@ -5390,26 +5406,6 @@ cse_insn (rtx_insn *insn)
> >>  
> >>  	      break;
> >>  	    }
> >> -
> >> -	  /* If the current function uses a constant pool and this is a
> >> -	     constant, try making a pool entry. Put it in src_folded
> >> -	     unless we already have done this since that is where it
> >> -	     likely came from.  */
> >> -
> >> -	  else if (crtl->uses_const_pool
> >> -		   && CONSTANT_P (trial)
> >> -		   && !CONST_INT_P (trial)
> >> -		   && (src_folded == 0 || !MEM_P (src_folded))
> >> -		   && GET_MODE_CLASS (mode) != MODE_CC
> >> -		   && mode != VOIDmode)
> >> -	    {
> >> -	      src_folded = force_const_mem (mode, trial);
> >> -	      if (src_folded)
> >> -		{
> >> -		  src_folded_cost = COST (src_folded, mode);
> >> -		  src_folded_regcost = approx_reg_cost (src_folded);
> >> -		}
> >> -	    }
> >>  	}
> >>  
> >>        /* If we changed the insn too much, handle this set from scratch.  */
> >> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> >> index 962bbb8caaf..65685311249 100644
> >> --- a/gcc/doc/tm.texi
> >> +++ b/gcc/doc/tm.texi
> >> @@ -6014,6 +6014,11 @@ holding the constant.  This restriction is often true of addresses
> >>  of TLS symbols for various targets.
> >>  @end deftypefn
> >>  
> >> +@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
> >> +It returns true if loading contant @var{x} is faster than building
> >> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
> >> +@end deftypefn
> >> +
> >>  @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
> >>  This hook should return true if pool entries for constant @var{x} can
> >>  be placed in an @code{object_block} structure.  @var{mode} is the mode
> >> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> >> index 394b59e70a6..918914f0e30 100644
> >> --- a/gcc/doc/tm.texi.in
> >> +++ b/gcc/doc/tm.texi.in
> >> @@ -4148,6 +4148,8 @@ address;  but often a machine-dependent strategy can generate better code.
> >>  
> >>  @hook TARGET_CANNOT_FORCE_CONST_MEM
> >>  
> >> +@hook TARGET_FASTER_LOADING_CONSTANT
> >> +
> >>  @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
> >>  
> >>  @hook TARGET_USE_BLOCKS_FOR_DECL_P
> >> diff --git a/gcc/target.def b/gcc/target.def
> >> index 57e64b20eef..8b007aca9dc 100644
> >> --- a/gcc/target.def
> >> +++ b/gcc/target.def
> >> @@ -2987,6 +2987,14 @@ locally.  The default is to return false.",
> >>   bool, (void),
> >>   hook_bool_void_false)
> >>  
> >> +/* Check if it is profitable to load the constant from constant pool.  */
> >> +DEFHOOK
> >> +(faster_loading_constant,
> >> + "It returns true if loading contant @var{x} is faster than building\n\
> >> +from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
> >> + bool, (machine_mode mode, rtx y, rtx x),
> >> + default_faster_loading_constant)
> >> +
> >>  /* True if it is OK to do sibling call optimization for the specified
> >>     call expression EXP.  DECL will be the called function, or NULL if
> >>     this is an indirect call.  */
> >> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> >> index fc49235eb38..12de61a7ed8 100644
> >> --- a/gcc/targhooks.cc
> >> +++ b/gcc/targhooks.cc
> >> @@ -173,6 +173,12 @@ default_return_in_memory (const_tree type,
> >>    return (TYPE_MODE (type) == BLKmode);
> >>  }
> >>  
> >> +bool
> >> +default_faster_loading_constant (machine_mode, rtx, rtx)
> >> +{
> >> +  return false;
> >> +}
> >> +
> >>  rtx
> >>  default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
> >>  			    machine_mode mode ATTRIBUTE_UNUSED)
> >> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> >> index ecfa11287ef..4db21cd59f6 100644
> >> --- a/gcc/targhooks.h
> >> +++ b/gcc/targhooks.h
> >> @@ -295,5 +295,6 @@ extern rtx default_memtag_extract_tag (rtx, rtx);
> >>  extern rtx default_memtag_untagged_pointer (rtx, rtx);
> >>  
> >>  extern HOST_WIDE_INT default_gcov_type_size (void);
> >> +extern bool default_faster_loading_constant (machine_mode, rtx, rtx);
> >>  
> >>  #endif /* GCC_TARGHOOKS_H */
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> index f29eba08c38..4889e8fa8ec 100644
> >> --- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> +++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
> >> @@ -1,7 +1,7 @@
> >>  /* { dg-do compile { target { powerpc*-*-* } } } */
> >>  /* { dg-require-effective-target lp64 } */
> >>  /* { dg-options "-O" } */
> >> -/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
> >> +/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
> >>  
> >>  static int x;
> >>  
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> new file mode 100644
> >> index 00000000000..58ba074d427
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
> >> @@ -0,0 +1,14 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2" } */
> >> +
> >> +#define CONST1 0x100803004101001
> >> +#define CONST2 0x000406002105007
> >> +
> >> +void __attribute__ ((noinline))
> >> +foo (unsigned long long *a, unsigned long long *b)
> >> +{
> >> +  *a = CONST1;
> >> +  *b = CONST2;
> >> +}
> >> +
> >> +/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
> >> diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> index 4f764d0576f..5afb4f79c45 100644
> >> --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
> >> @@ -10,4 +10,4 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
> >>  unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
> >>  unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
> >>  
> >> -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
> >> +/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */
> >> 
>
  
Jiufu Guo Feb. 25, 2022, 1:32 p.m. UTC | #14
Richard Biener <rguenther@suse.de> writes:

> On Fri, 25 Feb 2022, Jiufu Guo wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> 
>> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >
>> >> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> 
>> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >
>> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >> >>> I'm assuming we're always dealing with
>> >> >>> 
>> >> >>>   (set (reg:MODE ..) <src_folded>)
>> >> >>> 
>> >> >>> here and CSE is not substituting into random places of an
>> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >> >>> to for a constant, if it should implicitely evaluate the cost
>> >> >>> of putting the result into a register for example.
>> >> >>
>> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> >> for anything that is used as input in a machine instruction -- but you
>> >> >> need much more context to determine that.  insn_cost is much simpler and
>> >> >> much easier to use.
>> >> >>
>> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >> >>> your proposed new target hook and comparing it with the original
>> >> >>> unfolded src (again with SET and 1).
>> >> >>
>> >> >> It is required to generate valid instructions no matter what, before
>> >> >> the pass has finished that is.  On all more modern architectures it is
>> >> >> futile to think you can usefully consider the cost of an RTL expression
>> >> >> and derive a real-world cost of the generated code from that.
>> >> >
>> >> > Thanks Segher for pointing out these!  Here is  another reason that I
>> >> > did not use rtx_cost: in a few passes, there are codes to check the
>> >> > constants and store them in constant pool.  I'm thinking to integerate
>> >> > those codes in a consistent way.
>> >> 
>> >> Hi Segher, Richard!
>> >> 
>> >> I'm thinking the way like: For a constant,
>> >> 1. if the constant could be used as an immediate for the
>> >> instruction, then retreated as an operand;
>> >> 2. otherwise if the constant can not be stored into a
>> >> constant pool, then handle through instructions;
>> >> 3. if it is faster to access constant from pool, then emit
>> >> constant as data(.rodata);
>> >> 4. otherwise, handle the constant by instructions.
>> >> 
>> >> And to store the constant into a pool, besides force_const_mem,
>> >> create reference (toc) may be needed on some platforms.
>> >> 
>> >> For this particular issue in CSE, there is already code that
>> >> tries to put constant into a pool (invoke force_const_mem).
>> >> While the code is too late.  So, we may check the constant
>> >> earlier and store it into constant pool if profitable.
>> >> 
>> >> And another thing as Segher pointed out, CSE is doing too
>> >> much work.  It may be ok to separate the constant handling
>> >> logic from CSE.
>> >
>> > Not sure - CSE just is value numbering, I don't see that it does
>> > more than that.  Yes, it might have developed "heuristics" over
>> > the years what to CSE and to what and where to substitute and
>> > where not.  But in the end it does just value numbering.
>> >
>> >> 
>> >> I update a new version patch as follow (did not seprate CSE):
>> >
>> > How is the new target hook better in any way compared to rtx_cost
>> > or insn_cost?  It looks like a total hack.
>> >
>> > I suppose the actual way of materializing a constant is done
>> > behind GCCs back and not exposed anywhere?  But instead you
>> > claim the constants are valid when they actually are not?
>> > Isn't the problem then that the rs6000 backend lies?
>> 
>> Hi Richard,
>> 
>> Thanks for your comments and sugguestions!
>> 
>> Materializing a constant should be done behind GCC.
>> On rs6000, in expand pass, during emit_move, the constant is
>> checked and store into constant pool if necessary.
>> Some other platforms are doing a similar thing, e.g.
>> ix86_expand_vector_move, alpha_expand_mov,...
>> mips_legitimize_const_move.
>> 
>> But, it does not as we expect, force_const_mem is also 
>> exposed other places (not only ira/reload for stack reference).
>> 
>> CSE is one place, for example, CSE first retrieve the constant
>> from insn's equal, but it also tries to put constant into
>> pool for some condition (the condition was introduced at
>> early age of cse.c, and it is rare to run into in trunk).
>> In some aspects, IMHO, this seems not a great work of CSE.
>> 
>> And this is how the 'invalid(or say slow)' constant occurs.
>> e.g.  before cse:
>>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>>       REG_EQUAL 0x100803004101001
>> after cse: 
>>     7: r119:DI=0x100803004101001
>>       REG_EQUAL 0x100803004101001
>> 
>> As you pointed out: we can also avoid this transformation through
>> rtx_cost/insn_cost by estimating the COST more accurately for
>>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
>> be a real final instruction.)
>
> At which point does this become the final instruction?  I suppose
> CSE belives this constant is legitimate and the insn is recognized
> just fine in CSE?  (that's what I mean with "behind GCCs back")
>
It would become final instructions during split pass on rs6000,
other target, e.g. alpha, seem also do split it.
>> Is it necessary to refine this constant handling for CSE, or even
>> to eliminate the logic on constant extracting for an insn, beside
>> updating rtx_cost/insn_cost?
>
> So if CSE really just transforms
>
>>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>>       REG_EQUAL 0x100803004101001
>
> to
>
>>     7: r119:DI=0x100803004101001
>>       REG_EQUAL 0x100803004101001
>
> then why can't rtx_cost (SET, 1) or insn_cost () be used to
> accurately tell it that the immediate is going to be a lot
> more expensive?
>
> That is, rtx_cost (CONST_INT-rtx, DI, SET, 1, ...) is accurate
> enough to be treated as an actual insn (it _might_ be part of
> a parallel I guess, but that's unlikely) and thus the backend
> should have a very good idea of the many instruction it needs
> for this.  Likewise rtx_cost ([unspec[`*.LC0',%r2:DI] 47], DI, SET, 1, 
> ...)
> should receive accurate cost that can be compared to other
> rtx_cost (..., DI, SET, 1, ...)
>
> And CSE doesn't even need to create fake insns here since IMHO
> rtx_cost is good enough to capture the full insn.  Targets can
> choose to split out a set_cost from their rtx_cost/insn_cost
> hooks for this case for example.

Hi Richard,

Right, we can fix this issue by updating rtx_cost/insn_cost to
tell that this kind of constants is a lot of expansive.

To update rtx_cost, we can use a trivial patch (shown as end of
this mail) to fix this particular issue.
To use insn_cost, additional work is replacing rtx_cost with
insn_cost for CSE, maybe more suitable for stage1.

So, it would be too far to fix this by refactoring the logic of
constant handling. :-)

Thanks for your comments!

BR,
Jiufu

>
> Richard.
>
>> Any sugguestions?
>> 
>> >
>> > Btw, all of this is of course not appropriate for stage4 and changes
>> > to CSE need testing on more than one target.
>> I would do more evaluation, thanks!
>> 
>> Jiufu
>> 
>> >
>> > Richard.
>> >
>> >> Thanks for the comments and suggestions again!
>> >> 
>> >> 
>> >> BR,
>> >> Jiufu
>> >> 

Part of the experimental patch for rs6000, which could help
to mitigate inaccurate rtx cost on constant.

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index e82a47f4c0e..e429ae2bcf0 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -21884,6 +21884,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
 
     case CONST_DOUBLE:
     case CONST_WIDE_INT:
+      /* Set const to a reg, it may needs a few insns.  */
+      if (outer_code == SET)
+	{
+	  *total = COSTS_N_INSNS (num_insns_constant (x, mode));
+	  return true;
+	}
+      /* FALLTHRU */
+
     case CONST:
     case HIGH:
     case SYMBOL_REF:
  
Richard Biener Feb. 25, 2022, 1:57 p.m. UTC | #15
On Fri, 25 Feb 2022, Jiufu Guo wrote:

> Richard Biener <rguenther@suse.de> writes:
> 
> > On Fri, 25 Feb 2022, Jiufu Guo wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> 
> >> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >> >
> >> >> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> 
> >> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >> >
> >> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
> >> >> >>> I'm assuming we're always dealing with
> >> >> >>> 
> >> >> >>>   (set (reg:MODE ..) <src_folded>)
> >> >> >>> 
> >> >> >>> here and CSE is not substituting into random places of an
> >> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
> >> >> >>> to for a constant, if it should implicitely evaluate the cost
> >> >> >>> of putting the result into a register for example.
> >> >> >>
> >> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
> >> >> >> for anything that is used as input in a machine instruction -- but you
> >> >> >> need much more context to determine that.  insn_cost is much simpler and
> >> >> >> much easier to use.
> >> >> >>
> >> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
> >> >> >>> your proposed new target hook and comparing it with the original
> >> >> >>> unfolded src (again with SET and 1).
> >> >> >>
> >> >> >> It is required to generate valid instructions no matter what, before
> >> >> >> the pass has finished that is.  On all more modern architectures it is
> >> >> >> futile to think you can usefully consider the cost of an RTL expression
> >> >> >> and derive a real-world cost of the generated code from that.
> >> >> >
> >> >> > Thanks Segher for pointing out these!  Here is  another reason that I
> >> >> > did not use rtx_cost: in a few passes, there are codes to check the
> >> >> > constants and store them in constant pool.  I'm thinking to integerate
> >> >> > those codes in a consistent way.
> >> >> 
> >> >> Hi Segher, Richard!
> >> >> 
> >> >> I'm thinking the way like: For a constant,
> >> >> 1. if the constant could be used as an immediate for the
> >> >> instruction, then retreated as an operand;
> >> >> 2. otherwise if the constant can not be stored into a
> >> >> constant pool, then handle through instructions;
> >> >> 3. if it is faster to access constant from pool, then emit
> >> >> constant as data(.rodata);
> >> >> 4. otherwise, handle the constant by instructions.
> >> >> 
> >> >> And to store the constant into a pool, besides force_const_mem,
> >> >> create reference (toc) may be needed on some platforms.
> >> >> 
> >> >> For this particular issue in CSE, there is already code that
> >> >> tries to put constant into a pool (invoke force_const_mem).
> >> >> While the code is too late.  So, we may check the constant
> >> >> earlier and store it into constant pool if profitable.
> >> >> 
> >> >> And another thing as Segher pointed out, CSE is doing too
> >> >> much work.  It may be ok to separate the constant handling
> >> >> logic from CSE.
> >> >
> >> > Not sure - CSE just is value numbering, I don't see that it does
> >> > more than that.  Yes, it might have developed "heuristics" over
> >> > the years what to CSE and to what and where to substitute and
> >> > where not.  But in the end it does just value numbering.
> >> >
> >> >> 
> >> >> I update a new version patch as follow (did not seprate CSE):
> >> >
> >> > How is the new target hook better in any way compared to rtx_cost
> >> > or insn_cost?  It looks like a total hack.
> >> >
> >> > I suppose the actual way of materializing a constant is done
> >> > behind GCCs back and not exposed anywhere?  But instead you
> >> > claim the constants are valid when they actually are not?
> >> > Isn't the problem then that the rs6000 backend lies?
> >> 
> >> Hi Richard,
> >> 
> >> Thanks for your comments and sugguestions!
> >> 
> >> Materializing a constant should be done behind GCC.
> >> On rs6000, in expand pass, during emit_move, the constant is
> >> checked and store into constant pool if necessary.
> >> Some other platforms are doing a similar thing, e.g.
> >> ix86_expand_vector_move, alpha_expand_mov,...
> >> mips_legitimize_const_move.
> >> 
> >> But, it does not as we expect, force_const_mem is also 
> >> exposed other places (not only ira/reload for stack reference).
> >> 
> >> CSE is one place, for example, CSE first retrieve the constant
> >> from insn's equal, but it also tries to put constant into
> >> pool for some condition (the condition was introduced at
> >> early age of cse.c, and it is rare to run into in trunk).
> >> In some aspects, IMHO, this seems not a great work of CSE.
> >> 
> >> And this is how the 'invalid(or say slow)' constant occurs.
> >> e.g.  before cse:
> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
> >>       REG_EQUAL 0x100803004101001
> >> after cse: 
> >>     7: r119:DI=0x100803004101001
> >>       REG_EQUAL 0x100803004101001
> >> 
> >> As you pointed out: we can also avoid this transformation through
> >> rtx_cost/insn_cost by estimating the COST more accurately for
> >>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
> >> be a real final instruction.)
> >
> > At which point does this become the final instruction?  I suppose
> > CSE belives this constant is legitimate and the insn is recognized
> > just fine in CSE?  (that's what I mean with "behind GCCs back")
> >
> It would become final instructions during split pass on rs6000,
> other target, e.g. alpha, seem also do split it.
> >> Is it necessary to refine this constant handling for CSE, or even
> >> to eliminate the logic on constant extracting for an insn, beside
> >> updating rtx_cost/insn_cost?
> >
> > So if CSE really just transforms
> >
> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
> >>       REG_EQUAL 0x100803004101001
> >
> > to
> >
> >>     7: r119:DI=0x100803004101001
> >>       REG_EQUAL 0x100803004101001
> >
> > then why can't rtx_cost (SET, 1) or insn_cost () be used to
> > accurately tell it that the immediate is going to be a lot
> > more expensive?
> >
> > That is, rtx_cost (CONST_INT-rtx, DI, SET, 1, ...) is accurate
> > enough to be treated as an actual insn (it _might_ be part of
> > a parallel I guess, but that's unlikely) and thus the backend
> > should have a very good idea of the many instruction it needs
> > for this.  Likewise rtx_cost ([unspec[`*.LC0',%r2:DI] 47], DI, SET, 1, 
> > ...)
> > should receive accurate cost that can be compared to other
> > rtx_cost (..., DI, SET, 1, ...)
> >
> > And CSE doesn't even need to create fake insns here since IMHO
> > rtx_cost is good enough to capture the full insn.  Targets can
> > choose to split out a set_cost from their rtx_cost/insn_cost
> > hooks for this case for example.
> 
> Hi Richard,
> 
> Right, we can fix this issue by updating rtx_cost/insn_cost to
> tell that this kind of constants is a lot of expansive.
> 
> To update rtx_cost, we can use a trivial patch (shown as end of
> this mail) to fix this particular issue.
> To use insn_cost, additional work is replacing rtx_cost with
> insn_cost for CSE, maybe more suitable for stage1.
> 
> So, it would be too far to fix this by refactoring the logic of
> constant handling. :-)
> 
> Thanks for your comments!
> 
> BR,
> Jiufu
> 
> >
> > Richard.
> >
> >> Any sugguestions?
> >> 
> >> >
> >> > Btw, all of this is of course not appropriate for stage4 and changes
> >> > to CSE need testing on more than one target.
> >> I would do more evaluation, thanks!
> >> 
> >> Jiufu
> >> 
> >> >
> >> > Richard.
> >> >
> >> >> Thanks for the comments and suggestions again!
> >> >> 
> >> >> 
> >> >> BR,
> >> >> Jiufu
> >> >> 
> 
> Part of the experimental patch for rs6000, which could help
> to mitigate inaccurate rtx cost on constant.

Yes, that sort of thing is exactly what I had in mind.  I assume
the corresponding insn_costs hook does the same?

Richard.

> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index e82a47f4c0e..e429ae2bcf0 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -21884,6 +21884,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>  
>      case CONST_DOUBLE:
>      case CONST_WIDE_INT:
> +      /* Set const to a reg, it may needs a few insns.  */
> +      if (outer_code == SET)
> +	{
> +	  *total = COSTS_N_INSNS (num_insns_constant (x, mode));
> +	  return true;
> +	}
> +      /* FALLTHRU */
> +
>      case CONST:
>      case HIGH:
>      case SYMBOL_REF:
  
Jiufu Guo Feb. 28, 2022, 9:15 a.m. UTC | #16
Richard Biener <rguenther@suse.de> writes:

> On Fri, 25 Feb 2022, Jiufu Guo wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> 
>> > On Fri, 25 Feb 2022, Jiufu Guo wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> 
>> >> > On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >> >
>> >> >> Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> >> 
>> >> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >> >
>> >> >> >> On Wed, Feb 23, 2022 at 02:02:59PM +0100, Richard Biener wrote:
>> >> >> >>> I'm assuming we're always dealing with
>> >> >> >>> 
>> >> >> >>>   (set (reg:MODE ..) <src_folded>)
>> >> >> >>> 
>> >> >> >>> here and CSE is not substituting into random places of an
>> >> >> >>> instruction(?).  I don't know what 'rtx_cost' should evaluate
>> >> >> >>> to for a constant, if it should implicitely evaluate the cost
>> >> >> >>> of putting the result into a register for example.
>> >> >> >>
>> >> >> >> rtx_cost is no good here (and in most places).  rtx_cost should be 0
>> >> >> >> for anything that is used as input in a machine instruction -- but you
>> >> >> >> need much more context to determine that.  insn_cost is much simpler and
>> >> >> >> much easier to use.
>> >> >> >>
>> >> >> >>> Using RTX_COST with SET and 1 at least looks no worse than using
>> >> >> >>> your proposed new target hook and comparing it with the original
>> >> >> >>> unfolded src (again with SET and 1).
>> >> >> >>
>> >> >> >> It is required to generate valid instructions no matter what, before
>> >> >> >> the pass has finished that is.  On all more modern architectures it is
>> >> >> >> futile to think you can usefully consider the cost of an RTL expression
>> >> >> >> and derive a real-world cost of the generated code from that.
>> >> >> >
>> >> >> > Thanks Segher for pointing out these!  Here is  another reason that I
>> >> >> > did not use rtx_cost: in a few passes, there are codes to check the
>> >> >> > constants and store them in constant pool.  I'm thinking to integerate
>> >> >> > those codes in a consistent way.
>> >> >> 
>> >> >> Hi Segher, Richard!
>> >> >> 
>> >> >> I'm thinking the way like: For a constant,
>> >> >> 1. if the constant could be used as an immediate for the
>> >> >> instruction, then retreated as an operand;
>> >> >> 2. otherwise if the constant can not be stored into a
>> >> >> constant pool, then handle through instructions;
>> >> >> 3. if it is faster to access constant from pool, then emit
>> >> >> constant as data(.rodata);
>> >> >> 4. otherwise, handle the constant by instructions.
>> >> >> 
>> >> >> And to store the constant into a pool, besides force_const_mem,
>> >> >> create reference (toc) may be needed on some platforms.
>> >> >> 
>> >> >> For this particular issue in CSE, there is already code that
>> >> >> tries to put constant into a pool (invoke force_const_mem).
>> >> >> While the code is too late.  So, we may check the constant
>> >> >> earlier and store it into constant pool if profitable.
>> >> >> 
>> >> >> And another thing as Segher pointed out, CSE is doing too
>> >> >> much work.  It may be ok to separate the constant handling
>> >> >> logic from CSE.
>> >> >
>> >> > Not sure - CSE just is value numbering, I don't see that it does
>> >> > more than that.  Yes, it might have developed "heuristics" over
>> >> > the years what to CSE and to what and where to substitute and
>> >> > where not.  But in the end it does just value numbering.
>> >> >
>> >> >> 
>> >> >> I update a new version patch as follow (did not seprate CSE):
>> >> >
>> >> > How is the new target hook better in any way compared to rtx_cost
>> >> > or insn_cost?  It looks like a total hack.
>> >> >
>> >> > I suppose the actual way of materializing a constant is done
>> >> > behind GCCs back and not exposed anywhere?  But instead you
>> >> > claim the constants are valid when they actually are not?
>> >> > Isn't the problem then that the rs6000 backend lies?
>> >> 
>> >> Hi Richard,
>> >> 
>> >> Thanks for your comments and sugguestions!
>> >> 
>> >> Materializing a constant should be done behind GCC.
>> >> On rs6000, in expand pass, during emit_move, the constant is
>> >> checked and store into constant pool if necessary.
>> >> Some other platforms are doing a similar thing, e.g.
>> >> ix86_expand_vector_move, alpha_expand_mov,...
>> >> mips_legitimize_const_move.
>> >> 
>> >> But, it does not as we expect, force_const_mem is also 
>> >> exposed other places (not only ira/reload for stack reference).
>> >> 
>> >> CSE is one place, for example, CSE first retrieve the constant
>> >> from insn's equal, but it also tries to put constant into
>> >> pool for some condition (the condition was introduced at
>> >> early age of cse.c, and it is rare to run into in trunk).
>> >> In some aspects, IMHO, this seems not a great work of CSE.
>> >> 
>> >> And this is how the 'invalid(or say slow)' constant occurs.
>> >> e.g.  before cse:
>> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>> >>       REG_EQUAL 0x100803004101001
>> >> after cse: 
>> >>     7: r119:DI=0x100803004101001
>> >>       REG_EQUAL 0x100803004101001
>> >> 
>> >> As you pointed out: we can also avoid this transformation through
>> >> rtx_cost/insn_cost by estimating the COST more accurately for
>> >>  "r119:DI=0x100803004101001". (r119:DI=0x100803004101001 will not
>> >> be a real final instruction.)
>> >
>> > At which point does this become the final instruction?  I suppose
>> > CSE belives this constant is legitimate and the insn is recognized
>> > just fine in CSE?  (that's what I mean with "behind GCCs back")
>> >
>> It would become final instructions during split pass on rs6000,
>> other target, e.g. alpha, seem also do split it.
>> >> Is it necessary to refine this constant handling for CSE, or even
>> >> to eliminate the logic on constant extracting for an insn, beside
>> >> updating rtx_cost/insn_cost?
>> >
>> > So if CSE really just transforms
>> >
>> >>     7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
>> >>       REG_EQUAL 0x100803004101001
>> >
>> > to
>> >
>> >>     7: r119:DI=0x100803004101001
>> >>       REG_EQUAL 0x100803004101001
>> >
>> > then why can't rtx_cost (SET, 1) or insn_cost () be used to
>> > accurately tell it that the immediate is going to be a lot
>> > more expensive?
>> >
>> > That is, rtx_cost (CONST_INT-rtx, DI, SET, 1, ...) is accurate
>> > enough to be treated as an actual insn (it _might_ be part of
>> > a parallel I guess, but that's unlikely) and thus the backend
>> > should have a very good idea of the many instruction it needs
>> > for this.  Likewise rtx_cost ([unspec[`*.LC0',%r2:DI] 47], DI, SET, 1, 
>> > ...)
>> > should receive accurate cost that can be compared to other
>> > rtx_cost (..., DI, SET, 1, ...)
>> >
>> > And CSE doesn't even need to create fake insns here since IMHO
>> > rtx_cost is good enough to capture the full insn.  Targets can
>> > choose to split out a set_cost from their rtx_cost/insn_cost
>> > hooks for this case for example.
>> 
>> Hi Richard,
>> 
>> Right, we can fix this issue by updating rtx_cost/insn_cost to
>> tell that this kind of constants is a lot of expansive.
>> 
>> To update rtx_cost, we can use a trivial patch (shown as end of
>> this mail) to fix this particular issue.
>> To use insn_cost, additional work is replacing rtx_cost with
>> insn_cost for CSE, maybe more suitable for stage1.
>> 
>> So, it would be too far to fix this by refactoring the logic of
>> constant handling. :-)
>> 
>> Thanks for your comments!
>> 
>> BR,
>> Jiufu
>> 
>> >
>> > Richard.
>> >
>> >> Any sugguestions?
>> >> 
>> >> >
>> >> > Btw, all of this is of course not appropriate for stage4 and changes
>> >> > to CSE need testing on more than one target.
>> >> I would do more evaluation, thanks!
>> >> 
>> >> Jiufu
>> >> 
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks for the comments and suggestions again!
>> >> >> 
>> >> >> 
>> >> >> BR,
>> >> >> Jiufu
>> >> >> 
>> 
>> Part of the experimental patch for rs6000, which could help
>> to mitigate inaccurate rtx cost on constant.
>
> Yes, that sort of thing is exactly what I had in mind.  I assume
> the corresponding insn_costs hook does the same?

Hi Richard!

Thanks for your comments!

On most targets, insn_cost also uses code like COSTS_N_INSNS, while
insn_cost is also checking insn's attributes (like length,cost, size).
Since cse is using rtx_cost, to use insn_cost, we would need to
construct an insn first.

For this idea, I hacked a patch, which is using insn_cost partially,
and could be more aggressively enhanced.

BR,
Jiufu

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..b6ca1283b18 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int outer_code,
 static int
 rs6000_insn_cost (rtx_insn *insn, bool speed)
 {
+  /* Handle special 'invalid' insn. */
+  rtx set = PATTERN (insn);
+  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
+    {
+      rtx src = SET_SRC (set);
+      machine_mode mode = GET_MODE (SET_DEST (set));
+      if (CONST_INT_P (src) || CONST_WIDE_INT_P (src) || CONST_DOUBLE_P (src))
+	return COSTS_N_INSNS (num_insns_constant (src, mode));
+    }
+  
   if (recog_memoized (insn) < 0)
     return 0;
 
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..53fdb3fff58 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -456,6 +456,8 @@ struct table_elt
   (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1))
 #define COST_IN(X, MODE, OUTER, OPNO)					\
   (REG_P (X) ? 0 : notreg_cost (X, MODE, OUTER, OPNO))
+#define COST_SRC(I, X, MODE)				                \
+  (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1, I))
 
 /* Get the number of times this register has been updated in this
    basic block.  */
@@ -523,7 +525,8 @@ static bitmap cse_ebb_live_in, cse_ebb_live_out;
 static sbitmap cse_visited_basic_blocks;
 
 static bool fixed_base_plus_p (rtx x);
-static int notreg_cost (rtx, machine_mode, enum rtx_code, int);
+static int notreg_cost (rtx, machine_mode, enum rtx_code, int, rtx_insn*);
+static int insn_cost_x (rtx_insn *, rtx);
 static int preferable (int, int, int, int);
 static void new_basic_block (void);
 static void make_new_qty (unsigned int, machine_mode);
@@ -698,7 +701,8 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
    from COST macro to keep it simple.  */
 
 static int
-notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
+notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
+	     rtx_insn *insn = NULL)
 {
   scalar_int_mode int_mode, inner_mode;
   return ((GET_CODE (x) == SUBREG
@@ -709,9 +713,20 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
 	   && subreg_lowpart_p (x)
 	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
 	  ? 0
+	  : insn != NULL ? insn_cost_x (insn, x)
 	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
 }
 
+/* Internal function, to get cost when use X to replace source of insn
+   which is a SET.  */
+
+static int
+insn_cost_x (rtx_insn *insn, rtx x)
+{
+  SET_SRC (PATTERN (insn)) = x;
+  return insn_cost (insn, optimize_this_for_speed_p);
+}
+
 
 /* Initialize CSE_REG_INFO_TABLE.  */
 
@@ -4603,6 +4618,7 @@ cse_insn (rtx_insn *insn)
 
      Nothing in this loop changes the hash table or the register chains.  */
 
+  rtx_insn *tmp_insn = NULL;
   for (i = 0; i < n_sets; i++)
     {
       bool repeat = false;
@@ -4638,6 +4654,10 @@ cse_insn (rtx_insn *insn)
       mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
       sets[i].mode = mode;
 
+      if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
+	  && src != pc_rtx)
+	tmp_insn = gen_move_insn (copy_rtx (dest), src);
+
       if (src_eqv)
 	{
 	  machine_mode eqvmode = mode;
@@ -5103,7 +5123,7 @@ cse_insn (rtx_insn *insn)
 	    src_cost = src_regcost = -1;
 	  else
 	    {
-	      src_cost = COST (src, mode);
+	      src_cost = COST_SRC (tmp_insn, src, mode);
 	      src_regcost = approx_reg_cost (src);
 	    }
 	}
@@ -5114,7 +5134,7 @@ cse_insn (rtx_insn *insn)
 	    src_eqv_cost = src_eqv_regcost = -1;
 	  else
 	    {
-	      src_eqv_cost = COST (src_eqv_here, mode);
+	      src_eqv_cost = COST_SRC (tmp_insn, src_eqv_here, mode);
 	      src_eqv_regcost = approx_reg_cost (src_eqv_here);
 	    }
 	}
@@ -5125,7 +5145,7 @@ cse_insn (rtx_insn *insn)
 	    src_folded_cost = src_folded_regcost = -1;
 	  else
 	    {
-	      src_folded_cost = COST (src_folded, mode);
+	      src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
 	      src_folded_regcost = approx_reg_cost (src_folded);
 	    }
 	}
@@ -5136,7 +5156,7 @@ cse_insn (rtx_insn *insn)
 	    src_related_cost = src_related_regcost = -1;
 	  else
 	    {
-	      src_related_cost = COST (src_related, mode);
+	      src_related_cost = COST_SRC (tmp_insn, src_related, mode);
 	      src_related_regcost = approx_reg_cost (src_related);
 
 	      /* If a const-anchor is used to synthesize a constant that
@@ -5406,7 +5426,7 @@ cse_insn (rtx_insn *insn)
 	      src_folded = force_const_mem (mode, trial);
 	      if (src_folded)
 		{
-		  src_folded_cost = COST (src_folded, mode);
+		  src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
 		  src_folded_regcost = approx_reg_cost (src_folded);
 		}
 	    }
@@ -5460,7 +5480,9 @@ cse_insn (rtx_insn *insn)
 		  /* If we had a constant that is cheaper than what we are now
 		     setting SRC to, use that constant.  We ignored it when we
 		     thought we could make this into a no-op.  */
-		  if (src_const && COST (src_const, mode) < COST (src, mode)
+		  if (src_const
+		      && COST_SRC (tmp_insn, src_const, mode) 
+			   < COST_SRC (tmp_insn, src, mode)
 		      && validate_change (insn, &SET_SRC (sets[i].rtl),
 					  src_const, 0))
 		    src = src_const;

>
> Richard.
>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index e82a47f4c0e..e429ae2bcf0 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -21884,6 +21884,14 @@ rs6000_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>  
>>      case CONST_DOUBLE:
>>      case CONST_WIDE_INT:
>> +      /* Set const to a reg, it may needs a few insns.  */
>> +      if (outer_code == SET)
>> +	{
>> +	  *total = COSTS_N_INSNS (num_insns_constant (x, mode));
>> +	  return true;
>> +	}
>> +      /* FALLTHRU */
>> +
>>      case CONST:
>>      case HIGH:
>>      case SYMBOL_REF:
  
Segher Boessenkool Feb. 28, 2022, 4:45 p.m. UTC | #17
Hi!

On Thu, Feb 24, 2022 at 03:48:54PM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > That is the problem yes.  You need insns to call insn_cost on.  You can
> > look in combine.c:combine_validate_cost to see how this can be done; but
> > you need to have some code to generate in the first place, and for CSE
> > it isn't always clear what code to generate, it really is based on RTL
> > expressions having a cost.
> 
> Hi Segher,
> 
> Thanks! combine_validate_cost is useful to help me on
> evaluating the costs of several instructions or replacements.
> 
> As you pointed out, at CSE, it may not be clear to know what
> extact insn sequences will be generated. Actually,  the same
> issue also exists on RTL expression.  At CSE, it may not clear
> the exact cost, since the real instructions maybe emitted in
> very late passes.

But there will be RTL insns already.  Those may not correspond 1-1 to
the eventual machine insns (ideally they do most of the time though),
but it should be possible to estimate pretty accurate costs for them.

Costs are never exact anyway, it is only one number, while in reality
there are many dimensions.

> To get the accurate cost, we may analyze the constant in the
> hook(insn_cost or rtx_cost) and estimate the possible final
> instructions and then calculate the costs.

Yes.

> We discussed one idea: let the hook insn_cost accept
> any interim instruction, and estimate the real instruction
> base on the interim insn, and then return the estimated
> costs.

No.  insn_cost is only for correct, existing instructions, not for
made-up nonsense.  I created insn_cost precisely to get away from that
aspect of rtx_cost (and some other issues, like, it is incredibly hard
and cumbersome to write a correct rtx_cost).

> For example: input insn "r119:DI=0x100803004101001" to
> insn_cost; and in rs6000_insn_cost (for ppc), analyze
> constant "0x100803004101001" which would need 5 insns;
> then rs6000_insn_cost sumarize the cost of 5 insns.
> 
> A minor concern: because we know that reading this
> constant from the pool is faster than building it by insns,
> we will generate instructions to load constant from the pool
> finally, do not emit 5 real instructions to build the value.
> So, we are more interested in if it is faster to load from
> pool or not.

That is one reason why it is better to generate (close to) machine
insns as early as possible: it makes it much easier to estimate
realistic costs.  (Another important reason is it allows other
optimisations, without us having to do any work for it!)


Segher
  
Segher Boessenkool Feb. 28, 2022, 5:03 p.m. UTC | #18
On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
> On Thu, 24 Feb 2022, Jiufu Guo wrote:
> > And another thing as Segher pointed out, CSE is doing too
> > much work.  It may be ok to separate the constant handling
> > logic from CSE.
> 
> Not sure - CSE just is value numbering, I don't see that it does
> more than that.  Yes, it might have developed "heuristics" over
> the years what to CSE and to what and where to substitute and
> where not.  But in the end it does just value numbering.

It also does various micro-optimisations, like all the CC things it
does.

It is not very good at doing the CSE job, but it cannot easily be
replaced by a better implementation because it does many other small
optimisations (that are not done elsewhere).


Segher
  
Jiufu Guo March 1, 2022, 2:59 a.m. UTC | #19
Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> > And another thing as Segher pointed out, CSE is doing too
>> > much work.  It may be ok to separate the constant handling
>> > logic from CSE.
>> 
>> Not sure - CSE just is value numbering, I don't see that it does
>> more than that.  Yes, it might have developed "heuristics" over
>> the years what to CSE and to what and where to substitute and
>> where not.  But in the end it does just value numbering.
>
> It also does various micro-optimisations, like all the CC things it
> does.
>
> It is not very good at doing the CSE job, but it cannot easily be
> replaced by a better implementation because it does many other small
> optimisations (that are not done elsewhere).
>

Thanks a lot for these comments! I'm also wondering if we would
rewrite this cse.cc or refactor it in some aspects.

BR,
Jiufu

>
> Segher
  
Richard Biener March 1, 2022, 7:47 a.m. UTC | #20
On Tue, 1 Mar 2022, Jiufu Guo wrote:

> Segher Boessenkool <segher@kernel.crashing.org> writes:
> 
> > On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
> >> On Thu, 24 Feb 2022, Jiufu Guo wrote:
> >> > And another thing as Segher pointed out, CSE is doing too
> >> > much work.  It may be ok to separate the constant handling
> >> > logic from CSE.
> >> 
> >> Not sure - CSE just is value numbering, I don't see that it does
> >> more than that.  Yes, it might have developed "heuristics" over
> >> the years what to CSE and to what and where to substitute and
> >> where not.  But in the end it does just value numbering.
> >
> > It also does various micro-optimisations, like all the CC things it
> > does.
> >
> > It is not very good at doing the CSE job, but it cannot easily be
> > replaced by a better implementation because it does many other small
> > optimisations (that are not done elsewhere).
> >
> 
> Thanks a lot for these comments! I'm also wondering if we would
> rewrite this cse.cc or refactor it in some aspects.

I think time is better spent elsewhere ... I don't think CSE is as
bad as Segher depicts it - it might do "CC things" and other bits
but in the end that's going to be instruction/expression combination
things that "fit" likely because a value lattice (or just nonzero bits
in the cselib variant) existed.

So what might be interesting would be to work towards cleansing
CSE of those, producing testcases and making sure a better fit
pass (combine? fwprop? compare-elim?) performs the desired
optimization.

But I'm not really sure what Segher is talking about - I suppose
it must be magic done inside cselib (which only does analysis),
not in cse.cc itself.

Richard.

> BR,
> Jiufu
> 
> >
> > Segher
>
  
Jiufu Guo March 1, 2022, 1:47 p.m. UTC | #21
Richard Biener <rguenther@suse.de> writes:

> On Tue, 1 Mar 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> 
>> > On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>> >> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>> >> > And another thing as Segher pointed out, CSE is doing too
>> >> > much work.  It may be ok to separate the constant handling
>> >> > logic from CSE.
>> >> 
>> >> Not sure - CSE just is value numbering, I don't see that it does
>> >> more than that.  Yes, it might have developed "heuristics" over
>> >> the years what to CSE and to what and where to substitute and
>> >> where not.  But in the end it does just value numbering.
>> >
>> > It also does various micro-optimisations, like all the CC things it
>> > does.
>> >
>> > It is not very good at doing the CSE job, but it cannot easily be
>> > replaced by a better implementation because it does many other small
>> > optimisations (that are not done elsewhere).
>> >
>> 
>> Thanks a lot for these comments! I'm also wondering if we would
>> rewrite this cse.cc or refactor it in some aspects.
>
> I think time is better spent elsewhere ... I don't think CSE is as
> bad as Segher depicts it - it might do "CC things" and other bits
> but in the end that's going to be instruction/expression combination
> things that "fit" likely because a value lattice (or just nonzero bits
> in the cselib variant) existed.
>
> So what might be interesting would be to work towards cleansing
> CSE of those, producing testcases and making sure a better fit
> pass (combine? fwprop? compare-elim?) performs the desired
> optimization.
Hi Richard,

I'm not sure. I guess, those micro-optimizations may be also the
the things which could be move to other pass, like fwprop/combining...
Segher would correct me. :-)
cse.cc contains the code to take care of constant/ const_anchors,
expr folding, condition code/jump...

BR,
Jiufu 

>
> But I'm not really sure what Segher is talking about - I suppose
> it must be magic done inside cselib (which only does analysis),
> not in cse.cc itself.
>
> Richard.
>
>> BR,
>> Jiufu
>> 
>> >
>> > Segher
>>
  
Jiufu Guo March 1, 2022, 2:28 p.m. UTC | #22
Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi!

> Hi!
>
> On Thu, Feb 24, 2022 at 03:48:54PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > That is the problem yes.  You need insns to call insn_cost on.  You can
>> > look in combine.c:combine_validate_cost to see how this can be done; but
>> > you need to have some code to generate in the first place, and for CSE
>> > it isn't always clear what code to generate, it really is based on RTL
>> > expressions having a cost.
>> 
>> Hi Segher,
>> 
>> Thanks! combine_validate_cost is useful to help me on
>> evaluating the costs of several instructions or replacements.
>> 
>> As you pointed out, at CSE, it may not be clear to know what
>> extact insn sequences will be generated. Actually,  the same
>> issue also exists on RTL expression.  At CSE, it may not clear
>> the exact cost, since the real instructions maybe emitted in
>> very late passes.
>
> But there will be RTL insns already.  Those may not correspond 1-1 to
> the eventual machine insns (ideally they do most of the time though),
> but it should be possible to estimate pretty accurate costs for them.
>
> Costs are never exact anyway, it is only one number, while in reality
> there are many dimensions.
>
>> To get the accurate cost, we may analyze the constant in the
>> hook(insn_cost or rtx_cost) and estimate the possible final
>> instructions and then calculate the costs.
>
> Yes.
>
>> We discussed one idea: let the hook insn_cost accept
>> any interim instruction, and estimate the real instruction
>> base on the interim insn, and then return the estimated
>> costs.
>
> No.  insn_cost is only for correct, existing instructions, not for
> made-up nonsense.  I created insn_cost precisely to get away from that
> aspect of rtx_cost (and some other issues, like, it is incredibly hard
> and cumbersome to write a correct rtx_cost).

Thanks! The implementations of hook insn_cost are align with this
design, they are  checking insn's attributes and COSTS_N_INSNS.

One question on the speciall case: 
For instruction: "r119:DI=0x100803004101001"
Would we treat it as valid instruction?

A patch, which is attached the end of this mail, accepts
"r119:DI=0x100803004101001" as input of insn_cost.
In this patch, 
- A tmp instruction is generated via make_insn_raw.
- A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
- In hook of insn_cost, checking the special 'constant' instruction.
Are these make sense?

>
>> For example: input insn "r119:DI=0x100803004101001" to
>> insn_cost; and in rs6000_insn_cost (for ppc), analyze
>> constant "0x100803004101001" which would need 5 insns;
>> then rs6000_insn_cost sumarize the cost of 5 insns.
>> 
>> A minor concern: because we know that reading this
>> constant from the pool is faster than building it by insns,
>> we will generate instructions to load constant from the pool
>> finally, do not emit 5 real instructions to build the value.
>> So, we are more interested in if it is faster to load from
>> pool or not.
>
> That is one reason why it is better to generate (close to) machine
> insns as early as possible: it makes it much easier to estimate
> realistic costs.  (Another important reason is it allows other
> optimisations, without us having to do any work for it!)
Get it!  In the middle of an optimization pass, 'interim'
instruction maybe acceptable.  While it would better to outputs
only contains 'valid machine insn' from any RTL passes.

BR,
Jiufu
>
>
> Segher


diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..1184e6ad65b 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int outer_code,
 static int
 rs6000_insn_cost (rtx_insn *insn, bool speed)
 {
+  /* Handle special 'constant int' insn. */
+  rtx set = PATTERN (insn);
+  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
+    {
+      rtx src = SET_SRC (set);
+      machine_mode mode = GET_MODE (SET_DEST (set));
+      if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
+	return COSTS_N_INSNS (num_insns_constant (src, mode));
+    }
+  
   if (recog_memoized (insn) < 0)
     return 0;
 
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..4705ad82084 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -456,6 +456,8 @@ struct table_elt
   (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1))
 #define COST_IN(X, MODE, OUTER, OPNO)					\
   (REG_P (X) ? 0 : notreg_cost (X, MODE, OUTER, OPNO))
+#define COST_SRC(I, X, MODE)				                \
+  (REG_P (X) ? 0 : notreg_cost (X, MODE, SET, 1, I))
 
 /* Get the number of times this register has been updated in this
    basic block.  */
@@ -523,7 +525,8 @@ static bitmap cse_ebb_live_in, cse_ebb_live_out;
 static sbitmap cse_visited_basic_blocks;
 
 static bool fixed_base_plus_p (rtx x);
-static int notreg_cost (rtx, machine_mode, enum rtx_code, int);
+static int notreg_cost (rtx, machine_mode, enum rtx_code, int, rtx_insn*);
+static int insn_cost_x (rtx_insn *, rtx);
 static int preferable (int, int, int, int);
 static void new_basic_block (void);
 static void make_new_qty (unsigned int, machine_mode);
@@ -698,7 +701,8 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
    from COST macro to keep it simple.  */
 
 static int
-notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
+notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
+	     rtx_insn *insn = NULL)
 {
   scalar_int_mode int_mode, inner_mode;
   return ((GET_CODE (x) == SUBREG
@@ -709,9 +713,21 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
 	   && subreg_lowpart_p (x)
 	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
 	  ? 0
+	  : insn != NULL ? insn_cost_x (insn, x)
 	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
 }
 
+/* Internal function, to get cost when use X to replace source of insn
+   which is a SET.  */
+
+static int
+insn_cost_x (rtx_insn *insn, rtx x)
+{
+  INSN_CODE (insn) = -1;
+  SET_SRC (PATTERN (insn)) = x;
+  return insn_cost (insn, optimize_this_for_speed_p);
+}
+
 
 /* Initialize CSE_REG_INFO_TABLE.  */
 
@@ -4603,6 +4619,7 @@ cse_insn (rtx_insn *insn)
 
      Nothing in this loop changes the hash table or the register chains.  */
 
+  rtx_insn *tmp_insn = NULL;
   for (i = 0; i < n_sets; i++)
     {
       bool repeat = false;
@@ -4638,6 +4655,10 @@ cse_insn (rtx_insn *insn)
       mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
       sets[i].mode = mode;
 
+      if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
+	  && src != pc_rtx)
+	tmp_insn = make_insn_raw (gen_rtx_SET (copy_rtx (dest), copy_rtx(src)));
+
       if (src_eqv)
 	{
 	  machine_mode eqvmode = mode;
@@ -5103,7 +5124,7 @@ cse_insn (rtx_insn *insn)
 	    src_cost = src_regcost = -1;
 	  else
 	    {
-	      src_cost = COST (src, mode);
+	      src_cost = COST_SRC (tmp_insn, src, mode);
 	      src_regcost = approx_reg_cost (src);
 	    }
 	}
@@ -5114,7 +5135,7 @@ cse_insn (rtx_insn *insn)
 	    src_eqv_cost = src_eqv_regcost = -1;
 	  else
 	    {
-	      src_eqv_cost = COST (src_eqv_here, mode);
+	      src_eqv_cost = COST_SRC (tmp_insn, src_eqv_here, mode);
 	      src_eqv_regcost = approx_reg_cost (src_eqv_here);
 	    }
 	}
@@ -5125,7 +5146,7 @@ cse_insn (rtx_insn *insn)
 	    src_folded_cost = src_folded_regcost = -1;
 	  else
 	    {
-	      src_folded_cost = COST (src_folded, mode);
+	      src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
 	      src_folded_regcost = approx_reg_cost (src_folded);
 	    }
 	}
@@ -5136,7 +5157,7 @@ cse_insn (rtx_insn *insn)
 	    src_related_cost = src_related_regcost = -1;
 	  else
 	    {
-	      src_related_cost = COST (src_related, mode);
+	      src_related_cost = COST_SRC (tmp_insn, src_related, mode);
 	      src_related_regcost = approx_reg_cost (src_related);
 
 	      /* If a const-anchor is used to synthesize a constant that
@@ -5406,7 +5427,7 @@ cse_insn (rtx_insn *insn)
 	      src_folded = force_const_mem (mode, trial);
 	      if (src_folded)
 		{
-		  src_folded_cost = COST (src_folded, mode);
+		  src_folded_cost = COST_SRC (tmp_insn, src_folded, mode);
 		  src_folded_regcost = approx_reg_cost (src_folded);
 		}
 	    }
@@ -5460,7 +5481,9 @@ cse_insn (rtx_insn *insn)
 		  /* If we had a constant that is cheaper than what we are now
 		     setting SRC to, use that constant.  We ignored it when we
 		     thought we could make this into a no-op.  */
-		  if (src_const && COST (src_const, mode) < COST (src, mode)
+		  if (src_const
+		      && COST_SRC (tmp_insn, src_const, mode) 
+			   < COST_SRC (tmp_insn, src, mode)
 		      && validate_change (insn, &SET_SRC (sets[i].rtl),
 					  src_const, 0))
 		    src = src_const;
  
Jeff Law March 2, 2022, 7:15 p.m. UTC | #23
On 3/1/2022 12:47 AM, Richard Biener via Gcc-patches wrote:
> On Tue, 1 Mar 2022, Jiufu Guo wrote:
>
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>
>>> On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>>>> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>>>>> And another thing as Segher pointed out, CSE is doing too
>>>>> much work.  It may be ok to separate the constant handling
>>>>> logic from CSE.
>>>> Not sure - CSE just is value numbering, I don't see that it does
>>>> more than that.  Yes, it might have developed "heuristics" over
>>>> the years what to CSE and to what and where to substitute and
>>>> where not.  But in the end it does just value numbering.
>>> It also does various micro-optimisations, like all the CC things it
>>> does.
>>>
>>> It is not very good at doing the CSE job, but it cannot easily be
>>> replaced by a better implementation because it does many other small
>>> optimisations (that are not done elsewhere).
>>>
>> Thanks a lot for these comments! I'm also wondering if we would
>> rewrite this cse.cc or refactor it in some aspects.
> I think time is better spent elsewhere ... I don't think CSE is as
> bad as Segher depicts it - it might do "CC things" and other bits
> but in the end that's going to be instruction/expression combination
> things that "fit" likely because a value lattice (or just nonzero bits
> in the cselib variant) existed.
>
> So what might be interesting would be to work towards cleansing
> CSE of those, producing testcases and making sure a better fit
> pass (combine? fwprop? compare-elim?) performs the desired
> optimization.
>
> But I'm not really sure what Segher is talking about - I suppose
> it must be magic done inside cselib (which only does analysis),
> not in cse.cc itself.
I also don't see cse.cc has being *that* bad.     It's largely the same 
bits as we had before SSA and I wouldn't be surprised if there are 
things in there that aren't needed anymore.

For example, while we have excised HAVE_cc0, I think there are still 
remnants of those concepts lying around (see this_insn_cc0 and friends).

I'd always hoped we'd get to a point where we could eliminate the 
follow-jumps and cse-around-loops bits, but we never managed to 
accomplish that.  When I last looked (a long long time ago), there were 
still things that got exposed when we lowered from gimple to RTL that 
were picked up by those options.   Changing to work  with dominators to 
find EBBs would be a nice cleanup, but the code as it stands right now 
works.

I wouldn't be surprised if the costing stuff could use some serious 
cleanup.  But I don't see it as inherently broken, just very dated.

But in the end, it's really just value-numbering as you say.  It could 
be rewritten to be more modern, but I doubt it's going to make much of a 
real difference in the end.  If there are things that fit logically 
elsehwere, sure move them where they more logically fit, but I doubt 
there's a lot of this stuff.

jeff
  
Segher Boessenkool March 2, 2022, 8:24 p.m. UTC | #24
On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > No.  insn_cost is only for correct, existing instructions, not for
> > made-up nonsense.  I created insn_cost precisely to get away from that
> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> > and cumbersome to write a correct rtx_cost).
> 
> Thanks! The implementations of hook insn_cost are align with this
> design, they are  checking insn's attributes and COSTS_N_INSNS.
> 
> One question on the speciall case: 
> For instruction: "r119:DI=0x100803004101001"
> Would we treat it as valid instruction?

Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
This is costed as 5 insns (cost=20).

It generally is better to split things into patterns close to the
eventual machine isntructions as early as possible: all the more generic
optimisations can take advantage of that then.

> A patch, which is attached the end of this mail, accepts
> "r119:DI=0x100803004101001" as input of insn_cost.
> In this patch, 
> - A tmp instruction is generated via make_insn_raw.
> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> - In hook of insn_cost, checking the special 'constant' instruction.
> Are these make sense?

I'll review that patch inline.

> > That is one reason why it is better to generate (close to) machine
> > insns as early as possible: it makes it much easier to estimate
> > realistic costs.  (Another important reason is it allows other
> > optimisations, without us having to do any work for it!)
> Get it!  In the middle of an optimization pass, 'interim'
> instruction maybe acceptable.  While it would better to outputs
> only contains 'valid machine insn' from any RTL passes.

Acceptable only if there is a very good reason for it, really :-(

> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int outer_code,
>  static int
>  rs6000_insn_cost (rtx_insn *insn, bool speed)
>  {
> +  /* Handle special 'constant int' insn. */
> +  rtx set = PATTERN (insn);
> +  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
> +    {
> +      rtx src = SET_SRC (set);
> +      machine_mode mode = GET_MODE (SET_DEST (set));
> +      if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
> +	return COSTS_N_INSNS (num_insns_constant (src, mode));
> +    }
> +  
>    if (recog_memoized (insn) < 0)
>      return 0;

Why would such a set not recog()?

Needs a comment in any case, to say what this is a workaround for.

> +static int insn_cost_x (rtx_insn *, rtx);

Don't declare functions, just put their definitions before their first
use.  (And use a better name please :-) )

>  static int
> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> +notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
> +	     rtx_insn *insn = NULL)

Don't use default arguments like this, it is an abomination.

> @@ -709,9 +713,21 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>  	   && subreg_lowpart_p (x)
>  	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>  	  ? 0
> +	  : insn != NULL ? insn_cost_x (insn, x)
>  	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>  }

You can just always use insn_cost?  insn_cost -> pattern_cost ->
set_src_cost -> rtx_cost.  That works for COST at least, not sure about
COST_IN, maybe that needs a little more care (cse.c works with invalid
insns all over the place :-( )

>  
> +/* Internal function, to get cost when use X to replace source of insn
> +   which is a SET.  */
> +
> +static int
> +insn_cost_x (rtx_insn *insn, rtx x)
> +{
> +  INSN_CODE (insn) = -1;
> +  SET_SRC (PATTERN (insn)) = x;
> +  return insn_cost (insn, optimize_this_for_speed_p);
> +}

You need to restore stuff as well?

> @@ -4603,6 +4619,7 @@ cse_insn (rtx_insn *insn)
>  
>       Nothing in this loop changes the hash table or the register chains.  */
>  
> +  rtx_insn *tmp_insn = NULL;
>    for (i = 0; i < n_sets; i++)
>      {
>        bool repeat = false;
> @@ -4638,6 +4655,10 @@ cse_insn (rtx_insn *insn)
>        mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
>        sets[i].mode = mode;
>  
> +      if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
> +	  && src != pc_rtx)
> +	tmp_insn = make_insn_raw (gen_rtx_SET (copy_rtx (dest), copy_rtx(src)));

src and dest are always non-nil here.  I'll have to read the code better
to know about the (pc) stuff.

> @@ -5103,7 +5124,7 @@ cse_insn (rtx_insn *insn)
>  	    src_cost = src_regcost = -1;
>  	  else
>  	    {
> -	      src_cost = COST (src, mode);
> +	      src_cost = COST_SRC (tmp_insn, src, mode);

I think you can just leave this as COST?


Segher
  
Jiufu Guo March 3, 2022, 10:08 a.m. UTC | #25
Hi,

Jeff Law <jeffreyalaw@gmail.com> writes:

> On 3/1/2022 12:47 AM, Richard Biener via Gcc-patches wrote:
>> On Tue, 1 Mar 2022, Jiufu Guo wrote:
>>
>>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>>
>>>> On Thu, Feb 24, 2022 at 09:50:28AM +0100, Richard Biener wrote:
>>>>> On Thu, 24 Feb 2022, Jiufu Guo wrote:
>>>>>> And another thing as Segher pointed out, CSE is doing too
>>>>>> much work.  It may be ok to separate the constant handling
>>>>>> logic from CSE.
>>>>> Not sure - CSE just is value numbering, I don't see that it does
>>>>> more than that.  Yes, it might have developed "heuristics" over
>>>>> the years what to CSE and to what and where to substitute and
>>>>> where not.  But in the end it does just value numbering.
>>>> It also does various micro-optimisations, like all the CC things it
>>>> does.
>>>>
>>>> It is not very good at doing the CSE job, but it cannot easily be
>>>> replaced by a better implementation because it does many other small
>>>> optimisations (that are not done elsewhere).
>>>>
>>> Thanks a lot for these comments! I'm also wondering if we would
>>> rewrite this cse.cc or refactor it in some aspects.
>> I think time is better spent elsewhere ... I don't think CSE is as
>> bad as Segher depicts it - it might do "CC things" and other bits
>> but in the end that's going to be instruction/expression combination
>> things that "fit" likely because a value lattice (or just nonzero bits
>> in the cselib variant) existed.
>>
>> So what might be interesting would be to work towards cleansing
>> CSE of those, producing testcases and making sure a better fit
>> pass (combine? fwprop? compare-elim?) performs the desired
>> optimization.
>>
>> But I'm not really sure what Segher is talking about - I suppose
>> it must be magic done inside cselib (which only does analysis),
>> not in cse.cc itself.
> I also don't see cse.cc has being *that* bad.     It's largely the
> same bits as we had before SSA and I wouldn't be surprised if there
> are things in there that aren't needed anymore.
>
> For example, while we have excised HAVE_cc0, I think there are still
> remnants of those concepts lying around (see this_insn_cc0 and
> friends).
>
> I'd always hoped we'd get to a point where we could eliminate the
> follow-jumps and cse-around-loops bits, but we never managed to
> accomplish that.  When I last looked (a long long time ago), there
> were still things that got exposed when we lowered from gimple to RTL
> that were picked up by those options.   Changing to work  with
> dominators to find EBBs would be a nice cleanup, but the code as it
> stands right now works.
>
> I wouldn't be surprised if the costing stuff could use some serious
> cleanup.  But I don't see it as inherently broken, just very dated.
>
> But in the end, it's really just value-numbering as you say.  It could
> be rewritten to be more modern, but I doubt it's going to make much of
> a real difference in the end.  If there are things that fit logically
> elsehwere, sure move them where they more logically fit, but I doubt
> there's a lot of this stuff.
>
> jeff

Thanks for your comments and suggestions!

I had a quick test about cse1/cse2 on spec2017. Compare with "-O3",
we can see both positive and negative impacts for "-O3
-fdisable-rtl-cse1 -fdisable-rtl-cse2".  we can see performance gain on
500.perlbench_r(+1.83%),  538.imagick_r(0.7%) 520.omnetpp_r(+0.6%);
and performance recession on  548.exchange2_r(-1.97%) 557.xz_r(-0.7%)
on Power9.  The performance change on 500.perlbench_r would relate
to the behavior of 'constant loading'.

The performance impactions may be directly or indirectly caused by
sub-behaviors of current cse.cc.  And the data would change on
on different targets. Anyway, this data may also indicate that
we could clean up or enhance some functionalities in cse.cc.

BR,
Jiufu
  
Jiufu Guo March 3, 2022, 10:09 a.m. UTC | #26
Hi Sehger,

Segher Boessenkool <segher@kernel.crashing.org> writes:

> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > No.  insn_cost is only for correct, existing instructions, not for
>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> > and cumbersome to write a correct rtx_cost).
>> 
>> Thanks! The implementations of hook insn_cost are align with this
>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> 
>> One question on the speciall case: 
>> For instruction: "r119:DI=0x100803004101001"
>> Would we treat it as valid instruction?
>
> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> This is costed as 5 insns (cost=20).
>
> It generally is better to split things into patterns close to the
> eventual machine isntructions as early as possible: all the more generic
> optimisations can take advantage of that then.
Get it!
>
>> A patch, which is attached the end of this mail, accepts
>> "r119:DI=0x100803004101001" as input of insn_cost.
>> In this patch, 
>> - A tmp instruction is generated via make_insn_raw.
>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> - In hook of insn_cost, checking the special 'constant' instruction.
>> Are these make sense?
>
> I'll review that patch inline.
>
>> > That is one reason why it is better to generate (close to) machine
>> > insns as early as possible: it makes it much easier to estimate
>> > realistic costs.  (Another important reason is it allows other
>> > optimisations, without us having to do any work for it!)
>> Get it!  In the middle of an optimization pass, 'interim'
>> instruction maybe acceptable.  While it would better to outputs
>> only contains 'valid machine insn' from any RTL passes.
>
> Acceptable only if there is a very good reason for it, really :-(
>
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -22131,6 +22131,16 @@ rs6000_debug_rtx_costs (rtx x, machine_mode mode, int outer_code,
>>  static int
>>  rs6000_insn_cost (rtx_insn *insn, bool speed)
>>  {
>> +  /* Handle special 'constant int' insn. */
>> +  rtx set = PATTERN (insn);
>> +  if (GET_CODE (set) == SET && CONSTANT_P (SET_SRC (set)))
>> +    {
>> +      rtx src = SET_SRC (set);
>> +      machine_mode mode = GET_MODE (SET_DEST (set));
>> +      if (CONST_INT_P (src) || CONST_WIDE_INT_P (src))
>> +	return COSTS_N_INSNS (num_insns_constant (src, mode));
>> +    }
>> +  
>>    if (recog_memoized (insn) < 0)
>>      return 0;
>
> Why would such a set not recog()?
Thanks.  This code is not need at the top of function insn_cost.
recog_memoized could check insn_code on 'insn'.
>
> Needs a comment in any case, to say what this is a workaround for.
>
>> +static int insn_cost_x (rtx_insn *, rtx);
>
> Don't declare functions, just put their definitions before their first
> use.  (And use a better name please :-) )
Get it. :-)
>
>>  static int
>> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> +notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno,
>> +	     rtx_insn *insn = NULL)
>
> Don't use default arguments like this, it is an abomination.
Thanks.
>
>> @@ -709,9 +713,21 @@ notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>>  	   && subreg_lowpart_p (x)
>>  	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>>  	  ? 0
>> +	  : insn != NULL ? insn_cost_x (insn, x)
>>  	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>>  }
>
> You can just always use insn_cost?  insn_cost -> pattern_cost ->
> set_src_cost -> rtx_cost.  That works for COST at least, not sure about
> COST_IN, maybe that needs a little more care (cse.c works with invalid
> insns all over the place :-( )
>
This experiement patch just replace part of rtx_cost with insn_cost.
In case, COST is called outside cse_insn, 'insn' may not be set, and
then 'insn_cost' may not work.   This would need to be enhanced.
>>  
>> +/* Internal function, to get cost when use X to replace source of insn
>> +   which is a SET.  */
>> +
>> +static int
>> +insn_cost_x (rtx_insn *insn, rtx x)
>> +{
>> +  INSN_CODE (insn) = -1;
>> +  SET_SRC (PATTERN (insn)) = x;
>> +  return insn_cost (insn, optimize_this_for_speed_p);
>> +}
>
> You need to restore stuff as well?
In this patch, this function is called on a tmp_insn, so I did not
restore it.  If using the original 'insn' of cse_insn to invoked
'insn_cost_x', fields of 'insn' should be restored.
>
>> @@ -4603,6 +4619,7 @@ cse_insn (rtx_insn *insn)
>>  
>>       Nothing in this loop changes the hash table or the register chains.  */
>>  
>> +  rtx_insn *tmp_insn = NULL;
>>    for (i = 0; i < n_sets; i++)
>>      {
>>        bool repeat = false;
>> @@ -4638,6 +4655,10 @@ cse_insn (rtx_insn *insn)
>>        mode = GET_MODE (src) == VOIDmode ? GET_MODE (dest) : GET_MODE (src);
>>        sets[i].mode = mode;
>>  
>> +      if (tmp_insn == NULL_RTX && src && dest && dest != pc_rtx
>> +	  && src != pc_rtx)
>> +	tmp_insn = make_insn_raw (gen_rtx_SET (copy_rtx (dest), copy_rtx(src)));
>
> src and dest are always non-nil here.  I'll have to read the code better
> to know about the (pc) stuff.
>
>> @@ -5103,7 +5124,7 @@ cse_insn (rtx_insn *insn)
>>  	    src_cost = src_regcost = -1;
>>  	  else
>>  	    {
>> -	      src_cost = COST (src, mode);
>> +	      src_cost = COST_SRC (tmp_insn, src, mode);
>
> I think you can just leave this as COST?
>
Yes, it would be better to just use COST, and update COST macro
to use insn_cost.

Thanks a lot for your greate help!

BR,
Jiufu

>
> Segher
  
Jiufu Guo March 8, 2022, 11:25 a.m. UTC | #27
Jiufu Guo <guojiufu@linux.ibm.com> writes:

Hi!

> Hi Sehger,
>
> Segher Boessenkool <segher@kernel.crashing.org> writes:
>
>> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>>> > No.  insn_cost is only for correct, existing instructions, not for
>>> > made-up nonsense.  I created insn_cost precisely to get away from that
>>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>>> > and cumbersome to write a correct rtx_cost).
>>> 
>>> Thanks! The implementations of hook insn_cost are align with this
>>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>>> 
>>> One question on the speciall case: 
>>> For instruction: "r119:DI=0x100803004101001"
>>> Would we treat it as valid instruction?
>>
>> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> This is costed as 5 insns (cost=20).
>>
>> It generally is better to split things into patterns close to the
>> eventual machine isntructions as early as possible: all the more generic
>> optimisations can take advantage of that then.
> Get it!
>>
>>> A patch, which is attached the end of this mail, accepts
>>> "r119:DI=0x100803004101001" as input of insn_cost.
>>> In this patch, 
>>> - A tmp instruction is generated via make_insn_raw.
>>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>>> - In hook of insn_cost, checking the special 'constant' instruction.
>>> Are these make sense?
>>
>> I'll review that patch inline.

I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
Different from the previous partial patch, this patch replaces all usage
of rtx_cost. It may be better/aggressive than previous one.

With this patch, bootstrap pass.
From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
change seems as expected.


BR,
Jiufu

diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..e623ad298db 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
 static rtx_insn *this_insn;
 static bool optimize_this_for_speed_p;
 
+/* Used for insn_cost. */
+static rtx_insn *estimate_insn;
+
 /* Index by register number, gives the number of the next (or
    previous) register in the chain of registers sharing the same
    value.
@@ -445,7 +448,7 @@ struct table_elt
 /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
    hard registers and pointers into the frame are the cheapest with a cost
    of 0.  Next come pseudos with a cost of one and other hard registers with
-   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
+   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
 
 #define CHEAP_REGNO(N)							\
   (REGNO_PTR_FRAME_P (N)						\
@@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
    from COST macro to keep it simple.  */
 
 static int
-notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
+notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
 {
   scalar_int_mode int_mode, inner_mode;
-  return ((GET_CODE (x) == SUBREG
-	   && REG_P (SUBREG_REG (x))
-	   && is_int_mode (mode, &int_mode)
-	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
-	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
-	   && subreg_lowpart_p (x)
-	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
-	  ? 0
-	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
+  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
+      && is_int_mode (mode, &int_mode)
+      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
+      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
+      && subreg_lowpart_p (x)
+      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
+    return 0;
+
+  if (estimate_insn == NULL)
+    {
+      estimate_insn = make_insn_raw (
+	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
+      SET_PREV_INSN (estimate_insn) = NULL_RTX;
+      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
+      INSN_LOCATION (estimate_insn) = 0;
+    }
+  else
+    {
+      /* Update for new context.  */
+      INSN_CODE (estimate_insn) = -1;
+      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
+      SET_SRC (PATTERN (estimate_insn)) = x;
+    }
+  return insn_cost (estimate_insn, optimize_this_for_speed_p);
 }
 
 
@@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
 
   init_recog ();
   init_alias_analysis ();
+  estimate_insn = NULL;
 
   reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
  
Richard Biener March 8, 2022, 11:50 a.m. UTC | #28
On Tue, 8 Mar 2022, Jiufu Guo wrote:

> Jiufu Guo <guojiufu@linux.ibm.com> writes:
> 
> Hi!
> 
> > Hi Sehger,
> >
> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >
> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> >>> > and cumbersome to write a correct rtx_cost).
> >>> 
> >>> Thanks! The implementations of hook insn_cost are align with this
> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >>> 
> >>> One question on the speciall case: 
> >>> For instruction: "r119:DI=0x100803004101001"
> >>> Would we treat it as valid instruction?
> >>
> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> >> This is costed as 5 insns (cost=20).
> >>
> >> It generally is better to split things into patterns close to the
> >> eventual machine isntructions as early as possible: all the more generic
> >> optimisations can take advantage of that then.
> > Get it!
> >>
> >>> A patch, which is attached the end of this mail, accepts
> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >>> In this patch, 
> >>> - A tmp instruction is generated via make_insn_raw.
> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >>> Are these make sense?
> >>
> >> I'll review that patch inline.
> 
> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> Different from the previous partial patch, this patch replaces all usage
> of rtx_cost. It may be better/aggressive than previous one.

I think there's no advantage for using insn_cost over rtx_cost for
the simple SET case.

Richard.

> With this patch, bootstrap pass.
> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
> change seems as expected.
> 
> 
> BR,
> Jiufu
> 
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index a18b599d324..e623ad298db 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>  static rtx_insn *this_insn;
>  static bool optimize_this_for_speed_p;
>  
> +/* Used for insn_cost. */
> +static rtx_insn *estimate_insn;
> +
>  /* Index by register number, gives the number of the next (or
>     previous) register in the chain of registers sharing the same
>     value.
> @@ -445,7 +448,7 @@ struct table_elt
>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
>     hard registers and pointers into the frame are the cheapest with a cost
>     of 0.  Next come pseudos with a cost of one and other hard registers with
> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>  
>  #define CHEAP_REGNO(N)							\
>    (REGNO_PTR_FRAME_P (N)						\
> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
>     from COST macro to keep it simple.  */
>  
>  static int
> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
>  {
>    scalar_int_mode int_mode, inner_mode;
> -  return ((GET_CODE (x) == SUBREG
> -	   && REG_P (SUBREG_REG (x))
> -	   && is_int_mode (mode, &int_mode)
> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> -	   && subreg_lowpart_p (x)
> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> -	  ? 0
> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
> +      && is_int_mode (mode, &int_mode)
> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> +      && subreg_lowpart_p (x)
> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> +    return 0;
> +
> +  if (estimate_insn == NULL)
> +    {
> +      estimate_insn = make_insn_raw (
> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
> +      INSN_LOCATION (estimate_insn) = 0;
> +    }
> +  else
> +    {
> +      /* Update for new context.  */
> +      INSN_CODE (estimate_insn) = -1;
> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
> +      SET_SRC (PATTERN (estimate_insn)) = x;
> +    }
> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
>  }
>  
>  
> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>  
>    init_recog ();
>    init_alias_analysis ();
> +  estimate_insn = NULL;
>  
>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
>  
>
  
Jiufu Guo March 9, 2022, 4:37 a.m. UTC | #29
Hi!

Richard Biener <rguenther@suse.de> writes:

> On Tue, 8 Mar 2022, Jiufu Guo wrote:
>
>> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>> 
>> Hi!
>> 
>> > Hi Sehger,
>> >
>> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >
>> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >>> > No.  insn_cost is only for correct, existing instructions, not for
>> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> >>> > and cumbersome to write a correct rtx_cost).
>> >>> 
>> >>> Thanks! The implementations of hook insn_cost are align with this
>> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >>> 
>> >>> One question on the speciall case: 
>> >>> For instruction: "r119:DI=0x100803004101001"
>> >>> Would we treat it as valid instruction?
>> >>
>> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> >> This is costed as 5 insns (cost=20).
>> >>
>> >> It generally is better to split things into patterns close to the
>> >> eventual machine isntructions as early as possible: all the more generic
>> >> optimisations can take advantage of that then.
>> > Get it!
>> >>
>> >>> A patch, which is attached the end of this mail, accepts
>> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >>> In this patch, 
>> >>> - A tmp instruction is generated via make_insn_raw.
>> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >>> - In hook of insn_cost, checking the special 'constant' instruction.
>> >>> Are these make sense?
>> >>
>> >> I'll review that patch inline.
>> 
>> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> Different from the previous partial patch, this patch replaces all usage
>> of rtx_cost. It may be better/aggressive than previous one.
>
> I think there's no advantage for using insn_cost over rtx_cost for
> the simple SET case.

Thanks for your comments and raise this concern.

For those targets which do not implement insn_cost, insn_cost calls
rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.

While, for those targets which have insn_cost, it seems insn_cost would
be better(or say more accurate/consistent?) than rtx_cost. Since:
- insn_cost recog the insn first, and compute cost through something
(like length/cost attributes from .md file) for the 'machine insn'.
- rtx_cost estimates the cost through analyzing the 'rtx content'.
The accurate estimation relates to the context.

For a special example: "%r100 = C", as a previous patch, by tunning
target's rtx_cost hook, cost could be computed according to the value
of C. insn_cost may just model the cost in the define of the machine
instruction.

These reasons are my initial thoughts.  Segher may have better
explain. :-) 

To replace rtx_cost with insn_cost, this patch build a SET instruction:
"%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
the cost of "rtx_expr" from rtx_cost.


BR,
Jiufu

>
> Richard.
>
>> With this patch, bootstrap pass.
>> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
>> change seems as expected.
>> 
>> 
>> BR,
>> Jiufu
>> 
>> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> index a18b599d324..e623ad298db 100644
>> --- a/gcc/cse.cc
>> +++ b/gcc/cse.cc
>> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>>  static rtx_insn *this_insn;
>>  static bool optimize_this_for_speed_p;
>>  
>> +/* Used for insn_cost. */
>> +static rtx_insn *estimate_insn;
>> +
>>  /* Index by register number, gives the number of the next (or
>>     previous) register in the chain of registers sharing the same
>>     value.
>> @@ -445,7 +448,7 @@ struct table_elt
>>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
>>     hard registers and pointers into the frame are the cheapest with a cost
>>     of 0.  Next come pseudos with a cost of one and other hard registers with
>> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
>> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>>  
>>  #define CHEAP_REGNO(N)							\
>>    (REGNO_PTR_FRAME_P (N)						\
>> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
>>     from COST macro to keep it simple.  */
>>  
>>  static int
>> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
>>  {
>>    scalar_int_mode int_mode, inner_mode;
>> -  return ((GET_CODE (x) == SUBREG
>> -	   && REG_P (SUBREG_REG (x))
>> -	   && is_int_mode (mode, &int_mode)
>> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> -	   && subreg_lowpart_p (x)
>> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> -	  ? 0
>> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
>> +      && is_int_mode (mode, &int_mode)
>> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> +      && subreg_lowpart_p (x)
>> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> +    return 0;
>> +
>> +  if (estimate_insn == NULL)
>> +    {
>> +      estimate_insn = make_insn_raw (
>> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
>> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
>> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
>> +      INSN_LOCATION (estimate_insn) = 0;
>> +    }
>> +  else
>> +    {
>> +      /* Update for new context.  */
>> +      INSN_CODE (estimate_insn) = -1;
>> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
>> +      SET_SRC (PATTERN (estimate_insn)) = x;
>> +    }
>> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
>>  }
>>  
>>  
>> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>>  
>>    init_recog ();
>>    init_alias_analysis ();
>> +  estimate_insn = NULL;
>>  
>>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
>>  
>>
  
Richard Biener March 9, 2022, 7:41 a.m. UTC | #30
On Wed, 9 Mar 2022, Jiufu Guo wrote:

> 
> Hi!
> 
> Richard Biener <rguenther@suse.de> writes:
> 
> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
> >
> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
> >> 
> >> Hi!
> >> 
> >> > Hi Sehger,
> >> >
> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >
> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> >> >>> > and cumbersome to write a correct rtx_cost).
> >> >>> 
> >> >>> Thanks! The implementations of hook insn_cost are align with this
> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >> >>> 
> >> >>> One question on the speciall case: 
> >> >>> For instruction: "r119:DI=0x100803004101001"
> >> >>> Would we treat it as valid instruction?
> >> >>
> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> >> >> This is costed as 5 insns (cost=20).
> >> >>
> >> >> It generally is better to split things into patterns close to the
> >> >> eventual machine isntructions as early as possible: all the more generic
> >> >> optimisations can take advantage of that then.
> >> > Get it!
> >> >>
> >> >>> A patch, which is attached the end of this mail, accepts
> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >> >>> In this patch, 
> >> >>> - A tmp instruction is generated via make_insn_raw.
> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >> >>> Are these make sense?
> >> >>
> >> >> I'll review that patch inline.
> >> 
> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> >> Different from the previous partial patch, this patch replaces all usage
> >> of rtx_cost. It may be better/aggressive than previous one.
> >
> > I think there's no advantage for using insn_cost over rtx_cost for
> > the simple SET case.
> 
> Thanks for your comments and raise this concern.
> 
> For those targets which do not implement insn_cost, insn_cost calls
> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
> 
> While, for those targets which have insn_cost, it seems insn_cost would
> be better(or say more accurate/consistent?) than rtx_cost. Since:
> - insn_cost recog the insn first, and compute cost through something

target hooks are expected to call recog on the insn but the generic
fallback does not!?  Or do you say a target _could_ call recog?
I think it would be valid to only expect recognized insns here
and thus cse.cc obligation would be to call regoc on the "fake"
instruction which then raises the obvious issue whether you should
ever call recog on something "fake".

I also see that rs6000_insn_cost does

static int
rs6000_insn_cost (rtx_insn *insn, bool speed)
{
  if (recog_memoized (insn) < 0)
    return 0;

so not recognized insns become quite cheap - it looks like
insn_cost has no documented failure mode and initial implementors
chickened out asserting that an insn is / can be recognized.
In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
no failure mode and the _caller_ should have made sure the
insn can be recognized.  That said, the insn_cost should do that.
(is INSN_CODE () == 0 a real thing?)

> (like length/cost attributes from .md file) for the 'machine insn'.
> - rtx_cost estimates the cost through analyzing the 'rtx content'.
> The accurate estimation relates to the context.
> 
> For a special example: "%r100 = C", as a previous patch, by tunning
> target's rtx_cost hook, cost could be computed according to the value
> of C. insn_cost may just model the cost in the define of the machine
> instruction.
> 
> These reasons are my initial thoughts.  Segher may have better
> explain. :-) 

OK, so in theory the targets insn_cost can use insn attributes
which allows to keep the cost parts of an instruction close to the
instruction patterns which is probably a good thing for
maintainability.  But as you point out this requires recognizing
of possibly generated instructions.  And before reload it has
the issue that the alternative is not determined which means the
true cost cannot be determined without walking the pattern,
like on x86 where many instruction operands have both register
and memory alternatives.

> To replace rtx_cost with insn_cost, this patch build a SET instruction:
> "%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
> the cost of "rtx_expr" from rtx_cost.
> 
> 
> BR,
> Jiufu
> 
> >
> > Richard.
> >
> >> With this patch, bootstrap pass.
> >> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
> >> change seems as expected.
> >> 
> >> 
> >> BR,
> >> Jiufu
> >> 
> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
> >> index a18b599d324..e623ad298db 100644
> >> --- a/gcc/cse.cc
> >> +++ b/gcc/cse.cc
> >> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
> >>  static rtx_insn *this_insn;
> >>  static bool optimize_this_for_speed_p;
> >>  
> >> +/* Used for insn_cost. */
> >> +static rtx_insn *estimate_insn;
> >> +
> >>  /* Index by register number, gives the number of the next (or
> >>     previous) register in the chain of registers sharing the same
> >>     value.
> >> @@ -445,7 +448,7 @@ struct table_elt
> >>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
> >>     hard registers and pointers into the frame are the cheapest with a cost
> >>     of 0.  Next come pseudos with a cost of one and other hard registers with
> >> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
> >> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
> >>  
> >>  #define CHEAP_REGNO(N)							\
> >>    (REGNO_PTR_FRAME_P (N)						\
> >> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
> >>     from COST macro to keep it simple.  */
> >>  
> >>  static int
> >> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> >> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
> >>  {
> >>    scalar_int_mode int_mode, inner_mode;
> >> -  return ((GET_CODE (x) == SUBREG
> >> -	   && REG_P (SUBREG_REG (x))
> >> -	   && is_int_mode (mode, &int_mode)
> >> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> >> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> >> -	   && subreg_lowpart_p (x)
> >> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> >> -	  ? 0
> >> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
> >> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
> >> +      && is_int_mode (mode, &int_mode)
> >> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> >> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> >> +      && subreg_lowpart_p (x)
> >> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> >> +    return 0;
> >> +
> >> +  if (estimate_insn == NULL)
> >> +    {
> >> +      estimate_insn = make_insn_raw (
> >> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
> >> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
> >> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
> >> +      INSN_LOCATION (estimate_insn) = 0;
> >> +    }
> >> +  else
> >> +    {
> >> +      /* Update for new context.  */
> >> +      INSN_CODE (estimate_insn) = -1;
> >> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
> >> +      SET_SRC (PATTERN (estimate_insn)) = x;
> >> +    }
> >> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
> >>  }
> >>  
> >>  
> >> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
> >>  
> >>    init_recog ();
> >>    init_alias_analysis ();
> >> +  estimate_insn = NULL;
> >>  
> >>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
> >>  
> >> 
>
  
Jiufu Guo March 10, 2022, 2:09 a.m. UTC | #31
Hi!

Richard Biener <rguenther@suse.de> writes:

> On Wed, 9 Mar 2022, Jiufu Guo wrote:
>
>> 
>> Hi!
>> 
>> Richard Biener <rguenther@suse.de> writes:
>> 
>> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
>> >
>> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>> >> 
>> >> Hi!
>> >> 
>> >> > Hi Sehger,
>> >> >
>> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >
>> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
>> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> >> >>> > and cumbersome to write a correct rtx_cost).
>> >> >>> 
>> >> >>> Thanks! The implementations of hook insn_cost are align with this
>> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >> >>> 
>> >> >>> One question on the speciall case: 
>> >> >>> For instruction: "r119:DI=0x100803004101001"
>> >> >>> Would we treat it as valid instruction?
>> >> >>
>> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> >> >> This is costed as 5 insns (cost=20).
>> >> >>
>> >> >> It generally is better to split things into patterns close to the
>> >> >> eventual machine isntructions as early as possible: all the more generic
>> >> >> optimisations can take advantage of that then.
>> >> > Get it!
>> >> >>
>> >> >>> A patch, which is attached the end of this mail, accepts
>> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >> >>> In this patch, 
>> >> >>> - A tmp instruction is generated via make_insn_raw.
>> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
>> >> >>> Are these make sense?
>> >> >>
>> >> >> I'll review that patch inline.
>> >> 
>> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> >> Different from the previous partial patch, this patch replaces all usage
>> >> of rtx_cost. It may be better/aggressive than previous one.
>> >
>> > I think there's no advantage for using insn_cost over rtx_cost for
>> > the simple SET case.
>> 
>> Thanks for your comments and raise this concern.
>> 
>> For those targets which do not implement insn_cost, insn_cost calls
>> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
>> 
>> While, for those targets which have insn_cost, it seems insn_cost would
>> be better(or say more accurate/consistent?) than rtx_cost. Since:
>> - insn_cost recog the insn first, and compute cost through something
>
> target hooks are expected to call recog on the insn but the generic
> fallback does not!?  Or do you say a target _could_ call recog?
> I think it would be valid to only expect recognized insns here
> and thus cse.cc obligation would be to call regoc on the "fake"
> instruction which then raises the obvious issue whether you should
> ever call recog on something "fake".
Thanks Richard! I also feel this is a main point of insn_cost.
From my understanding: it would be better to let insn_cost check
the valid recognizable insns; this would be the major purpose of
insn_cost.  While, I'm also wondering, we may let it go for 'fake'
instruction for current implementations (e.g. arm_insn_cost) which
behaviors to walk/check pattern. 

>
> I also see that rs6000_insn_cost does
>
> static int
> rs6000_insn_cost (rtx_insn *insn, bool speed)
> {
>   if (recog_memoized (insn) < 0)
>     return 0;
>
> so not recognized insns become quite cheap - it looks like
> insn_cost has no documented failure mode and initial implementors
> chickened out asserting that an insn is / can be recognized.
> In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
> no failure mode and the _caller_ should have made sure the
> insn can be recognized.  That said, the insn_cost should do that.
Thanks! Using the assert would help to catch un-recog failures.

> (is INSN_CODE () == 0 a real thing?)
0 is specialy, it is some thing like CODE_FOR_nothing. :-)

>
>> (like length/cost attributes from .md file) for the 'machine insn'.
>> - rtx_cost estimates the cost through analyzing the 'rtx content'.
>> The accurate estimation relates to the context.
>> 
>> For a special example: "%r100 = C", as a previous patch, by tunning
>> target's rtx_cost hook, cost could be computed according to the value
>> of C. insn_cost may just model the cost in the define of the machine
>> instruction.
>> 
>> These reasons are my initial thoughts.  Segher may have better
>> explain. :-) 
>
> OK, so in theory the targets insn_cost can use insn attributes
> which allows to keep the cost parts of an instruction close to the
> instruction patterns which is probably a good thing for
> maintainability.  But as you point out this requires recognizing
> of possibly generated instructions.  And before reload it has
> the issue that the alternative is not determined which means the
> true cost cannot be determined without walking the pattern,
> like on x86 where many instruction operands have both register
> and memory alternatives.
Right, when there is no alternative fits for true cost, it may have to
check the pattern like 'rtx_cost' then. :-)


BR,
Jiufu

>
>> To replace rtx_cost with insn_cost, this patch build a SET instruction:
>> "%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
>> the cost of "rtx_expr" from rtx_cost.
>> 
>> 
>> BR,
>> Jiufu
>> 
>> >
>> > Richard.
>> >
>> >> With this patch, bootstrap pass.
>> >> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
>> >> change seems as expected.
>> >> 
>> >> 
>> >> BR,
>> >> Jiufu
>> >> 
>> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> >> index a18b599d324..e623ad298db 100644
>> >> --- a/gcc/cse.cc
>> >> +++ b/gcc/cse.cc
>> >> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>> >>  static rtx_insn *this_insn;
>> >>  static bool optimize_this_for_speed_p;
>> >>  
>> >> +/* Used for insn_cost. */
>> >> +static rtx_insn *estimate_insn;
>> >> +
>> >>  /* Index by register number, gives the number of the next (or
>> >>     previous) register in the chain of registers sharing the same
>> >>     value.
>> >> @@ -445,7 +448,7 @@ struct table_elt
>> >>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
>> >>     hard registers and pointers into the frame are the cheapest with a cost
>> >>     of 0.  Next come pseudos with a cost of one and other hard registers with
>> >> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
>> >> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>> >>  
>> >>  #define CHEAP_REGNO(N)							\
>> >>    (REGNO_PTR_FRAME_P (N)						\
>> >> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
>> >>     from COST macro to keep it simple.  */
>> >>  
>> >>  static int
>> >> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> >> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
>> >>  {
>> >>    scalar_int_mode int_mode, inner_mode;
>> >> -  return ((GET_CODE (x) == SUBREG
>> >> -	   && REG_P (SUBREG_REG (x))
>> >> -	   && is_int_mode (mode, &int_mode)
>> >> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> >> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> >> -	   && subreg_lowpart_p (x)
>> >> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> >> -	  ? 0
>> >> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>> >> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
>> >> +      && is_int_mode (mode, &int_mode)
>> >> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> >> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> >> +      && subreg_lowpart_p (x)
>> >> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> >> +    return 0;
>> >> +
>> >> +  if (estimate_insn == NULL)
>> >> +    {
>> >> +      estimate_insn = make_insn_raw (
>> >> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
>> >> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
>> >> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
>> >> +      INSN_LOCATION (estimate_insn) = 0;
>> >> +    }
>> >> +  else
>> >> +    {
>> >> +      /* Update for new context.  */
>> >> +      INSN_CODE (estimate_insn) = -1;
>> >> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
>> >> +      SET_SRC (PATTERN (estimate_insn)) = x;
>> >> +    }
>> >> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
>> >>  }
>> >>  
>> >>  
>> >> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>> >>  
>> >>    init_recog ();
>> >>    init_alias_analysis ();
>> >> +  estimate_insn = NULL;
>> >>  
>> >>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
>> >>  
>> >> 
>>
  
Richard Biener March 10, 2022, 7:09 a.m. UTC | #32
On Thu, 10 Mar 2022, Jiufu Guo wrote:

> 
> Hi!
> 
> Richard Biener <rguenther@suse.de> writes:
> 
> > On Wed, 9 Mar 2022, Jiufu Guo wrote:
> >
> >> 
> >> Hi!
> >> 
> >> Richard Biener <rguenther@suse.de> writes:
> >> 
> >> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
> >> >
> >> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
> >> >> 
> >> >> Hi!
> >> >> 
> >> >> > Hi Sehger,
> >> >> >
> >> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >> >
> >> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
> >> >> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
> >> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
> >> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
> >> >> >>> > and cumbersome to write a correct rtx_cost).
> >> >> >>> 
> >> >> >>> Thanks! The implementations of hook insn_cost are align with this
> >> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
> >> >> >>> 
> >> >> >>> One question on the speciall case: 
> >> >> >>> For instruction: "r119:DI=0x100803004101001"
> >> >> >>> Would we treat it as valid instruction?
> >> >> >>
> >> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
> >> >> >> This is costed as 5 insns (cost=20).
> >> >> >>
> >> >> >> It generally is better to split things into patterns close to the
> >> >> >> eventual machine isntructions as early as possible: all the more generic
> >> >> >> optimisations can take advantage of that then.
> >> >> > Get it!
> >> >> >>
> >> >> >>> A patch, which is attached the end of this mail, accepts
> >> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
> >> >> >>> In this patch, 
> >> >> >>> - A tmp instruction is generated via make_insn_raw.
> >> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
> >> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
> >> >> >>> Are these make sense?
> >> >> >>
> >> >> >> I'll review that patch inline.
> >> >> 
> >> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
> >> >> Different from the previous partial patch, this patch replaces all usage
> >> >> of rtx_cost. It may be better/aggressive than previous one.
> >> >
> >> > I think there's no advantage for using insn_cost over rtx_cost for
> >> > the simple SET case.
> >> 
> >> Thanks for your comments and raise this concern.
> >> 
> >> For those targets which do not implement insn_cost, insn_cost calls
> >> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
> >> 
> >> While, for those targets which have insn_cost, it seems insn_cost would
> >> be better(or say more accurate/consistent?) than rtx_cost. Since:
> >> - insn_cost recog the insn first, and compute cost through something
> >
> > target hooks are expected to call recog on the insn but the generic
> > fallback does not!?  Or do you say a target _could_ call recog?
> > I think it would be valid to only expect recognized insns here
> > and thus cse.cc obligation would be to call regoc on the "fake"
> > instruction which then raises the obvious issue whether you should
> > ever call recog on something "fake".
> Thanks Richard! I also feel this is a main point of insn_cost.
> From my understanding: it would be better to let insn_cost check
> the valid recognizable insns; this would be the major purpose of
> insn_cost.  While, I'm also wondering, we may let it go for 'fake'
> instruction for current implementations (e.g. arm_insn_cost) which
> behaviors to walk/check pattern. 

Note for the CSE case you'll always end up with the single move
pattern so it's somewhat pointless to do the recog & insn_cost
dance there.  For move cost using rtx_cost (SET, ..) should
be good enough.  One could argue we should standardize a (wrapping)
API like move_cost (enum machine_mode mode, rtx to, rtx from),
with to/from optional (if omitted a REG of MODE).  But the existing
rtx_cost target hook implementation should be sufficient to handle
it, without building a fake insn and without doing (the pointless,
if not failing) recog process.

Richard.

> >
> > I also see that rs6000_insn_cost does
> >
> > static int
> > rs6000_insn_cost (rtx_insn *insn, bool speed)
> > {
> >   if (recog_memoized (insn) < 0)
> >     return 0;
> >
> > so not recognized insns become quite cheap - it looks like
> > insn_cost has no documented failure mode and initial implementors
> > chickened out asserting that an insn is / can be recognized.
> > In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
> > no failure mode and the _caller_ should have made sure the
> > insn can be recognized.  That said, the insn_cost should do that.
> Thanks! Using the assert would help to catch un-recog failures.
> 
> > (is INSN_CODE () == 0 a real thing?)
> 0 is specialy, it is some thing like CODE_FOR_nothing. :-)
> 
> >
> >> (like length/cost attributes from .md file) for the 'machine insn'.
> >> - rtx_cost estimates the cost through analyzing the 'rtx content'.
> >> The accurate estimation relates to the context.
> >> 
> >> For a special example: "%r100 = C", as a previous patch, by tunning
> >> target's rtx_cost hook, cost could be computed according to the value
> >> of C. insn_cost may just model the cost in the define of the machine
> >> instruction.
> >> 
> >> These reasons are my initial thoughts.  Segher may have better
> >> explain. :-) 
> >
> > OK, so in theory the targets insn_cost can use insn attributes
> > which allows to keep the cost parts of an instruction close to the
> > instruction patterns which is probably a good thing for
> > maintainability.  But as you point out this requires recognizing
> > of possibly generated instructions.  And before reload it has
> > the issue that the alternative is not determined which means the
> > true cost cannot be determined without walking the pattern,
> > like on x86 where many instruction operands have both register
> > and memory alternatives.
> Right, when there is no alternative fits for true cost, it may have to
> check the pattern like 'rtx_cost' then. :-)
> 
> 
> BR,
> Jiufu
> 
> >
> >> To replace rtx_cost with insn_cost, this patch build a SET instruction:
> >> "%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
> >> the cost of "rtx_expr" from rtx_cost.
> >> 
> >> 
> >> BR,
> >> Jiufu
> >> 
> >> >
> >> > Richard.
> >> >
> >> >> With this patch, bootstrap pass.
> >> >> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
> >> >> change seems as expected.
> >> >> 
> >> >> 
> >> >> BR,
> >> >> Jiufu
> >> >> 
> >> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
> >> >> index a18b599d324..e623ad298db 100644
> >> >> --- a/gcc/cse.cc
> >> >> +++ b/gcc/cse.cc
> >> >> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
> >> >>  static rtx_insn *this_insn;
> >> >>  static bool optimize_this_for_speed_p;
> >> >>  
> >> >> +/* Used for insn_cost. */
> >> >> +static rtx_insn *estimate_insn;
> >> >> +
> >> >>  /* Index by register number, gives the number of the next (or
> >> >>     previous) register in the chain of registers sharing the same
> >> >>     value.
> >> >> @@ -445,7 +448,7 @@ struct table_elt
> >> >>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
> >> >>     hard registers and pointers into the frame are the cheapest with a cost
> >> >>     of 0.  Next come pseudos with a cost of one and other hard registers with
> >> >> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
> >> >> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
> >> >>  
> >> >>  #define CHEAP_REGNO(N)							\
> >> >>    (REGNO_PTR_FRAME_P (N)						\
> >> >> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
> >> >>     from COST macro to keep it simple.  */
> >> >>  
> >> >>  static int
> >> >> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
> >> >> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
> >> >>  {
> >> >>    scalar_int_mode int_mode, inner_mode;
> >> >> -  return ((GET_CODE (x) == SUBREG
> >> >> -	   && REG_P (SUBREG_REG (x))
> >> >> -	   && is_int_mode (mode, &int_mode)
> >> >> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> >> >> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> >> >> -	   && subreg_lowpart_p (x)
> >> >> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> >> >> -	  ? 0
> >> >> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
> >> >> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
> >> >> +      && is_int_mode (mode, &int_mode)
> >> >> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
> >> >> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
> >> >> +      && subreg_lowpart_p (x)
> >> >> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
> >> >> +    return 0;
> >> >> +
> >> >> +  if (estimate_insn == NULL)
> >> >> +    {
> >> >> +      estimate_insn = make_insn_raw (
> >> >> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
> >> >> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
> >> >> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
> >> >> +      INSN_LOCATION (estimate_insn) = 0;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      /* Update for new context.  */
> >> >> +      INSN_CODE (estimate_insn) = -1;
> >> >> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
> >> >> +      SET_SRC (PATTERN (estimate_insn)) = x;
> >> >> +    }
> >> >> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
> >> >>  }
> >> >>  
> >> >>  
> >> >> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
> >> >>  
> >> >>    init_recog ();
> >> >>    init_alias_analysis ();
> >> >> +  estimate_insn = NULL;
> >> >>  
> >> >>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
> >> >>  
> >> >> 
> >> 
>
  
Jiufu Guo March 11, 2022, 6:33 a.m. UTC | #33
Hi!

Richard Biener <rguenther@suse.de> writes:

> On Thu, 10 Mar 2022, Jiufu Guo wrote:
>
>> 
>> Hi!
>> 
>> Richard Biener <rguenther@suse.de> writes:
>> 
>> > On Wed, 9 Mar 2022, Jiufu Guo wrote:
>> >
>> >> 
>> >> Hi!
>> >> 
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> 
>> >> > On Tue, 8 Mar 2022, Jiufu Guo wrote:
>> >> >
>> >> >> Jiufu Guo <guojiufu@linux.ibm.com> writes:
>> >> >> 
>> >> >> Hi!
>> >> >> 
>> >> >> > Hi Sehger,
>> >> >> >
>> >> >> > Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >> >
>> >> >> >> On Tue, Mar 01, 2022 at 10:28:57PM +0800, Jiufu Guo wrote:
>> >> >> >>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> >> >>> > No.  insn_cost is only for correct, existing instructions, not for
>> >> >> >>> > made-up nonsense.  I created insn_cost precisely to get away from that
>> >> >> >>> > aspect of rtx_cost (and some other issues, like, it is incredibly hard
>> >> >> >>> > and cumbersome to write a correct rtx_cost).
>> >> >> >>> 
>> >> >> >>> Thanks! The implementations of hook insn_cost are align with this
>> >> >> >>> design, they are  checking insn's attributes and COSTS_N_INSNS.
>> >> >> >>> 
>> >> >> >>> One question on the speciall case: 
>> >> >> >>> For instruction: "r119:DI=0x100803004101001"
>> >> >> >>> Would we treat it as valid instruction?
>> >> >> >>
>> >> >> >> Currently we do, alternative 6 in *movdi_internal64: we allow any r<-n.
>> >> >> >> This is costed as 5 insns (cost=20).
>> >> >> >>
>> >> >> >> It generally is better to split things into patterns close to the
>> >> >> >> eventual machine isntructions as early as possible: all the more generic
>> >> >> >> optimisations can take advantage of that then.
>> >> >> > Get it!
>> >> >> >>
>> >> >> >>> A patch, which is attached the end of this mail, accepts
>> >> >> >>> "r119:DI=0x100803004101001" as input of insn_cost.
>> >> >> >>> In this patch, 
>> >> >> >>> - A tmp instruction is generated via make_insn_raw.
>> >> >> >>> - A few calls to rtx_cost (in cse_insn) is replaced by insn_cost.
>> >> >> >>> - In hook of insn_cost, checking the special 'constant' instruction.
>> >> >> >>> Are these make sense?
>> >> >> >>
>> >> >> >> I'll review that patch inline.
>> >> >> 
>> >> >> I drafted a new patch that replace rtx_cost with insn_cost for cse.cc.
>> >> >> Different from the previous partial patch, this patch replaces all usage
>> >> >> of rtx_cost. It may be better/aggressive than previous one.
>> >> >
>> >> > I think there's no advantage for using insn_cost over rtx_cost for
>> >> > the simple SET case.
>> >> 
>> >> Thanks for your comments and raise this concern.
>> >> 
>> >> For those targets which do not implement insn_cost, insn_cost calls
>> >> rtx_cost through pattern_cost, then insn_cost is equal to rtx_cost.
>> >> 
>> >> While, for those targets which have insn_cost, it seems insn_cost would
>> >> be better(or say more accurate/consistent?) than rtx_cost. Since:
>> >> - insn_cost recog the insn first, and compute cost through something
>> >
>> > target hooks are expected to call recog on the insn but the generic
>> > fallback does not!?  Or do you say a target _could_ call recog?
>> > I think it would be valid to only expect recognized insns here
>> > and thus cse.cc obligation would be to call regoc on the "fake"
>> > instruction which then raises the obvious issue whether you should
>> > ever call recog on something "fake".
>> Thanks Richard! I also feel this is a main point of insn_cost.
>> From my understanding: it would be better to let insn_cost check
>> the valid recognizable insns; this would be the major purpose of
>> insn_cost.  While, I'm also wondering, we may let it go for 'fake'
>> instruction for current implementations (e.g. arm_insn_cost) which
>> behaviors to walk/check pattern. 
>
> Note for the CSE case you'll always end up with the single move
> pattern so it's somewhat pointless to do the recog & insn_cost
> dance there.  For move cost using rtx_cost (SET, ..) should
> be good enough.  One could argue we should standardize a (wrapping)
> API like move_cost (enum machine_mode mode, rtx to, rtx from),
> with to/from optional (if omitted a REG of MODE).  But the existing
> rtx_cost target hook implementation should be sufficient to handle
> it, without building a fake insn and without doing (the pointless,
> if not failing) recog process.

Thanks so much for your comments and suggestions!

Using rtx_cost is able to handle those things in cse.cc.
For insn_cost, it is good for checking an instruction.  To simulate
'rtx_cost' using insn_cost, there is excess work: 'recog' on the setting
of rtx expr and alternative estimation. 

I just wondering we may prefer to use insn_cost for its consistency and
accuracy. :-) So, the patch is prepared.


BR,
Jiufu

>
> Richard.
>
>> >
>> > I also see that rs6000_insn_cost does
>> >
>> > static int
>> > rs6000_insn_cost (rtx_insn *insn, bool speed)
>> > {
>> >   if (recog_memoized (insn) < 0)
>> >     return 0;
>> >
>> > so not recognized insns become quite cheap - it looks like
>> > insn_cost has no documented failure mode and initial implementors
>> > chickened out asserting that an insn is / can be recognized.
>> > In fact I'd assert that INSN_CODE (insn) >= 0 simply because there's
>> > no failure mode and the _caller_ should have made sure the
>> > insn can be recognized.  That said, the insn_cost should do that.
>> Thanks! Using the assert would help to catch un-recog failures.
>> 
>> > (is INSN_CODE () == 0 a real thing?)
>> 0 is specialy, it is some thing like CODE_FOR_nothing. :-)
>> 
>> >
>> >> (like length/cost attributes from .md file) for the 'machine insn'.
>> >> - rtx_cost estimates the cost through analyzing the 'rtx content'.
>> >> The accurate estimation relates to the context.
>> >> 
>> >> For a special example: "%r100 = C", as a previous patch, by tunning
>> >> target's rtx_cost hook, cost could be computed according to the value
>> >> of C. insn_cost may just model the cost in the define of the machine
>> >> instruction.
>> >> 
>> >> These reasons are my initial thoughts.  Segher may have better
>> >> explain. :-) 
>> >
>> > OK, so in theory the targets insn_cost can use insn attributes
>> > which allows to keep the cost parts of an instruction close to the
>> > instruction patterns which is probably a good thing for
>> > maintainability.  But as you point out this requires recognizing
>> > of possibly generated instructions.  And before reload it has
>> > the issue that the alternative is not determined which means the
>> > true cost cannot be determined without walking the pattern,
>> > like on x86 where many instruction operands have both register
>> > and memory alternatives.
>> Right, when there is no alternative fits for true cost, it may have to
>> check the pattern like 'rtx_cost' then. :-)
>> 
>> 
>> BR,
>> Jiufu
>> 
>> >
>> >> To replace rtx_cost with insn_cost, this patch build a SET instruction:
>> >> "%r = rtx_expr", then using "%r = rtx_expr" from insn_cost to simulate
>> >> the cost of "rtx_expr" from rtx_cost.
>> >> 
>> >> 
>> >> BR,
>> >> Jiufu
>> >> 
>> >> >
>> >> > Richard.
>> >> >
>> >> >> With this patch, bootstrap pass.
>> >> >> From regtest, only output of fusion-p10-ldcmpi.c is changed, and the
>> >> >> change seems as expected.
>> >> >> 
>> >> >> 
>> >> >> BR,
>> >> >> Jiufu
>> >> >> 
>> >> >> diff --git a/gcc/cse.cc b/gcc/cse.cc
>> >> >> index a18b599d324..e623ad298db 100644
>> >> >> --- a/gcc/cse.cc
>> >> >> +++ b/gcc/cse.cc
>> >> >> @@ -262,6 +262,9 @@ static struct qty_table_elem *qty_table;
>> >> >>  static rtx_insn *this_insn;
>> >> >>  static bool optimize_this_for_speed_p;
>> >> >>  
>> >> >> +/* Used for insn_cost. */
>> >> >> +static rtx_insn *estimate_insn;
>> >> >> +
>> >> >>  /* Index by register number, gives the number of the next (or
>> >> >>     previous) register in the chain of registers sharing the same
>> >> >>     value.
>> >> >> @@ -445,7 +448,7 @@ struct table_elt
>> >> >>  /* Compute cost of X, as stored in the `cost' field of a table_elt.  Fixed
>> >> >>     hard registers and pointers into the frame are the cheapest with a cost
>> >> >>     of 0.  Next come pseudos with a cost of one and other hard registers with
>> >> >> -   a cost of 2.  Aside from these special cases, call `rtx_cost'.  */
>> >> >> +   a cost of 2.  Aside from these special cases, call `insn_cost'.  */
>> >> >>  
>> >> >>  #define CHEAP_REGNO(N)							\
>> >> >>    (REGNO_PTR_FRAME_P (N)						\
>> >> >> @@ -698,18 +701,33 @@ preferable (int cost_a, int regcost_a, int cost_b, int regcost_b)
>> >> >>     from COST macro to keep it simple.  */
>> >> >>  
>> >> >>  static int
>> >> >> -notreg_cost (rtx x, machine_mode mode, enum rtx_code outer, int opno)
>> >> >> +notreg_cost (rtx x, machine_mode mode, enum rtx_code /*outer*/, int /*opno*/)
>> >> >>  {
>> >> >>    scalar_int_mode int_mode, inner_mode;
>> >> >> -  return ((GET_CODE (x) == SUBREG
>> >> >> -	   && REG_P (SUBREG_REG (x))
>> >> >> -	   && is_int_mode (mode, &int_mode)
>> >> >> -	   && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> >> >> -	   && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> >> >> -	   && subreg_lowpart_p (x)
>> >> >> -	   && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> >> >> -	  ? 0
>> >> >> -	  : rtx_cost (x, mode, outer, opno, optimize_this_for_speed_p) * 2);
>> >> >> +  if (GET_CODE (x) == SUBREG && REG_P (SUBREG_REG (x))
>> >> >> +      && is_int_mode (mode, &int_mode)
>> >> >> +      && is_int_mode (GET_MODE (SUBREG_REG (x)), &inner_mode)
>> >> >> +      && GET_MODE_SIZE (int_mode) < GET_MODE_SIZE (inner_mode)
>> >> >> +      && subreg_lowpart_p (x)
>> >> >> +      && TRULY_NOOP_TRUNCATION_MODES_P (int_mode, inner_mode))
>> >> >> +    return 0;
>> >> >> +
>> >> >> +  if (estimate_insn == NULL)
>> >> >> +    {
>> >> >> +      estimate_insn = make_insn_raw (
>> >> >> +	gen_rtx_SET (gen_rtx_REG (mode, LAST_VIRTUAL_REGISTER + 1), x));
>> >> >> +      SET_PREV_INSN (estimate_insn) = NULL_RTX;
>> >> >> +      SET_NEXT_INSN (estimate_insn) = NULL_RTX;
>> >> >> +      INSN_LOCATION (estimate_insn) = 0;
>> >> >> +    }
>> >> >> +  else
>> >> >> +    {
>> >> >> +      /* Update for new context.  */
>> >> >> +      INSN_CODE (estimate_insn) = -1;
>> >> >> +      PUT_MODE (SET_DEST (PATTERN (estimate_insn)), mode);
>> >> >> +      SET_SRC (PATTERN (estimate_insn)) = x;
>> >> >> +    }
>> >> >> +  return insn_cost (estimate_insn, optimize_this_for_speed_p);
>> >> >>  }
>> >> >>  
>> >> >>  
>> >> >> @@ -6667,6 +6685,7 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
>> >> >>  
>> >> >>    init_recog ();
>> >> >>    init_alias_analysis ();
>> >> >> +  estimate_insn = NULL;
>> >> >>  
>> >> >>    reg_eqv_table = XNEWVEC (struct reg_eqv_elem, nregs);
>> >> >>  
>> >> >> 
>> >> 
>>
  

Patch

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..beb21a0bb2b 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1361,6 +1361,9 @@  static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
 
+#undef TARGET_FASTER_LOADING_CONSTANT
+#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
+
 #undef TARGET_DELEGITIMIZE_ADDRESS
 #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
 
@@ -9684,8 +9687,7 @@  rs6000_init_stack_protect_guard (void)
 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
 {
-  if (GET_CODE (x) == HIGH
-      && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  if (GET_CODE (x) == HIGH)
     return true;
 
   /* A TLS symbol in the TOC cannot contain a sum.  */
@@ -10483,6 +10485,30 @@  rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, machine_mode mode)
   return false;
 }
 
+
+/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
+
+static bool
+rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
+{
+  gcc_assert (CONSTANT_P (src));
+
+  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
+    return false;
+  if (GET_CODE (src) == HIGH)
+    return false;
+  if (toc_relative_expr_p (src, false, NULL, NULL))
+    return false;
+  if (rs6000_cannot_force_const_mem (mode, src))
+    return false;
+
+  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
+    return true;
+  if (!CONST_INT_P (src))
+    return true;
+  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
+}
+
 /* Emit a move from SOURCE to DEST in mode MODE.  */
 void
 rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
@@ -10860,13 +10886,7 @@  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
 	operands[1] = create_TOC_reference (operands[1], operands[0]);
       else if (mode == Pmode
 	       && CONSTANT_P (operands[1])
-	       && GET_CODE (operands[1]) != HIGH
-	       && ((REG_P (operands[0])
-		    && FP_REGNO_P (REGNO (operands[0])))
-		   || !CONST_INT_P (operands[1])
-		   || (num_insns_constant (operands[1], mode)
-		       > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
-	       && !toc_relative_expr_p (operands[1], false, NULL, NULL)
+	       && rs6000_faster_loading_const (mode, operands[0], operands[1])
 	       && (TARGET_CMODEL == CMODEL_SMALL
 		   || can_create_pseudo_p ()
 		   || (REG_P (operands[0])
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..c33d3c7e911 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -5158,6 +5158,11 @@  cse_insn (rtx_insn *insn)
       if (dest == pc_rtx && src_const && GET_CODE (src_const) == LABEL_REF)
 	src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
 
+      /* Check if src_foled is contant which is fastster to load pool.  */
+      if (src_folded && CONSTANT_P (src_folded) && src && MEM_P (src)
+	  && targetm.faster_loading_constant (mode, dest, src_folded))
+	src_folded = NULL_RTX, src_folded_cost = src_folded_regcost = -1;
+
       /* Terminate loop when replacement made.  This must terminate since
          the current contents will be tested and will always be valid.  */
       while (1)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 962bbb8caaf..65685311249 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6014,6 +6014,11 @@  holding the constant.  This restriction is often true of addresses
 of TLS symbols for various targets.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_FASTER_LOADING_CONSTANT (machine_mode @var{mode}, rtx @var{y}, rtx @var{x})
+It returns true if loading contant @var{x} is faster than building
+from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_USE_BLOCKS_FOR_CONSTANT_P (machine_mode @var{mode}, const_rtx @var{x})
 This hook should return true if pool entries for constant @var{x} can
 be placed in an @code{object_block} structure.  @var{mode} is the mode
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 394b59e70a6..918914f0e30 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4148,6 +4148,8 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_CANNOT_FORCE_CONST_MEM
 
+@hook TARGET_FASTER_LOADING_CONSTANT
+
 @hook TARGET_USE_BLOCKS_FOR_CONSTANT_P
 
 @hook TARGET_USE_BLOCKS_FOR_DECL_P
diff --git a/gcc/target.def b/gcc/target.def
index 57e64b20eef..8b007aca9dc 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2987,6 +2987,14 @@  locally.  The default is to return false.",
  bool, (void),
  hook_bool_void_false)
 
+/* Check if it is profitable to load the constant from constant pool.  */
+DEFHOOK
+(faster_loading_constant,
+ "It returns true if loading contant @var{x} is faster than building\n\
+from scratch, and set to @var{y}.  @var{mode} is the mode of @var{x}.",
+ bool, (machine_mode mode, rtx y, rtx x),
+ default_faster_loading_constant)
+
 /* True if it is OK to do sibling call optimization for the specified
    call expression EXP.  DECL will be the called function, or NULL if
    this is an indirect call.  */
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index fc49235eb38..1d9340d1c14 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -173,6 +173,12 @@  default_return_in_memory (const_tree type,
   return (TYPE_MODE (type) == BLKmode);
 }
 
+bool
+default_faster_loading_constant (machine_mode mode, rtx dest, rtx src)
+{
+  return false;
+}
+
 rtx
 default_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
 			    machine_mode mode ATTRIBUTE_UNUSED)
diff --git a/gcc/testsuite/gcc.target/powerpc/medium_offset.c b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
index f29eba08c38..4889e8fa8ec 100644
--- a/gcc/testsuite/gcc.target/powerpc/medium_offset.c
+++ b/gcc/testsuite/gcc.target/powerpc/medium_offset.c
@@ -1,7 +1,7 @@ 
 /* { dg-do compile { target { powerpc*-*-* } } } */
 /* { dg-require-effective-target lp64 } */
 /* { dg-options "-O" } */
-/* { dg-final { scan-assembler-not "\\+4611686018427387904" } } */
+/* { dg-final { scan-assembler-times {\msldi|pld\M} 1 } } */
 
 static int x;
 
diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c
new file mode 100644
index 00000000000..58ba074d427
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define CONST1 0x100803004101001
+#define CONST2 0x000406002105007
+
+void __attribute__ ((noinline))
+foo (unsigned long long *a, unsigned long long *b)
+{
+  *a = CONST1;
+  *b = CONST2;
+}
+
+/* { dg-final { scan-assembler-times {\mld|pld\M} 2 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c
index 4f764d0576f..5afb4f79c45 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr93012.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c
@@ -10,4 +10,4 @@  unsigned long long mskh1() { return 0xffff9234ffff9234ULL; }
 unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; }
 unsigned long long mskse() { return 0xffff1234ffff1234ULL; }
 
-/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */
+/* { dg-final { scan-assembler-times {\mrldimi|ld|pld\M} 7 } } */